Re: High IO Usage in Datanodes due to Replication

Harsh J Tue, 30 Apr 2013 23:08:21 -0700

The block scanner is a simple, independent operation of the DN that
runs periodically and does work in small phases, to ensure that no
blocks exist that aren't matching their checksums (its an automatic
data validator) - such that it may report corrupt/rotting blocks and
keep the cluster healthy.


Its runtime shouldn't cause any issues, unless your DN has a lot of
blocks (more than normal due to overload of small, inefficient files)
but too little heap size to perform retention plus block scanning.

> 1. Is data node will not allow to write the data during DataBlockScanning 
> process ?

No such thing. As I said, its independent and mostly lock free. Writes
or reads are not hampered.

> 2. Is data node will come normal only when "Not yet verified" come to zero in 
> data node blockScannerReport ?

Yes, but note that this runs over and over again (once every 3 weeks IIRC).

On Wed, May 1, 2013 at 11:33 AM, selva <[email protected]> wrote:
> Thanks Harsh & Manoj for the inputs.
>
> Now i found that the data node is busy with block scanning. I have TBs data
> attached with each data node. So its taking days to complete the data block
> scanning. I have two questions.
>
> 1. Is data node will not allow to write the data during DataBlockScanning
> process ?
>
> 2. Is data node will come normal only when "Not yet verified" come to zero
> in data node blockScannerReport ?
>
> # Data node logs
>
> 2013-05-01 05:53:50,639 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_-7605405041820244736_20626608
> 2013-05-01 05:53:50,664 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_-1425088964531225881_20391711
> 2013-05-01 05:53:50,692 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_2259194263704433881_10277076
> 2013-05-01 05:53:50,740 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_2653195657740262633_18315696
> 2013-05-01 05:53:50,818 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_-5124560783595402637_20821252
> 2013-05-01 05:53:50,866 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_6596021414426970798_19649117
> 2013-05-01 05:53:50,931 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_7026400040099637841_20741138
> 2013-05-01 05:53:50,992 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_8535358360851622516_20694185
> 2013-05-01 05:53:51,057 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_7959856580255809601_20559830
>
> # One of my Data node block scanning report
>
> http://<datanode-host>:15075/blockScannerReport
>
> Total Blocks                 : 2037907
> Verified in last hour        :   4819
> Verified in last day         : 107355
> Verified in last week        : 686873
> Verified in last four weeks  : 1589964
> Verified in SCAN_PERIOD      : 1474221
> Not yet verified             : 447943
> Verified since restart       : 318433
> Scans since restart          : 318058
> Scan errors since restart    :      0
> Transient scan errors        :      0
> Current scan rate limit KBps :   3205
> Progress this period         :    101%
> Time left in cur period      :  86.02%
>
> Thanks
> Selva
>
>
> -----Original Message-----
> From "S, Manoj" <[email protected]>
> Subject RE: High IO Usage in Datanodes due to Replication
> Date Mon, 29 Apr 2013 06:41:31 GMT
> Adding to Harsh's comments:
>
> You can also tweak a few OS level parameters to improve the I/O performance.
> 1) Mount the filesystem with "noatime" option.
> 2) Check if changing the IO scheduling the algorithm will improve the
> cluster's performance.
> (Check this file /sys/block/<device_name>/queue/scheduler)
> 3) If there are lots of I/O requests and your cluster hangs because of that,
> you can increase
> the queue length by increasing the value in
> /sys/block/<device_name>/queue/nr_requests.
>
> -----Original Message-----
> From: Harsh J [mailto:[email protected]]
> Sent: Sunday, April 28, 2013 12:03 AM
> To: <[email protected]>
> Subject: Re: High IO Usage in Datanodes due to Replication
>
> They seem to be transferring blocks between one another. This may most
> likely be due to under-replication
> and the NN UI will have numbers on work left to perform. The inter-DN
> transfer is controlled
> by the balancing bandwidth though, so you can lower that down if you want
> to, to cripple it
> - but you'll lose out on time for a perfectly replicated state again.
>
> On Sat, Apr 27, 2013 at 11:33 PM, selva <[email protected]> wrote:
>> Hi All,
>>
>> I have lost amazon instances of my hadoop cluster. But i had all the
>> data in aws EBS volumes. So i launched new instances and attached volumes.
>>
>> But all of the datanode logs keep on print the below lines it cauased
>> to high IO rate. Due to IO usage i am not able to run any jobs.
>>
>> Can anyone help me to understand what it is doing? Thanks in advance.
>>
>> 2013-04-27 17:51:40,197 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(10.157.10.242:10013,
>> storageID=DS-407656544-10.28.217.27-10013-1353165843727,
>> infoPort=15075,
>> ipcPort=10014) Starting thread to transfer block
>> blk_2440813767266473910_11564425 to 10.168.18.178:10013
>> 2013-04-27 17:51:40,230 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(10.157.10.242:10013,
>> storageID=DS-407656544-10.28.217.27-10013-1353165843727,
>> infoPort=15075, ipcPort=10014):Transmitted block
>> blk_2440813767266473910_11564425 to
>> /10.168.18.178:10013
>> 2013-04-27 17:51:40,433 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest:
>> /10.157.10.242:10013
>> 2013-04-27 17:51:40,450 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>> blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest:
>> /10.157.10.242:10013 of size 25431
>>
>> Thanks
>> Selva
>>
>>
>>
>>
>>
>>
>
>
>
> --
> Harsh J
>



-- 
Harsh J

Re: High IO Usage in Datanodes due to Replication

Reply via email to