Re: HDFS Missing Blocks / Corrupt Blocks Logic: What are the specific checks done to determine a block is bad and needs to be replicated?

TomK Fri, 23 Oct 2020 09:55:37 -0700

Hey Sanjeev,

While figuring out the UI ports and statistics, would like to addanother question.

So let's assume our scenario once more. (Data nodes *dn01* to *dn10* exist. Each data node has 10 x 10TB drives. ) Also remembering that thecheck takes 3 weeks (504 hours) and runs at 1MB/s.

In order to check the CRC 32C checksum, assuming maximum speed of300MB/s, drives at 75% capacity, and that disks are checked in parallel,these are the timings I'm getting:


1) 300MB /s  :   7.5TB (7864320MB) .   7864320/300 * 60 * 60 = 7.28 Hours.

2) 1MB /s : 7.5TB (7864320MB) . 7864320/1 * 60 * 60 = 2184Hours. This equals 89.5 days. This would overlap with the next 3 weekcycle (21 days). Therefore it can't effectively complete.

Given this, I'm thinking in order for HDFS to properly check largerclusters, the default values of*dfs.block.scanner.volume.bytes.per.second* and**dfs.datanode.scan.period.hours**would need to be adjusted. Otherwise,there is a potential that HDFS block (volume) scanner isn't checkingthings well.


Am I correct in the above?
****

Thx,
TK

Did these tests to get a rough estimate of the time it takes for thevarious checksums to complete:


 root  /  home  time gsutil hash BigFile17GiBinRAM;
Hashes [base64] for BigFile17GiBinRAM:7 MiB/s
        Hash (crc32c):          lRucsQ==
        Hash (md5):             v3yZMSuBlJ/q1N8Xavshgg==

Operation completed over 1 objects/16.0 GiB.

real    4m7.993s
user    3m44.568s
sys     0m22.338s
 root  /  home  time sha512sum BigFile17GiBinRAM

ac2c90c6acfeed591530ad9db639734308ef1bf494a3608c10f6334b933149d35c8098791a460fcdc302dcc921a4e16e2821712d8bbbb5d51ea8f5dc329de0adBigFile17GiBinRAM


real    2m21.008s
user    1m51.815s
sys     0m15.877s
 root  /  home  time sha1sum BigFile17GiBinRAM
69d37b2c75a13fc02e99bf8e0205ce2a09d98b24  BigFile17GiBinRAM

real    1m18.117s
user    1m1.428s
sys     0m15.505s
 root  /  home  time gsutil hash BigFile17GiBinRAM;
Hashes [base64] for BigFile17GiBinRAM:3 MiB/s
        Hash (crc32c):          lRucsQ==
        Hash (md5):             v3yZMSuBlJ/q1N8Xavshgg==

Operation completed over 1 objects/16.0 GiB.

real    4m9.437s
user    3m43.473s
sys     0m23.597s
 root  /  home 




On 10/22/2020 9:48 PM, TomK wrote:

Hey Austin, Sanjeev,

The ports defined are as follows in hdfs-site.xml:

 cm-r01wn01.mws.mds.xyz  root  …  run  cloudera-scm-agent  process  grep-Ei "dfs.datanode.http.address|dfs.datanode.https.address" -A 2./3370-hdfs-DATANODE/hdfs-site.xml

    <name>dfs.datanode.http.address</name>
    <value>cm-r01wn01.mws.mds.xyz:1006</value>
  </property>
--
    <name>dfs.datanode.https.address</name>
    <value>cm-r01wn01.mws.mds.xyz:9865</value>
  </property>

 cm-r01wn01.mws.mds.xyz  root  …  run  cloudera-scm-agent  process 



Checking the ports used:

 cm-r01wn01.mws.mds.xyz  root  ~  netstat -pnltu|grep -Ei"9866|1004|9864|9865|1006|9867"tcp 0 0 10.3.0.160:9867 0.0.0.0:* LISTEN 30096/jsvc.exectcp 0 0 10.3.0.160:1004 0.0.0.0:* LISTEN 30096/jsvc.exectcp 0 0 10.3.0.160:1006 0.0.0.0:* LISTEN 30096/jsvc.exec

  cm-r01wn01.mws.mds.xyz  root  ~ 

  cm-r01wn01.mws.mds.xyz  root  ~ 

 cm-r01wn01.mws.mds.xyz  root  ~  hds getconf -confKeydfs.datanode.address

-bash: hds: command not found

 cm-r01wn01.mws.mds.xyz  root  ~  hdfs getconf -confKeydfs.datanode.address

0.0.0.0:9866

 cm-r01wn01.mws.mds.xyz  root  ~  hdfs getconf -confKeydfs.datanode.http.address

0.0.0.0:9864

 cm-r01wn01.mws.mds.xyz  root  ~  hdfs getconf -confKeydfs.datanode.https.address

0.0.0.0:9865

 cm-r01wn01.mws.mds.xyz  root  ~  hdfs getconf -confKeydfs.datanode.ipc.address

0.0.0.0:9867
  cm-r01wn01.mws.mds.xyz  root  ~ 


The scanner looks to be initialized:

 cm-r01wn01.mws.mds.xyz  root  /  var  log  hadoop-hdfs  grep-EiR "Periodic block scanner is not running" *

  cm-r01wn01.mws.mds.xyz  root  /  var  log  hadoop-hdfs 

 cm-r01wn01.mws.mds.xyz  root  /  var  log  hadoop-hdfs  grep-EiR "Initialized block scanner with targetBytesPerSec" *|wc -l;

32
  cm-r01wn01.mws.mds.xyz  root  /  var  log  hadoop-hdfs 

And yes, indeed it is started up. It kicked off around the time whenI restarted the DataNode service.

 cm-r01wn01.mws.mds.xyz  root  /  var  log  hadoop-hdfs  vihadoop-cmf-hdfs-DATANODE-cm-r01wn01.mws.mds.xyz.log.out

STARTUP_MSG: build = http://github.com/cloudera/hadoop -r7f07ef8e6df428a8eb53009dc8d9a249dbbb50ad; compiled by 'jenkins' on2019-07-18T17:09Z

STARTUP_MSG:   java = 1.8.0_181
************************************************************/

2020-10-22 20:54:58,488 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: registered UNIXsignal handlers for [TERM, HUP, INT]2020-10-22 20:54:59,762 INFOorg.apache.hadoop.security.UserGroupInformation: Login successful foruser hdfs/cm-r01wn01.mws.mds....@mws.mds.xyz using keytab filehdfs.keytab. Keytab auto renewal enabled : false2020-10-22 20:55:00,265 INFOorg.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:Scheduling a check for [DISK]file:/hdfs/1/dfs/dn2020-10-22 20:55:00,295 INFOorg.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:Scheduling a check for [DISK]file:/hdfs/2/dfs/dn2020-10-22 20:55:00,296 INFOorg.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:Scheduling a check for [DISK]file:/hdfs/3/dfs/dn2020-10-22 20:55:00,297 INFOorg.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:Scheduling a check for [DISK]file:/hdfs/4/dfs/dn2020-10-22 20:55:00,521 INFOorg.apache.hadoop.metrics2.impl.MetricsConfig: Loaded properties fromhadoop-metrics2.properties2020-10-22 20:55:00,723 INFOorg.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metricsnapshot period at 10 second(s).2020-10-22 20:55:00,723 INFOorg.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metricssystem started2020-10-22 20:55:00,947 INFOorg.apache.hadoop.hdfs.server.common.Util:dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disablingfile IO profiling2020-10-22 20:55:00,953 INFOorg.apache.hadoop.hdfs.server.datanode.BlockScanner: *Initializedblock scanner with targetBytesPerSec 1048576*2020-10-22 20:55:00,961 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: File descriptorpassing is enabled.2020-10-22 20:55:00,963 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostnameis cm-r01wn01.mws.mds.xyz2020-10-22 20:55:00,965 INFOorg.apache.hadoop.hdfs.server.common.Util:dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disablingfile IO profiling2020-10-22 20:55:00,995 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Starting DataNodewith maxLockedMemory = 2998927362020-10-22 20:55:01,018 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Opened streamingserver at /10.3.0.160:10042020-10-22 20:55:01,023 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwidthis 10485760 bytes/s2020-10-22 20:55:01,024 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Number threads forbalancing is 502020-10-22 20:55:01,029 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwidthis 10485760 bytes/s2020-10-22 20:55:01,029 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Number threads forbalancing is 502020-10-22 20:55:01,029 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Listening on UNIXdomain socket: /var/run/hdfs-sockets/dn2020-10-22 20:55:01,304 INFO org.eclipse.jetty.util.log: Logginginitialized @8929ms2020-10-22 20:55:01,559 INFO org.apache.hadoop.http.HttpRequestLog:Http request log for http.requests.datanode is not defined2020-10-22 20:55:01,585 INFO org.apache.hadoop.http.HttpServer2: Addedglobal filter 'safety'(class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)2020-10-22 20:55:01,589 INFO org.apache.hadoop.http.HttpServer2: Addedfilter authentication(class=org.apache.hadoop.security.authentication.server.AuthenticationFilter)to context datanode

This answers another question I had: under what conditions does theblock / volume checker kick off. When removing a datanode and addingit back in, it appears the checker will get kicked off on the workerat that time.

Only the Secure DataNode port returns a login, as to be expected. (http://cm-r01wn01.mws.mds.xyz:1006/ )


Thx,
TK




On 10/22/2020 11:56 AM, संजीव (Sanjeev Tripurari) wrote:

Hi Tom,

Can you start your datanode service, and share the datanode logs,check if it is started properly or not.


Regards
-Sanjeev

On Thu, 22 Oct 2020 at 20:33, Austin Hackett <hacketta...@me.com<mailto:hacketta...@me.com>> wrote:


    Hi Tom

    It might be worth restarting the DataNode process? I didn’t think
    you could disable the DataNode Web UI as such, but I could be
    wrong on this point. Out of interest, what does hdfs-site.xml say
    with regards to dfs.datanode.http.address/dfs.datanode.https.address?

    Regarding the logs, a quick look on GitHub suggests there may be
    a couple of useful log messages:

    
https://github.com/apache/hadoop/blob/88a9f42f320e7c16cf0b0b424283f8e4486ef286/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockScanner.java
    
<https://github.com/apache/hadoop/blob/88a9f42f320e7c16cf0b0b424283f8e4486ef286/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockScanner.java>

    For example, LOG.warn(“Periodic block scanner is not running”) or
    LOG.info <http://LOG.info>(“Initialized block scanner with
    targetBytesPerSec {}”).

    Of course, you’d need make sure those LOG statements are present
    in the Hadoop version included with CDH 6.3. Git “blame” suggests
    the LOG statements were added 6 years, so chance are you have them...

    Thanks

    Austin

    On 22 Oct 2020, at 14:44, TomK <tomk...@mdevsys.com
    <mailto:tomk...@mdevsys.com>> wrote:

    Thanks Austin.  However none of these are open on a standard
    Cloudera 6.3 build.

    # netstat -pnltu|grep -Ei "9866|1004|9864|9865|1006|9867"
    #

    Would there be anything in the logs to indicate whether or not
    the block / volume scanner is running?

    Thx,
    TK


    On 10/22/2020 3:09 AM, Austin Hackett wrote:

    Hi Tom

    I not too familiar with the CDH distribution, but this page has
    the default ports used by DataNode:

    
https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_ports.html
    
<https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_ports.html>

    I believe it’s the settings for
    dfs.datanode.http.address/dfs.datanode.https.address that
    you’re interested in (9864/9865)

    Since the data block scanner related config parameters are not
    set, the defaults of 3 weeks and 1MB should be applied.

    Thanks

    Austin

    On 22 Oct 2020, at 06:35, TomK <tomk...@mdevsys.com>
    <mailto:tomk...@mdevsys.com> wrote:

    

    Hey Austin, Sanjeev,

    Thanks once more!  Took some time to review the pages.  That
    was certainly very helpful.  Appreciated!

    However, I tried to access https://dn01/blockScannerReport
    <https://dn01/blockScannerReport> on a test Cloudera 6.3
    cluster.  Didn't work  Tried the following as well:

    http://dn01:50075/blockscannerreport?listblocks
    <http://dn01:50075/blockscannerreport?listblocks>

    https://dn01:50075/blockscannerreport
    <https://dn01:50075/blockscannerreport>


    https://dn01:10006/blockscannerreport
    <https://dn01:10006/blockscannerreport>

    Checked that port 50075 is up ( netstat -pnltu ).  There's no
    service on that port on the workers.  Checked the pages:

    
https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_ig_ports_cdh5.html
    
<https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_ig_ports_cdh5.html>

    It is defined on the pages.  Checked if the following is set:

    The following 2 configurations in/hdfs-site.xml/are the most
    used for block scanners.

     *


     *



      * *dfs.block.scanner.volume.bytes.per.second* to throttle
        the scan bandwidth to configurable bytes per second.
        *Default value is 1M*. Setting this to 0 will disable the
        block scanner.
      * *dfs.datanode.scan.period.hours*to configure the scan
        period, which defines how often a whole scan is performed.
        This should be set to a long enough interval to really
        take effect, for the reasons explained above. *Default
        value is 3 weeks (504 hours)*. Setting this to 0 will use
        the default value. Setting this to a negative value will
        disable the block scanner.

    These are NOT explicitly set. Checked hdfs-site.xml.  Nothing
    defined there.  Checked the Configuration tab in the cluster. 
    It's not defined either.

    Does this mean that the defaults are applied OR does it mean
    that the block / volume scanner is disabled?  I see the pages
    detail what values for these settings mean but I didn't see
    any notes pertaining to the situation where both values are
    not explicitly set.

    Thx,
    TK


    On 10/21/2020 1:34 PM, संजीव (Sanjeev Tripurari) wrote:

    Yes Austin,

    you are right every datanode will do its block verification,
    which is send as health check report to the namenode

    Regards
    -Sanjeev


    On Wed, 21 Oct 2020 at 21:53, Austin Hackett
    <hacketta...@me.com <mailto:hacketta...@me.com>> wrote:

        Hi Tom

        It is my understanding that in addition to block
        verification on client reads, each data node runs a
        DataBlockScanner in a background thread that periodically
        verifies all the blocks stored on the data node. The
        dfs.datanode.scan.period.hours property controls how
        often this verification occurs.

        I think the reports are available via the data node
        /blockScannerReport HTTP endpoint, although I’m not sure
        I ever actually looked at one. (add ?listblocks to get
        the verification status of each block).

        More info here:
        
https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/
        
<https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/>

        Thanks

        Austin

        On 21 Oct 2020, at 16:47, TomK <tomk...@mdevsys.com
        <mailto:tomk...@mdevsys.com>> wrote:

        Hey Sanjeev,

        Allright.  Thank you once more.  This is clear.

        However, this poses an issue then.  If during the two
        years, disk drives develop bad blocks but do not
        necessarily fail to the point that they cannot be
        mounted, that checksum would have changed since those
        filesystem blocks can no longer be read.  However, from
        an HDFS perspective, since no checks are done regularly,
        that is not known.   So HDFS still reports that the file
        is fine, in other words, no missing blocks.  For
        example, if a disk is going bad, but those files are not
        read for two years, the system won't know that there is
        a problem.  Even when removing a data node temporarily
        and re-adding the datanode, HDFS isn't checking because
        that HDFS file isn't read.

        So let's assume this scenario.  Data nodes *dn01* to
        *dn10*  exist. Each data node has 10 x 10TB drives.

        And let's assume that there is one large file on those
        drives and it's replicated to factor of X3.

        If during the two years the file isn't read, and 10 of
        those drives develop bad blocks or other underlying
        hardware issues, then it is possible that HDFS will
        still report everything fine, even with a replication
        factor of 3.  Because with 10 disks failing, it's
        possible a block or sector has failed under each of the
        3 copies of the data. But HDFS would NOT know since
        nothing triggered a read of that HDFS file.  Based on
        everything below, then corruption is very much possible
        even with a replication of factor X3.  A this point the
        file is unreadable but HDFS still reports no missing
        blocks.

        Similarly, if once I take a data node out, I adjust one
        of the files on the data disks, HDFS will not know and
        still report everything fine.  That is until someone
        read's the file.

        Sounds like this is a very real possibility.

        Thx,
        TK


        On 10/21/2020 10:26 AM, संजीव (Sanjeev Tripurari) wrote:

        Hi Tom

        Therefore, if I write a file to HDFS but access it two
        years later, then the checksum will be computed only
        twice, at the beginning of the two years and again at
        the end when a client connects? Correct?  As long as no
        process ever accesses the file between now and two
        years from now, the checksum is never redone and
        compared to the two year old checksum in the fsimage?

        yes, Exactly unless data is read checksum is not
        verified. (when data is written and when the data is
        read),
        if checksum is mismatched, there is no way to correct
        it, you will have to re-write that file.

        When datanode is added back in, there is no real read
        operation on the files themselves. The datanode just
        reports the blocks but doesn't really read the blocks
        that are there to re-verify the files and ensure
        consistency?

        yes, Exactly, datanode maintains list of files and
        their blocks, which it reports, along with total disk
        size and used size.
        Namenode only has list of blocks, unless datanodes is
        connected it wont know where the blocks are stored.

        Regards
        -Sanjeev


        On Wed, 21 Oct 2020 at 18:31, TomK <tomk...@mdevsys.com
        <mailto:tomk...@mdevsys.com>> wrote:

            Hey Sanjeev,

            Thank you very much again.  This confirms my suspision.

            Therefore, if I write a file to HDFS but access it
            two years later, then the checksum will be computed
            only twice, at the beginning of the two years and
            again at the end when a client connects? Correct? 
            As long as no process ever accesses the file
            between now and two years from now, the checksum is
            never redone and compared to the two year old
            checksum in the fsimage?

            When datanode is added back in, there is no real
            read operation on the files themselves. The
            datanode just reports the blocks but doesn't really
            read the blocks that are there to re-verify the
            files and ensure consistency?

            Thx,
            TK



            On 10/21/2020 12:38 AM, संजीव (Sanjeev Tripurari) wrote:

            Hi Tom,

            Every datanode sends heartbeat to namenode, on its
            list of blocks it has.

            When a datanode which is disconnected for a while,
            after connecting will send heartbeat to namenode,
            with list of blocks it has (till then namenode
            will have under-replicated blocks).
            As soon as the datanode is connected to namenode,
            it will clear under-replicatred blocks.

            *When a client connects to read or write a file,
            it will run checksum to validate the file.*

            There is no independent process running to do
            checksum, as it will be heavy process on each node.

            Regards
            -Sanjeev

            On Wed, 21 Oct 2020 at 00:18, Tom <t...@mdevsys.com
            <mailto:t...@mdevsys.com>> wrote:

                Thank you.  That part I understand and am Ok
                with it.

                What I would like to know next is when again
                the CRC32C checksum is ran and checked against
                the fsimage that the block file has not
                changed or become corrupted?

                For example, if I take a datanode out, and
                within 15 minutes, plug it back in, does HDF
                rerun the CRC 32C on all data disks on that
                node to make sure blocks are ok?

                Cheers,
                TK

                Sent from my iPhone

                On Oct 20, 2020, at 1:39 PM, संजीव (Sanjeev
                Tripurari) <sanjeevtripur...@gmail.com
                <mailto:sanjeevtripur...@gmail.com>> wrote:

                its done as sson as  a file is stored on disk..

                Sanjeev

                On Tuesday, 20 October 2020, TomK
                <tomk...@mdevsys.com
                <mailto:tomk...@mdevsys.com>> wrote:

                    Thanks again.

                    At what points is the checksum validated
                    (checked) after that? For example, is it
                    done on a daily basis or is it done only
                    when the file is accessed?

                    Thx,
                    TK

                    On 10/20/2020 10:18 AM, संजीव (Sanjeev
                    Tripurari) wrote:

                    As soon as the file is written first
                    time checksum is calculated and updated
                    in fsimage (first in edit logs), and
                    same is replicated other replicas.



                    On Tue, 20 Oct 2020 at 19:15, TomK
                    <tomk...@mdevsys.com
                    <mailto:tomk...@mdevsys.com>> wrote:

                        Hi Sanjeev,

                        Thank you.  It does help.

                        At what points is the checksum
                        calculated?

                        Thx,
                        TK

                        On 10/20/2020 3:03 AM, संजीव
                        (Sanjeev Tripurari) wrote:

                        For Missing blocks and corrupted
                        blocks, do check if all the
                        datanode services are up, non of
                        the disks where hdfs data is stored
                        is accessible and have no issues,
                        hosts are reachable from namenode,

                        If you are able to re-generate the
                        data and write its great, otherwise
                        hadoop cannot correct itself.


                        Could you please elaborate on this? 
                        Does it mean I have to continuously
                        access a file for HDFS to be able to
                        detect corrupt blocks and correct
                        itself?



                        *"Does HDFS check that the data
                        node is up, data disk is mounted,
                        path to
                        the file exists and file can be read?"*
                        -- yes, only after it fails it will
                        say missing blocks.

                        *Or does it also do a filesystem
                        check on that data disk as well as
                        perhaps a checksum to ensure block
                        integrity?*
                        -- yes, every file cheksum is
                        maintained and cross checked, if it
                        fails it will say corrupted blocks.

                        hope this helps.

                        -Sanjeev
                        *
                        *

                        On Tue, 20 Oct 2020 at 09:52, TomK
                        <tomk...@mdevsys.com
                        <mailto:tomk...@mdevsys.com>> wrote:

                            Hello,

                            HDFS Missing Blocks / Corrupt
                            Blocks Logic:  What are the
                            specific
                            checks done to determine a
                            block is bad and needs to be
                            replicated?

                            Does HDFS check that the data
                            node is up, data disk is
                            mounted, path to
                            the file exists and file can be
                            read?

                            Or does it also do a filesystem
                            check on that data disk as well as
                            perhaps a checksum to ensure
                            block integrity?

                            I've googled on this quite a
                            bit.  I don't see the exact
                            answer I'm
                            looking for. I would like to
                            know exactly what happens
                            during file
                            integrity verification that
                            then constitutes missing blocks
                            or corrupt
                            blocks in the reports.

--Thank You,

                            TK.

                            
---------------------------------------------------------------------
                            To unsubscribe, e-mail:
                            user-unsubscr...@hadoop.apache.org
                            <mailto:user-unsubscr...@hadoop.apache.org>
                            For additional commands,
                            e-mail:
                            user-h...@hadoop.apache.org
                            <mailto:user-h...@hadoop.apache.org>

--Thx,

TK.

--Thx,

TK.

--Thx,

TK.

--Thx,

TK.

--Thx,

TK.


--
Thx,
TK.



--
Thx,
TK.

Re: HDFS Missing Blocks / Corrupt Blocks Logic: What are the specific checks done to determine a block is bad and needs to be replicated?

Reply via email to