Hi Tom, Can you start your datanode service, and share the datanode logs, check if it is started properly or not.
Regards -Sanjeev On Thu, 22 Oct 2020 at 20:33, Austin Hackett <hacketta...@me.com> wrote: > Hi Tom > > It might be worth restarting the DataNode process? I didn’t think you > could disable the DataNode Web UI as such, but I could be wrong on this > point. Out of interest, what does hdfs-site.xml say with regards > to dfs.datanode.http.address/dfs.datanode.https.address? > > Regarding the logs, a quick look on GitHub suggests there may be a couple > of useful log messages: > > > https://github.com/apache/hadoop/blob/88a9f42f320e7c16cf0b0b424283f8e4486ef286/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockScanner.java > > For example, LOG.warn(“Periodic block scanner is not running”) or > LOG.info(“Initialized > block scanner with targetBytesPerSec {}”). > > Of course, you’d need make sure those LOG statements are present in the > Hadoop version included with CDH 6.3. Git “blame” suggests the LOG > statements were added 6 years, so chance are you have them... > > Thanks > > Austin > > On 22 Oct 2020, at 14:44, TomK <tomk...@mdevsys.com> wrote: > > Thanks Austin. However none of these are open on a standard Cloudera 6.3 > build. > > # netstat -pnltu|grep -Ei "9866|1004|9864|9865|1006|9867" > # > > Would there be anything in the logs to indicate whether or not the block / > volume scanner is running? > > Thx, > TK > > > On 10/22/2020 3:09 AM, Austin Hackett wrote: > > Hi Tom > > I not too familiar with the CDH distribution, but this page has the > default ports used by DataNode: > > > https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_ports.html > > I believe it’s the settings for > dfs.datanode.http.address/dfs.datanode.https.address > that you’re interested in (9864/9865) > > Since the data block scanner related config parameters are not set, the > defaults of 3 weeks and 1MB should be applied. > > Thanks > > Austin > > On 22 Oct 2020, at 06:35, TomK <tomk...@mdevsys.com> <tomk...@mdevsys.com> > wrote: > > > > Hey Austin, Sanjeev, > > Thanks once more! Took some time to review the pages. That was certainly > very helpful. Appreciated! > However, I tried to access https://dn01/blockScannerReport on a test > Cloudera 6.3 cluster. Didn't work Tried the following as well: > > http://dn01:50075/blockscannerreport?listblocks > > https://dn01:50075/blockscannerreport > > > https://dn01:10006/blockscannerreport > > Checked that port 50075 is up ( netstat -pnltu ). There's no service on > that port on the workers. Checked the pages: > > > https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_ig_ports_cdh5.html > > It is defined on the pages. Checked if the following is set: > > The following 2 configurations in *hdfs-site.xml *are the most used for > block scanners. > > - > - > - *dfs.block.scanner.volume.bytes.per.second* to throttle the scan > bandwidth to configurable bytes per second. *Default value is 1M*. > Setting this to 0 will disable the block scanner. > - *dfs.datanode.scan.period.hours* to configure the scan period, which > defines how often a whole scan is performed. This should be set to a long > enough interval to really take effect, for the reasons explained above. > *Default > value is 3 weeks (504 hours)*. Setting this to 0 will use the default > value. Setting this to a negative value will disable the block scanner. > > These are NOT explicitly set. Checked hdfs-site.xml. Nothing defined > there. Checked the Configuration tab in the cluster. It's not defined > either. > > Does this mean that the defaults are applied OR does it mean that the > block / volume scanner is disabled? I see the pages detail what values for > these settings mean but I didn't see any notes pertaining to the situation > where both values are not explicitly set. > > Thx, > TK > > > On 10/21/2020 1:34 PM, संजीव (Sanjeev Tripurari) wrote: > > Yes Austin, > > you are right every datanode will do its block verification, which is send > as health check report to the namenode > > Regards > -Sanjeev > > > On Wed, 21 Oct 2020 at 21:53, Austin Hackett <hacketta...@me.com> wrote: > >> Hi Tom >> >> It is my understanding that in addition to block verification on client >> reads, each data node runs a DataBlockScanner in a background thread that >> periodically verifies all the blocks stored on the data node. The >> dfs.datanode.scan.period.hours property controls how often this >> verification occurs. >> >> I think the reports are available via the data node /blockScannerReport >> HTTP endpoint, although I’m not sure I ever actually looked at one. (add >> ?listblocks to get the verification status of each block). >> >> More info here: >> https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/ >> >> Thanks >> >> Austin >> >> On 21 Oct 2020, at 16:47, TomK <tomk...@mdevsys.com> wrote: >> >> Hey Sanjeev, >> >> Allright. Thank you once more. This is clear. >> >> However, this poses an issue then. If during the two years, disk drives >> develop bad blocks but do not necessarily fail to the point that they >> cannot be mounted, that checksum would have changed since those filesystem >> blocks can no longer be read. However, from an HDFS perspective, since no >> checks are done regularly, that is not known. So HDFS still reports that >> the file is fine, in other words, no missing blocks. For example, if a >> disk is going bad, but those files are not read for two years, the system >> won't know that there is a problem. Even when removing a data node >> temporarily and re-adding the datanode, HDFS isn't checking because that >> HDFS file isn't read. >> >> So let's assume this scenario. Data nodes *dn01* to *dn10* exist. Each >> data node has 10 x 10TB drives. >> >> And let's assume that there is one large file on those drives and it's >> replicated to factor of X3. >> If during the two years the file isn't read, and 10 of those drives >> develop bad blocks or other underlying hardware issues, then it is possible >> that HDFS will still report everything fine, even with a replication factor >> of 3. Because with 10 disks failing, it's possible a block or sector has >> failed under each of the 3 copies of the data. But HDFS would NOT know >> since nothing triggered a read of that HDFS file. Based on everything >> below, then corruption is very much possible even with a replication of >> factor X3. A this point the file is unreadable but HDFS still reports no >> missing blocks. >> >> Similarly, if once I take a data node out, I adjust one of the files on >> the data disks, HDFS will not know and still report everything fine. That >> is until someone read's the file. >> >> Sounds like this is a very real possibility. >> >> Thx, >> TK >> >> >> On 10/21/2020 10:26 AM, संजीव (Sanjeev Tripurari) wrote: >> >> Hi Tom >> >> Therefore, if I write a file to HDFS but access it two years later, then >> the checksum will be computed only twice, at the beginning of the two years >> and again at the end when a client connects? Correct? As long as no >> process ever accesses the file between now and two years from now, the >> checksum is never redone and compared to the two year old checksum in the >> fsimage? >> >> yes, Exactly unless data is read checksum is not verified. (when data is >> written and when the data is read), >> if checksum is mismatched, there is no way to correct it, you will have >> to re-write that file. >> >> When datanode is added back in, there is no real read operation on the >> files themselves. The datanode just reports the blocks but doesn't really >> read the blocks that are there to re-verify the files and ensure >> consistency? >> >> yes, Exactly, datanode maintains list of files and their blocks, which it >> reports, along with total disk size and used size. >> Namenode only has list of blocks, unless datanodes is connected it wont >> know where the blocks are stored. >> >> Regards >> -Sanjeev >> >> >> On Wed, 21 Oct 2020 at 18:31, TomK <tomk...@mdevsys.com> wrote: >> >>> Hey Sanjeev, >>> >>> Thank you very much again. This confirms my suspision. >>> >>> Therefore, if I write a file to HDFS but access it two years later, then >>> the checksum will be computed only twice, at the beginning of the two years >>> and again at the end when a client connects? Correct? As long as no >>> process ever accesses the file between now and two years from now, the >>> checksum is never redone and compared to the two year old checksum in the >>> fsimage? >>> >>> When datanode is added back in, there is no real read operation on the >>> files themselves. The datanode just reports the blocks but doesn't really >>> read the blocks that are there to re-verify the files and ensure >>> consistency? >>> >>> Thx, >>> TK >>> >>> >>> >>> On 10/21/2020 12:38 AM, संजीव (Sanjeev Tripurari) wrote: >>> >>> Hi Tom, >>> >>> Every datanode sends heartbeat to namenode, on its list of blocks it has. >>> >>> When a datanode which is disconnected for a while, after connecting will >>> send heartbeat to namenode, with list of blocks it has (till then namenode >>> will have under-replicated blocks). >>> As soon as the datanode is connected to namenode, it will clear >>> under-replicatred blocks. >>> >>> *When a client connects to read or write a file, it will run checksum to >>> validate the file.* >>> >>> There is no independent process running to do checksum, as it will be >>> heavy process on each node. >>> >>> Regards >>> -Sanjeev >>> >>> On Wed, 21 Oct 2020 at 00:18, Tom <t...@mdevsys.com> wrote: >>> >>>> Thank you. That part I understand and am Ok with it. >>>> >>>> What I would like to know next is when again the CRC32C checksum is ran >>>> and checked against the fsimage that the block file has not changed or >>>> become corrupted? >>>> >>>> For example, if I take a datanode out, and within 15 minutes, plug it >>>> back in, does HDF rerun the CRC 32C on all data disks on that node to make >>>> sure blocks are ok? >>>> >>>> Cheers, >>>> TK >>>> >>>> Sent from my iPhone >>>> >>>> On Oct 20, 2020, at 1:39 PM, संजीव (Sanjeev Tripurari) < >>>> sanjeevtripur...@gmail.com> wrote: >>>> >>>> its done as sson as a file is stored on disk.. >>>> >>>> Sanjeev >>>> >>>> On Tuesday, 20 October 2020, TomK <tomk...@mdevsys.com> wrote: >>>> >>>>> Thanks again. >>>>> >>>>> At what points is the checksum validated (checked) after that? For >>>>> example, is it done on a daily basis or is it done only when the file is >>>>> accessed? >>>>> >>>>> Thx, >>>>> TK >>>>> >>>>> On 10/20/2020 10:18 AM, संजीव (Sanjeev Tripurari) wrote: >>>>> >>>>> As soon as the file is written first time checksum is calculated and >>>>> updated in fsimage (first in edit logs), and same is replicated other >>>>> replicas. >>>>> >>>>> >>>>> >>>>> On Tue, 20 Oct 2020 at 19:15, TomK <tomk...@mdevsys.com> wrote: >>>>> >>>>>> Hi Sanjeev, >>>>>> >>>>>> Thank you. It does help. >>>>>> >>>>>> At what points is the checksum calculated? >>>>>> >>>>>> Thx, >>>>>> TK >>>>>> >>>>>> On 10/20/2020 3:03 AM, संजीव (Sanjeev Tripurari) wrote: >>>>>> >>>>>> For Missing blocks and corrupted blocks, do check if all the datanode >>>>>> services are up, non of the disks where hdfs data is stored is accessible >>>>>> and have no issues, hosts are reachable from namenode, >>>>>> >>>>>> If you are able to re-generate the data and write its great, >>>>>> otherwise hadoop cannot correct itself. >>>>>> >>>>>> Could you please elaborate on this? Does it mean I have to >>>>>> continuously access a file for HDFS to be able to detect corrupt blocks >>>>>> and >>>>>> correct itself? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> *"Does HDFS check that the data node is up, data disk is mounted, >>>>>> path to the file exists and file can be read?"* >>>>>> -- yes, only after it fails it will say missing blocks. >>>>>> >>>>>> >>>>>> *Or does it also do a filesystem check on that data disk as well as >>>>>> perhaps a checksum to ensure block integrity?* >>>>>> -- yes, every file cheksum is maintained and cross checked, if it >>>>>> fails it will say corrupted blocks. >>>>>> >>>>>> hope this helps. >>>>>> >>>>>> -Sanjeev >>>>>> >>>>>> >>>>>> On Tue, 20 Oct 2020 at 09:52, TomK <tomk...@mdevsys.com> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> HDFS Missing Blocks / Corrupt Blocks Logic: What are the specific >>>>>>> checks done to determine a block is bad and needs to be replicated? >>>>>>> >>>>>>> Does HDFS check that the data node is up, data disk is mounted, path >>>>>>> to >>>>>>> the file exists and file can be read? >>>>>>> >>>>>>> Or does it also do a filesystem check on that data disk as well as >>>>>>> perhaps a checksum to ensure block integrity? >>>>>>> >>>>>>> I've googled on this quite a bit. I don't see the exact answer I'm >>>>>>> looking for. I would like to know exactly what happens during file >>>>>>> integrity verification that then constitutes missing blocks or >>>>>>> corrupt >>>>>>> blocks in the reports. >>>>>>> >>>>>>> -- >>>>>>> Thank You, >>>>>>> TK. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org >>>>>>> For additional commands, e-mail: user-h...@hadoop.apache.org >>>>>>> >>>>>>> >>>>>> >>>>> -- >>>>> Thx, >>>>> TK. >>>>> >>>> >>> -- >>> Thx, >>> TK. >>> >> >> -- >> Thx, >> TK. >> >> >> > -- > Thx, > TK. > > > -- > Thx, > TK. > > >