Re: Lots of warning messages and exception in namenode logs

Ravi Prakash Tue, 27 Jun 2017 10:06:07 -0700

Hi Omprakash!

This is *not* ok. Please go through the datanode logs of the inactive
datanode and figure out why its inactive. If you set dfs.replication to 2,
atleast as many datanodes (and ideally a LOT more datanodes) should be
active and participating in the cluster.


Do you have the hdfs-site.xml you posted to the mailing list on all the
nodes (including the Namenode)? Was the file containing block
*blk_1074074104_337394* created when you had the cluster misconfigured to
dfs.replication=3 ? You can determine which file the block belongs to using
this command:

hdfs fsck -blockId blk_1074074104

Once you have the file, you can set its replication using
hdfs dfs -setrep 2 <Filename>

I'm guessing that you probably have a lot of files with this replication,
in which case you should set it on / (This would overwrite the replication
on all the files)

If the data on this cluster is important I would be very worried about the
condition its in.

HTH
Ravi

On Mon, Jun 26, 2017 at 11:22 PM, omprakash <ompraka...@cdac.in> wrote:

> Hi all,
>
>
>
> I started the HDFS in DEBUG mode. After examining the logs I found below
> logs which read that the replication factor required is 3 (as against the
> specified *dfs.replication=2*).
>
>
>
> *DEBUG BlockStateChange: BLOCK* NameSystem.UnderReplicationBlock.add:
> blk_1074074104_337394 has only 1 replicas and need 3 replicas so is added
> to neededReplications at priority level 0*
>
>
>
> *P.S : I have 1 datanode active out of 2. *
>
>
>
> I can also see from Namenode UI that the no. of under replicated blocks
> are growing.
>
>
>
> Any idea? Or this is OK.
>
>
>
> regards
>
>
>
>
>
> *From:* omprakash [mailto:ompraka...@cdac.in]
> *Sent:* 23 June 2017 11:02
> *To:* 'Ravi Prakash' <ravihad...@gmail.com>; 'Arpit Agarwal' <
> aagar...@hortonworks.com>
> *Cc:* 'user' <user@hadoop.apache.org>
> *Subject:* RE: Lots of warning messages and exception in namenode logs
>
>
>
> Hi Arpit,
>
>
>
> I will enable the settings as suggested and will post the results.
>
>
>
> I am just curious about setting *Namenode RPC service  port*. As I have
> checked the *hdfs-site.xml* properties, *dfs.namenode.rpc-address* is
> already set which will be default value to RPC service port also. Does
> specifying any other port have advantage over default one?
>
>
>
> Regarding JvmPauseMonitor Error, there are 5-6 instances of this error in 
> namenode logs. Here is one of them.
>
>
>
> How to identify the size of heap In such cases as I have 4GB of RAM on the
> namenode VM.?
>
>
>
> *@Ravi* Since the file size are very small thus I have only configured a
> VM with 20 GB space. The additional disk is simple SATA disk not SSD.
>
>
>
> As I can see from Namenode UI there are more than 50% of block under
> replicated. I have now 400K blocks out of which 200K are under-replicated.
>
> I will post the results again after changing the value of 
> *dfs.namenode.replication.work
> <http://dfs.namenode.replication.work>.multiplier.per.iteration*
>
>
>
>
>
> Thanks
>
> Om Prakash
>
>
>
> *From:* Ravi Prakash [mailto:ravihad...@gmail.com <ravihad...@gmail.com>]
> *Sent:* 22 June 2017 23:04
> *To:* Arpit Agarwal <aagar...@hortonworks.com>
> *Cc:* omprakash <ompraka...@cdac.in>; user <user@hadoop.apache.org>
>
> *Subject:* Re: Lots of warning messages and exception in namenode logs
>
>
>
> Hi Omprakash!
>
> How big are your disks? Just 20Gb? Just out of curiosity, are these SSDs?
>
> In addition to Arpit's reply, I'm also concerned with the number of
> under-replicated blocks you have: Under replicated blocks: 141863
>
> When there are fewer replicas for a block than there are supposed to be
> (in your case e.g. when there's 1 replica when there ought to be 2), the
> namenode will order the datanodes to create more replicas. The rate at
> which it does this is controlled by
> dfs.namenode.replication.work.multiplier.per.iteration . Given you have
> only 2 datanodes, you'll only be re-replicating 4 blocks every 3 seconds.
> So, it will take quite a while to re-replicate all the blocks.
>
> Also, please know that you want files to be much bigger than 1kb. Ideally
> you'd have a couple of blocks (blocks=128Mb) for each file. You should
> append to files when they are this small.
>
> Please do let us know how things turn out.
>
> Cheers,
>
> Ravi
>
>
>
> On Wed, Jun 21, 2017 at 11:23 PM, Arpit Agarwal <aagar...@hortonworks.com>
> wrote:
>
> Hi Omprakash,
>
>
>
> Your description suggests DataNodes cannot send timely reports to the
> NameNode. You can check it by looking for ‘stale’ DataNodes in the NN web
> UI when this situation is occurring. A few ideas:
>
>
>
>    - Try increasing the NameNode RPC handler count a bit (set
>    dfs.namenode.handler.count to 20 in hdfs-site.xml).
>    - Enable the NameNode service RPC port. This requires downtime and
>    reformatting the ZKFC znode.
>    - Search for JvmPauseMonitor messages in your service logs. If you see
>    any, try increasing JVM heap for that service.
>    - Enable debug logging as suggested here:
>
>
>
> *2017-06-21 12:11:30,626 WARN
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed
> to place enough replicas, still in need of 1 to reach 2
> (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7,
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]},
> newBlock=true) For more information, please enable DEBUG log level on
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> **org.apache.hadoop.net
> <http://org.apache.hadoop.net/>.NetworkTopology*
>
>
>
>
>
> *From: *omprakash <ompraka...@cdac.in>
> *Date: *Wednesday, June 21, 2017 at 9:23 PM
> *To: *'Ravi Prakash' <ravihad...@gmail.com>
> *Cc: *'user' <user@hadoop.apache.org>
> *Subject: *RE: Lots of warning messages and exception in namenode logs
>
>
>
> Hi Ravi,
>
>
>
> Pasting below my core-site and hdfs-site  configurations. I have kept bare
> minimal configurations for my cluster.  The cluster started fine and I was
> able to put couple of 100K files on hdfs but then when I checked the logs
> there were errors/Exceptions. After restart of datanodes they work well for
> few thousand files but same problem again.  No idea what is wrong.
>
>
>
> *PS: I am pumping 1 file per second to hdfs with aprox size 1KB*
>
>
>
> I thought it may be due to space quota on datanodes but here is the output
> of *hdfs dfs -report*. Looks fine to me
>
>
>
> $ hdfs dfsadmin -report
>
>
>
> Configured Capacity: 42005069824 (39.12 GB)
>
> Present Capacity: 38085839568 (35.47 GB)
>
> DFS Remaining: 34949058560 (32.55 GB)
>
> DFS Used: 3136781008 <(313)%20678-1008> (2.92 GB)
>
> DFS Used%: 8.24%
>
> Under replicated blocks: 141863
>
> Blocks with corrupt replicas: 0
>
> Missing blocks: 0
>
> Missing blocks (with replication factor 1): 0
>
> Pending deletion blocks: 0
>
>
>
> -------------------------------------------------
>
> Live datanodes (2):
>
>
>
> Name: 192.168.9.174:50010 (node5)
>
> Hostname: node5
>
> Decommission Status : Normal
>
> Configured Capacity: 21002534912 (19.56 GB)
>
> DFS Used: 1764211024 (1.64 GB)
>
> Non DFS Used: 811509424 (773.92 MB)
>
> DFS Remaining: 17067913216 <(706)%20791-3216> (15.90 GB)
>
> DFS Used%: 8.40%
>
> DFS Remaining%: 81.27%
>
> Configured Cache Capacity: 0 (0 B)
>
> Cache Used: 0 (0 B)
>
> Cache Remaining: 0 (0 B)
>
> Cache Used%: 100.00%
>
> Cache Remaining%: 0.00%
>
> Xceivers: 2
>
> Last contact: Wed Jun 21 14:38:17 IST 2017
>
>
>
>
>
> Name: 192.168.9.225:50010 (node4)
>
> Hostname: node5
>
> Decommission Status : Normal
>
> Configured Capacity: 21002534912 (19.56 GB)
>
> DFS Used: 1372569984 (1.28 GB)
>
> Non DFS Used: 658353792 (627.86 MB)
>
> DFS Remaining: 17881145344 (16.65 GB)
>
> DFS Used%: 6.54%
>
> DFS Remaining%: 85.14%
>
> Configured Cache Capacity: 0 (0 B)
>
> Cache Used: 0 (0 B)
>
> Cache Remaining: 0 (0 B)
>
> Cache Used%: 100.00%
>
> Cache Remaining%: 0.00%
>
> Xceivers: 1
>
> Last contact: Wed Jun 21 14:38:19 IST 2017
>
>
>
> *core-site.xml*
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <configuration>
>
> <property>
>
>   <name>fs.defaultFS</name>
>
>   <value>hdfs://hdfsCluster</value>
>
> </property>
>
> <property>
>
>   <name>dfs.journalnode.edits.dir</name>
>
>   <value>/mnt/hadoopData/hadoop/journal/node/local/data</value>
>
> </property>
>
> </configuration>
>
>
>
> *hdfs-site.xml*
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <configuration>
>
> *<property>*
>
> *<name>dfs.replication</name>*
>
> *<value>2</value>*
>
> *</property>*
>
> <property>
>
>   <name>dfs.name.dir</name>
>
>     <value>file:///mnt/hadoopData/hadoop/hdfs/namenode</value>
>
> </property>
>
> <property>
>
>   <name>dfs.data.dir</name>
>
>     <value>file:///mnt/hadoopData/hadoop/hdfs/datanode</value>
>
> </property>
>
> <property>
>
> <name>dfs.nameservices</name>
>
> <value>hdfsCluster</value>
>
> </property>
>
> <property>
>
>   <name>dfs.ha.namenodes.hdfsCluster</name>
>
>   <value>nn1,nn2</value>
>
> </property>
>
>
>
> <property>
>
>   <name>dfs.namenode.rpc-address.hdfsCluster.nn1</name>
>
>   <value>node1:8020</value>
>
> </property>
>
> <property>
>
>   <name>dfs.namenode.rpc-address.hdfsCluster.nn2</name>
>
>   <value>node22:8020</value>
>
> </property>
>
>
>
> <property>
>
>   <name>dfs.namenode.http-address.hdfsCluster.nn1</name>
>
>   <value>node1:50070</value>
>
> </property>
>
> <property>
>
>   <name>dfs.namenode.http-address.hdfsCluster.nn2</name>
>
>   <value>node2:50070</value>
>
> </property>
>
>
>
> <property>
>
>   <name>dfs.namenode.shared.edits.dir</name>
>
>   <value>qjournal://node1:8485;node2:8485;node3:8485;node4:
> 8485;node5:8485/hdfsCluster</value>
>
> </property>
>
> <property>
>
>   <name>dfs.client.failover.proxy.provider.hdfsCluster</name>
>
>   <value>org.apache.hadoop.hdfs.server.namenode.ha.
> ConfiguredFailoverProxyProvider</value>
>
> </property>
>
> <property>
>
>    <name>ha.zookeeper.quorum</name>
>
>    <value>node1:2181,node2:2181,node3:2181,node4:2181,node5:2181</value>
>
> </property>
>
> <property>
>
> <name>dfs.ha.fencing.methods</name>
>
> <value>sshfence</value>
>
> </property>
>
> <property>
>
> <name>dfs.ha.fencing.ssh.private-key-files</name>
>
> <value>/home/hadoop/.ssh/id_rsa</value>
>
> </property>
>
> <property>
>
>    <name>dfs.ha.automatic-failover.enabled</name>
>
>    <value>true</value>
>
> </property>
>
> </configuration>
>
>
>
>
>
> *From:* Ravi Prakash [mailto:ravihad...@gmail.com]
> *Sent:* 22 June 2017 02:38
> *To:* omprakash <ompraka...@cdac.in>
> *Cc:* user <user@hadoop.apache.org>
> *Subject:* Re: Lots of warning messages and exception in namenode logs
>
>
>
> Hi Omprakash!
>
> What is your default replication set to? What kind of disks do your
> datanodes have? Were you able to start a cluster with a simple
> configuration before you started tuning it?
>
> HDFS tries to create the default number of replicas for a block on
> different datanodes. The Namenode tries to give a list of datanodes that
> the client can write replicas of the block to. If the Namenode is not able
> to construct a list with adequate number of datanodes, you will see the
> message you are seeing. This may mean that datanodes are unhealthy (failed
> disks), or full (disks have no more space), being decomissioned ( HDFS will
> not write replicas on decomissioning datanodes) or misconfigured ( I'd
> suggest turning on storage classes only after a simple configuration works).
>
> When a client that was trying to write a file was killed (e.g. if you
> killed your MR job), after some time (hard limit expiring) the Namenode
> will try to recover the file. In your case the namenode is also not able to
> find enough datanodes for recovering the files.
>
>
>
> HTH
>
> Ravi
>
>
>
>
>
> On Tue, Jun 20, 2017 at 11:50 PM, omprakash <ompraka...@cdac.in> wrote:
>
> Hi,
>
>
>
> I am receiving lots of  *warning messages in namenodes* logs on ACTIVE NN
> in my *HA Hadoop setup*. Below are the logs
>
>
>
> *“2017-06-21 12:11:26,523 WARN
> org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough
> replicas: expected size is 1 but only 0 storage types can be selected
> (replication=2, selected=[], unavailable=[DISK], removed=[DISK],
> policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[],
> replicationFallbacks=[ARCHIVE]})*
>
> *2017-06-21 12:11:26,523 WARN
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed
> to place enough replicas, still in need of 1 to reach 2
> (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7,
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]},
> newBlock=true) All required storage types are unavailable:
> unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7,
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}*
>
> *2017-06-21 12:11:26,523 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> allocate blk_1073894332_153508, replicas=**192.168.9.174:50010*
> <http://192.168.9.174:50010>* for /36962._COPYING_*
>
> *2017-06-21 12:11:26,810 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> completeFile: /36962._COPYING_ is closed by
> DFSClient_NONMAPREDUCE_146762699_1*
>
> *2017-06-21 12:11:30,626 WARN
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed
> to place enough replicas, still in need of 1 to reach 2
> (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7,
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]},
> newBlock=true) For more information, please enable DEBUG log level on
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and *
> *org.apache.hadoop.net* <http://org.apache.hadoop.net>*.NetworkTopology*
>
> *2017-06-21 12:11:30,626 WARN
> org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough
> replicas: expected size is 1 but only 0 storage types can be selected
> (replication=2, selected=[], unavailable=[DISK], removed=[DISK],
> policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[],
> replicationFallbacks=[ARCHIVE]})”*
>
>
>
> I am also encountering exceptions in active namenode related to
> LeaseManager
>
>
>
> *2017-06-21 12:13:16,706 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder:
> DFSClient_NONMAPREDUCE_409197282_362092, pending creates: 1] has expired
> hard limit*
>
> *2017-06-21 12:13:16,706 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_NONMAPREDUCE_409197282_362092, pending creates: 1],
> src=/user/hadoop/**2106201707* <(210)%20620-1707>
> */02d5adda-d90f-47cb-85d5-999a079f4d79*
>
> *2017-06-21 12:13:16,706 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.internalReleaseLease: Failed to release lease for file
> /user/hadoop/**2106201707* 
> <(210)%20620-1707>*/02d5adda-d90f-47cb-85d5-999a079f4d79.
> Committed blocks are waiting to be minimally replicated. Try again later.*
>
> *2017-06-21 12:13:16,706 ERROR
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the
> path /user/hadoop/**2106201707* 
> <(210)%20620-1707>*/02d5adda-d90f-47cb-85d5-999a079f4d79
> in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_409197282_362092,
> pending creates: 1]*
>
> *org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR*
> NameSystem.internalReleaseLease: Failed to release lease for file
> /user/hadoop/**2106201707* 
> <(210)%20620-1707>*/02d5adda-d90f-47cb-85d5-999a079f4d79.
> Committed blocks are waiting to be minimally replicated. Try again later.*
>
> *        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3200)*
>
> *        at
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:383)*
>
> *        at
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:329)*
>
> *        at java.lang.Thread.run(Thread.java:745)*
>
>
>
> I have checked the two datanodes. Both are running and have enough space
> for new data.
>
>
>
> *PS: I have 2 Namenode and 2 datanodes in Hadoop HA setup. The HA is
> setuped using Qourom Journal Manager and  Zookeeper server.*
>
>
>
> Any idea why these errors?
>
>
>
> *Regards*
>
> *Omprakash Paliwal*
>
> HPC-Medical and Bioinformatics Applications Group
>
> Centre for Development of Advanced Computing (C-DAC)
>
> Pune University campus,
>
> PUNE-411007
>
> Maharashtra, India
>
> email:ompraka...@cdac.in
>
> Contact : +91-20-25704231 <+91%2020%202570%204231>
>
>
>
>
> ------------------------------------------------------------
> -------------------------------------------------------------------
> [ C-DAC is on Social-Media too. Kindly follow us at:
> Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
>
> This e-mail is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information. If you are not the
> intended recipient, please contact the sender by reply e-mail and destroy
> all copies and the original message. Any unauthorized review, use,
> disclosure, dissemination, forwarding, printing or copying of this email
> is strictly prohibited and appropriate legal action will be taken.
> ------------------------------------------------------------
> -------------------------------------------------------------------
>
>
>
>
> ------------------------------------------------------------
> -------------------------------------------------------------------
> [ C-DAC is on Social-Media too. Kindly follow us at:
> Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
>
> This e-mail is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information. If you are not the
> intended recipient, please contact the sender by reply e-mail and destroy
> all copies and the original message. Any unauthorized review, use,
> disclosure, dissemination, forwarding, printing or copying of this email
> is strictly prohibited and appropriate legal action will be taken.
> ------------------------------------------------------------
> -------------------------------------------------------------------
>
>
>
> ------------------------------------------------------------
> -------------------------------------------------------------------
> [ C-DAC is on Social-Media too. Kindly follow us at:
> Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
>
> This e-mail is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information. If you are not the
> intended recipient, please contact the sender by reply e-mail and destroy
> all copies and the original message. Any unauthorized review, use,
> disclosure, dissemination, forwarding, printing or copying of this email
> is strictly prohibited and appropriate legal action will be taken.
> ------------------------------------------------------------
> -------------------------------------------------------------------
>

Re: Lots of warning messages and exception in namenode logs

Reply via email to