In our HA setup, the active namenode keeps crashing once a week or so. The
cluster is quite idle without many jobs running and not much user activity.
Below is logs from journal nodes. Can someone help us with this please?
2015-08-04 13:00:20,054 INFO server.Journal
(Journal.java:updateLastPromisedEpoch(315)) - Updating lastPromisedEpoch
from 9 to 10 for client /172.26.44.133
2015-08-04 13:00:20,175 INFO server.Journal
(Journal.java:scanStorageForLatestEdits(188)) - Scanning storage
FileJournalManager(root=/hadoop/hdfs/journal/HDPPROD)
2015-08-04 13:00:20,220 INFO server.Journal
(Journal.java:scanStorageForLatestEdits(194)) - Latest log is
EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
2015-08-04 13:00:20,891 INFO server.Journal
(Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
-> startTxId: 523903 endTxId: 523925 isInProgress: true
2015-08-04 13:00:20,891 INFO server.Journal
(Journal.java:prepareRecovery(731)) - Prepared recovery for segment 523903:
segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
lastWriterEpoch: 9 lastCommittedTxId: 523924
2015-08-04 13:00:20,956 INFO server.Journal
(Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
-> startTxId: 523903 endTxId: 523925 isInProgress: true
2015-08-04 13:00:20,956 INFO server.Journal
(Journal.java:acceptRecovery(817)) - Skipping download of log startTxId:
523903 endTxId: 523925 isInProgress: true: already have up-to-date logs
2015-08-04 13:00:20,989 INFO server.Journal
(Journal.java:acceptRecovery(850)) - Accepted recovery for segment 523903:
segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
acceptedInEpoch: 10
2015-08-04 13:00:21,791 INFO server.Journal
(Journal.java:finalizeLogSegment(584)) - Validating log segment
/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
about to be finalized
2015-08-04 13:00:21,805 INFO namenode.FileJournalManager
(FileJournalManager.java:finalizeLogSegment(133)) - Finalizing edits file
/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
->
/hadoop/hdfs/journal/HDPPROD/current/edits_0000000000000523903-0000000000000523925
2015-08-04 13:00:22,257 INFO server.Journal
(Journal.java:startLogSegment(532)) - Updating lastWriterEpoch from 9 to 10
for client /172.26.44.133
2015-08-04 13:00:23,699 INFO ipc.Server (Server.java:run(2060)) - IPC
Server handler 4 on 8485, call
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from
172.26.44.135:43678 Call#304302 Retry#0
java.io.IOException: IPC's epoch 9 is less than the last promised epoch 10
at
org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
at
org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
at
org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
at
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
2015-08-06 19:13:14,012 INFO httpclient.HttpMethodDirector
(HttpMethodDirector.java:executeWithRetry(439)) - I/O exception
(org.apache.commons.httpclient.NoHttpResponseException) caught when
processing request: The server az-easthdpmnp02.metclouduseast.comfailed to
respond
Thank you
Suresh.