Re: RegionServers shutdown randomly

Ted Yu Fri, 07 Aug 2015 20:53:59 -0700

>From what I heard, reporting of CORRUPT for WAL related files was false
alarm.


There is no evidence that hbase 1.1 produces corrupt WAL files.

Cheers

On Fri, Aug 7, 2015 at 7:59 PM, James Estes <[email protected]> wrote:

> There is this
>
> http://mail-archives.apache.org/mod_mbox/hbase-user/201507.mbox/%3CCAE8tVdmyUfG%2BajK0gvMG_tLjoStZ0HjrQxJuuJzQ3Z%2B4vbzSuQ%40mail.gmail.com%3E
> Which points to
> https://issues.apache.org/jira/browse/HDFS-8809
>
> But (at least for us) this hasn't lead to region server
> crashing...though I'm definitely interested in what issues it may be
> able to cause.
>
> James
>
>
> On Fri, Aug 7, 2015 at 11:05 AM, Ted Yu <[email protected]> wrote:
> > Some WAL related files were marked corrupt.
> >
> > Can you try repairing them ?
> >
> > Please check namenode log.
> > Search HDFS JIRA for any pending fix - I haven't tracked HDFS movement
> > closely recently.
> >
> > Thanks
> >
> > On Fri, Aug 7, 2015 at 7:54 AM, Adrià Vilà <[email protected]> wrote:
> >
> >> About the logs attached in this conversation: only w-0 and w-1 nodes had
> >> failed, first w-0 and then w-1
> >> 10.240.187.182 = w-2
> >> w-0 internal IP address is 10.240.164.0
> >> w-1 IP is 10.240.2.235
> >> m IP is 10.240.200.196
> >>
> >> FSCK (hadoop fsck / | egrep -v '^\.+$' | grep -v eplica) output:
> >> -
> >> Connecting to namenode via
> >> http://hdp-m.c.dks-hadoop.internal:50070/fsck?ugi=root&path=%2F FSCK
> >> started by root (auth:SIMPLE) from /10.240.200.196 for path / at Fri
> Aug
> >> 07 14:51:22 UTC 2015
> >>
> /apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438946915810-splitting/hdp-w-0.c.dks-hadoop.internal%2C1602
> >> 0%2C1438946915810..meta.1438950914376.meta: MISSING 1 blocks of total
> size
> >> 90 B......
> >>
> /apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438959061234/hdp-w-1.c.dks-hadoop.internal%2C16020%2C143895
> >> 9061234.default.1438959069800: MISSING 1 blocks of total size 90 B...
> >>
> /apps/hbase/data/WALs/hdp-w-2.c.dks-hadoop.internal,16020,1438959056208/hdp-w-2.c.dks-hadoop.internal%2C16020%2C143895
> >> 9056208..meta.1438959068352.meta: MISSING 1 blocks of total size 90 B.
> >>
> /apps/hbase/data/WALs/hdp-w-2.c.dks-hadoop.internal,16020,1438959056208/hdp-w-2.c.dks-hadoop.internal%2C16020%2C143895
> >> 9056208.default.1438959061922: MISSING 1 blocks of total size 90
> >> B...........................
> >>
> >> .........Status: CORRUPT
> >> Total size: 54919712019 B (Total open files size: 360 B)
> >> Total dirs: 1709 Total files: 2628
> >> Total symlinks: 0 (Files currently being written: 6)
> >> Total blocks (validated): 2692 (avg. block size 20401081 B) (Total open
> >> file blocks (not validated): 4)
> >> ********************************
> >> UNDER MIN REPL'D BLOCKS: 4 (0.1485884 %)
> >> CORRUPT FILES: 4
> >> MISSING BLOCKS: 4
> >> MISSING SIZE: 360 B
> >> ********************************
> >> Corrupt blocks: 0
> >> Number of data-nodes: 4
> >> Number of racks: 1
> >> FSCK ended at Fri Aug 07 14:51:26 UTC 2015 in 4511 milliseconds
> >>
> >> The filesystem under path '/' is CORRUPT
> >> -
> >>
> >> Thank you for your time.
> >>
> >> *Desde*: "Ted Yu" <[email protected]>
> >> *Enviado*: viernes, 07 de agosto de 2015 16:07
> >> *Para*: "[email protected]" <[email protected]>,
> >> [email protected]
> >> *Asunto*: Re: RegionServers shutdown randomly
> >>
> >> Does 10.240.187.182 <http://10.240.187.182:50010/> correspond with w-0
> or
> >> m ?
> >>
> >> Looks like hdfs was intermittently unstable.
> >> Have you run fsck ?
> >>
> >> Cheers
> >>
> >> On Fri, Aug 7, 2015 at 12:59 AM, Adrià Vilà <[email protected]>
> wrote:
> >>>
> >>>  Hello,
> >>>
> >>>  HBase RegionServers fail once in a while:
> >>>   - it can be any regionserver, not always de same  - it can happen
> when
> >>> all the cluster is idle (at least not executing any human launched
> task)
> >>>  - it can happen at any time, not always the same
> >>>
> >>>  The cluster versions:
> >>>   - Phoenix 4.4 (or 4.5)  - HBase 1.1.1  - Hadoop/HDFS 2.7.1  -
> Zookeeper
> >>> 3.4.6     Some configs:
> >>>  -  ulimit -a
> >>>  core file size          (blocks, -c) 0
> >>> data seg size           (kbytes, -d) unlimited
> >>> scheduling priority             (-e) 0
> >>> file size               (blocks, -f) unlimited
> >>> pending signals                 (-i) 103227
> >>> max locked memory       (kbytes, -l) 64
> >>> max memory size         (kbytes, -m) unlimited
> >>> open files                      (-n) 1024
> >>> pipe size            (512 bytes, -p) 8
> >>> POSIX message queues     (bytes, -q) 819200
> >>> real-time priority              (-r) 0
> >>> stack size              (kbytes, -s) 10240
> >>> cpu time               (seconds, -t) unlimited
> >>> max user processes              (-u) 103227
> >>> virtual memory          (kbytes, -v) unlimited
> >>> file locks                      (-x) unlimited
> >>>  - have increased default timeouts for: hbase rpc, zookeeper session,
> dks
> >>> socket, regionserver lease and client scanner.
> >>>
> >>>  Next you can find the logs for the master, the regionserver that
> failed
> >>> first, another failed and the datanode log for master and worker.
> >>>
> >>>
> >>>  The timing was aproximately:
> >>>     14:05 start hbase
> >>>  14.11 w-0 down
> >>>  14.14 w-1 down
> >>>  14.15 stop hbase
> >>>
> >>>
> >>>   -------------
> >>>  hbase master log (m)
> >>>  -------------
> >>>  2015-08-06 14:11:13,640 ERROR
> >>> [PriorityRpcServer.handler=19,queue=1,port=16000]
> master.MasterRpcServices:
> >>> Region server hdp-w-0.c.dks-hadoop.internal,16020,1438869946905
> reported a
> >>> fatal error:
> >>>  ABORTING region server
> >>> hdp-w-0.c.dks-hadoop.internal,16020,1438869946905: Unrecoverable
> exception
> >>> while closing region
> >>>
> SYSTEM.SEQUENCE,]\x00\x00\x00,1438013446516.888f017eb1c0557fbe7079b50626c891.,
> >>> still finishing close
> >>>  Cause:
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>
> >>>  --------------
> >>>  hbase regionserver log (w-0)
> >>>  --------------
> >>>  2015-08-06 14:11:13,611 INFO
> >>> [PriorityRpcServer.handler=0,queue=0,port=16020]
> >>> regionserver.RSRpcServices: Close 888f017eb1c0557fbe7079b50626c891,
> moving
> >>> to hdp-m.c.dks-hadoop.internal,16020,1438869954062
> >>>  2015-08-06 14:11:13,615 INFO
> >>>
> [StoreCloserThread-SYSTEM.SEQUENCE,]\x00\x00\x00,1438013446516.888f017eb1c0557fbe7079b50626c891.-1]
> >>> regionserver.HStore: Closed 0
> >>>  2015-08-06 14:11:13,616 FATAL
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .append-pool1-t1]
> >>> wal.FSHLog: Could not append. Requesting close of wal
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:11:13,617 ERROR [sync.4] wal.FSHLog: Error syncing,
> >>> request close of wal
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:11:13,617 FATAL [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> regionserver.HRegionServer: ABORTING region server
> >>> hdp-w-0.c.dks-hadoop.internal,16020,1438869946905: Unrecoverable
> exception
> >>> while closing region
> >>>
> SYSTEM.SEQUENCE,]\x00\x00\x00,1438013446516.888f017eb1c0557fbe7079b50626c891.,
> >>> still finishing close
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:11:13,617 FATAL [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> are:
> >>> [org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> >>> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> >>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> >>> org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> >>> org.apache.phoenix.coprocessor.ScanRegionObserver,
> >>> org.apache.phoenix.hbase.index.Indexer,
> >>> org.apache.phoenix.coprocessor.SequenceRegionObserver,
> >>> org.apache.phoenix.coprocessor.MetaDataEndpointImpl]
> >>>  2015-08-06 14:11:13,627 INFO  [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> >>>    "beans" : [ {
> >>>      "name" : "java.lang:type=Memory",
> >>>      "modelerType" : "sun.management.MemoryImpl",
> >>>      "Verbose" : true,
> >>>      "HeapMemoryUsage" : {
> >>>        "committed" : 2104754176,
> >>>        "init" : 2147483648,
> >>>        "max" : 2104754176,
> >>>        "used" : 262288688
> >>>      },
> >>>      "ObjectPendingFinalizationCount" : 0,
> >>>      "NonHeapMemoryUsage" : {
> >>>        "committed" : 137035776,
> >>>        "init" : 136773632,
> >>>        "max" : 184549376,
> >>>        "used" : 49168288
> >>>      },
> >>>      "ObjectName" : "java.lang:type=Memory"
> >>>    } ],
> >>>    "beans" : [ {
> >>>      "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> >>>      "modelerType" : "RegionServer,sub=IPC",
> >>>      "tag.Context" : "regionserver",
> >>>      "tag.Hostname" : "hdp-w-0"
> >>>    } ],
> >>>    "beans" : [ {
> >>>      "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> >>>      "modelerType" : "RegionServer,sub=Replication",
> >>>      "tag.Context" : "regionserver",
> >>>      "tag.Hostname" : "hdp-w-0"
> >>>    } ],
> >>>    "beans" : [ {
> >>>      "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> >>>      "modelerType" : "RegionServer,sub=Server",
> >>>      "tag.Context" : "regionserver",
> >>>      "tag.Hostname" : "hdp-w-0"
> >>>    } ]
> >>>  }
> >>>  2015-08-06 14:11:13,640 ERROR [sync.0] wal.FSHLog: Error syncing,
> >>> request close of wal
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:11:13,640 WARN
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.FSHLog: Failed last sync but no outstanding unsync edits so falling
> >>> through to close; java.io.IOException: All datanodes
> >>> DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...
> >>>  2015-08-06 14:11:13,641 ERROR
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.ProtobufLogWriter: Got IOException while writing trailer
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:11:13,641 WARN
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.FSHLog: Riding over failed WAL close of
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438869949576,
> >>> cause="All datanodes DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...", errors=1; THIS FILE WAS NOT CLOSED BUT ALL EDITS
> >>> SYNCED SO SHOULD BE OK
> >>>  2015-08-06 14:11:13,642 INFO
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.FSHLog: Rolled WAL
> >>>
> /apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438869949576
> >>> with entries=101, filesize=30.38 KB; new WAL
> >>>
> /apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438870273617
> >>>  2015-08-06 14:11:13,643 INFO  [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> regionserver.HRegionServer: STOPPED: Unrecoverable exception while
> closing
> >>> region
> >>>
> SYSTEM.SEQUENCE,]\x00\x00\x00,1438013446516.888f017eb1c0557fbe7079b50626c891.,
> >>> still finishing close
> >>>  2015-08-06 14:11:13,643 INFO
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.FSHLog: Archiving
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438869949576
> >>> to
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/oldWALs/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438869949576
> >>>  2015-08-06 14:11:13,643 ERROR [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> executor.EventHandler: Caught throwable while processing event
> >>> M_RS_CLOSE_REGION
> >>>  java.lang.RuntimeException: java.io.IOException: All datanodes
> >>> DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:152)
> >>>          at
> >>>
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >>>          at
> >>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>>          at
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>          at java.lang.Thread.run(Thread.java:745)
> >>>  Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>
> >>>  ------------
> >>>  hbase regionserver log (w-1)
> >>>  ------------
> >>>  2015-08-06 14:11:14,267 INFO  [main-EventThread]
> >>> replication.ReplicationTrackerZKImpl:
> >>> /hbase-unsecure/rs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905
> znode
> >>> expired, triggering replicatorRemoved event
> >>>  2015-08-06 14:12:08,203 INFO  [ReplicationExecutor-0]
> >>> replication.ReplicationQueuesZKImpl: Atomically moving
> >>> hdp-w-0.c.dks-hadoop.internal,16020,1438869946905's wals to my queue
> >>>  2015-08-06 14:12:56,252 INFO
> >>> [PriorityRpcServer.handler=5,queue=1,port=16020]
> >>> regionserver.RSRpcServices: Close 918ed7c6568e7500fb434f4268c5bbc5,
> moving
> >>> to hdp-m.c.dks-hadoop.internal,16020,1438869954062
> >>>  2015-08-06 14:12:56,260 INFO
> >>>
> [StoreCloserThread-SYSTEM.SEQUENCE,\x7F\x00\x00\x00,1438013446516.918ed7c6568e7500fb434f4268c5bbc5.-1]
> >>> regionserver.HStore: Closed 0
> >>>  2015-08-06 14:12:56,261 FATAL
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .append-pool1-t1]
> >>> wal.FSHLog: Could not append. Requesting close of wal
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:12:56,261 ERROR [sync.3] wal.FSHLog: Error syncing,
> >>> request close of wal
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:12:56,262 FATAL [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> regionserver.HRegionServer: ABORTING region server
> >>> hdp-w-1.c.dks-hadoop.internal,16020,1438869946909: Unrecoverable
> exception
> >>> while closing region
> >>>
> SYSTEM.SEQUENCE,\x7F\x00\x00\x00,1438013446516.918ed7c6568e7500fb434f4268c5bbc5.,
> >>> still finishing close
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:12:56,262 FATAL [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> are:
> >>> [org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> >>> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> >>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> >>> org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> >>> org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint,
> >>> org.apache.phoenix.coprocessor.ScanRegionObserver,
> >>> org.apache.phoenix.hbase.index.Indexer,
> >>> org.apache.phoenix.coprocessor.SequenceRegionObserver]
> >>>  2015-08-06 14:12:56,281 INFO  [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> >>>    "beans" : [ {
> >>>      "name" : "java.lang:type=Memory",
> >>>      "modelerType" : "sun.management.MemoryImpl",
> >>>      "ObjectPendingFinalizationCount" : 0,
> >>>      "NonHeapMemoryUsage" : {
> >>>        "committed" : 137166848,
> >>>        "init" : 136773632,
> >>>        "max" : 184549376,
> >>>        "used" : 48667528
> >>>      },
> >>>      "HeapMemoryUsage" : {
> >>>        "committed" : 2104754176,
> >>>        "init" : 2147483648,
> >>>        "max" : 2104754176,
> >>>        "used" : 270075472
> >>>      },
> >>>      "Verbose" : true,
> >>>      "ObjectName" : "java.lang:type=Memory"
> >>>    } ],
> >>>    "beans" : [ {
> >>>      "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> >>>      "modelerType" : "RegionServer,sub=IPC",
> >>>      "tag.Context" : "regionserver",
> >>>      "tag.Hostname" : "hdp-w-1"
> >>>    } ],
> >>>    "beans" : [ {
> >>>      "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> >>>      "modelerType" : "RegionServer,sub=Replication",
> >>>      "tag.Context" : "regionserver",
> >>>      "tag.Hostname" : "hdp-w-1"
> >>>    } ],
> >>>    "beans" : [ {
> >>>      "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> >>>      "modelerType" : "RegionServer,sub=Server",
> >>>      "tag.Context" : "regionserver",
> >>>      "tag.Hostname" : "hdp-w-1"
> >>>    } ]
> >>>  }
> >>>  2015-08-06 14:12:56,284 ERROR [sync.4] wal.FSHLog: Error syncing,
> >>> request close of wal
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:12:56,285 WARN
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.FSHLog: Failed last sync but no outstanding unsync edits so falling
> >>> through to close; java.io.IOException: All datanodes
> >>> DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...
> >>>  2015-08-06 14:12:56,285 ERROR
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.ProtobufLogWriter: Got IOException while writing trailer
> >>>  java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>  2015-08-06 14:12:56,285 WARN
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.FSHLog: Riding over failed WAL close of
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438869946909/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438869950359,
> >>> cause="All datanodes DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...", errors=1; THIS FILE WAS NOT CLOSED BUT ALL EDITS
> >>> SYNCED SO SHOULD BE OK
> >>>  2015-08-06 14:12:56,287 INFO
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.FSHLog: Rolled WAL
> >>>
> /apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438869946909/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438869950359
> >>> with entries=100, filesize=30.73 KB; new WAL
> >>>
> /apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438869946909/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438870376262
> >>>  2015-08-06 14:12:56,288 INFO
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.FSHLog: Archiving
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438869946909/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438869950359
> >>> to
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/oldWALs/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438869950359
> >>>  2015-08-06 14:12:56,315 INFO  [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> regionserver.HRegionServer: STOPPED: Unrecoverable exception while
> closing
> >>> region
> >>>
> SYSTEM.SEQUENCE,\x7F\x00\x00\x00,1438013446516.918ed7c6568e7500fb434f4268c5bbc5.,
> >>> still finishing close
> >>>  2015-08-06 14:12:56,315 INFO
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020]
> >>> regionserver.SplitLogWorker: Sending interrupt to stop the worker
> thread
> >>>  2015-08-06 14:12:56,315 ERROR [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> executor.EventHandler: Caught throwable while processing event
> >>> M_RS_CLOSE_REGION
> >>>  java.lang.RuntimeException: java.io.IOException: All datanodes
> >>> DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:152)
> >>>          at
> >>>
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >>>          at
> >>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>>          at
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>          at java.lang.Thread.run(Thread.java:745)
> >>>  Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>
> >>>  -------------
> >>>  m datanode log
> >>>  -------------
> >>>  2015-07-27 14:11:16,082 INFO  datanode.DataNode
> >>> (BlockReceiver.java:run(1348)) - PacketResponder:
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742677_1857,
> >>> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> >>>  2015-07-27 14:11:16,132 INFO  datanode.DataNode
> >>> (DataXceiver.java:writeBlock(655)) - Receiving
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742678_1858 src: /
> >>> 10.240.200.196:56767 dest: /10.240.200.196:50010
> >>>  2015-07-27 14:11:16,155 INFO  DataNode.clienttrace
> >>> (BlockReceiver.java:finalizeBlock(1375)) - src: /10.240.200.196:56767,
> >>> dest: /10.240.200.196:50010, bytes: 117761, op: HDFS_WRITE, cliID:
> >>> DFSClient_NONMAPREDUCE_177514816_1, offset: 0, srvID:
> >>> 329bbe62-bcea-4a6d-8c97-e800631deb81, blockid:
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742678_1858,
> duration:
> >>> 6385289
> >>>  2015-07-27 14:11:16,155 INFO  datanode.DataNode
> >>> (BlockReceiver.java:run(1348)) - PacketResponder:
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742678_1858,
> >>> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> >>>  2015-07-27 14:11:16,267 ERROR datanode.DataNode
> >>> (DataXceiver.java:run(278)) -
> hdp-m.c.dks-hadoop.internal:50010:DataXceiver
> >>> error processing unknown operation  src: /127.0.0.1:60513 dst: /
> >>> 127.0.0.1:50010
> >>>  java.io.EOFException
> >>>          at java.io.DataInputStream.readShort(DataInputStream.java:315)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)
> >>>          at java.lang.Thread.run(Thread.java:745)
> >>>  2015-07-27 14:11:16,405 INFO  datanode.DataNode
> >>> (DataNode.java:transferBlock(1943)) - DatanodeRegistration(
> >>> 10.240.200.196:50010,
> datanodeUuid=329bbe62-bcea-4a6d-8c97-e800631deb81,
> >>> infoPort=50075, infoSecurePort=0, ipcPort=8010,
> >>>
> storageInfo=lv=-56;cid=CID-1247f294-77a9-4605-b6d3-4c1398bb5db0;nsid=2032226938;c=0)
> >>> Starting thread to transfer
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742649_1829 to
> >>> 10.240.2.235:50010 10.240.164.0:50010
> >>>
> >>>  -------------
> >>>  w-0 datanode log
> >>>  -------------
> >>>  2015-07-27 14:11:25,019 ERROR datanode.DataNode
> >>> (DataXceiver.java:run(278)) -
> >>> hdp-w-0.c.dks-hadoop.internal:50010:DataXceiver error processing
> unknown
> >>> operation  src: /127.0.0.1:47993 dst: /127.0.0.1:50010
> >>>  java.io.EOFException
> >>>          at java.io.DataInputStream.readShort(DataInputStream.java:315)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
> >>>          at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)
> >>>          at java.lang.Thread.run(Thread.java:745)
> >>>  2015-07-27 14:11:25,077 INFO  DataNode.clienttrace
> >>> (DataXceiver.java:requestShortCircuitFds(369)) - src: 127.0.0.1, dest:
> >>> 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: 1073742631, srvID:
> >>> a5eea5a8-5112-46da-9f18-64274486c472, success: true
> >>>
> >>>
> >>>  -----------------------------
> >>>  Thank you in advance,
> >>>
> >>>  Adrià
> >>>
> >>>
> >>
> >>
>

Re: RegionServers shutdown randomly

Reply via email to