答复: What cause region server to timeout other than long gc?

谢良 Tue, 22 Oct 2013 23:56:56 -0700

Maybe you can try to add "-XX:+PrintGCApplicationStoppedTime", then if other 
ops(not gc) caused the long safepoint duration, you could find the log.
btw, did you have a high load during that time:)


Best,
Liang
________________________________________
发件人: Henry Hung [[email protected]]
发送时间: 2013年10月23日 11:41
收件人: [email protected]
主题: What cause region server to timeout other than long gc?

Hi All,

Today I have 1 of 9 region servers down because of zookeeper timeout, this is 
the log:
2013-10-23 07:41:34,139 INFO org.apache.hadoop.hbase.regionserver.Store: 
Starting compaction of 4 file(s) in cf of 
MES_PERF_LOG_TIME,\x00\x04\x00\x00\x01A\x9D\xDD\xD9\x8D\x00\x00\x08\xD0fcap2\x00\x00=\xFC,1381922811424.cc6325e7896e3844d124321c315b02b0.
 into 
tmpdir=hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/.tmp,
 seqid=2655801231, totalSize=923.8m
2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: 
Compacting 
hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/3ae8adddd77f4a669b93eba2068f5778,
 keycount=34504344, bloomtype=NONE, size=501.3m, encoding=NONE
2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: 
Compacting 
hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/055361f74fb644f1b3eb48f5da7e3ec0,
 keycount=18581704, bloomtype=NONE, size=269.5m, encoding=NONE
2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: 
Compacting 
hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/a92cf779dca44843ab4c5130cca68d4f,
 keycount=9953404, bloomtype=NONE, size=143.9m, encoding=NONE
2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: 
Compacting 
hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/51b89f361eb64f2f8c894a5c1516c604,
 keycount=664008, bloomtype=NONE, size=9.2m, encoding=NONE
2013-10-23 07:41:34,182 DEBUG org.apache.hadoop.hbase.util.FSUtils: Creating 
file=hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/.tmp/16a28f8fc9f049c0a8b014502faf6339
 with permission=rwxrwxrwx
2013-10-23 07:41:34,190 DEBUG org.apache.hadoop.hbase.io.hfile.HFileWriterV2: 
Initialized with CacheConfig:enabled [cacheDataOnRead=true] 
[cacheDataOnWrite=false] [cacheIndexesOnWrite=false] [cacheBloomsOnWrite=false] 
[cacheEvictOnClose=false] [cacheCompressed=false]
2013-10-23 07:41:34,191 INFO org.apache.hadoop.hbase.regionserver.StoreFile: 
Delete Family Bloom filter type for 
hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/.tmp/16a28f8fc9f049c0a8b014502faf6339:
 CompoundBloomFilterWriter
2013-10-23 07:43:16,793 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
51848ms instead of 3000ms, this is likely due to a long garbage collecting 
pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-10-23 07:43:16,799 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 60261ms for sessionid 
0x1412996eb5d2170, closing socket connection and attempting reconnect
2013-10-23 07:43:16,800 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 60184ms for sessionid 
0x1412996eb5d217e, closing socket connection and attempting reconnect
2013-10-23 07:43:16,807 WARN org.apache.hadoop.ipc.HBaseServer: 
(responseTooSlow): 
{"processingtimems":50480,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b17c1aa),
 rpc version=1, client version=29, 
methodsFingerPrint=-1368823753","client":"10.16.10.181:40987","starttimems":1382485346321,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2013-10-23 07:43:16,807 WARN org.apache.hadoop.ipc.HBaseServer: 
(responseTooSlow): 
{"processingtimems":50159,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@71f59d1a),
 rpc version=1, client version=29, 
methodsFingerPrint=-1368823753","client":"10.16.10.181:40023","starttimems":1382485346321,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2013-10-23 07:43:16,808 WARN org.apache.hadoop.ipc.HBaseServer: 
(responseTooSlow): 
{"processingtimems":50480,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@380e6d06),
 rpc version=1, client version=29, 
methodsFingerPrint=-1368823753","client":"10.16.10.181:40963","starttimems":1382485346321,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2013-10-23 07:43:18,154 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server fchddn1.ctfab.com/10.16.10.181:2222. Will not attempt to 
authenticate using SASL (Unable to locate a login configuration)
2013-10-23 07:43:18,155 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to fchddn1.ctfab.com/10.16.10.181:2222, initiating session
2013-10-23 07:43:18,201 INFO 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
This client just lost it's session with ZooKeeper, will automatically reconnect 
when needed.
2013-10-23 07:43:18,201 INFO 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: ZK 
session expired. This disconnect could have been caused by a network partition 
or a long-running GC pause, either way it's recommended that you verify your 
environment.
2013-10-23 07:43:18,230 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2013-10-23 07:43:18,230 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x1412996eb5d217e has expired, closing 
socket connection
2013-10-23 07:43:18,804 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server fchddn1.ctfab.com/10.16.10.181:2222. Will not attempt to 
authenticate using SASL (Unable to locate a login configuration)
2013-10-23 07:43:18,805 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to fchddn1.ctfab.com/10.16.10.181:2222, initiating session
2013-10-23 07:43:18,805 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x1412996eb5d2170 has expired, closing 
socket connection
2013-10-23 07:43:18,805 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
fchddn2.ctfab.com,60020,1382059486345: 
regionserver:60020-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170
 
regionserver:60020-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170
 received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:363)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:282)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)


The message mention that this caused by a long garbage collection, but when I 
go to see the gc log, I cannot find any long running gc:
2013-10-23T07:42:16.396+0800: 425852.718: [GC 425852.718: [ParNew: 
211937K->5070K(235968K), 0.0106610 secs] 2412876K->2206256K(4801752K), 
0.0107230 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:42:16.760+0800: 425853.082: [GC 425853.082: [ParNew: 
214826K->4962K(235968K), 0.0108200 secs] 2416013K->2206428K(4801752K), 
0.0109040 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:17.101+0800: 425853.422: [GC 425853.422: [ParNew: 
214754K->8823K(235968K), 0.0111210 secs] 2416220K->2210549K(4801752K), 
0.0111890 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:17.461+0800: 425853.783: [GC 425853.783: [ParNew: 
218615K->10874K(235968K), 0.0117700 secs] 2420341K->2212905K(4801752K), 
0.0118380 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:42:17.827+0800: 425854.149: [GC 425854.149: [ParNew: 
220666K->10443K(235968K), 0.0120950 secs] 2422697K->2214840K(4801752K), 
0.0121640 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:18.216+0800: 425854.537: [GC 425854.538: [ParNew: 
220235K->10855K(235968K), 0.0115060 secs] 2424632K->2215478K(4801752K), 
0.0115940 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:42:18.592+0800: 425854.914: [GC 425854.914: [ParNew: 
220711K->2730K(235968K), 0.0116310 secs] 2425333K->2209654K(4801752K), 
0.0117080 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:18.962+0800: 425855.284: [GC 425855.284: [ParNew: 
212522K->2292K(235968K), 0.0108730 secs] 2419446K->2209540K(4801752K), 
0.0109640 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:19.322+0800: 425855.644: [GC 425855.644: [ParNew: 
212084K->3115K(235968K), 0.0106050 secs] 2419332K->2210656K(4801752K), 
0.0106740 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:42:19.722+0800: 425856.044: [GC 425856.044: [ParNew: 
212907K->2373K(235968K), 0.0107250 secs] 2420448K->2210234K(4801752K), 
0.0108060 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:42:20.258+0800: 425856.579: [GC 425856.579: [ParNew: 
212165K->2999K(235968K), 0.0107150 secs] 2420026K->2211194K(4801752K), 
0.0107950 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:20.640+0800: 425856.962: [GC 425856.962: [ParNew: 
212791K->2687K(235968K), 0.0107090 secs] 2420986K->2211198K(4801752K), 
0.0107890 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:21.008+0800: 425857.330: [GC 425857.330: [ParNew: 
212479K->10731K(235968K), 0.0124240 secs] 2420990K->2219520K(4801752K), 
0.0125040 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:21.390+0800: 425857.711: [GC 425857.711: [ParNew: 
220523K->11031K(235968K), 0.0117490 secs] 2429312K->2220131K(4801752K), 
0.0118380 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:21.737+0800: 425858.059: [GC 425858.059: [ParNew: 
220829K->11034K(235968K), 0.0116510 secs] 2429928K->2220524K(4801752K), 
0.0117290 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:22.088+0800: 425858.410: [GC 425858.410: [ParNew: 
220762K->8764K(235968K), 0.0182030 secs] 2430252K->2218531K(4801752K), 
0.0182760 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]
2013-10-23T07:42:22.452+0800: 425858.774: [GC 425858.774: [ParNew: 
218398K->3125K(235968K), 0.0109820 secs] 2428166K->2215240K(4801752K), 
0.0110600 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:42:22.806+0800: 425859.128: [GC 425859.128: [ParNew: 
212917K->3093K(235968K), 0.0129060 secs] 2425032K->2215725K(4801752K), 
0.0129740 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:23.191+0800: 425859.513: [GC 425859.513: [ParNew: 
212885K->3037K(235968K), 0.0105990 secs] 2425517K->2215997K(4801752K), 
0.0106680 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:42:23.547+0800: 425859.869: [GC 425859.869: [ParNew: 
212829K->2592K(235968K), 0.0155460 secs] 2425789K->2215818K(4801752K), 
0.0156380 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:23.913+0800: 425860.235: [GC 425860.235: [ParNew: 
212384K->6389K(235968K), 0.0185530 secs] 2425610K->2219920K(4801752K), 
0.0186300 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]
2013-10-23T07:42:23.932+0800: 425860.254: [GC 425860.254: [ParNew: 
6398K->4591K(235968K), 0.0101310 secs] 2219929K->2218390K(4801752K), 0.0101880 
secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:42:24.281+0800: 425860.603: [GC 425860.603: [ParNew: 
214383K->10297K(235968K), 0.0113230 secs] 2428182K->2224387K(4801752K), 
0.0114040 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:24.607+0800: 425860.929: [GC 425860.929: [ParNew: 
220089K->10532K(235968K), 0.0131330 secs] 2434179K->2224894K(4801752K), 
0.0132140 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:25.013+0800: 425861.335: [GC 425861.335: [ParNew: 
220324K->6430K(235968K), 0.0117640 secs] 2434686K->2223155K(4801752K), 
0.0118420 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:42:25.534+0800: 425861.856: [GC 425861.856: [ParNew: 
216222K->13486K(235968K), 0.0117460 secs] 2432947K->2230212K(4801752K), 
0.0118280 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:42:25.969+0800: 425862.290: [GC 425862.290: [ParNew: 
223236K->7543K(235968K), 0.0119530 secs] 2439961K->2226646K(4801752K), 
0.0120310 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:43:16.467+0800: 425912.789: [GC 425912.789: [ParNew: 
217335K->10714K(235968K), 0.0131220 secs] 2436438K->2230095K(4801752K), 
0.0132030 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:43:17.177+0800: 425913.499: [GC 425913.499: [ParNew: 
220506K->3502K(235968K), 0.0114700 secs] 2439887K->2225224K(4801752K), 
0.0115550 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:43:17.557+0800: 425913.879: [GC 425913.879: [ParNew: 
213294K->2873K(235968K), 0.0110120 secs] 2435016K->2224997K(4801752K), 
0.0110940 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:43:17.898+0800: 425914.220: [GC 425914.220: [ParNew: 
212665K->2783K(235968K), 0.0182350 secs] 2434789K->2225246K(4801752K), 
0.0183390 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]
2013-10-23T07:43:18.303+0800: 425914.624: [GC 425914.625: [ParNew: 
212541K->3134K(235968K), 0.0108970 secs] 2435004K->2225888K(4801752K), 
0.0110010 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2013-10-23T07:43:18.691+0800: 425915.013: [GC 425915.013: [ParNew: 
212926K->2851K(235968K), 0.0114990 secs] 2435680K->2226095K(4801752K), 
0.0115790 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-10-23T07:43:19.331+0800: 425915.653: [GC 425915.653: [ParNew: 
212643K->2503K(235968K), 0.0105020 secs] 2435887K->2226096K(4801752K), 
0.0105700 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
Heap
par new generation   total 235968K, used 86748K [0x0000000683e00000, 
0x0000000693e00000, 0x0000000693e00000)
  eden space 209792K,  40% used [0x0000000683e00000, 0x0000000689045380, 
0x0000000690ae0000)
  from space 26176K,   9% used [0x0000000692470000, 0x00000006926e1f70, 
0x0000000693e00000)
  to   space 26176K,   0% used [0x0000000690ae0000, 0x0000000690ae0000, 
0x0000000692470000)
concurrent mark-sweep generation total 4565784K, used 2223592K 
[0x0000000693e00000, 0x00000007aa8c6000, 0x00000007fae00000)
concurrent-mark-sweep perm gen total 43908K, used 26321K [0x00000007fae00000, 
0x00000007fd8e1000, 0x0000000800000000)

Is there any other reason that can cause the region server to idle more than 60 
seconds?

Best regards,
Henry

________________________________
The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.

答复: What cause region server to timeout other than long gc?

Reply via email to