Maybe you can try to add "-XX:+PrintGCApplicationStoppedTime", then if other ops(not gc) caused the long safepoint duration, you could find the log. btw, did you have a high load during that time:)
Best, Liang ________________________________________ 发件人: Henry Hung [[email protected]] 发送时间: 2013年10月23日 11:41 收件人: [email protected] 主题: What cause region server to timeout other than long gc? Hi All, Today I have 1 of 9 region servers down because of zookeeper timeout, this is the log: 2013-10-23 07:41:34,139 INFO org.apache.hadoop.hbase.regionserver.Store: Starting compaction of 4 file(s) in cf of MES_PERF_LOG_TIME,\x00\x04\x00\x00\x01A\x9D\xDD\xD9\x8D\x00\x00\x08\xD0fcap2\x00\x00=\xFC,1381922811424.cc6325e7896e3844d124321c315b02b0. into tmpdir=hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/.tmp, seqid=2655801231, totalSize=923.8m 2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: Compacting hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/3ae8adddd77f4a669b93eba2068f5778, keycount=34504344, bloomtype=NONE, size=501.3m, encoding=NONE 2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: Compacting hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/055361f74fb644f1b3eb48f5da7e3ec0, keycount=18581704, bloomtype=NONE, size=269.5m, encoding=NONE 2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: Compacting hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/a92cf779dca44843ab4c5130cca68d4f, keycount=9953404, bloomtype=NONE, size=143.9m, encoding=NONE 2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: Compacting hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/51b89f361eb64f2f8c894a5c1516c604, keycount=664008, bloomtype=NONE, size=9.2m, encoding=NONE 2013-10-23 07:41:34,182 DEBUG org.apache.hadoop.hbase.util.FSUtils: Creating file=hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/.tmp/16a28f8fc9f049c0a8b014502faf6339 with permission=rwxrwxrwx 2013-10-23 07:41:34,190 DEBUG org.apache.hadoop.hbase.io.hfile.HFileWriterV2: Initialized with CacheConfig:enabled [cacheDataOnRead=true] [cacheDataOnWrite=false] [cacheIndexesOnWrite=false] [cacheBloomsOnWrite=false] [cacheEvictOnClose=false] [cacheCompressed=false] 2013-10-23 07:41:34,191 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter type for hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/.tmp/16a28f8fc9f049c0a8b014502faf6339: CompoundBloomFilterWriter 2013-10-23 07:43:16,793 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 51848ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2013-10-23 07:43:16,799 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60261ms for sessionid 0x1412996eb5d2170, closing socket connection and attempting reconnect 2013-10-23 07:43:16,800 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60184ms for sessionid 0x1412996eb5d217e, closing socket connection and attempting reconnect 2013-10-23 07:43:16,807 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":50480,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b17c1aa), rpc version=1, client version=29, methodsFingerPrint=-1368823753","client":"10.16.10.181:40987","starttimems":1382485346321,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"} 2013-10-23 07:43:16,807 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":50159,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@71f59d1a), rpc version=1, client version=29, methodsFingerPrint=-1368823753","client":"10.16.10.181:40023","starttimems":1382485346321,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"} 2013-10-23 07:43:16,808 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":50480,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@380e6d06), rpc version=1, client version=29, methodsFingerPrint=-1368823753","client":"10.16.10.181:40963","starttimems":1382485346321,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"} 2013-10-23 07:43:18,154 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server fchddn1.ctfab.com/10.16.10.181:2222. Will not attempt to authenticate using SASL (Unable to locate a login configuration) 2013-10-23 07:43:18,155 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to fchddn1.ctfab.com/10.16.10.181:2222, initiating session 2013-10-23 07:43:18,201 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, will automatically reconnect when needed. 2013-10-23 07:43:18,201 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: ZK session expired. This disconnect could have been caused by a network partition or a long-running GC pause, either way it's recommended that you verify your environment. 2013-10-23 07:43:18,230 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2013-10-23 07:43:18,230 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x1412996eb5d217e has expired, closing socket connection 2013-10-23 07:43:18,804 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server fchddn1.ctfab.com/10.16.10.181:2222. Will not attempt to authenticate using SASL (Unable to locate a login configuration) 2013-10-23 07:43:18,805 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to fchddn1.ctfab.com/10.16.10.181:2222, initiating session 2013-10-23 07:43:18,805 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x1412996eb5d2170 has expired, closing socket connection 2013-10-23 07:43:18,805 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server fchddn2.ctfab.com,60020,1382059486345: regionserver:60020-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170 regionserver:60020-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:363) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:282) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) The message mention that this caused by a long garbage collection, but when I go to see the gc log, I cannot find any long running gc: 2013-10-23T07:42:16.396+0800: 425852.718: [GC 425852.718: [ParNew: 211937K->5070K(235968K), 0.0106610 secs] 2412876K->2206256K(4801752K), 0.0107230 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:42:16.760+0800: 425853.082: [GC 425853.082: [ParNew: 214826K->4962K(235968K), 0.0108200 secs] 2416013K->2206428K(4801752K), 0.0109040 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:17.101+0800: 425853.422: [GC 425853.422: [ParNew: 214754K->8823K(235968K), 0.0111210 secs] 2416220K->2210549K(4801752K), 0.0111890 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:17.461+0800: 425853.783: [GC 425853.783: [ParNew: 218615K->10874K(235968K), 0.0117700 secs] 2420341K->2212905K(4801752K), 0.0118380 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:42:17.827+0800: 425854.149: [GC 425854.149: [ParNew: 220666K->10443K(235968K), 0.0120950 secs] 2422697K->2214840K(4801752K), 0.0121640 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:18.216+0800: 425854.537: [GC 425854.538: [ParNew: 220235K->10855K(235968K), 0.0115060 secs] 2424632K->2215478K(4801752K), 0.0115940 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:42:18.592+0800: 425854.914: [GC 425854.914: [ParNew: 220711K->2730K(235968K), 0.0116310 secs] 2425333K->2209654K(4801752K), 0.0117080 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:18.962+0800: 425855.284: [GC 425855.284: [ParNew: 212522K->2292K(235968K), 0.0108730 secs] 2419446K->2209540K(4801752K), 0.0109640 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:19.322+0800: 425855.644: [GC 425855.644: [ParNew: 212084K->3115K(235968K), 0.0106050 secs] 2419332K->2210656K(4801752K), 0.0106740 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:42:19.722+0800: 425856.044: [GC 425856.044: [ParNew: 212907K->2373K(235968K), 0.0107250 secs] 2420448K->2210234K(4801752K), 0.0108060 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:42:20.258+0800: 425856.579: [GC 425856.579: [ParNew: 212165K->2999K(235968K), 0.0107150 secs] 2420026K->2211194K(4801752K), 0.0107950 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:20.640+0800: 425856.962: [GC 425856.962: [ParNew: 212791K->2687K(235968K), 0.0107090 secs] 2420986K->2211198K(4801752K), 0.0107890 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:21.008+0800: 425857.330: [GC 425857.330: [ParNew: 212479K->10731K(235968K), 0.0124240 secs] 2420990K->2219520K(4801752K), 0.0125040 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:21.390+0800: 425857.711: [GC 425857.711: [ParNew: 220523K->11031K(235968K), 0.0117490 secs] 2429312K->2220131K(4801752K), 0.0118380 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:21.737+0800: 425858.059: [GC 425858.059: [ParNew: 220829K->11034K(235968K), 0.0116510 secs] 2429928K->2220524K(4801752K), 0.0117290 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:22.088+0800: 425858.410: [GC 425858.410: [ParNew: 220762K->8764K(235968K), 0.0182030 secs] 2430252K->2218531K(4801752K), 0.0182760 secs] [Times: user=0.05 sys=0.00, real=0.02 secs] 2013-10-23T07:42:22.452+0800: 425858.774: [GC 425858.774: [ParNew: 218398K->3125K(235968K), 0.0109820 secs] 2428166K->2215240K(4801752K), 0.0110600 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:42:22.806+0800: 425859.128: [GC 425859.128: [ParNew: 212917K->3093K(235968K), 0.0129060 secs] 2425032K->2215725K(4801752K), 0.0129740 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:23.191+0800: 425859.513: [GC 425859.513: [ParNew: 212885K->3037K(235968K), 0.0105990 secs] 2425517K->2215997K(4801752K), 0.0106680 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:42:23.547+0800: 425859.869: [GC 425859.869: [ParNew: 212829K->2592K(235968K), 0.0155460 secs] 2425789K->2215818K(4801752K), 0.0156380 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:23.913+0800: 425860.235: [GC 425860.235: [ParNew: 212384K->6389K(235968K), 0.0185530 secs] 2425610K->2219920K(4801752K), 0.0186300 secs] [Times: user=0.06 sys=0.00, real=0.02 secs] 2013-10-23T07:42:23.932+0800: 425860.254: [GC 425860.254: [ParNew: 6398K->4591K(235968K), 0.0101310 secs] 2219929K->2218390K(4801752K), 0.0101880 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:42:24.281+0800: 425860.603: [GC 425860.603: [ParNew: 214383K->10297K(235968K), 0.0113230 secs] 2428182K->2224387K(4801752K), 0.0114040 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:24.607+0800: 425860.929: [GC 425860.929: [ParNew: 220089K->10532K(235968K), 0.0131330 secs] 2434179K->2224894K(4801752K), 0.0132140 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:25.013+0800: 425861.335: [GC 425861.335: [ParNew: 220324K->6430K(235968K), 0.0117640 secs] 2434686K->2223155K(4801752K), 0.0118420 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:42:25.534+0800: 425861.856: [GC 425861.856: [ParNew: 216222K->13486K(235968K), 0.0117460 secs] 2432947K->2230212K(4801752K), 0.0118280 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:42:25.969+0800: 425862.290: [GC 425862.290: [ParNew: 223236K->7543K(235968K), 0.0119530 secs] 2439961K->2226646K(4801752K), 0.0120310 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:43:16.467+0800: 425912.789: [GC 425912.789: [ParNew: 217335K->10714K(235968K), 0.0131220 secs] 2436438K->2230095K(4801752K), 0.0132030 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:43:17.177+0800: 425913.499: [GC 425913.499: [ParNew: 220506K->3502K(235968K), 0.0114700 secs] 2439887K->2225224K(4801752K), 0.0115550 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:43:17.557+0800: 425913.879: [GC 425913.879: [ParNew: 213294K->2873K(235968K), 0.0110120 secs] 2435016K->2224997K(4801752K), 0.0110940 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:43:17.898+0800: 425914.220: [GC 425914.220: [ParNew: 212665K->2783K(235968K), 0.0182350 secs] 2434789K->2225246K(4801752K), 0.0183390 secs] [Times: user=0.05 sys=0.00, real=0.02 secs] 2013-10-23T07:43:18.303+0800: 425914.624: [GC 425914.625: [ParNew: 212541K->3134K(235968K), 0.0108970 secs] 2435004K->2225888K(4801752K), 0.0110010 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 2013-10-23T07:43:18.691+0800: 425915.013: [GC 425915.013: [ParNew: 212926K->2851K(235968K), 0.0114990 secs] 2435680K->2226095K(4801752K), 0.0115790 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 2013-10-23T07:43:19.331+0800: 425915.653: [GC 425915.653: [ParNew: 212643K->2503K(235968K), 0.0105020 secs] 2435887K->2226096K(4801752K), 0.0105700 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] Heap par new generation total 235968K, used 86748K [0x0000000683e00000, 0x0000000693e00000, 0x0000000693e00000) eden space 209792K, 40% used [0x0000000683e00000, 0x0000000689045380, 0x0000000690ae0000) from space 26176K, 9% used [0x0000000692470000, 0x00000006926e1f70, 0x0000000693e00000) to space 26176K, 0% used [0x0000000690ae0000, 0x0000000690ae0000, 0x0000000692470000) concurrent mark-sweep generation total 4565784K, used 2223592K [0x0000000693e00000, 0x00000007aa8c6000, 0x00000007fae00000) concurrent-mark-sweep perm gen total 43908K, used 26321K [0x00000007fae00000, 0x00000007fd8e1000, 0x0000000800000000) Is there any other reason that can cause the region server to idle more than 60 seconds? Best regards, Henry ________________________________ The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond.
