Yes we are running MR on hbase. I tried running MR on snapshot but the data in our HBase changes very frequently and we end up occupying twice the space and end up running into full disks.
I think we are hitting large HBase heap and M/R problem. I will try to add some more space to our cluster and try MR on snapshot and see how it goes. Thanks for suggestion. Thanks, Rahul On Fri, May 29, 2015 at 9:14 AM, Vladimir Rodionov <[email protected]> wrote: > Is there any reason for 27G heap? It seems you run M/R job? If yes, then I > would recommend you trying M/R over snapshots. Combination of large HBase > heap and M/R is very hard to tune if possible at all > You can also try reducing number of map tasks and check your MR job > resource consumption > On May 29, 2015 8:58 AM, "Bryan Beaudreault" <[email protected]> > wrote: > > > That one long GC aside, look at the timings of the others as well. Even > > the smaller GCs are taking up the majority of each second. > > > > For a heap that size you might want to try a java version over java7u60 > and > > use the G1GC. Otherwise there are a bunch of resources on the web > > including in the refguide for how to tune GCs for hbase. > > > > > > On Fri, May 29, 2015 at 11:49 AM, rahul malviya < > > [email protected]> > > wrote: > > > > > This is a sample from gc log file. At the end I see long gc pauses. Is > > > there a way I can tune this ? > > > > > > 2015-04-29T22:46:12.387+0000: 98061.660: > [GC2015-04-29T22:46:12.387+0000: > > > 98061.661: [ParNew: 572757K->63867K(580608K), 0.6549550 secs] > > > 13294553K->12811090K(20001132K), 0.6551600 secs] [Times: user=8.09 > > > sys=0.07, real=0.65 secs] > > > 2015-04-29T22:46:13.165+0000: 98062.438: > [GC2015-04-29T22:46:13.165+0000: > > > 98062.439: [ParNew: 579901K->55337K(580608K), 1.1941730 secs] > > > 13327124K->12846913K(20001132K), 1.1944150 secs] [Times: user=5.39 > > > sys=0.08, real=1.20 secs] > > > 2015-04-29T22:46:14.485+0000: 98063.759: > [GC2015-04-29T22:46:14.485+0000: > > > 98063.759: [ParNew: 571496K->62270K(580608K), 0.8990460 secs] > > > 13363071K->12891412K(20001132K), 0.8992880 secs] [Times: user=7.07 > > > sys=0.05, real=0.90 secs] > > > 2015-04-29T22:46:15.519+0000: 98064.793: > [GC2015-04-29T22:46:15.519+0000: > > > 98064.793: [ParNew: 578417K->41496K(580608K), 0.8849310 secs] > > > 13407559K->12907053K(20001132K), 0.8851300 secs] [Times: user=4.98 > > > sys=0.08, real=0.88 secs] > > > 2015-04-29T22:46:16.547+0000: 98065.821: > [GC2015-04-29T22:46:16.547+0000: > > > 98065.821: [ParNew: 557652K->50791K(580608K), 0.6345050 secs] > > > 13423210K->12940694K(20001132K), 0.6347240 secs] [Times: user=7.99 > > > sys=0.04, real=0.63 secs] > > > 2015-04-29T22:46:17.273+0000: 98066.547: > [GC2015-04-29T22:46:17.273+0000: > > > 98066.547: [ParNew: 566889K->48425K(580608K), 0.6372610 secs] > > > 13456792K->12966994K(20001132K), 0.6375250 secs] [Times: user=3.72 > > > sys=0.05, real=0.64 secs] > > > 2015-04-29T22:46:18.035+0000: 98067.309: > [GC2015-04-29T22:46:18.035+0000: > > > 98067.309: [ParNew: 564444K->52607K(580608K), 0.7116560 secs] > > > 13483013K->12997986K(20001132K), 0.7119250 secs] [Times: user=7.16 > > > sys=0.05, real=0.72 secs] > > > 2015-04-29T22:46:18.875+0000: 98068.149: > [GC2015-04-29T22:46:18.875+0000: > > > 98068.149: [ParNew: 568635K->54078K(580608K), 0.5286160 secs] > > > 13514014K->13026734K(20001132K), 0.5288800 secs] [Times: user=4.70 > > > sys=0.05, real=0.53 secs] > > > 2015-04-29T22:46:19.521+0000: 98068.795: > [GC2015-04-29T22:46:19.521+0000: > > > 98068.795: [ParNew: 570106K->64416K(580608K), 1.0709370 secs] > > > 13542763K->13091755K(20001132K), 1.0712320 secs] [Times: user=9.57 > > > sys=0.08, real=1.08 secs] > > > 2015-04-29T22:46:20.751+0000: 98070.024: > [GC2015-04-29T22:46:20.751+0000: > > > 98070.024: [ParNew: 580387K->47393K(580608K), 0.5688620 secs] > > > 13607725K->13099551K(20001132K), 0.5690960 secs] [Times: user=6.48 > > > sys=0.05, real=0.57 secs] > > > 2015-04-29T22:46:21.393+0000: 98070.667: > [GC2015-04-29T22:46:21.393+0000: > > > 98070.667: [ParNew: 563351K->26473K(580608K), 0.7063100 secs] > > > 13615509K->13429978K(20001132K), 0.7065560 secs] [Times: user=10.49 > > > sys=0.07, real=0.71 secs] > > > 2015-04-29T22:46:22.214+0000: 98071.487: > [GC2015-04-29T22:46:22.214+0000: > > > 98071.488: [ParNew: 292415K->39128K(580608K), 0.0293500 > > > secs]2015-04-29T22:46:22.243+0000: 98071.517: [CMS: > > > 13403504K->7807614K(19420524K), 284.4275150 secs] > > > 13695919K->7807614K(20001132K), [CMS Perm : 30509K->30432K(51000K)], > > > 284.4594370 secs] [Times: user=5.55 sys=2.50, real=284.42 secs] > > > Heap > > > par new generation total 2146944K, used 868806K [0x0000000163600000, > > > 0x00000001f4f90000, 0x00000001f4f90000) > > > eden space 1908416K, 45% used [0x0000000163600000, > 0x00000001986718f8, > > > 0x00000001d7db0000) > > > from space 238528K, 0% used [0x00000001d7db0000, > 0x00000001d7db0000, > > > 0x00000001e66a0000) > > > to space 238528K, 0% used [0x00000001e66a0000, > 0x00000001e66a0000, > > > 0x00000001f4f90000) > > > concurrent mark-sweep generation total 19420524K, used 7807614K > > > [0x00000001f4f90000, 0x00000006964eb000, 0x00000007fae00000) > > > concurrent-mark-sweep perm gen total 51000K, used 30498K > > > [0x00000007fae00000, 0x00000007fdfce000, 0x0000000800000000) > > > > > > Thanks, > > > Rahul > > > > > > > > > On Fri, May 29, 2015 at 8:41 AM, Bryan Beaudreault < > > > [email protected] > > > > wrote: > > > > > > > > > > > > > 2014-08-14 21:35:16,740 WARN org.apache.hadoop.hbase.util.Sleeper: > We > > > > > slept > > > > > 14912ms instead of 3000ms, this is likely due to a long garbage > > > > collecting > > > > > pause and it's usually bad, see > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > > > > > > > > > > > > I would check your gc logs for long gc pauses. > > > > > > > > On Fri, May 29, 2015 at 11:38 AM, rahul malviya < > > > > [email protected]> > > > > wrote: > > > > > > > > > Hi All, > > > > > > > > > > In our cluster region server logs are filled with response too slow > > > > > message. This is causing jobs to slow down. How can I debug what is > > the > > > > > reason for this slowness. > > > > > > > > > > We have enabled short circuit reads and region server has 27GB RAM. > > > > > > > > > > Here is a trace when regionserver starts. > > > > > > > > > > Thu Aug 14 20:23:51 GMT 2014 Starting regionserver on nodex > > > > > core file size (blocks, -c) 0 > > > > > data seg size (kbytes, -d) unlimited > > > > > scheduling priority (-e) 0 > > > > > file size (blocks, -f) unlimited > > > > > pending signals (-i) 966365 > > > > > max locked memory (kbytes, -l) 64 > > > > > max memory size (kbytes, -m) unlimited > > > > > open files (-n) 32768 > > > > > pipe size (512 bytes, -p) 8 > > > > > POSIX message queues (bytes, -q) 819200 > > > > > real-time priority (-r) 0 > > > > > stack size (kbytes, -s) 10240 > > > > > cpu time (seconds, -t) unlimited > > > > > max user processes (-u) 966365 > > > > > virtual memory (kbytes, -v) unlimited > > > > > file locks (-x) unlimited > > > > > 2014-08-14 20:23:53,341 WARN org.apache.hadoop.conf.Configuration: > > > > > fs.default.name is deprecated. Instead, use fs.defaultFS > > > > > 2014-08-14 20:23:53,342 WARN org.apache.hadoop.conf.Configuration: > > > > > mapred.task.id is deprecated. Instead, use > mapreduce.task.attempt.id > > > > > 2014-08-14 20:23:53,884 WARN org.apache.hadoop.conf.Configuration: > > > > > slave.host.name is deprecated. Instead, use > > > > > mapreduce.tasktracker.host.name > > > > > 2014-08-14 20:24:03,999 WARN org.apache.hadoop.conf.Configuration: > > > > > hadoop.native.lib is deprecated. Instead, use > io.native.lib.available > > > > > 2014-08-14 20:26:47,605 ERROR > > > > > org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: > > > Inconsistent > > > > > configuration. Previous configuration for using table name in > > metrics: > > > > > true, new configuration: false > > > > > 2014-08-14 20:28:23,491 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":18725,"call":"next(-8041903839443097981, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.248:58716 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408048084720,"queuetimems":0,"class":"HRegionServer","responsesize":5031595,"method":"next"} > > > > > 2014-08-14 21:35:16,740 WARN org.apache.hadoop.hbase.util.Sleeper: > We > > > > slept > > > > > 14912ms instead of 3000ms, this is likely due to a long garbage > > > > collecting > > > > > pause and it's usually bad, see > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > > > > > 2014-08-14 21:42:28,477 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":16968,"call":"next(5487686201525374976, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.249:36657 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052531504,"queuetimems":0,"class":"HRegionServer","responsesize":1959532,"method":"next"} > > > > > 2014-08-14 21:42:56,923 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":10591,"call":"next(5487686201525374976, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.249:40818 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052566327,"queuetimems":1,"class":"HRegionServer","responsesize":2987578,"method":"next"} > > > > > 2014-08-14 21:44:24,372 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":10656,"call":"next(5487686201525374976, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.249:41993 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052653710,"queuetimems":1,"class":"HRegionServer","responsesize":3039779,"method":"next"} > > > > > 2014-08-14 21:45:50,598 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":12418,"call":"next(5487686201525374976, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.249:45197 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052738174,"queuetimems":10,"class":"HRegionServer","responsesize":2476903,"method":"next"} > > > > > 2014-08-14 21:46:15,187 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":23766,"call":"next(5487686201525374976, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.249:49425 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052751414,"queuetimems":0,"class":"HRegionServer","responsesize":5681175,"method":"next"} > > > > > 2014-08-14 21:47:09,041 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":12320,"call":"next(5487686201525374976, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.249:50269 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052816698,"queuetimems":1,"class":"HRegionServer","responsesize":2986949,"method":"next"} > > > > > 2014-08-14 21:49:23,833 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":11389,"call":"next(1227841280814011139, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.122:41976 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052952415,"queuetimems":0,"class":"HRegionServer","responsesize":3160025,"method":"next"} > > > > > 2014-08-14 21:49:23,869 WARN org.apache.hadoop.ipc.HBaseServer: > > > Exception > > > > > while changing ops : java.nio.channels.CancelledKeyException > > > > > 2014-08-14 21:49:23,900 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":11428,"call":"next(9103372947568217267, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.41:35241 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052952469,"queuetimems":0,"class":"HRegionServer","responsesize":1809158,"method":"next"} > > > > > 2014-08-14 21:49:23,902 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":11415,"call":"next(-3120240140302998196, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.195:46046 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052952468,"queuetimems":0,"class":"HRegionServer","responsesize":1826929,"method":"next"} > > > > > 2014-08-14 21:49:24,050 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":11438,"call":"next(3799907609071248384, 10), > rpc > > > > > version=1, client version=29, > > > methodsFingerPrint=-1368823753","client":" > > > > > 17.170.176.154:42797 > > > > > > > > > > > > > > > > > > > > ","starttimems":1408052952459,"queuetimems":0,"class":"HRegionServer","responsesize":2628568,"method":"next"} > > > > > 2014-08-14 21:49:24,057 WARN org.apache.hadoop.ipc.HBaseServer: > > > > > (responseTooSlow): > > > > > {"processingtimems":11843,"call":"next(-1679362783893333095, 10), > rpc > > > > > version=1, client version=29, > > > > > methodsFingerPrint=-1368823753","client":"17.170 > > > > > > > > > > Thanks, > > > > > Rahul > > > > > > > > > > > > > > >
