yes, we are using 94.7. The RS log has no error since the system killed it. This is from the syslog
Jul 22 07:25:12 hbasetest-e-regionserver-684ab93a monit[10188]: 'hbasetest-e-regionserver-684ab93a' cpu user usage of 84.4% matches resource limit [cpu user usage>70.0%] Jul 22 07:25:12 hbasetest-e-regionserver-684ab93a monit[10188]: 'hbasetest-e-regionserver-684ab93a' loadavg(15min) of 9.1 matches resource limit [loadavg(15min)>4.0] Jul 22 07:25:12 hbasetest-e-regionserver-684ab93a monit[10188]: 'hbasetest-e-regionserver-684ab93a' loadavg(5min) of 12.8 matches resource limit [loadavg(5min)>4.0] Jul 22 07:25:12 hbasetest-e-regionserver-684ab93a postfix/smtpd[1373]: connect from localhost[127.0.0.1] Jul 22 07:25:12 hbasetest-e-regionserver-684ab93a postfix/smtpd[1373]: disconnect from localhost[127.0.0.1] Jul 22 07:25:23 hbasetest-e-regionserver-684ab93a kernel: [4715957.327793] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327813] java cpuset=/ mems_allowed=0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327820] Pid: 5255, comm: java Not tainted 3.2.0-57-virtual #87-Ubuntu Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327827] Call Trace: Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327841] [<ffffffff81119e41>] dump_header+0x91/0xe0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327848] [<ffffffff8111a1c5>] oom_kill_process+0x85/0xb0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327874] [<ffffffff8111a56a>] out_of_memory+0xfa/0x220 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327884] [<ffffffff8111ff43>] __alloc_pages_nodemask+0x8c3/0x8e0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327896] [<ffffffff81157076>] alloc_pages_current+0xb6/0x120 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327905] [<ffffffff81116d67>] __page_cache_alloc+0xb7/0xd0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327912] [<ffffffff81118d32>] filemap_fault+0x212/0x3c0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327920] [<ffffffff81139412>] __do_fault+0x72/0x550 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327930] [<ffffffff81162172>] ? __cmpxchg_double_slab.isra.22+0x12/0x90 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327938] [<ffffffff8113caca>] handle_pte_fault+0xfa/0x200 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327947] [<ffffffff810063ee>] ? xen_pmd_val+0xe/0x10 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327954] [<ffffffff81005369>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327962] [<ffffffff8113dda9>] handle_mm_fault+0x269/0x370 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327973] [<ffffffff8165de84>] do_page_fault+0x184/0x550 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327980] [<ffffffff81004dc2>] ? xen_mc_flush+0xb2/0x1c0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327988] [<ffffffff8165a35e>] ? _raw_spin_lock+0xe/0x20 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.327995] [<ffffffff81187fe8>] ? setfl+0x118/0x170 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328002] [<ffffffff81188800>] ? do_fcntl+0x240/0x340 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328009] [<ffffffff8165aab5>] page_fault+0x25/0x30 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328015] Mem-Info: Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328018] Node 0 DMA per-cpu: Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328024] CPU 0: hi: 0, btch: 1 usd: 0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328029] CPU 1: hi: 0, btch: 1 usd: 0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328034] CPU 2: hi: 0, btch: 1 usd: 0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328038] CPU 3: hi: 0, btch: 1 usd: 0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328043] Node 0 DMA32 per-cpu: Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328049] CPU 0: hi: 186, btch: 31 usd: 24 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328072] CPU 1: hi: 186, btch: 31 usd: 0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328077] CPU 2: hi: 186, btch: 31 usd: 64 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328082] CPU 3: hi: 186, btch: 31 usd: 0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328088] Node 0 Normal per-cpu: Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328098] CPU 0: hi: 186, btch: 31 usd: 42 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328104] CPU 1: hi: 186, btch: 31 usd: 0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328112] CPU 2: hi: 186, btch: 31 usd: 114 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328118] CPU 3: hi: 186, btch: 31 usd: 30 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328128] active_anon:3747089 inactive_anon:35 isolated_anon:0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328129] active_file:467 inactive_file:576 isolated_file:45 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328131] unevictable:0 dirty:36 writeback:0 unstable:0 Jul 22 07:25:24 hbasetest-e-regionserver-684ab93a kernel: [4715957.328132] free:16894 slab_reclaimable:8871 slab_unreclaimable:6855 On Tue, Jul 22, 2014 at 8:35 AM, Ted Yu <[email protected]> wrote: > Can you post region server log snippet prior to (and including) the OOME ? > > Are you using 0.94 release ? > > Cheers > > > On Tue, Jul 22, 2014 at 8:15 AM, Tianying Chang <[email protected]> wrote: > > > Hi > > > > I was running WALPlayer that output HFile for future bulkload. There are > > 6200 hlogs, and the total size is about 400G. > > > > The mapreduce job finished. But I saw two bad things: > > 1. More than half of RS died. I checked the syslog, it seems they are > > killed by OOM. They also have very high CPU spike for the whole time > during > > WALPlayer > > > > cpu user usage of 84.4% matches resource limit [cpu user usage>70.0%] > > > > 2. Mapreduce job also has failure of Java heap Space error. My job set > the > > heap usage as 2G, > > *mapred.child.java.opts*-Xmx2048m > > Does this mean WALPlayer cannot support this load on this kind of > setting? > > > > Thanks > > Tian-Ying > > >
