Re: slave kernel panic with cgroups on a shared box

Benjamin Mahler Wed, 28 Jan 2015 18:06:06 -0800

It's a kernel bug from the looks of it, here's a similar stack trace:

https://bugs.centos.org/view.php?id=7538


But apparently they fixed it in the kernel you're running.

On Sat, Jan 24, 2015 at 8:26 AM, Gary Ogden <[email protected]> wrote:

> I'm using this:
> 3 Node Cluster 8GB 2 CPU each
> CentOS 6.5 2.6.32-504.3.3.el6.x86_64
>
> On each node we run Cassandra and Mesos where we run java spark jobs.
> This is a testing environment so it's actually shared between test groups.
> So we actually are running 3 instances of mesos-slave on each node.
> (integration, qa and preprod). We want to ensure these sparks jobs don't
> slow down Cassandra.
>
> If I don't use cgroups, we don't get a kernel panic. No matter how I try
> to configure cgroups, I still get the panic and reboot.  Is there an issue
> with having multiple slaves on the same machine?  Here's the kernel panic
> text:
>
> <4>Process mesos-slave (pid: 19593, threadinfo ffff88023a224000, task
> ffff8800aa1af540)
> <4>Stack:
> <4> ffff88023a225dd8 ffff8802395b6580 ffff88023a225e08 ffffffff810cdaa2
> <4><d> ffff88008f740440 ffff8800aa0df938 0000000000000000 ffff8800aa0df950
> <4><d> ffff88023a225e58 ffffffff810577e9 ffff88008f740440 0000000300000001
> <4>Call Trace:
> <4> [<ffffffff810cdaa2>] cgroup_event_wake+0x42/0x70
> <4> [<ffffffff810577e9>] __wake_up_common+0x59/0x90
> <4> [<ffffffff8105bd18>] __wake_up+0x48/0x70
> <4> [<ffffffff811dad2d>] eventfd_release+0x2d/0x40
> <4> [<ffffffff8118f8d5>] __fput+0xf5/0x210
> <4> [<ffffffff8118fa15>] fput+0x25/0x30
> <4> [<ffffffff8118ac6d>] filp_close+0x5d/0x90
> <4> [<ffffffff8118ad45>] sys_close+0xa5/0x100
> <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
> <4>Code: 01 01 01 01 01 48 0f af c2 48 c1 e8 38 c3 90 90 90 90 90 90 90 90
> 90 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 <4c> 8b 00
> 4c 39 c7 75
>  39 48 8b 03 4c 8b 40 08 4c 39 c3 75 4c 48
> <1>RIP  [<ffffffff8129e870>] list_del+0x10/0xa0
> <4> RSP <ffff88023a225dc8>
>
> Here's the cgconfig.conf:
>
> mount {
>     cpu = /cgroup/cpu;
>     cpuacct = /cgroup/cpuacct;
>     memory = /cgroup/memory;
> }
>
> group cassandra {
>     cpu {
>         cpu.shares="800";
>     }
>     cpuacct {
>         cpuacct.usage="0";
>     }
>     memory {
>         memory.limit_in_bytes="5G";
>         memory.memsw.limit_in_bytes="5G";
>     }
> }
>
>
> group mesos {
>     cpu {
>         cpu.shares="200";
>     }
>     cpuacct {
>         cpuacct.usage="0";
>     }
>     memory {
>         memory.limit_in_bytes="1G";
>         memory.memsw.limit_in_bytes="1G";
>     }
> }
>
> Here's the cgrules.conf:
>
> @mesos cpu,cpuacct,memory mesos
> @cassandra cpu,cpuacct,memory cassandra
>
>
> And here's how we start each slave:
> cgexec -g cpu,cpuacct,memory:mesos /usr/sbin/mesos-slave
> --isolation=cgroups/cpu,cgroups/mem --cgroups_limit_swap
> --cgroups_hierarchy=/cgroup --resource
> s="mem(*):256;cpus(*):1;ports(*):[20000-25000];disk(*):5000"
> --gc_delay=2days --cgroups_root=mesos --log_dir=/var/log/mesos/int
> --master=zk://intMesosMaster01:2
> 181,intMesosMaster02:2181,intMesosMaster03:2181/mesos --port=5150
> --work_dir=/tmp/mesos/int
>
>
> I've tried lots of different settings in cgroups, the startup of the
> slaves, but nothing seems to matter.  We've also disabled swap on these
> boxes since Cassandra doesn't like swap.
>
>
>
>

Re: slave kernel panic with cgroups on a shared box

Reply via email to