Public bug reported:

We are gathering more data, but it feels like we have enough to start a
bug report.

On Precise, on a 16 core/32 thread EC2 machine (instance type
cc2.8xlarge), when starting approximately eight or more Lucid containers
simultaneously, after the first seven or so connect to dnsmasq within
just a few seconds, the next group takes just under three minutes to
connect.  We will provide short steps to duplicate this.

We also have past experience indicating that more grouping occurs,
because simultaneously starting 32 containers occasionally (about once
in 15 attempts) will result in one container that still has not
connected to dnsmasq after ten minutes.  We have not tried to duplicate
this, because we hope that tackling the smaller issue will also address
the larger one.

I have not yet had enough disk space on this machine to duplicate this
without lxc-start-ephemeral.  I intend to do so.

I have tried with a precise container.  *Precise containers do not
exhibit this bug* in the experiment I performed, at least up to 16
containers.  I plan to duplicate this at least a couple more times to
verify, but I'm comfortable saying initially that I believe this is
specific to Lucid containers.  I'll report back.

To duplicate (again, I am doing this on a specific EC2 machine with a
lot of cores, but I believe you can see something similar to this even
on a 4 core/8 hyperthread machine):

sudo apt-get install moreutils
sudo lxc-create -t ubuntu -n lucid -- -r lucid -a i386 -b ubuntu
parallel -j 16 bash -c "lxc-start-ephemeral -d -o lucid" -- 1 2 3 4 5 6 7 8 9 
10 11 12 13 14 15 16

Now look at /var/log/syslog at the most recent dnsmasq entries.  You
will see about seven containers get DHCPACK from dnsmasq within about 20
seconds; then you will see a pause of about three minutes; then you will
see more containers report to dnsmasq.  Here's an excerpt (the
approximate three minute jump happens between lines 5 and 6):

...
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.161 
00:16:3e:38:9f:e2 lptests-temp-5uKcEua
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 
10.0.3.37 00:16:3e:9c:ae:16 
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.37 
00:16:3e:9c:ae:16 lptests-temp-1HxiwbU
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 
10.0.3.62 00:16:3e:69:83:2b 
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.62 
00:16:3e:69:83:2b lptests-temp-6o30rvH
Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:0d:7b:65 
Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.203 00:16:3e:0d:7b:65 
Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:32:5b:99 
Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.68 
00:16:3e:32:5b:99 
Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:7c:21:38 
Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.35 
00:16:3e:7c:21:38 
Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:72:0c:64 
Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.208 00:16:3e:72:0c:64 
Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:ed:b3:a0 
Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.82 
00:16:3e:ed:b3:a0 
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:e3:8f:1a 
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.247 00:16:3e:e3:8f:1a 
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:5c:99:20 
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.225 00:16:3e:5c:99:20 
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:ed:b3:a0 
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.82 
00:16:3e:ed:b3:a0 
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 
10.0.3.203 00:16:3e:0d:7b:65 
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.203 
00:16:3e:0d:7b:65 lptests-temp-C5po26y
...

What I expect to see is all of the containers reporting without a three
minute break, so that all containers are ready for work much sooner.

I looked in the syslogs of the started containers.  The first one to
connect to dnsmasq *after* the three minute pause showed a three minute
pause in its own syslog.  This is an excerpt from the syslog of that
machine (notice the jump between 20:40 to 20:43).  The "udevtrigger
post-stop process" lines and following dhclient startup lines, down
through the "plymouth-splash" line, are not found in the two fast
machine syslogs that I looked at.

Jun 18 20:40:10 lptests-temp-C5po26y ntpd[542]: frequency initialized 0.000 PPM 
from /var/lib/ntp/ntp.drift
Jun 18 20:40:12 lptests-temp-C5po26y kernel: 6n[8su4vP >25vs3e6e3 rr[2zIr
Jun 18 20:40:12 lptests-temp-C5po26y kernel: <7[9]zPrt26t6pl3 w
Jun 18 20:40:13 lptests-temp-C5po26y kernel: 
03rntt8hot<7[91h:6st42nre72.]Oorr67eq6 44vqvs
Jun 18 20:43:02 lptests-temp-C5po26y init: udevtrigger post-stop process (65) 
terminated with status 1
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Internet Systems Consortium DHCP 
Client V3.1.3
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Copyright 2004-2009 Internet 
Systems Consortium.
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: All rights reserved.
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: For info, please visit 
https://www.isc.org/software/dhcp/
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: 
Jun 18 20:43:02 lptests-temp-C5po26y init: plymouth-splash main process (639) 
terminated with status 1
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Listening on 
LPF/eth0/00:16:3e:0d:7b:65
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Sending on 
LPF/eth0/00:16:3e:0d:7b:65
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Sending on Socket/fallback
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: DHCPDISCOVER on eth0 to 
255.255.255.255 port 67 interval 5
Jun 18 20:43:05 lptests-temp-C5po26y dhclient: DHCPOFFER of 10.0.3.203 from 
10.0.3.1
Jun 18 20:43:05 lptests-temp-C5po26y dhclient: DHCPREQUEST of 10.0.3.203 on 
eth0 to 255.255.255.255 port 67

Finally, according to the stats we gathered while doing this, as far as
we could tell, the following are not causes of contention, even when
starting 24 simultaneous containers on a 16 core hyperthreaded machine:

* cpu (never got higher than about 80% utilization, load average still had some 
headroom)
* RAM (never got even half utilization of RAM)
* ioutil% never got above 1 or 2%
* dnsmasq itself does not seem to be an issue, looking at syslog, since we do 
not see restarts there

We plan to do a bit more analysis, as mentioned above, but we think this
data might be good enough to ask for your input at this point.  Thank
you.

** Affects: lxc (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1014916

Title:
  simultaneously started lucid containers pause while starting after the
  first seven

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1014916/+subscriptions

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs

Reply via email to