Re: [uml-user] Kernel Not Syncing: Bug!

Anthony Brock Fri, 01 Sep 2006 08:24:15 -0700

Since nobody else has replied, comments are inline below.

> -----Original Message-----
> I'm trying to run about 50 UML hosts on a dual-processor (each
> dual-core) Opteron system, running 64-bit Ubuntu 6.06LTS .


In my experience, this is a LOT for one machine. Remember that each UML is
probably dealing with 20-50 sub processes. This means your attempting to run
between 1000 and 2500 processes simultaneously on your machine. That's a lot
of work, even for a modern computer. I would consider backing this down to
15-25 UML instances per host.

Out of curiosity, what's the load average?

> I have 4 GB of physical RAM in the Ubuntu box + 3 GB swap.  It is
> running a vanilla 2.6.17.8 kernel (no skas3 patches, because I haven't
> successfully managed to apply any sets of skas3 patches)

Here is an interesting problem. Each instance has 48M. You're launching 50
instances. This means you're allocating up to 2.4 GB for instances. However,
you're also giving them 3 GB of swap. Do they need the swap? If not, get rid
of it (or at least lock your existing memory into ram).

My worst fear in your case would be each of these instances becoming
aggressive with memory management and trying to swap processes out to disk.
Disk IO will kill you if your not careful. In fact, I would guess that this
is your problem. Active processes are probably being pushed out to disk and
the wait time for IO (with IO contention) is probably causing the network
timeouts.

> /home/uml/kernel64 ubda=/home/$user/root_fs ubdb=/home/$user/swap
> eth0=daemon,08:00:07:26:c0:04,unix,/opt/uml/run/uml_switch.ctl mem=48M

Also, try using a bridge instead of the daemon. This will move some of your
network activity into the host kernel. I used to have instructions for this
on the wiki. However, the networking link appears to have been disabled.

> Here's my list of troubles:
>
> the guest machines are dropping off the network regularly.  They may be
> pingable for a few minutes, but then dhclient seems to crash on them.
> stopping them and restarting them puts them back on the network for a
> while, but invariably, the network dies again.
>
> guest machines randomly crash.  They aren't sending me anything via
> syslog (I've configured them to log to the host machine), and they don't
> appear to log anything to STDOUT.  I can make machines crash with the
> message in the subject line, by simply working them too hard (e.g. log
> in, fire off a few dozen shells, add about 1000 users as fast as bash
> can go through a for loop).  My students, however (up to 25 working
> simultaneously on UML guests), are making the machines crash by doing
> nothing more than adding two or three users manually, looking at man
> pages, and typing 'ls'.

50 machines doing ls (including swapping memory to and fro) could cause
enough IO cause significant pauses. However, I can't explain the crashes.
UML has been quite stable for me even under heavy load.

> I've tried to build a debugging kernel, and successfully started the
> kernel in gdb, and got it running.  Unfortunately, when I got the
> message in the subject line, it blew me all the way out of gdb.

Unfortunately, I can't help with this. Have you looked at the debugging
instructions for modules on the wiki? It may give you some pointers:

http://uml.harlowhill.com/index.php/DebuggingModules

> I have been experiencing these crashes using guest kernels 2.6.17.11,
> 2.6.18-rc4, 2.6.18-rc5.

Another possibility is downgrading your guest kernel to a 2.6.16.x kernel. I
don't know about a 64-bit kernel, but it helped me with a 32-bit kernel.
This does not support TLS, but I've had problems with some processes not
wanting to start with 32-bit 2.6.17.x and newer kernels. Unfortunately,
nobody on the list is interested in diagnosing this issue.

> One possible thing which may be connected to the general network failure
> is the fact that when I'm starting these machines, I dare not start them
> any faster than 1 machine per 40 seconds.  If I start the machines at
> that rate, they will successfully DHCP an address.  If I start them,
> say, every 10 seconds, most machines will never grab an address.  In
> fact, the DHCP server never sees their request come across tap0.

I would suspect that IO is the culprit. Is it obvious that I've had a lot of
problems with disk IO?

Try starting the first 3 at 10 second intervals. Do you still have the same
problem?

> My math indicates that I should be able to fit 50 48MB machines in 4 GB
> of RAM, but I'm willing to further constrain these machines if that will
> help.
>
> I've disabled /lib/tls and /lib64/tls on both host and guest operating
> systems.

This will help with 2.6.16.x and older kernels (on 32-bit). However, newer
kernels (such as the ones you're using) have TLS support, at least on
32-bit. Unfortunately, I'm not that familiar with 64-bit guest kernels.

Also, try the same thing with 32-bit guest kernels (2.6.17.x and newer).
Leave the host 64-bit. Do you experience the same crashing issue?

Tony


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
User-mode-linux-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-user

Re: [uml-user] Kernel Not Syncing: Bug!

Reply via email to