Since nobody else has replied, comments are inline below. > -----Original Message----- > I'm trying to run about 50 UML hosts on a dual-processor (each > dual-core) Opteron system, running 64-bit Ubuntu 6.06LTS .
In my experience, this is a LOT for one machine. Remember that each UML is probably dealing with 20-50 sub processes. This means your attempting to run between 1000 and 2500 processes simultaneously on your machine. That's a lot of work, even for a modern computer. I would consider backing this down to 15-25 UML instances per host. Out of curiosity, what's the load average? > I have 4 GB of physical RAM in the Ubuntu box + 3 GB swap. It is > running a vanilla 2.6.17.8 kernel (no skas3 patches, because I haven't > successfully managed to apply any sets of skas3 patches) Here is an interesting problem. Each instance has 48M. You're launching 50 instances. This means you're allocating up to 2.4 GB for instances. However, you're also giving them 3 GB of swap. Do they need the swap? If not, get rid of it (or at least lock your existing memory into ram). My worst fear in your case would be each of these instances becoming aggressive with memory management and trying to swap processes out to disk. Disk IO will kill you if your not careful. In fact, I would guess that this is your problem. Active processes are probably being pushed out to disk and the wait time for IO (with IO contention) is probably causing the network timeouts. > /home/uml/kernel64 ubda=/home/$user/root_fs ubdb=/home/$user/swap > eth0=daemon,08:00:07:26:c0:04,unix,/opt/uml/run/uml_switch.ctl mem=48M Also, try using a bridge instead of the daemon. This will move some of your network activity into the host kernel. I used to have instructions for this on the wiki. However, the networking link appears to have been disabled. > Here's my list of troubles: > > the guest machines are dropping off the network regularly. They may be > pingable for a few minutes, but then dhclient seems to crash on them. > stopping them and restarting them puts them back on the network for a > while, but invariably, the network dies again. > > guest machines randomly crash. They aren't sending me anything via > syslog (I've configured them to log to the host machine), and they don't > appear to log anything to STDOUT. I can make machines crash with the > message in the subject line, by simply working them too hard (e.g. log > in, fire off a few dozen shells, add about 1000 users as fast as bash > can go through a for loop). My students, however (up to 25 working > simultaneously on UML guests), are making the machines crash by doing > nothing more than adding two or three users manually, looking at man > pages, and typing 'ls'. 50 machines doing ls (including swapping memory to and fro) could cause enough IO cause significant pauses. However, I can't explain the crashes. UML has been quite stable for me even under heavy load. > I've tried to build a debugging kernel, and successfully started the > kernel in gdb, and got it running. Unfortunately, when I got the > message in the subject line, it blew me all the way out of gdb. Unfortunately, I can't help with this. Have you looked at the debugging instructions for modules on the wiki? It may give you some pointers: http://uml.harlowhill.com/index.php/DebuggingModules > I have been experiencing these crashes using guest kernels 2.6.17.11, > 2.6.18-rc4, 2.6.18-rc5. Another possibility is downgrading your guest kernel to a 2.6.16.x kernel. I don't know about a 64-bit kernel, but it helped me with a 32-bit kernel. This does not support TLS, but I've had problems with some processes not wanting to start with 32-bit 2.6.17.x and newer kernels. Unfortunately, nobody on the list is interested in diagnosing this issue. > One possible thing which may be connected to the general network failure > is the fact that when I'm starting these machines, I dare not start them > any faster than 1 machine per 40 seconds. If I start the machines at > that rate, they will successfully DHCP an address. If I start them, > say, every 10 seconds, most machines will never grab an address. In > fact, the DHCP server never sees their request come across tap0. I would suspect that IO is the culprit. Is it obvious that I've had a lot of problems with disk IO? Try starting the first 3 at 10 second intervals. Do you still have the same problem? > My math indicates that I should be able to fit 50 48MB machines in 4 GB > of RAM, but I'm willing to further constrain these machines if that will > help. > > I've disabled /lib/tls and /lib64/tls on both host and guest operating > systems. This will help with 2.6.16.x and older kernels (on 32-bit). However, newer kernels (such as the ones you're using) have TLS support, at least on 32-bit. Unfortunately, I'm not that familiar with 64-bit guest kernels. Also, try the same thing with 32-bit guest kernels (2.6.17.x and newer). Leave the host 64-bit. Do you experience the same crashing issue? Tony ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ User-mode-linux-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/user-mode-linux-user
