On Wednesday, March 30, 2011 11:21:04 AM Alvin wrote: > On Wednesday 30 March 2011 14:52:14 Serge E. Hallyn wrote: > > Quoting Scott Kitterman ([email protected]): > > > There was a lot of discussion around improving the server boot > > > experience before the UDS-M. A number of people expressed interest in > > > seeing more useful diagnostic information during boot. Others > > > expressed concerns with boot reliability on the more complex hardware > > > typically found in servers. > > > > > > How are we doing on this? Personally, I can't remember the last time I > > > rebooted a server and it wasn't via SSH and the hardware I use is the > > > sort there were problems with. Are these still issues for the Ubuntu > > > Server community? > > > > > > Scott K > > > > I think right now these issues are oveshadowed by the fact that a > > great deal of server software is not yet upstartified. I think that > > needs to be addressed for O. > > Yes, they are certainly still issues (and the primary reason the company I > work for is abandoning Ubuntu.) > > I agree that a lot of servers are not often rebooted, but not every server > is a webserver. Some are used only during certain hours and can be booted > automatically (BIOS or WOL) when needed in order to keep the electricity > bill down. Booting should be a reliable and automated process. Accurate > logging is important in order to know what went wrong in case the > unthinkable happens. > > The current boot.log looks like: > > mount.nfs: DNS resolution failed for 192.168.xxx.3: Name or service not > > known > > > mount.nfs4: Failed to resolve server exampleserver: Name or service not > > known > > > mountall: mount /srv/example [1134] terminated with status 32 > > mount error(101): Network is unreachable > > while in reality filesystems are mounted. Now, when something goes wrong, > the log is identical. conclusion: boot.log is useless. (actually, the log > is probably correct. it can't resolve server names at that specific time.) > Proper boot logging would be popular[1]. > > Take the following example of a server boot. Let's also assume that nothing > goes wrong that could lead to a busybox console. (It certainly can![2][3]) > So, you're now sitting in front of a nice prompt. Everything looks ok, but > is it? The server mounts NFS shares from another server, it runs > KVM/libvirt with a netfs storage pool for its virtual machines and a > quasselcore for IRC that stores it's data on a postgresql on another > server. The local filesystem uses mdadm for RAID1 and LVM on op of that. > Very server-like. (I once made this setup to test some things.) In order > to keep things under control, there are /no/ LVM snapshots. That is > another ugly story. > > So, what happens now: > - The RAID will be broken! [4][5] > - The NFS shares in /etc/fstab might not be mounted, [6][7] > even when you told the system to wait with _netdev. [8] > - Your virtual machines on netfs will not be running. [9] > - The quasselcore with external db will not be started. [10] > > The array can be assembled by running a command and all of the above > daemons can be started manually. > > I talked about some of those topics on IRC, and the following workarounds > came up. There are also some workarounds in the bug reports. > - Put NFS shares in /etc/fstab, and don't configure them as netfs storage > pools. > - Put the IP addresses of your NFS servers in /etc/hosts. > > For most servers, speeding up the boot process is less important than > reliability. Why not take a look at how Debian does it? You can disable > running the boot scripts in parallel with 'CONCURRENCY=none' in > /etc/default/rcS. > > Also, think about daemons of commercial software without upstart scripts. > You never know whether they will start at boot or not. > > Links: > [1] "init: support logging of job output" > https://bugs.launchpad.net/bugs/328881 > > [2] "Gave up waiting for root device after upgrade then busybox console" > https://bugs.launchpad.net/bugs/360378 > > [3] "karmic rc: root device sometimes not found" > https://bugs.launchpad.net/bugs/460914 > > [4] mdadm cannot assemble array as cannot open drive with O_EXCL > https://bugs.launchpad.net/bugs/27037 >
> [5] "mdadm cannot assemble array" > https://bugs.launchpad.net/bugs/599135 > > [6] "nfs mounts specified in fstab is not mounted on boot." > https://bugs.launchpad.net/bugs/275451 > > [7] "nfs shares are not automounted anymore in intrepid" > https://bugs.launchpad.net/bugs/285013 > > [8] "_netdev not working" > https://bugs.launchpad.net/bugs/384347 > > [9] "Libvirt NFS mount on boot." > https://bugs.launchpad.net/bugs/351307 > > [10] "quasselcore does not connect to database at boot" > https://bugs.launchpad.net/bugs/612729 This is exactly the kind of detailed feedback I was hoping to get. Thank you. I suspect we'll need to have several UDS sessions around server boot in order to lay out a comprehensive plan of attack. The release before the next LTS is definitely the cycle to hit this. Scott K -- ubuntu-server mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-server More info: https://wiki.ubuntu.com/ServerTeam
