Hi, this is my first appearance on this mailing list, therefore I apologize for writing a problem report as first post.
I am experiencing a rather strange process crash problem inside UML. In short: processes started inside an UML instance randomly (but reproduceably) crash when UML is running on a specific host. Everything works perfectly when the *same*filesystem image and UML kernel are used on other hosts. The filesystem image has been created using debootstrap (detailed commands in the report below). Examples of reproduceable problems that I have been observing: * The following command always segfaults: localedef -i en_US -A /usr/share/locale/locale.alias -f UTF-8 en_US.UTF-8 A line similar to the following is correspondingly logged by the kernel: localedef[49]: segfault at 0 ip 00000000004079fd sp 0000007fbfce84e0 error 4 in localedef[400000+47000] * After clearing /var/lib/apt/lists, 'apt-get update' successfully downloads some index files, then hangs forever with a message "[Connecting to]" (yes, without a host name). It is not just the output that misses the server name: a sniffer reveals attempts by apt-get to resolve exactly one more server name than those listed in /etc/apt/sources.list, but this excess server has an empty name (a SRV query is sent for '_http._tcp', which instead should be followed by a valid name as in '_http._tcp.httpredir.debian.org'). Of course, the DNS replies with a failure. * When systemd is used as init, from time to time I have observed crashes of the udev service or failures to write the journal. Errors similar to the following are logged by the UML kernel: systemd-journald[58]: Failed to write entry (25 items, 576 bytes), ignoring: Invalid argument Please consider that the latter symptom has nothing to do with the problem discussed at this URL (which, however, I have also been experiencing in certain experimental settings): https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1440903.html. Indeed, I am *not*using systemd as init in the tests I am reporting about below. I have performed tests with several host/UML/filesystem/configuration combinations, as explained in the following. ============================================================== To document test outcomes, I am using the following abbreviations: - Host A: Intel i7-6700 3.4GHz, 32GB RAM, running Debian GNU/Linux testing (stretch) with kernel 4.6.0-1-amd64 (supplied by Debian, no modifications) - Host B: VirtualBox 5.1.0r108711 VM with 1 (virtual CPU), PIIX3 chipset, no execution cap, 7GB RAM, running Ubuntu 16.04.1 LTS (xenial) with kernel 4.4.0-31-generic (supplied by Ubuntu, no modifications), on top of Host A - Host C: VirtualBox 5.1.0r108711 VM with 4 (virtual) CPUs, PIIX3 chipset, no execution cap, 2GB RAM, running Ubuntu 16.04 LTS (xenial) with kernel 4.4.0-28-generic (supplied by Ubuntu, no modifications), on top of a host with an Intel i7 2.2GHz CPU, 8GB RAM running Mac OS X 10.11.6 (El Capitan) - UML 1: 4.3.2 64-bit UML kernel, custom compiled into a static executable from vanilla, with a few patches of little relevance applied (https://github.com/maxonthegit/netkit/tree/master/devel/kernel-patches). For reference, the configuration file is https://github.com/maxonthegit/netkit/blob/master/devel/netkit-kernel-config-4.2.1-x86_64 - UML 2: 4.3.5 64-bit UML kernel, obtained from http://uml.devloop.org.uk - Network SLIRP: using slirp as transport, compiled from the Debian source package (https://packages.debian.org/source/stretch/slirp) with Debian patches applied (by 'apt-get source') and FULL_BOLT enabled. UML command line argument: eth0=slirp,,/path/to/slirp - Network NAT: using tuntap as transport, and enabling netfilter's masquerading on the host. Set up on the host using the following commands: tunctl -u user -g user -t tun ifconfig tun 10.0.0.1 up echo 1 >/proc/sys/net/ipv4/ip_forward iptables -t nat -A POSTROUTING -j MASQUERADE A corresponding default route is set up inside UML. UML command line argument: eth0=tuntap,tun - Suite: I tried to run UML on filesystem images created using debootstrap for both the 'sid' (unstable) and the 'stretch' testing Debian releases. I used the following commands to generate the images: dd if=/dev/zero of=/path/to/fs/image.img bs=1 count=0 seek=10G /sbin/mkfs.ext4 -t ext4 -F /path/to/fs/image.img mount -o loop /path/to/fs/image.img fs-mount-location debootstrap --include=debconf-utils,locales sid fs-mount-location cat > temp-fs-mount/startup.sh <<EOF #!/bin/bash mount -t proc none /proc mount -t sysfs none /sys mount -t tmpfs none /run mountpoint /dev || mount -t devtmpfs none /dev echo "nameserver 8.8.8.8" > /etc/resolv.conf ip addr add 10.0.2.15/8 dev eth0 ip link set eth0 up ip route add default dev eth0 /bin/bash EOF chmod +x fs-mount-location/startup.sh umount fs-mount-location I have been using the *same*2 filesystem image files for all the tests, reverting them to the original state after each test by keeping a backup copy. Test outcomes are as follows: - OK: everything works flawlessly inside UML. - FAIL: software crashes are observed inside UML, as in the examples discussed above (localedef, apt-get, sometimes systemd-journald). ============================================================== The UML command line was as follows: kernel-x86_64 umid=test-vm mem=1073741824 ubd0=image.img rw con=null con0=fd:0,fd:1 eth0=<as above> init=/startup.sh Here are the outcomes: +------+-----+---------+---------+---------+ | Host | UML | Network | Suite | OUTCOME | +------+-----+---------+---------+---------+ | A | 1 | SLIRP | sid | FAIL | | A | 1 | NAT | sid | FAIL | <= strace | A | 2 | SLIRP | sid | FAIL | | A | 2 | NAT | sid | FAIL | | B | 1 | SLIRP | sid | OK | | B | 1 | NAT | sid | OK | | B | 2 | SLIRP | sid | OK | | B | 2 | NAT | sid | OK | | C | 1 | SLIRP | sid | OK | <= strace | C | 1 | NAT | sid | OK | | C | 2 | SLIRP | sid | OK | | C | 2 | NAT | sid | OK | | A | 1 | SLIRP | stretch | FAIL | | A | 1 | NAT | stretch | FAIL | | A | 2 | SLIRP | stretch | FAIL | | A | 2 | NAT | stretch | FAIL | | B | 1 | SLIRP | stretch | OK | | B | 1 | NAT | stretch | OK | | B | 2 | SLIRP | stretch | OK | | B | 2 | NAT | stretch | OK | | C | 1 | SLIRP | stretch | OK | | C | 1 | NAT | stretch | OK | | C | 2 | SLIRP | stretch | OK | | C | 2 | NAT | stretch | OK | +------+-----+---------+---------+---------+ So, one could come to an easy conclusion: something is wrong with host A. But what? Here are some additional clues: - I have performed an 'apt-get upgrade' on host A only a few days ago. - Everything else works perfectly on host A. - The problem occurs with two different host kernel versions (4.4.0-1-amd64 and 4.6.0-1-amd64). - If I 'chroot' in the *same*filesystem image(s) used for the UML tests, everything works perfectly on host A. - Changing the amount of RAM assigned to UML instances does not solve the problem on host A. - It does not depend on the processes running on host A: I made a test after booting the host with init=/bin/bash and the problem still occurred. - Although I am assigning the same amount of RAM to all the UML instances (purposely without 'M' or 'G' suffixes to avoid ambiguities), when UML boots it reports slightly different memory amounts: On host A: Memory: 1024952K/1056060K available (3108K kernel code, 862K rwdata, 1032K rodata, 121K init, 294K bss, 31108K reserved, 0K cma-reserved) On host C: Memory: 1024824K/1063076K available (3108K kernel code, 862K rwdata, 1032K rodata, 121K init, 294K bss, 38252K reserved, 0K cma-reserved) - I have collected an strace of a working and a failing instance of localedef, collected in the scenarios marked in the table above using: strace -r -s 256 -T -yy -o out-strace-file -ff -v /usr/bin/localedef -i en_US -A /usr/share/locale/locale.alias -f UTF-8 en_US.UTF-8 These straces are available at the following addresses: http://pastebin.com/bfjPERkK (failing, PID 378) http://pastebin.com/BaWTCvjn (failing, PID 379) http://pastebin.com/pCTttzPz (working, PID 56) http://pastebin.com/ejXseFsq (working, PID 57) After sanitization, it is pretty evident that everything is essentially identical up to the SIGSEGV at a 'brk' call. The following command can be used to compare a sane and a faulty output: diff -y <(awk '{$1=""; $NF=""; print}' localedef-strace-failing.378 | sed -r 's/\[[0-9]+\]//g') <(awk '{$1=""; $NF=""; print}' localedef-strace-working.56 | sed -r 's/\[[0-9]+\]//g') | less -S ============================================================== I have more or less run out of ideas about how to overcome this problem. Any suggestions are welcome. Thank you very much even for just reading so far. As a side note, this resembles the kind of problem that was previously reported here: https://sourceforge.net/p/user-mode-linux/mailman/message/34978201/ Regards, Massimo
------------------------------------------------------------------------------
_______________________________________________ User-mode-linux-user mailing list User-mode-linux-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/user-mode-linux-user