Hi,

this is my first appearance on this mailing list, therefore I apologize for 
writing a problem report as first post.

I am experiencing a rather strange process crash problem inside UML. In short: 
processes started inside an UML instance randomly (but reproduceably) crash 
when UML is running on a specific host. Everything works perfectly when the 
*same*filesystem image and UML kernel are used on other hosts.
The filesystem image has been created using debootstrap (detailed commands in 
the report below).
Examples of reproduceable problems that I have been observing:
* The following command always segfaults:
    localedef -i en_US -A /usr/share/locale/locale.alias -f UTF-8 en_US.UTF-8
A line similar to the following is correspondingly logged by the kernel:
    localedef[49]: segfault at 0 ip 00000000004079fd sp 0000007fbfce84e0 error 
4 in localedef[400000+47000]
* After clearing /var/lib/apt/lists, 'apt-get update' successfully downloads 
some index files, then hangs forever with a message "[Connecting to]" (yes, 
without a host name). It is not just the output that misses the server name: a 
sniffer reveals attempts by apt-get to resolve exactly one more server name 
than those listed in /etc/apt/sources.list, but this excess server has an empty 
name (a SRV query is sent for '_http._tcp', which instead should be followed by 
a valid name as in '_http._tcp.httpredir.debian.org'). Of course, the DNS 
replies with a failure.
* When systemd is used as init, from time to time I have observed crashes of 
the udev service or failures to write the journal. Errors similar to the 
following are logged by the UML kernel:
    systemd-journald[58]: Failed to write entry (25 items, 576 bytes), 
ignoring: Invalid argument
Please consider that the latter symptom has nothing to do with the problem 
discussed at this URL (which, however, I have also been experiencing in certain 
experimental settings): 
https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1440903.html. 
Indeed, I am *not*using systemd as init in the tests I am reporting about below.

I have performed tests with several host/UML/filesystem/configuration 
combinations, as explained in the following.

==============================================================

To document test outcomes, I am using the following abbreviations:

- Host A: Intel i7-6700 3.4GHz, 32GB RAM, running Debian GNU/Linux testing 
(stretch) with kernel 4.6.0-1-amd64 (supplied by Debian, no modifications)
- Host B: VirtualBox 5.1.0r108711 VM with 1 (virtual CPU), PIIX3 chipset, no 
execution cap, 7GB RAM, running Ubuntu 16.04.1 LTS (xenial) with kernel 
4.4.0-31-generic (supplied by Ubuntu, no modifications), on top of Host A
- Host C: VirtualBox 5.1.0r108711 VM with 4 (virtual) CPUs, PIIX3 chipset, no 
execution cap, 2GB RAM, running Ubuntu 16.04 LTS (xenial) with kernel 
4.4.0-28-generic (supplied by Ubuntu, no modifications), on top of a host with 
an Intel i7 2.2GHz CPU, 8GB RAM running Mac OS X 10.11.6 (El Capitan)

- UML 1: 4.3.2 64-bit UML kernel, custom compiled into a static executable from 
vanilla, with a few patches of little relevance applied 
(https://github.com/maxonthegit/netkit/tree/master/devel/kernel-patches). For 
reference, the configuration file is 
https://github.com/maxonthegit/netkit/blob/master/devel/netkit-kernel-config-4.2.1-x86_64
- UML 2: 4.3.5 64-bit UML kernel, obtained from http://uml.devloop.org.uk

- Network SLIRP: using slirp as transport, compiled from the Debian source 
package (https://packages.debian.org/source/stretch/slirp) with Debian patches 
applied (by 'apt-get source') and FULL_BOLT enabled.
UML command line argument: eth0=slirp,,/path/to/slirp
- Network NAT: using tuntap as transport, and enabling netfilter's masquerading 
on the host. Set up on the host using the following commands:
  tunctl -u user -g user -t tun
  ifconfig tun 10.0.0.1 up
  echo 1 >/proc/sys/net/ipv4/ip_forward
  iptables -t nat -A POSTROUTING -j MASQUERADE
A corresponding default route is set up inside UML.
UML command line argument: eth0=tuntap,tun

- Suite: I tried to run UML on filesystem images created using debootstrap for 
both the 'sid' (unstable) and the 'stretch' testing Debian releases. I used the 
following commands to generate the images:
  dd if=/dev/zero of=/path/to/fs/image.img bs=1 count=0 seek=10G
  /sbin/mkfs.ext4 -t ext4 -F /path/to/fs/image.img
  mount -o loop /path/to/fs/image.img fs-mount-location
  debootstrap --include=debconf-utils,locales sid fs-mount-location
  cat > temp-fs-mount/startup.sh <<EOF
  #!/bin/bash
    mount -t proc none /proc
    mount -t sysfs none /sys
    mount -t tmpfs none /run
    mountpoint /dev || mount -t devtmpfs none /dev
    echo "nameserver 8.8.8.8" > /etc/resolv.conf
    ip addr add 10.0.2.15/8 dev eth0
    ip link set eth0 up
    ip route add default dev eth0
    /bin/bash
  EOF
  chmod +x fs-mount-location/startup.sh
  umount fs-mount-location
I have been using the *same*2 filesystem image files for all the tests, 
reverting them to the original state after each test by keeping a backup copy.

Test outcomes are as follows:
- OK: everything works flawlessly inside UML.
- FAIL: software crashes are observed inside UML, as in the examples discussed 
above (localedef, apt-get, sometimes systemd-journald).

==============================================================

The UML command line was as follows:
  kernel-x86_64 umid=test-vm mem=1073741824 ubd0=image.img rw con=null 
con0=fd:0,fd:1 eth0=<as above> init=/startup.sh

Here are the outcomes:

+------+-----+---------+---------+---------+
| Host | UML | Network |  Suite  | OUTCOME |
+------+-----+---------+---------+---------+
|  A   |  1  |  SLIRP  |   sid   | FAIL    |
|  A   |  1  |   NAT   |   sid   | FAIL    | <= strace
|  A   |  2  |  SLIRP  |   sid   | FAIL    |
|  A   |  2  |   NAT   |   sid   | FAIL    |
|  B   |  1  |  SLIRP  |   sid   | OK      |
|  B   |  1  |   NAT   |   sid   | OK      |
|  B   |  2  |  SLIRP  |   sid   | OK      |
|  B   |  2  |   NAT   |   sid   | OK      |
|  C   |  1  |  SLIRP  |   sid   | OK      | <= strace
|  C   |  1  |   NAT   |   sid   | OK      |
|  C   |  2  |  SLIRP  |   sid   | OK      |
|  C   |  2  |   NAT   |   sid   | OK      |
|  A   |  1  |  SLIRP  | stretch | FAIL    |
|  A   |  1  |   NAT   | stretch | FAIL    |
|  A   |  2  |  SLIRP  | stretch | FAIL    |
|  A   |  2  |   NAT   | stretch | FAIL    |
|  B   |  1  |  SLIRP  | stretch | OK      |
|  B   |  1  |   NAT   | stretch | OK      |
|  B   |  2  |  SLIRP  | stretch | OK      |
|  B   |  2  |   NAT   | stretch | OK      |
|  C   |  1  |  SLIRP  | stretch | OK      |
|  C   |  1  |   NAT   | stretch | OK      |
|  C   |  2  |  SLIRP  | stretch | OK      |
|  C   |  2  |   NAT   | stretch | OK      |
+------+-----+---------+---------+---------+

So, one could come to an easy conclusion: something is wrong with host A. But 
what?
Here are some additional clues:
- I have performed an 'apt-get upgrade' on host A only a few days ago.
- Everything else works perfectly on host A.
- The problem occurs with two different host kernel versions (4.4.0-1-amd64 and 
4.6.0-1-amd64).
- If I 'chroot' in the *same*filesystem image(s) used for the UML tests, 
everything works perfectly on host A.
- Changing the amount of RAM assigned to UML instances does not solve the 
problem on host A.
- It does not depend on the processes running on host A: I made a test after 
booting the host with init=/bin/bash and the problem still occurred.
- Although I am assigning the same amount of RAM to all the UML instances 
(purposely without 'M' or 'G' suffixes to avoid ambiguities), when UML boots it 
reports slightly different memory amounts:
On host A: Memory: 1024952K/1056060K available (3108K kernel code, 862K rwdata, 
1032K rodata, 121K init, 294K bss, 31108K reserved, 0K cma-reserved)
On host C: Memory: 1024824K/1063076K available (3108K kernel code, 862K rwdata, 
1032K rodata, 121K init, 294K bss, 38252K reserved, 0K cma-reserved)
- I have collected an strace of a working and a failing instance of localedef, 
collected in the scenarios marked in the table above using:
  strace -r -s 256 -T -yy -o out-strace-file -ff -v /usr/bin/localedef -i en_US 
-A /usr/share/locale/locale.alias -f UTF-8 en_US.UTF-8
These straces are available at the following addresses:
http://pastebin.com/bfjPERkK (failing, PID 378)
http://pastebin.com/BaWTCvjn (failing, PID 379)
http://pastebin.com/pCTttzPz (working, PID 56)
http://pastebin.com/ejXseFsq (working, PID 57)
After sanitization, it is pretty evident that everything is essentially 
identical up to the SIGSEGV at a 'brk' call. The following command can be used 
to compare a sane and a faulty output:
  diff -y <(awk '{$1=""; $NF=""; print}' localedef-strace-failing.378 | sed -r 
's/\[[0-9]+\]//g') <(awk '{$1=""; $NF=""; print}' localedef-strace-working.56 | 
sed -r 's/\[[0-9]+\]//g') | less -S

==============================================================

I have more or less run out of ideas about how to overcome this problem. Any 
suggestions are welcome. Thank you very much even for just reading so far.

As a side note, this resembles the kind of problem that was previously reported 
here: https://sourceforge.net/p/user-mode-linux/mailman/message/34978201/

Regards,
Massimo

------------------------------------------------------------------------------
_______________________________________________
User-mode-linux-user mailing list
User-mode-linux-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-user

Reply via email to