[Bug 1637601] [NEW] UbuntuKVM: migration using NFS mount fails #190

bugproxy Fri, 28 Oct 2016 11:01:50 -0700

Public bug reported:

---Problem Description---
We setup 2 Ubuntu KVM host with the same mount point and try to migration the 
guest between 2 HOST. The migration is success, the guest appear on other Host 
after the migration but it shows some I/O error on the guest.


On the first host, run this
root@micro:~# virsh migrate --live --domain microg5 
qemu+ssh://10.33.10.115/system --verbose --undefinesource --persistent 
--timeout 60
Migration: [100 %]

The guest appear on other HOST:
root@tiny:~# virsh list --all
Id Name State

2 tinyg1 running
3 tinyg2 running
5 tinyg4 running
6 tinyg5 running
7 tinyg6 running
9 tinyg3 running
12 microg5 running <<< this guest is from HOST "Micro"

Checking status of the guest, I can see this error....
root@microg5:~# dmesg |tail -20
[ 60.818955] blk_update_request: I/O error, dev vdc, sector 96749232
[ 60.819113] Aborting journal on device vdc2-8.
[ 60.820121] blk_update_request: I/O error, dev vdc, sector 9084320
[ 60.820643] EXT4-fs warning (device vdc2): ext4_end_bio:329: I/O error -5 
writing to inode 393279 (offset 0 size 0 starting block 1135541)
[ 60.820652] Buffer I/O error on device vdc2, logical block 1133492
[ 60.820655] EXT4-fs (vdc2): previous I/O error to superblock detected
[ 60.821394] blk_update_request: I/O error, dev vdc, sector 96747520
[ 60.821397] blk_update_request: I/O error, dev vdc, sector 96747520
[ 60.821402] Buffer I/O error on dev vdc2, logical block 12091392, lost sync 
page write
[ 60.821466] JBD2: Error -5 detected when updating journal superblock for 
vdc2-8.
[ 60.822214] blk_update_request: I/O error, dev vdc, sector 16384
[ 60.822216] blk_update_request: I/O error, dev vdc, sector 16384
[ 60.822218] Buffer I/O error on dev vdc2, logical block 0, lost sync page write
[ 60.822227] EXT4-fs error (device vdc2): ext4_journal_check_start:56: Detected 
aborted journal
[ 60.822228] EXT4-fs (vdc2): Remounting filesystem read-only
[ 60.822229] EXT4-fs (vdc2): previous I/O error to superblock detected
[ 60.823201] blk_update_request: I/O error, dev vdc, sector 16384
[ 60.823203] blk_update_request: I/O error, dev vdc, sector 16384
[ 60.823204] Buffer I/O error on dev vdc2, logical block 0, lost sync page write
[ 96.736959] nfsd4: failed to purge old clients from recovery directory 
v4recovery
root@microg5:~#
@haochanh
haochanh commented 18 days ago

Moving the guest back to original host successfully but we still see the I/O 
error
root@tiny:~# virsh migrate --live --domain microg5 
qemu+ssh://10.33.9.187/system --verbose --undefinesource --persistent --timeout 
60
Migration: [100 %]
root@tiny:~# virsh list --all
Id Name State

2 tinyg1 running
3 tinyg2 running
5 tinyg4 running
6 tinyg5 running
7 tinyg6 running
9 tinyg3 running

On the orginal host:
root@micro:~# virsh list --all
Id Name State

2 microg1 running
3 microg2 running
4 microg3 running
5 microg4 running
9 microg6 running
16 microg5 running

Here is our config: both HOST (micro & tiny) are sharing the same NFS mount 
/kvm_nfs/
Micro KVM:
root@micro:~# ls -l /kvm_nfs/microg6.raw.img
-rw-r--r-- 1 nobody 4294967294 107374182400 Aug 3 18:23 /kvm_nfs/microg6.raw.img

Tiny KVM:
root@tiny:~# ls -l /kvm_nfs/microg6.raw.img
-rw-r--r-- 1 nobody 4294967294 107374182400 Aug 3 2016 /kvm_nfs/microg6.raw.img

We try to do the migration the guest microg6 from "micro" to "tiny"
root@micro:~# virsh domblklist microg6
Target Source

vda /kvm_nfs/microg6.raw.img

root@micro:~# virsh migrate --live --domain microg6 
qemu+ssh://10.33.10.115/system --verbose --undefinesource --persistent 
--timeout 60
Migration: [100 %] <<<< it successfully goes to tiny KVM.

We can see guest "microg6" on tiny KVM now.
root@tiny:~# virsh domblklist microg6
Target Source

vda /kvm_nfs/microg6.raw.img

Checking on the guest "microg6", we see these error.....
root@microg6:~# dmesg |tail
[24371.936814] blk_update_request: I/O error, dev vda, sector 16384
[24371.936900] Buffer I/O error on dev vda2, logical block 0, lost sync page 
write
[24373.661328] blk_update_request: I/O error, dev vda, sector 16416
[24373.661552] Buffer I/O error on dev vda2, logical block 4, lost async page 
write
[24373.661778] Buffer I/O error on dev vda2, logical block 6, lost async page 
write
[24373.662023] Buffer I/O error on dev vda2, logical block 13107201, lost async 
page write
[24373.662253] Buffer I/O error on dev vda2, logical block 21004406, lost async 
page write
[24373.662427] Buffer I/O error on dev vda2, logical block 21010963, lost async 
page write
[24373.662713] Buffer I/O error on dev vda2, logical block 21495820, lost async 
page write
[24373.662957] Buffer I/O error on dev vda2, logical block 16777222, lost async 
page write

Both sharing the same NFS mount
root@tiny:~# df -T |grep "kvm_nfs"
tmp-lte:/kvm_lpm nfs4 1238417408 786754560 388756480 67% /kvm_nfs
root@tiny:~# rsh micro "df -T |grep kvm_nfs"
tmp-lte:/kvm_lpm nfs4 1238417408 786754560 388756480 67% /kvm_nfs

root@micro:~# df -T |grep "kvm_nfs"
tmp-lte:/kvm_lpm nfs4 1238417408 786754560 388756480 67% /kvm_nfs

The problem is due to a configuration problem in the NFSv4 server and/or 
clients.
Probably related to the NFSv4 ID <-> Name Mapping (idmap).

Thus, this is not an I/O-related problem.

The solution/requirement is to make sure that the libvirt-qemu user has the 
same UID/GID on all systems that it might be migrated too.
(workaround is to use NFSv3, for example, which has no ID mapping).

        "NFSv4 mount incorrectly shows all files with ownership as 
nobody:nobody"
        https://access.redhat.com/solutions/33455

         - Ensure the client and server have matching UID's and GID's. 
           It is a common misconception that the UID's and GID's can differ 
when using NFSv4. 

           The sole purpose of id mapping is to map an id to a name and 
vice-versa. 
           ID mapping is not intended as some sort of replacement for managing 
id's.

         - On a non-Ubuntu linux, if the above settings have been applied and 
UID/GID's are matched on server and client 
           and users are still being mapped to nobody:nobody than a clearing of 
the idmapd cache may be required:

           # nfsidmap -c

There are some articles on google on 'linux how to change uid'.


Details:
---

The user 'developer' on NFS server has UID 529.

        [developer@tmp-lte ~]$ grep developer /etc/passwd
        developer:x:529:531:developer login id:/home/developer:/bin/bash

Create an user 'developer' on tiny w/ UID non-529.
The change owner operation to the 'developer' user does not work (user remains 
'nobody').

        root@tiny:~# useradd --uid 1234 developer

        root@tiny:~# chown developer /kvm_nfs/test.mauricfo

        root@tiny:~# ls -l /kvm_nfs/test.mauricfo 
        -rw-r--r-- 1 nobody root 18 Oct 11 17:17 /kvm_nfs/test.mauricfo

Remove the user 'developer' for the next test.

        root@tiny:~# userdel developer

Create an user 'developer' on tiny w/ UID 529 (same as in NFS server).
The change owner operation to the 'developer' user DOES work (user is no longer 
'nobody').


        root@tiny:~# useradd --uid=529 developer

        root@tiny:~# chown developer /kvm_nfs/test.mauricfo

        root@tiny:~# ls -l /kvm_nfs/test.mauricfo 
        -rw-r--r-- 1 developer root 18 Oct 11 17:17 /kvm_nfs/test.mauricfo


And that works on tiny, but does NOT work on micro UNTIL you clear the nfsidmap 
cache.

        root@micro:~# ls -l /kvm_nfs/test.mauricfo 
        -rw-r--r-- 1 nobody root 18 Oct 11 17:17 /kvm_nfs/test.mauricfo

        root@micro:~# useradd --uid 529 developer

        root@micro:~# chown developer /kvm_nfs/test.mauricfo

        root@micro:~# ls -l /kvm_nfs/test.mauricfo 
        -rw-r--r-- 1 nobody root 18 Oct 11 17:17 /kvm_nfs/test.mauricfo

        root@micro:~# nfsidmap -c
        nfsidmap: clearing '2a251810 I------     1 perm 1f030000     0     0 
keyring   .id_resolver: 8'

        root@micro:~# ls -l /kvm_nfs/test.mauricfo 
        -rw-r--r-- 1 developer root 18 Oct 11 17:17 /kvm_nfs/test.mauricfo


So, this is clearly NFSv4-specific.


Details 2:
---

The idmapd uses the domain from 'hostname --domain', which is the same
across all hosts.

        [developer@tmp-lte ~]$ head -n5 /etc/idmapd.conf 
        [General]
        #Verbosity = 0
        # The following should be set to the local NFSv4 domain name
        # The default is the host's DNS domain name.
        #Domain = local.domain.edu

        [developer@tmp-lte ~]$ hostname --domain
        isst.aus.stglabs.ibm.com

        root@tiny:~# hostname --domain
        isst.aus.stglabs.ibm.com

        root@micro:~# hostname --domain
        isst.aus.stglabs.ibm.com

Started the nfs-idmapd.service in tiny and micro (apt-get install nfs-server; 
systemctl start nfs-idmapd.service).
Increased verbosity in /etc/idmapd.conf to 3.
Verified that the domain values used by nfs-idmapd service are the same on all 
hosts (journalctl -u nfs-idmapd)


References:
---

These links discuss a bit about this problem (user/group
nobody/nogroup); in case they're useful for the next person.


https://access.redhat.com/solutions/33455

 - Ensure the client and server have matching UID's and GID's. 
   It is a common misconception that the UID's and GID's can differ when using 
NFSv4. 
   The sole purpose of id mapping is to map an id to a name and vice-versa. 
   ID mapping is not intended as some sort of replacement for managing id's.

 - On non-Ubuntu linux, if the above settings have been applied and UID/GID's 
are matched on server and client 
   and users are still being mapped to nobody:nobody than a clearing of the 
idmapd cache may be required:

   # nfsidmap -c

https://help.ubuntu.com/community/NFSv4Howto

        If all directory listings show just "nobody" and "nogroup" instead of 
real user and group names, 
        then you might want to check the Domain parameter set in 
/etc/idmapd.conf. 

        NFSv4 client and server should be in the same domain. 
        Other operating systems might derive the NFSv4 domain name from the 
domain name mentioned in /etc/resolv.conf (e.g. Solaris 10).

http://www.enterprisenetworkingplanet.com/netos/article.php/3644471/Implement-NFSv4-Domains-and-Authentication.htm
http://unix.stackexchange.com/questions/138479/mounting-nfs-owners-are-nobodynogroup
https://www.novell.com/support/kb/doc.php?id=7005060
https://community.netapp.com/t5/Network-Storage-Protocols-Discussions/NFSv4-Linux-client-Netapp-Server-gt-Problem-with-id-mapping/td-p/17895

The reservation of an UID/GID in Debian/Ubuntu follows an allocation process 
governed by Debian.
I have submitted an allocation request, and will prepare the patches for 
libvirt-qemu in Debian and Ubuntu.

Hi Chanh,

Can you please test the libvirt-bin & libvirt0 packages?
    http://ausgsa.ibm.com/~mauricfo/public/bugs/bz145069/v2/

Please confirm if they resolve the issue.

Thanks!


Details
---

The test packages assume that the UID & GID 64055 will be allocated by
Debian, and user this value for libvirt-qemu user/group.

You can remove any trace of the current libvirt-bin package (which creates 
user/group) and user/group with:
# apt-get purge libvirt-bin
# userdel libvirt-qemu
# groupdel libvirt-qemu

Existing files assigned to libvirt-qemu user/group will not have its
permissions changed, so not sure that's a clean transition from an
existing system/files, but the permissions should be correct for all the
installed and new files created from there on.

Hi Mauricio,
Yes, after applied this patch, the migration is working fine without any IO 
error.  

I am able to migrate between 2 Host (Tiny & Micro) using the NFS mount method 
without any IO issue. 
root@micro:~# id libvirt-qemu
uid=64055(libvirt-qemu) gid=117(kvm) groups=117(kvm),64055(libvirt-qemu)

root@tiny:~# id libvirt-qemu
uid=64055(libvirt-qemu) gid=116(kvm) groups=116(kvm),64055(libvirt-qemu)

Hi Canonical,
@taco-screen-team

The attached patches are for Zesty, Xenial, and Debian sid (which I plan
to submit if the UID/GID allocation request is granted).

afaik @cjwatson is the maintainer of base-passwd on Debian, and could 
review/grant/deny the allocation request. 
Per Ubuntu Policy, we'd need it ack'ed on Debian first.

Then the libvirt patches.. all packages were verified.
For Z, X, and sid, the test-case follows this pattern, and has correct 
behavior/results.

Thanks!


Test-case
===

Package is not installed -- no libvirt-qemu user/group:
---

# getent passwd libvirt-qemu
# 

# getent group libvirt-qemu
# 


Package is installed -- libvirt-qemu user/group is created w/ allocated uid/gid:
---

# dpkg -i libvirt{-bin,0}_1.3.1-1ubuntu10.5uidgid1_*.deb

# getent passwd libvirt-qemu
libvirt-qemu:x:64055:117:Libvirt Qemu,,,:/var/lib/libvirt:/bin/false

# getent group libvirt-qemu
libvirt-qemu:x:64055:libvirt-qemu


Package is uninstalled -- libvirt-qemu user/group is removed:
---

# apt-get --yes purge libvirt-bin

# getent passwd libvirt-qemu
# 

# getent group libvirt-qemu
# 


Package is installed with uid/gid taken -- libvirt-qemu user/group is created 
with other uid/gid:
---

# useradd --uid 64055 testuser
# groupadd --gid 64055 testgroup

# dpkg -i libvirt{-bin,0}_1.3.1-1ubuntu10.5uidgid1_*.deb
# echo $?
0

# getent passwd libvirt-qemu
libvirt-qemu:x:113:117:Libvirt Qemu,,,:/var/lib/libvirt:/bin/false

# getent group libvirt-qemu
libvirt-qemu:x:118:libvirt-qemu

** Affects: libvirt (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-145069 severity-high 
targetmilestone-inin16041

** Tags added: architecture-ppc64le bugnameltc-145069 severity-high
targetmilestone-inin16041

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1637601

Title:
  UbuntuKVM: migration using NFS mount fails #190

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1637601/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1637601] [NEW] UbuntuKVM: migration using NFS mount fails #190

Reply via email to