Re: [ovirt-users] xfs fragmentation problem caused data domain to hang

Jason Keltz Mon, 02 Oct 2017 08:06:34 -0700

On 10/02/2017 11:00 AM, Yaniv Kaul wrote:

On Mon, Oct 2, 2017 at 5:57 PM, Jason Keltz <j...@cse.yorku.ca<mailto:j...@cse.yorku.ca>> wrote:



    On 10/02/2017 10:51 AM, Yaniv Kaul wrote:



    On Mon, Oct 2, 2017 at 5:14 PM, Jason Keltz <j...@cse.yorku.ca
    <mailto:j...@cse.yorku.ca>> wrote:


        On 10/02/2017 01:22 AM, Yaniv Kaul wrote:



        On Mon, Oct 2, 2017 at 5:11 AM, Jason Keltz
        <j...@cse.yorku.ca <mailto:j...@cse.yorku.ca>> wrote:

            Hi.

            For my data domain, I have one NFS server with a large
            RAID filesystem (9 TB).
            I'm only using 2 TB of that at the moment. Today, my NFS
            server  hung with
            the following error:

                xfs: possible memory allocation deadlock in kmem_alloc


        Can you share more of the log so we'll see what happened
        before and after?
        Y.


            Here is engine-log from yesterday.. the problem started
            around 14:29 PM.
            http://www.eecs.yorku.ca/~jas/ovirt-debug/10012017/engine-log.txt
            
<http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/10012017/engine-log.txt>

            Here is the vdsm log on one of the virtualization hosts,
            virt01:
            http://www.eecs.yorku.ca/~jas/ovirt-debug/10012017/vdsm.log.2
            <http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/10012017/vdsm.log.2>

            Doing further investigation, I found that the XFS error
            messages didn't start yesterday.  You'll see they
            started at the very end of the day on September 23. See:

            http://www.eecs.yorku.ca/~jas/ovirt-debug/messages-20170924
            <http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/messages-20170924>



        Our storage guys do NOT think it's an XFS fragmentation
        issue, but we'll be looking at it.

        Hmmm... almost sorry to hear that because that would be easy
        to "fix"...


            They continued on the 24th, then on the 26th... I think
            there were a few "hangs" on those times that people were
            complaining about, but we didn't catch the problem.
            However, the errors hit big time yesterday at 14:27
            PM... see here:

            http://www.eecs.yorku.ca/~jas/ovirt-debug/messages-20171001
            <http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/messages-20171001>

            If you want any other logs, I'm happy to provide them. I
            just don't know exactly what to provide.

            Do you know if I can run the XFS defrag command live?
            Rather than on a disk by disk, I'd rather just do it on
            the whole filesystem. There really aren't that many
            files since it's just ovirt disk images.  However, I
            don't understand the implications to running VMs.  I
            wouldn't want to do anything to create more downtime.


        Should be enough to copy the disks to make them less fragmented.

        Yes, but this requires downtime.. but there's plenty of
        additional storage, so this would fix things well.


Live storage migration could be used.
Y.


        I had upgraded the engine server + 4 virtualization hosts
        from 4.1.1 to current on September 20 along with upgrading
        them from CentOS 7.3 to CentOS 7.4.  virtfs, the NFS file
        server, was running CentOS 7.3 and kernel
        vmlinuz-3.10.0-514.16.1.el7.x86_64. Only yesterday, did I
        upgrade it to CentOS 7.4 and hence kernel
        vmlinuz-3.10.0-693.2.2.el7.x86_64.

        I believe the problem is fully XFS related, and not ovirt at
        all.   Although, I must admit, ovirt didn't help either. When
        I rebooted the file server, the iso and export domains were
        immediately active, but the data domain took quite a long

time. I kept trying to activate it, and it couldn't do it.I couldn't make a host an SPM. I found that the data domain

        directory on the virtualization host was a "stale NFS file
        handle".  I rebooted one of the virtualization hosts (virt1),

and tried to make it the SPM. Again, it wouldn't work.Finally, I ended up turning everything into maintenance mode,then activating just it, and I was able to make it the SPM.I was then able to bring everything up. I would have

        expected ovirt to handle the problem a little more
        gracefully, and give me more information because I was
        sweating thinking I had to restore all the VMs!


    Stale NFS is on our todo list to handle. Quite challenging.

    Thanks..


        I didn't think when I chose XFS as the filesystem for my
        virtualization NFS server that I would have to defragment the
        filesystem manually.  This is like the old days of running
        Norton SpeedDisk to defrag my 386...


    We are still not convinced it's an issue - but we'll look into it
    (and perhaps ask for more stats and data).

    Thanks!

    Y.


        Thanks for any help you can provide...

        Jason.

All 4 virtualization hosts of course had problems since
there was no
longer any storage.

In the end, it seems like the problem is related to XFS
fragmentation...

I read this great blog here:

https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-deadlock-kmem_alloc/

<https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-deadlock-kmem_alloc/>

In short, I tried this:

# xfs_db -r -c "frag -f" /dev/sdb1
actual 4314253, ideal 43107, fragmentation factor 99.00%

Apparently the fragmentation factor doesn't mean much,
but the fact that
"actual" number of extents is considerably higher than
"ideal" extents seems that it
may be the problem.

I saw that many of my virtual disks that are written to
a lot have, of course,
a lot of extents...

For example, on our main web server disk image, there
were 247,597
extents alone! I took the web server down, and ran the
XFS defrag
command on the disk...

# xfs_fsr -v 9a634692-1302-471f-a92e-c978b2b67fd0
9a634692-1302-471f-a92e-c978b2b67fd0
extents before:247597 after:429 DONE
9a634692-1302-471f-a92e-c978b2b67fd0

247,597 before and 429 after! WOW!

Are virtual disks a problem with XFS? Why isn't this
memory allocation
deadlock issue more prevalent. I do see this article
mentioned on many
web posts. I don't specifically see any recommendation
to *not* use
XFS for the data domain though.

I was running CentOS 7.3 on the file server, but before
rebooting the server,
I upgraded to the latest kernel and CentOS 7.4 in the
hopes that if there
was a kernel issue, that this would solve it.

I took a few virtual systems down, and ran the defrag on
the disks. However,
with over 30 virtual systems, I don't really want to do
this individually.
I was wondering if I could run xfs_fsr on all the disks
LIVE? It says in the
manual that you can run it live, but I can't see how
this would be good when
a system is using that disk, and I don't want to deal
with major
corruption across the board. Any thoughts?

Thanks,

Jason.

_______________________________________________
Users mailing list
Users@ovirt.org <mailto:Users@ovirt.org>
http://lists.ovirt.org/mailman/listinfo/users
<http://lists.ovirt.org/mailman/listinfo/users>

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] xfs fragmentation problem caused data domain to hang

Reply via email to