Re: some [big] changes to ZPL (ZFS<->VFS )

2016-08-08 Thread Alan Somers
On r303834 I can no longer reproduce the problem.
-Alan

On Sat, Aug 6, 2016 at 5:05 AM, Andriy Gapon  wrote:
> On 05/08/2016 23:31, Alan Somers wrote:
>> I'm not certain it's related, but on a head build at r303767 I see a
>> LOR and a reproducible panic that involve the snapdir code.
>
> Alan,
>
> thank you very much for the clear report and for the very easy
> reproduction scenario.  I am not sure how I missed this simple and
> severe bug.
> Please try r303791, it should fix the problem.
>
> I believe that the LOR is not new and been there since we started using
> distinct tags for .zfs special vnodes.
>
>> First, the LOR:
>> $ zpool destroy foo
>>
>> lock order reversal:
>>  1st 0xf800404c8b78 zfs (zfs) @
>> /usr/home/alans/freebsd/head/sys/kern/vfs_mount.c:1244
>>  2nd 0xf800404c85f0 zfs_gfs (zfs_gfs) @
>> /usr/home/alans/freebsd/head/sys/cddl/contrib/opensolaris/uts/common/fs/gfs.c:484
>> stack backtrace:
>> #0 0x80aa90b0 at witness_debugger+0x70
>> #1 0x80aa8fa4 at witness_checkorder+0xe54
>> #2 0x80a22072 at __lockmgr_args+0x4c2
>> #3 0x80af8e7c at vop_stdlock+0x3c
>> #4 0x81018880 at VOP_LOCK1_APV+0xe0
>> #5 0x80b19f2a at _vn_lock+0x9a
>> #6 0x821b9c53 at gfs_file_create+0x73
>> #7 0x821b9cfd at gfs_dir_create+0x1d
>> #8 0x8228aa07 at zfsctl_mknode_snapdir+0x47
>> #9 0x821ba1a5 at gfs_dir_lookup+0x185
>> #10 0x821ba68d at gfs_vop_lookup+0x1d
>> #11 0x82289a42 at zfsctl_root_lookup+0xf2
>> #12 0x8228a8c3 at zfsctl_umount_snapshots+0x83
>> #13 0x822a1d2b at zfs_umount+0x7b
>> #14 0x80b02a14 at dounmount+0x6f4
>> #15 0x80b0228d at sys_unmount+0x35d
>> #16 0x80ebbb7b at amd64_syscall+0x2db
>> #17 0x80e9b72b at Xfast_syscall+0xfb
>>
>>
>> Here's the panic:
>> $ zpool create testpool da0
>> $ touch /testpool/testfile
>> $ zfs snapshot testpool@testsnap
>> $ cd /testpool/.zfs/snapshots
>>
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 2; apic id = 04
>> fault virtual address   = 0x8
>> fault code  = supervisor read data, page not present
>> instruction pointer = 0x20:0x80b19f1c
>> stack pointer   = 0x28:0xfe0b54bf7430
>> frame pointer   = 0x28:0xfe0b54bf74a0
>> code segment= base 0x0, limit 0xf, type 0x1b
>> = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags= interrupt enabled, resume, IOPL = 0
>> current process = 966 (bash)
>> trap number = 12
>> panic: page fault
>> cpuid = 2
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
>> 0xfe0b54bf6fc0
>> vpanic() at vpanic+0x182/frame 0xfe0b54bf7040
>> panic() at panic+0x43/frame 0xfe0b54bf70a0
>> trap_fatal() at trap_fatal+0x351/frame 0xfe0b54bf7100
>> trap_pfault() at trap_pfault+0x1fd/frame 0xfe0b54bf7160
>> trap() at trap+0x284/frame 0xfe0b54bf7370
>> calltrap() at calltrap+0x8/frame 0xfe0b54bf7370
>> --- trap 0xc, rip = 0x80b19f1c, rsp = 0xfe0b54bf7440, rbp
>> = 0xfe0b54bf74a0 ---
>> _vn_lock() at _vn_lock+0x8c/frame 0xfe0b54bf74a0
>> zfs_lookup() at zfs_lookup+0x50d/frame 0xfe0b54bf7540
>> zfs_freebsd_lookup() at zfs_freebsd_lookup+0x91/frame 0xfe0b54bf7680
>> VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0xda/frame 0xfe0b54bf76b0
>> vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfe0b54bf7710
>> VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0xda/frame 0xfe0b54bf7740
>> lookup() at lookup+0x5a2/frame 0xfe0b54bf77d0
>> namei() at namei+0x5b2/frame 0xfe0b54bf7890
>> kern_statat() at kern_statat+0xa8/frame 0xfe0b54bf7a40
>> sys_stat() at sys_stat+0x2d/frame 0xfe0b54bf7ae0
>> amd64_syscall() at amd64_syscall+0x2db/frame 0xfe0b54bf7bf0
>> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfe0b54bf7bf0
>>
>>
>> I can provide core files, test scripts, whatever you need.  Thanks for
>> tackling this difficult problem.
>>
>> -Alan
>>
>> On Fri, Aug 5, 2016 at 12:36 AM, Andriy Gapon  wrote:
>>> On 03/08/2016 17:25, Andriy Gapon wrote:
 Another change that was not strictly required and which is probably too
 intrusive is killing the support for case insensitive operations.   My
 thinking was that FreeBSD VFS does not provide support for those anyway.
  But I'll probably restore the code, at least in the bottom half of the
 ZPL, before committing the change.
>>>
>>> It turned out that most of the removed code was dead anyway and it took
>>> just a few lines of code to restore support for case-insensitive
>>> filesystems.  Filesystems with mixed case sensitivity behave exactly the
>>> same as case-sensitive filesystem as it has always been the case on FreeBSD.
>>>
>>> Anyway the big change has just been committed:
>>> https://svnweb.freebsd.org/changeset/base/303763
>>> Please test away.
>>>
>>> Another 

Re: some [big] changes to ZPL (ZFS<->VFS )

2016-08-06 Thread Andriy Gapon
On 05/08/2016 23:31, Alan Somers wrote:
> I'm not certain it's related, but on a head build at r303767 I see a
> LOR and a reproducible panic that involve the snapdir code.

Alan,

thank you very much for the clear report and for the very easy
reproduction scenario.  I am not sure how I missed this simple and
severe bug.
Please try r303791, it should fix the problem.

I believe that the LOR is not new and been there since we started using
distinct tags for .zfs special vnodes.

> First, the LOR:
> $ zpool destroy foo
> 
> lock order reversal:
>  1st 0xf800404c8b78 zfs (zfs) @
> /usr/home/alans/freebsd/head/sys/kern/vfs_mount.c:1244
>  2nd 0xf800404c85f0 zfs_gfs (zfs_gfs) @
> /usr/home/alans/freebsd/head/sys/cddl/contrib/opensolaris/uts/common/fs/gfs.c:484
> stack backtrace:
> #0 0x80aa90b0 at witness_debugger+0x70
> #1 0x80aa8fa4 at witness_checkorder+0xe54
> #2 0x80a22072 at __lockmgr_args+0x4c2
> #3 0x80af8e7c at vop_stdlock+0x3c
> #4 0x81018880 at VOP_LOCK1_APV+0xe0
> #5 0x80b19f2a at _vn_lock+0x9a
> #6 0x821b9c53 at gfs_file_create+0x73
> #7 0x821b9cfd at gfs_dir_create+0x1d
> #8 0x8228aa07 at zfsctl_mknode_snapdir+0x47
> #9 0x821ba1a5 at gfs_dir_lookup+0x185
> #10 0x821ba68d at gfs_vop_lookup+0x1d
> #11 0x82289a42 at zfsctl_root_lookup+0xf2
> #12 0x8228a8c3 at zfsctl_umount_snapshots+0x83
> #13 0x822a1d2b at zfs_umount+0x7b
> #14 0x80b02a14 at dounmount+0x6f4
> #15 0x80b0228d at sys_unmount+0x35d
> #16 0x80ebbb7b at amd64_syscall+0x2db
> #17 0x80e9b72b at Xfast_syscall+0xfb
> 
> 
> Here's the panic:
> $ zpool create testpool da0
> $ touch /testpool/testfile
> $ zfs snapshot testpool@testsnap
> $ cd /testpool/.zfs/snapshots
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 2; apic id = 04
> fault virtual address   = 0x8
> fault code  = supervisor read data, page not present
> instruction pointer = 0x20:0x80b19f1c
> stack pointer   = 0x28:0xfe0b54bf7430
> frame pointer   = 0x28:0xfe0b54bf74a0
> code segment= base 0x0, limit 0xf, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags= interrupt enabled, resume, IOPL = 0
> current process = 966 (bash)
> trap number = 12
> panic: page fault
> cpuid = 2
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe0b54bf6fc0
> vpanic() at vpanic+0x182/frame 0xfe0b54bf7040
> panic() at panic+0x43/frame 0xfe0b54bf70a0
> trap_fatal() at trap_fatal+0x351/frame 0xfe0b54bf7100
> trap_pfault() at trap_pfault+0x1fd/frame 0xfe0b54bf7160
> trap() at trap+0x284/frame 0xfe0b54bf7370
> calltrap() at calltrap+0x8/frame 0xfe0b54bf7370
> --- trap 0xc, rip = 0x80b19f1c, rsp = 0xfe0b54bf7440, rbp
> = 0xfe0b54bf74a0 ---
> _vn_lock() at _vn_lock+0x8c/frame 0xfe0b54bf74a0
> zfs_lookup() at zfs_lookup+0x50d/frame 0xfe0b54bf7540
> zfs_freebsd_lookup() at zfs_freebsd_lookup+0x91/frame 0xfe0b54bf7680
> VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0xda/frame 0xfe0b54bf76b0
> vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfe0b54bf7710
> VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0xda/frame 0xfe0b54bf7740
> lookup() at lookup+0x5a2/frame 0xfe0b54bf77d0
> namei() at namei+0x5b2/frame 0xfe0b54bf7890
> kern_statat() at kern_statat+0xa8/frame 0xfe0b54bf7a40
> sys_stat() at sys_stat+0x2d/frame 0xfe0b54bf7ae0
> amd64_syscall() at amd64_syscall+0x2db/frame 0xfe0b54bf7bf0
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfe0b54bf7bf0
> 
> 
> I can provide core files, test scripts, whatever you need.  Thanks for
> tackling this difficult problem.
> 
> -Alan
> 
> On Fri, Aug 5, 2016 at 12:36 AM, Andriy Gapon  wrote:
>> On 03/08/2016 17:25, Andriy Gapon wrote:
>>> Another change that was not strictly required and which is probably too
>>> intrusive is killing the support for case insensitive operations.   My
>>> thinking was that FreeBSD VFS does not provide support for those anyway.
>>>  But I'll probably restore the code, at least in the bottom half of the
>>> ZPL, before committing the change.
>>
>> It turned out that most of the removed code was dead anyway and it took
>> just a few lines of code to restore support for case-insensitive
>> filesystems.  Filesystems with mixed case sensitivity behave exactly the
>> same as case-sensitive filesystem as it has always been the case on FreeBSD.
>>
>> Anyway the big change has just been committed:
>> https://svnweb.freebsd.org/changeset/base/303763
>> Please test away.
>>
>> Another note is that the filesystem name cache is now disabled for case
>> insensitive filesystems and filesystems with normalization other than
>> none.  That may hurt the lookup performance, but should ensure
>> correctness of operations.
>>
>> --
>> 

Re: some [big] changes to ZPL (ZFS<->VFS )

2016-08-05 Thread Alan Somers
I'm not certain it's related, but on a head build at r303767 I see a
LOR and a reproducible panic that involve the snapdir code.

First, the LOR:
$ zpool destroy foo

lock order reversal:
 1st 0xf800404c8b78 zfs (zfs) @
/usr/home/alans/freebsd/head/sys/kern/vfs_mount.c:1244
 2nd 0xf800404c85f0 zfs_gfs (zfs_gfs) @
/usr/home/alans/freebsd/head/sys/cddl/contrib/opensolaris/uts/common/fs/gfs.c:484
stack backtrace:
#0 0x80aa90b0 at witness_debugger+0x70
#1 0x80aa8fa4 at witness_checkorder+0xe54
#2 0x80a22072 at __lockmgr_args+0x4c2
#3 0x80af8e7c at vop_stdlock+0x3c
#4 0x81018880 at VOP_LOCK1_APV+0xe0
#5 0x80b19f2a at _vn_lock+0x9a
#6 0x821b9c53 at gfs_file_create+0x73
#7 0x821b9cfd at gfs_dir_create+0x1d
#8 0x8228aa07 at zfsctl_mknode_snapdir+0x47
#9 0x821ba1a5 at gfs_dir_lookup+0x185
#10 0x821ba68d at gfs_vop_lookup+0x1d
#11 0x82289a42 at zfsctl_root_lookup+0xf2
#12 0x8228a8c3 at zfsctl_umount_snapshots+0x83
#13 0x822a1d2b at zfs_umount+0x7b
#14 0x80b02a14 at dounmount+0x6f4
#15 0x80b0228d at sys_unmount+0x35d
#16 0x80ebbb7b at amd64_syscall+0x2db
#17 0x80e9b72b at Xfast_syscall+0xfb


Here's the panic:
$ zpool create testpool da0
$ touch /testpool/testfile
$ zfs snapshot testpool@testsnap
$ cd /testpool/.zfs/snapshots

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address   = 0x8
fault code  = supervisor read data, page not present
instruction pointer = 0x20:0x80b19f1c
stack pointer   = 0x28:0xfe0b54bf7430
frame pointer   = 0x28:0xfe0b54bf74a0
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 966 (bash)
trap number = 12
panic: page fault
cpuid = 2
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe0b54bf6fc0
vpanic() at vpanic+0x182/frame 0xfe0b54bf7040
panic() at panic+0x43/frame 0xfe0b54bf70a0
trap_fatal() at trap_fatal+0x351/frame 0xfe0b54bf7100
trap_pfault() at trap_pfault+0x1fd/frame 0xfe0b54bf7160
trap() at trap+0x284/frame 0xfe0b54bf7370
calltrap() at calltrap+0x8/frame 0xfe0b54bf7370
--- trap 0xc, rip = 0x80b19f1c, rsp = 0xfe0b54bf7440, rbp
= 0xfe0b54bf74a0 ---
_vn_lock() at _vn_lock+0x8c/frame 0xfe0b54bf74a0
zfs_lookup() at zfs_lookup+0x50d/frame 0xfe0b54bf7540
zfs_freebsd_lookup() at zfs_freebsd_lookup+0x91/frame 0xfe0b54bf7680
VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0xda/frame 0xfe0b54bf76b0
vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfe0b54bf7710
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0xda/frame 0xfe0b54bf7740
lookup() at lookup+0x5a2/frame 0xfe0b54bf77d0
namei() at namei+0x5b2/frame 0xfe0b54bf7890
kern_statat() at kern_statat+0xa8/frame 0xfe0b54bf7a40
sys_stat() at sys_stat+0x2d/frame 0xfe0b54bf7ae0
amd64_syscall() at amd64_syscall+0x2db/frame 0xfe0b54bf7bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfe0b54bf7bf0


I can provide core files, test scripts, whatever you need.  Thanks for
tackling this difficult problem.

-Alan

On Fri, Aug 5, 2016 at 12:36 AM, Andriy Gapon  wrote:
> On 03/08/2016 17:25, Andriy Gapon wrote:
>> Another change that was not strictly required and which is probably too
>> intrusive is killing the support for case insensitive operations.   My
>> thinking was that FreeBSD VFS does not provide support for those anyway.
>>  But I'll probably restore the code, at least in the bottom half of the
>> ZPL, before committing the change.
>
> It turned out that most of the removed code was dead anyway and it took
> just a few lines of code to restore support for case-insensitive
> filesystems.  Filesystems with mixed case sensitivity behave exactly the
> same as case-sensitive filesystem as it has always been the case on FreeBSD.
>
> Anyway the big change has just been committed:
> https://svnweb.freebsd.org/changeset/base/303763
> Please test away.
>
> Another note is that the filesystem name cache is now disabled for case
> insensitive filesystems and filesystems with normalization other than
> none.  That may hurt the lookup performance, but should ensure
> correctness of operations.
>
> --
> Andriy Gapon
> ___
> freebsd-current@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: some [big] changes to ZPL (ZFS<->VFS )

2016-08-05 Thread Andriy Gapon
On 03/08/2016 17:25, Andriy Gapon wrote:
> Another change that was not strictly required and which is probably too
> intrusive is killing the support for case insensitive operations.   My
> thinking was that FreeBSD VFS does not provide support for those anyway.
>  But I'll probably restore the code, at least in the bottom half of the
> ZPL, before committing the change.

It turned out that most of the removed code was dead anyway and it took
just a few lines of code to restore support for case-insensitive
filesystems.  Filesystems with mixed case sensitivity behave exactly the
same as case-sensitive filesystem as it has always been the case on FreeBSD.

Anyway the big change has just been committed:
https://svnweb.freebsd.org/changeset/base/303763
Please test away.

Another note is that the filesystem name cache is now disabled for case
insensitive filesystems and filesystems with normalization other than
none.  That may hurt the lookup performance, but should ensure
correctness of operations.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


some [big] changes to ZPL (ZFS<->VFS )

2016-08-03 Thread Andriy Gapon

I would like to get more comments about the following review request:
https://reviews.freebsd.org/D6533
Both about the code changes and about the general direction of the changes.

I've been asked to try to get this change into 11.0 because of
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209158
So, there is some urgency to my request.
To be honest I've never wanted to rush in this change, but I probably
have been sitting on it for too long.

ZFS POSIX Layer is originally written for Solaris VFS which is very
different from FreeBSD VFS.  Many things that FreeBSD VFS manages on
behalf of all filesystems are also (re-)implemented in ZPL in a
different way.  As a result, ZPL contains code that is redundant on
FreeBSD or duplicates VFS functionality or, in the worst cases, badly
interacts / interferes with VFS.

I am talking mostly about the operations that work on a filesystem
namespace rather than on contents and properties of files.
The former operations typically work on 2 or more vnodes.  The vnodes
are usually locked in a particular way or are expected to be locked in a
particular way by the filesystem.

ZPL on the other hand has quite an elaborate locking system of its own.
It includes a lock that protects a znode, a lock that protects the
znode's notion of its parent, locks that protect a name to znode
relationship for the znode's children.

So, we have two competing locking systems for the filesystem nodes and
they are not aware of each other.

The most prominent problem is a deadlock caused by the lock order
reversal of vnode locks that may happen with concurrent zfs_rename() and
lookup().  The deadlock is a result of zfs_rename() not observing the
vnode locking contract expected by the VFS.  I should mention here that
VOP_RENAME is the weirdest method with respect to locking requirements.
Every filesystem implementation is expected to perform quite an
elaborate locking "dance" in VOP_RENAME.  This is in contrast to all
other operations where all necessary locking is done by the VFS before
calling into the filesystem code.

So, I started by fixing zfs_rename() but that required to make some
changes to the "bottom half" of the ZPL (the "upper half" being the code
that implements the vnode and VFS operations).  And that required making
changes in other VOPs anyway.  And so the next step was to remove
redundant re-lookup of child nodes in methods like zfs_rmdir and
zfs_remove.  Before I could stop I removed all ZPL internal locking that
was supposed to protect parent-child relationships of filesystem nodes.
Those relationships are now protected exclusively by the vnode locks and
the code is changed to take advantage of that fact and to properly
interact with VFS.

As a minor detail, the removal of the internal locking allowed all ZPL
dmu_tx_assign calls to use TXG_WAIT mode.

Another change that was not strictly required and which is probably too
intrusive is killing the support for case insensitive operations.   My
thinking was that FreeBSD VFS does not provide support for those anyway.
 But I'll probably restore the code, at least in the bottom half of the
ZPL, before committing the change.

On a more general level, I decided that the upper half of ZPL would
necessarily be quite different between different VFS models and so it
does not make much sense to keep huge chunks of ifdef-ed out code (or
compiled in, but never reached code) that's not useful in FreeBSD at
all.  I think that keeping that code makes it harder to maintain the
FreeBSD code and to _correctly_ merge upstream changes.  E.g., if a
change can be merged without conflicts to an ifdef-ed out blocked, that
does not mean that the merge is correct.

Last, but not least, the work was sponsored by HybridCluster / ClusterHQ
Inc back in 2013.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"