Re: Maxphys on -current?

2023-08-04 Thread Jaromír Doleček
Le ven. 4 août 2023 à 17:27, Jason Thorpe  a écrit :
> If someone does pick this up, I think it would be a good idea to start from 
> scratch, because MAXPHYS, as it stands, is used for multiple things.  
> Thankfully, I think it would be relatively straightforward to do the work 
> that I am suggesting incrementally.
>

I believe I've been the last one to look at the tls maxphys branch and
at least update it.

I agree that it's likely more useful to start it from scratch and do
incremental updates.
Certainly physio before block interface, and also by controller type,
e.g. ATA first and then SCSI or USB.

For the branch, I particularly disliked that there were quite a few
changes which looked either unrelated, or avoidable.
As one of the first steps I've planned to reduce diffs against HEAD.
I've not gotten to it yet however.

Le ven. 4 août 2023 à 08:04, Brian Buhrow  a écrit :
> speed of the transfers on either system.  Interestingly enough, however, the 
> FreeBSD
> performance is markedly worse on this test.
> ...
> NetBSD-99.77/amd64 with SATA3 disk
> # dd if=/dev/rwd0a of=/dev/null bs=1m count=5
> 5242880 bytes transferred in 292.067 secs (179509496 bytes/sec)
>
> FreeBSD-13.1/AMD64 with SATA3 disk
> # dd if=/dev/da4 of=/dev/null bs=1m count=5
> 5242880 bytes transferred in 322.433936 secs (162603232 bytes/sec)

Interesting. FreeBSD da(4) is a character device since FreeBSD has no
block devices anymore, so it's not a raw-vs-block device difference.
Is the hardware really similar enough to be a fair comparison?

In a broader view, I have doubts if there is any practical reason to
even have support for bigger than 64kb block size support at all.

For HDDs over SATA maybe - the bigger blocks mean potentially more
sequential I/O and hence higher total throughput.
Also you can queue more I/O and hence avoid the seeks - current NetBSD
maximum on SATA is 32 x 64KiB = 2048KiB of queued I/O.
For SCSI, you can queue way more I/O than the usual disk cache can
hold even with 64KiB blocks, so bigger block size is not very
important.
Still, I doubt >64KiB blocks on HDD would achieve more than a couple
of percent increase over 64KiB ones.

For SSDs over SATA, there is no seek to worry about, but the command
latency is a concern. But even there, according to Linux benchmarks,
the total transfer rate tops out with 64KiB blocks already.

For NVMe, the command latency is close to irrelevant - it has a very
low latency command interface and very deep command queues. I don't
see bigger blocks helping much, if at all.

Jaromir


Re: Module autounload proposal: opt-in, not opt-out

2022-08-07 Thread Jaromír Doleček
Le lun. 8 août 2022 à 07:23, Taylor R Campbell
 a écrit :
> I propose we remove the assumption that ENOTTY means no objection, and
> switch from opt-out to opt-in: if a module _has_ been audited to
> verify MODULE_CMD_AUTOUNLOAD is safe, it can return 0; otherwise it
> will never be autounloaded, even if the author forgot to deny
> MODULE_CMD_AUTOUNLOAD.

Very reasonable. Please do this.

Jaromir



Re: maximum limit of files open with O_EXLOCK

2021-06-19 Thread Jaromír Doleček
Le sam. 19 juin 2021 à 10:12, nia  a écrit :
> The Zig developer found the kernel limit:
> https://nxr.netbsd.org/xref/src/sys/kern/vfs_lockf.c#116
>
> but it doesn't seem to be adjustable through sysctl.
> I wonder if it should be.

Yes it should be a sysctl. The default should probably also be bumped higher.

Jaromir


Re: Ext4 support

2021-04-29 Thread Jaromír Doleček
Le jeu. 29 avr. 2021 à 22:06, Vincent DEFERT <20@defert.com> a écrit :
> I'd like to have full ext4 support so an ext4-formatted disk could be
> used to exchange data between Linux and NetBSD, for instance.

Most of the commonly used features for this should be actually implemented.

Big one missing on the NetBSD side is the journal support, but that
would be Hard (TM). If you want just an interoperable disk partition,
you can switch it off on the Linux side and you should be good.

Jaromir


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-14 Thread Jaromír Doleček
Le mer. 14 avr. 2021 à 03:21, Greg A. Woods  a écrit :
> However their front-end code does detect it and seems to make use of it,
> and has done for some 6 years now according to "git blame" (with no
> recent fixes beyond fixing a memory leak on their end).  Here we see it
> live from FreeBSD's sysctl output, thus my concern that this feature may
> be the source of the problem:

You can test if this is the problem by disabling the feature in
negotiation in NetBSD xbdback.c - comment out the code which sets
feature-max-indirect-segments in xbdback_backend_changed(). With the
feature disabled, FreeBSD DomU should not use indirect segments.

Jaromir


Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Jaromír Doleček
Le dim. 11 avr. 2021 à 17:51, Robert Elz  a écrit :
>
> Date:Sun, 11 Apr 2021 14:25:40 - (UTC)
> From:mlel...@serpens.de (Michael van Elst)
> Message-ID:  
>
>   | +   dg->dg_secperunit = vnd->sc_size / DEV_BSIZE;
>
> While it shouldn't make any difference for any properly created image
> file, make it be
>
> (vnd->sc_size + DEV_BSIZE - 1) / DEV_BSIZE;
>
> so that any trailing partial sector remains in the image.

I don't think it's a good idea to include the sector which can't be written to.

Jaromir


Re: Bounties for xhci features: scatter-gather, suspend/resume

2021-03-31 Thread Jaromír Doleček
Le jeu. 25 mars 2021 à 21:36,  a écrit :
> xHCI scatter-gather support
>
> The infamous "failed to create xfers". xhci wants contiguous
> allocations. With too much memory fragmentation, they're hard to do.
> This shows up as USB drivers randomly failing on machines that have been
> Up for a while.
> (chuq attempted to handle this UVM-side recently).

I'll take care of this part.

Jaromir


Re: Issues with older wd* / IDE on various platforms

2020-11-13 Thread Jaromír Doleček
Le ven. 13 nov. 2020 à 22:31, Robert Swindells  a écrit :
>
>
> John Klos  wrote:
> >I've noticed problems in three places and only recently did it occur to me
> >that they may all be related.
>
> [snip]
>
> >All three of these machines have much older IDE, so I'm wondering what in
> >NetBSD changed that may've have caused this.
>
> I would guess that one thing all have in common is lack of DMA.
>
> I have been using a local patch for an ofppc machine that can only do
> PIO with the IDE controller, can try latest -current on it over the
> weekend.

That could very well be.

There is no more automatic mode downgrade from DMA to PIO any more, so
if the controller misadvertises DMA support, but it actually doesn't
work, then I/O will just fail.

Jaromir


Re: notes from running will-it-scale

2020-07-19 Thread Jaromír Doleček
Very interesting, particularly the outrageous assembly for
pmap_{zero,copy}_page().

Is there some way to tell the compiler that the address is already
4096-aligned and avoid the conditionals? Failing that, we could just
adopt the FreeBSD assembly for this.

Does anyone see a problem with introducing a vfs.timestamp_precision
to avoid the rtdscp?

Jaromir

Le dim. 19 juil. 2020 à 13:21, Mateusz Guzik  a écrit :
>
> Hello,
>
> I recently took an opportunity to run cross-systems microbenchmarks
> with will-it-scale and included NetBSD (amd64).
>
> https://people.freebsd.org/~mjg/freebsd-dragonflybsd-netbsd-v2.txt
> [no linux in this doc, I will probably create a new one soon(tm)]
>
> The system has a lot of problems in the vfs layer, vm is a mixed bag
> with multithreaded cases lagging behind and some singlethreaded being
> pretty good (and at least one winning against the other systems).
>
> Notes:
> - rtdscp is very expensive in vms, yet the kernel unconditionally
> performs by calling vfs_timestamp. Both FreeBSD and DragonflyBSD have
> a knob to change the resolution (and consequently avoid the
> instruction), I think you should introduce it and default to less
> accuracy on vms. Sample results:
> stock pipe1: 2413901
> patched pipe1: 3147312
> stock vfsmix: 13889
> patched vfsmix: 73477
> - sched_yield is apparently a nop when the binary is not linked with
> pthread. this does not match other systems and is probably a bug.
> - pmap_zero_page/pmap_copy_page compile to atrocious code which keeps
> checking for alignment. The compiler does not know what values can be
> assigned to pmap_direct_base and improvises.
>
>0x805200c3 <+0>:   add0xf93b46(%rip),%rdi#
> 0x814b3c10 
>0x805200ca <+7>:   mov$0x1000,%edx
>0x805200cf <+12>:  xor%eax,%eax
>0x805200d1 <+14>:  test   $0x1,%dil
>0x805200d5 <+18>:  jne0x805200ff 
> 
>0x805200d7 <+20>:  test   $0x2,%dil
>0x805200db <+24>:  jne0x8052010b 
> 
>0x805200dd <+26>:  test   $0x4,%dil
>0x805200e1 <+30>:  jne0x80520116 
> 
>0x805200e3 <+32>:  mov%edx,%ecx
>0x805200e5 <+34>:  shr$0x3,%ecx
>0x805200e8 <+37>:  rep stos %rax,%es:(%rdi)
>0x805200eb <+40>:  test   $0x4,%dl
>0x805200ee <+43>:  je 0x805200f1 
> 
>0x805200f0 <+45>:  stos   %eax,%es:(%rdi)
>0x805200f1 <+46>:  test   $0x2,%dl
>0x805200f4 <+49>:  je 0x805200f8 
> 
>0x805200f6 <+51>:  stos   %ax,%es:(%rdi)
>0x805200f8 <+53>:  and$0x1,%edx
>0x805200fb <+56>:  je 0x805200fe 
> 
>0x805200fd <+58>:  stos   %al,%es:(%rdi)
>0x805200fe <+59>:  retq
>0x805200ff <+60>:  stos   %al,%es:(%rdi)
>0x80520100 <+61>:  mov$0xfff,%edx
>0x80520105 <+66>:  test   $0x2,%dil
>0x80520109 <+70>:  je 0x805200dd 
> 
>0x8052010b <+72>:  stos   %ax,%es:(%rdi)
>0x8052010d <+74>:  sub$0x2,%edx
>0x80520110 <+77>:  test   $0x4,%dil
>0x80520114 <+81>:  je 0x805200e3 
> 
>0x80520116 <+83>:  stos   %eax,%es:(%rdi)
>0x80520117 <+84>:  sub$0x4,%edx
>0x8052011a <+87>:  jmp0x805200e3 
> 
>
> The thing to do in my opinion is to just provide dedicated asm funcs.
> This is the equivalent on FreeBSD (ifunc'ed):
>
> ENTRY(pagezero_std)
> PUSH_FRAME_POINTER
> movl$PAGE_SIZE/8,%ecx
> xorl%eax,%eax
> rep
> stosq
> POP_FRAME_POINTER
> ret
> END(pagezero_std)
>
> ENTRY(pagezero_erms)
> PUSH_FRAME_POINTER
> movl$PAGE_SIZE,%ecx
> xorl%eax,%eax
> rep
> stosb
> POP_FRAME_POINTER
> ret
> END(pagezero_erms)
>
> --
> Mateusz Guzik 
>


Re: kernel stack usage

2020-07-04 Thread Jaromír Doleček
Le sam. 4 juil. 2020 à 15:30, Kamil Rytarowski  a écrit :
> > Kamil - what's the difference in gcc between -Wframe-larger-than= and
> > -Wstack-size= ?
> >
>
> -Wstack-size doesn't exist?

Sorry, meant -Wstack-usage=

> > I see according to gcc documentation -Wframe-larger-than doesn't count
> > size for alloca() and variable-length arrays, which makes it much less
> > useful than -Wstack-usage.
> >
>
> It's not a problem.
>
> Whenever alloca or VLA is in use, it's already breaking the stack
> protector. There are a few exceptions registered in sys/conf/ssp.mk.
>
> We could add -Walloca -Wvla for USE_SSP=yes builds to catch quickly
> inappropriate usage (frequently violated by code optimizer).

It's already not used in our kernel code, I checked and I found it
used only in some arch-specific stand/ code. So -Walloca should be
safe to turn on unconditionally, regardless of SSP. Unless the
compiler emits alloca() calls itself.

Jaromir


Re: kernel stack usage

2020-07-04 Thread Jaromír Doleček
Can anybody using clang please confirm kernel build with
-Wframe-larger-than=3584?

Kamil - what's the difference in gcc between -Wframe-larger-than= and
-Wstack-size= ?

I see according to gcc documentation -Wframe-larger-than doesn't count
size for alloca() and variable-length arrays, which makes it much less
useful than -Wstack-usage.

Jaromir

Le dim. 31 mai 2020 à 16:39, Kamil Rytarowski  a écrit :
>
> Can we adopt -Wframe-larger-than=1024 and mark it fatal?
>
> This option is supported by GCC and Clang.
>
> On 31.05.2020 15:55, Jaromír Doleček wrote:
> > I think it would make sense to add -Wstack-usage=X to kernel builds.
> >
> > Either 2KB or 1KB seems to be good limit, not too many offenders
> > between 1KB and 2KB so maybe worth it to use 1KB and change them.
> >
> > I'm sure there would be something similar for LLVM too.
> >
> > Jaromir
> >
> > Le dim. 31 mai 2020 à 15:39, Simon Burge  a écrit :
> >>
> >> matthew green wrote:
> >>
> >>> glad to see this effort and the clean up already!
> >>>
> >>> ideally, we can break the kernel build if large stack consumers
> >>> are added to the kernel.  i'd be OK with it being default on,
> >>> with of course a way to skip it, and if necessary it can have
> >>> a whitelist of "OK large users."
> >>
> >> I started to look at -fstack-usage which gives a nice MI way of getting
> >> the data instead of parsing MD objdump output, but didn't get any
> >> further than the below patch.  The find(1) command referenced in the
> >> patch gives the following
> >>
> >>   1008 nfs_serv.c:2181:1:nfsrv_link
> >>   1056 procfs_linux.c:422:1:procfs_do_pid_stat
> >>   1056 vfs_subr.c:1521:1:vfs_buf_print
> >>   1072 dl_print.c:80:1:sdl_print
> >>   1104 core_elf32.c:300:1:coredump_getseghdrs_elf32
> >>   1104 core_elf32.c:300:1:coredump_getseghdrs_elf64
> >>   1104 sys_ptrace_common.c:1595:1:proc_regio
> >>   1120 subr_bufq.c:131:1:bufq_alloc
> >>   1136 db_lwp.c:64:1:db_lwp_whatis
> >>   1152 kern_ktrace.c:1269:1:ktrwrite
> >>   1168 uvm_swap.c:768:1:uvm_swap_stats.part.1
> >>   1312 nfs_serv.c:1905:1:nfsrv_rename
> >>   2112 tty_60.c:68:1:compat_60_ptmget_ioctl
> >>   2144 memmem.c:83:14:twoway_memmem
> >>
> >> Anyone else want to run with adding a bit more to this to do some build
> >> failure as mrg@ suggests and documenting for options(4)?
> >>
> >> Cheers,
> >> Simon.
> >> --
> >> Index: Makefile.kern.inc
> >> ===
> >> RCS file: /cvsroot/src/sys/conf/Makefile.kern.inc,v
> >> retrieving revision 1.270
> >> diff -d -p -u -r1.270 Makefile.kern.inc
> >> --- Makefile.kern.inc   21 May 2020 18:44:19 -  1.270
> >> +++ Makefile.kern.inc   31 May 2020 13:34:13 -
> >> @@ -104,6 +104,11 @@ CFLAGS+=   -ffreestanding -fno-zero-initia
> >>  CFLAGS+=   ${${ACTIVE_CC} == "gcc":? -fno-delete-null-pointer-checks 
> >> :}
> >>  CFLAGS+=   ${DEBUG} ${COPTS}
> >>  AFLAGS+=   -D_LOCORE -Wa,--fatal-warnings
> >> +.if defined(KERN_STACK_USAGE)
> >> +# example usage to find largest stack users in kernel compile directory:
> >> +#find . -name \*.su | xargs awk '{ printf "%6d %s\n", $2, $1 }' | 
> >> sort +1n
> >> +CFLAGS+=   -fstack-usage
> >> +.endif
> >>
> >>  # XXX
> >>  .if defined(HAVE_GCC) || defined(HAVE_LLVM)
> >> @@ -338,8 +343,8 @@ ${_s:T:R}.o: ${_s}
> >>  .if !target(__CLEANKERNEL)
> >>  __CLEANKERNEL: .USE
> >> ${_MKMSG} "${.TARGET}ing the kernel objects"
> >> -   rm -f ${KERNELS} *.map eddep tags *.[io] *.ko *.ln [a-z]*.s vers.c 
> >> \
> >> -   [Ee]rrs linterrs makelinks assym.h.tmp assym.h \
> >> +   rm -f ${KERNELS} *.map *.[io] *.ko *.ln [a-z]*.s *.su vers.c \
> >> +   eddep tags [Ee]rrs linterrs makelinks assym.h.tmp assym.h \
> >> ${EXTRA_KERNELS} ${EXTRA_CLEAN}
> >>  .endif
> >>
>
>


Re: Straw proposal: MI kthread vector/fp unit API

2020-06-23 Thread Jaromír Doleček
Le mar. 23 juin 2020 à 00:14, Eduardo Horvath  a écrit :
> The SPARC has always had a lazy FPU save logic.  The fpstate structure is
> not part of the pcb and is allocated on first use.

No lazy FPU save logic please. It was eradicated from x86 for a reason:
https://access.redhat.com/solutions/3485131
Same should really be done for SPARC.

Most certainly, we should not add any new interface using lazy FPU
save logic, particularly not for something involving cryptography.

Jaromir


Re: UBSan: Undefined Behavior in lf_advlock

2020-06-05 Thread Jaromír Doleček
Le ven. 5 juin 2020 à 21:49, syzbot
 a écrit :
> [  44.1699615] panic: UBSan: Undefined Behavior in 
> /syzkaller/managers/netbsd-kubsan/kernel/sys/kern/vfs_lockf.c:843:16, signed 
> integer overflow: 131072 + 9223372036854771712 cannot be represented in type 
> 'long int'
>
> [  44.1931600] cpu0: Begin traceback...
> [  44.1999503] vpanic() at netbsd:vpanic+0x287 sys/kern/subr_prf.c:290
> [  44.2299515] isAlreadyReported() at netbsd:isAlreadyReported
> [  44.2599494] HandleOverflow() at netbsd:HandleOverflow+0x1c9 
> sys/../common/lib/libc/misc/ubsan.c:375
> [  44.2899499] lf_advlock() at netbsd:lf_advlock+0x2124 
> sys/kern/vfs_lockf.c:843

This happens in:
if (fl->l_len > 0)
end = start + fl->l_len - 1;
else {

when call to fcntl() is arranged so that 'start' ends up 0x2 and
fl->l_len 0x7000, overflowing the off_t.

I wonder, Is it important to fix cases like that?

Jaromir


UEFI boot and PCI interrupt routing (amd64)

2020-06-03 Thread Jaromír Doleček
Hi,

I'm working on a driver for some PCI device, I'm far enough to execute
operations which should trigger interrupt, but the interrupt handler
(registered via pci_intr_establish()) is not called. It looks like
there is some kind of interrupt routing problem, maybe.

Any hints on what could/should I check?

Unfortunately can't boot the machine in legacy mode, to check whether
it's due to mission initialisation by BIOS.

Jaromir


Re: kernel stack usage

2020-05-31 Thread Jaromír Doleček
I think it would make sense to add -Wstack-usage=X to kernel builds.

Either 2KB or 1KB seems to be good limit, not too many offenders
between 1KB and 2KB so maybe worth it to use 1KB and change them.

I'm sure there would be something similar for LLVM too.

Jaromir

Le dim. 31 mai 2020 à 15:39, Simon Burge  a écrit :
>
> matthew green wrote:
>
> > glad to see this effort and the clean up already!
> >
> > ideally, we can break the kernel build if large stack consumers
> > are added to the kernel.  i'd be OK with it being default on,
> > with of course a way to skip it, and if necessary it can have
> > a whitelist of "OK large users."
>
> I started to look at -fstack-usage which gives a nice MI way of getting
> the data instead of parsing MD objdump output, but didn't get any
> further than the below patch.  The find(1) command referenced in the
> patch gives the following
>
>   1008 nfs_serv.c:2181:1:nfsrv_link
>   1056 procfs_linux.c:422:1:procfs_do_pid_stat
>   1056 vfs_subr.c:1521:1:vfs_buf_print
>   1072 dl_print.c:80:1:sdl_print
>   1104 core_elf32.c:300:1:coredump_getseghdrs_elf32
>   1104 core_elf32.c:300:1:coredump_getseghdrs_elf64
>   1104 sys_ptrace_common.c:1595:1:proc_regio
>   1120 subr_bufq.c:131:1:bufq_alloc
>   1136 db_lwp.c:64:1:db_lwp_whatis
>   1152 kern_ktrace.c:1269:1:ktrwrite
>   1168 uvm_swap.c:768:1:uvm_swap_stats.part.1
>   1312 nfs_serv.c:1905:1:nfsrv_rename
>   2112 tty_60.c:68:1:compat_60_ptmget_ioctl
>   2144 memmem.c:83:14:twoway_memmem
>
> Anyone else want to run with adding a bit more to this to do some build
> failure as mrg@ suggests and documenting for options(4)?
>
> Cheers,
> Simon.
> --
> Index: Makefile.kern.inc
> ===
> RCS file: /cvsroot/src/sys/conf/Makefile.kern.inc,v
> retrieving revision 1.270
> diff -d -p -u -r1.270 Makefile.kern.inc
> --- Makefile.kern.inc   21 May 2020 18:44:19 -  1.270
> +++ Makefile.kern.inc   31 May 2020 13:34:13 -
> @@ -104,6 +104,11 @@ CFLAGS+=   -ffreestanding -fno-zero-initia
>  CFLAGS+=   ${${ACTIVE_CC} == "gcc":? -fno-delete-null-pointer-checks :}
>  CFLAGS+=   ${DEBUG} ${COPTS}
>  AFLAGS+=   -D_LOCORE -Wa,--fatal-warnings
> +.if defined(KERN_STACK_USAGE)
> +# example usage to find largest stack users in kernel compile directory:
> +#find . -name \*.su | xargs awk '{ printf "%6d %s\n", $2, $1 }' | sort 
> +1n
> +CFLAGS+=   -fstack-usage
> +.endif
>
>  # XXX
>  .if defined(HAVE_GCC) || defined(HAVE_LLVM)
> @@ -338,8 +343,8 @@ ${_s:T:R}.o: ${_s}
>  .if !target(__CLEANKERNEL)
>  __CLEANKERNEL: .USE
> ${_MKMSG} "${.TARGET}ing the kernel objects"
> -   rm -f ${KERNELS} *.map eddep tags *.[io] *.ko *.ln [a-z]*.s vers.c \
> -   [Ee]rrs linterrs makelinks assym.h.tmp assym.h \
> +   rm -f ${KERNELS} *.map *.[io] *.ko *.ln [a-z]*.s *.su vers.c \
> +   eddep tags [Ee]rrs linterrs makelinks assym.h.tmp assym.h \
> ${EXTRA_KERNELS} ${EXTRA_CLEAN}
>  .endif
>


Re: kernel stack usage

2020-05-30 Thread Jaromír Doleček
Le sam. 30 mai 2020 à 18:41, Jason Thorpe  a écrit :
> These two seem slightly bogus.  coredump_note_elf64() was storing register 
> state not the stack, but not nearly 3K worth.  procfs_domounts() has nearly 
> nothing on the stack as far as I can tell, and the one function that could be 
> auto-inlined that it calls doesn't have much either.
>

struct statvfs is certainly over 3 KB - line 619

Jaromir


Re: kernel stack usage

2020-05-30 Thread Jaromír Doleček
I've fixed several where I felt comfortable, feel free to do more:
4096pci_conf_print at pci_subr.c:4812
4096dtv_demux_read at dtv_demux.c:493
3408genfb_calc_hsize at genfb.c:630
2240bwfm_rx_event_cb at bwfm.c:2099
1664wdcprobe_with_reset at wdc.c:491

Jaromir

Le sam. 30 mai 2020 à 16:18, Christos Zoulas  a écrit :
>
> In article <20200530095218.gb28...@mail.duskware.de>,
> Martin Husemann   wrote:
> >Hey folks,
> >
> >triggered by some experiments simonb did on mips I wrote a script to find
> >the functions using the bigest stack frame in my current sparc64 kernel.
> >
> >The top 15 list is:
> >
> >Frame/b Function
> >4096pci_conf_print at pci_subr.c:4812
> >4096dtv_demux_read at dtv_demux.c:493
> >3536SHA3_Selftest at sha3.c:430
> >3408genfb_calc_hsize at genfb.c:630
> >3248radeonfb_pickres at radeonfb.c:4127
> >2304radeonfb_set_cursor at radeonfb.c:3690
> >2272gem_pci_attach at if_gem_pci.c:147
> >2256twoway_memmem at memmem.c:84
> >2240bwfm_rx_event_cb at bwfm.c:2099
> >2240compat_60_ptmget_ioctl at tty_60.c:70
> >2112db_stack_trace_print at db_trace.c:77
> >1664wdcprobe_with_reset at wdc.c:491
> >1424nfsrv_rename at nfs_serv.c:1906
> >1408OF_mapintr at ofw_machdep.c:728
> >1344sysctl_hw_firmware_path at firmload.c:81
> >1280fw_bmr at firewire.c:2296
> >1264cdioctl at cd.c:1204
> >1248cpu_reset_fpustate at cpu.c:400
> >1248aubtfwl_attach_hook at aubtfwl.c:273
> >1248uvm_swap_stats at uvm_swap.c:726
> >
> >(left column is size of the frame on sparc64 in bytes)
> >
> >I think anything > 1k is dubious and should be checked.
>
> I agree, here is the same for x86_64/GENERIC...
>
> 4408 8027af14:pci_conf_print+0xd
> 4128 80a8dca0:dtv_demux_read+0xb
> 3352 80b940bb:procfs_domounts+0xd
> 3272 80e36b4b:SHA3_Selftest+0xd
> 3264 80c677da:coredump_note_elf64+0xb
> 3240 80b537c6:genfb_calc_hsize.isra.0+0x5
> 2704 80c66a88:coredump_note_elf32+0xb
> 2408 80227a71:process_machdep_doxstate+0xd
> 2184 804381fd:linux_ioctl_termios+0xd
> 2168 80440b2d:linux32_ioctl_termios+0xd
> 2112 802c5579:gem_pci_attach+0xb
> 2104 80e465c3:twoway_memmem+0xd
> 2088 806b5c18:bwfm_rx_event_cb+0xd
> 2072 8097e221:compat_60_ptmget_ioctl+0xd
> 2064 8053ce72:db_stack_trace_print+0x11
> 1488 8064f943:wdcprobe_with_reset+0xb
> 1384 80d7ee2b:ipmi_match+0x9
> 1328 80467a95:usb_add_event+0x7
> 1304 80ba85bb:nfsrv_rename+0xd
> 1256 8053069f:acpicpu_md_pstate_sysctl_all+0xd
> 1256 8052cd13:acpicpu_start+0x9
> 1240 8043b162:linux_sys_rt_sigreturn+0x9
> 1192 80b9392a:procfs_do_pid_stat+0xd
> 1176 80d70810:sysctl_hw_firmware_path+0xd
> 1160 807b30a3:radeon_cs_ioctl+0xd
> 1128 8044610f:oss_ioctl_mixer+0xd
> 1128 8024dd3c:cdioctl+0xd
> 1112 80c633ca:uvm_swap_stats.part.1+0xd
> 1104 804fdf59:fw_bmr+0xb
> 1096 80c8c0ab:ktrwrite+0xd
> 1080 80573ca6:ahc_print_register+0xd
> 1080 80550de8:procfs_getonecpu+0xd
> 1080 8048d95e:aubtfwl_attach_hook+0x9
> 1064 80cf925d:proc_regio+0xd
> 1064 80ccb407:bufq_alloc+0xd
> 1064 80ae8090:ar5112SetPowerTable+0xd
> 1064 80582aa0:ahd_print_register+0xd
> 1064 80568a53:tpmread+0xd
> 1064 80382262:txp_attach+0xd
> 1064 802636da:ata_probe_caps+0xd
> 1048 80e06097:ar9003_paprd_tx_tone_done+0xd
> 1048 80d87e9b:sdl_print+0x9
> 1048 80c67604:coredump_getseghdrs_elf64+0xd
>
>


ACPI SCI interrupt flooding - any ideas?

2020-04-27 Thread Jaromír Doleček
Hi,

we have the SCI interrupt storm - http://gnats.netbsd.org/53687

Linux has this quirk, and it claims such thing happens due to some
unsupported hw features. Would it make sense to have some quirk like
that?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9c4aa1eecb48cfac18ed5e3aca9d9ae58fbafc11

Any ideas what to do? I can try things if given some guidance, but I
simply don't know where to start looking.

Jaromir


Netowork interface deferred start - does it work?

2020-03-27 Thread Jaromír Doleček
Hi,

I'm looking on some performance improvements for xennet(4) and
particularly xvif(4).

The drivers use the deferred start and there are lots of comments on
how important it is to batch requests, but actually it's really rare
that the start routine would ever be called with more than one mbuf in
the Tx queue.

It seems the softnet is triggered more or less immediately after
enqueueing each packet in most cases. Maybe this used to be different
when giant lock used to be held more, but not so much nowadays.

So, is it supposed to still work?

Jaromir


Re: config_mounroot - spinout while attaching nouveaufb0 on amd64 with LOCKDEBUG

2020-02-19 Thread Jaromír Doleček
Le lun. 17 févr. 2020 à 17:55, matthew green  a écrit :
>
> FWIW, i've been running my radeon with a patch that exlicitly drops
> kernel lock around the "real attach" function (the one that config
> mountroot ends up calling.)
>
> we really need to MPSAFE-ify the autoconf subsystem.  right now, it
> is expected that autoconf runs with kernel lock... i am not sure of
> the path we should take for this -- but let's actually have a design
> in place we are happy with, while my change below works, it's ugly
> and wrong.

Would it make sense to simply require all callers of
config_mountroot() to have MPSAFE config routines and call the
callbacks for them without kernel lock?

There are just few of those, and they can simply be changed to do the
KERNEL_LOCK()/UNLOCK themselves.

Jaromir


config_mounroot - spinout while attaching nouveaufb0 on amd64 with LOCKDEBUG

2020-02-16 Thread Jaromír Doleček
Hi,

while debugging the MSI attachment for nouveaufb0, I've got several
times spinout panic like one below. It doesn't happen on every boot,
but on almost every one.

I confirmed via ddb that this happens due to config_mountroot_thread()
holding the kernel lock for too long - that's where backtrack for cpu
0 (which holds the lock) leads.

Would it be worth it to spend some efford to run that thread without
kernel lock, or postpone the driver initialization even further?

Or possibly bump the spinout count (spinout after 1.6 seconds seems a
bit too quick), or maybe just disable the spinout check during boot
altogether?

Jaromir

The end of boot:
[  12.1233898] nouveau0: info: No connectors reported connected with modes
[  12.1392346] kern info: [drm] Cannot find any crtc or sizes - going 1024x768
[  12.1461457] nouveaufb0 at nouveau0
[  13.7057542] Kernel lock error: _kernel_lock,244: spinout

[  13.7157692] lock address : 0x80a330c0 type :   spin
[  13.7157692] initialized  : 0x8070c259
[  13.7257828] shared holds :  0 exclusive:  1
[  13.7257828] shares wanted:  0 exclusive:  3
[  13.7357995] relevant cpu :  3 last held:  0
[  13.7458129] relevant lwp : 0x846d0bc7d740 last held: 0x846d0daf2b80
[  13.7458129] last locked* : 0x804e82aa unlocked : 0x80238583
[  13.7558296] curcpu holds :  0 wanted by: 0x846d0bc7d740

[  13.7658445] panic: LOCKDEBUG: Kernel lock error: _kernel_lock,244: spinout
[  13.7758590] cpu3: Begin traceback...
[  13.7758590] vpanic() at netbsd:vpanic+0x146
[  13.7758590] snprintf() at netbsd:snprintf
[  13.7858758] lockdebug_more() at netbsd:lockdebug_more
[  13.7858758] _kernel_lock() at netbsd:_kernel_lock+0x244
[  13.7958892] ip_slowtimo() at netbsd:ip_slowtimo+0x1a
[  13.7958892] pfslowtimo() at netbsd:pfslowtimo+0x31
[  13.8059048] callout_softclock() at netbsd:callout_softclock+0x10f
[  13.8059048] softint_dispatch() at netbsd:softint_dispatch+0x105
[  13.8159175] DDB lost frame for netbsd:Xsoftintr+0x4f, trying
0x9a00adca1ff0
[  13.8259329] Xsoftintr() at netbsd:Xsoftintr+0x4f
[  13.8259329] --- interrupt ---
[  13.8259329] 28f783ada3ba8615:
[  13.8359485] cpu3: End traceback...
[  13.8359485] fatal breakpoint trap in supervisor mode
[  13.8359485] trap type 1 code 0 rip 0x8021e3a5 cs 0x8 rflags
0x202 cr2 0 ilevel 0x2 rsp 0x9a00adca1d70
[  13.8459661] curlwp 0x846d0bc7d740 pid 0.36 lowest kstack
0x9a00adc9d2c0
Stopped in pid 0.36 (system) at netbsd:breakpoint+0x5:  leave


Re: Proposal to remove netsmb / smbfs

2020-01-20 Thread Jaromír Doleček
Le lun. 20 janv. 2020 à 21:07, Jason Thorpe  a écrit :
>
> (Cross-posted to tech-kern / tech-net because it affects networking and file 
> systems.)
>
> I would like to propose that we remove netsmb and smbfs.  Two reasons:

Yes, please.

> 1- They only support SMB1, which is an ancient flavor of the protocol.  SMB2 
> was introduced in 2006 and SMB3 in 2012.  SMB3 is the flavor of the protocol 
> favored by all of the major implementations, and for those that still support 
> SMB1, you sometimes have to go out of your way to enable it.
>
> 2- It is not MP-safe.  The effort to make it MP-safe would be non-trivial.

I remember netsmb particularly uses some fairly strange way how it
works internally, I was meaning to look at it ever since importing it,
which however never happened.

IIRC SMB1 is even not supported any more in current version of
windows, so the utility of smbfs is close to 0.

Jaromir


Re: New "ahcisata0 port 1: device present" messages with NetBSD 9

2020-01-17 Thread Jaromír Doleček
All right, let's do that. Penalising quirky old gear makes more sense
than penalising new gear.

I didn't like that the controller approach means we penalise ALL
drives on the controller. However, it's better than penalising EVO
drives everywhere, when we have evidence it works without errors on
other controllers.

Please go ahead.

Jaromir

Le ven. 17 janv. 2020 à 10:24, Simon Burge  a écrit :
>
> matthew green wrote (in kern/54855):
>
> > most people with 860 EVOs are successful with NCQ, i'm one
> > of them with a 500GB model, with netbsd-9 for months now,
> > and they're a wildly successful line i'm sure many others
> > also have working happily with NCQ.
> >
> > we don't know what the real issue is.  perhaps it is a
> > problem that the controller had that these drives trigger,
> > rather than a problem with them, so perhaps we should
> > disable it on the controller instead of the drive, if
> > we're going to pick one of the two triggers here.
>
> Izumi Tsutsui wrote:
>
> > Note mine is:
> > >> ahcisata0 at pci0 dev 18 function 0: vendor 1002 product 4380 (rev. 0x00)
> >  [ ... ]
> > >> 000:18:0: ATI Technologies SB600 SATA Controller (SATA mass storage, 
> > >> AHCI 1.0)
> >
> > Note it already has AHCI_QUIRK_BADPM, but the Samsung EVO 860 NCQ problem
> > should be an independent quirk?
> >  https://nxr.netbsd.org/xref/src/sys/dev/ic/ahcisatavar.h?r=1.23#57
> >  https://nxr.netbsd.org/xref/src/sys/dev/pci/ahcisata_pci.c?r=1.56#59
>
> Using the approach suggested by mrg, where we "penalise" the (older)
> AMD chipsets that exhibit the problem rather than the more recent and
> popular Samsung EVO 860 disks, I've added a new AHCI_QUIRK_BADNCQ quirk
> and enabled it for the same AMD chipsets that had the AHCI_QUIRK_BADPM
> quirk.  I do wonder, out of caution, if maybe this quirk should be
> enabled for PCI_PRODUCT_ATI_SB600_SATA_1 as well?
>
> Before patch:
>
> wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 
> (Ultra/133) (using DMA), NCQ (31 tags)
>
> After patch:
>
> wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 
> (Ultra/133) (using DMA), WRITE DMA FUA EXT
>
> Jaromir - are you happy with this alternate approach?  If you are
> happy with this I'll commit this patch and revert your EVO 860 quirk
> and request a pull-up to the netbsd-9 branch.
>
> tsutsui - are you able to test the patch too?
>
> Cheers,
> Simon.
>
> --
> Index: sys/dev/ic/ahcisata_core.c
> ===
> RCS file: /cvsroot/src/sys/dev/ic/ahcisata_core.c,v
> retrieving revision 1.75.4.2
> diff -d -p -u -r1.75.4.2 ahcisata_core.c
> --- sys/dev/ic/ahcisata_core.c  24 Dec 2019 17:34:33 -  1.75.4.2
> +++ sys/dev/ic/ahcisata_core.c  17 Jan 2020 08:45:41 -
> @@ -276,6 +276,11 @@ ahci_attach(struct ahci_softc *sc)
> "ignoring broken port multiplier support\n");
> sc->sc_ahci_cap &= ~AHCI_CAP_SPM;
> }
> +   if (sc->sc_ahci_quirks & AHCI_QUIRK_BADNCQ) {
> +   aprint_verbose_dev(sc->sc_atac.atac_dev,
> +   "ignoring broken NCQ support\n");
> +   sc->sc_ahci_cap &= ~AHCI_CAP_NCQ;
> +   }
> sc->sc_atac.atac_nchannels = (sc->sc_ahci_cap & AHCI_CAP_NPMASK) + 1;
> sc->sc_ncmds = ((sc->sc_ahci_cap & AHCI_CAP_NCS) >> 8) + 1;
> ahci_rev = AHCI_READ(sc, AHCI_VS);
> Index: sys/dev/ic/ahcisatavar.h
> ===
> RCS file: /cvsroot/src/sys/dev/ic/ahcisatavar.h,v
> retrieving revision 1.22
> diff -d -p -u -r1.22 ahcisatavar.h
> --- sys/dev/ic/ahcisatavar.h14 Jan 2019 21:29:56 -  1.22
> +++ sys/dev/ic/ahcisatavar.h17 Jan 2020 08:45:41 -
> @@ -59,6 +59,7 @@ struct ahci_softc {
>  #define AHCI_PCI_QUIRK_BAD64   __BIT(1)  /* broken 64-bit DMA */
>  #define AHCI_QUIRK_BADPMP  __BIT(2)  /* broken PMP support, ignore */
>  #define AHCI_QUIRK_SKIP_RESET  __BIT(4)  /* skip drive reset sequence */
> +#define AHCI_QUIRK_BADNCQ  __BIT(5)  /* possibly broken NCQ support, 
> ignore */
>
> uint32_t sc_ahci_cap;   /* copy of AHCI_CAP */
> int sc_ncmds; /* number of command slots */
> Index: sys/dev/pci/ahcisata_pci.c
> ===
> RCS file: /cvsroot/src/sys/dev/pci/ahcisata_pci.c,v
> retrieving revision 1.55.4.1
> diff -d -p -u -r1.55.4.1 ahcisata_pci.c
> --- sys/dev/pci/ahcisata_pci.c  23 Oct 2019 18:09:18 -  1.55.4.1
> +++ sys/dev/pci/ahcisata_pci.c  17 Jan 2020 08:45:41 -
> @@ -179,17 +179,17 @@ static const struct ahci_pci_quirk ahci_
> AHCI_PCI_QUIRK_FORCE },
> /* ATI SB600 AHCI 64-bit DMA only works on some boards/BIOSes */
> { PCI_VENDOR_ATI, PCI_PRODUCT_ATI_SB600_SATA_1,
> -   AHCI_PCI_QUIRK_BAD64 | AHCI_QUIRK_BADPMP },
> +   AHCI_PCI_QUIRK_BAD64 | AHCI_QUIRK_BADPMP | AHCI_QUIRK_BADNCQ },
> { PCI_VENDOR_ATI, PCI_PRODUCT_ATI_SB7

Re: New "ahcisata0 port 1: device present" messages with NetBSD 9

2020-01-13 Thread Jaromír Doleček
Le lun. 13 janv. 2020 à 11:56, Simon Burge  a écrit :
> A bit more digging shows that this seems to be a (somewhat) known
> problem with Samsung EVO 860 disks and AMD SB710/750 chipsets.  The
> problem also occurs on Windows and Linux with these drives and chipsets.
>
> Here's a couple of links:
>
>   
> https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813
>   https://bugzilla.kernel.org/show_bug.cgi?id=201693
>
> In the first, there's discussion of how Samsung haven't released a
> firmware update to address this.
>
> Jaromir - do you have any suggestions on how we can deal with this, in
> particular thinking about the soon to be released NetBSD 9.0?  Maybe
> somehow detecting this combination of disks and chipsets and disabling
> NCQ?  My concern right now is that people upgrading from NetBSD 8 or
> earlier to NetBSD 9 may experience data corruption.

If they are known broken, seems we indeed need to add a quirk entry to
disable NCQ for these drives, and push it into NetBSD 9.0 branch.

I'll make the change this week.

Jaromir


Re: adding linux syscall fallocate

2019-11-16 Thread Jaromír Doleček
Le sam. 16 nov. 2019 à 17:14, HRISHIKESH GOYAL  a
écrit :

> Does *posix_fallocate()* implemented (has support) for any of the
> underlying filesystems in NetBSD/FreeBSD?
>

As far as I know no filesystem actually implements the support in NetBSD.


> Also, as per my understanding, all calls to *posix_fallocate()*
> internally go to *fallocate()* systemcall which are redirected to 
> *vop_fallocate_desc().
> *So in the project https://wiki.netbsd.org/projects/project/ffs-fallocate/ we
> only need to implement *ffs_fallocate() *right?
>

Yes, pretty much.

Jaromir


Re: adding linux syscall fallocate

2019-11-09 Thread Jaromír Doleček
R0ller,

is your problem only that the linux binary used for cross-compile errors
out, because of the system call returning ENOSYS rather than potentially
expected EOPNOTSUPP?

In that case the right fix is indeed to add the mapping into the compat
code. Then the code would fail the way expected by the tool.

Linux also doesn't support fallocate() universally, actually even when
supported some of the filesystems return EOPNOTSUPP when e.g. unsupported
flag combination is used.

Jaromir

Le mer. 6 nov. 2019 à 10:12, r0ller  a écrit :

> Hi Maciej,
>
> Thanks for the detailed answer! Unfortunately, I don't think that I could
> accomplish this task in my spare time:(
>
> Please, don't take this as an offence but as a fix for my case I thought
> of an utter hack:
>
> I do know that writing a file by calling fallocate can be tricked and
> redirected into a named pipe of which the data can simply be pulled into a
> normal file. This is what I'm already doing in my project as a workaround
> when building it as 32bit arm lib:
>
> mkfifo mypipe
> cat mypipe > myfile
> 
>
> The problem with this is that it cannot be used when crosscompiling
> autoconf projects where a configure script starts creating many files as
> I'd need to edit the script at too many places to implement this trick.
>
> However, if I could carry out this trick with the pipe when intercepting
> the linux fallocate call, it could work. Do you think it feasible?
>
> Best regards,
> r0ller
>
>  Eredeti levél 
> Feladó: Maciej < maciej.grochow...@protonmail.com (Link -> mailto:
> maciej.grochow...@protonmail.com) >
> Dátum: 2019 november 4 23:32:56
> Tárgy: Re: adding linux syscall fallocate
> Címzett: r0ller < r0l...@freemail.hu (Link -> mailto:r0l...@freemail.hu) >
>
> Hi r0ller,
>
> A couple of weeks ago I also run to the issue when I found lack of
> fallocate or POSIX_FALLOCATE(2) (to be precise) a little bit sad.
> From the very first view typical usage of POSIX_FALLOCATE(2) seems
> straight forward, comparing to the Linux fallocate(2) where different modes
> have to be supported. However, things can go a little bit more complicated
> if you are dealing with an existing file with a more complex structure.
> Because of that, I found difficult to provide good quality implementation
> without a proper way to test it.
> Before EuroBSD 2019 as a part of work about fuzzing the FFS, I decided to
> bring some well known FS test (namely speaking "XFS tests') suit to make
> sure that bugs that we fix did not introduce a regression.
> The same thing applies to the new features of FS, is relatively easy to
> port implementation from other OS/FS, but without a proper way to test
> them, I would be very careful to introduce such things too quickly to the
> end-users.
>
> One thing that I was missing for XFS tests, and going to publish part of
> it soon, is a way to view the internal structure of inodes and other
> metadata of Filesystem. My primary use case was to debug mount issues, in
> the example the issue that I showed during my presentation about the
> fuzzing. But also same apply to the code that manipulates inode blocks.
>
> Hopefully, we can elaborate on that, and as I pointed earlier I would be
> careful with merging new FS features especially such wide used as
> POSIX_FALLOCATE(2) without a proper FS test suit or extensive enough
> testing that would require proper too like i.e. FSDB.
>
> Thanks
> Maciej
>
> ‐‐‐ Original Message ‐‐‐
> On Sunday, November 3, 2019 6:06 PM, r0ller  wrote:
>
> Hi Jaromir,
>
> Indeed. That's bad news but thanks for your answer! I've even found this:
> https://wiki.netbsd.org/projects/project/ffs-fallocate/
> Are there any details for this project besides that page? I don't know
> anything about NetBSD internals though if it's not meant for gurus, I'd
> have a look at it and give it a try.
>
> Best regards,
> r0ller
>
>  Eredeti levél 
> Feladó: Jaromír Doleček < jaromir.dole...@gmail.com (Link -> mailto:
> jaromir.dole...@gmail.com) >
> Dátum: 2019 november 3 15:16:34
> Tárgy: Re: adding linux syscall fallocate
> Címzett: r0ller < r0l...@freemail.hu (Link -> mailto:r0l...@freemail.hu) >
> Le dim. 3 nov. 2019 à 08:57, r0ller  a écrit :
>
> > As you can see on the attached screenshot, "line 4741" gets printed out

Re: adding linux syscall fallocate

2019-11-03 Thread Jaromír Doleček
Le dim. 3 nov. 2019 à 08:57, r0ller  a écrit :
> As you can see on the attached screenshot, "line 4741" gets printed out. So I 
> went on to check what happens in VOP_FALLOCATE but it gets really internal 
> there.
>
> Does anyone have any hint?

fallocate VOP is not implemented for FFS:

> grep fallocate *
ffs_vnops.c: { &vop_fallocate_desc, genfs_eopnotsupp }, /* fallocate */
ffs_vnops.c: { &vop_fallocate_desc, spec_fallocate }, /* fallocate */
ffs_vnops.c: { &vop_fallocate_desc, vn_fifo_bypass }, /* fallocate */

Jaromir


Re: Proposal, again: Disable autoload of compat_xyz modules

2019-09-27 Thread Jaromír Doleček
Le jeu. 26 sept. 2019 à 18:08, Manuel Bouyer  a écrit :
>
> On Thu, Sep 26, 2019 at 05:10:01PM +0200, Maxime Villard wrote:
> > issues for a clearly marginal use case, and given the current general
>  ^^^
>
> This is where we dissagree. You guess it's marginal but there's no
> evidence of that (and there's no evidence of the opposite either).

FYI - I've put also a lot of efford into fixing & enhancing
compat_linux in past. I also greatly appreciate all the work work of
other folks working on the layer, it's super useful in some situations
- browser with flash support used to be important (thankfully not
anymore), also vmware and matlab, I also used some Oracle dev tools.
However, that is not the topic of the discussion.

Let's concentrate on whether it should be enabled by default.

Given the history, to me it's completely clear compat_linux shouldn't
be on by default. Any possible linux-specific exploits should only be
problem for people actually explicitly enabling it. Let's just stop
pretending that we'd setup any kind of reasonable testing suite for
this - it has not been done in last >20 years, it's even less likely
to happen now that most of the major use cases are actually moot.

As Maya suggested, let's keep this concentrated on COMPAT_LINUX only
to avoid further bikeshed flogging, so basically I propose doing this:
1) Comment out COMPAT_LINUX from all kernels configs for all archs
which support modular
2) Disable autoload for compat_linux, requiring the user to explicitly
configure system to load it. No extra sysctl.

Any major and specific objections?

Jaromir


Re: Proposal, again: Disable autoload of compat_xyz modules

2019-09-26 Thread Jaromír Doleček
Le jeu. 26 sept. 2019 à 10:54, Maxime Villard  a écrit :
> I recently made a big set of changes to fix many bugs and vulnerabilities in
> compat_linux and compat_linux32, the majority of which have a security impact
> bigger than the Intel CPU bugs we hear about so much. These compat layers are
> enabled by default, so everybody is affected.
> ...
> Therefore, I am making today the same proposal as Taylor in 2017, because the
> problem is still there exactly as-is and we just hit it again; the solution
> however is more straightforward.

Yes please, it's the right thing to do.

Just please still keep also the option to compile it into the kernel.

I also now have no objection to moving all the arch-specific bits
under sys/compat/ as you&Taylor suggested in the old thread.

Jaromir


Re: Enable functionality by default

2019-04-16 Thread Jaromír Doleček
Le mar. 16 avr. 2019 à 17:55, Sevan Janiyan  a écrit :
> altq, veriexec, BUFQ_PRIOCSCAN, CARP
>
...
> Are there any reasons not enable these features by default on all ports
> (with the exception of resource constrained systems such as sun2, sun3,
> zaurus) ?

I think enabling veriexec used to incurr some overhead for exec or
file copy, but I don't recall details.

Jaromir


Re: Removing PF

2019-04-01 Thread Jaromír Doleček
Le lun. 1 avr. 2019 à 14:32, Stephen Borrill  a écrit :
>Your two statements are mutually inconsistent:
> 1) No-one is maintaining ipf or pf
> and
> 2) If the effort had been on one firewall instead of three, the one chosen
> would be more functional.

IMO it's consistent - if we had one, it would be clear to which one to
contribute, and clear if the feature is really missing and working.
i.e. nobody contributes anything because they don't know to which of
the firewalls to contribute.

In either case, let's return to a constructive discussion, and see
what needs to be done. NPF-only is the future, so let's get to that
future.

In the past discussion, I've only seen people mentioning only two
features missing in NPF and present in PF:

1. ftp-proxy support - Maxime volunteered to implement this in NPF,
I'm sure help there would be welcome
2. group support for config (mentioned by Manuel) - anyone feels like taking?
  - ??it might be enough to have some kind of config preprocessor
initially if that's easier to do??

Is there anything else?

Jaromir


Re: Regarding the ULTRIX and OSF1 compats

2019-03-16 Thread Jaromír Doleček
Le sam. 16 mars 2019 à 16:12, Robert Elz  a écrit :
> Sorry, I must have missed that.   All I ever seem to see is that xxx is
> unmaintained and full of unspecified bugs, and that obviously no-one cares,
> and so we should delete it.That's not an argument for anything.

You suggest that there are people who care about the compat, who are
not on the mailing list. How do you suggest we reach them?

> Yes, this makes it a long slow, and tedious process (involving a bunch of
> additional work) to delete something - but so it should be.  Someone spent
> a lot of time making the thing you want to discard work in the first place.

Max did everything reasonable to gather feedback about this, waiting a
long time for feedback to come. Now it's time to proceed with the
removal. The stuff will stay in Attic, so if somebody steps in to
actually maintain it, they would be able to revive it just as easily
as if it would be simply unlinked from build.

This is me, speaking as person who put a lot of efford into making
compat_linux work, and also trying to keep compat code updated. I know
what it means to update esoteric code down in compat layers. Most of
this is simply not testable by reasonable means, which unfortunately
means it needs to go.

Jaromir


Re: Reserve device major numbers for pkgsrc

2019-02-16 Thread Jaromír Doleček
Perhaps do not pre-reserve anything and simply reserve consecutive
numbers as the need arises?

Le sam. 16 févr. 2019 à 23:55, Kamil Rytarowski  a écrit :
>
> On 16.02.2019 23:40, Jaromír Doleček wrote:
> > Le sam. 16 févr. 2019 à 23:24, Kamil Rytarowski  a écrit :
> >>
> >> We started to build and ship kernel modules through pkgsrc.
> >>
> >> I would like to reserve 3 major numbers for the HAXM case from the base
> >> pool of devices and prevent potential future conflicts and compatibility
> >> breakage due to picking up another major number in a 3rd party software.
> >>
> >> Where and how to reserve these major numbers?
> >>
> >> I expect that we will start growing software that uses our kernel module
> >> framework. (I have got few candidates to get ported in future)
> >
> > See sys/conf/majors
> >
> > Jaromir
> >
>
> Proposed patch:
>
> http://netbsd.org/~kamil/patch-00086-pkgsrc-majors.txt
>
> I have no opinion on numbers to pick, please suggest the right ranges
> and I will follow it.
>


Re: Reserve device major numbers for pkgsrc

2019-02-16 Thread Jaromír Doleček
Le sam. 16 févr. 2019 à 23:24, Kamil Rytarowski  a écrit :
>
> We started to build and ship kernel modules through pkgsrc.
>
> I would like to reserve 3 major numbers for the HAXM case from the base
> pool of devices and prevent potential future conflicts and compatibility
> breakage due to picking up another major number in a 3rd party software.
>
> Where and how to reserve these major numbers?
>
> I expect that we will start growing software that uses our kernel module
> framework. (I have got few candidates to get ported in future)

See sys/conf/majors

Jaromir


Re: scsipi: physio split the request

2018-12-28 Thread Jaromír Doleček
> On Dec 27, 12:29pm, buh...@nfbcal.org (Brian Buhrow) wrote:
> -- Subject: Re: scsipi: physio split the request
>
> |   hello.  Just out of curiosity, why did the tls-maxphys branch never
> | get merged with head once the work was done or mostly done?

Simply nobody finished it up yet.

Le jeu. 27 déc. 2018 à 22:07, Christos Zoulas  a écrit :
> mostly done...

I did the last tls-maxphys sync. It will need at least one resync,
since more stuff moved around (i.e. wd(4) was converted to dksubr
since then).

I want to eliminate some duplicated code in PCI-IDE drivers before the
merge. Other than that, I don't see any other blockers for the merge.

Of course, it requires proper round of testing before it could be
really considered merge-worthy.

As others noted, the performance benefit of tls-maxphys is likely to
be small. It however removes artificial limits in the stack, unlocking
new opportunities for further development, so it's both necessary and
good to do nevertheless.

Jaromir


Re: scsipi: physio split the request

2018-12-28 Thread Jaromír Doleček
Le jeu. 27 déc. 2018 à 15:41, Emmanuel Dreyfus  a écrit :
>
> On Thu, Dec 27, 2018 at 02:33:28PM +, Christos Zoulas wrote:
> > I think you need resurrect the tls-maxphys branch... It was close to working
> > IIRC.
>
> What happens if I just #define MAXPHYS (1024*1204*1024) ?

Several drivers use MAXPHYS to allocate DMA memory for them (e.g.
nvme), they usually try to allocate
MAXPHYS * (max number of scatter-gather vectors). These will either
fail, or block huge amounts of RAM.

Also it just happens to be hard I/O size limit for ISA, and ATA/IDE
drives not supporting LBA48.

Even for hw which does support bigger I/O (stuff behind PCI, SCSI, ATA
drivers with LBA48), the drivers might be buggy and not cope.

Nothing good happens out of this, unfortunately the lower layers need
to be fixed first.

Jaromir


Re: svr4, again

2018-12-18 Thread Jaromír Doleček
Le mar. 18 déc. 2018 à 13:16, Maxime Villard  a écrit :
> It is clear that COMPAT_SVR4 is completely buggy, but to be clear on the
> use of the code:

+1 to removal for COMPAT_SVR4, there is always attic.

I remember I've been also doing some mechanical changes in the area in
past, and also encountered things which were broken just by casual
browsing. Which I did not fix because I did not have "srv4" binary I
would need to run. It's simply not useful any more.

Jaromir


Re: [uvm_hotplug] Fixing the build of tests

2018-12-15 Thread Jaromír Doleček
Le sam. 15 déc. 2018 à 15:27, Jason Thorpe  a écrit :
> We can buy ourselves some time on aarch64 by using a different page size, 
> yes?  iOS, for example, uses 16KiB VM pages (backed by 4KiB or 16KiB physical 
> pages, depending on the specific CPU type).  I don't think the ARM eabi has 
> the same section alignment issue that made the 4K/8K page size thing on m68k 
> such a pain.
>

Bigger physical pages also reduce size of page tables (big deal if it
e.g. allows to reduce e.g. 4-level to 3-level mapping, but even "just"
factor or two or four is very good) and hence speed up page faults,
and improve TLB utilization. It's really good thing overall and
something which is very desirable on monster memory machines.

Jaromir


Re: pci_intr_alloc() vs pci_intr_establish() - retry type?

2018-12-05 Thread Jaromír Doleček
Le mer. 5 déc. 2018 à 08:39, Masanobu SAITOH  a écrit :
>   I suspect Serial ATA AHCI 1.2.1 specification page 111 has the hint.
>
>   Figure 23: Port/CCC and MSI Message Mapping, Example 1
>   Figure 24: Port and MSI Message Mapping, Example 2
>
> I suspect MSI-X also assume this layout. pci_msix_alloc_map() might
> be used.

Very likely we need to disable MSICAP.MME bit if MSI happens to have
more vectors to force single MSI operation.

There is no such bit for MSI-X. Let's work on this off-list, I have
some idea to test the interrupt mapping, and actually add support for
multiple vectors. Since in your case the thing only fails for
IDENTIFY, it looks like the registers mapped via regular BAR work and
just interrupt routing is bad. Performance wise it doesn't buy much
besides avoiding one register read per interrupt, because the whole
ATA subsystem is not MPSAFE and hence the interrupt handlers are
serialized via KERNEL_LOCK() anyway.

>   My SuperMicro A2SDi-H-TP4F's disk is connected to not channel 0
> but channel 6. I also verified that GHC.MRSM is 0.

GHC.MRSM is only set if hardware has less MSI/MSI-X vectors than AHCI
ports, and fallbacks to single MSI due to this. So it's actually
normal for this to be 0. It'd be quite strange for hardware to have
less MSI/MSI-X vectors than supported ports.

Jaromir


Re: pci_intr_alloc() vs pci_intr_establish() - retry type?

2018-12-04 Thread Jaromír Doleček
I've now disabled MSI-X for ahcisata(4), need to figure what needs to
be setup there.

Le lun. 3 déc. 2018 à 22:41, Jared McNeill  a écrit :
> IIUC we don't actually have confirmation that ThunderX AHCI works yet..
> Nick is having issues with his board.

Okay - I misunderstood, thought it was confirmed working.

The BAR used for Cavium is slightly suspicious. AHCI spec mentions
(v1.3.1 section 2.4 Serial ATA Capability) 0x10 BAR as one of 'legacy'
access methods. It specifically only talks about 0x24 as the proper
BAR for AHCI, it doesn't mention option to have it elsewhere.

Jaromir


Re: pci_intr_alloc() vs pci_intr_establish() - retry type?

2018-12-03 Thread Jaromír Doleček
Le lun. 3 déc. 2018 à 12:09, Masanobu SAITOH  a écrit :
> C3000's AHCI has multi-vector MSI-X table and it doesn't work since
> Nobember 20th...

Can you try if by chance this code adapted nvme(4) changes anything on
your system?

http://www.netbsd.org/~jdolecek/ahcisata_msixoff.diff

ahcisata(4) works for me on two different computers in MSI mode. I
don't have any ahcisata device which supports MSI-X, but it is working
on Cavium which has MSI-X. Perhaps some different BAR wierdness than
Cavium, or simply the boards using MSI-X should use the "cavium" BAR?

So another thing to try would be to make ahci_pci_abar() return
AHCI_PCI_ABAR_CAVIUM always and see if it changes things.

Also you can try rev 1.44 of ahcisata_pci.c with AHCISATA_DISABLE_MSIX
to see if it works if forced to MSI only.

I wonder whether there is something we still need to adjust during
general initialization, I see this on my hardware:

re(4) - on my system works when attached either MSI or MSI-X
nvme(4) - on my system works only when attached via MSI-X (regardless
of the PCI_CAP_MSIX), doesn't work when forced to MSI
ahcisata(4) works in MSI on my system, as MSI-X for skrll@, MSI-X
doesn't work for you

Jaromir


Re: pci_intr_alloc() vs pci_intr_establish() - retry type?

2018-11-27 Thread Jaromír Doleček
Le mer. 28 nov. 2018 à 00:42, Taylor R Campbell
 a écrit :
> > Date: Tue, 27 Nov 2018 21:14:04 +
> > From: Robert Swindells 
> >
> > It looks to me that drivers try MSI and/or MSIX first based on the
> > device type not on whether the host controller can handle it.

pci_intr_alloc() checks what the device claims to support down the
call stack, reading the PCI config space. I assume there some
negotiation between the PCI bus and the device. I hope device can't
claim to support e.g. MSI-X if the bus doesn't.

> Can we make the same API work for MSI, MSI-X, and INTx interrupts, so
> that we don't have to remember two or three different APIs?  Maybe
> just pass a subset of PCI_INTR_INTx|PCI_INTR_MSI|PCI_INTR_MSIX to say
> which ones to try or something?

Yes, it would be the goal to have some shared API to do this fallback
logic. But I need to understand whether it's actually necessary. It
would be much simpler if we can rely on establish to work regardless
of type of interrupt.

Especially in 1/1/1 case, I don't see in the code a path where it
would be possible for MSI/MSI-X establish to fail, but e.g. INTx one
to not fail too. For sure, the MD allocation can fail due to some
memory shortage or because we run out of interrupts, supported by the
port. I don't see how establishing 1 MSI/MSI-X interrupt could fail,
but then establishing 1 INTx could succeed.

So perhaps it's okay to just to try to establish once and remove all
those fallbacks. Unless the driver does something fancy, like try to
use ncpu MSI interrupts and falling back to 1 INTx interrupt.

I'd be in favour of simplifying the code to do just that in xhci(4)
and ahcisata(4) - opinions?

Jaromir


pci_intr_alloc() vs pci_intr_establish() - retry type?

2018-11-27 Thread Jaromír Doleček
I see several drivers (e.g. xhci(4), ahcisata(4), bge(4), nvme(4))
retry the pci_intr_alloc()+pci_intr_establish() with 'lower' types
when pci_intr_establish() fails.

Is this a real case, can it actualy happen that pci_intr_alloc()
returns the interrupt handlers, but pci_intr_establish() would fail
for it?

If it's not a real case, it would be nice to remove all that boilerplate code.

Jaromir


Re: 8.0 performance issue when running build.sh?

2018-08-09 Thread Jaromír Doleček
2018-08-09 19:40 GMT+02:00 Thor Lancelot Simon :
> On Thu, Aug 09, 2018 at 10:10:07AM +0200, Martin Husemann wrote:
>> 100.002054 14.18 kernel_lock
>>  47.43 846  6.72 kernel_lockfileassoc_file_delete+20
>>  23.73 188  3.36 kernel_lockintr_biglock_wrapper+16
>>  16.01 203  2.27 kernel_lockscsipi_adapter_request+63
> Actually, I wonder if we could kill off the time spent by fileassoc.  Is
> it still used only by veriexec?  We can easily option that out of the build
> box kernels.

Or even better, make it less heavy?

It's not really intuitive that you could improve filesystem
performance by removing this obscure component.

Jaromir


Re: panic: biodone2 already

2018-08-07 Thread Jaromír Doleček
2018-08-07 18:42 GMT+02:00 Emmanuel Dreyfus :
>  kern/53506

Thanks. Could  you please try a -current kernel for DOMU and see if it
crashes the same? If possible a DOMU kernel from daily builds, to rule
out local compiler issue.

There are not really many differences in xbd/evtchn code itself
between 8.0 and -current however. There was some interrupt code
reorganization which might affected this, but this happened after
netbsd-8 was created.

xbd is not mpsafe, so it shouldn't be even race due to parallell
processing on different CPUs. Maybe it would be useful to check if the
problem still happens when you assign just single CPU to the DOMU.

Jaromir


Re: panic: biodone2 already

2018-08-06 Thread Jaromír Doleček
This is always a bug, driver processes same buf twice. It can do harm.
If the buf is reused for some other I/O, system can fail to store
data, or claim to read data when it didn't.

Can you give full backtrace?

Jaromir

2018-08-06 17:56 GMT+02:00 Emmanuel Dreyfus :
> Hello
>
> I have a Xen domU that now frequently panics with "biodone2 already". I
> suspect it started after upgrading to NetBSD 8.0.
>
> The offending code is in src/sys/kern/vfs_bio.c (see below). I wonder if
> we could not just release the mutex and exit in that case. Can it make
> some harm?
>
> static void
> biodone2(buf_t *bp)
> {
> void (*callout)(buf_t *);
>
> SDT_PROBE1(io, kernel, ,done, bp);
>
> BIOHIST_FUNC(__func__);
> BIOHIST_CALLARGS(biohist, "bp=%#jx", (uintptr_t)bp, 0, 0, 0);
>
> mutex_enter(bp->b_objlock);
> /* Note that the transfer is done. */
> if (ISSET(bp->b_oflags, BO_DONE))
> panic("biodone2 already");
>
> Cou
>
> --
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> m...@netbsd.org


Re: Removing dbregs

2018-07-13 Thread Jaromír Doleček
2018-07-13 22:54 GMT+02:00 Kamil Rytarowski :
> I disagree with disabling it. The code is not broken, it's covered by
> tests, it's in use.

This looks like perfect candidate for optional (default off) feature.
It is useless and dangerous for general purpose use by virtue of being
root only, but useful for specialized use.

Honestly, I don't see any reason to have this on by default.

Jaromir


Re: CVS commit: src/sys/arch/x86/x86

2018-07-08 Thread Jaromír Doleček
Le dim. 8 juil. 2018 à 15:29, Kamil Rytarowski  a écrit :
> I've introduced the change to mpbios.c as it was small, selfcontained
> and without the need to decorate the whole function.

Am I reading the code wrong or you actually introduced bug in mpbios.c?

Shouldn't this:

memtop |= (uint16_t)mpbios_page[0x414] << 8;

be actually << 16 to keep the same semantics?

Jaromir


Re: interrupt cleanup #1

2018-06-24 Thread Jaromír Doleček
ne 24. 6. 2018 v 12:13 odesílatel Cherry G.Mathew  napsal:
> Sometime tomorrow I'll send a first re-org patch (no functional changes)
> followed by the actual meat later in the week.

I have some small changes for x86/intr.c and xen/evtchn.c too. In
intr.c pretty much just some #ifdef shuffle to make intrctl list work
on Xen, in evtchn.c some new code on end.

http://www.netbsd.org/~jdolecek/xen_intrctl_list.diff

Would you mind if I'd pushed this before your changes, so that it's
not interleaved with your intr refactoring?

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-06-10 Thread Jaromír Doleček
2018-05-12 21:24 GMT+02:00 Chuck Silvers :
> the problem with keeping the pages locked (ie. PG_BUSY) while accessing
> the user address space is that it can lead to deadlock if the page

Meanwhile I've tested this scenario - wrote a test program to do
mmap(2) for file, then calling write() using the memory from mmap as
the buffer for write(2) for the same file offset.

It actually caused the process to get SIGSEGV with new code, where the
old code made write(2) just returned with EINVAL. I think this happens
because fault resolution returns EINVAL when the page has PG_BUSY.
Funny thing is that actually on Mac OS X that code actually does end
up with deadlock, and unkillable process :D

Maybe the SIGSEGV is actually acceptable behaviour? I don't think
there would be valid reason to do mmap(2)/write(2) combination like
that.

Jaromir


Re: Revisiting uvm_loan() for 'direct' write pipes

2018-06-10 Thread Jaromír Doleček
2018-05-25 23:19 GMT+02:00 Jason Thorpe :
> BTW, I was thinking about this, and I think you need to also handle the case
> where you try the ubc_uiomove_direct() case, but then *fall back* onto the
> non-direct case if some magic error is returned.
> ...
> These cache flushes could be potentially very expensive.

I think that it would not be actually performance-wise make sense to
do this - the decision to fallback would have to be done within the
pmap_direct_process() function, which happens only after all the
preparation steps have been already done. If we abort the operation
here and go call the non-direct code path, it could be even slower
than the cache flush. For UBC the fallback would not be too terrible
to code and debug, but fallback would be very messy e.g. for the
uvm_loan() in sys_pipe.c.

I don't know what to do for such archs, but I guess it would need to
be evaluated if the direct map could be used in a way which actually
improves performance there for the general case.

2018-05-26 16:52 GMT+02:00 Thor Lancelot Simon :
> In this case, could we do better gathering the IPIs so the cost were
> amortized over many pages?

Yes, it could gather the IPI for this particular TLB flush. Seems e.g.
on my system it would be enough to gather at least 3-4 to match the
speed. There is some limit to how this could be scaled though - right
now we have only 16 TLB IPI slots, which would be enough for maybe 1
parallell pipe setup. I've noticed DragonflyBSD did some work on
asynchronous TLB IPIs, maybe it would be possible to look for
inspiration there.

I do plan some tweaks for the TLB flush IPIs. I want to change the x86
pmap to not trigger TLB flush IPIs to idle processors, as was
discussed earlier for kern/53124. This would reduce the performance
impact of the IPIs if there are many idle processors. Seems acpi
wakeup takes linear time depending on number of cpus - I think this is
also something we'd want to fix too.

That said, regardless which way we slice it, anything which triggers
IPIs or TLB flushes inherently just does not scale on MP systems. If
uvm_loan() is to be used as high-performance zero-copy optimization,
it simply must avoid those. Adding something which triggers interrupt
to all active processors for every processed page simply can't
perform.

Jaromir


Re: Revisiting uvm_loan() for 'direct' write pipes

2018-05-25 Thread Jaromír Doleček
2018-05-21 21:49 GMT+02:00 Jaromír Doleček :
> It turned out uvm_loan() incurs most of the overhead. I'm still on my
> way to figure what it is exactly which makes it so much slower than
> uiomove().

I've now pinned the problem down to the pmap_page_protect(...,
VM_PROT_READ), that code does page table manipulations and triggers
synchronous IPIs. So basically the same problem as the UBC code in
uvm_bio.c.

If I comment out the pmap_page_protect() in uvm_loan.c and hence do
not change vm_page attributes, the uvm_loan() + direct map pipe
variant manages about 13 GB/s, compared to about 12 GB/s for the
regular pipe. 8% speedup is not much, but as extra it removes all the
KVA limits. Since it should scale well, it should be possible to
reduce reduce the direct threshold, and reduce the size of the fixed
in-kernel pipe buffer to save kernel memory.

So, I'm actually thinking to change uvm_loan() to not enforce R/O
mappings and leave page attributes without change. It would require
the caller to deal with possible COW or PG_RDONLY if they need to do
writes. In other words, allow using the 'loan' mechanics also for
writes, and eventually use this also for UBC writes to replace the
global PG_BUSY lock there.

WYT?

Jaromir


Revisiting uvm_loan() for 'direct' write pipes

2018-05-21 Thread Jaromír Doleček
Hello,

I've been playing a little on revisiting kern/sys_pipe.c to take
advantage of the direct map in order to avoid the pmap_enter() et.al
via the new uvm_direct_process() interface. Mostly since I want to
have at least one other consumer of the interface before I consider it
as final, to make sure it covers the general use cases.

I've managed to get it working. However, the loan mechanism is way
slower than just copying the data via the intermediate 16KB pipe
buffer - I get about 12GB/s on regular pipe, and about 4GB/s on the
'direct' pipe, even when using huge input buffers like 1MB a time.

For regular pipe, CPU is about same for producer and consumer about
50%, for 'direct' pipe producer (the one who does the loan) is at
about 80% and consumer at 5%.

It turned out uvm_loan() incurs most of the overhead. I'm still on my
way to figure what it is exactly which makes it so much slower than
uiomove().

Any hints on which parts I should concentrate?

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-13 Thread Jaromír Doleček
2018-05-12 21:24 GMT+02:00 Chuck Silvers :
> the problem with keeping the pages locked (ie. PG_BUSY) while accessing
> the user address space is that it can lead to deadlock if the page
> we are accessing via the kernel mapping is the same page that we are
> accessing via the user mapping.

The pages marked PG_BUSY for the UBC transfer are part of page cache.
I sincerely hope it's impossible to have any user mapping for it.

Is there are any other scenario for the deadlock in this context?

That said, the PG_BUSY is at least a scalability problem, since it
restricts the copy operation within the UBC window to only one at a
time. Old code holds PG_BUSY only during the fault - with exception of
the UBC_FAULTBUSY branch, which still keeps the lock during the whole
operation.

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-12 Thread Jaromír Doleček
2018-05-12 7:07 GMT+02:00 Michael van Elst :
> On Fri, May 11, 2018 at 05:28:18PM +0200, Jaromír Dole?ek wrote:
>> I've implemented a tweak to read-ahead code to skip the full
>> read-ahead if last page of the range is already in cache, this
>> improved things a lot:
>
> That looks a bit like a hack. Can't you just call uvm_pagelookup()
> to determine if the last page is cached? Or if that's too UVM internal,
> call uvm_findpages() with UFP_NOALLOC.

I thought it would be less abstraction violation when done in
getpages. But actually using uvm_pagelookup() seems to be fine from
that standpoint. Furthermore, it avoids some page flag manipulation,
and makes the change much more localized, to only uvm_readahead.c. So
I'll go with that.

Thanks.

Jaromir


Removing uvm_emap (aka ephemeral mapping interface)?

2018-05-11 Thread Jaromír Doleček
Hello,

I suggest to remove emap code.

Unfortunately it turned out as too tricky to implement and use
correctly. It's not currently used anywhere, the only use in
sys_pipe.c was ifdeffed out due to stability problems in 2009-08-31.

While the commit states the stability problems were present also on
sparc64 which used the emap fallback, I think the uvm_emap.c fallback
pmap_kenter_pa()/pmap_kremove() are actually not safe either.

Actual MD support for it was ever implemented only on x86. The emap
support complicates the already quite unwieldy x86 pmap, and I think
it makes sense to drop this unused code to lower the maintenance
overhead. Dropping this would certainly simplify my other work on PCID
support for x86.

Opinions?

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-11 Thread Jaromír Doleček
I've modified the uvm_bio.c code to use direct map for amd64. Change
seems to be stable, I've also run 'build.sh tools' to test out the
system, build suceeded and I didn't observe any problems with
stability. Test was on 6-core Intel CPU with HT on (i.e. "12" cores
visible to OS), NVMe disk, cca 2GB file read done via:

dd if=foo of=/dev/null bs=64k msgfmt=human

Because of the proper formatting, the speed is now in proper Gibibytes.

1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s
2. cache read: old: 2.2 GiB/s, new: 5.6 GiB/s

Seems the 1.9 GiB/s is device limit.

The patch for this is at:
http://www.netbsd.org/~jdolecek/uvm_bio_direct.diff

During testing I noticed that the read-ahead code slows down the
cached case quite a lot:
no read-ahead:
1. non-cached read: old: 851 MiB/s, new: 1.1 GiB/s
2. cached read: old: 2.3 GiB/s, new: 6.8 GiB/s

I've implemented a tweak to read-ahead code to skip the full
read-ahead if last page of the range is already in cache, this
improved things a lot:
smarter read-ahead:
1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s (no change compared
to old read-ahead)
2. cached read: old: 2.2 GiB/s, new: 6.6 GiB/s

Patch for this is at:
http://www.netbsd.org/~jdolecek/uvm_readahead_skip.diff

For the direct map, I've implemented new helper pmap function, which
in turn calls a callback with the appropriate virtual address. This
way the arch pmap is more free to choose an alternative
implementations for this, e.g. without actually having a direct map,
like sparc64.

The code right now has some instrumentation for testing, but the
general idea should be visible.

Opinions? Particularly, I'm not quite sure if it safe to avoid
triggering a write fault in ubc_uiomove(), and whether the new
heuristics for read-ahead is good enough for general case.

Jaromir

2018-04-19 22:39 GMT+02:00 Jaromír Doleček :
> I've finally got my test rig setup, so was able to check the
> performance difference when using emap.
>
> Good news there is significant speedupon NVMe device, without
> observing any bad side effects so far.
>
> I've setup a test file of 2G, smaller than RAM to fit all in cache, test was:
> dd if=foo of=/dev/null bs=64k
>
> First read (not-cached): old: 1.7 GB/s, new: 2.1 GB/s
> Cached read: old: 2.2 GB/s, new: 3.1 GB/s
>
> Reads from raw device were the same in both cases, around 1.7 GB/s
> which is a bit bizzarde.
>
> If we want to modify the uvm_bio.c code to optionally use direct map,
> there is the problem with the fault technique it uses for read I/O.
> The code doesn't actually enter the pages into the KVA window, it lets
> uimove() to trigger the faults. Is there some advantage to this, why
> is this better than just mapping those 1-2 pages before calling
> uiomove()?
>
> Jaromir
>
> 2018-04-02 21:28 GMT+02:00 Jaromír Doleček :
>> 2018-03-31 13:42 GMT+02:00 Jaromír Doleček :
>>> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger :
>>>> Yeah, that's what ephemeral mappings where supposed to be for. The other
>>>> question is whether we can't just use the direct map for this on amd64
>>>> and similar platforms?
>>>
>>> Right, we could/should use emap. I haven't realized emap is actually already
>>> implemented. It's currently used for pipe for the loan/"direct" write.
>>>
>>> I don't know anything about emap thought. Are there any known issues,
>>> do you reckon it's ready to be used for general I/O handling?
>>
>> Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral
>> mapping:
>>
>> http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff
>>
>> Seems to boot, no idea what else it will break.
>>
>> Looking at the state of usage though, the emap is only used for disabled code
>> path for sys_pipe and nowhere else. That code had several on-and-off passes
>> for being enabled in 2009, and no further use since then. Doesn't give too 
>> much
>> confidence.
>>
>> The only port actually having optimization for emap is x86. Since amd64
>> is also the only one supporting direct map, we are really at liberty to pick
>> either one. I'd lean towards direct map, since that doesn't require
>> adding/removing any mapping in pmap_kernel() at all. From looking on the 
>> code,
>> I gather direct map is quite easy to implement for other archs like
>> sparc64. I'd say
>> significantly easier than adding the necessary emap hooks into MD pmaps.
>>
>> Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-04-19 Thread Jaromír Doleček
I've finally got my test rig setup, so was able to check the
performance difference when using emap.

Good news there is significant speedupon NVMe device, without
observing any bad side effects so far.

I've setup a test file of 2G, smaller than RAM to fit all in cache, test was:
dd if=foo of=/dev/null bs=64k

First read (not-cached): old: 1.7 GB/s, new: 2.1 GB/s
Cached read: old: 2.2 GB/s, new: 3.1 GB/s

Reads from raw device were the same in both cases, around 1.7 GB/s
which is a bit bizzarde.

If we want to modify the uvm_bio.c code to optionally use direct map,
there is the problem with the fault technique it uses for read I/O.
The code doesn't actually enter the pages into the KVA window, it lets
uimove() to trigger the faults. Is there some advantage to this, why
is this better than just mapping those 1-2 pages before calling
uiomove()?

Jaromir

2018-04-02 21:28 GMT+02:00 Jaromír Doleček :
> 2018-03-31 13:42 GMT+02:00 Jaromír Doleček :
>> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger :
>>> Yeah, that's what ephemeral mappings where supposed to be for. The other
>>> question is whether we can't just use the direct map for this on amd64
>>> and similar platforms?
>>
>> Right, we could/should use emap. I haven't realized emap is actually already
>> implemented. It's currently used for pipe for the loan/"direct" write.
>>
>> I don't know anything about emap thought. Are there any known issues,
>> do you reckon it's ready to be used for general I/O handling?
>
> Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral
> mapping:
>
> http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff
>
> Seems to boot, no idea what else it will break.
>
> Looking at the state of usage though, the emap is only used for disabled code
> path for sys_pipe and nowhere else. That code had several on-and-off passes
> for being enabled in 2009, and no further use since then. Doesn't give too 
> much
> confidence.
>
> The only port actually having optimization for emap is x86. Since amd64
> is also the only one supporting direct map, we are really at liberty to pick
> either one. I'd lean towards direct map, since that doesn't require
> adding/removing any mapping in pmap_kernel() at all. From looking on the code,
> I gather direct map is quite easy to implement for other archs like
> sparc64. I'd say
> significantly easier than adding the necessary emap hooks into MD pmaps.
>
> Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-04-02 Thread Jaromír Doleček
2018-03-31 13:42 GMT+02:00 Jaromír Doleček :
> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger :
>> Yeah, that's what ephemeral mappings where supposed to be for. The other
>> question is whether we can't just use the direct map for this on amd64
>> and similar platforms?
>
> Right, we could/should use emap. I haven't realized emap is actually already
> implemented. It's currently used for pipe for the loan/"direct" write.
>
> I don't know anything about emap thought. Are there any known issues,
> do you reckon it's ready to be used for general I/O handling?

Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral
mapping:

http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff

Seems to boot, no idea what else it will break.

Looking at the state of usage though, the emap is only used for disabled code
path for sys_pipe and nowhere else. That code had several on-and-off passes
for being enabled in 2009, and no further use since then. Doesn't give too much
confidence.

The only port actually having optimization for emap is x86. Since amd64
is also the only one supporting direct map, we are really at liberty to pick
either one. I'd lean towards direct map, since that doesn't require
adding/removing any mapping in pmap_kernel() at all. From looking on the code,
I gather direct map is quite easy to implement for other archs like
sparc64. I'd say
significantly easier than adding the necessary emap hooks into MD pmaps.

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-31 Thread Jaromír Doleček
2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger :
> Yeah, that's what ephemeral mappings where supposed to be for. The other
> question is whether we can't just use the direct map for this on amd64
> and similar platforms?

Right, we could/should use emap. I haven't realized emap is actually already
implemented. It's currently used for pipe for the loan/"direct" write.

I don't know anything about emap thought. Are there any known issues,
do you reckon it's ready to be used for general I/O handling?

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-27 Thread Jaromír Doleček
2018-03-27 0:51 GMT+02:00 Michael van Elst :
> UBC_WINSHIFT=17 (i.e. 2*MAXPHYS) works fine, but sure, for KVA limited
> platforms you have to be careful. On the other hand, it's a temporary
> mapping that only exists while doing I/O. How many concurrent I/O
> operations would you expect on a 32bit system?

I think fragmentation might be bigger issue. It could happen there just
won't be continuous 128k KVA block any more, then it would hang
forever, like kern/47030. But maybe this is not concern, e.g. i386 KVA
uses top 1G.

> I'd also like to get another tuning to 8.0. Bumping RA_WINWIZE_MAX
> from 8*MAXPHYS to 16*MAXPHYS helps iSCSI and maybe some nvme devices
> too.

Yes please :) For some reason I thought we have this already. I remember this
mentioned in discussion around time nvme(4) got integrated into tree.

Regarding your issue, I'd also like to turn attention back to the acpi idle too.
The way I read dev/acpi/acpi_cpu_cstate.c, we try to go straight to C3
without any
backoff. It maybe be good to make this more granular, maybe only going
into the deeper states once enough idle time passes, for example longer
time then their wakeup latency.

For example, wakeup from C1 might have 1 usec latency, from C2 has 150 usec
latency, from C3 500.

Any opinion on that, any ACPI experts around?

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-26 Thread Jaromír Doleček
2018-03-25 13:48 GMT+02:00 Michael van Elst :
> For some reason it's not per block. There is a mechanism that moves
> an 8kbyte window, independent of the block size. You can easily change
> the window size (I'm currently experimenting with 128kbyte) by building
> a kernel with e.g. options UBC_WINSHIFT=17.
>
> Independent of how the scaling problem will be handled, I suggest to
> increase that window to at least MAXPHYS.

I think the file mapping window doesn't need to really be tied to MAXPHYS.
But values over MAXPHYS will probably currently [*] not give further
significant speedup for filesystem I/O, unless data is being read from cache.

One thing to note is that this size is used for all file mapping windows, so
having too big value could case KVA shortages on 32bit platforms. I think
original UBC_WINSHIFT tried to balance most usual file size and overall
KVA use in kernel pmap, similar to e.g. MAXSLP works. Maybe bump
to higher value on 64-bit archs only?

If the bump would work, I think we should pull it up to 8.0 too, at this
is probably least likely to break anything out of the proposed options.

Did the experiment work?

Jaromir

[*] I hope to get tls-maxphys integrated for 9.0, by then maybe we need
to revisit this. It would be also nice to revisit the need to map the
memory to kernel pmap in order to do the I/O.


Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Jaromír Doleček
Hello,

I'd like to gather some feedback on how to best tackle kern/53124.

The problem there is that FFS triggers a pathologic case. I/O transfer maps
and then unmaps each block into kernel pmap, so that the data could be
copied into user memory. This triggers TLB shootdown IPIs for each FS
block, sent  to all CPUs which happen to be idle, or otherwise running on
kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a
lot of synchronization overhead.

I see three possible ways how to fix this particular case:
1. make it possible to lazy invalidate TLB also for kernel pmap, i.e. make
pmap_deactivate()/pmap_reactivate() work with kernel pmap -  this avoids
the TLB shootdown IPIs to be sent to idle CPUs
2. make idle lwp run in it's own pmap and address space - variant to #1,
avoids changing invariant about kernel map being always loaded
3. change UVM code to not do this mapping via kernel map, so pmap_update()
and the IPIs are not triggered in first place

I reckon #2 would add a lot of overhead into the idle routine, it would
require at least an address space switch. This address space switch might
be fairly cheap on architectures with address space ID (avoiding TLB
flushes), but still much more than what the idle entry/exit does now.

Variants of problems with #3 was discussed on and off during the years as I
recall, but there is no resolution yet, and I'm not aware of anyone
actively working on this. I understand this would be Hard, with nobody
currently having the time and courage.

This leaves #1 as a short-term practical solution. Anyone foresees any
particular problems with this, does it have a chance to fly? Any other idea?

Jaromir


Re: amd64: svs

2018-01-11 Thread Jaromír Doleček
2018-01-10 19:56 GMT+01:00 Maxime Villard :
> That's what we do, too. Doing this is faster than "unmapping the kernel
pages
> on return to user space".
>
> Switching the address space costs one "movq %rax,%cr3".

Yes, but address space switch also causes an implicit TLB/paging structure
cache flush when not using something like PCID, doesn't it?

>> Do you reckon SVS has potential of having lower overall overhead than
>> separate address spaces + PCID? Alternatively, do you think SVS would
benefit
>> if we added PCID support to x86 pmap?
>
> Yes, it would benefit if we added PCID.

Okay, I'll look into this. The feature seems pretty simple to use, though
it will need
some care to allow inactive processes to relinguish PCID when there is
shortage.
I'll try to not disturb your SVS work.

> Probably, but there are still many CPUs without PCID out there, and for
them
> a blunt separation will unfortunately still be needed.

Haswell and later have it. Latest benchmarks for Linux and Windows confirm
the performance impact of their sw mitigations is significantly higher on
CPUs
without PCID, so it's all the more important to have this too.

Jaromir


Re: amd64: svs

2018-01-09 Thread Jaromír Doleček
2018-01-08 10:24 GMT+01:00 Maxime Villard :
>
> As far as SVS is concerned, it is not needed: each time an L4 slot is
added
> (pmap_get_ptp) or removed (pmap_free_ptp), SVS only applies the change in
the
> user page tables.
>
> The TLB is then flushed as usual: the slots that are stale in the pmap
page
> tables are also stale in the user page tables, so the flush has the same
effect
> regardless of whether the CPU is executing in userland or in the kernel.
>
> There are possible optimizations, like not sending TLB-flush IPIs to CPUs
that
> are known to have performed a kernel<->user transition between the time
when a
> page got removed from the page tables and the time we decided to flush
the TLB
> as a result.


If I understand correctly, basically main idea is that we keep the same
address space, but simply unmap kernel pages on return to user space.

Linux went with just completely separate address spaces for kernel and user
space on the affected Intel cpus. The address space switch could be made
cheaper when cpu has PCID, which Linux already supports.

Do you reckon SVS has potential of having lower overall overhead than
separate address spaces + PCID? Alternatively, do you think SVS would
benefit if we added PCID support to x86 pmap?

Since we already have support for Alpha ASN which is similar concept as the
x86 PCID, perhaps it would not be overly complex to implement the PCID
support.

Jaromir


Re: amd64: svs

2018-01-07 Thread Jaromír Doleček
BTW Maxime, I've updated
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention to note
your work on NetBSD support for SMAP and SMEP. Would it be feasible to get
SMAP support pulled up to netbsd-8 by chance?

The AMD Epyc processors support SMEP/SMAP as well, hopefully this will
eventually trickle down to their consumer lines like Ryzen too.

I really hope this huge security cpu bug by Intel for Meltdown will help
propel AMD into being a viable competitor.

Jaromir

2018-01-07 11:05 GMT+01:00 Maxime Villard :

> Le 06/01/2018 à 16:03, cl...@csel.org a écrit :
>
>> Runtime detection and configuration is desired as the
>> Linux did, rather than fixed compile-time option.
>>
>
> It would be nice, but it is more complicated than it seems. We would have
> to hotpatch the entry points, and to do that we need a way to hotpatch
> macros,
> and right now we don't have that. It would also be useful for SMAP.
>


Re: increase softint_bytes

2017-11-16 Thread Jaromír Doleček
softint_establish() already fails if there is no space, and prints a
WARNING in this case. The caller however needs to check for the failure,
which ixgbe(4) mostly doesn't. Not sure if that is really the root cause,
the panic seems to be actually in different area, and I didn't actually see
the warning in the dmesg.

Making softint size fully dynamic is unfortunately not straighforward, it
would need to reallocate the softint memory on all CPUs in a way that any
serviced softints would continue working during this time. It would be
simple however to adjust it based on % of physical memory, instead of fixed
value.

If I count correctly, with current 8192 bytes the system supports some 100
softints, which seems to be on quite low side - more advanced hardware and
drivers usually use queue and softint per cpu, so it can quickly run out on
system with e.g. 32 cores.

I agree that the sysctl seems somewhat unnecessary, since the limit is only
relevant during boot anyway, so it's not very helpful.

Jaromir

2017-11-16 21:05 GMT+01:00 matthew green :

> Masanobu SAITOH writes:
> >   Hi, all.
> >
> >   Some device drivers now allocate a lot of softints.
> > See:
> >
> >   http://mail-index.netbsd.org/current-users/2017/11/09/
> msg032581.html
> >
> > To avoid this panic, I wrote the following patch:
> >
> >   http://www.netbsd.org/~msaitoh/softint-20171116-0.dif
> >
> > Summary:
> >
> >   - Increase the default size from 8192 bytes to 32768 bytes.
> >   - Add new option SOFTINT_BYTES to change the value.
> >   - Add two new read-only sysctls kern.softint.{max,count}
> >
> > Any comment?
>
> can't this be fixed by making it dynamic?  ie, fix it properly so
> that it never has this problem again -- if in 10 years we're adding
> 5x as many softints as today, we'll hit it again.
>
> i also don't see why the crash can't be fixed to become an error.
> the code tries to avoid adding more than it can, but soemthing must
> be wrong for the crash to occur anyway.  ie, at worst we'd fix it
> to return an error instead of crashing.
>
> if that is done, the sysctl exports seem unneccesary as well.
>
> thanks.
>
>
> .mrg.
>


Re: ext3 support

2017-10-30 Thread Jaromír Doleček
I share pretty much the same sentiment.

ext3/ext4 journalling support is not very useful, as it's not likely to be
used for anything more critical then sharing files between Linux and
NetBSD. Ext3 journal would have to be complete reimplementation - NetBSD
wapbl is not usable nor sufficient for ext3 needs.

Practically speaking it would be more useful to finish support for the
other miscellaneous features, like extents, dirhash, or extended
attributes. You'd be more then welcome to help there.

Jaromir

2017-10-29 18:39 GMT+01:00 Christos Zoulas :

> In article  namprd13.prod.outlook.com>,
> Abhirup Das   wrote:
> >Hi,
> >
> >I just came across the ext3 implementation project listed on the netbsd
> >website, https://wiki.netbsd.org/projects/project/ext3fs/
> >and I would like to take the time to implement this. But I am not too
> >sure on how to start this project, I have read up on journaling and I
> >understand the added work that journaling needs to perform.
> >However I am sure ext3 needs more than just journaling, but I havent
> >come across a specifications list for ext3 yet. Any advice on how to
> >start this project would be greatly appreciated.
>
> So this is a bit complicated issue. I am not sure if adding journaling
> support is worth it (compared to the amount of time it will take to
> implement it) because you can always mount the filesystem on linux,
> recover from the journal and then use it on NetBSD. OTOH there has
> been some ext3/ext4 work done as part of GSoC 2016...
>
> https://summerofcode.withgoogle.com/archive/2016/
> projects/5183714946449408/
>
> Finishing the RW support for it might be a more useful endeavor.
>
> christos
>
>


Re: Deadlock on fragmented memory?

2017-10-24 Thread Jaromír Doleček
Thanks for doing this.

How difficult you gather it would be to convert the execargs to use 4K
pages rathern then continuous KVA, are there any particular gotchas? I
would be interested to look at this - it sounds like nice high-kernel only
project for me, to make a break from the ATA oddities.

Jaromir

2017-10-23 15:25 GMT+02:00 Manuel Bouyer :

> On Mon, Oct 23, 2017 at 03:36:53AM +, Taylor R Campbell wrote:
> > I should maybe also signal when pool_grow is done so it doesn't wait
> > forever, and/or I should maybe just go to bed now.
>
> > [patch stripped]
>
>
> I applied this patch to a netbsd-7 tree (along with the execargs one),
> and I have a Xen i386 domU doing a bulk build now. It's still running
> pbulk-scan so it's way too soon to see if it helps, but a last the
> system boots so it's not worse than before :)
>
> --
> Manuel Bouyer 
>  NetBSD: 26 ans d'experience feront toujours la difference
> --
>


Removal of 'wd* at umass?' driver?

2017-10-08 Thread Jaromír Doleček
Hello,

The code for this umass attachment (dev/usb/umass_isdata.c) doesn't use the
atabus, and sidesteps the normal atabus processing. Today I discovered that
after the NCQ branch was merged, some kernels which included
dev/usb/usbdevices.config but no atabus failed to compile due to missing
symbols, and added some bandaid for that.

As part of NCQ branch, I made efford for the umass_isdata.c code to
continue compile, but had no way to test if it actually still works. In
fact, I have no idea if the code worked before I started on NCQ. Nobody
responsed also when I asked for testers for this particular device in half
a year.

As the hardware is relatively rare, it's actually quite difficult to keep
maintaining this while working on more advanced ATA features. I think it
would be fair to simply drop it from tree. If some brave soul wants to
revive it, it would be available in the history.

Opinions?

Jaromir


Re: Snapshot performances

2017-09-26 Thread Jaromír Doleček
The WAPBL resource exhaustion caught my attention.

Does this still happen with netbsd-8? I've fixed something like this in
WAPBL a while ago, and the fixes are on the release branch.

Jaromir

2017-09-26 14:52 GMT+02:00 Edgar Fuß :

> > How is an internal snapshot defined?
> I'm not completely sure. Ask hannken@.
>
> > The term seems to only exist in dump(8).
> And fsck_ffs(8).
>
> > Is it a snapshot for which the backing store is on the same disk
> > the snapshot is done?
> I suppose. hannken@ should know.
>
> My experience is that I regularily take external snapshots (using
> fssconfig(8))
> while doing TSM backups, and this works fine even for multi-TB filesystems.
> However, taking an internal snapshot (fsck -X) on that filesystem starved
> the
> machine, eventually paniced it (some WAPBL resource exhaustion) and the log
> replay on reboot took something like half an hour. On much smaller
> filesystems,
> fsck -X works fine.
>


Re: Proposal: Disable autoload of compat_xyz modules

2017-08-01 Thread Jaromír Doleček
I like that all the arch-specific code is under sys/arch, and not randomly
spread around tree, i.e. I prefer to keep the compat things under sys/arch.

For sure, same argument could be used the opposite direction, that it would
be neater to have all the compat code together.

But IMO the arch-specific bits are more tied to the specific platform, so
it makes sense to have them together with their native counterparts - like
e.g. signal trampoline code.

Jaromir

2017-08-01 13:12 GMT+02:00 Maxime Villard :
> Yes, that's the right thing to do.
>
> I haven't rechecked, but as far as I remember, there will be some issues
with
> the compat layers that have several #ifdefs deep into the machdep code.
>
> Apart from that, several files could usefully be moved into the
appropriate
> compat_xyz directory, instead of polluting sys/arch/. For example:
>
> sys/arch/amd64/amd64/linux_*->   sys/compat/linux/arch/amd64/
> sys/arch/amd64/amd64/linux32_*  ->
sys/compat/linux32/arch/amd64/
>
> This would provide a better encapsulation.


SATA PMP and ATAPI - apparently not working, is it supposed to?

2017-06-28 Thread Jaromír Doleček
Hi,

during my testing on jdolecek-ncq branch, I found that when I attach the
ATAPI device via a port multiplier, the device is never actually detected.
The behaviour I see is that atapibus is detected on the port, but not the
actual device cd1, like this:

...
atabus10 at siisata0 channel 0
...
atabus10: SATA port multiplier, 15 ports
atabus10: vendor 0x197b, product 0x0325, revision 1.1, level 0
atabus10 PMP port 0: device present, speed: 1.5Gb/s
atabus10 PMP port 1: device present, speed: 3.0Gb/s
atabus10 PMP port 2: device present, speed: 3.0Gb/s
atabus10 PMP port 3: device present, speed: 1.5Gb/s
...
atapibus0 at atabus10: 1 targets
...

It gets detected like this when attached directly:

...
atabus11 at siisata0 channel 1
...
atapibus0 at atabus11: 1 targets
cd1 at atapibus0 drive 0:  cdrom
removable
cd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
cd1(siisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 5
(Ultra/100) (using DMA)

This behaviour is same on kernel from HEAD sources and my ncq branch.

Is ATAPI behind a PMP supposed to work, or does this depend on the
particular card maybe? I thought one of user cases for port multiplier is
to attach all the 'slow' devices through it, to save direct ports for fast
disks.

Could this be maybe driver issue? Do ATAPI devices connected via PMP work
e.g. on ahcisata(4)?

This particular port multiplier has issue with port detection (satapmp
claims it has 15 ports when it has only 5), so it's entirely possible it
could have problem with ATAPI pass-through also.

Jaromir


Re: HEADS-UP: jdolecek-ncq branch merge imminent

2017-06-28 Thread Jaromír Doleček
I've made some improvements to the error handling, and fixed the PMP & dump
issues I discovered on siisata/mvsata/ahcisata/pciide. I've also tested
siisata(4) and ahcisata(4) with atapi device to some extend, didn't see any
regression there so far.

I do want to do some more ATAPI testing, and do the wd(4) chaos monkey to
exercise&test the 'regular' error paths better, but other than that, I
think the branch is in quite decent shape. I consider it ready for merge.

Jonathan, do you have any specific concerns still? Particularly, do you
think the remaining XXXs in siisata(4) prevent the branch to be merged?

Jaromir

2017-06-13 22:34 GMT+02:00 Jaromír Doleček :

> I'll see if I can do some chaos monkey thing for wd(4) to test the error
> paths.
>
> I'm not quite sure what kind of special error recovery would be necessary
> to add though. With NCQ, if there is any error, all pending commands are
> cancelled by the device. There is nothing to recover by HBA. The commands
> just need to be reissued after channel reset, which is done after a timeout
> by wd(4). I think that eventual further error path tweaks really could be
> done on HEAD already, and that it's not necessary to hold off making this
> available for general public.
>
> I'd really prefer not to stretch this relatively simple branch to too
> ambitious rework, like converting ata to thread context or doing base MP
> work.
>
> I do want to test PMP though, thanks for reminder. In my opinion, once
> that is confirmed working, there is no reason really to wait any further
> with the merge.
>
> Is there anything on HEAD which is useful to sync to the branch? I'm
> wondering whether it's useful to sync it to ncq branch right now at all.
>
> Jaromir
>
> 2017-06-08 10:45 GMT+02:00 Jonathan A. Kollasch :
>
>> On Wed, Jun 07, 2017 at 09:45:48PM +0200, Jaromír Doleček wrote:
>> > Hello,
>> >
>> > I plan to merge the branch to HEAD very soon, likely over the weekend.
>> > Eventual further fixes will be done on HEAD already, including mvsata(4)
>> > restabilization, and potential switch of siisata(4) to support NCQ.
>> >
>> > The plan is to get this pulled up to netbsd-8 branch soon also, so that
>> it
>> > will be part of 8.0.
>> >
>> > Status:
>> > - ahci(4) fully working with NCQ (confirmed with qemu, and real hw)
>> > - piixide(4) continues working (no NCQ support of course) (confirmed in
>> > qemu)
>> > - siisata(4) continues working (without NCQ still) (confirmed with real
>> hw)
>> > - mvsata(4) not yet confirmed working after changes, mainly due the DMA
>> not
>> > really working on Marvell 88SX6042 which I have available - I have same
>> > issue as kern/52126
>> > - other ide/sata drivers received mechanical changes, should continue
>> > working as before
>> >
>> > Jaromir
>>
>> My limited testing has shown that the device error handling paths need
>> work yet.  It's unfortunately been a few weeks since I've tested things,
>> so my memory isn't fresh with the details.  The general issue is that
>> ata(4) is doing error handling in interrupt context, and NCQ error
>> handling is straining that assumption.
>>
>> At this point I advise that the branch could be synced with HEAD and
>> further testing done before merge.  I can not advise that this be
>> pulled up to netbsd-8 before more extensive and comprehensive testing
>> has been done.
>>
>> I've got siisata(4) NCQ stuff I should commit to the branch.
>>
>> Jonathan Kollasch
>>
>
>


Re: HEADS-UP: jdolecek-ncq branch merge imminent

2017-06-14 Thread Jaromír Doleček
I'll see if I can do some chaos monkey thing for wd(4) to test the error
paths.

I'm not quite sure what kind of special error recovery would be necessary
to add though. With NCQ, if there is any error, all pending commands are
cancelled by the device. There is nothing to recover by HBA. The commands
just need to be reissued after channel reset, which is done after a timeout
by wd(4). I think that eventual further error path tweaks really could be
done on HEAD already, and that it's not necessary to hold off making this
available for general public.

I'd really prefer not to stretch this relatively simple branch to too
ambitious rework, like converting ata to thread context or doing base MP
work.

I do want to test PMP though, thanks for reminder. In my opinion, once that
is confirmed working, there is no reason really to wait any further with
the merge.

Is there anything on HEAD which is useful to sync to the branch? I'm
wondering whether it's useful to sync it to ncq branch right now at all.

Jaromir

2017-06-08 10:45 GMT+02:00 Jonathan A. Kollasch :

> On Wed, Jun 07, 2017 at 09:45:48PM +0200, Jaromír Doleček wrote:
> > Hello,
> >
> > I plan to merge the branch to HEAD very soon, likely over the weekend.
> > Eventual further fixes will be done on HEAD already, including mvsata(4)
> > restabilization, and potential switch of siisata(4) to support NCQ.
> >
> > The plan is to get this pulled up to netbsd-8 branch soon also, so that
> it
> > will be part of 8.0.
> >
> > Status:
> > - ahci(4) fully working with NCQ (confirmed with qemu, and real hw)
> > - piixide(4) continues working (no NCQ support of course) (confirmed in
> > qemu)
> > - siisata(4) continues working (without NCQ still) (confirmed with real
> hw)
> > - mvsata(4) not yet confirmed working after changes, mainly due the DMA
> not
> > really working on Marvell 88SX6042 which I have available - I have same
> > issue as kern/52126
> > - other ide/sata drivers received mechanical changes, should continue
> > working as before
> >
> > Jaromir
>
> My limited testing has shown that the device error handling paths need
> work yet.  It's unfortunately been a few weeks since I've tested things,
> so my memory isn't fresh with the details.  The general issue is that
> ata(4) is doing error handling in interrupt context, and NCQ error
> handling is straining that assumption.
>
> At this point I advise that the branch could be synced with HEAD and
> further testing done before merge.  I can not advise that this be
> pulled up to netbsd-8 before more extensive and comprehensive testing
> has been done.
>
> I've got siisata(4) NCQ stuff I should commit to the branch.
>
> Jonathan Kollasch
>


HEADS-UP: jdolecek-ncq branch merge imminent

2017-06-07 Thread Jaromír Doleček
Hello,

I plan to merge the branch to HEAD very soon, likely over the weekend.
Eventual further fixes will be done on HEAD already, including mvsata(4)
restabilization, and potential switch of siisata(4) to support NCQ.

The plan is to get this pulled up to netbsd-8 branch soon also, so that it
will be part of 8.0.

Status:
- ahci(4) fully working with NCQ (confirmed with qemu, and real hw)
- piixide(4) continues working (no NCQ support of course) (confirmed in
qemu)
- siisata(4) continues working (without NCQ still) (confirmed with real hw)
- mvsata(4) not yet confirmed working after changes, mainly due the DMA not
really working on Marvell 88SX6042 which I have available - I have same
issue as kern/52126
- other ide/sata drivers received mechanical changes, should continue
working as before

Jaromir


Re: Modules and bus_dmamap_create() failing

2017-05-28 Thread Jaromír Doleček
The allocation size is function of MAXPHYS / PAGE_SIZE * (queue size),
which is e.g. for i386 and 128 queue size: 64k / 4k / 128 = 16*128 = 2048.
It's MAXPHYS _divided_ by PAGE_SIZE.

I've looked at virtio_alloc_vq() in little bit of detail. The way I count
how allocsize is computed, I arrive to something like 36104 bytes plus
twice alignment to 4096 boundary.

When I boot kernel printing the allocsize, it prints 45056 allocsize for
the vioscsi queue of size 128, and some 81920 bytes for vioif for queue of
256. So in line with expectation.

So is it really that the physical memory would be so fragmented after boot
to multiuser, that you cannot get even 11 continual memory pages?

It confuses me that this behaviour doesn't change when I increase the
available physical memory, but I used relatively small increase
(128MB/256MB/512MB IIRC). Maybe reducing page cache size would help?

Jaromir

2017-05-28 13:48 GMT+02:00 Jonathan A. Kollasch :
>
> On Sat, May 27, 2017 at 11:21:14AM +0200, Jaromír Doleček wrote:
> > The driver allocates sizeable, but not excessive amounts of memory -
mostly
> > working with values like MAXPHYS / PAGE_SIZE * (queue size) for
dmamaps/dma
> > memory, where queue size is 128 or so.
>
> No, that's excessive for a contiguous chunk.  You can only expect such
> large contiguous allocations to work at boot.  Physical memory becomes
> too fragmented after the machine has been up and doing things.  We
> could really use a way to defrag physical memory, but doing that is not
> easy.
>
> Jonathan Kollasch


Modules and bus_dmamap_create() failing

2017-05-27 Thread Jaromír Doleček
Hello,

while working on vioscsi(4) improvements, I had quite tedious problems with
driver failing to allocate the dma map in bus_dmamap_create().

Symptoms were that the bus_dmamap_create()/bus_dmamem_alloc() call was very
often, but not always, failing when driver was loaded via module. For
example, when the module was loaded immediatelly after VM reboot, the
dmamap/dmamem calls would succeed very often, but also not always. If it
failed, it would consistently fail with more module unload/load until
another reboot.

When the driver was compiled into kernel and hence attached during boot,
the bus dmamem/dmamap never failed for me. So I mostly ended up rebooting
kernels.

The driver allocates sizeable, but not excessive amounts of memory - mostly
working with values like MAXPHYS / PAGE_SIZE * (queue size) for dmamaps/dma
memory, where queue size is 128 or so.

The memory allocation failing was not dependant on amount of available
physical memory - it failed regardless of how much memory I allocated to
the virtual host, and regardless if I used i386 or x86_64 virtual machien.
It also failed regardless of using ALLOCNOW or not, and I think also
regardless of using WAITOK/NOWAIT.

Is there something which differs in bus dmamem/dmamap allocation during the
boot and after, which would cause the behaviour?

Could I possibly tweak some parameters to make
bus_dmamem_alloc()/bus_dmamap_create() fail less for dynamically loaded
modules? Failing that, is there maybe some other physical memory allocation
limit I should observe in driver code and try to limit the allocation with
that?

Jaromir


Initial ahci(4) NCQ support on jdolecek-ncq branch, test/review wanted

2017-04-19 Thread Jaromír Doleček
Hi,

I've committed the initial NCQ support, it's available at jdolecek-ncq
branch (branch only over sys). It works under Parallels for now, and
is ready for brave adventurers. There is sys/dev/ata/TODO.ncq which
you might want to check, also.

If people want to try it out and give feedback, I'd really appreciate it.

I'll be busy for next 2-3 weeks elsewhere, so probably won't have too
much time for bugfixing during the time. After that I plan to start
testing with real hardware, and look on remaining bugs.

One of the more annoying bugs right now is that AHCI NCQ transfers
pretty frequently timeout for me under Parallels. Maybe interrupt
handling is not quite right yet in the driver. I only got so far to
note that every transfer gets two interrupts, one with IS 0x8 and one
0x0 The 0x8 is Set device bit and is expected for NCQ, the other not
so much.
My understanding of ATA and AHCI specs is not good enough yet, so if
somebody could look there and sanity check, I would be really glad.

Also, I've not modified mvsata(4) and siisata(4) drivers yet to enable
NCQ. While the drivers seem to be mostly prepared, I lack the
hardware, so have no way to test. I would really appreciate if folks
with the hardware could check it out and possibly provide patches to
enable the support there.

Last but not least, there is simply no consideration in AHCI for PMP
in the changes. I have no idea if it needs special support. I hope
not, but would be nice if our PMP experts would sanity check.

Jaromir


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-03 Thread Jaromír Doleček
2017-04-02 17:28 GMT+02:00 Thor Lancelot Simon :
> However -- I believe for the 20-30% of SAS drives you mention as shipping
> with WCE set, it should be possible to obtain nearly identical performance
> and more safety by setting the Queue Algorithm Modifier bit in the control
> mode page to 1.  This allows the drive to arbitrarily reorder SIMPLE
> writes so long as the precedence rules with HEAD and ORDERED commands
> are respected.

Is there any reason we wouldn't want to set QAM=1 by default for
sd(4)? Seems like pretty obvious performance improvement tweak.

MODE SENSE has a flag to tell it this field is settable, so should be
pretty safe to set relying on that.

I can make this change (maybe with sysctl to control it?),
unfortunately I don't have any hw to test if it actually makes
measurable difference. Any volunteers to test a patch?

Jaromir


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Jaromír Doleček
2017-03-31 22:16 GMT+02:00 Thor Lancelot Simon :
> It's not obvious, but in fact ORDERED gets set for writes
> as a default, I believe -- in sd.c, I think?
>
> This confused me for some time when I last looked at it.

It confused me also, that's why I changed the code a while back to be
less confusing :)

It used to be easy to draw conclusion that we use ORDERED all the
time, as that was default if tag flag was unset. But in fact sd(4) and
cd(4) always set XS_CTL_SIMPLE_TAG explicitely, so it was actually
always SIMPLE.

scsipi_base.c
revision 1.166
date: 2016-10-02 21:40:35 +0200;  author: jdolecek;  state: Exp;
lines: +4 -4;  commitid: iiAGFJk9looTyBoz;
change scsipi_execute_xs() to default to simple tags for !XS_CTL_URGENT
if not specified by caller; this is mostly for documentation purposes
only, as sd(4) and cd(4) explicitly use simple tags already

Jaromir



Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Jaromír Doleček
> The problem is that it does not always use SIMPLE and ORDERED tags in a
> way that would facilitate the use of ORDERED tags to enforce barriers.

Our scsipi layer actually never issues ORDERED tags right now as far
as I can see, and there is currently no interface to get it set for an
I/O.

> Also, that we may not know enough about the behavior of our filesystems
> in the real world to be 100% sure it's safe to set the other mode page
> bits that allow the drive to arbitrarily reorder SIMPLE commands (which
> under some conditions is necessary to match the performance of running
> with WCE set).

I lived under assumption that SIMPLE tagged commands could be and are
reordered by the controller/drive at will already, without setting any
other flags.

> When SCSI tagged queueing is used properly, it is not necessary to set WCE
> to get good write performance, and doing so is in fact harmful, since it
> allows the drive to return ORDERED commands as complete before any of the
> data for those or prior commands have actually been committed to stable
> storage.

This was what I meant when I said "even ordered tags couldn't avoid
the cache flushes". Using ORDERED tags doesn't provide on-media
integrity when WCE is set.

Now, it might be the case that the on-media integrity is not the
primary goal. Then flush is only a write barrier, not integrity
measure. In that case yes, ORDERED does keep the semantics (e.g.
earlier journal writes are written before later journal writes). It
does make stuff much easier to code, too - simply mark I/O as ORDERED
and fire, no need to explicitly wait for competition, and can drop e.g
journal locks faster.

I do think that it's important to concentrate on case where WCE is on,
since that is realistically what majority of systems run with.

Just for record, I can see these practical problems with ORDERED:
1. only available on SCSI, so still needs fallback barrier logic for
less awesome hw
2. Windows and Linux used to always use SIMPLE tags and wait for
completition; suggests this avenue may have been already explored and
found not interesting enough, or too buggy (remember scheduler
activations?)
3. bufq processing needs special care for MPSAFE SCSI drivers, to
prevent processing any further commands while I/O with ORDERED tag is
being submitted to the controller.

I still see my FUA efford as more direct replacement of the cache
flushes, for it keeps both the logical and on-media integrity. Also,
it will benefit the SATA disks too, once/if NCQ is integrated.

I think that implementing barrier/ORDERED can be parallel efford,
similar to the maxphys branch. I don't think barriers will make FUA
irrelevant, as its still needed for systems with WCE on.

Jaromir


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-27 Thread Jaromír Doleček
Attached is final version of the patch. It uses MEDIA prefix for the
flags, but keeps the FUA/DPO - i.e names are B_MEDIA_FUA, B_MEDIA_DPO.
For wapbl it introduces a sysctl to use the feature, default is off
for now.

I plan to commit this later in the week or early next week, unless
there are some serious objections.

Jaromir

2017-03-05 23:22 GMT+01:00 Jaromír Doleček :
> Here is an updated patch. It was updated to check for the FUA support
> for SCSI, using the MODE SENSE device-specific flag. Code was tested
> with QEMU emulated bha(4) and nvme. WAPBL code was updated to use the
> flag. It keeps the flag naming for now.
>
> In the patch, WAPBL sets the flag for journal writes, and also for the
> metadata buffer for bawrite() call after journal commit.
>
> There is possible layer violation for metadata write - b_flags are
> supposed to be set by owner of the buffer. Not sure how strict we
> want/need to be there - perhaps introduce another flag field? Also the
> flag
> probably needs to be unset in biodone hook, so that the code
> guarantees the buffer in buffer cache doesn't accidentaly keep it over
> to another I/O.
>
> Jaromir
? dev/ic/TODO.nvme
Index: sys/buf.h
===
RCS file: /cvsroot/src/sys/sys/buf.h,v
retrieving revision 1.126
diff -u -p -r1.126 buf.h
--- sys/buf.h   26 Dec 2016 23:12:33 -  1.126
+++ sys/buf.h   27 Mar 2017 22:31:22 -
@@ -198,16 +198,21 @@ struct buf {
 #defineB_RAW   0x0008  /* Set by physio for raw 
transfers. */
 #defineB_READ  0x0010  /* Read buffer. */
 #defineB_DEVPRIVATE0x0200  /* Device driver private flag. 
*/
+#defineB_MEDIA_FUA 0x0800  /* Set Force Unit Access for 
media. */
+#defineB_MEDIA_DPO 0x1000  /* Set Disable Page Out for 
media. */
 
 #define BUF_FLAGBITS \
 "\20\1AGE\3ASYNC\4BAD\5BUSY\10DELWRI" \
 "\12DONE\13COWDONE\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \
-"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH"
+"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH\34MEDIAFUA\35MEDIADPO"
 
 /* Avoid weird code due to B_WRITE being a "pseudo flag" */
 #define BUF_ISREAD(bp) (((bp)->b_flags & B_READ) == B_READ)
 #define BUF_ISWRITE(bp)(((bp)->b_flags & B_READ) == B_WRITE)
 
+/* Media flags, to be passed for nested I/O */
+#define B_MEDIA_FLAGS  (B_MEDIA_FUA|B_MEDIA_DPO)
+
 /*
  * This structure describes a clustered I/O.  It is stored in the b_saveaddr
  * field of the buffer on which I/O is done.  At I/O completion, cluster
Index: sys/dkio.h
===
RCS file: /cvsroot/src/sys/sys/dkio.h,v
retrieving revision 1.22
diff -u -p -r1.22 dkio.h
--- sys/dkio.h  8 Dec 2015 20:36:15 -   1.22
+++ sys/dkio.h  27 Mar 2017 22:31:22 -
@@ -85,6 +85,8 @@
 #defineDKCACHE_RCHANGE 0x000100 /* read enable is changeable */
 #defineDKCACHE_WCHANGE 0x000200 /* write enable is changeable */
 #defineDKCACHE_SAVE0x01 /* cache parameters are savable/save 
them */
+#defineDKCACHE_FUA 0x02 /* Force Unit Access supported */
+#defineDKCACHE_DPO 0x04 /* Disable Page Out supported */
 
/* sync disk cache */
 #defineDIOCCACHESYNC   _IOW('d', 118, int) /* sync cache (force?) 
*/
Index: kern/vfs_bio.c
===
RCS file: /cvsroot/src/sys/kern/vfs_bio.c,v
retrieving revision 1.271
diff -u -p -r1.271 vfs_bio.c
--- kern/vfs_bio.c  21 Mar 2017 10:46:49 -  1.271
+++ kern/vfs_bio.c  27 Mar 2017 22:31:22 -
@@ -2027,7 +2027,7 @@ nestiobuf_iodone(buf_t *bp)
 void
 nestiobuf_setup(buf_t *mbp, buf_t *bp, int offset, size_t size)
 {
-   const int b_read = mbp->b_flags & B_READ;
+   const int b_pass = mbp->b_flags & (B_READ|B_MEDIA_FLAGS);
struct vnode *vp = mbp->b_vp;
 
KASSERT(mbp->b_bcount >= offset + size);
@@ -2035,14 +2035,14 @@ nestiobuf_setup(buf_t *mbp, buf_t *bp, i
bp->b_dev = mbp->b_dev;
bp->b_objlock = mbp->b_objlock;
bp->b_cflags = BC_BUSY;
-   bp->b_flags = B_ASYNC | b_read;
+   bp->b_flags = B_ASYNC | b_pass;
bp->b_iodone = nestiobuf_iodone;
bp->b_data = (char *)mbp->b_data + offset;
bp->b_resid = bp->b_bcount = size;
bp->b_bufsize = bp->b_bcount;
bp->b_private = mbp;
BIO_COPYPRIO(bp, mbp);
-   if (!b_read && vp != NULL) {
+   if (BUF_ISWRITE(bp) && vp != NULL) {
mutex_enter(vp->v_interlock);
vp->v_numoutput++;
mutex_exit(vp->v_interlock);
Index: kern/vfs_wapbl.c
=

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-27 Thread Jaromír Doleček
2017-03-12 11:15 GMT+01:00 Edgar Fuß :
> Some comments as I probably count as one of the larger WAPBL consumers (we
> have ~150 employee's Home and Mail on NFS on FFS2+WAPBL on RAIDframe on SAS):

I've not changed the code in RF to pass the cache flags, so the patch
doesn't actually enable FUA there. Mainly because disks come and go
and I'm not aware of mechanism to make WAPBL aware of such changes. It
would be easy to add the passthrough if needed, can use code from
DIOCCACHESYNC handling. Or maybe we could assume that everybody uses
same kind of disks in raid volume :)

Maybe we could handle this just by polling - i.e. adding some flag
like DKCACHE_VOLATILE for use in RF, and WAPBL code could implement a
periodic callout checking for changes.

The SAS drivers attach normal sd(4), right? My patch only updates
sd(4) and nvme(4) to support the flags. Of course it would be neat to
get anything else on board, if docs and testers are available. It
would good to first test it on real hardware, to see if there is
actual benefit.

> Also, I remember each journal flush to actually cause two cache syncs, one
> before and one after writing the actual data.

Yes. It needs to do this because it needs to make sure that journal
data are saved before we save the journal commit block. Implicitly,
the pre-commit flush also makes sure that all asynchronously written
metadata updates are written to the media, before the commit makes
them impossible to replay. The pre-commit sync can be avoided by
adding a checksum to journal commit record - I have a plan on adding
this sometime later, too.

> JD> it adds (another) incentive to actually integrate AHCI NCQ support
> What about SCSI TCQ? I seem to remember all that flushing could be avoided
> if the FS used queueing. DHs comments on barriers seem to second that.

Even SCSI ORDERED tags wouldn't help to avoid the need for cache flushes.

Jaromir


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-05 Thread Jaromír Doleček
Here is an updated patch. It was updated to check for the FUA support
for SCSI, using the MODE SENSE device-specific flag. Code was tested
with QEMU emulated bha(4) and nvme. WAPBL code was updated to use the
flag. It keeps the flag naming for now.

In the patch, WAPBL sets the flag for journal writes, and also for the
metadata buffer for bawrite() call after journal commit.

There is possible layer violation for metadata write - b_flags are
supposed to be set by owner of the buffer. Not sure how strict we
want/need to be there - perhaps introduce another flag field? Also the
flag
probably needs to be unset in biodone hook, so that the code
guarantees the buffer in buffer cache doesn't accidentaly keep it over
to another I/O.

Jaromir
? dev/ic/TODO.nvme
Index: sys/buf.h
===
RCS file: /cvsroot/src/sys/sys/buf.h,v
retrieving revision 1.126
diff -u -p -r1.126 buf.h
--- sys/buf.h   26 Dec 2016 23:12:33 -  1.126
+++ sys/buf.h   5 Mar 2017 22:08:35 -
@@ -198,11 +198,13 @@ struct buf {
 #defineB_RAW   0x0008  /* Set by physio for raw 
transfers. */
 #defineB_READ  0x0010  /* Read buffer. */
 #defineB_DEVPRIVATE0x0200  /* Device driver private flag. 
*/
+#defineB_FUA   0x0800  /* Force Unit Access flag 
(mandatory). */
+#defineB_DPO   0x1000  /* Disable Page Out flag 
(advisory). */
 
 #define BUF_FLAGBITS \
 "\20\1AGE\3ASYNC\4BAD\5BUSY\10DELWRI" \
 "\12DONE\13COWDONE\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \
-"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH"
+"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH\34FUA\35DPO"
 
 /* Avoid weird code due to B_WRITE being a "pseudo flag" */
 #define BUF_ISREAD(bp) (((bp)->b_flags & B_READ) == B_READ)
Index: sys/dkio.h
===
RCS file: /cvsroot/src/sys/sys/dkio.h,v
retrieving revision 1.22
diff -u -p -r1.22 dkio.h
--- sys/dkio.h  8 Dec 2015 20:36:15 -   1.22
+++ sys/dkio.h  5 Mar 2017 22:08:35 -
@@ -85,6 +85,8 @@
 #defineDKCACHE_RCHANGE 0x000100 /* read enable is changeable */
 #defineDKCACHE_WCHANGE 0x000200 /* write enable is changeable */
 #defineDKCACHE_SAVE0x01 /* cache parameters are savable/save 
them */
+#defineDKCACHE_FUA 0x02 /* Force Unit Access supported */
+#defineDKCACHE_DPO 0x04 /* Disable Page Out supported */
 
/* sync disk cache */
 #defineDIOCCACHESYNC   _IOW('d', 118, int) /* sync cache (force?) 
*/
Index: kern/vfs_wapbl.c
===
RCS file: /cvsroot/src/sys/kern/vfs_wapbl.c,v
retrieving revision 1.87
diff -u -p -r1.87 vfs_wapbl.c
--- kern/vfs_wapbl.c5 Mar 2017 13:57:29 -   1.87
+++ kern/vfs_wapbl.c5 Mar 2017 22:08:35 -
@@ -70,6 +70,7 @@ __KERNEL_RCSID(0, "$NetBSD: vfs_wapbl.c,
 static struct sysctllog *wapbl_sysctl;
 static int wapbl_flush_disk_cache = 1;
 static int wapbl_verbose_commit = 0;
+static int wapbl_use_fua = 1;
 
 static inline size_t wapbl_space_free(size_t, off_t, off_t);
 
@@ -229,6 +230,12 @@ struct wapbl {
u_char *wl_buffer;  /* l:   buffer for wapbl_buffered_write() */
daddr_t wl_buffer_dblk; /* l:   buffer disk block address */
size_t wl_buffer_used;  /* l:   buffer current use */
+
+   int wl_dkcache; /* r:   disk cache flags */
+#define WAPBL_USE_FUA(wl)  \
+   (wapbl_use_fua && ISSET(wl->wl_dkcache, DKCACHE_FUA))
+   int wl_jwrite_flags;/* r:   journal write flags */
+   int wl_mwrite_flags;/* r:   metadata write flags */
 };
 
 #ifdef WAPBL_DEBUG_PRINT
@@ -280,6 +287,8 @@ static void wapbl_deallocation_free(stru
 static void wapbl_evcnt_init(struct wapbl *);
 static void wapbl_evcnt_free(struct wapbl *);
 
+static void wapbl_dkcache_init(struct wapbl *);
+
 #if 0
 int wapbl_replay_verify(struct wapbl_replay *, struct vnode *);
 #endif
@@ -390,6 +399,30 @@ wapbl_evcnt_free(struct wapbl *wl)
evcnt_detach(&wl->wl_ev_cacheflush);
 }
 
+static void
+wapbl_dkcache_init(struct wapbl *wl)
+{
+   int error;
+
+   /* Get disk cache flags */
+   error = VOP_IOCTL(wl->wl_devvp, DIOCGCACHE, &wl->wl_dkcache,
+   FWRITE, FSCRED);
+   if (error) {
+   /* behave as if there is a write cache */
+   wl->wl_dkcache = DKCACHE_WRITE;
+   }
+
+   /* Use FUA instead of cache flush if available */
+   if (WAPBL_USE_FUA(wl)) {
+   wl->wl_jwrite_flags |= B_FUA;
+   wl->wl_mwrite_flags |= B_FUA;
+   }
+
+   /* Use DPO for journal writes if available */
+   if (ISSET(wl->wl_dkcache, DKCACHE_DPO))
+   wl->wl_jwrite_flags |= B_DPO;
+}
+
 static int
 wapbl_start_flush_inodes(struct wapbl *wl, struct wapbl_replay *wr)
 {
@@ -562,6 +595,8 @@ wapb

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-03 Thread Jaromír Doleček
2017-03-03 18:11 GMT+01:00 David Holland :
> Yes and no; there's also standard terminology for talking about
> caches, so my inclination would be to call it something like
> B_MEDIASYNC: synchronous at the media level.

Okay, this might be good. Words better then acronyms :)

>  > For DPO it's not so clear cut maybe. We could reuse B_NOCACHE maybe
>  > for the same functionality, but not sure if it matches with  what swap
>  > is using this flag for. DPO is ideal for journal writes however,
>  > that's why I want to add the support for it now.
>
> What does DPO do?

It tells the hardware to not store the data into it's cache. Or more
precisely, to not put it into a cache if it meant that it would have
to evict something else from it. It should improve general performance
of the disk - the journal writes will not trash the device cache.

B_MEDIANOCACHE?

> Perfect abstractions for any of these would be more complex, but
> barriers serve pretty well.

Perhaps we could start with reworking DIOCCACHESYNC into a barrier :)
Currently it is not actually guaranteed to be executed after the
already queued writes - the ioctl is executed out of bounds, bypassing
the bufq queue. Hence it doesn't actually quite work if there are any
in-fligh async writes, as queued e.g. by the bawrite() calls in
ffs_fsync().

In linux the block write interface accounts for that, there are flags
to ask a sync to be done before or after the I/O, and it is also
possible to send just empty I/O with only the sync flags. Thus the
sync is always queued along with the writes. It would be good to adopt
something like this, but that would require bufq interface changes and
possibly device driver changes, with much broader tree disturbance.

At least with FUA the caller can ensure to have all the writes safely
on media, and wouldn't depend on an out-of-bound ioctl.

> Single synchronous block writes are a bad way to implement barriers
> and it maybe makes sense to have two models and force every fs to be
> able to do things two different ways; but single synchronous block
> writes are also a bad way to implement any of the above invariants.
> E.g. I'm not convinced that writing out journal blocks synchronously
> one at a time will be faster than flushing the cache at the end of a
> journal write, even though the latter inflicts collateral damage in
> the sense of waiting for perhaps many blocks that don't need to be
> waited for.

Indeed - writing journal blocks sync one by one is unlikely to be
faster then sending them all async and doing cache flush on the end,
that wouldn't make sense.

I plan to change WAPBL to do the journal writes partially async. It
will use several bufs, issue the I/O asynchronously and only wait if
it runs out of buffers, or it it needs to do the commit. Seems usually
there are three or four block writes done as part of the transaction
commit, so there is decent parallelism opportunity.

> I guess it would help if I knew what you were intending to do with
> wapbl in this regard; have you posted that? (I've been at best
> skimming tech-kern the past few months...)

I haven't posted details on the WAPBL part of the changes. I'll put
together a patch over the weekend, and send it over. It will be useful
to show my thinking how the proposed interface could be used.

> Getting it working first is great but I'm not sure a broadly exposed
> piece of infrastructure should be committed in a preliminary design
> state... especially in a place (the legacy buffer cache) that's
> already a big ol' mess.

That's one of reasons I want to keep the current changes minimal :)

The proposed patch doesn't actually touch the legacy buffer cache code
at all. It only adds another B_* flag, and changes hardware device
drivers to react upon it. The flag is supposed to be set by the
caller, for example by WAPBL itself. Nothing in e.g. ffs would set the
flags.

> I guess what worries me is the possibility of getting interrupted by
> real life and then all this remaining in a half-done state in the long
> term; there are few things worse for maintainability in general than
> half-finished reorgs that end up getting left to bitrot. :-/

There is semi-good chance this will be finished into workable state
soon - I picked up jornaling improvements as my Bachelor thesis
material, so it either gets done or I will fail :)

> Is there something more generic / less hardware-specific that we can
> put in the fs in the near term?

Well, the FUA support looks like a good candidate for being useful and
could have direct positive performance impact, so I picked up that. If
we have code taking advantage of FUA, it adds (another) incentive to
actually integrate AHCI NCQ support, as that is the only way how to
get FUA support on more contemporary hardware. Also, it's my
understanding using FUA instead of full cache sync should be huge win
for raid also, so it's worth for that avenue too.

> keep in mind that whatever it is might end up in -8 and needing to be
> mai

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-02 Thread Jaromír Doleček
> Some quick thoughts, though:
>
> (1) ultimately it's necessary to patch each driver to crosscheck the
> flag, because otherwise eventually there'll be silent problems.

Maybe. I think I like having this as responsibility on the caller for
now, avoids too broad tree changes. Ultimately it might indeed be
necessary, if we find out that it can't be reasonably be handled by
the caller. Like maybe raidframe kicking in spare disk without FUA
into set with FUA.

> (2) it would be better not to expose hardware-specific flags in the
> buffercache, so it would be better to come up with a name that
> reflects the semantics, and a semantic guarantee that's at least
> notionally not hardware-specific.

I want to avoid unnecessary private NetBSD nomenclature. If storage
industry calls it FUA, it's probably good to just call it FUA.

For DPO it's not so clear cut maybe. We could reuse B_NOCACHE maybe
for the same functionality, but not sure if it matches with  what swap
is using this flag for. DPO is ideal for journal writes however,
that's why I want to add the support for it now.

> (3) as I recall (can you remind those of us not currently embedded in
> this stuff what the semantics of FUA actually are?) FUA is *not* a
> write barrier (as in, all writes before happen before all writes
> after) and since write barriers are a natural expression of the
> requirements for many fses, it would be well to make sure the
> implementation of this doesn't conflict with that.

FUA doesn't enforce any barriers. It merely changes the sematics of
the write request - the hardware will return success response only
after the data is written to non-volatile media.

Any barriers required by filesystem sematics need to be handled by the
fs code, same as now with DIOCCACHESYNC.

I've talked about adding some kind of generic barrier support in the
previous thread. After thinking about it, and reading more, I'm not
convinced it's necessary. Incidentally, Linux has moved away from the
generic barriers and pushed the logic into their fs code, which can
DTRT with e.g. journal transactions, too.

> (3a) Also, past discussion of this stuff has centered around trying to
> identify a single coherent interface for fs code to use, with the
> expansion into whatever hardware semantics are available happening in
> the bufferio layer. This would prevent needing conditional logic on
> device features in every fs. However, AFAICR these discussions have
> never reached any clear conclusion. Do you have any opinion on that?

I think that I'd like to have at least two different places in kernel
needing particular interface before generalizing this into a bufferio
level. Or at minimum, I'd like to have it working on one place
correctly, and then it can be generalized before using it on second
place. It would be awesome to use FUA e.g. for fsync(2), but let's not
get too ahead of ourselves.

We don't commit too much right now besides a B_* flag. I'd rather to
keep this raw and lean for now, and  concentrate on fixing the device
drivers to work with the flags correctly. Only then maybe come up with
interface to make it easier for general use.

I want to avoid broadening the scope too much. Especially since I want
to introduce SATA NCQ support within next few months, which might need
some tweaks to the semantics again.

> We don't want to block improvements to wapbl while we figure out the
> one true device interface, but on the other hand I'd rather not
> acquire a new set of long-term hacks. Stuff like the "logic" wapbl
> uses to intercept the synchronous writes issued by the FFS code is
> very expensive to get rid of later.

Yes, that funny bwrite() not being real bwrite() until issued for
second time from WAPBL :) Quite ugly. It's shame the B_LOCKED hack is
not really extensible to cover also data in journal, as it holds all
transaction data in memory.

Jaromir


Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-01 Thread Jaromír Doleček
Hi,

I'm working on an interface for WAPBL to use Force Unit Access (FUA)
feature on compatible hardware (currently SCSI and NVMe), as a
replacement to full disk cache flushes. I'd also like to add support
for DPO (Disable Page Out), as that is trivial extension of FUA
support at least for SCSI.

Scope is currently limited to I/O via traditional buffer cache
(metadata only) interface for now, as that is all what is needed for
WAPBL case. First support direct use over sd(4)/ld(4), then the fix
layered drivers like dk(4), cgd(4) and eventually raid(4).

In order to be a reliable cache flush replacement, the FUA flag needs
to be exposed in a way that it's used only when the underlying driver
and hardware supports it. When it's specified, the driver needs to
honor it.

For DPO, it should be safe to simply ignore the flag when not
supported, because it's just disk cache optimization.

Since I wasn't able to spot any specific support for this in FreeBSD
or OpenBSD, I've come with a new, and rather simplistic kernel change
to expose the feature. It basically relies on the caller to DTRT.

Code extends DIOCGCACHE to return also information about FUA/DPO
support by the underlying hardware. Then, I added new flags for buffer
cache I/O, and modified the drivers to pass the flags to the hardware
when present in the struct buf.

It's supposed to be used in a way that caller on the beginning checks
the FUA support via DIOCGCACHE once, then the caller issues any
further I/O requests with FUA flag when DIOCGCACHE indicated FUA is
supported. This way there is no need to modify drivers to refuse the
flag when not supported. It assumes that if DIOCGCACHE indicates
support, then the I/O will reach the same hardware also with the
struct buf flag kept.

Seems so far all drivers keep struct buf b_flags when splitting or
processing the I/O buffer, i.e. cgd(4), dk(4), raid(4) all seem to do
it. This needs to be tested of course and we need to fix any disk
pseudo devices which don't pass the flags  intact.

Attached is the patch. It's really small and compile-tested only at
the moment. The main goal now is to get feedback if this approach is
sufficient and suitable, or whether we need something completely
different or complex.

Comments?

Jaromir
Index: sys/buf.h
===
RCS file: /cvsroot/src/sys/sys/buf.h,v
retrieving revision 1.126
diff -u -p -r1.126 buf.h
--- sys/buf.h   26 Dec 2016 23:12:33 -  1.126
+++ sys/buf.h   1 Mar 2017 21:19:41 -
@@ -198,11 +198,13 @@ struct buf {
 #defineB_RAW   0x0008  /* Set by physio for raw 
transfers. */
 #defineB_READ  0x0010  /* Read buffer. */
 #defineB_DEVPRIVATE0x0200  /* Device driver private flag. 
*/
+#defineB_FUA   0x0800  /* Force Unit Access flag 
(mandatory). */
+#defineB_DPO   0x1000  /* Disable Page Out flag 
(advisory). */
 
 #define BUF_FLAGBITS \
 "\20\1AGE\3ASYNC\4BAD\5BUSY\10DELWRI" \
 "\12DONE\13COWDONE\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \
-"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH"
+"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH\34FUA\35DPO"
 
 /* Avoid weird code due to B_WRITE being a "pseudo flag" */
 #define BUF_ISREAD(bp) (((bp)->b_flags & B_READ) == B_READ)
Index: sys/dkio.h
===
RCS file: /cvsroot/src/sys/sys/dkio.h,v
retrieving revision 1.22
diff -u -p -r1.22 dkio.h
--- sys/dkio.h  8 Dec 2015 20:36:15 -   1.22
+++ sys/dkio.h  1 Mar 2017 21:19:41 -
@@ -85,6 +85,8 @@
 #defineDKCACHE_RCHANGE 0x000100 /* read enable is changeable */
 #defineDKCACHE_WCHANGE 0x000200 /* write enable is changeable */
 #defineDKCACHE_SAVE0x01 /* cache parameters are savable/save 
them */
+#defineDKCACHE_FUA 0x02 /* Force Unit Access supported */
+#defineDKCACHE_DPO 0x04 /* Disable Page Out supported */
 
/* sync disk cache */
 #defineDIOCCACHESYNC   _IOW('d', 118, int) /* sync cache (force?) 
*/
Index: dev/ic/ld_nvme.c
===
RCS file: /cvsroot/src/sys/dev/ic/ld_nvme.c,v
retrieving revision 1.14
diff -u -p -r1.14 ld_nvme.c
--- dev/ic/ld_nvme.c28 Feb 2017 20:55:09 -  1.14
+++ dev/ic/ld_nvme.c1 Mar 2017 21:19:41 -
@@ -152,11 +152,15 @@ static int
 ld_nvme_start(struct ld_softc *ld, struct buf *bp)
 {
struct ld_nvme_softc *sc = device_private(ld->sc_dv);
+   int flags = BUF_ISWRITE(bp) ? 0 : NVME_NS_CTX_F_READ;
+
+   if (bp->b_flags & B_FUA)
+   flags |= NVME_NS_CTX_F_FUA;
 
return nvme_ns_dobio(sc->sc_nvme, sc->sc_nsid, sc,
bp, bp->b_data, bp->b_bcount,
sc->sc_ld.sc_secsize, bp->b_rawblkno,
-   BUF_ISWRITE(bp) ? 0 : NVME_NS_CTX_F_READ,
+   flags,
ld_nvme_biodone)

Re: Plan: journalling fixes for WAPBL

2017-01-02 Thread Jaromír Doleček
2017-01-02 18:31 GMT+01:00 David Holland :
> Well, there's two things going on here. One is the parallelism limit,
> which you can work around by having more drives, e.g. in an array.
> The other is netbsd's 64k MAXPHYS issue, which is our own problem that
> we could put time into. (And in fact, it would be nice to get
> tls-maxphys into -8... anyone have time? I don't. sigh)

It would be very nice to have this intergrated, yes. It won't have
dramatic performance effect, but it's relatively low hanging fruit,
it's already almost done and we should just get rid of this arbitrary
system limit. I'd like to look into this, but I won't manage sooner
then autumn 2017 - I'd like to first work on FUA/DPO support, and then
SATA NCQ. I think those could have bigger performance impact.

I hope I'll have some patches to make FUA available for I/O for SCSI
drives by second half of feb, plus the changes for WAPBL to use it.
It's slightly difficult to DTRT with FUA on nested drivers like
raidframe/cgd/vnd, but maybe we can ignore those for the first
iteration :) I'll send a proposal once I figure details.

Jaromir


Re: Possible buffer cache race?

2016-10-30 Thread Jaromír Doleček
2016-10-24 19:34 GMT+02:00 David Holland :
> Is this correlated with the syncer running? I have been seeing a
> problem where every time the syncer runs it locks out everything else,
> and once that happens some things take several seconds to complete
> afterwards.

Doesn't seem to me just syncer slow - it behaves as regular deadlock.
If I just wait, nothing happens - it will not recover itself.
Actually, when I waited too long (couple minutes), the sync command
itself blocked and system didn't recover.

> The tstile ones are unlikely to be interesting (if you have a bunch of
> tstiles, pick one that isn't, which'll be what they're waiting behind)
> but the others might be. Knowing where it gets stuck will help a lot;
> or if it doesn't get stuck in any one place that too.

Got these example tracebacks:

#1
trace: pid 540 lid 1 at 0xfe90f090fb10
sleepq_block() at sleepq_block+0x97
cv_timedwait() at cv_timedwait+0x123
bbusy() at bbusy+0xad
getblk() at getblk+0x5a
bio_doread() at bio_doread+0x1d
bread() at bread+0x1a
ffs_update.part.3() at ffs_update.part.3+0x112
ufs_inactive() at ufs_inactive+0x18e
VOP_INACTIVE() at VOP_INACTIVE+0x33
vrelel() at vrelel+0x259
vn_close() at vn_close+0x41
closef() at closef+0x54
fd_close() at fd_close+0x2da
sys_close() at sys_close+0x20
syscall() at syscall+0x164

#2
trace: pid 222 lid 1 at 0xfe90f0927c20
sleepq_block() at sleepq_block+0x97
turnstile_block() at turnstile_block+0x3a9
mutex_enter() at mutex_enter+0x36c
genfs_renamelock_enter() at genfs_renamelock_enter+0xd
do_sys_renameat() at do_sys_renameat+0x478
sys_rename() at sys_rename+0x25
syscall() at syscall+0x164

#3
trace: pid 697 lid 1 at 0xfe90f089f850
sleepq_block() at sleepq_block+0x97
cv_timedwait() at cv_timedwait+0x123
bbusy() at bbusy+0xad
getblk() at getblk+0x5a
bio_doread() at bio_doread+0x1d
bread() at bread+0x1a
ffs_update.part.3() at ffs_update.part.3+0x112
ufs_inactive() at ufs_inactive+0x18e
VOP_INACTIVE() at VOP_INACTIVE+0x33
vrelel() at vrelel+0x259
genfs_rename_exit() at genfs_rename_exit+0x10a
genfs_sane_rename() at genfs_sane_rename+0x7a3
ufs_sane_rename() at ufs_sane_rename+0x3a
genfs_insane_rename() at genfs_insane_rename+0x15d
VOP_RENAME() at VOP_RENAME+0x43
do_sys_renameat() at do_sys_renameat+0x77f
sys_rename() at sys_rename+0x25
syscall() at syscall+0x164

Those seem to be then for inode updates. I've spent some time trying
to trigger this locally, so far without success.

Once I'll get the remote machine reset again, I'll try to trigger the
condition using ATA disk - I want to rule out driver bug.

> The most common cause of missed wakeups is not going to sleep
> atomically, so it's usually a property of a particular sleep site.
> Next most common is races between changing things affecting the sleep
> condition and posting a wakeup; in particular, unlocked wakeup before
> change (or unlocked wakeup after change without a memory barrier to
> enforce the ordering) leads to problems.

Seems this more likely an async mount issue, not so much concurrency -
similar symptoms are described in PR kern/47030 for sysinst stall.
Sysinst also mounts the new filesystem async, and the condition there
is observed on VM with 1 CPU. I didn't manage to repeat this with qemu
2.7.0, but I'll check later with the qemu0 package.

The traceback in that PR is slightly different thought (it stalls
waiting for new buf), so that can actually be different issue.

Jaromir


Re: WAPBL fix for deallocation exhaustion + slow file removal

2016-10-28 Thread Jaromír Doleček
The fix was committed, with only minor changes (some comments, and
fixed mishandling of error return value in ffs_indirtrunc().

Jaromir

2016-10-06 22:36 GMT+02:00 Jaromír Doleček :
> I've incorporated the mutex fix, here is the final patch relative to
> trunk. I'd like to commit this sometime next week.
>
> Jaromir
>
> 2016-10-01 19:00 GMT+02:00 Taylor R Campbell
> :
>>Date: Sat, 1 Oct 2016 18:40:31 +0200
>>From: Jaromír Dole ek 
>>
>>> Thanks for taking a shot at this!  But I think it needs a little more
>>> time for review -- certainly I can't digest it in the 24 hours you're
>>> giving.
>>
>>Sure.
>>
>> OK, thanks!  I'll try to find some time to review this more carefully
>> in the next few days.  For now, one comment:
>>
>>> From a quick glance at the patch, I see one bug immediately in
>>> vfs_wapbl.c that must have been introduced in a recent change:
>>> pool_get(PR_WAITOK) is forbidden while holding a mutex, but
>>> wapbl_register_deallocation does just that.
>>
>>I'll recheck this, thank you. I run it under rump only, I think I had
>>it compiled LOCKDEBUG, so should trigger the same assertions as
>>LOCKDEBUG kernel. But apparently not.
>>
>> Since wl->wl_mtx is an adaptive lock, not a spin lock, nothing
>> automatically detects this.  But in general, with minor exceptions,
>> one is not supposed to sleep while holding an adaptive lock.
>> (Exception example: waiting for a cheap cross-call to complete in
>> pserialize_perform(9).)
>>
>> It's also suboptimal that we sleep while holding rwlocks for vnode
>> locks, since rw_enter is uninterruptable, so if a wait inside the
>> kernel hangs with a vnode lock held, anyone else trying to examine
>> that vnode will hang uninterruptably too.  But it will take some
>> effort to undo that state of affairs by teaching all callers of
>> vn_lock to accept failure.
>>
>> The same goes for the the long-term file system transaction lock
>> wl->wl_rwlock.  However, suboptimality of sleeping with those
>> long-term locks aside, certainly one should not sleep while holding
>> the short-term wl->wl_mtx.


Re: PCIVERBOSE causing kernel stack overflow during boot - why?

2016-10-26 Thread Jaromír Doleček
In my case it crashed on the same device, Core i7-6xxxK/Xeon-D Memory
Controller (Target Address, Thermal, RAS) ID 0x6fa8. The last pci line
before the trap was for device immediatelly preceding that one.

Thanks Paul for getting to the bottom of this.

Jaromir

2016-10-25 7:02 GMT+02:00 Paul Goyette :
> OK, here's the problem...
>
> The device in question is Intel product code 6fa8 which has a product name
> of (deliberately line-split to facilitate character counting)
>
> "Core i7-6xxxK/Xeon-D Memory Cont"
> "roller (Target Address, Thermal,"
> " RAS)"
>
> That's a total of 65 characters plus trailing '\0'.
>
> Unfortunately pci/pci_verbose.h defines PCI_PRODUCTSTR_LEN to 64 (which
> would include the trailing '\0' as well as the space after the final word -
> this final space is removed in dev_untokenstring()), so this device's name
> is too long.
>
> It's still not clear to me why this is triggering SSP checks, but I think we
> can sweep the problem under the rug if we simply increase the value of
> PCI_PRODUCTSTR_LEN to a larger value.
>
> Index: pci_verbose.h
> ===
> RCS file: /cvsroot/src/sys/dev/pci/pci_verbose.h,v
> retrieving revision 1.4
> diff -u -p -r1.4 pci_verbose.h
> --- pci_verbose.h   21 Sep 2014 14:30:22 -  1.4
> +++ pci_verbose.h   25 Oct 2016 04:58:08 -
> @@ -36,7 +36,7 @@ typedef uint32_t pcireg_t;/* configura
>  #endif
>
>  #definePCI_VENDORSTR_LEN   64
> -#definePCI_PRODUCTSTR_LEN  64
> +#definePCI_PRODUCTSTR_LEN  96
>
>  DEV_VERBOSE_DECLARE(pci);
>
>
>
>
>
> On Sun, 23 Oct 2016, JaromĆ­r DoleÄ~Mek wrote:
>
>> Here is the output from lspci/pcictl.
>>
>> I'll try that DDB_COMMANDONENTER also - the machine is remote though,
>> so I'll send it later when I get it.
>>
>> Thanks.
>>
>> Jaromir
>>
>> 2016-10-19 7:23 GMT+02:00 Paul Goyette :
>>>
>>> On Tue, 18 Oct 2016, Paul Goyette wrote:
>>>
 Just as an added experiment, can you try to boot the non-PCIVERBOSE
 kernel, and at the boot prompt enter

 load pciverbose

 before actually booting?

 As far as getting a back-trace, you could set DDB_COMMANDONENTER="bt" in
 your config file 

 The dmesg looks interesting, especially with that strange pci9 bus.  My
 machine has a similar "management devices" pci bus.
>>>
>>>
>>>
>>> Also, if you have installed pkgsrc/sysutils/pciutils it would be useful
>>> to
>>> get the output from
>>>
>>> lspci -tvnn
>>>
>>> Otherwise, please provide output from following two commands:
>>>
>>> pcictl pci0 list -N
>>> pcictl pci0 list -n
>>>
>>>
>>>



 On Mon, 17 Oct 2016, JaromĆ Ā­r DoleĆ„~Mek wrote:

> Hi,
>
> I've got an amd64 system which panics with 'stack overflow detected'
> on boot, somewhere halfway through probing pci9 bus, when booted with
> kernel with PCIVERBOSE. Same kernel config without PCIVERBOSE boots
> fine. dmesg without PCIVERBOSE is attached.
>
> Any idea what might be causing this?
>
> I've had cursory look at pci code, it doesn't seem as if anything
> would be allocating extra space there. Maybe some interaction with
> dev_verbose module code? Unfortunately can't get backtrace as this
> happens before the keyboard is probed and attached.
>
> Jaromir
>
>
>
>

 +--+--++
 | Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
 | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com   |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
 +--+--++
>>>
>>>
>>>
>>> +--+--++
>>> | Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
>>> | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com   |
>>> | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
>>> +--+--++
>>
>>
>>
>> !DSPAM:580ce657154322062083821!
>>
>
> +--+--++
> | Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
> | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com   |
> | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
> +--+--++



Re: PCIVERBOSE causing kernel stack overflow during boot - why?

2016-10-23 Thread Jaromír Doleček
Here is the output from lspci/pcictl.

I'll try that DDB_COMMANDONENTER also - the machine is remote though,
so I'll send it later when I get it.

Thanks.

Jaromir

2016-10-19 7:23 GMT+02:00 Paul Goyette :
> On Tue, 18 Oct 2016, Paul Goyette wrote:
>
>> Just as an added experiment, can you try to boot the non-PCIVERBOSE
>> kernel, and at the boot prompt enter
>>
>> load pciverbose
>>
>> before actually booting?
>>
>> As far as getting a back-trace, you could set DDB_COMMANDONENTER="bt" in
>> your config file 
>>
>> The dmesg looks interesting, especially with that strange pci9 bus.  My
>> machine has a similar "management devices" pci bus.
>
>
> Also, if you have installed pkgsrc/sysutils/pciutils it would be useful to
> get the output from
>
> lspci -tvnn
>
> Otherwise, please provide output from following two commands:
>
> pcictl pci0 list -N
> pcictl pci0 list -n
>
>
>
>>
>>
>>
>> On Mon, 17 Oct 2016, Jaromír DoleÄ~Mek wrote:
>>
>>> Hi,
>>>
>>> I've got an amd64 system which panics with 'stack overflow detected'
>>> on boot, somewhere halfway through probing pci9 bus, when booted with
>>> kernel with PCIVERBOSE. Same kernel config without PCIVERBOSE boots
>>> fine. dmesg without PCIVERBOSE is attached.
>>>
>>> Any idea what might be causing this?
>>>
>>> I've had cursory look at pci code, it doesn't seem as if anything
>>> would be allocating extra space there. Maybe some interaction with
>>> dev_verbose module code? Unfortunately can't get backtrace as this
>>> happens before the keyboard is probed and attached.
>>>
>>> Jaromir
>>>
>>>
>>> !DSPAM:58058f59188191671919093!
>>>
>>
>> +--+--++
>> | Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
>> | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com   |
>> | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
>> +--+--++
>
>
> +--+--++
> | Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
> | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com   |
> | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
> +--+--++
-[:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
DMI2 [8086:6f00]
   +-01.0-[01]00.0  LSI Logic / Symbios Logic MegaRAID SAS-3 3108 
[Invader] [1000:005d]
   +-02.0-[02]--+-00.0  Intel Corporation Ethernet Controller 
10-Gigabit X540-AT2 [8086:1528]
   |\-00.1  Intel Corporation Ethernet Controller 
10-Gigabit X540-AT2 [8086:1528]
   +-02.2-[03]00.0  Intel Corporation PCIe Data Center SSD 
[8086:0953]
   +-03.0-[04]--
   +-04.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
Crystal Beach DMA Channel 0 [8086:6f20]
   +-04.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
Crystal Beach DMA Channel 1 [8086:6f21]
   +-04.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
Crystal Beach DMA Channel 2 [8086:6f22]
   +-04.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
Crystal Beach DMA Channel 3 [8086:6f23]
   +-04.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
Crystal Beach DMA Channel 4 [8086:6f24]
   +-04.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
Crystal Beach DMA Channel 5 [8086:6f25]
   +-04.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
Crystal Beach DMA Channel 6 [8086:6f26]
   +-04.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
Crystal Beach DMA Channel 7 [8086:6f27]
   +-05.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
Map/VTd_Misc/System Management [8086:6f28]
   +-05.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
IIO Hot Plug [8086:6f29]
   +-05.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
IIO RAS/Control Status/Global Errors [8086:6f2a]
   +-05.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D 
I/O APIC [8086:6f2c]
   +-11.0  Intel Corporation C610/X99 series chipset SPSR [8086:8d7c]
   +-14.0  Intel Corporation C610/X99 series chipset USB xHCI Host 
Controller [8086:8d31]
   +-16.0  Intel Corporation C610/X99 series chipset MEI Controller #1 
[8086:8d3a]
   +-16.1  Intel Corporation C610/X99 series chipset MEI Controller #2 
[8086:8d3b]
   +-1a.0  Intel Corporation C610/X99 series chipset USB Enhanced Host 
Controller #2 [8086:8d2d]
   +-1c.0-[05-06]00.0-[06]00.0  ASPEED Technology, Inc. ASPEED 
Graphics Family [1a03:2000]
   +-1c.2-[07]00.0  Intel Corporation I210 Gigabit Network 
Connection [8086:1533]
   +-1c.3-[

Possible buffer cache race?

2016-10-23 Thread Jaromír Doleček
Hi,

I'm doing some testing with nvme(4) with quite deep i/o queues with
-current kernel on a MP system. At this moment basically just
confirming functionality, so just bunch of parallel tar extracting
file within a filesystem to a subdirectory.

I have the filesystem mounted async and the machine has huge amount of
RAM, without logging at the moment. So it's mostly buffer cache
exercise, with i/o spikes on sync.

I see interesting thing - periodically, all of the tar processes get
blocked sleeping on either tstile, biolock or pager_map. All the tar
processes block. When I just wait they stay blocked. When I call
sync(8), all of them unblock and continue running, until again they
all hit the same condition later.  When I keep calling sync,
eventually all processes finish.

Most often they block on biolock, then somewhat less frequently
tstile; pager_map is more rare. It's usually mix of these - most
processes block on biolock, some tstile and zero/one/two on pager_map.

I understand "biolock" is used to wait for busy buf. Looked into code,
nothing really obviously wrong there. Code uses
cv_broadcast(&bp->b_busy) & cv_timedwait(), so there shouldn't be a
race.

Any idea where I should try to start poking?

Jaromir


PCIVERBOSE causing kernel stack overflow during boot - why?

2016-10-17 Thread Jaromír Doleček
Hi,

I've got an amd64 system which panics with 'stack overflow detected'
on boot, somewhere halfway through probing pci9 bus, when booted with
kernel with PCIVERBOSE. Same kernel config without PCIVERBOSE boots
fine. dmesg without PCIVERBOSE is attached.

Any idea what might be causing this?

I've had cursory look at pci code, it doesn't seem as if anything
would be allocating extra space there. Maybe some interaction with
dev_verbose module code? Unfortunately can't get backtrace as this
happens before the keyboard is probed and attached.

Jaromir
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016
The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California.  All rights reserved.

NetBSD 7.99.39 (BEAST) #8: Mon Oct 17 22:17:26 CEST 2016
dolecek@.../netbsd/sys/arch/amd64/compile/BEAST
total memory = 127 GB
avail memory = 124 GB
cpu_rng: RDSEED
timecounter: Timecounters tick every 10.000 msec
Kernelized RAIDframe activated
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
Supermicro SYS-1028R-MCTR (0123456789)
mainbus0 (root)
ACPI: RSDP 0x000F0580 24 (v02 SUPERM)
ACPI: XSDT 0x797330A8 CC (v01 01072009 AMI  
00010013)
ACPI: FACP 0x79763EB8 00010C (v05 SUPERM SMCI--MB 01072009 AMI  
00010013)
ACPI: DSDT 0x79733208 030CB0 (v02 SUPERM SMCI--MB 01072009 INTL 
20091013)
ACPI: FACS 0x79BDEF80 40
ACPI: APIC 0x79763FC8 0001E0 (v03 SUPERM SMCI--MB 01072009 AMI  
00010013)
ACPI: FPDT 0x797641A8 44 (v01 SUPERM SMCI--MB 01072009 AMI  
00010013)
ACPI: FIDT 0x797641F0 9C (v01 SUPERM SMCI--MB 01072009 AMI  
00010013)
ACPI: SPMI 0x79764290 40 (v05 SUPERM SMCI--MB  AMI. 
)
ACPI: MCFG 0x797642D0 3C (v01 SUPERM SMCI--MB 01072009 MSFT 
0097)
ACPI: UEFI 0x79764310 42 (v01 SUPERM SMCI--MB 01072009  
)
ACPI: HPET 0x79764358 38 (v01 SUPERM SMCI--MB 0001 INTL 
20091013)
ACPI: MSCT 0x79764390 90 (v01 SUPERM SMCI--MB 0001 INTL 
20091013)
ACPI: SLIT 0x79764420 2D (v01 SUPERM SMCI--MB 0001 INTL 
20091013)
ACPI: SRAT 0x79764450 001158 (v03 SUPERM SMCI--MB 0001 INTL 
20091013)
ACPI: WDDT 0x797655A8 40 (v01 SUPERM SMCI--MB  INTL 
20091013)
ACPI: SSDT 0x797655E8 016F57 (v02 SUPERM PmMgt0001 INTL 
20120913)
ACPI: SSDT 0x7977C540 00264C (v02 SUPERM SpsNm0002 INTL 
20120913)
ACPI: SSDT 0x7977EB90 64 (v02 SUPERM SpsNvs   0002 INTL 
20120913)
ACPI: PRAD 0x7977EBF8 000102 (v02 SUPERM SMCI--MB 0002 INTL 
20120913)
ACPI: DMAR 0x7977ED00 C4 (v01 SUPERM SMCI--MB 0001 INTL 
20091013)
ACPI: HEST 0x7977EDC8 00027C (v01 SUPERM SMCI--MB 0001 INTL 
0001)
ACPI: BERT 0x7977F048 30 (v01 SUPERM SMCI--MB 0001 INTL 
0001)
ACPI: ERST 0x7977F078 000230 (v01 SUPERM SMCI--MB 0001 INTL 
0001)
ACPI: EINJ 0x7977F2A8 000130 (v01 SUPERM SMCI--MB 0001 INTL 
0001)
ACPI: 4 ACPI AML tables successfully acquired and loaded

ioapic0 at mainbus0 apid 1: pa 0xfec0, version 0x20, 24 pins
ioapic1 at mainbus0 apid 2: pa 0xfec01000, version 0x20, 24 pins
cpu0 at mainbus0 apid 0
cpu0: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu1 at mainbus0 apid 2
cpu1: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu2 at mainbus0 apid 4
cpu2: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu3 at mainbus0 apid 6
cpu3: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu4 at mainbus0 apid 8
cpu4: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu5 at mainbus0 apid 10
cpu5: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu6 at mainbus0 apid 12
cpu6: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu7 at mainbus0 apid 16
cpu7: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu8 at mainbus0 apid 18
cpu8: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu9 at mainbus0 apid 20
cpu9: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu10 at mainbus0 apid 22
cpu10: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu11 at mainbus0 apid 24
cpu11: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu12 at mainbus0 apid 26
cpu12: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu13 at mainbus0 apid 28
cpu13: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu14 at mainbus0 apid 1
cpu14: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu15 at mainbus0 apid 3
cpu15: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu16 at mainbus0 apid 5
cpu16: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu17 at mainbus0 apid 7
cpu17: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, id 0x406f1
cpu18 at mainbus0 apid 9
cpu18: Inte

Re: WAPBL fix for deallocation exhaustion + slow file removal

2016-10-06 Thread Jaromír Doleček
I've incorporated the mutex fix, here is the final patch relative to
trunk. I'd like to commit this sometime next week.

Jaromir

2016-10-01 19:00 GMT+02:00 Taylor R Campbell
:
>Date: Sat, 1 Oct 2016 18:40:31 +0200
>From: Jaromír Dole ek 
>
>> Thanks for taking a shot at this!  But I think it needs a little more
>> time for review -- certainly I can't digest it in the 24 hours you're
>> giving.
>
>Sure.
>
> OK, thanks!  I'll try to find some time to review this more carefully
> in the next few days.  For now, one comment:
>
>> From a quick glance at the patch, I see one bug immediately in
>> vfs_wapbl.c that must have been introduced in a recent change:
>> pool_get(PR_WAITOK) is forbidden while holding a mutex, but
>> wapbl_register_deallocation does just that.
>
>I'll recheck this, thank you. I run it under rump only, I think I had
>it compiled LOCKDEBUG, so should trigger the same assertions as
>LOCKDEBUG kernel. But apparently not.
>
> Since wl->wl_mtx is an adaptive lock, not a spin lock, nothing
> automatically detects this.  But in general, with minor exceptions,
> one is not supposed to sleep while holding an adaptive lock.
> (Exception example: waiting for a cheap cross-call to complete in
> pserialize_perform(9).)
>
> It's also suboptimal that we sleep while holding rwlocks for vnode
> locks, since rw_enter is uninterruptable, so if a wait inside the
> kernel hangs with a vnode lock held, anyone else trying to examine
> that vnode will hang uninterruptably too.  But it will take some
> effort to undo that state of affairs by teaching all callers of
> vn_lock to accept failure.
>
> The same goes for the the long-term file system transaction lock
> wl->wl_rwlock.  However, suboptimality of sleeping with those
> long-term locks aside, certainly one should not sleep while holding
> the short-term wl->wl_mtx.


wapbl_fix_dealloc_2.patch
Description: Binary data


Re: WAPBL fix for deallocation exhaustion + slow file removal

2016-10-01 Thread Jaromír Doleček
> Thanks for taking a shot at this!  But I think it needs a little more
> time for review -- certainly I can't digest it in the 24 hours you're
> giving.

Sure.

> From a quick glance at the patch, I see one bug immediately in
> vfs_wapbl.c that must have been introduced in a recent change:
> pool_get(PR_WAITOK) is forbidden while holding a mutex, but
> wapbl_register_deallocation does just that.

I'll recheck this, thank you. I run it under rump only, I think I had
it compiled LOCKDEBUG, so should trigger the same assertions as
LOCKDEBUG kernel. But apparently not.

> What happens if wapbl_register_deallocation fails in the middle of
> ffs_truncate, and then the power goes out before the caller retries
> the truncation?  It looks like we may end up committing a transaction
> to disk that partially zeros a file -- and it is not immediately clear
> to me who will be responsible for freeing the remaining blocks when
> the system comes back up.

The indirect disk blocks go via the journal also, so if power fails
while the journal is not yet committed, it would work just the same as
usually, i.e. no change would be visible on filesystem. If truncate
fails once, then this partial truncate would be committed, and power
fails before the truncate is completely finished, then the log replay
would just replay just the first part, creating the hole.

> I don't understand why the possible failure in the fragment extension
> code should not be worrying.  `Only 7 times during a file's lifetime'
> says to me that if I check out and delete the NetBSD CVS repository
> once, there are about two million times this can happen.  If we can't
> restart the transaction safely, we need to find another way to do it.

The fragment deallocation is only registered when a file is extended
from size less then full block, and only when there is no free block
available just next to the fragment. This can't happen more then once
per data write operation, hence once per each
wapbl_begin()/wapbl_end() block.

The two million file creates are all in different vnode calls, each
call checking the limits in wapbl_start() and flushing if necessary.
It would be extremely difficult to trigger transaction overflow there.

The deletions won't cause any problem, since on that path we use
!force path, so can't overflow transaction.

> Can you say anything about alternatives you might have considered that
> don't entail adding undo/restart logic to every user of UFS_TRUNCATE
> and why they were unfit?

Right now every caller of UFS_TRUNCATE() does it implicitely, via
ufs_truncate(). That function right now tries to avoid big transaction
by truncating on indirect block boundary with wapbl, but that fails to
account for bigger directories and of course is slow.

There is alternative patch in kern/49175, which merely changes
ufs_truncate() to check number of real blocks and adjust truncate size
to fit the transaction within that limit. That patch however
exclusively locks the transaction in order to avoid some parallell
operations to exhaust the dealloc limit, and I don't like how it moves
logic about allocated blocks away from ffs_alloc/ffs_inode.

I considered another alternative with dealloc pre-registration,
avoiding the exclusive lock also. But that would still require partial
truncate support similar to my patch in order to avoid stall when
there are several parallel requests, so basically it doesn't bring any
extra value. Pre-allocation also requires some accouting hence extra
code, so it looked just simplier to go without.

Jaromir


WAPBL fix for deallocation exhaustion + slow file removal

2016-10-01 Thread Jaromír Doleček
Hi,

attached patch contains a fix to WAPBL deallocation structure
exhaustion and panic (kern/47146), and avoids need to do slow partial
truncates in loop, fixing kern/49175.

Patch changes wapbl_register_deallocation() to fail with EAGAIN when
we run into the limit, and change ffs_truncate() and ffs_indirtrunc()
to bail out early as well when this happens. Callers were updated to
either explicitly not care, or loop with transaction restart when
required by semantics (inactive, setattr).

There is one place in code really can't easily cope with failure, in
fragment extension code, when we deallocate the previous fragment
block. Since that particular condition is very rare (happens at most 7
times during file lifetime, and only when filesystem is very
fragmented), I've added just forced registration there.

There are some slight semantics changes. ffs_indirtrunc() now always
holds block buf for upper indirect node when descending, and the block
is written only when everything is done. Also, code assumes that
nothing in ffs_truncate() triggers EAGAIN for other purpose. Besides
this, the partial deallocation might leave holes in the file, as code
deallocates the whole indirect blocks before the last partial one. I
think this is okay even for dirs, so this shouldn't cause problem.

I plan to commit this sometime tomorrow.

Comments welcome.

Jaromir


wapbl_fix_dealloc.patch
Description: Binary data


Re: Plan: journalling fixes for WAPBL

2016-09-28 Thread Jaromír Doleček
I think it's far assesment to say that on SATA with NCQ/31 tags (max
is actually 31, not 32 tags), it's pretty much impossible to have
acceptable write performance without using write cache. We could never
saturate even drive with 16MB cache with just 31 tags and 64k maxphys.
So it's IMO not useful to design for world without disk drive write
cache.

Back to discussion about B_ORDERED:

As was said before, SCSI ORDERED tag does precisely what we want for
journal commit record - it forces all previous commands sent to
controller to be finished before the one with ORDERED tag is
processed, and any commands sent after the ORDERED tagged one are
executed only after the previous ordered command is finished. No need
for any bufq magic there, which is wonderful. Too bad that NCQ doesn't
provide this.

That said, we still need to be sure that all the previous commands
were sent prior to pushing ORDERED command to SCSI controller. Are
there any SCSI controllers with multiple submission queues (like
NVMe), regardless of our scsipi layer MP limitations?

FWIW AHCI is single-threaded by design, every command submission has
to write to same set of registers.

Jaromir

2016-09-23 19:51 GMT+02:00 Manuel Bouyer :
> On Fri, Sep 23, 2016 at 01:46:09PM -0400, Thor Lancelot Simon wrote:
>> > > This seems like the key thing needed to avoid FUA: to implement fsync() 
>> > > you just wait for notifications of completion to be received, and once 
>> > > you have those for all requests pending when fsync was called, or 
>> > > started as part of the fsync, then you're done.
>> >
>> > *if you have the write cache disabled*
>>
>> *Running with the write cache enabled is a bad idea*
>
> On ATA devices, you can't permanently disable the write cache. You have
> to do it on every power cycles.
>
> Well this really needs to be carefully evaluated. With only 32 tags I'm not
> sure you can efficiently use recent devices with the write cache
> disabled (most enterprise disks have a 64M cache these days)
>
> --
> Manuel Bouyer 
>  NetBSD: 26 ans d'experience feront toujours la difference
> --


Re: CVS commit: src/sys/arch

2016-09-23 Thread Jaromír Doleček
Hey Maxime,

Seems the KASSERTs() are too aggressive, or there is some other bug.

I can trigger the kassert by simply attaching to rump_ffs, setting a
breakpoint and continuing, i.e:

> rump_ffs -o log ./ffs ./mnt
> gdb rump_ffs
...
(gdb) attach RUMP_PID
(gdb) break ffs_truncate
Breakpoint 1 at 0xad0b951f: file
/usr/home/dolecek/netbsd/sys/rump/fs/lib/libffs/../../../../ufs/ffs/ffs_inode.c,
line 210.
(gdb) cont
panic: kernel diagnostic assetion "onfault == kcopy_fault || rcr2() <
VM_MAXUSER_ADDRESS" failed: file "../../../../arch/i386/i386/trap.c",
line 358

Could you please look at it?

I'll disable the KASSERT() in my local tree, so that I'll be able to
develop. But would be good idea to check what so special that gdb is
doing that it trips over.

Thank you.

Jaromir

2016-09-16 13:48 GMT+02:00 Maxime Villard :
> Module Name:src
> Committed By:   maxv
> Date:   Fri Sep 16 11:48:10 UTC 2016
>
> Modified Files:
> src/sys/arch/amd64/amd64: trap.c
> src/sys/arch/i386/i386: trap.c
>
> Log Message:
> Put two KASSERTs, to make sure the fault is happening in the correct
> half of the vm space when using special copy functions. It can detect
> bugs where the kernel would fault when copying a kernel buffer which
> it wrongly believes comes from userland.
>
>
> To generate a diff of this commit:
> cvs rdiff -u -r1.84 -r1.85 src/sys/arch/amd64/amd64/trap.c
> cvs rdiff -u -r1.278 -r1.279 src/sys/arch/i386/i386/trap.c
>
> Please note that diffs are not public domain; they are subject to the
> copyright notices on the relevant files.
>


  1   2   >