Hi Robert,

Thanks for looking into this issue.

I tried MSI interrupt type on my own but it didn't work, but I will try
your patch again and then report back.

I've studied the nvme driver in Solaris 11.3, it seems they do the same
thing as Linux - MSI-X first, then MSI, finally FIXED, see attached file
for the assembly code and my comments. By the way, if I
set nvmex_enable_msi and nvmex_enable_msix to 0 (false) in /etc/system,
Solaris crashed immediately upon reboot.

MSI-X works well on our host with a minor issue:

In nvme_var.h, NVME_ADMIN_CMD_TIMEOUT is defined as 100000, i.e. 100ms I
think. It's too small. One of the INTEL NVMe SSDs took 366ms to execute GET
LOG PAGE command. Once I bumped up the value to 1000,000, nvme driver
happily attached all the 24 SSDs in many reboots.

Here comes new issues/lack of functionality:

- The INTEL drives are formatted to use 512 bytes block size, but they
advertise 4096 bytes block size as the best performing one (see the data in
issue report https://www.illumos.org/issues/7279). So far we don't have the
ability to perform such low level FORMAT and I had to install other OS such
as Linux and use their tool (nvme-cli).

- With NVMe SSD formatted to use 4096 block size, many things don't work.
It seems our blkdev driver never intended to support device with block size
larger than 512 bytes. I tried hacking bd_strategy() function to modify
bp->b_lblkno (in 512 bytes size, passed all the way down from zfs layer) to
be in 4096 block size, it appeared to be working for most I/O ops, but I
know it's just a hack and I will open a new thread discussing this
particular issue.

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/blkdev/blkdev.c#1142
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#783

- Using 512-bytes block size, I was able to create zpool, do filesystem
ops, everything appears to be great. But, once I do zpool scrub, it reports
checksum errors. It can be reproduced easily with one drive. I tried the
same think using the same drive in Solaris, it worked so I don't think it's
a hardware issue.

The checksum errors issue just bothered me, I don't know where to start
with. I've done some dtracing in nvme driver and found no error returned
for the read/write cmds.

Thanks,

-Youzhong




On Wed, Aug 10, 2016 at 10:41 PM, Robert Mustacchi <r...@joyent.com> wrote:

> On 8/4/16 11:25 , Youzhong Yang wrote:
> > Thanks for the input Robert.
> >
> > I believe the issue is now resolved by using MSI-X (instead of FIXED)
> > interrupt type inside nvme_init() for the admin queue.
> >
> > Here is the issue report I just filed:
> >
> > https://www.illumos.org/issues/7273
> >
> > I don't know why FIXED interrupt would cause issue, probably because we
> > have too many NVMe SSDs?
> 
> Following up on this aspect, I've put together the following:
> 
> https://us-east.manta.joyent.com/rmustacc/public/webrevs/7273/index.html
> 
> Youzhong, would it be possible for you to use this? We've had success
> with this with someone who was seeing issues. I'll hopefully get some
> time to look at the offlining issue, but it's still a bit out, sorry.
> 
> Hans, can you review this and take a look?
> 
> Thanks,
> Robert
> 

int nvmex_register_intrs(nvmex_t *);
int nvmex_setup_intr(nvmex_t *, int);

nvmex_register_intrs:           pushq  %rbp
nvmex_register_intrs+1:         movq   %rsp,%rbp
nvmex_register_intrs+4:         subq   $0x8,%rsp
nvmex_register_intrs+8:         movq   %rdi,-0x8(%rbp)
nvmex_register_intrs+0xc:       pushq  %rbx
nvmex_register_intrs+0xd:       pushq  %r12
nvmex_register_intrs+0xf:       pushq  %r13
nvmex_register_intrs+0x11:      pushq  %r14
nvmex_register_intrs+0x13:      subq   $0x18,%rsp
nvmex_register_intrs+0x17:      movq   %rdi,%rbx         <---- nvmex_t pointer
nvmex_register_intrs+0x1a:      movl   $0xffff,%eax
nvmex_register_intrs+0x1f:      movl   %eax,0x514(%rbx)  <---- int nv_intr_type 
of nvmex_t
nvmex_register_intrs+0x25:      movl   $0x1,%r12d
nvmex_register_intrs+0x2b:      movl   %r12d,0x51c(%rbx) <---- boolean_t 
nv_intr_numa
nvmex_register_intrs+0x32:      movq   (%rbx),%rdi       <---- dev_info_t 
*nv_dip 
nvmex_register_intrs+0x35:      leaq   -0x34(%rbp),%rsi  <---- int *typesp arg 
nvmex_register_intrs+0x39:      call   +0x3b42fae       
<ddi_intr_get_supported_types>
nvmex_register_intrs+0x3e:      testl  %eax,%eax
nvmex_register_intrs+0x40:      jne    +0x141   <nvmex_register_intrs+0x187>  
<--- in case of error
nvmex_register_intrs+0x46:      cmpl   $0x0,-0x37e1db75(%rip)   
<nvmex_enable_msix>
nvmex_register_intrs+0x4d:      je     +0x1b    <nvmex_register_intrs+0x6a>
nvmex_register_intrs+0x4f:      movl   -0x34(%rbp),%eax  <---- supported intr 
types
nvmex_register_intrs+0x52:      testl  $0x4,%eax         <---- 
DDI_INTR_TYPE_MSIX = 0x4
nvmex_register_intrs+0x57:      je     +0x11    <nvmex_register_intrs+0x6a>
nvmex_register_intrs+0x59:      movq   %rbx,%rdi
nvmex_register_intrs+0x5c:      movl   $0x4,%esi
nvmex_register_intrs+0x61:      call   -0x289e  <nvmex_setup_intr>
nvmex_register_intrs+0x66:      testl  %eax,%eax
nvmex_register_intrs+0x68:      je     +0x78    <nvmex_register_intrs+0xe2>
nvmex_register_intrs+0x6a:      cmpl   $0x0,-0x37e1db9d(%rip)   
<nvmex_enable_msi>
nvmex_register_intrs+0x71:      je     +0x1b    <nvmex_register_intrs+0x8e>
nvmex_register_intrs+0x73:      movl   -0x34(%rbp),%eax
nvmex_register_intrs+0x76:      testl  $0x2,%eax         <---- 
DDI_INTR_TYPE_MSI = 0x2
nvmex_register_intrs+0x7b:      je     +0x11    <nvmex_register_intrs+0x8e>
nvmex_register_intrs+0x7d:      movq   %rbx,%rdi
nvmex_register_intrs+0x80:      movl   $0x2,%esi
nvmex_register_intrs+0x85:      call   -0x28c2  <nvmex_setup_intr>
nvmex_register_intrs+0x8a:      testl  %eax,%eax
nvmex_register_intrs+0x8c:      je     +0x41    <nvmex_register_intrs+0xcf>
nvmex_register_intrs+0x8e:      movl   -0x34(%rbp),%eax
nvmex_register_intrs+0x91:      testl  $0x1,%eax         <---- 
DDI_INTR_TYPE_FIXED = 0x1
nvmex_register_intrs+0x96:      je     +0x11    <nvmex_register_intrs+0xa9>
nvmex_register_intrs+0x98:      movq   %rbx,%rdi
nvmex_register_intrs+0x9b:      movl   $0x1,%esi
nvmex_register_intrs+0xa0:      call   -0x28dd  <nvmex_setup_intr>
nvmex_register_intrs+0xa5:      testl  %eax,%eax
nvmex_register_intrs+0xa7:      je     +0x16    <nvmex_register_intrs+0xbf>
nvmex_register_intrs+0xa9:      cmpl   $0xffff,0x514(%rbx)
nvmex_register_intrs+0xb3:      jne    +0x3e    <nvmex_register_intrs+0xf3>
nvmex_register_intrs+0xb5:      movl   $-0x1,%eax       <0xffffffff>
nvmex_register_intrs+0xba:      jmp    +0xcd    <nvmex_register_intrs+0x18c>
nvmex_register_intrs+0xbf:      movl   %r12d,0x514(%rbx)
nvmex_register_intrs+0xc6:      movl   $0x0,-0x34(%rbp)
nvmex_register_intrs+0xcd:      jmp    +0x24    <nvmex_register_intrs+0xf3>
nvmex_register_intrs+0xcf:      movl   $0x2,0x514(%rbx)
nvmex_register_intrs+0xd9:      movl   $0x0,-0x34(%rbp)
nvmex_register_intrs+0xe0:      jmp    +0x11    <nvmex_register_intrs+0xf3>
nvmex_register_intrs+0xe2:      movl   $0x4,0x514(%rbx)
nvmex_register_intrs+0xec:      movl   $0x0,-0x34(%rbp)
nvmex_register_intrs+0xf3:      movq   0x4f8(%rbx),%rax
nvmex_register_intrs+0xfa:      movq   (%rax),%rdi
nvmex_register_intrs+0xfd:      leaq   0x520(%rbx),%rsi
nvmex_register_intrs+0x104:     call   +0x3b438cb       <ddi_intr_get_pri>
nvmex_register_intrs+0x109:     testl  %eax,%eax
nvmex_register_intrs+0x10b:     jne    +0x73    <nvmex_register_intrs+0x180>
nvmex_register_intrs+0x10d:     xorl   %r13d,%r13d
nvmex_register_intrs+0x110:     cmpl   $0x0,0x508(%rbx)
nvmex_register_intrs+0x117:     jle    +0x63    <nvmex_register_intrs+0x17c>
nvmex_register_intrs+0x119:     xorq   %r12,%r12
nvmex_register_intrs+0x11c:     xorq   %r14,%r14
nvmex_register_intrs+0x11f:     movq   0x4f8(%rbx),%rax
nvmex_register_intrs+0x126:     addq   %r14,%rax
nvmex_register_intrs+0x129:     movq   (%rax),%rdi
nvmex_register_intrs+0x12c:     leaq   -0x2fb7(%rip),%rsi       <nvmex_intr>
nvmex_register_intrs+0x133:     movq   %rbx,%rdx
nvmex_register_intrs+0x136:     movq   %r12,%rcx
nvmex_register_intrs+0x139:     call   +0x3b439ea       <ddi_intr_add_handler>
nvmex_register_intrs+0x13e:     testl  %eax,%eax
nvmex_register_intrs+0x140:     jne    +0x33    <nvmex_register_intrs+0x175>
nvmex_register_intrs+0x142:     cmpl   $0x0,-0x37e1dc79(%rip)   
<nvmex_disable_numa_io>
nvmex_register_intrs+0x149:     jne    +0x15    <nvmex_register_intrs+0x160>
nvmex_register_intrs+0x14b:     movq   0x4f8(%rbx),%rax
nvmex_register_intrs+0x152:     addq   %r14,%rax
nvmex_register_intrs+0x155:     movq   (%rax),%rsi
nvmex_register_intrs+0x158:     movq   %rbx,%rdi
nvmex_register_intrs+0x15b:     call   -0x400   <nvmex_numaio_intr_added>
nvmex_register_intrs+0x160:     incq   %r12
nvmex_register_intrs+0x163:     addq   $0x8,%r14
nvmex_register_intrs+0x167:     incl   %r13d
nvmex_register_intrs+0x16a:     cmpl   0x508(%rbx),%r13d
nvmex_register_intrs+0x171:     jl     -0x54    <nvmex_register_intrs+0x11f>
nvmex_register_intrs+0x173:     jmp    +0x7     <nvmex_register_intrs+0x17c>
nvmex_register_intrs+0x175:     movl   $-0x1,%eax       <0xffffffff>
nvmex_register_intrs+0x17a:     jmp    +0x10    <nvmex_register_intrs+0x18c>
nvmex_register_intrs+0x17c:     xorl   %eax,%eax
nvmex_register_intrs+0x17e:     jmp    +0xc     <nvmex_register_intrs+0x18c>
nvmex_register_intrs+0x180:     movl   $-0x1,%eax       <0xffffffff>
nvmex_register_intrs+0x185:     jmp    +0x5     <nvmex_register_intrs+0x18c>
nvmex_register_intrs+0x187:     movl   $-0x1,%eax       <0xffffffff>
nvmex_register_intrs+0x18c:     addq   $0x18,%rsp
nvmex_register_intrs+0x190:     popq   %r14
nvmex_register_intrs+0x192:     popq   %r13
nvmex_register_intrs+0x194:     popq   %r12
nvmex_register_intrs+0x196:     popq   %rbx
nvmex_register_intrs+0x197:     leave  
nvmex_register_intrs+0x198:     ret


-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to