Hi Robert, Thanks for looking into this issue.
I tried MSI interrupt type on my own but it didn't work, but I will try your patch again and then report back. I've studied the nvme driver in Solaris 11.3, it seems they do the same thing as Linux - MSI-X first, then MSI, finally FIXED, see attached file for the assembly code and my comments. By the way, if I set nvmex_enable_msi and nvmex_enable_msix to 0 (false) in /etc/system, Solaris crashed immediately upon reboot. MSI-X works well on our host with a minor issue: In nvme_var.h, NVME_ADMIN_CMD_TIMEOUT is defined as 100000, i.e. 100ms I think. It's too small. One of the INTEL NVMe SSDs took 366ms to execute GET LOG PAGE command. Once I bumped up the value to 1000,000, nvme driver happily attached all the 24 SSDs in many reboots. Here comes new issues/lack of functionality: - The INTEL drives are formatted to use 512 bytes block size, but they advertise 4096 bytes block size as the best performing one (see the data in issue report https://www.illumos.org/issues/7279). So far we don't have the ability to perform such low level FORMAT and I had to install other OS such as Linux and use their tool (nvme-cli). - With NVMe SSD formatted to use 4096 block size, many things don't work. It seems our blkdev driver never intended to support device with block size larger than 512 bytes. I tried hacking bd_strategy() function to modify bp->b_lblkno (in 512 bytes size, passed all the way down from zfs layer) to be in 4096 block size, it appeared to be working for most I/O ops, but I know it's just a hack and I will open a new thread discussing this particular issue. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/blkdev/blkdev.c#1142 http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#783 - Using 512-bytes block size, I was able to create zpool, do filesystem ops, everything appears to be great. But, once I do zpool scrub, it reports checksum errors. It can be reproduced easily with one drive. I tried the same think using the same drive in Solaris, it worked so I don't think it's a hardware issue. The checksum errors issue just bothered me, I don't know where to start with. I've done some dtracing in nvme driver and found no error returned for the read/write cmds. Thanks, -Youzhong On Wed, Aug 10, 2016 at 10:41 PM, Robert Mustacchi <r...@joyent.com> wrote: > On 8/4/16 11:25 , Youzhong Yang wrote: > > Thanks for the input Robert. > > > > I believe the issue is now resolved by using MSI-X (instead of FIXED) > > interrupt type inside nvme_init() for the admin queue. > > > > Here is the issue report I just filed: > > > > https://www.illumos.org/issues/7273 > > > > I don't know why FIXED interrupt would cause issue, probably because we > > have too many NVMe SSDs? > > Following up on this aspect, I've put together the following: > > https://us-east.manta.joyent.com/rmustacc/public/webrevs/7273/index.html > > Youzhong, would it be possible for you to use this? We've had success > with this with someone who was seeing issues. I'll hopefully get some > time to look at the offlining issue, but it's still a bit out, sorry. > > Hans, can you review this and take a look? > > Thanks, > Robert >
int nvmex_register_intrs(nvmex_t *); int nvmex_setup_intr(nvmex_t *, int); nvmex_register_intrs: pushq %rbp nvmex_register_intrs+1: movq %rsp,%rbp nvmex_register_intrs+4: subq $0x8,%rsp nvmex_register_intrs+8: movq %rdi,-0x8(%rbp) nvmex_register_intrs+0xc: pushq %rbx nvmex_register_intrs+0xd: pushq %r12 nvmex_register_intrs+0xf: pushq %r13 nvmex_register_intrs+0x11: pushq %r14 nvmex_register_intrs+0x13: subq $0x18,%rsp nvmex_register_intrs+0x17: movq %rdi,%rbx <---- nvmex_t pointer nvmex_register_intrs+0x1a: movl $0xffff,%eax nvmex_register_intrs+0x1f: movl %eax,0x514(%rbx) <---- int nv_intr_type of nvmex_t nvmex_register_intrs+0x25: movl $0x1,%r12d nvmex_register_intrs+0x2b: movl %r12d,0x51c(%rbx) <---- boolean_t nv_intr_numa nvmex_register_intrs+0x32: movq (%rbx),%rdi <---- dev_info_t *nv_dip nvmex_register_intrs+0x35: leaq -0x34(%rbp),%rsi <---- int *typesp arg nvmex_register_intrs+0x39: call +0x3b42fae <ddi_intr_get_supported_types> nvmex_register_intrs+0x3e: testl %eax,%eax nvmex_register_intrs+0x40: jne +0x141 <nvmex_register_intrs+0x187> <--- in case of error nvmex_register_intrs+0x46: cmpl $0x0,-0x37e1db75(%rip) <nvmex_enable_msix> nvmex_register_intrs+0x4d: je +0x1b <nvmex_register_intrs+0x6a> nvmex_register_intrs+0x4f: movl -0x34(%rbp),%eax <---- supported intr types nvmex_register_intrs+0x52: testl $0x4,%eax <---- DDI_INTR_TYPE_MSIX = 0x4 nvmex_register_intrs+0x57: je +0x11 <nvmex_register_intrs+0x6a> nvmex_register_intrs+0x59: movq %rbx,%rdi nvmex_register_intrs+0x5c: movl $0x4,%esi nvmex_register_intrs+0x61: call -0x289e <nvmex_setup_intr> nvmex_register_intrs+0x66: testl %eax,%eax nvmex_register_intrs+0x68: je +0x78 <nvmex_register_intrs+0xe2> nvmex_register_intrs+0x6a: cmpl $0x0,-0x37e1db9d(%rip) <nvmex_enable_msi> nvmex_register_intrs+0x71: je +0x1b <nvmex_register_intrs+0x8e> nvmex_register_intrs+0x73: movl -0x34(%rbp),%eax nvmex_register_intrs+0x76: testl $0x2,%eax <---- DDI_INTR_TYPE_MSI = 0x2 nvmex_register_intrs+0x7b: je +0x11 <nvmex_register_intrs+0x8e> nvmex_register_intrs+0x7d: movq %rbx,%rdi nvmex_register_intrs+0x80: movl $0x2,%esi nvmex_register_intrs+0x85: call -0x28c2 <nvmex_setup_intr> nvmex_register_intrs+0x8a: testl %eax,%eax nvmex_register_intrs+0x8c: je +0x41 <nvmex_register_intrs+0xcf> nvmex_register_intrs+0x8e: movl -0x34(%rbp),%eax nvmex_register_intrs+0x91: testl $0x1,%eax <---- DDI_INTR_TYPE_FIXED = 0x1 nvmex_register_intrs+0x96: je +0x11 <nvmex_register_intrs+0xa9> nvmex_register_intrs+0x98: movq %rbx,%rdi nvmex_register_intrs+0x9b: movl $0x1,%esi nvmex_register_intrs+0xa0: call -0x28dd <nvmex_setup_intr> nvmex_register_intrs+0xa5: testl %eax,%eax nvmex_register_intrs+0xa7: je +0x16 <nvmex_register_intrs+0xbf> nvmex_register_intrs+0xa9: cmpl $0xffff,0x514(%rbx) nvmex_register_intrs+0xb3: jne +0x3e <nvmex_register_intrs+0xf3> nvmex_register_intrs+0xb5: movl $-0x1,%eax <0xffffffff> nvmex_register_intrs+0xba: jmp +0xcd <nvmex_register_intrs+0x18c> nvmex_register_intrs+0xbf: movl %r12d,0x514(%rbx) nvmex_register_intrs+0xc6: movl $0x0,-0x34(%rbp) nvmex_register_intrs+0xcd: jmp +0x24 <nvmex_register_intrs+0xf3> nvmex_register_intrs+0xcf: movl $0x2,0x514(%rbx) nvmex_register_intrs+0xd9: movl $0x0,-0x34(%rbp) nvmex_register_intrs+0xe0: jmp +0x11 <nvmex_register_intrs+0xf3> nvmex_register_intrs+0xe2: movl $0x4,0x514(%rbx) nvmex_register_intrs+0xec: movl $0x0,-0x34(%rbp) nvmex_register_intrs+0xf3: movq 0x4f8(%rbx),%rax nvmex_register_intrs+0xfa: movq (%rax),%rdi nvmex_register_intrs+0xfd: leaq 0x520(%rbx),%rsi nvmex_register_intrs+0x104: call +0x3b438cb <ddi_intr_get_pri> nvmex_register_intrs+0x109: testl %eax,%eax nvmex_register_intrs+0x10b: jne +0x73 <nvmex_register_intrs+0x180> nvmex_register_intrs+0x10d: xorl %r13d,%r13d nvmex_register_intrs+0x110: cmpl $0x0,0x508(%rbx) nvmex_register_intrs+0x117: jle +0x63 <nvmex_register_intrs+0x17c> nvmex_register_intrs+0x119: xorq %r12,%r12 nvmex_register_intrs+0x11c: xorq %r14,%r14 nvmex_register_intrs+0x11f: movq 0x4f8(%rbx),%rax nvmex_register_intrs+0x126: addq %r14,%rax nvmex_register_intrs+0x129: movq (%rax),%rdi nvmex_register_intrs+0x12c: leaq -0x2fb7(%rip),%rsi <nvmex_intr> nvmex_register_intrs+0x133: movq %rbx,%rdx nvmex_register_intrs+0x136: movq %r12,%rcx nvmex_register_intrs+0x139: call +0x3b439ea <ddi_intr_add_handler> nvmex_register_intrs+0x13e: testl %eax,%eax nvmex_register_intrs+0x140: jne +0x33 <nvmex_register_intrs+0x175> nvmex_register_intrs+0x142: cmpl $0x0,-0x37e1dc79(%rip) <nvmex_disable_numa_io> nvmex_register_intrs+0x149: jne +0x15 <nvmex_register_intrs+0x160> nvmex_register_intrs+0x14b: movq 0x4f8(%rbx),%rax nvmex_register_intrs+0x152: addq %r14,%rax nvmex_register_intrs+0x155: movq (%rax),%rsi nvmex_register_intrs+0x158: movq %rbx,%rdi nvmex_register_intrs+0x15b: call -0x400 <nvmex_numaio_intr_added> nvmex_register_intrs+0x160: incq %r12 nvmex_register_intrs+0x163: addq $0x8,%r14 nvmex_register_intrs+0x167: incl %r13d nvmex_register_intrs+0x16a: cmpl 0x508(%rbx),%r13d nvmex_register_intrs+0x171: jl -0x54 <nvmex_register_intrs+0x11f> nvmex_register_intrs+0x173: jmp +0x7 <nvmex_register_intrs+0x17c> nvmex_register_intrs+0x175: movl $-0x1,%eax <0xffffffff> nvmex_register_intrs+0x17a: jmp +0x10 <nvmex_register_intrs+0x18c> nvmex_register_intrs+0x17c: xorl %eax,%eax nvmex_register_intrs+0x17e: jmp +0xc <nvmex_register_intrs+0x18c> nvmex_register_intrs+0x180: movl $-0x1,%eax <0xffffffff> nvmex_register_intrs+0x185: jmp +0x5 <nvmex_register_intrs+0x18c> nvmex_register_intrs+0x187: movl $-0x1,%eax <0xffffffff> nvmex_register_intrs+0x18c: addq $0x18,%rsp nvmex_register_intrs+0x190: popq %r14 nvmex_register_intrs+0x192: popq %r13 nvmex_register_intrs+0x194: popq %r12 nvmex_register_intrs+0x196: popq %rbx nvmex_register_intrs+0x197: leave nvmex_register_intrs+0x198: ret ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com