Hi,

Any update to this issue?

I have the same system (SYS-2028U-TN24R4T+) with 11 x intel P3600 2.0TB, NVMe 
PCIe 3.0, HET MLC 20nm 3DWPD (SSDPE2ME020T4). I install it successfully with 
20170116T154141Z platform and it works for a few days then the headnode lost 
connection with it. I can not login to it from anywhere even from the console. 
It get the password then the screen is freeze. No error showing up. I also 
reboot it several times, boot it with many images including the latest 
(20170201T235645Z) with no luck.

Thank you
Viraphan


> On Aug 11, 2559 BE, at 8:38 PM, Youzhong Yang <[email protected]> wrote:
> 
> Hi Robert,
> 
> Thanks for looking into this issue.
> 
> I tried MSI interrupt type on my own but it didn't work, but I will try your 
> patch again and then report back.
> 
> I've studied the nvme driver in Solaris 11.3, it seems they do the same thing 
> as Linux - MSI-X first, then MSI, finally FIXED, see attached file for the 
> assembly code and my comments. By the way, if I set nvmex_enable_msi and 
> nvmex_enable_msix to 0 (false) in /etc/system, Solaris crashed immediately 
> upon reboot.
> 
> MSI-X works well on our host with a minor issue:
> 
> In nvme_var.h, NVME_ADMIN_CMD_TIMEOUT is defined as 100000, i.e. 100ms I 
> think. It's too small. One of the INTEL NVMe SSDs took 366ms to execute GET 
> LOG PAGE command. Once I bumped up the value to 1000,000, nvme driver happily 
> attached all the 24 SSDs in many reboots.
> 
> Here comes new issues/lack of functionality:
> 
> - The INTEL drives are formatted to use 512 bytes block size, but they 
> advertise 4096 bytes block size as the best performing one (see the data in 
> issue report https://www.illumos.org/issues/7279 
> <https://www.illumos.org/issues/7279>). So far we don't have the ability to 
> perform such low level FORMAT and I had to install other OS such as Linux and 
> use their tool (nvme-cli).
> 
> - With NVMe SSD formatted to use 4096 block size, many things don't work. It 
> seems our blkdev driver never intended to support device with block size 
> larger than 512 bytes. I tried hacking bd_strategy() function to modify 
> bp->b_lblkno (in 512 bytes size, passed all the way down from zfs layer) to 
> be in 4096 block size, it appeared to be working for most I/O ops, but I know 
> it's just a hack and I will open a new thread discussing this particular 
> issue.
> 
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/blkdev/blkdev.c#1142
>  
> <http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/blkdev/blkdev.c#1142>
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#783
>  
> <http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#783>
> 
> - Using 512-bytes block size, I was able to create zpool, do filesystem ops, 
> everything appears to be great. But, once I do zpool scrub, it reports 
> checksum errors. It can be reproduced easily with one drive. I tried the same 
> think using the same drive in Solaris, it worked so I don't think it's a 
> hardware issue.
> 
> The checksum errors issue just bothered me, I don't know where to start with. 
> I've done some dtracing in nvme driver and found no error returned for the 
> read/write cmds.
> 
> Thanks,
> 
> -Youzhong
> 
> 
> 
> 
> On Wed, Aug 10, 2016 at 10:41 PM, Robert Mustacchi <[email protected] 
> <mailto:[email protected]>> wrote:
> On 8/4/16 11:25 , Youzhong Yang wrote:
> > Thanks for the input Robert.
> >
> > I believe the issue is now resolved by using MSI-X (instead of FIXED)
> > interrupt type inside nvme_init() for the admin queue.
> >
> > Here is the issue report I just filed:
> >
> > https://www.illumos.org/issues/7273 <https://www.illumos.org/issues/7273>
> >
> > I don't know why FIXED interrupt would cause issue, probably because we
> > have too many NVMe SSDs?
> 
> Following up on this aspect, I've put together the following:
> 
> https://us-east.manta.joyent.com/rmustacc/public/webrevs/7273/index.html 
> <https://us-east.manta.joyent.com/rmustacc/public/webrevs/7273/index.html>
> 
> Youzhong, would it be possible for you to use this? We've had success
> with this with someone who was seeing issues. I'll hopefully get some
> time to look at the offlining issue, but it's still a bit out, sorry.
> 
> Hans, can you review this and take a look?
> 
> Thanks,
> Robert
> 
> smartos-discuss | Archives 
> <https://www.listbox.com/member/archive/184463/=now>  
> <https://www.listbox.com/member/archive/rss/184463/28073523-38ca017d> | 
> Modify <https://www.listbox.com/member/?&;> Your Subscription  
> <http://www.listbox.com/><nvme_register_intrs.txt>



-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to