Hi,

I am having problems getting NVMe to work on SmartOS and hope that
someone can enlighten me on what I need to do to get it flying.

I am runnning the newest SmartOS (joyent_20160330T234717Z) on
a new Skylake-based Intel NUC6. 

My problem: the Samsung PM951 M.2 NVMe SSD is properly recognized 
and supported in Linux, I could even boot Ubuntu from that SSD.

On SmartOS, the device is not recognized and not listed in diskinfo or sysinfo 
or format.

The relevant device info (from Linux) is:

02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD 
Controller (rev 01)

02:00.0 0108: 144d:a802 (rev 01) (prog-if 02 [NVM Express])
        Subsystem: 144d:a801
        Flags: bus master, fast devsel, latency 0
        Memory at df000000 (64-bit, non-prefetchable) [size=16K]
        I/O ports at e000 [size=256]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158] Power Budgeting <?>
        Capabilities: [168] #19
        Capabilities: [188] Latency Tolerance Reporting
        Capabilities: [190] L1 PM Substates
        Kernel driver in use: nvme
        Kernel modules: nvme

root@lubuntu:~# nvme list
Node             Model                Version  Namepace Usage                   
   Format           FW Rev
---------------- -------------------- -------- -------- 
-------------------------- ---------------- --------
/dev/nvme0n1     SAMSUNG MZVLV512HCJH 1.1      1          0.00   B / 512.11  GB 
   512   B +  0 B   BXV7000Q

root@dws-desktop:~# nvme fw-log /dev/nvme0n1
Firmware Log for device:nvme0n1
afi  : 0x1
frs1 : 0x5130303037565842 (BXV7000Q)

The PM951 is a NVMe 1.1 device, therefore I uncommented „strict-version=0;“
in /kernel/drv/nvme.conf and did „update_drv -vf nvme“. That was not enough.

„grep nvme /etc/driver_aliases“ only yields: nvme „pciex8086,953“

I tried:
[root@nuc6 ~]# update_drv -a -i ‚„pci144d,a802"' nvme

That actually attached the driver and make the SSD visible. I could
install the zones pool on that SSD.

To have SmartOS find the zones pool on boot I had to patch the boot_archive:

mount /zones/smartos/platform/i86pc/amd64/boot_archive /mnt
sed -i '' -e '/#strict-version=0;/{s:.*:strict-version=0;:;}' 
/mnt/kernel/drv/nvme.conf
sed -i '' -e '/nvme/{p;s:.*:nvme "pci144d,a802":;}' /mnt/etc/driver_aliases
umount /mnt

That did the trick. I could boot from USB straight into the zones pool on the 
NVMe.

To test things out, I copied some large files into the pool and did a scrub 
afterwards.
OMG!! This was the outcome:

[root@nuc6 ~]# zpool status
  pool: zones
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h0m with 1 errors on Tue Apr 12 18:04:37 2016
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     5
          c0t1d0    ONLINE       0     0    14

errors: 1 data errors, use ‚-v' for a list

Obviously something is very wrong here. I do NOT like to see this.

I looked at the file that was supposed to be corrupted. Since I had copied
that file over from another machine I could verify by means of the md5/sha256
checksums that the file was NOT actually corrupt. The checksums on both
machines were identical! I played around a little more, copied a few files
and ran a few scrubs. I got many more checksum errors on scrubbing
(how can it be that there are less errors for the pool in total than for the 
one 
disk that makes up the pool?) and also a second file with permanent errors
(let’s say the first file is „a“ and the second „b“). I could verify that file 
„b“ 
was also not corrupted.

Now I had both files a and b in the list of files with permanent errors. After 
some
time, file a disappeared from that list and only b remained!!!
Eventually that list of files was empty again! And that pool does not have 
redundancy, and I also did not use copies=2 or such.

[root@nuc6 ~]# zpool status -v
  pool: zones
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0h0m with 0 errors on Tue Apr 12 18:20:28 2016
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     5
          c0t1d0    ONLINE       0     0    15

errors: No known data errors
[root@nuc6 ~]#

This output at the same time says „unrecoverable error“ AND „No known data 
errors“!
And the files with „Permanent errors“ magically disappeared. How is that 
possible?

BTW, the memory is good. I did run a memory test overnight.

It appears that these errors only occur randomly in the read path of the 
scrubbing code.
I have no explanation for this.

Does anyone have any clues to
a) what is going on?
b) what can be done about it?

Although I could prove that all reported permanent errors during scrubbing
were not really errors after all, the entire thing feels really bad and cannot 
be used like this IMHO.

Does anyone have NVMe running reliably? If so, what devices?

Thanks for any help.

Cheers
Dirk




-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to