Re: [lustre-discuss] bad performance with Lustre/ZFS on NVMe SSD
Yes I tested every single disk and also with disks in a raidz pool without Lustre. disks perform to specs, 1.2TB each and up to 6GB/s in the zpool. When using lustre the zpool performs really bad no more than 1.5GB/s. I then configured one OST per disk without any raidz (6 OST total). I can scale up with performance distributing processes across OSTs in this way, but anyway if I use striping across all OSTs instead of manually bounding proesses to a specific OST, the performance decreases. Also running a single process on a single OST I never can get more than 700MB/s while I can reach 1.2GB/s using at least 4 processes on the same OST. I did test using obdfilter-survey this is what I got: ost 1 sz 524288000K rsz 1024K obj 4 thr 4 write 4872.92 [1525.83, 6120.75] I did run Lnet selftest and I got 6GB/s using FDR. But when I write form the client side the performances really drops dramatically. Especially when using a Lustre on raidz. so I Was wondering if there is any RPC parameter setting that I need to set to get better performances out of Lustre ? thank you On 4/9/18 4:15 PM, Dilger, Andreas wrote: > On Apr 6, 2018, at 23:04, Riccardo Veraldi > wrote: >> So I'm struggling since months with these low performances on Lsutre/ZFS. >> >> Looking for hints. >> >> 3 OSSes, RHEL 74 Lustre 2.10.3 and zfs 0.7.6 >> >> each OSS has one OST raidz >> >> pool: drpffb-ost01 >> state: ONLINE >> scan: none requested >> trim: completed on Fri Apr 6 21:53:04 2018 (after 0h3m) >> config: >> >> NAME STATE READ WRITE CKSUM >> drpffb-ost01 ONLINE 0 0 0 >> raidz1-0ONLINE 0 0 0 >> nvme0n1 ONLINE 0 0 0 >> nvme1n1 ONLINE 0 0 0 >> nvme2n1 ONLINE 0 0 0 >> nvme3n1 ONLINE 0 0 0 >> nvme4n1 ONLINE 0 0 0 >> nvme5n1 ONLINE 0 0 0 >> >> while the raidz without Lustre perform well at 6GB/s (1GB/s per disk), >> with Lustre on top of it performances are really poor. >> most of all they are not stable at all and go up and down between >> 1.5GB/s and 6GB/s. I Tested with obfilter-survey >> LNET is ok and working at 6GB/s (using infiniband FDR) >> >> What could be the cause of OST performance going up and down like a >> roller coaster ? > Riccardo, > to take a step back for a minute, have you tested all of the devices > individually, and also concurrently with some low-level tool like > sgpdd or vdbench? After that is known to be working, have you tested > with obdfilter-survey locally on the OSS, then remotely on the client(s) > so that we can isolate where the bottleneck is being hit. > > Cheers, Andreas > > >> for reference here are few considerations: >> >> filesystem parameters: >> >> zfs set mountpoint=none drpffb-ost01 >> zfs set sync=disabled drpffb-ost01 >> zfs set atime=off drpffb-ost01 >> zfs set redundant_metadata=most drpffb-ost01 >> zfs set xattr=sa drpffb-ost01 >> zfs set recordsize=1M drpffb-ost01 >> >> NVMe SSD are 4KB/sector >> >> ashift=12 >> >> >> ZFS module parameters >> >> options zfs zfs_prefetch_disable=1 >> options zfs zfs_txg_history=120 >> options zfs metaslab_debug_unload=1 >> # >> options zfs zfs_vdev_scheduler=deadline >> options zfs zfs_vdev_async_write_active_min_dirty_percent=20 >> # >> options zfs zfs_vdev_scrub_min_active=48 >> options zfs zfs_vdev_scrub_max_active=128 >> #options zfs zfs_vdev_sync_write_min_active=64 >> #options zfs zfs_vdev_sync_write_max_active=128 >> # >> options zfs zfs_vdev_sync_write_min_active=8 >> options zfs zfs_vdev_sync_write_max_active=32 >> options zfs zfs_vdev_sync_read_min_active=8 >> options zfs zfs_vdev_sync_read_max_active=32 >> options zfs zfs_vdev_async_read_min_active=8 >> options zfs zfs_vdev_async_read_max_active=32 >> options zfs zfs_top_maxinflight=320 >> options zfs zfs_txg_timeout=30 >> options zfs zfs_dirty_data_max_percent=40 >> options zfs zfs_vdev_scheduler=deadline >> options zfs zfs_vdev_async_write_min_active=8 >> options zfs zfs_vdev_async_write_max_active=32 >> > Cheers, Andreas > -- > Andreas Dilger > Lustre Principal Architect > Intel Corporation > > > > > > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] bad performance with Lustre/ZFS on NVMe SSD
Ricardo, It can be helpful to see output of commands on zfs pool host when you read files through lustre client; and directly through zfs: # zpool iostat -lq -y zpool_name 1 # zpool iostat -w -y zpool_name 5 # zpool iostat -r -y zpool_name 5 -q queue statistics -l Latency statistics -r Request size histogram: -w (undocumented) latency statistics I did see different behavior of zfs reads on zfs pool for the same dd/fio command reading file from lustre mount on different host; and directly from zfs on OSS. I created separate zfs dataset with similar zfs settings on lustre zpool. lustre IO seen on zfs pool with 128KB requests while dd/fio on zfs has 1MB requests. dd/fio command used 1MB IO. zptevlfs6 sync_readsync_writeasync_readasync_write scrub req_size indaggindaggindaggindaggindagg -- - - - - - - - - - - 512 0 0 0 0 0 0 0 0 0 0 1K 0 0 0 0 0 0 0 0 0 0 2K 0 0 0 0 0 0 0 0 0 0 4K 0 0 0 0 0 0 0 0 0 0 8K 0 0 0 0 0 0 0 0 0 0 16K 0 0 0 0 0 0 0 0 0 0 32K 0 0 0 0 0 0 0 0 0 0 64K 0 0 0 0 0 0 0 0 0 0 128K0 0 0 0 2.00K 0 0 0 0 0 < 256K0 0 0 0 0 0 0 0 0 0 512K0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0125 0 0 0 0 0< 2M 0 0 0 0 0 0 0 0 0 0 4M 0 0 0 0 0 0 0 0 0 0 8M 0 0 0 0 0 0 0 0 0 0 16M 0 0 0 0 0 0 0 0 0 0 ^C Alex. On 4/9/18, 6:15 PM, "lustre-discuss on behalf of Dilger, Andreas" wrote: On Apr 6, 2018, at 23:04, Riccardo Veraldi wrote: > > So I'm struggling since months with these low performances on Lsutre/ZFS. > > Looking for hints. > > 3 OSSes, RHEL 74 Lustre 2.10.3 and zfs 0.7.6 > > each OSS has one OST raidz > > pool: drpffb-ost01 > state: ONLINE > scan: none requested > trim: completed on Fri Apr 6 21:53:04 2018 (after 0h3m) > config: > > NAME STATE READ WRITE CKSUM > drpffb-ost01 ONLINE 0 0 0 > raidz1-0ONLINE 0 0 0 > nvme0n1 ONLINE 0 0 0 > nvme1n1 ONLINE 0 0 0 > nvme2n1 ONLINE 0 0 0 > nvme3n1 ONLINE 0 0 0 > nvme4n1 ONLINE 0 0 0 > nvme5n1 ONLINE 0 0 0 > > while the raidz without Lustre perform well at 6GB/s (1GB/s per disk), > with Lustre on top of it performances are really poor. > most of all they are not stable at all and go up and down between > 1.5GB/s and 6GB/s. I Tested with obfilter-survey > LNET is ok and working at 6GB/s (using infiniband FDR) > > What could be the cause of OST performance going up and down like a > roller coaster ? Riccardo, to take a step back for a minute, have you tested all of the devices individually, and also concurrently with some low-level tool like sgpdd or vdbench? After that is known to be working, have you tested with obdfilter-survey locally on the OSS, then remotely on the client(s) so that we can isolate where the bottleneck is being hit. Cheers, Andreas > for reference here are few considerations: > > filesystem parameters: > > zfs set mountpoint=none drpffb-ost01 > zfs set sync=disabled drpffb-ost01 > zfs set atime=off drpffb-ost01 > zfs set redundant_metadata=most drpffb-ost01 > zfs set xattr=sa drpffb-ost01 > zfs set recordsize=1M drpffb-ost01 > > NVMe SSD are 4KB/sector > > ashift=12 > > > ZFS module parameters > > options zfs zfs_prefetch_disable=1 > options zfs zfs_txg_history=120 > options zfs metaslab_debug_unload=1 > # > options zfs zfs_vdev_scheduler=deadline > options zfs zfs_vdev_async_write_active_min_dirty_percent=20 > # > options zfs zfs_vdev_scrub_min_active=48 > options zfs zfs_vdev_scrub_max_active=128
Re: [lustre-discuss] bad performance with Lustre/ZFS on NVMe SSD
On Apr 6, 2018, at 23:04, Riccardo Veraldi wrote: > > So I'm struggling since months with these low performances on Lsutre/ZFS. > > Looking for hints. > > 3 OSSes, RHEL 74 Lustre 2.10.3 and zfs 0.7.6 > > each OSS has one OST raidz > > pool: drpffb-ost01 > state: ONLINE > scan: none requested > trim: completed on Fri Apr 6 21:53:04 2018 (after 0h3m) > config: > > NAME STATE READ WRITE CKSUM > drpffb-ost01 ONLINE 0 0 0 > raidz1-0ONLINE 0 0 0 > nvme0n1 ONLINE 0 0 0 > nvme1n1 ONLINE 0 0 0 > nvme2n1 ONLINE 0 0 0 > nvme3n1 ONLINE 0 0 0 > nvme4n1 ONLINE 0 0 0 > nvme5n1 ONLINE 0 0 0 > > while the raidz without Lustre perform well at 6GB/s (1GB/s per disk), > with Lustre on top of it performances are really poor. > most of all they are not stable at all and go up and down between > 1.5GB/s and 6GB/s. I Tested with obfilter-survey > LNET is ok and working at 6GB/s (using infiniband FDR) > > What could be the cause of OST performance going up and down like a > roller coaster ? Riccardo, to take a step back for a minute, have you tested all of the devices individually, and also concurrently with some low-level tool like sgpdd or vdbench? After that is known to be working, have you tested with obdfilter-survey locally on the OSS, then remotely on the client(s) so that we can isolate where the bottleneck is being hit. Cheers, Andreas > for reference here are few considerations: > > filesystem parameters: > > zfs set mountpoint=none drpffb-ost01 > zfs set sync=disabled drpffb-ost01 > zfs set atime=off drpffb-ost01 > zfs set redundant_metadata=most drpffb-ost01 > zfs set xattr=sa drpffb-ost01 > zfs set recordsize=1M drpffb-ost01 > > NVMe SSD are 4KB/sector > > ashift=12 > > > ZFS module parameters > > options zfs zfs_prefetch_disable=1 > options zfs zfs_txg_history=120 > options zfs metaslab_debug_unload=1 > # > options zfs zfs_vdev_scheduler=deadline > options zfs zfs_vdev_async_write_active_min_dirty_percent=20 > # > options zfs zfs_vdev_scrub_min_active=48 > options zfs zfs_vdev_scrub_max_active=128 > #options zfs zfs_vdev_sync_write_min_active=64 > #options zfs zfs_vdev_sync_write_max_active=128 > # > options zfs zfs_vdev_sync_write_min_active=8 > options zfs zfs_vdev_sync_write_max_active=32 > options zfs zfs_vdev_sync_read_min_active=8 > options zfs zfs_vdev_sync_read_max_active=32 > options zfs zfs_vdev_async_read_min_active=8 > options zfs zfs_vdev_async_read_max_active=32 > options zfs zfs_top_maxinflight=320 > options zfs zfs_txg_timeout=30 > options zfs zfs_dirty_data_max_percent=40 > options zfs zfs_vdev_scheduler=deadline > options zfs zfs_vdev_async_write_min_active=8 > options zfs zfs_vdev_async_write_max_active=32 > Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Intel Corporation ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org