[developer] Re: [openzfs/openzfs] 9112 Improve allocation performance on high-end systems (#548)
It's probably mixed. For low-throughput workloads, probably it reduces locality. For high-throughput systems, my understanding is that in the steady state, you're switching between metaslabs during a txg anyway, so the effect is reduced significantly. One thing we could do is change the default of spa_allocators down to 1 or 2, to reduce the odds of harm; people could then tune it up if they think it would give them more performance. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/openzfs/openzfs/pull/548#issuecomment-365827055 -- openzfs-developer Archives: https://openzfs.topicbox.com/groups/developer/discussions/T987f71bf0a7c33f4-Mcbc42410fad2bcea8be8bef2 Powered by Topicbox: https://topicbox.com
[developer] Re: [openzfs/openzfs] 9112 Improve allocation performance on high-end systems (#548)
It look interesting. I just haven't decided yet whether distribution different objects writes between several allocators and so metaslabs is good or bad for data locality. It should be mostly irrelevant for SSDs, but I worry about HDDs, which may have to seek over all the media possibly many times per TXG due to allocation queue limitations. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/openzfs/openzfs/pull/548#issuecomment-365826628 -- openzfs-developer Archives: https://openzfs.topicbox.com/groups/developer/discussions/T987f71bf0a7c33f4-Mab62fc7bb7e67ecb774ad8f8 Powered by Topicbox: https://topicbox.com
[developer] [openzfs/openzfs] 9112 Improve allocation performance on high-end systems (#548)
Reviewed by: Matthew Ahrens mahr...@delphix.com Reviewed by: George Wilson george.wil...@delphix.com Reviewed by: Serapheim Dimitropoulos serapheim.dimi...@delphix.com ## Overview We parallelize the allocation process by creating the concept of "allocators". There are a certain number of allocators per metaslab group, defined by the value of a tunable at pool open time. Each allocator for a given metaslab group has up to 2 active metaslabs; one "primary", and one "secondary". The primary and secondary weight mean the same thing they did in in the pre-allocator world; primary metaslabs are used for most allocations, secondary metaslabs are used for ditto blocks being allocated in the same metaslab group. There is also the CLAIM weight, which has been separated out from the other weights, but that is less important to understanding the patch. The active metaslabs for each allocator are moved from their normal place in the metaslab tree for the group to the back of the tree. This way, they will not be selected for use by other allocators searching for new metaslabs unless all the passive metaslabs are unsuitable for allocations. If that does happen, the allocators will "steal" from each other to ensure that IOs don't fail until there is truly no space left to perform allocations. In addition, the alloc queue for each metaslab group has been broken into a separate queue for each allocator. We don't want to dramatically increase the number of inflight IOs on low-end systems, because it can significantly increase txg times. On the other hand, we want to ensure that there are enough IOs for each allocator to allow for good coalescing before sending the IOs to the disk. As a result, we take a compromise path; each allocator's alloc queue max depth starts at a certain value for every txg. Every time an IO completes, we increase the max depth. This should hopefully provide a good balance between the two failure modes, while not dramatically increasing complexity. We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause very similar contention when selecting IOs to allocate. This parallelization uses the same allocator scheme as metaslab selection. ## Performance results Performance improvements from this change can vary significantly based on the number of CPUs in the system, whether or not the system has a NUMA architecture, the speed of the drives, the values for the various tunables, and the workload being performed. For an fio async sequential write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB SSDs, there is a roughly 25% performance improvement. ## Future work Analysis of the performance of the system with this patch applied shows that a significant new bottleneck is the vdev disk queues, which also need to be parallelized. Prototyping of this change has occurred, and there was a performance improvement, but more work needs to be done before its stability has been verified and it is ready to be upstreamed. You can view, comment on, or merge this pull request online at: https://github.com/openzfs/openzfs/pull/548 -- Commit Summary -- * 9112 Improve allocation performance on high-end systems -- File Changes -- M usr/src/cmd/mdb/common/modules/zfs/zfs.c (11) M usr/src/test/zfs-tests/tests/functional/cli_root/zpool_import/import_rewind_config_changed.ksh (6) M usr/src/test/zfs-tests/tests/functional/slog/slog_014_pos.ksh (14) M usr/src/uts/common/fs/zfs/metaslab.c (513) M usr/src/uts/common/fs/zfs/spa.c (37) M usr/src/uts/common/fs/zfs/spa_misc.c (28) M usr/src/uts/common/fs/zfs/sys/metaslab.h (18) M usr/src/uts/common/fs/zfs/sys/metaslab_impl.h (79) M usr/src/uts/common/fs/zfs/sys/spa_impl.h (12) M usr/src/uts/common/fs/zfs/sys/vdev_impl.h (3) M usr/src/uts/common/fs/zfs/sys/zio.h (7) M usr/src/uts/common/fs/zfs/vdev.c (4) M usr/src/uts/common/fs/zfs/vdev_queue.c (11) M usr/src/uts/common/fs/zfs/vdev_removal.c (9) M usr/src/uts/common/fs/zfs/zil.c (8) M usr/src/uts/common/fs/zfs/zio.c (86) -- Patch Links -- https://github.com/openzfs/openzfs/pull/548.patch https://github.com/openzfs/openzfs/pull/548.diff -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/openzfs/openzfs/pull/548 -- openzfs-developer Archives: https://openzfs.topicbox.com/groups/developer/discussions/T987f71bf0a7c33f4-M098ca85e011538e6f0d949a4 Powered by Topicbox: https://topicbox.com
[developer] Re: [openzfs/openzfs] 9091 Add mdb smart write (#544)
@sdimitro pushed 1 commit. a23de99 get rid of redundant check -- You are receiving this because you are subscribed to this thread. View it on GitHub: https://github.com/openzfs/openzfs/pull/544/files/667119b38ecaf1f1a540975a8b63a7189ee9eaeb..a23de999d1a50e209c43bcdaadb8a29029c50e53 -- openzfs-developer Archives: https://openzfs.topicbox.com/groups/developer/discussions/Ta82fe96bb714cae3-M06d729e633b995fefc2486ad Powered by Topicbox: https://topicbox.com
[developer] Re: [openzfs/openzfs] 9091 Add mdb smart write (#544)
sdimitro commented on this pull request. > @@ -198,6 +204,100 @@ write_uint64(mdb_tgt_as_t as, mdb_tgt_addr_t addr, > uint64_t n, uint_t rdback) return (addr + sizeof (n)); } +/* + * Writes to objects of size 1, 2, 4, or 8 bytes. The function + * doesn't care if the object is a number or not (e.g. it could + * be a byte array, or a struct) as long as the size of the write + * is one of the aforementioned ones. + */ +static mdb_tgt_addr_t +write_var_uint(mdb_tgt_as_t as, mdb_tgt_addr_t addr, uint64_t val, size_t size, +uint_t rdback) +{ + if (size > MDB_UINT_WRITE_MAXBYTES) { I agree. Since that is pretty easy to test and we end up getting rid of the extra constant I can re-iterate once more. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/openzfs/openzfs/pull/544#discussion_r168258623 -- openzfs-developer Archives: https://openzfs.topicbox.com/groups/developer/discussions/Ta82fe96bb714cae3-M7d71c872e189a14a9a01d4fc Powered by Topicbox: https://topicbox.com
[developer] Re: [openzfs/openzfs] 9091 Add mdb smart write (#544)
rmustacc commented on this pull request. > @@ -198,6 +204,100 @@ write_uint64(mdb_tgt_as_t as, mdb_tgt_addr_t addr, > uint64_t n, uint_t rdback) return (addr + sizeof (n)); } +/* + * Writes to objects of size 1, 2, 4, or 8 bytes. The function + * doesn't care if the object is a number or not (e.g. it could + * be a byte array, or a struct) as long as the size of the write + * is one of the aforementioned ones. + */ +static mdb_tgt_addr_t +write_var_uint(mdb_tgt_as_t as, mdb_tgt_addr_t addr, uint64_t val, size_t size, +uint_t rdback) +{ + if (size > MDB_UINT_WRITE_MAXBYTES) { Looking at this again, given that we are checking the size down below and adjusting it based on a uint64_t do we actually need this block here? I just realized given that we'll check the size in the switch statement, it may end up being redundant. But, honestly, it doesn't matter either way. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/openzfs/openzfs/pull/544#pullrequestreview-96594391 -- openzfs-developer Archives: https://openzfs.topicbox.com/groups/developer/discussions/Ta82fe96bb714cae3-M3440619a9871a4154a7bca7e Powered by Topicbox: https://topicbox.com
[developer] Re: [openzfs/openzfs] 8727 Native data and metadata encryption for zfs (#489)
Got a dump with kmem_flags set, but I am not sure if this is the same issue, as I applied the patch to a more recent build. panic[cpu0]/thread=ff000f5e7c40: BAD TRAP: type=e (#pf Page fault) rp=ff000f5e7790 addr=0 occurred in module "zfs" due to a NULL pointer dereference zpool-zones: #pf Page fault Bad kernel fault at addr=0x0 pid=89, pc=0xf7d9975b, sp=0xff000f5e7880, eflags=0x10282 cr0: 8005003bcr4: 6f8 cr2: 0 cr3: 1dc0 cr8: c rdi:0 rsi:2 rdx:80041 rcx:0 r8:0 r9:0 rax:2 rbx:0 rbp: ff000f5e78b0 r10: fb8580d8 r11: feedfacefeedface r12:2 r13: ff03ecaeb000 r14: ff03eee62800 r15: ff000f5e7ad0 fsb:0 gsb: fbc399c0 ds: 4b es: 4b fs:0 gs: 1c3 trp:e err:0 rip: f7d9975b cs: 30 rfl:10282 rsp: ff000f5e7880 ss: 38 ff000f5e7670 unix:real_mode_stop_cpu_stage2_end+b30c () ff000f5e7780 unix:trap+e08 () ff000f5e7790 unix:_cmntrap+e6 () ff000f5e78b0 zfs:abd_borrow_buf+1b () ff000f5e78f0 zfs:abd_borrow_buf_copy+23 () ff000f5e7930 zfs:vdev_disk_io_start+c7 () ff000f5e7990 zfs:zio_vdev_io_start+b1 () ff000f5e79c0 zfs:zio_execute+7f () ff000f5e7a00 zfs:vdev_queue_io_done+68 () ff000f5e7a40 zfs:zio_vdev_io_done+80 () ff000f5e7a70 zfs:zio_execute+7f () ff000f5e7b30 genunix:taskq_thread+2d0 () ff000f5e7b40 unix:thread_start+8 () Dump available at https://www.magentacloud.de/lnk/sprBVjAK -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/openzfs/openzfs/pull/489#issuecomment-365670841 -- openzfs-developer Archives: https://openzfs.topicbox.com/groups/developer/discussions/T91797982fdd5b7d9-Mb58613789bfe6853f2eadb9a Powered by Topicbox: https://topicbox.com