Re: [LKP] Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression
Looking at what the stress-ng fiemap workload is doing, and it's interesting. It is running 4 processes which are calling FIEMAP on a particular file in a loop, with a 25ms sleep every 64 times. And then there is a fifth process which is randomly writing to the file and calling punch_hole to random offsets in the file. So this is quite different from what Ritesh has been benchmarking which is fiemap in isolation, as opposed to fiemap racing against a 3 other fiemap processes plus a process which is actively modifying the file. In the original code, if I remember correctly, we were using a shared reader/writer lock to look at the extent tree blocks directly, but we hold the i_data_sem rw_sempahore for the duration of the fiemap call. In the new code, we're going through the extent_status cache, which is grabbing the rw_spinlock each time we do a lookup in the extents status tree. So this is a much finer-grained locking and that is probably the explanation for the increased time for running fiemap in the contended case. If this theory is correct, we would probably get back the performance by wrapping the calls to iomap_fiemap() with {up,down}_read(>i_data_sem) in ext4_fiemap(). That being said, however it's clear what real-life workload cares about FIEMAP performance, especially with multiple threads all calling FIEMAP racing against a file which is being actively modified. Having stress-ng do this to find potential kernel bugs is a great thing, so I understand why stress-ng might be doing this as a QA tool. Why we should care about stress-ng as a performance benchmark, at least in this case, is much less clear to me. Cheers, - Ted
Re: [LKP] Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression
On 7/22/2020 2:17 PM, Xing Zhengjun wrote: On 7/15/2020 7:04 PM, Ritesh Harjani wrote: Hello Xing, On 4/7/20 1:30 PM, kernel test robot wrote: Greeting, FYI, we noticed a -60.5% regression of stress-ng.fiemap.ops_per_sec due to commit: commit: d3b6f23f71670007817a5d59f3fbafab2b794e8c ("ext4: move ext4_fiemap to use iomap framework") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master in testcase: stress-ng on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory with following parameters: nr_threads: 10% disk: 1HDD testtime: 1s class: os cpufreq_governor: performance ucode: 0x52c fs: ext4 I started looking into this issue. But with my unit testing, I didn't find any perf issue with fiemap ioctl call. I haven't yet explored about how stress-ng take fiemap performance numbers, it could be doing something differently. But in my testing I just made sure to create a file with large number of extents and used xfs_io -c "fiemap -v" cmd to check how much time it takes to read all the entries in 1st and subsequent iterations. Setup comprised of qemu machine on x86_64 with latest linux branch. 1. created a file of 10G using fallocate. (this allocated unwritten extents for this file). 2. Then I punched hole on every alternate block of file. This step took a long time. And after sufficiently long time, I had to cancel it. for i in $(seq 1 2 x); do echo $i; fallocate -p -o $(($i*4096)) -l 4096; done 3. Then issued fiemap call via xfs_io and took the time measurement. time xfs_io -c "fiemap -v" bigfile > /dev/null Perf numbers on latest default kernel build for above cmd. 1st iteration == real 0m31.684s user 0m1.593s sys 0m24.174s 2nd and subsequent iteration real 0m3.379s user 0m1.300s sys 0m2.080s 4. Then I reverted all the iomap_fiemap patches and re-tested this. With this the older ext4_fiemap implementation will be tested:- 1st iteration == real 0m31.591s user 0m1.400s sys 0m24.243s 2nd and subsequent iteration (had to cancel it since it was taking more time then 15m) ^C^C real 15m49.884s user 0m0.032s sys 15m49.722s I guess the reason why 2nd iteration with older implementation is taking too much time is since with previous implementation we never cached extent entries in extent_status tree. And also in 1st iteration the page cache may get filled with lot of buffer_head entries. So maybe page reclaims are taking more time. While with the latest implementation using iomap_fiemap(), the call to query extent blocks is done using ext4_map_blocks(). ext4_map_blocks() by default will also cache the extent entries into extent_status tree. Hence during 2nd iteration, we will directly read the entries from extent_status tree and will not do any disk I/O. -ritesh I re-test it on the v5.9-rc1, the regression still existed. Have you tried stress-ng test cases? Could you try stress-ng( https://kernel.ubuntu.com/~cking/stress-ng/) test cases? The tarballs can be get from https://kernel.ubuntu.com/~cking/tarballs/stress-ng/. The command for this case you can try "stress-ng --timeout 1 --times --verify --metrics-brief --sequential 9 --class os --minimize --exclude spawn,exec,swap" I re-test it on the v5.8-rc6, the regression still existed. = tbox_group/testcase/rootfs/kconfig/compiler/debug-setup/nr_threads/disk/testtime/fs/class/cpufreq_governor/ucode: lkp-csl-2sp5/stress-ng/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/test/10%/1HDD/1s/ext4/os/performance/0x5002f01 commit: b2c5764262edded1b1cfff5a6ca82c3d61bb4a4a d3b6f23f71670007817a5d59f3fbafab2b794e8c v5.8-rc6 b2c5764262edded1 d3b6f23f71670007817a5d59f3f v5.8-rc6 --- --- %stddev %change %stddev %change %stddev \ | \ | \ 20419 ± 3% -4.9% 19423 ± 4% +27.1% 25959 stress-ng.af-alg.ops 19655 ± 3% -5.7% 18537 ± 4% +27.8% 25111 stress-ng.af-alg.ops_per_sec 64.67 ± 5% -17.0% 53.67 ± 38% +22.2% 79.00 ± 9% stress-ng.chdir.ops 55.34 ± 3% -13.3% 47.99 ± 38% +26.4% 69.96 ± 10% stress-ng.chdir.ops_per_sec 64652 ± 7% -14.1% 55545 ± 11% -13.6% 55842 ± 6% stress-ng.chown.ops 64683 ± 7% -14.1% 55565 ± 11% -13.6% 55858 ± 6% stress-ng.chown.ops_per_sec 2805 ± 2% +0.6% 2820 ± 2% +130.0% 6452 stress-ng.clone.ops 2802 ± 2% +0.6% 2818 ± 2% +129.9% 6443 stress-ng.clone.ops_per_sec 34.67 +1.9%
Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression
On 7/15/2020 7:04 PM, Ritesh Harjani wrote: Hello Xing, On 4/7/20 1:30 PM, kernel test robot wrote: Greeting, FYI, we noticed a -60.5% regression of stress-ng.fiemap.ops_per_sec due to commit: commit: d3b6f23f71670007817a5d59f3fbafab2b794e8c ("ext4: move ext4_fiemap to use iomap framework") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master in testcase: stress-ng on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory with following parameters: nr_threads: 10% disk: 1HDD testtime: 1s class: os cpufreq_governor: performance ucode: 0x52c fs: ext4 I started looking into this issue. But with my unit testing, I didn't find any perf issue with fiemap ioctl call. I haven't yet explored about how stress-ng take fiemap performance numbers, it could be doing something differently. But in my testing I just made sure to create a file with large number of extents and used xfs_io -c "fiemap -v" cmd to check how much time it takes to read all the entries in 1st and subsequent iterations. Setup comprised of qemu machine on x86_64 with latest linux branch. 1. created a file of 10G using fallocate. (this allocated unwritten extents for this file). 2. Then I punched hole on every alternate block of file. This step took a long time. And after sufficiently long time, I had to cancel it. for i in $(seq 1 2 x); do echo $i; fallocate -p -o $(($i*4096)) -l 4096; done 3. Then issued fiemap call via xfs_io and took the time measurement. time xfs_io -c "fiemap -v" bigfile > /dev/null Perf numbers on latest default kernel build for above cmd. 1st iteration == real 0m31.684s user 0m1.593s sys 0m24.174s 2nd and subsequent iteration real 0m3.379s user 0m1.300s sys 0m2.080s 4. Then I reverted all the iomap_fiemap patches and re-tested this. With this the older ext4_fiemap implementation will be tested:- 1st iteration == real 0m31.591s user 0m1.400s sys 0m24.243s 2nd and subsequent iteration (had to cancel it since it was taking more time then 15m) ^C^C real 15m49.884s user 0m0.032s sys 15m49.722s I guess the reason why 2nd iteration with older implementation is taking too much time is since with previous implementation we never cached extent entries in extent_status tree. And also in 1st iteration the page cache may get filled with lot of buffer_head entries. So maybe page reclaims are taking more time. While with the latest implementation using iomap_fiemap(), the call to query extent blocks is done using ext4_map_blocks(). ext4_map_blocks() by default will also cache the extent entries into extent_status tree. Hence during 2nd iteration, we will directly read the entries from extent_status tree and will not do any disk I/O. -ritesh Could you try stress-ng( https://kernel.ubuntu.com/~cking/stress-ng/) test cases? The tarballs can be get from https://kernel.ubuntu.com/~cking/tarballs/stress-ng/. The command for this case you can try "stress-ng --timeout 1 --times --verify --metrics-brief --sequential 9 --class os --minimize --exclude spawn,exec,swap" I re-test it on the v5.8-rc6, the regression still existed. = tbox_group/testcase/rootfs/kconfig/compiler/debug-setup/nr_threads/disk/testtime/fs/class/cpufreq_governor/ucode: lkp-csl-2sp5/stress-ng/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/test/10%/1HDD/1s/ext4/os/performance/0x5002f01 commit: b2c5764262edded1b1cfff5a6ca82c3d61bb4a4a d3b6f23f71670007817a5d59f3fbafab2b794e8c v5.8-rc6 b2c5764262edded1 d3b6f23f71670007817a5d59f3fv5.8-rc6 --- --- %stddev %change %stddev %change %stddev \ |\ |\ 20419 ± 3% -4.9% 19423 ± 4% +27.1% 25959 stress-ng.af-alg.ops 19655 ± 3% -5.7% 18537 ± 4% +27.8% 25111 stress-ng.af-alg.ops_per_sec 64.67 ± 5% -17.0% 53.67 ± 38% +22.2% 79.00 ± 9% stress-ng.chdir.ops 55.34 ± 3% -13.3% 47.99 ± 38% +26.4% 69.96 ± 10% stress-ng.chdir.ops_per_sec 64652 ± 7% -14.1% 55545 ± 11% -13.6% 55842 ± 6% stress-ng.chown.ops 64683 ± 7% -14.1% 55565 ± 11% -13.6% 55858 ± 6% stress-ng.chown.ops_per_sec 2805 ± 2% +0.6% 2820 ± 2%+130.0% 6452 stress-ng.clone.ops 2802 ± 2% +0.6% 2818 ± 2%+129.9% 6443 stress-ng.clone.ops_per_sec 34.67+1.9% 35.33 ± 3% -9.6% 31.33 ± 3% stress-ng.copy-file.ops 22297 ± 23% +26.7% 28258 ± 2% +38.1% 30783 ± 14% stress-ng.dir.ops_per_sec
Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression
Hello Xing, On 4/7/20 1:30 PM, kernel test robot wrote: Greeting, FYI, we noticed a -60.5% regression of stress-ng.fiemap.ops_per_sec due to commit: commit: d3b6f23f71670007817a5d59f3fbafab2b794e8c ("ext4: move ext4_fiemap to use iomap framework") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master in testcase: stress-ng on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory with following parameters: nr_threads: 10% disk: 1HDD testtime: 1s class: os cpufreq_governor: performance ucode: 0x52c fs: ext4 I started looking into this issue. But with my unit testing, I didn't find any perf issue with fiemap ioctl call. I haven't yet explored about how stress-ng take fiemap performance numbers, it could be doing something differently. But in my testing I just made sure to create a file with large number of extents and used xfs_io -c "fiemap -v" cmd to check how much time it takes to read all the entries in 1st and subsequent iterations. Setup comprised of qemu machine on x86_64 with latest linux branch. 1. created a file of 10G using fallocate. (this allocated unwritten extents for this file). 2. Then I punched hole on every alternate block of file. This step took a long time. And after sufficiently long time, I had to cancel it. for i in $(seq 1 2 x); do echo $i; fallocate -p -o $(($i*4096)) -l 4096; done 3. Then issued fiemap call via xfs_io and took the time measurement. time xfs_io -c "fiemap -v" bigfile > /dev/null Perf numbers on latest default kernel build for above cmd. 1st iteration == real0m31.684s user0m1.593s sys 0m24.174s 2nd and subsequent iteration real0m3.379s user0m1.300s sys 0m2.080s 4. Then I reverted all the iomap_fiemap patches and re-tested this. With this the older ext4_fiemap implementation will be tested:- 1st iteration == real0m31.591s user0m1.400s sys 0m24.243s 2nd and subsequent iteration (had to cancel it since it was taking more time then 15m) ^C^C real15m49.884s user0m0.032s sys 15m49.722s I guess the reason why 2nd iteration with older implementation is taking too much time is since with previous implementation we never cached extent entries in extent_status tree. And also in 1st iteration the page cache may get filled with lot of buffer_head entries. So maybe page reclaims are taking more time. While with the latest implementation using iomap_fiemap(), the call to query extent blocks is done using ext4_map_blocks(). ext4_map_blocks() by default will also cache the extent entries into extent_status tree. Hence during 2nd iteration, we will directly read the entries from extent_status tree and will not do any disk I/O. -ritesh
Re: [LKP] [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression
Hi Ritesh, I test, the regression still existed in v5.8-rc1. Do you have time to take a look at it? Thanks. On 4/14/2020 1:49 PM, Xing Zhengjun wrote: Thanks for your quick response, if you need any more test information about the regression, please let me known. On 4/13/2020 6:56 PM, Ritesh Harjani wrote: On 4/13/20 2:07 PM, Xing Zhengjun wrote: Hi Harjani, Do you have time to take a look at this? Thanks. Hello Xing, I do want to look into this. But as of now I am stuck with another mballoc failure issue. I will get back at this once I have some handle over that one. BTW, are you planning to take look at this? -ritesh On 4/7/2020 4:00 PM, kernel test robot wrote: Greeting, FYI, we noticed a -60.5% regression of stress-ng.fiemap.ops_per_sec due to commit: commit: d3b6f23f71670007817a5d59f3fbafab2b794e8c ("ext4: move ext4_fiemap to use iomap framework") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master in testcase: stress-ng on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory with following parameters: nr_threads: 10% disk: 1HDD testtime: 1s class: os cpufreq_governor: performance ucode: 0x52c fs: ext4 Details are as below: --> To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml = class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/testcase/testtime/ucode: os/gcc-7/performance/1HDD/ext4/x86_64-rhel-7.6/10%/debian-x86_64-20191114.cgz/lkp-csl-2sp5/stress-ng/1s/0x52c commit: b2c5764262 ("ext4: make ext4_ind_map_blocks work with fiemap") d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework") b2c5764262edded1 d3b6f23f71670007817a5d59f3f --- fail:runs %reproduction fail:runs | | | :4 25% 1:4 dmesg.WARNING:at#for_ip_interrupt_entry/0x 2:4 5% 2:4 perf-profile.calltrace.cycles-pp.sync_regs.error_entry 2:4 6% 3:4 perf-profile.calltrace.cycles-pp.error_entry 3:4 9% 3:4 perf-profile.children.cycles-pp.error_entry 0:4 1% 0:4 perf-profile.self.cycles-pp.error_entry %stddev %change %stddev \ | \ 28623 +28.2% 36703 ± 12% stress-ng.daemon.ops 28632 +28.2% 36704 ± 12% stress-ng.daemon.ops_per_sec 566.00 ± 22% -53.2% 265.00 ± 53% stress-ng.dev.ops 278.81 ± 22% -53.0% 131.00 ± 54% stress-ng.dev.ops_per_sec 73160 -60.6% 28849 ± 3% stress-ng.fiemap.ops 72471 -60.5% 28612 ± 3% stress-ng.fiemap.ops_per_sec 23421 ± 12% +21.2% 28388 ± 6% stress-ng.filename.ops 22638 ± 12% +20.3% 27241 ± 6% stress-ng.filename.ops_per_sec 21.25 ± 7% -10.6% 19.00 ± 3% stress-ng.iomix.ops 38.75 ± 49% -47.7% 20.25 ± 96% stress-ng.memhotplug.ops 34.45 ± 52% -51.8% 16.62 ±106% stress-ng.memhotplug.ops_per_sec 1734 ± 10% +31.4% 2278 ± 10% stress-ng.resources.ops 807.56 ± 5% +35.2% 1091 ± 8% stress-ng.resources.ops_per_sec 1007356 ± 3% -16.5% 840642 ± 9% stress-ng.revio.ops 1007692 ± 3% -16.6% 840711 ± 9% stress-ng.revio.ops_per_sec 21812 ± 3% +16.0% 25294 ± 5% stress-ng.sysbadaddr.ops 21821 ± 3% +15.9% 25294 ± 5% stress-ng.sysbadaddr.ops_per_sec 440.75 ± 4% +21.9% 537.25 ± 9% stress-ng.sysfs.ops 440.53 ± 4% +21.9% 536.86 ± 9% stress-ng.sysfs.ops_per_sec 13286582 -11.1% 11805520 ± 6% stress-ng.time.file_system_outputs 68253896 +2.4% 69860122 stress-ng.time.minor_page_faults 197.00 ± 4% -15.9% 165.75 ± 12% stress-ng.xattr.ops 192.45 ± 5% -16.1% 161.46 ± 11% stress-ng.xattr.ops_per_sec 15310 +62.5% 24875 ± 22% stress-ng.zombie.ops 15310 +62.5% 24874 ± 22% stress-ng.zombie.ops_per_sec 203.50 ± 12% -47.3% 107.25 ± 49% vmstat.io.bi 861318 ± 18% -29.7% 605884 ± 5% meminfo.AnonHugePages 1062742 ± 14% -20.2% 847853 ± 3% meminfo.AnonPages 31093 ± 6% +9.6% 34090 ± 3% meminfo.KernelStack 7151 ± 34% +55.8% 11145 ± 9% meminfo.Mlocked 1.082e+08 ± 5% -40.2% 64705429 ± 31% numa-numastat.node0.local_node 1.082e+08