Re: [LKP] Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression

2020-08-19 Thread Theodore Y. Ts'o
Looking at what the stress-ng fiemap workload is doing, and
it's interesting.

It is running 4 processes which are calling FIEMAP on a particular
file in a loop, with a 25ms sleep every 64 times.  And then there is a
fifth process which is randomly writing to the file and calling
punch_hole to random offsets in the file.

So this is quite different from what Ritesh has been benchmarking
which is fiemap in isolation, as opposed to fiemap racing against a 3
other fiemap processes plus a process which is actively modifying the
file.

In the original code, if I remember correctly, we were using a shared
reader/writer lock to look at the extent tree blocks directly, but we
hold the i_data_sem rw_sempahore for the duration of the fiemap call.

In the new code, we're going through the extent_status cache, which is
grabbing the rw_spinlock each time we do a lookup in the extents
status tree.  So this is a much finer-grained locking and that is
probably the explanation for the increased time for running fiemap in
the contended case.

If this theory is correct, we would probably get back the performance
by wrapping the calls to iomap_fiemap() with {up,down}_read(>i_data_sem)
in ext4_fiemap().

That being said, however  it's clear what real-life workload cares
about FIEMAP performance, especially with multiple threads all calling
FIEMAP racing against a file which is being actively modified.  Having
stress-ng do this to find potential kernel bugs is a great thing, so I
understand why stress-ng might be doing this as a QA tool.  Why we
should care about stress-ng as a performance benchmark, at least in
this case, is much less clear to me.

Cheers,

- Ted


Re: [LKP] Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression

2020-08-18 Thread Xing Zhengjun




On 7/22/2020 2:17 PM, Xing Zhengjun wrote:



On 7/15/2020 7:04 PM, Ritesh Harjani wrote:

Hello Xing,

On 4/7/20 1:30 PM, kernel test robot wrote:

Greeting,

FYI, we noticed a -60.5% regression of stress-ng.fiemap.ops_per_sec 
due to commit:



commit: d3b6f23f71670007817a5d59f3fbafab2b794e8c ("ext4: move 
ext4_fiemap to use iomap framework")

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: stress-ng
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz 
with 192G memory

with following parameters:

nr_threads: 10%
disk: 1HDD
testtime: 1s
class: os
cpufreq_governor: performance
ucode: 0x52c
fs: ext4


I started looking into this issue. But with my unit testing, I didn't
find any perf issue with fiemap ioctl call. I haven't yet explored about
how stress-ng take fiemap performance numbers, it could be doing
something differently. But in my testing I just made sure to create a
file with large number of extents and used xfs_io -c "fiemap -v" cmd
to check how much time it takes to read all the entries in 1st
and subsequent iterations.


Setup comprised of qemu machine on x86_64 with latest linux branch.

1. created a file of 10G using fallocate. (this allocated unwritten
extents for this file).

2. Then I punched hole on every alternate block of file. This step took
a long time. And after sufficiently long time, I had to cancel it.
for i in $(seq 1 2 x); do echo $i; fallocate -p -o $(($i*4096)) -l 
4096; done


3. Then issued fiemap call via xfs_io and took the time measurement.
time xfs_io -c "fiemap -v" bigfile > /dev/null


Perf numbers on latest default kernel build for above cmd.

1st iteration
==
real    0m31.684s
user    0m1.593s
sys 0m24.174s

2nd and subsequent iteration

real    0m3.379s
user    0m1.300s
sys 0m2.080s


4. Then I reverted all the iomap_fiemap patches and re-tested this.
With this the older ext4_fiemap implementation will be tested:-


1st iteration
==
real    0m31.591s
user    0m1.400s
sys 0m24.243s


2nd and subsequent iteration (had to cancel it since it was taking 
more time then 15m)


^C^C
real    15m49.884s
user    0m0.032s
sys 15m49.722s

I guess the reason why 2nd iteration with older implementation is taking
too much time is since with previous implementation we never cached
extent entries in extent_status tree. And also in 1st iteration the page
cache may get filled with lot of buffer_head entries. So maybe page
reclaims are taking more time.

While with the latest implementation using iomap_fiemap(), the call to 
query extent blocks is done using ext4_map_blocks(). ext4_map_blocks()

by default will also cache the extent entries into extent_status tree.
Hence during 2nd iteration, we will directly read the entries from 
extent_status tree and will not do any disk I/O.


-ritesh


I re-test it on the v5.9-rc1, the regression still existed. Have you 
tried stress-ng test cases?




Could you try stress-ng( https://kernel.ubuntu.com/~cking/stress-ng/) 
test cases?  The tarballs can be get from 
https://kernel.ubuntu.com/~cking/tarballs/stress-ng/.
The command for this case you can try "stress-ng --timeout 1 --times 
--verify --metrics-brief --sequential 9 --class os --minimize --exclude 
spawn,exec,swap"

I re-test it on the v5.8-rc6, the regression still existed.

= 

tbox_group/testcase/rootfs/kconfig/compiler/debug-setup/nr_threads/disk/testtime/fs/class/cpufreq_governor/ucode: 



lkp-csl-2sp5/stress-ng/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/test/10%/1HDD/1s/ext4/os/performance/0x5002f01 



commit:
   b2c5764262edded1b1cfff5a6ca82c3d61bb4a4a
   d3b6f23f71670007817a5d59f3fbafab2b794e8c
   v5.8-rc6

b2c5764262edded1 d3b6f23f71670007817a5d59f3f    v5.8-rc6
 --- ---
  %stddev %change %stddev %change %stddev
  \  |    \  |    \
  20419 ±  3%  -4.9%  19423 ±  4% +27.1%  25959   
stress-ng.af-alg.ops
  19655 ±  3%  -5.7%  18537 ±  4% +27.8%  25111   
stress-ng.af-alg.ops_per_sec
  64.67 ±  5% -17.0%  53.67 ± 38% +22.2%  79.00 ± 
9%  stress-ng.chdir.ops
  55.34 ±  3% -13.3%  47.99 ± 38% +26.4%  69.96 ± 
10%  stress-ng.chdir.ops_per_sec
  64652 ±  7% -14.1%  55545 ± 11% -13.6%  55842 ± 
6%  stress-ng.chown.ops
  64683 ±  7% -14.1%  55565 ± 11% -13.6%  55858 ± 
6%  stress-ng.chown.ops_per_sec
   2805 ±  2%  +0.6%   2820 ±  2%    +130.0%   6452   
stress-ng.clone.ops
   2802 ±  2%  +0.6%   2818 ±  2%    +129.9%   6443   
stress-ng.clone.ops_per_sec
  34.67    +1.9%  

Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression

2020-07-22 Thread Xing Zhengjun




On 7/15/2020 7:04 PM, Ritesh Harjani wrote:

Hello Xing,

On 4/7/20 1:30 PM, kernel test robot wrote:

Greeting,

FYI, we noticed a -60.5% regression of stress-ng.fiemap.ops_per_sec 
due to commit:



commit: d3b6f23f71670007817a5d59f3fbafab2b794e8c ("ext4: move 
ext4_fiemap to use iomap framework")

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: stress-ng
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz 
with 192G memory

with following parameters:

nr_threads: 10%
disk: 1HDD
testtime: 1s
class: os
cpufreq_governor: performance
ucode: 0x52c
fs: ext4


I started looking into this issue. But with my unit testing, I didn't
find any perf issue with fiemap ioctl call. I haven't yet explored about
how stress-ng take fiemap performance numbers, it could be doing
something differently. But in my testing I just made sure to create a
file with large number of extents and used xfs_io -c "fiemap -v" cmd
to check how much time it takes to read all the entries in 1st
and subsequent iterations.


Setup comprised of qemu machine on x86_64 with latest linux branch.

1. created a file of 10G using fallocate. (this allocated unwritten
extents for this file).

2. Then I punched hole on every alternate block of file. This step took
a long time. And after sufficiently long time, I had to cancel it.
for i in $(seq 1 2 x); do echo $i; fallocate -p -o $(($i*4096)) -l 
4096; done


3. Then issued fiemap call via xfs_io and took the time measurement.
time xfs_io -c "fiemap -v" bigfile > /dev/null


Perf numbers on latest default kernel build for above cmd.

1st iteration
==
real    0m31.684s
user    0m1.593s
sys 0m24.174s

2nd and subsequent iteration

real    0m3.379s
user    0m1.300s
sys 0m2.080s


4. Then I reverted all the iomap_fiemap patches and re-tested this.
With this the older ext4_fiemap implementation will be tested:-


1st iteration
==
real    0m31.591s
user    0m1.400s
sys 0m24.243s


2nd and subsequent iteration (had to cancel it since it was taking more 
time then 15m)


^C^C
real    15m49.884s
user    0m0.032s
sys 15m49.722s

I guess the reason why 2nd iteration with older implementation is taking
too much time is since with previous implementation we never cached
extent entries in extent_status tree. And also in 1st iteration the page
cache may get filled with lot of buffer_head entries. So maybe page
reclaims are taking more time.

While with the latest implementation using iomap_fiemap(), the call to 
query extent blocks is done using ext4_map_blocks(). ext4_map_blocks()

by default will also cache the extent entries into extent_status tree.
Hence during 2nd iteration, we will directly read the entries from 
extent_status tree and will not do any disk I/O.


-ritesh


Could you try stress-ng( https://kernel.ubuntu.com/~cking/stress-ng/) 
test cases?  The tarballs can be get from 
https://kernel.ubuntu.com/~cking/tarballs/stress-ng/.
The command for this case you can try "stress-ng --timeout 1 --times 
--verify --metrics-brief --sequential 9 --class os --minimize --exclude 
spawn,exec,swap"

I re-test it on the v5.8-rc6, the regression still existed.

=
tbox_group/testcase/rootfs/kconfig/compiler/debug-setup/nr_threads/disk/testtime/fs/class/cpufreq_governor/ucode:

lkp-csl-2sp5/stress-ng/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/test/10%/1HDD/1s/ext4/os/performance/0x5002f01

commit:
  b2c5764262edded1b1cfff5a6ca82c3d61bb4a4a
  d3b6f23f71670007817a5d59f3fbafab2b794e8c
  v5.8-rc6

b2c5764262edded1 d3b6f23f71670007817a5d59f3fv5.8-rc6
 --- ---
 %stddev %change %stddev %change %stddev
 \  |\  |\
 20419 ±  3%  -4.9%  19423 ±  4% +27.1%  25959 
  stress-ng.af-alg.ops
 19655 ±  3%  -5.7%  18537 ±  4% +27.8%  25111 
  stress-ng.af-alg.ops_per_sec
 64.67 ±  5% -17.0%  53.67 ± 38% +22.2%  79.00 ± 
9%  stress-ng.chdir.ops
 55.34 ±  3% -13.3%  47.99 ± 38% +26.4%  69.96 ± 
10%  stress-ng.chdir.ops_per_sec
 64652 ±  7% -14.1%  55545 ± 11% -13.6%  55842 ± 
6%  stress-ng.chown.ops
 64683 ±  7% -14.1%  55565 ± 11% -13.6%  55858 ± 
6%  stress-ng.chown.ops_per_sec
  2805 ±  2%  +0.6%   2820 ±  2%+130.0%   6452 
  stress-ng.clone.ops
  2802 ±  2%  +0.6%   2818 ±  2%+129.9%   6443 
  stress-ng.clone.ops_per_sec
 34.67+1.9%  35.33 ±  3%  -9.6%  31.33 ± 
3%  stress-ng.copy-file.ops
 22297 ± 23% +26.7%  28258 ±  2% +38.1%  30783 ± 
14%  stress-ng.dir.ops_per_sec
  

Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression

2020-07-15 Thread Ritesh Harjani

Hello Xing,

On 4/7/20 1:30 PM, kernel test robot wrote:

Greeting,

FYI, we noticed a -60.5% regression of stress-ng.fiemap.ops_per_sec due to 
commit:


commit: d3b6f23f71670007817a5d59f3fbafab2b794e8c ("ext4: move ext4_fiemap to use 
iomap framework")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: stress-ng
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G 
memory
with following parameters:

nr_threads: 10%
disk: 1HDD
testtime: 1s
class: os
cpufreq_governor: performance
ucode: 0x52c
fs: ext4


I started looking into this issue. But with my unit testing, I didn't
find any perf issue with fiemap ioctl call. I haven't yet explored about
how stress-ng take fiemap performance numbers, it could be doing
something differently. But in my testing I just made sure to create a
file with large number of extents and used xfs_io -c "fiemap -v" cmd
to check how much time it takes to read all the entries in 1st
and subsequent iterations.


Setup comprised of qemu machine on x86_64 with latest linux branch.

1. created a file of 10G using fallocate. (this allocated unwritten
extents for this file).

2. Then I punched hole on every alternate block of file. This step took
a long time. And after sufficiently long time, I had to cancel it.
for i in $(seq 1 2 x); do echo $i; fallocate -p -o $(($i*4096)) -l 
4096; done


3. Then issued fiemap call via xfs_io and took the time measurement.
time xfs_io -c "fiemap -v" bigfile > /dev/null


Perf numbers on latest default kernel build for above cmd.

1st iteration
==
real0m31.684s
user0m1.593s
sys 0m24.174s

2nd and subsequent iteration

real0m3.379s
user0m1.300s
sys 0m2.080s


4. Then I reverted all the iomap_fiemap patches and re-tested this.
With this the older ext4_fiemap implementation will be tested:-


1st iteration
==
real0m31.591s
user0m1.400s
sys 0m24.243s


2nd and subsequent iteration (had to cancel it since it was taking more 
time then 15m)


^C^C
real15m49.884s
user0m0.032s
sys 15m49.722s

I guess the reason why 2nd iteration with older implementation is taking
too much time is since with previous implementation we never cached
extent entries in extent_status tree. And also in 1st iteration the page
cache may get filled with lot of buffer_head entries. So maybe page
reclaims are taking more time.

While with the latest implementation using iomap_fiemap(), the call to 
query extent blocks is done using ext4_map_blocks(). ext4_map_blocks()

by default will also cache the extent entries into extent_status tree.
Hence during 2nd iteration, we will directly read the entries from 
extent_status tree and will not do any disk I/O.


-ritesh


Re: [LKP] [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression

2020-06-16 Thread Xing Zhengjun

Hi Ritesh,

   I test, the regression still existed in v5.8-rc1. Do you have time 
to take a look at it? Thanks.


On 4/14/2020 1:49 PM, Xing Zhengjun wrote:
Thanks for your quick response, if you need any more test information 
about the regression, please let me known.


On 4/13/2020 6:56 PM, Ritesh Harjani wrote:



On 4/13/20 2:07 PM, Xing Zhengjun wrote:

Hi Harjani,

    Do you have time to take a look at this? Thanks.


Hello Xing,

I do want to look into this. But as of now I am stuck with another
mballoc failure issue. I will get back at this once I have some handle
over that one.

BTW, are you planning to take look at this?

-ritesh




On 4/7/2020 4:00 PM, kernel test robot wrote:

Greeting,

FYI, we noticed a -60.5% regression of stress-ng.fiemap.ops_per_sec 
due to commit:



commit: d3b6f23f71670007817a5d59f3fbafab2b794e8c ("ext4: move 
ext4_fiemap to use iomap framework")

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: stress-ng
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz 
with 192G memory

with following parameters:

nr_threads: 10%
disk: 1HDD
testtime: 1s
class: os
cpufreq_governor: performance
ucode: 0x52c
fs: ext4






Details are as below:
--> 




To reproduce:

 git clone https://github.com/intel/lkp-tests.git
 cd lkp-tests
 bin/lkp install job.yaml  # job file is attached in this email
 bin/lkp run job.yaml

= 

class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/testcase/testtime/ucode: 

os/gcc-7/performance/1HDD/ext4/x86_64-rhel-7.6/10%/debian-x86_64-20191114.cgz/lkp-csl-2sp5/stress-ng/1s/0x52c 



commit:
   b2c5764262 ("ext4: make ext4_ind_map_blocks work with fiemap")
   d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework")

b2c5764262edded1 d3b6f23f71670007817a5d59f3f
 ---
    fail:runs  %reproduction    fail:runs
    | | |
    :4   25%   1:4 
dmesg.WARNING:at#for_ip_interrupt_entry/0x
   2:4    5%   2:4 
perf-profile.calltrace.cycles-pp.sync_regs.error_entry
   2:4    6%   3:4 
perf-profile.calltrace.cycles-pp.error_entry
   3:4    9%   3:4 
perf-profile.children.cycles-pp.error_entry
   0:4    1%   0:4 
perf-profile.self.cycles-pp.error_entry

  %stddev %change %stddev
  \  |    \
  28623   +28.2%  36703 ± 12%  stress-ng.daemon.ops
  28632   +28.2%  36704 ± 12% 
stress-ng.daemon.ops_per_sec

 566.00 ± 22% -53.2% 265.00 ± 53%  stress-ng.dev.ops
 278.81 ± 22% -53.0% 131.00 ± 54%  
stress-ng.dev.ops_per_sec

  73160   -60.6%  28849 ±  3%  stress-ng.fiemap.ops
  72471   -60.5%  28612 ±  3% 
stress-ng.fiemap.ops_per_sec

  23421 ± 12% +21.2%  28388 ±  6%  stress-ng.filename.ops
  22638 ± 12% +20.3%  27241 ±  6% 
stress-ng.filename.ops_per_sec

  21.25 ±  7% -10.6%  19.00 ±  3%  stress-ng.iomix.ops
  38.75 ± 49% -47.7%  20.25 ± 96%  stress-ng.memhotplug.ops
  34.45 ± 52% -51.8%  16.62 ±106% 
stress-ng.memhotplug.ops_per_sec

   1734 ± 10% +31.4%   2278 ± 10%  stress-ng.resources.ops
 807.56 ±  5% +35.2%   1091 ±  8% 
stress-ng.resources.ops_per_sec

    1007356 ±  3% -16.5% 840642 ±  9%  stress-ng.revio.ops
    1007692 ±  3% -16.6% 840711 ±  9% 
stress-ng.revio.ops_per_sec

  21812 ±  3% +16.0%  25294 ±  5%  stress-ng.sysbadaddr.ops
  21821 ±  3% +15.9%  25294 ±  5% 
stress-ng.sysbadaddr.ops_per_sec

 440.75 ±  4% +21.9% 537.25 ±  9%  stress-ng.sysfs.ops
 440.53 ±  4% +21.9% 536.86 ±  9% 
stress-ng.sysfs.ops_per_sec
   13286582   -11.1%   11805520 ±  6% 
stress-ng.time.file_system_outputs
   68253896    +2.4%   69860122 
stress-ng.time.minor_page_faults

 197.00 ±  4% -15.9% 165.75 ± 12%  stress-ng.xattr.ops
 192.45 ±  5% -16.1% 161.46 ± 11% 
stress-ng.xattr.ops_per_sec

  15310   +62.5%  24875 ± 22%  stress-ng.zombie.ops
  15310   +62.5%  24874 ± 22% 
stress-ng.zombie.ops_per_sec

 203.50 ± 12% -47.3% 107.25 ± 49%  vmstat.io.bi
 861318 ± 18% -29.7% 605884 ±  5%  meminfo.AnonHugePages
    1062742 ± 14% -20.2% 847853 ±  3%  meminfo.AnonPages
  31093 ±  6%  +9.6%  34090 ±  3%  meminfo.KernelStack
   7151 ± 34% +55.8%  11145 ±  9%  meminfo.Mlocked
  1.082e+08 ±  5% -40.2%   64705429 ± 31% 
numa-numastat.node0.local_node
  1.082e+08