Re: [PATCH 0/2][V2] io.latency test for blktests

2018-12-06 Thread Omar Sandoval
On Wed, Dec 05, 2018 at 10:34:02AM -0500, Josef Bacik wrote:
> v1->v2:
> - dropped my python library, TIL about jq.
> - fixed the spelling mistakes in the test.
> 
> -- Original message --
> 
> This patchset is to add a test to verify io.latency is working properly, and 
> to
> add all the supporting code to run that test.
> 
> First is the cgroup2 infrastructure which is fairly straightforward.  Just
> verifies we have cgroup2, and gives us the helpers to check and make sure we
> have the right controllers in place for the test.
> 
> The second patch brings over some python scripts I put in xfstests for parsing
> the fio json output.  I looked at the existing fio performance stuff in
> blktests, but we only capture bw stuff, which is wonky with this particular 
> test
> because once the fast group is finished the slow group is allowed to go as 
> fast
> as it wants.  So I needed this to pull out actual jobtime spent.  This will 
> give
> us flexibility to pull out other fio performance data in the future.
> 
> The final patch is the test itself.  It simply runs a job by itself to get a
> baseline view of the disk performance.  Then it creates 2 cgroups, one fast 
> and
> one slow, and runs the same job simultaneously in both groups.  The result
> should be that the fast group takes just slightly longer time than the 
> baseline
> (I use a 15% threshold to be safe), and that the slow one takes considerably
> longer.  Thanks,

I cleaned up a ton of shellcheck warnings (from `make check`) and pushed
to https://github.com/osandov/blktests/tree/josef. On I tested with QEMU
on Jens' for-next branch. With an emulated NVMe device, it failed with
"Too much of a performance drop for the protected workload". On
virtio-blk, I hit this:

[ 1843.056452] INFO: task fio:20750 blocked for more than 120 seconds.
[ 1843.057495]   Not tainted 4.20.0-rc5-00251-g90efb26fa9a4 #19
[ 1843.058487] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1843.059769] fio D0 20750  20747 0x0080
[ 1843.060688] Call Trace:
[ 1843.061123]  ? __schedule+0x286/0x870
[ 1843.061735]  ? blkcg_iolatency_done_bio+0x680/0x680
[ 1843.062574]  ? blkcg_iolatency_cleanup+0x60/0x60
[ 1843.063347]  schedule+0x32/0x80
[ 1843.063874]  io_schedule+0x12/0x40
[ 1843.064449]  rq_qos_wait+0x9a/0x120
[ 1843.065007]  ? karma_partition+0x210/0x210
[ 1843.065661]  ? blkcg_iolatency_done_bio+0x680/0x680
[ 1843.066435]  blkcg_iolatency_throttle+0x185/0x360
[ 1843.067196]  __rq_qos_throttle+0x23/0x30
[ 1843.067958]  blk_mq_make_request+0x101/0x5c0
[ 1843.068637]  generic_make_request+0x1b3/0x3c0
[ 1843.069329]  submit_bio+0x45/0x140
[ 1843.069876]  blkdev_direct_IO+0x3db/0x440
[ 1843.070527]  ? aio_complete+0x2f0/0x2f0
[ 1843.071146]  generic_file_direct_write+0x96/0x160
[ 1843.071880]  __generic_file_write_iter+0xb3/0x1c0
[ 1843.072599]  ? blk_mq_dispatch_rq_list+0x3aa/0x550
[ 1843.073340]  blkdev_write_iter+0xa0/0x120
[ 1843.073960]  ? __fget+0x6e/0xa0
[ 1843.074452]  aio_write+0x11f/0x1d0
[ 1843.074979]  ? __blk_mq_run_hw_queue+0x6f/0xe0
[ 1843.075658]  ? __check_object_size+0xa0/0x189
[ 1843.076345]  ? preempt_count_add+0x5a/0xb0
[ 1843.077086]  ? aio_read_events+0x259/0x380
[ 1843.077819]  ? kmem_cache_alloc+0x16e/0x1c0
[ 1843.078427]  io_submit_one+0x4a8/0x790
[ 1843.078975]  ? read_events+0x76/0x150
[ 1843.079510]  __se_sys_io_submit+0x98/0x1a0
[ 1843.080116]  ? syscall_trace_enter+0x1d3/0x2d0
[ 1843.080785]  do_syscall_64+0x55/0x160
[ 1843.081404]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1843.082210] RIP: 0033:0x7f6e571fc4ed
[ 1843.082763] Code: Bad RIP value.
[ 1843.083268] RSP: 002b:7ffc212b76f8 EFLAGS: 0246 ORIG_RAX: 
00d1
[ 1843.084445] RAX: ffda RBX: 7f6e4c876870 RCX: 7f6e571fc4ed
[ 1843.085545] RDX: 557c4bc11208 RSI: 0001 RDI: 7f6e4c85e000
[ 1843.086251] RBP: 7f6e4c85e000 R08: 557c4bc2b130 R09: 02f8
[ 1843.087308] R10: 557c4bbf4470 R11: 0246 R12: 0001
[ 1843.088310] R13:  R14: 557c4bc11208 R15: 7f6e2b17f070


[PATCH 0/2][V2] io.latency test for blktests

2018-12-05 Thread Josef Bacik
v1->v2:
- dropped my python library, TIL about jq.
- fixed the spelling mistakes in the test.

-- Original message --

This patchset is to add a test to verify io.latency is working properly, and to
add all the supporting code to run that test.

First is the cgroup2 infrastructure which is fairly straightforward.  Just
verifies we have cgroup2, and gives us the helpers to check and make sure we
have the right controllers in place for the test.

The second patch brings over some python scripts I put in xfstests for parsing
the fio json output.  I looked at the existing fio performance stuff in
blktests, but we only capture bw stuff, which is wonky with this particular test
because once the fast group is finished the slow group is allowed to go as fast
as it wants.  So I needed this to pull out actual jobtime spent.  This will give
us flexibility to pull out other fio performance data in the future.

The final patch is the test itself.  It simply runs a job by itself to get a
baseline view of the disk performance.  Then it creates 2 cgroups, one fast and
one slow, and runs the same job simultaneously in both groups.  The result
should be that the fast group takes just slightly longer time than the baseline
(I use a 15% threshold to be safe), and that the slow one takes considerably
longer.  Thanks,

Josef