Re: recent issues with heavy delete's causing soft lockups

2018-11-21 Thread Thomas Fjellstrom
On Friday, November 2, 2018 2:37:08 PM MST Jens Axboe wrote:
> On 11/2/18 2:32 PM, Thomas Fjellstrom wrote:
> > On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
> >> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom 
> > 
> > [snip]
> > 
> >> Can you try 4.19? A patch went in since 4.18 that fixes a starvation
> >> issue
> >> around requeue conditions, which SATA is the one to most often hit.
> >> 
> >> Jens
> > 
> > I just had to do a clean, and I have the mq kernel options I mentioned in
> > my previous mail enabled. (mq should be disabled) and it appears to still
> > be causing issues. current io scheduler appears to be cfq, and it took
> > that "make clean" about 4 minutes, a lot of that time was spent with
> > plasma, intelij, and chrome all starved of IO.
> > 
> > I did switch to a terminal and checked iostat -d 1, and it showed very
> > little actual io for the time I was looking at it.
> > 
> > I have no idea what's going on.
> 
> If you're using cfq, then it's not using mq at all. Maybe do something ala:

Yeah, I switched off mq to test. I mentioned it in a previous mail.

> # perf record -ag -- sleep 10
> 
> while the slowdown is happening and then do perf report -g --no-children and
> see if that yields anything interesting. Sounds like time is being spent
> elsewhere and you aren't actually waiting on IO.

Ok, with the 4.19.1 kernel from linux-stable I've managed to catch the issue 
during real use, rather than just a dd command.

I should note that I have swap turned off, so I'm not sure what 
the "swapper" process in the below log is doing.

I also see the problem with swap enabled. But right now I'd rather 
certain apps die rather than slow the entire system down.

I also have a perf report -t log if that'd be helpful. It shows a lot of "use" 
in do_idle/acpi_idle_do_entry though I presume that's actual real idle time, 
not actual use.  The next most eye catching item in the -t log is chrome 
spending
17% of its time in glibc's free function.

(the top 100~ lines from perf report -g)

# Total Lost Samples: 0
#
# Samples: 456K of event 'cycles'
# Event count (approx.): 136347735217
#
# Overhead  Command  Shared Object   Symbol 









  
#   ...  ..  
.
#
25.64%  swapper  [kernel.kallsyms]   [k] 
acpi_idle_do_entry
|
---0xa16000d4
   |  
   |--22.23%--start_secondary
   |  cpu_startup_entry
   |  do_idle
   |  cpuidle_enter_state
   |  acpi_idle_enter
   |  acpi_idle_do_entry
   |  
--3.41%--start_kernel
  cpu_startup_entry
  do_idle
  cpuidle_enter_state
  acpi_idle_enter
  acpi_idle_do_entry

 0.61%  swapper  [kernel.kallsyms]   [k] 
apic_timer_interrupt
|
---0xa16000d4
   |  
--0.52%--start_secondary
  

Re: recent issues with heavy delete's causing soft lockups

2018-11-02 Thread Thomas Fjellstrom
On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom  
[snip]
> 
> Can you try 4.19? A patch went in since 4.18 that fixes a starvation issue
> around requeue conditions, which SATA is the one to most often hit.
> 
> Jens

I just had to do a clean, and I have the mq kernel options I mentioned in my 
previous mail enabled. (mq should be disabled) and it appears to still be 
causing issues. current io scheduler appears to be cfq, and it took that "make 
clean" about 4 minutes, a lot of that time was spent with plasma, intelij, and 
chrome all starved of IO. 

I did switch to a terminal and checked iostat -d 1, and it showed very little 
actual io for the time I was looking at it.

I have no idea what's going on.

-- 
Thomas Fjellstrom
tho...@fjellstrom.ca





Re: recent issues with heavy delete's causing soft lockups

2018-11-02 Thread Thomas Fjellstrom
On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom  
wrote:
> > Hi
[snip explanation of problem]
> 
> Can you try 4.19? A patch went in since 4.18 that fixes a starvation issue
> around requeue conditions, which SATA is the one to most often hit.

Gave it a shot. with the vanila kernel from git linux-stable/v4.9. It was a 
bit of a pain as the amdgpu driver seems to be broken for my r9 390 on many 
kernels, including 4.19. Had to reconfigure to the radeon driver, which I must 
say seems to work a lot better than it used to.

At any rate, it doesn't seem to have helped a lot so far. I did end up adding 
"scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0" to the default kernel boot command 
line in grub. It seems to have helped a little, but I haven't tested fully 
with a full delete of the build directory. haven't had time to sit and wait 
the 40+ minutes it takes to re build the entire thing. And I'm low enough on 
disk space that I can't easily make a copy of the 109GB build folder. I've got 
about 25GB free out of 780GB. I'll try and test some more soon.

> Jens


-- 
Thomas Fjellstrom
tho...@fjellstrom.ca





recent issues with heavy delete's causing soft lockups

2018-10-27 Thread Thomas Fjellstrom
Hi

As of the past few months or so I've been dealing with my workstation locking 
up for upwards of minutes at a time when deleting a large directory tree. I 
don't recall this being a problem before.

Current setup is 3 SATA SSDs in an lvm vg. most space is allocated to an ext4 
/home where my work projects live.

The main use case causing problems is deleting the "out" directory of an 
android AOSP build tree. It can be upwards of 95GB in size with 240k or more 
files. If I run a `rm -fr out` or `make clean` it will lock up anything 
attempting to use the disk (eg: plasma, intellij, android studio, chrome, etc) 
for sometimes minutes.

I have tried different block scheduler settings including none, mq-deadline, 
kyber and bfq none of which seem to improve things much at all.

It may be worth noting that disk space is starting to run low, perhaps there's 
some interaction going on with free space handling or ssd wear leveling...

That said, it seems to have started happening (or at least made worse) some 
time around when mq was made the default and only implementation for sata.

if it helps, my system specs are:

Kernel: Debian Sid's 4.18.0-2-amd64 (4.18.10-2)
CPU: AMD FX-8320 OCed to 4.4Ghz
RAM: 32GB DDR3 1866
MB: Asus 970 Aura Pro Gaming
Storage: Kingston HyperX 3K 240G + Samsung 850 Evo 250G + SanDisk X300 500G

I'm thinking of testing with a different or older kernel, what would be the 
best to test with?

Thanks for any assistance.

-- 
Thomas Fjellstrom
tho...@fjellstrom.ca