Re: [f2fs-dev] [PATCH] f2fs: move f2fs to use reader-unfair rwsems

2022-01-10 Thread Tim Murray via Linux-f2fs-devel
On Mon, Jan 10, 2022 at 7:02 AM Peter Zijlstra  wrote:
> Can we start by describing the actual problem?

Of course. Sorry, I didn't realize Jaegeuk was going to move so
quickly with this patch :)

We have a few thousand traces from internal Android devices where
large numbers of threads throughout the system end up blocked on
f2fs_issue_checkpoint(), which userspace code usually reaches via
fsync(). In the same traces, the f2fs-ckpt kthread is often blocked
around f2fs_write_checkpoint(), which does the actual fsync()
operation on behalf of some number of callers (coalescing multiple
fsync calls into a single checkpoint op). I think the problem is
something like:

1. f2fs-ckpt thread is running f2fs_write_checkpoint(), holding the
cp_rwsem write lock while doing so via f2fs_lock_all() in
block_operations().
2. Random very-low-priority thread A makes some other f2fs call that
tries to get the cp_rwsem read lock by atomically adding on the rwsem,
fails and deschedules in uninterruptible sleep. cp_rwsem now has a
non-zero reader count but is write-locked.
3. f2fs-ckpt thread releases the cp_rwsem write lock. cp_rwsem now has
a non-zero reader count and is not write-locked, so is reader-locked.
4. Other threads call fsync(), which requests checkpoints from
f2fs-ckpt, and block on a completion event that f2fs-ckpt dispatches.
cp_rwsem still has a non-zero reader count because the low-prio thread
A from (2) has not been scheduled again yet.
5. f2fs-ckpt wakes up to perform checkpoints, but it stalls on the
write lock via cmpxchg in block_operations() until the low-prio thread
A has run and released the cp_rwsem read lock. Because f2fs-ckpt can't
run, all fsync() callers are also effectively blocked by the
low-priority thread holding the read lock.

I think this is the rough shape of the problem (vs readers holding the
lock for too long or something like that) because the low-priority
thread is never run between when it is initially made runnable by
f2fs-ckpt and when it runs tens/hundreds of milliseconds later then
immediately unblocks f2fs-ckpt.

When there's a lot of oversubscription, threads running IO, etc, we
see f2fs-ckpt stalls of tens to hundreds of milliseconds per f2fs-ckpt
iteration. In one recent experiment of the immediate aftermath of
booting and unlocking a phone for the first time, over a 10s period,
we saw f2fs-ckpt blocked on the rwsem for 9.7s, running for <50ms over
110 separate slices (so probably ~110 fsyncs), and blocked on IO
operations for <100ms. The worst 10s period I can find in a similar
experiment with the unfair reader approach has f2fs-ckpt blocked on
the rwsem for 130ms, running for 20ms over 95 slices, blocked on IO
for 40ms, and sleeping the rest of the time.

Unfair rwsems are rarely appropriate, but I think the lack of fairness
makes sense here because of the relative importance of f2fs-ckpt vs
any particular reader of that rwsem. The one concern I have about
down_read_unfair() is whether the fairness should be a property of the
down_read operation or the rw_semaphore itself. In the f2fs case, it
seems like all read locks of the rw_semaphores should be unfair and
that adding a single fair down_read() could introduce a similar
regression, but duplicating rw_semaphore also seems bad. If it does
make sense to make the fairness a property of the semaphore, could
that property be flagged in an init_unfair_rwsem() macro and then
checked in lockdep?


___
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: move f2fs to use reader-unfair rwsems

2022-01-10 Thread Tim Murray via Linux-f2fs-devel
On Mon, Jan 10, 2022 at 8:15 PM Waiman Long  wrote:
> That is not how rwsem works. A reader which fails to get the lock
> because it is write-locked will remove its reader count before going to
> sleep. So the reader count will be zero eventually. Of course, there is
> a short period of time where the reader count will be non-zero until the
> reader removes its own reader count. So if a new writer comes in at that
> time, it will fail its initial trylock and probably go to optimistic
> spinning mode. If the writer that owns the lock release it at the right
> moment, the reader may acquire the read lock.

Thanks for the correction, that makes sense. I haven't spent too much
time on rwsem internals and I'm not confident about when flags are set
and cleared in sem->count; is there a case where sem->count after
up_write() could be nonzero?

An example from one trace:

1. Low-priority userspace thread 4764 is blocked in f2fs_unlink,
probably at f2fs_lock_op, which is a wrapper around
down_read(cp_rwsem).
2. f2fs-ckpt runs at t=0ms and wakes thread 4764, making it runnable.
3. At t=1ms, f2fs-ckpt enters uninterruptible sleep and blocks at
rwsem_down_write_slowpath per sched_blocked_reason.
4. At t=26ms, thread 4764 runs for the first time since being made
runnable. Within 40us, thread 4764 unblocks f2fs-ckpt and makes it
runnable.

Since thread 4764 is awakened by f2fs-ckpt but never runs before it
unblocks f2fs-ckpt in down_write_slowpath(), the only idea I had is
that cp_rwsem->count is nonzero after f2fs-ckpt's up_write() in step 2
(maybe because of rwsem_mark_wake()?).

> I do have a question about the number of readers in such a case compared
> with the number of writers. Are there a large number of low priority
> hanging around? What is an average read lock hold time?
>
> Blocking for 9.7s for a write lock is quite excessive and we need to
> figure out how this happen.,

Just to be 100% clear, it's not a single 9.7s stall, it's many smaller
stalls of 10-500+ms in f2fs-ckpt that add up to 9.7s over that range.

f2fs is not my area of expertise, but my understanding is that
cp_rwsem in f2fs has many (potentially unbounded) readers and a single
writer. Arbitrary userspace work (fsync, creating/deleting/truncating
files, atomic writes) may grab the read lock, but assuming the
merge_checkpoint option is enabled, only f2fs-ckpt will ever grab the
write lock during normal operation. However, in this particular
example, it looks like there may have been 5-10 threads blocked on
f2fs-ckpt that were awakened alongside thread 4764 in step 2.

I'll defer to the f2fs experts on the average duration that the read
lock is held.


___
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: move f2fs to use reader-unfair rwsems

2022-01-12 Thread Tim Murray via Linux-f2fs-devel
On Wed, Jan 12, 2022 at 6:06 AM Peter Zijlstra  wrote:
> *urgh*... so you're making the worst case less likely but fundamentally
> you don't change anything.
>
> If one of those low prio threads manages to block while holding
> cp_rwsem your checkpoint thread will still block for a very long time.
>
> So while you improve the average case, the worst case doesn't improve
> much I think.

You're right about the worst case latency, but I think a lock
optimization is still useful. If I understand the rwsem implementation
correctly, if there's one pending reader while a writer holds the
rwsem and the reader is sleeping when the writer releases the lock,
the writer will mark the rwsem as reader-owned by incrementing the
reader count then wake the reader in rwsem_mark_wake(). Assuming
that's correct, hitting this case guarantees the lowest possible
latency of a second write lock is the sum of reader's scheduling
delay, the reader's critical section duration, and the writer's
scheduling delay. What we see here is that the reader's scheduling
delay is orders of magnitude larger than the read lock duration when
the readers are low priority (tens of milliseconds of scheduling delay
when the device is heavily loaded vs tens of microseconds of holding
the read lock). The scheduling delay of f2fs-ckpt seems insignificant
compared to the scheduling delay of the reader (and is easily
addressed because it's a kthread vs arbitrary userspace code).

If the stalls were due to preemption of a low-priority reader that
holds the cp_rwsem read lock, I would expect to see some traces like:

1. low-prio userspace thread is running
2. low-prio userspace thread grabs cp_rwsem read lock, is preempted,
switches back to runnable
3. f2fs-ckpt wakes up, blocks on cp_rwsem write lock
4. low-prio userspace thread switches to running and unblocks f2fs-ckpt

Specifically, the low-prio userspace thread would need to wake up
f2fs-ckpt without the userspace thread ending up in uninterruptible
sleep around that period. To my knowledge, we've never seen that in a
trace. The bad behavior we see always comes after the low-prio thread
was blocked acquiring the read lock and switches back to runnable from
uninterruptible sleep. I think the patches that Waiman mentioned don't
address the case we're seeing; I think they fix different patterns of
new readers preventing a writer from becoming runnable vs a long
scheduling delay of a reader awakened by a writer that in turn makes a
writer runnable once the reader finally runs.

With the existing rwsem usage in f2fs, f2fs-ckpt is guaranteed to wait
for the reader's scheduling delay plus the read lock time every time
there's a pending reader, which is the common case with f2fs-ckpt
(because the write lock is held for milliseconds of IO ops vs
microseconds for readers) and why we often see hundreds of 10-100ms
checkpoints in a row. By moving to a reader-unfair rwsem, f2fs-ckpt
would only pay the latency of the reader's scheduling delay if the
reader is preempted while holding the lock. I'm sure that preemption
does happen and that the max latency is about the same as a result,
but given how short the read locks are held vs the write lock and the
relative likelihood of a preempted reader vs any reader arriving at
the lock while the writer is doing IO, the reader-unfair approach
should be a dramatic improvement in 99th percentile (and greater) f2fs
checkpoint latency.

> Also, given that this is a system wide rwsem, would percpu-rwsem not be
> 'better' ? Arguably with the same hack cgroups uses for it (see
> cgroup_init()) to lower the cost of percpu_down_write().

Funny you should mention this--the next thing on my todo list is to
root causing stalls around cgroup_threadgroup_rwsem (mostly in
cgroup_css_set_fork() hit from pthread_create in userspace).

If I'm reading percpu-rwsem.c correctly, it's not fair to readers in
the same way that this patch is. It's probably okay to switch to
percpu_rwsem in f2fs given the distribution of readers to writers, but
I am concerned about giving up spin-on-owner given how often these
locks will be hit from different processes at very different
priorities. Does that seem overly paranoid?

> Now, I'm not a filesystem developer and I'm not much familiar with the
> problem space, but this locking reads like a fairly big problem. I'm not
> sure optimizing the lock is the answer.

I'm not a filesystem developer either, but rewriting the locking was
my first suggestion too. It seems much harder than tweaking the lock.
Rewriting the locking might still be necessary if we see regular
preemption while the cp_rwsem read lock is held, but I don't have any
data showing that preemption is happening regularly right now.


___
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [GIT PULL] f2fs for 5.18

2022-03-22 Thread Tim Murray via Linux-f2fs-devel
Hi Linus,

On Tue, Mar 22, 2022 at 10:50 AM Linus Torvalds
 wrote:
>
> Even when people get the semantics and memory ordering right (which is
> not always the case, but at least the f2fs code uses real lock
> primitives - just oddly - and should thus be ok), it invariably tends
> to be a sign of something else being very wrong.
>
> And I can easily believe that in this case it's due to a rmsem issue
> that was already fixed long long ago as per Waiman.
>
> Can people please test with the actual modern rwsem code and with the
> odd reader-unfair locks disabled?

I did try the patch from Waiman (backported to a local Android device
running 5.10 without the unfair-rwsem patch) when we were first
discussing this, and as far as I could tell, his patch did not make a
difference for this problem. I think I can explain why it solves a
different problem, and it is related to how f2fs uses locking
primitives and worker threads. In this case, it's important to know
that f2fs's checkpoint thread coalesces checkpoints on behalf of all
userspace threads calling fsync(), userspace threads calling many f2fs
functions take the read lock on the rwsem, and the checkpoint thread
takes the write lock. The userspace threads can be any CFS priority
and may be throttled for various reasons (constrained cpuset or cpuctl
runtime limits).

Here's one example from a trace associated with a very slow app start
from a 5.10 device before the unfair-rwsem patch:

1. pid 6711, a random userspace thread, is running on a CPU at time 0ms.
2. After 2.6ms of running, pid 6711 enters state D, deschedules, and
is blocked in f2fs_new_inode according to sched_blocked_reason (a
tracepoint we added in the Android kernel that emits the function
where a thread is blocked in uninterruptible sleep).
3. The f2fs_checkpoint thread runs 5ms after pid 6711 deschedules.
f2fs_ckpt runs for 63us, makes pid 6711 runnable, and then deschedules
in state D blocked at rwsem_down_write_slowpath.
4. pid 6711 remains runnable for 341ms--it's a low priority thread in
a throttled process and the system is busy. Meanwhile, f2fs_ckpt
remains in D for 341ms.
5. pid 6711 finally runs and makes f2fs_ckpt runnable after 11us.

If the problem were related to optimistic spinning that would be
addressed by Waiman's patch, I'd expect to see pid 6711 running
concurrently with f2fs_ckpt. However, that's not what happens; there's
a 5ms gap in between pid 6711 blocking on the read lock and f2fs_ckpt
running. AFAICT, what's happening is that rwsem_down_read_slowpath
modifies sem->count to indicate that there's a pending reader while
f2fs_ckpt holds the write lock, and when f2fs_ckpt releases the write
lock, it wakes pending readers and hands the lock over to readers.
This means that any subsequent attempt to grab the write lock from
f2fs_ckpt will stall until the newly-awakened reader releases the read
lock, which depends on the readers' arbitrarily long scheduling
delays.

AFAICT, any solution for this problem that doesn't require an overhaul
of f2fs locking requires that the newly-awakened readers can't block
f2fs_ckpt until they are scheduled and grab the read lock. Either we
could do something trylock-related like this patch implemented, or we
could move f2fs to percpu_rwsem. However, moving to percpu_rwsem means
no optimistic spinning, which also seems bad given how short the
read-locked critical sections are. We have seen similar problems with
contention on the cgroup percpu_rwsem that spin-on-owner could have
alleviated, so I'm hesitant to get rid of optimistic spinning for f2fs
by switching to percpu_rwsem.

The most obvious objection to the patch is that f2fs_ckpt will still
stall if a reader deschedules while holding the lock. That is
absolutely true! However, in practice, we were hitting the worst case
every time because any attempt to grab the read lock while f2fs_ckpt
holds the write lock exposes f2fs_ckpt to arbitrary scheduling delay
from the reader. The reader-unfair rwsem patch prevents this problem.
We have evidence this is true in practice, too. I just ran a quick
comparison on 150 slow app start traces from our internal population
comparing kernels pre- and post-unfair-rwsem patch, and the percentage
of time the thread responsible for UI was stalled on f2fs locks went
from ~50% of all uninterruptible sleep time on that thread to less
than 5%.

I think this is a special case where fairness to readers is explicitly
harmful because any thread at any priority can block on the only
writer (f2fs_ckpt) but any low priority thread could be a reader
blocking the only writer. I don't know if it's worth explicit support
in rw_semaphore for this kind of use case, but I don't think an
existing rwsem patch addresses this issue because it is so unusual.


___
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel