I've been poking around in the WAPBL sources and some of the email
threads, also read the doc/roadmaps comments, so I'm aware of some of
the sentiment.

I think it would still be useful to get WAPBL safe to enable by
default again in NetBSD. Neither lfs64 nor Harvard journalling fs is
currently in tree. So it's unknown when they would be stable enough to
replace ffs by default. Also, I think that it is useful to keep some
kind of generic[*] journalling code, perhaps for use also for ext2fs
or maybe xfs one day.

In either case, IMO it is good to do also some generic system
improvements usable by any journalling solution.

I see following groups of useful changes. Reasonably for -8 timeframe,
IMO only group one really needs to be resolved to safely enable wapbl
journalling by default.

1. critical fixes for WAPBL
2. less critical fixes for WAPBL
3. performance improvements for WAPBL
4. disk subsystem and journalling-related improvements

1. Critical fixes for WAPBL
1.1 kern/47146 kernel panic when many files are unlinked
1.2 kern/50725 discard handling
1.3 kern/49175 degenerate truncate() case - too embarassing to leave in

2. Less critical fixes for WAPBL
2.1 kern/45676 flush semantics

2.2 (no PR) make group descriptor updates part of change transaction
The transaction, which changed the group descriptor, should contain
also the cg block write. Now the group descriptor blocks are written
to disk during filesystem sync via separate transaction, so it's quite
frequent they do not survive crash if it happens before sync. Normally
fsck fixes these easily using inode metadata, but fsck is skipped for
journalled filesystems. This IMO can lead to incorrect block
allocation, until fsck is actually run.

2.3 file data leaks on crashes
File data content blocks are written asynchronously, some of it can
make it to the disk before journal is commited, hence blocks can end
up back in different file on system crash. FFS always had it, even
with softdep albait more limited.

2.4 buffer blocks kept in memory until commit
Buffer cache bufs are kept in memory with B_LOCKED flag by wapbl,
starving the buffer cache subsystem.

3. WAPBL performance fixes
3.1 checksum journal data for commit
Avoid one of the two DIOCCACHESYNC by computing checksum over data and
storing it in the commit record; there is even field for it already,
so matter of implementation. There is however CPU use concern maybe.
crc32c hash is good candidate, do we need to have hash alternatives?
This seems to be reasonably simple to implement, needs just some hooks
into journal writes and journal replay logic.

3.2 use FUA (Force Unit Access) for commit record write
This avoids need to issue even the second DIOCCACHESYNC, as flushing
the disk cache is not really all that useful, I like the thread over
Slightly less controversially, this would allow the rest of the
journal records to be written asynchronously, leaving them to execute
even after commit if so desired. It may be useful to have this
behaviour optional. I lean towards skipping the disk cache flush as
default behaviour however, if we implement write barrier for the
commit record (see below).
WAPBL would need to deal with drives without FUA, i.e fall back to cache flush.

3.3 async, or 'group sync' writes
Submit all the journal block writes to the drive at once, instead of
writing the blocks synchronously one by one. We could even have the
journal block writes completely async if we have the commit record
Implementing 'group sync' write would be quite simple, making it full
async is more difficult and actually not very useful for journalling,
since commit would force those writes to disk drive anyway if it's
write barrier (see below)

4. disk subsystem and journalling-related improvements
4.1 write barriers
The current DIOCCACHESYNC has a problem in that it could be quite
easily I/O starved if the drive is very loaded. Normally, the drive
firmware flushes the disk buffer very soon (i.e in region of
milliseconds, i.e. when it has full track of data), but concurrent
disk activity might prevent it from doing it soon enough.
More serious NetBSD kernel problem is however that DIOCCACHESYNC
bypasses bufq, so if there are any queued writes, DIOCCACHESYNC sends
the command do disk before those writes are sent to the drive.
In order to avoid both of them, it would be good to have a way to mark
a buf as barrier. bufq and/or disk routines would be changed to drain
the write queue before barrier write is sent to drive, and any later
writes would wait until barrier write completes. On sane hardware like
SCSI/SAS, this could be almost completely offloaded to the controller
by just using ORDERED tags, without need to drain the queue.
This would be semi-hard to implement, especially if it would require
changes to disk drivers.

4.2 scsipi default to ORDERED tags, change to SIMPLE
>From a quick scsipi_base.c inspection, it seems we use ordered tag if
it was not specified in the request. This seems like a waste. This
probably assumes disksort() does miracle job, but bufq disksort can't
account for e.g. head positions, so this is actually misoptimization
even for spinning rust, and not useful at all for SSDs. We should
change default to SIMPLE and rely on disk firmware to do it's job.
This is very simple to do.

4.3 generic FUA flag support
In order to avoid full cache sync after journal comit, it would be
useful to mark certain writes (like the journal commit record write)
to bypass disk write cache. There is FUA bit in SCSI/SAS word, NCQ for
SATA, and on NVMe. This could be as simple as struct buf flag, which
the disk driver would act upon. Very simple to do for SCSI and NVMe
since the support for tags is already there, for SATA we'd need to
first implement NCQ support :D
This is quite easy to do, it's just struct buf flag and tweaks to
scsipi/nvme code.

4.4 NCQ support for AHCI
We need NCQ to support FUA flag. We have AHCI support, but without
NCQ. FreeBSD has support for AHCI NCQ, OpenBSD seems to have some kind
of support also, so can be used as reference besides the official AHCI
specification. Most of recent motherboards support AHCI mode. Some
non-AHCI PCI SATA controllers support NCQ too, but those would be out
of scope for now.
This is semi-hard to do.

I plan to start on group 1, followed by 3.1 checksum, 4.1 write
barrier, 4.3 generic FUA support, and finally 3.2 FUA usage.

Comments are welcome :)


[*] WAPBL is of course not so generic since it forces on-disk format
right now, but it could eventually be made more flexible.

Reply via email to