Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write
On Thu, Nov 22, 2018 at 02:40:50PM +0900, Eiichi Tsukata wrote: > 2018年11月21日(水) 13:54 Al Viro : > > > > On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote: > > > Some file systems (including ext4, xfs, ramfs ...) have the following > > > problem as I've described in the commit message of the 1/4 patch. > > > > > > The commit ef3d0fd27e90 ("vfs: do (nearly) lockless > > > generic_file_llseek") > > > removed almost all locks in llseek() including SEEK_END. It based on the > > > idea that write() updates size atomically. But in fact, write() can be > > > divided into two or more parts in generic_perform_write() when pos > > > straddles over the PAGE_SIZE, which results in updating size multiple > > > times in one write(). It means that llseek() can see the size being > > > updated during write(). > > > > And? Who has ever promised anything that insane? write(2) can take an > > arbitrary > > amount of time; another process doing lseek() on independently opened > > descriptor > > is *not* going to wait for that (e.g. page-in of the buffer being written, > > which > > just happens to be mmapped from a file on NFS over RFC1149 link). > > Thanks. > > The lock I added in NFS was nothing but slow down lseek() because a file size > is > updated atomically. Even `spin_lock(&inode->i_lock)` is unnecessary. > > I'll fix the commit message which only refers to specific local file > systems that use > generic_perform_write() and remove unnecessary locks in some > distributed file systems > (e.g. nfs, cifs, or more) by replacing generic_file_llseek() with > generic_file_llseek_unlocked() > so that `tail` don't have to wait for avian carriers. fd = open("/mnt/slw/foo", O_RDONLY); p = mmap(NULL, 8192, PROT_READ, MAP_SHARED, fd, 0); local_fd = open("/tmp/foo", O_RDWR); write(local_fd, p, 8192); and there you go - extremely slow write(2). To file on a local filesystem. Can you show me where does POSIX/SuS/whatever it's called these days promise that kind of atomicity? TBH, I would be very surprised if any Unix promised to have file size updated no more than once per write(2). Any userland code that relies on such property is, as minimum, non-portable and I would argue that it is outright broken. Note, BTW, that your example depends upon rather non-obvious details of echo implementation, starting with the bufferization logics used by particular shell. And AFAICS ash(1) ends up with possibility of more than one write(2) anyway - get SIGSTOP delivered, followed by SIGCONT and you've got just that. Transparently for echo(1). I'm fairly sure that the same holds for anything that isn't outright broken - write(2) *IS* interruptible (must be, for obvious reasons) and that pair of signals will lead to short write if it comes at the right time. The only thing userland can do (and does) is to issue further write(2) on anything that hadn't been written the last time around.
Re: [Cluster-devel] [PATCH v1 3/4] f2fs: fix race between llseek SEEK_END and write
2018年11月21日(水) 18:23 Christoph Hellwig : > > On Wed, Nov 21, 2018 at 11:43:59AM +0900, Eiichi Tsukata wrote: > > This patch itself seems to be just a cleanup but with the > > commit b25bd1d9fd87 ("vfs: fix race between llseek SEEK_END and write") > > it fixes race. > > Please move this patch to the beginning of the series and replace > the commit log with something like the one below. Note that your > commit id is different from the one that will appear once applied > upstream, so the aboe isn't too helpful. > > --- > f2fs: use generic_file_llseek > > f2fs always passes inode->i_sb->s_maxbytes to generic_file_llseek_size, > and thus should simply use generic_file_llseek. For now this is a just > a cleanup, but it will allow f2fs to pick up a race fix in > generic_file_llseek for free. Thanks, I'll fix it in v2.
Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write
2018年11月21日(水) 13:54 Al Viro : > > On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote: > > Some file systems (including ext4, xfs, ramfs ...) have the following > > problem as I've described in the commit message of the 1/4 patch. > > > > The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek") > > removed almost all locks in llseek() including SEEK_END. It based on the > > idea that write() updates size atomically. But in fact, write() can be > > divided into two or more parts in generic_perform_write() when pos > > straddles over the PAGE_SIZE, which results in updating size multiple > > times in one write(). It means that llseek() can see the size being > > updated during write(). > > And? Who has ever promised anything that insane? write(2) can take an > arbitrary > amount of time; another process doing lseek() on independently opened > descriptor > is *not* going to wait for that (e.g. page-in of the buffer being written, > which > just happens to be mmapped from a file on NFS over RFC1149 link). Thanks. The lock I added in NFS was nothing but slow down lseek() because a file size is updated atomically. Even `spin_lock(&inode->i_lock)` is unnecessary. I'll fix the commit message which only refers to specific local file systems that use generic_perform_write() and remove unnecessary locks in some distributed file systems (e.g. nfs, cifs, or more) by replacing generic_file_llseek() with generic_file_llseek_unlocked() so that `tail` don't have to wait for avian carriers.
Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()
On Wed, Nov 21, 2018 at 05:10:25PM +0100, Christoph Hellwig wrote: > On Wed, Nov 21, 2018 at 11:31:36PM +0800, Ming Lei wrote: > > > But while looking over this I wonder why we even need the max_seg_len > > > here. The only thing __bvec_iter_advance does it to move bi_bvec_done > > > and bi_idx forward, with corresponding decrements of bi_size. As far > > > as I can tell the only thing that max_seg_len does is that we need > > > to more iterations of the while loop to archive the same thing. > > > > > > And actual bvec used by the caller will be obtained using > > > bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page > > > or single-page variants. > > > > Right, we let __bvec_iter_advance() serve for both multi-page and > > single-page > > case, then we have to tell it via one way or another, now we use the > > constant > > of 'max_seg_len'. > > > > Or you suggest to implement two versions of __bvec_iter_advance()? > > No - I think we can always use the code without any segment in > bvec_iter_advance. Because bvec_iter_advance only operates on the > iteractor, the generation of an actual single-page or multi-page > bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec > helpers. The only difference is how many bytes you can move the > iterator forward in a single loop iteration - so if you pass in > PAGE_SIZE as the max_seg_len you just will have to loop more often > for a large enough bytes, but not actually do anything different. Yeah, I see that. The difference is made by bio_iter_iovec()/bio_iter_mp_iovec() in __bio_for_each_segment()/__bio_for_each_bvec(). Thanks, Ming
Re: [Cluster-devel] [PATCH V11 02/19] block: introduce multi-page bvec helpers
On Wed, Nov 21, 2018 at 05:08:11PM +0100, Christoph Hellwig wrote: > On Wed, Nov 21, 2018 at 11:06:11PM +0800, Ming Lei wrote: > > bvec_iter_* is used for single-page bvec in current linus tree, and there > > are > > lots of users now: > > > > [linux]$ git grep -n "bvec_iter_*" ./ | wc > > 191 995 13242 > > > > If we have to switch it first, it can be a big change, just wondering if > > Jens > > is happy with that? > > Your above grep statement seems to catch every use of struct bvec_iter, > due to the *. > > Most uses of bvec_iter_ are either in the block headers, or are > ceph wrappers that match the above and can easily be redefined. OK, looks you are right, seems not so widely used: $ git grep -n -w -E "bvec_iter_len|bvec_iter_bvec|bvec_iter_advance|bvec_iter_page|bvec_iter_offset" ./ | wc 36 1942907 I will switch to that given the effected driver are only dm, nvdimm and ceph. Thanks, Ming
[Cluster-devel] [PATCH 1/2] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
This patch addresses various problems with gfs2/dlm recovery. For example, suppose a node with a bunch of gfs2 mounts suddenly reboots due to kernel panic, and dlm determines it should perform recovery. DLM does so from a pseudo-state machine calling various callbacks into lock_dlm to perform a sequence of steps. It uses generation numbers and recover bits in dlm "control" lock lvbs. Now suppose another node tries to recover the failed node's journal, but in so doing, encounters an IO error or withdraws due to unforeseen circumstances, such as an hba driver failure. In these cases, the recovery would eventually bail out, but it would still update its generation number in the lvb. The other nodes would all see the newer generation number and think they don't need to do recovery because the generation number is newer than the last one they saw, and therefore someone else has already taken care of it. If the file system has an io error or is withdrawn, it cannot safely replay any journals (its own or others) but someone else still needs to do it. Therefore we don't want it messing with the journal recovery generation numbers: the local generation numbers eventually get put into the lvb generation numbers to be seen by all nodes. This patch adds checks to many of the callbacks used by dlm in its recovery state machine so that the functions are ignored and skipped if an io error has occurred or if the file system was withdraw. Signed-off-by: Bob Peterson --- fs/gfs2/lock_dlm.c | 36 1 file changed, 36 insertions(+) diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c index 31df26ed7854..68ca648cf918 100644 --- a/fs/gfs2/lock_dlm.c +++ b/fs/gfs2/lock_dlm.c @@ -1081,6 +1081,14 @@ static void gdlm_recover_prep(void *arg) struct gfs2_sbd *sdp = arg; struct lm_lockstruct *ls = &sdp->sd_lockstruct; + if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) { + fs_err(sdp, "recover_prep ignored due to io error.\n"); + return; + } + if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) { + fs_err(sdp, "recover_prep ignored due to withdraw.\n"); + return; + } spin_lock(&ls->ls_recover_spin); ls->ls_recover_block = ls->ls_recover_start; set_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags); @@ -1103,6 +,16 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot *slot) struct lm_lockstruct *ls = &sdp->sd_lockstruct; int jid = slot->slot - 1; + if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) { + fs_err(sdp, "recover_slot jid %d ignored due to io error.\n", + jid); + return; + } + if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) { + fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n", + jid); + return; + } spin_lock(&ls->ls_recover_spin); if (ls->ls_recover_size < jid + 1) { fs_err(sdp, "recover_slot jid %d gen %u short size %d\n", @@ -1127,6 +1145,14 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots, struct gfs2_sbd *sdp = arg; struct lm_lockstruct *ls = &sdp->sd_lockstruct; + if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) { + fs_err(sdp, "recover_done ignored due to io error.\n"); + return; + } + if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) { + fs_err(sdp, "recover_done ignored due to withdraw.\n"); + return; + } /* ensure the ls jid arrays are large enough */ set_recover_size(sdp, slots, num_slots); @@ -1154,6 +1180,16 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid, { struct lm_lockstruct *ls = &sdp->sd_lockstruct; + if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) { + fs_err(sdp, "recovery_result jid %d ignored due to io error.\n", + jid); + return; + } + if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) { + fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n", + jid); + return; + } if (test_bit(DFL_NO_DLM_OPS, &ls->ls_recover_flags)) return; -- 2.19.1
[Cluster-devel] [PATCH 2/2] gfs2: initiate journal recovery as soon as a node withdraws
This patch uses the "live" glock and some new lvbs to signal when a node has withdrawn from a file system. Nodes who see this try to initiate journal recovery. Definitions: 1. The "withdrawer" is a node that has withdrawn from a gfs2 file system due to some inconsistency. 2. The "recoverer" is a node who sees the need to replay the withdrawers's journal. The way it works is as follows: When things are running normally, all nodes with the file system mounted have the "live" (non-disk) glock enqueued in SH (shared) mode. When a node withdraws from the file system, the withdrawer dequeues the "live" glock and re-enqueues it in EX mode. This forces all other nodes to see a demote request for the glock. In the act of demoting and promoting the live glock, a new lvb is used to inform the other nodes that it has withdrawn and its journal needs recovery. Although unlikely, there may be multiple withdrawers and dlm keeps the lvb consistent between them, so many of these lvb bits may be set at any given time, indicating more than one node has withdrawn and all their journals need recovery. Once the withdrawer has the live glock in EX, it knows all the other nodes have the live glock in UN (unlocked) and have gone through their callback, so they know it has withdrawn. The withdrawer then dequeues the live glock, demoting it back to UN, then pauses to give the other nodes a chance to reacquire it. This pause will hopefully prevent gfs2's caching the transition from EX->UN->SH from simplifying it to a DLM lock convert directly from EX->SH, which would keep the other nodes from doing recovery by making them wait forever for the glock (because once it's back in SH, nothing else is going to demote it, short of an unmount). After the pause, it re-enqueues the live glock in shared (SH) mode once again, thus enabling everyone to resume working normally after recovery has taken place. When the other nodes see the demote of the live lock, they set a flag in the superblock indicating recovery is in progress. Unfortunately, lvbs are only passed by dlm on lock conversions, so the nodes need to re-enqueue the live glock. They all need it back in SH mode eventually, but only after journal recovery. They also need to prevent other nodes from doing recovery at the same time for the failed node's journal. For node failures, DLM takes care of this problem for us by setting its BLOCK_LOCKS flag and going through its recovery state machine. However, in the case of a withdraw, we can't use the same mechanism. So the recoverer acquires the live glock in EX, then performs journal recovery for all journals indicated in the lvb. Taking the live glock in EX at this time may cause a race with the original withdrawer, who enqueued in EX, then UN, then back to SH. There's a new superblock flag GLF_IN_RECOVERY that is set when the node withdraws and isn't cleared until recovery is complete. That prevents the recoverer's new demote callback for the live lock in EX from mistaking it for another withdraw of the recoverer. The recoverer's promote to EX does not cause a glock demote on any other potential recoverer, since by definition they must have already demoted the glock to UN before the recoverer got it. The first recoverer is granted the live glock in EX and performs journal recovery, clearing the recovery bits in the live glock lvb as it does. The recoverer then demotes the glock from EX back to SH. All subsequent nodes also acquire the glock in EX, but the lvb bits are then clear, so no additional recovery is done. They skip the bulk of it and transition the live glock back to SH too. In order to accomodate this process, I had to make several additional changes: (1) I had to allow non-disk glocks (like the live glock) to be enqueued and promoted in do_xmote, even when the file system is withdrawn. This should never cause a problem, since they are "non-disk" and therefore have no tangible resources to corrupt. (2) The glock demote callback for the live lock cannot call journal recovery directly. That's because it's a callback made from dlm, which means it has dlm resources locked. If you try to call recovery here, it would be a circular dependency that could never be filled: dlm's callback would try to lock dlm resources it already has locked. Therefore I had to separate the actual recovery from the callback. I therefore modified gfs2_control_func and added a step 0 (as Dave Teigland suggested) to signal a withdraw. Instead, the callback queues work to the workqueue that causes gfs2_control_func to run in a different context after the callback is complete. To that end, I changed function gfs2_control_func (the node failure recovery state machine) to perform this new step 0 to recover withdrawer's journals with new function remote_withdraw rather than its normal "dead node" recovery sequence. (3) GFS2 now needs to know the difference between a with
[Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2)
Hi, This is a second draft of a two-patch set to fix some of the nasty journal recovery problems I've found lately. The original post from 08 November had horribly bad and inaccurate comments, and Steve Whitehouse and Andreas Gruenbacher pointed out. This version is hopefully better and more accurately describes what the patches do and how they work. Also, I fixed a superblock flag that was improperly declared as a glock flag. Other than the renamed and re-valued superblock flag, the code remains unchanged from the previous version. It probably needs a bit more testing, but it seems to work well. --- The problems have to do with file system corruption caused when recovery replays a journal after the resource group blocks have been unlocked by the recovery process. In other words, when no cluster node takes responsibility to replay the journal of a withdrawing node, then it gets replayed later on, after the blocks contents have been changed. The first patch prevents gfs2 from attempting recovery if the file system is withdrawn or has journal IO errors. Trying to recover your own journal from either of these unstable conditions is dangerous and likely to corrupt the file system. The second patch is more extensive. When a node withdraws from a file system it signals all other nodes with the file system mounted to perform recovery on its journal, since it cannot safely recover its own journal. This is accomplished by a new non-disk callback glop used exclusively by the "live" glock, which sets up an lvb in the glock to indicate which journal(s) need to be recovered. Regards, Bob Peterson --- Bob Peterson (2): gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn gfs2: initiate journal recovery as soon as a node withdraws fs/gfs2/glock.c| 5 ++- fs/gfs2/glops.c| 47 +++ fs/gfs2/incore.h | 3 ++ fs/gfs2/lock_dlm.c | 95 ++ fs/gfs2/log.c | 62 -- fs/gfs2/super.c| 5 ++- fs/gfs2/super.h| 1 + fs/gfs2/util.c | 84 fs/gfs2/util.h | 13 +++ 9 files changed, 282 insertions(+), 33 deletions(-) -- 2.19.1
Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split
Actually.. I think we can kill this code entirely. If we look at what the clustering setting is really about it is to avoid ever merging a segement that spans a page boundary. And we should be able to do that with something like this before your series: --- >From 0d46fa76c376493a74ea0dbe77305bd5fa2cf011 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 21 Nov 2018 18:39:47 +0100 Subject: block: remove the "cluster" flag The cluster flag implements some very old SCSI behavior. As far as I can tell the original intent was to enable or disable any kind of segment merging. But the actually visible effect to the LLDD is that it limits each segments to be inside a single page, which we can also affect by setting the maximum segment size and the virt boundary. Signed-off-by: Christoph Hellwig --- block/blk-merge.c | 20 block/blk-settings.c| 3 --- block/blk-sysfs.c | 5 + drivers/scsi/scsi_lib.c | 16 +--- include/linux/blkdev.h | 6 -- 5 files changed, 22 insertions(+), 28 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 6be04ef8da5b..e69d8f8ba819 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -195,7 +195,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, goto split; } - if (bvprvp && blk_queue_cluster(q)) { + if (bvprvp) { if (seg_size + bv.bv_len > queue_max_segment_size(q)) goto new_segment; if (!biovec_phys_mergeable(q, bvprvp, &bv)) @@ -295,10 +295,10 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, bool no_sg_merge) { struct bio_vec bv, bvprv = { NULL }; - int cluster, prev = 0; unsigned int seg_size, nr_phys_segs; struct bio *fbio, *bbio; struct bvec_iter iter; + bool prev = false; if (!bio) return 0; @@ -313,7 +313,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, } fbio = bio; - cluster = blk_queue_cluster(q); seg_size = 0; nr_phys_segs = 0; for_each_bio(bio) { @@ -325,7 +324,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, if (no_sg_merge) goto new_segment; - if (prev && cluster) { + if (prev) { if (seg_size + bv.bv_len > queue_max_segment_size(q)) goto new_segment; @@ -343,7 +342,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, nr_phys_segs++; bvprv = bv; - prev = 1; + prev = true; seg_size = bv.bv_len; } bbio = bio; @@ -396,9 +395,6 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio, { struct bio_vec end_bv = { NULL }, nxt_bv; - if (!blk_queue_cluster(q)) - return 0; - if (bio->bi_seg_back_size + nxt->bi_seg_front_size > queue_max_segment_size(q)) return 0; @@ -415,12 +411,12 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio, static inline void __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec, struct scatterlist *sglist, struct bio_vec *bvprv, -struct scatterlist **sg, int *nsegs, int *cluster) +struct scatterlist **sg, int *nsegs) { int nbytes = bvec->bv_len; - if (*sg && *cluster) { + if (*sg) { if ((*sg)->length + nbytes > queue_max_segment_size(q)) goto new_segment; if (!biovec_phys_mergeable(q, bvprv, bvec)) @@ -466,12 +462,12 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, { struct bio_vec bvec, bvprv = { NULL }; struct bvec_iter iter; - int cluster = blk_queue_cluster(q), nsegs = 0; + int nsegs = 0; for_each_bio(bio) bio_for_each_segment(bvec, bio, iter) __blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg, -&nsegs, &cluster); +&nsegs); return nsegs; } diff --git a/block/blk-settings.c b/block/blk-settings.c index 3abe831e92c8..3e7038e475ee 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -56,7 +56,6 @@ void blk_set_default_limits(struct queue_limits *lim) lim->alignment_offset = 0; lim->io_opt = 0; lim->misaligned = 0; - lim->cluster = 1; lim->zoned = BLK_ZONED_
Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()
On Wed, Nov 21, 2018 at 05:10:25PM +0100, Christoph Hellwig wrote: > No - I think we can always use the code without any segment in > bvec_iter_advance. Because bvec_iter_advance only operates on the > iteractor, the generation of an actual single-page or multi-page > bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec > helpers. The only difference is how many bytes you can move the > iterator forward in a single loop iteration - so if you pass in > PAGE_SIZE as the max_seg_len you just will have to loop more often > for a large enough bytes, but not actually do anything different. FYI, this patch reverts the max_seg_len related changes back to where we are in mainline, and as expected everything works fine for me: diff --git a/include/linux/bio.h b/include/linux/bio.h index e5b975fa0558..926550ce2d21 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -137,24 +137,18 @@ static inline bool bio_full(struct bio *bio) for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; iter_all.idx++)\ bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), i, iter_all) -static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter, - unsigned bytes, unsigned max_seg_len) +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter, + unsigned bytes) { iter->bi_sector += bytes >> 9; if (bio_no_advance_iter(bio)) iter->bi_size -= bytes; else - __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len); + bvec_iter_advance(bio->bi_io_vec, iter, bytes); /* TODO: It is reasonable to complete bio with error here. */ } -static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter, - unsigned bytes) -{ - __bio_advance_iter(bio, iter, bytes, PAGE_SIZE); -} - #define __bio_for_each_segment(bvl, bio, iter, start) \ for (iter = (start);\ (iter).bi_size && \ @@ -168,7 +162,7 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter, for (iter = (start);\ (iter).bi_size && \ ((bvl = bio_iter_mp_iovec((bio), (iter))), 1); \ -__bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN)) +bio_advance_iter((bio), &(iter), (bvl).bv_len)) /* returns one real segment(multi-page bvec) each time */ #define bio_for_each_bvec(bvl, bio, iter) \ diff --git a/include/linux/bvec.h b/include/linux/bvec.h index cab36d838ed0..138b4007b8f2 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -25,8 +25,6 @@ #include #include -#define BVEC_MAX_LEN ((unsigned int)-1) - /* * was unsigned short, but we might as well be ready for > 64kB I/O pages */ @@ -102,8 +100,8 @@ struct bvec_iter_all { .bv_offset = segment_iter_offset((bvec), (iter)), \ }) -static inline bool __bvec_iter_advance(const struct bio_vec *bv, - struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len) +static inline bool bvec_iter_advance(const struct bio_vec *bv, + struct bvec_iter *iter, unsigned bytes) { if (WARN_ONCE(bytes > iter->bi_size, "Attempted to advance past end of bvec iter\n")) { @@ -112,18 +110,12 @@ static inline bool __bvec_iter_advance(const struct bio_vec *bv, } while (bytes) { - unsigned segment_len = segment_iter_len(bv, *iter); - - if (max_seg_len < BVEC_MAX_LEN) - segment_len = min_t(unsigned, segment_len, - max_seg_len - - bvec_iter_offset(bv, *iter)); + unsigned iter_len = bvec_iter_len(bv, *iter); + unsigned len = min(bytes, iter_len); - segment_len = min(bytes, segment_len); - - bytes -= segment_len; - iter->bi_size -= segment_len; - iter->bi_bvec_done += segment_len; + bytes -= len; + iter->bi_size -= len; + iter->bi_bvec_done += len; if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) { iter->bi_bvec_done = 0; @@ -157,13 +149,6 @@ static inline bool bvec_iter_rewind(const struct bio_vec *bv, return true; } -static inline bool bvec_iter_advance(const struct bio_vec *bv, -struct bvec_iter *iter, -unsigned bytes) -{ - return __bvec_iter_advance(bv, iter, bytes, PAGE_SIZE); -} - #define for_each_bvec(bvl, bio_vec,
Re: [Cluster-devel] [PATCH V11 15/19] block: enable multipage bvecs
On Wed, Nov 21, 2018 at 11:48:13PM +0800, Ming Lei wrote: > I guess the correct check should be: > > end_addr = vec_addr + bv->bv_offset + bv->bv_len; > if (same_page && > (end_addr & PAGE_MASK) != (page_addr & PAGE_MASK)) > return false; Indeed.
Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split
On Wed, Nov 21, 2018 at 11:37:27PM +0800, Ming Lei wrote: > > + bio = bio_alloc_bioset(GFP_NOIO, bio_segments(*bio_orig), > > + &non_cluster_bio_set); > > bio_segments(*bio_orig) may be > 256, so bio_alloc_bioset() may fail. Nothing a little min with BIO_MAX_PAGES couldn't fix. This patch was just intended as an idea how I think this code could work.
Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()
On Wed, Nov 21, 2018 at 11:31:36PM +0800, Ming Lei wrote: > > But while looking over this I wonder why we even need the max_seg_len > > here. The only thing __bvec_iter_advance does it to move bi_bvec_done > > and bi_idx forward, with corresponding decrements of bi_size. As far > > as I can tell the only thing that max_seg_len does is that we need > > to more iterations of the while loop to archive the same thing. > > > > And actual bvec used by the caller will be obtained using > > bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page > > or single-page variants. > > Right, we let __bvec_iter_advance() serve for both multi-page and single-page > case, then we have to tell it via one way or another, now we use the constant > of 'max_seg_len'. > > Or you suggest to implement two versions of __bvec_iter_advance()? No - I think we can always use the code without any segment in bvec_iter_advance. Because bvec_iter_advance only operates on the iteractor, the generation of an actual single-page or multi-page bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec helpers. The only difference is how many bytes you can move the iterator forward in a single loop iteration - so if you pass in PAGE_SIZE as the max_seg_len you just will have to loop more often for a large enough bytes, but not actually do anything different.
Re: [Cluster-devel] [PATCH V11 02/19] block: introduce multi-page bvec helpers
On Wed, Nov 21, 2018 at 11:06:11PM +0800, Ming Lei wrote: > bvec_iter_* is used for single-page bvec in current linus tree, and there are > lots of users now: > > [linux]$ git grep -n "bvec_iter_*" ./ | wc > 191 995 13242 > > If we have to switch it first, it can be a big change, just wondering if Jens > is happy with that? Your above grep statement seems to catch every use of struct bvec_iter, due to the *. Most uses of bvec_iter_ are either in the block headers, or are ceph wrappers that match the above and can easily be redefined.
Re: [Cluster-devel] [PATCH V11 15/19] block: enable multipage bvecs
On Wed, Nov 21, 2018 at 03:55:02PM +0100, Christoph Hellwig wrote: > On Wed, Nov 21, 2018 at 11:23:23AM +0800, Ming Lei wrote: > > if (bio->bi_vcnt > 0) { > > - struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; > > + struct bio_vec bv; > > + struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1]; > > > > - if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) { > > - bv->bv_len += len; > > + bvec_last_segment(seg, &bv); > > + > > + if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) { > > I think this we can simplify the try to merge into bio case a bit, > and also document it better with something like this: > > diff --git a/block/bio.c b/block/bio.c > index 854676edc438..cc913281a723 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -822,54 +822,40 @@ EXPORT_SYMBOL(bio_add_pc_page); > * @page: page to add > * @len: length of the data to add > * @off: offset of the data in @page > + * @same_page: if %true only merge if the new data is in the same physical > + * page as the last segment of the bio. > * > - * Try to add the data at @page + @off to the last page of @bio. This is a > + * Try to add the data at @page + @off to the last bvec of @bio. This is a > * a useful optimisation for file systems with a block size smaller than the > * page size. > * > * Return %true on success or %false on failure. > */ > bool __bio_try_merge_page(struct bio *bio, struct page *page, > - unsigned int len, unsigned int off) > + unsigned int len, unsigned int off, bool same_page) > { > if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) > return false; > > if (bio->bi_vcnt > 0) { > - struct bio_vec bv; > - struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1]; > - > - bvec_last_segment(seg, &bv); > - > - if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) { > - seg->bv_len += len; > - bio->bi_iter.bi_size += len; > - return true; > - } > + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; > + phys_addr_t vec_addr = page_to_phys(bv->bv_page); > + phys_addr_t page_addr = page_to_phys(page); > + > + if (vec_addr + bv->bv_offset + bv->bv_len != page_addr + off) > + return false; > + if (same_page && > + (vec_addr & PAGE_SIZE) != (page_addr & PAGE_SIZE)) > + return false; I guess the correct check should be: end_addr = vec_addr + bv->bv_offset + bv->bv_len; if (same_page && (end_addr & PAGE_MASK) != (page_addr & PAGE_MASK)) return false; And this approach is good, will take it in V12. Thanks, Ming
Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split
On Wed, Nov 21, 2018 at 03:33:55PM +0100, Christoph Hellwig wrote: > > + non-cluster.o > > Do we really need a new source file for these few functions? > > > default: > > + if (!blk_queue_cluster(q)) { > > + blk_queue_non_cluster_bio(q, bio); > > + return; > > I'd name this blk_bio_segment_split_singlepage or similar. OK. > > > +static __init int init_non_cluster_bioset(void) > > +{ > > + WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0, > > + BIOSET_NEED_BVECS)); > > + WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE)); > > + WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0)); > > Please only allocate the resources once a queue without the cluster > flag is registered, there are only very few modern drivers that do that. OK. > > > +static void non_cluster_end_io(struct bio *bio) > > +{ > > + struct bio *bio_orig = bio->bi_private; > > + > > + bio_orig->bi_status = bio->bi_status; > > + bio_endio(bio_orig); > > + bio_put(bio); > > +} > > Why can't we use bio_chain for the split bios? The parent bio is multi-page bvec, we can't submit it for non-cluster. > > > + bio_for_each_segment(from, *bio_orig, iter) { > > + if (i++ < max_segs) > > + sectors += from.bv_len >> 9; > > + else > > + break; > > + } > > The easy to read way would be: > > bio_for_each_segment(from, *bio_orig, iter) { > if (i++ == max_segs) > break; > sectors += from.bv_len >> 9; > } OK. > > > + if (sectors < bio_sectors(*bio_orig)) { > > + bio = bio_split(*bio_orig, sectors, GFP_NOIO, > > + &non_cluster_bio_split); > > + bio_chain(bio, *bio_orig); > > + generic_make_request(*bio_orig); > > + *bio_orig = bio; > > I don't think this is very efficient, as this means we now > clone the bio twice, first to split it at the sector boundary, > and then again when converting it to single-page bio_vec. That is exactly what bounce code does. The problem for both bounce and non-cluster is same actually because the bvec table itself has to be changed. > > I think this could be something like this (totally untested): > > diff --git a/block/non-cluster.c b/block/non-cluster.c > index 9c2910be9404..60389f275c43 100644 > --- a/block/non-cluster.c > +++ b/block/non-cluster.c > @@ -13,58 +13,59 @@ > > #include "blk.h" > > -static struct bio_set non_cluster_bio_set, non_cluster_bio_split; > +static struct bio_set non_cluster_bio_set; > > static __init int init_non_cluster_bioset(void) > { > WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0, > BIOSET_NEED_BVECS)); > WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE)); > - WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0)); > > return 0; > } > __initcall(init_non_cluster_bioset); > > -static void non_cluster_end_io(struct bio *bio) > -{ > - struct bio *bio_orig = bio->bi_private; > - > - bio_orig->bi_status = bio->bi_status; > - bio_endio(bio_orig); > - bio_put(bio); > -} > - > void blk_queue_non_cluster_bio(struct request_queue *q, struct bio > **bio_orig) > { > - struct bio *bio; > struct bvec_iter iter; > - struct bio_vec from; > - unsigned i = 0; > - unsigned sectors = 0; > - unsigned short max_segs = min_t(unsigned short, BIO_MAX_PAGES, > - queue_max_segments(q)); > + struct bio *bio; > + struct bio_vec bv; > + unsigned short max_segs, segs = 0; > + > + bio = bio_alloc_bioset(GFP_NOIO, bio_segments(*bio_orig), > + &non_cluster_bio_set); bio_segments(*bio_orig) may be > 256, so bio_alloc_bioset() may fail. Thanks, Ming
Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()
On Wed, Nov 21, 2018 at 02:32:44PM +0100, Christoph Hellwig wrote: > > +#define bio_iter_mp_iovec(bio, iter) \ > > + segment_iter_bvec((bio)->bi_io_vec, (iter)) > > Besides the mp naming we'd like to get rid off there also is just > a single user of this macro, please just expand it there. OK. > > > +#define segment_iter_bvec(bvec, iter) \ > > +((struct bio_vec) { > > \ > > + .bv_page= segment_iter_page((bvec), (iter)),\ > > + .bv_len = segment_iter_len((bvec), (iter)), \ > > + .bv_offset = segment_iter_offset((bvec), (iter)), \ > > +}) > > And for this one please keep the segment vs bvec versions of these > macros close together in the file please, right now it follow the > bvec_iter_bvec variant closely. OK. > > > +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter > > *iter, > > + unsigned bytes, unsigned max_seg_len) > > { > > iter->bi_sector += bytes >> 9; > > > > if (bio_no_advance_iter(bio)) > > iter->bi_size -= bytes; > > else > > - bvec_iter_advance(bio->bi_io_vec, iter, bytes); > > + __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len); > > /* TODO: It is reasonable to complete bio with error here. */ > > } > > > > +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter > > *iter, > > + unsigned bytes) > > +{ > > + __bio_advance_iter(bio, iter, bytes, PAGE_SIZE); > > +} > > Btw, I think the remaining users of bio_advance_iter() in bio.h > should probably switch to using __bio_advance_iter to make them a little > more clear to read. Good point. > > > +/* returns one real segment(multi-page bvec) each time */ > > space before the brace, please. OK. > > > +#define BVEC_MAX_LEN ((unsigned int)-1) > > > while (bytes) { > > + unsigned segment_len = segment_iter_len(bv, *iter); > > > > - iter->bi_bvec_done += len; > > + if (max_seg_len < BVEC_MAX_LEN) > > + segment_len = min_t(unsigned, segment_len, > > + max_seg_len - > > + bvec_iter_offset(bv, *iter)); > > + > > + segment_len = min(bytes, segment_len); > > Please stick to passing the magic zero here as can often generate more > efficient code. But zero may decrease the code readability. Actually the passed 'max_seg_len' is just a constant, and complier should have generated same efficient code for any constant, either 0 or other. > > Talking about efficent code - I wonder how much code size we'd save > by moving this function out of line.. That is good point, see the following diff: [mingl@hp kernel]$ diff -u inline.size non_inline.size --- inline.size 2018-11-21 23:24:52.305312076 +0800 +++ non_inline.size 2018-11-21 23:24:59.908393010 +0800 @@ -1,2 +1,2 @@ text data bss dec hex filename -13429213 6893922 4292692 246158271779b93 vmlinux.inline +13429153 6893346 4292692 246151911779917 vmlinux.non_inline vmlinux(non_inline) is built by just moving/exporting __bvec_iter_advance() into block/bio.c. The difference is about 276bytes. > > But while looking over this I wonder why we even need the max_seg_len > here. The only thing __bvec_iter_advance does it to move bi_bvec_done > and bi_idx forward, with corresponding decrements of bi_size. As far > as I can tell the only thing that max_seg_len does is that we need > to more iterations of the while loop to archive the same thing. > > And actual bvec used by the caller will be obtained using > bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page > or single-page variants. Right, we let __bvec_iter_advance() serve for both multi-page and single-page case, then we have to tell it via one way or another, now we use the constant of 'max_seg_len'. Or you suggest to implement two versions of __bvec_iter_advance()? Thanks, Ming
Re: [Cluster-devel] [PATCH V11 02/19] block: introduce multi-page bvec helpers
On Wed, Nov 21, 2018 at 02:19:28PM +0100, Christoph Hellwig wrote: > On Wed, Nov 21, 2018 at 11:23:10AM +0800, Ming Lei wrote: > > This patch introduces helpers of 'segment_iter_*' for multipage > > bvec support. > > > > The introduced helpers treate one bvec as real multi-page segment, > > which may include more than one pages. > > Unless I'm missing something these bvec vs segment names are exactly > inverted vs how we use it elsewhere. > > In the iterators we use segment for single-page bvec, and bvec for multi > page ones, and here it is inverse. Please switch it around. bvec_iter_* is used for single-page bvec in current linus tree, and there are lots of users now: [linux]$ git grep -n "bvec_iter_*" ./ | wc 191 995 13242 If we have to switch it first, it can be a big change, just wondering if Jens is happy with that? Thanks, Ming
Re: [Cluster-devel] [PATCH V11 15/19] block: enable multipage bvecs
On Wed, Nov 21, 2018 at 11:23:23AM +0800, Ming Lei wrote: > if (bio->bi_vcnt > 0) { > - struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; > + struct bio_vec bv; > + struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1]; > > - if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) { > - bv->bv_len += len; > + bvec_last_segment(seg, &bv); > + > + if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) { I think this we can simplify the try to merge into bio case a bit, and also document it better with something like this: diff --git a/block/bio.c b/block/bio.c index 854676edc438..cc913281a723 100644 --- a/block/bio.c +++ b/block/bio.c @@ -822,54 +822,40 @@ EXPORT_SYMBOL(bio_add_pc_page); * @page: page to add * @len: length of the data to add * @off: offset of the data in @page + * @same_page: if %true only merge if the new data is in the same physical + * page as the last segment of the bio. * - * Try to add the data at @page + @off to the last page of @bio. This is a + * Try to add the data at @page + @off to the last bvec of @bio. This is a * a useful optimisation for file systems with a block size smaller than the * page size. * * Return %true on success or %false on failure. */ bool __bio_try_merge_page(struct bio *bio, struct page *page, - unsigned int len, unsigned int off) + unsigned int len, unsigned int off, bool same_page) { if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) return false; if (bio->bi_vcnt > 0) { - struct bio_vec bv; - struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1]; - - bvec_last_segment(seg, &bv); - - if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) { - seg->bv_len += len; - bio->bi_iter.bi_size += len; - return true; - } + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; + phys_addr_t vec_addr = page_to_phys(bv->bv_page); + phys_addr_t page_addr = page_to_phys(page); + + if (vec_addr + bv->bv_offset + bv->bv_len != page_addr + off) + return false; + if (same_page && + (vec_addr & PAGE_SIZE) != (page_addr & PAGE_SIZE)) + return false; + + bv->bv_len += len; + bio->bi_iter.bi_size += len; + return true; } return false; } EXPORT_SYMBOL_GPL(__bio_try_merge_page); -static bool bio_try_merge_segment(struct bio *bio, struct page *page, - unsigned int len, unsigned int off) -{ - if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) - return false; - - if (bio->bi_vcnt > 0) { - struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1]; - - if (page_to_phys(seg->bv_page) + seg->bv_offset + seg->bv_len == - page_to_phys(page) + off) { - seg->bv_len += len; - bio->bi_iter.bi_size += len; - return true; - } - } - return false; -} - /** * __bio_add_page - add page to a bio in a new segment * @bio: destination bio @@ -910,7 +896,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page); int bio_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int offset) { - if (!bio_try_merge_segment(bio, page, len, offset)) { + if (!__bio_try_merge_page(bio, page, len, offset, false)) { if (bio_full(bio)) return 0; __bio_add_page(bio, page, len, offset); diff --git a/fs/iomap.c b/fs/iomap.c index ccc2ba115f4d..d918acb9bfc9 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -313,7 +313,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, */ sector = iomap_sector(iomap, pos); if (ctx->bio && bio_end_sector(ctx->bio) == sector) { - if (__bio_try_merge_page(ctx->bio, page, plen, poff)) + if (__bio_try_merge_page(ctx->bio, page, plen, poff, true)) goto done; is_contig = true; } diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 5c2190216614..b9fd44168f61 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -616,7 +616,7 @@ xfs_add_to_ioend( bdev, sector); } - if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) { + if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, true)) { if (iop) atomic_inc(&iop->write_count); if (bio_full(wpc->ioend->io_bio)) diff --git a/include/linux/bio.h b/include/linux/bio.h index e5b97
Re: [Cluster-devel] [PATCH V11 17/19] block: document usage of bio iterator helpers
On Wed, Nov 21, 2018 at 09:45:25AM +0200, Nikolay Borisov wrote: > > + bio_for_each_segment_all() > > + bio_first_bvec_all() > > + bio_first_page_all() > > + bio_last_bvec_all() > > + > > +* The following helpers iterate over single-page bvecs. The passed 'struct > > +bio_vec' will contain a single-page IO vector during the iteration > > + > > + bio_for_each_segment() > > + bio_for_each_segment_all() > > + > > +* The following helpers iterate over single-page bvecs. The passed 'struct > > +bio_vec' will contain a single-page IO vector during the iteration > > + > > + bio_for_each_bvec() > > Just put this helper right below the above 2, no need to repeat the > explanation. Also I'd suggest introducing another catch-all sentence > "All other helpers are assumed to iterate multipage bio vecs" and > perhaps give an example with 1-2 helpers. Well, I think the second explanation is wrong - bio_for_each_bvec iterates over the whole bvecs, not just single page.
Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split
> + non-cluster.o Do we really need a new source file for these few functions? > default: > + if (!blk_queue_cluster(q)) { > + blk_queue_non_cluster_bio(q, bio); > + return; I'd name this blk_bio_segment_split_singlepage or similar. > +static __init int init_non_cluster_bioset(void) > +{ > + WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0, > +BIOSET_NEED_BVECS)); > + WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE)); > + WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0)); Please only allocate the resources once a queue without the cluster flag is registered, there are only very few modern drivers that do that. > +static void non_cluster_end_io(struct bio *bio) > +{ > + struct bio *bio_orig = bio->bi_private; > + > + bio_orig->bi_status = bio->bi_status; > + bio_endio(bio_orig); > + bio_put(bio); > +} Why can't we use bio_chain for the split bios? > + bio_for_each_segment(from, *bio_orig, iter) { > + if (i++ < max_segs) > + sectors += from.bv_len >> 9; > + else > + break; > + } The easy to read way would be: bio_for_each_segment(from, *bio_orig, iter) { if (i++ == max_segs) break; sectors += from.bv_len >> 9; } > + if (sectors < bio_sectors(*bio_orig)) { > + bio = bio_split(*bio_orig, sectors, GFP_NOIO, > + &non_cluster_bio_split); > + bio_chain(bio, *bio_orig); > + generic_make_request(*bio_orig); > + *bio_orig = bio; I don't think this is very efficient, as this means we now clone the bio twice, first to split it at the sector boundary, and then again when converting it to single-page bio_vec. I think this could be something like this (totally untested): diff --git a/block/non-cluster.c b/block/non-cluster.c index 9c2910be9404..60389f275c43 100644 --- a/block/non-cluster.c +++ b/block/non-cluster.c @@ -13,58 +13,59 @@ #include "blk.h" -static struct bio_set non_cluster_bio_set, non_cluster_bio_split; +static struct bio_set non_cluster_bio_set; static __init int init_non_cluster_bioset(void) { WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS)); WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE)); - WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0)); return 0; } __initcall(init_non_cluster_bioset); -static void non_cluster_end_io(struct bio *bio) -{ - struct bio *bio_orig = bio->bi_private; - - bio_orig->bi_status = bio->bi_status; - bio_endio(bio_orig); - bio_put(bio); -} - void blk_queue_non_cluster_bio(struct request_queue *q, struct bio **bio_orig) { - struct bio *bio; struct bvec_iter iter; - struct bio_vec from; - unsigned i = 0; - unsigned sectors = 0; - unsigned short max_segs = min_t(unsigned short, BIO_MAX_PAGES, - queue_max_segments(q)); + struct bio *bio; + struct bio_vec bv; + unsigned short max_segs, segs = 0; + + bio = bio_alloc_bioset(GFP_NOIO, bio_segments(*bio_orig), + &non_cluster_bio_set); + bio->bi_disk= (*bio_orig)->bi_disk; + bio->bi_partno = (*bio_orig)->bi_partno; + bio_set_flag(bio, BIO_CLONED); + if (bio_flagged(*bio_orig, BIO_THROTTLED)) + bio_set_flag(bio, BIO_THROTTLED); + bio->bi_opf = (*bio_orig)->bi_opf; + bio->bi_ioprio = (*bio_orig)->bi_ioprio; + bio->bi_write_hint = (*bio_orig)->bi_write_hint; + bio->bi_iter.bi_sector = (*bio_orig)->bi_iter.bi_sector; + bio->bi_iter.bi_size= (*bio_orig)->bi_iter.bi_size; + + if (bio_integrity(*bio_orig)) + bio_integrity_clone(bio, *bio_orig, GFP_NOIO); - bio_for_each_segment(from, *bio_orig, iter) { - if (i++ < max_segs) - sectors += from.bv_len >> 9; - else + bio_clone_blkcg_association(bio, *bio_orig); + + max_segs = min_t(unsigned short, queue_max_segments(q), BIO_MAX_PAGES); + bio_for_each_segment(bv, *bio_orig, iter) { + bio->bi_io_vec[segs++] = bv; + if (segs++ == max_segs) break; } - if (sectors < bio_sectors(*bio_orig)) { - bio = bio_split(*bio_orig, sectors, GFP_NOIO, - &non_cluster_bio_split); - bio_chain(bio, *bio_orig); - generic_make_request(*bio_orig); - *bio_orig = bio; - } - bio = bio_clone_bioset(*bio_orig, GFP_NOIO, &non_cluster_bio_set); + bio->bi_vcnt = se
Re: [Cluster-devel] [PATCH V11 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec
On Wed, Nov 21, 2018 at 11:23:20AM +0800, Ming Lei wrote: > This patch introduces one extra iterator variable to > bio_for_each_segment_all(), > then we can allow bio_for_each_segment_all() to iterate over multi-page bvec. > > Given it is just one mechannical & simple change on all > bio_for_each_segment_all() > users, this patch does tree-wide change in one single patch, so that we can > avoid to use a temporary helper for this conversion. > > Signed-off-by: Ming Lei Looks good, Reviewed-by: Christoph Hellwig
Re: [Cluster-devel] [PATCH V11 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
On Wed, Nov 21, 2018 at 11:23:19AM +0800, Ming Lei wrote: > bch_bio_alloc_pages() is always called on one new bio, so it is safe > to access the bvec table directly. Given it is the only kind of this > case, open code the bvec table access since bio_for_each_segment_all() > will be changed to support for iterating over multipage bvec. Looks good, Reviewed-by: Christoph Hellwig
Re: [Cluster-devel] [PATCH V11 10/19] block: loop: pass multi-page bvec to iov_iter
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 1ad6eafc43f2..a281b6737b61 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -805,6 +805,10 @@ struct req_iterator { > __rq_for_each_bio(_iter.bio, _rq) \ > bio_for_each_segment(bvl, _iter.bio, _iter.iter) > > +#define rq_for_each_bvec(bvl, _rq, _iter)\ > + __rq_for_each_bio(_iter.bio, _rq) \ > + bio_for_each_bvec(bvl, _iter.bio, _iter.iter) > + I think this should go into the patch adding bio_for_each_bvec and friends. Otherwise looks fine: Reviewed-by: Christoph Hellwig
Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()
> +#define bio_iter_mp_iovec(bio, iter) \ > + segment_iter_bvec((bio)->bi_io_vec, (iter)) Besides the mp naming we'd like to get rid off there also is just a single user of this macro, please just expand it there. > +#define segment_iter_bvec(bvec, iter)\ > +((struct bio_vec) { \ > + .bv_page= segment_iter_page((bvec), (iter)),\ > + .bv_len = segment_iter_len((bvec), (iter)), \ > + .bv_offset = segment_iter_offset((bvec), (iter)), \ > +}) And for this one please keep the segment vs bvec versions of these macros close together in the file please, right now it follow the bvec_iter_bvec variant closely. > +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter > *iter, > + unsigned bytes, unsigned max_seg_len) > { > iter->bi_sector += bytes >> 9; > > if (bio_no_advance_iter(bio)) > iter->bi_size -= bytes; > else > - bvec_iter_advance(bio->bi_io_vec, iter, bytes); > + __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len); > /* TODO: It is reasonable to complete bio with error here. */ > } > > +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter, > + unsigned bytes) > +{ > + __bio_advance_iter(bio, iter, bytes, PAGE_SIZE); > +} Btw, I think the remaining users of bio_advance_iter() in bio.h should probably switch to using __bio_advance_iter to make them a little more clear to read. > +/* returns one real segment(multi-page bvec) each time */ space before the brace, please. > +#define BVEC_MAX_LEN ((unsigned int)-1) > while (bytes) { > + unsigned segment_len = segment_iter_len(bv, *iter); > > - iter->bi_bvec_done += len; > + if (max_seg_len < BVEC_MAX_LEN) > + segment_len = min_t(unsigned, segment_len, > + max_seg_len - > + bvec_iter_offset(bv, *iter)); > + > + segment_len = min(bytes, segment_len); Please stick to passing the magic zero here as can often generate more efficient code. Talking about efficent code - I wonder how much code size we'd save by moving this function out of line.. But while looking over this I wonder why we even need the max_seg_len here. The only thing __bvec_iter_advance does it to move bi_bvec_done and bi_idx forward, with corresponding decrements of bi_size. As far as I can tell the only thing that max_seg_len does is that we need to more iterations of the while loop to archive the same thing. And actual bvec used by the caller will be obtained using bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page or single-page variants.
Re: [Cluster-devel] [PATCH V11 02/19] block: introduce multi-page bvec helpers
On Wed, Nov 21, 2018 at 11:23:10AM +0800, Ming Lei wrote: > This patch introduces helpers of 'segment_iter_*' for multipage > bvec support. > > The introduced helpers treate one bvec as real multi-page segment, > which may include more than one pages. Unless I'm missing something these bvec vs segment names are exactly inverted vs how we use it elsewhere. In the iterators we use segment for single-page bvec, and bvec for multi page ones, and here it is inverse. Please switch it around.
Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
On Tue, Nov 20, 2018 at 09:35:07PM -0800, Sagi Grimberg wrote: > > > > Wait, I see that the bvec is still a single array per bio. When you said > > > a table I thought you meant a 2-dimentional array... > > > > I mean a new 1-d table A has to be created for multiple bios in one rq, > > and build it in the following way > > > > rq_for_each_bvec(tmp, rq, rq_iter) > > *A = tmp; > > > > Then you can pass A to iov_iter_bvec() & send(). > > > > Given it is over TCP, I guess it should be doable for you to preallocate one > > 256-bvec table in one page for each request, then sets the max segment size > > as > > (unsigned int)-1, and max segment number as 256, the preallocated table > > should work anytime. > > 256 bvec table is really a lot to preallocate, especially when its not > needed, I can easily initialize the bvec_iter on the bio bvec. If this > involves preallocation of the worst-case than I don't consider this to > be an improvement. If you don't provide one single bvec table, I understand you may not send this req via one send(). The bvec_iter initialization is easy to do: bvec_iter = bio->bi_iter when you move to a new a bio, please refer to __bio_for_each_bvec() or __bio_for_each_segment(). Thanks, Ming
Re: [Cluster-devel] [PATCH v1 3/4] f2fs: fix race between llseek SEEK_END and write
On Wed, Nov 21, 2018 at 11:43:59AM +0900, Eiichi Tsukata wrote: > This patch itself seems to be just a cleanup but with the > commit b25bd1d9fd87 ("vfs: fix race between llseek SEEK_END and write") > it fixes race. Please move this patch to the beginning of the series and replace the commit log with something like the one below. Note that your commit id is different from the one that will appear once applied upstream, so the aboe isn't too helpful. --- f2fs: use generic_file_llseek f2fs always passes inode->i_sb->s_maxbytes to generic_file_llseek_size, and thus should simply use generic_file_llseek. For now this is a just a cleanup, but it will allow f2fs to pick up a race fix in generic_file_llseek for free.
[Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write
Some file systems (including ext4, xfs, ramfs ...) have the following problem as I've described in the commit message of the 1/4 patch. The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek") removed almost all locks in llseek() including SEEK_END. It based on the idea that write() updates size atomically. But in fact, write() can be divided into two or more parts in generic_perform_write() when pos straddles over the PAGE_SIZE, which results in updating size multiple times in one write(). It means that llseek() can see the size being updated during write(). This race changes behavior of some applications. 'tail' is one of those applications. It reads range [pos, pos_end] where pos_end is obtained via llseek() SEEK_END. Sometimes, a read line could be broken. reproducer: $ while true; do echo 123456 >> out; done $ while true; do tail out | grep -v 123456 ; done example output(take 30 secs): 12345 1 1234 1 12 1234 Note: Some file systems which indivisually implements llseek() and hold inode mutex lock in it are not affected. ex:) btrfs, ocfs2 Patch 1: re-implements inode lock for SEEK_END in generic_file_llseek() Patch 2 to 4: implement inode lock for SEEK_END in each file systems Eiichi Tsukata (4): vfs: fix race between llseek SEEK_END and write ext4: fix race between llseek SEEK_END and write f2fs: fix race between llseek SEEK_END and write overlayfs: fix race between llseek SEEK_END and write fs/btrfs/file.c | 2 +- fs/ext4/file.c | 10 ++ fs/f2fs/file.c | 6 +- fs/fuse/file.c | 5 +++-- fs/gfs2/file.c | 3 ++- fs/overlayfs/file.c | 23 --- fs/read_write.c | 37 ++--- include/linux/fs.h | 2 ++ 8 files changed, 73 insertions(+), 15 deletions(-) -- 2.19.1
[Cluster-devel] [PATCH v1 3/4] f2fs: fix race between llseek SEEK_END and write
This patch itself seems to be just a cleanup but with the commit b25bd1d9fd87 ("vfs: fix race between llseek SEEK_END and write") it fixes race. Signed-off-by: Eiichi Tsukata --- fs/f2fs/file.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c index 88b124677189..14a923607eff 100644 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -447,15 +447,11 @@ static loff_t f2fs_seek_block(struct file *file, loff_t offset, int whence) static loff_t f2fs_llseek(struct file *file, loff_t offset, int whence) { - struct inode *inode = file->f_mapping->host; - loff_t maxbytes = inode->i_sb->s_maxbytes; - switch (whence) { case SEEK_SET: case SEEK_CUR: case SEEK_END: - return generic_file_llseek_size(file, offset, whence, - maxbytes, i_size_read(inode)); + return generic_file_llseek(file, offset, whence); case SEEK_DATA: case SEEK_HOLE: if (offset < 0) -- 2.19.1
Re: [Cluster-devel] [PATCH V11 17/19] block: document usage of bio iterator helpers
On 21.11.18 г. 5:23 ч., Ming Lei wrote: > Now multi-page bvec is supported, some helpers may return page by > page, meantime some may return segment by segment, this patch > documents the usage. > > Signed-off-by: Ming Lei > --- > Documentation/block/biovecs.txt | 24 > 1 file changed, 24 insertions(+) > > diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt > index 25689584e6e0..bb008f7afb05 100644 > --- a/Documentation/block/biovecs.txt > +++ b/Documentation/block/biovecs.txt > @@ -117,3 +117,27 @@ Other implications: > size limitations and the limitations of the underlying devices. Thus > there's no need to define ->merge_bvec_fn() callbacks for individual block > drivers. > + > +Usage of helpers: > += > + > +* The following helpers whose names have the suffix of "_all" can only be > used > +on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers > +shouldn't use them because the bio may have been split before it reached the > +driver. > + > + bio_for_each_segment_all() > + bio_first_bvec_all() > + bio_first_page_all() > + bio_last_bvec_all() > + > +* The following helpers iterate over single-page bvecs. The passed 'struct > +bio_vec' will contain a single-page IO vector during the iteration > + > + bio_for_each_segment() > + bio_for_each_segment_all() > + > +* The following helpers iterate over single-page bvecs. The passed 'struct > +bio_vec' will contain a single-page IO vector during the iteration > + > + bio_for_each_bvec() Just put this helper right below the above 2, no need to repeat the explanation. Also I'd suggest introducing another catch-all sentence "All other helpers are assumed to iterate multipage bio vecs" and perhaps give an example with 1-2 helpers. >
[Cluster-devel] [PATCH v1 2/4] ext4: fix race between llseek SEEK_END and write
Implement individual lock for SEEK_END for ext4 which directly calls generic_file_llseek_size(). Signed-off-by: Eiichi Tsukata --- fs/ext4/file.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 69d65d49837b..6479f3066043 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -477,6 +477,16 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence) default: return generic_file_llseek_size(file, offset, whence, maxbytes, i_size_read(inode)); + case SEEK_END: + /* +* protects against inode size race with write so that llseek +* doesn't see inode size being updated in generic_perform_write +*/ + inode_lock_shared(inode); + offset = generic_file_llseek_size(file, offset, whence, + maxbytes, i_size_read(inode)); + inode_unlock_shared(inode); + return offset; case SEEK_HOLE: inode_lock_shared(inode); offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops); -- 2.19.1
[Cluster-devel] [PATCH v1 4/4] overlayfs: fix race between llseek SEEK_END and write
Implement individual lock for SEEK_END for overlayfs which directly calls generic_file_llseek_size(). Signed-off-by: Eiichi Tsukata --- fs/overlayfs/file.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index 84dd957efa24..57bc6538eea8 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -146,10 +146,27 @@ static int ovl_release(struct inode *inode, struct file *file) static loff_t ovl_llseek(struct file *file, loff_t offset, int whence) { struct inode *realinode = ovl_inode_real(file_inode(file)); + loff_t ret; - return generic_file_llseek_size(file, offset, whence, - realinode->i_sb->s_maxbytes, - i_size_read(realinode)); + switch (whence) { + default: + return generic_file_llseek_size(file, offset, whence, + realinode->i_sb->s_maxbytes, + i_size_read(realinode)); + case SEEK_END: + case SEEK_DATA: + case SEEK_HOLE: + /* +* protects against inode size race with write so that llseek +* doesn't see inode size being updated in individual fs write +*/ + inode_lock(realinode); + ret = generic_file_llseek_size(file, offset, whence, + realinode->i_sb->s_maxbytes, + i_size_read(realinode)); + inode_unlock(realinode); + return ret; + } } static void ovl_file_accessed(struct file *file) -- 2.19.1
[Cluster-devel] [PATCH v1 1/4] vfs: fix race between llseek SEEK_END and write
The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek") removed almost all locks in llseek() including SEEK_END. It based on the idea that write() updates size atomically. But in fact, write() can be divided into two or more parts in generic_perform_write() when pos straddles over the PAGE_SIZE, which results in updating size multiple times in one write(). It means that llseek() can see the size being updated during write(). This race changes behavior of some applications. 'tail' is one of those applications. It reads range [pos, pos_end] where pos_end is obtained via llseek() SEEK_END. Sometimes, a read line could be broken. reproducer: $ while true; do echo 123456 >> out; done $ while true; do tail out | grep -v 123456 ; done example output(take 30 secs): 12345 1 1234 1 12 1234 This patch re-introduces generic_file_llseek_unlocked() and implements a lock for SEEK_END/DATA/HOLE in generic_file_llseek(). I replaced all generic_file_llseek() callers with _unlocked() if they are called with a inode lock. All file systems which call generic_file_llseek_size() directly are fixed in the later commits. Fixes: ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek") Signed-off-by: Eiichi Tsukata --- fs/btrfs/file.c| 2 +- fs/fuse/file.c | 5 +++-- fs/gfs2/file.c | 3 ++- fs/read_write.c| 37 ++--- include/linux/fs.h | 2 ++ 5 files changed, 42 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index a3c22e16509b..ec932fa0f8a9 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3256,7 +3256,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) switch (whence) { case SEEK_END: case SEEK_CUR: - offset = generic_file_llseek(file, offset, whence); + offset = generic_file_llseek_unlocked(file, offset, whence); goto out; case SEEK_DATA: case SEEK_HOLE: diff --git a/fs/fuse/file.c b/fs/fuse/file.c index b52f9baaa3e7..e220b848929b 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -2336,13 +2336,14 @@ static loff_t fuse_file_llseek(struct file *file, loff_t offset, int whence) case SEEK_SET: case SEEK_CUR: /* No i_mutex protection necessary for SEEK_CUR and SEEK_SET */ - retval = generic_file_llseek(file, offset, whence); + retval = generic_file_llseek_unlocked(file, offset, whence); break; case SEEK_END: inode_lock(inode); retval = fuse_update_attributes(inode, file); if (!retval) - retval = generic_file_llseek(file, offset, whence); + retval = generic_file_llseek_unlocked(file, offset, + whence); inode_unlock(inode); break; case SEEK_HOLE: diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 45a17b770d97..171df9550c27 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -66,7 +66,8 @@ static loff_t gfs2_llseek(struct file *file, loff_t offset, int whence) error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &i_gh); if (!error) { - error = generic_file_llseek(file, offset, whence); + error = generic_file_llseek_unlocked(file, offset, +whence); gfs2_glock_dq_uninit(&i_gh); } break; diff --git a/fs/read_write.c b/fs/read_write.c index bfcb4ced5664..859dbac5b2f6 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -131,6 +131,24 @@ generic_file_llseek_size(struct file *file, loff_t offset, int whence, } EXPORT_SYMBOL(generic_file_llseek_size); +/** + * generic_file_llseek_unlocked - lockless generic llseek implementation + * @file: file structure to seek on + * @offset:file offset to seek to + * @whence:type of seek + * + */ +loff_t generic_file_llseek_unlocked(struct file *file, loff_t offset, + int whence) +{ + struct inode *inode = file->f_mapping->host; + + return generic_file_llseek_size(file, offset, whence, + inode->i_sb->s_maxbytes, + i_size_read(inode)); +} +EXPORT_SYMBOL(generic_file_llseek_unlocked); + /** * generic_file_llseek - generic llseek implementation for regular files * @file: file structure to seek on @@ -144,10 +162,23 @@ EXPORT_SYMBOL(generic_file_llseek_size); loff_t generic_file_llseek(struct file *file, loff_t offset, int whence) { struct inode *inode = file->f_mapping->host; + loff_t retval; - return generic_file_llseek_size(file, offset, whence, -
Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
On Tue, Nov 20, 2018 at 09:35:07PM -0800, Sagi Grimberg wrote: >> Given it is over TCP, I guess it should be doable for you to preallocate one >> 256-bvec table in one page for each request, then sets the max segment size >> as >> (unsigned int)-1, and max segment number as 256, the preallocated table >> should work anytime. > > 256 bvec table is really a lot to preallocate, especially when its not > needed, I can easily initialize the bvec_iter on the bio bvec. If this > involves preallocation of the worst-case than I don't consider this to > be an improvement. Ok, I took a look at the nvme-tcp code and it seems you care about bios because you want a contiguos bio chunk for sending it down the networking code. Yes, in that case we sort of need to iterate over bios. But you already have a special case for discard, so you don't really need any of the magic in the bio_bvecs() helper either can can just count bi_vcnt in the bio.