Re: Silent data corruption in blkdev_direct_IO()
On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote: > On 7/12/18 10:20 AM, Jens Axboe wrote: > > On 7/12/18 10:14 AM, Hannes Reinecke wrote: > > > On 07/12/2018 05:08 PM, Jens Axboe wrote: > > > > On 7/12/18 8:36 AM, Hannes Reinecke wrote: > > > > > Hi Jens, Christoph, > > > > > > > > > > we're currently hunting down a silent data corruption > > > > > occurring due to > > > > > commit 72ecad22d9f1 ("block: support a full bio worth of IO > > > > > for > > > > > simplified bdev direct-io"). > > > > > > > > > > While the whole thing is still hazy on the details, the one > > > > > thing we've > > > > > found is that reverting that patch fixes the data corruption. > > > > > > > > > > And looking closer, I've found this: > > > > > > > > > > static ssize_t > > > > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > > > > > { > > > > > int nr_pages; > > > > > > > > > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > > > > > if (!nr_pages) > > > > > return 0; > > > > > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > > > > > return __blkdev_direct_IO_simple(iocb, iter, > > > > > nr_pages); > > > > > > > > > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, > > > > > BIO_MAX_PAGES)); > > > > > } > > > > > > > > > > When checking the call path > > > > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > > > > > I found that bvec_alloc() will fail if nr_pages > > > > > > BIO_MAX_PAGES. > > > > > > > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > > > > > It's not that we can handle it in __blkdev_direct_IO() ... > > > > > > > > The logic could be cleaned up like below, the sync part is > > > > really all > > > > we care about. What is the test case for this? async or sync? > > > > > > > > I also don't remember why it's BIO_MAX_PAGES + 1... > > > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > > index 0dd87aaeb39a..14ef3d71b55f 100644 > > > > --- a/fs/block_dev.c > > > > +++ b/fs/block_dev.c > > > > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, > > > > struct iov_iter *iter) > > > > { > > > > int nr_pages; > > > > > > > > - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > > > > + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > > > > if (!nr_pages) > > > > return 0; > > > > - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > > > > + if (is_sync_kiocb(iocb)) > > > > return __blkdev_direct_IO_simple(iocb, iter, > > > > nr_pages); > > > > > > > > - return __blkdev_direct_IO(iocb, iter, min(nr_pages, > > > > BIO_MAX_PAGES)); > > > > + return __blkdev_direct_IO(iocb, iter, nr_pages); > > > > } > > > > > > > > static __init int blkdev_init(void) > > > > > > > > > > Hmm. We'll give it a go, but somehow I feel this won't solve our > > > problem. > > > > It probably won't, the only joker here is the BIO_MAX_PAGES + 1. > > But it > > does simplify that part... > > OK, now I remember. The +1 is just to check if there are actually > more > pages. __blkdev_direct_IO_simple() only does one bio, so it has to > fit > within that one bio. __blkdev_direct_IO() will loop just fine and > will finish any size, BIO_MAX_PAGES at the time. Right. Hannes, I think we (at least myself) have been irritated by looking at outdated code. The key point which I missed is that __blkdev_direct_IO() is called with min(nr_pages, BIO_MAX_PAGES), and advances beyond that later in the loop. > Hence the patch I sent is wrong, the code actually looks fine. Which > means we're back to trying to figure out what is going on here. It'd > be great with a test case... Unfortunately we have no reproducer just yet. Only the customer can reproduce it. The scenario is a data base running on a KVM virtual machine on top of a virtio-scsi volume backed by a multipath map, with cache='none' in qemu. My late thinking is that if, for whatever reason I don't yet understand, blkdev_direct_IO() resulted in a short write, __generic_file_write_iter() would fall back to buffered writing, which might be a possible explanation for the data corruption we observe. That's just speculation at the current stage. Regards Martin -- Dr. Martin Wilck , Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On 07/13/2018 06:24 AM, Bart Van Assche wrote: > Hello Keith, > > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); Even if before tejun's patch, we also have this for both blk-mq and blk-legacy code. blk_rq_check_expired if (time_after_eq(jiffies, rq->deadline)) { list_del_init(&rq->timeout_list); /* * Check if we raced with end io completion */ if (!blk_mark_rq_complete(rq)) blk_rq_timed_out(rq); } blk_complete_request if (!blk_mark_rq_complete(req)) __blk_complete_request(req); blk_mq_check_expired if (time_after_eq(jiffies, rq->deadline)) { if (!blk_mark_rq_complete(rq)) blk_mq_rq_timed_out(rq, reserved); } blk_mq_complete_request if (!blk_mark_rq_complete(rq)) __blk_mq_complete_request(rq); Thanks Jianchao > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis?
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On 07/13/2018 06:24 AM, Bart Van Assche wrote: > On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote: >> On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote: >>> What prevents that a request finishes and gets reused after the >>> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is >>> called? Is this perhaps a race condition that has not yet been triggered by >>> any existing block layer test? Please note that there is no such race >>> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling >>> again" - >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dblock_msg26489.html&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=zqZd2myYLkxjU6DWtRKpls-gvzEGEB4vv8bdYq5CiBs&s=-d79KAhEM83ShMp8xCHKoE6Dp5Gxf98L94DuamLEAaU&e=). >> >> I don't think there's any such race in the merged implementation >> either. If the request is reallocated, then the kref check may succeed, >> but the blk_mq_req_expired() check would surely fail, allowing the >> request to proceed as normal. The code comments at least say as much. > > Hello Keith, > > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a > request completion was reported after request timeout processing had > started, completion handling was skipped. The following code in > blk_mq_complete_request() realized that: > > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); > > Since commit 12f5b9314545, if a completion occurs after request timeout > processing has started, that completion is processed if the request has the > state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request > state unless the block driver timeout handler modifies it, e.g. by calling > blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical > behavior of scsi_times_out() is to queue sending of a SCSI abort and hence > not to change the request state immediately. In other words, if a request > completion occurs during or shortly after a timeout occurred then > blk_mq_complete_request() will call __blk_mq_complete_request() and will > complete the request, although that is not allowed because timeout handling > has already started. Do you agree with this analysis? > Oh, thanks gods for hearing Bart said this. I was always saying the same thing in the mail https://marc.info/?l=linux-block&m=152950093831738&w=2 Even though my voice is tiny, I support Bart's point definitely. Thanks Jianchao > Thanks, > > Bart. > > > > > > > > >
Re: Silent data corruption in blkdev_direct_IO()
On Thu, Jul 12, 2018 at 10:36 PM, Hannes Reinecke wrote: > Hi Jens, Christoph, > > we're currently hunting down a silent data corruption occurring due to > commit 72ecad22d9f1 ("block: support a full bio worth of IO for > simplified bdev direct-io"). > > While the whole thing is still hazy on the details, the one thing we've > found is that reverting that patch fixes the data corruption. > > And looking closer, I've found this: > > static ssize_t > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > { > int nr_pages; > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > if (!nr_pages) > return 0; > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); > } > > When checking the call path > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > It's not that we can handle it in __blkdev_direct_IO() ... > > Thanks for any clarification. Maybe you can try the following patch from Christoph to see if it makes a difference: https://marc.info/?l=linux-kernel&m=153013977816825&w=2 thanks, Ming Lei
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote: > On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote: > > What prevents that a request finishes and gets reused after the > > blk_mq_req_expired() call has finished and before kref_get_unless_zero() is > > called? Is this perhaps a race condition that has not yet been triggered by > > any existing block layer test? Please note that there is no such race > > condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling > > again" - https://www.spinics.net/lists/linux-block/msg26489.html). > > I don't think there's any such race in the merged implementation > either. If the request is reallocated, then the kref check may succeed, > but the blk_mq_req_expired() check would surely fail, allowing the > request to proceed as normal. The code comments at least say as much. Hello Keith, Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a request completion was reported after request timeout processing had started, completion handling was skipped. The following code in blk_mq_complete_request() realized that: if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) __blk_mq_complete_request(rq); Since commit 12f5b9314545, if a completion occurs after request timeout processing has started, that completion is processed if the request has the state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request state unless the block driver timeout handler modifies it, e.g. by calling blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical behavior of scsi_times_out() is to queue sending of a SCSI abort and hence not to change the request state immediately. In other words, if a request completion occurs during or shortly after a timeout occurred then blk_mq_complete_request() will call __blk_mq_complete_request() and will complete the request, although that is not allowed because timeout handling has already started. Do you agree with this analysis? Thanks, Bart.
Re: [PATCH] block: skd: Use %pad printk format for dma_addr_t values
On 7/12/18 2:29 PM, Helge Deller wrote: > Use the existing %pad printk format to print dma_addr_t values. > This avoids the following warnings when compiling on the parisc64 platform: > > drivers/block/skd_main.c: In function 'skd_preop_sg_list': > drivers/block/skd_main.c:660:4: warning: format '%llx' expects argument of > type 'long long unsigned int', but argument 6 has type 'dma_addr_t {aka > unsigned int}' [-Wformat=] Applied, thanks. -- Jens Axboe
Re: [PATCH] block: skd: Use %pad printk format for dma_addr_t values
On 07/12/18 13:29, Helge Deller wrote: Use the existing %pad printk format to print dma_addr_t values. This avoids the following warnings when compiling on the parisc64 platform: Reviewed-by: Bart Van Assche
[PATCH] block: skd: Use %pad printk format for dma_addr_t values
Use the existing %pad printk format to print dma_addr_t values. This avoids the following warnings when compiling on the parisc64 platform: drivers/block/skd_main.c: In function 'skd_preop_sg_list': drivers/block/skd_main.c:660:4: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 6 has type 'dma_addr_t {aka unsigned int}' [-Wformat=] Signed-off-by: Helge Deller diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c index bc7aea6d7b7c..9ba61c194b4c 100644 --- a/drivers/block/skd_main.c +++ b/drivers/block/skd_main.c @@ -657,8 +657,8 @@ static bool skd_preop_sg_list(struct skd_device *skdev, if (unlikely(skdev->dbg_level > 1)) { dev_dbg(&skdev->pdev->dev, - "skreq=%x sksg_list=%p sksg_dma=%llx\n", - skreq->id, skreq->sksg_list, skreq->sksg_dma_address); + "skreq=%x sksg_list=%p sksg_dma=%pad\n", + skreq->id, skreq->sksg_list, &skreq->sksg_dma_address); for (i = 0; i < n_sg; i++) { struct fit_sg_descriptor *sgd = &skreq->sksg_list[i]; @@ -1190,8 +1190,8 @@ static void skd_send_fitmsg(struct skd_device *skdev, { u64 qcmd; - dev_dbg(&skdev->pdev->dev, "dma address 0x%llx, busy=%d\n", - skmsg->mb_dma_address, skd_in_flight(skdev)); + dev_dbg(&skdev->pdev->dev, "dma address %pad, busy=%d\n", + &skmsg->mb_dma_address, skd_in_flight(skdev)); dev_dbg(&skdev->pdev->dev, "msg_buf %p\n", skmsg->msg_buf); qcmd = skmsg->mb_dma_address; @@ -1250,9 +1250,9 @@ static void skd_send_special_fitmsg(struct skd_device *skdev, } dev_dbg(&skdev->pdev->dev, - "skspcl=%p id=%04x sksg_list=%p sksg_dma=%llx\n", + "skspcl=%p id=%04x sksg_list=%p sksg_dma=%pad\n", skspcl, skspcl->req.id, skspcl->req.sksg_list, - skspcl->req.sksg_dma_address); + &skspcl->req.sksg_dma_address); for (i = 0; i < skspcl->req.n_sg; i++) { struct fit_sg_descriptor *sgd = &skspcl->req.sksg_list[i]; @@ -2685,8 +2685,8 @@ static int skd_cons_skmsg(struct skd_device *skdev) WARN(((uintptr_t)skmsg->msg_buf | skmsg->mb_dma_address) & (FIT_QCMD_ALIGN - 1), -"not aligned: msg_buf %p mb_dma_address %#llx\n", -skmsg->msg_buf, skmsg->mb_dma_address); +"not aligned: msg_buf %p mb_dma_address %pad\n", +skmsg->msg_buf, &skmsg->mb_dma_address); memset(skmsg->msg_buf, 0, SKD_N_FITMSG_BYTES); }
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote: > On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote: > > /* > > -* We marked @rq->aborted_gstate and waited for RCU. If there were > > -* completions that we lost to, they would have finished and > > -* updated @rq->gstate by now; otherwise, the completion path is > > -* now guaranteed to see @rq->aborted_gstate and yield. If > > -* @rq->aborted_gstate still matches @rq->gstate, @rq is ours. > > +* Just do a quick check if it is expired before locking the request in > > +* so we're not unnecessarilly synchronizing across CPUs. > > */ > > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && > > - READ_ONCE(rq->gstate) == rq->aborted_gstate) > > + if (!blk_mq_req_expired(rq, next)) > > + return; > > + > > + /* > > +* We have reason to believe the request may be expired. Take a > > +* reference on the request to lock this request lifetime into its > > +* currently allocated context to prevent it from being reallocated in > > +* the event the completion by-passes this timeout handler. > > +* > > +* If the reference was already released, then the driver beat the > > +* timeout handler to posting a natural completion. > > +*/ > > + if (!kref_get_unless_zero(&rq->ref)) > > + return; > > + > > + /* > > +* The request is now locked and cannot be reallocated underneath the > > +* timeout handler's processing. Re-verify this exact request is truly > > +* expired; if it is not expired, then the request was completed and > > +* reallocated as a new request. > > +*/ > > + if (blk_mq_req_expired(rq, next)) > > blk_mq_rq_timed_out(rq, reserved); > > + blk_mq_put_request(rq); > > } > > Hello Keith and Christoph, > > What prevents that a request finishes and gets reused after the > blk_mq_req_expired() call has finished and before kref_get_unless_zero() is > called? Is this perhaps a race condition that has not yet been triggered by > any existing block layer test? Please note that there is no such race > condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling > again" - https://www.spinics.net/lists/linux-block/msg26489.html). I don't think there's any such race in the merged implementation either. If the request is reallocated, then the kref check may succeed, but the blk_mq_req_expired() check would surely fail, allowing the request to proceed as normal. The code comments at least say as much.
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote: > /* > - * We marked @rq->aborted_gstate and waited for RCU. If there were > - * completions that we lost to, they would have finished and > - * updated @rq->gstate by now; otherwise, the completion path is > - * now guaranteed to see @rq->aborted_gstate and yield. If > - * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. > + * Just do a quick check if it is expired before locking the request in > + * so we're not unnecessarilly synchronizing across CPUs. >*/ > - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && > - READ_ONCE(rq->gstate) == rq->aborted_gstate) > + if (!blk_mq_req_expired(rq, next)) > + return; > + > + /* > + * We have reason to believe the request may be expired. Take a > + * reference on the request to lock this request lifetime into its > + * currently allocated context to prevent it from being reallocated in > + * the event the completion by-passes this timeout handler. > + * > + * If the reference was already released, then the driver beat the > + * timeout handler to posting a natural completion. > + */ > + if (!kref_get_unless_zero(&rq->ref)) > + return; > + > + /* > + * The request is now locked and cannot be reallocated underneath the > + * timeout handler's processing. Re-verify this exact request is truly > + * expired; if it is not expired, then the request was completed and > + * reallocated as a new request. > + */ > + if (blk_mq_req_expired(rq, next)) > blk_mq_rq_timed_out(rq, reserved); > + blk_mq_put_request(rq); > } Hello Keith and Christoph, What prevents that a request finishes and gets reused after the blk_mq_req_expired() call has finished and before kref_get_unless_zero() is called? Is this perhaps a race condition that has not yet been triggered by any existing block layer test? Please note that there is no such race condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling again" - https://www.spinics.net/lists/linux-block/msg26489.html). Thanks, Bart.
Re: Silent data corruption in blkdev_direct_IO()
On 7/12/18 10:20 AM, Jens Axboe wrote: > On 7/12/18 10:14 AM, Hannes Reinecke wrote: >> On 07/12/2018 05:08 PM, Jens Axboe wrote: >>> On 7/12/18 8:36 AM, Hannes Reinecke wrote: Hi Jens, Christoph, we're currently hunting down a silent data corruption occurring due to commit 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io"). While the whole thing is still hazy on the details, the one thing we've found is that reverting that patch fixes the data corruption. And looking closer, I've found this: static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { int nr_pages; nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); if (!nr_pages) return 0; if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) return __blkdev_direct_IO_simple(iocb, iter, nr_pages); return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); } When checking the call path __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? It's not that we can handle it in __blkdev_direct_IO() ... >>> >>> The logic could be cleaned up like below, the sync part is really all >>> we care about. What is the test case for this? async or sync? >>> >>> I also don't remember why it's BIO_MAX_PAGES + 1... >>> >>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>> index 0dd87aaeb39a..14ef3d71b55f 100644 >>> --- a/fs/block_dev.c >>> +++ b/fs/block_dev.c >>> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter >>> *iter) >>> { >>> int nr_pages; >>> >>> - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >>> + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); >>> if (!nr_pages) >>> return 0; >>> - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >>> + if (is_sync_kiocb(iocb)) >>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>> >>> - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >>> + return __blkdev_direct_IO(iocb, iter, nr_pages); >>> } >>> >>> static __init int blkdev_init(void) >>> >> Hmm. We'll give it a go, but somehow I feel this won't solve our problem. > > It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it > does simplify that part... OK, now I remember. The +1 is just to check if there are actually more pages. __blkdev_direct_IO_simple() only does one bio, so it has to fit within that one bio. __blkdev_direct_IO() will loop just fine and will finish any size, BIO_MAX_PAGES at the time. Hence the patch I sent is wrong, the code actually looks fine. Which means we're back to trying to figure out what is going on here. It'd be great with a test case... -- Jens Axboe
Re: Silent data corruption in blkdev_direct_IO()
On 7/12/18 10:14 AM, Hannes Reinecke wrote: > On 07/12/2018 05:08 PM, Jens Axboe wrote: >> On 7/12/18 8:36 AM, Hannes Reinecke wrote: >>> Hi Jens, Christoph, >>> >>> we're currently hunting down a silent data corruption occurring due to >>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for >>> simplified bdev direct-io"). >>> >>> While the whole thing is still hazy on the details, the one thing we've >>> found is that reverting that patch fixes the data corruption. >>> >>> And looking closer, I've found this: >>> >>> static ssize_t >>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >>> { >>> int nr_pages; >>> >>> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >>> if (!nr_pages) >>> return 0; >>> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>> >>> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >>> } >>> >>> When checking the call path >>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() >>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. >>> >>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? >>> It's not that we can handle it in __blkdev_direct_IO() ... >> >> The logic could be cleaned up like below, the sync part is really all >> we care about. What is the test case for this? async or sync? >> >> I also don't remember why it's BIO_MAX_PAGES + 1... >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 0dd87aaeb39a..14ef3d71b55f 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter >> *iter) >> { >> int nr_pages; >> >> -nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >> +nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); >> if (!nr_pages) >> return 0; >> -if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >> +if (is_sync_kiocb(iocb)) >> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >> >> -return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >> +return __blkdev_direct_IO(iocb, iter, nr_pages); >> } >> >> static __init int blkdev_init(void) >> > Hmm. We'll give it a go, but somehow I feel this won't solve our problem. It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it does simplify that part... > Another question (which probably shows my complete ignorance): > What happens if the iter holds >= BIO_MAX_PAGES? > 'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on > whether the iter contains more pages. > What happens to those left-over pages? > Will they ever be processed? Short read or write, we rely on being called again for the remainder. -- Jens Axboe
Re: Silent data corruption in blkdev_direct_IO()
On 07/12/2018 05:08 PM, Jens Axboe wrote: On 7/12/18 8:36 AM, Hannes Reinecke wrote: Hi Jens, Christoph, we're currently hunting down a silent data corruption occurring due to commit 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io"). While the whole thing is still hazy on the details, the one thing we've found is that reverting that patch fixes the data corruption. And looking closer, I've found this: static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { int nr_pages; nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); if (!nr_pages) return 0; if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) return __blkdev_direct_IO_simple(iocb, iter, nr_pages); return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); } When checking the call path __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? It's not that we can handle it in __blkdev_direct_IO() ... The logic could be cleaned up like below, the sync part is really all we care about. What is the test case for this? async or sync? I also don't remember why it's BIO_MAX_PAGES + 1... diff --git a/fs/block_dev.c b/fs/block_dev.c index 0dd87aaeb39a..14ef3d71b55f 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { int nr_pages; - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) return 0; - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) + if (is_sync_kiocb(iocb)) return __blkdev_direct_IO_simple(iocb, iter, nr_pages); - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); + return __blkdev_direct_IO(iocb, iter, nr_pages); } static __init int blkdev_init(void) Hmm. We'll give it a go, but somehow I feel this won't solve our problem. Another question (which probably shows my complete ignorance): What happens if the iter holds >= BIO_MAX_PAGES? 'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on whether the iter contains more pages. What happens to those left-over pages? Will they ever be processed? Cheers, Hannes
Re: Silent data corruption in blkdev_direct_IO()
On Thu, 2018-07-12 at 09:08 -0600, Jens Axboe wrote: > On 7/12/18 8:36 AM, Hannes Reinecke wrote: > > Hi Jens, Christoph, > > > > we're currently hunting down a silent data corruption occurring due > > to > > commit 72ecad22d9f1 ("block: support a full bio worth of IO for > > simplified bdev direct-io"). > > > > While the whole thing is still hazy on the details, the one thing > > we've > > found is that reverting that patch fixes the data corruption. > > > > And looking closer, I've found this: > > > > static ssize_t > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > > { > > int nr_pages; > > > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > > if (!nr_pages) > > return 0; > > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, > > BIO_MAX_PAGES)); > > } > > > > When checking the call path > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > > It's not that we can handle it in __blkdev_direct_IO() ... > > The logic could be cleaned up like below, the sync part is really all > we care about. What is the test case for this? async or sync? It's sync, and the corruption is in the __blkdev_direct_IO_simple() path. It starts to occur with your "block: support a full bio worth of IO for simplified bdev direct-io" patch, which causes the "simple" path to be taken for larger IO sizes. Martin > > I also don't remember why it's BIO_MAX_PAGES + 1... > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aaeb39a..14ef3d71b55f 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct > iov_iter *iter) > { > int nr_pages; > > - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > if (!nr_pages) > return 0; > - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > + if (is_sync_kiocb(iocb)) > return __blkdev_direct_IO_simple(iocb, iter, > nr_pages); > > - return __blkdev_direct_IO(iocb, iter, min(nr_pages, > BIO_MAX_PAGES)); > + return __blkdev_direct_IO(iocb, iter, nr_pages); > } > > static __init int blkdev_init(void) > -- Dr. Martin Wilck , Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)
Re: Silent data corruption in blkdev_direct_IO()
On 7/12/18 8:36 AM, Hannes Reinecke wrote: > Hi Jens, Christoph, > > we're currently hunting down a silent data corruption occurring due to > commit 72ecad22d9f1 ("block: support a full bio worth of IO for > simplified bdev direct-io"). > > While the whole thing is still hazy on the details, the one thing we've > found is that reverting that patch fixes the data corruption. > > And looking closer, I've found this: > > static ssize_t > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > { > int nr_pages; > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > if (!nr_pages) > return 0; > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); > } > > When checking the call path > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > It's not that we can handle it in __blkdev_direct_IO() ... The logic could be cleaned up like below, the sync part is really all we care about. What is the test case for this? async or sync? I also don't remember why it's BIO_MAX_PAGES + 1... diff --git a/fs/block_dev.c b/fs/block_dev.c index 0dd87aaeb39a..14ef3d71b55f 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { int nr_pages; - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) return 0; - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) + if (is_sync_kiocb(iocb)) return __blkdev_direct_IO_simple(iocb, iter, nr_pages); - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); + return __blkdev_direct_IO(iocb, iter, nr_pages); } static __init int blkdev_init(void) -- Jens Axboe
Silent data corruption in blkdev_direct_IO()
Hi Jens, Christoph, we're currently hunting down a silent data corruption occurring due to commit 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io"). While the whole thing is still hazy on the details, the one thing we've found is that reverting that patch fixes the data corruption. And looking closer, I've found this: static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { int nr_pages; nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); if (!nr_pages) return 0; if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) return __blkdev_direct_IO_simple(iocb, iter, nr_pages); return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); } When checking the call path __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? It's not that we can handle it in __blkdev_direct_IO() ... Thanks for any clarification. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Hello Beautiful
Hi Dear, my name is Jack and i am seeking for a relationship in which i will feel loved after a series of failed relationships. I am hoping that you would be interested and we could possibly get to know each other more if you do not mind. I am open to answering questions from you as i think my approach is a little inappropriate. Hope to hear back from you. Jack.