Re: Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Martin Wilck
On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote:
> On 7/12/18 10:20 AM, Jens Axboe wrote:
> > On 7/12/18 10:14 AM, Hannes Reinecke wrote:
> > > On 07/12/2018 05:08 PM, Jens Axboe wrote:
> > > > On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> > > > > Hi Jens, Christoph,
> > > > > 
> > > > > we're currently hunting down a silent data corruption
> > > > > occurring due to
> > > > > commit 72ecad22d9f1 ("block: support a full bio worth of IO
> > > > > for
> > > > > simplified bdev direct-io").
> > > > > 
> > > > > While the whole thing is still hazy on the details, the one
> > > > > thing we've
> > > > > found is that reverting that patch fixes the data corruption.
> > > > > 
> > > > > And looking closer, I've found this:
> > > > > 
> > > > > static ssize_t
> > > > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> > > > > {
> > > > >   int nr_pages;
> > > > > 
> > > > >   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > > > >   if (!nr_pages)
> > > > >   return 0;
> > > > >   if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > > > >   return __blkdev_direct_IO_simple(iocb, iter,
> > > > > nr_pages);
> > > > > 
> > > > >   return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > > > > BIO_MAX_PAGES));
> > > > > }
> > > > > 
> > > > > When checking the call path
> > > > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> > > > > I found that bvec_alloc() will fail if nr_pages >
> > > > > BIO_MAX_PAGES.
> > > > > 
> > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> > > > > It's not that we can handle it in __blkdev_direct_IO() ...
> > > > 
> > > > The logic could be cleaned up like below, the sync part is
> > > > really all
> > > > we care about. What is the test case for this? async or sync?
> > > > 
> > > > I also don't remember why it's BIO_MAX_PAGES + 1...
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 0dd87aaeb39a..14ef3d71b55f 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb,
> > > > struct iov_iter *iter)
> > > >   {
> > > > int nr_pages;
> > > >   
> > > > -   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > > > +   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
> > > > if (!nr_pages)
> > > > return 0;
> > > > -   if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > > > +   if (is_sync_kiocb(iocb))
> > > > return __blkdev_direct_IO_simple(iocb, iter,
> > > > nr_pages);
> > > >   
> > > > -   return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > > > BIO_MAX_PAGES));
> > > > +   return __blkdev_direct_IO(iocb, iter, nr_pages);
> > > >   }
> > > >   
> > > >   static __init int blkdev_init(void)
> > > > 
> > > 
> > > Hmm. We'll give it a go, but somehow I feel this won't solve our
> > > problem.
> > 
> > It probably won't, the only joker here is the BIO_MAX_PAGES + 1.
> > But it
> > does simplify that part...
> 
> OK, now I remember. The +1 is just to check if there are actually
> more
> pages. __blkdev_direct_IO_simple() only does one bio, so it has to
> fit
> within that one bio. __blkdev_direct_IO() will loop just fine and
> will finish any size, BIO_MAX_PAGES at the time.

Right. Hannes, I think we (at least myself) have been irritated by
looking at outdated code. The key point which I missed is that
__blkdev_direct_IO() is called with min(nr_pages, BIO_MAX_PAGES), and
advances beyond that later in the loop.

> Hence the patch I sent is wrong, the code actually looks fine. Which
> means we're back to trying to figure out what is going on here. It'd
> be great with a test case...

Unfortunately we have no reproducer just yet. Only the customer can
reproduce it. The scenario is a data base running on a KVM virtual
machine on top of a virtio-scsi volume backed by a multipath map, with
cache='none' in qemu.

My late thinking is that if, for whatever reason I don't yet
understand, blkdev_direct_IO() resulted in a short write,
__generic_file_write_iter() would fall back to buffered writing, which
might be a possible explanation for the data corruption we observe.
That's just speculation at the current stage.

Regards
Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)



Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-12 Thread jianchao.wang


On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> Hello Keith,
> 
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
>   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   __blk_mq_complete_request(rq);

Even if before tejun's patch, we also have this for both blk-mq and blk-legacy 
code.

blk_rq_check_expired

if (time_after_eq(jiffies, rq->deadline)) {
list_del_init(&rq->timeout_list);

/*
 * Check if we raced with end io completion
 */
if (!blk_mark_rq_complete(rq))
blk_rq_timed_out(rq);
} 
 

blk_complete_request

if (!blk_mark_rq_complete(req))
__blk_complete_request(req);

blk_mq_check_expired

if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
blk_mq_rq_timed_out(rq, reserved);
}


blk_mq_complete_request

if (!blk_mark_rq_complete(rq))
__blk_mq_complete_request(rq);

Thanks
Jianchao

> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-12 Thread jianchao.wang



On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote:
>> On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote:
>>> What prevents that a request finishes and gets reused after the
>>> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
>>> called? Is this perhaps a race condition that has not yet been triggered by
>>> any existing block layer test? Please note that there is no such race
>>> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
>>> again" - 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_linux-2Dblock_msg26489.html&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=zqZd2myYLkxjU6DWtRKpls-gvzEGEB4vv8bdYq5CiBs&s=-d79KAhEM83ShMp8xCHKoE6Dp5Gxf98L94DuamLEAaU&e=).
>>
>> I don't think there's any such race in the merged implementation
>> either. If the request is reallocated, then the kref check may succeed,
>> but the blk_mq_req_expired() check would surely fail, allowing the
>> request to proceed as normal. The code comments at least say as much.
> 
> Hello Keith,
> 
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
>   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   __blk_mq_complete_request(rq);
> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?
>

Oh, thanks gods for hearing Bart said this.
I was always saying the same thing in the mail
https://marc.info/?l=linux-block&m=152950093831738&w=2
Even though my voice is tiny, I support Bart's point definitely.

Thanks
Jianchao
 
> Thanks,
> 
> Bart.
> 
> 
> 
> 
> 
> 
> 
> 
> 


Re: Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Ming Lei
On Thu, Jul 12, 2018 at 10:36 PM, Hannes Reinecke  wrote:
> Hi Jens, Christoph,
>
> we're currently hunting down a silent data corruption occurring due to
> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
> simplified bdev direct-io").
>
> While the whole thing is still hazy on the details, the one thing we've
> found is that reverting that patch fixes the data corruption.
>
> And looking closer, I've found this:
>
> static ssize_t
> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
> int nr_pages;
>
> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> if (!nr_pages)
> return 0;
> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>
> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
> }
>
> When checking the call path
> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>
> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> It's not that we can handle it in __blkdev_direct_IO() ...
>
> Thanks for any clarification.

Maybe you can try the following patch from Christoph to see if it makes a
difference:

https://marc.info/?l=linux-kernel&m=153013977816825&w=2


thanks,
Ming Lei


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-12 Thread Bart Van Assche
On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote:
> On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote:
> > What prevents that a request finishes and gets reused after the
> > blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
> > called? Is this perhaps a race condition that has not yet been triggered by
> > any existing block layer test? Please note that there is no such race
> > condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
> > again" - https://www.spinics.net/lists/linux-block/msg26489.html).
> 
> I don't think there's any such race in the merged implementation
> either. If the request is reallocated, then the kref check may succeed,
> but the blk_mq_req_expired() check would surely fail, allowing the
> request to proceed as normal. The code comments at least say as much.

Hello Keith,

Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
request completion was reported after request timeout processing had
started, completion handling was skipped. The following code in
blk_mq_complete_request() realized that:

if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);

Since commit 12f5b9314545, if a completion occurs after request timeout
processing has started, that completion is processed if the request has the
state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
state unless the block driver timeout handler modifies it, e.g. by calling
blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
not to change the request state immediately. In other words, if a request
completion occurs during or shortly after a timeout occurred then
blk_mq_complete_request() will call __blk_mq_complete_request() and will
complete the request, although that is not allowed because timeout handling
has already started. Do you agree with this analysis?

Thanks,

Bart.










Re: [PATCH] block: skd: Use %pad printk format for dma_addr_t values

2018-07-12 Thread Jens Axboe
On 7/12/18 2:29 PM, Helge Deller wrote:
> Use the existing %pad printk format to print dma_addr_t values.
> This avoids the following warnings when compiling on the parisc64 platform:
> 
> drivers/block/skd_main.c: In function 'skd_preop_sg_list':
> drivers/block/skd_main.c:660:4: warning: format '%llx' expects argument of 
> type 'long long unsigned int', but argument 6 has type 'dma_addr_t {aka 
> unsigned int}' [-Wformat=]

Applied, thanks.

-- 
Jens Axboe



Re: [PATCH] block: skd: Use %pad printk format for dma_addr_t values

2018-07-12 Thread Bart Van Assche

On 07/12/18 13:29, Helge Deller wrote:

Use the existing %pad printk format to print dma_addr_t values.
This avoids the following warnings when compiling on the parisc64 platform:


Reviewed-by: Bart Van Assche 


[PATCH] block: skd: Use %pad printk format for dma_addr_t values

2018-07-12 Thread Helge Deller
Use the existing %pad printk format to print dma_addr_t values.
This avoids the following warnings when compiling on the parisc64 platform:

drivers/block/skd_main.c: In function 'skd_preop_sg_list':
drivers/block/skd_main.c:660:4: warning: format '%llx' expects argument of type 
'long long unsigned int', but argument 6 has type 'dma_addr_t {aka unsigned 
int}' [-Wformat=]

Signed-off-by: Helge Deller 

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index bc7aea6d7b7c..9ba61c194b4c 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -657,8 +657,8 @@ static bool skd_preop_sg_list(struct skd_device *skdev,
 
if (unlikely(skdev->dbg_level > 1)) {
dev_dbg(&skdev->pdev->dev,
-   "skreq=%x sksg_list=%p sksg_dma=%llx\n",
-   skreq->id, skreq->sksg_list, skreq->sksg_dma_address);
+   "skreq=%x sksg_list=%p sksg_dma=%pad\n",
+   skreq->id, skreq->sksg_list, &skreq->sksg_dma_address);
for (i = 0; i < n_sg; i++) {
struct fit_sg_descriptor *sgd = &skreq->sksg_list[i];
 
@@ -1190,8 +1190,8 @@ static void skd_send_fitmsg(struct skd_device *skdev,
 {
u64 qcmd;
 
-   dev_dbg(&skdev->pdev->dev, "dma address 0x%llx, busy=%d\n",
-   skmsg->mb_dma_address, skd_in_flight(skdev));
+   dev_dbg(&skdev->pdev->dev, "dma address %pad, busy=%d\n",
+   &skmsg->mb_dma_address, skd_in_flight(skdev));
dev_dbg(&skdev->pdev->dev, "msg_buf %p\n", skmsg->msg_buf);
 
qcmd = skmsg->mb_dma_address;
@@ -1250,9 +1250,9 @@ static void skd_send_special_fitmsg(struct skd_device 
*skdev,
}
 
dev_dbg(&skdev->pdev->dev,
-   "skspcl=%p id=%04x sksg_list=%p sksg_dma=%llx\n",
+   "skspcl=%p id=%04x sksg_list=%p sksg_dma=%pad\n",
skspcl, skspcl->req.id, skspcl->req.sksg_list,
-   skspcl->req.sksg_dma_address);
+   &skspcl->req.sksg_dma_address);
for (i = 0; i < skspcl->req.n_sg; i++) {
struct fit_sg_descriptor *sgd =
&skspcl->req.sksg_list[i];
@@ -2685,8 +2685,8 @@ static int skd_cons_skmsg(struct skd_device *skdev)
 
WARN(((uintptr_t)skmsg->msg_buf | skmsg->mb_dma_address) &
 (FIT_QCMD_ALIGN - 1),
-"not aligned: msg_buf %p mb_dma_address %#llx\n",
-skmsg->msg_buf, skmsg->mb_dma_address);
+"not aligned: msg_buf %p mb_dma_address %pad\n",
+skmsg->msg_buf, &skmsg->mb_dma_address);
memset(skmsg->msg_buf, 0, SKD_N_FITMSG_BYTES);
}
 


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-12 Thread Keith Busch
On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote:
> On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> > /*
> > -* We marked @rq->aborted_gstate and waited for RCU.  If there were
> > -* completions that we lost to, they would have finished and
> > -* updated @rq->gstate by now; otherwise, the completion path is
> > -* now guaranteed to see @rq->aborted_gstate and yield.  If
> > -* @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> > +* Just do a quick check if it is expired before locking the request in
> > +* so we're not unnecessarilly synchronizing across CPUs.
> >  */
> > -   if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> > -   READ_ONCE(rq->gstate) == rq->aborted_gstate)
> > +   if (!blk_mq_req_expired(rq, next))
> > +   return;
> > +
> > +   /*
> > +* We have reason to believe the request may be expired. Take a
> > +* reference on the request to lock this request lifetime into its
> > +* currently allocated context to prevent it from being reallocated in
> > +* the event the completion by-passes this timeout handler.
> > +* 
> > +* If the reference was already released, then the driver beat the
> > +* timeout handler to posting a natural completion.
> > +*/
> > +   if (!kref_get_unless_zero(&rq->ref))
> > +   return;
> > +
> > +   /*
> > +* The request is now locked and cannot be reallocated underneath the
> > +* timeout handler's processing. Re-verify this exact request is truly
> > +* expired; if it is not expired, then the request was completed and
> > +* reallocated as a new request.
> > +*/
> > +   if (blk_mq_req_expired(rq, next))
> > blk_mq_rq_timed_out(rq, reserved);
> > +   blk_mq_put_request(rq);
> >  }
> 
> Hello Keith and Christoph,
> 
> What prevents that a request finishes and gets reused after the
> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
> called? Is this perhaps a race condition that has not yet been triggered by
> any existing block layer test? Please note that there is no such race
> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
> again" - https://www.spinics.net/lists/linux-block/msg26489.html).

I don't think there's any such race in the merged implementation
either. If the request is reallocated, then the kref check may succeed,
but the blk_mq_req_expired() check would surely fail, allowing the
request to proceed as normal. The code comments at least say as much.


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-12 Thread Bart Van Assche
On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
>   /*
> -  * We marked @rq->aborted_gstate and waited for RCU.  If there were
> -  * completions that we lost to, they would have finished and
> -  * updated @rq->gstate by now; otherwise, the completion path is
> -  * now guaranteed to see @rq->aborted_gstate and yield.  If
> -  * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> +  * Just do a quick check if it is expired before locking the request in
> +  * so we're not unnecessarilly synchronizing across CPUs.
>*/
> - if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> - READ_ONCE(rq->gstate) == rq->aborted_gstate)
> + if (!blk_mq_req_expired(rq, next))
> + return;
> +
> + /*
> +  * We have reason to believe the request may be expired. Take a
> +  * reference on the request to lock this request lifetime into its
> +  * currently allocated context to prevent it from being reallocated in
> +  * the event the completion by-passes this timeout handler.
> +  * 
> +  * If the reference was already released, then the driver beat the
> +  * timeout handler to posting a natural completion.
> +  */
> + if (!kref_get_unless_zero(&rq->ref))
> + return;
> +
> + /*
> +  * The request is now locked and cannot be reallocated underneath the
> +  * timeout handler's processing. Re-verify this exact request is truly
> +  * expired; if it is not expired, then the request was completed and
> +  * reallocated as a new request.
> +  */
> + if (blk_mq_req_expired(rq, next))
>   blk_mq_rq_timed_out(rq, reserved);
> + blk_mq_put_request(rq);
>  }

Hello Keith and Christoph,

What prevents that a request finishes and gets reused after the
blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
called? Is this perhaps a race condition that has not yet been triggered by
any existing block layer test? Please note that there is no such race
condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
again" - https://www.spinics.net/lists/linux-block/msg26489.html).

Thanks,

Bart.






Re: Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Jens Axboe
On 7/12/18 10:20 AM, Jens Axboe wrote:
> On 7/12/18 10:14 AM, Hannes Reinecke wrote:
>> On 07/12/2018 05:08 PM, Jens Axboe wrote:
>>> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
 Hi Jens, Christoph,

 we're currently hunting down a silent data corruption occurring due to
 commit 72ecad22d9f1 ("block: support a full bio worth of IO for
 simplified bdev direct-io").

 While the whole thing is still hazy on the details, the one thing we've
 found is that reverting that patch fixes the data corruption.

 And looking closer, I've found this:

 static ssize_t
 blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
int nr_pages;

nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
if (!nr_pages)
return 0;
if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
 }

 When checking the call path
 __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
 I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.

 So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
 It's not that we can handle it in __blkdev_direct_IO() ...
>>>
>>> The logic could be cleaned up like below, the sync part is really all
>>> we care about. What is the test case for this? async or sync?
>>>
>>> I also don't remember why it's BIO_MAX_PAGES + 1...
>>>
>>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>>> index 0dd87aaeb39a..14ef3d71b55f 100644
>>> --- a/fs/block_dev.c
>>> +++ b/fs/block_dev.c
>>> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
>>> *iter)
>>>   {
>>> int nr_pages;
>>>   
>>> -   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>>> +   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>>> if (!nr_pages)
>>> return 0;
>>> -   if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>> +   if (is_sync_kiocb(iocb))
>>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>   
>>> -   return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>>> +   return __blkdev_direct_IO(iocb, iter, nr_pages);
>>>   }
>>>   
>>>   static __init int blkdev_init(void)
>>>
>> Hmm. We'll give it a go, but somehow I feel this won't solve our problem.
> 
> It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it
> does simplify that part...

OK, now I remember. The +1 is just to check if there are actually more
pages. __blkdev_direct_IO_simple() only does one bio, so it has to fit
within that one bio. __blkdev_direct_IO() will loop just fine and
will finish any size, BIO_MAX_PAGES at the time.

Hence the patch I sent is wrong, the code actually looks fine. Which
means we're back to trying to figure out what is going on here. It'd
be great with a test case...

-- 
Jens Axboe



Re: Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Jens Axboe
On 7/12/18 10:14 AM, Hannes Reinecke wrote:
> On 07/12/2018 05:08 PM, Jens Axboe wrote:
>> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
>>> Hi Jens, Christoph,
>>>
>>> we're currently hunting down a silent data corruption occurring due to
>>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
>>> simplified bdev direct-io").
>>>
>>> While the whole thing is still hazy on the details, the one thing we've
>>> found is that reverting that patch fixes the data corruption.
>>>
>>> And looking closer, I've found this:
>>>
>>> static ssize_t
>>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>> {
>>> int nr_pages;
>>>
>>> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>>> if (!nr_pages)
>>> return 0;
>>> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>
>>> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>>> }
>>>
>>> When checking the call path
>>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
>>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>>>
>>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
>>> It's not that we can handle it in __blkdev_direct_IO() ...
>>
>> The logic could be cleaned up like below, the sync part is really all
>> we care about. What is the test case for this? async or sync?
>>
>> I also don't remember why it's BIO_MAX_PAGES + 1...
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 0dd87aaeb39a..14ef3d71b55f 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
>> *iter)
>>   {
>>  int nr_pages;
>>   
>> -nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>> +nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>>  if (!nr_pages)
>>  return 0;
>> -if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>> +if (is_sync_kiocb(iocb))
>>  return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>   
>> -return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>> +return __blkdev_direct_IO(iocb, iter, nr_pages);
>>   }
>>   
>>   static __init int blkdev_init(void)
>>
> Hmm. We'll give it a go, but somehow I feel this won't solve our problem.

It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it
does simplify that part...

> Another question (which probably shows my complete ignorance):
> What happens if the iter holds >= BIO_MAX_PAGES?
> 'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on 
> whether the iter contains more pages.
> What happens to those left-over pages?
> Will they ever be processed?

Short read or write, we rely on being called again for the remainder.

-- 
Jens Axboe



Re: Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Hannes Reinecke

On 07/12/2018 05:08 PM, Jens Axboe wrote:

On 7/12/18 8:36 AM, Hannes Reinecke wrote:

Hi Jens, Christoph,

we're currently hunting down a silent data corruption occurring due to
commit 72ecad22d9f1 ("block: support a full bio worth of IO for
simplified bdev direct-io").

While the whole thing is still hazy on the details, the one thing we've
found is that reverting that patch fixes the data corruption.

And looking closer, I've found this:

static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
int nr_pages;

nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
if (!nr_pages)
return 0;
if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
}

When checking the call path
__blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.

So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
It's not that we can handle it in __blkdev_direct_IO() ...


The logic could be cleaned up like below, the sync part is really all
we care about. What is the test case for this? async or sync?

I also don't remember why it's BIO_MAX_PAGES + 1...

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aaeb39a..14ef3d71b55f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter)
  {
int nr_pages;
  
-	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);

+   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
if (!nr_pages)
return 0;
-   if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
+   if (is_sync_kiocb(iocb))
return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
  
-	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));

+   return __blkdev_direct_IO(iocb, iter, nr_pages);
  }
  
  static __init int blkdev_init(void)



Hmm. We'll give it a go, but somehow I feel this won't solve our problem.
Another question (which probably shows my complete ignorance):
What happens if the iter holds >= BIO_MAX_PAGES?
'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on 
whether the iter contains more pages.

What happens to those left-over pages?
Will they ever be processed?

Cheers,

Hannes


Re: Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Martin Wilck
On Thu, 2018-07-12 at 09:08 -0600, Jens Axboe wrote:
> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> > Hi Jens, Christoph,
> > 
> > we're currently hunting down a silent data corruption occurring due
> > to
> > commit 72ecad22d9f1 ("block: support a full bio worth of IO for
> > simplified bdev direct-io").
> > 
> > While the whole thing is still hazy on the details, the one thing
> > we've
> > found is that reverting that patch fixes the data corruption.
> > 
> > And looking closer, I've found this:
> > 
> > static ssize_t
> > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> > {
> > int nr_pages;
> > 
> > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > if (!nr_pages)
> > return 0;
> > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
> > 
> > return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > BIO_MAX_PAGES));
> > }
> > 
> > When checking the call path
> > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
> > 
> > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> > It's not that we can handle it in __blkdev_direct_IO() ...
> 
> The logic could be cleaned up like below, the sync part is really all
> we care about. What is the test case for this? async or sync?

It's sync, and the corruption is in the __blkdev_direct_IO_simple()
path. It starts to occur with your "block: support a full bio worth of
IO for simplified bdev direct-io" patch, which causes the "simple" path
to be taken for larger IO sizes.

Martin



> 
> I also don't remember why it's BIO_MAX_PAGES + 1...
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aaeb39a..14ef3d71b55f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct
> iov_iter *iter)
>  {
>   int nr_pages;
>  
> - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>   if (!nr_pages)
>   return 0;
> - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> + if (is_sync_kiocb(iocb))
>   return __blkdev_direct_IO_simple(iocb, iter,
> nr_pages);
>  
> - return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> BIO_MAX_PAGES));
> + return __blkdev_direct_IO(iocb, iter, nr_pages);
>  }
>  
>  static __init int blkdev_init(void)
> 

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)



Re: Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Jens Axboe
On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> Hi Jens, Christoph,
> 
> we're currently hunting down a silent data corruption occurring due to
> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
> simplified bdev direct-io").
> 
> While the whole thing is still hazy on the details, the one thing we've
> found is that reverting that patch fixes the data corruption.
> 
> And looking closer, I've found this:
> 
> static ssize_t
> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
>   int nr_pages;
> 
>   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>   if (!nr_pages)
>   return 0;
>   if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>   return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
> 
>   return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
> }
> 
> When checking the call path
> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
> 
> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> It's not that we can handle it in __blkdev_direct_IO() ...

The logic could be cleaned up like below, the sync part is really all
we care about. What is the test case for this? async or sync?

I also don't remember why it's BIO_MAX_PAGES + 1...

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aaeb39a..14ef3d71b55f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter)
 {
int nr_pages;
 
-   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
+   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
if (!nr_pages)
return 0;
-   if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
+   if (is_sync_kiocb(iocb))
return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
 
-   return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
+   return __blkdev_direct_IO(iocb, iter, nr_pages);
 }
 
 static __init int blkdev_init(void)

-- 
Jens Axboe



Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Hannes Reinecke
Hi Jens, Christoph,

we're currently hunting down a silent data corruption occurring due to
commit 72ecad22d9f1 ("block: support a full bio worth of IO for
simplified bdev direct-io").

While the whole thing is still hazy on the details, the one thing we've
found is that reverting that patch fixes the data corruption.

And looking closer, I've found this:

static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
int nr_pages;

nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
if (!nr_pages)
return 0;
if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
}

When checking the call path
__blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.

So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
It's not that we can handle it in __blkdev_direct_IO() ...

Thanks for any clarification.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Hello Beautiful

2018-07-12 Thread Jack
Hi Dear, my name is Jack and i am seeking for a relationship in which i will 
feel loved after a series of failed relationships. 

I am hoping that you would be interested and we could possibly get to know each 
other more if you do not mind. I am open to answering questions from you as i 
think my approach is a little inappropriate. Hope to hear back from you.

Jack.