Re: Silent data corruption in blkdev_direct_IO()

2018-07-16 Thread Martin Wilck
On Fri, 2018-07-13 at 14:52 -0600, Jens Axboe wrote: > On 7/13/18 2:48 PM, Martin Wilck wrote: > > > > > > > However, so far I've only identified a minor problem, see below > > > > - > > > > it > > > > doesn't explain the data corruption we're seeing. > > > > > > What would help is trying to

Re: [PATCH 1/2] nbd: don't requeue the same request twice.

2018-07-16 Thread Jens Axboe
On 7/16/18 10:11 AM, Josef Bacik wrote: > We can race with the snd timeout and the per-request timeout and end up > requeuing the same request twice. We can't use the send_complete > completion to tell if everything is ok because we hold the tx_lock > during send, so the timeout stuff will block

Re: [PATCH 1/2] blk-iolatency: don't change the latency window

2018-07-16 Thread Jens Axboe
On 7/16/18 10:12 AM, Josef Bacik wrote: > From: Josef Bacik > > Early versions of these patches had us waiting for seconds at a time > during submission, so we had to adjust the timing window we monitored > for latency. Now we don't do things like that so this is unnecessary > code. Applied

[PATCH 2/2] blk-iolatency: truncate our current time

2018-07-16 Thread Josef Bacik
From: Josef Bacik In our longer tests we noticed that some boxes would degrade to the point of uselessness. This is because we truncate the current time when saving it in our bio, but I was using the raw current time to subtract from. So once the box had been up a certain amount of time it

[PATCH 1/2] blk-iolatency: don't change the latency window

2018-07-16 Thread Josef Bacik
From: Josef Bacik Early versions of these patches had us waiting for seconds at a time during submission, so we had to adjust the timing window we monitored for latency. Now we don't do things like that so this is unnecessary code. Signed-off-by: Josef Bacik --- block/blk-iolatency.c | 10

[PATCH 1/2] nbd: don't requeue the same request twice.

2018-07-16 Thread Josef Bacik
We can race with the snd timeout and the per-request timeout and end up requeuing the same request twice. We can't use the send_complete completion to tell if everything is ok because we hold the tx_lock during send, so the timeout stuff will block waiting to mark the socket dead, and we could be

[PATCH 2/2] nbd: handle unexpected replies better

2018-07-16 Thread Josef Bacik
If the server or network is misbehaving and we get an unexpected reply we can sometimes miss the request not being started and wait on a request and never get a response, or even double complete the same request. Fix this by replacing the send_complete completion with just a per command lock.

Re: Silent data corruption in blkdev_direct_IO()

2018-07-16 Thread Ming Lei
On Sat, Jul 14, 2018 at 6:29 AM, Martin Wilck wrote: > Hi Ming & Jens, > > On Fri, 2018-07-13 at 12:54 -0600, Jens Axboe wrote: >> On 7/12/18 5:29 PM, Ming Lei wrote: >> > >> > Maybe you can try the following patch from Christoph to see if it >> > makes a >> > difference: >> > >> >