Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
> On 6 Oct 2016, at 11:15, Wouter Verhelst <w...@uter.be> wrote: > >> >> >> but I still think it would be helpful if the protocol helped out >> the end user of the client and refused to negotiate multichannel >> connections when they are unsafe. How is the end client meant to know >> whether the back end is not on Linux, not on a block device, done >> via a Ceph driver etc? > > Well, it isn't. The server, if it provides certain functionality, should > also provide particular guarantees. If it can't provide those > guarantees, it should not provide that functionality. > > e.g., if a server runs on a backend with cache coherency issues, it > should not allow multiple connections to the same device, etc. Sure. I'm simply saying that the connection flags should say "I can't support multiple connections to this device" (available at NBD_OPT_INFO time) rather than errorring out. This is a userspace protocol issue. >> I still think it's pretty damn awkward that with a ceph back end >> (for instance) which would be one of the backends to benefit the >> most from multichannel connections (as it's inherently parallel), >> no one has explained how flush could be done safely. > > If ceph doesn't have any way to guarantee that a write is available to > all readers of a particular device, then it *cannot* be used to map > block device semantics with multiple channels. Thinking about it I believe Ceph actually may be able to do that, it's just harder than a straightforward flush. > Therefore, it should not > allow writing to the device from multiple clients, period, unless the > filesystem (or other thing) making use of the nbd device above the ceph > layer actually understands how things may go wrong and can take care of > it. > > As such, I don't think that the problems inherent in using multiple > connections to a ceph device (which I do not deny) have any place in a > discussion on how NBD should work in the face of multiple channels with > a sane/regular backend. On which note, I am still not convinced that fsync() provides such semantics on all operating systems and on Linux on non-block devices. I'm not sure all those backends are 'insane'! However, if the server could signal lack of support for multiple connections (see above) my concerns would be significantly reduced. Note his requires no kernel change (as you pointed out). -- Alex Bligh
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
> On 6 Oct 2016, at 11:15, Wouter Verhelst wrote: > >> >> >> but I still think it would be helpful if the protocol helped out >> the end user of the client and refused to negotiate multichannel >> connections when they are unsafe. How is the end client meant to know >> whether the back end is not on Linux, not on a block device, done >> via a Ceph driver etc? > > Well, it isn't. The server, if it provides certain functionality, should > also provide particular guarantees. If it can't provide those > guarantees, it should not provide that functionality. > > e.g., if a server runs on a backend with cache coherency issues, it > should not allow multiple connections to the same device, etc. Sure. I'm simply saying that the connection flags should say "I can't support multiple connections to this device" (available at NBD_OPT_INFO time) rather than errorring out. This is a userspace protocol issue. >> I still think it's pretty damn awkward that with a ceph back end >> (for instance) which would be one of the backends to benefit the >> most from multichannel connections (as it's inherently parallel), >> no one has explained how flush could be done safely. > > If ceph doesn't have any way to guarantee that a write is available to > all readers of a particular device, then it *cannot* be used to map > block device semantics with multiple channels. Thinking about it I believe Ceph actually may be able to do that, it's just harder than a straightforward flush. > Therefore, it should not > allow writing to the device from multiple clients, period, unless the > filesystem (or other thing) making use of the nbd device above the ceph > layer actually understands how things may go wrong and can take care of > it. > > As such, I don't think that the problems inherent in using multiple > connections to a ceph device (which I do not deny) have any place in a > discussion on how NBD should work in the face of multiple channels with > a sane/regular backend. On which note, I am still not convinced that fsync() provides such semantics on all operating systems and on Linux on non-block devices. I'm not sure all those backends are 'insane'! However, if the server could signal lack of support for multiple connections (see above) my concerns would be significantly reduced. Note his requires no kernel change (as you pointed out). -- Alex Bligh
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
Wouter, > On 6 Oct 2016, at 10:04, Wouter Verhelst <w...@uter.be> wrote: > > Hi Alex, > > On Tue, Oct 04, 2016 at 10:35:03AM +0100, Alex Bligh wrote: >> Wouter, >>> I see now that it should be closer >>> to the former; a more useful definition is probably something along the >>> following lines: >>> >>> All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM) >>> for which a reply was received on the client side prior to the >> >> No, that's wrong as the server has no knowledge of whether the client >> has actually received them so no way of knowing to which writes that >> would reply. > > I realise that, but I don't think it's a problem. > > In the current situation, a client could opportunistically send a number > of write requests immediately followed by a flush and hope for the best. > However, in that case there is no guarantee that for the write requests > that the client actually cares about to have hit the disk, a reply > arrives on the client side before the flush reply arrives. If that > doesn't happen, that would then mean the client would have to issue > another flush request, probably at a performance hit. Sure, but the client knows (currently) that any write request which it has a reply to before it receives the reply from the flush request has been written to disk. Such a client might simply note whether it has issued any subsequent write requests. > As I understand Christoph's explanations, currently the Linux kernel > *doesn't* issue flush requests unless and until the necessary writes > have already completed (i.e., the reply has been received and processed > on the client side). Sure, but it is not the only client. > Given that, given the issue in the previous > paragraph, and given the uncertainty introduced with multiple > connections, I think it is reasonable to say that a client should just > not assume a flush touches anything except for the writes for which it > has already received a reply by the time the flush request is sent out. OK. So you are proposing weakening the semantic for flush (saying that it is only guaranteed to cover those writes for which the client has actually received a reply prior to sending the flush, as opposed to prior to receiving the flush reply). This is based on the view that the Linux kernel client wouldn't be affected, and if other clients were affected, their behaviour would be 'somewhat unusual'. We do have one significant other client out there that uses flush which is Qemu. I think we should get a view on whether they would be affected. > Those are semantics that are actually useful and can be guaranteed in > the face of multiple connections. Other semantics can not. Well there is another semantic which would work just fine, and also cures the other problem (synchronisation between channels) which would be simply that flush is only guaranteed to affect writes issued on the same channel. Then flush would do the natural thing, i.e. flush all the writes that had been done *on that channel*. > It is indeed impossible for a server to know what has been received by > the client by the time it (the client) sent out the flush request. > However, the server doesn't need that information, at all. The flush > request's semantics do not say that any request not covered by the flush > request itself MUST NOT have hit disk; instead, it just says that there > is no guarantee on whether or not that is the case. That's fine; all a > server needs to know is that when it receives a flush, it needs to > fsync() or some such, and then send the reply. All a *client* needs to > know is which requests have most definitely hit the disk. In my > proposal, those are the requests that finished before the flush request > was sent, and not the requests that finished between that and when the > flush reply is received. Those are *likely* to also be covered > (especially on single-connection NBD setups), but in my proposal, > they're no longer *guaranteed* to be. I think my objection was more that you were writing mandatory language for a server's behaviour based on what the client perceives. What you are saying from the client's point of view is that it under your proposed change it can only rely on that writes in respect of which the reply has been received prior to issuing the flush are persisted to disk (more might be persisted, but the client can't rely on it). So far so good. However, I don't think you can usefully make the guarantee weaker from the SERVER'S point of view, because it doesn't know how things got reordered. IE it still needs to persist to disk any write that it has completed when it processes the flush. Yes, the client doesn't get the same guarantee, but it can't know whether it can be slacker about a particular wri
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
Wouter, > On 6 Oct 2016, at 10:04, Wouter Verhelst wrote: > > Hi Alex, > > On Tue, Oct 04, 2016 at 10:35:03AM +0100, Alex Bligh wrote: >> Wouter, >>> I see now that it should be closer >>> to the former; a more useful definition is probably something along the >>> following lines: >>> >>> All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM) >>> for which a reply was received on the client side prior to the >> >> No, that's wrong as the server has no knowledge of whether the client >> has actually received them so no way of knowing to which writes that >> would reply. > > I realise that, but I don't think it's a problem. > > In the current situation, a client could opportunistically send a number > of write requests immediately followed by a flush and hope for the best. > However, in that case there is no guarantee that for the write requests > that the client actually cares about to have hit the disk, a reply > arrives on the client side before the flush reply arrives. If that > doesn't happen, that would then mean the client would have to issue > another flush request, probably at a performance hit. Sure, but the client knows (currently) that any write request which it has a reply to before it receives the reply from the flush request has been written to disk. Such a client might simply note whether it has issued any subsequent write requests. > As I understand Christoph's explanations, currently the Linux kernel > *doesn't* issue flush requests unless and until the necessary writes > have already completed (i.e., the reply has been received and processed > on the client side). Sure, but it is not the only client. > Given that, given the issue in the previous > paragraph, and given the uncertainty introduced with multiple > connections, I think it is reasonable to say that a client should just > not assume a flush touches anything except for the writes for which it > has already received a reply by the time the flush request is sent out. OK. So you are proposing weakening the semantic for flush (saying that it is only guaranteed to cover those writes for which the client has actually received a reply prior to sending the flush, as opposed to prior to receiving the flush reply). This is based on the view that the Linux kernel client wouldn't be affected, and if other clients were affected, their behaviour would be 'somewhat unusual'. We do have one significant other client out there that uses flush which is Qemu. I think we should get a view on whether they would be affected. > Those are semantics that are actually useful and can be guaranteed in > the face of multiple connections. Other semantics can not. Well there is another semantic which would work just fine, and also cures the other problem (synchronisation between channels) which would be simply that flush is only guaranteed to affect writes issued on the same channel. Then flush would do the natural thing, i.e. flush all the writes that had been done *on that channel*. > It is indeed impossible for a server to know what has been received by > the client by the time it (the client) sent out the flush request. > However, the server doesn't need that information, at all. The flush > request's semantics do not say that any request not covered by the flush > request itself MUST NOT have hit disk; instead, it just says that there > is no guarantee on whether or not that is the case. That's fine; all a > server needs to know is that when it receives a flush, it needs to > fsync() or some such, and then send the reply. All a *client* needs to > know is which requests have most definitely hit the disk. In my > proposal, those are the requests that finished before the flush request > was sent, and not the requests that finished between that and when the > flush reply is received. Those are *likely* to also be covered > (especially on single-connection NBD setups), but in my proposal, > they're no longer *guaranteed* to be. I think my objection was more that you were writing mandatory language for a server's behaviour based on what the client perceives. What you are saying from the client's point of view is that it under your proposed change it can only rely on that writes in respect of which the reply has been received prior to issuing the flush are persisted to disk (more might be persisted, but the client can't rely on it). So far so good. However, I don't think you can usefully make the guarantee weaker from the SERVER'S point of view, because it doesn't know how things got reordered. IE it still needs to persist to disk any write that it has completed when it processes the flush. Yes, the client doesn't get the same guarantee, but it can't know whether it can be slacker about a particular write which it has perfo
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
ose). So to the client it means that the server has persisted to disk all those commands to which it has received a reply. However, to the server, the 'MUST' condition needs to refer to the replies it sent, not the replies the client receives. I think "before processing" here really just means "before sending a reply to the NBD_CMD_FLUSH". I believe the 'before processing' phrase came from the kernel wording. > I see now that it should be closer > to the former; a more useful definition is probably something along the > following lines: > >All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM) >for which a reply was received on the client side prior to the No, that's wrong as the server has no knowledge of whether the client has actually received them so no way of knowing to which writes that would reply. It thus has to ensure it covers a potentially larger set of writes - those for which a reply has been sent, as all those MIGHT have been received by the client. >transmission of the NBD_CMD_FLUSH message MUST be written to no that's wrong because the server has no idea when the client transmitted the NBD_CMD_FLUSH message. It can only deal with events it knows about. And the delimiter is (in essence) those write commands that were replied to prior to the sending of the reply to the NBD_CMD_FLUSH - of course it can flush others too. >non-volatile storage prior to replying to that NBD_CMD_FLUSH. A >server MAY process this command in ways that result committing more >data to non-volatile storage than is strictly required. I think the wording is basically right for the current semantic, but here's a slight improvement: The server MUST NOT send a non-error reply to NBD_CMD_FLUSH until it has ensured that the contents of all writes to which it has already completed (i.e. replied to) have been persisted to non-volatile storage. However, given that the replies can subsequently be reordered, I now think we do have a problem, as the client can't tell to which those refer. > [...] >> I don't think there is actually a problem here - Wouter if I'm wrong >> about this, I'd like to understand your argument better. > > No, I now see that there isn't, and I misunderstood things. However, I > do think we should update the spec to clarify this. Haha - I now think there is. You accidentally convinced me! >> b) What I'm describing - which is the lack of synchronisation between >> channels. > [... long explanation snipped...] > > Yes, and I acknowledge that. However, I think that should not be a > blocker. It's fine to mark this feature as experimental; it will not > ever be required to use multiple connections to connect to a server. > > When this feature lands in nbd-client, I plan to ensure that the man > page and -help output says something along the following lines: > > use N connections to connect to the NBD server, improving performance > at the cost of a possible loss of reliability. So in essence we are relying on (userspace) nbd-client not to open more connections if it's unsafe? IE we can sort out all the negotiation of whether it's safe or unsafe within userspace and not bother Josef about it? I suppose that's fine in that we can at least shorten the CC: line, but I still think it would be helpful if the protocol >> Now, in the reference server, NBD_CMD_FLUSH is implemented through an >> fdatasync(). > > Actually, no, the reference server uses fsync() for reasons that I've > forgotten (side note: you wrote it that way ;-) > > [...] I vaguely remember why - something to do with files expanding when holes were written to. However, I don't think that makes much difference to the question I asked, or at most s/fdatasync/fsync/g -- Alex Bligh
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
ose). So to the client it means that the server has persisted to disk all those commands to which it has received a reply. However, to the server, the 'MUST' condition needs to refer to the replies it sent, not the replies the client receives. I think "before processing" here really just means "before sending a reply to the NBD_CMD_FLUSH". I believe the 'before processing' phrase came from the kernel wording. > I see now that it should be closer > to the former; a more useful definition is probably something along the > following lines: > >All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM) >for which a reply was received on the client side prior to the No, that's wrong as the server has no knowledge of whether the client has actually received them so no way of knowing to which writes that would reply. It thus has to ensure it covers a potentially larger set of writes - those for which a reply has been sent, as all those MIGHT have been received by the client. >transmission of the NBD_CMD_FLUSH message MUST be written to no that's wrong because the server has no idea when the client transmitted the NBD_CMD_FLUSH message. It can only deal with events it knows about. And the delimiter is (in essence) those write commands that were replied to prior to the sending of the reply to the NBD_CMD_FLUSH - of course it can flush others too. >non-volatile storage prior to replying to that NBD_CMD_FLUSH. A >server MAY process this command in ways that result committing more >data to non-volatile storage than is strictly required. I think the wording is basically right for the current semantic, but here's a slight improvement: The server MUST NOT send a non-error reply to NBD_CMD_FLUSH until it has ensured that the contents of all writes to which it has already completed (i.e. replied to) have been persisted to non-volatile storage. However, given that the replies can subsequently be reordered, I now think we do have a problem, as the client can't tell to which those refer. > [...] >> I don't think there is actually a problem here - Wouter if I'm wrong >> about this, I'd like to understand your argument better. > > No, I now see that there isn't, and I misunderstood things. However, I > do think we should update the spec to clarify this. Haha - I now think there is. You accidentally convinced me! >> b) What I'm describing - which is the lack of synchronisation between >> channels. > [... long explanation snipped...] > > Yes, and I acknowledge that. However, I think that should not be a > blocker. It's fine to mark this feature as experimental; it will not > ever be required to use multiple connections to connect to a server. > > When this feature lands in nbd-client, I plan to ensure that the man > page and -help output says something along the following lines: > > use N connections to connect to the NBD server, improving performance > at the cost of a possible loss of reliability. So in essence we are relying on (userspace) nbd-client not to open more connections if it's unsafe? IE we can sort out all the negotiation of whether it's safe or unsafe within userspace and not bother Josef about it? I suppose that's fine in that we can at least shorten the CC: line, but I still think it would be helpful if the protocol >> Now, in the reference server, NBD_CMD_FLUSH is implemented through an >> fdatasync(). > > Actually, no, the reference server uses fsync() for reasons that I've > forgotten (side note: you wrote it that way ;-) > > [...] I vaguely remember why - something to do with files expanding when holes were written to. However, I don't think that makes much difference to the question I asked, or at most s/fdatasync/fsync/g -- Alex Bligh
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
Josef, > On 3 Oct 2016, at 15:32, Josef Bacik <jba...@fb.com> wrote: > > Ok I understand your objections now. You aren't arguing that we are unsafe > by default, only that we are unsafe with servers that do something special > beyond simply writing to a single disk or file. I agree this is problematic, > but you simply don't use this feature if your server can't deal with it well. Not quite. I am arguing that: 1. Parallel channels as currently implemented are inherently unsafe on some servers - I have given an example of such servers 2. Parallel channels as currently implemented may be unsafe on the reference server on some or all platforms (depending on the behaviour of fdatasync() which might varies between platforms) 3. Either the parallel channels stuff should issue flush requests on all channels, or the protocol should protect the unwitting user by negotiating an option to do so (or refusing multi-channel connects). It is not reasonable for an nbd client to have to know the intimate details of the server and its implementation of synchronisation primitives - saying 'the user should disable multiple channels' is not good enough, as this is what we have a protocol for. "unsafe" here means 'do not conform to the kernel's requirements for flush semantics, whereas they did without multiple channels' So from my point of view either we need to have an additional connection flag ("don't use multiple channels unless you are going to issue a flush on all channels") OR flush on all channels should be the default. Whatever SOME more work needs to be done on your patch because there it current doesn't issue flush on all channels, and it currently does not have any way to prevent use of multiple channels. I'd really like someone with linux / POSIX knowledge to confirm the linux and POSIX position on fdatasync of an fd opened by one process on writes made to the same file by another process. If on all POSIX platforms and all filing systems this is guaranteed to flush the second platform's writes, then I think we could argue "OK there may be some weird servers which might not support multiple channels and they just need a way of signalling that". If on the other hand there is no such cross platform guarantee, I think this means in essence even with the reference server, this patch is unsafe, and it needs adapting to send flushes on all channels - yes it might theoretically be possible to introduce IPC to the current server, but you'd still need some way of tying together channels from a single client. -- Alex Bligh
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
Josef, > On 3 Oct 2016, at 15:32, Josef Bacik wrote: > > Ok I understand your objections now. You aren't arguing that we are unsafe > by default, only that we are unsafe with servers that do something special > beyond simply writing to a single disk or file. I agree this is problematic, > but you simply don't use this feature if your server can't deal with it well. Not quite. I am arguing that: 1. Parallel channels as currently implemented are inherently unsafe on some servers - I have given an example of such servers 2. Parallel channels as currently implemented may be unsafe on the reference server on some or all platforms (depending on the behaviour of fdatasync() which might varies between platforms) 3. Either the parallel channels stuff should issue flush requests on all channels, or the protocol should protect the unwitting user by negotiating an option to do so (or refusing multi-channel connects). It is not reasonable for an nbd client to have to know the intimate details of the server and its implementation of synchronisation primitives - saying 'the user should disable multiple channels' is not good enough, as this is what we have a protocol for. "unsafe" here means 'do not conform to the kernel's requirements for flush semantics, whereas they did without multiple channels' So from my point of view either we need to have an additional connection flag ("don't use multiple channels unless you are going to issue a flush on all channels") OR flush on all channels should be the default. Whatever SOME more work needs to be done on your patch because there it current doesn't issue flush on all channels, and it currently does not have any way to prevent use of multiple channels. I'd really like someone with linux / POSIX knowledge to confirm the linux and POSIX position on fdatasync of an fd opened by one process on writes made to the same file by another process. If on all POSIX platforms and all filing systems this is guaranteed to flush the second platform's writes, then I think we could argue "OK there may be some weird servers which might not support multiple channels and they just need a way of signalling that". If on the other hand there is no such cross platform guarantee, I think this means in essence even with the reference server, this patch is unsafe, and it needs adapting to send flushes on all channels - yes it might theoretically be possible to introduce IPC to the current server, but you'd still need some way of tying together channels from a single client. -- Alex Bligh
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
M) ***ASSOCIATED WITH ANY CLIENT AT ALL*** that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH" - i.e. a flush on any channel of any client must flush every channel of every client, because we have no easy way to tell which clients are in fact two channels. I have concerns over the scalability of this. Now, in the reference server, NBD_CMD_FLUSH is implemented through an fdatasync(). Each client (and therefore each channel) runs in a different process. Earlier in this thread, someone suggested that if this happens: Process AProcess B == fd1=open("file123") fd2=open("file123") write(fd1, ...) fdatasync("fd2") then the fdatasync() is guaranteed to sync the write that Process A has written. This may or may not be the case under Linux (wiser minds than me will know). Is it guaranteed to be the case with (e.g.) the file on NFS? On all POSIX platforms? Looking at http://pubs.opengroup.org/onlinepubs/009695399/functions/fdatasync.html I'd say it was a little ambiguous as to whether it will ALWAYS flush all data associated with the file even if it is being written by a different process (and a different FD). If fdatasync() is always guaranteed to flush data associated with writes by a different process (and separately opened fd), then as it happens there won't be a problem on the reference server, just on servers that don't happen to use fdatasync() or similar to implement flushes, and which don't maintain their own caches. If fdatasync() is not so guaranteed, we have a problem with the reference server too, at least on some platforms and fling systems. What I'm therefore asking for is either: a) that the server can say 'if you are multichannel, you will need to send flush on each channel' (best); OR b) that the server can say 'don't go multichannel' as part of the negotiation stage. Of course as this is dependent on the backend, this is going to be something that is per-target (i.e. needs to come as a transmission flag or similar). -- Alex Bligh
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
LIENT AT ALL*** that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH" - i.e. a flush on any channel of any client must flush every channel of every client, because we have no easy way to tell which clients are in fact two channels. I have concerns over the scalability of this. Now, in the reference server, NBD_CMD_FLUSH is implemented through an fdatasync(). Each client (and therefore each channel) runs in a different process. Earlier in this thread, someone suggested that if this happens: Process AProcess B == fd1=open("file123") fd2=open("file123") write(fd1, ...) fdatasync("fd2") then the fdatasync() is guaranteed to sync the write that Process A has written. This may or may not be the case under Linux (wiser minds than me will know). Is it guaranteed to be the case with (e.g.) the file on NFS? On all POSIX platforms? Looking at http://pubs.opengroup.org/onlinepubs/009695399/functions/fdatasync.html I'd say it was a little ambiguous as to whether it will ALWAYS flush all data associated with the file even if it is being written by a different process (and a different FD). If fdatasync() is always guaranteed to flush data associated with writes by a different process (and separately opened fd), then as it happens there won't be a problem on the reference server, just on servers that don't happen to use fdatasync() or similar to implement flushes, and which don't maintain their own caches. If fdatasync() is not so guaranteed, we have a problem with the reference server too, at least on some platforms and fling systems. What I'm therefore asking for is either: a) that the server can say 'if you are multichannel, you will need to send flush on each channel' (best); OR b) that the server can say 'don't go multichannel' as part of the negotiation stage. Of course as this is dependent on the backend, this is going to be something that is per-target (i.e. needs to come as a transmission flag or similar). -- Alex Bligh
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
> On 29 Sep 2016, at 17:59, Josef Bacik <jba...@fb.com> wrote: > > On 09/29/2016 12:41 PM, Wouter Verhelst wrote: >> On Thu, Sep 29, 2016 at 10:03:50AM -0400, Josef Bacik wrote: >>> So think of it like normal disks with multiple channels. We don't send >>> flushes >>> down all the hwq's to make sure they are clear, we leave that decision up >>> to the >>> application (usually a FS of course). >> >> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that >> when a FLUSH is sent over one channel, and the reply comes in, that all >> commands which have been received, regardless of which channel they were >> received over, have reched disk. >> >> [1] Message-ID: <20160915122304.ga15...@infradead.org> >> >> It is impossible for nbd to make such a guarantee, due to head-of-line >> blocking on TCP. >> > > Huh I missed that. Yeah that's not possible for us for sure, I think my > option > idea is the less awful way forward if we want to address that limitation. > Thanks, I think if the server supports flush (which you can tell), sending flush on all channels is the only safe thing to do, without substantial protocol changes (which I'm not sure how one would do given flush is in a sense a synchronisation point). I think it's thus imperative this gets fixed before the change gets merged. -- Alex Bligh
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
> On 29 Sep 2016, at 17:59, Josef Bacik wrote: > > On 09/29/2016 12:41 PM, Wouter Verhelst wrote: >> On Thu, Sep 29, 2016 at 10:03:50AM -0400, Josef Bacik wrote: >>> So think of it like normal disks with multiple channels. We don't send >>> flushes >>> down all the hwq's to make sure they are clear, we leave that decision up >>> to the >>> application (usually a FS of course). >> >> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that >> when a FLUSH is sent over one channel, and the reply comes in, that all >> commands which have been received, regardless of which channel they were >> received over, have reched disk. >> >> [1] Message-ID: <20160915122304.ga15...@infradead.org> >> >> It is impossible for nbd to make such a guarantee, due to head-of-line >> blocking on TCP. >> > > Huh I missed that. Yeah that's not possible for us for sure, I think my > option > idea is the less awful way forward if we want to address that limitation. > Thanks, I think if the server supports flush (which you can tell), sending flush on all channels is the only safe thing to do, without substantial protocol changes (which I'm not sure how one would do given flush is in a sense a synchronisation point). I think it's thus imperative this gets fixed before the change gets merged. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Wouter, > On 15 Sep 2016, at 17:27, Wouter Verhelst <w...@uter.be> wrote: > > On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote: >> Wouter, >> >>> The server can always refuse to allow multiple connections. >> >> Sure, but it would be neater to warn the client of that at negotiation >> stage (it would only be one flag, e.g. 'multiple connections >> unsafe'). > > I suppose that's not a bad idea. Good. > [...] >>> I was thinking of changing the spec as follows: >>> >>> diff --git a/doc/proto.md b/doc/proto.md >>> index 217f57e..cb099e2 100644 >>> --- a/doc/proto.md >>> +++ b/doc/proto.md >>> @@ -308,6 +308,23 @@ specification, the >>> [kernel >>> documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt) >>> may be useful. >>> >>> +For performance reasons, clients MAY open multiple connections to the >>> +same server. To support such clients, servers SHOULD ensure that at >>> +least one of the following conditions hold: >>> + >>> +* Flush commands are processed for ALL connections. That is, when an >>> + `NBD_CMD_WRITE` is processed on one connection, and then an >>> + `NBD_CMD_FLUSH` is processed on another connection, the data of the >>> + `NBD_CMD_WRITE` on the first connection MUST reach permanent storage >>> + before the reply of the `NBD_CMD_FLUSH` is sent. >>> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one >>> + connection >>> +* Multiple connections are not allowed >>> + >>> +In addition, clients using multiple connections SHOULD NOT send >>> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to >>> +the flush has not been replied to yet. >>> + >> >> I don't think that should be a mandatory behaviour. > > Which part of it? I've read it again :-) The wording was slightly contorted. I think what I mean is that if you don't support flush at all, that's another option. The final paragraph I am not sure is right, as that's not what the kernel currently does. If we are going to suggest a change in our main client's behaviour, should we not just request that flush is done on all channels? >> For once, it would >> be reasonably easy on gonbdserver but a PITA on the reference server. >> You'd need to put in IPC between each of the forked processes OR rely >> on fdatasync() only - I'm not sure that would necessarily work >> safely with (e.g.) the 'treefiles' / COW options. >> >> I think better would be to say that the server MUST either >> >> * Not support NBD_CMD_FLUSH at all > > I think we should discourage not supporting FLUSH, rather than > suggesting it. Sure, but some backends just don't support flush. For them, this aspect at least is not a problem. >> * Support NBD_CMD_FLUSH across channels (as you set out above), or >> * Indicate that it does not support multiple channels. > > You dropped the one with no writes. I said "at most" there for a reason. > Originally I was going to say "if the server is read-only", but then > thought that it could work to do the "at most" thing. After having given > that some more thought, I now realize that if you write, you need to > flush across to other channels, regardless of whether they write too, so > that bit of it is moot now anyway. > > Still, a server which exports read-only should still be safe for > multiple connections, even if there is no cache coherency (since > presumably nothing's going to change anyway). Yes >> Actually I think this is a problem anyway. A simpler failure case is >> one where (by chance) one channel gets the writes, and one channel >> gets the flushes. The flush reply is delayed beyond the replies to >> unconnected writes (on the other channel) and hence the kernel thinks >> replied-to writes have been persisted when they have not been. > > Yes, that is another example of essentially the same problem. Yeah, I was just trying to simplify things. >> The only way to fix that (as far as I can see) without changing flush >> semantics is for the block layer to issue flush requests on each >> channel when flushing on one channel. > > Christoph just said that that doesn't (currently) happen; I don't know > whether the kernel currently already (optimistically) sends out flush > requests before the writes that it expects to hit permanent storage have > finished, but if it doesn't do that, then there is no problem and my > suggested bit of spec would be okay. > > If there are g
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Wouter, > On 15 Sep 2016, at 17:27, Wouter Verhelst wrote: > > On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote: >> Wouter, >> >>> The server can always refuse to allow multiple connections. >> >> Sure, but it would be neater to warn the client of that at negotiation >> stage (it would only be one flag, e.g. 'multiple connections >> unsafe'). > > I suppose that's not a bad idea. Good. > [...] >>> I was thinking of changing the spec as follows: >>> >>> diff --git a/doc/proto.md b/doc/proto.md >>> index 217f57e..cb099e2 100644 >>> --- a/doc/proto.md >>> +++ b/doc/proto.md >>> @@ -308,6 +308,23 @@ specification, the >>> [kernel >>> documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt) >>> may be useful. >>> >>> +For performance reasons, clients MAY open multiple connections to the >>> +same server. To support such clients, servers SHOULD ensure that at >>> +least one of the following conditions hold: >>> + >>> +* Flush commands are processed for ALL connections. That is, when an >>> + `NBD_CMD_WRITE` is processed on one connection, and then an >>> + `NBD_CMD_FLUSH` is processed on another connection, the data of the >>> + `NBD_CMD_WRITE` on the first connection MUST reach permanent storage >>> + before the reply of the `NBD_CMD_FLUSH` is sent. >>> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one >>> + connection >>> +* Multiple connections are not allowed >>> + >>> +In addition, clients using multiple connections SHOULD NOT send >>> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to >>> +the flush has not been replied to yet. >>> + >> >> I don't think that should be a mandatory behaviour. > > Which part of it? I've read it again :-) The wording was slightly contorted. I think what I mean is that if you don't support flush at all, that's another option. The final paragraph I am not sure is right, as that's not what the kernel currently does. If we are going to suggest a change in our main client's behaviour, should we not just request that flush is done on all channels? >> For once, it would >> be reasonably easy on gonbdserver but a PITA on the reference server. >> You'd need to put in IPC between each of the forked processes OR rely >> on fdatasync() only - I'm not sure that would necessarily work >> safely with (e.g.) the 'treefiles' / COW options. >> >> I think better would be to say that the server MUST either >> >> * Not support NBD_CMD_FLUSH at all > > I think we should discourage not supporting FLUSH, rather than > suggesting it. Sure, but some backends just don't support flush. For them, this aspect at least is not a problem. >> * Support NBD_CMD_FLUSH across channels (as you set out above), or >> * Indicate that it does not support multiple channels. > > You dropped the one with no writes. I said "at most" there for a reason. > Originally I was going to say "if the server is read-only", but then > thought that it could work to do the "at most" thing. After having given > that some more thought, I now realize that if you write, you need to > flush across to other channels, regardless of whether they write too, so > that bit of it is moot now anyway. > > Still, a server which exports read-only should still be safe for > multiple connections, even if there is no cache coherency (since > presumably nothing's going to change anyway). Yes >> Actually I think this is a problem anyway. A simpler failure case is >> one where (by chance) one channel gets the writes, and one channel >> gets the flushes. The flush reply is delayed beyond the replies to >> unconnected writes (on the other channel) and hence the kernel thinks >> replied-to writes have been persisted when they have not been. > > Yes, that is another example of essentially the same problem. Yeah, I was just trying to simplify things. >> The only way to fix that (as far as I can see) without changing flush >> semantics is for the block layer to issue flush requests on each >> channel when flushing on one channel. > > Christoph just said that that doesn't (currently) happen; I don't know > whether the kernel currently already (optimistically) sends out flush > requests before the writes that it expects to hit permanent storage have > finished, but if it doesn't do that, then there is no problem and my > suggested bit of spec would be okay. > > If there are good reasons to do so, h
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Wouter, > The server can always refuse to allow multiple connections. Sure, but it would be neater to warn the client of that at negotiation stage (it would only be one flag, e.g. 'multiple connections unsafe'). That way the connection won't fail with a cryptic EBUSY or whatever, but will just negotiate a single connection. > I was thinking of changing the spec as follows: > > diff --git a/doc/proto.md b/doc/proto.md > index 217f57e..cb099e2 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -308,6 +308,23 @@ specification, the > [kernel > documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt) > may be useful. > > +For performance reasons, clients MAY open multiple connections to the > +same server. To support such clients, servers SHOULD ensure that at > +least one of the following conditions hold: > + > +* Flush commands are processed for ALL connections. That is, when an > + `NBD_CMD_WRITE` is processed on one connection, and then an > + `NBD_CMD_FLUSH` is processed on another connection, the data of the > + `NBD_CMD_WRITE` on the first connection MUST reach permanent storage > + before the reply of the `NBD_CMD_FLUSH` is sent. > +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one > + connection > +* Multiple connections are not allowed > + > +In addition, clients using multiple connections SHOULD NOT send > +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to > +the flush has not been replied to yet. > + I don't think that should be a mandatory behaviour. For once, it would be reasonably easy on gonbdserver but a PITA on the reference server. You'd need to put in IPC between each of the forked processes OR rely on fdatasync() only - I'm not sure that would necessarily work safely with (e.g.) the 'treefiles' / COW options. I think better would be to say that the server MUST either * Not support NBD_CMD_FLUSH at all * Support NBD_CMD_FLUSH across channels (as you set out above), or * Indicate that it does not support multiple channels. > Request message > > The request message, sent by the client, looks as follows: > > The latter bit (on the client side) is because even if your backend has > no cache coherency issues, TCP does not guarantee ordering between > multiple connections. I don't know if the above is in line with what > blk-mq does, but consider the following scenario: > > - A client sends two writes to the server, followed (immediately) by a > flush, where at least the second write and the flush are not sent over > the same connection. > - The first write is a small one, and it is handled almost immediately. > - The second write takes a little longer, so the flush is handled > earlier than the second write > - The network packet containing the flush reply gets lost for whatever > reason, so the client doesn't get it, and we fall into TCP > retransmits. > - The second write finishes, and its reply header does not get lost > - After the second write reply reaches the client, the TCP retransmits > for the flush reply are handled. > > In the above scenario, the flush reply arrives on the client side after > a write reply which it did not cover; so the client will (incorrectly) > assume that the write has reached permanent storage when in fact it may > not have done so yet. > > If the kernel does not care about the ordering of the two writes versus > the flush, then there is no problem. I don't know how blk-mq works in > that context, but if the above is a likely scenario, we may have to > reconsider adding blk-mq to nbd. Actually I think this is a problem anyway. A simpler failure case is one where (by chance) one channel gets the writes, and one channel gets the flushes. The flush reply is delayed beyond the replies to unconnected writes (on the other channel) and hence the kernel thinks replied-to writes have been persisted when they have not been. The only way to fix that (as far as I can see) without changing flush semantics is for the block layer to issue flush requests on each channel when flushing on one channel. This, incidentally, would resolve every other issue! -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Wouter, > The server can always refuse to allow multiple connections. Sure, but it would be neater to warn the client of that at negotiation stage (it would only be one flag, e.g. 'multiple connections unsafe'). That way the connection won't fail with a cryptic EBUSY or whatever, but will just negotiate a single connection. > I was thinking of changing the spec as follows: > > diff --git a/doc/proto.md b/doc/proto.md > index 217f57e..cb099e2 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -308,6 +308,23 @@ specification, the > [kernel > documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt) > may be useful. > > +For performance reasons, clients MAY open multiple connections to the > +same server. To support such clients, servers SHOULD ensure that at > +least one of the following conditions hold: > + > +* Flush commands are processed for ALL connections. That is, when an > + `NBD_CMD_WRITE` is processed on one connection, and then an > + `NBD_CMD_FLUSH` is processed on another connection, the data of the > + `NBD_CMD_WRITE` on the first connection MUST reach permanent storage > + before the reply of the `NBD_CMD_FLUSH` is sent. > +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one > + connection > +* Multiple connections are not allowed > + > +In addition, clients using multiple connections SHOULD NOT send > +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to > +the flush has not been replied to yet. > + I don't think that should be a mandatory behaviour. For once, it would be reasonably easy on gonbdserver but a PITA on the reference server. You'd need to put in IPC between each of the forked processes OR rely on fdatasync() only - I'm not sure that would necessarily work safely with (e.g.) the 'treefiles' / COW options. I think better would be to say that the server MUST either * Not support NBD_CMD_FLUSH at all * Support NBD_CMD_FLUSH across channels (as you set out above), or * Indicate that it does not support multiple channels. > Request message > > The request message, sent by the client, looks as follows: > > The latter bit (on the client side) is because even if your backend has > no cache coherency issues, TCP does not guarantee ordering between > multiple connections. I don't know if the above is in line with what > blk-mq does, but consider the following scenario: > > - A client sends two writes to the server, followed (immediately) by a > flush, where at least the second write and the flush are not sent over > the same connection. > - The first write is a small one, and it is handled almost immediately. > - The second write takes a little longer, so the flush is handled > earlier than the second write > - The network packet containing the flush reply gets lost for whatever > reason, so the client doesn't get it, and we fall into TCP > retransmits. > - The second write finishes, and its reply header does not get lost > - After the second write reply reaches the client, the TCP retransmits > for the flush reply are handled. > > In the above scenario, the flush reply arrives on the client side after > a write reply which it did not cover; so the client will (incorrectly) > assume that the write has reached permanent storage when in fact it may > not have done so yet. > > If the kernel does not care about the ordering of the two writes versus > the flush, then there is no problem. I don't know how blk-mq works in > that context, but if the above is a likely scenario, we may have to > reconsider adding blk-mq to nbd. Actually I think this is a problem anyway. A simpler failure case is one where (by chance) one channel gets the writes, and one channel gets the flushes. The flush reply is delayed beyond the replies to unconnected writes (on the other channel) and hence the kernel thinks replied-to writes have been persisted when they have not been. The only way to fix that (as far as I can see) without changing flush semantics is for the block layer to issue flush requests on each channel when flushing on one channel. This, incidentally, would resolve every other issue! -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Eric, > I doubt that qemu-nbd would ever want to support the situation with more > than one client connection writing to the same image at the same time; > the implications of sorting out data consistency between multiple > writers is rather complex and not worth coding into qemu. So I think > qemu would probably prefer to just prohibit the multiple writer > situation. Yeah, I was thinking about a 'no multiple connection' flag. > And while multiple readers with no writer should be fine, > I'm not even sure if multiple readers plus one writer can always be made > to appear sane (if there is no coordination between the different > connections, on an image where the writer changes AA to BA then flushes > then changes to BB, it is still feasible that a reader could see AB > (pre-flush state of the first sector, post-flush changes to the second > sector, even though the writer never flushed that particular content to > disk). Agree -- Alex Bligh signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Eric, > I doubt that qemu-nbd would ever want to support the situation with more > than one client connection writing to the same image at the same time; > the implications of sorting out data consistency between multiple > writers is rather complex and not worth coding into qemu. So I think > qemu would probably prefer to just prohibit the multiple writer > situation. Yeah, I was thinking about a 'no multiple connection' flag. > And while multiple readers with no writer should be fine, > I'm not even sure if multiple readers plus one writer can always be made > to appear sane (if there is no coordination between the different > connections, on an image where the writer changes AA to BA then flushes > then changes to BB, it is still feasible that a reader could see AB > (pre-flush state of the first sector, post-flush changes to the second > sector, even though the writer never flushed that particular content to > disk). Agree -- Alex Bligh signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Paolo, > On 15 Sep 2016, at 15:07, Paolo Bonzini <pbonz...@redhat.com> wrote: > > I don't think QEMU forbids multiple clients to the single server, and > guarantees consistency as long as there is no overlap between writes and > reads. These are the same guarantees you have for multiple commands on > a single connection. > > In other words, from the POV of QEMU there's no difference whether > multiple commands come from one or more connections. This isn't really about ordering, it's about cache coherency and persisting things to disk. What you say is correct as far as it goes in terms of ordering. However consider the scenario with read and writes on two channels as follows of the same block: Channel1 Channel2 R Block read, and cached in user space in channel 1's cache Reply sent W New value written, channel 2's cache updated channel 1's cache not R Value returned from channel 1's cache. In the above scenario, there is a problem if the server(s) handling the two channels each use a read cache which is not coherent between the two channels. An example would be a read-through cache on a server that did fork() and shared no state between connections. Similarly, if there is a write on channel 1 that has completed, and the flush goes to channel 2, it may not (if state is not shared) guarantee that the write on channel 1 (which has completed) is persisted to non-volatile media. Obviously if the 'state' is OS block cache/buffers/whatever, it will, but if it's (e.g.) a user-space per process write-through cache, it won't. I don't know whether qemu-nbd is likely to suffer from either of these. -- Alex Bligh signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Paolo, > On 15 Sep 2016, at 15:07, Paolo Bonzini wrote: > > I don't think QEMU forbids multiple clients to the single server, and > guarantees consistency as long as there is no overlap between writes and > reads. These are the same guarantees you have for multiple commands on > a single connection. > > In other words, from the POV of QEMU there's no difference whether > multiple commands come from one or more connections. This isn't really about ordering, it's about cache coherency and persisting things to disk. What you say is correct as far as it goes in terms of ordering. However consider the scenario with read and writes on two channels as follows of the same block: Channel1 Channel2 R Block read, and cached in user space in channel 1's cache Reply sent W New value written, channel 2's cache updated channel 1's cache not R Value returned from channel 1's cache. In the above scenario, there is a problem if the server(s) handling the two channels each use a read cache which is not coherent between the two channels. An example would be a read-through cache on a server that did fork() and shared no state between connections. Similarly, if there is a write on channel 1 that has completed, and the flush goes to channel 2, it may not (if state is not shared) guarantee that the write on channel 1 (which has completed) is persisted to non-volatile media. Obviously if the 'state' is OS block cache/buffers/whatever, it will, but if it's (e.g.) a user-space per process write-through cache, it won't. I don't know whether qemu-nbd is likely to suffer from either of these. -- Alex Bligh signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Josef, > On 15 Sep 2016, at 14:57, Josef Bacik <jba...@fb.com> wrote: > > This isn't an NBD problem, this is an application problem. The application > must wait for all writes it cares about _before_ issuing a flush. This is > the same as for normal storage as it is for NBD. It is not NBD's > responsibility to maintain coherency between multiple requests across > connections, just simply to act on and respond to requests. > > I think changing the specification to indicate that this is the case for > multiple connections is a good thing, to keep NBD servers from doing weird > things like sending different connections to the same export to different > backing stores without some sort of synchronization. It should definitely be > explicitly stated somewhere that NBD does not provide any ordering guarantees > and that is up to the application. Thanks, I don't think that's correct. The block stack issues a flush to mean (precisely) "do not reply to this until all preceding writes THAT HAVE BEEN REPLIED TO have been persisted to non-volatile storage". The danger is with multiple connections (where apparently only one flush is sent - let's say down connection 1) that not al the writes that have been replied to on connection 2 have been persisted to non-volatile storage. Only the ones on connection 1 have been persisted (this is assuming the nbd server doesn't 'link' in some way the connections). There's nothing the 'application' (here meaning the kernel or higher level) can do to mitigate this. Sure it can wait for all the replies, but this doesn't guarantee the writes have been persisted to non-volatile storage, precisely because writes may return prior to this. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Josef, > On 15 Sep 2016, at 14:57, Josef Bacik wrote: > > This isn't an NBD problem, this is an application problem. The application > must wait for all writes it cares about _before_ issuing a flush. This is > the same as for normal storage as it is for NBD. It is not NBD's > responsibility to maintain coherency between multiple requests across > connections, just simply to act on and respond to requests. > > I think changing the specification to indicate that this is the case for > multiple connections is a good thing, to keep NBD servers from doing weird > things like sending different connections to the same export to different > backing stores without some sort of synchronization. It should definitely be > explicitly stated somewhere that NBD does not provide any ordering guarantees > and that is up to the application. Thanks, I don't think that's correct. The block stack issues a flush to mean (precisely) "do not reply to this until all preceding writes THAT HAVE BEEN REPLIED TO have been persisted to non-volatile storage". The danger is with multiple connections (where apparently only one flush is sent - let's say down connection 1) that not al the writes that have been replied to on connection 2 have been persisted to non-volatile storage. Only the ones on connection 1 have been persisted (this is assuming the nbd server doesn't 'link' in some way the connections). There's nothing the 'application' (here meaning the kernel or higher level) can do to mitigate this. Sure it can wait for all the replies, but this doesn't guarantee the writes have been persisted to non-volatile storage, precisely because writes may return prior to this. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 13:41, Christoph Hellwig <h...@infradead.org> wrote: > > On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote: >> That's probably right in the case of file-based back ends that >> are running on a Linux OS. But gonbdserver for instance supports >> (e.g.) Ceph based backends, where each connection might be talking >> to a completely separate ceph node, and there may be no cache >> consistency between connections. > > Yes, if you don't have a cache coherent backend you are generally > screwed with a multiqueue protocol. I wonder if the ability to support multiqueue should be visible in the negotiation stage. That would allow the client to refuse to select multiqueue where it isn't safe. Wouter? -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 13:41, Christoph Hellwig wrote: > > On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote: >> That's probably right in the case of file-based back ends that >> are running on a Linux OS. But gonbdserver for instance supports >> (e.g.) Ceph based backends, where each connection might be talking >> to a completely separate ceph node, and there may be no cache >> consistency between connections. > > Yes, if you don't have a cache coherent backend you are generally > screwed with a multiqueue protocol. I wonder if the ability to support multiqueue should be visible in the negotiation stage. That would allow the client to refuse to select multiqueue where it isn't safe. Wouter? -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 13:36, Christoph Hellwig <h...@infradead.org> wrote: > > On Thu, Sep 15, 2016 at 01:33:20PM +0100, Alex Bligh wrote: >> At an implementation level that is going to be a little difficult >> for some NBD servers, e.g. ones that fork() a different process per >> connection. There is in general no IPC to speak of between server >> instances. Such servers would thus be unsafe with more than one >> connection if FLUSH is in use. >> >> I believe such servers include the reference server where there is >> process per connection (albeit possibly with several threads). >> >> Even single process servers (including mine - gonbdserver) would >> require logic to pair up multiple connections to the same >> device. > > Why? If you only send the completion after your I/O syscall returned > your are fine if fsync comes from a difference process, no matter > if you're using direct or buffered I/O underneath. That's probably right in the case of file-based back ends that are running on a Linux OS. But gonbdserver for instance supports (e.g.) Ceph based backends, where each connection might be talking to a completely separate ceph node, and there may be no cache consistency between connections. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 13:36, Christoph Hellwig wrote: > > On Thu, Sep 15, 2016 at 01:33:20PM +0100, Alex Bligh wrote: >> At an implementation level that is going to be a little difficult >> for some NBD servers, e.g. ones that fork() a different process per >> connection. There is in general no IPC to speak of between server >> instances. Such servers would thus be unsafe with more than one >> connection if FLUSH is in use. >> >> I believe such servers include the reference server where there is >> process per connection (albeit possibly with several threads). >> >> Even single process servers (including mine - gonbdserver) would >> require logic to pair up multiple connections to the same >> device. > > Why? If you only send the completion after your I/O syscall returned > your are fine if fsync comes from a difference process, no matter > if you're using direct or buffered I/O underneath. That's probably right in the case of file-based back ends that are running on a Linux OS. But gonbdserver for instance supports (e.g.) Ceph based backends, where each connection might be talking to a completely separate ceph node, and there may be no cache consistency between connections. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 13:23, Christoph Hellwig <h...@infradead.org> wrote: > > On Thu, Sep 15, 2016 at 02:21:20PM +0200, Wouter Verhelst wrote: >> Right. So do I understand you correctly that blk-mq currently doesn't >> look at multiple queues, and just assumes that if a FLUSH is sent over >> any one of the queues, it applies to all queues? > > Yes. The same is true at the protocol level for NVMe or SCSI transports > that can make use of multiple queues. At an implementation level that is going to be a little difficult for some NBD servers, e.g. ones that fork() a different process per connection. There is in general no IPC to speak of between server instances. Such servers would thus be unsafe with more than one connection if FLUSH is in use. I believe such servers include the reference server where there is process per connection (albeit possibly with several threads). Even single process servers (including mine - gonbdserver) would require logic to pair up multiple connections to the same device. I suspect ALL nbd servers would require a change. So I think we should think carefully before introducing this protocol change. It might be easier to make the client code change work around this issue (somehow). -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 13:23, Christoph Hellwig wrote: > > On Thu, Sep 15, 2016 at 02:21:20PM +0200, Wouter Verhelst wrote: >> Right. So do I understand you correctly that blk-mq currently doesn't >> look at multiple queues, and just assumes that if a FLUSH is sent over >> any one of the queues, it applies to all queues? > > Yes. The same is true at the protocol level for NVMe or SCSI transports > that can make use of multiple queues. At an implementation level that is going to be a little difficult for some NBD servers, e.g. ones that fork() a different process per connection. There is in general no IPC to speak of between server instances. Such servers would thus be unsafe with more than one connection if FLUSH is in use. I believe such servers include the reference server where there is process per connection (albeit possibly with several threads). Even single process servers (including mine - gonbdserver) would require logic to pair up multiple connections to the same device. I suspect ALL nbd servers would require a change. So I think we should think carefully before introducing this protocol change. It might be easier to make the client code change work around this issue (somehow). -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 13:18, Christoph Hellwig <h...@infradead.org> wrote: > > Yes, please do that. A "barrier" implies draining of the queue. Done -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 13:18, Christoph Hellwig wrote: > > Yes, please do that. A "barrier" implies draining of the queue. Done -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Christoph, > It's not a write barrier - a write barrier was command that ensured that > > a) all previous writes were completed to the host/client > b) all previous writes were on non-volatile storage > > and > > c) the actual write with the barrier bit was on non-volatile storage Ah! the bit you are complaining about is not the bit I pointed to you, but: > NBD_CMD_FLUSH (3) > > A flush request; a write barrier. I can see that's potentially confusing as isn't meant to mean 'an old-style linux kernel block device write barrier'. I think in general terms it probably is some form of barrier, but I see no problem in deleting the words "a write barrier" from the spec text if only to make it clearer. However, I think the description of the command itself: > The server MUST NOT send a successful reply header for this request before > all write requests for which a reply has already been sent to the client have > reached permanent storage (using fsync() or similar). and the ordering section I pointed you to before, were both correct, yes? >> The point still remains that "X was sent before Y" is difficult to >> determine on the client side if X was sent over a different TCP channel >> than Y, because a packet might be dropped (requiring a retransmit) for >> X, and similar things. If blk-mq can deal with that, we're good and >> nothing further needs to be done. If not, this should be evaluated by >> someone more familiar with the internals of the kernel block subsystem >> than me. > > The important bit in all the existing protocols, and which Linux relies > on is that any write the Linux block layer got a completion for earlier > needs to be flushed out to non-volatile storage when a FLUSH command is > set. Anything that still is in flight does not matter. Which for > NBD means anything that you already replies to need to be flushed. ... that's what it says (I hope). > Or to say it more practicly - in the nbd server you simply need to > call fdatasync on the backing device or file whenever you get a FLUSH > requires, and it will do the right thing. actually fdatasync() technically does more than is necessary, as it will also flush commands that have been processed, but for which no reply has yet been sent - that's no bad thing. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Christoph, > It's not a write barrier - a write barrier was command that ensured that > > a) all previous writes were completed to the host/client > b) all previous writes were on non-volatile storage > > and > > c) the actual write with the barrier bit was on non-volatile storage Ah! the bit you are complaining about is not the bit I pointed to you, but: > NBD_CMD_FLUSH (3) > > A flush request; a write barrier. I can see that's potentially confusing as isn't meant to mean 'an old-style linux kernel block device write barrier'. I think in general terms it probably is some form of barrier, but I see no problem in deleting the words "a write barrier" from the spec text if only to make it clearer. However, I think the description of the command itself: > The server MUST NOT send a successful reply header for this request before > all write requests for which a reply has already been sent to the client have > reached permanent storage (using fsync() or similar). and the ordering section I pointed you to before, were both correct, yes? >> The point still remains that "X was sent before Y" is difficult to >> determine on the client side if X was sent over a different TCP channel >> than Y, because a packet might be dropped (requiring a retransmit) for >> X, and similar things. If blk-mq can deal with that, we're good and >> nothing further needs to be done. If not, this should be evaluated by >> someone more familiar with the internals of the kernel block subsystem >> than me. > > The important bit in all the existing protocols, and which Linux relies > on is that any write the Linux block layer got a completion for earlier > needs to be flushed out to non-volatile storage when a FLUSH command is > set. Anything that still is in flight does not matter. Which for > NBD means anything that you already replies to need to be flushed. ... that's what it says (I hope). > Or to say it more practicly - in the nbd server you simply need to > call fdatasync on the backing device or file whenever you get a FLUSH > requires, and it will do the right thing. actually fdatasync() technically does more than is necessary, as it will also flush commands that have been processed, but for which no reply has yet been sent - that's no bad thing. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 12:52, Christoph Hellwig <h...@infradead.org> wrote: > > On Thu, Sep 15, 2016 at 12:46:07PM +0100, Alex Bligh wrote: >> Essentially NBD does supports FLUSH/FUA like this: >> >> https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt >> >> IE supports the same FLUSH/FUA primitives as other block drivers (AIUI). >> >> Link to protocol (per last email) here: >> >> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes > > Flush as defined by the Linux block layer (and supported that way in > SCSI, ATA, NVMe) only requires to flush all already completed writes > to non-volatile media. It does not impose any ordering unlike the > nbd spec. As maintainer of the NBD spec, I'm confused as to why you think it imposes any ordering - if you think this, clearly I need to clean up the wording. Here's what it says: > The server MAY process commands out of order, and MAY reply out of order, > except that: > > • All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) > that the server completes (i.e. replies to) prior to processing to a > NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to > that > NBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set > within > the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent by the > client to the server. (and another bit re FUA that isn't relevant here). Here's the Linux Kernel documentation: > The REQ_PREFLUSH flag can be OR ed into the r/w flags of a bio submitted from > the filesystem and will make sure the volatile cache of the storage device > has been flushed before the actual I/O operation is started. This explicitly > guarantees that previously completed write requests are on non-volatile > storage before the flagged bio starts. In addition the REQ_PREFLUSH flag can > be > set on an otherwise empty bio structure, which causes only an explicit cache > flush without any dependent I/O. It is recommend to use > the blkdev_issue_flush() helper for a pure cache flush. I believe that NBD treats NBD_CMD_FLUSH the same as a REQ_PREFLUSH and empty bio. If you don't read those two as compatible, I'd like to understand why not (i.e. what additional constraints one is applying that the other is not) as they are meant to be the same (save that NBD only has FLUSH as a command, i.e. the 'empty bio' version). I am happy to improve the docs to make it clearer. (sidenote: I am interested in the change from REQ_FLUSH to REQ_PREFLUSH, but in an empty bio it's not really relevant I think). > FUA as defined by the Linux block layer (and supported that way in SCSI, > ATA, NVMe) only requires the write operation the FUA bit is set on to be > on non-volatile media before completing the write operation. It does > not impose any ordering, which seems to match the nbd spec. Unlike the > NBD spec Linux does not allow FUA to be set on anything by WRITE > commands. Some other storage protocols allow a FUA bit on READ > commands or other commands that write data to the device, though. I think you mean "anything *but* WRITE commands". In NBD setting FUA on a command that does not write will do nothing, but FUA can be set on NBD_CMD_TRIM and has the expected effect. Interestingly the kernel docs are silent on which commands REQ_FUA can be set on. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 12:52, Christoph Hellwig wrote: > > On Thu, Sep 15, 2016 at 12:46:07PM +0100, Alex Bligh wrote: >> Essentially NBD does supports FLUSH/FUA like this: >> >> https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt >> >> IE supports the same FLUSH/FUA primitives as other block drivers (AIUI). >> >> Link to protocol (per last email) here: >> >> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes > > Flush as defined by the Linux block layer (and supported that way in > SCSI, ATA, NVMe) only requires to flush all already completed writes > to non-volatile media. It does not impose any ordering unlike the > nbd spec. As maintainer of the NBD spec, I'm confused as to why you think it imposes any ordering - if you think this, clearly I need to clean up the wording. Here's what it says: > The server MAY process commands out of order, and MAY reply out of order, > except that: > > • All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) > that the server completes (i.e. replies to) prior to processing to a > NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to > that > NBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set > within > the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent by the > client to the server. (and another bit re FUA that isn't relevant here). Here's the Linux Kernel documentation: > The REQ_PREFLUSH flag can be OR ed into the r/w flags of a bio submitted from > the filesystem and will make sure the volatile cache of the storage device > has been flushed before the actual I/O operation is started. This explicitly > guarantees that previously completed write requests are on non-volatile > storage before the flagged bio starts. In addition the REQ_PREFLUSH flag can > be > set on an otherwise empty bio structure, which causes only an explicit cache > flush without any dependent I/O. It is recommend to use > the blkdev_issue_flush() helper for a pure cache flush. I believe that NBD treats NBD_CMD_FLUSH the same as a REQ_PREFLUSH and empty bio. If you don't read those two as compatible, I'd like to understand why not (i.e. what additional constraints one is applying that the other is not) as they are meant to be the same (save that NBD only has FLUSH as a command, i.e. the 'empty bio' version). I am happy to improve the docs to make it clearer. (sidenote: I am interested in the change from REQ_FLUSH to REQ_PREFLUSH, but in an empty bio it's not really relevant I think). > FUA as defined by the Linux block layer (and supported that way in SCSI, > ATA, NVMe) only requires the write operation the FUA bit is set on to be > on non-volatile media before completing the write operation. It does > not impose any ordering, which seems to match the nbd spec. Unlike the > NBD spec Linux does not allow FUA to be set on anything by WRITE > commands. Some other storage protocols allow a FUA bit on READ > commands or other commands that write data to the device, though. I think you mean "anything *but* WRITE commands". In NBD setting FUA on a command that does not write will do nothing, but FUA can be set on NBD_CMD_TRIM and has the expected effect. Interestingly the kernel docs are silent on which commands REQ_FUA can be set on. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 12:46, Christoph Hellwig <h...@infradead.org> wrote: > > On Thu, Sep 15, 2016 at 12:43:35PM +0100, Alex Bligh wrote: >> Sure, it's at: >> >> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes >> >> and that link takes you to the specific section. >> >> The treatment of FLUSH and FUA is meant to mirror exactly the >> linux block layer (or rather how the linux block layer was a few >> years ago). I even asked on LKML to verify a few points. > > Linux never expected ordering on the wire. Before 2010 we had barriers > in the kernel that provided ordering to the caller, but we never > required it from the protocol / hardware. Sure. And I think the doc section reflects exactly Linux's post-2010 expectations (note the link to the kernel documentation). IE servers are not required to order reads/writes (save that all writes that the server completes prior to reply to an NBD_CMD_FLUSH must be persisted prior to the reply to that NBD_CMD_FLUSH which I believe to be exactly the same behaviour as the kernel dealing with an empty bio with REQ_FLUSH set). Perhaps the section should be called "No ordering of messages and writes"! My point was that *in practice* disordering is not well tested as *in practice* many server implementations do in fact process each command in order, though that's changed recently. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 12:46, Christoph Hellwig wrote: > > On Thu, Sep 15, 2016 at 12:43:35PM +0100, Alex Bligh wrote: >> Sure, it's at: >> >> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes >> >> and that link takes you to the specific section. >> >> The treatment of FLUSH and FUA is meant to mirror exactly the >> linux block layer (or rather how the linux block layer was a few >> years ago). I even asked on LKML to verify a few points. > > Linux never expected ordering on the wire. Before 2010 we had barriers > in the kernel that provided ordering to the caller, but we never > required it from the protocol / hardware. Sure. And I think the doc section reflects exactly Linux's post-2010 expectations (note the link to the kernel documentation). IE servers are not required to order reads/writes (save that all writes that the server completes prior to reply to an NBD_CMD_FLUSH must be persisted prior to the reply to that NBD_CMD_FLUSH which I believe to be exactly the same behaviour as the kernel dealing with an empty bio with REQ_FLUSH set). Perhaps the section should be called "No ordering of messages and writes"! My point was that *in practice* disordering is not well tested as *in practice* many server implementations do in fact process each command in order, though that's changed recently. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 12:40, Christoph Hellwig <h...@infradead.org> wrote: > > On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote: >> Yes, and that is why I was asking about this. If the write barriers >> are expected to be shared across connections, we have a problem. If, >> however, they are not, then it doesn't matter that the commands may be >> processed out of order. > > There is no such thing as a write barrier in the Linux kernel. We'd > much prefer protocols not to introduce any pointless synchronization > if we can avoid it. I suspect the issue is terminological. Essentially NBD does supports FLUSH/FUA like this: https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt IE supports the same FLUSH/FUA primitives as other block drivers (AIUI). Link to protocol (per last email) here: https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
> On 15 Sep 2016, at 12:40, Christoph Hellwig wrote: > > On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote: >> Yes, and that is why I was asking about this. If the write barriers >> are expected to be shared across connections, we have a problem. If, >> however, they are not, then it doesn't matter that the commands may be >> processed out of order. > > There is no such thing as a write barrier in the Linux kernel. We'd > much prefer protocols not to introduce any pointless synchronization > if we can avoid it. I suspect the issue is terminological. Essentially NBD does supports FLUSH/FUA like this: https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt IE supports the same FLUSH/FUA primitives as other block drivers (AIUI). Link to protocol (per last email) here: https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Christoph, > On 15 Sep 2016, at 12:38, Christoph Hellwig <h...@infradead.org> wrote: > > On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote: >> A while back, we spent quite some time defining the semantics of the >> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA >> write barriers. At the time, we decided that it would be unreasonable >> to expect servers to make these write barriers effective across >> different connections. > > Do you have a nbd protocol specification? treating a flush or fua > as any sort of barrier is incredibly stupid. Is it really documented > that way, and if yes, why? Sure, it's at: https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes and that link takes you to the specific section. The treatment of FLUSH and FUA is meant to mirror exactly the linux block layer (or rather how the linux block layer was a few years ago). I even asked on LKML to verify a few points. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Christoph, > On 15 Sep 2016, at 12:38, Christoph Hellwig wrote: > > On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote: >> A while back, we spent quite some time defining the semantics of the >> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA >> write barriers. At the time, we decided that it would be unreasonable >> to expect servers to make these write barriers effective across >> different connections. > > Do you have a nbd protocol specification? treating a flush or fua > as any sort of barrier is incredibly stupid. Is it really documented > that way, and if yes, why? Sure, it's at: https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes and that link takes you to the specific section. The treatment of FLUSH and FUA is meant to mirror exactly the linux block layer (or rather how the linux block layer was a few years ago). I even asked on LKML to verify a few points. -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Wouter, Josef, (& Eric) > On 15 Sep 2016, at 11:49, Wouter Verhelst <w...@uter.be> wrote: > > Hi, > > On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote: >> I see some practical problems with this: > [...] > > One more that I didn't think about earlier: > > A while back, we spent quite some time defining the semantics of the > various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA > write barriers. At the time, we decided that it would be unreasonable > to expect servers to make these write barriers effective across > different connections. Actually I wonder if there is a wider problem in that implementations might mediate access to a device by presence of an extant TCP connection, i.e. only permit one TCP connection to access a given block device at once. If you think about (for instance) a forking daemon that does writeback caching, that would be an entirely reasonable thing to do for data consistency. I also wonder whether any servers that can do caching per connection will always share a consistent cache between connections. The one I'm worried about in particular here is qemu-nbd - Eric Blake CC'd. A more general point is that with multiple queues requests may be processed in a different order even by those servers that currently process the requests in strict order, or in something similar to strict order. The server is permitted by the spec (save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to process commands out of order anyway, but I suspect this has to date been little tested. Lastly I confess to lack of familiarity with the kernel side code, but how is NBD_CMD_DISCONNECT synchronised across each of the connections? Presumably you need to send it on each channel, but cannot assume the NBD connection as a whole is dead until the last tcp connection has closed? -- Alex Bligh
Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements
Wouter, Josef, (& Eric) > On 15 Sep 2016, at 11:49, Wouter Verhelst wrote: > > Hi, > > On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote: >> I see some practical problems with this: > [...] > > One more that I didn't think about earlier: > > A while back, we spent quite some time defining the semantics of the > various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA > write barriers. At the time, we decided that it would be unreasonable > to expect servers to make these write barriers effective across > different connections. Actually I wonder if there is a wider problem in that implementations might mediate access to a device by presence of an extant TCP connection, i.e. only permit one TCP connection to access a given block device at once. If you think about (for instance) a forking daemon that does writeback caching, that would be an entirely reasonable thing to do for data consistency. I also wonder whether any servers that can do caching per connection will always share a consistent cache between connections. The one I'm worried about in particular here is qemu-nbd - Eric Blake CC'd. A more general point is that with multiple queues requests may be processed in a different order even by those servers that currently process the requests in strict order, or in something similar to strict order. The server is permitted by the spec (save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to process commands out of order anyway, but I suspect this has to date been little tested. Lastly I confess to lack of familiarity with the kernel side code, but how is NBD_CMD_DISCONNECT synchronised across each of the connections? Presumably you need to send it on each channel, but cannot assume the NBD connection as a whole is dead until the last tcp connection has closed? -- Alex Bligh
Re: [Nbd] [PATCH 2/2] nbd: Disallow ioctls on disconnected block device
On 16 Jul 2016, at 08:42, Pranay Srivastava <pran...@gmail.com> wrote: > So instead can't we put a mechanism in place for network address + mac > to be same > for allowing clients to reconnect? Do let me know if this is not of concern. MAC address?! nbd clients connect over IP, and if a router reboots between them, you could easily see two packets from the same client come from different MAC addresses. Similarly all clients not on the same L2 network will carry the same MAC address. So MAC address is a very poor indicator of 'same client'. IP address is also a poor indicator (think NAT) but is substantially less bad. -- Alex Bligh
Re: [Nbd] [PATCH 2/2] nbd: Disallow ioctls on disconnected block device
On 16 Jul 2016, at 08:42, Pranay Srivastava wrote: > So instead can't we put a mechanism in place for network address + mac > to be same > for allowing clients to reconnect? Do let me know if this is not of concern. MAC address?! nbd clients connect over IP, and if a router reboots between them, you could easily see two packets from the same client come from different MAC addresses. Similarly all clients not on the same L2 network will carry the same MAC address. So MAC address is a very poor indicator of 'same client'. IP address is also a poor indicator (think NAT) but is substantially less bad. -- Alex Bligh
Block layer - meaning of REQ_FUA on not write requests
I am trying to clean up the documentation of the NBD protocol. NBD's support for Force Unit Access (FUA) was modelled on the linux kernel block layer. When I added support a few years ago, I omitted to find out exactly what types of request it applies to. Obviously it applies to write requests, but how about others (e.g. read)? https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt seems somewhat ambiguous. It says > The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the > filesystem and will make sure that I/O completion for this request is only > signaled after the data has been committed to non-volatile storage. and > For devices that also support the FUA bit the block layer needs > to be told to pass through the REQ_FUA bit using: > > blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA); > > and the driver must handle write requests that have the REQ_FUA bit set > in prep_fn/request_fn. If the FUA bit is not natively supported the block > layer turns it into an empty REQ_FLUSH request after the actual write. I can see four possible interpretations for reads: 1. REQ_FUA is not permitted on requests other than write, and it is an error to include it. 2. REQ_FUA does nothing on requests other than writes, and whilst permitted is ignored. 3. REQ_FUA has the following meaning on read: that area of data referred to by the read must be written to persistent storage if it is dirty. and the following which is not an interpretation of the above, but is logical: 4. REQ_FUA means that any in-memory copy of the data to be read must be discarded and the data reread from persistent storage (in case it has been written to by another client, for instance). What meaning does REQ_FUA have for reads? Similarly what meaning does it have for other block layer requests (e.g. trim)? -- Alex Bligh
Block layer - meaning of REQ_FUA on not write requests
I am trying to clean up the documentation of the NBD protocol. NBD's support for Force Unit Access (FUA) was modelled on the linux kernel block layer. When I added support a few years ago, I omitted to find out exactly what types of request it applies to. Obviously it applies to write requests, but how about others (e.g. read)? https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt seems somewhat ambiguous. It says > The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the > filesystem and will make sure that I/O completion for this request is only > signaled after the data has been committed to non-volatile storage. and > For devices that also support the FUA bit the block layer needs > to be told to pass through the REQ_FUA bit using: > > blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA); > > and the driver must handle write requests that have the REQ_FUA bit set > in prep_fn/request_fn. If the FUA bit is not natively supported the block > layer turns it into an empty REQ_FLUSH request after the actual write. I can see four possible interpretations for reads: 1. REQ_FUA is not permitted on requests other than write, and it is an error to include it. 2. REQ_FUA does nothing on requests other than writes, and whilst permitted is ignored. 3. REQ_FUA has the following meaning on read: that area of data referred to by the read must be written to persistent storage if it is dirty. and the following which is not an interpretation of the above, but is logical: 4. REQ_FUA means that any in-memory copy of the data to be read must be discarded and the data reread from persistent storage (in case it has been written to by another client, for instance). What meaning does REQ_FUA have for reads? Similarly what meaning does it have for other block layer requests (e.g. trim)? -- Alex Bligh
Re: kernel panic in skb_copy_bits
--On 4 July 2013 03:12:10 -0700 Eric Dumazet wrote: It looks like a typical COW issue to me. If the page content is written while there is still a reference on this page, we should allocate a new page and copy the previous content. And this has little to do with networking. I suspect this would get more attention if we could make Ian's case below trigger (a) outside Xen, (b) outside networking. memset(buf, 0xaa, 4096); write(fd, buf, 4096) memset(buf, 0x55, 4096); (where fd is O_DIRECT on NFS) Can result in 0x55 being seen on the wire in the TCP retransmit. We know this should fail using O_DIRECT+NFS. We've had reports suggesting it fails in O_DIRECT+iSCSI. However, that's been with a kernel panic (under Xen) rather than data corruption as per the above. Historical trawling suggests this is an issue with DRDB (see Ian's original thread from the mists of time). I don't quite understand why we aren't seeing corruption with standard ATA devices + O_DIRECT and no Xen involved at all. My memory is a bit misty on this but I had thought the reason why this would NOT be solved simply by O_DIRECT taking a reference to the page was that the O_DIRECT I/O completed (and thus the reference would be freed up) before the networking stack had actually finished with the page. If the O_DIRECT I/O did not complete until the page was actually finished with, we wouldn't see the problem in the first place. I may be completely off base here. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic in skb_copy_bits
--On 4 July 2013 03:12:10 -0700 Eric Dumazet eric.duma...@gmail.com wrote: It looks like a typical COW issue to me. If the page content is written while there is still a reference on this page, we should allocate a new page and copy the previous content. And this has little to do with networking. I suspect this would get more attention if we could make Ian's case below trigger (a) outside Xen, (b) outside networking. memset(buf, 0xaa, 4096); write(fd, buf, 4096) memset(buf, 0x55, 4096); (where fd is O_DIRECT on NFS) Can result in 0x55 being seen on the wire in the TCP retransmit. We know this should fail using O_DIRECT+NFS. We've had reports suggesting it fails in O_DIRECT+iSCSI. However, that's been with a kernel panic (under Xen) rather than data corruption as per the above. Historical trawling suggests this is an issue with DRDB (see Ian's original thread from the mists of time). I don't quite understand why we aren't seeing corruption with standard ATA devices + O_DIRECT and no Xen involved at all. My memory is a bit misty on this but I had thought the reason why this would NOT be solved simply by O_DIRECT taking a reference to the page was that the O_DIRECT I/O completed (and thus the reference would be freed up) before the networking stack had actually finished with the page. If the O_DIRECT I/O did not complete until the page was actually finished with, we wouldn't see the problem in the first place. I may be completely off base here. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic in skb_copy_bits
Joe, Do you know if have a fix for above? so far we also suspected the grant page be unmapped earlier, we using 4.1 stable during our test. A true fix? No, but I posted a patch set (see later email message for a link) that you could forward port. The workaround is: A workaround is to turn off O_DIRECT use by Xen as that ensures the pages are copied. Xen 4.3 does this by default. I believe fixes for this are in 4.3 and 4.2.2 if using the qemu upstream DM. Note these aren't real fixes, just a workaround of a kernel bug. The guest is pvm, and disk model is xvbd, guest config file as below: ... I think this only for pvhvm/hvm? I don't have much experience outside pvhvm/hvm, but I believe it should work for any device. Testing was simple - just find all (*) the references to O_DIRECT in your device model and remove them! (*)=you could be less lazy than me and find the right ones. I am guessing it will be the same ones though. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic in skb_copy_bits
Joe, Do you know if have a fix for above? so far we also suspected the grant page be unmapped earlier, we using 4.1 stable during our test. A true fix? No, but I posted a patch set (see later email message for a link) that you could forward port. The workaround is: A workaround is to turn off O_DIRECT use by Xen as that ensures the pages are copied. Xen 4.3 does this by default. I believe fixes for this are in 4.3 and 4.2.2 if using the qemu upstream DM. Note these aren't real fixes, just a workaround of a kernel bug. The guest is pvm, and disk model is xvbd, guest config file as below: ... I think this only for pvhvm/hvm? I don't have much experience outside pvhvm/hvm, but I believe it should work for any device. Testing was simple - just find all (*) the references to O_DIRECT in your device model and remove them! (*)=you could be less lazy than me and find the right ones. I am guessing it will be the same ones though. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic in skb_copy_bits
--On 30 June 2013 10:13:35 +0100 Alex Bligh wrote: The nature of the bug is extensively discussed in that thread - you'll also find a reference to a thread on linux-nfs which concludes it isn't an nfs problem, and even some patches to fix it in the kernel adding reference counting. Some more links for anyone interested in fixing the kernel bug: http://lists.xen.org/archives/html/xen-devel/2013-01/msg01618.html http://www.spinics.net/lists/linux-nfs/msg34913.html http://www.spinics.net/lists/netdev/msg224106.html -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic in skb_copy_bits
--On 28 June 2013 12:17:43 +0800 Joe Jin wrote: Find a similar issue http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen developer as well. I thought this sounded familiar. I haven't got the start of this thread, but what version of Xen are you running and what device model? If before 4.3, there is a page lifetime bug in the kernel (not the xen code) which can affect anything where the guest accesses the host's block stack and that in turn accesses the networking stack (it may in fact be wider than that). So, e.g. domU on iCSSI will do it. It tends to get triggered by a TCP retransmit or (on NFS) the RPC equivalent. Essentially block operation is considered complete, returning through xen and freeing the grant table entry, and yet something in the kernel (e.g. tcp retransmit) can still access the data. The nature of the bug is extensively discussed in that thread - you'll also find a reference to a thread on linux-nfs which concludes it isn't an nfs problem, and even some patches to fix it in the kernel adding reference counting. A workaround is to turn off O_DIRECT use by Xen as that ensures the pages are copied. Xen 4.3 does this by default. I believe fixes for this are in 4.3 and 4.2.2 if using the qemu upstream DM. Note these aren't real fixes, just a workaround of a kernel bug. To fix on a local build of xen you will need something like this: https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9 and something like this (NB: obviously insert your own git repo and commit numbers) https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca Also note those fixes are (technically) unsafe for live migration unless there is an ordering change made in qemu's block open call. Of course this might be something completely different. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic in skb_copy_bits
--On 28 June 2013 12:17:43 +0800 Joe Jin joe@oracle.com wrote: Find a similar issue http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen developer as well. I thought this sounded familiar. I haven't got the start of this thread, but what version of Xen are you running and what device model? If before 4.3, there is a page lifetime bug in the kernel (not the xen code) which can affect anything where the guest accesses the host's block stack and that in turn accesses the networking stack (it may in fact be wider than that). So, e.g. domU on iCSSI will do it. It tends to get triggered by a TCP retransmit or (on NFS) the RPC equivalent. Essentially block operation is considered complete, returning through xen and freeing the grant table entry, and yet something in the kernel (e.g. tcp retransmit) can still access the data. The nature of the bug is extensively discussed in that thread - you'll also find a reference to a thread on linux-nfs which concludes it isn't an nfs problem, and even some patches to fix it in the kernel adding reference counting. A workaround is to turn off O_DIRECT use by Xen as that ensures the pages are copied. Xen 4.3 does this by default. I believe fixes for this are in 4.3 and 4.2.2 if using the qemu upstream DM. Note these aren't real fixes, just a workaround of a kernel bug. To fix on a local build of xen you will need something like this: https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9 and something like this (NB: obviously insert your own git repo and commit numbers) https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca Also note those fixes are (technically) unsafe for live migration unless there is an ordering change made in qemu's block open call. Of course this might be something completely different. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic in skb_copy_bits
--On 30 June 2013 10:13:35 +0100 Alex Bligh a...@alex.org.uk wrote: The nature of the bug is extensively discussed in that thread - you'll also find a reference to a thread on linux-nfs which concludes it isn't an nfs problem, and even some patches to fix it in the kernel adding reference counting. Some more links for anyone interested in fixing the kernel bug: http://lists.xen.org/archives/html/xen-devel/2013-01/msg01618.html http://www.spinics.net/lists/linux-nfs/msg34913.html http://www.spinics.net/lists/netdev/msg224106.html -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
On 13 Feb 2013, at 16:02, Paolo Bonzini wrote: > If you do not have REQ_FUA, as is the case with this patch, the block > layer converts it to a REQ_FLUSH *after* the write. OK. Well +/- it converting an fdatasync() into a full fsync(), that should be be ok. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
Paolo, On 13 Feb 2013, at 13:00, Paolo Bonzini wrote: > But as far as I can test with free servers, the FUA bits have no > advantage over flush. Also, I wasn't sure if SEND_FUA without > SEND_FLUSH is valid, and if so how to handle this combination (treat it > as writethrough and add FUA to all requests? warn and do nothing?). On the main opensource nbd client, the following applies: What REQ_FUA does is an fdatasync() after the write. Code extract and comments below from Christoph Hellwig. What REQ_FLUSH does is to do an fsync(). The way I read Christoph's comment, provided the linux block layer always issues a REQ_FLUSH before a REQ_FUA, there is not performance problem. However, a REQ_FUA is going to do a f(data)?sync AFTER the write, whereas the preceding REQ_FLUSH is going to an fsync() BEFORE the write. It seems to me that either the FUA and FLUSH semantics are therefore different (and we need FUA), or that Christoph's comment is wrong and that you are guaranteed a REQ_FLUSH *after* the write with REQ_FUA. -- Alex Bligh } else if (fua) { /* This is where we would do the following * #ifdef USE_SYNC_FILE_RANGE * However, we don't, for the reasons set out below * by Christoph Hellwig * * [BEGINS] * fdatasync is equivalent to fsync except that it does not flush * non-essential metadata (basically just timestamps in practice), but it * does flush metadata requried to find the data again, e.g. allocation * information and extent maps. sync_file_range does nothing but flush * out pagecache content - it means you basically won't get your data * back in case of a crash if you either: * * a) have a volatile write cache in your disk (e.g. any normal SATA disk) * b) are using a sparse file on a filesystem * c) are using a fallocate-preallocated file on a filesystem * d) use any file on a COW filesystem like btrfs * * e.g. it only does anything useful for you if you do not have a volatile * write cache, and either use a raw block device node, or just overwrite * an already fully allocated (and not preallocated) file on a non-COW * filesystem. * [ENDS] * * What we should do is open a second FD with O_DSYNC set, then write to * that when appropriate. However, with a Linux client, every REQ_FUA * immediately follows a REQ_FLUSH, so fdatasync does not cause performance * problems. * */ #if 0 sync_file_range(fhandle, foffset, len, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); #else fdatasync(fhandle); #endif } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
Paolo, On 13 Feb 2013, at 13:00, Paolo Bonzini wrote: But as far as I can test with free servers, the FUA bits have no advantage over flush. Also, I wasn't sure if SEND_FUA without SEND_FLUSH is valid, and if so how to handle this combination (treat it as writethrough and add FUA to all requests? warn and do nothing?). On the main opensource nbd client, the following applies: What REQ_FUA does is an fdatasync() after the write. Code extract and comments below from Christoph Hellwig. What REQ_FLUSH does is to do an fsync(). The way I read Christoph's comment, provided the linux block layer always issues a REQ_FLUSH before a REQ_FUA, there is not performance problem. However, a REQ_FUA is going to do a f(data)?sync AFTER the write, whereas the preceding REQ_FLUSH is going to an fsync() BEFORE the write. It seems to me that either the FUA and FLUSH semantics are therefore different (and we need FUA), or that Christoph's comment is wrong and that you are guaranteed a REQ_FLUSH *after* the write with REQ_FUA. -- Alex Bligh } else if (fua) { /* This is where we would do the following * #ifdef USE_SYNC_FILE_RANGE * However, we don't, for the reasons set out below * by Christoph Hellwig h...@infradead.org * * [BEGINS] * fdatasync is equivalent to fsync except that it does not flush * non-essential metadata (basically just timestamps in practice), but it * does flush metadata requried to find the data again, e.g. allocation * information and extent maps. sync_file_range does nothing but flush * out pagecache content - it means you basically won't get your data * back in case of a crash if you either: * * a) have a volatile write cache in your disk (e.g. any normal SATA disk) * b) are using a sparse file on a filesystem * c) are using a fallocate-preallocated file on a filesystem * d) use any file on a COW filesystem like btrfs * * e.g. it only does anything useful for you if you do not have a volatile * write cache, and either use a raw block device node, or just overwrite * an already fully allocated (and not preallocated) file on a non-COW * filesystem. * [ENDS] * * What we should do is open a second FD with O_DSYNC set, then write to * that when appropriate. However, with a Linux client, every REQ_FUA * immediately follows a REQ_FLUSH, so fdatasync does not cause performance * problems. * */ #if 0 sync_file_range(fhandle, foffset, len, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); #else fdatasync(fhandle); #endif } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
On 13 Feb 2013, at 16:02, Paolo Bonzini wrote: If you do not have REQ_FUA, as is the case with this patch, the block layer converts it to a REQ_FLUSH *after* the write. OK. Well +/- it converting an fdatasync() into a full fsync(), that should be be ok. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
On 12 Feb 2013, at 21:32, Andrew Morton wrote: > > Obviously the changelog was inadequate. Please send along a new one > which fully describes the reasons for this change. To be clear I have no complaints about the rest of the patch being merged. Supporting FLUSH but not FUA is far better than supporting neither. I just didn't understand dropping FUA given the semantics of nbd is in essence 'linux bios over tcp'. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
Paolo, On 12 Feb 2013, at 18:06, Paolo Bonzini wrote: > Il 12/02/2013 18:37, Alex Bligh ha scritto: >> For my education, why remove the FUA stuff? > > Because I had no way to test it. I think my mods to the official NBD code support FUA (albeit not very efficiently) >>>>>> Hmmm... the underlying storage could be md/dm RAIDs in which case FUA >>>>>> should be cheaper than FLUSH. >>>> >>>> If someone ever wrote a virtio-blk backend that sits directly ontop >>>> of the Linux block layer that would be true. >> In this case we don't know what the backend is sitting on top of >> a-priori. It might be the current nbd server code, but it might >> not be. > > Do you know of any other NBD server than the "official" one and qemu-nbd? Yes. I know one well (but it's not open source). NBD seems to be the 'goto protocol' for writing distributed block store drivers in user space. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
Paolo, > Add support for a new flag that the server can pass. If the flag is > enabled, we translate REQ_FLUSH requests into the NBD_CMD_FLUSH > command. > > Cc: > Cc: Paul Clements > Cc: Andrew Morton > Signed-off-by: Alex Bligh > [ Removed FUA support for reasons similar to those outlined in > https://lkml.org/lkml/2010/8/17/234 for virtio - Paolo ] > Signed-off-by: Paolo Bonzini For my education, why remove the FUA stuff? The link you posted was for virtio drivers, and says: > > Hmmm... the underlying storage could be md/dm RAIDs in which case FUA > > should be cheaper than FLUSH. > > If someone ever wrote a virtio-blk backend that sits directly ontop > of the Linux block layer that would be true. In this case we don't know what the backend is sitting on top of a-priori. It might be the current nbd server code, but it might not be. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
Paolo, Add support for a new flag that the server can pass. If the flag is enabled, we translate REQ_FLUSH requests into the NBD_CMD_FLUSH command. Cc: nbd-gene...@lists.sf.net Cc: Paul Clements paul.cleme...@steeleye.com Cc: Andrew Morton a...@linux-foundation.org Signed-off-by: Alex Bligh a...@alex.org.uk [ Removed FUA support for reasons similar to those outlined in https://lkml.org/lkml/2010/8/17/234 for virtio - Paolo ] Signed-off-by: Paolo Bonzini pbonz...@redhat.com For my education, why remove the FUA stuff? The link you posted was for virtio drivers, and says: Hmmm... the underlying storage could be md/dm RAIDs in which case FUA should be cheaper than FLUSH. If someone ever wrote a virtio-blk backend that sits directly ontop of the Linux block layer that would be true. In this case we don't know what the backend is sitting on top of a-priori. It might be the current nbd server code, but it might not be. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
Paolo, On 12 Feb 2013, at 18:06, Paolo Bonzini wrote: Il 12/02/2013 18:37, Alex Bligh ha scritto: For my education, why remove the FUA stuff? Because I had no way to test it. I think my mods to the official NBD code support FUA (albeit not very efficiently) Hmmm... the underlying storage could be md/dm RAIDs in which case FUA should be cheaper than FLUSH. If someone ever wrote a virtio-blk backend that sits directly ontop of the Linux block layer that would be true. In this case we don't know what the backend is sitting on top of a-priori. It might be the current nbd server code, but it might not be. Do you know of any other NBD server than the official one and qemu-nbd? Yes. I know one well (but it's not open source). NBD seems to be the 'goto protocol' for writing distributed block store drivers in user space. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] nbd: support FLUSH requests
On 12 Feb 2013, at 21:32, Andrew Morton wrote: Obviously the changelog was inadequate. Please send along a new one which fully describes the reasons for this change. To be clear I have no complaints about the rest of the patch being merged. Supporting FLUSH but not FUA is far better than supporting neither. I just didn't understand dropping FUA given the semantics of nbd is in essence 'linux bios over tcp'. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Local DoS through write heavy I/O on CFQ & Deadline
--On 13 October 2012 21:53:09 +0800 Hillf Danton wrote: Take a look at the "wait for writeback" problem please. Linux 3.0+ Disk performance problem - wrong pdflush behaviour https://lkml.org/lkml/2012/10/10/412 I'm guessing that's related but may not be the whole story. My test case is rather simpler, and Viktor says that with the patch causing his regression reverted, "After I've set the dirty_bytes over the file size the writes are never blocked.". That suggests to me that in order to avoid write blocking he needs dirty_bytes larger than the file size. As the bytes written in my test case exceed RAM, that's going to be be an issue as dirty_bytes is always going to be hit; I think it Viktor's case he is trying to avoid it being hit at all. Or perhaps I have the wrong end of the stick. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Local DoS through write heavy I/O on CFQ Deadline
--On 13 October 2012 21:53:09 +0800 Hillf Danton dhi...@gmail.com wrote: Take a look at the wait for writeback problem please. Linux 3.0+ Disk performance problem - wrong pdflush behaviour https://lkml.org/lkml/2012/10/10/412 I'm guessing that's related but may not be the whole story. My test case is rather simpler, and Viktor says that with the patch causing his regression reverted, After I've set the dirty_bytes over the file size the writes are never blocked.. That suggests to me that in order to avoid write blocking he needs dirty_bytes larger than the file size. As the bytes written in my test case exceed RAM, that's going to be be an issue as dirty_bytes is always going to be hit; I think it Viktor's case he is trying to avoid it being hit at all. Or perhaps I have the wrong end of the stick. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Local DoS through write heavy I/O on CFQ & Deadline
Michael, --On 12 October 2012 16:58:39 +0200 Michal Hocko wrote: Once dirty_ratio (resp. dirty_bytes) limit is hit then the process which writes gets throttled. If this is not the case then there is a bug in the throttling code. I believe that is the problem. Isn't the only thing that is going to change that it ends up triggering the writeback earlier? Set the limit lowe? I think you mean 'lower'. If I do that, what I think will happen is that it will start the write-back earlier, but the writeback once started will not keep up with the generation of data, possibly because the throttling isn't going to work. Note that for instance using ionice to set priority or class to 'idle' has no effect. So, to test my hypothesis ... Happy to test etc - what would you suggest, dirty_ratio=5, dirty_background_ratio=2 ? These are measured in percentage. On the other hand if you use dirty_bytes resp. dirty_background_bytes then you get absolute numbers independent on the amount of memory. ... what would you suggest I set any of these to in order to test (assuming the same box) so that it's 'low enough' that if it still hangs, it's a bug, rather than it's simply 'not low enough'. It's an 8G box and clearly I'm happy to set either the _ratio or _bytes entries. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Local DoS through write heavy I/O on CFQ & Deadline
--On 12 October 2012 15:30:45 +0200 Michal Hocko wrote: Full info, including logs and scripts can be found at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521 You seem to have 8G of RAM and dirty_ratio=20 resp. dirty_background_ratio=10 which means that 1.5G worth of dirty data until writer gets throttled which is a lot. Background writeback starts at 800M which is probably not sufficient as well. Have you tried to set dirty_bytes at a reasonable value (wrt. to your storage)? This is for an appliance install where we have no idea how much memory the box has in advance other than 'at least 4G' so it is difficult to tune by default. However, I don't think that would solve the problem as the zcat/dd can always generate data faster than it can be written to disk unless or until it is throttled, which it never is. Isn't the only thing that is going to change that it ends up triggering the writeback earlier? Happy to test etc - what would you suggest, dirty_ratio=5, dirty_background_ratio=2 ? -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Local DoS through write heavy I/O on CFQ & Deadline
Alan, --On 11 October 2012 14:46:34 +0100 Alan Cox wrote: We have reproduced this on multiple hardware environments, using 3.2 (/proc/version_signature gives "Ubuntu 3.2.0-29.46-generic 3.2.24"). Anecdotally we believe the situation has worsened since 3.0. I've certainly seen this on 3.0 and 3.2, but do you still see it on 3.5/6 ? We've just tested this. We see exactly the same issue on 3.6.1 using the current build of the Ubuntu Quantal kernel, which is: Linux version 3.6.1-030601-generic (apw@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201210071322 SMP Sun Oct 7 17:23:28 UTC 2012 More details (including full logs for that kernel) at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521 -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Local DoS through write heavy I/O on CFQ Deadline
Alan, --On 11 October 2012 14:46:34 +0100 Alan Cox a...@lxorguk.ukuu.org.uk wrote: We have reproduced this on multiple hardware environments, using 3.2 (/proc/version_signature gives Ubuntu 3.2.0-29.46-generic 3.2.24). Anecdotally we believe the situation has worsened since 3.0. I've certainly seen this on 3.0 and 3.2, but do you still see it on 3.5/6 ? We've just tested this. We see exactly the same issue on 3.6.1 using the current build of the Ubuntu Quantal kernel, which is: Linux version 3.6.1-030601-generic (apw@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201210071322 SMP Sun Oct 7 17:23:28 UTC 2012 More details (including full logs for that kernel) at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521 -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Local DoS through write heavy I/O on CFQ Deadline
--On 12 October 2012 15:30:45 +0200 Michal Hocko mho...@suse.cz wrote: Full info, including logs and scripts can be found at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521 You seem to have 8G of RAM and dirty_ratio=20 resp. dirty_background_ratio=10 which means that 1.5G worth of dirty data until writer gets throttled which is a lot. Background writeback starts at 800M which is probably not sufficient as well. Have you tried to set dirty_bytes at a reasonable value (wrt. to your storage)? This is for an appliance install where we have no idea how much memory the box has in advance other than 'at least 4G' so it is difficult to tune by default. However, I don't think that would solve the problem as the zcat/dd can always generate data faster than it can be written to disk unless or until it is throttled, which it never is. Isn't the only thing that is going to change that it ends up triggering the writeback earlier? Happy to test etc - what would you suggest, dirty_ratio=5, dirty_background_ratio=2 ? -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Local DoS through write heavy I/O on CFQ Deadline
Michael, --On 12 October 2012 16:58:39 +0200 Michal Hocko mho...@suse.cz wrote: Once dirty_ratio (resp. dirty_bytes) limit is hit then the process which writes gets throttled. If this is not the case then there is a bug in the throttling code. I believe that is the problem. Isn't the only thing that is going to change that it ends up triggering the writeback earlier? Set the limit lowe? I think you mean 'lower'. If I do that, what I think will happen is that it will start the write-back earlier, but the writeback once started will not keep up with the generation of data, possibly because the throttling isn't going to work. Note that for instance using ionice to set priority or class to 'idle' has no effect. So, to test my hypothesis ... Happy to test etc - what would you suggest, dirty_ratio=5, dirty_background_ratio=2 ? These are measured in percentage. On the other hand if you use dirty_bytes resp. dirty_background_bytes then you get absolute numbers independent on the amount of memory. ... what would you suggest I set any of these to in order to test (assuming the same box) so that it's 'low enough' that if it still hangs, it's a bug, rather than it's simply 'not low enough'. It's an 8G box and clearly I'm happy to set either the _ratio or _bytes entries. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Local DoS through write heavy I/O on CFQ & Deadline
We have noticed significant I/O scheduling issues on both the CFQ and the deadline scheduler where a non-root user can starve any other process of any I/O for minutes at a time. The problem is more serious using CFQ but is still an effective local DoS vector using Deadline. A simple way to generate the problem is: dd if=/dev/zero of=- bs=1M count=5 | dd if=- of=myfile bs=1M count=5 (note use of 2 dd's is to avoid alleged optimisation of the writing dd from /dev/zero). zcat-ing a large file with stout redirected to a file produces a similar error. Using ionice to set idle priority makes no difference. To instrument the problem we produced a python script which does a MySQL select and update every 10 seconds, and time the execution of the update. This is normally milliseconds, but under user generated load conditions, we can take this to indefinite (on CFQ) and over a minute (on deadline). Postgres is affected in a similar manner (i.e. it is not MySQL specific). Simultaneously we have captured the output of 'vmstat 1 2' and /proc/meminfo, with appropriate timestamps. We have reproduced this on multiple hardware environments, using 3.2 (/proc/version_signature gives "Ubuntu 3.2.0-29.46-generic 3.2.24"). Anecdotally we believe the situation has worsened since 3.0. We believe the problem is that dirty pages writeout is starving the system of any other I/O, and whilst the process concerned should be penalised to allow other processes I/O time, this is not happening. Full info, including logs and scripts can be found at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521 I believe this represents a local DoS vector as an unprivileged user can effectively stall any root owned process that is performing I/O. -- Alex Bligh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Local DoS through write heavy I/O on CFQ Deadline
We have noticed significant I/O scheduling issues on both the CFQ and the deadline scheduler where a non-root user can starve any other process of any I/O for minutes at a time. The problem is more serious using CFQ but is still an effective local DoS vector using Deadline. A simple way to generate the problem is: dd if=/dev/zero of=- bs=1M count=5 | dd if=- of=myfile bs=1M count=5 (note use of 2 dd's is to avoid alleged optimisation of the writing dd from /dev/zero). zcat-ing a large file with stout redirected to a file produces a similar error. Using ionice to set idle priority makes no difference. To instrument the problem we produced a python script which does a MySQL select and update every 10 seconds, and time the execution of the update. This is normally milliseconds, but under user generated load conditions, we can take this to indefinite (on CFQ) and over a minute (on deadline). Postgres is affected in a similar manner (i.e. it is not MySQL specific). Simultaneously we have captured the output of 'vmstat 1 2' and /proc/meminfo, with appropriate timestamps. We have reproduced this on multiple hardware environments, using 3.2 (/proc/version_signature gives Ubuntu 3.2.0-29.46-generic 3.2.24). Anecdotally we believe the situation has worsened since 3.0. We believe the problem is that dirty pages writeout is starving the system of any other I/O, and whilst the process concerned should be penalised to allow other processes I/O time, this is not happening. Full info, including logs and scripts can be found at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521 I believe this represents a local DoS vector as an unprivileged user can effectively stall any root owned process that is performing I/O. -- Alex Bligh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Controversy over dynamic linking -- how to end the panic (long)
rights of distribution of the GPL'd code if you distribute stuff yourself. This is set out clearly in the preamble. Areas of doubt: A. Are all contributed patches / files /always/ under a license no more restrictive than 'COPYING' - including Linus' preamble. B. Is it clear that inclusion of kernel headers when compiling a progam does not make that program a derived work of the kernel. Implications: A. It would be worth Linus publicising the fact (or some URL be added to Linux kernel postings) that patches by default will be assumed to be under the 'COPYING' license. B. It would be worth someone clearing up the status of the license on header files. C. It would be preferable if people read the COPYING file before commenting on license issues. -- Alex Bligh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Controversy over dynamic linking -- how to end the panic (long)
in the preamble. Areas of doubt: A. Are all contributed patches / files /always/ under a license no more restrictive than 'COPYING' - including Linus' preamble. B. Is it clear that inclusion of kernel headers when compiling a progam does not make that program a derived work of the kernel. Implications: A. It would be worth Linus publicising the fact (or some URL be added to Linux kernel postings) that patches by default will be assumed to be under the 'COPYING' license. B. It would be worth someone clearing up the status of the license on header files. C. It would be preferable if people read the COPYING file before commenting on license issues. -- Alex Bligh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: LANANA: To Pending Device Number Registrants
> I believe thats why there are persistant superblocks on the RAID > partitions. You can switch them around, and it still knows which drive > holds which RAID partition... That's the only way booting off RAID > works, and the only reason for the "RAID Autodetect" partition type... > you can find those shuffled partitions correctly. The only time it > really looks at the file, is if you try to rebuild the partition I > believe... and some other circumstance that dosn't come to mind. OK. I obviously asked slightly the wrong question as I was talking about a theoretical case and chose badly. What I meant was 'Wouldn't it be better to use some form of static identifier / volume serial / MAC address / whatever, which /dev/[device-type]/[ID] exactly identifies, and continues to identify, even across config changes, than (a) go look for these ID's, then (b) sequence them in number disk0..N so that if one is removed, all the rest of them shuffle down'. IE isn't the point being argued here: 'don't identify devices by major/minor device number pairs chosen on a rather arbitrary and wasteful basis which describes how the given device is physically connected' and not 'the devices shall be dynamically ordered 0..N and rely on the order of detection, and the number of other devices, to determine their numbering'. As has been pointed out elsewhere, if you expose /dev/disk/0/serial-number and/or /dev/disk/0/interface/scsi/ whatever, in a format trivially readable by a script, the problem largely goes away (+/- a linear search). However, if you don't at least try to keep (say) disk names static across reboot with trivial reconfig change (like inserting a PCMCIA flash card), I can see people are going to need to rewrite stuff like mount to look at (say) an fstab where the first field is not /dev/hda0, it is something which identifies a disk beyond /dev/disk/0 and actually goes looks at the tree under that for some form of serial number (or whatever), or mount -a will produce surprising results (for instance there's no reason a PCMCIA flash card might not be detected before a PCMCIA HD). I understood 2.4 was meant to be pretty static in terms of external interface. -- Alex Bligh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: LANANA: To Pending Device Number Registrants
I believe thats why there are persistant superblocks on the RAID partitions. You can switch them around, and it still knows which drive holds which RAID partition... That's the only way booting off RAID works, and the only reason for the RAID Autodetect partition type... you can find those shuffled partitions correctly. The only time it really looks at the file, is if you try to rebuild the partition I believe... and some other circumstance that dosn't come to mind. OK. I obviously asked slightly the wrong question as I was talking about a theoretical case and chose badly. What I meant was 'Wouldn't it be better to use some form of static identifier / volume serial / MAC address / whatever, which /dev/[device-type]/[ID] exactly identifies, and continues to identify, even across config changes, than (a) go look for these ID's, then (b) sequence them in number disk0..N so that if one is removed, all the rest of them shuffle down'. IE isn't the point being argued here: 'don't identify devices by major/minor device number pairs chosen on a rather arbitrary and wasteful basis which describes how the given device is physically connected' and not 'the devices shall be dynamically ordered 0..N and rely on the order of detection, and the number of other devices, to determine their numbering'. As has been pointed out elsewhere, if you expose /dev/disk/0/serial-number and/or /dev/disk/0/interface/scsi/ whatever, in a format trivially readable by a script, the problem largely goes away (+/- a linear search). However, if you don't at least try to keep (say) disk names static across reboot with trivial reconfig change (like inserting a PCMCIA flash card), I can see people are going to need to rewrite stuff like mount to look at (say) an fstab where the first field is not /dev/hda0, it is something which identifies a disk beyond /dev/disk/0 and actually goes looks at the tree under that for some form of serial number (or whatever), or mount -a will produce surprising results (for instance there's no reason a PCMCIA flash card might not be detected before a PCMCIA HD). I understood 2.4 was meant to be pretty static in terms of external interface. -- Alex Bligh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LANANA: To Pending Device Number Registrants
> The argument that "if you use numbering based on where in the SCSI chain > the disk is, disks don't pop in and out" is absolute crap. It's not true > even for SCSI any more (there are devices that will aquire their location > dynamically), and it has never been true anywhere else. Give it up. Q: Let us assume you have dynamic numbering disk0..N as you suggest, and you have some s/w RAID of SCSI disks. A disk fails, and is (hot) removed. Life continues. You reboot the machine. Disks are now numbered disk0..(N-1). If the RAID config specifies using disk0..N thusly, it is going to be very confused, as stripes will appear in the wrong place. Doesn't that mean the file specifying the RAID config is going to have to enumerate SCSI IDs (or something configuration invariant) as opposed to use the disk0..N numbering anyway? Sure it can interrogate each disk0..N to see which has the ID that it actually wanted, but doesn't this rather subvert the purpose? IE, given one could create /dev/disk/?.+, isn't the important argument that they share common major device numbers etc., not whether they linearly reorder precisely to 0..N as opposed to have some form of identifier guaranteed to be static across reboot & config change. -- Alex Bligh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LANANA: To Pending Device Number Registrants
The argument that if you use numbering based on where in the SCSI chain the disk is, disks don't pop in and out is absolute crap. It's not true even for SCSI any more (there are devices that will aquire their location dynamically), and it has never been true anywhere else. Give it up. Q: Let us assume you have dynamic numbering disk0..N as you suggest, and you have some s/w RAID of SCSI disks. A disk fails, and is (hot) removed. Life continues. You reboot the machine. Disks are now numbered disk0..(N-1). If the RAID config specifies using disk0..N thusly, it is going to be very confused, as stripes will appear in the wrong place. Doesn't that mean the file specifying the RAID config is going to have to enumerate SCSI IDs (or something configuration invariant) as opposed to use the disk0..N numbering anyway? Sure it can interrogate each disk0..N to see which has the ID that it actually wanted, but doesn't this rather subvert the purpose? IE, given one could create /dev/disk/?.+, isn't the important argument that they share common major device numbers etc., not whether they linearly reorder precisely to 0..N as opposed to have some form of identifier guaranteed to be static across reboot config change. -- Alex Bligh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] allocation looping + kswapd CPU cycles
> The real fix is to measure fragmentation and the progress of kswapd, but > that is too drastic for 2.4.x. I suspect the real fix might, in general, be a) to reduce use of kmalloc() etc. which gives physically contiguous memory, where virtually contiguous memory will do (and is, presumably, far easier to come by). (or perhaps add some flag to kmalloc to allocate out of virtual rather than physical memory). b) to bias flush or swap out routines to create physically contiguous higher order blocks. Many heuristics will give you that ability. Disclaimer: I haven't looked at this for issue for years, but Linux seems to fail on >4k allocations now, and fragment memory far more, than it did on much smaller systems doing lots of nasty (8k, thus 3 pages including header) NFS stuff back in 94. -- Alex Bligh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] allocation looping + kswapd CPU cycles
The real fix is to measure fragmentation and the progress of kswapd, but that is too drastic for 2.4.x. I suspect the real fix might, in general, be a) to reduce use of kmalloc() etc. which gives physically contiguous memory, where virtually contiguous memory will do (and is, presumably, far easier to come by). (or perhaps add some flag to kmalloc to allocate out of virtual rather than physical memory). b) to bias flush or swap out routines to create physically contiguous higher order blocks. Many heuristics will give you that ability. Disclaimer: I haven't looked at this for issue for years, but Linux seems to fail on 4k allocations now, and fragment memory far more, than it did on much smaller systems doing lots of nasty (8k, thus 3 pages including header) NFS stuff back in 94. -- Alex Bligh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioctl arg passing
> And there is very low chance that kmalloc() for > anything bigger than 4KB will succeed. You should either use > vmalloc unconditionally, or at least as fallback. The phrase 'very low chance' is inaccurate. How do you think NFS works with -rsize, -wsize > 4096? kmalloc() uses get_free_pages() to allocate more than one physically contiguous memory page, which in turn uses a sort of modified buddy system to distribute them. And if you specify the right GFP_ flags, will page out, if necessary, to do so. (I know the hard way as this is the one substantial thing I fixed in the linux kernel just after 1.0 in about 95). If you need physically (as opposed to virtially) contiguous memory, unless lots has changed since then, kmalloc() is the right call. However, you are correct that it draws on scarce resources. -- Alex Bligh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioctl arg passing
And there is very low chance that kmalloc() for anything bigger than 4KB will succeed. You should either use vmalloc unconditionally, or at least as fallback. The phrase 'very low chance' is inaccurate. How do you think NFS works with -rsize, -wsize 4096? kmalloc() uses get_free_pages() to allocate more than one physically contiguous memory page, which in turn uses a sort of modified buddy system to distribute them. And if you specify the right GFP_ flags, will page out, if necessary, to do so. (I know the hard way as this is the one substantial thing I fixed in the linux kernel just after 1.0 in about 95). If you need physically (as opposed to virtially) contiguous memory, unless lots has changed since then, kmalloc() is the right call. However, you are correct that it draws on scarce resources. -- Alex Bligh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Sources of entropy - /dev/random problem for network servers
Jeff et al., >> However, only 3 drivers in drivers/net actually set >> SA_SAMPLE_RANDOM when calling request_irq(). I believe >> all of them should. > > No, because an attacker can potentially control input and make it > non-random. Point taken but: 1. In that case, if we follow your logic, no drivers (rather than 3 arbitrary ones) should set SA_SAMPLE_RANDOM 2. Given that otherwise in at least my application (and machine without keyboard and mouse can't be too uncommon) there is *no* entropy otherwise, which is rather easier for a hacker. At least with entropy from a (albeit predictable) network, his attack is going to shift the state of the seed (in an albeit predictable manner), though he's going to have to make some accurate guesses each time about the forwarding delay of the router in front of it, and the stream of other packets hitting the server. I'd rather rely on this, than rely on cron (which is effectively what is driving any disk entropy every few minutes and is extremely predictable). -- Alex Bligh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Sources of entropy - /dev/random problem for network servers
Jeff et al., However, only 3 drivers in drivers/net actually set SA_SAMPLE_RANDOM when calling request_irq(). I believe all of them should. No, because an attacker can potentially control input and make it non-random. Point taken but: 1. In that case, if we follow your logic, no drivers (rather than 3 arbitrary ones) should set SA_SAMPLE_RANDOM 2. Given that otherwise in at least my application (and machine without keyboard and mouse can't be too uncommon) there is *no* entropy otherwise, which is rather easier for a hacker. At least with entropy from a (albeit predictable) network, his attack is going to shift the state of the seed (in an albeit predictable manner), though he's going to have to make some accurate guesses each time about the forwarding delay of the router in front of it, and the stream of other packets hitting the server. I'd rather rely on this, than rely on cron (which is effectively what is driving any disk entropy every few minutes and is extremely predictable). -- Alex Bligh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Sources of entropy - /dev/random problem for network servers
In debugging why my (unloaded) IMAP server takes many seconds to open folders, I discovered what looks like a problem in 2.4's feeding of entropy into /dev/random. When there is insufficient entropy in the random number generator, reading from /dev/random blocks for several seconds. /dev/random is used (correctly) for crytographic key verification. Entropy comes from 4 sources it seems: Keyboard, Mouse, Disk I/O and IRQs. The machine in question is locked in a data center (can't be the only one) and thus sees none of the former two. IDE Entropy comes from executed IDE commands. The disk is physically largely inactive due to caching. But there's plenty of network traffic which should generate IRQs. However, only 3 drivers in drivers/net actually set SA_SAMPLE_RANDOM when calling request_irq(). I believe all of them should. And indeed this fixed the problem for me using an eepro100(). The following patch fixes eepro100.c - others can be patched similarly. -- Alex Bligh /usr/src/linux# diff -C3 drivers/net/eepro100.c{.keep,} *** drivers/net/eepro100.c.keep Tue Feb 13 21:15:05 2001 --- drivers/net/eepro100.c Sun Apr 8 22:17:00 2001 *** *** 923,929 sp->in_interrupt = 0; /* .. we can safely take handler calls during init. */ ! retval = request_irq(dev->irq, _interrupt, SA_SHIRQ, dev->name, dev); if (retval) { MOD_DEC_USE_COUNT; return retval; --- 923,929 sp->in_interrupt = 0; /* .. we can safely take handler calls during init. */ ! retval = request_irq(dev->irq, _interrupt, SA_SHIRQ | SA_SAMPLE_RANDOM, dev->name, dev); if (retval) { MOD_DEC_USE_COUNT; return retval; [ENDS] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Sources of entropy - /dev/random problem for network servers
In debugging why my (unloaded) IMAP server takes many seconds to open folders, I discovered what looks like a problem in 2.4's feeding of entropy into /dev/random. When there is insufficient entropy in the random number generator, reading from /dev/random blocks for several seconds. /dev/random is used (correctly) for crytographic key verification. Entropy comes from 4 sources it seems: Keyboard, Mouse, Disk I/O and IRQs. The machine in question is locked in a data center (can't be the only one) and thus sees none of the former two. IDE Entropy comes from executed IDE commands. The disk is physically largely inactive due to caching. But there's plenty of network traffic which should generate IRQs. However, only 3 drivers in drivers/net actually set SA_SAMPLE_RANDOM when calling request_irq(). I believe all of them should. And indeed this fixed the problem for me using an eepro100(). The following patch fixes eepro100.c - others can be patched similarly. -- Alex Bligh /usr/src/linux# diff -C3 drivers/net/eepro100.c{.keep,} *** drivers/net/eepro100.c.keep Tue Feb 13 21:15:05 2001 --- drivers/net/eepro100.c Sun Apr 8 22:17:00 2001 *** *** 923,929 sp-in_interrupt = 0; /* .. we can safely take handler calls during init. */ ! retval = request_irq(dev-irq, speedo_interrupt, SA_SHIRQ, dev-name, dev); if (retval) { MOD_DEC_USE_COUNT; return retval; --- 923,929 sp-in_interrupt = 0; /* .. we can safely take handler calls during init. */ ! retval = request_irq(dev-irq, speedo_interrupt, SA_SHIRQ | SA_SAMPLE_RANDOM, dev-name, dev); if (retval) { MOD_DEC_USE_COUNT; return retval; [ENDS] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/