Re: [RFC, PATCH] Extensible AIO interface
On Thu, Oct 04, 2012 at 06:58:06AM +0900, Tejun Heo wrote: > Hello, Kent. > > On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote: > > > However, I don't think it's a good idea to try to implement something > > > which is a neutral transport of opaque data between userland and lower > > > layers. Things like that sound attractive with unlimited > > > possibilities but reality seems to have the tendancy to make a big > > > mess out of setups like that. > > > > I don't see how the "neutral transport of opaque data" itself is a bad > > thing. We want something simple and sane to build actual interfaces on > > top of - once we've got that, we can either build clean generic well > > defined interfaces or we can make a mess like with ioctls :P > > > > It's like any other mechanism. There's good syscalls and bad syscalls... > > Depending on what a feature aims for, the design and implementation > vary greatly. If you go for completely generic extensible stuff which > can be used to warp space-time continuum, it's easy to end up with a > monstrosity with generic and programmable parser, verifier, accessor > and so on. I don't think that's concrete enough that I can comment - I think this is becoming too abstract. You didn't have any complaints when I showed you the code I posted, I don't plan on making it really any more complicated than that - I think we do need explicit return values but honestly that makes it less generic. > > Say we implement an attr to control a block layer cache. That attr could > > be parsed/validated in high level code (if there's any to do) - that I > > don't object to. But the high level code isn't going to /know/ whether > > there was any block cache in the stack that handled the attr. If the > > attr is passed down to the block cache, that block cache can return that > > it was handled. > > My point is that if it doesn't fit the generic abstract model as in > fadvise(2), it probably isn't worth supporting in any generic manner. How so? Do you mean the file range part? I think that's orthogonal to the rest (the hints fadvise specifies could be used per IO or with a file range like they are now), but the hints themselves are inadequate for SSD caches. > > > It's okay to allow some side channel thing for specific hacky uses but > > > I really hope the general design were focused around properly > > > abstracted attributes which can be understood and handled by the upper > > > layer. > > > > Completely agreed. I want to leave that side channel open for > > experimentation, and so we have a way of implementing one off hacky > > stuff when we need to - but normal mainline stuff should be sane and > > well designed. > > So, I think we can aim for something simple and modest (the only thing > I can think of at the moment is task association) and provide simple > framework which can be used for specific custom usages. Let's please > not go overboard with generic parser / verifier which supports pointer > indirection or whatnot. I wasn't seriously proposing implementing a generic parser/verifier - certainly not just for this, that was idle musing; all I'm saying is that when an attr needs parsing/verification, that should be done in the attr code, not driver code. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Wed, Oct 03, 2012 at 03:15:26PM -0400, Jeff Moyer wrote: > Kent Overstreet writes: > > > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: > >> Kent Overstreet writes: > >> > >> > So, I and other people keep running into things where we really need to > >> > add an interface to pass some auxiliary... stuff along with a pread() or > >> > pwrite(). > >> > > >> > A few examples: > >> > > >> > * IO scheduler hints. Some userspace program wants to, per IO, specify > >> > either priorities or a cgroup - by specifying a cgroup you can have a > >> > fileserver in userspace that makes use of cfq's per cgroup bandwidth > >> > quotas. > >> > >> You can do this today by splitting I/O between processes and placing > >> those processes in different cgroups. For io priority, there is > >> ioprio_set, which incurs an extra system call, but can be used. Not > >> elegant, but possible. > > > > Yes - those are things I'm trying to replace. Doing it that way is a > > real pain, both as it's a lousy interface for this and it does impact > > performance (ioprio_set doesn't really work too well with aio, too). > > ioprio_set works fine with aio, since the I/O is issued in the caller's > context. Perhaps you're thinking of writeback I/O? Until you want to issue different IOs with different priorities... > >> > * Cache hints. For bcache and other things, userspace may want to specify > >> > "this data should be cached", "this data should bypass the cache", etc. > >> > >> Please explain how you will differentiate this from posix_fadvise. > > > > Oh sorry, I think about SSD caching so much I forget to say that's what > > I'm talking about. posix_fadvise is for the page cache, we want > > something different for an SSD cache (IMO it'd be really ugly to use it > > for both, and posix_fadvise() can't really specifify everything we'd > > want to for an SSD cache). > > DESCRIPTION >Programs can use posix_fadvise() to announce an intention to >access file data in a specific pattern in the future, thus >allowing the kernel to perform appropriate optimizations. > > That description seems broad enough to include disk caches as well. You > haven't exactly stated what's missing. It _could_ work for SSD caches, but that doesn't mean you'd want it to - it doesn't have any way of specifying which cache you want the hint to apply to, and there are certainly circumstances under which you _wouldn't_ want it to apply to both. And making it apply to SSD caches would be silently changing the behavior, and also like I mentioned it's not sufficient for SSD caches. > >> > Hence, AIO attributes. > >> > >> *No.* Start with the non-AIO case first. > > > > Why? It is orthogonal to AIO (and I should make that clearer), but to do > > it for sync IO we'd need new syscalls that take an extra argument so IMO > > it's a bit easier to start with AIO. > > > > Might be worth implementing the sync interface sooner rather than later > > just to discover any potential issues, I suppose. > > Looking back to preadv and pwritev, it was wrong to only add them to > libaio (and that later got corrected). I'd just like to see things > start out with the sync interfaces, since you'll get more eyes on the > code (not everyone cares about aio) and that helps to work out any > interface issues. I agree we don't want to leave out sync versions, but honestly this stuff is more useful with AIO and that's the easier place to start. > > It's not possible in general - consider stacking block devices, and > > attrs that are supported only by specific block drivers. I.e. if you've > > got lvm on top of bcache or bcache on top of md, we can pass the attr > > down with the IO but we can't determine ahead of time, in general, where > > the IO is going to go. > > If the io stack is static (meaning you setup a device once, then open it > and do io to it, and it doesn't change while you're doing io), how would > you not know where the IO is going to go? With something like dm, md or bcache - you've got multiple underlying devices, and of those underlying devices which one the IO goes to is not something you can in general predict ahead of time. > > But that probably isn't true for most attrs so it probably would be a > > good idea to have an interface for querying what's supported, and even > > for device specific ones you could query what a device supports. > > OK. > > >> > One could imagine sticking the return in the attribute itself, but I > >> > don't want to do this. For some things (checksums), the attribute will > >> > contain a pointer to a buffer - that's fine. But I don't want the > >> > attributes themselves to be writeable. > >> > >> One could imagine that attributes don't return anything, because, well, > >> they're properties of something else, and properties don't return > >> anything. > > > > With a strict definition of attribute, yeah. One of the real uses cases > > we have for this is per IO timings, for aio -
Re: [RFC, PATCH] Extensible AIO interface
On Wed, Oct 03, 2012 at 03:15:26PM -0400, Jeff Moyer wrote: Kent Overstreet koverstr...@google.com writes: On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: Kent Overstreet koverstr...@google.com writes: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. You can do this today by splitting I/O between processes and placing those processes in different cgroups. For io priority, there is ioprio_set, which incurs an extra system call, but can be used. Not elegant, but possible. Yes - those are things I'm trying to replace. Doing it that way is a real pain, both as it's a lousy interface for this and it does impact performance (ioprio_set doesn't really work too well with aio, too). ioprio_set works fine with aio, since the I/O is issued in the caller's context. Perhaps you're thinking of writeback I/O? Until you want to issue different IOs with different priorities... * Cache hints. For bcache and other things, userspace may want to specify this data should be cached, this data should bypass the cache, etc. Please explain how you will differentiate this from posix_fadvise. Oh sorry, I think about SSD caching so much I forget to say that's what I'm talking about. posix_fadvise is for the page cache, we want something different for an SSD cache (IMO it'd be really ugly to use it for both, and posix_fadvise() can't really specifify everything we'd want to for an SSD cache). DESCRIPTION Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations. That description seems broad enough to include disk caches as well. You haven't exactly stated what's missing. It _could_ work for SSD caches, but that doesn't mean you'd want it to - it doesn't have any way of specifying which cache you want the hint to apply to, and there are certainly circumstances under which you _wouldn't_ want it to apply to both. And making it apply to SSD caches would be silently changing the behavior, and also like I mentioned it's not sufficient for SSD caches. Hence, AIO attributes. *No.* Start with the non-AIO case first. Why? It is orthogonal to AIO (and I should make that clearer), but to do it for sync IO we'd need new syscalls that take an extra argument so IMO it's a bit easier to start with AIO. Might be worth implementing the sync interface sooner rather than later just to discover any potential issues, I suppose. Looking back to preadv and pwritev, it was wrong to only add them to libaio (and that later got corrected). I'd just like to see things start out with the sync interfaces, since you'll get more eyes on the code (not everyone cares about aio) and that helps to work out any interface issues. I agree we don't want to leave out sync versions, but honestly this stuff is more useful with AIO and that's the easier place to start. It's not possible in general - consider stacking block devices, and attrs that are supported only by specific block drivers. I.e. if you've got lvm on top of bcache or bcache on top of md, we can pass the attr down with the IO but we can't determine ahead of time, in general, where the IO is going to go. If the io stack is static (meaning you setup a device once, then open it and do io to it, and it doesn't change while you're doing io), how would you not know where the IO is going to go? With something like dm, md or bcache - you've got multiple underlying devices, and of those underlying devices which one the IO goes to is not something you can in general predict ahead of time. But that probably isn't true for most attrs so it probably would be a good idea to have an interface for querying what's supported, and even for device specific ones you could query what a device supports. OK. One could imagine sticking the return in the attribute itself, but I don't want to do this. For some things (checksums), the attribute will contain a pointer to a buffer - that's fine. But I don't want the attributes themselves to be writeable. One could imagine that attributes don't return anything, because, well, they're properties of something else, and properties don't return anything. With a strict definition of attribute, yeah. One of the real uses cases we have for this is per IO timings, for aio - right now we've got an interface for the kernel to tell userspace how long a syscall took (don't think it's upstream yet - Paul's been behind that
Re: [RFC, PATCH] Extensible AIO interface
On Thu, Oct 04, 2012 at 06:58:06AM +0900, Tejun Heo wrote: Hello, Kent. On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote: However, I don't think it's a good idea to try to implement something which is a neutral transport of opaque data between userland and lower layers. Things like that sound attractive with unlimited possibilities but reality seems to have the tendancy to make a big mess out of setups like that. I don't see how the neutral transport of opaque data itself is a bad thing. We want something simple and sane to build actual interfaces on top of - once we've got that, we can either build clean generic well defined interfaces or we can make a mess like with ioctls :P It's like any other mechanism. There's good syscalls and bad syscalls... Depending on what a feature aims for, the design and implementation vary greatly. If you go for completely generic extensible stuff which can be used to warp space-time continuum, it's easy to end up with a monstrosity with generic and programmable parser, verifier, accessor and so on. I don't think that's concrete enough that I can comment - I think this is becoming too abstract. You didn't have any complaints when I showed you the code I posted, I don't plan on making it really any more complicated than that - I think we do need explicit return values but honestly that makes it less generic. Say we implement an attr to control a block layer cache. That attr could be parsed/validated in high level code (if there's any to do) - that I don't object to. But the high level code isn't going to /know/ whether there was any block cache in the stack that handled the attr. If the attr is passed down to the block cache, that block cache can return that it was handled. My point is that if it doesn't fit the generic abstract model as in fadvise(2), it probably isn't worth supporting in any generic manner. How so? Do you mean the file range part? I think that's orthogonal to the rest (the hints fadvise specifies could be used per IO or with a file range like they are now), but the hints themselves are inadequate for SSD caches. It's okay to allow some side channel thing for specific hacky uses but I really hope the general design were focused around properly abstracted attributes which can be understood and handled by the upper layer. Completely agreed. I want to leave that side channel open for experimentation, and so we have a way of implementing one off hacky stuff when we need to - but normal mainline stuff should be sane and well designed. So, I think we can aim for something simple and modest (the only thing I can think of at the moment is task association) and provide simple framework which can be used for specific custom usages. Let's please not go overboard with generic parser / verifier which supports pointer indirection or whatnot. I wasn't seriously proposing implementing a generic parser/verifier - certainly not just for this, that was idle musing; all I'm saying is that when an attr needs parsing/verification, that should be done in the attr code, not driver code. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Tue, Oct 02, 2012 at 07:41:10PM -0700, Kent Overstreet wrote: > On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote: > > On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote: > > > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: > > > > Kent Overstreet writes: > > > > > > > > > So, I and other people keep running into things where we really need > > > > > to > > > > > add an interface to pass some auxiliary... stuff along with a pread() > > > > > or > > > > > pwrite(). > > > > > > > > > > A few examples: > > > > > > > > > > * IO scheduler hints. Some userspace program wants to, per IO, specify > > > > > either priorities or a cgroup - by specifying a cgroup you can have a > > > > > fileserver in userspace that makes use of cfq's per cgroup bandwidth > > > > > quotas. > > > > > > > > You can do this today by splitting I/O between processes and placing > > > > those processes in different cgroups. For io priority, there is > > > > ioprio_set, which incurs an extra system call, but can be used. Not > > > > elegant, but possible. > > > > > > Yes - those are things I'm trying to replace. Doing it that way is a > > > real pain, both as it's a lousy interface for this and it does impact > > > performance (ioprio_set doesn't really work too well with aio, too). > > > > > > > > * Cache hints. For bcache and other things, userspace may want to > > > > > specify > > > > > "this data should be cached", "this data should bypass the cache", > > > > > etc. > > > > > > > > Please explain how you will differentiate this from posix_fadvise. > > > > > > Oh sorry, I think about SSD caching so much I forget to say that's what > > > I'm talking about. posix_fadvise is for the page cache, we want > > > something different for an SSD cache (IMO it'd be really ugly to use it > > > for both, and posix_fadvise() can't really specifify everything we'd > > > want to for an SSD cache). > > > > Similar discussions about posix_fadvise() are being had for marking > > ranges of files as volatile (i.e. useful for determining what can be > > evicted from a cache when space reclaim is required). > > > > https://lkml.org/lkml/2012/10/2/501 > > Hmm, interesting > > Speaking as an implementor though, hints that aren't associated with any > specific IO are harder to make use of - stuff is in the cache. What you > really want is to know, for a given IO, whether to cache it or not, and > possibly where in the LRU to stick it. I can see how it might be useful, but it needs to have a defined set of attributes that a file IO is allowed to have. If you don't define the set, then what really have is an arbitrary set of storage-device specific interfaces. Of course, once we have a defined set of per-file IO policy attributes, we don't really need per-IO attributes - you can just set them through a range interface like fadvise() or fallocate(). > Well, it's quite possible that different implementations would have no > trouble making use of those kinds of hints, I'm no doubt biased by > having implemented bcache. With bcache though, cache replacement is done > in terms of physical address space, not logical (i.e. the address space > of the device being cached). > > So to handle posix_fadvise, we'd have to walk the btree and chase > pointers to buckets, and modify the bucket priorities up or down... but > what about the other data in those buckets? It's not clear what should > happen, but there isn't any good way to take that into account. > > (The exception is dropping data from the cache entirely, we can just > invalidate that part of the keyspace and garbage collection will reclaim > the buckets they pointed to. Bcache does that for discard requests, > currently). It sounds to me like you are saying is that the design of bcache is unsuited to file-level management of caching policy, and that is why you want to pass attributes directly to bcache with specific IOs. Is that a fair summary of the problem you are describing here? My problem with this approach has nothing to do with the per-IO nature of it - it's to do with the layering violations and the amount of storage specific knowledge needed to make effective use of it. i.e. it seems like an interface that can only be used by people intimately familiar with underlying storage implementation. You happen to be one of those people, so I figure that you don't see a problem with that. ;) However, it also implies that an application must understand and use a specific storage configuration that matches the attributes an application sends. I understand how this model is appealling to Google because they control the whole application and storage stack (hardware and software) from top to bottom. However, I just don't think that it is a solution that the rest of the world can use effectively. The scope of data locality, aging and IO priority policy control is much larger than just controlling SSD caching. SSD caching is just a another implementation of
Re: [RFC, PATCH] Extensible AIO interface
Hello, Kent. On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote: > > However, I don't think it's a good idea to try to implement something > > which is a neutral transport of opaque data between userland and lower > > layers. Things like that sound attractive with unlimited > > possibilities but reality seems to have the tendancy to make a big > > mess out of setups like that. > > I don't see how the "neutral transport of opaque data" itself is a bad > thing. We want something simple and sane to build actual interfaces on > top of - once we've got that, we can either build clean generic well > defined interfaces or we can make a mess like with ioctls :P > > It's like any other mechanism. There's good syscalls and bad syscalls... Depending on what a feature aims for, the design and implementation vary greatly. If you go for completely generic extensible stuff which can be used to warp space-time continuum, it's easy to end up with a monstrosity with generic and programmable parser, verifier, accessor and so on. > Say we implement an attr to control a block layer cache. That attr could > be parsed/validated in high level code (if there's any to do) - that I > don't object to. But the high level code isn't going to /know/ whether > there was any block cache in the stack that handled the attr. If the > attr is passed down to the block cache, that block cache can return that > it was handled. My point is that if it doesn't fit the generic abstract model as in fadvise(2), it probably isn't worth supporting in any generic manner. > > It's okay to allow some side channel thing for specific hacky uses but > > I really hope the general design were focused around properly > > abstracted attributes which can be understood and handled by the upper > > layer. > > Completely agreed. I want to leave that side channel open for > experimentation, and so we have a way of implementing one off hacky > stuff when we need to - but normal mainline stuff should be sane and > well designed. So, I think we can aim for something simple and modest (the only thing I can think of at the moment is task association) and provide simple framework which can be used for specific custom usages. Let's please not go overboard with generic parser / verifier which supports pointer indirection or whatnot. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
Kent Overstreet writes: > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: >> Kent Overstreet writes: >> >> > So, I and other people keep running into things where we really need to >> > add an interface to pass some auxiliary... stuff along with a pread() or >> > pwrite(). >> > >> > A few examples: >> > >> > * IO scheduler hints. Some userspace program wants to, per IO, specify >> > either priorities or a cgroup - by specifying a cgroup you can have a >> > fileserver in userspace that makes use of cfq's per cgroup bandwidth >> > quotas. >> >> You can do this today by splitting I/O between processes and placing >> those processes in different cgroups. For io priority, there is >> ioprio_set, which incurs an extra system call, but can be used. Not >> elegant, but possible. > > Yes - those are things I'm trying to replace. Doing it that way is a > real pain, both as it's a lousy interface for this and it does impact > performance (ioprio_set doesn't really work too well with aio, too). ioprio_set works fine with aio, since the I/O is issued in the caller's context. Perhaps you're thinking of writeback I/O? >> > * Cache hints. For bcache and other things, userspace may want to specify >> > "this data should be cached", "this data should bypass the cache", etc. >> >> Please explain how you will differentiate this from posix_fadvise. > > Oh sorry, I think about SSD caching so much I forget to say that's what > I'm talking about. posix_fadvise is for the page cache, we want > something different for an SSD cache (IMO it'd be really ugly to use it > for both, and posix_fadvise() can't really specifify everything we'd > want to for an SSD cache). DESCRIPTION Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations. That description seems broad enough to include disk caches as well. You haven't exactly stated what's missing. >> > Hence, AIO attributes. >> >> *No.* Start with the non-AIO case first. > > Why? It is orthogonal to AIO (and I should make that clearer), but to do > it for sync IO we'd need new syscalls that take an extra argument so IMO > it's a bit easier to start with AIO. > > Might be worth implementing the sync interface sooner rather than later > just to discover any potential issues, I suppose. Looking back to preadv and pwritev, it was wrong to only add them to libaio (and that later got corrected). I'd just like to see things start out with the sync interfaces, since you'll get more eyes on the code (not everyone cares about aio) and that helps to work out any interface issues. >> > * FUTURE STUFF: >> > >> > Return values: >> > >> > Some attributes are probably going to want to return something to >> > userspace. >> > >> > If nothing else, we want this so that userspace can tell if anything >> > handled the attributes it specified - as dynamic as the io stack can be, >> > with something extensible like this there really isn't any generic way >> > of knowing ahead of time if something is going to interpret any >> > attribute - we want to return at least an error code. >> >> Seems odd to me. Why not expose supported attributes via some other >> call? fcntl? > > It's not possible in general - consider stacking block devices, and > attrs that are supported only by specific block drivers. I.e. if you've > got lvm on top of bcache or bcache on top of md, we can pass the attr > down with the IO but we can't determine ahead of time, in general, where > the IO is going to go. If the io stack is static (meaning you setup a device once, then open it and do io to it, and it doesn't change while you're doing io), how would you not know where the IO is going to go? > But that probably isn't true for most attrs so it probably would be a > good idea to have an interface for querying what's supported, and even > for device specific ones you could query what a device supports. OK. >> > One could imagine sticking the return in the attribute itself, but I >> > don't want to do this. For some things (checksums), the attribute will >> > contain a pointer to a buffer - that's fine. But I don't want the >> > attributes themselves to be writeable. >> >> One could imagine that attributes don't return anything, because, well, >> they're properties of something else, and properties don't return >> anything. > > With a strict definition of attribute, yeah. One of the real uses cases > we have for this is per IO timings, for aio - right now we've got an > interface for the kernel to tell userspace how long a syscall took > (don't think it's upstream yet - Paul's been behind that stuff), but it > only really makes sense with synchronous syscalls. Something beyond recording the time spent in the kernel? Paul who? I agree the per io timing for aio may be coarse-grained today (you can time the difference between io_submit returning and the event
Re: [RFC, PATCH] Extensible AIO interface
Kent Overstreet koverstr...@google.com writes: On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: Kent Overstreet koverstr...@google.com writes: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. You can do this today by splitting I/O between processes and placing those processes in different cgroups. For io priority, there is ioprio_set, which incurs an extra system call, but can be used. Not elegant, but possible. Yes - those are things I'm trying to replace. Doing it that way is a real pain, both as it's a lousy interface for this and it does impact performance (ioprio_set doesn't really work too well with aio, too). ioprio_set works fine with aio, since the I/O is issued in the caller's context. Perhaps you're thinking of writeback I/O? * Cache hints. For bcache and other things, userspace may want to specify this data should be cached, this data should bypass the cache, etc. Please explain how you will differentiate this from posix_fadvise. Oh sorry, I think about SSD caching so much I forget to say that's what I'm talking about. posix_fadvise is for the page cache, we want something different for an SSD cache (IMO it'd be really ugly to use it for both, and posix_fadvise() can't really specifify everything we'd want to for an SSD cache). DESCRIPTION Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations. That description seems broad enough to include disk caches as well. You haven't exactly stated what's missing. Hence, AIO attributes. *No.* Start with the non-AIO case first. Why? It is orthogonal to AIO (and I should make that clearer), but to do it for sync IO we'd need new syscalls that take an extra argument so IMO it's a bit easier to start with AIO. Might be worth implementing the sync interface sooner rather than later just to discover any potential issues, I suppose. Looking back to preadv and pwritev, it was wrong to only add them to libaio (and that later got corrected). I'd just like to see things start out with the sync interfaces, since you'll get more eyes on the code (not everyone cares about aio) and that helps to work out any interface issues. * FUTURE STUFF: Return values: Some attributes are probably going to want to return something to userspace. If nothing else, we want this so that userspace can tell if anything handled the attributes it specified - as dynamic as the io stack can be, with something extensible like this there really isn't any generic way of knowing ahead of time if something is going to interpret any attribute - we want to return at least an error code. Seems odd to me. Why not expose supported attributes via some other call? fcntl? It's not possible in general - consider stacking block devices, and attrs that are supported only by specific block drivers. I.e. if you've got lvm on top of bcache or bcache on top of md, we can pass the attr down with the IO but we can't determine ahead of time, in general, where the IO is going to go. If the io stack is static (meaning you setup a device once, then open it and do io to it, and it doesn't change while you're doing io), how would you not know where the IO is going to go? But that probably isn't true for most attrs so it probably would be a good idea to have an interface for querying what's supported, and even for device specific ones you could query what a device supports. OK. One could imagine sticking the return in the attribute itself, but I don't want to do this. For some things (checksums), the attribute will contain a pointer to a buffer - that's fine. But I don't want the attributes themselves to be writeable. One could imagine that attributes don't return anything, because, well, they're properties of something else, and properties don't return anything. With a strict definition of attribute, yeah. One of the real uses cases we have for this is per IO timings, for aio - right now we've got an interface for the kernel to tell userspace how long a syscall took (don't think it's upstream yet - Paul's been behind that stuff), but it only really makes sense with synchronous syscalls. Something beyond recording the time spent in the kernel? Paul who? I agree the per io timing for aio may be coarse-grained today (you can time the difference between io_submit returning and the event being returned by io_getevents, but that says nothing of when the io was issued to the block layer). I'm curious to
Re: [RFC, PATCH] Extensible AIO interface
Hello, Kent. On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote: However, I don't think it's a good idea to try to implement something which is a neutral transport of opaque data between userland and lower layers. Things like that sound attractive with unlimited possibilities but reality seems to have the tendancy to make a big mess out of setups like that. I don't see how the neutral transport of opaque data itself is a bad thing. We want something simple and sane to build actual interfaces on top of - once we've got that, we can either build clean generic well defined interfaces or we can make a mess like with ioctls :P It's like any other mechanism. There's good syscalls and bad syscalls... Depending on what a feature aims for, the design and implementation vary greatly. If you go for completely generic extensible stuff which can be used to warp space-time continuum, it's easy to end up with a monstrosity with generic and programmable parser, verifier, accessor and so on. Say we implement an attr to control a block layer cache. That attr could be parsed/validated in high level code (if there's any to do) - that I don't object to. But the high level code isn't going to /know/ whether there was any block cache in the stack that handled the attr. If the attr is passed down to the block cache, that block cache can return that it was handled. My point is that if it doesn't fit the generic abstract model as in fadvise(2), it probably isn't worth supporting in any generic manner. It's okay to allow some side channel thing for specific hacky uses but I really hope the general design were focused around properly abstracted attributes which can be understood and handled by the upper layer. Completely agreed. I want to leave that side channel open for experimentation, and so we have a way of implementing one off hacky stuff when we need to - but normal mainline stuff should be sane and well designed. So, I think we can aim for something simple and modest (the only thing I can think of at the moment is task association) and provide simple framework which can be used for specific custom usages. Let's please not go overboard with generic parser / verifier which supports pointer indirection or whatnot. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Tue, Oct 02, 2012 at 07:41:10PM -0700, Kent Overstreet wrote: On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote: On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote: On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: Kent Overstreet koverstr...@google.com writes: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. You can do this today by splitting I/O between processes and placing those processes in different cgroups. For io priority, there is ioprio_set, which incurs an extra system call, but can be used. Not elegant, but possible. Yes - those are things I'm trying to replace. Doing it that way is a real pain, both as it's a lousy interface for this and it does impact performance (ioprio_set doesn't really work too well with aio, too). * Cache hints. For bcache and other things, userspace may want to specify this data should be cached, this data should bypass the cache, etc. Please explain how you will differentiate this from posix_fadvise. Oh sorry, I think about SSD caching so much I forget to say that's what I'm talking about. posix_fadvise is for the page cache, we want something different for an SSD cache (IMO it'd be really ugly to use it for both, and posix_fadvise() can't really specifify everything we'd want to for an SSD cache). Similar discussions about posix_fadvise() are being had for marking ranges of files as volatile (i.e. useful for determining what can be evicted from a cache when space reclaim is required). https://lkml.org/lkml/2012/10/2/501 Hmm, interesting Speaking as an implementor though, hints that aren't associated with any specific IO are harder to make use of - stuff is in the cache. What you really want is to know, for a given IO, whether to cache it or not, and possibly where in the LRU to stick it. I can see how it might be useful, but it needs to have a defined set of attributes that a file IO is allowed to have. If you don't define the set, then what really have is an arbitrary set of storage-device specific interfaces. Of course, once we have a defined set of per-file IO policy attributes, we don't really need per-IO attributes - you can just set them through a range interface like fadvise() or fallocate(). Well, it's quite possible that different implementations would have no trouble making use of those kinds of hints, I'm no doubt biased by having implemented bcache. With bcache though, cache replacement is done in terms of physical address space, not logical (i.e. the address space of the device being cached). So to handle posix_fadvise, we'd have to walk the btree and chase pointers to buckets, and modify the bucket priorities up or down... but what about the other data in those buckets? It's not clear what should happen, but there isn't any good way to take that into account. (The exception is dropping data from the cache entirely, we can just invalidate that part of the keyspace and garbage collection will reclaim the buckets they pointed to. Bcache does that for discard requests, currently). It sounds to me like you are saying is that the design of bcache is unsuited to file-level management of caching policy, and that is why you want to pass attributes directly to bcache with specific IOs. Is that a fair summary of the problem you are describing here? My problem with this approach has nothing to do with the per-IO nature of it - it's to do with the layering violations and the amount of storage specific knowledge needed to make effective use of it. i.e. it seems like an interface that can only be used by people intimately familiar with underlying storage implementation. You happen to be one of those people, so I figure that you don't see a problem with that. ;) However, it also implies that an application must understand and use a specific storage configuration that matches the attributes an application sends. I understand how this model is appealling to Google because they control the whole application and storage stack (hardware and software) from top to bottom. However, I just don't think that it is a solution that the rest of the world can use effectively. The scope of data locality, aging and IO priority policy control is much larger than just controlling SSD caching. SSD caching is just a another implementation of automatic tiering, and HSMs have been doing this for years and years. It's the same problem - space management and deciding what to leave in the frequently accessed pool
Re: [RFC, PATCH] Extensible AIO interface
On Wed, Oct 03, 2012 at 10:41:06AM +0900, Tejun Heo wrote: > Hello, Kent. > > On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote: > > Seems to me it'd be no different from security considerations when > > introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but > > sometimes no way around it. > > > > It really has to be ad hoc if it's extensible, unfortunately. > > > > The only way of getting around _that_ would be with some kind of > > reflective type system, so that generic code could parse (in some > > fashion) the types of the various attributes, and for pointers copy the > > user data into the kernel and do whatever access controls in generic > > code. > > I'm not userland API expert by any stretch of imagination but the > basic mechanism to pass data around seems sane to me and aio as stinky > as it is seems to be the only logical stuff for IOs w/ extra > attributes although alignment is always painful with any form of > concatenated opaque structures. > > However, I don't think it's a good idea to try to implement something > which is a neutral transport of opaque data between userland and lower > layers. Things like that sound attractive with unlimited > possibilities but reality seems to have the tendancy to make a big > mess out of setups like that. I don't see how the "neutral transport of opaque data" itself is a bad thing. We want something simple and sane to build actual interfaces on top of - once we've got that, we can either build clean generic well defined interfaces or we can make a mess like with ioctls :P It's like any other mechanism. There's good syscalls and bad syscalls... > So, if we're gonna do this, let's define what attributes we want to > have and let them be processed at the interface layer and fed to lower > layers afterwards - e.g. for cgroup context association, associate the > resulting bios with the target cgroup from the aio layer. That sounds perfectly reasonable to me (the emails with Zach ended up heading in that direction). > I'm quite skeptical of general usefulness of having opaque knobs to > lower IO stack which don't have proper generic abstraction. Nobody > can make proper use of things like that. Well, not nobody, maybe if > the lower stack, the interface and the application are implemented by > a single organization over relatively short span of time, maybe it > would be useful for them, but that isn't something which generic > interface design should focus on. I think it could work fine, but I'm not convinced either way on what the correct approach is. Say we implement an attr to control a block layer cache. That attr could be parsed/validated in high level code (if there's any to do) - that I don't object to. But the high level code isn't going to /know/ whether there was any block cache in the stack that handled the attr. If the attr is passed down to the block cache, that block cache can return that it was handled. Or if we implement an attr that says "return whether it was a cache hit or miss, and maybe other statistics". Similar thing. We could conceivably confine attrs to the upper layer, and define in kernel interfaces for passing all this stuff around, but I'm doubtful it's worth the trouble - at least if attrs themselves are implemented sanely. And honestly I think it'd make (more of) a mess of the block layer and the rest of the io stack to have to explicitly pass around cache hints/cache controlling options/whatever else we think of in the future - especially when most of this stuff isn't going to be in use most of the time. But, like I sort of mentioned in another email with Zach - passing the attrs from userspace around is _not_ mutually exclusive with standard code for parsing/validating them that exists in one place. We shouldn't have driver code reaching in and grabbing the raw attr unless there _is_ no parsing/validation to do. > It's okay to allow some side channel thing for specific hacky uses but > I really hope the general design were focused around properly > abstracted attributes which can be understood and handled by the upper > layer. Completely agreed. I want to leave that side channel open for experimentation, and so we have a way of implementing one off hacky stuff when we need to - but normal mainline stuff should be sane and well designed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote: > On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote: > > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: > > > Kent Overstreet writes: > > > > > > > So, I and other people keep running into things where we really need to > > > > add an interface to pass some auxiliary... stuff along with a pread() or > > > > pwrite(). > > > > > > > > A few examples: > > > > > > > > * IO scheduler hints. Some userspace program wants to, per IO, specify > > > > either priorities or a cgroup - by specifying a cgroup you can have a > > > > fileserver in userspace that makes use of cfq's per cgroup bandwidth > > > > quotas. > > > > > > You can do this today by splitting I/O between processes and placing > > > those processes in different cgroups. For io priority, there is > > > ioprio_set, which incurs an extra system call, but can be used. Not > > > elegant, but possible. > > > > Yes - those are things I'm trying to replace. Doing it that way is a > > real pain, both as it's a lousy interface for this and it does impact > > performance (ioprio_set doesn't really work too well with aio, too). > > > > > > * Cache hints. For bcache and other things, userspace may want to > > > > specify > > > > "this data should be cached", "this data should bypass the cache", etc. > > > > > > Please explain how you will differentiate this from posix_fadvise. > > > > Oh sorry, I think about SSD caching so much I forget to say that's what > > I'm talking about. posix_fadvise is for the page cache, we want > > something different for an SSD cache (IMO it'd be really ugly to use it > > for both, and posix_fadvise() can't really specifify everything we'd > > want to for an SSD cache). > > Similar discussions about posix_fadvise() are being had for marking > ranges of files as volatile (i.e. useful for determining what can be > evicted from a cache when space reclaim is required). > > https://lkml.org/lkml/2012/10/2/501 Hmm, interesting Speaking as an implementor though, hints that aren't associated with any specific IO are harder to make use of - stuff is in the cache. What you really want is to know, for a given IO, whether to cache it or not, and possibly where in the LRU to stick it. Well, it's quite possible that different implementations would have no trouble making use of those kinds of hints, I'm no doubt biased by having implemented bcache. With bcache though, cache replacement is done in terms of physical address space, not logical (i.e. the address space of the device being cached). So to handle posix_fadvise, we'd have to walk the btree and chase pointers to buckets, and modify the bucket priorities up or down... but what about the other data in those buckets? It's not clear what should happen, but there isn't any good way to take that into account. (The exception is dropping data from the cache entirely, we can just invalidate that part of the keyspace and garbage collection will reclaim the buckets they pointed to. Bcache does that for discard requests, currently). > If you have requirements for specific cache management, then it > might be worth seeing if you can steer an existing interface > proposal for some form of cache management in the direction you > need. Certainly - I don't plan on implementing anything bcache specific, or implementing anything from scratch if there's a good proposal out there. But a per-io interface does seem useful from an implementation pov and natural to use for at least some classes of applications. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
Hello, Kent. On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote: > Seems to me it'd be no different from security considerations when > introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but > sometimes no way around it. > > It really has to be ad hoc if it's extensible, unfortunately. > > The only way of getting around _that_ would be with some kind of > reflective type system, so that generic code could parse (in some > fashion) the types of the various attributes, and for pointers copy the > user data into the kernel and do whatever access controls in generic > code. I'm not userland API expert by any stretch of imagination but the basic mechanism to pass data around seems sane to me and aio as stinky as it is seems to be the only logical stuff for IOs w/ extra attributes although alignment is always painful with any form of concatenated opaque structures. However, I don't think it's a good idea to try to implement something which is a neutral transport of opaque data between userland and lower layers. Things like that sound attractive with unlimited possibilities but reality seems to have the tendancy to make a big mess out of setups like that. So, if we're gonna do this, let's define what attributes we want to have and let them be processed at the interface layer and fed to lower layers afterwards - e.g. for cgroup context association, associate the resulting bios with the target cgroup from the aio layer. I'm quite skeptical of general usefulness of having opaque knobs to lower IO stack which don't have proper generic abstraction. Nobody can make proper use of things like that. Well, not nobody, maybe if the lower stack, the interface and the application are implemented by a single organization over relatively short span of time, maybe it would be useful for them, but that isn't something which generic interface design should focus on. It's okay to allow some side channel thing for specific hacky uses but I really hope the general design were focused around properly abstracted attributes which can be understood and handled by the upper layer. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote: > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: > > Kent Overstreet writes: > > > > > So, I and other people keep running into things where we really need to > > > add an interface to pass some auxiliary... stuff along with a pread() or > > > pwrite(). > > > > > > A few examples: > > > > > > * IO scheduler hints. Some userspace program wants to, per IO, specify > > > either priorities or a cgroup - by specifying a cgroup you can have a > > > fileserver in userspace that makes use of cfq's per cgroup bandwidth > > > quotas. > > > > You can do this today by splitting I/O between processes and placing > > those processes in different cgroups. For io priority, there is > > ioprio_set, which incurs an extra system call, but can be used. Not > > elegant, but possible. > > Yes - those are things I'm trying to replace. Doing it that way is a > real pain, both as it's a lousy interface for this and it does impact > performance (ioprio_set doesn't really work too well with aio, too). > > > > * Cache hints. For bcache and other things, userspace may want to specify > > > "this data should be cached", "this data should bypass the cache", etc. > > > > Please explain how you will differentiate this from posix_fadvise. > > Oh sorry, I think about SSD caching so much I forget to say that's what > I'm talking about. posix_fadvise is for the page cache, we want > something different for an SSD cache (IMO it'd be really ugly to use it > for both, and posix_fadvise() can't really specifify everything we'd > want to for an SSD cache). Similar discussions about posix_fadvise() are being had for marking ranges of files as volatile (i.e. useful for determining what can be evicted from a cache when space reclaim is required). https://lkml.org/lkml/2012/10/2/501 If you have requirements for specific cache management, then it might be worth seeing if you can steer an existing interface proposal for some form of cache management in the direction you need. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: > Kent Overstreet writes: > > > So, I and other people keep running into things where we really need to > > add an interface to pass some auxiliary... stuff along with a pread() or > > pwrite(). > > > > A few examples: > > > > * IO scheduler hints. Some userspace program wants to, per IO, specify > > either priorities or a cgroup - by specifying a cgroup you can have a > > fileserver in userspace that makes use of cfq's per cgroup bandwidth > > quotas. > > You can do this today by splitting I/O between processes and placing > those processes in different cgroups. For io priority, there is > ioprio_set, which incurs an extra system call, but can be used. Not > elegant, but possible. Yes - those are things I'm trying to replace. Doing it that way is a real pain, both as it's a lousy interface for this and it does impact performance (ioprio_set doesn't really work too well with aio, too). > > * Cache hints. For bcache and other things, userspace may want to specify > > "this data should be cached", "this data should bypass the cache", etc. > > Please explain how you will differentiate this from posix_fadvise. Oh sorry, I think about SSD caching so much I forget to say that's what I'm talking about. posix_fadvise is for the page cache, we want something different for an SSD cache (IMO it'd be really ugly to use it for both, and posix_fadvise() can't really specifify everything we'd want to for an SSD cache). > > * Passing checksums out to userspace. We've got bio integrity, which is > > a (somewhat) generic interface for passing data checksums between the > > filesystem and the hardware. There are various circumstances under which > > you may want to pass these checksums out to userspace, and if so we > > ought to have a generic way of doing it. > > Yes, that needs a new interface. > > > Hence, AIO attributes. > > *No.* Start with the non-AIO case first. Why? It is orthogonal to AIO (and I should make that clearer), but to do it for sync IO we'd need new syscalls that take an extra argument so IMO it's a bit easier to start with AIO. Might be worth implementing the sync interface sooner rather than later just to discover any potential issues, I suppose. > > * FUTURE STUFF: > > > > Return values: > > > > Some attributes are probably going to want to return something to > > userspace. > > > > If nothing else, we want this so that userspace can tell if anything > > handled the attributes it specified - as dynamic as the io stack can be, > > with something extensible like this there really isn't any generic way > > of knowing ahead of time if something is going to interpret any > > attribute - we want to return at least an error code. > > Seems odd to me. Why not expose supported attributes via some other > call? fcntl? It's not possible in general - consider stacking block devices, and attrs that are supported only by specific block drivers. I.e. if you've got lvm on top of bcache or bcache on top of md, we can pass the attr down with the IO but we can't determine ahead of time, in general, where the IO is going to go. But that probably isn't true for most attrs so it probably would be a good idea to have an interface for querying what's supported, and even for device specific ones you could query what a device supports. > > One could imagine sticking the return in the attribute itself, but I > > don't want to do this. For some things (checksums), the attribute will > > contain a pointer to a buffer - that's fine. But I don't want the > > attributes themselves to be writeable. > > One could imagine that attributes don't return anything, because, well, > they're properties of something else, and properties don't return > anything. With a strict definition of attribute, yeah. One of the real uses cases we have for this is per IO timings, for aio - right now we've got an interface for the kernel to tell userspace how long a syscall took (don't think it's upstream yet - Paul's been behind that stuff), but it only really makes sense with synchronous syscalls. These AIO attributes would be useful for that too, but I'd _much_ prefer if the timing information was explicitly returned instead of using a pointer to a buffer. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
> "Kent" == Kent Overstreet writes: >> Hmm, careful here. I think that in DIF/DIX the checksums are >> per-sector, not per IO, right? That'd mean that the PAGE_SIZE attr >> limit in this patch would be magically creating different max IO size >> limits on different architectures. That doesn't seem great. Kent> Not just per sector, Per hardware sector. Per logical block (or for some devices less). Kent> For passing around checksums userspace would have to find out the Kent> hardware sector size and checksum type/size via a different Kent> interface, The relevant information is already exported in sysfs. Including the format, how many bytes of integrity metadata go with how many bytes of data, etc. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Tue, Oct 02, 2012 at 10:43:23AM -0700, Zach Brown wrote: > > The generic code wouldn't know about any user pointers inside > > attributes, so it'd have to be downstream consumers. Hopefully there > > won't be many attributes with user pointers in them (I don't expect > > there to be), so we won't have too much of this messyness. > > I really don't like this. We should have learned this lesson with ioctl > structs that are nested pointers. I will in no way claim to be skilled at kernel/userspace interface design (my expertise is more in the block layer), so feel free to educate me as much as you have the patience for :) The comparision with ioctl is definitely apt. (And I don't care for ioctl either, my first preference is for textual interfaces wherever possible - but I don't think that's a reasonable approach here :p) I'm not sure what you're exact complaint about nested pointers is - I agree we want to avoid pointers as much as they can, for multiple reasons. I do expect to get rid of one layer of indirection eventually, but that'll probably require new syscalls. Also, I emphatically hope and expect that nested pointers will _not_ be the norm. > What about security bits that are trying to determine if attributes are > OK? Seems to me it'd be no different from security considerations when introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but sometimes no way around it. It really has to be ad hoc if it's extensible, unfortunately. The only way of getting around _that_ would be with some kind of reflective type system, so that generic code could parse (in some fashion) the types of the various attributes, and for pointers copy the user data into the kernel and do whatever access controls in generic code. Which might not be a crazy idea if it could be applied to ioctls... hmm... > What about contexts that can't easily deal with userspace pointers? We > tend to copy into relatively more accessible kernel memory as early as > possible. That tends not to be the case submission side. submit_bio() and make_request functions are all run in normal process context. And also, like I mentioned I expect nested pointers to be fairly unusual so this shouldn't come up often. > What about fuse trying to extend this interface out to their fs clients? > Look at the horrible mess it implements to bounce nested ioctl data > parsing between the kernel's user pointer copying and the fuse client's > parsing of that copied data. Ugh, yeah that's something I should look into/think about. It's been ages since I looked at fuse, and I never looked at how it handled ioctls (not sure if that even existed when i was looking at it). > Let's not do this again, please. I think it's a fallacy to claim that > the interface can be opaque to high levels and only parsed by lower > levels. The interface should be explicit and fully specified on entry > so that all levels have trivial access to it. Hmm - that is definitely a good principle. To restate a bit, the parsing and validation ought to be done in generic centralized code to the greatest extent possible (I _think_ that it's not always possible in general, but maybe it will be in practice). Also - IMO, one of the nastier things about ioctls is that parsing/validation tends to be completely mixed with implementation. It would be nice to get away from that. Pondering a bit - that actually is the situation with my patch for attributes that are simple data; it's just data and the data is trivially accessible at all levels. But, this isn't just an issue for attributes that contain user pointers - the first attribute we prototyped was the proxy_pid attr, so a process can have io done in another cgroup's context. That requires validation, permission checking, and depending on how it's used refcounting. That stuff _definitely_ shouldn't be buried in the middle of block/cfq-iosched.c. Two approaches I can think of: For every attr, define a struct user_iocb_attr_foo and struct kern_iocb_attr_foo. Outside of the attr parsing code, nothing in the kernel sees the user version - they see a kern version which has had said parsing/validation done on it. I do like this approach in that it ought to make things easy to audit/hard to screw up. One difficulty I can see with that approach is refcounting - if parsing/validation is done on kernel entry, we've got to take refcounts to whatever kernel structures are referred to (like a different task_struct). This can often be avoided if we delay parsing/validation until when it's actually used, and then use rcu. Or maybe we could go with a halfway approach - the userspace structs are passed around within the kernel like in my existing patch, but for any attr that isn't simple data we define a parse_iocb_attr_foo function, and those functions are implemented in a common location - not scattered around defined where they're used. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a
Re: [RFC, PATCH] Extensible AIO interface
Kent Overstreet writes: > So, I and other people keep running into things where we really need to > add an interface to pass some auxiliary... stuff along with a pread() or > pwrite(). How would you enumerate this? How does the application know what the underlying stack supports/need? How is versioning handled? Frankly, in this form it looks more like a special purpose hack than a general facility. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
> The generic code wouldn't know about any user pointers inside > attributes, so it'd have to be downstream consumers. Hopefully there > won't be many attributes with user pointers in them (I don't expect > there to be), so we won't have too much of this messyness. I really don't like this. We should have learned this lesson with ioctl structs that are nested pointers. What about security bits that are trying to determine if attributes are OK? What about contexts that can't easily deal with userspace pointers? We tend to copy into relatively more accessible kernel memory as early as possible. What about fuse trying to extend this interface out to their fs clients? Look at the horrible mess it implements to bounce nested ioctl data parsing between the kernel's user pointer copying and the fuse client's parsing of that copied data. Let's not do this again, please. I think it's a fallacy to claim that the interface can be opaque to high levels and only parsed by lower levels. The interface should be explicit and fully specified on entry so that all levels have trivial access to it. - z -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
Kent Overstreet writes: > So, I and other people keep running into things where we really need to > add an interface to pass some auxiliary... stuff along with a pread() or > pwrite(). > > A few examples: > > * IO scheduler hints. Some userspace program wants to, per IO, specify > either priorities or a cgroup - by specifying a cgroup you can have a > fileserver in userspace that makes use of cfq's per cgroup bandwidth > quotas. You can do this today by splitting I/O between processes and placing those processes in different cgroups. For io priority, there is ioprio_set, which incurs an extra system call, but can be used. Not elegant, but possible. > * Cache hints. For bcache and other things, userspace may want to specify > "this data should be cached", "this data should bypass the cache", etc. Please explain how you will differentiate this from posix_fadvise. > * Passing checksums out to userspace. We've got bio integrity, which is > a (somewhat) generic interface for passing data checksums between the > filesystem and the hardware. There are various circumstances under which > you may want to pass these checksums out to userspace, and if so we > ought to have a generic way of doing it. Yes, that needs a new interface. > Hence, AIO attributes. *No.* Start with the non-AIO case first. > * FUTURE STUFF: > > Return values: > > Some attributes are probably going to want to return something to > userspace. > > If nothing else, we want this so that userspace can tell if anything > handled the attributes it specified - as dynamic as the io stack can be, > with something extensible like this there really isn't any generic way > of knowing ahead of time if something is going to interpret any > attribute - we want to return at least an error code. Seems odd to me. Why not expose supported attributes via some other call? fcntl? > One could imagine sticking the return in the attribute itself, but I > don't want to do this. For some things (checksums), the attribute will > contain a pointer to a buffer - that's fine. But I don't want the > attributes themselves to be writeable. One could imagine that attributes don't return anything, because, well, they're properties of something else, and properties don't return anything. Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
Kent Overstreet koverstr...@google.com writes: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. You can do this today by splitting I/O between processes and placing those processes in different cgroups. For io priority, there is ioprio_set, which incurs an extra system call, but can be used. Not elegant, but possible. * Cache hints. For bcache and other things, userspace may want to specify this data should be cached, this data should bypass the cache, etc. Please explain how you will differentiate this from posix_fadvise. * Passing checksums out to userspace. We've got bio integrity, which is a (somewhat) generic interface for passing data checksums between the filesystem and the hardware. There are various circumstances under which you may want to pass these checksums out to userspace, and if so we ought to have a generic way of doing it. Yes, that needs a new interface. Hence, AIO attributes. *No.* Start with the non-AIO case first. * FUTURE STUFF: Return values: Some attributes are probably going to want to return something to userspace. If nothing else, we want this so that userspace can tell if anything handled the attributes it specified - as dynamic as the io stack can be, with something extensible like this there really isn't any generic way of knowing ahead of time if something is going to interpret any attribute - we want to return at least an error code. Seems odd to me. Why not expose supported attributes via some other call? fcntl? One could imagine sticking the return in the attribute itself, but I don't want to do this. For some things (checksums), the attribute will contain a pointer to a buffer - that's fine. But I don't want the attributes themselves to be writeable. One could imagine that attributes don't return anything, because, well, they're properties of something else, and properties don't return anything. Cheers, Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
The generic code wouldn't know about any user pointers inside attributes, so it'd have to be downstream consumers. Hopefully there won't be many attributes with user pointers in them (I don't expect there to be), so we won't have too much of this messyness. I really don't like this. We should have learned this lesson with ioctl structs that are nested pointers. What about security bits that are trying to determine if attributes are OK? What about contexts that can't easily deal with userspace pointers? We tend to copy into relatively more accessible kernel memory as early as possible. What about fuse trying to extend this interface out to their fs clients? Look at the horrible mess it implements to bounce nested ioctl data parsing between the kernel's user pointer copying and the fuse client's parsing of that copied data. Let's not do this again, please. I think it's a fallacy to claim that the interface can be opaque to high levels and only parsed by lower levels. The interface should be explicit and fully specified on entry so that all levels have trivial access to it. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
Kent Overstreet koverstr...@google.com writes: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). How would you enumerate this? How does the application know what the underlying stack supports/need? How is versioning handled? Frankly, in this form it looks more like a special purpose hack than a general facility. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Tue, Oct 02, 2012 at 10:43:23AM -0700, Zach Brown wrote: The generic code wouldn't know about any user pointers inside attributes, so it'd have to be downstream consumers. Hopefully there won't be many attributes with user pointers in them (I don't expect there to be), so we won't have too much of this messyness. I really don't like this. We should have learned this lesson with ioctl structs that are nested pointers. I will in no way claim to be skilled at kernel/userspace interface design (my expertise is more in the block layer), so feel free to educate me as much as you have the patience for :) The comparision with ioctl is definitely apt. (And I don't care for ioctl either, my first preference is for textual interfaces wherever possible - but I don't think that's a reasonable approach here :p) I'm not sure what you're exact complaint about nested pointers is - I agree we want to avoid pointers as much as they can, for multiple reasons. I do expect to get rid of one layer of indirection eventually, but that'll probably require new syscalls. Also, I emphatically hope and expect that nested pointers will _not_ be the norm. What about security bits that are trying to determine if attributes are OK? Seems to me it'd be no different from security considerations when introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but sometimes no way around it. It really has to be ad hoc if it's extensible, unfortunately. The only way of getting around _that_ would be with some kind of reflective type system, so that generic code could parse (in some fashion) the types of the various attributes, and for pointers copy the user data into the kernel and do whatever access controls in generic code. Which might not be a crazy idea if it could be applied to ioctls... hmm... What about contexts that can't easily deal with userspace pointers? We tend to copy into relatively more accessible kernel memory as early as possible. That tends not to be the case submission side. submit_bio() and make_request functions are all run in normal process context. And also, like I mentioned I expect nested pointers to be fairly unusual so this shouldn't come up often. What about fuse trying to extend this interface out to their fs clients? Look at the horrible mess it implements to bounce nested ioctl data parsing between the kernel's user pointer copying and the fuse client's parsing of that copied data. Ugh, yeah that's something I should look into/think about. It's been ages since I looked at fuse, and I never looked at how it handled ioctls (not sure if that even existed when i was looking at it). Let's not do this again, please. I think it's a fallacy to claim that the interface can be opaque to high levels and only parsed by lower levels. The interface should be explicit and fully specified on entry so that all levels have trivial access to it. Hmm - that is definitely a good principle. To restate a bit, the parsing and validation ought to be done in generic centralized code to the greatest extent possible (I _think_ that it's not always possible in general, but maybe it will be in practice). Also - IMO, one of the nastier things about ioctls is that parsing/validation tends to be completely mixed with implementation. It would be nice to get away from that. Pondering a bit - that actually is the situation with my patch for attributes that are simple data; it's just data and the data is trivially accessible at all levels. But, this isn't just an issue for attributes that contain user pointers - the first attribute we prototyped was the proxy_pid attr, so a process can have io done in another cgroup's context. That requires validation, permission checking, and depending on how it's used refcounting. That stuff _definitely_ shouldn't be buried in the middle of block/cfq-iosched.c. Two approaches I can think of: For every attr, define a struct user_iocb_attr_foo and struct kern_iocb_attr_foo. Outside of the attr parsing code, nothing in the kernel sees the user version - they see a kern version which has had said parsing/validation done on it. I do like this approach in that it ought to make things easy to audit/hard to screw up. One difficulty I can see with that approach is refcounting - if parsing/validation is done on kernel entry, we've got to take refcounts to whatever kernel structures are referred to (like a different task_struct). This can often be avoided if we delay parsing/validation until when it's actually used, and then use rcu. Or maybe we could go with a halfway approach - the userspace structs are passed around within the kernel like in my existing patch, but for any attr that isn't simple data we define a parse_iocb_attr_foo function, and those functions are implemented in a common location - not scattered around defined where they're used. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to
Re: [RFC, PATCH] Extensible AIO interface
Kent == Kent Overstreet koverstr...@google.com writes: Hmm, careful here. I think that in DIF/DIX the checksums are per-sector, not per IO, right? That'd mean that the PAGE_SIZE attr limit in this patch would be magically creating different max IO size limits on different architectures. That doesn't seem great. Kent Not just per sector, Per hardware sector. Per logical block (or for some devices less). Kent For passing around checksums userspace would have to find out the Kent hardware sector size and checksum type/size via a different Kent interface, The relevant information is already exported in sysfs. Including the format, how many bytes of integrity metadata go with how many bytes of data, etc. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: Kent Overstreet koverstr...@google.com writes: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. You can do this today by splitting I/O between processes and placing those processes in different cgroups. For io priority, there is ioprio_set, which incurs an extra system call, but can be used. Not elegant, but possible. Yes - those are things I'm trying to replace. Doing it that way is a real pain, both as it's a lousy interface for this and it does impact performance (ioprio_set doesn't really work too well with aio, too). * Cache hints. For bcache and other things, userspace may want to specify this data should be cached, this data should bypass the cache, etc. Please explain how you will differentiate this from posix_fadvise. Oh sorry, I think about SSD caching so much I forget to say that's what I'm talking about. posix_fadvise is for the page cache, we want something different for an SSD cache (IMO it'd be really ugly to use it for both, and posix_fadvise() can't really specifify everything we'd want to for an SSD cache). * Passing checksums out to userspace. We've got bio integrity, which is a (somewhat) generic interface for passing data checksums between the filesystem and the hardware. There are various circumstances under which you may want to pass these checksums out to userspace, and if so we ought to have a generic way of doing it. Yes, that needs a new interface. Hence, AIO attributes. *No.* Start with the non-AIO case first. Why? It is orthogonal to AIO (and I should make that clearer), but to do it for sync IO we'd need new syscalls that take an extra argument so IMO it's a bit easier to start with AIO. Might be worth implementing the sync interface sooner rather than later just to discover any potential issues, I suppose. * FUTURE STUFF: Return values: Some attributes are probably going to want to return something to userspace. If nothing else, we want this so that userspace can tell if anything handled the attributes it specified - as dynamic as the io stack can be, with something extensible like this there really isn't any generic way of knowing ahead of time if something is going to interpret any attribute - we want to return at least an error code. Seems odd to me. Why not expose supported attributes via some other call? fcntl? It's not possible in general - consider stacking block devices, and attrs that are supported only by specific block drivers. I.e. if you've got lvm on top of bcache or bcache on top of md, we can pass the attr down with the IO but we can't determine ahead of time, in general, where the IO is going to go. But that probably isn't true for most attrs so it probably would be a good idea to have an interface for querying what's supported, and even for device specific ones you could query what a device supports. One could imagine sticking the return in the attribute itself, but I don't want to do this. For some things (checksums), the attribute will contain a pointer to a buffer - that's fine. But I don't want the attributes themselves to be writeable. One could imagine that attributes don't return anything, because, well, they're properties of something else, and properties don't return anything. With a strict definition of attribute, yeah. One of the real uses cases we have for this is per IO timings, for aio - right now we've got an interface for the kernel to tell userspace how long a syscall took (don't think it's upstream yet - Paul's been behind that stuff), but it only really makes sense with synchronous syscalls. These AIO attributes would be useful for that too, but I'd _much_ prefer if the timing information was explicitly returned instead of using a pointer to a buffer. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote: On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: Kent Overstreet koverstr...@google.com writes: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. You can do this today by splitting I/O between processes and placing those processes in different cgroups. For io priority, there is ioprio_set, which incurs an extra system call, but can be used. Not elegant, but possible. Yes - those are things I'm trying to replace. Doing it that way is a real pain, both as it's a lousy interface for this and it does impact performance (ioprio_set doesn't really work too well with aio, too). * Cache hints. For bcache and other things, userspace may want to specify this data should be cached, this data should bypass the cache, etc. Please explain how you will differentiate this from posix_fadvise. Oh sorry, I think about SSD caching so much I forget to say that's what I'm talking about. posix_fadvise is for the page cache, we want something different for an SSD cache (IMO it'd be really ugly to use it for both, and posix_fadvise() can't really specifify everything we'd want to for an SSD cache). Similar discussions about posix_fadvise() are being had for marking ranges of files as volatile (i.e. useful for determining what can be evicted from a cache when space reclaim is required). https://lkml.org/lkml/2012/10/2/501 If you have requirements for specific cache management, then it might be worth seeing if you can steer an existing interface proposal for some form of cache management in the direction you need. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
Hello, Kent. On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote: Seems to me it'd be no different from security considerations when introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but sometimes no way around it. It really has to be ad hoc if it's extensible, unfortunately. The only way of getting around _that_ would be with some kind of reflective type system, so that generic code could parse (in some fashion) the types of the various attributes, and for pointers copy the user data into the kernel and do whatever access controls in generic code. I'm not userland API expert by any stretch of imagination but the basic mechanism to pass data around seems sane to me and aio as stinky as it is seems to be the only logical stuff for IOs w/ extra attributes although alignment is always painful with any form of concatenated opaque structures. However, I don't think it's a good idea to try to implement something which is a neutral transport of opaque data between userland and lower layers. Things like that sound attractive with unlimited possibilities but reality seems to have the tendancy to make a big mess out of setups like that. So, if we're gonna do this, let's define what attributes we want to have and let them be processed at the interface layer and fed to lower layers afterwards - e.g. for cgroup context association, associate the resulting bios with the target cgroup from the aio layer. I'm quite skeptical of general usefulness of having opaque knobs to lower IO stack which don't have proper generic abstraction. Nobody can make proper use of things like that. Well, not nobody, maybe if the lower stack, the interface and the application are implemented by a single organization over relatively short span of time, maybe it would be useful for them, but that isn't something which generic interface design should focus on. It's okay to allow some side channel thing for specific hacky uses but I really hope the general design were focused around properly abstracted attributes which can be understood and handled by the upper layer. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote: On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote: On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: Kent Overstreet koverstr...@google.com writes: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. You can do this today by splitting I/O between processes and placing those processes in different cgroups. For io priority, there is ioprio_set, which incurs an extra system call, but can be used. Not elegant, but possible. Yes - those are things I'm trying to replace. Doing it that way is a real pain, both as it's a lousy interface for this and it does impact performance (ioprio_set doesn't really work too well with aio, too). * Cache hints. For bcache and other things, userspace may want to specify this data should be cached, this data should bypass the cache, etc. Please explain how you will differentiate this from posix_fadvise. Oh sorry, I think about SSD caching so much I forget to say that's what I'm talking about. posix_fadvise is for the page cache, we want something different for an SSD cache (IMO it'd be really ugly to use it for both, and posix_fadvise() can't really specifify everything we'd want to for an SSD cache). Similar discussions about posix_fadvise() are being had for marking ranges of files as volatile (i.e. useful for determining what can be evicted from a cache when space reclaim is required). https://lkml.org/lkml/2012/10/2/501 Hmm, interesting Speaking as an implementor though, hints that aren't associated with any specific IO are harder to make use of - stuff is in the cache. What you really want is to know, for a given IO, whether to cache it or not, and possibly where in the LRU to stick it. Well, it's quite possible that different implementations would have no trouble making use of those kinds of hints, I'm no doubt biased by having implemented bcache. With bcache though, cache replacement is done in terms of physical address space, not logical (i.e. the address space of the device being cached). So to handle posix_fadvise, we'd have to walk the btree and chase pointers to buckets, and modify the bucket priorities up or down... but what about the other data in those buckets? It's not clear what should happen, but there isn't any good way to take that into account. (The exception is dropping data from the cache entirely, we can just invalidate that part of the keyspace and garbage collection will reclaim the buckets they pointed to. Bcache does that for discard requests, currently). If you have requirements for specific cache management, then it might be worth seeing if you can steer an existing interface proposal for some form of cache management in the direction you need. Certainly - I don't plan on implementing anything bcache specific, or implementing anything from scratch if there's a good proposal out there. But a per-io interface does seem useful from an implementation pov and natural to use for at least some classes of applications. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Wed, Oct 03, 2012 at 10:41:06AM +0900, Tejun Heo wrote: Hello, Kent. On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote: Seems to me it'd be no different from security considerations when introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but sometimes no way around it. It really has to be ad hoc if it's extensible, unfortunately. The only way of getting around _that_ would be with some kind of reflective type system, so that generic code could parse (in some fashion) the types of the various attributes, and for pointers copy the user data into the kernel and do whatever access controls in generic code. I'm not userland API expert by any stretch of imagination but the basic mechanism to pass data around seems sane to me and aio as stinky as it is seems to be the only logical stuff for IOs w/ extra attributes although alignment is always painful with any form of concatenated opaque structures. However, I don't think it's a good idea to try to implement something which is a neutral transport of opaque data between userland and lower layers. Things like that sound attractive with unlimited possibilities but reality seems to have the tendancy to make a big mess out of setups like that. I don't see how the neutral transport of opaque data itself is a bad thing. We want something simple and sane to build actual interfaces on top of - once we've got that, we can either build clean generic well defined interfaces or we can make a mess like with ioctls :P It's like any other mechanism. There's good syscalls and bad syscalls... So, if we're gonna do this, let's define what attributes we want to have and let them be processed at the interface layer and fed to lower layers afterwards - e.g. for cgroup context association, associate the resulting bios with the target cgroup from the aio layer. That sounds perfectly reasonable to me (the emails with Zach ended up heading in that direction). I'm quite skeptical of general usefulness of having opaque knobs to lower IO stack which don't have proper generic abstraction. Nobody can make proper use of things like that. Well, not nobody, maybe if the lower stack, the interface and the application are implemented by a single organization over relatively short span of time, maybe it would be useful for them, but that isn't something which generic interface design should focus on. I think it could work fine, but I'm not convinced either way on what the correct approach is. Say we implement an attr to control a block layer cache. That attr could be parsed/validated in high level code (if there's any to do) - that I don't object to. But the high level code isn't going to /know/ whether there was any block cache in the stack that handled the attr. If the attr is passed down to the block cache, that block cache can return that it was handled. Or if we implement an attr that says return whether it was a cache hit or miss, and maybe other statistics. Similar thing. We could conceivably confine attrs to the upper layer, and define in kernel interfaces for passing all this stuff around, but I'm doubtful it's worth the trouble - at least if attrs themselves are implemented sanely. And honestly I think it'd make (more of) a mess of the block layer and the rest of the io stack to have to explicitly pass around cache hints/cache controlling options/whatever else we think of in the future - especially when most of this stuff isn't going to be in use most of the time. But, like I sort of mentioned in another email with Zach - passing the attrs from userspace around is _not_ mutually exclusive with standard code for parsing/validating them that exists in one place. We shouldn't have driver code reaching in and grabbing the raw attr unless there _is_ no parsing/validation to do. It's okay to allow some side channel thing for specific hacky uses but I really hope the general design were focused around properly abstracted attributes which can be understood and handled by the upper layer. Completely agreed. I want to leave that side channel open for experimentation, and so we have a way of implementing one off hacky stuff when we need to - but normal mainline stuff should be sane and well designed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote: > And what about duplicate instances of a given attribute id? Use the > first? The last? Error? Depends on the id? I thought of a better idea, instead of explicitly checking for disallowed dups: We want to return -ENOTHANDLED for not handled attributes anyways, so let's just do that for dups - that'll catch erronious usage just fine and a generic mechanism's better than a one off hack any day. This does mean we can't punt on return values, which isn't a bad thing. Also, if we've got duplicate attributes userspace needs to be able to figure out which return value was for which attribute. Two possibilities: one, return values come out in the same order attributes went in. That'd work, but I dislike the subtlety and I expect it'd be a pain for userspace. Instead, let's just stick a u64 cookie in the attribute and include that in the return, just like we do everywhere else. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote: > > Not just per sector, Per hardware sector. For passing around checksums > > userspace would have to find out the hardware sector size and checksum > > type/size via a different interface, and then the attribute would > > contain a pointer to a buffer that can hold the appropriate number of > > checksums. > > All problems fall to another layer of indirection? :) Yep :) > But yes, that's > fair. I was obviously just assuming that the checksums would be in the > attribute. > > But now we're talking about layers of user pointers. Would the > attribute parser need to verify/copy pointers before downstream kernel > code tries to work with it? Would it be up to the attribute consumers > to verify the pointers that the core doesn't really touch? The generic code wouldn't know about any user pointers inside attributes, so it'd have to be downstream consumers. Hopefully there won't be many attributes with user pointers in them (I don't expect there to be), so we won't have too much of this messyness. > Are these > second pointers native (enter compat goo) or u64s? Outside the scope of the generic implementation, but AFAIK u64s are the sane way so I'd prefer to just enforce that. > > I don't think there's anything fragile about the basic idea though. Or > > do you have some way of improving upon it in mind? > > Nothing super great is springing to mind, no. > > > The idea with the size field is that it's just sizeof(the particular > > attribute struct), so when userspace is appending attributes it just > > sets size = sizeof() and attr_list->size += attr->size. > > I suppose. But this also raises the spectre of aligning the packed > attributes to match their struct definitions. It's the netlink(3) > macros all over again, right? I guess unaligned accesses aren't *that* > big a deal. But still. I was just thinking about that a few minutes ago. Since the way I'm doing it embeds struct iocb_attr inside the specific attributes, if we stick __aligned(8) on struct iocb_attr that should solve alignment. > And what about duplicate instances of a given attribute id? Use the > first? The last? Error? Depends on the id? I suspect duplicate instances will be useful and the sanest approach for a few things, so I don't want to disallow it. But - perhaps we could define whether duplicates are allowed of a given attribute along with the attribute ids, then the kernel in generic code could check for duplicates of attrs for which it was disallowed and error. I think I really like that idea. > It just seems like there are a lot of corner cases that can go wrong > with an API that is so free form. I'd like something a lot harder to > make mistakes with. > > - z > (being That Guy today, apparently :/) I think it's got to be free form to be useful, but that doesn't mean we can't avoid as many mistakes as possible ahead of time so please continue to point out anything you can think of :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
> Not just per sector, Per hardware sector. For passing around checksums > userspace would have to find out the hardware sector size and checksum > type/size via a different interface, and then the attribute would > contain a pointer to a buffer that can hold the appropriate number of > checksums. All problems fall to another layer of indirection? :) But yes, that's fair. I was obviously just assuming that the checksums would be in the attribute. But now we're talking about layers of user pointers. Would the attribute parser need to verify/copy pointers before downstream kernel code tries to work with it? Would it be up to the attribute consumers to verify the pointers that the core doesn't really touch? Are these second pointers native (enter compat goo) or u64s? > I don't think there's anything fragile about the basic idea though. Or > do you have some way of improving upon it in mind? Nothing super great is springing to mind, no. > The idea with the size field is that it's just sizeof(the particular > attribute struct), so when userspace is appending attributes it just > sets size = sizeof() and attr_list->size += attr->size. I suppose. But this also raises the spectre of aligning the packed attributes to match their struct definitions. It's the netlink(3) macros all over again, right? I guess unaligned accesses aren't *that* big a deal. But still. And what about duplicate instances of a given attribute id? Use the first? The last? Error? Depends on the id? It just seems like there are a lot of corner cases that can go wrong with an API that is so free form. I'd like something a lot harder to make mistakes with. - z (being That Guy today, apparently :/) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Mon, Oct 01, 2012 at 04:12:22PM -0700, Zach Brown wrote: > On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote: > > So, I and other people keep running into things where we really need to > > add an interface to pass some auxiliary... stuff along with a pread() or > > pwrite(). > > Sure. Martin (cc:ed) will sympathize. > > > A few examples: > > > > * IO scheduler hints... > > * Cache hints... > > > > * Passing checksums out to userspace. We've got bio integrity, which is > > a (somewhat) generic interface for passing data checksums between the > > filesystem and the hardware. > > Hmm, careful here. I think that in DIF/DIX the checksums are > per-sector, not per IO, right? That'd mean that the PAGE_SIZE attr > limit in this patch would be magically creating different max IO size > limits on different architectures. That doesn't seem great. Not just per sector, Per hardware sector. For passing around checksums userspace would have to find out the hardware sector size and checksum type/size via a different interface, and then the attribute would contain a pointer to a buffer that can hold the appropriate number of checksums. > > > Hence, AIO attributes. > > I have to be honest: I really don't like tying the interface to AIO, but > I guess it's the only per-io facility we have today. It'd be nice to > include sync O_DIRECT when designing the interface to make sure that it > is possible to use generic syscalls in the future without running up > against unexpected problems. It'd certainly useful with regular sync IO, I just want to take it one step at a time particularly since for sync IO we'll probably need new syscalls. But yes you're right, it would be good to keep in mind. > > An iocb_attr has an id field, and a size field - and some amount of data > > specific to that attribute. > > I'd hope that we can come up with a less fragile interface. The kernel > would have to scan the attributes to make sure that there aren't > malicious sizes. I only quickly glanced at the loops, but it seemed > like you could have a 0 size attribute in there and _next() would spin > forever. Ouch, yeah that's wrong :/ I don't think there's anything fragile about the basic idea though. Or do you have some way of improving upon it in mind? The idea with the size field is that it's just sizeof(the particular attribute struct), so when userspace is appending attributes it just sets size = sizeof() and attr_list->size += attr->size. The kernel is going to have to sanity check the size fields of the individual attributes anyways to verify the size of the last attr doesn't extend off the end of the attr list, so I think it makes sense to keep the current semantics of the size fields and just also check that the size field is nonzero (actually >= sizeof(struct iocb_attr)). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote: > So, I and other people keep running into things where we really need to > add an interface to pass some auxiliary... stuff along with a pread() or > pwrite(). Sure. Martin (cc:ed) will sympathize. > A few examples: > > * IO scheduler hints... > * Cache hints... > > * Passing checksums out to userspace. We've got bio integrity, which is > a (somewhat) generic interface for passing data checksums between the > filesystem and the hardware. Hmm, careful here. I think that in DIF/DIX the checksums are per-sector, not per IO, right? That'd mean that the PAGE_SIZE attr limit in this patch would be magically creating different max IO size limits on different architectures. That doesn't seem great. > Hence, AIO attributes. I have to be honest: I really don't like tying the interface to AIO, but I guess it's the only per-io facility we have today. It'd be nice to include sync O_DIRECT when designing the interface to make sure that it is possible to use generic syscalls in the future without running up against unexpected problems. > An iocb_attr has an id field, and a size field - and some amount of data > specific to that attribute. I'd hope that we can come up with a less fragile interface. The kernel would have to scan the attributes to make sure that there aren't malicious sizes. I only quickly glanced at the loops, but it seemed like you could have a 0 size attribute in there and _next() would spin forever. - z -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC, PATCH] Extensible AIO interface
So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. * Cache hints. For bcache and other things, userspace may want to specify "this data should be cached", "this data should bypass the cache", etc. * Passing checksums out to userspace. We've got bio integrity, which is a (somewhat) generic interface for passing data checksums between the filesystem and the hardware. There are various circumstances under which you may want to pass these checksums out to userspace, and if so we ought to have a generic way of doing it. Hence, AIO attributes. The way it works is I stole the reserved2 field in struct iocb. This becomes a pointer to struct iocb_attr_list. An iocb_attr_list is some number of iocb_attrs appended together, along with the total number of bytes of attributes. An iocb_attr has an id field, and a size field - and some amount of data specific to that attribute. The size fields mean we can iterate through the attributes and find one with a specific id with generic code - we don't have to know anything about the attributes we don't care about. I also added a pointer to struct iocb_attr_list to struct bio. Now, we can define new attributes, and then anywhere in the block layer (say cfq, or some driver code) we can lookup any attributes that were specified for this io. cfq can then schedule as userspace wants, or bcache can get its cache hints... That's pretty much it, for the moment. It's intended to be simple and extensible. * FUTURE STUFF: Return values: Some attributes are probably going to want to return something to userspace. If nothing else, we want this so that userspace can tell if anything handled the attributes it specified - as dynamic as the io stack can be, with something extensible like this there really isn't any generic way of knowing ahead of time if something is going to interpret any attribute - we want to return at least an error code. One could imagine sticking the return in the attribute itself, but I don't want to do this. For some things (checksums), the attribute will contain a pointer to a buffer - that's fine. But I don't want the attributes themselves to be writeable. The reason for this is that struct iocb point to the attributes isn't completely ideal, and is IMO a stopgap solution. It would be better if the attributes were inline with the iocbs. But for this we may (?) want new aio syscalls to replace io_submit()/io_getevents(), but that isn't something I want to rush into. If the attributes are inline with the iocbs, the good and natural thing to do for return values is stick them inline with the io_event completions. But I'm not quite sure what I want to do in the meantime, if we end up needing return values in the short term. commit 34232e6f28112f049d633f4ecd2d34e536f8c7cf Author: Kent Overstreet Date: Mon Oct 1 13:17:56 2012 -0700 Extensible AIO interface diff --git a/fs/aio.c b/fs/aio.c index 71f613c..54acc17 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -424,6 +424,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx) req->private = NULL; req->ki_iovec = NULL; INIT_LIST_HEAD(>ki_run_list); + req->ki_attrs = NULL; req->ki_eventfd = NULL; return req; @@ -538,6 +539,7 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req) req->ki_dtor(req); if (req->ki_iovec != >ki_inline_vec) kfree(req->ki_iovec); + kfree(req->ki_attrs); kmem_cache_free(kiocb_cachep, req); ctx->reqs_active--; @@ -1504,6 +1506,33 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat) return 0; } +static int aio_setup_attrs(struct iocb *iocb, struct kiocb *req) +{ + u64 size; + + if (!iocb->aio_attrs) + return 0; + + if (unlikely(get_user(size, (u64 *) iocb->aio_attrs))) + return -EFAULT; + + if (unlikely(size > PAGE_SIZE)) + return -EFAULT; + + req->ki_attrs = kmalloc(size, GFP_KERNEL); + if (unlikely(!req->ki_attrs)) + return -ENOMEM; + + if (unlikely(copy_from_user(req->ki_attrs, + (void *) iocb->aio_attrs, size))) { + kfree(req->ki_attrs); + req->ki_attrs = NULL; + return -EFAULT; + } + + return 0; +} + static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, struct iocb *iocb, struct kiocb_batch *batch, bool compat) @@ -1513,7 +1542,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, ssize_t
[RFC, PATCH] Extensible AIO interface
So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). A few examples: * IO scheduler hints. Some userspace program wants to, per IO, specify either priorities or a cgroup - by specifying a cgroup you can have a fileserver in userspace that makes use of cfq's per cgroup bandwidth quotas. * Cache hints. For bcache and other things, userspace may want to specify this data should be cached, this data should bypass the cache, etc. * Passing checksums out to userspace. We've got bio integrity, which is a (somewhat) generic interface for passing data checksums between the filesystem and the hardware. There are various circumstances under which you may want to pass these checksums out to userspace, and if so we ought to have a generic way of doing it. Hence, AIO attributes. The way it works is I stole the reserved2 field in struct iocb. This becomes a pointer to struct iocb_attr_list. An iocb_attr_list is some number of iocb_attrs appended together, along with the total number of bytes of attributes. An iocb_attr has an id field, and a size field - and some amount of data specific to that attribute. The size fields mean we can iterate through the attributes and find one with a specific id with generic code - we don't have to know anything about the attributes we don't care about. I also added a pointer to struct iocb_attr_list to struct bio. Now, we can define new attributes, and then anywhere in the block layer (say cfq, or some driver code) we can lookup any attributes that were specified for this io. cfq can then schedule as userspace wants, or bcache can get its cache hints... That's pretty much it, for the moment. It's intended to be simple and extensible. * FUTURE STUFF: Return values: Some attributes are probably going to want to return something to userspace. If nothing else, we want this so that userspace can tell if anything handled the attributes it specified - as dynamic as the io stack can be, with something extensible like this there really isn't any generic way of knowing ahead of time if something is going to interpret any attribute - we want to return at least an error code. One could imagine sticking the return in the attribute itself, but I don't want to do this. For some things (checksums), the attribute will contain a pointer to a buffer - that's fine. But I don't want the attributes themselves to be writeable. The reason for this is that struct iocb point to the attributes isn't completely ideal, and is IMO a stopgap solution. It would be better if the attributes were inline with the iocbs. But for this we may (?) want new aio syscalls to replace io_submit()/io_getevents(), but that isn't something I want to rush into. If the attributes are inline with the iocbs, the good and natural thing to do for return values is stick them inline with the io_event completions. But I'm not quite sure what I want to do in the meantime, if we end up needing return values in the short term. commit 34232e6f28112f049d633f4ecd2d34e536f8c7cf Author: Kent Overstreet koverstr...@google.com Date: Mon Oct 1 13:17:56 2012 -0700 Extensible AIO interface diff --git a/fs/aio.c b/fs/aio.c index 71f613c..54acc17 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -424,6 +424,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx) req-private = NULL; req-ki_iovec = NULL; INIT_LIST_HEAD(req-ki_run_list); + req-ki_attrs = NULL; req-ki_eventfd = NULL; return req; @@ -538,6 +539,7 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req) req-ki_dtor(req); if (req-ki_iovec != req-ki_inline_vec) kfree(req-ki_iovec); + kfree(req-ki_attrs); kmem_cache_free(kiocb_cachep, req); ctx-reqs_active--; @@ -1504,6 +1506,33 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat) return 0; } +static int aio_setup_attrs(struct iocb *iocb, struct kiocb *req) +{ + u64 size; + + if (!iocb-aio_attrs) + return 0; + + if (unlikely(get_user(size, (u64 *) iocb-aio_attrs))) + return -EFAULT; + + if (unlikely(size PAGE_SIZE)) + return -EFAULT; + + req-ki_attrs = kmalloc(size, GFP_KERNEL); + if (unlikely(!req-ki_attrs)) + return -ENOMEM; + + if (unlikely(copy_from_user(req-ki_attrs, + (void *) iocb-aio_attrs, size))) { + kfree(req-ki_attrs); + req-ki_attrs = NULL; + return -EFAULT; + } + + return 0; +} + static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, struct iocb *iocb, struct kiocb_batch *batch, bool compat) @@ -1513,7 +1542,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
Re: [RFC, PATCH] Extensible AIO interface
On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). Sure. Martin (cc:ed) will sympathize. A few examples: * IO scheduler hints... * Cache hints... * Passing checksums out to userspace. We've got bio integrity, which is a (somewhat) generic interface for passing data checksums between the filesystem and the hardware. Hmm, careful here. I think that in DIF/DIX the checksums are per-sector, not per IO, right? That'd mean that the PAGE_SIZE attr limit in this patch would be magically creating different max IO size limits on different architectures. That doesn't seem great. Hence, AIO attributes. I have to be honest: I really don't like tying the interface to AIO, but I guess it's the only per-io facility we have today. It'd be nice to include sync O_DIRECT when designing the interface to make sure that it is possible to use generic syscalls in the future without running up against unexpected problems. An iocb_attr has an id field, and a size field - and some amount of data specific to that attribute. I'd hope that we can come up with a less fragile interface. The kernel would have to scan the attributes to make sure that there aren't malicious sizes. I only quickly glanced at the loops, but it seemed like you could have a 0 size attribute in there and _next() would spin forever. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Mon, Oct 01, 2012 at 04:12:22PM -0700, Zach Brown wrote: On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote: So, I and other people keep running into things where we really need to add an interface to pass some auxiliary... stuff along with a pread() or pwrite(). Sure. Martin (cc:ed) will sympathize. A few examples: * IO scheduler hints... * Cache hints... * Passing checksums out to userspace. We've got bio integrity, which is a (somewhat) generic interface for passing data checksums between the filesystem and the hardware. Hmm, careful here. I think that in DIF/DIX the checksums are per-sector, not per IO, right? That'd mean that the PAGE_SIZE attr limit in this patch would be magically creating different max IO size limits on different architectures. That doesn't seem great. Not just per sector, Per hardware sector. For passing around checksums userspace would have to find out the hardware sector size and checksum type/size via a different interface, and then the attribute would contain a pointer to a buffer that can hold the appropriate number of checksums. Hence, AIO attributes. I have to be honest: I really don't like tying the interface to AIO, but I guess it's the only per-io facility we have today. It'd be nice to include sync O_DIRECT when designing the interface to make sure that it is possible to use generic syscalls in the future without running up against unexpected problems. It'd certainly useful with regular sync IO, I just want to take it one step at a time particularly since for sync IO we'll probably need new syscalls. But yes you're right, it would be good to keep in mind. An iocb_attr has an id field, and a size field - and some amount of data specific to that attribute. I'd hope that we can come up with a less fragile interface. The kernel would have to scan the attributes to make sure that there aren't malicious sizes. I only quickly glanced at the loops, but it seemed like you could have a 0 size attribute in there and _next() would spin forever. Ouch, yeah that's wrong :/ I don't think there's anything fragile about the basic idea though. Or do you have some way of improving upon it in mind? The idea with the size field is that it's just sizeof(the particular attribute struct), so when userspace is appending attributes it just sets size = sizeof() and attr_list-size += attr-size. The kernel is going to have to sanity check the size fields of the individual attributes anyways to verify the size of the last attr doesn't extend off the end of the attr list, so I think it makes sense to keep the current semantics of the size fields and just also check that the size field is nonzero (actually = sizeof(struct iocb_attr)). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
Not just per sector, Per hardware sector. For passing around checksums userspace would have to find out the hardware sector size and checksum type/size via a different interface, and then the attribute would contain a pointer to a buffer that can hold the appropriate number of checksums. All problems fall to another layer of indirection? :) But yes, that's fair. I was obviously just assuming that the checksums would be in the attribute. But now we're talking about layers of user pointers. Would the attribute parser need to verify/copy pointers before downstream kernel code tries to work with it? Would it be up to the attribute consumers to verify the pointers that the core doesn't really touch? Are these second pointers native (enter compat goo) or u64s? I don't think there's anything fragile about the basic idea though. Or do you have some way of improving upon it in mind? Nothing super great is springing to mind, no. The idea with the size field is that it's just sizeof(the particular attribute struct), so when userspace is appending attributes it just sets size = sizeof() and attr_list-size += attr-size. I suppose. But this also raises the spectre of aligning the packed attributes to match their struct definitions. It's the netlink(3) macros all over again, right? I guess unaligned accesses aren't *that* big a deal. But still. And what about duplicate instances of a given attribute id? Use the first? The last? Error? Depends on the id? It just seems like there are a lot of corner cases that can go wrong with an API that is so free form. I'd like something a lot harder to make mistakes with. - z (being That Guy today, apparently :/) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote: Not just per sector, Per hardware sector. For passing around checksums userspace would have to find out the hardware sector size and checksum type/size via a different interface, and then the attribute would contain a pointer to a buffer that can hold the appropriate number of checksums. All problems fall to another layer of indirection? :) Yep :) But yes, that's fair. I was obviously just assuming that the checksums would be in the attribute. But now we're talking about layers of user pointers. Would the attribute parser need to verify/copy pointers before downstream kernel code tries to work with it? Would it be up to the attribute consumers to verify the pointers that the core doesn't really touch? The generic code wouldn't know about any user pointers inside attributes, so it'd have to be downstream consumers. Hopefully there won't be many attributes with user pointers in them (I don't expect there to be), so we won't have too much of this messyness. Are these second pointers native (enter compat goo) or u64s? Outside the scope of the generic implementation, but AFAIK u64s are the sane way so I'd prefer to just enforce that. I don't think there's anything fragile about the basic idea though. Or do you have some way of improving upon it in mind? Nothing super great is springing to mind, no. The idea with the size field is that it's just sizeof(the particular attribute struct), so when userspace is appending attributes it just sets size = sizeof() and attr_list-size += attr-size. I suppose. But this also raises the spectre of aligning the packed attributes to match their struct definitions. It's the netlink(3) macros all over again, right? I guess unaligned accesses aren't *that* big a deal. But still. I was just thinking about that a few minutes ago. Since the way I'm doing it embeds struct iocb_attr inside the specific attributes, if we stick __aligned(8) on struct iocb_attr that should solve alignment. And what about duplicate instances of a given attribute id? Use the first? The last? Error? Depends on the id? I suspect duplicate instances will be useful and the sanest approach for a few things, so I don't want to disallow it. But - perhaps we could define whether duplicates are allowed of a given attribute along with the attribute ids, then the kernel in generic code could check for duplicates of attrs for which it was disallowed and error. I think I really like that idea. It just seems like there are a lot of corner cases that can go wrong with an API that is so free form. I'd like something a lot harder to make mistakes with. - z (being That Guy today, apparently :/) I think it's got to be free form to be useful, but that doesn't mean we can't avoid as many mistakes as possible ahead of time so please continue to point out anything you can think of :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] Extensible AIO interface
On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote: And what about duplicate instances of a given attribute id? Use the first? The last? Error? Depends on the id? I thought of a better idea, instead of explicitly checking for disallowed dups: We want to return -ENOTHANDLED for not handled attributes anyways, so let's just do that for dups - that'll catch erronious usage just fine and a generic mechanism's better than a one off hack any day. This does mean we can't punt on return values, which isn't a bad thing. Also, if we've got duplicate attributes userspace needs to be able to figure out which return value was for which attribute. Two possibilities: one, return values come out in the same order attributes went in. That'd work, but I dislike the subtlety and I expect it'd be a pain for userspace. Instead, let's just stick a u64 cookie in the attribute and include that in the return, just like we do everywhere else. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/