Re: [RFC, PATCH] Extensible AIO interface

2012-10-04 Thread Kent Overstreet
On Thu, Oct 04, 2012 at 06:58:06AM +0900, Tejun Heo wrote:
> Hello, Kent.
> 
> On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote:
> > > However, I don't think it's a good idea to try to implement something
> > > which is a neutral transport of opaque data between userland and lower
> > > layers.  Things like that sound attractive with unlimited
> > > possibilities but reality seems to have the tendancy to make a big
> > > mess out of setups like that.
> > 
> > I don't see how the "neutral transport of opaque data" itself is a bad
> > thing. We want something simple and sane to build actual interfaces on
> > top of - once we've got that, we can either build clean generic well
> > defined interfaces or we can make a mess like with ioctls :P
> > 
> > It's like any other mechanism. There's good syscalls and bad syscalls...
> 
> Depending on what a feature aims for, the design and implementation
> vary greatly.  If you go for completely generic extensible stuff which
> can be used to warp space-time continuum, it's easy to end up with a
> monstrosity with generic and programmable parser, verifier, accessor
> and so on.

I don't think that's concrete enough that I can comment - I think this
is becoming too abstract.

You didn't have any complaints when I showed you the code I posted, I
don't plan on making it really any more complicated than that - I think
we do need explicit return values but honestly that makes it less
generic.


> > Say we implement an attr to control a block layer cache. That attr could
> > be parsed/validated in high level code (if there's any to do) - that I
> > don't object to. But the high level code isn't going to /know/ whether
> > there was any block cache in the stack that handled the attr. If the
> > attr is passed down to the block cache, that block cache can return that
> > it was handled.
> 
> My point is that if it doesn't fit the generic abstract model as in
> fadvise(2), it probably isn't worth supporting in any generic manner.

How so? Do you mean the file range part? I think that's orthogonal to
the rest (the hints fadvise specifies could be used per IO or with a
file range like they are now), but the hints themselves are inadequate
for SSD caches.

> > > It's okay to allow some side channel thing for specific hacky uses but
> > > I really hope the general design were focused around properly
> > > abstracted attributes which can be understood and handled by the upper
> > > layer.
> > 
> > Completely agreed. I want to leave that side channel open for
> > experimentation, and so we have a way of implementing one off hacky
> > stuff when we need to - but normal mainline stuff should be sane and
> > well designed.
> 
> So, I think we can aim for something simple and modest (the only thing
> I can think of at the moment is task association) and provide simple
> framework which can be used for specific custom usages.  Let's please
> not go overboard with generic parser / verifier which supports pointer
> indirection or whatnot.

I wasn't seriously proposing implementing a generic parser/verifier -
certainly not just for this, that was idle musing; all I'm saying is
that when an attr needs parsing/verification, that should be done in the
attr code, not driver code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-04 Thread Kent Overstreet
On Wed, Oct 03, 2012 at 03:15:26PM -0400, Jeff Moyer wrote:
> Kent Overstreet  writes:
> 
> > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> >> Kent Overstreet  writes:
> >> 
> >> > So, I and other people keep running into things where we really need to
> >> > add an interface to pass some auxiliary... stuff along with a pread() or
> >> > pwrite().
> >> >
> >> > A few examples:
> >> >
> >> > * IO scheduler hints. Some userspace program wants to, per IO, specify
> >> > either priorities or a cgroup - by specifying a cgroup you can have a
> >> > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> >> > quotas.
> >> 
> >> You can do this today by splitting I/O between processes and placing
> >> those processes in different cgroups.  For io priority, there is
> >> ioprio_set, which incurs an extra system call, but can be used.  Not
> >> elegant, but possible.
> >
> > Yes - those are things I'm trying to replace. Doing it that way is a
> > real pain, both as it's a lousy interface for this and it does impact
> > performance (ioprio_set doesn't really work too well with aio, too).
> 
> ioprio_set works fine with aio, since the I/O is issued in the caller's
> context.  Perhaps you're thinking of writeback I/O?

Until you want to issue different IOs with different priorities...

> >> > * Cache hints. For bcache and other things, userspace may want to specify
> >> > "this data should be cached", "this data should bypass the cache", etc.
> >> 
> >> Please explain how you will differentiate this from posix_fadvise.
> >
> > Oh sorry, I think about SSD caching so much I forget to say that's what
> > I'm talking about. posix_fadvise is for the page cache, we want
> > something different for an SSD cache (IMO it'd be really ugly to use it
> > for both, and posix_fadvise() can't really specifify everything we'd
> > want to for an SSD cache).
> 
> DESCRIPTION
>Programs can use posix_fadvise() to announce an intention to
>access file data in a specific pattern in the future, thus
>allowing the kernel to perform appropriate optimizations.
> 
> That description seems broad enough to include disk caches as well.  You
> haven't exactly stated what's missing.

It _could_ work for SSD caches, but that doesn't mean you'd want it to -
it doesn't have any way of specifying which cache you want the hint to
apply to, and there are certainly circumstances under which you
_wouldn't_ want it to apply to both.

And making it apply to SSD caches would be silently changing the
behavior, and also like I mentioned it's not sufficient for SSD caches.

> >> > Hence, AIO attributes.
> >> 
> >> *No.*  Start with the non-AIO case first.
> >
> > Why? It is orthogonal to AIO (and I should make that clearer), but to do
> > it for sync IO we'd need new syscalls that take an extra argument so IMO
> > it's a bit easier to start with AIO.
> >
> > Might be worth implementing the sync interface sooner rather than later
> > just to discover any potential issues, I suppose.
> 
> Looking back to preadv and pwritev, it was wrong to only add them to
> libaio (and that later got corrected).  I'd just like to see things
> start out with the sync interfaces, since you'll get more eyes on the
> code (not everyone cares about aio) and that helps to work out any
> interface issues.

I agree we don't want to leave out sync versions, but honestly this
stuff is more useful with AIO and that's the easier place to start.

> > It's not possible in general - consider stacking block devices, and
> > attrs that are supported only by specific block drivers. I.e. if you've
> > got lvm on top of bcache or bcache on top of md, we can pass the attr
> > down with the IO but we can't determine ahead of time, in general, where
> > the IO is going to go.
> 
> If the io stack is static (meaning you setup a device once, then open it
> and do io to it, and it doesn't change while you're doing io), how would
> you not know where the IO is going to go?

With something like dm, md or bcache - you've got multiple underlying
devices, and of those underlying devices which one the IO goes to is not
something you can in general predict ahead of time.

> > But that probably isn't true for most attrs so it probably would be a
> > good idea to have an interface for querying what's supported, and even
> > for device specific ones you could query what a device supports.
> 
> OK.
> 
> >> > One could imagine sticking the return in the attribute itself, but I
> >> > don't want to do this. For some things (checksums), the attribute will
> >> > contain a pointer to a buffer - that's fine. But I don't want the
> >> > attributes themselves to be writeable.
> >> 
> >> One could imagine that attributes don't return anything, because, well,
> >> they're properties of something else, and properties don't return
> >> anything.
> >
> > With a strict definition of attribute, yeah. One of the real uses cases
> > we have for this is per IO timings, for aio - 

Re: [RFC, PATCH] Extensible AIO interface

2012-10-04 Thread Kent Overstreet
On Wed, Oct 03, 2012 at 03:15:26PM -0400, Jeff Moyer wrote:
 Kent Overstreet koverstr...@google.com writes:
 
  On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
  Kent Overstreet koverstr...@google.com writes:
  
   So, I and other people keep running into things where we really need to
   add an interface to pass some auxiliary... stuff along with a pread() or
   pwrite().
  
   A few examples:
  
   * IO scheduler hints. Some userspace program wants to, per IO, specify
   either priorities or a cgroup - by specifying a cgroup you can have a
   fileserver in userspace that makes use of cfq's per cgroup bandwidth
   quotas.
  
  You can do this today by splitting I/O between processes and placing
  those processes in different cgroups.  For io priority, there is
  ioprio_set, which incurs an extra system call, but can be used.  Not
  elegant, but possible.
 
  Yes - those are things I'm trying to replace. Doing it that way is a
  real pain, both as it's a lousy interface for this and it does impact
  performance (ioprio_set doesn't really work too well with aio, too).
 
 ioprio_set works fine with aio, since the I/O is issued in the caller's
 context.  Perhaps you're thinking of writeback I/O?

Until you want to issue different IOs with different priorities...

   * Cache hints. For bcache and other things, userspace may want to specify
   this data should be cached, this data should bypass the cache, etc.
  
  Please explain how you will differentiate this from posix_fadvise.
 
  Oh sorry, I think about SSD caching so much I forget to say that's what
  I'm talking about. posix_fadvise is for the page cache, we want
  something different for an SSD cache (IMO it'd be really ugly to use it
  for both, and posix_fadvise() can't really specifify everything we'd
  want to for an SSD cache).
 
 DESCRIPTION
Programs can use posix_fadvise() to announce an intention to
access file data in a specific pattern in the future, thus
allowing the kernel to perform appropriate optimizations.
 
 That description seems broad enough to include disk caches as well.  You
 haven't exactly stated what's missing.

It _could_ work for SSD caches, but that doesn't mean you'd want it to -
it doesn't have any way of specifying which cache you want the hint to
apply to, and there are certainly circumstances under which you
_wouldn't_ want it to apply to both.

And making it apply to SSD caches would be silently changing the
behavior, and also like I mentioned it's not sufficient for SSD caches.

   Hence, AIO attributes.
  
  *No.*  Start with the non-AIO case first.
 
  Why? It is orthogonal to AIO (and I should make that clearer), but to do
  it for sync IO we'd need new syscalls that take an extra argument so IMO
  it's a bit easier to start with AIO.
 
  Might be worth implementing the sync interface sooner rather than later
  just to discover any potential issues, I suppose.
 
 Looking back to preadv and pwritev, it was wrong to only add them to
 libaio (and that later got corrected).  I'd just like to see things
 start out with the sync interfaces, since you'll get more eyes on the
 code (not everyone cares about aio) and that helps to work out any
 interface issues.

I agree we don't want to leave out sync versions, but honestly this
stuff is more useful with AIO and that's the easier place to start.

  It's not possible in general - consider stacking block devices, and
  attrs that are supported only by specific block drivers. I.e. if you've
  got lvm on top of bcache or bcache on top of md, we can pass the attr
  down with the IO but we can't determine ahead of time, in general, where
  the IO is going to go.
 
 If the io stack is static (meaning you setup a device once, then open it
 and do io to it, and it doesn't change while you're doing io), how would
 you not know where the IO is going to go?

With something like dm, md or bcache - you've got multiple underlying
devices, and of those underlying devices which one the IO goes to is not
something you can in general predict ahead of time.

  But that probably isn't true for most attrs so it probably would be a
  good idea to have an interface for querying what's supported, and even
  for device specific ones you could query what a device supports.
 
 OK.
 
   One could imagine sticking the return in the attribute itself, but I
   don't want to do this. For some things (checksums), the attribute will
   contain a pointer to a buffer - that's fine. But I don't want the
   attributes themselves to be writeable.
  
  One could imagine that attributes don't return anything, because, well,
  they're properties of something else, and properties don't return
  anything.
 
  With a strict definition of attribute, yeah. One of the real uses cases
  we have for this is per IO timings, for aio - right now we've got an
  interface for the kernel to tell userspace how long a syscall took
  (don't think it's upstream yet - Paul's been behind that 

Re: [RFC, PATCH] Extensible AIO interface

2012-10-04 Thread Kent Overstreet
On Thu, Oct 04, 2012 at 06:58:06AM +0900, Tejun Heo wrote:
 Hello, Kent.
 
 On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote:
   However, I don't think it's a good idea to try to implement something
   which is a neutral transport of opaque data between userland and lower
   layers.  Things like that sound attractive with unlimited
   possibilities but reality seems to have the tendancy to make a big
   mess out of setups like that.
  
  I don't see how the neutral transport of opaque data itself is a bad
  thing. We want something simple and sane to build actual interfaces on
  top of - once we've got that, we can either build clean generic well
  defined interfaces or we can make a mess like with ioctls :P
  
  It's like any other mechanism. There's good syscalls and bad syscalls...
 
 Depending on what a feature aims for, the design and implementation
 vary greatly.  If you go for completely generic extensible stuff which
 can be used to warp space-time continuum, it's easy to end up with a
 monstrosity with generic and programmable parser, verifier, accessor
 and so on.

I don't think that's concrete enough that I can comment - I think this
is becoming too abstract.

You didn't have any complaints when I showed you the code I posted, I
don't plan on making it really any more complicated than that - I think
we do need explicit return values but honestly that makes it less
generic.


  Say we implement an attr to control a block layer cache. That attr could
  be parsed/validated in high level code (if there's any to do) - that I
  don't object to. But the high level code isn't going to /know/ whether
  there was any block cache in the stack that handled the attr. If the
  attr is passed down to the block cache, that block cache can return that
  it was handled.
 
 My point is that if it doesn't fit the generic abstract model as in
 fadvise(2), it probably isn't worth supporting in any generic manner.

How so? Do you mean the file range part? I think that's orthogonal to
the rest (the hints fadvise specifies could be used per IO or with a
file range like they are now), but the hints themselves are inadequate
for SSD caches.

   It's okay to allow some side channel thing for specific hacky uses but
   I really hope the general design were focused around properly
   abstracted attributes which can be understood and handled by the upper
   layer.
  
  Completely agreed. I want to leave that side channel open for
  experimentation, and so we have a way of implementing one off hacky
  stuff when we need to - but normal mainline stuff should be sane and
  well designed.
 
 So, I think we can aim for something simple and modest (the only thing
 I can think of at the moment is task association) and provide simple
 framework which can be used for specific custom usages.  Let's please
 not go overboard with generic parser / verifier which supports pointer
 indirection or whatnot.

I wasn't seriously proposing implementing a generic parser/verifier -
certainly not just for this, that was idle musing; all I'm saying is
that when an attr needs parsing/verification, that should be done in the
attr code, not driver code.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-03 Thread Dave Chinner
On Tue, Oct 02, 2012 at 07:41:10PM -0700, Kent Overstreet wrote:
> On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote:
> > On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
> > > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> > > > Kent Overstreet  writes:
> > > > 
> > > > > So, I and other people keep running into things where we really need 
> > > > > to
> > > > > add an interface to pass some auxiliary... stuff along with a pread() 
> > > > > or
> > > > > pwrite().
> > > > >
> > > > > A few examples:
> > > > >
> > > > > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > > > > either priorities or a cgroup - by specifying a cgroup you can have a
> > > > > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > > > > quotas.
> > > > 
> > > > You can do this today by splitting I/O between processes and placing
> > > > those processes in different cgroups.  For io priority, there is
> > > > ioprio_set, which incurs an extra system call, but can be used.  Not
> > > > elegant, but possible.
> > > 
> > > Yes - those are things I'm trying to replace. Doing it that way is a
> > > real pain, both as it's a lousy interface for this and it does impact
> > > performance (ioprio_set doesn't really work too well with aio, too).
> > > 
> > > > > * Cache hints. For bcache and other things, userspace may want to 
> > > > > specify
> > > > > "this data should be cached", "this data should bypass the cache", 
> > > > > etc.
> > > > 
> > > > Please explain how you will differentiate this from posix_fadvise.
> > > 
> > > Oh sorry, I think about SSD caching so much I forget to say that's what
> > > I'm talking about. posix_fadvise is for the page cache, we want
> > > something different for an SSD cache (IMO it'd be really ugly to use it
> > > for both, and posix_fadvise() can't really specifify everything we'd
> > > want to for an SSD cache).
> > 
> > Similar discussions about posix_fadvise() are being had for marking
> > ranges of files as volatile (i.e. useful for determining what can be
> > evicted from a cache when space reclaim is required).
> > 
> > https://lkml.org/lkml/2012/10/2/501
> 
> Hmm, interesting
> 
> Speaking as an implementor though, hints that aren't associated with any
> specific IO are harder to make use of - stuff is in the cache. What you
> really want is to know, for a given IO, whether to cache it or not, and
> possibly where in the LRU to stick it.

I can see how it might be useful, but it needs to have a defined set
of attributes that a file IO is allowed to have. If you don't define
the set, then what really have is an arbitrary set of storage-device
specific interfaces.

Of course, once we have a defined set of per-file IO policy
attributes, we don't really need per-IO attributes - you can just
set them through a range interface like fadvise() or fallocate().

> Well, it's quite possible that different implementations would have no
> trouble making use of those kinds of hints, I'm no doubt biased by
> having implemented bcache. With bcache though, cache replacement is done
> in terms of physical address space, not logical (i.e. the address space
> of the device being cached). 
> 
> So to handle posix_fadvise, we'd have to walk the btree and chase
> pointers to buckets, and modify the bucket priorities up or down... but
> what about the other data in those buckets? It's not clear what should
> happen, but there isn't any good way to take that into account.
> 
> (The exception is dropping data from the cache entirely, we can just
> invalidate that part of the keyspace and garbage collection will reclaim
> the buckets they pointed to. Bcache does that for discard requests,
> currently).

It sounds to me like you are saying is that the design of bcache is
unsuited to file-level management of caching policy, and that is why
you want to pass attributes directly to bcache with specific IOs. Is
that a fair summary of the problem you are describing here?

My problem with this approach has nothing to do with the per-IO
nature of it - it's to do with the layering violations and the
amount of storage specific knowledge needed to make effective use of
it. i.e. it seems like an interface that can only be used by people
intimately familiar with underlying storage implementation. You
happen to be one of those people, so I figure that you don't see a
problem with that. ;)

However, it also implies that an application must understand and use
a specific storage configuration that matches the attributes an
application sends. I understand how this model is appealling to
Google because they control the whole application and storage stack
(hardware and software) from top to bottom. However, I just don't
think that it is a solution that the rest of the world can use
effectively.

The scope of data locality, aging and IO priority policy
control is much larger than just controlling SSD caching.
SSD caching is just a another implementation of 

Re: [RFC, PATCH] Extensible AIO interface

2012-10-03 Thread Tejun Heo
Hello, Kent.

On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote:
> > However, I don't think it's a good idea to try to implement something
> > which is a neutral transport of opaque data between userland and lower
> > layers.  Things like that sound attractive with unlimited
> > possibilities but reality seems to have the tendancy to make a big
> > mess out of setups like that.
> 
> I don't see how the "neutral transport of opaque data" itself is a bad
> thing. We want something simple and sane to build actual interfaces on
> top of - once we've got that, we can either build clean generic well
> defined interfaces or we can make a mess like with ioctls :P
> 
> It's like any other mechanism. There's good syscalls and bad syscalls...

Depending on what a feature aims for, the design and implementation
vary greatly.  If you go for completely generic extensible stuff which
can be used to warp space-time continuum, it's easy to end up with a
monstrosity with generic and programmable parser, verifier, accessor
and so on.

> Say we implement an attr to control a block layer cache. That attr could
> be parsed/validated in high level code (if there's any to do) - that I
> don't object to. But the high level code isn't going to /know/ whether
> there was any block cache in the stack that handled the attr. If the
> attr is passed down to the block cache, that block cache can return that
> it was handled.

My point is that if it doesn't fit the generic abstract model as in
fadvise(2), it probably isn't worth supporting in any generic manner.

> > It's okay to allow some side channel thing for specific hacky uses but
> > I really hope the general design were focused around properly
> > abstracted attributes which can be understood and handled by the upper
> > layer.
> 
> Completely agreed. I want to leave that side channel open for
> experimentation, and so we have a way of implementing one off hacky
> stuff when we need to - but normal mainline stuff should be sane and
> well designed.

So, I think we can aim for something simple and modest (the only thing
I can think of at the moment is task association) and provide simple
framework which can be used for specific custom usages.  Let's please
not go overboard with generic parser / verifier which supports pointer
indirection or whatnot.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-03 Thread Jeff Moyer
Kent Overstreet  writes:

> On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
>> Kent Overstreet  writes:
>> 
>> > So, I and other people keep running into things where we really need to
>> > add an interface to pass some auxiliary... stuff along with a pread() or
>> > pwrite().
>> >
>> > A few examples:
>> >
>> > * IO scheduler hints. Some userspace program wants to, per IO, specify
>> > either priorities or a cgroup - by specifying a cgroup you can have a
>> > fileserver in userspace that makes use of cfq's per cgroup bandwidth
>> > quotas.
>> 
>> You can do this today by splitting I/O between processes and placing
>> those processes in different cgroups.  For io priority, there is
>> ioprio_set, which incurs an extra system call, but can be used.  Not
>> elegant, but possible.
>
> Yes - those are things I'm trying to replace. Doing it that way is a
> real pain, both as it's a lousy interface for this and it does impact
> performance (ioprio_set doesn't really work too well with aio, too).

ioprio_set works fine with aio, since the I/O is issued in the caller's
context.  Perhaps you're thinking of writeback I/O?

>> > * Cache hints. For bcache and other things, userspace may want to specify
>> > "this data should be cached", "this data should bypass the cache", etc.
>> 
>> Please explain how you will differentiate this from posix_fadvise.
>
> Oh sorry, I think about SSD caching so much I forget to say that's what
> I'm talking about. posix_fadvise is for the page cache, we want
> something different for an SSD cache (IMO it'd be really ugly to use it
> for both, and posix_fadvise() can't really specifify everything we'd
> want to for an SSD cache).

DESCRIPTION
   Programs can use posix_fadvise() to announce an intention to
   access file data in a specific pattern in the future, thus
   allowing the kernel to perform appropriate optimizations.

That description seems broad enough to include disk caches as well.  You
haven't exactly stated what's missing.

>> > Hence, AIO attributes.
>> 
>> *No.*  Start with the non-AIO case first.
>
> Why? It is orthogonal to AIO (and I should make that clearer), but to do
> it for sync IO we'd need new syscalls that take an extra argument so IMO
> it's a bit easier to start with AIO.
>
> Might be worth implementing the sync interface sooner rather than later
> just to discover any potential issues, I suppose.

Looking back to preadv and pwritev, it was wrong to only add them to
libaio (and that later got corrected).  I'd just like to see things
start out with the sync interfaces, since you'll get more eyes on the
code (not everyone cares about aio) and that helps to work out any
interface issues.

>> > * FUTURE STUFF:
>> >
>> > Return values:
>> >
>> > Some attributes are probably going to want to return something to
>> > userspace.
>> >
>> > If nothing else, we want this so that userspace can tell if anything
>> > handled the attributes it specified - as dynamic as the io stack can be,
>> > with something extensible like this there really isn't any generic way
>> > of knowing ahead of time if something is going to interpret any
>> > attribute - we want to return at least an error code.
>> 
>> Seems odd to me.  Why not expose supported attributes via some other
>> call?  fcntl?
>
> It's not possible in general - consider stacking block devices, and
> attrs that are supported only by specific block drivers. I.e. if you've
> got lvm on top of bcache or bcache on top of md, we can pass the attr
> down with the IO but we can't determine ahead of time, in general, where
> the IO is going to go.

If the io stack is static (meaning you setup a device once, then open it
and do io to it, and it doesn't change while you're doing io), how would
you not know where the IO is going to go?

> But that probably isn't true for most attrs so it probably would be a
> good idea to have an interface for querying what's supported, and even
> for device specific ones you could query what a device supports.

OK.

>> > One could imagine sticking the return in the attribute itself, but I
>> > don't want to do this. For some things (checksums), the attribute will
>> > contain a pointer to a buffer - that's fine. But I don't want the
>> > attributes themselves to be writeable.
>> 
>> One could imagine that attributes don't return anything, because, well,
>> they're properties of something else, and properties don't return
>> anything.
>
> With a strict definition of attribute, yeah. One of the real uses cases
> we have for this is per IO timings, for aio - right now we've got an
> interface for the kernel to tell userspace how long a syscall took
> (don't think it's upstream yet - Paul's been behind that stuff), but it
> only really makes sense with synchronous syscalls.

Something beyond recording the time spent in the kernel?  Paul who?  I
agree the per io timing for aio may be coarse-grained today (you can
time the difference between io_submit returning and the event 

Re: [RFC, PATCH] Extensible AIO interface

2012-10-03 Thread Jeff Moyer
Kent Overstreet koverstr...@google.com writes:

 On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
 Kent Overstreet koverstr...@google.com writes:
 
  So, I and other people keep running into things where we really need to
  add an interface to pass some auxiliary... stuff along with a pread() or
  pwrite().
 
  A few examples:
 
  * IO scheduler hints. Some userspace program wants to, per IO, specify
  either priorities or a cgroup - by specifying a cgroup you can have a
  fileserver in userspace that makes use of cfq's per cgroup bandwidth
  quotas.
 
 You can do this today by splitting I/O between processes and placing
 those processes in different cgroups.  For io priority, there is
 ioprio_set, which incurs an extra system call, but can be used.  Not
 elegant, but possible.

 Yes - those are things I'm trying to replace. Doing it that way is a
 real pain, both as it's a lousy interface for this and it does impact
 performance (ioprio_set doesn't really work too well with aio, too).

ioprio_set works fine with aio, since the I/O is issued in the caller's
context.  Perhaps you're thinking of writeback I/O?

  * Cache hints. For bcache and other things, userspace may want to specify
  this data should be cached, this data should bypass the cache, etc.
 
 Please explain how you will differentiate this from posix_fadvise.

 Oh sorry, I think about SSD caching so much I forget to say that's what
 I'm talking about. posix_fadvise is for the page cache, we want
 something different for an SSD cache (IMO it'd be really ugly to use it
 for both, and posix_fadvise() can't really specifify everything we'd
 want to for an SSD cache).

DESCRIPTION
   Programs can use posix_fadvise() to announce an intention to
   access file data in a specific pattern in the future, thus
   allowing the kernel to perform appropriate optimizations.

That description seems broad enough to include disk caches as well.  You
haven't exactly stated what's missing.

  Hence, AIO attributes.
 
 *No.*  Start with the non-AIO case first.

 Why? It is orthogonal to AIO (and I should make that clearer), but to do
 it for sync IO we'd need new syscalls that take an extra argument so IMO
 it's a bit easier to start with AIO.

 Might be worth implementing the sync interface sooner rather than later
 just to discover any potential issues, I suppose.

Looking back to preadv and pwritev, it was wrong to only add them to
libaio (and that later got corrected).  I'd just like to see things
start out with the sync interfaces, since you'll get more eyes on the
code (not everyone cares about aio) and that helps to work out any
interface issues.

  * FUTURE STUFF:
 
  Return values:
 
  Some attributes are probably going to want to return something to
  userspace.
 
  If nothing else, we want this so that userspace can tell if anything
  handled the attributes it specified - as dynamic as the io stack can be,
  with something extensible like this there really isn't any generic way
  of knowing ahead of time if something is going to interpret any
  attribute - we want to return at least an error code.
 
 Seems odd to me.  Why not expose supported attributes via some other
 call?  fcntl?

 It's not possible in general - consider stacking block devices, and
 attrs that are supported only by specific block drivers. I.e. if you've
 got lvm on top of bcache or bcache on top of md, we can pass the attr
 down with the IO but we can't determine ahead of time, in general, where
 the IO is going to go.

If the io stack is static (meaning you setup a device once, then open it
and do io to it, and it doesn't change while you're doing io), how would
you not know where the IO is going to go?

 But that probably isn't true for most attrs so it probably would be a
 good idea to have an interface for querying what's supported, and even
 for device specific ones you could query what a device supports.

OK.

  One could imagine sticking the return in the attribute itself, but I
  don't want to do this. For some things (checksums), the attribute will
  contain a pointer to a buffer - that's fine. But I don't want the
  attributes themselves to be writeable.
 
 One could imagine that attributes don't return anything, because, well,
 they're properties of something else, and properties don't return
 anything.

 With a strict definition of attribute, yeah. One of the real uses cases
 we have for this is per IO timings, for aio - right now we've got an
 interface for the kernel to tell userspace how long a syscall took
 (don't think it's upstream yet - Paul's been behind that stuff), but it
 only really makes sense with synchronous syscalls.

Something beyond recording the time spent in the kernel?  Paul who?  I
agree the per io timing for aio may be coarse-grained today (you can
time the difference between io_submit returning and the event being
returned by io_getevents, but that says nothing of when the io was
issued to the block layer).  I'm curious to 

Re: [RFC, PATCH] Extensible AIO interface

2012-10-03 Thread Tejun Heo
Hello, Kent.

On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote:
  However, I don't think it's a good idea to try to implement something
  which is a neutral transport of opaque data between userland and lower
  layers.  Things like that sound attractive with unlimited
  possibilities but reality seems to have the tendancy to make a big
  mess out of setups like that.
 
 I don't see how the neutral transport of opaque data itself is a bad
 thing. We want something simple and sane to build actual interfaces on
 top of - once we've got that, we can either build clean generic well
 defined interfaces or we can make a mess like with ioctls :P
 
 It's like any other mechanism. There's good syscalls and bad syscalls...

Depending on what a feature aims for, the design and implementation
vary greatly.  If you go for completely generic extensible stuff which
can be used to warp space-time continuum, it's easy to end up with a
monstrosity with generic and programmable parser, verifier, accessor
and so on.

 Say we implement an attr to control a block layer cache. That attr could
 be parsed/validated in high level code (if there's any to do) - that I
 don't object to. But the high level code isn't going to /know/ whether
 there was any block cache in the stack that handled the attr. If the
 attr is passed down to the block cache, that block cache can return that
 it was handled.

My point is that if it doesn't fit the generic abstract model as in
fadvise(2), it probably isn't worth supporting in any generic manner.

  It's okay to allow some side channel thing for specific hacky uses but
  I really hope the general design were focused around properly
  abstracted attributes which can be understood and handled by the upper
  layer.
 
 Completely agreed. I want to leave that side channel open for
 experimentation, and so we have a way of implementing one off hacky
 stuff when we need to - but normal mainline stuff should be sane and
 well designed.

So, I think we can aim for something simple and modest (the only thing
I can think of at the moment is task association) and provide simple
framework which can be used for specific custom usages.  Let's please
not go overboard with generic parser / verifier which supports pointer
indirection or whatnot.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-03 Thread Dave Chinner
On Tue, Oct 02, 2012 at 07:41:10PM -0700, Kent Overstreet wrote:
 On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote:
  On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
   On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
Kent Overstreet koverstr...@google.com writes:

 So, I and other people keep running into things where we really need 
 to
 add an interface to pass some auxiliary... stuff along with a pread() 
 or
 pwrite().

 A few examples:

 * IO scheduler hints. Some userspace program wants to, per IO, specify
 either priorities or a cgroup - by specifying a cgroup you can have a
 fileserver in userspace that makes use of cfq's per cgroup bandwidth
 quotas.

You can do this today by splitting I/O between processes and placing
those processes in different cgroups.  For io priority, there is
ioprio_set, which incurs an extra system call, but can be used.  Not
elegant, but possible.
   
   Yes - those are things I'm trying to replace. Doing it that way is a
   real pain, both as it's a lousy interface for this and it does impact
   performance (ioprio_set doesn't really work too well with aio, too).
   
 * Cache hints. For bcache and other things, userspace may want to 
 specify
 this data should be cached, this data should bypass the cache, 
 etc.

Please explain how you will differentiate this from posix_fadvise.
   
   Oh sorry, I think about SSD caching so much I forget to say that's what
   I'm talking about. posix_fadvise is for the page cache, we want
   something different for an SSD cache (IMO it'd be really ugly to use it
   for both, and posix_fadvise() can't really specifify everything we'd
   want to for an SSD cache).
  
  Similar discussions about posix_fadvise() are being had for marking
  ranges of files as volatile (i.e. useful for determining what can be
  evicted from a cache when space reclaim is required).
  
  https://lkml.org/lkml/2012/10/2/501
 
 Hmm, interesting
 
 Speaking as an implementor though, hints that aren't associated with any
 specific IO are harder to make use of - stuff is in the cache. What you
 really want is to know, for a given IO, whether to cache it or not, and
 possibly where in the LRU to stick it.

I can see how it might be useful, but it needs to have a defined set
of attributes that a file IO is allowed to have. If you don't define
the set, then what really have is an arbitrary set of storage-device
specific interfaces.

Of course, once we have a defined set of per-file IO policy
attributes, we don't really need per-IO attributes - you can just
set them through a range interface like fadvise() or fallocate().

 Well, it's quite possible that different implementations would have no
 trouble making use of those kinds of hints, I'm no doubt biased by
 having implemented bcache. With bcache though, cache replacement is done
 in terms of physical address space, not logical (i.e. the address space
 of the device being cached). 
 
 So to handle posix_fadvise, we'd have to walk the btree and chase
 pointers to buckets, and modify the bucket priorities up or down... but
 what about the other data in those buckets? It's not clear what should
 happen, but there isn't any good way to take that into account.
 
 (The exception is dropping data from the cache entirely, we can just
 invalidate that part of the keyspace and garbage collection will reclaim
 the buckets they pointed to. Bcache does that for discard requests,
 currently).

It sounds to me like you are saying is that the design of bcache is
unsuited to file-level management of caching policy, and that is why
you want to pass attributes directly to bcache with specific IOs. Is
that a fair summary of the problem you are describing here?

My problem with this approach has nothing to do with the per-IO
nature of it - it's to do with the layering violations and the
amount of storage specific knowledge needed to make effective use of
it. i.e. it seems like an interface that can only be used by people
intimately familiar with underlying storage implementation. You
happen to be one of those people, so I figure that you don't see a
problem with that. ;)

However, it also implies that an application must understand and use
a specific storage configuration that matches the attributes an
application sends. I understand how this model is appealling to
Google because they control the whole application and storage stack
(hardware and software) from top to bottom. However, I just don't
think that it is a solution that the rest of the world can use
effectively.

The scope of data locality, aging and IO priority policy
control is much larger than just controlling SSD caching.
SSD caching is just a another implementation of automatic tiering,
and HSMs have been doing this for years and years. It's the same
problem - space management and deciding what to leave in the
frequently accessed pool 

Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Kent Overstreet
On Wed, Oct 03, 2012 at 10:41:06AM +0900, Tejun Heo wrote:
> Hello, Kent.
> 
> On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote:
> > Seems to me it'd be no different from security considerations when
> > introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but
> > sometimes no way around it.
> > 
> > It really has to be ad hoc if it's extensible, unfortunately.
> > 
> > The only way of getting around _that_ would be with some kind of
> > reflective type system, so that generic code could parse (in some
> > fashion) the types of the various attributes, and for pointers copy the
> > user data into the kernel and do whatever access controls in generic
> > code.
> 
> I'm not userland API expert by any stretch of imagination but the
> basic mechanism to pass data around seems sane to me and aio as stinky
> as it is seems to be the only logical stuff for IOs w/ extra
> attributes although alignment is always painful with any form of
> concatenated opaque structures.
> 
> However, I don't think it's a good idea to try to implement something
> which is a neutral transport of opaque data between userland and lower
> layers.  Things like that sound attractive with unlimited
> possibilities but reality seems to have the tendancy to make a big
> mess out of setups like that.

I don't see how the "neutral transport of opaque data" itself is a bad
thing. We want something simple and sane to build actual interfaces on
top of - once we've got that, we can either build clean generic well
defined interfaces or we can make a mess like with ioctls :P

It's like any other mechanism. There's good syscalls and bad syscalls...

> So, if we're gonna do this, let's define what attributes we want to
> have and let them be processed at the interface layer and fed to lower
> layers afterwards - e.g. for cgroup context association, associate the
> resulting bios with the target cgroup from the aio layer.

That sounds perfectly reasonable to me (the emails with Zach ended up
heading in that direction).

> I'm quite skeptical of general usefulness of having opaque knobs to
> lower IO stack which don't have proper generic abstraction.  Nobody
> can make proper use of things like that.  Well, not nobody, maybe if
> the lower stack, the interface and the application are implemented by
> a single organization over relatively short span of time, maybe it
> would be useful for them, but that isn't something which generic
> interface design should focus on.

I think it could work fine, but I'm not convinced either way on what the
correct approach is.

Say we implement an attr to control a block layer cache. That attr could
be parsed/validated in high level code (if there's any to do) - that I
don't object to. But the high level code isn't going to /know/ whether
there was any block cache in the stack that handled the attr. If the
attr is passed down to the block cache, that block cache can return that
it was handled.

Or if we implement an attr that says "return whether it was a cache hit
or miss, and maybe other statistics". Similar thing.

We could conceivably confine attrs to the upper layer, and define in
kernel interfaces for passing all this stuff around, but I'm doubtful
it's worth the trouble - at least if attrs themselves are implemented
sanely. And honestly I think it'd make (more of) a mess of the block
layer and the rest of the io stack to have to explicitly pass around
cache hints/cache controlling options/whatever else we think of in the
future - especially when most of this stuff isn't going to be in use
most of the time.

But, like I sort of mentioned in another email with Zach - passing the
attrs from userspace around is _not_ mutually exclusive with standard
code for parsing/validating them that exists in one place. We shouldn't
have driver code reaching in and grabbing the raw attr unless there _is_
no parsing/validation to do.


> It's okay to allow some side channel thing for specific hacky uses but
> I really hope the general design were focused around properly
> abstracted attributes which can be understood and handled by the upper
> layer.

Completely agreed. I want to leave that side channel open for
experimentation, and so we have a way of implementing one off hacky
stuff when we need to - but normal mainline stuff should be sane and
well designed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Kent Overstreet
On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote:
> On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
> > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> > > Kent Overstreet  writes:
> > > 
> > > > So, I and other people keep running into things where we really need to
> > > > add an interface to pass some auxiliary... stuff along with a pread() or
> > > > pwrite().
> > > >
> > > > A few examples:
> > > >
> > > > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > > > either priorities or a cgroup - by specifying a cgroup you can have a
> > > > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > > > quotas.
> > > 
> > > You can do this today by splitting I/O between processes and placing
> > > those processes in different cgroups.  For io priority, there is
> > > ioprio_set, which incurs an extra system call, but can be used.  Not
> > > elegant, but possible.
> > 
> > Yes - those are things I'm trying to replace. Doing it that way is a
> > real pain, both as it's a lousy interface for this and it does impact
> > performance (ioprio_set doesn't really work too well with aio, too).
> > 
> > > > * Cache hints. For bcache and other things, userspace may want to 
> > > > specify
> > > > "this data should be cached", "this data should bypass the cache", etc.
> > > 
> > > Please explain how you will differentiate this from posix_fadvise.
> > 
> > Oh sorry, I think about SSD caching so much I forget to say that's what
> > I'm talking about. posix_fadvise is for the page cache, we want
> > something different for an SSD cache (IMO it'd be really ugly to use it
> > for both, and posix_fadvise() can't really specifify everything we'd
> > want to for an SSD cache).
> 
> Similar discussions about posix_fadvise() are being had for marking
> ranges of files as volatile (i.e. useful for determining what can be
> evicted from a cache when space reclaim is required).
> 
> https://lkml.org/lkml/2012/10/2/501

Hmm, interesting

Speaking as an implementor though, hints that aren't associated with any
specific IO are harder to make use of - stuff is in the cache. What you
really want is to know, for a given IO, whether to cache it or not, and
possibly where in the LRU to stick it.

Well, it's quite possible that different implementations would have no
trouble making use of those kinds of hints, I'm no doubt biased by
having implemented bcache. With bcache though, cache replacement is done
in terms of physical address space, not logical (i.e. the address space
of the device being cached). 

So to handle posix_fadvise, we'd have to walk the btree and chase
pointers to buckets, and modify the bucket priorities up or down... but
what about the other data in those buckets? It's not clear what should
happen, but there isn't any good way to take that into account.

(The exception is dropping data from the cache entirely, we can just
invalidate that part of the keyspace and garbage collection will reclaim
the buckets they pointed to. Bcache does that for discard requests,
currently).

> If you have requirements for specific cache management, then it
> might be worth seeing if you can steer an existing interface
> proposal for some form of cache management in the direction you
> need.

Certainly - I don't plan on implementing anything bcache specific, or
implementing anything from scratch if there's a good proposal out there.
But a per-io interface does seem useful from an implementation pov and
natural to use for at least some classes of applications.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Tejun Heo
Hello, Kent.

On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote:
> Seems to me it'd be no different from security considerations when
> introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but
> sometimes no way around it.
> 
> It really has to be ad hoc if it's extensible, unfortunately.
> 
> The only way of getting around _that_ would be with some kind of
> reflective type system, so that generic code could parse (in some
> fashion) the types of the various attributes, and for pointers copy the
> user data into the kernel and do whatever access controls in generic
> code.

I'm not userland API expert by any stretch of imagination but the
basic mechanism to pass data around seems sane to me and aio as stinky
as it is seems to be the only logical stuff for IOs w/ extra
attributes although alignment is always painful with any form of
concatenated opaque structures.

However, I don't think it's a good idea to try to implement something
which is a neutral transport of opaque data between userland and lower
layers.  Things like that sound attractive with unlimited
possibilities but reality seems to have the tendancy to make a big
mess out of setups like that.

So, if we're gonna do this, let's define what attributes we want to
have and let them be processed at the interface layer and fed to lower
layers afterwards - e.g. for cgroup context association, associate the
resulting bios with the target cgroup from the aio layer.

I'm quite skeptical of general usefulness of having opaque knobs to
lower IO stack which don't have proper generic abstraction.  Nobody
can make proper use of things like that.  Well, not nobody, maybe if
the lower stack, the interface and the application are implemented by
a single organization over relatively short span of time, maybe it
would be useful for them, but that isn't something which generic
interface design should focus on.

It's okay to allow some side channel thing for specific hacky uses but
I really hope the general design were focused around properly
abstracted attributes which can be understood and handled by the upper
layer.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Dave Chinner
On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
> On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> > Kent Overstreet  writes:
> > 
> > > So, I and other people keep running into things where we really need to
> > > add an interface to pass some auxiliary... stuff along with a pread() or
> > > pwrite().
> > >
> > > A few examples:
> > >
> > > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > > either priorities or a cgroup - by specifying a cgroup you can have a
> > > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > > quotas.
> > 
> > You can do this today by splitting I/O between processes and placing
> > those processes in different cgroups.  For io priority, there is
> > ioprio_set, which incurs an extra system call, but can be used.  Not
> > elegant, but possible.
> 
> Yes - those are things I'm trying to replace. Doing it that way is a
> real pain, both as it's a lousy interface for this and it does impact
> performance (ioprio_set doesn't really work too well with aio, too).
> 
> > > * Cache hints. For bcache and other things, userspace may want to specify
> > > "this data should be cached", "this data should bypass the cache", etc.
> > 
> > Please explain how you will differentiate this from posix_fadvise.
> 
> Oh sorry, I think about SSD caching so much I forget to say that's what
> I'm talking about. posix_fadvise is for the page cache, we want
> something different for an SSD cache (IMO it'd be really ugly to use it
> for both, and posix_fadvise() can't really specifify everything we'd
> want to for an SSD cache).

Similar discussions about posix_fadvise() are being had for marking
ranges of files as volatile (i.e. useful for determining what can be
evicted from a cache when space reclaim is required).

https://lkml.org/lkml/2012/10/2/501

If you have requirements for specific cache management, then it
might be worth seeing if you can steer an existing interface
proposal for some form of cache management in the direction you
need.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Kent Overstreet
On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> Kent Overstreet  writes:
> 
> > So, I and other people keep running into things where we really need to
> > add an interface to pass some auxiliary... stuff along with a pread() or
> > pwrite().
> >
> > A few examples:
> >
> > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > either priorities or a cgroup - by specifying a cgroup you can have a
> > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > quotas.
> 
> You can do this today by splitting I/O between processes and placing
> those processes in different cgroups.  For io priority, there is
> ioprio_set, which incurs an extra system call, but can be used.  Not
> elegant, but possible.

Yes - those are things I'm trying to replace. Doing it that way is a
real pain, both as it's a lousy interface for this and it does impact
performance (ioprio_set doesn't really work too well with aio, too).

> > * Cache hints. For bcache and other things, userspace may want to specify
> > "this data should be cached", "this data should bypass the cache", etc.
> 
> Please explain how you will differentiate this from posix_fadvise.

Oh sorry, I think about SSD caching so much I forget to say that's what
I'm talking about. posix_fadvise is for the page cache, we want
something different for an SSD cache (IMO it'd be really ugly to use it
for both, and posix_fadvise() can't really specifify everything we'd
want to for an SSD cache).

> > * Passing checksums out to userspace. We've got bio integrity, which is
> > a (somewhat) generic interface for passing data checksums between the
> > filesystem and the hardware. There are various circumstances under which
> > you may want to pass these checksums out to userspace, and if so we
> > ought to have a generic way of doing it.
> 
> Yes, that needs a new interface.
> 
> > Hence, AIO attributes.
> 
> *No.*  Start with the non-AIO case first.

Why? It is orthogonal to AIO (and I should make that clearer), but to do
it for sync IO we'd need new syscalls that take an extra argument so IMO
it's a bit easier to start with AIO.

Might be worth implementing the sync interface sooner rather than later
just to discover any potential issues, I suppose.


> > * FUTURE STUFF:
> >
> > Return values:
> >
> > Some attributes are probably going to want to return something to
> > userspace.
> >
> > If nothing else, we want this so that userspace can tell if anything
> > handled the attributes it specified - as dynamic as the io stack can be,
> > with something extensible like this there really isn't any generic way
> > of knowing ahead of time if something is going to interpret any
> > attribute - we want to return at least an error code.
> 
> Seems odd to me.  Why not expose supported attributes via some other
> call?  fcntl?

It's not possible in general - consider stacking block devices, and
attrs that are supported only by specific block drivers. I.e. if you've
got lvm on top of bcache or bcache on top of md, we can pass the attr
down with the IO but we can't determine ahead of time, in general, where
the IO is going to go.

But that probably isn't true for most attrs so it probably would be a
good idea to have an interface for querying what's supported, and even
for device specific ones you could query what a device supports.

> > One could imagine sticking the return in the attribute itself, but I
> > don't want to do this. For some things (checksums), the attribute will
> > contain a pointer to a buffer - that's fine. But I don't want the
> > attributes themselves to be writeable.
> 
> One could imagine that attributes don't return anything, because, well,
> they're properties of something else, and properties don't return
> anything.

With a strict definition of attribute, yeah. One of the real uses cases
we have for this is per IO timings, for aio - right now we've got an
interface for the kernel to tell userspace how long a syscall took
(don't think it's upstream yet - Paul's been behind that stuff), but it
only really makes sense with synchronous syscalls.

These AIO attributes would be useful for that too, but I'd _much_ prefer
if the timing information was explicitly returned instead of using a
pointer to a buffer.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Martin K. Petersen
> "Kent" == Kent Overstreet  writes:

>> Hmm, careful here.  I think that in DIF/DIX the checksums are
>> per-sector, not per IO, right?  That'd mean that the PAGE_SIZE attr
>> limit in this patch would be magically creating different max IO size
>> limits on different architectures.  That doesn't seem great.

Kent> Not just per sector, Per hardware sector. 

Per logical block (or for some devices less).


Kent> For passing around checksums userspace would have to find out the
Kent> hardware sector size and checksum type/size via a different
Kent> interface,

The relevant information is already exported in sysfs. Including the
format, how many bytes of integrity metadata go with how many bytes of
data, etc.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Kent Overstreet
On Tue, Oct 02, 2012 at 10:43:23AM -0700, Zach Brown wrote:
> > The generic code wouldn't know about any user pointers inside
> > attributes, so it'd have to be downstream consumers. Hopefully there
> > won't be many attributes with user pointers in them (I don't expect
> > there to be), so we won't have too much of this messyness.
> 
> I really don't like this.  We should have learned this lesson with ioctl
> structs that are nested pointers.

I will in no way claim to be skilled at kernel/userspace interface
design (my expertise is more in the block layer), so feel free to
educate me as much as you have the patience for :)

The comparision with ioctl is definitely apt. (And I don't care for
ioctl either, my first preference is for textual interfaces wherever
possible - but I don't think that's a reasonable approach here :p)

I'm not sure what you're exact complaint about nested pointers is - I
agree we want to avoid pointers as much as they can, for multiple
reasons. I do expect to get rid of one layer of indirection eventually,
but that'll probably require new syscalls.

Also, I emphatically hope and expect that nested pointers will _not_ be
the norm.

> What about security bits that are trying to determine if attributes are
> OK?

Seems to me it'd be no different from security considerations when
introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but
sometimes no way around it.

It really has to be ad hoc if it's extensible, unfortunately.

The only way of getting around _that_ would be with some kind of
reflective type system, so that generic code could parse (in some
fashion) the types of the various attributes, and for pointers copy the
user data into the kernel and do whatever access controls in generic
code.

Which might not be a crazy idea if it could be applied to ioctls...
hmm...

> What about contexts that can't easily deal with userspace pointers?  We
> tend to copy into relatively more accessible kernel memory as early as
> possible.

That tends not to be the case submission side. submit_bio() and
make_request functions are all run in normal process context.

And also, like I mentioned I expect nested pointers to be fairly unusual
so this shouldn't come up often.

> What about fuse trying to extend this interface out to their fs clients?
> Look at the horrible mess it implements to bounce nested ioctl data
> parsing between the kernel's user pointer copying and the fuse client's
> parsing of that copied data.

Ugh, yeah that's something I should look into/think about. It's been
ages since I looked at fuse, and I never looked at how it handled ioctls
(not sure if that even existed when i was looking at it).

> Let's not do this again, please.  I think it's a fallacy to claim that
> the interface can be opaque to high levels and only parsed by lower
> levels.  The interface should be explicit and fully specified on entry
> so that all levels have trivial access to it.

Hmm - that is definitely a good principle.

To restate a bit, the parsing and validation ought to be done in generic
centralized code to the greatest extent possible (I _think_ that it's
not always possible in general, but maybe it will be in practice).

Also - IMO, one of the nastier things about ioctls is that
parsing/validation tends to be completely mixed with implementation. It
would be nice to get away from that.

Pondering a bit - that actually is the situation with my patch for
attributes that are simple data; it's just data and the data is
trivially accessible at all levels.

But, this isn't just an issue for attributes that contain user pointers
- the first attribute we prototyped was the proxy_pid attr, so a process
can have io done in another cgroup's context.

That requires validation, permission checking, and depending on how it's
used refcounting. That stuff _definitely_ shouldn't be buried in the
middle of block/cfq-iosched.c.

Two approaches I can think of:

For every attr, define a struct user_iocb_attr_foo and struct
kern_iocb_attr_foo. Outside of the attr parsing code, nothing in the
kernel sees the user version - they see a kern version which has had
said parsing/validation done on it.

I do like this approach in that it ought to make things easy to
audit/hard to screw up.

One difficulty I can see with that approach is refcounting - if
parsing/validation is done on kernel entry, we've got to take refcounts
to whatever kernel structures are referred to (like a different
task_struct). This can often be avoided if we delay parsing/validation
until when it's actually used, and then use rcu.

Or maybe we could go with a halfway approach - the userspace structs are
passed around within the kernel like in my existing patch, but for any
attr that isn't simple data we define a parse_iocb_attr_foo function,
and those functions are implemented in a common location - not scattered
around defined where they're used.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a 

Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Andi Kleen
Kent Overstreet  writes:

> So, I and other people keep running into things where we really need to
> add an interface to pass some auxiliary... stuff along with a pread() or
> pwrite().

How would you enumerate this? 
How does the application know what the underlying stack supports/need?
How is versioning handled?

Frankly, in this form it looks more like a special purpose hack than a general
facility. 

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Zach Brown
> The generic code wouldn't know about any user pointers inside
> attributes, so it'd have to be downstream consumers. Hopefully there
> won't be many attributes with user pointers in them (I don't expect
> there to be), so we won't have too much of this messyness.

I really don't like this.  We should have learned this lesson with ioctl
structs that are nested pointers.

What about security bits that are trying to determine if attributes are
OK?

What about contexts that can't easily deal with userspace pointers?  We
tend to copy into relatively more accessible kernel memory as early as
possible.

What about fuse trying to extend this interface out to their fs clients?
Look at the horrible mess it implements to bounce nested ioctl data
parsing between the kernel's user pointer copying and the fuse client's
parsing of that copied data.

Let's not do this again, please.  I think it's a fallacy to claim that
the interface can be opaque to high levels and only parsed by lower
levels.  The interface should be explicit and fully specified on entry
so that all levels have trivial access to it.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Jeff Moyer
Kent Overstreet  writes:

> So, I and other people keep running into things where we really need to
> add an interface to pass some auxiliary... stuff along with a pread() or
> pwrite().
>
> A few examples:
>
> * IO scheduler hints. Some userspace program wants to, per IO, specify
> either priorities or a cgroup - by specifying a cgroup you can have a
> fileserver in userspace that makes use of cfq's per cgroup bandwidth
> quotas.

You can do this today by splitting I/O between processes and placing
those processes in different cgroups.  For io priority, there is
ioprio_set, which incurs an extra system call, but can be used.  Not
elegant, but possible.

> * Cache hints. For bcache and other things, userspace may want to specify
> "this data should be cached", "this data should bypass the cache", etc.

Please explain how you will differentiate this from posix_fadvise.

> * Passing checksums out to userspace. We've got bio integrity, which is
> a (somewhat) generic interface for passing data checksums between the
> filesystem and the hardware. There are various circumstances under which
> you may want to pass these checksums out to userspace, and if so we
> ought to have a generic way of doing it.

Yes, that needs a new interface.

> Hence, AIO attributes.

*No.*  Start with the non-AIO case first.

> * FUTURE STUFF:
>
> Return values:
>
> Some attributes are probably going to want to return something to
> userspace.
>
> If nothing else, we want this so that userspace can tell if anything
> handled the attributes it specified - as dynamic as the io stack can be,
> with something extensible like this there really isn't any generic way
> of knowing ahead of time if something is going to interpret any
> attribute - we want to return at least an error code.

Seems odd to me.  Why not expose supported attributes via some other
call?  fcntl?

> One could imagine sticking the return in the attribute itself, but I
> don't want to do this. For some things (checksums), the attribute will
> contain a pointer to a buffer - that's fine. But I don't want the
> attributes themselves to be writeable.

One could imagine that attributes don't return anything, because, well,
they're properties of something else, and properties don't return
anything.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Jeff Moyer
Kent Overstreet koverstr...@google.com writes:

 So, I and other people keep running into things where we really need to
 add an interface to pass some auxiliary... stuff along with a pread() or
 pwrite().

 A few examples:

 * IO scheduler hints. Some userspace program wants to, per IO, specify
 either priorities or a cgroup - by specifying a cgroup you can have a
 fileserver in userspace that makes use of cfq's per cgroup bandwidth
 quotas.

You can do this today by splitting I/O between processes and placing
those processes in different cgroups.  For io priority, there is
ioprio_set, which incurs an extra system call, but can be used.  Not
elegant, but possible.

 * Cache hints. For bcache and other things, userspace may want to specify
 this data should be cached, this data should bypass the cache, etc.

Please explain how you will differentiate this from posix_fadvise.

 * Passing checksums out to userspace. We've got bio integrity, which is
 a (somewhat) generic interface for passing data checksums between the
 filesystem and the hardware. There are various circumstances under which
 you may want to pass these checksums out to userspace, and if so we
 ought to have a generic way of doing it.

Yes, that needs a new interface.

 Hence, AIO attributes.

*No.*  Start with the non-AIO case first.

 * FUTURE STUFF:

 Return values:

 Some attributes are probably going to want to return something to
 userspace.

 If nothing else, we want this so that userspace can tell if anything
 handled the attributes it specified - as dynamic as the io stack can be,
 with something extensible like this there really isn't any generic way
 of knowing ahead of time if something is going to interpret any
 attribute - we want to return at least an error code.

Seems odd to me.  Why not expose supported attributes via some other
call?  fcntl?

 One could imagine sticking the return in the attribute itself, but I
 don't want to do this. For some things (checksums), the attribute will
 contain a pointer to a buffer - that's fine. But I don't want the
 attributes themselves to be writeable.

One could imagine that attributes don't return anything, because, well,
they're properties of something else, and properties don't return
anything.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Zach Brown
 The generic code wouldn't know about any user pointers inside
 attributes, so it'd have to be downstream consumers. Hopefully there
 won't be many attributes with user pointers in them (I don't expect
 there to be), so we won't have too much of this messyness.

I really don't like this.  We should have learned this lesson with ioctl
structs that are nested pointers.

What about security bits that are trying to determine if attributes are
OK?

What about contexts that can't easily deal with userspace pointers?  We
tend to copy into relatively more accessible kernel memory as early as
possible.

What about fuse trying to extend this interface out to their fs clients?
Look at the horrible mess it implements to bounce nested ioctl data
parsing between the kernel's user pointer copying and the fuse client's
parsing of that copied data.

Let's not do this again, please.  I think it's a fallacy to claim that
the interface can be opaque to high levels and only parsed by lower
levels.  The interface should be explicit and fully specified on entry
so that all levels have trivial access to it.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Andi Kleen
Kent Overstreet koverstr...@google.com writes:

 So, I and other people keep running into things where we really need to
 add an interface to pass some auxiliary... stuff along with a pread() or
 pwrite().

How would you enumerate this? 
How does the application know what the underlying stack supports/need?
How is versioning handled?

Frankly, in this form it looks more like a special purpose hack than a general
facility. 

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Kent Overstreet
On Tue, Oct 02, 2012 at 10:43:23AM -0700, Zach Brown wrote:
  The generic code wouldn't know about any user pointers inside
  attributes, so it'd have to be downstream consumers. Hopefully there
  won't be many attributes with user pointers in them (I don't expect
  there to be), so we won't have too much of this messyness.
 
 I really don't like this.  We should have learned this lesson with ioctl
 structs that are nested pointers.

I will in no way claim to be skilled at kernel/userspace interface
design (my expertise is more in the block layer), so feel free to
educate me as much as you have the patience for :)

The comparision with ioctl is definitely apt. (And I don't care for
ioctl either, my first preference is for textual interfaces wherever
possible - but I don't think that's a reasonable approach here :p)

I'm not sure what you're exact complaint about nested pointers is - I
agree we want to avoid pointers as much as they can, for multiple
reasons. I do expect to get rid of one layer of indirection eventually,
but that'll probably require new syscalls.

Also, I emphatically hope and expect that nested pointers will _not_ be
the norm.

 What about security bits that are trying to determine if attributes are
 OK?

Seems to me it'd be no different from security considerations when
introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but
sometimes no way around it.

It really has to be ad hoc if it's extensible, unfortunately.

The only way of getting around _that_ would be with some kind of
reflective type system, so that generic code could parse (in some
fashion) the types of the various attributes, and for pointers copy the
user data into the kernel and do whatever access controls in generic
code.

Which might not be a crazy idea if it could be applied to ioctls...
hmm...

 What about contexts that can't easily deal with userspace pointers?  We
 tend to copy into relatively more accessible kernel memory as early as
 possible.

That tends not to be the case submission side. submit_bio() and
make_request functions are all run in normal process context.

And also, like I mentioned I expect nested pointers to be fairly unusual
so this shouldn't come up often.

 What about fuse trying to extend this interface out to their fs clients?
 Look at the horrible mess it implements to bounce nested ioctl data
 parsing between the kernel's user pointer copying and the fuse client's
 parsing of that copied data.

Ugh, yeah that's something I should look into/think about. It's been
ages since I looked at fuse, and I never looked at how it handled ioctls
(not sure if that even existed when i was looking at it).

 Let's not do this again, please.  I think it's a fallacy to claim that
 the interface can be opaque to high levels and only parsed by lower
 levels.  The interface should be explicit and fully specified on entry
 so that all levels have trivial access to it.

Hmm - that is definitely a good principle.

To restate a bit, the parsing and validation ought to be done in generic
centralized code to the greatest extent possible (I _think_ that it's
not always possible in general, but maybe it will be in practice).

Also - IMO, one of the nastier things about ioctls is that
parsing/validation tends to be completely mixed with implementation. It
would be nice to get away from that.

Pondering a bit - that actually is the situation with my patch for
attributes that are simple data; it's just data and the data is
trivially accessible at all levels.

But, this isn't just an issue for attributes that contain user pointers
- the first attribute we prototyped was the proxy_pid attr, so a process
can have io done in another cgroup's context.

That requires validation, permission checking, and depending on how it's
used refcounting. That stuff _definitely_ shouldn't be buried in the
middle of block/cfq-iosched.c.

Two approaches I can think of:

For every attr, define a struct user_iocb_attr_foo and struct
kern_iocb_attr_foo. Outside of the attr parsing code, nothing in the
kernel sees the user version - they see a kern version which has had
said parsing/validation done on it.

I do like this approach in that it ought to make things easy to
audit/hard to screw up.

One difficulty I can see with that approach is refcounting - if
parsing/validation is done on kernel entry, we've got to take refcounts
to whatever kernel structures are referred to (like a different
task_struct). This can often be avoided if we delay parsing/validation
until when it's actually used, and then use rcu.

Or maybe we could go with a halfway approach - the userspace structs are
passed around within the kernel like in my existing patch, but for any
attr that isn't simple data we define a parse_iocb_attr_foo function,
and those functions are implemented in a common location - not scattered
around defined where they're used.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to 

Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Martin K. Petersen
 Kent == Kent Overstreet koverstr...@google.com writes:

 Hmm, careful here.  I think that in DIF/DIX the checksums are
 per-sector, not per IO, right?  That'd mean that the PAGE_SIZE attr
 limit in this patch would be magically creating different max IO size
 limits on different architectures.  That doesn't seem great.

Kent Not just per sector, Per hardware sector. 

Per logical block (or for some devices less).


Kent For passing around checksums userspace would have to find out the
Kent hardware sector size and checksum type/size via a different
Kent interface,

The relevant information is already exported in sysfs. Including the
format, how many bytes of integrity metadata go with how many bytes of
data, etc.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Kent Overstreet
On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
 Kent Overstreet koverstr...@google.com writes:
 
  So, I and other people keep running into things where we really need to
  add an interface to pass some auxiliary... stuff along with a pread() or
  pwrite().
 
  A few examples:
 
  * IO scheduler hints. Some userspace program wants to, per IO, specify
  either priorities or a cgroup - by specifying a cgroup you can have a
  fileserver in userspace that makes use of cfq's per cgroup bandwidth
  quotas.
 
 You can do this today by splitting I/O between processes and placing
 those processes in different cgroups.  For io priority, there is
 ioprio_set, which incurs an extra system call, but can be used.  Not
 elegant, but possible.

Yes - those are things I'm trying to replace. Doing it that way is a
real pain, both as it's a lousy interface for this and it does impact
performance (ioprio_set doesn't really work too well with aio, too).

  * Cache hints. For bcache and other things, userspace may want to specify
  this data should be cached, this data should bypass the cache, etc.
 
 Please explain how you will differentiate this from posix_fadvise.

Oh sorry, I think about SSD caching so much I forget to say that's what
I'm talking about. posix_fadvise is for the page cache, we want
something different for an SSD cache (IMO it'd be really ugly to use it
for both, and posix_fadvise() can't really specifify everything we'd
want to for an SSD cache).

  * Passing checksums out to userspace. We've got bio integrity, which is
  a (somewhat) generic interface for passing data checksums between the
  filesystem and the hardware. There are various circumstances under which
  you may want to pass these checksums out to userspace, and if so we
  ought to have a generic way of doing it.
 
 Yes, that needs a new interface.
 
  Hence, AIO attributes.
 
 *No.*  Start with the non-AIO case first.

Why? It is orthogonal to AIO (and I should make that clearer), but to do
it for sync IO we'd need new syscalls that take an extra argument so IMO
it's a bit easier to start with AIO.

Might be worth implementing the sync interface sooner rather than later
just to discover any potential issues, I suppose.


  * FUTURE STUFF:
 
  Return values:
 
  Some attributes are probably going to want to return something to
  userspace.
 
  If nothing else, we want this so that userspace can tell if anything
  handled the attributes it specified - as dynamic as the io stack can be,
  with something extensible like this there really isn't any generic way
  of knowing ahead of time if something is going to interpret any
  attribute - we want to return at least an error code.
 
 Seems odd to me.  Why not expose supported attributes via some other
 call?  fcntl?

It's not possible in general - consider stacking block devices, and
attrs that are supported only by specific block drivers. I.e. if you've
got lvm on top of bcache or bcache on top of md, we can pass the attr
down with the IO but we can't determine ahead of time, in general, where
the IO is going to go.

But that probably isn't true for most attrs so it probably would be a
good idea to have an interface for querying what's supported, and even
for device specific ones you could query what a device supports.

  One could imagine sticking the return in the attribute itself, but I
  don't want to do this. For some things (checksums), the attribute will
  contain a pointer to a buffer - that's fine. But I don't want the
  attributes themselves to be writeable.
 
 One could imagine that attributes don't return anything, because, well,
 they're properties of something else, and properties don't return
 anything.

With a strict definition of attribute, yeah. One of the real uses cases
we have for this is per IO timings, for aio - right now we've got an
interface for the kernel to tell userspace how long a syscall took
(don't think it's upstream yet - Paul's been behind that stuff), but it
only really makes sense with synchronous syscalls.

These AIO attributes would be useful for that too, but I'd _much_ prefer
if the timing information was explicitly returned instead of using a
pointer to a buffer.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Dave Chinner
On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
 On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
  Kent Overstreet koverstr...@google.com writes:
  
   So, I and other people keep running into things where we really need to
   add an interface to pass some auxiliary... stuff along with a pread() or
   pwrite().
  
   A few examples:
  
   * IO scheduler hints. Some userspace program wants to, per IO, specify
   either priorities or a cgroup - by specifying a cgroup you can have a
   fileserver in userspace that makes use of cfq's per cgroup bandwidth
   quotas.
  
  You can do this today by splitting I/O between processes and placing
  those processes in different cgroups.  For io priority, there is
  ioprio_set, which incurs an extra system call, but can be used.  Not
  elegant, but possible.
 
 Yes - those are things I'm trying to replace. Doing it that way is a
 real pain, both as it's a lousy interface for this and it does impact
 performance (ioprio_set doesn't really work too well with aio, too).
 
   * Cache hints. For bcache and other things, userspace may want to specify
   this data should be cached, this data should bypass the cache, etc.
  
  Please explain how you will differentiate this from posix_fadvise.
 
 Oh sorry, I think about SSD caching so much I forget to say that's what
 I'm talking about. posix_fadvise is for the page cache, we want
 something different for an SSD cache (IMO it'd be really ugly to use it
 for both, and posix_fadvise() can't really specifify everything we'd
 want to for an SSD cache).

Similar discussions about posix_fadvise() are being had for marking
ranges of files as volatile (i.e. useful for determining what can be
evicted from a cache when space reclaim is required).

https://lkml.org/lkml/2012/10/2/501

If you have requirements for specific cache management, then it
might be worth seeing if you can steer an existing interface
proposal for some form of cache management in the direction you
need.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Tejun Heo
Hello, Kent.

On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote:
 Seems to me it'd be no different from security considerations when
 introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but
 sometimes no way around it.
 
 It really has to be ad hoc if it's extensible, unfortunately.
 
 The only way of getting around _that_ would be with some kind of
 reflective type system, so that generic code could parse (in some
 fashion) the types of the various attributes, and for pointers copy the
 user data into the kernel and do whatever access controls in generic
 code.

I'm not userland API expert by any stretch of imagination but the
basic mechanism to pass data around seems sane to me and aio as stinky
as it is seems to be the only logical stuff for IOs w/ extra
attributes although alignment is always painful with any form of
concatenated opaque structures.

However, I don't think it's a good idea to try to implement something
which is a neutral transport of opaque data between userland and lower
layers.  Things like that sound attractive with unlimited
possibilities but reality seems to have the tendancy to make a big
mess out of setups like that.

So, if we're gonna do this, let's define what attributes we want to
have and let them be processed at the interface layer and fed to lower
layers afterwards - e.g. for cgroup context association, associate the
resulting bios with the target cgroup from the aio layer.

I'm quite skeptical of general usefulness of having opaque knobs to
lower IO stack which don't have proper generic abstraction.  Nobody
can make proper use of things like that.  Well, not nobody, maybe if
the lower stack, the interface and the application are implemented by
a single organization over relatively short span of time, maybe it
would be useful for them, but that isn't something which generic
interface design should focus on.

It's okay to allow some side channel thing for specific hacky uses but
I really hope the general design were focused around properly
abstracted attributes which can be understood and handled by the upper
layer.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Kent Overstreet
On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote:
 On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
  On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
   Kent Overstreet koverstr...@google.com writes:
   
So, I and other people keep running into things where we really need to
add an interface to pass some auxiliary... stuff along with a pread() or
pwrite().
   
A few examples:
   
* IO scheduler hints. Some userspace program wants to, per IO, specify
either priorities or a cgroup - by specifying a cgroup you can have a
fileserver in userspace that makes use of cfq's per cgroup bandwidth
quotas.
   
   You can do this today by splitting I/O between processes and placing
   those processes in different cgroups.  For io priority, there is
   ioprio_set, which incurs an extra system call, but can be used.  Not
   elegant, but possible.
  
  Yes - those are things I'm trying to replace. Doing it that way is a
  real pain, both as it's a lousy interface for this and it does impact
  performance (ioprio_set doesn't really work too well with aio, too).
  
* Cache hints. For bcache and other things, userspace may want to 
specify
this data should be cached, this data should bypass the cache, etc.
   
   Please explain how you will differentiate this from posix_fadvise.
  
  Oh sorry, I think about SSD caching so much I forget to say that's what
  I'm talking about. posix_fadvise is for the page cache, we want
  something different for an SSD cache (IMO it'd be really ugly to use it
  for both, and posix_fadvise() can't really specifify everything we'd
  want to for an SSD cache).
 
 Similar discussions about posix_fadvise() are being had for marking
 ranges of files as volatile (i.e. useful for determining what can be
 evicted from a cache when space reclaim is required).
 
 https://lkml.org/lkml/2012/10/2/501

Hmm, interesting

Speaking as an implementor though, hints that aren't associated with any
specific IO are harder to make use of - stuff is in the cache. What you
really want is to know, for a given IO, whether to cache it or not, and
possibly where in the LRU to stick it.

Well, it's quite possible that different implementations would have no
trouble making use of those kinds of hints, I'm no doubt biased by
having implemented bcache. With bcache though, cache replacement is done
in terms of physical address space, not logical (i.e. the address space
of the device being cached). 

So to handle posix_fadvise, we'd have to walk the btree and chase
pointers to buckets, and modify the bucket priorities up or down... but
what about the other data in those buckets? It's not clear what should
happen, but there isn't any good way to take that into account.

(The exception is dropping data from the cache entirely, we can just
invalidate that part of the keyspace and garbage collection will reclaim
the buckets they pointed to. Bcache does that for discard requests,
currently).

 If you have requirements for specific cache management, then it
 might be worth seeing if you can steer an existing interface
 proposal for some form of cache management in the direction you
 need.

Certainly - I don't plan on implementing anything bcache specific, or
implementing anything from scratch if there's a good proposal out there.
But a per-io interface does seem useful from an implementation pov and
natural to use for at least some classes of applications.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Kent Overstreet
On Wed, Oct 03, 2012 at 10:41:06AM +0900, Tejun Heo wrote:
 Hello, Kent.
 
 On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote:
  Seems to me it'd be no different from security considerations when
  introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but
  sometimes no way around it.
  
  It really has to be ad hoc if it's extensible, unfortunately.
  
  The only way of getting around _that_ would be with some kind of
  reflective type system, so that generic code could parse (in some
  fashion) the types of the various attributes, and for pointers copy the
  user data into the kernel and do whatever access controls in generic
  code.
 
 I'm not userland API expert by any stretch of imagination but the
 basic mechanism to pass data around seems sane to me and aio as stinky
 as it is seems to be the only logical stuff for IOs w/ extra
 attributes although alignment is always painful with any form of
 concatenated opaque structures.
 
 However, I don't think it's a good idea to try to implement something
 which is a neutral transport of opaque data between userland and lower
 layers.  Things like that sound attractive with unlimited
 possibilities but reality seems to have the tendancy to make a big
 mess out of setups like that.

I don't see how the neutral transport of opaque data itself is a bad
thing. We want something simple and sane to build actual interfaces on
top of - once we've got that, we can either build clean generic well
defined interfaces or we can make a mess like with ioctls :P

It's like any other mechanism. There's good syscalls and bad syscalls...

 So, if we're gonna do this, let's define what attributes we want to
 have and let them be processed at the interface layer and fed to lower
 layers afterwards - e.g. for cgroup context association, associate the
 resulting bios with the target cgroup from the aio layer.

That sounds perfectly reasonable to me (the emails with Zach ended up
heading in that direction).

 I'm quite skeptical of general usefulness of having opaque knobs to
 lower IO stack which don't have proper generic abstraction.  Nobody
 can make proper use of things like that.  Well, not nobody, maybe if
 the lower stack, the interface and the application are implemented by
 a single organization over relatively short span of time, maybe it
 would be useful for them, but that isn't something which generic
 interface design should focus on.

I think it could work fine, but I'm not convinced either way on what the
correct approach is.

Say we implement an attr to control a block layer cache. That attr could
be parsed/validated in high level code (if there's any to do) - that I
don't object to. But the high level code isn't going to /know/ whether
there was any block cache in the stack that handled the attr. If the
attr is passed down to the block cache, that block cache can return that
it was handled.

Or if we implement an attr that says return whether it was a cache hit
or miss, and maybe other statistics. Similar thing.

We could conceivably confine attrs to the upper layer, and define in
kernel interfaces for passing all this stuff around, but I'm doubtful
it's worth the trouble - at least if attrs themselves are implemented
sanely. And honestly I think it'd make (more of) a mess of the block
layer and the rest of the io stack to have to explicitly pass around
cache hints/cache controlling options/whatever else we think of in the
future - especially when most of this stuff isn't going to be in use
most of the time.

But, like I sort of mentioned in another email with Zach - passing the
attrs from userspace around is _not_ mutually exclusive with standard
code for parsing/validating them that exists in one place. We shouldn't
have driver code reaching in and grabbing the raw attr unless there _is_
no parsing/validation to do.


 It's okay to allow some side channel thing for specific hacky uses but
 I really hope the general design were focused around properly
 abstracted attributes which can be understood and handled by the upper
 layer.

Completely agreed. I want to leave that side channel open for
experimentation, and so we have a way of implementing one off hacky
stuff when we need to - but normal mainline stuff should be sane and
well designed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Kent Overstreet
On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote:
> And what about duplicate instances of a given attribute id?  Use the
> first?  The last?  Error?  Depends on the id?

I thought of a better idea, instead of explicitly checking for
disallowed dups:

We want to return -ENOTHANDLED for not handled attributes anyways, so
let's just do that for dups - that'll catch erronious usage just fine
and a generic mechanism's better than a one off hack any day.

This does mean we can't punt on return values, which isn't a bad thing.

Also, if we've got duplicate attributes userspace needs to be able to
figure out which return value was for which attribute.

Two possibilities: one, return values come out in the same order
attributes went in. That'd work, but I dislike the subtlety and I expect
it'd be a pain for userspace.

Instead, let's just stick a u64 cookie in the attribute and include that
in the return, just like we do everywhere else.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Kent Overstreet
On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote:
> > Not just per sector, Per hardware sector. For passing around checksums
> > userspace would have to find out the hardware sector size and checksum
> > type/size via a different interface, and then the attribute would
> > contain a pointer to a buffer that can hold the appropriate number of
> > checksums.
> 
> All problems fall to another layer of indirection? :) 

Yep :)

> But yes, that's
> fair.  I was obviously just assuming that the checksums would be in the
> attribute.
> 
> But now we're talking about layers of user pointers.  Would the
> attribute parser need to verify/copy pointers before downstream kernel
> code tries to work with it?  Would it be up to the attribute consumers
> to verify the pointers that the core doesn't really touch? 

The generic code wouldn't know about any user pointers inside
attributes, so it'd have to be downstream consumers. Hopefully there
won't be many attributes with user pointers in them (I don't expect
there to be), so we won't have too much of this messyness.

> Are these
> second pointers native (enter compat goo) or u64s?

Outside the scope of the generic implementation, but AFAIK u64s are the
sane way so I'd prefer to just enforce that.

> > I don't think there's anything fragile about the basic idea though. Or
> > do you have some way of improving upon it in mind?
> 
> Nothing super great is springing to mind, no.
> 
> > The idea with the size field is that it's just sizeof(the particular
> > attribute struct), so when userspace is appending attributes it just
> > sets size = sizeof() and attr_list->size += attr->size.
> 
> I suppose.  But this also raises the spectre of aligning the packed
> attributes to match their struct definitions.  It's the netlink(3)
> macros all over again, right?  I guess unaligned accesses aren't *that*
> big a deal.  But still.

I was just thinking about that a few minutes ago. Since the way I'm
doing it embeds struct iocb_attr inside the specific attributes, if we
stick __aligned(8) on struct iocb_attr that should solve alignment.

> And what about duplicate instances of a given attribute id?  Use the
> first?  The last?  Error?  Depends on the id?

I suspect duplicate instances will be useful and the sanest approach
for a few things, so I don't want to disallow it.

But - perhaps we could define whether duplicates are allowed of a given
attribute along with the attribute ids, then the kernel in generic code
could check for duplicates of attrs for which it was disallowed and
error.

I think I really like that idea.

> It just seems like there are a lot of corner cases that can go wrong
> with an API that is so free form.  I'd like something a lot harder to
> make mistakes with.
> 
> - z
> (being That Guy today, apparently :/)

I think it's got to be free form to be useful, but that doesn't mean we
can't avoid as many mistakes as possible ahead of time so please
continue to point out anything you can think of :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Zach Brown
> Not just per sector, Per hardware sector. For passing around checksums
> userspace would have to find out the hardware sector size and checksum
> type/size via a different interface, and then the attribute would
> contain a pointer to a buffer that can hold the appropriate number of
> checksums.

All problems fall to another layer of indirection? :)  But yes, that's
fair.  I was obviously just assuming that the checksums would be in the
attribute.

But now we're talking about layers of user pointers.  Would the
attribute parser need to verify/copy pointers before downstream kernel
code tries to work with it?  Would it be up to the attribute consumers
to verify the pointers that the core doesn't really touch?  Are these
second pointers native (enter compat goo) or u64s?

> I don't think there's anything fragile about the basic idea though. Or
> do you have some way of improving upon it in mind?

Nothing super great is springing to mind, no.

> The idea with the size field is that it's just sizeof(the particular
> attribute struct), so when userspace is appending attributes it just
> sets size = sizeof() and attr_list->size += attr->size.

I suppose.  But this also raises the spectre of aligning the packed
attributes to match their struct definitions.  It's the netlink(3)
macros all over again, right?  I guess unaligned accesses aren't *that*
big a deal.  But still.

And what about duplicate instances of a given attribute id?  Use the
first?  The last?  Error?  Depends on the id?

It just seems like there are a lot of corner cases that can go wrong
with an API that is so free form.  I'd like something a lot harder to
make mistakes with.

- z
(being That Guy today, apparently :/)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Kent Overstreet
On Mon, Oct 01, 2012 at 04:12:22PM -0700, Zach Brown wrote:
> On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote:
> > So, I and other people keep running into things where we really need to
> > add an interface to pass some auxiliary... stuff along with a pread() or
> > pwrite().
> 
> Sure.  Martin (cc:ed) will sympathize.
> 
> > A few examples:
> > 
> > * IO scheduler hints...
> > * Cache hints...
> > 
> > * Passing checksums out to userspace. We've got bio integrity, which is
> > a (somewhat) generic interface for passing data checksums between the
> > filesystem and the hardware.
> 
> Hmm, careful here.  I think that in DIF/DIX the checksums are
> per-sector, not per IO, right?  That'd mean that the PAGE_SIZE attr
> limit in this patch would be magically creating different max IO size
> limits on different architectures.  That doesn't seem great.

Not just per sector, Per hardware sector. For passing around checksums
userspace would have to find out the hardware sector size and checksum
type/size via a different interface, and then the attribute would
contain a pointer to a buffer that can hold the appropriate number of
checksums.

> 
> > Hence, AIO attributes.
> 
> I have to be honest: I really don't like tying the interface to AIO, but
> I guess it's the only per-io facility we have today.  It'd be nice to
> include sync O_DIRECT when designing the interface to make sure that it
> is possible to use generic syscalls in the future without running up
> against unexpected problems. 

It'd certainly useful with regular sync IO, I just want to take it
one step at a time particularly since for sync IO we'll probably need
new syscalls.

But yes you're right, it would be good to keep in mind.

> > An iocb_attr has an id field, and a size field - and some amount of data
> > specific to that attribute.
> 
> I'd hope that we can come up with a less fragile interface.  The kernel
> would have to scan the attributes to make sure that there aren't
> malicious sizes.  I only quickly glanced at the loops, but it seemed
> like you could have a 0 size attribute in there and _next() would spin
> forever.

Ouch, yeah that's wrong :/

I don't think there's anything fragile about the basic idea though. Or
do you have some way of improving upon it in mind?

The idea with the size field is that it's just sizeof(the particular
attribute struct), so when userspace is appending attributes it just
sets size = sizeof() and attr_list->size += attr->size.

The kernel is going to have to sanity check the size fields of the
individual attributes anyways to verify the size of the last attr
doesn't extend off the end of the attr list, so I think it makes sense
to keep the current semantics of the size fields and just also check
that the size field is nonzero (actually >= sizeof(struct iocb_attr)).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Zach Brown
On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote:
> So, I and other people keep running into things where we really need to
> add an interface to pass some auxiliary... stuff along with a pread() or
> pwrite().

Sure.  Martin (cc:ed) will sympathize.

> A few examples:
> 
> * IO scheduler hints...
> * Cache hints...
> 
> * Passing checksums out to userspace. We've got bio integrity, which is
> a (somewhat) generic interface for passing data checksums between the
> filesystem and the hardware.

Hmm, careful here.  I think that in DIF/DIX the checksums are
per-sector, not per IO, right?  That'd mean that the PAGE_SIZE attr
limit in this patch would be magically creating different max IO size
limits on different architectures.  That doesn't seem great.

> Hence, AIO attributes.

I have to be honest: I really don't like tying the interface to AIO, but
I guess it's the only per-io facility we have today.  It'd be nice to
include sync O_DIRECT when designing the interface to make sure that it
is possible to use generic syscalls in the future without running up
against unexpected problems. 

> An iocb_attr has an id field, and a size field - and some amount of data
> specific to that attribute.

I'd hope that we can come up with a less fragile interface.  The kernel
would have to scan the attributes to make sure that there aren't
malicious sizes.  I only quickly glanced at the loops, but it seemed
like you could have a 0 size attribute in there and _next() would spin
forever.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Kent Overstreet
So, I and other people keep running into things where we really need to
add an interface to pass some auxiliary... stuff along with a pread() or
pwrite().

A few examples:

* IO scheduler hints. Some userspace program wants to, per IO, specify
either priorities or a cgroup - by specifying a cgroup you can have a
fileserver in userspace that makes use of cfq's per cgroup bandwidth
quotas.

* Cache hints. For bcache and other things, userspace may want to specify
"this data should be cached", "this data should bypass the cache", etc.

* Passing checksums out to userspace. We've got bio integrity, which is
a (somewhat) generic interface for passing data checksums between the
filesystem and the hardware. There are various circumstances under which
you may want to pass these checksums out to userspace, and if so we
ought to have a generic way of doing it.

Hence, AIO attributes.

The way it works is I stole the reserved2 field in struct iocb. This
becomes a pointer to struct iocb_attr_list.

An iocb_attr_list is some number of iocb_attrs appended together, along
with the total number of bytes of attributes.

An iocb_attr has an id field, and a size field - and some amount of data
specific to that attribute.

The size fields mean we can iterate through the attributes and find one
with a specific id with generic code - we don't have to know anything
about the attributes we don't care about.

I also added a pointer to struct iocb_attr_list to struct bio. Now, we
can define new attributes, and then anywhere in the block layer (say
cfq, or some driver code) we can lookup any attributes that were
specified for this io.

cfq can then schedule as userspace wants, or bcache can get its cache
hints...

That's pretty much it, for the moment. It's intended to be simple and
extensible.

* FUTURE STUFF:

Return values:

Some attributes are probably going to want to return something to
userspace.

If nothing else, we want this so that userspace can tell if anything
handled the attributes it specified - as dynamic as the io stack can be,
with something extensible like this there really isn't any generic way
of knowing ahead of time if something is going to interpret any
attribute - we want to return at least an error code.

One could imagine sticking the return in the attribute itself, but I
don't want to do this. For some things (checksums), the attribute will
contain a pointer to a buffer - that's fine. But I don't want the
attributes themselves to be writeable.

The reason for this is that struct iocb point to the attributes isn't
completely ideal, and is IMO a stopgap solution. It would be better if
the attributes were inline with the iocbs. But for this we may (?) want
new aio syscalls to replace io_submit()/io_getevents(), but that isn't
something I want to rush into.

If the attributes are inline with the iocbs, the good and natural thing
to do for return values is stick them inline with the io_event
completions.

But I'm not quite sure what I want to do in the meantime, if we end up
needing return values in the short term.


commit 34232e6f28112f049d633f4ecd2d34e536f8c7cf
Author: Kent Overstreet 
Date:   Mon Oct 1 13:17:56 2012 -0700

Extensible AIO interface

diff --git a/fs/aio.c b/fs/aio.c
index 71f613c..54acc17 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -424,6 +424,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
req->private = NULL;
req->ki_iovec = NULL;
INIT_LIST_HEAD(>ki_run_list);
+   req->ki_attrs = NULL;
req->ki_eventfd = NULL;
 
return req;
@@ -538,6 +539,7 @@ static inline void really_put_req(struct kioctx *ctx, 
struct kiocb *req)
req->ki_dtor(req);
if (req->ki_iovec != >ki_inline_vec)
kfree(req->ki_iovec);
+   kfree(req->ki_attrs);
kmem_cache_free(kiocb_cachep, req);
ctx->reqs_active--;
 
@@ -1504,6 +1506,33 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool 
compat)
return 0;
 }
 
+static int aio_setup_attrs(struct iocb *iocb, struct kiocb *req)
+{
+   u64 size;
+
+   if (!iocb->aio_attrs)
+   return 0;
+
+   if (unlikely(get_user(size, (u64 *) iocb->aio_attrs)))
+   return -EFAULT;
+
+   if (unlikely(size > PAGE_SIZE))
+   return -EFAULT;
+
+   req->ki_attrs = kmalloc(size, GFP_KERNEL);
+   if (unlikely(!req->ki_attrs))
+   return  -ENOMEM;
+
+   if (unlikely(copy_from_user(req->ki_attrs,
+   (void *) iocb->aio_attrs, size))) {
+   kfree(req->ki_attrs);
+   req->ki_attrs = NULL;
+   return -EFAULT;
+   }
+
+   return 0;
+}
+
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 struct iocb *iocb, struct kiocb_batch *batch,
 bool compat)
@@ -1513,7 +1542,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
ssize_t 

[RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Kent Overstreet
So, I and other people keep running into things where we really need to
add an interface to pass some auxiliary... stuff along with a pread() or
pwrite().

A few examples:

* IO scheduler hints. Some userspace program wants to, per IO, specify
either priorities or a cgroup - by specifying a cgroup you can have a
fileserver in userspace that makes use of cfq's per cgroup bandwidth
quotas.

* Cache hints. For bcache and other things, userspace may want to specify
this data should be cached, this data should bypass the cache, etc.

* Passing checksums out to userspace. We've got bio integrity, which is
a (somewhat) generic interface for passing data checksums between the
filesystem and the hardware. There are various circumstances under which
you may want to pass these checksums out to userspace, and if so we
ought to have a generic way of doing it.

Hence, AIO attributes.

The way it works is I stole the reserved2 field in struct iocb. This
becomes a pointer to struct iocb_attr_list.

An iocb_attr_list is some number of iocb_attrs appended together, along
with the total number of bytes of attributes.

An iocb_attr has an id field, and a size field - and some amount of data
specific to that attribute.

The size fields mean we can iterate through the attributes and find one
with a specific id with generic code - we don't have to know anything
about the attributes we don't care about.

I also added a pointer to struct iocb_attr_list to struct bio. Now, we
can define new attributes, and then anywhere in the block layer (say
cfq, or some driver code) we can lookup any attributes that were
specified for this io.

cfq can then schedule as userspace wants, or bcache can get its cache
hints...

That's pretty much it, for the moment. It's intended to be simple and
extensible.

* FUTURE STUFF:

Return values:

Some attributes are probably going to want to return something to
userspace.

If nothing else, we want this so that userspace can tell if anything
handled the attributes it specified - as dynamic as the io stack can be,
with something extensible like this there really isn't any generic way
of knowing ahead of time if something is going to interpret any
attribute - we want to return at least an error code.

One could imagine sticking the return in the attribute itself, but I
don't want to do this. For some things (checksums), the attribute will
contain a pointer to a buffer - that's fine. But I don't want the
attributes themselves to be writeable.

The reason for this is that struct iocb point to the attributes isn't
completely ideal, and is IMO a stopgap solution. It would be better if
the attributes were inline with the iocbs. But for this we may (?) want
new aio syscalls to replace io_submit()/io_getevents(), but that isn't
something I want to rush into.

If the attributes are inline with the iocbs, the good and natural thing
to do for return values is stick them inline with the io_event
completions.

But I'm not quite sure what I want to do in the meantime, if we end up
needing return values in the short term.


commit 34232e6f28112f049d633f4ecd2d34e536f8c7cf
Author: Kent Overstreet koverstr...@google.com
Date:   Mon Oct 1 13:17:56 2012 -0700

Extensible AIO interface

diff --git a/fs/aio.c b/fs/aio.c
index 71f613c..54acc17 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -424,6 +424,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
req-private = NULL;
req-ki_iovec = NULL;
INIT_LIST_HEAD(req-ki_run_list);
+   req-ki_attrs = NULL;
req-ki_eventfd = NULL;
 
return req;
@@ -538,6 +539,7 @@ static inline void really_put_req(struct kioctx *ctx, 
struct kiocb *req)
req-ki_dtor(req);
if (req-ki_iovec != req-ki_inline_vec)
kfree(req-ki_iovec);
+   kfree(req-ki_attrs);
kmem_cache_free(kiocb_cachep, req);
ctx-reqs_active--;
 
@@ -1504,6 +1506,33 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool 
compat)
return 0;
 }
 
+static int aio_setup_attrs(struct iocb *iocb, struct kiocb *req)
+{
+   u64 size;
+
+   if (!iocb-aio_attrs)
+   return 0;
+
+   if (unlikely(get_user(size, (u64 *) iocb-aio_attrs)))
+   return -EFAULT;
+
+   if (unlikely(size  PAGE_SIZE))
+   return -EFAULT;
+
+   req-ki_attrs = kmalloc(size, GFP_KERNEL);
+   if (unlikely(!req-ki_attrs))
+   return  -ENOMEM;
+
+   if (unlikely(copy_from_user(req-ki_attrs,
+   (void *) iocb-aio_attrs, size))) {
+   kfree(req-ki_attrs);
+   req-ki_attrs = NULL;
+   return -EFAULT;
+   }
+
+   return 0;
+}
+
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 struct iocb *iocb, struct kiocb_batch *batch,
 bool compat)
@@ -1513,7 +1542,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,

Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Zach Brown
On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote:
 So, I and other people keep running into things where we really need to
 add an interface to pass some auxiliary... stuff along with a pread() or
 pwrite().

Sure.  Martin (cc:ed) will sympathize.

 A few examples:
 
 * IO scheduler hints...
 * Cache hints...
 
 * Passing checksums out to userspace. We've got bio integrity, which is
 a (somewhat) generic interface for passing data checksums between the
 filesystem and the hardware.

Hmm, careful here.  I think that in DIF/DIX the checksums are
per-sector, not per IO, right?  That'd mean that the PAGE_SIZE attr
limit in this patch would be magically creating different max IO size
limits on different architectures.  That doesn't seem great.

 Hence, AIO attributes.

I have to be honest: I really don't like tying the interface to AIO, but
I guess it's the only per-io facility we have today.  It'd be nice to
include sync O_DIRECT when designing the interface to make sure that it
is possible to use generic syscalls in the future without running up
against unexpected problems. 

 An iocb_attr has an id field, and a size field - and some amount of data
 specific to that attribute.

I'd hope that we can come up with a less fragile interface.  The kernel
would have to scan the attributes to make sure that there aren't
malicious sizes.  I only quickly glanced at the loops, but it seemed
like you could have a 0 size attribute in there and _next() would spin
forever.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Kent Overstreet
On Mon, Oct 01, 2012 at 04:12:22PM -0700, Zach Brown wrote:
 On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote:
  So, I and other people keep running into things where we really need to
  add an interface to pass some auxiliary... stuff along with a pread() or
  pwrite().
 
 Sure.  Martin (cc:ed) will sympathize.
 
  A few examples:
  
  * IO scheduler hints...
  * Cache hints...
  
  * Passing checksums out to userspace. We've got bio integrity, which is
  a (somewhat) generic interface for passing data checksums between the
  filesystem and the hardware.
 
 Hmm, careful here.  I think that in DIF/DIX the checksums are
 per-sector, not per IO, right?  That'd mean that the PAGE_SIZE attr
 limit in this patch would be magically creating different max IO size
 limits on different architectures.  That doesn't seem great.

Not just per sector, Per hardware sector. For passing around checksums
userspace would have to find out the hardware sector size and checksum
type/size via a different interface, and then the attribute would
contain a pointer to a buffer that can hold the appropriate number of
checksums.

 
  Hence, AIO attributes.
 
 I have to be honest: I really don't like tying the interface to AIO, but
 I guess it's the only per-io facility we have today.  It'd be nice to
 include sync O_DIRECT when designing the interface to make sure that it
 is possible to use generic syscalls in the future without running up
 against unexpected problems. 

It'd certainly useful with regular sync IO, I just want to take it
one step at a time particularly since for sync IO we'll probably need
new syscalls.

But yes you're right, it would be good to keep in mind.

  An iocb_attr has an id field, and a size field - and some amount of data
  specific to that attribute.
 
 I'd hope that we can come up with a less fragile interface.  The kernel
 would have to scan the attributes to make sure that there aren't
 malicious sizes.  I only quickly glanced at the loops, but it seemed
 like you could have a 0 size attribute in there and _next() would spin
 forever.

Ouch, yeah that's wrong :/

I don't think there's anything fragile about the basic idea though. Or
do you have some way of improving upon it in mind?

The idea with the size field is that it's just sizeof(the particular
attribute struct), so when userspace is appending attributes it just
sets size = sizeof() and attr_list-size += attr-size.

The kernel is going to have to sanity check the size fields of the
individual attributes anyways to verify the size of the last attr
doesn't extend off the end of the attr list, so I think it makes sense
to keep the current semantics of the size fields and just also check
that the size field is nonzero (actually = sizeof(struct iocb_attr)).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Zach Brown
 Not just per sector, Per hardware sector. For passing around checksums
 userspace would have to find out the hardware sector size and checksum
 type/size via a different interface, and then the attribute would
 contain a pointer to a buffer that can hold the appropriate number of
 checksums.

All problems fall to another layer of indirection? :)  But yes, that's
fair.  I was obviously just assuming that the checksums would be in the
attribute.

But now we're talking about layers of user pointers.  Would the
attribute parser need to verify/copy pointers before downstream kernel
code tries to work with it?  Would it be up to the attribute consumers
to verify the pointers that the core doesn't really touch?  Are these
second pointers native (enter compat goo) or u64s?

 I don't think there's anything fragile about the basic idea though. Or
 do you have some way of improving upon it in mind?

Nothing super great is springing to mind, no.

 The idea with the size field is that it's just sizeof(the particular
 attribute struct), so when userspace is appending attributes it just
 sets size = sizeof() and attr_list-size += attr-size.

I suppose.  But this also raises the spectre of aligning the packed
attributes to match their struct definitions.  It's the netlink(3)
macros all over again, right?  I guess unaligned accesses aren't *that*
big a deal.  But still.

And what about duplicate instances of a given attribute id?  Use the
first?  The last?  Error?  Depends on the id?

It just seems like there are a lot of corner cases that can go wrong
with an API that is so free form.  I'd like something a lot harder to
make mistakes with.

- z
(being That Guy today, apparently :/)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Kent Overstreet
On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote:
  Not just per sector, Per hardware sector. For passing around checksums
  userspace would have to find out the hardware sector size and checksum
  type/size via a different interface, and then the attribute would
  contain a pointer to a buffer that can hold the appropriate number of
  checksums.
 
 All problems fall to another layer of indirection? :) 

Yep :)

 But yes, that's
 fair.  I was obviously just assuming that the checksums would be in the
 attribute.
 
 But now we're talking about layers of user pointers.  Would the
 attribute parser need to verify/copy pointers before downstream kernel
 code tries to work with it?  Would it be up to the attribute consumers
 to verify the pointers that the core doesn't really touch? 

The generic code wouldn't know about any user pointers inside
attributes, so it'd have to be downstream consumers. Hopefully there
won't be many attributes with user pointers in them (I don't expect
there to be), so we won't have too much of this messyness.

 Are these
 second pointers native (enter compat goo) or u64s?

Outside the scope of the generic implementation, but AFAIK u64s are the
sane way so I'd prefer to just enforce that.

  I don't think there's anything fragile about the basic idea though. Or
  do you have some way of improving upon it in mind?
 
 Nothing super great is springing to mind, no.
 
  The idea with the size field is that it's just sizeof(the particular
  attribute struct), so when userspace is appending attributes it just
  sets size = sizeof() and attr_list-size += attr-size.
 
 I suppose.  But this also raises the spectre of aligning the packed
 attributes to match their struct definitions.  It's the netlink(3)
 macros all over again, right?  I guess unaligned accesses aren't *that*
 big a deal.  But still.

I was just thinking about that a few minutes ago. Since the way I'm
doing it embeds struct iocb_attr inside the specific attributes, if we
stick __aligned(8) on struct iocb_attr that should solve alignment.

 And what about duplicate instances of a given attribute id?  Use the
 first?  The last?  Error?  Depends on the id?

I suspect duplicate instances will be useful and the sanest approach
for a few things, so I don't want to disallow it.

But - perhaps we could define whether duplicates are allowed of a given
attribute along with the attribute ids, then the kernel in generic code
could check for duplicates of attrs for which it was disallowed and
error.

I think I really like that idea.

 It just seems like there are a lot of corner cases that can go wrong
 with an API that is so free form.  I'd like something a lot harder to
 make mistakes with.
 
 - z
 (being That Guy today, apparently :/)

I think it's got to be free form to be useful, but that doesn't mean we
can't avoid as many mistakes as possible ahead of time so please
continue to point out anything you can think of :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-01 Thread Kent Overstreet
On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote:
 And what about duplicate instances of a given attribute id?  Use the
 first?  The last?  Error?  Depends on the id?

I thought of a better idea, instead of explicitly checking for
disallowed dups:

We want to return -ENOTHANDLED for not handled attributes anyways, so
let's just do that for dups - that'll catch erronious usage just fine
and a generic mechanism's better than a one off hack any day.

This does mean we can't punt on return values, which isn't a bad thing.

Also, if we've got duplicate attributes userspace needs to be able to
figure out which return value was for which attribute.

Two possibilities: one, return values come out in the same order
attributes went in. That'd work, but I dislike the subtlety and I expect
it'd be a pain for userspace.

Instead, let's just stick a u64 cookie in the attribute and include that
in the return, just like we do everywhere else.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/