subject:"Re\: \[Kiobuf\-io\-devel\] RFC\: Kernel mechanism\: Compound event wait\/notify \+ callback chains"

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-06 Thread Jens Axboe

On Tue, Feb 06 2001, [EMAIL PROTECTED] wrote:
> >It depends on the device driver.  Different controllers will have
> >different maximum transfer size.  For IDE, for example, we get wakeups
> >all over the place.  For SCSI, it depends on how many scatter-gather
> >entries the driver can push into a single on-the-wire request.  Exceed
> >that limit and the driver is forced to open a new scsi mailbox, and
> >you get independent completion signals for each such chunk.

SCSI does not build a request bigger than the low level driver
can handle. If you exceed the scatter count in a single request,
you just stop and fire of that request, later on restarting I/O
on the remainder.

> I see. I remember Jens Axboe mentioning something like this with IDE.
> So, in this case, you want every such chunk to check if its completed
> filling up a buffer and then trigger a wakeup on that ?

Yes. Which is why dealing with buffer heads are so nice in this
regard, you never have problems with ending I/O on a single "piece".

> But, does this also mean that in such a case combining requests beyond this
> limit doesn't really help ? (Reordering requests to get contiguity would
> help of course in terms of seek times, I guess, but not merging beyond this
> limit)

There's a slight benefit in building bigger requests than the driver
can handle, in that you can have more I/O pending on the queue. It's
not worth spending too much time on though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-06 Thread bsuparna



>Hi,
>
>On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote:
>>
>> >It's the very essence of readahead that we wake up the earlier buffers
>> >as soon as they become available, without waiting for the later ones
>> >to complete, so we _need_ this multiple completion concept.
>>
>> I can understand this in principle, but when we have a single request
going
>> down to the device that actually fills in multiple buffers, do we get
>> notified (interrupted) by the device before all the data in that request
>> got transferred ?
>
>It depends on the device driver.  Different controllers will have
>different maximum transfer size.  For IDE, for example, we get wakeups
>all over the place.  For SCSI, it depends on how many scatter-gather
>entries the driver can push into a single on-the-wire request.  Exceed
>that limit and the driver is forced to open a new scsi mailbox, and
>you get independent completion signals for each such chunk.

I see. I remember Jens Axboe mentioning something like this with IDE.
So, in this case, you want every such chunk to check if its completed
filling up a buffer and then trigger a wakeup on that ?
But, does this also mean that in such a case combining requests beyond this
limit doesn't really help ? (Reordering requests to get contiguity would
help of course in terms of seek times, I guess, but not merging beyond this
limit)

>> >Which is exactly why we have one kiobuf per higher-level buffer, and
>> >we chain together kiobufs when we need to for a long request, but we
>> >still get the independent completion notifiers.
>>
>> As I mentioned above, the alternative is to have the i/o completion
related
>> linkage information within the wakeup structures instead. That way, it
>> doesn't matter to the lower level driver what higher level structure we
>> have above (maybe buffer heads, may be page cache structures, may be
>> kiobufs). We only chain together memory descriptors for the buffers
during
>> the io.
>
>You forgot IO failures: it is essential, once the IO completes, to
>know exactly which higher-level structures completed successfully and
>which did not.  The low-level drivers have to have access to the
>independent completion notifications for this to work.
>
No, I didn't forget IO failures; just that I expect the wait structure
containing the wakeup function to be embedded in a cev structure that
contains a pointer to the wait_queue_head field in the higher level
structure. The rest is for the wakeup function to interpret (it can always
access the other fields in the higher level structure - just like
list_entry() does)

Later I realized that instead of having multiple wakeup functions queued on
the low level structures wait queue, its perhaps better to just sort of
turn the cev_wait structure upside down (entry on the lower level
structure's queue should link to the parent entries instead).




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-06 Thread bsuparna



Hi,

On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote:

 It's the very essence of readahead that we wake up the earlier buffers
 as soon as they become available, without waiting for the later ones
 to complete, so we _need_ this multiple completion concept.

 I can understand this in principle, but when we have a single request
going
 down to the device that actually fills in multiple buffers, do we get
 notified (interrupted) by the device before all the data in that request
 got transferred ?

It depends on the device driver.  Different controllers will have
different maximum transfer size.  For IDE, for example, we get wakeups
all over the place.  For SCSI, it depends on how many scatter-gather
entries the driver can push into a single on-the-wire request.  Exceed
that limit and the driver is forced to open a new scsi mailbox, and
you get independent completion signals for each such chunk.

I see. I remember Jens Axboe mentioning something like this with IDE.
So, in this case, you want every such chunk to check if its completed
filling up a buffer and then trigger a wakeup on that ?
But, does this also mean that in such a case combining requests beyond this
limit doesn't really help ? (Reordering requests to get contiguity would
help of course in terms of seek times, I guess, but not merging beyond this
limit)

 Which is exactly why we have one kiobuf per higher-level buffer, and
 we chain together kiobufs when we need to for a long request, but we
 still get the independent completion notifiers.

 As I mentioned above, the alternative is to have the i/o completion
related
 linkage information within the wakeup structures instead. That way, it
 doesn't matter to the lower level driver what higher level structure we
 have above (maybe buffer heads, may be page cache structures, may be
 kiobufs). We only chain together memory descriptors for the buffers
during
 the io.

You forgot IO failures: it is essential, once the IO completes, to
know exactly which higher-level structures completed successfully and
which did not.  The low-level drivers have to have access to the
independent completion notifications for this to work.

No, I didn't forget IO failures; just that I expect the wait structure
containing the wakeup function to be embedded in a cev structure that
contains a pointer to the wait_queue_head field in the higher level
structure. The rest is for the wakeup function to interpret (it can always
access the other fields in the higher level structure - just like
list_entry() does)

Later I realized that instead of having multiple wakeup functions queued on
the low level structures wait queue, its perhaps better to just sort of
turn the cev_wait structure upside down (entry on the lower level
structure's queue should link to the parent entries instead).




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-06 Thread Jens Axboe


On Tue, Feb 06 2001, [EMAIL PROTECTED] wrote:
 It depends on the device driver.  Different controllers will have
 different maximum transfer size.  For IDE, for example, we get wakeups
 all over the place.  For SCSI, it depends on how many scatter-gather
 entries the driver can push into a single on-the-wire request.  Exceed
 that limit and the driver is forced to open a new scsi mailbox, and
 you get independent completion signals for each such chunk.

SCSI does not build a request bigger than the low level driver
can handle. If you exceed the scatter count in a single request,
you just stop and fire of that request, later on restarting I/O
on the remainder.

 I see. I remember Jens Axboe mentioning something like this with IDE.
 So, in this case, you want every such chunk to check if its completed
 filling up a buffer and then trigger a wakeup on that ?

Yes. Which is why dealing with buffer heads are so nice in this
regard, you never have problems with ending I/O on a single "piece".

 But, does this also mean that in such a case combining requests beyond this
 limit doesn't really help ? (Reordering requests to get contiguity would
 help of course in terms of seek times, I guess, but not merging beyond this
 limit)

There's a slight benefit in building bigger requests than the driver
can handle, in that you can have more I/O pending on the queue. It's
not worth spending too much time on though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Manfred Spraul


"Stephen C. Tweedie" wrote:
> 
> The original multi-page buffers came from the map_user_kiobuf
> interface: they represented a user data buffer.  I'm not wedded to
> that format --- we can happily replace it with a fine-grained sg list
>
Could you change that interface?

<<< from Linus mail:

struct buffer {
struct page *page;
u16 offset, length;
};

>>

/* returns the number of used buffers, or <0 on error */
int map_user_buffer(struct buffer *ba, int max_bcount,
void* addr, int len);
void unmap_buffer(struct buffer *ba, int bcount);

That's enough for the zero copy pipe code ;-)

Real hw drivers probably need a replacement for pci_map_single()
(pci_map_and_align_and_bounce_buffer_array())

The kiobuf structure could contain these 'struct buffer' instead of the
current 'struct page' pointers.

> 
> In other words, even if we expand the kiobuf into a sg vector list,
> when it comes to merging requests in ll_rw_blk.c we still need to
> track the callbacks on each independent source kiobufs.
>
Probably.


--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 11:06:48PM +, Alan Cox wrote:
> > do you then tell the application _above_ raid0 if one of the
> > underlying IOs succeeds and the other fails halfway through?
> 
> struct 
> {
>   u32 flags;  /* because everything needs flags */
>   struct io_completion *completions;
>   kiovec_t sglist[0];
> } thingy;
> 
> now kmalloc one object of the header the sglist of the right size and the
> completion list. Shove the completion list on the end of it as another
> array of objects and what is the problem.

XFS uses both small metadata items in the buffer cache and large
pagebufs.  You may have merged a 512-byte read with a large pagebuf
read: one completion callback is associated with a single sg fragment,
the next callback belongs to a dozen different fragments.  Associating
the two lists becomes non-trivial, although it could be done.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Alan Cox


> do you then tell the application _above_ raid0 if one of the
> underlying IOs succeeds and the other fails halfway through?

struct 
{
u32 flags;  /* because everything needs flags */
struct io_completion *completions;
kiovec_t sglist[0];
} thingy;

now kmalloc one object of the header the sglist of the right size and the
completion list. Shove the completion list on the end of it as another
array of objects and what is the problem.

> In other words, even if we expand the kiobuf into a sg vector list,
> when it comes to merging requests in ll_rw_blk.c we still need to
> track the callbacks on each independent source kiobufs.  

But that can be two arrays

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote:
> 
> On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> 
> it's exactly these 'compound' structures i'm vehemently against. I do
> think it's a design nightmare. I can picture these monster kiobufs
> complicating the whole code for no good reason - we couldnt even get the
> bh-list code in block_device.c right - why do you think kiobufs *all
> across the kernel* will be any better?
> 
> RAID0 is not an issue. Split it up, use separate kiobufs for every
> different disk.

Umm, that's not the point --- of course you can use separate kiobufs
for the communication between raid0 and the underlying disks, but what
do you then tell the application _above_ raid0 if one of the
underlying IOs succeeds and the other fails halfway through?

And what about raid1?  Are you really saying that raid1 doesn't need
to know which blocks succeeded and which failed?  That's the level of
completion information I'm worrying about at the moment.

> fragmented skbs are a different matter: they are simply a bit more generic
> abstractions of 'memory buffer'. Clear goal, clear solution. I do not
> think kiobufs have clear goals.

The goal: allow arbitrary IOs to be pushed down through the stack in
such a way that the callers can get meaningful information back about
what worked and what did not.  If the write was a 128kB raw IO, then
you obviously get coarse granularity of completion callback.  If the
write was a series of independent pages which happened to be
contiguous on disk, you actually get told which pages hit disk and
which did not.

> and what is the goal of having multi-page kiobufs. To avoid having to do
> multiple function calls via a simpler interface? Shouldnt we optimize that
> codepath instead?

The original multi-page buffers came from the map_user_kiobuf
interface: they represented a user data buffer.  I'm not wedded to
that format --- we can happily replace it with a fine-grained sg list
--- but the reason they have been pushed so far down the IO stack is
the need for accurate completion information on the originally
requested IOs.

In other words, even if we expand the kiobuf into a sg vector list,
when it comes to merging requests in ll_rw_blk.c we still need to
track the callbacks on each independent source kiobufs.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Ingo Molnar



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> > Obviously the disk access itself must be sector aligned and the total
> > length must be a multiple of the sector length, but there shouldn't be
> > any restrictions on the data buffers.
>
> But there are. Many controllers just break down and corrupt things
> silently if you don't align the data buffers (Jeff Merkey found this
> by accident when he started generating unaligned IOs within page
> boundaries in his NWFS code). And a lot of controllers simply cannot
> break a sector dma over a page boundary (at least not without some
> form of IOMMU remapping).

so we are putting workarounds for hardware bugs into the design?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Ingo Molnar

On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> And no, the IO success is *not* necessarily sequential from the start
> of the IO: if you are doing IO to raid0, for example, and the IO gets
> striped across two disks, you might find that the first disk gets an
> error so the start of the IO fails but the rest completes.  It's the
> completion code which notifies the caller of what worked and what did
> not.

it's exactly these 'compound' structures i'm vehemently against. I do
think it's a design nightmare. I can picture these monster kiobufs
complicating the whole code for no good reason - we couldnt even get the
bh-list code in block_device.c right - why do you think kiobufs *all
across the kernel* will be any better?

RAID0 is not an issue. Split it up, use separate kiobufs for every
different disk. We need simple constructs - i do not believe why nobody
sees that these big fat monster-trucks of IO workload are *trouble*. They
keep things localized, instead of putting workload components into the
system immediately. We'll have performance bugs nobody has seen before.
bhs have one very nice property: they are simple, modularized. I think
this is like CISC vs. RISC: CISC designs ended up splitting 'fat
instructions' up into RISC-like instructions.

fragmented skbs are a different matter: they are simply a bit more generic
abstractions of 'memory buffer'. Clear goal, clear solution. I do not
think kiobufs have clear goals.

and i do not buy the performance arguments. In 2.4.1 we improved block-IO
performance dramatically by fixing high-load IO scheduling. Write
performance suddenly improved dramatically, there is a 30-40% improvement
in dbench performance. To put in another way: *we needed 5 years to fix a
serious IO-subsystem performance bug*. Block IO was already too complex -
and Alex & Andrea have done a nice job streamlining and cleaning it up for
2.4. We should simplify it further - and optimize the components, instead
of bringing in yet another *big* complication into the API.

and what is the goal of having multi-page kiobufs. To avoid having to do
multiple function calls via a simpler interface? Shouldnt we optimize that
codepath instead?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote:

> Have you ever thought about other things, like networking, special
> devices, stuff like that? They can (and do) have packet boundaries that
> have nothing to do with pages what-so-ever. They can have such notions as
> packets that contain multiple streams in one packet, where it ends up
> being split up into several pieces. Where neither the original packet
> _nor_ the final pieces have _anything_ to do with "pages".
> 
> THERE IS NO PAGE ALIGNMENT.

And kiobufs don't require IO to be page aligned, and they have never
done.  The only page alignment they assume is that if a *single*
scatter-gather element spans multiple pages, then the joins between
those pages occur on page boundaries.

Remember, a kiobuf is only designed to represent one scatter-gather
fragment, not a full sg list.  That was the whole reason for having a
kiovec as a separate concept: if you have more than one independent
fragment in the sg-list, you need more than one kiobuf.

And the reason why we created sg fragments which can span pages was so
that we can encode IOs which interact with the VM: any arbitrary
virtually-contiguous user data buffer can be mapped into a *single*
kiobuf for a write() call, so it's a generic way of supporting things
like O_DIRECT without the IO layers having to know anything about VM
(and Ben's async IO patches also use kiobufs in this way to allow
read()s to write to the user's data buffer once the IO completes,
without having to have a context switch back into that user's
context.)  Similarly, any extent of a file in the page cache can be
encoded in a single kiobuf.

And no, the simpler networking-style sg-list does not cut it for block
device IO, because for block devices, we want to have separate
completion status made available for each individual sg fragment in
the IO.  *That* is why the kiobuf is more heavyweight than the
networking variant: each fragment [kiobuf] in the scatter-gather list
[kiovec] has its own completion information.  

If we have a bunch of separate data buffers queued for sequential disk
IO as a single request, then we still want things like readahead and
error handling to work.  That means that we want the first kiobuf in
the chain to get its completion wakeup as soon as that segment of the
IO is complete, without having to wait for the remaining sectors of
the IO to be transferred.  It also means that if we've done something
like split the IO over a raid stripe, then when an error occurs, we
still want to know which of the callers' buffers succeeded and which
failed.

Yes, I agree that the original kiovec mechanism of using a *kiobuf[]
array to assemble the scatter-gather fragments sucked.  But I don't
believe that just throwing away the concept of kiobuf as a sc-fragment
will work either when it comes to disk IOs: the need for per-fragment
completion is too compelling.  I'd rather shift to allowing kiobufs to
be assembled into linked lists for IO to avoid *kiobuf[] vectors, in
just the same way that we currently chain buffer_heads for IO.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Linus Torvalds

On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> > Thats true for _block_ disk devices but if we want a generic kiovec then
> > if I am going from video capture to network I dont need to force anything more
> > than 4 byte align
> 
> Kiobufs have never, ever required the IO to be aligned on any
> particular boundary.  They simply make the assumption that the
> underlying buffered object can be described in terms of pages with
> some arbitrary (non-aligned) start/offset.  Every video framebuffer
> I've ever seen satisfies that, so you can easily map an arbitrary
> contiguous region of the framebuffer with a kiobuf already.

Stop this idiocy, Stephen. You're _this_ close to be the first person I
ever blacklist from my mailbox. 

Network. Packets. Fragmentation. Or just non-page-sized MTU's. 

It is _not_ a "series of contiguous pages". Never has been. Never will be.
So stop making excuses.

Also, think of protocols that may want to gather stuff from multiple
places, where the boundaries have little to do with pages but are
specified some other way. Imagine doing "writev()" style operations to
disk, gathering stuff from multiple sources into one operation.

Think of GART remappings - you can have multiple pages that show up as one
"linear" chunk to the graphics device behind the AGP bridge, but that are
_not_ contiguous in real memory.

There just is NO excuse for the "linear series of pages" view. And if you
cannot realize that, then I don't know what's wrong with you. Your
arguments are obviously crap, and the stuff you seem unable to argue
against (like networking) you decide to just ignore. Get your act
together.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Alan Cox


> Kiobufs have never, ever required the IO to be aligned on any
> particular boundary.  They simply make the assumption that the
> underlying buffered object can be described in terms of pages with
> some arbitrary (non-aligned) start/offset.  Every video framebuffer

start/length per page ?

> I've ever seen satisfies that, so you can easily map an arbitrary
> contiguous region of the framebuffer with a kiobuf already.

Video is non contiguous ranges. In fact if you are blitting to a card with
tiled memory it gets very interesting in its video lists

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 05:29:47PM +, Alan Cox wrote:
> > 
> > _All_ drivers would have to do that in the degenerate case, because
> > none of our drivers can deal with a dma boundary in the middle of a
> > sector, and even in those places where the hardware supports it in
> > theory, you are still often limited to word-alignment.
> 
> Thats true for _block_ disk devices but if we want a generic kiovec then
> if I am going from video capture to network I dont need to force anything more
> than 4 byte align

Kiobufs have never, ever required the IO to be aligned on any
particular boundary.  They simply make the assumption that the
underlying buffered object can be described in terms of pages with
some arbitrary (non-aligned) start/offset.  Every video framebuffer
I've ever seen satisfies that, so you can easily map an arbitrary
contiguous region of the framebuffer with a kiobuf already.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Alan Cox


> > kiovec_align(kiovec, 512);
> > and have it do the bounce buffers ?
> 
> _All_ drivers would have to do that in the degenerate case, because
> none of our drivers can deal with a dma boundary in the middle of a
> sector, and even in those places where the hardware supports it in
> theory, you are still often limited to word-alignment.

Thats true for _block_ disk devices but if we want a generic kiovec then
if I am going from video capture to network I dont need to force anything more
than 4 byte align

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 03:19:09PM +, Alan Cox wrote:
> > Yes, it's the sort of thing that you would hope should work, but in
> > practice it's not reliable.
> 
> So the less smart devices need to call something like
> 
>   kiovec_align(kiovec, 512);
> 
> and have it do the bounce buffers ?

_All_ drivers would have to do that in the degenerate case, because
none of our drivers can deal with a dma boundary in the middle of a
sector, and even in those places where the hardware supports it in
theory, you are still often limited to word-alignment.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Linus Torvalds

On Mon, 5 Feb 2001, Manfred Spraul wrote:
> "Stephen C. Tweedie" wrote:
> > 
> > You simply cannot do physical disk IO on
> > non-sector-aligned memory or in chunks which aren't a multiple of
> > sector size.
> 
> Why not?
> 
> Obviously the disk access itself must be sector aligned and the total
> length must be a multiple of the sector length, but there shouldn't be
> any restrictions on the data buffers.

In fact, regular IDE DMA allows arbitrary scatter-gather at least in
theory. Linux has never used it, so I don't know how well it works in
practice - I would not be surprised if it ends up causing no end of nasty 
corner-cases that have bugs. It's not as if IDE controllers always follow 
the documentation ;)

The _total_ length of the buffers have to be a multiple of the sector
size, and there are some alignment issues (each scatter-gather area has to
be at least 16-bit aligned both in physical memory and in length, and
apparently many controllers need 32-bit alignment). And I'd almost be
surprised if there wouldn't be hardware that wanted cache alignment
because they always expect to burst. 

But despite a lot of likely practical reasons why it won't work for
arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it
wouldn't. And there are bound to be better controllers that could handle
it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Linus Torvalds

On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> 
> On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
> > 
> > Neither the read nor the write are page-aligned. I don't know where you
> > got that idea. It's obviously not true even in the common case: it depends
> > _entirely_ on what the file offsets are, and expecting the offset to be
> > zero is just being stupid. It's often _not_ zero. With networking it is in
> > fact seldom zero, because the network packets are seldom aligned either in
> > size or in location.
> 
> The underlying buffer is.  The VFS (and the current kiobuf code) is
> already happy about IO happening at odd offsets within a page.

Stephen. 

Don't bother even talking about this. You're so damn hung up about the
page cache that it's not funny.

Have you ever thought about other things, like networking, special
devices, stuff like that? They can (and do) have packet boundaries that
have nothing to do with pages what-so-ever. They can have such notions as
packets that contain multiple streams in one packet, where it ends up
being split up into several pieces. Where neither the original packet
_nor_ the final pieces have _anything_ to do with "pages".

THERE IS NO PAGE ALIGNMENT.

So stop blathering about it.

Of _course_ the current kiobuf code has page-alignment assumptions. You
_designed_ it that way. So bringing it up as an example is a circular
argument. And a really stupid one at that, as that's the thing I've been
quoting as the single biggest design bug in all of kiobufs. It's the thing
that makes them entirely useless for things like describing "struct
msghdr" etc. 

We should get _away_ from this page-alignment fallacy. It's not true. It's
not necessarily even true for the page cache - which has no real
fundamental reasons any more for not being able to be a "variable-size"
cache some time in the future (ie it might be a per-address-space decision
on whether the granularity is 1, 2, 4 or more pages).

Anything that designs for "everything is a page" will automatically be
limited for cases where you might sometimes have 64kB chunks of data.

Instead, just face the realization that "everything is a bunch or ranges",
and leave it at that. It's true _already_ - thing about fragmented IP
packets. We may not handle it that way completely yet, but the zero-copy
networking is going in this direction.

And as long as you keep on harping about page alignment, you're not going
to play in this game. End of story. 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Alan Cox


> Yes, it's the sort of thing that you would hope should work, but in
> practice it's not reliable.

So the less smart devices need to call something like

kiovec_align(kiovec, 512);

and have it do the bounce buffers ?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote:
> "Stephen C. Tweedie" wrote:
> > 
> > You simply cannot do physical disk IO on
> > non-sector-aligned memory or in chunks which aren't a multiple of
> > sector size.
> 
> Why not?
> 
> Obviously the disk access itself must be sector aligned and the total
> length must be a multiple of the sector length, but there shouldn't be
> any restrictions on the data buffers.

But there are.  Many controllers just break down and corrupt things
silently if you don't align the data buffers (Jeff Merkey found this
by accident when he started generating unaligned IOs within page
boundaries in his NWFS code).  And a lot of controllers simply cannot
break a sector dma over a page boundary (at least not without some
form of IOMMU remapping).

Yes, it's the sort of thing that you would hope should work, but in
practice it's not reliable.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote:
> 
> >It's the very essence of readahead that we wake up the earlier buffers
> >as soon as they become available, without waiting for the later ones
> >to complete, so we _need_ this multiple completion concept.
> 
> I can understand this in principle, but when we have a single request going
> down to the device that actually fills in multiple buffers, do we get
> notified (interrupted) by the device before all the data in that request
> got transferred ?

It depends on the device driver.  Different controllers will have
different maximum transfer size.  For IDE, for example, we get wakeups
all over the place.  For SCSI, it depends on how many scatter-gather
entries the driver can push into a single on-the-wire request.  Exceed
that limit and the driver is forced to open a new scsi mailbox, and
you get independent completion signals for each such chunk.

> >Which is exactly why we have one kiobuf per higher-level buffer, and
> >we chain together kiobufs when we need to for a long request, but we
> >still get the independent completion notifiers.
> 
> As I mentioned above, the alternative is to have the i/o completion related
> linkage information within the wakeup structures instead. That way, it
> doesn't matter to the lower level driver what higher level structure we
> have above (maybe buffer heads, may be page cache structures, may be
> kiobufs). We only chain together memory descriptors for the buffers during
> the io.

You forgot IO failures: it is essential, once the IO completes, to
know exactly which higher-level structures completed successfully and
which did not.  The low-level drivers have to have access to the
independent completion notifications for this to work.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread bsuparna

>Hi,
>
>On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote:
>>
>> Can't we define a kiobuf structure as just this ? A combination of a
>> frag_list and a page_list ?
>

>Then all code which needs to accept an arbitrary kiobuf needs to be
>able to parse both --- ugh.
>

Making this a little more explicit to help analyse tradeoffs:

/* Memory descriptor portion of a kiobuf - this is something that may get
passed around between layers and subsystems */
struct kio_mdesc {
 int nr_frags;
 struct frag *frag_list;
 int nr_pages;
 struct page **page_list;
 /* list follows */
};

For block i/o requiring #1 type descriptors, the list could have allocated
extra space for:
struct kio_type1_ext {
 struct frag frag;
 struct page *pages[NUM_STATIC_PAGES];
}

For n/w i/o or cases requiring  #2 type descriptors, the list could have
allocated extra space for:

struct kio_type2_ext {
 struct frag frags[NUM_STATIC_FRAGS];
 struct page *page[NUM_STATIC_FRAGS];
}

struct  kiobuf {
 intstatus;
 wait_queue_head_t   waitq;
 struct kio_mdescmdesc;
 /* list follows - leaves room for allocation for mem descs, completion
sub structs etc */
}

Code that accepts an arbitrary kiobuf needs to do the following :
 process the fragments one by one
  - type #1 case, only one fragment would typically be there, but
processing it would involve crossing all pages in the page list
   So extra processing vs a kiobuf with single 
pair, involves the following:
dereferencing the frag_list pointer
checking the nr_frags field
  - type #2 case, the number of fragments would be equal to or
greater than number of pages, so processing will typically go over each
fragments and thus cross each page in the list one by one
   So extra processing vs a kiobuf with per-page 
pairs, involves
deferencing the page list entry (involves computing the
page-index in the page_list from the offset value)
check if offset+len doesn't fall outside the page

Boils down to approx one extra dereference and one comparison  per kiobuf
for the common cases (have I missed something critical ?)  vs the most
optimized choices of descriptors for those cases.

In terms of resource consumption (extra bytes taken up), two fields extra
per kiobuf chain (e.g. nr_frags and frag_list pointer when it comes to #1),
i.e. a total of 8 bytes, for the common cases vs the most optimized choice
of structures for those cases.

This seems to be more uniformly balanced across #1 and #2 cases, than an
 for every page, as well as an overall . But,
then, come to think of it, since the need for lightweight structures is
greater in the case of #2, should the point of balance (if at all we want
to find one) be tilted towards #2 ?

On the other hand, since having a common structure does involve extra bytes
and cycles, if there are very few situations where we need both #1 and #2 -
conversion only at subsystem boundaries like i2o does may turn out to be
better.

Oh well ...

>> BTW, We could have a higher level io container that includes a 
>> field and a  to take care of i/o completion
>
>IO completion requirements are much more complex.  Think of disk
>readahead: we can create a single request struct for an IO of a
>hundred buffer heads, and as the device driver satisfies that request,
>it wakes up the buffer heads as it goes.  There is a separete
>completion notification for every single buffer head in the chain.
>
I understand the requirement of independent completion notifiers for higher
level buffers/other structures, since they are indeed independently usable
structures. That was one aspect that I thought I was being able to address
in the cev_wait design based on wait_queue wakeup functions.
The way it would work is that there would be multiple wakeup functions
registered on the container for the big request, each wakeup function being
responsible for waking up a higher level buffer. This way, the linkage
information is actually external to the buffer structures (which seems
reasonable, since it is only required while the i/o is happening, unless
there is another reason to keep a more lasting association)

>It's the very essence of readahead that we wake up the earlier buffers
>as soon as they become available, without waiting for the later ones
>to complete, so we _need_ this multiple completion concept.
>

I can understand this in principle, but when we have a single request going
down to the device that actually fills in multiple buffers, do we get
notified (interrupted) by the device before all the data in that request
got transferred ? I mean, how do we know that some buffers have become
available until the overall device request has completed (unless of course
the request actually gets broken up at this level and completed bit by
bit).

>Which is exactly why we have one kiobuf per

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote:
> 
> > I may still be persuaded that we need the full scatter-gather list
> > fields throughout, but for now I tend to think that, at least in the
> > disk layers, we may get cleaner results by allow linked lists of
> > page-aligned kiobufs instead.  That allows for merging of kiobufs
> > without having to copy all of the vector information each time.
> 
> But it will have the same problems as the array soloution: there will
> be one complete kio structure for each kiobuf, with it's own end_io
> callback, etc.

And what's the problem with that?

You *need* this.  You have to have that multiple-completion concept in
the disk layers.  Think about chains of buffer_heads being sent to
disk as a single IO --- you need to know which buffers make it to disk
successfully and which had IO errors.

And no, the IO success is *not* necessarily sequential from the start
of the IO: if you are doing IO to raid0, for example, and the IO gets
striped across two disks, you might find that the first disk gets an
error so the start of the IO fails but the rest completes.  It's the
completion code which notifies the caller of what worked and what did
not.

And for readahead, you want to notify the caller as early as posssible
about completion for the first part of the IO, even if the device
driver is still processing the rest.

Multiple completions are a necessary feature of the current block
device interface.  Removing that would be a step backwards.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote:
> 
> Can't we define a kiobuf structure as just this ? A combination of a
> frag_list and a page_list ?

Then all code which needs to accept an arbitrary kiobuf needs to be
able to parse both --- ugh.

> BTW, We could have a higher level io container that includes a 
> field and a  to take care of i/o completion

IO completion requirements are much more complex.  Think of disk
readahead: we can create a single request struct for an IO of a
hundred buffer heads, and as the device driver satisfies that request,
it wakes up the buffer heads as it goes.  There is a separete
completion notification for every single buffer head in the chain.

It's the very essence of readahead that we wake up the earlier buffers
as soon as they become available, without waiting for the later ones
to complete, so we _need_ this multiple completion concept.

Which is exactly why we have one kiobuf per higher-level buffer, and
we chain together kiobufs when we need to for a long request, but we
still get the independent completion notifiers.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Manfred Spraul

"Stephen C. Tweedie" wrote:
> 
> You simply cannot do physical disk IO on
> non-sector-aligned memory or in chunks which aren't a multiple of
> sector size.

Why not?

Obviously the disk access itself must be sector aligned and the total
length must be a multiple of the sector length, but there shouldn't be
any restrictions on the data buffers.

I remember that even Windoze 95 has scatter-gather support for physical
disk IO with arbitraty buffer chunks. (If the hardware supports it,
otherwise the io subsystem will copy the data into a contiguous
temporary buffer)

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
> 
> On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> > 
> Neither the read nor the write are page-aligned. I don't know where you
> got that idea. It's obviously not true even in the common case: it depends
> _entirely_ on what the file offsets are, and expecting the offset to be
> zero is just being stupid. It's often _not_ zero. With networking it is in
> fact seldom zero, because the network packets are seldom aligned either in
> size or in location.

The underlying buffer is.  The VFS (and the current kiobuf code) is
already happy about IO happening at odd offsets within a page.
However, the more general case --- doing zero-copy IO on arbitrary
unaligned buffers --- simply won't work if you expect to be able to
push those buffers to disk without a copy.  

The splice case you talked about is fine because it's doing the normal
prepare/commit logic where the underlying buffer is page aligned, even
if the splice IO is not to a page aligned location.  That's _exactly_
what kiobufs were intended to support.  The prepare_read/prepare_write/
pull/push cycle lets the caller tell the pull() function where to
store its data, becausse there are alignment constraints which just
can't be ignored: you simply cannot do physical disk IO on
non-sector-aligned memory or in chunks which aren't a multiple of
sector size.  (The buffer address alignment can sometimes be relaxed
--- obviously if you're doing PIO then it doesn't matter --- but the
length granularity is rigidly enforced.)
 
Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
 
 On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
  
 Neither the read nor the write are page-aligned. I don't know where you
 got that idea. It's obviously not true even in the common case: it depends
 _entirely_ on what the file offsets are, and expecting the offset to be
 zero is just being stupid. It's often _not_ zero. With networking it is in
 fact seldom zero, because the network packets are seldom aligned either in
 size or in location.

The underlying buffer is.  The VFS (and the current kiobuf code) is
already happy about IO happening at odd offsets within a page.
However, the more general case --- doing zero-copy IO on arbitrary
unaligned buffers --- simply won't work if you expect to be able to
push those buffers to disk without a copy.  

The splice case you talked about is fine because it's doing the normal
prepare/commit logic where the underlying buffer is page aligned, even
if the splice IO is not to a page aligned location.  That's _exactly_
what kiobufs were intended to support.  The prepare_read/prepare_write/
pull/push cycle lets the caller tell the pull() function where to
store its data, becausse there are alignment constraints which just
can't be ignored: you simply cannot do physical disk IO on
non-sector-aligned memory or in chunks which aren't a multiple of
sector size.  (The buffer address alignment can sometimes be relaxed
--- obviously if you're doing PIO then it doesn't matter --- but the
length granularity is rigidly enforced.)
 
Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Manfred Spraul


"Stephen C. Tweedie" wrote:
 
 You simply cannot do physical disk IO on
 non-sector-aligned memory or in chunks which aren't a multiple of
 sector size.

Why not?

Obviously the disk access itself must be sector aligned and the total
length must be a multiple of the sector length, but there shouldn't be
any restrictions on the data buffers.

I remember that even Windoze 95 has scatter-gather support for physical
disk IO with arbitraty buffer chunks. (If the hardware supports it,
otherwise the io subsystem will copy the data into a contiguous
temporary buffer)

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote:
 
 Can't we define a kiobuf structure as just this ? A combination of a
 frag_list and a page_list ?

Then all code which needs to accept an arbitrary kiobuf needs to be
able to parse both --- ugh.

 BTW, We could have a higher level io container that includes a status
 field and a wait_queue_head to take care of i/o completion

IO completion requirements are much more complex.  Think of disk
readahead: we can create a single request struct for an IO of a
hundred buffer heads, and as the device driver satisfies that request,
it wakes up the buffer heads as it goes.  There is a separete
completion notification for every single buffer head in the chain.

It's the very essence of readahead that we wake up the earlier buffers
as soon as they become available, without waiting for the later ones
to complete, so we _need_ this multiple completion concept.

Which is exactly why we have one kiobuf per higher-level buffer, and
we chain together kiobufs when we need to for a long request, but we
still get the independent completion notifiers.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote:
 
  I may still be persuaded that we need the full scatter-gather list
  fields throughout, but for now I tend to think that, at least in the
  disk layers, we may get cleaner results by allow linked lists of
  page-aligned kiobufs instead.  That allows for merging of kiobufs
  without having to copy all of the vector information each time.
 
 But it will have the same problems as the array soloution: there will
 be one complete kio structure for each kiobuf, with it's own end_io
 callback, etc.

And what's the problem with that?

You *need* this.  You have to have that multiple-completion concept in
the disk layers.  Think about chains of buffer_heads being sent to
disk as a single IO --- you need to know which buffers make it to disk
successfully and which had IO errors.

And no, the IO success is *not* necessarily sequential from the start
of the IO: if you are doing IO to raid0, for example, and the IO gets
striped across two disks, you might find that the first disk gets an
error so the start of the IO fails but the rest completes.  It's the
completion code which notifies the caller of what worked and what did
not.

And for readahead, you want to notify the caller as early as posssible
about completion for the first part of the IO, even if the device
driver is still processing the rest.

Multiple completions are a necessary feature of the current block
device interface.  Removing that would be a step backwards.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread bsuparna




Hi,

On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote:

 Can't we define a kiobuf structure as just this ? A combination of a
 frag_list and a page_list ?


Then all code which needs to accept an arbitrary kiobuf needs to be
able to parse both --- ugh.


Making this a little more explicit to help analyse tradeoffs:

/* Memory descriptor portion of a kiobuf - this is something that may get
passed around between layers and subsystems */
struct kio_mdesc {
 int nr_frags;
 struct frag *frag_list;
 int nr_pages;
 struct page **page_list;
 /* list follows */
};

For block i/o requiring #1 type descriptors, the list could have allocated
extra space for:
struct kio_type1_ext {
 struct frag frag;
 struct page *pages[NUM_STATIC_PAGES];
}

For n/w i/o or cases requiring  #2 type descriptors, the list could have
allocated extra space for:

struct kio_type2_ext {
 struct frag frags[NUM_STATIC_FRAGS];
 struct page *page[NUM_STATIC_FRAGS];
}


struct  kiobuf {
 intstatus;
 wait_queue_head_t   waitq;
 struct kio_mdescmdesc;
 /* list follows - leaves room for allocation for mem descs, completion
sub structs etc */
}

Code that accepts an arbitrary kiobuf needs to do the following :
 process the fragments one by one
  - type #1 case, only one fragment would typically be there, but
processing it would involve crossing all pages in the page list
   So extra processing vs a kiobuf with single offset, len
pair, involves the following:
dereferencing the frag_list pointer
checking the nr_frags field
  - type #2 case, the number of fragments would be equal to or
greater than number of pages, so processing will typically go over each
fragments and thus cross each page in the list one by one
   So extra processing vs a kiobuf with per-page offset, len
pairs, involves
deferencing the page list entry (involves computing the
page-index in the page_list from the offset value)
check if offset+len doesn't fall outside the page


Boils down to approx one extra dereference and one comparison  per kiobuf
for the common cases (have I missed something critical ?)  vs the most
optimized choices of descriptors for those cases.

In terms of resource consumption (extra bytes taken up), two fields extra
per kiobuf chain (e.g. nr_frags and frag_list pointer when it comes to #1),
i.e. a total of 8 bytes, for the common cases vs the most optimized choice
of structures for those cases.

This seems to be more uniformly balanced across #1 and #2 cases, than an
offset, len for every page, as well as an overall offset, len. But,
then, come to think of it, since the need for lightweight structures is
greater in the case of #2, should the point of balance (if at all we want
to find one) be tilted towards #2 ?

On the other hand, since having a common structure does involve extra bytes
and cycles, if there are very few situations where we need both #1 and #2 -
conversion only at subsystem boundaries like i2o does may turn out to be
better.

Oh well ...


 BTW, We could have a higher level io container that includes a status
 field and a wait_queue_head to take care of i/o completion

IO completion requirements are much more complex.  Think of disk
readahead: we can create a single request struct for an IO of a
hundred buffer heads, and as the device driver satisfies that request,
it wakes up the buffer heads as it goes.  There is a separete
completion notification for every single buffer head in the chain.

I understand the requirement of independent completion notifiers for higher
level buffers/other structures, since they are indeed independently usable
structures. That was one aspect that I thought I was being able to address
in the cev_wait design based on wait_queue wakeup functions.
The way it would work is that there would be multiple wakeup functions
registered on the container for the big request, each wakeup function being
responsible for waking up a higher level buffer. This way, the linkage
information is actually external to the buffer structures (which seems
reasonable, since it is only required while the i/o is happening, unless
there is another reason to keep a more lasting association)

It's the very essence of readahead that we wake up the earlier buffers
as soon as they become available, without waiting for the later ones
to complete, so we _need_ this multiple completion concept.


I can understand this in principle, but when we have a single request going
down to the device that actually fills in multiple buffers, do we get
notified (interrupted) by the device before all the data in that request
got transferred ? I mean, how do we know that some buffers have become
available until the overall device request has completed (unless of course
the request actually gets broken up at this level and completed bit by
bit).


Which is

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote:
 
 It's the very essence of readahead that we wake up the earlier buffers
 as soon as they become available, without waiting for the later ones
 to complete, so we _need_ this multiple completion concept.
 
 I can understand this in principle, but when we have a single request going
 down to the device that actually fills in multiple buffers, do we get
 notified (interrupted) by the device before all the data in that request
 got transferred ?

It depends on the device driver.  Different controllers will have
different maximum transfer size.  For IDE, for example, we get wakeups
all over the place.  For SCSI, it depends on how many scatter-gather
entries the driver can push into a single on-the-wire request.  Exceed
that limit and the driver is forced to open a new scsi mailbox, and
you get independent completion signals for each such chunk.

 Which is exactly why we have one kiobuf per higher-level buffer, and
 we chain together kiobufs when we need to for a long request, but we
 still get the independent completion notifiers.
 
 As I mentioned above, the alternative is to have the i/o completion related
 linkage information within the wakeup structures instead. That way, it
 doesn't matter to the lower level driver what higher level structure we
 have above (maybe buffer heads, may be page cache structures, may be
 kiobufs). We only chain together memory descriptors for the buffers during
 the io.

You forgot IO failures: it is essential, once the IO completes, to
know exactly which higher-level structures completed successfully and
which did not.  The low-level drivers have to have access to the
independent completion notifications for this to work.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote:
 "Stephen C. Tweedie" wrote:
  
  You simply cannot do physical disk IO on
  non-sector-aligned memory or in chunks which aren't a multiple of
  sector size.
 
 Why not?
 
 Obviously the disk access itself must be sector aligned and the total
 length must be a multiple of the sector length, but there shouldn't be
 any restrictions on the data buffers.

But there are.  Many controllers just break down and corrupt things
silently if you don't align the data buffers (Jeff Merkey found this
by accident when he started generating unaligned IOs within page
boundaries in his NWFS code).  And a lot of controllers simply cannot
break a sector dma over a page boundary (at least not without some
form of IOMMU remapping).

Yes, it's the sort of thing that you would hope should work, but in
practice it's not reliable.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Alan Cox


 Yes, it's the sort of thing that you would hope should work, but in
 practice it's not reliable.

So the less smart devices need to call something like

kiovec_align(kiovec, 512);

and have it do the bounce buffers ?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Linus Torvalds




On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
 
 On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
  
  Neither the read nor the write are page-aligned. I don't know where you
  got that idea. It's obviously not true even in the common case: it depends
  _entirely_ on what the file offsets are, and expecting the offset to be
  zero is just being stupid. It's often _not_ zero. With networking it is in
  fact seldom zero, because the network packets are seldom aligned either in
  size or in location.
 
 The underlying buffer is.  The VFS (and the current kiobuf code) is
 already happy about IO happening at odd offsets within a page.

Stephen. 

Don't bother even talking about this. You're so damn hung up about the
page cache that it's not funny.

Have you ever thought about other things, like networking, special
devices, stuff like that? They can (and do) have packet boundaries that
have nothing to do with pages what-so-ever. They can have such notions as
packets that contain multiple streams in one packet, where it ends up
being split up into several pieces. Where neither the original packet
_nor_ the final pieces have _anything_ to do with "pages".

THERE IS NO PAGE ALIGNMENT.

So stop blathering about it.

Of _course_ the current kiobuf code has page-alignment assumptions. You
_designed_ it that way. So bringing it up as an example is a circular
argument. And a really stupid one at that, as that's the thing I've been
quoting as the single biggest design bug in all of kiobufs. It's the thing
that makes them entirely useless for things like describing "struct
msghdr" etc. 

We should get _away_ from this page-alignment fallacy. It's not true. It's
not necessarily even true for the page cache - which has no real
fundamental reasons any more for not being able to be a "variable-size"
cache some time in the future (ie it might be a per-address-space decision
on whether the granularity is 1, 2, 4 or more pages).

Anything that designs for "everything is a page" will automatically be
limited for cases where you might sometimes have 64kB chunks of data.

Instead, just face the realization that "everything is a bunch or ranges",
and leave it at that. It's true _already_ - thing about fragmented IP
packets. We may not handle it that way completely yet, but the zero-copy
networking is going in this direction.

And as long as you keep on harping about page alignment, you're not going
to play in this game. End of story. 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Linus Torvalds




On Mon, 5 Feb 2001, Manfred Spraul wrote:
 "Stephen C. Tweedie" wrote:
  
  You simply cannot do physical disk IO on
  non-sector-aligned memory or in chunks which aren't a multiple of
  sector size.
 
 Why not?
 
 Obviously the disk access itself must be sector aligned and the total
 length must be a multiple of the sector length, but there shouldn't be
 any restrictions on the data buffers.

In fact, regular IDE DMA allows arbitrary scatter-gather at least in
theory. Linux has never used it, so I don't know how well it works in
practice - I would not be surprised if it ends up causing no end of nasty 
corner-cases that have bugs. It's not as if IDE controllers always follow 
the documentation ;)

The _total_ length of the buffers have to be a multiple of the sector
size, and there are some alignment issues (each scatter-gather area has to
be at least 16-bit aligned both in physical memory and in length, and
apparently many controllers need 32-bit alignment). And I'd almost be
surprised if there wouldn't be hardware that wanted cache alignment
because they always expect to burst. 

But despite a lot of likely practical reasons why it won't work for
arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it
wouldn't. And there are bound to be better controllers that could handle
it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Mon, Feb 05, 2001 at 03:19:09PM +, Alan Cox wrote:
  Yes, it's the sort of thing that you would hope should work, but in
  practice it's not reliable.
 
 So the less smart devices need to call something like
 
   kiovec_align(kiovec, 512);
 
 and have it do the bounce buffers ?

_All_ drivers would have to do that in the degenerate case, because
none of our drivers can deal with a dma boundary in the middle of a
sector, and even in those places where the hardware supports it in
theory, you are still often limited to word-alignment.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Alan Cox


  kiovec_align(kiovec, 512);
  and have it do the bounce buffers ?
 
 _All_ drivers would have to do that in the degenerate case, because
 none of our drivers can deal with a dma boundary in the middle of a
 sector, and even in those places where the hardware supports it in
 theory, you are still often limited to word-alignment.

Thats true for _block_ disk devices but if we want a generic kiovec then
if I am going from video capture to network I dont need to force anything more
than 4 byte align

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Mon, Feb 05, 2001 at 05:29:47PM +, Alan Cox wrote:
  
  _All_ drivers would have to do that in the degenerate case, because
  none of our drivers can deal with a dma boundary in the middle of a
  sector, and even in those places where the hardware supports it in
  theory, you are still often limited to word-alignment.
 
 Thats true for _block_ disk devices but if we want a generic kiovec then
 if I am going from video capture to network I dont need to force anything more
 than 4 byte align

Kiobufs have never, ever required the IO to be aligned on any
particular boundary.  They simply make the assumption that the
underlying buffered object can be described in terms of pages with
some arbitrary (non-aligned) start/offset.  Every video framebuffer
I've ever seen satisfies that, so you can easily map an arbitrary
contiguous region of the framebuffer with a kiobuf already.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Alan Cox


 Kiobufs have never, ever required the IO to be aligned on any
 particular boundary.  They simply make the assumption that the
 underlying buffered object can be described in terms of pages with
 some arbitrary (non-aligned) start/offset.  Every video framebuffer

start/length per page ?

 I've ever seen satisfies that, so you can easily map an arbitrary
 contiguous region of the framebuffer with a kiobuf already.

Video is non contiguous ranges. In fact if you are blitting to a card with
tiled memory it gets very interesting in its video lists

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Linus Torvalds




On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
  Thats true for _block_ disk devices but if we want a generic kiovec then
  if I am going from video capture to network I dont need to force anything more
  than 4 byte align
 
 Kiobufs have never, ever required the IO to be aligned on any
 particular boundary.  They simply make the assumption that the
 underlying buffered object can be described in terms of pages with
 some arbitrary (non-aligned) start/offset.  Every video framebuffer
 I've ever seen satisfies that, so you can easily map an arbitrary
 contiguous region of the framebuffer with a kiobuf already.

Stop this idiocy, Stephen. You're _this_ close to be the first person I
ever blacklist from my mailbox. 

Network. Packets. Fragmentation. Or just non-page-sized MTU's. 

It is _not_ a "series of contiguous pages". Never has been. Never will be.
So stop making excuses.

Also, think of protocols that may want to gather stuff from multiple
places, where the boundaries have little to do with pages but are
specified some other way. Imagine doing "writev()" style operations to
disk, gathering stuff from multiple sources into one operation.

Think of GART remappings - you can have multiple pages that show up as one
"linear" chunk to the graphics device behind the AGP bridge, but that are
_not_ contiguous in real memory.

There just is NO excuse for the "linear series of pages" view. And if you
cannot realize that, then I don't know what's wrong with you. Your
arguments are obviously crap, and the stuff you seem unable to argue
against (like networking) you decide to just ignore. Get your act
together.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote:

 Have you ever thought about other things, like networking, special
 devices, stuff like that? They can (and do) have packet boundaries that
 have nothing to do with pages what-so-ever. They can have such notions as
 packets that contain multiple streams in one packet, where it ends up
 being split up into several pieces. Where neither the original packet
 _nor_ the final pieces have _anything_ to do with "pages".
 
 THERE IS NO PAGE ALIGNMENT.

And kiobufs don't require IO to be page aligned, and they have never
done.  The only page alignment they assume is that if a *single*
scatter-gather element spans multiple pages, then the joins between
those pages occur on page boundaries.

Remember, a kiobuf is only designed to represent one scatter-gather
fragment, not a full sg list.  That was the whole reason for having a
kiovec as a separate concept: if you have more than one independent
fragment in the sg-list, you need more than one kiobuf.

And the reason why we created sg fragments which can span pages was so
that we can encode IOs which interact with the VM: any arbitrary
virtually-contiguous user data buffer can be mapped into a *single*
kiobuf for a write() call, so it's a generic way of supporting things
like O_DIRECT without the IO layers having to know anything about VM
(and Ben's async IO patches also use kiobufs in this way to allow
read()s to write to the user's data buffer once the IO completes,
without having to have a context switch back into that user's
context.)  Similarly, any extent of a file in the page cache can be
encoded in a single kiobuf.

And no, the simpler networking-style sg-list does not cut it for block
device IO, because for block devices, we want to have separate
completion status made available for each individual sg fragment in
the IO.  *That* is why the kiobuf is more heavyweight than the
networking variant: each fragment [kiobuf] in the scatter-gather list
[kiovec] has its own completion information.  

If we have a bunch of separate data buffers queued for sequential disk
IO as a single request, then we still want things like readahead and
error handling to work.  That means that we want the first kiobuf in
the chain to get its completion wakeup as soon as that segment of the
IO is complete, without having to wait for the remaining sectors of
the IO to be transferred.  It also means that if we've done something
like split the IO over a raid stripe, then when an error occurs, we
still want to know which of the callers' buffers succeeded and which
failed.

Yes, I agree that the original kiovec mechanism of using a *kiobuf[]
array to assemble the scatter-gather fragments sucked.  But I don't
believe that just throwing away the concept of kiobuf as a sc-fragment
will work either when it comes to disk IOs: the need for per-fragment
completion is too compelling.  I'd rather shift to allowing kiobufs to
be assembled into linked lists for IO to avoid *kiobuf[] vectors, in
just the same way that we currently chain buffer_heads for IO.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Ingo Molnar



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

 And no, the IO success is *not* necessarily sequential from the start
 of the IO: if you are doing IO to raid0, for example, and the IO gets
 striped across two disks, you might find that the first disk gets an
 error so the start of the IO fails but the rest completes.  It's the
 completion code which notifies the caller of what worked and what did
 not.

it's exactly these 'compound' structures i'm vehemently against. I do
think it's a design nightmare. I can picture these monster kiobufs
complicating the whole code for no good reason - we couldnt even get the
bh-list code in block_device.c right - why do you think kiobufs *all
across the kernel* will be any better?

RAID0 is not an issue. Split it up, use separate kiobufs for every
different disk. We need simple constructs - i do not believe why nobody
sees that these big fat monster-trucks of IO workload are *trouble*. They
keep things localized, instead of putting workload components into the
system immediately. We'll have performance bugs nobody has seen before.
bhs have one very nice property: they are simple, modularized. I think
this is like CISC vs. RISC: CISC designs ended up splitting 'fat
instructions' up into RISC-like instructions.

fragmented skbs are a different matter: they are simply a bit more generic
abstractions of 'memory buffer'. Clear goal, clear solution. I do not
think kiobufs have clear goals.

and i do not buy the performance arguments. In 2.4.1 we improved block-IO
performance dramatically by fixing high-load IO scheduling. Write
performance suddenly improved dramatically, there is a 30-40% improvement
in dbench performance. To put in another way: *we needed 5 years to fix a
serious IO-subsystem performance bug*. Block IO was already too complex -
and Alex  Andrea have done a nice job streamlining and cleaning it up for
2.4. We should simplify it further - and optimize the components, instead
of bringing in yet another *big* complication into the API.

and what is the goal of having multi-page kiobufs. To avoid having to do
multiple function calls via a simpler interface? Shouldnt we optimize that
codepath instead?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-05 Thread Ingo Molnar



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

  Obviously the disk access itself must be sector aligned and the total
  length must be a multiple of the sector length, but there shouldn't be
  any restrictions on the data buffers.

 But there are. Many controllers just break down and corrupt things
 silently if you don't align the data buffers (Jeff Merkey found this
 by accident when he started generating unaligned IOs within page
 boundaries in his NWFS code). And a lot of controllers simply cannot
 break a sector dma over a page boundary (at least not without some
 form of IOMMU remapping).

so we are putting workarounds for hardware bugs into the design?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote:
 
 On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
 
 it's exactly these 'compound' structures i'm vehemently against. I do
 think it's a design nightmare. I can picture these monster kiobufs
 complicating the whole code for no good reason - we couldnt even get the
 bh-list code in block_device.c right - why do you think kiobufs *all
 across the kernel* will be any better?
 
 RAID0 is not an issue. Split it up, use separate kiobufs for every
 different disk.

Umm, that's not the point --- of course you can use separate kiobufs
for the communication between raid0 and the underlying disks, but what
do you then tell the application _above_ raid0 if one of the
underlying IOs succeeds and the other fails halfway through?

And what about raid1?  Are you really saying that raid1 doesn't need
to know which blocks succeeded and which failed?  That's the level of
completion information I'm worrying about at the moment.

 fragmented skbs are a different matter: they are simply a bit more generic
 abstractions of 'memory buffer'. Clear goal, clear solution. I do not
 think kiobufs have clear goals.

The goal: allow arbitrary IOs to be pushed down through the stack in
such a way that the callers can get meaningful information back about
what worked and what did not.  If the write was a 128kB raw IO, then
you obviously get coarse granularity of completion callback.  If the
write was a series of independent pages which happened to be
contiguous on disk, you actually get told which pages hit disk and
which did not.

 and what is the goal of having multi-page kiobufs. To avoid having to do
 multiple function calls via a simpler interface? Shouldnt we optimize that
 codepath instead?

The original multi-page buffers came from the map_user_kiobuf
interface: they represented a user data buffer.  I'm not wedded to
that format --- we can happily replace it with a fine-grained sg list
--- but the reason they have been pushed so far down the IO stack is
the need for accurate completion information on the originally
requested IOs.

In other words, even if we expand the kiobuf into a sg vector list,
when it comes to merging requests in ll_rw_blk.c we still need to
track the callbacks on each independent source kiobufs.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Alan Cox


 do you then tell the application _above_ raid0 if one of the
 underlying IOs succeeds and the other fails halfway through?

struct 
{
u32 flags;  /* because everything needs flags */
struct io_completion *completions;
kiovec_t sglist[0];
} thingy;

now kmalloc one object of the header the sglist of the right size and the
completion list. Shove the completion list on the end of it as another
array of objects and what is the problem.

 In other words, even if we expand the kiobuf into a sg vector list,
 when it comes to merging requests in ll_rw_blk.c we still need to
 track the callbacks on each independent source kiobufs.  

But that can be two arrays

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie


Hi,

On Mon, Feb 05, 2001 at 11:06:48PM +, Alan Cox wrote:
  do you then tell the application _above_ raid0 if one of the
  underlying IOs succeeds and the other fails halfway through?
 
 struct 
 {
   u32 flags;  /* because everything needs flags */
   struct io_completion *completions;
   kiovec_t sglist[0];
 } thingy;
 
 now kmalloc one object of the header the sglist of the right size and the
 completion list. Shove the completion list on the end of it as another
 array of objects and what is the problem.

XFS uses both small metadata items in the buffer cache and large
pagebufs.  You may have merged a 512-byte read with a large pagebuf
read: one completion callback is associated with a single sg fragment,
the next callback belongs to a dozen different fragments.  Associating
the two lists becomes non-trivial, although it could be done.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Manfred Spraul


"Stephen C. Tweedie" wrote:
 
 The original multi-page buffers came from the map_user_kiobuf
 interface: they represented a user data buffer.  I'm not wedded to
 that format --- we can happily replace it with a fine-grained sg list

Could you change that interface?

 from Linus mail:

struct buffer {
struct page *page;
u16 offset, length;
};



/* returns the number of used buffers, or 0 on error */
int map_user_buffer(struct buffer *ba, int max_bcount,
void* addr, int len);
void unmap_buffer(struct buffer *ba, int bcount);

That's enough for the zero copy pipe code ;-)

Real hw drivers probably need a replacement for pci_map_single()
(pci_map_and_align_and_bounce_buffer_array())

The kiobuf structure could contain these 'struct buffer' instead of the
current 'struct page' pointers.

 
 In other words, even if we expand the kiobuf into a sg vector list,
 when it comes to merging requests in ll_rw_blk.c we still need to
 track the callbacks on each independent source kiobufs.

Probably.


--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-04 Thread bsuparna

>Hi,
>
>On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
>> >
>> > If I have a page vector with a single offset/length pair, I can build
>> > a new header with the same vector and modified offset/length to split
>> > the vector in two without copying it.
>>
>> You just say in the higher-level structure ignore from x to y even if
>> they have an offset in their own vector.
>
>Exactly --- and so you end up with something _much_ uglier, because
>you end up with all sorts of combinations of length/offset fields all
>over the place.
>
>This is _precisely_ the mess I want to avoid.
>
>Cheers,
> Stephen

It appears that we are coming across 2 kinds of requirements for kiobuf
vectors - and quite a bit of debate centering around that.

1. In the block device i/o world, where large i/os may be involved, we'd
like to be able to describe chunks/fragments that contain multiple pages;
which is why it  make sense to have a single  pair for the
entire set of pages in a kiobuf, rather than having to deal with per page
offset/len fields.

2. In the networking world, we deal with smaller fragments (for protocol
headers and stuff, and small packets) ideally chained together, typically
not page aligned, with the ability to extend the list at least at the head
and tail (and maybe some reshuffling in case of ip fragmentation?); so I
guess that's why it seems good to have an  pair per
page/fragment. (If there can be multiple fragments in a page, even this
might not be frugal enough ... )

Looks like there are 2 kinds of entities that we are looking for in the kio
descriptor:
 - A collection of physical memory pages (call it say, a page_list)
 - A collection of fragments of memory described as 
tuples w.r.t this collection
 (offset in turn could be  if it
helps) (call this collection a frag_list)

Can't we define a kiobuf structure as just this ? A combination of a
frag_list and a page_list ? (Clone kiobufs might share the original
kiobuf's page_list, but just split parts of the frag_list)
How hard is it to maintain and to manipulate such a structure ?

BTW, We could have a higher level io container that includes a 
field and a  to take care of i/o completion (If we have a
wait queue head, then I don't think we need a separate callback function if
we have Ben's wakeup functions in place).

Or,  is this going in the direction of a cross between and elephant and a
bicycle :-)  ?

Regards
Suparna

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-04 Thread bsuparna



Hi,

On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
 
  If I have a page vector with a single offset/length pair, I can build
  a new header with the same vector and modified offset/length to split
  the vector in two without copying it.

 You just say in the higher-level structure ignore from x to y even if
 they have an offset in their own vector.

Exactly --- and so you end up with something _much_ uglier, because
you end up with all sorts of combinations of length/offset fields all
over the place.

This is _precisely_ the mess I want to avoid.

Cheers,
 Stephen

It appears that we are coming across 2 kinds of requirements for kiobuf
vectors - and quite a bit of debate centering around that.

1. In the block device i/o world, where large i/os may be involved, we'd
like to be able to describe chunks/fragments that contain multiple pages;
which is why it  make sense to have a single offset, length pair for the
entire set of pages in a kiobuf, rather than having to deal with per page
offset/len fields.

2. In the networking world, we deal with smaller fragments (for protocol
headers and stuff, and small packets) ideally chained together, typically
not page aligned, with the ability to extend the list at least at the head
and tail (and maybe some reshuffling in case of ip fragmentation?); so I
guess that's why it seems good to have an offset, length pair per
page/fragment. (If there can be multiple fragments in a page, even this
might not be frugal enough ... )

Looks like there are 2 kinds of entities that we are looking for in the kio
descriptor:
 - A collection of physical memory pages (call it say, a page_list)
 - A collection of fragments of memory described as offset, len
tuples w.r.t this collection
 (offset in turn could be index in page-list, offset-in-page if it
helps) (call this collection a frag_list)

Can't we define a kiobuf structure as just this ? A combination of a
frag_list and a page_list ? (Clone kiobufs might share the original
kiobuf's page_list, but just split parts of the frag_list)
How hard is it to maintain and to manipulate such a structure ?

BTW, We could have a higher level io container that includes a status
field and a wait_queue_head to take care of i/o completion (If we have a
wait queue head, then I don't think we need a separate callback function if
we have Ben's wakeup functions in place).

Or,  is this going in the direction of a cross between and elephant and a
bicycle :-)  ?

Regards
Suparna


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-03 Thread Linus Torvalds

On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> 
> On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
> 
> > I think you want the whole kio concept only for disk-like IO.  
> 
> No.  I want something good for zero-copy IO in general, but a lot of
> that concerns the problem of interacting with the user, and the basic
> center of that interaction in 99% of the interesting cases is either a
> user VM buffer or the page cache --- all of which are page-aligned.  
> 
> If you look at the sorts of models being proposed (even by Linus) for
> splice, you get
> 
>   len = prepare_read();
>   prepare_write();
>   pull_fd();
>   commit_write();
> 
> in which the read is being pulled into a known location in the page
> cache -- it's page-aligned, again.

Wrong.

Neither the read nor the write are page-aligned. I don't know where you
got that idea. It's obviously not true even in the common case: it depends
_entirely_ on what the file offsets are, and expecting the offset to be
zero is just being stupid. It's often _not_ zero. With networking it is in
fact seldom zero, because the network packets are seldom aligned either in
size or in location.

Also, there are many reasons why "page" may have different meaning. We
will eventually have a page-cache where the pagecace granularity is not
the same as the user-level visible one. User-level may do mmap at 4kB
boundaries, even if the page cache itself uses 8kB or 16kB pages.

THERE IS NO PAGE-ALIGNMENT. And anything that even _mentions_ the word
page-aligned is going into my trash-can faster than you can say "bug".

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-03 Thread Linus Torvalds




On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
 
 On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
 
  I think you want the whole kio concept only for disk-like IO.  
 
 No.  I want something good for zero-copy IO in general, but a lot of
 that concerns the problem of interacting with the user, and the basic
 center of that interaction in 99% of the interesting cases is either a
 user VM buffer or the page cache --- all of which are page-aligned.  
 
 If you look at the sorts of models being proposed (even by Linus) for
 splice, you get
 
   len = prepare_read();
   prepare_write();
   pull_fd();
   commit_write();
 
 in which the read is being pulled into a known location in the page
 cache -- it's page-aligned, again.

Wrong.

Neither the read nor the write are page-aligned. I don't know where you
got that idea. It's obviously not true even in the common case: it depends
_entirely_ on what the file offsets are, and expecting the offset to be
zero is just being stupid. It's often _not_ zero. With networking it is in
fact seldom zero, because the network packets are seldom aligned either in
size or in location.

Also, there are many reasons why "page" may have different meaning. We
will eventually have a page-cache where the pagecace granularity is not
the same as the user-level visible one. User-level may do mmap at 4kB
boundaries, even if the page cache itself uses 8kB or 16kB pages.

THERE IS NO PAGE-ALIGNMENT. And anything that even _mentions_ the word
page-aligned is going into my trash-can faster than you can say "bug".

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread bsuparna

>Hi,
>
>On Thu, Feb 01, 2001 at 01:28:33PM +0530, [EMAIL PROTECTED] wrote:
>>
>> Here's a second pass attempt, based on Ben's wait queue extensions:
> Does this sound any better ?
>
>It's a mechanism, all right, but you haven't described what problems
>it is trying to solve, and where it is likely to be used, so it's hard
>to judge it. :)

Hmm .. I thought I had done that in my first posting, but obviously, I
mustn't have done a good job at expressing it, so let me take another stab
at trying to convey why I started on this.

There are certain specific situations that I have in mind right now, but to
me it looks like the very nature of the abstraction is such that it is
quite likely that there would be uses in some other situations which I may
not have thought of yet, or just do not understand well enough to vouch for
at this point. What those situations could be, and the associated issues
involved (especially performance related) is something that I hope other
people on this forum would be able to help pinpoint, based on their
experiences and areas of expertise.

I do realize that generic and yet simple and performance optimal in all
kinds of situations is a really difficult (if not impossible :-) ) thing to
achieve, but even then, won't it be nice to at least abstract out
uniformity in patterns across situations in a way which can be
tweaked/tuned for each specific class of situations ?

And the nice thing which I see about Ben's wait queue extensions is that it
gives us a route to try to do that ...

Some needs considered (and associated problems):

a. Stacking of completion events - asynchronously, through multiple layers
 - layered drivers  (encryption, conversion)
 - filter filesystems
Key aspects:
 1. It should be possible to pass the same (original) i/o container
structure all the way down (no copies/clones should need to happen, unless
actual i/o splitting, or extra buffer space or multiple sub-ios are
involved)
 2. Transparency: Neither the upper layer nor the layer below it should
need to have any specific knowledge about the existence/absense of an
intermediate filter layer (the mechanism should hide all that)
 3. LIFO ordering of completion actions
 4. The i/o structure should be marked as up-to-date only after all the
completion actions are done.
 5. Preferably have waiters on the i/o structure woken up only after
all completion actions are through (to avoid spurious/redundant wakeups
since the data won't be ready for use)
 6. Possible to have completion actions execute later in task context

b. Co-relation between multiple completion events and their associated
operations and data structures
 -  (bottom up aspect) merging results of split i/o requests, and
marking the completion of the compound i/o through multiple such layers
(tree), e.g
  - lvm
  - md / raid
  - evms aggregator features
 - (top down aspect) cascading down i/o cancellation requests /
sub-event waits , monitoring sub-io status etc
  Some aspects:
 1. Result of collation of sub-i/os may be driver specific  (In some
situations like lvm  - each sub i/o maps to a particular portion of a
buffer; with software raid or some other kind of scheme the collation may
involve actually interpreting the data read)
 2. Re-start/retries of sub-ios (in case of errors) can be handled.
 3. Transparency : Neither the upper layer nor the layer below it
should need to have any specific knowledge about the existence/absense of
an intermediate layer (that sends out multiple sub i/os)
 4. The system should be devised to avoid extra logic/fields in the
generic i/o structures being passed around, in situations where no compound
i/o is involved (i.e. in the simple i/o cases and most common situations).
As far as possible it is desirable to keep the linkage information outside
of the i/o structure for this reason.
 5. Possible to have collation/completion actions execute later in task
context

Ben LaHaise's wait queue extensions takes care of most of the aspects of
(a), if used with a little care to ensure a(4).
[This just means that function that marks the i/o structure as up-to-date
should be put in the completion queue first]
With this, we don't even need and explicit end_io() in bh/kiobufs etc. Just
the wait queue would do.

Only a(5) needs some thought since cache efficiency is upset by changing
the ordering of waits.

But, (b) needs a little more work as a higher level construct/mechanism
that latches on to the wait queue extensions. That is what the cev_wait
structure was designed for.
It keeps the chaining information outside of the i/o structures by default
(They can be allocated together, if desired anyway)

Is this still too much in the air ? Maybe I should describe the flow in a
specific scenario to illustrate ?

Regards
Suparna

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Stephen C. Tweedie


Hi,

On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
> > 
> > If I have a page vector with a single offset/length pair, I can build
> > a new header with the same vector and modified offset/length to split
> > the vector in two without copying it.
> 
> You just say in the higher-level structure ignore from x to y even if
> they have an offset in their own vector.

Exactly --- and so you end up with something _much_ uglier, because
you end up with all sorts of combinations of length/offset fields all
over the place.

This is _precisely_ the mess I want to avoid.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Christoph Hellwig

On Thu, Feb 01, 2001 at 11:18:56PM -0500, [EMAIL PROTECTED] wrote:
> On Thu, 1 Feb 2001, Christoph Hellwig wrote:
> 
> > A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
> > used for caching data, a kiobuf not.
> 
> Go measure the cost of a distant cache miss, then complain about having
> everything in one structure.  Also, 1 kiobuf maps 16-128 times as much
> data as a single buffer head.

I'd never dipute that.  It was just an answers to Stephen's "a kiobuf is
already smaller".

> > enum kio_flags {
> > KIO_LOANED, /* the calling subsystem wants this buf back*/
> > KIO_GIFTED, /* thanks for the buffer, man!  */
> > KIO_COW /* copy on write (XXX: not yet) */
> > };
> 
> This is a Really Bad Idea.  Having semantics depend on a subtle flag
> determined by a caller is a sure way to

The semantics aren't different for the using subsystem.  LOANED vs GIFTED
is an issue for the free function, COW will probably be a page-level mm
thing - though I haven't thought a lot about it yet an am not sure wether
it actually makes sense.

> 
> >
> >
> > struct kio {
> > struct kiovec * kio_data;   /* our kiovecs  */
> > int kio_ndata;  /* # of kiovecs */
> > int kio_flags;  /* loaned or giftet?*/
> > void *  kio_priv;   /* caller private data  */
> > wait_queue_head_t   kio_wait;   /* wait queue   */
> > };
> >
> > makes it a lot simpler for the subsytems to integrate.
> 
> Keep in mind that using distant memory allocations for kio_data will incur
> additional cache misses.

It could also be a [0] array at the end, allowing for a single allocation,
but that looks more like a implementation detail then a design problem to me.

> The atomic count is probably going to be widely
> used; I see it being applicable to the network stack, block io layers and
> others.

Hmm.  Currently it is used only for the multiple buffer_head's per iobuf
cruft, and I don't see why multiple outstanding IOs should be noted in a
kiobuf.

> Also, how is information about io completion status passed back
> to the caller?

Yes, there needs to be an kio_errno field - though I wanted to get rid of
it I had to readd in in later versions of my design.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 10:07:44PM +, Stephen C. Tweedie wrote:
> No.  I want something good for zero-copy IO in general, but a lot of
> that concerns the problem of interacting with the user, and the basic
> center of that interaction in 99% of the interesting cases is either a
> user VM buffer or the page cache --- all of which are page-aligned.

Yes.

> If you look at the sorts of models being proposed (even by Linus) for
> splice, you get
> 
>   len = prepare_read();
>   prepare_write();
>   pull_fd();
>   commit_write();

Yepp.

> in which the read is being pulled into a known location in the page
> cache -- it's page-aligned, again.  I'm perfectly willing to accept
> that there may be a need for scatter-gather boundaries including
> non-page-aligned fragments in this model, but I can't see one if
> you're using the page cache as a mediator, nor if you're doing it
> through a user mmapped buffer.

True.

> The only reason you need finer scatter-gather boundaries --- and it
> may be a compelling reason --- is if you are merging multiple IOs
> together into a single device-level IO.  That makes perfect sense for
> the zerocopy tcp case where you're doing MSG_MORE-type coalescing.  It
> doesn't help the existing SGI kiobuf block device code, because that
> performs its merging in the filesystem layers and the block device
> code just squirts the IOs to the wire as-is,

Yes - but that is no soloution for a generic model.  AFAICS even XFS
falls back to buffer_head's for small requests.

> but if we want to start
> merging those kiobuf-based IOs within make_request() then the block
> device layer may want it too.

Yes.

> And Linus is right, the old way of using a *kiobuf[] for that was
> painful, but the solution of adding start/length to every entry in
> the page vector just doesn't sit right with many components of the
> block device environment either.

What do you thing is the alternative?

> I may still be persuaded that we need the full scatter-gather list
> fields throughout, but for now I tend to think that, at least in the
> disk layers, we may get cleaner results by allow linked lists of
> page-aligned kiobufs instead.  That allows for merging of kiobufs
> without having to copy all of the vector information each time.

But it will have the same problems as the array soloution: there will
be one complete kio structure for each kiobuf, with it's own end_io
callback, etc.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 09:25:08PM +, Stephen C. Tweedie wrote:
> > No.  Just allow passing the multiple of the devices blocksize over
> > ll_rw_block.
> 
> That was just one example: you need the sub-ios just as much when
> you split up an IO over stripe boundaries in LVM or raid0, for
> example.

IIRC that's why you designed (and I thought of independandly) clone-kiobufs.

> Secondly, ll_rw_block needs to die anyway: you can expand
> the blocksize up to PAGE_SIZE but not beyond, whereas something like
> ll_rw_kiobuf can submit a much larger IO atomically (and we have
> devices which don't start to deliver good throughput until you use
> IO sizes of 1MB or more).

Completly agreed.

> If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
> to split it in two, I have to make two new vectors (page X, offset 0,
> length n) and (page X, offset n, length PAGE_SIZE-n).  That implies
> copying both vectors.
> 
> If I have a page vector with a single offset/length pair, I can build
> a new header with the same vector and modified offset/length to split
> the vector in two without copying it.

You just say in the higher-level structure ignore from x to y even if
they have an offset in their own vector.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 09:25:08PM +, Stephen C. Tweedie wrote:
  No.  Just allow passing the multiple of the devices blocksize over
  ll_rw_block.
 
 That was just one example: you need the sub-ios just as much when
 you split up an IO over stripe boundaries in LVM or raid0, for
 example.

IIRC that's why you designed (and I thought of independandly) clone-kiobufs.

 Secondly, ll_rw_block needs to die anyway: you can expand
 the blocksize up to PAGE_SIZE but not beyond, whereas something like
 ll_rw_kiobuf can submit a much larger IO atomically (and we have
 devices which don't start to deliver good throughput until you use
 IO sizes of 1MB or more).

Completly agreed.

 If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
 to split it in two, I have to make two new vectors (page X, offset 0,
 length n) and (page X, offset n, length PAGE_SIZE-n).  That implies
 copying both vectors.
 
 If I have a page vector with a single offset/length pair, I can build
 a new header with the same vector and modified offset/length to split
 the vector in two without copying it.

You just say in the higher-level structure ignore from x to y even if
they have an offset in their own vector.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 10:07:44PM +, Stephen C. Tweedie wrote:
 No.  I want something good for zero-copy IO in general, but a lot of
 that concerns the problem of interacting with the user, and the basic
 center of that interaction in 99% of the interesting cases is either a
 user VM buffer or the page cache --- all of which are page-aligned.

Yes.

 If you look at the sorts of models being proposed (even by Linus) for
 splice, you get
 
   len = prepare_read();
   prepare_write();
   pull_fd();
   commit_write();

Yepp.

 in which the read is being pulled into a known location in the page
 cache -- it's page-aligned, again.  I'm perfectly willing to accept
 that there may be a need for scatter-gather boundaries including
 non-page-aligned fragments in this model, but I can't see one if
 you're using the page cache as a mediator, nor if you're doing it
 through a user mmapped buffer.

True.

 The only reason you need finer scatter-gather boundaries --- and it
 may be a compelling reason --- is if you are merging multiple IOs
 together into a single device-level IO.  That makes perfect sense for
 the zerocopy tcp case where you're doing MSG_MORE-type coalescing.  It
 doesn't help the existing SGI kiobuf block device code, because that
 performs its merging in the filesystem layers and the block device
 code just squirts the IOs to the wire as-is,

Yes - but that is no soloution for a generic model.  AFAICS even XFS
falls back to buffer_head's for small requests.

 but if we want to start
 merging those kiobuf-based IOs within make_request() then the block
 device layer may want it too.

Yes.

 And Linus is right, the old way of using a *kiobuf[] for that was
 painful, but the solution of adding start/length to every entry in
 the page vector just doesn't sit right with many components of the
 block device environment either.

What do you thing is the alternative?

 I may still be persuaded that we need the full scatter-gather list
 fields throughout, but for now I tend to think that, at least in the
 disk layers, we may get cleaner results by allow linked lists of
 page-aligned kiobufs instead.  That allows for merging of kiobufs
 without having to copy all of the vector information each time.

But it will have the same problems as the array soloution: there will
be one complete kio structure for each kiobuf, with it's own end_io
callback, etc.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 11:18:56PM -0500, [EMAIL PROTECTED] wrote:
 On Thu, 1 Feb 2001, Christoph Hellwig wrote:
 
  A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
  used for caching data, a kiobuf not.
 
 Go measure the cost of a distant cache miss, then complain about having
 everything in one structure.  Also, 1 kiobuf maps 16-128 times as much
 data as a single buffer head.

I'd never dipute that.  It was just an answers to Stephen's "a kiobuf is
already smaller".

  enum kio_flags {
  KIO_LOANED, /* the calling subsystem wants this buf back*/
  KIO_GIFTED, /* thanks for the buffer, man!  */
  KIO_COW /* copy on write (XXX: not yet) */
  };
 
 This is a Really Bad Idea.  Having semantics depend on a subtle flag
 determined by a caller is a sure way to

The semantics aren't different for the using subsystem.  LOANED vs GIFTED
is an issue for the free function, COW will probably be a page-level mm
thing - though I haven't thought a lot about it yet an am not sure wether
it actually makes sense.

 
 
 
  struct kio {
  struct kiovec * kio_data;   /* our kiovecs  */
  int kio_ndata;  /* # of kiovecs */
  int kio_flags;  /* loaned or giftet?*/
  void *  kio_priv;   /* caller private data  */
  wait_queue_head_t   kio_wait;   /* wait queue   */
  };
 
  makes it a lot simpler for the subsytems to integrate.
 
 Keep in mind that using distant memory allocations for kio_data will incur
 additional cache misses.

It could also be a [0] array at the end, allowing for a single allocation,
but that looks more like a implementation detail then a design problem to me.

 The atomic count is probably going to be widely
 used; I see it being applicable to the network stack, block io layers and
 others.

Hmm.  Currently it is used only for the multiple buffer_head's per iobuf
cruft, and I don't see why multiple outstanding IOs should be noted in a
kiobuf.

 Also, how is information about io completion status passed back
 to the caller?

Yes, there needs to be an kio_errno field - though I wanted to get rid of
it I had to readd in in later versions of my design.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Stephen C. Tweedie


Hi,

On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
  
  If I have a page vector with a single offset/length pair, I can build
  a new header with the same vector and modified offset/length to split
  the vector in two without copying it.
 
 You just say in the higher-level structure ignore from x to y even if
 they have an offset in their own vector.

Exactly --- and so you end up with something _much_ uglier, because
you end up with all sorts of combinations of length/offset fields all
over the place.

This is _precisely_ the mess I want to avoid.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread bsuparna



Hi,

On Thu, Feb 01, 2001 at 01:28:33PM +0530, [EMAIL PROTECTED] wrote:

 Here's a second pass attempt, based on Ben's wait queue extensions:
 Does this sound any better ?

It's a mechanism, all right, but you haven't described what problems
it is trying to solve, and where it is likely to be used, so it's hard
to judge it. :)

Hmm .. I thought I had done that in my first posting, but obviously, I
mustn't have done a good job at expressing it, so let me take another stab
at trying to convey why I started on this.

There are certain specific situations that I have in mind right now, but to
me it looks like the very nature of the abstraction is such that it is
quite likely that there would be uses in some other situations which I may
not have thought of yet, or just do not understand well enough to vouch for
at this point. What those situations could be, and the associated issues
involved (especially performance related) is something that I hope other
people on this forum would be able to help pinpoint, based on their
experiences and areas of expertise.

I do realize that generic and yet simple and performance optimal in all
kinds of situations is a really difficult (if not impossible :-) ) thing to
achieve, but even then, won't it be nice to at least abstract out
uniformity in patterns across situations in a way which can be
tweaked/tuned for each specific class of situations ?

And the nice thing which I see about Ben's wait queue extensions is that it
gives us a route to try to do that ...

Some needs considered (and associated problems):

a. Stacking of completion events - asynchronously, through multiple layers
 - layered drivers  (encryption, conversion)
 - filter filesystems
Key aspects:
 1. It should be possible to pass the same (original) i/o container
structure all the way down (no copies/clones should need to happen, unless
actual i/o splitting, or extra buffer space or multiple sub-ios are
involved)
 2. Transparency: Neither the upper layer nor the layer below it should
need to have any specific knowledge about the existence/absense of an
intermediate filter layer (the mechanism should hide all that)
 3. LIFO ordering of completion actions
 4. The i/o structure should be marked as up-to-date only after all the
completion actions are done.
 5. Preferably have waiters on the i/o structure woken up only after
all completion actions are through (to avoid spurious/redundant wakeups
since the data won't be ready for use)
 6. Possible to have completion actions execute later in task context

b. Co-relation between multiple completion events and their associated
operations and data structures
 -  (bottom up aspect) merging results of split i/o requests, and
marking the completion of the compound i/o through multiple such layers
(tree), e.g
  - lvm
  - md / raid
  - evms aggregator features
 - (top down aspect) cascading down i/o cancellation requests /
sub-event waits , monitoring sub-io status etc
  Some aspects:
 1. Result of collation of sub-i/os may be driver specific  (In some
situations like lvm  - each sub i/o maps to a particular portion of a
buffer; with software raid or some other kind of scheme the collation may
involve actually interpreting the data read)
 2. Re-start/retries of sub-ios (in case of errors) can be handled.
 3. Transparency : Neither the upper layer nor the layer below it
should need to have any specific knowledge about the existence/absense of
an intermediate layer (that sends out multiple sub i/os)
 4. The system should be devised to avoid extra logic/fields in the
generic i/o structures being passed around, in situations where no compound
i/o is involved (i.e. in the simple i/o cases and most common situations).
As far as possible it is desirable to keep the linkage information outside
of the i/o structure for this reason.
 5. Possible to have collation/completion actions execute later in task
context


Ben LaHaise's wait queue extensions takes care of most of the aspects of
(a), if used with a little care to ensure a(4).
[This just means that function that marks the i/o structure as up-to-date
should be put in the completion queue first]
With this, we don't even need and explicit end_io() in bh/kiobufs etc. Just
the wait queue would do.

Only a(5) needs some thought since cache efficiency is upset by changing
the ordering of waits.

But, (b) needs a little more work as a higher level construct/mechanism
that latches on to the wait queue extensions. That is what the cev_wait
structure was designed for.
It keeps the chaining information outside of the i/o structures by default
(They can be allocated together, if desired anyway)

Is this still too much in the air ? Maybe I should describe the flow in a
specific scenario to illustrate ?

Regards
Suparna


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-01 Thread bcrl

On Thu, 1 Feb 2001, Christoph Hellwig wrote:

> A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
> used for caching data, a kiobuf not.

Go measure the cost of a distant cache miss, then complain about having
everything in one structure.  Also, 1 kiobuf maps 16-128 times as much
data as a single buffer head.

> enum kio_flags {
>   KIO_LOANED, /* the calling subsystem wants this buf back*/
>   KIO_GIFTED, /* thanks for the buffer, man!  */
>   KIO_COW /* copy on write (XXX: not yet) */
> };

This is a Really Bad Idea.  Having semantics depend on a subtle flag
determined by a caller is a sure way to

>
>
> struct kio {
>   struct kiovec * kio_data;   /* our kiovecs  */
>   int kio_ndata;  /* # of kiovecs */
>   int kio_flags;  /* loaned or giftet?*/
>   void *  kio_priv;   /* caller private data  */
>   wait_queue_head_t   kio_wait;   /* wait queue   */
> };
>
> makes it a lot simpler for the subsytems to integrate.

Keep in mind that using distant memory allocations for kio_data will incur
additional cache misses.  The atomic count is probably going to be widely
used; I see it being applicable to the network stack, block io layers and
others.  Also, how is information about io completion status passed back
to the caller?  That information is required across layers so that io can
be properly aborted or proceed with the partial amount of io.  Add those
back in and we're right back to the original kiobuf structure.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-01 Thread Chaitanya Tumuluri


On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 04:16:15PM +, Stephen C. Tweedie wrote:
> > > > 
> > > > No, and with the current kiobufs it would not make sense, because they
> > > > are to heavy-weight.
> > > 
> > > Really?  In what way?  
> > 
> > We can't allocate a huge kiobuf structure just for requesting one page of
> > IO.  It might get better with VM-level IO clustering though.
> 
> A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> allocate a buffer_head per block for all IO.
> 
> A kiobuf contains enough embedded page vector space for 16 pages by
> default, but I'm happy enough to remove that if people want.  However,
> note that that memory is not initialised, so there is no memory access
> cost at all for that empty space.  Remove that space and instead of
> one memory allocation per kiobuf, you get two, so the cost goes *UP*
> for small IOs.
> 
> > > > With page,length,offsett iobufs this makes sense
> > > > and is IMHO the way to go.
> > > 
> > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > > lean enough to do the job??
> > 
> > No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> > lkml some time ago (though I'd much more like to call it kiovec analogous
> > to BSD iovecs).
> 
> What is so heavyweight in the current kiobuf (other than the embedded
> vector, which I've already noted I'm willing to cut)?


Hi,

It'd seem that "array_len", "locked", "bounced", "io_count" and "errno" 
are the fields that need to go away (apart from the "maplist").

The field "nr_pages" would reincarnate in the kiovec struct (which is
is not a plain array anymore) as the field "nbufs". See below.

Based on what I've seen fly by on the lists here's my understanding of 
the proposed new kiobuf/kiovec structures:

===
/*
 * a simple page,offset,length tuple like Linus wants it
 */
struct kiobuf {
struct page *   page;   /* The page itself   */
u_16offset; /* Offset to start of valid data */
u_16length; /* Number of valid bytes of data */
};

struct kiovec {
int nbufs;  /* Kiobufs actually referenced */
struct kiobuf * bufs;
}

/*
 * the name is just plain stupid, but that shouldn't matter
 */
struct vfs_kiovec {
struct kiovec * iov;

/* private data, mostly for the callback */
void * private;

/* completion callback */
void (*end_io)  (struct vfs_kiovec *);
wait_queue_head_t wait_queue;
};
===

Is this correct? 

If so, I have a few questions/clarifications:

- The [ll_rw_blk, scsi/ide request-functions, scsi/ide 
  I/O completion handling] functions would be handed the 
  "X_kiovec" struct, correct?

- So, the soft-RAID / LVM layers need to construct their 
  own "lvm_kiovec" structs to handle request splits and
  the partial completions, correct? 

- Then, what are the semantics of request-merges containing 
  the "X_kiovec" structs in the block I/O queueing layers?
  Do we add "X_kiovec->next", "X_kiovec->prev" etc. fields?

  It will also require a re-allocation of a new and longer
  kiovec->bufs array, correct?
  
- How are I/O error codes to be propagated back to the 
  higher (calling) layers? I think that needs to be added
  into the "X_kiovec" struct, no?

- How is bouncing to be handled with this setup? (some state 
  is needed to (a) determine that bouncing occurred, (b) find 
  out which pages have been bounced and, (c) find out the 
  bounce-page for each of these bounced pages).

Cheers,
-Chait.






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie


Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:

> I think you want the whole kio concept only for disk-like IO.  

No.  I want something good for zero-copy IO in general, but a lot of
that concerns the problem of interacting with the user, and the basic
center of that interaction in 99% of the interesting cases is either a
user VM buffer or the page cache --- all of which are page-aligned.  

If you look at the sorts of models being proposed (even by Linus) for
splice, you get

len = prepare_read();
prepare_write();
pull_fd();
commit_write();

in which the read is being pulled into a known location in the page
cache -- it's page-aligned, again.  I'm perfectly willing to accept
that there may be a need for scatter-gather boundaries including
non-page-aligned fragments in this model, but I can't see one if
you're using the page cache as a mediator, nor if you're doing it
through a user mmapped buffer.

The only reason you need finer scatter-gather boundaries --- and it
may be a compelling reason --- is if you are merging multiple IOs
together into a single device-level IO.  That makes perfect sense for
the zerocopy tcp case where you're doing MSG_MORE-type coalescing.  It
doesn't help the existing SGI kiobuf block device code, because that
performs its merging in the filesystem layers and the block device
code just squirts the IOs to the wire as-is, but if we want to start
merging those kiobuf-based IOs within make_request() then the block
device layer may want it too.

And Linus is right, the old way of using a *kiobuf[] for that was
painful, but the solution of adding start/length to every entry in
the page vector just doesn't sit right with many components of the
block device environment either.

I may still be persuaded that we need the full scatter-gather list
fields throughout, but for now I tend to think that, at least in the
disk layers, we may get cleaner results by allow linked lists of
page-aligned kiobufs instead.  That allows for merging of kiobufs
without having to copy all of the vector information each time.

The killer, however, is what happens if you want to split such a
merged kiobuf.  Right now, that's something that I can only imagine
happening in the block layers if we start encoding buffer_head chains
as kiobufs, but if we do that in the future, or if we start merging
genuine kiobuf requests requests, then doing that split later on (for
raid0 etc) may require duplicating whole chains of kiobufs.  At that
point, just doing scatter-gather lists is cleaner.

But for now, the way to picture what I'm trying to achieve is that
kiobufs are a bit like buffer_heads --- they represent the physical
pages of some VM object that a higher layer has constructed, such as
the page cache or a user VM buffer.  You can chain these objects
together for IO, but that doesn't stop the individual objects from
being separate entities with independent IO completion callbacks to be
honoured.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie


Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
> 
> > On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
> > In the disk IO case, you basically don't get that (the only thing
> > which comes close is raid5 parity blocks).  The data which the user
> > started with is the data sent out on the wire.  You do get some
> > interesting cases such as soft raid and LVM, or even in the scsi stack
> > if you run out of mailbox space, where you need to send only a
> > sub-chunk of the input buffer. 
> 
> Though your describption is right, I don't think the case is very common:
> Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

On raid0 stripes, it's common to have stripes of between 16k and 64k,
so it's rather more common there than you'd like.  In any case, you
need the code to handle it, and I don't want to make the code paths
any more complex than necessary.

> In raid1 you need some kind of clone iobuf, which should work with both
> cases.  In raid0 you need a complete new pagelist anyway

No you don't.  You take the existing one, specify which region of it
is going to the current stripe, and send it off.  Nothing more.

> > In that case, having offset/len as the kiobuf limit markers is ideal:
> > you can clone a kiobuf header using the same page vector as the
> > parent, narrow down the start/end points, and continue down the stack
> > without having to copy any part of the page list.  If you had the
> > offset/len data encoded implicitly into each entry in the sglist, you
> > would not be able to do that.
> 
> Sure you could: you embedd that information in a higher-level structure.

What's the point in a common data container structure if you need
higher-level information to make any sense out of it?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 09:46:27PM +0100, Christoph Hellwig wrote:

> > Right now we can take a kiobuf and turn it into a bunch of
> > buffer_heads for IO.  The io_count lets us track all of those sub-IOs
> > so that we know when all submitted IO has completed, so that we can
> > pass the completion callback back up the chain without having to
> > allocate yet more descriptor structs for the IO.
> 
> > Again, remove this and the IO becomes more heavyweight because we need
> > to create a separate struct for the info.
> 
> No.  Just allow passing the multiple of the devices blocksize over
> ll_rw_block.

That was just one example: you need the sub-ios just as much when
you split up an IO over stripe boundaries in LVM or raid0, for
example.  Secondly, ll_rw_block needs to die anyway: you can expand
the blocksize up to PAGE_SIZE but not beyond, whereas something like
ll_rw_kiobuf can submit a much larger IO atomically (and we have
devices which don't start to deliver good throughput until you use
IO sizes of 1MB or more).

> >> and the lack of
> >> scatter gather in one kiobuf struct (you always need an array)
> 
> > Again, _all_ data being sent down through the block device layer is
> > either in buffer heads or is page aligned.
> 
> That's the point.  You are always talking about the block-layer only.

I'm talking about why the minimal, generic solution doesn't provide
what the block layer needs.

> > Obviously, extra code will be needed to scan kiobufs if we do that,
> > and unless we have both per-page _and_ per-kiobuf start/offset pairs
> > (adding even further to the complexity), those scatter-gather lists
> > would prevent us from carving up a kiobuf into smaller sub-ios without
> > copying the whole (expanded) vector.
> 
> No.  I think I explained that in my last mail.

How?

If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
to split it in two, I have to make two new vectors (page X, offset 0,
length n) and (page X, offset n, length PAGE_SIZE-n).  That implies
copying both vectors.

If I have a page vector with a single offset/length pair, I can build
a new header with the same vector and modified offset/length to split
the vector in two without copying it.

> > Possibly, but I remain to be convinced, because you may end up with a
> > mechanism which is generic but is not well-tuned for any specific
> > case, so everything goes slower.
> 
> As kiobufs are widely used for real IO, just as containers, this is
> better then nothing.

Surely having all of the subsystems working fast is better still?

> And IMHO a nice generic concepts that lets different subsystems work
> toegther is a _lot_ better then a bunch of over-optimized, rather isolated
> subsytems.  The IO-Lite people have done a nice research of the effect of
> an unified IO-Caching system vs. the typical isolated systems.

I know, and IO-Lite has some major problems (the close integration of
that code into the cache, for example, makes it harder to expose the
zero-copy to user-land).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Steve Lord

> On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote:
> > And if you are writing to a striped volume via a filesystem which can do
> > it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
> > is striped on 64K boundaries.
> 
> But usually I want to have pages 0-63, 128-191, etc together, because they ar
> e
> contingous on disk, or?

I was just giving an example of how kiobufs might need splitting up more often
than you think, crossing a stripe boundary is one obvious case. Yes you do
want to keep the pages which are contiguous on disk together, but you will
often get requests which cover multiple stripes, otherwise you don't really
get much out of stripes and may as well just concatenate drives.

Ideally the file is striped across the various disks in the volume, and one
large write (direct or from the cache) gets scattered across the disks. All
the I/O's run in parallel (and on different controllers if you have the 
budget).

Steve

> 
>   Christoph
> 
> -- 
> Of course it doesn't work. We've performed a software upgrade.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote:
> And if you are writing to a striped volume via a filesystem which can do
> it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
> is striped on 64K boundaries.

But usually I want to have pages 0-63, 128-191, etc together, because they are
contingous on disk, or?

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Steve Lord


> In article <[EMAIL PROTECTED]> you wrote:
> > Hi,
> 
> > On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
> > In the disk IO case, you basically don't get that (the only thing
> > which comes close is raid5 parity blocks).  The data which the user
> > started with is the data sent out on the wire.  You do get some
> > interesting cases such as soft raid and LVM, or even in the scsi stack
> > if you run out of mailbox space, where you need to send only a
> > sub-chunk of the input buffer. 
> 
> Though your describption is right, I don't think the case is very common:
> Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.


And if you are writing to a striped volume via a filesystem which can do
it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
is striped on 64K boundaries.

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig


In article <[EMAIL PROTECTED]> you wrote:
> Buffer_heads are _sometimes_ used for caching data.

Actually they are mostly used, but that should have any value for the
discussion...

> That's one of the
> big problems with them, they are too overloaded, being both IO
> descriptors _and_ cache descriptors.

Agreed.

> If you've got 128k of data to
> write out from user space, do you want to set up one kiobuf or 256
> buffer_heads?  Buffer_heads become really very heavy indeed once you
> start doing non-trivial IO.

Sure - I was never arguing in favor of buffer_head's ...

>> > What is so heavyweight in the current kiobuf (other than the embedded
>> > vector, which I've already noted I'm willing to cut)?
>> 
>> array_len

> kiobufs can be reused after IO.  You can depopulate a kiobuf,
> repopulate it with new pages and submit new IO without having to
> deallocate the kiobuf.  You can't do this without knowing how big the
> data vector is.  Removing that functionality will prevent reuse,
> making them _more_ heavyweight.

>> io_count,

> Right now we can take a kiobuf and turn it into a bunch of
> buffer_heads for IO.  The io_count lets us track all of those sub-IOs
> so that we know when all submitted IO has completed, so that we can
> pass the completion callback back up the chain without having to
> allocate yet more descriptor structs for the IO.

> Again, remove this and the IO becomes more heavyweight because we need
> to create a separate struct for the info.

No.  Just allow passing the multiple of the devices blocksize over
ll_rw_block.  XFS is doing that and it just needs an audit of the lesser
used block drivers.

>> and the lack of
>> scatter gather in one kiobuf struct (you always need an array)

> Again, _all_ data being sent down through the block device layer is
> either in buffer heads or is page aligned.

That's the point.  You are always talking about the block-layer only.
And I think it should be generic instead.
Looks like that is the major point.

> You want us to triple the
> size of the "heavyweight" kiobuf's data vector for what gain, exactly?

double.

> Obviously, extra code will be needed to scan kiobufs if we do that,
> and unless we have both per-page _and_ per-kiobuf start/offset pairs
> (adding even further to the complexity), those scatter-gather lists
> would prevent us from carving up a kiobuf into smaller sub-ios without
> copying the whole (expanded) vector.

No.  I think I explained that in my last mail.

> That's a _lot_ of extra complexity in the disk IO layers.

> Possibly, but I remain to be convinced, because you may end up with a
> mechanism which is generic but is not well-tuned for any specific
> case, so everything goes slower.

As kiobufs are widely used for real IO, just as containers, this is
better then nothing.
And IMHO a nice generic concepts that lets different subsystems work
toegther is a _lot_ better then a bunch of over-optimized, rather isolated
subsytems.  The IO-Lite people have done a nice research of the effect of
an unified IO-Caching system vs. the typical isolated systems.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig

In article <[EMAIL PROTECTED]> you wrote:
> Hi,

> On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
> In the disk IO case, you basically don't get that (the only thing
> which comes close is raid5 parity blocks).  The data which the user
> started with is the data sent out on the wire.  You do get some
> interesting cases such as soft raid and LVM, or even in the scsi stack
> if you run out of mailbox space, where you need to send only a
> sub-chunk of the input buffer. 

Though your describption is right, I don't think the case is very common:
Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

In raid1 you need some kind of clone iobuf, which should work with both
cases.  In raid0 you need a complete new pagelist anyway, same for raid5.

> In that case, having offset/len as the kiobuf limit markers is ideal:
> you can clone a kiobuf header using the same page vector as the
> parent, narrow down the start/end points, and continue down the stack
> without having to copy any part of the page list.  If you had the
> offset/len data encoded implicitly into each entry in the sglist, you
> would not be able to do that.

Sure you could: you embedd that information in a higher-level structure.
I think you want the whole kio concept only for disk-like IO.  Then many
of the things you do are completly right and I don't see much problems
(besides thinking that some thing may go away - but that's no major point).

With a generic object that is used over subsytem boundaries things are
different.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-01 Thread Chaitanya Tumuluri


On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
> > > 
> > > I don't see any real advantage for disk IO.  The real advantage is that
> > > we can have a generic structure that is also usefull in e.g. networking
> > > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> > 
> > Networking wants something lighter rather than heavier. Adding tons of
> > base/limit pairs to kiobufs makes it worse not better
> 
> Networking has fundamentally different requirements.  In a network
> stack, you want the ability to add fragments to unaligned chunks of
> data to represent headers at any point in the stack.
> 
> In the disk IO case, you basically don't get that (the only thing
> which comes close is raid5 parity blocks).  The data which the user
> started with is the data sent out on the wire.  You do get some
> interesting cases such as soft raid and LVM, or even in the scsi stack
> if you run out of mailbox space, where you need to send only a
> sub-chunk of the input buffer.  

Or the case of BSD-style UIO implementing the readv() and writev() calls.
This may/may-not align perfectly so, address-length lists per page could
be helpful.

I did try an implementation of this for rawio and found that I had to
restrict the a-len lists coming in via the user iovecs to be aligned.

> In that case, having offset/len as the kiobuf limit markers is ideal:
> you can clone a kiobuf header using the same page vector as the
> parent, narrow down the start/end points, and continue down the stack
> without having to copy any part of the page list.  If you had the
> offset/len data encoded implicitly into each entry in the sglist, you
> would not be able to do that.

This would solve the issue with UIO, yes. Also, I think Martin Peterson
(mkp) had taken a stab at doing "clone-kiobufs" for LVM at some point.

Martin?

Cheers,
-Chait.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread yodaiken


On Thu, Feb 01, 2001 at 04:32:48PM -0200, Rik van Riel wrote:
> On Thu, 1 Feb 2001, Alan Cox wrote:
> 
> > > Sure.  But Linus saing that he doesn't want more of that (shit, crap,
> > > I don't rember what he said exactly) in the kernel is a very good reason
> > > for thinking a little more aboyt it.
> > 
> > No. Linus is not a God, Linus is fallible, regularly makes mistakes and
> > frequently opens his mouth and says stupid things when he is far too busy.
> 
> People may remember Linus saying a resolute no to SMP
> support in Linux ;)

And perhaps he was right!

-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 07:14:03PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 05:41:20PM +, Stephen C. Tweedie wrote:
> > > 
> > > We can't allocate a huge kiobuf structure just for requesting one page of
> > > IO.  It might get better with VM-level IO clustering though.
> > 
> > A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> > allocate a buffer_head per block for all IO.
> 
> A kiobuf is 124 bytes,

... the vast majority of which is room for the page vector to expand
without having to be copied.  You don't touch that in the normal case.

> a buffer_head 96.  And a buffer_head is additionally
> used for caching data, a kiobuf not.

Buffer_heads are _sometimes_ used for caching data.  That's one of the
big problems with them, they are too overloaded, being both IO
descriptors _and_ cache descriptors.  If you've got 128k of data to
write out from user space, do you want to set up one kiobuf or 256
buffer_heads?  Buffer_heads become really very heavy indeed once you
start doing non-trivial IO.

> > What is so heavyweight in the current kiobuf (other than the embedded
> > vector, which I've already noted I'm willing to cut)?
> 
> array_len

kiobufs can be reused after IO.  You can depopulate a kiobuf,
repopulate it with new pages and submit new IO without having to
deallocate the kiobuf.  You can't do this without knowing how big the
data vector is.  Removing that functionality will prevent reuse,
making them _more_ heavyweight.

> io_count,

Right now we can take a kiobuf and turn it into a bunch of
buffer_heads for IO.  The io_count lets us track all of those sub-IOs
so that we know when all submitted IO has completed, so that we can
pass the completion callback back up the chain without having to
allocate yet more descriptor structs for the IO.

Again, remove this and the IO becomes more heavyweight because we need
to create a separate struct for the info.

> the presence of wait_queue AND end_io,

That's fine, I'm happy scrapping the wait queue: people can always use
the kiobuf private data field to refer to a wait queue if they want
to.

> and the lack of
> scatter gather in one kiobuf struct (you always need an array)

Again, _all_ data being sent down through the block device layer is
either in buffer heads or is page aligned.  You want us to triple the
size of the "heavyweight" kiobuf's data vector for what gain, exactly?
Obviously, extra code will be needed to scan kiobufs if we do that,
and unless we have both per-page _and_ per-kiobuf start/offset pairs
(adding even further to the complexity), those scatter-gather lists
would prevent us from carving up a kiobuf into smaller sub-ios without
copying the whole (expanded) vector.

That's a _lot_ of extra complexity in the disk IO layers.

I'm all for a fast kiobuf_to_sglist converter.  But I haven't seen any
evidence that such scatter-gather lists will do anything in the block
device case except complicate the code and decrease performance.

> S.th. like:
...
> makes it a lot simpler for the subsytems to integrate.

Possibly, but I remain to be convinced, because you may end up with a
mechanism which is generic but is not well-tuned for any specific
case, so everything goes slower.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie


Hi,

On Thu, Feb 01, 2001 at 06:49:50PM +0100, Christoph Hellwig wrote:
> 
> > Adding tons of base/limit pairs to kiobufs makes it worse not better
> 
> For disk I/O it makes the handling a little easier for the cost of the
> additional offset/length fields.

Umm, actually, no, it makes it much worse for many of the cases.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-01 Thread bcrl

On Thu, 1 Feb 2001, Alan Cox wrote:

> Linus list of reasons like the amount of state are more interesting

The state is required, not optional, if we are to have a decent basis for
building asyncronous io into the kernel.

> Networking wants something lighter rather than heavier. Adding tons of
> base/limit pairs to kiobufs makes it worse not better

I'm still not seeing what I consider valid arguments from the networking
people regarding the use of kiobufs as the interface they present to the
VFS for asynchronous/bulk io.  I agree with their needs for a light weight
mechanism for getting small io requests from userland, and even the need
for using lightweight scatter gather lists within the network layer
itself.  If the statement is that map_user_kiobuf is too heavy for use on
every single io, sure.  But that is a seperate issue.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-01 Thread Chaitanya Tumuluri


On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 10:25:22AM +0530, [EMAIL PROTECTED] wrote:
> > 
> > Being able to track the children of a kiobuf would help with I/O
> > cancellation (e.g. to pull sub-ios off their request queues if I/O
> > cancellation for the parent kiobuf was issued). Not essential, I guess, in
> > general, but useful in some situations.
> 
> What exactly is the justification for IO cancellation?  It really
> upsets the normal flow of control through the IO stack to have
> voluntary cancellation semantics.
> 
XFS does something called a "forced shutdown" of the filesystem in which
it requires outstanding I/Os issued against file data to be cancelled. 
This is triggered by (among other things) errors in writing out file 
metadata. I'm cc'ing Steve Lord so he can provide more information.

Of course, I was thinking along the lines of an API flushing the requests
out of the elevator at that time  didn't get too far with it though.

Cheers,
-Chait.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig

On Thu, Feb 01, 2001 at 06:25:16PM +, Alan Cox wrote:
> > array_len, io_count, the presence of wait_queue AND end_io, and the lack of
> > scatter gather in one kiobuf struct (you always need an array), and AFAICS
> > that is what the networking guys dislike.
> 
> You need a completion pointer. Its arguable whether you want the wait_queue
> in the default structure or as part of whatever its contained in and handled
> by the completion pointer.

I personaly think that Ben's function pointer on wakeup work is the alternative in
this area.

> And I've actually bothered to talk to the networking people and they dont have
> a problem with the completion pointer.

I have never said that they don't like it - but having both the waitqueue and the
completion handler in the kiobuf makes it bigger.

> > Now one could say: just let the networkers use their own kind of buffers
> > (and that's exactly what is done in the zerocopy patches), but that again leds
> > to inefficient buffer passing and ungeneric IO handling.
> 
> Careful.  This is the line of reasoning which also says
> 
> Aeroplanes are good for travelling long distances
> Cars are better for getting to my front door
> Therefore everyone should drive a 747 home

Hehe ;)

> It is quite possible that the right thing to do is to do conversions in the
> cases it happens.

Yes, this would be THE alternative to my suggestion.

> That might seem a good reason for having offset/length
> pairs on each block, because streaming from the network to disk you may well
> get a collection of partial pages of data you need to write to disk. 
> Unfortunately the reality of DMA support on almost (but not quite) all
> disk controllers is that you don't get that degree of scatter gather.
> 
> My I2O controllers and I think the fusion controllers could indeed benefit
> and cope with being given a pile of randomly located 1480 byte chunks of 
> data and being asked to put them on disk.

It doesn't really matter that much, because we write to the pagecache
first anyway.

The real thing is that we want to have some common data structure for
describing physical memory used for IO.  We could either use special
structures in every subsystem and then copy between them or pass
struct page * and lose meta information.  Or we could try to find a
structure that holds enough information to make passing it from one
subsystem to another usefull.  The cut-down kio design (heavily inspired
by Larry McVoy's splice paper) should allow just that, nothing more an
nothing less.  For use in disk-io and networking or v4l there are probably
other primary data structures needed, and that's ok.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 06:57:41PM +, Alan Cox wrote:
> Not for raw I/O. Although for the drivers that can't cope then going via
> the page cache is certainly the next best alternative

True - but raw-io has it's own alignment issues anyway.

> Yes. You also need a way to describe it in terms of page * in order to do
> mm locking for raw I/O (like the video capture stuff wants)

Right. (That's why we have the struct page * always as part of the structure)

> Certainly having the lightweight one a subset of the heavyweight one is a good
> target. 

Yes, I'm trying to address that...

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Alan Cox


> It doesn't really matter that much, because we write to the pagecache
> first anyway.

Not for raw I/O. Although for the drivers that can't cope then going via
the page cache is certainly the next best alternative

> The real thing is that we want to have some common data structure for
> describing physical memory used for IO.  We could either use special

Yes. You also need a way to describe it in terms of page * in order to do
mm locking for raw I/O (like the video capture stuff wants)

> by Larry McVoy's splice paper) should allow just that, nothing more an
> nothing less.  For use in disk-io and networking or v4l there are probably
> other primary data structures needed, and that's ok.

Certainly having the lightweight one a subset of the heavyweight one is a good
target. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-01 Thread Rik van Riel


On Thu, 1 Feb 2001, Alan Cox wrote:

> > Now one could say: just let the networkers use their own kind of buffers
> > (and that's exactly what is done in the zerocopy patches), but that again leds
> > to inefficient buffer passing and ungeneric IO handling.

[snip]
> It is quite possible that the right thing to do is to do
> conversions in the cases it happens.

OTOH, somehow a zero-copy system which converts the zero-copy
metadata every time the buffer is handed to another subsystem
just doesn't sound right ...

(well, maybe it _is_, but it looks quite inefficient at first
glance)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait/notify + callback chains

2001-02-01 Thread Rik van Riel

On Thu, 1 Feb 2001, Alan Cox wrote:

> > Sure.  But Linus saing that he doesn't want more of that (shit, crap,
> > I don't rember what he said exactly) in the kernel is a very good reason
> > for thinking a little more aboyt it.
> 
> No. Linus is not a God, Linus is fallible, regularly makes mistakes and
> frequently opens his mouth and says stupid things when he is far too busy.

People may remember Linus saying a resolute no to SMP
support in Linux ;)

In my experience, when Linus says "NO" to a certain
idea, he's usually objecting to bad design decisions
in the proposed implementation of the idea and the
lack of a nice alternative solution ...

... but as soon as a clean, efficient and maintainable
alternative to the original bad idea surfaces, it seems
to be quite easy to convince Linus to include it.

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Alan Cox


> array_len, io_count, the presence of wait_queue AND end_io, and the lack of
> scatter gather in one kiobuf struct (you always need an array), and AFAICS
> that is what the networking guys dislike.

You need a completion pointer. Its arguable whether you want the wait_queue
in the default structure or as part of whatever its contained in and handled
by the completion pointer.

And I've actually bothered to talk to the networking people and they dont have
a problem with the completion pointer.

> Now one could say: just let the networkers use their own kind of buffers
> (and that's exactly what is done in the zerocopy patches), but that again leds
> to inefficient buffer passing and ungeneric IO handling.

Careful.  This is the line of reasoning which also says

Aeroplanes are good for travelling long distances
Cars are better for getting to my front door
Therefore everyone should drive a 747 home

It is quite possible that the right thing to do is to do conversions in the
cases it happens. That might seem a good reason for having offset/length
pairs on each block, because streaming from the network to disk you may well
get a collection of partial pages of data you need to write to disk. 
Unfortunately the reality of DMA support on almost (but not quite) all
disk controllers is that you don't get that degree of scatter gather.

My I2O controllers and I think the fusion controllers could indeed benefit
and cope with being given a pile of randomly located 1480 byte chunks of 
data and being asked to put them on disk.

I do seriously doubt there are any real world situations this is useful.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig

On Thu, Feb 01, 2001 at 05:41:20PM +, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 04:16:15PM +, Stephen C. Tweedie wrote:
> > > > 
> > > > No, and with the current kiobufs it would not make sense, because they
> > > > are to heavy-weight.
> > > 
> > > Really?  In what way?  
> > 
> > We can't allocate a huge kiobuf structure just for requesting one page of
> > IO.  It might get better with VM-level IO clustering though.
> 
> A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> allocate a buffer_head per block for all IO.

A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
used for caching data, a kiobuf not.

> 
> A kiobuf contains enough embedded page vector space for 16 pages by
> default, but I'm happy enough to remove that if people want.  However,
> note that that memory is not initialised, so there is no memory access
> cost at all for that empty space.  Remove that space and instead of
> one memory allocation per kiobuf, you get two, so the cost goes *UP*
> for small IOs.

You could still embed it into a surrounding structure, even if there are cases
where an additional memory allocation is needed, yes.

> 
> > > > With page,length,offsett iobufs this makes sense
> > > > and is IMHO the way to go.
> > > 
> > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > > lean enough to do the job??
> > 
> > No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> > lkml some time ago (though I'd much more like to call it kiovec analogous
> > to BSD iovecs).
> 
> What is so heavyweight in the current kiobuf (other than the embedded
> vector, which I've already noted I'm willing to cut)?

array_len, io_count, the presence of wait_queue AND end_io, and the lack of
scatter gather in one kiobuf struct (you always need an array), and AFAICS
that is what the networking guys dislike.

They often just want multiple buffers in one physical page, and and array of
those.

Now one could say: just let the networkers use their own kind of buffers
(and that's exactly what is done in the zerocopy patches), but that again leds
to inefficient buffer passing and ungeneric IO handling.

S.th. like:

struct kiovec {
struct page *   kv_page;/* physical page*/
u_short kv_offset;  /* offset into page */
u_short kv_length;  /* data length  */
};

enum kio_flags {
KIO_LOANED, /* the calling subsystem wants this buf back*/
KIO_GIFTED, /* thanks for the buffer, man!  */
KIO_COW /* copy on write (XXX: not yet) */
};

struct kio {
struct kiovec * kio_data;   /* our kiovecs  */
int kio_ndata;  /* # of kiovecs */
int kio_flags;  /* loaned or giftet?*/
void *  kio_priv;   /* caller private data  */
wait_queue_head_t   kio_wait;   /* wait queue   */
};

makes it a lot simpler for the subsytems to integrate.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Alan Cox


> > Linus basically designed the original kiobuf scheme of course so I guess
> > he's allowed to dislike it. Linus disliking something however doesn't mean
> > its wrong. Its not a technically valid basis for argument.
> 
> Sure.  But Linus saing that he doesn't want more of that (shit, crap,
> I don't rember what he said exactly) in the kernel is a very good reason
> for thinking a little more aboyt it.

No. Linus is not a God, Linus is fallible, regularly makes mistakes and
frequently opens his mouth and says stupid things when he is far too busy.

> Espescially if most arguments look right to one after thinking more about
> it...

I agree with the issues about networking wanting lightweight objects, Im
unconvinced however the existing setup for networking is sanely applicable
for real world applications in other spaces.

Take video capture. I want to stream 60Mbytes/second in multi-megabyte
chunks between my capture cards and a high end raid array. The array wants
1Mbyte or large blocks per I/O to reach 60Mbytes/second performance.

This btw isnt benchmark crap like most of the zero copy networking, this is
a real world application..

The current buffer head stuff is already heavier than the kio stuff. The
networking stuff isnt oriented to that kind of I/O and would end up
needing to do tons of extra processing.

> For disk I/O it makes the handling a little easier for the cost of the
> additional offset/length fields.

I remain to be convinced by that. However you do get 64bytes/cacheline on
a real processor nowdays so if you touch any of that 64byte block you are
practically zero cost to fill the rest. 

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
> > 
> > I don't see any real advantage for disk IO.  The real advantage is that
> > we can have a generic structure that is also usefull in e.g. networking
> > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> 
> Networking wants something lighter rather than heavier. Adding tons of
> base/limit pairs to kiobufs makes it worse not better

Networking has fundamentally different requirements.  In a network
stack, you want the ability to add fragments to unaligned chunks of
data to represent headers at any point in the stack.

In the disk IO case, you basically don't get that (the only thing
which comes close is raid5 parity blocks).  The data which the user
started with is the data sent out on the wire.  You do get some
interesting cases such as soft raid and LVM, or even in the scsi stack
if you run out of mailbox space, where you need to send only a
sub-chunk of the input buffer.  

In that case, having offset/len as the kiobuf limit markers is ideal:
you can clone a kiobuf header using the same page vector as the
parent, narrow down the start/end points, and continue down the stack
without having to copy any part of the page list.  If you had the
offset/len data encoded implicitly into each entry in the sglist, you
would not be able to do that.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig

On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
> > > I'm in the middle of some parts of it, and am actively soliciting
> > > feedback on what cleanups are required.  
> > 
> > The real issue is that Linus dislikes the current kiobuf scheme.
> > I do not like everything he proposes, but lots of things makes sense.
> 
> Linus basically designed the original kiobuf scheme of course so I guess
> he's allowed to dislike it. Linus disliking something however doesn't mean
> its wrong. Its not a technically valid basis for argument.

Sure.  But Linus saing that he doesn't want more of that (shit, crap,
I don't rember what he said exactly) in the kernel is a very good reason
for thinking a little more aboyt it.

Espescially if most arguments look right to one after thinking more about
it...

> Linus list of reasons like the amount of state are more interesting

True.  The arument that they are to heaviweight also.
That they should allow scatter gather without an array of structs also.

> > > So, what are the benefits in the disk IO stack of adding length/offset
> > > pairs to each page of the kiobuf?
> > 
> > I don't see any real advantage for disk IO.  The real advantage is that
> > we can have a generic structure that is also usefull in e.g. networking
> > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> 
> Networking wants something lighter rather than heavier.

Right.  That's what the new design was about, besides adding a offset and
length to every page instead of the page array, something also wanted by
the networking in the first place.
Look at the skb_frag struct in the zero-copy patch for what networking
thinks it needs for physical page based buffers.

> Adding tons of base/limit pairs to kiobufs makes it worse not better

>From looking at the networking code and listening to Dave and Ingo it looks
like it makes the thing better for networking, although I can not verify
this due to the lack of familarity with the networking code.

For disk I/O it makes the handling a little easier for the cost of the
additional offset/length fields.

Christoph

P.S. the tuple things is also what Larry had in his inital slice paper.
-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 04:16:15PM +, Stephen C. Tweedie wrote:
> > > 
> > > No, and with the current kiobufs it would not make sense, because they
> > > are to heavy-weight.
> > 
> > Really?  In what way?  
> 
> We can't allocate a huge kiobuf structure just for requesting one page of
> IO.  It might get better with VM-level IO clustering though.

A kiobuf is *much* smaller than, say, a buffer_head, and we currently
allocate a buffer_head per block for all IO.

A kiobuf contains enough embedded page vector space for 16 pages by
default, but I'm happy enough to remove that if people want.  However,
note that that memory is not initialised, so there is no memory access
cost at all for that empty space.  Remove that space and instead of
one memory allocation per kiobuf, you get two, so the cost goes *UP*
for small IOs.

> > > With page,length,offsett iobufs this makes sense
> > > and is IMHO the way to go.
> > 
> > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > lean enough to do the job??
> 
> No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> lkml some time ago (though I'd much more like to call it kiovec analogous
> to BSD iovecs).

What is so heavyweight in the current kiobuf (other than the embedded
vector, which I've already noted I'm willing to cut)?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Alan Cox


> > I'm in the middle of some parts of it, and am actively soliciting
> > feedback on what cleanups are required.  
> 
> The real issue is that Linus dislikes the current kiobuf scheme.
> I do not like everything he proposes, but lots of things makes sense.

Linus basically designed the original kiobuf scheme of course so I guess
he's allowed to dislike it. Linus disliking something however doesn't mean
its wrong. Its not a technically valid basis for argument.

Linus list of reasons like the amount of state are more interesting

> > So, what are the benefits in the disk IO stack of adding length/offset
> > pairs to each page of the kiobuf?
> 
> I don't see any real advantage for disk IO.  The real advantage is that
> we can have a generic structure that is also usefull in e.g. networking
> and can lead to a unified IO buffering scheme (a little like IO-Lite).

Networking wants something lighter rather than heavier. Adding tons of
base/limit pairs to kiobufs makes it worse not better

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Steve Lord


Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
> > 
> > That would require the vfs interfaces themselves (address space
> > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > page *  . That's not the case right now, is it ?
> 
> No, and with the current kiobufs it would not make sense, because they
> are to heavy-weight.  With page,length,offsett iobufs this makes sense
> and is IMHO the way to go.
> 
>   Christoph
> 

Enquiring minds would like to know if you are working towards this 
revamp of the kiobuf structure at the moment, you have been very quiet
recently. 

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 04:49:58PM +, Stephen C. Tweedie wrote:
> > Enquiring minds would like to know if you are working towards this 
> > revamp of the kiobuf structure at the moment, you have been very quiet
> > recently. 
> 
> I'm in the middle of some parts of it, and am actively soliciting
> feedback on what cleanups are required.  

The real issue is that Linus dislikes the current kiobuf scheme.
I do not like everything he proposes, but lots of things makes sense.

> I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and
> have started doing some of the cleanups needed --- removing the
> embedded page vector, and adding support for lightweight stacking of
> kiobufs for completion callback chains.

Ok, great.

> However, filesystem IO is almost *always* page aligned: O_DIRECT IO
> comes from VM pages, and internal filesystem IO comes from page cache
> pages.  Buffer cache IOs are the only exception, and kiobufs only fail
> for such IOs once you have multiple buffer_heads being merged into
> single requests.
> 
> So, what are the benefits in the disk IO stack of adding length/offset
> pairs to each page of the kiobuf?

I don't see any real advantage for disk IO.  The real advantage is that
we can have a generic structure that is also usefull in e.g. networking
and can lead to a unified IO buffering scheme (a little like IO-Lite).

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie


Hi,

On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
> > 
> > That would require the vfs interfaces themselves (address space
> > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > page *  . That's not the case right now, is it ?
> 
> No, and with the current kiobufs it would not make sense, because they
> are to heavy-weight.

Really?  In what way?  

> With page,length,offsett iobufs this makes sense
> and is IMHO the way to go.

What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
lean enough to do the job??

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 10:08:45AM -0600, Steve Lord wrote:
> Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
> > > 
> > > That would require the vfs interfaces themselves (address space
> > > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > > page *  . That's not the case right now, is it ?
> > 
> > No, and with the current kiobufs it would not make sense, because they
> > are to heavy-weight.  With page,length,offsett iobufs this makes sense
> > and is IMHO the way to go.
> 
> Enquiring minds would like to know if you are working towards this 
> revamp of the kiobuf structure at the moment, you have been very quiet
> recently. 

I'm in the middle of some parts of it, and am actively soliciting
feedback on what cleanups are required.  

I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and
have started doing some of the cleanups needed --- removing the
embedded page vector, and adding support for lightweight stacking of
kiobufs for completion callback chains.

However, filesystem IO is almost *always* page aligned: O_DIRECT IO
comes from VM pages, and internal filesystem IO comes from page cache
pages.  Buffer cache IOs are the only exception, and kiobufs only fail
for such IOs once you have multiple buffer_heads being merged into
single requests.

So, what are the benefits in the disk IO stack of adding length/offset
pairs to each page of the kiobuf?  Basically, the only advantage right
now is that it would allow us to merge requests together without
having to chain separate kiobufs.  However, chaining kiobufs in this
case is actually much better than merging them if the original IOs
came in as kiobufs: merging kiobufs requires us to reallocate a new,
longer (page/offset/len) vector, whereas chaining kiobufs is just a
list operation.

Having true scatter-gather lists in the kiobuf would let us represent
arbitrary lists of buffer_heads as a single kiobuf, though, and that
_is_ a big win if we can avoid using buffer_heads below the
ll_rw_block layer at all.  (It's not clear that this is really
possible, though, since we still need to propagate completion
information back up into each individual buffer head's status and wait
queue.)

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > lean enough to do the job??
> 
> No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
   ^ Linus ...
> lkml some time ago (though I'd much more like to call it kiovec analogous
> to BSD iovecs).
> 
> And a page,offset,length tuple is pretty cheap compared to a current kiobuf.

Christoph (slapping himself for the stupid typo and selfreply ...)

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 04:16:15PM +, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
> > > 
> > > That would require the vfs interfaces themselves (address space
> > > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > > page *  . That's not the case right now, is it ?
> > 
> > No, and with the current kiobufs it would not make sense, because they
> > are to heavy-weight.
> 
> Really?  In what way?  

We can't allocate a huge kiobuf structure just for requesting one page of
IO.  It might get better with VM-level IO clustering though.

> 
> > With page,length,offsett iobufs this makes sense
> > and is IMHO the way to go.
> 
> What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> lean enough to do the job??

No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
lkml some time ago (though I'd much more like to call it kiovec analogous
to BSD iovecs).

And a page,offset,length tuple is pretty cheap compared to a current kiobuf.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Christoph Hellwig


On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
> 
> >Hi,
> >
> >On Thu, Feb 01, 2001 at 10:25:22AM +0530, [EMAIL PROTECTED] wrote:
> >>
> >> >We _do_ need the ability to stack completion events, but as far as the
> >> >kiobuf work goes, my current thoughts are to do that by stacking
> >> >lightweight "clone" kiobufs.
> >>
> >> Would that work with stackable filesystems ?
> >
> >Only if the filesystems were using VFS interfaces which used kiobufs.
> >Right now, the only filesystem using kiobufs is XFS, and it only
> >passes them down to the block device layer, not to other filesystems.
> 
> That would require the vfs interfaces themselves (address space
> readpage/writepage ops) to take kiobufs as arguments, instead of struct
> page *  . That's not the case right now, is it ?

No, and with the current kiobufs it would not make sense, because they
are to heavy-weight.  With page,length,offsett iobufs this makes sense
and is IMHO the way to go.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread bsuparna

>Hi,
>
>On Thu, Feb 01, 2001 at 10:25:22AM +0530, [EMAIL PROTECTED] wrote:
>>
>> >We _do_ need the ability to stack completion events, but as far as the
>> >kiobuf work goes, my current thoughts are to do that by stacking
>> >lightweight "clone" kiobufs.
>>
>> Would that work with stackable filesystems ?
>
>Only if the filesystems were using VFS interfaces which used kiobufs.
>Right now, the only filesystem using kiobufs is XFS, and it only
>passes them down to the block device layer, not to other filesystems.

That would require the vfs interfaces themselves (address space
readpage/writepage ops) to take kiobufs as arguments, instead of struct
page *  . That's not the case right now, is it ?
A filter filesystem would be layered over XFS to take this example.
So right now a filter filesystem only sees the struct page * and passes
this along. Any completion event stacking has to be applied with reference
to this.

>> Being able to track the children of a kiobuf would help with I/O
>> cancellation (e.g. to pull sub-ios off their request queues if I/O
>> cancellation for the parent kiobuf was issued). Not essential, I guess,
in
>> general, but useful in some situations.
>
>What exactly is the justification for IO cancellation?  It really
>upsets the normal flow of control through the IO stack to have
>voluntary cancellation semantics.

One reason that I saw is that if the results of an i/o are no longer
required due to some condition (e.g. aio cancellation situations, or if the
process that issued the i/o gets killed), then this avoids the unnecessary
disk i/o, if the request hadn't been scheduled as yet.

Too remote a requirement ? If the capability/support doesn't exist at the
driver level I guess its difficult.

--Stephen

___
Kiobuf-io-devel mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread bsuparna

sct wrote:
>> >
>> > Thanks for mentioning this. I didn't know about it earlier. I've been
>> > going through the 4/00 kqueue patch on freebsd ...
>>
>> Linus has already denounced them as massively over-engineered...
>
>That shouldn't stop anyone from looking at them and learning, though.
>There might be a good idea or two hiding in there somewhere.
>- Dan
>

There is always a scope to learn from a different approach to tackle a
problem of a similar nature -  both good ideas as well as over-engineered
ones - sometimes more from the later :-)

As far as I have understood so far from looking at the original kevent
patch and notes (which perhaps isn't enough and maybe out of date as well),
the concept of knotes and filter ops, and the event queuing mechanism in
itself is interesting and generic, but most of it seems to have been
designed with linkage to user-mode issueable event waits in mind - like
poll/select/aio/signal etc, at least as it appears from the way its been
used in the kernel. A little different from what I had in mind, though its
perhaps possible to use it otherwise. But maybe I've just not thought about
it enough or understood it.

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 01:28:33PM +0530, [EMAIL PROTECTED] wrote:
> 
> Here's a second pass attempt, based on Ben's wait queue extensions:
> Does this sound any better ?

It's a mechanism, all right, but you haven't described what problems
it is trying to solve, and where it is likely to be used, so it's hard
to judge it. :)

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

1 2 >

1 - 100 of 153 matches

Mail list logo