Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-17 Thread Charles Forsyth
> I'll see if I wrote up some of it. I think there were manual pages for the
>> Messages replacing Blocks.
>
>
Here are the three manual pages  https://goo.gl/Qykprf
It's not obvious from them, but internally a Fragment can represent a slice
of a Segment*


Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread erik quanstrom
> It's useful internally in protocol implementation, specifically to avoid
> copying in transport protocols (for later retransmission), and the
> modifications aren't vast.
> A few changes were trickier, often because of small bugs in the original
> code. icmp does some odd things i think.

that makes sense.  likewise, if it were essentially free to add file systems in 
the i/o path,
from user space, one could build micro file systems that took care of small 
details without
incuring much cost.  ramfs is enough of a file system if you have other 
programs to do
other things like dump.

> I'll see if I wrote up some of it. I think there were manual pages for the
> Messages replacing Blocks.

that would be great.  thanks.


> My mcs lock implementation was probably more useful, and I use that in my
> copy of the kernel known as 9k

indeed.  i've seen great performance with mcs in my kernel.

- erik



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread Charles Forsyth
They are machines designed to run programs most people do not write!

On Mon, 15 Oct 2018 at 19:20, hiro <23h...@gmail.com> wrote:

> > Also, NUMA effects are more important in practice on big multicores. Some
> > of the off-chip delays are brutal.
>
> yeah, we've been talking about this on #cat-v. even inside one CPU
> package amd puts multiple dies nowadays, and the cross-die cpu cache
> access delays are approaching the same dimensions as memory-access!
>
> also on each die, they have what they call ccx (cpu complex),
> groupings of 4 cores, which are connected much faster internally than
> towards the other ccx
>
>


Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread hiro
> Also, NUMA effects are more important in practice on big multicores. Some
> of the off-chip delays are brutal.

yeah, we've been talking about this on #cat-v. even inside one CPU
package amd puts multiple dies nowadays, and the cross-die cpu cache
access delays are approaching the same dimensions as memory-access!

also on each die, they have what they call ccx (cpu complex),
groupings of 4 cores, which are connected much faster internally than
towards the other ccx



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread hiro
> Btw, "zero copy" isn't the right term and I preferred another term that I've 
> now forgotten. Minimal copying, perhaps.

I like that, "zero-copy" makes me imply other linux-specifics, and
those are hard to not get emotional about.



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread Charles Forsyth
It's useful internally in protocol implementation, specifically to avoid
copying in transport protocols (for later retransmission), and the
modifications aren't vast.
A few changes were trickier, often because of small bugs in the original
code. icmp does some odd things i think.

Btw, "zero copy" isn't the right term and I preferred another term that
I've now forgotten. Minimal copying, perhaps.
For one thing, messages can eventually end up being copied to contiguous
blocks for devices without decent scatter-gather DMA.

Messages are a tuple (mutable header stack, immutable slices of immutable
data).
Originally the data was organised as a tree, but nemo suggested using just
an array, so I changed it.
It's important that it's (logically) immutable. Headers are pushed onto and
popped from the header stack, and the current stack top is mutable.

There were new readmsg and writemsg system calls to carry message
structures between kernel and user level.
The message was immutable on writemsg. Between processes in the same
program, message transfers could be done by exchanging pointers into a
shared region.

I'll see if I wrote up some of it. I think there were manual pages for the
Messages replacing Blocks.

My mcs lock implementation was probably more useful, and I use that in my
copy of the kernel known as 9k

Also, NUMA effects are more important in practice on big multicores. Some
of the off-chip delays are brutal.

On Sun, 14 Oct 2018 at 09:50, hiro <23h...@gmail.com> wrote:

> thanks, this will allow us to know where to look more closely.
>
> On 10/14/18, Francisco J Ballesteros  wrote:
> > Pure "producer/cosumer" stuff, like sending things through a pipe as
> long as
> > the source didn't need to touch the data ever more.
> > Regarding bugs, I meant "producing bugs" not "fixing bugs", btw.
> >
> >> On 14 Oct 2018, at 09:34, hiro <23h...@gmail.com> wrote:
> >>
> >> well, finding bugs is always good :)
> >> but since i got curious could you also tell which things exactly got
> >> much faster, so that we know what might be possible?
> >>
> >> On 10/14/18, FJ Ballesteros  wrote:
> >>> yes. bugs, on my side at least.
> >>> The copy isolates from others.
> >>> But some experiments in nix and in a thing I wrote for leanxcale show
> >>> that
> >>> some things can be much faster.
> >>> It’s fun either way.
> >>>
>  El 13 oct 2018, a las 23:11, hiro <23h...@gmail.com> escribió:
> 
>  and, did it improve anything noticeably?
> 
> > On 10/13/18, Charles Forsyth  wrote:
> > I did several versions of one part of zero copy, inspired by several
> > things
> > in x-kernel, replacing Blocks by another structure throughout the
> > network
> > stacks and kernel, then made messages visible to user level. Nemo did
> > another part, on his way to Clive
> >
> >> On Fri, 12 Oct 2018, 07:05 Ori Bernstein, 
> wrote:
> >>
> >> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg
> >> 
> >> wrote:
> >>
> >>> Another case to ponder ...   We're handling the incoming I/Q data
> >>> stream, but need to fan that out to many downstream consumers.  If
> >>> we already read the data into a page, then flip it to the first
> >>> consumer, is there a benefit to adding a reference counter to that
> >>> read-only page and leaving the page live until the counter expires?
> >>>
> >>> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> >>> done don't show anyone trying this out with P9 (and publishing
> >>> their results).  Anybody have hints/references to prior work?
> >>>
> >>> --lyndon
> >>>
> >>
> >> I don't believe anyone has done the work yet. I'd be interested
> >> to see what you come up with.
> >>
> >>
> >> --
> >>   Ori Bernstein
> >>
> >>
> >
> 
> >>>
> >>>
> >>>
> >>
> >
> >
> >
>
>


Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-14 Thread hiro
thanks, this will allow us to know where to look more closely.

On 10/14/18, Francisco J Ballesteros  wrote:
> Pure "producer/cosumer" stuff, like sending things through a pipe as long as
> the source didn't need to touch the data ever more.
> Regarding bugs, I meant "producing bugs" not "fixing bugs", btw.
>
>> On 14 Oct 2018, at 09:34, hiro <23h...@gmail.com> wrote:
>>
>> well, finding bugs is always good :)
>> but since i got curious could you also tell which things exactly got
>> much faster, so that we know what might be possible?
>>
>> On 10/14/18, FJ Ballesteros  wrote:
>>> yes. bugs, on my side at least.
>>> The copy isolates from others.
>>> But some experiments in nix and in a thing I wrote for leanxcale show
>>> that
>>> some things can be much faster.
>>> It’s fun either way.
>>>
 El 13 oct 2018, a las 23:11, hiro <23h...@gmail.com> escribió:

 and, did it improve anything noticeably?

> On 10/13/18, Charles Forsyth  wrote:
> I did several versions of one part of zero copy, inspired by several
> things
> in x-kernel, replacing Blocks by another structure throughout the
> network
> stacks and kernel, then made messages visible to user level. Nemo did
> another part, on his way to Clive
>
>> On Fri, 12 Oct 2018, 07:05 Ori Bernstein,  wrote:
>>
>> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg
>> 
>> wrote:
>>
>>> Another case to ponder ...   We're handling the incoming I/Q data
>>> stream, but need to fan that out to many downstream consumers.  If
>>> we already read the data into a page, then flip it to the first
>>> consumer, is there a benefit to adding a reference counter to that
>>> read-only page and leaving the page live until the counter expires?
>>>
>>> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
>>> done don't show anyone trying this out with P9 (and publishing
>>> their results).  Anybody have hints/references to prior work?
>>>
>>> --lyndon
>>>
>>
>> I don't believe anyone has done the work yet. I'd be interested
>> to see what you come up with.
>>
>>
>> --
>>   Ori Bernstein
>>
>>
>

>>>
>>>
>>>
>>
>
>
>



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-14 Thread Francisco J Ballesteros
Pure "producer/cosumer" stuff, like sending things through a pipe as long as 
the source didn't need to touch the data ever more.
Regarding bugs, I meant "producing bugs" not "fixing bugs", btw.

> On 14 Oct 2018, at 09:34, hiro <23h...@gmail.com> wrote:
> 
> well, finding bugs is always good :)
> but since i got curious could you also tell which things exactly got
> much faster, so that we know what might be possible?
> 
> On 10/14/18, FJ Ballesteros  wrote:
>> yes. bugs, on my side at least.
>> The copy isolates from others.
>> But some experiments in nix and in a thing I wrote for leanxcale show that
>> some things can be much faster.
>> It’s fun either way.
>> 
>>> El 13 oct 2018, a las 23:11, hiro <23h...@gmail.com> escribió:
>>> 
>>> and, did it improve anything noticeably?
>>> 
 On 10/13/18, Charles Forsyth  wrote:
 I did several versions of one part of zero copy, inspired by several
 things
 in x-kernel, replacing Blocks by another structure throughout the
 network
 stacks and kernel, then made messages visible to user level. Nemo did
 another part, on his way to Clive
 
> On Fri, 12 Oct 2018, 07:05 Ori Bernstein,  wrote:
> 
> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg
> 
> wrote:
> 
>> Another case to ponder ...   We're handling the incoming I/Q data
>> stream, but need to fan that out to many downstream consumers.  If
>> we already read the data into a page, then flip it to the first
>> consumer, is there a benefit to adding a reference counter to that
>> read-only page and leaving the page live until the counter expires?
>> 
>> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
>> done don't show anyone trying this out with P9 (and publishing
>> their results).  Anybody have hints/references to prior work?
>> 
>> --lyndon
>> 
> 
> I don't believe anyone has done the work yet. I'd be interested
> to see what you come up with.
> 
> 
> --
>   Ori Bernstein
> 
> 
 
>>> 
>> 
>> 
>> 
> 




Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-14 Thread hiro
well, finding bugs is always good :)
but since i got curious could you also tell which things exactly got
much faster, so that we know what might be possible?

On 10/14/18, FJ Ballesteros  wrote:
> yes. bugs, on my side at least.
> The copy isolates from others.
> But some experiments in nix and in a thing I wrote for leanxcale show that
> some things can be much faster.
> It’s fun either way.
>
>> El 13 oct 2018, a las 23:11, hiro <23h...@gmail.com> escribió:
>>
>> and, did it improve anything noticeably?
>>
>>> On 10/13/18, Charles Forsyth  wrote:
>>> I did several versions of one part of zero copy, inspired by several
>>> things
>>> in x-kernel, replacing Blocks by another structure throughout the
>>> network
>>> stacks and kernel, then made messages visible to user level. Nemo did
>>> another part, on his way to Clive
>>>
 On Fri, 12 Oct 2018, 07:05 Ori Bernstein,  wrote:

 On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg
 
 wrote:

> Another case to ponder ...   We're handling the incoming I/Q data
> stream, but need to fan that out to many downstream consumers.  If
> we already read the data into a page, then flip it to the first
> consumer, is there a benefit to adding a reference counter to that
> read-only page and leaving the page live until the counter expires?
>
> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> done don't show anyone trying this out with P9 (and publishing
> their results).  Anybody have hints/references to prior work?
>
> --lyndon
>

 I don't believe anyone has done the work yet. I'd be interested
 to see what you come up with.


 --
Ori Bernstein


>>>
>>
>
>
>



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-13 Thread FJ Ballesteros
yes. bugs, on my side at least. 
The copy isolates from others. 
But some experiments in nix and in a thing I wrote for leanxcale show that some 
things can be much faster. 
It’s fun either way. 

> El 13 oct 2018, a las 23:11, hiro <23h...@gmail.com> escribió:
> 
> and, did it improve anything noticeably?
> 
>> On 10/13/18, Charles Forsyth  wrote:
>> I did several versions of one part of zero copy, inspired by several things
>> in x-kernel, replacing Blocks by another structure throughout the network
>> stacks and kernel, then made messages visible to user level. Nemo did
>> another part, on his way to Clive
>> 
>>> On Fri, 12 Oct 2018, 07:05 Ori Bernstein,  wrote:
>>> 
>>> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg 
>>> wrote:
>>> 
 Another case to ponder ...   We're handling the incoming I/Q data
 stream, but need to fan that out to many downstream consumers.  If
 we already read the data into a page, then flip it to the first
 consumer, is there a benefit to adding a reference counter to that
 read-only page and leaving the page live until the counter expires?
 
 Hiro clamours for benchmarks.  I agree.  Some basic searches I've
 done don't show anyone trying this out with P9 (and publishing
 their results).  Anybody have hints/references to prior work?
 
 --lyndon
 
>>> 
>>> I don't believe anyone has done the work yet. I'd be interested
>>> to see what you come up with.
>>> 
>>> 
>>> --
>>>Ori Bernstein
>>> 
>>> 
>> 
> 




Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-13 Thread hiro
and, did it improve anything noticeably?

On 10/13/18, Charles Forsyth  wrote:
> I did several versions of one part of zero copy, inspired by several things
> in x-kernel, replacing Blocks by another structure throughout the network
> stacks and kernel, then made messages visible to user level. Nemo did
> another part, on his way to Clive
>
> On Fri, 12 Oct 2018, 07:05 Ori Bernstein,  wrote:
>
>> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg 
>> wrote:
>>
>> > Another case to ponder ...   We're handling the incoming I/Q data
>> > stream, but need to fan that out to many downstream consumers.  If
>> > we already read the data into a page, then flip it to the first
>> > consumer, is there a benefit to adding a reference counter to that
>> > read-only page and leaving the page live until the counter expires?
>> >
>> > Hiro clamours for benchmarks.  I agree.  Some basic searches I've
>> > done don't show anyone trying this out with P9 (and publishing
>> > their results).  Anybody have hints/references to prior work?
>> >
>> > --lyndon
>> >
>>
>> I don't believe anyone has done the work yet. I'd be interested
>> to see what you come up with.
>>
>>
>> --
>> Ori Bernstein
>>
>>
>



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-13 Thread Charles Forsyth
I did several versions of one part of zero copy, inspired by several things
in x-kernel, replacing Blocks by another structure throughout the network
stacks and kernel, then made messages visible to user level. Nemo did
another part, on his way to Clive

On Fri, 12 Oct 2018, 07:05 Ori Bernstein,  wrote:

> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg 
> wrote:
>
> > Another case to ponder ...   We're handling the incoming I/Q data
> > stream, but need to fan that out to many downstream consumers.  If
> > we already read the data into a page, then flip it to the first
> > consumer, is there a benefit to adding a reference counter to that
> > read-only page and leaving the page live until the counter expires?
> >
> > Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> > done don't show anyone trying this out with P9 (and publishing
> > their results).  Anybody have hints/references to prior work?
> >
> > --lyndon
> >
>
> I don't believe anyone has done the work yet. I'd be interested
> to see what you come up with.
>
>
> --
> Ori Bernstein
>
>


Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-12 Thread Ori Bernstein
On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg  wrote:

> Another case to ponder ...   We're handling the incoming I/Q data
> stream, but need to fan that out to many downstream consumers.  If
> we already read the data into a page, then flip it to the first
> consumer, is there a benefit to adding a reference counter to that
> read-only page and leaving the page live until the counter expires?
> 
> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> done don't show anyone trying this out with P9 (and publishing
> their results).  Anybody have hints/references to prior work?
> 
> --lyndon
> 

I don't believe anyone has done the work yet. I'd be interested
to see what you come up with.


-- 
Ori Bernstein



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-11 Thread hiro
i'm not saying you should measure a lot even. just trying to make you
verify my point that this is not your bottleneck, just check if you
hit a cpu limit already with that single processing stage (my guess
was FFT).

the reason why i think my guess is right is bec. of experience with
the low bandwidth of SEQUENTIAL data you're claiming could create
problems.

in contrast i'm happy stallione at least brought up something more
demanding earlier, like finding true limits during small block-size
random access.

On 10/11/18, Lyndon Nerenberg  wrote:
> Another case to ponder ...   We're handling the incoming I/Q data
> stream, but need to fan that out to many downstream consumers.  If
> we already read the data into a page, then flip it to the first
> consumer, is there a benefit to adding a reference counter to that
> read-only page and leaving the page live until the counter expires?
>
> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> done don't show anyone trying this out with P9 (and publishing
> their results).  Anybody have hints/references to prior work?
>
> --lyndon
>
>



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-11 Thread Lyndon Nerenberg
Another case to ponder ...   We're handling the incoming I/Q data
stream, but need to fan that out to many downstream consumers.  If
we already read the data into a page, then flip it to the first
consumer, is there a benefit to adding a reference counter to that
read-only page and leaving the page live until the counter expires?

Hiro clamours for benchmarks.  I agree.  Some basic searches I've
done don't show anyone trying this out with P9 (and publishing
their results).  Anybody have hints/references to prior work?

--lyndon



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steven Stallion
On Wed, Oct 10, 2018 at 9:32 PM Bakul Shah  wrote:
> Steve wrote "1:1 mapping of the virtual kernel address space such
> that something like zero-copy could be possible"
>
> Not sure what he meant. For zero copy you need to *directly*
> write to the memory allocated to a process. 1:1 mapping is
> really not needed.

Ugh. I could have worded that better. That was a (very) clumsy attempt
at stating that the kernel would have to support remapping the user
buffer to virtual kernel space. Fortunately Plan 9 doesn't page out
kernel memory, so pinning wouldn't be required.

Cheers,
Steve



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Bakul Shah
On Wed, 10 Oct 2018 20:56:20 -0400 Dan Cross  wrote:
>
> On Wed, Oct 10, 2018 at 7:58 PM  wrote:
>
> > > Fundamentally zero-copy requires that the kernel and user process
> > > share the same virtual address space mapped for the given operation.
> >
> > and it is. this doesnt make your point clear. the kernel is always mapped.
> >
>
> Meltdown has shown this to be a bad idea.

People still do this.

> > (you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)

Steve wrote "1:1 mapping of the virtual kernel address space such
that something like zero-copy could be possible"

Not sure what he meant. For zero copy you need to *directly*
write to the memory allocated to a process. 1:1 mapping is
really not needed.

> plan9 doesn't use an identity mapping; it uses an offset mapping for most
> of the address space and on 64-bit systems a separate mapping for the
> kernel. An identity mapping from P to V is a function f such that f(a) = a.
> But on 32-bit plan9, VADDR(p) = p + KZERO and PADDR(v) = v - KZERO. On
> 64-bit plan9 systems it's a little more complex because of the two
> mappings, which vary between sub-projects: 9front appears to map the kernel
> into the top 2 gigs of the address space which means that, on large
> machines, the entire physical address space can't fit into the kernel.  Of
> course in such situations one maps the top part of the canonical address
> space for the exclusive use of supervisor code, so in that way it's a
> distinction without a difference.
>
> Of course, there are tricks to make lookups of arbitrary addresses
> relatively cheap by using the MMU hardware and dedicating part of the
> address space to a recursive self-map. That is, if you don't want to walk
> page tables yourself, or keep a more elaborate data structure to describe
> the address space.
>
> > the difference is that *USER* pages are (unless you use special segments)
> > scattered randomly in physical memory or not even realized and you need
> > to lookup the pages in the virtual page table to get to the physical
> > addresses needed to hand them to the hardware for DMA.

If you don't copy, you do need to find all the physical pages.
This is not really expensive and many OSes do precisely this.

If you copy, you can avoid walking the page table. But for
that to work, the kernel virtual space needs to mapped 1:1 in
*every* process -- this is because any cached data will be in
kernel space and must be availabele in all processes.

In fact the *main* reason this was done was to facilitate such
copying. Had we always done zero-copy, we could've avoided
Meltdown altogether. copyin/copyout of syscall arguments
shouldn't be expensive.

> So...walking page tables is hard? Ok
>
> > now the *INTERESTING* thing is what happens to the original virtual
> > address space that covered the I/O when someone touches into it while
> > the I/O is in flight. so do we cut it out of the TLB's of ALL processes
> > *SHARING* the segment? and then have the pagefault handler wait until
> > the I/O is finished?

In general, the way this works is a bit different. In an
mmap() scenario, the initial mapping simply allocates the
necessary PTEs and marks them so that *any* read/write access
will incur a page fault.  At this time if the underlying page
is found to be cached, it is linked to the PTE and the
relevant access bit changed to allow the access. if not, the
process has to wait until the page is read in, at which time
it be linked with the relevant PTE(s). Even if the same file
page is mapped in N processes, the same thing happens. The
kernel does have to do some bookkeeping as the same file data
may be referenced from multiple places.

> You seem to be mixing multiple things here. The physical page has to be
> pinned while the DMA operation is active (unless it can be reliably
> canceled). This can be done any number of ways; but so what? It's not new
> and it's not black magic. Who cares about the virtual address space? If
> some other processor (nb, not process -- processes don't have TLB entries,
> processors do) might have a TLB entry for that mapping that you just
> changed you need to shoot it down anyway: what's that have to do with
> making things wait for page faulting?

Indeed.

> The simplicity of the current scheme comes from the fact that the kernel
> portion of the address *space* is effectively immutable once the kernel
> gets going. That's easy, but it's not particularly flexible and other
> systems do things differently (not just Linux and its ilk). I'm not saying
> you *should* do it in plan9, but it's not like it hasn't been done
> elegantly before.
>
>
> > fuck your go routines... he wants the D.

What?!

> > > This can't always be done and the kernel will be forced to perform a
> > > copy anyway.

In general this is wrong. None of this is new. By decades.
Theoreticlly even regular read/write can use mapping behind
the scenes. [Save the old V->P map to deal with any IO error,
remove the same 

Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steven Stallion
On Wed, Oct 10, 2018 at 8:20 PM Dan Cross  wrote:
>> don't forget the code complexity with dealing with these scattered
>> pages in the *DRIVERS*.
>
> It's really not that hard. The way Linux does it is pretty bad, but it's not 
> like that's the only way to do it.

SunOS and Win32 (believe it or not) managed to get this "right";
dealing with zero-copy in those kernels was a non-event. I'm not sure
I understand the assertion how this would affect constituent drivers.
This sort of detail is handled at a higher level - the driver
generally operates on a buffer that gets jammed into a ring for DMA
transfer. Apart from grabbing the physical address, the worst you may
have to do is pin/unpin the block for the DMA operation. From the
driver's perspective, it's memory. It doesn't matter where it came
from (or who owns it for that matter).

> Or don't.

There's a lot to be said for keeping it simple...



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Dan Cross
On Wed, Oct 10, 2018 at 7:58 PM  wrote:

> > Fundamentally zero-copy requires that the kernel and user process
> > share the same virtual address space mapped for the given operation.
>
> and it is. this doesnt make your point clear. the kernel is always mapped.
>

Meltdown has shown this to be a bad idea.

(you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)
>

plan9 doesn't use an identity mapping; it uses an offset mapping for most
of the address space and on 64-bit systems a separate mapping for the
kernel. An identity mapping from P to V is a function f such that f(a) = a.
But on 32-bit plan9, VADDR(p) = p + KZERO and PADDR(v) = v - KZERO. On
64-bit plan9 systems it's a little more complex because of the two
mappings, which vary between sub-projects: 9front appears to map the kernel
into the top 2 gigs of the address space which means that, on large
machines, the entire physical address space can't fit into the kernel.  Of
course in such situations one maps the top part of the canonical address
space for the exclusive use of supervisor code, so in that way it's a
distinction without a difference.

Of course, there are tricks to make lookups of arbitrary addresses
relatively cheap by using the MMU hardware and dedicating part of the
address space to a recursive self-map. That is, if you don't want to walk
page tables yourself, or keep a more elaborate data structure to describe
the address space.

the difference is that *USER* pages are (unless you use special segments)
> scattered randomly in physical memory or not even realized and you need
> to lookup the pages in the virtual page table to get to the physical
> addresses needed to hand them to the hardware for DMA.
>

So...walking page tables is hard? Ok

now the *INTERESTING* thing is what happens to the original virtual
> address space that covered the I/O when someone touches into it while
> the I/O is in flight. so do we cut it out of the TLB's of ALL processes
> *SHARING* the segment? and then have the pagefault handler wait until
> the I/O is finished?


You seem to be mixing multiple things here. The physical page has to be
pinned while the DMA operation is active (unless it can be reliably
canceled). This can be done any number of ways; but so what? It's not new
and it's not black magic. Who cares about the virtual address space? If
some other processor (nb, not process -- processes don't have TLB entries,
processors do) might have a TLB entry for that mapping that you just
changed you need to shoot it down anyway: what's that have to do with
making things wait for page faulting?

The simplicity of the current scheme comes from the fact that the kernel
portion of the address *space* is effectively immutable once the kernel
gets going. That's easy, but it's not particularly flexible and other
systems do things differently (not just Linux and its ilk). I'm not saying
you *should* do it in plan9, but it's not like it hasn't been done
elegantly before.


> fuck your go routines... he wants the D.
>
> > This can't always be done and the kernel will be forced to perform a
> > copy anyway.
>
> explain *WHEN*, that would be an insight in what you'r trying to
> explain.
>
> > To wit, one of the things I added to the exynos kernel
> > early on was a 1:1 mapping of the virtual kernel address space such
> > that something like zero-copy could be possible in the future (it was
> > also very convenient to limit MMU swaps on the Cortex-A15). That said,
> > the problem gets harder when you're working on something more general
> > that can handle the entire address space. In the end, you trade the
> > complexity/performance hit of MMU management versus making a copy.
>
> don't forget the code complexity with dealing with these scattered
> pages in the *DRIVERS*.
>

It's really not that hard. The way Linux does it is pretty bad, but it's
not like that's the only way to do it.

Or don't.

- Dan C.


Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread cinap_lenrek
> Fundamentally zero-copy requires that the kernel and user process
> share the same virtual address space mapped for the given operation.

and it is. this doesnt make your point clear. the kernel is always mapped.
(you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)

the difference is that *USER* pages are (unless you use special segments)
scattered randomly in physical memory or not even realized and you need
to lookup the pages in the virtual page table to get to the physical
addresses needed to hand them to the hardware for DMA.

now the *INTERESTING* thing is what happens to the original virtual
address space that covered the I/O when someone touches into it while
the I/O is in flight. so do we cut it out of the TLB's of ALL processes
*SHARING* the segment? and then have the pagefault handler wait until
the I/O is finished? fuck your go routines... he wants the D.

> This can't always be done and the kernel will be forced to perform a
> copy anyway.

explain *WHEN*, that would be an insight in what you'r trying to
explain.

> To wit, one of the things I added to the exynos kernel
> early on was a 1:1 mapping of the virtual kernel address space such
> that something like zero-copy could be possible in the future (it was
> also very convenient to limit MMU swaps on the Cortex-A15). That said,
> the problem gets harder when you're working on something more general
> that can handle the entire address space. In the end, you trade the
> complexity/performance hit of MMU management versus making a copy.

don't forget the code complexity with dealing with these scattered
pages in the *DRIVERS*.

> Believe it or not, sometimes copies can be faster, especially on
> larger NUMA systems.

--
cinap



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steven Stallion
> On Oct 10, 2018, at 2:54 PM, Steven Stallion  wrote:
>
> You seem to be saying zero-copy wouldn't buy anything until these
> other problems are solved, right?

Fundamentally zero-copy requires that the kernel and user process
share the same virtual address space mapped for the given operation.
This can't always be done and the kernel will be forced to perform a
copy anyway. To wit, one of the things I added to the exynos kernel
early on was a 1:1 mapping of the virtual kernel address space such
that something like zero-copy could be possible in the future (it was
also very convenient to limit MMU swaps on the Cortex-A15). That said,
the problem gets harder when you're working on something more general
that can handle the entire address space. In the end, you trade the
complexity/performance hit of MMU management versus making a copy.
Believe it or not, sometimes copies can be faster, especially on
larger NUMA systems.

> Suppose you could replace 9p based FS with something of your choice.
> Would it have made your jobs easier? Code less grotty? In other
> words, is the complexity of the driver to achieve high throughput
> due to the complexity of hardware or is it due to 9p's RPC model?
> For streaming data you pretty much have to have some sort of
> windowing protocol (data prefetch or write behind with mmap is a
> similar thing).

This is one of those problems that afflicts storage more than any
other subsystem, but like most things it's a tradeoff. Having a
filesystem that doesn't support 9P doesn't seem to make much sense on
Plan 9 given the ubiquity of the protocol. Dealing with the multiple
outstanding issue does make filesystem support much more complex and
would have a far-reaching effect on existing code (not to mention the
kernel).

It's completely possible to support prefetch and/or streaming I/O
using existing kernel interfaces. cinap's comment about read not
returning until the entire buffer is read is an implementation detail
of the underlying device. A read call is free to return fewer bytes
than requested; it's not uncommon for a driver to return partial data
to favor latency over throughput. In other words, there's no magic
behind mmap - it's a convenience interface. If you look at how other
kernels tend to implement I/O, there are generally fundamental calls
to the a read/write interface - there are no special provisions for
mmap beyond the syscall layer.

The beauty of 9P is you can wrap driver filesystems for added
functionality. Want a block caching interface? Great! Slap a kernel
device on top of a storage driver that handles caching and prefetch.
I'm sure you can see where this is going...

> Looks like people who have worked on the plan9 kernel have learned
> a lot of lessons and have a lot of good advice to offer. I'd love
> to learn from that. Except usually I rarely see anyone criticizing
> plan9.

Something, something, in polite company :-)