Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread erik quanstrom
> It's useful internally in protocol implementation, specifically to avoid
> copying in transport protocols (for later retransmission), and the
> modifications aren't vast.
> A few changes were trickier, often because of small bugs in the original
> code. icmp does some odd things i think.

that makes sense.  likewise, if it were essentially free to add file systems in 
the i/o path,
from user space, one could build micro file systems that took care of small 
details without
incuring much cost.  ramfs is enough of a file system if you have other 
programs to do
other things like dump.

> I'll see if I wrote up some of it. I think there were manual pages for the
> Messages replacing Blocks.

that would be great.  thanks.


> My mcs lock implementation was probably more useful, and I use that in my
> copy of the kernel known as 9k

indeed.  i've seen great performance with mcs in my kernel.

- erik



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread Charles Forsyth
They are machines designed to run programs most people do not write!

On Mon, 15 Oct 2018 at 19:20, hiro <23h...@gmail.com> wrote:

> > Also, NUMA effects are more important in practice on big multicores. Some
> > of the off-chip delays are brutal.
>
> yeah, we've been talking about this on #cat-v. even inside one CPU
> package amd puts multiple dies nowadays, and the cross-die cpu cache
> access delays are approaching the same dimensions as memory-access!
>
> also on each die, they have what they call ccx (cpu complex),
> groupings of 4 cores, which are connected much faster internally than
> towards the other ccx
>
>


Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread hiro
> Also, NUMA effects are more important in practice on big multicores. Some
> of the off-chip delays are brutal.

yeah, we've been talking about this on #cat-v. even inside one CPU
package amd puts multiple dies nowadays, and the cross-die cpu cache
access delays are approaching the same dimensions as memory-access!

also on each die, they have what they call ccx (cpu complex),
groupings of 4 cores, which are connected much faster internally than
towards the other ccx



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread hiro
> Btw, "zero copy" isn't the right term and I preferred another term that I've 
> now forgotten. Minimal copying, perhaps.

I like that, "zero-copy" makes me imply other linux-specifics, and
those are hard to not get emotional about.



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread Charles Forsyth
It's useful internally in protocol implementation, specifically to avoid
copying in transport protocols (for later retransmission), and the
modifications aren't vast.
A few changes were trickier, often because of small bugs in the original
code. icmp does some odd things i think.

Btw, "zero copy" isn't the right term and I preferred another term that
I've now forgotten. Minimal copying, perhaps.
For one thing, messages can eventually end up being copied to contiguous
blocks for devices without decent scatter-gather DMA.

Messages are a tuple (mutable header stack, immutable slices of immutable
data).
Originally the data was organised as a tree, but nemo suggested using just
an array, so I changed it.
It's important that it's (logically) immutable. Headers are pushed onto and
popped from the header stack, and the current stack top is mutable.

There were new readmsg and writemsg system calls to carry message
structures between kernel and user level.
The message was immutable on writemsg. Between processes in the same
program, message transfers could be done by exchanging pointers into a
shared region.

I'll see if I wrote up some of it. I think there were manual pages for the
Messages replacing Blocks.

My mcs lock implementation was probably more useful, and I use that in my
copy of the kernel known as 9k

Also, NUMA effects are more important in practice on big multicores. Some
of the off-chip delays are brutal.

On Sun, 14 Oct 2018 at 09:50, hiro <23h...@gmail.com> wrote:

> thanks, this will allow us to know where to look more closely.
>
> On 10/14/18, Francisco J Ballesteros  wrote:
> > Pure "producer/cosumer" stuff, like sending things through a pipe as
> long as
> > the source didn't need to touch the data ever more.
> > Regarding bugs, I meant "producing bugs" not "fixing bugs", btw.
> >
> >> On 14 Oct 2018, at 09:34, hiro <23h...@gmail.com> wrote:
> >>
> >> well, finding bugs is always good :)
> >> but since i got curious could you also tell which things exactly got
> >> much faster, so that we know what might be possible?
> >>
> >> On 10/14/18, FJ Ballesteros  wrote:
> >>> yes. bugs, on my side at least.
> >>> The copy isolates from others.
> >>> But some experiments in nix and in a thing I wrote for leanxcale show
> >>> that
> >>> some things can be much faster.
> >>> It’s fun either way.
> >>>
>  El 13 oct 2018, a las 23:11, hiro <23h...@gmail.com> escribió:
> 
>  and, did it improve anything noticeably?
> 
> > On 10/13/18, Charles Forsyth  wrote:
> > I did several versions of one part of zero copy, inspired by several
> > things
> > in x-kernel, replacing Blocks by another structure throughout the
> > network
> > stacks and kernel, then made messages visible to user level. Nemo did
> > another part, on his way to Clive
> >
> >> On Fri, 12 Oct 2018, 07:05 Ori Bernstein, 
> wrote:
> >>
> >> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg
> >> 
> >> wrote:
> >>
> >>> Another case to ponder ...   We're handling the incoming I/Q data
> >>> stream, but need to fan that out to many downstream consumers.  If
> >>> we already read the data into a page, then flip it to the first
> >>> consumer, is there a benefit to adding a reference counter to that
> >>> read-only page and leaving the page live until the counter expires?
> >>>
> >>> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> >>> done don't show anyone trying this out with P9 (and publishing
> >>> their results).  Anybody have hints/references to prior work?
> >>>
> >>> --lyndon
> >>>
> >>
> >> I don't believe anyone has done the work yet. I'd be interested
> >> to see what you come up with.
> >>
> >>
> >> --
> >>   Ori Bernstein
> >>
> >>
> >
> 
> >>>
> >>>
> >>>
> >>
> >
> >
> >
>
>


Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-15 Thread Giacomo Tesio
Il giorno dom 14 ott 2018 alle ore 19:39 Ole-Hjalmar Kristensen
 ha scritto:
>
> OK, that makes sense. So it would not stop a client from for example first 
> read an index block in a B-tree, wait for the result, and then issue read 
> operations for all the data blocks in parallel.

If the client is the kernel that's true.
If the client is directly speaking 9P that's true again.

But if the client is a userspace program using pread/pwrite that
wouldn't work unless it fork a new process per each read as the
syscalls blocks.
Which is what fcp does, actually:
https://github.com/brho/plan9/blob/master/sys/src/cmd/fcp.c


Giacomo