Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread erik quanstrom
I have been able to copy 1 GiB/s to userspace from an nvme device.  I should think a radio should be no problem.- Erik

Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Bakul Shah
On Oct 9, 2018, at 3:06 PM, erik quanstrom  wrote:
> 
> with meltdown/Spectre mitigations in place, I would like to see evidence that 
> flip is faster than copy.

If your system is well balanced, you should be able to
stream data as fast as memory allows[1]. In such a system
copying things N times will reduce throughput by similar
factor. It may be that plan9 underperforms so much this
doesn't matter normally.

But the reason I want this is to reduce latency to the first
access, especially for very large files. With read() I have
to wait until the read completes. With mmap() processing can
start much earlier and can be interleaved with background
data fetch or prefetch. With read() a lot more resources
are tied down. If I need random access and don't need to
read all of the data, the application has to do pread(),
pwrite() a lot thus complicating it. With mmap() I can just
map in the whole file and excess reading (beyond what the
app needs) will not be a large fraction.

The default assumption here seems to be that doing this
will be very complicated and be as bad as on Linux. But
Linux is not a good model of what to do and examples of what
not to do are not useful guides in system design. There are
other OSes such as the old Apollo Aegis (AKA Apollo/Domain),
KeyKOS & seL4 that avoid copying[2].

Though none of this matters right now as we don't even have
a paper design so please put down your clubs and swords :-)

[1] See: https://code.kx.com/q/cloud/aws/benchmarking/
A single q process can ingest data at 1.9GB/s from a
single drive. 16 can achieve 2.7GB/s, with theoretical
max being 2.8GB/s.

[2] Liedke's original L4 evolved into a provably secure
seL4 and in the process it became very much like KeyKOS.
Capability systems do pass around pages as protected
objects and avoid copying. Sort of like how in a program
you'd pass a huge array by reference and not by value
to a function.





Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread erik quanstrom
zero copy is also the source of the dreaded 'D' state.- Erik

Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Giacomo Tesio
Il giorno mar 9 ott 2018 alle ore 05:33 Lucio De Re
 ha scritto:
>
> On 10/9/18, Bakul Shah  wrote:
> >
> > One thing I have mused about is recasting plan9 as a
> > microkernel and pushing out a lot of its kernel code into user
> > mode code.  It is already half way there -- it is basically a
> > mux for 9p calls, low level device drivers,
> >
> There are religious reasons not to go there

Indeed, as an heretic, one of the first things I did with Jehanne was
to move the console filesystem out of kernel.
Then I moved several syscalls into userspace. Or turned them to files
or to operation on existing files.
More syscall/kernel services will move to user space as I'll have time
to hack it again.

You know... heretics ruin everything!

I'm not going to turn Jehanne to a microkernel, but I'm looking for
the simplest possible set of kernel abstractions that can support a
distributed operating system able to replace the mainstream Web+OS
mess.
You know... heretics are crazy, too!


Giacomo



Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Digby R.S. Tarvin
I don't know which other ARM board you tried, but I have always found
terrible I/O performance of the Pi to be a bigger problem that the ARM
speed.  The USB2 interface is really slow, and there arn't really many
other (documented) alternative options. The Ethernet goes through the same
slow USB interface, and there is only so much that you can do bit bashing
data with GPIO's.  The sdCard interface seems to be the only non-usb
filesystem I/O available. And that in turn limits the viability of
relieving the RAM contraints with virtual memory. So the ARM processor
itself is not usually the problem for me.

In general I find the pi a nice little device for quite a few things - like
low power, low bandwidth, low cost servers or displays with plenty of open
source compatability.. Or hacking/prototyping where I don't want to have to
worry too much about blowing things up. But it not good for high throughput
I/O,  memory intensive applications, or anything requiring a lot of
processing power.

The validity of your conclusion regarding low power ARM in general probably
depends on what the other board you tried was..

DigbyT

On Wed, 10 Oct 2018 at 17:51, hiro <23h...@gmail.com> wrote:

> > Eliminating as much of the copy in/out WRT the kernel cannot but
> > help, especially when you're doing SDR decoding near the radios
> > using low-powered compute hardware (think Pies and the like).
>
> Does this include demodulation on the pi? cause even when i dumped the
> pi i was given for that purpose (with a <2Mbit I/Q stream) and
> replaced it with some similar ARM platform that at least had neon cpu
> instruction extensions for faster floating point operations, I was
> barely able to run a small FFT.
>
> My conclusion was that these low-powered ARM systems are just good
> enough for gathering low-bandwidth, non-critical USB traffic, like
> those raw I/Q samples from a dongle, but unfit for anything else.
>
>


Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread hiro
I agree, if you have a choice avoid rpi by all costs.
Even if the software side of that other board was less pleasent at least it
worked with my mouse and keyboard!! :)

As I said I was looking at 2Mbit/s stuff, which is nothing, even over USB.
But my point is that even though this number is low, the rpi is too limited
to do any meaningful processing anyway (ignoring the usb troubles and lack
of ethernet). It's a mobile phone soc after all, where the modulation is
done by dedicated chips, not on cpu! :)

On Wednesday, October 10, 2018, Digby R.S. Tarvin 
wrote:
> I don't know which other ARM board you tried, but I have always found
terrible I/O performance of the Pi to be a bigger problem that the ARM
speed.  The USB2 interface is really slow, and there arn't really many
other (documented) alternative options. The Ethernet goes through the same
slow USB interface, and there is only so much that you can do bit bashing
data with GPIO's.  The sdCard interface seems to be the only non-usb
filesystem I/O available. And that in turn limits the viability of
relieving the RAM contraints with virtual memory. So the ARM processor
itself is not usually the problem for me.
> In general I find the pi a nice little device for quite a few things -
like low power, low bandwidth, low cost servers or displays with plenty of
open source compatability.. Or hacking/prototyping where I don't want to
have to worry too much about blowing things up. But it not good for high
throughput I/O,  memory intensive applications, or anything requiring a lot
of processing power.
> The validity of your conclusion regarding low power ARM in general
probably depends on what the other board you tried was..
> DigbyT
> On Wed, 10 Oct 2018 at 17:51, hiro <23h...@gmail.com> wrote:
>>
>> > Eliminating as much of the copy in/out WRT the kernel cannot but
>> > help, especially when you're doing SDR decoding near the radios
>> > using low-powered compute hardware (think Pies and the like).
>>
>> Does this include demodulation on the pi? cause even when i dumped the
>> pi i was given for that purpose (with a <2Mbit I/Q stream) and
>> replaced it with some similar ARM platform that at least had neon cpu
>> instruction extensions for faster floating point operations, I was
>> barely able to run a small FFT.
>>
>> My conclusion was that these low-powered ARM systems are just good
>> enough for gathering low-bandwidth, non-critical USB traffic, like
>> those raw I/Q samples from a dongle, but unfit for anything else.
>>
>


Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Ethan Gardener
On Tue, Oct 9, 2018, at 11:22 PM, Digby R.S. Tarvin wrote:
> 
> 
> On Tue, 9 Oct 2018 at 23:00, Ethan Gardener  wrote:
>> 
>> Fascinating thread, but I think you're off by a decade with the 16-bit 
>> address bus comment, unless you're not actually talking about Plan 9.  The 
>> 8086 and 8088 were introduced with 20-bit addressing in 1978 and 1979 
>> respectively.  The IBM PC, launched in 1982, had its ROM at the top of that 
>> 1MByte space, so it couldn't have been constrained in that way.  By the end 
>> of the 80s, all my schoolmates had 68k-powered computers from Commodore and 
>> Atari, showing hardware with a 24-bit address space was very much affordable 
>> and ubiquitous at the time Plan 9 development started.  Almost all of them 
>> had 512KB at the time.  A few flashy gits had 1MB machines. :)
> 
> Not sure I would agree with that. The 20 bit addressing of the 8086 and 8088 
> did not change their 16 bit nature. They were still 16 bit program counter, 
> with segmentation to provide access to a larger memory - similar in principle 
> to the PDP11 with MMU. 

That's not at all the same as being constrained to 64KB memory.  Are we 
communicating at cross purposes here?  If we're not, if I haven't misunderstood 
you, you might want to read up on creating .exe files for MS-DOS.  

> The first 32 bit x86 processor was the 386, which I think came out in 1985, 
> very close to when work on Plan9 was rumored to have  started. So it seemed 
> not impossible that work might have started on an older 16 bit machine, but  
> at Bell Labs probably a long shot.

Mmh, rumors. I read they were starting to think about Plan 9 in 1985, but I 
haven't read anything about it being up and running until '89 or '90.  There's 
not much to go on.

>> I still wish I'd kept the better of the Atari STs which made their way down 
>> to me -- a "1040 STE" -- 1MB with a better keyboard and ROM than the earlier 
>> "STFM" models.  I remember wanting to try to run Plan 9 on it.  Let's 
>> estimate how tight it would be...
>>  
>>  I think it would be terrible, because I got frustrated enough trying to run 
>> a 4e CPU server with graphics on a 2GB x86.  I kept running out of image 
>> memory!  The trouble was the draw device in 4th edition stores images in the 
>> same "image memory" the kernel loads programs into, and the 386 CPU kernel 
>> 'only' allocates 64MB of that. :)  
>>  
>>  1 bit per pixel would obviously improve matters by a factor of 16 compared 
>> to my setup, and 640x400 (Atari ST high resolution) would be another 5 times 
>> smaller than my screen.  Putting these numbers together with my experience, 
>> you'd have to be careful to use images sparingly on a machine with 800KB 
>> free RAM after the kernel is loaded.  That's better than I thought, probably 
>> achievable on that Atari I had, but it couldn't be used as intensively as I 
>> used Plan 9 back then.  
>>  
>>  How could it be used?  I think it would be a good idea to push the draw 
>> device back to user space and make very sure to have it check for failing 
>> malloc!  I certainly wouldn't want a terminal with a filesystem and graphics 
>> all on a single 1MByte 64000-powered computer, because a filesystem on a 
>> terminal runs in user space, and thus requires some free memory to run the 
>> programs to shut it down.  Actually, Plan 9's separation of terminal from 
>> filesystem seems quite the obvious choice when I look at it like this. :)  
> 
> I went Commodore Amiga at about that time - because it at least supported 
> some form of multi-tasking out out the box, and I spent many happy hours 
> getting OS9 running on it.. An interesting architecture, capable of some 
> impressive graphics, but subject to quite severe limitations which made 
> general purpose graphics difficult. (Commodore later released SVR4 Unix for 
> the A3000, but limited X11 to monochrome when using the inbuilt graphics).

It does sound like fun. :)  I'm not surprised by the monochrome graphics 
limitation after my calculations.  Still, X11 or any other window system which 
lacks a backing store may do better in low-memory environments than Plan 9's 
present draw device.  It's a shame, a backing store is a great simplification 
for programmers.

> But being 32 bit didn't give it a huge advantage over the 16 bit x86 systems 
> for tinkering with operating system, because the 68000 had no MMU.  It was 
> easier to get a Unix like system going with 16 bit segmentation than a 32 bit 
> linear space and no hardware support for run time relocation.
> (OS9 used position independent code throughout to work without an MMU, but 
> didn't try to implement fork() semantics).

I'm sometimes tempted to think that fork() is freakishly high-level crazy 
stuff. :)  Still, like backing store, it's very nice to have.

> It wasn't till the 68030 based Amiga 3000 came out in 1990 that it really did 
> everything I wanted. The 68020 with an optional MMU was equivalent, but not 
> so common in 

Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steve Simon


people come down very hard on the pi.

here are my times for building the pi kernel. i rebuilt it a few times to push 
data into any caches available.

pi3+ with a high-ish spec sd card: 23 secs
dual intel atom 1.8Ghz with an SSD: 9 secs

the pi is slower, but not 10 times slower.
However it does cost a 10th of the price and consumes a 10th of the electricity.

i use the order of magnitude test as that is (in my experience) what you need 
to make a really noticeable difference (to stuff in general).

i use one daily as a plan9 terminal, for which i feel its ideal.

-Steve






Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread erik quanstrom
> > with meltdown/Spectre mitigations in place, I would like to see evidence 
> > that flip is faster than copy.
> 
> If your system is well balanced, you should be able to
> stream data as fast as memory allows[1]. In such a system
> copying things N times will reduce throughput by similar
> factor. It may be that plan9 underperforms so much this
> doesn't matter normally.

sure.  but flipping page tables is also not free.  there is a huge cost in 
processor stalls, etc.
spectre and meltdown mitigations make this worse as each page flip has to be 
accompanied
by a complete pipeline flush or other costly mitigation.  (not that this was 
cheap to begin with)

it's also not an object to move data as fast as possible.  the object is to do 
work as fast as possible.

> [1] See: https://code.kx.com/q/cloud/aws/benchmarking/
> A single q process can ingest data at 1.9GB/s from a
> single drive. 16 can achieve 2.7GB/s, with theoretical
> max being 2.8GB/s.

with my same crappy un-optimized nvme driver, i was able to hit 2.5-2.6 GiB/s
with two very crappy nvme drives.  (are you're numbers really GB rather than 
GiB?)
i am sure i could scale that lineraly.  there's plenty of memory bandwidth 
left, but
i haven't got any more nvme.  :-)

similarly coraid built an appliance that did copying (due to cache) and hit 1 
million
4k iops.  this was in 2011 or so.

but, so what.  all this proves is that with copying or without, we can ingest 
enough
data for even the most hungry programs.

unless you have data that shows otherwise.  :-)

- erik



Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread cinap_lenrek
oh! you wrote a nvme driver TOO? where can i find it?

maybe we can share some knowledge. especially regarding
some quirks. i dont own hardware myself, so i wrote it
using an emulator over a weekend and tested it on a
work machine afterwork.

http://code.9front.org/hg/plan9front/log/9df9ef969856/sys/src/9/pc/sdnvme.c

--
cinap



Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread cinap_lenrek
> But the reason I want this is to reduce latency to the first
> access, especially for very large files. With read() I have
> to wait until the read completes. With mmap() processing can
> start much earlier and can be interleaved with background
> data fetch or prefetch. With read() a lot more resources
> are tied down. If I need random access and don't need to
> read all of the data, the application has to do pread(),
> pwrite() a lot thus complicating it. With mmap() I can just
> map in the whole file and excess reading (beyond what the
> app needs) will not be a large fraction.

you think doing single 4K page sized reads in the pagefault
handler is better than doing precise >4K reads from your
application? possibly in a background thread so you can
overlap processing with data fetching?

the advantage of mmap is not prefetch. its about not to do
any I/O when data is already in the *SHARED* buffer cache!
which plan9 does not have (except the mntcache, but that is
optional and only works for the disk fileservers that maintain
ther file qid ver info consistently). its *IS* really a linux
thing where all block device i/o goes thru the buffer cache.

--
cinap



Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Ethan Gardener
On Tue, Oct 9, 2018, at 8:14 PM, Lyndon Nerenberg wrote:
> hiro writes:
> 
> > Huh? What exactly do you mean? Can you describe the scenario and the
> > measurements you made?
> 
> The big one is USB.  disk/radio->kernel->user-space-usbd->kernel->application.
> Four copies.
> 
> I would like to start playing with software defined radio on Plan
> 9, but that amount of data copying is going to put a lot of pressure
> on the kernel to keep up.  UNIX/Linux suffers the same copy bloat,
> and it's having trouble keeping up, too.

References, please.  Programmers are notoriously bad at determining the cause 
of performance problems.  Examining the source will help to see if "copy bloat" 
is the actual problem.

> 
> --lyndon
> 


-- 
Progress might have been all right once, but it has gone on too long -- Ogden 
Nash



Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Digby R.S. Tarvin
Well, I think 'avoid at all costs'  is a bit strong.

The Raspberry Pi is a good little platform for the right applications, so
long as you are aware of its limitations. I use one as my 'always on' home
server to give me access files when travelling (the networking is slow by
LAN standards, but ok for WAN), and another for my energy monitoring
system. It is good for experimenting with OS's, especially networking OS's
like Plan9 where price is important if you want to try a large number of
hosts. Its good for teaching/learning. Or for running/trying different
operating systems without having do spend time and resources setting up VMs
(downloading and flashing an sd card image is quick and takes up no space
on my main systems).

Just don't plan on deploying RPi's for mission critical applications that
have demanding I/O or processing requirements. It was never intended to
compete in that market.

On Wed, 10 Oct 2018 at 20:54, hiro <23h...@gmail.com> wrote:

> I agree, if you have a choice avoid rpi by all costs.
> Even if the software side of that other board was less pleasent at least
> it worked with my mouse and keyboard!! :)
>
> As I said I was looking at 2Mbit/s stuff, which is nothing, even over USB.
> But my point is that even though this number is low, the rpi is too limited
> to do any meaningful processing anyway (ignoring the usb troubles and lack
> of ethernet). It's a mobile phone soc after all, where the modulation is
> done by dedicated chips, not on cpu! :)
>
> On Wednesday, October 10, 2018, Digby R.S. Tarvin 
> wrote:
> > I don't know which other ARM board you tried, but I have always found
> terrible I/O performance of the Pi to be a bigger problem that the ARM
> speed.  The USB2 interface is really slow, and there arn't really many
> other (documented) alternative options. The Ethernet goes through the same
> slow USB interface, and there is only so much that you can do bit bashing
> data with GPIO's.  The sdCard interface seems to be the only non-usb
> filesystem I/O available. And that in turn limits the viability of
> relieving the RAM contraints with virtual memory. So the ARM processor
> itself is not usually the problem for me.
> > In general I find the pi a nice little device for quite a few things -
> like low power, low bandwidth, low cost servers or displays with plenty of
> open source compatability.. Or hacking/prototyping where I don't want to
> have to worry too much about blowing things up. But it not good for high
> throughput I/O,  memory intensive applications, or anything requiring a lot
> of processing power.
> > The validity of your conclusion regarding low power ARM in general
> probably depends on what the other board you tried was..
> > DigbyT
> > On Wed, 10 Oct 2018 at 17:51, hiro <23h...@gmail.com> wrote:
> >>
> >> > Eliminating as much of the copy in/out WRT the kernel cannot but
> >> > help, especially when you're doing SDR decoding near the radios
> >> > using low-powered compute hardware (think Pies and the like).
> >>
> >> Does this include demodulation on the pi? cause even when i dumped the
> >> pi i was given for that purpose (with a <2Mbit I/Q stream) and
> >> replaced it with some similar ARM platform that at least had neon cpu
> >> instruction extensions for faster floating point operations, I was
> >> barely able to run a small FFT.
> >>
> >> My conclusion was that these low-powered ARM systems are just good
> >> enough for gathering low-bandwidth, non-critical USB traffic, like
> >> those raw I/Q samples from a dongle, but unfit for anything else.
> >>
> >


Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Skip Tavakkolian
For operations that matter in this context (read, write), there can be
multiple outstanding tags. A while back rsc implemented fcp, partly to
prove this point.

On Wed, Oct 10, 2018 at 2:54 PM Steven Stallion  wrote:

> As the guy who wrote the majority of the code that pushed those 1M 4K
> random IOPS erik mentioned, this thread annoys the shit out of me. You
> don't get an award for writing a driver. In fact, it's probably better
> not to be known at all considering the bloody murder one has to commit
> to marry hardware and software together.
>
> Let's be frank, the I/O handling in the kernel is anachronistic. To
> hit those rates, I had to add support for asynchronous and vectored
> I/O not to mention a sizable bit of work by a co-worker to properly
> handle NUMA on our appliances to hit those speeds. As I recall, we had
> to rewrite the scheduler and re-implement locking, which even Charles
> Forsyth had a hand in. Had we the time and resources to implement
> something like zero-copy we'd have done it in a heartbeat.
>
> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
> - as soon as you put a 9P-based filesystem on it, it's going to be
> limited to a single outstanding operation. This is the tyranny of 9P.
> We (Coraid) got around this by avoiding filesystems altogether.
>
> Go solve that problem first.
> On Wed, Oct 10, 2018 at 12:36 PM  wrote:
> >
> > > But the reason I want this is to reduce latency to the first
> > > access, especially for very large files. With read() I have
> > > to wait until the read completes. With mmap() processing can
> > > start much earlier and can be interleaved with background
> > > data fetch or prefetch. With read() a lot more resources
> > > are tied down. If I need random access and don't need to
> > > read all of the data, the application has to do pread(),
> > > pwrite() a lot thus complicating it. With mmap() I can just
> > > map in the whole file and excess reading (beyond what the
> > > app needs) will not be a large fraction.
> >
> > you think doing single 4K page sized reads in the pagefault
> > handler is better than doing precise >4K reads from your
> > application? possibly in a background thread so you can
> > overlap processing with data fetching?
> >
> > the advantage of mmap is not prefetch. its about not to do
> > any I/O when data is already in the *SHARED* buffer cache!
> > which plan9 does not have (except the mntcache, but that is
> > optional and only works for the disk fileservers that maintain
> > ther file qid ver info consistently). its *IS* really a linux
> > thing where all block device i/o goes thru the buffer cache.
> >
> > --
> > cinap
> >
>
>


Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Kurt H Maier
On Wed, Oct 10, 2018 at 04:54:22PM -0500, Steven Stallion wrote:
> As the guy 

might be worth keeping in mind the current most common use case for nvme
is laptop storage and not building jet engines in coraid's basement

so the nvme driver that cinap wrote works on my thinkpad today and is 
about infinity times faster than the one you guys locked up in the 
warehouse at the end of raiders of the lost ark, because my laptop can't
seem to boot off nostalgia.

so no, nobody gets an award for writing a driver.  but cinap won the
9front Order of Valorous Service (with bronze oak leaf cluster,
signifying working code) for *releasing* one.  I was there when field
marshal aiju presented the award; it was a very nice ceremony.

anyway, someone once said communication is not a zero-sum game.  the
hyperspecific use case you describe is fine but there are other reasons
to care about how well this stuff works, you know?

khm



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steven Stallion
> On Oct 10, 2018, at 2:54 PM, Steven Stallion  wrote:
>
> You seem to be saying zero-copy wouldn't buy anything until these
> other problems are solved, right?

Fundamentally zero-copy requires that the kernel and user process
share the same virtual address space mapped for the given operation.
This can't always be done and the kernel will be forced to perform a
copy anyway. To wit, one of the things I added to the exynos kernel
early on was a 1:1 mapping of the virtual kernel address space such
that something like zero-copy could be possible in the future (it was
also very convenient to limit MMU swaps on the Cortex-A15). That said,
the problem gets harder when you're working on something more general
that can handle the entire address space. In the end, you trade the
complexity/performance hit of MMU management versus making a copy.
Believe it or not, sometimes copies can be faster, especially on
larger NUMA systems.

> Suppose you could replace 9p based FS with something of your choice.
> Would it have made your jobs easier? Code less grotty? In other
> words, is the complexity of the driver to achieve high throughput
> due to the complexity of hardware or is it due to 9p's RPC model?
> For streaming data you pretty much have to have some sort of
> windowing protocol (data prefetch or write behind with mmap is a
> similar thing).

This is one of those problems that afflicts storage more than any
other subsystem, but like most things it's a tradeoff. Having a
filesystem that doesn't support 9P doesn't seem to make much sense on
Plan 9 given the ubiquity of the protocol. Dealing with the multiple
outstanding issue does make filesystem support much more complex and
would have a far-reaching effect on existing code (not to mention the
kernel).

It's completely possible to support prefetch and/or streaming I/O
using existing kernel interfaces. cinap's comment about read not
returning until the entire buffer is read is an implementation detail
of the underlying device. A read call is free to return fewer bytes
than requested; it's not uncommon for a driver to return partial data
to favor latency over throughput. In other words, there's no magic
behind mmap - it's a convenience interface. If you look at how other
kernels tend to implement I/O, there are generally fundamental calls
to the a read/write interface - there are no special provisions for
mmap beyond the syscall layer.

The beauty of 9P is you can wrap driver filesystems for added
functionality. Want a block caching interface? Great! Slap a kernel
device on top of a storage driver that handles caching and prefetch.
I'm sure you can see where this is going...

> Looks like people who have worked on the plan9 kernel have learned
> a lot of lessons and have a lot of good advice to offer. I'd love
> to learn from that. Except usually I rarely see anyone criticizing
> plan9.

Something, something, in polite company :-)



Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steven Stallion
As the guy who wrote the majority of the code that pushed those 1M 4K
random IOPS erik mentioned, this thread annoys the shit out of me. You
don't get an award for writing a driver. In fact, it's probably better
not to be known at all considering the bloody murder one has to commit
to marry hardware and software together.

Let's be frank, the I/O handling in the kernel is anachronistic. To
hit those rates, I had to add support for asynchronous and vectored
I/O not to mention a sizable bit of work by a co-worker to properly
handle NUMA on our appliances to hit those speeds. As I recall, we had
to rewrite the scheduler and re-implement locking, which even Charles
Forsyth had a hand in. Had we the time and resources to implement
something like zero-copy we'd have done it in a heartbeat.

In the end, it doesn't matter how "fast" a storage driver is in Plan 9
- as soon as you put a 9P-based filesystem on it, it's going to be
limited to a single outstanding operation. This is the tyranny of 9P.
We (Coraid) got around this by avoiding filesystems altogether.

Go solve that problem first.
On Wed, Oct 10, 2018 at 12:36 PM  wrote:
>
> > But the reason I want this is to reduce latency to the first
> > access, especially for very large files. With read() I have
> > to wait until the read completes. With mmap() processing can
> > start much earlier and can be interleaved with background
> > data fetch or prefetch. With read() a lot more resources
> > are tied down. If I need random access and don't need to
> > read all of the data, the application has to do pread(),
> > pwrite() a lot thus complicating it. With mmap() I can just
> > map in the whole file and excess reading (beyond what the
> > app needs) will not be a large fraction.
>
> you think doing single 4K page sized reads in the pagefault
> handler is better than doing precise >4K reads from your
> application? possibly in a background thread so you can
> overlap processing with data fetching?
>
> the advantage of mmap is not prefetch. its about not to do
> any I/O when data is already in the *SHARED* buffer cache!
> which plan9 does not have (except the mntcache, but that is
> optional and only works for the disk fileservers that maintain
> ther file qid ver info consistently). its *IS* really a linux
> thing where all block device i/o goes thru the buffer cache.
>
> --
> cinap
>



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread cinap_lenrek
> Fundamentally zero-copy requires that the kernel and user process
> share the same virtual address space mapped for the given operation.

and it is. this doesnt make your point clear. the kernel is always mapped.
(you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)

the difference is that *USER* pages are (unless you use special segments)
scattered randomly in physical memory or not even realized and you need
to lookup the pages in the virtual page table to get to the physical
addresses needed to hand them to the hardware for DMA.

now the *INTERESTING* thing is what happens to the original virtual
address space that covered the I/O when someone touches into it while
the I/O is in flight. so do we cut it out of the TLB's of ALL processes
*SHARING* the segment? and then have the pagefault handler wait until
the I/O is finished? fuck your go routines... he wants the D.

> This can't always be done and the kernel will be forced to perform a
> copy anyway.

explain *WHEN*, that would be an insight in what you'r trying to
explain.

> To wit, one of the things I added to the exynos kernel
> early on was a 1:1 mapping of the virtual kernel address space such
> that something like zero-copy could be possible in the future (it was
> also very convenient to limit MMU swaps on the Cortex-A15). That said,
> the problem gets harder when you're working on something more general
> that can handle the entire address space. In the end, you trade the
> complexity/performance hit of MMU management versus making a copy.

don't forget the code complexity with dealing with these scattered
pages in the *DRIVERS*.

> Believe it or not, sometimes copies can be faster, especially on
> larger NUMA systems.

--
cinap



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Dan Cross
On Wed, Oct 10, 2018 at 7:58 PM  wrote:

> > Fundamentally zero-copy requires that the kernel and user process
> > share the same virtual address space mapped for the given operation.
>
> and it is. this doesnt make your point clear. the kernel is always mapped.
>

Meltdown has shown this to be a bad idea.

(you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)
>

plan9 doesn't use an identity mapping; it uses an offset mapping for most
of the address space and on 64-bit systems a separate mapping for the
kernel. An identity mapping from P to V is a function f such that f(a) = a.
But on 32-bit plan9, VADDR(p) = p + KZERO and PADDR(v) = v - KZERO. On
64-bit plan9 systems it's a little more complex because of the two
mappings, which vary between sub-projects: 9front appears to map the kernel
into the top 2 gigs of the address space which means that, on large
machines, the entire physical address space can't fit into the kernel.  Of
course in such situations one maps the top part of the canonical address
space for the exclusive use of supervisor code, so in that way it's a
distinction without a difference.

Of course, there are tricks to make lookups of arbitrary addresses
relatively cheap by using the MMU hardware and dedicating part of the
address space to a recursive self-map. That is, if you don't want to walk
page tables yourself, or keep a more elaborate data structure to describe
the address space.

the difference is that *USER* pages are (unless you use special segments)
> scattered randomly in physical memory or not even realized and you need
> to lookup the pages in the virtual page table to get to the physical
> addresses needed to hand them to the hardware for DMA.
>

So...walking page tables is hard? Ok

now the *INTERESTING* thing is what happens to the original virtual
> address space that covered the I/O when someone touches into it while
> the I/O is in flight. so do we cut it out of the TLB's of ALL processes
> *SHARING* the segment? and then have the pagefault handler wait until
> the I/O is finished?


You seem to be mixing multiple things here. The physical page has to be
pinned while the DMA operation is active (unless it can be reliably
canceled). This can be done any number of ways; but so what? It's not new
and it's not black magic. Who cares about the virtual address space? If
some other processor (nb, not process -- processes don't have TLB entries,
processors do) might have a TLB entry for that mapping that you just
changed you need to shoot it down anyway: what's that have to do with
making things wait for page faulting?

The simplicity of the current scheme comes from the fact that the kernel
portion of the address *space* is effectively immutable once the kernel
gets going. That's easy, but it's not particularly flexible and other
systems do things differently (not just Linux and its ilk). I'm not saying
you *should* do it in plan9, but it's not like it hasn't been done
elegantly before.


> fuck your go routines... he wants the D.
>
> > This can't always be done and the kernel will be forced to perform a
> > copy anyway.
>
> explain *WHEN*, that would be an insight in what you'r trying to
> explain.
>
> > To wit, one of the things I added to the exynos kernel
> > early on was a 1:1 mapping of the virtual kernel address space such
> > that something like zero-copy could be possible in the future (it was
> > also very convenient to limit MMU swaps on the Cortex-A15). That said,
> > the problem gets harder when you're working on something more general
> > that can handle the entire address space. In the end, you trade the
> > complexity/performance hit of MMU management versus making a copy.
>
> don't forget the code complexity with dealing with these scattered
> pages in the *DRIVERS*.
>

It's really not that hard. The way Linux does it is pretty bad, but it's
not like that's the only way to do it.

Or don't.

- Dan C.


Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread cinap_lenrek
hahahahahahahaha

--
cinap



[9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Bakul Shah
Excellent response! Just what I was hoping for!

On Oct 10, 2018, at 2:54 PM, Steven Stallion  wrote:
> 
> As the guy who wrote the majority of the code that pushed those 1M 4K
> random IOPS erik mentioned, this thread annoys the shit out of me. You
> don't get an award for writing a driver. In fact, it's probably better
> not to be known at all considering the bloody murder one has to commit
> to marry hardware and software together.
> 
> Let's be frank, the I/O handling in the kernel is anachronistic. To
> hit those rates, I had to add support for asynchronous and vectored
> I/O not to mention a sizable bit of work by a co-worker to properly
> handle NUMA on our appliances to hit those speeds. As I recall, we had
> to rewrite the scheduler and re-implement locking, which even Charles
> Forsyth had a hand in. Had we the time and resources to implement
> something like zero-copy we'd have done it in a heartbeat.
> 
> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
> - as soon as you put a 9P-based filesystem on it, it's going to be
> limited to a single outstanding operation. This is the tyranny of 9P.
> We (Coraid) got around this by avoiding filesystems altogether.
> 
> Go solve that problem first.

You seem to be saying zero-copy wouldn't buy anything until these
other problems are solved, right?

Suppose you could replace 9p based FS with something of your choice.
Would it have made your jobs easier? Code less grotty? In other
words, is the complexity of the driver to achieve high throughput
due to the complexity of hardware or is it due to 9p's RPC model?
For streaming data you pretty much have to have some sort of
windowing protocol (data prefetch or write behind with mmap is a
similar thing).

Looks like people who have worked on the plan9 kernel have learned
a lot of lessons and have a lot of good advice to offer. I'd love
to learn from that. Except usually I rarely see anyone criticizing
plan9.


> On Wed, Oct 10, 2018 at 12:36 PM  wrote:
>> 
>>> But the reason I want this is to reduce latency to the first
>>> access, especially for very large files. With read() I have
>>> to wait until the read completes. With mmap() processing can
>>> start much earlier and can be interleaved with background
>>> data fetch or prefetch. With read() a lot more resources
>>> are tied down. If I need random access and don't need to
>>> read all of the data, the application has to do pread(),
>>> pwrite() a lot thus complicating it. With mmap() I can just
>>> map in the whole file and excess reading (beyond what the
>>> app needs) will not be a large fraction.
>> 
>> you think doing single 4K page sized reads in the pagefault
>> handler is better than doing precise >4K reads from your
>> application? possibly in a background thread so you can
>> overlap processing with data fetching?
>> 
>> the advantage of mmap is not prefetch. its about not to do
>> any I/O when data is already in the *SHARED* buffer cache!
>> which plan9 does not have (except the mntcache, but that is
>> optional and only works for the disk fileservers that maintain
>> ther file qid ver info consistently). its *IS* really a linux
>> thing where all block device i/o goes thru the buffer cache.
>> 
>> --
>> cinap
>> 
> 




Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steven Stallion
Posted August 15th, 2013: https://9p.io/sources/contrib/stallion/src/sdmpt2.c
Corresponding announcement:
https://groups.google.com/forum/#!topic/comp.os.plan9/134-YyYnfbQ
On Wed, Oct 10, 2018 at 5:31 PM Kurt H Maier  wrote:
>
> On Wed, Oct 10, 2018 at 04:54:22PM -0500, Steven Stallion wrote:
> > As the guy
>
> might be worth keeping in mind the current most common use case for nvme
> is laptop storage and not building jet engines in coraid's basement
>
> so the nvme driver that cinap wrote works on my thinkpad today and is
> about infinity times faster than the one you guys locked up in the
> warehouse at the end of raiders of the lost ark, because my laptop can't
> seem to boot off nostalgia.
>
> so no, nobody gets an award for writing a driver.  but cinap won the
> 9front Order of Valorous Service (with bronze oak leaf cluster,
> signifying working code) for *releasing* one.  I was there when field
> marshal aiju presented the award; it was a very nice ceremony.
>
> anyway, someone once said communication is not a zero-sum game.  the
> hyperspecific use case you describe is fine but there are other reasons
> to care about how well this stuff works, you know?
>
> khm
>



Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Digby R.S. Tarvin
On Wed, 10 Oct 2018 at 21:40, Ethan Gardener  wrote:

> >
> > Not sure I would agree with that. The 20 bit addressing of the 8086 and
> 8088 did not change their 16 bit nature. They were still 16 bit program
> counter, with segmentation to provide access to a larger memory - similar
> in principle to the PDP11 with MMU.
>
> That's not at all the same as being constrained to 64KB memory.  Are we
> communicating at cross purposes here?  If we're not, if I haven't
> misunderstood you, you might want to read up on creating .exe files for
> MS-DOS.


Agreed, but the PDP11/70 was not constrained to 64KB memory either.

I do recall the MS-DOS small/large/medium etc models that used the
segmentation in various ways to mitigate the limitations of being a 16 bit
computer. Similar techniques were possible on the PDP11, for example
Modula-2/VRS under RT-11 used the MMU to transparently support 4MB programs
back in 1984 (it used trap instructions to implement subroutine calls).

It wasn't possible under Unix, of course, because there were no system
calls for manipulating the mmu. Understandable, as it would have
complicated the security model in a multi-tasking system. Something neither
MS-DOS or RT-11 had to deal with.

Address space manipulation was more convenient with Intel segmentation
because the instruction set included procedure call/return instructions
that manipulated the segmentation registers, but the situation was not
fundamentally different.  They were both 16 bit machines with hacks to give
access to a larger than 64K physical memory.

The OS9 operating system allowed some control of application memory maps in
a unix like environement by supporting dynamic (but explicit) link and
unlink of subroutine and data modules - which would be added and removed
from your 64K address space as required.So more analogous to memory based
overlays.


> > I went Commodore Amiga at about that time - because it at least
> supported some form of multi-tasking out out the box, and I spent many
> happy hours getting OS9 running on it.. An interesting architecture,
> capable of some impressive graphics, but subject to quite severe
> limitations which made general purpose graphics difficult. (Commodore later
> released SVR4 Unix for the A3000, but limited X11 to monochrome when using
> the inbuilt graphics).
>
> It does sound like fun. :)  I'm not surprised by the monochrome graphics
> limitation after my calculations.  Still, X11 or any other window system
> which lacks a backing store may do better in low-memory environments than
> Plan 9's present draw device.  It's a shame, a backing store is a great
> simplification for programmers.
>

X11 does, of course, support the concept of a backing store. It just
doesn't mandate it. It was an expensive thing to provide back when X11 was
young, so pretty rare. I remember finding the need to be able to re-create
windows on demand rather annoying when I first learned to program in Xlib,
but once you get used to it I find it can lead to benefits when you have to
retain a knowledge of how an image is created, not just the end result.


> > But being 32 bit didn't give it a huge advantage over the 16 bit x86
> systems for tinkering with operating system, because the 68000 had no MMU.
> It was easier to get a Unix like system going with 16 bit segmentation than
> a 32 bit linear space and no hardware support for run time relocation.
> > (OS9 used position independent code throughout to work without an MMU,
> but didn't try to implement fork() semantics).
>
> I'm sometimes tempted to think that fork() is freakishly high-level crazy
> stuff. :)  Still, like backing store, it's very nice to have.
>

I agree. Very elegant when you compare it to the hoops you have to jump
through to initialize the child process environment in systems with the
more common combined 'forkexec' semantics, but a real sticking point for
low end hardware.


> > It wasn't till the 68030 based Amiga 3000 came out in 1990 that it
> really did everything I wanted. The 68020 with an optional MMU was
> equivalent, but not so common in consumer machines.
> >
> > Hardware progress seems to have been rather uninteresting since then.
> Sure, hardware is *much* faster and *much* bigger, but fundamentally the
> same architecture. Intel had a brief flirtation with a novel architecture
> with the iAPX 432 in 81, but obviously found that was more profitable
> making the familiar architecture bigger and faster .
>
> I rather agree.  Multi-core and hyperthreading don't bring in much from an
> operating system designer's perspective, and I think all the interesting
> things about caches are means of working around their problems.


I don't think anyone would bother with multiple cores or caches if that
same performance could be achieved without them.  They just buy a bit more
performance at the cost of additional software complexity.

I would very much like to get my hands on a ga144 to see what sort of
> operating system structure 

Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steven Stallion
Interesting - was this ever generalized? It's been several years since
I last looked, but I seem to recall that unless you went out of your
way to write your own 9P implementation, you were limited to a single
tag.
On Wed, Oct 10, 2018 at 7:51 PM Skip Tavakkolian
 wrote:
>
> For operations that matter in this context (read, write), there can be 
> multiple outstanding tags. A while back rsc implemented fcp, partly to prove 
> this point.
>
> On Wed, Oct 10, 2018 at 2:54 PM Steven Stallion  wrote:
>>
>> As the guy who wrote the majority of the code that pushed those 1M 4K
>> random IOPS erik mentioned, this thread annoys the shit out of me. You
>> don't get an award for writing a driver. In fact, it's probably better
>> not to be known at all considering the bloody murder one has to commit
>> to marry hardware and software together.
>>
>> Let's be frank, the I/O handling in the kernel is anachronistic. To
>> hit those rates, I had to add support for asynchronous and vectored
>> I/O not to mention a sizable bit of work by a co-worker to properly
>> handle NUMA on our appliances to hit those speeds. As I recall, we had
>> to rewrite the scheduler and re-implement locking, which even Charles
>> Forsyth had a hand in. Had we the time and resources to implement
>> something like zero-copy we'd have done it in a heartbeat.
>>
>> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
>> - as soon as you put a 9P-based filesystem on it, it's going to be
>> limited to a single outstanding operation. This is the tyranny of 9P.
>> We (Coraid) got around this by avoiding filesystems altogether.
>>
>> Go solve that problem first.
>> On Wed, Oct 10, 2018 at 12:36 PM  wrote:
>> >
>> > > But the reason I want this is to reduce latency to the first
>> > > access, especially for very large files. With read() I have
>> > > to wait until the read completes. With mmap() processing can
>> > > start much earlier and can be interleaved with background
>> > > data fetch or prefetch. With read() a lot more resources
>> > > are tied down. If I need random access and don't need to
>> > > read all of the data, the application has to do pread(),
>> > > pwrite() a lot thus complicating it. With mmap() I can just
>> > > map in the whole file and excess reading (beyond what the
>> > > app needs) will not be a large fraction.
>> >
>> > you think doing single 4K page sized reads in the pagefault
>> > handler is better than doing precise >4K reads from your
>> > application? possibly in a background thread so you can
>> > overlap processing with data fetching?
>> >
>> > the advantage of mmap is not prefetch. its about not to do
>> > any I/O when data is already in the *SHARED* buffer cache!
>> > which plan9 does not have (except the mntcache, but that is
>> > optional and only works for the disk fileservers that maintain
>> > ther file qid ver info consistently). its *IS* really a linux
>> > thing where all block device i/o goes thru the buffer cache.
>> >
>> > --
>> > cinap
>> >
>>



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Bakul Shah
On Wed, 10 Oct 2018 20:56:20 -0400 Dan Cross  wrote:
>
> On Wed, Oct 10, 2018 at 7:58 PM  wrote:
>
> > > Fundamentally zero-copy requires that the kernel and user process
> > > share the same virtual address space mapped for the given operation.
> >
> > and it is. this doesnt make your point clear. the kernel is always mapped.
> >
>
> Meltdown has shown this to be a bad idea.

People still do this.

> > (you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)

Steve wrote "1:1 mapping of the virtual kernel address space such
that something like zero-copy could be possible"

Not sure what he meant. For zero copy you need to *directly*
write to the memory allocated to a process. 1:1 mapping is
really not needed.

> plan9 doesn't use an identity mapping; it uses an offset mapping for most
> of the address space and on 64-bit systems a separate mapping for the
> kernel. An identity mapping from P to V is a function f such that f(a) = a.
> But on 32-bit plan9, VADDR(p) = p + KZERO and PADDR(v) = v - KZERO. On
> 64-bit plan9 systems it's a little more complex because of the two
> mappings, which vary between sub-projects: 9front appears to map the kernel
> into the top 2 gigs of the address space which means that, on large
> machines, the entire physical address space can't fit into the kernel.  Of
> course in such situations one maps the top part of the canonical address
> space for the exclusive use of supervisor code, so in that way it's a
> distinction without a difference.
>
> Of course, there are tricks to make lookups of arbitrary addresses
> relatively cheap by using the MMU hardware and dedicating part of the
> address space to a recursive self-map. That is, if you don't want to walk
> page tables yourself, or keep a more elaborate data structure to describe
> the address space.
>
> > the difference is that *USER* pages are (unless you use special segments)
> > scattered randomly in physical memory or not even realized and you need
> > to lookup the pages in the virtual page table to get to the physical
> > addresses needed to hand them to the hardware for DMA.

If you don't copy, you do need to find all the physical pages.
This is not really expensive and many OSes do precisely this.

If you copy, you can avoid walking the page table. But for
that to work, the kernel virtual space needs to mapped 1:1 in
*every* process -- this is because any cached data will be in
kernel space and must be availabele in all processes.

In fact the *main* reason this was done was to facilitate such
copying. Had we always done zero-copy, we could've avoided
Meltdown altogether. copyin/copyout of syscall arguments
shouldn't be expensive.

> So...walking page tables is hard? Ok
>
> > now the *INTERESTING* thing is what happens to the original virtual
> > address space that covered the I/O when someone touches into it while
> > the I/O is in flight. so do we cut it out of the TLB's of ALL processes
> > *SHARING* the segment? and then have the pagefault handler wait until
> > the I/O is finished?

In general, the way this works is a bit different. In an
mmap() scenario, the initial mapping simply allocates the
necessary PTEs and marks them so that *any* read/write access
will incur a page fault.  At this time if the underlying page
is found to be cached, it is linked to the PTE and the
relevant access bit changed to allow the access. if not, the
process has to wait until the page is read in, at which time
it be linked with the relevant PTE(s). Even if the same file
page is mapped in N processes, the same thing happens. The
kernel does have to do some bookkeeping as the same file data
may be referenced from multiple places.

> You seem to be mixing multiple things here. The physical page has to be
> pinned while the DMA operation is active (unless it can be reliably
> canceled). This can be done any number of ways; but so what? It's not new
> and it's not black magic. Who cares about the virtual address space? If
> some other processor (nb, not process -- processes don't have TLB entries,
> processors do) might have a TLB entry for that mapping that you just
> changed you need to shoot it down anyway: what's that have to do with
> making things wait for page faulting?

Indeed.

> The simplicity of the current scheme comes from the fact that the kernel
> portion of the address *space* is effectively immutable once the kernel
> gets going. That's easy, but it's not particularly flexible and other
> systems do things differently (not just Linux and its ilk). I'm not saying
> you *should* do it in plan9, but it's not like it hasn't been done
> elegantly before.
>
>
> > fuck your go routines... he wants the D.

What?!

> > > This can't always be done and the kernel will be forced to perform a
> > > copy anyway.

In general this is wrong. None of this is new. By decades.
Theoreticlly even regular read/write can use mapping behind
the scenes. [Save the old V->P map to deal with any IO error,
remove the same 

Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steven Stallion
On Wed, Oct 10, 2018 at 9:32 PM Bakul Shah  wrote:
> Steve wrote "1:1 mapping of the virtual kernel address space such
> that something like zero-copy could be possible"
>
> Not sure what he meant. For zero copy you need to *directly*
> write to the memory allocated to a process. 1:1 mapping is
> really not needed.

Ugh. I could have worded that better. That was a (very) clumsy attempt
at stating that the kernel would have to support remapping the user
buffer to virtual kernel space. Fortunately Plan 9 doesn't page out
kernel memory, so pinning wouldn't be required.

Cheers,
Steve



Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

2018-10-10 Thread Steven Stallion
On Wed, Oct 10, 2018 at 8:20 PM Dan Cross  wrote:
>> don't forget the code complexity with dealing with these scattered
>> pages in the *DRIVERS*.
>
> It's really not that hard. The way Linux does it is pretty bad, but it's not 
> like that's the only way to do it.

SunOS and Win32 (believe it or not) managed to get this "right";
dealing with zero-copy in those kernels was a non-event. I'm not sure
I understand the assertion how this would affect constituent drivers.
This sort of detail is handled at a higher level - the driver
generally operates on a buffer that gets jammed into a ring for DMA
transfer. Apart from grabbing the physical address, the worst you may
have to do is pin/unpin the block for the DMA operation. From the
driver's perspective, it's memory. It doesn't matter where it came
from (or who owns it for that matter).

> Or don't.

There's a lot to be said for keeping it simple...