Re: AMD64 buffer cache 4GB cap anything new, multiqueueing plans? ("64bit DMA on amd64" cont)

2018-11-06 Thread Philip Guenther
On Tue, Nov 6, 2018 at 9:51 PM Joseph Mayer 
wrote:

> Previously there was a years-long thread about a 4GB (32bit) buffer
> cache constraint on AMD64, ref
> https://marc.info/?t=14682443664=1=2 .
>
> What I gather is,
>
>  * The problematique is that on AMD64, DMA is limited to 32bit
>addressing, I guess because unlike AMD64 arch CPU:s which all have
>64bit DMA support, popular PCI accessories and supporting hardware
>out there like bridges, have DMA functionality limited to 32bit
>addressing.
>

My read of that thread, particularly Theo's comments, is that no one
actually demonstrated a case where lack of 64bit DMA caused any problems or
limitations.

If you have a system and use where lack of 64bit DMA creates a performance
limitation, then describe it and, *more importantly*, *why* you think the
DMA limit is involved.


Philip Guenther


AMD64 buffer cache 4GB cap anything new, multiqueueing plans? ("64bit DMA on amd64" cont)

2018-11-06 Thread Joseph Mayer
Hi,

Previously there was a years-long thread about a 4GB (32bit) buffer
cache constraint on AMD64, ref
https://marc.info/?t=14682443664=1=2 .

What I gather is,

 * The problematique is that on AMD64, DMA is limited to 32bit
   addressing, I guess because unlike AMD64 arch CPU:s which all have
   64bit DMA support, popular PCI accessories and supporting hardware
   out there like bridges, have DMA functionality limited to 32bit
   addressing.

   (Is this a feature of lower-quality hardware, or for very old PCI
   devices, or is it systemic to the whole AMD64 ecosystem today?

   Could a system be configured to use 64bit DMA on AMD64 and be
   expected to work presuming recent or higher-quality / well-selected
   hardware?)

 * The OS asks the disk hardware to load disk data to give memory
   locations via DMA, and then userland fread() and mmap() is fed with
   that data - no need for further data moving or mapping. This is the
   dynamics leading to the 4GB cap.

   And, the 4GB cap is kind of constraining for any computer with much
   RAM and lots of disk reading, as it means lots of reads that
   wouldn't need to hit the disk (as it could be cached using all this
   free memory) isn't cached and is directed to disk anyhow which takes
   a lot of time, yes?

 * This was recognized a long time ago and Bob wrote a solution in
   the form of a "buffer cache flipper" that would push buffer cache
   data out of the 32bit area (to "high memory" as in >32bit) hence
   lifting the limit, via a "(generic) backpressure" mechanism that as
   a bonus used the DMA engine to do the memory moving, I guess this
   means that the buffer cache would be pretty much zero-cost to the
   CPU - sounds incredibly neat!

   And then, it didn't really work, malfunctioned and irritated people
   (was "busted" - for unknown reasons, actually why was it?) and Theo
   wrote it will be fixed in the future.


Has it been fixed since?


Also - when fixed, fread() and mmap() reads to data that's in the
buffer cache will be incredibly fast right, as, in optimal conditions
the mmap:ed addresses will be already-mapped to the buffer cache data
and hence in optimal conditions mmap:ed buffer cache data reads will
have the speed of any memory access, right?


(The ML thread also mentioned an undeadly.org post discussing this
topic, however both searching and browsing I can't find it, the closest
i find is 5 words here
https://undeadly.org/cgi?action=article;sid=20170815171854 - do you
have any URL?)


Last, OpenBSD's biggest limit as an OS seems to be that the disk/file
subsystem is sequential. A modern SSD can read at 2.8GB/sec but that
requires parallellism, without multiqueueing and with small reads e.g.
4KB or smaller, speeds stay around 70-120MB/sec = ~3.5% of the
hardware's potential performance. This would be really worthy goal to
donate to for instance, in particular as OpenBSD leads the way in many
other areas.

Are there any thoughts about implementing this in the future?

Thanks,
Joseph



Re: 64bit DMA on amd64

2017-11-11 Thread Philip Guenther
On Sat, Nov 11, 2017 at 4:22 PM,  wrote:

> Theo 2016-07-11 15:09:48, https://marc.info/?l=openbsd-
> tech=146824981122013=2 ,
> https://marc.info/?l=openbsd-tech=146825098022380=2 :
> > And bufs don't need it either.  Have you actually cranked your buffer
> > cache that high?  I have test this, on sparc64 which has unlimited DMA
> > reach due to the iommu.  The system comes to a crawl when there are
> > too many mbufs or bufs, probably due to management structures unable
> > to handle the pressure.
>
> Theo 2016-07-11 16:16:13 , https://marc.info/?l=openbsd-
> tech=146825379723312=2 :
> > I was simply pointing out that massive (well above 4GB) buffer cache
> > on a 64-bit DMA-reachable machine worked poorly.  Likely due to data
> > structures managing the memory with rather large O...
>
> What algorithms drive the buffer cache structure now?


If I recall Bob and Ted's undeadly posts correctly, the buffers are both in
per-vnode red-black trees and a global 2Q structure to manage the total set
of buffers.

(How those names will be useful I don't know.)


Philip Guenther


Re: 64bit DMA on amd64

2017-11-11 Thread tinkr
Theo 2016-07-11 15:09:48, 
https://marc.info/?l=openbsd-tech=146824981122013=2 ,
https://marc.info/?l=openbsd-tech=146825098022380=2 :
> And bufs don't need it either.  Have you actually cranked your buffer
> cache that high?  I have test this, on sparc64 which has unlimited DMA
> reach due to the iommu.  The system comes to a crawl when there are
> too many mbufs or bufs, probably due to management structures unable
> to handle the pressure.

Theo 2016-07-11 16:16:13 , 
https://marc.info/?l=openbsd-tech=146825379723312=2 :
> I was simply pointing out that massive (well above 4GB) buffer cache
> on a 64-bit DMA-reachable machine worked poorly.  Likely due to data
> structures managing the memory with rather large O...

What algorithms drive the buffer cache structure now?

Re: 64bit DMA on amd64

2016-11-13 Thread Tinker

(Reply to misc@ I presume.)

Hi Theo / list,

Some humble followup questions regarding the previous buffer cache 
conversation. In particular curious what the crawl was that you saw in 
the very large buffer cache test you made on Sparc64?


On 2016-07-12 00:16, Theo de Raadt wrote:
[...]

The buffer cache flipper was going to give us very large buffer cache
compared to other systems.  Until it is finished, we are still doing
fine.


What do you mean by very large compared to other systems, do other OS:es 
have any limit to it within their software architecture? Just to get the 
idea.


[...]

I was simply pointing out that massive (well above 4GB) buffer cache
on a 64-bit DMA-reachable machine worked poorly.  Likely due to data
structures managing the memory with rather large O...


(What did you mean by "O..."?)

On 2016-07-11 23:09, Theo de Raadt wrote:
[...]

And bufs don't need it either.  Have you actually cranked your buffer
cache that high?  I have test this, on sparc64 which has unlimited DMA
reach due to the iommu.  The system comes to a crawl when there are
too many mbufs or bufs, probably due to management structures unable
to handle the pressure.


At what kind of sizes does it start to crawl, how is the crawling 
experienced on the user level, why is the crawling / what kind of 
pressure on management structures are we talking, how can it be 
CPU-expensive?


(
On 2016-07-11 23:29, Theo de Raadt wrote:
[...]

BTW, my tests were on a 128GB sun4v machine.  Sun T5140.  They are
actually fairly cheap used these days.


Not sure how that affects the benchmarkas I not understand the 
performance characteristics of the Sun T2 CPU, 
http://johnjmclaughlin.blogspot.hk/2007/10/utrasparc-t2-server-benchmark-results.html 
)


On 2016-07-12 00:07, Mark Kettenis wrote:
..

Except that the flipper isn't enabled yet and that the backpressure
mechanism is busted somewhow.  At least that is what the recent
experiment with cranking up the buffer cache limit showed us.  People
screamed and we backed the change out again.  And there were problems
on amd64 and sparc64 alike.


What function does/would the backpressure mechanism serve do on Sparc64?

Also last and very much secondarily, if you have any guess on if ARM64 
and Power8 would have 64bit DMA (and hence like Sparc64 no buffer cache 
size limit) or not (and hence be like AMD64 with a 32bit buffer cache 
size limit), that would be interesting to learn to know.


Thanks!
Tinker



Re: 64bit DMA on amd64

2016-07-11 Thread Theo de Raadt
> Except that the flipper isn't enabled yet and that the backpressure
> mechanism is busted somewhow.  At least that is what the recent
> experiment with cranking up the buffer cache limit showed us. 

> People screamed and we backed the change out again.  And there were
> problems on amd64 and sparc64 alike.

Which means the generic backpressure mechanism is busted.  As a
result, we currently rely on the 4GB dma limit as a forwardpressure
subsystem, and tuneables which keep the buffer cache small.

The buffer cache flipper was going to give us very large buffer cache
compared to other systems.  Until it is finished, we are still doing
fine.

> What we probably need is help fixing the buffer cache.  Then we can
> enable the flipper.  And then we see if 64-bit DMA is still a
> requirement.

I was simply pointing out that massive (well above 4GB) buffer cache
on a 64-bit DMA-reachable machine worked poorly.  Likely due to data
structures managing the memory with rather large O...

Chasing DMA-reachability on a theory that it helps some subsystem...
some substantiation is required.  In my experience (and I think
yours), there are other hurdles.



Re: 64bit DMA on amd64

2016-07-11 Thread Mark Kettenis
> From: "Theo de Raadt" 
> Date: Mon, 11 Jul 2016 09:29:16 -0600
> 
> > > And bufs don't need it either.  Have you actually cranked your buffer
> > > cache that high?  I have test this, on sparc64 which has unlimited DMA
> > > reach due to the iommu.  The system comes to a crawl when there are
> > > too many mbufs or bufs, probably due to management structures unable
> > > to handle the pressure.
> > 
> > No, I didn't know that. I assumed that having a few more GBs of bufcache 
> > would help the performance. Until that is the case, 64bit dma does not 
> > make much sense.
> 
> BTW, my tests were on a 128GB sun4v machine.  Sun T5140.  They are
> actually fairly cheap used these days.
> 
> A maximum sized buffer cache should be fast.  However there is no need
> for it to be dma-reachable.  Bob's buffer cache flipper can bounce it
> to high memory easily after it is read the first time, and preserve it
> in otherwise unused memory.  A buffer cache object of that sort is
> never written back to the io path.  Also, it can be discarded in any
> memory shortage condition without cost.

Except that the flipper isn't enabled yet and that the backpressure
mechanism is busted somewhow.  At least that is what the recent
experiment with cranking up the buffer cache limit showed us.  People
screamed and we backed the change out again.  And there were problems
on amd64 and sparc64 alike.

What we probably need is help fixing the buffer cache.  Then we can
enable the flipper.  And then we see if 64-bit DMA is still a
requirement.



Re: 64bit DMA on amd64

2016-07-11 Thread Theo de Raadt
> On Mon, 11 Jul 2016, Theo de Raadt wrote:
> > > No, I didn't know that. I assumed that having a few more GBs of bufcache 
> > > would help the performance. Until that is the case, 64bit dma does not 
> > > make much sense.
> > 
> > BTW, my tests were on a 128GB sun4v machine.  Sun T5140.  They are
> > actually fairly cheap used these days.
> > 
> > A maximum sized buffer cache should be fast.  However there is no need
> > for it to be dma-reachable.  Bob's buffer cache flipper can bounce it
> > to high memory easily after it is read the first time, and preserve it
> > in otherwise unused memory.  A buffer cache object of that sort is
> > never written back to the io path.  Also, it can be discarded in any
> > memory shortage condition without cost.
> 
> But flipping buffers is not without cost. Especially for a SSD at rates of 
> >200 MB/s (or even > 500 MB/s). With 64bit DMA, one could have a large 
> buffer cache without this cost. But actual benchmarks would be required to 
> see how relevant this is.

Stefan -- you don't understand the system.

Buffers are not flipped at the moment of read or write.  They are read
into available dma memory.  They are used by process immediately,
without latency.  At a later time when they are about to be thrown
away to (to conserve dma memory), they are not thrown away but
asyncronously / low-cost flipped to high memory, and conserved.  Then
future reads can find that the on-disk blocks are still cached in
(high) memory.  DMA reachability is not required to copy that memory
to processes.

You are suggesting that buf storage is latency sensitive.  That is not
the case.



Re: 64bit DMA on amd64

2016-07-11 Thread Stefan Fritsch
On Mon, 11 Jul 2016, Theo de Raadt wrote:
> > No, I didn't know that. I assumed that having a few more GBs of bufcache 
> > would help the performance. Until that is the case, 64bit dma does not 
> > make much sense.
> 
> BTW, my tests were on a 128GB sun4v machine.  Sun T5140.  They are
> actually fairly cheap used these days.
> 
> A maximum sized buffer cache should be fast.  However there is no need
> for it to be dma-reachable.  Bob's buffer cache flipper can bounce it
> to high memory easily after it is read the first time, and preserve it
> in otherwise unused memory.  A buffer cache object of that sort is
> never written back to the io path.  Also, it can be discarded in any
> memory shortage condition without cost.

But flipping buffers is not without cost. Especially for a SSD at rates of 
>200 MB/s (or even > 500 MB/s). With 64bit DMA, one could have a large 
buffer cache without this cost. But actual benchmarks would be required to 
see how relevant this is.



Re: 64bit DMA on amd64

2016-07-11 Thread Theo de Raadt
> > And bufs don't need it either.  Have you actually cranked your buffer
> > cache that high?  I have test this, on sparc64 which has unlimited DMA
> > reach due to the iommu.  The system comes to a crawl when there are
> > too many mbufs or bufs, probably due to management structures unable
> > to handle the pressure.
> 
> No, I didn't know that. I assumed that having a few more GBs of bufcache 
> would help the performance. Until that is the case, 64bit dma does not 
> make much sense.

BTW, my tests were on a 128GB sun4v machine.  Sun T5140.  They are
actually fairly cheap used these days.

A maximum sized buffer cache should be fast.  However there is no need
for it to be dma-reachable.  Bob's buffer cache flipper can bounce it
to high memory easily after it is read the first time, and preserve it
in otherwise unused memory.  A buffer cache object of that sort is
never written back to the io path.  Also, it can be discarded in any
memory shortage condition without cost.



Re: 64bit DMA on amd64

2016-07-11 Thread Stefan Fritsch
On Mon, 11 Jul 2016, Theo de Raadt wrote:

> > Openbsd on amd64 assumes that DMA is only possible to the lower 4GB.
> 
> Not exactly.  On an architecture-by-architecture basis, OpenBSD is
> capable of insisting DMA reachable memory only lands in a smaller zone
> of memory -- because it makes the other layers of code easier.
> 
> > More interesting would be bufs and mbufs.
> 
> Why is it interesting for mbufs?  Please describe the environment
> where anywhere near that many mbufs make sense.
>
> And bufs don't need it either.  Have you actually cranked your buffer
> cache that high?  I have test this, on sparc64 which has unlimited DMA
> reach due to the iommu.  The system comes to a crawl when there are
> too many mbufs or bufs, probably due to management structures unable
> to handle the pressure.

No, I didn't know that. I assumed that having a few more GBs of bufcache 
would help the performance. Until that is the case, 64bit dma does not 
make much sense.

> 
> What is the usage case for this diff, if it cannot be enabled?
> 



Re: 64bit DMA on amd64

2016-07-11 Thread Theo de Raadt
> BTW, for usb devices, it probably depends on the host controller if 64bit 
> dma is possible or not. I guess most xhci controllers will be able to do 
> it.

The 4GB limitation is a simple solution to a wide variety of problems.

Please describe a situation where 4GB of dma memory is a limitation.

> > That said, I'm not 100% convinced the fear of bounce buffers is justified. 
> > If
> > a USB device requires bouncing, it's already pretty slow. What are we
> > optimizing for again?
> 
> True for spinning disks or usb storage sticks. But an SSD attached via USB 
> 3.x is not slow.

The buffer cache is cabable of doing flipping at the right point.

I still cannot identify a need for 64 bit dma.   Why?



Re: 64bit DMA on amd64

2016-07-11 Thread Theo de Raadt
> Openbsd on amd64 assumes that DMA is only possible to the lower 4GB.

Not exactly.  On an architecture-by-architecture basis, OpenBSD is
capable of insisting DMA reachable memory only lands in a smaller zone
of memory -- because it makes the other layers of code easier.

> More interesting would be bufs and mbufs.

Why is it interesting for mbufs?  Please describe the environment
where anywhere near that many mbufs make sense.

And bufs don't need it either.  Have you actually cranked your buffer
cache that high?  I have test this, on sparc64 which has unlimited DMA
reach due to the iommu.  The system comes to a crawl when there are
too many mbufs or bufs, probably due to management structures unable
to handle the pressure.

What is the usage case for this diff, if it cannot be enabled?



Re: 64bit DMA on amd64

2016-07-11 Thread Stefan Fritsch
On Mon, 11 Jul 2016, Ted Unangst wrote:

> Stefan Fritsch wrote:
> > On Mon, 11 Jul 2016, Reyk Floeter wrote:
> > > The intentional 4GB limit is for forwarding: what if you forward mbufs 
> > > from a 64bit-capable interface to another one that doesn't support 64bit 
> > > DMA? And even if you would only enable it if all interfaces are 
> > > 64bit-capable, what if you plug in a 32bit USB/hotplug interface? We did 
> > > not want to support bounce buffers in OpenBSD.
> > 
> > Yes, I have understood that. My mail was more about non-mbuf DMA: Does it 
> > make sense to allow 64bit DMA in other cases while keeping the 4GB 
> > limitation for mbufs?
> 
> every kind of device can be attached via usb now. for the code that supports
> flipping, like bufcache, this is still tricky to handle dynamic limit changes.
> what happens to buffers marked DMA that suddenly aren't?

I guess the flipping would have to be done just before the device driver 
is called, but after it is clear which device driver will be called. Not 
sure if that is feasible or worth the effort. That's what I wanted to find 
out with my mail ;)

BTW, for usb devices, it probably depends on the host controller if 64bit 
dma is possible or not. I guess most xhci controllers will be able to do 
it.

> That said, I'm not 100% convinced the fear of bounce buffers is justified. If
> a USB device requires bouncing, it's already pretty slow. What are we
> optimizing for again?

True for spinning disks or usb storage sticks. But an SSD attached via USB 
3.x is not slow.

> Or something could be done to bring iommu to life.

The problem is that there are many systems that dont' have any. Or for 
openbsd in VMs, it may be expensive for the host system to emulate the 
iommu.


Cheers,
Stefan



Re: 64bit DMA on amd64

2016-07-11 Thread Mark Kettenis
> From: "Ted Unangst" 
> Date: Mon, 11 Jul 2016 10:45:19 -0400
> 
> Stefan Fritsch wrote:
> > On Mon, 11 Jul 2016, Reyk Floeter wrote:
> > > The intentional 4GB limit is for forwarding: what if you forward mbufs 
> > > from a 64bit-capable interface to another one that doesn't support 64bit 
> > > DMA? And even if you would only enable it if all interfaces are 
> > > 64bit-capable, what if you plug in a 32bit USB/hotplug interface? We did 
> > > not want to support bounce buffers in OpenBSD.
> > 
> > Yes, I have understood that. My mail was more about non-mbuf DMA: Does it 
> > make sense to allow 64bit DMA in other cases while keeping the 4GB 
> > limitation for mbufs?
> 
> every kind of device can be attached via usb now. for the code that
> supports flipping, like bufcache, this is still tricky to handle
> dynamic limit changes.  what happens to buffers marked DMA that
> suddenly aren't?

Actually, as long as the usb controller implements 64-bit DMA, all
these devices should work just fine.  It's just that not all USB
controllers support this and that our uhci(4), ohci(4) and ehci(4)
drivers don't support this.

> That said, I'm not 100% convinced the fear of bounce buffers is justified. If
> a USB device requires bouncing, it's already pretty slow. What are we
> optimizing for again?

Right.  At some point the vast majority of the amd64 hardware we run
on will be 64-bit "clean".  The major issue here is that we don't
really trust all the legacy drivers to do the proper bus_dmamap_sync()
operations that are needed for bounce buffers to work.  But perhaps
that's an argument to do this sooner than later such that we can
fix things while hardware is still around.



Re: 64bit DMA on amd64

2016-07-11 Thread Ted Unangst
Stefan Fritsch wrote:
> On Mon, 11 Jul 2016, Reyk Floeter wrote:
> > The intentional 4GB limit is for forwarding: what if you forward mbufs 
> > from a 64bit-capable interface to another one that doesn't support 64bit 
> > DMA? And even if you would only enable it if all interfaces are 
> > 64bit-capable, what if you plug in a 32bit USB/hotplug interface? We did 
> > not want to support bounce buffers in OpenBSD.
> 
> Yes, I have understood that. My mail was more about non-mbuf DMA: Does it 
> make sense to allow 64bit DMA in other cases while keeping the 4GB 
> limitation for mbufs?

every kind of device can be attached via usb now. for the code that supports
flipping, like bufcache, this is still tricky to handle dynamic limit changes.
what happens to buffers marked DMA that suddenly aren't?

That said, I'm not 100% convinced the fear of bounce buffers is justified. If
a USB device requires bouncing, it's already pretty slow. What are we
optimizing for again?

Or something could be done to bring iommu to life.



Re: 64bit DMA on amd64

2016-07-11 Thread Mark Kettenis
> Date: Mon, 11 Jul 2016 16:10:04 +0200 (CEST)
> From: Stefan Fritsch 
> 
> On Mon, 11 Jul 2016, Reyk Floeter wrote:
> > The intentional 4GB limit is for forwarding: what if you forward mbufs 
> > from a 64bit-capable interface to another one that doesn't support 64bit 
> > DMA? And even if you would only enable it if all interfaces are 
> > 64bit-capable, what if you plug in a 32bit USB/hotplug interface? We did 
> > not want to support bounce buffers in OpenBSD.
> 
> Yes, I have understood that. My mail was more about non-mbuf DMA: Does it 
> make sense to allow 64bit DMA in other cases while keeping the 4GB 
> limitation for mbufs?

My guess is: not really.  I have a hard time coming up with a driver
that will allocate significant amounts of DMA memory that isn't a disk
or a network driver.



Re: 64bit DMA on amd64

2016-07-11 Thread Stefan Fritsch
On Mon, 11 Jul 2016, Reyk Floeter wrote:
> The intentional 4GB limit is for forwarding: what if you forward mbufs 
> from a 64bit-capable interface to another one that doesn't support 64bit 
> DMA? And even if you would only enable it if all interfaces are 
> 64bit-capable, what if you plug in a 32bit USB/hotplug interface? We did 
> not want to support bounce buffers in OpenBSD.

Yes, I have understood that. My mail was more about non-mbuf DMA: Does it 
make sense to allow 64bit DMA in other cases while keeping the 4GB 
limitation for mbufs?

Cheers,
Stefan

> 
> Reyk
> 
> > On 11.07.2016, at 15:37, Stefan Fritsch  wrote:
> > 
> > Hi,
> > 
> > following the discussion about mbufs, I have some questions about 64bit 
> > DMA in general.
> > 
> > Openbsd on amd64 assumes that DMA is only possible to the lower 4GB. But 
> > there are many devices (PCIe, virtio, ...) that can do DMA to the whole 
> > memory. Is it feasible to have known good devices opt-in into 64bit DMA?
> > 
> > I have done a patch that allows virtio to do 64bit DMA. This works insofar 
> > as the queues used by the device will now be allocated above 4GB. But this 
> > is only a small amount of memory. More interesting would be bufs and 
> > mbufs.
> > 
> > 
> > For bufs, Bob added some code for copying bufs above/below 4GB. But this 
> > code only has a single flag B_DMA to denote if DMA is possible into a buf 
> > or not. Would it make sense to replace that by a mechanism that is device 
> > specific, so that we can use devices efficiently that allow 64bit DMA? 
> > Maybe a flag in the device vnode?
> > 
> > 
> > Does it make sense to commit something like the diff below (not tested 
> > much), even if it saves at most a few MB below 4GB right now?
> > 
> > Cheers,
> > Stefan
> > 
> > 
> > diff --git sys/arch/amd64/amd64/bus_dma.c sys/arch/amd64/amd64/bus_dma.c
> > index 8eaa2e7..1aba7c0 100644
> > --- sys/arch/amd64/amd64/bus_dma.c
> > +++ sys/arch/amd64/amd64/bus_dma.c
> > @@ -293,6 +293,7 @@ _bus_dmamap_load_raw(bus_dma_tag_t t, bus_dmamap_t map, 
> > bus_dma_segment_t *segs,
> > {
> > bus_addr_t paddr, baddr, bmask, lastaddr = 0;
> > bus_size_t plen, sgsize, mapsize;
> > +   struct uvm_constraint_range *constraint = t->_cookie;
> > int first = 1;
> > int i, seg = 0;
> > 
> > @@ -320,7 +321,7 @@ _bus_dmamap_load_raw(bus_dma_tag_t t, bus_dmamap_t map, 
> > bus_dma_segment_t *segs,
> > if (plen < sgsize)
> > sgsize = plen;
> > 
> > -   if (paddr > dma_constraint.ucr_high)
> > +   if (paddr > constraint->ucr_high)
> > panic("Non dma-reachable buffer at paddr 
> > %#lx(raw)",
> > paddr);
> > 
> > @@ -405,15 +406,11 @@ _bus_dmamem_alloc(bus_dma_tag_t t, bus_size_t size, 
> > bus_size_t alignment,
> > bus_size_t boundary, bus_dma_segment_t *segs, int nsegs, int *rsegs,
> > int flags)
> > {
> > +   struct uvm_constraint_range *constraint = t->_cookie;
> > 
> > -   /*
> > -* XXX in the presence of decent (working) iommus and bouncebuffers
> > -* we can then fallback this allocation to a range of { 0, -1 }.
> > -* However for now  we err on the side of caution and allocate dma
> > -* memory under the 4gig boundary.
> > -*/
> > return (_bus_dmamem_alloc_range(t, size, alignment, boundary,
> > -   segs, nsegs, rsegs, flags, (bus_addr_t)0, (bus_addr_t)0x));
> > +   segs, nsegs, rsegs, flags, (bus_addr_t)constraint->ucr_low,
> > +   (bus_addr_t)constraint->ucr_high));
> > }
> > 
> > /*
> > @@ -567,6 +564,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t, bus_dmamap_t 
> > map, void *buf,
> > bus_size_t sgsize;
> > bus_addr_t curaddr, lastaddr, baddr, bmask;
> > vaddr_t vaddr = (vaddr_t)buf;
> > +   struct uvm_constraint_range *constraint = t->_cookie;
> > int seg;
> > pmap_t pmap;
> > 
> > @@ -584,7 +582,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t, bus_dmamap_t 
> > map, void *buf,
> >  */
> > pmap_extract(pmap, vaddr, (paddr_t *));
> > 
> > -   if (curaddr > dma_constraint.ucr_high)
> > +   if (curaddr > constraint->ucr_high)
> > panic("Non dma-reachable buffer at curaddr %#lx(raw)",
> > curaddr);
> > 
> > diff --git sys/arch/amd64/amd64/machdep.c sys/arch/amd64/amd64/machdep.c
> > index de9f481..7640532 100644
> > --- sys/arch/amd64/amd64/machdep.c
> > +++ sys/arch/amd64/amd64/machdep.c
> > @@ -201,6 +201,12 @@ struct vm_map *phys_map = NULL;
> > 
> > /* UVM constraint ranges. */
> > struct uvm_constraint_range  isa_constraint = { 0x0, 0x00ffUL };
> > +   /*
> > +* XXX in the presence of decent (working) iommus and bouncebuffers
> > +* we can then fallback this allocation to a range of { 0, -1 }.
> > +* However for now  we err on the side of caution and allocate dma
> > +* memory under the 4gig boundary.
> 

Re: 64bit DMA on amd64

2016-07-11 Thread Reyk Floeter
Hi,

The intentional 4GB limit is for forwarding: what if you forward mbufs from a 
64bit-capable interface to another one that doesn't support 64bit DMA? And even 
if you would only enable it if all interfaces are 64bit-capable, what if you 
plug in a 32bit USB/hotplug interface? We did not want to support bounce 
buffers in OpenBSD.

Reyk

> On 11.07.2016, at 15:37, Stefan Fritsch  wrote:
> 
> Hi,
> 
> following the discussion about mbufs, I have some questions about 64bit 
> DMA in general.
> 
> Openbsd on amd64 assumes that DMA is only possible to the lower 4GB. But 
> there are many devices (PCIe, virtio, ...) that can do DMA to the whole 
> memory. Is it feasible to have known good devices opt-in into 64bit DMA?
> 
> I have done a patch that allows virtio to do 64bit DMA. This works insofar 
> as the queues used by the device will now be allocated above 4GB. But this 
> is only a small amount of memory. More interesting would be bufs and 
> mbufs.
> 
> 
> For bufs, Bob added some code for copying bufs above/below 4GB. But this 
> code only has a single flag B_DMA to denote if DMA is possible into a buf 
> or not. Would it make sense to replace that by a mechanism that is device 
> specific, so that we can use devices efficiently that allow 64bit DMA? 
> Maybe a flag in the device vnode?
> 
> 
> Does it make sense to commit something like the diff below (not tested 
> much), even if it saves at most a few MB below 4GB right now?
> 
> Cheers,
> Stefan
> 
> 
> diff --git sys/arch/amd64/amd64/bus_dma.c sys/arch/amd64/amd64/bus_dma.c
> index 8eaa2e7..1aba7c0 100644
> --- sys/arch/amd64/amd64/bus_dma.c
> +++ sys/arch/amd64/amd64/bus_dma.c
> @@ -293,6 +293,7 @@ _bus_dmamap_load_raw(bus_dma_tag_t t, bus_dmamap_t map, 
> bus_dma_segment_t *segs,
> {
>   bus_addr_t paddr, baddr, bmask, lastaddr = 0;
>   bus_size_t plen, sgsize, mapsize;
> + struct uvm_constraint_range *constraint = t->_cookie;
>   int first = 1;
>   int i, seg = 0;
> 
> @@ -320,7 +321,7 @@ _bus_dmamap_load_raw(bus_dma_tag_t t, bus_dmamap_t map, 
> bus_dma_segment_t *segs,
>   if (plen < sgsize)
>   sgsize = plen;
> 
> - if (paddr > dma_constraint.ucr_high)
> + if (paddr > constraint->ucr_high)
>   panic("Non dma-reachable buffer at paddr 
> %#lx(raw)",
>   paddr);
> 
> @@ -405,15 +406,11 @@ _bus_dmamem_alloc(bus_dma_tag_t t, bus_size_t size, 
> bus_size_t alignment,
> bus_size_t boundary, bus_dma_segment_t *segs, int nsegs, int *rsegs,
> int flags)
> {
> + struct uvm_constraint_range *constraint = t->_cookie;
> 
> - /*
> -  * XXX in the presence of decent (working) iommus and bouncebuffers
> -  * we can then fallback this allocation to a range of { 0, -1 }.
> -  * However for now  we err on the side of caution and allocate dma
> -  * memory under the 4gig boundary.
> -  */
>   return (_bus_dmamem_alloc_range(t, size, alignment, boundary,
> - segs, nsegs, rsegs, flags, (bus_addr_t)0, (bus_addr_t)0x));
> + segs, nsegs, rsegs, flags, (bus_addr_t)constraint->ucr_low,
> + (bus_addr_t)constraint->ucr_high));
> }
> 
> /*
> @@ -567,6 +564,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t, bus_dmamap_t 
> map, void *buf,
>   bus_size_t sgsize;
>   bus_addr_t curaddr, lastaddr, baddr, bmask;
>   vaddr_t vaddr = (vaddr_t)buf;
> + struct uvm_constraint_range *constraint = t->_cookie;
>   int seg;
>   pmap_t pmap;
> 
> @@ -584,7 +582,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t, bus_dmamap_t 
> map, void *buf,
>*/
>   pmap_extract(pmap, vaddr, (paddr_t *));
> 
> - if (curaddr > dma_constraint.ucr_high)
> + if (curaddr > constraint->ucr_high)
>   panic("Non dma-reachable buffer at curaddr %#lx(raw)",
>   curaddr);
> 
> diff --git sys/arch/amd64/amd64/machdep.c sys/arch/amd64/amd64/machdep.c
> index de9f481..7640532 100644
> --- sys/arch/amd64/amd64/machdep.c
> +++ sys/arch/amd64/amd64/machdep.c
> @@ -201,6 +201,12 @@ struct vm_map *phys_map = NULL;
> 
> /* UVM constraint ranges. */
> struct uvm_constraint_range  isa_constraint = { 0x0, 0x00ffUL };
> + /*
> +  * XXX in the presence of decent (working) iommus and bouncebuffers
> +  * we can then fallback this allocation to a range of { 0, -1 }.
> +  * However for now  we err on the side of caution and allocate dma
> +  * memory under the 4gig boundary.
> +  */
> struct uvm_constraint_range  dma_constraint = { 0x0, 0xUL };
> struct uvm_constraint_range *uvm_md_constraints[] = {
> _constraint,
> diff --git sys/arch/amd64/include/pci_machdep.h 
> sys/arch/amd64/include/pci_machdep.h
> index 27b833b..bf54f31 100644
> --- sys/arch/amd64/include/pci_machdep.h
> +++ sys/arch/amd64/include/pci_machdep.h
> @@ 

64bit DMA on amd64

2016-07-11 Thread Stefan Fritsch
Hi,

following the discussion about mbufs, I have some questions about 64bit 
DMA in general.

Openbsd on amd64 assumes that DMA is only possible to the lower 4GB. But 
there are many devices (PCIe, virtio, ...) that can do DMA to the whole 
memory. Is it feasible to have known good devices opt-in into 64bit DMA?

I have done a patch that allows virtio to do 64bit DMA. This works insofar 
as the queues used by the device will now be allocated above 4GB. But this 
is only a small amount of memory. More interesting would be bufs and 
mbufs.


For bufs, Bob added some code for copying bufs above/below 4GB. But this 
code only has a single flag B_DMA to denote if DMA is possible into a buf 
or not. Would it make sense to replace that by a mechanism that is device 
specific, so that we can use devices efficiently that allow 64bit DMA? 
Maybe a flag in the device vnode?


Does it make sense to commit something like the diff below (not tested 
much), even if it saves at most a few MB below 4GB right now?

Cheers,
Stefan


diff --git sys/arch/amd64/amd64/bus_dma.c sys/arch/amd64/amd64/bus_dma.c
index 8eaa2e7..1aba7c0 100644
--- sys/arch/amd64/amd64/bus_dma.c
+++ sys/arch/amd64/amd64/bus_dma.c
@@ -293,6 +293,7 @@ _bus_dmamap_load_raw(bus_dma_tag_t t, bus_dmamap_t map, 
bus_dma_segment_t *segs,
 {
bus_addr_t paddr, baddr, bmask, lastaddr = 0;
bus_size_t plen, sgsize, mapsize;
+   struct uvm_constraint_range *constraint = t->_cookie;
int first = 1;
int i, seg = 0;
 
@@ -320,7 +321,7 @@ _bus_dmamap_load_raw(bus_dma_tag_t t, bus_dmamap_t map, 
bus_dma_segment_t *segs,
if (plen < sgsize)
sgsize = plen;
 
-   if (paddr > dma_constraint.ucr_high)
+   if (paddr > constraint->ucr_high)
panic("Non dma-reachable buffer at paddr 
%#lx(raw)",
paddr);
 
@@ -405,15 +406,11 @@ _bus_dmamem_alloc(bus_dma_tag_t t, bus_size_t size, 
bus_size_t alignment,
 bus_size_t boundary, bus_dma_segment_t *segs, int nsegs, int *rsegs,
 int flags)
 {
+   struct uvm_constraint_range *constraint = t->_cookie;
 
-   /*
-* XXX in the presence of decent (working) iommus and bouncebuffers
-* we can then fallback this allocation to a range of { 0, -1 }.
-* However for now  we err on the side of caution and allocate dma
-* memory under the 4gig boundary.
-*/
return (_bus_dmamem_alloc_range(t, size, alignment, boundary,
-   segs, nsegs, rsegs, flags, (bus_addr_t)0, (bus_addr_t)0x));
+   segs, nsegs, rsegs, flags, (bus_addr_t)constraint->ucr_low,
+   (bus_addr_t)constraint->ucr_high));
 }
 
 /*
@@ -567,6 +564,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t, bus_dmamap_t map, 
void *buf,
bus_size_t sgsize;
bus_addr_t curaddr, lastaddr, baddr, bmask;
vaddr_t vaddr = (vaddr_t)buf;
+   struct uvm_constraint_range *constraint = t->_cookie;
int seg;
pmap_t pmap;
 
@@ -584,7 +582,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t, bus_dmamap_t map, 
void *buf,
 */
pmap_extract(pmap, vaddr, (paddr_t *));
 
-   if (curaddr > dma_constraint.ucr_high)
+   if (curaddr > constraint->ucr_high)
panic("Non dma-reachable buffer at curaddr %#lx(raw)",
curaddr);
 
diff --git sys/arch/amd64/amd64/machdep.c sys/arch/amd64/amd64/machdep.c
index de9f481..7640532 100644
--- sys/arch/amd64/amd64/machdep.c
+++ sys/arch/amd64/amd64/machdep.c
@@ -201,6 +201,12 @@ struct vm_map *phys_map = NULL;
 
 /* UVM constraint ranges. */
 struct uvm_constraint_range  isa_constraint = { 0x0, 0x00ffUL };
+   /*
+* XXX in the presence of decent (working) iommus and bouncebuffers
+* we can then fallback this allocation to a range of { 0, -1 }.
+* However for now  we err on the side of caution and allocate dma
+* memory under the 4gig boundary.
+*/
 struct uvm_constraint_range  dma_constraint = { 0x0, 0xUL };
 struct uvm_constraint_range *uvm_md_constraints[] = {
 _constraint,
diff --git sys/arch/amd64/include/pci_machdep.h 
sys/arch/amd64/include/pci_machdep.h
index 27b833b..bf54f31 100644
--- sys/arch/amd64/include/pci_machdep.h
+++ sys/arch/amd64/include/pci_machdep.h
@@ -41,6 +41,7 @@
  */
 
 extern struct bus_dma_tag pci_bus_dma_tag;
+extern struct bus_dma_tag virtio_pci_bus_dma_tag;
 
 /*
  * Types provided to machine-independent PCI code
diff --git sys/arch/amd64/isa/isa_machdep.c sys/arch/amd64/isa/isa_machdep.c
index 74dc907..ec35edead 100644
--- sys/arch/amd64/isa/isa_machdep.c
+++ sys/arch/amd64/isa/isa_machdep.c
@@ -140,7 +140,7 @@ void_isa_dma_free_bouncebuf(bus_dma_tag_t, 
bus_dmamap_t);
  * buffers, if necessary.
  */
 struct bus_dma_tag isa_bus_dma_tag = {
-   NULL,