Re: Maxphys on -current?

2023-08-05 Thread Thor Lancelot Simon
On Fri, Aug 04, 2023 at 09:48:18PM +0200, Jarom??r Dole??ek wrote:
> 
> For the branch, I particularly disliked that there were quite a few
> changes which looked either unrelated, or avoidable.

There should not have been any "unrelated" changes.  I would not be
surprised if there were changes that could have been avoided.

It has been a very, very long time, but I think there are a few things
worth noting that I discovered in the course of the work I did on this
years ago.

1) It really is important to propagate maximum-transfer-size information
   down the bus hierarchy, because we have many cases where the same device
   could be connected to a different bus.

2) RAIDframe and its ilk are tough to get right, because there are many ugly
   corner cases such as someone trying to replace a failed component of a
   RAID set with a spare that is attached via a bus that has a smaller
   transfer size limit.  Ensuring both that this doesn't happen and that
   errors are propagated back correctly is pretty hard.  I have a vague
   recollection that this might be one source of the "unrelated" changes you
   mention.

3) With MAXPHYS increased to a very large value, we have filesystem code that
   can behave very poorly because it uses naive readahead or write clustering
   strategies that were only previously held in check by the 64K MAXPHYS
   limit.  I didn't even make a start at handling this, honestly, and the
   aparrent difficult of getting it right is one reason I eventually decided
   I didn't have time to finish the work I started on the tls-maxphys branch.
   Beware!  Don't trust linear-read or linear-write benchmarks that say your
   work in this area is done.  You may have massacred performance for real
   world use cases other than your own.

One thing we should probably do, if we have not already, is remove any ISA
DMA devices and some old things like the wdc and pciide IDE attachments from
the GENERIC kernels for ports like amd64, and then bump MAXPHYS to at least
128K, maybe 256K, for those kernels.  Beyond that, though, I think you will
quickly see the filesystem and paging misbehaviors I mention in #3 above.

Thor


Re: Maxphys on -current?

2023-08-04 Thread J. Hannken-Illjes
> On 4. Aug 2023, at 21:48, Jaromír Doleček  wrote:

> In a broader view, I have doubts if there is any practical reason to
> even have support for bigger than 64kb block size support at all.

Having tapes with record size > 64k would be a real benefit though.

--
J. Hannken-Illjes - hann...@mailbox.org



signature.asc
Description: Message signed with OpenPGP


Re: Maxphys on -current?

2023-08-04 Thread Jaromír Doleček
Le ven. 4 août 2023 à 17:27, Jason Thorpe  a écrit :
> If someone does pick this up, I think it would be a good idea to start from 
> scratch, because MAXPHYS, as it stands, is used for multiple things.  
> Thankfully, I think it would be relatively straightforward to do the work 
> that I am suggesting incrementally.
>

I believe I've been the last one to look at the tls maxphys branch and
at least update it.

I agree that it's likely more useful to start it from scratch and do
incremental updates.
Certainly physio before block interface, and also by controller type,
e.g. ATA first and then SCSI or USB.

For the branch, I particularly disliked that there were quite a few
changes which looked either unrelated, or avoidable.
As one of the first steps I've planned to reduce diffs against HEAD.
I've not gotten to it yet however.

Le ven. 4 août 2023 à 08:04, Brian Buhrow  a écrit :
> speed of the transfers on either system.  Interestingly enough, however, the 
> FreeBSD
> performance is markedly worse on this test.
> ...
> NetBSD-99.77/amd64 with SATA3 disk
> # dd if=/dev/rwd0a of=/dev/null bs=1m count=5
> 5242880 bytes transferred in 292.067 secs (179509496 bytes/sec)
>
> FreeBSD-13.1/AMD64 with SATA3 disk
> # dd if=/dev/da4 of=/dev/null bs=1m count=5
> 5242880 bytes transferred in 322.433936 secs (162603232 bytes/sec)

Interesting. FreeBSD da(4) is a character device since FreeBSD has no
block devices anymore, so it's not a raw-vs-block device difference.
Is the hardware really similar enough to be a fair comparison?

In a broader view, I have doubts if there is any practical reason to
even have support for bigger than 64kb block size support at all.

For HDDs over SATA maybe - the bigger blocks mean potentially more
sequential I/O and hence higher total throughput.
Also you can queue more I/O and hence avoid the seeks - current NetBSD
maximum on SATA is 32 x 64KiB = 2048KiB of queued I/O.
For SCSI, you can queue way more I/O than the usual disk cache can
hold even with 64KiB blocks, so bigger block size is not very
important.
Still, I doubt >64KiB blocks on HDD would achieve more than a couple
of percent increase over 64KiB ones.

For SSDs over SATA, there is no seek to worry about, but the command
latency is a concern. But even there, according to Linux benchmarks,
the total transfer rate tops out with 64KiB blocks already.

For NVMe, the command latency is close to irrelevant - it has a very
low latency command interface and very deep command queues. I don't
see bigger blocks helping much, if at all.

Jaromir


Re: Maxphys on -current?

2023-08-04 Thread Jason Thorpe


> On Aug 3, 2023, at 2:19 PM, Brian Buhrow  wrote:
> 
> hello.  I know that this has ben a very long term project, but I'm wondering 
> about the
> status of this effort?  I note that FreeBSD-13 has a Maxphys value of 1048576 
> bytes.
> Have we found other ways to get more throughput from ATA disks that obviate 
> the need for this
> setting which I'm not aware of?
> If not, is anyone working on this project?  The wiki page says the project is 
> stalled.

If someone does pick this up, I think it would be a good idea to start from 
scratch, because MAXPHYS, as it stands, is used for multiple things.  
Thankfully, I think it would be relatively straightforward to do the work that 
I am suggesting incrementally.

Here goes...

MAXPHYS is really supposed to be “maximum transfer via physio”, which is the 
code path you use when you open /dev/rsd0e and read/write to it.  The 
user-space pages are wired and mapped into the kernel for the purpose of doing 
I/O.  MAXPHYS is a per-architecture constant because some systems have 
different constraints as to how much KVA space can be used for that at any 
given time.

Unfortunately, some of the adjacent physio machinery (e.g. minphys()) is also 
used for other purposes, specifically to clamp I/O sizes to constraints defined 
by the physical device and/or the controllers / busses they’re connected to.

Logically, these are two totally separate things, and IMO they should be 
cleanly separated.

What we *should* have is the notion of “I/O parameters” that are defined by the 
device… max I/O size, max queue depth, preferred I/O size, preferred I/O 
alignment, physical block size, logical block size, etc.  The base values for 
these parameters should come from the leaf device (e.g.. the disk), and then be 
clamped as needed by it’s connective tissue (the controller, the system bus the 
controller is connected to, and ultimately the platform-specific e.g. DMA 
constraints).

The the interface layers (the page cache / UBC, the traditional block I/O 
buffer cache, and the physio interface for user-space) can further impose their 
own constraints, as necessary per their API contract.  There is zero reason 
that MAXPHYS should impact the maximum I/O that a file system can do via the 
UBC, for example.

(In a perfect world, we wouldn’t even have to consume virtual address space to 
bring data and and out of the page cache / UBC, because we already know the 
physical addresses of the pages that are being pulled in / cleaned.)

> Any thoughts or news would be greatly appreciated.

Anyway, there are mine :-)

-- thorpej



Re: Maxphys on -current?

2023-08-04 Thread Michael van Elst
On Thu, Aug 03, 2023 at 11:04:18PM -0700, Brian Buhrow wrote:

> speed of the transfers on either system.  Interestingly enough, however, the 
> FreeBSD
> performance is markedly worse on this test.  


162MB/s or 179MB/s is just the speed of the disk, so I would guess the
disks are different.

There might also be some difference in command queuing parameters,
some disks get slower for this kind of test.


-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Maxphys on -current?

2023-08-04 Thread Edgar Fuß
Hasn't there been a tls-maxphys branch?


Re: Maxphys on -current?

2023-08-04 Thread Brian Buhrow
hello.  Michael's e-mail explains the behavior I'm seeing with trying 
different block
sizes with NetBSD and FreeBSD.   
The scripts below show transfers of the same number of bytes using 1m and 64k 
block sizes for
NetBSD-9.99.77/amd64 and FreeBSD-13.1/amd64.  NetBSD is using SATA3 disks with 
NCQ enabled and
FreeBSD is using SATA3 disks with command queueing enabled.  The block size 
doesn't change the
speed of the transfers on either system.  Interestingly enough, however, the 
FreeBSD
performance is markedly worse on this test.  

-thanks
-Brian


NetBSD-99.77/amd64 with SATA3 disk
# dd if=/dev/rwd0a of=/dev/null bs=1m count=5
5+0 records in
5+0 records out
5242880 bytes transferred in 292.078 secs (179502735 bytes/sec)
# dd if=/dev/rwd0a of=/dev/null bs=65536 count=80
80+0 records in
80+0 records out
5242880 bytes transferred in 292.067 secs (179509496 bytes/sec)

FreeBSD-13.1/AMD64 with SATA3 disk
# dd if=/dev/da4 of=/dev/null bs=1m count=5
5+0 records in
5+0 records out
5242880 bytes transferred in 322.807433 secs (162415095 bytes/sec)
# dd if=/dev/da4 of=/dev/null bs=65536 count=80
80+0 records in
80+0 records out
5242880 bytes transferred in 322.433936 secs (162603232 bytes/sec)


Re: Maxphys on -current?

2023-08-03 Thread Michael van Elst
g...@lexort.com (Greg Troxel) writes:

>When you run dd with bs=64k and then bs=1m, how different are the
>results?  (I believe raw requests happen accordingly, vs MAXPHYS for fs
>etc. access.)

'raw requests' are split into MAXPHYS size chunks. While using bs=1m
reduces the syscall overhead somewhat, the major effect is that the
system will issue requests for all 16 chunks (1M / MAXPHYS) concurrently.
16 chunks is also the maximum, so between bs=1m and bs=2m the difference
is only the reduced syscall overhead.

The filesystem can do something similar, asynchronous writes are also
issued in parallel, for reading it may chose to read-ahead blocks to
optimize I/O requests, also for up to 16 chunks. In reality, large
contigous I/O rarely happens and the current UVM overhead (e.g. mapping
buffers) becomes more significant, the faster your drive is.

A larger MAXPHYS also reduces SATA command overhead, that's up to 10%
for SATA3 (6Gbps) that you might gain, assuming that you manage to
do large contigous I/O.

NVME is a different thing. While the hardware command overhead is
neglible, you can mitigate software overhead by using larger chunks
for I/O and the gain can be much higher, at least for raw I/O.



Re: Maxphys on -current?

2023-08-03 Thread Greg Troxel
Brian Buhrow  writes:

>   hello.  I know that this has ben a very long term project, but I'm 
> wondering about the
> status of this effort?  I note that FreeBSD-13 has a Maxphys value of 1048576 
> bytes.
> Have we found other ways to get more throughput from ATA disks that obviate 
> the need for this
> setting which I'm not aware of?
> If not, is anyone working on this project?  The wiki page says the project is 
> stalled.

I haven't heard that anyone is.

When you run dd with bs=64k and then bs=1m, how different are the
results?  (I believe raw requests happen accordingly, vs MAXPHYS for fs
etc. access.)


Maxphys on -current?

2023-08-03 Thread Brian Buhrow
hello.  I know that this has ben a very long term project, but I'm 
wondering about the
status of this effort?  I note that FreeBSD-13 has a Maxphys value of 1048576 
bytes.
Have we found other ways to get more throughput from ATA disks that obviate the 
need for this
setting which I'm not aware of?
If not, is anyone working on this project?  The wiki page says the project is 
stalled.

Any thoughts or news would be greatly appreciated.

-thanks
-Brian