Re: Maxphys on -current?
On Fri, Aug 04, 2023 at 09:48:18PM +0200, Jarom??r Dole??ek wrote: > > For the branch, I particularly disliked that there were quite a few > changes which looked either unrelated, or avoidable. There should not have been any "unrelated" changes. I would not be surprised if there were changes that could have been avoided. It has been a very, very long time, but I think there are a few things worth noting that I discovered in the course of the work I did on this years ago. 1) It really is important to propagate maximum-transfer-size information down the bus hierarchy, because we have many cases where the same device could be connected to a different bus. 2) RAIDframe and its ilk are tough to get right, because there are many ugly corner cases such as someone trying to replace a failed component of a RAID set with a spare that is attached via a bus that has a smaller transfer size limit. Ensuring both that this doesn't happen and that errors are propagated back correctly is pretty hard. I have a vague recollection that this might be one source of the "unrelated" changes you mention. 3) With MAXPHYS increased to a very large value, we have filesystem code that can behave very poorly because it uses naive readahead or write clustering strategies that were only previously held in check by the 64K MAXPHYS limit. I didn't even make a start at handling this, honestly, and the aparrent difficult of getting it right is one reason I eventually decided I didn't have time to finish the work I started on the tls-maxphys branch. Beware! Don't trust linear-read or linear-write benchmarks that say your work in this area is done. You may have massacred performance for real world use cases other than your own. One thing we should probably do, if we have not already, is remove any ISA DMA devices and some old things like the wdc and pciide IDE attachments from the GENERIC kernels for ports like amd64, and then bump MAXPHYS to at least 128K, maybe 256K, for those kernels. Beyond that, though, I think you will quickly see the filesystem and paging misbehaviors I mention in #3 above. Thor
Re: Maxphys on -current?
> On 4. Aug 2023, at 21:48, Jaromír Doleček wrote: > In a broader view, I have doubts if there is any practical reason to > even have support for bigger than 64kb block size support at all. Having tapes with record size > 64k would be a real benefit though. -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: Maxphys on -current?
Le ven. 4 août 2023 à 17:27, Jason Thorpe a écrit : > If someone does pick this up, I think it would be a good idea to start from > scratch, because MAXPHYS, as it stands, is used for multiple things. > Thankfully, I think it would be relatively straightforward to do the work > that I am suggesting incrementally. > I believe I've been the last one to look at the tls maxphys branch and at least update it. I agree that it's likely more useful to start it from scratch and do incremental updates. Certainly physio before block interface, and also by controller type, e.g. ATA first and then SCSI or USB. For the branch, I particularly disliked that there were quite a few changes which looked either unrelated, or avoidable. As one of the first steps I've planned to reduce diffs against HEAD. I've not gotten to it yet however. Le ven. 4 août 2023 à 08:04, Brian Buhrow a écrit : > speed of the transfers on either system. Interestingly enough, however, the > FreeBSD > performance is markedly worse on this test. > ... > NetBSD-99.77/amd64 with SATA3 disk > # dd if=/dev/rwd0a of=/dev/null bs=1m count=5 > 5242880 bytes transferred in 292.067 secs (179509496 bytes/sec) > > FreeBSD-13.1/AMD64 with SATA3 disk > # dd if=/dev/da4 of=/dev/null bs=1m count=5 > 5242880 bytes transferred in 322.433936 secs (162603232 bytes/sec) Interesting. FreeBSD da(4) is a character device since FreeBSD has no block devices anymore, so it's not a raw-vs-block device difference. Is the hardware really similar enough to be a fair comparison? In a broader view, I have doubts if there is any practical reason to even have support for bigger than 64kb block size support at all. For HDDs over SATA maybe - the bigger blocks mean potentially more sequential I/O and hence higher total throughput. Also you can queue more I/O and hence avoid the seeks - current NetBSD maximum on SATA is 32 x 64KiB = 2048KiB of queued I/O. For SCSI, you can queue way more I/O than the usual disk cache can hold even with 64KiB blocks, so bigger block size is not very important. Still, I doubt >64KiB blocks on HDD would achieve more than a couple of percent increase over 64KiB ones. For SSDs over SATA, there is no seek to worry about, but the command latency is a concern. But even there, according to Linux benchmarks, the total transfer rate tops out with 64KiB blocks already. For NVMe, the command latency is close to irrelevant - it has a very low latency command interface and very deep command queues. I don't see bigger blocks helping much, if at all. Jaromir
Re: Maxphys on -current?
> On Aug 3, 2023, at 2:19 PM, Brian Buhrow wrote: > > hello. I know that this has ben a very long term project, but I'm wondering > about the > status of this effort? I note that FreeBSD-13 has a Maxphys value of 1048576 > bytes. > Have we found other ways to get more throughput from ATA disks that obviate > the need for this > setting which I'm not aware of? > If not, is anyone working on this project? The wiki page says the project is > stalled. If someone does pick this up, I think it would be a good idea to start from scratch, because MAXPHYS, as it stands, is used for multiple things. Thankfully, I think it would be relatively straightforward to do the work that I am suggesting incrementally. Here goes... MAXPHYS is really supposed to be “maximum transfer via physio”, which is the code path you use when you open /dev/rsd0e and read/write to it. The user-space pages are wired and mapped into the kernel for the purpose of doing I/O. MAXPHYS is a per-architecture constant because some systems have different constraints as to how much KVA space can be used for that at any given time. Unfortunately, some of the adjacent physio machinery (e.g. minphys()) is also used for other purposes, specifically to clamp I/O sizes to constraints defined by the physical device and/or the controllers / busses they’re connected to. Logically, these are two totally separate things, and IMO they should be cleanly separated. What we *should* have is the notion of “I/O parameters” that are defined by the device… max I/O size, max queue depth, preferred I/O size, preferred I/O alignment, physical block size, logical block size, etc. The base values for these parameters should come from the leaf device (e.g.. the disk), and then be clamped as needed by it’s connective tissue (the controller, the system bus the controller is connected to, and ultimately the platform-specific e.g. DMA constraints). The the interface layers (the page cache / UBC, the traditional block I/O buffer cache, and the physio interface for user-space) can further impose their own constraints, as necessary per their API contract. There is zero reason that MAXPHYS should impact the maximum I/O that a file system can do via the UBC, for example. (In a perfect world, we wouldn’t even have to consume virtual address space to bring data and and out of the page cache / UBC, because we already know the physical addresses of the pages that are being pulled in / cleaned.) > Any thoughts or news would be greatly appreciated. Anyway, there are mine :-) -- thorpej
Re: Maxphys on -current?
On Thu, Aug 03, 2023 at 11:04:18PM -0700, Brian Buhrow wrote: > speed of the transfers on either system. Interestingly enough, however, the > FreeBSD > performance is markedly worse on this test. 162MB/s or 179MB/s is just the speed of the disk, so I would guess the disks are different. There might also be some difference in command queuing parameters, some disks get slower for this kind of test. -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Maxphys on -current?
Hasn't there been a tls-maxphys branch?
Re: Maxphys on -current?
hello. Michael's e-mail explains the behavior I'm seeing with trying different block sizes with NetBSD and FreeBSD. The scripts below show transfers of the same number of bytes using 1m and 64k block sizes for NetBSD-9.99.77/amd64 and FreeBSD-13.1/amd64. NetBSD is using SATA3 disks with NCQ enabled and FreeBSD is using SATA3 disks with command queueing enabled. The block size doesn't change the speed of the transfers on either system. Interestingly enough, however, the FreeBSD performance is markedly worse on this test. -thanks -Brian NetBSD-99.77/amd64 with SATA3 disk # dd if=/dev/rwd0a of=/dev/null bs=1m count=5 5+0 records in 5+0 records out 5242880 bytes transferred in 292.078 secs (179502735 bytes/sec) # dd if=/dev/rwd0a of=/dev/null bs=65536 count=80 80+0 records in 80+0 records out 5242880 bytes transferred in 292.067 secs (179509496 bytes/sec) FreeBSD-13.1/AMD64 with SATA3 disk # dd if=/dev/da4 of=/dev/null bs=1m count=5 5+0 records in 5+0 records out 5242880 bytes transferred in 322.807433 secs (162415095 bytes/sec) # dd if=/dev/da4 of=/dev/null bs=65536 count=80 80+0 records in 80+0 records out 5242880 bytes transferred in 322.433936 secs (162603232 bytes/sec)
Re: Maxphys on -current?
g...@lexort.com (Greg Troxel) writes: >When you run dd with bs=64k and then bs=1m, how different are the >results? (I believe raw requests happen accordingly, vs MAXPHYS for fs >etc. access.) 'raw requests' are split into MAXPHYS size chunks. While using bs=1m reduces the syscall overhead somewhat, the major effect is that the system will issue requests for all 16 chunks (1M / MAXPHYS) concurrently. 16 chunks is also the maximum, so between bs=1m and bs=2m the difference is only the reduced syscall overhead. The filesystem can do something similar, asynchronous writes are also issued in parallel, for reading it may chose to read-ahead blocks to optimize I/O requests, also for up to 16 chunks. In reality, large contigous I/O rarely happens and the current UVM overhead (e.g. mapping buffers) becomes more significant, the faster your drive is. A larger MAXPHYS also reduces SATA command overhead, that's up to 10% for SATA3 (6Gbps) that you might gain, assuming that you manage to do large contigous I/O. NVME is a different thing. While the hardware command overhead is neglible, you can mitigate software overhead by using larger chunks for I/O and the gain can be much higher, at least for raw I/O.
Re: Maxphys on -current?
Brian Buhrow writes: > hello. I know that this has ben a very long term project, but I'm > wondering about the > status of this effort? I note that FreeBSD-13 has a Maxphys value of 1048576 > bytes. > Have we found other ways to get more throughput from ATA disks that obviate > the need for this > setting which I'm not aware of? > If not, is anyone working on this project? The wiki page says the project is > stalled. I haven't heard that anyone is. When you run dd with bs=64k and then bs=1m, how different are the results? (I believe raw requests happen accordingly, vs MAXPHYS for fs etc. access.)
Maxphys on -current?
hello. I know that this has ben a very long term project, but I'm wondering about the status of this effort? I note that FreeBSD-13 has a Maxphys value of 1048576 bytes. Have we found other ways to get more throughput from ATA disks that obviate the need for this setting which I'm not aware of? If not, is anyone working on this project? The wiki page says the project is stalled. Any thoughts or news would be greatly appreciated. -thanks -Brian