Re: scsipi: physio split the request
On Tue, Jan 01, 2019 at 02:48:05PM -0500, Thor Lancelot Simon wrote: > > The work remaining to be done on the branch, as I see it, is: [...] I missed one. I got fed up dealing with the way arguments are passed through the mount syscalls (especially for ufs) so the work of letting mount(8) pass a maxxfer value through (either initially or on an update mount) is not done. That's definitely a thing that should be done before this hits the tree. Thor
Re: scsipi: physio split the request
On Thu, Dec 27, 2018 at 09:07:41AM +, Emmanuel Dreyfus wrote: > Hello > > A few years ago I made a failed attempt at running LTFS on a LTO 6 drive. > I resumed the effort, and once I got the LTFS code ported, running > a command like mkltfs fails with kernel console saying: > st0(mpii0:0:2:0): physio split the request.. cannot proceed > > This is netbsd-current from yesterday. You really need tls-maxphys. It won't be a ton of fun to rebase it on newer NetBSD-current but it can't be more than a day's work (IIRC where I left it we were pre device/softc cleanup, and that'll be some nuisance to address if so). tls-maxphys propagates the maximum supported transfer size down the system's actual discovered bus topology at boot time; any node in the tree can enforce its own restrictions as it sees fit, and nodes like RAIDframe that effectively demux I/O can compute and declare their own supported maximum. The work remaining to be done on the branch, as I see it, is: 1) *Some* backpressure mechanism *must* be implemented to prevent the filesystems from greedily attempting maximum size I/Os at all times, because with a new, much larger maximum in many cases, this will lead to much worse unfairness than we now see (and some threads doing I/O will much more obviously starve others). IIRC we've already got something effective for either read or write but not the other but it's been a while, so I could be wrong. 2) There's an ugly case with RAIDframe if a component is replaced with one that supports a smaller maxphys. The filesystems need to be notified so they can change their own internal max xfer size. I think I wrote the code to deal with this but it's untested. Wants a look. 3) A number of device drivers -- particularly things in the LSI family -- will need to learn about newer DMA descriptor formats supported by their hardware in order to support transfers of reasonable size for things like tape drives (mpt and possibly mfi*, for example, are currently limited to 192K because our driver only supports a very old descriptor format; this should be a relatively simple fix based on reading newer open-source code for these devices as a reference). I believe that should be all that's needed. I would estimate it at 5 days of work, or perhaps a month of evenings/weekends. I don't have that time available now and won't in the forseeable future, but, perhaps someone reading this does. And of course some of you are much quicker at this stuff than I am (thorpej, I'm looking at you ;-)). Most of what the branch does is useful *even if* we remove the stupid VA/PA mapping business for I/O, I think. Because it's mostly config sugar to let the clients know how big an I/O they can ask for at runtime, and that will be needed regardless. Thor
Re: scsipi: physio split the request
On Fri, Dec 28, 2018, 11:04 AM Warner Losh > > On Fri, Dec 28, 2018, 1:25 AM matthew green >> > Of course larger transfers would also mitigate the overhead for each I/O >> > operation, but we already do several Gigabyte/s with 64k transfers and >> > filesystem I/O tends to be even smaller. >> >> yes - the benefits will be in the 0-10% range for most things. it >> will help, but only a fairly small amount, most of us won't notice. >> >> i've seen peaks of 1.4GB/s with an nvme(4) device with ffs on top. >> > > > I've seen 3.3GB/s of 128k-512k transfers on FreeBSD off of nvme, but > that's mostly video. It seems to be limited there not so much by transfer > size, but by the ability to queue transactions. We see <1% by raising > MAXPHYS to 1MB over the default 128k there. > Also, we are limited by what the device itself can do which varies a lot by drive. From a low of 1GB/s to a high of just under 3.4GB/s. Warner >
Re: scsipi: physio split the request
On Fri, Dec 28, 2018, 1:25 AM matthew green > Of course larger transfers would also mitigate the overhead for each I/O > > operation, but we already do several Gigabyte/s with 64k transfers and > > filesystem I/O tends to be even smaller. > > yes - the benefits will be in the 0-10% range for most things. it > will help, but only a fairly small amount, most of us won't notice. > > i've seen peaks of 1.4GB/s with an nvme(4) device with ffs on top. > I've seen 3.3GB/s of 128k-512k transfers on FreeBSD off of nvme, but that's mostly video. It seems to be limited there not so much by transfer size, but by the ability to queue transactions. We see <1% by raising MAXPHYS to 1MB over the default 128k there. Warner >
Re: scsipi: physio split the request
> On Dec 27, 12:29pm, buh...@nfbcal.org (Brian Buhrow) wrote: > -- Subject: Re: scsipi: physio split the request > > | hello. Just out of curiosity, why did the tls-maxphys branch never > | get merged with head once the work was done or mostly done? Simply nobody finished it up yet. Le jeu. 27 déc. 2018 à 22:07, Christos Zoulas a écrit : > mostly done... I did the last tls-maxphys sync. It will need at least one resync, since more stuff moved around (i.e. wd(4) was converted to dksubr since then). I want to eliminate some duplicated code in PCI-IDE drivers before the merge. Other than that, I don't see any other blockers for the merge. Of course, it requires proper round of testing before it could be really considered merge-worthy. As others noted, the performance benefit of tls-maxphys is likely to be small. It however removes artificial limits in the stack, unlocking new opportunities for further development, so it's both necessary and good to do nevertheless. Jaromir
Re: scsipi: physio split the request
Le jeu. 27 déc. 2018 à 15:41, Emmanuel Dreyfus a écrit : > > On Thu, Dec 27, 2018 at 02:33:28PM +, Christos Zoulas wrote: > > I think you need resurrect the tls-maxphys branch... It was close to working > > IIRC. > > What happens if I just #define MAXPHYS (1024*1204*1024) ? Several drivers use MAXPHYS to allocate DMA memory for them (e.g. nvme), they usually try to allocate MAXPHYS * (max number of scatter-gather vectors). These will either fail, or block huge amounts of RAM. Also it just happens to be hard I/O size limit for ISA, and ATA/IDE drives not supporting LBA48. Even for hw which does support bigger I/O (stuff behind PCI, SCSI, ATA drivers with LBA48), the drivers might be buggy and not cope. Nothing good happens out of this, unfortunately the lower layers need to be fixed first. Jaromir
re: scsipi: physio split the request
> Of course larger transfers would also mitigate the overhead for each I/O > operation, but we already do several Gigabyte/s with 64k transfers and > filesystem I/O tends to be even smaller. yes - the benefits will be in the 0-10% range for most things. it will help, but only a fairly small amount, most of us won't notice. i've seen peaks of 1.4GB/s with an nvme(4) device with ffs on top. .mrg.
Re: scsipi: physio split the request
jnem...@cue.bc.ca (John Nemeth) writes: >On Dec 27, 6:49pm, Michael van Elst wrote: >} So far that's mostly a problem with software raid and modern tape I/O. > Wouldn't hardware RAID also benefit from bigger buffers? >Although, I suppose a battery backed cache be used to workaround >small transfer sizes. The transfer size currently limits I/O of stripes because it is split over all stripe units (drives). A hardware controller does this internally and isn't affected by MAXPHYS. Of course larger transfers would also mitigate the overhead for each I/O operation, but we already do several Gigabyte/s with 64k transfers and filesystem I/O tends to be even smaller. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: scsipi: physio split the request
On Dec 27, 6:49pm, Michael van Elst wrote: } m...@netbsd.org (Emmanuel Dreyfus) writes: } } >Is there a reason other than historical for NetBSD 64kB limit? } } It's a compromise. Some buffers are statically sized for MAXPHYS } and some ancient hardware cannot exceed 64k (or even less) DMA transfers. } The buffer size is mostly a problem because we don't support } scatter-gather transfers, so the buffers need to be contigous in } physical RAM (and some hardware doesn't support s-g either). } } So far that's mostly a problem with software raid and modern tape I/O. Wouldn't hardware RAID also benefit from bigger buffers? Although, I suppose a battery backed cache be used to workaround small transfer sizes. }-- End of excerpt from Michael van Elst
Re: scsipi: physio split the request
thor...@me.com (Jason Thorpe) writes: >> You need a really huge amount of RAM for that, and also a huge >> KVA space. >...but it doesn't have to be that way. >The fundamental problem is that for physio, we currently have to map the >buffer into kernel space at all. Mapping into KVA is another problem. > We really should have a more abstract way to describe memory that is passed > down to device drivers that currently take struct buf *s, call it an I/O > memory descriptor ("iomd"). This iomd would have, say, an array of vm_page > *'s, or perhaps an array of paddr_t's, but would also have a pointer to the > buffer as mapped into kernel address space. The problem is that currently we and also some hardware cannot handle such a construct. >Then a new bus_dmamap_load_iomd() call could take an iomd as an argument, and >skip doing a bunch of work (calling into the pmap later to get the physical >address), and just build the bus_dma_segment_t's directly. There is hardware that can only handle a single bus_dma_segment. So that's: - support some more abstract MAXPHYS (i.e. not a global constant). - make buffers based on scatter-gather lists instead of a single linear piece of memory. - make drivers use these scatter-gather buffers - try to emulate this behaviour when hardware is too limited. - make other users of buffers compatible with scatter-gather lists That's a long way to go and still not related to mapping buffers into KVA. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: scsipi: physio split the request
On Dec 27, 12:29pm, buh...@nfbcal.org (Brian Buhrow) wrote: -- Subject: Re: scsipi: physio split the request | hello. Just out of curiosity, why did the tls-maxphys branch never | get merged with head once the work was done or mostly done? mostly done... christos
Re: scsipi: physio split the request
> On Dec 27, 2018, at 10:51 AM, Michael van Elst wrote: > > m...@netbsd.org (Emmanuel Dreyfus) writes: > >> What happens if I just #define MAXPHYS (1024*1204*1024) ? > > You need a really huge amount of RAM for that, and also a huge > KVA space. ...but it doesn't have to be that way. The fundamental problem is that for physio, we currently have to map the buffer into kernel space at all. We really should have a more abstract way to describe memory that is passed down to device drivers that currently take struct buf *s, call it an I/O memory descriptor ("iomd"). This iomd would have, say, an array of vm_page *'s, or perhaps an array of paddr_t's, but would also have a pointer to the buffer as mapped into kernel address space. The necessary part is having the page array filled in, along with an offset, and a length. If not sufficient, then callers could map the buffer ONLY if needed, e.g. if you have to do PIO to your device. Then a new bus_dmamap_load_iomd() call could take an iomd as an argument, and skip doing a bunch of work (calling into the pmap later to get the physical address), and just build the bus_dma_segment_t's directly. If it needs to bounce-buffer, then the back-end takes care of calling iomd_map() or whatever. This isn't a fully fleshed-out proposal, or anything, but I know it's ben brought up off and on for years... we really ought to just get around to doing it. Unfortunately, it's going to mean modifying a lot of drivers before the upper layers can assume "I can pass iomds down everywhere for buf I/O". -- thorpej
Re: scsipi: physio split the request
hello. Just out of curiosity, why did the tls-maxphys branch never get merged with head once the work was done or mostly done? -thanks -Brian
Re: scsipi: physio split the request
In article <20181227153028.gr4...@homeworld.netbsd.org>, Emmanuel Dreyfus wrote: >On Thu, Dec 27, 2018 at 09:47:03AM -0500, Christos Zoulas wrote: >> | What happens if I just #define MAXPHYS (1024*1204*1024) ? >> I don't think that's a good idea. My guess is that things are going to >blow up. > >At least if I try to be on par with Linux limit and build with >-DMAXPHYS=1048576 the system goes to multiuser without a hitch. > >Running mkltfs raises aa few errors on the console, though: >mpii0: error 27 loading dmamap >st0(mpii0:0:2:0): passthrough: adapter inconsistency >mpii0: error 27 loading dmamap >st0(mpii0:0:2:0): passthrough: adapter inconsistency Told you: EFBIG :-) Why don't you try tls-maxphys? christos
Re: scsipi: physio split the request
m...@netbsd.org (Emmanuel Dreyfus) writes: >What happens if I just #define MAXPHYS (1024*1204*1024) ? You need a really huge amount of RAM for that, and also a huge KVA space. Try MAXPHYS (1024*1024) for a start. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: scsipi: physio split the request
m...@netbsd.org (Emmanuel Dreyfus) writes: >Is there a reason other than historical for NetBSD 64kB limit? It's a compromise. Some buffers are statically sized for MAXPHYS and some ancient hardware cannot exceed 64k (or even less) DMA transfers. The buffer size is mostly a problem because we don't support scatter-gather transfers, so the buffers need to be contigous in physical RAM (and some hardware doesn't support s-g either). So far that's mostly a problem with software raid and modern tape I/O. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: scsipi: physio split the request
m...@netbsd.org (Emmanuel Dreyfus) writes: >On Thu, Dec 27, 2018 at 10:44:46AM +0100, Manuel Bouyer wrote: >> tape block size are usually larger than 512 (I use 64k here). >> What block size did mkltfs use ? Actually we can't do larger than 64k. >It seems to attempt transfers of 256kB We are limited to MAXPHYS which is currently 64k. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: scsipi: physio split the request
On Thu, Dec 27, 2018 at 09:47:03AM -0500, Christos Zoulas wrote: > | What happens if I just #define MAXPHYS (1024*1204*1024) ? > I don't think that's a good idea. My guess is that things are going to blow > up. At least if I try to be on par with Linux limit and build with -DMAXPHYS=1048576 the system goes to multiuser without a hitch. Running mkltfs raises aa few errors on the console, though: mpii0: error 27 loading dmamap st0(mpii0:0:2:0): passthrough: adapter inconsistency mpii0: error 27 loading dmamap st0(mpii0:0:2:0): passthrough: adapter inconsistency -- Emmanuel Dreyfus m...@netbsd.org
Re: scsipi: physio split the request
On Dec 27, 2:41pm, m...@netbsd.org (Emmanuel Dreyfus) wrote: -- Subject: Re: scsipi: physio split the request | On Thu, Dec 27, 2018 at 02:33:28PM +, Christos Zoulas wrote: | > I think you need resurrect the tls-maxphys branch... It was close to working | > IIRC. | | What happens if I just #define MAXPHYS (1024*1204*1024) ? I don't think that's a good idea. My guess is that things are going to blow up. christos
Re: scsipi: physio split the request
On Thu, Dec 27, 2018 at 02:33:28PM +, Christos Zoulas wrote: > I think you need resurrect the tls-maxphys branch... It was close to working > IIRC. What happens if I just #define MAXPHYS (1024*1204*1024) ? -- Emmanuel Dreyfus m...@netbsd.org
Re: scsipi: physio split the request
In article <20181227123711.go4...@homeworld.netbsd.org>, Emmanuel Dreyfus wrote: >On Thu, Dec 27, 2018 at 10:44:46AM +0100, Manuel Bouyer wrote: >> tape block size are usually larger than 512 (I use 64k here). >> What block size did mkltfs use ? Actually we can't do larger than 64k. > >It seems to attempt transfers of 256kB > >LTFS20010D SCSI request: [ A3 1F 08 00 00 00 04 00 00 00 00 00 ] >Requested length=262144 >LTFS20089D Driver detail:errno = 0x5 >LTFS20089D Driver detail: host_status = 0x0 >LTFS20089D Driver detail:driver_status = 0x0 >LTFS20089D Driver detail: status = 0x0 >LTFS20011D SCSI outcome: Driver status=0xFF SCSI status=0xFF Actual length=0 I think you need resurrect the tls-maxphys branch... It was close to working IIRC. christos
Re: scsipi: physio split the request
On Thu, Dec 27, 2018 at 10:44:46AM +0100, Manuel Bouyer wrote: > tape block size are usually larger than 512 (I use 64k here). I patched ltfs so that all the max sizes (256kB and 512kB for Linux) are set to 64kB for NetBSD. I can now format and mount the LTFS filesystem, but I need to limit the block size to under 64kB. This will work: dump -0f - / | dd obs=63k of=/ltfs/dump20181227 This hangs the filesystem: dump -0f - / | dd obs=64k of=/ltfs/dump20181227 I tested on glusterfs that our FUSE implementation does not limit writes to 64k chunks, hence I assume I introduced a bug in ltfs with the 64kB limit everywhere. Is there a reason other than historical for NetBSD 64kB limit? -- Emmanuel Dreyfus m...@netbsd.org
Re: scsipi: physio split the request
On Thu, Dec 27, 2018 at 10:44:46AM +0100, Manuel Bouyer wrote: > tape block size are usually larger than 512 (I use 64k here). > What block size did mkltfs use ? Actually we can't do larger than 64k. It seems to attempt transfers of 256kB LTFS20010D SCSI request: [ A3 1F 08 00 00 00 04 00 00 00 00 00 ] Requested length=262144 LTFS20089D Driver detail:errno = 0x5 LTFS20089D Driver detail: host_status = 0x0 LTFS20089D Driver detail:driver_status = 0x0 LTFS20089D Driver detail: status = 0x0 LTFS20011D SCSI outcome: Driver status=0xFF SCSI status=0xFF Actual length=0 -- Emmanuel Dreyfus m...@netbsd.org
Re: scsipi: physio split the request
On Thu, Dec 27, 2018 at 09:07:41AM +, Emmanuel Dreyfus wrote: > Hello > > A few years ago I made a failed attempt at running LTFS on a LTO 6 drive. > I resumed the effort, and once I got the LTFS code ported, running > a command like mkltfs fails with kernel console saying: > st0(mpii0:0:2:0): physio split the request.. cannot proceed > > This is netbsd-current from yesterday. > > I understand this is about tape block size larger than usual 512. tape block size are usually larger than 512 (I use 64k here). What block size did mkltfs use ? Actually we can't do larger than 64k. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --