Re: libata .sg_tablesize: why always dividing by 2 ?
On Tue, 2008-02-26 at 16:47 +1100, Benjamin Herrenschmidt wrote: On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote: I suppose so. I don't remember all of the details, but iirc, it has to do with crossing 64K boundaries. Some controllers can't handle it. It's not only the _size_ of the segments, it's their alignment. The iommu will not keep alignement beyond the page size (and even then... on powerpc with a 64k base page size, you may still end up with a 4k aligned result, but let's not go there now). .. That's just not possible, unless the IOMMU *splits* segments. And the IOMMU experts here say that it never does that. It is totally possible, and I know as wrote part of the powerpc iommu code :-) The iommu code makes no guarantee vs. preserving the alignment of a segment, at least not below PAGE_SIZE. It's supposed to, precisely to forestall this case. The alignment guarantees of the parisc iommu code are sg length aligned up to a fixed maximum (128k on 32 bit and 256k on 64 bit because of the way the allocator works). However, tomo's code is fixing this, so it shouldn't be a problem much longer. James - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
Benjamin Herrenschmidt wrote: On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote: I suppose so. I don't remember all of the details, but iirc, it has to do with crossing 64K boundaries. Some controllers can't handle it. It's not only the _size_ of the segments, it's their alignment. The iommu will not keep alignement beyond the page size (and even then... on powerpc with a 64k base page size, you may still end up with a 4k aligned result, but let's not go there now). .. That's just not possible, unless the IOMMU *splits* segments. And the IOMMU experts here say that it never does that. It is totally possible, and I know as wrote part of the powerpc iommu code :-) The iommu code makes no guarantee vs. preserving the alignment of a segment, at least not below PAGE_SIZE. Thus if you pass to dma_map_sg() a 64K aligned 64K segment, you may well get back a 4K aligned 64K segment. Enforcing natural alignment in the iommu code only happens for dma_alloc_coherent (it uses order-N allocations anyway), it doesn't happen for map_sg. If we were to do that, we would make it very likely for iommu allocations to fail on machine with small DMA windows. Ben. .. That's interesting. Can you point us to the exact file::lines where this happens? It would be good to ensure that this gets fixed. I'm copying Fujita Tomonori on this thread now -- he's the dude who's trying to sort out the IOMMU mess. Thanks - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
Mark Lord wrote: Benjamin Herrenschmidt wrote: On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote: I suppose so. I don't remember all of the details, but iirc, it has to do with crossing 64K boundaries. Some controllers can't handle it. It's not only the _size_ of the segments, it's their alignment. The iommu will not keep alignement beyond the page size (and even then... on powerpc with a 64k base page size, you may still end up with a 4k aligned result, but let's not go there now). .. That's just not possible, unless the IOMMU *splits* segments. And the IOMMU experts here say that it never does that. It is totally possible, and I know as wrote part of the powerpc iommu code :-) The iommu code makes no guarantee vs. preserving the alignment of a segment, at least not below PAGE_SIZE. Thus if you pass to dma_map_sg() a 64K aligned 64K segment, you may well get back a 4K aligned 64K segment. Enforcing natural alignment in the iommu code only happens for dma_alloc_coherent (it uses order-N allocations anyway), it doesn't happen for map_sg. If we were to do that, we would make it very likely for iommu allocations to fail on machine with small DMA windows. Ben. .. That's interesting. Can you point us to the exact file::lines where this happens? It would be good to ensure that this gets fixed. I'm copying Fujita Tomonori on this thread now -- he's the dude who's trying to sort out the IOMMU mess. .. Mmm.. looks like ppc is already fixed in mainline, commit 740c3ce66700640a6e6136ff679b067e92125794 Ben? - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
libata .sg_tablesize: why always dividing by 2 ?
Jeff, We had a discussion here today about IOMMUs, and they *never* split sg list entries -- they only ever *merge*. And this happens only after the block layer has already done merging while respecting q-seg_boundary_mask. So worst case, the IOMMU may merge everything, and then in libata we unmerge them again. But the end result can never exceed the max_sg_entries limit enforced by the block layer. So.. why are we still specifying .sg_tablesize as half of what the LLD can really handle? This can cost a lot of memory, as using NCQ effectively multiplies everything by 32.. Based on this information, I should be able to do this in sata_mv, for example: - .sg_tablesize = MV_MAX_SG_CT / 2, + .sg_tablesize = MV_MAX_SG_CT, ??? - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
As an aside, ISTR tomo-san was working on eliminating the need for the /2 by tackling the details on the IOMMU side... Jeff - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
Jeff Garzik wrote: Mark Lord wrote: Jeff, We had a discussion here today about IOMMUs, and they *never* split sg list entries -- they only ever *merge*. And this happens only after the block layer has already done merging while respecting q-seg_boundary_mask. So worst case, the IOMMU may merge everything, and then in libata we unmerge them again. But the end result can never exceed the max_sg_entries limit enforced by the block layer. shrug Early experience said otherwise. The split in foo_fill_sg() and resulting sg_tablesize reduction were both needed to successfully transfer data, when Ben H originally did the work. If Ben H and everyone on the arch side agrees with the above analysis, I would be quite happy to remove all those / 2. This can cost a lot of memory, as using NCQ effectively multiplies everything by 32.. I recommend dialing down the hyperbole a bit :) a lot in this case is... maybe another page or two per table, if that. Compared with everything else in the system going on, with 16-byte S/G entries, S/G table size is really the least of our worries. .. Well, today each sg table is about a page in size, and sata_mv has 32 of them per port. So cutting them in half would save 16 pages per port, or 64 pages per host controller. That's a lot for a small system, but maybe not for my 4GB test boxes. If you were truly concerned about memory usage in sata_mv, a more effective route is simply reducing MV_MAX_SG_CT to a number closer to the average s/g table size -- which is far, far lower than 256 (currently MV_MAX_SG_CT), or even 128 (MV_MAX_SG_CT/2). Or moving to a scheme where you allocate (for example) S/G tables with 32 entries... then allocate on the fly for the rare case where the S/G table must be larger... .. Oh, absolutely.. that's on my clean up list once the rest of the driver becomes stable and mostly done. But for now, safety and correctness is far more paramount in sata_mv. :) Cheers - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
Jeff Garzik wrote: As an aside, ISTR tomo-san was working on eliminating the need for the /2 by tackling the details on the IOMMU side... .. Yes, tomo-san just led a nice detailed discussion of it here at LSF'08, and he agrees that even today it shouldn't affect us that way. Cheers - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
On Mon, 2008-02-25 at 19:15 -0500, Jeff Garzik wrote: Mark Lord wrote: Jeff, We had a discussion here today about IOMMUs, and they *never* split sg list entries -- they only ever *merge*. And this happens only after the block layer has already done merging while respecting q-seg_boundary_mask. So worst case, the IOMMU may merge everything, and then in libata we unmerge them again. But the end result can never exceed the max_sg_entries limit enforced by the block layer. shrug Early experience said otherwise. The split in foo_fill_sg() and resulting sg_tablesize reduction were both needed to successfully transfer data, when Ben H originally did the work. If Ben H and everyone on the arch side agrees with the above analysis, I would be quite happy to remove all those / 2. The split wasn't done by the iommu. The split was done by the IDE code itself to handle the stupid 64k crossing thingy. If it's done differently now, it might be possible to remove it, I haven't looked. Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
Benjamin Herrenschmidt wrote: On Mon, 2008-02-25 at 19:15 -0500, Jeff Garzik wrote: Mark Lord wrote: Jeff, We had a discussion here today about IOMMUs, and they *never* split sg list entries -- they only ever *merge*. And this happens only after the block layer has already done merging while respecting q-seg_boundary_mask. So worst case, the IOMMU may merge everything, and then in libata we unmerge them again. But the end result can never exceed the max_sg_entries limit enforced by the block layer. shrug Early experience said otherwise. The split in foo_fill_sg() and resulting sg_tablesize reduction were both needed to successfully transfer data, when Ben H originally did the work. If Ben H and everyone on the arch side agrees with the above analysis, I would be quite happy to remove all those / 2. The split wasn't done by the iommu. The split was done by the IDE code itself to handle the stupid 64k crossing thingy. If it's done differently now, it might be possible to remove it, I haven't looked. .. The block layer uses seg_boundary_mask to ensure that we never have to split them again in the LLD. A very long time ago, when I wrote the IDE DMA code, this was not the case. Cheers - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
Mark Lord wrote: Benjamin Herrenschmidt wrote: .. The split wasn't done by the iommu. The split was done by the IDE code itself to handle the stupid 64k crossing thingy. If it's done differently now, it might be possible to remove it, I haven't looked. .. The block layer uses seg_boundary_mask to ensure that we never have to split them again in the LLD. .. James B. suggests that we stick a WARN_ON() into libata to let us know if that precondition is violated. Sounds like an easy thing to do for a couple of -rc cycles someday. Cheers - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
The block layer uses seg_boundary_mask to ensure that we never have to split them again in the LLD. A very long time ago, when I wrote the IDE DMA code, this was not the case. Not good enough, still, because the boundaries can change due to the iommu merging, thus the split must be re-done. Ben. - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
James B. suggests that we stick a WARN_ON() into libata to let us know if that precondition is violated. Sounds like an easy thing to do for a couple of -rc cycles someday. If the block layer gives us a 32k block aligned on a 32k boundary (aligned), we have no guarantee that the iommu will not turn that into something unaligned crossing a 32k (and thus possibly a 64k) boundary. On powerpc, the iommu operates on 4k pages and only provides that level of alignment to dma_map_sg() (dma_alloc_coherent are naturally aligned, but not map_sg, that would put way too much pressure on the allocator on machines that have pinhole-sized iommu space). Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
Benjamin Herrenschmidt wrote: James B. suggests that we stick a WARN_ON() into libata to let us know if that precondition is violated. Sounds like an easy thing to do for a couple of -rc cycles someday. If the block layer gives us a 32k block aligned on a 32k boundary (aligned), we have no guarantee that the iommu will not turn that into something unaligned crossing a 32k (and thus possibly a 64k) boundary. .. Certainly, but never any worse than what the block layer gave originally. The important note being: IOMMU only ever *merges*, it never *splits*. Which means that, by the time we split up any mis-merges again for 64K crossings, we can never have more SG segments than what the block layer originally fed to the IOMMU stuff. Or so the IOMMU and SCSI experts here at LSF'08 have assured me, even after my own skeptical questioning. Cheers - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
On Mon, 2008-02-25 at 23:38 -0500, Mark Lord wrote: Benjamin Herrenschmidt wrote: James B. suggests that we stick a WARN_ON() into libata to let us know if that precondition is violated. Sounds like an easy thing to do for a couple of -rc cycles someday. If the block layer gives us a 32k block aligned on a 32k boundary (aligned), we have no guarantee that the iommu will not turn that into something unaligned crossing a 32k (and thus possibly a 64k) boundary. .. Certainly, but never any worse than what the block layer gave originally. The important note being: IOMMU only ever *merges*, it never *splits*. Yes, but it will also change the address and doesn't guarantee the alignment. Which means that, by the time we split up any mis-merges again for 64K crossings, we can never have more SG segments than what the block layer originally fed to the IOMMU stuff. Or so the IOMMU and SCSI experts here at LSF'08 have assured me, even after my own skeptical questioning. I suppose so. I don't remember all of the details, but iirc, it has to do with crossing 64K boundaries. Some controllers can't handle it. It's not only the _size_ of the segments, it's their alignment. The iommu will not keep alignement beyond the page size (and even then... on powerpc with a 64k base page size, you may still end up with a 4k aligned result, but let's not go there now). So that means that even if your block layer gives you nice aligned less than 64k segments that don't cross 64k boundaries, and even if your iommu isn't doing any merging at all, it may still give you back things that do not respect that 64k alignment boundary, might cross them, and thus might need to be split. Now, it would make sense (if we don't have it already) to have a flag provided by the host controller that tells us whether it suffers from that limitation, and if not, we get avoid the whole thing. Ben. - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
Benjamin Herrenschmidt wrote: On Mon, 2008-02-25 at 23:38 -0500, Mark Lord wrote: Benjamin Herrenschmidt wrote: James B. suggests that we stick a WARN_ON() into libata to let us know if that precondition is violated. Sounds like an easy thing to do for a couple of -rc cycles someday. If the block layer gives us a 32k block aligned on a 32k boundary (aligned), we have no guarantee that the iommu will not turn that into something unaligned crossing a 32k (and thus possibly a 64k) boundary. .. Certainly, but never any worse than what the block layer gave originally. The important note being: IOMMU only ever *merges*, it never *splits*. Yes, but it will also change the address and doesn't guarantee the alignment. Which means that, by the time we split up any mis-merges again for 64K crossings, we can never have more SG segments than what the block layer originally fed to the IOMMU stuff. Or so the IOMMU and SCSI experts here at LSF'08 have assured me, even after my own skeptical questioning. I suppose so. I don't remember all of the details, but iirc, it has to do with crossing 64K boundaries. Some controllers can't handle it. It's not only the _size_ of the segments, it's their alignment. The iommu will not keep alignement beyond the page size (and even then... on powerpc with a 64k base page size, you may still end up with a 4k aligned result, but let's not go there now). .. That's just not possible, unless the IOMMU *splits* segments. And the IOMMU experts here say that it never does that. -ml - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata .sg_tablesize: why always dividing by 2 ?
On Tue, 2008-02-26 at 00:43 -0500, Mark Lord wrote: I suppose so. I don't remember all of the details, but iirc, it has to do with crossing 64K boundaries. Some controllers can't handle it. It's not only the _size_ of the segments, it's their alignment. The iommu will not keep alignement beyond the page size (and even then... on powerpc with a 64k base page size, you may still end up with a 4k aligned result, but let's not go there now). .. That's just not possible, unless the IOMMU *splits* segments. And the IOMMU experts here say that it never does that. It is totally possible, and I know as wrote part of the powerpc iommu code :-) The iommu code makes no guarantee vs. preserving the alignment of a segment, at least not below PAGE_SIZE. Thus if you pass to dma_map_sg() a 64K aligned 64K segment, you may well get back a 4K aligned 64K segment. Enforcing natural alignment in the iommu code only happens for dma_alloc_coherent (it uses order-N allocations anyway), it doesn't happen for map_sg. If we were to do that, we would make it very likely for iommu allocations to fail on machine with small DMA windows. Ben. - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html