All
> Alex (or anyone else) can you point to where IOVA addresses are generated?
A case of RTFM perhaps (though a pointer to the code would still be
appreciated).
https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
Some exceptions to IOVA
---
Interrupt ranges are not
>I find this hard to believe. There's always the possibility that some
>part of the system doesn't support ACS so if the PCI bus addresses and
>IOVA overlap there's a good chance that P2P and ATS won't work at all on
>some hardware.
I tend to agree but this comes down to how
Hi Jerome
>Hopes this helps understanding the big picture. I over simplify thing and
>devils is in the details.
This was a great primer thanks for putting it together. An LWN.net article
perhaps ;-)??
Stephen
Hi Jerome
>Note on GPU we do would not rely on ATS for peer to peer. Some part
>of the GPU (DMA engines) do not necessarily support ATS. Yet those
>are the part likely to be use in peer to peer.
OK this is good to know. I agree the DMA engine is probably one of the GPU
components
> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address.
Ah yes of course. Brain fart on my part. We are not programming the P2PDMA
initiator with an IOVA but with the PCI bus address...
> So regardless of whether we are using the IOMMU or
> not, the
Hi Jerome
> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> user is the AMD GPU driver see:
Ah thanks. This cleared things up for me. A quick search shows there are still
no users of
Hi Christian
> Why would a switch not identify that as a peer address? We use the PASID
>together with ATS to identify the address space which a transaction
>should use.
I think you are conflating two types of TLPs here. If the device supports ATS
then it will issue a TR TLP to obtain
Hi Jerome
> Now inside that page table you can point GPU virtual address
> to use GPU memory or use system memory. Those system memory entry can
> also be mark as ATS against a given PASID.
Thanks. This all makes sense.
But do you have examples of this in a kernel driver (if so can you
Christian
>Interesting point, give me a moment to check that. That finally makes
>all the hardware I have standing around here valuable :)
Yes. At the very least it provides an initial standards based path for P2P DMAs
across RPs which is something we have discussed on this list in
Jerome and Christian
> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.
So I went
Hi Don
>RDMA VFs lend themselves to NVMEoF w/device-assignment need a way to
>put NVME 'resources' into an assignable/manageable object for
> 'IOMMU-grouping',
>which is really a 'DMA security domain' and less an 'IOMMU grouping
> domain'.
Ha, I like your term "DMA Security
Hi Logan
>Yeah, I'm having a hard time coming up with an easy enough solution for
>the user. I agree with Dan though, the bus renumbering risk would be
>fairly low in the custom hardware seeing the switches are likely going
>to be directly soldered to the same board with the CPU.
Hi Alex and Don
>Correct, the VM has no concept of the host's IOMMU groups, only the
> hypervisor knows about the groups,
But as I understand it these groups are usually passed through to VMs on a
pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and
IOMMU
>Yeah, so based on the discussion I'm leaning toward just having a
>command line option that takes a list of BDFs and disables ACS for them.
>(Essentially as Dan has suggested.) This avoids the shotgun.
I concur that this seems to be where the conversation is taking us.
@Alex -
Hi Alex
>But it would be a much easier proposal to disable ACS when the IOMMU is
>not enabled, ACS has no real purpose in that case.
I guess one issue I have with this is that it disables IOMMU groups for all
Root Ports and not just the one(s) we wish to do p2pdma on.
>The
Hi Jerome
>I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
>translation for a virtual address. Device can then use that address
>directly without going through IOMMU for translation.
This makes
Hi Don
>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two
>devices.
>That agent should 'request' to the kernel that ACS be removed/circumvented
> (p2p enabled) btwn two endpoints.
>I recommend doing so via a sysfs method.
Yes we looked at something like this
Hi Dan
>It seems unwieldy that this is a compile time option and not a runtime
>option. Can't we have a kernel command line option to opt-in to this
>behavior rather than require a wholly separate kernel image?
I think because of the security implications associated with p2pdma and
Hi Christian
> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
> CPU when IOMMU is enabled or otherwise you will break SVM.
OK but in this case aren't you losing (many of) the benefits of P2P since all
DMAs will now get routed up to the IOMMU before being passed
> I'll see if I can get our PCI SIG people to follow this through
Hi Jonathan
Can you let me know if this moves forward within PCI-SIG? I would like to track
it. I can see this being doable between Root Ports that reside in the same Root
Complex but might become more challenging to
> That would be very nice but many devices do not support the internal
> route.
But Logan in the NVMe case we are discussing movement within a single function
(i.e. from a NVMe namespace to a NVMe CMB on the same function). Bjorn is
discussing movement between two functions (PFs or VFs) in the
> I've seen the response that peers directly below a Root Port could not
> DMA to each other through the Root Port because of the "route to self"
> issue, and I'm not disputing that.
Bjorn
You asked me for a reference to RTS in the PCIe specification. As luck would
have it I ended up in an
> P2P over PCI/PCI-X is quite common in devices like raid controllers.
Hi Dan
Do you mean between PCIe devices below the RAID controller? Isn't it pretty
novel to be able to support PCIe EPs below a RAID controller (as opposed to
SCSI based devices)?
> It would be useful if those
>I assume you want to exclude Root Ports because of multi-function
> devices and the "route to self" error. I was hoping for a reference
> to that so I could learn more about it.
Apologies Bjorn. This slipped through my net. I will try and get you a
reference for RTS in the next couple of
Hi Sinan
>If hardware doesn't support it, blacklisting should have been the right
>path and I still think that you should remove all switch business from the
> code.
>I did not hear enough justification for having a switch requirement
>for P2P.
We disagree. As does the
>> It sounds like you have very tight hardware expectations for this to work
>> at this moment. You also don't want to generalize this code for others and
>> address the shortcomings.
> No, that's the way the community has pushed this work
Hi Sinan
Thanks for all the input. As Logan has pointed
>Yes i need to document that some more in hmm.txt...
Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with
what you sent out?
> I am about to send RFC for nouveau, i am still working out some bugs.
Great. I will keep an eye out for it. An example user of hmm will
> It seems people miss-understand HMM :(
Hi Jerome
Your unhappy face emoticon made me sad so I went off to (re)read up on HMM.
Along the way I came up with a couple of things.
While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and
DEVICE_PUBLIC. It also gives no
>http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip
@Keith - my apologies.
@Christoph - thanks for the link
So my understanding of when the technical content surrounding new NVMe
Technical Proposals (TPs) was wrong. I though the TP content could only be
discussed
> We don't want to lump these all together without knowing which region you're
> allocating from, right?
In all seriousness I do agree with you on these Keith in the long term. We
would consider adding property flags for the memory as it is added to the p2p
core and then the allocator could
> There's a meaningful difference between writing to an NVMe CMB vs PMR
When the PMR spec becomes public we can discuss how best to integrate it into
the P2P framework (if at all) ;-).
Stephen
> No, locality matters. If you have a bunch of NICs and bunch of drives
> and the allocator chooses to put all P2P memory on a single drive your
> performance will suck horribly even if all the traffic is offloaded.
Sagi brought this up earlier in his comments about the _find_ function.
> I'm pretty sure the spec disallows routing-to-self so doing a P2P
> transaction in that sense isn't going to work unless the device
> specifically supports it and intercepts the traffic before it gets to
> the port.
This is correct. Unless the device intercepts the TLP before it hits the
>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
>> was part of an IB adapter and not the NVMe card. So this won't work if it's
>> an NVMe only interface.
> It just seems like it it
> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.
Hi Jermone and thanks for your input! Understood. We have looked at HMM in the
past and long term I definitely would like to consider how we can add P2P
functionality to HMM for
> your kernel provider needs to decide whether they favor device assignment or
> p2p
Thanks Alex! The hardware requirements for P2P (switch, high performance EPs)
are such that we really only expect CONFIG_P2P_DMA to be enabled in specific
instances and in those instances the users have made a
> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..
I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer
and in deploying
Thanks for the detailed review Bjorn!
>>
>> + Enabling this option will also disable ACS on all ports behind
>> + any PCIe switch. This effictively puts all devices behind any
>> + switch into the same IOMMU group.
>
> Does this really mean "all devices behind the same Root
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First
> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> > save an extra PCI transfer as the NVME card could just take the data
> > out of it's own memory. However, at this time, cards with CMB buffers
> > don't seem to be available.
> Can you describe what would be the plan
> Any plans adding the capability to nvme-rdma? Should be
> straight-forward... In theory, the use-case would be rdma backend
> fabric behind. Shouldn't be hard to test either...
Nice idea Sagi. Yes we have been starting to look at that. Though again we
would probably want to impose the
>> From: Stephen Bates <sba...@raithlin.com>
>>
>> Hybrid polling currently uses half the average completion time as an
>> estimate of how long to poll for. We can improve upon this by noting
>> that polling before the minimum completion time makes no sense. A
> As far as I can tell, Greg Kroah-Hartman queued the fix for 4.9 and
> 4.11, and the fix is in mainline, too:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=20223f0f39ea9d31ece08f04ac79f8c4e8d98246
Oh, I see that now. That’s superb! Thanks.
>
>> I've already sent this patch to Al twice (including a stable tag),
>> but it didn't seem to make it anywhere.
Does anyone have ideas how we move this along? This is the missing link in
allowing applications to request IO polling…
Stephen
> I am working on it btw [1]
Thanks for working on this Adhemerval!
> PS: resending with cc to all.
Can you cc linux-block when you submit the patchset?
Stephen
> So far, no one has submitted a patch.
OK, unless I hear that someone else is working on one I will take a look at
this.
> I hope the off_t parameter is passed exactly the same way as for pwritev
> and its 64-bit variant, for all architectures.
Duly noted.
For the kernel peeps I think this
hem...
Looks good.
Reviewed-By: Stephen Bates <sba...@raithlin.com>
> I recommend adding a pretty section to your gitconfig:
>
>[pretty]
>fixes = Fixes: %h (\"%s\")
Thanks Sagi. Duly added!
> Thanks, added. BTW, in the future, we have a format for specifying
> if a patch fixes another patch. This should have been:
Thanks!
> Fixes: 720b8ccc4500 ("blk-mq: Add a polling specific stats function")
Duly noted for next time. Apologies for the miss…
> You can, but it won't do much good since v3 is already applied. Any
> further changes must be incremental.
BTW getting a compile error from the Kyber code in for-4.12/block due to the
fact we now return a signed from the bucket function…
batesste@ubuntu64-batesste:~/kernel/linux$ make -j 2
> You can, but it won't do much good since v3 is already applied. Any
> further changes must be incremental.
OK, I will send a small patch on top of what you just applied. Thanks!
> Great thanks, pushed. I'll get this added for 4.11. Thanks for
> the report!
I see you applied my v3 series to the for-4.12/block branch. There is one issue
there I need to fix. Can I send you a v4 a bit later today instead?
--
Jens Axboe
> Is the patch going to be different than the one I sent? Here it
> is, with a comment added. Can I add you tested-by?
Yes you can add a Tested-By from me….
Tested-By: Stephen Bates <sba...@raithlin.com>
> I agree, it's fine as-is. We should queue it up for 4.12.
Great. I will get something based on Omar’s latest comments asap.
> > However right now I am stuck as I am seeing the kernel oops I reported
> > before in testing of my latest patchset [1]. I will try and find some
>> time to bisect
> Nah, let's just leave it as-is then, even though I don't think it's the
> prettiest thing I've ever seen.
I did look at making the stats buckets in the request_queue struct based on dir
and size. Something like:
- struct blk_rq_stat poll_stat[2];
+ struct blk_rq_stat
Hi All
As part of my testing of IO polling [1] I am seeing a NULL pointer dereference
oops that seems to have been introduced in the preparation for 4.11. The kernel
oops output is below and this seems to be due to blk_mq_tag_to_rq returning
NULL in blk_mq_poll in blk-mq.c. I have not had a
Hi
Does anyone know the status of support for the pwritev2/preadv2 system calls in
glibc? I am doing some more IO polling testing and realized that the support is
not in 2.25 and does not seem to be staged for later release?
Right now I am using the FIO_HAVE_PWRITEV2 support that fio provides
On 2017-04-05, 7:14 PM, "Jens Axboe" wrote:
> Why not just have 8 buckets, and make it:
>
> bucket = ddir + ilog2(bytes) - 9;
>
> and cap it at MAX_BUCKET (8) and put all those above into the top
> bucket.
Thanks. However, that equation does not differentiate between
[Retrying as my new setup secretly converted to html format without telling me.
Apologies for the resend.]
>>>
>>> Thanks for the review Sagi. I’d be OK going with <=0 as the exact
>>> match would normally be for minimal IO sizes (where <= and = are the
>>> same thing). I will see what other
>>
>> Thanks for the review Sagi. I’d be OK going with <=0 as the exact
>> match would normally be for minimal IO sizes (where <= and = are the
>> same thing). I will see what other feedback I get and aim to do a
>> respin soon…
>
> No tunables for this, please. There's absolutely no reason why
>>
>> In order to bucket IO for the polling algorithm we use a sysfs entry
>> to set the filter value. It is signed and we will use that as follows:
>>
>> 0 : No filtering. All IO are considered in stat generation
>> > 0 : Filtering based on IO of exactly this size only.
>> < 0 :
Thanks for the review Omar!
>>
>> -unsigned int blk_stat_rq_ddir(const struct request *rq)
>> +int blk_stat_rq_ddir(const struct request *rq)
>> {
>> -return rq_data_dir(rq);
>> +return (int)rq_data_dir(rq);
>
>The cast here here isn't necessary, let's leave it off.
>
OK, I will add
n times [2].
Stephen Bates
[1] http://marc.info/?l=linux-block=146307410101827=2
[2] http://marc.info/?l=linux-block=147803441801858=2
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at ht
> Minor nit below
>
>
>> +
>> +for (i = NVME_CMB_CAP_SQS; i <= NVME_CMB_CAP_WDS; i++)
>>
> I'd prefer seeing (i = 0; i < ARRAY_SIZE(..); i++) because it provides
> automatic bounds checking against future code.
>
Thanks Jon, I will take a look at doing this in a V1.
Stephen
--
To unsubscribe
>
> I have added 1/2, since that one is a no-brainer. For 2/2, not so sure.
> Generally we try to avoid having sysfs file that aren't single value
> output. That isn't a super hard rule, but it is preferable.
>
> --
> Jens Axboe
>
Thanks Jens and sorry for the delay (extended vacation). Thanks
Make sure we are using the correct scnprintf in the sysfs show
function for the CMB.
Signed-off-by: Stephen Bates <sba...@raithlin.com>
---
drivers/nvme/host/pci.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
Hi
This series adds some more verbosity to the NVMe CMB sysfs entry.
Jens I based this off v4.9 because for some reason your for-4.10/block
is missing my original CMB commit (202021c1a63c6)?
Stephen
Stephen Bates (2):
nvme : Use correct scnprintf in cmb show
nvme: improve cmb sysfs
Add more information to the NVMe CMB sysfs entry. This includes
information about the CMB size, location and capabilities.
Signed-off-by: Stephen Bates <sba...@raithlin.com>
---
drivers/nvme/host/pci.c | 31 +--
include/linux/nvme.h| 8
2 files c
etion time. However I think that is something we can improve on over
time and I don't see it as a reason to not get this series upstream.
For the series:
Tested-By: Stephen Bates <sba...@raithlin.com>
Reviewed-By: Stephen Bates <sba...@raithlin.com>
Cheers
Stephen
[1] https://githu
On Tue, October 25, 2016 3:19 pm, Dave Chinner wrote:
> On Tue, Oct 25, 2016 at 05:50:43AM -0600, Stephen Bates wrote:
>>
>> Dave are you saying that even for local mappings of files on a DAX
>> capable system it is possible for the mappings to move on you unless the
&
On Wed, Oct 19, 2016 at 01:01:06PM -0700, Dan Williams wrote:
> >>
> >> In the cover letter, "[PATCH 0/3] iopmem : A block device for PCIe
> >> memory", it mentions that the lack of I/O coherency is a known issue
> >> and users of this functionality need to be cognizant of the pitfalls.
> >> If
On Wed, Oct 19, 2016 at 10:50:25AM -0700, Dan Williams wrote:
> On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates <sba...@raithlin.com> wrote:
> > From: Logan Gunthorpe <log...@deltatee.com>
> >
> > We build on recent work that adds memory regions owned by a d
IO memory with struct pages.
Stephen Bates (2):
iopmem : Add a block device driver for PCIe attached IO memory.
iopmem : Add documentation for iopmem driver
Documentation/blockdev/00-INDEX | 2 +
Documentation/blockdev/iopmem.txt | 62 +++
MAINTAINERS | 7
Add documentation for the iopmem PCIe device driver.
Signed-off-by: Stephen Bates <sba...@raithlin.com>
Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
---
Documentation/blockdev/00-INDEX | 2 ++
Documentation/blockdev/iopmem.txt | 62
https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html
Signed-off-by: Stephen Bates <sba...@raithlin.com>
Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
---
drivers/dax/pmem.c| 4
Add a new block device driver that binds to PCIe devices and turns
PCIe BARs into DAX capable block devices.
Signed-off-by: Stephen Bates <sba...@raithlin.com>
Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
---
MAINTAINERS| 7 ++
drivers/block/Kconfig | 27 +
Remove annoying white space damage in pci.c.
Signed-off-by: Stephen Bates <sba...@raithlin.com>
---
drivers/nvme/host/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7c6e9d..e2b3243 100644
--- a/drivers/nvm
are based off commit 0eadf37afc in Jen's
for-4.9/block. I am not sure if that is the right repo and can rebase
if necessary.
In full disclosure the zeroing patch is based heavily on a code
snippet Jens send me a few months ago.
Stephen Bates (2):
Add poll_considered statistic
Enable zeroing
-by: Stephen Bates <sba...@raithlin.com>
---
block/blk-core.c | 8 ++--
block/blk-mq-sysfs.c | 4 +++-
include/linux/blk-mq.h | 1 +
3 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 34ff808..14d7c07 100644
--- a/block/blk-
79 matches
Mail list logo