Re: [PATCH net-next 03/21] ethtool, stats: introduce standard XDP statistics
On 8/4/21 12:27 PM, Saeed Mahameed wrote: > >> I just ran some quick tests with my setup and measured about 1.2% >> worst > > 1.2% is a lot ! what was the test ? what is the change ? I did say "quick test ... not exhaustive" and it was definitely eyeballing a pps change over a small time window. If multiple counters are bumped 20-25 million times a second (e.g. XDP drop case), how measurable is it? I was just trying to ballpark the overhead - 1%, 5%, more? If it is <~ 1% then there is no performance argument in which case let's do the right thing for users - export via existing APIs. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/11] Provide offset-adjusted framebuffer mappings
Hi Sam Am 03.08.21 um 18:21 schrieb Sam Ravnborg: Hi Thomas, On Tue, Aug 03, 2021 at 02:59:17PM +0200, Thomas Zimmermann wrote: A framebuffer's offsets field might be non-zero to make the BO data start at the specified offset within the BO's memory. Handle this case in drm_gem_fb_vmap() and update all callers. So far, many drivers ignore the offsets, which can lead to visual artifacts. Patch 1 adds an optional argument to drm_gem_fb_vmap() to return the offset-adjusted data address for use with shadow-buffered planes. Patches 3 and 11 convert gud and vkms, which are the other callers of drm_gem_fb_vmap(). For gud, it's just a cleanup. Vkms will handle the framebuffer offsets correctly for its input and output framebuffers. The other patches convert users of shadow-buffered planes to use the data address. After conversion, each driver will use the correct data for non-zero offsets. drm/ast: Use offset-adjusted shadow-plane mappings drm/gud: Get offset-adjusted mapping from drm_gem_fb_vmap() drm/hyperv: Use offset-adjusted shadow-plane mappings drm/mgag200: Use offset-adjusted shadow-plane mappings drm/cirrus: Use offset-adjusted shadow-plane mappings drm/gm12u320: Use offset-adjusted shadow-plane mappings drm/simpledrm: Use offset-adjusted shadow-plane mapping drm/udl: Use offset-adjusted shadow-plane mapping drm/vbox: Use offset-adjusted shadow-plane mappings drm/vkms: Use offset-adjusted shadow-plane mappings and output Everything looked good while reading through the patches. I cannot say if everything was properly converted but the patches looked good. So they are all: Acked-by: Sam Ravnborg Thanks! There was a few TODO comments visible aboput using the mapping api properly. I assume this is coming in a later patch set.. There are indeed quite a few such comments. When we introduced the dma_buf_map type to solve the fbdev issue on sparc64, in many places I simply put the existing vaddr pointers into the map structure, and vice versa. The code works as expected, but in the future we may want to use dma_buf_map for all framebuffer pointers. This would, for example, require format conversion helpers that operate on these structures. Adding that will require a number of changes throughout these helpers. Best regards Thomas Sam -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Felix Imendörffer OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next 03/21] ethtool, stats: introduce standard XDP statistics
On 8/4/21 10:44 AM, Jakub Kicinski wrote: > On Wed, 4 Aug 2021 10:17:56 -0600 David Ahern wrote: >> On 8/4/21 6:36 AM, Jakub Kicinski wrote: XDP is going to always be eBPF based ! why not just report such stats to a special BPF_MAP ? BPF stack can collect the stats from the driver and report them to this special MAP upon user request. >>> Do you mean replacing the ethtool-netlink / rtnetlink etc. with >>> a new BPF_MAP? I don't think adding another category of uAPI thru >>> which netdevice stats are exposed would do much good :( Plus it >>> doesn't address the "yet another cacheline" concern. >>> >>> To my understanding the need for stats recognizes the fact that (in >>> large organizations) fleet monitoring is done by different teams than >>> XDP development. So XDP team may have all the stats they need, but the >>> team doing fleet monitoring has no idea how to get to them. >>> >>> To bridge the two worlds we need a way for the infra team to ask the >>> XDP for well-defined stats. Maybe we should take a page from the BPF >>> iterators book and create a program type for bridging the two worlds? >>> Called by networking core when duping stats to extract from the >>> existing BPF maps all the relevant stats and render them into a well >>> known struct? Users' XDP design can still use a single per-cpu map with >>> all the stats if they so choose, but there's a way to implement more >>> optimal designs and still expose well-defined stats. >>> >>> Maybe that's too complex, IDK. >> >> I was just explaining to someone internally how to get stats at all of >> the different points in the stack to track down reasons for dropped packets: >> >> ethtool -S for h/w and driver >> tc -s for drops by the qdisc >> /proc/net/softnet_stat for drops at the backlog layer >> netstat -s for network and transport layer >> >> yet another command and API just adds to the nightmare of explaining and >> understanding these stats. > > Are you referring to RTM_GETSTATS when you say "yet another command"? > RTM_GETSTATS exists and is used by offloads today. > > I'd expect ip -s (-s) to be extended to run GETSTATS and display the xdp > stats. (Not sure why ip -s was left out of your list :)) It's on my diagram, and yes, forgot to add it here. > >> There is real value in continuing to use ethtool API for XDP stats. Not >> saying this reorg of the XDP stats is the right thing to do, only that >> the existing API has real user benefits. > > RTM_GETSTATS is an existing API. New ethtool stats are intended to be HW > stats. I don't want to go back to ethtool being a dumping ground for all > stats because that's what the old interface encouraged. driver stats are important too. e.g., mlx5's cache stats and per-queue stats. > >> Does anyone have data that shows bumping a properly implemented counter >> causes a noticeable performance degradation and if so by how much? You >> mention 'yet another cacheline' but collecting stats on stack and >> incrementing the driver structs at the end of the napi loop should not >> have a huge impact versus the value the stats provide. > > Not sure, maybe Jesper has some numbers. Maybe Intel folks do? I just ran some quick tests with my setup and measured about 1.2% worst case. Certainly not exhaustive. Perhaps Intel or Mellanox can provide numbers for their high speed nics - e.g. ConnectX-6 and a saturated host. > > I'm just allergic to situations when there is a decision made and > then months later patches are posted disregarding the decision, > without analysis on why that decision was wrong. And while the > maintainer who made the decision is on vacation. > stats is one of the many sensitive topics. I have been consistent in defending the need to use existing APIs and tooling and not relying on XDP program writers to add the relevant stats and then provide whatever tool is needed to extract and print them. Standardization for fundamental analysis tools. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next 03/21] ethtool, stats: introduce standard XDP statistics
On 8/4/21 6:36 AM, Jakub Kicinski wrote: >> XDP is going to always be eBPF based ! why not just report such stats >> to a special BPF_MAP ? BPF stack can collect the stats from the driver >> and report them to this special MAP upon user request. > Do you mean replacing the ethtool-netlink / rtnetlink etc. with > a new BPF_MAP? I don't think adding another category of uAPI thru > which netdevice stats are exposed would do much good :( Plus it > doesn't address the "yet another cacheline" concern. > > To my understanding the need for stats recognizes the fact that (in > large organizations) fleet monitoring is done by different teams than > XDP development. So XDP team may have all the stats they need, but the > team doing fleet monitoring has no idea how to get to them. > > To bridge the two worlds we need a way for the infra team to ask the > XDP for well-defined stats. Maybe we should take a page from the BPF > iterators book and create a program type for bridging the two worlds? > Called by networking core when duping stats to extract from the > existing BPF maps all the relevant stats and render them into a well > known struct? Users' XDP design can still use a single per-cpu map with > all the stats if they so choose, but there's a way to implement more > optimal designs and still expose well-defined stats. > > Maybe that's too complex, IDK. I was just explaining to someone internally how to get stats at all of the different points in the stack to track down reasons for dropped packets: ethtool -S for h/w and driver tc -s for drops by the qdisc /proc/net/softnet_stat for drops at the backlog layer netstat -s for network and transport layer yet another command and API just adds to the nightmare of explaining and understanding these stats. There is real value in continuing to use ethtool API for XDP stats. Not saying this reorg of the XDP stats is the right thing to do, only that the existing API has real user benefits. Does anyone have data that shows bumping a properly implemented counter causes a noticeable performance degradation and if so by how much? You mention 'yet another cacheline' but collecting stats on stack and incrementing the driver structs at the end of the napi loop should not have a huge impact versus the value the stats provide. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()
On 2021-08-04 06:02, Yongji Xie wrote: On Tue, Aug 3, 2021 at 6:54 PM Robin Murphy wrote: On 2021-08-03 09:54, Yongji Xie wrote: On Tue, Aug 3, 2021 at 3:41 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: Export alloc_iova_fast() and free_iova_fast() so that some modules can use it to improve iova allocation efficiency. It's better to explain why alloc_iova() is not sufficient here. Fine. What I fail to understand from the later patches is what the IOVA domain actually represents. If the "device" is a userspace process then logically the "IOVA" would be the userspace address, so presumably somewhere you're having to translate between this arbitrary address space and actual usable addresses - if you're worried about efficiency surely it would be even better to not do that? Yes, userspace daemon needs to translate the "IOVA" in a DMA descriptor to the VA (from mmap(2)). But this actually doesn't affect performance since it's an identical mapping in most cases. I'm not familiar with the vhost_iotlb stuff, but it looks suspiciously like you're walking yet another tree to make those translations. Even if the buffer can be mapped all at once with a fixed offset such that each DMA mapping call doesn't need a lookup for each individual "IOVA" - that might be what's happening already, but it's a bit hard to follow just reading the patches in my mail client - vhost_iotlb_add_range() doesn't look like it's super-cheap to call, and you're serialising on a lock for that. My main point, though, is that if you've already got something else keeping track of the actual addresses, then the way you're using an iova_domain appears to be something you could do with a trivial bitmap allocator. That's why I don't buy the efficiency argument. The main design points of the IOVA allocator are to manage large address spaces while trying to maximise spatial locality to minimise the underlying pagetable usage, and allocating with a flexible limit to support multiple devices with different addressing capabilities in the same address space. If none of those aspects are relevant to the use-case - which AFAICS appears to be true here - then as a general-purpose resource allocator it's rubbish and has an unreasonably massive memory overhead and there are many, many better choices. FWIW I've recently started thinking about moving all the caching stuff out of iova_domain and into the iommu-dma layer since it's now a giant waste of space for all the other current IOVA users. Presumably userspace doesn't have any concern about alignment and the things we have to worry about for the DMA API in general, so it's pretty much just allocating slots in a buffer, and there are far more effective ways to do that than a full-blown address space manager. Considering iova allocation efficiency, I think the iova allocator is better here. In most cases, we don't even need to hold a spin lock during iova allocation. If you're going to reuse any infrastructure I'd have expected it to be SWIOTLB rather than the IOVA allocator. Because, y'know, you're *literally implementing a software I/O TLB* ;) But actually what we can reuse in SWIOTLB is the IOVA allocator. Huh? Those are completely unrelated and orthogonal things - SWIOTLB does not use an external allocator (see find_slots()). By SWIOTLB I mean specifically the library itself, not dma-direct or any of the other users built around it. The functionality for managing slots in a buffer and bouncing data in and out can absolutely be reused - that's why users like the Xen and iommu-dma code *are* reusing it instead of open-coding their own versions. And the IOVA management in SWIOTLB is not what we want. For example, SWIOTLB allocates and uses contiguous memory for bouncing, which is not necessary in VDUSE case. alloc_iova() allocates a contiguous (in IOVA address) region of space. In vduse_domain_map_page() you use it to allocate a contiguous region of space from your bounce buffer. Can you clarify how that is fundamentally different from allocating a contiguous region of space from a bounce buffer? Nobody's saying the underlying implementation details of where the buffer itself comes from can't be tweaked. And VDUSE needs coherent mapping which is not supported by the SWIOTLB. Besides, the SWIOTLB works in singleton mode (designed for platform IOMMU) , but VDUSE is based on on-chip IOMMU (supports multiple instances). That's not entirely true - the IOMMU bounce buffering scheme introduced in intel-iommu and now moved into the iommu-dma layer was already a step towards something conceptually similar. It does still rely on stealing the underlying pages from the global SWIOTLB pool at the moment, but the bouncing is effectively done in a per-IOMMU-domain context. The next step is currently queued in linux-next, wherein we can now have individual per-device SWIOTLB pools. In fact at that point I think you might actually
Re: [External] Re: [PATCH 0/4] Add multi-cid support for vsock driver
On Wed, Aug 04, 2021 at 03:09:41PM +0800, 傅关丞 wrote: Sorry I cannot figure out a good use case. It is normal for a host to have multiple ip addresses used for communication. So I thought it might be nice to have both host and guest use multiple CIDs for communication. I know this is not a very strong argument. Maybe there could be a use case for guests (which I don't see now), but for the host it seems pointless. The strength of vsock is that the guest knows that using CID=2 always reaches the host. Moreover we have recently merged VMADDR_FLAG_TO_HOST that when set allows you to forward any packet to the host, regardless of the CID (not yet supported by vhost-vsock). The vsock driver does not work if one of the two peers doesn't support multiple CIDs. This is absolutely to be avoided. I think the virtio device feature negotiation can help here. I have a possible solution here, but there may be some problems with it that I haven't noticed. Hypervisors will use different ways to send CIDs setup to the kernel based on their vsock setup. --For host--- If host vsock driver supports multi-cid, the hypervisor will use the modified VHOST_VSOCK_SET_GUEST_CID call to set its CIDs. Otherwise, the original call is used. --For guest--- Now the virtio_vsock_config looks like this: u64 guest_cid u32 num_guest_cid; u32 num_host_cid; u32 index; u64 cid; If the guest vsock driver supports multi-cid, it will read num_guest_cid and num_host_cid from the device config space. Then it writes an index register, which is the cid it wants to read. After hypervisors handle this issue, it can read the cid from the cid register. If it does not support multi-cid, it will just read the guest_cid from the config space, which should work just fine. Why not add a new device feature to enable or disable multi-cid? ---Communication For communication issues, we might need to use a new feature bit. Let's call it VHOST_VSOCK_SUPPORT_MULTI_CID. The basic idea is that this feature bit is set when both host and guest support using multiple CIDs. After negotiation, if the feature bit is set, the host can use all the CIDs specified to communicate with the guest. Otherwise, the first cid passed in will be used as the guest_cid to communicate with guests. I think the same feature bit can be used for the virtio_vsock_config, no? Also, if the bit is set for guests, all the CIDs can be used to communicate with the host. Otherwise, the first cid with index 0 will be used as the guest_cid while the VMADDR_HOST_CID will be used for host cid. We already have VMADDR_FLAG_TO_HOST to forward all packets to the host, we only need to support in some way in vhost-vsock. Thanks, Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v1 0/7] virtio/vsock: introduce MSG_EOR flag for SEQPACKET
Hi Arseny, On Mon, Jul 26, 2021 at 07:31:33PM +0300, Arseny Krasnov wrote: This patchset implements support of MSG_EOR bit for SEQPACKET AF_VSOCK sockets over virtio transport. Idea is to distinguish concepts of 'messages' and 'records'. Message is result of sending calls: 'write()', 'send()', 'sendmsg()' etc. It has fixed maximum length, and it bounds are visible using return from receive calls: 'read()', 'recv()', 'recvmsg()' etc. Current implementation based on message definition above. Okay, so the implementation we merged is wrong right? Should we disable the feature bit in stable kernels that contain it? Or maybe we can backport the fixes... Record has unlimited length, it consists of multiple message, and bounds of record are visible via MSG_EOR flag returned from 'recvmsg()' call. Sender passes MSG_EOR to sending system call and receiver will see MSG_EOR when corresponding message will be processed. To support MSG_EOR new bit was added along with existing 'VIRTIO_VSOCK_SEQ_EOR': 'VIRTIO_VSOCK_SEQ_EOM'(end-of-message) - now it works in the same way as 'VIRTIO_VSOCK_SEQ_EOR'. But 'VIRTIO_VSOCK_SEQ_EOR' is used to mark 'MSG_EOR' bit passed from userspace. I understand that it makes sense to remap VIRTIO_VSOCK_SEQ_EOR to MSG_EOR to make the user understand the boundaries, but why do we need EOM as well? Why do we care about the boundaries of a message within a record? I mean, if the sender makes 3 calls: send(A1,0) send(A2,0) send(A3, MSG_EOR); IIUC it should be fine if the receiver for example receives all in one single recv() calll with MSG_EOR set, so why do we need EOM? Thanks, Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 15/15] nvme: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index dfd9dec0c1f6..02ce94b2906b 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -968,12 +968,11 @@ void nvme_cleanup_cmd(struct request *req) { if (req->rq_flags & RQF_SPECIAL_PAYLOAD) { struct nvme_ctrl *ctrl = nvme_req(req)->ctrl; - struct page *page = req->special_vec.bv_page; - if (page == ctrl->discard_page) + if (req->special_vec.bv_page == ctrl->discard_page) clear_bit_unlock(0, >discard_page_busy); else - kfree(page_address(page) + req->special_vec.bv_offset); + kfree(bvec_virt(>special_vec)); } } EXPORT_SYMBOL_GPL(nvme_cleanup_cmd); -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 11/15] ubd: use bvec_virt
On 04/08/2021 10:56, Christoph Hellwig wrote: Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- arch/um/drivers/ubd_kern.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c index e497185dd393..cd9dc0556e91 100644 --- a/arch/um/drivers/ubd_kern.c +++ b/arch/um/drivers/ubd_kern.c @@ -1268,8 +1268,7 @@ static void ubd_map_req(struct ubd *dev, struct io_thread_req *io_req, rq_for_each_segment(bvec, req, iter) { BUG_ON(i >= io_req->desc_cnt); - io_req->io_desc[i].buffer = - page_address(bvec.bv_page) + bvec.bv_offset; + io_req->io_desc[i].buffer = bvec_virt(); io_req->io_desc[i].length = bvec.bv_len; i++; } Acked-By: Anton Ivanov -- Anton R. Ivanov Cambridgegreys Limited. Registered in England. Company Number 10273661 https://www.cambridgegreys.com/ ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 14/15] dcssblk: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- drivers/s390/block/dcssblk.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 29180bdf0977..5be3d1c39a78 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -892,8 +892,7 @@ dcssblk_submit_bio(struct bio *bio) index = (bio->bi_iter.bi_sector >> 3); bio_for_each_segment(bvec, bio, iter) { - page_addr = (unsigned long) - page_address(bvec.bv_page) + bvec.bv_offset; + page_addr = (unsigned long)bvec_virt(); source_addr = dev_info->start + (index<<12) + bytes_done; if (unlikely((page_addr & 4095) != 0) || (bvec.bv_len & 4095) != 0) // More paranoia. -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 13/15] dasd: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- drivers/s390/block/dasd_diag.c | 2 +- drivers/s390/block/dasd_eckd.c | 14 +++--- drivers/s390/block/dasd_fba.c | 4 ++-- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/s390/block/dasd_diag.c b/drivers/s390/block/dasd_diag.c index 6bb775236c16..db5987281010 100644 --- a/drivers/s390/block/dasd_diag.c +++ b/drivers/s390/block/dasd_diag.c @@ -552,7 +552,7 @@ static struct dasd_ccw_req *dasd_diag_build_cp(struct dasd_device *memdev, dbio = dreq->bio; recid = first_rec; rq_for_each_segment(bv, req, iter) { - dst = page_address(bv.bv_page) + bv.bv_offset; + dst = bvec_virt(); for (off = 0; off < bv.bv_len; off += blksize) { memset(dbio, 0, sizeof (struct dasd_diag_bio)); dbio->type = rw_cmd; diff --git a/drivers/s390/block/dasd_eckd.c b/drivers/s390/block/dasd_eckd.c index 0de1a463c509..8610ea414322 100644 --- a/drivers/s390/block/dasd_eckd.c +++ b/drivers/s390/block/dasd_eckd.c @@ -3267,7 +3267,7 @@ static int dasd_eckd_ese_read(struct dasd_ccw_req *cqr, struct irb *irb) end_blk = (curr_trk + 1) * recs_per_trk; rq_for_each_segment(bv, req, iter) { - dst = page_address(bv.bv_page) + bv.bv_offset; + dst = bvec_virt(); for (off = 0; off < bv.bv_len; off += blksize) { if (first_blk + blk_count >= end_blk) { cqr->proc_bytes = blk_count * blksize; @@ -3999,7 +3999,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_cmd_single( last_rec - recid + 1, cmd, basedev, blksize); } rq_for_each_segment(bv, req, iter) { - dst = page_address(bv.bv_page) + bv.bv_offset; + dst = bvec_virt(); if (dasd_page_cache) { char *copy = kmem_cache_alloc(dasd_page_cache, GFP_DMA | __GFP_NOWARN); @@ -4166,7 +4166,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_cmd_track( idaw_dst = NULL; idaw_len = 0; rq_for_each_segment(bv, req, iter) { - dst = page_address(bv.bv_page) + bv.bv_offset; + dst = bvec_virt(); seg_len = bv.bv_len; while (seg_len) { if (new_track) { @@ -4509,7 +4509,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_tpm_track( new_track = 1; recid = first_rec; rq_for_each_segment(bv, req, iter) { - dst = page_address(bv.bv_page) + bv.bv_offset; + dst = bvec_virt(); seg_len = bv.bv_len; while (seg_len) { if (new_track) { @@ -4542,7 +4542,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_tpm_track( } } else { rq_for_each_segment(bv, req, iter) { - dst = page_address(bv.bv_page) + bv.bv_offset; + dst = bvec_virt(); last_tidaw = itcw_add_tidaw(itcw, 0x00, dst, bv.bv_len); if (IS_ERR(last_tidaw)) { @@ -4778,7 +4778,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_raw(struct dasd_device *startdev, idaws = idal_create_words(idaws, rawpadpage, PAGE_SIZE); } rq_for_each_segment(bv, req, iter) { - dst = page_address(bv.bv_page) + bv.bv_offset; + dst = bvec_virt(); seg_len = bv.bv_len; if (cmd == DASD_ECKD_CCW_READ_TRACK) memset(dst, 0, seg_len); @@ -4839,7 +4839,7 @@ dasd_eckd_free_cp(struct dasd_ccw_req *cqr, struct request *req) if (private->uses_cdl == 0 || recid > 2*blk_per_trk) ccw++; rq_for_each_segment(bv, req, iter) { - dst = page_address(bv.bv_page) + bv.bv_offset; + dst = bvec_virt(); for (off = 0; off < bv.bv_len; off += blksize) { /* Skip locate record. */ if (private->uses_cdl && recid <= 2*blk_per_trk) diff --git a/drivers/s390/block/dasd_fba.c b/drivers/s390/block/dasd_fba.c index 3ad319aee51e..e084f4de 100644 --- a/drivers/s390/block/dasd_fba.c +++ b/drivers/s390/block/dasd_fba.c @@ -501,7 +501,7 @@ static struct dasd_ccw_req *dasd_fba_build_cp_regular( } recid = first_rec; rq_for_each_segment(bv, req, iter) { - dst = page_address(bv.bv_page) + bv.bv_offset; + dst = bvec_virt(); if (dasd_page_cache) { char *copy = kmem_cache_alloc(dasd_page_cache,
[PATCH 12/15] ps3vram: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- drivers/block/ps3vram.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c index 7fbf469651c4..c7b19e128b03 100644 --- a/drivers/block/ps3vram.c +++ b/drivers/block/ps3vram.c @@ -541,7 +541,7 @@ static struct bio *ps3vram_do_bio(struct ps3_system_bus_device *dev, bio_for_each_segment(bvec, bio, iter) { /* PS3 is ppc64, so we don't handle highmem */ - char *ptr = page_address(bvec.bv_page) + bvec.bv_offset; + char *ptr = bvec_virt(); size_t len = bvec.bv_len, retlen; dev_dbg(>core, "%s %zu bytes at offset %llu\n", op, -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 11/15] ubd: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- arch/um/drivers/ubd_kern.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c index e497185dd393..cd9dc0556e91 100644 --- a/arch/um/drivers/ubd_kern.c +++ b/arch/um/drivers/ubd_kern.c @@ -1268,8 +1268,7 @@ static void ubd_map_req(struct ubd *dev, struct io_thread_req *io_req, rq_for_each_segment(bvec, req, iter) { BUG_ON(i >= io_req->desc_cnt); - io_req->io_desc[i].buffer = - page_address(bvec.bv_page) + bvec.bv_offset; + io_req->io_desc[i].buffer = bvec_virt(); io_req->io_desc[i].length = bvec.bv_len; i++; } -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 10/15] sd: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- drivers/scsi/sd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index b8d55af763f9..5b5b8266e142 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -886,7 +886,7 @@ static blk_status_t sd_setup_unmap_cmnd(struct scsi_cmnd *cmd) cmd->cmnd[0] = UNMAP; cmd->cmnd[8] = 24; - buf = page_address(rq->special_vec.bv_page); + buf = bvec_virt(>special_vec); put_unaligned_be16(6 + 16, [0]); put_unaligned_be16(16, [2]); put_unaligned_be64(lba, [8]); -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 09/15] bcache: use bvec_virt
Use bvec_virt instead of open coding it. Note that the existing code is fine despite ignoring bv_offset as the bio is known to contain exactly one page from the page allocator per bio_vec. Signed-off-by: Christoph Hellwig --- drivers/md/bcache/btree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 183a58c89377..0595559de174 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -378,7 +378,7 @@ static void do_btree_node_write(struct btree *b) struct bvec_iter_all iter_all; bio_for_each_segment_all(bv, b->bio, iter_all) { - memcpy(page_address(bv->bv_page), addr, PAGE_SIZE); + memcpy(bvec_virt(bv), addr, PAGE_SIZE); addr += PAGE_SIZE; } -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 08/15] virtio_blk: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- drivers/block/virtio_blk.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 4b49df2dfd23..767b4f72a54d 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -166,11 +166,8 @@ static inline void virtblk_request_done(struct request *req) { struct virtblk_req *vbr = blk_mq_rq_to_pdu(req); - if (req->rq_flags & RQF_SPECIAL_PAYLOAD) { - kfree(page_address(req->special_vec.bv_page) + - req->special_vec.bv_offset); - } - + if (req->rq_flags & RQF_SPECIAL_PAYLOAD) + kfree(bvec_virt(>special_vec)); blk_mq_end_request(req, virtblk_result(vbr)); } -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 07/15] rbd: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- drivers/block/rbd.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 6d596c2c2cd6..e65c9d706f6f 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2986,8 +2986,7 @@ static bool is_zero_bvecs(struct bio_vec *bvecs, u32 bytes) }; ceph_bvec_iter_advance_step(, bytes, ({ - if (memchr_inv(page_address(bv.bv_page) + bv.bv_offset, 0, - bv.bv_len)) + if (memchr_inv(bvec_virt(), 0, bv.bv_len)) return false; })); return true; -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 06/15] squashfs: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- fs/squashfs/block.c| 7 +++ fs/squashfs/lz4_wrapper.c | 2 +- fs/squashfs/lzo_wrapper.c | 2 +- fs/squashfs/xz_wrapper.c | 2 +- fs/squashfs/zlib_wrapper.c | 2 +- fs/squashfs/zstd_wrapper.c | 2 +- 6 files changed, 8 insertions(+), 9 deletions(-) diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c index 855f0e87066d..2db8bcf7ff85 100644 --- a/fs/squashfs/block.c +++ b/fs/squashfs/block.c @@ -49,8 +49,7 @@ static int copy_bio_to_actor(struct bio *bio, bytes_to_copy = min_t(int, bytes_to_copy, req_length - copied_bytes); - memcpy(actor_addr + actor_offset, - page_address(bvec->bv_page) + bvec->bv_offset + offset, + memcpy(actor_addr + actor_offset, bvec_virt(bvec) + offset, bytes_to_copy); actor_offset += bytes_to_copy; @@ -177,7 +176,7 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length, goto out_free_bio; } /* Extract the length of the metadata block */ - data = page_address(bvec->bv_page) + bvec->bv_offset; + data = bvec_virt(bvec); length = data[offset]; if (offset < bvec->bv_len - 1) { length |= data[offset + 1] << 8; @@ -186,7 +185,7 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length, res = -EIO; goto out_free_bio; } - data = page_address(bvec->bv_page) + bvec->bv_offset; + data = bvec_virt(bvec); length |= data[0] << 8; } bio_free_pages(bio); diff --git a/fs/squashfs/lz4_wrapper.c b/fs/squashfs/lz4_wrapper.c index 233d5582fbee..b685b6238316 100644 --- a/fs/squashfs/lz4_wrapper.c +++ b/fs/squashfs/lz4_wrapper.c @@ -101,7 +101,7 @@ static int lz4_uncompress(struct squashfs_sb_info *msblk, void *strm, while (bio_next_segment(bio, _all)) { int avail = min(bytes, ((int)bvec->bv_len) - offset); - data = page_address(bvec->bv_page) + bvec->bv_offset; + data = bvec_virt(bvec); memcpy(buff, data + offset, avail); buff += avail; bytes -= avail; diff --git a/fs/squashfs/lzo_wrapper.c b/fs/squashfs/lzo_wrapper.c index 97bb7d92ddcd..cb510a631968 100644 --- a/fs/squashfs/lzo_wrapper.c +++ b/fs/squashfs/lzo_wrapper.c @@ -76,7 +76,7 @@ static int lzo_uncompress(struct squashfs_sb_info *msblk, void *strm, while (bio_next_segment(bio, _all)) { int avail = min(bytes, ((int)bvec->bv_len) - offset); - data = page_address(bvec->bv_page) + bvec->bv_offset; + data = bvec_virt(bvec); memcpy(buff, data + offset, avail); buff += avail; bytes -= avail; diff --git a/fs/squashfs/xz_wrapper.c b/fs/squashfs/xz_wrapper.c index e80419aed862..68f6d09bb3a2 100644 --- a/fs/squashfs/xz_wrapper.c +++ b/fs/squashfs/xz_wrapper.c @@ -146,7 +146,7 @@ static int squashfs_xz_uncompress(struct squashfs_sb_info *msblk, void *strm, } avail = min(length, ((int)bvec->bv_len) - offset); - data = page_address(bvec->bv_page) + bvec->bv_offset; + data = bvec_virt(bvec); length -= avail; stream->buf.in = data + offset; stream->buf.in_size = avail; diff --git a/fs/squashfs/zlib_wrapper.c b/fs/squashfs/zlib_wrapper.c index bcb881ec47f2..a20e9042146b 100644 --- a/fs/squashfs/zlib_wrapper.c +++ b/fs/squashfs/zlib_wrapper.c @@ -76,7 +76,7 @@ static int zlib_uncompress(struct squashfs_sb_info *msblk, void *strm, } avail = min(length, ((int)bvec->bv_len) - offset); - data = page_address(bvec->bv_page) + bvec->bv_offset; + data = bvec_virt(bvec); length -= avail; stream->next_in = data + offset; stream->avail_in = avail; diff --git a/fs/squashfs/zstd_wrapper.c b/fs/squashfs/zstd_wrapper.c index b7cb1faa652d..0015cf8b5582 100644 --- a/fs/squashfs/zstd_wrapper.c +++ b/fs/squashfs/zstd_wrapper.c @@ -94,7 +94,7 @@ static int zstd_uncompress(struct squashfs_sb_info *msblk, void *strm, } avail = min(length, ((int)bvec->bv_len) - offset); - data = page_address(bvec->bv_page) + bvec->bv_offset; + data = bvec_virt(bvec); length -= avail; in_buf.src = data + offset;
[PATCH 05/15] dm-integrity: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- drivers/md/dm-integrity.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index 20f2510db1f6..a9ea361769a7 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -1819,7 +1819,7 @@ static void integrity_metadata(struct work_struct *w) unsigned this_len; BUG_ON(PageHighMem(biv.bv_page)); - tag = lowmem_page_address(biv.bv_page) + biv.bv_offset; + tag = bvec_virt(); this_len = min(biv.bv_len, data_to_process); r = dm_integrity_rw_tag(ic, tag, >metadata_block, >metadata_offset, this_len, dio->op == REQ_OP_READ ? TAG_READ : TAG_WRITE); @@ -2006,7 +2006,7 @@ static bool __journal_read_write(struct dm_integrity_io *dio, struct bio *bio, unsigned tag_now = min(biv.bv_len, tag_todo); char *tag_addr; BUG_ON(PageHighMem(biv.bv_page)); - tag_addr = lowmem_page_address(biv.bv_page) + biv.bv_offset; + tag_addr = bvec_virt(); if (likely(dio->op == REQ_OP_WRITE)) memcpy(tag_ptr, tag_addr, tag_now); else -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 04/15] dm-ebs: use bvec_virt
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig --- drivers/md/dm-ebs-target.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/dm-ebs-target.c b/drivers/md/dm-ebs-target.c index 71475a2410be..0c509dae0ff8 100644 --- a/drivers/md/dm-ebs-target.c +++ b/drivers/md/dm-ebs-target.c @@ -74,7 +74,7 @@ static int __ebs_rw_bvec(struct ebs_c *ec, int rw, struct bio_vec *bv, struct bv if (unlikely(!bv->bv_page || !bv_len)) return -EIO; - pa = page_address(bv->bv_page) + bv->bv_offset; + pa = bvec_virt(bv); /* Handle overlapping page <-> blocks */ while (bv_len) { -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 03/15] dm: make EBS depend on !HIGHMEM
__ebs_rw_bvec use page_address on the submitted bios data, and thus can't deal with highmem. Disable the target on highmem configs. Signed-off-by: Christoph Hellwig --- drivers/md/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 0602e82a9516..ecc559c60d40 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -340,7 +340,7 @@ config DM_WRITECACHE config DM_EBS tristate "Emulated block size target (EXPERIMENTAL)" - depends on BLK_DEV_DM + depends on BLK_DEV_DM && !HIGHMEM select DM_BUFIO help dm-ebs emulates smaller logical block size on backing devices -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 02/15] block: use bvec_virt in bio_integrity_{process,free}
Use the bvec_virt helper to clean up the bio integrity processing a little bit. Signed-off-by: Christoph Hellwig --- block/bio-integrity.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/block/bio-integrity.c b/block/bio-integrity.c index 8f54d49dc500..6b47cddbbca1 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -104,8 +104,7 @@ void bio_integrity_free(struct bio *bio) struct bio_set *bs = bio->bi_pool; if (bip->bip_flags & BIP_BLOCK_INTEGRITY) - kfree(page_address(bip->bip_vec->bv_page) + - bip->bip_vec->bv_offset); + kfree(bvec_virt(bip->bip_vec)); __bio_integrity_free(bs, bip); bio->bi_integrity = NULL; @@ -163,13 +162,11 @@ static blk_status_t bio_integrity_process(struct bio *bio, struct bio_vec bv; struct bio_integrity_payload *bip = bio_integrity(bio); blk_status_t ret = BLK_STS_OK; - void *prot_buf = page_address(bip->bip_vec->bv_page) + - bip->bip_vec->bv_offset; iter.disk_name = bio->bi_bdev->bd_disk->disk_name; iter.interval = 1 << bi->interval_exp; iter.seed = proc_iter->bi_sector; - iter.prot_buf = prot_buf; + iter.prot_buf = bvec_virt(bip->bip_vec); __bio_for_each_segment(bv, bio, bviter, *proc_iter) { void *kaddr = bvec_kmap_local(); -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 01/15] bvec: add a bvec_virt helper
Add a helper to get the virtual address for a bvec. This avoids that all callers need to know about the page + offset representation. Signed-off-by: Christoph Hellwig --- include/linux/bvec.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/bvec.h b/include/linux/bvec.h index f9fa43b940ff..0e9bdd42dafb 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -229,4 +229,16 @@ static inline void memzero_bvec(struct bio_vec *bvec) memzero_page(bvec->bv_page, bvec->bv_offset, bvec->bv_len); } +/** + * bvec_virt - return the virtual address for a bvec + * @bvec: bvec to return the virtual address for + * + * Note: the caller must ensure that @bvec->bv_page is not a highmem page. + */ +static inline void *bvec_virt(struct bio_vec *bvec) +{ + WARN_ON_ONCE(PageHighMem(bvec->bv_page)); + return page_address(bvec->bv_page) + bvec->bv_offset; +} + #endif /* __LINUX_BVEC_H */ -- 2.30.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
add a bvec_virt helper
Hi Jens, this series adds a bvec_virt helper to return the virtual address of the data in bvec to replace the open coded calculation, and as a reminder that generall bio/bvec data can be in high memory unless it is caller controller or in an architecture specific driver where highmem is impossible. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()
在 2021/8/4 下午4:50, Yongji Xie 写道: On Wed, Aug 4, 2021 at 4:32 PM Jason Wang wrote: 在 2021/8/3 下午5:38, Yongji Xie 写道: On Tue, Aug 3, 2021 at 4:09 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: The device reset may fail in virtio-vdpa case now, so add checks to its return value and fail the register_virtio_device(). So the reset() would be called by the driver during remove as well, or is it sufficient to deal only with the reset during probe? Actually there is no way to handle failure during removal. And it should be safe with the protection of software IOTLB even if the reset() fails. Thanks, Yongji If this is true, does it mean we don't even need to care about reset failure? But we need to handle the failure in the vhost-vdpa case, isn't it? Yes, but: - This patch is for virtio not for vhost, if we don't care virtio, we can avoid the changes - For vhost, there could be two ways probably: 1) let the set_status to report error 2) require userspace to re-read for status It looks to me you want to go with 1) and I'm not sure whether or not it's too late to go with 2). Thanks Thanks, Yongji ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 05/17] vhost-vdpa: Fail the vhost_vdpa_set_status() on reset failure
在 2021/8/3 下午5:50, Yongji Xie 写道: On Tue, Aug 3, 2021 at 4:10 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: Re-read the device status to ensure it's set to zero during resetting. Otherwise, fail the vhost_vdpa_set_status() after timeout. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b07aa161f7ad..dd05c1e1133c 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -157,7 +157,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; u8 status, status_old; - int nvqs = v->nvqs; + int timeout = 0, nvqs = v->nvqs; u16 i; if (copy_from_user(, statusp, sizeof(status))) @@ -173,6 +173,15 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) return -EINVAL; ops->set_status(vdpa, status); + if (status == 0) { + while (ops->get_status(vdpa)) { + timeout += 20; + if (timeout > VDPA_RESET_TIMEOUT_MS) + return -EIO; + + msleep(20); + } Spec has introduced the reset a one of the basic facility. And consider we differ reset here. This makes me think if it's better to introduce a dedicated vdpa ops for reset? Do you mean replace the ops.set_status(vdev, 0) with the ops.reset()? Then we can remove the timeout processing which is device specific stuff. Exactly. Thanks Thanks, Yongji ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()
在 2021/8/3 下午5:38, Yongji Xie 写道: On Tue, Aug 3, 2021 at 4:09 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: The device reset may fail in virtio-vdpa case now, so add checks to its return value and fail the register_virtio_device(). So the reset() would be called by the driver during remove as well, or is it sufficient to deal only with the reset during probe? Actually there is no way to handle failure during removal. And it should be safe with the protection of software IOTLB even if the reset() fails. Thanks, Yongji If this is true, does it mean we don't even need to care about reset failure? Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 04/17] vdpa: Fail the vdpa_reset() if fail to set device status to zero
在 2021/8/3 下午5:31, Yongji Xie 写道: On Tue, Aug 3, 2021 at 3:58 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: Re-read the device status to ensure it's set to zero during resetting. Otherwise, fail the vdpa_reset() after timeout. Signed-off-by: Xie Yongji --- include/linux/vdpa.h | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 406d53a606ac..d1a80ef05089 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -6,6 +6,7 @@ #include #include #include +#include /** * struct vdpa_calllback - vDPA callback definition. @@ -340,12 +341,24 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) return vdev->dma_dev; } -static inline void vdpa_reset(struct vdpa_device *vdev) +#define VDPA_RESET_TIMEOUT_MS 1000 + +static inline int vdpa_reset(struct vdpa_device *vdev) { const struct vdpa_config_ops *ops = vdev->config; + int timeout = 0; vdev->features_valid = false; ops->set_status(vdev, 0); + while (ops->get_status(vdev)) { + timeout += 20; + if (timeout > VDPA_RESET_TIMEOUT_MS) + return -EIO; + + msleep(20); + } I wonder if it's better to do this in the vDPA parent? Thanks Sorry, I didn't get you here. Do you mean vDPA parent driver (e.g. VDUSE)? Yes, since the how it's expected to behave depends on the specific hardware. Even for the spec, the behavior is transport specific: PCI: requires reread until 0 MMIO: doesn't require but it might not work for the hardware so we decide to change CCW: the succeed of the ccw command means the success of the reset Thanks Actually I didn't find any other place where I can do set_status() and get_status(). Thanks, Yongji ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 02/17] file: Export receive_fd() to modules
在 2021/8/3 下午5:01, Yongji Xie 写道: On Tue, Aug 3, 2021 at 3:46 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: Export receive_fd() so that some modules can use it to pass file descriptor between processes without missing any security stuffs. Signed-off-by: Xie Yongji --- fs/file.c| 6 ++ include/linux/file.h | 7 +++ 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/fs/file.c b/fs/file.c index 86dc9956af32..210e540672aa 100644 --- a/fs/file.c +++ b/fs/file.c @@ -1134,6 +1134,12 @@ int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags) return new_fd; } +int receive_fd(struct file *file, unsigned int o_flags) +{ + return __receive_fd(file, NULL, o_flags); Any reason that receive_fd_user() can live in the file.h? Since no modules use it. Thanks, Yongji Ok. Acked-by: Jason Wang ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH virtiofsd 1/3] virtiofsd: expand fuse protocol to support per-file DAX
Signed-off-by: Jeffle Xu --- include/standard-headers/linux/fuse.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/standard-headers/linux/fuse.h b/include/standard-headers/linux/fuse.h index 950d7edb7e..7bd006ffcb 100644 --- a/include/standard-headers/linux/fuse.h +++ b/include/standard-headers/linux/fuse.h @@ -356,6 +356,7 @@ struct fuse_file_lock { #define FUSE_MAP_ALIGNMENT (1 << 26) #define FUSE_SUBMOUNTS (1 << 27) #define FUSE_HANDLE_KILLPRIV_V2(1 << 28) +#define FUSE_PERFILE_DAX (1 << 30) /** * CUSE INIT request/reply flags @@ -440,6 +441,7 @@ struct fuse_file_lock { * FUSE_ATTR_SUBMOUNT: Object is a submount root */ #define FUSE_ATTR_SUBMOUNT (1 << 0) +#define FUSE_ATTR_DAX (1 << 1) /** * Open flags -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH virtiofsd 2/3] virtiofsd: support per-file DAX negotiation in FUSE_INIT
Signed-off-by: Jeffle Xu --- tools/virtiofsd/fuse_common.h| 5 + tools/virtiofsd/fuse_lowlevel.c | 6 ++ tools/virtiofsd/passthrough_ll.c | 3 +++ 3 files changed, 14 insertions(+) diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h index 8a75729be9..ee6fc64c23 100644 --- a/tools/virtiofsd/fuse_common.h +++ b/tools/virtiofsd/fuse_common.h @@ -372,6 +372,11 @@ struct fuse_file_info { */ #define FUSE_CAP_HANDLE_KILLPRIV_V2 (1 << 28) +/** + * Indicates support for per-file DAX. + */ +#define FUSE_CAP_PERFILE_DAX (1 << 29) + /** * Ioctl flags * diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c index 50fc5c8d5a..60fb82a92a 100644 --- a/tools/virtiofsd/fuse_lowlevel.c +++ b/tools/virtiofsd/fuse_lowlevel.c @@ -2065,6 +2065,9 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid, if (arg->flags & FUSE_HANDLE_KILLPRIV_V2) { se->conn.capable |= FUSE_CAP_HANDLE_KILLPRIV_V2; } +if (arg->flags & FUSE_PERFILE_DAX) { +se->conn.capable |= FUSE_CAP_PERFILE_DAX; +} #ifdef HAVE_SPLICE #ifdef HAVE_VMSPLICE se->conn.capable |= FUSE_CAP_SPLICE_WRITE | FUSE_CAP_SPLICE_MOVE; @@ -2180,6 +2183,9 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid, if (se->conn.want & FUSE_CAP_POSIX_ACL) { outarg.flags |= FUSE_POSIX_ACL; } +if (se->conn.want & FUSE_CAP_PERFILE_DAX) { +outarg.flags |= FUSE_PERFILE_DAX; +} outarg.max_readahead = se->conn.max_readahead; outarg.max_write = se->conn.max_write; if (se->conn.max_background >= (1 << 16)) { diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index b76d878509..da88304253 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -702,6 +702,9 @@ static void lo_init(void *userdata, struct fuse_conn_info *conn) conn->want &= ~FUSE_CAP_HANDLE_KILLPRIV_V2; lo->killpriv_v2 = 0; } +if (conn->capable & FUSE_CAP_PERFILE_DAX) { +conn->want |= FUSE_CAP_PERFILE_DAX; +} } static void lo_getattr(fuse_req_t req, fuse_ino_t ino, -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH virtiofsd 3/3] virtiofsd: support per-file DAX in FUSE_LOOKUP
Signed-off-by: Jeffle Xu --- tools/virtiofsd/passthrough_ll.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index da88304253..1a472ce7f0 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -1026,6 +1026,9 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name, e->attr_flags |= FUSE_ATTR_SUBMOUNT; } +if (e->attr.st_size >= 1048576) + e->attr_flags |= FUSE_ATTR_DAX; + inode = lo_find(lo, >attr, mnt_id); if (inode) { close(newfd); -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH virtiofsd 0/3] virtiofsd: support per-file DAX
As discussed with Vivek Goyal, I tried to make virtiofsd support per-file DAX by checking if the file is marked with FS_DAX_FL attr. Thus I need to implement .ioctl() method for passthrough_ll.c (because FS_DAX_FL attr is get/set by FS_IOC_GETFLAGS/FS_IOC_SETFLAGS ioctl), something like ``` static struct fuse_lowlevel_ops lo_oper = { +.ioctl = lo_ioctl, +static void lo_ioctl(...) +{ + ret = ioctl(fd, FS_IOC_SETFLAGS, ...); } ``` But unfortunately once virtiofsd calls ioctl() syscall, it directly exits, and qemu also hangs with 'qemu-system-x86_64: Unexpected end-of-file before all data were read'. I'm not sure if it's because ioctl() is not permitted at all for virtiofsd or qemu. Many thanks if someone familiar with virtualization could help. The code repository of virtiofsd used is: gitlab.com/virtio-fs/qemu.git virtio-fs-dev Thus this patch set is still used for test only, marking files larger than 1MB shall enable per-file DAX. Jeffle Xu (3): virtiofsd: expand fuse protocol to support per-file DAX virtiofsd: support per-file DAX negotiation in FUSE_INIT virtiofsd: support per-file DAX in FUSE_LOOKUP include/standard-headers/linux/fuse.h | 2 ++ tools/virtiofsd/fuse_common.h | 5 + tools/virtiofsd/fuse_lowlevel.c | 6 ++ tools/virtiofsd/passthrough_ll.c | 6 ++ 4 files changed, 19 insertions(+) -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 0/8] fuse,virtiofs: support per-file DAX
changes since v2: - modify fuse_show_options() accordingly to make it compatible with new tri-state mount option (patch 2) - extract FUSE protocol changes into one seperate patch (patch 3) - FUSE server/client need to negotiate if they support per-file DAX (patch 4) - extract DONT_CACHE logic into patch 6/7 This patchset adds support of per-file DAX for virtiofs, which is inspired by Ira Weiny's work on ext4[1] and xfs[2]. Any comment is welcome. [1] commit 9cb20f94afcd ("fs/ext4: Make DAX mount option a tri-state") [2] commit 02beb2686ff9 ("fs/xfs: Make DAX mount option a tri-state") v2: https://www.spinics.net/lists/linux-fsdevel/msg199584.html v1: https://www.spinics.net/lists/linux-virtualization/msg51008.html Jeffle Xu (8): fuse: add fuse_should_enable_dax() helper fuse: Make DAX mount option a tri-state fuse: support per-file DAX fuse: negotiate if server/client supports per-file DAX fuse: enable per-file DAX fuse: mark inode DONT_CACHE when per-file DAX indication changes fuse: support changing per-file DAX flag inside guest fuse: show '-o dax=inode' option only when FUSE server supports fs/fuse/dax.c | 32 ++-- fs/fuse/file.c| 4 ++-- fs/fuse/fuse_i.h | 22 ++ fs/fuse/inode.c | 27 ++- fs/fuse/ioctl.c | 15 +-- fs/fuse/virtio_fs.c | 16 ++-- include/uapi/linux/fuse.h | 9 - 7 files changed, 103 insertions(+), 22 deletions(-) -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 5/8] fuse: enable per-file DAX
Enable per-file DAX if fuse server advertises that the file supports that. Currently the state whether the file enables DAX or not is initialized only when inode is instantiated. Signed-off-by: Jeffle Xu --- fs/fuse/dax.c| 12 fs/fuse/file.c | 4 ++-- fs/fuse/fuse_i.h | 4 ++-- fs/fuse/inode.c | 2 +- 4 files changed, 13 insertions(+), 9 deletions(-) diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index fe4e9593a590..30833f8d37dd 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -1339,7 +1339,7 @@ static const struct address_space_operations fuse_dax_file_aops = { .invalidatepage = noop_invalidatepage, }; -static bool fuse_should_enable_dax(struct inode *inode) +static bool fuse_should_enable_dax(struct inode *inode, unsigned int flags) { struct fuse_conn *fc = get_fuse_conn(inode); unsigned int dax_mode = fc->dax_mode; @@ -1348,12 +1348,16 @@ static bool fuse_should_enable_dax(struct inode *inode) if (dax_mode == FUSE_DAX_NEVER) return false; - return true; + if (dax_mode == FUSE_DAX_ALWAYS) + return true; + + WARN_ON_ONCE(dax_mode != FUSE_DAX_INODE); + return fc->perfile_dax && (flags & FUSE_ATTR_DAX); } -void fuse_dax_inode_init(struct inode *inode) +void fuse_dax_inode_init(struct inode *inode, unsigned int flags) { - if (!fuse_should_enable_dax(inode)) + if (!fuse_should_enable_dax(inode, flags)) return; inode->i_flags |= S_DAX; diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 97f860cfc195..b494ff08f08c 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -3142,7 +3142,7 @@ static const struct address_space_operations fuse_file_aops = { .write_end = fuse_write_end, }; -void fuse_init_file_inode(struct inode *inode) +void fuse_init_file_inode(struct inode *inode, unsigned int flags) { struct fuse_inode *fi = get_fuse_inode(inode); @@ -3156,5 +3156,5 @@ void fuse_init_file_inode(struct inode *inode) fi->writepages = RB_ROOT; if (IS_ENABLED(CONFIG_FUSE_DAX)) - fuse_dax_inode_init(inode); + fuse_dax_inode_init(inode, flags); } diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 0b21e76a379a..7b7b4c208af2 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1006,7 +1006,7 @@ int fuse_notify_poll_wakeup(struct fuse_conn *fc, /** * Initialize file operations on a regular file */ -void fuse_init_file_inode(struct inode *inode); +void fuse_init_file_inode(struct inode *inode, unsigned int flags); /** * Initialize inode operations on regular files and special files @@ -1258,7 +1258,7 @@ int fuse_dax_conn_alloc(struct fuse_conn *fc, enum fuse_dax_mode mode, struct dax_device *dax_dev); void fuse_dax_conn_free(struct fuse_conn *fc); bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi); -void fuse_dax_inode_init(struct inode *inode); +void fuse_dax_inode_init(struct inode *inode, unsigned int flags); void fuse_dax_inode_cleanup(struct inode *inode); bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment); void fuse_dax_cancel_work(struct fuse_conn *fc); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index d59aea41d70d..882c376f188b 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -281,7 +281,7 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr) inode->i_ctime.tv_nsec = attr->ctimensec; if (S_ISREG(inode->i_mode)) { fuse_init_common(inode); - fuse_init_file_inode(inode); + fuse_init_file_inode(inode, attr->flags); } else if (S_ISDIR(inode->i_mode)) fuse_init_dir(inode); else if (S_ISLNK(inode->i_mode)) -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 8/8] fuse: show '-o dax=inode' option only when FUSE server supports
Prior of this patch, the mount option will still show '-o dax=inode' when FUSE server advertises that it doesn't support per-file DAX. Signed-off-by: Jeffle Xu --- fs/fuse/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 9108d8ab39bc..439f74a041fa 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -697,7 +697,7 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) seq_puts(m, ",dax=always"); else if (fc->dax_mode == FUSE_DAX_NEVER) seq_puts(m, ",dax=never"); - else if (fc->dax_mode == FUSE_DAX_INODE) + else if ((fc->dax_mode == FUSE_DAX_INODE) && fc->perfile_dax) seq_puts(m, ",dax=inode"); #endif -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 4/8] fuse: negotiate if server/client supports per-file DAX
Among the FUSE_INIT phase, server/client shall negotiate if supporting per-file DAX. Requirements for server: - capable of handling SETFLAGS/FSSETXATTR ioctl and storing FS_DAX_FL/FS_XFLAG_DAX persistently. - set FUSE_ATTR_DAX if the file capable of per-file DAX when replying FUSE_LOOKUP request accordingly. Requirements for client: - capable of handling per-file DAX when receiving FUSE_ATTR_DAX. Signed-off-by: Jeffle Xu --- fs/fuse/fuse_i.h | 3 +++ fs/fuse/inode.c | 12 +++- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index a23dd8d0c181..0b21e76a379a 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -770,6 +770,9 @@ struct fuse_conn { /* Propagate syncfs() to server */ unsigned int sync_fs:1; + /* Does the filesystem support per-file DAX? */ + unsigned int perfile_dax:1; + /** The number of requests waiting for completion */ atomic_t num_waiting; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 0bc0d8af81e1..d59aea41d70d 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1087,10 +1087,12 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, min_t(unsigned int, fc->max_pages_limit, max_t(unsigned int, arg->max_pages, 1)); } - if (IS_ENABLED(CONFIG_FUSE_DAX) && - arg->flags & FUSE_MAP_ALIGNMENT && - !fuse_dax_check_alignment(fc, arg->map_alignment)) { - ok = false; + if (IS_ENABLED(CONFIG_FUSE_DAX)) { + if (arg->flags & FUSE_MAP_ALIGNMENT && + !fuse_dax_check_alignment(fc, arg->map_alignment)) + ok = false; + if (arg->flags & FUSE_PERFILE_DAX) + fc->perfile_dax = 1; } if (arg->flags & FUSE_HANDLE_KILLPRIV_V2) { fc->handle_killpriv_v2 = 1; @@ -1144,7 +1146,7 @@ void fuse_send_init(struct fuse_mount *fm) FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT; #ifdef CONFIG_FUSE_DAX if (fm->fc->dax) - ia->in.flags |= FUSE_MAP_ALIGNMENT; + ia->in.flags |= FUSE_MAP_ALIGNMENT | FUSE_PERFILE_DAX; #endif if (fm->fc->auto_submounts) ia->in.flags |= FUSE_SUBMOUNTS; -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 3/8] fuse: support per-file DAX
Expand the fuse protocol to support per-file DAX. FUSE_PERFILE_DAX flag is added indicating if fuse server/client supporting per-file DAX when sending or replying FUSE_INIT request. Besides, FUSE_ATTR_DAX flag is added indicating if DAX shall be enabled for corresponding file when replying FUSE_LOOKUP request. Signed-off-by: Jeffle Xu --- include/uapi/linux/fuse.h | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 36ed092227fa..15a1f5fc0797 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -184,6 +184,9 @@ * * 7.34 * - add FUSE_SYNCFS + * + * 7.35 + * - add FUSE_PERFILE_DAX, FUSE_ATTR_DAX */ #ifndef _LINUX_FUSE_H @@ -219,7 +222,7 @@ #define FUSE_KERNEL_VERSION 7 /** Minor version number of this interface */ -#define FUSE_KERNEL_MINOR_VERSION 34 +#define FUSE_KERNEL_MINOR_VERSION 35 /** The node ID of the root inode */ #define FUSE_ROOT_ID 1 @@ -336,6 +339,7 @@ struct fuse_file_lock { * write/truncate sgid is killed only if file has group * execute permission. (Same as Linux VFS behavior). * FUSE_SETXATTR_EXT: Server supports extended struct fuse_setxattr_in + * FUSE_PERFILE_DAX: kernel supports per-file DAX */ #define FUSE_ASYNC_READ(1 << 0) #define FUSE_POSIX_LOCKS (1 << 1) @@ -367,6 +371,7 @@ struct fuse_file_lock { #define FUSE_SUBMOUNTS (1 << 27) #define FUSE_HANDLE_KILLPRIV_V2(1 << 28) #define FUSE_SETXATTR_EXT (1 << 29) +#define FUSE_PERFILE_DAX (1 << 30) /** * CUSE INIT request/reply flags @@ -449,8 +454,10 @@ struct fuse_file_lock { * fuse_attr flags * * FUSE_ATTR_SUBMOUNT: Object is a submount root + * FUSE_ATTR_DAX: Enable DAX for this file in per-file DAX mode */ #define FUSE_ATTR_SUBMOUNT (1 << 0) +#define FUSE_ATTR_DAX (1 << 1) /** * Open flags -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 6/8] fuse: mark inode DONT_CACHE when per-file DAX indication changes
When the per-file DAX indication changes while the file is still *opened*, it is quite complicated and maybe fragile to dynamically change the DAX state. Hence mark the inode and corresponding dentries as DONE_CACHE once the per-file DAX indication changes, so that the inode instance will be evicted and freed as soon as possible once the file is closed and the last reference to the inode is put. And then when the file gets reopened next time, the inode will reflect the new DAX state. In summary, when the per-file DAX indication changes for an *opened* file, the state of the file won't be updated until this file is closed and reopened later. Signed-off-by: Jeffle Xu --- fs/fuse/dax.c| 9 + fs/fuse/fuse_i.h | 1 + fs/fuse/inode.c | 3 +++ 3 files changed, 13 insertions(+) diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 30833f8d37dd..14bc5b9a0576 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -1364,6 +1364,15 @@ void fuse_dax_inode_init(struct inode *inode, unsigned int flags) inode->i_data.a_ops = _dax_file_aops; } +void fuse_dax_dontcache(struct inode *inode, bool newdax) +{ + struct fuse_conn *fc = get_fuse_conn(inode); + + if (fc->dax_mode == FUSE_DAX_INODE && + fc->perfile_dax && IS_DAX(inode) != newdax) + d_mark_dontcache(inode); +} + bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment) { if (fc->dax && (map_alignment > FUSE_DAX_SHIFT)) { diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 7b7b4c208af2..56fe1c4d2136 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1260,6 +1260,7 @@ void fuse_dax_conn_free(struct fuse_conn *fc); bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi); void fuse_dax_inode_init(struct inode *inode, unsigned int flags); void fuse_dax_inode_cleanup(struct inode *inode); +void fuse_dax_dontcache(struct inode *inode, bool newdax); bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment); void fuse_dax_cancel_work(struct fuse_conn *fc); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 882c376f188b..9108d8ab39bc 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -269,6 +269,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, if (inval) invalidate_inode_pages2(inode->i_mapping); } + + if (IS_ENABLED(CONFIG_FUSE_DAX)) + fuse_dax_dontcache(inode, attr->flags & FUSE_ATTR_DAX); } static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr) -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 2/8] fuse: Make DAX mount option a tri-state
We add 'always', 'never', and 'inode' (default). '-o dax' continues to operate the same which is equivalent to 'always'. By the time this patch is applied, 'inode' mode is actually equal to 'always' mode, before the per-file DAX flag is introduced in the following patch. Signed-off-by: Jeffle Xu --- fs/fuse/dax.c | 9 +++-- fs/fuse/fuse_i.h| 14 -- fs/fuse/inode.c | 10 +++--- fs/fuse/virtio_fs.c | 16 ++-- 4 files changed, 40 insertions(+), 9 deletions(-) diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index c6f4e82e65f3..fe4e9593a590 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -1288,11 +1288,14 @@ static int fuse_dax_mem_range_init(struct fuse_conn_dax *fcd) return ret; } -int fuse_dax_conn_alloc(struct fuse_conn *fc, struct dax_device *dax_dev) +int fuse_dax_conn_alloc(struct fuse_conn *fc, enum fuse_dax_mode dax_mode, + struct dax_device *dax_dev) { struct fuse_conn_dax *fcd; int err; + fc->dax_mode = dax_mode; + if (!dax_dev) return 0; @@ -1339,8 +1342,10 @@ static const struct address_space_operations fuse_dax_file_aops = { static bool fuse_should_enable_dax(struct inode *inode) { struct fuse_conn *fc = get_fuse_conn(inode); + unsigned int dax_mode = fc->dax_mode; - if (!fc->dax) + /* If 'dax=always/inode', fc->dax couldn't be NULL */ + if (dax_mode == FUSE_DAX_NEVER) return false; return true; diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 07829ce78695..a23dd8d0c181 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -487,6 +487,12 @@ struct fuse_dev { struct list_head entry; }; +enum fuse_dax_mode { + FUSE_DAX_INODE, + FUSE_DAX_ALWAYS, + FUSE_DAX_NEVER, +}; + struct fuse_fs_context { int fd; unsigned int rootmode; @@ -503,7 +509,7 @@ struct fuse_fs_context { bool no_control:1; bool no_force_umount:1; bool legacy_opts_show:1; - bool dax:1; + enum fuse_dax_mode dax_mode; unsigned int max_read; unsigned int blksize; const char *subtype; @@ -801,6 +807,9 @@ struct fuse_conn { struct list_head devices; #ifdef CONFIG_FUSE_DAX + /* dax mode: FUSE_DAX_* (always, never or per-file) */ + enum fuse_dax_mode dax_mode; + /* Dax specific conn data, non-NULL if DAX is enabled */ struct fuse_conn_dax *dax; #endif @@ -1242,7 +1251,8 @@ ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to); ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from); int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma); int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start, u64 dmap_end); -int fuse_dax_conn_alloc(struct fuse_conn *fc, struct dax_device *dax_dev); +int fuse_dax_conn_alloc(struct fuse_conn *fc, enum fuse_dax_mode mode, + struct dax_device *dax_dev); void fuse_dax_conn_free(struct fuse_conn *fc); bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi); void fuse_dax_inode_init(struct inode *inode); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index b9beb39a4a18..0bc0d8af81e1 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -690,8 +690,12 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) seq_printf(m, ",blksize=%lu", sb->s_blocksize); } #ifdef CONFIG_FUSE_DAX - if (fc->dax) - seq_puts(m, ",dax"); + if (fc->dax_mode == FUSE_DAX_ALWAYS) + seq_puts(m, ",dax=always"); + else if (fc->dax_mode == FUSE_DAX_NEVER) + seq_puts(m, ",dax=never"); + else if (fc->dax_mode == FUSE_DAX_INODE) + seq_puts(m, ",dax=inode"); #endif return 0; @@ -1434,7 +1438,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx) sb->s_subtype = ctx->subtype; ctx->subtype = NULL; if (IS_ENABLED(CONFIG_FUSE_DAX)) { - err = fuse_dax_conn_alloc(fc, ctx->dax_dev); + err = fuse_dax_conn_alloc(fc, ctx->dax_mode, ctx->dax_dev); if (err) goto err; } diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 8f52cdaa8445..0050132e2787 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -88,12 +88,21 @@ struct virtio_fs_req_work { static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq, struct fuse_req *req, bool in_flight); +static const struct constant_table dax_param_enums[] = { + {"inode", FUSE_DAX_INODE }, + {"always", FUSE_DAX_ALWAYS }, + {"never", FUSE_DAX_NEVER }, + {} +}; + enum { OPT_DAX, + OPT_DAX_ENUM, }; static const struct fs_parameter_spec virtio_fs_parameters[] = { fsparam_flag("dax", OPT_DAX), +
[PATCH v3 7/8] fuse: support changing per-file DAX flag inside guest
Fuse client can enable or disable per-file DAX inside kernel/guest by chattr(1). Similarly the new state won't be updated until the file is closed and reopened later. Signed-off-by: Jeffle Xu --- fs/fuse/ioctl.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c index 546ea3d58fb4..a9ed53c5dbd1 100644 --- a/fs/fuse/ioctl.c +++ b/fs/fuse/ioctl.c @@ -469,8 +469,6 @@ int fuse_fileattr_set(struct user_namespace *mnt_userns, if (fa->flags_valid) { err = fuse_priv_ioctl(inode, ff, FS_IOC_SETFLAGS, , sizeof(flags)); - if (err) - goto cleanup; } else { memset(, 0, sizeof(xfa)); xfa.fsx_xflags = fa->fsx_xflags; @@ -483,6 +481,19 @@ int fuse_fileattr_set(struct user_namespace *mnt_userns, , sizeof(xfa)); } + if (err) + goto cleanup; + + if (IS_ENABLED(CONFIG_FUSE_DAX)) { + bool newdax; + + if (fa->flags_valid) + newdax = flags & FS_DAX_FL; + else + newdax = fa->fsx_xflags & FS_XFLAG_DAX; + fuse_dax_dontcache(inode, newdax); + } + cleanup: fuse_priv_ioctl_cleanup(inode, ff); -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 1/8] fuse: add fuse_should_enable_dax() helper
This is in prep for following per-file DAX checking. Signed-off-by: Jeffle Xu --- fs/fuse/dax.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 0e5407f48e6a..c6f4e82e65f3 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -1336,11 +1336,19 @@ static const struct address_space_operations fuse_dax_file_aops = { .invalidatepage = noop_invalidatepage, }; -void fuse_dax_inode_init(struct inode *inode) +static bool fuse_should_enable_dax(struct inode *inode) { struct fuse_conn *fc = get_fuse_conn(inode); if (!fc->dax) + return false; + + return true; +} + +void fuse_dax_inode_init(struct inode *inode) +{ + if (!fuse_should_enable_dax(inode)) return; inode->i_flags |= S_DAX; -- 2.27.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V4 2/2] gpio: virtio: Add IRQ support
On 03-08-21, 17:01, Arnd Bergmann wrote: > On Tue, Aug 3, 2021 at 1:36 PM Viresh Kumar wrote: > > +struct vgpio_irq_line { > > + u8 type; > > + bool masked; > > + bool update_pending; > > + bool queued; > > + > > + struct virtio_gpio_irq_request ireq; > > + struct virtio_gpio_irq_response ires; > > +}; > > I think the last two members should be marked as __cacheline_aligned, > since they are transferred separately over DMA. Right. > > +static void virtio_gpio_irq_eoi(struct irq_data *d) > > +{ > > + /* > > +* Queue buffers, by calling virtio_gpio_irq_prepare(), from > > +* virtio_gpio_event_vq() itself, after taking into consideration > > the > > +* masking status of the interrupt. > > +*/ > > +} > > Shouldn't this just requeue the interrupt? There is no reason to > defer this, and I think we want the normal operation to not have > to involve any scheduling. > > > +static void virtio_gpio_irq_unmask(struct irq_data *d) > > +{ > > + struct gpio_chip *gc = irq_data_get_irq_chip_data(d); > > + struct virtio_gpio *vgpio = gpiochip_get_data(gc); > > + struct vgpio_irq_line *irq_line = >irq_lines[d->hwirq]; > > + > > + irq_line->masked = false; > > + irq_line->update_pending = true; > > +} > > Same here. unmask is probably less important, but it's the > same operation: if you want interrupts to become active > again when they are not, just queue the request We can't because its a slow bus ? And unmask can be called from irq-context. That's exactly why we had the irq_bus_lock/unlock discussion earlier. > > +static void virtio_gpio_irq_mask(struct irq_data *d) > > +{ > > + struct gpio_chip *gc = irq_data_get_irq_chip_data(d); > > + struct virtio_gpio *vgpio = gpiochip_get_data(gc); > > + struct vgpio_irq_line *irq_line = >irq_lines[d->hwirq]; > > + > > + irq_line->masked = true; > > + irq_line->update_pending = true; > > +} > > This is of course the tricky bit, I was hoping you had found a solution > based on what I wrote above for eio() and unmask(). > > > +static void vgpio_work_handler(struct work_struct *work) > > +{ > > + struct virtio_gpio *vgpio = container_of(work, struct virtio_gpio, > > +work); > > + struct device *dev = >vdev->dev; > > + struct vgpio_irq_line *irq_line; > > + int irq, gpio, ret; > > + unsigned int len; > > + > > + mutex_lock(>irq_lock); > > + > > + while (true) { > > + irq_line = virtqueue_get_buf(vgpio->event_vq, ); > > + if (!irq_line) > > + break; > > Related to above, I think all the eventq handling should be moved into the > virtio_gpio_event_vq() function directly. You mean without scheduling a work ? > > + /* The interrupt may have been disabled by now */ > > + if (irq_line->update_pending && irq_line->masked) > > + update_irq_type(vgpio, gpio, > > VIRTIO_GPIO_IRQ_TYPE_NONE); > > This is a part I'm not sure about, and I suppose it's the same part that > Marc was also confused by. > > As far as I can tell, the update_irq_type() message would lead to the > interrupt getting delivered when it was armed and is now getting disabled, > but I don't see why we would call update_irq_type() as a result of the > eventq notification. Lemme try to explain answer to all other question together here. The irq related functions get called in two scenarios: - request_irq() or irq_set_irq_type(), enable/disable_irq(), etc: The call sequence here is like this: ->irq_bus_lock() ->spin-lock-irqsave ->irq_mask()/irq_unmask()/irq_set_type().. ->spin-unlock-irqsave ->irq_bus_unlock() So the mask/unmask/set-type routines can't issue virtio requests and we need to do that from irq_bus_unlock(). This shall answer your question about mask/unmask, right ? Or maybe I misunderstood them then ? - Interrupt: i.e. buffer sent back by the host over virtio. virtio_gpio_event_vq() schedules a work item, which processes the items from the eventq virtqueue and eventually calls generic_handle_irq(). The irq-core can issue calls to ->irq_mask/unmask() here without a prior call to irq_bus_lock/unlock(), normally they will balance out by the end, but I am not sure if it is guaranteed. Moreover, interrupt should be re-enabled only after unmask() is called (for ONESHOT) and not at EOI, right ? I chose not to queue the buffers back from eoi() as it is possible that we won't want to queue them at all, as the interrupt needs to be disabled by the time generic_handle_irq() returns. And so I did everything from the end of vgpio_work_handler()'s loop, i.e. either disable the interrupts with VIRTIO_GPIO_IRQ_TYPE_NONE or enable the interrupt again by re-queuing the buffer. Regarding irq handling using work-item, I had to move to that
Re: [PATCH v2 0/4] virtiofs,fuse: support per-file DAX
On 7/21/21 10:42 PM, Vivek Goyal wrote: > On Wed, Jul 21, 2021 at 08:48:57AM -0400, Vivek Goyal wrote: > [..] So is "dax=inode" enough for your needs? What's your requirement, can you give little bit of more details. >>> >>> In our use case, the backend fs is something like SquashFS on host. The >>> content of the file on host is downloaded *as needed*. When the file is >>> not completely ready (completely downloaded), the guest will follow the >>> normal IO routine, i.e., by FUSE_READ/FUSE_WRITE request. While the file >>> is completely ready, per-file DAX is enabled for this file. IOW the FUSE >>> server need to dynamically decide if per-file DAX shall be enabled, >>> depending on if the file is completely downloaded. >> >> So you don't want to enable DAX yet because guest might fault on >> a section of file which has not been downloaded yet? >> >> I am wondering if somehow user fault handling can help with this. >> If we could handle faults for this file in user space, then you >> should be able to download that particular page[s] and resolve >> the fault? > > Stefan mentioned that can't we block when fuse mmap request comes > in and download corresponding section of file. Or do whatever you > are doing in FUSE_READ. > > IOW, even if you enable dax in your use case on all files, > FUSE_SETUPMAPPING request will give you control to make sure > file section being mmaped has been downloaded. > Sorry for the late reply. I missed this mail as it is classified into the mailing list folder. The idea you mentioned may works. Anyway, the implementation details of the FUSE server is not strongly binding to the FUSE protocol changes in kernel. The protocol only requires that FUSE client shall be able to store FS_DAX_FL attr persistently in *some way*. The changes in kernel shall be general, no matter whether the FUSE server is FS_DAX_FL attr based or something else. -- Thanks, Jeffle ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization