Re: [PATCH V3 0/7] mdev based hardware virtio offloading support
On 2019/10/15 上午1:49, Stefan Hajnoczi wrote: On Fri, Oct 11, 2019 at 04:15:50PM +0800, Jason Wang wrote: There are hardware that can do virtio datapath offloading while having its own control path. This path tries to implement a mdev based unified API to support using kernel virtio driver to drive those devices. This is done by introducing a new mdev transport for virtio (virtio_mdev) and register itself as a new kind of mdev driver. Then it provides a unified way for kernel virtio driver to talk with mdev device implementation. Though the series only contains kernel driver support, the goal is to make the transport generic enough to support userspace drivers. This means vhost-mdev[1] could be built on top as well by resuing the transport. A sample driver is also implemented which simulate a virito-net loopback ethernet device on top of vringh + workqueue. This could be used as a reference implementation for real hardware driver. Consider mdev framework only support VFIO device and driver right now, this series also extend it to support other types. This is done through introducing class id to the device and pairing it with id_talbe claimed by the driver. On top, this seris also decouple device specific parents ops out of the common ones. I was curious so I took a quick look and posted comments. I guess this driver runs inside the guest since it registers virtio devices? It could run in either guest or host. But the main focus is to run in the host then we can use virtio drivers in containers. If this is used with physical PCI devices that support datapath offloading then how are physical devices presented to the guest without SR-IOV? We will do control path meditation through vhost-mdev[1] and vhost-vfio[2]. Then we will present a full virtio compatible ethernet device for guest. SR-IOV is not a must, any mdev device that implements the API defined in patch 5 can be used by this framework. Thanks [1] https://lkml.org/lkml/2019/9/26/15 [2] https://patchwork.ozlabs.org/cover/984763/ Stefan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 6/7] virtio: introduce a mdev based transport
On 2019/10/15 上午1:39, Stefan Hajnoczi wrote: On Fri, Oct 11, 2019 at 04:15:56PM +0800, Jason Wang wrote: +struct virtio_mdev_device { + struct virtio_device vdev; + struct mdev_device *mdev; + unsigned long version; + + struct virtqueue **vqs; + /* The lock to protect virtqueue list */ + spinlock_t lock; + struct list_head virtqueues; Is this a list of struct virtio_mdev_vq_info? Please document the actual type in a comment. Ok. +static int virtio_mdev_find_vqs(struct virtio_device *vdev, unsigned nvqs, + struct virtqueue *vqs[], + vq_callback_t *callbacks[], + const char * const names[], + const bool *ctx, + struct irq_affinity *desc) +{ + struct virtio_mdev_device *vm_dev = to_virtio_mdev_device(vdev); + struct mdev_device *mdev = vm_get_mdev(vdev); + const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev); + struct virtio_mdev_callback cb; + int i, err, queue_idx = 0; + + vm_dev->vqs = kmalloc_array(queue_idx, sizeof(*vm_dev->vqs), + GFP_KERNEL); kmalloc_array(0, ...)? I would have expected nvqs instead of queue_idx (0). What is this the purpose of vm_dev->vqs and does anything ever access it? It's useless, will remove it. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 5/7] mdev: introduce virtio device and its device ops
On 2019/10/15 上午1:23, Stefan Hajnoczi wrote: On Fri, Oct 11, 2019 at 04:15:55PM +0800, Jason Wang wrote: + * @set_vq_cb: Set the interrut calback function for s/interrut/interrupt/ s/calback/callback/ Fixed. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v4 0/5] vhost: ring format independence
On 2019/10/13 下午7:41, Michael S. Tsirkin wrote: This adds infrastructure required for supporting multiple ring formats. The idea is as follows: we convert descriptors to an independent format first, and process that converting to iov later. The point is that we have a tight loop that fetches descriptors, which is good for cache utilization. This will also allow all kind of batching tricks - e.g. it seems possible to keep SMAP disabled while we are fetching multiple descriptors. This seems to perform exactly the same as the original code already based on a microbenchmark. Lightly tested. More testing would be very much appreciated. To use new code: echo 1 > /sys/module/vhost_test/parameters/newcode or echo 1 > /sys/module/vhost_net/parameters/newcode changes from v3: - fixed error handling in case of indirect descriptors - add BUG_ON to detect buffer overflow in case of bugs in response to comment by Jason Wang - minor code tweaks Changes from v2: - fixed indirect descriptor batching reported by Jason Wang Changes from v1: - typo fixes I've just done some quick benchmark with testpmd + vhost_net txonly. With 256 queue size, no difference but in 1024 queue size 1% regression of PPS were found. Thanks Michael S. Tsirkin (5): vhost: option to fetch descriptors through an independent struct vhost/test: add an option to test new code vhost: batching fetches vhost/net: add an option to test new code vhost: last descriptor must have NEXT clear drivers/vhost/net.c | 32 - drivers/vhost/test.c | 19 ++- drivers/vhost/vhost.c | 328 +- drivers/vhost/vhost.h | 20 ++- 4 files changed, 385 insertions(+), 14 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 0/7] mdev based hardware virtio offloading support
On Fri, Oct 11, 2019 at 04:15:50PM +0800, Jason Wang wrote: > There are hardware that can do virtio datapath offloading while having > its own control path. This path tries to implement a mdev based > unified API to support using kernel virtio driver to drive those > devices. This is done by introducing a new mdev transport for virtio > (virtio_mdev) and register itself as a new kind of mdev driver. Then > it provides a unified way for kernel virtio driver to talk with mdev > device implementation. > > Though the series only contains kernel driver support, the goal is to > make the transport generic enough to support userspace drivers. This > means vhost-mdev[1] could be built on top as well by resuing the > transport. > > A sample driver is also implemented which simulate a virito-net > loopback ethernet device on top of vringh + workqueue. This could be > used as a reference implementation for real hardware driver. > > Consider mdev framework only support VFIO device and driver right now, > this series also extend it to support other types. This is done > through introducing class id to the device and pairing it with > id_talbe claimed by the driver. On top, this seris also decouple > device specific parents ops out of the common ones. I was curious so I took a quick look and posted comments. I guess this driver runs inside the guest since it registers virtio devices? If this is used with physical PCI devices that support datapath offloading then how are physical devices presented to the guest without SR-IOV? Stefan signature.asc Description: PGP signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 6/7] virtio: introduce a mdev based transport
On Fri, Oct 11, 2019 at 04:15:56PM +0800, Jason Wang wrote: > +struct virtio_mdev_device { > + struct virtio_device vdev; > + struct mdev_device *mdev; > + unsigned long version; > + > + struct virtqueue **vqs; > + /* The lock to protect virtqueue list */ > + spinlock_t lock; > + struct list_head virtqueues; Is this a list of struct virtio_mdev_vq_info? Please document the actual type in a comment. > +static int virtio_mdev_find_vqs(struct virtio_device *vdev, unsigned nvqs, > + struct virtqueue *vqs[], > + vq_callback_t *callbacks[], > + const char * const names[], > + const bool *ctx, > + struct irq_affinity *desc) > +{ > + struct virtio_mdev_device *vm_dev = to_virtio_mdev_device(vdev); > + struct mdev_device *mdev = vm_get_mdev(vdev); > + const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev); > + struct virtio_mdev_callback cb; > + int i, err, queue_idx = 0; > + > + vm_dev->vqs = kmalloc_array(queue_idx, sizeof(*vm_dev->vqs), > + GFP_KERNEL); kmalloc_array(0, ...)? I would have expected nvqs instead of queue_idx (0). What is this the purpose of vm_dev->vqs and does anything ever access it? signature.asc Description: PGP signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 5/7] mdev: introduce virtio device and its device ops
On Fri, Oct 11, 2019 at 04:15:55PM +0800, Jason Wang wrote: > + * @set_vq_cb: Set the interrut calback function for s/interrut/interrupt/ s/calback/callback/ signature.asc Description: PGP signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 01/25] crypto: virtio - implement missing support for output IVs
In order to allow for CBC to be chained, which is something that the CTS template relies upon, implementations of CBC need to pass the IV to be used for subsequent invocations via the IV buffer. This was not implemented yet for virtio-crypto so implement it now. Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver") Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Gonglei Cc: virtualization@lists.linux-foundation.org Signed-off-by: Ard Biesheuvel --- drivers/crypto/virtio/virtio_crypto_algs.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c index 42d19205166b..65ec10800137 100644 --- a/drivers/crypto/virtio/virtio_crypto_algs.c +++ b/drivers/crypto/virtio/virtio_crypto_algs.c @@ -437,6 +437,11 @@ __virtio_crypto_ablkcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req, goto free; } memcpy(iv, req->info, ivsize); + if (!vc_sym_req->encrypt) + scatterwalk_map_and_copy(req->info, req->src, +req->nbytes - AES_BLOCK_SIZE, +AES_BLOCK_SIZE, 0); + sg_init_one(_sg, iv, ivsize); sgs[num_out++] = _sg; vc_sym_req->iv = iv; @@ -563,6 +568,10 @@ static void virtio_crypto_ablkcipher_finalize_req( struct ablkcipher_request *req, int err) { + if (vc_sym_req->encrypt) + scatterwalk_map_and_copy(req->info, req->dst, +req->nbytes - AES_BLOCK_SIZE, +AES_BLOCK_SIZE, 0); crypto_finalize_ablkcipher_request(vc_sym_req->base.dataq->engine, req, err); kzfree(vc_sym_req->iv); -- 2.20.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 03/25] crypto: virtio - switch to skcipher API
Commit 7a7ffe65c8c5 ("crypto: skcipher - Add top-level skcipher interface") dated 20 august 2015 introduced the new skcipher API which is supposed to replace both blkcipher and ablkcipher. While all consumers of the API have been converted long ago, some producers of the ablkcipher remain, forcing us to keep the ablkcipher support routines alive, along with the matching code to expose [a]blkciphers via the skcipher API. So switch this driver to the skcipher API, allowing us to finally drop the blkcipher code in the near future. Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Gonglei Cc: virtualization@lists.linux-foundation.org Signed-off-by: Ard Biesheuvel --- drivers/crypto/virtio/virtio_crypto_algs.c | 187 ++-- drivers/crypto/virtio/virtio_crypto_common.h | 2 +- 2 files changed, 92 insertions(+), 97 deletions(-) diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c index 82b316b2f537..4b71e80951b7 100644 --- a/drivers/crypto/virtio/virtio_crypto_algs.c +++ b/drivers/crypto/virtio/virtio_crypto_algs.c @@ -8,6 +8,7 @@ #include #include +#include #include #include #include @@ -16,10 +17,10 @@ #include "virtio_crypto_common.h" -struct virtio_crypto_ablkcipher_ctx { +struct virtio_crypto_skcipher_ctx { struct crypto_engine_ctx enginectx; struct virtio_crypto *vcrypto; - struct crypto_tfm *tfm; + struct crypto_skcipher *tfm; struct virtio_crypto_sym_session_info enc_sess_info; struct virtio_crypto_sym_session_info dec_sess_info; @@ -30,8 +31,8 @@ struct virtio_crypto_sym_request { /* Cipher or aead */ uint32_t type; - struct virtio_crypto_ablkcipher_ctx *ablkcipher_ctx; - struct ablkcipher_request *ablkcipher_req; + struct virtio_crypto_skcipher_ctx *skcipher_ctx; + struct skcipher_request *skcipher_req; uint8_t *iv; /* Encryption? */ bool encrypt; @@ -41,7 +42,7 @@ struct virtio_crypto_algo { uint32_t algonum; uint32_t service; unsigned int active_devs; - struct crypto_alg algo; + struct skcipher_alg algo; }; /* @@ -49,9 +50,9 @@ struct virtio_crypto_algo { * and crypto algorithms registion. */ static DEFINE_MUTEX(algs_lock); -static void virtio_crypto_ablkcipher_finalize_req( +static void virtio_crypto_skcipher_finalize_req( struct virtio_crypto_sym_request *vc_sym_req, - struct ablkcipher_request *req, + struct skcipher_request *req, int err); static void virtio_crypto_dataq_sym_callback @@ -59,7 +60,7 @@ static void virtio_crypto_dataq_sym_callback { struct virtio_crypto_sym_request *vc_sym_req = container_of(vc_req, struct virtio_crypto_sym_request, base); - struct ablkcipher_request *ablk_req; + struct skcipher_request *ablk_req; int error; /* Finish the encrypt or decrypt process */ @@ -79,8 +80,8 @@ static void virtio_crypto_dataq_sym_callback error = -EIO; break; } - ablk_req = vc_sym_req->ablkcipher_req; - virtio_crypto_ablkcipher_finalize_req(vc_sym_req, + ablk_req = vc_sym_req->skcipher_req; + virtio_crypto_skcipher_finalize_req(vc_sym_req, ablk_req, error); } } @@ -110,8 +111,8 @@ virtio_crypto_alg_validate_key(int key_len, uint32_t *alg) return 0; } -static int virtio_crypto_alg_ablkcipher_init_session( - struct virtio_crypto_ablkcipher_ctx *ctx, +static int virtio_crypto_alg_skcipher_init_session( + struct virtio_crypto_skcipher_ctx *ctx, uint32_t alg, const uint8_t *key, unsigned int keylen, int encrypt) @@ -200,8 +201,8 @@ static int virtio_crypto_alg_ablkcipher_init_session( return 0; } -static int virtio_crypto_alg_ablkcipher_close_session( - struct virtio_crypto_ablkcipher_ctx *ctx, +static int virtio_crypto_alg_skcipher_close_session( + struct virtio_crypto_skcipher_ctx *ctx, int encrypt) { struct scatterlist outhdr, status_sg, *sgs[2]; @@ -261,8 +262,8 @@ static int virtio_crypto_alg_ablkcipher_close_session( return 0; } -static int virtio_crypto_alg_ablkcipher_init_sessions( - struct virtio_crypto_ablkcipher_ctx *ctx, +static int virtio_crypto_alg_skcipher_init_sessions( + struct virtio_crypto_skcipher_ctx *ctx, const uint8_t *key, unsigned int keylen) { uint32_t alg; @@ -278,30 +279,30 @@ static int virtio_crypto_alg_ablkcipher_init_sessions( goto bad_key; /* Create encryption session */ - ret = virtio_crypto_alg_ablkcipher_init_session(ctx, + ret = virtio_crypto_alg_skcipher_init_session(ctx, alg, key,
[PATCH 02/25] crypto: virtio - deal with unsupported input sizes
Return -EINVAL for input sizes that are not a multiple of the AES block size, since they are not supported by our CBC chaining mode. While at it, remove the pr_err() that reports unsupported key sizes being used: we shouldn't spam the kernel log with that. Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver") Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Gonglei Cc: virtualization@lists.linux-foundation.org Signed-off-by: Ard Biesheuvel --- drivers/crypto/virtio/virtio_crypto_algs.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c index 65ec10800137..82b316b2f537 100644 --- a/drivers/crypto/virtio/virtio_crypto_algs.c +++ b/drivers/crypto/virtio/virtio_crypto_algs.c @@ -105,8 +105,6 @@ virtio_crypto_alg_validate_key(int key_len, uint32_t *alg) *alg = VIRTIO_CRYPTO_CIPHER_AES_CBC; break; default: - pr_err("virtio_crypto: Unsupported key length: %d\n", - key_len); return -EINVAL; } return 0; @@ -489,6 +487,11 @@ static int virtio_crypto_ablkcipher_encrypt(struct ablkcipher_request *req) /* Use the first data virtqueue as default */ struct data_queue *data_vq = >data_vq[0]; + if (!req->nbytes) + return 0; + if (req->nbytes % AES_BLOCK_SIZE) + return -EINVAL; + vc_req->dataq = data_vq; vc_req->alg_cb = virtio_crypto_dataq_sym_callback; vc_sym_req->ablkcipher_ctx = ctx; @@ -509,6 +512,11 @@ static int virtio_crypto_ablkcipher_decrypt(struct ablkcipher_request *req) /* Use the first data virtqueue as default */ struct data_queue *data_vq = >data_vq[0]; + if (!req->nbytes) + return 0; + if (req->nbytes % AES_BLOCK_SIZE) + return -EINVAL; + vc_req->dataq = data_vq; vc_req->alg_cb = virtio_crypto_dataq_sym_callback; vc_sym_req->ablkcipher_ctx = ctx; -- 2.20.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] dma-mapping: Add dma_addr_is_phys_addr()
On 14/10/2019 05:51, David Gibson wrote: On Fri, Oct 11, 2019 at 06:25:18PM -0700, Ram Pai wrote: From: Thiago Jung Bauermann In order to safely use the DMA API, virtio needs to know whether DMA addresses are in fact physical addresses and for that purpose, dma_addr_is_phys_addr() is introduced. cc: Benjamin Herrenschmidt cc: David Gibson cc: Michael Ellerman cc: Paul Mackerras cc: Michael Roth cc: Alexey Kardashevskiy cc: Paul Burton cc: Robin Murphy cc: Bartlomiej Zolnierkiewicz cc: Marek Szyprowski cc: Christoph Hellwig Suggested-by: Michael S. Tsirkin Signed-off-by: Ram Pai Signed-off-by: Thiago Jung Bauermann The change itself looks ok, so Reviewed-by: David Gibson However, I would like to see the commit message (and maybe the inline comments) expanded a bit on what the distinction here is about. Some of the text from the next patch would be suitable, about DMA addresses usually being in a different address space but not in the case of bounce buffering. Right, this needs a much tighter definition. "DMA address happens to be a valid physical address" is true of various IOMMU setups too, but I can't believe it's meaningful in such cases. If what you actually want is "DMA is direct or SWIOTLB" - i.e. "DMA address is physical address of DMA data (not necessarily the original buffer)" - wouldn't dma_is_direct() suffice? Robin. --- arch/powerpc/include/asm/dma-mapping.h | 21 + arch/powerpc/platforms/pseries/Kconfig | 1 + include/linux/dma-mapping.h| 20 kernel/dma/Kconfig | 3 +++ 4 files changed, 45 insertions(+) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 565d6f7..f92c0a4b 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -5,6 +5,8 @@ #ifndef _ASM_DMA_MAPPING_H #define _ASM_DMA_MAPPING_H +#include + static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { /* We don't handle the NULL dev case for ISA for now. We could @@ -15,4 +17,23 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) return NULL; } +#ifdef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR +/** + * dma_addr_is_phys_addr - check whether a device DMA address is a physical + * address + * @dev: device to check + * + * Returns %true if any DMA address for this device happens to also be a valid + * physical address (not necessarily of the same page). + */ +static inline bool dma_addr_is_phys_addr(struct device *dev) +{ + /* +* Secure guests always use the SWIOTLB, therefore DMA addresses are +* actually the physical address of the bounce buffer. +*/ + return is_secure_guest(); +} +#endif + #endif/* _ASM_DMA_MAPPING_H */ diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 9e35cdd..0108150 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -152,6 +152,7 @@ config PPC_SVM select SWIOTLB select ARCH_HAS_MEM_ENCRYPT select ARCH_HAS_FORCE_DMA_UNENCRYPTED + select ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR help There are certain POWER platforms which support secure guests using the Protected Execution Facility, with the help of an Ultravisor diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f7d1eea..6df5664 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -693,6 +693,26 @@ static inline bool dma_addressing_limited(struct device *dev) dma_get_required_mask(dev); } +#ifndef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR +/** + * dma_addr_is_phys_addr - check whether a device DMA address is a physical + * address + * @dev: device to check + * + * Returns %true if any DMA address for this device happens to also be a valid + * physical address (not necessarily of the same page). + */ +static inline bool dma_addr_is_phys_addr(struct device *dev) +{ + /* +* Except in very specific setups, DMA addresses exist in a different +* address space from CPU physical addresses and cannot be directly used +* to reference system memory. +*/ + return false; +} +#endif + #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, const struct iommu_ops *iommu, bool coherent); diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 9decbba..6209b46 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -51,6 +51,9 @@ config ARCH_HAS_DMA_MMAP_PGPROT config ARCH_HAS_FORCE_DMA_UNENCRYPTED bool +config ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR + bool + config DMA_NONCOHERENT_CACHE_SYNC bool ___
Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
On Mon, Oct 14, 2019 at 04:21:35PM +0800, Jason Wang wrote: > On 2019/10/14 下午4:17, Stefan Hajnoczi wrote: > > SO_VM_SOCKETS_BUFFER_SIZE might have been useful for VMCI-specific > > applications, but we should use SO_RCVBUF and SO_SNDBUF for portable > > applications in the future. Those socket options also work with other > > address families. > > I think hyperv_transport started to use it in this patch: ac383f58f3c9 hv_sock: perf: Allow the socket buffer size options to influence the actual socket buffers > > I guess these sockopts are bypassed by AF_VSOCK because it doesn't use > > the common skb queuing code in net/core/sock.c:(. But one day we might > > migrate to it... > > > > Stefan > > > +1, we should really consider to reuse the exist socket mechanism instead of > re-inventing wheels. +1, I totally agree. I'll go this way. Guys, thank you all for your suggestions! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
On 2019/10/14 下午4:17, Stefan Hajnoczi wrote: SO_VM_SOCKETS_BUFFER_SIZE might have been useful for VMCI-specific applications, but we should use SO_RCVBUF and SO_SNDBUF for portable applications in the future. Those socket options also work with other address families. I guess these sockopts are bypassed by AF_VSOCK because it doesn't use the common skb queuing code in net/core/sock.c:(. But one day we might migrate to it... Stefan +1, we should really consider to reuse the exist socket mechanism instead of re-inventing wheels. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
On Fri, Oct 11, 2019 at 03:40:48PM +0200, Stefano Garzarella wrote: > On Sun, Sep 1, 2019 at 8:56 AM Michael S. Tsirkin wrote: > > On Fri, Aug 30, 2019 at 11:40:59AM +0200, Stefano Garzarella wrote: > > > On Mon, Jul 29, 2019 at 10:04:29AM -0400, Michael S. Tsirkin wrote: > > > > On Wed, Jul 17, 2019 at 01:30:26PM +0200, Stefano Garzarella wrote: > > > > > Since virtio-vsock was introduced, the buffers filled by the host > > > > > and pushed to the guest using the vring, are directly queued in > > > > > a per-socket list. These buffers are preallocated by the guest > > > > > with a fixed size (4 KB). > > > > > > > > > > The maximum amount of memory used by each socket should be > > > > > controlled by the credit mechanism. > > > > > The default credit available per-socket is 256 KB, but if we use > > > > > only 1 byte per packet, the guest can queue up to 262144 of 4 KB > > > > > buffers, using up to 1 GB of memory per-socket. In addition, the > > > > > guest will continue to fill the vring with new 4 KB free buffers > > > > > to avoid starvation of other sockets. > > > > > > > > > > This patch mitigates this issue copying the payload of small > > > > > packets (< 128 bytes) into the buffer of last packet queued, in > > > > > order to avoid wasting memory. > > > > > > > > > > Reviewed-by: Stefan Hajnoczi > > > > > Signed-off-by: Stefano Garzarella > > > > > > > > This is good enough for net-next, but for net I think we > > > > should figure out how to address the issue completely. > > > > Can we make the accounting precise? What happens to > > > > performance if we do? > > > > > > > > > > Since I'm back from holidays, I'm restarting this thread to figure out > > > how to address the issue completely. > > > > > > I did a better analysis of the credit mechanism that we implemented in > > > virtio-vsock to get a clearer view and I'd share it with you: > > > > > > This issue affect only the "host->guest" path. In this case, when the > > > host wants to send a packet to the guest, it uses a "free" buffer > > > allocated by the guest (4KB). > > > The "free" buffers available for the host are shared between all > > > sockets, instead, the credit mechanism is per-socket, I think to > > > avoid the starvation of others sockets. > > > The guests re-fill the "free" queue when the available buffers are > > > less than half. > > > > > > Each peer have these variables in the per-socket state: > > >/* local vars */ > > >buf_alloc/* max bytes usable by this socket > > >[exposed to the other peer] */ > > >fwd_cnt /* increased when RX packet is consumed by the > > >user space [exposed to the other peer] */ > > >tx_cnt /* increased when TX packet is sent to the > > > other peer */ > > > > > >/* remote vars */ > > >peer_buf_alloc /* peer's buf_alloc */ > > >peer_fwd_cnt /* peer's fwd_cnt */ > > > > > > When a peer sends a packet, it increases the 'tx_cnt'; when the > > > receiver consumes the packet (copy it to the user-space buffer), it > > > increases the 'fwd_cnt'. > > > Note: increments are made considering the payload length and not the > > > buffer length. > > > > > > The value of 'buf_alloc' and 'fwd_cnt' are sent to the other peer in > > > all packet headers or with an explicit CREDIT_UPDATE packet. > > > > > > The local 'buf_alloc' value can be modified by the user space using > > > setsockopt() with optname=SO_VM_SOCKETS_BUFFER_SIZE. > > > > > > Before to send a packet, the peer checks the space available: > > > credit_available = peer_buf_alloc - (tx_cnt - peer_fwd_cnt) > > > and it will send up to credit_available bytes to the other peer. > > > > > > Possible solutions considering Michael's advice: > > > 1. Use the buffer length instead of the payload length when we increment > > >the counters: > > > - This approach will account precisely the memory used per socket. > > > - This requires changes in both guest and host. > > > - It is not compatible with old drivers, so a feature should be > > > negotiated. > > > 2. Decrease the advertised 'buf_alloc' taking count of bytes queued in > > >the socket queue but not used. (e.g. 256 byte used on 4K available in > > >the buffer) > > > - pkt->hdr.buf_alloc = buf_alloc - bytes_not_used. > > > - This should be compatible also with old drivers. > > > > > > Maybe the second is less invasive, but will it be too tricky? > > > Any other advice or suggestions? > > > > > > Thanks in advance, > > > Stefano > > > > OK let me try to clarify. The idea is this: > > > > Let's say we queue a buffer of 4K, and we copy if len < 128 bytes. This > > means that in the worst case (128 byte packets), each byte of credit in > > the socket uses up 4K/128 = 16 bytes of kernel memory. In fact we need > > to also account for the