Re: [PATCH 39/52] migration/rdma: Convert qemu_rdma_write_one() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 26/09/2023 13:50, Li Zhijian wrote:
> 
> 
> On 18/09/2023 22:41, Markus Armbruster wrote:
>> Functions that use an Error **errp parameter to return errors should
>> not also report them to the user, because reporting is the caller's
>> job.  When the caller does, the error is reported twice.  When it
>> doesn't (because it recovered from the error), there is no error to
>> report, i.e. the report is bogus.
>>
>> qemu_rdma_write_flush() violates this principle: it calls
>> error_report() via qemu_rdma_write_one().  I elected not to
>> investigate how callers handle the error, i.e. precise impact is not
>> known.
>>
>> Clean this up by converting qemu_rdma_write_one() to Error.
>>
>> Signed-off-by: Markus Armbruster
>> ---
>>   migration/rdma.c | 25 +++--
>>   1 file changed, 11 insertions(+), 14 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index c3c33fe242..9b8cbadfcd 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -2019,9 +2019,8 @@ static int qemu_rdma_exchange_recv(RDMAContext *rdma, 
>> RDMAControlHeader *head,
>>    */
>>   static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,
>>  int current_index, uint64_t current_addr,
>> -   uint64_t length)
>> +   uint64_t length, Error **errp)
>>   {
>> -    Error *err = NULL;
>>   struct ibv_sge sge;
>>   struct ibv_send_wr send_wr = { 0 };
>>   struct ibv_send_wr *bad_wr;
> 
> [...]
> 
>>   }
>> @@ -2219,7 +2216,7 @@ retry:
>>   goto retry;
>>   } else if (ret > 0) {
>> -    perror("rdma migration: post rdma write failed");
>> +    error_setg(errp, "rdma migration: post rdma write failed");
> 
> It reminds that do you miss to use error_setg_errno() instead.
> 

Answer it myself:
ibv_post_send(3) says:

RETURN VALUE
ibv_post_send() returns 0 on success, or the value of errno on failure 
(which indicates the failure reason).


the global error is not defined here.



> 
> 
> 
> 
>>   return -1;
>>   }

Re: [PATCH] hw/i386: changes towards enabling -Wshadow=local

2023-09-25 Thread Ani Sinha



> On 25-Sep-2023, at 8:42 PM, Peter Xu  wrote:
> 
> On Sat, Sep 23, 2023 at 08:03:34AM +0530, Ani Sinha wrote:
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index c0ce896668..c1fb69170f 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3770,9 +3770,9 @@ static void vtd_address_space_unmap(VTDAddressSpace 
>> *as, IOMMUNotifier *n)
>> while (remain >= VTD_PAGE_SIZE) {
>> IOMMUTLBEvent event;
>> uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
>> -uint64_t size = mask + 1;
>> +uint64_t sz = mask + 1;
>> 
>> -assert(size);
>> +assert(sz);
>> 
>> event.type = IOMMU_NOTIFIER_UNMAP;
>> event.entry.iova = start;
>> @@ -3784,8 +3784,8 @@ static void vtd_address_space_unmap(VTDAddressSpace 
>> *as, IOMMUNotifier *n)
>> 
>> memory_region_notify_iommu_one(n, );
>> 
>> -start += size;
>> -remain -= size;
>> +start += sz;
>> +remain -= sz;
>> }
>> 
>> assert(!remain);
> 
> Ani,
> 
> I've got a small patch for this hunk already:
> 
> https://lore.kernel.org/r/20230922160410.138786-1-pet...@redhat.com
> 
> Wouldn't hurt to merge both, though.. or just drop the other one.

I liked your change so kept it and removed mine. See v2.

> 
> Reviewed-by: Peter Xu 
> 
> Thanks,
> 
> -- 
> Peter Xu
> 




[PATCH v2] hw/i386: changes towards enabling -Wshadow=local

2023-09-25 Thread Ani Sinha
Code changes that addresses all compiler complaints coming from enabling
-Wshadow flags. Enabling -Wshadow catches cases of local variables shadowing
other local variables or parameters. These makes the code confusing and/or adds
bugs that are difficult to catch.

CC: Markus Armbruster 
CC: Philippe Mathieu-Daude 
CC: m...@redhat.com
Message-Id: <87r0mqlf9x@pond.sub.org>
Signed-off-by: Ani Sinha 
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Peter Xu 
---
 hw/i386/acpi-microvm.c | 12 ++--
 hw/i386/intel_iommu.c  |  8 
 hw/i386/pc.c   |  1 -
 hw/i386/x86.c  |  2 --
 4 files changed, 10 insertions(+), 13 deletions(-)

changelog:
v2: kept Peter's changes from 
https://lore.kernel.org/r/20230922160410.138786-1-pet...@redhat.com
and removed mine.

diff --git a/hw/i386/acpi-microvm.c b/hw/i386/acpi-microvm.c
index a075360d85..6e4f8061eb 100644
--- a/hw/i386/acpi-microvm.c
+++ b/hw/i386/acpi-microvm.c
@@ -78,18 +78,18 @@ static void acpi_dsdt_add_virtio(Aml *scope,
 hwaddr base = VIRTIO_MMIO_BASE + index * 512;
 hwaddr size = 512;
 
-Aml *dev = aml_device("VR%02u", (unsigned)index);
-aml_append(dev, aml_name_decl("_HID", aml_string("LNRO0005")));
-aml_append(dev, aml_name_decl("_UID", aml_int(index)));
-aml_append(dev, aml_name_decl("_CCA", aml_int(1)));
+Aml *adev = aml_device("VR%02u", (unsigned)index);
+aml_append(adev, aml_name_decl("_HID", aml_string("LNRO0005")));
+aml_append(adev, aml_name_decl("_UID", aml_int(index)));
+aml_append(adev, aml_name_decl("_CCA", aml_int(1)));
 
 Aml *crs = aml_resource_template();
 aml_append(crs, aml_memory32_fixed(base, size, AML_READ_WRITE));
 aml_append(crs,
aml_interrupt(AML_CONSUMER, AML_LEVEL, AML_ACTIVE_HIGH,
  AML_EXCLUSIVE, , 1));
-aml_append(dev, aml_name_decl("_CRS", crs));
-aml_append(scope, dev);
+aml_append(adev, aml_name_decl("_CRS", crs));
+aml_append(scope, adev);
 }
 }
 }
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c0ce896668..2c832ab68b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3744,7 +3744,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
PCIBus *bus,
 /* Unmap the whole range in the notifier's scope. */
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 {
-hwaddr size, remain;
+hwaddr total, remain;
 hwaddr start = n->start;
 hwaddr end = n->end;
 IntelIOMMUState *s = as->iommu_state;
@@ -3765,7 +3765,7 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, 
IOMMUNotifier *n)
 }
 
 assert(start <= end);
-size = remain = end - start + 1;
+total = remain = end - start + 1;
 
 while (remain >= VTD_PAGE_SIZE) {
 IOMMUTLBEvent event;
@@ -3793,10 +3793,10 @@ static void vtd_address_space_unmap(VTDAddressSpace 
*as, IOMMUNotifier *n)
 trace_vtd_as_unmap_whole(pci_bus_num(as->bus),
  VTD_PCI_SLOT(as->devfn),
  VTD_PCI_FUNC(as->devfn),
- n->start, size);
+ n->start, total);
 
 map.iova = n->start;
-map.size = size - 1; /* Inclusive */
+map.size = total - 1; /* Inclusive */
 iova_tree_remove(as->iova_tree, map);
 }
 
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 3db0743f31..e7a233e886 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1116,7 +1116,6 @@ void pc_memory_init(PCMachineState *pcms,
 
 if (machine->device_memory) {
 uint64_t *val = g_malloc(sizeof(*val));
-PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
 uint64_t res_mem_end = machine->device_memory->base;
 
 if (!pcmc->broken_reserved_end) {
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index f034df8bf6..b3d054889b 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -365,8 +365,6 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 
 cpu_slot = x86_find_cpu_slot(MACHINE(x86ms), cpu->apic_id, );
 if (!cpu_slot) {
-MachineState *ms = MACHINE(x86ms);
-
 x86_topo_ids_from_apicid(cpu->apic_id, _info, _ids);
 error_setg(errp,
 "Invalid CPU [socket: %u, die: %u, core: %u, thread: %u] with"
-- 
2.39.3




Re: [PATCH 39/52] migration/rdma: Convert qemu_rdma_write_one() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qemu_rdma_write_flush() violates this principle: it calls
> error_report() via qemu_rdma_write_one().  I elected not to
> investigate how callers handle the error, i.e. precise impact is not
> known.
> 
> Clean this up by converting qemu_rdma_write_one() to Error.
> 
> Signed-off-by: Markus Armbruster
> ---
>   migration/rdma.c | 25 +++--
>   1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index c3c33fe242..9b8cbadfcd 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -2019,9 +2019,8 @@ static int qemu_rdma_exchange_recv(RDMAContext *rdma, 
> RDMAControlHeader *head,
>*/
>   static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,
>  int current_index, uint64_t current_addr,
> -   uint64_t length)
> +   uint64_t length, Error **errp)
>   {
> -Error *err = NULL;
>   struct ibv_sge sge;
>   struct ibv_send_wr send_wr = { 0 };
>   struct ibv_send_wr *bad_wr;

[...]

>   }
> @@ -2219,7 +2216,7 @@ retry:
>   goto retry;
>   
>   } else if (ret > 0) {
> -perror("rdma migration: post rdma write failed");
> +error_setg(errp, "rdma migration: post rdma write failed");

It reminds that do you miss to use error_setg_errno() instead.





>   return -1;
>   }

Re: [PATCH 44/52] migration/rdma: Silence qemu_rdma_resolve_host()

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qemu_rdma_resolve_host() violates this principle: it calls
> error_report().
> 
> Clean this up: drop error_report().
> 
> Signed-off-by: Markus Armbruster 

Reviewed-by: Li Zhijian 


> ---
>   migration/rdma.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 41f0ae4ddb..0e365db06a 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1003,7 +1003,6 @@ route:
>   error_setg(errp,
>  "RDMA ERROR: result not equal to event_addr_resolved %s",
>  rdma_event_str(cm_event->event));
> -error_report("rdma_resolve_addr");
>   rdma_ack_cm_event(cm_event);
>   goto err_resolve_get_addr;
>   }

Re: [PATCH 43/52] migration/rdma: Convert qemu_rdma_alloc_pd_cq() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qemu_rdma_source_init() violates this principle: it calls
> error_report() via qemu_rdma_alloc_pd_cq().  I elected not to
> investigate how callers handle the error, i.e. precise impact is not
> known.
> 
> Clean this up by converting qemu_rdma_alloc_pd_cq() to Error.
> 
> Signed-off-by: Markus Armbruster
> ---
>   migration/rdma.c | 27 ++-
>   1 file changed, 14 insertions(+), 13 deletions(-)

[...]


> @@ -2451,6 +2451,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>   
>   static int qemu_rdma_source_init(RDMAContext *rdma, bool pin_all, Error 
> **errp)
>   {
> +ERRP_GUARD();
>   int ret, idx;
>   
>   /*
> @@ -2464,12 +2465,12 @@ static int qemu_rdma_source_init(RDMAContext *rdma, 
> bool pin_all, Error **errp)
>   goto err_rdma_source_init;
>   }
>   
> -ret = qemu_rdma_alloc_pd_cq(rdma);
> +ret = qemu_rdma_alloc_pd_cq(rdma, errp);
>   if (ret < 0) {
> -error_setg(errp, "RDMA ERROR: "
> -   "rdma migration: error allocating pd and cq! Your mlock()"
> -   " limits may be too low. Please check $ ulimit -a # and "
> -   "search for 'ulimit -l' in the output");
> +error_append_hint(errp,
> +  "Your mlock() limits may be too low. "
> +  "Please check $ ulimit -a # and "
> +  "search for 'ulimit -l' in the output\n");


I think we could freely remove this error message as well, it may neither a 
exact resolution
nor some one will take care. Just report the error qemu_rdma_alloc_pd_cq() tell 
us.

Anyway

Reviewed-by: Li Zhijian 


>   goto err_rdma_source_init;
>   }

Re: [PATCH 42/52] migration/rdma: Convert qemu_rdma_post_recv_control() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Just for symmetry with qemu_rdma_post_send_control().  Error messages
> lose detail I consider of no use to users.
> 
> Signed-off-by: Markus Armbruster

Reviewed-by: Li Zhijian 

Re: [PATCH 41/52] migration/rdma: Convert qemu_rdma_post_send_control() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qemu_rdma_exchange_send() violates this principle: it calls
> error_report() via qemu_rdma_post_send_control().  I elected not to
> investigate how callers handle the error, i.e. precise impact is not
> known.
> 
> Clean this up by converting qemu_rdma_post_send_control() to Error.
> 
> Signed-off-by: Markus Armbruster 

Reviewed-by: Li Zhijian 

Re: [PATCH 40/52] migration/rdma: Convert qemu_rdma_write() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Just for consistency with qemu_rdma_write_one() and
> qemu_rdma_write_flush(), and for slightly simpler code.
> 
> Signed-off-by: Markus Armbruster

Reviewed-by: Li Zhijian 

Re: [PATCH 39/52] migration/rdma: Convert qemu_rdma_write_one() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qemu_rdma_write_flush() violates this principle: it calls
> error_report() via qemu_rdma_write_one().  I elected not to
> investigate how callers handle the error, i.e. precise impact is not
> known.
> 
> Clean this up by converting qemu_rdma_write_one() to Error.
> 
> Signed-off-by: Markus Armbruster


Reviewed-by: Li Zhijian 

Re: [PATCH v4 01/21] i386: Fix comment style in topology.h

2023-09-25 Thread Zhao Liu
On Fri, Sep 22, 2023 at 11:05:59AM -0500, Moger, Babu wrote:
> Date: Fri, 22 Sep 2023 11:05:59 -0500
> From: "Moger, Babu" 
> Subject: Re: [PATCH v4 01/21] i386: Fix comment style in topology.h
> 
> 
> On 9/14/2023 2:21 AM, Zhao Liu wrote:
> > From: Zhao Liu 
> > 
> > For function comments in this file, keep the comment style consistent
> > with other files in the directory.
> > 
> > Signed-off-by: Zhao Liu 
> > Reviewed-by: Philippe Mathieu-Daudé 
> > Reviewed-by: Yanan Wang 
> > Reviewed-by: Xiaoyao Li 
> > Acked-by: Michael S. Tsirkin 
> 
> Reviewed-by: Babu Moger 

Thanks!

-Zhao

> 
> 
> > ---
> > Changes since v3:
> >   * Optimized the description in commit message: Change "with other
> > places" to "with other files in the directory". (Babu)
> > ---
> >   include/hw/i386/topology.h | 33 +
> >   1 file changed, 17 insertions(+), 16 deletions(-)
> > 
> > diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
> > index 81573f6cfde0..5a19679f618b 100644
> > --- a/include/hw/i386/topology.h
> > +++ b/include/hw/i386/topology.h
> > @@ -24,7 +24,8 @@
> >   #ifndef HW_I386_TOPOLOGY_H
> >   #define HW_I386_TOPOLOGY_H
> > -/* This file implements the APIC-ID-based CPU topology enumeration logic,
> > +/*
> > + * This file implements the APIC-ID-based CPU topology enumeration logic,
> >* documented at the following document:
> >*   Intel® 64 Architecture Processor Topology Enumeration
> >*   
> > http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/
> > @@ -41,7 +42,8 @@
> >   #include "qemu/bitops.h"
> > -/* APIC IDs can be 32-bit, but beware: APIC IDs > 255 require x2APIC 
> > support
> > +/*
> > + * APIC IDs can be 32-bit, but beware: APIC IDs > 255 require x2APIC 
> > support
> >*/
> >   typedef uint32_t apic_id_t;
> > @@ -58,8 +60,7 @@ typedef struct X86CPUTopoInfo {
> >   unsigned threads_per_core;
> >   } X86CPUTopoInfo;
> > -/* Return the bit width needed for 'count' IDs
> > - */
> > +/* Return the bit width needed for 'count' IDs */
> >   static unsigned apicid_bitwidth_for_count(unsigned count)
> >   {
> >   g_assert(count >= 1);
> > @@ -67,15 +68,13 @@ static unsigned apicid_bitwidth_for_count(unsigned 
> > count)
> >   return count ? 32 - clz32(count) : 0;
> >   }
> > -/* Bit width of the SMT_ID (thread ID) field on the APIC ID
> > - */
> > +/* Bit width of the SMT_ID (thread ID) field on the APIC ID */
> >   static inline unsigned apicid_smt_width(X86CPUTopoInfo *topo_info)
> >   {
> >   return apicid_bitwidth_for_count(topo_info->threads_per_core);
> >   }
> > -/* Bit width of the Core_ID field
> > - */
> > +/* Bit width of the Core_ID field */
> >   static inline unsigned apicid_core_width(X86CPUTopoInfo *topo_info)
> >   {
> >   return apicid_bitwidth_for_count(topo_info->cores_per_die);
> > @@ -87,8 +86,7 @@ static inline unsigned apicid_die_width(X86CPUTopoInfo 
> > *topo_info)
> >   return apicid_bitwidth_for_count(topo_info->dies_per_pkg);
> >   }
> > -/* Bit offset of the Core_ID field
> > - */
> > +/* Bit offset of the Core_ID field */
> >   static inline unsigned apicid_core_offset(X86CPUTopoInfo *topo_info)
> >   {
> >   return apicid_smt_width(topo_info);
> > @@ -100,14 +98,14 @@ static inline unsigned 
> > apicid_die_offset(X86CPUTopoInfo *topo_info)
> >   return apicid_core_offset(topo_info) + apicid_core_width(topo_info);
> >   }
> > -/* Bit offset of the Pkg_ID (socket ID) field
> > - */
> > +/* Bit offset of the Pkg_ID (socket ID) field */
> >   static inline unsigned apicid_pkg_offset(X86CPUTopoInfo *topo_info)
> >   {
> >   return apicid_die_offset(topo_info) + apicid_die_width(topo_info);
> >   }
> > -/* Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
> > +/*
> > + * Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
> >*
> >* The caller must make sure core_id < nr_cores and smt_id < nr_threads.
> >*/
> > @@ -120,7 +118,8 @@ static inline apic_id_t 
> > x86_apicid_from_topo_ids(X86CPUTopoInfo *topo_info,
> >  topo_ids->smt_id;
> >   }
> > -/* Calculate thread/core/package IDs for a specific topology,
> > +/*
> > + * Calculate thread/core/package IDs for a specific topology,
> >* based on (contiguous) CPU index
> >*/
> >   static inline void x86_topo_ids_from_idx(X86CPUTopoInfo *topo_info,
> > @@ -137,7 +136,8 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo 
> > *topo_info,
> >   topo_ids->smt_id = cpu_index % nr_threads;
> >   }
> > -/* Calculate thread/core/package IDs for a specific topology,
> > +/*
> > + * Calculate thread/core/package IDs for a specific topology,
> >* based on APIC ID
> >*/
> >   static inline void x86_topo_ids_from_apicid(apic_id_t apicid,
> > @@ -155,7 +155,8 @@ static inline void x86_topo_ids_from_apicid(apic_id_t 
> > apicid,
> >   topo_ids->pkg_id = apicid >> apicid_pkg_offset(topo_info);
> >   }
> > -/* Make APIC ID for the CPU 

Re: [PATCH v4 00/21] Support smp.clusters for x86 in QEMU

2023-09-25 Thread Zhao Liu
On Fri, Sep 22, 2023 at 11:03:55AM -0500, Moger, Babu wrote:
> Date: Fri, 22 Sep 2023 11:03:55 -0500
> From: "Moger, Babu" 
> Subject: Re: [PATCH v4 00/21] Support smp.clusters for x86 in QEMU
> 
> Tested the series on AMD system. Created a VM and ran some basic commands.
> 
> Everything looks good.
> 
> Tested-by: Babu Moger 
> 

Many thanks!

-Zhao

> 
> On 9/14/2023 2:21 AM, Zhao Liu wrote:
> > From: Zhao Liu 
> > 
> > Hi list,
> > 
> > (CC k...@vger.kernel.org for better browsing.)
> > 
> > This is the our v4 patch series, rebased on the master branch at the
> > commit 9ef497755afc2 ("Merge tag 'pull-vfio-20230911' of
> > https://github.com/legoater/qemu into staging").
> > 
> > Comparing with v3 [1], v4 mainly refactors the CPUID[0x1F] encoding and
> > exposes module level in CPUID[0x1F] with these new patches:
> > 
> > * [PATCH v4 08/21] i386: Split topology types of CPUID[0x1F] from the
> > definitions of CPUID[0xB]
> > * [PATCH v4 09/21] i386: Decouple CPUID[0x1F] subleaf with specific
> > topology level
> > * [PATCH v4 12/21] i386: Expose module level in CPUID[0x1F]
> > 
> > v4 also fixes compile warnings and fixes cache topology uninitialization
> > bugs for some AMD CPUs.
> > 
> > Welcome your comments!
> > 
> > 
> > # Introduction
> > 
> > This series add the cluster support for x86 PC machine, which allows
> > x86 can use smp.clusters to configure the module level CPU topology
> > of x86.
> > 
> > And since the compatibility issue (see section: ## Why not share L2
> > cache in cluster directly), this series also introduce a new command
> > to adjust the topology of the x86 L2 cache.
> > 
> > Welcome your comments!
> > 
> > 
> > # Background
> > 
> > The "clusters" parameter in "smp" is introduced by ARM [2], but x86
> > hasn't supported it.
> > 
> > At present, x86 defaults L2 cache is shared in one core, but this is
> > not enough. There're some platforms that multiple cores share the
> > same L2 cache, e.g., Alder Lake-P shares L2 cache for one module of
> > Atom cores [3], that is, every four Atom cores shares one L2 cache.
> > Therefore, we need the new CPU topology level (cluster/module).
> > 
> > Another reason is for hybrid architecture. cluster support not only
> > provides another level of topology definition in x86, but would also
> > provide required code change for future our hybrid topology support.
> > 
> > 
> > # Overview
> > 
> > ## Introduction of module level for x86
> > 
> > "cluster" in smp is the CPU topology level which is between "core" and
> > die.
> > 
> > For x86, the "cluster" in smp is corresponding to the module level [4],
> > which is above the core level. So use the "module" other than "cluster"
> > in x86 code.
> > 
> > And please note that x86 already has a cpu topology level also named
> > "cluster" [4], this level is at the upper level of the package. Here,
> > the cluster in x86 cpu topology is completely different from the
> > "clusters" as the smp parameter. After the module level is introduced,
> > the cluster as the smp parameter will actually refer to the module level
> > of x86.
> > 
> > 
> > ## Why not share L2 cache in cluster directly
> > 
> > Though "clusters" was introduced to help define L2 cache topology
> > [2], using cluster to define x86's L2 cache topology will cause the
> > compatibility problem:
> > 
> > Currently, x86 defaults that the L2 cache is shared in one core, which
> > actually implies a default setting "cores per L2 cache is 1" and
> > therefore implicitly defaults to having as many L2 caches as cores.
> > 
> > For example (i386 PC machine):
> > -smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16 (*)
> > 
> > Considering the topology of the L2 cache, this (*) implicitly means "1
> > core per L2 cache" and "2 L2 caches per die".
> > 
> > If we use cluster to configure L2 cache topology with the new default
> > setting "clusters per L2 cache is 1", the above semantics will change
> > to "2 cores per cluster" and "1 cluster per L2 cache", that is, "2
> > cores per L2 cache".
> > 
> > So the same command (*) will cause changes in the L2 cache topology,
> > further affecting the performance of the virtual machine.
> > 
> > Therefore, x86 should only treat cluster as a cpu topology level and
> > avoid using it to change L2 cache by default for compatibility.
> > 
> > 
> > ## module level in CPUID
> > 
> > Linux kernel (from v6.4, with commit edc0a2b595765 ("x86/topology: Fix
> > erroneous smp_num_siblings on Intel Hybrid platforms") is able to
> > handle platforms with Module level enumerated via CPUID.1F.
> > 
> > Expose the module level in CPUID[0x1F] (for Intel CPUs) if the machine
> > has more than 1 modules since v3.
> > 
> > We can configure CPUID.04H.02H (L2 cache topology) with module level by
> > a new command:
> > 
> >  "-cpu,x-l2-cache-topo=cluster"
> > 
> > More information about this command, please see the section: "## New
> > property: x-l2-cache-topo".
> > 
> > 
> > ## New cache topology info in CPUCacheInfo

Re: [PATCH v4 19/21] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]

2023-09-25 Thread Zhao Liu
On Fri, Sep 22, 2023 at 02:27:18PM -0500, Moger, Babu wrote:
> Date: Fri, 22 Sep 2023 14:27:18 -0500
> From: "Moger, Babu" 
> Subject: Re: [PATCH v4 19/21] i386: Use offsets get NumSharingCache for
>  CPUID[0x801D].EAX[bits 25:14]
> 
> 
> On 9/14/2023 2:21 AM, Zhao Liu wrote:
> > From: Zhao Liu 
> > 
> > The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
> > for cpuid 0x801D") adds the cache topology for AMD CPU by encoding
> > the number of sharing threads directly.
> > 
> >  From AMD's APM, NumSharingCache (CPUID[0x801D].EAX[bits 25:14])
> > means [1]:
> > 
> > The number of logical processors sharing this cache is the value of
> > this field incremented by 1. To determine which logical processors are
> > sharing a cache, determine a Share Id for each processor as follows:
> > 
> > ShareId = LocalApicId >> log2(NumSharingCache+1)
> > 
> > Logical processors with the same ShareId then share a cache. If
> > NumSharingCache+1 is not a power of two, round it up to the next power
> > of two.
> > 
> >  From the description above, the calculation of this field should be same
> > as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
> > APIC ID to calculate this field.
> > 
> > [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
> >   Information
> > 
> > Signed-off-by: Zhao Liu 
> Reviewed-by: Babu Moger 

Thanks Babu!

-Zhao

> > ---
> > Changes since v3:
> >   * Rewrite the subject. (Babu)
> >   * Delete the original "comment/help" expression, as this behavior is
> > confirmed for AMD CPUs. (Babu)
> >   * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
> > definition. (Babu)
> > 
> > Changes since v1:
> >   * Rename "l3_threads" to "num_apic_ids" in
> > encode_cache_cpuid801d(). (Yanan)
> >   * Add the description of the original commit and add Cc.
> > ---
> >   target/i386/cpu.c | 10 --
> >   1 file changed, 4 insertions(+), 6 deletions(-)
> > 
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index 5d066107d6ce..bc28c59df089 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -482,7 +482,7 @@ static void encode_cache_cpuid801d(CPUCacheInfo 
> > *cache,
> >  uint32_t *eax, uint32_t *ebx,
> >  uint32_t *ecx, uint32_t *edx)
> >   {
> > -uint32_t l3_threads;
> > +uint32_t num_sharing_cache;
> >   assert(cache->size == cache->line_size * cache->associativity *
> > cache->partitions * cache->sets);
> > @@ -491,13 +491,11 @@ static void encode_cache_cpuid801d(CPUCacheInfo 
> > *cache,
> >   /* L3 is shared among multiple cores */
> >   if (cache->level == 3) {
> > -l3_threads = topo_info->modules_per_die *
> > - topo_info->cores_per_module *
> > - topo_info->threads_per_core;
> > -*eax |= (l3_threads - 1) << 14;
> > +num_sharing_cache = 1 << apicid_die_offset(topo_info);
> >   } else {
> > -*eax |= ((topo_info->threads_per_core - 1) << 14);
> > +num_sharing_cache = 1 << apicid_core_offset(topo_info);
> >   }
> > +*eax |= (num_sharing_cache - 1) << 14;
> >   assert(cache->line_size > 0);
> >   assert(cache->partitions > 0);



Re: [PATCH v4 20/21] i386: Use CPUCacheInfo.share_level to encode CPUID[0x8000001D].EAX[bits 25:14]

2023-09-25 Thread Zhao Liu
On Fri, Sep 22, 2023 at 02:27:58PM -0500, Moger, Babu wrote:
> Date: Fri, 22 Sep 2023 14:27:58 -0500
> From: "Moger, Babu" 
> Subject: Re: [PATCH v4 20/21] i386: Use CPUCacheInfo.share_level to encode
>  CPUID[0x801D].EAX[bits 25:14]
> 
> 
> On 9/14/2023 2:21 AM, Zhao Liu wrote:
> > From: Zhao Liu 
> > 
> > CPUID[0x801D].EAX[bits 25:14] NumSharingCache: number of logical
> > processors sharing cache.
> > 
> > The number of logical processors sharing this cache is
> > NumSharingCache + 1.
> > 
> > After cache models have topology information, we can use
> > CPUCacheInfo.share_level to decide which topology level to be encoded
> > into CPUID[0x801D].EAX[bits 25:14].
> > 
> > Signed-off-by: Zhao Liu 
> 
> Reviewed-by: Babu Moger 

Thanks Babu!

-Zhao

> 
> 
> > ---
> > Changes since v3:
> >   * Explain what "CPUID[0x801D].EAX[bits 25:14]" means in the commit
> > message. (Babu)
> > 
> > Changes since v1:
> >   * Use cache->share_level as the parameter in
> > max_processor_ids_for_cache().
> > ---
> >   target/i386/cpu.c | 10 +-
> >   1 file changed, 1 insertion(+), 9 deletions(-)
> > 
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index bc28c59df089..3bed823dc3b7 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -482,20 +482,12 @@ static void encode_cache_cpuid801d(CPUCacheInfo 
> > *cache,
> >  uint32_t *eax, uint32_t *ebx,
> >  uint32_t *ecx, uint32_t *edx)
> >   {
> > -uint32_t num_sharing_cache;
> >   assert(cache->size == cache->line_size * cache->associativity *
> > cache->partitions * cache->sets);
> >   *eax = CACHE_TYPE(cache->type) | CACHE_LEVEL(cache->level) |
> >  (cache->self_init ? CACHE_SELF_INIT_LEVEL : 0);
> > -
> > -/* L3 is shared among multiple cores */
> > -if (cache->level == 3) {
> > -num_sharing_cache = 1 << apicid_die_offset(topo_info);
> > -} else {
> > -num_sharing_cache = 1 << apicid_core_offset(topo_info);
> > -}
> > -*eax |= (num_sharing_cache - 1) << 14;
> > +*eax |= max_processor_ids_for_cache(topo_info, cache->share_level) << 
> > 14;
> >   assert(cache->line_size > 0);
> >   assert(cache->partitions > 0);



[PATCH v3 1/4] vfio/pci: detect the support of dynamic MSI-X allocation

2023-09-25 Thread Jing Liu
Kernel provides the guidance of dynamic MSI-X allocation support of
passthrough device, by clearing the VFIO_IRQ_INFO_NORESIZE flag to
guide user space.

Fetch the flags from host to determine if dynamic MSI-X allocation is
supported.

Originally-by: Reinette Chatre 
Signed-off-by: Jing Liu 
Reviewed-by: Cédric Le Goater 
Reviewed-by: Alex Williamson 
---
Changes since v2:
- Apply Alex's Reviewed-by.

Changes since v1:
- Free msix when failed to get MSI-X irq info. (Cédric)
- Apply Cédric's Reviewed-by.

Changes since RFC v1:
- Filter the dynamic MSI-X allocation flag and store as a bool type.
  (Alex)
- Move the detection to vfio_msix_early_setup(). (Alex)
- Report error of getting irq info and remove the trace of failure
  case. (Alex, Cédric)
---
 hw/vfio/pci.c| 16 ++--
 hw/vfio/pci.h|  1 +
 hw/vfio/trace-events |  2 +-
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 3b2ca3c24ca2..a94eef50e41e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1493,7 +1493,9 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, 
Error **errp)
 uint8_t pos;
 uint16_t ctrl;
 uint32_t table, pba;
-int fd = vdev->vbasedev.fd;
+int ret, fd = vdev->vbasedev.fd;
+struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info),
+  .index = VFIO_PCI_MSIX_IRQ_INDEX };
 VFIOMSIXInfo *msix;
 
 pos = pci_find_capability(>pdev, PCI_CAP_ID_MSIX);
@@ -1530,6 +1532,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, 
Error **errp)
 msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK;
 msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1;
 
+ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, _info);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "failed to get MSI-X irq info");
+g_free(msix);
+return;
+}
+
+msix->noresize = !!(irq_info.flags & VFIO_IRQ_INFO_NORESIZE);
+
 /*
  * Test the size of the pba_offset variable and catch if it extends outside
  * of the specified BAR. If it is the case, we need to apply a hardware
@@ -1562,7 +1573,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, 
Error **errp)
 }
 
 trace_vfio_msix_early_setup(vdev->vbasedev.name, pos, msix->table_bar,
-msix->table_offset, msix->entries);
+msix->table_offset, msix->entries,
+msix->noresize);
 vdev->msix = msix;
 
 vfio_pci_fixup_msix_region(vdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 2d836093a83d..0d89eb761ece 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -113,6 +113,7 @@ typedef struct VFIOMSIXInfo {
 uint32_t table_offset;
 uint32_t pba_offset;
 unsigned long *pending;
+bool noresize;
 } VFIOMSIXInfo;
 
 #define TYPE_VFIO_PCI "vfio-pci"
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index e64ca4a01961..0ba3c5a0e26b 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -27,7 +27,7 @@ vfio_vga_read(uint64_t addr, int size, uint64_t data) " 
(0x%"PRIx64", %d) = 0x%"
 vfio_pci_read_config(const char *name, int addr, int len, int val) " (%s, 
@0x%x, len=0x%x) 0x%x"
 vfio_pci_write_config(const char *name, int addr, int val, int len) " (%s, 
@0x%x, 0x%x, len=0x%x)"
 vfio_msi_setup(const char *name, int pos) "%s PCI MSI CAP @0x%x"
-vfio_msix_early_setup(const char *name, int pos, int table_bar, int offset, 
int entries) "%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d"
+vfio_msix_early_setup(const char *name, int pos, int table_bar, int offset, 
int entries, bool noresize) "%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, 
entries %d, noresize %d"
 vfio_check_pcie_flr(const char *name) "%s Supports FLR via PCIe cap"
 vfio_check_pm_reset(const char *name) "%s Supports PM reset"
 vfio_check_af_flr(const char *name) "%s Supports FLR via AF cap"
-- 
2.27.0




[PATCH v3 2/4] vfio/pci: enable vector on dynamic MSI-X allocation

2023-09-25 Thread Jing Liu
The vector_use callback is used to enable vector that is unmasked in
guest. The kernel used to only support static MSI-X allocation. When
allocating a new interrupt using "static MSI-X allocation" kernels,
QEMU first disables all previously allocated vectors and then
re-allocates all including the new one. The nr_vectors of VFIOPCIDevice
indicates that all vectors from 0 to nr_vectors are allocated (and may
be enabled), which is used to loop all the possibly used vectors when
e.g., disabling MSI-X interrupts.

Extend the vector_use function to support dynamic MSI-X allocation when
host supports the capability. QEMU therefore can individually allocate
and enable a new interrupt without affecting others or causing interrupts
lost during runtime.

Utilize nr_vectors to calculate the upper bound of enabled vectors in
dynamic MSI-X allocation mode since looping all msix_entries_nr is not
efficient and unnecessary.

Signed-off-by: Jing Liu 
Tested-by: Reinette Chatre 
Reviewed-by: Alex Williamson 
---
Changes since v2:
- Use a bool type to test (vdev->nr_vectors < nr + 1). (Alex)
- Revise the comments. (Alex)
- Apply Alex's Reviewed-by.

Changes since v1:
- Revise Qemu to QEMU.

Changes since RFC v1:
- Test vdev->msix->noresize to identify the allocation mode. (Alex)
- Move defer_kvm_irq_routing test out and update nr_vectors in a
  common place before vfio_enable_vectors(). (Alex)
- Revise the comments. (Alex)
---
 hw/vfio/pci.c | 46 --
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a94eef50e41e..27a65302ea69 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -470,6 +470,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, 
unsigned int nr,
 VFIOPCIDevice *vdev = VFIO_PCI(pdev);
 VFIOMSIVector *vector;
 int ret;
+bool resizing = !!(vdev->nr_vectors < nr + 1);
 
 trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
 
@@ -512,33 +513,42 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, 
unsigned int nr,
 }
 
 /*
- * We don't want to have the host allocate all possible MSI vectors
- * for a device if they're not in use, so we shutdown and incrementally
- * increase them as needed.
+ * When dynamic allocation is not supported, we don't want to have the
+ * host allocate all possible MSI vectors for a device if they're not
+ * in use, so we shutdown and incrementally increase them as needed.
+ * nr_vectors represents the total number of vectors allocated.
+ *
+ * When dynamic allocation is supported, let the host only allocate
+ * and enable a vector when it is in use in guest. nr_vectors represents
+ * the upper bound of vectors being enabled (but not all of the ranges
+ * is allocated or enabled).
  */
-if (vdev->nr_vectors < nr + 1) {
+if (resizing) {
 vdev->nr_vectors = nr + 1;
-if (!vdev->defer_kvm_irq_routing) {
+}
+
+if (!vdev->defer_kvm_irq_routing) {
+if (vdev->msix->noresize && resizing) {
 vfio_disable_irqindex(>vbasedev, VFIO_PCI_MSIX_IRQ_INDEX);
 ret = vfio_enable_vectors(vdev, true);
 if (ret) {
 error_report("vfio: failed to enable vectors, %d", ret);
 }
-}
-} else {
-Error *err = NULL;
-int32_t fd;
-
-if (vector->virq >= 0) {
-fd = event_notifier_get_fd(>kvm_interrupt);
 } else {
-fd = event_notifier_get_fd(>interrupt);
-}
+Error *err = NULL;
+int32_t fd;
 
-if (vfio_set_irq_signaling(>vbasedev,
- VFIO_PCI_MSIX_IRQ_INDEX, nr,
- VFIO_IRQ_SET_ACTION_TRIGGER, fd, )) {
-error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+if (vector->virq >= 0) {
+fd = event_notifier_get_fd(>kvm_interrupt);
+} else {
+fd = event_notifier_get_fd(>interrupt);
+}
+
+if (vfio_set_irq_signaling(>vbasedev,
+   VFIO_PCI_MSIX_IRQ_INDEX, nr,
+   VFIO_IRQ_SET_ACTION_TRIGGER, fd, )) 
{
+error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+}
 }
 }
 
-- 
2.27.0




[PATCH v3 3/4] vfio/pci: use an invalid fd to enable MSI-X

2023-09-25 Thread Jing Liu
Guests typically enable MSI-X with all of the vectors masked in the MSI-X
vector table. To match the guest state of device, QEMU enables MSI-X by
enabling vector 0 with userspace triggering and immediately release.
However the release function actually does not release it due to already
using userspace mode.

It is no need to enable triggering on host and rely on the mask bit to
avoid spurious interrupts. Use an invalid fd (i.e. fd = -1) is enough
to get MSI-X enabled.

After dynamic MSI-X allocation is supported, the interrupt restoring
also need use such way to enable MSI-X, therefore, create a function
for that.

Suggested-by: Alex Williamson 
Signed-off-by: Jing Liu 
Reviewed-by: Cédric Le Goater 
Reviewed-by: Alex Williamson 
---
Changes since v2:
- Apply Cédric's Reviewed-by.
- Apply Alex's Reviewed-by.

Changes since v1:
- Revise Qemu to QEMU. (Cédric)
- Use g_autofree to automatically release. (Cédric)
- Just return 'ret' and let the caller of vfio_enable_msix_no_vec()
  report the error. (Cédric)

Changes since RFC v1:
- A new patch. Use an invalid fd to get MSI-X enabled instead of using
  userspace triggering. (Alex)
---
 hw/vfio/pci.c | 44 
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 27a65302ea69..bf676a49ae77 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -369,6 +369,33 @@ static void vfio_msi_interrupt(void *opaque)
 notify(>pdev, nr);
 }
 
+/*
+ * Get MSI-X enabled, but no vector enabled, by setting vector 0 with an 
invalid
+ * fd to kernel.
+ */
+static int vfio_enable_msix_no_vec(VFIOPCIDevice *vdev)
+{
+g_autofree struct vfio_irq_set *irq_set = NULL;
+int ret = 0, argsz;
+int32_t *fd;
+
+argsz = sizeof(*irq_set) + sizeof(*fd);
+
+irq_set = g_malloc0(argsz);
+irq_set->argsz = argsz;
+irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+ VFIO_IRQ_SET_ACTION_TRIGGER;
+irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+irq_set->start = 0;
+irq_set->count = 1;
+fd = (int32_t *)_set->data;
+*fd = -1;
+
+ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+
+return ret;
+}
+
 static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
 {
 struct vfio_irq_set *irq_set;
@@ -618,6 +645,8 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice 
*vdev)
 
 static void vfio_msix_enable(VFIOPCIDevice *vdev)
 {
+int ret;
+
 vfio_disable_interrupts(vdev);
 
 vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->msix->entries);
@@ -640,8 +669,6 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev)
 vfio_commit_kvm_msi_virq_batch(vdev);
 
 if (vdev->nr_vectors) {
-int ret;
-
 ret = vfio_enable_vectors(vdev, true);
 if (ret) {
 error_report("vfio: failed to enable vectors, %d", ret);
@@ -655,13 +682,14 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev)
  * MSI-X capability, but leaves the vector table masked.  We therefore
  * can't rely on a vector_use callback (from request_irq() in the 
guest)
  * to switch the physical device into MSI-X mode because that may come 
a
- * long time after pci_enable_msix().  This code enables vector 0 with
- * triggering to userspace, then immediately release the vector, 
leaving
- * the physical device with no vectors enabled, but MSI-X enabled, just
- * like the guest view.
+ * long time after pci_enable_msix().  This code sets vector 0 with an
+ * invalid fd to make the physical device MSI-X enabled, but with no
+ * vectors enabled, just like the guest view.
  */
-vfio_msix_vector_do_use(>pdev, 0, NULL, NULL);
-vfio_msix_vector_release(>pdev, 0);
+ret = vfio_enable_msix_no_vec(vdev);
+if (ret) {
+error_report("vfio: failed to enable MSI-X, %d", ret);
+}
 }
 
 trace_vfio_msix_enable(vdev->vbasedev.name);
-- 
2.27.0




[PATCH v3 0/4] Support dynamic MSI-X allocation

2023-09-25 Thread Jing Liu
Changes since v2:
- v2: https://www.mail-archive.com/qemu-devel@nongnu.org/msg989852.html
- Use a bool type to test (vdev->nr_vectors < nr + 1). (Alex)
- Revise comments. (Alex)
- Apply Cédric's and Alex's Reviewed-by.

Changes since v1:
- v1: https://www.mail-archive.com/qemu-devel@nongnu.org/msg982842.html
- Revise Qemu to QEMU. (Cédric)
- Add g_free when failure of getting MSI-X irq info. (Cédric)
- Apply Cédric's Reviewed-by. (Cédric)
- Use g_autofree to automatically release. (Cédric)
- Remove the failure message in vfio_enable_msix_no_vec(). (Cédric)

Changes since RFC v1:
- RFC v1: https://www.mail-archive.com/qemu-devel@nongnu.org/msg978637.html
- Revise the comments. (Alex)
- Report error of getting irq info and remove the trace of failure
  case. (Alex, Cédric)
- Only store dynamic allocation flag as a bool type and test
  accordingly. (Alex)
- Move dynamic allocation detection to vfio_msix_early_setup(). (Alex)
- Change the condition logic in vfio_msix_vector_do_use() that moving
  the defer_kvm_irq_routing test out and create a common place to update
  nr_vectors. (Alex)
- Consolidate the way of MSI-X enabling during device initialization and
  interrupt restoring that uses fd = -1 trick. Create a function doing
  that. (Alex)

Before kernel v6.5, dynamic allocation of MSI-X interrupts was not
supported. QEMU therefore when allocating a new interrupt, should first
release all previously allocated interrupts (including disable of MSI-X)
and re-allocate all interrupts that includes the new one.

The kernel series [1] adds the support of dynamic MSI-X allocation to
vfio-pci and uses the existing flag VFIO_IRQ_INFO_NORESIZE to guide user
space, that when dynamic MSI-X is supported the flag is cleared.

This series makes the behavior for VFIO PCI devices when dynamic MSI-X
allocation is supported. When guest unmasks an interrupt, QEMU can
directly allocate an interrupt on host for this and has nothing to do
with the previously allocated ones. Therefore, host only allocates
interrupts for those unmasked (enabled) interrupts inside guest when
dynamic MSI-X allocation is supported by device.

When guests enable MSI-X with all of the vectors masked, QEMU need match
the state to enable MSI-X with no vector enabled. During migration
restore, QEMU also need enable MSI-X first in dynamic allocation mode,
to avoid the guest unused vectors being allocated on host. To
consolidate them, we use vector 0 with an invalid fd to get MSI-X
enabled and create a common function for this. This is cleaner than
setting userspace triggering and immediately release.

Any feedback is appreciated.

Jing

[1] https://lwn.net/Articles/931679/

Jing Liu (4):
  vfio/pci: detect the support of dynamic MSI-X allocation
  vfio/pci: enable vector on dynamic MSI-X allocation
  vfio/pci: use an invalid fd to enable MSI-X
  vfio/pci: enable MSI-X in interrupt restoring on dynamic allocation

 hw/vfio/pci.c| 123 +--
 hw/vfio/pci.h|   1 +
 hw/vfio/trace-events |   2 +-
 3 files changed, 97 insertions(+), 29 deletions(-)

-- 
2.27.0




[PATCH v3 4/4] vfio/pci: enable MSI-X in interrupt restoring on dynamic allocation

2023-09-25 Thread Jing Liu
During migration restoring, vfio_enable_vectors() is called to restore
enabling MSI-X interrupts for assigned devices. It sets the range from
0 to nr_vectors to kernel to enable MSI-X and the vectors unmasked in
guest. During the MSI-X enabling, all the vectors within the range are
allocated according to the VFIO_DEVICE_SET_IRQS ioctl.

When dynamic MSI-X allocation is supported, we only want the guest
unmasked vectors being allocated and enabled. Use vector 0 with an
invalid fd to get MSI-X enabled, after that, all the vectors can be
allocated in need.

Signed-off-by: Jing Liu 
Reviewed-by: Cédric Le Goater 
Reviewed-by: Alex Williamson 
---
Changes since v2:
- Apply Cédric's Reviewed-by.
- Apply Alex's Reviewed-by.

Changes since v1:
- No change.

Changes since RFC v1:
- Revise the comments. (Alex)
- Call the new helper function in previous patch to enable MSI-X. (Alex)
---
 hw/vfio/pci.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bf676a49ae77..8a082af39e77 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -402,6 +402,23 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool 
msix)
 int ret = 0, i, argsz;
 int32_t *fds;
 
+/*
+ * If dynamic MSI-X allocation is supported, the vectors to be allocated
+ * and enabled can be scattered. Before kernel enabling MSI-X, setting
+ * nr_vectors causes all these vectors to be allocated on host.
+ *
+ * To keep allocation as needed, use vector 0 with an invalid fd to get
+ * MSI-X enabled first, then set vectors with a potentially sparse set of
+ * eventfds to enable interrupts only when enabled in guest.
+ */
+if (msix && !vdev->msix->noresize) {
+ret = vfio_enable_msix_no_vec(vdev);
+
+if (ret) {
+return ret;
+}
+}
+
 argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds));
 
 irq_set = g_malloc0(argsz);
-- 
2.27.0




Re: [PATCH 38/52] migration/rdma: Convert qemu_rdma_write_flush() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qio_channel_rdma_writev() violates this principle: it calls
> error_report() via qemu_rdma_write_flush().  I elected not to
> investigate how callers handle the error, i.e. precise impact is not
> known.
> 
> Clean this up by converting qemu_rdma_write_flush() to Error.
> 
> Necessitates setting an error when qemu_rdma_write_one() failed.
> Since this error will go away later in this series, simply use "FIXME
> temporary error message" there.
> 
> Signed-off-by: Markus Armbruster

Reviewed-by: Li Zhijian 

Re: [PATCH 37/52] migration/rdma: Convert qemu_rdma_reg_whole_ram_blocks() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qemu_rdma_exchange_send() violates this principle: it calls
> error_report() via callback qemu_rdma_reg_whole_ram_blocks().  I
> elected not to investigate how callers handle the error, i.e. precise
> impact is not known.
> 
> Clean this up by converting the callback to Error.
> 
> Signed-off-by: Markus Armbruster

Reviewed-by: Li Zhijian 


Re: [PATCH 36/52] migration/rdma: Convert qemu_rdma_exchange_get_response() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qemu_rdma_exchange_send() and qemu_rdma_exchange_recv() violate this
> principle: they call error_report() via
> qemu_rdma_exchange_get_response().  I elected not to investigate how
> callers handle the error, i.e. precise impact is not known.
> 
> Clean this up by converting qemu_rdma_exchange_get_response() to
> Error.
> 
> Signed-off-by: Markus Armbruster

Reviewed-by: Li Zhijian 

Re: [PATCH 35/52] migration/rdma: Convert qemu_rdma_exchange_send() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qio_channel_rdma_writev() violates this principle: it calls
> error_report() via qemu_rdma_exchange_send().  I elected not to
> investigate how callers handle the error, i.e. precise impact is not
> known.
> 
> Clean this up by converting qemu_rdma_exchange_send() to Error.
> 
> Necessitates setting an error when qemu_rdma_post_recv_control(),
> callback(), or qemu_rdma_exchange_get_response() failed.  Since these
> errors will go away later in this series, simply use "FIXME temporary
> error message" there.
> 
> Signed-off-by: Markus Armbruster

Reviewed-by: Li Zhijian 

Re: [PATCH v4 00/19] riscv: split TCG/KVM accelerators from cpu.c

2023-09-25 Thread Alistair Francis
On Tue, Sep 26, 2023 at 3:58 AM Daniel Henrique Barboza
 wrote:
>
> Hi,
>
> This new version has a simple copyright change in the tcg-cpu.c file,
> patch 01, suggested by Alistair in v3. No other changes made.
>
> All patches acked/reviewed.
>
> Changes from v3:
> - patch 1:
>   - use cpu.c copyright notice in tcg-cpu.c
> - v3 link: 
> https://lore.kernel.org/qemu-riscv/20230920112020.651006-1-dbarb...@ventanamicro.com/
>
> Daniel Henrique Barboza (19):
>   target/riscv: introduce TCG AccelCPUClass
>   target/riscv: move riscv_cpu_realize_tcg() to TCG::cpu_realizefn()
>   target/riscv: move riscv_cpu_validate_set_extensions() to tcg-cpu.c
>   target/riscv: move riscv_tcg_ops to tcg-cpu.c
>   target/riscv/cpu.c: add .instance_post_init()
>   target/riscv: move 'host' CPU declaration to kvm.c
>   target/riscv/cpu.c: mark extensions arrays as 'const'
>   target/riscv: move riscv_cpu_add_kvm_properties() to kvm.c
>   target/riscv: make riscv_add_satp_mode_properties() public
>   target/riscv: remove kvm-stub.c
>   target/riscv: introduce KVM AccelCPUClass
>   target/riscv: move KVM only files to kvm subdir
>   target/riscv/kvm: do not use riscv_cpu_add_misa_properties()
>   target/riscv/cpu.c: export set_misa()
>   target/riscv/tcg: introduce tcg_cpu_instance_init()
>   target/riscv/cpu.c: make misa_ext_cfgs[] 'const'
>   target/riscv/tcg: move riscv_cpu_add_misa_properties() to tcg-cpu.c
>   target/riscv/cpu.c: export isa_edata_arr[]
>   target/riscv/cpu: move priv spec functions to tcg-cpu.c

Thanks!

Applied to riscv-to-apply.next

Alistair

>
>  hw/intc/riscv_aplic.c |   2 +-
>  hw/riscv/virt.c   |   2 +-
>  target/riscv/cpu.c| 988 ++
>  target/riscv/cpu.h|  30 +-
>  target/riscv/csr.c|   1 +
>  target/riscv/kvm-stub.c   |  30 -
>  target/riscv/{kvm.c => kvm/kvm-cpu.c} | 120 +++-
>  target/riscv/{ => kvm}/kvm_riscv.h|   4 -
>  target/riscv/kvm/meson.build  |   1 +
>  target/riscv/meson.build  |   4 +-
>  target/riscv/tcg/meson.build  |   2 +
>  target/riscv/tcg/tcg-cpu.c| 883 +++
>  target/riscv/tcg/tcg-cpu.h|  27 +
>  13 files changed, 1113 insertions(+), 981 deletions(-)
>  delete mode 100644 target/riscv/kvm-stub.c
>  rename target/riscv/{kvm.c => kvm/kvm-cpu.c} (91%)
>  rename target/riscv/{ => kvm}/kvm_riscv.h (89%)
>  create mode 100644 target/riscv/kvm/meson.build
>  create mode 100644 target/riscv/tcg/meson.build
>  create mode 100644 target/riscv/tcg/tcg-cpu.c
>  create mode 100644 target/riscv/tcg/tcg-cpu.h
>
> --
> 2.41.0
>
>



Re: [Qemu-devel] [PATCH 08/19] aspeed/timer: Fix behaviour running Linux

2023-09-25 Thread Andrew Jeffery



On Mon, 25 Sep 2023, at 18:50, Cédric Le Goater wrote:
> On 9/25/23 09:54, Andrew Jeffery wrote:
>> 
>> 
>> On Fri, 22 Sep 2023, at 22:51, Cédric Le Goater wrote:
>>> Joel, Andrew,
>>>
>>> On 5/25/19 17:12, Cédric Le Goater wrote:
 From: Joel Stanley 

 The Linux kernel driver was updated in commit 4451d3f59f2a
 ("clocksource/drivers/fttmr010: Fix set_next_event handler) to fix an
 issue observed on hardware:

> RELOAD register is loaded into COUNT register when the aspeed timer
> is enabled, which means the next event may be delayed because timer
> interrupt won't be generated until <0x - current_count +
> cycles>.

 When running under Qemu, the system appeared "laggy". The guest is now
 scheduling timer events too regularly, starving the host of CPU time.

 This patch modifies the timer model to attempt to schedule the timer
 expiry as the guest requests, but if we have missed the deadline we
 re interrupt and try again, which allows the guest to catch up.

 Provides expected behaviour with old and new guest code.

 Fixes: c04bd47db6b9 ("hw/timer: Add ASPEED timer device model")
 Signed-off-by: Joel Stanley 
 [clg: - merged a fix from Andrew Jeffery 
   "Fire interrupt on failure to meet deadline"
   
 https://lists.ozlabs.org/pipermail/openbmc/2019-January/014641.html
 - adapted commit log
 - checkpatch fixes ]
 Signed-off-by: Cédric Le Goater 
 ---
hw/timer/aspeed_timer.c | 59 ++---
1 file changed, 31 insertions(+), 28 deletions(-)

 diff --git a/hw/timer/aspeed_timer.c b/hw/timer/aspeed_timer.c
 index 5c786e512815..9ffd8e09f670 100644
 --- a/hw/timer/aspeed_timer.c
 +++ b/hw/timer/aspeed_timer.c
 @@ -109,37 +109,40 @@ static inline uint64_t calculate_time(struct 
 AspeedTimer *t, uint32_t ticks)

static uint64_t calculate_next(struct AspeedTimer *t)
{
 -uint64_t next = 0;
 -uint32_t rate = calculate_rate(t);
 +uint64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
 +uint64_t next;

 -while (!next) {
 -/* We don't know the relationship between the values in the match
 - * registers, so sort using MAX/MIN/zero. We sort in that order 
 as the
 - * timer counts down to zero. */
 -uint64_t seq[] = {
 -calculate_time(t, MAX(t->match[0], t->match[1])),
 -calculate_time(t, MIN(t->match[0], t->match[1])),
 -calculate_time(t, 0),
 -};
 -uint64_t reload_ns;
 -uint64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
 -
 -if (now < seq[0]) {
 -next = seq[0];
 -} else if (now < seq[1]) {
 -next = seq[1];
 -} else if (now < seq[2]) {
 -next = seq[2];
 -} else if (t->reload) {
 -reload_ns = muldiv64(t->reload, NANOSECONDS_PER_SECOND, rate);
 -t->start = now - ((now - t->start) % reload_ns);
 -} else {
 -/* no reload value, return 0 */
 -break;
 -}
 +/*
 + * We don't know the relationship between the values in the match
 + * registers, so sort using MAX/MIN/zero. We sort in that order as
 + * the timer counts down to zero.
 + */
 +
 +next = calculate_time(t, MAX(t->match[0], t->match[1]));
 +if (now < next) {
 +return next;
 +}
 +
 +next = calculate_time(t, MIN(t->match[0], t->match[1]));
 +if (now < next) {
 +return next;
 +}
 +
 +next = calculate_time(t, 0);
 +if (now < next) {
 +return next;
 +}
 +
 +/* We've missed all deadlines, fire interrupt and try again */
 +timer_del(>timer);
 +
 +if (timer_overflow_interrupt(t)) {
 +t->level = !t->level;
 +qemu_set_irq(t->irq, t->level);
}

 -return next;
 +t->start = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
 +return calculate_time(t, MAX(MAX(t->match[0], t->match[1]), 0));
>>>
>>> This MAX(MAX(x, y), 0) looks strange to me. Would you remember where it 
>>> comes
>>> from ? Thanks,
>> 
>> The inner MAX() deals with the lack of ordering constraints between the 
>> match values. I think the outer MAX() is redundant. We should probably 
>> remove it. The match member is type uint32_t so it can't be negative. You 
>> did steal that from an RFC patch :D
>
> I did ! Fixed there :
>
>
> https://patchwork.ozlabs.org/project/qemu-devel/patch/20230922155924.1172019-5-...@kaod.org/
>

Thanks. That one might be further down in my review queue 

Andrew



Re: [PATCH v4 02/19] target/riscv: move riscv_cpu_realize_tcg() to TCG::cpu_realizefn()

2023-09-25 Thread Alistair Francis
On Tue, Sep 26, 2023 at 5:09 AM Daniel Henrique Barboza
 wrote:
>
> riscv_cpu_realize_tcg() was added to allow TCG cpus to have a different
> realize() path during the common riscv_cpu_realize(), making it a good
> choice to start moving TCG exclusive code to tcg-cpu.c.
>
> Rename it to tcg_cpu_realizefn() and assign it as a implementation of
> accel::cpu_realizefn(). tcg_cpu_realizefn() will then be called during
> riscv_cpu_realize() via cpu_exec_realizefn(). We'll use a similar
> approach with KVM in the near future.
>
> riscv_cpu_validate_set_extensions() is too big and with too many
> dependencies to be moved in this same patch. We'll do that next.
>
> Signed-off-by: Daniel Henrique Barboza 
> Reviewed-by: Andrew Jones 
> Reviewed-by: LIU Zhiwei 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  target/riscv/cpu.c | 128 ---
>  target/riscv/tcg/tcg-cpu.c | 133 +
>  2 files changed, 133 insertions(+), 128 deletions(-)
>
> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> index e72c49c881..030629294f 100644
> --- a/target/riscv/cpu.c
> +++ b/target/riscv/cpu.c
> @@ -23,9 +23,7 @@
>  #include "qemu/log.h"
>  #include "cpu.h"
>  #include "cpu_vendorid.h"
> -#include "pmu.h"
>  #include "internals.h"
> -#include "time_helper.h"
>  #include "exec/exec-all.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> @@ -1064,29 +1062,6 @@ static void riscv_cpu_validate_v(CPURISCVState *env, 
> RISCVCPUConfig *cfg,
>  }
>  }
>
> -static void riscv_cpu_validate_priv_spec(RISCVCPU *cpu, Error **errp)
> -{
> -CPURISCVState *env = >env;
> -int priv_version = -1;
> -
> -if (cpu->cfg.priv_spec) {
> -if (!g_strcmp0(cpu->cfg.priv_spec, "v1.12.0")) {
> -priv_version = PRIV_VERSION_1_12_0;
> -} else if (!g_strcmp0(cpu->cfg.priv_spec, "v1.11.0")) {
> -priv_version = PRIV_VERSION_1_11_0;
> -} else if (!g_strcmp0(cpu->cfg.priv_spec, "v1.10.0")) {
> -priv_version = PRIV_VERSION_1_10_0;
> -} else {
> -error_setg(errp,
> -   "Unsupported privilege spec version '%s'",
> -   cpu->cfg.priv_spec);
> -return;
> -}
> -
> -env->priv_ver = priv_version;
> -}
> -}
> -
>  static void riscv_cpu_disable_priv_spec_isa_exts(RISCVCPU *cpu)
>  {
>  CPURISCVState *env = >env;
> @@ -,33 +1086,6 @@ static void 
> riscv_cpu_disable_priv_spec_isa_exts(RISCVCPU *cpu)
>  }
>  }
>
> -static void riscv_cpu_validate_misa_mxl(RISCVCPU *cpu, Error **errp)
> -{
> -RISCVCPUClass *mcc = RISCV_CPU_GET_CLASS(cpu);
> -CPUClass *cc = CPU_CLASS(mcc);
> -CPURISCVState *env = >env;
> -
> -/* Validate that MISA_MXL is set properly. */
> -switch (env->misa_mxl_max) {
> -#ifdef TARGET_RISCV64
> -case MXL_RV64:
> -case MXL_RV128:
> -cc->gdb_core_xml_file = "riscv-64bit-cpu.xml";
> -break;
> -#endif
> -case MXL_RV32:
> -cc->gdb_core_xml_file = "riscv-32bit-cpu.xml";
> -break;
> -default:
> -g_assert_not_reached();
> -}
> -
> -if (env->misa_mxl_max != env->misa_mxl) {
> -error_setg(errp, "misa_mxl_max must be equal to misa_mxl");
> -return;
> -}
> -}
> -
>  /*
>   * Check consistency between chosen extensions while setting
>   * cpu->cfg accordingly.
> @@ -1511,74 +1459,6 @@ static void riscv_cpu_finalize_features(RISCVCPU *cpu, 
> Error **errp)
>  #endif
>  }
>
> -static void riscv_cpu_validate_misa_priv(CPURISCVState *env, Error **errp)
> -{
> -if (riscv_has_ext(env, RVH) && env->priv_ver < PRIV_VERSION_1_12_0) {
> -error_setg(errp, "H extension requires priv spec 1.12.0");
> -return;
> -}
> -}
> -
> -static void riscv_cpu_realize_tcg(DeviceState *dev, Error **errp)
> -{
> -RISCVCPU *cpu = RISCV_CPU(dev);
> -CPURISCVState *env = >env;
> -Error *local_err = NULL;
> -
> -if (object_dynamic_cast(OBJECT(dev), TYPE_RISCV_CPU_HOST)) {
> -error_setg(errp, "'host' CPU is not compatible with TCG 
> acceleration");
> -return;
> -}
> -
> -riscv_cpu_validate_misa_mxl(cpu, _err);
> -if (local_err != NULL) {
> -error_propagate(errp, local_err);
> -return;
> -}
> -
> -riscv_cpu_validate_priv_spec(cpu, _err);
> -if (local_err != NULL) {
> -error_propagate(errp, local_err);
> -return;
> -}
> -
> -riscv_cpu_validate_misa_priv(env, _err);
> -if (local_err != NULL) {
> -error_propagate(errp, local_err);
> -return;
> -}
> -
> -if (cpu->cfg.epmp && !cpu->cfg.pmp) {
> -/*
> - * Enhanced PMP should only be available
> - * on harts with PMP support
> - */
> -error_setg(errp, "Invalid configuration: EPMP requires PMP support");
> -return;
> -}
> -
> -riscv_cpu_validate_set_extensions(cpu, _err);
> -if (local_err != NULL) {
> -  

Re: [PATCH 34/52] migration/rdma: Convert qemu_rdma_exchange_recv() to Error

2023-09-25 Thread Zhijian Li (Fujitsu)


On 18/09/2023 22:41, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qio_channel_rdma_readv() violates this principle: it calls
> error_report() via qemu_rdma_exchange_recv().  I elected not to
> investigate how callers handle the error, i.e. precise impact is not
> known.
> 
> Clean this up by converting qemu_rdma_exchange_recv() to Error.
> 
> Necessitates setting an error when qemu_rdma_exchange_get_response()
> failed.  Since this error will go away later in this series, simply
> use "FIXME temporary error message" there.
> 
> Signed-off-by: Markus Armbruster 


Reviewed-by: Li Zhijian 


> ---
>   migration/rdma.c | 15 +--
>   1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index c88cd1f468..50546b3a27 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1957,7 +1957,7 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, 
> RDMAControlHeader *head,
>* control-channel message.
>*/
>   static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader 
> *head,
> -   uint32_t expecting)
> +   uint32_t expecting, Error **errp)
>   {
>   RDMAControlHeader ready = {
>   .len = 0,
> @@ -1972,7 +1972,7 @@ static int qemu_rdma_exchange_recv(RDMAContext *rdma, 
> RDMAControlHeader *head,
>   ret = qemu_rdma_post_send_control(rdma, NULL, );
>   
>   if (ret < 0) {
> -error_report("Failed to send control buffer!");
> +error_setg(errp, "Failed to send control buffer!");
>   return -1;
>   }
>   
> @@ -1983,6 +1983,7 @@ static int qemu_rdma_exchange_recv(RDMAContext *rdma, 
> RDMAControlHeader *head,
> expecting, RDMA_WRID_READY);
>   
>   if (ret < 0) {
> +error_setg(errp, "FIXME temporary error message");
>   return -1;
>   }
>   
> @@ -1993,7 +1994,7 @@ static int qemu_rdma_exchange_recv(RDMAContext *rdma, 
> RDMAControlHeader *head,
>*/
>   ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY);
>   if (ret < 0) {
> -error_report("rdma migration: error posting second control recv!");
> +error_setg(errp, "rdma migration: error posting second control 
> recv!");
>   return -1;
>   }
>   
> @@ -2908,11 +2909,11 @@ static ssize_t qio_channel_rdma_readv(QIOChannel *ioc,
>   /* We've got nothing at all, so lets wait for
>* more to arrive
>*/
> -ret = qemu_rdma_exchange_recv(rdma, , RDMA_CONTROL_QEMU_FILE);
> +ret = qemu_rdma_exchange_recv(rdma, , RDMA_CONTROL_QEMU_FILE,
> +  errp);
>   
>   if (ret < 0) {
>   rdma->errored = true;
> -error_setg(errp, "qemu_rdma_exchange_recv failed");
>   return -1;
>   }
>   
> @@ -3536,6 +3537,7 @@ static int qemu_rdma_registration_handle(QEMUFile *f)
>   RDMAControlHeader blocks = { .type = RDMA_CONTROL_RAM_BLOCKS_RESULT,
>.repeat = 1 };
>   QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
> +Error *err = NULL;
>   RDMAContext *rdma;
>   RDMALocalBlocks *local;
>   RDMAControlHeader head;
> @@ -3565,9 +3567,10 @@ static int qemu_rdma_registration_handle(QEMUFile *f)
>   do {
>   trace_qemu_rdma_registration_handle_wait();
>   
> -ret = qemu_rdma_exchange_recv(rdma, , RDMA_CONTROL_NONE);
> +ret = qemu_rdma_exchange_recv(rdma, , RDMA_CONTROL_NONE, );
>   
>   if (ret < 0) {
> +error_report_err(err);
>   break;
>   }
>   

Re: [PATCH 05/12] device_tree: qmp_dumpdtb(): stronger assertion

2023-09-25 Thread Alistair Francis
On Tue, Sep 26, 2023 at 6:42 AM Vladimir Sementsov-Ogievskiy
 wrote:
>
> Coverity mark this size, got from the buffer as untrasted value, it's

s/untrasted/untrusted/g

> not good to use it as length when writing to file. Make the assertion
> more strict to also check upper bound.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  softmmu/device_tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/softmmu/device_tree.c b/softmmu/device_tree.c
> index 30aa3aea9f..adc4236e21 100644
> --- a/softmmu/device_tree.c
> +++ b/softmmu/device_tree.c
> @@ -660,7 +660,7 @@ void qmp_dumpdtb(const char *filename, Error **errp)
>
>  size = fdt_totalsize(current_machine->fdt);
>
> -g_assert(size > 0);
> +g_assert(size > 0 && size <= FDT_MAX_SIZE);
>
>  if (!g_file_set_contents(filename, current_machine->fdt, size, )) {
>  error_setg(errp, "Error saving FDT to file %s: %s",
> --
> 2.34.1
>
>



Re: [PATCH RFC V2 00/37] Support of Virtual CPU Hotplug for ARMv8 Arch

2023-09-25 Thread Gavin Shan

Hi Salil,

On 9/26/23 06:21, Salil Mehta wrote:

From: Russell King 
Sent: Monday, September 25, 2023 9:13 PM
To: Salil Mehta 
Cc: qemu-devel@nongnu.org; qemu-...@nongnu.org; m...@kernel.org;
james.mo...@arm.com; jean-phili...@linaro.org; Jonathan Cameron
; lorenzo.pieral...@linaro.com;
lpieral...@kernel.org; peter.mayd...@linaro.org;
richard.hender...@linaro.org; imamm...@redhat.com; andrew.jo...@linux.dev;
da...@redhat.com; phi...@linaro.org; eric.au...@redhat.com;
w...@kernel.org; catalin.mari...@arm.com; a...@kernel.org;
justin...@arm.com; oliver.up...@linux.dev; pbonz...@redhat.com;
m...@redhat.com; gs...@redhat.com; raf...@kernel.org;
borntrae...@linux.ibm.com; alex.ben...@linaro.org;
dar...@os.amperecomputing.com; il...@os.amperecomputing.com;
vis...@os.amperecomputing.com; karl.heub...@oracle.com;
miguel.l...@oracle.com; sudeep.ho...@arm.com; salil.me...@opnsrc.net;
zhukeqian ; wangxiongfeng (C)
; wangyanan (Y) ;
jiakern...@gmail.com; maob...@loongson.cn; lixiang...@loongson.cn
Subject: Re: [PATCH RFC V2 00/37] Support of Virtual CPU Hotplug for ARMv8
Arch

On Mon, Sep 25, 2023 at 08:03:56PM +, Salil Mehta wrote:

Looks like some problem with Huawei's mail server server. No patches
except the cover letter are reaching the qemu-devel mailing-list.


I haven't seen any of the actual patches - just the cover letters.
Was that intentional?


No. all the patches are either getting held either by the server or
Some other problem. This has happened for both the instances of the
patch-set I had pushed to the mailing list within 2 hours.

I am not sure how to sort it out without the help of IT. China is
asleep now.

Any suggestions welcome to debug this. Or Should I wait till tomorrow?



Thanks for your efforts to continue working on the feature. Hope the mail
server issue can be fixed early so that patches can be posted. I don't
see the attached patches either. However, the code is available for early
access in your private repository, as clarified in the cover letter :)

  https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v2

Thanks,
Gavin




Re: [PATCH] virtio: use shadow_avail_idx while checking number of heads

2023-09-25 Thread Michael S. Tsirkin
On Tue, Sep 26, 2023 at 12:13:11AM +0200, Ilya Maximets wrote:
> On 9/25/23 23:24, Michael S. Tsirkin wrote:
> > On Mon, Sep 25, 2023 at 10:58:05PM +0200, Ilya Maximets wrote:
> >> On 9/25/23 17:38, Stefan Hajnoczi wrote:
> >>> On Mon, 25 Sept 2023 at 11:36, Ilya Maximets  wrote:
> 
>  On 9/25/23 17:12, Stefan Hajnoczi wrote:
> > On Mon, 25 Sept 2023 at 11:02, Ilya Maximets  wrote:
> >>
> >> On 9/25/23 16:23, Stefan Hajnoczi wrote:
> >>> On Fri, 25 Aug 2023 at 13:04, Ilya Maximets  
> >>> wrote:
> 
>  We do not need the most up to date number of heads, we only want to
>  know if there is at least one.
> 
>  Use shadow variable as long as it is not equal to the last available
>  index checked.  This avoids expensive qatomic dereference of the
>  RCU-protected memory region cache as well as the memory access itself
>  and the subsequent memory barrier.
> 
>  The change improves performance of the af-xdp network backend by 
>  2-3%.
> 
>  Signed-off-by: Ilya Maximets 
>  ---
>   hw/virtio/virtio.c | 10 +-
>   1 file changed, 9 insertions(+), 1 deletion(-)
> 
>  diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>  index 309038fd46..04bf7cc977 100644
>  --- a/hw/virtio/virtio.c
>  +++ b/hw/virtio/virtio.c
>  @@ -999,7 +999,15 @@ void virtqueue_push(VirtQueue *vq, const 
>  VirtQueueElement *elem,
>   /* Called within rcu_read_lock().  */
>   static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
>   {
>  -uint16_t num_heads = vring_avail_idx(vq) - idx;
>  +uint16_t num_heads;
>  +
>  +if (vq->shadow_avail_idx != idx) {
>  +num_heads = vq->shadow_avail_idx - idx;
>  +
>  +return num_heads;
> >>>
> >>> This still needs to check num_heads > vq->vring.num and return -EINVAL
> >>> as is done below.
> >>
> >> Hmm, yeas, you're right.  If the value was incorrect initially, the 
> >> shadow
> >> will be incorrect.  However, I think we should just not return here in 
> >> this
> >> case and let vring_avail_idx() to grab an actual new value below.  
> >> Otherwise
> >> we may never break out of this error.
> >>
> >> Does that make sense?
> >
> > No, because virtio_error() marks the device as broken. The device
> > requires a reset in order to function again. Fetching
> > vring_avail_idx() again won't help.
> 
>  OK, I see.  In this case we're talking about situation where
>  vring_avail_idx() was called in some other place and stored a bad value
>  in the shadow variable, then virtqueue_num_heads() got called.  Right?
> >>
> >> Hmm, I suppose we also need a read barrier after all even if we use
> >> a shadow index.  Assuming the index is correct, but the shadow variable
> >> was updated by a call outside of this function, then we may miss a
> >> barrier and read the descriptor out of order, in theory.  Read barrier
> >> is going to be a compiler barrier on x86, so the performance gain from
> >> this patch should still be mostly there.  I'll test that.
> > 
> > I can't say I understand generally. shadow is under qemu control,
> > I don't think it can be updated concurrently by multiple CPUs.
> 
> It can't, I agree.  Scenario I'm thinking about is the following:
> 
> 1. vring_avail_idx() is called from one of the places other than
>virtqueue_num_heads().  Shadow is updated with the current value.
>Some users of vring_avail_idx() do not use barriers after the call.
> 
> 2. virtqueue_split_get_avail_bytes() is called.
> 
> 3. virtqueue_split_get_avail_bytes() calls virtqueue_num_heads().
> 
> 4. virtqueue_num_heads() checks the shadow and returns early.
> 
> 5. virtqueue_split_get_avail_bytes() calls vring_split_desc_read() and
>reads the descriptor.
> 
> If between steps 1 and 5 we do not have a read barrier, we potentially
> risk reading descriptor data that is not yet fully written, because
> there is no guarantee that reading the last_avail_idx on step 1 wasn't
> reordered with the descriptor read.
> 
> In current code we always have smp_rmb() in virtqueue_num_heads().
> But if we return from this function without a barrier, we may have an
> issue, IIUC.
> 
> I agree that it's kind of a very unlikely scenario and we will probably
> have a control dependency between steps 1 and 5 that will prevent the
> issue, but it might be safer to just have an explicit barrier in
> virtqueue_num_heads().
> 
> Does that make sense?  Or am I missing something else here?

Aha, got it. Good point, yes. Pls document in a code comment.


> > 
> > 
> 
>  AFAIU, we can still just fall through here and let vring_avail_idx()
>  to read the index again and fail the existing check.  That would happen
> 

Re: [PATCH] virtio: use shadow_avail_idx while checking number of heads

2023-09-25 Thread Ilya Maximets
On 9/25/23 23:24, Michael S. Tsirkin wrote:
> On Mon, Sep 25, 2023 at 10:58:05PM +0200, Ilya Maximets wrote:
>> On 9/25/23 17:38, Stefan Hajnoczi wrote:
>>> On Mon, 25 Sept 2023 at 11:36, Ilya Maximets  wrote:

 On 9/25/23 17:12, Stefan Hajnoczi wrote:
> On Mon, 25 Sept 2023 at 11:02, Ilya Maximets  wrote:
>>
>> On 9/25/23 16:23, Stefan Hajnoczi wrote:
>>> On Fri, 25 Aug 2023 at 13:04, Ilya Maximets  wrote:

 We do not need the most up to date number of heads, we only want to
 know if there is at least one.

 Use shadow variable as long as it is not equal to the last available
 index checked.  This avoids expensive qatomic dereference of the
 RCU-protected memory region cache as well as the memory access itself
 and the subsequent memory barrier.

 The change improves performance of the af-xdp network backend by 2-3%.

 Signed-off-by: Ilya Maximets 
 ---
  hw/virtio/virtio.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

 diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
 index 309038fd46..04bf7cc977 100644
 --- a/hw/virtio/virtio.c
 +++ b/hw/virtio/virtio.c
 @@ -999,7 +999,15 @@ void virtqueue_push(VirtQueue *vq, const 
 VirtQueueElement *elem,
  /* Called within rcu_read_lock().  */
  static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
  {
 -uint16_t num_heads = vring_avail_idx(vq) - idx;
 +uint16_t num_heads;
 +
 +if (vq->shadow_avail_idx != idx) {
 +num_heads = vq->shadow_avail_idx - idx;
 +
 +return num_heads;
>>>
>>> This still needs to check num_heads > vq->vring.num and return -EINVAL
>>> as is done below.
>>
>> Hmm, yeas, you're right.  If the value was incorrect initially, the 
>> shadow
>> will be incorrect.  However, I think we should just not return here in 
>> this
>> case and let vring_avail_idx() to grab an actual new value below.  
>> Otherwise
>> we may never break out of this error.
>>
>> Does that make sense?
>
> No, because virtio_error() marks the device as broken. The device
> requires a reset in order to function again. Fetching
> vring_avail_idx() again won't help.

 OK, I see.  In this case we're talking about situation where
 vring_avail_idx() was called in some other place and stored a bad value
 in the shadow variable, then virtqueue_num_heads() got called.  Right?
>>
>> Hmm, I suppose we also need a read barrier after all even if we use
>> a shadow index.  Assuming the index is correct, but the shadow variable
>> was updated by a call outside of this function, then we may miss a
>> barrier and read the descriptor out of order, in theory.  Read barrier
>> is going to be a compiler barrier on x86, so the performance gain from
>> this patch should still be mostly there.  I'll test that.
> 
> I can't say I understand generally. shadow is under qemu control,
> I don't think it can be updated concurrently by multiple CPUs.

It can't, I agree.  Scenario I'm thinking about is the following:

1. vring_avail_idx() is called from one of the places other than
   virtqueue_num_heads().  Shadow is updated with the current value.
   Some users of vring_avail_idx() do not use barriers after the call.

2. virtqueue_split_get_avail_bytes() is called.

3. virtqueue_split_get_avail_bytes() calls virtqueue_num_heads().

4. virtqueue_num_heads() checks the shadow and returns early.

5. virtqueue_split_get_avail_bytes() calls vring_split_desc_read() and
   reads the descriptor.

If between steps 1 and 5 we do not have a read barrier, we potentially
risk reading descriptor data that is not yet fully written, because
there is no guarantee that reading the last_avail_idx on step 1 wasn't
reordered with the descriptor read.

In current code we always have smp_rmb() in virtqueue_num_heads().
But if we return from this function without a barrier, we may have an
issue, IIUC.

I agree that it's kind of a very unlikely scenario and we will probably
have a control dependency between steps 1 and 5 that will prevent the
issue, but it might be safer to just have an explicit barrier in
virtqueue_num_heads().

Does that make sense?  Or am I missing something else here?

> 
> 

 AFAIU, we can still just fall through here and let vring_avail_idx()
 to read the index again and fail the existing check.  That would happen
 today without this patch applied.
>>>
>>> Yes, that is fine.
>>>

 I'm jut trying to avoid duplication of the virtio_error call, i.e.:

 if (vq->shadow_avail_idx != idx) {
 num_heads = vq->shadow_avail_idx - idx;

 /* Check it isn't doing very strange things with descriptor 
 numbers. */
 if (num_heads 

Re: [PATCH v3 0/6] hw/virtio: Build vhost-vdpa.o once for all targets

2023-09-25 Thread Michael S. Tsirkin
On Mon, Sep 18, 2023 at 12:42:02PM +0200, Philippe Mathieu-Daudé wrote:
> Hi Michael,
> 
> On 6/9/23 08:31, Philippe Mathieu-Daudé wrote:
> > On 30/8/23 15:35, Philippe Mathieu-Daudé wrote:
> > > 
> > > This series is now fully reviewed.
> > > 
> > > On 10/7/23 11:49, Philippe Mathieu-Daudé wrote:
> > > > Missing review: patch #4
> > > > 
> > > > Since v2:
> > > > - Added R-b tags
> > > > - Addressed Richard's review comment: page_mask = -page_size
> > > > 
> > > > Philippe Mathieu-Daudé (6):
> > > >    hw/virtio: Propagate page_mask to
> > > >  vhost_vdpa_listener_skipped_section()
> > > >    hw/virtio: Propagate page_mask to vhost_vdpa_section_end()
> > > >    hw/virtio/vhost-vdpa: Inline TARGET_PAGE_ALIGN() macro
> > > >    hw/virtio/vhost-vdpa: Use target-agnostic qemu_target_page_mask()
> > > >    hw/virtio: Build vhost-vdpa.o once
> > > >    hw/virtio/meson: Rename softmmu_virtio_ss[] -> system_virtio_ss[]
> > 
> > Michael, I have another series unifying virtio endianness blocked
> > by this one. I can merge it if you provide your Ack-by.
> 
> Unless you object, I'll merge this series. Since we are early enough
> in the development window, if something breaks, we'll catch it in
> time and fix or revert.
> 
> Regards,
> 
> Phil.

It's actually in my tree, and should go out in a couple of days finally.

-- 
MST




Re: [PATCH v1] rtc/mc146818rtc: improve rtc performance

2023-09-25 Thread Michael S. Tsirkin
On Mon, Jul 24, 2023 at 04:54:22PM +0800, Evanzhang wrote:
> under heavy workloads,irq_coalesced could up to 30+,
> after modification EXTERNAL_INTERRUPT reduce by 40%
> 
> before:
> Analyze events for all VMs, all VCPUs:
>  VM-EXITSamples  Samples% Time%
> 
>EPT_VIOLATION983398463.41%15.96%
>   IO_INSTRUCTION216084313.93%50.07%
>   EXTERNAL_INTERRUPT194926712.57% 0.89%
>   APIC_WRITE 767795 4.95% 0.55%
>  EOI_INDUCED 615308 3.97% 0.30%
>  HLT 130821 0.84%31.77%
> 
> after:
> Analyze events for all VMs, all VCPUs:
>  VM-EXITSamples  Samples% Time%
> 
>EPT_VIOLATION523803150.91% 6.44%
>   IO_INSTRUCTION225765821.94%54.88%
>   EXTERNAL_INTERRUPT116008611.28% 0.61%
>   APIC_WRITE 780454 7.59% 0.54%
>  EOI_INDUCED 615309 5.98% 0.28%
>  HLT 179703 1.75%36.87%
> 
> Signed-off-by: Evanzhang 



Paolo do you mind picking this old patch up?

Reviewed-by: Michael S. Tsirkin 

> ---
>  hw/rtc/mc146818rtc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/rtc/mc146818rtc.c b/hw/rtc/mc146818rtc.c
> index c27c362..2995078 100644
> --- a/hw/rtc/mc146818rtc.c
> +++ b/hw/rtc/mc146818rtc.c
> @@ -96,8 +96,8 @@ static void rtc_coalesced_timer_update(MC146818RtcState *s)
>  if (s->irq_coalesced == 0) {
>  timer_del(s->coalesced_timer);
>  } else {
> -/* divide each RTC interval to 2 - 8 smaller intervals */
> -int c = MIN(s->irq_coalesced, 7) + 1;
> +/* divide each RTC interval to 2 - 32 smaller intervals */
> +int c = MIN(s->irq_coalesced, 31) + 1;
>  int64_t next_clock = qemu_clock_get_ns(rtc_clock) +
>  periodic_clock_to_ns(s->period / c);
>  timer_mod(s->coalesced_timer, next_clock);
> -- 
> 2.9.5




Re: [PATCH] virtio: use shadow_avail_idx while checking number of heads

2023-09-25 Thread Michael S. Tsirkin
On Mon, Sep 25, 2023 at 10:58:05PM +0200, Ilya Maximets wrote:
> On 9/25/23 17:38, Stefan Hajnoczi wrote:
> > On Mon, 25 Sept 2023 at 11:36, Ilya Maximets  wrote:
> >>
> >> On 9/25/23 17:12, Stefan Hajnoczi wrote:
> >>> On Mon, 25 Sept 2023 at 11:02, Ilya Maximets  wrote:
> 
>  On 9/25/23 16:23, Stefan Hajnoczi wrote:
> > On Fri, 25 Aug 2023 at 13:04, Ilya Maximets  wrote:
> >>
> >> We do not need the most up to date number of heads, we only want to
> >> know if there is at least one.
> >>
> >> Use shadow variable as long as it is not equal to the last available
> >> index checked.  This avoids expensive qatomic dereference of the
> >> RCU-protected memory region cache as well as the memory access itself
> >> and the subsequent memory barrier.
> >>
> >> The change improves performance of the af-xdp network backend by 2-3%.
> >>
> >> Signed-off-by: Ilya Maximets 
> >> ---
> >>  hw/virtio/virtio.c | 10 +-
> >>  1 file changed, 9 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >> index 309038fd46..04bf7cc977 100644
> >> --- a/hw/virtio/virtio.c
> >> +++ b/hw/virtio/virtio.c
> >> @@ -999,7 +999,15 @@ void virtqueue_push(VirtQueue *vq, const 
> >> VirtQueueElement *elem,
> >>  /* Called within rcu_read_lock().  */
> >>  static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
> >>  {
> >> -uint16_t num_heads = vring_avail_idx(vq) - idx;
> >> +uint16_t num_heads;
> >> +
> >> +if (vq->shadow_avail_idx != idx) {
> >> +num_heads = vq->shadow_avail_idx - idx;
> >> +
> >> +return num_heads;
> >
> > This still needs to check num_heads > vq->vring.num and return -EINVAL
> > as is done below.
> 
>  Hmm, yeas, you're right.  If the value was incorrect initially, the 
>  shadow
>  will be incorrect.  However, I think we should just not return here in 
>  this
>  case and let vring_avail_idx() to grab an actual new value below.  
>  Otherwise
>  we may never break out of this error.
> 
>  Does that make sense?
> >>>
> >>> No, because virtio_error() marks the device as broken. The device
> >>> requires a reset in order to function again. Fetching
> >>> vring_avail_idx() again won't help.
> >>
> >> OK, I see.  In this case we're talking about situation where
> >> vring_avail_idx() was called in some other place and stored a bad value
> >> in the shadow variable, then virtqueue_num_heads() got called.  Right?
> 
> Hmm, I suppose we also need a read barrier after all even if we use
> a shadow index.  Assuming the index is correct, but the shadow variable
> was updated by a call outside of this function, then we may miss a
> barrier and read the descriptor out of order, in theory.  Read barrier
> is going to be a compiler barrier on x86, so the performance gain from
> this patch should still be mostly there.  I'll test that.

I can't say I understand generally. shadow is under qemu control,
I don't think it can be updated concurrently by multiple CPUs.


> >>
> >> AFAIU, we can still just fall through here and let vring_avail_idx()
> >> to read the index again and fail the existing check.  That would happen
> >> today without this patch applied.
> > 
> > Yes, that is fine.
> > 
> >>
> >> I'm jut trying to avoid duplication of the virtio_error call, i.e.:
> >>
> >> if (vq->shadow_avail_idx != idx) {
> >> num_heads = vq->shadow_avail_idx - idx;
> >>
> >> /* Check it isn't doing very strange things with descriptor 
> >> numbers. */
> >> if (num_heads > vq->vring.num) {
> >> virtio_error(vq->vdev, "Guest moved used index from %u to %u",
> >>  idx, vq->shadow_avail_idx);
> >> return -EINVAL;
> >> }
> >> return num_heads;
> >> }
> >>
> >> vs
> >>
> >> if (vq->shadow_avail_idx != idx) {
> >> num_heads = vq->shadow_avail_idx - idx;
> >>
> >> /* Only use the shadow value if it was good initially. */
> >> if (num_heads <= vq->vring.num) {
> >> return num_heads;
> >> }
> >> }
> >>
> >>
> >> What do you think?
> > 
> > Sounds good.
> > 
> >>
> >> Best regards, Ilya Maximets.




Re: [PATCH] virtio: use shadow_avail_idx while checking number of heads

2023-09-25 Thread Ilya Maximets
On 9/25/23 17:38, Stefan Hajnoczi wrote:
> On Mon, 25 Sept 2023 at 11:36, Ilya Maximets  wrote:
>>
>> On 9/25/23 17:12, Stefan Hajnoczi wrote:
>>> On Mon, 25 Sept 2023 at 11:02, Ilya Maximets  wrote:

 On 9/25/23 16:23, Stefan Hajnoczi wrote:
> On Fri, 25 Aug 2023 at 13:04, Ilya Maximets  wrote:
>>
>> We do not need the most up to date number of heads, we only want to
>> know if there is at least one.
>>
>> Use shadow variable as long as it is not equal to the last available
>> index checked.  This avoids expensive qatomic dereference of the
>> RCU-protected memory region cache as well as the memory access itself
>> and the subsequent memory barrier.
>>
>> The change improves performance of the af-xdp network backend by 2-3%.
>>
>> Signed-off-by: Ilya Maximets 
>> ---
>>  hw/virtio/virtio.c | 10 +-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>> index 309038fd46..04bf7cc977 100644
>> --- a/hw/virtio/virtio.c
>> +++ b/hw/virtio/virtio.c
>> @@ -999,7 +999,15 @@ void virtqueue_push(VirtQueue *vq, const 
>> VirtQueueElement *elem,
>>  /* Called within rcu_read_lock().  */
>>  static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
>>  {
>> -uint16_t num_heads = vring_avail_idx(vq) - idx;
>> +uint16_t num_heads;
>> +
>> +if (vq->shadow_avail_idx != idx) {
>> +num_heads = vq->shadow_avail_idx - idx;
>> +
>> +return num_heads;
>
> This still needs to check num_heads > vq->vring.num and return -EINVAL
> as is done below.

 Hmm, yeas, you're right.  If the value was incorrect initially, the shadow
 will be incorrect.  However, I think we should just not return here in this
 case and let vring_avail_idx() to grab an actual new value below.  
 Otherwise
 we may never break out of this error.

 Does that make sense?
>>>
>>> No, because virtio_error() marks the device as broken. The device
>>> requires a reset in order to function again. Fetching
>>> vring_avail_idx() again won't help.
>>
>> OK, I see.  In this case we're talking about situation where
>> vring_avail_idx() was called in some other place and stored a bad value
>> in the shadow variable, then virtqueue_num_heads() got called.  Right?

Hmm, I suppose we also need a read barrier after all even if we use
a shadow index.  Assuming the index is correct, but the shadow variable
was updated by a call outside of this function, then we may miss a
barrier and read the descriptor out of order, in theory.  Read barrier
is going to be a compiler barrier on x86, so the performance gain from
this patch should still be mostly there.  I'll test that.

>>
>> AFAIU, we can still just fall through here and let vring_avail_idx()
>> to read the index again and fail the existing check.  That would happen
>> today without this patch applied.
> 
> Yes, that is fine.
> 
>>
>> I'm jut trying to avoid duplication of the virtio_error call, i.e.:
>>
>> if (vq->shadow_avail_idx != idx) {
>> num_heads = vq->shadow_avail_idx - idx;
>>
>> /* Check it isn't doing very strange things with descriptor numbers. 
>> */
>> if (num_heads > vq->vring.num) {
>> virtio_error(vq->vdev, "Guest moved used index from %u to %u",
>>  idx, vq->shadow_avail_idx);
>> return -EINVAL;
>> }
>> return num_heads;
>> }
>>
>> vs
>>
>> if (vq->shadow_avail_idx != idx) {
>> num_heads = vq->shadow_avail_idx - idx;
>>
>> /* Only use the shadow value if it was good initially. */
>> if (num_heads <= vq->vring.num) {
>> return num_heads;
>> }
>> }
>>
>>
>> What do you think?
> 
> Sounds good.
> 
>>
>> Best regards, Ilya Maximets.




Re: [PATCH v3 0/5] vhost-user: Back-end state migration

2023-09-25 Thread Stefan Hajnoczi
On Fri, Sep 15, 2023 at 12:25:25PM +0200, Hanna Czenczek wrote:
> RFC:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html
> 
> v1:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-04/msg01575.html
> 
> v2:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02604.html
> 
> Hi,
> 
> I’ve decided not to work on vhost-user SUSPEND/RESUME for now – it is
> not technically required for virtio-fs migration, which is the actual
> priority for me now.  While we do want to have SUSPEND/RESUME at some
> point, the only practically existing reason for it is to be able to
> implement vhost-level resetting in virtiofsd, but that is not related to
> migration.

QEMU sends VHOST_USER_SET_STATUS 0 in vhost_dev_stop(). Are you assuming
that virtiofs back-ends do not reset the device upon receiving this
message?

> So one of the changes in v3 is that it no longer depends on the
> vhost-user SUSPEND/RESUME series, and describes the migration protocol
> without the device being suspended at any point, but merely that the
> vrings are stopped.
> 
> Other changes include:
> 
> - Patch 1:
>   - Rephrased a lot
>   - Added a description for the VHOST_USER_SET_DEVICE_STATE_FD
> parameters
>   - Renamed VHOST_USER_PROTOCOL_F_MIGRATORY_STATE to
> VHOST_USER_PROTOCOL_F_DEVICE_STATE
>   - enum variants changed in value due to dropping the SUSPEND/RESUME
> dependency
> 
> - Patch 2:
>   - Pulled in, was a stand-alone patch before
>   - Dropped a sentence about ring state before feature negotiations, as
> the rings are not to be used during that period anyway
>   - Bit of rephrasing
> 
> - Patch 3:
>   - Renamed “migratory state” to “device state”
>   - enum variants changed in value due to dropping the SUSPEND/RESUME
> dependency
> 
> - Patch 4:
>   - Changed `f` to @f (referencing parameter “f”) in comments
>   - Use g_autofree for the transfer buffer
>   - Note SUSPEND state as a future feature, not currently existing
>   - Wrap read() and write() in RETRY_ON_EINTR()
> 
> - Patch 5:
>   - Renamed “migratory state” to “device state”
>   - (kept R-b still)
> 
> 
> Hanna Czenczek (5):
>   vhost-user.rst: Migrating back-end-internal state
>   vhost-user.rst: Clarify enabling/disabling vrings
>   vhost-user: Interface for migration state transfer
>   vhost: Add high-level state save/load functions
>   vhost-user-fs: Implement internal migration
> 
>  docs/interop/vhost-user.rst   | 188 ++-
>  include/hw/virtio/vhost-backend.h |  24 +++
>  include/hw/virtio/vhost.h | 114 ++
>  hw/virtio/vhost-user-fs.c | 101 -
>  hw/virtio/vhost-user.c| 148 ++
>  hw/virtio/vhost.c | 241 ++
>  6 files changed, 810 insertions(+), 6 deletions(-)
> 
> -- 
> 2.41.0
> 


signature.asc
Description: PGP signature


Re: [PATCH v3 5/5] vhost-user-fs: Implement internal migration

2023-09-25 Thread Stefan Hajnoczi
On Fri, Sep 15, 2023 at 12:25:30PM +0200, Hanna Czenczek wrote:
> A virtio-fs device's VM state consists of:
> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
> - the back-end's (virtiofsd's) internal state
> 
> We get/set the latter via the new vhost operations to transfer migratory
> state.  It is its own dedicated subsection, so that for external
> migration, it can be disabled.
> 
> Reviewed-by: Stefan Hajnoczi 
> Signed-off-by: Hanna Czenczek 
> ---
>  hw/virtio/vhost-user-fs.c | 101 +-
>  1 file changed, 100 insertions(+), 1 deletion(-)

CCing Anton so he is aware that internal migration is being added.

> 
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index 49d699ffc2..eb91723855 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -298,9 +298,108 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice 
> *vdev)
>  return >vhost_dev;
>  }
>  
> +/**
> + * Fetch the internal state from virtiofsd and save it to `f`.
> + */
> +static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
> +  const VMStateField *field, JSONWriter *vmdesc)
> +{
> +VirtIODevice *vdev = pv;
> +VHostUserFS *fs = VHOST_USER_FS(vdev);
> +Error *local_error = NULL;
> +int ret;
> +
> +ret = vhost_save_backend_state(>vhost_dev, f, _error);
> +if (ret < 0) {
> +error_reportf_err(local_error,
> +  "Error saving back-end state of %s device %s "
> +  "(tag: \"%s\"): ",
> +  vdev->name, vdev->parent_obj.canonical_path,
> +  fs->conf.tag ?: "");
> +return ret;
> +}
> +
> +return 0;
> +}
> +
> +/**
> + * Load virtiofsd's internal state from `f` and send it over to virtiofsd.
> + */
> +static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
> +  const VMStateField *field)
> +{
> +VirtIODevice *vdev = pv;
> +VHostUserFS *fs = VHOST_USER_FS(vdev);
> +Error *local_error = NULL;
> +int ret;
> +
> +ret = vhost_load_backend_state(>vhost_dev, f, _error);
> +if (ret < 0) {
> +error_reportf_err(local_error,
> +  "Error loading back-end state of %s device %s "
> +  "(tag: \"%s\"): ",
> +  vdev->name, vdev->parent_obj.canonical_path,
> +  fs->conf.tag ?: "");
> +return ret;
> +}
> +
> +return 0;
> +}
> +
> +static bool vuf_is_internal_migration(void *opaque)
> +{
> +/* TODO: Return false when an external migration is requested */
> +return true;
> +}
> +
> +static int vuf_check_migration_support(void *opaque)
> +{
> +VirtIODevice *vdev = opaque;
> +VHostUserFS *fs = VHOST_USER_FS(vdev);
> +
> +if (!vhost_supports_device_state(>vhost_dev)) {
> +error_report("Back-end of %s device %s (tag: \"%s\") does not 
> support "
> + "migration through qemu",
> + vdev->name, vdev->parent_obj.canonical_path,
> + fs->conf.tag ?: "");
> +return -ENOTSUP;
> +}
> +
> +return 0;
> +}
> +
> +static const VMStateDescription vuf_backend_vmstate;
> +
>  static const VMStateDescription vuf_vmstate = {
>  .name = "vhost-user-fs",
> -.unmigratable = 1,
> +.version_id = 0,
> +.fields = (VMStateField[]) {
> +VMSTATE_VIRTIO_DEVICE,
> +VMSTATE_END_OF_LIST()
> +},
> +.subsections = (const VMStateDescription * []) {
> +_backend_vmstate,
> +NULL,
> +}
> +};
> +
> +static const VMStateDescription vuf_backend_vmstate = {
> +.name = "vhost-user-fs-backend",
> +.version_id = 0,
> +.needed = vuf_is_internal_migration,
> +.pre_load = vuf_check_migration_support,
> +.pre_save = vuf_check_migration_support,
> +.fields = (VMStateField[]) {
> +{
> +.name = "back-end",
> +.info = &(const VMStateInfo) {
> +.name = "virtio-fs back-end state",
> +.get = vuf_load_state,
> +.put = vuf_save_state,
> +},
> +},
> +VMSTATE_END_OF_LIST()
> +},
>  };
>  
>  static Property vuf_properties[] = {
> -- 
> 2.41.0
> 


signature.asc
Description: PGP signature


Re: [PATCH v3 4/5] vhost: Add high-level state save/load functions

2023-09-25 Thread Stefan Hajnoczi
On Fri, Sep 15, 2023 at 12:25:29PM +0200, Hanna Czenczek wrote:
> vhost_save_backend_state() and vhost_load_backend_state() can be used by
> vhost front-ends to easily save and load the back-end's state to/from
> the migration stream.
> 
> Because we do not know the full state size ahead of time,
> vhost_save_backend_state() simply reads the data in 1 MB chunks, and
> writes each chunk consecutively into the migration stream, prefixed by
> its length.  EOF is indicated by a 0-length chunk.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  include/hw/virtio/vhost.h |  35 +++
>  hw/virtio/vhost.c | 204 ++
>  2 files changed, 239 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


RE: [PATCH RFC V2 00/37] Support of Virtual CPU Hotplug for ARMv8 Arch

2023-09-25 Thread Salil Mehta via
Hi Russell,

> From: Russell King 
> Sent: Monday, September 25, 2023 9:13 PM
> To: Salil Mehta 
> Cc: qemu-devel@nongnu.org; qemu-...@nongnu.org; m...@kernel.org;
> james.mo...@arm.com; jean-phili...@linaro.org; Jonathan Cameron
> ; lorenzo.pieral...@linaro.com;
> lpieral...@kernel.org; peter.mayd...@linaro.org;
> richard.hender...@linaro.org; imamm...@redhat.com; andrew.jo...@linux.dev;
> da...@redhat.com; phi...@linaro.org; eric.au...@redhat.com;
> w...@kernel.org; catalin.mari...@arm.com; a...@kernel.org;
> justin...@arm.com; oliver.up...@linux.dev; pbonz...@redhat.com;
> m...@redhat.com; gs...@redhat.com; raf...@kernel.org;
> borntrae...@linux.ibm.com; alex.ben...@linaro.org;
> dar...@os.amperecomputing.com; il...@os.amperecomputing.com;
> vis...@os.amperecomputing.com; karl.heub...@oracle.com;
> miguel.l...@oracle.com; sudeep.ho...@arm.com; salil.me...@opnsrc.net;
> zhukeqian ; wangxiongfeng (C)
> ; wangyanan (Y) ;
> jiakern...@gmail.com; maob...@loongson.cn; lixiang...@loongson.cn
> Subject: Re: [PATCH RFC V2 00/37] Support of Virtual CPU Hotplug for ARMv8
> Arch
> 
> On Mon, Sep 25, 2023 at 08:03:56PM +, Salil Mehta wrote:
> > Looks like some problem with Huawei's mail server server. No patches
> > except the cover letter are reaching the qemu-devel mailing-list.
> 
> I haven't seen any of the actual patches - just the cover letters.
> Was that intentional?

No. all the patches are either getting held either by the server or
Some other problem. This has happened for both the instances of the
patch-set I had pushed to the mailing list within 2 hours. 

I am not sure how to sort it out without the help of IT. China is
asleep now.

Any suggestions welcome to debug this. Or Should I wait till tomorrow?

Many thanks
Salil.




Re: [PATCH v3 3/5] vhost-user: Interface for migration state transfer

2023-09-25 Thread Stefan Hajnoczi
On Fri, Sep 15, 2023 at 12:25:28PM +0200, Hanna Czenczek wrote:
> Add the interface for transferring the back-end's state during migration
> as defined previously in vhost-user.rst.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  include/hw/virtio/vhost-backend.h |  24 +
>  include/hw/virtio/vhost.h |  79 
>  hw/virtio/vhost-user.c| 148 ++
>  hw/virtio/vhost.c |  37 
>  4 files changed, 288 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH RFC V2 00/37] Support of Virtual CPU Hotplug for ARMv8 Arch

2023-09-25 Thread Russell King (Oracle)
On Mon, Sep 25, 2023 at 08:03:56PM +, Salil Mehta wrote:
> Looks like some problem with Huawei's mail server server. No patches
> except the cover letter are reaching the qemu-devel mailing-list.

I haven't seen any of the actual patches - just the cover letters.
Was that intentional?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!



Re: [PATCH 11/12] hw/core/loader: read_targphys(): add upper bound

2023-09-25 Thread Michael Tokarev

25.09.2023 22:40, Vladimir Sementsov-Ogievskiy wrote:

Coverity doesn't like using "untrusted" values, coming from buffers and
fd-s as length to do IO and allocations. And that's make sense. The


"And that makes sense".  Just a nitpick in commit comment.


function is used three times with "untrusted" nbytes parameter. Let's
introduce at least empirical limit of 1G for it.

While being here make the function static, as it's used only here.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  hw/core/loader.c| 13 ++---
  include/hw/loader.h |  2 --
  2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/hw/core/loader.c b/hw/core/loader.c
index aa02b27089..48cff6f59e 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -101,17 +101,24 @@ ssize_t load_image_size(const char *filename, void *addr, 
size_t size)
  return actsize < 0 ? -1 : l;
  }
  
+#define READ_TARGPHYS_MAX_BYTES (1024 * 1024 * 1024)

  /* read()-like version */
-ssize_t read_targphys(const char *name,
-  int fd, hwaddr dst_addr, size_t nbytes)
+static ssize_t read_targphys(const char *name,
+ int fd, hwaddr dst_addr, size_t nbytes)
  {
  uint8_t *buf;
  ssize_t did;
  
+if (nbytes > READ_TARGPHYS_MAX_BYTES) {

+return -1;


Right now this is not important, since the only user of this
function, load_aout(), ignores errno value and reports general
failure instead.  Original read_targphys() returned errno which
corresponds to failed read().

FWIW, at least load_aout() assumes we've read whole struct exec
from the file in question, which might not be the case.

/mjt



Re: [PATCH 08/12] block/nvme: nvme_process_completion() fix bound for cid

2023-09-25 Thread Michael Tokarev

25.09.2023 22:40, Vladimir Sementsov-Ogievskiy wrote:

NVMeQueuePair::reqs as length NVME_NUM_REQS, which less than
NVME_QUEUE_SIZE by 1.



+if (cid == 0 || cid > NVME_NUM_REQS) {
+warn_report("NVMe: Unexpected CID in completion queue: %" PRIu32
+", should be within is: 1..%u", cid, NVME_NUM_REQS);


 - is: I guess :)

/mjt



RE: [PATCH RFC V2 00/37] Support of Virtual CPU Hotplug for ARMv8 Arch

2023-09-25 Thread Salil Mehta via
Looks like some problem with Huawei's mail server server. No patches
except the cover letter are reaching the qemu-devel mailing-list.


> From: Salil Mehta 
> Sent: Monday, September 25, 2023 8:44 PM
> To: qemu-devel@nongnu.org; qemu-...@nongnu.org
> Cc: Salil Mehta ; m...@kernel.org;
> james.mo...@arm.com; jean-phili...@linaro.org; Jonathan Cameron
> ; lorenzo.pieral...@linaro.com;
> lpieral...@kernel.org; peter.mayd...@linaro.org;
> richard.hender...@linaro.org; imamm...@redhat.com; andrew.jo...@linux.dev;
> da...@redhat.com; phi...@linaro.org; eric.au...@redhat.com;
> w...@kernel.org; catalin.mari...@arm.com; a...@kernel.org;
> justin...@arm.com; oliver.up...@linux.dev; pbonz...@redhat.com;
> m...@redhat.com; gs...@redhat.com; raf...@kernel.org;
> borntrae...@linux.ibm.com; alex.ben...@linaro.org; li...@armlinux.org.uk;
> dar...@os.amperecomputing.com; il...@os.amperecomputing.com;
> vis...@os.amperecomputing.com; karl.heub...@oracle.com;
> miguel.l...@oracle.com; sudeep.ho...@arm.com; salil.me...@opnsrc.net;
> zhukeqian ; wangxiongfeng (C)
> ; wangyanan (Y) ;
> jiakern...@gmail.com; maob...@loongson.cn; lixiang...@loongson.cn
> Subject: [PATCH RFC V2 00/37] Support of Virtual CPU Hotplug for ARMv8 Arch
> 
> PROLOGUE
> 
> 
> To assist in review and set the right expectations from this RFC, please
> first
> read below sections *APPENDED AT THE END* of this cover letter,
> 
> 1. Important *DISCLAIMER* [Section (X)]
> 2. Work presented at KVMForum Conference (slides available) [Section (V)F]
> 3. Organization of patches [Section (XI)]
> 4. References [Section (XII)]
> 5. Detailed TODO list of the leftover work or work-in-progress [Section
> (IX)]
> 
> NOTE: There has been an interest shown by other organizations in adapting
> this series for their architecture. I am planning to split this RFC into
> architecture *agnostic* and *specific* patch-sets in subsequent releases.
> ARM
> specific patch-set will continue as RFC V3 and architecture agnostic patch-
> set
> will be floated without RFC tag and can be consumed in this Qemu cycle if
> MAINTAINERs ack it.
> 
> [Please check section (XI)B for details of architecture agnostic patches]
> 
> 
> SECTIONS [I - XIII] are as follows :
> 
> (I) Key Changes (RFC V1 -> RFC V2)
> ==
> 
> RFC V1: https://lore.kernel.org/qemu-devel/20200613213629.21984-1-
> salil.me...@huawei.com/
> 
> 1. ACPI MADT Table GIC CPU Interface can now be presented [6] as ACPI
>*online-capable* or *enabled* to the Guest OS at the boot time. This
> means
>associated CPUs can have ACPI _STA as *enabled* or *disabled* even after
> boot
>See, UEFI ACPI 6.5 Spec, Section 05, Table 5.37 GICC CPU Interface
> Flags[20]
> 2. SMCC/HVC Hypercall exit handling in userspace/Qemu for PSCI CPU_{ON,OFF}
>request. This is required to {dis}allow online'ing a vCPU.
> 3. Always presenting unplugged vCPUs in CPUs ACPI AML code as ACPI
> _STA.PRESENT
>to the Guest OS. Toggling ACPI _STA.Enabled to give an effect of the
>hot{un}plug.
> 4. Live Migration works (some issues are still there)
> 5. TCG/HVF/qtest does not support Hotplug and falls back to default.
> 6. Code for TCG support do exists in this release (it is a work-in-
> progress)
> 7. ACPI _OSC method can now be used by OSPM to negotiate Qemu VM platform
>hotplug capability (_OSC Query support still pending)
> 8. Misc. Bug fixes
> 
> (II) Summary
>  ===
> 
> This patch-set introduces the virtual CPU hotplug support for ARMv8
> architecture
> in QEMU. Idea is to be able to hotplug and hot-unplug the vCPUs while guest
> VM
> is running and no reboot is required. This does *not* makes any assumption
> of
> the physical CPU hotplug availability within the host system but rather
> tries to
> solve the problem at virtualizer/QEMU layer. Introduces ACPI CPU hotplug
> hooks
> and event handling to interface with the guest kernel, code to initialize,
> plug
> and unplug CPUs. No changes are required within the host kernel/KVM except
> the
> support of hypercall exit handling in the user-space/Qemu which has
> recently
> been added to the kernel. Its corresponding Guest kernel changes have been
> posted on the mailing-list [3] [4] by James Morse.
> 
> (III) Motivation
>   ==
> 
> This allows scaling the guest VM compute capacity on-demand which would be
> useful for the following example scenarios,
> 
> 1. Vertical Pod Autoscaling [9][10] in the cloud: Part of the orchestration
>framework which could adjust resource requests (CPU and Mem requests)
> for
>the containers in a pod, based on usage.
> 2. Pay-as-you-grow Business Model: Infrastructure provider could allocate
> and
>restrict the total number of compute resources available to the guest VM
>according to the SLA (Service Level Agreement). VM owner could request
> for
>more compute to be hot-plugged for some cost.
> 
> For example, Kata Container VM starts with a minimum amount of resources
> 

Re: [PATCH 1/1] hw/ide/core: terminate in-flight DMA on IDE bus reset

2023-09-25 Thread John Snow
Niklas, I'm sorry to lean on you here a little bit - You've been
working on the SATA side of this a bit more often, can you let me know
if you think this patch is safe?

I'm not immediately sure what the impact of applying it is, but I have
some questions about it:

(1) When does ide_dma_cb get invoked when DRQ_STAT is set but the
return code we were passed is not < 0, and

(2) what's the impact of just *not* executing the end-of-transfer
block here; what happens to the state machine?

On Thu, Sep 21, 2023 at 12:07 PM Simon Rowe  wrote:
>
> When an IDE controller is reset, its internal state is being cleared
> before any outstanding I/O is cancelled. If a response to DMA is
> received in this window, the aio callback will incorrectly continue
> with the next part of the transfer (now using sector 0 from
> the cleared controller state).

Eugh, yikes. It feels like we should fix the cancellation ... I'm
worried that if we've reset the state machine and we need to bail on a
DMA callback that the heuristics we use for that will eventually be
wrong, unless I am mistaken and this is totally safe and reliable for
a reason I don't intuitively see right away.

Thoughts?

>
> For a write operation, this results in user data being written to the
> MBR, replacing the first stage bootloader and/or partition table. A
> malicious user could exploit this bug to first read the MBR and then
> rewrite it with user-controller bootloader code.

Oh, good.

>
> This addresses the bug by checking if DRQ_STAT is still set in the DMA
> callback (as it is otherwise cleared at the start of the bus
> reset). If it is not, treat the transfer as ended.
>
> This only appears to affect SATA controllers, plain IDE does not use
> aio.
>
> Fixes: CVE-2023-5088
> Signed-off-by: Simon Rowe 
> Cc: Felipe Franciosi 
> ---
>  hw/ide/core.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/hw/ide/core.c b/hw/ide/core.c
> index b5e0dcd29b..826b7eaeeb 100644
> --- a/hw/ide/core.c
> +++ b/hw/ide/core.c
> @@ -906,8 +906,12 @@ static void ide_dma_cb(void *opaque, int ret)
>  s->nsector -= n;
>  }
>
> -/* end of transfer ? */
> -if (s->nsector == 0) {
> +/*
> + * End of transfer ?
> + * If a bus reset occurs immediately before the callback is invoked the
> + * bus state will have been cleared. Terminate the transfer.
> + */
> +if (s->nsector == 0 || !(s->status & DRQ_STAT)) {
>  s->status = READY_STAT | SEEK_STAT;
>  ide_bus_set_irq(s->bus);
>  goto eot;
> --
> 2.22.3
>




[PATCH RFC V2 00/37] Support of Virtual CPU Hotplug for ARMv8 Arch

2023-09-25 Thread Salil Mehta via
PROLOGUE


To assist in review and set the right expectations from this RFC, please first
read below sections *APPENDED AT THE END* of this cover letter,

1. Important *DISCLAIMER* [Section (X)]
2. Work presented at KVMForum Conference (slides available) [Section (V)F]
3. Organization of patches [Section (XI)]
4. References [Section (XII)]
5. Detailed TODO list of the leftover work or work-in-progress [Section (IX)]

NOTE: There has been an interest shown by other organizations in adapting
this series for their architecture. I am planning to split this RFC into
architecture *agnostic* and *specific* patch-sets in subsequent releases. ARM
specific patch-set will continue as RFC V3 and architecture agnostic patch-set
will be floated without RFC tag and can be consumed in this Qemu cycle if
MAINTAINERs ack it.

[Please check section (XI)B for details of architecture agnostic patches]


SECTIONS [I - XIII] are as follows :

(I) Key Changes (RFC V1 -> RFC V2)
==

RFC V1: 
https://lore.kernel.org/qemu-devel/20200613213629.21984-1-salil.me...@huawei.com/

1. ACPI MADT Table GIC CPU Interface can now be presented [6] as ACPI
   *online-capable* or *enabled* to the Guest OS at the boot time. This means
   associated CPUs can have ACPI _STA as *enabled* or *disabled* even after boot
   See, UEFI ACPI 6.5 Spec, Section 05, Table 5.37 GICC CPU Interface Flags[20]
2. SMCC/HVC Hypercall exit handling in userspace/Qemu for PSCI CPU_{ON,OFF}
   request. This is required to {dis}allow online'ing a vCPU.
3. Always presenting unplugged vCPUs in CPUs ACPI AML code as ACPI _STA.PRESENT 
   to the Guest OS. Toggling ACPI _STA.Enabled to give an effect of the
   hot{un}plug.
4. Live Migration works (some issues are still there)
5. TCG/HVF/qtest does not support Hotplug and falls back to default.
6. Code for TCG support do exists in this release (it is a work-in-progress)
7. ACPI _OSC method can now be used by OSPM to negotiate Qemu VM platform
   hotplug capability (_OSC Query support still pending)
8. Misc. Bug fixes

(II) Summary
 ===

This patch-set introduces the virtual CPU hotplug support for ARMv8 architecture
in QEMU. Idea is to be able to hotplug and hot-unplug the vCPUs while guest VM
is running and no reboot is required. This does *not* makes any assumption of
the physical CPU hotplug availability within the host system but rather tries to
solve the problem at virtualizer/QEMU layer. Introduces ACPI CPU hotplug hooks
and event handling to interface with the guest kernel, code to initialize, plug
and unplug CPUs. No changes are required within the host kernel/KVM except the
support of hypercall exit handling in the user-space/Qemu which has recently
been added to the kernel. Its corresponding Guest kernel changes have been
posted on the mailing-list [3] [4] by James Morse.

(III) Motivation
  ==

This allows scaling the guest VM compute capacity on-demand which would be
useful for the following example scenarios,

1. Vertical Pod Autoscaling [9][10] in the cloud: Part of the orchestration
   framework which could adjust resource requests (CPU and Mem requests) for
   the containers in a pod, based on usage.
2. Pay-as-you-grow Business Model: Infrastructure provider could allocate and
   restrict the total number of compute resources available to the guest VM
   according to the SLA (Service Level Agreement). VM owner could request for
   more compute to be hot-plugged for some cost.

For example, Kata Container VM starts with a minimum amount of resources (i.e.
hotplug everything approach). why?

1. Allowing faster *boot time* and
2. Reduction in *memory footprint*

Kata Container VM can boot with just 1 vCPU and then later more vCPUs can be
hot-plugged as per requirement.

(IV) Terminology
 ===

(*) Posssible CPUs: Total vCPUs which could ever exist in VM. This includes
any cold booted CPUs plus any CPUs which could be later
hot-plugged.
- Qemu parameter(-smp maxcpus=N)
(*) Present CPUs:   Possible CPUs which are ACPI 'present'. These might or might
not be ACPI 'enabled'. 
- Present vCPUs = Possible vCPUs (Always on ARM Arch)
(*) Enabled CPUs:   Possible CPUs which are ACPI ‘present’ and 'enabled' and can
now be ‘onlined’ (PSCI) for use by Guest Kernel. All cold
booted vCPUs are ACPI 'enabled' at boot. Later, using
device_add more vCPUs can be hotplugged and be made ACPI
'enabled.
- Qemu parameter(-smp cpus=N). Can be used to specify some
  cold booted vCPUs during VM init. Some can be added using
  '-device' option.

(V) Constraints Due To ARMv8 CPU Architecture [+] Other Impediments
===

A. Physical Limitation to Support CPU Hotplug: 

[PATCH 11/12] hw/core/loader: read_targphys(): add upper bound

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
Coverity doesn't like using "untrusted" values, coming from buffers and
fd-s as length to do IO and allocations. And that's make sense. The
function is used three times with "untrusted" nbytes parameter. Let's
introduce at least empirical limit of 1G for it.

While being here make the function static, as it's used only here.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 hw/core/loader.c| 13 ++---
 include/hw/loader.h |  2 --
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/hw/core/loader.c b/hw/core/loader.c
index aa02b27089..48cff6f59e 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -101,17 +101,24 @@ ssize_t load_image_size(const char *filename, void *addr, 
size_t size)
 return actsize < 0 ? -1 : l;
 }
 
+#define READ_TARGPHYS_MAX_BYTES (1024 * 1024 * 1024)
 /* read()-like version */
-ssize_t read_targphys(const char *name,
-  int fd, hwaddr dst_addr, size_t nbytes)
+static ssize_t read_targphys(const char *name,
+ int fd, hwaddr dst_addr, size_t nbytes)
 {
 uint8_t *buf;
 ssize_t did;
 
+if (nbytes > READ_TARGPHYS_MAX_BYTES) {
+return -1;
+}
+
 buf = g_malloc(nbytes);
 did = read(fd, buf, nbytes);
-if (did > 0)
+if (did > 0) {
 rom_add_blob_fixed("read", buf, did, dst_addr);
+}
+
 g_free(buf);
 return did;
 }
diff --git a/include/hw/loader.h b/include/hw/loader.h
index c4c14170ea..e29af233d2 100644
--- a/include/hw/loader.h
+++ b/include/hw/loader.h
@@ -264,8 +264,6 @@ ssize_t load_ramdisk(const char *filename, hwaddr addr, 
uint64_t max_sz);
 
 ssize_t gunzip(void *dst, size_t dstlen, uint8_t *src, size_t srclen);
 
-ssize_t read_targphys(const char *name,
-  int fd, hwaddr dst_addr, size_t nbytes);
 void pstrcpy_targphys(const char *name,
   hwaddr dest, int buf_size,
   const char *source);
-- 
2.34.1




[PATCH 06/12] mc146818rtc: rtc_set_time(): initialize tm to zeroes

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
set_time() function doesn't set all the fields, so it's better to
initialize tm structure. And Coverity will be happier about it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 hw/rtc/mc146818rtc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rtc/mc146818rtc.c b/hw/rtc/mc146818rtc.c
index c27c362db9..b63e1aeaea 100644
--- a/hw/rtc/mc146818rtc.c
+++ b/hw/rtc/mc146818rtc.c
@@ -599,7 +599,7 @@ static void rtc_get_time(MC146818RtcState *s, struct tm *tm)
 
 static void rtc_set_time(MC146818RtcState *s)
 {
-struct tm tm;
+struct tm tm = {0};
 g_autofree const char *qom_path = object_get_canonical_path(OBJECT(s));
 
 rtc_get_time(s, );
-- 
2.34.1




[PATCH 07/12] pcie_sriov: unregister_vfs(): fix error path

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
local_err must be NULL before calling object_property_set_bool(), so we
must clear it on each iteration. Let's also use more convenient
error_reportf_err().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 hw/pci/pcie_sriov.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 76a3b6917e..5ef8950940 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -196,19 +196,16 @@ static void register_vfs(PCIDevice *dev)
 
 static void unregister_vfs(PCIDevice *dev)
 {
-Error *local_err = NULL;
 uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
 uint16_t i;
 
 trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
PCI_FUNC(dev->devfn), num_vfs);
 for (i = 0; i < num_vfs; i++) {
+Error *err = NULL;
 PCIDevice *vf = dev->exp.sriov_pf.vf[i];
-object_property_set_bool(OBJECT(vf), "realized", false, _err);
-if (local_err) {
-fprintf(stderr, "Failed to unplug: %s\n",
-error_get_pretty(local_err));
-error_free(local_err);
+if (!object_property_set_bool(OBJECT(vf), "realized", false, )) {
+error_reportf_err(err, "Failed to unplug: ");
 }
 object_unparent(OBJECT(vf));
 object_unref(OBJECT(vf));
-- 
2.34.1




[PATCH 08/12] block/nvme: nvme_process_completion() fix bound for cid

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
NVMeQueuePair::reqs as length NVME_NUM_REQS, which less than
NVME_QUEUE_SIZE by 1.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nvme.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index b6e95f0b7e..7f11ce1d46 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -416,9 +416,9 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 q->cq_phase = !q->cq_phase;
 }
 cid = le16_to_cpu(c->cid);
-if (cid == 0 || cid > NVME_QUEUE_SIZE) {
-warn_report("NVMe: Unexpected CID in completion queue: %"PRIu32", "
-"queue size: %u", cid, NVME_QUEUE_SIZE);
+if (cid == 0 || cid > NVME_NUM_REQS) {
+warn_report("NVMe: Unexpected CID in completion queue: %" PRIu32
+", should be within is: 1..%u", cid, NVME_NUM_REQS);
 continue;
 }
 trace_nvme_complete_command(s, q->index, cid);
-- 
2.34.1




[PATCH 01/12] hw/core/loader: load_at(): check size

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
This @size parameter often comes from fd. We'd better check it before
doing read and allocation.

Chose 1G as high enough empiric bound.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 hw/core/loader.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/hw/core/loader.c b/hw/core/loader.c
index 4dd5a71fb7..4b67543046 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -281,11 +281,26 @@ ssize_t load_aout(const char *filename, hwaddr addr, int 
max_sz,
 
 /* ELF loader */
 
+#define ELF_LOAD_MAX (1024 * 1024 * 1024)
+
 static void *load_at(int fd, off_t offset, size_t size)
 {
 void *ptr;
-if (lseek(fd, offset, SEEK_SET) < 0)
+
+/*
+ * We often come here with @size, which was previously read from file
+ * descriptor too. That's not good to read and allocate for unchecked
+ * number of bytes. Coverity also doesn't like it and generate problems.
+ * So, let's limit all load_at() calls to ELF_LOAD_MAX at least.
+ */
+if (size > ELF_LOAD_MAX) {
 return NULL;
+}
+
+if (lseek(fd, offset, SEEK_SET) < 0) {
+return NULL;
+}
+
 ptr = g_malloc(size);
 if (read(fd, ptr, size) != size) {
 g_free(ptr);
-- 
2.34.1




[PATCH 09/12] kvm-all: introduce limits for name_size and num_desc

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
Coverity doesn't like when the value with unchecked bounds that comes
from fd is used as length for IO or allocation. And really, that's not
a good practice. Let's introduce at least an empirical limits for these
values.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 accel/kvm/kvm-all.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index ff1578bb32..6d0ba7d900 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3988,6 +3988,9 @@ typedef struct StatsDescriptors {
 static QTAILQ_HEAD(, StatsDescriptors) stats_descriptors =
 QTAILQ_HEAD_INITIALIZER(stats_descriptors);
 
+
+#define KVM_STATS_QEMU_MAX_NAME_SIZE (1024 * 1024)
+#define KVM_STATS_QEMU_MAX_NUM_DESC (1024)
 /*
  * Return the descriptors for 'target', that either have already been read
  * or are retrieved from 'stats_fd'.
@@ -4021,6 +4024,18 @@ static StatsDescriptors 
*find_stats_descriptors(StatsTarget target, int stats_fd
 g_free(descriptors);
 return NULL;
 }
+if (kvm_stats_header->name_size > KVM_STATS_QEMU_MAX_NAME_SIZE) {
+error_setg(errp, "KVM stats: too large name_size: %" PRIu32,
+   kvm_stats_header->name_size);
+g_free(descriptors);
+return NULL;
+}
+if (kvm_stats_header->num_desc > KVM_STATS_QEMU_MAX_NUM_DESC) {
+error_setg(errp, "KVM stats: too large num_desc: %" PRIu32,
+   kvm_stats_header->num_desc);
+g_free(descriptors);
+return NULL;
+}
 size_desc = sizeof(*kvm_stats_desc) + kvm_stats_header->name_size;
 
 /* Read stats descriptors */
-- 
2.34.1




[PATCH 02/12] hw/i386/intel_iommu: vtd_slpte_nonzero_rsvd(): reduce magic numbers

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
Add a constant and clear assertion. The assertion also tells Coverity
that we are not going to overflow the array.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 hw/i386/intel_iommu.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c0ce896668..2233dbe13a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1028,12 +1028,17 @@ static dma_addr_t 
vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
  * vtd_spte_rsvd 4k pages
  * vtd_spte_rsvd_large large pages
  */
-static uint64_t vtd_spte_rsvd[5];
-static uint64_t vtd_spte_rsvd_large[5];
+#define VTD_SPTE_RSVD_LEN 5
+static uint64_t vtd_spte_rsvd[VTD_SPTE_RSVD_LEN];
+static uint64_t vtd_spte_rsvd_large[VTD_SPTE_RSVD_LEN];
 
 static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
 {
-uint64_t rsvd_mask = vtd_spte_rsvd[level];
+uint64_t rsvd_mask;
+
+assert(level < VTD_SPTE_RSVD_LEN);
+
+rsvd_mask = vtd_spte_rsvd[level];
 
 if ((level == VTD_SL_PD_LEVEL || level == VTD_SL_PDP_LEVEL) &&
 (slpte & VTD_SL_PT_PAGE_SIZE_MASK)) {
-- 
2.34.1




[PATCH 12/12] io/channel-socket: qio_channel_socket_flush(): improve msg validation

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 io/channel-socket.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/io/channel-socket.c b/io/channel-socket.c
index 02ffb51e99..3a899b0608 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -782,6 +782,11 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
  "Error not from zero copy");
 return -1;
 }
+if (serr->ee_data < serr->ee_info) {
+error_setg_errno(errp, serr->ee_origin,
+ "Wrong notification bounds");
+return -1;
+}
 
 /* No errors, count successfully finished sendmsg()*/
 sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
-- 
2.34.1




[PATCH 04/12] libvhost-user.c: add assertion to vu_message_read_default

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
Explain Coverity that we are not going to overflow vmsg->fds.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 subprojects/libvhost-user/libvhost-user.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 0469a50101..49b57c7ef4 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -322,6 +322,7 @@ vu_message_read_default(VuDev *dev, int conn_fd, 
VhostUserMsg *vmsg)
 if (cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_RIGHTS) {
 fd_size = cmsg->cmsg_len - CMSG_LEN(0);
 vmsg->fd_num = fd_size / sizeof(int);
+assert(fd_size < VHOST_MEMORY_BASELINE_NREGIONS);
 memcpy(vmsg->fds, CMSG_DATA(cmsg), fd_size);
 break;
 }
-- 
2.34.1




[PATCH 05/12] device_tree: qmp_dumpdtb(): stronger assertion

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
Coverity mark this size, got from the buffer as untrasted value, it's
not good to use it as length when writing to file. Make the assertion
more strict to also check upper bound.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 softmmu/device_tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/softmmu/device_tree.c b/softmmu/device_tree.c
index 30aa3aea9f..adc4236e21 100644
--- a/softmmu/device_tree.c
+++ b/softmmu/device_tree.c
@@ -660,7 +660,7 @@ void qmp_dumpdtb(const char *filename, Error **errp)
 
 size = fdt_totalsize(current_machine->fdt);
 
-g_assert(size > 0);
+g_assert(size > 0 && size <= FDT_MAX_SIZE);
 
 if (!g_file_set_contents(filename, current_machine->fdt, size, )) {
 error_setg(errp, "Error saving FDT to file %s: %s",
-- 
2.34.1




[PATCH 10/12] hw/core/loader: gunzip(): initialize z_stream

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
Coverity signals that variable as being used uninitialized. And really,
when work with external APIs that's better to zero out the structure,
where we set some fields by hand.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 hw/core/loader.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/core/loader.c b/hw/core/loader.c
index 4b67543046..aa02b27089 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -573,7 +573,7 @@ static void zfree(void *x, void *addr)
 
 ssize_t gunzip(void *dst, size_t dstlen, uint8_t *src, size_t srclen)
 {
-z_stream s;
+z_stream s = {0};
 ssize_t dstbytes;
 int r, i, flags;
 
-- 
2.34.1




[PATCH 03/12] util/filemonitor-inotify: qemu_file_monitor_watch(): avoid overflow

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
Prefer clear assertions instead of possible array overflow.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 util/filemonitor-inotify.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/util/filemonitor-inotify.c b/util/filemonitor-inotify.c
index 2c45f7f176..09ef240174 100644
--- a/util/filemonitor-inotify.c
+++ b/util/filemonitor-inotify.c
@@ -81,16 +81,21 @@ static void qemu_file_monitor_watch(void *arg)
 
 /* Loop over all events in the buffer */
 while (used < len) {
-struct inotify_event *ev =
-(struct inotify_event *)(buf + used);
-const char *name = ev->len ? ev->name : "";
-QFileMonitorDir *dir = g_hash_table_lookup(mon->idmap,
-   GINT_TO_POINTER(ev->wd));
-uint32_t iev = ev->mask &
-(IN_CREATE | IN_MODIFY | IN_DELETE | IN_IGNORED |
- IN_MOVED_TO | IN_MOVED_FROM | IN_ATTRIB);
+const char *name;
+QFileMonitorDir *dir;
+uint32_t iev;
 int qev;
 gsize i;
+struct inotify_event *ev = (struct inotify_event *)(buf + used);
+
+assert(len - used >= sizeof(struct inotify_event));
+assert(len - used - sizeof(struct inotify_event) >= ev->len);
+
+name = ev->len ? ev->name : "";
+dir = g_hash_table_lookup(mon->idmap, GINT_TO_POINTER(ev->wd));
+iev = ev->mask &
+(IN_CREATE | IN_MODIFY | IN_DELETE | IN_IGNORED |
+ IN_MOVED_TO | IN_MOVED_FROM | IN_ATTRIB);
 
 used += sizeof(struct inotify_event) + ev->len;
 
-- 
2.34.1




[PATCH 00/12] coverity fixes

2023-09-25 Thread Vladimir Sementsov-Ogievskiy
Hi! Here are some improvements to handle issues found by Coverity (not
public Coverity site, so there are no CIDs).

Vladimir Sementsov-Ogievskiy (12):
  hw/core/loader: load_at(): check size
  hw/i386/intel_iommu: vtd_slpte_nonzero_rsvd(): reduce magic numbers
  util/filemonitor-inotify: qemu_file_monitor_watch(): avoid overflow
  libvhost-user.c: add assertion to vu_message_read_default
  device_tree: qmp_dumpdtb(): stronger assertion
  mc146818rtc: rtc_set_time(): initialize tm to zeroes
  pcie_sriov: unregister_vfs(): fix error path
  block/nvme: nvme_process_completion() fix bound for cid
  kvm-all: introduce limits for name_size and num_desc
  hw/core/loader: gunzip(): initialize z_stream
  hw/core/loader: read_targphys(): add upper bound
  io/channel-socket: qio_channel_socket_flush(): improve msg validation

 accel/kvm/kvm-all.c   | 15 +++
 block/nvme.c  |  6 ++---
 hw/core/loader.c  | 32 +++
 hw/i386/intel_iommu.c | 11 +---
 hw/pci/pcie_sriov.c   |  9 +++
 hw/rtc/mc146818rtc.c  |  2 +-
 include/hw/loader.h   |  2 --
 io/channel-socket.c   |  5 
 softmmu/device_tree.c |  2 +-
 subprojects/libvhost-user/libvhost-user.c |  1 +
 util/filemonitor-inotify.c| 21 +--
 11 files changed, 77 insertions(+), 29 deletions(-)

-- 
2.34.1




[PATCH v7 01/12] nbd/server: Support a request payload

2023-09-25 Thread Eric Blake
Upcoming additions to support NBD 64-bit effect lengths allow for the
possibility to distinguish between payload length (capped at 32M) and
effect length (64 bits, although we generally assume 63 bits because
of off_t limitations).  Without that extension, only the NBD_CMD_WRITE
request has a payload; but with the extension, it makes sense to allow
at least NBD_CMD_BLOCK_STATUS to have both a payload and effect length
in a future patch (where the payload is a limited-size struct that in
turn gives the real effect length as well as a subset of known ids for
which status is requested).  Other future NBD commands may also have a
request payload, so the 64-bit extension introduces a new
NBD_CMD_FLAG_PAYLOAD_LEN that distinguishes between whether the header
length is a payload length or an effect length, rather than
hard-coding the decision based on the command.

According to the spec, a client should never send a command with a
payload without the negotiation phase proving such extension is
available.  So in the unlikely event the bit is set or cleared
incorrectly, the client is already at fault; if the client then
provides the payload, we can gracefully consume it off the wire and
fail the command with NBD_EINVAL (subsequent checks for magic numbers
ensure we are still in sync), while if the client fails to send
payload we block waiting for it (basically deadlocking our connection
to the bad client, but not negatively impacting our ability to service
other clients, so not a security risk).  Note that we do not support
the payload version of BLOCK_STATUS yet.

Signed-off-by: Eric Blake 
---

v7: another try at better logic [Vladimir]

v5: retitled from v4 13/24, rewrite on top of previous patch's switch
statement [Vladimir]

v4: less indentation on several 'if's [Vladimir]
---
 nbd/server.c | 37 +
 nbd/trace-events |  1 +
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 7a6f95071f8..1eabcfc908d 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2322,9 +2322,11 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,
Error **errp)
 {
 NBDClient *client = req->client;
+bool extended_with_payload;
 bool check_length = false;
 bool check_rofs = false;
 bool allocate_buffer = false;
+bool payload_okay = false;
 unsigned payload_len = 0;
 int valid_flags = NBD_CMD_FLAG_FUA;
 int ret;
@@ -2338,6 +2340,13 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,

 trace_nbd_co_receive_request_decode_type(request->cookie, request->type,
  nbd_cmd_lookup(request->type));
+extended_with_payload = client->mode >= NBD_MODE_EXTENDED &&
+request->flags & NBD_CMD_FLAG_PAYLOAD_LEN;
+if (extended_with_payload) {
+payload_len = request->len;
+check_length = true;
+}
+
 switch (request->type) {
 case NBD_CMD_DISC:
 /* Special case: we're going to disconnect without a reply,
@@ -2354,6 +2363,15 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,
 break;

 case NBD_CMD_WRITE:
+if (client->mode >= NBD_MODE_EXTENDED) {
+if (!extended_with_payload) {
+/* The client is noncompliant. Trace it, but proceed. */
+trace_nbd_co_receive_ext_payload_compliance(request->from,
+request->len);
+}
+valid_flags |= NBD_CMD_FLAG_PAYLOAD_LEN;
+}
+payload_okay = true;
 payload_len = request->len;
 check_length = true;
 allocate_buffer = true;
@@ -2395,6 +2413,14 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,
request->len, NBD_MAX_BUFFER_SIZE);
 return -EINVAL;
 }
+if (payload_len && !payload_okay) {
+/*
+ * For now, we don't support payloads on other commands; but
+ * we can keep the connection alive by ignoring the payload.
+ */
+assert(request->type != NBD_CMD_WRITE);
+request->len = 0;
+}
 if (allocate_buffer) {
 /* READ, WRITE */
 req->data = blk_try_blockalign(client->exp->common.blk,
@@ -2405,10 +2431,13 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,
 }
 }
 if (payload_len) {
-/* WRITE */
-assert(req->data);
-ret = nbd_read(client->ioc, req->data, payload_len,
-   "CMD_WRITE data", errp);
+if (payload_okay) {
+assert(req->data);
+ret = nbd_read(client->ioc, req->data, payload_len,
+   "CMD_WRITE data", errp);
+} else {
+ret = nbd_drop(client->ioc, payload_len, errp);
+}
 if (ret < 0) {
 return -EIO;
 }
diff --git 

[PATCH v7 10/12] nbd/server: Refactor list of negotiated meta contexts

2023-09-25 Thread Eric Blake
Peform several minor refactorings of how the list of negotiated meta
contexts is managed, to make upcoming patches easier: Promote the
internal type NBDExportMetaContexts to the public opaque type
NBDMetaContexts, and mark exp const.  Use a shorter member name in
NBDClient.  Hoist calls to nbd_check_meta_context() earlier in their
callers, as the number of negotiated contexts may impact the flags
exposed in regards to an export, which in turn requires a new
parameter.  Drop a redundant parameter to nbd_negotiate_meta_queries.
No semantic change intended on the success path; on the failure path,
dropping context in nbd_check_meta_export even when reporting an error
is safer.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

v5: rebase to master, tweak commit message [Vladimir], R-b added

v4: new patch split out from v3 13/14, with smaller impact (quit
trying to separate exp outside of NBDMeataContexts)
---
 include/block/nbd.h |  1 +
 nbd/server.c| 55 -
 2 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index ba8724f5336..2006497f987 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -29,6 +29,7 @@
 typedef struct NBDExport NBDExport;
 typedef struct NBDClient NBDClient;
 typedef struct NBDClientConnection NBDClientConnection;
+typedef struct NBDMetaContexts NBDMetaContexts;

 extern const BlockExportDriver blk_exp_nbd;

diff --git a/nbd/server.c b/nbd/server.c
index b09ee44c159..44ebbd139b2 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -105,11 +105,13 @@ struct NBDExport {

 static QTAILQ_HEAD(, NBDExport) exports = QTAILQ_HEAD_INITIALIZER(exports);

-/* NBDExportMetaContexts represents a list of contexts to be exported,
+/*
+ * NBDMetaContexts represents a list of meta contexts in use,
  * as selected by NBD_OPT_SET_META_CONTEXT. Also used for
- * NBD_OPT_LIST_META_CONTEXT. */
-typedef struct NBDExportMetaContexts {
-NBDExport *exp;
+ * NBD_OPT_LIST_META_CONTEXT.
+ */
+struct NBDMetaContexts {
+const NBDExport *exp; /* associated export */
 size_t count; /* number of negotiated contexts */
 bool base_allocation; /* export base:allocation context (block status) */
 bool allocation_depth; /* export qemu:allocation-depth */
@@ -117,7 +119,7 @@ typedef struct NBDExportMetaContexts {
 * export qemu:dirty-bitmap:,
 * sized by exp->nr_export_bitmaps
 */
-} NBDExportMetaContexts;
+};

 struct NBDClient {
 int refcount;
@@ -144,7 +146,7 @@ struct NBDClient {
 uint32_t check_align; /* If non-zero, check for aligned client requests */

 NBDMode mode;
-NBDExportMetaContexts export_meta;
+NBDMetaContexts contexts; /* Negotiated meta contexts */

 uint32_t opt; /* Current option being negotiated */
 uint32_t optlen; /* remaining length of data in ioc for the option being
@@ -455,10 +457,10 @@ static int nbd_negotiate_handle_list(NBDClient *client, 
Error **errp)
 return nbd_negotiate_send_rep(client, NBD_REP_ACK, errp);
 }

-static void nbd_check_meta_export(NBDClient *client)
+static void nbd_check_meta_export(NBDClient *client, NBDExport *exp)
 {
-if (client->exp != client->export_meta.exp) {
-client->export_meta.count = 0;
+if (exp != client->contexts.exp) {
+client->contexts.count = 0;
 }
 }

@@ -504,6 +506,7 @@ static int nbd_negotiate_handle_export_name(NBDClient 
*client, bool no_zeroes,
 error_setg(errp, "export not found");
 return -EINVAL;
 }
+nbd_check_meta_export(client, client->exp);

 myflags = client->exp->nbdflags;
 if (client->mode >= NBD_MODE_STRUCTURED) {
@@ -521,7 +524,6 @@ static int nbd_negotiate_handle_export_name(NBDClient 
*client, bool no_zeroes,

 QTAILQ_INSERT_TAIL(>exp->clients, client, next);
 blk_exp_ref(>exp->common);
-nbd_check_meta_export(client);

 return 0;
 }
@@ -641,6 +643,9 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
Error **errp)
   errp, "export '%s' not present",
   sane_name);
 }
+if (client->opt == NBD_OPT_GO) {
+nbd_check_meta_export(client, exp);
+}

 /* Don't bother sending NBD_INFO_NAME unless client requested it */
 if (sendname) {
@@ -729,7 +734,6 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
Error **errp)
 client->check_align = check_align;
 QTAILQ_INSERT_TAIL(>exp->clients, client, next);
 blk_exp_ref(>exp->common);
-nbd_check_meta_export(client);
 rc = 1;
 }
 return rc;
@@ -852,7 +856,7 @@ static bool nbd_strshift(const char **str, const char 
*prefix)
  * Handle queries to 'base' namespace. For now, only the base:allocation
  * context is available.  Return true if @query has been handled.
  */
-static bool nbd_meta_base_query(NBDClient *client, 

[PATCH v7 12/12] nbd/server: Add FLAG_PAYLOAD support to CMD_BLOCK_STATUS

2023-09-25 Thread Eric Blake
Allow a client to request a subset of negotiated meta contexts.  For
example, a client may ask to use a single connection to learn about
both block status and dirty bitmaps, but where the dirty bitmap
queries only need to be performed on a subset of the disk; forcing the
server to compute that information on block status queries in the rest
of the disk is wasted effort (both at the server, and on the amount of
traffic sent over the wire to be parsed and ignored by the client).

Qemu as an NBD client never requests to use more than one meta
context, so it has no need to use block status payloads.  Testing this
instead requires support from libnbd, which CAN access multiple meta
contexts in parallel from a single NBD connection; an interop test
submitted to the libnbd project at the same time as this patch
demonstrates the feature working, as well as testing some corner cases
(for example, when the payload length is longer than the export
length), although other corner cases (like passing the same id
duplicated) requires a protocol fuzzer because libnbd is not wired up
to break the protocol that badly.

This also includes tweaks to 'qemu-nbd --list' to show when a server
is advertising the capability, and to the testsuite to reflect the
addition to that output.

Of note: qemu will always advertise the new feature bit during
NBD_OPT_INFO if extended headers have alreay been negotiated
(regardless of whether any NBD_OPT_SET_META_CONTEXT negotiation has
occurred); but for NBD_OPT_GO, qemu only advertises the feature if
block status is also enabled (that is, if the client does not
negotiate any contexts, then NBD_CMD_BLOCK_STATUS cannot be used, so
the feature is not advertised).

Signed-off-by: Eric Blake 
---

v5: factor out 'id - NBD_MTA_ID_DIRTY_BITMAP' [Vladimir], rework logic
on zero-length requests to be clearer [Vladimir], rebase to earlier
changes
---
 docs/interop/nbd.txt  |   2 +-
 nbd/server.c  | 114 --
 qemu-nbd.c|   1 +
 nbd/trace-events  |   1 +
 tests/qemu-iotests/223.out|  12 +-
 tests/qemu-iotests/307.out|  10 +-
 .../tests/nbd-qemu-allocation.out |   2 +-
 7 files changed, 122 insertions(+), 20 deletions(-)

diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index 9aae5e1f294..18efb251de9 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -69,4 +69,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
 NBD_CMD_FLAG_FAST_ZERO
 * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
 * 7.1: NBD_FLAG_CAN_MULTI_CONN for shareable writable exports
-* 8.2: NBD_OPT_EXTENDED_HEADERS
+* 8.2: NBD_OPT_EXTENDED_HEADERS, NBD_FLAG_BLOCK_STATUS_PAYLOAD
diff --git a/nbd/server.c b/nbd/server.c
index 2d4cec75a49..898580a9b0b 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -512,6 +512,9 @@ static int nbd_negotiate_handle_export_name(NBDClient 
*client, bool no_zeroes,
 if (client->mode >= NBD_MODE_STRUCTURED) {
 myflags |= NBD_FLAG_SEND_DF;
 }
+if (client->mode >= NBD_MODE_EXTENDED && client->contexts.count) {
+myflags |= NBD_FLAG_BLOCK_STAT_PAYLOAD;
+}
 trace_nbd_negotiate_new_style_size_flags(client->exp->size, myflags);
 stq_be_p(buf, client->exp->size);
 stw_be_p(buf + 8, myflags);
@@ -699,6 +702,10 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
Error **errp)
 if (client->mode >= NBD_MODE_STRUCTURED) {
 myflags |= NBD_FLAG_SEND_DF;
 }
+if (client->mode >= NBD_MODE_EXTENDED &&
+(client->contexts.count || client->opt == NBD_OPT_INFO)) {
+myflags |= NBD_FLAG_BLOCK_STAT_PAYLOAD;
+}
 trace_nbd_negotiate_new_style_size_flags(exp->size, myflags);
 stq_be_p(buf, exp->size);
 stw_be_p(buf + 8, myflags);
@@ -2420,6 +2427,87 @@ static int coroutine_fn nbd_co_send_bitmap(NBDClient 
*client,
 return nbd_co_send_extents(client, request, ea, last, context_id, errp);
 }

+/*
+ * nbd_co_block_status_payload_read
+ * Called when a client wants a subset of negotiated contexts via a
+ * BLOCK_STATUS payload.  Check the payload for valid length and
+ * contents.  On success, return 0 with request updated to effective
+ * length.  If request was invalid but all payload consumed, return 0
+ * with request->len and request->contexts->count set to 0 (which will
+ * trigger an appropriate NBD_EINVAL response later on).  Return
+ * negative errno if the payload was not fully consumed.
+ */
+static int
+nbd_co_block_status_payload_read(NBDClient *client, NBDRequest *request,
+ Error **errp)
+{
+int payload_len = request->len;
+g_autofree char *buf = NULL;
+size_t count, i, nr_bitmaps;
+uint32_t id;
+
+if (payload_len > NBD_MAX_BUFFER_SIZE) {
+error_setg(errp, "len (%" PRIu64 ") is larger than max len (%u)",
+   request->len, 

[PATCH v7 03/12] nbd/server: Prepare to send extended header replies

2023-09-25 Thread Eric Blake
Although extended mode is not yet enabled, once we do turn it on, we
need to reply with extended headers to all messages.  Update the low
level entry points necessary so that all other callers automatically
get the right header based on the current mode.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

v5: s/iov->iov_len/iov[0].iov_len/ [Vladimir], add R-b

v4: new patch, split out from v3 9/14
---
 nbd/server.c | 30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index e227e470d41..4cd061f9da4 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1938,8 +1938,6 @@ static inline void set_be_chunk(NBDClient *client, struct 
iovec *iov,
 size_t niov, uint16_t flags, uint16_t type,
 NBDRequest *request)
 {
-/* TODO - handle structured vs. extended replies */
-NBDStructuredReplyChunk *chunk = iov->iov_base;
 size_t i, length = 0;

 for (i = 1; i < niov; i++) {
@@ -1947,12 +1945,26 @@ static inline void set_be_chunk(NBDClient *client, 
struct iovec *iov,
 }
 assert(length <= NBD_MAX_BUFFER_SIZE + sizeof(NBDStructuredReadData));

-iov[0].iov_len = sizeof(*chunk);
-stl_be_p(>magic, NBD_STRUCTURED_REPLY_MAGIC);
-stw_be_p(>flags, flags);
-stw_be_p(>type, type);
-stq_be_p(>cookie, request->cookie);
-stl_be_p(>length, length);
+if (client->mode >= NBD_MODE_EXTENDED) {
+NBDExtendedReplyChunk *chunk = iov->iov_base;
+
+iov[0].iov_len = sizeof(*chunk);
+stl_be_p(>magic, NBD_EXTENDED_REPLY_MAGIC);
+stw_be_p(>flags, flags);
+stw_be_p(>type, type);
+stq_be_p(>cookie, request->cookie);
+stq_be_p(>offset, request->from);
+stq_be_p(>length, length);
+} else {
+NBDStructuredReplyChunk *chunk = iov->iov_base;
+
+iov[0].iov_len = sizeof(*chunk);
+stl_be_p(>magic, NBD_STRUCTURED_REPLY_MAGIC);
+stw_be_p(>flags, flags);
+stw_be_p(>type, type);
+stq_be_p(>cookie, request->cookie);
+stl_be_p(>length, length);
+}
 }

 static int coroutine_fn nbd_co_send_chunk_done(NBDClient *client,
@@ -2509,6 +2521,8 @@ static coroutine_fn int nbd_send_generic_reply(NBDClient 
*client,
 {
 if (client->mode >= NBD_MODE_STRUCTURED && ret < 0) {
 return nbd_co_send_chunk_error(client, request, -ret, error_msg, errp);
+} else if (client->mode >= NBD_MODE_EXTENDED) {
+return nbd_co_send_chunk_done(client, request, errp);
 } else {
 return nbd_co_send_simple_reply(client, request, ret < 0 ? -ret : 0,
 NULL, 0, errp);
-- 
2.41.0




[PATCH v7 06/12] nbd/client: Plumb errp through nbd_receive_replies

2023-09-25 Thread Eric Blake
Instead of ignoring the low-level error just to refabricate our own
message to pass to the caller, we can just plumb the caller's errp
down to the low level.

Signed-off-by: Eric Blake 
---

v5: set errp on more failure cases [Vladimir], typo fix

v4: new patch [Vladimir]
---
 block/nbd.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 4a7f37da1c6..22d3cb11ac8 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -416,7 +416,8 @@ static void coroutine_fn GRAPH_RDLOCK 
nbd_reconnect_attempt(BDRVNBDState *s)
 reconnect_delay_timer_del(s);
 }

-static coroutine_fn int nbd_receive_replies(BDRVNBDState *s, uint64_t cookie)
+static coroutine_fn int nbd_receive_replies(BDRVNBDState *s, uint64_t cookie,
+Error **errp)
 {
 int ret;
 uint64_t ind = COOKIE_TO_INDEX(cookie), ind2;
@@ -457,20 +458,25 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t cookie)

 /* We are under mutex and cookie is 0. We have to do the dirty work. */
 assert(s->reply.cookie == 0);
-ret = nbd_receive_reply(s->bs, s->ioc, >reply, NULL);
-if (ret <= 0) {
-ret = ret ? ret : -EIO;
+ret = nbd_receive_reply(s->bs, s->ioc, >reply, errp);
+if (ret == 0) {
+ret = -EIO;
+error_setg(errp, "server dropped connection");
+}
+if (ret < 0) {
 nbd_channel_error(s, ret);
 return ret;
 }
 if (nbd_reply_is_structured(>reply) &&
 s->info.mode < NBD_MODE_STRUCTURED) {
 nbd_channel_error(s, -EINVAL);
+error_setg(errp, "unexpected structured reply");
 return -EINVAL;
 }
 ind2 = COOKIE_TO_INDEX(s->reply.cookie);
 if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].coroutine) {
 nbd_channel_error(s, -EINVAL);
+error_setg(errp, "unexpected cookie value");
 return -EINVAL;
 }
 if (s->reply.cookie == cookie) {
@@ -842,9 +848,9 @@ static coroutine_fn int nbd_co_do_receive_one_chunk(
 }
 *request_ret = 0;

-ret = nbd_receive_replies(s, cookie);
+ret = nbd_receive_replies(s, cookie, errp);
 if (ret < 0) {
-error_setg(errp, "Connection closed");
+error_prepend(errp, "Connection closed: ");
 return -EIO;
 }
 assert(s->ioc);
-- 
2.41.0




[PATCH v7 05/12] nbd/server: Enable initial support for extended headers

2023-09-25 Thread Eric Blake
Time to start supporting clients that request extended headers.  Now
we can finally reach the code added across several previous patches.

Even though the NBD spec has been altered to allow us to accept
NBD_CMD_READ larger than the max payload size (provided our response
is a hole or broken up over more than one data chunk), we are not
planning to take advantage of that, and continue to cap NBD_CMD_READ
to 32M regardless of header size.

For NBD_CMD_WRITE_ZEROES and NBD_CMD_TRIM, the block layer already
supports 64-bit operations without any effort on our part.  For
NBD_CMD_BLOCK_STATUS, the client's length is a hint, and the previous
patch took care of implementing the required
NBD_REPLY_TYPE_BLOCK_STATUS_EXT.

We do not yet support clients that want to do request payload
filtering of NBD_CMD_BLOCK_STATUS; that will be added in later
patches, but is not essential for qemu as a client since qemu only
requests the single context base:allocation.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

v5: add R-b, s/8.1/8.2/

v4: split out parts into earlier patches, rebase to earlier changes,
simplify handling of generic replies, retitle (compare to v3 9/14)
---
 docs/interop/nbd.txt |  1 +
 nbd/server.c | 21 +
 2 files changed, 22 insertions(+)

diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index f5ca25174a6..9aae5e1f294 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -69,3 +69,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
 NBD_CMD_FLAG_FAST_ZERO
 * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
 * 7.1: NBD_FLAG_CAN_MULTI_CONN for shareable writable exports
+* 8.2: NBD_OPT_EXTENDED_HEADERS
diff --git a/nbd/server.c b/nbd/server.c
index 8448167b12a..b09ee44c159 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -482,6 +482,10 @@ static int nbd_negotiate_handle_export_name(NBDClient 
*client, bool no_zeroes,
 [10 .. 133]   reserved (0) [unless no_zeroes]
  */
 trace_nbd_negotiate_handle_export_name();
+if (client->mode >= NBD_MODE_EXTENDED) {
+error_setg(errp, "Extended headers already negotiated");
+return -EINVAL;
+}
 if (client->optlen > NBD_MAX_STRING_SIZE) {
 error_setg(errp, "Bad length received");
 return -EINVAL;
@@ -1264,6 +1268,10 @@ static int nbd_negotiate_options(NBDClient *client, 
Error **errp)
 case NBD_OPT_STRUCTURED_REPLY:
 if (length) {
 ret = nbd_reject_length(client, false, errp);
+} else if (client->mode >= NBD_MODE_EXTENDED) {
+ret = nbd_negotiate_send_rep_err(
+client, NBD_REP_ERR_EXT_HEADER_REQD, errp,
+"extended headers already negotiated");
 } else if (client->mode >= NBD_MODE_STRUCTURED) {
 ret = nbd_negotiate_send_rep_err(
 client, NBD_REP_ERR_INVALID, errp,
@@ -1280,6 +1288,19 @@ static int nbd_negotiate_options(NBDClient *client, 
Error **errp)
  errp);
 break;

+case NBD_OPT_EXTENDED_HEADERS:
+if (length) {
+ret = nbd_reject_length(client, false, errp);
+} else if (client->mode >= NBD_MODE_EXTENDED) {
+ret = nbd_negotiate_send_rep_err(
+client, NBD_REP_ERR_INVALID, errp,
+"extended headers already negotiated");
+} else {
+ret = nbd_negotiate_send_rep(client, NBD_REP_ACK, errp);
+client->mode = NBD_MODE_EXTENDED;
+}
+break;
+
 default:
 ret = nbd_opt_drop(client, NBD_REP_ERR_UNSUP, errp,
"Unsupported option %" PRIu32 " (%s)",
-- 
2.41.0




[PATCH v7 09/12] nbd/client: Request extended headers during negotiation

2023-09-25 Thread Eric Blake
All the pieces are in place for a client to finally request extended
headers.  Note that we must not request extended headers when qemu-nbd
is used to connect to the kernel module (as nbd.ko does not expect
them, but expects us to do the negotiation in userspace before handing
the socket over to the kernel), but there is no harm in all other
clients requesting them.

Extended headers are not essential to the information collected during
'qemu-nbd --list', but probing for it gives us one more piece of
information in that output.  Update the iotests affected by the new
line of output.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

v5: add R-b

v4: rebase to earlier changes, tweak commit message for why qemu-nbd
connection to /dev/nbd cannot use extended mode [Vladimir]
---
 nbd/client-connection.c   |  2 +-
 nbd/client.c  | 20 ++-
 qemu-nbd.c|  3 +++
 tests/qemu-iotests/223.out|  6 ++
 tests/qemu-iotests/233.out|  4 
 tests/qemu-iotests/241.out|  3 +++
 tests/qemu-iotests/307.out|  5 +
 .../tests/nbd-qemu-allocation.out |  1 +
 8 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/nbd/client-connection.c b/nbd/client-connection.c
index aa0201b7107..f9da67c87e3 100644
--- a/nbd/client-connection.c
+++ b/nbd/client-connection.c
@@ -93,7 +93,7 @@ NBDClientConnection *nbd_client_connection_new(const 
SocketAddress *saddr,
 .do_negotiation = do_negotiation,

 .initial_info.request_sizes = true,
-.initial_info.mode = NBD_MODE_STRUCTURED,
+.initial_info.mode = NBD_MODE_EXTENDED,
 .initial_info.base_allocation = true,
 .initial_info.x_dirty_bitmap = g_strdup(x_dirty_bitmap),
 .initial_info.name = g_strdup(export_name ?: "")
diff --git a/nbd/client.c b/nbd/client.c
index a2f253062aa..29ffc609a4b 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -953,15 +953,23 @@ static int nbd_start_negotiate(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 if (fixedNewStyle) {
 int result = 0;

+if (max_mode >= NBD_MODE_EXTENDED) {
+result = nbd_request_simple_option(ioc,
+   NBD_OPT_EXTENDED_HEADERS,
+   false, errp);
+if (result) {
+return result < 0 ? -EINVAL : NBD_MODE_EXTENDED;
+}
+}
 if (max_mode >= NBD_MODE_STRUCTURED) {
 result = nbd_request_simple_option(ioc,
NBD_OPT_STRUCTURED_REPLY,
false, errp);
-if (result < 0) {
-return -EINVAL;
+if (result) {
+return result < 0 ? -EINVAL : NBD_MODE_STRUCTURED;
 }
 }
-return result ? NBD_MODE_STRUCTURED : NBD_MODE_SIMPLE;
+return NBD_MODE_SIMPLE;
 } else {
 return NBD_MODE_EXPORT_NAME;
 }
@@ -1034,6 +1042,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 }

 switch (info->mode) {
+case NBD_MODE_EXTENDED:
 case NBD_MODE_STRUCTURED:
 if (base_allocation) {
 result = nbd_negotiate_simple_meta_context(ioc, info, errp);
@@ -1144,7 +1153,7 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,

 *info = NULL;
 result = nbd_start_negotiate(ioc, tlscreds, hostname, ,
- NBD_MODE_STRUCTURED, NULL, errp);
+ NBD_MODE_EXTENDED, NULL, errp);
 if (tlscreds && sioc) {
 ioc = sioc;
 }
@@ -1155,6 +1164,7 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 switch ((NBDMode)result) {
 case NBD_MODE_SIMPLE:
 case NBD_MODE_STRUCTURED:
+case NBD_MODE_EXTENDED:
 /* newstyle - use NBD_OPT_LIST to populate array, then try
  * NBD_OPT_INFO on each array member. If structured replies
  * are enabled, also try NBD_OPT_LIST_META_CONTEXT. */
@@ -1191,7 +1201,7 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 break;
 }

-if (result == NBD_MODE_STRUCTURED &&
+if (result >= NBD_MODE_STRUCTURED &&
 nbd_list_meta_contexts(ioc, [i], errp) < 0) {
 goto out;
 }
diff --git a/qemu-nbd.c b/qemu-nbd.c
index 70aa3c487a0..e6681450cfe 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -235,6 +235,9 @@ static int qemu_nbd_client_list(SocketAddress *saddr, 
QCryptoTLSCreds *tls,
 printf("  opt block: %u\n", list[i].opt_block);
 printf("  max block: %u\n", list[i].max_block);
 }

[PATCH v7 07/12] nbd/client: Initial support for extended headers

2023-09-25 Thread Eric Blake
Update the client code to be able to send an extended request, and
parse an extended header from the server.  Note that since we reject
any structured reply with a too-large payload, we can always normalize
a valid header back into the compact form, so that the caller need not
deal with two branches of a union.  Still, until a later patch lets
the client negotiate extended headers, the code added here should not
be reached.  Note that because of the different magic numbers, it is
just as easy to trace and then tolerate a non-compliant server sending
the wrong header reply as it would be to insist that the server is
compliant.

Signed-off-by: Eric Blake 
---

v5: fix logic bug on error reporting [Vladimir]

v4: split off errp handling to separate patch [Vladimir], better
function naming [Vladimir]
---
 include/block/nbd.h |   3 +-
 block/nbd.c |   2 +-
 nbd/client.c| 104 +---
 nbd/trace-events|   3 +-
 4 files changed, 74 insertions(+), 38 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 8a765e78dfb..ba8724f5336 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -389,7 +389,8 @@ int nbd_init(int fd, QIOChannelSocket *sioc, NBDExportInfo 
*info,
  Error **errp);
 int nbd_send_request(QIOChannel *ioc, NBDRequest *request);
 int coroutine_fn nbd_receive_reply(BlockDriverState *bs, QIOChannel *ioc,
-   NBDReply *reply, Error **errp);
+   NBDReply *reply, NBDMode mode,
+   Error **errp);
 int nbd_client(int fd);
 int nbd_disconnect(int fd);
 int nbd_errno_to_system_errno(int err);
diff --git a/block/nbd.c b/block/nbd.c
index 22d3cb11ac8..76461430411 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -458,7 +458,7 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t cookie,

 /* We are under mutex and cookie is 0. We have to do the dirty work. */
 assert(s->reply.cookie == 0);
-ret = nbd_receive_reply(s->bs, s->ioc, >reply, errp);
+ret = nbd_receive_reply(s->bs, s->ioc, >reply, s->info.mode, errp);
 if (ret == 0) {
 ret = -EIO;
 error_setg(errp, "server dropped connection");
diff --git a/nbd/client.c b/nbd/client.c
index cecb0f04377..a2f253062aa 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1346,22 +1346,29 @@ int nbd_disconnect(int fd)

 int nbd_send_request(QIOChannel *ioc, NBDRequest *request)
 {
-uint8_t buf[NBD_REQUEST_SIZE];
+uint8_t buf[NBD_EXTENDED_REQUEST_SIZE];
+size_t len;

-assert(request->mode <= NBD_MODE_STRUCTURED); /* TODO handle extended */
-assert(request->len <= UINT32_MAX);
 trace_nbd_send_request(request->from, request->len, request->cookie,
request->flags, request->type,
nbd_cmd_lookup(request->type));

-stl_be_p(buf, NBD_REQUEST_MAGIC);
 stw_be_p(buf + 4, request->flags);
 stw_be_p(buf + 6, request->type);
 stq_be_p(buf + 8, request->cookie);
 stq_be_p(buf + 16, request->from);
-stl_be_p(buf + 24, request->len);
+if (request->mode >= NBD_MODE_EXTENDED) {
+stl_be_p(buf, NBD_EXTENDED_REQUEST_MAGIC);
+stq_be_p(buf + 24, request->len);
+len = NBD_EXTENDED_REQUEST_SIZE;
+} else {
+assert(request->len <= UINT32_MAX);
+stl_be_p(buf, NBD_REQUEST_MAGIC);
+stl_be_p(buf + 24, request->len);
+len = NBD_REQUEST_SIZE;
+}

-return nbd_write(ioc, buf, sizeof(buf), NULL);
+return nbd_write(ioc, buf, len, NULL);
 }

 /* nbd_receive_simple_reply
@@ -1388,30 +1395,36 @@ static int nbd_receive_simple_reply(QIOChannel *ioc, 
NBDSimpleReply *reply,
 return 0;
 }

-/* nbd_receive_structured_reply_chunk
+/* nbd_receive_reply_chunk_header
  * Read structured reply chunk except magic field (which should be already
- * read).
+ * read).  Normalize into the compact form.
  * Payload is not read.
  */
-static int nbd_receive_structured_reply_chunk(QIOChannel *ioc,
-  NBDStructuredReplyChunk *chunk,
-  Error **errp)
+static int nbd_receive_reply_chunk_header(QIOChannel *ioc, NBDReply *chunk,
+  Error **errp)
 {
 int ret;
+size_t len;
+uint64_t payload_len;

-assert(chunk->magic == NBD_STRUCTURED_REPLY_MAGIC);
+if (chunk->magic == NBD_STRUCTURED_REPLY_MAGIC) {
+len = sizeof(chunk->structured);
+} else {
+assert(chunk->magic == NBD_EXTENDED_REPLY_MAGIC);
+len = sizeof(chunk->extended);
+}

 ret = nbd_read(ioc, (uint8_t *)chunk + sizeof(chunk->magic),
-   sizeof(*chunk) - sizeof(chunk->magic), "structured chunk",
+   len - sizeof(chunk->magic), "structured chunk",
errp);
 if (ret < 0) {
 return ret;
 }

-

[PATCH v7 08/12] nbd/client: Accept 64-bit block status chunks

2023-09-25 Thread Eric Blake
Once extended mode is enabled, we need to accept 64-bit status replies
(even for replies that don't exceed a 32-bit length).  It is easier to
normalize narrow replies into wide format so that the rest of our code
only has to handle one width.  Although a server is non-compliant if
it sends a 64-bit reply in compact mode, or a 32-bit reply in extended
mode, it is still easy enough to tolerate these mismatches.

In normal execution, we are only requesting "base:allocation" which
never exceeds 32 bits for flag values. But during testing with
x-dirty-bitmap, we can force qemu to connect to some other context
that might have 64-bit status bit; however, we ignore those upper bits
(other than mapping qemu:allocation-depth into something that
'qemu-img map --output=json' can expose), and since that only affects
testing, we really don't bother with checking whether more than the
two least-significant bits are set.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

v5: factor out duplicate length calculation [Vladimir], add R-b

v4: tweak comments and error message about count mismatch, fix setting
of wide in loop [Vladimir]
---
 block/nbd.c| 49 --
 block/trace-events |  1 +
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 76461430411..52ebc8b2f53 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -615,13 +615,17 @@ static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
  */
 static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
  NBDStructuredReplyChunk *chunk,
- uint8_t *payload, uint64_t 
orig_length,
- NBDExtent32 *extent, Error **errp)
+ uint8_t *payload, bool wide,
+ uint64_t orig_length,
+ NBDExtent64 *extent, Error **errp)
 {
 uint32_t context_id;
+uint32_t count;
+size_t ext_len = wide ? sizeof(*extent) : sizeof(NBDExtent32);
+size_t pay_len = sizeof(context_id) + wide * sizeof(count) + ext_len;

 /* The server succeeded, so it must have sent [at least] one extent */
-if (chunk->length < sizeof(context_id) + sizeof(*extent)) {
+if (chunk->length < pay_len) {
 error_setg(errp, "Protocol error: invalid payload for "
  "NBD_REPLY_TYPE_BLOCK_STATUS");
 return -EINVAL;
@@ -636,8 +640,15 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
 return -EINVAL;
 }

-extent->length = payload_advance32();
-extent->flags = payload_advance32();
+if (wide) {
+count = payload_advance32();
+extent->length = payload_advance64();
+extent->flags = payload_advance64();
+} else {
+count = 0;
+extent->length = payload_advance32();
+extent->flags = payload_advance32();
+}

 if (extent->length == 0) {
 error_setg(errp, "Protocol error: server sent status chunk with "
@@ -658,7 +669,7 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
  * (always a safe status, even if it loses information).
  */
 if (s->info.min_block && !QEMU_IS_ALIGNED(extent->length,
-   s->info.min_block)) {
+  s->info.min_block)) {
 trace_nbd_parse_blockstatus_compliance("extent length is unaligned");
 if (extent->length > s->info.min_block) {
 extent->length = QEMU_ALIGN_DOWN(extent->length,
@@ -672,13 +683,15 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
 /*
  * We used NBD_CMD_FLAG_REQ_ONE, so the server should not have
  * sent us any more than one extent, nor should it have included
- * status beyond our request in that extent. However, it's easy
- * enough to ignore the server's noncompliance without killing the
+ * status beyond our request in that extent. Furthermore, a wide
+ * server should have replied with an accurate count (we left
+ * count at 0 for a narrow server).  However, it's easy enough to
+ * ignore the server's noncompliance without killing the
  * connection; just ignore trailing extents, and clamp things to
  * the length of our request.
  */
-if (chunk->length > sizeof(context_id) + sizeof(*extent)) {
-trace_nbd_parse_blockstatus_compliance("more than one extent");
+if (count != wide || chunk->length > pay_len) {
+trace_nbd_parse_blockstatus_compliance("unexpected extent count");
 }
 if (extent->length > orig_length) {
 extent->length = orig_length;
@@ -1124,7 +1137,7 @@ nbd_co_receive_cmdread_reply(BDRVNBDState *s, uint64_t 
cookie,

 static int coroutine_fn
 nbd_co_receive_blockstatus_reply(BDRVNBDState *s, uint64_t cookie,
- uint64_t 

[PATCH v7 11/12] nbd/server: Prepare for per-request filtering of BLOCK_STATUS

2023-09-25 Thread Eric Blake
The next commit will add support for the optional extension
NBD_CMD_FLAG_PAYLOAD during NBD_CMD_BLOCK_STATUS, where the client can
request that the server only return a subset of negotiated contexts,
rather than all contexts.  To make that task easier, this patch
populates the list of contexts to return on a per-command basis (for
now, identical to the full set of negotiated contexts).

Signed-off-by: Eric Blake 
---

v5: fix null dereference on early error [Vladimir], hoist in assertion
from v4 24/24

v4: split out NBDMetaContexts refactoring to its own patch, track
NBDRequests.contexts as a pointer rather than inline
---
 include/block/nbd.h |  1 +
 nbd/server.c| 22 +-
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 2006497f987..4e7bd6342f9 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -77,6 +77,7 @@ typedef struct NBDRequest {
 uint16_t flags; /* NBD_CMD_FLAG_* */
 uint16_t type;  /* NBD_CMD_* */
 NBDMode mode;   /* Determines which network representation to use */
+NBDMetaContexts *contexts; /* Used by NBD_CMD_BLOCK_STATUS */
 } NBDRequest;

 typedef struct NBDSimpleReply {
diff --git a/nbd/server.c b/nbd/server.c
index 44ebbd139b2..2d4cec75a49 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2505,6 +2505,7 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,
 break;

 case NBD_CMD_BLOCK_STATUS:
+request->contexts = >contexts;
 valid_flags |= NBD_CMD_FLAG_REQ_ONE;
 break;

@@ -2745,17 +2746,18 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
   "discard failed", errp);

 case NBD_CMD_BLOCK_STATUS:
+assert(request->contexts);
 if (!request->len) {
 return nbd_send_generic_reply(client, request, -EINVAL,
   "need non-zero length", errp);
 }
 assert(client->mode >= NBD_MODE_EXTENDED ||
request->len <= UINT32_MAX);
-if (client->contexts.count) {
+if (request->contexts->count) {
 bool dont_fragment = request->flags & NBD_CMD_FLAG_REQ_ONE;
-int contexts_remaining = client->contexts.count;
+int contexts_remaining = request->contexts->count;

-if (client->contexts.base_allocation) {
+if (request->contexts->base_allocation) {
 ret = nbd_co_send_block_status(client, request,
exp->common.blk,
request->from,
@@ -2768,7 +2770,7 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
 }
 }

-if (client->contexts.allocation_depth) {
+if (request->contexts->allocation_depth) {
 ret = nbd_co_send_block_status(client, request,
exp->common.blk,
request->from, request->len,
@@ -2781,8 +2783,9 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
 }
 }

+assert(request->contexts->exp == client->exp);
 for (i = 0; i < client->exp->nr_export_bitmaps; i++) {
-if (!client->contexts.bitmaps[i]) {
+if (!request->contexts->bitmaps[i]) {
 continue;
 }
 ret = nbd_co_send_bitmap(client, request,
@@ -2798,6 +2801,10 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
 assert(!contexts_remaining);

 return 0;
+} else if (client->contexts.count) {
+return nbd_send_generic_reply(client, request, -EINVAL,
+  "CMD_BLOCK_STATUS payload not valid",
+  errp);
 } else {
 return nbd_send_generic_reply(client, request, -EINVAL,
   "CMD_BLOCK_STATUS not negotiated",
@@ -2876,6 +2883,11 @@ static coroutine_fn void nbd_trip(void *opaque)
 } else {
 ret = nbd_handle_request(client, , req->data, _err);
 }
+if (request.contexts && request.contexts != >contexts) {
+assert(request.type == NBD_CMD_BLOCK_STATUS);
+g_free(request.contexts->bitmaps);
+g_free(request.contexts);
+}
 if (ret < 0) {
 error_prepend(_err, "Failed to send reply: ");
 goto disconnect;
-- 
2.41.0




[PATCH v7 04/12] nbd/server: Support 64-bit block status

2023-09-25 Thread Eric Blake
The NBD spec states that if the client negotiates extended headers,
the server must avoid NBD_REPLY_TYPE_BLOCK_STATUS and instead use
NBD_REPLY_TYPE_BLOCK_STATUS_EXT which supports 64-bit lengths, even if
the reply does not need more than 32 bits.  As of this patch,
client->mode is still never NBD_MODE_EXTENDED, so the code added here
does not take effect until the next patch enables negotiation.

For now, all metacontexts that we know how to export never populate
more than 32 bits of information, so we don't have to worry about
NBD_REP_ERR_EXT_HEADER_REQD or filtering during handshake, and we
always send all zeroes for the upper 32 bits of status during
NBD_CMD_BLOCK_STATUS.

Note that we previously had some interesting size-juggling on call
chains, such as:

nbd_co_send_block_status(uint32_t length)
-> blockstatus_to_extents(uint32_t bytes)
  -> bdrv_block_status_above(bytes, _t num)
  -> nbd_extent_array_add(uint64_t num)
-> store num in 32-bit length

But we were lucky that it never overflowed: bdrv_block_status_above
never sets num larger than bytes, and we had previously been capping
'bytes' at 32 bits (since the protocol does not allow sending a larger
request without extended headers).  This patch adds some assertions
that ensure we continue to avoid overflowing 32 bits for a narrow
client, while fully utilizing 64-bits all the way through when the
client understands that.  Even in 64-bit math, overflow is not an
issue, because all lengths are coming from the block layer, and we
know that the block layer does not support images larger than off_t
(if lengths were coming from the network, the story would be
different).

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

v5: stronger justification on assertion [Vladimir], add R-b

v4: split conversion to big-endian across two helper functions rather
than in-place union [Vladimir]
---
 nbd/server.c | 108 ++-
 1 file changed, 82 insertions(+), 26 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 4cd061f9da4..8448167b12a 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2103,20 +2103,24 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 }

 typedef struct NBDExtentArray {
-NBDExtent32 *extents;
+NBDExtent64 *extents;
 unsigned int nb_alloc;
 unsigned int count;
 uint64_t total_length;
+bool extended;
 bool can_add;
 bool converted_to_be;
 } NBDExtentArray;

-static NBDExtentArray *nbd_extent_array_new(unsigned int nb_alloc)
+static NBDExtentArray *nbd_extent_array_new(unsigned int nb_alloc,
+NBDMode mode)
 {
 NBDExtentArray *ea = g_new0(NBDExtentArray, 1);

+assert(mode >= NBD_MODE_STRUCTURED);
 ea->nb_alloc = nb_alloc;
-ea->extents = g_new(NBDExtent32, nb_alloc);
+ea->extents = g_new(NBDExtent64, nb_alloc);
+ea->extended = mode >= NBD_MODE_EXTENDED;
 ea->can_add = true;

 return ea;
@@ -2135,15 +2139,36 @@ static void 
nbd_extent_array_convert_to_be(NBDExtentArray *ea)
 int i;

 assert(!ea->converted_to_be);
+assert(ea->extended);
 ea->can_add = false;
 ea->converted_to_be = true;

 for (i = 0; i < ea->count; i++) {
-ea->extents[i].flags = cpu_to_be32(ea->extents[i].flags);
-ea->extents[i].length = cpu_to_be32(ea->extents[i].length);
+ea->extents[i].length = cpu_to_be64(ea->extents[i].length);
+ea->extents[i].flags = cpu_to_be64(ea->extents[i].flags);
 }
 }

+/* Further modifications of the array after conversion are abandoned */
+static NBDExtent32 *nbd_extent_array_convert_to_narrow(NBDExtentArray *ea)
+{
+int i;
+NBDExtent32 *extents = g_new(NBDExtent32, ea->count);
+
+assert(!ea->converted_to_be);
+assert(!ea->extended);
+ea->can_add = false;
+ea->converted_to_be = true;
+
+for (i = 0; i < ea->count; i++) {
+assert((ea->extents[i].length | ea->extents[i].flags) <= UINT32_MAX);
+extents[i].length = cpu_to_be32(ea->extents[i].length);
+extents[i].flags = cpu_to_be32(ea->extents[i].flags);
+}
+
+return extents;
+}
+
 /*
  * Add extent to NBDExtentArray. If extent can't be added (no available space),
  * return -1.
@@ -2154,19 +2179,27 @@ static void 
nbd_extent_array_convert_to_be(NBDExtentArray *ea)
  * would result in an incorrect range reported to the client)
  */
 static int nbd_extent_array_add(NBDExtentArray *ea,
-uint32_t length, uint32_t flags)
+uint64_t length, uint32_t flags)
 {
 assert(ea->can_add);

 if (!length) {
 return 0;
 }
+if (!ea->extended) {
+assert(length <= UINT32_MAX);
+}

 /* Extend previous extent if flags are the same */
 if (ea->count > 0 && flags == ea->extents[ea->count - 1].flags) {
-uint64_t sum = (uint64_t)length + ea->extents[ea->count - 1].length;
+uint64_t sum = 

[PATCH v7 02/12] nbd/server: Prepare to receive extended header requests

2023-09-25 Thread Eric Blake
Although extended mode is not yet enabled, once we do turn it on, we
need to accept extended requests for all messages.  Previous patches
have already taken care of supporting 64-bit lengths, now we just need
to read it off the wire.

Note that this implementation will block indefinitely on a buggy
client that sends a non-extended payload (that is, we try to read a
full packet before we ever check the magic number, but a client that
mistakenly sends a simple request after negotiating extended headers
doesn't send us enough bytes), but it's no different from any other
client that stops talking to us partway through a packet and thus not
worth coding around.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

v6: fix sign extension bug

v5: no change

v4: new patch, split out from v3 9/14
---
 nbd/nbd-internal.h |  5 -
 nbd/server.c   | 43 ++-
 2 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 133b1d94b50..dfa02f77ee4 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -34,8 +34,11 @@
  * https://github.com/yoe/nbd/blob/master/doc/proto.md
  */

-/* Size of all NBD_OPT_*, without payload */
+/* Size of all compact NBD_CMD_*, without payload */
 #define NBD_REQUEST_SIZE(4 + 2 + 2 + 8 + 8 + 4)
+/* Size of all extended NBD_CMD_*, without payload */
+#define NBD_EXTENDED_REQUEST_SIZE   (4 + 2 + 2 + 8 + 8 + 8)
+
 /* Size of all NBD_REP_* sent in answer to most NBD_OPT_*, without payload */
 #define NBD_REPLY_SIZE  (4 + 4 + 8)
 /* Size of reply to NBD_OPT_EXPORT_NAME */
diff --git a/nbd/server.c b/nbd/server.c
index 1eabcfc908d..e227e470d41 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1411,11 +1411,13 @@ nbd_read_eof(NBDClient *client, void *buffer, size_t 
size, Error **errp)
 static int coroutine_fn nbd_receive_request(NBDClient *client, NBDRequest 
*request,
 Error **errp)
 {
-uint8_t buf[NBD_REQUEST_SIZE];
-uint32_t magic;
+uint8_t buf[NBD_EXTENDED_REQUEST_SIZE];
+uint32_t magic, expect;
 int ret;
+size_t size = client->mode >= NBD_MODE_EXTENDED ?
+NBD_EXTENDED_REQUEST_SIZE : NBD_REQUEST_SIZE;

-ret = nbd_read_eof(client, buf, sizeof(buf), errp);
+ret = nbd_read_eof(client, buf, size, errp);
 if (ret < 0) {
 return ret;
 }
@@ -1423,13 +1425,21 @@ static int coroutine_fn nbd_receive_request(NBDClient 
*client, NBDRequest *reque
 return -EIO;
 }

-/* Request
-   [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
-   [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
-   [ 6 ..  7]   type(NBD_CMD_READ, ...)
-   [ 8 .. 15]   cookie
-   [16 .. 23]   from
-   [24 .. 27]   len
+/*
+ * Compact request
+ *  [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
+ *  [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
+ *  [ 6 ..  7]   type(NBD_CMD_READ, ...)
+ *  [ 8 .. 15]   cookie
+ *  [16 .. 23]   from
+ *  [24 .. 27]   len
+ * Extended request
+ *  [ 0 ..  3]   magic   (NBD_EXTENDED_REQUEST_MAGIC)
+ *  [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, NBD_CMD_FLAG_PAYLOAD_LEN, ...)
+ *  [ 6 ..  7]   type(NBD_CMD_READ, ...)
+ *  [ 8 .. 15]   cookie
+ *  [16 .. 23]   from
+ *  [24 .. 31]   len
  */

 magic = ldl_be_p(buf);
@@ -1437,13 +1447,20 @@ static int coroutine_fn nbd_receive_request(NBDClient 
*client, NBDRequest *reque
 request->type   = lduw_be_p(buf + 6);
 request->cookie = ldq_be_p(buf + 8);
 request->from   = ldq_be_p(buf + 16);
-request->len= (uint32_t)ldl_be_p(buf + 24); /* widen 32 to 64 bits */
+if (client->mode >= NBD_MODE_EXTENDED) {
+request->len = ldq_be_p(buf + 24);
+expect = NBD_EXTENDED_REQUEST_MAGIC;
+} else {
+request->len = (uint32_t)ldl_be_p(buf + 24); /* widen 32 to 64 bits */
+expect = NBD_REQUEST_MAGIC;
+}

 trace_nbd_receive_request(magic, request->flags, request->type,
   request->from, request->len);

-if (magic != NBD_REQUEST_MAGIC) {
-error_setg(errp, "invalid magic (got 0x%" PRIx32 ")", magic);
+if (magic != expect) {
+error_setg(errp, "invalid magic (got 0x%" PRIx32 ", expected 0x%"
+   PRIx32 ")", magic, expect);
 return -EINVAL;
 }
 return 0;
-- 
2.41.0




[PATCH v7 00/12] NBD 64-bit extensions for qemu

2023-09-25 Thread Eric Blake
v6 was here:
https://lists.gnu.org/archive/html/qemu-devel/2023-08/msg05231.html

Since then:
 - patches v6 1-5 included in pull request
 - patch v6 6 logic improved, now v7 patch 1
 - rebased to master

Still needing review:
 - patch 1,6,7,11,12

Eric Blake (12):
  nbd/server: Support a request payload
  nbd/server: Prepare to receive extended header requests
  nbd/server: Prepare to send extended header replies
  nbd/server: Support 64-bit block status
  nbd/server: Enable initial support for extended headers
  nbd/client: Plumb errp through nbd_receive_replies
  nbd/client: Initial support for extended headers
  nbd/client: Accept 64-bit block status chunks
  nbd/client: Request extended headers during negotiation
  nbd/server: Refactor list of negotiated meta contexts
  nbd/server: Prepare for per-request filtering of BLOCK_STATUS
  nbd/server: Add FLAG_PAYLOAD support to CMD_BLOCK_STATUS

 docs/interop/nbd.txt  |   1 +
 include/block/nbd.h   |   5 +-
 nbd/nbd-internal.h|   5 +-
 block/nbd.c   |  67 ++-
 nbd/client-connection.c   |   2 +-
 nbd/client.c  | 124 --
 nbd/server.c  | 418 ++
 qemu-nbd.c|   4 +
 block/trace-events|   1 +
 nbd/trace-events  |   5 +-
 tests/qemu-iotests/223.out|  18 +-
 tests/qemu-iotests/233.out|   4 +
 tests/qemu-iotests/241.out|   3 +
 tests/qemu-iotests/307.out|  15 +-
 .../tests/nbd-qemu-allocation.out |   3 +-
 15 files changed, 516 insertions(+), 159 deletions(-)

-- 
2.41.0




Re: [PATCH v3 2/5] vhost-user.rst: Clarify enabling/disabling vrings

2023-09-25 Thread Stefan Hajnoczi
On Fri, Sep 15, 2023 at 12:25:27PM +0200, Hanna Czenczek wrote:
> Currently, the vhost-user documentation says that rings are to be
> initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is
> negotiated.  However, by the time of feature negotiation, all rings have
> already been initialized, so it is not entirely clear what this means.
> 
> At least the vhost-user-backend Rust crate's implementation interpreted
> it to mean that whenever this feature is negotiated, all rings are to
> put into a disabled state, which means that every SET_FEATURES call
> would disable all rings, effectively halting the device.  This is
> problematic because the VHOST_F_LOG_ALL feature is also set or cleared
> this way, which happens during migration.  Doing so should not halt the
> device.
> 
> Other implementations have interpreted this to mean that the device is
> to be initialized with all rings disabled, and a subsequent SET_FEATURES
> call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of
> them.  Here, SET_FEATURES will never disable any ring.
> 
> This interpretation does not suffer the problem of unintentionally
> halting the device whenever features are set or cleared, so it seems
> better and more reasonable.
> 
> We should clarify this in the documentation.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  docs/interop/vhost-user.rst | 20 ++--
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index bb4dd0fd60..9b9b802c60 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -409,12 +409,20 @@ and stop ring upon receiving 
> ``VHOST_USER_GET_VRING_BASE``.
>  
>  Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
>  
> -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
> -ring starts directly in the enabled state.
> -
> -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
> -initialized in a disabled state and is enabled by
> -``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
> +If ``VHOST_USER_SET_FEATURES`` does not negotiate
> +``VHOST_USER_F_PROTOCOL_FEATURES``, rings are enabled immediately when
> +started.
> +
> +If ``VHOST_USER_SET_FEATURES`` does negotiate
> +``VHOST_USER_F_PROTOCOL_FEATURES``, each ring will remain in the disabled
> +state until ``VHOST_USER_SET_VRING_ENABLE`` enables it with parameter 1.
> +
> +Back-end implementations that support ``VHOST_USER_F_PROTOCOL_FEATURES``
> +should implement this by initializing each ring in a disabled state, and
> +enabling them when ``VHOST_USER_SET_FEATURES`` is used without
> +negotiating ``VHOST_USER_F_PROTOCOL_FEATURES``.  Other than that, rings
> +should only be enabled and disabled through
> +``VHOST_USER_SET_VRING_ENABLE``.

The "Ring states" section starts by saying there are three states:
"stopped", "started but disabled", and "started and enabled". But this
patch talks about a "disabled state". Can you rephrase this patch to use
the exact state names defined earlier in the spec?

>  
>  While processing the rings (whether they are enabled or not), the back-end
>  must support changing some configuration aspects on the fly.
> -- 
> 2.41.0
> 


signature.asc
Description: PGP signature


Re: [PATCH v3 1/5] vhost-user.rst: Migrating back-end-internal state

2023-09-25 Thread Stefan Hajnoczi
On Fri, Sep 15, 2023 at 12:25:26PM +0200, Hanna Czenczek wrote:
> For vhost-user devices, qemu can migrate the virtio state, but not the
> back-end's internal state.  To do so, we need to be able to transfer
> this internal state between front-end (qemu) and back-end.
> 
> At this point, this new feature is added for the purpose of virtio-fs
> migration.  Because virtiofsd's internal state will not be too large, we
> believe it is best to transfer it as a single binary blob after the
> streaming phase.
> 
> These are the additions to the protocol:
> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE
> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file
>   descriptor over which to transfer the state.
> - CHECK_DEVICE_STATE: After the state has been transferred through the
>   file descriptor, the front-end invokes this function to verify
>   success.  There is no in-band way (through the file descriptor) to
>   indicate failure, so we need to check explicitly.
> 
> Once the transfer FD has been established via SET_DEVICE_STATE_FD
> (which includes establishing the direction of transfer and migration
> phase), the sending side writes its data into it, and the reading side
> reads it until it sees an EOF.  Then, the front-end will check for
> success via CHECK_DEVICE_STATE, which on the destination side includes
> checking for integrity (i.e. errors during deserialization).
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  docs/interop/vhost-user.rst | 170 
>  1 file changed, 170 insertions(+)

Great documentation!

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[PULL 3/7] nbd: Replace bool structured_reply with mode enum

2023-09-25 Thread Eric Blake
The upcoming patches for 64-bit extensions requires various points in
the protocol to make decisions based on what was negotiated.  While we
could easily add a 'bool extended_headers' alongside the existing
'bool structured_reply', this does not scale well if more modes are
added in the future.  Better is to expose the mode enum added in the
recent commit bfe04d0a7d out to a wider use in the code base.

Where the code previously checked for structured_reply being set or
clear, it now prefers checking for an inequality; this works because
the nodes are in a continuum of increasing abilities, and allows us to
touch fewer places if we ever insert other modes in the middle of the
enum.  There should be no semantic change in this patch.

Signed-off-by: Eric Blake 
Message-ID: <20230829175826.377251-20-ebl...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/nbd.h |  2 +-
 block/nbd.c |  8 +---
 nbd/client-connection.c |  4 ++--
 nbd/client.c| 18 +-
 nbd/server.c| 31 ++-
 qemu-nbd.c  |  4 +++-
 6 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index f672b76173b..53226764574 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -305,7 +305,7 @@ typedef struct NBDExportInfo {

 /* In-out fields, set by client before nbd_receive_negotiate() and
  * updated by server results during nbd_receive_negotiate() */
-bool structured_reply;
+NBDMode mode; /* input maximum mode tolerated; output actual mode chosen */
 bool base_allocation; /* base:allocation context for NBD_CMD_BLOCK_STATUS 
*/

 /* Set by server results during nbd_receive_negotiate() and
diff --git a/block/nbd.c b/block/nbd.c
index cc48580df70..676b755d79f 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -463,7 +463,8 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t cookie)
 nbd_channel_error(s, ret);
 return ret;
 }
-if (nbd_reply_is_structured(>reply) && !s->info.structured_reply) {
+if (nbd_reply_is_structured(>reply) &&
+s->info.mode < NBD_MODE_STRUCTURED) {
 nbd_channel_error(s, -EINVAL);
 return -EINVAL;
 }
@@ -866,7 +867,7 @@ static coroutine_fn int nbd_co_do_receive_one_chunk(
 }

 /* handle structured reply chunk */
-assert(s->info.structured_reply);
+assert(s->info.mode >= NBD_MODE_STRUCTURED);
 chunk = >reply.structured;

 if (chunk->type == NBD_REPLY_TYPE_NONE) {
@@ -1070,7 +1071,8 @@ nbd_co_receive_cmdread_reply(BDRVNBDState *s, uint64_t 
cookie,
 void *payload = NULL;
 Error *local_err = NULL;

-NBD_FOREACH_REPLY_CHUNK(s, iter, cookie, s->info.structured_reply,
+NBD_FOREACH_REPLY_CHUNK(s, iter, cookie,
+s->info.mode >= NBD_MODE_STRUCTURED,
 qiov, , )
 {
 int ret;
diff --git a/nbd/client-connection.c b/nbd/client-connection.c
index 53a6549914c..aa0201b7107 100644
--- a/nbd/client-connection.c
+++ b/nbd/client-connection.c
@@ -1,5 +1,5 @@
 /*
- * QEMU Block driver for  NBD
+ * QEMU Block driver for NBD
  *
  * Copyright (c) 2021 Virtuozzo International GmbH.
  *
@@ -93,7 +93,7 @@ NBDClientConnection *nbd_client_connection_new(const 
SocketAddress *saddr,
 .do_negotiation = do_negotiation,

 .initial_info.request_sizes = true,
-.initial_info.structured_reply = true,
+.initial_info.mode = NBD_MODE_STRUCTURED,
 .initial_info.base_allocation = true,
 .initial_info.x_dirty_bitmap = g_strdup(x_dirty_bitmap),
 .initial_info.name = g_strdup(export_name ?: "")
diff --git a/nbd/client.c b/nbd/client.c
index bd7e2001366..844be42181a 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -879,7 +879,7 @@ static int nbd_list_meta_contexts(QIOChannel *ioc,
  */
 static int nbd_start_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds,
const char *hostname, QIOChannel **outioc,
-   bool structured_reply, bool *zeroes,
+   NBDMode max_mode, bool *zeroes,
Error **errp)
 {
 ERRP_GUARD();
@@ -953,7 +953,7 @@ static int nbd_start_negotiate(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 if (fixedNewStyle) {
 int result = 0;

-if (structured_reply) {
+if (max_mode >= NBD_MODE_STRUCTURED) {
 result = nbd_request_simple_option(ioc,
NBD_OPT_STRUCTURED_REPLY,
false, errp);
@@ -1022,20 +1022,19 @@ int nbd_receive_negotiate(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 trace_nbd_receive_negotiate_name(info->name);

 result = nbd_start_negotiate(ioc, tlscreds, hostname, outioc,
-   

Re: [PATCH v6 05/17] nbd/server: Refactor handling of command sanity checks

2023-09-25 Thread Eric Blake
On Mon, Sep 04, 2023 at 07:53:10PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 29.08.23 20:58, Eric Blake wrote:
> > Upcoming additions to support NBD 64-bit effect lengths will add a new
> > command flag NBD_CMD_FLAG_PAYLOAD_LEN that needs to be considered in
> > our sanity checks of the client's messages (that is, more than just
> > CMD_WRITE have the potential to carry a client payload when extended
> > headers are in effect).  But before we can start to support that, it
> > is easier to first refactor the existing set of various if statements
> > over open-coded combinations of request->type to instead be a single
> > switch statement over all command types that sets witnesses, then
> > straight-line processing based on the witnesses.  No semantic change
> > is intended.
> > 
> > Signed-off-by: Eric Blake
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 

Patches 1-5 queued through my latest NBD pull request; will respin
patches 6 and following on top of latest master and post a v7 for one
more round of reviews.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




[PULL 6/7] nbd: Prepare for 64-bit request effect lengths

2023-09-25 Thread Eric Blake
Widen the length field of NBDRequest to 64-bits, although we can
assert that all current uses are still under 32 bits: either because
of NBD_MAX_BUFFER_SIZE which is even smaller (and where size_t can
still be appropriate, even on 32-bit platforms), or because nothing
ever puts us into NBD_MODE_EXTENDED yet (and while future patches will
allow larger transactions, the lengths in play here are still capped
at 32-bit).  There are no semantic changes, other than a typo fix in a
couple of error messages.

Signed-off-by: Eric Blake 
Message-ID: <20230829175826.377251-23-ebl...@redhat.com>
[eblake: fix assertion bug in nbd_co_send_simple_reply]
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/nbd.h |  4 ++--
 block/nbd.c | 25 +++--
 nbd/client.c|  1 +
 nbd/server.c| 23 +++
 block/trace-events  |  2 +-
 nbd/trace-events| 14 +++---
 6 files changed, 45 insertions(+), 24 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 22a9b5d10e0..8a765e78dfb 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -71,8 +71,8 @@ typedef enum NBDMode {
  */
 typedef struct NBDRequest {
 uint64_t cookie;
-uint64_t from;
-uint32_t len;
+uint64_t from;  /* Offset touched by the command */
+uint64_t len;   /* Effect length; 32 bit limit without extended headers */
 uint16_t flags; /* NBD_CMD_FLAG_* */
 uint16_t type;  /* NBD_CMD_* */
 NBDMode mode;   /* Determines which network representation to use */
diff --git a/block/nbd.c b/block/nbd.c
index 9e99a4ddb52..4a7f37da1c6 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1305,10 +1305,11 @@ nbd_client_co_pwrite_zeroes(BlockDriverState *bs, 
int64_t offset, int64_t bytes,
 NBDRequest request = {
 .type = NBD_CMD_WRITE_ZEROES,
 .from = offset,
-.len = bytes,  /* .len is uint32_t actually */
+.len = bytes,
 };

-assert(bytes <= UINT32_MAX); /* rely on max_pwrite_zeroes */
+/* rely on max_pwrite_zeroes */
+assert(bytes <= UINT32_MAX || s->info.mode >= NBD_MODE_EXTENDED);

 assert(!(s->info.flags & NBD_FLAG_READ_ONLY));
 if (!(s->info.flags & NBD_FLAG_SEND_WRITE_ZEROES)) {
@@ -1355,10 +1356,11 @@ nbd_client_co_pdiscard(BlockDriverState *bs, int64_t 
offset, int64_t bytes)
 NBDRequest request = {
 .type = NBD_CMD_TRIM,
 .from = offset,
-.len = bytes, /* len is uint32_t */
+.len = bytes,
 };

-assert(bytes <= UINT32_MAX); /* rely on max_pdiscard */
+/* rely on max_pdiscard */
+assert(bytes <= UINT32_MAX || s->info.mode >= NBD_MODE_EXTENDED);

 assert(!(s->info.flags & NBD_FLAG_READ_ONLY));
 if (!(s->info.flags & NBD_FLAG_SEND_TRIM) || !bytes) {
@@ -1380,8 +1382,7 @@ static int coroutine_fn GRAPH_RDLOCK 
nbd_client_co_block_status(
 NBDRequest request = {
 .type = NBD_CMD_BLOCK_STATUS,
 .from = offset,
-.len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
-   MIN(bytes, s->info.size - offset)),
+.len = MIN(bytes, s->info.size - offset),
 .flags = NBD_CMD_FLAG_REQ_ONE,
 };

@@ -1391,6 +1392,10 @@ static int coroutine_fn GRAPH_RDLOCK 
nbd_client_co_block_status(
 *file = bs;
 return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }
+if (s->info.mode < NBD_MODE_EXTENDED) {
+request.len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
+  request.len);
+}

 /*
  * Work around the fact that the block layer doesn't do
@@ -1955,6 +1960,14 @@ static void nbd_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.max_pwrite_zeroes = max;
 bs->bl.max_transfer = max;

+/*
+ * Assume that if the server supports extended headers, it also
+ * supports unlimited size zero and trim commands.
+ */
+if (s->info.mode >= NBD_MODE_EXTENDED) {
+bs->bl.max_pdiscard = bs->bl.max_pwrite_zeroes = 0;
+}
+
 if (s->info.opt_block &&
 s->info.opt_block > bs->bl.opt_transfer) {
 bs->bl.opt_transfer = s->info.opt_block;
diff --git a/nbd/client.c b/nbd/client.c
index 345f1c0f2d0..cecb0f04377 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1349,6 +1349,7 @@ int nbd_send_request(QIOChannel *ioc, NBDRequest *request)
 uint8_t buf[NBD_REQUEST_SIZE];

 assert(request->mode <= NBD_MODE_STRUCTURED); /* TODO handle extended */
+assert(request->len <= UINT32_MAX);
 trace_nbd_send_request(request->from, request->len, request->cookie,
request->flags, request->type,
nbd_cmd_lookup(request->type));
diff --git a/nbd/server.c b/nbd/server.c
index 35fcd9fdbc9..0ca0a4c5c25 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1165,7 +1165,7 @@ static int nbd_negotiate_options(NBDClient *client, Error 
**errp)
 client->optlen = length;

 if (length > NBD_MAX_BUFFER_SIZE) {
-

[PULL 0/7] NBD patches through 2023-09-25

2023-09-25 Thread Eric Blake
The following changes since commit b55e4b9c0525560577384adfc6d30eb0daa8d7be:

  Merge tag 'pull-trivial-patches' of https://gitlab.com/mjt0k/qemu into 
staging (2023-09-21 09:32:47 -0400)

are available in the Git repository at:

  https://repo.or.cz/qemu/ericb.git tags/pull-nbd-2023-09-25

for you to fetch changes up to 8db7e2d65733268f7e788a69f5c6125f14f14cb0:

  nbd/server: Refactor handling of command sanity checks (2023-09-25 08:43:22 
-0500)


NBD patches through 2023-09-25

- Denis V. Lunev: iotest improvements
- Eric Blake: further work towards 64-bit NBD extensions


Denis V. Lunev (2):
  iotests: use TEST_IMG_FILE instead of TEST_IMG in _require_large_file
  iotests: improve 'not run' message for nbd-multiconn test

Eric Blake (5):
  nbd: Replace bool structured_reply with mode enum
  nbd/client: Pass mode through to nbd_send_request
  nbd: Add types for extended headers
  nbd: Prepare for 64-bit request effect lengths
  nbd/server: Refactor handling of command sanity checks

 include/block/nbd.h| 142 ++-
 nbd/nbd-internal.h |   3 +-
 block/nbd.c|  44 ++---
 nbd/client-connection.c|   4 +-
 nbd/client.c   |  22 +++--
 nbd/common.c   |  12 ++-
 nbd/server.c   | 172 -
 qemu-nbd.c |   4 +-
 tests/qemu-iotests/common.rc   |   9 +-
 block/trace-events |   2 +-
 nbd/trace-events   |  14 +--
 tests/qemu-iotests/tests/nbd-multiconn |   2 +-
 12 files changed, 280 insertions(+), 150 deletions(-)

-- 
2.41.0




[PULL 2/7] iotests: improve 'not run' message for nbd-multiconn test

2023-09-25 Thread Eric Blake
From: "Denis V. Lunev" 

The test actually requires Python bindings to libnbd rather than libnbd
itself. Clarify that inside the message.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Hanna Reitz 
CC: Eric Blake 
CC: Vladimir Sementsov-Ogievskiy 
Message-ID: <20230906140917.559129-3-...@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
Signed-off-by: Eric Blake 
---
 tests/qemu-iotests/tests/nbd-multiconn | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/tests/nbd-multiconn 
b/tests/qemu-iotests/tests/nbd-multiconn
index b121f2e363d..478a1eaba27 100755
--- a/tests/qemu-iotests/tests/nbd-multiconn
+++ b/tests/qemu-iotests/tests/nbd-multiconn
@@ -142,4 +142,4 @@ if __name__ == '__main__':

 iotests.main(supported_fmts=['qcow2'])
 except ImportError:
-iotests.notrun('libnbd not installed')
+iotests.notrun('Python bindings to libnbd are not installed')
-- 
2.41.0




[PULL 5/7] nbd: Add types for extended headers

2023-09-25 Thread Eric Blake
Add the constants and structs necessary for later patches to start
implementing the NBD_OPT_EXTENDED_HEADERS extension in both the client
and server, matching recent upstream nbd.git (through commit
e6f3b94a934).  This patch does not change any existing behavior, but
merely sets the stage for upcoming patches.

This patch does not change the status quo that neither the client nor
server use a packed-struct representation for the request header.
While most of the patch adds new types, there is also some churn for
renaming the existing NBDExtent to NBDExtent32 to contrast it with
NBDExtent64, which I thought was a nicer name than NBDExtentExt.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-ID: <20230829175826.377251-22-ebl...@redhat.com>
---
 include/block/nbd.h | 124 +++-
 nbd/nbd-internal.h  |   3 +-
 block/nbd.c |   6 +--
 nbd/common.c|  12 -
 nbd/server.c|   6 +--
 5 files changed, 106 insertions(+), 45 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index e07b9f9bff7..22a9b5d10e0 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -60,7 +60,7 @@ typedef enum NBDMode {
 NBD_MODE_EXPORT_NAME,  /* newstyle but only OPT_EXPORT_NAME safe */
 NBD_MODE_SIMPLE,   /* newstyle but only simple replies */
 NBD_MODE_STRUCTURED,   /* newstyle, structured replies enabled */
-/* TODO add NBD_MODE_EXTENDED */
+NBD_MODE_EXTENDED, /* newstyle, extended headers enabled */
 } NBDMode;

 /* Transmission phase structs */
@@ -93,20 +93,36 @@ typedef struct NBDStructuredReplyChunk {
 uint32_t length; /* length of payload */
 } QEMU_PACKED NBDStructuredReplyChunk;

+typedef struct NBDExtendedReplyChunk {
+uint32_t magic;  /* NBD_EXTENDED_REPLY_MAGIC */
+uint16_t flags;  /* combination of NBD_REPLY_FLAG_* */
+uint16_t type;   /* NBD_REPLY_TYPE_* */
+uint64_t cookie; /* request handle */
+uint64_t offset; /* request offset */
+uint64_t length; /* length of payload */
+} QEMU_PACKED NBDExtendedReplyChunk;
+
 typedef union NBDReply {
 NBDSimpleReply simple;
 NBDStructuredReplyChunk structured;
+NBDExtendedReplyChunk extended;
 struct {
 /*
- * @magic and @cookie fields have the same offset and size both in
- * simple reply and structured reply chunk, so let them be accessible
- * without ".simple." or ".structured." specification
+ * @magic and @cookie fields have the same offset and size in all
+ * forms of replies, so let them be accessible without ".simple.",
+ * ".structured.", or ".extended." specifications.
  */
 uint32_t magic;
 uint32_t _skip;
 uint64_t cookie;
-} QEMU_PACKED;
+};
 } NBDReply;
+QEMU_BUILD_BUG_ON(offsetof(NBDReply, simple.cookie) !=
+  offsetof(NBDReply, cookie));
+QEMU_BUILD_BUG_ON(offsetof(NBDReply, structured.cookie) !=
+  offsetof(NBDReply, cookie));
+QEMU_BUILD_BUG_ON(offsetof(NBDReply, extended.cookie) !=
+  offsetof(NBDReply, cookie));

 /* Header of chunk for NBD_REPLY_TYPE_OFFSET_DATA */
 typedef struct NBDStructuredReadData {
@@ -133,14 +149,34 @@ typedef struct NBDStructuredError {
 typedef struct NBDStructuredMeta {
 /* header's length >= 12 (at least one extent) */
 uint32_t context_id;
-/* extents follows */
+/* NBDExtent32 extents[] follows, array length implied by header */
 } QEMU_PACKED NBDStructuredMeta;

-/* Extent chunk for NBD_REPLY_TYPE_BLOCK_STATUS */
-typedef struct NBDExtent {
+/* Extent array element for NBD_REPLY_TYPE_BLOCK_STATUS */
+typedef struct NBDExtent32 {
 uint32_t length;
 uint32_t flags; /* NBD_STATE_* */
-} QEMU_PACKED NBDExtent;
+} QEMU_PACKED NBDExtent32;
+
+/* Header of NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
+typedef struct NBDExtendedMeta {
+/* header's length >= 24 (at least one extent) */
+uint32_t context_id;
+uint32_t count; /* header length must be count * 16 + 8 */
+/* NBDExtent64 extents[count] follows */
+} QEMU_PACKED NBDExtendedMeta;
+
+/* Extent array element for NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
+typedef struct NBDExtent64 {
+uint64_t length;
+uint64_t flags; /* NBD_STATE_* */
+} QEMU_PACKED NBDExtent64;
+
+/* Client payload for limiting NBD_CMD_BLOCK_STATUS reply */
+typedef struct NBDBlockStatusPayload {
+uint64_t effect_length;
+/* uint32_t ids[] follows, array length implied by header */
+} QEMU_PACKED NBDBlockStatusPayload;

 /* Transmission (export) flags: sent from server to client during handshake,
but describe what will happen during transmission */
@@ -158,20 +194,22 @@ enum {
 NBD_FLAG_SEND_RESIZE_BIT=  9, /* Send resize */
 NBD_FLAG_SEND_CACHE_BIT = 10, /* Send CACHE (prefetch) */
 NBD_FLAG_SEND_FAST_ZERO_BIT = 11, /* FAST_ZERO flag for WRITE_ZEROES */
+NBD_FLAG_BLOCK_STAT_PAYLOAD_BIT = 12, /* PAYLOAD flag for 

[PULL 7/7] nbd/server: Refactor handling of command sanity checks

2023-09-25 Thread Eric Blake
Upcoming additions to support NBD 64-bit effect lengths will add a new
command flag NBD_CMD_FLAG_PAYLOAD_LEN that needs to be considered in
our sanity checks of the client's messages (that is, more than just
CMD_WRITE have the potential to carry a client payload when extended
headers are in effect).  But before we can start to support that, it
is easier to first refactor the existing set of various if statements
over open-coded combinations of request->type to instead be a single
switch statement over all command types that sets witnesses, then
straight-line processing based on the witnesses.  No semantic change
is intended.

Signed-off-by: Eric Blake 
Message-ID: <20230829175826.377251-24-ebl...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 nbd/server.c | 118 ---
 1 file changed, 74 insertions(+), 44 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 0ca0a4c5c25..7a6f95071f8 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2317,11 +2317,16 @@ static int coroutine_fn nbd_co_send_bitmap(NBDClient 
*client,
  * to the client (although the caller may still need to disconnect after
  * reporting the error).
  */
-static int coroutine_fn nbd_co_receive_request(NBDRequestData *req, NBDRequest 
*request,
+static int coroutine_fn nbd_co_receive_request(NBDRequestData *req,
+   NBDRequest *request,
Error **errp)
 {
 NBDClient *client = req->client;
-int valid_flags;
+bool check_length = false;
+bool check_rofs = false;
+bool allocate_buffer = false;
+unsigned payload_len = 0;
+int valid_flags = NBD_CMD_FLAG_FUA;
 int ret;

 g_assert(qemu_in_coroutine());
@@ -2333,55 +2338,88 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req, NBDRequest *

 trace_nbd_co_receive_request_decode_type(request->cookie, request->type,
  nbd_cmd_lookup(request->type));
-
-if (request->type != NBD_CMD_WRITE) {
-/* No payload, we are ready to read the next request.  */
-req->complete = true;
-}
-
-if (request->type == NBD_CMD_DISC) {
+switch (request->type) {
+case NBD_CMD_DISC:
 /* Special case: we're going to disconnect without a reply,
  * whether or not flags, from, or len are bogus */
+req->complete = true;
 return -EIO;
-}

-if (request->type == NBD_CMD_READ || request->type == NBD_CMD_WRITE ||
-request->type == NBD_CMD_CACHE)
-{
-if (request->len > NBD_MAX_BUFFER_SIZE) {
-error_setg(errp, "len (%" PRIu64 ") is larger than max len (%u)",
-   request->len, NBD_MAX_BUFFER_SIZE);
-return -EINVAL;
+case NBD_CMD_READ:
+if (client->mode >= NBD_MODE_STRUCTURED) {
+valid_flags |= NBD_CMD_FLAG_DF;
 }
+check_length = true;
+allocate_buffer = true;
+break;

-if (request->type != NBD_CMD_CACHE) {
-req->data = blk_try_blockalign(client->exp->common.blk,
-   request->len);
-if (req->data == NULL) {
-error_setg(errp, "No memory");
-return -ENOMEM;
-}
-}
+case NBD_CMD_WRITE:
+payload_len = request->len;
+check_length = true;
+allocate_buffer = true;
+check_rofs = true;
+break;
+
+case NBD_CMD_FLUSH:
+break;
+
+case NBD_CMD_TRIM:
+check_rofs = true;
+break;
+
+case NBD_CMD_CACHE:
+check_length = true;
+break;
+
+case NBD_CMD_WRITE_ZEROES:
+valid_flags |= NBD_CMD_FLAG_NO_HOLE | NBD_CMD_FLAG_FAST_ZERO;
+check_rofs = true;
+break;
+
+case NBD_CMD_BLOCK_STATUS:
+valid_flags |= NBD_CMD_FLAG_REQ_ONE;
+break;
+
+default:
+/* Unrecognized, will fail later */
+;
 }

-if (request->type == NBD_CMD_WRITE) {
-assert(request->len <= NBD_MAX_BUFFER_SIZE);
-if (nbd_read(client->ioc, req->data, request->len, "CMD_WRITE data",
- errp) < 0)
-{
+/* Payload and buffer handling. */
+if (!payload_len) {
+req->complete = true;
+}
+if (check_length && request->len > NBD_MAX_BUFFER_SIZE) {
+/* READ, WRITE, CACHE */
+error_setg(errp, "len (%" PRIu64 ") is larger than max len (%u)",
+   request->len, NBD_MAX_BUFFER_SIZE);
+return -EINVAL;
+}
+if (allocate_buffer) {
+/* READ, WRITE */
+req->data = blk_try_blockalign(client->exp->common.blk,
+   request->len);
+if (req->data == NULL) {
+error_setg(errp, "No memory");
+return -ENOMEM;
+}
+}
+if (payload_len) {
+/* WRITE */
+assert(req->data);
+ret = 

[PULL 4/7] nbd/client: Pass mode through to nbd_send_request

2023-09-25 Thread Eric Blake
Once the 64-bit headers extension is enabled, the data layout we send
over the wire for a client request depends on the mode negotiated with
the server.  Rather than adding a parameter to nbd_send_request, we
can add a member to struct NBDRequest, since it already does not
reflect on-wire format.  Some callers initialize it directly; many
others rely on a common initialization point during
nbd_co_send_request().  At this point, there is no semantic change.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-ID: <20230829175826.377251-21-ebl...@redhat.com>
---
 include/block/nbd.h | 12 +++-
 block/nbd.c |  5 +++--
 nbd/client.c|  3 ++-
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 53226764574..e07b9f9bff7 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -63,17 +63,19 @@ typedef enum NBDMode {
 /* TODO add NBD_MODE_EXTENDED */
 } NBDMode;

-/* Transmission phase structs
- *
- * Note: these are _NOT_ the same as the network representation of an NBD
- * request and reply!
+/* Transmission phase structs */
+
+/*
+ * Note: NBDRequest is _NOT_ the same as the network representation of an NBD
+ * request!
  */
 typedef struct NBDRequest {
 uint64_t cookie;
 uint64_t from;
 uint32_t len;
 uint16_t flags; /* NBD_CMD_FLAG_* */
-uint16_t type; /* NBD_CMD_* */
+uint16_t type;  /* NBD_CMD_* */
+NBDMode mode;   /* Determines which network representation to use */
 } NBDRequest;

 typedef struct NBDSimpleReply {
diff --git a/block/nbd.c b/block/nbd.c
index 676b755d79f..24f50b79e47 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -339,7 +339,7 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
  * We have connected, but must fail for other reasons.
  * Send NBD_CMD_DISC as a courtesy to the server.
  */
-NBDRequest request = { .type = NBD_CMD_DISC };
+NBDRequest request = { .type = NBD_CMD_DISC, .mode = s->info.mode };

 nbd_send_request(s->ioc, );

@@ -520,6 +520,7 @@ nbd_co_send_request(BlockDriverState *bs, NBDRequest 
*request,

 qemu_co_mutex_lock(>send_mutex);
 request->cookie = INDEX_TO_COOKIE(i);
+request->mode = s->info.mode;

 assert(s->ioc);

@@ -1465,7 +1466,7 @@ static void nbd_yank(void *opaque)
 static void nbd_client_close(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-NBDRequest request = { .type = NBD_CMD_DISC };
+NBDRequest request = { .type = NBD_CMD_DISC, .mode = s->info.mode };

 if (s->ioc) {
 nbd_send_request(s->ioc, );
diff --git a/nbd/client.c b/nbd/client.c
index 844be42181a..345f1c0f2d0 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1218,7 +1218,7 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 /* Send NBD_CMD_DISC as a courtesy to the server, but ignore all
  * errors now that we have the information we wanted. */
 if (nbd_drop(ioc, 124, NULL) == 0) {
-NBDRequest request = { .type = NBD_CMD_DISC };
+NBDRequest request = { .type = NBD_CMD_DISC, .mode = result };

 nbd_send_request(ioc, );
 }
@@ -1348,6 +1348,7 @@ int nbd_send_request(QIOChannel *ioc, NBDRequest *request)
 {
 uint8_t buf[NBD_REQUEST_SIZE];

+assert(request->mode <= NBD_MODE_STRUCTURED); /* TODO handle extended */
 trace_nbd_send_request(request->from, request->len, request->cookie,
request->flags, request->type,
nbd_cmd_lookup(request->type));
-- 
2.41.0




[PULL 1/7] iotests: use TEST_IMG_FILE instead of TEST_IMG in _require_large_file

2023-09-25 Thread Eric Blake
From: "Denis V. Lunev" 

We need to check that we are able to create large enough file which is
used as an export base rather than connection URL. Unfortunately, there
are cases when the TEST_IMG_FILE is not defined. We should fallback to
TEST_IMG in that case.

This problem has been detected when running
./check -nbd 5
The test should be able to run while it does not.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Hanna Reitz 
CC: Eric Blake 
CC: Vladimir Sementsov-Ogievskiy 
Message-ID: <20230906140917.559129-2-...@openvz.org>
Tested-by: Eric Blake 
Reviewed-by: Eric Blake 
Signed-off-by: Eric Blake 
---
 tests/qemu-iotests/common.rc | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index d145f08201c..95c12577dd4 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -979,10 +979,15 @@ _require_drivers()
 #
 _require_large_file()
 {
-if ! truncate --size="$1" "$TEST_IMG"; then
+if [ -z "$TEST_IMG_FILE" ]; then
+FILENAME="$TEST_IMG"
+else
+FILENAME="$TEST_IMG_FILE"
+fi
+if ! truncate --size="$1" "$FILENAME"; then
 _notrun "file system on $TEST_DIR does not support large enough files"
 fi
-rm "$TEST_IMG"
+rm "$FILENAME"
 }

 # Check that a set of devices is available in the QEMU binary
-- 
2.41.0




[PATCH] m68k: Silence -Wshadow=local warnings in the m68k code

2023-09-25 Thread Thomas Huth
Rename the innermost variables to make the code compile
without warnings when using -Wshadow=local.

Signed-off-by: Thomas Huth 
---
 hw/m68k/bootinfo.h  | 10 --
 disas/m68k.c|  8 
 target/m68k/translate.c |  8 
 3 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/hw/m68k/bootinfo.h b/hw/m68k/bootinfo.h
index a3d37e3c80..d077d03559 100644
--- a/hw/m68k/bootinfo.h
+++ b/hw/m68k/bootinfo.h
@@ -44,15 +44,14 @@
 
 #define BOOTINFOSTR(base, id, string) \
 do { \
-int i; \
 stw_p(base, id); \
 base += 2; \
 stw_p(base, \
  (sizeof(struct bi_record) + strlen(string) + \
   1 /* null termination */ + 3 /* padding */) & ~3); \
 base += 2; \
-for (i = 0; string[i]; i++) { \
-stb_p(base++, string[i]); \
+for (int _i = 0; string[_i]; _i++) { \
+stb_p(base++, string[_i]); \
 } \
 stb_p(base++, 0); \
 base = QEMU_ALIGN_PTR_UP(base, 4); \
@@ -60,7 +59,6 @@
 
 #define BOOTINFODATA(base, id, data, len) \
 do { \
-int i; \
 stw_p(base, id); \
 base += 2; \
 stw_p(base, \
@@ -69,8 +67,8 @@
 base += 2; \
 stw_p(base, len); \
 base += 2; \
-for (i = 0; i < len; ++i) { \
-stb_p(base++, data[i]); \
+for (int _i = 0; _i < len; ++_i) { \
+stb_p(base++, data[_i]); \
 } \
 base = QEMU_ALIGN_PTR_UP(base, 4); \
 } while (0)
diff --git a/disas/m68k.c b/disas/m68k.c
index aefaecfbd6..a384b4cb64 100644
--- a/disas/m68k.c
+++ b/disas/m68k.c
@@ -1632,10 +1632,10 @@ print_insn_arg (const char *d,
 case '2':
 case '3':
   {
-   int val = fetch_arg (buffer, place, 5, info);
+   int val2 = fetch_arg (buffer, place, 5, info);
 const char *name = 0;
 
-   switch (val)
+   switch (val2)
  {
  case 2: name = "%tt0"; break;
  case 3: name = "%tt1"; break;
@@ -1655,12 +1655,12 @@ print_insn_arg (const char *d,
  int break_reg = ((buffer[3] >> 2) & 7);
 
  (*info->fprintf_func)
-   (info->stream, val == 0x1c ? "%%bad%d" : "%%bac%d",
+   (info->stream, val2 == 0x1c ? "%%bad%d" : "%%bac%d",
 break_reg);
}
break;
  default:
-   (*info->fprintf_func) (info->stream, "", val);
+   (*info->fprintf_func) (info->stream, "", val2);
  }
if (name)
  (*info->fprintf_func) (info->stream, "%s", name);
diff --git a/target/m68k/translate.c b/target/m68k/translate.c
index 9e224fe796..b28d7f7d4b 100644
--- a/target/m68k/translate.c
+++ b/target/m68k/translate.c
@@ -824,14 +824,14 @@ static TCGv gen_ea_mode(CPUM68KState *env, DisasContext 
*s, int mode, int reg0,
 reg = get_areg(s, reg0);
 result = gen_ldst(s, opsize, reg, val, what, index);
 if (what == EA_STORE || !addrp) {
-TCGv tmp = tcg_temp_new();
+TCGv tmp2 = tcg_temp_new();
 if (reg0 == 7 && opsize == OS_BYTE &&
 m68k_feature(s->env, M68K_FEATURE_M68K)) {
-tcg_gen_addi_i32(tmp, reg, 2);
+tcg_gen_addi_i32(tmp2, reg, 2);
 } else {
-tcg_gen_addi_i32(tmp, reg, opsize_bytes(opsize));
+tcg_gen_addi_i32(tmp2, reg, opsize_bytes(opsize));
 }
-delay_set_areg(s, reg0, tmp, true);
+delay_set_areg(s, reg0, tmp2, true);
 }
 return result;
 case 4: /* Indirect predecrememnt.  */
-- 
2.41.0




[PATCH v6 20/23] bsd-user: Implement shm_unlink(2) and shmget(2)

2023-09-25 Thread Karim Taha
From: Stacey Son 

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Warner Losh 
Reviewed-by: Richard Henderson 
---
 bsd-user/bsd-mem.h| 23 +++
 bsd-user/freebsd/os-syscall.c |  8 
 2 files changed, 31 insertions(+)

diff --git a/bsd-user/bsd-mem.h b/bsd-user/bsd-mem.h
index f8dc943c23..c362cc07a3 100644
--- a/bsd-user/bsd-mem.h
+++ b/bsd-user/bsd-mem.h
@@ -282,4 +282,27 @@ static inline abi_long do_bsd_shm_open(abi_ulong arg1, 
abi_long arg2,
 return ret;
 }
 
+/* shm_unlink(2) */
+static inline abi_long do_bsd_shm_unlink(abi_ulong arg1)
+{
+int ret;
+void *p;
+
+p = lock_user_string(arg1);
+if (p == NULL) {
+return -TARGET_EFAULT;
+}
+ret = get_errno(shm_unlink(p)); /* XXX path(p)? */
+unlock_user(p, arg1, 0);
+
+return ret;
+}
+
+/* shmget(2) */
+static inline abi_long do_bsd_shmget(abi_long arg1, abi_ulong arg2,
+abi_long arg3)
+{
+return get_errno(shmget(arg1, arg2, arg3));
+}
+
 #endif /* BSD_USER_BSD_MEM_H */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index effa6dac54..f0ccd787e5 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -655,6 +655,14 @@ static abi_long freebsd_syscall(void *cpu_env, int num, 
abi_long arg1,
 break;
 #endif
 
+case TARGET_FREEBSD_NR_shm_unlink: /* shm_unlink(2) */
+ret = do_bsd_shm_unlink(arg1);
+break;
+
+case TARGET_FREEBSD_NR_shmget: /* shmget(2) */
+ret = do_bsd_shmget(arg1, arg2, arg3);
+break;
+
 /*
  * Misc
  */
-- 
2.42.0




[PATCH v5 07/28] bsd-user: Implement target_to_host_rlim and host_to_target_rlim conversion.

2023-09-25 Thread Karim Taha
From: Stacey Son 

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
Reviewed-by: Warner Losh 
---
 bsd-user/bsd-proc.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/bsd-user/bsd-proc.c b/bsd-user/bsd-proc.c
index 68410a0aa9..19e39a2f76 100644
--- a/bsd-user/bsd-proc.c
+++ b/bsd-user/bsd-proc.c
@@ -38,3 +38,13 @@ int target_to_host_resource(int code)
 return code;
 }
 
+rlim_t target_to_host_rlim(abi_llong target_rlim)
+{
+return tswap64(target_rlim);
+}
+
+abi_llong host_to_target_rlim(rlim_t rlim)
+{
+return tswap64(rlim);
+}
+
-- 
2.42.0




[PATCH v6 06/23] bsd-user: Implement shm_rename(2) system call

2023-09-25 Thread Karim Taha
From: Kyle Evans 

Signed-off-by: Kyle Evans 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
Reviewed-by: Warner Losh 
---
 bsd-user/freebsd/os-misc.h| 24 
 bsd-user/freebsd/os-syscall.c |  6 ++
 2 files changed, 30 insertions(+)

diff --git a/bsd-user/freebsd/os-misc.h b/bsd-user/freebsd/os-misc.h
index 6b424b7078..67e450fe7c 100644
--- a/bsd-user/freebsd/os-misc.h
+++ b/bsd-user/freebsd/os-misc.h
@@ -66,5 +66,29 @@ static inline abi_long do_freebsd_shm_open2(abi_ulong 
pathptr, abi_ulong flags,
 }
 #endif /* __FreeBSD_version >= 1300048 */
 
+#if defined(__FreeBSD_version) && __FreeBSD_version >= 1300049
+/* shm_rename(2) */
+static inline abi_long do_freebsd_shm_rename(abi_ulong fromptr, abi_ulong 
toptr,
+abi_ulong flags)
+{
+int ret;
+void *ufrom, *uto;
+
+ufrom = lock_user_string(fromptr);
+if (ufrom == NULL) {
+return -TARGET_EFAULT;
+}
+uto = lock_user_string(toptr);
+if (uto == NULL) {
+unlock_user(ufrom, fromptr, 0);
+return -TARGET_EFAULT;
+}
+ret = get_errno(shm_rename(ufrom, uto, flags));
+unlock_user(ufrom, fromptr, 0);
+unlock_user(uto, toptr, 0);
+
+return ret;
+}
+#endif /* __FreeBSD_version >= 1300049 */
 
 #endif /* OS_MISC_H */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index 74146d8c72..ae92a2314c 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -603,6 +603,12 @@ static abi_long freebsd_syscall(void *cpu_env, int num, 
abi_long arg1,
 break;
 #endif
 
+#if defined(__FreeBSD_version) && __FreeBSD_version >= 1300049
+case TARGET_FREEBSD_NR_shm_rename: /* shm_rename(2) */
+ret = do_freebsd_shm_rename(arg1, arg2, arg3);
+break;
+#endif
+
 /*
  * sys{ctl, arch, call}
  */
-- 
2.42.0




[PATCH v6 01/23] bsd-user: Implement struct target_ipc_perm

2023-09-25 Thread Karim Taha
From: Stacey Son 

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
Reviewed-by: Warner Losh 
---
 bsd-user/syscall_defs.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/bsd-user/syscall_defs.h b/bsd-user/syscall_defs.h
index 9c90616baa..4deb4fed35 100644
--- a/bsd-user/syscall_defs.h
+++ b/bsd-user/syscall_defs.h
@@ -55,6 +55,23 @@ struct target_iovec {
 abi_long iov_len;   /* Number of bytes */
 };
 
+/*
+ * sys/ipc.h
+ */
+struct target_ipc_perm {
+uint32_tcuid;   /* creator user id */
+uint32_tcgid;   /* creator group id */
+uint32_tuid;/* user id */
+uint32_tgid;/* group id */
+uint16_tmode;   /* r/w permission */
+uint16_tseq;/* sequence # */
+abi_longkey;/* user specified msg/sem/shm key */
+};
+
+#define TARGET_IPC_RMID 0   /* remove identifier */
+#define TARGET_IPC_SET  1   /* set options */
+#define TARGET_IPC_STAT 2   /* get options */
+
 /*
  *  sys/mman.h
  */
-- 
2.42.0




[PATCH v6 02/23] bsd-user: Implement struct target_shmid_ds

2023-09-25 Thread Karim Taha
From: Stacey Son 

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
Reviewed-by: Warner Losh 
---
 bsd-user/syscall_defs.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/bsd-user/syscall_defs.h b/bsd-user/syscall_defs.h
index 4deb4fed35..f4a5ae2a12 100644
--- a/bsd-user/syscall_defs.h
+++ b/bsd-user/syscall_defs.h
@@ -72,6 +72,26 @@ struct target_ipc_perm {
 #define TARGET_IPC_SET  1   /* set options */
 #define TARGET_IPC_STAT 2   /* get options */
 
+/*
+ * sys/shm.h
+ */
+struct target_shmid_ds {
+struct  target_ipc_perm shm_perm; /* peration permission structure */
+abi_ulong   shm_segsz;  /* size of segment in bytes */
+int32_t shm_lpid;   /* process ID of last shared memory op */
+int32_t shm_cpid;   /* process ID of creator */
+int32_t shm_nattch; /* number of current attaches */
+target_time_t shm_atime;  /* time of last shmat() */
+target_time_t shm_dtime;  /* time of last shmdt() */
+target_time_t shm_ctime;  /* time of last change by shmctl() */
+};
+
+#define N_BSD_SHM_REGIONS   32
+struct bsd_shm_regions {
+abi_long start;
+abi_long size;
+};
+
 /*
  *  sys/mman.h
  */
-- 
2.42.0




[PATCH v6 19/23] bsd-user: Implement shm_open(2)

2023-09-25 Thread Karim Taha
From: Stacey Son 

Co-authored-by: Kyle Evans 

Signed-off-by: Stacey Son 
Signed-off-by: Kyle Evans 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
---
 bsd-user/bsd-mem.h| 25 +
 bsd-user/freebsd/os-syscall.c |  4 
 2 files changed, 29 insertions(+)

diff --git a/bsd-user/bsd-mem.h b/bsd-user/bsd-mem.h
index b296c5c6f0..f8dc943c23 100644
--- a/bsd-user/bsd-mem.h
+++ b/bsd-user/bsd-mem.h
@@ -257,4 +257,29 @@ static inline abi_long do_obreak(abi_ulong brk_val)
 return target_brk;
 }
 
+/* shm_open(2) */
+static inline abi_long do_bsd_shm_open(abi_ulong arg1, abi_long arg2,
+abi_long arg3)
+{
+int ret;
+void *p;
+
+if (arg1 == (uintptr_t)SHM_ANON) {
+p = SHM_ANON;
+} else {
+p = lock_user_string(arg1);
+if (p == NULL) {
+return -TARGET_EFAULT;
+}
+}
+ret = get_errno(shm_open(p, target_to_host_bitmask(arg2, fcntl_flags_tbl),
+ arg3));
+
+if (p != SHM_ANON) {
+unlock_user(p, arg1, 0);
+}
+
+return ret;
+}
+
 #endif /* BSD_USER_BSD_MEM_H */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index 5cd60fc272..effa6dac54 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -639,6 +639,10 @@ static abi_long freebsd_syscall(void *cpu_env, int num, 
abi_long arg1,
 ret = do_bsd_mincore(arg1, arg2, arg3);
 break;
 
+case TARGET_FREEBSD_NR_freebsd12_shm_open: /* shm_open(2) */
+ret = do_bsd_shm_open(arg1, arg2, arg3);
+break;
+
 #if defined(__FreeBSD_version) && __FreeBSD_version >= 1300048
 case TARGET_FREEBSD_NR_shm_open2: /* shm_open2(2) */
 ret = do_freebsd_shm_open2(arg1, arg2, arg3, arg4, arg5);
-- 
2.42.0




[PATCH v6 18/23] bsd-user: Implement do_obreak function

2023-09-25 Thread Karim Taha
From: Stacey Son 

Match linux-user, by manually applying the following commits, in order:

d28b3c90cfad1a7e211ae2bce36ecb9071086129   linux-user: Make sure initial brk(0) 
is page-aligned
15ad98536ad9410fb32ddf1ff09389b677643faa   linux-user: Fix qemu brk() to not 
zero bytes on current page
dfe49864afb06e7e452a4366051697bc4fcfc1a5   linux-user: Prohibit brk() to to 
shrink below initial heap address
eac78a4b0b7da4de2c0a297f4d528ca9cc6256a3   linux-user: Fix signed math overflow 
in brk() syscall
c6cc059eca18d9f6e4e26bb8b6d1135ddb35d81a   linux-user: Do not call get_errno() 
in do_brk()
e69e032d1a8ee8d754ca119009a3c2c997f8bb30   linux-user: Use MAP_FIXED_NOREPLACE 
for do_brk()
cb9d5d1fda0bc2312fc0c779b4ea1d7bf826f31f   linux-user: Do nothing if too small 
brk is specified
2aea137a425a87b930a33590177b04368fd7cc12   linux-user: Do not align brk with 
host page size

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
---
 bsd-user/bsd-mem.h| 45 +++
 bsd-user/freebsd/os-syscall.c |  7 ++
 2 files changed, 52 insertions(+)

diff --git a/bsd-user/bsd-mem.h b/bsd-user/bsd-mem.h
index 0c8d96d9a4..b296c5c6f0 100644
--- a/bsd-user/bsd-mem.h
+++ b/bsd-user/bsd-mem.h
@@ -212,4 +212,49 @@ static inline abi_long do_bsd_mincore(abi_ulong 
target_addr, abi_ulong len,
 return ret;
 }
 
+/* do_brk() must return target values and target errnos. */
+static inline abi_long do_obreak(abi_ulong brk_val)
+{
+abi_long mapped_addr;
+abi_ulong new_brk;
+abi_ulong old_brk;
+
+/* brk pointers are always untagged */
+
+/* do not allow to shrink below initial brk value */
+if (brk_val < initial_target_brk) {
+return target_brk;
+}
+
+new_brk = TARGET_PAGE_ALIGN(brk_val);
+old_brk = TARGET_PAGE_ALIGN(target_brk);
+
+/* new and old target_brk might be on the same page */
+if (new_brk == old_brk) {
+target_brk = brk_val;
+return target_brk;
+}
+
+/* Release heap if necesary */
+if (new_brk < old_brk) {
+target_munmap(new_brk, old_brk - new_brk);
+
+target_brk = brk_val;
+return target_brk;
+}
+
+mapped_addr = target_mmap(old_brk, new_brk - old_brk,
+  PROT_READ | PROT_WRITE,
+  MAP_FIXED | MAP_EXCL | MAP_ANON | MAP_PRIVATE,
+  -1, 0);
+
+if (mapped_addr == old_brk) {
+target_brk = brk_val;
+return target_brk;
+}
+
+/* For everything else, return the previous break. */
+return target_brk;
+}
+
 #endif /* BSD_USER_BSD_MEM_H */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index 8ba5fcc6ca..5cd60fc272 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -651,6 +651,13 @@ static abi_long freebsd_syscall(void *cpu_env, int num, 
abi_long arg1,
 break;
 #endif
 
+/*
+ * Misc
+ */
+case TARGET_FREEBSD_NR_break:
+ret = do_obreak(arg1);
+break;
+
 /*
  * sys{ctl, arch, call}
  */
-- 
2.42.0




[PATCH v6 00/23] bsd-user: Implement mmap related system calls for FreeBSD.

2023-09-25 Thread Karim Taha
Upstream the implementation of the following mmap system calls, from the
qemu-bsd-user fork:
   mmap(2), munmap(2),
   mprotect(2),
   msync(2),
   mlock(2), munlock(2), mlockall(2), munlockall(2), mincore(2),
   madvise(2),
   minherit(2),
   shm_open(2),shm_open2(2), shm_rename2(2), shm_unlink(2), shmget(2), 
shmctl(2), shmat(2),
   shmdt(2)
   brk(2)

Karim Taha (3):
  bsd-user: Implement shm_open2(2) system call
  bsd-user: Add bsd-mem.c to meson.build
  bsd-user: Implment madvise(2) to match the linux-user implementation.

Kyle Evans (1):
  bsd-user: Implement shm_rename(2) system call

Stacey Son (18):
  bsd-user: Implement struct target_ipc_perm
  bsd-user: Implement struct target_shmid_ds
  bsd-user: Declarations for ipc_perm and shmid_ds conversion functions
  bsd-user: Introduce freebsd/os-misc.h to the source tree
  bsd-user: Implement target_set_brk function in bsd-mem.c instead of
os-syscall.c
  bsd-user: Implement ipc_perm conversion between host and target.
  bsd-user: Implement shmid_ds conversion between host and target.
  bsd-user: Introduce bsd-mem.h to the source tree
  bsd-user: Implement mmap(2) and munmap(2)
  bsd-user: Implement mprotect(2)
  bsd-user: Implement msync(2)
  bsd-user: Implement mlock(2), munlock(2), mlockall(2), munlockall(2),
minherit(2)
  bsd-user: Implement mincore(2)
  bsd-user: Implement do_obreak function
  bsd-user: Implement shm_open(2)
  bsd-user: Implement shm_unlink(2) and shmget(2)
  bsd-user: Implement shmctl(2)
  bsd-user: Implement shmat(2) and shmdt(2)

Warner Losh (1):
  bsd-user: Add stubs for vadvise(), sbrk() and sstk()

 bsd-user/bsd-mem.c| 104 
 bsd-user/bsd-mem.h| 452 ++
 bsd-user/freebsd/os-misc.h|  94 +++
 bsd-user/freebsd/os-syscall.c | 109 +++-
 bsd-user/meson.build  |   1 +
 bsd-user/mmap.c   |   2 +-
 bsd-user/qemu-bsd.h   |  45 
 bsd-user/qemu.h   |   1 +
 bsd-user/syscall_defs.h   |  39 +++
 9 files changed, 842 insertions(+), 5 deletions(-)
 create mode 100644 bsd-user/bsd-mem.c
 create mode 100644 bsd-user/bsd-mem.h
 create mode 100644 bsd-user/freebsd/os-misc.h
 create mode 100644 bsd-user/qemu-bsd.h

-- 
2.42.0




[PATCH v6 22/23] bsd-user: Implement shmat(2) and shmdt(2)

2023-09-25 Thread Karim Taha
From: Stacey Son 

Use `WITH_MMAP_LOCK_GUARD` instead of mmap_lock() and mmap_unlock(),
to match linux-user implementation, according to the following commits:

69fa2708a216df715ba5102a0f98468b540a464e linux-user: Use WITH_MMAP_LOCK_GUARD 
in target_{shmat,shmdt}
ceda5688b650646248f269a992c06b11148c5759 linux-user: Fix shmdt

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
---
 bsd-user/bsd-mem.h| 87 +++
 bsd-user/freebsd/os-syscall.c |  8 
 bsd-user/mmap.c   |  2 +-
 bsd-user/qemu.h   |  1 +
 4 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/bsd-user/bsd-mem.h b/bsd-user/bsd-mem.h
index b82f3eaa25..c512a4e375 100644
--- a/bsd-user/bsd-mem.h
+++ b/bsd-user/bsd-mem.h
@@ -344,4 +344,91 @@ static inline abi_long do_bsd_shmctl(abi_long shmid, 
abi_long cmd,
 return ret;
 }
 
+/* shmat(2) */
+static inline abi_long do_bsd_shmat(int shmid, abi_ulong shmaddr, int shmflg)
+{
+abi_ulong raddr;
+abi_long ret;
+struct shmid_ds shm_info;
+
+/* Find out the length of the shared memory segment. */
+ret = get_errno(shmctl(shmid, IPC_STAT, _info));
+if (is_error(ret)) {
+/* Can't get the length */
+return ret;
+}
+
+if (!guest_range_valid_untagged(shmaddr, shm_info.shm_segsz)) {
+return -TARGET_EINVAL;
+}
+
+WITH_MMAP_LOCK_GUARD() {
+void *host_raddr;
+
+if (shmaddr) {
+host_raddr = shmat(shmid, (void *)g2h_untagged(shmaddr), shmflg);
+} else {
+abi_ulong mmap_start;
+
+mmap_start = mmap_find_vma(0, shm_info.shm_segsz);
+
+if (mmap_start == -1) {
+return -TARGET_ENOMEM;
+}
+host_raddr = shmat(shmid, g2h_untagged(mmap_start),
+   shmflg | SHM_REMAP);
+}
+
+if (host_raddr == (void *)-1) {
+return get_errno(-1);
+}
+raddr = h2g(host_raddr);
+
+page_set_flags(raddr, raddr + shm_info.shm_segsz - 1,
+   PAGE_VALID | PAGE_RESET | PAGE_READ |
+   (shmflg & SHM_RDONLY ? 0 : PAGE_WRITE));
+
+for (int i = 0; i < N_BSD_SHM_REGIONS; i++) {
+if (bsd_shm_regions[i].start == 0) {
+bsd_shm_regions[i].start = raddr;
+bsd_shm_regions[i].size = shm_info.shm_segsz;
+break;
+}
+}
+}
+
+return raddr;
+}
+
+/* shmdt(2) */
+static inline abi_long do_bsd_shmdt(abi_ulong shmaddr)
+{
+abi_long ret;
+
+WITH_MMAP_LOCK_GUARD() {
+int i;
+
+for (i = 0; i < N_BSD_SHM_REGIONS; ++i) {
+if (bsd_shm_regions[i].start == shmaddr) {
+break;
+}
+}
+
+if (i == N_BSD_SHM_REGIONS) {
+return -TARGET_EINVAL;
+}
+
+ret = get_errno(shmdt(g2h_untagged(shmaddr)));
+if (ret == 0) {
+abi_ulong size = bsd_shm_regions[i].size;
+
+bsd_shm_regions[i].start = 0;
+page_set_flags(shmaddr, shmaddr + size - 1, 0);
+mmap_reserve(shmaddr, size);
+}
+}
+
+return ret;
+}
+
 #endif /* BSD_USER_BSD_MEM_H */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index 664b8de104..6b32d4df68 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -667,6 +667,14 @@ static abi_long freebsd_syscall(void *cpu_env, int num, 
abi_long arg1,
 ret = do_bsd_shmctl(arg1, arg2, arg3);
 break;
 
+case TARGET_FREEBSD_NR_shmat: /* shmat(2) */
+ret = do_bsd_shmat(arg1, arg2, arg3);
+break;
+
+case TARGET_FREEBSD_NR_shmdt: /* shmdt(2) */
+ret = do_bsd_shmdt(arg1);
+break;
+
 /*
  * Misc
  */
diff --git a/bsd-user/mmap.c b/bsd-user/mmap.c
index 8e148a2ea3..3ef11b2807 100644
--- a/bsd-user/mmap.c
+++ b/bsd-user/mmap.c
@@ -636,7 +636,7 @@ fail:
 return -1;
 }
 
-static void mmap_reserve(abi_ulong start, abi_ulong size)
+void mmap_reserve(abi_ulong start, abi_ulong size)
 {
 abi_ulong real_start;
 abi_ulong real_end;
diff --git a/bsd-user/qemu.h b/bsd-user/qemu.h
index d9507137cc..d67dd76827 100644
--- a/bsd-user/qemu.h
+++ b/bsd-user/qemu.h
@@ -232,6 +232,7 @@ abi_long target_mremap(abi_ulong old_addr, abi_ulong 
old_size,
 int target_msync(abi_ulong start, abi_ulong len, int flags);
 extern abi_ulong mmap_next_start;
 abi_ulong mmap_find_vma(abi_ulong start, abi_ulong size);
+void mmap_reserve(abi_ulong start, abi_ulong size);
 void TSA_NO_TSA mmap_fork_start(void);
 void TSA_NO_TSA mmap_fork_end(int child);
 
-- 
2.42.0




[PATCH v6 21/23] bsd-user: Implement shmctl(2)

2023-09-25 Thread Karim Taha
From: Stacey Son 

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
---
 bsd-user/bsd-mem.h| 39 +++
 bsd-user/freebsd/os-syscall.c |  4 
 2 files changed, 43 insertions(+)

diff --git a/bsd-user/bsd-mem.h b/bsd-user/bsd-mem.h
index c362cc07a3..b82f3eaa25 100644
--- a/bsd-user/bsd-mem.h
+++ b/bsd-user/bsd-mem.h
@@ -305,4 +305,43 @@ static inline abi_long do_bsd_shmget(abi_long arg1, 
abi_ulong arg2,
 return get_errno(shmget(arg1, arg2, arg3));
 }
 
+/* shmctl(2) */
+static inline abi_long do_bsd_shmctl(abi_long shmid, abi_long cmd,
+abi_ulong buff)
+{
+struct shmid_ds dsarg;
+abi_long ret = -TARGET_EINVAL;
+
+cmd &= 0xff;
+
+switch (cmd) {
+case IPC_STAT:
+if (target_to_host_shmid_ds(, buff)) {
+return -TARGET_EFAULT;
+}
+ret = get_errno(shmctl(shmid, cmd, ));
+if (host_to_target_shmid_ds(buff, )) {
+return -TARGET_EFAULT;
+}
+break;
+
+case IPC_SET:
+if (target_to_host_shmid_ds(, buff)) {
+return -TARGET_EFAULT;
+}
+ret = get_errno(shmctl(shmid, cmd, ));
+break;
+
+case IPC_RMID:
+ret = get_errno(shmctl(shmid, cmd, NULL));
+break;
+
+default:
+ret = -TARGET_EINVAL;
+break;
+}
+
+return ret;
+}
+
 #endif /* BSD_USER_BSD_MEM_H */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index f0ccd787e5..664b8de104 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -663,6 +663,10 @@ static abi_long freebsd_syscall(void *cpu_env, int num, 
abi_long arg1,
 ret = do_bsd_shmget(arg1, arg2, arg3);
 break;
 
+case TARGET_FREEBSD_NR_shmctl: /* shmctl(2) */
+ret = do_bsd_shmctl(arg1, arg2, arg3);
+break;
+
 /*
  * Misc
  */
-- 
2.42.0




[PATCH v6 23/23] bsd-user: Add stubs for vadvise(), sbrk() and sstk()

2023-09-25 Thread Karim Taha
From: Warner Losh 

The above system calls are not supported by qemu.

Signed-off-by: Warner Losh 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
---
 bsd-user/bsd-mem.h| 18 ++
 bsd-user/freebsd/os-syscall.c | 12 
 2 files changed, 30 insertions(+)

diff --git a/bsd-user/bsd-mem.h b/bsd-user/bsd-mem.h
index c512a4e375..c3e72e3b86 100644
--- a/bsd-user/bsd-mem.h
+++ b/bsd-user/bsd-mem.h
@@ -431,4 +431,22 @@ static inline abi_long do_bsd_shmdt(abi_ulong shmaddr)
 return ret;
 }
 
+static inline abi_long do_bsd_vadvise(void)
+{
+/* See sys_ovadvise() in vm_unix.c */
+return -TARGET_EINVAL;
+}
+
+static inline abi_long do_bsd_sbrk(void)
+{
+/* see sys_sbrk() in vm_mmap.c */
+return -TARGET_EOPNOTSUPP;
+}
+
+static inline abi_long do_bsd_sstk(void)
+{
+/* see sys_sstk() in vm_mmap.c */
+return -TARGET_EOPNOTSUPP;
+}
+
 #endif /* BSD_USER_BSD_MEM_H */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index 6b32d4df68..ce2a6bc29e 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -675,6 +675,18 @@ static abi_long freebsd_syscall(void *cpu_env, int num, 
abi_long arg1,
 ret = do_bsd_shmdt(arg1);
 break;
 
+case TARGET_FREEBSD_NR_freebsd11_vadvise:
+ret = do_bsd_vadvise();
+break;
+
+case TARGET_FREEBSD_NR_sbrk:
+ret = do_bsd_sbrk();
+break;
+
+case TARGET_FREEBSD_NR_sstk:
+ret = do_bsd_sstk();
+break;
+
 /*
  * Misc
  */
-- 
2.42.0




[PATCH v6 11/23] bsd-user: Introduce bsd-mem.h to the source tree

2023-09-25 Thread Karim Taha
From: Stacey Son 

Preserve the copyright notice and help with the 'Author' info for
subsequent changes to the file.

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Warner Losh 
Reviewed-by: Richard Henderson 
---
 bsd-user/bsd-mem.h| 64 +++
 bsd-user/freebsd/os-syscall.c |  1 +
 2 files changed, 65 insertions(+)
 create mode 100644 bsd-user/bsd-mem.h

diff --git a/bsd-user/bsd-mem.h b/bsd-user/bsd-mem.h
new file mode 100644
index 00..d865e0807d
--- /dev/null
+++ b/bsd-user/bsd-mem.h
@@ -0,0 +1,64 @@
+/*
+ *  memory management system call shims and definitions
+ *
+ *  Copyright (c) 2013-15 Stacey D. Son
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see .
+ */
+
+/*
+ * Copyright (c) 1982, 1986, 1993
+ *  The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 4. Neither the name of the University nor the names of its contributors
+ *may be used to endorse or promote products derived from this software
+ *without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef BSD_USER_BSD_MEM_H
+#define BSD_USER_BSD_MEM_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "qemu-bsd.h"
+
+extern struct bsd_shm_regions bsd_shm_regions[];
+extern abi_ulong target_brk;
+extern abi_ulong initial_target_brk;
+
+#endif /* BSD_USER_BSD_MEM_H */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index 4c99760a21..42cd52a406 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -35,6 +35,7 @@
 
 /* BSD independent syscall shims */
 #include "bsd-file.h"
+#include "bsd-mem.h"
 #include "bsd-proc.h"
 
 /* *BSD dependent syscall shims */
-- 
2.42.0




[PATCH v6 03/23] bsd-user: Declarations for ipc_perm and shmid_ds conversion functions

2023-09-25 Thread Karim Taha
From: Stacey Son 

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
Reviewed-by: Warner Losh 
---
 bsd-user/qemu-bsd.h | 45 +
 1 file changed, 45 insertions(+)
 create mode 100644 bsd-user/qemu-bsd.h

diff --git a/bsd-user/qemu-bsd.h b/bsd-user/qemu-bsd.h
new file mode 100644
index 00..46572ece7d
--- /dev/null
+++ b/bsd-user/qemu-bsd.h
@@ -0,0 +1,45 @@
+/*
+ *  BSD conversion extern declarations
+ *
+ *  Copyright (c) 2013 Stacey D. Son
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see .
+ */
+
+#ifndef QEMU_BSD_H
+#define QEMU_BSD_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* bsd-mem.c */
+void target_to_host_ipc_perm__locked(struct ipc_perm *host_ip,
+struct target_ipc_perm *target_ip);
+void host_to_target_ipc_perm__locked(struct target_ipc_perm *target_ip,
+struct ipc_perm *host_ip);
+abi_long target_to_host_shmid_ds(struct shmid_ds *host_sd,
+abi_ulong target_addr);
+abi_long host_to_target_shmid_ds(abi_ulong target_addr,
+struct shmid_ds *host_sd);
+
+#endif /* QEMU_BSD_H */
-- 
2.42.0




[PATCH v5 18/28] bsd-user: Implement getpriority(2) and setpriority(2).

2023-09-25 Thread Karim Taha
From: Stacey Son 

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Warner Losh 
---
 bsd-user/bsd-proc.h   | 24 
 bsd-user/freebsd/os-syscall.c |  8 
 2 files changed, 32 insertions(+)

diff --git a/bsd-user/bsd-proc.h b/bsd-user/bsd-proc.h
index 2c1a9ae22f..9a8912361f 100644
--- a/bsd-user/bsd-proc.h
+++ b/bsd-user/bsd-proc.h
@@ -390,4 +390,28 @@ static inline abi_long do_bsd_ptrace(abi_long arg1, 
abi_long arg2,
 return -TARGET_ENOSYS;
 }
 
+/* getpriority(2) */
+static inline abi_long do_bsd_getpriority(abi_long which, abi_long who)
+{
+abi_long ret;
+/*
+ * Note that negative values are valid for getpriority, so we must
+ * differentiate based on errno settings.
+ */
+errno = 0;
+ret = getpriority(which, who);
+if (ret == -1 && errno != 0) {
+return -host_to_target_errno(errno);
+}
+
+return ret;
+}
+
+/* setpriority(2) */
+static inline abi_long do_bsd_setpriority(abi_long which, abi_long who,
+  abi_long prio)
+{
+return get_errno(setpriority(which, who, prio));
+}
+
 #endif /* !BSD_PROC_H_ */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index 1a760b1380..71a2657dd0 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -359,6 +359,14 @@ static abi_long freebsd_syscall(void *cpu_env, int num, 
abi_long arg1,
 ret = do_bsd_ptrace(arg1, arg2, arg3, arg4);
 break;
 
+case TARGET_FREEBSD_NR_getpriority: /* getpriority(2) */
+ret = do_bsd_getpriority(arg1, arg2);
+break;
+
+case TARGET_FREEBSD_NR_setpriority: /* setpriority(2) */
+ret = do_bsd_setpriority(arg1, arg2, arg3);
+break;
+
 
 /*
  * File system calls.
-- 
2.42.0




[PATCH v5 11/28] bsd-user: Implement getgroups(2) and setgroups(2) system calls.

2023-09-25 Thread Karim Taha
From: Stacey Son 

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Warner Losh 
---
 bsd-user/bsd-proc.h   | 44 +++
 bsd-user/freebsd/os-syscall.c |  9 +++
 2 files changed, 53 insertions(+)

diff --git a/bsd-user/bsd-proc.h b/bsd-user/bsd-proc.h
index b6225e520e..7b25aa1982 100644
--- a/bsd-user/bsd-proc.h
+++ b/bsd-user/bsd-proc.h
@@ -41,4 +41,48 @@ static inline abi_long do_bsd_exit(void *cpu_env, abi_long 
arg1)
 return 0;
 }
 
+/* getgroups(2) */
+static inline abi_long do_bsd_getgroups(abi_long gidsetsize, abi_long arg2)
+{
+abi_long ret;
+uint32_t *target_grouplist;
+g_autofree gid_t *grouplist;
+int i;
+
+grouplist = g_try_new(gid_t, gidsetsize);
+ret = get_errno(getgroups(gidsetsize, grouplist));
+if (gidsetsize != 0) {
+if (!is_error(ret)) {
+target_grouplist = lock_user(VERIFY_WRITE, arg2, gidsetsize * 2, 
0);
+if (!target_grouplist) {
+return -TARGET_EFAULT;
+}
+for (i = 0; i < ret; i++) {
+target_grouplist[i] = tswap32(grouplist[i]);
+}
+unlock_user(target_grouplist, arg2, gidsetsize * 2);
+}
+}
+return ret;
+}
+
+/* setgroups(2) */
+static inline abi_long do_bsd_setgroups(abi_long gidsetsize, abi_long arg2)
+{
+uint32_t *target_grouplist;
+g_autofree gid_t *grouplist;
+int i;
+
+grouplist = g_try_new(gid_t, gidsetsize);
+target_grouplist = lock_user(VERIFY_READ, arg2, gidsetsize * 2, 1);
+if (!target_grouplist) {
+return -TARGET_EFAULT;
+}
+for (i = 0; i < gidsetsize; i++) {
+grouplist[i] = tswap32(target_grouplist[i]);
+}
+unlock_user(target_grouplist, arg2, 0);
+return get_errno(setgroups(gidsetsize, grouplist));
+}
+
 #endif /* !BSD_PROC_H_ */
diff --git a/bsd-user/freebsd/os-syscall.c b/bsd-user/freebsd/os-syscall.c
index fa60df529e..535e6287bd 100644
--- a/bsd-user/freebsd/os-syscall.c
+++ b/bsd-user/freebsd/os-syscall.c
@@ -223,6 +223,15 @@ static abi_long freebsd_syscall(void *cpu_env, int num, 
abi_long arg1,
 ret = do_bsd_exit(cpu_env, arg1);
 break;
 
+case TARGET_FREEBSD_NR_getgroups: /* getgroups(2) */
+ret = do_bsd_getgroups(arg1, arg2);
+break;
+
+case TARGET_FREEBSD_NR_setgroups: /* setgroups(2) */
+ret = do_bsd_setgroups(arg1, arg2);
+break;
+
+
 /*
  * File system calls.
  */
-- 
2.42.0




[PATCH v6 04/23] bsd-user: Introduce freebsd/os-misc.h to the source tree

2023-09-25 Thread Karim Taha
From: Stacey Son 

To preserve the copyright notice and help with the 'Author' info for
subsequent changes to the file.

Signed-off-by: Stacey Son 
Signed-off-by: Karim Taha 
Reviewed-by: Richard Henderson 
Reviewed-by: Warner Losh 
---
 bsd-user/freebsd/os-misc.h | 28 
 1 file changed, 28 insertions(+)
 create mode 100644 bsd-user/freebsd/os-misc.h

diff --git a/bsd-user/freebsd/os-misc.h b/bsd-user/freebsd/os-misc.h
new file mode 100644
index 00..8436ccb2f7
--- /dev/null
+++ b/bsd-user/freebsd/os-misc.h
@@ -0,0 +1,28 @@
+/*
+ *  miscellaneous FreeBSD system call shims
+ *
+ *  Copyright (c) 2013-14 Stacey D. Son
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see .
+ */
+
+#ifndef OS_MISC_H
+#define OS_MISC_H
+
+#include 
+#include 
+#include 
+
+
+#endif /* OS_MISC_H */
-- 
2.42.0




  1   2   3   4   >