date:20221004

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-04 Thread Ming Lei

On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> On Tue, 4 Oct 2022 at 05:44, Ming Lei  wrote:
> >
> > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > ublk-qcow2 is available now.
> > >
> > > Cool, thanks for sharing!
> > >
> > > >
> > > > So far it provides basic read/write function, and compression and 
> > > > snapshot
> > > > aren't supported yet. The target/backend implementation is completely
> > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > handler, just like what ublk-loop does.
> > > >
> > > > Follows the main motivations of ublk-qcow2:
> > > >
> > > > - building one complicated target from scratch helps libublksrv 
> > > > APIs/functions
> > > >   become mature/stable more quickly, since qcow2 is complicated and 
> > > > needs more
> > > >   requirement from libublksrv compared with other simple ones(loop, 
> > > > null)
> > > >
> > > > - there are several attempts of implementing qcow2 driver in kernel, 
> > > > such as
> > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so 
> > > > ublk-qcow2
> > > >   might useful be for covering requirement in this field
> > > >
> > > > - performance comparison with qemu-nbd, and it was my 1st thought to 
> > > > evaluate
> > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since 
> > > > ublksrv
> > > >   is started
> > > >
> > > > - help to abstract common building block or design pattern for writing 
> > > > new ublk
> > > >   target/backend
> > > >
> > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > > soft update approach is applied in meta flushing, and meta data
> > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > test, and only cluster leak is reported during this test.
> > > >
> > > > The performance data looks much better compared with qemu-nbd, see
> > > > details in commit log[1], README[5] and STATUS[6]. And the test covers 
> > > > both
> > > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > image(8GB):
> > > >
> > > > - qemu-nbd (make test T=qcow2/002)
> > >
> > > Single queue?
> >
> > Yeah.
> >
> > >
> > > > randwrite(4k): jobs 1, iops 24605
> > > > randread(4k): jobs 1, iops 30938
> > > > randrw(4k): jobs 1, iops read 13981 write 14001
> > > > rw(512k): jobs 1, iops read 724 write 728
> > >
> > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > command-line should be similar to this:
> > >
> > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> >
> > Not found virtio_vdpa module even though I enabled all the following
> > options:
> >
> > --- vDPA drivers
> >  vDPA device simulator core
> >vDPA simulator for networking device
> >vDPA simulator for block device
> >  VDUSE (vDPA Device in Userspace) support
> >  Intel IFC VF vDPA driver
> >  Virtio PCI bridge vDPA driver
> >  vDPA driver for Alibaba ENI
> >
> > BTW, my test environment is VM and the shared data is done in VM too, and
> > can virtio_vdpa be used inside VM?
> 
> I hope Xie Yongji can help explain how to benchmark VDUSE.
> 
> virtio_vdpa is available inside guests too. Please check that
> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> drivers" menu.
> 
> >
> > >   # modprobe vduse
> > >   # qemu-storage-daemon \
> > >   --blockdev 
> > > file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > >   --blockdev qcow2,file=file,node-name=qcow2 \
> > >   --object iothread,id=iothread0 \
> > >   --export 
> > > vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > >   # vdpa dev add name vduse0 mgmtdev vduse
> > >
> > > A virtio-blk device should appear and xfstests can be run on it
> > > (typically /dev/vda unless you already have other virtio-blk devices).
> > >
> > > Afterwards you can destroy the device using:
> > >
> > >   # vdpa dev del vduse0
> > >
> > > >
> > > > - ublk-qcow2 (make test T=qcow2/022)
> > >
> > > There are a lot of other factors not directly related to NBD vs ublk. In
> > > order to get an apples-to-apples comparison with qemu-* a ublk export
> > > type is needed in qemu-storage-daemon. That way only the difference is
> > > the ublk interface and the rest of the code path is identical, making it
> > > possible to compare NBD, VDUSE, ublk, etc more precisely.
> >
> > Maybe not true.
> >
> > ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) 
> > completely,
> > and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > command.
> 
> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't

I tried to use it via

Re: [PATCH v2 1/2] file-posix: add the tracking of the zones wp

2022-10-04 Thread Damien Le Moal

On 10/5/22 10:44, Damien Le Moal wrote:
> On 9/29/22 18:31, Sam Li wrote:
>> Since Linux doesn't have a user API to issue zone append operations to
>> zoned devices from user space, the file-posix driver is modified to add
>> zone append emulation using regular writes. To do this, the file-posix
>> driver tracks the wp location of all zones of the device. It uses an
>> array of uint64_t. The most significant bit of each wp location indicates
>> if the zone type is sequential write required.
>>
>> The zones wp can be changed due to the following operations issued:
>> - zone reset: change the wp to the start offset of that zone
>> - zone finish: change to the end location of that zone
>> - write to a zone
>> - zone append
>>
>> Signed-off-by: Sam Li 
>> ---
>>  block/file-posix.c   | 138 ++-
>>  include/block/block-common.h |  16 
>>  include/block/block_int-common.h |   5 ++
>>  include/block/raw-aio.h  |   4 +-
>>  4 files changed, 159 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 73656d87f2..33e81ac112 100755
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -206,6 +206,8 @@ typedef struct RawPosixAIOData {
>>  struct {
>>  struct iovec *iov;
>>  int niov;
>> +int64_t *append_sector;
>> +BlockZoneWps *wps;
>>  } io;
>>  struct {
>>  uint64_t cmd;
>> @@ -1332,6 +1334,59 @@ static int hdev_get_max_segments(int fd, struct stat 
>> *st) {
>>  #endif
>>  }
>>  
>> +#if defined(CONFIG_BLKZONED)
>> +static int report_zone_wp(int64_t offset, int fd, BlockZoneWps *wps,
>> +  unsigned int nrz) {
> 
> Maybe rename this to get_zones_wp() ?
> 
>> +struct blk_zone *blkz;
>> +int64_t rep_size;
>> +int64_t sector = offset >> BDRV_SECTOR_BITS;
>> +int ret, n = 0, i = 0;
>> +
>> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
>> blk_zone);
>> +g_autofree struct blk_zone_report *rep = NULL;
> 
> To be cleaner, move this declaration above with the others ?
> 
>> +rep = g_malloc(rep_size);
>> +
>> +blkz = (struct blk_zone *)(rep + 1);
>> +while (n < nrz) {
>> +memset(rep, 0, rep_size);
>> +rep->sector = sector;
>> +rep->nr_zones = nrz - n;
>> +
>> +do {
>> +ret = ioctl(fd, BLKREPORTZONE, rep);
>> +} while (ret != 0 && errno == EINTR);
>> +if (ret != 0) {
>> +error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
>> +fd, offset, errno);
>> +return -errno;
>> +}
>> +
>> +if (!rep->nr_zones) {
>> +break;
>> +}
>> +
>> +for (i = 0; i < rep->nr_zones; i++, n++) {
>> +wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS;
>> +sector = blkz[i].start + blkz[i].len;
>> +
>> +/*
>> + * In the wp tracking, it only cares if the zone type is 
>> sequential
>> + * writes required so that the wp can advance to the right 
>> location.
> 
> Or sequential write preferred (host aware case)
> 
>> + * Instead of the type of zone_type which is an 8-bit unsigned
>> + * integer, use the first most significant bits of the wp 
>> location
>> + * to indicate the zone type: 0 for SWR zones and 1 for the
>> + * others.
>> + */
>> +if (!(blkz[i].type & BLK_ZONE_TYPE_SEQWRITE_REQ)) {
> 
> This should be:
> 
>   if (blkz[i].type != BLK_ZONE_TYPE_CONVENTIONAL) {
> 
> Note that the type field is not a bit-field. So you must compare values
> instead of doing bit operations.
> 
>> +wps->wp[i] += (uint64_t)1 << 63;
> 
> You can simplify this:
> 
>  wps->wp[i] |= 1ULL << 63;
> 
> Overall, I would rewrite this like this:
> 
> for (i = 0; i < rep->nr_zones; i++, n++) {
> /*
>  * The wp tracking cares only about sequential write required
>  * and sequential write preferred zones so that the wp can
>  * advance to the right location.
>  * Use the most significant bit of the wp location
>  * to indicate the zone type: 0 for SWR zones and 1 for
>  * conventional zones.
>  */
> if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> wps->wp[i] = 1ULL << 63;
> else
> wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS;
> }
> sector = blkz[i - 1].start + blkz[i - 1].len;
> 
> Which I think is a lot simpler.
> 
>> +}
>> +}
>> +}
>> +
>> +return 0;
>> +}
>> +#endif
>> +
>>  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>>  {
>>  BDRVRawState *s = bs->opaque;
>> @@ -1415,6 +1470,20 @@ static void raw_refresh_limits(BlockDriverState *bs, 
>> Error **errp)
>>  error_report("Invalid device capacity %" PRId64 " bytes ", 
>> bs->bl.capacity);
>>  return;
>>

Re: [PATCH v2 2/2] block: introduce zone append write for zoned devices

2022-10-04 Thread Damien Le Moal

On 9/29/22 18:31, Sam Li wrote:
> A zone append command is a write operation that specifies the first
> logical block of a zone as the write position. When writing to a zoned
> block device using zone append, the byte offset of the write is pointing
> to the write pointer of that zone. Upon completion the device will
> respond with the position the data has been written in the zone.
> 
> Signed-off-by: Sam Li 
> ---
>  block/block-backend.c  | 65 ++
>  block/file-posix.c | 51 +++
>  block/io.c | 21 ++
>  block/raw-format.c |  7 
>  include/block/block-io.h   |  3 ++
>  include/block/block_int-common.h   |  3 ++
>  include/sysemu/block-backend-io.h  |  9 +
>  qemu-io-cmds.c | 62 
>  tests/qemu-iotests/tests/zoned.out |  7 
>  tests/qemu-iotests/tests/zoned.sh  |  9 +
>  10 files changed, 237 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index f7f7acd6f4..07a8632af1 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1439,6 +1439,9 @@ typedef struct BlkRwCo {
>  struct {
>  BlockZoneOp op;
>  } zone_mgmt;
> +struct {
> +int64_t *append_sector;
> +} zone_append;
>  };
>  } BlkRwCo;
>  
> @@ -1869,6 +1872,47 @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, 
> BlockZoneOp op,
>  return >common;
>  }
>  
> +static void blk_aio_zone_append_entry(void *opaque) {
> +BlkAioEmAIOCB *acb = opaque;
> +BlkRwCo *rwco = >rwco;
> +
> +rwco->ret = blk_co_zone_append(rwco->blk, 
> rwco->zone_append.append_sector,
> +   rwco->iobuf, rwco->flags);
> +blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
> +QEMUIOVector *qiov, BdrvRequestFlags flags,
> +BlockCompletionFunc *cb, void *opaque) {
> +BlkAioEmAIOCB *acb;
> +Coroutine *co;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk);
> +acb = blk_aio_get(_aio_em_aiocb_info, blk, cb, opaque);
> +acb->rwco = (BlkRwCo) {
> +.blk= blk,
> +.ret= NOT_DONE,
> +.flags  = flags,
> +.iobuf  = qiov,
> +.zone_append = {
> +.append_sector = offset,
> +},
> +};
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(blk_aio_zone_append_entry, acb);
> +bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +acb->has_returned = true;
> +if (acb->rwco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> + blk_aio_complete_bh, acb);
> +}
> +
> +return >common;
> +}
> +
>  /*
>   * Send a zone_report command.
>   * offset is a byte offset from the start of the device. No alignment
> @@ -1921,6 +1965,27 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, 
> BlockZoneOp op,
>  return ret;
>  }
>  
> +/*
> + * Send a zone_append command.
> + */
> +int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
> +QEMUIOVector *qiov, BdrvRequestFlags flags)
> +{
> +int ret;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk);
> +blk_wait_while_drained(blk);
> +if (!blk_is_available(blk)) {
> +blk_dec_in_flight(blk);
> +return -ENOMEDIUM;
> +}
> +
> +ret = bdrv_co_zone_append(blk_bs(blk), offset, qiov, flags);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
>  void blk_drain(BlockBackend *blk)
>  {
>  BlockDriverState *bs = blk_bs(blk);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 33e81ac112..24b70f1afe 100755
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -3454,6 +3454,56 @@ static int coroutine_fn 
> raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
>  #endif
>  }
>  
> +

whiteline change.

> +static int coroutine_fn raw_co_zone_append(BlockDriverState *bs,
> +   int64_t *offset,
> +   QEMUIOVector *qiov,
> +   BdrvRequestFlags flags) {
> +#if defined(CONFIG_BLKZONED)
> +BDRVRawState *s = bs->opaque;
> +int64_t zone_size_mask = bs->bl.zone_size - 1;
> +int64_t iov_len = 0;
> +int64_t len = 0;
> +RawPosixAIOData acb;
> +
> +if (*offset & zone_size_mask) {
> +error_report("sector offset %" PRId64 " is not aligned to zone size "
> + "%" PRId32 "", *offset / 512, bs->bl.zone_size / 512);
> +return -EINVAL;
> +}
> +
> +int64_t wg = bs->bl.write_granularity;
> +int64_t wg_mask = wg - 1;
> +for (int i = 0; i < qiov->niov; i++) {
> +   iov_len = qiov->iov[i].iov_len;
> +   if (iov_len & wg_mask) {
> +   error_report("len of

Re: [PATCH v2 1/2] file-posix: add the tracking of the zones wp

2022-10-04 Thread Damien Le Moal

On 9/29/22 18:31, Sam Li wrote:
> Since Linux doesn't have a user API to issue zone append operations to
> zoned devices from user space, the file-posix driver is modified to add
> zone append emulation using regular writes. To do this, the file-posix
> driver tracks the wp location of all zones of the device. It uses an
> array of uint64_t. The most significant bit of each wp location indicates
> if the zone type is sequential write required.
> 
> The zones wp can be changed due to the following operations issued:
> - zone reset: change the wp to the start offset of that zone
> - zone finish: change to the end location of that zone
> - write to a zone
> - zone append
> 
> Signed-off-by: Sam Li 
> ---
>  block/file-posix.c   | 138 ++-
>  include/block/block-common.h |  16 
>  include/block/block_int-common.h |   5 ++
>  include/block/raw-aio.h  |   4 +-
>  4 files changed, 159 insertions(+), 4 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 73656d87f2..33e81ac112 100755
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -206,6 +206,8 @@ typedef struct RawPosixAIOData {
>  struct {
>  struct iovec *iov;
>  int niov;
> +int64_t *append_sector;
> +BlockZoneWps *wps;
>  } io;
>  struct {
>  uint64_t cmd;
> @@ -1332,6 +1334,59 @@ static int hdev_get_max_segments(int fd, struct stat 
> *st) {
>  #endif
>  }
>  
> +#if defined(CONFIG_BLKZONED)
> +static int report_zone_wp(int64_t offset, int fd, BlockZoneWps *wps,
> +  unsigned int nrz) {

Maybe rename this to get_zones_wp() ?

> +struct blk_zone *blkz;
> +int64_t rep_size;
> +int64_t sector = offset >> BDRV_SECTOR_BITS;
> +int ret, n = 0, i = 0;
> +
> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> blk_zone);
> +g_autofree struct blk_zone_report *rep = NULL;

To be cleaner, move this declaration above with the others ?

> +rep = g_malloc(rep_size);
> +
> +blkz = (struct blk_zone *)(rep + 1);
> +while (n < nrz) {
> +memset(rep, 0, rep_size);
> +rep->sector = sector;
> +rep->nr_zones = nrz - n;
> +
> +do {
> +ret = ioctl(fd, BLKREPORTZONE, rep);
> +} while (ret != 0 && errno == EINTR);
> +if (ret != 0) {
> +error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> +fd, offset, errno);
> +return -errno;
> +}
> +
> +if (!rep->nr_zones) {
> +break;
> +}
> +
> +for (i = 0; i < rep->nr_zones; i++, n++) {
> +wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS;
> +sector = blkz[i].start + blkz[i].len;
> +
> +/*
> + * In the wp tracking, it only cares if the zone type is 
> sequential
> + * writes required so that the wp can advance to the right 
> location.

Or sequential write preferred (host aware case)

> + * Instead of the type of zone_type which is an 8-bit unsigned
> + * integer, use the first most significant bits of the wp 
> location
> + * to indicate the zone type: 0 for SWR zones and 1 for the
> + * others.
> + */
> +if (!(blkz[i].type & BLK_ZONE_TYPE_SEQWRITE_REQ)) {

This should be:

if (blkz[i].type != BLK_ZONE_TYPE_CONVENTIONAL) {

Note that the type field is not a bit-field. So you must compare values
instead of doing bit operations.

> +wps->wp[i] += (uint64_t)1 << 63;

You can simplify this:

   wps->wp[i] |= 1ULL << 63;

Overall, I would rewrite this like this:

for (i = 0; i < rep->nr_zones; i++, n++) {
/*
 * The wp tracking cares only about sequential write required
 * and sequential write preferred zones so that the wp can
 * advance to the right location.
 * Use the most significant bit of the wp location
 * to indicate the zone type: 0 for SWR zones and 1 for
 * conventional zones.
 */
if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
wps->wp[i] = 1ULL << 63;
else
wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS;
}
sector = blkz[i - 1].start + blkz[i - 1].len;

Which I think is a lot simpler.

> +}
> +}
> +}
> +
> +return 0;
> +}
> +#endif
> +
>  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>  BDRVRawState *s = bs->opaque;
> @@ -1415,6 +1470,20 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  error_report("Invalid device capacity %" PRId64 " bytes ", 
> bs->bl.capacity);
>  return;
>  }
> +
> +ret = get_sysfs_long_val(, "physical_block_size");
> +if (ret >= 0) {
> +bs->bl.write_granularity = ret;
> +}

This change seems unrelated to the wp tracking. Should this be

Re: [PATCH v4 6/6] hw/arm/virt: Add 'compact-highmem' property

2022-10-04 Thread Gavin Shan


Hi Marc,

On 10/5/22 1:39 AM, Marc Zyngier wrote:

On Tue, 04 Oct 2022 01:26:27 +0100,
Gavin Shan  wrote:


After the improvement to high memory region address assignment is
applied, the memory layout can be changed, introducing possible
migration breakage. For example, VIRT_HIGH_PCIE_MMIO memory region
is disabled or enabled when the optimization is applied or not, with
the following configuration.

   pa_bits  = 40;
   vms->highmem_redists = false;
   vms->highmem_ecam= false;
   vms->highmem_mmio= true;


The question is how are these parameters specified by a user? Short of
hacking the code, this isn't really possible.



Yeah, It's impossible to have false for vms->highmem_redists unless
the code is hacked.



   # qemu-system-aarch64 -accel kvm -cpu host\
 -machine virt-7.2,compact-highmem={on, off} \
 -m 4G,maxmem=511G -monitor stdio

   Regioncompact-highmem=off compact-highmem=on
   
   RAM   [1GB 512GB][1GB 512GB]
   HIGH_GIC_REDISTS  [512GB   512GB+64MB]   [disabled]
   HIGH_PCIE_ECAM[512GB+256MB 512GB+512MB]  [disabled]
   HIGH_PCIE_MMIO[disabled] [512GB   1TB]

In order to keep backwords compatibility, we need to disable the
optimization on machines, which is virt-7.1 or ealier than it. It
means the optimization is enabled by default from virt-7.2. Besides,
'compact-highmem' property is added so that the optimization can be
explicitly enabled or disabled on all machine types by users.


Not directly related to this series, but it seems to me that we should
be aiming at reproducible results across HW implementations (at least
with KVM). Depending on how many PA bits the HW implements, we end-up
with a set of devices or another, which is likely to be confusing for
a user.

I think we should consider an additional set of changes to allow a
user to specify the PA bits as well as the devices they want to see
enabled.



I think the idea to selectively enable devices (high memory regions)
is sensible. For example, users may needn't HIGH_PCIE_MMIO at all
in some systems, where they have limited PCI devices.

I'm not sure about PA bits because it has been discovered from hardware
and configure the automatically optimized value/bits back to KVM. The
optimized value/bits is automatically calculated based on the enabled
high memory regions.

Thanks,
Gavin

Re: [PATCH v4 5/6] hw/arm/virt: Improve high memory region address

2022-10-04 Thread Gavin Shan


Hi Connie,

On 10/4/22 6:53 PM, Cornelia Huck wrote:

On Tue, Oct 04 2022, Gavin Shan  wrote:


There are three high memory regions, which are VIRT_HIGH_REDIST2,
VIRT_HIGH_PCIE_ECAM and VIRT_HIGH_PCIE_MMIO. Their base addresses
are floating on highest RAM address. However, they can be disabled
in several cases.

(1) One specific high memory region is disabled by developer by
 toggling vms->highmem_{redists, ecam, mmio}.

(2) VIRT_HIGH_PCIE_ECAM region is disabled on machine, which is
 'virt-2.12' or ealier than it.

(3) VIRT_HIGH_PCIE_ECAM region is disabled when firmware is loaded
 on 32-bits system.

(4) One specific high memory region is disabled when it breaks the
 PA space limit.

The current implementation of virt_set_memmap() isn't comprehensive
because the space for one specific high memory region is always
reserved from the PA space for case (1), (2) and (3). In the code,
'base' and 'vms->highest_gpa' are always increased for those three
cases. It's unnecessary since the assigned space of the disabled
high memory region won't be used afterwards.

This improves the address assignment for those three high memory
region by skipping the address assignment for one specific high
memory region if it has been disabled in case (1), (2) and (3).
'vms->high_compact' is false for now, meaning that we don't have
any behavior changes until it becomes configurable through property
'compact-highmem' in next patch.

Signed-off-by: Gavin Shan 
---
  hw/arm/virt.c | 19 ---
  include/hw/arm/virt.h |  1 +
  2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 59de7b78b5..4164da49e9 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1715,9 +1715,6 @@ static void virt_set_high_memmap(VirtMachineState *vms,
  region_base = ROUND_UP(base, extended_memmap[i].size);
  region_size = extended_memmap[i].size;
  
-vms->memmap[i].base = region_base;

-vms->memmap[i].size = region_size;
-
  /*
   * Check each device to see if they fit in the PA space,
   * moving highest_gpa as we go.


Maybe tweak this comment?

"Check each enabled device to see if they fit in the PA space,
moving highest_gpa as we go. For compatibility, move highest_gpa
for disabled fitting devices as well, if the compact layout has
been disabled."

(Or would that be overkill?)



It looks overkill to me since the code is simple and clear. However,
comments won't be harmful. I will integrate the proposed comment
in next respin.

Thanks,
Gavin

[PATCH 2/3] i386: kvm: Add support for MSR filtering

2022-10-04 Thread Alexander Graf

KVM has grown support to deflect arbitrary MSRs to user space since
Linux 5.10. For now we don't expect to make a lot of use of this
feature, so let's expose it the easiest way possible: With up to 16
individually maskable MSRs.

This patch adds a kvm_filter_msr() function that other code can call
to install a hook on KVM MSR reads or writes.

Signed-off-by: Alexander Graf 
---
 target/i386/kvm/kvm.c  | 124 +
 target/i386/kvm/kvm_i386.h |  11 
 2 files changed, 135 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index a1fd1f5379..ea53092dd0 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -139,6 +139,8 @@ static struct kvm_cpuid2 *cpuid_cache;
 static struct kvm_cpuid2 *hv_cpuid_cache;
 static struct kvm_msr_list *kvm_feature_msrs;
 
+static KVMMSRHandlers msr_handlers[KVM_MSR_FILTER_MAX_RANGES];
+
 #define BUS_LOCK_SLICE_TIME 10ULL /* ns */
 static RateLimit bus_lock_ratelimit_ctrl;
 static int kvm_get_one_msr(X86CPU *cpu, int index, uint64_t *value);
@@ -2588,6 +2590,16 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 }
 }
 
+if (kvm_vm_check_extension(s, KVM_CAP_X86_USER_SPACE_MSR)) {
+ret = kvm_vm_enable_cap(s, KVM_CAP_X86_USER_SPACE_MSR, 0,
+KVM_MSR_EXIT_REASON_FILTER);
+if (ret) {
+error_report("Could not enable user space MSRs: %s",
+ strerror(-ret));
+exit(1);
+}
+}
+
 return 0;
 }
 
@@ -5077,6 +5089,108 @@ void kvm_arch_update_guest_debug(CPUState *cpu, struct 
kvm_guest_debug *dbg)
 }
 }
 
+static bool kvm_install_msr_filters(KVMState *s)
+{
+uint64_t zero = 0;
+struct kvm_msr_filter filter = {
+.flags = KVM_MSR_FILTER_DEFAULT_ALLOW,
+};
+int r, i, j = 0;
+
+for (i = 0; i < KVM_MSR_FILTER_MAX_RANGES; i++) {
+KVMMSRHandlers *handler = _handlers[i];
+if (handler->msr) {
+struct kvm_msr_filter_range *range = [j++];
+
+*range = (struct kvm_msr_filter_range) {
+.flags = 0,
+.nmsrs = 1,
+.base = handler->msr,
+.bitmap = (__u8 *),
+};
+
+if (handler->rdmsr) {
+range->flags |= KVM_MSR_FILTER_READ;
+}
+
+if (handler->wrmsr) {
+range->flags |= KVM_MSR_FILTER_WRITE;
+}
+}
+}
+
+r = kvm_vm_ioctl(s, KVM_X86_SET_MSR_FILTER, );
+if (r) {
+return false;
+}
+
+return true;
+}
+
+bool kvm_filter_msr(KVMState *s, uint32_t msr, QEMURDMSRHandler *rdmsr,
+QEMUWRMSRHandler *wrmsr)
+{
+int i;
+
+for (i = 0; i < ARRAY_SIZE(msr_handlers); i++) {
+if (!msr_handlers[i].msr) {
+msr_handlers[i] = (KVMMSRHandlers) {
+.msr = msr,
+.rdmsr = rdmsr,
+.wrmsr = wrmsr,
+};
+
+if (!kvm_install_msr_filters(s)) {
+msr_handlers[i] = (KVMMSRHandlers) { };
+return false;
+}
+
+return true;
+}
+}
+
+return false;
+}
+
+static int kvm_handle_rdmsr(X86CPU *cpu, struct kvm_run *run)
+{
+int i;
+bool r;
+
+for (i = 0; i < ARRAY_SIZE(msr_handlers); i++) {
+KVMMSRHandlers *handler = _handlers[i];
+if (run->msr.index == handler->msr) {
+if (handler->rdmsr) {
+r = handler->rdmsr(cpu, handler->msr,
+   (uint64_t *)>msr.data);
+run->msr.error = r ? 0 : 1;
+return 0;
+}
+}
+}
+
+assert(false);
+}
+
+static int kvm_handle_wrmsr(X86CPU *cpu, struct kvm_run *run)
+{
+int i;
+bool r;
+
+for (i = 0; i < ARRAY_SIZE(msr_handlers); i++) {
+KVMMSRHandlers *handler = _handlers[i];
+if (run->msr.index == handler->msr) {
+if (handler->wrmsr) {
+r = handler->wrmsr(cpu, handler->msr, run->msr.data);
+run->msr.error = r ? 0 : 1;
+return 0;
+}
+}
+}
+
+assert(false);
+}
+
 static bool has_sgx_provisioning;
 
 static bool __kvm_enable_sgx_provisioning(KVMState *s)
@@ -5176,6 +5290,16 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run 
*run)
 /* already handled in kvm_arch_post_run */
 ret = 0;
 break;
+case KVM_EXIT_X86_RDMSR:
+/* We only enable MSR filtering, any other exit is bogus */
+assert(run->msr.reason == KVM_MSR_EXIT_REASON_FILTER);
+ret = kvm_handle_rdmsr(cpu, run);
+break;
+case KVM_EXIT_X86_WRMSR:
+/* We only enable MSR filtering, any other exit is bogus */
+assert(run->msr.reason == KVM_MSR_EXIT_REASON_FILTER);
+ret = kvm_handle_wrmsr(cpu, run);
+break;
 default:
 fprintf(stderr, "KVM: unknown exit reason %d\n",

[PATCH 3/3] KVM: x86: Implement MSR_CORE_THREAD_COUNT MSR

2022-10-04 Thread Alexander Graf

The MSR_CORE_THREAD_COUNT MSR describes CPU package topology, such as number
of threads and cores for a given package. This is information that QEMU has
readily available and can provide through the new user space MSR deflection
interface.

This patch propagates the existing hvf logic from patch 027ac0cb516
("target/i386/hvf: add rdmsr 35H MSR_CORE_THREAD_COUNT") to KVM.

Signed-off-by: Alexander Graf 
---
 target/i386/kvm/kvm.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index ea53092dd0..791e995389 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2403,6 +2403,17 @@ static int kvm_get_supported_msrs(KVMState *s)
 return ret;
 }
 
+static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
+uint64_t *val)
+{
+CPUState *cs = CPU(cpu);
+
+*val = cs->nr_threads * cs->nr_cores; /* thread count, bits 15..0 */
+*val |= ((uint32_t)cs->nr_cores << 16); /* core count, bits 31..16 */
+
+return true;
+}
+
 static Notifier smram_machine_done;
 static KVMMemoryListener smram_listener;
 static AddressSpace smram_address_space;
@@ -2591,6 +2602,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 }
 
 if (kvm_vm_check_extension(s, KVM_CAP_X86_USER_SPACE_MSR)) {
+bool r;
+
 ret = kvm_vm_enable_cap(s, KVM_CAP_X86_USER_SPACE_MSR, 0,
 KVM_MSR_EXIT_REASON_FILTER);
 if (ret) {
@@ -2598,6 +2611,14 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  strerror(-ret));
 exit(1);
 }
+
+r = kvm_filter_msr(s, MSR_CORE_THREAD_COUNT,
+   kvm_rdmsr_core_thread_count, NULL);
+if (!r) {
+error_report("Could not install MSR_CORE_THREAD_COUNT handler: %s",
+ strerror(-ret));
+exit(1);
+}
 }
 
 return 0;
-- 
2.37.0 (Apple Git-136)

[PATCH 0/3] Add TCG & KVM support for MSR_CORE_THREAD_COUNT

2022-10-04 Thread Alexander Graf

Commit 027ac0cb516 ("target/i386/hvf: add rdmsr 35H
MSR_CORE_THREAD_COUNT") added support for the MSR_CORE_THREAD_COUNT MSR
to HVF. This MSR is mandatory to execute macOS when run with -cpu
host,+hypervisor.

This patch set adds support for the very same MSR to TCG as well as
KVM - as long as host KVM is recent enough to support MSR trapping.

With this support added, I can successfully execute macOS guests in
KVM with an APFS enabled OVMF build, a valid applesmc plus OSK and

  -cpu Skylake-Client,+invtsc,+hypervisor


Alex

Alexander Graf (3):
  x86: Implement MSR_CORE_THREAD_COUNT MSR
  i386: kvm: Add support for MSR filtering
  KVM: x86: Implement MSR_CORE_THREAD_COUNT MSR

 target/i386/kvm/kvm.c| 145 +++
 target/i386/kvm/kvm_i386.h   |  11 ++
 target/i386/tcg/sysemu/misc_helper.c |   5 +
 3 files changed, 161 insertions(+)

-- 
2.37.0 (Apple Git-136)

[PATCH 1/3] x86: Implement MSR_CORE_THREAD_COUNT MSR

2022-10-04 Thread Alexander Graf

Intel CPUs starting with Haswell-E implement a new MSR called
MSR_CORE_THREAD_COUNT which exposes the number of threads and cores
inside of a package.

This MSR is used by XNU to populate internal data structures and not
implementing it prevents virtual machines with more than 1 vCPU from
booting if the emulated CPU generation is at least Haswell-E.

This patch propagates the existing hvf logic from patch 027ac0cb516
("target/i386/hvf: add rdmsr 35H MSR_CORE_THREAD_COUNT") to TCG.

Signed-off-by: Alexander Graf 
---
 target/i386/tcg/sysemu/misc_helper.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/tcg/sysemu/misc_helper.c 
b/target/i386/tcg/sysemu/misc_helper.c
index 1328aa656f..e1528b7f80 100644
--- a/target/i386/tcg/sysemu/misc_helper.c
+++ b/target/i386/tcg/sysemu/misc_helper.c
@@ -450,6 +450,11 @@ void helper_rdmsr(CPUX86State *env)
  case MSR_IA32_UCODE_REV:
 val = x86_cpu->ucode_rev;
 break;
+case MSR_CORE_THREAD_COUNT: {
+CPUState *cs = CPU(x86_cpu);
+val = (cs->nr_threads * cs->nr_cores) | (cs->nr_cores << 16);
+break;
+}
 default:
 if ((uint32_t)env->regs[R_ECX] >= MSR_MC0_CTL
 && (uint32_t)env->regs[R_ECX] < MSR_MC0_CTL +
-- 
2.37.0 (Apple Git-136)

Re: [PATCH v2 13/13] hw/ppc/e500: Add Freescale eSDHC to e500 boards

2022-10-04 Thread Bernhard Beschow

Am 3. Oktober 2022 21:06:57 UTC schrieb "Philippe Mathieu-Daudé" 
:
>On 3/10/22 22:31, Bernhard Beschow wrote:
>> Adds missing functionality to emulated e500 SOCs which increases the
>> chance of given "real" firmware images to access SD cards.
>> 
>> Signed-off-by: Bernhard Beschow 
>> ---
>>   docs/system/ppc/ppce500.rst | 13 +
>>   hw/ppc/Kconfig  |  1 +
>>   hw/ppc/e500.c   | 31 ++-
>>   3 files changed, 44 insertions(+), 1 deletion(-)
>
>> +static void dt_sdhc_create(void *fdt, const char *parent, const char *mpic)
>> +{
>> +hwaddr mmio = MPC85XX_ESDHC_REGS_OFFSET;
>> +hwaddr size = MPC85XX_ESDHC_REGS_SIZE;
>> +int irq = MPC85XX_ESDHC_IRQ;
>
>Why not pass these 3 variable as argument?

In anticipation of data-driven board creation, I'd ideally infer those from the 
device's QOM properties. This seems similar to what Mark suggested in the BoF 
at KVM Forum [1], where -- IIUC -- he stated that QOM properties could be the 
foundation of all wiring representations. And device tree seems just like one 
specialized representation to me. (Note that I'm slightly hijacking the review 
here because I don't know where and how to express these thoughts elsewhere).

Does it make sense to add the missing properties here?

Best regards,
Bernhard

[1] https://etherpad.opendev.org/p/qemu-emulation-bof%40kvmforum2022

>
>> +g_autofree char *name = NULL;
>> +
>> +name = g_strdup_printf("%s/sdhc@%" PRIx64, parent, mmio);
>> +qemu_fdt_add_subnode(fdt, name);
>> +qemu_fdt_setprop(fdt, name, "sdhci,auto-cmd12", NULL, 0);
>> +qemu_fdt_setprop_phandle(fdt, name, "interrupt-parent", mpic);
>> +qemu_fdt_setprop_cells(fdt, name, "bus-width", 4);
>> +qemu_fdt_setprop_cells(fdt, name, "interrupts", irq, 0x2);
>> +qemu_fdt_setprop_cells(fdt, name, "reg", mmio, size);
>> +qemu_fdt_setprop_string(fdt, name, "compatible", "fsl,esdhc");
>> +}
>> typedef struct PlatformDevtreeData {
>>   void *fdt;
>> @@ -553,6 +573,8 @@ static int ppce500_load_device_tree(PPCE500MachineState 
>> *pms,
>> dt_rtc_create(fdt, "i2c", "rtc");
>>   +/* sdhc */
>> +dt_sdhc_create(fdt, soc, mpic);
>>

Re: [PATCH v4 4/6] hw/arm/virt: Introduce virt_get_high_memmap_enabled() helper

2022-10-04 Thread Gavin Shan


Hi Connie,

On 10/4/22 6:41 PM, Cornelia Huck wrote:

On Tue, Oct 04 2022, Gavin Shan  wrote:


This introduces virt_get_high_memmap_enabled() helper, which returns
the pointer to vms->highmem_{redists, ecam, mmio}. The pointer will
be used in the subsequent patches.

No functional change intended.

Signed-off-by: Gavin Shan 
---
  hw/arm/virt.c | 30 +-
  1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index b0b679d1f4..59de7b78b5 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1689,14 +1689,29 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
*vms, int idx)
  return arm_cpu_mp_affinity(idx, clustersz);
  }
  
+static inline bool *virt_get_high_memmap_enabled(VirtMachineState *vms,

+ int index)
+{
+bool *enabled_array[] = {
+>highmem_redists,
+>highmem_ecam,
+>highmem_mmio,
+};
+
+assert(index - VIRT_LOWMEMMAP_LAST < ARRAY_SIZE(enabled_array));


I wonder whether we want an assert(ARRAY_SIZE(extended_memmap) ==
ARRAY_SIZE(enabled_array))? IIUC, we never want those two to get out of
sync?



Yeah, It makes sense to ensure both arrays synchronized. I will add
the extra check in next respin.


+
+return enabled_array[index - VIRT_LOWMEMMAP_LAST];
+}
+


Thanks,
Gavin

Re: [PATCH 2/3] vdpa: load vlan configuration at NIC startup

2022-10-04 Thread Si-Wei Liu




On 9/29/2022 12:13 AM, Michael S. Tsirkin wrote:

On Wed, Sep 21, 2022 at 04:00:58PM -0700, Si-Wei Liu wrote:

The spec doesn't explicitly say anything about that
as far as I see.

Here the spec is totally ruled by the (software artifact of)
implementation rather than what a real device is expected to work with
VLAN rx filters. Are we sure we'd stick to this flawed device
implementation? The guest driver seems to be agnostic with this broken
spec behavior so far, and I am afraid it's an overkill to add another
feature bit or ctrl command to VLAN filter in clean way.


I agree with all of the above. So, double checking, all vlan should be
allowed by default at device start?

That is true only when VIRTIO_NET_F_CTRL_VLAN is not negotiated. If the
guest already negotiated VIRTIO_NET_F_CTRL_VLAN before being migrated,
device should resume with all VLANs filtered/disallowed.


   Maybe the spec needs to be more
clear in that regard?

Yes, I think this is crucial. Otherwise we can't get consistent behavior,
either from software to vDPA, or cross various vDPA vendors.

OK. Can you open a github issue for the spec? We'll try to address.

Thanks, ticket filed at:
https://github.com/oasis-tcs/virtio-spec/issues/147

Also, is it ok if we make it a SHOULD, i.e. best effort filtering?


Yes, that's fine.

-Siwei

Re: [PATCH v2 09/13] hw/ppc/e500: Implement pflash handling

2022-10-04 Thread Bernhard Beschow

Am 3. Oktober 2022 21:21:15 UTC schrieb "Philippe Mathieu-Daudé" 
:
>On 3/10/22 22:31, Bernhard Beschow wrote:
>> Allows e500 boards to have their root file system reside on flash using
>> only builtin devices located in the eLBC memory region.
>> 
>> Note that the flash memory area is only created when a -pflash argument is
>> given, and that the size is determined by the given file. The idea is to
>> put users into control.
>> 
>> Signed-off-by: Bernhard Beschow 
>> ---
>>   docs/system/ppc/ppce500.rst | 12 ++
>>   hw/ppc/Kconfig  |  1 +
>>   hw/ppc/e500.c   | 76 +
>>   3 files changed, 89 insertions(+)
>
>> @@ -856,6 +892,7 @@ void ppce500_init(MachineState *machine)
>>   unsigned int pci_irq_nrs[PCI_NUM_PINS] = {1, 2, 3, 4};
>>   IrqLines *irqs;
>>   DeviceState *dev, *mpicdev;
>> +DriveInfo *dinfo;
>>   CPUPPCState *firstenv = NULL;
>>   MemoryRegion *ccsr_addr_space;
>>   SysBusDevice *s;
>> @@ -1024,6 +1061,45 @@ void ppce500_init(MachineState *machine)
>>   pmc->platform_bus_base,
>>   >pbus_dev->mmio);
>>   +dinfo = drive_get(IF_PFLASH, 0, 0);
>> +if (dinfo) {
>> +BlockBackend *blk = blk_by_legacy_dinfo(dinfo);
>> +BlockDriverState *bs = blk_bs(blk);
>> +uint64_t size = bdrv_getlength(bs);
>> +uint64_t mmio_size = pms->pbus_dev->mmio.size;
>> +uint32_t sector_len = 64 * KiB;
>> +
>> +if (ctpop64(size) != 1) {
>> +error_report("Size of pflash file must be a power of two.");
>
>This is a PFLASH restriction (which you already fixed in the previous
>patch), not a board one.

I agree that this check seems redundant to the one in cfi01. I added this one 
for clearer error messages since cfi01 only complains about the "device size" 
not being a power of two while this message at least gives a hint towards the 
source of the problem (the file given in the pflash option).

Usually the size of the pflash area is hardcoded in the board while I choose to 
derive it from the size of the backing file in order to avoid hardcoding it. My 
idea is to put users into control by offering more flexibility.

>
>> +exit(1);
>> +}
>> +
>> +if (size > mmio_size) {
>> +error_report("Size of pflash file must not be bigger than %" 
>> PRIu64
>> + " bytes.", mmio_size);
>
>There is no hardware limitation here, you can wire flash bigger than the
>memory aperture. What is above the aperture will simply be ignored.
>
>Should we display a warning here instead of a fatal error?

While this is technically possible, is that what users would expect? Couldn't 
we just require users to truncate their files if they really want the 
"aperture" behavior?

>
>> +exit(1);
>> +}
>> +
>> +assert(QEMU_IS_ALIGNED(size, sector_len));
>
>Similarly, this doesn't seem a problem the board code should worry
>about: better to defer it to PFLASH realize().

The reason for the assert() here is that size isn't stored directly in the 
cfi01 device. Instead, it must be calculated by the properties "num-blocks" 
times "sector-length". For this to work, size must be divisible by sector_len 
without remainder, which is checked by the assertion.

We could theoretically add a "size" property which would violate the single 
point of truth principle, though. Do you see a different solution?

Best regards,
Bernhard

>
>> +dev = qdev_new(TYPE_PFLASH_CFI01);
>> +qdev_prop_set_drive(dev, "drive", blk);
>> +qdev_prop_set_uint32(dev, "num-blocks", size / sector_len);
>> +qdev_prop_set_uint64(dev, "sector-length", sector_len);
>> +qdev_prop_set_uint8(dev, "width", 2);
>> +qdev_prop_set_bit(dev, "big-endian", true);
>> +qdev_prop_set_uint16(dev, "id0", 0x89);
>> +qdev_prop_set_uint16(dev, "id1", 0x18);
>> +qdev_prop_set_uint16(dev, "id2", 0x);
>> +qdev_prop_set_uint16(dev, "id3", 0x0);
>> +qdev_prop_set_string(dev, "name", "e500.flash");
>> +s = SYS_BUS_DEVICE(dev);
>> +sysbus_realize_and_unref(s, _fatal);
>> +
>> +memory_region_add_subregion(>pbus_dev->mmio, 0,
>> +sysbus_mmio_get_region(s, 0));
>> +}
>> +
>>   /*
>>* Smart firmware defaults ahead!
>>*
>

[PATCH 19/20] tests/9p: merge v9fs_tunlinkat() and do_unlinkat()

2022-10-04 Thread Christian Schoenebeck

As with previous patches, unify those 2 functions into a single function
v9fs_tunlinkat() by using a declarative function arguments approach.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 37 +--
 tests/qtest/libqos/virtio-9p-client.h | 29 +++--
 tests/qtest/virtio-9p-test.c  | 26 ++-
 3 files changed, 64 insertions(+), 28 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index a2770719b9..e017e030ec 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -1004,23 +1004,44 @@ void v9fs_rlink(P9Req *req)
 }
 
 /* size[4] Tunlinkat tag[2] dirfd[4] name[s] flags[4] */
-P9Req *v9fs_tunlinkat(QVirtio9P *v9p, uint32_t dirfd, const char *name,
-  uint32_t flags, uint16_t tag)
+TunlinkatRes v9fs_tunlinkat(TunlinkatOpt opt)
 {
 P9Req *req;
+uint32_t err;
+
+g_assert(opt.client);
+/* expecting either hi-level atPath or low-level dirfd, but not both */
+g_assert(!opt.atPath || !opt.dirfd);
+
+if (opt.atPath) {
+opt.dirfd = v9fs_twalk((TWalkOpt) { .client = opt.client,
+.path = opt.atPath }).newfid;
+}
 
 uint32_t body_size = 4 + 4;
-uint16_t string_size = v9fs_string_size(name);
+uint16_t string_size = v9fs_string_size(opt.name);
 
 g_assert_cmpint(body_size, <=, UINT32_MAX - string_size);
 body_size += string_size;
 
-req = v9fs_req_init(v9p, body_size, P9_TUNLINKAT, tag);
-v9fs_uint32_write(req, dirfd);
-v9fs_string_write(req, name);
-v9fs_uint32_write(req, flags);
+req = v9fs_req_init(opt.client, body_size, P9_TUNLINKAT, opt.tag);
+v9fs_uint32_write(req, opt.dirfd);
+v9fs_string_write(req, opt.name);
+v9fs_uint32_write(req, opt.flags);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_runlinkat(req);
+}
+req = NULL; /* request was freed */
+}
+
+return (TunlinkatRes) { .req = req };
 }
 
 /* size[4] Runlinkat tag[2] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index 49ffd0fc51..78228eb97d 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -415,6 +415,32 @@ typedef struct TlinkRes {
 P9Req *req;
 } TlinkRes;
 
+/* options for 'Tunlinkat' 9p request */
+typedef struct TunlinkatOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* low-level variant of directory where name shall be unlinked */
+uint32_t dirfd;
+/* high-level variant of directory where name shall be unlinked */
+const char *atPath;
+/* name of directory entry to be unlinked (required) */
+const char *name;
+/* Linux unlinkat(2) flags */
+uint32_t flags;
+/* only send Tunlinkat request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TunlinkatOpt;
+
+/* result of 'Tunlinkat' 9p request */
+typedef struct TunlinkatRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TunlinkatRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -462,8 +488,7 @@ TsymlinkRes v9fs_tsymlink(TsymlinkOpt);
 void v9fs_rsymlink(P9Req *req, v9fs_qid *qid);
 TlinkRes v9fs_tlink(TlinkOpt);
 void v9fs_rlink(P9Req *req);
-P9Req *v9fs_tunlinkat(QVirtio9P *v9p, uint32_t dirfd, const char *name,
-  uint32_t flags, uint16_t tag);
+TunlinkatRes v9fs_tunlinkat(TunlinkatOpt);
 void v9fs_runlinkat(P9Req *req);
 
 #endif
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 185eaf8b1e..65e69491e5 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -28,6 +28,7 @@
 #define tlcreate(...) v9fs_tlcreate((TlcreateOpt) __VA_ARGS__)
 #define tsymlink(...) v9fs_tsymlink((TsymlinkOpt) __VA_ARGS__)
 #define tlink(...) v9fs_tlink((TlinkOpt) __VA_ARGS__)
+#define tunlinkat(...) v9fs_tunlinkat((TunlinkatOpt) __VA_ARGS__)
 
 static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
 {
@@ -481,20 +482,6 @@ static void fs_flush_ignored(void *obj, void *data, 
QGuestAllocator *t_alloc)
 g_free(wnames[0]);
 }
 
-static void do_unlinkat(QVirtio9P *v9p, const char *atpath, const char *rpath,
-uint32_t flags)
-{
-g_autofree char *name = g_strdup(rpath);
-uint32_t fid;
-

[PATCH 16/20] tests/9p: merge v9fs_tlcreate() and do_lcreate()

2022-10-04 Thread Christian Schoenebeck

As with previous patches, unify those 2 functions into a single function
v9fs_tlcreate() by using a declarative function arguments approach.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 45 +--
 tests/qtest/libqos/virtio-9p-client.h | 39 +--
 tests/qtest/virtio-9p-test.c  | 30 +-
 3 files changed, 79 insertions(+), 35 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index c374ba2048..5c805a133c 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -827,11 +827,26 @@ void v9fs_rmkdir(P9Req *req, v9fs_qid *qid)
 }
 
 /* size[4] Tlcreate tag[2] fid[4] name[s] flags[4] mode[4] gid[4] */
-P9Req *v9fs_tlcreate(QVirtio9P *v9p, uint32_t fid, const char *name,
- uint32_t flags, uint32_t mode, uint32_t gid,
- uint16_t tag)
+TlcreateRes v9fs_tlcreate(TlcreateOpt opt)
 {
 P9Req *req;
+uint32_t err;
+g_autofree char *name = g_strdup(opt.name);
+
+g_assert(opt.client);
+/* expecting either hi-level atPath or low-level fid, but not both */
+g_assert(!opt.atPath || !opt.fid);
+/* expecting either Rlcreate or Rlerror, but obviously not both */
+g_assert(!opt.expectErr || !(opt.rlcreate.qid || opt.rlcreate.iounit));
+
+if (opt.atPath) {
+opt.fid = v9fs_twalk((TWalkOpt) { .client = opt.client,
+  .path = opt.atPath }).newfid;
+}
+
+if (!opt.mode) {
+opt.mode = 0750;
+}
 
 uint32_t body_size = 4 + 4 + 4 + 4;
 uint16_t string_size = v9fs_string_size(name);
@@ -839,14 +854,26 @@ P9Req *v9fs_tlcreate(QVirtio9P *v9p, uint32_t fid, const 
char *name,
 g_assert_cmpint(body_size, <=, UINT32_MAX - string_size);
 body_size += string_size;
 
-req = v9fs_req_init(v9p, body_size, P9_TLCREATE, tag);
-v9fs_uint32_write(req, fid);
+req = v9fs_req_init(opt.client, body_size, P9_TLCREATE, opt.tag);
+v9fs_uint32_write(req, opt.fid);
 v9fs_string_write(req, name);
-v9fs_uint32_write(req, flags);
-v9fs_uint32_write(req, mode);
-v9fs_uint32_write(req, gid);
+v9fs_uint32_write(req, opt.flags);
+v9fs_uint32_write(req, opt.mode);
+v9fs_uint32_write(req, opt.gid);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rlcreate(req, opt.rlcreate.qid, opt.rlcreate.iounit);
+}
+req = NULL; /* request was freed */
+}
+
+return (TlcreateRes) { .req = req };
 }
 
 /* size[4] Rlcreate tag[2] qid[13] iounit[4] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index ae44f95a4d..8916b1c7aa 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -320,6 +320,41 @@ typedef struct TMkdirRes {
 P9Req *req;
 } TMkdirRes;
 
+/* options for 'Tlcreate' 9p request */
+typedef struct TlcreateOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* low-level variant of directory where new file shall be created */
+uint32_t fid;
+/* high-level variant of directory where new file shall be created */
+const char *atPath;
+/* name of new file (required) */
+const char *name;
+/* Linux kernel intent bits */
+uint32_t flags;
+/* Linux create(2) mode bits */
+uint32_t mode;
+/* effective group ID of caller */
+uint32_t gid;
+/* data being received from 9p server as 'Rlcreate' response (optional) */
+struct {
+v9fs_qid *qid;
+uint32_t *iounit;
+} rlcreate;
+/* only send Tlcreate request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TlcreateOpt;
+
+/* result of 'Tlcreate' 9p request */
+typedef struct TlcreateRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TlcreateRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -361,9 +396,7 @@ TFlushRes v9fs_tflush(TFlushOpt);
 void v9fs_rflush(P9Req *req);
 TMkdirRes v9fs_tmkdir(TMkdirOpt);
 void v9fs_rmkdir(P9Req *req, v9fs_qid *qid);
-P9Req *v9fs_tlcreate(QVirtio9P *v9p, uint32_t fid, const char *name,
- uint32_t flags, uint32_t mode, uint32_t gid,
- uint16_t tag);
+TlcreateRes v9fs_tlcreate(TlcreateOpt);
 void v9fs_rlcreate(P9Req *req, v9fs_qid *qid, uint32_t *iounit);
 P9Req

[PATCH 18/20] tests/9p: merge v9fs_tlink() and do_hardlink()

2022-10-04 Thread Christian Schoenebeck

As with previous patches, unify those 2 functions into a single function
v9fs_tlink() by using a declarative function arguments approach.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 43 ++-
 tests/qtest/libqos/virtio-9p-client.h | 31 +--
 tests/qtest/virtio-9p-test.c  | 26 ++--
 3 files changed, 73 insertions(+), 27 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index 89eaf50355..a2770719b9 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -950,23 +950,50 @@ void v9fs_rsymlink(P9Req *req, v9fs_qid *qid)
 }
 
 /* size[4] Tlink tag[2] dfid[4] fid[4] name[s] */
-P9Req *v9fs_tlink(QVirtio9P *v9p, uint32_t dfid, uint32_t fid,
-  const char *name, uint16_t tag)
+TlinkRes v9fs_tlink(TlinkOpt opt)
 {
 P9Req *req;
+uint32_t err;
+
+g_assert(opt.client);
+/* expecting either hi-level atPath or low-level dfid, but not both */
+g_assert(!opt.atPath || !opt.dfid);
+/* expecting either hi-level toPath or low-level fid, but not both */
+g_assert(!opt.toPath || !opt.fid);
+
+if (opt.atPath) {
+opt.dfid = v9fs_twalk((TWalkOpt) { .client = opt.client,
+   .path = opt.atPath }).newfid;
+}
+if (opt.toPath) {
+opt.fid = v9fs_twalk((TWalkOpt) { .client = opt.client,
+  .path = opt.toPath }).newfid;
+}
 
 uint32_t body_size = 4 + 4;
-uint16_t string_size = v9fs_string_size(name);
+uint16_t string_size = v9fs_string_size(opt.name);
 
 g_assert_cmpint(body_size, <=, UINT32_MAX - string_size);
 body_size += string_size;
 
-req = v9fs_req_init(v9p, body_size, P9_TLINK, tag);
-v9fs_uint32_write(req, dfid);
-v9fs_uint32_write(req, fid);
-v9fs_string_write(req, name);
+req = v9fs_req_init(opt.client, body_size, P9_TLINK, opt.tag);
+v9fs_uint32_write(req, opt.dfid);
+v9fs_uint32_write(req, opt.fid);
+v9fs_string_write(req, opt.name);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rlink(req);
+}
+req = NULL; /* request was freed */
+}
+
+return (TlinkRes) { .req = req };
 }
 
 /* size[4] Rlink tag[2] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index b905a54966..49ffd0fc51 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -387,6 +387,34 @@ typedef struct TsymlinkRes {
 P9Req *req;
 } TsymlinkRes;
 
+/* options for 'Tlink' 9p request */
+typedef struct TlinkOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* low-level variant of directory where hard link shall be created */
+uint32_t dfid;
+/* high-level variant of directory where hard link shall be created */
+const char *atPath;
+/* low-level variant of target referenced by new hard link */
+uint32_t fid;
+/* high-level variant of target referenced by new hard link */
+const char *toPath;
+/* name of hard link (required) */
+const char *name;
+/* only send Tlink request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TlinkOpt;
+
+/* result of 'Tlink' 9p request */
+typedef struct TlinkRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TlinkRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -432,8 +460,7 @@ TlcreateRes v9fs_tlcreate(TlcreateOpt);
 void v9fs_rlcreate(P9Req *req, v9fs_qid *qid, uint32_t *iounit);
 TsymlinkRes v9fs_tsymlink(TsymlinkOpt);
 void v9fs_rsymlink(P9Req *req, v9fs_qid *qid);
-P9Req *v9fs_tlink(QVirtio9P *v9p, uint32_t dfid, uint32_t fid,
-  const char *name, uint16_t tag);
+TlinkRes v9fs_tlink(TlinkOpt);
 void v9fs_rlink(P9Req *req);
 P9Req *v9fs_tunlinkat(QVirtio9P *v9p, uint32_t dirfd, const char *name,
   uint32_t flags, uint16_t tag);
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index c7213d6caf..185eaf8b1e 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -27,6 +27,7 @@
 #define tmkdir(...) v9fs_tmkdir((TMkdirOpt) __VA_ARGS__)
 #define tlcreate(...) v9fs_tlcreate((TlcreateOpt) __VA_ARGS__)
 #define tsymlink(...) v9fs_tsymlink((TsymlinkOpt) __VA_ARGS__)
+#define

Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2022-10-04 Thread Eric Blake

On Fri, Dec 03, 2021 at 05:14:34PM -0600, Eric Blake wrote:
> Add a new negotiation feature where the client and server agree to use
> larger packet headers on every packet sent during transmission phase.
> This has two purposes: first, it makes it possible to perform
> operations like trim, write zeroes, and block status on more than 2^32
> bytes in a single command; this in turn requires that some structured
> replies from the server also be extended to match.  The wording chosen
> here is careful to permit a server to use either flavor in its reply
> (that is, a request less than 32-bits can trigger an extended reply,
> and conversely a request larger than 32-bits can trigger a compact
> reply).

Following up on this original proposal with something that came out of
KVM Forum this year.

> +* `NBD_REPLY_TYPE_BLOCK_STATUS_EXT` (6)
> +
> +  This chunk type is in the status chunk category.  *length* MUST be
> +  4 + (a positive multiple of 16).  The semantics of this chunk mirror
> +  those of `NBD_REPLY_TYPE_BLOCK_STATUS`, other than the use of a
> +  larger *extent length* field, as well as added padding to ease
> +  alignment.  This chunk type MUST NOT be used unless extended headers
> +  were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
> +
> +  The payload starts with:
> +
> +  32 bits, metadata context ID  
> +
> +  and is followed by a list of one or more descriptors, each with this
> +  layout:
> +
> +  64 bits, length of the extent to which the status below
> + applies (unsigned, MUST be nonzero)  
> +  32 bits, status flags  
> +  32 bits, padding (MUST be zero)

During KVM Forum, I had several conversations about Zoned Block
Devices (https://zonedstorage.io/docs/linux/zbd-api), and what it
would take to expose ZBD information over NBD.  In particular,
NBD_CMD_BLOCK_STATUS sounds like a great way for advertising
information about zones (by adding several metadata contexts that can
be negotiated during NBD_OPT_SET_META_CONTEXT), except for the fact
that a zone might be larger than 32 bits in size.  So Rich Jones asked
me the question of whether my work on 64-bit extensions to the NBD
protocol could also allow for a server to advertise a metadata context
only to clients that support 64-bit extensions, at which point it can
report 64-bit offsets or lengths as needed, rather than being limited
to 32-bit status flags.

The idea definitely has merit, so I'm working on incorporating that
into my next revision for 64-bit extensions in NBD.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset

2022-10-04 Thread Si-Wei Liu

Apologies, please disregard this email. Wrong target audience it was 
sent to, although the content of patch is correct. For those who want to 
review the patch, please reply to this thread:


Message-Id: <1664913563-3351-1-git-send-email-si-wei@oracle.com>

Thanks,
-Siwei

On 10/4/2022 12:58 PM, Si-Wei Liu wrote:

The citing commit has incorrect code in vhost_vdpa_receive() that returns
zero instead of full packet size to the caller. This renders pending packets
unable to be freed so then get clogged in the tx queue forever. When device
is being reset later on, below assertion failure ensues:

0  0x7f86d53bb387 in raise () from /lib64/libc.so.6
1  0x7f86d53bca78 in abort () from /lib64/libc.so.6
2  0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6
3  0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6
4  0x55b8f6ff6fcc in virtio_net_reset (vdev=) at 
/usr/src/debug/qemu/hw/net/virtio-net.c:563
5  0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at 
/usr/src/debug/qemu/hw/virtio/virtio.c:1993
6  0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at 
/usr/src/debug/qemu/hw/virtio/virtio-bus.c:102
7  0x55b8f71f1620 in virtio_pci_reset (qdev=) at 
/usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845
8  0x55b8f6fafc6c in memory_region_write_accessor (mr=, 
addr=, value=,
size=, shift=, mask=, 
attrs=...) at /usr/src/debug/qemu/memory.c:483
9  0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f867e7fb7e8, size=size@entry=1,
access_size_min=, access_size_max=, 
access_fn=0x55b8f6fafc20 ,
mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544
10 0x55b8f6fb1d0b in memory_region_dispatch_write (mr=mr@entry=0x55b8faf80a50, 
addr=addr@entry=20, data=0, op=,
attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470
11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, 
addr=addr@entry=549755813908, attrs=...,
attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1,
mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266
12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, 
attrs=...,
buf=0x7f86d0223028 , len=1) at 
/usr/src/debug/qemu/exec.c:3306
13 0x55b8f6f674cb in address_space_write (as=, addr=, attrs=..., buf=,
len=) at /usr/src/debug/qemu/exec.c:3396
14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=...,
buf=buf@entry=0x7f86d0223028 , len=, is_write=)
at /usr/src/debug/qemu/exec.c:3406
15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/accel/kvm/kvm-all.c:2410
16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/cpus.c:1318
17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at 
/usr/src/debug/qemu/util/qemu-thread-posix.c:519
18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0
19 0x7f86d5483b2d in clone () from /lib64/libc.so.6

Make vhost_vdpa_receive() return the size passed in as is, so that the
caller qemu_deliver_packet_iov() would eventually propagate it back to
virtio_net_flush_tx() to release pending packets from the async_tx queue.
Which corresponds to the drop path where qemu_sendv_packet_async() returns
non-zero in virtio_net_flush_tx().

Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback")
Cc: Eugenio Perez Martin
Signed-off-by: Si-Wei Liu
---
  net/vhost-vdpa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4bc3fd0..182b3a1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, 
ObjectClass *oc,
  static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
size_t size)
  {
-return 0;
+return size;
  }
  
  static NetClientInfo net_vhost_vdpa_info = {

[PATCH 17/20] tests/9p: merge v9fs_tsymlink() and do_symlink()

2022-10-04 Thread Christian Schoenebeck

As with previous patches, unify those 2 functions into a single function
v9fs_tsymlink() by using a declarative function arguments approach.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 37 ++-
 tests/qtest/libqos/virtio-9p-client.h | 35 +++--
 tests/qtest/virtio-9p-test.c  | 27 +++
 3 files changed, 73 insertions(+), 26 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index 5c805a133c..89eaf50355 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -892,10 +892,23 @@ void v9fs_rlcreate(P9Req *req, v9fs_qid *qid, uint32_t 
*iounit)
 }
 
 /* size[4] Tsymlink tag[2] fid[4] name[s] symtgt[s] gid[4] */
-P9Req *v9fs_tsymlink(QVirtio9P *v9p, uint32_t fid, const char *name,
- const char *symtgt, uint32_t gid, uint16_t tag)
+TsymlinkRes v9fs_tsymlink(TsymlinkOpt opt)
 {
 P9Req *req;
+uint32_t err;
+g_autofree char *name = g_strdup(opt.name);
+g_autofree char *symtgt = g_strdup(opt.symtgt);
+
+g_assert(opt.client);
+/* expecting either hi-level atPath or low-level fid, but not both */
+g_assert(!opt.atPath || !opt.fid);
+/* expecting either Rsymlink or Rlerror, but obviously not both */
+g_assert(!opt.expectErr || !opt.rsymlink.qid);
+
+if (opt.atPath) {
+opt.fid = v9fs_twalk((TWalkOpt) { .client = opt.client,
+  .path = opt.atPath }).newfid;
+}
 
 uint32_t body_size = 4 + 4;
 uint16_t string_size = v9fs_string_size(name) + v9fs_string_size(symtgt);
@@ -903,13 +916,25 @@ P9Req *v9fs_tsymlink(QVirtio9P *v9p, uint32_t fid, const 
char *name,
 g_assert_cmpint(body_size, <=, UINT32_MAX - string_size);
 body_size += string_size;
 
-req = v9fs_req_init(v9p, body_size, P9_TSYMLINK, tag);
-v9fs_uint32_write(req, fid);
+req = v9fs_req_init(opt.client, body_size, P9_TSYMLINK, opt.tag);
+v9fs_uint32_write(req, opt.fid);
 v9fs_string_write(req, name);
 v9fs_string_write(req, symtgt);
-v9fs_uint32_write(req, gid);
+v9fs_uint32_write(req, opt.gid);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rsymlink(req, opt.rsymlink.qid);
+}
+req = NULL; /* request was freed */
+}
+
+return (TsymlinkRes) { .req = req };
 }
 
 /* size[4] Rsymlink tag[2] qid[13] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index 8916b1c7aa..b905a54966 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -355,6 +355,38 @@ typedef struct TlcreateRes {
 P9Req *req;
 } TlcreateRes;
 
+/* options for 'Tsymlink' 9p request */
+typedef struct TsymlinkOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* low-level variant of directory where symlink shall be created */
+uint32_t fid;
+/* high-level variant of directory where symlink shall be created */
+const char *atPath;
+/* name of symlink (required) */
+const char *name;
+/* where symlink will point to (required) */
+const char *symtgt;
+/* effective group ID of caller */
+uint32_t gid;
+/* data being received from 9p server as 'Rsymlink' response (optional) */
+struct {
+v9fs_qid *qid;
+} rsymlink;
+/* only send Tsymlink request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TsymlinkOpt;
+
+/* result of 'Tsymlink' 9p request */
+typedef struct TsymlinkRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TsymlinkRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -398,8 +430,7 @@ TMkdirRes v9fs_tmkdir(TMkdirOpt);
 void v9fs_rmkdir(P9Req *req, v9fs_qid *qid);
 TlcreateRes v9fs_tlcreate(TlcreateOpt);
 void v9fs_rlcreate(P9Req *req, v9fs_qid *qid, uint32_t *iounit);
-P9Req *v9fs_tsymlink(QVirtio9P *v9p, uint32_t fid, const char *name,
- const char *symtgt, uint32_t gid, uint16_t tag);
+TsymlinkRes v9fs_tsymlink(TsymlinkOpt);
 void v9fs_rsymlink(P9Req *req, v9fs_qid *qid);
 P9Req *v9fs_tlink(QVirtio9P *v9p, uint32_t dfid, uint32_t fid,
   const char *name, uint16_t tag);
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index d13b27bd2e..c7213d6caf 100644
--- a/tests/qtest/virtio-9p-test.c
+++

[PATCH 13/20] tests/9p: simplify callers of twrite()

2022-10-04 Thread Christian Schoenebeck

Now as twrite() is using a declarative approach, simplify the
code of callers of this function.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/virtio-9p-test.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index a5b9284acb..5ad7bebec7 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -377,7 +377,6 @@ static void fs_write(void *obj, void *data, QGuestAllocator 
*t_alloc)
 char *wnames[] = { g_strdup(QTEST_V9FS_SYNTH_WRITE_FILE) };
 g_autofree char *buf = g_malloc0(write_count);
 uint32_t count;
-P9Req *req;
 
 tattach({ .client = v9p });
 twalk({
@@ -386,12 +385,10 @@ static void fs_write(void *obj, void *data, 
QGuestAllocator *t_alloc)
 
 tlopen({ .client = v9p, .fid = 1, .flags = O_WRONLY });
 
-req = twrite({
+count = twrite({
 .client = v9p, .fid = 1, .offset = 0, .count = write_count,
-.data = buf, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rwrite(req, );
+.data = buf
+}).count;
 g_assert_cmpint(count, ==, write_count);
 
 g_free(wnames[0]);
-- 
2.30.2

[PATCH 08/20] tests/9p: convert v9fs_treaddir() to declarative arguments

2022-10-04 Thread Christian Schoenebeck

Use declarative function arguments for function v9fs_treaddir().

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 32 --
 tests/qtest/libqos/virtio-9p-client.h | 33 +--
 tests/qtest/virtio-9p-test.c  | 11 +++--
 3 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index 29916a23b5..047c8993b6 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -557,17 +557,35 @@ void v9fs_rgetattr(P9Req *req, v9fs_attr *attr)
 }
 
 /* size[4] Treaddir tag[2] fid[4] offset[8] count[4] */
-P9Req *v9fs_treaddir(QVirtio9P *v9p, uint32_t fid, uint64_t offset,
- uint32_t count, uint16_t tag)
+TReadDirRes v9fs_treaddir(TReadDirOpt opt)
 {
 P9Req *req;
+uint32_t err;
 
-req = v9fs_req_init(v9p, 4 + 8 + 4, P9_TREADDIR, tag);
-v9fs_uint32_write(req, fid);
-v9fs_uint64_write(req, offset);
-v9fs_uint32_write(req, count);
+g_assert(opt.client);
+/* expecting either Rreaddir or Rlerror, but obviously not both */
+g_assert(!opt.expectErr || !(opt.rreaddir.count ||
+ opt.rreaddir.nentries || opt.rreaddir.entries));
+
+req = v9fs_req_init(opt.client, 4 + 8 + 4, P9_TREADDIR, opt.tag);
+v9fs_uint32_write(req, opt.fid);
+v9fs_uint64_write(req, opt.offset);
+v9fs_uint32_write(req, opt.count);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rreaddir(req, opt.rreaddir.count, opt.rreaddir.nentries,
+  opt.rreaddir.entries);
+}
+req = NULL; /* request was freed */
+}
+
+return (TReadDirRes) { .req = req };
 }
 
 /* size[4] Rreaddir tag[2] count[4] data[count] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index f7b1bfc79a..2bf649085f 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -182,6 +182,36 @@ typedef struct TGetAttrRes {
 P9Req *req;
 } TGetAttrRes;
 
+/* options for 'Treaddir' 9p request */
+typedef struct TReadDirOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* file ID of directory whose entries shall be retrieved (required) */
+uint32_t fid;
+/* offset in entries stream, i.e. for multiple requests (optional) */
+uint64_t offset;
+/* maximum bytes to be returned by server (required) */
+uint32_t count;
+/* data being received from 9p server as 'Rreaddir' response (optional) */
+struct {
+uint32_t *count;
+uint32_t *nentries;
+struct V9fsDirent **entries;
+} rreaddir;
+/* only send Treaddir request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TReadDirOpt;
+
+/* result of 'Treaddir' 9p request */
+typedef struct TReadDirRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TReadDirRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -211,8 +241,7 @@ TWalkRes v9fs_twalk(TWalkOpt opt);
 void v9fs_rwalk(P9Req *req, uint16_t *nwqid, v9fs_qid **wqid);
 TGetAttrRes v9fs_tgetattr(TGetAttrOpt);
 void v9fs_rgetattr(P9Req *req, v9fs_attr *attr);
-P9Req *v9fs_treaddir(QVirtio9P *v9p, uint32_t fid, uint64_t offset,
- uint32_t count, uint16_t tag);
+TReadDirRes v9fs_treaddir(TReadDirOpt);
 void v9fs_rreaddir(P9Req *req, uint32_t *count, uint32_t *nentries,
struct V9fsDirent **entries);
 void v9fs_free_dirents(struct V9fsDirent *e);
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index ae1220d0cb..e5c174c218 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -20,6 +20,7 @@
 #define tversion(...) v9fs_tversion((TVersionOpt) __VA_ARGS__)
 #define tattach(...) v9fs_tattach((TAttachOpt) __VA_ARGS__)
 #define tgetattr(...) v9fs_tgetattr((TGetAttrOpt) __VA_ARGS__)
+#define treaddir(...) v9fs_treaddir((TReadDirOpt) __VA_ARGS__)
 
 static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
 {
@@ -119,7 +120,10 @@ static void fs_readdir(void *obj, void *data, 
QGuestAllocator *t_alloc)
 /*
  * submit count = msize - 11, because 11 is the header size of Rreaddir
  */
-req = v9fs_treaddir(v9p, 1, 0, P9_MAX_SIZE - 11, 0);
+req = treaddir({
+.client = v9p, .fid = 1, .offset = 0, .count =

[PATCH 15/20] tests/9p: merge v9fs_tmkdir() and do_mkdir()

2022-10-04 Thread Christian Schoenebeck

As with previous patches, unify those 2 functions into a single function
v9fs_tmkdir() by using a declarative function arguments approach.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 42 ++-
 tests/qtest/libqos/virtio-9p-client.h | 36 +--
 tests/qtest/virtio-9p-test.c  | 30 ++-
 3 files changed, 78 insertions(+), 30 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index 3be0ffc7da..c374ba2048 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -766,10 +766,26 @@ void v9fs_rflush(P9Req *req)
 }
 
 /* size[4] Tmkdir tag[2] dfid[4] name[s] mode[4] gid[4] */
-P9Req *v9fs_tmkdir(QVirtio9P *v9p, uint32_t dfid, const char *name,
-   uint32_t mode, uint32_t gid, uint16_t tag)
+TMkdirRes v9fs_tmkdir(TMkdirOpt opt)
 {
 P9Req *req;
+uint32_t err;
+g_autofree char *name = g_strdup(opt.name);
+
+g_assert(opt.client);
+/* expecting either hi-level atPath or low-level dfid, but not both */
+g_assert(!opt.atPath || !opt.dfid);
+/* expecting either Rmkdir or Rlerror, but obviously not both */
+g_assert(!opt.expectErr || !opt.rmkdir.qid);
+
+if (opt.atPath) {
+opt.dfid = v9fs_twalk((TWalkOpt) { .client = opt.client,
+   .path = opt.atPath }).newfid;
+}
+
+if (!opt.mode) {
+opt.mode = 0750;
+}
 
 uint32_t body_size = 4 + 4 + 4;
 uint16_t string_size = v9fs_string_size(name);
@@ -777,13 +793,25 @@ P9Req *v9fs_tmkdir(QVirtio9P *v9p, uint32_t dfid, const 
char *name,
 g_assert_cmpint(body_size, <=, UINT32_MAX - string_size);
 body_size += string_size;
 
-req = v9fs_req_init(v9p, body_size, P9_TMKDIR, tag);
-v9fs_uint32_write(req, dfid);
+req = v9fs_req_init(opt.client, body_size, P9_TMKDIR, opt.tag);
+v9fs_uint32_write(req, opt.dfid);
 v9fs_string_write(req, name);
-v9fs_uint32_write(req, mode);
-v9fs_uint32_write(req, gid);
+v9fs_uint32_write(req, opt.mode);
+v9fs_uint32_write(req, opt.gid);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rmkdir(req, opt.rmkdir.qid);
+}
+req = NULL; /* request was freed */
+}
+
+return (TMkdirRes) { .req = req };
 }
 
 /* size[4] Rmkdir tag[2] qid[13] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index b22b54c720..ae44f95a4d 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -287,6 +287,39 @@ typedef struct TFlushRes {
 P9Req *req;
 } TFlushRes;
 
+/* options for 'Tmkdir' 9p request */
+typedef struct TMkdirOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* low level variant of directory where new one shall be created */
+uint32_t dfid;
+/* high-level variant of directory where new one shall be created */
+const char *atPath;
+/* New directory's name (required) */
+const char *name;
+/* Linux mkdir(2) mode bits (optional) */
+uint32_t mode;
+/* effective group ID of caller */
+uint32_t gid;
+/* data being received from 9p server as 'Rmkdir' response (optional) */
+struct {
+/* QID of newly created directory */
+v9fs_qid *qid;
+} rmkdir;
+/* only send Tmkdir request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TMkdirOpt;
+
+/* result of 'TMkdir' 9p request */
+typedef struct TMkdirRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TMkdirRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -326,8 +359,7 @@ TWriteRes v9fs_twrite(TWriteOpt);
 void v9fs_rwrite(P9Req *req, uint32_t *count);
 TFlushRes v9fs_tflush(TFlushOpt);
 void v9fs_rflush(P9Req *req);
-P9Req *v9fs_tmkdir(QVirtio9P *v9p, uint32_t dfid, const char *name,
-   uint32_t mode, uint32_t gid, uint16_t tag);
+TMkdirRes v9fs_tmkdir(TMkdirOpt);
 void v9fs_rmkdir(P9Req *req, v9fs_qid *qid);
 P9Req *v9fs_tlcreate(QVirtio9P *v9p, uint32_t fid, const char *name,
  uint32_t flags, uint32_t mode, uint32_t gid,
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 5544998bac..6d75afee87 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -24,6 +24,7 @@
 #define

[PATCH 20/20] tests/9p: remove unnecessary g_strdup() calls

2022-10-04 Thread Christian Schoenebeck

This is a leftover from before the recent function merge and
refactoring patches:

As these functions do not return control to the caller in
between, it is not necessary to duplicate strings passed to them.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index e017e030ec..e4a368e036 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -770,7 +770,6 @@ TMkdirRes v9fs_tmkdir(TMkdirOpt opt)
 {
 P9Req *req;
 uint32_t err;
-g_autofree char *name = g_strdup(opt.name);
 
 g_assert(opt.client);
 /* expecting either hi-level atPath or low-level dfid, but not both */
@@ -788,14 +787,14 @@ TMkdirRes v9fs_tmkdir(TMkdirOpt opt)
 }
 
 uint32_t body_size = 4 + 4 + 4;
-uint16_t string_size = v9fs_string_size(name);
+uint16_t string_size = v9fs_string_size(opt.name);
 
 g_assert_cmpint(body_size, <=, UINT32_MAX - string_size);
 body_size += string_size;
 
 req = v9fs_req_init(opt.client, body_size, P9_TMKDIR, opt.tag);
 v9fs_uint32_write(req, opt.dfid);
-v9fs_string_write(req, name);
+v9fs_string_write(req, opt.name);
 v9fs_uint32_write(req, opt.mode);
 v9fs_uint32_write(req, opt.gid);
 v9fs_req_send(req);
@@ -831,7 +830,6 @@ TlcreateRes v9fs_tlcreate(TlcreateOpt opt)
 {
 P9Req *req;
 uint32_t err;
-g_autofree char *name = g_strdup(opt.name);
 
 g_assert(opt.client);
 /* expecting either hi-level atPath or low-level fid, but not both */
@@ -849,14 +847,14 @@ TlcreateRes v9fs_tlcreate(TlcreateOpt opt)
 }
 
 uint32_t body_size = 4 + 4 + 4 + 4;
-uint16_t string_size = v9fs_string_size(name);
+uint16_t string_size = v9fs_string_size(opt.name);
 
 g_assert_cmpint(body_size, <=, UINT32_MAX - string_size);
 body_size += string_size;
 
 req = v9fs_req_init(opt.client, body_size, P9_TLCREATE, opt.tag);
 v9fs_uint32_write(req, opt.fid);
-v9fs_string_write(req, name);
+v9fs_string_write(req, opt.name);
 v9fs_uint32_write(req, opt.flags);
 v9fs_uint32_write(req, opt.mode);
 v9fs_uint32_write(req, opt.gid);
@@ -896,8 +894,6 @@ TsymlinkRes v9fs_tsymlink(TsymlinkOpt opt)
 {
 P9Req *req;
 uint32_t err;
-g_autofree char *name = g_strdup(opt.name);
-g_autofree char *symtgt = g_strdup(opt.symtgt);
 
 g_assert(opt.client);
 /* expecting either hi-level atPath or low-level fid, but not both */
@@ -911,15 +907,16 @@ TsymlinkRes v9fs_tsymlink(TsymlinkOpt opt)
 }
 
 uint32_t body_size = 4 + 4;
-uint16_t string_size = v9fs_string_size(name) + v9fs_string_size(symtgt);
+uint16_t string_size = v9fs_string_size(opt.name) +
+   v9fs_string_size(opt.symtgt);
 
 g_assert_cmpint(body_size, <=, UINT32_MAX - string_size);
 body_size += string_size;
 
 req = v9fs_req_init(opt.client, body_size, P9_TSYMLINK, opt.tag);
 v9fs_uint32_write(req, opt.fid);
-v9fs_string_write(req, name);
-v9fs_string_write(req, symtgt);
+v9fs_string_write(req, opt.name);
+v9fs_string_write(req, opt.symtgt);
 v9fs_uint32_write(req, opt.gid);
 v9fs_req_send(req);
 
-- 
2.30.2

[PATCH 14/20] tests/9p: convert v9fs_tflush() to declarative arguments

2022-10-04 Thread Christian Schoenebeck

Use declarative function arguments for function v9fs_tflush().

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 23 +++
 tests/qtest/libqos/virtio-9p-client.h | 22 +-
 tests/qtest/virtio-9p-test.c  |  9 +++--
 3 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index 9ae347fad5..3be0ffc7da 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -733,14 +733,29 @@ void v9fs_rwrite(P9Req *req, uint32_t *count)
 }
 
 /* size[4] Tflush tag[2] oldtag[2] */
-P9Req *v9fs_tflush(QVirtio9P *v9p, uint16_t oldtag, uint16_t tag)
+TFlushRes v9fs_tflush(TFlushOpt opt)
 {
 P9Req *req;
+uint32_t err;
 
-req = v9fs_req_init(v9p,  2, P9_TFLUSH, tag);
-v9fs_uint32_write(req, oldtag);
+g_assert(opt.client);
+
+req = v9fs_req_init(opt.client, 2, P9_TFLUSH, opt.tag);
+v9fs_uint32_write(req, opt.oldtag);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rflush(req);
+}
+req = NULL; /* request was freed */
+}
+
+return (TFlushRes) { .req = req };
 }
 
 /* size[4] Rflush tag[2] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index dda371c054..b22b54c720 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -267,6 +267,26 @@ typedef struct TWriteRes {
 uint32_t count;
 } TWriteRes;
 
+/* options for 'Tflush' 9p request */
+typedef struct TFlushOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* message to flush (required) */
+uint16_t oldtag;
+/* only send Tflush request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TFlushOpt;
+
+/* result of 'Tflush' 9p request */
+typedef struct TFlushRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TFlushRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -304,7 +324,7 @@ TLOpenRes v9fs_tlopen(TLOpenOpt);
 void v9fs_rlopen(P9Req *req, v9fs_qid *qid, uint32_t *iounit);
 TWriteRes v9fs_twrite(TWriteOpt);
 void v9fs_rwrite(P9Req *req, uint32_t *count);
-P9Req *v9fs_tflush(QVirtio9P *v9p, uint16_t oldtag, uint16_t tag);
+TFlushRes v9fs_tflush(TFlushOpt);
 void v9fs_rflush(P9Req *req);
 P9Req *v9fs_tmkdir(QVirtio9P *v9p, uint32_t dfid, const char *name,
uint32_t mode, uint32_t gid, uint16_t tag);
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 5ad7bebec7..5544998bac 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -23,6 +23,7 @@
 #define treaddir(...) v9fs_treaddir((TReadDirOpt) __VA_ARGS__)
 #define tlopen(...) v9fs_tlopen((TLOpenOpt) __VA_ARGS__)
 #define twrite(...) v9fs_twrite((TWriteOpt) __VA_ARGS__)
+#define tflush(...) v9fs_tflush((TFlushOpt) __VA_ARGS__)
 
 static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
 {
@@ -420,7 +421,9 @@ static void fs_flush_success(void *obj, void *data, 
QGuestAllocator *t_alloc)
 .requestOnly = true
 }).req;
 
-flush_req = v9fs_tflush(v9p, req->tag, 1);
+flush_req = tflush({
+.client = v9p, .oldtag = req->tag, .tag = 1, .requestOnly = true
+}).req;
 
 /* The write request is supposed to be flushed: the server should just
  * mark the write request as used and reply to the flush request.
@@ -459,7 +462,9 @@ static void fs_flush_ignored(void *obj, void *data, 
QGuestAllocator *t_alloc)
 .requestOnly = true
 }).req;
 
-flush_req = v9fs_tflush(v9p, req->tag, 1);
+flush_req = tflush({
+.client = v9p, .oldtag = req->tag, .tag = 1, .requestOnly = true
+}).req;
 
 /* The write request is supposed to complete. The server should
  * reply to the write request and the flush request.
-- 
2.30.2

[PATCH 11/20] tests/9p: simplify callers of tlopen()

2022-10-04 Thread Christian Schoenebeck

Now as tlopen() is using a declarative approach, simplify the
code of callers of this function.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/virtio-9p-test.c | 43 +---
 1 file changed, 10 insertions(+), 33 deletions(-)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 0455c3a094..60a030b877 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -105,7 +105,6 @@ static void fs_readdir(void *obj, void *data, 
QGuestAllocator *t_alloc)
 v9fs_qid qid;
 uint32_t count, nentries;
 struct V9fsDirent *entries = NULL;
-P9Req *req;
 
 tattach({ .client = v9p });
 twalk({
@@ -114,11 +113,9 @@ static void fs_readdir(void *obj, void *data, 
QGuestAllocator *t_alloc)
 });
 g_assert_cmpint(nqid, ==, 1);
 
-req = tlopen({
-.client = v9p, .fid = 1, .flags = O_DIRECTORY, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rlopen(req, , NULL);
+tlopen({
+.client = v9p, .fid = 1, .flags = O_DIRECTORY, .rlopen.qid = 
+});
 
 /*
  * submit count = msize - 11, because 11 is the header size of Rreaddir
@@ -163,7 +160,6 @@ static void do_readdir_split(QVirtio9P *v9p, uint32_t count)
 v9fs_qid qid;
 uint32_t nentries, npartialentries;
 struct V9fsDirent *entries, *tail, *partialentries;
-P9Req *req;
 int fid;
 uint64_t offset;
 
@@ -181,11 +177,9 @@ static void do_readdir_split(QVirtio9P *v9p, uint32_t 
count)
 });
 g_assert_cmpint(nqid, ==, 1);
 
-req = tlopen({
-.client = v9p, .fid = fid, .flags = O_DIRECTORY, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rlopen(req, , NULL);
+tlopen({
+.client = v9p, .fid = fid, .flags = O_DIRECTORY, .rlopen.qid = 
+});
 
 /*
  * send as many Treaddir requests as required to get all directory
@@ -363,18 +357,13 @@ static void fs_lopen(void *obj, void *data, 
QGuestAllocator *t_alloc)
 QVirtio9P *v9p = obj;
 v9fs_set_allocator(t_alloc);
 char *wnames[] = { g_strdup(QTEST_V9FS_SYNTH_LOPEN_FILE) };
-P9Req *req;
 
 tattach({ .client = v9p });
 twalk({
 .client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames
 });
 
-req = tlopen({
-.client = v9p, .fid = 1, .flags = O_WRONLY, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rlopen(req, NULL, NULL);
+tlopen({ .client = v9p, .fid = 1, .flags = O_WRONLY });
 
 g_free(wnames[0]);
 }
@@ -394,11 +383,7 @@ static void fs_write(void *obj, void *data, 
QGuestAllocator *t_alloc)
 .client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames
 });
 
-req = tlopen({
-.client = v9p, .fid = 1, .flags = O_WRONLY, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rlopen(req, NULL, NULL);
+tlopen({ .client = v9p, .fid = 1, .flags = O_WRONLY });
 
 req = v9fs_twrite(v9p, 1, 0, write_count, buf, 0);
 v9fs_req_wait_for_reply(req, NULL);
@@ -422,11 +407,7 @@ static void fs_flush_success(void *obj, void *data, 
QGuestAllocator *t_alloc)
 .client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames
 });
 
-req = tlopen({
-.client = v9p, .fid = 1, .flags = O_WRONLY, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rlopen(req, NULL, NULL);
+tlopen({ .client = v9p, .fid = 1, .flags = O_WRONLY });
 
 /* This will cause the 9p server to try to write data to the backend,
  * until the write request gets cancelled.
@@ -461,11 +442,7 @@ static void fs_flush_ignored(void *obj, void *data, 
QGuestAllocator *t_alloc)
 .client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames
 });
 
-req = tlopen({
-.client = v9p, .fid = 1, .flags = O_WRONLY, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rlopen(req, NULL, NULL);
+tlopen({ .client = v9p, .fid = 1, .flags = O_WRONLY });
 
 /* This will cause the write request to complete right away, before it
  * could be actually cancelled.
-- 
2.30.2

[PATCH 10/20] tests/9p: convert v9fs_tlopen() to declarative arguments

2022-10-04 Thread Christian Schoenebeck

Use declarative function arguments for function v9fs_tlopen().

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 28 +++--
 tests/qtest/libqos/virtio-9p-client.h | 30 +--
 tests/qtest/virtio-9p-test.c  | 25 --
 3 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index 047c8993b6..15fde54d63 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -643,16 +643,32 @@ void v9fs_free_dirents(struct V9fsDirent *e)
 }
 
 /* size[4] Tlopen tag[2] fid[4] flags[4] */
-P9Req *v9fs_tlopen(QVirtio9P *v9p, uint32_t fid, uint32_t flags,
-   uint16_t tag)
+TLOpenRes v9fs_tlopen(TLOpenOpt opt)
 {
 P9Req *req;
+uint32_t err;
 
-req = v9fs_req_init(v9p,  4 + 4, P9_TLOPEN, tag);
-v9fs_uint32_write(req, fid);
-v9fs_uint32_write(req, flags);
+g_assert(opt.client);
+/* expecting either Rlopen or Rlerror, but obviously not both */
+g_assert(!opt.expectErr || !(opt.rlopen.qid || opt.rlopen.iounit));
+
+req = v9fs_req_init(opt.client,  4 + 4, P9_TLOPEN, opt.tag);
+v9fs_uint32_write(req, opt.fid);
+v9fs_uint32_write(req, opt.flags);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rlopen(req, opt.rlopen.qid, opt.rlopen.iounit);
+}
+req = NULL; /* request was freed */
+}
+
+return (TLOpenRes) { .req = req };
 }
 
 /* size[4] Rlopen tag[2] qid[13] iounit[4] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index 2bf649085f..3b70aef51e 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -212,6 +212,33 @@ typedef struct TReadDirRes {
 P9Req *req;
 } TReadDirRes;
 
+/* options for 'Tlopen' 9p request */
+typedef struct TLOpenOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* file ID of file / directory to be opened (required) */
+uint32_t fid;
+/* Linux open(2) flags such as O_RDONLY, O_RDWR, O_WRONLY (optional) */
+uint32_t flags;
+/* data being received from 9p server as 'Rlopen' response (optional) */
+struct {
+v9fs_qid *qid;
+uint32_t *iounit;
+} rlopen;
+/* only send Tlopen request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TLOpenOpt;
+
+/* result of 'Tlopen' 9p request */
+typedef struct TLOpenRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TLOpenRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -245,8 +272,7 @@ TReadDirRes v9fs_treaddir(TReadDirOpt);
 void v9fs_rreaddir(P9Req *req, uint32_t *count, uint32_t *nentries,
struct V9fsDirent **entries);
 void v9fs_free_dirents(struct V9fsDirent *e);
-P9Req *v9fs_tlopen(QVirtio9P *v9p, uint32_t fid, uint32_t flags,
-   uint16_t tag);
+TLOpenRes v9fs_tlopen(TLOpenOpt);
 void v9fs_rlopen(P9Req *req, v9fs_qid *qid, uint32_t *iounit);
 P9Req *v9fs_twrite(QVirtio9P *v9p, uint32_t fid, uint64_t offset,
uint32_t count, const void *data, uint16_t tag);
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 99e24fce0b..0455c3a094 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -21,6 +21,7 @@
 #define tattach(...) v9fs_tattach((TAttachOpt) __VA_ARGS__)
 #define tgetattr(...) v9fs_tgetattr((TGetAttrOpt) __VA_ARGS__)
 #define treaddir(...) v9fs_treaddir((TReadDirOpt) __VA_ARGS__)
+#define tlopen(...) v9fs_tlopen((TLOpenOpt) __VA_ARGS__)
 
 static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
 {
@@ -113,7 +114,9 @@ static void fs_readdir(void *obj, void *data, 
QGuestAllocator *t_alloc)
 });
 g_assert_cmpint(nqid, ==, 1);
 
-req = v9fs_tlopen(v9p, 1, O_DIRECTORY, 0);
+req = tlopen({
+.client = v9p, .fid = 1, .flags = O_DIRECTORY, .requestOnly = true
+}).req;
 v9fs_req_wait_for_reply(req, NULL);
 v9fs_rlopen(req, , NULL);
 
@@ -178,7 +181,9 @@ static void do_readdir_split(QVirtio9P *v9p, uint32_t count)
 });
 g_assert_cmpint(nqid, ==, 1);
 
-req = v9fs_tlopen(v9p, fid, O_DIRECTORY, 0);
+req = tlopen({
+.client = v9p, .fid = fid, .flags = O_DIRECTORY, .requestOnly = true
+}).req;
 v9fs_req_wait_for_reply(req, NULL);

[PATCH 12/20] tests/9p: convert v9fs_twrite() to declarative arguments

2022-10-04 Thread Christian Schoenebeck

Use declarative function arguments for function v9fs_twrite().

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 38 ---
 tests/qtest/libqos/virtio-9p-client.h | 31 --
 tests/qtest/virtio-9p-test.c  | 18 ++---
 3 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index 15fde54d63..9ae347fad5 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -687,21 +687,39 @@ void v9fs_rlopen(P9Req *req, v9fs_qid *qid, uint32_t 
*iounit)
 }
 
 /* size[4] Twrite tag[2] fid[4] offset[8] count[4] data[count] */
-P9Req *v9fs_twrite(QVirtio9P *v9p, uint32_t fid, uint64_t offset,
-   uint32_t count, const void *data, uint16_t tag)
+TWriteRes v9fs_twrite(TWriteOpt opt)
 {
 P9Req *req;
+uint32_t err;
 uint32_t body_size = 4 + 8 + 4;
+uint32_t written = 0;
 
-g_assert_cmpint(body_size, <=, UINT32_MAX - count);
-body_size += count;
-req = v9fs_req_init(v9p,  body_size, P9_TWRITE, tag);
-v9fs_uint32_write(req, fid);
-v9fs_uint64_write(req, offset);
-v9fs_uint32_write(req, count);
-v9fs_memwrite(req, data, count);
+g_assert(opt.client);
+
+g_assert_cmpint(body_size, <=, UINT32_MAX - opt.count);
+body_size += opt.count;
+req = v9fs_req_init(opt.client, body_size, P9_TWRITE, opt.tag);
+v9fs_uint32_write(req, opt.fid);
+v9fs_uint64_write(req, opt.offset);
+v9fs_uint32_write(req, opt.count);
+v9fs_memwrite(req, opt.data, opt.count);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rwrite(req, );
+}
+req = NULL; /* request was freed */
+}
+
+return (TWriteRes) {
+.req = req,
+.count = written
+};
 }
 
 /* size[4] Rwrite tag[2] count[4] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index 3b70aef51e..dda371c054 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -239,6 +239,34 @@ typedef struct TLOpenRes {
 P9Req *req;
 } TLOpenRes;
 
+/* options for 'Twrite' 9p request */
+typedef struct TWriteOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* file ID of file to write to (required) */
+uint32_t fid;
+/* start position of write from beginning of file (optional) */
+uint64_t offset;
+/* how many bytes to write */
+uint32_t count;
+/* data to be written */
+const void *data;
+/* only send Twrite request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TWriteOpt;
+
+/* result of 'Twrite' 9p request */
+typedef struct TWriteRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+/* amount of bytes written */
+uint32_t count;
+} TWriteRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -274,8 +302,7 @@ void v9fs_rreaddir(P9Req *req, uint32_t *count, uint32_t 
*nentries,
 void v9fs_free_dirents(struct V9fsDirent *e);
 TLOpenRes v9fs_tlopen(TLOpenOpt);
 void v9fs_rlopen(P9Req *req, v9fs_qid *qid, uint32_t *iounit);
-P9Req *v9fs_twrite(QVirtio9P *v9p, uint32_t fid, uint64_t offset,
-   uint32_t count, const void *data, uint16_t tag);
+TWriteRes v9fs_twrite(TWriteOpt);
 void v9fs_rwrite(P9Req *req, uint32_t *count);
 P9Req *v9fs_tflush(QVirtio9P *v9p, uint16_t oldtag, uint16_t tag);
 void v9fs_rflush(P9Req *req);
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 60a030b877..a5b9284acb 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -22,6 +22,7 @@
 #define tgetattr(...) v9fs_tgetattr((TGetAttrOpt) __VA_ARGS__)
 #define treaddir(...) v9fs_treaddir((TReadDirOpt) __VA_ARGS__)
 #define tlopen(...) v9fs_tlopen((TLOpenOpt) __VA_ARGS__)
+#define twrite(...) v9fs_twrite((TWriteOpt) __VA_ARGS__)
 
 static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
 {
@@ -385,7 +386,10 @@ static void fs_write(void *obj, void *data, 
QGuestAllocator *t_alloc)
 
 tlopen({ .client = v9p, .fid = 1, .flags = O_WRONLY });
 
-req = v9fs_twrite(v9p, 1, 0, write_count, buf, 0);
+req = twrite({
+.client = v9p, .fid = 1, .offset = 0, .count = write_count,
+.data = buf, .requestOnly = true
+}).req;
 v9fs_req_wait_for_reply(req, NULL);

[PATCH 07/20] tests/9p: simplify callers of tgetattr()

2022-10-04 Thread Christian Schoenebeck

Now as tgetattr() is using a declarative approach, simplify the
code of callers of this function.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/virtio-9p-test.c | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 9c1219db33..ae1220d0cb 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -264,8 +264,7 @@ static void fs_walk_2nd_nonexistent(void *obj, void *data,
 v9fs_set_allocator(t_alloc);
 v9fs_qid root_qid;
 uint16_t nwqid;
-uint32_t fid, err;
-P9Req *req;
+uint32_t fid;
 g_autofree v9fs_qid *wqid = NULL;
 g_autofree char *path = g_strdup_printf(
 QTEST_V9FS_SYNTH_WALK_FILE "/non-existent", 0
@@ -286,14 +285,10 @@ static void fs_walk_2nd_nonexistent(void *obj, void *data,
 g_assert(wqid && wqid[0] && !is_same_qid(root_qid, wqid[0]));
 
 /* expect fid being unaffected by walk above */
-req = tgetattr({
+tgetattr({
 .client = v9p, .fid = fid, .request_mask = P9_GETATTR_BASIC,
-.requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rlerror(req, );
-
-g_assert_cmpint(err, ==, ENOENT);
+.expectErr = ENOENT
+});
 }
 
 static void fs_walk_none(void *obj, void *data, QGuestAllocator *t_alloc)
@@ -302,7 +297,6 @@ static void fs_walk_none(void *obj, void *data, 
QGuestAllocator *t_alloc)
 v9fs_set_allocator(t_alloc);
 v9fs_qid root_qid;
 g_autofree v9fs_qid *wqid = NULL;
-P9Req *req;
 struct v9fs_attr attr;
 
 tversion({ .client = v9p });
@@ -319,12 +313,10 @@ static void fs_walk_none(void *obj, void *data, 
QGuestAllocator *t_alloc)
 /* special case: no QID is returned if nwname=0 was sent */
 g_assert(wqid == NULL);
 
-req = tgetattr({
+tgetattr({
 .client = v9p, .fid = 1, .request_mask = P9_GETATTR_BASIC,
-.requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rgetattr(req, );
+.rgetattr.attr = 
+});
 
 g_assert(is_same_qid(root_qid, attr.qid));
 }
-- 
2.30.2

[PATCH 03/20] tests/9p: merge v9fs_tversion() and do_version()

2022-10-04 Thread Christian Schoenebeck

As with previous patches, unify functions v9fs_tversion() and do_version()
into a single function v9fs_tversion() by using a declarative function
arguments approach.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 47 +++
 tests/qtest/libqos/virtio-9p-client.h | 25 --
 tests/qtest/virtio-9p-test.c  | 23 +++--
 3 files changed, 68 insertions(+), 27 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index a95bbad9c8..e8364f8d64 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -291,21 +291,54 @@ void v9fs_rlerror(P9Req *req, uint32_t *err)
 }
 
 /* size[4] Tversion tag[2] msize[4] version[s] */
-P9Req *v9fs_tversion(QVirtio9P *v9p, uint32_t msize, const char *version,
- uint16_t tag)
+TVersionRes v9fs_tversion(TVersionOpt opt)
 {
 P9Req *req;
+uint32_t err;
 uint32_t body_size = 4;
-uint16_t string_size = v9fs_string_size(version);
+uint16_t string_size;
+uint16_t server_len;
+g_autofree char *server_version = NULL;
 
+g_assert(opt.client);
+
+if (!opt.msize) {
+opt.msize = P9_MAX_SIZE;
+}
+
+if (!opt.tag) {
+opt.tag = P9_NOTAG;
+}
+
+if (!opt.version) {
+opt.version = "9P2000.L";
+}
+
+string_size = v9fs_string_size(opt.version);
 g_assert_cmpint(body_size, <=, UINT32_MAX - string_size);
 body_size += string_size;
-req = v9fs_req_init(v9p, body_size, P9_TVERSION, tag);
+req = v9fs_req_init(opt.client, body_size, P9_TVERSION, opt.tag);
 
-v9fs_uint32_write(req, msize);
-v9fs_string_write(req, version);
+v9fs_uint32_write(req, opt.msize);
+v9fs_string_write(req, opt.version);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rversion(req, _len, _version);
+g_assert_cmpmem(server_version, server_len,
+opt.version, strlen(opt.version));
+}
+req = NULL; /* request was freed */
+}
+
+return (TVersionRes) {
+.req = req,
+};
 }
 
 /* size[4] Rversion tag[2] msize[4] version[s] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index 8c6abbb173..fcde849b5d 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -106,6 +106,28 @@ typedef struct TWalkRes {
 P9Req *req;
 } TWalkRes;
 
+/* options for 'Tversion' 9p request */
+typedef struct TVersionOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* maximum message size that can be handled by client (optional) */
+uint32_t msize;
+/* protocol version (optional) */
+const char *version;
+/* only send Tversion request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TVersionOpt;
+
+/* result of 'Tversion' 9p request */
+typedef struct TVersionRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TVersionRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -127,8 +149,7 @@ void v9fs_req_wait_for_reply(P9Req *req, uint32_t *len);
 void v9fs_req_recv(P9Req *req, uint8_t id);
 void v9fs_req_free(P9Req *req);
 void v9fs_rlerror(P9Req *req, uint32_t *err);
-P9Req *v9fs_tversion(QVirtio9P *v9p, uint32_t msize, const char *version,
- uint16_t tag);
+TVersionRes v9fs_tversion(TVersionOpt);
 void v9fs_rversion(P9Req *req, uint16_t *len, char **version);
 P9Req *v9fs_tattach(QVirtio9P *v9p, uint32_t fid, uint32_t n_uname,
 uint16_t tag);
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 3c326451b1..f2907c8026 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -17,6 +17,7 @@
 #include "libqos/virtio-9p-client.h"
 
 #define twalk(...) v9fs_twalk((TWalkOpt) __VA_ARGS__)
+#define tversion(...) v9fs_tversion((TVersionOpt) __VA_ARGS__)
 
 static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
 {
@@ -41,31 +42,17 @@ static inline bool is_same_qid(v9fs_qid a, v9fs_qid b)
 return a[0] == b[0] && memcmp([5], [5], 8) == 0;
 }
 
-static void do_version(QVirtio9P *v9p)
-{
-const char *version = "9P2000.L";
-uint16_t server_len;
-g_autofree char *server_version = NULL;
-P9Req *req;
-
-req = v9fs_tversion(v9p, P9_MAX_SIZE, version, P9_NOTAG);

[PATCH 06/20] tests/9p: convert v9fs_tgetattr() to declarative arguments

2022-10-04 Thread Christian Schoenebeck

Use declarative function arguments for function v9fs_tgetattr().

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 32 ++-
 tests/qtest/libqos/virtio-9p-client.h | 30 +++--
 tests/qtest/virtio-9p-test.c  | 11 +++--
 3 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index 5e6bd6120c..29916a23b5 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -489,16 +489,36 @@ void v9fs_rwalk(P9Req *req, uint16_t *nwqid, v9fs_qid 
**wqid)
 }
 
 /* size[4] Tgetattr tag[2] fid[4] request_mask[8] */
-P9Req *v9fs_tgetattr(QVirtio9P *v9p, uint32_t fid, uint64_t request_mask,
- uint16_t tag)
+TGetAttrRes v9fs_tgetattr(TGetAttrOpt opt)
 {
 P9Req *req;
+uint32_t err;
 
-req = v9fs_req_init(v9p, 4 + 8, P9_TGETATTR, tag);
-v9fs_uint32_write(req, fid);
-v9fs_uint64_write(req, request_mask);
+g_assert(opt.client);
+/* expecting either Rgetattr or Rlerror, but obviously not both */
+g_assert(!opt.expectErr || !opt.rgetattr.attr);
+
+if (!opt.request_mask) {
+opt.request_mask = P9_GETATTR_ALL;
+}
+
+req = v9fs_req_init(opt.client, 4 + 8, P9_TGETATTR, opt.tag);
+v9fs_uint32_write(req, opt.fid);
+v9fs_uint64_write(req, opt.request_mask);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rgetattr(req, opt.rgetattr.attr);
+}
+req = NULL; /* request was freed */
+}
+
+return (TGetAttrRes) { .req = req };
 }
 
 /*
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index 64b97b229b..f7b1bfc79a 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -63,6 +63,7 @@ typedef struct v9fs_attr {
 } v9fs_attr;
 
 #define P9_GETATTR_BASIC0x07ffULL /* Mask for fields up to BLOCKS */
+#define P9_GETATTR_ALL  0x3fffULL /* Mask for ALL fields */
 
 struct V9fsDirent {
 v9fs_qid qid;
@@ -155,6 +156,32 @@ typedef struct TAttachRes {
 P9Req *req;
 } TAttachRes;
 
+/* options for 'Tgetattr' 9p request */
+typedef struct TGetAttrOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* file ID of file/dir whose attributes shall be retrieved (required) */
+uint32_t fid;
+/* bitmask indicating attribute fields to be retrieved (optional) */
+uint64_t request_mask;
+/* data being received from 9p server as 'Rgetattr' response (optional) */
+struct {
+v9fs_attr *attr;
+} rgetattr;
+/* only send Tgetattr request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TGetAttrOpt;
+
+/* result of 'Tgetattr' 9p request */
+typedef struct TGetAttrRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TGetAttrRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -182,8 +209,7 @@ TAttachRes v9fs_tattach(TAttachOpt);
 void v9fs_rattach(P9Req *req, v9fs_qid *qid);
 TWalkRes v9fs_twalk(TWalkOpt opt);
 void v9fs_rwalk(P9Req *req, uint16_t *nwqid, v9fs_qid **wqid);
-P9Req *v9fs_tgetattr(QVirtio9P *v9p, uint32_t fid, uint64_t request_mask,
- uint16_t tag);
+TGetAttrRes v9fs_tgetattr(TGetAttrOpt);
 void v9fs_rgetattr(P9Req *req, v9fs_attr *attr);
 P9Req *v9fs_treaddir(QVirtio9P *v9p, uint32_t fid, uint64_t offset,
  uint32_t count, uint16_t tag);
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 46bb189b81..9c1219db33 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -19,6 +19,7 @@
 #define twalk(...) v9fs_twalk((TWalkOpt) __VA_ARGS__)
 #define tversion(...) v9fs_tversion((TVersionOpt) __VA_ARGS__)
 #define tattach(...) v9fs_tattach((TAttachOpt) __VA_ARGS__)
+#define tgetattr(...) v9fs_tgetattr((TGetAttrOpt) __VA_ARGS__)
 
 static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
 {
@@ -285,7 +286,10 @@ static void fs_walk_2nd_nonexistent(void *obj, void *data,
 g_assert(wqid && wqid[0] && !is_same_qid(root_qid, wqid[0]));
 
 /* expect fid being unaffected by walk above */
-req = v9fs_tgetattr(v9p, fid, P9_GETATTR_BASIC, 0);
+req = tgetattr({
+.client = v9p, .fid = fid, .request_mask = P9_GETATTR_BASIC,
+.requestOnly = true
+}).req;

[PATCH 05/20] tests/9p: simplify callers of tattach()

2022-10-04 Thread Christian Schoenebeck

Now as tattach() is using a declarative approach, simplify the
code of callers of this function.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/virtio-9p-test.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 271c42f6f9..46bb189b81 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -302,11 +302,10 @@ static void fs_walk_none(void *obj, void *data, 
QGuestAllocator *t_alloc)
 struct v9fs_attr attr;
 
 tversion({ .client = v9p });
-req = tattach({
-.client = v9p, .fid = 0, .n_uname = getuid(), .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rattach(req, _qid);
+tattach({
+.client = v9p, .fid = 0, .n_uname = getuid(),
+.rattach.qid = _qid
+});
 
 twalk({
 .client = v9p, .fid = 0, .newfid = 1, .nwname = 0, .wnames = NULL,
@@ -330,14 +329,12 @@ static void fs_walk_dotdot(void *obj, void *data, 
QGuestAllocator *t_alloc)
 char *wnames[] = { g_strdup("..") };
 v9fs_qid root_qid;
 g_autofree v9fs_qid *wqid = NULL;
-P9Req *req;
 
 tversion({ .client = v9p });
-req = tattach((TAttachOpt) {
-.client = v9p, .fid = 0, .n_uname = getuid(), .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rattach(req, _qid);
+tattach({
+.client = v9p, .fid = 0, .n_uname = getuid(),
+.rattach.qid = _qid
+});
 
 twalk({
 .client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames,
-- 
2.30.2

[PATCH 09/20] tests/9p: simplify callers of treaddir()

2022-10-04 Thread Christian Schoenebeck

Now as treaddir() is using a declarative approach, simplify the
code of callers of this function.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/virtio-9p-test.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index e5c174c218..99e24fce0b 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -120,12 +120,12 @@ static void fs_readdir(void *obj, void *data, 
QGuestAllocator *t_alloc)
 /*
  * submit count = msize - 11, because 11 is the header size of Rreaddir
  */
-req = treaddir({
+treaddir({
 .client = v9p, .fid = 1, .offset = 0, .count = P9_MAX_SIZE - 11,
-.requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rreaddir(req, , , );
+.rreaddir = {
+.count = , .nentries = , .entries = 
+}
+});
 
 /*
  * Assuming msize (P9_MAX_SIZE) is large enough so we can retrieve all
@@ -190,12 +190,13 @@ static void do_readdir_split(QVirtio9P *v9p, uint32_t 
count)
 npartialentries = 0;
 partialentries = NULL;
 
-req = treaddir({
+treaddir({
 .client = v9p, .fid = fid, .offset = offset, .count = count,
-.requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rreaddir(req, , , );
+.rreaddir = {
+.count = , .nentries = ,
+.entries = 
+}
+});
 if (npartialentries > 0 && partialentries) {
 if (!entries) {
 entries = partialentries;
-- 
2.30.2

[PATCH 02/20] tests/9p: simplify callers of twalk()

2022-10-04 Thread Christian Schoenebeck

Now as twalk() is using a declarative approach, simplify the
code of callers of this function.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/virtio-9p-test.c | 92 +---
 1 file changed, 32 insertions(+), 60 deletions(-)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index cf5d6146ad..3c326451b1 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -90,19 +90,17 @@ static void fs_walk(void *obj, void *data, QGuestAllocator 
*t_alloc)
 uint16_t nwqid;
 g_autofree v9fs_qid *wqid = NULL;
 int i;
-P9Req *req;
 
 for (i = 0; i < P9_MAXWELEM; i++) {
 wnames[i] = g_strdup_printf(QTEST_V9FS_SYNTH_WALK_FILE, i);
 }
 
 do_attach(v9p);
-req = twalk({
+twalk({
 .client = v9p, .fid = 0, .newfid = 1,
-.nwname = P9_MAXWELEM, .wnames = wnames, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rwalk(req, , );
+.nwname = P9_MAXWELEM, .wnames = wnames,
+.rwalk = { .nwqid = , .wqid =  }
+});
 
 g_assert_cmpint(nwqid, ==, P9_MAXWELEM);
 
@@ -134,12 +132,10 @@ static void fs_readdir(void *obj, void *data, 
QGuestAllocator *t_alloc)
 P9Req *req;
 
 do_attach(v9p);
-req = twalk({
+twalk({
 .client = v9p, .fid = 0, .newfid = 1,
-.nwname = 1, .wnames = wnames, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rwalk(req, , NULL);
+.nwname = 1, .wnames = wnames, .rwalk.nwqid = 
+});
 g_assert_cmpint(nqid, ==, 1);
 
 req = v9fs_tlopen(v9p, 1, O_DIRECTORY, 0);
@@ -198,12 +194,10 @@ static void do_readdir_split(QVirtio9P *v9p, uint32_t 
count)
 nentries = 0;
 tail = NULL;
 
-req = twalk({
+twalk({
 .client = v9p, .fid = 0, .newfid = fid,
-.nwname = 1, .wnames = wnames, .requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rwalk(req, , NULL);
+.nwname = 1, .wnames = wnames, .rwalk.nwqid = 
+});
 g_assert_cmpint(nqid, ==, 1);
 
 req = v9fs_tlopen(v9p, fid, O_DIRECTORY, 0);
@@ -266,18 +260,12 @@ static void fs_walk_no_slash(void *obj, void *data, 
QGuestAllocator *t_alloc)
 QVirtio9P *v9p = obj;
 v9fs_set_allocator(t_alloc);
 char *wnames[] = { g_strdup(" /") };
-P9Req *req;
-uint32_t err;
 
 do_attach(v9p);
-req = twalk({
+twalk({
 .client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames,
-.requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rlerror(req, );
-
-g_assert_cmpint(err, ==, ENOENT);
+.expectErr = ENOENT
+});
 
 g_free(wnames[0]);
 }
@@ -312,7 +300,7 @@ static void fs_walk_2nd_nonexistent(void *obj, void *data,
 do_attach_rqid(v9p, _qid);
 fid = twalk({
 .client = v9p, .path = path,
-.rwalk.nwqid = , .rwalk.wqid = 
+.rwalk = { .nwqid = , .wqid =  }
 }).newfid;
 /*
  * The 9p2000 protocol spec says: "nwqid is therefore either nwname or the
@@ -345,12 +333,10 @@ static void fs_walk_none(void *obj, void *data, 
QGuestAllocator *t_alloc)
 v9fs_req_wait_for_reply(req, NULL);
 v9fs_rattach(req, _qid);
 
-req = twalk({
+twalk({
 .client = v9p, .fid = 0, .newfid = 1, .nwname = 0, .wnames = NULL,
-.requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rwalk(req, NULL, );
+.rwalk.wqid = 
+});
 
 /* special case: no QID is returned if nwname=0 was sent */
 g_assert(wqid == NULL);
@@ -376,12 +362,10 @@ static void fs_walk_dotdot(void *obj, void *data, 
QGuestAllocator *t_alloc)
 v9fs_req_wait_for_reply(req, NULL);
 v9fs_rattach(req, _qid);
 
-req = twalk({
+twalk({
 .client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames,
-.requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rwalk(req, NULL, ); /* We now we'll get one qid */
+.rwalk.wqid =  /* We now we'll get one qid */
+});
 
 g_assert_cmpmem(_qid, 13, wqid[0], 13);
 
@@ -396,12 +380,9 @@ static void fs_lopen(void *obj, void *data, 
QGuestAllocator *t_alloc)
 P9Req *req;
 
 do_attach(v9p);
-req = twalk({
-.client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames,
-.requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);
-v9fs_rwalk(req, NULL, NULL);
+twalk({
+.client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames
+});
 
 req = v9fs_tlopen(v9p, 1, O_WRONLY, 0);
 v9fs_req_wait_for_reply(req, NULL);
@@ -421,12 +402,9 @@ static void fs_write(void *obj, void *data, 
QGuestAllocator *t_alloc)
 P9Req *req;
 
 do_attach(v9p);
-req = twalk({
-.client = v9p, .fid = 0, .newfid = 1, .nwname = 1, .wnames = wnames,
-.requestOnly = true
-}).req;
-v9fs_req_wait_for_reply(req, NULL);

[PATCH 01/20] tests/9p: merge walk() functions

2022-10-04 Thread Christian Schoenebeck

Introduce declarative function calls.

There are currently 4 different functions for sending a 9p 'Twalk'
request: v9fs_twalk(), do_walk(), do_walk_rqids() and
do_walk_expect_error(). They are all doing the same thing, just in a
slightly different way and with slightly different function arguments.

Merge those 4 functions into a single function by using a struct for
function call arguments and use designated initializers when calling
this function to turn usage into a declarative approach, which is
better readable and easier to maintain.

Also move private functions genfid(), split() and split_free() from
virtio-9p-test.c to virtio-9p-client.c.

Based-on: 
Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 114 ++--
 tests/qtest/libqos/virtio-9p-client.h |  37 -
 tests/qtest/virtio-9p-test.c  | 187 +-
 3 files changed, 198 insertions(+), 140 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index f5c35fd722..a95bbad9c8 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -23,6 +23,65 @@ void v9fs_set_allocator(QGuestAllocator *t_alloc)
 alloc = t_alloc;
 }
 
+/*
+ * Used to auto generate new fids. Start with arbitrary high value to avoid
+ * collision with hard coded fids in basic test code.
+ */
+static uint32_t fid_generator = 1000;
+
+static uint32_t genfid(void)
+{
+return fid_generator++;
+}
+
+/**
+ * Splits the @a in string by @a delim into individual (non empty) strings
+ * and outputs them to @a out. The output array @a out is NULL terminated.
+ *
+ * Output array @a out must be freed by calling split_free().
+ *
+ * @returns number of individual elements in output array @a out (without the
+ *  final NULL terminating element)
+ */
+static int split(const char *in, const char *delim, char ***out)
+{
+int n = 0, i = 0;
+char *tmp, *p;
+
+tmp = g_strdup(in);
+for (p = strtok(tmp, delim); p != NULL; p = strtok(NULL, delim)) {
+if (strlen(p) > 0) {
+++n;
+}
+}
+g_free(tmp);
+
+*out = g_new0(char *, n + 1); /* last element NULL delimiter */
+
+tmp = g_strdup(in);
+for (p = strtok(tmp, delim); p != NULL; p = strtok(NULL, delim)) {
+if (strlen(p) > 0) {
+(*out)[i++] = g_strdup(p);
+}
+}
+g_free(tmp);
+
+return n;
+}
+
+static void split_free(char ***out)
+{
+int i;
+if (!*out) {
+return;
+}
+for (i = 0; (*out)[i]; ++i) {
+g_free((*out)[i]);
+}
+g_free(*out);
+*out = NULL;
+}
+
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len)
 {
 qtest_memwrite(req->qts, req->t_msg + req->t_off, addr, len);
@@ -294,28 +353,61 @@ void v9fs_rattach(P9Req *req, v9fs_qid *qid)
 }
 
 /* size[4] Twalk tag[2] fid[4] newfid[4] nwname[2] nwname*(wname[s]) */
-P9Req *v9fs_twalk(QVirtio9P *v9p, uint32_t fid, uint32_t newfid,
-  uint16_t nwname, char *const wnames[], uint16_t tag)
+TWalkRes v9fs_twalk(TWalkOpt opt)
 {
 P9Req *req;
 int i;
 uint32_t body_size = 4 + 4 + 2;
+uint32_t err;
+char **wnames = NULL;
 
-for (i = 0; i < nwname; i++) {
-uint16_t wname_size = v9fs_string_size(wnames[i]);
+g_assert(opt.client);
+/* expecting either high- or low-level path, both not both */
+g_assert(!opt.path || !(opt.nwname || opt.wnames));
+/* expecting either Rwalk or Rlerror, but obviously not both */
+g_assert(!opt.expectErr || !(opt.rwalk.nwqid || opt.rwalk.wqid));
+
+if (!opt.newfid) {
+opt.newfid = genfid();
+}
+
+if (opt.path) {
+opt.nwname = split(opt.path, "/", );
+opt.wnames = wnames;
+}
+
+for (i = 0; i < opt.nwname; i++) {
+uint16_t wname_size = v9fs_string_size(opt.wnames[i]);
 
 g_assert_cmpint(body_size, <=, UINT32_MAX - wname_size);
 body_size += wname_size;
 }
-req = v9fs_req_init(v9p,  body_size, P9_TWALK, tag);
-v9fs_uint32_write(req, fid);
-v9fs_uint32_write(req, newfid);
-v9fs_uint16_write(req, nwname);
-for (i = 0; i < nwname; i++) {
-v9fs_string_write(req, wnames[i]);
+req = v9fs_req_init(opt.client, body_size, P9_TWALK, opt.tag);
+v9fs_uint32_write(req, opt.fid);
+v9fs_uint32_write(req, opt.newfid);
+v9fs_uint16_write(req, opt.nwname);
+for (i = 0; i < opt.nwname; i++) {
+v9fs_string_write(req, opt.wnames[i]);
 }
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rwalk(req, opt.rwalk.nwqid, opt.rwalk.wqid);
+}
+req = NULL; /* request was freed */
+}
+
+split_free();
+
+return (TWalkRes) {
+.newfid = opt.newfid,
+

[PATCH 04/20] tests/9p: merge v9fs_tattach(), do_attach(), do_attach_rqid()

2022-10-04 Thread Christian Schoenebeck

As with previous patches, unify those 3 functions into a single function
v9fs_tattach() by using a declarative function arguments approach.

Signed-off-by: Christian Schoenebeck 
---
 tests/qtest/libqos/virtio-9p-client.c | 40 ++---
 tests/qtest/libqos/virtio-9p-client.h | 30 -
 tests/qtest/virtio-9p-test.c  | 62 +++
 3 files changed, 88 insertions(+), 44 deletions(-)

diff --git a/tests/qtest/libqos/virtio-9p-client.c 
b/tests/qtest/libqos/virtio-9p-client.c
index e8364f8d64..5e6bd6120c 100644
--- a/tests/qtest/libqos/virtio-9p-client.c
+++ b/tests/qtest/libqos/virtio-9p-client.c
@@ -359,20 +359,48 @@ void v9fs_rversion(P9Req *req, uint16_t *len, char 
**version)
 }
 
 /* size[4] Tattach tag[2] fid[4] afid[4] uname[s] aname[s] n_uname[4] */
-P9Req *v9fs_tattach(QVirtio9P *v9p, uint32_t fid, uint32_t n_uname,
-uint16_t tag)
+TAttachRes v9fs_tattach(TAttachOpt opt)
 {
+uint32_t err;
 const char *uname = ""; /* ignored by QEMU */
 const char *aname = ""; /* ignored by QEMU */
-P9Req *req = v9fs_req_init(v9p, 4 + 4 + 2 + 2 + 4, P9_TATTACH, tag);
 
-v9fs_uint32_write(req, fid);
+g_assert(opt.client);
+/* expecting either Rattach or Rlerror, but obviously not both */
+g_assert(!opt.expectErr || !opt.rattach.qid);
+
+if (!opt.requestOnly) {
+v9fs_tversion((TVersionOpt) { .client = opt.client });
+}
+
+if (!opt.n_uname) {
+opt.n_uname = getuid();
+}
+
+P9Req *req = v9fs_req_init(opt.client, 4 + 4 + 2 + 2 + 4, P9_TATTACH,
+   opt.tag);
+
+v9fs_uint32_write(req, opt.fid);
 v9fs_uint32_write(req, P9_NOFID);
 v9fs_string_write(req, uname);
 v9fs_string_write(req, aname);
-v9fs_uint32_write(req, n_uname);
+v9fs_uint32_write(req, opt.n_uname);
 v9fs_req_send(req);
-return req;
+
+if (!opt.requestOnly) {
+v9fs_req_wait_for_reply(req, NULL);
+if (opt.expectErr) {
+v9fs_rlerror(req, );
+g_assert_cmpint(err, ==, opt.expectErr);
+} else {
+v9fs_rattach(req, opt.rattach.qid);
+}
+req = NULL; /* request was freed */
+}
+
+return (TAttachRes) {
+.req = req,
+};
 }
 
 /* size[4] Rattach tag[2] qid[13] */
diff --git a/tests/qtest/libqos/virtio-9p-client.h 
b/tests/qtest/libqos/virtio-9p-client.h
index fcde849b5d..64b97b229b 100644
--- a/tests/qtest/libqos/virtio-9p-client.h
+++ b/tests/qtest/libqos/virtio-9p-client.h
@@ -128,6 +128,33 @@ typedef struct TVersionRes {
 P9Req *req;
 } TVersionRes;
 
+/* options for 'Tattach' 9p request */
+typedef struct TAttachOpt {
+/* 9P client being used (mandatory) */
+QVirtio9P *client;
+/* user supplied tag number being returned with response (optional) */
+uint16_t tag;
+/* file ID to be associated with root of file tree (optional) */
+uint32_t fid;
+/* numerical uid of user being introduced to server (optional) */
+uint32_t n_uname;
+/* data being received from 9p server as 'Rattach' response (optional) */
+struct {
+/* server's idea of the root of the file tree */
+v9fs_qid *qid;
+} rattach;
+/* only send Tattach request but not wait for a reply? (optional) */
+bool requestOnly;
+/* do we expect an Rlerror response, if yes which error code? (optional) */
+uint32_t expectErr;
+} TAttachOpt;
+
+/* result of 'Tattach' 9p request */
+typedef struct TAttachRes {
+/* if requestOnly was set: request object for further processing */
+P9Req *req;
+} TAttachRes;
+
 void v9fs_set_allocator(QGuestAllocator *t_alloc);
 void v9fs_memwrite(P9Req *req, const void *addr, size_t len);
 void v9fs_memskip(P9Req *req, size_t len);
@@ -151,8 +178,7 @@ void v9fs_req_free(P9Req *req);
 void v9fs_rlerror(P9Req *req, uint32_t *err);
 TVersionRes v9fs_tversion(TVersionOpt);
 void v9fs_rversion(P9Req *req, uint16_t *len, char **version);
-P9Req *v9fs_tattach(QVirtio9P *v9p, uint32_t fid, uint32_t n_uname,
-uint16_t tag);
+TAttachRes v9fs_tattach(TAttachOpt);
 void v9fs_rattach(P9Req *req, v9fs_qid *qid);
 TWalkRes v9fs_twalk(TWalkOpt opt);
 void v9fs_rwalk(P9Req *req, uint16_t *nwqid, v9fs_qid **wqid);
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index f2907c8026..271c42f6f9 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -18,6 +18,7 @@
 
 #define twalk(...) v9fs_twalk((TWalkOpt) __VA_ARGS__)
 #define tversion(...) v9fs_tversion((TVersionOpt) __VA_ARGS__)
+#define tattach(...) v9fs_tattach((TAttachOpt) __VA_ARGS__)
 
 static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
 {
@@ -48,25 +49,10 @@ static void fs_version(void *obj, void *data, 
QGuestAllocator *t_alloc)
 tversion({ .client = obj });
 }
 
-static void do_attach_rqid(QVirtio9P *v9p, v9fs_qid *qid)
-{
-P9Req *req;
-
-tversion({ .client = v9p });
-req =

[PATCH 00/20] tests/9p: introduce declarative function calls

2022-10-04 Thread Christian Schoenebeck

This series converts relevant 9p (test) client functions to use named
function arguments. For instance

do_walk_expect_error(v9p, "non-existent", ENOENT);

becomes

twalk({
.client = v9p, .path = "non-existent", .expectErr = ENOENT
});

The intention is to make the actual 9p test code more readable, and easier
to maintain on the long-term.

Not only makes it clear what a literal passed to a function is supposed to
do, it also makes the order and selection of arguments very liberal, and
allows to merge multiple, similar functions into one single function.

This is basically just refactoring, it does not change behaviour.

PREREQUISITES
=

This series requires the following additional patch to work correctly:

https://lore.kernel.org/all/e1odrya-0004fv...@lizzy.crudebyte.com/
https://github.com/cschoenebeck/qemu/commit/23d01367fc7a4f27be323ed6d195c527bec9ede1

Christian Schoenebeck (20):
  tests/9p: merge *walk*() functions
  tests/9p: simplify callers of twalk()
  tests/9p: merge v9fs_tversion() and do_version()
  tests/9p: merge v9fs_tattach(), do_attach(), do_attach_rqid()
  tests/9p: simplify callers of tattach()
  tests/9p: convert v9fs_tgetattr() to declarative arguments
  tests/9p: simplify callers of tgetattr()
  tests/9p: convert v9fs_treaddir() to declarative arguments
  tests/9p: simplify callers of treaddir()
  tests/9p: convert v9fs_tlopen() to declarative arguments
  tests/9p: simplify callers of tlopen()
  tests/9p: convert v9fs_twrite() to declarative arguments
  tests/9p: simplify callers of twrite()
  tests/9p: convert v9fs_tflush() to declarative arguments
  tests/9p: merge v9fs_tmkdir() and do_mkdir()
  tests/9p: merge v9fs_tlcreate() and do_lcreate()
  tests/9p: merge v9fs_tsymlink() and do_symlink()
  tests/9p: merge v9fs_tlink() and do_hardlink()
  tests/9p: merge v9fs_tunlinkat() and do_unlinkat()
  tests/9p: remove unnecessary g_strdup() calls

 tests/qtest/libqos/virtio-9p-client.c | 569 +-
 tests/qtest/libqos/virtio-9p-client.h | 408 --
 tests/qtest/virtio-9p-test.c  | 529 
 3 files changed, 1031 insertions(+), 475 deletions(-)

-- 
2.30.2

[PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset

2022-10-04 Thread Si-Wei Liu

The citing commit has incorrect code in vhost_vdpa_receive() that returns
zero instead of full packet size to the caller. This renders pending packets
unable to be freed so then get clogged in the tx queue forever. When device
is being reset later on, below assertion failure ensues:

0  0x7f86d53bb387 in raise () from /lib64/libc.so.6
1  0x7f86d53bca78 in abort () from /lib64/libc.so.6
2  0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6
3  0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6
4  0x55b8f6ff6fcc in virtio_net_reset (vdev=) at 
/usr/src/debug/qemu/hw/net/virtio-net.c:563
5  0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at 
/usr/src/debug/qemu/hw/virtio/virtio.c:1993
6  0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at 
/usr/src/debug/qemu/hw/virtio/virtio-bus.c:102
7  0x55b8f71f1620 in virtio_pci_reset (qdev=) at 
/usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845
8  0x55b8f6fafc6c in memory_region_write_accessor (mr=, 
addr=, value=,
   size=, shift=, mask=, 
attrs=...) at /usr/src/debug/qemu/memory.c:483
9  0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f867e7fb7e8, size=size@entry=1,
   access_size_min=, access_size_max=, 
access_fn=0x55b8f6fafc20 ,
   mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544
10 0x55b8f6fb1d0b in memory_region_dispatch_write 
(mr=mr@entry=0x55b8faf80a50, addr=addr@entry=20, data=0, op=,
   attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470
11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, 
addr=addr@entry=549755813908, attrs=...,
   attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1,
   mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266
12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, 
attrs=...,
   buf=0x7f86d0223028 , len=1) at 
/usr/src/debug/qemu/exec.c:3306
13 0x55b8f6f674cb in address_space_write (as=, 
addr=, attrs=..., buf=,
   len=) at /usr/src/debug/qemu/exec.c:3396
14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=...,
   buf=buf@entry=0x7f86d0223028 , 
len=, is_write=)
   at /usr/src/debug/qemu/exec.c:3406
15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/accel/kvm/kvm-all.c:2410
16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/cpus.c:1318
17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at 
/usr/src/debug/qemu/util/qemu-thread-posix.c:519
18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0
19 0x7f86d5483b2d in clone () from /lib64/libc.so.6

Make vhost_vdpa_receive() return the size passed in as is, so that the
caller qemu_deliver_packet_iov() would eventually propagate it back to
virtio_net_flush_tx() to release pending packets from the async_tx queue.
Which corresponds to the drop path where qemu_sendv_packet_async() returns
non-zero in virtio_net_flush_tx().

Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback")
Cc: Eugenio Perez Martin 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4bc3fd0..182b3a1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, 
ObjectClass *oc,
 static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
   size_t size)
 {
-return 0;
+return size;
 }
 
 static NetClientInfo net_vhost_vdpa_info = {
-- 
1.8.3.1

Re: [PATCH v5 9/9] target/arm: Enable TARGET_TB_PCREL

2022-10-04 Thread Richard Henderson


On 10/4/22 12:27, Richard Henderson wrote:

On 10/4/22 09:23, Peter Maydell wrote:

  void arm_cpu_synchronize_from_tb(CPUState *cs,
   const TranslationBlock *tb)
  {
-    ARMCPU *cpu = ARM_CPU(cs);
-    CPUARMState *env = >env;
-
-    /*
- * It's OK to look at env for the current mode here, because it's
- * never possible for an AArch64 TB to chain to an AArch32 TB.
- */
-    if (is_a64(env)) {
-    env->pc = tb_pc(tb);
-    } else {
-    env->regs[15] = tb_pc(tb);
+    /* The program counter is always up to date with TARGET_TB_PCREL. */


I was confused for a bit about this, but it works because
although the synchronize_from_tb hook has a name that implies
it's comparatively general purpose, in fact we use it only
in the special case of "we abandoned execution at the start of
this TB without executing any of it".


Correct.


@@ -347,16 +354,22 @@ static void gen_exception_internal(int excp)

  static void gen_exception_internal_insn(DisasContext *s, int excp)
  {
+    target_ulong pc_save = s->pc_save;
+
  gen_a64_update_pc(s, 0);
  gen_exception_internal(excp);
  s->base.is_jmp = DISAS_NORETURN;
+    s->pc_save = pc_save;


What is trashing s->pc_save that we have to work around like this,
here and in the other similar changes ?


gen_a64_update_pc trashes pc_save.

Off of the top of my head, I can't remember what conditionally uses exceptions (single 
step?).


Oh, duh, any conditional a32 instruction.

To some extent this instance duplicates s->pc_cond_save, but the usage pattern 
there is

brcond(..., s->condlabel);
s->pc_cond_save = s->pc_save;

gen_update_pc(s, 0);  /* pc_save = pc_curr */
raise_exception;

if (s->pc_cond_save != s->pc_save) {
gen_update_pc(s->pc_save - s->pc_cond_save);
}
/* s->pc_save now matches the state at brcond */

condlabel:


So, we have exited the TB via exception, and the second gen_update_pc would be deleted as 
dead code, it's just as easy to keep s->pc_save unchanged so that the second gen_update_pc 
is not emitted.  We certainly *must* update s->pc_save around indirect branches, so that 
we don't wind up with an assert on s->pc_save != -1.



r~

[PATCH] vhost-vdpa: allow passing opened vhostfd to vhost-vdpa

2022-10-04 Thread Si-Wei Liu

Similar to other vhost backends, vhostfd can be passed to vhost-vdpa
backend as another parameter to instantiate vhost-vdpa net client.
This would benefit the use case where only open fd's, as oppposed to
raw vhost-vdpa device paths, are accessible from the QEMU process.

(qemu) netdev_add type=vhost-vdpa,vhostfd=61,id=vhost-vdpa1

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 25 -
 qapi/net.json|  3 +++
 qemu-options.hx  |  6 --
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 182b3a1..366b070 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -683,14 +683,29 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 
 assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA);
 opts = >u.vhost_vdpa;
-if (!opts->vhostdev) {
-error_setg(errp, "vdpa character device not specified with vhostdev");
+if (!opts->has_vhostdev && !opts->has_vhostfd) {
+error_setg(errp,
+   "vhost-vdpa: neither vhostdev= nor vhostfd= was specified");
 return -1;
 }
 
-vdpa_device_fd = qemu_open(opts->vhostdev, O_RDWR, errp);
-if (vdpa_device_fd == -1) {
-return -errno;
+if (opts->has_vhostdev && opts->has_vhostfd) {
+error_setg(errp,
+   "vhost-vdpa: vhostdev= and vhostfd= are mutually 
exclusive");
+return -1;
+}
+
+if (opts->has_vhostdev) {
+vdpa_device_fd = qemu_open(opts->vhostdev, O_RDWR, errp);
+if (vdpa_device_fd == -1) {
+return -errno;
+}
+} else if (opts->has_vhostfd) {
+vdpa_device_fd = monitor_fd_param(monitor_cur(), opts->vhostfd, errp);
+if (vdpa_device_fd == -1) {
+error_prepend(errp, "vhost-vdpa: unable to parse vhostfd: ");
+return -1;
+}
 }
 
 r = vhost_vdpa_get_features(vdpa_device_fd, , errp);
diff --git a/qapi/net.json b/qapi/net.json
index dd088c0..926ecc8 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -442,6 +442,8 @@
 # @vhostdev: path of vhost-vdpa device
 #(default:'/dev/vhost-vdpa-0')
 #
+# @vhostfd: file descriptor of an already opened vhost vdpa device
+#
 # @queues: number of queues to be created for multiqueue vhost-vdpa
 #  (default: 1)
 #
@@ -456,6 +458,7 @@
 { 'struct': 'NetdevVhostVDPAOptions',
   'data': {
 '*vhostdev': 'str',
+'*vhostfd':  'str',
 '*queues':   'int',
 '*x-svq':{'type': 'bool', 'features' : [ 'unstable'] } } }
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 913c71e..c040f74 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2774,8 +2774,10 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
 "configure a vhost-user network, backed by a chardev 
'dev'\n"
 #endif
 #ifdef __linux__
-"-netdev vhost-vdpa,id=str,vhostdev=/path/to/dev\n"
+"-netdev vhost-vdpa,id=str[,vhostdev=/path/to/dev][,vhostfd=h]\n"
 "configure a vhost-vdpa network,Establish a vhost-vdpa 
netdev\n"
+"use 'vhostdev=/path/to/dev' to open a vhost vdpa device\n"
+"use 'vhostfd=h' to connect to an already opened vhost 
vdpa device\n"
 #endif
 #ifdef CONFIG_VMNET
 "-netdev vmnet-host,id=str[,isolated=on|off][,net-uuid=uuid]\n"
@@ -3280,7 +3282,7 @@ SRST
  -netdev type=vhost-user,id=net0,chardev=chr0 \
  -device virtio-net-pci,netdev=net0
 
-``-netdev vhost-vdpa,vhostdev=/path/to/dev``
+``-netdev vhost-vdpa[,vhostdev=/path/to/dev][,vhostfd=h]``
 Establish a vhost-vdpa netdev.
 
 vDPA device is a device that uses a datapath which complies with
-- 
1.8.3.1

[PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset

2022-10-04 Thread Si-Wei Liu

The citing commit has incorrect code in vhost_vdpa_receive() that returns
zero instead of full packet size to the caller. This renders pending packets
unable to be freed so then get clogged in the tx queue forever. When device
is being reset later on, below assertion failure ensues:

0  0x7f86d53bb387 in raise () from /lib64/libc.so.6
1  0x7f86d53bca78 in abort () from /lib64/libc.so.6
2  0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6
3  0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6
4  0x55b8f6ff6fcc in virtio_net_reset (vdev=) at 
/usr/src/debug/qemu/hw/net/virtio-net.c:563
5  0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at 
/usr/src/debug/qemu/hw/virtio/virtio.c:1993
6  0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at 
/usr/src/debug/qemu/hw/virtio/virtio-bus.c:102
7  0x55b8f71f1620 in virtio_pci_reset (qdev=) at 
/usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845
8  0x55b8f6fafc6c in memory_region_write_accessor (mr=, 
addr=, value=,
   size=, shift=, mask=, 
attrs=...) at /usr/src/debug/qemu/memory.c:483
9  0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f867e7fb7e8, size=size@entry=1,
   access_size_min=, access_size_max=, 
access_fn=0x55b8f6fafc20 ,
   mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544
10 0x55b8f6fb1d0b in memory_region_dispatch_write 
(mr=mr@entry=0x55b8faf80a50, addr=addr@entry=20, data=0, op=,
   attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470
11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, 
addr=addr@entry=549755813908, attrs=...,
   attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1,
   mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266
12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, 
attrs=...,
   buf=0x7f86d0223028 , len=1) at 
/usr/src/debug/qemu/exec.c:3306
13 0x55b8f6f674cb in address_space_write (as=, 
addr=, attrs=..., buf=,
   len=) at /usr/src/debug/qemu/exec.c:3396
14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=...,
   buf=buf@entry=0x7f86d0223028 , 
len=, is_write=)
   at /usr/src/debug/qemu/exec.c:3406
15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/accel/kvm/kvm-all.c:2410
16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/cpus.c:1318
17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at 
/usr/src/debug/qemu/util/qemu-thread-posix.c:519
18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0
19 0x7f86d5483b2d in clone () from /lib64/libc.so.6

Make vhost_vdpa_receive() return the size passed in as is, so that the
caller qemu_deliver_packet_iov() would eventually propagate it back to
virtio_net_flush_tx() to release pending packets from the async_tx queue.
Which corresponds to the drop path where qemu_sendv_packet_async() returns
non-zero in virtio_net_flush_tx().

Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback")
Cc: Eugenio Perez Martin 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4bc3fd0..182b3a1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, 
ObjectClass *oc,
 static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
   size_t size)
 {
-return 0;
+return size;
 }
 
 static NetClientInfo net_vhost_vdpa_info = {
-- 
1.8.3.1

Re: [PATCH v5 6/9] target/arm: Change gen_jmp* to work on displacements

2022-10-04 Thread Richard Henderson


On 10/4/22 08:58, Peter Maydell wrote:

On Fri, 30 Sept 2022 at 23:10, Richard Henderson
 wrote:


In preparation for TARGET_TB_PCREL, reduce reliance on absolute values.

Signed-off-by: Richard Henderson 
---
  target/arm/translate.c | 37 +
  1 file changed, 21 insertions(+), 16 deletions(-)



@@ -8368,7 +8372,8 @@ static bool trans_BLX_i(DisasContext *s, arg_BLX_i *a)
  }
  tcg_gen_movi_i32(cpu_R[14], s->base.pc_next | s->thumb);
  store_cpu_field_constant(!s->thumb, thumb);
-gen_jmp(s, (read_pc(s) & ~3) + a->imm);
+/* This difference computes a page offset so ok for TARGET_TB_PCREL. */
+gen_jmp(s, (read_pc(s) & ~3) - s->pc_curr + a->imm);


Could we just calculate the offset of the jump target instead?
read_pc() returns s->pc_curr + a constant, so the s->pc_curr cancels
out anyway:

   (read_pc(s) & ~3) - s->pc_curr + a->imm
==
 (pc_curr + (s->thumb ? 4 : 8) & ~3) - pc_curr + imm
==  pc_curr - pc_curr_low_bits - pc_curr + 4-or-8 + imm
==  imm + 4-or-8 - low_bits_of_pc

That's then more obviously not dependent on the absolute value
of the PC.


Yes, this works:

-gen_jmp(s, (read_pc(s) & ~3) + a->imm);

+/* This jump is computed from an aligned PC: subtract off the low bits. */

+gen_jmp(s, jmp_diff(s, a->imm - (s->pc_curr & 3)));



r~

Re: [RFC PATCH 1/1] block: add vhost-blk backend

2022-10-04 Thread Stefan Hajnoczi

On Mon, Jul 25, 2022 at 11:55:27PM +0300, Andrey Zhadchenko wrote:
> Although QEMU virtio is quite fast, there is still some room for
> improvements. Disk latency can be reduced if we handle virito-blk requests
> in host kernel istead of passing them to QEMU. The patch adds vhost-blk
> backend which sets up vhost-blk kernel module to process requests.
> 
> test setup and results:
> fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
> QEMU drive options: cache=none
> filesystem: xfs
> 
> SSD:
>| randread, IOPS  | randwrite, IOPS |
> Host   |  95.8k|  85.3k  |
> QEMU virtio|  57.5k|  79.4k  |
> QEMU vhost-blk |  95.6k|  84.3k  |
> 
> RAMDISK (vq == vcpu):
>  | randread, IOPS | randwrite, IOPS |
> virtio, 1vcpu|123k  |  129k   |
> virtio, 2vcpu|253k (??) |  250k (??)  |
> virtio, 4vcpu|158k  |  154k   |
> vhost-blk, 1vcpu |110k  |  113k   |
> vhost-blk, 2vcpu |247k  |  252k   |
> vhost-blk, 4vcpu |576k  |  567k   |
> 
> Signed-off-by: Andrey Zhadchenko 
> ---
>  hw/block/Kconfig  |   5 +
>  hw/block/meson.build  |   4 +
>  hw/block/vhost-blk.c  | 394 ++
>  hw/virtio/meson.build |   3 +
>  hw/virtio/vhost-blk-pci.c | 102 +
>  include/hw/virtio/vhost-blk.h |  50 +
>  meson.build   |   5 +
>  7 files changed, 563 insertions(+)
>  create mode 100644 hw/block/vhost-blk.c
>  create mode 100644 hw/virtio/vhost-blk-pci.c
>  create mode 100644 include/hw/virtio/vhost-blk.h
> 
> diff --git a/hw/block/Kconfig b/hw/block/Kconfig
> index 9e8f28f982..b4286ad10e 100644
> --- a/hw/block/Kconfig
> +++ b/hw/block/Kconfig
> @@ -36,6 +36,11 @@ config VIRTIO_BLK
>  default y
>  depends on VIRTIO
>  
> +config VHOST_BLK
> +bool
> +default n

Feel free to enable it by default. That way it gets more CI/build
coverage.

> +depends on VIRTIO && LINUX
> +
>  config VHOST_USER_BLK
>  bool
>  # Only PCI devices are provided for now
> diff --git a/hw/block/meson.build b/hw/block/meson.build
> index 2389326112..caf9bedff3 100644
> --- a/hw/block/meson.build
> +++ b/hw/block/meson.build
> @@ -19,4 +19,8 @@ softmmu_ss.add(when: 'CONFIG_TC58128', if_true: 
> files('tc58128.c'))
>  specific_ss.add(when: 'CONFIG_VIRTIO_BLK', if_true: files('virtio-blk.c'))
>  specific_ss.add(when: 'CONFIG_VHOST_USER_BLK', if_true: 
> files('vhost-user-blk.c'))
>  
> +if have_vhost_blk
> +  specific_ss.add(files('vhost-blk.c'))
> +endif

Can this use the same add(when: 'CONFIG_VHOST_BLK', ...) syntax as the
other conditional builds above?

> +
>  subdir('dataplane')
> diff --git a/hw/block/vhost-blk.c b/hw/block/vhost-blk.c
> new file mode 100644
> index 00..33d90af270
> --- /dev/null
> +++ b/hw/block/vhost-blk.c
> @@ -0,0 +1,394 @@
> +/*
> + * Copyright (c) 2022 Virtuozzo International GmbH.
> + * Author: Andrey Zhadchenko 
> + *
> + * vhost-blk is host kernel accelerator for virtio-blk.
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu/error-report.h"
> +#include "qom/object.h"
> +#include "hw/qdev-core.h"
> +#include "hw/boards.h"
> +#include "hw/virtio/vhost.h"
> +#include "hw/virtio/vhost-blk.h"
> +#include "hw/virtio/virtio.h"
> +#include "hw/virtio/virtio-blk.h"
> +#include "hw/virtio/virtio-bus.h"
> +#include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/virtio-pci.h"
> +#include "sysemu/sysemu.h"
> +#include "linux-headers/linux/vhost.h"
> +#include 
> +#include 
> +
> +static int vhost_blk_start(VirtIODevice *vdev)
> +{
> +VHostBlk *s = VHOST_BLK(vdev);
> +struct vhost_vring_file backend;
> +int ret, i;
> +int *fd = blk_bs(s->conf.conf.blk)->file->bs->opaque;

This needs a clean API so vhost-blk.c doesn't make assumptions about the
file-posix BlockDriver's internal state memory layout.

> +BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
> +VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
> +
> +if (!k->set_guest_notifiers) {
> +error_report("vhost-blk: binding does not support guest notifiers");
> +return -ENOSYS;
> +}
> +
> +if (s->vhost_started) {
> +return 0;
> +}
> +
> +if (ioctl(s->vhostfd, VHOST_SET_OWNER, NULL)) {
> +error_report("vhost-blk: unable to set owner");
> +return -ENOSYS;
> +}
> +
> +ret = vhost_dev_enable_notifiers(>dev, vdev);
> +if (ret < 0) {
> +error_report("vhost-blk: unable to enable dev notifiers", errno);
> +return ret;
> +}
> +
> +s->dev.acked_features = vdev->guest_features & s->dev.backend_features;
> +
> +ret = vhost_dev_start(>dev, vdev);
> +if (ret < 0) {
> +

[PULL 17/20] accel/tcg: Introduce tb_pc and log_pc

2022-10-04 Thread Richard Henderson

The availability of tb->pc will shortly be conditional.
Introduce accessor functions to minimize ifdefs.

Pass around a known pc to places like tcg_gen_code,
where the caller must already have the value.

Reviewed-by: Alex Bennée 
Signed-off-by: Richard Henderson 
---
 accel/tcg/internal.h|  6 
 include/exec/exec-all.h |  6 
 include/tcg/tcg.h   |  2 +-
 accel/tcg/cpu-exec.c| 46 ++---
 accel/tcg/translate-all.c   | 37 +++-
 target/arm/cpu.c|  4 +--
 target/avr/cpu.c|  2 +-
 target/hexagon/cpu.c|  2 +-
 target/hppa/cpu.c   |  4 +--
 target/i386/tcg/tcg-cpu.c   |  2 +-
 target/loongarch/cpu.c  |  2 +-
 target/microblaze/cpu.c |  2 +-
 target/mips/tcg/exception.c |  2 +-
 target/mips/tcg/sysemu/special_helper.c |  2 +-
 target/openrisc/cpu.c   |  2 +-
 target/riscv/cpu.c  |  4 +--
 target/rx/cpu.c |  2 +-
 target/sh4/cpu.c|  4 +--
 target/sparc/cpu.c  |  2 +-
 target/tricore/cpu.c|  2 +-
 tcg/tcg.c   |  8 ++---
 21 files changed, 82 insertions(+), 61 deletions(-)

diff --git a/accel/tcg/internal.h b/accel/tcg/internal.h
index 3092bfa964..a3875a3b5a 100644
--- a/accel/tcg/internal.h
+++ b/accel/tcg/internal.h
@@ -18,4 +18,10 @@ G_NORETURN void cpu_io_recompile(CPUState *cpu, uintptr_t 
retaddr);
 void page_init(void);
 void tb_htable_init(void);
 
+/* Return the current PC from CPU, which may be cached in TB. */
+static inline target_ulong log_pc(CPUState *cpu, const TranslationBlock *tb)
+{
+return tb_pc(tb);
+}
+
 #endif /* ACCEL_TCG_INTERNAL_H */
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index b1b920a713..7ea6026ba9 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -570,6 +570,12 @@ struct TranslationBlock {
 uintptr_t jmp_dest[2];
 };
 
+/* Hide the read to avoid ifdefs for TARGET_TB_PCREL. */
+static inline target_ulong tb_pc(const TranslationBlock *tb)
+{
+return tb->pc;
+}
+
 /* Hide the qatomic_read to make code a little easier on the eyes */
 static inline uint32_t tb_cflags(const TranslationBlock *tb)
 {
diff --git a/include/tcg/tcg.h b/include/tcg/tcg.h
index 26a70526f1..d84bae6e3f 100644
--- a/include/tcg/tcg.h
+++ b/include/tcg/tcg.h
@@ -840,7 +840,7 @@ void tcg_register_thread(void);
 void tcg_prologue_init(TCGContext *s);
 void tcg_func_start(TCGContext *s);
 
-int tcg_gen_code(TCGContext *s, TranslationBlock *tb);
+int tcg_gen_code(TCGContext *s, TranslationBlock *tb, target_ulong pc_start);
 
 void tcg_set_frame(TCGContext *s, TCGReg reg, intptr_t start, intptr_t size);
 
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 2d7e610ee2..8b3f8435fb 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -186,7 +186,7 @@ static bool tb_lookup_cmp(const void *p, const void *d)
 const TranslationBlock *tb = p;
 const struct tb_desc *desc = d;
 
-if (tb->pc == desc->pc &&
+if (tb_pc(tb) == desc->pc &&
 tb->page_addr[0] == desc->page_addr0 &&
 tb->cs_base == desc->cs_base &&
 tb->flags == desc->flags &&
@@ -271,12 +271,10 @@ static inline TranslationBlock *tb_lookup(CPUState *cpu, 
target_ulong pc,
 return tb;
 }
 
-static inline void log_cpu_exec(target_ulong pc, CPUState *cpu,
-const TranslationBlock *tb)
+static void log_cpu_exec(target_ulong pc, CPUState *cpu,
+ const TranslationBlock *tb)
 {
-if (unlikely(qemu_loglevel_mask(CPU_LOG_TB_CPU | CPU_LOG_EXEC))
-&& qemu_log_in_addr_range(pc)) {
-
+if (qemu_log_in_addr_range(pc)) {
 qemu_log_mask(CPU_LOG_EXEC,
   "Trace %d: %p [" TARGET_FMT_lx
   "/" TARGET_FMT_lx "/%08x/%08x] %s\n",
@@ -400,7 +398,9 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
 return tcg_code_gen_epilogue;
 }
 
-log_cpu_exec(pc, cpu, tb);
+if (qemu_loglevel_mask(CPU_LOG_TB_CPU | CPU_LOG_EXEC)) {
+log_cpu_exec(pc, cpu, tb);
+}
 
 return tb->tc.ptr;
 }
@@ -423,7 +423,9 @@ cpu_tb_exec(CPUState *cpu, TranslationBlock *itb, int 
*tb_exit)
 TranslationBlock *last_tb;
 const void *tb_ptr = itb->tc.ptr;
 
-log_cpu_exec(itb->pc, cpu, itb);
+if (qemu_loglevel_mask(CPU_LOG_TB_CPU | CPU_LOG_EXEC)) {
+log_cpu_exec(log_pc(cpu, itb), cpu, itb);
+}
 
 qemu_thread_jit_execute();
 ret = tcg_qemu_tb_exec(env, tb_ptr);
@@ -447,16 +449,20 @@ cpu_tb_exec(CPUState *cpu, TranslationBlock *itb, int 
*tb_exit)
  * of the start of the TB.
  */
 CPUClass *cc = CPU_GET_CLASS(cpu);
-qemu_log_mask_and_addr(CPU_LOG_EXEC, last_tb->pc,
-

[PULL 20/20] target/sh4: Fix TB_FLAG_UNALIGN

2022-10-04 Thread Richard Henderson

The value previously chosen overlaps GUSA_MASK.

Rename all DELAY_SLOT_* and GUSA_* defines to emphasize
that they are included in TB_FLAGs.  Add aliases for the
FPSCR and SR bits that are included in TB_FLAGS, so that
we don't accidentally reassign those bits.

Fixes: 4da06fb3062 ("target/sh4: Implement prctl_unalign_sigbus")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/856
Reviewed-by: Yoshinori Sato 
Signed-off-by: Richard Henderson 
---
 target/sh4/cpu.h| 56 +
 linux-user/sh4/signal.c |  6 +--
 target/sh4/cpu.c|  6 +--
 target/sh4/helper.c |  6 +--
 target/sh4/translate.c  | 90 ++---
 5 files changed, 88 insertions(+), 76 deletions(-)

diff --git a/target/sh4/cpu.h b/target/sh4/cpu.h
index 9f15ef913c..727b829598 100644
--- a/target/sh4/cpu.h
+++ b/target/sh4/cpu.h
@@ -78,26 +78,33 @@
 #define FPSCR_RM_NEAREST   (0 << 0)
 #define FPSCR_RM_ZERO  (1 << 0)
 
-#define DELAY_SLOT_MASK0x7
-#define DELAY_SLOT (1 << 0)
-#define DELAY_SLOT_CONDITIONAL (1 << 1)
-#define DELAY_SLOT_RTE (1 << 2)
+#define TB_FLAG_DELAY_SLOT   (1 << 0)
+#define TB_FLAG_DELAY_SLOT_COND  (1 << 1)
+#define TB_FLAG_DELAY_SLOT_RTE   (1 << 2)
+#define TB_FLAG_PENDING_MOVCA(1 << 3)
+#define TB_FLAG_GUSA_SHIFT   4  /* [11:4] */
+#define TB_FLAG_GUSA_EXCLUSIVE   (1 << 12)
+#define TB_FLAG_UNALIGN  (1 << 13)
+#define TB_FLAG_SR_FD(1 << SR_FD)   /* 15 */
+#define TB_FLAG_FPSCR_PR FPSCR_PR   /* 19 */
+#define TB_FLAG_FPSCR_SZ FPSCR_SZ   /* 20 */
+#define TB_FLAG_FPSCR_FR FPSCR_FR   /* 21 */
+#define TB_FLAG_SR_RB(1 << SR_RB)   /* 29 */
+#define TB_FLAG_SR_MD(1 << SR_MD)   /* 30 */
 
-#define TB_FLAG_PENDING_MOVCA  (1 << 3)
-#define TB_FLAG_UNALIGN(1 << 4)
-
-#define GUSA_SHIFT 4
-#ifdef CONFIG_USER_ONLY
-#define GUSA_EXCLUSIVE (1 << 12)
-#define GUSA_MASK  ((0xff << GUSA_SHIFT) | GUSA_EXCLUSIVE)
-#else
-/* Provide dummy versions of the above to allow tests against tbflags
-   to be elided while avoiding ifdefs.  */
-#define GUSA_EXCLUSIVE 0
-#define GUSA_MASK  0
-#endif
-
-#define TB_FLAG_ENVFLAGS_MASK  (DELAY_SLOT_MASK | GUSA_MASK)
+#define TB_FLAG_DELAY_SLOT_MASK  (TB_FLAG_DELAY_SLOT |   \
+  TB_FLAG_DELAY_SLOT_COND |  \
+  TB_FLAG_DELAY_SLOT_RTE)
+#define TB_FLAG_GUSA_MASK((0xff << TB_FLAG_GUSA_SHIFT) | \
+  TB_FLAG_GUSA_EXCLUSIVE)
+#define TB_FLAG_FPSCR_MASK   (TB_FLAG_FPSCR_PR | \
+  TB_FLAG_FPSCR_SZ | \
+  TB_FLAG_FPSCR_FR)
+#define TB_FLAG_SR_MASK  (TB_FLAG_SR_FD | \
+  TB_FLAG_SR_RB | \
+  TB_FLAG_SR_MD)
+#define TB_FLAG_ENVFLAGS_MASK(TB_FLAG_DELAY_SLOT_MASK | \
+  TB_FLAG_GUSA_MASK)
 
 typedef struct tlb_t {
 uint32_t vpn;  /* virtual page number */
@@ -258,7 +265,7 @@ static inline int cpu_mmu_index (CPUSH4State *env, bool 
ifetch)
 {
 /* The instruction in a RTE delay slot is fetched in privileged
mode, but executed in user mode.  */
-if (ifetch && (env->flags & DELAY_SLOT_RTE)) {
+if (ifetch && (env->flags & TB_FLAG_DELAY_SLOT_RTE)) {
 return 0;
 } else {
 return (env->sr & (1u << SR_MD)) == 0 ? 1 : 0;
@@ -366,11 +373,10 @@ static inline void cpu_get_tb_cpu_state(CPUSH4State *env, 
target_ulong *pc,
 {
 *pc = env->pc;
 /* For a gUSA region, notice the end of the region.  */
-*cs_base = env->flags & GUSA_MASK ? env->gregs[0] : 0;
-*flags = env->flags /* TB_FLAG_ENVFLAGS_MASK: bits 0-2, 4-12 */
-| (env->fpscr & (FPSCR_FR | FPSCR_SZ | FPSCR_PR))  /* Bits 19-21 */
-| (env->sr & ((1u << SR_MD) | (1u << SR_RB)))  /* Bits 29-30 */
-| (env->sr & (1u << SR_FD))/* Bit 15 */
+*cs_base = env->flags & TB_FLAG_GUSA_MASK ? env->gregs[0] : 0;
+*flags = env->flags
+| (env->fpscr & TB_FLAG_FPSCR_MASK)
+| (env->sr & TB_FLAG_SR_MASK)
 | (env->movcal_backup ? TB_FLAG_PENDING_MOVCA : 0); /* Bit 3 */
 #ifdef CONFIG_USER_ONLY
 *flags |= TB_FLAG_UNALIGN * !env_cpu(env)->prctl_unalign_sigbus;
diff --git a/linux-user/sh4/signal.c b/linux-user/sh4/signal.c
index f6a18bc6b5..c4ba962708 100644
--- a/linux-user/sh4/signal.c
+++ b/linux-user/sh4/signal.c
@@ -161,7 +161,7 @@ static void restore_sigcontext(CPUSH4State *regs, struct 
target_sigcontext *sc)
 __get_user(regs->fpul, >sc_fpul);
 
 regs->tra = -1; /* disable syscall checks */
-regs->flags &= ~(DELAY_SLOT_MASK | GUSA_MASK);
+regs->flags = 0;
 }
 
 void setup_frame(int sig, struct

Re: [RFC patch 0/1] block: vhost-blk backend

2022-10-04 Thread Stefan Hajnoczi

On Mon, Jul 25, 2022 at 11:55:26PM +0300, Andrey Zhadchenko wrote:
> Although QEMU virtio-blk is quite fast, there is still some room for
> improvements. Disk latency can be reduced if we handle virito-blk requests
> in host kernel so we avoid a lot of syscalls and context switches.
> 
> The biggest disadvantage of this vhost-blk flavor is raw format.
> Luckily Kirill Thai proposed device mapper driver for QCOW2 format to attach
> files as block devices: https://www.spinics.net/lists/kernel/msg4292965.html
> 
> Also by using kernel modules we can bypass iothread limitation and finaly 
> scale
> block requests with cpus for high-performance devices. This is planned to be
> implemented in next version.
> 
> Linux kernel module part:
> https://lore.kernel.org/kvm/20220725202753.298725-1-andrey.zhadche...@virtuozzo.com/
> 
> test setups and results:
> fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128

> QEMU drive options: cache=none
> filesystem: xfs

Please post the full QEMU command-line so it's clear exactly what this
is benchmarking.

A preallocated raw image file is a good baseline with:

  --object iothread,id=iothread0 \
  --blockdev file,filename=test.img,cache.direct=on,aio=native,node-name=drive0 
\
  --device virtio-blk-pci,drive=drive0,iothread=iothread0

(BTW QEMU's default vq size is 256 descriptors and the number of vqs is
the number of vCPUs.)

> 
> SSD:
>| randread, IOPS  | randwrite, IOPS |
> Host   |  95.8k|  85.3k  |
> QEMU virtio|  57.5k|  79.4k  |
> QEMU vhost-blk |  95.6k|  84.3k  |
> 
> RAMDISK (vq == vcpu):

With fio numjobs=vcpu here?

>  | randread, IOPS | randwrite, IOPS |
> virtio, 1vcpu|123k  |  129k   |
> virtio, 2vcpu|253k (??) |  250k (??)  |

QEMU's aio=threads (default) gets around the single IOThread. It beats
aio=native for this reason in some cases. Were you using aio=native or
aio=threads?

> virtio, 4vcpu|158k  |  154k   |
> vhost-blk, 1vcpu |110k  |  113k   |
> vhost-blk, 2vcpu |247k  |  252k   |


signature.asc
Description: PGP signature

[PULL 13/20] accel/tcg: Do not align tb->page_addr[0]

2022-10-04 Thread Richard Henderson

Let tb->page_addr[0] contain the address of the first byte of the
translated block, rather than the address of the page containing the
start of the translated block.  We need to recover this value anyway
at various points, and it is easier to discard a page offset when it
is not needed, which happens naturally via the existing find_page shift.

Reviewed-by: Alex Bennée 
Signed-off-by: Richard Henderson 
---
 accel/tcg/cpu-exec.c  | 16 
 accel/tcg/cputlb.c|  3 ++-
 accel/tcg/translate-all.c |  9 +
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 5f43b9769a..dd58a144a8 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -174,7 +174,7 @@ struct tb_desc {
 target_ulong pc;
 target_ulong cs_base;
 CPUArchState *env;
-tb_page_addr_t phys_page1;
+tb_page_addr_t page_addr0;
 uint32_t flags;
 uint32_t cflags;
 uint32_t trace_vcpu_dstate;
@@ -186,7 +186,7 @@ static bool tb_lookup_cmp(const void *p, const void *d)
 const struct tb_desc *desc = d;
 
 if (tb->pc == desc->pc &&
-tb->page_addr[0] == desc->phys_page1 &&
+tb->page_addr[0] == desc->page_addr0 &&
 tb->cs_base == desc->cs_base &&
 tb->flags == desc->flags &&
 tb->trace_vcpu_dstate == desc->trace_vcpu_dstate &&
@@ -195,8 +195,8 @@ static bool tb_lookup_cmp(const void *p, const void *d)
 if (tb->page_addr[1] == -1) {
 return true;
 } else {
-tb_page_addr_t phys_page2;
-target_ulong virt_page2;
+tb_page_addr_t phys_page1;
+target_ulong virt_page1;
 
 /*
  * We know that the first page matched, and an otherwise valid TB
@@ -207,9 +207,9 @@ static bool tb_lookup_cmp(const void *p, const void *d)
  * is different for the new TB.  Therefore any exception raised
  * here by the faulting lookup is not premature.
  */
-virt_page2 = TARGET_PAGE_ALIGN(desc->pc);
-phys_page2 = get_page_addr_code(desc->env, virt_page2);
-if (tb->page_addr[1] == phys_page2) {
+virt_page1 = TARGET_PAGE_ALIGN(desc->pc);
+phys_page1 = get_page_addr_code(desc->env, virt_page1);
+if (tb->page_addr[1] == phys_page1) {
 return true;
 }
 }
@@ -235,7 +235,7 @@ static TranslationBlock *tb_htable_lookup(CPUState *cpu, 
target_ulong pc,
 if (phys_pc == -1) {
 return NULL;
 }
-desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
+desc.page_addr0 = phys_pc;
 h = tb_hash_func(phys_pc, pc, flags, cflags, *cpu->trace_dstate);
 return qht_lookup_custom(_ctx.htable, , h, tb_lookup_cmp);
 }
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 361078471b..a0db2d32a8 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -951,7 +951,8 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState 
*src_cpu,
can be detected */
 void tlb_protect_code(ram_addr_t ram_addr)
 {
-cpu_physical_memory_test_and_clear_dirty(ram_addr, TARGET_PAGE_SIZE,
+cpu_physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
+ TARGET_PAGE_SIZE,
  DIRTY_MEMORY_CODE);
 }
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index ca685f6ede..3a63113c41 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1167,7 +1167,7 @@ static void do_tb_phys_invalidate(TranslationBlock *tb, 
bool rm_from_page_list)
 qemu_spin_unlock(>jmp_lock);
 
 /* remove the TB from the hash list */
-phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
+phys_pc = tb->page_addr[0];
 h = tb_hash_func(phys_pc, tb->pc, tb->flags, orig_cflags,
  tb->trace_vcpu_dstate);
 if (!qht_remove(_ctx.htable, tb, h)) {
@@ -1291,7 +1291,7 @@ tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
  * we can only insert TBs that are fully initialized.
  */
 page_lock_pair(, phys_pc, , phys_page2, true);
-tb_page_add(p, tb, 0, phys_pc & TARGET_PAGE_MASK);
+tb_page_add(p, tb, 0, phys_pc);
 if (p2) {
 tb_page_add(p2, tb, 1, phys_page2);
 } else {
@@ -1644,11 +1644,12 @@ tb_invalidate_phys_page_range__locked(struct 
page_collection *pages,
 if (n == 0) {
 /* NOTE: tb_end may be after the end of the page, but
it is not a problem */
-tb_start = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
+tb_start = tb->page_addr[0];
 tb_end = tb_start + tb->size;
 } else {
 tb_start = tb->page_addr[1];
-tb_end = tb_start + ((tb->pc + tb->size) & ~TARGET_PAGE_MASK);
+tb_end = tb_start + ((tb->page_addr[0] + tb->size)
+ & ~TARGET_PAGE_MASK);
 }

[PULL 19/20] tcg/ppc: Optimize 26-bit jumps

2022-10-04 Thread Richard Henderson

From: Leandro Lupori 

PowerPC64 processors handle direct branches better than indirect
ones, resulting in less stalled cycles and branch misses.

However, PPC's tb_target_set_jmp_target() was only using direct
branches for 16-bit jumps, while PowerPC64's unconditional branch
instructions are able to handle displacements of up to 26 bits.
To take advantage of this, now jumps whose displacements fit in
between 17 and 26 bits are also converted to direct branches.

Reviewed-by: Richard Henderson 
Signed-off-by: Leandro Lupori 
[rth: Expanded some commentary.]
Signed-off-by: Richard Henderson 
---
 tcg/ppc/tcg-target.c.inc | 119 +--
 1 file changed, 88 insertions(+), 31 deletions(-)

diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index 1cbd047ab3..e3dba47697 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -1847,44 +1847,101 @@ static void tcg_out_mb(TCGContext *s, TCGArg a0)
 tcg_out32(s, insn);
 }
 
+static inline uint64_t make_pair(tcg_insn_unit i1, tcg_insn_unit i2)
+{
+if (HOST_BIG_ENDIAN) {
+return (uint64_t)i1 << 32 | i2;
+}
+return (uint64_t)i2 << 32 | i1;
+}
+
+static inline void ppc64_replace2(uintptr_t rx, uintptr_t rw,
+  tcg_insn_unit i0, tcg_insn_unit i1)
+{
+#if TCG_TARGET_REG_BITS == 64
+qatomic_set((uint64_t *)rw, make_pair(i0, i1));
+flush_idcache_range(rx, rw, 8);
+#else
+qemu_build_not_reached();
+#endif
+}
+
+static inline void ppc64_replace4(uintptr_t rx, uintptr_t rw,
+  tcg_insn_unit i0, tcg_insn_unit i1,
+  tcg_insn_unit i2, tcg_insn_unit i3)
+{
+uint64_t p[2];
+
+p[!HOST_BIG_ENDIAN] = make_pair(i0, i1);
+p[HOST_BIG_ENDIAN] = make_pair(i2, i3);
+
+/*
+ * There's no convenient way to get the compiler to allocate a pair
+ * of registers at an even index, so copy into r6/r7 and clobber.
+ */
+asm("mr  %%r6, %1\n\t"
+"mr  %%r7, %2\n\t"
+"stq %%r6, %0"
+: "=Q"(*(__int128 *)rw) : "r"(p[0]), "r"(p[1]) : "r6", "r7");
+flush_idcache_range(rx, rw, 16);
+}
+
 void tb_target_set_jmp_target(uintptr_t tc_ptr, uintptr_t jmp_rx,
   uintptr_t jmp_rw, uintptr_t addr)
 {
-if (TCG_TARGET_REG_BITS == 64) {
-tcg_insn_unit i1, i2;
-intptr_t tb_diff = addr - tc_ptr;
-intptr_t br_diff = addr - (jmp_rx + 4);
-uint64_t pair;
+tcg_insn_unit i0, i1, i2, i3;
+intptr_t tb_diff = addr - tc_ptr;
+intptr_t br_diff = addr - (jmp_rx + 4);
+intptr_t lo, hi;
 
-/* This does not exercise the range of the branch, but we do
-   still need to be able to load the new value of TCG_REG_TB.
-   But this does still happen quite often.  */
-if (tb_diff == (int16_t)tb_diff) {
-i1 = ADDI | TAI(TCG_REG_TB, TCG_REG_TB, tb_diff);
-i2 = B | (br_diff & 0x3fc);
-} else {
-intptr_t lo = (int16_t)tb_diff;
-intptr_t hi = (int32_t)(tb_diff - lo);
-assert(tb_diff == hi + lo);
-i1 = ADDIS | TAI(TCG_REG_TB, TCG_REG_TB, hi >> 16);
-i2 = ADDI | TAI(TCG_REG_TB, TCG_REG_TB, lo);
-}
-#if HOST_BIG_ENDIAN
-pair = (uint64_t)i1 << 32 | i2;
-#else
-pair = (uint64_t)i2 << 32 | i1;
-#endif
-
-/* As per the enclosing if, this is ppc64.  Avoid the _Static_assert
-   within qatomic_set that would fail to build a ppc32 host.  */
-qatomic_set__nocheck((uint64_t *)jmp_rw, pair);
-flush_idcache_range(jmp_rx, jmp_rw, 8);
-} else {
+if (TCG_TARGET_REG_BITS == 32) {
 intptr_t diff = addr - jmp_rx;
 tcg_debug_assert(in_range_b(diff));
 qatomic_set((uint32_t *)jmp_rw, B | (diff & 0x3fc));
 flush_idcache_range(jmp_rx, jmp_rw, 4);
+return;
 }
+
+/*
+ * For 16-bit displacements, we can use a single add + branch.
+ * This happens quite often.
+ */
+if (tb_diff == (int16_t)tb_diff) {
+i0 = ADDI | TAI(TCG_REG_TB, TCG_REG_TB, tb_diff);
+i1 = B | (br_diff & 0x3fc);
+ppc64_replace2(jmp_rx, jmp_rw, i0, i1);
+return;
+}
+
+lo = (int16_t)tb_diff;
+hi = (int32_t)(tb_diff - lo);
+assert(tb_diff == hi + lo);
+i0 = ADDIS | TAI(TCG_REG_TB, TCG_REG_TB, hi >> 16);
+i1 = ADDI | TAI(TCG_REG_TB, TCG_REG_TB, lo);
+
+/*
+ * Without stq from 2.07, we can only update two insns,
+ * and those must be the ones that load the target address.
+ */
+if (!have_isa_2_07) {
+ppc64_replace2(jmp_rx, jmp_rw, i0, i1);
+return;
+}
+
+/*
+ * For 26-bit displacements, we can use a direct branch.
+ * Otherwise we still need the indirect branch, which we
+ * must restore after a potential direct branch write.
+ */
+br_diff -= 4;
+if (in_range_b(br_diff)) {
+i2 = B | (br_diff &

[PULL 18/20] accel/tcg: Introduce TARGET_TB_PCREL

2022-10-04 Thread Richard Henderson

Prepare for targets to be able to produce TBs that can
run in more than one virtual context.

Reviewed-by: Alex Bennée 
Signed-off-by: Richard Henderson 
---
 accel/tcg/internal.h  |  4 +++
 accel/tcg/tb-jmp-cache.h  | 41 +
 include/exec/cpu-defs.h   |  3 ++
 include/exec/exec-all.h   | 32 ++--
 accel/tcg/cpu-exec.c  | 16 ++
 accel/tcg/translate-all.c | 64 ++-
 6 files changed, 131 insertions(+), 29 deletions(-)

diff --git a/accel/tcg/internal.h b/accel/tcg/internal.h
index a3875a3b5a..dc800fd485 100644
--- a/accel/tcg/internal.h
+++ b/accel/tcg/internal.h
@@ -21,7 +21,11 @@ void tb_htable_init(void);
 /* Return the current PC from CPU, which may be cached in TB. */
 static inline target_ulong log_pc(CPUState *cpu, const TranslationBlock *tb)
 {
+#if TARGET_TB_PCREL
+return cpu->cc->get_pc(cpu);
+#else
 return tb_pc(tb);
+#endif
 }
 
 #endif /* ACCEL_TCG_INTERNAL_H */
diff --git a/accel/tcg/tb-jmp-cache.h b/accel/tcg/tb-jmp-cache.h
index 2d8fbb1bfe..ff5ffc8fc2 100644
--- a/accel/tcg/tb-jmp-cache.h
+++ b/accel/tcg/tb-jmp-cache.h
@@ -14,11 +14,52 @@
 
 /*
  * Accessed in parallel; all accesses to 'tb' must be atomic.
+ * For TARGET_TB_PCREL, accesses to 'pc' must be protected by
+ * a load_acquire/store_release to 'tb'.
  */
 struct CPUJumpCache {
 struct {
 TranslationBlock *tb;
+#if TARGET_TB_PCREL
+target_ulong pc;
+#endif
 } array[TB_JMP_CACHE_SIZE];
 };
 
+static inline TranslationBlock *
+tb_jmp_cache_get_tb(CPUJumpCache *jc, uint32_t hash)
+{
+#if TARGET_TB_PCREL
+/* Use acquire to ensure current load of pc from jc. */
+return qatomic_load_acquire(>array[hash].tb);
+#else
+/* Use rcu_read to ensure current load of pc from *tb. */
+return qatomic_rcu_read(>array[hash].tb);
+#endif
+}
+
+static inline target_ulong
+tb_jmp_cache_get_pc(CPUJumpCache *jc, uint32_t hash, TranslationBlock *tb)
+{
+#if TARGET_TB_PCREL
+return jc->array[hash].pc;
+#else
+return tb_pc(tb);
+#endif
+}
+
+static inline void
+tb_jmp_cache_set(CPUJumpCache *jc, uint32_t hash,
+ TranslationBlock *tb, target_ulong pc)
+{
+#if TARGET_TB_PCREL
+jc->array[hash].pc = pc;
+/* Use store_release on tb to ensure pc is written first. */
+qatomic_store_release(>array[hash].tb, tb);
+#else
+/* Use the pc value already stored in tb->pc. */
+qatomic_set(>array[hash].tb, tb);
+#endif
+}
+
 #endif /* ACCEL_TCG_TB_JMP_CACHE_H */
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 67239b4e5e..21309cf567 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -54,6 +54,9 @@
 #  error TARGET_PAGE_BITS must be defined in cpu-param.h
 # endif
 #endif
+#ifndef TARGET_TB_PCREL
+# define TARGET_TB_PCREL 0
+#endif
 
 #define TARGET_LONG_SIZE (TARGET_LONG_BITS / 8)
 
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 7ea6026ba9..e5f8b224a5 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -496,8 +496,32 @@ struct tb_tc {
 };
 
 struct TranslationBlock {
-target_ulong pc;   /* simulated PC corresponding to this block (EIP + CS 
base) */
-target_ulong cs_base; /* CS base for this block */
+#if !TARGET_TB_PCREL
+/*
+ * Guest PC corresponding to this block.  This must be the true
+ * virtual address.  Therefore e.g. x86 stores EIP + CS_BASE, and
+ * targets like Arm, MIPS, HP-PA, which reuse low bits for ISA or
+ * privilege, must store those bits elsewhere.
+ *
+ * If TARGET_TB_PCREL, the opcodes for the TranslationBlock are
+ * written such that the TB is associated only with the physical
+ * page and may be run in any virtual address context.  In this case,
+ * PC must always be taken from ENV in a target-specific manner.
+ * Unwind information is taken as offsets from the page, to be
+ * deposited into the "current" PC.
+ */
+target_ulong pc;
+#endif
+
+/*
+ * Target-specific data associated with the TranslationBlock, e.g.:
+ * x86: the original user, the Code Segment virtual base,
+ * arm: an extension of tb->flags,
+ * s390x: instruction data for EXECUTE,
+ * sparc: the next pc of the instruction queue (for delay slots).
+ */
+target_ulong cs_base;
+
 uint32_t flags; /* flags defining in which context the code was generated 
*/
 uint32_t cflags;/* compile flags */
 
@@ -573,7 +597,11 @@ struct TranslationBlock {
 /* Hide the read to avoid ifdefs for TARGET_TB_PCREL. */
 static inline target_ulong tb_pc(const TranslationBlock *tb)
 {
+#if TARGET_TB_PCREL
+qemu_build_not_reached();
+#else
 return tb->pc;
+#endif
 }
 
 /* Hide the qatomic_read to make code a little easier on the eyes */
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 8b3f8435fb..f9e5cc9ba0 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -186,7 +186,7 @@ static bool tb_lookup_cmp(const void *p, const

Re: [RFC patch 0/1] block: vhost-blk backend

2022-10-04 Thread Stefan Hajnoczi

On Mon, Jul 25, 2022 at 11:55:26PM +0300, Andrey Zhadchenko wrote:
> Although QEMU virtio-blk is quite fast, there is still some room for
> improvements. Disk latency can be reduced if we handle virito-blk requests
> in host kernel so we avoid a lot of syscalls and context switches.
> 
> The biggest disadvantage of this vhost-blk flavor is raw format.
> Luckily Kirill Thai proposed device mapper driver for QCOW2 format to attach
> files as block devices: https://www.spinics.net/lists/kernel/msg4292965.html
> 
> Also by using kernel modules we can bypass iothread limitation and finaly 
> scale
> block requests with cpus for high-performance devices. This is planned to be
> implemented in next version.
> 
> Linux kernel module part:
> https://lore.kernel.org/kvm/20220725202753.298725-1-andrey.zhadche...@virtuozzo.com/
> 
> test setups and results:
> fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
> QEMU drive options: cache=none
> filesystem: xfs
> 
> SSD:
>| randread, IOPS  | randwrite, IOPS |
> Host   |  95.8k|  85.3k  |
> QEMU virtio|  57.5k|  79.4k  |
> QEMU vhost-blk |  95.6k|  84.3k  |
> 
> RAMDISK (vq == vcpu):
>  | randread, IOPS | randwrite, IOPS |
> virtio, 1vcpu|123k  |  129k   |
> virtio, 2vcpu|253k (??) |  250k (??)  |
> virtio, 4vcpu|158k  |  154k   |
> vhost-blk, 1vcpu |110k  |  113k   |
> vhost-blk, 2vcpu |247k  |  252k   |
> vhost-blk, 4vcpu |576k  |  567k   |
> 
> Andrey Zhadchenko (1):
>   block: add vhost-blk backend
> 
>  configure |  13 ++
>  hw/block/Kconfig  |   5 +
>  hw/block/meson.build  |   1 +
>  hw/block/vhost-blk.c  | 395 ++
>  hw/virtio/meson.build |   1 +
>  hw/virtio/vhost-blk-pci.c | 102 +
>  include/hw/virtio/vhost-blk.h |  44 
>  linux-headers/linux/vhost.h   |   3 +
>  8 files changed, 564 insertions(+)
>  create mode 100644 hw/block/vhost-blk.c
>  create mode 100644 hw/virtio/vhost-blk-pci.c
>  create mode 100644 include/hw/virtio/vhost-blk.h

vhost-blk has been tried several times in the past. That doesn't mean it
cannot be merged this time, but past arguments should be addressed:

- What makes it necessary to move the code into the kernel? In the past
  the performance results were not very convincing. The fastest
  implementations actually tend to be userspace NVMe PCI drivers that
  bypass the kernel! Bypassing the VFS and submitting block requests
  directly was not a huge boost. The syscall/context switch argument
  sounds okay but the numbers didn't really show that kernel block I/O
  is much faster than userspace block I/O.

  I've asked for more details on the QEMU command-line to understand
  what your numbers show. Maybe something has changed since previous
  times when vhost-blk has been tried.

  The only argument I see is QEMU's current 1 IOThread per virtio-blk
  device limitation, which is currently being worked on. If that's the
  only reason for vhost-blk then is it worth doing all the work of
  getting vhost-blk shipped (kernel, QEMU, and libvirt changes)? It
  seems like a short-term solution.

- The security impact of bugs in kernel vhost-blk code is more serious
  than bugs in a QEMU userspace process.

- The management stack needs to be changed to use vhost-blk whereas
  QEMU can be optimized without affecting other layers.

Stefan


signature.asc
Description: PGP signature

[PULL 14/20] accel/tcg: Inline tb_flush_jmp_cache

2022-10-04 Thread Richard Henderson

This function has two users, who use it incompatibly.
In tlb_flush_page_by_mmuidx_async_0, when flushing a
single page, we need to flush exactly two pages.
In tlb_flush_range_by_mmuidx_async_0, when flushing a
range of pages, we need to flush N+1 pages.

This avoids double-flushing of jmp cache pages in a range.

Reviewed-by: Alex Bennée 
Signed-off-by: Richard Henderson 
---
 accel/tcg/cputlb.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index a0db2d32a8..c7909fb619 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -107,14 +107,6 @@ static void tb_jmp_cache_clear_page(CPUState *cpu, 
target_ulong page_addr)
 }
 }
 
-static void tb_flush_jmp_cache(CPUState *cpu, target_ulong addr)
-{
-/* Discard jump cache entries for any tb which might potentially
-   overlap the flushed page.  */
-tb_jmp_cache_clear_page(cpu, addr - TARGET_PAGE_SIZE);
-tb_jmp_cache_clear_page(cpu, addr);
-}
-
 /**
  * tlb_mmu_resize_locked() - perform TLB resize bookkeeping; resize if 
necessary
  * @desc: The CPUTLBDesc portion of the TLB
@@ -541,7 +533,12 @@ static void tlb_flush_page_by_mmuidx_async_0(CPUState *cpu,
 }
 qemu_spin_unlock(_tlb(env)->c.lock);
 
-tb_flush_jmp_cache(cpu, addr);
+/*
+ * Discard jump cache entries for any tb which might potentially
+ * overlap the flushed page, which includes the previous.
+ */
+tb_jmp_cache_clear_page(cpu, addr - TARGET_PAGE_SIZE);
+tb_jmp_cache_clear_page(cpu, addr);
 }
 
 /**
@@ -792,8 +789,14 @@ static void tlb_flush_range_by_mmuidx_async_0(CPUState 
*cpu,
 return;
 }
 
-for (target_ulong i = 0; i < d.len; i += TARGET_PAGE_SIZE) {
-tb_flush_jmp_cache(cpu, d.addr + i);
+/*
+ * Discard jump cache entries for any tb which might potentially
+ * overlap the flushed pages, which includes the previous.
+ */
+d.addr -= TARGET_PAGE_SIZE;
+for (target_ulong i = 0, n = d.len / TARGET_PAGE_SIZE + 1; i < n; i++) {
+tb_jmp_cache_clear_page(cpu, d.addr);
+d.addr += TARGET_PAGE_SIZE;
 }
 }
 
-- 
2.34.1

[PULL 08/20] accel/tcg: Introduce tlb_set_page_full

2022-10-04 Thread Richard Henderson

Now that we have collected all of the page data into
CPUTLBEntryFull, provide an interface to record that
all in one go, instead of using 4 arguments.  This interface
allows CPUTLBEntryFull to be extended without having to
change the number of arguments.

Reviewed-by: Alex Bennée 
Reviewed-by: Peter Maydell 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 include/exec/cpu-defs.h | 14 +++
 include/exec/exec-all.h | 22 ++
 accel/tcg/cputlb.c  | 51 ++---
 3 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index f70f54d850..5e12cc1854 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -148,7 +148,21 @@ typedef struct CPUTLBEntryFull {
  * + the offset within the target MemoryRegion (otherwise)
  */
 hwaddr xlat_section;
+
+/*
+ * @phys_addr contains the physical address in the address space
+ * given by cpu_asidx_from_attrs(cpu, @attrs).
+ */
+hwaddr phys_addr;
+
+/* @attrs contains the memory transaction attributes for the page. */
 MemTxAttrs attrs;
+
+/* @prot contains the complete protections for the page. */
+uint8_t prot;
+
+/* @lg_page_size contains the log2 of the page size. */
+uint8_t lg_page_size;
 } CPUTLBEntryFull;
 
 /*
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index d255d69bc1..b1b920a713 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -257,6 +257,28 @@ void tlb_flush_range_by_mmuidx_all_cpus_synced(CPUState 
*cpu,
uint16_t idxmap,
unsigned bits);
 
+/**
+ * tlb_set_page_full:
+ * @cpu: CPU context
+ * @mmu_idx: mmu index of the tlb to modify
+ * @vaddr: virtual address of the entry to add
+ * @full: the details of the tlb entry
+ *
+ * Add an entry to @cpu tlb index @mmu_idx.  All of the fields of
+ * @full must be filled, except for xlat_section, and constitute
+ * the complete description of the translated page.
+ *
+ * This is generally called by the target tlb_fill function after
+ * having performed a successful page table walk to find the physical
+ * address and attributes for the translation.
+ *
+ * At most one entry for a given virtual address is permitted. Only a
+ * single TARGET_PAGE_SIZE region is mapped; @full->lg_page_size is only
+ * used by tlb_flush_page.
+ */
+void tlb_set_page_full(CPUState *cpu, int mmu_idx, target_ulong vaddr,
+   CPUTLBEntryFull *full);
+
 /**
  * tlb_set_page_with_attrs:
  * @cpu: CPU to add this TLB entry for
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index e3ee4260bd..361078471b 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1095,16 +1095,16 @@ static void tlb_add_large_page(CPUArchState *env, int 
mmu_idx,
 env_tlb(env)->d[mmu_idx].large_page_mask = lp_mask;
 }
 
-/* Add a new TLB entry. At most one entry for a given virtual address
+/*
+ * Add a new TLB entry. At most one entry for a given virtual address
  * is permitted. Only a single TARGET_PAGE_SIZE region is mapped, the
  * supplied size is only used by tlb_flush_page.
  *
  * Called from TCG-generated code, which is under an RCU read-side
  * critical section.
  */
-void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
- hwaddr paddr, MemTxAttrs attrs, int prot,
- int mmu_idx, target_ulong size)
+void tlb_set_page_full(CPUState *cpu, int mmu_idx,
+   target_ulong vaddr, CPUTLBEntryFull *full)
 {
 CPUArchState *env = cpu->env_ptr;
 CPUTLB *tlb = env_tlb(env);
@@ -1117,35 +1117,36 @@ void tlb_set_page_with_attrs(CPUState *cpu, 
target_ulong vaddr,
 CPUTLBEntry *te, tn;
 hwaddr iotlb, xlat, sz, paddr_page;
 target_ulong vaddr_page;
-int asidx = cpu_asidx_from_attrs(cpu, attrs);
-int wp_flags;
+int asidx, wp_flags, prot;
 bool is_ram, is_romd;
 
 assert_cpu_is_self(cpu);
 
-if (size <= TARGET_PAGE_SIZE) {
+if (full->lg_page_size <= TARGET_PAGE_BITS) {
 sz = TARGET_PAGE_SIZE;
 } else {
-tlb_add_large_page(env, mmu_idx, vaddr, size);
-sz = size;
+sz = (hwaddr)1 << full->lg_page_size;
+tlb_add_large_page(env, mmu_idx, vaddr, sz);
 }
 vaddr_page = vaddr & TARGET_PAGE_MASK;
-paddr_page = paddr & TARGET_PAGE_MASK;
+paddr_page = full->phys_addr & TARGET_PAGE_MASK;
 
+prot = full->prot;
+asidx = cpu_asidx_from_attrs(cpu, full->attrs);
 section = address_space_translate_for_iotlb(cpu, asidx, paddr_page,
-, , attrs, );
+, , full->attrs, 
);
 assert(sz >= TARGET_PAGE_SIZE);
 
 tlb_debug("vaddr=" TARGET_FMT_lx " paddr=0x" TARGET_FMT_plx
   " prot=%x idx=%d\n",
-

[PULL 15/20] include/hw/core: Create struct CPUJumpCache

2022-10-04 Thread Richard Henderson

Wrap the bare TranslationBlock pointer into a structure.

Reviewed-by: Alex Bennée 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 accel/tcg/tb-hash.h   |  1 +
 accel/tcg/tb-jmp-cache.h  | 24 
 include/exec/cpu-common.h |  1 +
 include/hw/core/cpu.h | 15 +--
 include/qemu/typedefs.h   |  1 +
 accel/stubs/tcg-stub.c|  4 
 accel/tcg/cpu-exec.c  | 10 +++---
 accel/tcg/cputlb.c|  9 +
 accel/tcg/translate-all.c | 28 +---
 hw/core/cpu-common.c  |  3 +--
 plugins/core.c|  2 +-
 trace/control-target.c|  2 +-
 12 files changed, 72 insertions(+), 28 deletions(-)
 create mode 100644 accel/tcg/tb-jmp-cache.h

diff --git a/accel/tcg/tb-hash.h b/accel/tcg/tb-hash.h
index 0a273d9605..83dc610e4c 100644
--- a/accel/tcg/tb-hash.h
+++ b/accel/tcg/tb-hash.h
@@ -23,6 +23,7 @@
 #include "exec/cpu-defs.h"
 #include "exec/exec-all.h"
 #include "qemu/xxhash.h"
+#include "tb-jmp-cache.h"
 
 #ifdef CONFIG_SOFTMMU
 
diff --git a/accel/tcg/tb-jmp-cache.h b/accel/tcg/tb-jmp-cache.h
new file mode 100644
index 00..2d8fbb1bfe
--- /dev/null
+++ b/accel/tcg/tb-jmp-cache.h
@@ -0,0 +1,24 @@
+/*
+ * The per-CPU TranslationBlock jump cache.
+ *
+ *  Copyright (c) 2003 Fabrice Bellard
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef ACCEL_TCG_TB_JMP_CACHE_H
+#define ACCEL_TCG_TB_JMP_CACHE_H
+
+#define TB_JMP_CACHE_BITS 12
+#define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
+
+/*
+ * Accessed in parallel; all accesses to 'tb' must be atomic.
+ */
+struct CPUJumpCache {
+struct {
+TranslationBlock *tb;
+} array[TB_JMP_CACHE_SIZE];
+};
+
+#endif /* ACCEL_TCG_TB_JMP_CACHE_H */
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index d909429427..c493510ee9 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -38,6 +38,7 @@ void cpu_list_unlock(void);
 unsigned int cpu_list_generation_id_get(void);
 
 void tcg_flush_softmmu_tlb(CPUState *cs);
+void tcg_flush_jmp_cache(CPUState *cs);
 
 void tcg_iommu_init_notifier_list(CPUState *cpu);
 void tcg_iommu_free_notifier_list(CPUState *cpu);
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 009dc0d336..18ca701b44 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -236,9 +236,6 @@ struct kvm_run;
 struct hax_vcpu_state;
 struct hvf_vcpu_state;
 
-#define TB_JMP_CACHE_BITS 12
-#define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
-
 /* work queue */
 
 /* The union type allows passing of 64 bit target pointers on 32 bit
@@ -369,8 +366,7 @@ struct CPUState {
 CPUArchState *env_ptr;
 IcountDecr *icount_decr_ptr;
 
-/* Accessed in parallel; all accesses must be atomic */
-TranslationBlock *tb_jmp_cache[TB_JMP_CACHE_SIZE];
+CPUJumpCache *tb_jmp_cache;
 
 struct GDBRegisterState *gdb_regs;
 int gdb_num_regs;
@@ -456,15 +452,6 @@ extern CPUTailQ cpus;
 
 extern __thread CPUState *current_cpu;
 
-static inline void cpu_tb_jmp_cache_clear(CPUState *cpu)
-{
-unsigned int i;
-
-for (i = 0; i < TB_JMP_CACHE_SIZE; i++) {
-qatomic_set(>tb_jmp_cache[i], NULL);
-}
-}
-
 /**
  * qemu_tcg_mttcg_enabled:
  * Check whether we are running MultiThread TCG or not.
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index a4aee238c7..5f95169827 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -41,6 +41,7 @@ typedef struct CoMutex CoMutex;
 typedef struct ConfidentialGuestSupport ConfidentialGuestSupport;
 typedef struct CPUAddressSpace CPUAddressSpace;
 typedef struct CPUArchState CPUArchState;
+typedef struct CPUJumpCache CPUJumpCache;
 typedef struct CPUState CPUState;
 typedef struct CPUTLBEntryFull CPUTLBEntryFull;
 typedef struct DeviceListener DeviceListener;
diff --git a/accel/stubs/tcg-stub.c b/accel/stubs/tcg-stub.c
index 6ce8a34228..c1b05767c0 100644
--- a/accel/stubs/tcg-stub.c
+++ b/accel/stubs/tcg-stub.c
@@ -21,6 +21,10 @@ void tlb_set_dirty(CPUState *cpu, target_ulong vaddr)
 {
 }
 
+void tcg_flush_jmp_cache(CPUState *cpu)
+{
+}
+
 int probe_access_flags(CPUArchState *env, target_ulong addr,
MMUAccessType access_type, int mmu_idx,
bool nonfault, void **phost, uintptr_t retaddr)
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index dd58a144a8..2d7e610ee2 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -42,6 +42,7 @@
 #include "sysemu/replay.h"
 #include "sysemu/tcg.h"
 #include "exec/helper-proto.h"
+#include "tb-jmp-cache.h"
 #include "tb-hash.h"
 #include "tb-context.h"
 #include "internal.h"
@@ -252,7 +253,7 @@ static inline TranslationBlock *tb_lookup(CPUState *cpu, 
target_ulong pc,
 tcg_debug_assert(!(cflags & CF_INVALID));
 
 hash = tb_jmp_cache_hash_func(pc);
-tb = qatomic_rcu_read(>tb_jmp_cache[hash]);
+tb = qatomic_rcu_read(>tb_jmp_cache->array[hash].tb);
 
 if (likely(tb &&

Re: [RFC patch 0/1] block: vhost-blk backend

2022-10-04 Thread Stefan Hajnoczi

On Mon, Jul 25, 2022 at 11:55:26PM +0300, Andrey Zhadchenko wrote:
> Although QEMU virtio-blk is quite fast, there is still some room for
> improvements. Disk latency can be reduced if we handle virito-blk requests
> in host kernel so we avoid a lot of syscalls and context switches.
> 
> The biggest disadvantage of this vhost-blk flavor is raw format.
> Luckily Kirill Thai proposed device mapper driver for QCOW2 format to attach
> files as block devices: https://www.spinics.net/lists/kernel/msg4292965.html
> 
> Also by using kernel modules we can bypass iothread limitation and finaly 
> scale
> block requests with cpus for high-performance devices. This is planned to be
> implemented in next version.

Hi Andrey,
Do you have a new version of this patch series that uses multiple
threads?

I have been playing with vq-IOThread mapping in QEMU and would like to
benchmark vhost-blk vs QEMU virtio-blk mq IOThreads:
https://gitlab.com/stefanha/qemu/-/tree/virtio-blk-mq-iothread-prototype

Thanks,
Stefan


signature.asc
Description: PGP signature

[PULL 12/20] accel/tcg: Use DisasContextBase in plugin_gen_tb_start

2022-10-04 Thread Richard Henderson

Use the pc coming from db->pc_first rather than the TB.

Use the cached host_addr rather than re-computing for the
first page.  We still need a separate lookup for the second
page because it won't be computed for DisasContextBase until
the translator actually performs a read from the page.

Reviewed-by: Alex Bennée 
Signed-off-by: Richard Henderson 
---
 include/exec/plugin-gen.h |  7 ---
 accel/tcg/plugin-gen.c| 22 +++---
 accel/tcg/translator.c|  2 +-
 3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/include/exec/plugin-gen.h b/include/exec/plugin-gen.h
index f92f169739..5004728c61 100644
--- a/include/exec/plugin-gen.h
+++ b/include/exec/plugin-gen.h
@@ -19,7 +19,8 @@ struct DisasContextBase;
 
 #ifdef CONFIG_PLUGIN
 
-bool plugin_gen_tb_start(CPUState *cpu, const TranslationBlock *tb, bool 
supress);
+bool plugin_gen_tb_start(CPUState *cpu, const struct DisasContextBase *db,
+ bool supress);
 void plugin_gen_tb_end(CPUState *cpu);
 void plugin_gen_insn_start(CPUState *cpu, const struct DisasContextBase *db);
 void plugin_gen_insn_end(void);
@@ -48,8 +49,8 @@ static inline void plugin_insn_append(abi_ptr pc, const void 
*from, size_t size)
 
 #else /* !CONFIG_PLUGIN */
 
-static inline
-bool plugin_gen_tb_start(CPUState *cpu, const TranslationBlock *tb, bool 
supress)
+static inline bool
+plugin_gen_tb_start(CPUState *cpu, const struct DisasContextBase *db, bool sup)
 {
 return false;
 }
diff --git a/accel/tcg/plugin-gen.c b/accel/tcg/plugin-gen.c
index 3d0b101e34..80dff68934 100644
--- a/accel/tcg/plugin-gen.c
+++ b/accel/tcg/plugin-gen.c
@@ -852,7 +852,8 @@ static void plugin_gen_inject(const struct qemu_plugin_tb 
*plugin_tb)
 pr_ops();
 }
 
-bool plugin_gen_tb_start(CPUState *cpu, const TranslationBlock *tb, bool 
mem_only)
+bool plugin_gen_tb_start(CPUState *cpu, const DisasContextBase *db,
+ bool mem_only)
 {
 bool ret = false;
 
@@ -870,9 +871,9 @@ bool plugin_gen_tb_start(CPUState *cpu, const 
TranslationBlock *tb, bool mem_onl
 
 ret = true;
 
-ptb->vaddr = tb->pc;
+ptb->vaddr = db->pc_first;
 ptb->vaddr2 = -1;
-get_page_addr_code_hostp(cpu->env_ptr, tb->pc, >haddr1);
+ptb->haddr1 = db->host_addr[0];
 ptb->haddr2 = NULL;
 ptb->mem_only = mem_only;
 
@@ -898,16 +899,15 @@ void plugin_gen_insn_start(CPUState *cpu, const 
DisasContextBase *db)
  * Note that we skip this when haddr1 == NULL, e.g. when we're
  * fetching instructions from a region not backed by RAM.
  */
-if (likely(ptb->haddr1 != NULL && ptb->vaddr2 == -1) &&
-unlikely((db->pc_next & TARGET_PAGE_MASK) !=
- (db->pc_first & TARGET_PAGE_MASK))) {
-get_page_addr_code_hostp(cpu->env_ptr, db->pc_next,
- >haddr2);
-ptb->vaddr2 = db->pc_next;
-}
-if (likely(ptb->vaddr2 == -1)) {
+if (ptb->haddr1 == NULL) {
+pinsn->haddr = NULL;
+} else if (is_same_page(db, db->pc_next)) {
 pinsn->haddr = ptb->haddr1 + pinsn->vaddr - ptb->vaddr;
 } else {
+if (ptb->vaddr2 == -1) {
+ptb->vaddr2 = TARGET_PAGE_ALIGN(db->pc_first);
+get_page_addr_code_hostp(cpu->env_ptr, ptb->vaddr2, >haddr2);
+}
 pinsn->haddr = ptb->haddr2 + pinsn->vaddr - ptb->vaddr2;
 }
 }
diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
index ca8a5f2d83..8e78fd7a9c 100644
--- a/accel/tcg/translator.c
+++ b/accel/tcg/translator.c
@@ -75,7 +75,7 @@ void translator_loop(CPUState *cpu, TranslationBlock *tb, int 
max_insns,
 ops->tb_start(db, cpu);
 tcg_debug_assert(db->is_jmp == DISAS_NEXT);  /* no early exit */
 
-plugin_enabled = plugin_gen_tb_start(cpu, tb, cflags & CF_MEMI_ONLY);
+plugin_enabled = plugin_gen_tb_start(cpu, db, cflags & CF_MEMI_ONLY);
 
 while (true) {
 db->num_insns++;
-- 
2.34.1

[PULL 16/20] hw/core: Add CPUClass.get_pc

2022-10-04 Thread Richard Henderson

Populate this new method for all targets.  Always match
the result that would be given by cpu_get_tb_cpu_state,
as we will want these values to correspond in the logs.

Reviewed-by: Taylor Simpson 
Reviewed-by: Alex Bennée 
Reviewed-by: Mark Cave-Ayland  (target/sparc)
Signed-off-by: Richard Henderson 
---
Cc: Eduardo Habkost  (supporter:Machine core)
Cc: Marcel Apfelbaum  (supporter:Machine core)
Cc: "Philippe Mathieu-Daudé"  (reviewer:Machine core)
Cc: Yanan Wang  (reviewer:Machine core)
Cc: Michael Rolnik  (maintainer:AVR TCG CPUs)
Cc: "Edgar E. Iglesias"  (maintainer:CRIS TCG CPUs)
Cc: Taylor Simpson  (supporter:Hexagon TCG CPUs)
Cc: Song Gao  (maintainer:LoongArch TCG CPUs)
Cc: Xiaojuan Yang  (maintainer:LoongArch TCG CPUs)
Cc: Laurent Vivier  (maintainer:M68K TCG CPUs)
Cc: Jiaxun Yang  (reviewer:MIPS TCG CPUs)
Cc: Aleksandar Rikalo  (reviewer:MIPS TCG CPUs)
Cc: Chris Wulff  (maintainer:NiosII TCG CPUs)
Cc: Marek Vasut  (maintainer:NiosII TCG CPUs)
Cc: Stafford Horne  (odd fixer:OpenRISC TCG CPUs)
Cc: Yoshinori Sato  (reviewer:RENESAS RX CPUs)
Cc: Mark Cave-Ayland  (maintainer:SPARC TCG CPUs)
Cc: Bastian Koppelmann  (maintainer:TriCore TCG 
CPUs)
Cc: Max Filippov  (maintainer:Xtensa TCG CPUs)
Cc: qemu-...@nongnu.org (open list:ARM TCG CPUs)
Cc: qemu-...@nongnu.org (open list:PowerPC TCG CPUs)
Cc: qemu-ri...@nongnu.org (open list:RISC-V TCG CPUs)
Cc: qemu-s3...@nongnu.org (open list:S390 TCG CPUs)
---
 include/hw/core/cpu.h   |  3 +++
 target/alpha/cpu.c  |  9 +
 target/arm/cpu.c| 13 +
 target/avr/cpu.c|  8 
 target/cris/cpu.c   |  8 
 target/hexagon/cpu.c|  8 
 target/hppa/cpu.c   |  8 
 target/i386/cpu.c   |  9 +
 target/loongarch/cpu.c  |  9 +
 target/m68k/cpu.c   |  8 
 target/microblaze/cpu.c |  8 
 target/mips/cpu.c   |  8 
 target/nios2/cpu.c  |  9 +
 target/openrisc/cpu.c   |  8 
 target/ppc/cpu_init.c   |  8 
 target/riscv/cpu.c  | 13 +
 target/rx/cpu.c |  8 
 target/s390x/cpu.c  |  8 
 target/sh4/cpu.c|  8 
 target/sparc/cpu.c  |  8 
 target/tricore/cpu.c|  9 +
 target/xtensa/cpu.c |  8 
 22 files changed, 186 insertions(+)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 18ca701b44..f9b58773f7 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -115,6 +115,8 @@ struct SysemuCPUOps;
  *   If the target behaviour here is anything other than "set
  *   the PC register to the value passed in" then the target must
  *   also implement the synchronize_from_tb hook.
+ * @get_pc: Callback for getting the Program Counter register.
+ *   As above, with the semantics of the target architecture.
  * @gdb_read_register: Callback for letting GDB read a register.
  * @gdb_write_register: Callback for letting GDB write a register.
  * @gdb_adjust_breakpoint: Callback for adjusting the address of a
@@ -151,6 +153,7 @@ struct CPUClass {
 void (*dump_state)(CPUState *cpu, FILE *, int flags);
 int64_t (*get_arch_id)(CPUState *cpu);
 void (*set_pc)(CPUState *cpu, vaddr value);
+vaddr (*get_pc)(CPUState *cpu);
 int (*gdb_read_register)(CPUState *cpu, GByteArray *buf, int reg);
 int (*gdb_write_register)(CPUState *cpu, uint8_t *buf, int reg);
 vaddr (*gdb_adjust_breakpoint)(CPUState *cpu, vaddr addr);
diff --git a/target/alpha/cpu.c b/target/alpha/cpu.c
index a8990d401b..979a629d59 100644
--- a/target/alpha/cpu.c
+++ b/target/alpha/cpu.c
@@ -33,6 +33,14 @@ static void alpha_cpu_set_pc(CPUState *cs, vaddr value)
 cpu->env.pc = value;
 }
 
+static vaddr alpha_cpu_get_pc(CPUState *cs)
+{
+AlphaCPU *cpu = ALPHA_CPU(cs);
+
+return cpu->env.pc;
+}
+
+
 static bool alpha_cpu_has_work(CPUState *cs)
 {
 /* Here we are checking to see if the CPU should wake up from HALT.
@@ -244,6 +252,7 @@ static void alpha_cpu_class_init(ObjectClass *oc, void 
*data)
 cc->has_work = alpha_cpu_has_work;
 cc->dump_state = alpha_cpu_dump_state;
 cc->set_pc = alpha_cpu_set_pc;
+cc->get_pc = alpha_cpu_get_pc;
 cc->gdb_read_register = alpha_cpu_gdb_read_register;
 cc->gdb_write_register = alpha_cpu_gdb_write_register;
 #ifndef CONFIG_USER_ONLY
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 7ec3281da9..fa67ba6647 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -60,6 +60,18 @@ static void arm_cpu_set_pc(CPUState *cs, vaddr value)
 }
 }
 
+static vaddr arm_cpu_get_pc(CPUState *cs)
+{
+ARMCPU *cpu = ARM_CPU(cs);
+CPUARMState *env = >env;
+
+if (is_a64(env)) {
+return env->pc;
+} else {
+return env->regs[15];
+}
+}
+
 #ifdef CONFIG_TCG
 void arm_cpu_synchronize_from_tb(CPUState *cs,
  const TranslationBlock *tb)
@@ -2172,6 +2184,7 @@ static void arm_cpu_class_init(ObjectClass *oc, void 
*data)

[PULL 11/20] accel/tcg: Use bool for page_find_alloc

2022-10-04 Thread Richard Henderson

Bool is more appropriate type for the alloc parameter.

Reviewed-by: Alex Bennée 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 accel/tcg/translate-all.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 59432dc558..ca685f6ede 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -465,7 +465,7 @@ void page_init(void)
 #endif
 }
 
-static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
+static PageDesc *page_find_alloc(tb_page_addr_t index, bool alloc)
 {
 PageDesc *pd;
 void **lp;
@@ -533,11 +533,11 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, 
int alloc)
 
 static inline PageDesc *page_find(tb_page_addr_t index)
 {
-return page_find_alloc(index, 0);
+return page_find_alloc(index, false);
 }
 
 static void page_lock_pair(PageDesc **ret_p1, tb_page_addr_t phys1,
-   PageDesc **ret_p2, tb_page_addr_t phys2, int alloc);
+   PageDesc **ret_p2, tb_page_addr_t phys2, bool 
alloc);
 
 /* In user-mode page locks aren't used; mmap_lock is enough */
 #ifdef CONFIG_USER_ONLY
@@ -651,7 +651,7 @@ static inline void page_unlock(PageDesc *pd)
 /* lock the page(s) of a TB in the correct acquisition order */
 static inline void page_lock_tb(const TranslationBlock *tb)
 {
-page_lock_pair(NULL, tb->page_addr[0], NULL, tb->page_addr[1], 0);
+page_lock_pair(NULL, tb->page_addr[0], NULL, tb->page_addr[1], false);
 }
 
 static inline void page_unlock_tb(const TranslationBlock *tb)
@@ -840,7 +840,7 @@ void page_collection_unlock(struct page_collection *set)
 #endif /* !CONFIG_USER_ONLY */
 
 static void page_lock_pair(PageDesc **ret_p1, tb_page_addr_t phys1,
-   PageDesc **ret_p2, tb_page_addr_t phys2, int alloc)
+   PageDesc **ret_p2, tb_page_addr_t phys2, bool alloc)
 {
 PageDesc *p1, *p2;
 tb_page_addr_t page1;
@@ -1290,7 +1290,7 @@ tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
  * Note that inserting into the hash table first isn't an option, since
  * we can only insert TBs that are fully initialized.
  */
-page_lock_pair(, phys_pc, , phys_page2, 1);
+page_lock_pair(, phys_pc, , phys_page2, true);
 tb_page_add(p, tb, 0, phys_pc & TARGET_PAGE_MASK);
 if (p2) {
 tb_page_add(p2, tb, 1, phys_page2);
@@ -2219,7 +2219,7 @@ void page_set_flags(target_ulong start, target_ulong end, 
int flags)
 for (addr = start, len = end - start;
  len != 0;
  len -= TARGET_PAGE_SIZE, addr += TARGET_PAGE_SIZE) {
-PageDesc *p = page_find_alloc(addr >> TARGET_PAGE_BITS, 1);
+PageDesc *p = page_find_alloc(addr >> TARGET_PAGE_BITS, true);
 
 /* If the write protection bit is set, then we invalidate
the code inside.  */
-- 
2.34.1

[PULL 07/20] accel/tcg: Introduce probe_access_full

2022-10-04 Thread Richard Henderson

Add an interface to return the CPUTLBEntryFull struct
that goes with the lookup.  The result is not intended
to be valid across multiple lookups, so the user must
use the results immediately.

Reviewed-by: Alex Bennée 
Reviewed-by: Peter Maydell 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 include/exec/exec-all.h | 15 +
 include/qemu/typedefs.h |  1 +
 accel/tcg/cputlb.c  | 47 +
 3 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index bcad607c4e..d255d69bc1 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -434,6 +434,21 @@ int probe_access_flags(CPUArchState *env, target_ulong 
addr,
MMUAccessType access_type, int mmu_idx,
bool nonfault, void **phost, uintptr_t retaddr);
 
+#ifndef CONFIG_USER_ONLY
+/**
+ * probe_access_full:
+ * Like probe_access_flags, except also return into @pfull.
+ *
+ * The CPUTLBEntryFull structure returned via @pfull is transient
+ * and must be consumed or copied immediately, before any further
+ * access or changes to TLB @mmu_idx.
+ */
+int probe_access_full(CPUArchState *env, target_ulong addr,
+  MMUAccessType access_type, int mmu_idx,
+  bool nonfault, void **phost,
+  CPUTLBEntryFull **pfull, uintptr_t retaddr);
+#endif
+
 #define CODE_GEN_ALIGN   16 /* must be >= of the size of a icache line 
*/
 
 /* Estimated block size for TB allocation.  */
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 42f4ceb701..a4aee238c7 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -42,6 +42,7 @@ typedef struct ConfidentialGuestSupport 
ConfidentialGuestSupport;
 typedef struct CPUAddressSpace CPUAddressSpace;
 typedef struct CPUArchState CPUArchState;
 typedef struct CPUState CPUState;
+typedef struct CPUTLBEntryFull CPUTLBEntryFull;
 typedef struct DeviceListener DeviceListener;
 typedef struct DeviceState DeviceState;
 typedef struct DirtyBitmapSnapshot DirtyBitmapSnapshot;
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 264f84a248..e3ee4260bd 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1510,7 +1510,8 @@ static void notdirty_write(CPUState *cpu, vaddr 
mem_vaddr, unsigned size,
 static int probe_access_internal(CPUArchState *env, target_ulong addr,
  int fault_size, MMUAccessType access_type,
  int mmu_idx, bool nonfault,
- void **phost, uintptr_t retaddr)
+ void **phost, CPUTLBEntryFull **pfull,
+ uintptr_t retaddr)
 {
 uintptr_t index = tlb_index(env, mmu_idx, addr);
 CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
@@ -1543,10 +1544,12 @@ static int probe_access_internal(CPUArchState *env, 
target_ulong addr,
mmu_idx, nonfault, retaddr)) {
 /* Non-faulting page table read failed.  */
 *phost = NULL;
+*pfull = NULL;
 return TLB_INVALID_MASK;
 }
 
 /* TLB resize via tlb_fill may have moved the entry.  */
+index = tlb_index(env, mmu_idx, addr);
 entry = tlb_entry(env, mmu_idx, addr);
 
 /*
@@ -1560,6 +1563,8 @@ static int probe_access_internal(CPUArchState *env, 
target_ulong addr,
 }
 flags &= tlb_addr;
 
+*pfull = _tlb(env)->d[mmu_idx].fulltlb[index];
+
 /* Fold all "mmio-like" bits into TLB_MMIO.  This is not RAM.  */
 if (unlikely(flags & ~(TLB_WATCHPOINT | TLB_NOTDIRTY))) {
 *phost = NULL;
@@ -1571,37 +1576,44 @@ static int probe_access_internal(CPUArchState *env, 
target_ulong addr,
 return flags;
 }
 
-int probe_access_flags(CPUArchState *env, target_ulong addr,
-   MMUAccessType access_type, int mmu_idx,
-   bool nonfault, void **phost, uintptr_t retaddr)
+int probe_access_full(CPUArchState *env, target_ulong addr,
+  MMUAccessType access_type, int mmu_idx,
+  bool nonfault, void **phost, CPUTLBEntryFull **pfull,
+  uintptr_t retaddr)
 {
-int flags;
-
-flags = probe_access_internal(env, addr, 0, access_type, mmu_idx,
-  nonfault, phost, retaddr);
+int flags = probe_access_internal(env, addr, 0, access_type, mmu_idx,
+  nonfault, phost, pfull, retaddr);
 
 /* Handle clean RAM pages.  */
 if (unlikely(flags & TLB_NOTDIRTY)) {
-uintptr_t index = tlb_index(env, mmu_idx, addr);
-CPUTLBEntryFull *full = _tlb(env)->d[mmu_idx].fulltlb[index];
-
-notdirty_write(env_cpu(env), addr, 1, full, retaddr);
+notdirty_write(env_cpu(env), addr, 1, *pfull, retaddr);

[PULL 05/20] accel/tcg: Drop addr member from SavedIOTLB

2022-10-04 Thread Richard Henderson

This field is only written, not read; remove it.

Reviewed-by: Alex Bennée 
Reviewed-by: Peter Maydell 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 include/hw/core/cpu.h | 1 -
 accel/tcg/cputlb.c| 7 +++
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 1a7e1a9380..009dc0d336 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -225,7 +225,6 @@ struct CPUWatchpoint {
  * the memory regions get moved around  by io_writex.
  */
 typedef struct SavedIOTLB {
-hwaddr addr;
 MemoryRegionSection *section;
 hwaddr mr_offset;
 } SavedIOTLB;
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index aa22f578cb..d06ff44ce9 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1372,12 +1372,11 @@ static uint64_t io_readx(CPUArchState *env, 
CPUTLBEntryFull *full,
  * This is read by tlb_plugin_lookup if the fulltlb entry doesn't match
  * because of the side effect of io_writex changing memory layout.
  */
-static void save_iotlb_data(CPUState *cs, hwaddr addr,
-MemoryRegionSection *section, hwaddr mr_offset)
+static void save_iotlb_data(CPUState *cs, MemoryRegionSection *section,
+hwaddr mr_offset)
 {
 #ifdef CONFIG_PLUGIN
 SavedIOTLB *saved = >saved_iotlb;
-saved->addr = addr;
 saved->section = section;
 saved->mr_offset = mr_offset;
 #endif
@@ -1406,7 +1405,7 @@ static void io_writex(CPUArchState *env, CPUTLBEntryFull 
*full,
  * The memory_region_dispatch may trigger a flush/resize
  * so for plugins we save the iotlb_data just in case.
  */
-save_iotlb_data(cpu, full->xlat_section, section, mr_offset);
+save_iotlb_data(cpu, section, mr_offset);
 
 if (!qemu_mutex_iothread_locked()) {
 qemu_mutex_lock_iothread();
-- 
2.34.1

[PULL 10/20] accel/tcg: Remove PageDesc code_bitmap

2022-10-04 Thread Richard Henderson

This bitmap is created and discarded immediately.
We gain nothing by its existence.

Reviewed-by: Alex Bennée 
Signed-off-by: Richard Henderson 
Message-Id: <20220822232338.1727934-2-richard.hender...@linaro.org>
---
 accel/tcg/translate-all.c | 78 ++-
 1 file changed, 4 insertions(+), 74 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index d71d04d338..59432dc558 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -102,21 +102,14 @@
 #define assert_memory_lock() tcg_debug_assert(have_mmap_lock())
 #endif
 
-#define SMC_BITMAP_USE_THRESHOLD 10
-
 typedef struct PageDesc {
 /* list of TBs intersecting this ram page */
 uintptr_t first_tb;
-#ifdef CONFIG_SOFTMMU
-/* in order to optimize self modifying code, we count the number
-   of lookups we do to a given page to use a bitmap */
-unsigned long *code_bitmap;
-unsigned int code_write_count;
-#else
+#ifdef CONFIG_USER_ONLY
 unsigned long flags;
 void *target_data;
 #endif
-#ifndef CONFIG_USER_ONLY
+#ifdef CONFIG_SOFTMMU
 QemuSpin lock;
 #endif
 } PageDesc;
@@ -907,17 +900,6 @@ void tb_htable_init(void)
 qht_init(_ctx.htable, tb_cmp, CODE_GEN_HTABLE_SIZE, mode);
 }
 
-/* call with @p->lock held */
-static inline void invalidate_page_bitmap(PageDesc *p)
-{
-assert_page_locked(p);
-#ifdef CONFIG_SOFTMMU
-g_free(p->code_bitmap);
-p->code_bitmap = NULL;
-p->code_write_count = 0;
-#endif
-}
-
 /* Set to NULL all the 'first_tb' fields in all PageDescs. */
 static void page_flush_tb_1(int level, void **lp)
 {
@@ -932,7 +914,6 @@ static void page_flush_tb_1(int level, void **lp)
 for (i = 0; i < V_L2_SIZE; ++i) {
 page_lock([i]);
 pd[i].first_tb = (uintptr_t)NULL;
-invalidate_page_bitmap(pd + i);
 page_unlock([i]);
 }
 } else {
@@ -1197,11 +1178,9 @@ static void do_tb_phys_invalidate(TranslationBlock *tb, 
bool rm_from_page_list)
 if (rm_from_page_list) {
 p = page_find(tb->page_addr[0] >> TARGET_PAGE_BITS);
 tb_page_remove(p, tb);
-invalidate_page_bitmap(p);
 if (tb->page_addr[1] != -1) {
 p = page_find(tb->page_addr[1] >> TARGET_PAGE_BITS);
 tb_page_remove(p, tb);
-invalidate_page_bitmap(p);
 }
 }
 
@@ -1246,35 +1225,6 @@ void tb_phys_invalidate(TranslationBlock *tb, 
tb_page_addr_t page_addr)
 }
 }
 
-#ifdef CONFIG_SOFTMMU
-/* call with @p->lock held */
-static void build_page_bitmap(PageDesc *p)
-{
-int n, tb_start, tb_end;
-TranslationBlock *tb;
-
-assert_page_locked(p);
-p->code_bitmap = bitmap_new(TARGET_PAGE_SIZE);
-
-PAGE_FOR_EACH_TB(p, tb, n) {
-/* NOTE: this is subtle as a TB may span two physical pages */
-if (n == 0) {
-/* NOTE: tb_end may be after the end of the page, but
-   it is not a problem */
-tb_start = tb->pc & ~TARGET_PAGE_MASK;
-tb_end = tb_start + tb->size;
-if (tb_end > TARGET_PAGE_SIZE) {
-tb_end = TARGET_PAGE_SIZE;
- }
-} else {
-tb_start = 0;
-tb_end = ((tb->pc + tb->size) & ~TARGET_PAGE_MASK);
-}
-bitmap_set(p->code_bitmap, tb_start, tb_end - tb_start);
-}
-}
-#endif
-
 /* add the tb in the target page and protect it if necessary
  *
  * Called with mmap_lock held for user-mode emulation.
@@ -1295,7 +1245,6 @@ static inline void tb_page_add(PageDesc *p, 
TranslationBlock *tb,
 page_already_protected = p->first_tb != (uintptr_t)NULL;
 #endif
 p->first_tb = (uintptr_t)tb | n;
-invalidate_page_bitmap(p);
 
 #if defined(CONFIG_USER_ONLY)
 /* translator_loop() must have made all TB pages non-writable */
@@ -1357,10 +1306,8 @@ tb_link_page(TranslationBlock *tb, tb_page_addr_t 
phys_pc,
 /* remove TB from the page(s) if we couldn't insert it */
 if (unlikely(existing_tb)) {
 tb_page_remove(p, tb);
-invalidate_page_bitmap(p);
 if (p2) {
 tb_page_remove(p2, tb);
-invalidate_page_bitmap(p2);
 }
 tb = existing_tb;
 }
@@ -1731,7 +1678,6 @@ tb_invalidate_phys_page_range__locked(struct 
page_collection *pages,
 #if !defined(CONFIG_USER_ONLY)
 /* if no code remaining, no need to continue to use slow writes */
 if (!p->first_tb) {
-invalidate_page_bitmap(p);
 tlb_unprotect_code(start);
 }
 #endif
@@ -1827,24 +1773,8 @@ void tb_invalidate_phys_page_fast(struct page_collection 
*pages,
 }
 
 assert_page_locked(p);
-if (!p->code_bitmap &&
-++p->code_write_count >= SMC_BITMAP_USE_THRESHOLD) {
-build_page_bitmap(p);
-}
-if (p->code_bitmap) {
-unsigned int nr;
-unsigned long b;
-
-nr = start & ~TARGET_PAGE_MASK;
-b = p->code_bitmap[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG - 1));
-if (b & ((1 << len)

[PULL 04/20] accel/tcg: Rename CPUIOTLBEntry to CPUTLBEntryFull

2022-10-04 Thread Richard Henderson

This structure will shortly contain more than just
data for accessing MMIO.  Rename the 'addr' member
to 'xlat_section' to more clearly indicate its purpose.

Reviewed-by: Alex Bennée 
Reviewed-by: Peter Maydell 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 include/exec/cpu-defs.h|  22 
 accel/tcg/cputlb.c | 102 +++--
 target/arm/mte_helper.c|  14 ++---
 target/arm/sve_helper.c|   4 +-
 target/arm/translate-a64.c |   2 +-
 5 files changed, 73 insertions(+), 71 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index ba3cd32a1e..f70f54d850 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -108,6 +108,7 @@ typedef uint64_t target_ulong;
 #  endif
 # endif
 
+/* Minimalized TLB entry for use by TCG fast path. */
 typedef struct CPUTLBEntry {
 /* bit TARGET_LONG_BITS to TARGET_PAGE_BITS : virtual address
bit TARGET_PAGE_BITS-1..4  : Nonzero for accesses that should not
@@ -131,14 +132,14 @@ typedef struct CPUTLBEntry {
 
 QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
 
-/* The IOTLB is not accessed directly inline by generated TCG code,
- * so the CPUIOTLBEntry layout is not as critical as that of the
- * CPUTLBEntry. (This is also why we don't want to combine the two
- * structs into one.)
+/*
+ * The full TLB entry, which is not accessed by generated TCG code,
+ * so the layout is not as critical as that of CPUTLBEntry. This is
+ * also why we don't want to combine the two structs.
  */
-typedef struct CPUIOTLBEntry {
+typedef struct CPUTLBEntryFull {
 /*
- * @addr contains:
+ * @xlat_section contains:
  *  - in the lower TARGET_PAGE_BITS, a physical section number
  *  - with the lower TARGET_PAGE_BITS masked off, an offset which
  *must be added to the virtual address to obtain:
@@ -146,9 +147,9 @@ typedef struct CPUIOTLBEntry {
  *   number is PHYS_SECTION_NOTDIRTY or PHYS_SECTION_ROM)
  * + the offset within the target MemoryRegion (otherwise)
  */
-hwaddr addr;
+hwaddr xlat_section;
 MemTxAttrs attrs;
-} CPUIOTLBEntry;
+} CPUTLBEntryFull;
 
 /*
  * Data elements that are per MMU mode, minus the bits accessed by
@@ -172,9 +173,8 @@ typedef struct CPUTLBDesc {
 size_t vindex;
 /* The tlb victim table, in two parts.  */
 CPUTLBEntry vtable[CPU_VTLB_SIZE];
-CPUIOTLBEntry viotlb[CPU_VTLB_SIZE];
-/* The iotlb.  */
-CPUIOTLBEntry *iotlb;
+CPUTLBEntryFull vfulltlb[CPU_VTLB_SIZE];
+CPUTLBEntryFull *fulltlb;
 } CPUTLBDesc;
 
 /*
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 193bfc1cfc..aa22f578cb 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -200,13 +200,13 @@ static void tlb_mmu_resize_locked(CPUTLBDesc *desc, 
CPUTLBDescFast *fast,
 }
 
 g_free(fast->table);
-g_free(desc->iotlb);
+g_free(desc->fulltlb);
 
 tlb_window_reset(desc, now, 0);
 /* desc->n_used_entries is cleared by the caller */
 fast->mask = (new_size - 1) << CPU_TLB_ENTRY_BITS;
 fast->table = g_try_new(CPUTLBEntry, new_size);
-desc->iotlb = g_try_new(CPUIOTLBEntry, new_size);
+desc->fulltlb = g_try_new(CPUTLBEntryFull, new_size);
 
 /*
  * If the allocations fail, try smaller sizes. We just freed some
@@ -215,7 +215,7 @@ static void tlb_mmu_resize_locked(CPUTLBDesc *desc, 
CPUTLBDescFast *fast,
  * allocations to fail though, so we progressively reduce the allocation
  * size, aborting if we cannot even allocate the smallest TLB we support.
  */
-while (fast->table == NULL || desc->iotlb == NULL) {
+while (fast->table == NULL || desc->fulltlb == NULL) {
 if (new_size == (1 << CPU_TLB_DYN_MIN_BITS)) {
 error_report("%s: %s", __func__, strerror(errno));
 abort();
@@ -224,9 +224,9 @@ static void tlb_mmu_resize_locked(CPUTLBDesc *desc, 
CPUTLBDescFast *fast,
 fast->mask = (new_size - 1) << CPU_TLB_ENTRY_BITS;
 
 g_free(fast->table);
-g_free(desc->iotlb);
+g_free(desc->fulltlb);
 fast->table = g_try_new(CPUTLBEntry, new_size);
-desc->iotlb = g_try_new(CPUIOTLBEntry, new_size);
+desc->fulltlb = g_try_new(CPUTLBEntryFull, new_size);
 }
 }
 
@@ -258,7 +258,7 @@ static void tlb_mmu_init(CPUTLBDesc *desc, CPUTLBDescFast 
*fast, int64_t now)
 desc->n_used_entries = 0;
 fast->mask = (n_entries - 1) << CPU_TLB_ENTRY_BITS;
 fast->table = g_new(CPUTLBEntry, n_entries);
-desc->iotlb = g_new(CPUIOTLBEntry, n_entries);
+desc->fulltlb = g_new(CPUTLBEntryFull, n_entries);
 tlb_mmu_flush_locked(desc, fast);
 }
 
@@ -299,7 +299,7 @@ void tlb_destroy(CPUState *cpu)
 CPUTLBDescFast *fast = _tlb(env)->f[i];
 
 g_free(fast->table);
-g_free(desc->iotlb);
+g_free(desc->fulltlb);
 }
 }
 
@@ -1219,7 +1219,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong 
vaddr,

[PULL 02/20] hw/core/cpu-sysemu: used cached class in cpu_asidx_from_attrs

2022-10-04 Thread Richard Henderson

From: Alex Bennée 

This is a heavily used function so lets avoid the cost of
CPU_GET_CLASS. On the romulus-bmc run it has a modest effect:

  Before: 36.812 s ±  0.506 s
  After:  35.912 s ±  0.168 s

Signed-off-by: Alex Bennée 
Reviewed-by: Richard Henderson 
Message-Id: <20220811151413.3350684-4-alex.ben...@linaro.org>
Signed-off-by: Cédric Le Goater 
Message-Id: <20220923084803.498337-4-...@kaod.org>
Signed-off-by: Richard Henderson 
---
 hw/core/cpu-sysemu.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/hw/core/cpu-sysemu.c b/hw/core/cpu-sysemu.c
index 00253f8929..5eaf2e79e6 100644
--- a/hw/core/cpu-sysemu.c
+++ b/hw/core/cpu-sysemu.c
@@ -69,11 +69,10 @@ hwaddr cpu_get_phys_page_debug(CPUState *cpu, vaddr addr)
 
 int cpu_asidx_from_attrs(CPUState *cpu, MemTxAttrs attrs)
 {
-CPUClass *cc = CPU_GET_CLASS(cpu);
 int ret = 0;
 
-if (cc->sysemu_ops->asidx_from_attrs) {
-ret = cc->sysemu_ops->asidx_from_attrs(cpu, attrs);
+if (cpu->cc->sysemu_ops->asidx_from_attrs) {
+ret = cpu->cc->sysemu_ops->asidx_from_attrs(cpu, attrs);
 assert(ret < cpu->num_ases && ret >= 0);
 }
 return ret;
-- 
2.34.1

[PULL 09/20] include/exec: Introduce TARGET_PAGE_ENTRY_EXTRA

2022-10-04 Thread Richard Henderson

Allow the target to cache items from the guest page tables.

Reviewed-by: Alex Bennée 
Reviewed-by: Peter Maydell 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 include/exec/cpu-defs.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 5e12cc1854..67239b4e5e 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -163,6 +163,15 @@ typedef struct CPUTLBEntryFull {
 
 /* @lg_page_size contains the log2 of the page size. */
 uint8_t lg_page_size;
+
+/*
+ * Allow target-specific additions to this structure.
+ * This may be used to cache items from the guest cpu
+ * page tables for later use by the implementation.
+ */
+#ifdef TARGET_PAGE_ENTRY_EXTRA
+TARGET_PAGE_ENTRY_EXTRA
+#endif
 } CPUTLBEntryFull;
 
 /*
-- 
2.34.1

[PULL 03/20] cputlb: used cached CPUClass in our hot-paths

2022-10-04 Thread Richard Henderson

From: Alex Bennée 

Before: 35.912 s ±  0.168 s
  After: 35.565 s ±  0.087 s

Signed-off-by: Alex Bennée 
Reviewed-by: Richard Henderson 
Message-Id: <20220811151413.3350684-5-alex.ben...@linaro.org>
Signed-off-by: Cédric Le Goater 
Message-Id: <20220923084803.498337-5-...@kaod.org>
Signed-off-by: Richard Henderson 
---
 accel/tcg/cputlb.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 8fad2d9b83..193bfc1cfc 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1291,15 +1291,14 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
 static void tlb_fill(CPUState *cpu, target_ulong addr, int size,
  MMUAccessType access_type, int mmu_idx, uintptr_t retaddr)
 {
-CPUClass *cc = CPU_GET_CLASS(cpu);
 bool ok;
 
 /*
  * This is not a probe, so only valid return is success; failure
  * should result in exception + longjmp to the cpu loop.
  */
-ok = cc->tcg_ops->tlb_fill(cpu, addr, size,
-   access_type, mmu_idx, false, retaddr);
+ok = cpu->cc->tcg_ops->tlb_fill(cpu, addr, size,
+access_type, mmu_idx, false, retaddr);
 assert(ok);
 }
 
@@ -1307,9 +1306,8 @@ static inline void cpu_unaligned_access(CPUState *cpu, 
vaddr addr,
 MMUAccessType access_type,
 int mmu_idx, uintptr_t retaddr)
 {
-CPUClass *cc = CPU_GET_CLASS(cpu);
-
-cc->tcg_ops->do_unaligned_access(cpu, addr, access_type, mmu_idx, retaddr);
+cpu->cc->tcg_ops->do_unaligned_access(cpu, addr, access_type,
+  mmu_idx, retaddr);
 }
 
 static inline void cpu_transaction_failed(CPUState *cpu, hwaddr physaddr,
@@ -1539,10 +1537,9 @@ static int probe_access_internal(CPUArchState *env, 
target_ulong addr,
 if (!tlb_hit_page(tlb_addr, page_addr)) {
 if (!victim_tlb_hit(env, mmu_idx, index, elt_ofs, page_addr)) {
 CPUState *cs = env_cpu(env);
-CPUClass *cc = CPU_GET_CLASS(cs);
 
-if (!cc->tcg_ops->tlb_fill(cs, addr, fault_size, access_type,
-   mmu_idx, nonfault, retaddr)) {
+if (!cs->cc->tcg_ops->tlb_fill(cs, addr, fault_size, access_type,
+   mmu_idx, nonfault, retaddr)) {
 /* Non-faulting page table read failed.  */
 *phost = NULL;
 return TLB_INVALID_MASK;
-- 
2.34.1

[PULL 06/20] accel/tcg: Suppress auto-invalidate in probe_access_internal

2022-10-04 Thread Richard Henderson

When PAGE_WRITE_INV is set when calling tlb_set_page,
we immediately set TLB_INVALID_MASK in order to force
tlb_fill to be called on the next lookup.  Here in
probe_access_internal, we have just called tlb_fill
and eliminated true misses, thus the lookup must be valid.

This allows us to remove a warning comment from s390x.
There doesn't seem to be a reason to change the code though.

Reviewed-by: Alex Bennée 
Reviewed-by: David Hildenbrand 
Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 accel/tcg/cputlb.c| 10 +-
 target/s390x/tcg/mem_helper.c |  4 
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index d06ff44ce9..264f84a248 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1533,6 +1533,7 @@ static int probe_access_internal(CPUArchState *env, 
target_ulong addr,
 }
 tlb_addr = tlb_read_ofs(entry, elt_ofs);
 
+flags = TLB_FLAGS_MASK;
 page_addr = addr & TARGET_PAGE_MASK;
 if (!tlb_hit_page(tlb_addr, page_addr)) {
 if (!victim_tlb_hit(env, mmu_idx, index, elt_ofs, page_addr)) {
@@ -1547,10 +1548,17 @@ static int probe_access_internal(CPUArchState *env, 
target_ulong addr,
 
 /* TLB resize via tlb_fill may have moved the entry.  */
 entry = tlb_entry(env, mmu_idx, addr);
+
+/*
+ * With PAGE_WRITE_INV, we set TLB_INVALID_MASK immediately,
+ * to force the next access through tlb_fill.  We've just
+ * called tlb_fill, so we know that this entry *is* valid.
+ */
+flags &= ~TLB_INVALID_MASK;
 }
 tlb_addr = tlb_read_ofs(entry, elt_ofs);
 }
-flags = tlb_addr & TLB_FLAGS_MASK;
+flags &= tlb_addr;
 
 /* Fold all "mmio-like" bits into TLB_MMIO.  This is not RAM.  */
 if (unlikely(flags & ~(TLB_WATCHPOINT | TLB_NOTDIRTY))) {
diff --git a/target/s390x/tcg/mem_helper.c b/target/s390x/tcg/mem_helper.c
index fc52aa128b..3758b9e688 100644
--- a/target/s390x/tcg/mem_helper.c
+++ b/target/s390x/tcg/mem_helper.c
@@ -148,10 +148,6 @@ static int s390_probe_access(CPUArchState *env, 
target_ulong addr, int size,
 #else
 int flags;
 
-/*
- * For !CONFIG_USER_ONLY, we cannot rely on TLB_INVALID_MASK or haddr==NULL
- * to detect if there was an exception during tlb_fill().
- */
 env->tlb_fill_exc = 0;
 flags = probe_access_flags(env, addr, access_type, mmu_idx, nonfault, 
phost,
ra);
-- 
2.34.1

[PULL 00/20] tcg patch queue

2022-10-04 Thread Richard Henderson

TCG patch queue, plus one target/sh4 patch that
Yoshinori Sato asked me to process.


r~


The following changes since commit efbf38d73e5dcc4d5f8b98c6e7a12be1f3b91745:

  Merge tag 'for-upstream' of git://repo.or.cz/qemu/kevin into staging 
(2022-10-03 15:06:07 -0400)

are available in the Git repository at:

  https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20221004

for you to fetch changes up to ab419fd8a035a65942de4e63effcd55ccbf1a9fe:

  target/sh4: Fix TB_FLAG_UNALIGN (2022-10-04 12:33:05 -0700)


Cache CPUClass for use in hot code paths.
Add CPUTLBEntryFull, probe_access_full, tlb_set_page_full.
Add generic support for TARGET_TB_PCREL.
tcg/ppc: Optimize 26-bit jumps using STQ for POWER 2.07
target/sh4: Fix TB_FLAG_UNALIGN


Alex Bennée (3):
  cpu: cache CPUClass in CPUState for hot code paths
  hw/core/cpu-sysemu: used cached class in cpu_asidx_from_attrs
  cputlb: used cached CPUClass in our hot-paths

Leandro Lupori (1):
  tcg/ppc: Optimize 26-bit jumps

Richard Henderson (16):
  accel/tcg: Rename CPUIOTLBEntry to CPUTLBEntryFull
  accel/tcg: Drop addr member from SavedIOTLB
  accel/tcg: Suppress auto-invalidate in probe_access_internal
  accel/tcg: Introduce probe_access_full
  accel/tcg: Introduce tlb_set_page_full
  include/exec: Introduce TARGET_PAGE_ENTRY_EXTRA
  accel/tcg: Remove PageDesc code_bitmap
  accel/tcg: Use bool for page_find_alloc
  accel/tcg: Use DisasContextBase in plugin_gen_tb_start
  accel/tcg: Do not align tb->page_addr[0]
  accel/tcg: Inline tb_flush_jmp_cache
  include/hw/core: Create struct CPUJumpCache
  hw/core: Add CPUClass.get_pc
  accel/tcg: Introduce tb_pc and log_pc
  accel/tcg: Introduce TARGET_TB_PCREL
  target/sh4: Fix TB_FLAG_UNALIGN

 accel/tcg/internal.h|  10 ++
 accel/tcg/tb-hash.h |   1 +
 accel/tcg/tb-jmp-cache.h|  65 
 include/exec/cpu-common.h   |   1 +
 include/exec/cpu-defs.h |  48 --
 include/exec/exec-all.h |  75 -
 include/exec/plugin-gen.h   |   7 +-
 include/hw/core/cpu.h   |  28 ++--
 include/qemu/typedefs.h |   2 +
 include/tcg/tcg.h   |   2 +-
 target/sh4/cpu.h|  56 ---
 accel/stubs/tcg-stub.c  |   4 +
 accel/tcg/cpu-exec.c|  80 +-
 accel/tcg/cputlb.c  | 259 ++--
 accel/tcg/plugin-gen.c  |  22 +--
 accel/tcg/translate-all.c   | 214 --
 accel/tcg/translator.c  |   2 +-
 cpu.c   |   9 +-
 hw/core/cpu-common.c|   3 +-
 hw/core/cpu-sysemu.c|   5 +-
 linux-user/sh4/signal.c |   6 +-
 plugins/core.c  |   2 +-
 target/alpha/cpu.c  |   9 ++
 target/arm/cpu.c|  17 ++-
 target/arm/mte_helper.c |  14 +-
 target/arm/sve_helper.c |   4 +-
 target/arm/translate-a64.c  |   2 +-
 target/avr/cpu.c|  10 +-
 target/cris/cpu.c   |   8 +
 target/hexagon/cpu.c|  10 +-
 target/hppa/cpu.c   |  12 +-
 target/i386/cpu.c   |   9 ++
 target/i386/tcg/tcg-cpu.c   |   2 +-
 target/loongarch/cpu.c  |  11 +-
 target/m68k/cpu.c   |   8 +
 target/microblaze/cpu.c |  10 +-
 target/mips/cpu.c   |   8 +
 target/mips/tcg/exception.c |   2 +-
 target/mips/tcg/sysemu/special_helper.c |   2 +-
 target/nios2/cpu.c  |   9 ++
 target/openrisc/cpu.c   |  10 +-
 target/ppc/cpu_init.c   |   8 +
 target/riscv/cpu.c  |  17 ++-
 target/rx/cpu.c |  10 +-
 target/s390x/cpu.c  |   8 +
 target/s390x/tcg/mem_helper.c   |   4 -
 target/sh4/cpu.c|  18 ++-
 target/sh4/helper.c |   6 +-
 target/sh4/translate.c  |  90 +--
 target/sparc/cpu.c  |  10 +-
 target/tricore/cpu.c|  11 +-
 target/xtensa/cpu.c |   8 +
 tcg/tcg.c   |   8 +-
 trace/control-target.c  |   2 +-
 tcg/ppc/tcg-target.c.inc| 119 +++
 55 files changed, 915 insertions(+), 462 deletions(-)
 create mode 100644 accel/tcg/tb-jmp-cache.h

[PULL 01/20] cpu: cache CPUClass in CPUState for hot code paths

2022-10-04 Thread Richard Henderson

From: Alex Bennée 

The class cast checkers are quite expensive and always on (unlike the
dynamic case who's checks are gated by CONFIG_QOM_CAST_DEBUG). To
avoid the overhead of repeatedly checking something which should never
change we cache the CPUClass reference for use in the hot code paths.

Signed-off-by: Alex Bennée 
Reviewed-by: Richard Henderson 
Message-Id: <20220811151413.3350684-3-alex.ben...@linaro.org>
Signed-off-by: Cédric Le Goater 
Message-Id: <20220923084803.498337-3-...@kaod.org>
Signed-off-by: Richard Henderson 
---
 include/hw/core/cpu.h | 9 +
 cpu.c | 9 -
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 500503da13..1a7e1a9380 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -51,6 +51,13 @@ typedef int (*WriteCoreDumpFunction)(const void *buf, size_t 
size,
  */
 #define CPU(obj) ((CPUState *)(obj))
 
+/*
+ * The class checkers bring in CPU_GET_CLASS() which is potentially
+ * expensive given the eventual call to
+ * object_class_dynamic_cast_assert(). Because of this the CPUState
+ * has a cached value for the class in cs->cc which is set up in
+ * cpu_exec_realizefn() for use in hot code paths.
+ */
 typedef struct CPUClass CPUClass;
 DECLARE_CLASS_CHECKERS(CPUClass, CPU,
TYPE_CPU)
@@ -317,6 +324,8 @@ struct qemu_work_item;
 struct CPUState {
 /*< private >*/
 DeviceState parent_obj;
+/* cache to avoid expensive CPU_GET_CLASS */
+CPUClass *cc;
 /*< public >*/
 
 int nr_cores;
diff --git a/cpu.c b/cpu.c
index 584ac78baf..14365e36f3 100644
--- a/cpu.c
+++ b/cpu.c
@@ -131,9 +131,8 @@ const VMStateDescription vmstate_cpu_common = {
 
 void cpu_exec_realizefn(CPUState *cpu, Error **errp)
 {
-#ifndef CONFIG_USER_ONLY
-CPUClass *cc = CPU_GET_CLASS(cpu);
-#endif
+/* cache the cpu class for the hotpath */
+cpu->cc = CPU_GET_CLASS(cpu);
 
 cpu_list_add(cpu);
 if (!accel_cpu_realizefn(cpu, errp)) {
@@ -151,8 +150,8 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp)
 if (qdev_get_vmsd(DEVICE(cpu)) == NULL) {
 vmstate_register(NULL, cpu->cpu_index, _cpu_common, cpu);
 }
-if (cc->sysemu_ops->legacy_vmsd != NULL) {
-vmstate_register(NULL, cpu->cpu_index, cc->sysemu_ops->legacy_vmsd, 
cpu);
+if (cpu->cc->sysemu_ops->legacy_vmsd != NULL) {
+vmstate_register(NULL, cpu->cpu_index, 
cpu->cc->sysemu_ops->legacy_vmsd, cpu);
 }
 #endif /* CONFIG_USER_ONLY */
 }
-- 
2.34.1

Re: [PATCH v5 9/9] target/arm: Enable TARGET_TB_PCREL

2022-10-04 Thread Richard Henderson


On 10/4/22 09:23, Peter Maydell wrote:

  void arm_cpu_synchronize_from_tb(CPUState *cs,
   const TranslationBlock *tb)
  {
-ARMCPU *cpu = ARM_CPU(cs);
-CPUARMState *env = >env;
-
-/*
- * It's OK to look at env for the current mode here, because it's
- * never possible for an AArch64 TB to chain to an AArch32 TB.
- */
-if (is_a64(env)) {
-env->pc = tb_pc(tb);
-} else {
-env->regs[15] = tb_pc(tb);
+/* The program counter is always up to date with TARGET_TB_PCREL. */


I was confused for a bit about this, but it works because
although the synchronize_from_tb hook has a name that implies
it's comparatively general purpose, in fact we use it only
in the special case of "we abandoned execution at the start of
this TB without executing any of it".


Correct.


@@ -347,16 +354,22 @@ static void gen_exception_internal(int excp)

  static void gen_exception_internal_insn(DisasContext *s, int excp)
  {
+target_ulong pc_save = s->pc_save;
+
  gen_a64_update_pc(s, 0);
  gen_exception_internal(excp);
  s->base.is_jmp = DISAS_NORETURN;
+s->pc_save = pc_save;


What is trashing s->pc_save that we have to work around like this,
here and in the other similar changes ?


gen_a64_update_pc trashes pc_save.

Off of the top of my head, I can't remember what conditionally uses exceptions (single 
step?).  But the usage pattern that is interesting is


brcond(x, y, L1)
update_pc(disp1);
exit-or-exception.
L1:
update_pc(disp2);
exit-or-exception.

where at L1 we should have the same pc_save value as we did at the brcond.  Saving and 
restoring around (at least some of) the DISAS_NORETURN points achieves that.



r~

Re: [PATCH 06/14] migration: Use atomic ops properly for page accountings

2022-10-04 Thread Peter Xu

On Tue, Oct 04, 2022 at 05:59:36PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (pet...@redhat.com) wrote:
> > To prepare for thread-safety on page accountings, at least below counters
> > need to be accessed only atomically, they are:
> > 
> > ram_counters.transferred
> > ram_counters.duplicate
> > ram_counters.normal
> > ram_counters.postcopy_bytes
> > 
> > There are a lot of other counters but they won't be accessed outside
> > migration thread, then they're still safe to be accessed without atomic
> > ops.
> > 
> > Signed-off-by: Peter Xu 
> 
> I think this is OK; I'm not sure whether the memset 0's of ram_counters
> technically need changing.

IMHO they're fine - what we need there should be thing like WRITE_ONCE()
just to make sure no register caches (actually atomic_write() is normally
implemented with WRITE_ONCE afaik).  But I think that's already guaranteed
by memset() as the function call does, so we should be 100% safe.

> I'd love to put a comment somewhere saying these fields need to be
> atomically read, but their qapi defined so I don't think we can.

How about I add a comment above ram_counters declarations in ram.c?

> 
> Finally, we probably need to check these are happy on 32 bit builds,
> sometimes it's a bit funny with atomic adds.

Yeah.. I hope using qatomic_*() APIs can help me avoid any issues.  Or
anything concerning?  I'd be happy to test on specific things if there are.

> 
> 
> Reviewed-by: Dr. David Alan Gilbert 

Thanks!

-- 
Peter Xu

Re: [PATCH v2] target/sh4: Fix TB_FLAG_UNALIGN

2022-10-04 Thread Richard Henderson


On 10/3/22 22:56, Yoshinori Sato wrote:

On Mon, 03 Oct 2022 02:23:51 +0900,
Richard Henderson wrote:


Ping, or should I create a PR myself?

r~


Sorry.
I can't work this week, so please submit a PR.


Ok, I will fold this into the tcg-next PR that I am preparing now.


r~

Re: [PATCH 05/14] migration: Yield bitmap_mutex properly when sending/sleeping

2022-10-04 Thread Peter Xu

On Tue, Oct 04, 2022 at 02:55:10PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (pet...@redhat.com) wrote:
> > Don't take the bitmap mutex when sending pages, or when being throttled by
> > migration_rate_limit() (which is a bit tricky to call it here in ram code,
> > but seems still helpful).
> > 
> > It prepares for the possibility of concurrently sending pages in >1 threads
> > using the function ram_save_host_page() because all threads may need the
> > bitmap_mutex to operate on bitmaps, so that either sendmsg() or any kind of
> > qemu_sem_wait() blocking for one thread will not block the other from
> > progressing.
> > 
> > Signed-off-by: Peter Xu 
> 
> I generally dont like taking locks conditionally; but this kind of looks
> OK; I think it needs a big comment on the start of the function saying
> that it's called and left with the lock held but that it might drop it
> temporarily.

Right, the code is slightly hard to read, I just didn't yet see a good and
easy solution for it yet.  It's just that we may still want to keep the
lock as long as possible for precopy in one shot.

> 
> > ---
> >  migration/ram.c | 42 +++---
> >  1 file changed, 31 insertions(+), 11 deletions(-)
> > 
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 8303252b6d..6e7de6087a 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2463,6 +2463,7 @@ static void postcopy_preempt_reset_channel(RAMState 
> > *rs)
> >   */
> >  static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
> >  {
> > +bool page_dirty, release_lock = postcopy_preempt_active();
> 
> Could you rename that to something like 'drop_lock' - you are taking the
> lock at the end even when you have 'release_lock' set - which is a bit
> strange naming.

Is there any difference on "drop" or "release"?  I'll change the name
anyway since I definitely trust you on any English comments, but please
still let me know - I love to learn more on those! :)

> 
> >  int tmppages, pages = 0;
> >  size_t pagesize_bits =
> >  qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
> > @@ -2486,22 +2487,41 @@ static int ram_save_host_page(RAMState *rs, 
> > PageSearchStatus *pss)
> >  break;
> >  }
> >  
> > +page_dirty = migration_bitmap_clear_dirty(rs, pss->block, 
> > pss->page);
> > +/*
> > + * Properly yield the lock only in postcopy preempt mode because
> > + * both migration thread and rp-return thread can operate on the
> > + * bitmaps.
> > + */
> > +if (release_lock) {
> > +qemu_mutex_unlock(>bitmap_mutex);
> > +}
> 
> Shouldn't the unlock/lock move inside the 'if (page_dirty) {' ?

I think we can move into it, but it may not be as optimal as keeping it
as-is.

Consider a case where we've got the bitmap with continous zero bits.
During postcopy, the migration thread could be spinning here with the lock
held even if it doesn't send a thing.  It could still block the other
return path thread on sending urgent pages which may be outside the zero
zones.

> 
> 
> >  /* Check the pages is dirty and if it is send it */
> > -if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
> > +if (page_dirty) {
> >  tmppages = ram_save_target_page(rs, pss);
> > -if (tmppages < 0) {
> > -return tmppages;
> > +if (tmppages >= 0) {
> > +pages += tmppages;
> > +/*
> > + * Allow rate limiting to happen in the middle of huge 
> > pages if
> > + * something is sent in the current iteration.
> > + */
> > +if (pagesize_bits > 1 && tmppages > 0) {
> > +migration_rate_limit();
> 
> This feels interesting, I know it's no change from before, and it's
> difficult to do here, but it seems odd to hold the lock around the
> sleeping in the rate limit.

Good point.. I think I'll leave it there for this patch because it's
totally irrelevant, but seems proper in the future to do unlocking too for
normal precopy.

Maybe I'll just attach a patch at the end of this series when I repost.
That'll be easier before things got forgotten again.

-- 
Peter Xu

[PATCH v2 0/5] migration: Bug fixes (prepare for preempt-full)

2022-10-04 Thread Peter Xu

v2:
- Drop patch "migration: Disallow xbzrle with postcopy" [Dave]
- Added patch "migration: Disable multifd explicitly with compression"
  (according to the comment in the other series) [Dave]
- s/deadloop/infinite loop/ in patch 1 subject [Dave]

v1: https://lore.kernel.org/qemu-devel/20220920223800.47467-1-pet...@redhat.com

This patchset does bug fixes that I found when testing preempt-full.
Please refer to each of the patch on the purpose.  Thanks,

Peter Xu (5):
  migration: Fix possible infinite loop of ram save process
  migration: Fix race on qemu_file_shutdown()
  migration: Disallow postcopy preempt to be used with compress
  migration: Use non-atomic ops for clear log bitmap
  migration: Disable multifd explicitly with compression

 include/exec/ram_addr.h | 11 +-
 include/exec/ramblock.h |  3 +++
 include/qemu/bitmap.h   |  1 +
 migration/migration.c   | 18 +
 migration/qemu-file.c   | 27 ++---
 migration/ram.c | 27 -
 util/bitmap.c   | 45 +
 7 files changed, 114 insertions(+), 18 deletions(-)

-- 
2.37.3

[PATCH v2 2/5] migration: Fix race on qemu_file_shutdown()

2022-10-04 Thread Peter Xu

In qemu_file_shutdown(), there's a possible race if with current order of
operation.  There're two major things to do:

  (1) Do real shutdown() (e.g. shutdown() syscall on socket)
  (2) Update qemufile's last_error

We must do (2) before (1) otherwise there can be a race condition like:

  page receiver other thread
  - 
  qemu_get_buffer()
do shutdown()
returns 0 (buffer all zero)
(meanwhile we didn't check this retcode)
  try to detect IO error
last_error==NULL, IO okay
  install ALL-ZERO page
set last_error
  --> guest crash!

To fix this, we can also check retval of qemu_get_buffer(), but not all
APIs can be properly checked and ultimately we still need to go back to
qemu_file_get_error().  E.g. qemu_get_byte() doesn't return error.

Maybe some day a rework of qemufile API is really needed, but for now keep
using qemu_file_get_error() and fix it by not allowing that race condition
to happen.  Here shutdown() is indeed special because the last_error was
emulated.  For real -EIO errors it'll always be set when e.g. sendmsg()
error triggers so we won't miss those ones, only shutdown() is a bit tricky
here.

Cc: Daniel P. Berrange 
Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
---
 migration/qemu-file.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 4f400c2e52..2d5f74ffc2 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -79,6 +79,30 @@ int qemu_file_shutdown(QEMUFile *f)
 int ret = 0;
 
 f->shutdown = true;
+
+/*
+ * We must set qemufile error before the real shutdown(), otherwise
+ * there can be a race window where we thought IO all went though
+ * (because last_error==NULL) but actually IO has already stopped.
+ *
+ * If without correct ordering, the race can happen like this:
+ *
+ *  page receiver other thread
+ *  - 
+ *  qemu_get_buffer()
+ *do shutdown()
+ *returns 0 (buffer all zero)
+ *(we didn't check this retcode)
+ *  try to detect IO error
+ *last_error==NULL, IO okay
+ *  install ALL-ZERO page
+ *set last_error
+ *  --> guest crash!
+ */
+if (!f->last_error) {
+qemu_file_set_error(f, -EIO);
+}
+
 if (!qio_channel_has_feature(f->ioc,
  QIO_CHANNEL_FEATURE_SHUTDOWN)) {
 return -ENOSYS;
@@ -88,9 +112,6 @@ int qemu_file_shutdown(QEMUFile *f)
 ret = -EIO;
 }
 
-if (!f->last_error) {
-qemu_file_set_error(f, -EIO);
-}
 return ret;
 }
 
-- 
2.37.3

[PATCH v2 4/5] migration: Use non-atomic ops for clear log bitmap

2022-10-04 Thread Peter Xu

Since we already have bitmap_mutex to protect either the dirty bitmap or
the clear log bitmap, we don't need atomic operations to set/clear/test on
the clear log bitmap.  Switching all ops from atomic to non-atomic
versions, meanwhile touch up the comments to show which lock is in charge.

Introduced non-atomic version of bitmap_test_and_clear_atomic(), mostly the
same as the atomic version but simplified a few places, e.g. dropped the
"old_bits" variable, and also the explicit memory barriers.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
---
 include/exec/ram_addr.h | 11 +-
 include/exec/ramblock.h |  3 +++
 include/qemu/bitmap.h   |  1 +
 util/bitmap.c   | 45 +
 4 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index f3e0c78161..5092a2e0ff 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -42,7 +42,8 @@ static inline long clear_bmap_size(uint64_t pages, uint8_t 
shift)
 }
 
 /**
- * clear_bmap_set: set clear bitmap for the page range
+ * clear_bmap_set: set clear bitmap for the page range.  Must be with
+ * bitmap_mutex held.
  *
  * @rb: the ramblock to operate on
  * @start: the start page number
@@ -55,12 +56,12 @@ static inline void clear_bmap_set(RAMBlock *rb, uint64_t 
start,
 {
 uint8_t shift = rb->clear_bmap_shift;
 
-bitmap_set_atomic(rb->clear_bmap, start >> shift,
-  clear_bmap_size(npages, shift));
+bitmap_set(rb->clear_bmap, start >> shift, clear_bmap_size(npages, shift));
 }
 
 /**
- * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set
+ * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set.
+ * Must be with bitmap_mutex held.
  *
  * @rb: the ramblock to operate on
  * @page: the page number to check
@@ -71,7 +72,7 @@ static inline bool clear_bmap_test_and_clear(RAMBlock *rb, 
uint64_t page)
 {
 uint8_t shift = rb->clear_bmap_shift;
 
-return bitmap_test_and_clear_atomic(rb->clear_bmap, page >> shift, 1);
+return bitmap_test_and_clear(rb->clear_bmap, page >> shift, 1);
 }
 
 static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 6cbedf9e0c..adc03df59c 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -53,6 +53,9 @@ struct RAMBlock {
  * and split clearing of dirty bitmap on the remote node (e.g.,
  * KVM).  The bitmap will be set only when doing global sync.
  *
+ * It is only used during src side of ram migration, and it is
+ * protected by the global ram_state.bitmap_mutex.
+ *
  * NOTE: this bitmap is different comparing to the other bitmaps
  * in that one bit can represent multiple guest pages (which is
  * decided by the `clear_bmap_shift' variable below).  On
diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
index 82a1d2f41f..3ccb00865f 100644
--- a/include/qemu/bitmap.h
+++ b/include/qemu/bitmap.h
@@ -253,6 +253,7 @@ void bitmap_set(unsigned long *map, long i, long len);
 void bitmap_set_atomic(unsigned long *map, long i, long len);
 void bitmap_clear(unsigned long *map, long start, long nr);
 bool bitmap_test_and_clear_atomic(unsigned long *map, long start, long nr);
+bool bitmap_test_and_clear(unsigned long *map, long start, long nr);
 void bitmap_copy_and_clear_atomic(unsigned long *dst, unsigned long *src,
   long nr);
 unsigned long bitmap_find_next_zero_area(unsigned long *map,
diff --git a/util/bitmap.c b/util/bitmap.c
index f81d8057a7..8d12e90a5a 100644
--- a/util/bitmap.c
+++ b/util/bitmap.c
@@ -240,6 +240,51 @@ void bitmap_clear(unsigned long *map, long start, long nr)
 }
 }
 
+bool bitmap_test_and_clear(unsigned long *map, long start, long nr)
+{
+unsigned long *p = map + BIT_WORD(start);
+const long size = start + nr;
+int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+bool dirty = false;
+
+assert(start >= 0 && nr >= 0);
+
+/* First word */
+if (nr - bits_to_clear > 0) {
+if ((*p) & mask_to_clear) {
+dirty = true;
+}
+*p &= ~mask_to_clear;
+nr -= bits_to_clear;
+bits_to_clear = BITS_PER_LONG;
+p++;
+}
+
+/* Full words */
+if (bits_to_clear == BITS_PER_LONG) {
+while (nr >= BITS_PER_LONG) {
+if (*p) {
+dirty = true;
+*p = 0;
+}
+nr -= BITS_PER_LONG;
+p++;
+}
+}
+
+/* Last word */
+if (nr) {
+mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+if ((*p) & mask_to_clear) {
+dirty = true;
+}
+*p &= ~mask_to_clear;
+}
+
+return dirty;
+}
+
 bool bitmap_test_and_clear_atomic(unsigned long *map, long start, long nr)
 {

[PATCH v2 5/5] migration: Disable multifd explicitly with compression

2022-10-04 Thread Peter Xu

Multifd thread model does not work for compression, explicitly disable it.

Note that previuosly even we can enable both of them, nothing will go
wrong, because the compression code has higher priority so multifd feature
will just be ignored.  Now we'll fail even earlier at config time so the
user should be aware of the consequence better.

Note that there can be a slight chance of breaking existing users, but
let's assume they're not majority and not serious users, or they should
have found that multifd is not working already.

With that, we can safely drop the check in ram_save_target_page() for using
multifd, because when multifd=on then compression=off, then the removed
check on save_page_use_compression() will also always return false too.

Signed-off-by: Peter Xu 
---
 migration/migration.c |  7 +++
 migration/ram.c   | 11 +--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 844bca1ff6..ef00bff0b3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1349,6 +1349,13 @@ static bool migrate_caps_check(bool *cap_list,
 }
 }
 
+if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+error_setg(errp, "Multifd is not compatible with compress");
+return false;
+}
+}
+
 return true;
 }
 
diff --git a/migration/ram.c b/migration/ram.c
index 1d42414ecc..1338e47665 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2305,13 +2305,12 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss)
 }
 
 /*
- * Do not use multifd for:
- * 1. Compression as the first page in the new block should be posted out
- *before sending the compressed page
- * 2. In postcopy as one whole host page should be placed
+ * Do not use multifd in postcopy as one whole host page should be
+ * placed.  Meanwhile postcopy requires atomic update of pages, so even
+ * if host page size == guest page size the dest guest during run may
+ * still see partially copied pages which is data corruption.
  */
-if (!save_page_use_compression(rs) && migrate_use_multifd()
-&& !migration_in_postcopy()) {
+if (migrate_use_multifd() && !migration_in_postcopy()) {
 return ram_save_multifd_page(rs, block, offset);
 }
 
-- 
2.37.3

[PATCH v2 1/5] migration: Fix possible infinite loop of ram save process

2022-10-04 Thread Peter Xu

When starting ram saving procedure (especially at the completion phase),
always set last_seen_block to non-NULL to make sure we can always correctly
detect the case where "we've migrated all the dirty pages".

Then we'll guarantee both last_seen_block and pss.block will be valid
always before the loop starts.

See the comment in the code for some details.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
---
 migration/ram.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index dc1de9ddbc..1d42414ecc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2546,14 +2546,22 @@ static int ram_find_and_save_block(RAMState *rs)
 return pages;
 }
 
+/*
+ * Always keep last_seen_block/last_page valid during this procedure,
+ * because find_dirty_block() relies on these values (e.g., we compare
+ * last_seen_block with pss.block to see whether we searched all the
+ * ramblocks) to detect the completion of migration.  Having NULL value
+ * of last_seen_block can conditionally cause below loop to run forever.
+ */
+if (!rs->last_seen_block) {
+rs->last_seen_block = QLIST_FIRST_RCU(_list.blocks);
+rs->last_page = 0;
+}
+
 pss.block = rs->last_seen_block;
 pss.page = rs->last_page;
 pss.complete_round = false;
 
-if (!pss.block) {
-pss.block = QLIST_FIRST_RCU(_list.blocks);
-}
-
 do {
 again = true;
 found = get_queued_page(rs, );
-- 
2.37.3

[PATCH v2 3/5] migration: Disallow postcopy preempt to be used with compress

2022-10-04 Thread Peter Xu

The preempt mode requires the capability to assign channel for each of the
page, while the compression logic will currently assign pages to different
compress thread/local-channel so potentially they're incompatible.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
---
 migration/migration.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index bb8bbddfe4..844bca1ff6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1336,6 +1336,17 @@ static bool migrate_caps_check(bool *cap_list,
 error_setg(errp, "Postcopy preempt requires postcopy-ram");
 return false;
 }
+
+/*
+ * Preempt mode requires urgent pages to be sent in separate
+ * channel, OTOH compression logic will disorder all pages into
+ * different compression channels, which is not compatible with the
+ * preempt assumptions on channel assignments.
+ */
+if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+error_setg(errp, "Postcopy preempt not compatible with compress");
+return false;
+}
 }
 
 return true;
-- 
2.37.3

Re: [PATCH v4 6/6] hw/arm/virt: Add 'compact-highmem' property

2022-10-04 Thread Marc Zyngier

On Tue, 04 Oct 2022 01:26:27 +0100,
Gavin Shan  wrote:
> 
> After the improvement to high memory region address assignment is
> applied, the memory layout can be changed, introducing possible
> migration breakage. For example, VIRT_HIGH_PCIE_MMIO memory region
> is disabled or enabled when the optimization is applied or not, with
> the following configuration.
> 
>   pa_bits  = 40;
>   vms->highmem_redists = false;
>   vms->highmem_ecam= false;
>   vms->highmem_mmio= true;

The question is how are these parameters specified by a user? Short of
hacking the code, this isn't really possible.

> 
>   # qemu-system-aarch64 -accel kvm -cpu host\
> -machine virt-7.2,compact-highmem={on, off} \
> -m 4G,maxmem=511G -monitor stdio
> 
>   Regioncompact-highmem=off compact-highmem=on
>   
>   RAM   [1GB 512GB][1GB 512GB]
>   HIGH_GIC_REDISTS  [512GB   512GB+64MB]   [disabled]
>   HIGH_PCIE_ECAM[512GB+256MB 512GB+512MB]  [disabled]
>   HIGH_PCIE_MMIO[disabled] [512GB   1TB]
> 
> In order to keep backwords compatibility, we need to disable the
> optimization on machines, which is virt-7.1 or ealier than it. It
> means the optimization is enabled by default from virt-7.2. Besides,
> 'compact-highmem' property is added so that the optimization can be
> explicitly enabled or disabled on all machine types by users.

Not directly related to this series, but it seems to me that we should
be aiming at reproducible results across HW implementations (at least
with KVM). Depending on how many PA bits the HW implements, we end-up
with a set of devices or another, which is likely to be confusing for
a user.

I think we should consider an additional set of changes to allow a
user to specify the PA bits as well as the devices they want to see
enabled.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v3 5/5] pci-ids: document modern virtio-pci ids in pci.h too

2022-10-04 Thread Eric Auger

Hi Gerd,

On 10/4/22 13:21, Gerd Hoffmann wrote:
> While being at it add a #define for the magic 0x1040 number.
>
> Signed-off-by: Gerd Hoffmann 
Reviewed-by: Eric Auger 

Thanks

Eric
> ---
>  include/hw/pci/pci.h   | 10 ++
>  hw/virtio/virtio-pci.c |  2 +-
>  2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 42c83cb5ed00..d1ac308574f1 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -76,6 +76,7 @@ extern bool pci_available;
>  #define PCI_SUBVENDOR_ID_REDHAT_QUMRANET 0x1af4
>  #define PCI_SUBDEVICE_ID_QEMU0x1100
>  
> +/* legacy virtio-pci devices */
>  #define PCI_DEVICE_ID_VIRTIO_NET 0x1000
>  #define PCI_DEVICE_ID_VIRTIO_BLOCK   0x1001
>  #define PCI_DEVICE_ID_VIRTIO_BALLOON 0x1002
> @@ -85,6 +86,15 @@ extern bool pci_available;
>  #define PCI_DEVICE_ID_VIRTIO_9P  0x1009
>  #define PCI_DEVICE_ID_VIRTIO_VSOCK   0x1012
>  
> +/*
> + * modern virtio-pci devices get their id assigned automatically,
> + * there is no need to add #defines here.  It gets calculated as
> + *
> + * PCI_DEVICE_ID = PCI_DEVICE_ID_VIRTIO_10_BASE +
> + * virtio_bus_get_vdev_id(bus)
> + */
> +#define PCI_DEVICE_ID_VIRTIO_10_BASE 0x1040
> +
>  #define PCI_VENDOR_ID_REDHAT 0x1b36
>  #define PCI_DEVICE_ID_REDHAT_BRIDGE  0x0001
>  #define PCI_DEVICE_ID_REDHAT_SERIAL  0x0002
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index a50c5a57d7e5..e7d80242b73f 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -1688,7 +1688,7 @@ static void virtio_pci_device_plugged(DeviceState *d, 
> Error **errp)
>  pci_set_word(config + PCI_VENDOR_ID,
>   PCI_VENDOR_ID_REDHAT_QUMRANET);
>  pci_set_word(config + PCI_DEVICE_ID,
> - 0x1040 + virtio_bus_get_vdev_id(bus));
> + PCI_DEVICE_ID_VIRTIO_10_BASE + 
> virtio_bus_get_vdev_id(bus));
>  pci_config_set_revision(config, 1);
>  }
>  config[PCI_INTERRUPT_PIN] = 1;

Re: [PATCH 06/14] migration: Use atomic ops properly for page accountings

2022-10-04 Thread Dr. David Alan Gilbert

* Peter Xu (pet...@redhat.com) wrote:
> To prepare for thread-safety on page accountings, at least below counters
> need to be accessed only atomically, they are:
> 
> ram_counters.transferred
> ram_counters.duplicate
> ram_counters.normal
> ram_counters.postcopy_bytes
> 
> There are a lot of other counters but they won't be accessed outside
> migration thread, then they're still safe to be accessed without atomic
> ops.
> 
> Signed-off-by: Peter Xu 

I think this is OK; I'm not sure whether the memset 0's of ram_counters
technically need changing.
I'd love to put a comment somewhere saying these fields need to be
atomically read, but their qapi defined so I don't think we can.

Finally, we probably need to check these are happy on 32 bit builds,
sometimes it's a bit funny with atomic adds.


Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/migration.c | 10 +-
>  migration/multifd.c   |  2 +-
>  migration/ram.c   | 29 +++--
>  3 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 07c74a79a2..0eacc0c99b 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1048,13 +1048,13 @@ static void populate_ram_info(MigrationInfo *info, 
> MigrationState *s)
>  
>  info->has_ram = true;
>  info->ram = g_malloc0(sizeof(*info->ram));
> -info->ram->transferred = ram_counters.transferred;
> +info->ram->transferred = qatomic_read(_counters.transferred);
>  info->ram->total = ram_bytes_total();
> -info->ram->duplicate = ram_counters.duplicate;
> +info->ram->duplicate = qatomic_read(_counters.duplicate);
>  /* legacy value.  It is not used anymore */
>  info->ram->skipped = 0;
> -info->ram->normal = ram_counters.normal;
> -info->ram->normal_bytes = ram_counters.normal * page_size;
> +info->ram->normal = qatomic_read(_counters.normal);
> +info->ram->normal_bytes = info->ram->normal * page_size;
>  info->ram->mbps = s->mbps;
>  info->ram->dirty_sync_count = ram_counters.dirty_sync_count;
>  info->ram->dirty_sync_missed_zero_copy =
> @@ -1065,7 +1065,7 @@ static void populate_ram_info(MigrationInfo *info, 
> MigrationState *s)
>  info->ram->pages_per_second = s->pages_per_second;
>  info->ram->precopy_bytes = ram_counters.precopy_bytes;
>  info->ram->downtime_bytes = ram_counters.downtime_bytes;
> -info->ram->postcopy_bytes = ram_counters.postcopy_bytes;
> +info->ram->postcopy_bytes = qatomic_read(_counters.postcopy_bytes);
>  
>  if (migrate_use_xbzrle()) {
>  info->has_xbzrle_cache = true;
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 586ddc9d65..460326acd4 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -437,7 +437,7 @@ static int multifd_send_pages(QEMUFile *f)
>  + p->packet_len;
>  qemu_file_acct_rate_limit(f, transferred);
>  ram_counters.multifd_bytes += transferred;
> -ram_counters.transferred += transferred;
> +qatomic_add(_counters.transferred, transferred);
>  qemu_mutex_unlock(>mutex);
>  qemu_sem_post(>sem);
>  
> diff --git a/migration/ram.c b/migration/ram.c
> index 6e7de6087a..5bd3d76bf0 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -432,11 +432,11 @@ static void ram_transferred_add(uint64_t bytes)
>  if (runstate_is_running()) {
>  ram_counters.precopy_bytes += bytes;
>  } else if (migration_in_postcopy()) {
> -ram_counters.postcopy_bytes += bytes;
> +qatomic_add(_counters.postcopy_bytes, bytes);
>  } else {
>  ram_counters.downtime_bytes += bytes;
>  }
> -ram_counters.transferred += bytes;
> +qatomic_add(_counters.transferred, bytes);
>  }
>  
>  void dirty_sync_missed_zero_copy(void)
> @@ -725,7 +725,7 @@ void mig_throttle_counter_reset(void)
>  
>  rs->time_last_bitmap_sync = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>  rs->num_dirty_pages_period = 0;
> -rs->bytes_xfer_prev = ram_counters.transferred;
> +rs->bytes_xfer_prev = qatomic_read(_counters.transferred);
>  }
>  
>  /**
> @@ -1085,8 +1085,9 @@ uint64_t ram_pagesize_summary(void)
>  
>  uint64_t ram_get_total_transferred_pages(void)
>  {
> -return  ram_counters.normal + ram_counters.duplicate +
> -compression_counters.pages + xbzrle_counters.pages;
> +return  qatomic_read(_counters.normal) +
> +qatomic_read(_counters.duplicate) +
> +compression_counters.pages + xbzrle_counters.pages;
>  }
>  
>  static void migration_update_rates(RAMState *rs, int64_t end_time)
> @@ -1145,8 +1146,8 @@ static void migration_trigger_throttle(RAMState *rs)
>  {
>  MigrationState *s = migrate_get_current();
>  uint64_t threshold = s->parameters.throttle_trigger_threshold;
> -
> -uint64_t bytes_xfer_period = ram_counters.transferred - 
> rs->bytes_xfer_prev;
> +uint64_t bytes_xfer_period =
> +

Re: [PATCH v5 7/9] target/arm: Introduce gen_pc_plus_diff for aarch64

2022-10-04 Thread Peter Maydell

On Fri, 30 Sept 2022 at 23:15, Richard Henderson
 wrote:
>
> In preparation for TARGET_TB_PCREL, reduce reliance on absolute values.
>
> Signed-off-by: Richard Henderson 
> ---
>  target/arm/translate-a64.c | 41 +++---
>  1 file changed, 29 insertions(+), 12 deletions(-)

Reviewed-by: Peter Maydell 

thanks
-- PMM

Re: [PATCH v5 9/9] target/arm: Enable TARGET_TB_PCREL

2022-10-04 Thread Peter Maydell

On Fri, 30 Sept 2022 at 23:10, Richard Henderson
 wrote:
>
> Signed-off-by: Richard Henderson 
> ---
>  target/arm/cpu-param.h |  1 +
>  target/arm/translate.h | 19 
>  target/arm/cpu.c   | 23 +++---
>  target/arm/translate-a64.c | 37 ++-
>  target/arm/translate.c | 62 ++
>  5 files changed, 112 insertions(+), 30 deletions(-)
>
> diff --git a/target/arm/cpu-param.h b/target/arm/cpu-param.h
> index 68ffb12427..29c5fc4241 100644
> --- a/target/arm/cpu-param.h
> +++ b/target/arm/cpu-param.h
> @@ -30,6 +30,7 @@
>   */
>  # define TARGET_PAGE_BITS_VARY
>  # define TARGET_PAGE_BITS_MIN  10
> +# define TARGET_TB_PCREL 1
>  #endif
>
>  #define NB_MMU_MODES 15
> diff --git a/target/arm/translate.h b/target/arm/translate.h
> index 4aa239e23c..41d14cc067 100644
> --- a/target/arm/translate.h
> +++ b/target/arm/translate.h
> @@ -12,6 +12,25 @@ typedef struct DisasContext {
>
>  /* The address of the current instruction being translated. */
>  target_ulong pc_curr;
> +/*
> + * For TARGET_TB_PCREL, the full value of cpu_pc is not known
> + * (although the page offset is known).  For convenience, the
> + * translation loop uses the full virtual address that triggered
> + * the translation is used, from base.pc_start through pc_curr.

s/ is used//

> + * For efficiency, we do not update cpu_pc for every instruction.
> + * Instead, pc_save has the value of pc_curr at the time of the
> + * last update to cpu_pc, which allows us to compute the addend
> + * needed to bring cpu_pc current: pc_curr - pc_save.
> + * If cpu_pc now contains the destiation of an indirect branch,

"destination"

> + * pc_save contains -1 to indicate that relative updates are no
> + * longer possible.
> + */
> +target_ulong pc_save;
> +/*
> + * Similarly, pc_cond_save contains the value of pc_save at the
> + * beginning of an AArch32 conditional instruction.
> + */
> +target_ulong pc_cond_save;
>  target_ulong page_start;
>  uint32_t insn;
>  /* Nonzero if this instruction has been conditionally skipped.  */
> diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> index 94ca6f163f..0bc5e9b125 100644
> --- a/target/arm/cpu.c
> +++ b/target/arm/cpu.c
> @@ -76,17 +76,18 @@ static vaddr arm_cpu_get_pc(CPUState *cs)
>  void arm_cpu_synchronize_from_tb(CPUState *cs,
>   const TranslationBlock *tb)
>  {
> -ARMCPU *cpu = ARM_CPU(cs);
> -CPUARMState *env = >env;
> -
> -/*
> - * It's OK to look at env for the current mode here, because it's
> - * never possible for an AArch64 TB to chain to an AArch32 TB.
> - */
> -if (is_a64(env)) {
> -env->pc = tb_pc(tb);
> -} else {
> -env->regs[15] = tb_pc(tb);
> +/* The program counter is always up to date with TARGET_TB_PCREL. */

I was confused for a bit about this, but it works because
although the synchronize_from_tb hook has a name that implies
it's comparatively general purpose, in fact we use it only
in the special case of "we abandoned execution at the start of
this TB without executing any of it".

> +if (!TARGET_TB_PCREL) {
> +CPUARMState *env = cs->env_ptr;
> +/*
> + * It's OK to look at env for the current mode here, because it's
> + * never possible for an AArch64 TB to chain to an AArch32 TB.
> + */
> +if (is_a64(env)) {
> +env->pc = tb_pc(tb);
> +} else {
> +env->regs[15] = tb_pc(tb);
> +}
>  }
>  }
>  #endif /* CONFIG_TCG */

> @@ -347,16 +354,22 @@ static void gen_exception_internal(int excp)
>
>  static void gen_exception_internal_insn(DisasContext *s, int excp)
>  {
> +target_ulong pc_save = s->pc_save;
> +
>  gen_a64_update_pc(s, 0);
>  gen_exception_internal(excp);
>  s->base.is_jmp = DISAS_NORETURN;
> +s->pc_save = pc_save;

What is trashing s->pc_save that we have to work around like this,
here and in the other similar changes ?

thanks
-- PMM

Re: [PATCH v7 18/18] accel/tcg: Introduce TARGET_TB_PCREL

2022-10-04 Thread Alex Bennée



Richard Henderson  writes:

> Prepare for targets to be able to produce TBs that can
> run in more than one virtual context.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v5 6/9] target/arm: Change gen_jmp* to work on displacements

2022-10-04 Thread Peter Maydell

On Fri, 30 Sept 2022 at 23:10, Richard Henderson
 wrote:
>
> In preparation for TARGET_TB_PCREL, reduce reliance on absolute values.
>
> Signed-off-by: Richard Henderson 
> ---
>  target/arm/translate.c | 37 +
>  1 file changed, 21 insertions(+), 16 deletions(-)

> @@ -8368,7 +8372,8 @@ static bool trans_BLX_i(DisasContext *s, arg_BLX_i *a)
>  }
>  tcg_gen_movi_i32(cpu_R[14], s->base.pc_next | s->thumb);
>  store_cpu_field_constant(!s->thumb, thumb);
> -gen_jmp(s, (read_pc(s) & ~3) + a->imm);
> +/* This difference computes a page offset so ok for TARGET_TB_PCREL. */
> +gen_jmp(s, (read_pc(s) & ~3) - s->pc_curr + a->imm);

Could we just calculate the offset of the jump target instead?
read_pc() returns s->pc_curr + a constant, so the s->pc_curr cancels
out anyway:

  (read_pc(s) & ~3) - s->pc_curr + a->imm
==
(pc_curr + (s->thumb ? 4 : 8) & ~3) - pc_curr + imm
==  pc_curr - pc_curr_low_bits - pc_curr + 4-or-8 + imm
==  imm + 4-or-8 - low_bits_of_pc

That's then more obviously not dependent on the absolute value
of the PC.

-- PMM

Re: [PATCH v7 13/18] accel/tcg: Do not align tb->page_addr[0]

2022-10-04 Thread Alex Bennée



Richard Henderson  writes:

> Let tb->page_addr[0] contain the address of the first byte of the
> translated block, rather than the address of the page containing the
> start of the translated block.  We need to recover this value anyway
> at various points, and it is easier to discard a page offset when it
> is not needed, which happens naturally via the existing find_page shift.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH] docs/nuvoton: Update URL for images

2022-10-04 Thread Hao Wu

On Mon, Oct 3, 2022 at 10:01 PM Joel Stanley  wrote:

> openpower.xyz was retired some time ago. The OpenBMC Jenkins is where
> images can be found these days.
>
> Signed-off-by: Joel Stanley 
>
Reviewed-by: Hao Wu 

> ---
>  docs/system/arm/nuvoton.rst | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/docs/system/arm/nuvoton.rst b/docs/system/arm/nuvoton.rst
> index ef2792076aa8..c38df32bde07 100644
> --- a/docs/system/arm/nuvoton.rst
> +++ b/docs/system/arm/nuvoton.rst
> @@ -82,9 +82,9 @@ Boot options
>
>  The Nuvoton machines can boot from an OpenBMC firmware image, or directly
> into
>  a kernel using the ``-kernel`` option. OpenBMC images for ``quanta-gsj``
> and
> -possibly others can be downloaded from the OpenPOWER jenkins :
> +possibly others can be downloaded from the OpenBMC jenkins :
>
> -   https://openpower.xyz/
> +   https://jenkins.openbmc.org/
>
>  The firmware image should be attached as an MTD drive. Example :
>
> --
> 2.35.1
>
>
>

Re: [PULL 52/54] contrib/gitdm: add Simon to individual contributors

2022-10-04 Thread Simon Safar

On Tue, Oct 4, 2022, at 6:01 AM, Alex Bennée wrote:
> Please confirm this is the correct mapping for you.

Looks good to me, thank you!!!

> Signed-off-by: Alex Bennée 
> Reviewed-by: Simon Safar 
> Message-Id: <20220926134609.3301945-2-alex.ben...@linaro.org>
> 
> diff --git a/contrib/gitdm/group-map-individuals 
> b/contrib/gitdm/group-map-individuals
> index e19d79626c..53883cc526 100644
> --- a/contrib/gitdm/group-map-individuals
> +++ b/contrib/gitdm/group-map-individuals
> @@ -36,3 +36,4 @@ chetan4wind...@gmail.com
>  akihiko.od...@gmail.com
>  p...@nowt.org
>  g...@xen0n.name
> +si...@simonsafar.com
> -- 
> 2.34.1
> 
>

Re: [PATCH] ui/cocoa: Support hardware cursor interface

2022-10-04 Thread Peter Maydell

Ccing Akihiko to see if he wants to review this cocoa ui frontend
patch.

also available at:
https://lore.kernel.org/qemu-devel/54930451-d85f-4ce0-9a45-b3478c5a6...@www.fastmail.com/

I can confirm that the patch does build, but I don't have any
interesting graphics-using test images to hand to test with.

thanks
-- PMM

On Thu, 4 Aug 2022 at 07:28, Elliot Nunn  wrote:
>
> Implement dpy_cursor_define() and dpy_mouse_set() on macOS.
>
> The main benefit is from dpy_cursor_define: in absolute pointing mode, the
> host can redraw the cursor on the guest's behalf much faster than the guest
> can itself.
>
> To provide the programmatic movement expected from a hardware cursor,
> dpy_mouse_set is also implemented.
>
> Tricky cases are handled:
> - dpy_mouse_set() avoids rounded window corners.
> - The sometimes-delay between warping the cursor and an affected mouse-move
>   event is accounted for.
> - Cursor bitmaps are nearest-neighbor scaled to Retina size.
>
> Signed-off-by: Elliot Nunn 
> ---
>  ui/cocoa.m | 263 -
>  1 file changed, 240 insertions(+), 23 deletions(-)
>
> diff --git a/ui/cocoa.m b/ui/cocoa.m
> index 5a8bd5dd84..f9d54448e4 100644
> --- a/ui/cocoa.m
> +++ b/ui/cocoa.m
> @@ -85,12 +85,20 @@ static void cocoa_switch(DisplayChangeListener *dcl,
>
>  static void cocoa_refresh(DisplayChangeListener *dcl);
>
> +static void cocoa_mouse_set(DisplayChangeListener *dcl,
> +int x, int y, int on);
> +
> +static void cocoa_cursor_define(DisplayChangeListener *dcl,
> +QEMUCursor *c);
> +
>  static NSWindow *normalWindow;
>  static const DisplayChangeListenerOps dcl_ops = {
>  .dpy_name  = "cocoa",
>  .dpy_gfx_update = cocoa_update,
>  .dpy_gfx_switch = cocoa_switch,
>  .dpy_refresh = cocoa_refresh,
> +.dpy_mouse_set = cocoa_mouse_set,
> +.dpy_cursor_define = cocoa_cursor_define,
>  };
>  static DisplayChangeListener dcl = {
>  .ops = _ops,
> @@ -313,6 +321,13 @@ @interface QemuCocoaView : NSView
>  BOOL isFullscreen;
>  BOOL isAbsoluteEnabled;
>  CFMachPortRef eventsTap;
> +NSCursor *guestCursor;
> +BOOL cursorHiddenByMe;
> +BOOL guestCursorVis;
> +int guestCursorX, guestCursorY;
> +int lastWarpX, lastWarpY;
> +int warpDeltaX, warpDeltaY;
> +BOOL ignoreNextMouseMove;
>  }
>  - (void) switchSurface:(pixman_image_t *)image;
>  - (void) grabMouse;
> @@ -323,6 +338,10 @@ - (void) handleMonitorInput:(NSEvent *)event;
>  - (bool) handleEvent:(NSEvent *)event;
>  - (bool) handleEventLocked:(NSEvent *)event;
>  - (void) setAbsoluteEnabled:(BOOL)tIsAbsoluteEnabled;
> +- (void) cursorDefine:(NSCursor *)cursor;
> +- (void) mouseSetX:(int)x Y:(int)y on:(int)on;
> +- (void) setCursorAppearance;
> +- (void) setCursorPosition;
>  /* The state surrounding mouse grabbing is potentially confusing.
>   * isAbsoluteEnabled tracks qemu_input_is_absolute() [ie "is the emulated
>   *   pointing device an absolute-position one?"], but is only updated on
> @@ -432,22 +451,6 @@ - (CGPoint) screenLocationOfEvent:(NSEvent *)ev
>  }
>  }
>
> -- (void) hideCursor
> -{
> -if (!cursor_hide) {
> -return;
> -}
> -[NSCursor hide];
> -}
> -
> -- (void) unhideCursor
> -{
> -if (!cursor_hide) {
> -return;
> -}
> -[NSCursor unhide];
> -}
> -
>  - (void) drawRect:(NSRect) rect
>  {
>  COCOA_DEBUG("QemuCocoaView: drawRect\n");
> @@ -635,6 +638,8 @@ - (void) switchSurface:(pixman_image_t *)image
>  screen.height = h;
>  [self setContentDimensions];
>  [self setFrame:NSMakeRect(cx, cy, cw, ch)];
> +[self setCursorAppearance];
> +[self setCursorPosition];
>  }
>
>  // update screenBuffer
> @@ -681,6 +686,7 @@ - (void) toggleFullScreen:(id)sender
>  styleMask:NSWindowStyleMaskBorderless
>  backing:NSBackingStoreBuffered
>  defer:NO];
> +[fullScreenWindow disableCursorRects];
>  [fullScreenWindow setAcceptsMouseMovedEvents: YES];
>  [fullScreenWindow setHasShadow:NO];
>  [fullScreenWindow setBackgroundColor: [NSColor blackColor]];
> @@ -812,6 +818,7 @@ - (bool) handleEventLocked:(NSEvent *)event
>  int buttons = 0;
>  int keycode = 0;
>  bool mouse_event = false;
> +bool mousemoved_event = false;
>  // Location of event in virtual screen coordinates
>  NSPoint p = [self screenLocationOfEvent:event];
>  NSUInteger modifiers = [event modifierFlags];
> @@ -1023,6 +1030,7 @@ - (bool) handleEventLocked:(NSEvent *)event
>  }
>  }
>  mouse_event = true;
> +mousemoved_event = true;
>  break;
>  case NSEventTypeLeftMouseDown:
>  buttons |= MOUSE_EVENT_LBUTTON;
> @@ -1039,14 +1047,17 @@ - (bool) handleEventLocked:(NSEvent *)event
>  case NSEventTypeLeftMouseDragged:
>  buttons |=

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-04 Thread Fuad Tabba

Hi,

On Mon, Oct 3, 2022 at 12:01 PM Kirill A. Shutemov  wrote:
>
> On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > > I think it is "don't do that" category. inaccessible_register_notifier()
> > > caller has to know what file it operates on, no?
> >
> > The thing is, you could oops the kernel from userspace. For that, all
> > you have to do is a memfd_create without the MFD_INACCESSIBLE,
> > followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> > I ran into this using my port of this patch series to arm64.
>
> My point is that it has to be handled on a different level. KVM has to
> reject private_fd if it is now inaccessible. It should be trivial by
> checking file->f_inode->i_sb->s_magic.

Yes, that makes sense.

Thanks,
/fuad

> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v7 17/18] accel/tcg: Introduce tb_pc and log_pc

2022-10-04 Thread Alex Bennée



Richard Henderson  writes:

> The availability of tb->pc will shortly be conditional.
> Introduce accessor functions to minimize ifdefs.
>
> Pass around a known pc to places like tcg_gen_code,
> where the caller must already have the value.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH 1/5] migration: Fix possible deadloop of ram save process

2022-10-04 Thread Dr. David Alan Gilbert

* Peter Xu (pet...@redhat.com) wrote:
> On Thu, Sep 22, 2022 at 05:41:30PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (pet...@redhat.com) wrote:
> > > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Peter Xu (pet...@redhat.com) wrote:
> > > > > When starting ram saving procedure (especially at the completion 
> > > > > phase),
> > > > > always set last_seen_block to non-NULL to make sure we can always 
> > > > > correctly
> > > > > detect the case where "we've migrated all the dirty pages".
> > > > > 
> > > > > Then we'll guarantee both last_seen_block and pss.block will be valid
> > > > > always before the loop starts.
> > > > > 
> > > > > See the comment in the code for some details.
> > > > > 
> > > > > Signed-off-by: Peter Xu 
> > > > 
> > > > Yeh I guess it can currently only happen during restart?
> > > 
> > > There're only two places to clear last_seen_block:
> > > 
> > > ram_state_reset[2683]  rs->last_seen_block = NULL;
> > > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL;
> > > 
> > > Where for the reset case:
> > > 
> > > ram_state_init[2994]   ram_state_reset(*rsp);
> > > ram_state_resume_prepare[3110] ram_state_reset(rs);
> > > ram_save_iterate[3271] ram_state_reset(rs);
> > > 
> > > So I think it can at least happen in two places, either (1) postcopy just
> > > started (assume when postcopy starts accidentally when all dirty pages 
> > > were
> > > migrated?), or (2) postcopy recover from failure.
> > 
> > Oh, (1) is a more general problem then; yeh.
> > 
> > > In my case I triggered this deadloop when I was debugging the other bug
> > > fixed by the next patch where it was postcopy recovery (on tls), but only
> > > once..  So currently I'm still not 100% sure whether this is the same
> > > problem, but logically it could trigger.
> > > 
> > > I also remember I used to hit very rare deadloops before too, maybe 
> > > they're
> > > the same thing because I did test recovery a lot.
> > 
> > Note; 'deadlock' not 'deadloop'.
> 
> (Oops I somehow forgot there's still this series pending..)
> 
> Here it's not about a lock, or maybe I should add a space ("dead loop")?

So the normal phrases I'm used to are:
  'deadlock' - two threads waiting for each other
  'livelock' - two threads spinning for each other

Dave

> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH] target/arm: allow setting SCR_EL3.EnTP2 when FEAT_SME is implemented

2022-10-04 Thread Peter Maydell

On Tue, 4 Oct 2022 at 08:24, Jerome Forissier
 wrote:
>
> Updates write_scr() to allow setting SCR_EL3.EnTP2 when FEAT_SME is
> implemented. SCR_EL3 being a 64-bit register, valid_mask is changed
> to uint64_t and the SCR_* constants in target/arm/cpu.h are extended
> to 64-bit so that masking and bitwise not (~) behave as expected.
>
> This enables booting Linux with Trusted Firmware-A at EL3 with
> "-M virt,secure=on -cpu max".
>
> Cc: qemu-sta...@nongnu.org
> Fixes: 78cb9776662a ("target/arm: Enable SME for -cpu max")
> Signed-off-by: Jerome Forissier 



Applied to target-arm.next, thanks.

-- PMM

Re: [PATCH] linux-user,bsd-user: re-exec with G_SLICE=always-malloc

2022-10-04 Thread Richard Henderson


On 10/4/22 05:00, Daniel P. Berrangé wrote:

g_slice uses a one-time initializer to check the G_SLICE env variable
making it hard for QEMU to set the env before any GLib API call has
triggered the initializer. Even attribute((constructor)) is not
sufficient as QEMU has many constructors and there is no ordering
guarantee between them.


There are orderings for constructors, see 
__attribute__((constructor(priority))).


r~

Re: [PATCH] target/arm: allow setting SCR_EL3.EnTP2 when FEAT_SME is implemented

2022-10-04 Thread Richard Henderson


On 10/4/22 00:23, Jerome Forissier wrote:

Updates write_scr() to allow setting SCR_EL3.EnTP2 when FEAT_SME is
implemented. SCR_EL3 being a 64-bit register, valid_mask is changed
to uint64_t and the SCR_* constants in target/arm/cpu.h are extended
to 64-bit so that masking and bitwise not (~) behave as expected.

This enables booting Linux with Trusted Firmware-A at EL3 with
"-M virt,secure=on -cpu max".

Cc:qemu-sta...@nongnu.org
Fixes: 78cb9776662a ("target/arm: Enable SME for -cpu max")
Signed-off-by: Jerome Forissier
---
  target/arm/cpu.h| 54 ++---
  target/arm/helper.c |  5 -
  2 files changed, 31 insertions(+), 28 deletions(-)


Whoops, sorry about that.

Reviewed-by: Richard Henderson 


r~

Re: [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE

2022-10-04 Thread Jarkko Sakkinen

On Thu, Sep 15, 2022 at 10:29:13PM +0800, Chao Peng wrote:
> Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
> userspace. KVM will register/unregister private memslot to fd-based
> memory backing store and response to invalidation event from
> inaccessible_notifier to zap the existing memory mappings in the
> secondary page table.
> 
> Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> by architecture code which can turn on it by overriding the default
> kvm_arch_has_private_mem().
> 
> A 'kvm' reference is added in memslot structure since in
> inaccessible_notifier callback we can only obtain a memslot reference
> but 'kvm' is needed to do the zapping.
> 
> Co-developed-by: Yu Zhang 
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 

ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_free_memslot':
kvm_main.c:(.text+0x1385): undefined reference to 
`inaccessible_unregister_notifier'
ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_set_memslot':
kvm_main.c:(.text+0x1b86): undefined reference to 
`inaccessible_register_notifier'
ld: kvm_main.c:(.text+0x1c85): undefined reference to 
`inaccessible_unregister_notifier'
ld: arch/x86/kvm/mmu/mmu.o: in function `kvm_faultin_pfn':
mmu.c:(.text+0x1e38): undefined reference to `inaccessible_get_pfn'
ld: arch/x86/kvm/mmu/mmu.o: in function `direct_page_fault':
mmu.c:(.text+0x67ca): undefined reference to `inaccessible_put_pfn'
make: *** [Makefile:1169: vmlinux] Error 1

I attached kernel config for reproduction.

The problem is that CONFIG_MEMFD_CREATE does not get enabled:

mm/Makefile:obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o

BR, Jarkko
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 6.0.0 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (GCC) 12.2.1 20220819 (Red Hat 12.2.1-2)"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=120201
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=23700
CONFIG_LD_IS_BFD=y
CONFIG_LD_VERSION=23700
CONFIG_LLD_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_PAHOLE_VERSION=123
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
# CONFIG_WERROR is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_KERNEL_XZ=y
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SYSVIPC is not set
# CONFIG_WATCH_QUEUE is not set
# CONFIG_CROSS_MEMORY_ATTACH is not set
# CONFIG_USELIB is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_HZ_PERIODIC=y
# CONFIG_NO_HZ_IDLE is not set
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US=100
# end of Timers subsystem

CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y

#
# BPF subsystem
#
# CONFIG_BPF_SYSCALL is not set
# end of BPF subsystem

CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_DYNAMIC is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

#
# RCU Subsystem
#
CONFIG_TINY_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TINY_SRCU=y
# end of RCU Subsystem

# CONFIG_IKCONFIG is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough=5"
CONFIG_GCC12_NO_ARRAY_BOUNDS=y
CONFIG_CC_NO_ARRAY_BOUNDS=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_CGROUPS is not set
#

Re: [PATCH v3 01/15] hw: encode accessing CPU index in MemTxAttrs

2022-10-04 Thread Peter Maydell

On Tue, 4 Oct 2022 at 14:33, Alex Bennée  wrote:
>
>
> Peter Maydell  writes:
> > The MSC is in the address map like most other stuff, and thus there is
> > no restriction on whether it can be accessed by other things than CPUs
> > (DMAing to it would be silly but is perfectly possible).
> >
> > The intent of the code is "pass this transaction through, but force
> > it to be Secure/NonSecure regardless of what it was before". That
> > should not involve a change of the requester type.
>
> Should we assert (or warn) when the requester_type is unspecified?

Not in the design of MemTxAttrs that's currently in git, no:
in that design it's perfectly fine for something generating
memory transactions to use MEMTXATTRS_UNSPECIFIED (which defaults
to meaning a bunch of things including "not secure").

thanks
-- PMM

Re: [PATCH 1/5] migration: Fix possible deadloop of ram save process

2022-10-04 Thread Peter Xu

On Thu, Sep 22, 2022 at 05:41:30PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (pet...@redhat.com) wrote:
> > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (pet...@redhat.com) wrote:
> > > > When starting ram saving procedure (especially at the completion phase),
> > > > always set last_seen_block to non-NULL to make sure we can always 
> > > > correctly
> > > > detect the case where "we've migrated all the dirty pages".
> > > > 
> > > > Then we'll guarantee both last_seen_block and pss.block will be valid
> > > > always before the loop starts.
> > > > 
> > > > See the comment in the code for some details.
> > > > 
> > > > Signed-off-by: Peter Xu 
> > > 
> > > Yeh I guess it can currently only happen during restart?
> > 
> > There're only two places to clear last_seen_block:
> > 
> > ram_state_reset[2683]  rs->last_seen_block = NULL;
> > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL;
> > 
> > Where for the reset case:
> > 
> > ram_state_init[2994]   ram_state_reset(*rsp);
> > ram_state_resume_prepare[3110] ram_state_reset(rs);
> > ram_save_iterate[3271] ram_state_reset(rs);
> > 
> > So I think it can at least happen in two places, either (1) postcopy just
> > started (assume when postcopy starts accidentally when all dirty pages were
> > migrated?), or (2) postcopy recover from failure.
> 
> Oh, (1) is a more general problem then; yeh.
> 
> > In my case I triggered this deadloop when I was debugging the other bug
> > fixed by the next patch where it was postcopy recovery (on tls), but only
> > once..  So currently I'm still not 100% sure whether this is the same
> > problem, but logically it could trigger.
> > 
> > I also remember I used to hit very rare deadloops before too, maybe they're
> > the same thing because I did test recovery a lot.
> 
> Note; 'deadlock' not 'deadloop'.

(Oops I somehow forgot there's still this series pending..)

Here it's not about a lock, or maybe I should add a space ("dead loop")?

-- 
Peter Xu

Re: [PATCH 04/14] migration: Remove RAMState.f references in compression code

2022-10-04 Thread Peter Xu

On Tue, Oct 04, 2022 at 11:54:02AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (pet...@redhat.com) wrote:
> > Removing referencing to RAMState.f in compress_page_with_multi_thread() and
> > flush_compressed_data().
> > 
> > Compression code by default isn't compatible with having >1 channels (or it
> > won't currently know which channel to flush the compressed data), so to
> > make it simple we always flush on the default to_dst_file port until
> > someone wants to add >1 ports support, as rs->f right now can really
> > change (after postcopy preempt is introduced).
> > 
> > There should be no functional change at all after patch applied, since as
> > long as rs->f referenced in compression code, it must be to_dst_file.
> > 
> > Signed-off-by: Peter Xu 
> 
> Yes, I think that's true - although I think we need to add checks to
> stop someone trying to enable compression+multifd?

We could.  I think they can be enabled and migration will just ignore the
multifd feature as ram_save_target_page() handles compress earlier.  Even
if save_compress_page() fails with -EBUSY we'll still skip multifd:

if (!save_page_use_compression(rs) && migrate_use_multifd()
&& !migration_in_postcopy()) {
return ram_save_multifd_page(rs, block, offset);
}

Explicitly disable compression would be still cleaner, then we can even
drop above check on save_page_use_compression().  Slight risk of breaking
someone's config, but I don't think it should be majority.

If that looks good, I can add one patch for it (probably in the other
patchset, though, which I'll repost today).

> 
> Reviewed-by: Dr. David Alan Gilbert 

Thanks!

-- 
Peter Xu

Re: [PATCH] target/arm: Implement FEAT_E0PD

2022-10-04 Thread Richard Henderson


On 10/4/22 04:05, Peter Maydell wrote:

FEAT_E0PD adds new bits E0PD0 and E0PD1 to TCR_EL1, which allow the
OS to forbid EL0 access to half of the address space.  Since this is
an EL0-specific variation on the existing TCR_ELx.{EPD0,EPD1}, we can
implement it entirely in aa64_va_parameters().

This requires moving the existing regime_is_user() to internals.h
so that the code in helper.c can get at it.

Signed-off-by: Peter Maydell
---
Based-on:20221003162315.2833797-1-peter.mayd...@linaro.org
("[PATCH v2 0/3] target/arm: Enforce implemented granule size limits")
but only to avoid textual conflicts.


Reviewed-by: Richard Henderson 


r~

Re: [PATCH v3] tcg/ppc: Optimize 26-bit jumps

2022-10-04 Thread Richard Henderson


On 10/3/22 11:32, Leandro Lupori wrote:

On 9/19/22 14:56, Leandro Lupori wrote:

PowerPC64 processors handle direct branches better than indirect
ones, resulting in less stalled cycles and branch misses.

However, PPC's tb_target_set_jmp_target() was only using direct
branches for 16-bit jumps, while PowerPC64's unconditional branch
instructions are able to handle displacements of up to 26 bits.
To take advantage of this, now jumps whose displacements fit in
between 17 and 26 bits are also converted to direct branches.

Signed-off-by: Leandro Lupori 
---
v3:
   - make goto tb code 16-byte aligned
   - code cleanup



Ping?

Does v3 look good now?


It does, thanks.  I will queue it for tcg-next.


r~

1 2 3 >

1 - 100 of 262 matches

Mail list logo