Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-28 Thread Sam Li
Damien Le Moal  于2022年6月29日周三 10:32写道:
>
> On 6/29/22 10:50, Sam Li wrote:
> >>> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> >>> blk_zone);
> >>> +g_autofree struct blk_zone_report *rep = g_new(struct 
> >>> blk_zone_report, nrz);
> >>
> >> g_new() looks incorrect. There should be 1 struct blk_zone_report
> >> followed by nrz struct blk_zone structs. Please use g_malloc(rep_size)
> >> instead.
> >
> > Yes! However, it still has a memory leak error when using g_autofree
> > && g_malloc.
> 
>  That may be because you are changing the value of the rep pointer while
>  parsing the report ?
> >>>
> >>> I am not sure it is the case. Can you show me some way to find the 
> >>> problem?
> >>
> >> Not sure. I never used this g_malloc()/g_autofree() before so not sure how
> >> it works. It may be that g_autofree() work only with g_new() ?
> >> Could you try separating the declaration and allocation ? e.g.
> >>
> >> Declare at the beginning of the function:
> >> g_autofree struct blk_zone_report *rep = NULL;
> >>
> >> And then when needed do:
> >>
> >> rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
> >> rep = g_malloc(rep_size);
> >
> > Actually, the memory leak occurs in that way. When using zone_mgmt,
> > memory leak still occurs. Asan gives the error information not much so
> > I haven't tracked down the problem yet.
>
> See this:
>
> https://blog.fishsoup.net/2015/11/05/attributecleanup-mixed-declarations-and-code-and-goto/
>
> Maybe you can find some hints.

Thanks!

>
> --
> Damien Le Moal
> Western Digital Research



Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-28 Thread Damien Le Moal
On 6/29/22 10:50, Sam Li wrote:
>>> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
>>> blk_zone);
>>> +g_autofree struct blk_zone_report *rep = g_new(struct 
>>> blk_zone_report, nrz);
>>
>> g_new() looks incorrect. There should be 1 struct blk_zone_report
>> followed by nrz struct blk_zone structs. Please use g_malloc(rep_size)
>> instead.
>
> Yes! However, it still has a memory leak error when using g_autofree
> && g_malloc.

 That may be because you are changing the value of the rep pointer while
 parsing the report ?
>>>
>>> I am not sure it is the case. Can you show me some way to find the problem?
>>
>> Not sure. I never used this g_malloc()/g_autofree() before so not sure how
>> it works. It may be that g_autofree() work only with g_new() ?
>> Could you try separating the declaration and allocation ? e.g.
>>
>> Declare at the beginning of the function:
>> g_autofree struct blk_zone_report *rep = NULL;
>>
>> And then when needed do:
>>
>> rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
>> rep = g_malloc(rep_size);
> 
> Actually, the memory leak occurs in that way. When using zone_mgmt,
> memory leak still occurs. Asan gives the error information not much so
> I haven't tracked down the problem yet.

See this:

https://blog.fishsoup.net/2015/11/05/attributecleanup-mixed-declarations-and-code-and-goto/

Maybe you can find some hints.

-- 
Damien Le Moal
Western Digital Research



Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-28 Thread Sam Li
Damien Le Moal  于2022年6月29日周三 09:43写道:
>
> On 6/28/22 19:23, Sam Li wrote:
> > Damien Le Moal  于2022年6月28日周二 17:05写道:
> >>
> >> On 6/28/22 17:05, Sam Li wrote:
> >>> Stefan Hajnoczi  于2022年6月28日周二 14:52写道:
> 
>  On Mon, Jun 27, 2022 at 08:19:13AM +0800, Sam Li wrote:
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index e0e1aff4b1..786f964d02 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> > @@ -1810,6 +1810,62 @@ int blk_flush(BlockBackend *blk)
> >  return ret;
> >  }
> >
> > +/*
> > + * Return zone_report from BlockDriver. Offset can be any number within
> > + * the zone size. No alignment for offset and len.
> 
>  What is the purpose of len? Is it the maximum number of zones to return
>  in nr_zones[]?
> >>>
> >>> len is actually not used in bdrv_co_zone_report. It is needed by
> >>> blk_check_byte_request.
> >>
> >> There is no IO buffer being passed. Only an array of zone descriptors and
> >> an in-out number of zones. len is definitely not needed. Not sure what
> >> blk_check_byte_request() is intended to check though.
> >
> > Can I just remove len argument and blk_check_byte_request() if it's not 
> > used?
>
> If it is unused, yes, you must remove it.
>
> >
> >>
> >>>
>  How does the caller know how many zones were returned?
> >>>
> >>> nr_zones represents IN maximum and OUT actual. The caller will know by
> >>> nr_zones which is changed in bdrv_co_zone_report. I'll add it in the
> >>> comments.
> >>>
> 
> > + */
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +   int64_t len, int64_t *nr_zones,
> > +   BlockZoneDescriptor *zones)
> > +{
> > +int ret;
> > +BlockDriverState *bs;
> > +IO_CODE();
> > +
> > +blk_inc_in_flight(blk); /* increase before waiting */
> > +blk_wait_while_drained(blk);
> > +bs = blk_bs(blk);
> > +
> > +ret = blk_check_byte_request(blk, offset, len);
> > +if (ret < 0) {
> > +return ret;
> > +}
> > +
> > +bdrv_inc_in_flight(bs);
> 
>  The bdrv_inc/dec_in_flight() call should be inside
>  bdrv_co_zone_report(). See bdrv_co_ioctl() for an example.
> 
> > +ret = bdrv_co_zone_report(blk->root->bs, offset, len,
> > +  nr_zones, zones);
> > +bdrv_dec_in_flight(bs);
> > +blk_dec_in_flight(blk);
> > +return ret;
> > +}
> > +
> > +/*
> > + * Return zone_mgmt from BlockDriver.
> 
>  Maybe this should be:
> 
>    Send a zone management command.
> >>>
> >>> Yes, it's more accurate.
> >>>
> 
> > @@ -216,6 +217,11 @@ typedef struct RawPosixAIOData {
> >  PreallocMode prealloc;
> >  Error **errp;
> >  } truncate;
> > +struct {
> > +int64_t *nr_zones;
> > +BlockZoneDescriptor *zones;
> > +} zone_report;
> > +zone_op op;
> 
>  It's cleaner to put op inside a struct zone_mgmt so its purpose is
>  self-explanatory:
> 
>    struct {
>    zone_op op;
>    } zone_mgmt;
> 
> > +static int handle_aiocb_zone_report(void *opaque) {
> > +RawPosixAIOData *aiocb = opaque;
> > +int fd = aiocb->aio_fildes;
> > +int64_t *nr_zones = aiocb->zone_report.nr_zones;
> > +BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> > +int64_t offset = aiocb->aio_offset;
> > +int64_t len = aiocb->aio_nbytes;
> >>
> >> Since you have the zone array and number of zones (size of that array) I
> >> really do not see why you need len.
> >>
> > +
> > +struct blk_zone *blkz;
> > +int64_t rep_size, nrz;
> > +int ret, n = 0, i = 0;
> > +
> > +nrz = *nr_zones;
> > +if (len == -1) {
> > +return -errno;
> 
>  Where is errno set? Should this be an errno constant instead like
>  -EINVAL?
> >>>
> >>> That's right! Noted.
> >>>
> 
> > +}
> > +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> > blk_zone);
> > +g_autofree struct blk_zone_report *rep = g_new(struct 
> > blk_zone_report, nrz);
> 
>  g_new() looks incorrect. There should be 1 struct blk_zone_report
>  followed by nrz struct blk_zone structs. Please use g_malloc(rep_size)
>  instead.
> >>>
> >>> Yes! However, it still has a memory leak error when using g_autofree
> >>> && g_malloc.
> >>
> >> That may be because you are changing the value of the rep pointer while
> >> parsing the report ?
> >
> > I am not sure it is the case. Can you show me some way to find the problem?
>
> Not sure. I never used this g_malloc()/g_autofree() before so not sure how
> it works. It may be that g_autofree() work only with g_new() ?
> 

Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-28 Thread Damien Le Moal
On 6/28/22 19:23, Sam Li wrote:
> Damien Le Moal  于2022年6月28日周二 17:05写道:
>>
>> On 6/28/22 17:05, Sam Li wrote:
>>> Stefan Hajnoczi  于2022年6月28日周二 14:52写道:

 On Mon, Jun 27, 2022 at 08:19:13AM +0800, Sam Li wrote:
> diff --git a/block/block-backend.c b/block/block-backend.c
> index e0e1aff4b1..786f964d02 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1810,6 +1810,62 @@ int blk_flush(BlockBackend *blk)
>  return ret;
>  }
>
> +/*
> + * Return zone_report from BlockDriver. Offset can be any number within
> + * the zone size. No alignment for offset and len.

 What is the purpose of len? Is it the maximum number of zones to return
 in nr_zones[]?
>>>
>>> len is actually not used in bdrv_co_zone_report. It is needed by
>>> blk_check_byte_request.
>>
>> There is no IO buffer being passed. Only an array of zone descriptors and
>> an in-out number of zones. len is definitely not needed. Not sure what
>> blk_check_byte_request() is intended to check though.
> 
> Can I just remove len argument and blk_check_byte_request() if it's not used?

If it is unused, yes, you must remove it.

> 
>>
>>>
 How does the caller know how many zones were returned?
>>>
>>> nr_zones represents IN maximum and OUT actual. The caller will know by
>>> nr_zones which is changed in bdrv_co_zone_report. I'll add it in the
>>> comments.
>>>

> + */
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +   int64_t len, int64_t *nr_zones,
> +   BlockZoneDescriptor *zones)
> +{
> +int ret;
> +BlockDriverState *bs;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk); /* increase before waiting */
> +blk_wait_while_drained(blk);
> +bs = blk_bs(blk);
> +
> +ret = blk_check_byte_request(blk, offset, len);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +bdrv_inc_in_flight(bs);

 The bdrv_inc/dec_in_flight() call should be inside
 bdrv_co_zone_report(). See bdrv_co_ioctl() for an example.

> +ret = bdrv_co_zone_report(blk->root->bs, offset, len,
> +  nr_zones, zones);
> +bdrv_dec_in_flight(bs);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
> +/*
> + * Return zone_mgmt from BlockDriver.

 Maybe this should be:

   Send a zone management command.
>>>
>>> Yes, it's more accurate.
>>>

> @@ -216,6 +217,11 @@ typedef struct RawPosixAIOData {
>  PreallocMode prealloc;
>  Error **errp;
>  } truncate;
> +struct {
> +int64_t *nr_zones;
> +BlockZoneDescriptor *zones;
> +} zone_report;
> +zone_op op;

 It's cleaner to put op inside a struct zone_mgmt so its purpose is
 self-explanatory:

   struct {
   zone_op op;
   } zone_mgmt;

> +static int handle_aiocb_zone_report(void *opaque) {
> +RawPosixAIOData *aiocb = opaque;
> +int fd = aiocb->aio_fildes;
> +int64_t *nr_zones = aiocb->zone_report.nr_zones;
> +BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> +int64_t offset = aiocb->aio_offset;
> +int64_t len = aiocb->aio_nbytes;
>>
>> Since you have the zone array and number of zones (size of that array) I
>> really do not see why you need len.
>>
> +
> +struct blk_zone *blkz;
> +int64_t rep_size, nrz;
> +int ret, n = 0, i = 0;
> +
> +nrz = *nr_zones;
> +if (len == -1) {
> +return -errno;

 Where is errno set? Should this be an errno constant instead like
 -EINVAL?
>>>
>>> That's right! Noted.
>>>

> +}
> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> blk_zone);
> +g_autofree struct blk_zone_report *rep = g_new(struct 
> blk_zone_report, nrz);

 g_new() looks incorrect. There should be 1 struct blk_zone_report
 followed by nrz struct blk_zone structs. Please use g_malloc(rep_size)
 instead.
>>>
>>> Yes! However, it still has a memory leak error when using g_autofree
>>> && g_malloc.
>>
>> That may be because you are changing the value of the rep pointer while
>> parsing the report ?
> 
> I am not sure it is the case. Can you show me some way to find the problem?

Not sure. I never used this g_malloc()/g_autofree() before so not sure how
it works. It may be that g_autofree() work only with g_new() ?
Could you try separating the declaration and allocation ? e.g.

Declare at the beginning of the function:
g_autofree struct blk_zone_report *rep = NULL;

And then when needed do:

rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
rep = g_malloc(rep_size);

> 
> Thanks for reviewing!
> 
> 
>>
>>>

> 

Re: [PATCH v7 10/18] jobs: rename static functions called with job_mutex held

2022-06-28 Thread Vladimir Sementsov-Ogievskiy

On 6/28/22 20:28, Emanuele Giuseppe Esposito wrote:



Am 28/06/2022 um 17:26 schrieb Vladimir Sementsov-Ogievskiy:

On 6/28/22 18:22, Vladimir Sementsov-Ogievskiy wrote:

On 6/28/22 16:04, Emanuele Giuseppe Esposito wrote:

Ok so far I did the following:

- duplicated each public function as static {function}_locked()

They shouldn't be duplicates: function without _locked suffix should
take the mutex.

By "duplicate" I mean same function name, with just _locked suffix.
Maybe a better definition?

Almost done preparing the patches!


Why not just add _locked version and rework the version without suffix
to call _locked under mutex one in one patch, to just keep it all
meaningful?



I mean, instead of:

patch 1: add a _locked() duplicate

   At this point we have a duplicated function that's just bad practice.

patch 2: remake version without prefix to call _locked() under mutex
  
   Now everything is correct. But we have to track the moment when

something strange becomes something correct.


do just

patch 1: rename function to _locked() and add a wrapper without suffix,
that calls _locked() under mutex




That's what I always intended to do. As I said, I just used the wrong word.



Ah, OK then, I misunderstood.


--
Best regards,
Vladimir



Re: [PATCH v7 10/18] jobs: rename static functions called with job_mutex held

2022-06-28 Thread Emanuele Giuseppe Esposito



Am 28/06/2022 um 17:26 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/28/22 18:22, Vladimir Sementsov-Ogievskiy wrote:
>> On 6/28/22 16:04, Emanuele Giuseppe Esposito wrote:
> Ok so far I did the following:
>
> - duplicated each public function as static {function}_locked()
 They shouldn't be duplicates: function without _locked suffix should
 take the mutex.
>>> By "duplicate" I mean same function name, with just _locked suffix.
>>> Maybe a better definition?
>>>
>>> Almost done preparing the patches!
>>
>> Why not just add _locked version and rework the version without suffix
>> to call _locked under mutex one in one patch, to just keep it all
>> meaningful?
>>
> 
> I mean, instead of:
> 
> patch 1: add a _locked() duplicate
> 
>   At this point we have a duplicated function that's just bad practice.
> 
> patch 2: remake version without prefix to call _locked() under mutex
>  
>   Now everything is correct. But we have to track the moment when
> something strange becomes something correct.
> 
> 
> do just
> 
> patch 1: rename function to _locked() and add a wrapper without suffix,
> that calls _locked() under mutex
> 
> 

That's what I always intended to do. As I said, I just used the wrong word.

Emanuele




Re: [PATCH v4] hw: m25p80: add WP# pin and SRWD bit for write protection

2022-06-28 Thread Francisco Iglesias
On [2022 Jun 28] Tue 17:52:50, Cédric Le Goater wrote:
> Alistair, Francisco,
> 
> On 6/22/22 11:45, Francisco Iglesias wrote:
> > On [2022 Jun 21] Tue 13:24:27, Iris Chen wrote:
> > > From: Iris Chen 
> > > 
> > > Signed-off-by: Iris Chen 
> > 
> > Reviewed-by: Francisco Iglesias 
> 
> I am planning to include this patch in the next aspeed PR if that's
> OK with you.

Sounds good to me Cédric!

Best regards,
Francisco

> 
> Thanks,
> 
> C.
> 
> > 
> > > ---
> > > Fixed .needed for subsection and suggestions from Francisco
> > > 
> > >   hw/block/m25p80.c | 82 ++-
> > >   1 file changed, 67 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
> > > index 81ba3da4df..3045dda53b 100644
> > > --- a/hw/block/m25p80.c
> > > +++ b/hw/block/m25p80.c
> > > @@ -472,11 +472,13 @@ struct Flash {
> > >   uint8_t spansion_cr2v;
> > >   uint8_t spansion_cr3v;
> > >   uint8_t spansion_cr4v;
> > > +bool wp_level;
> > >   bool write_enable;
> > >   bool four_bytes_address_mode;
> > >   bool reset_enable;
> > >   bool quad_enable;
> > >   bool aai_enable;
> > > +bool status_register_write_disabled;
> > >   uint8_t ear;
> > >   int64_t dirty_page;
> > > @@ -723,6 +725,8 @@ static void complete_collecting_data(Flash *s)
> > >   flash_erase(s, s->cur_addr, s->cmd_in_progress);
> > >   break;
> > >   case WRSR:
> > > +s->status_register_write_disabled = extract32(s->data[0], 7, 1);
> > > +
> > >   switch (get_man(s)) {
> > >   case MAN_SPANSION:
> > >   s->quad_enable = !!(s->data[1] & 0x02);
> > > @@ -1165,22 +1169,34 @@ static void decode_new_cmd(Flash *s, uint32_t 
> > > value)
> > >   break;
> > >   case WRSR:
> > > -if (s->write_enable) {
> > > -switch (get_man(s)) {
> > > -case MAN_SPANSION:
> > > -s->needed_bytes = 2;
> > > -s->state = STATE_COLLECTING_DATA;
> > > -break;
> > > -case MAN_MACRONIX:
> > > -s->needed_bytes = 2;
> > > -s->state = STATE_COLLECTING_VAR_LEN_DATA;
> > > -break;
> > > -default:
> > > -s->needed_bytes = 1;
> > > -s->state = STATE_COLLECTING_DATA;
> > > -}
> > > -s->pos = 0;
> > > +/*
> > > + * If WP# is low and status_register_write_disabled is high,
> > > + * status register writes are disabled.
> > > + * This is also called "hardware protected mode" (HPM). All other
> > > + * combinations of the two states are called "software protected 
> > > mode"
> > > + * (SPM), and status register writes are permitted.
> > > + */
> > > +if ((s->wp_level == 0 && s->status_register_write_disabled)
> > > +|| !s->write_enable) {
> > > +qemu_log_mask(LOG_GUEST_ERROR,
> > > +  "M25P80: Status register write is 
> > > disabled!\n");
> > > +break;
> > > +}
> > > +
> > > +switch (get_man(s)) {
> > > +case MAN_SPANSION:
> > > +s->needed_bytes = 2;
> > > +s->state = STATE_COLLECTING_DATA;
> > > +break;
> > > +case MAN_MACRONIX:
> > > +s->needed_bytes = 2;
> > > +s->state = STATE_COLLECTING_VAR_LEN_DATA;
> > > +break;
> > > +default:
> > > +s->needed_bytes = 1;
> > > +s->state = STATE_COLLECTING_DATA;
> > >   }
> > > +s->pos = 0;
> > >   break;
> > >   case WRDI:
> > > @@ -1195,6 +1211,8 @@ static void decode_new_cmd(Flash *s, uint32_t value)
> > >   case RDSR:
> > >   s->data[0] = (!!s->write_enable) << 1;
> > > +s->data[0] |= (!!s->status_register_write_disabled) << 7;
> > > +
> > >   if (get_man(s) == MAN_MACRONIX || get_man(s) == MAN_ISSI) {
> > >   s->data[0] |= (!!s->quad_enable) << 6;
> > >   }
> > > @@ -1484,6 +1502,14 @@ static uint32_t m25p80_transfer8(SSIPeripheral 
> > > *ss, uint32_t tx)
> > >   return r;
> > >   }
> > > +static void m25p80_write_protect_pin_irq_handler(void *opaque, int n, 
> > > int level)
> > > +{
> > > +Flash *s = M25P80(opaque);
> > > +/* WP# is just a single pin. */
> > > +assert(n == 0);
> > > +s->wp_level = !!level;
> > > +}
> > > +
> > >   static void m25p80_realize(SSIPeripheral *ss, Error **errp)
> > >   {
> > >   Flash *s = M25P80(ss);
> > > @@ -1515,12 +1541,18 @@ static void m25p80_realize(SSIPeripheral *ss, 
> > > Error **errp)
> > >   s->storage = blk_blockalign(NULL, s->size);
> > >   memset(s->storage, 0xFF, s->size);
> > >   }
> > > +
> > > +qdev_init_gpio_in_named(DEVICE(s),
> > > +m25p80_write_protect_pin_irq_handler, "WP#", 
> > > 1);
> > >   }
> > >   static 

Re: [PATCH v4] hw: m25p80: add WP# pin and SRWD bit for write protection

2022-06-28 Thread Cédric Le Goater

Alistair, Francisco,

On 6/22/22 11:45, Francisco Iglesias wrote:

On [2022 Jun 21] Tue 13:24:27, Iris Chen wrote:

From: Iris Chen 

Signed-off-by: Iris Chen 


Reviewed-by: Francisco Iglesias 


I am planning to include this patch in the next aspeed PR if that's
OK with you.

Thanks,

C.




---
Fixed .needed for subsection and suggestions from Francisco

  hw/block/m25p80.c | 82 ++-
  1 file changed, 67 insertions(+), 15 deletions(-)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 81ba3da4df..3045dda53b 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -472,11 +472,13 @@ struct Flash {
  uint8_t spansion_cr2v;
  uint8_t spansion_cr3v;
  uint8_t spansion_cr4v;
+bool wp_level;
  bool write_enable;
  bool four_bytes_address_mode;
  bool reset_enable;
  bool quad_enable;
  bool aai_enable;
+bool status_register_write_disabled;
  uint8_t ear;
  
  int64_t dirty_page;

@@ -723,6 +725,8 @@ static void complete_collecting_data(Flash *s)
  flash_erase(s, s->cur_addr, s->cmd_in_progress);
  break;
  case WRSR:
+s->status_register_write_disabled = extract32(s->data[0], 7, 1);
+
  switch (get_man(s)) {
  case MAN_SPANSION:
  s->quad_enable = !!(s->data[1] & 0x02);
@@ -1165,22 +1169,34 @@ static void decode_new_cmd(Flash *s, uint32_t value)
  break;
  
  case WRSR:

-if (s->write_enable) {
-switch (get_man(s)) {
-case MAN_SPANSION:
-s->needed_bytes = 2;
-s->state = STATE_COLLECTING_DATA;
-break;
-case MAN_MACRONIX:
-s->needed_bytes = 2;
-s->state = STATE_COLLECTING_VAR_LEN_DATA;
-break;
-default:
-s->needed_bytes = 1;
-s->state = STATE_COLLECTING_DATA;
-}
-s->pos = 0;
+/*
+ * If WP# is low and status_register_write_disabled is high,
+ * status register writes are disabled.
+ * This is also called "hardware protected mode" (HPM). All other
+ * combinations of the two states are called "software protected mode"
+ * (SPM), and status register writes are permitted.
+ */
+if ((s->wp_level == 0 && s->status_register_write_disabled)
+|| !s->write_enable) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "M25P80: Status register write is disabled!\n");
+break;
+}
+
+switch (get_man(s)) {
+case MAN_SPANSION:
+s->needed_bytes = 2;
+s->state = STATE_COLLECTING_DATA;
+break;
+case MAN_MACRONIX:
+s->needed_bytes = 2;
+s->state = STATE_COLLECTING_VAR_LEN_DATA;
+break;
+default:
+s->needed_bytes = 1;
+s->state = STATE_COLLECTING_DATA;
  }
+s->pos = 0;
  break;
  
  case WRDI:

@@ -1195,6 +1211,8 @@ static void decode_new_cmd(Flash *s, uint32_t value)
  
  case RDSR:

  s->data[0] = (!!s->write_enable) << 1;
+s->data[0] |= (!!s->status_register_write_disabled) << 7;
+
  if (get_man(s) == MAN_MACRONIX || get_man(s) == MAN_ISSI) {
  s->data[0] |= (!!s->quad_enable) << 6;
  }
@@ -1484,6 +1502,14 @@ static uint32_t m25p80_transfer8(SSIPeripheral *ss, 
uint32_t tx)
  return r;
  }
  
+static void m25p80_write_protect_pin_irq_handler(void *opaque, int n, int level)

+{
+Flash *s = M25P80(opaque);
+/* WP# is just a single pin. */
+assert(n == 0);
+s->wp_level = !!level;
+}
+
  static void m25p80_realize(SSIPeripheral *ss, Error **errp)
  {
  Flash *s = M25P80(ss);
@@ -1515,12 +1541,18 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
**errp)
  s->storage = blk_blockalign(NULL, s->size);
  memset(s->storage, 0xFF, s->size);
  }
+
+qdev_init_gpio_in_named(DEVICE(s),
+m25p80_write_protect_pin_irq_handler, "WP#", 1);
  }
  
  static void m25p80_reset(DeviceState *d)

  {
  Flash *s = M25P80(d);
  
+s->wp_level = true;

+s->status_register_write_disabled = false;
+
  reset_memory(s);
  }
  
@@ -1587,6 +1619,25 @@ static const VMStateDescription vmstate_m25p80_aai_enable = {

  }
  };
  
+static bool m25p80_wp_level_srwd_needed(void *opaque)

+{
+Flash *s = (Flash *)opaque;
+
+return !s->wp_level || s->status_register_write_disabled;
+}
+
+static const VMStateDescription vmstate_m25p80_write_protect = {
+.name = "m25p80/write_protect",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = m25p80_wp_level_srwd_needed,
+.fields = (VMStateField[]) {
+VMSTATE_BOOL(wp_level, Flash),
+VMSTATE_BOOL(status_register_write_disabled, Flash),
+VMSTATE_END_OF_LIST()
+}
+};
+
  static const 

Re: [PATCH v7 10/18] jobs: rename static functions called with job_mutex held

2022-06-28 Thread Vladimir Sementsov-Ogievskiy

On 6/28/22 18:22, Vladimir Sementsov-Ogievskiy wrote:

On 6/28/22 16:04, Emanuele Giuseppe Esposito wrote:

Ok so far I did the following:

- duplicated each public function as static {function}_locked()

They shouldn't be duplicates: function without _locked suffix should
take the mutex.

By "duplicate" I mean same function name, with just _locked suffix.
Maybe a better definition?

Almost done preparing the patches!


Why not just add _locked version and rework the version without suffix to call 
_locked under mutex one in one patch, to just keep it all meaningful?



I mean, instead of:

patch 1: add a _locked() duplicate

  At this point we have a duplicated function that's just bad practice.

patch 2: remake version without prefix to call _locked() under mutex
 
  Now everything is correct. But we have to track the moment when something strange becomes something correct.



do just

patch 1: rename function to _locked() and add a wrapper without suffix, that 
calls _locked() under mutex



--
Best regards,
Vladimir



Re: [PATCH v7 10/18] jobs: rename static functions called with job_mutex held

2022-06-28 Thread Vladimir Sementsov-Ogievskiy

On 6/28/22 16:04, Emanuele Giuseppe Esposito wrote:

Ok so far I did the following:

- duplicated each public function as static {function}_locked()

They shouldn't be duplicates: function without _locked suffix should
take the mutex.

By "duplicate" I mean same function name, with just _locked suffix.
Maybe a better definition?

Almost done preparing the patches!


Why not just add _locked version and rework the version without suffix to call 
_locked under mutex one in one patch, to just keep it all meaningful?

--
Best regards,
Vladimir



Re: [PATCH v7 01/18] job.c: make job_mutex and job_lock/unlock() public

2022-06-28 Thread Vladimir Sementsov-Ogievskiy

On 6/28/22 16:08, Emanuele Giuseppe Esposito wrote:



Am 24/06/2022 um 20:22 schrieb Vladimir Sementsov-Ogievskiy:

I've already acked this (honestly, because Stefan do), but still, want
to clarify:

On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:

job mutex will be used to protect the job struct elements and list,
replacing AioContext locks.

Right now use a shared lock for all jobs, in order to keep things
simple. Once the AioContext lock is gone, we can introduce per-job
locks.

To simplify the switch from aiocontext to job lock, introduce
*nop*  lock/unlock functions and macros.
We want to always call job_lock/unlock outside the AioContext locks,
and not vice-versa, otherwise we might get a deadlock.


Could you describe here, why we get a deadlock?

As I understand, we'll deadlock if two code paths exist simultaneously:

1. we take job mutex under aiocontext lock
2. we take aiocontex lock under job mutex

If these paths exists, it's possible that one thread goes through [1]
and another through [2]. If thread [1] holds job-mutex and want to take
aiocontext-lock, and in the same time thread [2] holds aiocontext-lock
and want to take job-mutext, that's a dead-lock.

If you say, that we must avoid [1], do you have in mind that we have [2]
somewhere? If so, this should be mentioned here

If not, could we just make a normal mutex, not a noop?


Of course we have [2] somewhere, otherwise I wouldn't even think about
creating a noop function. This idea came up in v1-v2.

Regarding the specific case, I don't remember. But there are tons of
functions that are acquiring the AioContext lock and then calling job_*
API, such as job_cancel_sync in blockdev.c.

I might use job_cancel_sync as example and write it in the commit
message though.



Yes, that's obvious that we have tons of [1]. That's why an example of [2] 
would be lot more valuable.


--
Best regards,
Vladimir



Re: [PATCH v7 01/18] job.c: make job_mutex and job_lock/unlock() public

2022-06-28 Thread Emanuele Giuseppe Esposito



Am 24/06/2022 um 20:22 schrieb Vladimir Sementsov-Ogievskiy:
> I've already acked this (honestly, because Stefan do), but still, want
> to clarify:
> 
> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>> job mutex will be used to protect the job struct elements and list,
>> replacing AioContext locks.
>>
>> Right now use a shared lock for all jobs, in order to keep things
>> simple. Once the AioContext lock is gone, we can introduce per-job
>> locks.
>>
>> To simplify the switch from aiocontext to job lock, introduce
>> *nop*  lock/unlock functions and macros.
>> We want to always call job_lock/unlock outside the AioContext locks,
>> and not vice-versa, otherwise we might get a deadlock.
> 
> Could you describe here, why we get a deadlock?
> 
> As I understand, we'll deadlock if two code paths exist simultaneously:
> 
> 1. we take job mutex under aiocontext lock
> 2. we take aiocontex lock under job mutex
> 
> If these paths exists, it's possible that one thread goes through [1]
> and another through [2]. If thread [1] holds job-mutex and want to take
> aiocontext-lock, and in the same time thread [2] holds aiocontext-lock
> and want to take job-mutext, that's a dead-lock.
> 
> If you say, that we must avoid [1], do you have in mind that we have [2]
> somewhere? If so, this should be mentioned here
> 
> If not, could we just make a normal mutex, not a noop?

Of course we have [2] somewhere, otherwise I wouldn't even think about
creating a noop function. This idea came up in v1-v2.

Regarding the specific case, I don't remember. But there are tons of
functions that are acquiring the AioContext lock and then calling job_*
API, such as job_cancel_sync in blockdev.c.

I might use job_cancel_sync as example and write it in the commit
message though.

Thank you,
Emanuele
>> This is not
>> straightforward to do, and that's why we start with nop functions.
>> Once everything is protected by job_lock/unlock, we can change the nop
>> into
>> an actual mutex and remove the aiocontext lock.
>>
>> Since job_mutex is already being used, add static
>> real_job_{lock/unlock} for the existing usage.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito
>> Reviewed-by: Stefan Hajnoczi
> 
> 




Re: [PATCH v7 10/18] jobs: rename static functions called with job_mutex held

2022-06-28 Thread Emanuele Giuseppe Esposito



Am 28/06/2022 um 12:47 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/28/22 10:40, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:


 Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>> With the*nop*  job_lock/unlock placed, rename the static
>> functions that are always under job_mutex, adding "_locked" suffix.
>>
>> List of functions that get this suffix:
>> job_txn_ref   job_txn_del_job
>> job_txn_apply   job_state_transition
>> job_should_pause   job_event_cancelled
>> job_event_completed   job_event_pending
>> job_event_ready   job_event_idle
>> job_do_yield   job_timer_not_pending
>> job_do_dismiss   job_conclude
>> job_update_rc   job_commit
>> job_abort   job_clean
>> job_finalize_single   job_cancel_async
>> job_completed_txn_abort   job_prepare
>> job_needs_finalize   job_do_finalize
>> job_transition_to_pending  job_completed_txn_success
>> job_completed   job_cancel_err
>> job_force_cancel_err
>>
>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>> real_job_lock/unlock.
>>
>> No functional change intended.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito
>
>
> Hmm. Maybe it was already discussed.. But for me it seems, that it
> would
> be simpler to review previous patches, that fix job_ API users to use
> locking properly, if this renaming go earlier.
>
> Anyway, in this series, we can't update everything at once. So
> patch to
> patch, we make the code more and more correct. (yes I remember that
> lock() is a noop, but I should review thinking that it real,
> otherwise,
> how to review?)
>
> So, I'm saying about formal correctness of using lock() unlock()
> function in connection with introduced _locked prifixes and in
> connection with how it should finally work.
>
> You do:
>
> 05. introduce some _locked functions, that just duplicates, and
> job_pause_point_locked() is formally inconsistent, as I said.
>
> 06. Update a lot of places, to give them their final form (but not
> final, as some functions will be renamed to _locked, some not, hard to
> imagine)
>
> 07,08,09. Update some more, and even more places. very hard to track
> formal correctness of using locks
>
> 10-...: rename APIs.
>
>
> What do you think about the following:
>
> 1. Introduce noop lock, and some internal _locked() versions, and keep
> formal consistency inside job.c, considering all public interfaces as
> unlocked:
>
>    at this point:
>     - everything correct inside job.c
>     - no public interfaces with _locked prefix
>     - all public interfaces take mutex internally
>     - no external user take mutex by hand
>
> We can rename all internal static functions at this step too.
>
> 2. Introduce some public _locked APIs, that we'll use in next patches
>
> 3. Now start fixing external users in several patches:
>       - protect by mutex direct use of job fields
>     - make wider locks and move to _locked APIs inside them where
> needed
>
>
> In this scenario, every updated unit becomes formally correct after
> update, and after all steps everything is formally correct, and we can
> move to turning-on the mutex.
>

 I don't understand your logic also here, sorry :(

 I assume you want to keep patch 1-4, then the problem is assing
 job_lock
 and renaming functions in _locked.
 So I would say the problem is in patch 5-6-10-11-12-13. All the others
 should be self contained.

 I understand patch 5 is a little hard to follow.

 Now, I am not sure what you propose here but it seems that the end goal
 is to just have the same result, but with additional intermediate steps
 that are just "do this just because in the next patch will be useful".
 I think the problem is that we are going to miss the "why we need the
 lock" logic in the patches if we do so.

 The logic I tried to convey in this order is the following:
 - job.h: add _locked duplicates for job API functions called with and
 without job_mutex
  Just create duplicates of functions

 - jobs: protect jobs with job_lock/unlock
  QMP and monitor functions call APIs that assume lock is taken,
  drivers must take explicitly the lock

 - jobs: rename static functions called with job_mutex held
 - job.h: rename job API functions called with job_mutex held
 - block_job: rename block_job functions called with job_mutex held

Re: [PATCH] hw/nvme: force nvme-ns param 'shared' to false if no nvme-subsys node

2022-06-28 Thread Niklas Cassel
On Tue, Jun 28, 2022 at 02:22:09PM +0200, Niklas Cassel via wrote:

Hello Peter,

It seems that mailman configuration on qemu-devel is rewriting the
"From:" field to "From: Niklas Cassel via "

If found this old thread about the same issue:
https://qemu-devel.nongnu.narkive.com/6hm8Fbvz/mailing-list-vs-dmarc-and-microsoft-com-s-p-reject-policy

Which says that this can happen when using p=reject policy.

However, doing a bcc to another of my personal addresses,
it looks like the SKIM/SPF/DMARC is all passing,
and non of them seem to have p=reject:

ARC-Authentication-Results: i=1; mx.google.com;
   dkim=pass header.i=@wdc.com header.s=dkim.wdc.com header.b=TTwPBUcS;
   spf=pass (google.com: domain of prvs=171ad38db=niklas.cas...@wdc.com 
designates 216.71.153.141 as permitted sender) 
smtp.mailfrom="prvs=171ad38db=niklas.cas...@wdc.com";
   dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=wdc.com
Return-Path: 
Received: from esa3.hgst.iphmx.com (esa3.hgst.iphmx.com. [216.71.153.141])
by mx.google.com with ESMTPS id 
sd33-20020a1709076e2100b00711f20051f2si13306645ejc.697.2022.06.28.05.22.25
for 
(version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
Tue, 28 Jun 2022 05:22:26 -0700 (PDT)
Received-SPF: pass (google.com: domain of prvs=171ad38db=niklas.cas...@wdc.com 
designates 216.71.153.141 as permitted sender) client-ip=216.71.153.141;
Authentication-Results: mx.google.com;
   dkim=pass header.i=@wdc.com header.s=dkim.wdc.com header.b=TTwPBUcS;
   spf=pass (google.com: domain of prvs=171ad38db=niklas.cas...@wdc.com 
designates 216.71.153.141 as permitted sender) 
smtp.mailfrom="prvs=171ad38db=niklas.cas...@wdc.com";
   dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE)


Any idea why mailmain is rewriting the "From:" field for my messages?


Kind regards,
Niklas


Re: [PATCH] hw/nvme: fix example serial in documentation

2022-06-28 Thread Klaus Jensen
On Jun 28 13:26, Daniel P. Berrangé wrote:
> On Mon, Jun 27, 2022 at 02:39:57PM +0200, Niklas Cassel via wrote:
> > The serial prop on the controller is actually describing the nvme
> > subsystem serial, which has to be identical for all controllers within
> > the same nvme subsystem.
> 
> Given this description...
> 
> >  
> > -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
> 
> ...I'm wondering why 'serial' isn't a property of this device..
> 
> > -   -device nvme,serial=a,subsys=nvme-subsys-0
> > -   -device nvme,serial=b,subsys=nvme-subsys-0
> > +   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
> > +   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
> 
> ..rather than requiring it to be redundantly set to the same value here ?
> 

-device nvme can be used without a subsystem device, and in that case
the serial must be set. However, you are right that we could not require
it if set on the subsystem device.


signature.asc
Description: PGP signature


Re: [PATCH] hw/nvme: fix example serial in documentation

2022-06-28 Thread Daniel P . Berrangé
On Mon, Jun 27, 2022 at 02:39:57PM +0200, Niklas Cassel via wrote:
> The serial prop on the controller is actually describing the nvme
> subsystem serial, which has to be identical for all controllers within
> the same nvme subsystem.

Given this description...

>  
> -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0

...I'm wondering why 'serial' isn't a property of this device..

> -   -device nvme,serial=a,subsys=nvme-subsys-0
> -   -device nvme,serial=b,subsys=nvme-subsys-0
> +   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
> +   -device nvme,serial=deadbeef,subsys=nvme-subsys-0

..rather than requiring it to be redundantly set to the same value here ?

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




[PATCH] hw/nvme: force nvme-ns param 'shared' to false if no nvme-subsys node

2022-06-28 Thread Niklas Cassel via
Since commit 916b0f0b5264 ("hw/nvme: change nvme-ns 'shared' default")
the default value of nvme-ns param 'shared' is set to true, regardless
if there is a nvme-subsys node or not.

On a system without a nvme-subsys node, a namespace will never be able
to be attached to more than one controller, so for this configuration,
it is counterintuitive for this parameter to be set by default.

Force the nvme-ns param 'shared' to false for configurations where
there is no nvme-subsys node, as the namespace will never be able to
attach to more than one controller anyway.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ns.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 870c3ca1a2..62a1f97be0 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -546,6 +546,8 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 int i;
 
 if (!n->subsys) {
+/* If no subsys, the ns cannot be attached to more than one ctrl. */
+ns->params.shared = false;
 if (ns->params.detached) {
 error_setg(errp, "detached requires that the nvme device is "
"linked to an nvme-subsys device");
-- 
2.36.1




Re: [PATCH v7 10/18] jobs: rename static functions called with job_mutex held

2022-06-28 Thread Vladimir Sementsov-Ogievskiy

On 6/28/22 10:40, Emanuele Giuseppe Esposito wrote:



Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:

On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:



Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:

On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:

With the*nop*  job_lock/unlock placed, rename the static
functions that are always under job_mutex, adding "_locked" suffix.

List of functions that get this suffix:
job_txn_ref   job_txn_del_job
job_txn_apply   job_state_transition
job_should_pause   job_event_cancelled
job_event_completed   job_event_pending
job_event_ready   job_event_idle
job_do_yield   job_timer_not_pending
job_do_dismiss   job_conclude
job_update_rc   job_commit
job_abort   job_clean
job_finalize_single   job_cancel_async
job_completed_txn_abort   job_prepare
job_needs_finalize   job_do_finalize
job_transition_to_pending  job_completed_txn_success
job_completed   job_cancel_err
job_force_cancel_err

Note that "locked" refers to the*nop*  job_lock/unlock, and not
real_job_lock/unlock.

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito



Hmm. Maybe it was already discussed.. But for me it seems, that it would
be simpler to review previous patches, that fix job_ API users to use
locking properly, if this renaming go earlier.

Anyway, in this series, we can't update everything at once. So patch to
patch, we make the code more and more correct. (yes I remember that
lock() is a noop, but I should review thinking that it real, otherwise,
how to review?)

So, I'm saying about formal correctness of using lock() unlock()
function in connection with introduced _locked prifixes and in
connection with how it should finally work.

You do:

05. introduce some _locked functions, that just duplicates, and
job_pause_point_locked() is formally inconsistent, as I said.

06. Update a lot of places, to give them their final form (but not
final, as some functions will be renamed to _locked, some not, hard to
imagine)

07,08,09. Update some more, and even more places. very hard to track
formal correctness of using locks

10-...: rename APIs.


What do you think about the following:

1. Introduce noop lock, and some internal _locked() versions, and keep
formal consistency inside job.c, considering all public interfaces as
unlocked:

   at this point:
    - everything correct inside job.c
    - no public interfaces with _locked prefix
    - all public interfaces take mutex internally
    - no external user take mutex by hand

We can rename all internal static functions at this step too.

2. Introduce some public _locked APIs, that we'll use in next patches

3. Now start fixing external users in several patches:
      - protect by mutex direct use of job fields
    - make wider locks and move to _locked APIs inside them where needed


In this scenario, every updated unit becomes formally correct after
update, and after all steps everything is formally correct, and we can
move to turning-on the mutex.



I don't understand your logic also here, sorry :(

I assume you want to keep patch 1-4, then the problem is assing job_lock
and renaming functions in _locked.
So I would say the problem is in patch 5-6-10-11-12-13. All the others
should be self contained.

I understand patch 5 is a little hard to follow.

Now, I am not sure what you propose here but it seems that the end goal
is to just have the same result, but with additional intermediate steps
that are just "do this just because in the next patch will be useful".
I think the problem is that we are going to miss the "why we need the
lock" logic in the patches if we do so.

The logic I tried to convey in this order is the following:
- job.h: add _locked duplicates for job API functions called with and
without job_mutex
 Just create duplicates of functions

- jobs: protect jobs with job_lock/unlock
 QMP and monitor functions call APIs that assume lock is taken,
 drivers must take explicitly the lock

- jobs: rename static functions called with job_mutex held
- job.h: rename job API functions called with job_mutex held
- block_job: rename block_job functions called with job_mutex held
 *given* that some functions are always under lock, transform
 them in _locked. Requires the job_lock/unlock patch

- job.h: define unlocked functions
 Comments on the public functions that are not _locked


@Kevin, since you also had some feedbacks on the patch ordering, do you
agree with this ordering or you have some other ideas?

Following your suggestion, we could move patches 10-11-12-13 before
patch 6 "jobs: protect jobs with job_lock/unlock".

(Apologies for changing my mind, but being the second complain I am
starting to reconsider reordering the patches).



In two words, what I mean: let's keep the following invariant from patch
to patch:

1. Function that has _locked() prefix is always called with lock 

Re: [RFC v3 2/5] qemu-io: add zoned block device operations.

2022-06-28 Thread Sam Li
Damien Le Moal  于2022年6月28日周二 17:11写道:
>
> On 6/28/22 16:56, Stefan Hajnoczi wrote:
> > On Mon, Jun 27, 2022 at 08:19:14AM +0800, Sam Li wrote:
> >> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> >> index 2f0d8ac25a..3f2592b9f5 100644
> >> --- a/qemu-io-cmds.c
> >> +++ b/qemu-io-cmds.c
> >> @@ -1706,6 +1706,122 @@ static const cmdinfo_t flush_cmd = {
> >>  .oneline= "flush all in-core file state to disk",
> >>  };
> >>
> >> +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> >> +{
> >> +int ret;
> >> +int64_t offset, len, nr_zones;
> >> +int i = 0;
> >> +
> >> +++optind;
> >> +offset = cvtnum(argv[optind]);
> >> +++optind;
> >> +len = cvtnum(argv[optind]);
> >> +++optind;
> >> +nr_zones = cvtnum(argv[optind]);
> >> +
> >> +g_autofree BlockZoneDescriptor *zones = g_new(BlockZoneDescriptor, 
> >> nr_zones);
> >> +ret = blk_zone_report(blk, offset, len, _zones, zones);
> >> +while (i < nr_zones) {
> >
> > Does blk_zone_report() set nr_zones to 0 on failure or do we need to
> > check if (ret < 0) here?
>
> ret = 0 means "no zone reported" which happen only if nr_zones is 0 or the
> start offset is past the end of the disk capacity. ret < 0 would mean that
> a report zone operation was actually attempted and failed (EIO, ENOMEM etc).
>
> >
> >> +fprintf(stdout, "start: 0x%lx, len 0x%lx, cap 0x%lx, wptr 0x%lx, "
> >
> > The rest of the source file uses printf() instead of fprintf(stdout,
> > ...). That's usually preferred because it's shorter.
> >
> >> +"zcond:%u, [type: %u]\n",
> >
> > Please use PRIx64 instead of lx format specifiers for portability. On
> > 32-bit hosts lx is 32-bit, not 64-bit. You can grep QEMU's code for
> > examples of PRIx64.
> >
> >> +zones[i].start, zones[i].length, zones[i].cap, 
> >> zones[i].wp,
> >> +zones[i].cond, zones[i].type);
> >> +++i;
> >> +}
> >
> > A for loop is more idiomatic:
> >
> >   for (int i = 0; i < nr_zones; i++) {
> >   ...
> >   }
> >
> >> +return ret;
> >> +}
> >> +
> >> +static const cmdinfo_t zone_report_cmd = {
> >> +.name = "zone_report",
> >> +.altname = "f",
> >> +.cfunc = zone_report_f,
> >> +.argmin = 3,
> >> +.argmax = 3,
> >> +.args = "offset [offset..] len [len..] number [num..]",
> >
> > The arguments are "offset len number". This command does not accept
> > optional offset/len/num arguments.
>
> The arguments should be offset + len OR offset + number of zones. Having
> the 3 of them does not make sense to me. The interface would then be:
>
> (1) offset + len -> report all zones in the block range [offset .. offset
> + len - 1]
>
> (2) offset + number of zones -> report at most "number of zones" from the
> zone containing the block at "offset".
>
> (2) matches the semantic used at the device command level. So I prefer to
> approach (1).
Yes, I'll remove the len argument then.

>
> >
> >> +.oneline = "report a number of zones",
> >
> > Maybe "report zone information".
> >
> >> +};
> >> +
> >> +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
> >> +{
> >> +int64_t offset, len;
> >> +++optind;
> >> +offset = cvtnum(argv[optind]);
> >> +++optind;
> >> +len = cvtnum(argv[optind]);
> >> +return blk_zone_mgmt(blk, zone_open, offset, len);
> >
> > Where is the error reported? When I look at read_f() I see:
> >
> > if (ret < 0) {
> > printf("read failed: %s\n", strerror(-ret));
> >
> > I think something similar is needed because qemu-io.c does not print an
> > error message for us. The same is true for the other commands defined in
> > this patch.
> >
> >> +}
> >> +
> >> +static const cmdinfo_t zone_open_cmd = {
> >> +.name = "zone_open",
> >> +.altname = "f",
> >> +.cfunc = zone_open_f,
> >> +.argmin = 2,
> >> +.argmax = 2,
> >> +.args = "offset [offset..] len [len..]",
> >
> > There are no optional offset/len args. The same is true for the other
> > commands below.
>
>
> --
> Damien Le Moal
> Western Digital Research



Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-28 Thread Sam Li
Damien Le Moal  于2022年6月28日周二 17:05写道:
>
> On 6/28/22 17:05, Sam Li wrote:
> > Stefan Hajnoczi  于2022年6月28日周二 14:52写道:
> >>
> >> On Mon, Jun 27, 2022 at 08:19:13AM +0800, Sam Li wrote:
> >>> diff --git a/block/block-backend.c b/block/block-backend.c
> >>> index e0e1aff4b1..786f964d02 100644
> >>> --- a/block/block-backend.c
> >>> +++ b/block/block-backend.c
> >>> @@ -1810,6 +1810,62 @@ int blk_flush(BlockBackend *blk)
> >>>  return ret;
> >>>  }
> >>>
> >>> +/*
> >>> + * Return zone_report from BlockDriver. Offset can be any number within
> >>> + * the zone size. No alignment for offset and len.
> >>
> >> What is the purpose of len? Is it the maximum number of zones to return
> >> in nr_zones[]?
> >
> > len is actually not used in bdrv_co_zone_report. It is needed by
> > blk_check_byte_request.
>
> There is no IO buffer being passed. Only an array of zone descriptors and
> an in-out number of zones. len is definitely not needed. Not sure what
> blk_check_byte_request() is intended to check though.

Can I just remove len argument and blk_check_byte_request() if it's not used?

>
> >
> >> How does the caller know how many zones were returned?
> >
> > nr_zones represents IN maximum and OUT actual. The caller will know by
> > nr_zones which is changed in bdrv_co_zone_report. I'll add it in the
> > comments.
> >
> >>
> >>> + */
> >>> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> >>> +   int64_t len, int64_t *nr_zones,
> >>> +   BlockZoneDescriptor *zones)
> >>> +{
> >>> +int ret;
> >>> +BlockDriverState *bs;
> >>> +IO_CODE();
> >>> +
> >>> +blk_inc_in_flight(blk); /* increase before waiting */
> >>> +blk_wait_while_drained(blk);
> >>> +bs = blk_bs(blk);
> >>> +
> >>> +ret = blk_check_byte_request(blk, offset, len);
> >>> +if (ret < 0) {
> >>> +return ret;
> >>> +}
> >>> +
> >>> +bdrv_inc_in_flight(bs);
> >>
> >> The bdrv_inc/dec_in_flight() call should be inside
> >> bdrv_co_zone_report(). See bdrv_co_ioctl() for an example.
> >>
> >>> +ret = bdrv_co_zone_report(blk->root->bs, offset, len,
> >>> +  nr_zones, zones);
> >>> +bdrv_dec_in_flight(bs);
> >>> +blk_dec_in_flight(blk);
> >>> +return ret;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Return zone_mgmt from BlockDriver.
> >>
> >> Maybe this should be:
> >>
> >>   Send a zone management command.
> >
> > Yes, it's more accurate.
> >
> >>
> >>> @@ -216,6 +217,11 @@ typedef struct RawPosixAIOData {
> >>>  PreallocMode prealloc;
> >>>  Error **errp;
> >>>  } truncate;
> >>> +struct {
> >>> +int64_t *nr_zones;
> >>> +BlockZoneDescriptor *zones;
> >>> +} zone_report;
> >>> +zone_op op;
> >>
> >> It's cleaner to put op inside a struct zone_mgmt so its purpose is
> >> self-explanatory:
> >>
> >>   struct {
> >>   zone_op op;
> >>   } zone_mgmt;
> >>
> >>> +static int handle_aiocb_zone_report(void *opaque) {
> >>> +RawPosixAIOData *aiocb = opaque;
> >>> +int fd = aiocb->aio_fildes;
> >>> +int64_t *nr_zones = aiocb->zone_report.nr_zones;
> >>> +BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> >>> +int64_t offset = aiocb->aio_offset;
> >>> +int64_t len = aiocb->aio_nbytes;
>
> Since you have the zone array and number of zones (size of that array) I
> really do not see why you need len.
>
> >>> +
> >>> +struct blk_zone *blkz;
> >>> +int64_t rep_size, nrz;
> >>> +int ret, n = 0, i = 0;
> >>> +
> >>> +nrz = *nr_zones;
> >>> +if (len == -1) {
> >>> +return -errno;
> >>
> >> Where is errno set? Should this be an errno constant instead like
> >> -EINVAL?
> >
> > That's right! Noted.
> >
> >>
> >>> +}
> >>> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> >>> blk_zone);
> >>> +g_autofree struct blk_zone_report *rep = g_new(struct 
> >>> blk_zone_report, nrz);
> >>
> >> g_new() looks incorrect. There should be 1 struct blk_zone_report
> >> followed by nrz struct blk_zone structs. Please use g_malloc(rep_size)
> >> instead.
> >
> > Yes! However, it still has a memory leak error when using g_autofree
> > && g_malloc.
>
> That may be because you are changing the value of the rep pointer while
> parsing the report ?

I am not sure it is the case. Can you show me some way to find the problem?

Thanks for reviewing!


>
> >
> >>
> >>> +offset = offset / 512; /* get the unit of the start sector: sector 
> >>> size is 512 bytes. */
> >>> +printf("start to report zone with offset: 0x%lx\n", offset);
> >>> +
> >>> +blkz = (struct blk_zone *)(rep + 1);
> >>> +while (n < nrz) {
> >>> +memset(rep, 0, rep_size);
> >>> +rep->sector = offset;
> >>> +rep->nr_zones = nrz;
> >>
> >> What prevents zones[] overflowing? nrz isn't being reduced in the loop,
> >> so maybe the rep->nr_zones return value will eventually exceed the
> >> 

Re: [PATCH 4/4] hw/ide/piix: Ignore writes of hardwired PCI command register bits

2022-06-28 Thread Laurent Vivier

CC:

John Snow  (supporter:IDE)
qemu-block@nongnu.org (open list:IDE)

Le 28/05/2022 à 23:02, Lev Kujawski a écrit :

One method to enable PCI bus mastering for IDE controllers, often used
by x86 firmware, is to write 0x7 to the PCI command register.  Neither
the PIIX3 specification nor actual hardware (a Tyan S1686D system)
permit modification of the Memory Space Enable (MSE) bit, 1, and thus
the command register would be left in an unspecified state without
this patch.

Signed-off-by: Lev Kujawski 
---
  hw/ide/piix.c | 25 +
  1 file changed, 25 insertions(+)

diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index 76ea8fd9f6..f1d1168ecd 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -25,6 +25,8 @@
   * References:
   *  [1] 82371FB (PIIX) AND 82371SB (PIIX3) PCI ISA IDE XCELERATOR,
   *  290550-002, Intel Corporation, April 1997.
+ *  [2] 82371AB PCI-TO-ISA / IDE XCELERATOR (PIIX4), 290562-001,
+ *  Intel Corporation, April 1997.
   */
  
  #include "qemu/osdep.h"

@@ -32,6 +34,7 @@
  #include "migration/vmstate.h"
  #include "qapi/error.h"
  #include "qemu/module.h"
+#include "qemu/range.h"
  #include "sysemu/block-backend.h"
  #include "sysemu/blockdev.h"
  #include "sysemu/dma.h"
@@ -220,6 +223,26 @@ static void pci_piix_ide_exitfn(PCIDevice *dev)
  }
  }
  
+static void piix_pci_config_write(PCIDevice *d, uint32_t addr,

+  uint32_t val, int l)
+{
+/*
+ * Mask all IDE PCI command register bits except for Bus Master
+ * Function Enable (bit 2) and I/O Space Enable (bit 1), as the
+ * remainder are hardwired to 0 [1, p.48] [2, p.89-90].
+ *
+ * NOTE: According to the PIIX3 datasheet [1], the Memory Space
+ * Enable (MSE bit) is hardwired to 1, but this is contradicted by
+ * actual PIIX3 hardware, the datasheet itself (viz., Default
+ * Value: h), and the PIIX4 datasheet [2].
+ */
+if (range_covers_byte(addr, l, PCI_COMMAND)) {
+val &= ~(0xfffa << ((PCI_COMMAND - addr) << 3));
+}
+
+pci_default_write_config(d, addr, val, l);
+}
+
  /* NOTE: for the PIIX3, the IRQs and IOports are hardcoded */
  static void piix3_ide_class_init(ObjectClass *klass, void *data)
  {
@@ -232,6 +255,7 @@ static void piix3_ide_class_init(ObjectClass *klass, void 
*data)
  k->vendor_id = PCI_VENDOR_ID_INTEL;
  k->device_id = PCI_DEVICE_ID_INTEL_82371SB_1;
  k->class_id = PCI_CLASS_STORAGE_IDE;
+k->config_write = piix_pci_config_write;
  set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
  dc->hotpluggable = false;
  }
@@ -260,6 +284,7 @@ static void piix4_ide_class_init(ObjectClass *klass, void 
*data)
  k->vendor_id = PCI_VENDOR_ID_INTEL;
  k->device_id = PCI_DEVICE_ID_INTEL_82371AB;
  k->class_id = PCI_CLASS_STORAGE_IDE;
+k->config_write = piix_pci_config_write;
  set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
  dc->hotpluggable = false;
  }





Re: [PATCH 2/4] hw/ide/core: Clear LBA and drive bits for EXECUTE DEVICE DIAGNOSTIC

2022-06-28 Thread Laurent Vivier

CC:

John Snow  (supporter:IDE)
qemu-block@nongnu.org (open list:IDE)

Le 28/05/2022 à 23:02, Lev Kujawski a écrit :

Prior to this patch, cmd_exec_dev_diagnostic relied upon
ide_set_signature to clear the device register.  While the
preservation of the drive bit by ide_set_signature is necessary for
the DEVICE RESET, IDENTIFY DEVICE, and READ SECTOR commands,
ATA/ATAPI-6 specifies that "DEV shall be cleared to zero" for EXECUTE
DEVICE DIAGNOSTIC.

This deviation was uncovered by the ATACT Device Testing Program
written by Hale Landis.

Signed-off-by: Lev Kujawski 
---
  hw/ide/core.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index c2caa54285..5a24547e49 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -1704,8 +1704,14 @@ static bool cmd_identify_packet(IDEState *s, uint8_t cmd)
  return false;
  }
  
+/* EXECUTE DEVICE DIAGNOSTIC */

  static bool cmd_exec_dev_diagnostic(IDEState *s, uint8_t cmd)
  {
+/*
+ * Clear the device register per the ATA (v6) specification,
+ * because ide_set_signature does not clear LBA or drive bits.
+ */
+s->select = (ATA_DEV_ALWAYS_ON);
  ide_set_signature(s);
  
  if (s->drive_kind == IDE_CD) {





Re: [PATCH 3/4] piix_ide_reset: Use pci_set_* functions instead of direct access

2022-06-28 Thread Laurent Vivier

CC:

John Snow  (supporter:IDE)
qemu-block@nongnu.org (open list:IDE)

Le 28/05/2022 à 23:02, Lev Kujawski a écrit :

Eliminates the remaining TODOs in hw/ide/piix.c by:
- Using pci_set_{size} functions to write the PIIX PCI configuration
   space instead of manipulating it directly as an array; and
- Documenting default register values by reference to the controlling
   specification.

Signed-off-by: Lev Kujawski 
---
  hw/ide/piix.c | 17 +
  1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index ce89fd0aa3..76ea8fd9f6 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -21,6 +21,10 @@
   * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
   * THE SOFTWARE.
+ *
+ * References:
+ *  [1] 82371FB (PIIX) AND 82371SB (PIIX3) PCI ISA IDE XCELERATOR,
+ *  290550-002, Intel Corporation, April 1997.
   */
  
  #include "qemu/osdep.h"

@@ -114,14 +118,11 @@ static void piix_ide_reset(DeviceState *dev)
  ide_bus_reset(>bus[i]);
  }
  
-/* TODO: this is the default. do not override. */

-pci_conf[PCI_COMMAND] = 0x00;
-/* TODO: this is the default. do not override. */
-pci_conf[PCI_COMMAND + 1] = 0x00;
-/* TODO: use pci_set_word */
-pci_conf[PCI_STATUS] = PCI_STATUS_FAST_BACK;
-pci_conf[PCI_STATUS + 1] = PCI_STATUS_DEVSEL_MEDIUM >> 8;
-pci_conf[0x20] = 0x01; /* BMIBA: 20-23h */
+/* PCI command register default value (h) per [1, p.48].  */
+pci_set_word(pci_conf + PCI_COMMAND, 0x);
+pci_set_word(pci_conf + PCI_STATUS,
+ PCI_STATUS_DEVSEL_MEDIUM | PCI_STATUS_FAST_BACK);
+pci_set_byte(pci_conf + 0x20, 0x01);  /* BMIBA: 20-23h */
  }
  
  static int pci_piix_init_ports(PCIIDEState *d)





Re: [RFC v3 2/5] qemu-io: add zoned block device operations.

2022-06-28 Thread Damien Le Moal
On 6/28/22 16:56, Stefan Hajnoczi wrote:
> On Mon, Jun 27, 2022 at 08:19:14AM +0800, Sam Li wrote:
>> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
>> index 2f0d8ac25a..3f2592b9f5 100644
>> --- a/qemu-io-cmds.c
>> +++ b/qemu-io-cmds.c
>> @@ -1706,6 +1706,122 @@ static const cmdinfo_t flush_cmd = {
>>  .oneline= "flush all in-core file state to disk",
>>  };
>>  
>> +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
>> +{
>> +int ret;
>> +int64_t offset, len, nr_zones;
>> +int i = 0;
>> +
>> +++optind;
>> +offset = cvtnum(argv[optind]);
>> +++optind;
>> +len = cvtnum(argv[optind]);
>> +++optind;
>> +nr_zones = cvtnum(argv[optind]);
>> +
>> +g_autofree BlockZoneDescriptor *zones = g_new(BlockZoneDescriptor, 
>> nr_zones);
>> +ret = blk_zone_report(blk, offset, len, _zones, zones);
>> +while (i < nr_zones) {
> 
> Does blk_zone_report() set nr_zones to 0 on failure or do we need to
> check if (ret < 0) here?

ret = 0 means "no zone reported" which happen only if nr_zones is 0 or the
start offset is past the end of the disk capacity. ret < 0 would mean that
a report zone operation was actually attempted and failed (EIO, ENOMEM etc).

> 
>> +fprintf(stdout, "start: 0x%lx, len 0x%lx, cap 0x%lx, wptr 0x%lx, "
> 
> The rest of the source file uses printf() instead of fprintf(stdout,
> ...). That's usually preferred because it's shorter.
> 
>> +"zcond:%u, [type: %u]\n",
> 
> Please use PRIx64 instead of lx format specifiers for portability. On
> 32-bit hosts lx is 32-bit, not 64-bit. You can grep QEMU's code for
> examples of PRIx64.
> 
>> +zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
>> +zones[i].cond, zones[i].type);
>> +++i;
>> +}
> 
> A for loop is more idiomatic:
> 
>   for (int i = 0; i < nr_zones; i++) {
>   ...
>   }
> 
>> +return ret;
>> +}
>> +
>> +static const cmdinfo_t zone_report_cmd = {
>> +.name = "zone_report",
>> +.altname = "f",
>> +.cfunc = zone_report_f,
>> +.argmin = 3,
>> +.argmax = 3,
>> +.args = "offset [offset..] len [len..] number [num..]",
> 
> The arguments are "offset len number". This command does not accept
> optional offset/len/num arguments.

The arguments should be offset + len OR offset + number of zones. Having
the 3 of them does not make sense to me. The interface would then be:

(1) offset + len -> report all zones in the block range [offset .. offset
+ len - 1]

(2) offset + number of zones -> report at most "number of zones" from the
zone containing the block at "offset".

(2) matches the semantic used at the device command level. So I prefer to
approach (1).

> 
>> +.oneline = "report a number of zones",
> 
> Maybe "report zone information".
> 
>> +};
>> +
>> +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
>> +{
>> +int64_t offset, len;
>> +++optind;
>> +offset = cvtnum(argv[optind]);
>> +++optind;
>> +len = cvtnum(argv[optind]);
>> +return blk_zone_mgmt(blk, zone_open, offset, len);
> 
> Where is the error reported? When I look at read_f() I see:
> 
> if (ret < 0) {
> printf("read failed: %s\n", strerror(-ret));
> 
> I think something similar is needed because qemu-io.c does not print an
> error message for us. The same is true for the other commands defined in
> this patch.
> 
>> +}
>> +
>> +static const cmdinfo_t zone_open_cmd = {
>> +.name = "zone_open",
>> +.altname = "f",
>> +.cfunc = zone_open_f,
>> +.argmin = 2,
>> +.argmax = 2,
>> +.args = "offset [offset..] len [len..]",
> 
> There are no optional offset/len args. The same is true for the other
> commands below.


-- 
Damien Le Moal
Western Digital Research



Re: [PATCH 1/2] Trivial: 3 char repeat typos

2022-06-28 Thread Laurent Vivier

Le 14/06/2022 à 12:40, Dr. David Alan Gilbert (git) a écrit :

From: "Dr. David Alan Gilbert" 

Inspired by Julia Lawall's fixing of Linux
kernel comments, I looked at qemu, although I did it manually.

Signed-off-by: Dr. David Alan Gilbert 
---
  hw/intc/openpic.c| 2 +-
  hw/net/imx_fec.c | 2 +-
  hw/pci/pcie_aer.c| 2 +-
  hw/pci/shpc.c| 3 ++-
  hw/ppc/spapr_caps.c  | 2 +-
  hw/scsi/spapr_vscsi.c| 2 +-
  qapi/net.json| 2 +-
  tools/virtiofsd/passthrough_ll.c | 2 +-
  ui/input.c   | 2 +-
  9 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/hw/intc/openpic.c b/hw/intc/openpic.c
index 49504e740f..b0787e8ee7 100644
--- a/hw/intc/openpic.c
+++ b/hw/intc/openpic.c
@@ -729,7 +729,7 @@ static void openpic_tmr_set_tmr(OpenPICTimer *tmr, uint32_t 
val, bool enabled)
  }
  
  /*

- * Returns the currrent tccr value, i.e., timer value (in clocks) with
+ * Returns the current tccr value, i.e., timer value (in clocks) with
   * appropriate TOG.
   */
  static uint64_t openpic_tmr_get_timer(OpenPICTimer *tmr)
diff --git a/hw/net/imx_fec.c b/hw/net/imx_fec.c
index 0db9aaf76a..8c11b237de 100644
--- a/hw/net/imx_fec.c
+++ b/hw/net/imx_fec.c
@@ -438,7 +438,7 @@ static void imx_eth_update(IMXFECState *s)
   *   assignment fail.
   *
   * To ensure that all versions of Linux work, generate ENET_INT_MAC
- * interrrupts on both interrupt lines. This should be changed if and when
+ * interrupts on both interrupt lines. This should be changed if and when
   * qemu supports IOMUX.
   */
  if (s->regs[ENET_EIR] & s->regs[ENET_EIMR] &
diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
index 92bd0530dd..eff62f3945 100644
--- a/hw/pci/pcie_aer.c
+++ b/hw/pci/pcie_aer.c
@@ -323,7 +323,7 @@ static void pcie_aer_msg_root_port(PCIDevice *dev, const 
PCIEAERMsg *msg)
   */
  }
  
-/* Errro Message Received: Root Error Status register */

+/* Error Message Received: Root Error Status register */
  switch (msg->severity) {
  case PCI_ERR_ROOT_CMD_COR_EN:
  if (root_status & PCI_ERR_ROOT_COR_RCV) {
diff --git a/hw/pci/shpc.c b/hw/pci/shpc.c
index f822f18b98..e71f3a7483 100644
--- a/hw/pci/shpc.c
+++ b/hw/pci/shpc.c
@@ -480,7 +480,8 @@ static const MemoryRegionOps shpc_mmio_ops = {
  .endianness = DEVICE_LITTLE_ENDIAN,
  .valid = {
  /* SHPC ECN requires dword accesses, but the original 1.0 spec 
doesn't.
- * It's easier to suppport all sizes than worry about it. */
+ * It's easier to support all sizes than worry about it.
+ */
  .min_access_size = 1,
  .max_access_size = 4,
  },
diff --git a/hw/ppc/spapr_caps.c b/hw/ppc/spapr_caps.c
index 655ab856a0..b4283055c1 100644
--- a/hw/ppc/spapr_caps.c
+++ b/hw/ppc/spapr_caps.c
@@ -553,7 +553,7 @@ static void cap_ccf_assist_apply(SpaprMachineState *spapr, 
uint8_t val,
   * instruction is a harmless no-op.  It won't correctly
   * implement the cache count flush *but* if we have
   * count-cache-disabled in the host, that flush is
- * unnnecessary.  So, specifically allow this case.  This
+ * unnecessary.  So, specifically allow this case.  This
   * allows us to have better performance on POWER9 DD2.3,
   * while still working on POWER9 DD2.2 and POWER8 host
   * cpus.
diff --git a/hw/scsi/spapr_vscsi.c b/hw/scsi/spapr_vscsi.c
index a07a8e1523..e320ccaa23 100644
--- a/hw/scsi/spapr_vscsi.c
+++ b/hw/scsi/spapr_vscsi.c
@@ -1013,7 +1013,7 @@ static int vscsi_send_capabilities(VSCSIState *s, 
vscsi_req *req)
  }
  
  /*

- * Current implementation does not suppport any migration or
+ * Current implementation does not support any migration or
   * reservation capabilities. Construct the response telling the
   * guest not to use them.
   */
diff --git a/qapi/net.json b/qapi/net.json
index d6f7cfd4d6..9af11e9a3b 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -298,7 +298,7 @@
  #
  # @udp: use the udp version of l2tpv3 encapsulation
  #
-# @cookie64: use 64 bit coookies
+# @cookie64: use 64 bit cookies
  #
  # @counter: have sequence counter
  #
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index b15c631ca5..7a73dfcce9 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2319,7 +2319,7 @@ static int do_lo_create(fuse_req_t req, struct lo_inode 
*parent_inode,
   * If security.selinux has not been remapped and selinux is enabled,
   * use fscreate to set context before file creation. If not, use
   * tmpfile method for regular files. Otherwise fallback to
- * non-atomic method of file creation and xattr settting.
+ * non-atomic method of file creation and xattr setting.
   */
  if 

Re: [RFC v3 2/5] qemu-io: add zoned block device operations.

2022-06-28 Thread Sam Li
Stefan Hajnoczi  于2022年6月28日周二 16:20写道:
>
> On Mon, Jun 27, 2022 at 08:19:14AM +0800, Sam Li wrote:
> > diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> > index 2f0d8ac25a..3f2592b9f5 100644
> > --- a/qemu-io-cmds.c
> > +++ b/qemu-io-cmds.c
> > @@ -1706,6 +1706,122 @@ static const cmdinfo_t flush_cmd = {
> >  .oneline= "flush all in-core file state to disk",
> >  };
> >
> > +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +int ret;
> > +int64_t offset, len, nr_zones;
> > +int i = 0;
> > +
> > +++optind;
> > +offset = cvtnum(argv[optind]);
> > +++optind;
> > +len = cvtnum(argv[optind]);
> > +++optind;
> > +nr_zones = cvtnum(argv[optind]);
> > +
> > +g_autofree BlockZoneDescriptor *zones = g_new(BlockZoneDescriptor, 
> > nr_zones);
> > +ret = blk_zone_report(blk, offset, len, _zones, zones);
> > +while (i < nr_zones) {
>
> Does blk_zone_report() set nr_zones to 0 on failure or do we need to
> check if (ret < 0) here?

I'll check if (ret<0) in zone_report and other commands in this patch as well.

>
> > +fprintf(stdout, "start: 0x%lx, len 0x%lx, cap 0x%lx, wptr 0x%lx, "
>
> The rest of the source file uses printf() instead of fprintf(stdout,
> ...). That's usually preferred because it's shorter.
>
> > +"zcond:%u, [type: %u]\n",
>
> Please use PRIx64 instead of lx format specifiers for portability. On
> 32-bit hosts lx is 32-bit, not 64-bit. You can grep QEMU's code for
> examples of PRIx64.

Noted. It is necessary.

>
> > +zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
> > +zones[i].cond, zones[i].type);
> > +++i;
> > +}
>
> A for loop is more idiomatic:
>
>   for (int i = 0; i < nr_zones; i++) {
>   ...
>   }
>
> > +return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_report_cmd = {
> > +.name = "zone_report",
> > +.altname = "f",
> > +.cfunc = zone_report_f,
> > +.argmin = 3,
> > +.argmax = 3,
> > +.args = "offset [offset..] len [len..] number [num..]",
>
> The arguments are "offset len number". This command does not accept
> optional offset/len/num arguments.
>
> > +.oneline = "report a number of zones",
>
> Maybe "report zone information".
>
> > +};
> > +
> > +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +int64_t offset, len;
> > +++optind;
> > +offset = cvtnum(argv[optind]);
> > +++optind;
> > +len = cvtnum(argv[optind]);
> > +return blk_zone_mgmt(blk, zone_open, offset, len);
>
> Where is the error reported? When I look at read_f() I see:
>
> if (ret < 0) {
> printf("read failed: %s\n", strerror(-ret));
>
> I think something similar is needed because qemu-io.c does not print an
> error message for us. The same is true for the other commands defined in
> this patch.
>
> > +}
> > +
> > +static const cmdinfo_t zone_open_cmd = {
> > +.name = "zone_open",
> > +.altname = "f",
> > +.cfunc = zone_open_f,
> > +.argmin = 2,
> > +.argmax = 2,
> > +.args = "offset [offset..] len [len..]",
>
> There are no optional offset/len args. The same is true for the other
> commands below.

Thanks for reviewing!

Sam



Re: [RFC v3 5/5] qemu-iotests: add zone operation tests.

2022-06-28 Thread Sam Li
Stefan Hajnoczi  于2022年6月28日周二 16:20写道:
>
> On Mon, Jun 27, 2022 at 08:19:17AM +0800, Sam Li wrote:
> > diff --git a/tests/qemu-iotests/tests/zoned.sh 
> > b/tests/qemu-iotests/tests/zoned.sh
> > new file mode 100755
> > index 00..262c0b5427
> > --- /dev/null
> > +++ b/tests/qemu-iotests/tests/zoned.sh
> > @@ -0,0 +1,49 @@
> > +#!/usr/bin/env bash
> > +#
> > +# Test zone management operations.
> > +#
> > +
> > +QEMU_IO="build/qemu-io"
> > +IMG="--image-opts driver=zoned_host_device,filename=/dev/nullb0"
> > +QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
> > +
> > +echo "Testing a null_blk device"
> > +echo "Simple cases: if the operations work"
> > +sudo modprobe null_blk nr_devices=1 zoned=1
>
> Please use bash's "trap" command to remove null_blk on exit. That way
> cleanup happens whether the script exits successfully or not. See
> tests/qemu-iotests/108 for an example.

Noted. Should I just include "rmmod null_blk" in _cleanup()? I'm a
little confused about the normal way to write qemu-iotests.

>
> > +# success, all done
> > +sudo rmmod null_blk
> > +echo "*** done"
> > +#rm -f $seq.full
>
> Why is this commented out?
I should just remove it. seq is not used.



Re: [PATCH 2/2] trivial typos: namesapce

2022-06-28 Thread Laurent Vivier

Le 14/06/2022 à 12:40, Dr. David Alan Gilbert (git) a écrit :

From: "Dr. David Alan Gilbert" 

'namespace' is misspelled in a bunch of places.

Signed-off-by: Dr. David Alan Gilbert 
---
  hw/9pfs/9p-xattr-user.c | 8 
  hw/acpi/nvdimm.c| 2 +-
  hw/nvme/ctrl.c  | 2 +-
  3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/9pfs/9p-xattr-user.c b/hw/9pfs/9p-xattr-user.c
index f2ae9582e6..535677ed60 100644
--- a/hw/9pfs/9p-xattr-user.c
+++ b/hw/9pfs/9p-xattr-user.c
@@ -27,7 +27,7 @@ static ssize_t mp_user_getxattr(FsContext *ctx, const char 
*path,
  {
  if (strncmp(name, "user.virtfs.", 12) == 0) {
  /*
- * Don't allow fetch of user.virtfs namesapce
+ * Don't allow fetch of user.virtfs namespace
   * in case of mapped security
   */
  errno = ENOATTR;
@@ -49,7 +49,7 @@ static ssize_t mp_user_listxattr(FsContext *ctx, const char 
*path,
  name_size -= 12;
  } else {
  /*
- * Don't allow fetch of user.virtfs namesapce
+ * Don't allow fetch of user.virtfs namespace
   * in case of mapped security
   */
  return 0;
@@ -74,7 +74,7 @@ static int mp_user_setxattr(FsContext *ctx, const char *path, 
const char *name,
  {
  if (strncmp(name, "user.virtfs.", 12) == 0) {
  /*
- * Don't allow fetch of user.virtfs namesapce
+ * Don't allow fetch of user.virtfs namespace
   * in case of mapped security
   */
  errno = EACCES;
@@ -88,7 +88,7 @@ static int mp_user_removexattr(FsContext *ctx,
  {
  if (strncmp(name, "user.virtfs.", 12) == 0) {
  /*
- * Don't allow fetch of user.virtfs namesapce
+ * Don't allow fetch of user.virtfs namespace
   * in case of mapped security
   */
  errno = EACCES;
diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 0d43da19ea..5f85b16327 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -476,7 +476,7 @@ struct NvdimmFuncGetLabelDataOut {
  /* the size of buffer filled by QEMU. */
  uint32_t len;
  uint32_t func_ret_status; /* return status code. */
-uint8_t out_buf[]; /* the data got via Get Namesapce Label function. */
+uint8_t out_buf[]; /* the data got via Get Namespace Label function. */
  } QEMU_PACKED;
  typedef struct NvdimmFuncGetLabelDataOut NvdimmFuncGetLabelDataOut;
  QEMU_BUILD_BUG_ON(sizeof(NvdimmFuncGetLabelDataOut) > NVDIMM_DSM_MEMORY_SIZE);
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 1e6e0fcad9..770a38381a 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -71,7 +71,7 @@
   *   the SUBNQN field in the controller will report the NQN of the subsystem
   *   device. This also enables multi controller capability represented in
   *   Identify Controller data structure in CMIC (Controller Multi-path I/O and
- *   Namesapce Sharing Capabilities).
+ *   Namespace Sharing Capabilities).
   *
   * - `aerl`
   *   The Asynchronous Event Request Limit (AERL). Indicates the maximum number


Applied to my trivial-patches branch.

Thanks,
Laurent




Re: [RFC v3 3/5] file-posix: introduce get_sysfs_long_val for zoned device information.

2022-06-28 Thread Sam Li
Stefan Hajnoczi  于2022年6月28日周二 16:20写道:
>
> On Mon, Jun 27, 2022 at 08:19:15AM +0800, Sam Li wrote:
> > Use sysfs attribute files to get the zoned device information in case
> > that ioctl() commands of zone management interface won't work. It can
> > return long type of value like chunk_sectors, zoned_append_max_bytes,
> > max_open_zones, max_active_zones.
> > ---
> >  block/file-posix.c | 37 +
> >  1 file changed, 25 insertions(+), 12 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 1b8b0d351f..73c2cdfbca 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1216,15 +1216,19 @@ static int hdev_get_max_hw_transfer(int fd, struct 
> > stat *st)
> >  #endif
> >  }
> >
> > -static int hdev_get_max_segments(int fd, struct stat *st)
> > -{
> > +/*
> > + * Get zoned device information (chunk_sectors, zoned_append_max_bytes,
> > + * max_open_zones, max_active_zones) through sysfs attribute files.
>
> This function is also used to get max_segments, which is not related to
> zoned devices. How about:
>
>   Get a block queue sysfs attribute value.
>
> > + */
> > +static long get_sysfs_long_val(int fd, struct stat *st,
> > +   const char *attribute) {
> >  #ifdef CONFIG_LINUX
> >  char buf[32];
> >  const char *end;
> >  char *sysfspath = NULL;
> >  int ret;
> >  int sysfd = -1;
> > -long max_segments;
> > +long val;
> >
> >  if (S_ISCHR(st->st_mode)) {
> >  if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
> > @@ -1237,8 +1241,9 @@ static int hdev_get_max_segments(int fd, struct stat 
> > *st)
> >  return -ENOTSUP;
> >  }
> >
> > -sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
> > -major(st->st_rdev), minor(st->st_rdev));
> > +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
> > +major(st->st_rdev), minor(st->st_rdev),
> > +attribute);
> >  sysfd = open(sysfspath, O_RDONLY);
> >  if (sysfd == -1) {
> >  ret = -errno;
> > @@ -1256,9 +1261,9 @@ static int hdev_get_max_segments(int fd, struct stat 
> > *st)
> >  }
> >  buf[ret] = 0;
> >  /* The file is ended with '\n', pass 'end' to accept that. */
> > -ret = qemu_strtol(buf, , 10, _segments);
> > +ret = qemu_strtol(buf, , 10, );
> >  if (ret == 0 && end && *end == '\n') {
> > -ret = max_segments;
> > +ret = val;
> >  }
> >
> >  out:
> > @@ -1272,6 +1277,15 @@ out:
> >  #endif
> >  }
> >
> > +static int hdev_get_max_segments(int fd, struct stat *st) {
> > +int ret;
> > +ret = get_sysfs_long_val(fd, st, "max_segments");
> > +if (ret < 0) {
> > +return -1;
>
> This hides the actual error number. The if statement can be dropped and
> the function can be simplified to:
>
>   static int hdev_get_max_segments(int fd, struct stat *st) {
>   return get_sysfs_long_val(fd, st, "max_segments");
>   }
>
> Whether you want to keep the tiny helper function or inline
> get_sysfs_long_val(fd, st, "max_segments") into the caller is up to you.
>
> > +}
> > +return ret;
> > +}
> > +
> >  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> >  {
> >  BDRVRawState *s = bs->opaque;
> > @@ -1872,6 +1886,7 @@ static int handle_aiocb_zone_report(void *opaque) {
> >
> >  static int handle_aiocb_zone_mgmt(void *opaque) {
> >  RawPosixAIOData *aiocb = opaque;
> > +BlockDriverState *s = aiocb->bs;
> >  int fd = aiocb->aio_fildes;
> >  int64_t offset = aiocb->aio_offset;
> >  int64_t len = aiocb->aio_nbytes;
> > @@ -1884,11 +1899,9 @@ static int handle_aiocb_zone_mgmt(void *opaque) {
> >  int64_t zone_size_mask;
> >  int ret;
> >
> > -ret = ioctl(fd, BLKGETZONESZ, _size);
> > -if (ret) {
> > -return -1;
> > -}
> > -
> > +g_autofree struct stat *file = g_new(struct stat, 1);
> > +stat(s->filename, file);
>
> There is no need to allocate struct stat using g_new(). It's a small
> struct that can be on the stack.
>
> Also, fstat(fd, ) is preferred when the file descriptor is already
> open because it avoids race conditions due to file renaming/path
> traversal.
>
> Here is how this could be written:
>
>   struct stat st;
>   if (fstat(fd, ) < 0) {
>   return -errno;
>   }

Thanks for the suggestions!

>
> > +zone_size = get_sysfs_long_val(fd, file, "chunk_sectors");
> >  zone_size_mask = zone_size - 1;
> >  if (offset & zone_size_mask) {
> >  error_report("offset is not the start of a zone");
> > --
> > 2.35.3
> >



Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-28 Thread Damien Le Moal
On 6/28/22 17:05, Sam Li wrote:
> Stefan Hajnoczi  于2022年6月28日周二 14:52写道:
>>
>> On Mon, Jun 27, 2022 at 08:19:13AM +0800, Sam Li wrote:
>>> diff --git a/block/block-backend.c b/block/block-backend.c
>>> index e0e1aff4b1..786f964d02 100644
>>> --- a/block/block-backend.c
>>> +++ b/block/block-backend.c
>>> @@ -1810,6 +1810,62 @@ int blk_flush(BlockBackend *blk)
>>>  return ret;
>>>  }
>>>
>>> +/*
>>> + * Return zone_report from BlockDriver. Offset can be any number within
>>> + * the zone size. No alignment for offset and len.
>>
>> What is the purpose of len? Is it the maximum number of zones to return
>> in nr_zones[]?
> 
> len is actually not used in bdrv_co_zone_report. It is needed by
> blk_check_byte_request.

There is no IO buffer being passed. Only an array of zone descriptors and
an in-out number of zones. len is definitely not needed. Not sure what
blk_check_byte_request() is intended to check though.

> 
>> How does the caller know how many zones were returned?
> 
> nr_zones represents IN maximum and OUT actual. The caller will know by
> nr_zones which is changed in bdrv_co_zone_report. I'll add it in the
> comments.
> 
>>
>>> + */
>>> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
>>> +   int64_t len, int64_t *nr_zones,
>>> +   BlockZoneDescriptor *zones)
>>> +{
>>> +int ret;
>>> +BlockDriverState *bs;
>>> +IO_CODE();
>>> +
>>> +blk_inc_in_flight(blk); /* increase before waiting */
>>> +blk_wait_while_drained(blk);
>>> +bs = blk_bs(blk);
>>> +
>>> +ret = blk_check_byte_request(blk, offset, len);
>>> +if (ret < 0) {
>>> +return ret;
>>> +}
>>> +
>>> +bdrv_inc_in_flight(bs);
>>
>> The bdrv_inc/dec_in_flight() call should be inside
>> bdrv_co_zone_report(). See bdrv_co_ioctl() for an example.
>>
>>> +ret = bdrv_co_zone_report(blk->root->bs, offset, len,
>>> +  nr_zones, zones);
>>> +bdrv_dec_in_flight(bs);
>>> +blk_dec_in_flight(blk);
>>> +return ret;
>>> +}
>>> +
>>> +/*
>>> + * Return zone_mgmt from BlockDriver.
>>
>> Maybe this should be:
>>
>>   Send a zone management command.
> 
> Yes, it's more accurate.
> 
>>
>>> @@ -216,6 +217,11 @@ typedef struct RawPosixAIOData {
>>>  PreallocMode prealloc;
>>>  Error **errp;
>>>  } truncate;
>>> +struct {
>>> +int64_t *nr_zones;
>>> +BlockZoneDescriptor *zones;
>>> +} zone_report;
>>> +zone_op op;
>>
>> It's cleaner to put op inside a struct zone_mgmt so its purpose is
>> self-explanatory:
>>
>>   struct {
>>   zone_op op;
>>   } zone_mgmt;
>>
>>> +static int handle_aiocb_zone_report(void *opaque) {
>>> +RawPosixAIOData *aiocb = opaque;
>>> +int fd = aiocb->aio_fildes;
>>> +int64_t *nr_zones = aiocb->zone_report.nr_zones;
>>> +BlockZoneDescriptor *zones = aiocb->zone_report.zones;
>>> +int64_t offset = aiocb->aio_offset;
>>> +int64_t len = aiocb->aio_nbytes;

Since you have the zone array and number of zones (size of that array) I
really do not see why you need len.

>>> +
>>> +struct blk_zone *blkz;
>>> +int64_t rep_size, nrz;
>>> +int ret, n = 0, i = 0;
>>> +
>>> +nrz = *nr_zones;
>>> +if (len == -1) {
>>> +return -errno;
>>
>> Where is errno set? Should this be an errno constant instead like
>> -EINVAL?
> 
> That's right! Noted.
> 
>>
>>> +}
>>> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
>>> blk_zone);
>>> +g_autofree struct blk_zone_report *rep = g_new(struct blk_zone_report, 
>>> nrz);
>>
>> g_new() looks incorrect. There should be 1 struct blk_zone_report
>> followed by nrz struct blk_zone structs. Please use g_malloc(rep_size)
>> instead.
> 
> Yes! However, it still has a memory leak error when using g_autofree
> && g_malloc.

That may be because you are changing the value of the rep pointer while
parsing the report ?

> 
>>
>>> +offset = offset / 512; /* get the unit of the start sector: sector 
>>> size is 512 bytes. */
>>> +printf("start to report zone with offset: 0x%lx\n", offset);
>>> +
>>> +blkz = (struct blk_zone *)(rep + 1);
>>> +while (n < nrz) {
>>> +memset(rep, 0, rep_size);
>>> +rep->sector = offset;
>>> +rep->nr_zones = nrz;
>>
>> What prevents zones[] overflowing? nrz isn't being reduced in the loop,
>> so maybe the rep->nr_zones return value will eventually exceed the
>> number of elements still available in zones[n..]?
> 
> I suppose the number of zones[] is restricted in the subsequent for
> loop where zones[] copy one zone at a time as n increases. Even if
> rep->zones exceeds the available room in zones[], the extra zone will
> not be copied.
> 
>>
>>> +static int handle_aiocb_zone_mgmt(void *opaque) {
>>> +RawPosixAIOData *aiocb = opaque;
>>> +int fd = aiocb->aio_fildes;
>>> +int64_t offset = aiocb->aio_offset;
>>> +int64_t len = 

Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-28 Thread Stefan Hajnoczi
On Mon, Jun 27, 2022 at 08:19:13AM +0800, Sam Li wrote:
> diff --git a/block/block-backend.c b/block/block-backend.c
> index e0e1aff4b1..786f964d02 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1810,6 +1810,62 @@ int blk_flush(BlockBackend *blk)
>  return ret;
>  }
>  
> +/*
> + * Return zone_report from BlockDriver. Offset can be any number within
> + * the zone size. No alignment for offset and len.

What is the purpose of len? Is it the maximum number of zones to return
in nr_zones[]?

How does the caller know how many zones were returned?

> + */
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +   int64_t len, int64_t *nr_zones,
> +   BlockZoneDescriptor *zones)
> +{
> +int ret;
> +BlockDriverState *bs;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk); /* increase before waiting */
> +blk_wait_while_drained(blk);
> +bs = blk_bs(blk);
> +
> +ret = blk_check_byte_request(blk, offset, len);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +bdrv_inc_in_flight(bs);

The bdrv_inc/dec_in_flight() call should be inside
bdrv_co_zone_report(). See bdrv_co_ioctl() for an example.

> +ret = bdrv_co_zone_report(blk->root->bs, offset, len,
> +  nr_zones, zones);
> +bdrv_dec_in_flight(bs);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
> +/*
> + * Return zone_mgmt from BlockDriver.

Maybe this should be:

  Send a zone management command.

> @@ -216,6 +217,11 @@ typedef struct RawPosixAIOData {
>  PreallocMode prealloc;
>  Error **errp;
>  } truncate;
> +struct {
> +int64_t *nr_zones;
> +BlockZoneDescriptor *zones;
> +} zone_report;
> +zone_op op;

It's cleaner to put op inside a struct zone_mgmt so its purpose is
self-explanatory:

  struct {
  zone_op op;
  } zone_mgmt;

> +static int handle_aiocb_zone_report(void *opaque) {
> +RawPosixAIOData *aiocb = opaque;
> +int fd = aiocb->aio_fildes;
> +int64_t *nr_zones = aiocb->zone_report.nr_zones;
> +BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> +int64_t offset = aiocb->aio_offset;
> +int64_t len = aiocb->aio_nbytes;
> +
> +struct blk_zone *blkz;
> +int64_t rep_size, nrz;
> +int ret, n = 0, i = 0;
> +
> +nrz = *nr_zones;
> +if (len == -1) {
> +return -errno;

Where is errno set? Should this be an errno constant instead like
-EINVAL?

> +}
> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> blk_zone);
> +g_autofree struct blk_zone_report *rep = g_new(struct blk_zone_report, 
> nrz);

g_new() looks incorrect. There should be 1 struct blk_zone_report
followed by nrz struct blk_zone structs. Please use g_malloc(rep_size)
instead.

> +offset = offset / 512; /* get the unit of the start sector: sector size 
> is 512 bytes. */
> +printf("start to report zone with offset: 0x%lx\n", offset);
> +
> +blkz = (struct blk_zone *)(rep + 1);
> +while (n < nrz) {
> +memset(rep, 0, rep_size);
> +rep->sector = offset;
> +rep->nr_zones = nrz;

What prevents zones[] overflowing? nrz isn't being reduced in the loop,
so maybe the rep->nr_zones return value will eventually exceed the
number of elements still available in zones[n..]?

> +static int handle_aiocb_zone_mgmt(void *opaque) {
> +RawPosixAIOData *aiocb = opaque;
> +int fd = aiocb->aio_fildes;
> +int64_t offset = aiocb->aio_offset;
> +int64_t len = aiocb->aio_nbytes;
> +zone_op op = aiocb->op;
> +
> +struct blk_zone_range range;
> +const char *ioctl_name;
> +unsigned long ioctl_op;
> +int64_t zone_size;
> +int64_t zone_size_mask;
> +int ret;
> +
> +ret = ioctl(fd, BLKGETZONESZ, _size);

Can this value be stored in bs (maybe in BlockLimits) to avoid calling
ioctl(BLKGETZONESZ) each time?

> +if (ret) {
> +return -1;

The return value should be a negative errno. -1 is -EPERM but it's
probably not that error code you wanted. This should be:

  return -errno;

> +}
> +
> +zone_size_mask = zone_size - 1;
> +if (offset & zone_size_mask) {
> +error_report("offset is not the start of a zone");
> +return -1;

return -EINVAL;

> +}
> +
> +if (len & zone_size_mask) {
> +error_report("len is not aligned to zones");
> +return -1;

return -EINVAL;

> +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t 
> offset,
> +int64_t len, int64_t *nr_zones,
> +BlockZoneDescriptor *zones) {
> +BDRVRawState *s = bs->opaque;
> +RawPosixAIOData acb;
> +
> +acb = (RawPosixAIOData) {
> +.bs = bs,
> +.aio_fildes = s->fd,
> +.aio_type   = QEMU_AIO_IOCTL,
> +.aio_offset = offset,
> +.aio_nbytes = len,
> +.zone_report= {
> +.nr_zones   = 

Re: [RFC v3 3/5] file-posix: introduce get_sysfs_long_val for zoned device information.

2022-06-28 Thread Stefan Hajnoczi
On Mon, Jun 27, 2022 at 08:19:15AM +0800, Sam Li wrote:
> Use sysfs attribute files to get the zoned device information in case
> that ioctl() commands of zone management interface won't work. It can
> return long type of value like chunk_sectors, zoned_append_max_bytes,
> max_open_zones, max_active_zones.
> ---
>  block/file-posix.c | 37 +
>  1 file changed, 25 insertions(+), 12 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 1b8b0d351f..73c2cdfbca 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1216,15 +1216,19 @@ static int hdev_get_max_hw_transfer(int fd, struct 
> stat *st)
>  #endif
>  }
>  
> -static int hdev_get_max_segments(int fd, struct stat *st)
> -{
> +/*
> + * Get zoned device information (chunk_sectors, zoned_append_max_bytes,
> + * max_open_zones, max_active_zones) through sysfs attribute files.

This function is also used to get max_segments, which is not related to
zoned devices. How about:

  Get a block queue sysfs attribute value.

> + */
> +static long get_sysfs_long_val(int fd, struct stat *st,
> +   const char *attribute) {
>  #ifdef CONFIG_LINUX
>  char buf[32];
>  const char *end;
>  char *sysfspath = NULL;
>  int ret;
>  int sysfd = -1;
> -long max_segments;
> +long val;
>  
>  if (S_ISCHR(st->st_mode)) {
>  if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
> @@ -1237,8 +1241,9 @@ static int hdev_get_max_segments(int fd, struct stat 
> *st)
>  return -ENOTSUP;
>  }
>  
> -sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
> -major(st->st_rdev), minor(st->st_rdev));
> +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
> +major(st->st_rdev), minor(st->st_rdev),
> +attribute);
>  sysfd = open(sysfspath, O_RDONLY);
>  if (sysfd == -1) {
>  ret = -errno;
> @@ -1256,9 +1261,9 @@ static int hdev_get_max_segments(int fd, struct stat 
> *st)
>  }
>  buf[ret] = 0;
>  /* The file is ended with '\n', pass 'end' to accept that. */
> -ret = qemu_strtol(buf, , 10, _segments);
> +ret = qemu_strtol(buf, , 10, );
>  if (ret == 0 && end && *end == '\n') {
> -ret = max_segments;
> +ret = val;
>  }
>  
>  out:
> @@ -1272,6 +1277,15 @@ out:
>  #endif
>  }
>  
> +static int hdev_get_max_segments(int fd, struct stat *st) {
> +int ret;
> +ret = get_sysfs_long_val(fd, st, "max_segments");
> +if (ret < 0) {
> +return -1;

This hides the actual error number. The if statement can be dropped and
the function can be simplified to:

  static int hdev_get_max_segments(int fd, struct stat *st) {
  return get_sysfs_long_val(fd, st, "max_segments");
  }

Whether you want to keep the tiny helper function or inline
get_sysfs_long_val(fd, st, "max_segments") into the caller is up to you.

> +}
> +return ret;
> +}
> +
>  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>  BDRVRawState *s = bs->opaque;
> @@ -1872,6 +1886,7 @@ static int handle_aiocb_zone_report(void *opaque) {
>  
>  static int handle_aiocb_zone_mgmt(void *opaque) {
>  RawPosixAIOData *aiocb = opaque;
> +BlockDriverState *s = aiocb->bs;
>  int fd = aiocb->aio_fildes;
>  int64_t offset = aiocb->aio_offset;
>  int64_t len = aiocb->aio_nbytes;
> @@ -1884,11 +1899,9 @@ static int handle_aiocb_zone_mgmt(void *opaque) {
>  int64_t zone_size_mask;
>  int ret;
>  
> -ret = ioctl(fd, BLKGETZONESZ, _size);
> -if (ret) {
> -return -1;
> -}
> -
> +g_autofree struct stat *file = g_new(struct stat, 1);
> +stat(s->filename, file);

There is no need to allocate struct stat using g_new(). It's a small
struct that can be on the stack.

Also, fstat(fd, ) is preferred when the file descriptor is already
open because it avoids race conditions due to file renaming/path
traversal.

Here is how this could be written:

  struct stat st;
  if (fstat(fd, ) < 0) {
  return -errno;
  }

> +zone_size = get_sysfs_long_val(fd, file, "chunk_sectors");
>  zone_size_mask = zone_size - 1;
>  if (offset & zone_size_mask) {
>  error_report("offset is not the start of a zone");
> -- 
> 2.35.3
> 


signature.asc
Description: PGP signature


Re: [RFC v3 5/5] qemu-iotests: add zone operation tests.

2022-06-28 Thread Stefan Hajnoczi
On Mon, Jun 27, 2022 at 08:19:17AM +0800, Sam Li wrote:
> diff --git a/tests/qemu-iotests/tests/zoned.sh 
> b/tests/qemu-iotests/tests/zoned.sh
> new file mode 100755
> index 00..262c0b5427
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/zoned.sh
> @@ -0,0 +1,49 @@
> +#!/usr/bin/env bash
> +#
> +# Test zone management operations.
> +#
> +
> +QEMU_IO="build/qemu-io"
> +IMG="--image-opts driver=zoned_host_device,filename=/dev/nullb0"
> +QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
> +
> +echo "Testing a null_blk device"
> +echo "Simple cases: if the operations work"
> +sudo modprobe null_blk nr_devices=1 zoned=1

Please use bash's "trap" command to remove null_blk on exit. That way
cleanup happens whether the script exits successfully or not. See
tests/qemu-iotests/108 for an example.

> +# success, all done
> +sudo rmmod null_blk
> +echo "*** done"
> +#rm -f $seq.full

Why is this commented out?


signature.asc
Description: PGP signature


Re: [PATCH v2 3/4] hw/nvme: add support for ratified TP4084

2022-06-28 Thread Klaus Jensen
On Jun 27 13:47, Niklas Cassel wrote:
> TP4084 adds a new mode, CC.CRIME, that can be used to mark a namespace
> as ready independently from the controller.
> 
> When CC.CRIME is 0 (default), things behave as before, all namespaces
> are ready when CSTS.RDY gets set to 1.
> 
> When CC.CRIME is 1, the controller will become ready when CSTS.RDY gets
> set to 1, but commands accessing a namespace are allowed to return
> "Namespace Not Ready" or "Admin Command Media Not Ready".
> After CRTO.CRWMT amount of time, if the namespace has not yet been
> marked ready, the status codes also need to have the DNR bit set.
> 
> Add a new "ready_delay" namespace device parameter, in order to emulate
> different ready latencies for namespaces.
> 
> Once a namespace is ready, it will set the NRDY bit in the I/O Command
> Set Independent Identify Namespace Data Structure, and then send out a
> Namespace Attribute Changed event.
> 
> This new "ready_delay" is supported on controllers not part of a NVMe
> subsystem. The reasons are many. One problem is that multiple controllers
> can have different CC.CRIME modes running. Another problem is the extra
> locking needed. The third problem is when to actually clear NRDY. If we
> assume that a namespace clears NRDY when it no longer has any controller
> online for that namespace. The problem then is that Linux will reset the
> controllers one by one during probe time. The reset goes so fast so that
> there is no time when all controllers are in reset at the same time, so
> NRDY will never get cleared. (The controllers are enabled by SeaBIOS by
> default.) We could introduce a reset_time param, but this would only
> increase the chances that all controllers are in reset at the same time.
> 
> Signed-off-by: Niklas Cassel 
> ---
>  hw/nvme/ctrl.c   | 123 +--
>  hw/nvme/ns.c |  18 +++
>  hw/nvme/nvme.h   |   6 +++
>  hw/nvme/trace-events |   1 +
>  include/block/nvme.h |  60 -
>  5 files changed, 204 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 8ca824ea14..5404f67480 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -88,6 +88,12 @@
>   *   completion when there are no outstanding AERs. When the maximum number 
> of
>   *   enqueued events are reached, subsequent events will be dropped.
>   *
> + * - `crwmt`
> + *   This is the Controller Ready With Media Timeout (CRWMT) field that is
> + *   defined in the CRTO register. This specifies the worst-case time that 
> host
> + *   software should wait for the controller and all attached namespaces to
> + *   become ready. The value is in units of 500 milliseconds.
> + *
>   * - `mdts`
>   *   Indicates the maximum data transfer size for a command that transfers 
> data
>   *   between host-accessible memory and the controller. The value is 
> specified
> @@ -157,6 +163,11 @@
>   *   namespace will be available in the subsystem but not attached to any
>   *   controllers.
>   *
> + * - `ready_delay`
> + *   This parameter specifies the amount of time that the namespace should 
> wait
> + *   before being marked ready. Only applicable if CC.CRIME is set by the 
> user.
> + *   The value is in units of 500 milliseconds (to be consistent with 
> `crwmt`).
> + *
>   * Setting `zoned` to true selects Zoned Command Set at the namespace.
>   * In this case, the following namespace properties are available to 
> configure
>   * zoned operation:
> @@ -216,6 +227,8 @@
>  #define NVME_VF_RES_GRANULARITY 1
>  #define NVME_VF_OFFSET 0x1
>  #define NVME_VF_STRIDE 1
> +#define NVME_DEFAULT_CRIMT 0xa
> +#define NVME_DEFAULT_CRWMT 0xf
>  
>  #define NVME_GUEST_ERR(trace, fmt, ...) \
>  do { \
> @@ -4188,6 +4201,10 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
> *req)
>  return NVME_INVALID_OPCODE | NVME_DNR;
>  }
>  
> +if (!(ns->id_indep_ns.nstat & NVME_NSTAT_NRDY)) {
> +return NVME_NS_NOT_READY;
> +}
> +

I'd add a ns->ready bool instead. See below for my previously posted
patch.

>  if (ns->status) {
>  return ns->status;
>  }
> @@ -4791,6 +4808,27 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
> NvmeRequest *req, bool active)
>  return NVME_INVALID_CMD_SET | NVME_DNR;
>  }
>  
> +static uint16_t nvme_identify_cs_indep_ns(NvmeCtrl *n, NvmeRequest *req)
> +{
> +NvmeNamespace *ns;
> +NvmeIdentify *c = (NvmeIdentify *)>cmd;
> +uint32_t nsid = le32_to_cpu(c->nsid);
> +
> +trace_pci_nvme_identify_cs_indep_ns(nsid);
> +
> +if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
> +return NVME_INVALID_NSID | NVME_DNR;
> +}
> +
> +ns = nvme_ns(n, nsid);
> +if (unlikely(!ns)) {
> +return nvme_rpt_empty_id_struct(n, req);
> +}
> +
> +return nvme_c2h(n, (uint8_t *)>id_indep_ns, sizeof(NvmeIdNsCsIndep),
> +req);
> +}
> +

I posted a patch for this some time ago


Re: [RFC v3 2/5] qemu-io: add zoned block device operations.

2022-06-28 Thread Stefan Hajnoczi
On Mon, Jun 27, 2022 at 08:19:14AM +0800, Sam Li wrote:
> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> index 2f0d8ac25a..3f2592b9f5 100644
> --- a/qemu-io-cmds.c
> +++ b/qemu-io-cmds.c
> @@ -1706,6 +1706,122 @@ static const cmdinfo_t flush_cmd = {
>  .oneline= "flush all in-core file state to disk",
>  };
>  
> +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> +{
> +int ret;
> +int64_t offset, len, nr_zones;
> +int i = 0;
> +
> +++optind;
> +offset = cvtnum(argv[optind]);
> +++optind;
> +len = cvtnum(argv[optind]);
> +++optind;
> +nr_zones = cvtnum(argv[optind]);
> +
> +g_autofree BlockZoneDescriptor *zones = g_new(BlockZoneDescriptor, 
> nr_zones);
> +ret = blk_zone_report(blk, offset, len, _zones, zones);
> +while (i < nr_zones) {

Does blk_zone_report() set nr_zones to 0 on failure or do we need to
check if (ret < 0) here?

> +fprintf(stdout, "start: 0x%lx, len 0x%lx, cap 0x%lx, wptr 0x%lx, "

The rest of the source file uses printf() instead of fprintf(stdout,
...). That's usually preferred because it's shorter.

> +"zcond:%u, [type: %u]\n",

Please use PRIx64 instead of lx format specifiers for portability. On
32-bit hosts lx is 32-bit, not 64-bit. You can grep QEMU's code for
examples of PRIx64.

> +zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
> +zones[i].cond, zones[i].type);
> +++i;
> +}

A for loop is more idiomatic:

  for (int i = 0; i < nr_zones; i++) {
  ...
  }

> +return ret;
> +}
> +
> +static const cmdinfo_t zone_report_cmd = {
> +.name = "zone_report",
> +.altname = "f",
> +.cfunc = zone_report_f,
> +.argmin = 3,
> +.argmax = 3,
> +.args = "offset [offset..] len [len..] number [num..]",

The arguments are "offset len number". This command does not accept
optional offset/len/num arguments.

> +.oneline = "report a number of zones",

Maybe "report zone information".

> +};
> +
> +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
> +{
> +int64_t offset, len;
> +++optind;
> +offset = cvtnum(argv[optind]);
> +++optind;
> +len = cvtnum(argv[optind]);
> +return blk_zone_mgmt(blk, zone_open, offset, len);

Where is the error reported? When I look at read_f() I see:

if (ret < 0) {
printf("read failed: %s\n", strerror(-ret));

I think something similar is needed because qemu-io.c does not print an
error message for us. The same is true for the other commands defined in
this patch.

> +}
> +
> +static const cmdinfo_t zone_open_cmd = {
> +.name = "zone_open",
> +.altname = "f",
> +.cfunc = zone_open_f,
> +.argmin = 2,
> +.argmax = 2,
> +.args = "offset [offset..] len [len..]",

There are no optional offset/len args. The same is true for the other
commands below.


signature.asc
Description: PGP signature


Re: [PATCH v4 6/7] block/copy-before-write: implement cbw-timeout option

2022-06-28 Thread Vladimir Sementsov-Ogievskiy

While debugging my "[PULL 00/10] Block jobs & NBD patches", I found that we 
have bdrv_dec_in_flight() and bdrv_inc_in_flight().

So, this should be fixed:

On 4/7/22 16:27, Vladimir Sementsov-Ogievskiy wrote:

In some scenarios, when copy-before-write operations lasts too long
time, it's better to cancel it.

Most useful would be to use the new option together with
on-cbw-error=break-snapshot: this way if cbw operation takes too long
time we'll just cancel backup process but do not disturb the guest too
much.



[..]

  
+static void block_copy_cb(void *opaque)

+{
+BlockDriverState *bs = opaque;
+
+bs->in_flight--;
+aio_wait_kick();


Just bdrv_dec_in_flight(bs), which includes aio_wait_kick().


+}
+
  /*
   * Do copy-before-write operation.
   *
@@ -111,7 +120,16 @@ static coroutine_fn int 
cbw_do_copy_before_write(BlockDriverState *bs,
  off = QEMU_ALIGN_DOWN(offset, cluster_size);
  end = QEMU_ALIGN_UP(offset + bytes, cluster_size);
  
-ret = block_copy(s->bcs, off, end - off, true, 0, NULL, NULL);

+/*
+ * Increase in_flight, so that in case of timed-out block-copy, the
+ * remaining background block_copy() request (which can't be immediately
+ * cancelled by timeout) is presented in bs->in_flight. This way we are
+ * sure that on bs close() we'll previously wait for all timed-out but yet
+ * running block_copy calls.
+ */
+bs->in_flight++;


bdrv_inc_in_flight(bs)


+ret = block_copy(s->bcs, off, end - off, true, s->cbw_timeout_ns,
+ block_copy_cb, bs);
  if (ret < 0 && s->on_cbw_error == ON_CBW_ERROR_BREAK_GUEST_WRITE) {
  return ret;
  }
@@ -377,6 +395,7 @@ static BlockdevOptions *cbw_parse_options(QDict *options, 
Error **errp)
   */
  qdict_extract_subqdict(options, NULL, "bitmap");
  qdict_del(options, "on-cbw-error");
+qdict_del(options, "cbw-timeout");
  


I'm going to resend "[PULL 00/10] Block jobs & NBD patches" with this fix and 
with fix in 03, if no objections.

--
Best regards,
Vladimir



Re: [PATCH v7 10/18] jobs: rename static functions called with job_mutex held

2022-06-28 Thread Emanuele Giuseppe Esposito



Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
 With the*nop*  job_lock/unlock placed, rename the static
 functions that are always under job_mutex, adding "_locked" suffix.

 List of functions that get this suffix:
 job_txn_ref   job_txn_del_job
 job_txn_apply   job_state_transition
 job_should_pause   job_event_cancelled
 job_event_completed   job_event_pending
 job_event_ready   job_event_idle
 job_do_yield   job_timer_not_pending
 job_do_dismiss   job_conclude
 job_update_rc   job_commit
 job_abort   job_clean
 job_finalize_single   job_cancel_async
 job_completed_txn_abort   job_prepare
 job_needs_finalize   job_do_finalize
 job_transition_to_pending  job_completed_txn_success
 job_completed   job_cancel_err
 job_force_cancel_err

 Note that "locked" refers to the*nop*  job_lock/unlock, and not
 real_job_lock/unlock.

 No functional change intended.

 Signed-off-by: Emanuele Giuseppe Esposito
>>>
>>>
>>> Hmm. Maybe it was already discussed.. But for me it seems, that it would
>>> be simpler to review previous patches, that fix job_ API users to use
>>> locking properly, if this renaming go earlier.
>>>
>>> Anyway, in this series, we can't update everything at once. So patch to
>>> patch, we make the code more and more correct. (yes I remember that
>>> lock() is a noop, but I should review thinking that it real, otherwise,
>>> how to review?)
>>>
>>> So, I'm saying about formal correctness of using lock() unlock()
>>> function in connection with introduced _locked prifixes and in
>>> connection with how it should finally work.
>>>
>>> You do:
>>>
>>> 05. introduce some _locked functions, that just duplicates, and
>>> job_pause_point_locked() is formally inconsistent, as I said.
>>>
>>> 06. Update a lot of places, to give them their final form (but not
>>> final, as some functions will be renamed to _locked, some not, hard to
>>> imagine)
>>>
>>> 07,08,09. Update some more, and even more places. very hard to track
>>> formal correctness of using locks
>>>
>>> 10-...: rename APIs.
>>>
>>>
>>> What do you think about the following:
>>>
>>> 1. Introduce noop lock, and some internal _locked() versions, and keep
>>> formal consistency inside job.c, considering all public interfaces as
>>> unlocked:
>>>
>>>   at this point:
>>>    - everything correct inside job.c
>>>    - no public interfaces with _locked prefix
>>>    - all public interfaces take mutex internally
>>>    - no external user take mutex by hand
>>>
>>> We can rename all internal static functions at this step too.
>>>
>>> 2. Introduce some public _locked APIs, that we'll use in next patches
>>>
>>> 3. Now start fixing external users in several patches:
>>>      - protect by mutex direct use of job fields
>>>    - make wider locks and move to _locked APIs inside them where needed
>>>
>>>
>>> In this scenario, every updated unit becomes formally correct after
>>> update, and after all steps everything is formally correct, and we can
>>> move to turning-on the mutex.
>>>
>>
>> I don't understand your logic also here, sorry :(
>>
>> I assume you want to keep patch 1-4, then the problem is assing job_lock
>> and renaming functions in _locked.
>> So I would say the problem is in patch 5-6-10-11-12-13. All the others
>> should be self contained.
>>
>> I understand patch 5 is a little hard to follow.
>>
>> Now, I am not sure what you propose here but it seems that the end goal
>> is to just have the same result, but with additional intermediate steps
>> that are just "do this just because in the next patch will be useful".
>> I think the problem is that we are going to miss the "why we need the
>> lock" logic in the patches if we do so.
>>
>> The logic I tried to convey in this order is the following:
>> - job.h: add _locked duplicates for job API functions called with and
>> without job_mutex
>> Just create duplicates of functions
>>
>> - jobs: protect jobs with job_lock/unlock
>> QMP and monitor functions call APIs that assume lock is taken,
>> drivers must take explicitly the lock
>>
>> - jobs: rename static functions called with job_mutex held
>> - job.h: rename job API functions called with job_mutex held
>> - block_job: rename block_job functions called with job_mutex held
>> *given* that some functions are always under lock, transform
>> them in _locked. Requires the job_lock/unlock patch
>>
>> - job.h: define unlocked functions
>> Comments on the public functions that are not _locked
>>
>>
>> @Kevin, since you also had some feedbacks on the patch ordering, do you
>> agree with this ordering or you have some other ideas?
>>
>> Following your 

Re: [PATCH v4 3/7] iotests: add copy-before-write: on-cbw-error tests

2022-06-28 Thread Vladimir Sementsov-Ogievskiy

I had a long and not fun debugging session through gitlab pipelines with this:)

The problem is that pure QEMUMachine doesn't work on arm in gitlab. And we have 
to specify at least machine. And we don't want qtest, as described in commit 
message.

So, the following fix helps:


On 4/7/22 16:27, Vladimir Sementsov-Ogievskiy wrote:

Add tests for new option of copy-before-write filter: on-cbw-error.

Note that we use QEMUMachine instead of VM class, because in further
commit we'll want to use throttling which doesn't work with -accel
qtest used by VM.

We also touch pylintrc to not break iotest 297.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/pylintrc   |   5 +
  tests/qemu-iotests/tests/copy-before-write| 132 ++
  .../qemu-iotests/tests/copy-before-write.out  |   5 +
  3 files changed, 142 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/copy-before-write
  create mode 100644 tests/qemu-iotests/tests/copy-before-write.out

diff --git a/tests/qemu-iotests/pylintrc b/tests/qemu-iotests/pylintrc
index 32ab77b8bb..f4f823a991 100644
--- a/tests/qemu-iotests/pylintrc
+++ b/tests/qemu-iotests/pylintrc
@@ -51,3 +51,8 @@ notes=FIXME,
  
  # Maximum number of characters on a single line.

  max-line-length=79
+
+
+[SIMILARITIES]
+
+min-similarity-lines=6
diff --git a/tests/qemu-iotests/tests/copy-before-write 
b/tests/qemu-iotests/tests/copy-before-write
new file mode 100755
index 00..6c7638965e
--- /dev/null
+++ b/tests/qemu-iotests/tests/copy-before-write
@@ -0,0 +1,132 @@
+#!/usr/bin/env python3
+# group: auto backup


[..]


+
+def setUp(self):
+qemu_img_create('-f', iotests.imgfmt, source_img, size)
+qemu_img_create('-f', iotests.imgfmt, temp_img, size)
+qemu_io('-c', 'write 0 1M', source_img)
+
+self.vm = QEMUMachine(iotests.qemu_prog)


Will fix to be:

  +opts = ['-nodefaults', '-display', 'none', '-machine', 'none']
  +self.vm = QEMUMachine(iotests.qemu_prog, opts,
  +  base_temp_dir=iotests.test_dir,
  +  sock_dir=iotests.sock_dir)



+self.vm.launch()
+



So, if no objections, I'm going to resend a PULL request (v1 was "[PULL 00/10] Block 
jobs & NBD patches") with this fix and small improvement in 06.


--
Best regards,
Vladimir



Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-28 Thread Sam Li
Stefan Hajnoczi  于2022年6月28日周二 14:52写道:
>
> On Mon, Jun 27, 2022 at 08:19:13AM +0800, Sam Li wrote:
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index e0e1aff4b1..786f964d02 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> > @@ -1810,6 +1810,62 @@ int blk_flush(BlockBackend *blk)
> >  return ret;
> >  }
> >
> > +/*
> > + * Return zone_report from BlockDriver. Offset can be any number within
> > + * the zone size. No alignment for offset and len.
>
> What is the purpose of len? Is it the maximum number of zones to return
> in nr_zones[]?

len is actually not used in bdrv_co_zone_report. It is needed by
blk_check_byte_request.

> How does the caller know how many zones were returned?

nr_zones represents IN maximum and OUT actual. The caller will know by
nr_zones which is changed in bdrv_co_zone_report. I'll add it in the
comments.

>
> > + */
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +   int64_t len, int64_t *nr_zones,
> > +   BlockZoneDescriptor *zones)
> > +{
> > +int ret;
> > +BlockDriverState *bs;
> > +IO_CODE();
> > +
> > +blk_inc_in_flight(blk); /* increase before waiting */
> > +blk_wait_while_drained(blk);
> > +bs = blk_bs(blk);
> > +
> > +ret = blk_check_byte_request(blk, offset, len);
> > +if (ret < 0) {
> > +return ret;
> > +}
> > +
> > +bdrv_inc_in_flight(bs);
>
> The bdrv_inc/dec_in_flight() call should be inside
> bdrv_co_zone_report(). See bdrv_co_ioctl() for an example.
>
> > +ret = bdrv_co_zone_report(blk->root->bs, offset, len,
> > +  nr_zones, zones);
> > +bdrv_dec_in_flight(bs);
> > +blk_dec_in_flight(blk);
> > +return ret;
> > +}
> > +
> > +/*
> > + * Return zone_mgmt from BlockDriver.
>
> Maybe this should be:
>
>   Send a zone management command.

Yes, it's more accurate.

>
> > @@ -216,6 +217,11 @@ typedef struct RawPosixAIOData {
> >  PreallocMode prealloc;
> >  Error **errp;
> >  } truncate;
> > +struct {
> > +int64_t *nr_zones;
> > +BlockZoneDescriptor *zones;
> > +} zone_report;
> > +zone_op op;
>
> It's cleaner to put op inside a struct zone_mgmt so its purpose is
> self-explanatory:
>
>   struct {
>   zone_op op;
>   } zone_mgmt;
>
> > +static int handle_aiocb_zone_report(void *opaque) {
> > +RawPosixAIOData *aiocb = opaque;
> > +int fd = aiocb->aio_fildes;
> > +int64_t *nr_zones = aiocb->zone_report.nr_zones;
> > +BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> > +int64_t offset = aiocb->aio_offset;
> > +int64_t len = aiocb->aio_nbytes;
> > +
> > +struct blk_zone *blkz;
> > +int64_t rep_size, nrz;
> > +int ret, n = 0, i = 0;
> > +
> > +nrz = *nr_zones;
> > +if (len == -1) {
> > +return -errno;
>
> Where is errno set? Should this be an errno constant instead like
> -EINVAL?

That's right! Noted.

>
> > +}
> > +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> > blk_zone);
> > +g_autofree struct blk_zone_report *rep = g_new(struct blk_zone_report, 
> > nrz);
>
> g_new() looks incorrect. There should be 1 struct blk_zone_report
> followed by nrz struct blk_zone structs. Please use g_malloc(rep_size)
> instead.

Yes! However, it still has a memory leak error when using g_autofree
&& g_malloc.

>
> > +offset = offset / 512; /* get the unit of the start sector: sector 
> > size is 512 bytes. */
> > +printf("start to report zone with offset: 0x%lx\n", offset);
> > +
> > +blkz = (struct blk_zone *)(rep + 1);
> > +while (n < nrz) {
> > +memset(rep, 0, rep_size);
> > +rep->sector = offset;
> > +rep->nr_zones = nrz;
>
> What prevents zones[] overflowing? nrz isn't being reduced in the loop,
> so maybe the rep->nr_zones return value will eventually exceed the
> number of elements still available in zones[n..]?

I suppose the number of zones[] is restricted in the subsequent for
loop where zones[] copy one zone at a time as n increases. Even if
rep->zones exceeds the available room in zones[], the extra zone will
not be copied.

>
> > +static int handle_aiocb_zone_mgmt(void *opaque) {
> > +RawPosixAIOData *aiocb = opaque;
> > +int fd = aiocb->aio_fildes;
> > +int64_t offset = aiocb->aio_offset;
> > +int64_t len = aiocb->aio_nbytes;
> > +zone_op op = aiocb->op;
> > +
> > +struct blk_zone_range range;
> > +const char *ioctl_name;
> > +unsigned long ioctl_op;
> > +int64_t zone_size;
> > +int64_t zone_size_mask;
> > +int ret;
> > +
> > +ret = ioctl(fd, BLKGETZONESZ, _size);
>
> Can this value be stored in bs (maybe in BlockLimits) to avoid calling
> ioctl(BLKGETZONESZ) each time?

Yes, zone_size is in the zbd_dev field. I'll update BlockLimits after
I think through the configurations. In addition, it's a temporary

Re: [PATCH v2 2/4] hw/nvme: store a pointer to the NvmeSubsystem in the NvmeNamespace

2022-06-28 Thread Klaus Jensen
On Jun 27 13:47, Niklas Cassel wrote:
> Each NvmeNamespace can be used by serveral controllers,
> but a NvmeNamespace can at most belong to a single NvmeSubsystem.
> Store a pointer to the NvmeSubsystem, if the namespace was realized
> with a NvmeSubsystem.
> 
> This will be used by a follow up patch.
> 
> Signed-off-by: Niklas Cassel 
> ---
>  hw/nvme/ns.c   | 1 +
>  hw/nvme/nvme.h | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> index 870c3ca1a2..8bee3e8b3b 100644
> --- a/hw/nvme/ns.c
> +++ b/hw/nvme/ns.c
> @@ -559,6 +559,7 @@ static void nvme_ns_realize(DeviceState *dev, Error 
> **errp)
>  if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
>  return;
>  }
> +ns->subsys = subsys;
>  }
>  
>  if (nvme_ns_setup(ns, errp)) {
> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> index 0711b9748c..5487e2db40 100644
> --- a/hw/nvme/nvme.h
> +++ b/hw/nvme/nvme.h
> @@ -167,6 +167,7 @@ typedef struct NvmeNamespace {
>  int32_t nr_active_zones;
>  
>  NvmeNamespaceParams params;
> +NvmeSubsystem *subsys;
>  
>  struct {
>  uint32_t err_rec;
> -- 
> 2.36.1
> 

No problemo.

Reviewed-by: Klaus Jensen 


signature.asc
Description: PGP signature


Re: nvme shared namespaces without subsys

2022-06-28 Thread Klaus Jensen
On Jun 27 18:34, Niklas Cassel wrote:
> Hello qemu-nvme maintainers,
> 
> I noticed that since commit
> 916b0f0b5264 ("hw/nvme: change nvme-ns 'shared' default")
> 
> The default value of ns param 'shared' is true, regardless
> if there is a nvme-subsys node or not.
> 
> It probably doesn't matter that the NMIC bit gets set when
> there is no nvme-subsys node, as there is no requirement
> that such a namespace is attached to more than one controller,
> even when it can be.
> 
> However, it does seem a bit awkward to have the bit set by
> default when it is impossible for such a namespace to be
> attached to more than one controller without a nvme-subsys
> node.
> 
> What do you think about something like:
> 
> diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> index 1b9c9d1156..4bf20ec836 100644
> --- a/hw/nvme/ns.c
> +++ b/hw/nvme/ns.c
> @@ -546,6 +546,8 @@ static void nvme_ns_realize(DeviceState *dev, Error 
> **errp)
>  int i;
>  
>  if (!n->subsys) {
> +/* If no subsys, the ns cannot be attached to more than one ctrl. */
> +ns->params.shared = false;
>  if (ns->params.detached) {
>  error_setg(errp, "detached requires that the nvme device is "
> "linked to an nvme-subsys device");
> 
> 
> 
> The downside with this proposal is that even though we
> might not want NMIC to be set by default when we don't
> have a nvme-subsys node, we might still want to be able
> to explictly set it for testing purposes.
> 
> 
> Kind regards,
> Niklas

I don't have a problem with this. Wanna send the patch?


signature.asc
Description: PGP signature


Re: [PATCH v2 1/4] hw/nvme: claim NVMe 2.0 compliance

2022-06-28 Thread Klaus Jensen
On Jun 27 10:59, Christoph Hellwig wrote:
> On Mon, Jun 27, 2022 at 01:47:28PM +0200, Niklas Cassel via wrote:
> > CRMS.CRWMS bit shall be set to 1 on controllers compliant with versions
> > later than NVMe 1.4.
> > 
> > The first version later than NVMe 1.4 is NVMe 2.0
> > 
> > Let's claim compliance with NVMe 2.0 such that a follow up patch can
> > set the CRMS.CRWMS bit.
> > 
> > This is needed since CC.CRIME is only writable when both CRMS.CRIMS
> > and CRMS.CRWMS is set.
> 
> You can also always support newer features without claiming
> compliance for the new version.  I'd suggest to go through the
> mandatory changes list first before upgrading the compliance.

Agreed.

> (And one day it would be neat if someone tried to run the official
> but commercial compliance tests on qemu a well..)

Ouch!


signature.asc
Description: PGP signature


Re: [PATCH v2 4/4] hw/nvme: add new never_ready parameter to test the DNR bit

2022-06-28 Thread Klaus Jensen
On Jun 27 13:47, Niklas Cassel wrote:
> Since we verify that "ready_delay" parameter has to be smaller than CRWMT,
> we know that the namespace will always become ready.
> Therefore the "Namespace Not Ready" status code will never have the DNR
> bit set.
> 
> Add a new parameter "never_ready" that can be used to emulate a namespace
> that never gets ready, such that the DNR bit gets set after CRWMT amount
> of time.
> 
> Signed-off-by: Niklas Cassel 
> ---
>  hw/nvme/ctrl.c | 28 +++-
>  hw/nvme/ns.c   |  1 +
>  hw/nvme/nvme.h |  2 ++
>  3 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 5404f67480..5f98d4778d 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -6389,6 +6412,8 @@ static void nvme_ctrl_shutdown(NvmeCtrl *n)
>  
>  nvme_ns_shutdown(ns);
>  }
> +
> +n->cc_enable_timestamp = 0;

A shutdown does not disable the controller (in the case of QEMU), so
isn't this wrong?


signature.asc
Description: PGP signature