Re: css_clear_io_interrupt() error handling

2023-05-15 Thread Markus Armbruster
Halil Pasic  writes:

> On Thu, 11 May 2023 14:20:51 +0200
> Markus Armbruster  wrote:
> [..]
>> >
>> > In my opinion the best way to deal with such situations would be to
>> > abort() in test/development and log a warning in production. Of course  
>> 
>> Understand, but...
>> 
>> > assert() wouldn't give me that, and it wouldn't be locally consistent at
>> > all.  
>> 
>> ... nothing behaves like that so far.
>> 
>
> I understand. And I agree with all statements from your previous mail. 
>
>> Let's try to come to a conclusion.  We can either keep the current
>> behavior, i.e. abort().  Or we change it to just print something.
>> 
>> If we want the latter: fprintf() to stderr, warn_report(), or trace
>> point?
>> 
>> You are the maintainer, so the decision is yours.
>> 
>> I could stick a patch into a series of error-related cleanup patches I'm
>> working on.
>
> I would gladly take that offer. Given that we didn't see any crashes and
> thus violations of assumptions up till now, and that both the kvm and the
> qemu implementations are from my perspective stable, I think not forcing
> a crash is a good option. From the options you offered, warn_report()
> looks the most compelling to me, but I would trust your expertise to pick
> the actually best one.
>
> Thank you very much.

You're welcome!

>> [*] I'm rather fond of the trick to have oopsie() fork & crash.
>
> I never thought of this, but I do actually find it very compelling
> to get a dump while keeping the workload alive. Especially if
> it was oopsie_once() so one does not get buried in dumps. But we don't
> do things like this in QEMU, or do we?

No, we don't.




Re: css_clear_io_interrupt() error handling

2023-05-11 Thread Halil Pasic
On Thu, 11 May 2023 14:20:51 +0200
Markus Armbruster  wrote:
[..]
> >
> > In my opinion the best way to deal with such situations would be to
> > abort() in test/development and log a warning in production. Of course  
> 
> Understand, but...
> 
> > assert() wouldn't give me that, and it wouldn't be locally consistent at
> > all.  
> 
> ... nothing behaves like that so far.
> 

I understand. And I agree with all statements from your previous mail. 

> Let's try to come to a conclusion.  We can either keep the current
> behavior, i.e. abort().  Or we change it to just print something.
> 
> If we want the latter: fprintf() to stderr, warn_report(), or trace
> point?
> 
> You are the maintainer, so the decision is yours.
> 
> I could stick a patch into a series of error-related cleanup patches I'm
> working on.

I would gladly take that offer. Given that we didn't see any crashes and
thus violations of assumptions up till now, and that both the kvm and the
qemu implementations are from my perspective stable, I think not forcing
a crash is a good option. From the options you offered, warn_report()
looks the most compelling to me, but I would trust your expertise to pick
the actually best one.

Thank you very much.

> 
> 
> [*] I'm rather fond of the trick to have oopsie() fork & crash.

I never thought of this, but I do actually find it very compelling
to get a dump while keeping the workload alive. Especially if
it was oopsie_once() so one does not get buried in dumps. But we don't
do things like this in QEMU, or do we?

Regards,
Halil




Re: css_clear_io_interrupt() error handling

2023-05-11 Thread Markus Armbruster
Halil Pasic  writes:

> On Wed, 10 May 2023 08:32:12 +0200
> Markus Armbruster  wrote:
>
>> Halil Pasic  writes:
>> 
>> > On Mon, 08 May 2023 11:01:55 +0200
>> > Cornelia Huck  wrote:
>> >  
>> >> On Mon, May 08 2023, Markus Armbruster  wrote:
> [..]
>> > and we do check for availability and cover that via -ENOSYS.  
>> 
>> Yes, kvm_s390_flic_realize() checks and sets ->clear_io_supported
>> accordingly, and kvm_s390_clear_io_flic() returns -ENOSYS when it's
>> false.
>> 
>> Doc on the actual set:
>
> Right. Sorry for the misinformation.

No problem!  With the clue you provided, the exact match was easy to
find :)

>>   4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
>>   
>> 
>>   :Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
>>KVM_CAP_VCPU_ATTRIBUTES for vcpu device
>>KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device (no set)
>>   :Type: device ioctl, vm ioctl, vcpu ioctl
>>   :Parameters: struct kvm_device_attr
>>   :Returns: 0 on success, -1 on error
>> 
>>   Errors:
>> 
>> =   =
>> ENXIO   The group or attribute is unknown/unsupported for this device
>> or hardware support is missing.
>> EPERM   The attribute cannot (currently) be accessed this way
>> (e.g. read-only attribute, or attribute that only makes
>> sense when the device is in a different state)
>> =   =
>> 
>> Other error conditions may be defined by individual device types.
>> 
>>   Gets/sets a specified piece of device configuration and/or state.  The
>>   semantics are device-specific.  See individual device documentation in
>>   the "devices" directory.  As with ONE_REG, the size of the data
>>   transferred is defined by the particular attribute.
>> 
>>   ::
>> 
>> struct kvm_device_attr {
>>   __u32  flags;  /* no flags currently defined */
>>   __u32  group;  /* device-defined */
>>   __u64  attr;   /* group-defined */
>>   __u64  addr;   /* userspace address of attr data */
>> };
>> 
>> 
>> kvm_s390_flic_realize() sets ->fd is to refer to the KVM_DEV_TYPE_FLIC
>> it creates.  I guess that means ENXIO and EPERM should never happen.
>
> I agree.
>
>> 
>> > For KVM_DEV_FLIC_CLEAR_IO_IRQ is just the following error code
>> > documented in linux/Documentation/virt/kvm/devices/s390_flic.rst
>> > which is to my knowledge the most authoritative source.
>> > """
>> > .. note:: The KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl will return EINVAL in case a
>> >   zero schid is specified
>> > """
>> > but a look in the code will tell us that -EFAULT is also possible if the
>> > supplied address is broken.  
>> 
>> Common behavior.
>>
>
> Makes sense, just that I did not find it in the interface
> description/documentation.
>  
>> > To sum it up, there is nothing to go wrong with the given operation, and
>> > to my best knowledge seeing an error code on the ioctl would either
>> > indicate a programming error on the client side (QEMU messed it up) or
>> > there is something wrong with the kernel.  
>> 
>> Abort on "QEMU messed up" is proper.  Abort on "something wrong with the
>> kernel" less so.  More on that below.
>> 
>
> I think I understand where are you coming from. IMHO it boils down
> to how broken the kernel is.
>
>> >> > Is the error condition fatal, i.e. continuing would be unsafe?  
>> >
>> > If the kernel is broken, probably. It is certainly unexpected.
>> >  
>> >> >
>> >> > If it's a fatal programming error, then abort() is appropriate.
>> >> >
>> >> > If it's fatal, but not a programming error, we should exit(1) instead.  
>> >
>> > It might not be a QEMU programming error. I really see no reason why
>> > would a combination of a sane QEMU and a sane kernel give us another
>> > error code than -ENOSYS.
>> >  
>> >> >
>> >> > If it's a survivable programming error, use of abort() is a matter of
>> >> > taste.
>> >
>> > The fact that we might have failed to clear up some interrupts which we
>> > are obligated to clean up by the s390 architecture is not expected to
>> > have grave consequences.   
>> 
>> Good to know.
>> 
>> >> From what I remember, this was introduced to clean up a potentially
>> >> queued interrupt that is not supposed to be delivered, so the worst
>> >> thing that could happen on failure is a spurious interrupt (same as what
>> >> could happen if the kernel flic doesn't provide this function in the
>> >> first place.) My main worry would be changes/breakages on the kernel
>> >> side (while the QEMU side remains unchanged).  
>> >
>> > Agreed. And I hope anybody changing the kernel would test the new error
>> > code and notice the QEMU crashes. This was my intention in the first
>> > place.
>> >  
>> >> So, I think we should continue to log the error 

Re: css_clear_io_interrupt() error handling

2023-05-10 Thread Halil Pasic
On Wed, 10 May 2023 08:32:12 +0200
Markus Armbruster  wrote:

> Halil Pasic  writes:
> 
> > On Mon, 08 May 2023 11:01:55 +0200
> > Cornelia Huck  wrote:
> >  
> >> On Mon, May 08 2023, Markus Armbruster  wrote:
[..]
> > and we do check for availability and cover that via -ENOSYS.  
> 
> Yes, kvm_s390_flic_realize() checks and sets ->clear_io_supported
> accordingly, and kvm_s390_clear_io_flic() returns -ENOSYS when it's
> false.
> 
> Doc on the actual set:

Right. Sorry for the misinformation.

> 
>   4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
>   
> 
>   :Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
>KVM_CAP_VCPU_ATTRIBUTES for vcpu device
>KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device (no set)
>   :Type: device ioctl, vm ioctl, vcpu ioctl
>   :Parameters: struct kvm_device_attr
>   :Returns: 0 on success, -1 on error
> 
>   Errors:
> 
> =   =
> ENXIO   The group or attribute is unknown/unsupported for this device
> or hardware support is missing.
> EPERM   The attribute cannot (currently) be accessed this way
> (e.g. read-only attribute, or attribute that only makes
> sense when the device is in a different state)
> =   =
> 
> Other error conditions may be defined by individual device types.
> 
>   Gets/sets a specified piece of device configuration and/or state.  The
>   semantics are device-specific.  See individual device documentation in
>   the "devices" directory.  As with ONE_REG, the size of the data
>   transferred is defined by the particular attribute.
> 
>   ::
> 
> struct kvm_device_attr {
>   __u32   flags;  /* no flags currently defined */
>   __u32   group;  /* device-defined */
>   __u64   attr;   /* group-defined */
>   __u64   addr;   /* userspace address of attr data */
> };
> 
> 
> kvm_s390_flic_realize() sets ->fd is to refer to the KVM_DEV_TYPE_FLIC
> it creates.  I guess that means ENXIO and EPERM should never happen.

I agree.

> 
> > For KVM_DEV_FLIC_CLEAR_IO_IRQ is just the following error code
> > documented in linux/Documentation/virt/kvm/devices/s390_flic.rst
> > which is to my knowledge the most authoritative source.
> > """
> > .. note:: The KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl will return EINVAL in case a
> >   zero schid is specified
> > """
> > but a look in the code will tell us that -EFAULT is also possible if the
> > supplied address is broken.  
> 
> Common behavior.
>

Makes sense, just that I did not find it in the interface
description/documentation.
 
> > To sum it up, there is nothing to go wrong with the given operation, and
> > to my best knowledge seeing an error code on the ioctl would either
> > indicate a programming error on the client side (QEMU messed it up) or
> > there is something wrong with the kernel.  
> 
> Abort on "QEMU messed up" is proper.  Abort on "something wrong with the
> kernel" less so.  More on that below.
> 

I think I understand where are you coming from. IMHO it boils down
to how broken the kernel is.

> >> > Is the error condition fatal, i.e. continuing would be unsafe?  
> >
> > If the kernel is broken, probably. It is certainly unexpected.
> >  
> >> >
> >> > If it's a fatal programming error, then abort() is appropriate.
> >> >
> >> > If it's fatal, but not a programming error, we should exit(1) instead.  
> >
> > It might not be a QEMU programming error. I really see no reason why
> > would a combination of a sane QEMU and a sane kernel give us another
> > error code than -ENOSYS.
> >  
> >> >
> >> > If it's a survivable programming error, use of abort() is a matter of
> >> > taste.
> >
> > The fact that we might have failed to clear up some interrupts which we
> > are obligated to clean up by the s390 architecture is not expected to
> > have grave consequences.   
> 
> Good to know.
> 
> >> From what I remember, this was introduced to clean up a potentially
> >> queued interrupt that is not supposed to be delivered, so the worst
> >> thing that could happen on failure is a spurious interrupt (same as what
> >> could happen if the kernel flic doesn't provide this function in the
> >> first place.) My main worry would be changes/breakages on the kernel
> >> side (while the QEMU side remains unchanged).  
> >
> > Agreed. And I hope anybody changing the kernel would test the new error
> > code and notice the QEMU crashes. This was my intention in the first
> > place.
> >  
> >> So, I think we should continue to log the error in any case; but I don't
> >> have a strong opinion as to whether we should use exit(1) (as I wouldn't
> >> consider it a programming error) or just continue. Halil, your choice :)  
> >
> > Neither do I have a strong opinion. 

Re: css_clear_io_interrupt() error handling

2023-05-10 Thread Markus Armbruster
Halil Pasic  writes:

> On Mon, 08 May 2023 11:01:55 +0200
> Cornelia Huck  wrote:
>
>> On Mon, May 08 2023, Markus Armbruster  wrote:
>> 
>> > css_clear_io_interrupt() aborts on unexpected ioctl() errors, and I
>> > wonder whether that's appropriate.  Let's have a closer look:
>
> Just for my understanding, was there a field problem with this code,
> or is it more a theoretical (i.e. no know crashes)?

Inspection.  I stumbled over it while cleaning up use of _abort.

>> >
>> > static void css_clear_io_interrupt(uint16_t subchannel_id,
>> >uint16_t subchannel_nr)
>> > {
>> > Error *err = NULL;
>> > static bool no_clear_irq;
>> > S390FLICState *fs = s390_get_flic();
>> > S390FLICStateClass *fsc = s390_get_flic_class(fs);
>> > int r;
>> >
>> > if (unlikely(no_clear_irq)) {
>> > return;
>> > }
>> > r = fsc->clear_io_irq(fs, subchannel_id, subchannel_nr);
>> > switch (r) {
>> > case 0:
>> > break;
>> > case -ENOSYS:
>> > no_clear_irq = true;
>> > /*
>> > * Ignore unavailability, as the user can't do anything
>> > * about it anyway.
>> > */
>> > break;
>> > default:
>> > error_setg_errno(, -r, "unexpected error condition");
>> > error_propagate(_abort, err);
>> > }
>> > }
>> >
>> > The default case is abort() with a liberal amount of lipstick applied.
>> > Let's ignore the lipstick and focus on the abort().
>
> Nod.
>
>> >
>> > fsc->clear_io_irq ist either qemu_s390_clear_io_flic() order
>> > kvm_s390_clear_io_flic().
>
> Right.
>
>> >
>> > Only kvm_s390_clear_io_flic() can return non-zero: -errno when ioctl()
>> > fails.
>
> Agreed, this is the case right now. This was not the case when the code
> was written qemu_s390_clear_io_flic() used to be missing functionality
> and always returned -ENOSYS.

I see.

>> > The ioctl() is KVM_SET_DEVICE_ATTR for KVM_DEV_FLIC_CLEAR_IO_IRQ with
>> > subchannel_id and subchannel_nr.  I.e. we assume that this can only fail
>> > with ENOSYS, und crash hard when the assumption turns out to be wrong.
>
> Yes this is the assumption and the current behavior.
>
>> >
>> > Is this error condition a programming error?  I figure it can be one
>> > only if the ioctl()'s contract promises us it cannot fail in any other
>> > way unless we violate preconditions.
>
> AFAIK and AFAIR it is indeed only possible in case of a programming error
> somewhere, and this was almost certainly my intention with this code. 
>
> For example if the future implementer of a meaningful
> qemu_s390_clear_io_flic() was to decide to use a multitude of error
> codes, the implementer would also have to touch this and handle those
> accordingly to avoid crashes.
>
>
> On the ioctl() is KVM_SET_DEVICE_ATTR for KVM_DEV_FLIC_CLEAR_IO_IRQ I'm
> afraid there is no really authoritative contract, and the current
> implementation, the documentation under Documentation/virt/kvm in
> the Linux source tree and this code in QEMU are the de-facto contract. 
>
> linux/Documentation/virt/kvm/api.rst says
> """
> 4.81 KVM_HAS_DEVICE_ATTR
> 
>
> :Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
>  KVM_CAP_VCPU_ATTRIBUTES for vcpu device
>  KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device
> :Type: device ioctl, vm ioctl, vcpu ioctl
> :Parameters: struct kvm_device_attr
> :Returns: 0 on success, -1 on error
>
> Errors:
>
>   =   =
>   ENXIO   The group or attribute is unknown/unsupported for this device
>   or hardware support is missing.
>   =   =
>
> Tests whether a device supports a particular attribute.  A successful
> return indicates the attribute is implemented.  It does not necessarily
> indicate that the attribute can be read or written in the device's
> current state.  "addr" is ignored.
> """
>
> and we do check for availability and cover that via -ENOSYS.

Yes, kvm_s390_flic_realize() checks and sets ->clear_io_supported
accordingly, and kvm_s390_clear_io_flic() returns -ENOSYS when it's
false.

Doc on the actual set:

  4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
  

  :Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
   KVM_CAP_VCPU_ATTRIBUTES for vcpu device
   KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device (no set)
  :Type: device ioctl, vm ioctl, vcpu ioctl
  :Parameters: struct kvm_device_attr
  :Returns: 0 on success, -1 on error

  Errors:

=   =
ENXIO   The group or attribute is unknown/unsupported for this device
or hardware support is missing.
EPERM   The 

Re: css_clear_io_interrupt() error handling

2023-05-09 Thread Halil Pasic
On Mon, 08 May 2023 11:01:55 +0200
Cornelia Huck  wrote:

> On Mon, May 08 2023, Markus Armbruster  wrote:
> 
> > css_clear_io_interrupt() aborts on unexpected ioctl() errors, and I
> > wonder whether that's appropriate.  Let's have a closer look:

Just for my understanding, was there a field problem with this code,
or is it more a theoretical (i.e. no know crashes)?

> >
> > static void css_clear_io_interrupt(uint16_t subchannel_id,
> >uint16_t subchannel_nr)
> > {
> > Error *err = NULL;
> > static bool no_clear_irq;
> > S390FLICState *fs = s390_get_flic();
> > S390FLICStateClass *fsc = s390_get_flic_class(fs);
> > int r;
> >
> > if (unlikely(no_clear_irq)) {
> > return;
> > }
> > r = fsc->clear_io_irq(fs, subchannel_id, subchannel_nr);
> > switch (r) {
> > case 0:
> > break;
> > case -ENOSYS:
> > no_clear_irq = true;
> > /*
> > * Ignore unavailability, as the user can't do anything
> > * about it anyway.
> > */
> > break;
> > default:
> > error_setg_errno(, -r, "unexpected error condition");
> > error_propagate(_abort, err);
> > }
> > }
> >
> > The default case is abort() with a liberal amount of lipstick applied.
> > Let's ignore the lipstick and focus on the abort().

Nod.

> >
> > fsc->clear_io_irq ist either qemu_s390_clear_io_flic() order
> > kvm_s390_clear_io_flic().

Right.

> >
> > Only kvm_s390_clear_io_flic() can return non-zero: -errno when ioctl()
> > fails.

Agreed, this is the case right now. This was not the case when the code
was written qemu_s390_clear_io_flic() used to be missing functionality
and always returned -ENOSYS.

> >
> > The ioctl() is KVM_SET_DEVICE_ATTR for KVM_DEV_FLIC_CLEAR_IO_IRQ with
> > subchannel_id and subchannel_nr.  I.e. we assume that this can only fail
> > with ENOSYS, und crash hard when the assumption turns out to be wrong.

Yes this is the assumption and the current behavior.

> >
> > Is this error condition a programming error?  I figure it can be one
> > only if the ioctl()'s contract promises us it cannot fail in any other
> > way unless we violate preconditions.

AFAIK and AFAIR it is indeed only possible in case of a programming error
somewhere, and this was almost certainly my intention with this code. 

For example if the future implementer of a meaningful
qemu_s390_clear_io_flic() was to decide to use a multitude of error
codes, the implementer would also have to touch this and handle those
accordingly to avoid crashes.


On the ioctl() is KVM_SET_DEVICE_ATTR for KVM_DEV_FLIC_CLEAR_IO_IRQ I'm
afraid there is no really authoritative contract, and the current
implementation, the documentation under Documentation/virt/kvm in
the Linux source tree and this code in QEMU are the de-facto contract. 

linux/Documentation/virt/kvm/api.rst says
"""
4.81 KVM_HAS_DEVICE_ATTR


:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
 KVM_CAP_VCPU_ATTRIBUTES for vcpu device
 KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device
:Type: device ioctl, vm ioctl, vcpu ioctl
:Parameters: struct kvm_device_attr
:Returns: 0 on success, -1 on error

Errors:

  =   =
  ENXIO   The group or attribute is unknown/unsupported for this device
  or hardware support is missing.
  =   =

Tests whether a device supports a particular attribute.  A successful
return indicates the attribute is implemented.  It does not necessarily
indicate that the attribute can be read or written in the device's
current state.  "addr" is ignored.
"""

and we do check for availability and cover that via -ENOSYS.

For KVM_DEV_FLIC_CLEAR_IO_IRQ is just the following error code
documented in linux/Documentation/virt/kvm/devices/s390_flic.rst
which is to my knowledge the most authoritative source.
"""
.. note:: The KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl will return EINVAL in case a
  zero schid is specified
"""
but a look in the code will tell us that -EFAULT is also possible if the
supplied address is broken.

To sum it up, there is nothing to go wrong with the given operation, and
to my best knowledge seeing an error code on the ioctl would either
indicate a programming error on the client side (QEMU messed it up) or
there is something wrong with the kernel.

> >
> > Is the error condition fatal, i.e. continuing would be unsafe?

If the kernel is broken, probably. It is certainly unexpected.

> >
> > If it's a fatal programming error, then abort() is appropriate.
> >
> > If it's fatal, but not a programming error, we should exit(1) instead.

It might not be a QEMU programming error. I really see no reason why
would a combination of a sane 

Re: css_clear_io_interrupt() error handling

2023-05-08 Thread Cornelia Huck
On Mon, May 08 2023, Markus Armbruster  wrote:

> css_clear_io_interrupt() aborts on unexpected ioctl() errors, and I
> wonder whether that's appropriate.  Let's have a closer look:
>
> static void css_clear_io_interrupt(uint16_t subchannel_id,
>uint16_t subchannel_nr)
> {
> Error *err = NULL;
> static bool no_clear_irq;
> S390FLICState *fs = s390_get_flic();
> S390FLICStateClass *fsc = s390_get_flic_class(fs);
> int r;
>
> if (unlikely(no_clear_irq)) {
> return;
> }
> r = fsc->clear_io_irq(fs, subchannel_id, subchannel_nr);
> switch (r) {
> case 0:
> break;
> case -ENOSYS:
> no_clear_irq = true;
> /*
> * Ignore unavailability, as the user can't do anything
> * about it anyway.
> */
> break;
> default:
> error_setg_errno(, -r, "unexpected error condition");
> error_propagate(_abort, err);
> }
> }
>
> The default case is abort() with a liberal amount of lipstick applied.
> Let's ignore the lipstick and focus on the abort().
>
> fsc->clear_io_irq ist either qemu_s390_clear_io_flic() order
> kvm_s390_clear_io_flic().
>
> Only kvm_s390_clear_io_flic() can return non-zero: -errno when ioctl()
> fails.
>
> The ioctl() is KVM_SET_DEVICE_ATTR for KVM_DEV_FLIC_CLEAR_IO_IRQ with
> subchannel_id and subchannel_nr.  I.e. we assume that this can only fail
> with ENOSYS, und crash hard when the assumption turns out to be wrong.
>
> Is this error condition a programming error?  I figure it can be one
> only if the ioctl()'s contract promises us it cannot fail in any other
> way unless we violate preconditions.
>
> Is the error condition fatal, i.e. continuing would be unsafe?
>
> If it's a fatal programming error, then abort() is appropriate.
>
> If it's fatal, but not a programming error, we should exit(1) instead.
>
> If it's a survivable programming error, use of abort() is a matter of
> taste.

>From what I remember, this was introduced to clean up a potentially
queued interrupt that is not supposed to be delivered, so the worst
thing that could happen on failure is a spurious interrupt (same as what
could happen if the kernel flic doesn't provide this function in the
first place.) My main worry would be changes/breakages on the kernel
side (while the QEMU side remains unchanged).

So, I think we should continue to log the error in any case; but I don't
have a strong opinion as to whether we should use exit(1) (as I wouldn't
consider it a programming error) or just continue. Halil, your choice :)




css_clear_io_interrupt() error handling

2023-05-08 Thread Markus Armbruster
css_clear_io_interrupt() aborts on unexpected ioctl() errors, and I
wonder whether that's appropriate.  Let's have a closer look:

static void css_clear_io_interrupt(uint16_t subchannel_id,
   uint16_t subchannel_nr)
{
Error *err = NULL;
static bool no_clear_irq;
S390FLICState *fs = s390_get_flic();
S390FLICStateClass *fsc = s390_get_flic_class(fs);
int r;

if (unlikely(no_clear_irq)) {
return;
}
r = fsc->clear_io_irq(fs, subchannel_id, subchannel_nr);
switch (r) {
case 0:
break;
case -ENOSYS:
no_clear_irq = true;
/*
* Ignore unavailability, as the user can't do anything
* about it anyway.
*/
break;
default:
error_setg_errno(, -r, "unexpected error condition");
error_propagate(_abort, err);
}
}

The default case is abort() with a liberal amount of lipstick applied.
Let's ignore the lipstick and focus on the abort().

fsc->clear_io_irq ist either qemu_s390_clear_io_flic() order
kvm_s390_clear_io_flic().

Only kvm_s390_clear_io_flic() can return non-zero: -errno when ioctl()
fails.

The ioctl() is KVM_SET_DEVICE_ATTR for KVM_DEV_FLIC_CLEAR_IO_IRQ with
subchannel_id and subchannel_nr.  I.e. we assume that this can only fail
with ENOSYS, und crash hard when the assumption turns out to be wrong.

Is this error condition a programming error?  I figure it can be one
only if the ioctl()'s contract promises us it cannot fail in any other
way unless we violate preconditions.

Is the error condition fatal, i.e. continuing would be unsafe?

If it's a fatal programming error, then abort() is appropriate.

If it's fatal, but not a programming error, we should exit(1) instead.

If it's a survivable programming error, use of abort() is a matter of
taste.