Impact of CONFIG_PARAVIRT=y / mmap benchmark
Hi, I'm trying to assess the performance impact of enabling PARAVIRT (and XEN) in a custom kernel configuration. I came across a very old thread (https://lkml.org/lkml/2009/5/13/449) on this topic and the conclusion back then was that the performance impact was (arguably) around 1%. Does anyone still have a copy of Ingo Molnar's mmap-perf.c program (the old link is broken)? Would it still be relevant to use it for measuring performance in case of PARAVIRT? Last but not least, has anyone looked into PARAVIRT performance more recently? Thank you! Best regards, Radu Rendec
Re: Impact of CONFIG_PARAVIRT=y / mmap benchmark
On Wed, 2017-03-29 at 20:05 +0200, Ingo Molnar wrote: > * Randy Dunlap <rdun...@infradead.org> wrote: > > > On 03/28/17 10:17, Radu Rendec wrote: > > > Does anyone still have a copy of Ingo Molnar's mmap-perf.c > > > program (the > > > old link is broken)? Would it still be relevant to use it for > > > measuring > > > performance in case of PARAVIRT? > > > > > I have mmap-perf.c that says: > > /* Copyright Ingo Molnar (c) 2006 */ > > > > and no license info... > > Ingo? > > It's GPL v2, like the kernel. > Randy, now that Ingo has clarified the license of that code, can you please share it? Thank you both for the feedback! Radu
Re: Lowering the log level in watchdog_dev_register when err==-EBUSY
On Thu, 2017-10-12 at 12:54 -0700, Guenter Roeck wrote: > On Thu, Oct 12, 2017 at 07:15:59PM +0100, Radu Rendec wrote: > > Thanks for the suggestion! That makes sense. I will start working on > > converting i6300esb and submit a patch in a few days. > > > > By the way, I don't have the hardware. I'm using it with KVM (Qemu), > > but I guess that's good enough since I'm not going to touch any of the > > code parts that deal with the hardware. > > > > Ah, interesting. Can you send me the qemu command line ? Sure, this is as easy as adding "-device i6300esb" to the qemu command line. For what is worth, my full command line is: qemu-system-i386 \ -enable-kvm \ -hda fed26.qcow2 \ -cdrom ~/Downloads/Fedora-Everything-netinst-i386-26-1.5.iso \ -m 512 \ -smp 4 \ -netdev user,id=net0 \ -netdev tap,id=net1,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \ -device e1000,netdev=net0 \ -device e1000,netdev=net1 \ -device i6300esb Thanks, Radu
Re: Lowering the log level in watchdog_dev_register when err==-EBUSY
On Wed, 2017-10-11 at 11:46 -0700, Guenter Roeck wrote: > On Wed, Oct 11, 2017 at 06:46:31PM +0100, Radu Rendec wrote: > > In a project I'm working on we have a valid use case where we activate > > both the i6300esb and softdog watchdogs. We always activate i6300esb > > first (which uses the "legacy" watchdog API) and then softdog. This > > gets us two "error" level messages (coming from watchdog_cdev_register) > > although softdog falls back to the "new" API and registers its char > > device just fine. > > > > Since watchdog_cdev_register/watchdog_dev_register seem to be used only > > by watchdog_register_device and the latter always falls back to the > > "new" API, I'm thinking about lowering the log level of these messages > > when err is -EBUSY. > > I would suggest to convert the offending driver to use the watchdog subsystem > (and along the line remove the restriction of only supporting a single > instance). You have the hardware, so that should be a straightforward change. Thanks for the suggestion! That makes sense. I will start working on converting i6300esb and submit a patch in a few days. By the way, I don't have the hardware. I'm using it with KVM (Qemu), but I guess that's good enough since I'm not going to touch any of the code parts that deal with the hardware. Radu
Re: virtio_net: ethtool supported link modes
On Fri, 2017-09-01 at 11:36 +0800, Jason Wang wrote: > > On 2017年09月01日 01:04, Radu Rendec wrote: > > Hello, > > > > Looking at the code in virtnet_set_link_ksettings, it seems the speed > > and duplex can be set to any valid value. The driver will "remember" > > them and report them back in virtnet_get_link_ksettings. > > > > However, the supported link modes (link_modes.supported in struct > > ethtool_link_ksettings) is always 0, indicating that no speed/duplex > > setting is supported. > > > > Does it make more sense to set (at least a few of) the supported link > > modes, such as 10baseT_Half ... 1baseT_Full? > > > > I would expect to see consistency between what is reported in > > link_modes.supported and what can actually be set. Could you please > > share your opinion on this? > > I think the may make sense only if there's a hardware implementation for > virtio. And we probably need to extend virtio spec for adding new commands. So you're saying that the "hardware" should provide the supported link modes (e.g. by using feature bits at the virtio layer) and the virtio_net driver should just translate them and expose them as link_modes.supported? Then for consistency, I assume setting speed/duplex via ethtool should also go into the virtio layer (currently virtio_net seems to just store them for future retrieval via ethtool). Thanks, Radu
Re: virtio_net: ethtool supported link modes
On Fri, 2017-09-01 at 18:43 +0300, Michael S. Tsirkin wrote: > On Thu, Aug 31, 2017 at 06:04:04PM +0100, Radu Rendec wrote: > > Looking at the code in virtnet_set_link_ksettings, it seems the speed > > and duplex can be set to any valid value. The driver will "remember" > > them and report them back in virtnet_get_link_ksettings. > > > > However, the supported link modes (link_modes.supported in struct > > ethtool_link_ksettings) is always 0, indicating that no speed/duplex > > setting is supported. > > > > Does it make more sense to set (at least a few of) the supported link > > modes, such as 10baseT_Half ... 1baseT_Full? > > > > I would expect to see consistency between what is reported in > > link_modes.supported and what can actually be set. Could you please > > share your opinion on this? > > I would like to know more about why this is desirable. > > We used not to support the modes at all, but it turned out > some tools are confused by this: e.g. people would try to > bond virtio with a hardware device, tools would see > a mismatch in speed and features between bonded devices > and get confused. > > See > > commit 16032be56c1f66770da15cb94f0eb366c37aff6e > Author: Nikolay Aleksandrov <niko...@cumulusnetworks.com> > Date: Wed Feb 3 04:04:37 2016 +0100 > > virtio_net: add ethtool support for set and get of settings > > > as well as the discussion around it > https://www.spinics.net/lists/netdev/msg362111.html Thanks for pointing these out. It is much more clear now why modes support is implemented the way it is and what the expectations are. > If you think we need to add more hacks like this, a stronger > motivation than "to see consistency" would be needed. The use case behind my original question is very simple: * Net device is queried via ethtool for supported modes. * Supported modes are presented to user. * User can configure any of the supported modes. This is done transparently to the net device type (driver), so it actually makes sense for physical NICs. This alone of course is not a good enough motivation to modify the driver. And it can be easily addressed in user-space at the application level by testing for the driver. I was merely trying to avoid driver-specific workarounds (i.e. keep the application driver agnostic) and wondered if "advertising" supported modes through ethtool made any sense and/or would be a desirable change from the driver perspective. I believe I have my answers now. Thanks, Radu
Re: virtio_net: ethtool supported link modes
On Fri, 2017-09-01 at 20:45 +0300, Michael S. Tsirkin wrote: > On Fri, Sep 01, 2017 at 05:19:53PM +0100, Radu Rendec wrote: > > On Fri, 2017-09-01 at 18:43 +0300, Michael S. Tsirkin wrote: > > > On Thu, Aug 31, 2017 at 06:04:04PM +0100, Radu Rendec wrote: > > > > Looking at the code in virtnet_set_link_ksettings, it seems the speed > > > > and duplex can be set to any valid value. The driver will "remember" > > > > them and report them back in virtnet_get_link_ksettings. > > > > > > > > However, the supported link modes (link_modes.supported in struct > > > > ethtool_link_ksettings) is always 0, indicating that no speed/duplex > > > > setting is supported. > > > > > > > > Does it make more sense to set (at least a few of) the supported link > > > > modes, such as 10baseT_Half ... 1baseT_Full? > > > > > > > > I would expect to see consistency between what is reported in > > > > link_modes.supported and what can actually be set. Could you please > > > > share your opinion on this? > > > > The use case behind my original question is very simple: > > * Net device is queried via ethtool for supported modes. > > * Supported modes are presented to user. > > * User can configure any of the supported modes. > > Since this has no effect on virtio, isn't presenting > "no supported modes" to user the right thing to do? Yes, that makes sense. > > This is done transparently to the net device type (driver), so it > > actually makes sense for physical NICs. > > > > This alone of course is not a good enough motivation to modify the > > driver. And it can be easily addressed in user-space at the application > > level by testing for the driver. > > I think you might want to special-case no supported modes. > Special-casing virtio is probably best avoided. > > > I was merely trying to avoid driver-specific workarounds (i.e. keep the > > application driver agnostic) > > I think that's the right approach. So if driver does not present > any supported modes this probably means it is not necessary > to display or program any. Yes, apparently it boils down to special-casing no supported modes. This avoids both modifying virtio and special-casing virtio, and keeps the application driver-agnostic at the same time. Thanks for all the feedback. It was very helpful in figuring out the right approach. I really appreciate it. Radu
virtio_net: ethtool supported link modes
Hello, Looking at the code in virtnet_set_link_ksettings, it seems the speed and duplex can be set to any valid value. The driver will "remember" them and report them back in virtnet_get_link_ksettings. However, the supported link modes (link_modes.supported in struct ethtool_link_ksettings) is always 0, indicating that no speed/duplex setting is supported. Does it make more sense to set (at least a few of) the supported link modes, such as 10baseT_Half ... 1baseT_Full? I would expect to see consistency between what is reported in link_modes.supported and what can actually be set. Could you please share your opinion on this? Thank you, Radu Rendec
Lowering the log level in watchdog_dev_register when err==-EBUSY
Hello, In a project I'm working on we have a valid use case where we activate both the i6300esb and softdog watchdogs. We always activate i6300esb first (which uses the "legacy" watchdog API) and then softdog. This gets us two "error" level messages (coming from watchdog_cdev_register) although softdog falls back to the "new" API and registers its char device just fine. Since watchdog_cdev_register/watchdog_dev_register seem to be used only by watchdog_register_device and the latter always falls back to the "new" API, I'm thinking about lowering the log level of these messages when err is -EBUSY. Something along the lines of: --- a/drivers/watchdog/watchdog_dev.c +++ b/drivers/watchdog/watchdog_dev.c @@ -928,11 +928,14 @@ static int watchdog_cdev_register(struct watchdog_device *wdd, dev_t devno) watchdog_miscdev.parent = wdd->parent; err = misc_register(_miscdev); if (err != 0) { - pr_err("%s: cannot register miscdev on minor=%d (err=%d).\n", - wdd->info->identity, WATCHDOG_MINOR, err); - if (err == -EBUSY) - pr_err("%s: a legacy watchdog module is probably present.\n", - wdd->info->identity); + if (err == -EBUSY) { + pr_info("%s: cannot register miscdev on minor=%d (err=%d).\n", + wdd->info->identity, WATCHDOG_MINOR, err); + pr_info("%s: a legacy watchdog module is probably present.\n", + wdd->info->identity); + } else + pr_err("%s: cannot register miscdev on minor=%d (err=%d).\n", + wdd->info->identity, WATCHDOG_MINOR, err); old_wd_data = NULL; kfree(wd_data); return err; Does this look like a good approach? If not, what would you recommend? In any case, I want to upstream the change, so better ask first :) Thanks, Radu Rendec
[PATCH 1/1] Improve kernfs_notify() poll notification latency
kernfs_notify() does two notifications: poll and fsnotify. Originally, both notifications were done from scheduled work context and all that kernfs_notify() did was schedule the work. This patch simply moves the poll notification from the scheduled work handler to kernfs_notify(). The fsnotify notification still needs to be done from scheduled work context because it can sleep (it needs to lock a mutex). If the poll notification is time critical (the notified thread needs to wake as quickly as possible), it's better to do it from kernfs_notify() directly. One example is calling sysfs_notify_dirent() from a hardware interrupt handler to wake up a thread and handle the interrupt in user space. Signed-off-by: Radu Rendec --- fs/kernfs/file.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index dbf5bc250bfd..f8d5021a652e 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -857,7 +857,6 @@ static __poll_t kernfs_fop_poll(struct file *filp, poll_table *wait) static void kernfs_notify_workfn(struct work_struct *work) { struct kernfs_node *kn; - struct kernfs_open_node *on; struct kernfs_super_info *info; repeat: /* pop one off the notify_list */ @@ -871,17 +870,6 @@ static void kernfs_notify_workfn(struct work_struct *work) kn->attr.notify_next = NULL; spin_unlock_irq(_notify_lock); - /* kick poll */ - spin_lock_irq(_open_node_lock); - - on = kn->attr.open; - if (on) { - atomic_inc(>event); - wake_up_interruptible(>poll); - } - - spin_unlock_irq(_open_node_lock); - /* kick fsnotify */ mutex_lock(_mutex); @@ -934,10 +922,21 @@ void kernfs_notify(struct kernfs_node *kn) { static DECLARE_WORK(kernfs_notify_work, kernfs_notify_workfn); unsigned long flags; + struct kernfs_open_node *on; if (WARN_ON(kernfs_type(kn) != KERNFS_FILE)) return; + /* kick poll immediately */ + spin_lock_irqsave(_open_node_lock, flags); + on = kn->attr.open; + if (on) { + atomic_inc(>event); + wake_up_interruptible(>poll); + } + spin_unlock_irqrestore(_open_node_lock, flags); + + /* schedule work to kick fsnotify */ spin_lock_irqsave(_notify_lock, flags); if (!kn->attr.notify_next) { kernfs_get(kn); -- 2.17.2
[PATCH 0/1] kernfs_notify() poll latency
Hi everyone, I believe kernfs_notify() poll latency can be improved if the poll notification is done from kernfs_notify() directly rather than scheduled work context. I am sure there are good reasons why the fsnotify notification must be done from scheduled work context (an obvious one is that it needs to be able to sleep). But I don't see any reason why the poll notification could not be done from kernfs_notify(). If there is any, please point it out - I would highly appreciate it. I came across this while working on a project that uses the sysfs GPIO interface to wake a (real time) user space project on GPIO interrupts. I know that interface is deprecated, but I still believe it's a valid scenario and may occur with other drivers as well (current or future). The sysfs GPIO interface (drivers/gpio/gpiolib-sysfs.c) interrupt handler relies on kernfs_notify() (actually sysfs_notify_dirent(), but that's just an alias) to wake any thread that may be poll()ing on the interrupt. It is important to wake the thread as quickly as possible and going through the kernel worker to handle the scheduled work is much slower. Since the kernel worker runs with normal priority, this can even become a case of priority inversion. If a higher priority thread hogs the CPU, it may delay the kernel worker and in turn the thread that needs to be notified (which could be a real time thread). Best regards, Radu Rendec Radu Rendec (1): Improve kernfs_notify() poll notification latency fs/kernfs/file.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) -- 2.17.2
virtio_net: ethtool supported link modes
Hello, Looking at the code in virtnet_set_link_ksettings, it seems the speed and duplex can be set to any valid value. The driver will "remember" them and report them back in virtnet_get_link_ksettings. However, the supported link modes (link_modes.supported in struct ethtool_link_ksettings) is always 0, indicating that no speed/duplex setting is supported. Does it make more sense to set (at least a few of) the supported link modes, such as 10baseT_Half ... 1baseT_Full? I would expect to see consistency between what is reported in link_modes.supported and what can actually be set. Could you please share your opinion on this? Thank you, Radu Rendec
Re: virtio_net: ethtool supported link modes
On Fri, 2017-09-01 at 11:36 +0800, Jason Wang wrote: > > On 2017年09月01日 01:04, Radu Rendec wrote: > > Hello, > > > > Looking at the code in virtnet_set_link_ksettings, it seems the speed > > and duplex can be set to any valid value. The driver will "remember" > > them and report them back in virtnet_get_link_ksettings. > > > > However, the supported link modes (link_modes.supported in struct > > ethtool_link_ksettings) is always 0, indicating that no speed/duplex > > setting is supported. > > > > Does it make more sense to set (at least a few of) the supported link > > modes, such as 10baseT_Half ... 1baseT_Full? > > > > I would expect to see consistency between what is reported in > > link_modes.supported and what can actually be set. Could you please > > share your opinion on this? > > I think the may make sense only if there's a hardware implementation for > virtio. And we probably need to extend virtio spec for adding new commands. So you're saying that the "hardware" should provide the supported link modes (e.g. by using feature bits at the virtio layer) and the virtio_net driver should just translate them and expose them as link_modes.supported? Then for consistency, I assume setting speed/duplex via ethtool should also go into the virtio layer (currently virtio_net seems to just store them for future retrieval via ethtool). Thanks, Radu
Lowering the log level in watchdog_dev_register when err==-EBUSY
Hello, In a project I'm working on we have a valid use case where we activate both the i6300esb and softdog watchdogs. We always activate i6300esb first (which uses the "legacy" watchdog API) and then softdog. This gets us two "error" level messages (coming from watchdog_cdev_register) although softdog falls back to the "new" API and registers its char device just fine. Since watchdog_cdev_register/watchdog_dev_register seem to be used only by watchdog_register_device and the latter always falls back to the "new" API, I'm thinking about lowering the log level of these messages when err is -EBUSY. Something along the lines of: --- a/drivers/watchdog/watchdog_dev.c +++ b/drivers/watchdog/watchdog_dev.c @@ -928,11 +928,14 @@ static int watchdog_cdev_register(struct watchdog_device *wdd, dev_t devno) watchdog_miscdev.parent = wdd->parent; err = misc_register(_miscdev); if (err != 0) { - pr_err("%s: cannot register miscdev on minor=%d (err=%d).\n", - wdd->info->identity, WATCHDOG_MINOR, err); - if (err == -EBUSY) - pr_err("%s: a legacy watchdog module is probably present.\n", - wdd->info->identity); + if (err == -EBUSY) { + pr_info("%s: cannot register miscdev on minor=%d (err=%d).\n", + wdd->info->identity, WATCHDOG_MINOR, err); + pr_info("%s: a legacy watchdog module is probably present.\n", + wdd->info->identity); + } else + pr_err("%s: cannot register miscdev on minor=%d (err=%d).\n", + wdd->info->identity, WATCHDOG_MINOR, err); old_wd_data = NULL; kfree(wd_data); return err; Does this look like a good approach? If not, what would you recommend? In any case, I want to upstream the change, so better ask first :) Thanks, Radu Rendec
Re: Lowering the log level in watchdog_dev_register when err==-EBUSY
On Wed, 2017-10-11 at 11:46 -0700, Guenter Roeck wrote: > On Wed, Oct 11, 2017 at 06:46:31PM +0100, Radu Rendec wrote: > > In a project I'm working on we have a valid use case where we activate > > both the i6300esb and softdog watchdogs. We always activate i6300esb > > first (which uses the "legacy" watchdog API) and then softdog. This > > gets us two "error" level messages (coming from watchdog_cdev_register) > > although softdog falls back to the "new" API and registers its char > > device just fine. > > > > Since watchdog_cdev_register/watchdog_dev_register seem to be used only > > by watchdog_register_device and the latter always falls back to the > > "new" API, I'm thinking about lowering the log level of these messages > > when err is -EBUSY. > > I would suggest to convert the offending driver to use the watchdog subsystem > (and along the line remove the restriction of only supporting a single > instance). You have the hardware, so that should be a straightforward change. Thanks for the suggestion! That makes sense. I will start working on converting i6300esb and submit a patch in a few days. By the way, I don't have the hardware. I'm using it with KVM (Qemu), but I guess that's good enough since I'm not going to touch any of the code parts that deal with the hardware. Radu
Re: Lowering the log level in watchdog_dev_register when err==-EBUSY
On Thu, 2017-10-12 at 12:54 -0700, Guenter Roeck wrote: > On Thu, Oct 12, 2017 at 07:15:59PM +0100, Radu Rendec wrote: > > Thanks for the suggestion! That makes sense. I will start working on > > converting i6300esb and submit a patch in a few days. > > > > By the way, I don't have the hardware. I'm using it with KVM (Qemu), > > but I guess that's good enough since I'm not going to touch any of the > > code parts that deal with the hardware. > > > > Ah, interesting. Can you send me the qemu command line ? Sure, this is as easy as adding "-device i6300esb" to the qemu command line. For what is worth, my full command line is: qemu-system-i386 \ -enable-kvm \ -hda fed26.qcow2 \ -cdrom ~/Downloads/Fedora-Everything-netinst-i386-26-1.5.iso \ -m 512 \ -smp 4 \ -netdev user,id=net0 \ -netdev tap,id=net1,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \ -device e1000,netdev=net0 \ -device e1000,netdev=net1 \ -device i6300esb Thanks, Radu
Re: virtio_net: ethtool supported link modes
On Fri, 2017-09-01 at 18:43 +0300, Michael S. Tsirkin wrote: > On Thu, Aug 31, 2017 at 06:04:04PM +0100, Radu Rendec wrote: > > Looking at the code in virtnet_set_link_ksettings, it seems the speed > > and duplex can be set to any valid value. The driver will "remember" > > them and report them back in virtnet_get_link_ksettings. > > > > However, the supported link modes (link_modes.supported in struct > > ethtool_link_ksettings) is always 0, indicating that no speed/duplex > > setting is supported. > > > > Does it make more sense to set (at least a few of) the supported link > > modes, such as 10baseT_Half ... 1baseT_Full? > > > > I would expect to see consistency between what is reported in > > link_modes.supported and what can actually be set. Could you please > > share your opinion on this? > > I would like to know more about why this is desirable. > > We used not to support the modes at all, but it turned out > some tools are confused by this: e.g. people would try to > bond virtio with a hardware device, tools would see > a mismatch in speed and features between bonded devices > and get confused. > > See > > commit 16032be56c1f66770da15cb94f0eb366c37aff6e > Author: Nikolay Aleksandrov > Date: Wed Feb 3 04:04:37 2016 +0100 > > virtio_net: add ethtool support for set and get of settings > > > as well as the discussion around it > https://www.spinics.net/lists/netdev/msg362111.html Thanks for pointing these out. It is much more clear now why modes support is implemented the way it is and what the expectations are. > If you think we need to add more hacks like this, a stronger > motivation than "to see consistency" would be needed. The use case behind my original question is very simple: * Net device is queried via ethtool for supported modes. * Supported modes are presented to user. * User can configure any of the supported modes. This is done transparently to the net device type (driver), so it actually makes sense for physical NICs. This alone of course is not a good enough motivation to modify the driver. And it can be easily addressed in user-space at the application level by testing for the driver. I was merely trying to avoid driver-specific workarounds (i.e. keep the application driver agnostic) and wondered if "advertising" supported modes through ethtool made any sense and/or would be a desirable change from the driver perspective. I believe I have my answers now. Thanks, Radu
Re: virtio_net: ethtool supported link modes
On Fri, 2017-09-01 at 20:45 +0300, Michael S. Tsirkin wrote: > On Fri, Sep 01, 2017 at 05:19:53PM +0100, Radu Rendec wrote: > > On Fri, 2017-09-01 at 18:43 +0300, Michael S. Tsirkin wrote: > > > On Thu, Aug 31, 2017 at 06:04:04PM +0100, Radu Rendec wrote: > > > > Looking at the code in virtnet_set_link_ksettings, it seems the speed > > > > and duplex can be set to any valid value. The driver will "remember" > > > > them and report them back in virtnet_get_link_ksettings. > > > > > > > > However, the supported link modes (link_modes.supported in struct > > > > ethtool_link_ksettings) is always 0, indicating that no speed/duplex > > > > setting is supported. > > > > > > > > Does it make more sense to set (at least a few of) the supported link > > > > modes, such as 10baseT_Half ... 1baseT_Full? > > > > > > > > I would expect to see consistency between what is reported in > > > > link_modes.supported and what can actually be set. Could you please > > > > share your opinion on this? > > > > The use case behind my original question is very simple: > > * Net device is queried via ethtool for supported modes. > > * Supported modes are presented to user. > > * User can configure any of the supported modes. > > Since this has no effect on virtio, isn't presenting > "no supported modes" to user the right thing to do? Yes, that makes sense. > > This is done transparently to the net device type (driver), so it > > actually makes sense for physical NICs. > > > > This alone of course is not a good enough motivation to modify the > > driver. And it can be easily addressed in user-space at the application > > level by testing for the driver. > > I think you might want to special-case no supported modes. > Special-casing virtio is probably best avoided. > > > I was merely trying to avoid driver-specific workarounds (i.e. keep the > > application driver agnostic) > > I think that's the right approach. So if driver does not present > any supported modes this probably means it is not necessary > to display or program any. Yes, apparently it boils down to special-casing no supported modes. This avoids both modifying virtio and special-casing virtio, and keeps the application driver-agnostic at the same time. Thanks for all the feedback. It was very helpful in figuring out the right approach. I really appreciate it. Radu
Re: pick_next_task() picking the wrong task [v4.9.163]
On Sat, 2019-03-23 at 11:15 +0100, Peter Zijlstra wrote: > On Fri, Mar 22, 2019 at 05:57:59PM -0400, Radu Rendec wrote: > > Hi Everyone, > > > > I believe I'm seeing a weird behavior of pick_next_task() where it > > chooses a lower priority task over a higher priority one. The scheduling > > class of the two tasks is also different ('fair' vs. 'rt'). The culprit > > seems to be the optimization at the beginning of the function, where > > fair_sched_class.pick_next_task() is called directly. I'm running > > v4.9.163, but that piece of code is very similar in recent kernels. > > > > [...] > > > > I instrumented pick_next_task() with trace_printk() and I am sure that > > every time the wrong task is picked, flow goes through the optimization > > That's weird, because when you wake a RT task, the: > > rq->nr_running == rq->cfs.h_nr_running > > condition should not be true. Maybe try adding trace_printk() to all > rq->nr_running manipulation to see what goes wobbly? The answer is in enqueue_top_rt_rq(): it returns before touching the run queue counters because rt_rq_throttled(rt_rq) is true. So basically this is RT throttling kicking in. I confirmed by disabling RT throttling and testing again. So there's nothing wrong with the scheduler. The "sched_wakeup: comm=.." trace was a bit misleading. What happens when RT throttling kicks in is that the task is woken (and probably changes state to TASK_RUNNING) but not actually added to the run queue. Thanks again for looking into this and sorry about the noise! Best regards, Radu Rendec
pick_next_task() picking the wrong task [v4.9.163]
Hi Everyone, I believe I'm seeing a weird behavior of pick_next_task() where it chooses a lower priority task over a higher priority one. The scheduling class of the two tasks is also different ('fair' vs. 'rt'). The culprit seems to be the optimization at the beginning of the function, where fair_sched_class.pick_next_task() is called directly. I'm running v4.9.163, but that piece of code is very similar in recent kernels. My use case is quite simple: I have a real-time thread that is woken up by a GPIO hardware interrupt. The thread sleeps most of the time in poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the interrupt and the thread being woken up/scheduled is very important for the application. Note that I backported my own commit 03c0a9208bb1, so the thread is always woken up synchronously from HW interrupt context. Most of the time things work as expected, but sometimes the scheduler picks kworker and even the idle task before my real-time thread. I used the trace infrastructure to figure out what happens and I'm including a snippet below (I apologize for the wide lines). -0 [000] d.h2 161.202970: gpio_sysfs_irq <-__handle_irq_event_percpu -0 [000] d.h2 161.202981: kernfs_notify <-gpio_sysfs_irq -0 [000] d.h4 161.202998: sched_waking: comm=irqWorker pid=1141 prio=9 target_cpu=000 -0 [000] d.h5 161.203025: sched_wakeup: comm=irqWorker pid=1141 prio=9 target_cpu=000 -0 [000] d.h3 161.203047: workqueue_queue_work: work struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60 req_cpu=1 cpu=0 -0 [000] d.h3 161.203049: workqueue_activate_work: work struct 806506b8 -0 [000] d.h4 161.203061: sched_waking: comm=kworker/0:1 pid=134 prio=120 target_cpu=000 -0 [000] d.h5 161.203083: sched_wakeup: comm=kworker/0:1 pid=134 prio=120 target_cpu=000 -0 [000] d..2 161.203201: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=134 next_prio=120 kworker/0:1-134 [000] 161.203222: workqueue_execute_start: work struct 806506b8: function kernfs_notify_workfn kworker/0:1-134 [000] ...1 161.203286: schedule <-worker_thread kworker/0:1-134 [000] d..2 161.203329: sched_switch: prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120 -0 [000] .n.1 161.230287: schedule <-schedule_preempt_disabled -0 [000] d..2 161.230310: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=irqWorker next_pid=1141 next_prio=9 irqWorker-1141 [000] d..3 161.230316: finish_task_switch <-schedule The system is Freescale MPC8378 (PowerPC, single processor). I instrumented pick_next_task() with trace_printk() and I am sure that every time the wrong task is picked, flow goes through the optimization path and idle_sched_class.pick_next_task() is called directly. When the right task is eventually picked, flow goes through the bottom block that iterates over all scheduling classes. This probably makes sense: when the scheduler runs in the context of the idle task, prev->sched_class is no longer fair_sched_class, so the bottom block with the full iteration is used. Note that in v4.9.163 the optimization path is taken only when prev->sched_class is fair_sched_class, whereas in recent kernels it is taken for both fair_sched_class and idle_sched_class. Any help or feedback would be much appreciated. In the meantime, I will experiment with commenting out the optimization (at the expense of a slower scheduler, of course). Best regards, Radu Rendec
Re: pick_next_task() picking the wrong task [v4.9.163]
On Sat, 2019-03-23 at 11:15 +0100, Peter Zijlstra wrote: > On Fri, Mar 22, 2019 at 05:57:59PM -0400, Radu Rendec wrote: > > Hi Everyone, > > > > I believe I'm seeing a weird behavior of pick_next_task() where it > > chooses a lower priority task over a higher priority one. The scheduling > > class of the two tasks is also different ('fair' vs. 'rt'). The culprit > > seems to be the optimization at the beginning of the function, where > > fair_sched_class.pick_next_task() is called directly. I'm running > > v4.9.163, but that piece of code is very similar in recent kernels. > > > > My use case is quite simple: I have a real-time thread that is woken up > > by a GPIO hardware interrupt. The thread sleeps most of the time in > > poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the > > interrupt and the thread being woken up/scheduled is very important for > > the application. Note that I backported my own commit 03c0a9208bb1, so > > the thread is always woken up synchronously from HW interrupt context. > > > > Most of the time things work as expected, but sometimes the scheduler > > picks kworker and even the idle task before my real-time thread. I used > > the trace infrastructure to figure out what happens and I'm including a > > snippet below (I apologize for the wide lines). > > If only they were wide :/ I had to unwrap them myself.. Sorry about that! Wonders of using the gmail web interface. I'll pay more attention in the future. > > -0 [000] d.h2 161.202970: gpio_sysfs_irq > > <-__handle_irq_event_percpu > > -0 [000] d.h2 161.202981: kernfs_notify <-gpio_sysfs_irq > > -0 [000] d.h4 161.202998: sched_waking: comm=irqWorker > > pid=1141 prio=9 target_cpu=000 > > -0 [000] d.h5 161.203025: sched_wakeup: comm=irqWorker > > pid=1141 prio=9 target_cpu=000 > > weird how the next line doesn't have 'n/N' set: > > > -0 [000] d.h3 161.203047: workqueue_queue_work: work > > struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60 req_cpu=1 > > cpu=0 > > -0 [000] d.h3 161.203049: workqueue_activate_work: work > > struct 806506b8 > > -0 [000] d.h4 161.203061: sched_waking: comm=kworker/0:1 > > pid=134 prio=120 target_cpu=000 > > -0 [000] d.h5 161.203083: sched_wakeup: comm=kworker/0:1 > > pid=134 prio=120 target_cpu=000 > > There's that kworker wakeup. > > > -0 [000] d..2 161.203201: sched_switch: prev_comm=swapper > > prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 > > next_pid=134 next_prio=120 > > And I agree that that is weird. > > > kworker/0:1-134 [000] 161.203222: workqueue_execute_start: work > > struct 806506b8: function kernfs_notify_workfn > > kworker/0:1-134 [000] ...1 161.203286: schedule <-worker_thread > > kworker/0:1-134 [000] d..2 161.203329: sched_switch: > > prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==> > > next_comm=swapper next_pid=0 next_prio=120 > > -0 [000] .n.1 161.230287: schedule > > <-schedule_preempt_disabled > > Only here do I see 'n'. Looking at other captures I can see 'n' starting at sched_wakeup for irqWorker. Perhaps there was something wrong with this one or I copied the wrong line. > > -0 [000] d..2 161.230310: sched_switch: prev_comm=swapper > > prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=irqWorker > > next_pid=1141 next_prio=9 > > irqWorker-1141 [000] d..3 161.230316: finish_task_switch <-schedule > > > > The system is Freescale MPC8378 (PowerPC, single processor). > > > > I instrumented pick_next_task() with trace_printk() and I am sure that > > every time the wrong task is picked, flow goes through the optimization > > That's weird, because when you wake a RT task, the: > > rq->nr_running == rq->cfs.h_nr_running > > condition should not be true. Maybe try adding trace_printk() to all > rq->nr_running manipulation to see what goes wobbly? Sure, I will try that and come back with the results. FWIW, I tried to comment out the entire optimization and force the "slow" path that goes through every scheduling class. Surprisingly, rt_sched_class.pick_next_task() returns NULL. > > path and idle_sched_class.pick_next_task() is called directly. When the > > right task is eventually picked, flow goes through the bottom block that > > iterates over all scheduling classes. This probably makes sense: when > > the scheduler runs in the context of the idle task, prev->sched
[PATCH 0/1] kernfs_notify() poll latency
Hi everyone, I believe kernfs_notify() poll latency can be improved if the poll notification is done from kernfs_notify() directly rather than scheduled work context. I am sure there are good reasons why the fsnotify notification must be done from scheduled work context (an obvious one is that it needs to be able to sleep). But I don't see any reason why the poll notification could not be done from kernfs_notify(). If there is any, please point it out - I would highly appreciate it. I came across this while working on a project that uses the sysfs GPIO interface to wake a (real time) user space project on GPIO interrupts. I know that interface is deprecated, but I still believe it's a valid scenario and may occur with other drivers as well (current or future). The sysfs GPIO interface (drivers/gpio/gpiolib-sysfs.c) interrupt handler relies on kernfs_notify() (actually sysfs_notify_dirent(), but that's just an alias) to wake any thread that may be poll()ing on the interrupt. It is important to wake the thread as quickly as possible and going through the kernel worker to handle the scheduled work is much slower. Since the kernel worker runs with normal priority, this can even become a case of priority inversion. If a higher priority thread hogs the CPU, it may delay the kernel worker and in turn the thread that needs to be notified (which could be a real time thread). Best regards, Radu Rendec Radu Rendec (1): Improve kernfs_notify() poll notification latency fs/kernfs/file.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) -- 2.17.2
[PATCH 1/1] Improve kernfs_notify() poll notification latency
kernfs_notify() does two notifications: poll and fsnotify. Originally, both notifications were done from scheduled work context and all that kernfs_notify() did was schedule the work. This patch simply moves the poll notification from the scheduled work handler to kernfs_notify(). The fsnotify notification still needs to be done from scheduled work context because it can sleep (it needs to lock a mutex). If the poll notification is time critical (the notified thread needs to wake as quickly as possible), it's better to do it from kernfs_notify() directly. One example is calling sysfs_notify_dirent() from a hardware interrupt handler to wake up a thread and handle the interrupt in user space. Signed-off-by: Radu Rendec --- fs/kernfs/file.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index dbf5bc250bfd..f8d5021a652e 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -857,7 +857,6 @@ static __poll_t kernfs_fop_poll(struct file *filp, poll_table *wait) static void kernfs_notify_workfn(struct work_struct *work) { struct kernfs_node *kn; - struct kernfs_open_node *on; struct kernfs_super_info *info; repeat: /* pop one off the notify_list */ @@ -871,17 +870,6 @@ static void kernfs_notify_workfn(struct work_struct *work) kn->attr.notify_next = NULL; spin_unlock_irq(_notify_lock); - /* kick poll */ - spin_lock_irq(_open_node_lock); - - on = kn->attr.open; - if (on) { - atomic_inc(>event); - wake_up_interruptible(>poll); - } - - spin_unlock_irq(_open_node_lock); - /* kick fsnotify */ mutex_lock(_mutex); @@ -934,10 +922,21 @@ void kernfs_notify(struct kernfs_node *kn) { static DECLARE_WORK(kernfs_notify_work, kernfs_notify_workfn); unsigned long flags; + struct kernfs_open_node *on; if (WARN_ON(kernfs_type(kn) != KERNFS_FILE)) return; + /* kick poll immediately */ + spin_lock_irqsave(_open_node_lock, flags); + on = kn->attr.open; + if (on) { + atomic_inc(>event); + wake_up_interruptible(>poll); + } + spin_unlock_irqrestore(_open_node_lock, flags); + + /* schedule work to kick fsnotify */ spin_lock_irqsave(_notify_lock, flags); if (!kn->attr.notify_next) { kernfs_get(kn); -- 2.17.2
Impact of CONFIG_PARAVIRT=y / mmap benchmark
Hi, I'm trying to assess the performance impact of enabling PARAVIRT (and XEN) in a custom kernel configuration. I came across a very old thread (https://lkml.org/lkml/2009/5/13/449) on this topic and the conclusion back then was that the performance impact was (arguably) around 1%. Does anyone still have a copy of Ingo Molnar's mmap-perf.c program (the old link is broken)? Would it still be relevant to use it for measuring performance in case of PARAVIRT? Last but not least, has anyone looked into PARAVIRT performance more recently? Thank you! Best regards, Radu Rendec
Re: Impact of CONFIG_PARAVIRT=y / mmap benchmark
On Wed, 2017-03-29 at 20:05 +0200, Ingo Molnar wrote: > * Randy Dunlap wrote: > > > On 03/28/17 10:17, Radu Rendec wrote: > > > Does anyone still have a copy of Ingo Molnar's mmap-perf.c > > > program (the > > > old link is broken)? Would it still be relevant to use it for > > > measuring > > > performance in case of PARAVIRT? > > > > > I have mmap-perf.c that says: > > /* Copyright Ingo Molnar (c) 2006 */ > > > > and no license info... > > Ingo? > > It's GPL v2, like the kernel. > Randy, now that Ingo has clarified the license of that code, can you please share it? Thank you both for the feedback! Radu