Impact of CONFIG_PARAVIRT=y / mmap benchmark

2017-03-28 Thread Radu Rendec
Hi,

I'm trying to assess the performance impact of enabling PARAVIRT (and
XEN) in a custom kernel configuration. I came across a very old thread
(https://lkml.org/lkml/2009/5/13/449) on this topic and the conclusion
back then was that the performance impact was (arguably) around 1%.

Does anyone still have a copy of Ingo Molnar's mmap-perf.c program (the
old link is broken)? Would it still be relevant to use it for measuring
performance in case of PARAVIRT?

Last but not least, has anyone looked into PARAVIRT performance more
recently?

Thank you!

Best regards,
Radu Rendec



Re: Impact of CONFIG_PARAVIRT=y / mmap benchmark

2017-03-31 Thread Radu Rendec
On Wed, 2017-03-29 at 20:05 +0200, Ingo Molnar wrote:
> * Randy Dunlap <rdun...@infradead.org> wrote:
> 
> > On 03/28/17 10:17, Radu Rendec wrote:
> > > Does anyone still have a copy of Ingo Molnar's mmap-perf.c
> > > program (the
> > > old link is broken)? Would it still be relevant to use it for
> > > measuring
> > > performance in case of PARAVIRT?
> > > 
> > I have mmap-perf.c that says:
> > /* Copyright Ingo Molnar (c) 2006 */
> > 
> > and no license info...
> > Ingo?
> 
> It's GPL v2, like the kernel.
> 

Randy, now that Ingo has clarified the license of that code, can you
please share it?

Thank you both for the feedback!

Radu



Re: Lowering the log level in watchdog_dev_register when err==-EBUSY

2017-10-13 Thread Radu Rendec
On Thu, 2017-10-12 at 12:54 -0700, Guenter Roeck wrote:
> On Thu, Oct 12, 2017 at 07:15:59PM +0100, Radu Rendec wrote:
> > Thanks for the suggestion! That makes sense. I will start working on
> > converting i6300esb and submit a patch in a few days.
> > 
> > By the way, I don't have the hardware. I'm using it with KVM (Qemu),
> > but I guess that's good enough since I'm not going to touch any of the
> > code parts that deal with the hardware.
> > 
> 
> Ah, interesting. Can you send me the qemu command line ?

Sure, this is as easy as adding "-device i6300esb" to the qemu command
line.

For what is worth, my full command line is:

qemu-system-i386 \
-enable-kvm \
-hda fed26.qcow2 \
-cdrom ~/Downloads/Fedora-Everything-netinst-i386-26-1.5.iso \
-m 512 \
-smp 4 \
-netdev user,id=net0 \
-netdev tap,id=net1,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
-device e1000,netdev=net0 \
-device e1000,netdev=net1 \
-device i6300esb

Thanks,
Radu



Re: Lowering the log level in watchdog_dev_register when err==-EBUSY

2017-10-12 Thread Radu Rendec
On Wed, 2017-10-11 at 11:46 -0700, Guenter Roeck wrote:
> On Wed, Oct 11, 2017 at 06:46:31PM +0100, Radu Rendec wrote:
> > In a project I'm working on we have a valid use case where we activate
> > both the i6300esb and softdog watchdogs. We always activate i6300esb
> > first (which uses the "legacy" watchdog API) and then softdog. This
> > gets us two "error" level messages (coming from watchdog_cdev_register)
> > although softdog falls back to the "new" API and registers its char
> > device just fine.
> > 
> > Since watchdog_cdev_register/watchdog_dev_register seem to be used only
> > by watchdog_register_device and the latter always falls back to the
> > "new" API, I'm thinking about lowering the log level of these messages
> > when err is -EBUSY.
>
> I would suggest to convert the offending driver to use the watchdog subsystem
> (and along the line remove the restriction of only supporting a single
> instance). You have the hardware, so that should be a straightforward change.

Thanks for the suggestion! That makes sense. I will start working on
converting i6300esb and submit a patch in a few days.

By the way, I don't have the hardware. I'm using it with KVM (Qemu),
but I guess that's good enough since I'm not going to touch any of the
code parts that deal with the hardware.

Radu



Re: virtio_net: ethtool supported link modes

2017-09-01 Thread Radu Rendec
On Fri, 2017-09-01 at 11:36 +0800, Jason Wang wrote:
> 
> On 2017年09月01日 01:04, Radu Rendec wrote:
> > Hello,
> > 
> > Looking at the code in virtnet_set_link_ksettings, it seems the speed
> > and duplex can be set to any valid value. The driver will "remember"
> > them and report them back in virtnet_get_link_ksettings.
> > 
> > However, the supported link modes (link_modes.supported in struct
> > ethtool_link_ksettings) is always 0, indicating that no speed/duplex
> > setting is supported.
> > 
> > Does it make more sense to set (at least a few of) the supported link
> > modes, such as 10baseT_Half ... 1baseT_Full?
> > 
> > I would expect to see consistency between what is reported in
> > link_modes.supported and what can actually be set. Could you please
> > share your opinion on this?
> 
> I think the may make sense only if there's a hardware implementation for 
> virtio. And we probably need to extend virtio spec for adding new commands.

So you're saying that the "hardware" should provide the supported link
modes (e.g. by using feature bits at the virtio layer) and the
virtio_net driver should just translate them and expose them as
link_modes.supported?

Then for consistency, I assume setting speed/duplex via ethtool should
also go into the virtio layer (currently virtio_net seems to just store
them for future retrieval via ethtool).

Thanks,
Radu


Re: virtio_net: ethtool supported link modes

2017-09-01 Thread Radu Rendec
On Fri, 2017-09-01 at 18:43 +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 31, 2017 at 06:04:04PM +0100, Radu Rendec wrote:
> > Looking at the code in virtnet_set_link_ksettings, it seems the speed
> > and duplex can be set to any valid value. The driver will "remember"
> > them and report them back in virtnet_get_link_ksettings.
> > 
> > However, the supported link modes (link_modes.supported in struct
> > ethtool_link_ksettings) is always 0, indicating that no speed/duplex
> > setting is supported.
> > 
> > Does it make more sense to set (at least a few of) the supported link
> > modes, such as 10baseT_Half ... 1baseT_Full?
> > 
> > I would expect to see consistency between what is reported in
> > link_modes.supported and what can actually be set. Could you please
> > share your opinion on this?
> 
> I would like to know more about why this is desirable.
> 
> We used not to support the modes at all, but it turned out
> some tools are confused by this: e.g. people would try to
> bond virtio with a hardware device, tools would see
> a mismatch in speed and features between bonded devices
> and get confused.
> 
> See
> 
>   commit 16032be56c1f66770da15cb94f0eb366c37aff6e
>   Author: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
>   Date:   Wed Feb 3 04:04:37 2016 +0100
> 
>   virtio_net: add ethtool support for set and get of settings
> 
> 
> as well as the discussion around it
>   https://www.spinics.net/lists/netdev/msg362111.html

Thanks for pointing these out. It is much more clear now why modes
support is implemented the way it is and what the expectations are.

> If you think we need to add more hacks like this, a stronger
> motivation than "to see consistency" would be needed.

The use case behind my original question is very simple:
 * Net device is queried via ethtool for supported modes.
 * Supported modes are presented to user.
 * User can configure any of the supported modes.

This is done transparently to the net device type (driver), so it
actually makes sense for physical NICs.

This alone of course is not a good enough motivation to modify the
driver. And it can be easily addressed in user-space at the application
level by testing for the driver.

I was merely trying to avoid driver-specific workarounds (i.e. keep the
application driver agnostic) and wondered if "advertising" supported
modes through ethtool made any sense and/or would be a desirable change
from the driver perspective. I believe I have my answers now.

Thanks,
Radu



Re: virtio_net: ethtool supported link modes

2017-09-04 Thread Radu Rendec
On Fri, 2017-09-01 at 20:45 +0300, Michael S. Tsirkin wrote:
> On Fri, Sep 01, 2017 at 05:19:53PM +0100, Radu Rendec wrote:
> > On Fri, 2017-09-01 at 18:43 +0300, Michael S. Tsirkin wrote:
> > > On Thu, Aug 31, 2017 at 06:04:04PM +0100, Radu Rendec wrote:
> > > > Looking at the code in virtnet_set_link_ksettings, it seems the speed
> > > > and duplex can be set to any valid value. The driver will "remember"
> > > > them and report them back in virtnet_get_link_ksettings.
> > > > 
> > > > However, the supported link modes (link_modes.supported in struct
> > > > ethtool_link_ksettings) is always 0, indicating that no speed/duplex
> > > > setting is supported.
> > > > 
> > > > Does it make more sense to set (at least a few of) the supported link
> > > > modes, such as 10baseT_Half ... 1baseT_Full?
> > > > 
> > > > I would expect to see consistency between what is reported in
> > > > link_modes.supported and what can actually be set. Could you please
> > > > share your opinion on this?
> > 
> > The use case behind my original question is very simple:
> >  * Net device is queried via ethtool for supported modes.
> >  * Supported modes are presented to user.
> >  * User can configure any of the supported modes.
> 
> Since this has no effect on virtio, isn't presenting
> "no supported modes" to user the right thing to do?

Yes, that makes sense.

> > This is done transparently to the net device type (driver), so it
> > actually makes sense for physical NICs.
> > 
> > This alone of course is not a good enough motivation to modify the
> > driver. And it can be easily addressed in user-space at the application
> > level by testing for the driver.
> 
> I think you might want to special-case no supported modes.
> Special-casing virtio is probably best avoided.
> 
> > I was merely trying to avoid driver-specific workarounds (i.e. keep the
> > application driver agnostic)
> 
> I think that's the right approach. So if driver does not present
> any supported modes this probably means it is not necessary
> to display or program any.

Yes, apparently it boils down to special-casing no supported modes.
This avoids both modifying virtio and special-casing virtio, and keeps
the application driver-agnostic at the same time.

Thanks for all the feedback. It was very helpful in figuring out the
right approach. I really appreciate it.

Radu



virtio_net: ethtool supported link modes

2017-08-31 Thread Radu Rendec
Hello,

Looking at the code in virtnet_set_link_ksettings, it seems the speed
and duplex can be set to any valid value. The driver will "remember"
them and report them back in virtnet_get_link_ksettings.

However, the supported link modes (link_modes.supported in struct
ethtool_link_ksettings) is always 0, indicating that no speed/duplex
setting is supported.

Does it make more sense to set (at least a few of) the supported link
modes, such as 10baseT_Half ... 1baseT_Full?

I would expect to see consistency between what is reported in
link_modes.supported and what can actually be set. Could you please
share your opinion on this?

Thank you,
Radu Rendec



Lowering the log level in watchdog_dev_register when err==-EBUSY

2017-10-11 Thread Radu Rendec
Hello,

In a project I'm working on we have a valid use case where we activate
both the i6300esb and softdog watchdogs. We always activate i6300esb
first (which uses the "legacy" watchdog API) and then softdog. This
gets us two "error" level messages (coming from watchdog_cdev_register)
although softdog falls back to the "new" API and registers its char
device just fine.

Since watchdog_cdev_register/watchdog_dev_register seem to be used only
by watchdog_register_device and the latter always falls back to the
"new" API, I'm thinking about lowering the log level of these messages
when err is -EBUSY. Something along the lines of:

--- a/drivers/watchdog/watchdog_dev.c
+++ b/drivers/watchdog/watchdog_dev.c
@@ -928,11 +928,14 @@ static int watchdog_cdev_register(struct watchdog_device 
*wdd, dev_t devno)
watchdog_miscdev.parent = wdd->parent;
err = misc_register(_miscdev);
if (err != 0) {
-   pr_err("%s: cannot register miscdev on minor=%d 
(err=%d).\n",
-   wdd->info->identity, WATCHDOG_MINOR, err);
-   if (err == -EBUSY)
-   pr_err("%s: a legacy watchdog module is 
probably present.\n",
-   wdd->info->identity);
+   if (err == -EBUSY) {
+   pr_info("%s: cannot register miscdev on 
minor=%d (err=%d).\n",
+   wdd->info->identity, 
WATCHDOG_MINOR, err);
+   pr_info("%s: a legacy watchdog module is 
probably present.\n",
+   wdd->info->identity);
+   } else
+   pr_err("%s: cannot register miscdev on minor=%d 
(err=%d).\n",
+   wdd->info->identity, WATCHDOG_MINOR, 
err);
old_wd_data = NULL;
kfree(wd_data);
return err;

Does this look like a good approach? If not, what would you recommend?
In any case, I want to upstream the change, so better ask first :)

Thanks,
Radu Rendec



[PATCH 1/1] Improve kernfs_notify() poll notification latency

2018-11-15 Thread Radu Rendec
kernfs_notify() does two notifications: poll and fsnotify. Originally,
both notifications were done from scheduled work context and all that
kernfs_notify() did was schedule the work.

This patch simply moves the poll notification from the scheduled work
handler to kernfs_notify(). The fsnotify notification still needs to be
done from scheduled work context because it can sleep (it needs to lock
a mutex).

If the poll notification is time critical (the notified thread needs to
wake as quickly as possible), it's better to do it from kernfs_notify()
directly. One example is calling sysfs_notify_dirent() from a hardware
interrupt handler to wake up a thread and handle the interrupt in user
space.

Signed-off-by: Radu Rendec 
---
 fs/kernfs/file.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index dbf5bc250bfd..f8d5021a652e 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -857,7 +857,6 @@ static __poll_t kernfs_fop_poll(struct file *filp, 
poll_table *wait)
 static void kernfs_notify_workfn(struct work_struct *work)
 {
struct kernfs_node *kn;
-   struct kernfs_open_node *on;
struct kernfs_super_info *info;
 repeat:
/* pop one off the notify_list */
@@ -871,17 +870,6 @@ static void kernfs_notify_workfn(struct work_struct *work)
kn->attr.notify_next = NULL;
spin_unlock_irq(_notify_lock);
 
-   /* kick poll */
-   spin_lock_irq(_open_node_lock);
-
-   on = kn->attr.open;
-   if (on) {
-   atomic_inc(>event);
-   wake_up_interruptible(>poll);
-   }
-
-   spin_unlock_irq(_open_node_lock);
-
/* kick fsnotify */
mutex_lock(_mutex);
 
@@ -934,10 +922,21 @@ void kernfs_notify(struct kernfs_node *kn)
 {
static DECLARE_WORK(kernfs_notify_work, kernfs_notify_workfn);
unsigned long flags;
+   struct kernfs_open_node *on;
 
if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
return;
 
+   /* kick poll immediately */
+   spin_lock_irqsave(_open_node_lock, flags);
+   on = kn->attr.open;
+   if (on) {
+   atomic_inc(>event);
+   wake_up_interruptible(>poll);
+   }
+   spin_unlock_irqrestore(_open_node_lock, flags);
+
+   /* schedule work to kick fsnotify */
spin_lock_irqsave(_notify_lock, flags);
if (!kn->attr.notify_next) {
kernfs_get(kn);
-- 
2.17.2



[PATCH 0/1] kernfs_notify() poll latency

2018-11-15 Thread Radu Rendec
Hi everyone,

I believe kernfs_notify() poll latency can be improved if the poll
notification is done from kernfs_notify() directly rather than scheduled
work context.

I am sure there are good reasons why the fsnotify notification must be
done from scheduled work context (an obvious one is that it needs to be
able to sleep). But I don't see any reason why the poll notification
could not be done from kernfs_notify(). If there is any, please point it
out - I would highly appreciate it.

I came across this while working on a project that uses the sysfs GPIO
interface to wake a (real time) user space project on GPIO interrupts. I
know that interface is deprecated, but I still believe it's a valid
scenario and may occur with other drivers as well (current or future).

The sysfs GPIO interface (drivers/gpio/gpiolib-sysfs.c) interrupt
handler relies on kernfs_notify() (actually sysfs_notify_dirent(), but
that's just an alias) to wake any thread that may be poll()ing on the
interrupt. It is important to wake the thread as quickly as possible and
going through the kernel worker to handle the scheduled work is much
slower. Since the kernel worker runs with normal priority, this can even
become a case of priority inversion. If a higher priority thread hogs
the CPU, it may delay the kernel worker and in turn the thread that
needs to be notified (which could be a real time thread).

Best regards,
Radu Rendec


Radu Rendec (1):
  Improve kernfs_notify() poll notification latency

 fs/kernfs/file.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

-- 
2.17.2



virtio_net: ethtool supported link modes

2017-08-31 Thread Radu Rendec
Hello,

Looking at the code in virtnet_set_link_ksettings, it seems the speed
and duplex can be set to any valid value. The driver will "remember"
them and report them back in virtnet_get_link_ksettings.

However, the supported link modes (link_modes.supported in struct
ethtool_link_ksettings) is always 0, indicating that no speed/duplex
setting is supported.

Does it make more sense to set (at least a few of) the supported link
modes, such as 10baseT_Half ... 1baseT_Full?

I would expect to see consistency between what is reported in
link_modes.supported and what can actually be set. Could you please
share your opinion on this?

Thank you,
Radu Rendec



Re: virtio_net: ethtool supported link modes

2017-09-01 Thread Radu Rendec
On Fri, 2017-09-01 at 11:36 +0800, Jason Wang wrote:
> 
> On 2017年09月01日 01:04, Radu Rendec wrote:
> > Hello,
> > 
> > Looking at the code in virtnet_set_link_ksettings, it seems the speed
> > and duplex can be set to any valid value. The driver will "remember"
> > them and report them back in virtnet_get_link_ksettings.
> > 
> > However, the supported link modes (link_modes.supported in struct
> > ethtool_link_ksettings) is always 0, indicating that no speed/duplex
> > setting is supported.
> > 
> > Does it make more sense to set (at least a few of) the supported link
> > modes, such as 10baseT_Half ... 1baseT_Full?
> > 
> > I would expect to see consistency between what is reported in
> > link_modes.supported and what can actually be set. Could you please
> > share your opinion on this?
> 
> I think the may make sense only if there's a hardware implementation for 
> virtio. And we probably need to extend virtio spec for adding new commands.

So you're saying that the "hardware" should provide the supported link
modes (e.g. by using feature bits at the virtio layer) and the
virtio_net driver should just translate them and expose them as
link_modes.supported?

Then for consistency, I assume setting speed/duplex via ethtool should
also go into the virtio layer (currently virtio_net seems to just store
them for future retrieval via ethtool).

Thanks,
Radu


Lowering the log level in watchdog_dev_register when err==-EBUSY

2017-10-11 Thread Radu Rendec
Hello,

In a project I'm working on we have a valid use case where we activate
both the i6300esb and softdog watchdogs. We always activate i6300esb
first (which uses the "legacy" watchdog API) and then softdog. This
gets us two "error" level messages (coming from watchdog_cdev_register)
although softdog falls back to the "new" API and registers its char
device just fine.

Since watchdog_cdev_register/watchdog_dev_register seem to be used only
by watchdog_register_device and the latter always falls back to the
"new" API, I'm thinking about lowering the log level of these messages
when err is -EBUSY. Something along the lines of:

--- a/drivers/watchdog/watchdog_dev.c
+++ b/drivers/watchdog/watchdog_dev.c
@@ -928,11 +928,14 @@ static int watchdog_cdev_register(struct watchdog_device 
*wdd, dev_t devno)
watchdog_miscdev.parent = wdd->parent;
err = misc_register(_miscdev);
if (err != 0) {
-   pr_err("%s: cannot register miscdev on minor=%d 
(err=%d).\n",
-   wdd->info->identity, WATCHDOG_MINOR, err);
-   if (err == -EBUSY)
-   pr_err("%s: a legacy watchdog module is 
probably present.\n",
-   wdd->info->identity);
+   if (err == -EBUSY) {
+   pr_info("%s: cannot register miscdev on 
minor=%d (err=%d).\n",
+   wdd->info->identity, 
WATCHDOG_MINOR, err);
+   pr_info("%s: a legacy watchdog module is 
probably present.\n",
+   wdd->info->identity);
+   } else
+   pr_err("%s: cannot register miscdev on minor=%d 
(err=%d).\n",
+   wdd->info->identity, WATCHDOG_MINOR, 
err);
old_wd_data = NULL;
kfree(wd_data);
return err;

Does this look like a good approach? If not, what would you recommend?
In any case, I want to upstream the change, so better ask first :)

Thanks,
Radu Rendec



Re: Lowering the log level in watchdog_dev_register when err==-EBUSY

2017-10-12 Thread Radu Rendec
On Wed, 2017-10-11 at 11:46 -0700, Guenter Roeck wrote:
> On Wed, Oct 11, 2017 at 06:46:31PM +0100, Radu Rendec wrote:
> > In a project I'm working on we have a valid use case where we activate
> > both the i6300esb and softdog watchdogs. We always activate i6300esb
> > first (which uses the "legacy" watchdog API) and then softdog. This
> > gets us two "error" level messages (coming from watchdog_cdev_register)
> > although softdog falls back to the "new" API and registers its char
> > device just fine.
> > 
> > Since watchdog_cdev_register/watchdog_dev_register seem to be used only
> > by watchdog_register_device and the latter always falls back to the
> > "new" API, I'm thinking about lowering the log level of these messages
> > when err is -EBUSY.
>
> I would suggest to convert the offending driver to use the watchdog subsystem
> (and along the line remove the restriction of only supporting a single
> instance). You have the hardware, so that should be a straightforward change.

Thanks for the suggestion! That makes sense. I will start working on
converting i6300esb and submit a patch in a few days.

By the way, I don't have the hardware. I'm using it with KVM (Qemu),
but I guess that's good enough since I'm not going to touch any of the
code parts that deal with the hardware.

Radu



Re: Lowering the log level in watchdog_dev_register when err==-EBUSY

2017-10-13 Thread Radu Rendec
On Thu, 2017-10-12 at 12:54 -0700, Guenter Roeck wrote:
> On Thu, Oct 12, 2017 at 07:15:59PM +0100, Radu Rendec wrote:
> > Thanks for the suggestion! That makes sense. I will start working on
> > converting i6300esb and submit a patch in a few days.
> > 
> > By the way, I don't have the hardware. I'm using it with KVM (Qemu),
> > but I guess that's good enough since I'm not going to touch any of the
> > code parts that deal with the hardware.
> > 
> 
> Ah, interesting. Can you send me the qemu command line ?

Sure, this is as easy as adding "-device i6300esb" to the qemu command
line.

For what is worth, my full command line is:

qemu-system-i386 \
-enable-kvm \
-hda fed26.qcow2 \
-cdrom ~/Downloads/Fedora-Everything-netinst-i386-26-1.5.iso \
-m 512 \
-smp 4 \
-netdev user,id=net0 \
-netdev tap,id=net1,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
-device e1000,netdev=net0 \
-device e1000,netdev=net1 \
-device i6300esb

Thanks,
Radu



Re: virtio_net: ethtool supported link modes

2017-09-01 Thread Radu Rendec
On Fri, 2017-09-01 at 18:43 +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 31, 2017 at 06:04:04PM +0100, Radu Rendec wrote:
> > Looking at the code in virtnet_set_link_ksettings, it seems the speed
> > and duplex can be set to any valid value. The driver will "remember"
> > them and report them back in virtnet_get_link_ksettings.
> > 
> > However, the supported link modes (link_modes.supported in struct
> > ethtool_link_ksettings) is always 0, indicating that no speed/duplex
> > setting is supported.
> > 
> > Does it make more sense to set (at least a few of) the supported link
> > modes, such as 10baseT_Half ... 1baseT_Full?
> > 
> > I would expect to see consistency between what is reported in
> > link_modes.supported and what can actually be set. Could you please
> > share your opinion on this?
> 
> I would like to know more about why this is desirable.
> 
> We used not to support the modes at all, but it turned out
> some tools are confused by this: e.g. people would try to
> bond virtio with a hardware device, tools would see
> a mismatch in speed and features between bonded devices
> and get confused.
> 
> See
> 
>   commit 16032be56c1f66770da15cb94f0eb366c37aff6e
>   Author: Nikolay Aleksandrov 
>   Date:   Wed Feb 3 04:04:37 2016 +0100
> 
>   virtio_net: add ethtool support for set and get of settings
> 
> 
> as well as the discussion around it
>   https://www.spinics.net/lists/netdev/msg362111.html

Thanks for pointing these out. It is much more clear now why modes
support is implemented the way it is and what the expectations are.

> If you think we need to add more hacks like this, a stronger
> motivation than "to see consistency" would be needed.

The use case behind my original question is very simple:
 * Net device is queried via ethtool for supported modes.
 * Supported modes are presented to user.
 * User can configure any of the supported modes.

This is done transparently to the net device type (driver), so it
actually makes sense for physical NICs.

This alone of course is not a good enough motivation to modify the
driver. And it can be easily addressed in user-space at the application
level by testing for the driver.

I was merely trying to avoid driver-specific workarounds (i.e. keep the
application driver agnostic) and wondered if "advertising" supported
modes through ethtool made any sense and/or would be a desirable change
from the driver perspective. I believe I have my answers now.

Thanks,
Radu



Re: virtio_net: ethtool supported link modes

2017-09-04 Thread Radu Rendec
On Fri, 2017-09-01 at 20:45 +0300, Michael S. Tsirkin wrote:
> On Fri, Sep 01, 2017 at 05:19:53PM +0100, Radu Rendec wrote:
> > On Fri, 2017-09-01 at 18:43 +0300, Michael S. Tsirkin wrote:
> > > On Thu, Aug 31, 2017 at 06:04:04PM +0100, Radu Rendec wrote:
> > > > Looking at the code in virtnet_set_link_ksettings, it seems the speed
> > > > and duplex can be set to any valid value. The driver will "remember"
> > > > them and report them back in virtnet_get_link_ksettings.
> > > > 
> > > > However, the supported link modes (link_modes.supported in struct
> > > > ethtool_link_ksettings) is always 0, indicating that no speed/duplex
> > > > setting is supported.
> > > > 
> > > > Does it make more sense to set (at least a few of) the supported link
> > > > modes, such as 10baseT_Half ... 1baseT_Full?
> > > > 
> > > > I would expect to see consistency between what is reported in
> > > > link_modes.supported and what can actually be set. Could you please
> > > > share your opinion on this?
> > 
> > The use case behind my original question is very simple:
> >  * Net device is queried via ethtool for supported modes.
> >  * Supported modes are presented to user.
> >  * User can configure any of the supported modes.
> 
> Since this has no effect on virtio, isn't presenting
> "no supported modes" to user the right thing to do?

Yes, that makes sense.

> > This is done transparently to the net device type (driver), so it
> > actually makes sense for physical NICs.
> > 
> > This alone of course is not a good enough motivation to modify the
> > driver. And it can be easily addressed in user-space at the application
> > level by testing for the driver.
> 
> I think you might want to special-case no supported modes.
> Special-casing virtio is probably best avoided.
> 
> > I was merely trying to avoid driver-specific workarounds (i.e. keep the
> > application driver agnostic)
> 
> I think that's the right approach. So if driver does not present
> any supported modes this probably means it is not necessary
> to display or program any.

Yes, apparently it boils down to special-casing no supported modes.
This avoids both modifying virtio and special-casing virtio, and keeps
the application driver-agnostic at the same time.

Thanks for all the feedback. It was very helpful in figuring out the
right approach. I really appreciate it.

Radu



Re: pick_next_task() picking the wrong task [v4.9.163]

2019-03-27 Thread Radu Rendec
On Sat, 2019-03-23 at 11:15 +0100, Peter Zijlstra wrote:
> On Fri, Mar 22, 2019 at 05:57:59PM -0400, Radu Rendec wrote:
> > Hi Everyone,
> > 
> > I believe I'm seeing a weird behavior of pick_next_task() where it
> > chooses a lower priority task over a higher priority one. The scheduling
> > class of the two tasks is also different ('fair' vs. 'rt'). The culprit
> > seems to be the optimization at the beginning of the function, where
> > fair_sched_class.pick_next_task() is called directly.  I'm running
> > v4.9.163, but that piece of code is very similar in recent kernels.
> > 
> > [...]
> > 
> > I instrumented pick_next_task() with trace_printk() and I am sure that
> > every time the wrong task is picked, flow goes through the optimization
> 
> That's weird, because when you wake a RT task, the:
> 
>   rq->nr_running == rq->cfs.h_nr_running
> 
> condition should not be true. Maybe try adding trace_printk() to all
> rq->nr_running manipulation to see what goes wobbly?

The answer is in enqueue_top_rt_rq(): it returns before touching the run
queue counters because rt_rq_throttled(rt_rq) is true. So basically this
is RT throttling kicking in. I confirmed by disabling RT throttling and
testing again.

So there's nothing wrong with the scheduler. The "sched_wakeup: comm=.."
trace was a bit misleading. What happens when RT throttling kicks in is
that the task is woken (and probably changes state to TASK_RUNNING) but
not actually added to the run queue.

Thanks again for looking into this and sorry about the noise!

Best regards,
Radu Rendec




pick_next_task() picking the wrong task [v4.9.163]

2019-03-22 Thread Radu Rendec
Hi Everyone,

I believe I'm seeing a weird behavior of pick_next_task() where it
chooses a lower priority task over a higher priority one. The scheduling
class of the two tasks is also different ('fair' vs. 'rt'). The culprit
seems to be the optimization at the beginning of the function, where
fair_sched_class.pick_next_task() is called directly.  I'm running
v4.9.163, but that piece of code is very similar in recent kernels.

My use case is quite simple: I have a real-time thread that is woken up
by a GPIO hardware interrupt. The thread sleeps most of the time in
poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the
interrupt and the thread being woken up/scheduled is very important for
the application. Note that I backported my own commit 03c0a9208bb1, so
the thread is always woken up synchronously from HW interrupt context.

Most of the time things work as expected, but sometimes the scheduler
picks kworker and even the idle task before my real-time thread. I used
the trace infrastructure to figure out what happens and I'm including a
snippet below (I apologize for the wide lines).

 -0 [000] d.h2   161.202970: gpio_sysfs_irq
<-__handle_irq_event_percpu
 -0 [000] d.h2   161.202981: kernfs_notify <-gpio_sysfs_irq
 -0 [000] d.h4   161.202998: sched_waking:
comm=irqWorker pid=1141 prio=9 target_cpu=000
 -0 [000] d.h5   161.203025: sched_wakeup:
comm=irqWorker pid=1141 prio=9 target_cpu=000
 -0 [000] d.h3   161.203047: workqueue_queue_work: work
struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60
req_cpu=1 cpu=0
 -0 [000] d.h3   161.203049: workqueue_activate_work:
work struct 806506b8
 -0 [000] d.h4   161.203061: sched_waking:
comm=kworker/0:1 pid=134 prio=120 target_cpu=000
 -0 [000] d.h5   161.203083: sched_wakeup:
comm=kworker/0:1 pid=134 prio=120 target_cpu=000
 -0 [000] d..2   161.203201: sched_switch:
prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==>
next_comm=kworker/0:1 next_pid=134 next_prio=120
kworker/0:1-134   [000]    161.203222: workqueue_execute_start:
work struct 806506b8: function kernfs_notify_workfn
kworker/0:1-134   [000] ...1   161.203286: schedule <-worker_thread
kworker/0:1-134   [000] d..2   161.203329: sched_switch:
prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==>
next_comm=swapper next_pid=0 next_prio=120
 -0 [000] .n.1   161.230287: schedule <-schedule_preempt_disabled
 -0 [000] d..2   161.230310: sched_switch:
prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==>
next_comm=irqWorker next_pid=1141 next_prio=9
  irqWorker-1141  [000] d..3   161.230316: finish_task_switch <-schedule

The system is Freescale MPC8378 (PowerPC, single processor).

I instrumented pick_next_task() with trace_printk() and I am sure that
every time the wrong task is picked, flow goes through the optimization
path and idle_sched_class.pick_next_task() is called directly. When the
right task is eventually picked, flow goes through the bottom block that
iterates over all scheduling classes. This probably makes sense: when
the scheduler runs in the context of the idle task, prev->sched_class is
no longer fair_sched_class, so the bottom block with the full iteration
is used. Note that in v4.9.163 the optimization path is taken only when
prev->sched_class is fair_sched_class, whereas in recent kernels it is
taken for both fair_sched_class and idle_sched_class.

Any help or feedback would be much appreciated. In the meantime, I will
experiment with commenting out the optimization (at the expense of a
slower scheduler, of course).

Best regards,
Radu Rendec


Re: pick_next_task() picking the wrong task [v4.9.163]

2019-03-24 Thread Radu Rendec
On Sat, 2019-03-23 at 11:15 +0100, Peter Zijlstra wrote:
> On Fri, Mar 22, 2019 at 05:57:59PM -0400, Radu Rendec wrote:
> > Hi Everyone,
> > 
> > I believe I'm seeing a weird behavior of pick_next_task() where it
> > chooses a lower priority task over a higher priority one. The scheduling
> > class of the two tasks is also different ('fair' vs. 'rt'). The culprit
> > seems to be the optimization at the beginning of the function, where
> > fair_sched_class.pick_next_task() is called directly.  I'm running
> > v4.9.163, but that piece of code is very similar in recent kernels.
> > 
> > My use case is quite simple: I have a real-time thread that is woken up
> > by a GPIO hardware interrupt. The thread sleeps most of the time in
> > poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the
> > interrupt and the thread being woken up/scheduled is very important for
> > the application. Note that I backported my own commit 03c0a9208bb1, so
> > the thread is always woken up synchronously from HW interrupt context.
> > 
> > Most of the time things work as expected, but sometimes the scheduler
> > picks kworker and even the idle task before my real-time thread. I used
> > the trace infrastructure to figure out what happens and I'm including a
> > snippet below (I apologize for the wide lines).
> 
> If only they were wide :/ I had to unwrap them myself..

Sorry about that! Wonders of using the gmail web interface. I'll pay
more attention in the future.

> >  -0 [000] d.h2   161.202970: gpio_sysfs_irq  
> > <-__handle_irq_event_percpu
> >  -0 [000] d.h2   161.202981: kernfs_notify <-gpio_sysfs_irq
> >  -0 [000] d.h4   161.202998: sched_waking: comm=irqWorker 
> > pid=1141 prio=9 target_cpu=000
> >  -0 [000] d.h5   161.203025: sched_wakeup: comm=irqWorker 
> > pid=1141 prio=9 target_cpu=000
> 
> weird how the next line doesn't have 'n/N' set:
> 
> >  -0 [000] d.h3   161.203047: workqueue_queue_work: work 
> > struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60 req_cpu=1 
> > cpu=0
> >  -0 [000] d.h3   161.203049: workqueue_activate_work: work 
> > struct 806506b8
> >  -0 [000] d.h4   161.203061: sched_waking: comm=kworker/0:1 
> > pid=134 prio=120 target_cpu=000
> >  -0 [000] d.h5   161.203083: sched_wakeup: comm=kworker/0:1 
> > pid=134 prio=120 target_cpu=000
> 
> There's that kworker wakeup.
> 
> >  -0 [000] d..2   161.203201: sched_switch: prev_comm=swapper 
> > prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 
> > next_pid=134 next_prio=120
> 
> And I agree that that is weird.
> 
> > kworker/0:1-134   [000]    161.203222: workqueue_execute_start: work 
> > struct 806506b8: function kernfs_notify_workfn
> > kworker/0:1-134   [000] ...1   161.203286: schedule <-worker_thread
> > kworker/0:1-134   [000] d..2   161.203329: sched_switch: 
> > prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==> 
> > next_comm=swapper next_pid=0 next_prio=120
> >  -0 [000] .n.1   161.230287: schedule 
> > <-schedule_preempt_disabled
> 
> Only here do I see 'n'.

Looking at other captures I can see 'n' starting at sched_wakeup for
irqWorker. Perhaps there was something wrong with this one or I copied
the wrong line.

> >  -0 [000] d..2   161.230310: sched_switch: prev_comm=swapper 
> > prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=irqWorker 
> > next_pid=1141 next_prio=9
> >   irqWorker-1141  [000] d..3   161.230316: finish_task_switch <-schedule
> > 
> > The system is Freescale MPC8378 (PowerPC, single processor).
> > 
> > I instrumented pick_next_task() with trace_printk() and I am sure that
> > every time the wrong task is picked, flow goes through the optimization
> 
> That's weird, because when you wake a RT task, the:
> 
>   rq->nr_running == rq->cfs.h_nr_running
> 
> condition should not be true. Maybe try adding trace_printk() to all
> rq->nr_running manipulation to see what goes wobbly?

Sure, I will try that and come back with the results.

FWIW, I tried to comment out the entire optimization and force the
"slow" path that goes through every scheduling class. Surprisingly,
rt_sched_class.pick_next_task() returns NULL.

> > path and idle_sched_class.pick_next_task() is called directly. When the
> > right task is eventually picked, flow goes through the bottom block that
> > iterates over all scheduling classes. This probably makes sense: when
> > the scheduler runs in the context of the idle task, prev->sched

[PATCH 0/1] kernfs_notify() poll latency

2018-11-15 Thread Radu Rendec
Hi everyone,

I believe kernfs_notify() poll latency can be improved if the poll
notification is done from kernfs_notify() directly rather than scheduled
work context.

I am sure there are good reasons why the fsnotify notification must be
done from scheduled work context (an obvious one is that it needs to be
able to sleep). But I don't see any reason why the poll notification
could not be done from kernfs_notify(). If there is any, please point it
out - I would highly appreciate it.

I came across this while working on a project that uses the sysfs GPIO
interface to wake a (real time) user space project on GPIO interrupts. I
know that interface is deprecated, but I still believe it's a valid
scenario and may occur with other drivers as well (current or future).

The sysfs GPIO interface (drivers/gpio/gpiolib-sysfs.c) interrupt
handler relies on kernfs_notify() (actually sysfs_notify_dirent(), but
that's just an alias) to wake any thread that may be poll()ing on the
interrupt. It is important to wake the thread as quickly as possible and
going through the kernel worker to handle the scheduled work is much
slower. Since the kernel worker runs with normal priority, this can even
become a case of priority inversion. If a higher priority thread hogs
the CPU, it may delay the kernel worker and in turn the thread that
needs to be notified (which could be a real time thread).

Best regards,
Radu Rendec


Radu Rendec (1):
  Improve kernfs_notify() poll notification latency

 fs/kernfs/file.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

-- 
2.17.2



[PATCH 1/1] Improve kernfs_notify() poll notification latency

2018-11-15 Thread Radu Rendec
kernfs_notify() does two notifications: poll and fsnotify. Originally,
both notifications were done from scheduled work context and all that
kernfs_notify() did was schedule the work.

This patch simply moves the poll notification from the scheduled work
handler to kernfs_notify(). The fsnotify notification still needs to be
done from scheduled work context because it can sleep (it needs to lock
a mutex).

If the poll notification is time critical (the notified thread needs to
wake as quickly as possible), it's better to do it from kernfs_notify()
directly. One example is calling sysfs_notify_dirent() from a hardware
interrupt handler to wake up a thread and handle the interrupt in user
space.

Signed-off-by: Radu Rendec 
---
 fs/kernfs/file.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index dbf5bc250bfd..f8d5021a652e 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -857,7 +857,6 @@ static __poll_t kernfs_fop_poll(struct file *filp, 
poll_table *wait)
 static void kernfs_notify_workfn(struct work_struct *work)
 {
struct kernfs_node *kn;
-   struct kernfs_open_node *on;
struct kernfs_super_info *info;
 repeat:
/* pop one off the notify_list */
@@ -871,17 +870,6 @@ static void kernfs_notify_workfn(struct work_struct *work)
kn->attr.notify_next = NULL;
spin_unlock_irq(_notify_lock);
 
-   /* kick poll */
-   spin_lock_irq(_open_node_lock);
-
-   on = kn->attr.open;
-   if (on) {
-   atomic_inc(>event);
-   wake_up_interruptible(>poll);
-   }
-
-   spin_unlock_irq(_open_node_lock);
-
/* kick fsnotify */
mutex_lock(_mutex);
 
@@ -934,10 +922,21 @@ void kernfs_notify(struct kernfs_node *kn)
 {
static DECLARE_WORK(kernfs_notify_work, kernfs_notify_workfn);
unsigned long flags;
+   struct kernfs_open_node *on;
 
if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
return;
 
+   /* kick poll immediately */
+   spin_lock_irqsave(_open_node_lock, flags);
+   on = kn->attr.open;
+   if (on) {
+   atomic_inc(>event);
+   wake_up_interruptible(>poll);
+   }
+   spin_unlock_irqrestore(_open_node_lock, flags);
+
+   /* schedule work to kick fsnotify */
spin_lock_irqsave(_notify_lock, flags);
if (!kn->attr.notify_next) {
kernfs_get(kn);
-- 
2.17.2



Impact of CONFIG_PARAVIRT=y / mmap benchmark

2017-03-28 Thread Radu Rendec
Hi,

I'm trying to assess the performance impact of enabling PARAVIRT (and
XEN) in a custom kernel configuration. I came across a very old thread
(https://lkml.org/lkml/2009/5/13/449) on this topic and the conclusion
back then was that the performance impact was (arguably) around 1%.

Does anyone still have a copy of Ingo Molnar's mmap-perf.c program (the
old link is broken)? Would it still be relevant to use it for measuring
performance in case of PARAVIRT?

Last but not least, has anyone looked into PARAVIRT performance more
recently?

Thank you!

Best regards,
Radu Rendec



Re: Impact of CONFIG_PARAVIRT=y / mmap benchmark

2017-03-31 Thread Radu Rendec
On Wed, 2017-03-29 at 20:05 +0200, Ingo Molnar wrote:
> * Randy Dunlap  wrote:
> 
> > On 03/28/17 10:17, Radu Rendec wrote:
> > > Does anyone still have a copy of Ingo Molnar's mmap-perf.c
> > > program (the
> > > old link is broken)? Would it still be relevant to use it for
> > > measuring
> > > performance in case of PARAVIRT?
> > > 
> > I have mmap-perf.c that says:
> > /* Copyright Ingo Molnar (c) 2006 */
> > 
> > and no license info...
> > Ingo?
> 
> It's GPL v2, like the kernel.
> 

Randy, now that Ingo has clarified the license of that code, can you
please share it?

Thank you both for the feedback!

Radu