[PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Ming Lei
Because usb_hcd_submit_urb is in the hotest path of usb core,
so use percpu counter to count URB instead of using atomic variable
because atomic operations are much slower than percpu operations.

Cc: Oliver Neukum oli...@neukum.org
Cc: Alan Stern st...@rowland.harvard.edu
Signed-off-by: Ming Lei ming@canonical.com
---
 drivers/usb/core/hcd.c   |4 ++--
 drivers/usb/core/sysfs.c |7 ++-
 drivers/usb/core/usb.c   |9 -
 drivers/usb/core/usb.h   |1 +
 include/linux/usb.h  |2 +-
 5 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 19ad3d2..0b4d1ae 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1556,7 +1556,7 @@ int usb_hcd_submit_urb (struct urb *urb, gfp_t mem_flags)
 */
usb_get_urb(urb);
atomic_inc(urb-use_count);
-   atomic_inc(urb-dev-urbnum);
+   this_cpu_inc(*urb-dev-urbnum);
usbmon_urb_submit(hcd-self, urb);
 
/* NOTE requirements on root-hub callers (usbfs and the hub
@@ -1583,7 +1583,7 @@ int usb_hcd_submit_urb (struct urb *urb, gfp_t mem_flags)
urb-hcpriv = NULL;
INIT_LIST_HEAD(urb-urb_list);
atomic_dec(urb-use_count);
-   atomic_dec(urb-dev-urbnum);
+   this_cpu_dec(*urb-dev-urbnum);
if (atomic_read(urb-reject))
wake_up(usb_kill_urb_queue);
usb_put_urb(urb);
diff --git a/drivers/usb/core/sysfs.c b/drivers/usb/core/sysfs.c
index d9284b9..707f2ca 100644
--- a/drivers/usb/core/sysfs.c
+++ b/drivers/usb/core/sysfs.c
@@ -237,9 +237,14 @@ static ssize_t
 show_urbnum(struct device *dev, struct device_attribute *attr, char *buf)
 {
struct usb_device *udev;
+   unsigned int cnt = 0;
+   int i;
 
udev = to_usb_device(dev);
-   return sprintf(buf, %d\n, atomic_read(udev-urbnum));
+   for_each_possible_cpu(i)
+   cnt += *per_cpu_ptr(udev-urbnum, i);
+
+   return sprintf(buf, %d\n, cnt);
 }
 static DEVICE_ATTR(urbnum, S_IRUGO, show_urbnum, NULL);
 
diff --git a/drivers/usb/core/usb.c b/drivers/usb/core/usb.c
index 0a6ee2e..5111edb 100644
--- a/drivers/usb/core/usb.c
+++ b/drivers/usb/core/usb.c
@@ -271,6 +271,7 @@ static void usb_release_dev(struct device *dev)
kfree(udev-product);
kfree(udev-manufacturer);
kfree(udev-serial);
+   free_percpu(udev-urbnum);
kfree(udev);
 }
 
@@ -433,7 +434,13 @@ struct usb_device *usb_alloc_dev(struct usb_device *parent,
set_dev_node(dev-dev, dev_to_node(bus-controller));
dev-state = USB_STATE_ATTACHED;
dev-lpm_disable_count = 1;
-   atomic_set(dev-urbnum, 0);
+
+   dev-urbnum = alloc_percpu(typeof(*dev-urbnum));
+   if (!dev-urbnum) {
+   usb_put_hcd(bus_to_hcd(bus));
+   kfree(dev);
+   return NULL;
+   }
 
INIT_LIST_HEAD(dev-ep0.urb_list);
dev-ep0.desc.bLength = USB_DT_ENDPOINT_SIZE;
diff --git a/drivers/usb/core/usb.h b/drivers/usb/core/usb.h
index 8238577..12a0181 100644
--- a/drivers/usb/core/usb.h
+++ b/drivers/usb/core/usb.h
@@ -1,5 +1,6 @@
 #include linux/pm.h
 #include linux/acpi.h
+#include linux/percpu.h
 
 struct usb_hub_descriptor;
 struct dev_state;
diff --git a/include/linux/usb.h b/include/linux/usb.h
index 001629c..75332dc 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -561,7 +561,7 @@ struct usb_device {
int maxchild;
 
u32 quirks;
-   atomic_t urbnum;
+   unsigned int __percpu *urbnum;
 
unsigned long active_duration;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Greg Kroah-Hartman
On Mon, Aug 19, 2013 at 07:04:18PM +0800, Ming Lei wrote:
 Because usb_hcd_submit_urb is in the hotest path of usb core,
 so use percpu counter to count URB instead of using atomic variable
 because atomic operations are much slower than percpu operations.
 
 Cc: Oliver Neukum oli...@neukum.org
 Cc: Alan Stern st...@rowland.harvard.edu
 Signed-off-by: Ming Lei ming@canonical.com
 ---
  drivers/usb/core/hcd.c   |4 ++--
  drivers/usb/core/sysfs.c |7 ++-
  drivers/usb/core/usb.c   |9 -
  drivers/usb/core/usb.h   |1 +
  include/linux/usb.h  |2 +-
  5 files changed, 18 insertions(+), 5 deletions(-)

And this really speeds things up?  Exactly what does it?

And it's not that atomic operations are slower, it's just that the
barriers involved can be slower, depending on what else is happening.
If you look, you are already hitting atomic variables in the same path,
so how can this change speed anything up?

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Alan Stern
On Mon, 19 Aug 2013, Ming Lei wrote:

 Because usb_hcd_submit_urb is in the hotest path of usb core,
 so use percpu counter to count URB instead of using atomic variable
 because atomic operations are much slower than percpu operations.

This seems like a ridiculous amount of additional overhead for a simple
counter.  The kernel doesn't even use this value for anything; it's 
only purpose is to allow userspace to see how many URBs have been 
transferred for a device.  (I don't know what programs use this 
information.  Powertop maybe?)

Do you have any reason to believe this will really improve performance 
at all?

Alan Stern



--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Ming Lei
On Mon, Aug 19, 2013 at 10:00 PM, Greg Kroah-Hartman
gre...@linuxfoundation.org wrote:
 On Mon, Aug 19, 2013 at 07:04:18PM +0800, Ming Lei wrote:
 Because usb_hcd_submit_urb is in the hotest path of usb core,
 so use percpu counter to count URB instead of using atomic variable
 because atomic operations are much slower than percpu operations.

 Cc: Oliver Neukum oli...@neukum.org
 Cc: Alan Stern st...@rowland.harvard.edu
 Signed-off-by: Ming Lei ming@canonical.com
 ---
  drivers/usb/core/hcd.c   |4 ++--
  drivers/usb/core/sysfs.c |7 ++-
  drivers/usb/core/usb.c   |9 -
  drivers/usb/core/usb.h   |1 +
  include/linux/usb.h  |2 +-
  5 files changed, 18 insertions(+), 5 deletions(-)

 And this really speeds things up?  Exactly what does it?

 And it's not that atomic operations are slower, it's just that the

For SMP, atomic_inc/atomic_dec are much slower than percpu
variable inc/dec, see 4.1(Why Isn’t Concurrent Count-ing Trivial?)
of [1].

 However, it is slower: on a Intel Core Duo laptop, it is about six
 times slower than non-atomic increment when a single thread
 is incrementing, and more than ten times slower if two threads
 are incrementing.

Considered that most of desktop  laptop are SMP now, and with
USB3.0, the submitted URBs per second may reach tens of thousand
or more, and we can remove the atomic inc/dec operations in the hot
path, so why don't do it?

 barriers involved can be slower, depending on what else is happening.
 If you look, you are already hitting atomic variables in the same path,
 so how can this change speed anything up?

No, no barriers are involved in atomic_inc/atomic_dec at all.

[1], Is Parallel Programming Hard, And, If So, What Can You
Do About It?

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Ming Lei
On Mon, Aug 19, 2013 at 10:42 PM, Alan Stern st...@rowland.harvard.edu wrote:
 On Mon, 19 Aug 2013, Ming Lei wrote:

 Because usb_hcd_submit_urb is in the hotest path of usb core,
 so use percpu counter to count URB instead of using atomic variable
 because atomic operations are much slower than percpu operations.

 This seems like a ridiculous amount of additional overhead for a simple
 counter.  The kernel doesn't even use this value for anything; it's
 only purpose is to allow userspace to see how many URBs have been
 transferred for a device.  (I don't know what programs use this
 information.  Powertop maybe?)

That is why I want to remove the expensive atomic inc/dec, or can we
remove the counter?


 Do you have any reason to believe this will really improve performance
 at all?

Please see my reply on Greg's comments.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Greg Kroah-Hartman
On Mon, Aug 19, 2013 at 11:06:31PM +0800, Ming Lei wrote:
 On Mon, Aug 19, 2013 at 10:00 PM, Greg Kroah-Hartman
 gre...@linuxfoundation.org wrote:
  On Mon, Aug 19, 2013 at 07:04:18PM +0800, Ming Lei wrote:
  Because usb_hcd_submit_urb is in the hotest path of usb core,
  so use percpu counter to count URB instead of using atomic variable
  because atomic operations are much slower than percpu operations.
 
  Cc: Oliver Neukum oli...@neukum.org
  Cc: Alan Stern st...@rowland.harvard.edu
  Signed-off-by: Ming Lei ming@canonical.com
  ---
   drivers/usb/core/hcd.c   |4 ++--
   drivers/usb/core/sysfs.c |7 ++-
   drivers/usb/core/usb.c   |9 -
   drivers/usb/core/usb.h   |1 +
   include/linux/usb.h  |2 +-
   5 files changed, 18 insertions(+), 5 deletions(-)
 
  And this really speeds things up?  Exactly what does it?
 
  And it's not that atomic operations are slower, it's just that the
 
 For SMP, atomic_inc/atomic_dec are much slower than percpu
 variable inc/dec, see 4.1(Why Isn’t Concurrent Count-ing Trivial?)
 of [1].
 
  However, it is slower: on a Intel Core Duo laptop, it is about six
  times slower than non-atomic increment when a single thread
  is incrementing, and more than ten times slower if two threads
  are incrementing.
 
 Considered that most of desktop  laptop are SMP now, and with
 USB3.0, the submitted URBs per second may reach tens of thousand
 or more, and we can remove the atomic inc/dec operations in the hot
 path, so why don't do it?

Because you really didn't do it, there are lots of other atomic
operations on that same path.

And, thens of thousands of urbs should be trivial, did you measure this
to see if it changed anything?  I'm not taking patches like this that
are not quantifiable, sorry.

The gating problem in USB right now is the hardware, it's the slowest
thing, not the kernel, from everything I have ever tested, or seen.

Well, bad host controller silicon is also a problem (i.e. raspberry pi),
but there's not much we can do about braindead problems like that...

  barriers involved can be slower, depending on what else is happening.
  If you look, you are already hitting atomic variables in the same path,
  so how can this change speed anything up?
 
 No, no barriers are involved in atomic_inc/atomic_dec at all.

None?  Hm, you might want to rethink that statement :)

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Alan Stern
On Mon, 19 Aug 2013, Ming Lei wrote:

 On Mon, Aug 19, 2013 at 10:42 PM, Alan Stern st...@rowland.harvard.edu 
 wrote:
  On Mon, 19 Aug 2013, Ming Lei wrote:
 
  Because usb_hcd_submit_urb is in the hotest path of usb core,
  so use percpu counter to count URB instead of using atomic variable
  because atomic operations are much slower than percpu operations.
 
  This seems like a ridiculous amount of additional overhead for a simple
  counter.  The kernel doesn't even use this value for anything; it's
  only purpose is to allow userspace to see how many URBs have been
  transferred for a device.  (I don't know what programs use this
  information.  Powertop maybe?)
 
 That is why I want to remove the expensive atomic inc/dec, or can we
 remove the counter?

No doubt somebody would complain if the counter was removed.  Who added 
it in the first place, and for what reason?

  Do you have any reason to believe this will really improve performance
  at all?
 
 Please see my reply on Greg's comments.

As far as I can see, this counter does not need to be exact.  Why not 
simply make it a non-atomic unsigned int?

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Ming Lei
On Mon, Aug 19, 2013 at 11:17 PM, Alan Stern st...@rowland.harvard.edu wrote:
 On Mon, 19 Aug 2013, Ming Lei wrote:

 On Mon, Aug 19, 2013 at 10:42 PM, Alan Stern st...@rowland.harvard.edu 
 wrote:
  On Mon, 19 Aug 2013, Ming Lei wrote:
 
  Because usb_hcd_submit_urb is in the hotest path of usb core,
  so use percpu counter to count URB instead of using atomic variable
  because atomic operations are much slower than percpu operations.
 
  This seems like a ridiculous amount of additional overhead for a simple
  counter.  The kernel doesn't even use this value for anything; it's
  only purpose is to allow userspace to see how many URBs have been
  transferred for a device.  (I don't know what programs use this
  information.  Powertop maybe?)

 That is why I want to remove the expensive atomic inc/dec, or can we
 remove the counter?

 No doubt somebody would complain if the counter was removed.  Who added
 it in the first place, and for what reason?

  Do you have any reason to believe this will really improve performance
  at all?

 Please see my reply on Greg's comments.

 As far as I can see, this counter does not need to be exact.  Why not
 simply make it a non-atomic unsigned int?

It may becomes quite inaccurate, and 4.1 of the perfbook mentioned
that half of counts might be lost with simple non-atomic unsigned int,
so I think percpu variable is good choice.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Alan Stern
On Mon, 19 Aug 2013, Ming Lei wrote:

  As far as I can see, this counter does not need to be exact.  Why not
  simply make it a non-atomic unsigned int?
 
 It may becomes quite inaccurate, and 4.1 of the perfbook mentioned
 that half of counts might be lost with simple non-atomic unsigned int,
 so I think percpu variable is good choice.

In practice I think that is very unlikely to happen.  There would have 
to be separate threads running on different CPUs, simultaneously 
submitting URBs for the same device and very closely synchronized.

Also, we don't know how this number gets used.  Quite possibly, losing 
half of the counts won't matter very much -- maybe the user cares only 
about the order of magnitude.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Ming Lei
On Mon, Aug 19, 2013 at 11:14 PM, Greg Kroah-Hartman
gre...@linuxfoundation.org wrote:
 On Mon, Aug 19, 2013 at 11:06:31PM +0800, Ming Lei wrote:
 On Mon, Aug 19, 2013 at 10:00 PM, Greg Kroah-Hartman
 gre...@linuxfoundation.org wrote:
  On Mon, Aug 19, 2013 at 07:04:18PM +0800, Ming Lei wrote:
  Because usb_hcd_submit_urb is in the hotest path of usb core,
  so use percpu counter to count URB instead of using atomic variable
  because atomic operations are much slower than percpu operations.
 
  Cc: Oliver Neukum oli...@neukum.org
  Cc: Alan Stern st...@rowland.harvard.edu
  Signed-off-by: Ming Lei ming@canonical.com
  ---
   drivers/usb/core/hcd.c   |4 ++--
   drivers/usb/core/sysfs.c |7 ++-
   drivers/usb/core/usb.c   |9 -
   drivers/usb/core/usb.h   |1 +
   include/linux/usb.h  |2 +-
   5 files changed, 18 insertions(+), 5 deletions(-)
 
  And this really speeds things up?  Exactly what does it?
 
  And it's not that atomic operations are slower, it's just that the

 For SMP, atomic_inc/atomic_dec are much slower than percpu
 variable inc/dec, see 4.1(Why Isn’t Concurrent Count-ing Trivial?)
 of [1].

  However, it is slower: on a Intel Core Duo laptop, it is about six
  times slower than non-atomic increment when a single thread
  is incrementing, and more than ten times slower if two threads
  are incrementing.

 Considered that most of desktop  laptop are SMP now, and with
 USB3.0, the submitted URBs per second may reach tens of thousand
 or more, and we can remove the atomic inc/dec operations in the hot
 path, so why don't do it?

 Because you really didn't do it, there are lots of other atomic
 operations on that same path.

Not lots in the path of usbcore.


 And, thens of thousands of urbs should be trivial, did you measure this
 to see if it changed anything?  I'm not taking patches like this that
 are not quantifiable, sorry.

The number may be too trivial to measure, but I will try to test
with perf.


 The gating problem in USB right now is the hardware, it's the slowest
 thing, not the kernel, from everything I have ever tested, or seen.

The problem may not speed up usb performance, but might decrease
CPU utilization a bit, or cache miss.


 Well, bad host controller silicon is also a problem (i.e. raspberry pi),
 but there's not much we can do about braindead problems like that...

  barriers involved can be slower, depending on what else is happening.
  If you look, you are already hitting atomic variables in the same path,
  so how can this change speed anything up?

 No, no barriers are involved in atomic_inc/atomic_dec at all.

 None?  Hm, you might want to rethink that statement :)

Please see Documentation/memory-barriers.txt:

The following also do _not_ imply memory barriers, and so may require explicit
memory barriers under some circumstances (smp_mb__before_atomic_dec() for
instance):

atomic_add();
atomic_sub();
atomic_inc();
atomic_dec();


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Paul E. McKenney
On Mon, Aug 19, 2013 at 11:40:50PM +0800, Ming Lei wrote:
 On Mon, Aug 19, 2013 at 11:14 PM, Greg Kroah-Hartman
 gre...@linuxfoundation.org wrote:
  On Mon, Aug 19, 2013 at 11:06:31PM +0800, Ming Lei wrote:
  On Mon, Aug 19, 2013 at 10:00 PM, Greg Kroah-Hartman
  gre...@linuxfoundation.org wrote:
   On Mon, Aug 19, 2013 at 07:04:18PM +0800, Ming Lei wrote:
   Because usb_hcd_submit_urb is in the hotest path of usb core,
   so use percpu counter to count URB instead of using atomic variable
   because atomic operations are much slower than percpu operations.
  
   Cc: Oliver Neukum oli...@neukum.org
   Cc: Alan Stern st...@rowland.harvard.edu
   Signed-off-by: Ming Lei ming@canonical.com
   ---
drivers/usb/core/hcd.c   |4 ++--
drivers/usb/core/sysfs.c |7 ++-
drivers/usb/core/usb.c   |9 -
drivers/usb/core/usb.h   |1 +
include/linux/usb.h  |2 +-
5 files changed, 18 insertions(+), 5 deletions(-)
  
   And this really speeds things up?  Exactly what does it?
  
   And it's not that atomic operations are slower, it's just that the
 
  For SMP, atomic_inc/atomic_dec are much slower than percpu
  variable inc/dec, see 4.1(Why Isn’t Concurrent Count-ing Trivial?)
  of [1].
 
   However, it is slower: on a Intel Core Duo laptop, it is about six
   times slower than non-atomic increment when a single thread
   is incrementing, and more than ten times slower if two threads
   are incrementing.
 
  Considered that most of desktop  laptop are SMP now, and with
  USB3.0, the submitted URBs per second may reach tens of thousand
  or more, and we can remove the atomic inc/dec operations in the hot
  path, so why don't do it?
 
  Because you really didn't do it, there are lots of other atomic
  operations on that same path.
 
 Not lots in the path of usbcore.
 
 
  And, thens of thousands of urbs should be trivial, did you measure this
  to see if it changed anything?  I'm not taking patches like this that
  are not quantifiable, sorry.
 
 The number may be too trivial to measure, but I will try to test
 with perf.
 
 
  The gating problem in USB right now is the hardware, it's the slowest
  thing, not the kernel, from everything I have ever tested, or seen.
 
 The problem may not speed up usb performance, but might decrease
 CPU utilization a bit, or cache miss.
 
 
  Well, bad host controller silicon is also a problem (i.e. raspberry pi),
  but there's not much we can do about braindead problems like that...
 
   barriers involved can be slower, depending on what else is happening.
   If you look, you are already hitting atomic variables in the same path,
   so how can this change speed anything up?
 
  No, no barriers are involved in atomic_inc/atomic_dec at all.
 
  None?  Hm, you might want to rethink that statement :)
 
 Please see Documentation/memory-barriers.txt:
 
 The following also do _not_ imply memory barriers, and so may require explicit
 memory barriers under some circumstances (smp_mb__before_atomic_dec() for
 instance):
 
 atomic_add();
 atomic_sub();
 atomic_inc();
 atomic_dec();

You are both right, each in your own way.  Greg is correct that on x86
these operations do include memory barriers, even though Linux does not
require them to do so.  Ming is correct that Linux does not require it,
and that there are in fact non-x86 architectures in which these operations
do not include memory barriers.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Greg Kroah-Hartman
On Mon, Aug 19, 2013 at 11:40:50PM +0800, Ming Lei wrote:
 On Mon, Aug 19, 2013 at 11:14 PM, Greg Kroah-Hartman
 gre...@linuxfoundation.org wrote:
  On Mon, Aug 19, 2013 at 11:06:31PM +0800, Ming Lei wrote:
  On Mon, Aug 19, 2013 at 10:00 PM, Greg Kroah-Hartman
  gre...@linuxfoundation.org wrote:
   On Mon, Aug 19, 2013 at 07:04:18PM +0800, Ming Lei wrote:
   Because usb_hcd_submit_urb is in the hotest path of usb core,
   so use percpu counter to count URB instead of using atomic variable
   because atomic operations are much slower than percpu operations.
  
   Cc: Oliver Neukum oli...@neukum.org
   Cc: Alan Stern st...@rowland.harvard.edu
   Signed-off-by: Ming Lei ming@canonical.com
   ---
drivers/usb/core/hcd.c   |4 ++--
drivers/usb/core/sysfs.c |7 ++-
drivers/usb/core/usb.c   |9 -
drivers/usb/core/usb.h   |1 +
include/linux/usb.h  |2 +-
5 files changed, 18 insertions(+), 5 deletions(-)
  
   And this really speeds things up?  Exactly what does it?
  
   And it's not that atomic operations are slower, it's just that the
 
  For SMP, atomic_inc/atomic_dec are much slower than percpu
  variable inc/dec, see 4.1(Why Isn’t Concurrent Count-ing Trivial?)
  of [1].
 
   However, it is slower: on a Intel Core Duo laptop, it is about six
   times slower than non-atomic increment when a single thread
   is incrementing, and more than ten times slower if two threads
   are incrementing.
 
  Considered that most of desktop  laptop are SMP now, and with
  USB3.0, the submitted URBs per second may reach tens of thousand
  or more, and we can remove the atomic inc/dec operations in the hot
  path, so why don't do it?
 
  Because you really didn't do it, there are lots of other atomic
  operations on that same path.
 
 Not lots in the path of usbcore.

Did you look close?  I see 2 more right there in the context of your
patch alone.  One you try to take care of later (but just do the same
thing, no real change), the other you don't address at all.

  And, thens of thousands of urbs should be trivial, did you measure this
  to see if it changed anything?  I'm not taking patches like this that
  are not quantifiable, sorry.
 
 The number may be too trivial to measure, but I will try to test
 with perf.

If it's too trivial to measure, then I can't accept the patch, nor
should you expect it to be accepted, right?

  The gating problem in USB right now is the hardware, it's the slowest
  thing, not the kernel, from everything I have ever tested, or seen.
 
 The problem may not speed up usb performance, but might decrease
 CPU utilization a bit, or cache miss.

Do you have proof of this?  Without that, why would you even want
someone to accept such a patch?

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Alan Stern
On Mon, 19 Aug 2013, Bjørn Mork wrote:

 Alan Stern st...@rowland.harvard.edu writes:
 
  No doubt somebody would complain if the counter was removed.  Who added 
  it in the first place, and for what reason?
 
 Who and why is pretty well documented:
 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d59d8a11383ebf0e0260ee481a4e766959fd7d9
 
 Thanks to Sarah and git.

Thanks for locating that (I'm not using at my regular computer today so 
I don't have access to my git repository).

The commit message says clearly that urbnum was added specifically for 
use by powertop.  It seems reasonable to assume that powertop won't 
mind very much if the number is off by some small factor.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Alan Stern
On Mon, 19 Aug 2013, Ming Lei wrote:

 Another disadvantage is that accessing the shared variable is still
 slower than accessing one percpu variable in theory.

By that argument, _everything_ in the kernel should be percpu.  There 
is a reason why atomic variables exist.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] USB: use percpu counter to count submitted URBs per device

2013-08-19 Thread Greg Kroah-Hartman
On Mon, Aug 19, 2013 at 10:14:57AM -0700, Greg Kroah-Hartman wrote:
 On Mon, Aug 19, 2013 at 11:40:50PM +0800, Ming Lei wrote:
  On Mon, Aug 19, 2013 at 11:14 PM, Greg Kroah-Hartman
  gre...@linuxfoundation.org wrote:
   On Mon, Aug 19, 2013 at 11:06:31PM +0800, Ming Lei wrote:
   On Mon, Aug 19, 2013 at 10:00 PM, Greg Kroah-Hartman
   gre...@linuxfoundation.org wrote:
On Mon, Aug 19, 2013 at 07:04:18PM +0800, Ming Lei wrote:
Because usb_hcd_submit_urb is in the hotest path of usb core,
so use percpu counter to count URB instead of using atomic variable
because atomic operations are much slower than percpu operations.
   
Cc: Oliver Neukum oli...@neukum.org
Cc: Alan Stern st...@rowland.harvard.edu
Signed-off-by: Ming Lei ming@canonical.com
---
 drivers/usb/core/hcd.c   |4 ++--
 drivers/usb/core/sysfs.c |7 ++-
 drivers/usb/core/usb.c   |9 -
 drivers/usb/core/usb.h   |1 +
 include/linux/usb.h  |2 +-
 5 files changed, 18 insertions(+), 5 deletions(-)
   
And this really speeds things up?  Exactly what does it?
   
And it's not that atomic operations are slower, it's just that the
  
   For SMP, atomic_inc/atomic_dec are much slower than percpu
   variable inc/dec, see 4.1(Why Isn’t Concurrent Count-ing Trivial?)
   of [1].
  
However, it is slower: on a Intel Core Duo laptop, it is about 
   six
times slower than non-atomic increment when a single thread
is incrementing, and more than ten times slower if two threads
are incrementing.
  
   Considered that most of desktop  laptop are SMP now, and with
   USB3.0, the submitted URBs per second may reach tens of thousand
   or more, and we can remove the atomic inc/dec operations in the hot
   path, so why don't do it?
  
   Because you really didn't do it, there are lots of other atomic
   operations on that same path.
  
  Not lots in the path of usbcore.
 
 Did you look close?  I see 2 more right there in the context of your
 patch alone.  One you try to take care of later (but just do the same
 thing, no real change), the other you don't address at all.

Ok, sorry, atomic_set() is, on some arches, faster than atomic_inc(),
so you have sped up things there, my apologies.

But given the other locks taken in this path, and other atomic
operations, removing 1 seems like premature optimization.

Please, use perf, and other things, to find out where real problems are
in the USB stack.  I'm sure there are locations that we can improve on,
but until you get some real numbers, it's going to be hard to accept
stuff like this.

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html