from:"Steinar H. Gunderson"

On Thu, Dec 22, 2016 at 06:56:28AM +0800, Jin, Yao wrote:
> Could you see the inline if you use the addr2line command? For example,
> addr2line -e  -i 

I'm sorry, I don't have this profile anymore. I'll try again once we sort out
the problems of the DWARF error messages everywhere.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Thu, Dec 22, 2016 at 06:56:28AM +0800, Jin, Yao wrote:
> Could you see the inline if you use the addr2line command? For example,
> addr2line -e  -i 

I'm sorry, I don't have this profile anymore. I'll try again once we sort out
the problems of the DWARF error messages everywhere.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Wed, Dec 21, 2016 at 11:09:42AM +0100, Milian Wolff wrote:
> Just to check - did you really compile your code with frame pointers? By 
> default, that is not the case, and the above will try to do frame pointer 
> unwinding which will then fail. Put differently - do you any stack frames at 
> all? Can you try `perf record --call-graph dwarf` instead? Of course, make 
> sure you compile your code with `-g -O2` or similar.

I don't specifically use -fno-omit-frame-pointer, no. But the normal stack
unwinding works just fine with mainline perf nevertheless; is this expected?

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Wed, Dec 21, 2016 at 11:09:42AM +0100, Milian Wolff wrote:
> Just to check - did you really compile your code with frame pointers? By 
> default, that is not the case, and the above will try to do frame pointer 
> unwinding which will then fail. Put differently - do you any stack frames at 
> all? Can you try `perf record --call-graph dwarf` instead? Of course, make 
> sure you compile your code with `-g -O2` or similar.

I don't specifically use -fno-omit-frame-pointer, no. But the normal stack
unwinding works just fine with mainline perf nevertheless; is this expected?

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Wed, Dec 21, 2016 at 08:53:33AM +0800, Jin, Yao wrote:
> I just pull my repo with the latest perf/core branch, and apply the patch
> one by one (git am 0001/0002/.../0005), they can be applied. Maybe you have
> to do like that because the mails are probably coming out of order.

OK. I applied everything on top of the branch you suggested, and now it's
compiling. But seemingly I don't have too much success; on a quick check
(perf record -p  -g, perf report --inline) I don't get anything marked
as (inline), but I get these warnings:

BFD: Dwarf Error: found dwarf version '6931', this reader only handles version 
2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '18896', this reader only handles version 
2, 3 and 4 information.

and so on for many seemingly random version numbers.

It may have been that the stack traces I happened to check don't actually
have any inlined functions in them (the load was a bit different from what
I've looked at earlier), but the BFD errors are new from what I can see.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Wed, Dec 21, 2016 at 08:53:33AM +0800, Jin, Yao wrote:
> I just pull my repo with the latest perf/core branch, and apply the patch
> one by one (git am 0001/0002/.../0005), they can be applied. Maybe you have
> to do like that because the mails are probably coming out of order.

OK. I applied everything on top of the branch you suggested, and now it's
compiling. But seemingly I don't have too much success; on a quick check
(perf record -p  -g, perf report --inline) I don't get anything marked
as (inline), but I get these warnings:

BFD: Dwarf Error: found dwarf version '6931', this reader only handles version 
2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '18896', this reader only handles version 
2, 3 and 4 information.

and so on for many seemingly random version numbers.

It may have been that the stack traces I happened to check don't actually
have any inlined functions in them (the load was a bit different from what
I've looked at earlier), but the BFD errors are new from what I can see.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Tue, Dec 20, 2016 at 11:37:46AM -0300, Arnaldo Carvalho de Melo wrote:
>> Woot. Is this available in git somewhere? (Or if not, what do I apply it on
>> top of?)
> Normally you get it from tip, i.e. from:
> 
> git//git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core

I suppose perf/core here means a branch named perf/core in that git
repository, but it doesn't seem to contain the patches in question.

I tried applying them on top of that branch by wget-ing down the right
messages from marc.info, but somehow, I must have misapplied them
(it was rather painful, especially since they seemingly come out-of-order
in the archives), because the resulting tree didn't compile.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Tue, Dec 20, 2016 at 11:37:46AM -0300, Arnaldo Carvalho de Melo wrote:
>> Woot. Is this available in git somewhere? (Or if not, what do I apply it on
>> top of?)
> Normally you get it from tip, i.e. from:
> 
> git//git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core

I suppose perf/core here means a branch named perf/core in that git
repository, but it doesn't seem to contain the patches in question.

I tried applying them on top of that branch by wget-ing down the right
messages from marc.info, but somehow, I must have misapplied them
(it was rather painful, especially since they seemingly come out-of-order
in the archives), because the resulting tree didn't compile.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Tue, Dec 20, 2016 at 10:54:50AM -0300, Arnaldo Carvalho de Melo wrote:
> Have you guys looked at this:
> 
> http://lkml.kernel.org/r/1481121822-2537-1-git-send-email-yao@linux.intel.com
> 
> I have to review it and maybe you will help me with that ;-)

Woot. Is this available in git somewhere? (Or if not, what do I apply it on
top of?)

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Tue, Dec 20, 2016 at 10:54:50AM -0300, Arnaldo Carvalho de Melo wrote:
> Have you guys looked at this:
> 
> http://lkml.kernel.org/r/1481121822-2537-1-git-send-email-yao@linux.intel.com
> 
> I have to review it and maybe you will help me with that ;-)

Woot. Is this available in git somewhere? (Or if not, what do I apply it on
top of?)

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Tue, Dec 20, 2016 at 02:27:10PM +0100, Milian Wolff wrote:
> It is not even possible with that, perf report is lacking the steps required 
> to add inline frames - it will only add "real" frames it gets from either of 
> the unwind libraries.
> 
> I have a WIP patch available for this functionality though, it can be found 
> here (depends on libbfd, i.e. bfd_find_inliner_info):
> 
> https://github.com/milianw/linux/commit/
> 71d031c9d679bfb4a4044226e8903dd80ea601b3

Thanks, I'll be sure to try it out. I assume this works only with -g dwarf?
I.e., for non-graph runs, I will still get the bottom function only, not the
inlined one.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: Inlined functions in perf report

On Tue, Dec 20, 2016 at 02:27:10PM +0100, Milian Wolff wrote:
> It is not even possible with that, perf report is lacking the steps required 
> to add inline frames - it will only add "real" frames it gets from either of 
> the unwind libraries.
> 
> I have a WIP patch available for this functionality though, it can be found 
> here (depends on libbfd, i.e. bfd_find_inliner_info):
> 
> https://github.com/milianw/linux/commit/
> 71d031c9d679bfb4a4044226e8903dd80ea601b3

Thanks, I'll be sure to try it out. I assume this works only with -g dwarf?
I.e., for non-graph runs, I will still get the bottom function only, not the
inlined one.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Inlined functions in perf report

Hi Peter,

I can't find a good point of contact for perf, so I'm contacting you based on
the MAINTAINERS file; feel free to redirect somewhere if you're not the right
person.

I'm trying to figure out how to deal with perf report when there are inlined
functions; they don't generally seem to show up in the call stack, which
sometimes can make it very hard to figure out what is going, especially in
a code base one doesn't know too well. As an example, I threw together a
minimal test program:

  #include 
  
  inline int foo()
  {
  int k = rand();
  int sum = 1;
  for (int i = 0; i < 100; ++i)
  {
  sum ^= k;
  sum += k;
  }
  return sum;
  }
  
  int main(void)
  {
  return foo();
  }

Compiling with -O2 -g, and running perf record -g yields:

  # Samples: 6K of event 'cycles:ppp'
  # Event count (approx.): 5876825543
  #
  # Children  Self  Command  Shared Object  Symbol
  #     ...  .  ..
  #
  99.98%99.98%  inline   inline [.] main
  |
  ---0x706258d4c544155
 main
  
  99.98% 0.00%  inline   [unknown]  [.] 0x0706258d4c544155
  |
  ---0x706258d4c544155
 main

Is there a way I can get it to show “foo” in the call graph? (I suppose also
ideally, “foo” and not “main” should show up in a non-graph run.) Of course,
this gets even more confusing if foo calls bar, since it now looks like the
call chain is main -> bar directly.

I have debug information that should be sufficient in the binary, because if
I break in gdb, I definitely get the call stack:

  Program received signal SIGINT, Interrupt.
  0x4589 in foo () at inline.c:5
  5   int k = rand();
  (gdb) bt
  #0  0x4589 in foo () at inline.c:5
  #1  main () at inline.c:17
  (gdb) 

FWIW, this is with perf from 4.10 (git as of a few days ago) and GCC 6.2.1.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Inlined functions in perf report

Hi Peter,

I can't find a good point of contact for perf, so I'm contacting you based on
the MAINTAINERS file; feel free to redirect somewhere if you're not the right
person.

I'm trying to figure out how to deal with perf report when there are inlined
functions; they don't generally seem to show up in the call stack, which
sometimes can make it very hard to figure out what is going, especially in
a code base one doesn't know too well. As an example, I threw together a
minimal test program:

  #include 
  
  inline int foo()
  {
  int k = rand();
  int sum = 1;
  for (int i = 0; i < 100; ++i)
  {
  sum ^= k;
  sum += k;
  }
  return sum;
  }
  
  int main(void)
  {
  return foo();
  }

Compiling with -O2 -g, and running perf record -g yields:

  # Samples: 6K of event 'cycles:ppp'
  # Event count (approx.): 5876825543
  #
  # Children  Self  Command  Shared Object  Symbol
  #     ...  .  ..
  #
  99.98%99.98%  inline   inline [.] main
  |
  ---0x706258d4c544155
 main
  
  99.98% 0.00%  inline   [unknown]  [.] 0x0706258d4c544155
  |
  ---0x706258d4c544155
 main

Is there a way I can get it to show “foo” in the call graph? (I suppose also
ideally, “foo” and not “main” should show up in a non-graph run.) Of course,
this gets even more confusing if foo calls bar, since it now looks like the
call chain is main -> bar directly.

I have debug information that should be sufficient in the binary, because if
I break in gdb, I definitely get the call stack:

  Program received signal SIGINT, Interrupt.
  0x4589 in foo () at inline.c:5
  5   int k = rand();
  (gdb) bt
  #0  0x4589 in foo () at inline.c:5
  #1  main () at inline.c:17
  (gdb) 

FWIW, this is with perf from 4.10 (git as of a few days ago) and GCC 6.2.1.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: [PATCH] Add support for usbfs zerocopy.

2016-02-24 Thread Steinar H. Gunderson

On Wed, Feb 24, 2016 at 02:30:08PM -0500, Sasha Levin wrote:
> I'm seeing the following warning while fuzzing:
> [ 1595.188189] WARNING: CPU: 3 PID: 26063 at mm/page_alloc.c:3207 
> __alloc_pages_nodemask+0x960/0x29e0()
> [ 1595.188287] Modules linked in:
> [ 1595.188316] CPU: 3 PID: 26063 Comm: syz-executor Not tainted 
> 4.5.0-rc5-next-20160223-sasha-00022-g03b30f1-dirty #2982

I think it was already established that one could cause kernel warnings if
trying to allocate large amounts of memory, but that the usbfs memory limits
could curb truly dangerous amounts. Someone please correct me if I'm
misunderstanding?

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH] Add support for usbfs zerocopy.

2016-02-24 Thread Steinar H. Gunderson

On Wed, Feb 24, 2016 at 02:30:08PM -0500, Sasha Levin wrote:
> I'm seeing the following warning while fuzzing:
> [ 1595.188189] WARNING: CPU: 3 PID: 26063 at mm/page_alloc.c:3207 
> __alloc_pages_nodemask+0x960/0x29e0()
> [ 1595.188287] Modules linked in:
> [ 1595.188316] CPU: 3 PID: 26063 Comm: syz-executor Not tainted 
> 4.5.0-rc5-next-20160223-sasha-00022-g03b30f1-dirty #2982

I think it was already established that one could cause kernel warnings if
trying to allocate large amounts of memory, but that the usbfs memory limits
could curb truly dangerous amounts. Someone please correct me if I'm
misunderstanding?

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH] net: fix bridge multicast packet checksum validation

2016-02-18 Thread Steinar H. Gunderson

On Mon, Feb 15, 2016 at 03:07:06AM +0100, Linus Lüssing wrote:
> Steinar, can you check whether this fixes the bridge issues you reported on
> bugzilla #99081? Not quite sure whether it is the same as yours as you
> do not seem to have any such call traces.

It doesn't immediately sound like the same problem; why would promisc change
anything if the problem is the checksumming?

I don't have any reboots scheduled for this machine right now, but I'll see
what I can do wrt. testing.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: [PATCH] net: fix bridge multicast packet checksum validation

2016-02-18 Thread Steinar H. Gunderson

On Mon, Feb 15, 2016 at 03:07:06AM +0100, Linus Lüssing wrote:
> Steinar, can you check whether this fixes the bridge issues you reported on
> bugzilla #99081? Not quite sure whether it is the same as yours as you
> do not seem to have any such call traces.

It doesn't immediately sound like the same problem; why would promisc change
anything if the problem is the checksumming?

I don't have any reboots scheduled for this machine right now, but I'll see
what I can do wrt. testing.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

Re: [PATCH v2] Add support for usbfs zerocopy.

2016-02-12 Thread Steinar H. Gunderson

On Wed, Feb 03, 2016 at 11:09:16PM +0100, Steinar H. Gunderson wrote:
> Trying again; sending v4 as a reply to your email.

Did the v4 sending work for you?

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH v2] Add support for usbfs zerocopy.

2016-02-12 Thread Steinar H. Gunderson

On Wed, Feb 03, 2016 at 11:09:16PM +0100, Steinar H. Gunderson wrote:
> Trying again; sending v4 as a reply to your email.

Did the v4 sending work for you?

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH v2] Add support for usbfs zerocopy.

2016-02-04 Thread Steinar H. Gunderson

On Thu, Feb 04, 2016 at 11:17:26AM +0100, Bjørn Mork wrote:
> Then use Mutt to reply, but include the patch inline instead of
> attaching it.  I believe this is discussed in the Mutt section of
> Documentation/email-clients.txt

Thanks; if that works (even though it changes the “From” line and such)
that's a good solution. It doesn't work for series of multiple patches, but I
can certainly live with that.

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH v2] Add support for usbfs zerocopy.

2016-02-04 Thread Steinar H. Gunderson

On Thu, Feb 04, 2016 at 11:17:26AM +0100, Bjørn Mork wrote:
> Then use Mutt to reply, but include the patch inline instead of
> attaching it.  I believe this is discussed in the Mutt section of
> Documentation/email-clients.txt

Thanks; if that works (even though it changes the “From” line and such)
that's a good solution. It doesn't work for series of multiple patches, but I
can certainly live with that.

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH v2] Add support for usbfs zerocopy.

On Thu, Feb 04, 2016 at 12:15:50AM +0200, Felipe Balbi wrote:
>> Since I've now been bitten by this several times: Is there any sort of best
>> practice for integrating git with MUAs? What I'm doing right now is
>> cut-and-paste from mutt to get the to/cc/in-reply-to headers right, and
>> that's suboptimal. It feels like it should be a FAQ, but I can't really find
>> anything obvious.
> have you tried git send-email ? If you wanna send as a reply to another
> message, use:
> 
>   $ git send-email --to f...@bar.com --cc b...@foo.com 
> --in-reply-to=abcd-message...@foo.bar.com

Unfortunately that doesn't change anything; I still need to cut-and-paste the
to, cc and in-reply-to manually. What I'd want is something like
“git send-email as a reply to the last email in my outbox”.

Attaching the patch gives me exactly this, by the way, but seemingly ends up
with a format that's more cumbersome to receive (which is a bad tradeoff).

/* Steinar */
-- 
Software Engineer, Google Switzerland

[PATCH v4] Add support for usbfs zerocopy.

Add a new interface for userspace to preallocate memory that can be
used with usbfs. This gives two primary benefits:

 - Zerocopy; data no longer needs to be copied between the userspace
   and the kernel, but can instead be read directly by the driver from
   userspace's buffers. This works for all kinds of transfers (even if
   nonsensical for control and interrupt transfers); isochronous also
   no longer need to memset() the buffer to zero to avoid leaking kernel data.

 - Once the buffers are allocated, USB transfers can no longer fail due to
   memory fragmentation; previously, long-running programs could run into
   problems finding a large enough contiguous memory chunk, especially on
   embedded systems or at high rates.

Memory is allocated by using mmap() against the usbfs file descriptor,
and similarly deallocated by munmap(). Once memory has been allocated,
using it as pointers to a bulk or isochronous operation means you will
automatically get zerocopy behavior. Note that this also means you cannot
modify outgoing data until the transfer is complete. The same holds for
data on the same cache lines as incoming data; DMA modifying them at the
same time could lead to your changes being overwritten.

There's a new capability USBDEVFS_CAP_MMAP that userspace can query to see
if the running kernel supports this functionality, if just trying mmap() is
not acceptable.

Largely based on a patch by Markus Rechberger with some updates. The original
patch can be found at:

  http://sundtek.de/support/devio_mmap_v0.4.diff

Signed-off-by: Steinar H. Gunderson 
Signed-off-by: Markus Rechberger 
Acked-by: Alan Stern 
---
 drivers/usb/core/devio.c  | 227 +-
 include/uapi/linux/usbdevice_fs.h |   1 +
 2 files changed, 203 insertions(+), 25 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 59e7a33..a9fc6ff 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -69,6 +70,7 @@ struct usb_dev_state {
spinlock_t lock;/* protects the async urb lists */
struct list_head async_pending;
struct list_head async_completed;
+   struct list_head memory_list;
wait_queue_head_t wait; /* wake up if a request completed */
unsigned int discsignr;
struct pid *disc_pid;
@@ -79,6 +81,17 @@ struct usb_dev_state {
u32 disabled_bulk_eps;
 };
 
+struct usb_memory {
+   struct list_head memlist;
+   int vma_use_count;
+   int urb_use_count;
+   u32 size;
+   void *mem;
+   dma_addr_t dma_handle;
+   unsigned long vm_start;
+   struct usb_dev_state *ps;
+};
+
 struct async {
struct list_head asynclist;
struct usb_dev_state *ps;
@@ -89,6 +102,7 @@ struct async {
void __user *userbuffer;
void __user *userurb;
struct urb *urb;
+   struct usb_memory *usbm;
unsigned int mem_usage;
int status;
u32 secid;
@@ -162,6 +176,111 @@ static int connected(struct usb_dev_state *ps)
ps->dev->state != USB_STATE_NOTATTACHED);
 }
 
+static void dec_usb_memory_use_count(struct usb_memory *usbm, int *count)
+{
+   struct usb_dev_state *ps = usbm->ps;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   --*count;
+   if (usbm->urb_use_count == 0 && usbm->vma_use_count == 0) {
+   list_del(>memlist);
+   spin_unlock_irqrestore(>lock, flags);
+
+   usb_free_coherent(ps->dev, usbm->size, usbm->mem,
+   usbm->dma_handle);
+   usbfs_decrease_memory_usage(
+   usbm->size + sizeof(struct usb_memory));
+   kfree(usbm);
+   } else {
+   spin_unlock_irqrestore(>lock, flags);
+   }
+}
+
+static void usbdev_vm_open(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+   unsigned long flags;
+
+   spin_lock_irqsave(>ps->lock, flags);
+   ++usbm->vma_use_count;
+   spin_unlock_irqrestore(>ps->lock, flags);
+}
+
+static void usbdev_vm_close(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+
+   dec_usb_memory_use_count(usbm, >vma_use_count);
+}
+
+struct vm_operations_struct usbdev_vm_ops = {
+   .open = usbdev_vm_open,
+   .close = usbdev_vm_close
+};
+
+static int usbdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = NULL;
+   struct usb_dev_state *ps = file->private_data;
+   size_t size = vma->vm_end - vma->vm_start;
+   void *mem;
+   unsigned long flags;
+   dma_addr_t dma_handle;
+   int ret;
+
+   ret = usbfs_increase_memory_usage(size + sizeof(struct usb_memory));
+   if

Re: [PATCH v2] Add support for usbfs zerocopy.

On Wed, Feb 03, 2016 at 01:23:17PM -0800, Greg Kroah-Hartman wrote:
> Attachments don't work, you know better than that :(

Since I've now been bitten by this several times: Is there any sort of best
practice for integrating git with MUAs? What I'm doing right now is
cut-and-paste from mutt to get the to/cc/in-reply-to headers right, and
that's suboptimal. It feels like it should be a FAQ, but I can't really find
anything obvious.

> I need a _real_ email that I don't have to hand-edit to get it to apply.

Trying again; sending v4 as a reply to your email. This is on top of git
master, since the patch no longer applied there. If you want it against any
other tree, please let me know which.

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH v2] Add support for usbfs zerocopy.

On Wed, Feb 03, 2016 at 01:23:17PM -0800, Greg Kroah-Hartman wrote:
> Attachments don't work, you know better than that :(

Since I've now been bitten by this several times: Is there any sort of best
practice for integrating git with MUAs? What I'm doing right now is
cut-and-paste from mutt to get the to/cc/in-reply-to headers right, and
that's suboptimal. It feels like it should be a FAQ, but I can't really find
anything obvious.

> I need a _real_ email that I don't have to hand-edit to get it to apply.

Trying again; sending v4 as a reply to your email. This is on top of git
master, since the patch no longer applied there. If you want it against any
other tree, please let me know which.

/* Steinar */
-- 
Software Engineer, Google Switzerland

[PATCH v4] Add support for usbfs zerocopy.

Add a new interface for userspace to preallocate memory that can be
used with usbfs. This gives two primary benefits:

 - Zerocopy; data no longer needs to be copied between the userspace
   and the kernel, but can instead be read directly by the driver from
   userspace's buffers. This works for all kinds of transfers (even if
   nonsensical for control and interrupt transfers); isochronous also
   no longer need to memset() the buffer to zero to avoid leaking kernel data.

 - Once the buffers are allocated, USB transfers can no longer fail due to
   memory fragmentation; previously, long-running programs could run into
   problems finding a large enough contiguous memory chunk, especially on
   embedded systems or at high rates.

Memory is allocated by using mmap() against the usbfs file descriptor,
and similarly deallocated by munmap(). Once memory has been allocated,
using it as pointers to a bulk or isochronous operation means you will
automatically get zerocopy behavior. Note that this also means you cannot
modify outgoing data until the transfer is complete. The same holds for
data on the same cache lines as incoming data; DMA modifying them at the
same time could lead to your changes being overwritten.

There's a new capability USBDEVFS_CAP_MMAP that userspace can query to see
if the running kernel supports this functionality, if just trying mmap() is
not acceptable.

Largely based on a patch by Markus Rechberger with some updates. The original
patch can be found at:

  http://sundtek.de/support/devio_mmap_v0.4.diff

Signed-off-by: Steinar H. Gunderson <se...@google.com>
Signed-off-by: Markus Rechberger <mrechber...@gmail.com>
Acked-by: Alan Stern <st...@rowland.harvard.edu>
---
 drivers/usb/core/devio.c  | 227 +-
 include/uapi/linux/usbdevice_fs.h |   1 +
 2 files changed, 203 insertions(+), 25 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 59e7a33..a9fc6ff 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -69,6 +70,7 @@ struct usb_dev_state {
spinlock_t lock;/* protects the async urb lists */
struct list_head async_pending;
struct list_head async_completed;
+   struct list_head memory_list;
wait_queue_head_t wait; /* wake up if a request completed */
unsigned int discsignr;
struct pid *disc_pid;
@@ -79,6 +81,17 @@ struct usb_dev_state {
u32 disabled_bulk_eps;
 };
 
+struct usb_memory {
+   struct list_head memlist;
+   int vma_use_count;
+   int urb_use_count;
+   u32 size;
+   void *mem;
+   dma_addr_t dma_handle;
+   unsigned long vm_start;
+   struct usb_dev_state *ps;
+};
+
 struct async {
struct list_head asynclist;
struct usb_dev_state *ps;
@@ -89,6 +102,7 @@ struct async {
void __user *userbuffer;
void __user *userurb;
struct urb *urb;
+   struct usb_memory *usbm;
unsigned int mem_usage;
int status;
u32 secid;
@@ -162,6 +176,111 @@ static int connected(struct usb_dev_state *ps)
ps->dev->state != USB_STATE_NOTATTACHED);
 }
 
+static void dec_usb_memory_use_count(struct usb_memory *usbm, int *count)
+{
+   struct usb_dev_state *ps = usbm->ps;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   --*count;
+   if (usbm->urb_use_count == 0 && usbm->vma_use_count == 0) {
+   list_del(>memlist);
+   spin_unlock_irqrestore(>lock, flags);
+
+   usb_free_coherent(ps->dev, usbm->size, usbm->mem,
+   usbm->dma_handle);
+   usbfs_decrease_memory_usage(
+   usbm->size + sizeof(struct usb_memory));
+   kfree(usbm);
+   } else {
+   spin_unlock_irqrestore(>lock, flags);
+   }
+}
+
+static void usbdev_vm_open(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+   unsigned long flags;
+
+   spin_lock_irqsave(>ps->lock, flags);
+   ++usbm->vma_use_count;
+   spin_unlock_irqrestore(>ps->lock, flags);
+}
+
+static void usbdev_vm_close(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+
+   dec_usb_memory_use_count(usbm, >vma_use_count);
+}
+
+struct vm_operations_struct usbdev_vm_ops = {
+   .open = usbdev_vm_open,
+   .close = usbdev_vm_close
+};
+
+static int usbdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = NULL;
+   struct usb_dev_state *ps = file->private_data;
+   size_t size = vma->vm_end - vma->vm_start;
+   void *mem;
+   unsigned long flags;
+   dma_addr_t dma_handle;
+   in

Re: [PATCH v2] Add support for usbfs zerocopy.

On Thu, Feb 04, 2016 at 12:15:50AM +0200, Felipe Balbi wrote:
>> Since I've now been bitten by this several times: Is there any sort of best
>> practice for integrating git with MUAs? What I'm doing right now is
>> cut-and-paste from mutt to get the to/cc/in-reply-to headers right, and
>> that's suboptimal. It feels like it should be a FAQ, but I can't really find
>> anything obvious.
> have you tried git send-email ? If you wanna send as a reply to another
> message, use:
> 
>   $ git send-email --to f...@bar.com --cc b...@foo.com 
> --in-reply-to=abcd-message...@foo.bar.com

Unfortunately that doesn't change anything; I still need to cut-and-paste the
to, cc and in-reply-to manually. What I'd want is something like
“git send-email as a reply to the last email in my outbox”.

Attaching the patch gives me exactly this, by the way, but seemingly ends up
with a format that's more cumbersome to receive (which is a bad tradeoff).

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH v2] Add support for usbfs zerocopy.

2016-02-02 Thread Steinar H. Gunderson

On Mon, Jan 25, 2016 at 09:03:57AM +0100, Steinar H. Gunderson wrote:
> I did git rebase --ignore-date HEAD^ just to reset the date. Sending it as an
> attachment just to be sure.

Hi Greg,

Did this work for you? Is there anything else I should do to this patch?

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH v2] Add support for usbfs zerocopy.

2016-02-02 Thread Steinar H. Gunderson

On Mon, Jan 25, 2016 at 09:03:57AM +0100, Steinar H. Gunderson wrote:
> I did git rebase --ignore-date HEAD^ just to reset the date. Sending it as an
> attachment just to be sure.

Hi Greg,

Did this work for you? Is there anything else I should do to this patch?

/* Steinar */
-- 
Software Engineer, Google Switzerland

Re: [PATCH v2] Add support for usbfs zerocopy.

2016-01-25 Thread Steinar H. Gunderson

On Sun, Jan 24, 2016 at 01:12:08PM -0800, Greg Kroah-Hartman wrote:
> Something is really wrong with your email client, it is saying this is
> sent on Nov 26, the same exact time as your previous patch, yet you sent
> this in January.  Which implies that this is an old patch and not an
> updated one :(

It comes directly from git format-patch. I guess git commit --amend doesn't
properly reset the date?

> Can you send me the latest verison of this, with the date set properly
> on your machine for it?

I did git rebase --ignore-date HEAD^ just to reset the date. Sending it as an
attachment just to be sure.

/* Steinar */
-- 
Software Engineer, Google Switzerland
>From e56d9235b343c5e70061e977639cc7dddeae8164 Mon Sep 17 00:00:00 2001
From: "Steinar H. Gunderson" 
Date: Mon, 25 Jan 2016 09:02:34 +0100
Subject: [PATCH v3] Add support for usbfs zerocopy.

Add a new interface for userspace to preallocate memory that can be
used with usbfs. This gives two primary benefits:

 - Zerocopy; data no longer needs to be copied between the userspace
   and the kernel, but can instead be read directly by the driver from
   userspace's buffers. This works for all kinds of transfers (even if
   nonsensical for control and interrupt transfers); isochronous also
   no longer need to memset() the buffer to zero to avoid leaking kernel data.

 - Once the buffers are allocated, USB transfers can no longer fail due to
   memory fragmentation; previously, long-running programs could run into
   problems finding a large enough contiguous memory chunk, especially on
   embedded systems or at high rates.

Memory is allocated by using mmap() against the usbfs file descriptor,
and similarly deallocated by munmap(). Once memory has been allocated,
using it as pointers to a bulk or isochronous operation means you will
automatically get zerocopy behavior. Note that this also means you cannot
modify outgoing data until the transfer is complete. The same holds for
data on the same cache lines as incoming data; DMA modifying them at the
same time could lead to your changes being overwritten.

There's a new capability USBDEVFS_CAP_MMAP that userspace can query to see
if the running kernel supports this functionality, if just trying mmap() is
not acceptable.

Largely based on a patch by Markus Rechberger with some updates. The original
patch can be found at:

  http://sundtek.de/support/devio_mmap_v0.4.diff

Signed-off-by: Steinar H. Gunderson 
Signed-off-by: Markus Rechberger 
Acked-by: Alan Stern 
---
 drivers/usb/core/devio.c  | 227 +-
 include/uapi/linux/usbdevice_fs.h |   1 +
 2 files changed, 203 insertions(+), 25 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 38ae877c..0238c78 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -69,6 +70,7 @@ struct usb_dev_state {
 	spinlock_t lock;/* protects the async urb lists */
 	struct list_head async_pending;
 	struct list_head async_completed;
+	struct list_head memory_list;
 	wait_queue_head_t wait; /* wake up if a request completed */
 	unsigned int discsignr;
 	struct pid *disc_pid;
@@ -79,6 +81,17 @@ struct usb_dev_state {
 	u32 disabled_bulk_eps;
 };
 
+struct usb_memory {
+	struct list_head memlist;
+	int vma_use_count;
+	int urb_use_count;
+	u32 size;
+	void *mem;
+	dma_addr_t dma_handle;
+	unsigned long vm_start;
+	struct usb_dev_state *ps;
+};
+
 struct async {
 	struct list_head asynclist;
 	struct usb_dev_state *ps;
@@ -89,6 +102,7 @@ struct async {
 	void __user *userbuffer;
 	void __user *userurb;
 	struct urb *urb;
+	struct usb_memory *usbm;
 	unsigned int mem_usage;
 	int status;
 	u32 secid;
@@ -157,6 +171,111 @@ static int connected(struct usb_dev_state *ps)
 			ps->dev->state != USB_STATE_NOTATTACHED);
 }
 
+static void dec_usb_memory_use_count(struct usb_memory *usbm, int *count)
+{
+	struct usb_dev_state *ps = usbm->ps;
+	unsigned long flags;
+
+	spin_lock_irqsave(>lock, flags);
+	--*count;
+	if (usbm->urb_use_count == 0 && usbm->vma_use_count == 0) {
+		list_del(>memlist);
+		spin_unlock_irqrestore(>lock, flags);
+
+		usb_free_coherent(ps->dev, usbm->size, usbm->mem,
+usbm->dma_handle);
+		usbfs_decrease_memory_usage(
+			usbm->size + sizeof(struct usb_memory));
+		kfree(usbm);
+	} else {
+		spin_unlock_irqrestore(>lock, flags);
+	}
+}
+
+static void usbdev_vm_open(struct vm_area_struct *vma)
+{
+	struct usb_memory *usbm = vma->vm_private_data;
+	unsigned long flags;
+
+	spin_lock_irqsave(>ps->lock, flags);
+	++usbm->vma_use_count;
+	spin_unlock_irqrestore(>ps->lock, flags);
+}
+
+static void usbdev_vm_close(struct vm_area_struct *vma)
+{
+	struct usb_memory *usbm = vma->vm_private_data;
+
+	dec_usb_memory_use_count(usbm, >vma_use_count);
+}
+
+struct vm_o

Re: [PATCH v2] Add support for usbfs zerocopy.

2016-01-25 Thread Steinar H. Gunderson

On Sun, Jan 24, 2016 at 01:12:08PM -0800, Greg Kroah-Hartman wrote:
> Something is really wrong with your email client, it is saying this is
> sent on Nov 26, the same exact time as your previous patch, yet you sent
> this in January.  Which implies that this is an old patch and not an
> updated one :(

It comes directly from git format-patch. I guess git commit --amend doesn't
properly reset the date?

> Can you send me the latest verison of this, with the date set properly
> on your machine for it?

I did git rebase --ignore-date HEAD^ just to reset the date. Sending it as an
attachment just to be sure.

/* Steinar */
-- 
Software Engineer, Google Switzerland
>From e56d9235b343c5e70061e977639cc7dddeae8164 Mon Sep 17 00:00:00 2001
From: "Steinar H. Gunderson" <se...@google.com>
Date: Mon, 25 Jan 2016 09:02:34 +0100
Subject: [PATCH v3] Add support for usbfs zerocopy.

Add a new interface for userspace to preallocate memory that can be
used with usbfs. This gives two primary benefits:

 - Zerocopy; data no longer needs to be copied between the userspace
   and the kernel, but can instead be read directly by the driver from
   userspace's buffers. This works for all kinds of transfers (even if
   nonsensical for control and interrupt transfers); isochronous also
   no longer need to memset() the buffer to zero to avoid leaking kernel data.

 - Once the buffers are allocated, USB transfers can no longer fail due to
   memory fragmentation; previously, long-running programs could run into
   problems finding a large enough contiguous memory chunk, especially on
   embedded systems or at high rates.

Memory is allocated by using mmap() against the usbfs file descriptor,
and similarly deallocated by munmap(). Once memory has been allocated,
using it as pointers to a bulk or isochronous operation means you will
automatically get zerocopy behavior. Note that this also means you cannot
modify outgoing data until the transfer is complete. The same holds for
data on the same cache lines as incoming data; DMA modifying them at the
same time could lead to your changes being overwritten.

There's a new capability USBDEVFS_CAP_MMAP that userspace can query to see
if the running kernel supports this functionality, if just trying mmap() is
not acceptable.

Largely based on a patch by Markus Rechberger with some updates. The original
patch can be found at:

  http://sundtek.de/support/devio_mmap_v0.4.diff

Signed-off-by: Steinar H. Gunderson <se...@google.com>
Signed-off-by: Markus Rechberger <mrechber...@gmail.com>
Acked-by: Alan Stern <st...@rowland.harvard.edu>
---
 drivers/usb/core/devio.c  | 227 +-
 include/uapi/linux/usbdevice_fs.h |   1 +
 2 files changed, 203 insertions(+), 25 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 38ae877c..0238c78 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -69,6 +70,7 @@ struct usb_dev_state {
 	spinlock_t lock;/* protects the async urb lists */
 	struct list_head async_pending;
 	struct list_head async_completed;
+	struct list_head memory_list;
 	wait_queue_head_t wait; /* wake up if a request completed */
 	unsigned int discsignr;
 	struct pid *disc_pid;
@@ -79,6 +81,17 @@ struct usb_dev_state {
 	u32 disabled_bulk_eps;
 };
 
+struct usb_memory {
+	struct list_head memlist;
+	int vma_use_count;
+	int urb_use_count;
+	u32 size;
+	void *mem;
+	dma_addr_t dma_handle;
+	unsigned long vm_start;
+	struct usb_dev_state *ps;
+};
+
 struct async {
 	struct list_head asynclist;
 	struct usb_dev_state *ps;
@@ -89,6 +102,7 @@ struct async {
 	void __user *userbuffer;
 	void __user *userurb;
 	struct urb *urb;
+	struct usb_memory *usbm;
 	unsigned int mem_usage;
 	int status;
 	u32 secid;
@@ -157,6 +171,111 @@ static int connected(struct usb_dev_state *ps)
 			ps->dev->state != USB_STATE_NOTATTACHED);
 }
 
+static void dec_usb_memory_use_count(struct usb_memory *usbm, int *count)
+{
+	struct usb_dev_state *ps = usbm->ps;
+	unsigned long flags;
+
+	spin_lock_irqsave(>lock, flags);
+	--*count;
+	if (usbm->urb_use_count == 0 && usbm->vma_use_count == 0) {
+		list_del(>memlist);
+		spin_unlock_irqrestore(>lock, flags);
+
+		usb_free_coherent(ps->dev, usbm->size, usbm->mem,
+usbm->dma_handle);
+		usbfs_decrease_memory_usage(
+			usbm->size + sizeof(struct usb_memory));
+		kfree(usbm);
+	} else {
+		spin_unlock_irqrestore(>lock, flags);
+	}
+}
+
+static void usbdev_vm_open(struct vm_area_struct *vma)
+{
+	struct usb_memory *usbm = vma->vm_private_data;
+	unsigned long flags;
+
+	spin_lock_irqsave(>ps->lock, flags);
+	++usbm->vma_use_count;
+	spin_unlock_irqrestore(>ps->lock, flags);
+}
+
+static void usbdev_vm_close(struct vm_area_struct *vma)
+{
+	struct usb

Re: [PATCH] Add support for usbfs zerocopy.

On Wed, Jan 06, 2016 at 04:22:12PM +0100, Peter Stuge wrote:
>>> Our interface for zero copy reads/writes is O_DIRECT, and that requires
>>> not special memory allocation, just proper alignment.
>> But that assumes you are using I/O using read()/write(). There's no way you
>> can shoehorn USB isochronous reads into the read() interface, O_DIRECT or 
>> not.
> How about aio?

I don't really see how; a USB device does not look much like a file. (Where
would you stick the endpoint, for one? And how would you ever submit an URB
with multiple packets in it, which is essential?) It feels a bit like
trying to use UDP sockets with only read() and write().

In any case, the usbfs interface already exists and is stable. This is about
extending it; replacing it with something new from scratch to get zerocopy
would seem overkill.

/* Steinar */
-- 
Software Engineer, Google Switzerland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Add support for usbfs zerocopy.

On Tue, Jan 05, 2016 at 10:49:49PM -0800, Christoph Hellwig wrote:
> This is a completely broken usage of the mmap interface.  if you use
> mmap on a device file you must use the actual mmap for the data
> transfer.

Really? V4L does exactly the same thing, from what I can see. It's just a way
of allocating memory with specific properties, roughly similar to hugetlbfs.

> Our interface for zero copy reads/writes is O_DIRECT, and that requires
> not special memory allocation, just proper alignment.

But that assumes you are using I/O using read()/write(). There's no way you
can shoehorn USB isochronous reads into the read() interface, O_DIRECT or not.

/* Steinar */
-- 
Software Engineer, Google Switzerland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Add support for usbfs zerocopy.

On Wed, Jan 06, 2016 at 04:22:12PM +0100, Peter Stuge wrote:
>>> Our interface for zero copy reads/writes is O_DIRECT, and that requires
>>> not special memory allocation, just proper alignment.
>> But that assumes you are using I/O using read()/write(). There's no way you
>> can shoehorn USB isochronous reads into the read() interface, O_DIRECT or 
>> not.
> How about aio?

I don't really see how; a USB device does not look much like a file. (Where
would you stick the endpoint, for one? And how would you ever submit an URB
with multiple packets in it, which is essential?) It feels a bit like
trying to use UDP sockets with only read() and write().

In any case, the usbfs interface already exists and is stable. This is about
extending it; replacing it with something new from scratch to get zerocopy
would seem overkill.

/* Steinar */
-- 
Software Engineer, Google Switzerland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Add support for usbfs zerocopy.

On Tue, Jan 05, 2016 at 10:49:49PM -0800, Christoph Hellwig wrote:
> This is a completely broken usage of the mmap interface.  if you use
> mmap on a device file you must use the actual mmap for the data
> transfer.

Really? V4L does exactly the same thing, from what I can see. It's just a way
of allocating memory with specific properties, roughly similar to hugetlbfs.

> Our interface for zero copy reads/writes is O_DIRECT, and that requires
> not special memory allocation, just proper alignment.

But that assumes you are using I/O using read()/write(). There's no way you
can shoehorn USB isochronous reads into the read() interface, O_DIRECT or not.

/* Steinar */
-- 
Software Engineer, Google Switzerland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] Add support for usbfs zerocopy.

Add a new interface for userspace to preallocate memory that can be
used with usbfs. This gives two primary benefits:

 - Zerocopy; data no longer needs to be copied between the userspace
   and the kernel, but can instead be read directly by the driver from
   userspace's buffers. This works for all kinds of transfers (even if
   nonsensical for control and interrupt transfers); isochronous also
   no longer need to memset() the buffer to zero to avoid leaking kernel data.

 - Once the buffers are allocated, USB transfers can no longer fail due to
   memory fragmentation; previously, long-running programs could run into
   problems finding a large enough contiguous memory chunk, especially on
   embedded systems or at high rates.

Memory is allocated by using mmap() against the usbfs file descriptor,
and similarly deallocated by munmap(). Once memory has been allocated,
using it as pointers to a bulk or isochronous operation means you will
automatically get zerocopy behavior. Note that this also means you cannot
modify outgoing data until the transfer is complete. The same holds for
data on the same cache lines as incoming data; DMA modifying them at the
same time could lead to your changes being overwritten.

There's a new capability USBDEVFS_CAP_MMAP that userspace can query to see
if the running kernel supports this functionality, if just trying mmap() is
not acceptable.

Largely based on a patch by Markus Rechberger with some updates. The original
patch can be found at:

  http://sundtek.de/support/devio_mmap_v0.4.diff

Signed-off-by: Steinar H. Gunderson 
Signed-off-by: Markus Rechberger 
Acked-by: Alan Stern 
---
 drivers/usb/core/devio.c  | 227 +-
 include/uapi/linux/usbdevice_fs.h |   1 +
 2 files changed, 203 insertions(+), 25 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 38ae877c..0238c78 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -69,6 +70,7 @@ struct usb_dev_state {
spinlock_t lock;/* protects the async urb lists */
struct list_head async_pending;
struct list_head async_completed;
+   struct list_head memory_list;
wait_queue_head_t wait; /* wake up if a request completed */
unsigned int discsignr;
struct pid *disc_pid;
@@ -79,6 +81,17 @@ struct usb_dev_state {
u32 disabled_bulk_eps;
 };
 
+struct usb_memory {
+   struct list_head memlist;
+   int vma_use_count;
+   int urb_use_count;
+   u32 size;
+   void *mem;
+   dma_addr_t dma_handle;
+   unsigned long vm_start;
+   struct usb_dev_state *ps;
+};
+
 struct async {
struct list_head asynclist;
struct usb_dev_state *ps;
@@ -89,6 +102,7 @@ struct async {
void __user *userbuffer;
void __user *userurb;
struct urb *urb;
+   struct usb_memory *usbm;
unsigned int mem_usage;
int status;
u32 secid;
@@ -157,6 +171,111 @@ static int connected(struct usb_dev_state *ps)
ps->dev->state != USB_STATE_NOTATTACHED);
 }
 
+static void dec_usb_memory_use_count(struct usb_memory *usbm, int *count)
+{
+   struct usb_dev_state *ps = usbm->ps;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   --*count;
+   if (usbm->urb_use_count == 0 && usbm->vma_use_count == 0) {
+   list_del(>memlist);
+   spin_unlock_irqrestore(>lock, flags);
+
+   usb_free_coherent(ps->dev, usbm->size, usbm->mem,
+   usbm->dma_handle);
+   usbfs_decrease_memory_usage(
+   usbm->size + sizeof(struct usb_memory));
+   kfree(usbm);
+   } else {
+   spin_unlock_irqrestore(>lock, flags);
+   }
+}
+
+static void usbdev_vm_open(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+   unsigned long flags;
+
+   spin_lock_irqsave(>ps->lock, flags);
+   ++usbm->vma_use_count;
+   spin_unlock_irqrestore(>ps->lock, flags);
+}
+
+static void usbdev_vm_close(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+
+   dec_usb_memory_use_count(usbm, >vma_use_count);
+}
+
+struct vm_operations_struct usbdev_vm_ops = {
+   .open = usbdev_vm_open,
+   .close = usbdev_vm_close
+};
+
+static int usbdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = NULL;
+   struct usb_dev_state *ps = file->private_data;
+   size_t size = vma->vm_end - vma->vm_start;
+   void *mem;
+   unsigned long flags;
+   dma_addr_t dma_handle;
+   int ret;
+
+   ret = usbfs_increase_memory_usage(size + sizeof(struct usb_memory));
+   if

Re: [PATCH] Add support for usbfs zerocopy.

On Tue, Jan 05, 2016 at 04:11:43PM -0800, Greg Kroah-Hartman wrote:
>> Add a new interface for userspace to preallocate memory that can be
>> used with usbfs. This gives two primary benefits:
> Please 'version' your patches, so that I have a chance to figure out
> what patch is what, and what changed between patches.
> 
> otherwise the odds of me picking the "wrong" one is _very_ high...

OK. I won't make any attempt at reconstructing the history, but I'll resend
the one I just sent you as v2, ie. --reroll-count=2.

Somehow it feels like there should be a way to integrate this better into my
MUA, but hopefully this is soon all done anyway. :-)

/* Steinar */
-- 
Software Engineer, Google Switzerland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Does vm_operations_struct require a .owner field?

On Tue, Jan 05, 2016 at 04:31:09PM -0500, Alan Stern wrote:
> Thank you.  So it looks like I was worried about nothing.
> 
> Steinar, you can remove the try_module_get/module_put lines from your
> patch.  Also, the list_del() and comment in usbdev_release() aren't 
> needed -- at that point we know the memory_list has to be empty since 
> there can't be any outstanding URBs or VMA references.  If you take 
> those things out then the patch should be ready for merging.

Good, thanks. Did so, compiled, testing it still works, sending :-)

/* Steinar */
-- 
Software Engineer, Google Switzerland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Add support for usbfs zerocopy.

Add a new interface for userspace to preallocate memory that can be
used with usbfs. This gives two primary benefits:

 - Zerocopy; data no longer needs to be copied between the userspace
   and the kernel, but can instead be read directly by the driver from
   userspace's buffers. This works for all kinds of transfers (even if
   nonsensical for control and interrupt transfers); isochronous also
   no longer need to memset() the buffer to zero to avoid leaking kernel data.

 - Once the buffers are allocated, USB transfers can no longer fail due to
   memory fragmentation; previously, long-running programs could run into
   problems finding a large enough contiguous memory chunk, especially on
   embedded systems or at high rates.

Memory is allocated by using mmap() against the usbfs file descriptor,
and similarly deallocated by munmap(). Once memory has been allocated,
using it as pointers to a bulk or isochronous operation means you will
automatically get zerocopy behavior. Note that this also means you cannot
modify outgoing data until the transfer is complete. The same holds for
data on the same cache lines as incoming data; DMA modifying them at the
same time could lead to your changes being overwritten.

There's a new capability USBDEVFS_CAP_MMAP that userspace can query to see
if the running kernel supports this functionality, if just trying mmap() is
not acceptable.

Largely based on a patch by Markus Rechberger with some updates. The original
patch can be found at:

  http://sundtek.de/support/devio_mmap_v0.4.diff

Signed-off-by: Steinar H. Gunderson 
Signed-off-by: Markus Rechberger 
Acked-by: Alan Stern 
---
 drivers/usb/core/devio.c  | 227 +-
 include/uapi/linux/usbdevice_fs.h |   1 +
 2 files changed, 203 insertions(+), 25 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 38ae877c..0238c78 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -69,6 +70,7 @@ struct usb_dev_state {
spinlock_t lock;/* protects the async urb lists */
struct list_head async_pending;
struct list_head async_completed;
+   struct list_head memory_list;
wait_queue_head_t wait; /* wake up if a request completed */
unsigned int discsignr;
struct pid *disc_pid;
@@ -79,6 +81,17 @@ struct usb_dev_state {
u32 disabled_bulk_eps;
 };
 
+struct usb_memory {
+   struct list_head memlist;
+   int vma_use_count;
+   int urb_use_count;
+   u32 size;
+   void *mem;
+   dma_addr_t dma_handle;
+   unsigned long vm_start;
+   struct usb_dev_state *ps;
+};
+
 struct async {
struct list_head asynclist;
struct usb_dev_state *ps;
@@ -89,6 +102,7 @@ struct async {
void __user *userbuffer;
void __user *userurb;
struct urb *urb;
+   struct usb_memory *usbm;
unsigned int mem_usage;
int status;
u32 secid;
@@ -157,6 +171,111 @@ static int connected(struct usb_dev_state *ps)
ps->dev->state != USB_STATE_NOTATTACHED);
 }
 
+static void dec_usb_memory_use_count(struct usb_memory *usbm, int *count)
+{
+   struct usb_dev_state *ps = usbm->ps;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   --*count;
+   if (usbm->urb_use_count == 0 && usbm->vma_use_count == 0) {
+   list_del(>memlist);
+   spin_unlock_irqrestore(>lock, flags);
+
+   usb_free_coherent(ps->dev, usbm->size, usbm->mem,
+   usbm->dma_handle);
+   usbfs_decrease_memory_usage(
+   usbm->size + sizeof(struct usb_memory));
+   kfree(usbm);
+   } else {
+   spin_unlock_irqrestore(>lock, flags);
+   }
+}
+
+static void usbdev_vm_open(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+   unsigned long flags;
+
+   spin_lock_irqsave(>ps->lock, flags);
+   ++usbm->vma_use_count;
+   spin_unlock_irqrestore(>ps->lock, flags);
+}
+
+static void usbdev_vm_close(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+
+   dec_usb_memory_use_count(usbm, >vma_use_count);
+}
+
+struct vm_operations_struct usbdev_vm_ops = {
+   .open = usbdev_vm_open,
+   .close = usbdev_vm_close
+};
+
+static int usbdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = NULL;
+   struct usb_dev_state *ps = file->private_data;
+   size_t size = vma->vm_end - vma->vm_start;
+   void *mem;
+   unsigned long flags;
+   dma_addr_t dma_handle;
+   int ret;
+
+   ret = usbfs_increase_memory_usage(size + sizeof(struct usb_memory));
+   if

[PATCH v2] Add support for usbfs zerocopy.

Add a new interface for userspace to preallocate memory that can be
used with usbfs. This gives two primary benefits:

 - Zerocopy; data no longer needs to be copied between the userspace
   and the kernel, but can instead be read directly by the driver from
   userspace's buffers. This works for all kinds of transfers (even if
   nonsensical for control and interrupt transfers); isochronous also
   no longer need to memset() the buffer to zero to avoid leaking kernel data.

 - Once the buffers are allocated, USB transfers can no longer fail due to
   memory fragmentation; previously, long-running programs could run into
   problems finding a large enough contiguous memory chunk, especially on
   embedded systems or at high rates.

Memory is allocated by using mmap() against the usbfs file descriptor,
and similarly deallocated by munmap(). Once memory has been allocated,
using it as pointers to a bulk or isochronous operation means you will
automatically get zerocopy behavior. Note that this also means you cannot
modify outgoing data until the transfer is complete. The same holds for
data on the same cache lines as incoming data; DMA modifying them at the
same time could lead to your changes being overwritten.

There's a new capability USBDEVFS_CAP_MMAP that userspace can query to see
if the running kernel supports this functionality, if just trying mmap() is
not acceptable.

Largely based on a patch by Markus Rechberger with some updates. The original
patch can be found at:

  http://sundtek.de/support/devio_mmap_v0.4.diff

Signed-off-by: Steinar H. Gunderson <se...@google.com>
Signed-off-by: Markus Rechberger <mrechber...@gmail.com>
Acked-by: Alan Stern <st...@rowland.harvard.edu>
---
 drivers/usb/core/devio.c  | 227 +-
 include/uapi/linux/usbdevice_fs.h |   1 +
 2 files changed, 203 insertions(+), 25 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 38ae877c..0238c78 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -69,6 +70,7 @@ struct usb_dev_state {
spinlock_t lock;/* protects the async urb lists */
struct list_head async_pending;
struct list_head async_completed;
+   struct list_head memory_list;
wait_queue_head_t wait; /* wake up if a request completed */
unsigned int discsignr;
struct pid *disc_pid;
@@ -79,6 +81,17 @@ struct usb_dev_state {
u32 disabled_bulk_eps;
 };
 
+struct usb_memory {
+   struct list_head memlist;
+   int vma_use_count;
+   int urb_use_count;
+   u32 size;
+   void *mem;
+   dma_addr_t dma_handle;
+   unsigned long vm_start;
+   struct usb_dev_state *ps;
+};
+
 struct async {
struct list_head asynclist;
struct usb_dev_state *ps;
@@ -89,6 +102,7 @@ struct async {
void __user *userbuffer;
void __user *userurb;
struct urb *urb;
+   struct usb_memory *usbm;
unsigned int mem_usage;
int status;
u32 secid;
@@ -157,6 +171,111 @@ static int connected(struct usb_dev_state *ps)
ps->dev->state != USB_STATE_NOTATTACHED);
 }
 
+static void dec_usb_memory_use_count(struct usb_memory *usbm, int *count)
+{
+   struct usb_dev_state *ps = usbm->ps;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   --*count;
+   if (usbm->urb_use_count == 0 && usbm->vma_use_count == 0) {
+   list_del(>memlist);
+   spin_unlock_irqrestore(>lock, flags);
+
+   usb_free_coherent(ps->dev, usbm->size, usbm->mem,
+   usbm->dma_handle);
+   usbfs_decrease_memory_usage(
+   usbm->size + sizeof(struct usb_memory));
+   kfree(usbm);
+   } else {
+   spin_unlock_irqrestore(>lock, flags);
+   }
+}
+
+static void usbdev_vm_open(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+   unsigned long flags;
+
+   spin_lock_irqsave(>ps->lock, flags);
+   ++usbm->vma_use_count;
+   spin_unlock_irqrestore(>ps->lock, flags);
+}
+
+static void usbdev_vm_close(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+
+   dec_usb_memory_use_count(usbm, >vma_use_count);
+}
+
+struct vm_operations_struct usbdev_vm_ops = {
+   .open = usbdev_vm_open,
+   .close = usbdev_vm_close
+};
+
+static int usbdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = NULL;
+   struct usb_dev_state *ps = file->private_data;
+   size_t size = vma->vm_end - vma->vm_start;
+   void *mem;
+   unsigned long flags;
+   dma_addr_t dma_handle;
+   in

Re: [PATCH] Add support for usbfs zerocopy.

On Tue, Jan 05, 2016 at 04:11:43PM -0800, Greg Kroah-Hartman wrote:
>> Add a new interface for userspace to preallocate memory that can be
>> used with usbfs. This gives two primary benefits:
> Please 'version' your patches, so that I have a chance to figure out
> what patch is what, and what changed between patches.
> 
> otherwise the odds of me picking the "wrong" one is _very_ high...

OK. I won't make any attempt at reconstructing the history, but I'll resend
the one I just sent you as v2, ie. --reroll-count=2.

Somehow it feels like there should be a way to integrate this better into my
MUA, but hopefully this is soon all done anyway. :-)

/* Steinar */
-- 
Software Engineer, Google Switzerland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Add support for usbfs zerocopy.

Add a new interface for userspace to preallocate memory that can be
used with usbfs. This gives two primary benefits:

 - Zerocopy; data no longer needs to be copied between the userspace
   and the kernel, but can instead be read directly by the driver from
   userspace's buffers. This works for all kinds of transfers (even if
   nonsensical for control and interrupt transfers); isochronous also
   no longer need to memset() the buffer to zero to avoid leaking kernel data.

 - Once the buffers are allocated, USB transfers can no longer fail due to
   memory fragmentation; previously, long-running programs could run into
   problems finding a large enough contiguous memory chunk, especially on
   embedded systems or at high rates.

Memory is allocated by using mmap() against the usbfs file descriptor,
and similarly deallocated by munmap(). Once memory has been allocated,
using it as pointers to a bulk or isochronous operation means you will
automatically get zerocopy behavior. Note that this also means you cannot
modify outgoing data until the transfer is complete. The same holds for
data on the same cache lines as incoming data; DMA modifying them at the
same time could lead to your changes being overwritten.

There's a new capability USBDEVFS_CAP_MMAP that userspace can query to see
if the running kernel supports this functionality, if just trying mmap() is
not acceptable.

Largely based on a patch by Markus Rechberger with some updates. The original
patch can be found at:

  http://sundtek.de/support/devio_mmap_v0.4.diff

Signed-off-by: Steinar H. Gunderson <se...@google.com>
Signed-off-by: Markus Rechberger <mrechber...@gmail.com>
Acked-by: Alan Stern <st...@rowland.harvard.edu>
---
 drivers/usb/core/devio.c  | 227 +-
 include/uapi/linux/usbdevice_fs.h |   1 +
 2 files changed, 203 insertions(+), 25 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 38ae877c..0238c78 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -69,6 +70,7 @@ struct usb_dev_state {
spinlock_t lock;/* protects the async urb lists */
struct list_head async_pending;
struct list_head async_completed;
+   struct list_head memory_list;
wait_queue_head_t wait; /* wake up if a request completed */
unsigned int discsignr;
struct pid *disc_pid;
@@ -79,6 +81,17 @@ struct usb_dev_state {
u32 disabled_bulk_eps;
 };
 
+struct usb_memory {
+   struct list_head memlist;
+   int vma_use_count;
+   int urb_use_count;
+   u32 size;
+   void *mem;
+   dma_addr_t dma_handle;
+   unsigned long vm_start;
+   struct usb_dev_state *ps;
+};
+
 struct async {
struct list_head asynclist;
struct usb_dev_state *ps;
@@ -89,6 +102,7 @@ struct async {
void __user *userbuffer;
void __user *userurb;
struct urb *urb;
+   struct usb_memory *usbm;
unsigned int mem_usage;
int status;
u32 secid;
@@ -157,6 +171,111 @@ static int connected(struct usb_dev_state *ps)
ps->dev->state != USB_STATE_NOTATTACHED);
 }
 
+static void dec_usb_memory_use_count(struct usb_memory *usbm, int *count)
+{
+   struct usb_dev_state *ps = usbm->ps;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   --*count;
+   if (usbm->urb_use_count == 0 && usbm->vma_use_count == 0) {
+   list_del(>memlist);
+   spin_unlock_irqrestore(>lock, flags);
+
+   usb_free_coherent(ps->dev, usbm->size, usbm->mem,
+   usbm->dma_handle);
+   usbfs_decrease_memory_usage(
+   usbm->size + sizeof(struct usb_memory));
+   kfree(usbm);
+   } else {
+   spin_unlock_irqrestore(>lock, flags);
+   }
+}
+
+static void usbdev_vm_open(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+   unsigned long flags;
+
+   spin_lock_irqsave(>ps->lock, flags);
+   ++usbm->vma_use_count;
+   spin_unlock_irqrestore(>ps->lock, flags);
+}
+
+static void usbdev_vm_close(struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = vma->vm_private_data;
+
+   dec_usb_memory_use_count(usbm, >vma_use_count);
+}
+
+struct vm_operations_struct usbdev_vm_ops = {
+   .open = usbdev_vm_open,
+   .close = usbdev_vm_close
+};
+
+static int usbdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct usb_memory *usbm = NULL;
+   struct usb_dev_state *ps = file->private_data;
+   size_t size = vma->vm_end - vma->vm_start;
+   void *mem;
+   unsigned long flags;
+   dma_addr_t dma_handle;
+   in

Re: Does vm_operations_struct require a .owner field?

On Tue, Jan 05, 2016 at 04:31:09PM -0500, Alan Stern wrote:
> Thank you.  So it looks like I was worried about nothing.
> 
> Steinar, you can remove the try_module_get/module_put lines from your
> patch.  Also, the list_del() and comment in usbdev_release() aren't 
> needed -- at that point we know the memory_list has to be empty since 
> there can't be any outstanding URBs or VMA references.  If you take 
> those things out then the patch should be ready for merging.

Good, thanks. Did so, compiled, testing it still works, sending :-)

/* Steinar */
-- 
Software Engineer, Google Switzerland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

2014-11-02 Thread Steinar H. Gunderson

On Fri, Oct 24, 2014 at 01:37:51AM +0200, Steinar H. Gunderson wrote:
>> I'm currently testing the patch below and will submit with proper
>> tested by attributions later today.
> We applied this patch in a reboot today (on top of 3.17.1), and so far,
> things seem to be going much better.

I see 3.17.2 is out. Am I right in that this patch didn't make it?

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

2014-11-02 Thread Steinar H. Gunderson

On Fri, Oct 24, 2014 at 01:37:51AM +0200, Steinar H. Gunderson wrote:
 I'm currently testing the patch below and will submit with proper
 tested by attributions later today.
 We applied this patch in a reboot today (on top of 3.17.1), and so far,
 things seem to be going much better.

I see 3.17.2 is out. Am I right in that this patch didn't make it?

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Sun, Oct 26, 2014 at 02:58:36PM +0100, Mike Galbraith wrote:
> Can you try the below?

I can bake it into the kernel the next time I boot, but that is unlikely to
be anytime soon, I'm afraid. I'm maybe a bit surprised nobody else can
reproduce my issue; I assume it won't help if I give out shell access
somehow.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Fri, Oct 24, 2014 at 06:38:41PM +0200, Mike Galbraith wrote:
>>> Whew, good, futex.c is hard.  Heads up chess guys . 
>> I wonder whether the barrier fix which got into 3.17 late fixes that
>> issue as well.
> Yes, it did.

This is only about the lockup, right, not that the threads bounce around a
lot and make things slower?

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression = 3.14 - was - Slowdown due to threads bouncing between HT cores

On Fri, Oct 24, 2014 at 06:38:41PM +0200, Mike Galbraith wrote:
 Whew, good, futex.c is hard.  Heads up chess guys punt. 
 I wonder whether the barrier fix which got into 3.17 late fixes that
 issue as well.
 Yes, it did.

This is only about the lockup, right, not that the threads bounce around a
lot and make things slower?

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression = 3.14 - was - Slowdown due to threads bouncing between HT cores

On Sun, Oct 26, 2014 at 02:58:36PM +0100, Mike Galbraith wrote:
 Can you try the below?

I can bake it into the kernel the next time I boot, but that is unlikely to
be anytime soon, I'm afraid. I'm maybe a bit surprised nobody else can
reproduce my issue; I assume it won't help if I give out shell access
somehow.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

2014-10-23 Thread Steinar H. Gunderson

On Tue, Oct 21, 2014 at 02:31:44PM +0100, Thomas Graf wrote:
> I'm currently testing the patch below and will submit with proper
> tested by attributions later today.

We applied this patch in a reboot today (on top of 3.17.1), and so far,
things seem to be going much better.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

2014-10-23 Thread Steinar H. Gunderson

On Tue, Oct 21, 2014 at 02:31:44PM +0100, Thomas Graf wrote:
 I'm currently testing the patch below and will submit with proper
 tested by attributions later today.

We applied this patch in a reboot today (on top of 3.17.1), and so far,
things seem to be going much better.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

2014-10-21 Thread Steinar H. Gunderson

On Fri, Oct 17, 2014 at 07:25:17AM +0100, Thomas Graf wrote:
> I think the only option at this point is to re-add the nltable lock to
> netlink_lookup() so we can drop the synchronize_net() until we find a
> way to RCUify socket destruction. I will cook up a patch today unless
> somebody can come up with a smarter way to work around needing the
> synchronize_net().

Did you end up with a patch? We'd like to reboot the server in question
soon-ish, but it's an open question whether we should go to 3.16.x, revert
exactly the patches in question or something else.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

2014-10-21 Thread Steinar H. Gunderson

On Fri, Oct 17, 2014 at 07:25:17AM +0100, Thomas Graf wrote:
 I think the only option at this point is to re-add the nltable lock to
 netlink_lookup() so we can drop the synchronize_net() until we find a
 way to RCUify socket destruction. I will cook up a patch today unless
 somebody can come up with a smarter way to work around needing the
 synchronize_net().

Did you end up with a patch? We'd like to reboot the server in question
soon-ish, but it's an open question whether we should go to 3.16.x, revert
exactly the patches in question or something else.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

2014-10-18 Thread Steinar H. Gunderson

On Fri, Oct 17, 2014 at 02:34:30AM +0200, Steinar H. Gunderson wrote:
> e341694e3eb57fcda9f1adc7bfea42fe080d8d7a looks like it might cause something
> like this (it certainly added the synchronize_net() call). Cc-ing people on 
> that commit; quoting the entire rest of the message for reference.

I see there's discussion on what to do with this; thanks. :-)

FWIW, I've verified that reverting these four patches (in that order) fixes
the problem:

  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/net/netlink/af_netlink.c?id=9ce12eb16ffb143f3a509da86283ddd0b10bcdb3
  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/net/netlink/af_netlink.c?id=6c8f7e70837468da4e658080d4448930fb597e1b
  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/net/netlink/af_netlink.c?id=67a24ac18b0262178ba9f05501b2c6e6731d449a
  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/net/netlink/af_netlink.c?id=e341694e3eb57fcda9f1adc7bfea42fe080d8d7a

My Perl script is seemingly not the only affected program on the system;
witness the average page load times for our PHP-based home page:

  http://home.samfundet.no/~sesse/web_load_time_samfundet_no-week.png

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

2014-10-18 Thread Steinar H. Gunderson

On Fri, Oct 17, 2014 at 02:34:30AM +0200, Steinar H. Gunderson wrote:
 e341694e3eb57fcda9f1adc7bfea42fe080d8d7a looks like it might cause something
 like this (it certainly added the synchronize_net() call). Cc-ing people on 
 that commit; quoting the entire rest of the message for reference.

I see there's discussion on what to do with this; thanks. :-)

FWIW, I've verified that reverting these four patches (in that order) fixes
the problem:

  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/net/netlink/af_netlink.c?id=9ce12eb16ffb143f3a509da86283ddd0b10bcdb3
  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/net/netlink/af_netlink.c?id=6c8f7e70837468da4e658080d4448930fb597e1b
  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/net/netlink/af_netlink.c?id=67a24ac18b0262178ba9f05501b2c6e6731d449a
  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/net/netlink/af_netlink.c?id=e341694e3eb57fcda9f1adc7bfea42fe080d8d7a

My Perl script is seemingly not the only affected program on the system;
witness the average page load times for our PHP-based home page:

  http://home.samfundet.no/~sesse/web_load_time_samfundet_no-week.png

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

getaddrinfo slowdown in 3.17.1, due to getifaddrs

Hi,

We recently upgraded a machine from 3.14.5 to 3.17.1, and a Perl script we're
running to poll SNMP suddenly needed ten times as much time to complete.
ps shows that it keeps being in state D (ie., I/O), and strace with -ttT
shows this curious pattern:

02:11:33.106973 socket(PF_NETLINK, SOCK_RAW, 0) = 42 <0.13>
02:11:33.107013 bind(42, {sa_family=AF_NETLINK, pid=0, groups=}, 12) = 
0 <0.10>
02:11:33.107051 getsockname(42, {sa_family=AF_NETLINK, pid=1128, 
groups=}, [12]) = 0 <0.08>
02:11:33.107094 sendto(42, "\24\0\0\0\26\0\1\3\265^@T\0\0\0\0\0\0\0\0", 20, 0, 
{sa_family=AF_NETLINK, pid=0, groups=}, 12) = 20 <0.15>
02:11:33.107146 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=}, 
msg_iov(1)=[{"L\0\0\0\24\0\2\0\265^@Th\4\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"...,
 4096}], msg_controllen=0, msg_flags=0}, 0) = 332 <0.16>
02:11:33.107208 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=}, 
msg_iov(1)=[{"H\0\0\0\24\0\2\0\265^@Th\4\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0"...,
 4096}], msg_controllen=0, msg_flags=0}, 0) = 936 <0.10>
02:11:33.107262 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=}, 
msg_iov(1)=[{"\24\0\0\0\3\0\2\0\265^@Th\4\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0"...,
 4096}], msg_controllen=0, msg_flags=0}, 0) = 20 <0.09>
02:11:33.107313 close(42)   = 0 <0.057092>
02:11:33.164529 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 42 <0.80>


Debugging with gdb indicates that this is from getaddrinfo calls, which the
program (for, well, Perl reasons) uses as part of DNS reverse lookups.
getaddrinfo wants to look at the list of interfaces on the system
(__check_pf in glibc), which calls out to netlink via getifaddrs.
Note specifically the call to close(), which takes 57 ms to complete.

This doesn't happen on every single getaddrinfo call, but more like 50% of
them. I've tried on another machine, running 3.16.3, and we don't see
anything like it.

I've distilled it down to this Perl script:

  #! /usr/bin/perl
  use strict;
  use warnings;
  use Socket::GetAddrInfo;
  
  for my $i (1..1000) {
my ($err, @res) = Socket::GetAddrInfo::getaddrinfo("127.0.0.1", undef, 
{ flags => Socket::GetAddrInfo::AI_NUMERICHOST });
  }

On my 3.16.3 machine, this completes in 26 ms. On the 3.17.1 machine:
65 _seconds_! According to the stack, this is what it's doing:

[] wait_rcu_gp+0x48/0x4f
[] synchronize_sched+0x29/0x2b
[] synchronize_net+0x19/0x1b
[] netlink_release+0x25b/0x2b7
[] sock_release+0x1a/0x79
[] sock_close+0xd/0x11
[] __fput+0xdf/0x184
[] fput+0x9/0xb
[] task_work_run+0x7c/0x94
[] do_notify_resume+0x55/0x66
[] int_signal+0x12/0x17
[] 0x

strace indicates it starts off nicely, then goes completely off:

cirkus:~> time strace -e close -ttT perl test.pl 
02:20:39.292060 close(3)= 0 <0.41>
02:20:39.292407 close(3)= 0 <0.37>
02:20:39.292660 close(3)= 0 <0.10>
02:20:39.292883 close(3)= 0 <0.09>
02:20:39.293100 close(3)= 0 <0.09>
[some more fast ones...]
02:20:39.311421 close(4)= 0 <0.09>
02:20:39.311927 close(3)= 0 <0.11>
02:20:39.312188 close(3)= 0 <0.072224>
02:20:39.384979 close(3)= 0 <0.059658>
02:20:39.445378 close(3)= 0 <0.048205>
02:20:39.494213 close(3)= 0 <0.060195>
^C

Is there a way to fix this? Somehow I doubt we're the only ones calling
getaddrinfo in this way. :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

On Fri, Oct 17, 2014 at 02:21:32AM +0200, Steinar H. Gunderson wrote:
> Hi,
> 
> We recently upgraded a machine from 3.14.5 to 3.17.1, and a Perl script we're
> running to poll SNMP suddenly needed ten times as much time to complete.

e341694e3eb57fcda9f1adc7bfea42fe080d8d7a looks like it might cause something
like this (it certainly added the synchronize_net() call). Cc-ing people on 
that commit; quoting the entire rest of the message for reference.

> ps shows that it keeps being in state D (ie., I/O), and strace with -ttT
> shows this curious pattern:
> 
> 02:11:33.106973 socket(PF_NETLINK, SOCK_RAW, 0) = 42 <0.13>
> 02:11:33.107013 bind(42, {sa_family=AF_NETLINK, pid=0, groups=}, 12) 
> = 0 <0.10>
> 02:11:33.107051 getsockname(42, {sa_family=AF_NETLINK, pid=1128, 
> groups=}, [12]) = 0 <0.08>
> 02:11:33.107094 sendto(42, "\24\0\0\0\26\0\1\3\265^@T\0\0\0\0\0\0\0\0", 20, 
> 0, {sa_family=AF_NETLINK, pid=0, groups=}, 12) = 20 <0.15>
> 02:11:33.107146 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
> groups=}, 
> msg_iov(1)=[{"L\0\0\0\24\0\2\0\265^@Th\4\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"...,
>  4096}], msg_controllen=0, msg_flags=0}, 0) = 332 <0.16>
> 02:11:33.107208 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
> groups=}, 
> msg_iov(1)=[{"H\0\0\0\24\0\2\0\265^@Th\4\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0"...,
>  4096}], msg_controllen=0, msg_flags=0}, 0) = 936 <0.10>
> 02:11:33.107262 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
> groups=}, 
> msg_iov(1)=[{"\24\0\0\0\3\0\2\0\265^@Th\4\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0"...,
>  4096}], msg_controllen=0, msg_flags=0}, 0) = 20 <0.09>
> 02:11:33.107313 close(42)   = 0 <0.057092>
> 02:11:33.164529 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 42 <0.80>
> 
> 
> Debugging with gdb indicates that this is from getaddrinfo calls, which the
> program (for, well, Perl reasons) uses as part of DNS reverse lookups.
> getaddrinfo wants to look at the list of interfaces on the system
> (__check_pf in glibc), which calls out to netlink via getifaddrs.
> Note specifically the call to close(), which takes 57 ms to complete.
> 
> This doesn't happen on every single getaddrinfo call, but more like 50% of
> them. I've tried on another machine, running 3.16.3, and we don't see
> anything like it.
> 
> I've distilled it down to this Perl script:
> 
>   #! /usr/bin/perl
>   use strict;
>   use warnings;
>   use Socket::GetAddrInfo;
>   
>   for my $i (1..1000) {
>   my ($err, @res) = Socket::GetAddrInfo::getaddrinfo("127.0.0.1", undef, 
> { flags => Socket::GetAddrInfo::AI_NUMERICHOST });
>   }
> 
> On my 3.16.3 machine, this completes in 26 ms. On the 3.17.1 machine:
> 65 _seconds_! According to the stack, this is what it's doing:
> 
> [] wait_rcu_gp+0x48/0x4f
> [] synchronize_sched+0x29/0x2b
> [] synchronize_net+0x19/0x1b
> [] netlink_release+0x25b/0x2b7
> [] sock_release+0x1a/0x79
> [] sock_close+0xd/0x11
> [] __fput+0xdf/0x184
> [] fput+0x9/0xb
> [] task_work_run+0x7c/0x94
> [] do_notify_resume+0x55/0x66
> [] int_signal+0x12/0x17
> [] 0x
> 
> strace indicates it starts off nicely, then goes completely off:
> 
> cirkus:~> time strace -e close -ttT perl test.pl 
> 02:20:39.292060 close(3)= 0 <0.41>
> 02:20:39.292407 close(3)= 0 <0.37>
> 02:20:39.292660 close(3)= 0 <0.10>
> 02:20:39.292883 close(3)= 0 <0.09>
> 02:20:39.293100 close(3)= 0 <0.09>
> [some more fast ones...]
> 02:20:39.311421 close(4)= 0 <0.09>
> 02:20:39.311927 close(3)= 0 <0.11>
> 02:20:39.312188 close(3)= 0 <0.072224>
> 02:20:39.384979 close(3)= 0 <0.059658>
> 02:20:39.445378 close(3)= 0 <0.048205>
> 02:20:39.494213 close(3)= 0 <0.060195>
> ^C
> 
> Is there a way to fix this? Somehow I doubt we're the only ones calling
> getaddrinfo in this way. :-)
> 
> /* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getaddrinfo slowdown in 3.17.1, due to getifaddrs

On Fri, Oct 17, 2014 at 02:21:32AM +0200, Steinar H. Gunderson wrote:
 Hi,
 
 We recently upgraded a machine from 3.14.5 to 3.17.1, and a Perl script we're
 running to poll SNMP suddenly needed ten times as much time to complete.

e341694e3eb57fcda9f1adc7bfea42fe080d8d7a looks like it might cause something
like this (it certainly added the synchronize_net() call). Cc-ing people on 
that commit; quoting the entire rest of the message for reference.

 ps shows that it keeps being in state D (ie., I/O), and strace with -ttT
 shows this curious pattern:
 
 02:11:33.106973 socket(PF_NETLINK, SOCK_RAW, 0) = 42 0.13
 02:11:33.107013 bind(42, {sa_family=AF_NETLINK, pid=0, groups=}, 12) 
 = 0 0.10
 02:11:33.107051 getsockname(42, {sa_family=AF_NETLINK, pid=1128, 
 groups=}, [12]) = 0 0.08
 02:11:33.107094 sendto(42, \24\0\0\0\26\0\1\3\265^@T\0\0\0\0\0\0\0\0, 20, 
 0, {sa_family=AF_NETLINK, pid=0, groups=}, 12) = 20 0.15
 02:11:33.107146 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
 groups=}, 
 msg_iov(1)=[{L\0\0\0\24\0\2\0\265^@Th\4\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1...,
  4096}], msg_controllen=0, msg_flags=0}, 0) = 332 0.16
 02:11:33.107208 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
 groups=}, 
 msg_iov(1)=[{H\0\0\0\24\0\2\0\265^@Th\4\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0...,
  4096}], msg_controllen=0, msg_flags=0}, 0) = 936 0.10
 02:11:33.107262 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
 groups=}, 
 msg_iov(1)=[{\24\0\0\0\3\0\2\0\265^@Th\4\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0...,
  4096}], msg_controllen=0, msg_flags=0}, 0) = 20 0.09
 02:11:33.107313 close(42)   = 0 0.057092
 02:11:33.164529 open(/etc/hosts, O_RDONLY|O_CLOEXEC) = 42 0.80
 more stuff...
 
 Debugging with gdb indicates that this is from getaddrinfo calls, which the
 program (for, well, Perl reasons) uses as part of DNS reverse lookups.
 getaddrinfo wants to look at the list of interfaces on the system
 (__check_pf in glibc), which calls out to netlink via getifaddrs.
 Note specifically the call to close(), which takes 57 ms to complete.
 
 This doesn't happen on every single getaddrinfo call, but more like 50% of
 them. I've tried on another machine, running 3.16.3, and we don't see
 anything like it.
 
 I've distilled it down to this Perl script:
 
   #! /usr/bin/perl
   use strict;
   use warnings;
   use Socket::GetAddrInfo;
   
   for my $i (1..1000) {
   my ($err, @res) = Socket::GetAddrInfo::getaddrinfo(127.0.0.1, undef, 
 { flags = Socket::GetAddrInfo::AI_NUMERICHOST });
   }
 
 On my 3.16.3 machine, this completes in 26 ms. On the 3.17.1 machine:
 65 _seconds_! According to the stack, this is what it's doing:
 
 [810766b7] wait_rcu_gp+0x48/0x4f
 [81078be5] synchronize_sched+0x29/0x2b
 [813aacdb] synchronize_net+0x19/0x1b
 [813d313e] netlink_release+0x25b/0x2b7
 [8139af07] sock_release+0x1a/0x79
 [8139b1f4] sock_close+0xd/0x11
 [8111feca] __fput+0xdf/0x184
 [8111ff9f] fput+0x9/0xb
 [81051610] task_work_run+0x7c/0x94
 [810026b2] do_notify_resume+0x55/0x66
 [8146feda] int_signal+0x12/0x17
 [] 0x
 
 strace indicates it starts off nicely, then goes completely off:
 
 cirkus:~ time strace -e close -ttT perl test.pl 
 02:20:39.292060 close(3)= 0 0.41
 02:20:39.292407 close(3)= 0 0.37
 02:20:39.292660 close(3)= 0 0.10
 02:20:39.292883 close(3)= 0 0.09
 02:20:39.293100 close(3)= 0 0.09
 [some more fast ones...]
 02:20:39.311421 close(4)= 0 0.09
 02:20:39.311927 close(3)= 0 0.11
 02:20:39.312188 close(3)= 0 0.072224
 02:20:39.384979 close(3)= 0 0.059658
 02:20:39.445378 close(3)= 0 0.048205
 02:20:39.494213 close(3)= 0 0.060195
 ^C
 
 Is there a way to fix this? Somehow I doubt we're the only ones calling
 getaddrinfo in this way. :-)
 
 /* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

getaddrinfo slowdown in 3.17.1, due to getifaddrs

Hi,

We recently upgraded a machine from 3.14.5 to 3.17.1, and a Perl script we're
running to poll SNMP suddenly needed ten times as much time to complete.
ps shows that it keeps being in state D (ie., I/O), and strace with -ttT
shows this curious pattern:

02:11:33.106973 socket(PF_NETLINK, SOCK_RAW, 0) = 42 0.13
02:11:33.107013 bind(42, {sa_family=AF_NETLINK, pid=0, groups=}, 12) = 
0 0.10
02:11:33.107051 getsockname(42, {sa_family=AF_NETLINK, pid=1128, 
groups=}, [12]) = 0 0.08
02:11:33.107094 sendto(42, \24\0\0\0\26\0\1\3\265^@T\0\0\0\0\0\0\0\0, 20, 0, 
{sa_family=AF_NETLINK, pid=0, groups=}, 12) = 20 0.15
02:11:33.107146 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=}, 
msg_iov(1)=[{L\0\0\0\24\0\2\0\265^@Th\4\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1...,
 4096}], msg_controllen=0, msg_flags=0}, 0) = 332 0.16
02:11:33.107208 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=}, 
msg_iov(1)=[{H\0\0\0\24\0\2\0\265^@Th\4\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0...,
 4096}], msg_controllen=0, msg_flags=0}, 0) = 936 0.10
02:11:33.107262 recvmsg(42, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=}, 
msg_iov(1)=[{\24\0\0\0\3\0\2\0\265^@Th\4\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0...,
 4096}], msg_controllen=0, msg_flags=0}, 0) = 20 0.09
02:11:33.107313 close(42)   = 0 0.057092
02:11:33.164529 open(/etc/hosts, O_RDONLY|O_CLOEXEC) = 42 0.80
more stuff...

Debugging with gdb indicates that this is from getaddrinfo calls, which the
program (for, well, Perl reasons) uses as part of DNS reverse lookups.
getaddrinfo wants to look at the list of interfaces on the system
(__check_pf in glibc), which calls out to netlink via getifaddrs.
Note specifically the call to close(), which takes 57 ms to complete.

This doesn't happen on every single getaddrinfo call, but more like 50% of
them. I've tried on another machine, running 3.16.3, and we don't see
anything like it.

I've distilled it down to this Perl script:

  #! /usr/bin/perl
  use strict;
  use warnings;
  use Socket::GetAddrInfo;
  
  for my $i (1..1000) {
my ($err, @res) = Socket::GetAddrInfo::getaddrinfo(127.0.0.1, undef, 
{ flags = Socket::GetAddrInfo::AI_NUMERICHOST });
  }

On my 3.16.3 machine, this completes in 26 ms. On the 3.17.1 machine:
65 _seconds_! According to the stack, this is what it's doing:

[810766b7] wait_rcu_gp+0x48/0x4f
[81078be5] synchronize_sched+0x29/0x2b
[813aacdb] synchronize_net+0x19/0x1b
[813d313e] netlink_release+0x25b/0x2b7
[8139af07] sock_release+0x1a/0x79
[8139b1f4] sock_close+0xd/0x11
[8111feca] __fput+0xdf/0x184
[8111ff9f] fput+0x9/0xb
[81051610] task_work_run+0x7c/0x94
[810026b2] do_notify_resume+0x55/0x66
[8146feda] int_signal+0x12/0x17
[] 0x

strace indicates it starts off nicely, then goes completely off:

cirkus:~ time strace -e close -ttT perl test.pl 
02:20:39.292060 close(3)= 0 0.41
02:20:39.292407 close(3)= 0 0.37
02:20:39.292660 close(3)= 0 0.10
02:20:39.292883 close(3)= 0 0.09
02:20:39.293100 close(3)= 0 0.09
[some more fast ones...]
02:20:39.311421 close(4)= 0 0.09
02:20:39.311927 close(3)= 0 0.11
02:20:39.312188 close(3)= 0 0.072224
02:20:39.384979 close(3)= 0 0.059658
02:20:39.445378 close(3)= 0 0.048205
02:20:39.494213 close(3)= 0 0.060195
^C

Is there a way to fix this? Somehow I doubt we're the only ones calling
getaddrinfo in this way. :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 08, 2014 at 01:04:01PM -0400, Linus Torvalds wrote:
> So like Thomas, I would suspect a race condition in the futex use, and
> then the exact futex implementation details are just exposing it
> incidentally.

FWIW, Stockfish does not use futex directly; it uses pthreads (or Win32
threads if you are on that OS :-) ). That doesn't preclude a race condition
somewhere, of course.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 08, 2014 at 06:14:18PM +0200, Thomas Gleixner wrote:
> It looks far more like an issue with the stocking fish code, but hell
> with futexes one can never be sure.

OK, maybe we should move to a more recent Stockfish version first of all;
the specific benchmark was about that specific binary, but for tracking down
futex issues we can see if more recent code fixes it (the SMP in this thing
keeps getting developed).

I'm moving to 2ac206e847a04a7de07690dd575c6949e5625115 (current head) of
https://github.com/mcostalba/Stockfish.git, and building with
“make -j ARCH=x86-64-bmi2”.

I still don't see any hangs, but I do see the same behavior of moving around
between CPUs as the older version exhibited. In a test run (using the given
test script, just with 28 replaced by 20), I get 18273 kN/sec with default,
and 21875 kN/sec when using taskset.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 08, 2014 at 05:37:44PM +0200, Mike Galbraith wrote:
> Seems you opened a can of futex worms...

Awesome.

> I don't see that on the 2 x E5-2697 box I borrowed to take a peek.  Once
> I got stockfish to actually run to completion by hunting down and brute
> force reverting the below, I see ~32 million nodes/sec throughput with
> 3.17 whether I use taskset or just let it do its thing.

Interesting. If you open up top, do you see (like me) the load being spread
out across multiple CPUs, or do they actually stay in place like they're
supposed to?

I suppose you run with Threads set to 28, right? (And not, say, 56.)

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression = 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 08, 2014 at 05:37:44PM +0200, Mike Galbraith wrote:
 Seems you opened a can of futex worms...

Awesome.

 I don't see that on the 2 x E5-2697 box I borrowed to take a peek.  Once
 I got stockfish to actually run to completion by hunting down and brute
 force reverting the below, I see ~32 million nodes/sec throughput with
 3.17 whether I use taskset or just let it do its thing.

Interesting. If you open up top, do you see (like me) the load being spread
out across multiple CPUs, or do they actually stay in place like they're
supposed to?

I suppose you run with Threads set to 28, right? (And not, say, 56.)

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression = 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 08, 2014 at 06:14:18PM +0200, Thomas Gleixner wrote:
 It looks far more like an issue with the stocking fish code, but hell
 with futexes one can never be sure.

OK, maybe we should move to a more recent Stockfish version first of all;
the specific benchmark was about that specific binary, but for tracking down
futex issues we can see if more recent code fixes it (the SMP in this thing
keeps getting developed).

I'm moving to 2ac206e847a04a7de07690dd575c6949e5625115 (current head) of
https://github.com/mcostalba/Stockfish.git, and building with
“make -j ARCH=x86-64-bmi2”.

I still don't see any hangs, but I do see the same behavior of moving around
between CPUs as the older version exhibited. In a test run (using the given
test script, just with 28 replaced by 20), I get 18273 kN/sec with default,
and 21875 kN/sec when using taskset.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected: futex regression = 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 08, 2014 at 01:04:01PM -0400, Linus Torvalds wrote:
 So like Thomas, I would suspect a race condition in the futex use, and
 then the exact futex implementation details are just exposing it
 incidentally.

FWIW, Stockfish does not use futex directly; it uses pthreads (or Win32
threads if you are on that OS :-) ). That doesn't preclude a race condition
somewhere, of course.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Slowdown due to threads bouncing between HT cores

2014-10-05 Thread Steinar H. Gunderson

On Sat, Oct 04, 2014 at 09:50:04AM -0500, Chuck Ebbert wrote:
> Try playing with /proc/sys/kernel/sched_migration_cost_ns. This sets
> the number of nanoseconds the kernel will wait before considering
> moving a thread to another CPU. I have mine set to 5000.

I can't get any good effect out of this. I tried both 50 ms (your value) and
500 ms, and while it seems (by eyeballing the per-cpu display in top) to
_sometimes_ lock the processes to cores, more often than not, they still
bounce around between all 40. Worse still, when it _does_, it seems to often
lock them to hyperthread pairs (e.g., I've seen it put threads only on
virtual cores 0-9 and 20-29, which means it has all of them on one socket!)

Notwithstanding top, the benchmarks don't improve; setting cores manually
with taskset still is much better.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Slowdown due to threads bouncing between HT cores

2014-10-05 Thread Steinar H. Gunderson

On Sat, Oct 04, 2014 at 09:50:04AM -0500, Chuck Ebbert wrote:
 Try playing with /proc/sys/kernel/sched_migration_cost_ns. This sets
 the number of nanoseconds the kernel will wait before considering
 moving a thread to another CPU. I have mine set to 5000.

I can't get any good effect out of this. I tried both 50 ms (your value) and
500 ms, and while it seems (by eyeballing the per-cpu display in top) to
_sometimes_ lock the processes to cores, more often than not, they still
bounce around between all 40. Worse still, when it _does_, it seems to often
lock them to hyperthread pairs (e.g., I've seen it put threads only on
virtual cores 0-9 and 20-29, which means it has all of them on one socket!)

Notwithstanding top, the benchmarks don't improve; setting cores manually
with taskset still is much better.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Slowdown due to threads bouncing between HT cores

2014-10-04 Thread Steinar H. Gunderson

On Sat, Oct 04, 2014 at 06:41:15AM -0700, Andi Kleen wrote:
> - something else gets scheduled on these logical CPUs, so
> the scheduler tries to balance to run queue lengths
> 
> You could check that with perf timechart or perf sched record/map
> or kernelshark.

I've never read any of these maps before, but perf sched map really doesn't
indicate to me that there's a lot of other stuff going on. It seems to mainly
show a lot of Stockfish processes bouncing around seemingly randomly with not
much understanding of hyperthread pairs. Of course, there's the odd other
job, including ksoftirq or an RCU process.

I can send you a copy of the map if you want to, but it is of course rather
large.

> - there is some IO or communication which causes wakeup affinity.

There's a fair amount of communication between the threads; I don't know the
architecture very deeply (multithreading in chess is rather nontrivial),
but as far as I know, the worker threads access shared data through shm,
sometimes using pthread mutexes to lock some of it.

This also means, by the way, that occasionally they will sleep. They're not
by default going to hog the CPU 100% of the time, more like 90%.

> You could try disabling WAKEUP_PREEMPTION or NEXT_BUDDY in 
> /sys/kernel/debug/sched_features

NO_NEXT_BUDDY was already set. (Changing it to NEXT_BUDDY didn't seem to help
anything.) I tried setting NO_WAKEUP_PREEMPTION, and it didn't make a
difference that I could see; they still bounce around a lot.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Slowdown due to threads bouncing between HT cores

2014-10-04 Thread Steinar H. Gunderson

On Sat, Oct 04, 2014 at 06:41:15AM -0700, Andi Kleen wrote:
 - something else gets scheduled on these logical CPUs, so
 the scheduler tries to balance to run queue lengths
 
 You could check that with perf timechart or perf sched record/map
 or kernelshark.

I've never read any of these maps before, but perf sched map really doesn't
indicate to me that there's a lot of other stuff going on. It seems to mainly
show a lot of Stockfish processes bouncing around seemingly randomly with not
much understanding of hyperthread pairs. Of course, there's the odd other
job, including ksoftirq or an RCU process.

I can send you a copy of the map if you want to, but it is of course rather
large.

 - there is some IO or communication which causes wakeup affinity.

There's a fair amount of communication between the threads; I don't know the
architecture very deeply (multithreading in chess is rather nontrivial),
but as far as I know, the worker threads access shared data through shm,
sometimes using pthread mutexes to lock some of it.

This also means, by the way, that occasionally they will sleep. They're not
by default going to hog the CPU 100% of the time, more like 90%.

 You could try disabling WAKEUP_PREEMPTION or NEXT_BUDDY in 
 /sys/kernel/debug/sched_features

NO_NEXT_BUDDY was already set. (Changing it to NEXT_BUDDY didn't seem to help
anything.) I tried setting NO_WAKEUP_PREEMPTION, and it didn't make a
difference that I could see; they still bounce around a lot.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Slowdown due to threads bouncing between HT cores

On Fri, Oct 03, 2014 at 11:11:52PM +0200, Marc Burkhardt wrote:
> As I understand your mail, you problem is quite similar, isn't it?

I guess it depends on how often your process migrates. If it happens, like,
every second, it's not a big problem (and probably is expected).
If it happens all the time, it might be; it depends a bit on a number of
factors.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Slowdown due to threads bouncing between HT cores

Hi,

I did a chess benchmark of my new machine (2x E5-2650v3, so 20x2.3GHz
Haswell-EP), and it performed a bit worse than comparable Windows setups.
It looks like the scheduler somehow doesn't perform as well with
hyperthreading; HT is on in the BIOS, but I'm only using 20 threads
(chess scales sublinearly, so using all 40 usually isn't a good idea),
so really, the threads should just get one core each and that's it.
It looks like they are bouncing between cores, reducing overall performance
by ~20% for some reason. (The machine is otherwise generally idle.)

First some details to reproduce more easily. Kernel version is 3.16.3, 64-bit
x86, Debian stable (so gcc 4.7.2). The benchmark binary is a chess engine
knows as Stockfish; this is the compile I used (because that's what everyone
else is benchmarking with):

http://abrok.eu/stockfish/builds/dbd6156fceaf9bec8e9ff14f99c325c36b284079/linux64modernsse/stockfish_13111907_x64_modern_sse42

Stockfish is GPL, so the source is readily available if you should need it.

The benchmark is run with by just running the binary, then giving it these
commands one by one:

uci
setoption name Threads value 20
setoption name Hash value 1024
position fen rnbq1rk1/pppnbppp/4p3/3pP1B1/3P3P/2N5/PPP2PP1/R2QKBNR w KQ – 0 7
go wtime 720 winc 3 btime 720 binc 3

After ~3 minutes, it will output “bestmove d1g4 ponder f8e8”. A few lines
above that, you'll see a line with something similar to “nps 13266463”.
That's nodes per second, and you want it to be higher.

So, benchmark:

- Default: 13266 kN/sec
- Change from ondemand to performance on all cores: 14600 kN/sec
- taskset -c 0-19 (locking affinity to only one set of hyperthreads):
17512 kN/sec

There is some local variation, but it's typically within a few percent.
Does anyone know what's going on? I have CONFIG_SCHED_SMT=y and
CONFIG_SCHED_MC=y.

/* Steinar */
--
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Slowdown due to threads bouncing between HT cores

Hi,

http://abrok.eu/stockfish/builds/dbd6156fceaf9bec8e9ff14f99c325c36b284079/linux64modernsse/stockfish_13111907_x64_modern_sse42

Stockfish is GPL, so the source is readily available if you should need it.

The benchmark is run with by just running the binary, then giving it these
commands one by one:

uci
setoption name Threads value 20
setoption name Hash value 1024
position fen rnbq1rk1/pppnbppp/4p3/3pP1B1/3P3P/2N5/PPP2PP1/R2QKBNR w KQ – 0 7
go wtime 720 winc 3 btime 720 binc 3

So, benchmark:

- Default: 13266 kN/sec
- Change from ondemand to performance on all cores: 14600 kN/sec
- taskset -c 0-19 (locking affinity to only one set of hyperthreads):
17512 kN/sec

There is some local variation, but it's typically within a few percent.
Does anyone know what's going on? I have CONFIG_SCHED_SMT=y and
CONFIG_SCHED_MC=y.

/* Steinar */
--
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Slowdown due to threads bouncing between HT cores

On Fri, Oct 03, 2014 at 11:11:52PM +0200, Marc Burkhardt wrote:
 As I understand your mail, you problem is quite similar, isn't it?

I guess it depends on how often your process migrates. If it happens, like,
every second, it's not a big problem (and probably is expected).
If it happens all the time, it might be; it depends a bit on a number of
factors.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Fri, Jun 07, 2013 at 02:44:19PM -0400, Steven Rostedt wrote:
> Do know if you have CONFIG_NET_NS set in your .config?

Sorry, I forgot to answer this: No, it is not set.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Fri, Jun 07, 2013 at 02:26:08PM -0400, Steven Rostedt wrote:
> On Fri, 2013-06-07 at 19:52 +0200, Steinar H. Gunderson wrote:
> Ah, that's because of this: module_init(ipgre_init);  Where it makes it
> into:
> 
>  :
>0:   55  push   %ebp
>1:   89 e5   mov%esp,%ebp
>3:   53  push   %ebx
>4:   83 ec 08sub$0x8,%esp
>7:   c7 04 24 00 00 00 00movl   $0x0,(%esp)
> a: R_386_32 .rodata.str1.4
> 
> We can use ipgre_tap_init_net, and the offset of 0xb032 (45106) as that
> was 0xa0e5d034 - 0xa0e52002. Do you have CONFIG_NET_NS
> set?

ipgre_tap_init_net is 001a, but there's no way I can subtract
0xb053 from that? Sorry, I'm confused. :-)

> You can also cat /proc/modules. It gives you where the modules are
> located.

I've booted back to 3.9.x already; I couldn't live with a crashing kernel like
that. Unfortunately it's not that easy for me to reboot this machine all the
time either. :-/

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Fri, Jun 07, 2013 at 12:12:23PM -0400, Steven Rostedt wrote:
>> Ffffa0e76000 u ip_tunnel_init_net   [ip_gre]
> What do you get if you do an objdump -Dr ip_gre.ko
> 
> And then look for ipgre_init, and then subtract 0xb053 (45139) from its
> address. As that is: a0e81055 - a0e76002, then see if
> that object file has anything in that location.

pannekake:~> objdump -Dr /lib/modules/3.10.0-rc4/kernel/net/ipv4/ip_gre.ko | 
grep ipgre_init
 :  

   0:   8b 35 00 00 00 00   mov0x0(%rip),%esi# 6 

  13:   e8 00 00 00 00  callq  18 

Ie., the symbol doesn't show up in the disassembly (for whatever reason).

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Fri, Jun 07, 2013 at 11:15:00AM -0400, Steven Rostedt wrote:
> net: Remove __net_init/exit from exported functions
> 
> If CONFIG_NET_NS is not set then __net_init is the same as __init and
> __net_exit is the same as __exit. These functions will be removed from
> memory after the module loads or is removed. Functions that are exported
> for use by other functions should never be labeled for removal.

That didn't help much, I'm afraid:

[   18.005451] BUG: unable to handle kernel NULL pointer dereference at 
0003
[   18.013853] IP: [] 0xa0e76001
[   18.019380] PGD 0 
[   18.021695] Oops:  [#1] SMP 
[   18.025285] Modules linked in: ip_gre(+) gre ip_tunnel psmouse ide_generic 
ide_gd_mod ide_cd_mod cdrom acpi_cpufreq mperf coretemp kvm_intel kvm iTCO_wdt 
iTCO_vendor_support lpc_ich microcode mfd_core i2c_i801 pcspkr i2c_core 
ehci_pci evbug evdev ext4 crc16 jbd2 mbcache dm_mod raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 md_mod sg sd_mod 
usbhid ide_pci_generic ide_core crc32c_intel e1000e ata_piix ptp pps_core 
uhci_hcd ehci_hcd mpt2sas raid_class unix
[   18.073543] CPU: 0 PID: 3263 Comm: modprobe Not tainted 3.10.0-rc4 #2
[   18.080237] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a   12/30/2011
[   18.087634] task: 88061ecfad60 ti: 8806212f task.ti: 
8806212f
[   18.095571] RIP: 0010:[]  [] 
0xa0e76001
[   18.103745] RSP: 0018:8806212f1ca8  EFLAGS: 00010246
[   18.109301] RAX: a0e81000 RBX: 880623ebe280 RCX: 
[   18.116682] RDX: a0e7ea40 RSI: 0003 RDI: a0e81018
[   18.124063] RBP: 8806212f1ca8 R08: 0cf8 R09: 812bae96
[   18.131441] R10: ea0018852c00 R11:  R12: 880621678290
[   18.138829] R13: a0e7e9c0 R14: 8806212f1ef8 R15: 0002
[   18.146210] FS:  7f2e37fd1700() GS:88062720() 
knlGS:
[   18.154747] CS:  0010 DS:  ES:  CR0: 8005003b
[   18.160742] CR2: 0003 CR3: 000622a5e000 CR4: 07f0
[   18.168131] DR0:  DR1:  DR2: 
[   18.175510] DR3:  DR6: 0ff0 DR7: 0400
[   18.182890] Stack:
[   18.185143]  8806212f1cf8 812baf26  

[   18.193235]   a0e7e9c0  

[   18.201313]  8806212f1ef8 a0e7eb60 8806212f1d28 
812bafb6
[   18.209389] Call Trace:
[   18.212084]  [] ops_init.constprop.7+0xc6/0xf5
[   18.218339]  [] register_pernet_operations.isra.4+0x61/0x91
[   18.225720]  [] ? mutex_lock+0xf/0x20
[   18.231189]  [] register_pernet_device+0x20/0x51
[   18.237621]  [] ? ipgre_tap_init_net+0x1a/0x1a [ip_gre]
[   18.244661]  [] ipgre_init+0x21/0xc9 [ip_gre]
[   18.250831]  [] ? ipgre_tap_init_net+0x1a/0x1a [ip_gre]
[   18.257866]  [] do_one_initcall+0x7b/0x10c
[   18.263780]  [] load_module+0x1b1f/0x1e19
[   18.269594]  [] ? sys_getegid16+0x44/0x44
[   18.275416]  [] ? page_fault+0x22/0x30
[   18.280972]  [] SyS_init_module+0x94/0xa1
[   18.286795]  [] system_call_fastpath+0x16/0x1b
[   18.293051] Code: <6e> 65 77 6c 69 6e 6b 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 
[   18.302807] RIP  [] 0xa0e76001
[   18.308429]  RSP 
[   18.312163] CR2: 0003
[   18.316021] ---[ end trace 839c6b43b00f02f5 ]---

and still:

Ffffa0e76000 u ip_tunnel_init_net   [ip_gre]

I've checked that ip_tunnel.ko and ip_gre.ko was indeed rebuilt (new 
timestamps),
and that my patching (I had to resolve manually due to fuzz) really removed 
__net_init.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Thu, Jun 06, 2013 at 11:06:48PM -0400, Steven Rostedt wrote:
> Note the faulting address is 0xa0e52001, which is around the
> above address, be interesting to know what was at that location.

Doh, I looked at the wrong place in kallsyms:

a0e52000 u ip_tunnel_init_net   [ip_gre]
a0e55000 t gre_err  [gre]
a0e5503d t gre_gso_send_check   [gre]
a0e55053 t gre_rcv  [gre]

So it's really ip_tunnel_init_net+1.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Thu, Jun 06, 2013 at 11:06:48PM -0400, Steven Rostedt wrote:
> Note the faulting address is 0xa0e52001, which is around the
> above address, be interesting to know what was at that location.

Aha, the plot thickens:

root  6095  0.0  0.0   6632   596 ?DJun06   0:00 /sbin/modprobe 
-q -- net-pf-17

pannekake:/usr/src/linux-3.10-rc4> sudo cat /proc/6095/stack
[] register_pernet_subsys+0x18/0x39
[] packet_init+0x32/0x44 [af_packet]
[] do_one_initcall+0x7b/0x10c
[] load_module+0x1b1f/0x1e19
[] SyS_init_module+0x94/0xa1
[] system_call_fastpath+0x16/0x1b
[] 0x

I have a tcpdump running almost all the time (from boot), for a variety of
reasons. And I think I have the BPF JIT on; possibly related.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Thu, Jun 06, 2013 at 08:59:48PM -0700, Eric Dumazet wrote:
> Steinar, please make sure you recompiled your modules, because this
> looks like you loaded old modules.

I compiled the kernel using make-kpkg, so I don't see how that would happen.
Also, the timestamps indicate everything is fine:

-rw-r--r-- 1 root root 2574976 Jun  6 00:39 /boot/vmlinuz-3.10.0-rc4
-rw-r--r-- 1 root root   22856 Jun  6 00:39 
/lib/modules/3.10.0-rc4/kernel/net/ipv4/ip_gre.ko

Or from the source tree:

-rw-r--r-- 1 root root 2574976 Jun  6 00:36 arch/x86/boot/bzImage
-rw-r--r-- 1 root root   22800 Jun  6 00:39 ./net/ipv4/ip_gre.ko

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Thu, Jun 06, 2013 at 11:06:48PM -0400, Steven Rostedt wrote:
> Note the faulting address is 0xa0e52001, which is around the
> above address, be interesting to know what was at that location.

Is there any way I can figure this out? The machine in question is still
running. kallsyms doesn't show anything near it, though.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Fri, Jun 07, 2013 at 11:15:00AM -0400, Steven Rostedt wrote:
 net: Remove __net_init/exit from exported functions
 
 If CONFIG_NET_NS is not set then __net_init is the same as __init and
 __net_exit is the same as __exit. These functions will be removed from
 memory after the module loads or is removed. Functions that are exported
 for use by other functions should never be labeled for removal.

That didn't help much, I'm afraid:

[   18.005451] BUG: unable to handle kernel NULL pointer dereference at 
0003
[   18.013853] IP: [a0e76002] 0xa0e76001
[   18.019380] PGD 0 
[   18.021695] Oops:  [#1] SMP 
[   18.025285] Modules linked in: ip_gre(+) gre ip_tunnel psmouse ide_generic 
ide_gd_mod ide_cd_mod cdrom acpi_cpufreq mperf coretemp kvm_intel kvm iTCO_wdt 
iTCO_vendor_support lpc_ich microcode mfd_core i2c_i801 pcspkr i2c_core 
ehci_pci evbug evdev ext4 crc16 jbd2 mbcache dm_mod raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 md_mod sg sd_mod 
usbhid ide_pci_generic ide_core crc32c_intel e1000e ata_piix ptp pps_core 
uhci_hcd ehci_hcd mpt2sas raid_class unix
[   18.073543] CPU: 0 PID: 3263 Comm: modprobe Not tainted 3.10.0-rc4 #2
[   18.080237] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a   12/30/2011
[   18.087634] task: 88061ecfad60 ti: 8806212f task.ti: 
8806212f
[   18.095571] RIP: 0010:[a0e76002]  [a0e76002] 
0xa0e76001
[   18.103745] RSP: 0018:8806212f1ca8  EFLAGS: 00010246
[   18.109301] RAX: a0e81000 RBX: 880623ebe280 RCX: 
[   18.116682] RDX: a0e7ea40 RSI: 0003 RDI: a0e81018
[   18.124063] RBP: 8806212f1ca8 R08: 0cf8 R09: 812bae96
[   18.131441] R10: ea0018852c00 R11:  R12: 880621678290
[   18.138829] R13: a0e7e9c0 R14: 8806212f1ef8 R15: 0002
[   18.146210] FS:  7f2e37fd1700() GS:88062720() 
knlGS:
[   18.154747] CS:  0010 DS:  ES:  CR0: 8005003b
[   18.160742] CR2: 0003 CR3: 000622a5e000 CR4: 07f0
[   18.168131] DR0:  DR1:  DR2: 
[   18.175510] DR3:  DR6: 0ff0 DR7: 0400
[   18.182890] Stack:
[   18.185143]  8806212f1cf8 812baf26  

[   18.193235]   a0e7e9c0  

[   18.201313]  8806212f1ef8 a0e7eb60 8806212f1d28 
812bafb6
[   18.209389] Call Trace:
[   18.212084]  [812baf26] ops_init.constprop.7+0xc6/0xf5
[   18.218339]  [812bafb6] register_pernet_operations.isra.4+0x61/0x91
[   18.225720]  [8138486f] ? mutex_lock+0xf/0x20
[   18.231189]  [812bb006] register_pernet_device+0x20/0x51
[   18.237621]  [a0e81034] ? ipgre_tap_init_net+0x1a/0x1a [ip_gre]
[   18.244661]  [a0e81055] ipgre_init+0x21/0xc9 [ip_gre]
[   18.250831]  [a0e81034] ? ipgre_tap_init_net+0x1a/0x1a [ip_gre]
[   18.257866]  [81000263] do_one_initcall+0x7b/0x10c
[   18.263780]  [8107e5db] load_module+0x1b1f/0x1e19
[   18.269594]  [8107a4f8] ? sys_getegid16+0x44/0x44
[   18.275416]  [81386cf2] ? page_fault+0x22/0x30
[   18.280972]  [8107e969] SyS_init_module+0x94/0xa1
[   18.286795]  [8138cf12] system_call_fastpath+0x16/0x1b
[   18.293051] Code: 6e 65 77 6c 69 6e 6b 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 
[   18.302807] RIP  [a0e76002] 0xa0e76001
[   18.308429]  RSP 8806212f1ca8
[   18.312163] CR2: 0003
[   18.316021] ---[ end trace 839c6b43b00f02f5 ]---

and still:

Ffffa0e76000 u ip_tunnel_init_net   [ip_gre]

I've checked that ip_tunnel.ko and ip_gre.ko was indeed rebuilt (new 
timestamps),
and that my patching (I had to resolve manually due to fuzz) really removed 
__net_init.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Fri, Jun 07, 2013 at 12:12:23PM -0400, Steven Rostedt wrote:
 Ffffa0e76000 u ip_tunnel_init_net   [ip_gre]
 What do you get if you do an objdump -Dr ip_gre.ko
 
 And then look for ipgre_init, and then subtract 0xb053 (45139) from its
 address. As that is: a0e81055 - a0e76002, then see if
 that object file has anything in that location.

pannekake:~ objdump -Dr /lib/modules/3.10.0-rc4/kernel/net/ipv4/ip_gre.ko | 
grep ipgre_init
 ipgre_init_net:  

   0:   8b 35 00 00 00 00   mov0x0(%rip),%esi# 6 
ipgre_init_net+0x6
  13:   e8 00 00 00 00  callq  18 ipgre_init_net+0x18

Ie., the symbol doesn't show up in the disassembly (for whatever reason).

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Fri, Jun 07, 2013 at 02:26:08PM -0400, Steven Rostedt wrote:
 On Fri, 2013-06-07 at 19:52 +0200, Steinar H. Gunderson wrote:
 Ah, that's because of this: module_init(ipgre_init);  Where it makes it
 into:
 
  init_module:
0:   55  push   %ebp
1:   89 e5   mov%esp,%ebp
3:   53  push   %ebx
4:   83 ec 08sub$0x8,%esp
7:   c7 04 24 00 00 00 00movl   $0x0,(%esp)
 a: R_386_32 .rodata.str1.4
 
 We can use ipgre_tap_init_net, and the offset of 0xb032 (45106) as that
 was 0xa0e5d034 - 0xa0e52002. Do you have CONFIG_NET_NS
 set?

ipgre_tap_init_net is 001a, but there's no way I can subtract
0xb053 from that? Sorry, I'm confused. :-)

 You can also cat /proc/modules. It gives you where the modules are
 located.

I've booted back to 3.9.x already; I couldn't live with a crashing kernel like
that. Unfortunately it's not that easy for me to reboot this machine all the
time either. :-/

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Fri, Jun 07, 2013 at 02:44:19PM -0400, Steven Rostedt wrote:
 Do know if you have CONFIG_NET_NS set in your .config?

Sorry, I forgot to answer this: No, it is not set.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Thu, Jun 06, 2013 at 11:06:48PM -0400, Steven Rostedt wrote:
 Note the faulting address is 0xa0e52001, which is around the
 above address, be interesting to know what was at that location.

Is there any way I can figure this out? The machine in question is still
running. kallsyms doesn't show anything near it, though.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Thu, Jun 06, 2013 at 08:59:48PM -0700, Eric Dumazet wrote:
 Steinar, please make sure you recompiled your modules, because this
 looks like you loaded old modules.

I compiled the kernel using make-kpkg, so I don't see how that would happen.
Also, the timestamps indicate everything is fine:

-rw-r--r-- 1 root root 2574976 Jun  6 00:39 /boot/vmlinuz-3.10.0-rc4
-rw-r--r-- 1 root root   22856 Jun  6 00:39 
/lib/modules/3.10.0-rc4/kernel/net/ipv4/ip_gre.ko

Or from the source tree:

-rw-r--r-- 1 root root 2574976 Jun  6 00:36 arch/x86/boot/bzImage
-rw-r--r-- 1 root root   22800 Jun  6 00:39 ./net/ipv4/ip_gre.ko

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Thu, Jun 06, 2013 at 11:06:48PM -0400, Steven Rostedt wrote:
 Note the faulting address is 0xa0e52001, which is around the
 above address, be interesting to know what was at that location.

Aha, the plot thickens:

root  6095  0.0  0.0   6632   596 ?DJun06   0:00 /sbin/modprobe 
-q -- net-pf-17

pannekake:/usr/src/linux-3.10-rc4 sudo cat /proc/6095/stack
[812bb04f] register_pernet_subsys+0x18/0x39
[a0ffd089] packet_init+0x32/0x44 [af_packet]
[81000263] do_one_initcall+0x7b/0x10c
[8107e5db] load_module+0x1b1f/0x1e19
[8107e969] SyS_init_module+0x94/0xa1
[8138cf12] system_call_fastpath+0x16/0x1b
[] 0x

I have a tcpdump running almost all the time (from boot), for a variety of
reasons. And I think I have the BPF JIT on; possibly related.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NULL pointer dereference when loading the gre module (3.10.0-rc4)

On Thu, Jun 06, 2013 at 11:06:48PM -0400, Steven Rostedt wrote:
 Note the faulting address is 0xa0e52001, which is around the
 above address, be interesting to know what was at that location.

Doh, I looked at the wrong place in kallsyms:

a0e52000 u ip_tunnel_init_net   [ip_gre]
a0e55000 t gre_err  [gre]
a0e5503d t gre_gso_send_check   [gre]
a0e55053 t gre_rcv  [gre]

So it's really ip_tunnel_init_net+1.

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

NULL pointer dereference when loading the gre module (3.10.0-rc4)

2013-06-06 Thread Steinar H. Gunderson

Hi,

In 3.10.0-rc4, I get this on boot:

[   16.871043] BUG: unable to handle kernel NULL pointer dereference at 
0003
[   16.879453] IP: [] 0xa0e52001
[   16.884995] PGD 0 
[   16.887313] Oops:  [#1] SMP 
[   16.890904] Modules linked in: ip_gre(+) gre ip_tunnel psmouse ide_generic 
ide_gd_mod ide_cd_mod cdrom acpi_cpufreq mperf coretemp kvm_intel kvm iTCO_wdt 
iTCO_vendor_support i2c_i801 microcode lpc_ich pcspkr i2c_core mfd_core 
ehci_pci evbug evdev ext4 crc16 jbd2 mbcache dm_mod raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 md_mod sg sd_mod 
usbhid ide_pci_generic ide_core crc32c_intel e1000e ata_piix ptp pps_core 
uhci_hcd ehci_hcd mpt2sas raid_class unix
[   16.939181] CPU: 0 PID: 3261 Comm: modprobe Not tainted 3.10.0-rc4 #1
[   16.945873] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a   12/30/2011
[   16.953252] task: 880621662d60 ti: 8806227de000 task.ti: 
8806227de000
[   16.961184] RIP: 0010:[]  [] 
0xa0e52001
[   16.969346] RSP: 0018:8806227dfca8  EFLAGS: 00010246
[   16.974903] RAX: a0e5d000 RBX: 880623ebe280 RCX: 
[   16.982285] RDX: a0e5aa40 RSI: 0003 RDI: a0e5d018
[   16.989674] RBP: 8806227dfca8 R08: 072f R09: 812bae96
[   16.997051] R10: ea00188d1200 R11:  R12: 88061f874900
[   17.004440] R13: a0e5a9c0 R14: 8806227dfef8 R15: 0002
[   17.011818] FS:  7f7da1d97700() GS:88062720() 
knlGS:
[   17.020357] CS:  0010 DS:  ES:  CR0: 8005003b
[   17.026349] CR2: 0003 CR3: 000621b84000 CR4: 07f0
[   17.033734] DR0:  DR1:  DR2: 
[   17.041110] DR3:  DR6: 0ff0 DR7: 0400
[   17.048494] Stack:
[   17.050757]  8806227dfcf8 812baf26  

[   17.058857]   a0e5a9c0  

[   17.066933]  8806227dfef8 a0e5ab60 8806227dfd28 
812bafb6
[   17.075008] Call Trace:
[   17.077703]  [] ops_init.constprop.7+0xc6/0xf5
[   17.083956]  [] register_pernet_operations.isra.4+0x61/0x91
[   17.091340]  [] ? mutex_lock+0xf/0x20
[   17.096822]  [] register_pernet_device+0x20/0x51
[   17.103254]  [] ? ipgre_tap_init_net+0x1a/0x1a [ip_gre]
[   17.110298]  [] ipgre_init+0x21/0xc9 [ip_gre]
[   17.116470]  [] ? ipgre_tap_init_net+0x1a/0x1a [ip_gre]
[   17.123515]  [] do_one_initcall+0x7b/0x10c
[   17.129422]  [] load_module+0x1b1f/0x1e19
[   17.135241]  [] ? sys_getegid16+0x44/0x44
[   17.141058]  [] ? page_fault+0x22/0x30
[   17.146618]  [] SyS_init_module+0x94/0xa1
[   17.152440]  [] system_call_fastpath+0x16/0x1b
[   17.158695] Code: <6e> 65 77 6c 69 6e 6b 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 
[   17.168440] RIP  [] 0xa0e52001
[   17.174058]  RSP 
[   17.177798] CR2: 0003
[   17.181730] ---[ end trace 531fea804a54bcad ]---

I assume this is from loading ip_gre, given that it's somewhere in the call
stack; amazingly enough, GRE tunnels seem to actually still work, though,
although I cannot load other modules such as ip_tables (modprobe hangs).

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

NULL pointer dereference when loading the gre module (3.10.0-rc4)

2013-06-06 Thread Steinar H. Gunderson

Hi,

In 3.10.0-rc4, I get this on boot:

[   16.871043] BUG: unable to handle kernel NULL pointer dereference at 
0003
[   16.879453] IP: [a0e52002] 0xa0e52001
[   16.884995] PGD 0 
[   16.887313] Oops:  [#1] SMP 
[   16.890904] Modules linked in: ip_gre(+) gre ip_tunnel psmouse ide_generic 
ide_gd_mod ide_cd_mod cdrom acpi_cpufreq mperf coretemp kvm_intel kvm iTCO_wdt 
iTCO_vendor_support i2c_i801 microcode lpc_ich pcspkr i2c_core mfd_core 
ehci_pci evbug evdev ext4 crc16 jbd2 mbcache dm_mod raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 md_mod sg sd_mod 
usbhid ide_pci_generic ide_core crc32c_intel e1000e ata_piix ptp pps_core 
uhci_hcd ehci_hcd mpt2sas raid_class unix
[   16.939181] CPU: 0 PID: 3261 Comm: modprobe Not tainted 3.10.0-rc4 #1
[   16.945873] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a   12/30/2011
[   16.953252] task: 880621662d60 ti: 8806227de000 task.ti: 
8806227de000
[   16.961184] RIP: 0010:[a0e52002]  [a0e52002] 
0xa0e52001
[   16.969346] RSP: 0018:8806227dfca8  EFLAGS: 00010246
[   16.974903] RAX: a0e5d000 RBX: 880623ebe280 RCX: 
[   16.982285] RDX: a0e5aa40 RSI: 0003 RDI: a0e5d018
[   16.989674] RBP: 8806227dfca8 R08: 072f R09: 812bae96
[   16.997051] R10: ea00188d1200 R11:  R12: 88061f874900
[   17.004440] R13: a0e5a9c0 R14: 8806227dfef8 R15: 0002
[   17.011818] FS:  7f7da1d97700() GS:88062720() 
knlGS:
[   17.020357] CS:  0010 DS:  ES:  CR0: 8005003b
[   17.026349] CR2: 0003 CR3: 000621b84000 CR4: 07f0
[   17.033734] DR0:  DR1:  DR2: 
[   17.041110] DR3:  DR6: 0ff0 DR7: 0400
[   17.048494] Stack:
[   17.050757]  8806227dfcf8 812baf26  

[   17.058857]   a0e5a9c0  

[   17.066933]  8806227dfef8 a0e5ab60 8806227dfd28 
812bafb6
[   17.075008] Call Trace:
[   17.077703]  [812baf26] ops_init.constprop.7+0xc6/0xf5
[   17.083956]  [812bafb6] register_pernet_operations.isra.4+0x61/0x91
[   17.091340]  [8138486f] ? mutex_lock+0xf/0x20
[   17.096822]  [812bb006] register_pernet_device+0x20/0x51
[   17.103254]  [a0e5d034] ? ipgre_tap_init_net+0x1a/0x1a [ip_gre]
[   17.110298]  [a0e5d055] ipgre_init+0x21/0xc9 [ip_gre]
[   17.116470]  [a0e5d034] ? ipgre_tap_init_net+0x1a/0x1a [ip_gre]
[   17.123515]  [81000263] do_one_initcall+0x7b/0x10c
[   17.129422]  [8107e5db] load_module+0x1b1f/0x1e19
[   17.135241]  [8107a4f8] ? sys_getegid16+0x44/0x44
[   17.141058]  [81386cf2] ? page_fault+0x22/0x30
[   17.146618]  [8107e969] SyS_init_module+0x94/0xa1
[   17.152440]  [8138cf12] system_call_fastpath+0x16/0x1b
[   17.158695] Code: 6e 65 77 6c 69 6e 6b 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 
[   17.168440] RIP  [a0e52002] 0xa0e52001
[   17.174058]  RSP 8806227dfca8
[   17.177798] CR2: 0003
[   17.181730] ---[ end trace 531fea804a54bcad ]---

I assume this is from loading ip_gre, given that it's somewhere in the call
stack; amazingly enough, GRE tunnels seem to actually still work, though,
although I cannot load other modules such as ip_tables (modprobe hangs).

/* Steinar */
-- 
Homepage: http://www.sesse.net/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 3.10.0-rc4 oops in scsi_lib.c:1196

On Wed, Jun 05, 2013 at 03:16:05PM -0400, Ilia Mirkin wrote:
>> It dies every time. 3.9.0 is fine.
> You need the patch at https://lkml.org/lkml/2013/5/19/75

Ah, I'm already down at 3.9.4 again; I mainly reported this in case it was
unknown. (But a patch from May 19 not being in 3.10-rc4 is maybe a bit
strange.)

/* Steinar */
-- 
Homepage: http://www.sesse.net/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

3.10.0-rc4 oops in scsi_lib.c:1196

Hi,

I have a setup with an LSI2008 controller running md (RAID-1 and RAID-6) with
LVM on top, and I'm seeing this on boot, pretty much under fsck:

[   15.235697] [ cut here ]
[   15.235738] md: resync of RAID array md1
[   15.235740] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[   15.235742] md: using maximum available idle IO bandwidth (but not more than 
20 KB/sec) for resync.
[   15.235752] md: using 128k window, over a total of 2930111488k.
[   15.266826] kernel BUG at drivers/scsi/scsi_lib.c:1196!
[   15.272305] invalid opcode:  [#1] SMP 
[   15.276771] Modules linked in: acpi_cpufreq mperf coretemp kvm_intel kvm 
iTCO_wdt iTCO_vendor_support microcode pcspkr i2c_i801 ehci
_pci i2c_core lpc_ich mfd_core evbug evdev ext4 crc16 jbd2 mbcache dm_mod 
raid456 async_raid6_recov async_memcpy async_pq async_xor asy
nc_tx xor raid6_pq raid1 md_mod sg sd_mod usbhid crc32c_intel ide_pci_generic 
ide_core e1000e ata_piix ptp pps_core uhci_hcd ehci_hcd m
pt2sas raid_class unix
[   15.318120] CPU: 6 PID: 1614 Comm: md1_raid6 Not tainted 3.10.0-rc4 #1
[   15.324905] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a   12/30/2011
[   15.332288] task: 88061eadc410 ti: 88061d1a2000 task.ti: 
88061d1a2000
[   15.340217] RIP: 0010:[]  [] 
scsi_setup_fs_cmnd+0x48/0x90
[   15.349271] RSP: 0018:88061d1a3b48  EFLAGS: 00010046
[   15.354828] RAX:  RBX: 880621a3e000 RCX: 0002
[   15.362217] RDX: 0001 RSI: 88061fd17530 RDI: 880621a3e000
[   15.369598] RBP: 88061d1a3b58 R08: 880621b76b00 R09: 880621678c58
[   15.376985] R10: 880621678c58 R11:  R12: 88061fd17530
[   15.384402] R13: 0008 R14: 88061fd9c800 R15: 1000
[   15.391838] FS:  () GS:8806272c() 
knlGS:
[   15.400382] CS:  0010 DS:  ES:  CR0: 8005003b
[   15.406375] CR2: 7fda91b26d26 CR3: 01595000 CR4: 07e0
[   15.413761] DR0:  DR1:  DR2: 
[   15.421144] DR3:  DR6: 0ff0 DR7: 0400
[   15.428530] Stack:
[   15.430788]  88061fd17530 2efbf980 88061d1a3bd8 
a0139967
[   15.438866]   880621b76b00[   15.442789] usb 4-1: new 
low-speed USB device number 4 using uhci_hcd
 88061d1a3ba8 8119e9a1
[   15.453782]  88061d1a3ba8 88061f2d 88061f2d 
880621a3e000
[   15.461854] Call Trace:
[   15.464553]  [] sd_prep_fn+0x3bb/0xc76 [sd_mod]
[   15.470898]  [] ? deadline_remove_request.isra.4+0x7e/0x86
[   15.478197]  [] blk_peek_request+0xdd/0x1d9
[   15.484193]  [] scsi_request_fn+0x4a/0x51b
[   15.490102]  [] __blk_run_queue+0x2e/0x38
[   15.495929]  [] queue_unplugged+0x55/0x7d
[   15.501780]  [] blk_flush_plug_list+0x137/0x1b9
[   15.508120]  [] blk_finish_plug+0x11/0x32
[   15.513940]  [] raid5d+0x370/0x3f3 [raid456]
[   15.520032]  [] ? schedule_timeout+0x24/0x19c
[   15.526200]  [] md_thread+0x11e/0x13c [md_mod]
[   15.532454]  [] ? abort_exclusive_wait+0x8a/0x8a
[   15.538891]  [] ? md_register_thread+0xd0/0xd0 [md_mod]
[   15.545930]  [] kthread+0xb5/0xbd
[   15.551059]  [] ? kthread_freezable_should_stop+0x43/0x43
[   15.558313]  [] ret_

The backtrace is cut off at this point (the serial console is not always 
reliable).

It dies every time. 3.9.0 is fine.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

3.10.0-rc4 oops in scsi_lib.c:1196

Hi,

I have a setup with an LSI2008 controller running md (RAID-1 and RAID-6) with
LVM on top, and I'm seeing this on boot, pretty much under fsck:

[   15.235697] [ cut here ]
[   15.235738] md: resync of RAID array md1
[   15.235740] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[   15.235742] md: using maximum available idle IO bandwidth (but not more than 
20 KB/sec) for resync.
[   15.235752] md: using 128k window, over a total of 2930111488k.
[   15.266826] kernel BUG at drivers/scsi/scsi_lib.c:1196!
[   15.272305] invalid opcode:  [#1] SMP 
[   15.276771] Modules linked in: acpi_cpufreq mperf coretemp kvm_intel kvm 
iTCO_wdt iTCO_vendor_support microcode pcspkr i2c_i801 ehci
_pci i2c_core lpc_ich mfd_core evbug evdev ext4 crc16 jbd2 mbcache dm_mod 
raid456 async_raid6_recov async_memcpy async_pq async_xor asy
nc_tx xor raid6_pq raid1 md_mod sg sd_mod usbhid crc32c_intel ide_pci_generic 
ide_core e1000e ata_piix ptp pps_core uhci_hcd ehci_hcd m
pt2sas raid_class unix
[   15.318120] CPU: 6 PID: 1614 Comm: md1_raid6 Not tainted 3.10.0-rc4 #1
[   15.324905] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a   12/30/2011
[   15.332288] task: 88061eadc410 ti: 88061d1a2000 task.ti: 
88061d1a2000
[   15.340217] RIP: 0010:[812440b6]  [812440b6] 
scsi_setup_fs_cmnd+0x48/0x90
[   15.349271] RSP: 0018:88061d1a3b48  EFLAGS: 00010046
[   15.354828] RAX:  RBX: 880621a3e000 RCX: 0002
[   15.362217] RDX: 0001 RSI: 88061fd17530 RDI: 880621a3e000
[   15.369598] RBP: 88061d1a3b58 R08: 880621b76b00 R09: 880621678c58
[   15.376985] R10: 880621678c58 R11:  R12: 88061fd17530
[   15.384402] R13: 0008 R14: 88061fd9c800 R15: 1000
[   15.391838] FS:  () GS:8806272c() 
knlGS:
[   15.400382] CS:  0010 DS:  ES:  CR0: 8005003b
[   15.406375] CR2: 7fda91b26d26 CR3: 01595000 CR4: 07e0
[   15.413761] DR0:  DR1:  DR2: 
[   15.421144] DR3:  DR6: 0ff0 DR7: 0400
[   15.428530] Stack:
[   15.430788]  88061fd17530 2efbf980 88061d1a3bd8 
a0139967
[   15.438866]   880621b76b00[   15.442789] usb 4-1: new 
low-speed USB device number 4 using uhci_hcd
 88061d1a3ba8 8119e9a1
[   15.453782]  88061d1a3ba8 88061f2d 88061f2d 
880621a3e000
[   15.461854] Call Trace:
[   15.464553]  [a0139967] sd_prep_fn+0x3bb/0xc76 [sd_mod]
[   15.470898]  [8119e9a1] ? deadline_remove_request.isra.4+0x7e/0x86
[   15.478197]  [8118de29] blk_peek_request+0xdd/0x1d9
[   15.484193]  [8124396d] scsi_request_fn+0x4a/0x51b
[   15.490102]  [81189997] __blk_run_queue+0x2e/0x38
[   15.495929]  [8118c4a4] queue_unplugged+0x55/0x7d
[   15.501780]  [8118e204] blk_flush_plug_list+0x137/0x1b9
[   15.508120]  [8118e297] blk_finish_plug+0x11/0x32
[   15.513940]  [a01d233a] raid5d+0x370/0x3f3 [raid456]
[   15.520032]  [81382c48] ? schedule_timeout+0x24/0x19c
[   15.526200]  [a016c568] md_thread+0x11e/0x13c [md_mod]
[   15.532454]  [810534ca] ? abort_exclusive_wait+0x8a/0x8a
[   15.538891]  [a016c44a] ? md_register_thread+0xd0/0xd0 [md_mod]
[   15.545930]  [81052b9a] kthread+0xb5/0xbd
[   15.551059]  [81052ae5] ? kthread_freezable_should_stop+0x43/0x43
[   15.558313]  [8138c06c] ret_

The backtrace is cut off at this point (the serial console is not always 
reliable).

It dies every time. 3.9.0 is fine.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 3.10.0-rc4 oops in scsi_lib.c:1196