from:"Andrew Morton"

Re: [PATCH] bpfilter: fix user mode helper cross compilation

2018-06-28 Thread Andrew Morton

On Wed, 20 Jun 2018 16:04:34 +0200 Matteo Croce  wrote:

> Use $(OBJDUMP) instead of literal 'objdump' to avoid
> using host toolchain when cross compiling.
> 

I'm still having issues here, with ld.

x86_64 machine, ARCH=i386:

y:/usr/src/25> make V=1 M=net/bpfilter
test -e include/generated/autoconf.h -a -e include/config/auto.conf || (   \
echo >&2;   \
echo >&2 "  ERROR: Kernel configuration is invalid.";   \
echo >&2 " include/generated/autoconf.h or include/config/auto.conf are 
missing.";\
echo >&2 " Run 'make oldconfig && make prepare' on kernel src to fix 
it.";  \
echo >&2 ;  \
/bin/false)
mkdir -p net/bpfilter/.tmp_versions ; rm -f net/bpfilter/.tmp_versions/*
make -f ./scripts/Makefile.build obj=net/bpfilter
(cat /dev/null;   echo kernel/net/bpfilter/bpfilter.ko;) > 
net/bpfilter/modules.order
  ld -m elf_i386   -r -o net/bpfilter/bpfilter.o net/bpfilter/bpfilter_kern.o 
net/bpfilter/bpfilter_umh.o ; scripts/mod/modpost net/bpfilter/bpfilter.o
ld: i386:x86-64 architecture of input file `net/bpfilter/bpfilter_umh.o' is 
incompatible with i386 output
scripts/Makefile.build:530: recipe for target 'net/bpfilter/bpfilter.o' failed
make[1]: *** [net/bpfilter/bpfilter.o] Error 1
Makefile:1518: recipe for target '_module_net/bpfilter' failed
make: *** [_module_net/bpfilter] Error 2

y:/usr/src/25> ld --version
GNU ld (GNU Binutils for Ubuntu) 2.29.1

Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-05-02 Thread Andrew Morton

On Wed, 2 May 2018 21:58:25 -0700 Cong Wang <xiyou.wangc...@gmail.com> wrote:

> On Wed, May 2, 2018 at 9:27 PM, Andrew Morton <a...@linux-foundation.org> 
> wrote:
> >
> > So it's saying that something which got committed into Linus's tree
> > after 4.17-rc3 has caused a NULL deref in
> > sock_release->llc_ui_release+0x3a/0xd0
> 
> Do you mean it contains commit 3a04ce7130a7
> ("llc: fix NULL pointer deref for SOCK_ZAPPED")?

That was in 4.17-rc3 so if this report's bisection is correct, that
patch is innocent.

origin.patch (http://ozlabs.org/~akpm/mmots/broken-out/origin.patch)
contains no changes to net/llc/af_llc.c so perhaps this crash is also
occurring in 4.17-rc3 base.

Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-05-02 Thread Andrew Morton


(networking cc's added)

On Thu, 3 May 2018 12:14:50 +0800 kernel test robot <shun@intel.com> wrote:

> Greetings,
> 
> 0day kernel testing robot got the below dmesg and the first bad commit is
> 
> git://git.cmpxchg.org/linux-mmotm.git master
> 
> commit 486ad79630d0ba0b7205a8db9fe15ba392f5ee32
> Author: Andrew Morton <a...@linux-foundation.org>
> AuthorDate: Fri Apr 20 22:00:53 2018 +
> Commit: Johannes Weiner <han...@cmpxchg.org>
> CommitDate: Fri Apr 20 22:00:53 2018 +
> 
> origin


OK, this got confusing.  origin.patch is the diff between 4.17-rc3 and
current mainline.

>
> [many lines deleted]
>
> [main] Setsockopt(101 c 1b24000 a) on fd 177 [3:5:240]
> [main] Setsockopt(1 2c 1b24000 4) on fd 178 [5:2:0]
> [main] Setsockopt(29 8 1b24000 4) on fd 180 [10:1:0]
> [main] Setsockopt(1 20 1b24000 4) on fd 181 [26:2:125]
> [main] Setsockopt(11 1 1b24000 4) on fd 183 [2:2:17]
> [   15.532543] BUG: unable to handle kernel NULL pointer dereference at 
> 0004
> [   15.534143] PGD 80001734b067 P4D 80001734b067 PUD 17350067 PMD 0 
> [   15.535516] Oops: 0002 [#1] PTI
> [   15.536165] Modules linked in:
> [   15.536798] CPU: 0 PID: 363 Comm: trinity-main Not tainted 
> 4.17.0-rc1-1-g486ad79 #2
> [   15.538396] RIP: 0010:llc_ui_release+0x3a/0xd0
> [   15.539293] RSP: 0018:c915bd70 EFLAGS: 00010202
> [   15.540345] RAX: 0001 RBX: 88001fa60008 RCX: 
> 0006
> [   15.541802] RDX: 0006 RSI: 88001fdda660 RDI: 
> 88001fa60008
> [   15.543139] RBP: c915bd80 R08:  R09: 
> 
> [   15.544725] R10:  R11:  R12: 
> 
> [   15.546287] R13: 88001fa61730 R14: 88001e130a60 R15: 
> 880019bdb3f0
> [   15.547962] FS:  7f2221bb1700() GS:82034000() 
> knlGS:
> [   15.549848] CS:  0010 DS:  ES:  CR0: 80050033
> [   15.551186] CR2: 0004 CR3: 1734e000 CR4: 
> 06b0
> [   15.552671] DR0: 02232000 DR1:  DR2: 
> 
> [   15.554105] DR3:  DR6: 0ff0 DR7: 
> 0600
> [   15.34] Call Trace:
> [   15.556049]  sock_release+0x14/0x60
> [   15.556767]  sock_close+0xd/0x20
> [   15.557427]  __fput+0xba/0x1f0
> [   15.558058]  fput+0x9/0x10
> [   15.558682]  task_work_run+0x73/0xa0
> [   15.559416]  do_exit+0x231/0xab0
> [   15.560079]  do_group_exit+0x3f/0xc0
> [   15.560810]  __x64_sys_exit_group+0x13/0x20
> [   15.561656]  do_syscall_64+0x58/0x2f0
> [   15.562407]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [   15.563360]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   15.564471] RIP: 0033:0x7f2221696408
> [   15.565264] RSP: 002b:7ffe5c544c48 EFLAGS: 0206 ORIG_RAX: 
> 00e7
> [   15.566924] RAX: ffda RBX:  RCX: 
> 7f2221696408
> [   15.568485] RDX:  RSI: 003c RDI: 
> 
> [   15.570046] RBP:  R08: 00e7 R09: 
> ffa0
> [   15.571603] R10: 7ffe5c5449e0 R11: 0206 R12: 
> 0004
> [   15.573160] R13: 7ffe5c544e30 R14:  R15: 
> 
> [   15.574720] Code: 7b ff 43 78 0f 88 a5 6f 14 00 31 f6 48 89 df e8 ad 33 fb 
> ff 48 89 df e8 55 94 ff ff 85 c0 0f 84 84 00 00 00 4c 8b a3 d8 04 00 00 <41> 
> ff 44 24 04 0f 88 7f 6f 14 00 48 8b 43 58 f6 c4 01 74 58 48 
> [   15.578679] RIP: llc_ui_release+0x3a/0xd0 RSP: c915bd70
> [   15.579874] CR2: 0004
> [   15.580553] ---[ end trace 0dd8fdc6b7182234 ]---
>

So it's saying that something which got committed into Linus's tree
after 4.17-rc3 has caused a NULL deref in
sock_release->llc_ui_release+0x3a/0xd0

Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-05-01 Thread Andrew Morton

On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka  
wrote:

> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > 
> > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > > 
> > > > > Fixing __vmalloc code 
> > > > > is easy and it doesn't require cooperation with maintainers.
> > > > 
> > > > But it is a hack against the intention of the scope api.
> > > 
> > > It is not!
> > 
> > This discussion simply doesn't make much sense it seems. The scope API
> > is to document the scope of the reclaim recursion critical section. That
> > certainly is not a utility function like vmalloc.
> 
> That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel 
> developer) from converting the code to the scope API. You make nonsensical 
> excuses.
> 

Fun thread!

Winding back to the original problem, I'd state it as

- Caller uses kvmalloc() but passes the address into vmalloc-naive
  DMA API and

- Caller uses kvmalloc() but passes the address into kfree()

Yes?

If so, then...

Is there a way in which, in the kvmalloc-called-kmalloc path, we can
tag the slab-allocated memory with a "this memory was allocated with
kvmalloc()" flag?  I *think* there's extra per-object storage available
with suitable slab/slub debugging options?  Perhaps we could steal one
bit from the redzone, dunno.

If so then we can

a) set that flag in kvmalloc() if the kmalloc() call succeeded

b) check for that flag in the DMA code, WARN if it is set.

c) in kvfree(), clear that flag before calling kfree()

d) in kfree(), check for that flag and go WARN() if set.

So both potential bugs are detected all the time, dependent upon
CONFIG_SLUB_DEBUG (and perhaps other slub config options).

Re: simplify procfs code for seq_file instances

2018-04-24 Thread Andrew Morton

On Tue, 24 Apr 2018 16:23:04 +0200 Christoph Hellwig  wrote:

> On Thu, Apr 19, 2018 at 09:57:50PM +0300, Alexey Dobriyan wrote:
> > > git://git.infradead.org/users/hch/misc.git proc_create
> > 
> > 
> > I want to ask if it is time to start using poorman function overloading
> > with _b_c_e(). There are millions of allocation functions for example,
> > all slightly difference, and people will add more. Seeing /proc interfaces
> > doubled like this is painful.
> 
> Function overloading is totally unacceptable.
> 
> And I very much disagree with a tradeoff that keeps 5000 lines of 
> code vs a few new helpers.

OK, the curiosity and suspense are killing me.  What the heck is
"function overloading with _b_c_e()"?

Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-19 Thread Andrew Morton

On Thu, 19 Apr 2018 17:19:20 -0400 (EDT) Mikulas Patocka  
wrote:

> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > > 
> > > ...
> > >
> > > --- linux-2.6.orig/mm/util.c  2018-04-18 15:46:23.0 +0200
> > > +++ linux-2.6/mm/util.c   2018-04-18 16:00:43.0 +0200
> > > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> > >   */
> > >  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> > >  {
> > > +#ifndef CONFIG_DEBUG_VM
> > >   gfp_t kmalloc_flags = flags;
> > >   void *ret;
> > >  
> > > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > >*/
> > >   if (ret || size <= PAGE_SIZE)
> > >   return ret;
> > > +#endif
> > >  
> > >   return __vmalloc_node_flags_caller(size, node, flags,
> > >   __builtin_return_address(0));
> > 
> > Well, it doesn't have to be done at compile-time, does it?  We could
> > add a knob (in debugfs, presumably) which enables this at runtime. 
> > That's far more user-friendly.
> 
> But who will turn it on in debugfs?

But who will turn it on in Kconfig?  Just a handful of developers.  We
could add SONFIG_DEBUG_SG to the list in
Documentation/process/submit-checklist.rst, but nobody reads that.

Also, a whole bunch of defconfigs set CONFIG_DEBUG_SG=y and some
googling indicates that they aren't the only ones...

> It should be default for debugging 
> kernels, so that users using them would report the error.

Well.  This isn't the first time we've wanted to enable expensive (or
noisy) debugging things in debug kernels, by any means.

So how could we define a debug kernel in which it's OK to enable such
things?

- Could be "it's an -rc kernel".  But then we'd be enabling a bunch of
  untested code when Linus cuts a release.

- Could be "it's an -rc kernel with SUBLEVEL <= 5".  But then we risk
  unexpected things happening when Linux cuts -rc6, which still isn't
  good.

- How about "it's an -rc kernel with odd-numbered SUBLEVEL and
  SUBLEVEL <= 5".  That way everybody who runs -rc1, -rc3 and -rc5 will
  have kvmalloc debugging enabled.  That's potentially nasty because
  vmalloc is much slower than kmalloc.  But kvmalloc() is only used for
  large and probably infrequent allocations, so it's probably OK.

I wonder how we get at SUBLEVEL from within .c.

Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-19 Thread Andrew Morton

On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka  
wrote:

> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
> 
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
> 
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.

Yes, that's nasty.

> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> 
> ...
>
> --- linux-2.6.orig/mm/util.c  2018-04-18 15:46:23.0 +0200
> +++ linux-2.6/mm/util.c   2018-04-18 16:00:43.0 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
> +#ifndef CONFIG_DEBUG_VM
>   gfp_t kmalloc_flags = flags;
>   void *ret;
>  
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
>*/
>   if (ret || size <= PAGE_SIZE)
>   return ret;
> +#endif
>  
>   return __vmalloc_node_flags_caller(size, node, flags,
>   __builtin_return_address(0));

Well, it doesn't have to be done at compile-time, does it?  We could
add a knob (in debugfs, presumably) which enables this at runtime. 
That's far more user-friendly.

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-13 Thread Andrew Morton

On Mon, 12 Mar 2018 21:28:57 -0700 Kees Cook <keesc...@chromium.org> wrote:

> On Mon, Mar 12, 2018 at 4:57 PM, Linus Torvalds
> <torva...@linux-foundation.org> wrote:
> > On Mon, Mar 12, 2018 at 3:55 PM, Andrew Morton
> > <a...@linux-foundation.org> wrote:
> >>
> >> Replacing the __builtin_choose_expr() with ?: works of course.
> >
> > Hmm. That sounds like the right thing to do. We were so myopically
> > staring at the __builtin_choose_expr() problem that we overlooked the
> > obvious solution.
> >
> > Using __builtin_constant_p() together with a ?: is in fact our common
> > pattern, so that should be fine. The only real reason to use
> > __builtin_choose_expr() is if you want to get the *type* to vary
> > depending on which side you choose, but that's not an issue for
> > min/max.
> 
> This doesn't solve it for -Wvla, unfortunately. That was the point of
> Josh's original suggestion of __builtin_choose_expr().
> 
> Try building with KCFLAGS=-Wval and checking net/ipv6/proc.c:
> 
> net/ipv6/proc.c: In function ‘snmp6_seq_show_item’:
> net/ipv6/proc.c:198:2: warning: ISO C90 forbids array ‘buff’ whose
> size can’t be evaluated [-Wvla]
>   unsigned long buff[SNMP_MIB_MAX];
>   ^~~~

PITA.  Didn't we once have a different way of detecting VLAs?  Some
post-compilation asm parser, iirc.

I suppose the world wouldn't end if we had a gcc version ifdef in
kernel.h.  We'll get to remove it in, oh, ten years.

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-12 Thread Andrew Morton

On Fri, 9 Mar 2018 17:30:15 -0800 Kees Cook  wrote:

> > It's one reason why I wondered if simplifying the expression to have
> > just that single __builtin_constant_p() might not end up working..
> 
> Yeah, it seems like it doesn't bail out as "false" for complex
> expressions given to __builtin_constant_p().
> 
> If no magic solution, then which of these?
> 
> - go back to my original "multi-eval max only for constants" macro (meh)
> - add gcc version checks around this and similarly for -Wvla in the future 
> (eww)
> - raise gcc version (yikes)

Replacing the __builtin_choose_expr() with ?: works of course.  What
will be the runtime effects?

I tried replacing

__builtin_choose_expr(__builtin_constant_p(x) &&
  __builtin_constant_p(y),

with

__builtin_choose_expr(__builtin_constant_p(x + y),

but no success.

I'm not sure what else to try to trick gcc into working.

--- 
a/include/linux/kernel.h~kernelh-skip-single-eval-logic-on-literals-in-min-max-v3-fix
+++ a/include/linux/kernel.h
@@ -804,13 +804,10 @@ static inline void ftrace_dump(enum ftra
  * accidental VLA.
  */
 #define __min(t1, t2, x, y)\
-   __builtin_choose_expr(__builtin_constant_p(x) &&\
- __builtin_constant_p(y),  \
- (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
- __single_eval_min(t1, t2, \
-   __UNIQUE_ID(min1_), \
-   __UNIQUE_ID(min2_), \
-   x, y))
+   ((__builtin_constant_p(x) && __builtin_constant_p(y)) ? \
+   ((t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y)) :   \
+   (__single_eval_min(t1, t2, __UNIQUE_ID(min1_),  \
+ __UNIQUE_ID(min2_), x, y)))
 
 /**
  * min - return minimum of two values of the same or compatible types
@@ -826,13 +823,11 @@ static inline void ftrace_dump(enum ftra
max1 > max2 ? max1 : max2; })
 
 #define __max(t1, t2, x, y)\
-   __builtin_choose_expr(__builtin_constant_p(x) &&\
- __builtin_constant_p(y),  \
- (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
- __single_eval_max(t1, t2, \
-   __UNIQUE_ID(max1_), \
-   __UNIQUE_ID(max2_), \
-   x, y))
+   ((__builtin_constant_p(x) && __builtin_constant_p(y)) ? \
+   ((t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y)) :   \
+   (__single_eval_max(t1, t2, __UNIQUE_ID(max1_),  \
+ __UNIQUE_ID(max2_), x, y)))
+
 /**
  * max - return maximum of two values of the same or compatible types
  * @x: first value
_

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Andrew Morton

On Fri, 9 Mar 2018 16:28:51 -0800 Linus Torvalds 
<torva...@linux-foundation.org> wrote:

> On Fri, Mar 9, 2018 at 4:07 PM, Andrew Morton <a...@linux-foundation.org> 
> wrote:
> >
> > A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
> > to know that __builtin_constant_p(x) is a constant.  Or something.
> 
> LOL.
> 
> I suspect it might be that it wants to evaluate
> __builtin_choose_expr() at an earlier stage than it evaluates
> __builtin_constant_p(), so it's not that it doesn't know that
> __builtin_constant_p() is a constant, it just might not know it *yet*.
> 
> Maybe.
> 
> Side note, if it's not that, but just the "complex" expression that
> has the logical 'and' etc, maybe the code could just use
> 
>   __builtin_constant_p((x)+(y))
> 
> or something.

I'll do a bit more poking at it.

> But yeah:
> 
> > Sigh.  Wasn't there some talk about modernizing our toolchain
> > requirements?
> 
> Maybe it's just time to give up on 4.4.  We wanted 4.5 for "asm goto",
> and once we upgrade to 4.5 I think Arnd said that no distro actually
> ships it, so we might as well go to 4.6.
> 
> So maybe this is just the excuse to finally make that official, if
> there is no clever workaround any more.

I wonder which gcc versions actually accept Kees's addition.

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Andrew Morton

On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:

> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:
> 
> $ diff -u before.txt after.txt | grep ^-
> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
> variable length array ‘ids’ [-Wvla]
> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
> array ‘namebuf’ [-Wvla]
> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
> [-Wvla]
> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
> ‘buff64’ [-Wvla]
> 
> Based on an earlier patch from Josh Poimboeuf.

v1, v2 and v3 of this patch all fail with gcc-4.4.4:

./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
./include/linux/jiffies.h:444: error: first argument to '__builtin_choose_expr' 
not a constant

That's with

#define __max(t1, t2, x, y) \
__builtin_choose_expr(__builtin_constant_p(x) &&\
  __builtin_constant_p(y) &&\
  __builtin_types_compatible_p(t1, t2), \
  (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
  __single_eval_max(t1, t2, \
__UNIQUE_ID(max1_), \
__UNIQUE_ID(max2_), \
x, y))
/**
 * max - return maximum of two values of the same or compatible types
 * @x: first value
 * @y: second value
 */
#define max(x, y)   __max(typeof(x), typeof(y), x, y)


A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
to know that __builtin_constant_p(x) is a constant.  Or something.

Sigh.  Wasn't there some talk about modernizing our toolchain
requirements?

Re: [PATCH] net/9p: avoid -ERESTARTSYS leak to userspace

2018-03-09 Thread Andrew Morton

On Fri, 09 Mar 2018 21:41:38 +0100 Greg Kurz  wrote:

> If it was interrupted by a signal, the 9p client may need to send some
> more requests to the server for cleanup before returning to userspace.
> 
> To avoid such a last minute request to be interrupted right away, the
> client memorizes if a signal is pending, clears TIF_SIGPENDING, handles
> the request and calls recalc_sigpending() before returning.
> 
> Unfortunately, if the transmission of this cleanup request fails for any
> reason, the transport returns an error and the client propagates it right
> away, without calling recalc_sigpending().
> 
> This ends up with -ERESTARTSYS from the initially interrupted request
> crawling up to syscall exit, with TIF_SIGPENDING cleared by the cleanup
> request. The specific signal handling code, which is responsible for
> converting -ERESTARTSYS to -EINTR is not called, and userspace receives
> the confusing errno value:
> 
> open: Unknown error 512 (512)
> 
> This is really hard to hit in real life. I discovered the issue while
> working on hot-unplug of a virtio-9p-pci device with an instrumented
> QEMU allowing to control request completion.
> 
> Both p9_client_zc_rpc() and p9_client_rpc() functions have this buggy
> error path actually. Their code flow is a bit obscure and the best
> thing to do would probably be a full rewrite: to really ensure this
> situation of clearing TIF_SIGPENDING and returning -ERESTARTSYS can
> never happen.
> 
> But given the general lack of interest for the 9p code, I won't risk
> breaking more things. So this patch simply fixes the buggy paths in
> both functions with a trivial label+goto.
> 
> Thanks to Laurent Dufour for his help and suggestions on how to find
> the root cause and how to fix it.

That's a fairly straightforward-looking bug.  However the code still
looks a bit racy:


:   if (signal_pending(current)) {
:   sigpending = 1;
:   clear_thread_flag(TIF_SIGPENDING);
:   } else
:   sigpending = 0;
: 

what happens if signal_pending(current) becomes true right here?

:   err = c->trans_mod->request(c, req);


I'm surprised that the 9p client is mucking with signals at all. 
Signals are a userspace IPC scheme and kernel code should instead be
using the more powerful messaging mechanisms which we've developed. 
Ones which don't disrupt userspace state.

Why is this happening?  Is there some userspace/kernel interoperation
involved?

Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Andrew Morton

On Thu, 8 Mar 2018 13:40:45 -0800 Kees Cook  wrote:

> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:
> 
> $ diff -u before.txt after.txt | grep ^-
> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
> variable length array ‘ids’ [-Wvla]
> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
> array ‘namebuf’ [-Wvla]
> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
> [-Wvla]
> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
> ‘buff64’ [-Wvla]
> 
> Based on an earlier patch from Josh Poimboeuf.
> 
> ...
>
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -787,37 +787,57 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
> oops_dump_mode) { }
>   * strict type-checking.. See the
>   * "unnecessary" pointer comparison.
>   */
> -#define __min(t1, t2, min1, min2, x, y) ({   \
> +#define __single_eval_min(t1, t2, min1, min2, x, y) ({   \
>   t1 min1 = (x);  \
>   t2 min2 = (y);  \
>   (void) ( == );\
>   min1 < min2 ? min1 : min2; })
>  
> +/*
> + * In the case of builtin constant values, there is no need to do the
> + * double-evaluation protection, so the raw comparison can be made.
> + * This allows min()/max() to be used in stack array allocations and
> + * avoid the compiler thinking it is a dynamic value leading to an
> + * accidental VLA.
> + */
> +#define __min(t1, t2, x, y)  \
> + __builtin_choose_expr(__builtin_constant_p(x) &&\
> +   __builtin_constant_p(y) &&\
> +   __builtin_types_compatible_p(t1, t2), \
> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
> +   __single_eval_min(t1, t2, \
> + __UNIQUE_ID(max1_), \
> + __UNIQUE_ID(max2_), \
> + x, y))
> +

Holy crap.

I suppose gcc will one day be fixed and we won't need this.

Is there a good reason to convert min()?  Surely nobody will be using
min to dimension an array - always max?  Just for symmetry, I guess.

Re: [PATCH 0/3] Remove accidental VLA usage

2018-03-08 Thread Andrew Morton

On Thu, 8 Mar 2018 09:02:36 -0600 Josh Poimboeuf  wrote:

> On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote:
> > This series adds SIMPLE_MAX() to be used in places where a stack array
> > is actually fixed, but the compiler still warns about VLA usage due to
> > confusion caused by the safety checks in the max() macro.
> > 
> > I'm sending these via -mm since that's where I've introduced SIMPLE_MAX(),
> > and they should all have no operational differences.
> 
> What if we instead simplify the max() macro's type checking so that GCC
> can more easily fold the array size constants?  The below patch seems to
> work:
> 
> -/*
> - * min()/max()/clamp() macros that also do
> - * strict type-checking.. See the
> - * "unnecessary" pointer comparison.
> - */
> -#define __min(t1, t2, min1, min2, x, y) ({   \
> - t1 min1 = (x);  \
> - t2 min2 = (y);  \
> - (void) ( == );\
> - min1 < min2 ? min1 : min2; })
> +extern long __error_incompatible_types_in_min_macro;
> +extern long __error_incompatible_types_in_max_macro;
> +
> +#define __min(t1, t2, x, y)  \
> + __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \
> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
> +   (t1)__error_incompatible_types_in_min_macro)

This will move the error detection from compile-time to link-time. 
That's tolerable I guess, but a bit sad and should be flagged in the
changelog at least.

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-02-07 Thread Andrew Morton

On Wed, 7 Feb 2018 18:44:39 +0100 Pablo Neira Ayuso <pa...@netfilter.org> wrote:

> Hi,
> 
> On Wed, Jan 31, 2018 at 09:19:16AM +0100, Michal Hocko wrote:
> [...]
> > Yeah, we do not BUG but rather fail instead. See __vmalloc_node_range.
> > My excavation tools pointed me to "VM: Rework vmalloc code to support 
> > mapping of arbitray pages"
> > by Christoph back in 2002. So yes, we can safely remove it finally. Se
> > below.
> > 
> > 
> > From 8d52e1d939d101b0dafed6ae5c3c1376183e65bb Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mho...@suse.com>
> > Date: Wed, 31 Jan 2018 09:16:56 +0100
> > Subject: [PATCH] net/netfilter/x_tables.c: remove size check
> > 
> > Back in 2002 vmalloc used to BUG on too large sizes. We are much better
> > behaved these days and vmalloc simply returns NULL for those. Remove
> > the check as it simply not needed and the comment even misleading.
> > 
> > Suggested-by: Andrew Morton <a...@linux-foundation.org>
> > Signed-off-by: Michal Hocko <mho...@suse.com>
> > ---
> >  net/netfilter/x_tables.c | 4 
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index b55ec5aa51a6..48a6ff620493 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -999,10 +999,6 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
> > size)
> > if (sz < sizeof(*info))
> > return NULL;
> >  
> > -   /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
> > -   if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
> > -   return NULL;
> > -
> > /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> >  * work reasonably well if sz is too large and bail out rather
> >  * than shoot all processes down before realizing there is nothing
> 
> Patchwork didn't catch this patch for some reason, would you mind to
> resend?

From: Michal Hocko <mho...@suse.com>
Subject: net/netfilter/x_tables.c: remove size check

Back in 2002 vmalloc used to BUG on too large sizes.  We are much better
behaved these days and vmalloc simply returns NULL for those.  Remove the
check as it simply not needed and the comment is even misleading.

Link: http://lkml.kernel.org/r/20180131081916.go21...@dhcp22.suse.cz
Suggested-by: Andrew Morton <a...@linux-foundation.org>
Signed-off-by: Michal Hocko <mho...@suse.com>
Reviewed-by: Andrew Morton <a...@linux-foundation.org>
Cc: Florian Westphal <f...@strlen.de>
Cc: David S. Miller <da...@davemloft.net>
Signed-off-by: Andrew Morton <a...@linux-foundation.org>
---

 net/netfilter/x_tables.c |4 
 1 file changed, 4 deletions(-)

diff -puN net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check 
net/netfilter/x_tables.c
--- a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check
+++ a/net/netfilter/x_tables.c
@@ -1004,10 +1004,6 @@ struct xt_table_info *xt_alloc_table_inf
if (sz < sizeof(*info))
return NULL;
 
-   /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
-   if ((size >> PAGE_SHIFT) + 2 > totalram_pages)
-   return NULL;
-
/* __GFP_NORETRY is not fully supported by kvmalloc but it should
 * work reasonably well if sz is too large and bail out rather
 * than shoot all processes down before realizing there is nothing
_

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Andrew Morton

On Tue, 30 Jan 2018 15:01:04 +0100 Michal Hocko  wrote:

> > Well, this is not about syzkaller, it merely pointed out a potential
> > DoS... And that has to be addressed somehow.
> 
> So how about this?
> ---

argh ;)

> >From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 30 Jan 2018 14:51:07 +0100
> Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
> 
> syzbot has noticed that xt_alloc_table_info can allocate a lot of
> memory. This is an admin only interface but an admin in a namespace
> is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use
> kvmalloc() in xt_alloc_table_info()") has changed the opencoded
> kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on
> the way because vmalloc has simply never fully supported __GFP_NORETRY
> semantic. This is still the case because e.g. page tables backing the
> vmalloc area are hardcoded GFP_KERNEL.
> 
> Revert back to __GFP_NORETRY as a poors man defence against excessively
> large allocation request here. We will not rule out the OOM killer
> completely but __GFP_NORETRY should at least stop the large request
> in most cases.
> 
> Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in 
> xt_alloc_table_info()")
> Signed-off-by: Michal Hocko 
> ---
>  net/netfilter/x_tables.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d8571f414208..a5f5c29bcbdc 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1003,7 +1003,13 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
> size)
>   if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
>   return NULL;

offtopic: preceding comment here is "prevent them from hitting BUG() in
vmalloc.c".  I suspect this is ancient code and vmalloc sure as heck
shouldn't go BUG with this input.  And it should be using `sz' ;)

So I suspect and hope that this code can be removed.  If not, let's fix
vmalloc!

> - info = kvmalloc(sz, GFP_KERNEL);
> + /*
> +  * __GFP_NORETRY is not fully supported by kvmalloc but it should
> +  * work reasonably well if sz is too large and bail out rather
> +  * than shoot all processes down before realizing there is nothing
> +  * more to reclaim.
> +  */
> + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>   if (!info)
>   return NULL;

checkpatch sayeth

networking block comments don't use an empty /* line, use /* Comment...

So I'll do that and shall scoot the patch Davewards.

Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Andrew Morton

On Wed, 29 Nov 2017 13:05:04 +1100 "Tobin C. Harding"  wrote:

> printk specifier %p now hashes all addresses before printing. Sometimes
> we need to see the actual unmodified address. This can be achieved using
> %lx but then we face the risk that if in future we want to change the
> way the Kernel handles printing of pointers we will have to grep through
> the already existent 50 000 %lx call sites. Let's add specifier %px as a
> clear, opt-in, way to print a pointer and maintain some level of
> isolation from all the other hex integer output within the Kernel.
> 
> Add printk specifier %px to print the actual unmodified address.
> 
> ...
>
> +Unmodified Addresses
> +
> +
> +::
> +
> + %px 01234567 or 0123456789abcdef
> +
> +For printing pointers when you _really_ want to print the address. Please
> +consider whether or not you are leaking sensitive information about the
> +Kernel layout in memory before printing pointers with %px. %px is
> +functionally equivalent to %lx. %px is preferred to %lx because it is more
> +uniquely grep'able. If, in the future, we need to modify the way the Kernel
> +handles printing pointers it will be nice to be able to find the call
> +sites.
> +

You might want to add a checkpatch rule which emits a stern
do-you-really-want-to-do-this warning when someone uses %px.

Re: [PATCH V11 3/5] printk: hash addresses printed with %p

2017-11-29 Thread Andrew Morton

On Wed, 29 Nov 2017 13:05:03 +1100 "Tobin C. Harding"  wrote:

> Currently there exist approximately 14 000 places in the kernel where
> addresses are being printed using an unadorned %p. This potentially
> leaks sensitive information regarding the Kernel layout in memory. Many
> of these calls are stale, instead of fixing every call lets hash the
> address by default before printing. This will of course break some
> users, forcing code printing needed addresses to be updated.
> 
> Code that _really_ needs the address will soon be able to use the new
> printk specifier %px to print the address.
> 
> For what it's worth, usage of unadorned %p can be broken down as
> follows (thanks to Joe Perches).
> 
> $ git grep -E '%p[^A-Za-z0-9]' | cut -f1 -d"/" | sort | uniq -c
>1084 arch
>  20 block
>  10 crypto
>  32 Documentation
>8121 drivers
>1221 fs
> 143 include
> 101 kernel
>  69 lib
> 100 mm
>1510 net
>  40 samples
>   7 scripts
>  11 security
> 166 sound
> 152 tools
>   2 virt
> 
> Add function ptr_to_id() to map an address to a 32 bit unique
> identifier. Hash any unadorned usage of specifier %p and any malformed
> specifiers.
> 
> ...
>
> @@ -1644,6 +1646,73 @@ char *device_node_string(char *buf, char *end, struct 
> device_node *dn,
>   return widen_string(buf, buf - buf_start, end, spec);
>  }
>  
> +static bool have_filled_random_ptr_key __read_mostly;
> +static siphash_key_t ptr_key __read_mostly;
> +
> +static void fill_random_ptr_key(struct random_ready_callback *unused)
> +{
> + get_random_bytes(_key, sizeof(ptr_key));
> + /*
> +  * have_filled_random_ptr_key==true is dependent on get_random_bytes().
> +  * ptr_to_id() needs to see have_filled_random_ptr_key==true
> +  * after get_random_bytes() returns.
> +  */
> + smp_mb();
> + WRITE_ONCE(have_filled_random_ptr_key, true);
> +}

I don't think I'm seeing anything which prevents two CPUs from
initializing ptr_key at the same time.  Probably doesn't matter much...

Re: [PATCH V11 0/5] hash addresses printed with %p

2017-11-29 Thread Andrew Morton

On Wed, 29 Nov 2017 13:05:00 +1100 "Tobin C. Harding"  wrote:

> Currently there exist approximately 14 000 places in the Kernel where
> addresses are being printed using an unadorned %p. This potentially
> leaks sensitive information regarding the Kernel layout in memory. Many
> of these calls are stale, instead of fixing every call lets hash the
> address by default before printing. This will of course break some
> users, forcing code printing needed addresses to be updated. We can add
> a printk specifier for this purpose (%px) to give developers a clear
> upgrade path for breakages caused by applying this patch set.
> 
> The added advantage of hashing %p is that security is now opt-out, if
> you _really_ want the address you have to work a little harder and use
> %px.
> 
> The idea for creating the printk specifier %px to print the actual
> address was suggested by Kees Cook (see below for email threads by
> subject).

Maybe I'm being thick, but...  if we're rendering these addresses
unusable by hashing them, why not just print something like
"" in their place?  That loses the uniqueness thing but I
wonder how valuable that is in practice?

Re: [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

2017-04-10 Thread Andrew Morton

On Mon, 10 Apr 2017 16:08:21 +0100 Mel Gorman  
wrote:

> IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching
> of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu
> allocator for irq-safe requests").
> 
> This unfortunately also included excluded SoftIRQ.  This hurt the performance
> for the use-case of refilling DMA RX rings in softirq context.

Out of curiosity: by how much did it "hurt"?

Tariq found:

: I disabled the page-cache (recycle) mechanism to stress the page
: allocator, and see a drastic degradation in BW, from 47.5 G in v4.10 to
: 31.4 G in v4.11-rc1 (34% drop).

then with this patch he found

: It looks very good!  I get line-rate (94Gbits/sec) with 8 streams, in
: comparison to less than 55Gbits/sec before.

Can I take this to mean that the page allocator's per-cpu-pages feature
ended up doubling the performance of this driver?  Better than the
driver's private page recycling?  I'd like to believe that, but am
having trouble doing so ;)

> This patch re-allow softirq context, which should be safe by disabling
> BH/softirq, while accessing the list.  PCP-lists access from both hard-IRQ
> and NMI context must not be allowed.  Peter Zijlstra says in_nmi() code
> never access the page allocator, thus it should be sufficient to only test
> for !in_irq().
> 
> One concern with this change is adding a BH (enable) scheduling point at
> both PCP alloc and free. If further concerns are highlighted by this patch,
> the result wiill be to revert 374ad05ab64d and try again at a later date
> to offset the irq enable/disable overhead.

Re: [mm PATCH 0/3] Page fragment updates

2016-12-05 Thread Andrew Morton

On Mon, 5 Dec 2016 09:01:12 -0800 Alexander Duyck  
wrote:

> On Tue, Nov 29, 2016 at 10:23 AM, Alexander Duyck
>  wrote:
> > This patch series takes care of a few cleanups for the page fragments API.
> >
> > ...
> 
> It's been about a week since I submitted this series.  Just wanted to
> check in and see if anyone had any feedback or if this is good to be
> accepted for 4.10-rc1 with the rest of the set?

Looks good to me.  I have it all queued for post-4.9 processing.

Re: [mm PATCH v3 21/23] mm: Add support for releasing multiple instances of a page

2016-11-21 Thread Andrew Morton

On Mon, 21 Nov 2016 08:21:39 -0800 Alexander Duyck  
wrote:

> >> + __free_pages_ok(page, order);
> >> + }
> >> +}
> >> +EXPORT_SYMBOL(__page_frag_drain);
> >
> > It's an exported-to-modules library function.  It should be documented,
> > please?  The page-frag API is only partially documented, but that's no
> > excuse.
> 
> Okay.  I assume you want the documentation as a follow-up patch since
> I received a notice that the patch was added to -mm?

Yes please.  Or a replacement patch which I'll temporarily turn into a
delta, either is fine.

> If you would like I could look at doing a couple of renaming patches
> so that we make the API a bit more consistent.  I could move the
> __alloc and __free to what you have suggested, and then take a look at
> trying to rename the refill/drain to be a bit more consistent in terms
> of what they are supposed to work on and how they are supposed to be
> used.

I think that would be better - it's hardly high-priority but a bit of
attention to the documentation and naming conventions would help tidy
things up.  When you can't find anything else to do ;)

Re: [mm PATCH v3 21/23] mm: Add support for releasing multiple instances of a page

2016-11-18 Thread Andrew Morton

On Thu, 10 Nov 2016 06:36:06 -0500 Alexander Duyck 
 wrote:

> This patch adds a function that allows us to batch free a page that has
> multiple references outstanding.  Specifically this function can be used to
> drop a page being used in the page frag alloc cache.  With this drivers can
> make use of functionality similar to the page frag alloc cache without
> having to do any workarounds for the fact that there is no function that
> frees multiple references.
> 
> ...
>
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -506,6 +506,8 @@ extern void free_hot_cold_page(struct page *page, bool 
> cold);
>  extern void free_hot_cold_page_list(struct list_head *list, bool cold);
>  
>  struct page_frag_cache;
> +extern void __page_frag_drain(struct page *page, unsigned int order,
> +   unsigned int count);
>  extern void *__alloc_page_frag(struct page_frag_cache *nc,
>  unsigned int fragsz, gfp_t gfp_mask);
>  extern void __free_page_frag(void *addr);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0fbfead..54fea40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3912,6 +3912,20 @@ static struct page *__page_frag_refill(struct 
> page_frag_cache *nc,
>   return page;
>  }
>  
> +void __page_frag_drain(struct page *page, unsigned int order,
> +unsigned int count)
> +{
> + VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> +
> + if (page_ref_sub_and_test(page, count)) {
> + if (order == 0)
> + free_hot_cold_page(page, false);
> + else
> + __free_pages_ok(page, order);
> + }
> +}
> +EXPORT_SYMBOL(__page_frag_drain);

It's an exported-to-modules library function.  It should be documented,
please?  The page-frag API is only partially documented, but that's no
excuse.

And perhaps documentation will help explain the naming choice.  Why
"drain"?  I'd have expected "put"?

And why the leading underscores.  The page-frag API is pretty weird :(

And inconsistent.  __alloc_page_frag -> page_frag_alloc,
__free_page_frag -> page_frag_free(), etc.  I must have been asleep
when I let that lot through.

Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)

2016-09-26 Thread Andrew Morton

On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka  wrote:

> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> physically contiguous. Also the allocation is temporary for the duration of 
> the
> syscall, so it's unlikely to stress vmalloc too much.
> 
> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
> it doesn't need this kind of fallback.
> 
> ...
>
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
> __user *outp,
>   struct fdtable *fdt;
>   /* Allocate small arguments on the stack to save memory and be faster */
>   long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
> + unsigned long alloc_size;
>  
>   ret = -EINVAL;
>   if (n < 0)
> @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
> __user *outp,
>   bits = stack_fds;
>   if (size > sizeof(stack_fds) / 6) {
>   /* Not enough space in on-stack array; must use kmalloc */
> + alloc_size = 6 * size;

Well.  `size' is `unsigned'.  The multiplication will be done as 32-bit
so there was no point in making `alloc_size' unsigned long.

So can we tighten up the types in this function?  size_t might make
sense, but vmalloc() takes a ulong.

>   ret = -ENOMEM;
> - bits = kmalloc(6 * size, GFP_KERNEL);
> + bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
> + if (!bits && alloc_size > PAGE_SIZE)
> + bits = vmalloc(alloc_size);

I don't share Eric's concerns about performance here.  If the vmalloc()
is called, we're about to write to that quite large amount of memory
which we just allocated, and the vmalloc() overhead will be relatively
low.

>   if (!bits)
>   goto out_nofds;
>   }
> @@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
> __user *outp,
>  
>  out:
>   if (bits != stack_fds)
> - kfree(bits);
> + kvfree(bits);
>  out_nofds:
>   return ret;

It otherwise looks OK to me.

Re: [PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking

2016-09-14 Thread Andrew Morton

On Thu, 15 Sep 2016 13:34:24 +0800 kbuild test robot  wrote:

> Hi Johannes,
> 
> [auto build test ERROR on net/master]
> [also build test ERROR on v4.8-rc6 next-20160914]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
> convenience) to record what (public, well-known) commit your patch series was 
> built on]
> [Check https://git-scm.com/docs/git-format-patch for more information]
> 
> url:
> https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-memcontrol-make-per-cpu-charge-cache-IRQ-safe-for-socket-accounting/20160915-035634
> config: m68k-sun3_defconfig (attached as .config)
> compiler: m68k-linux-gcc (GCC) 4.9.0
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=m68k 
> 
> All errors (new ones prefixed by >>):
> 
>net/built-in.o: In function `sk_alloc':
> >> (.text+0x4076): undefined reference to `mem_cgroup_sk_alloc'
>net/built-in.o: In function `__sk_destruct':
> >> sock.c:(.text+0x457e): undefined reference to `mem_cgroup_sk_free'
>net/built-in.o: In function `sk_clone_lock':
>(.text+0x4f1c): undefined reference to `mem_cgroup_sk_alloc'

This?

--- a/mm/memcontrol.c~mm-memcontrol-consolidate-cgroup-socket-tracking-fix
+++ a/mm/memcontrol.c
@@ -5655,9 +5655,6 @@ void mem_cgroup_sk_alloc(struct sock *sk
 {
struct mem_cgroup *memcg;
 
-   if (!mem_cgroup_sockets_enabled)
-   return;
-
/*
 * Socket cloning can throw us here with sk_memcg already
 * filled. It won't however, necessarily happen from
--- a/net/core/sock.c~mm-memcontrol-consolidate-cgroup-socket-tracking-fix
+++ a/net/core/sock.c
@@ -1385,7 +1385,8 @@ static void sk_prot_free(struct proto *p
slab = prot->slab;
 
cgroup_sk_free(>sk_cgrp_data);
-   mem_cgroup_sk_free(sk);
+   if (mem_cgroup_sockets_enabled)
+   mem_cgroup_sk_free(sk);
security_sk_free(sk);
if (slab != NULL)
kmem_cache_free(slab, sk);
@@ -1422,7 +1423,8 @@ struct sock *sk_alloc(struct net *net, i
sock_net_set(sk, net);
atomic_set(>sk_wmem_alloc, 1);
 
-   mem_cgroup_sk_alloc(sk);
+   if (mem_cgroup_sockets_enabled)
+   mem_cgroup_sk_alloc(sk);
cgroup_sk_alloc(>sk_cgrp_data);
sock_update_classid(>sk_cgrp_data);
sock_update_netprioidx(>sk_cgrp_data);
@@ -1569,7 +1571,8 @@ struct sock *sk_clone_lock(const struct
newsk->sk_incoming_cpu = raw_smp_processor_id();
atomic64_set(>sk_cookie, 0);
 
-   mem_cgroup_sk_alloc(newsk);
+   if (mem_cgroup_sockets_enabled)
+   mem_cgroup_sk_alloc(newsk);
cgroup_sk_alloc(>sk_cgrp_data);
 
/*
_

Re: [PATCH] remove lots of IS_ERR_VALUE abuses

2016-05-27 Thread Andrew Morton

On Fri, 27 May 2016 23:23:25 +0200 Arnd Bergmann  wrote:

> Most users of IS_ERR_VALUE() in the kernel are wrong, as they
> pass an 'int' into a function that takes an 'unsigned long'
> argument. This happens to work because the type is sign-extended
> on 64-bit architectures before it gets converted into an
> unsigned type.
> 
> However, anything that passes an 'unsigned short' or 'unsigned int'
> argument into IS_ERR_VALUE() is guaranteed to be broken, as are
> 8-bit integers and types that are wider than 'unsigned long'.
> 
> Andrzej Hajda has already fixed a lot of the worst abusers that
> were causing actual bugs, but it would be nice to prevent any
> users that are not passing 'unsigned long' arguments.
> 
> This patch changes all users of IS_ERR_VALUE() that I could find
> on 32-bit ARM randconfig builds and x86 allmodconfig. For the
> moment, this doesn't change the definition of IS_ERR_VALUE()
> because there are probably still architecture specific users
> elsewhere.

So you do plan to add some sort of typechecking into IS_ERR_VALUE()?

> Almost all the warnings I got are for files that are better off
> using 'if (err)' or 'if (err < 0)'.
> The only legitimate user I could find that we get a warning for
> is the (32-bit only) freescale fman driver, so I did not remove
> the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
> For 9pfs, I just worked around one user whose calling conventions
> are so obscure that I did not dare change the behavior.
> 
> I was using this definition for testing:
> 
>  #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
>unlikely((unsigned long long)(x) >= (unsigned long 
> long)(typeof(x))-MAX_ERRNO))
> 
> which ends up making all 16-bit or wider types work correctly with
> the most plausible interpretation of what IS_ERR_VALUE() was supposed
> to return according to its users, but also causes a compile-time
> warning for any users that do not pass an 'unsigned long' argument.
> 
> I suggested this approach earlier this year, but back then we ended
> up deciding to just fix the users that are obviously broken. After
> the initial warning that caused me to get involved in the discussion
> (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
> asked me to send the whole thing again.

Re: [Y2038] [RESEND PATCH 2/3] fs: poll/select/recvmmsg: use timespec64 for timeout events

2016-05-04 Thread Andrew Morton

On Wed, 04 May 2016 23:08:11 +0200 Arnd Bergmann  wrote:

> > But I'm less comfortable making the call on this one. It looks
> > relatively straight forward, but it would be good to have maintainer
> > acks before I add it to my tree.
> 
> Agreed. Feel free to add my
> 
> Reviewed-by: Arnd Bergmann 
> 
> at least (whoever picks it up).

In reply to [1/3] John said

: Looks ok at the first glance. I've queued these up for testing,
: however I only got #1 and #3 of the set. Are you hoping these two
: patches will go through tip/timers/core or are you looking for acks so
: they can go via another tree?

However none of the patches are in linux-next.

John had qualms about [2/3], but it looks like a straightforward
substitution in areas which will get plenty of testing

I'll grab the patches for now to get them some external testing.

Re: [Bug 117521] New: BUG: unable to handle kernel paging request at 000001a400015ff4

2016-05-02 Thread Andrew Morton


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

Thanks.  It's probably a TIPC issue.

On Mon, 02 May 2016 16:44:18 + bugzilla-dae...@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=117521
> 
> Bug ID: 117521
>Summary: BUG: unable to handle kernel paging request at
> 01a400015ff4
>Product: Memory Management
>Version: 2.5
> Kernel Version: 4.4.0
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: high
>   Priority: P1
>  Component: Other
>   Assignee: a...@linux-foundation.org
>   Reporter: gbala...@gmail.com
> Regression: No
> 
> [   65.954959] sm-msp-queue[1279]: unable to qualify my own domain name
> (dcsx5testslot3) -- using short name
> [  632.098785] perf interrupt took too long (2505 > 2500), lowering
> kernel.perf_event_max_sample_rate to 5
> [ 5880.428123] perf interrupt took too long (5585 > 5000), lowering
> kernel.perf_event_max_sample_rate to 25000
> [17934.014969] CE: hpet increased min_delta_ns to 20115 nsec
> [38956.721789] CE: hpet4 increased min_delta_ns to 20115 nsec
> [46927.872827] hrtimer: interrupt took 63361 ns
> [101662.241093] CE: hpet2 increased min_delta_ns to 20115 nsec
> [245973.044600] CE: hpet6 increased min_delta_ns to 20115 nsec
> [368639.565040] show_signal_msg: 6 callbacks suppressed
> [375832.498126] BUG: unable to handle kernel paging request at 
> 01a400015ff4
> [375832.505300] IP: [] queued_spin_lock_slowpath+0xe6/0x160
> [375832.512394] PGD 0 
> [375832.514657] Oops: 0002 [#1] SMP
> [375832.518306] Modules linked in: nf_log_ipv6 nf_log_ipv4 nf_log_common 
> xt_LOG
> sctp libcrc32c e1000e tipc udp_tunnel ip6_udp_tunnel 8021q garp iTCO_wdt
> xt_physdev br_netfilter bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4
> ipmiq_drv(O) sio_mmc(O) ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
> nf_defrag_ipv6 xt_state nf_conntrack lockd ip6table_filter event_drv(O)
> ip6_tables grace pt_timer_info(O) ddi(O) usb_storage ixgbe igb i2c_i801
> iTCO_vendor_support i2c_algo_bit ioatdma intel_ips i2c_core pcspkr sunrpc ptp
> mdio dca pps_core lpc_ich tpm_tis mfd_core tpm [last unloaded: iTCO_wdt]
> [375832.573693] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G   O4.4.0
> #14
> [375832.581385] Hardware name: PT AMC124/Base Board Product Name, BIOS
> LGNAJFIP.PTI.0012.P15 01/15/2014
> [375832.591028] task: 880351a89b40 ti: 880351a9 task.ti:
> 880351a9
> [375832.599026] RIP: 0010:[]  []
> queued_spin_lock_slowpath+0xe6/0x160
> [375832.608964] RSP: 0018:88035fc83d58  EFLAGS: 00010002
> [375832.614825] RAX: 1447 RBX: 0292 RCX:
> 88035fc95fc0
> [375832.622743] RDX: 01a400015ff4 RSI: 0014 RDI:
> 880351232f80
> [375832.630567] RBP: 88035fc83d58 R08: 0101 R09:
> 0004
> [375832.638348] R10:  R11:  R12:
> 01001002
> [375832.645919] R13: 0001 R14:  R15:
> 
> [375832.653610] FS:  () GS:88035fc8()
> knlGS:
> [375832.662317] CS:  0010 DS:  ES:  CR0: 8005003b
> [375832.668483] CR2: 01a400015ff4 CR3: 01c0a000 CR4:
> 06e0
> [375832.676133] Stack:
> [375832.678344]  88035fc83d78 816de2c1 88034a8bba60
> 880351232f80
> [375832.686163]  88035fc83db8 810bc592 88035fc83dc8
> 880351758000
> [375832.694139]  01001002  b802f4bd
> a024e6f0
> [375832.702154] Call Trace:
> [375832.704844]   
> [375832.707018]  [] _raw_spin_lock_irqsave+0x31/0x40
> [375832.713970]  [] __wake_up+0x32/0x70
> [375832.719444]  [] ? tipc_recv_stream+0x370/0x370 [tipc]
> [375832.726589]  [] sock_def_wakeup+0x30/0x40
> [375832.732566]  [] tipc_sk_timeout+0x148/0x180 [tipc]
> [375832.739388]  [] ? tipc_recv_stream+0x370/0x370 [tipc]
> [375832.746507]  [] call_timer_fn+0x44/0x110
> [375832.752378]  [] ? cascade+0x4a/0x80
> [375832.757848]  [] ? tipc_recv_stream+0x370/0x370 [tipc]
> [375832.764871]  [] run_timer_softirq+0x22c/0x280
> [375832.771175]  [] __do_softirq+0xc8/0x260
> [375832.776958]  [] irq_exit+0x83/0xb0
> [375832.782369]  [] do_IRQ+0x65/0xf0
> [375832.787607]  [] common_interrupt+0x7f/0x7f
> [375832.793709]   
> [375832.795803]  [] ? cpuidle_enter_state+0xad/0x200
> [375832.802765]  [] ? cpuidle_enter_state+0x91/0x200
> [375832.809338]  [] cpuidle_enter+0x17/0x20
> [375832.815155]  [] call_cpuidle+0x37/0x60
> [375832.821184]  [] ? cpuidle_select+0x13/0x20
> [375832.827249]  [] cpu_startup_entry+0x211/0x2d0
> [375832.833535]  [] start_secondary+0x103/0x130
> [375832.839759] Code: 87 47 02 c1 e0 10 85 c0 74 38 48 89 c2 c1 e8 12 48 c1 ea
> 0c 83 e8 01 83 e2 30 48 98 48 81 c2 c0 5f 01 00 48 03 14 c5

Re: [PATCH] mm: memcontrol: only manage socket pressure for CONFIG_INET

2015-12-09 Thread Andrew Morton

On Wed, 9 Dec 2015 13:58:58 -0500 Johannes Weiner  wrote:

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 6faea81e66d7..73cd572167bb 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4220,13 +4220,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state 
> > *css)
> > if (ret)
> > return ret;
> >  
> > +#ifdef CONFIG_INET
> >  #ifdef CONFIG_MEMCG_LEGACY_KMEM
> > ret = tcp_init_cgroup(memcg);
> > if (ret)
> > return ret;
> >  #endif
> 
> The calls to tcp_init_cgroup() appear earlier in the series than "mm:
> memcontrol: hook up vmpressure to socket pressure". However, they get
> moved around a few times so fixing it earlier means respinning the
> series. Andrew, it's up to you whether we take the bisectability hit
> for !CONFIG_INET && CONFIG_MEMCG (how common is this?) or whether you
> want me to resend the series.

hm, drat, I was suspecting dependency issues here, but a test build
said it was OK.

Actually, I was expecting this patch series to depend on the linux-next
cgroup2 changes, but that doesn't appear to be the case.  *should* this
series be staged after the cgroup2 code?

Regarding this particular series: yes, I think we can live with a
bisection hole for !CONFIG_INET && CONFIG_MEMCG users.  But I'm not
sure why we're discussing bisection issues, because Arnd's build
failure occurs with everything applied?

> Sorry about the trouble. I don't have a git tree on kernel.org because
> we don't really use git in -mm, but the downside is that we don't get
> the benefits of the automatic build testing for all kinds of configs.
> I'll try to set up a git tree to expose series to full build coverage
> before they hit -mm and -next.

This sort of thing happens quite rarely.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mm: memcontrol: only manage socket pressure for CONFIG_INET

2015-12-09 Thread Andrew Morton

On Wed, 9 Dec 2015 18:05:05 -0500 Johannes Weiner <han...@cmpxchg.org> wrote:

> On Wed, Dec 09, 2015 at 02:28:36PM -0800, Andrew Morton wrote:
> > On Wed, 9 Dec 2015 13:58:58 -0500 Johannes Weiner <han...@cmpxchg.org> 
> > wrote:
> > > The calls to tcp_init_cgroup() appear earlier in the series than "mm:
> > > memcontrol: hook up vmpressure to socket pressure". However, they get
> > > moved around a few times so fixing it earlier means respinning the
> > > series. Andrew, it's up to you whether we take the bisectability hit
> > > for !CONFIG_INET && CONFIG_MEMCG (how common is this?) or whether you
> > > want me to resend the series.
> > 
> > hm, drat, I was suspecting dependency issues here, but a test build
> > said it was OK.
> > 
> > Actually, I was expecting this patch series to depend on the linux-next
> > cgroup2 changes, but that doesn't appear to be the case.  *should* this
> > series be staged after the cgroup2 code?
> 
> Code-wise they are independent. My stuff is finishing up the new memcg
> control knobs, the cgroup2 stuff is changing how and when those knobs
> are exposed from within the cgroup core. I'm not relying on any recent
> changes in the cgroup core AFAICS, so the order shouldn't matter here.

OK, thanks.

> > Regarding this particular series: yes, I think we can live with a
> > bisection hole for !CONFIG_INET && CONFIG_MEMCG users.  But I'm not
> > sure why we're discussing bisection issues, because Arnd's build
> > failure occurs with everything applied?
> 
> Arnd's patches apply to the top of the stack, but they address issues
> introduced early in the series and the problematic code gets touched a
> lot in subsequent patches. E.g. the first build breakage is in ("net:
> tcp_memcontrol: simplify linkage between socket and page counter")
> when the tcp_init_cgroup() and tcp_destroy_cgroup() function calls get
> moved around and lose the CONFIG_INET protection.

Yeah, this is a pain.  I think I'll fold Arnd's fix into
mm-memcontrol-introduce-config_memcg_legacy_kmem.patch (which is staged
after all the other MM patches and after linux-next) and will pretend I
didn't know about the issue ;)

> Anyway, if we can live with the bisection caveat then Arnd's fixes on
> top of the kmem series look good to me. Depending on what Vladimir
> thinks we might want to replace the CONFIG_SLOB fix with something
> else later on, but that shouldn't be a problem, either.

I don't have a fix for the CONFIG_SLOB&_MEMCG issue yet.  I
agree that it would be best to make the combination work correctly
rather than banning it, but that does require a bit of runtime testing.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node

2015-10-08 Thread Andrew Morton

On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes  
wrote:

> On Wed, 19 Aug 2015, Patil, Kiran wrote:
> 
> > Acked-by: Kiran Patil 
> 
> Where's the call to preempt_disable() to prevent kernels with preemption 
> from making numa_node_id() invalid during this iteration?

David asked this question twice, received no answer and now the patch
is in the maintainer tree, destined for mainline.

If I was asked this question I would respond

  The use of numa_mem_id() is racy and best-effort.  If the unlikely
  race occurs, the memory allocation will occur on the wrong node, the
  overall result being very slightly suboptimal performance.  The
  existing use of numa_node_id() suffers from the same issue.

But I'm not the person proposing the patch.  Please don't just ignore
reviewer comments!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: simplify configfs attributes V2

2015-10-05 Thread Andrew Morton

On Sat,  3 Oct 2015 15:32:36 +0200 Christoph Hellwig  wrote:

> This series consolidates the code to implement configfs attributes
> by providing the ->show and ->store method in common code and using
> container_of in the methods to access the containing structure.
> 
> This reduces source and binary size of configfs consumers a lot.

There's a version of this patch series (I assume V1) in linux-next via
Nicholas's tree so my hands are tied.  I trust that Nicholas will
update things.

Or maybe it was a slipup - modifying usb, ocfs2 etc from the iscsi tree
is innovative ;)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists

2015-10-02 Thread Andrew Morton

On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer  
wrote:

> > Thus, I need introducing new code like this patch and at the same time
> > have to reduce the number of instruction-cache misses/usage.  In this
> > case we solve the problem by kmem_cache_free_bulk() not getting called
> > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > cause icache misses.
> 
> I just tested this change on top of my net-use-case patchset... and for
> some strange reason the code with this WARN_ON is faster and have much
> less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).
> 
> Thus, I think we should keep your fix.
> 
> I cannot explain why using WARN_ON() is better and cause less icache
> misses.  And I hate when I don't understand every detail.
> 
>  My theory is, after reading the assembler code, that the UD2
> instruction (from BUG_ON) cause some kind of icache decoder stall
> (Intel experts???).  Now that should not be a problem, as UD2 is
> obviously placed as an unlikely branch and left at the end of the asm
> function call.  But the call to __slab_free() is also placed at the end
> of the asm function (gets inlined from slab_free() as unlikely).  And
> it is actually fairly likely that bulking is calling __slab_free (slub
> slowpath call).

Yes, I was looking at the asm code and the difference is pretty small:
a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly.

But I wouldn't assume that the microbenchmarking is meaningful.  I've
seen shockingly large (and quite repeatable) microbenchmarking
differences from small changes in code which isn't even executed (and
this is one such case, actually).  You add or remove just one byte of
text and half the kernel (or half the .o file?) gets a different
alignment and this seems to change everything.

Deleting the BUG altogether sounds the best solution.  As long as the
kernel crashes in some manner, we'll be able to work out what happened.
And it's cant-happen anyway, isn't it?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists

2015-10-01 Thread Andrew Morton

On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer  
wrote:

> Make it possible to free a freelist with several objects by adjusting
> API of slab_free() and __slab_free() to have head, tail and an objects
> counter (cnt).
> 
> Tail being NULL indicate single object free of head object.  This
> allow compiler inline constant propagation in slab_free() and
> slab_free_freelist_hook() to avoid adding any overhead in case of
> single object free.
> 
> This allows a freelist with several objects (all within the same
> slab-page) to be free'ed using a single locked cmpxchg_double in
> __slab_free() and with an unlocked cmpxchg_double in slab_free().
> 
> Object debugging on the free path is also extended to handle these
> freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> objects don't belong to the same slab-page.
> 
> These changes are needed for the next patch to bulk free the detached
> freelists it introduces and constructs.
> 
> Micro benchmarking showed no performance reduction due to this change,
> when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
> 

checkpatch says

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather 
than BUG() or BUG_ON()
#205: FILE: mm/slub.c:2888:
+   BUG_ON(!size);


Linus will get mad at you if he finds out, and we wouldn't want that.

--- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix
+++ a/mm/slub.c
@@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc
 /* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
-   BUG_ON(!size);
+   if (WARN_ON(!size))
+   return;
 
do {
struct detached_freelist df;
_


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [linux-next] oops in ip_route_input_noref

2015-09-18 Thread Andrew Morton

On Thu, 17 Sep 2015 10:58:52 +0200 Thierry Reding  
wrote:

> On Wed, Sep 16, 2015 at 09:04:15AM -0600, David Ahern wrote:
> > On 9/16/15 9:00 AM, Fabio Estevam wrote:
> > >On Wed, Sep 16, 2015 at 6:24 AM, Sergey Senozhatsky
> > > wrote:
> > >
> > >>added by b7503e0cdb5dbec5d201aa69dc14679b5ae8
> > >>
> > >> net: Add FIB table id to rtable
> > >>
> > >> Add the FIB table id to rtable to make the information available for
> > >> IPv4 as it is for IPv6.
> > >
> > >I see the same issue here when booting a mx25 ARM processor via NFS.
> > >
> > >defconfig is arch/arm/configs/imx_v4_v5_defconfig.
> > >
> > 
> > I am still not able to reproduce. While I work on a full Cumulus image for
> > other test cases here's a patch to try; eagle eye Nikolay noted a potential
> > use without init in the maze of goto's.
> > 
> > Thanks,
> > David
> 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index da427a4a33fe..80f7c5b7b832 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1712,6 +1712,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
> > __be32 daddr, __be32 saddr,
> > goto martian_source;
> >  
> > res.fi = NULL;
> > +   res.table = NULL;
> > if (ipv4_is_lbcast(daddr) || (saddr == 0 && daddr == 0))
> > goto brd_input;
> >  
> > @@ -1834,6 +1835,7 @@ out:  return err;
> > RT_CACHE_STAT_INC(in_no_route);
> > res.type = RTN_UNREACHABLE;
> > res.fi = NULL;
> > +   res.table = NULL;
> > goto local_input;
> >  
> > /*
> 
> I was seeing the same oops as Fabio (except that the faulting address
> was 0xb instead of 0x7) and after applying this patch I no longer see
> it:
> 
> Tested-by: Thierry Reding 

I've been hitting this as well.  An oops on boot in
ip_route_input_slow(), here:

#ifdef CONFIG_IP_ROUTE_CLASSID
rth->dst.tclassid = itag;
#endif
rth->rt_is_input = 1;
if (res.table)
-->>rth->rt_table_id = res.table->tb_id;

RT_CACHE_STAT_INC(in_slow_tot);


I did this, which made it go away:

--- a/net/ipv4/route.c~a
+++ a/net/ipv4/route.c
@@ -1692,6 +1692,8 @@ static int ip_route_input_slow(struct sk
struct net*net = dev_net(dev);
bool do_cache;
 
+   res.table = 0;
+
/* IP on this device is disabled. */
 
if (!in_dev)
_

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/7] slub: improve bulk alloc strategy

2015-06-16 Thread Andrew Morton

On Mon, 15 Jun 2015 17:52:46 +0200 Jesper Dangaard Brouer bro...@redhat.com 
wrote:

 Call slowpath __slab_alloc() from within the bulk loop, as the
 side-effect of this call likely repopulates c-freelist.
 
 Choose to reenable local IRQs while calling slowpath.
 
 Saving some optimizations for later.  E.g. it is possible to
 extract parts of __slab_alloc() and avoid the unnecessary and
 expensive (37 cycles) local_irq_{save,restore}.  For now, be
 happy calling __slab_alloc() this lower icache impact of this
 func and I don't have to worry about correctness.
 
 ...

 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -2776,8 +2776,23 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
 flags, size_t size,
   for (i = 0; i  size; i++) {
   void *object = c-freelist;
  
 - if (!object)
 - break;
 + if (unlikely(!object)) {
 + c-tid = next_tid(c-tid);
 + local_irq_enable();
 +
 + /* Invoke slow path one time, then retry fastpath
 +  * as side-effect have updated c-freelist
 +  */

That isn't very grammatical.

Block comments are formatted

/*
 * like this
 */

please.


 + p[i] = __slab_alloc(s, flags, NUMA_NO_NODE,
 + _RET_IP_, c);
 + if (unlikely(!p[i])) {
 + __kmem_cache_free_bulk(s, i, p);
 + return false;
 + }
 + local_irq_disable();
 + c = this_cpu_ptr(s-cpu_slab);
 + continue; /* goto for-loop */
 + }
  

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/7] slab: infrastructure for bulk object allocation and freeing

2015-06-16 Thread Andrew Morton

On Mon, 15 Jun 2015 17:51:56 +0200 Jesper Dangaard Brouer bro...@redhat.com 
wrote:

 +bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 + void **p)
 +{
 + return kmem_cache_alloc_bulk(s, flags, size, p);
 +}

hm, any call to this function is going to be nasty, brutal and short.

--- 
a/mm/slab.c~slab-infrastructure-for-bulk-object-allocation-and-freeing-v3-fix
+++ a/mm/slab.c
@@ -3425,7 +3425,7 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
 bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
 {
-   return kmem_cache_alloc_bulk(s, flags, size, p);
+   return __kmem_cache_alloc_bulk(s, flags, size, p);
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
 
_

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] slub bulk alloc: extract objects from the per cpu slab

2015-06-16 Thread Andrew Morton

On Mon, 15 Jun 2015 17:52:07 +0200 Jesper Dangaard Brouer bro...@redhat.com 
wrote:

 From: Christoph Lameter c...@linux.com
 
 [NOTICE: Already in AKPM's quilt-queue]
 
 First piece: acceleration of retrieval of per cpu objects
 
 If we are allocating lots of objects then it is advantageous to disable
 interrupts and avoid the this_cpu_cmpxchg() operation to get these objects
 faster.
 
 Note that we cannot do the fast operation if debugging is enabled, because
 we would have to add extra code to do all the debugging checks.  And it
 would not be fast anyway.
 
 Note also that the requirement of having interrupts disabled
 avoids having to do processor flag operations.
 
 Allocate as many objects as possible in the fast way and then fall back to
 the generic implementation for the rest of the objects.
 
 ...

 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -2759,7 +2759,32 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
  bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
   void **p)
  {
 - return kmem_cache_alloc_bulk(s, flags, size, p);
 + if (!kmem_cache_debug(s)) {
 + struct kmem_cache_cpu *c;
 +
 + /* Drain objects in the per cpu slab */
 + local_irq_disable();
 + c = this_cpu_ptr(s-cpu_slab);
 +
 + while (size) {
 + void *object = c-freelist;
 +
 + if (!object)
 + break;
 +
 + c-freelist = get_freepointer(s, object);
 + *p++ = object;
 + size--;
 +
 + if (unlikely(flags  __GFP_ZERO))
 + memset(object, 0, s-object_size);
 + }
 + c-tid = next_tid(c-tid);
 +
 + local_irq_enable();

It might be worth adding

if (!size)
return true;

here.  To avoid the pointless call to __kmem_cache_alloc_bulk().

It depends on the typical success rate of this allocation loop.  Do you
know what this is?

 + }
 +
 + return __kmem_cache_alloc_bulk(s, flags, size, p);
  }
  EXPORT_SYMBOL(kmem_cache_alloc_bulk);

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7] slub: fix error path bug in kmem_cache_alloc_bulk

2015-06-16 Thread Andrew Morton

On Mon, 15 Jun 2015 17:52:26 +0200 Jesper Dangaard Brouer bro...@redhat.com 
wrote:

 The current kmem_cache/SLAB bulking API need to release all objects
 in case the layer cannot satisfy the full request.
 
 If __kmem_cache_alloc_bulk() fails, all allocated objects in array
 should be freed, but, __kmem_cache_alloc_bulk() can't know
 about objects allocated by this slub specific kmem_cache_alloc_bulk()
 function.

Can we fold patches 2, 3 and 4 into a single patch?

And maybe patch 5 as well.  I don't think we need all these
development-time increments in the permanent record.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] netconsole: implement extended console support

2015-05-12 Thread Andrew Morton

On Mon, 11 May 2015 12:41:34 -0400 Tejun Heo t...@kernel.org wrote:

 printk logbuf keeps various metadata and optional key=value dictionary
 for structured messages, both of which are stripped when messages are
 handed to regular console drivers.
 
 It can be useful to have this metadata and dictionary available to
 netconsole consumers.  This obviously makes logging via netconsole
 more complete and the sequence number in particular is useful in
 environments where messages may be lost or reordered in transit -
 e.g. when netconsole is used to collect messages in a large cluster
 where packets may have to travel congested hops to reach the
 aggregator.  The lost and reordered messages can easily be identified
 and handled accordingly using the sequence numbers.
 
 printk recently added extended console support which can be selected
 by setting CON_EXTENDED flag.

There's no such thing as CON_EXTENDED.  Not sure what this is trying to
say.

  From console driver side, not much
 changes.  The only difference is that the text passed to the write
 callback is formatted the same way as /dev/kmsg.
 
 This patch implements extended console support for netconsole which
 can be enabled by either prepending + to a netconsole boot param
 entry or echoing 1 to extended file in configfs.  When enabled,
 netconsole transmits extended log messages with headers identical to
 /dev/kmsg output.
 
 There's one complication due to message fragments.  netconsole limits
 the maximum message size to 1k and messages longer than that are split
 into multiple fragments.  As all extended console messages should
 carry matching headers and be uniquely identifiable, each extended
 message fragment carries full copy of the metadata and an extra header
 field to identify the specific fragment.  The optional header is of
 the form ncfrag=OFF/LEN where OFF is the byte offset into the
 message body and LEN is the total length.
 
 To avoid unnecessarily making printk format extended messages,
 Extended netconsole is registered with printk when the first extended
 netconsole is configured.

 ...

 +static ssize_t store_extended(struct netconsole_target *nt,
 +   const char *buf,
 +   size_t count)
 +{
 + int extended;
 + int err;
 +
 + if (nt-enabled) {
 + pr_err(target (%s) is enabled, disable to update parameters\n,
 +config_item_name(nt-item));
 + return -EINVAL;
 + }

What's the reason for the above?

It's unclear (to me, at least ;)) what disable means?  Specifically
what steps must the operator take to successfully perform this
operation?  A sentence detailing those steps in netconsole.txt would be
nice.

 + err = kstrtoint(buf, 10, extended);
 + if (err  0)
 + return err;
 + if (extended  0 || extended  1)
 + return -EINVAL;
 +
 + nt-extended = extended;
 +
 + return strnlen(buf, count);
 +}
 +

 ...

 +static void send_ext_msg_udp(struct netconsole_target *nt, const char *msg,
 +  int msg_len)
 +{
 + static char buf[MAX_PRINT_CHUNK];
 + const char *header, *body;
 + int offset = 0;
 + int header_len, body_len;
 +
 + if (msg_len = MAX_PRINT_CHUNK) {
 + netpoll_send_udp(nt-np, msg, msg_len);
 + return;
 + }
 +
 + /* need to insert extra header fields, detect header and body */
 + header = msg;
 + body = memchr(msg, ';', msg_len);
 + if (WARN_ON_ONCE(!body))
 + return;
 +
 + header_len = body - header;
 + body_len = msg_len - header_len - 1;
 + body++;
 +
 + /*
 +  * Transfer multiple chunks with the following extra header.
 +  * ncfrag=byte-offset/total-bytes
 +  */
 + memcpy(buf, header, header_len);
 +
 + while (offset  body_len) {
 + int this_header = header_len;
 + int this_chunk;
 +
 + this_header += scnprintf(buf + this_header,
 +  sizeof(buf) - this_header,
 +  ,ncfrag=%d/%d;, offset, body_len);
 +
 + this_chunk = min(body_len - offset,
 +  MAX_PRINT_CHUNK - this_header);
 + if (WARN_ON_ONCE(this_chunk = 0))
 + return;
 +
 + memcpy(buf + this_header, body + offset, this_chunk);
 +
 + netpoll_send_udp(nt-np, buf, this_header + this_chunk);
 +
 + offset += this_chunk;
 + }

What protects `buf'?  console_sem, I assume?

-   static char buf[MAX_PRINT_CHUNK];
+   static char buf[MAX_PRINT_CHUNK];   /* Protected by console_sem */

wouldn't hurt.

 +}
 +
 +static void write_ext_msg(struct console *con, const char *msg,


I've forgotten what's happening with this patchset.  There were a few
design-level issues raised against an earlier version.  What were those
and how have they been addressed?

--
To unsubscribe from this

Re: [PATCH 00/28] Swap over NFS -v16

2008-02-26 Thread Andrew Morton

On Tue, 26 Feb 2008 11:50:42 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Tue, 2008-02-26 at 17:03 +1100, Neil Brown wrote:
  On Saturday February 23, [EMAIL PROTECTED] wrote:
  
   What is the NFS and net people's take on all of this?
  
  Well I'm only vaguely an NFS person, barely a net person, sporadically
  an mm person, but I've had a look and it seems to mostly make sense.
 
 Thanks for taking a look, and giving such elaborate feedback. I'll try
 and address these issues asap, but first let me reply to a few points
 here.

Neil's overview of what-all-this-is and how-it-all-works is really good. 
I'd suggest that you take it over, flesh it out and attach it firmly to the
patchset.  It really helps.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10097] New: SMP BUG in __nf_conntrack_find

2008-02-25 Thread Andrew Morton

On Mon, 25 Feb 2008 10:44:08 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=10097
 
Summary: SMP BUG in __nf_conntrack_find
Product: Networking
Version: 2.5
  KernelVersion: 2.6.25-rc3
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: Netfilter/Iptables
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: 2.6.24.2
 Earliest failing kernel version: 2.6.24-rc3 (not checked before)
 Distribution: Bluewhite64
 Hardware Environment: Athlon X2 4200
 
 Software Environment:
 samba 3.0, 2.6.25-rc3 kernel + HR + tickless + kernel SMP debugging
 
 Problem Description:
 The Samba smbd daemon triggers regularly the following BUG with 2.6.25-rc3:
 
 BUG: using smp_processor_id() in preemptible [] code: nmbd/3167
 caller is __nf_conntrack_find+0x119/0x150
 Pid: 3167, comm: nmbd Not tainted 2.6.25-rc3 #1
 
 Call Trace:
  [8038f3f4] debug_smp_processor_id+0xc4/0xd0
  [80555d79] __nf_conntrack_find+0x119/0x150
  [80555dc9] nf_conntrack_find_get+0x19/0x80
  [80556914] nf_conntrack_in+0x1a4/0x5a0
  [8020bd33] ? restore_args+0x0/0x30
  [8059d596] ipv4_conntrack_local+0x66/0x70
  [80554362] nf_iterate+0x62/0xa0
  [80567050] ? dst_output+0x0/0x10
  [80554406] nf_hook_slow+0x66/0xe0
  [80567050] ? dst_output+0x0/0x10
  [80568825] __ip_local_out+0xa5/0xb0
  [80568841] ip_local_out+0x11/0x30
  [80568ac1] ip_push_pending_frames+0x261/0x3e0
  [80587153] udp_push_pending_frames+0x233/0x3d0
  [8058860f] udp_sendmsg+0x30f/0x710
  [802328b0] ? default_wake_function+0x0/0x10
  [8058f895] inet_sendmsg+0x45/0x80
  [80531fcf] sock_sendmsg+0xdf/0x110
  [80251270] ? autoremove_wake_function+0x0/0x40
  [802374c7] ? hrtick_resched+0x77/0x90
  [8025e2b5] ? trace_hardirqs_on+0xd5/0x160
  [80531735] ? sockfd_lookup_light+0x45/0x80
  [805323da] sys_sendto+0xea/0x120
  [80626bcb] ? _spin_unlock_irq+0x2b/0x60
  [8025e2b5] ? trace_hardirqs_on+0xd5/0x160
  [80626bd6] ? _spin_unlock_irq+0x36/0x60
  [8020b6db] system_call_after_swapgs+0x7b/0x80
 
 Steps to reproduce:
 Start smbd with the forementionned kernel instrumented for SMP and kernel
 debugging and hr + tickless enabled.
 

Presumably this is in NF_CT_STAT_INC().  I wonder what caused it to start
happening.

Guys, this probably means that the developers who tested this change aren't
enabling the debug options which all kernel developers _should_ be enabling
when testing their code!  Documentation/SubmitChecklist has a handy list.

Should NF_CT_STAT_INC() be using local_inc()?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10071] New: kernel hang in inet_init

2008-02-25 Thread Andrew Morton

On Sat, 23 Feb 2008 00:34:06 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=10071
 
Summary: kernel hang in inet_init
Product: Networking
Version: 2.5
  KernelVersion: 2.6.25 rc2 latest git
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: IPV4
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Distribution:ubuntu hardy
 Hardware Environment:dell dimension 5150
 Problem Description:
 kernel hang mostly but boot slowly sometimes
 
 Steps to reproduce:
 boot
 
 when the computer manage to boot, inet_init last some time:
 [1.725306] NET: Registered protocol family 2
 [6.825384] IP route cache hash table entries: 32768 (order: 5, 131072
 bytes)
 [6.825803] TCP established hash table entries: 131072 (order: 8, 1048576
 bytes)
 [6.826691] TCP bind hash table entries: 65536 (order: 9, 2359296 bytes)
 [6.836160] TCP: Hash tables configured (established 131072 bind 65536)
 [6.836255] TCP reno registered
 [7.081569] initcall 0xc03b8920: inet_init+0x0/0x3aa() returned 0.
 [7.081569] initcall 0xc03b8920 ran for 5106 msecs: inet_init+0x0/0x3aa()
 
 When booting with acpi=ht it works:
 [0.124668] Calling initcall 0xc03b8920: inet_init+0x0/0x3aa()
 [0.124867] NET: Registered protocol family 2
 [0.288067] IP route cache hash table entries: 32768 (order: 5, 131072
 bytes)
 [0.288856] TCP established hash table entries: 131072 (order: 8, 1048576
 bytes)
 [0.289739] TCP bind hash table entries: 65536 (order: 9, 2359296 bytes)
 [0.299156] TCP: Hash tables configured (established 131072 bind 65536)
 [0.299250] TCP reno registered
 Feb 23 09:10:28 tiyann kernel: [0.162267] initcall 0xc03b8920:
 inet_init+0x0/0x3aa() returned 0.
 [0.162383] initcall 0xc03b8920 ran for 33 msecs: inet_init+0x0/0x3aa()
 
 there seem to be a weird interaction between acpi and inet_init
 
 any hint on how to provide better information
 thanks
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: printk_ratelimit and net_ratelimit conflict and tunable behavior

2008-02-25 Thread Andrew Morton

On Mon, 25 Feb 2008 14:36:40 -0600 Steven Hawkes [EMAIL PROTECTED] wrote:

 From: Steve Hawkes [EMAIL PROTECTED]
 
 The printk_ratelimit() and net_ratelimit() functions each have their own
 tunable parameters to control their respective rate limiting feature, but
 they share common state variables, preventing independent tuning of the
 parameters from working correctly. Also, changes to rate limiting tunable
 parameters do not always take effect properly since state is not recomputed
 when changes occur. For example, if ratelimit_burst is increased while rate
 limiting is occurring, the change won't take full effect until at least
 enough time between messages occurs so that the toks value reaches
 ratelimit_burst * ratelimit_jiffies. This can result in messages being
 suppressed when they should be allowed.
 
 Implement independent state for printk_ratelimit() and net_ratelimit(), and
 update state when tunables are changed.
 

This patch causes a large and nasty reject.

 ---
 --- linux-2.6.24/include/linux/kernel.h   2008-01-24 16:58:37.0 
 -0600
 +++ linux-2.6.24-printk_ratelimit/include/linux/kernel.h  2008-02-21 
 11:20:41.751197312 -0600

Probably because you patched 2.6.24.  We're developing 2.6.25 now, and the
difference between the two is very large inded.  Please raise patches
against Linus's latest tree?

There are other patches pending against printk.c (in -mm and in git-sched)
but afacit they won't collide.

 @@ -196,8 +196,19 @@ static inline int log_buf_copy(char *des
  
  unsigned long int_sqrt(unsigned long);
  
 +struct printk_ratelimit_state
 +{

Please do

struct printk_ratelimit_state {

 + unsigned long toks;
 + unsigned long last_jiffies;
 + int missed;
 + int limit_jiffies;
 + int limit_burst;
 + char const *facility;
 +};

I find that the best-value comments one can add to kernel code are to the
members of structures.  If the reader understands what all the fields do, the
code becomes simple to follow.

 --- linux-2.6.24/net/core/utils.c 2008-01-24 16:58:37.0 -0600
 +++ linux-2.6.24-printk_ratelimit/net/core/utils.c2008-02-21 
 11:03:44.644337698 -0600
 @@ -41,7 +41,16 @@ EXPORT_SYMBOL(net_msg_warn);
   */
  int net_ratelimit(void)
  {
 - return __printk_ratelimit(net_msg_cost, net_msg_burst);
 + static struct printk_ratelimit_state limit_state = {
 + .toks  = 10 * 5 * HZ,
 + .last_jiffies  = 0,
 + .missed= 0,
 + .limit_jiffies = 5 * HZ,
 + .limit_burst   = 10,
 + .facility  = net
 + };
 +
 + return __printk_ratelimit(net_msg_cost, net_msg_burst, limit_state);

I don't get it.  There's one instance of limit_state, kernel-wide, and
__printk_ratelimit() modifies it.  What prevents one CPU's activities from
interfering with a second CPU's activities?

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10073] New: Just-small-enough packets in tunnels are silently eaten

2008-02-25 Thread Andrew Morton

On Sat, 23 Feb 2008 09:17:14 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=10073
 
Summary: Just-small-enough packets in tunnels are silently eaten
Product: Networking
Version: 2.5
  KernelVersion: 2.6.23 (mainline), 2.6.25-rc2 (mainline), 2.6.18-6-amd64
 (Debian
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: IPV6
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Hi,
 
 This has been broken for quite a while, but I haven't gotten around to debug 
 it
 until now.
 
 I have an IPv6-in-IPv4 tunnel between two points, both with MTU 1500 on the
 regular v4 network. (I've verified that I can indeed send 1500-byte packets 
 and
 fragments over IPv4 from the two points.) By default, Linux assigns MTU 1480 
 to
 this tunnel.
 
 However, if I try to ping -s 5000 from one side of the tunnel to the other, I
 get
 first Packet too big, mtu=1480 and then on the next packet (when the machine
 tries to send 1480-byte fragments) Packet too big, mtu=1472. After that, the
 packet goes through.
 
 However, in some cases it seems I do not seem to get the Packet too big ICMP
 at all. In particular, if I change to a GRE tunnel (where the default MTU is
 1476), and send in 1476-byte packets, they are just eaten. They clearly go 
 into
 the GRE tunnel (according to tcpdump), but no IPv4 packets ever go out on the
 other side, and no ICMPs are sent back. (There's no iptables rules on the
 router in question, nor any relevant sysctl settings except that IPv6
 forwarding is of course turned on.) If I lower MTU on the interfaces to 1468,
 everything seems to work just fine. (I cannot change the MTU of a regular ipip
 tunnel, so it's impossible for me to check whether a lower MTU would have 
 fixed
 the issues for those as well, but it seems reasonable.)
 
 Any idea where the extra eight bytes go? Is there some inner layer of
 encapsulation in the kernel (adding eight bytes internally) that I've missed?
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nework Emulation with UML, bug 8895

2008-02-24 Thread Andrew Morton

On Sat, 23 Feb 2008 14:04:18 +0100 clowncoder [EMAIL PROTECTED] wrote:

 Hello,
 You can have a configured and running network inside a single linux machine,
 only one script command is enough. After the start of all the machine, a
 graphical representation of your topology helps your interactions with
 the network. This virtual network can be downloaded at http://clownix.net,
 I must warn you, it is a 350mega bz2 file-system that needs 4 giga on 
 the host.
 Very usefull for networking development in kernel as proved by bug 8895
 that has been found with the above tool:
 
 The uncorrected bug 8895:
 
 file:  /usr/src/linux-2.6.24.2/net/ipv6/ip6_fib.c  line  796:   
 dst_free(rt-u.dst);
 
 --- /Comment From qmiao mailto:[EMAIL PROTECTED] 2007-08-29 
 23:33:07 /---
 
 fib6_add
 ...
 
 if (fn-leaf == NULL) {
 fn-leaf = rt;--**-- rt is assigned to fn-leaf
 atomic_inc(rt-rt6i_ref);
 }
 ...
 err = fib6_add_rt2node(fn, rt, info); -**- return -EEXIST
 ...
 (Here err was not null)
 ...
 if (err) {
 ...
 dst_free(rt-u.dst); --**-- Actually rt is still in
 tree (fn-leaf = rt /* see above */)
 ...
 }
 

Please cc netdev on net-related bug reports.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10063] New: Network problems with 2.6.23+

2008-02-24 Thread Andrew Morton


(plese respond via emailed reply-to-all, not via the bugzilla web interface)

On Thu, 21 Feb 2008 18:04:45 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=10063

There's more info (including a tcpdump) in bugzilla.

Summary: Network problems with 2.6.23+
Product: Networking
Version: 2.5
  KernelVersion: 2.6.24.2
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: high
   Priority: P1
  Component: IPV4
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: 2.6.23
 Earliest failing kernel version: 2.6.24
 Distribution: Gentoo
 Hardware Environment: 
 
 model name  : Dual Core AMD Opteron(tm) Processor 170
 stepping: 2
 cpu MHz : 2009.144
 
 02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E
 Gigabit Ethernet Controller (rev 15) sky2
 
 Working on 2.6.24 (as far as I can see) in :
 
 model name  : Intel(R) Pentium(R) 4 CPU 2.80GHz
 stepping: 9
 cpu MHz : 2793.222
 01:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet
 Controller
 
 Software Environment: TCP-transfers (FTP,HTTP)
 Problem Description:
 
 First I want to apologize that I'm not able to do alot of troubleshooting on
 this issue cause I don't know where to start exactly.
 
 The problem is that I have, after upgrading to 2.6.24 from 2.6.23, experienced
 alot of weird network problems. For example, FTP-transfers to certain hosts
 on the Internet stalls after a while. I have one dst-host where this problem
 appears frequently. On another dst-host it doesn't happen as often. Sometimes
 the transfers doesn't even start. 
 
 First we did alot of faultracing on the transmission but after downgrading the
 kernel it was apparent that the transmission wasn't the problem. 
 
 I did not think about filing this report until i saw,
 http://kerneltrap.org/node/15550. So I guess I'm not alone.
 
 I must also add that to some dst-hosts there is no problem at all. So I would
 guess that there might be something with the mtu,wscale or something between
 hosts. 
 
 All the other dst-hosts I've tried is running 2.6.23
 
 I will try to test alot of kernel-version to try to ease the troubleshooting. 
 
 
 Steps to reproduce:
 
 Start an FTP-transfer from the box mention above.

I guess it might help if you can tell us the IP address of some of the
problematic servers.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 8/8] Jhash in too big for inlining, move under lib/

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:47:18 +0200 Ilpo Järvinen [EMAIL PROTECTED] wrote:

 vmlinux.o:
  62 functions changed, 66 bytes added, 10935 bytes removed, diff: -10869
 
 ...+ these to lib/jhash.o:
  jhash_3words: 112
  jhash2: 276
  jhash: 475
 
 select for networking code might need a more fine-grained approach.

It should be possible to use a modular jhash.ko.  The things which you
have identified as clients of the jhash library are usually loaded as modules.

But in the case where someone does (say) NFSD=y we do need jhash.o linked into
vmlinux also.  This is doable in Kconfig but I always forget how.  Adrian, Sam
and Randy are the repositories of knowledge here ;)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/8]: uninline uninline

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:47:10 +0200 Ilpo J__rvinen [EMAIL PROTECTED] wrote:

 Ok, here's the top of the list (1+ bytes):

This is good stuff - thanks.

 -41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 

This is a surprise.  I expect that the -mm-only
profile-likely-unlikely-macros.patch is the cause of this and mainline
doesn't have this problem.

If true, then this likely/unlikely bloat has probably spread into a lot of
your other results and it all should be redone against mainline, sorry :(

(I'm not aware of anyone having used profile-likely-unlikely-macros.patch
in quite some time.  That's unfortunate because it has turned up some
fairly flagrant code deoptimisations)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: at net/ipv4/tcp_input.c:2054 tcp_mark_head_lost()

2008-02-23 Thread Andrew Morton


(cc netdev)

On Wed, 20 Feb 2008 20:04:39 -0800 (PST) Giangiacomo Mariotti [EMAIL 
PROTECTED] wrote:

 This is what I got with dmesg :
 
 [  266.978695] WARNING: at net/ipv4/tcp_input.c:2054 tcp_mark_head_lost()
 [  266.978701] Pid: 0, comm: swapper Not tainted 2.6.24.2-my001 #1
 [  266.978703] 
 [  266.978704] Call Trace:
 [  266.978706]  IRQ  [80426981] tcp_ack+0x16d8/0x197f
 [  266.978721]  [8022e72f] __wake_up+0x38/0x4e
 [  266.978727]  [804295ef] tcp_rcv_established+0xe2/0x8cb
 [  266.978732]  [8042f56f] tcp_v4_do_rcv+0x30/0x39c
 [  266.978738]  [80431d29] tcp_v4_rcv+0x99b/0xa06
 [  266.978743]  [803f2c95] __netdev_alloc_skb+0x29/0x43
 [  266.978749]  [80416d21] ip_local_deliver_finish+0x152/0x212
 [  266.978753]  [80416bac] ip_rcv_finish+0x2f8/0x31b
 [  266.978758]  [803f6c42] netif_receive_skb+0x3ae/0x3d1
 [  266.978763]  [8037398f] rtl8169_rx_interrupt+0x45f/0x53e
 [  266.978768]  [8037405b] rtl8169_poll+0x36/0x16a
 [  266.978773]  [803f8ca7] net_rx_action+0xb7/0x1f3
 [  266.978778]  [8023a3a5] __do_softirq+0x65/0xce
 [  266.978782]  [8020b0d2] default_idle+0x0/0x3d
 [  266.978786]  [8020d09c] call_softirq+0x1c/0x28
 [  266.978789]  [8020e4f0] do_softirq+0x2c/0x7d
 [  266.978792]  [8023a2fb] irq_exit+0x3f/0x84
 [  266.978794]  [8020e729] do_IRQ+0xb6/0xd5
 [  266.978797]  [8020b0d2] default_idle+0x0/0x3d
 [  266.978800]  [8020c421] ret_from_intr+0x0/0xa
 [  266.978801]  EOI  [8020b0fb] default_idle+0x29/0x3d
 [  266.978809]  [8020b1a2] cpu_idle+0x93/0xbb
 [  266.978813]  [805cfa4b] start_kernel+0x2bb/0x2c7
 [  266.978818]  [805cf123] _sinittext+0x123/0x12a
 [  266.978821] 
 
 This though didn't cause any user-visible problem.
 
 .config file :
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/28] mm: __GFP_MEMALLOC

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:19 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 __GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
 much like PF_MEMALLOC.
 

'twould be nice if the changelog had some explanation of the reason
for this change.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/28] mm: emergency pool

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:17 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 @@ -213,7 +213,7 @@ enum zone_type {
  
  struct zone {
   /* Fields commonly accessed by the page allocator */
 - unsigned long   pages_min, pages_low, pages_high;
 + unsigned long   pages_emerg, pages_min, pages_low, pages_high;

It would be nice to make these one-per-line, then document them.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:15 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
 a borrowed context save current-flags, ksoftirqd will have its own 
 task_struct.

The second sentence doesn't make sense.

 This is needed to allow network softirq packet processing to make use of
 PF_MEMALLOC.

 ...

 +#define tsk_restore_flags(p, pflags, mask) \
 + do {(p)-flags = ~(mask); \
 + (p)-flags |= ((pflags)  (mask)); } while (0)
 +

Does it need to be a macro?

If so, it really should cook up a temporary to avoid referencing p twice -
the children might be watching.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/28] Swap over NFS -v16

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 Another posting of the full swap over NFS series. 

Well I looked.  There's rather a lot of it and I wouldn't pretend to
understand it.

What is the NFS and net people's take on all of this?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/28] mm: kmem_estimate_pages()

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:14 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 Provide a method to get the upper bound on the pages needed to allocate
 a given number of objects from a given kmem_cache.
 
 This lays the foundation for a generic reserve framework as presented in
 a later patch in this series. This framework needs to convert object demand
 (kmalloc() bytes, kmem_cache_alloc() objects) to pages.
 
 ...

  /*
 + * return the max number of pages required to allocated count
 + * objects from the given cache
 + */
 +unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects)

You might want to have another go at that comment.

 +/*
 + * return the max number of pages required to allocate @bytes from kmalloc
 + * in an unspecified number of allocation of heterogeneous size.
 + */
 +unsigned kestimate(gfp_t flags, size_t bytes)

And its pal.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:18 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
 wide - which they are per setup_per_zone_pages_min(), when we scrape the
 barrel, do it properly.
 

The changelog is fairly incomprehensible.

  mm/page_alloc.c |6 ++
  1 file changed, 6 insertions(+)
 
 Index: linux-2.6/mm/page_alloc.c
 ===
 --- linux-2.6.orig/mm/page_alloc.c
 +++ linux-2.6/mm/page_alloc.c
 @@ -1552,6 +1552,12 @@ restart:
  rebalance:
   if (alloc_flags  ALLOC_NO_WATERMARKS) {
  nofail_alloc:
 + /*
 +  * break out of mempolicy boundaries
 +  */
 + zonelist = NODE_DATA(numa_node_id())-node_zonelists +
 + gfp_zone(gfp_mask);
 +
   /* go through the zonelist yet again, ignoring mins */
   page = get_page_from_freelist(gfp_mask, order, zonelist,
   ALLOC_NO_WATERMARKS);

As is the patch.  People who care about mempolicies will want a better
explanation, please, so they can check that we're not busting their stuff.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/28] netvm: network reserve infrastructure

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:25 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 Provide the basic infrastructure to reserve and charge/account network memory.
 
 We provide the following reserve tree:
 
 1)  total network reserve
 2)network TX reserve
 3)  protocol TX pages
 4)network RX reserve
 5)  SKB data reserve
 
 [1] is used to make all the network reserves a single subtree, for easy
 manipulation.
 
 [2] and [4] are merely for eastetic reasons.
 
 The TX pages reserve [3] is assumed bounded by it being the upper bound of
 memory that can be used for sending pages (not quite true, but good enough)
 
 The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
 against in the fallback path.
 
 The consumers for these reserves are sockets marked with:
   SOCK_MEMALLOC
 
 Such sockets are to be used to service the VM (iow. to swap over). They
 must be handled kernel side, exposing such a socket to user-space is a BUG.
 
 +/**
 + *   sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
 + *   @socks: number of new %SOCK_MEMALLOC sockets
 + *   @tx_resserve_pages: number of pages to (un)reserve for TX
 + *
 + *   This function adjusts the memalloc reserve based on system demand.
 + *   The RX reserve is a limit, and only added once, not for each socket.
 + *
 + *   NOTE:
 + *  @tx_reserve_pages is an upper-bound of memory used for TX hence
 + *  we need not account the pages like we do for RX pages.
 + */
 +int sk_adjust_memalloc(int socks, long tx_reserve_pages)
 +{
 + int nr_socks;
 + int err;
 +
 + err = mem_reserve_pages_add(net_tx_pages, tx_reserve_pages);
 + if (err)
 + return err;
 +
 + nr_socks = atomic_read(memalloc_socks);
 + if (!nr_socks  socks  0)
 + err = mem_reserve_connect(net_reserve, mem_reserve_root);

This looks like it should have some locking?

 + nr_socks = atomic_add_return(socks, memalloc_socks);
 + if (!nr_socks  socks)
 + err = mem_reserve_disconnect(net_reserve);

Or does that try to make up for it?  Still looks fishy.

 + if (err)
 + mem_reserve_pages_add(net_tx_pages, -tx_reserve_pages);
 +
 + return err;
 +}
 +
 +/**
 + *   sk_set_memalloc - sets %SOCK_MEMALLOC
 + *   @sk: socket to set it on
 + *
 + *   Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
 + *   accordingly.
 + */
 +int sk_set_memalloc(struct sock *sk)
 +{
 + int set = sock_flag(sk, SOCK_MEMALLOC);
 +#ifndef CONFIG_NETVM
 + BUG();
 +#endif

??  #error, maybe?

 + if (!set) {
 + int err = sk_adjust_memalloc(1, 0);
 + if (err)
 + return err;
 +
 + sock_set_flag(sk, SOCK_MEMALLOC);
 + sk-sk_allocation |= __GFP_MEMALLOC;
 + }
 + return !set;
 +}
 +EXPORT_SYMBOL_GPL(sk_set_memalloc);

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/28] netvm: hook skb allocation to reserves

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:27 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 Change the skb allocation api to indicate RX usage and use this to fall back 
 to
 the reserve when needed. SKBs allocated from the reserve are tagged in
 skb-emergency.
 
 Teach all other skb ops about emergency skbs and the reserve accounting.
 
 Use the (new) packet split API to allocate and track fragment pages from the
 emergency reserve. Do this using an atomic counter in page-index. This is
 needed because the fragments have a different sharing semantic than that
 indicated by skb_shinfo()-dataref. 
 
 Note that the decision to distinguish between regular and emergency SKBs 
 allows
 the accounting overhead to be limited to the later kind.
 
 ...

 +static inline void skb_get_page(struct sk_buff *skb, struct page *page)
 +{
 + get_page(page);
 + if (skb_emergency(skb))
 + atomic_inc(page-frag_count);
 +}
 +
 +static inline void skb_put_page(struct sk_buff *skb, struct page *page)
 +{
 + if (skb_emergency(skb)  atomic_dec_and_test(page-frag_count))
 + rx_emergency_put(PAGE_SIZE);
 + put_page(page);
 +}

I'm thinking we should do `#define slowcall inline' then use that in the future.

  static void skb_release_data(struct sk_buff *skb)
  {
   if (!skb-cloned ||
   !atomic_sub_return(skb-nohdr ? (1  SKB_DATAREF_SHIFT) + 1 : 1,
  skb_shinfo(skb)-dataref)) {
 + int size;
 +
 +#ifdef NET_SKBUFF_DATA_USES_OFFSET
 + size = skb-end;
 +#else
 + size = skb-end - skb-head;
 +#endif

The patch adds rather a lot of ifdefs.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/28] mm: memory reserve management

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:20 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:

 Generic reserve management code. 
 
 It provides methods to reserve and charge. Upon this, generic alloc/free style
 reserve pools could be build, which could fully replace mempool_t
 functionality.
 
 It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

Generally: the comments in this code are a bit straggly and hard to follow.
They'd be worth a revisit.

 +/*
 + * Simple output of the reserve tree in: /proc/reserve_info
 + * Example:
 + *
 + * localhost ~ # cat /proc/reserve_info
 + * total reserve  8156K (0/544817)
 + *   total network reserve  8156K (0/544817)
 + * network TX reserve 196K (0/49)
 + *   protocol TX pages  196K (0/49)
 + * network RX reserve 7960K (0/544768)
 + *   IPv6 route cache   1372K (0/4096)
 + *   IPv4 route cache   5468K (0/16384)
 + *   SKB data reserve   1120K (0/524288)
 + * IPv6 fragment cache560K (0/262144)
 + * IPv4 fragment cache560K (0/262144)
 + */

Well, Simple was a freudian typo.  Not designed for programmatic parsing,
I see.

 +static __init int mem_reserve_proc_init(void)
 +{
 + struct proc_dir_entry *entry;
 +
 + entry = create_proc_entry(reserve_info, S_IRUSR, NULL);

I think we're supposed to use proc_create().  Blame Alexey.

 + if (entry)
 + entry-proc_fops = mem_reserve_opterations;
 +
 + return 0;
 +}
 +
 +__initcall(mem_reserve_proc_init);

module_init() is more trendy.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] alloc_percpu() fails to allocate percpu data

2008-02-23 Thread Andrew Morton

On Thu, 21 Feb 2008 19:00:03 +0100 Eric Dumazet [EMAIL PROTECTED] wrote:

 +#ifndef cache_line_size
 +#define cache_line_size()L1_CACHE_BYTES
 +#endif

argh, you made me look.

Really cache_line_size() should be implemented in include/linux/cache.h. 
Then we tromp the stupid private implementations in slob.c and slub.c.

Then we wonder why x86 uses a custom cache_line_size(), but still uses
L1_CACHE_BYTES for its L1_CACHE_ALIGN().

Once we've answered that, we look at your

+   /*
+* We should make sure each CPU gets private memory.
+*/
+   size = roundup(size, cache_line_size());

and wonder whether it should have used L1_CACHE_ALIGN().

I think I'd better stop looking.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 8/8] Jhash in too big for inlining, move under lib/

2008-02-23 Thread Andrew Morton

On Sat, 23 Feb 2008 12:05:36 +0200 (EET) Ilpo Järvinen [EMAIL PROTECTED] 
wrote:

 On Sat, 23 Feb 2008, Andrew Morton wrote:
 
  On Wed, 20 Feb 2008 15:47:18 +0200 Ilpo Järvinen [EMAIL PROTECTED] 
  wrote:
  
   vmlinux.o:
62 functions changed, 66 bytes added, 10935 bytes removed, diff: -10869
   
   ...+ these to lib/jhash.o:
jhash_3words: 112
jhash2: 276
jhash: 475
   
   select for networking code might need a more fine-grained approach.
  
  It should be possible to use a modular jhash.ko.  The things which you
  have identified as clients of the jhash library are usually loaded as 
  modules.
  But in the case where someone does (say) NFSD=y we do need jhash.o linked 
  into
  vmlinux also. This is doable in Kconfig but I always forget how.
 
 Ok, even though its not that likely that one lives without e.g.
 net/ipv4/inet_connection_sock.c or net/netlink/af_netlink.c?

Sure, the number of people who will want CONFIG_JHASH=n is very small.

 But maybe 
 some guys really know what they are doing and can come up with config 
 that would be able to build it as module (for other than proof-of-concept 
 uses I mean)... :-/
 
  Adrian, Sam and Randy are the repositories of knowledge here ;)
 
 Thanks, I'll consult them in this. I've never needed to do any Kconfig 
 stuff so far so it's no surprise I have very little clue... :-)

Thanks.  If it gets messy I'd just put lib/jhash.o in obj-y.  I assume it's
pretty small.

 I've one question for you Andrew, how would you like this kind of 
 cross-subsys toucher to be merged, through you directly I suppose?

Sure, I can scrounge around for appropriate reviews and acks and can make
sure that nobody's pending work gets disrupted.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/8]: uninline uninline

2008-02-23 Thread Andrew Morton

On Sat, 23 Feb 2008 14:15:06 +0100 Andi Kleen [EMAIL PROTECTED] wrote:

 Andrew Morton [EMAIL PROTECTED] writes:
 
 
  -41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 
 
  This is a surprise.  I expect that the -mm-only
  profile-likely-unlikely-macros.patch is the cause of this and mainline
  doesn't have this problem.
 
 Shouldn't they only have overhead when the respective CONFIG is enabled?

yup.

  If true, then this likely/unlikely bloat has probably spread into a lot of
  your other results and it all should be redone against mainline, sorry :(
 
  (I'm not aware of anyone having used profile-likely-unlikely-macros.patch
  in quite some time.  That's unfortunate because it has turned up some
  fairly flagrant code deoptimisations)
 
 Is there any reason they couldn't just be merged to mainline? 
 
 I think it's a useful facility.

ummm, now why did we made that decision...  I think we decided that it's
the sort of thing which one person can run once per few months and that
will deliver its full value.  I can maintain it in -mm and we're happy - no
need to add it to mainline.  No strong feelings either way really.

It does have the downside that the kernel explodes if someone adds unlikely
or likely to the vdso code and I need to occasionally hunt down new
additions and revert them in that patch.  That makes it a bit of a
maintenance burden.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10061] New: Hang in md5_resync

2008-02-21 Thread Andrew Morton


(please respond via emailed reply-to-all, not via the bugzilla web
interface, thanks)

On Thu, 21 Feb 2008 13:13:13 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=10061
 
Summary: Hang in md5_resync
Product: IO/Storage
Version: 2.5
  KernelVersion: 2.6.25-rc2
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: MD
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: 2.6.23
 Earliest failing kernel version: 2.6.25-rc2
 Distribution: Debian unstable
 Hardware Environment: sata_nv, ext3, Seagate SATA disks 
 Software Environment: N/A
 Problem Description: After upgrading to 2.6.25, I did some light I/O 
 (basically
 a tab-complete in my home directory), and the whole machine promptly hung.
 After a while, the serial console spat out:
 
 [  548.684064] BUG: soft lockup - CPU#1 stuck for]
 [  548.684065] CPU 1:
 [  548.684065] Modules linked in: tun eeprom ide_generic ide_disk ide_cd_mod
 cdx
 [  548.684065] Pid: 3599, comm: md4_resync Not tainted 2.6.25-rc2 #1
 [  548.684065] RIP: 0010:[802fee07]  [802fee07]
 __write_loc0
 [  548.684065] RSP: 0018:81011fcffc18  EFLAGS: 0287
 [  548.684065] RAX: 81011c14dfd8 RBX: 8100d42d30c0 RCX:
 8100d42d30c0
 [  548.684065] RDX:  RSI: 8100d4cd0d90 RDI:
 8100d4cd0d00
 [  548.684065] RBP: 81011fcffb90 R08: fe538a93 R09:
 
 [  548.684065] R10: 00c0 R11: 803ad3e1 R12:
 8020bcb6
 [  548.684065] R13: 81011fcffb90 R14: 8100d4cd0d00 R15:
 
 [  548.684065] FS:  7f8748e0c6e0() GS:81011fcb5dc0()
 knlGS:0
 [  548.684065] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
 [  548.684066] CR2: 7f8748e076f3 CR3: d296b000 CR4:
 06e0
 [  548.684066] DR0:  DR1:  DR2:
 
 [  548.684066] DR3:  DR6: 0ff0 DR7:
 0400
 [  548.684066]
 [  548.684066] Call Trace:
 [  548.684066]  IRQ  [80446fe4] ? _write_lock_bh+0x1a/0x1c
 [  548.684066]  [803ad49e] ? neigh_resolve_output+0xbd/0x281
 [  548.684066]  [80407ab5] ? ip6_output+0xc5c/0xc71
 [  548.684066]  [8039d414] ? sock_alloc_send_skb+0x93/0x1e1
 [  548.684066]  [8041685b] ? __ndisc_send+0x3c6/0x4fb
 [  548.684066]  [880a431a] ? :sd_mod:sd_prep_fn+0x74/0x6fb
 [  548.684066]  [8041725b] ? ndisc_send_ns+0x71/0x7e
 [  548.684066]  [80363c3c] ? scsi_request_fn+0x339/0x3a5
 [  548.684066]  [804181df] ? ndisc_solicit+0x159/0x166
 [  548.684066]  [80235673] ? __mod_timer+0xb0/0xbf
 [  548.684066]  [803adc72] ? neigh_timer_handler+0x292/0x2d0
 [  548.684066]  [803ad9e0] ? neigh_timer_handler+0x0/0x2d0
 [  548.684066]  [8023521b] ? run_timer_softirq+0x151/0x1bf
 [  548.684066]  [80231f40] ? __do_softirq+0x65/0xcf
 [  548.684066]  [8020c20c] ? call_softirq+0x1c/0x28
 [  548.684066]  [8020d8b4] ? do_softirq+0x2c/0x68
 [  548.684066]  [8021919d] ? smp_apic_timer_interrupt+0x8a/0xa5
 [  548.684066]  [8020bcb6] ? apic_timer_interrupt+0x66/0x70
 [  548.684066]  EOI  [880d549c] ?
 :raid456:raid5_compute_sector+0x19
 [  548.684066]  [880d5511] ? :raid456:stripe_to_pdidx+0x48/0x51
 [  548.684066]  [880db728] ? :raid456:sync_request+0x7d3/0x8a1
 [  548.684066]  [802edf17] ? generic_unplug_device+0x18/0x24
 [  548.684066]  [802ecdef] ? blk_unplug+0x5a/0x60
 [  548.684066]  [880d51af] ? :raid456:unplug_slaves+0x55/0x91
 [  548.684066]  [880b0884] ? :md_mod:md_do_sync+0x54e/0x96c
 [  548.684066]  [802429a7] ? getnstimeofday+0x2f/0x83
 [  548.684066]  [802263e0] ? try_to_wake_up+0xfd/0x10e
 [  548.684066]  [880b2ff6] ? :md_mod:md_thread+0xde/0xfa
 [  548.684066]  [880b2f18] ? :md_mod:md_thread+0x0/0xfa
 [  548.684066]  [8023e741] ? kthread+0x47/0x76
 [  548.684066]  [80228ebd] ? schedule_tail+0x28/0x5c
 [  548.684066]  [8020be98] ? child_rip+0xa/0x12
 [  548.684066]  [8023e6fa] ? kthread+0x0/0x76
 [  548.684066]  [8020be8e] ? child_rip+0x0/0x12
 [  548.684066]
 
 and the machine continued hanging.
 
 Downgraded to 2.6.23 again, seems to work fine. Will try 2.6.24.2 shortly.

This looks like a networking regression, not a RAID regression.

We did have a big locking bug in netlink in 2.6.25-rc2, although your trace
doesn't appear to point at that particular bug.

It would be good if you could test the latest Linux snapshot please, see if
we've already fixed this, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]

Re: Linux 2.6.24.1 - kernel does not boot; IRQ trouble?

2008-02-18 Thread Andrew Morton

On Sun, 17 Feb 2008 00:54:08 + (GMT) Chris Rankin [EMAIL PROTECTED] wrote:

 [Try this again, except this time I'll force the attachment as inline text!]
 
 Hi,
 
 I have managed to boot 2.6.24.1 on this machine, with the NMI watchdog 
 enabled, by using the
 acpi=noirq option. (There does seem to be some unhappiness with bridge 
 symlinks in sysfs,
 though.)
 
 ...

 sysfs: duplicate filename 'bridge' can not be created
 WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()
 Pid: 1, comm: swapper Not tainted 2.6.24.1 #1
  [c0105020] show_trace_log_lvl+0x1a/0x2f
  [c0105990] show_trace+0x12/0x14
  [c010613d] dump_stack+0x6c/0x72
  [c01991bf] sysfs_add_one+0x57/0xbc
  [c0199e41] sysfs_create_link+0xc2/0x10d
  [c01bae9a] pci_bus_add_devices+0xbd/0x103
  [c034016c] pci_legacy_init+0x56/0xe3
  [c03274e1] kernel_init+0x157/0x2c3
  [c0104c83] kernel_thread_helper+0x7/0x10
  ===
 pci :00:01.0: Error creating sysfs bridge symlink, continuing...
 sysfs: duplicate filename 'bridge' can not be created
 WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()
 Pid: 1, comm: swapper Not tainted 2.6.24.1 #1
  [c0105020] show_trace_log_lvl+0x1a/0x2f
  [c0105990] show_trace+0x12/0x14
  [c010613d] dump_stack+0x6c/0x72
  [c01991bf] sysfs_add_one+0x57/0xbc
  [c0199e41] sysfs_create_link+0xc2/0x10d
  [c01bae9a] pci_bus_add_devices+0xbd/0x103
  [c01bae82] pci_bus_add_devices+0xa5/0x103
  [c034016c] pci_legacy_init+0x56/0xe3
  [c03274e1] kernel_init+0x157/0x2c3
  [c0104c83] kernel_thread_helper+0x7/0x10
  ===

I have a vague feeling that this was fixed, perhaps in 2.6.24.x?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.25-rc2] e100: Trying to free already-free IRQ 11 during suspend ...

2008-02-18 Thread Andrew Morton

On Sun, 17 Feb 2008 15:36:50 +0300 Andrey Borzenkov [EMAIL PROTECTED] wrote:

 ... and possibly reboot/poweroff (it flows by too fast to be legible).
 
 [ 8803.850634] ACPI: Preparing to enter system sleep state S3
 [ 8803.853141] Suspending console(s)
 [ 8805.287505] serial 00:09: disabled
 [ 8805.291564] Trying to free already-free IRQ 11
 [ 8805.291579] Pid: 6920, comm: pm-suspend Not tainted 2.6.25-rc2-1avb #2
 [ 8805.291628]  [c0152127] free_irq+0xb7/0x130
 [ 8805.291675]  [c024bd80] e100_suspend+0xc0/0x100
 [ 8805.291724]  [c01eaa36] pci_device_suspend+0x26/0x70
 [ 8805.291747]  [c0243674] suspend_device+0x94/0xd0
 [ 8805.291763]  [c02439a3] device_suspend+0x153/0x240
 [ 8805.291784]  [c014314f] suspend_devices_and_enter+0x4f/0xf0
 [ 8805.291808]  [c0143a5f] ? freeze_processes+0x3f/0x80
 [ 8805.291825]  [c01432fa] enter_state+0xaa/0x140
 [ 8805.291840]  [c014341f] state_store+0x8f/0xd0
 [ 8805.291852]  [c0143390] ? state_store+0x0/0xd0
 [ 8805.291866]  [c01d3404] kobj_attr_store+0x24/0x30
 [ 8805.291901]  [c01b547b] sysfs_write_file+0xbb/0x110
 [ 8805.291936]  [c0177d79] vfs_write+0x99/0x130
 [ 8805.291963]  [c01b53c0] ? sysfs_write_file+0x0/0x110
 [ 8805.291979]  [c01782fd] sys_write+0x3d/0x70
 [ 8805.291998]  [c010409a] sysenter_past_esp+0x5f/0xa5
 [ 8805.292038]  ===
 [ 8805.347640] ACPI: PCI interrupt for device :00:06.0 disabled
 [ 8805.361128] ACPI: PCI interrupt for device :00:02.0 disabled
 [ 8805.376670]  hwsleep-0322 [00] enter_sleep_state : Entering sleep 
 state [S3]
 [ 8805.376670] Back to C!
 
 Interface is unused normally (only for netconsole sometimes). dmesg and config
 attached.

Does reverting this:

commit 8543da6672b0994921f014f2250e27ae81645580
Author: Auke Kok [EMAIL PROTECTED]
Date:   Wed Dec 12 16:30:42 2007 -0800

e100: free IRQ to remove warningwhenrebooting

with this patch:

--- a/drivers/net/e100.c~revert-1
+++ a/drivers/net/e100.c
@@ -2804,9 +2804,8 @@ static int e100_suspend(struct pci_dev *
pci_enable_wake(pdev, PCI_D3cold, 0);
}
 
-   free_irq(pdev-irq, netdev);
-
pci_disable_device(pdev);
+   free_irq(pdev-irq, netdev);
pci_set_power_state(pdev, PCI_D3hot);
 
return 0;
@@ -2848,8 +2847,6 @@ static void e100_shutdown(struct pci_dev
pci_enable_wake(pdev, PCI_D3cold, 0);
}
 
-   free_irq(pdev-irq, netdev);
-
pci_disable_device(pdev);
pci_set_power_state(pdev, PCI_D3hot);
 }
_

fix it?

 Hmm ... after resume device has disappeared at all ...
 
 {pts/1}% cat /proc/interrupts
CPU0
   0:1290492XT-PIC-XTtimer
   1:   6675XT-PIC-XTi8042
   2:  0XT-PIC-XTcascade
   3:  2XT-PIC-XT
   4:  2XT-PIC-XT
   5:  3XT-PIC-XT
   7:  4XT-PIC-XTirda0
   8:  0XT-PIC-XTrtc0
   9:583XT-PIC-XTacpi
  10:  2XT-PIC-XT
  11:  31483XT-PIC-XTyenta, yenta, yenta, ohci_hcd:usb1, ALI 
 5451, pcmcia0.0
  12:  28070XT-PIC-XTi8042
  14:  21705XT-PIC-XTide0
  15:  82123XT-PIC-XTide1
 NMI:  0   Non-maskable interrupts
 TRM:  0   Thermal event interrupts
 SPU:  0   Spurious interrupts
 ERR:  0

I hope that's not a separate bug...
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.25-rc2, 2.6.24-rc8] page allocation failure...

2008-02-18 Thread Andrew Morton

On Sun, 17 Feb 2008 13:20:59 + Daniel J Blueman [EMAIL PROTECTED] wrote:

 I'm still hitting this with e1000e on 2.6.25-rc2, 10 times again.
 
 It's clearly non-fatal, but then do we expect it to occur?
 
 Daniel
 
 --- [dmesg]
 
 [ 1250.822786] swapper: page allocation failure. order:3, mode:0x4020
 [ 1250.822786] Pid: 0, comm: swapper Not tainted 2.6.25-rc2-119 #2
 [ 1250.822786]
 [ 1250.822786] Call Trace:
 [ 1250.822786]  IRQ  [8025fe9e] __alloc_pages+0x34e/0x3a0
 [ 1250.822786]  [8048c6df] ? __netdev_alloc_skb+0x1f/0x40
 [ 1250.822786]  [8027acc2] __slab_alloc+0x102/0x3d0
 [ 1250.822786]  [8048c6df] ? __netdev_alloc_skb+0x1f/0x40
 [ 1250.822786]  [8027b8cb] __kmalloc_track_caller+0x7b/0xc0
 [ 1250.822786]  [8048b74f] __alloc_skb+0x6f/0x160
 [ 1250.822786]  [8048c6df] __netdev_alloc_skb+0x1f/0x40
 [ 1250.822786]  [8042652d] e1000_alloc_rx_buffers+0x1ed/0x260
 [ 1250.822786]  [80426b5a] e1000_clean_rx_irq+0x22a/0x330
 [ 1250.822786]  [80422981] e1000_clean+0x1e1/0x540
 [ 1250.822786]  [8024b7a5] ? tick_program_event+0x45/0x70
 [ 1250.822786]  [804930ba] net_rx_action+0x9a/0x150
 [ 1250.822786]  [802336b4] __do_softirq+0x74/0xf0
 [ 1250.822786]  [8020c5fc] call_softirq+0x1c/0x30
 [ 1250.822786]  [8020eaad] do_softirq+0x3d/0x80
 [ 1250.822786]  [80233635] irq_exit+0x85/0x90
 [ 1250.822786]  [8020eba5] do_IRQ+0x85/0x100
 [ 1250.822786]  [8020a5b0] ? mwait_idle+0x0/0x50
 [ 1250.822786]  [8020b981] ret_from_intr+0x0/0xa
 [ 1250.822786]  EOI  [8020a5f5] ? mwait_idle+0x45/0x50
 [ 1250.822786]  [80209a92] ? enter_idle+0x22/0x30
 [ 1250.822786]  [8020a534] ? cpu_idle+0x74/0xa0
 [ 1250.822786]  [80527825] ? rest_init+0x55/0x60

They're regularly reported with e1000 too - I don't think aything really
changed.

e1000 has this crazy problem where because of a cascade of follies (mainly
borked hardware) it has to do a 32kb allocation for a 9kb(?) packet.  It
would be sad if that was carried over into e1000e?

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: include/linux/pcounter.h

2008-02-16 Thread Andrew Morton

On Sat, 16 Feb 2008 11:07:29 +0100 Eric Dumazet [EMAIL PROTECTED] wrote:

 
 Andrew, pcounter is a temporary abstraction.

It's buggy!  Main problems are a) possible return of negative numbers b)
some of the API can't be from preemptible code c) excessive interrupt-off
time on some machines if used from irq-disabled sections.

 It is temporaty because it will vanish as soon as Christoph Clameter (or 
 somebody else) provides real cheap per cpu counter implementation.

numbers?

most of percpu_counter_add() is only executed once per FBC_BATCH calls.

 At time I introduced it in network tree (locally, not meant to invade kernel 
 land and makes you unhappy :) ), the goals were :

Well maybe as a temporary networking-only thing OK, based upon
performance-tested results.  But I don't think the present code is suitable
as part of the kernel-wide toolkit.

 Some counters (total sockets count) were a single integer, that were doing 
 ping-pong between cpus (SMP/NUMA). As they are basically lazy values (as we 
 dont really need to read their value), using plain atomic_t was overkill.
 
 Using a plain percpu_counters was expensive (NR_CPUS*(32+sizeof(void *)) 
 instead of num_possible_cpus()*4).

No, percpu_counters use alloc_percpu(), which is O(num_possible_cpus), not
O(NR_CPUS).

 Using 'online' instead of 'possible' stuff is not really needed for a 
 temporary thing.

This was put in ./lib/!

 - We dont care of read sides.

Well the present single caller in networking might not care.  But this was
put in ./lib/ and was exported to modules.  That is an invitation to all
kernel developers to use it in new code.  Which may result in truly awful
performance on high-cpu-count machines.

  We want really fast write side. Real fast.

eh?  It's called on a per-connection basis, not on a per-packet basis?

 Read side is when you do a cat /proc/net/sockstat.
 That is ... once in a while...

For the current single caller.  But it's in ./lib/.

And there's always someone out there who does whatever we don't expect them
to do.

 Now when we allocate a new socket, code to increment the socket count is :
 
 c03a74a8 tcp_pcounter_add:
 c03a74a8:   b8 90 26 5f c0  mov$0xc05f2690,%eax
 c03a74ad:   64 8b 0d 10 f1 5e c0mov%fs:0xc05ef110,%ecx
 c03a74b4:   01 14 01add%edx,(%ecx,%eax,1)
 c03a74b7:   c3  ret

I can't find that code.  I suspect that's the DEFINE_PER_CPU flavour, which
isn't used anywhere afaict.  Plus this omits the local_irq_save/restore (or
preempt_disable/enable) and the indirect function call, which can be
expensive.

 That is 4 instructions. I could be two in the future, thanks to current work 
 on fs/gs based percpu variables.
 
 Current percpu_counters implementation is more expensive :
 
 c021467b __percpu_counter_add:
 c021467b:   55  push   %ebp
 c021467c:   57  push   %edi
 c021467d:   89 c7   mov%eax,%edi
 c021467f:   56  push   %esi
 c0214680:   53  push   %ebx
 c0214681:   83 ec 04sub$0x4,%esp
 c0214684:   8b 40 14mov0x14(%eax),%eax
 c0214687:   64 8b 1d 08 f0 5e c0mov%fs:0xc05ef008,%ebx
 c021468e:   8b 6c 24 18 mov0x18(%esp),%ebp
 c0214692:   f7 d0   not%eax
 c0214694:   8b 1c 98mov(%eax,%ebx,4),%ebx
 c0214697:   89 1c 24mov%ebx,(%esp)
 c021469a:   8b 03   mov(%ebx),%eax
 c021469c:   89 c3   mov%eax,%ebx
 c021469e:   89 c6   mov%eax,%esi
 c02146a0:   c1 fe 1fsar$0x1f,%esi
 c02146a3:   89 e8   mov%ebp,%eax
 c02146a5:   01 d3   add%edx,%ebx
 c02146a7:   11 ce   adc%ecx,%esi
 c02146a9:   99  cltd
 c02146aa:   39 d6   cmp%edx,%esi
 c02146ac:   7f 15   jg c02146c3 
 __percpu_counter_add+0x48
 c02146ae:   7c 04   jl c02146b4 

One of the above two branches is taken ((FBC_BATCH-1)/FBC_BATCH)ths of the
time.


 __percpu_counter_add+0x39
 c02146b0:   39 eb   cmp%ebp,%ebx
 c02146b2:   73 0f   jaec02146c3 
 __percpu_counter_add+0x48
 c02146b4:   f7 dd   neg%ebp
 c02146b6:   89 e8   mov%ebp,%eax
 c02146b8:   99  cltd
 c02146b9:   39 d6   cmp%edx,%esi
 c02146bb:   7f 20   jg c02146dd 
 __percpu_counter_add+0x62
 c02146bd:   7c 04   jl c02146c3 
 __percpu_counter_add+0x48
 c02146bf:   39 eb   cmp%ebp,%ebx
 c02146c1:   77 1a   ja c02146dd 
 __percpu_counter_add+0x62
 c02146c3:   89 f8

Re: [PATCH 1/2] remove rcu_assign_pointer(NULL) penalty with type/macro safety

2008-02-15 Thread Andrew Morton

On Wed, 13 Feb 2008 14:00:24 -0800
Paul E. McKenney [EMAIL PROTECTED] wrote:

 Hello!
 
 This is an updated version of the patch posted last November:
 
   http://archives.free.net.ph/message/20071201.003721.cd6ff17c.en.html
 
 This new version permits arguments with side effects, for example:
 
   rcu_assign_pointer(global_p, p++);
 
 and also verifies that the arguments are pointers, while still avoiding
 the unnecessary memory barrier when assigning NULL to a pointer.
 This memory-barrier avoidance means that rcu_assign_pointer() is now only
 permitted for pointers (not array indexes), and so this version emits a
 compiler warning if the first argument is not a pointer.  I built a make
 allyesconfig version on an x86 system, and received no such warnings.
 If RCU is ever applied to array indexes, then the second patch in this
 series should be applied, and the resulting rcu_assign_index() be used.
 
 Given the rather surprising history of subtlely broken implementations of
 rcu_assign_pointer(), I took the precaution of generating a full set of
 test cases and verified that memory barriers and compiler warnings were
 emitted when required.  I guess it is the simple things that get you...
 
 Signed-off-by: Paul E. McKenney [EMAIL PROTECTED]
 ---
 
  rcupdate.h |   16 
  1 file changed, 12 insertions(+), 4 deletions(-)
 
 diff -urpNa -X dontdiff linux-2.6.24/include/linux/rcupdate.h 
 linux-2.6.24-rap/include/linux/rcupdate.h
 --- linux-2.6.24/include/linux/rcupdate.h 2008-01-24 14:58:37.0 
 -0800
 +++ linux-2.6.24-rap/include/linux/rcupdate.h 2008-02-13 13:36:47.0 
 -0800
 @@ -270,12 +270,20 @@ extern struct lockdep_map rcu_lock_map;
   * structure after the pointer assignment.  More importantly, this
   * call documents which pointers will be dereferenced by RCU read-side
   * code.
 + *
 + * Throws a compiler warning for non-pointer arguments.
 + *
 + * Does not insert a memory barrier for a NULL pointer.
   */
  
 -#define rcu_assign_pointer(p, v) ({ \
 - smp_wmb(); \
 - (p) = (v); \
 - })
 +#define rcu_assign_pointer(p, v) \
 + ({ \
 + typeof(*p) *_p1 = (v); \
 + \
 + if (!__builtin_constant_p(v) || (_p1 != NULL)) \
 + smp_wmb(); \
 + (p) = _p1; \
 + })
  

umm...

net/netfilter/core.c: In function 'nf_register_afinfo':
net/netfilter/core.c:39: warning: initialization discards qualifiers from 
pointer target type
net/netfilter/nf_log.c: In function 'nf_log_register':
net/netfilter/nf_log.c:37: warning: initialization discards qualifiers from 
pointer target type
net/netfilter/nf_queue.c: In function 'nf_register_queue_handler':
net/netfilter/nf_queue.c:38: warning: initialization discards qualifiers from 
pointer target type


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: include/linux/pcounter.h

2008-02-15 Thread Andrew Morton


- First up, why was this added at all?  We have percpu_counter.h which
  has several years development invested in it.  afaict it would suit the
  present applications of pcounters.

  If some deficiency in percpu_counters has been identified, is it
  possible to correct that deficiency rather than implementing a whole new
  counter type?  That would be much better.

- The comments in pcounter.h appear to indicate that there is a
  performance advantage (and we infer that the advantage is when the
  statically-allocated flavour of pcounters is used).  When compared with
  percpu_counters the number of data-reference indirections is the same as
  with percpu_counters, so no advantage there.

  And, bizarrely, because of a quite inappropriate abstraction toy, both
  flavours of pcounters add an indirect function call which I believe is
  significantly more expensive than a plain old pointer indirection.

  So it's quite possible that DEFINE_PCOUNTER-style counters consume more
  memory and are slower than percpu_counters.  They will surely be much
  slower on the read side.  More below.

  If we really want to put some helper wrappers around
  DEFINE_PER_CPU(s32) then I'd suggest that we should do that as a
  standalone thing and not attempt to graft the same interface onto two
  quite different types of storage (DEFINE_PER_CPU and alloc_per_cpu)

- The comment 2) in pcounter.h (which overflows 80 cols and probably
  wasn't passed through checkpatch) indicates that some other
  implementation (presumably plain old DEFINE_PER_CPU) will use
  NR_CPUS*(32+sizeof(void *)) bytes of storage.  But DEFINE_PCOUNTER will
  use as much memory as DEFINE_PER_CPU(s32) and both pcounter_alloc()-style
  pcounters and percpu_counters use
  num_possible_cpus()*sizeof(s32)+epsilon.

- The CONFIG_SMP=n stubs in pcounter.h are cheesy and are vulnerable to
  several well-known compilation risks which I always forget.  Should be
  converted to regular static inlines.

- the comment in lib/pcounter.c needlessly exceeds 80 cols.

- pcounter_dyn_add() will spew a
  use-of-smp_processor_id()-in-preemptible-code warning if used in places
  where one could reasonably use it.  The interface could do with a bit of
  a rethink.  Or at least some justification and documentation.

- pcounter_getval() will be disastrously inefficient if
  num_possible_cpus() is much greater than num_online_cpus().  It should
  use for_each_online_cpu() (as does percpu_counter), and implement a CPU
  hotplug notifier (as does percpu_counter).

  It will remain grossly inefficient at high CPU counts, unlike
  percpu_counters, which solved this problem by doing a batched spill into
  a central counter at add/sub time.

  The danger here is that someone will actually use this interface in new
  code.  Six months later (when it's too late to fix it) the big-NUMA guys
  come up and say whaa, when our user does this it disabled interrupts
  for N milliseconds.

- pcounter_getval() can return incorrect negative numbers.  This can
  cause caller malfunctions in very rare situations because callers just
  don't expect the things which they're counting to go negative.

  We experienced this during the evolution of percpu_counter.  See
  percpu_counter_sum_positive() and friends.

- pcounter_alloc() should return -ENOMEM on allocation error, not 1.

- pcounter_free() perhaps shouldn't test for (self-per_cpu_values !=
  NULL), because callers shouldn't be calling it if pcounter_alloc() failed
  (arguable).



afaict the whole implementation can and should be removed and replaced with
percpu_counters.  I don't think there's much point in its ability to manage
DEFINE_PER_CPU counters: pcounter_getval() remains grossly inefficient (and
can return negative values) and quite a bit of new code will need to be put
in place to address that.

But perhaps there are plans to evolve it into something further in the
future, I don't know.  But I would suggest that the people who have worked
upon percpu_counters (principally Gautham, Peter Z, clameter and me) be
involved in that work.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9990] New: tg3: eth0: The system may be re-ordering memory-mapped I/O cycles

2008-02-14 Thread Andrew Morton

On Thu, 14 Feb 2008 01:59:12 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9990
 
Summary: tg3: eth0: The system may be re-ordering memory-mapped
 I/O cycles
Product: Drivers
Version: 2.5
  KernelVersion: 2.6.24-git18
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: Network
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: 2.6.24
 Earliest failing kernel version: 2.6.24-git18
 Distribution: Debian/testing
 Hardware Environment:
 Software Environment:
 Problem Description:
 
 Feb 11 13:11:52 www kernel: [   12.015569] tg3: eth0: Link is up at 100 Mbps,
 full duplex.
 Feb 11 13:11:52 www kernel: [   12.015633] tg3: eth0: Flow control is on for 
 TX
 and on for RX.
 Feb 11 13:33:44 www kernel: [ 1328.538204] tg3: eth0: The system may be
 re-ordering memory-mapped I/O cycles to the network
 device, attempting to recover. Please report the problem to the driver
 maintainer and include system chipset information.
 Feb 11 13:33:44 www kernel: [ 1328.667255] tg3: eth0: Link is down.
 Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Link is up at 100 Mbps,
 full duplex.
 Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Flow control is on for 
 TX
 and on for RX.
 
 After that, the machine rebooted (panic?)
 
 Feb 11 13:35:14 www kernel: klogd 1.5.0#1.1, log source = /proc/kmsg started.
 
 lspci -vvv info:
 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
 Ethernet (rev 10)
 Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter
 (PCI-X, 10,100,1000-T)
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
 Stepping- SERR+ FastB2B- DisINTx-
 Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort-
 TAbort- MAbort- SERR- PERR- INTx-
 Latency: 64 (16000ns min), Cache Line Size: 64 bytes
 Interrupt: pin A routed to IRQ 19
 Region 0: Memory at fdf7 (64-bit, non-prefetchable) [size=64K]
 [virtual] Expansion ROM at 8814 [disabled] [size=64K]
 Capabilities: [40] PCI-X non-bridge device
 Command: DPERE- ERO- RBC=2048 OST=1
 Status: Dev=02:02.0 64bit+ 133MHz+ SCD- USC- DC=simple
 DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
 Capabilities: [48] Power Management version 2
 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
 PME(D0-,D1-,D2-,D3hot+,D3cold+)
 Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
 Capabilities: [50] Vital Product Data ?
 Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ 
 Queue=0/3
 Enable-
 Address: fd7ffd6fdf7deeb8  Data: bdfd
 Kernel driver in use: tg3
 Kernel modules: tg3
 
 02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
 Ethernet (rev 10)
 Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter
 (PCI-X, 10,100,1000-T)
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
 Stepping- SERR+ FastB2B- DisINTx-
 Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort-
 TAbort- MAbort- SERR- PERR- INTx-
 Latency: 64 (16000ns min), Cache Line Size: 64 bytes
 Interrupt: pin B routed to IRQ 20
 Region 0: Memory at fdf6 (64-bit, non-prefetchable) [size=64K]
 Capabilities: [40] PCI-X non-bridge device
 Command: DPERE- ERO+ RBC=512 OST=1
 Status: Dev=02:02.1 64bit+ 133MHz+ SCD- USC- DC=simple
 DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
 Capabilities: [48] Power Management version 2
 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
 PME(D0-,D1-,D2-,D3hot+,D3cold+)
 Status: D0 PME-Enable- DSel=0 DScale=1 PME-
 Capabilities: [50] Vital Product Data ?
 Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ 
 Queue=0/3
 Enable-
 Address: f73feeefe7f8  Data: 9bcd
 Kernel driver in use: tg3
 Kernel modules: tg3
 
 
 Steps to reproduce:
 
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: 2.6.25-rc1: iptables postrouting setup causes oops

2008-02-13 Thread Andrew Morton

On Tue, 12 Feb 2008 22:46:01 +1100 Ben Nizette [EMAIL PROTECTED] wrote:

 
 On an AVR32, root over NFS, config attached, running (from a startup
 script):
 
 iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
 
 Results in (dmesg extract including a bit of context for good measure):
 -8
 VFS: Mounted root (nfs filesystem).
 Freeing init memory: 72K (9000 - 90012000)
 eth0: no IPv6 routers present
 warning: `dnsmasq' uses 32-bit capabilities (legacy support in use)
 ip_tables: (C) 2000-2006 Netfilter Core Team
 nf_conntrack version 0.5.0 (1024 buckets, 4096 max)
 Unable to handle kernel paging request at virtual address d76a7138
 ptbr = 91d3b000 pgd = e5f3 pte = 00014370
 Oops: Kernel access of bad area, sig: 11 [#1]
 FRAME_POINTER chip: 0x01f:0x1e82 rev 2
 Modules linked in: nf_conntrack_ipv4(+) nf_conntrack ip_tables
 PC is at kmem_cache_alloc+0x2c/0x54
 LR is at nf_conntrack_l4proto_register+0x34/0x9c [nf_conntrack]

I take it that the above means that the crash is in kmem_cache_alloc()?

 pc : [9004fa78]lr : [c08537d8]Not tainted
 sp : 91eb5e9c  r12: 901cc588  r11: 00d0
 r10: 901e03a0  r9 : 0002  r8 : 91d33800
 r7 : 91eb5e9c  r6 : 901d9138  r5 : 901cc588  r4 : 0007a008
 r3 : 00d0  r2 : 0045  r1 : c085f9c0  r0 : c08c1424
 Flags: qvNzc
 Mode bits: hjmdeG
 CPU Mode: Supervisor
 Process: insmod [250] (task: 91d789c0 thread: 91eb4000)
 Stack: (0x91eb5e9c to 0x91eb6000)
 5e80:c08537d8
 5ea0: 91eb5ec0 c085a6b4  0007a008 fff0 0027 c085f9c0 c08c1424
 5ec0: c0861024 91eb5ee4  c085f654 0007a008 901d31b0 0027 c085f9c0
 5ee0: c08c1424 900358d4 91eb5f94 91d6ddf0 901d31b8 0007a008 91eb5f08 0007a008
 5f00:        
 5f20: 000c    0007 c08df440 91d1c660 c08c19ec
 5f40: c08c16a4 c0881000  00b2 00b2 c085e17c 90022414 0027
 5f60: c08c12c3 c08c1424 c08c1a3c c08c1a14 0027 0025 0006 2ab17000
 5f80: 90047080 91eb5f94 91eb4000  c087ef80 90012132  00072e14
 5fa0: 535e 0007a008 00072858 7f89ff73 8000 91eb4000 0001 2aae23ec
 5fc0: 7f89fd98 7f89fd88 2ab17008 0005edf5 0007a008 0005edf5 0073 7f89fedc
 5fe0: 00072e14 535e 0007a008 00072858 7f89ff73 0002 00059538 2ab17008
 Call trace:
  [c08537d8] nf_conntrack_l4proto_register+0x34/0x9c [nf_conntrack]
  [c0861024] nf_conntrack_l3proto_ipv4_init+0x24/0x108 [nf_conntrack_ipv4]
  [900358d4] sys_init_module+0xf78/0x1028
  [90012132] syscall_return+0x0/0x12
 
 iptable_nat: gave up waiting for init of module nf_conntrack_ipv4.
 iptable_nat: Unknown symbol need_ipv4_conntrack
 8-
 
 iptables version 1.3.8

If so, the bug could be almost anywhere - in slab, or in some random piece
of code which scribbles on slab's data structures.

 Perfectly repeatable.

If my theory is correct, changing pretty much anything in the kernel config
might just make it go away.  But still, it would be most valuable if you
could try running a bisection search, as described in
http://www.kernel.org/doc/local/git-quick.html, thanks.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: 2.6.25-rc1: iptables postrouting setup causes oops

2008-02-13 Thread Andrew Morton

On Wed, 13 Feb 2008 10:10:24 +0100 Haavard Skinnemoen [EMAIL PROTECTED] wrote:

 On Wed, 13 Feb 2008 00:48:29 -0800
 Andrew Morton [EMAIL PROTECTED] wrote:
 
  On Tue, 12 Feb 2008 22:46:01 +1100 Ben Nizette [EMAIL PROTECTED] wrote:
  
   
   On an AVR32, root over NFS, config attached, running (from a startup
   script):
   
   iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
   
   Results in (dmesg extract including a bit of context for good measure):
   -8
   VFS: Mounted root (nfs filesystem).
   Freeing init memory: 72K (9000 - 90012000)
   eth0: no IPv6 routers present
   warning: `dnsmasq' uses 32-bit capabilities (legacy support in use)
 
 Hmm. What does that mean? What size do capabilities normally have?

My near-namesake put than in, but I immediately forgot what it means?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] remove rcu_assign_pointer(NULL) penalty with type/macro safety

2008-02-13 Thread Andrew Morton

On Wed, 13 Feb 2008 14:00:24 -0800
Paul E. McKenney [EMAIL PROTECTED] wrote:

 Hello!
 
 This is an updated version of the patch posted last November:
 
   http://archives.free.net.ph/message/20071201.003721.cd6ff17c.en.html
 
 This new version permits arguments with side effects, for example:
 
   rcu_assign_pointer(global_p, p++);
 
 and also verifies that the arguments are pointers, while still avoiding
 the unnecessary memory barrier when assigning NULL to a pointer.
 This memory-barrier avoidance means that rcu_assign_pointer() is now only
 permitted for pointers (not array indexes), and so this version emits a
 compiler warning if the first argument is not a pointer.  I built a make
 allyesconfig version on an x86 system, and received no such warnings.
 If RCU is ever applied to array indexes, then the second patch in this
 series should be applied, and the resulting rcu_assign_index() be used.
 
 Given the rather surprising history of subtlely broken implementations of
 rcu_assign_pointer(), I took the precaution of generating a full set of
 test cases and verified that memory barriers and compiler warnings were
 emitted when required.  I guess it is the simple things that get you...
 
 Signed-off-by: Paul E. McKenney [EMAIL PROTECTED]
 ---
 
  rcupdate.h |   16 
  1 file changed, 12 insertions(+), 4 deletions(-)
 
 diff -urpNa -X dontdiff linux-2.6.24/include/linux/rcupdate.h 
 linux-2.6.24-rap/include/linux/rcupdate.h
 --- linux-2.6.24/include/linux/rcupdate.h 2008-01-24 14:58:37.0 
 -0800
 +++ linux-2.6.24-rap/include/linux/rcupdate.h 2008-02-13 13:36:47.0 
 -0800

whoop, ancient kernel alert.

 @@ -270,12 +270,20 @@ extern struct lockdep_map rcu_lock_map;
   * structure after the pointer assignment.  More importantly, this
   * call documents which pointers will be dereferenced by RCU read-side
   * code.
 + *
 + * Throws a compiler warning for non-pointer arguments.
 + *
 + * Does not insert a memory barrier for a NULL pointer.
   */
  
 -#define rcu_assign_pointer(p, v) ({ \
 - smp_wmb(); \
 - (p) = (v); \
 - })
 +#define rcu_assign_pointer(p, v) \
 + ({ \
 + typeof(*p) *_p1 = (v); \
 + \
 + if (!__builtin_constant_p(v) || (_p1 != NULL)) \
 + smp_wmb(); \
 + (p) = _p1; \
 + })

Someone already merged the dont-do-it-for-NULL patch so I reworked this
appropriately.  Was too lazy to update the changelog though.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] remove rcu_assign_pointer(NULL) penalty with type/macro safety

2008-02-13 Thread Andrew Morton

On Wed, 13 Feb 2008 15:37:44 -0800
Paul E. McKenney [EMAIL PROTECTED] wrote:

 Ah.  It does take a bit to get fib_trie into one's build -- allyesconfig
 doesn't cut it.

This is not good.  The sole purpose of allmodconfig and allyesconfig is for
compilation and linkage coverage testing.  Hence we should aim to get as
much code as possible included in those cases.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] remove rcu_assign_pointer(NULL) penalty with type/macro safety

2008-02-13 Thread Andrew Morton

On Wed, 13 Feb 2008 15:57:38 -0800 (PST)
David Miller [EMAIL PROTECTED] wrote:

 From: Andrew Morton [EMAIL PROTECTED]
 Date: Wed, 13 Feb 2008 15:52:45 -0800

  On Wed, 13 Feb 2008 15:37:44 -0800
  Paul E. McKenney [EMAIL PROTECTED] wrote:

   Ah.  It does take a bit to get fib_trie into one's build -- allyesconfig
   doesn't cut it.

  This is not good.  The sole purpose of allmodconfig and allyesconfig is for
  compilation and linkage coverage testing.  Hence we should aim to get as
  much code as possible included in those cases.

 Well, in this case there is a choice, either you use one routing
 lookup datastructure or the other.  It's not purposefully being hidden
 from the everything builds :-)

oic.  yes, that is a bit of a problem.  Oh well.

`make randconfig' seems to be able to enable CONFIG_IP_FIB_TRIE about one
time in eight ;)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] add rcu_assign_index() if ever needed

2008-02-13 Thread Andrew Morton

On Thu, 14 Feb 2008 09:02:09 +0530 Gautham R Shenoy [EMAIL PROTECTED] wrote:

   /**
  + * rcu_assign_index - assign (publicize) a index of a newly
  + * initialized array elementg that will be dereferenced by RCU
    
 
 I hope Andrew got that one while porting against the latest -mm :)

I don't actually read the comments - I just like to make sure they're
there ;)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9940] New: e1000e driver with 82566DM-2 controller doesn't strip crc from frames

2008-02-12 Thread Andrew Morton

On Tue, 12 Feb 2008 01:50:49 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9940
 
Summary: e1000e driver with 82566DM-2 controller doesn't strip
 crc from frames
Product: Drivers
Version: 2.5
  KernelVersion: 2.6.24
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: Network
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: n/a
 Earliest failing kernel version: 2.6.24
 Distribution: Gentoo 2007.0
 Hardware Environment: HP Compaq dc7800p
 Software Environment: n/a
 Problem Description:
 The e1000e driver doesn't strip the ethernet frame crc is used with the
 82566DM-2 chip.
 This will result, among other things, that bridging doesn't work together with
 this chip.
 Reverting commit 140a74802894e9db57e5cd77ccff77e590ece5f3 fixes this issue but
 has performance implications.
 
 Steps to reproduce:
 Send a ping and monitor the responce with tcpdump.
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] fib_trie: improve output format for /proc/net/fib_trie

2008-02-12 Thread Andrew Morton

On Tue, 12 Feb 2008 16:50:44 -0800 Stephen Hemminger [EMAIL PROTECTED] wrote:

 Make output format prettier (more tree like).
 
local:
--- 0.0.0.0/0
  |--- 10.111.111.0/24
  |  +-- 10.111.111.0/32 link broadcast
  |  |--- 10.111.111.254/31
  |  |  +-- 10.111.111.254/32 host local
  |  |  +-- 10.111.111.255/32 link broadcast
  |--- 127.0.0.0/8
  |  |--- 127.0.0.0/31
  |  |  +-- 127.0.0.0/32 link broadcast
  |  |  +-- 127.0.0.0/8 host local
  |  |  +-- 127.0.0.1/32 host local
  |  +-- 127.255.255.255/32 link broadcast
  |--- 192.168.1.0/24
  |  |--- 192.168.1.0/28
  |  |  +-- 192.168.1.0/32 link broadcast
  |  |  +-- 192.168.1.9/32 host local
  |  +-- 192.168.1.255/32 link broadcast
main:
--- 0.0.0.0/0
  |--- 0.0.0.0/4
  |  +-- 0.0.0.0/0 universe unicast
  |  +-- 10.111.111.0/24 link unicast
  +-- 169.254.0.0/16 link unicast
  +-- 192.168.1.0/24 link unicast

isn't that a non-back-compatible kernel ABI change?  It might
break pre-existing parsers?

aside: how lame are we to put pretty-printers in the kernel?
English-only ones, at that?  Root cause: kernel developers still
don't have a sufficiently easy way of shipping userspace tools.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Execute tasklets in the same order they were queued

2008-02-11 Thread Andrew Morton

On Mon, 11 Feb 2008 16:28:13 -0600
Olof Johansson [EMAIL PROTECTED] wrote:

 I noticed this when looking at an openswan issue. Openswan (ab?)uses
 the tasklet API to defer processing of packets in some situations,
 with one packet per tasklet_action(). I started noticing sequences of
 reverse-ordered sequence numbers coming over the wire, since new tasklets
 are always queued at the head of the list but processed sequentially.
 
 Convert it to instead append new entries to the tail of the list. As an
 extra bonus, the splicing code in takeover_tasklets() no longer has to
 iterate over the list.
 

hm, I'd have thought that this would already have caused problems in
networking.  And perhaps this change might have effects on networking too?

Probably it won't have _much_ effect on networking because networking
probably isn't queueing one tasklet per packet(!) but perhaps with bonded
channels or something like that?

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9937] New: Bug in bonding driver - Kernel oops whenever driver is loaded with max_bonds parameter

2008-02-11 Thread Andrew Morton

On Mon, 11 Feb 2008 15:04:03 -0800 (PST)
[EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9937
 
Summary: Bug in bonding driver - Kernel oops whenever driver is
 loaded with max_bonds parameter
Product: Networking
Version: 2.5
  KernelVersion: 2.6.24.2
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: high
   Priority: P1
  Component: IPV4
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version:
 Earliest failing kernel version: 2.6.24.2
 Distribution: Slackware / Debian GNU/Linux
 Hardware Environment: HP ProLiant DL380 G5 (Debian), Slackware Acer TravelMate
 4001 Laptop
 Software Environment: 
 Problem Description: Kernel oops whenever bonding driver with max_bonds=2 (or 
 
 2) is loaded ...
 
 Steps to reproduce:
 
 modprobe bonding mode=0 miimon=100 max_bonds=2 
 or
 modprobe bonding max_bonds=2 
 
 
 dmesg output (from slackware laptop / x86):
 
 BUG: unable to handle kernel NULL pointer dereference at virtual address
 
 printing eip: c028eeaf *pde = 
 Oops:  [#1] SMP
 Modules linked in: bonding snd_seq_dummy snd_seq_oss snd_seq_midi_event 
 snd_seq
 snd_seq_device snd_pcm_oss snd_mixer_oss ntfs pcmcia yenta_socket
 rsrc_nonstatic tifm_7xx1 tifm_core pcmcia_core snd_intel8x0 snd_ac97_codec
 ac97_bus snd_pcm i2c_i801 snd_timer snd i2c_core shpchp snd_page_alloc 
 ehci_hcd
 uhci_hcd pci_hotplug
 
 Pid: 2729, comm: modprobe Not tainted (2.6.24.2 #2)
 EIP: 0060:[c028eeaf] EFLAGS: 00010282 CPU: 0
 EIP is at strnicmp+0x17/0x61
 EAX: d8162800 EBX:  ECX: 0010 EDX: 0062
 ESI: 0010 EDI:  EBP: d8162801 ESP: d82c9f60
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 Process modprobe (pid: 2729, ti=d82c8000 task=df926550 task.ti=d82c8000)
 Stack: d8162c80  e0c76814  e0c67170 0001 df80b700 e0c77180
0001  000c d82c8000 e0afe05e e0c6ed14 e0c6ce70 e0c76c00
0805c098 000c c014e355 b7e7a008 0805c098 c0106f12 b7e7a008 00019477
 Call Trace:
  [e0c67170] bond_create+0x4a/0x162 [bonding]
  [e0afe05e] bonding_init+0x5e/0xf0 [bonding]
  [c014e355] sys_init_module+0x91/0x11b
  [c0106f12] syscall_call+0x7/0xb
  [c047] sctp_setsockopt_bindx+0xe8/0x127
  ===
 Code: 08 fe dc ba 98 c7 40 0c 76 54 32 10 c7 40 10 f0 e1 d2 c3 c3 55 89 c5 57
 89 d7 31 d2 56 89 ce 53 31 db 85 c9 74 42 0f b6 55 00 45 0f b6 1f 47 84 d2 
 74
 35 84 db 74 31 38 da 74 2a 0f b6 c2 88 d1
 EIP: [c028eeaf] strnicmp+0x17/0x61 SS:ESP 0068:d82c9f60
 ---[ end trace 75761717808bf4ee ]---
 
 dmesg output (from Debian x86_64 - HP ProLiant DL380):
 
 Unable to handle kernel NULL pointer dereference at  RIP:
  [8030271e] strnicmp+0x12/0x5f
 PGD 223005067 PUD 223b22067 PMD 0
 Oops:  [1] SMP
 CPU 7
 Modules linked in: bonding mptctl mptbase fan ac battery ipv6 dm_snapshot
 dm_mirror dm_mod loop usbhid ide_cd cdrom bnx2 generic thermal ipmi_si piix
 serio_raw evdev shpchp
 psmouse pci_hotplug container pcspkr ide_core ipmi_msghandler uhci_hcd button
 processor ehci_hcd e1000 ext3 jbd mbcache reiserfs cciss
 Pid: 12469, comm: modprobe Not tainted 2.6.24.2 #1
 RIP: 0010:[8030271e]  [8030271e] strnicmp+0x12/0x5f
 RSP: 0018:81022339fe00  EFLAGS: 00010202
 RAX: 81022307e6c0 RBX: 88233918 RCX: 20e7
 RDX: 0010 RSI:  RDI: 81022307e000
 RBP:  R08: 810223b90362 R09: 0010
 R10: 8822d60b R11: 0001 R12: 
 R13: 88234b00 R14: 81022307e7c8 R15: 
 FS:  2b07aa3166e0() GS:81022743bd00() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2:  CR3: 00022339c000 CR4: 06e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process modprobe (pid: 12469, threadinfo 81022339e000, task
 8102239aa000)
 Stack:  882200ce 8102239ad000 0001 8102274273c0
   0001 c20011bef960 810225c88540
  8809f7bf 882340c0 882340c0 8102263f7f00
 Call Trace:
  [882200ce] :bonding:bond_create+0x4e/0x30e
  [8809f7bf] :bonding:bonding_init+0x7bf/0x85d
  [8024f752] sys_init_module+0x176d/0x183f
  [8020be8e] system_call+0x7e/0x83
 
 
 Code: 8a 0e 48 ff c7 48 ff c6 45 84 c0 74 36 84 c9 74 32 41 38 c8
 RIP  [8030271e] strnicmp+0x12/0x5f
  RSP 81022339fe00
 CR2: 
 ---[ end trace ba3d7089e7da64fa ]---
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at

Re: [Bugme-new] [Bug 9933] New: kernel BUG at include/linux/skbuff.h:912

2008-02-11 Thread Andrew Morton

On Mon, 11 Feb 2008 03:46:45 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9933
 
Summary: kernel BUG at include/linux/skbuff.h:912
Product: Networking
Version: 2.5
  KernelVersion: 2.6.24.2
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: Netfilter/Iptables
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: 2.6.22.3
 Earliest failing kernel version: 2.6.24.1
 Distribution: Debian etch
 Hardware Environment: x86_64, SMP
 
 If libnetfilter-queue (v. 0.0.12-1) application calls nfq_set_verdict
 and:
 - protocol is IPv4 (works fine with IPv6)
 - new packet length has been changed
 - packet contains data payload (not affected if tcp header is extended with
 options, but data payload=0)
 
 SKB_LINEAR_ASSERT is catched.
 
 
 [ cut here ]
 kernel BUG at include/linux/skbuff.h:912!
 invalid opcode:  [1] SMP
 CPU 4
 Modules linked in: nfnetlink_queue nfnetlink ip6table_mangle xt_NFQUEUE
 iptable_mangle xt_tcpudp nf_conntrack_ipv6 nf_conntrack_ipv4 xt_state
 nf_conntrack iptable_filter ip_tables ip6table_filter ip6_tables x_tables esp4
 ah4 xfrm4_mode_transport deflate zlib_deflate twofish twofish_common camellia
 serpent blowfish des_generic cbc ecb blkcipher aes_x86_64 aes_generic xcbc
 sha256_generic sha1_generic crypto_null af_key dm_crypt dm_snapshot dm_mirror
 dm_mod ipv6 ipmi_si iTCO_wdt container ipmi_msghandler button serio_raw evdev
 pcspkr ide_generic ide_cd cdrom pata_acpi ata_generic ata_piix libata scsi_mod
 usbhid piix generic ide_core ehci_hcd bnx2 uhci_hcd zlib_inflate cciss thermal
 processor fan
 Pid: 3390, comm: tcpmd5 Not tainted 2.6.24.2 #1
 RIP: 0010:[88258b2c]  [88258b2c]
 :nfnetlink_queue:nfqnl_recv_verdict+0x179/0x227
 RSP: 0018:81012d219a08  EFLAGS: 00010206
 RAX: 0100 RBX:  RCX: 00010001
 RDX: 81012e539500 RSI: 81012e539638 RDI: 81012df7ce18
 RBP: 0075 R08: 88250079 R09: 81012df7ce18
 R10: 7fff576df198 R11: 81012d9daac0 R12: 0014
 R13: 81012e691e40 R14: 0001 R15: 81012eae3c20
 FS:  2aab53c7a6d0() GS:81012f8fdb40() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 2aab535e6090 CR3: 00012e06c000 CR4: 06e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process tcpmd5 (pid: 3390, threadinfo 81012d218000, task 81012dafc080)
 Stack:  8825000a 81012d219a60 81012e524740 81012d219a60
  81012d219af8 88258ff8 81012eae3c00 81012d219ac8
  81012f9d7a80 88255233 804e42e8 
 Call Trace:
  [88255233] :nfnetlink:nfnetlink_rcv_msg+0x129/0x172
  [8825512b] :nfnetlink:nfnetlink_rcv_msg+0x21/0x172
  [8825510a] :nfnetlink:nfnetlink_rcv_msg+0x0/0x172
  [803d23b6] netlink_rcv_skb+0x34/0x8b
  [8825501f] :nfnetlink:nfnetlink_rcv+0x1f/0x2c
  [803d2156] netlink_unicast+0x1e0/0x240
  [803d29eb] netlink_sendmsg+0x2a2/0x2b5
  [803ba345] memcpy_fromiovec+0x36/0x66
  [803b3790] sock_sendmsg+0xe2/0xff
  [80243e10] autoremove_wake_function+0x0/0x2e
  [803b3790] sock_sendmsg+0xe2/0xff
  [80243e10] autoremove_wake_function+0x0/0x2e
  [80312f8d] xfs_vn_getattr+0x3d/0xfd
  [803b29c0] move_addr_to_kernel+0x25/0x36
  [803b39c1] sys_sendmsg+0x214/0x287
  [803b3b5c] sys_sendto+0x128/0x151
  [8027bbf5] do_readv_writev+0x18f/0x1a4
  [8020be2e] system_call+0x7e/0x83
 
 
 Code: 0f 0b eb fe 44 01 e0 44 01 67 68 3b 87 b8 00 00 00 89 87 b4
 RIP  [88258b2c] :nfnetlink_queue:nfqnl_recv_verdict+0x179/0x227
  RSP 81012d219a08
 ---[ end trace 303d8add98149551 ]---
 
 I cannot reproduce problem on kernel 2.6.22.3 (both i386 and x86-64) and
 2.6.24.2 if arch is i386.
 
 tcpmd5 application http://tcpmd5.googlecode.com/files/tcpmd5_0.0.3.tar.gz
 
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9923] New: Unable conect to internet over ISDN AVM USB BlueFritz

2008-02-09 Thread Andrew Morton

On Sat,  9 Feb 2008 03:36:26 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9923
 
Summary: Unable conect to internet over ISDN AVM USB BlueFritz
Product: Networking
Version: 2.5
  KernelVersion: 2.6.24
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: Other
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version:2.6.22
 Earliest failing kernel version:2.24

A regression.

 Distribution:opensuse 10.3
 Hardware Environment:AVM BlueFritz USB ISDN
 Software Environment:
 Problem Description:
 ifup ippp0
 ippp0
 SIOCSIFFLAGS: Invalid argument
 Cannot enable interface ippp0
 
 Steps to reproduce:
 Feb  9 11:20:54 linux-zhr0 klogd: ippp0: dialing 1 019193383... 
 Feb  9 11:20:56 linux-zhr0 klogd: isdn_net: ippp0 connected
 Feb  9 11:20:56 linux-zhr0 klogd: capidrv-1: chan 0 up with ncci 0x10101
 Feb  9 11:20:56 linux-zhr0 ipppd[5624]: Local number: 906138, Remote number:
 019193383, Type: outgoing
 Feb  9 11:20:56 linux-zhr0 ipppd[5624]: PHASE_WAIT - PHASE_ESTABLISHED,
 ifunit: 0, linkunit: 0, fd: 7
 Feb  9 11:20:56 linux-zhr0 ipppd[5624]: ioctl(SIOCSIFMTU): Invalid argument, 6
 ippp0 1524.
 Feb  9 11:20:56 linux-zhr0 ipppd[5624]: MPPP negotiation, He: Yes We: Yes
 Feb  9 11:20:56 linux-zhr0 ipppd[5624]: CCP enabled! Trying CCP.
 Feb  9 11:20:56 linux-zhr0 ipppd[5624]: CCP: got ccp-unit 0 for link 0
 (Compression Control Protocol)
 Feb  9 11:20:56 linux-zhr0 ipppd[5624]: ccp_resetci!
 Feb  9 11:20:56 linux-zhr0 ipppd[5624]: ccp_resetci!
 Feb  9 11:20:58 linux-zhr0 klogd: Received CCP frame from peer slot(0)
 Feb  9 11:20:58 linux-zhr0 klogd: [0/0].ccp-rcv[0]: 01 01 00 09 11 05 00 01 
 04 
 Feb  9 11:20:58 linux-zhr0 klogd: Received CCP frame from daemon:
 Feb  9 11:20:58 linux-zhr0 klogd: [0/0].ccp-xmit[0]: ff 03 80 fd 01 01 00 04 
 Feb  9 11:20:58 linux-zhr0 klogd: Received CCP frame from daemon:
 Feb  9 11:20:58 linux-zhr0 klogd: [0/0].ccp-xmit[0]: ff 03 80 fd 04 01 00 09 
 11
 05 00 01 04 
 Feb  9 11:20:58 linux-zhr0 klogd: Received CCP frame from peer slot(0)
 Feb  9 11:20:58 linux-zhr0 klogd: [0/0].ccp-rcv[0]: 04 01 00 04 
 Feb  9 11:20:58 linux-zhr0 ipppd[5624]: ccp_resetci!
 Feb  9 11:20:58 linux-zhr0 klogd: Received CCP frame from peer slot(0)
 Feb  9 11:20:58 linux-zhr0 klogd: [0/0].ccp-rcv[0]: 01 02 00 0a 11 06 00 01 01
 03 
 Feb  9 11:20:58 linux-zhr0 klogd: Received CCP frame from daemon:
 Feb  9 11:20:58 linux-zhr0 klogd: [0/0].ccp-xmit[0]: ff 03 80 fd 01 02 00 04 
 Feb  9 11:20:58 linux-zhr0 klogd: Received CCP frame from daemon:
 Feb  9 11:20:58 linux-zhr0 klogd: [0/0].ccp-xmit[0]: ff 03 80 fd 04 02 00 0a 
 11
 06 00 01 01 03 
 Feb  9 11:20:58 linux-zhr0 klogd: Received CCP frame from peer slot(0)
 Feb  9 11:20:58 linux-zhr0 klogd: [0/0].ccp-rcv[0]: 04 02 00 04 
 Feb  9 11:20:59 linux-zhr0 ipppd[5624]: local  IP address 149.225.68.4
 Feb  9 11:20:59 linux-zhr0 ipppd[5624]: remote IP address 139.4.250.4
 Feb  9 11:20:59 linux-zhr0 ipppd[5624]: Ifup: ioctl(SIOCSIFFLAGS): Invalid
 argument
 Feb  9 11:20:59 linux-zhr0 ip-down: SIOCSIFFLAGS: Invalid argument
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/7] typhoon section fix

2008-02-08 Thread Andrew Morton

On Fri, 08 Feb 2008 08:59:10 -0500 David Dillow [EMAIL PROTECTED] wrote:

 
 On Fri, 2008-02-08 at 03:11 -0800, [EMAIL PROTECTED] wrote:
  From: Andrew Morton [EMAIL PROTECTED]
  
  gcc-3.4.4 on powerpc:
  
  drivers/net/typhoon.c:137: error: version causes a section type conflict
  
  Cc: Jeff Garzik [EMAIL PROTECTED]
  Cc: Sam Ravnborg [EMAIL PROTECTED]
  Signed-off-by: Andrew Morton [EMAIL PROTECTED]
 
  -static const char version[] __devinitdata =
  +static char version[] __devinitdata =
   typhoon.c: version  DRV_MODULE_VERSION  ( DRV_MODULE_RELDATE )\n;
 
 Wouldn't going to __devinitconst be better? I'll try to whip up a patch
 and test-compile it.

Sam told me that doesn't work right, that this approach is the one to use
and, iirc, that __devinitcont and friends will be removed.

I'm not sure why, actually.  I think I missed that dicussion.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9920] New: kernel panic when using ebtables redirect target

2008-02-08 Thread Andrew Morton

On Fri,  8 Feb 2008 17:40:20 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9920
 
Summary: kernel panic when using ebtables redirect target
Product: Networking
Version: 2.5
  KernelVersion: 2.6.24 and 2.6.24-git
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: Other
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: 2.6.22 ( did not test 2.6.23 )
 Earliest failing kernel version: 2.6.24 
 Distribution:
 Hardware Environment: 
 Software Environment: bridge working as a router
 Problem Description: when using ebtables to set up target-redirect, there will
 be kernel panic
 
 Steps to reproduce:
 1. set up a basic bridge br0 with slaves eth0, eth1
 2. on the bridge setup a default router to route traffic
 3. use ebtables to setup target redirect, 
 
 ebtables -t broute -A BROUTING --logical-in br0 \
 -p ipv4  --ip-protocol tcp --ip-destination-port 80 \
 -j redirect --redirect-target ACCEPT
 
 4. from a client which is connect to the bridge, 
 send some traffic to allow the BROUTE chain to be 
 traversed :-
 
 lynx http://www.google.com
 
 5. Kernel panic :-
 
 Pid: 0, comm: swapper Not tainted (2.6.24-tmc #1)
 EIP: 0060:[c69f61aa] EFLAGS: 0217 CPU: 0
 EIP is at ebt_do_table+0x4ea/0x5d0 [ebtables]
 EAX:  EBX:  ECX:  EDX: 0001
 ESI: c69f1178 EDI: c69f1108 EBP: c69f1000 ESP: c0315e20
 DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
 Process swapper (pid: 0, ti=c0314000 task=c02f1300 task.ti=c0314000)
 Stack:  c69f11dc 0004  c28c7800 c2b79c20 0005 c69de350
   0001 0002 c69ed040 c69ed040   c69f1000 00b0
   00b0 c29b0812  c69f1122   a0c3 c29b0812
 Call Trace:
 [c69de032] ebt_broute+0x22/0x30 [ebtable_broute]
 [c69fef48] br_handle_frame+0xb8/0x220 [bridge]
 [c02274ac] netif_receive_skb+0x19c/0x440
 [c0229ffb] process_backlog+0x6b/0xd0
 [c0229a45] net_rx_action+0x105/0x1b0
 [c011f835] __do_softirq+0x75/0xf0
 [c011f8e7] do_softirq+0x37/0x40
 [c011fb25] irq_exit+0x75/0x80
 [c010d877] smp_apic_timer_interrupt+0x57/0x90
 [c0105b34] apic_timer_interrupt+0x28/0x30
 [c0103cd0] default_idle+0x0/0x40
 [c0103cff] default_idle+0x2f/0x40
 [c0103443] cpu_idle+0x73/0xa0
 [c0319cd5] start_kernel+0x2c5/0x340
 [c0319420] unknown_bootoption+0x0/0x1e0
 ===
 Code: 00 00 83 f9 fe 74 64 83 f9 fc 0f 84 d7 fb ff ff 83 f9 fd 0f 84 bb fc ff
 ff 8b 5c 24 30 8b 54 24 34 8d 04 5b 8d 04 82 8b 54 24 20 89 28 42 89 50 08 
 8b
 5f 6c 01 df 89 78 04 8b 6c 24 38 8b 54 24
 EIP: [c69f61aa] ebt_do_table+0x4ea/0x5d0 [ebtables] SS:ESP 0068:c0315e20
 
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] net: sh_eth: Add support for Renesas SuperH Ethernet

2008-02-07 Thread Andrew Morton

On Thu, 07 Feb 2008 17:39:23 +0900 Yoshihiro Shimoda [EMAIL PROTECTED] wrote:

 Add support for Renesas SuperH Ethernet controller.
 This driver supported SH7710 and SH7712.
 

Nice looking driver.

Quick comments:

 ...

 +/*
 + * Program the hardware MAC address from dev-dev_addr.
 + */
 +static void __init update_mac_address(struct net_device *ndev)
 +{
 + u32 ioaddr = ndev-base_addr;
 +
 + ctrl_outl((ndev-dev_addr[0]  24) | (ndev-dev_addr[1]  16) |
 +   (ndev-dev_addr[2]  8) | (ndev-dev_addr[3]),
 +   ioaddr + MAHR);
 + ctrl_outl((ndev-dev_addr[4]  8) | (ndev-dev_addr[5]),
 +   ioaddr + MALR);
 +}
 +
 +/*
 + * Get MAC address from SuperH MAC address register
 + *
 + * SuperH's Ethernet device doesn't have 'ROM' to MAC address.
 + * This driver get MAC address that use by bootloader(U-boot or sh-ipl+g).
 + * When you want use this device, you must set MAC address in bootloader.
 + *
 + */
 +static void __init read_mac_address(struct net_device *ndev)
 +{
 + u32 ioaddr = ndev-base_addr;
 +
 + ndev-dev_addr[0] = (ctrl_inl(ioaddr + MAHR)  24);
 + ndev-dev_addr[1] = (ctrl_inl(ioaddr + MAHR)  16)  0xFF;
 + ndev-dev_addr[2] = (ctrl_inl(ioaddr + MAHR)  8)  0xFF;
 + ndev-dev_addr[3] = (ctrl_inl(ioaddr + MAHR)  0xFF);
 + ndev-dev_addr[4] = (ctrl_inl(ioaddr + MALR)  8)  0xFF;
 + ndev-dev_addr[5] = (ctrl_inl(ioaddr + MALR)  0xFF);
 +}

Both the above functions are called from non-__init code and hence cannot
be __init.  sh_eth_tsu_init() is wrong too.  Please check all section
annotations in the driver.

 +struct bb_info {
 + struct mdiobb_ctrl ctrl;
 + u32 addr;
 + u32 mmd_msk;/* MMD */
 + u32 mdo_msk;
 + u32 mdi_msk;
 + u32 mdc_msk;
 +};

Please cc David Brownell on updates to this driver - perhaps he will find
time to review the bit-banging interface usage.

 +/* PHY bit set */
 +static void bb_set(u32 addr, u32 msk)
 +{
 + ctrl_outl(ctrl_inl(addr) | msk, addr);
 +}
 +
 +/* PHY bit clear */
 +static void bb_clr(u32 addr, u32 msk)
 +{
 + ctrl_outl((ctrl_inl(addr)  ~msk), addr);
 +}
 +
 +/* PHY bit read */
 +static int bb_read(u32 addr, u32 msk)
 +{
 + return (ctrl_inl(addr)  msk) != 0;
 +}
 +
 +/* Data I/O pin control */
 +static inline void sh__mmd_ctrl(struct mdiobb_ctrl *ctrl, int bit)
 +{
 + struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
 + if (bit)
 + bb_set(bitbang-addr, bitbang-mmd_msk);
 + else
 + bb_clr(bitbang-addr, bitbang-mmd_msk);
 +}
 +
 +/* Set bit data*/
 +static inline void sh__set_mdio(struct mdiobb_ctrl *ctrl, int bit)
 +{
 + struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
 +
 + if (bit)
 + bb_set(bitbang-addr, bitbang-mdo_msk);
 + else
 + bb_clr(bitbang-addr, bitbang-mdo_msk);
 +}
 +
 +/* Get bit data*/
 +static inline int sh__get_mdio(struct mdiobb_ctrl *ctrl)
 +{
 + struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
 + return bb_read(bitbang-addr, bitbang-mdi_msk);
 +}

There seems to be a fairly random mixture of inline and non-inline here. 
I'd suggest that you just remove all the `inline's.  The compiler does a
pretty good job of working doing this for you.

 +/* MDC pin control */
 +static inline void sh__mdc_ctrl(struct mdiobb_ctrl *ctrl, int bit)
 +{
 + struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
 +
 + if (bit)
 + bb_set(bitbang-addr, bitbang-mdc_msk);
 + else
 + bb_clr(bitbang-addr, bitbang-mdc_msk);
 +}
 +
 +/* mdio bus control struct */
 +static struct mdiobb_ops bb_ops = {
 + .owner = THIS_MODULE,
 + .set_mdc = sh__mdc_ctrl,
 + .set_mdio_dir = sh__mmd_ctrl,
 + .set_mdio_data = sh__set_mdio,
 + .get_mdio_data = sh__get_mdio,
 +};

It's particularly inappropriate that sh__mdc_ctrl() was inlined - it is
only ever called via a function pointer and hence will never be inlined!

 ...

 +static void sh_eth_timer(unsigned long data)
 +{
 + struct net_device *ndev = (struct net_device *)data;
 + struct sh_eth_private *mdp = netdev_priv(ndev);
 + int next_tick = 10 * HZ;
 +
 + /* We could do something here... nah. */
 + mdp-timer.expires = jiffies + next_tick;
 + add_timer(mdp-timer);

mod_timer() would be neater here.

 +}

 ...

 +
 +/* network device open function */
 +static int sh_eth_open(struct net_device *ndev)
 +{
 + int ret = 0;
 + struct sh_eth_private *mdp = netdev_priv(ndev);
 +
 + ret = request_irq(ndev-irq, sh_eth_interrupt, 0, ndev-name, ndev);
 + if (ret) {
 + printk(KERN_ERR Can not assign IRQ number to %s\n, CARDNAME);
 + return ret;
 + }
 +
 + /* Descriptor set */
 + ret = sh_eth_ring_init(ndev);
 + if (ret)
 + goto out_free_irq;
 +
 + /* device init */
 + ret = sh_eth_dev_init(ndev);
 + if (ret)
 + goto out_free_irq;
 +
 + /*

Re: [Bugme-new] [Bug 9914] New: bnx2 driver of latest kernel 2.6.24 not working with Cisco catalyst 650x Switch

2008-02-07 Thread Andrew Morton

On Thu,  7 Feb 2008 23:06:55 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9914
 
Summary: bnx2 driver of latest kernel 2.6.24 not working with
 Cisco catalyst 650x Switch
Product: Drivers
Version: 2.5
  KernelVersion: 2.6.24
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: Network
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: (Works with Redhat 4 kernel)
 Earliest failing kernel version: 2.6.24
 Distribution: Self, kernel.org's kernel version 2.6.24
 Hardware Environment: Dell 2950
 Software Environment: Linux from Scratch, kernel version 2.6.24
 
 Problem Description:
 I am using the latest kernel 2.6.24. The hardware unit is a 2950 Dell box.
 The firmware version of bnx2 is 2.9.1. The bnx2 driver does work with other
 flavors of Cisco Switches like 290x. It is having problems at a customer site
 who has a Cisco 650x.
 
 The customer tried the RHEL 4 and does seem to work fine. The RHEL4 comes with
 the version of bnx2: 1.4.38
 
 The version on 2.6.24 is 1.6.9.
 
 Can anyone give me pointers as to why this is broken? Where can I get a 
 patch? 
 
 Any help is appreciated.
 
 
 
 
 Steps to reproduce:
 The driver does not auto-negotiate, neither does it work with a fixed link
 speed.
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: locking api self-test hanging

2008-02-06 Thread Andrew Morton

On Mon, 4 Feb 2008 04:43:04 -0800 Andrew Morton [EMAIL PROTECTED] wrote:

 On Sun, 3 Feb 2008 15:07:44 -0800 Andrew Morton [EMAIL PROTECTED] wrote:
 
  On Sun, 3 Feb 2008 15:02:46 -0800 Andrew Morton [EMAIL PROTECTED] wrote:
  
   
   With current mainline I'm getting intermittent hangs here:
   
 http://userweb.kernel.org/~akpm/p2033590.jpg
   
   with this config:
   
 http://userweb.kernel.org/~akpm/config-sony.txt
   
   on the Vaio.  Sometimes it boots (then hits another different hang),
   sometimes it gets stuck there.
 
 CONFIG_DEBUG_LOCKING_API_SELFTESTS=n fixed that up.
 
  
  The second hang is in kobject_uevent_init().  All that function does is call
  netlink_kernel_create().
 
 And I've fully bisected this hang twice and both times came up with
 
 commit 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7
 Author: Stephen Hemminger [EMAIL PROTECTED]
 Date:   Mon Nov 19 19:24:52 2007 -0800
 
 [NETPOLL]: Kill NETPOLL_RX_DROP, set but never tested.
 
 which is stupid because that patch doesn't do anything.
 
 However I am using netconsole-over-e100 and the hang does go away when I
 disable netconsole on the kernel boot command line.
 
 I'd say it's some timing thing in netpoll/netconsole/napi/etc.
 

I'm seeing this netconsole hang on a second machine now.  Current
mainline.  It also has e100.  This one is SMP ancient PIII.

Config: http://userweb.kernel.org/~akpm/config-vmm.txt

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: locking api self-test hanging

2008-02-06 Thread Andrew Morton

On Wed, 6 Feb 2008 00:34:12 -0800 Andrew Morton [EMAIL PROTECTED] wrote:

 On Mon, 4 Feb 2008 04:43:04 -0800 Andrew Morton [EMAIL PROTECTED] wrote:
 
  On Sun, 3 Feb 2008 15:07:44 -0800 Andrew Morton [EMAIL PROTECTED] wrote:
  
   On Sun, 3 Feb 2008 15:02:46 -0800 Andrew Morton [EMAIL PROTECTED] wrote:
   

With current mainline I'm getting intermittent hangs here:

http://userweb.kernel.org/~akpm/p2033590.jpg

with this config:

http://userweb.kernel.org/~akpm/config-sony.txt

on the Vaio.  Sometimes it boots (then hits another different hang),
sometimes it gets stuck there.
  
  CONFIG_DEBUG_LOCKING_API_SELFTESTS=n fixed that up.
  
   
   The second hang is in kobject_uevent_init().  All that function does is 
   call
   netlink_kernel_create().
  
  And I've fully bisected this hang twice and both times came up with
  
  commit 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7
  Author: Stephen Hemminger [EMAIL PROTECTED]
  Date:   Mon Nov 19 19:24:52 2007 -0800
  
  [NETPOLL]: Kill NETPOLL_RX_DROP, set but never tested.
  
  which is stupid because that patch doesn't do anything.
  
  However I am using netconsole-over-e100 and the hang does go away when I
  disable netconsole on the kernel boot command line.
  
  I'd say it's some timing thing in netpoll/netconsole/napi/etc.
  
 
 I'm seeing this netconsole hang on a second machine now.  Current
 mainline.  It also has e100.  This one is SMP ancient PIII.
 
 Config: http://userweb.kernel.org/~akpm/config-vmm.txt
 

I can reproduce this on a third machine: the t61p laptop: dual x86_64 with
e1000.

It seems to need quite a lot of printk activity to make it happen.  Turning
on initcall_debug is a suitable way of triggering it.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/4] forcedeth: fix MAC address detection on network card (regression in 2.6.23)

2008-02-05 Thread Andrew Morton

On Tue, 05 Feb 2008 13:20:59 -0500 Jeff Garzik [EMAIL PROTECTED] wrote:

  Signed-off-by: Andrew Morton [EMAIL PROTECTED]
 
 NAK - this fixes one set of users, and breaks a working set of users.
 
 Need to add DMI check for the specific motherboard (dmi_check_system), 
 and flip flag according to success/failure of that check.

OK :)  I added the above to the changelog for next time.

You guys can hide, but this patch isn't going away!
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

include/linux/pcounter.h

2008-02-04 Thread Andrew Morton


e7d0362dd41e760f340c1b500646cc92522bd9d5 should have been folded into
de4d1db369785c29d68915edfee0cb70e8199f4c prior to merging.  We now and for
ever have a window of breakage which screws up git bisection.  Which I
just hit.  Which is the only reason I discovered the file's existence.


Please do not merge pieces of generic kernel infrastructure while keeping
it all secret on the netdev list.  Ever.  That code has a number of deficiencies
which I and probably others would have noted had it been offered to us
for review.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: locking api self-test hanging

2008-02-04 Thread Andrew Morton

On Mon, 4 Feb 2008 04:43:04 -0800 Andrew Morton [EMAIL PROTECTED] wrote:

 On Sun, 3 Feb 2008 15:07:44 -0800 Andrew Morton [EMAIL PROTECTED] wrote:
 
  On Sun, 3 Feb 2008 15:02:46 -0800 Andrew Morton [EMAIL PROTECTED] wrote:
  
   
   With current mainline I'm getting intermittent hangs here:
   
 http://userweb.kernel.org/~akpm/p2033590.jpg
   
   with this config:
   
 http://userweb.kernel.org/~akpm/config-sony.txt
   
   on the Vaio.  Sometimes it boots (then hits another different hang),
   sometimes it gets stuck there.
 
 CONFIG_DEBUG_LOCKING_API_SELFTESTS=n fixed that up.
 
  
  The second hang is in kobject_uevent_init().  All that function does is call
  netlink_kernel_create().
 
 And I've fully bisected this hang twice and both times came up with
 
 commit 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7
 Author: Stephen Hemminger [EMAIL PROTECTED]
 Date:   Mon Nov 19 19:24:52 2007 -0800
 
 [NETPOLL]: Kill NETPOLL_RX_DROP, set but never tested.
 
 which is stupid because that patch doesn't do anything.
 
 However I am using netconsole-over-e100 and the hang does go away when I
 disable netconsole on the kernel boot command line.
 
 I'd say it's some timing thing in netpoll/netconsole/napi/etc.
 

I expect the locking API self-tests are innocent - the netconsole lockup
happened to strike during the locking API tests and fooled me.  When I
disabled those tests, the netconsole lockup struck later on.  

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: locking api self-test hanging

2008-02-04 Thread Andrew Morton

On Sun, 3 Feb 2008 15:07:44 -0800 Andrew Morton [EMAIL PROTECTED] wrote:

 On Sun, 3 Feb 2008 15:02:46 -0800 Andrew Morton [EMAIL PROTECTED] wrote:
 
  
  With current mainline I'm getting intermittent hangs here:
  
  http://userweb.kernel.org/~akpm/p2033590.jpg
  
  with this config:
  
  http://userweb.kernel.org/~akpm/config-sony.txt
  
  on the Vaio.  Sometimes it boots (then hits another different hang),
  sometimes it gets stuck there.

CONFIG_DEBUG_LOCKING_API_SELFTESTS=n fixed that up.

 
 The second hang is in kobject_uevent_init().  All that function does is call
 netlink_kernel_create().

And I've fully bisected this hang twice and both times came up with

commit 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7
Author: Stephen Hemminger [EMAIL PROTECTED]
Date:   Mon Nov 19 19:24:52 2007 -0800

[NETPOLL]: Kill NETPOLL_RX_DROP, set but never tested.

which is stupid because that patch doesn't do anything.

However I am using netconsole-over-e100 and the hang does go away when I
disable netconsole on the kernel boot command line.

I'd say it's some timing thing in netpoll/netconsole/napi/etc.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2.6.24-mm1] error compiling net driver NE2000/NE1000

2008-02-04 Thread Andrew Morton

On Mon, 4 Feb 2008 16:29:21 +0100
Pierre Peiffer [EMAIL PROTECTED] wrote:

 Hi,
 
   When I compile the kernel 2.6.24-mm1 with:
 CONFIG_NET_ISA=y
 CONFIG_NE2000=y
 
 I have the following compile error:
 ...
   GEN .version
   CHK include/linux/compile.h
   UPD include/linux/compile.h
   CC  init/version.o
   LD  init/built-in.o
   LD  .tmp_vmlinux1
 drivers/built-in.o: In function `ne_block_output':
 linux-2.6.24-mm1/drivers/net/ne.c:797: undefined reference to `NS8390_init'
 drivers/built-in.o: In function `ne_drv_resume':
 linux-2.6.24-mm1/drivers/net/ne.c:858: undefined reference to `NS8390_init'
 drivers/built-in.o: In function `ne_probe1':
 linux-2.6.24-mm1/drivers/net/ne.c:539: undefined reference to `NS8390_init'
 make[1]: *** [.tmp_vmlinux1] Error 1
 make: *** [sub-make] Error 2

Thanks for reporting this.

 As I saw that the file 8390p.c is compiled for this driver, but not the file 
 8390.c which contains this function NS8390_init(), I fixed this error with
 the following patch.

Alan's
8390-split-8390-support-into-a-pausing-and-a-non-pausing-driver-core.patch
would be a prime suspect.  I assume this bug isn't present ing mainline or
in 2.6.24?

 As NS8390p_init() does the same thing than NS8390_init(), I suppose that this 
 is the right fix ?
 
 Signed-off-by: Pierre Peiffer [EMAIL PROTECTED]
 ---
  drivers/net/ne.c |6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)
 
 Index: b/drivers/net/ne.c
 ===
 --- a/drivers/net/ne.c
 +++ b/drivers/net/ne.c
 @@ -536,7 +536,7 @@ static int __init ne_probe1(struct net_d
  #ifdef CONFIG_NET_POLL_CONTROLLER
   dev-poll_controller = eip_poll;
  #endif
 - NS8390_init(dev, 0);
 + NS8390p_init(dev, 0);
  
   ret = register_netdev(dev);
   if (ret)
 @@ -794,7 +794,7 @@ retry:
   if (time_after(jiffies, dma_start + 2*HZ/100)) {
 /* 20ms */
   printk(KERN_WARNING %s: timeout waiting for Tx 
 RDC.\n, dev-name);
   ne_reset_8390(dev);
 - NS8390_init(dev,1);
 + NS8390p_init(dev,1);
   break;
   }
  
 @@ -855,7 +855,7 @@ static int ne_drv_resume(struct platform
  
   if (netif_running(dev)) {
   ne_reset_8390(dev);
 - NS8390_init(dev, 1);
 + NS8390p_init(dev, 1);
   netif_device_attach(dev);
   }
   return 0;


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/24 for-2.6.25] DM9000 updates for 2.6.25

2008-02-04 Thread Andrew Morton

On Tue, 05 Feb 2008 00:01:59 +
Ben Dooks [EMAIL PROTECTED] wrote:

 Subject: [PATCH 00/24 for-2.6.25] DM9000 updates for 2.6.25

Holy cow.

 This patch set is a series of updates for the DM9000
 driver, to tidy-up some of the source, stop the accesses
 to the PHY and EEPROM sitting and spinning with locks
 held, and to add ethtool support.

Jeff, the immediate issue is that the driver doesn't compile on mips.  I
have the below lameo fix for it, but it appears to be wrong.  Or at least
suboptimal.

So if you're unprepared to chew on this lot (and 24 patches two weeks into the
merge window is one hell of a chew) then we do need to get that
regression fixed, at least.



From: Andrew Morton [EMAIL PROTECTED]

mips:

drivers/net/dm9000.c: In function `dm9000_open':
drivers/net/dm9000.c:627: error: `IRQT_RISING' undeclared (first use in this 
function)
drivers/net/dm9000.c:627: error: (Each undeclared identifier is reported only 
once
drivers/net/dm9000.c:627: error: for each function it appears in.)

Cc: Daniel Mack [EMAIL PROTECTED]
Cc: Russell King [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 drivers/net/dm9000.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff -puN 
drivers/net/dm9000.c~drivers-net-dm9000c-vague-probably-wrong-build-fix 
drivers/net/dm9000.c
--- a/drivers/net/dm9000.c~drivers-net-dm9000c-vague-probably-wrong-build-fix
+++ a/drivers/net/dm9000.c
@@ -113,8 +113,10 @@
 #define writeswoutsw
 #define writesloutsl
 #define DM9000_IRQ_FLAGS   (IRQF_SHARED | IRQF_TRIGGER_HIGH)
-#else
+#elif defined(ARM)
 #define DM9000_IRQ_FLAGS   (IRQF_SHARED | IRQT_RISING)
+#else
+#define DM9000_IRQ_FLAGS   (IRQF_SHARED)
 #endif
 
 /*
_

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9888] New: tun device without protocol info header fails under IPv6

2008-02-04 Thread Andrew Morton

On Mon,  4 Feb 2008 13:46:13 -0800 (PST)
[EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9888
 
Summary: tun device without protocol info header fails under IPv6
Product: Networking
Version: 2.5
  KernelVersion: =2.6.23
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: low
   Priority: P1
  Component: IPV6
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: None known -- appears to be a historic bug
 Earliest failing kernel version: All
 Distribution:
 Hardware Environment:
 Software Environment:
 Problem Description:
 
 Steps to reproduce: 
 
 Open a tun device as type TUN, set the TUN_NO_PI flag, and try sending an IPv6
 packet. The packet appears at the interface under tcpdumps, but propagates no
 further. This is because the default protocol info used for tun devices where
 the TUN_NO_PI flag is set assumes IPv4 as can be seen by the initialization at
 the top of the tun_get_user function in drivers/net/tun.c file given by
 
 struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) };
 
 This can easily be fixed by adding a quick check at the top of tun_get_user.
 Basically the code that used to read
 
 if (!(tun-flags  TUN_NO_PI)) {
 if ((len -= sizeof(pi))  count)
 return -EINVAL;
 
 if(memcpy_fromiovec((void *)pi, iv, sizeof(pi)))
 return -EFAULT;
 }
 
 when changed to read
 
 if (!(tun-flags  TUN_NO_PI)) {
 if ((len -= sizeof(pi))  count)
 return -EINVAL;
 
 if(memcpy_fromiovec((void *)pi, iv, sizeof(pi)))
 return -EFAULT;
 }
 else {
   /* Fixup default pi if IPv6 rather than IPv4 */
   if (((tun-flags  TUN_TYPE_MASK) == TUN_TUN_DEV) 
   (*(char *)(iv-iov_base)  == 0x60)) {
 pi.proto = __constant_htons(ETH_P_IPV6);
   }
 }
 
 fixes the problem. 
 
 How do we get this in as part of the maintained codebase??
 

Please email a tested patch prepared as described in

Documentation/SubmittingPatches
Documentation/SubmitChecklist
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt

to

Maxim Krasnyansky [EMAIL PROTECTED]
David S. Miller [EMAIL PROTECTED]
Andrew Morton [EMAIL PROTECTED]
netdev@vger.kernel.org

thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-mm1 - Build failure at net/sched/cls_flow.c:598

2008-02-04 Thread Andrew Morton

On Mon, 04 Feb 2008 23:32:49 +0100
Tilman Schmidt [EMAIL PROTECTED] wrote:

 My attempt to build this failed with:
 
CC [M]  net/sched/cls_flow.o
 net/sched/cls_flow.c: In function ___flow_dump___:
 net/sched/cls_flow.c:598: error: ___struct tcf_ematch_tree___ has no member 
 named ___hdr___
 
 Config attached.

Thanks.  hm.

#else /* CONFIG_NET_EMATCH */

struct tcf_ematch_tree
{
};

methinks Patrick has a CONFIG_NET_EMATCH=n problem?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: include/linux/pcounter.h

2008-02-04 Thread Andrew Morton

On Mon, 04 Feb 2008 16:20:35 -0800 (PST)
David Miller [EMAIL PROTECTED] wrote:

 From: Andrew Morton [EMAIL PROTECTED]
 Date: Mon, 4 Feb 2008 01:44:02 -0800

  Please do not merge pieces of generic kernel infrastructure while
  keeping it all secret on the netdev list.  Ever.

 It was so damn secret that it sat in your -mm tree for months.

I never noticed it and I doubt if anyone else did.  How was I (or anyone
else) to have known?

 Don't be rediculious Andrew.

There is nothing ridiculous about requiring that new generic kernel
infrastructure patches be appropriately submitted and reviewed.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.24-mm1] TCP/IPv6 connect() oopses at twothirdsMD4Transform()

2008-02-04 Thread Andrew Morton

On Tue, 05 Feb 2008 10:28:43 +0900 Tetsuo Handa [EMAIL PROTECTED] wrote:

 Hello.
 
 Kernel config is at http://I-love.SAKURA.ne.jp/tmp/config-2.6.24-mm1
 
 2.6.24 works fine.

Thanks for testing and reporting.  It really helps.

 Regards.
 --
 BUG: unable to handle kernel paging request at 25476bec
 IP: [c0211c28] twothirdsMD4Transform+0x78/0x37c
 *pde =  
 Oops:  [#1] SMP DEBUG_PAGEALLOC
 last sysfs file: 
 /sys/devices/pci:00/:00:10.0/host0/target0:0:1/0:0:1:0/type
 Modules linked in: nfsd lockd sunrpc exportfs pcnet32
 
 Pid: 2148, comm: a.out Not tainted (2.6.24-mm1 #1)
 EIP: 0060:[c0211c28] EFLAGS: 00010286 CPU: 0
 EIP is at twothirdsMD4Transform+0x78/0x37c
 EAX: 00084000 EBX: 0800 ECX: 8000 EDX: db45ddec
 ESI:  EDI: 52806380 EBP: db45dddc ESP: db45ddc8
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 Process a.out (pid: 2148, ti=db45d000 task=deaf9250 task.ti=db45d000)
 Stack: 8000 def6ef9c 6380 c0759d60 db45de1c db45de28 c0211fd2 
 0040 
c0759d40    0100 52806380 1f2e00ba 
 fffa249f 
5a696b37 8dbe1970 cf7579d0 3b0cc350 a54b10a8 def6e9a0  
 def6ef8c 
 Call Trace:
  [c0211fd2] ? secure_tcpv6_sequence_number+0x58/0x7a
  [c0322247] ? tcp_v6_connect+0x46d/0x4e3
  [c02ade07] ? lock_sock_nested+0x56/0x5e
  [c02ecbae] ? inet_stream_connect+0x1c/0x163
  [c02ecc24] ? inet_stream_connect+0x92/0x163
  [c02ab14b] ? sys_connect+0x72/0x98
  [c013b945] ? lock_release_holdtime+0x4e/0x54
  [c0115250] ? do_page_fault+0x1c5/0x3fc
  [c013e858] ? __lock_release+0x4b/0x51
  [c0115250] ? do_page_fault+0x1c5/0x3fc
  [c02ab9f5] ? sys_socketcall+0x6f/0x15e
  [c0103ae7] ? restore_nocheck+0x12/0x15
  [c0103a86] ? syscall_call+0x7/0xb
  ===
 Code: 31 c1 03 0c ba 8b 7a 0c 01 ce 8b 4d ec c1 c6 0b 31 d9 21 f1 31 d9 03 0c 
 ba 8b 7a 10 01 c8 8b 4d ec c1 c0 13 31 f1 21 c1 33 4d ec 03 0c ba 8b 7a 14 
 01 cb 89 c1 c1 c3 03 31 f1 21 d9 31 f1 03 0c 
 EIP: [c0211c28] twothirdsMD4Transform+0x78/0x37c SS:ESP 0068:db45ddc8
 ---[ end trace 160518059a282c77 ]---

err, Matt?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: locking api self-test hanging

2008-02-03 Thread Andrew Morton

On Sun, 3 Feb 2008 15:02:46 -0800 Andrew Morton [EMAIL PROTECTED] wrote:

 
 With current mainline I'm getting intermittent hangs here:
 
   http://userweb.kernel.org/~akpm/p2033590.jpg
 
 with this config:
 
   http://userweb.kernel.org/~akpm/config-sony.txt
 
 on the Vaio.  Sometimes it boots (then hits another different hang),
 sometimes it gets stuck there.
 

The second hang is in kobject_uevent_init().  All that function does is call
netlink_kernel_create().

These things make bisecting all the other bugs rather hard.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] request_irq() always returns -EINVAL with a NULL handler.

2008-02-02 Thread Andrew Morton

On Thu, 17 Jan 2008 17:57:58 +1100 Rusty Russell [EMAIL PROTECTED] wrote:

 I assume that these ancient network drivers were trying to find out if
 an irq is available.  eepro.c expecting +EBUSY was doubly wrong.
 
 I'm not sure that can_request_irq() is the right thing, but these drivers
 are definitely wrong.
 
 request_irq should BUG() on bad input, and these would have been found
 earlier.

This breaks non-CONFIG_GENERIC_HARDIRQS architectures.

alpha:

drivers/net/3c503.c: In function 'el2_open':
drivers/net/3c503.c:382: error: implicit declaration of function 
'can_request_irq'

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 946 matches

Mail list logo