date:20071120

Re: [PATCH 07/18] x86 vDSO: vdso32 build

2007-11-20 Thread Roland McGrath

> I assume that if an error happened in a pipe set -e; would catch it.
> But I did not check that - I normally just adds set -e; without much thought.

No, you need set -o pipefail for that (which is a bashism).


Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 02/13] dio: ARRAY_SIZE() cleanup

2007-11-20 Thread Geert Uytterhoeven

On Tue, 20 Nov 2007, Richard Knutsson wrote:
> Geert Uytterhoeven wrote:
> 
> > -#define NUMNAMES (sizeof(names) / sizeof(struct dioname))
> > +#define NUMNAMES ARRAY_SIZE(names)
> 
> Why not replace NUMNAMES?

Good idea! Updated patch below.

---

Subject: dio: ARRAY_SIZE() cleanup

From: Alejandro Martinez Ruiz <[EMAIL PROTECTED]>

dio: ARRAY_SIZE() cleanup

[Geert: eliminate NUMNAMES, as suggested by Richard Knutsson ]

Signed-off-by: Alejandro Martinez Ruiz <[EMAIL PROTECTED]>
Signed-off-by: Geert Uytterhoeven <[EMAIL PROTECTED]>
---
 drivers/dio/dio.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/drivers/dio/dio.c
+++ b/drivers/dio/dio.c
@@ -88,8 +88,6 @@ static struct dioname names[] = 
 #undef DIONAME
 #undef DIOFBNAME
 
-#define NUMNAMES (sizeof(names) / sizeof(struct dioname))
-
 static const char *unknowndioname 
 = "unknown DIO board -- please email <[EMAIL PROTECTED]>!";
 
@@ -97,7 +95,7 @@ static const char *dio_getname(int id)
 {
 /* return pointer to a constant string describing the board with given 
ID */
unsigned int i;
-for (i = 0; i < NUMNAMES; i++)
+for (i = 0; i < ARRAY_SIZE(names); i++)
 if (names[i].id == id) 
 return names[i].name;
 


Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: gitweb: kernel versions in the history (feature request, probably)

2007-11-20 Thread Jarek Poplawski

On Tue, Nov 20, 2007 at 10:20:09PM -0500, J. Bruce Fields wrote:
> On Wed, Nov 21, 2007 at 12:30:23AM +0100, Jarek Poplawski wrote:
> > I don't know git, but it seems, at least if done for web only, this
> > shouldn't be so 'heavy'. It could be a 'simple' translation of commit
> > date by querying a small database with kernel versions & dates.
> 
> If I create a commit in my linux working repo today, but Linus doesn't
> merge it into his repository until after he releases 2.6.24, then my
> commit will be created with an earlier date than 2.6.24, even though it
> isn't included until 2.6.25.
> 
> So you have to actually examine the history graph to figure this out
> this sort of thing.

Of course, you are right, and I probably miss something, but to be
sure we think about the same thing let's look at some example: so, I
open a page with current Linus' tree, go to something titled:
/pub/scm / linux/kernel/git/torvalds/linux-2.6.git / history

and see:
2007-10-10 Stephen Hemminger [NET]: Make NAPI polling independent ...
and just below something with 2007-08-14 date.

Accidentally, I can remember this patch introduced many changes, and
this big interval in dates suggests some waiting. Then I look at the
commit, and there are 2 dates visible, so the patch really was created
earlier. Then I go back to:
/pub/scm / linux/kernel/git/torvalds/linux-2.6.git / summary

and at the bottom I can see this:

...
tags
4 days ago  v2.6.24-rc3 Linux 2.6.24-rc3
2 weeks ago v2.6.24-rc2 Linux 2.6.24-rc2
4 weeks ago v2.6.24-rc1 Linux 2.6.24-rc1
6 weeks ago v2.6.23 Linux 2.6.23

which drives me crazy, because, without looking at the calendar, and
calculator, I don't really know which month was 6 weeks ago, and 4
days ago, either!

So, I go to the: http://www.eu.kernel.org/pub/linux/kernel/v2.6/, 
do some scrolling, look at this:
ChangeLog-2.6.23 09-Oct-2007 20:38  3.8M  

and only now I can guess, this napi patch didn't manage to 2.6.23.
Of course, usually I've to do a few more clicks and reading to make
sure where it really started.

So, this could suggest this 2007-10-10 (probably stored with time
too), could be useful here... but it seems, I'm wrong.

Of course, this problem doesn't look so hard if we forget about
git internals: I can imagine keeping a simple database, which
could simply retrieve commit numbers from these ChangeLogs, and
connecting this with gitweb's commit page as well... For
performance reasons, doing it only for stable and testing, so with
-rc 'precision' would be very helpful too.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Prospects of DRM TTM making it into 2.6.24?

2007-11-20 Thread Andrew Morton

On Wed, 21 Nov 2007 17:36:33 +1000 "Dave Airlie" <[EMAIL PROTECTED]> wrote:

> > >
> > > I also have a few AGP changes I need to line up to support the chipset
> > > flushing work I've done to support TTM properly..
> > >
> >
> > Did anything happen with that null-pointer deref I was hitting?
> >
> 
> I've rebased the patches in my tree along with a chunk I missed which
> should make your patch unnecessary.

ok..

> Do you pull my agp-mm tree into -mm?
> 
> ssh://master.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6 agp-mm
> if not..

I pull
git+ssh://master.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6.git#agp-mm
and
git+ssh://master.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6.git#drm-mm

the agp tree was empty when I last pulled it, ~12 hours ago.



There's stuff there now:

GIT d9b38a24bdbeeb38c03de7172a30769333452d10 
git+ssh://master.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6.git#agp-mm

commit d9b38a24bdbeeb38c03de7172a30769333452d10
Author: Dave Airlie <[EMAIL PROTECTED]>
Date:   Wed Nov 21 16:36:31 2007 +1000

agp/intel: Add chipset flushing support for i8xx chipsets.

This is a bit of a large hammer but it makes sure the chipset is flushed
by writing out 1k of data to an uncached page. We may be able to get better
information in the future on how to this better.

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit f5b8bc5801bc9676935cb7d2417a636a08e2e06d
Author: Dave Airlie <[EMAIL PROTECTED]>
Date:   Mon Oct 29 18:06:10 2007 +1000

intel-agp: add chipset flushing support

This adds support for flushing the chipsets on the 915, 945, 965 and G33
families of Intel chips.

The BIOS doesn't seem to always allocate the BAR on the 965 chipsets
so I have to use pci resource code to create a resource

It adds an export for pcibios_align_resource.

commit 5beaa1edf58ef2a477049f21842a8f12f28b8366
Author: Dave Airlie <[EMAIL PROTECTED]>
Date:   Mon Oct 29 15:14:03 2007 +1000

agp: add chipset flushing support to AGP interface

This bumps the AGP interface to 0.103.

Certain Intel chipsets contains a global write buffer, and this can require
flushing from the drm or X.org to make sure all data has hit RAM before
initiating a GPU transfer, due to a lack of coherency with the integrated
graphics device and this buffer.

This just adds generic support to the AGP interfaces, a follow-on patch
will add support to the Intel driver to use this interface.

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>
 arch/x86/pci/i386.c |2 +-
 drivers/char/agp/agp.h  |3 +-
 drivers/char/agp/backend.c  |2 +-
 drivers/char/agp/compat_ioctl.c |4 +
 drivers/char/agp/compat_ioctl.h |2 +
 drivers/char/agp/frontend.c |   11 +++
 drivers/char/agp/generic.c  |7 ++
 drivers/char/agp/intel-agp.c|  150 +++
 include/linux/agp_backend.h |1 +
 include/linux/agpgart.h |1 +
 10 files changed, 180 insertions(+), 3 deletions(-)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Prospects of DRM TTM making it into 2.6.24?

2007-11-20 Thread Dave Airlie

> >
> > I also have a few AGP changes I need to line up to support the chipset
> > flushing work I've done to support TTM properly..
> >
>
> Did anything happen with that null-pointer deref I was hitting?
>

I've rebased the patches in my tree along with a chunk I missed which
should make your patch unnecessary.

Do you pull my agp-mm tree into -mm?

ssh://master.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6 agp-mm
if not..

Dave.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 07/18] x86 vDSO: vdso32 build

2007-11-20 Thread Sam Ravnborg

On Tue, Nov 20, 2007 at 11:10:15PM -0800, Roland McGrath wrote:
> > > +# This makes sure the $(obj) subdirectory exists even though vdso32/
> > > +# is not a kbuild sub-make subdirectory.
> > > +override obj-dirs = $(dir $(obj)) $(obj)/vdso32/
> > 
> > Should we teach kbuild to create dirs specified in targets?
> > Or we could 'fix' it so you do not need the override.
> 
> Something cleaner would be nice, yes.  I'll leave it to you to decide.

OK - if I come up with something smart I will convert the vdso stuff.

> 
> > use "set -e; in front of this shell script to bail out early
> > in case of errors.
> 
> Back when I knew something about make, all commands ran with sh -ec.
> Ah, progress.  Anyway, the one you cited does not have any commands that
> aren't tested with && or if already.  set -e would have no effect.
I assume that if an error happened in a pipe set -e; would catch it.
But I did not check that - I normally just adds set -e; without much thought.

Sam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv5 1/5] actual sys_indirect code

2007-11-20 Thread Ulrich Drepper

This is the actual architecture-independent part of the system call
implementation.

 include/linux/indirect.h |   17 +
 include/linux/sched.h|4 
 include/linux/syscalls.h |4 
 kernel/Makefile  |3 +++
 kernel/indirect.c|   40 
 5 files changed, 68 insertions(+)


diff -u linux/include/linux/indirect.h linux/include/linux/indirect.h
--- linux/include/linux/indirect.h
+++ linux/include/linux/indirect.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_INDIRECT_H
+#define _LINUX_INDIRECT_H
+
+#include 
+
+
+/* IMPORTANT:
+   All the elements of this union must be neutral to the word size
+   and must not require reworking when used in compat syscalls.  Used
+   fixed-size types or types which are known to not vary in size across
+   architectures.  */
+union indirect_params {
+};
+
+#define INDIRECT_PARAM(set, name) current->indirect_params.set.name
+
+#endif
diff -u linux/kernel/Makefile linux/kernel/Makefile
--- linux/kernel/Makefile
+++ linux/kernel/Makefile
@@ -57,6 +57,7 @@
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_ARCH_HAS_INDIRECT_SYSCALLS) += indirect.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <[EMAIL PROTECTED]>, the -fno-omit-frame-pointer is
@@ -67,6 +68,8 @@
 CFLAGS_sched.o := $(PROFILING) -fno-omit-frame-pointer
 endif
 
+CFLAGS_indirect.o = -Wno-undef
+
 $(obj)/configs.o: $(obj)/config_data.h
 
 # config_data.h contains the same information as ikconfig.h but gzipped.
diff -u linux/kernel/indirect.c linux/kernel/indirect.c
--- linux/kernel/indirect.c
+++ linux/kernel/indirect.c
@@ -0,0 +1,40 @@
+#include 
+#include 
+#include 
+#include 
+
+
+asmlinkage long sys_indirect(struct indirect_registers __user *userregs,
+void __user *userparams, size_t paramslen,
+int flags)
+{
+   struct indirect_registers regs;
+   long result;
+
+   if (unlikely(flags != 0))
+   return -EINVAL;
+
+   if (copy_from_user(, userregs, sizeof(regs)))
+   return -EFAULT;
+
+   switch (INDIRECT_SYSCALL ())
+   {
+#define INDSYSCALL(name) __NR_##name
+#include 
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   if (paramslen > sizeof(union indirect_params))
+   return -EINVAL;
+
+   result = -EFAULT;
+   if (!copy_from_user(>indirect_params, userparams, paramslen))
+   result = call_indirect();
+
+   memset(>indirect_params, '\0', paramslen);
+
+   return result;
+}
diff -u linux/include/linux/syscalls.h linux/include/linux/syscalls.h
--- linux/include/linux/syscalls.h
+++ linux/include/linux/syscalls.h
@@ -54,6 +54,7 @@
 struct compat_timeval;
 struct robust_list_head;
 struct getcpu_cache;
+struct indirect_registers;
 
 #include 
 #include 
@@ -611,6 +612,9 @@
const struct itimerspec __user *utmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_indirect(struct indirect_registers __user *userregs,
+void __user *userparams, size_t paramslen,
+int flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const 
envp[]);
 
--- linux/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -80,6 +80,7 @@ struct sched_param {
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1174,6 +1175,9 @@ struct task_struct {
int make_it_fail;
 #endif
struct prop_local_single dirties;
+
+   /* Additional system call parameters.  */
+   union indirect_params indirect_params;
 };
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv5 0/5] sys_indirect system call

2007-11-20 Thread Ulrich Drepper

The following patches provide an alternative implementation of the
sys_indirect system call which has been discussed a few times.
This is a system call that allows us to extend existing system call
interfaces by adding more system call parameters.

Davide's previous implementation is IMO far more complex than
warranted.  This code here is trivial, as you can see.  I've
discussed this approach with Linus recently and for a brief moment
we actually agreed on something.

We pass an additional block of data to the kernel, it is copied into
the task_struct, and then it is up to the function implementing the system
call to interpret the data.  Each system call, which is meant to be
extended this way, has to be white-listed in sys_indirect.  The
alternative is to filter out those system calls which absolutely cannot
be handled using sys_indirect (like clone, execve) since they require
the stack layout of an ordinary system call.  This is more dangerous
since it is too easy to miss a call.

Note that the sys_indirect system call takes an additional parameter which
is for now forced to be zero.  This parameter is meant to enable the use
of sys_indirect to create syslets, asynchronously executed system calls.
This syslet approach is also the main reason for the interface in the form
proposed here.

The code for x86 and x86-64 gets by without a single line of assembly
code.  This is likely to be true for many other archs as well.
There is architecture-dependent code, though.

The last three patches show the first application of the functionality.
They also show a complication: we need the test for valid sub-syscalls in the
main implementation and in the compatibility code.  And more: the actual
sources and generated binary for the test are very different (the numbers
differ).  Duplicating the information is a big problem, though.  I've used
some macro tricks to avoid this.  All the information about the flags and
the system calls using them is concentrated in one header.  This should
keep maintenance bearable.

This patch to use sys_indirect is just the beginning.  More will follow,
but I want to see how these patches are received before I spend more time
on it.  This code is enough to test the implementation with the following
test program.  Adjust it for architectures other than x86 and x86-64.

What is not addressed are differences in opinion about the whole approach.
Maybe Linus can chime in a defend what is basically his design.


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

typedef uint32_t __u32;
typedef uint64_t __u64;

union indirect_params {
  struct {
int flags;
  } file_flags;
};

#ifdef __x86_64__
# define __NR_indirect 286
struct indirect_registers {
  __u64 rax;
  __u64 rdi;
  __u64 rsi;
  __u64 rdx;
  __u64 r10;
  __u64 r8;
  __u64 r9;
};
#elif defined __i386__
# define __NR_indirect 325
struct indirect_registers {
  __u32 eax;
  __u32 ebx;
  __u32 ecx;
  __u32 edx;
  __u32 esi;
  __u32 edi;
  __u32 ebp;
};
#else
# error "need to define __NR_indirect and struct indirect_params"
#endif

#define FILL_IN(var, values...) \
  var = (struct indirect_registers) { values }

int
main (void)
{
  int fd = socket (AF_INET, SOCK_DGRAM, IPPROTO_IP);
  int s1 = fcntl (fd, F_GETFD);
  int t1 = fcntl (fd, F_GETFL);
  printf ("old: FD_CLOEXEC %s set, NONBLOCK %s set\n",
  s1 == 0 ? "not" : "is", (t1 & O_NONBLOCK) ? "is" : "not");
  close (fd);

  union indirect_params i;
  memset(, '\0', sizeof(i));
  i.file_flags.flags = O_CLOEXEC|O_NONBLOCK;

  struct indirect_registers r;
#ifdef __NR_socketcall
# define SOCKOP_socket   1
  long args[3] = { AF_INET, SOCK_DGRAM, IPPROTO_IP };
  FILL_IN (r, __NR_socketcall, SOCKOP_socket, (long) args);
#else
  FILL_IN (r, __NR_socket, AF_INET, SOCK_DGRAM, IPPROTO_IP);
#endif

  fd = syscall (__NR_indirect, , , sizeof (i), 0);
  int s2 = fcntl (fd, F_GETFD);
  int t2 = fcntl (fd, F_GETFL);
  printf ("new: FD_CLOEXEC %s set, NONBLOCK %s set\n",
  s2 == 0 ? "not" : "is", (t2 & O_NONBLOCK) ? "is" : "not");
  close (fd);

  i.file_flags.flags = O_CLOEXEC;
  sigset_t ss;
  sigemptyset();
  FILL_IN(r, __NR_signalfd, -1, (long) , 8);
  fd = syscall (__NR_indirect, , , sizeof (i), 0);
  int s3 = fcntl (fd, F_GETFD);
  printf ("signalfd: FD_CLOEXEC %s set\n", s3 == 0 ? "not" : "is");
  close (fd);

  FILL_IN(r, __NR_eventfd, 8);
  fd = syscall (__NR_indirect, , , sizeof (i), 0);
  int s4 = fcntl (fd, F_GETFD);
  printf ("eventfd: FD_CLOEXEC %s set\n", s4 == 0 ? "not" : "is");
  close (fd);

  return s1 != 0 || s2 == 0 || t1 != 0 || t2 == 0 || s3 == 0 || s4 == 0;
}


Signed-off-by: Ulrich Drepper <[EMAIL PROTECTED]>


 arch/x86/Kconfig   |3 ++
 arch/x86/ia32/Makefile |1 
 arch/x86/ia32/ia32entry.S  |2 +
 arch/x86/ia32/sys_ia32.c   |

[PATCHv5 2/5] x86 support for sys_indirect

2007-11-20 Thread Ulrich Drepper

This part adds support for sys_indirect on x86 and x86-64.

 arch/x86/Kconfig   |3 ++
 arch/x86/ia32/Makefile |1 
 arch/x86/ia32/ia32entry.S  |2 +
 arch/x86/ia32/sys_ia32.c   |   38 +
 arch/x86/kernel/syscall_table_32.S |1 
 include/asm-x86/indirect.h |5 
 include/asm-x86/indirect_32.h  |   25 
 include/asm-x86/indirect_64.h  |   36 +++
 include/asm-x86/unistd_32.h|3 +-
 include/asm-x86/unistd_64.h|2 +
 10 files changed, 115 insertions(+), 1 deletion(-)


--- linux/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -112,6 +112,9 @@ config GENERIC_TIME_VSYSCALL
bool
default X86_64
 
+config ARCH_HAS_INDIRECT_SYSCALLS
+   def_bool y
+
 
 
 
diff -u linux/include/asm-x86/indirect_32.h linux/include/asm-x86/indirect_32.h
--- linux/include/asm-x86/indirect_32.h
+++ linux/include/asm-x86/indirect_32.h
@@ -0,0 +1,25 @@
+#ifndef _ASM_X86_INDIRECT_32_H
+#define _ASM_X86_INDIRECT_32_H
+
+struct indirect_registers {
+   __u32 eax;
+   __u32 ebx;
+   __u32 ecx;
+   __u32 edx;
+   __u32 esi;
+   __u32 edi;
+   __u32 ebp;
+};
+
+#define INDIRECT_SYSCALL(regs) (regs)->eax
+
+static inline long call_indirect(struct indirect_registers *regs)
+{
+  extern long (*sys_call_table[]) (__u32, __u32, __u32, __u32, __u32, __u32);
+
+  return sys_call_table[INDIRECT_SYSCALL(regs)](regs->ebx, regs->ecx,
+   regs->edx, regs->esi,
+   regs->edi, regs->ebp);
+}
+
+#endif
diff -u linux/include/asm-x86/indirect_64.h linux/include/asm-x86/indirect_64.h
--- linux/include/asm-x86/indirect_64.h
+++ linux/include/asm-x86/indirect_64.h
@@ -0,0 +1,36 @@
+#ifndef _ASM_X86_INDIRECT_64_H
+#define _ASM_X86_INDIRECT_64_H
+
+struct indirect_registers {
+   __u64 rax;
+   __u64 rdi;
+   __u64 rsi;
+   __u64 rdx;
+   __u64 r10;
+   __u64 r8;
+   __u64 r9;
+};
+
+struct indirect_registers32 {
+   __u32 eax;
+   __u32 ebx;
+   __u32 ecx;
+   __u32 edx;
+   __u32 esi;
+   __u32 edi;
+   __u32 ebp;
+};
+
+#define INDIRECT_SYSCALL(regs) (regs)->rax
+#define INDIRECT_SYSCALL32(regs) (regs)->eax
+
+static inline long call_indirect(struct indirect_registers *regs)
+{
+  extern long (*sys_call_table[]) (__u64, __u64, __u64, __u64, __u64, __u64);
+
+  return sys_call_table[INDIRECT_SYSCALL(regs)](regs->rdi, regs->rsi,
+   regs->rdx, regs->r10,
+   regs->r8, regs->r9);
+}
+
+#endif
diff -u linux/arch/x86/ia32/sys_ia32.c linux/arch/x86/ia32/sys_ia32.c
--- linux/arch/x86/ia32/sys_ia32.c
+++ linux/arch/x86/ia32/sys_ia32.c
@@ -889,0 +890,38 @@
+
+asmlinkage long sys32_indirect(struct indirect_registers32 __user *userregs,
+  void __user *userparams, size_t paramslen,
+  int flags)
+{
+   extern long (*ia32_sys_call_table[])(u32, u32, u32, u32, u32, u32);
+
+   struct indirect_registers32 regs;
+   long result;
+
+   if (flags != 0)
+   return -EINVAL;
+
+   if (copy_from_user(, userregs, sizeof(regs)))
+   return -EFAULT;
+
+   switch (INDIRECT_SYSCALL32())
+   {
+#define INDSYSCALL(name) __NR_ia32_##name
+#include 
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   if (paramslen > sizeof(union indirect_params))
+   return -EINVAL;
+   result = -EFAULT;
+   if (!copy_from_user(>indirect_params, userparams, paramslen))
+   result = ia32_sys_call_table[regs.eax](regs.ebx, regs.ecx,
+  regs.edx, regs.esi,
+  regs.edi, regs.ebp);
+
+   memset(>indirect_params, '\0', paramslen);
+
+   return result;
+}
--- linux/arch/x86/ia32/Makefile
+++ linux/arch/x86/ia32/Makefile
@@ -36,6 +36,7 @@ $(obj)/vsyscall-sysenter.so.dbg 
$(obj)/vsyscall-syscall.so.dbg: \
 $(obj)/vsyscall-%.so.dbg: $(src)/vsyscall.lds $(obj)/vsyscall-%.o FORCE
$(call if_changed,syscall)
 
+CFLAGS_sys_ia32.o = -Wno-undef
 AFLAGS_vsyscall-sysenter.o = -m32 -Wa,-32
 AFLAGS_vsyscall-syscall.o = -m32 -Wa,-32
 
--- linux/arch/x86/ia32/ia32entry.S
+++ linux/arch/x86/ia32/ia32entry.S
@@ -400,6 +400,7 @@ END(ia32_ptregs_common)
 
.section .rodata,"a"
.align 8
+   .globl ia32_sys_call_table
 ia32_sys_call_table:
.quad sys_restart_syscall
.quad sys_exit
@@ -726,4 +727,5 @@ ia32_sys_call_table:
.quad compat_sys_timerfd
.quad sys_eventfd
.quad sys32_fallocate
+   .quad sys32_indirect/* 325  */
 ia32_syscall_end:
--- linux/arch/x86/kernel/syscall_table_32.S
+++

[PATCHv5 3/5] Allow setting FD_CLOEXEC flag for new sockets

2007-11-20 Thread Ulrich Drepper

This is a first user of sys_indirect.  Several of the socket-related system
calls which produce a file handle now can be passed an additional parameter
to set the FD_CLOEXEC flag.

 include/asm-x86/ia32_unistd.h |1 +
 include/linux/indirect.h  |   27 +++
 net/socket.c  |   21 +
 3 files changed, 41 insertions(+), 8 deletions(-)


diff -u linux/include/linux/indirect.h linux/include/linux/indirect.h
--- linux/include/linux/indirect.h
+++ linux/include/linux/indirect.h
@@ -1,3 +1,4 @@
+#ifndef INDSYSCALL
 #ifndef _LINUX_INDIRECT_H
 #define _LINUX_INDIRECT_H
 
@@ -13,5 +14,31 @@
+  struct {
+int flags;
+  } file_flags;
 };
 
 #define INDIRECT_PARAM(set, name) current->indirect_params.set.name
 
 #endif
+#else
+
+/* Here comes the list of system calls which can be called through
+   sys_indirect.  When the list if support system calls is needed the
+   file including this header is supposed to define a macro "INDSYSCALL"
+   which adds a prefix fitting to the use.  If the resulting macro is
+   defined we generate a line
+   case MACRO:
+   */
+#if INDSYSCALL(accept)
+  case INDSYSCALL(accept):
+#endif
+#if INDSYSCALL(socket)
+  case INDSYSCALL(socket):
+#endif
+#if INDSYSCALL(socketcall)
+  case INDSYSCALL(socketcall):
+#endif
+#if INDSYSCALL(socketpair)
+  case INDSYSCALL(socketpair):
+#endif
+
+#endif
--- linux/include/asm-x86/ia32_unistd.h
+++ linux/include/asm-x86/ia32_unistd.h
@@ -12,6 +12,7 @@
 #define __NR_ia32_exit   1
 #define __NR_ia32_read   3
 #define __NR_ia32_write  4
+#define __NR_ia32_socketcall   102
 #define __NR_ia32_sigreturn119
 #define __NR_ia32_rt_sigreturn 173
 
diff -u linux/net/socket.c linux/net/socket.c
--- linux/net/socket.c
+++ linux/net/socket.c
@@ -344,11 +344,11 @@
  * but we take care of internal coherence yet.
  */
 
-static int sock_alloc_fd(struct file **filep)
+static int sock_alloc_fd(struct file **filep, int flags)
 {
int fd;
 
-   fd = get_unused_fd();
+   fd = get_unused_fd_flags(flags);
if (likely(fd >= 0)) {
struct file *file = get_empty_filp();
 
@@ -391,10 +391,10 @@
return 0;
 }
 
-int sock_map_fd(struct socket *sock)
+static int sock_map_fd_flags(struct socket *sock, int flags)
 {
struct file *newfile;
-   int fd = sock_alloc_fd();
+   int fd = sock_alloc_fd(, flags);
 
if (likely(fd >= 0)) {
int err = sock_attach_fd(sock, newfile);
@@ -409,6 +409,11 @@
return fd;
 }
 
+int sock_map_fd(struct socket *sock)
+{
+   return sock_map_fd_flags(sock, 0);
+}
+
 static struct socket *sock_from_file(struct file *file, int *err)
 {
if (file->f_op == _file_ops)
@@ -1208,7 +1213,7 @@
if (retval < 0)
goto out;
 
-   retval = sock_map_fd(sock);
+   retval = sock_map_fd_flags(sock, INDIRECT_PARAM(file_flags, flags));
if (retval < 0)
goto out_release;
 
@@ -1249,13 +1254,13 @@
if (err < 0)
goto out_release_both;
 
-   fd1 = sock_alloc_fd();
+   fd1 = sock_alloc_fd(, INDIRECT_PARAM(file_flags, flags));
if (unlikely(fd1 < 0)) {
err = fd1;
goto out_release_both;
}
 
-   fd2 = sock_alloc_fd();
+   fd2 = sock_alloc_fd(, INDIRECT_PARAM(file_flags, flags));
if (unlikely(fd2 < 0)) {
err = fd2;
put_filp(newfile1);
@@ -1411,7 +1416,7 @@
 */
__module_get(newsock->ops->owner);
 
-   newfd = sock_alloc_fd();
+   newfd = sock_alloc_fd(, INDIRECT_PARAM(file_flags, flags));
if (unlikely(newfd < 0)) {
err = newfd;
sock_release(newsock);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv5 5/5] FD_CLOEXEC support for eventfd, signalfd, timerfd

2007-11-20 Thread Ulrich Drepper

This patch adds support to set the FD_CLOEXEC flag for the file descriptors
returned by eventfd, signalfd, timerfd.

 fs/anon_inodes.c  |   15 +++
 fs/eventfd.c  |5 +++--
 fs/signalfd.c |6 --
 fs/timerfd.c  |6 --
 include/asm-x86/ia32_unistd.h |3 +++
 include/linux/anon_inodes.h   |3 +++
 include/linux/indirect.h  |3 +++
 7 files changed, 31 insertions(+), 10 deletions(-)


--- linux/include/linux/indirect.h
+++ linux/include/linux/indirect.h
@@ -40,5 +40,8 @@ union indirect_params {
 #if INDSYSCALL(socketpair)
   case INDSYSCALL(socketpair):
 #endif
+  case INDSYSCALL(eventfd):
+  case INDSYSCALL(signalfd):
+  case INDSYSCALL(timerfd):
 
 #endif
--- linux/fs/anon_inodes.c
+++ linux/fs/anon_inodes.c
@@ -70,9 +70,9 @@ static struct dentry_operations 
anon_inodefs_dentry_operations = {
  * hence saving memory and avoiding code duplication for the file/inode/dentry
  * setup.
  */
-int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile,
-const char *name, const struct file_operations *fops,
-void *priv)
+int anon_inode_getfd_flags(int *pfd, struct inode **pinode, struct file 
**pfile,
+  const char *name, const struct file_operations *fops,
+  void *priv, int flags)
 {
struct qstr this;
struct dentry *dentry;
@@ -85,7 +85,7 @@ int anon_inode_getfd(int *pfd, struct inode **pinode, struct 
file **pfile,
if (!file)
return -ENFILE;
 
-   error = get_unused_fd();
+   error = get_unused_fd_flags(flags);
if (error < 0)
goto err_put_filp;
fd = error;
@@ -138,6 +138,13 @@ err_put_filp:
put_filp(file);
return error;
 }
+
+int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile,
+const char *name, const struct file_operations *fops,
+void *priv)
+{
+   return anon_inode_getfd_flags(pfd, pinode, pfile, name, fops, priv, 0);
+}
 EXPORT_SYMBOL_GPL(anon_inode_getfd);
 
 /*
--- linux/include/linux/anon_inodes.h
+++ linux/include/linux/anon_inodes.h
@@ -8,6 +8,9 @@
 #ifndef _LINUX_ANON_INODES_H
 #define _LINUX_ANON_INODES_H
 
+int anon_inode_getfd_flags(int *pfd, struct inode **pinode, struct file 
**pfile,
+  const char *name, const struct file_operations *fops,
+  void *priv, int flags);
 int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile,
 const char *name, const struct file_operations *fops,
 void *priv);
--- linux/fs/eventfd.c
+++ linux/fs/eventfd.c
@@ -215,8 +215,9 @@ asmlinkage long sys_eventfd(unsigned int count)
 * When we call this, the initialization must be complete, since
 * anon_inode_getfd() will install the fd.
 */
-   error = anon_inode_getfd(, , , "[eventfd]",
-_fops, ctx);
+   error = anon_inode_getfd_flags(, , , "[eventfd]",
+  _fops, ctx,
+  INDIRECT_PARAM(file_flags, flags));
if (!error)
return fd;
 
--- linux/fs/signalfd.c
+++ linux/fs/signalfd.c
@@ -224,8 +224,10 @@ asmlinkage long sys_signalfd(int ufd, sigset_t __user 
*user_mask, size_t sizemas
 * When we call this, the initialization must be complete, since
 * anon_inode_getfd() will install the fd.
 */
-   error = anon_inode_getfd(, , , "[signalfd]",
-_fops, ctx);
+   error = anon_inode_getfd_flags(, , ,
+  "[signalfd]", _fops,
+  ctx, INDIRECT_PARAM(file_flags,
+  flags));
if (error)
goto err_fdalloc;
} else {
--- linux/fs/timerfd.c
+++ linux/fs/timerfd.c
@@ -182,8 +182,10 @@ asmlinkage long sys_timerfd(int ufd, int clockid, int 
flags,
 * When we call this, the initialization must be complete, since
 * anon_inode_getfd() will install the fd.
 */
-   error = anon_inode_getfd(, , , "[timerfd]",
-_fops, ctx);
+   error = anon_inode_getfd_flags(, , , "[timerfd]",
+  _fops, ctx,
+  INDIRECT_PARAM(file_flags,
+ flags));
if (error)
goto err_tmrcancel;
} else {
--- linux/include/asm-x86/ia32_unistd.h
+++ linux/include/asm-x86/ia32_unistd.h
@@ -15,5 +15,8 @@
 #define __NR_ia32_socketcall   102
 #define

[PATCHv5 4/5] Allow setting O_NONBLOCK flag for new sockets

2007-11-20 Thread Ulrich Drepper

This patch adds support for setting the O_NONBLOCK flag of the file
descriptors returned by socket, socketpair, and accept.

 socket.c |   15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)


--- linux/net/socket.c
+++ linux/net/socket.c
@@ -362,7 +362,7 @@ static int sock_alloc_fd(struct file **filep, int flags)
return fd;
 }
 
-static int sock_attach_fd(struct socket *sock, struct file *file)
+static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
struct dentry *dentry;
struct qstr name = { .name = "" };
@@ -384,7 +384,7 @@ static int sock_attach_fd(struct socket *sock, struct file 
*file)
init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,
  _file_ops);
SOCK_INODE(sock)->i_fop = _file_ops;
-   file->f_flags = O_RDWR;
+   file->f_flags = O_RDWR | (flags & O_NONBLOCK);
file->f_pos = 0;
file->private_data = sock;
 
@@ -397,7 +397,7 @@ static int sock_map_fd_flags(struct socket *sock, int flags)
int fd = sock_alloc_fd(, flags);
 
if (likely(fd >= 0)) {
-   int err = sock_attach_fd(sock, newfile);
+   int err = sock_attach_fd(sock, newfile, flags);
 
if (unlikely(err < 0)) {
put_filp(newfile);
@@ -1268,12 +1268,14 @@ asmlinkage long sys_socketpair(int family, int type, 
int protocol,
goto out_release_both;
}
 
-   err = sock_attach_fd(sock1, newfile1);
+   err = sock_attach_fd(sock1, newfile1,
+INDIRECT_PARAM(file_flags, flags));
if (unlikely(err < 0)) {
goto out_fd2;
}
 
-   err = sock_attach_fd(sock2, newfile2);
+   err = sock_attach_fd(sock2, newfile2,
+INDIRECT_PARAM(file_flags, flags));
if (unlikely(err < 0)) {
fput(newfile1);
goto out_fd1;
@@ -1423,7 +1425,8 @@ asmlinkage long sys_accept(int fd, struct sockaddr __user 
*upeer_sockaddr,
goto out_put;
}
 
-   err = sock_attach_fd(newsock, newfile);
+   err = sock_attach_fd(newsock, newfile,
+INDIRECT_PARAM(file_flags, flags));
if (err < 0)
goto out_fd_simple;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-20 Thread Avi Kivity


Anthony Liguori wrote:

Avi Kivity wrote:
  

Anthony Liguori wrote:


Avi Kivity wrote:
 
  

Anthony Liguori wrote:
   

This is a PCI device that implements a transport for virtio.  It 
allows virtio

devices to be used by QEMU based VMMs like KVM or Xen.

+
+/* the notify function used when creating a virt queue */
+static void vp_notify(struct virtqueue *vq)
+{
+struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
+struct virtio_pci_vq_info *info = vq->priv;
+
+/* we write the queue's selector into the notification 
register to

+ * signal the other end */
+iowrite16(info->queue_index, vp_dev->ioaddr + 
VIRTIO_PCI_QUEUE_NOTIFY);

+}

  

This means we can't kick multiple queues with one exit.


There is no interface in virtio currently to batch multiple queue 
notifications so the only way one could do this AFAICT is to use a 
timer to delay the notifications.  Were you thinking of something else?


  
  

No.  We can change virtio though, so let's have a flexible ABI.



Well please propose the virtio API first and then I'll adjust the PCI 
ABI.  I don't want to build things into the ABI that we never actually 
end up using in virtio :-)


  


Move ->kick() to virtio_driver.

I believe Xen networking uses the same event channel for both rx and tx, 
so in effect they're using this model.  Long time since I looked though,


I'd also like to see a hypercall-capable version of this (but that 
can wait).



That can be a different device.
  
  
That means the user has to select which device to expose.  With 
feature bits, the hypervisor advertises both pio and hypercalls, the 
guest picks whatever it wants.



I was thinking more along the lines that a hypercall-based device would 
certainly be implemented in-kernel whereas the current device is 
naturally implemented in userspace.  We can simply use a different 
device for in-kernel drivers than for userspace drivers.  


Where the device is implemented is an implementation detail that should 
be hidden from the guest, isn't that one of the strengths of 
virtualization?  Two examples: a file-based block device implemented in 
qemu gives you fancy file formats with encryption and compression, while 
the same device implemented in the kernel gives you a low-overhead path 
directly to a zillion-disk SAN volume.  Or a user-level network device 
capable of running with the slirp stack and no permissions vs. the 
kernel device running copyless most of the time and using a dma engine 
for the rest but requiring you to be good friends with the admin.


The user should expect zero reconfigurations moving a VM from one model 
to the other.


There's no 
point at all in doing a hypercall based userspace device IMHO.
  


We abstract this away by having a "channel signalled" API (both at the 
kernel for kernel devices and as a kvm.h exit reason / libkvm callback.


Again, somewhat like Xen's event channels, though asymmetric.

I don't think so.  A vmexit is required to lower the IRQ line.  It 
may be possible to do something clever like set a shared memory value 
that's checked on every vmexit.  I think it's very unlikely that it's 
worth it though.
  
  

Why so unlikely?  Not all workloads will have good batching.



It's pretty invasive.  I think a more paravirt device that expected an 
edge triggered interrupt would be a better solution for those types of 
devices.
  


I was thinking it could be useful mostly in the context of a paravirt 
irqchip, where we can lower the cost of level-triggered interrupts.



+
+/* Select the queue we're interested in */
+iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);

  
I would really like to see this implemented as pci config space, 
with no tricks like multiplexing several virtqueues on one register. 
Something like the PCI BARs where you have all the register numbers 
allocated statically to queues.


My first implementation did that.  I switched to using a selector 
because it reduces the amount of PCI config space used and does not 
limit the number of queues defined by the ABI as much.
  
  
But... it's tricky, and it's nonstandard.  With pci config, you can do 
live migration by shipping the pci config space to the other side.  
With the special iospace, you need to encode/decode it.



None of the PCI devices currently work like that in QEMU.  It would be 
very hard to make a device that worked this way because since the order 
in which values are written matter a whole lot.  For instance, if you 
wrote the status register before the queue information, the driver could 
get into a funky state.
  


I assume you're talking about restore?  Isn't that atomic?

We'll still need save/restore routines for virtio devices.  I don't 
really see this as a problem since we do this for every other device.


  


Yeah.


Not much of an argument, I know.


wrt. number of

Re: [PATCH 07/18] x86 vDSO: vdso32 build

2007-11-20 Thread Roland McGrath

> > +# This makes sure the $(obj) subdirectory exists even though vdso32/
> > +# is not a kbuild sub-make subdirectory.
> > +override obj-dirs = $(dir $(obj)) $(obj)/vdso32/
> 
> Should we teach kbuild to create dirs specified in targets?
> Or we could 'fix' it so you do not need the override.

Something cleaner would be nice, yes.  I'll leave it to you to decide.

> use "set -e; in front of this shell script to bail out early
> in case of errors.

Back when I knew something about make, all commands ran with sh -ec.
Ah, progress.  Anyway, the one you cited does not have any commands that
aren't tested with && or if already.  set -e would have no effect.

> > +VDSO_LDFLAGS = -fPIC -shared $(call ld-option, 
> > -Wl$(comma)--hash-style=sysv)
> 
> Do you need to specify soname for 64-bit - seems missing?

Using this rule for the 64-bit vDSO is not in this patch.
Patch 18/18 defines VDSO_LDFLAGS_vdso.lds for this.

> > +$(vdso-install-y): %.so: $(obj)/%.so.dbg FORCE
> > @mkdir -p $(MODLIB)/vdso
> > $(call cmd,vdso_install)
> Please use $(Q) in preference for @
> Then it is easier to debug using make V=1

This line is not being changed in this patch, so that is really a separate
question.  Other places in other Makefiles use @mkdir too, so if you are
concerned you could do a patch covering all of those.


Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Futexes and network filesystems.

2007-11-20 Thread Eric W. Biederman

Kyle Moffett <[EMAIL PROTECTED]> writes:

> On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote:
>> I had a chance to think about this a bit more, and realized that the problem
>> is that futexes don't appear to work on network  filesystems, even if the
>> network filesystems provide coherent  shared memory.
>>
>> It seems to me that we need to have a call that gets a unique token for a
>> process for each filesystem per filesystem for use in futexes  (especially
>> robust futexes).  Say get_fs_task_id(const char *path);
>>
>> On local filesystems this could just be the pid as we use today, but for
>> filesystems that can be accessed from contexts with  potentially overlapping
>> pid values this could be something else.   It is an extra syscall in the
>> preparation path, but it should be hardly more expensive the current 
>> getpid().
>>
>> Once we have fixed the futex infrastructure to be able to handle futexes on
>> network filesystems, the pid namespace case will be  trivial to implement.
>
> Actually, I would think that get_vm_task_id(void *addr) would be a more useful
> interface.  The call would still be a relatively simple  lookup to find the
> struct file associated with the particular virtual  mapping, but it would be
> race-free from the perspective of userspace  and would not require that we
> somehow figure out the file descriptor  associated with a particular mmap()
> (which may be closed by this  point in time).  Useful extension would be the
> get_fd_task_id(int fd) and get_fs_task_id(const char *path), but those are 
> less
> important.

You are probably right.  The important thing is that we get an interface where
user space can cache the result.  Or else you totally loose the benefit of
avoiding trapping into the kernel.

> The other important thing is to ensure that somehow the numbers are considered
> unique only within the particular domain of a container,  such that you can
> migrate a container from one system to another even  using a simple local ext3
> filesystem (on a networked block device)  and still be able to have things 
> work
> properly even after the  migration.  Naturally this would only work with an
> upgraded libc but  I think that's a reasonable requirement to enforce for
> migration of  futexes and cross-network futexes.

Well. The numbers are unique per filesystem.  Which is what should save you.
The numbers can't simply be unique per container.

> Even for network filesystems which don't implement coherent shared memory, you
> might add a memexcl() system call which (when used by  multiple cooperating
> processes) ensures that a given page is only  ever mapped by at most one
> computer accessing a given network filesystem.  The page-outs and page-ins 
> when
> shuttling that page  across the network would be expensive, but I believe the
> cost would  be reasonable for many applications and it would allow traditional
> atomic ops on the mapped pages to take and release futexes in the  uncontended
> case.

Sounds reasonable to me. 

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PREEMPT_RT] LOCK_STAT and kexec: general protection fault

2007-11-20 Thread Sripathi Kodi

Hi,

On RT kernel, When CONFIG_LOCK_STAT is enabled, trying to load the kdump 
kernel (kexec -p) causes a kernel panic. I have seen slightly varying 
backtraces, but I see general protection fault every time.

Hardware: x86_64, 4CPUs, LS20 blade.
Kernel: 2.6.23.1-rt11.

I don't see this problem on non-rt kernels.

Kernstopped custom tracer.
general protection fault:  [1] PREEMPT SMP
CPU 3
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler autofs4 hidp rfcomm 
l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 
xt_state iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter 
ip6_tables x_tables ipv6 dm_mirror dm_multipath dm_mod video output sbs dock 
battery ac parport_pc lp parport joydev sg amd_rng rtc_cmos i2c_amd756 k8temp 
shpchp i2c_core rtc_core tg3 hwmon serio_raw rtc_lib button mptspi mptscsih 
scsi_transport_spi mptbase sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd 
uhci_hcd
Pid: 3540, comm: kexec Not tainted 2.6.23.1-rt11 #2
RIP: 0010:[]  [] 
__lock_acquire+0x6b2/0xca0
RSP: 0018:8102118d7eb8  EFLAGS: 00010092
RAX: 0bbda08e8f9f3119 RBX: 8101fb4defd0 RCX: 80e31110
RDX: 0bbda08e8f9f3119 RSI: 000a RDI: 0001
RBP: 8102118d7f18 R08: 0002 R09: 0001
R10: 80275659 R11: f000 R12: 00e6
R13: 8101fb4de7c0 R14:  R15: 001225dc
FS:  2b55f6778b00() GS:810211fbb940() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 0032fc494ff0 CR3: 0001fbfa CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process kexec (pid: 3540, threadinfo 8101fb518000, task 8101fb4de7c0)
Stack:  0001b434a540 0002  8068dc90
 80234dd8 8102118d7f00 000a 0046
 8068dc78 0003 8068dc78 001225dc
Call Trace:
   [] wake_up_process+0x12/0x14
 [] lock_acquire+0x5e/0x77
 [] handle_edge_irq+0x29/0x158
 [] __spin_lock+0x35/0x62
 [] handle_edge_irq+0x29/0x158
 [] do_IRQ+0x80/0xea
 [] ret_from_intr+0x0/0xf
   [] copy_user_generic_string+0x17/0x40
 [] sys_kexec_load+0x501/0x5dd
 [] syscall_trace_enter+0x95/0x99
 [] tracesys+0xdc/0xe1

INFO: lockdep is turned off.

Code: 48 8b 10 0f 18 0a 48 39 c8 75 e4 f0 ff 0d 1b 5a 39 00 79 0d 
RIP  [] __lock_acquire+0x6b2/0xca0
 RSP 
Kernel panic - not syncing: Fatal exception

Call Trace:
   [] panic+0xaf/0x16e
 [] task_rq_lock+0x42/0x74
 [] rt_spin_unlock+0x1e/0x50
 [] rt_spin_unlock+0x1e/0x50
 [] oops_end+0x69/0x72
 [] die+0x4a/0x54
 [] do_general_protection+0x116/0x11f
 [] put_lock_stats+0x2b/0x2d
 [] error_exit+0x0/0x96
 [] handle_edge_irq+0x29/0x158
 [] __lock_acquire+0x6b2/0xca0
 [] wake_up_process+0x12/0x14
 [] lock_acquire+0x5e/0x77
 [] handle_edge_irq+0x29/0x158
 [] __spin_lock+0x35/0x62
 [] handle_edge_irq+0x29/0x158
 [] do_IRQ+0x80/0xea
 [] ret_from_intr+0x0/0xf
   [] copy_user_generic_string+0x17/0x40
 [] sys_kexec_load+0x501/0x5dd
 [] syscall_trace_enter+0x95/0x99
 [] tracesys+0xdc/0xe1

INFO: lockdep is turned off.

Thanks,
Sripathi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3: find complains about /proc/net

2007-11-20 Thread Eric W. Biederman


Below is a preliminary patch.  It solves the directory issue but it doesn't
play well with proc_mnt and proc_flush_task.  It works by simply caching the
network namespace when we mount proc so we don't have to be fancy and dynamic.

Something for the discussion anyway.

I will start sorting out what makes sense tomorrow.

Eric


>From f359fde2469ba8be2123a465e788a83c7e6831e9 Mon Sep 17 00:00:00 2001
From: Eric W. Biederman <[EMAIL PROTECTED]>
Date: Tue, 20 Nov 2007 19:36:05 -0700
Subject: [PATCH] proc: Fix /proc/net directory listings.

Having proc dynamically display the contents of /proc/net is
hard.  So make life simpler by capturing the network namespace
when we mount proc and only displaying that network namespace.

---
 fs/proc/base.c  |8 ++--
 fs/proc/generic.c   |4 ++-
 fs/proc/internal.h  |   13 +++
 fs/proc/proc_net.c  |   89 ---
 fs/proc/root.c  |   50 ++
 include/linux/proc_fs.h |4 ++
 6 files changed, 66 insertions(+), 102 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index aeaf0d0..9d4f06a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2395,7 +2395,7 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct 
dentry * dentry, struct
if (tgid == ~0U)
goto out;
 
-   ns = dentry->d_sb->s_fs_info;
+   ns = proc_sbi(dentry->d_sb)->pid_ns;
rcu_read_lock();
task = find_task_by_pid_ns(tgid, ns);
if (task)
@@ -2476,7 +2476,7 @@ int proc_pid_readdir(struct file * filp, void * dirent, 
filldir_t filldir)
goto out;
}
 
-   ns = filp->f_dentry->d_sb->s_fs_info;
+   ns = proc_sbi(filp->f_dentry->d_sb)->pid_ns;
tgid = filp->f_pos - TGID_OFFSET;
for (task = next_tgid(tgid, ns);
 task;
@@ -2615,7 +2615,7 @@ static struct dentry *proc_task_lookup(struct inode *dir, 
struct dentry * dentry
if (tid == ~0U)
goto out;
 
-   ns = dentry->d_sb->s_fs_info;
+   ns = proc_sbi(dentry->d_sb)->pid_ns;
rcu_read_lock();
task = find_task_by_pid_ns(tid, ns);
if (task)
@@ -2758,7 +2758,7 @@ static int proc_task_readdir(struct file * filp, void * 
dirent, filldir_t filldi
/* f_version caches the tgid value that the last readdir call couldn't
 * return. lseek aka telldir automagically resets f_version to 0.
 */
-   ns = filp->f_dentry->d_sb->s_fs_info;
+   ns = proc_sbi(filp->f_dentry->d_sb)->pid_ns;
tid = (int)filp->f_version;
filp->f_version = 0;
for (task = first_tid(leader, tid, pos - 2, ns);
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 1bdb624..b58f0ec 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -398,7 +398,9 @@ struct dentry *proc_lookup(struct inode * dir, struct 
dentry *dentry, struct nam
continue;
if (!memcmp(dentry->d_name.name, de->name, 
de->namelen)) {
unsigned int ino = de->low_ino;
-
+   
+   if (de->shadow_proc)
+   de = de->shadow_proc(dentry->d_sb, de);
de_get(de);
spin_unlock(_subdir_lock);
error = -EINVAL;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 1820eb2..a26f115 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -11,6 +11,18 @@
 
 #include 
 
+struct pid_namespace;
+struct net;
+struct proc_sb_info {
+   struct pid_namespace *pid_ns;
+   struct net   *net_ns;
+};
+
+static inline struct proc_sb_info *proc_sbi(struct super_block *sb)
+{
+   return sb->s_fs_info;
+}
+
 #ifdef CONFIG_PROC_SYSCTL
 extern int proc_sys_init(void);
 #else
@@ -78,3 +90,4 @@ static inline int proc_fd(struct inode *inode)
 {
return PROC_I(inode)->fd;
 }
+
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 131f9c6..8a82e29 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -50,89 +50,15 @@ struct net *get_proc_net(const struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(get_proc_net);
 
-static struct proc_dir_entry *proc_net_shadow;
+static struct proc_dir_entry *shadow_pde;
 
-static struct dentry *proc_net_shadow_dentry(struct dentry *parent,
-   struct proc_dir_entry *de)
+static struct proc_dir_entry *proc_net_shadow(struct super_block *sb,
+ struct proc_dir_entry *de)
 {
-   struct dentry *shadow = NULL;
-   struct inode *inode;
-   if (!de)
-   goto out;
-   de_get(de);
-   inode = proc_get_inode(parent->d_inode->i_sb, de->low_ino, de);
-   if (!inode)
-   goto out_de_put;
-   shadow = d_alloc_name(parent, de->name);
-   if (!shadow)
-   goto out_iput;

Re: 2.6.24-rc3-mm1

2007-11-20 Thread Dave Young

On Nov 21, 2007 2:15 PM, Andrew Morton <[EMAIL PROTECTED]> wrote:
>
> On Wed, 21 Nov 2007 14:03:34 +0800 "Dave Young" <[EMAIL PROTECTED]> wrote:
>
> > On Nov 21, 2007 2:00 PM, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > >
> > > On Wed, 21 Nov 2007 13:51:47 +0800 "Dave Young" <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hi, andrew
> > > >
> > > > modpost failed for me:
> > > >   MODPOST 360 modules
> > > > ERROR: "empty_zero_page" [drivers/kvm/kvm.ko] undefined!
> > > > make[1]: *** [__modpost] Error 1
> > > > make: *** [modules] Error 2
> > > >
> > >
> > > You're a victim of the hasty unexporting fad.  Which architecture?
> > > x86_64 I guess?
> > >
> > Hi,
> > ia32 instead.
> >
>
> oic.  Like this, I guess.
>
> --- a/arch/x86/kernel/i386_ksyms_32.c~git-x86-i386-export-empty_zero_page
> +++ a/arch/x86/kernel/i386_ksyms_32.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  EXPORT_SYMBOL(__down_failed);
>  EXPORT_SYMBOL(__down_failed_interruptible);
> @@ -22,3 +23,4 @@ EXPORT_SYMBOL(__put_user_8);
>  EXPORT_SYMBOL(strstr);
>
>  EXPORT_SYMBOL(csum_partial);
> +EXPORT_SYMBOL(empty_zero_page);
> _
>

Yes, passed :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Futexes and network filesystems.

2007-11-20 Thread Kyle Moffett


On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote:
I had a chance to think about this a bit more, and realized that  
the problem is that futexes don't appear to work on network  
filesystems, even if the network filesystems provide coherent  
shared memory.


It seems to me that we need to have a call that gets a unique token  
for a process for each filesystem per filesystem for use in futexes  
(especially robust futexes).  Say get_fs_task_id(const char *path);


On local filesystems this could just be the pid as we use today,  
but for filesystems that can be accessed from contexts with  
potentially overlapping pid values this could be something else.   
It is an extra syscall in the preparation path, but it should be  
hardly more expensive the current getpid().


Once we have fixed the futex infrastructure to be able to handle  
futexes on network filesystems, the pid namespace case will be  
trivial to implement.


Actually, I would think that get_vm_task_id(void *addr) would be a  
more useful interface.  The call would still be a relatively simple  
lookup to find the struct file associated with the particular virtual  
mapping, but it would be race-free from the perspective of userspace  
and would not require that we somehow figure out the file descriptor  
associated with a particular mmap() (which may be closed by this  
point in time).  Useful extension would be the get_fd_task_id(int fd)  
and get_fs_task_id(const char *path), but those are less important.


The other important thing is to ensure that somehow the numbers are  
considered unique only within the particular domain of a container,  
such that you can migrate a container from one system to another even  
using a simple local ext3 filesystem (on a networked block device)  
and still be able to have things work properly even after the  
migration.  Naturally this would only work with an upgraded libc but  
I think that's a reasonable requirement to enforce for migration of  
futexes and cross-network futexes.


Even for network filesystems which don't implement coherent shared  
memory, you might add a memexcl() system call which (when used by  
multiple cooperating processes) ensures that a given page is only  
ever mapped by at most one computer accessing a given network  
filesystem.  The page-outs and page-ins when shuttling that page  
across the network would be expensive, but I believe the cost would  
be reasonable for many applications and it would allow traditional  
atomic ops on the mapped pages to take and release futexes in the  
uncontended case.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-20 Thread Andrew Morton

On Wed, 21 Nov 2007 11:41:23 +0530 Kamalesh Babulal <[EMAIL PROTECTED]> wrote:

> Hi Andrew,
> 
> Kernel panic's across different architectures like powerpc, x86_64, 

powerpc complains about IO-APICs??

> Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes)
> Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes)
> Mount-cache hash table entries: 256
> SMP alternatives: switching to UP code
> ACPI: Core revision 20070126
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
> 'noapic' kernel parameter

ACPI or x86 breakage, I guess.

Did 'noapic' work?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm1

2007-11-20 Thread Andrew Morton

On Wed, 21 Nov 2007 14:03:34 +0800 "Dave Young" <[EMAIL PROTECTED]> wrote:

> On Nov 21, 2007 2:00 PM, Andrew Morton <[EMAIL PROTECTED]> wrote:
> >
> > On Wed, 21 Nov 2007 13:51:47 +0800 "Dave Young" <[EMAIL PROTECTED]> wrote:
> >
> > > Hi, andrew
> > >
> > > modpost failed for me:
> > >   MODPOST 360 modules
> > > ERROR: "empty_zero_page" [drivers/kvm/kvm.ko] undefined!
> > > make[1]: *** [__modpost] Error 1
> > > make: *** [modules] Error 2
> > >
> >
> > You're a victim of the hasty unexporting fad.  Which architecture?
> > x86_64 I guess?
> >
> Hi,
> ia32 instead.
> 

oic.  Like this, I guess.

--- a/arch/x86/kernel/i386_ksyms_32.c~git-x86-i386-export-empty_zero_page
+++ a/arch/x86/kernel/i386_ksyms_32.c
@@ -2,6 +2,7 @@
 #include 
 #include 
 #include 
+#include 
 
 EXPORT_SYMBOL(__down_failed);
 EXPORT_SYMBOL(__down_failed_interruptible);
@@ -22,3 +23,4 @@ EXPORT_SYMBOL(__put_user_8);
 EXPORT_SYMBOL(strstr);
 
 EXPORT_SYMBOL(csum_partial);
+EXPORT_SYMBOL(empty_zero_page);
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-20 Thread Kamalesh Babulal

Hi Andrew,

Kernel panic's across different architectures like powerpc, x86_64, 

Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes)
Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes)
Mount-cache hash table entries: 256
SMP alternatives: switching to UP code
ACPI: Core revision 20070126
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
'noapic' kernel parameter

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm1

2007-11-20 Thread Andrew Morton

On Wed, 21 Nov 2007 14:58:38 +0900 KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:

> I met.
> 
>   CHK include/linux/version.h
>   CHK include/linux/utsrelease.h
>   CALLscripts/checksyscalls.sh
> :1389:2: warning: #warning syscall revokeat not implemented
> :1393:2: warning: #warning syscall frevoke not implemented
>   CHK include/linux/compile.h
> make[1]: *** No rule to make target `arch/ia64/lib/copy_page-export.o', 
> needed by `arch/ia64/lib/built-in.o'.  Stop.
> make: *** [arch/ia64/lib] Error 2
> 
> fix (for my config ?) is attached.
> 
> =
> This was necessary to build.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>
> 
>  arch/ia64/lib/Makefile |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.24-rc3-mm1/arch/ia64/lib/Makefile
> ===
> --- linux-2.6.24-rc3-mm1.orig/arch/ia64/lib/Makefile
> +++ linux-2.6.24-rc3-mm1/arch/ia64/lib/Makefile
> @@ -2,7 +2,7 @@
>  # Makefile for ia64-specific library routines..
>  #
>  
> -obj-y := io.o copy_page-export.o
> +obj-y := io.o
>  
>  lib-y := __divsi3.o __udivsi3.o __modsi3.o __umodsi3.o   
> \
>   __divdi3.o __udivdi3.o __moddi3.o __umoddi3.o   \

erp.  Actually, it should be this:

--- a/arch/ia64/lib/Makefile~ia64-export-copy_page-to-modules-fix-fix
+++ a/arch/ia64/lib/Makefile
@@ -2,7 +2,7 @@
 # Makefile for ia64-specific library routines..
 #
 
-obj-y := io.o copy_page-export.o
+obj-y := io.o
 
 lib-y := __divsi3.o __udivsi3.o __modsi3.o __umodsi3.o \
__divdi3.o __udivdi3.o __moddi3.o __umoddi3.o   \
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm1 - Build Failure on S390x

2007-11-20 Thread Andrew Morton

On Wed, 21 Nov 2007 11:26:51 +0530 Kamalesh Babulal <[EMAIL PROTECTED]> wrote:

> The kernel build fails on S390x, with

Yes, sorry, I forgot to mention that.  I got a large patch reject
between Greg's driver tree and the s390 tree and I couldn't be bothered
fixing it.  s390 is busted in 2.6.24-rc3-mm1.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm1

2007-11-20 Thread Dave Young

On Nov 21, 2007 2:00 PM, Andrew Morton <[EMAIL PROTECTED]> wrote:
>
> On Wed, 21 Nov 2007 13:51:47 +0800 "Dave Young" <[EMAIL PROTECTED]> wrote:
>
> > Hi, andrew
> >
> > modpost failed for me:
> >   MODPOST 360 modules
> > ERROR: "empty_zero_page" [drivers/kvm/kvm.ko] undefined!
> > make[1]: *** [__modpost] Error 1
> > make: *** [modules] Error 2
> >
>
> You're a victim of the hasty unexporting fad.  Which architecture?
> x86_64 I guess?
>
Hi,
ia32 instead.

Regards
dave
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm1

2007-11-20 Thread Andrew Morton

On Wed, 21 Nov 2007 13:51:47 +0800 "Dave Young" <[EMAIL PROTECTED]> wrote:

> Hi, andrew
> 
> modpost failed for me:
>   MODPOST 360 modules
> ERROR: "empty_zero_page" [drivers/kvm/kvm.ko] undefined!
> make[1]: *** [__modpost] Error 1
> make: *** [modules] Error 2
> 

You're a victim of the hasty unexporting fad.  Which architecture?
x86_64 I guess?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm1 - Build Failure on S390x

2007-11-20 Thread Kamalesh Babulal

Hi Andrew,

The kernel build fails on S390x, with

arch/s390/kernel/ipl.c: In function `ipl_register_fcp_files':
arch/s390/kernel/ipl.c:415: error: `ipl_subsys' undeclared (first use in this 
function)
arch/s390/kernel/ipl.c:415: error: (Each undeclared identifier is reported only 
once
arch/s390/kernel/ipl.c:415: error: for each function it appears in.)
arch/s390/kernel/ipl.c: In function `ipl_init':
arch/s390/kernel/ipl.c:449: error: implicit declaration of function 
`firmware_register'
arch/s390/kernel/ipl.c:449: error: `ipl_subsys' undeclared (first use in this 
function)
arch/s390/kernel/ipl.c: In function `on_panic_show':
arch/s390/kernel/ipl.c:764: error: implicit declaration of function 
`shutdown_action_str'
arch/s390/kernel/ipl.c:764: error: `on_panic_action' undeclared (first use in 
this function)
arch/s390/kernel/ipl.c:764: warning: format argument is not a pointer (arg 3)
arch/s390/kernel/ipl.c:764: warning: format argument is not a pointer (arg 3)
arch/s390/kernel/ipl.c: In function `on_panic_store':
arch/s390/kernel/ipl.c:771: error: `SHUTDOWN_REIPL_STR' undeclared (first use 
in this function)
arch/s390/kernel/ipl.c:772: error: `on_panic_action' undeclared (first use in 
this function)
arch/s390/kernel/ipl.c:772: error: `SHUTDOWN_REIPL' undeclared (first use in 
this function)
arch/s390/kernel/ipl.c:773: error: `SHUTDOWN_DUMP_STR' undeclared (first use in 
this function)
arch/s390/kernel/ipl.c:775: error: `SHUTDOWN_DUMP' undeclared (first use in 
this function)
arch/s390/kernel/ipl.c:776: error: `SHUTDOWN_STOP_STR' undeclared (first use in 
this function)
arch/s390/kernel/ipl.c:778: error: `SHUTDOWN_STOP' undeclared (first use in 
this function)
arch/s390/kernel/ipl.c: At top level:
arch/s390/kernel/ipl.c:879: error: redefinition of 'ipl_register_fcp_files'
arch/s390/kernel/ipl.c:412: error: previous definition of 
'ipl_register_fcp_files' was here
arch/s390/kernel/ipl.c:904: error: redefinition of 'ipl_init'
arch/s390/kernel/ipl.c:446: error: previous definition of 'ipl_init' was here
arch/s390/kernel/ipl.c:1050: error: `reipl_run' undeclared here (not in a 
function)
arch/s390/kernel/ipl.c:1050: error: initializer element is not constant
arch/s390/kernel/ipl.c:1050: error: (near initialization for `reipl_action.fn')
arch/s390/kernel/ipl.c:1058: error: redefinition of 'sys_dump_fcp_wwpn_show'
arch/s390/kernel/ipl.c:662: error: previous definition of 
'sys_dump_fcp_wwpn_show' was here
arch/s390/kernel/ipl.c:1058: error: redefinition of 'sys_dump_fcp_wwpn_store'
arch/s390/kernel/ipl.c:662: error: previous definition of 
'sys_dump_fcp_wwpn_store' was here
arch/s390/kernel/ipl.c:1058: error: redefinition of 'sys_dump_fcp_wwpn_attr'
arch/s390/kernel/ipl.c:662: error: previous definition of 
'sys_dump_fcp_wwpn_attr' was here
arch/s390/kernel/ipl.c:1060: error: redefinition of 'sys_dump_fcp_lun_show'
arch/s390/kernel/ipl.c:664: error: previous definition of 
'sys_dump_fcp_lun_show' was here
arch/s390/kernel/ipl.c:1060: error: redefinition of 'sys_dump_fcp_lun_store'
arch/s390/kernel/ipl.c:664: error: previous definition of 
'sys_dump_fcp_lun_store' was here
arch/s390/kernel/ipl.c:1060: error: redefinition of 'sys_dump_fcp_lun_attr'
arch/s390/kernel/ipl.c:664: error: previous definition of 
'sys_dump_fcp_lun_attr' was here
arch/s390/kernel/ipl.c:1062: error: redefinition of 'sys_dump_fcp_bootprog_show'
arch/s390/kernel/ipl.c:666: error: previous definition of 
'sys_dump_fcp_bootprog_show' was here
arch/s390/kernel/ipl.c:1062: error: redefinition of 
'sys_dump_fcp_bootprog_store'
arch/s390/kernel/ipl.c:666: error: previous definition of 
'sys_dump_fcp_bootprog_store' was here
arch/s390/kernel/ipl.c:1062: error: redefinition of 'sys_dump_fcp_bootprog_attr'
arch/s390/kernel/ipl.c:666: error: previous definition of 
'sys_dump_fcp_bootprog_attr' was here
arch/s390/kernel/ipl.c:1064: error: redefinition of 'sys_dump_fcp_br_lba_show'
arch/s390/kernel/ipl.c:668: error: previous definition of 
'sys_dump_fcp_br_lba_show' was here
arch/s390/kernel/ipl.c:1064: error: redefinition of 'sys_dump_fcp_br_lba_store'
arch/s390/kernel/ipl.c:668: error: previous definition of 
'sys_dump_fcp_br_lba_store' was here
arch/s390/kernel/ipl.c:1064: error: redefinition of 'sys_dump_fcp_br_lba_attr'
arch/s390/kernel/ipl.c:668: error: previous definition of 
'sys_dump_fcp_br_lba_attr' was here
arch/s390/kernel/ipl.c:1066: error: redefinition of 'sys_dump_fcp_device_show'
arch/s390/kernel/ipl.c:670: error: previous definition of 
'sys_dump_fcp_device_show' was here
arch/s390/kernel/ipl.c:1066: error: redefinition of 'sys_dump_fcp_device_store'
arch/s390/kernel/ipl.c:670: error: previous definition of 
'sys_dump_fcp_device_store' was here
arch/s390/kernel/ipl.c:1066: error: redefinition of 'sys_dump_fcp_device_attr'
arch/s390/kernel/ipl.c:670: error: previous definition of 
'sys_dump_fcp_device_attr' was here
arch/s390/kernel/ipl.c:1069: error: redefinition of 'dump_fcp_attrs'
arch/s390/kernel/ipl.c:673: error: previous

Re: [PATCH 07/18] x86 vDSO: vdso32 build

2007-11-20 Thread Sam Ravnborg

Hi Roland.

Some minor things below.
In general I like the simplification of this
area and having it moved out of kernel/Makefile is the-right-thing.

Sam

On Mon, Nov 19, 2007 at 02:05:32PM -0800, Roland McGrath wrote:
> 
> This builds the 32-bit vDSO images in the arch/x86/vdso subdirectory.
> Nothing uses the images yet, but this paves the way for consolidating
> the vDSO build logic all in one place.  The new images use a linker
> script sharing the layout parts from vdso-layout.lds.S with the 64-bit
> vDSO.  A new vdso32-syms.lds is generated in the style of vdso-syms.lds.
> 
> Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
> ---
>  arch/x86/vdso/Makefile|   76 +++-
>  arch/x86/vdso/vdso32/vdso32.lds.S |   37 ++
>  2 files changed, 110 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/vdso/vdso32/vdso32.lds.S
> 
> diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
> index 6a665dd..a1e0418 100644
> --- a/arch/x86/vdso/Makefile
> +++ b/arch/x86/vdso/Makefile
> @@ -1,7 +1,15 @@
>  #
> -# x86-64 vDSO.
> +# Building vDSO images for x86.
>  #
>  
> +VDSO64-$(CONFIG_X86_64)  := y
> +VDSO32-$(CONFIG_X86_32)  := y
> +VDSO32-$(CONFIG_COMPAT)  := y
> +
> +vdso-install-$(VDSO64-y) += vdso.so
> +vdso-install-$(VDSO32-y) += $(vdso32-y:=.so)
> +
> +
>  # files to link into the vdso
>  vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vvar.o
>  
> @@ -57,10 +65,72 @@ quiet_cmd_vdsosym = VDSOSYM $@
>  $(obj)/%-syms.lds: $(obj)/%.so.dbg FORCE
>   $(call if_changed,vdsosym)
>  
> +#
> +# Build multiple 32-bit vDSO images to choose from at boot time.
> +#
> +vdso32.so-$(CONFIG_X86_32)   += int80
> +vdso32.so-$(VDSO32-y)+= sysenter
> +
> +CPPFLAGS_vdso32.lds = $(CPPFLAGS_vdso.lds)
> +VDSO_LDFLAGS_vdso32.lds = -m elf_i386 -Wl,-soname=linux-gate.so.1
> +
> +# This makes sure the $(obj) subdirectory exists even though vdso32/
> +# is not a kbuild sub-make subdirectory.
> +override obj-dirs = $(dir $(obj)) $(obj)/vdso32/

Should we teach kbuild to create dirs specified in targets?
Or we could 'fix' it so you do not need the override.

> +
> +targets += vdso32/vdso32.lds
> +targets += $(vdso32.so-y:%=vdso32-%.so.dbg) $(vdso32.so-y:%=vdso32-%.so)
> +targets += vdso32/note.o $(vdso32.so-y:%=vdso32/%.o)
> +
> +extra-y  += $(vdso32.so-y:%=vdso32-%.so)
> +
> +$(vdso32.so-y:%=$(obj)/vdso32-%.so.dbg): asflags-$(CONFIG_X86_64) += -m32
> +
> +$(vdso32.so-y:%=$(obj)/vdso32-%.so.dbg): $(obj)/vdso32-%.so.dbg: FORCE \
> +  $(obj)/vdso32/vdso32.lds \
> +  $(obj)/vdso32/note.o \
> +  $(obj)/vdso32/%.o
> + $(call if_changed,vdso)
> +
> +# Make vdso32-*-syms.lds from each image, and then make sure they match.
> +# The only difference should be that some do not define 
> VDSO32_SYSENTER_RETURN.
> +
> +targets += vdso32-syms.lds $(vdso32.so-y:%=vdso32-%-syms.lds)
> +
> +quiet_cmd_vdso32sym = VDSOSYM $@
> +define cmd_vdso32sym
> + if LC_ALL=C sort -u $(filter-out FORCE,$^) > $(@D)/.tmp_$(@F) && \
> +$(foreach H,$(filter-out FORCE,$^),\
> +  if grep -q VDSO32_SYSENTER_RETURN $H; \
> +  then diff -u $(@D)/.tmp_$(@F) $H; \
> +  else sed /VDSO32_SYSENTER_RETURN/d $(@D)/.tmp_$(@F) | \
> +   diff -u - $H; fi &&) : ;\
> + then mv -f $(@D)/.tmp_$(@F) $@; \
> + else rm -f $(@D)/.tmp_$(@F); exit 1; \
> + fi
> +endef

use "set -e; in front of this shell script to bail out early
in case of errors.


> +
> +$(obj)/vdso32-syms.lds: $(vdso32.so-y:%=$(obj)/vdso32-%-syms.lds) FORCE
> + $(call if_changed,vdso32sym)
> +
> +#
> +# The DSO images are built using a special linker script.
> +#
> +quiet_cmd_vdso = VDSO$@
> +  cmd_vdso = $(CC) -nostdlib -o $@ \
> +$(VDSO_LDFLAGS) $(VDSO_LDFLAGS_$(filter %.lds,$(^F))) \
> +-Wl,-T,$(filter %.lds,$^) $(filter %.o,$^)
> +
> +VDSO_LDFLAGS = -fPIC -shared $(call ld-option, -Wl$(comma)--hash-style=sysv)

Do you need to specify soname for 64-bit - seems missing?

> +
> +#
> +# Install the unstripped copy of vdso*.so listed in $(vdso-install-y).
> +#
>  quiet_cmd_vdso_install = INSTALL $@
>cmd_vdso_install = cp $(obj)/[EMAIL PROTECTED] $(MODLIB)/vdso/$@
> -vdso.so:
> +$(vdso-install-y): %.so: $(obj)/%.so.dbg FORCE
>   @mkdir -p $(MODLIB)/vdso
>   $(call cmd,vdso_install)
Please use $(Q) in preference for @
Then it is easier to debug using make V=1

> -vdso_install: vdso.so
> +PHONY += vdso_install $(vdso-install-y)
> +vdso_install: $(vdso-install-y)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Modules: Handle symbols that have a zero value

2007-11-20 Thread Christoph Lameter

Another issue that I encountered with the cpu_alloc stuff.



The module subsystem cannot handle symbols that are zero. If symbols are
present that have a zero value then the module resolver prints out
a message that these symbols are unresolved.

Use ERR_PTR to return an error code instead of 0. This is a bit awkward
since the addresses are handled as unsigned longs. So we need to convert
them everywhere.

The idea top use ERR_PTR is from Mathieu.

Cc: Mathieu Desnoyers <[EMAIL PROTECTED]>
Cc: Kay Sievers <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 kernel/module.c |   17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/module.c
===
--- linux-2.6.orig/kernel/module.c  2007-11-20 21:06:29.856965949 -0800
+++ linux-2.6/kernel/module.c   2007-11-20 21:16:53.001715887 -0800
@@ -285,7 +285,7 @@ static unsigned long __find_symbol(const
}
}
DEBUGP("Failed to find symbol %s\n", name);
-   return 0;
+   return (unsigned long)ERR_PTR(-ENOENT);
 }
 
 /* Search for module by name: must hold module_mutex. */
@@ -648,7 +648,7 @@ void __symbol_put(const char *symbol)
const unsigned long *crc;
 
preempt_disable();
-   if (!__find_symbol(symbol, , , 1))
+   if (IS_ERR((void *)__find_symbol(symbol, , , 1)))
BUG();
module_put(owner);
preempt_enable();
@@ -792,7 +792,8 @@ static inline int check_modstruct_versio
const unsigned long *crc;
struct module *owner;
 
-   if (!__find_symbol("struct_module", , , 1))
+   if (IS_ERR((void *)__find_symbol("struct_module",
+   , , 1)))
BUG();
return check_version(sechdrs, versindex, "struct_module", mod,
 crc);
@@ -845,7 +846,7 @@ static unsigned long resolve_symbol(Elf_
/* use_module can fail due to OOM, or module unloading */
if (!check_version(sechdrs, versindex, name, mod, crc) ||
!use_module(mod, owner))
-   ret = 0;
+   ret = (unsigned long)ERR_PTR(-EINVAL);
}
return ret;
 }
@@ -1238,14 +1239,16 @@ static int verify_export_symbols(struct 
const unsigned long *crc;
 
for (i = 0; i < mod->num_syms; i++)
-   if (__find_symbol(mod->syms[i].name, , , 1)) {
+   if (!IS_ERR((void *)__find_symbol(mod->syms[i].name,
+   , , 1))) {
name = mod->syms[i].name;
ret = -ENOEXEC;
goto dup;
}
 
for (i = 0; i < mod->num_gpl_syms; i++)
-   if (__find_symbol(mod->gpl_syms[i].name, , , 1)) {
+   if (!IS_ERR((void *)__find_symbol(mod->gpl_syms[i].name,
+   , , 1))) {
name = mod->gpl_syms[i].name;
ret = -ENOEXEC;
goto dup;
@@ -1295,7 +1298,7 @@ static int simplify_symbols(Elf_Shdr *se
   strtab + sym[i].st_name, mod);
 
/* Ok if resolved.  */
-   if (sym[i].st_value != 0)
+   if (!IS_ERR((void *)sym[i].st_value))
break;
/* Ok if weak.  */
if (ELF_ST_BIND(sym[i].st_info) == STB_WEAK)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] kprobes: Add user entry-handler in kretprobes

2007-11-20 Thread Jim Keniston

On Mon, 2007-11-19 at 17:56 +0530, Abhishek Sagar wrote:
> On Nov 17, 2007 6:24 AM, Jim Keniston <[EMAIL PROTECTED]> wrote:
> > > True, some kind of data pointer/pouch is essential.
> >
> > Yes.  If the pouch idea is too weird, then the data pointer is a good
> > compromise.
> >
> > With the above reservations, your enclosed patch looks OK.
> >
> > You should provide a patch #2 that updates Documentation/kprobes.txt.
> > Maybe that will yield a little more review from other folks.
> 
> The inlined patch provides support for an optional user entry-handler
> in kretprobes. It also adds provision for private data to be held in
> each return instance based on Kevin Stafford's "data pouch" approach.
> Kretprobe usage example in Documentation/kprobes.txt has also been
> updated to demonstrate the usage of entry-handlers.
> 
> Signed-off-by: Abhishek Sagar <[EMAIL PROTECTED]>

Thanks for doing this.

I have one minor suggestion on the code -- see below -- but I'm willing
to ack that with or without the suggested change.  Please also see
suggestions on kprobes.txt and the demo program.

Jim Keniston

> ---
> diff -upNr linux-2.6.24-rc2/Documentation/kprobes.txt
> linux-2.6.24-rc2_kp/Documentation/kprobes.txt
> --- linux-2.6.24-rc2/Documentation/kprobes.txt2007-11-07
> 03:27:46.0 +0530
> +++ linux-2.6.24-rc2_kp/Documentation/kprobes.txt 2007-11-19
> 17:41:27.0 +0530
> @@ -100,16 +100,21 @@ prototype matches that of the probed fun
> 
>  When you call register_kretprobe(), Kprobes establishes a kprobe at
>  the entry to the function.  When the probed function is called and this
> -probe is hit, Kprobes saves a copy of the return address, and replaces
> -the return address with the address of a "trampoline."  The trampoline
> -is an arbitrary piece of code -- typically just a nop instruction.
> -At boot time, Kprobes registers a kprobe at the trampoline.
> -
> -When the probed function executes its return instruction, control
> -passes to the trampoline and that probe is hit.  Kprobes' trampoline
> -handler calls the user-specified handler associated with the kretprobe,
> -then sets the saved instruction pointer to the saved return address,
> -and that's where execution resumes upon return from the trap.
> +probe is hit, the user defined entry_handler is invoked (optional). If

probe is hit, the user-defined entry_handler, if any, is invoked.  If

> +the entry_handler returns 0 (success) or is not present, then Kprobes
> +saves a copy of the return address, and replaces the return address
> +with the address of a "trampoline."  If the entry_handler returns a
> +non-zero error, the function executes as normal, as if no probe was
> +installed on it.

non-zero value, Kprobes leaves the return address as is, and the
kretprobe has no further effect for that particular function instance.

> The trampoline is an arbitrary piece of code --
> +typically just a nop instruction. At boot time, Kprobes registers a
> +kprobe at the trampoline.
> +
> +After the trampoline return address is planted, when the probed function
> +executes its return instruction, control passes to the trampoline and
> +that probe is hit.  Kprobes' trampoline handler calls the user-specified
> +return handler associated with the kretprobe, then sets the saved
> +instruction pointer to the saved return address, and that's where
> +execution resumes upon return from the trap.
> 
>  While the probed function is executing, its return address is
>  stored in an object of type kretprobe_instance.  Before calling
> @@ -117,6 +122,9 @@ register_kretprobe(), the user sets the
>  kretprobe struct to specify how many instances of the specified
>  function can be probed simultaneously.  register_kretprobe()
>  pre-allocates the indicated number of kretprobe_instance objects.
> +Additionally, a user may also specify per-instance private data to
> +be part of each return instance.  This is useful when using kretprobes
> +with a user entry_handler (see "register_kretprobe" for details).
> 
>  For example, if the function is non-recursive and is called with a
>  spinlock held, maxactive = 1 should be enough.  If the function is
> @@ -129,7 +137,8 @@ It's not a disaster if you set maxactive
>  some probes.  In the kretprobe struct, the nmissed field is set to
>  zero when the return probe is registered, and is incremented every
>  time the probed function is entered but there is no kretprobe_instance
> -object available for establishing the return probe.
> +object available for establishing the return probe. A miss also prevents
> +user entry_handler from being invoked.
> 
>  2. Architectures Supported
> 
> @@ -258,6 +267,16 @@ Establishes a return probe for the funct
>  rp->kp.addr.  When that function returns, Kprobes calls rp->handler.
>  You must set rp->maxactive appropriately before you call
>  register_kretprobe(); see "How Does a Return Probe Work?" for details.

It would be more consistent with the existing text in kprobes.txt to add

Re: 2.6.24-rc3-mm1

2007-11-20 Thread KAMEZAWA Hiroyuki

I met.

  CHK include/linux/version.h
  CHK include/linux/utsrelease.h
  CALLscripts/checksyscalls.sh
:1389:2: warning: #warning syscall revokeat not implemented
:1393:2: warning: #warning syscall frevoke not implemented
  CHK include/linux/compile.h
make[1]: *** No rule to make target `arch/ia64/lib/copy_page-export.o', needed 
by `arch/ia64/lib/built-in.o'.  Stop.
make: *** [arch/ia64/lib] Error 2

fix (for my config ?) is attached.

=
This was necessary to build.

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

 arch/ia64/lib/Makefile |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.24-rc3-mm1/arch/ia64/lib/Makefile
===
--- linux-2.6.24-rc3-mm1.orig/arch/ia64/lib/Makefile
+++ linux-2.6.24-rc3-mm1/arch/ia64/lib/Makefile
@@ -2,7 +2,7 @@
 # Makefile for ia64-specific library routines..
 #
 
-obj-y := io.o copy_page-export.o
+obj-y := io.o
 
 lib-y := __divsi3.o __udivsi3.o __modsi3.o __umodsi3.o \
__divdi3.o __udivdi3.o __moddi3.o __umoddi3.o   \

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm1

2007-11-20 Thread Dave Young

Hi, andrew

modpost failed for me:
  MODPOST 360 modules
ERROR: "empty_zero_page" [drivers/kvm/kvm.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2

Regards
dave
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: USB deadlock after resume

2007-11-20 Thread Markus Rechberger

On 11/21/07, Mark Lord <[EMAIL PROTECTED]> wrote:
> Markus Rechberger wrote:
> > Hi,
> >
> > I'm looking at the linux uvc driver, and noticed after resuming my
> ..
>
> Pardon me.. what is the "uvc" driver?  Which module/source file is that?
>

http://linux-uvc.berlios.de/ it's not yet included in the kernel
sources although many distributions already ship it.
A "dry" run putting the device into sleep mode works fine (I added a
proc interface for calling those suspend/resume function).

Markus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix optimized search

2007-11-20 Thread Gregory Haskins

>>> On Tue, Nov 20, 2007 at 11:26 PM, in message
<[EMAIL PROTECTED]>, Steven Rostedt <[EMAIL PROTECTED]>
wrote: 
> On Tue, Nov 20, 2007 at 11:15:48PM -0500, Steven Rostedt wrote:
>> Gregory Haskins wrote:
>>> I spied a few more issues from http://lkml.org/lkml/2007/11/20/590.
>>> Patch is below..
>>
>> Thanks, but I have one update...
>>
> 
> Here's the updated patch.
> 
> Oh, and Gregory, please email me at my [EMAIL PROTECTED] account. It
> has better filters ;-)
> 
> This series is at:
> 
>   http://rostedt.homelinux.com/rt/rt-balance-patches-v6.tar.bz2

Ah..mails crossed. ;)  Ignore my patch #1 from the 0/4 series I just sent out.

Regards,
-Greg

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [HIFN 00/03]: RNG support v2

2007-11-20 Thread Herbert Xu

On Sun, Nov 18, 2007 at 10:32:52PM +0100, Patrick McHardy wrote:
> These patches add support for using the HIFN rng.

All applied.  Thanks a lot Patrick!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] RT: Use a 2-d bitmap for searching lowest-pri CPU

2007-11-20 Thread Gregory Haskins

The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

In this past, this patch has maintained the priorties as a global
property.  It now has a per-root-domain scope to shield unrelated
domains from unecessary cache collisions.

You may ask yourself why do we add logic to find_lowest_cpus() earlier
in the series and then rip it out in this patch.  The answer is that
this current patch has been controversial, and is likely to not be
accepted (at least in the short term).  Therefore, it is in our best
interest to optimize as much of the code prior to this patch as possible
even if we ultimately rip it out here.  That way, the system is still tuned
even if this patch goes to /dev/null.

You may now be asking yourself why bother including this patch?  The
answer is that our own independent testing shows
(http://article.gmane.org/gmane.linux.rt.user/1889) this patch makes a
significant performance improvement (at least for rt latencies) and
doesnt appear to cause any regressions in other dimensions.  However,
I also understand the reservations raised by Steve Rostedt et. al.
Therefore, I include this patch in the hopes that it is useful to
someone, but with the understanding that it is not likely to be accepted
without further demonstration of its benefits. 

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
CC: Christoph Lameter <[EMAIL PROTECTED]>
---

 kernel/Makefile   |1 
 kernel/sched.c|4 +
 kernel/sched_cpupri.c |  166 +
 kernel/sched_cpupri.h |   35 ++
 kernel/sched_rt.c |  104 +--
 5 files changed, 223 insertions(+), 87 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..c013a6c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_SMP) += sched_cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <[EMAIL PROTECTED]>, the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index 578c186..d6be7e6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -70,6 +70,8 @@
 #include 
 #include 
 
+#include "sched_cpupri.h"
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -283,6 +285,7 @@ struct root_domain {
cpumask_t span;
cpumask_t rto_mask;
atomic_t  rto_count;
+   struct cpupri cpupri;
 };
 
 struct root_domain def_root_domain;
@@ -5765,6 +5768,7 @@ static void init_rootdomain(struct root_domain *rd, 
cpumask_t *map)
memset(rd, 0, sizeof(*rd));
 
rd->span = *map;
+   cpupri_init(>cpupri);
 }
 
 static void init_defrootdomain(void)
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 000..e30d33f
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,166 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins <[EMAIL PROTECTED]>
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ * (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, nr_domcpus)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include "sched_cpupri.h"
+
+/* Convert between a 140 based task->prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+   int cpupri;
+
+   if (prio == MAX_PRIO)
+   cpupri = CPUPRI_IDLE;
+   else if (prio >= MAX_RT_PRIO)
+   cpupri = CPUPRI_NORMAL;
+   else
+   cpupri = MAX_RT_PRIO - prio + 1;
+
+   return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)\
+  for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES); \
+   idx < CPUPRI_NR_PRIORITIES;\
+   idx = find_next_bit(array,

[PATCH 2/4] RT: Add sched-domain roots

2007-11-20 Thread Gregory Haskins

We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables.  Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset.  However, we currently still maintain some
policy/state as global variables which transcend all cpusets.  Consider,
for instance, rt-overload state.

Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).

We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one.  This logic doesn't have any clients
yet but it will later in the series.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
CC: Christoph Lameter <[EMAIL PROTECTED]>
CC: Paul Jackson <[EMAIL PROTECTED]>
CC: Simon Derr <[EMAIL PROTECTED]>
---

 include/linux/sched.h |3 ++
 kernel/sched.c|   89 +++--
 2 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 19d9f3f..0ba221e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -845,6 +845,9 @@ struct sched_class {
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
+
+   void (*join_domain)(struct rq *rq);
+   void (*leave_domain)(struct rq *rq);
 };
 
 struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index e685402..fb619fb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -276,6 +276,17 @@ struct rt_rq {
int overloaded;
 };
 
+#ifdef CONFIG_SMP
+
+struct root_domain {
+   atomic_t refcount;
+   cpumask_t span;
+};
+
+struct root_domain def_root_domain;
+
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -333,6 +344,7 @@ struct rq {
atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
+   struct root_domain  *rd;
struct sched_domain *sd;
 
/* For active balancing */
@@ -5718,11 +5730,68 @@ sd_parent_degenerate(struct sched_domain *sd, struct 
sched_domain *parent)
return 1;
 }
 
+static void rq_attach_root(struct rq *rq, struct root_domain *rd)
+{
+   unsigned long flags;
+   const struct sched_class *class;
+
+   spin_lock_irqsave(>lock, flags);
+
+   if (rq->rd) {
+   struct root_domain *old_rd = rq->rd;
+
+   for (class = sched_class_highest; class; class = class->next)
+   if (class->leave_domain)
+   class->leave_domain(rq);
+
+   if (atomic_dec_and_test(_rd->refcount))
+   kfree(old_rd);
+   }
+
+   atomic_inc(>refcount);
+   rq->rd = rd;
+
+   for (class = sched_class_highest; class; class = class->next)
+   if (class->join_domain)
+   class->join_domain(rq);
+
+   spin_unlock_irqrestore(>lock, flags);
+}
+
+static void init_rootdomain(struct root_domain *rd, cpumask_t *map)
+{
+   memset(rd, 0, sizeof(*rd));
+
+   rd->span = *map;
+}
+
+static void init_defrootdomain(void)
+{
+   cpumask_t cpus = CPU_MASK_ALL;
+
+   init_rootdomain(_root_domain, );
+   atomic_set(_root_domain.refcount, 1);
+}
+
+static struct root_domain *alloc_rootdomain(cpumask_t *map)
+{
+   struct root_domain *rd;
+
+   rd = kmalloc(sizeof(*rd), GFP_KERNEL);
+   if (!rd)
+   return NULL;
+
+   init_rootdomain(rd, map);
+
+   return rd;
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain.  Callers must
  * hold the hotplug lock.
  */
-static void cpu_attach_domain(struct sched_domain *sd, int cpu)
+static void cpu_attach_domain(struct sched_domain *sd,
+ struct root_domain *rd, int cpu)
 {
struct rq *rq = cpu_rq(cpu);
struct sched_domain *tmp;
@@ -5747,6 +5816,7 @@ static void cpu_attach_domain(struct sched_domain *sd, 
int cpu)
 
sched_domain_debug(sd, cpu);
 
+   rq_attach_root(rq, rd);
rcu_assign_pointer(rq->sd, sd);
 }
 
@@ -6115,6 +6185,7 @@ static void init_sched_groups_power(int cpu, struct 
sched_domain *sd)
 static int build_sched_domains(const cpumask_t *cpu_map)
 {
int i;
+   struct root_domain *rd;
 #ifdef CONFIG_NUMA
struct sched_group **sched_group_nodes = NULL;
int sd_allnodes = 0;
@@ -6131,6 +6202,12 @@ static int build_sched_domains(const cpumask_t *cpu_map)
sched_group_nodes_bycpu[first_cpu(*cpu_map)] = sched_group_nodes;
 #endif
 
+   rd = alloc_rootdomain(cpu_map);
+   if

[PATCH 3/4] RT: Only balance our RT tasks within our root-domain

2007-11-20 Thread Gregory Haskins

We move the rt-overload data as the first global to per-domain
reclassification.  This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.

Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system.  Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down.  Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.

Note: In some cases, task->cpus_allowed will effectively reduce our search
to within our domain.  However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution.  If it can be optimized later, so be it.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
CC: Christoph Lameter <[EMAIL PROTECTED]>
---

 kernel/sched.c|2 ++
 kernel/sched_rt.c |   65 +++--
 2 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index fb619fb..578c186 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -281,6 +281,8 @@ struct rt_rq {
 struct root_domain {
atomic_t refcount;
cpumask_t span;
+   cpumask_t rto_mask;
+   atomic_t  rto_count;
 };
 
 struct root_domain def_root_domain;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index fbf4fb1..3495762 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -4,20 +4,18 @@
  */
 
 #ifdef CONFIG_SMP
-static cpumask_t rt_overload_mask;
-static atomic_t rto_count;
-static inline int rt_overloaded(void)
+
+static inline int rt_overloaded(struct rq *rq)
 {
-   return atomic_read(_count);
+   return atomic_read(>rd->rto_count);
 }
-static inline cpumask_t *rt_overload(void)
+static inline cpumask_t *rt_overload(struct rq *rq)
 {
-   return _overload_mask;
+   return >rd->rto_mask;
 }
 static inline void rt_set_overload(struct rq *rq)
 {
-   rq->rt.overloaded = 1;
-   cpu_set(rq->cpu, rt_overload_mask);
+   cpu_set(rq->cpu, rq->rd->rto_mask);
/*
 * Make sure the mask is visible before we set
 * the overload count. That is checked to determine
@@ -26,22 +24,25 @@ static inline void rt_set_overload(struct rq *rq)
 * updated yet.
 */
wmb();
-   atomic_inc(_count);
+   atomic_inc(>rd->rto_count);
 }
 static inline void rt_clear_overload(struct rq *rq)
 {
/* the order here really doesn't matter */
-   atomic_dec(_count);
-   cpu_clear(rq->cpu, rt_overload_mask);
-   rq->rt.overloaded = 0;
+   atomic_dec(>rd->rto_count);
+   cpu_clear(rq->cpu, rq->rd->rto_mask);
 }
 
 static void update_rt_migration(struct rq *rq)
 {
-   if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1))
+   if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) {
rt_set_overload(rq);
-   else
+   rq->rt.overloaded = 1;
+   } else {
rt_clear_overload(rq);
+   rq->rt.overloaded = 0;
+   }
+   
 }
 #endif /* CONFIG_SMP */
 
@@ -304,6 +305,15 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
 
/*
+* Yes, I know doing two cpus_and() kind of sucks, especially on
+* those very large SMP systems.  We are going for correctness now,
+* optimization later.  A later patch will add rd->online to cache
+* the subset of rd->span that are online and then we can collapse
+* these two mask operations into one
+*/
+   cpus_and(*lowest_mask, *lowest_mask, task_rq(task)->rd->span);
+
+   /*
 * Scan each rq for the lowest prio.
 */
for_each_cpu_mask(cpu, *lowest_mask) {
@@ -584,18 +594,12 @@ static int pull_rt_task(struct rq *this_rq)
 
assert_spin_locked(_rq->lock);
 
-   /*
-* If cpusets are used, and we have overlapping
-* run queue cpusets, then this algorithm may not catch all.
-* This is just the price you pay on trying to keep
-* dirtying caches down on large SMP machines.
-*/
-   if (likely(!rt_overloaded()))
+   if (likely(!rt_overloaded(this_rq)))
return 0;
 
next = pick_next_task_rt(this_rq);
 
-   rto_cpumask = rt_overload();
+   rto_cpumask = rt_overload(this_rq);
 
for_each_cpu_mask(cpu, *rto_cpumask) {
if (this_cpu == cpu)
@@ -814,6 +818,20 @@ static void task_tick_rt(struct rq *rq, struct task_struct 
*p)
}
 }
 
+/* Assumes rq->lock is held */
+static void join_domain_rt(struct rq *rq)
+{
+   if (rq->rt.overloaded)
+   rt_set_overload(rq);
+}
+
+/* Assumes rq->lock is held */
+static void leave_domain_rt(struct rq *rq)
+{
+   if (rq->rt.overloaded)
+

[PATCH 1/4] Fix optimized search

2007-11-20 Thread Gregory Haskins

Include cpu 0 in the search, and eliminate the redundant cpu_set since
the bit should already be set in the mask.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |7 +++
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 28feeff..fbf4fb1 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -297,7 +297,7 @@ static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
int   lowest_prio = -1;
-   int   lowest_cpu  = 0;
+   int   lowest_cpu  = -1;
int   count   = 0;
int   cpu;
 
@@ -319,7 +319,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
 * and the count==1 will cause the algorithm
 * to use the first bit found.
 */
-   if (lowest_cpu) {
+   if (lowest_cpu != -1) {
cpus_clear(*lowest_mask);
cpu_set(rq->cpu, *lowest_mask);
}
@@ -335,7 +335,6 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
lowest_cpu = cpu;
count = 0;
}
-   cpu_set(rq->cpu, *lowest_mask);
count++;
} else
cpu_clear(cpu, *lowest_mask);
@@ -346,7 +345,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
 * runqueues that were of higher prio than
 * the lowest_prio.
 */
-   if (lowest_cpu) {
+   if (lowest_cpu != -1) {
/*
 * Perhaps we could add another cpumask op to
 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/4] more RT balancing enhancements

2007-11-20 Thread Gregory Haskins

These are patches that apply to the end of the v5 series annouced here:

http://lkml.org/lkml/2007/11/20/558

Steven,
These are patches that I could not finish in time to get in with the v4
release. 

Ingo,
If you accept the prior work submitted by Steven and myself, please also
consider this series.

Comments/feedback welcome!

Regards,
-Greg 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc 18/45] cpu alloc: XFS counters

2007-11-20 Thread Christoph Lameter

On Wed, 21 Nov 2007, David Chinner wrote:

> Seeing as I didn't notice this patchest changed XFS (where's the cc?)
> until I saw hch's question I'd appreciate a pointer to that discussion
> as it's long been deleted from my mailbox.

http://marc.info/?t=11943826359=2=2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc 18/45] cpu alloc: XFS counters

2007-11-20 Thread David Chinner

On Tue, Nov 20, 2007 at 12:38:29PM -0800, Christoph Lameter wrote:
> On Tue, 20 Nov 2007, Christoph Hellwig wrote:
> 
> > On Mon, Nov 19, 2007 at 05:11:50PM -0800, [EMAIL PROTECTED] wrote:
> > > Also remove the useless zeroing after allocation. Allocpercpu already
> > > zeroed the objects.
> > 
> > You still haven't answered my comment to the last iteration.
> 
> And you have not read the discussion on that subject in the prior 
> iteration between Peter Zilkstra and me.

Seeing as I didn't notice this patchest changed XFS (where's the cc?)
until I saw hch's question I'd appreciate a pointer to that discussion
as it's long been deleted from my mailbox.

FWIW, I happen to agree with Christoph (hch) that the shouting
macros are an ugly step backwards, esp. given that is replacing:

#define per_cpu_ptr(ptr, cpu)   percpu_ptr((ptr), (cpu))

a set of lowercase macros

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.24-rc3-mm1

2007-11-20 Thread Andrew Morton


ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/

- Added new git tree git-lblnet.patch: net labelling work (Paul Moore
  <[EMAIL PROTECTED])

- Re-added Smack

- Dropped git-kbuild.patch - it broke the scsi build

- I'm offline until Sunday (as are certain other patch targets..)


Boilerplate:

- See the `hot-fixes' directory for any important updates to this patchset.

- To fetch an -mm tree using git, use (for example)

  git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git 
tag v2.6.16-rc2-mm1
  git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1

- -mm kernel commit activity can be reviewed by subscribing to the
  mm-commits mailing list.

echo "subscribe mm-commits" | mail [EMAIL PROTECTED]

- If you hit a bug in -mm and it is not obvious which patch caused it, it is
  most valuable if you can perform a bisection search to identify which patch
  introduced the bug.  Instructions for this process are at

http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt

  But beware that this process takes some time (around ten rebuilds and
  reboots), so consider reporting the bug first and if we cannot immediately
  identify the faulty patch, then perform the bisection search.

- When reporting bugs, please try to Cc: the relevant maintainer and mailing
  list on any email.

- When reporting bugs in this kernel via email, please also rewrite the
  email Subject: in some manner to reflect the nature of the bug.  Some
  developers filter by Subject: when looking for messages to read.

- Occasional snapshots of the -mm lineup are uploaded to
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on
  the mm-commits list.  These probably are at least compilable.

- More-than-daily -mm snapshots may be found at
  http://userweb.kernel.org/~akpm/mmotm/.  These are almost certainly not
  compileable.



Changes since 2.6.24-rc2-mm1:


 origin.patch
 git-acpi.patch
 git-alsa.patch
 git-arm-master.patch
 git-arm.patch
 git-avr32.patch
 git-cifs.patch
 git-cpufreq.patch
 git-powerpc.patch
 git-drm.patch
 git-dvb.patch
 git-hwmon.patch
 git-gfs2-nmw.patch
 git-hid.patch
 git-hrt.patch
 git-ieee1394.patch
 git-infiniband.patch
 git-input.patch
 git-jfs.patch
 git-kvm.patch
 git-lblnet.patch
 git-leds.patch
 git-libata-all.patch
 git-m32r.patch
 git-md-accel.patch
 git-mips.patch
 git-mmc.patch
 git-mtd.patch
 git-ubi.patch
 git-net.patch
 git-netdev-all.patch
 git-nfsd.patch
 git-ocfs2.patch
 git-parisc.patch
 git-selinux.patch
 git-s390.patch
 git-sh.patch
 git-scsi-misc.patch
 git-scsi-rc-fixes.patch
 git-sparc64.patch
 git-unionfs.patch
 git-v9fs.patch
 git-watchdog.patch
 git-wireless.patch
 git-ipwireless_cs.patch
 git-x86.patch
 git-newsetup.patch
 git-xfs.patch
 git-cryptodev.patch
 git-xtensa.patch

 git trees

-ecryptfs-cast-page-index-to-loff_t-instead-of-off_t.patch
-fix-oops-in-toshiba_acpi-error-return-path.patch
-rtc_hctosys-expects-rtcs-in-utc-doc.patch
-rtcs-handle-nvram-better.patch
-rtc-ds1307-exports-nvram.patch
-drivers-video-ps3fb-fix-memset-size-error.patch
-w1-fix-memset-size-error.patch
-slab-fix-typo-in-allocation-failure-handling.patch
-serial-add-pnp-id-for-davicom-isa-336k-modem.patch
-sysctl-check-length-at-deprecated_sysctl_warning.patch
-cm40x0_csc-fix-debug-macros.patch
-lib-move-bitmapo-from-lib-y-to-obj-y.patch
-uml-fix-symlink-loops.patch
-rtc-tweak-driver-documentation-for-rtc-periodic.patch
-chipsfb-uses-depends-on-pci.patch
-uvesafb-fix-warnings-about-unused-variables-on-non-x86.patch
-oprofile-oops-when-profile_pc-return-0lu.patch
-uml-fix-recvmsg-return-value-checking.patch
-uml-update-address-space-affected-by-pud_clear.patch
-uml-update-address-space-affected-by-pud_clear-checkpatch-fixes.patch
-improve-cgroup-printks.patch
-improve-cgroup-printks-fix.patch
-drivers-video-s1d13xxxfbc-as-module-with-dbg.patch
-forbid-user-to-change-file-flags-on-quota-files.patch
-forbid-user-to-change-file-flags-on-quota-files-fix.patch
-lxfb-use-the-correct-msr-number-for-panel-support.patch
-lguest_userc-fix-memory-leak.patch
-video-sis-fix-negative-array-index.patch
-8250_pnp-add-support-for-lg-c1-express-dual-machines.patch
-proc-fix-proc_kill_inodes-to-kill-dentries-on-all-proc-superblocks.patch
-proc-fix-proc_kill_inodes-to-kill-dentries-on-all-proc-superblocks-checkpatch-fixes.patch
-migration-find-correct-vma-in-new_vma_page.patch
-memory-hotremove-unset-migrate-type-isolate-after-removal.patch
-make-getdelays-cgroupstats-aware.patch
-mm-speed-up-writeback-ramp-up-on-clean-systems.patch
-add-ioresouce_busy-flag-for-system-ram.patch
-acpi-make-acpi_procfs-default-to-y.patch
-spi-fix-double-free-on-spi_unregister_master.patch
-spi-fix-error-paths-on-txx9spi_probe.patch
-paride-pf-driver-fixes.patch
-drivers-misc-move-misplaced-pci_dev_puts.patch
-dmaengine-fix-broken-device-refcounting.patch
-atmel_serial-build-warnings-begone.patch

Re: [PATCH] Fix optimized search

2007-11-20 Thread Steven Rostedt

On Tue, Nov 20, 2007 at 11:15:48PM -0500, Steven Rostedt wrote:
> Gregory Haskins wrote:
>> I spied a few more issues from http://lkml.org/lkml/2007/11/20/590.
>> Patch is below..
>
> Thanks, but I have one update...
>

Here's the updated patch.

Oh, and Gregory, please email me at my [EMAIL PROTECTED] account. It
has better filters ;-)

This series is at:

  http://rostedt.homelinux.com/rt/rt-balance-patches-v6.tar.bz2

===

This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

---
 kernel/sched_rt.c |   49 -
 1 file changed, 36 insertions(+), 13 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 23:17:43.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 23:18:21.0 -0500
@@ -293,29 +293,36 @@ static struct task_struct *pick_next_hig
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
-static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-   int   cpu;
-   cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
int   lowest_prio = -1;
+   int   lowest_cpu  = -1;
int   count   = 0;
+   int   cpu;
 
-   cpus_clear(*lowest_mask);
-   cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
+   cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
 */
-   for_each_cpu_mask(cpu, *valid_mask) {
+   for_each_cpu_mask(cpu, *lowest_mask) {
struct rq *rq = cpu_rq(cpu);
 
/* We look for lowest RT prio or non-rt CPU */
if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-   if (count)
+   /*
+* if we already found a low RT queue
+* and now we found this non-rt queue
+* clear the mask and set our bit.
+* Otherwise just return the queue as is
+* and the count==1 will cause the algorithm
+* to use the first bit found.
+*/
+   if (lowest_cpu != -1) {
cpus_clear(*lowest_mask);
-   cpu_set(rq->cpu, *lowest_mask);
+   cpu_set(rq->cpu, *lowest_mask);
+   }
return 1;
}
 
@@ -325,13 +332,29 @@ static int find_lowest_cpus(struct task_
if (rq->rt.highest_prio > lowest_prio) {
/* new low - clear old data */
lowest_prio = rq->rt.highest_prio;
-   if (count) {
-   cpus_clear(*lowest_mask);
-   count = 0;
-   }
+   lowest_cpu = cpu;
+   count = 0;
}
-   cpu_set(rq->cpu, *lowest_mask);
count++;
+   } else
+   cpu_clear(cpu, *lowest_mask);
+   }
+
+   /*
+* Clear out all the set bits that represent
+* runqueues that were of higher prio than
+* the lowest_prio.
+*/
+   if (lowest_cpu > 0) {
+   /*
+* Perhaps we could add another cpumask op to
+* zero out bits. Like cpu_zero_bits(cpumask, nrbits);
+* Then that could be optimized to use memset and such.
+*/
+   for_each_cpu_mask(cpu, *lowest_mask) {
+   if (cpu >= lowest_cpu)
+   break;
+   cpu_clear(cpu, *lowest_mask);
}
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: USB deadlock after resume

2007-11-20 Thread Mark Lord


Markus Rechberger wrote:

Hi,

I'm looking at the linux uvc driver, and noticed after resuming my

..

Pardon me.. what is the "uvc" driver?  Which module/source file is that?

Thanks



notebook it deadlocks at usb_set_interface.
The linux kernel version on that notebook is 2.6.21.4, I searched
around and haven't found any such bugreports.
I wonder if anyone has ever heard about such a problem?

I'm digging closer into that issue now..

thanks,
Markus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix optimized search

2007-11-20 Thread Steven Rostedt


Gregory Haskins wrote:

I spied a few more issues from http://lkml.org/lkml/2007/11/20/590.

Patch is below..


Thanks, but I have one update...



Regards,
-Greg

-

Include cpu 0 in the search, and eliminate the redundant cpu_set since
the bit should already be set in the mask.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |7 +++
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 28feeff..fbf4fb1 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -297,7 +297,7 @@ static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
int   lowest_prio = -1;
-   int   lowest_cpu  = 0;
+   int   lowest_cpu  = -1;
int   count   = 0;
int   cpu;
 
@@ -319,7 +319,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)

 * and the count==1 will cause the algorithm
 * to use the first bit found.
 */
-   if (lowest_cpu) {
+   if (lowest_cpu != -1) {
cpus_clear(*lowest_mask);
cpu_set(rq->cpu, *lowest_mask);
}
@@ -335,7 +335,6 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
lowest_cpu = cpu;
count = 0;
}
-   cpu_set(rq->cpu, *lowest_mask);
count++;
} else
cpu_clear(cpu, *lowest_mask);
@@ -346,7 +345,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
 * runqueues that were of higher prio than
 * the lowest_prio.
 */
-   if (lowest_cpu) {
+   if (lowest_cpu != -1) {


We can change this to

  if (lowest_cpu > 0) {

because if lowest_cpu == 0, we don't need to bother with clearing any bits.

I'll apply this next.

Thanks.

-- Steve


/*
 * Perhaps we could add another cpumask op to
 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] proc: simplify remove_proc_entry() wrt locking

2007-11-20 Thread Andrew Morton

On Fri, 16 Nov 2007 18:10:15 +0300 Alexey Dobriyan <[EMAIL PROTECTED]> wrote:

> We can take proc_subdir_lock for duration of list searching and removing
> from lists only. It can't hurt -- we can gather any amount of looked up
> PDEs right after proc_subdir_lock droppage in proc_lookup() anyway.
> Current code should already deal with this correctly.
> 
> Also this should make code more undestandable:
> * original looks like a loop, however, it's a loop with unconditional
>   trailing "break;" -- not loop at all.
> * more explicit statement that proc_subdir_lock protects only ->subdir lists.

oopses the Vaio.


[   12.595145] BUG: unable to handle kernel NULL pointer dereference at virtual 
address 0030
[   12.598487] printing eip: c01a607f *pde =  
[   12.601795] Oops:  [#1] PREEMPT 
[   12.605101] last sysfs file: 
[   12.608432] Modules linked in:
[   12.611727] 
[   12.615000] Pid: 1, comm: swapper Not tainted (2.6.24-rc3-mm1 #4)
[   12.618345] EIP: 0060:[] EFLAGS: 00010206 CPU: 0
[   12.621713] EIP is at remove_proc_entry+0x69/0x16c
[   12.625071] EAX:  EBX: f726d940 ECX: f726d9bd EDX: 
[   12.628445] ESI: 0030 EDI: f726d940 EBP: f7841e3c ESP: f7841dcc
[   12.631747]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[   12.635052] Process swapper (pid: 1, ti=F784 task=F783ED30 
task.ti=F784)
[   12.635181] Stack: 0005 f726d9c0 c042747c f7841df0 c0131992 0282 
f7841e1c 0005 
[   12.638669] 0046 0174  c0138012  
  
[   12.642173]f783f2e0 f783ed30  f783ed30 c0320d27 0010 
f7841e34 c013a1e4 
[   12.645623] Call Trace:
[   12.652432]  [] show_trace_log_lvl+0x12/0x25
[   12.655938]  [] show_stack_log_lvl+0x8c/0x9e
[   12.659333]  [] show_registers+0x8a/0x1c0
[   12.662755]  [] die+0xee/0x1c4
[   12.666101]  [] do_page_fault+0x405/0x4e1
[   12.669427]  [] error_code+0x6a/0x70
[   12.672700]  [] unregister_handler_proc+0x1b/0x1d
[   12.675974]  [] free_irq+0xb3/0xdc
[   12.679227]  [] yenta_probe_cb_irq+0xc9/0xd6
[   12.682482]  [] ti12xx_override+0x12b/0x4c5
[   12.685782]  [] yenta_probe+0x2b1/0x55d
[   12.689042]  [] pci_device_probe+0x39/0x5b
[   12.692276]  [] driver_probe_device+0xd1/0x147
[   12.695492]  [] __driver_attach+0x6a/0xa1
[   12.698666]  [] bus_for_each_dev+0x37/0x5c
[   12.701783]  [] driver_attach+0x14/0x16
[   12.704891]  [] bus_add_driver+0x7a/0x191
[   12.708015]  [] driver_register+0x57/0x5c
[   12.711095]  [] __pci_register_driver+0x56/0x83
[   12.714170]  [] yenta_socket_init+0x14/0x16
[   12.717195]  [] kernel_init+0xc5/0x20f
[   12.720120]  [] kernel_thread_helper+0x7/0x10
[   12.723041]  ===
[   12.725918] INFO: lockdep is turned off.
[   12.728813] Code: 75 94 83 c6 38 eb 24 8b 55 f0 89 d9 8b 45 90 e8 ab fe ff 
ff 85 c0 74 0e 8b 43 30 89 df 89 06 c7 43 30 00 00 00 00 8b 36 83 c6 30 <8b> 1e 
85 db 75 d6 b8 e0 2b 43 c0 e8 ab ab 17 00 85 ff 0f 84 e3 
[   12.735473] EIP: [] remove_proc_entry+0x69/0x16c SS:ESP 
0068:f7841dcc

(gdb) l *0xc01a4fdf
0xc01a4fdf is in remove_proc_entry (fs/proc/generic.c:698).
warning: Source file is more recent than executable.

693 if (!parent && xlate_proc_name(name, , ) != 0)
694 return;
695 len = strlen(fn);
696 
697 spin_lock(_subdir_lock);
698 for (p = >subdir; *p; p=&(*p)->next ) {
699 if (!proc_match(len, fn, *p))
700 continue;
701 de = *p;
702 *p = de->next;

iirc this is what Andy was hitting.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] 0/4 Support for Toshiba TMIO multifunction devices

2007-11-20 Thread eric miao

On Nov 21, 2007 11:54 AM, ian <[EMAIL PROTECTED]> wrote:
> On Wed, 2007-11-21 at 10:23 +0800, eric miao wrote:
> > Roughly went through the patch, looks good, here comes the remind, though 
> > :-)
> >
> > 1. is it possible to use some name other than "soc_core", maybe
> > "tmio_core" so that other multifunction chips sharing a core base
> > will live easier.
>
> It's (soc-core) not tmio MFD specific - its already used by other MFD
> chips (although obviously not ones in mainline (yet!)
>
> it might be better named 'mfd-core' though, as thats its intended use...
>
> > 2. those C++ style comments "//" are not so pleasant...
>
> Should I clean them up and resubmit?
>

Will be nice then, anyway, could you inline them so others can comment?

> More to the point, who should I be submitting them to? the files under
> arm/ are obviously for RMK to peruse, but I couldnt find an entry for
> drivers/mfd in MAINTAINERS...
>

Well, I briefly went through the git history, looks like Russell is the proper
one you could sent them to (probably not) :-)

>
>



-- 
Cheers
- eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] 0/4 Support for Toshiba TMIO multifunction devices

2007-11-20 Thread ian

On Wed, 2007-11-21 at 10:23 +0800, eric miao wrote:
> Roughly went through the patch, looks good, here comes the remind, though :-)
> 
> 1. is it possible to use some name other than "soc_core", maybe
> "tmio_core" so that other multifunction chips sharing a core base
> will live easier.

It's (soc-core) not tmio MFD specific - its already used by other MFD
chips (although obviously not ones in mainline (yet!)

it might be better named 'mfd-core' though, as thats its intended use...

> 2. those C++ style comments "//" are not so pleasant...

Should I clean them up and resubmit?

More to the point, who should I be submitting them to? the files under
arm/ are obviously for RMK to peruse, but I couldnt find an entry for
drivers/mfd in MAINTAINERS...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

null pointer dereference during restart autofs (was: Linux 2.6.22.12)

2007-11-20 Thread Tomasz Kłoczko



BUG: unable to handle kernel NULL pointer dereference at virtual address 
0014
 printing eip:
c047102c
*pdpt = 24e2c001
*pde = 
Oops: 0002 [#1]
SMP
Modules linked in: nfsd exportfs iptable_mangle iptable_nat nf_nat 
nf_conntrack_ipv4 ipt_LOG ipt_connlimit nf_conntrack nfnetlink xt_tcpudp 
iptable_filter ip_tables x_tables nfs lockd nfs_acl ipv6 autofs4 sunrpc 
binfmt_misc quota_v2 dm_mirror dm_mod video sbs button dock battery ac 
parport_pc lp parport floppy nvram sr_mod cdrom joydev ata_generic sg 
e752x_edac edac_mc ata_piix pata_sil680 ehci_hcd libata iTCO_wdt 
iTCO_vendor_support e1000 uhci_hcd rtc_cmos rtc_core serio_raw rtc_lib 
scsi_wait_scan megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd mbcache
CPU:1
EIP:0060:[]Not tainted VLI
EFLAGS: 00010246   (2.6.22.12-1 #1)
EIP is at fput+0x2/0x15
eax:    ebx:    ecx: 9362   edx: 
esi:    edi: cc6c0dc0   ebp: e342b480   esp: cf231f30
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process automount (pid: 15112, ti=cf23 task=dff84870 task.ti=cf23)
Stack: f8b60885 80049370   cc6c0dc0 f8b5f99f f625f588 f8b5f896
   e342b480 f8b5f896  c047a48f    9362
    e342b480  0004 c047a6de 0002 c047a014 
Call Trace:
 [] autofs4_catatonic_mode+0x5a/0x66 [autofs4]
 [] autofs4_root_ioctl+0x109/0x226 [autofs4]
 [] autofs4_root_ioctl+0x0/0x226 [autofs4]
 [] autofs4_root_ioctl+0x0/0x226 [autofs4]
 [] do_ioctl+0x87/0x9f
 [] vfs_ioctl+0x237/0x249
 [] do_fcntl+0xd2/0x249
 [] sys_ioctl+0x4c/0x64
 [] sysenter_past_esp+0x5f/0x85
 ===
Code: 7c 24 08 00 74 1b 8b 44 24 08 c7 40 64 00 00 00 00 8b 44 24 08 83 c4 
0c 5b 5e 5f 5d e9 7c 16 01 00 83 c4 0c 5b 5e 5f 5d c3 89 c2  ff 48 14 
0f 94 c0 84 c0 74 07 89 d0 e9 9c fe ff ff c3 56 85

EIP: [] fput+0x2/0x15 SS:ESP 0068:cf231f30

kloczek
--
---
*Ludzie nie mają problemów, tylko sobie sami je stwarzają*
---
Tomasz Kłoczko | *e-mail: [EMAIL PROTECTED]

Re: __rcu_process_callbacks() in Linux 2.6

2007-11-20 Thread James Huang

Please disregard the previous email.


In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded 
as follows: 


422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
423struct rcu_data *rdp)
424 {
425if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
426*rdp->donetail = rdp->curlist;
427rdp->donetail = rdp->curtail;
428rdp->curlist = NULL;
429rdp->curtail = >curlist;
430}
431 
432if (rdp->nxtlist && !rdp->curlist) {
433local_irq_disable();
434rdp->curlist = rdp->nxtlist;
435rdp->curtail = rdp->nxttail;
436rdp->nxtlist = NULL;
437rdp->nxttail = >nxtlist;
438local_irq_enable();
439 
440/*
441  * start the next batch of callbacks
442  */
443 
444/* determine batch number */
445rdp->batch = rcp->cur + 1;
446/* see the comment and corresponding wmb() in
447  * the rcu_start_batch()
448  */
449smp_rmb();
450 
451if (!rcp->next_pending) {
452/* and start it/schedule start if it's a new batch */
453spin_lock(>lock);
454rcp->next_pending = 1;
455rcu_start_batch(rcp);
456spin_unlock(>lock);
457}
458}
459 
460rcu_check_quiescent_state(rcp, rdp);
461if (rdp->donelist)
462rcu_do_batch(rdp);
463 }


The question is how does the update of rdp->batch at line 445 guarantee to 
observe the most updated value of rcp->cur??
The issue is that there is no memory barrier/locking before line 445.
So I think the following sequence of events in chronological order is possible:

Assume initially rcp->cur = 100, this current batch value is visible to every 
CPU, and batch 100 has completed 
(i.e. rcp->completed = 100, rcp->next_pending = 0,  rcp->cpumask = 0, and for 
each CPU, rdp->quiescbatch = 100, rdp->qs_pending = 0, rdp->passed_quiesc = 1)
 
 
CPU 0: 
- 
 call_rcu(): a callback inserted into rdp->nxtlist; 

 
 timer interrupt 
call rcu_pending(), return true  ( ! rdp->curlist && 
rdp->nxtlist)
call rcu_check_callbacks() 
 schedule per CPU rcu_tasklet
 
__rcu_process_callbacks()
   move callbacks from nxtlist to curlist;
   rdp->batch = 101
   lock rcp->lock
   rcp->next_pending = 1
   call rcu_start_batch()
 find the current batch has completed and next 
batch pending;
 rcp->next_pending = 0
 update rcp->cur to 101 and initialize 
rcp->cpumask;  <- time t1
  ~~
   unlock rcp->lock
  
CPU 1:
-  
 timer interrupt 
  call rcu_pending(), return true (asume observing rcp->cur = 
101 != rdp->quiescbatch) 
  
  call rcu_check_callbacks() 
   schedule per CPU rcu_tasklet
 
 __rcu_process_callbacks()
   call rcu_check_quisecent_state()
 find rdp->quiescbatch != rcp->cur
 set rdp->qs_pending = 1
 set rdp->passed_quiesc = 0
 set rdp->quiescbatch = 101 (rcp->cur)
 
 Another timer interrupt
 call rcu_pending(), return true (rdp->qs_pending == 1)  
 call rcu_check_callbacks() 
   (assume in user mode) <-- time 
t2  pass quiescent state
   ~~ 
~~
   rdp->passed_quiesc = 1
   schedule per CPU rcu_tasklet

 __rcu_process_callbacks()
   call rcu_check_quisecent_state()
 find rdp->qs_pending == 1 && rdp-> passed_quiesc 
== 1
 set rdp->qs_pending = 0
 lock rcp->lock
 call cpu_quite()
clear bit in the rcp->cpumask set up by CPU 
0 at time t1
~~~
 unlock rcp->lock

CPU 2:
-  
 call_rcu(): a callback inserted into rdp->nxtlist;

[PATCH] CPUFREQ: powernow-k8 print pstate instead of fid/did for family 10h

2007-11-20 Thread Yinghai Lu

[PATCH] CPUFREQ: powernow-k8 print pstate instead of fid/did for family 10h

powernow-k8: Found 1 Quad-Core AMD Opteron(tm) Processor 8354 processors (4 cpu 
cores) (version 2.20.00)
powernow-k8:0 : fid 0x0 did 0x0 (2200 MHz)
powernow-k8:1 : fid 0x0 did 0x0 (2000 MHz)
powernow-k8:2 : fid 0x0 did 0x0 (1700 MHz)
powernow-k8:3 : fid 0x0 did 0x0 (1400 MHz)
powernow-k8:4 : fid 0x0 did 0x0 (1100 MHz)

actually index for CPU_HW_PSTATE is pstate instead of fid/vid
So print it out as pstate.

powernow-k8: Found 1 Quad-Core AMD Opteron(tm) Processor 8354 processors (4 cpu 
cores) (version 2.20.00)
powernow-k8:0 : pstate 0 (2200 MHz)
powernow-k8:1 : pstate 1 (2000 MHz)
powernow-k8:2 : pstate 2 (1700 MHz)
powernow-k8:3 : pstate 3 (1400 MHz)
powernow-k8:4 : pstate 4 (1100 MHz)

Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]>

Index: linux-2.6/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
===
--- linux-2.6.orig/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
+++ linux-2.6/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
@@ -578,10 +578,9 @@ static void print_basics(struct powernow
for (j = 0; j < data->numps; j++) {
if (data->powernow_table[j].frequency != CPUFREQ_ENTRY_INVALID) 
{
if (cpu_family == CPU_HW_PSTATE) {
-   printk(KERN_INFO PFX "   %d : fid 0x%x did 0x%x 
(%d MHz)\n",
+   printk(KERN_INFO PFX "   %d : pstate %d (%d 
MHz)\n",
j,
-   (data->powernow_table[j].index & 
0xff00) >> 8,
-   (data->powernow_table[j].index & 
0xff) >> 16,
+   data->powernow_table[j].index,
data->powernow_table[j].frequency/1000);
} else {
printk(KERN_INFO PFX "   %d : fid 0x%x (%d 
MHz), vid 0x%x\n",
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix optimized search

2007-11-20 Thread Gregory Haskins

I spied a few more issues from http://lkml.org/lkml/2007/11/20/590.

Patch is below..

Regards,
-Greg

-

Include cpu 0 in the search, and eliminate the redundant cpu_set since
the bit should already be set in the mask.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |7 +++
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 28feeff..fbf4fb1 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -297,7 +297,7 @@ static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
int   lowest_prio = -1;
-   int   lowest_cpu  = 0;
+   int   lowest_cpu  = -1;
int   count   = 0;
int   cpu;
 
@@ -319,7 +319,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
 * and the count==1 will cause the algorithm
 * to use the first bit found.
 */
-   if (lowest_cpu) {
+   if (lowest_cpu != -1) {
cpus_clear(*lowest_mask);
cpu_set(rq->cpu, *lowest_mask);
}
@@ -335,7 +335,6 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
lowest_cpu = cpu;
count = 0;
}
-   cpu_set(rq->cpu, *lowest_mask);
count++;
} else
cpu_clear(cpu, *lowest_mask);
@@ -346,7 +345,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
 * runqueues that were of higher prio than
 * the lowest_prio.
 */
-   if (lowest_cpu) {
+   if (lowest_cpu != -1) {
/*
 * Perhaps we could add another cpumask op to
 * zero out bits. Like cpu_zero_bits(cpumask, nrbits);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

__rcu_process_callbacks() in Linux 2.6

2007-11-20 Thread James Huang

In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded 
as follows: 


422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
423struct rcu_data *rdp)
424 {
425if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
426*rdp->donetail = rdp->curlist;
427rdp->donetail = rdp->curtail;
428rdp->curlist = NULL;
429rdp->curtail = >curlist;
430}
431 
432if (rdp->nxtlist && !rdp->curlist) {
433local_irq_disable();
434rdp->curlist = rdp->nxtlist;
435rdp->curtail = rdp->nxttail;
436rdp->nxtlist = NULL;
437rdp->nxttail = >nxtlist;
438local_irq_enable();
439 
440/*
441  * start the next batch of callbacks
442  */
443 
444/* determine batch number */
445rdp->batch = rcp->cur + 1;
446/* see the comment and corresponding wmb() in
447  * the rcu_start_batch()
448  */
449smp_rmb();
450 
451if (!rcp->next_pending) {
452/* and start it/schedule start if it's a new batch */
453spin_lock(>lock);
454rcp->next_pending = 1;
455rcu_start_batch(rcp);
456spin_unlock(>lock);
457}
458}
459 
460rcu_check_quiescent_state(rcp, rdp);
461if (rdp->donelist)
462rcu_do_batch(rdp);
463 }


The question is how does the update of rdp->batch at line 445 guarantee to 
observe the most updated value of rcp->cur??
The issue is that there is no memory barrier/locking before line 445.
So I think the following sequence of events in chronological order is possible:

Assume initially rcp->cur = 100, this current batch value is visible to every 
CPU, and batch 100 has completed.
 
 
CPU 0: 
- 
 call_rcu(): a callback inserted into rdp->nxtlist; 

 
 timer interrupt 
call rcu_pending(), return true  ( ! rdp->curlist && 
rdp->nxtlist)
call rcu_check_callbacks() 
 schedule per CPU rcu_tasklet
 
__rcu_process_callbacks()
   move callbacks from nxtlist to curlist;
   rdp->batch = 101
   lock rcp->lock
   rcp->next_pending = 1
   call rcu_start_batch()
 find the current batch has completed and next 
batch pending;
 rcp->next_pending = 0
 update rcp->cur to 101 and initialize 
rcp->cpumask;  <- time t1
  ~~
   unlock rcp->lock
  
CPU 1:
-  
 timer interrupt 
  call rcu_pending(), return true (asume observing rcp->cur = 
101 != rdp->quiescbatch) 
  
  call rcu_check_callbacks() 
   schedule per CPU rcu_tasklet
 
 __rcu_process_callbacks()
   call rcu_check_quisecent_state()
 find rdp->quiescbatch != rcp->cur
 set rdp->qs_pending = 1
 set rdp->passed_quiesc = 0
 set rdp->quiescbatch = 101 (rcp->cur)
 
 Another timer interrupt
 call rcu_pending(), return true (rdp->qs_pending == 1)  
 call rcu_check_callbacks() 
   (assume in user mode) <-- time 
t2  pass quiescent state
   ~~   
~~
   rdp->passed_quiesc = 1
   schedule per CPU rcu_tasklet

 __rcu_process_callbacks()
   call rcu_check_quisecent_state()
 find rdp->qs_pending == 1 && rdp-> passed_quiesc 
== 1
 set rdp->qs_pending = 0
 lock rcp->lock
 call cpu_quite()
clear bit in the rcp->cpumask set up by CPU 
0 at time t1
~~~
 unlock rcp->lock

CPU 2:
-  
 call_rcu(): a callback inserted into rdp->nxtlist; <--- time t3
 ~~~
 
 timer interrupt 
 call rcu_pending(), return true ( ! rdp->curlist && 
rdp->nxtlist)

Re: gitweb: kernel versions in the history (feature request, probably)

2007-11-20 Thread J. Bruce Fields

On Wed, Nov 21, 2007 at 12:30:23AM +0100, Jarek Poplawski wrote:
> I don't know git, but it seems, at least if done for web only, this
> shouldn't be so 'heavy'. It could be a 'simple' translation of commit
> date by querying a small database with kernel versions & dates.

If I create a commit in my linux working repo today, but Linus doesn't
merge it into his repository until after he releases 2.6.24, then my
commit will be created with an earlier date than 2.6.24, even though it
isn't included until 2.6.25.

So you have to actually examine the history graph to figure this out
this sort of thing.

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Inotify fails to send IN_ATTRIB events

2007-11-20 Thread Morten Welinder

I am seeing missing inotify IN_ATTRIB events in the following situation:

1. "touch foo"

2. Make inotify watch "foo"

3. "ln foo bar"
   --> Link count changed so I should have gotten an IN_ATTRIB.

4. "rm foo"
   --> Link count changed so I should have gotten an IN_ATTRIB.  (Or
IN_DELETE_SELF;
   I don't care which.)

5. "ln bar foo && rm bar"
   --> Still no events.

6. "mv foo bar"
   --> I get IN_MOVED_SELF.  Good!

7. "mv bar foo"
   --> I get IN_MOVED_SELF.  Good!


3+4 is pretty much the same as 6, so I really ought to be told that my
file has changed
name.  I don't really care much about getting notified about 3, but
for completeness
it ought to be handled.

As far as I can see, the only way to be told about 4 is to put a watch
on the directory in
which foo resides.  That is inelegant and has an inherent race condition.

This is with "Linux version 2.6.22.12-0.1-default" (SuSE 10.3)

Looking at current source, fs/namei.c, I notice that vfs_rename has a
fsnotify_move call
(which notified directory as well as files) whereas sys_link only has
a fsnotify_create call
(which notified the directory only).

Morten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CONFIG_IRQBALANCE for 64-bit x86 ?

2007-11-20 Thread H. Peter Anvin


Jeff Garzik wrote:


Take a look at usr/Makefile for how initramfs is automatically included 
in the image, right now.


The intention at the time was to quickly follow up this stub (generated 
by gen_init_cpio) with a full inclusion of klibc + some basics like 
nfsroot.  It should be a very straightforward step to go from what we 
have today to including klibc initramfs into the kernel image.




http://git.kernel.org/?p=linux/kernel/git/hpa/linux-2.6-klibc.git;a=summary

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] remove unused tsk_thread from asm-offsets_64.c

2007-11-20 Thread Steven Rostedt

I was looking for where tsk_thread is used in the x86_64 code, and
couln't find it anywhere. I took it out and compiled the kernel, and it
compiled fine.

So this patch simply removes the "thread" from asm-offsets.c since I
can't find an owner for it.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index d1b6ed9..40f4175 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -38,7 +38,6 @@ int main(void)
 #define ENTRY(entry) DEFINE(tsk_ ## entry, offsetof(struct task_struct, entry))
ENTRY(state);
ENTRY(flags); 
-   ENTRY(thread); 
ENTRY(pid);
BLANK();
 #undef ENTRY


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] Fix the configuration dependencies

2007-11-20 Thread Ken'ichi Ohmichi


Hi Simon,

Thank you for reviewing and your "Acked-by" signs.

Simon Horman wrote:
> On Fri, Nov 16, 2007 at 11:33:20AM +0900, Ken'ichi Ohmichi wrote:
>> This patch fixes the configuration dependencies in the vmcoreinfo data.
>>
>> i386's "node_data" is defined in arch/x86/mm/discontig_32.c,
>> and x86_64's one is defined in arch/x86/mm/numa_64.c.
>> They depend on CONFIG_NUMA:
>>   arch/x86/mm/Makefile_32:7
>> obj-$(CONFIG_NUMA) += discontig_32.o
>>   arch/x86/mm/Makefile_64:7
>> obj-$(CONFIG_NUMA) += numa_64.o
>>
>> ia64's "pgdat_list" is defined in arch/ia64/mm/discontig.c,
>> and it depends on CONFIG_DISCONTIGMEM and CONFIG_SPARSEMEM:
>>   arch/ia64/mm/Makefile:9-10
>> obj-$(CONFIG_DISCONTIGMEM) += discontig.o
>> obj-$(CONFIG_SPARSEMEM)+= discontig.o
>>
>> ia64's "node_memblk" is defined in arch/ia64/mm/numa.c,
>> and it depends on CONFIG_NUMA:
>>   arch/ia64/mm/Makefile:8
>> obj-$(CONFIG_NUMA) += numa.o
>>
>> Signed-off-by: Ken'ichi Ohmichi <[EMAIL PROTECTED]>
> 
> This appears correct to me, checking through the symbols and the
> location of their deffinitions, though I have not had a chance to run
> many build checks.
> 
> I also note that CONFIG_ARCH_DISCONTIGMEM_ENABLE does not even
> appear to exist on i386, so it looks that without this change
> the code in question whould never be enabled.

If you enable "Numa Memory Allocation and Scheduler Support" in i386's
menuconfig, CONFIG_ARCH_DISCONTIGMEM_ENABLE is enabled.


Thanks
Ken'ichi Ohmichi 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CONFIG_IRQBALANCE for 64-bit x86 ?

2007-11-20 Thread Jeff Garzik


Ingo Molnar wrote:
single-bzImage initrd 
was and is possible,


Correct (though s/initrd/initramfs/).

Take a look at usr/Makefile for how initramfs is automatically included 
in the image, right now.


The intention at the time was to quickly follow up this stub (generated 
by gen_init_cpio) with a full inclusion of klibc + some basics like 
nfsroot.  It should be a very straightforward step to go from what we 
have today to including klibc initramfs into the kernel image.



 so we could in fact move chunks of system-related 
userland (such as irqbalanced) into the kernel proper?


s/kernel/kernel tree/ I presume you mean...

With regards to irqbalanced, if you are thinking about including it in 
initramfs, you would need to work out the details of how 
userland/distros modify the default policy configurations.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mm snapshot broken-out-2007-11-20-01-45.tar.gz uploaded

2007-11-20 Thread Andrew Morton

On Wed, 21 Nov 2007 01:32:48 + David Howells <[EMAIL PROTECTED]> wrote:

> Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > > The patch
> > > aout-suppress-aout-library-support-if-config_arch_supports_aout.patch,
> > > creates a struct exec in linux/a.out.h and asm/a.out.h already has it, for
> > > the struct related warnings.
> 
> Nothing should be including {asm,linux}/a.out.h unless it absolutely needs it.
> I removed all the places it did so extraneously, after moving out STACK_TOP.

So... what went wrong with broken-out-2007-11-20-01-45.tar.gz?

> > OK, I've had it with trying to get that patch to vaguely work.  I'll drop
> > it and will then fix up the extensive dependency trail which it drags along
> > behind it.
> > 
> > David, please do not bring it back until it has had a *lot* of testing.
> 
> It compiles for all the archs for which I have a compiler, and the x86_64 and
> i386 kernels all build and boot for the following combinations of AOUT
> configs:
> 
>   x86_64  CONFIG_IA32_AOUT=n
>   CONFIG_IA32_AOUT=y
>   CONFIG_IA32_AOUT=m
>   i386CONFIG_BINFMT_AOUT=n
>   CONFIG_BINFMT_AOUT=y
>   CONFIG_BINFMT_AOUT=m
> 
> It seems I had forgetten to include:
> 
>   config ARCH_SUPPORTS_AOUT
>   def_bool y
> 
> in arch/x86/Kconfig, but it builds without that too for both subarchs.
> 
> The kernel also builds and boots for MN10300 and FRV.
> 
> 
> The problem is that your -mm patchset doesn't match Linus's as a base.  I'm
> still not sure what the right procedure is for that.  I can give you some
> altered patches, but there's no guarantee you'll be able to pass them on to
> Linus without breaking his tree.  What do *you* want?

Often when people base a patch on -mm it is worse than basing it on
mainline - I usually prefer patches against mainline; partly because that's
less work for originators too.

But sometimes it doesn't work out very well.  There's lot of stuff
outstanding again.  Immediate problems are from an x86 exec randomisation
thingy in git-x86 and pie-executable-randomization.patch in -mm, which both
hit on binfmt_elf.c

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/6] Use two zonelist that are filtered by GFP mask

2007-11-20 Thread 小崎資広

Hi

> +static inline enum zone_type gfp_zonelist(gfp_t flags)
> +{
> + if (NUMA_BUILD && unlikely(flags & __GFP_THISNODE))
> + return 1;
> +
> + return 0;
> +}
> +

static inline int gfp_zonelist(gfp_t flags) ?

if not, why no use ZONE_XXX macro.



kosaki


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CONFIG_IRQBALANCE for 64-bit x86 ?

2007-11-20 Thread Walt H

On Tue, 20 Nov 2007 15:17:15 +1100
Nick Piggin <[EMAIL PROTECTED] > wrote:

> On Tuesday 20 November 2007 15:12, Mark Lord wrote:
> > On 32-bit x86, we have CONFIG_IRQBALANCE available,
> > but not on 64-bit x86.  Why not?

because the in-kernel one is actually quite bad.

> > My QuadCore box works very well in 32-bit mode with IRQBALANCE,
> > but responsiveness sucks bigtime when run in 64-bit mode (no
> > IRQBALANCE) during periods of multiple heavy I/O streams (USB flash
> > drives).

please run the userspace irq balancer, see http://www.irqbalance.org
afaik most distros ship that by default anyway.

I've been running the daemon for quite some time, however, have noticed 
something on my newest computer.  It's a core2 duo and the IRQ balance 
daemon always exits after some time.  After looking at the source, I see 
it's because dual core/hyperthreaded boxes (single domain caches) always 
get treated as though the --oneshot option were passed and exit after 
the first pass (I assume same thing happens on quad cores?).

Does this not adversely affect IRQ balancing on those CPU's?  If the IRQ 
load of a mostly idle device changes from when the daemon was run, 
wouldn't the inability of the balance to adjust it adversely affect 
performance if the load changes at a later time? I'm used to my old SMP 
box with 2 physical cores, so this is just something I've wondered about 
on the new box.  Thanks,

-Walt

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git patches] libata fixes

2007-11-20 Thread Jeff Garzik


Tejun Heo wrote:

These are upstream patches I collected while Jeff is away.  Thanks.

* workaround for ATAPI tape drives
* detection/suspend workarounds for several laptops
* ICH8/9 port_enable fix

ata_piix controller ID reorganization is included to ease the fixes.

Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/tj/libata-dev.git upstream-linus

to receive the following updates:

 drivers/ata/ata_piix.c|   87 
 drivers/ata/libata-core.c |  100 +++---
 drivers/ata/libata-eh.c   |   95 ---
 drivers/ata/libata-scsi.c |3 -
 drivers/ata/pata_sis.c|1 
 include/linux/libata.h|5 --

 6 files changed, 81 insertions(+), 210 deletions(-)

Adrian Bunk (1):
  libata: remove unused functions

Albert Lee (2):
  libata: workaround DRQ=1 ERR=1 for ATAPI tape drives
  libata: use ATA_HORKAGE_STUCK_ERR for ATAPI tape drives

Gabriel C (1):
  pata_sis.c: Add Packard Bell EasyNote K5305 to laptops

Mark Lord (1):
  libata-scsi: be tolerant of 12-byte ATAPI commands in 16-byte CDBs

Tejun Heo (3):
  ata_piix: add SATELLITE U205 to broken suspend list
  ata_piix: reorganize controller IDs
  ata_piix: port enable for the first SATA controller of ICH8 is 0xf not 0x3

Thomas Rohwer (1):
  ata_piix: only enable the first port on apple macbook pro


Just to make sure, I pulled this into #upstream-fixes.  If Linus already 
picked it up, great.  Otherwise I'll make sure it goes upstream.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] 0/4 Support for Toshiba TMIO multifunction devices

2007-11-20 Thread eric miao

Roughly went through the patch, looks good, here comes the remind, though :-)

1. is it possible to use some name other than "soc_core", maybe
"tmio_core" so that
other multifunction chips sharing a core base will live easier.

2. those C++ style comments "//" are not so pleasant...

- eric

On Nov 21, 2007 6:20 AM, ian <[EMAIL PROTECTED]> wrote:
> On Wed, 2007-11-21 at 01:04 +0300, Dmitry Baryshkov wrote:
> > Just to note, that there is an alternative implementation for at least
> > the tc6393 chip devices. Most current version of those patches can be
> > found in the OpenEmbedded monotone.
>
> Yes, this is true. The core code in both cases originates from my and
> Dirks work.
>
> I'd just like to get something pushed up to mainline so that work can
> begin on expanding the number of chips supported.
>
> This would include the ASICs found in many iPAQ PDAs (their MMC module
> seems to be an IRQ-less TMIO chip).
>
> I havent submitted the subdevice drivers yet, only the base/core
> drivers.
>
> The OE versions probably have better USB and video subdevice support
> (which I shall evaluate as I go on), wheras the NAND flash and MMC
> drivers in hh.org are probably more up to date (I run root on SD usinf
> the hh.org tmio_mmc driver on all three of the MFD chips in this patch
> series on a daily basis).
>
> merging the best of these patch series is my goal and it shouldnt be too
> hard.
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
Cheers
- eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] 2.6.24-rc2-mm1 Warning at arch/x86/kernel/smp_32.c:561

2007-11-20 Thread Dave Young

On Nov 20, 2007 5:59 PM, Dave Young <[EMAIL PROTECTED]> wrote:
>
> On Nov 20, 2007 5:56 PM, Andrew Morton <[EMAIL PROTECTED]> wrote:
> >
> > On Tue, 20 Nov 2007 17:47:28 +0800 Dave Young <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > > I encountered kernel warningsr. I just executed xawtv without video dev 
> > > being found.
> > >
> > > like this:
> > >
> > > WARNING: at arch/x86/kernel/smp_32.c:561 native_smp_call_function_mask()
> > >  [] native_smp_call_function_mask+0x149/0x150
> > >  [] alloc_debug_processing+0xa9/0x130
> > >  [] smp_callback+0x0/0x10
> > >  [] smp_call_function+0x1c/0x20
> > >  [] cpuidle_latency_notify+0x18/0x20
> > >  [] notifier_call_chain+0x3e/0x70
> > >  [] __blocking_notifier_call_chain+0x44/0x70
> > >  [] blocking_notifier_call_chain+0x17/0x20
> > >  [] pm_qos_add_requirement+0x8d/0xd0
> > >  [] snd_pcm_hw_params+0x20c/0x2a0 [snd_pcm]
> > >  [] snd_pcm_hw_params_user+0x4e/0x90 [snd_pcm]
> > >  [] snd_pcm_capture_ioctl1+0x3d/0x230 [snd_pcm]
> > >  [] snd_pcm_hw_param_near+0x198/0x230 [snd_pcm_oss]
> > >  [] snd_pcm_kernel_ioctl+0x7e/0x90 [snd_pcm]
> > >  [] snd_pcm_oss_change_params+0x2fc/0x750 [snd_pcm_oss]
> > >  [] snd_pcm_oss_make_ready+0x47/0x60 [snd_pcm_oss]
> > >  [] snd_pcm_oss_sync+0x10e/0x290 [snd_pcm_oss]
> > >  [] snd_pcm_oss_release+0x9a/0xb0 [snd_pcm_oss]
> > >  [] __fput+0x16e/0x200
> > >  [] filp_close+0x3c/0x80
> > >  [] sys_close+0x69/0xd0
> > >  [] syscall_call+0x7/0xb
> > >  [] xfrm_notify_sa+0x110/0x290
> > >  ===
> > >
> >
> > That was hopefully fixed.  You might care to test
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-11-20-01-45.tar.gz
> > to confirm that, if feeling sufficiently brave..
> >
>

Hi,
I just confirm that I can't reproduce this after apply
broken-out-2007-11-20-01-45 patch set.

Regards
dave
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 19/20] Optimize out cpu_clears

2007-11-20 Thread Steven Rostedt

On Tue, Nov 20, 2007 at 08:01:13PM -0500, Steven Rostedt wrote:
>  
>  static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
> -static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
>  
>  static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
>  {
> - int   cpu;
> - cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
>   int   lowest_prio = -1;
> + int   lowest_cpu  = 0;
>   int   count   = 0;
> + int   cpu;
>  
> - cpus_clear(*lowest_mask);
> - cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
> + cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
>  
>   /*
>* Scan each rq for the lowest prio.
>*/
> - for_each_cpu_mask(cpu, *valid_mask) {
> + for_each_cpu_mask(cpu, *lowest_mask) {
>   struct rq *rq = cpu_rq(cpu);
>  
>   /* We look for lowest RT prio or non-rt CPU */
> @@ -325,13 +323,30 @@ static int find_lowest_cpus(struct task_
>   if (rq->rt.highest_prio > lowest_prio) {
>   /* new low - clear old data */
>   lowest_prio = rq->rt.highest_prio;
> - if (count) {
> - cpus_clear(*lowest_mask);
> - count = 0;
> - }

Gregory Haskins pointed out to me that this logic is slightly wrong. I
originally wrote this patch before adding his "count" patch optimization.
I did not take into account that on finding a non RT queue, we may leave
on some extra bits because the clear_cpus is not preformed if count is
zero. And count gets set to zero here. Which means that we don't clean
up.

The fix is to check for lowest_cpus > 0 instead of count on finding an
non-RT runqueue. This lets us know that we need to clear the mask.
Otherwise, if lowest_cpus == 0, then we can return the mask untouched.
The proper bit would already be set, and the return of 1 will have
the rest of the algorithm use the first bit.

Below is the updated patch. The full series is at:

  http://rostedt.homelinux.com/rt/rt-balance-patches-v5.tar.bz2


> + lowest_cpu = cpu;
> + count = 0;
>   }
>   cpu_set(rq->cpu, *lowest_mask);
>   count++;



From: Steven Rostedt <[EMAIL PROTECTED]>

This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

---
 kernel/sched_rt.c |   48 
 1 file changed, 36 insertions(+), 12 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:15.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 20:35:04.0 -0500
@@ -293,29 +293,36 @@ static struct task_struct *pick_next_hig
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
-static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-   int   cpu;
-   cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
int   lowest_prio = -1;
+   int   lowest_cpu  = 0;
int   count   = 0;
+   int   cpu;
 
-   cpus_clear(*lowest_mask);
-   cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
+   cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
 */
-   for_each_cpu_mask(cpu, *valid_mask) {
+   for_each_cpu_mask(cpu, *lowest_mask) {
struct rq *rq = cpu_rq(cpu);
 
/* We look for lowest RT prio or non-rt CPU */
if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-   if (count)
+   /*
+* if we already found a low RT queue
+* and now we found this non-rt queue
+* clear the mask and set our bit.
+* Otherwise just return the queue as is
+* and the count==1 will cause the algorithm
+* to use the first bit found.
+*/
+   if (lowest_cpu) {
cpus_clear(*lowest_mask);
-   cpu_set(rq->cpu, *lowest_mask);
+   cpu_set(rq->cpu, *lowest_mask);
+   }
return 1;
}
 
@@ -325,13 +332,30 @@ static int find_lowest_cpus(struct task_
if

Re: [PATCH] ext4: dir inode reservation V3

2007-11-20 Thread Andreas Dilger

On Nov 20, 2007  12:22 -0800, Mingming Cao wrote:
> On Tue, 2007-11-20 at 12:14 +0800, Coly Li wrote:
> > Mingming Cao wrote:
> > > On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote:
> > The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because
> > directory inode will also use a inode slot from reserved area, reset
> > slots number for files is (s_dir_ireserve_nr - 1).  Except for the
> > reserved inodes number, your understanding exactly matches my idea.
> 
> The performance gain on large number of directories looks interesting,
> but I am afraid this makes the uninitialized block group feature less
> useful (to speed up fsck) than before:( The reserved inode will cause
> the unused inode watermark higher than before, and spread the
> directories over to other block groups earlier than before. Maybe it
> worth it...not sure.

My original thoughts on the design for this were slightly different:
- that the per-directory reserved window would scale with the size of
  the directory, so that even (or especially) with htree directories the 
  inodes would be kept in hash-ordered sets to speed up stat/unlink
- it would be possible/desirable to re-use the existing block bitmap
  reservation code to handle inode bitmap reservation for directories
  while those directories are in-core.  We already have the mechanisms
  for this, "all" that would need to change is have the reservation code
  point at the inode bitmaps but I don't know how easy that is
- after an unmount/remount it would be desirable to re-use the same blocks
  for getting good hash->inode mappings, wherein lies the problem of
  compatibility

One possible solutions for the restart problem is searching the directory
leaf block in which an inode is being added for the inode numbers and try
to use those as a goal for the inode allocation...  Has a minor problem
with ordering, because usually the inode is allocated before the dirent
is created, but isn't impossible to fix (e.g. find dirent slot first,
keep a pointer to it, check for inode goals, and then fill in dirent
inum after allocating inode)

> > >> 5, Performance number
> > >> On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4
> > >> partition, and tried to create 5 directories, and create 15 (1KB)
> > >> files in each directory alternatively. After a remount, I tried to
> > >> remove all the directories and files recursively by a 'rm -rf'.
> > >> Below is the benchmark result,
> > >>  normal ext4 ext4 with dir inode reservation
> > >>  mount options:  -o data=writeback   -o 
> > >> data=writeback,dir_ireserve=low
> > >>  Create dirs:real0m49.101s   real2m59.703s
> > >>  Create files:   real24m17.962s  real21m8.161s
> > >>  Unlink all: real24m43.788s  real17m29.862s

One likely reason that the create dirs step is slower is that this is
doing a lot more IO than in the past.  Only a single inode in each
inode table block is being used, so that means that a lot of empty
bytes are being read and written (maybe 16x as much data in this case).

Also, in what order are you creating files in the directories?  If you
are creating them in directory order like:

for (f = 0; f < 15; f++)
for (i = 0; i < 5; i++)
touch dir$i/f$f

then it is completely unsurprising that directory reservation is faster
at file create/unlink because those inodes are now contiguous at the
expense of having gaps in the inode sequence.  Creating 15 files per
directory is of course the optimum test case also.

How does this patch behave with benchmarks like dbench, mongo, postmark?

> It would be nice to remember the last allocated bit for each block
> group, so we don't have to start from bit 0 (in the worse case scan the
> entire block group) for every single directory. Probably this can be
> saved in the in-core block group descriptor.  Right now the in core
> block group descriptor structure is the same on-disk block-group
> structure though, it might worth to separate them and provide load/store
> helper functions.

Note that mballoc already creates an in-memory struct for each group.
I think the initialization of this should be moved outside of mballoc
so that it can be used for other purposes as you propose.

Eric had a benchmark where creating many files/subdirs would cause
a huge slowdown because of bitmap searching, and having a per-group
pointer with the first free inode (or last allocated inode might be
less work to track) would speed this up a lot. 

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter

On Wed, 21 Nov 2007, Andi Kleen wrote:

> On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote:
> > But one can subtract too... 
> 
> The linker cannot subtract (unless you add a new relocation types) 

The compiler knows and emits assembly to compensate.

> All you need is a 2MB area (16MB is too large if you really
> want 16k CPUs someday) somewhere in the -2GB or probably better
> in +2GB. Then the linker puts stuff in there and you use
> the offsets for referencing relative to %gs.

2MB * 16k = 32GB. Even with 4k cpus we will have 2M * 4k = 8GB both do
not fit in the 2GB area.

The offset relative to %gs cannot be used if you have a loop and are 
calculating the addresses for all instances. That is what we are talking 
about. The CPU_xxx operations that are using the %gs register are fine and 
are not affected by the changes we are discussing.

> Then for all CPUs (including CPU #0) you put the real mapping
> somewhere else, copy the reference data there (which also doesn't need
> to be on the offset the linker assigned, just on a constant offset
> from it somewhere in the normal kernel data) and off you go.

Real mapping? We have constant offsets after this patchset. I do not get 
what you are planning here.

> Then the reference data would be initdata and eventually freed.
> That is similar to how the current per cpu data works.

Yes that is also how the current patchset works. I just do not understand 
what you want changed.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3: find complains about /proc/net

2007-11-20 Thread Eric W. Biederman

Robert Hancock <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>> Could you elaborate a bit on how the semantics of returning the
>> wrong information are more useful?
>>
>> In particular if a thread does the logical equivalent of:
>> grep Pid: /proc/self/status.  It always get the tgid despite
>> having a different process id.
>
> The POSIX-defined userspace concept of a PID requires that all threads appear 
> to
> have the same PID. This is something that Linux didn't comply with under the 
> old
> LinuxThreads implementation and was finally fixed with NPTL. This isn't a
> POSIX-defined interface, but I assume it's trying to be consistent with
> getpid(), etc.


Linux exports two fields in /proc/self/status:
Tgid:   32698
Pid:32698

The tgid maps to the posix concept.  The pid is this context is the
thread id.

So it seems broken to me to return the same thread id for different threads.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mm snapshot broken-out-2007-11-20-01-45.tar.gz uploaded

2007-11-20 Thread David Howells

Andrew Morton <[EMAIL PROTECTED]> wrote:

> > The patch
> > aout-suppress-aout-library-support-if-config_arch_supports_aout.patch,
> > creates a struct exec in linux/a.out.h and asm/a.out.h already has it, for
> > the struct related warnings.

Nothing should be including {asm,linux}/a.out.h unless it absolutely needs it.
I removed all the places it did so extraneously, after moving out STACK_TOP.

> OK, I've had it with trying to get that patch to vaguely work.  I'll drop
> it and will then fix up the extensive dependency trail which it drags along
> behind it.
> 
> David, please do not bring it back until it has had a *lot* of testing.

It compiles for all the archs for which I have a compiler, and the x86_64 and
i386 kernels all build and boot for the following combinations of AOUT
configs:

x86_64  CONFIG_IA32_AOUT=n
CONFIG_IA32_AOUT=y
CONFIG_IA32_AOUT=m
i386CONFIG_BINFMT_AOUT=n
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_AOUT=m

It seems I had forgetten to include:

config ARCH_SUPPORTS_AOUT
def_bool y

in arch/x86/Kconfig, but it builds without that too for both subarchs.

The kernel also builds and boots for MN10300 and FRV.

The problem is that your -mm patchset doesn't match Linus's as a base.  I'm
still not sure what the right procedure is for that.  I can give you some
altered patches, but there's no guarantee you'll be able to pass them on to
Linus without breaking his tree.  What do *you* want?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][v2] net/ipv4/arp.c: Fix arp reply when sender ip 0

2007-11-20 Thread David Miller

From: "Jonas Danielsson" <[EMAIL PROTECTED]>
Date: Tue, 20 Nov 2007 17:28:22 +0100

> Fix arp reply when received arp probe with sender ip 0.
> 
> Send arp reply with target ip address 0.0.0.0 and target hardware address
> set to hardware address of requester. Previously sent reply with target
> ip address and target hardware address set to same as source fields.
> 
> 
> Signed-off-by: Jonas Danielsson <[EMAIL PROTECTED]>
> Acked-by: Alexey Kuznetov <[EMAIL PROTECTED]>

I'm applying this but you gmail folks really have to get your
act together when submitting patches.  The patches from gmail
accounts come out corrupted 9 times out of 10.

This time line breaks were added and all tab characters were
turned into spaces.

Please correct this before any future patch submissions.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen

On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote:
> But one can subtract too... 

The linker cannot subtract (unless you add a new relocation types) 

> Hmmm... So the cpu area 0 could be put at 
> the beginning of the 2GB kernel area and then grow downwards from 
> 0x8000. The cost in terms of code is one subtract
> instruction for each per_cpu() or CPU_PTR()
> 
> The next thing doward from 0x8000 is the vmemmap at 
> 0xe200, so ~32TB. If we leave 16TB for the vmemmap
> (a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes 
> more than currently supported by the processors)
> 
> then the remaining 16TB could be used to map 1GB per cpu for a 16k config. 
> That is wildly overdoing it. Guess we could just do it with 1M anyways. 
> Just to be safe we could do 128M. 128M x 16k = 2TB?
> 
> Would such a configuration be okay?

I'm not sure I really understand your problem.

All you need is a 2MB area (16MB is too large if you really
want 16k CPUs someday) somewhere in the -2GB or probably better
in +2GB. Then the linker puts stuff in there and you use
the offsets for referencing relative to %gs.

But %gs can be located wherever you want in the end,
at a completely different address than you told the linker.
All you're interested in were offsets anyways.

Then for all CPUs (including CPU #0) you put the real mapping
somewhere else, copy the reference data there (which also doesn't need
to be on the offset the linker assigned, just on a constant offset
from it somewhere in the normal kernel data) and off you go.

Then the reference data would be initdata and eventually freed.
That is similar to how the current per cpu data works.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SPARC64: check for possible NULL pointer dereference

2007-11-20 Thread David Miller

From: Cyrill Gorcunov <[EMAIL PROTECTED]>
Date: Tue, 20 Nov 2007 20:28:33 +0300

> From: Cyrill Gorcunov <[EMAIL PROTECTED]>
> 
> This patch adds checking for possible NULL pointer dereference
> if of_find_property() failed.
> 
> Signed-off-by: Cyrill Gorcunov <[EMAIL PROTECTED]>

Applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3: find complains about /proc/net

2007-11-20 Thread Robert Hancock


Eric W. Biederman wrote:

Could you elaborate a bit on how the semantics of returning the
wrong information are more useful?

In particular if a thread does the logical equivalent of:
grep Pid: /proc/self/status.  It always get the tgid despite
having a different process id.


The POSIX-defined userspace concept of a PID requires that all threads 
appear to have the same PID. This is something that Linux didn't comply 
with under the old LinuxThreads implementation and was finally fixed 
with NPTL. This isn't a POSIX-defined interface, but I assume it's 
trying to be consistent with getpid(), etc.



How can that possibly be useful or correct?

From the kernel side I really think the current semantics of /proc/self
in the context of threads is a bug and confusing.  All of the kernel
developers first reaction when this was pointed out was that this
is a regression.

If it is truly useful to user space we can preserve this API design
bug forever.  I just want to make certain we are not being bug
compatible without a good reason.

Currently we have several kernel side bugs with threaded
programs because /proc/self does not do the intuitive thing.  Unless
something has changed recently selinux will cause accesses by a
non-leader thread to fail when accessing files through /proc/self.

So far the more I look at the current /proc/self behavior the
more I am convinced it is broken, and useless.  Please help me see
where it is useful, so we can justify keeping it.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: apm emulation driver broken ?

2007-11-20 Thread Rafael J. Wysocki

On Monday, 19 of November 2007, Franck Bui-Huu wrote:
> Rafael J. Wysocki wrote:
> > On Sunday, 18 of November 2007, Franck Bui-Huu wrote:
> >> Rafael J. Wysocki wrote:
> >> See the call to wait_even() made by apm_ioctl(). If any processes
> >> run this, it will prevent the system to suspend...
> > 
> > True, but does it actually happen in practice?
> > 
> 
> when several processes are waiting for a suspend event.
> 
> > 
> > At this point the second branch of the "if (as->suspend_state == 
> > SUSPEND_READ)"
> > can be fixed by replacing wait_event_interruptible() with
> > wait_event_freezable(), 
> 
> yes
> 
> > but the fix for the first branch depends on whether or
> > not the wait_event() is really necessary.
> 
> As I said I don't know. It's probably time to put some people
> on CC but don't know who though.

OK, never mind.  I think the patch below is the right fix.


---
From: Rafael J. Wysocki <[EMAIL PROTECTED]>

The APM emulation is currently broken as a result of commit
831441862956fffa17b9801db37e6ea1650b0f69
"Freezer: make kernel threads nonfreezable by default"
that removed the PF_NOFREEZE annotations from apm_ioctl() without adding
the appropriate freezer hooks.  Fix it and remove the unnecessary variable flags
from apm_ioctl().

Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
---
 drivers/char/apm-emulation.c |   15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/char/apm-emulation.c
===
--- linux-2.6.orig/drivers/char/apm-emulation.c
+++ linux-2.6/drivers/char/apm-emulation.c
@@ -295,7 +295,6 @@ static int
 apm_ioctl(struct inode * inode, struct file *filp, u_int cmd, u_long arg)
 {
struct apm_user *as = filp->private_data;
-   unsigned long flags;
int err = -EINVAL;
 
if (!as->suser || !as->writer)
@@ -331,10 +330,16 @@ apm_ioctl(struct inode * inode, struct f
 * Wait for the suspend/resume to complete.  If there
 * are pending acknowledges, we wait here for them.
 */
-   flags = current->flags;
+   freezer_do_not_count();
 
wait_event(apm_suspend_waitqueue,
   as->suspend_state == SUSPEND_DONE);
+
+   /*
+* Since we are waiting until the suspend is done, the
+* try_to_freeze() in freezer_count() will not trigger
+*/
+   freezer_count();
} else {
as->suspend_state = SUSPEND_WAIT;
mutex_unlock(_lock);
@@ -362,14 +367,10 @@ apm_ioctl(struct inode * inode, struct f
 * Wait for the suspend/resume to complete.  If there
 * are pending acknowledges, we wait here for them.
 */
-   flags = current->flags;
-
-   wait_event_interruptible(apm_suspend_waitqueue,
+   wait_event_freezable(apm_suspend_waitqueue,
 as->suspend_state == SUSPEND_DONE);
}
 
-   current->flags = flags;
-
mutex_lock(_lock);
err = as->suspend_result;
as->suspend_state = SUSPEND_NONE;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter

But one can subtract too... Hmmm... So the cpu area 0 could be put at
the beginning of the 2GB kernel area and then grow downwards from 
0x8000. The cost in terms of code is one subtract
instruction for each per_cpu() or CPU_PTR()

The next thing doward from 0x8000 is the vmemmap at 
0xe200, so ~32TB. If we leave 16TB for the vmemmap
(a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes 
more than currently supported by the processors)

then the remaining 16TB could be used to map 1GB per cpu for a 16k config. 
That is wildly overdoing it. Guess we could just do it with 1M anyways. 
Just to be safe we could do 128M. 128M x 16k = 2TB?

Would such a configuration be okay?

 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3: find complains about /proc/net

2007-11-20 Thread Eric W. Biederman

Pavel Emelyanov <[EMAIL PROTECTED]> writes:

> Rafael J. Wysocki wrote:
>> On Monday, 19 of November 2007, Pavel Machek wrote:
>>> Hi!
>>>
>>> I think that this worked before:
>>>
>>> [EMAIL PROTECTED]:/proc# find . -name "timer_info"
>>> find: WARNING: Hard link count is wrong for ./net: this may be a bug
>>> in your filesystem driver.  Automatically turning on find's -noleaf
>>> option.  Earlier results may have failed to include directories that
>>> should have been searched.
>>> [EMAIL PROTECTED]:/proc#
>> 
>> I'm seeing that too.
>
> I have a better things with 2.6.24-rc3 ;)
>
> # cd /proc/net
> # ls ..
> ls: reading directory ..: Not a directory
>
> and this
>
> # cd /proc
> # find
> ...
> ./net
> find: . changed during execution of find
> # find net
> find: net changed during execution of find
> # find net/
> 
>
> Moreover. Program that opens /proc/net and dumps the /proc/self/fd
> files produces the following:
>
> # cd /
> # a.out /proc/net
> ...
> lr-x--  1 root root 64 Nov 20 18:02 3 -> /proc/net/net (deleted)
> ...
> # cd /proc/net
> # a.out .
> ...
> lr-x--  1 root root 64 Nov 20 18:03 3 -> /proc/net/net (deleted)
> ...
> # a.out ..
> ...
> lr-x--  1 root root 64 Nov 20 18:03 3 -> /proc/net
> ...
>
> This all is somehow related to the shadow proc files.
> E.g. the first problem (with -ENOTDIR) is due to the shadow /proc/net
> dentry doesn't implement the .readdir method:
>
> static const struct file_operations proc_net_dir_operations = {
> .read   = generic_read_dir,
> };
>
> And I haven't managed to find out why the rest problems
> occur...
>
> Eric, do you have fixes for it?

Duh.  There is one other possible solution I forgot to mention and
at least as a first pass it should be relatively simple.  Have the
mount of proc capture the network namespace.  I'm not certain
if it is what we want long term but it should be simple and relatively
easy to implement.

I don't like capturing the network namespace when we mount proc but
it is easier then implementing /proc/self/net.  Which is the other
real alternative.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 07/20] disable CFS RT load balancing.

2007-11-20 Thread Steven Rostedt

Since we now take an active approach to load balancing, we don't need to
balance RT tasks via CFS. In fact, this code was found to pull RT tasks
away from CPUS that the active movement performed, resulting in
large latencies.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |   95 ++
 1 file changed, 4 insertions(+), 91 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:00.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:01.0 -0500
@@ -564,109 +564,22 @@ static void wakeup_balance_rt(struct rq 
push_rt_tasks(rq);
 }
 
-/*
- * Load-balancing iterator. Note: while the runqueue stays locked
- * during the whole iteration, the current task might be
- * dequeued so the iterator has to be dequeue-safe. Here we
- * achieve that by always pre-iterating before returning
- * the current task:
- */
-static struct task_struct *load_balance_start_rt(void *arg)
-{
-   struct rq *rq = arg;
-   struct rt_prio_array *array = >rt.active;
-   struct list_head *head, *curr;
-   struct task_struct *p;
-   int idx;
-
-   idx = sched_find_first_bit(array->bitmap);
-   if (idx >= MAX_RT_PRIO)
-   return NULL;
-
-   head = array->queue + idx;
-   curr = head->prev;
-
-   p = list_entry(curr, struct task_struct, run_list);
-
-   curr = curr->prev;
-
-   rq->rt.rt_load_balance_idx = idx;
-   rq->rt.rt_load_balance_head = head;
-   rq->rt.rt_load_balance_curr = curr;
-
-   return p;
-}
-
-static struct task_struct *load_balance_next_rt(void *arg)
-{
-   struct rq *rq = arg;
-   struct rt_prio_array *array = >rt.active;
-   struct list_head *head, *curr;
-   struct task_struct *p;
-   int idx;
-
-   idx = rq->rt.rt_load_balance_idx;
-   head = rq->rt.rt_load_balance_head;
-   curr = rq->rt.rt_load_balance_curr;
-
-   /*
-* If we arrived back to the head again then
-* iterate to the next queue (if any):
-*/
-   if (unlikely(head == curr)) {
-   int next_idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-
-   if (next_idx >= MAX_RT_PRIO)
-   return NULL;
-
-   idx = next_idx;
-   head = array->queue + idx;
-   curr = head->prev;
-
-   rq->rt.rt_load_balance_idx = idx;
-   rq->rt.rt_load_balance_head = head;
-   }
-
-   p = list_entry(curr, struct task_struct, run_list);
-
-   curr = curr->prev;
-
-   rq->rt.rt_load_balance_curr = curr;
-
-   return p;
-}
-
 static unsigned long
 load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move,
struct sched_domain *sd, enum cpu_idle_type idle,
int *all_pinned, int *this_best_prio)
 {
-   struct rq_iterator rt_rq_iterator;
-
-   rt_rq_iterator.start = load_balance_start_rt;
-   rt_rq_iterator.next = load_balance_next_rt;
-   /* pass 'busiest' rq argument into
-* load_balance_[start|next]_rt iterators
-*/
-   rt_rq_iterator.arg = busiest;
-
-   return balance_tasks(this_rq, this_cpu, busiest, max_load_move, sd,
-idle, all_pinned, this_best_prio, _rq_iterator);
+   /* don't touch RT tasks */
+   return 0;
 }
 
 static int
 move_one_task_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
 struct sched_domain *sd, enum cpu_idle_type idle)
 {
-   struct rq_iterator rt_rq_iterator;
-
-   rt_rq_iterator.start = load_balance_start_rt;
-   rt_rq_iterator.next = load_balance_next_rt;
-   rt_rq_iterator.arg = busiest;
-
-   return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle,
- _rq_iterator);
+   /* don't touch RT tasks */
+   return 0;
 }
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)  do { } while (0)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 05/20] pull RT tasks

2007-11-20 Thread Steven Rostedt

This patch adds the algorithm to pull tasks from RT overloaded runqueues.

When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched.c|2 
 kernel/sched_rt.c |  187 ++
 2 files changed, 178 insertions(+), 11 deletions(-)

Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:52:56.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:52:59.0 -0500
@@ -3646,6 +3646,8 @@ need_resched_nonpreemptible:
switch_count = >nvcsw;
}
 
+   schedule_balance_rt(rq, prev);
+
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);
 
Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:52:57.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:52:59.0 -0500
@@ -176,8 +176,17 @@ static void put_prev_task_rt(struct rq *
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
 
+static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+   if (!task_running(rq, p) &&
+   (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)))
+   return 1;
+   return 0;
+}
+
 /* Return the second highest RT task, NULL otherwise */
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
+int cpu)
 {
struct rt_prio_array *array = >rt.active;
struct task_struct *next;
@@ -196,26 +205,36 @@ static struct task_struct *pick_next_hig
}
 
queue = array->queue + idx;
+   BUG_ON(list_empty(queue));
+
next = list_entry(queue->next, struct task_struct, run_list);
-   if (unlikely(next != rq->curr))
-   return next;
+   if (unlikely(pick_rt_task(rq, next, cpu)))
+   goto out;
 
if (queue->next->next != queue) {
/* same prio task */
next = list_entry(queue->next->next, struct task_struct, 
run_list);
-   return next;
+   if (pick_rt_task(rq, next, cpu))
+   goto out;
}
 
+ retry:
/* slower, but more flexible */
idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-   if (unlikely(idx >= MAX_RT_PRIO)) {
-   WARN_ON(1); /* rt_nr_running was 2 and above! */
+   if (unlikely(idx >= MAX_RT_PRIO))
return NULL;
-   }
 
queue = array->queue + idx;
-   next = list_entry(queue->next, struct task_struct, run_list);
+   BUG_ON(list_empty(queue));
+
+   list_for_each_entry(next, queue, run_list) {
+   if (pick_rt_task(rq, next, cpu))
+   goto out;
+   }
+
+   goto retry;
 
+ out:
return next;
 }
 
@@ -302,13 +321,15 @@ static int push_rt_task(struct rq *this_
 
assert_spin_locked(_rq->lock);
 
-   next_task = pick_next_highest_task_rt(this_rq);
+   next_task = pick_next_highest_task_rt(this_rq, -1);
if (!next_task)
return 0;
 
  retry:
-   if (unlikely(next_task == this_rq->curr))
+   if (unlikely(next_task == this_rq->curr)) {
+   WARN_ON(1);
return 0;
+   }
 
/*
 * It's possible that the next_task slipped in of
@@ -332,7 +353,7 @@ static int push_rt_task(struct rq *this_
 * so it is possible that next_task has changed.
 * If it has, then try again.
 */
-   task = pick_next_highest_task_rt(this_rq);
+   task = pick_next_highest_task_rt(this_rq, -1);
if (unlikely(task != next_task) && task && paranoid--) {
put_task_struct(next_task);
next_task = task;
@@ -375,6 +396,149 @@ static void push_rt_tasks(struct rq *rq)
;
 }
 
+static int pull_rt_task(struct rq *this_rq)
+{
+   struct task_struct *next;
+   struct task_struct *p;
+   struct rq *src_rq;
+   cpumask_t *rto_cpumask;
+   int this_cpu = this_rq->cpu;
+   int cpu;
+   int ret = 0;
+
+   assert_spin_locked(_rq->lock);
+
+   /*
+* If cpusets are used, and we have overlapping
+* run queue cpusets, then this algorithm may not catch all.
+* This is just the

[PATCH v4 19/20] Optimize out cpu_clears

2007-11-20 Thread Steven Rostedt

This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

---
 kernel/sched_rt.c |   35 +--
 1 file changed, 25 insertions(+), 10 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:15.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:17.0 -0500
@@ -293,22 +293,20 @@ static struct task_struct *pick_next_hig
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
-static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-   int   cpu;
-   cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
int   lowest_prio = -1;
+   int   lowest_cpu  = 0;
int   count   = 0;
+   int   cpu;
 
-   cpus_clear(*lowest_mask);
-   cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
+   cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
 */
-   for_each_cpu_mask(cpu, *valid_mask) {
+   for_each_cpu_mask(cpu, *lowest_mask) {
struct rq *rq = cpu_rq(cpu);
 
/* We look for lowest RT prio or non-rt CPU */
@@ -325,13 +323,30 @@ static int find_lowest_cpus(struct task_
if (rq->rt.highest_prio > lowest_prio) {
/* new low - clear old data */
lowest_prio = rq->rt.highest_prio;
-   if (count) {
-   cpus_clear(*lowest_mask);
-   count = 0;
-   }
+   lowest_cpu = cpu;
+   count = 0;
}
cpu_set(rq->cpu, *lowest_mask);
count++;
+   } else
+   cpu_clear(cpu, *lowest_mask);
+   }
+
+   /*
+* Clear out all the set bits that represent
+* runqueues that were of higher prio than
+* the lowest_prio.
+*/
+   if (lowest_cpu) {
+   /*
+* Perhaps we could add another cpumask op to
+* zero out bits. Like cpu_zero_bits(cpumask, nrbits);
+* Then that could be optimized to use memset and such.
+*/
+   for_each_cpu_mask(cpu, *lowest_mask) {
+   if (cpu >= lowest_cpu)
+   break;
+   cpu_clear(cpu, *lowest_mask);
}
}
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 01/20] Add rt_nr_running accounting

2007-11-20 Thread Steven Rostedt

This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched.c|1 +
 kernel/sched_rt.c |   17 +
 2 files changed, 18 insertions(+)

Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:52:44.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:52:50.0 -0500
@@ -266,6 +266,7 @@ struct rt_rq {
struct rt_prio_array active;
int rt_load_balance_idx;
struct list_head *rt_load_balance_head, *rt_load_balance_curr;
+   unsigned long rt_nr_running;
 };
 
 /*
Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:52:44.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:52:50.0 -0500
@@ -25,12 +25,27 @@ static void update_curr_rt(struct rq *rq
curr->se.exec_start = rq->clock;
 }
 
+static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+   WARN_ON(!rt_task(p));
+   rq->rt.rt_nr_running++;
+}
+
+static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+   WARN_ON(!rt_task(p));
+   WARN_ON(!rq->rt.rt_nr_running);
+   rq->rt.rt_nr_running--;
+}
+
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct rt_prio_array *array = >rt.active;
 
list_add_tail(>run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
+
+   inc_rt_tasks(p, rq);
 }
 
 /*
@@ -45,6 +60,8 @@ static void dequeue_task_rt(struct rq *r
list_del(>run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
+
+   dec_rt_tasks(p, rq);
 }
 
 /*

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 17/20] RT: restore the migratable conditional

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

We don't need to bother searching if the task cannot be migrated

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:11.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:13.0 -0500
@@ -173,7 +173,8 @@ static int select_task_rq_rt(struct task
 * that is just being woken and probably will have
 * cold cache anyway.
 */
-   if (unlikely(rt_task(rq->curr))) {
+   if (unlikely(rt_task(rq->curr)) &&
+   (p->nr_cpus_allowed > 1)) {
int cpu = find_lowest_rq(p);
 
return (cpu == -1) ? task_cpu(p) : cpu;

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 12/20] RT: Allow current_cpu to be included in search

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

It doesn't hurt if we allow the current CPU to be included in the
search.  We will just simply skip it later if the current CPU turns out
to be the lowest.

We will use this later in the series

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:05.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:07.0 -0500
@@ -274,9 +274,6 @@ static int find_lowest_rq(struct task_st
for_each_cpu_mask(cpu, *cpu_mask) {
struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == rq->cpu)
-   continue;
-
/* We look for lowest RT prio or non-rt CPU */
if (rq->rt.highest_prio >= MAX_RT_PRIO) {
lowest_rq = rq;
@@ -304,7 +301,7 @@ static struct rq *find_lock_lowest_rq(st
for (tries = 0; tries < RT_MAX_TRIES; tries++) {
cpu = find_lowest_rq(task);
 
-   if (cpu == -1)
+   if ((cpu == -1) || (cpu == rq->cpu))
break;
 
lowest_rq = cpu_rq(cpu);

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 14/20] RT: Optimize our cpu selection based on topology

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

The current code base assumes a relatively flat CPU/core topology and will
route RT tasks to any CPU fairly equally.  In the real world, there are
various toplogies and affinities that govern where a task is best suited to
run with the smallest amount of overhead.  NUMA and multi-core CPUs are
prime examples of topologies that can impact cache performance.

Fortunately, linux is already structured to represent these topologies via
the sched_domains interface.  So we change our RT router to consult a
combination of topology and affinity policy to best place tasks during
migration.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched.c|1 
 kernel/sched_rt.c |  100 +++---
 2 files changed, 89 insertions(+), 12 deletions(-)

Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:53:04.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:53:09.0 -0500
@@ -24,6 +24,7 @@
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
  *  2007-10-22  RT overload balancing by Steven Rostedt
  * (with thanks to Gregory Haskins)
+ *  2007-11-05  RT migration/wakeup tuning by Gregory Haskins
  */
 
 #include 
Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:08.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:09.0 -0500
@@ -278,35 +278,111 @@ static struct task_struct *pick_next_hig
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
+static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-   int cpu;
-   cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask);
-   struct rq *lowest_rq = NULL;
+   int   cpu;
+   cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
+   int   lowest_prio = -1;
+   int   ret = 0;
 
-   cpus_and(*cpu_mask, cpu_online_map, task->cpus_allowed);
+   cpus_clear(*lowest_mask);
+   cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
 */
-   for_each_cpu_mask(cpu, *cpu_mask) {
+   for_each_cpu_mask(cpu, *valid_mask) {
struct rq *rq = cpu_rq(cpu);
 
/* We look for lowest RT prio or non-rt CPU */
if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-   lowest_rq = rq;
-   break;
+   if (ret)
+   cpus_clear(*lowest_mask);
+   cpu_set(rq->cpu, *lowest_mask);
+   return 1;
}
 
/* no locking for now */
-   if (rq->rt.highest_prio > task->prio &&
-   (!lowest_rq || rq->rt.highest_prio > 
lowest_rq->rt.highest_prio)) {
-   lowest_rq = rq;
+   if ((rq->rt.highest_prio > task->prio)
+   && (rq->rt.highest_prio >= lowest_prio)) {
+   if (rq->rt.highest_prio > lowest_prio) {
+   /* new low - clear old data */
+   lowest_prio = rq->rt.highest_prio;
+   cpus_clear(*lowest_mask);
+   }
+   cpu_set(rq->cpu, *lowest_mask);
+   ret = 1;
+   }
+   }
+
+   return ret;
+}
+
+static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
+{
+   int first;
+
+   /* "this_cpu" is cheaper to preempt than a remote processor */
+   if ((this_cpu != -1) && cpu_isset(this_cpu, *mask))
+   return this_cpu;
+
+   first = first_cpu(*mask);
+   if (first != NR_CPUS)
+   return first;
+
+   return -1;
+}
+
+static int find_lowest_rq(struct task_struct *task)
+{
+   struct sched_domain *sd;
+   cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask);
+   int this_cpu = smp_processor_id();
+   int cpu  = task_cpu(task);
+
+   if (!find_lowest_cpus(task, lowest_mask))
+   return -1;
+
+   /*
+* At this point we have built a mask of cpus representing the
+* lowest priority tasks in the system.  Now we want to elect
+* the best one based on our affinity and topology.
+*
+* We prioritize the last cpu that the task executed on since
+* it is most likely cache-hot in that location.
+*/
+   if (cpu_isset(cpu, *lowest_mask))
+   return cpu;
+
+   /*
+* Otherwise, we

[PATCH v4 10/20] RT: Remove some CFS specific code from the wakeup path of RT tasks

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations.  The problem is that
these calculations are only relevant to CFS tasks where load is king.  For RT
tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 include/linux/sched.h   |1 
 kernel/sched.c  |  167 +++-
 kernel/sched_fair.c |  148 ++
 kernel/sched_idletask.c |9 ++
 kernel/sched_rt.c   |   10 ++
 5 files changed, 195 insertions(+), 140 deletions(-)

Index: linux-compile.git/include/linux/sched.h
===
--- linux-compile.git.orig/include/linux/sched.h2007-11-20 
19:53:02.0 -0500
+++ linux-compile.git/include/linux/sched.h 2007-11-20 19:53:04.0 
-0500
@@ -823,6 +823,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*yield_task) (struct rq *rq);
+   int  (*select_task_rq)(struct task_struct *p, int sync);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:53:02.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:53:04.0 -0500
@@ -860,6 +860,13 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
+#ifdef CONFIG_SMP
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long cpu_avg_load_per_task(int cpu);
+static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
+#endif /* CONFIG_SMP */
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -1045,7 +1052,7 @@ static inline void __set_task_cpu(struct
 /*
  * Is this task likely cache-hot:
  */
-static inline int
+static int
 task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 {
s64 delta;
@@ -1270,7 +1277,7 @@ static unsigned long target_load(int cpu
 /*
  * Return the average load per task on the cpu's run queue
  */
-static inline unsigned long cpu_avg_load_per_task(int cpu)
+static unsigned long cpu_avg_load_per_task(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);
@@ -1427,58 +1434,6 @@ static int sched_balance_self(int cpu, i
 
 #endif /* CONFIG_SMP */
 
-/*
- * wake_idle() will wake a task on an idle cpu if task->cpu is
- * not idle and an idle cpu is available.  The span of cpus to
- * search starts with cpus closest then further out as needed,
- * so we always favor a closer, idle cpu.
- *
- * Returns the CPU we should wake onto.
- */
-#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
-static int wake_idle(int cpu, struct task_struct *p)
-{
-   cpumask_t tmp;
-   struct sched_domain *sd;
-   int i;
-
-   /*
-* If it is idle, then it is the best cpu to run this task.
-*
-* This cpu is also the best, if it has more than one task already.
-* Siblings must be also busy(in most cases) as they didn't already
-* pickup the extra load from this cpu and hence we need not check
-* sibling runqueue info. This will avoid the checks and cache miss
-* penalities associated with that.
-*/
-   if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
-   return cpu;
-
-   for_each_domain(cpu, sd) {
-   if (sd->flags & SD_WAKE_IDLE) {
-   cpus_and(tmp, sd->span, p->cpus_allowed);
-   for_each_cpu_mask(i, tmp) {
-   if (idle_cpu(i)) {
-   if (i != task_cpu(p)) {
-   schedstat_inc(p,
-   se.nr_wakeups_idle);
-   }
-   return i;
-   }
-   }
-   } else {
-   break;
-   }
-   }
-   return cpu;
-}
-#else
-static inline int wake_idle(int cpu, struct task_struct *p)
-{
-   return cpu;
-}
-#endif
-
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -1500,8 +1455,6 @@ static int try_to_wake_up(struct task_st
long

[PATCH v4 08/20] Cache cpus_allowed weight for optimizing migration

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run.  So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

This causes two problems in the current code:

1) If a high-priority bound task and a low-priority unbounded task queue
   up behind the running task, we will fail to ever relocate the unbounded
   task because we terminate the search on the first unmovable task.

2) We spend precious futile cycles in the fast-path trying to pull
   overloaded tasks over.  It is therefore optimial to strive to avoid the
   overhead all together if we can cheaply detect the condition before
   overload even occurs.

This patch tries to achieve this optimization by utilizing the hamming
weight of the task->cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated.  We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.

We introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 include/linux/init_task.h |1 
 include/linux/sched.h |2 +
 kernel/fork.c |1 
 kernel/sched.c|9 +++-
 kernel/sched_rt.c |   50 +-
 5 files changed, 57 insertions(+), 6 deletions(-)

Index: linux-compile.git/include/linux/init_task.h
===
--- linux-compile.git.orig/include/linux/init_task.h2007-11-20 
19:52:44.0 -0500
+++ linux-compile.git/include/linux/init_task.h 2007-11-20 19:53:02.0 
-0500
@@ -130,6 +130,7 @@ extern struct group_info init_groups;
.normal_prio= MAX_PRIO-20,  \
.policy = SCHED_NORMAL, \
.cpus_allowed   = CPU_MASK_ALL, \
+   .nr_cpus_allowed = NR_CPUS, \
.mm = NULL, \
.active_mm  = _mm, \
.run_list   = LIST_HEAD_INIT(tsk.run_list), \
Index: linux-compile.git/include/linux/sched.h
===
--- linux-compile.git.orig/include/linux/sched.h2007-11-20 
19:52:44.0 -0500
+++ linux-compile.git/include/linux/sched.h 2007-11-20 19:53:02.0 
-0500
@@ -843,6 +843,7 @@ struct sched_class {
void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
+   void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
 };
 
 struct load_weight {
@@ -952,6 +953,7 @@ struct task_struct {
 
unsigned int policy;
cpumask_t cpus_allowed;
+   int nr_cpus_allowed;
unsigned int time_slice;
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
Index: linux-compile.git/kernel/fork.c
===
--- linux-compile.git.orig/kernel/fork.c2007-11-20 19:52:44.0 
-0500
+++ linux-compile.git/kernel/fork.c 2007-11-20 19:53:02.0 -0500
@@ -1237,6 +1237,7 @@ static struct task_struct *copy_process(
 * parent's CPU). This avoids alot of nasty races.
 */
p->cpus_allowed = current->cpus_allowed;
+   p->nr_cpus_allowed = current->nr_cpus_allowed;
if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
!cpu_online(task_cpu(p
set_task_cpu(p, smp_processor_id());
Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:53:00.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:53:02.0 -0500
@@ -269,6 +269,7 @@ struct rt_rq {
int

[PATCH v4 03/20] push RT tasks

2007-11-20 Thread Steven Rostedt

This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.

When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority.  The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.

We continue pushing RT tasks off the current runqueue until we don't push any
more.  The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.

TODO: The algorithm may stop when there are still RT tasks that can be
 migrated. Specifically, if the highest non running RT task CPU affinity
 is restricted to CPUs that are running higher priority tasks, there may
 be a lower priority task queued that has an affinity with a CPU that is
 running a lower priority task that it could be migrated to.  This
 patch set does not address this issue.

Note: checkpatch reveals two over 80 character instances. I'm not sure
 that breaking them up will help visually, so I left them as is.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched.c|8 +
 kernel/sched_rt.c |  225 +-
 2 files changed, 231 insertions(+), 2 deletions(-)

Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:52:55.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:52:56.0 -0500
@@ -1877,6 +1877,8 @@ static void finish_task_switch(struct rq
prev_state = prev->state;
finish_arch_switch(prev);
finish_lock_switch(rq, prev);
+   schedule_tail_balance_rt(rq);
+
fire_sched_in_preempt_notifiers(current);
if (mm)
mmdrop(mm);
@@ -2110,11 +2112,13 @@ static void double_rq_unlock(struct rq *
 /*
  * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
  */
-static void double_lock_balance(struct rq *this_rq, struct rq *busiest)
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
__releases(this_rq->lock)
__acquires(busiest->lock)
__acquires(this_rq->lock)
 {
+   int ret = 0;
+
if (unlikely(!irqs_disabled())) {
/* printk() doesn't work good under rq->lock */
spin_unlock(_rq->lock);
@@ -2125,9 +2129,11 @@ static void double_lock_balance(struct r
spin_unlock(_rq->lock);
spin_lock(>lock);
spin_lock(_rq->lock);
+   ret = 1;
} else
spin_lock(>lock);
}
+   return ret;
 }
 
 /*
Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:52:55.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:52:56.0 -0500
@@ -134,6 +134,227 @@ static void put_prev_task_rt(struct rq *
 }
 
 #ifdef CONFIG_SMP
+/* Only try algorithms three times */
+#define RT_MAX_TRIES 3
+
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
+static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
+
+/* Return the second highest RT task, NULL otherwise */
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+{
+   struct rt_prio_array *array = >rt.active;
+   struct task_struct *next;
+   struct list_head *queue;
+   int idx;
+
+   assert_spin_locked(>lock);
+
+   if (likely(rq->rt.rt_nr_running < 2))
+   return NULL;
+
+   idx = sched_find_first_bit(array->bitmap);
+   if (unlikely(idx >= MAX_RT_PRIO)) {
+   WARN_ON(1); /* rt_nr_running is bad */
+   return NULL;
+   }
+
+   queue = array->queue + idx;
+   next = list_entry(queue->next, struct task_struct, run_list);
+   if (unlikely(next != rq->curr))
+   return next;
+
+   if (queue->next->next != queue) {
+   /* same prio task */
+   next = list_entry(queue->next->next, struct task_struct, 
run_list);
+   return next;
+   }
+
+   /* slower, but more flexible */
+   idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
+   if (unlikely(idx >= MAX_RT_PRIO)) {
+   WARN_ON(1); /* rt_nr_running was 2 and above! */
+   return NULL;
+   }
+
+   queue = array->queue + idx;
+   next = list_entry(queue->next, struct task_struct, run_list);
+
+   return next;
+}
+
+static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
+
+/*

[PATCH v4 15/20] RT: Optimize rebalancing

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away.  So we add support
for considering this new information.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched.c|2 ++
 kernel/sched_rt.c |   10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:53:09.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:53:10.0 -0500
@@ -273,6 +273,7 @@ struct rt_rq {
unsigned long rt_nr_migratory;
/* highest queued rt task prio */
int highest_prio;
+   int overloaded;
 };
 
 /*
@@ -6685,6 +6686,7 @@ void __init sched_init(void)
rq->migration_thread = NULL;
INIT_LIST_HEAD(>migration_queue);
rq->rt.highest_prio = MAX_RT_PRIO;
+   rq->rt.overloaded = 0;
 #endif
atomic_set(>nr_iowait, 0);
 
Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:09.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:10.0 -0500
@@ -16,6 +16,7 @@ static inline cpumask_t *rt_overload(voi
 }
 static inline void rt_set_overload(struct rq *rq)
 {
+   rq->rt.overloaded = 1;
cpu_set(rq->cpu, rt_overload_mask);
/*
 * Make sure the mask is visible before we set
@@ -32,6 +33,7 @@ static inline void rt_clear_overload(str
/* the order here really doesn't matter */
atomic_dec(_count);
cpu_clear(rq->cpu, rt_overload_mask);
+   rq->rt.overloaded = 0;
 }
 
 static void update_rt_migration(struct rq *rq)
@@ -445,6 +447,9 @@ static int push_rt_task(struct rq *rq)
 
assert_spin_locked(>lock);
 
+   if (!rq->rt.overloaded)
+   return 0;
+
next_task = pick_next_highest_task_rt(rq, -1);
if (!next_task)
return 0;
@@ -672,7 +677,7 @@ static void schedule_tail_balance_rt(str
 * the lock was owned by prev, we need to release it
 * first via finish_lock_switch and then reaquire it here.
 */
-   if (unlikely(rq->rt.rt_nr_running > 1)) {
+   if (unlikely(rq->rt.overloaded)) {
spin_lock_irq(>lock);
push_rt_tasks(rq);
spin_unlock_irq(>lock);
@@ -684,7 +689,8 @@ static void wakeup_balance_rt(struct rq 
 {
if (unlikely(rt_task(p)) &&
!task_running(rq, p) &&
-   (p->prio >= rq->curr->prio))
+   (p->prio >= rq->rt.highest_prio) &&
+   rq->rt.overloaded)
push_rt_tasks(rq);
 }
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 20/20] balance RT tasks no new wake up

2007-11-20 Thread Steven Rostedt

Run the RT balancing code on wake up to an RT task.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

---
 kernel/sched.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:53:10.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:53:18.0 -0500
@@ -1650,6 +1650,7 @@ void fastcall wake_up_new_task(struct ta
inc_nr_running(p, rq);
}
check_preempt_curr(rq, p);
+   wakeup_balance_rt(rq, p);
task_rq_unlock(rq, );
 }
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 11/20] RT: Break out the search function

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |   66 +++---
 1 file changed, 39 insertions(+), 27 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:04.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:05.0 -0500
@@ -260,54 +260,66 @@ static struct task_struct *pick_next_hig
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 
-/* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(struct task_struct *task,
- struct rq *this_rq)
+static int find_lowest_rq(struct task_struct *task)
 {
-   struct rq *lowest_rq = NULL;
int cpu;
-   int tries;
cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask);
+   struct rq *lowest_rq = NULL;
 
cpus_and(*cpu_mask, cpu_online_map, task->cpus_allowed);
 
-   for (tries = 0; tries < RT_MAX_TRIES; tries++) {
-   /*
-* Scan each rq for the lowest prio.
-*/
-   for_each_cpu_mask(cpu, *cpu_mask) {
-   struct rq *rq = _cpu(runqueues, cpu);
+   /*
+* Scan each rq for the lowest prio.
+*/
+   for_each_cpu_mask(cpu, *cpu_mask) {
+   struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == this_rq->cpu)
-   continue;
+   if (cpu == rq->cpu)
+   continue;
 
-   /* We look for lowest RT prio or non-rt CPU */
-   if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-   lowest_rq = rq;
-   break;
-   }
+   /* We look for lowest RT prio or non-rt CPU */
+   if (rq->rt.highest_prio >= MAX_RT_PRIO) {
+   lowest_rq = rq;
+   break;
+   }
 
-   /* no locking for now */
-   if (rq->rt.highest_prio > task->prio &&
-   (!lowest_rq || rq->rt.highest_prio > 
lowest_rq->rt.highest_prio)) {
-   lowest_rq = rq;
-   }
+   /* no locking for now */
+   if (rq->rt.highest_prio > task->prio &&
+   (!lowest_rq || rq->rt.highest_prio > 
lowest_rq->rt.highest_prio)) {
+   lowest_rq = rq;
}
+   }
+
+   return lowest_rq ? lowest_rq->cpu : -1;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+ struct rq *rq)
+{
+   struct rq *lowest_rq = NULL;
+   int cpu;
+   int tries;
 
-   if (!lowest_rq)
+   for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+   cpu = find_lowest_rq(task);
+
+   if (cpu == -1)
break;
 
+   lowest_rq = cpu_rq(cpu);
+
/* if the prio of this runqueue changed, try again */
-   if (double_lock_balance(this_rq, lowest_rq)) {
+   if (double_lock_balance(rq, lowest_rq)) {
/*
 * We had to unlock the run queue. In
 * the mean time, task could have
 * migrated already or had its affinity changed.
 * Also make sure that it wasn't scheduled on its rq.
 */
-   if (unlikely(task_rq(task) != this_rq ||
+   if (unlikely(task_rq(task) != rq ||
 !cpu_isset(lowest_rq->cpu, 
task->cpus_allowed) ||
-task_running(this_rq, task) ||
+task_running(rq, task) ||
 !task->se.on_rq)) {
spin_unlock(_rq->lock);
lowest_rq = NULL;

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 09/20] RT: Consistency cleanup for this_rq usage

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

"this_rq" is normally used to denote the RQ on the current cpu
(i.e. "cpu_rq(this_cpu)").  So clean up the usage of this_rq to be
more consistent with the rest of the code.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |   22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:02.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:03.0 -0500
@@ -325,21 +325,21 @@ static struct rq *find_lock_lowest_rq(st
  * running task can migrate over to a CPU that is running a task
  * of lesser priority.
  */
-static int push_rt_task(struct rq *this_rq)
+static int push_rt_task(struct rq *rq)
 {
struct task_struct *next_task;
struct rq *lowest_rq;
int ret = 0;
int paranoid = RT_MAX_TRIES;
 
-   assert_spin_locked(_rq->lock);
+   assert_spin_locked(>lock);
 
-   next_task = pick_next_highest_task_rt(this_rq, -1);
+   next_task = pick_next_highest_task_rt(rq, -1);
if (!next_task)
return 0;
 
  retry:
-   if (unlikely(next_task == this_rq->curr)) {
+   if (unlikely(next_task == rq->curr)) {
WARN_ON(1);
return 0;
}
@@ -349,24 +349,24 @@ static int push_rt_task(struct rq *this_
 * higher priority than current. If that's the case
 * just reschedule current.
 */
-   if (unlikely(next_task->prio < this_rq->curr->prio)) {
-   resched_task(this_rq->curr);
+   if (unlikely(next_task->prio < rq->curr->prio)) {
+   resched_task(rq->curr);
return 0;
}
 
-   /* We might release this_rq lock */
+   /* We might release rq lock */
get_task_struct(next_task);
 
/* find_lock_lowest_rq locks the rq if found */
-   lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+   lowest_rq = find_lock_lowest_rq(next_task, rq);
if (!lowest_rq) {
struct task_struct *task;
/*
-* find lock_lowest_rq releases this_rq->lock
+* find lock_lowest_rq releases rq->lock
 * so it is possible that next_task has changed.
 * If it has, then try again.
 */
-   task = pick_next_highest_task_rt(this_rq, -1);
+   task = pick_next_highest_task_rt(rq, -1);
if (unlikely(task != next_task) && task && paranoid--) {
put_task_struct(next_task);
next_task = task;
@@ -377,7 +377,7 @@ static int push_rt_task(struct rq *this_
 
assert_spin_locked(_rq->lock);
 
-   deactivate_task(this_rq, next_task, 0);
+   deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, lowest_rq->cpu);
activate_task(lowest_rq, next_task, 0);
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 06/20] wake up balance RT

2007-11-20 Thread Steven Rostedt

This patch adds pushing of overloaded RT tasks from a runqueue that is
having tasks (most likely RT tasks) added to the run queue.

TODO: We don't cover the case of waking of new RT tasks (yet).

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched.c|3 +++
 kernel/sched_rt.c |   10 ++
 2 files changed, 13 insertions(+)

Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:52:59.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:53:00.0 -0500
@@ -22,6 +22,8 @@
  *  by Peter Williams
  *  2007-05-06  Interactivity improvements to CFS by Mike Galbraith
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
+ *  2007-10-22  RT overload balancing by Steven Rostedt
+ * (with thanks to Gregory Haskins)
  */
 
 #include 
@@ -1635,6 +1637,7 @@ out_activate:
 
 out_running:
p->state = TASK_RUNNING;
+   wakeup_balance_rt(rq, p);
 out:
task_rq_unlock(rq, );
 
Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:52:59.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:00.0 -0500
@@ -555,6 +555,15 @@ static void schedule_tail_balance_rt(str
}
 }
 
+
+static void wakeup_balance_rt(struct rq *rq, struct task_struct *p)
+{
+   if (unlikely(rt_task(p)) &&
+   !task_running(rq, p) &&
+   (p->prio >= rq->curr->prio))
+   push_rt_tasks(rq);
+}
+
 /*
  * Load-balancing iterator. Note: while the runqueue stays locked
  * during the whole iteration, the current task might be
@@ -662,6 +671,7 @@ move_one_task_rt(struct rq *this_rq, int
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)  do { } while (0)
 # define schedule_balance_rt(rq, prev) do { } while (0)
+# define wakeup_balance_rt(rq, p)  do { } while (0)
 #endif /* CONFIG_SMP */
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 13/20] RT: Pre-route RT tasks on wakeup

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

In the original patch series that Steven Rostedt and I worked on together,
we both took different approaches to low-priority wakeup path.  I utilized
"pre-routing" (push the task away to a less important RQ before activating)
approach, while Steve utilized a "post-routing" approach.  The advantage of
my approach is that you avoid the overhead of a wasted activate/deactivate
cycle and peripherally related burdens.  The advantage of Steve's method is
that it neatly solves an issue preventing a "pull" optimization from being
deployed.

In the end, we ended up deploying Steve's idea.  But it later dawned on me
that we could get the best of both worlds by deploying both ideas together,
albeit slightly modified.

The idea is simple:  Use a "light-weight" lookup for pre-routing, since we
only need to approximate a good home for the task.  And we also retain the
post-routing push logic to clean up any inaccuracies caused by a condition
of "priority mistargeting" caused by the lightweight lookup.  Most of the
time, the pre-routing should work and yield lower overhead.  In the cases
where it doesnt, the post-router will bat cleanup.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |   19 +++
 1 file changed, 19 insertions(+)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:07.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:08.0 -0500
@@ -148,8 +148,27 @@ yield_task_rt(struct rq *rq)
 }
 
 #ifdef CONFIG_SMP
+static int find_lowest_rq(struct task_struct *task);
+
 static int select_task_rq_rt(struct task_struct *p, int sync)
 {
+   struct rq *rq = task_rq(p);
+
+   /*
+* If the task will not preempt the RQ, try to find a better RQ
+* before we even activate the task
+*/
+   if ((p->prio >= rq->rt.highest_prio)
+   && (p->nr_cpus_allowed > 1)) {
+   int cpu = find_lowest_rq(p);
+
+   return (cpu == -1) ? task_cpu(p) : cpu;
+   }
+
+   /*
+* Otherwise, just let it ride on the affined RQ and the
+* post-schedule router will push the preempted task away
+*/
return task_cpu(p);
 }
 #endif /* CONFIG_SMP */

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 16/20] Avoid overload

2007-11-20 Thread Steven Rostedt

This patch changes the searching for a run queue by a waking RT task
to try to pick another runqueue if the currently running task
is an RT task.

The reason is that RT tasks behave different than normal
tasks. Preempting a normal task to run a RT task to keep
its cache hot is fine, because the preempted non-RT task
may wait on that same runqueue to run again unless the
migration thread comes along and pulls it off.

RT tasks behave differently. If one is preempted, it makes
an active effort to continue to run. So by having a high
priority task preempt a lower priority RT task, that lower
RT task will then quickly try to run on another runqueue.
This will cause that lower RT task to replace its nice
hot cache (and TLB) with a completely cold one. This is
for the hope that the new high priority RT task will keep
 its cache hot.

Remeber that this high priority RT task was just woken up.
So it may likely have been sleeping for several milliseconds,
and will end up with a cold cache anyway. RT tasks run till
they voluntarily stop, or are preempted by a higher priority
task. This means that it is unlikely that the woken RT task
will have a hot cache to wake up to. So pushing off a lower
RT task is just killing its cache for no good reason.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |   20 
 1 file changed, 16 insertions(+), 4 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:10.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:11.0 -0500
@@ -157,11 +157,23 @@ static int select_task_rq_rt(struct task
struct rq *rq = task_rq(p);
 
/*
-* If the task will not preempt the RQ, try to find a better RQ
-* before we even activate the task
+* If the current task is an RT task, then
+* try to see if we can wake this RT task up on another
+* runqueue. Otherwise simply start this RT task
+* on its current runqueue.
+*
+* We want to avoid overloading runqueues. Even if
+* the RT task is of higher priority than the current RT task.
+* RT tasks behave differently than other tasks. If
+* one gets preempted, we try to push it off to another queue.
+* So trying to keep a preempting RT task on the same
+* cache hot CPU will force the running RT task to
+* a cold CPU. So we waste all the cache for the lower
+* RT task in hopes of saving some of a RT task
+* that is just being woken and probably will have
+* cold cache anyway.
 */
-   if ((p->prio >= rq->rt.highest_prio)
-   && (p->nr_cpus_allowed > 1)) {
+   if (unlikely(rt_task(rq->curr))) {
int cpu = find_lowest_rq(p);
 
return (cpu == -1) ? task_cpu(p) : cpu;

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 02/20] track highest prio queued on runqueue

2007-11-20 Thread Steven Rostedt

This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.

This information will be used for later patches.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched.c|3 +++
 kernel/sched_rt.c |   18 ++
 2 files changed, 21 insertions(+)

Index: linux-compile.git/kernel/sched.c
===
--- linux-compile.git.orig/kernel/sched.c   2007-11-20 19:52:50.0 
-0500
+++ linux-compile.git/kernel/sched.c2007-11-20 19:52:55.0 -0500
@@ -267,6 +267,8 @@ struct rt_rq {
int rt_load_balance_idx;
struct list_head *rt_load_balance_head, *rt_load_balance_curr;
unsigned long rt_nr_running;
+   /* highest queued rt task prio */
+   int highest_prio;
 };
 
 /*
@@ -6776,6 +6778,7 @@ void __init sched_init(void)
rq->cpu = i;
rq->migration_thread = NULL;
INIT_LIST_HEAD(>migration_queue);
+   rq->rt.highest_prio = MAX_RT_PRIO;
 #endif
atomic_set(>nr_iowait, 0);
 
Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:52:50.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:52:55.0 -0500
@@ -29,6 +29,10 @@ static inline void inc_rt_tasks(struct t
 {
WARN_ON(!rt_task(p));
rq->rt.rt_nr_running++;
+#ifdef CONFIG_SMP
+   if (p->prio < rq->rt.highest_prio)
+   rq->rt.highest_prio = p->prio;
+#endif /* CONFIG_SMP */
 }
 
 static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -36,6 +40,20 @@ static inline void dec_rt_tasks(struct t
WARN_ON(!rt_task(p));
WARN_ON(!rq->rt.rt_nr_running);
rq->rt.rt_nr_running--;
+#ifdef CONFIG_SMP
+   if (rq->rt.rt_nr_running) {
+   struct rt_prio_array *array;
+
+   WARN_ON(p->prio < rq->rt.highest_prio);
+   if (p->prio == rq->rt.highest_prio) {
+   /* recalculate */
+   array = >rt.active;
+   rq->rt.highest_prio =
+   sched_find_first_bit(array->bitmap);
+   } /* otherwise leave rq->highest prio alone */
+   } else
+   rq->rt.highest_prio = MAX_RT_PRIO;
+#endif /* CONFIG_SMP */
 }
 
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 04/20] RT overloaded runqueues accounting

2007-11-20 Thread Steven Rostedt

This patch adds an RT overload accounting system. When a runqueue has
more than one RT task queued, it is marked as overloaded. That is that it
is a candidate to have RT tasks pulled from it.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |   36 
 1 file changed, 36 insertions(+)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:52:56.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:52:57.0 -0500
@@ -3,6 +3,38 @@
  * policies)
  */
 
+#ifdef CONFIG_SMP
+static cpumask_t rt_overload_mask;
+static atomic_t rto_count;
+static inline int rt_overloaded(void)
+{
+   return atomic_read(_count);
+}
+static inline cpumask_t *rt_overload(void)
+{
+   return _overload_mask;
+}
+static inline void rt_set_overload(struct rq *rq)
+{
+   cpu_set(rq->cpu, rt_overload_mask);
+   /*
+* Make sure the mask is visible before we set
+* the overload count. That is checked to determine
+* if we should look at the mask. It would be a shame
+* if we looked at the mask, but the mask was not
+* updated yet.
+*/
+   wmb();
+   atomic_inc(_count);
+}
+static inline void rt_clear_overload(struct rq *rq)
+{
+   /* the order here really doesn't matter */
+   atomic_dec(_count);
+   cpu_clear(rq->cpu, rt_overload_mask);
+}
+#endif /* CONFIG_SMP */
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -32,6 +64,8 @@ static inline void inc_rt_tasks(struct t
 #ifdef CONFIG_SMP
if (p->prio < rq->rt.highest_prio)
rq->rt.highest_prio = p->prio;
+   if (rq->rt.rt_nr_running > 1)
+   rt_set_overload(rq);
 #endif /* CONFIG_SMP */
 }
 
@@ -53,6 +87,8 @@ static inline void dec_rt_tasks(struct t
} /* otherwise leave rq->highest prio alone */
} else
rq->rt.highest_prio = MAX_RT_PRIO;
+   if (rq->rt.rt_nr_running < 2)
+   rt_clear_overload(rq);
 #endif /* CONFIG_SMP */
 }
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 00/20] New RT Balancing version 4

2007-11-20 Thread Steven Rostedt


[
  Changes since V3:

Updated to git tree 2ffbb8377c7a0713baf6644e285adc27a5654582

Removed cpumask_t from stacks (using per_cpu masks).

Optimized the searching for overloaded queues a bit.
  (a lot of work in this area)

Run RT balance logic on waking of new tasks.

The tarball of these patches is also available at
  http://rostedt.homelinux.com/rt/rt-balance-patches-v4.tar.bz2
]

Currently in mainline the balancing of multiple RT threads is quite broken.
That is to say that a high priority thread that is scheduled on a CPU
with a higher priority thread, may need to unnecessarily wait while it
can easily run on another CPU that's running a lower priority thread.

Balancing (or migrating) tasks in general is an art. Lots of considerations
must be taken into account. Cache lines, NUMA and more. This is true
with general processes which expect high through put and migration can
be done in batch.  But when it comes to RT tasks, we really need to
put them off to a CPU that they can run on as soon as possible. Even
if it means a bit of cache line flushing.

Right now an RT task can wait several milliseconds before it gets scheduled
to run. And perhaps even longer. The migration thread is not fast enough
to take care of RT tasks.

To demonstrate this, I wrote a simple test.
 
  http://rostedt.homelinux.com/rt/rt-migrate-test.c

  (gcc -o rt-migrate-test rt-migrate-test.c -lpthread)

This test expects a parameter to pass in the number of threads to create.
If you add the '-c' option (check) it will terminate if the test fails
one of the iterations. If you add this, pass in +1 threads.

For example, on a 4 way box, I used

  rt-migrate-test -c 5

What this test does is to create the number of threads specified (in this
case 5). Each thread is set as an RT FIFO task starting at a specified
prio (default 2) and each thread being one priority higher. So with this
example the 5 threads created are at priorities 2, 3, 4, 5, and 6.

The parent thread sets its priority to one higher than the highest of
the children (this example 7). It uses pthread_barrier_wait to synchronize
the threads.  Then it takes a time stamp and starts all the threads.
The threads when woken up take a time stamp and compares it to the parent
thread to see how long it took to be awoken. It then runs for an
interval (20ms default) in a busy loop. The busy loop ends when it reaches
the interval delta from the start time stamp. So if it is preempted, it
may not actually run for the full interval. This is expected behavior
of the test.

The numbers recorded are the delta from the thread's time stamp from the
parent time stamp. The number of iterations it ran the busy loop for, and
the delta from a thread time stamp taken at the end of the loop to the
parent time stamp.

Sometimes a lower priority task might wake up before a higher priority,
but this is OK, as long as the higher priority process gets the CPU when
it is awoken.

At the end of the test, the iteration data is printed to stdout. If a
higher priority task had to wait for a lower one to finish running, then
this is considered a failure. Here's an example of the output from
a run against git commit 4fa4d23fa20de67df919030c1216295664866ad7.

   1:   36  33   20041  39  33
 len:20036   20033   40041   20039   20033
 loops: 167789  167693  227167  167829  167814

On iteration 1 (starts at 0) the third task started at 20ms after the parent
woke it up. We can see here that the first two tasks ran to completion
before the higher priority task was even able to start. That is a
20ms latency for the higher priority task!!!

So people who think that their audio would lose most latencies by upping 
the priority, may be in for a surprise. Since some kernel threads (like
the migration thread itself) may cause this latency.

To solve this issue, I've changed the RT task balancing from a passive
method (migration thread) to an active method.  This new method is
to actively push or pull RT tasks when they are woken up or scheduled.

On wake up of a task if it is an RT task, and there's already an RT task
of higher priority running on its runqueue, we initiate a push_rt_tasks
algorithm. This algorithm looks at the highest non-running RT task
and tries to find a CPU where it can run on. It only migrates the RT
task if it finds a CPU (of lowest priority) where the RT task
can run on and can preempt the currently running task on that CPU.
We continue pushing RT tasks until we can't push anymore.

If a RT task fails to be migrated we stop the pushing. This is possible
because we are always looking at the highest priority RT task on the
run queue. And if it can't migrate, then most likely the lower RT tasks
can not either.

There is one case that is not covered by this patch set. That is that
when the highest priority non running RT task has its CPU affinity
in such a way that it can not preempt any tasks on the CPUs running
on CPUs of its affinity. But a lower

[PATCH v4 18/20] Optimize cpu search with hamming weight

2007-11-20 Thread Steven Rostedt

From: Gregory Haskins <[EMAIL PROTECTED]>

We can cheaply track the number of bits set in the cpumask for the lowest
priority CPUs.  Therefore, compute the mask's weight and use it to skip
the optimal domain search logic when there is only one CPU available.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/sched_rt.c |   25 ++---
 1 file changed, 18 insertions(+), 7 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===
--- linux-compile.git.orig/kernel/sched_rt.c2007-11-20 19:53:13.0 
-0500
+++ linux-compile.git/kernel/sched_rt.c 2007-11-20 19:53:15.0 -0500
@@ -300,7 +300,7 @@ static int find_lowest_cpus(struct task_
int   cpu;
cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
int   lowest_prio = -1;
-   int   ret = 0;
+   int   count   = 0;
 
cpus_clear(*lowest_mask);
cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
@@ -313,7 +313,7 @@ static int find_lowest_cpus(struct task_
 
/* We look for lowest RT prio or non-rt CPU */
if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-   if (ret)
+   if (count)
cpus_clear(*lowest_mask);
cpu_set(rq->cpu, *lowest_mask);
return 1;
@@ -325,14 +325,17 @@ static int find_lowest_cpus(struct task_
if (rq->rt.highest_prio > lowest_prio) {
/* new low - clear old data */
lowest_prio = rq->rt.highest_prio;
-   cpus_clear(*lowest_mask);
+   if (count) {
+   cpus_clear(*lowest_mask);
+   count = 0;
+   }
}
cpu_set(rq->cpu, *lowest_mask);
-   ret = 1;
+   count++;
}
}
 
-   return ret;
+   return count;
 }
 
 static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
@@ -356,9 +359,17 @@ static int find_lowest_rq(struct task_st
cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask);
int this_cpu = smp_processor_id();
int cpu  = task_cpu(task);
+   int count= find_lowest_cpus(task, lowest_mask);
 
-   if (!find_lowest_cpus(task, lowest_mask))
-   return -1;
+   if (!count)
+   return -1; /* No targets found */
+
+   /*
+* There is no sense in performing an optimal search if only one
+* target is found.
+*/
+   if (count == 1)
+   return first_cpu(*lowest_mask);
 
/*
 * At this point we have built a mask of cpus representing the

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] sata_nv: don't use legacy DMA in ADMA mode

2007-11-20 Thread Robert Hancock


Tejun Heo wrote:

Tejun Heo wrote:

If so, can you please add that switching into register mode is okay as
long as there's no other ADMA commands in flight and add
WARN_ON((qc->flags & ATA_QCFLAG_RESULT_TF) && link->sactive)?


More accurately, link->sactive test can be substituted with
(ap->qc_allocated & ~(1 << qc->tag)).


Unfortunately we only get the ata_port and ata_taskfile in the tf_read 
callback, so I'm not sure if we can do the equivalent of the qc->flags & 
ATA_QCFLAG_RESULT_TF test (i.e. distinguishing between the 
error-handling case where we care if we abort outstanding commands and 
the normal case with a RESULT_TF command where we do)..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1146 matches

Mail list logo