Re: [patch 00/12] Syslets, Threadlets, generic AIO support, v5

2007-03-13 Thread Milton Miller

Anton Blanchard wrote:

Hi Ingo,


this is the v5 release of the syslet/threadlet subsystem:

   http://redhat.com/~mingo/syslet-patches/


Nice!



I too went and downloaded patches-v5 for review.

First off, one problem I noticed in sys_async_wait:

+   ah-events_left = min_wait_events - (kernel_ring_idx - user_ring_idx);

This completely misses the wraparound case of kernel_ring_idx 
user_ring_idx. I wonder if this is causing some of the benchmark 
problems?

(add max_ring_index if kernel  user).


I tried to port this to ppc64 but found a few problems:

The 64bit powerpc ABI has the concept of a TOC (r2) which is used for
per function data. This means this wont work:

[deleted]
I think we would want to change restore_ip to restore_function, and 
then

create a per arch helper, perhaps:

void set_user_context(struct task_struct *task, unsigned long stack,
  unsigned long function, unsigned long retval);

ppc64 could then grab the ip and r2 values from the function 
descriptor.


The other issue involves the syscall table:

asmlinkage struct syslet_uatom __user *
sys_async_exec(struct syslet_uatom __user *uatom,
   struct async_head_user __user *ahu)
{
return __sys_async_exec(uatom, ahu, sys_call_table, 
NR_syscalls);

}

This exposes the layout of the syscall table. Unfortunately it wont 
work

on ppc64. In arch/powerpc/kernel/systbl.S:

#define COMPAT_SYS(func).llong  .sys_##func,.compat_sys_##func

Both syscall tables are overlaid.

Anton


In addition, the entries in the table are not function pointers, they
are the actual code targets.   So we need a arch helper to invoke the
system call.

Here is another problem with your compat code.  Just telling user space
that everything is u64 and having the kernel retrieve pointers and ulong
doesn't work, you have to actually copy in u64 values and truncate them
down.  Your current code is broken on all 32bit big endian kernels.
Actually, the check needs to be that the upper 32 bits are 0 or return
-EINVAL.

In addition, the compat syscall entry points assume that the arguments
have been truncated to compat_ulong values by the syscall entry path,
and that they only need to do sign extension (and/or pointer conversion
on s390 with its 31 bit pointers).  So all compat kernels are broken.

The two of these things together makes me think we want two copy
functions.  At that point we may as well define the struct uatom in
terms of ulong and compat_ulong for the compat_uatom.  That would lead
to two copies of exec_uatom, but the elimination of passing the syscall
table as an argument down.  The need_resched and signal check could
become part of the common next_uatom routine, although it would need to
know uatom+1 instead of doing the addition in itself.

Other observations:

All the logic setting at and async_ready is a bit hard to follow.
After some analysis, t-at is only ever set to t-__at and
async_ready is only set to the same at or NULL.  Both of these
should become flags, and at-task should be converted to
container_of.  Also, the name at is hard to grep / search for.

The stop flags are decoded with a case but are not densely encoded,
rather they are one hot.  We either need to error on multiple stop
bits being set, stop on each possible condition, or encode them
densely.

There is no check for flags being set that are not recognized.
If we ever add a flag for another stop condition this would
lead to incorrect execution by the kernel.

There are some syscalls that can return -EFAULT but later have
force_syscall_noerror.  We should create a stop on ERROR and
clear the force_noerror flag between syscalls.  The umem_add
syscall should add force_noerror if the put_user succeeds.

In copy_uatom, you call verify_read on the entire uatom.  This means
that the struct with all user space size has to be within the process
limit, which violates your assertion that userspace doesn't need the
whole structure.  If we add the requirement that the space that would
be occupied by the complete atom has to exist, then we can copy the
whole struct uatom with copy_from_user and then copy the args with
get_user.  User space can still pack them more densely, and we can
still stop copying on a null arg pointer.  Actually, calling access_ok
then __get_user can be more expensive on some architectures because
they have to verify both start and length on access_ok but can only
verify start on get_user because they have unmapped areas between user
space and kernel space.  This would also mean that we don't check
arg_ptr for NULL without verifying that get_user actually worked.   
The gotos in exec_uatom are just a while loop with a break.

sys_umem_add should be in /lib under lib-y in the Makefile.
In fact declaring the function weak does not make it a weak
syscall implementation on some architectures.

Weak syscalls aliases to sys_ni_syscall are needed for when
async support is not selected in Kconfig.

The Documentation patch 

Re: Make sure we populate the initroot filesystem late enough

2007-03-13 Thread Benjamin Herrenschmidt

 Hmm. The crash came back after I booted into Mac OS X and back. It was however
 a different crash, I believe it was coming from the USB modules (as it would
 keep going when it happened, and get another crash, which tended to scroll 
 away
 too fast for me to capture) but I believe it was still getting down into the
 slab code and actually dying there.

Have you tried, instead, to apply
38f3323037de22bb0089d08be27be01196e7148b ? (That is revert
39d61db0edb34d60b83c5e0d62d0e906578cc707).

I suspect this is the proper fix...

Ben.

 However, reverting the reversion of
 8d610dd52dd1da696e199e4b4545f33a2a5de5c6 and instead applying
 the following patch:
 
 diff -ru linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c 
 linux-source-2.6.20/arch/powerpc/mm/init_32.c
 --- linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c  2007-02-05 
 05:44:54.0 +1100
 +++ linux-source-2.6.20/arch/powerpc/mm/init_32.c   2007-03-10 
 11:03:56.0 +1100
 @@ -244,7 +244,8 @@
  void free_initrd_mem(unsigned long start, unsigned long end)
  {
 if (start  end)
 -   printk (Freeing initrd memory: %ldk freed\n, (end - start) 
  10);
 +   printk (NOT Freeing initrd memory: %ldk freed\n, (end - 
 start)  10);
 +   return;
 for (; start  end; start += PAGE_SIZE) {
 ClearPageReserved(virt_to_page(start));
 init_page_count(virt_to_page(start));
 
 which if I recall correctly David Woodhouse posted to this thread,
 seems to have fixed it.
 
 I dunno if it's relevant, but my initrd.img is 13193315 bytes long,
 (ie 99 bytes over 12884k) and the above logs:
 NOT Freeing initrd memory: 12888k freed
 which makes sense...
 
 I of course completely failed to think to check this with the crashing
 kernel, if it seems relevant I can roll back to it and get the numbers.
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-13 Thread Pavel Emelianov
Dave Hansen wrote:
 On Mon, 2007-03-12 at 20:19 +0300, Pavel Emelianov wrote:
 Dave Hansen wrote:
 On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
 now VE2 maps the same page. You can't determine whether this page is mapped
 to this container or another one w/o page-container pointer. 
 Hi Kirill,

 I thought we can always get from the page to the VMA.  rmap provides
 this to us via page-mapping and the 'struct address_space' or anon_vma.
 Do we agree on that?
 Not completely. When page is unmapped from the *very last*
 user its *first* toucher may already be dead. So we'll never
 find out who it was.
 
 OK, but  this is assuming that we didn't *un*account for the page when
 the last user of the owning container stopped using the page.

That's exactly what we agreed on during our discussions:
When page is get touched it is charged to this container.
When page is get touched again by new container it is NOT
charged to new container, but keeps holding the old one
till it (the page) is completely freed. Nobody worried the
fact that a single page can hold container for good.

OpenVZ beancounters work the other way (and we proposed this
solution when we first sent the patches). We keep track of
*all* the containers (i.e. beancounters) holding this page.

 We can also get from the vma to the mm very easily, via vma-vm_mm,
 right?

 We can also get from a task to the container quite easily.  

 So, the only question becomes whether there is a 1:1 relationship
 between mm_structs and containers.  Does each mm_struct belong to one
 No. The question is how to get a container that touched the
 page first which is the same as how to find mm_struct which
 touched the page first. Obviously there's no answer on this
 question unless we hold some direct page-container reference.
 This may be a hash, a direct on-page pointer, or mirrored
 array of pointers.
 
 Or, you keep track of when the last user from the container goes away,
 and you effectively account it to another one.

We can migrate page to another user but we decided
to implement it later after accepting simple accounting.

 Are there problems with shifting ownership around like this?
 
 -- Dave
 
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Christoph Lameter
V1-V2
- Add sparch64 patch
- Single i386 and x86_64 patch
- Update attribution
- Update justification
- Update approvals
- Earlier discussion of V1 was at
  http://marc.info/?l=linux-kernelm=117357922219342w=2

This patchset introduces an arch independent framework to handle lists
of recently used page table pages. It is necessary for x86_64 and
i386 to avoid the special casing of SLUB because these two platforms
use fields in the page_struct (page-index and page-private)
that SLUB needs (and in fact SLAB also needs page-private if
performing debugging!). There is also the tendency of arches to use
page flags to mark page table pages. The slab also uses page flags.
Separating page table page allocation into quicklists avoids the danger
of conflicts and frees up page flags for SLUB and for the arch code.

Page table pages have the characteristics that they are typically zero
or in a known state when they are freed. This is usually the exactly
same state as needed after allocation. So it makes sense to build a list
of freed page table pages and then consume the pages already in use
first. Those pages have already been initialized correctly (thus no
need to zero them) and are likely already cached in such a way that
the MMU can use them most effectively. Page table pages are used in
a sparse way so zeroing them on allocation is not too useful.

Such an implementation already exits for ia64. Howver, that implementation
did not support constructors and destructors as needed by i386 / x86_64.
It also only supported a single quicklist. The implementation here has
constructor and destructor support as well as the ability for an arch to
specify how many quicklists are needed.

Quicklists are defined by an arch defining the necessary number
of quicklists in arch/arch/Kconfig. F.e. i386 needs two and thus
has

config NR_QUICK
int
default 2

If an arch has requested quicklist support then pages can be allocated
from the quicklist (or from the page allocator if the quicklist is
empty) via:

quicklist_alloc(quicklist-nr, gfpflags, constructor)

Page table pages can be freed using:

quicklist_free(quicklist-nr, destructor, page)

Pages must have a definite state after allocation and before
they are freed. If no constructor is specified then pages
will be zeroed on allocation and must be zeroed before they are
freed.

If a constructor is used then the constructor will establish
a definite page state. F.e. the i386 and x86_64 pgd constructors
establish certain mappings.

Constructors and destructors can also be used to track the pages.
i386 and x86_64 use a list of pgds in order to be able to dynamically
update standard mappings.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[QUICKLIST 1/4] Generic quicklist implementation

2007-03-13 Thread Christoph Lameter
Abstract quicklist from the OA64 implementation

Extract the quicklist implementation for IA64, clean it up
and generalize it to allow multiple quicklists and support
for constructors and destructors..

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 arch/ia64/Kconfig  |4 ++
 arch/ia64/mm/contig.c  |2 -
 arch/ia64/mm/discontig.c   |2 -
 arch/ia64/mm/init.c|   51 ---
 include/asm-ia64/pgalloc.h |   82 -
 include/linux/quicklist.h  |   81 
 mm/Kconfig |5 ++
 mm/Makefile|2 +
 mm/quicklist.c |   81 
 9 files changed, 191 insertions(+), 119 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/ia64/mm/init.c   2007-03-12 
22:49:21.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c2007-03-12 22:49:23.0 
-0700
@@ -39,9 +39,6 @@
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 
-DEFINE_PER_CPU(unsigned long *, __pgtable_quicklist);
-DEFINE_PER_CPU(long, __pgtable_quicklist_size);
-
 extern void ia64_tlb_init (void);
 
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x1UL;
@@ -56,54 +53,6 @@ EXPORT_SYMBOL(vmem_map);
 struct page *zero_page_memmap_ptr; /* map entry for zero page */
 EXPORT_SYMBOL(zero_page_memmap_ptr);
 
-#define MIN_PGT_PAGES  25UL
-#define MAX_PGT_FREES_PER_PASS 16L
-#define PGT_FRACTION_OF_NODE_MEM   16
-
-static inline long
-max_pgt_pages(void)
-{
-   u64 node_free_pages, max_pgt_pages;
-
-#ifndefCONFIG_NUMA
-   node_free_pages = nr_free_pages();
-#else
-   node_free_pages = node_page_state(numa_node_id(), NR_FREE_PAGES);
-#endif
-   max_pgt_pages = node_free_pages / PGT_FRACTION_OF_NODE_MEM;
-   max_pgt_pages = max(max_pgt_pages, MIN_PGT_PAGES);
-   return max_pgt_pages;
-}
-
-static inline long
-min_pages_to_free(void)
-{
-   long pages_to_free;
-
-   pages_to_free = pgtable_quicklist_size - max_pgt_pages();
-   pages_to_free = min(pages_to_free, MAX_PGT_FREES_PER_PASS);
-   return pages_to_free;
-}
-
-void
-check_pgt_cache(void)
-{
-   long pages_to_free;
-
-   if (unlikely(pgtable_quicklist_size = MIN_PGT_PAGES))
-   return;
-
-   preempt_disable();
-   while (unlikely((pages_to_free = min_pages_to_free())  0)) {
-   while (pages_to_free--) {
-   free_page((unsigned long)pgtable_quicklist_alloc());
-   }
-   preempt_enable();
-   preempt_disable();
-   }
-   preempt_enable();
-}
-
 void
 lazy_mmu_prot_update (pte_t pte)
 {
Index: linux-2.6.21-rc3-mm2/include/asm-ia64/pgalloc.h
===
--- linux-2.6.21-rc3-mm2.orig/include/asm-ia64/pgalloc.h2007-03-12 
22:49:21.0 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-ia64/pgalloc.h 2007-03-12 
22:49:23.0 -0700
@@ -18,71 +18,18 @@
 #include linux/mm.h
 #include linux/page-flags.h
 #include linux/threads.h
+#include linux/quicklist.h
 
 #include asm/mmu_context.h
 
-DECLARE_PER_CPU(unsigned long *, __pgtable_quicklist);
-#define pgtable_quicklist __ia64_per_cpu_var(__pgtable_quicklist)
-DECLARE_PER_CPU(long, __pgtable_quicklist_size);
-#define pgtable_quicklist_size __ia64_per_cpu_var(__pgtable_quicklist_size)
-
-static inline long pgtable_quicklist_total_size(void)
-{
-   long ql_size = 0;
-   int cpuid;
-
-   for_each_online_cpu(cpuid) {
-   ql_size += per_cpu(__pgtable_quicklist_size, cpuid);
-   }
-   return ql_size;
-}
-
-static inline void *pgtable_quicklist_alloc(void)
-{
-   unsigned long *ret = NULL;
-
-   preempt_disable();
-
-   ret = pgtable_quicklist;
-   if (likely(ret != NULL)) {
-   pgtable_quicklist = (unsigned long *)(*ret);
-   ret[0] = 0;
-   --pgtable_quicklist_size;
-   preempt_enable();
-   } else {
-   preempt_enable();
-   ret = (unsigned long *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
-   }
-
-   return ret;
-}
-
-static inline void pgtable_quicklist_free(void *pgtable_entry)
-{
-#ifdef CONFIG_NUMA
-   int nid = page_to_nid(virt_to_page(pgtable_entry));
-
-   if (unlikely(nid != numa_node_id())) {
-   free_page((unsigned long)pgtable_entry);
-   return;
-   }
-#endif
-
-   preempt_disable();
-   *(unsigned long *)pgtable_entry = (unsigned long)pgtable_quicklist;
-   pgtable_quicklist = (unsigned long *)pgtable_entry;
-   ++pgtable_quicklist_size;
-   preempt_enable();
-}
-
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   return pgtable_quicklist_alloc();
+   return 

[QUICKLIST 4/4] Quicklist support for sparc64

2007-03-13 Thread Christoph Lameter
From: David Miller [EMAIL PROTECTED]

[QUICKLIST]: Add sparc64 quicklist support.

I ported this to sparc64 as per the patch below, tested on
UP SunBlade1500 and 24 cpu Niagara T1000.

Signed-off-by: David S. Miller [EMAIL PROTECTED]

---
 arch/sparc64/Kconfig  |4 
 arch/sparc64/mm/init.c|   24 
 arch/sparc64/mm/tsb.c |2 +-
 include/asm-sparc64/pgalloc.h |   26 ++
 4 files changed, 19 insertions(+), 37 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig
===
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/Kconfig  2007-03-12 
22:49:19.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig   2007-03-12 22:53:30.0 
-0700
@@ -26,6 +26,10 @@ config MMU
bool
default y
 
+config NR_QUICK
+   int
+   default 1
+
 config STACKTRACE_SUPPORT
bool
default y
Index: linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/mm/init.c2007-03-12 
22:49:19.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c 2007-03-12 22:53:30.0 
-0700
@@ -176,30 +176,6 @@ unsigned long sparc64_kern_sec_context _
 
 int bigkernel = 0;
 
-struct kmem_cache *pgtable_cache __read_mostly;
-
-static void zero_ctor(void *addr, struct kmem_cache *cache, unsigned long 
flags)
-{
-   clear_page(addr);
-}
-
-extern void tsb_cache_init(void);
-
-void pgtable_cache_init(void)
-{
-   pgtable_cache = kmem_cache_create(pgtable_cache,
- PAGE_SIZE, PAGE_SIZE,
- SLAB_HWCACHE_ALIGN |
- SLAB_MUST_HWCACHE_ALIGN,
- zero_ctor,
- NULL);
-   if (!pgtable_cache) {
-   prom_printf(Could not create pgtable_cache\n);
-   prom_halt();
-   }
-   tsb_cache_init();
-}
-
 #ifdef CONFIG_DEBUG_DCFLUSH
 atomic_t dcpage_flushes = ATOMIC_INIT(0);
 #ifdef CONFIG_SMP
Index: linux-2.6.21-rc3-mm2/arch/sparc64/mm/tsb.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/mm/tsb.c 2007-03-12 
22:49:19.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/mm/tsb.c  2007-03-12 22:53:30.0 
-0700
@@ -252,7 +252,7 @@ static const char *tsb_cache_names[8] = 
tsb_1MB,
 };
 
-void __init tsb_cache_init(void)
+void __init pgtable_cache_init(void)
 {
unsigned long i;
 
Index: linux-2.6.21-rc3-mm2/include/asm-sparc64/pgalloc.h
===
--- linux-2.6.21-rc3-mm2.orig/include/asm-sparc64/pgalloc.h 2007-03-12 
22:49:19.0 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-sparc64/pgalloc.h  2007-03-12 
22:53:30.0 -0700
@@ -6,6 +6,7 @@
 #include linux/sched.h
 #include linux/mm.h
 #include linux/slab.h
+#include linux/quicklist.h
 
 #include asm/spitfire.h
 #include asm/cpudata.h
@@ -13,52 +14,50 @@
 #include asm/page.h
 
 /* Page table allocation/freeing. */
-extern struct kmem_cache *pgtable_cache;
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
-   kmem_cache_free(pgtable_cache, pgd);
+   quicklist_free(0, NULL, pgd);
 }
 
 #define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return kmem_cache_alloc(pgtable_cache,
-   GFP_KERNEL|__GFP_REPEAT);
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pmd_free(pmd_t *pmd)
 {
-   kmem_cache_free(pgtable_cache, pmd);
+   quicklist_free(0, NULL, pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
  unsigned long address)
 {
-   return kmem_cache_alloc(pgtable_cache,
-   GFP_KERNEL|__GFP_REPEAT);
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
 unsigned long address)
 {
-   return virt_to_page(pte_alloc_one_kernel(mm, address));
+   void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
+   return pg ? virt_to_page(pg) : NULL;
 }

 static inline void pte_free_kernel(pte_t *pte)
 {
-   kmem_cache_free(pgtable_cache, pte);
+   quicklist_free(0, NULL, pte);
 }
 
 static inline void pte_free(struct page *ptepage)
 {
-   pte_free_kernel(page_address(ptepage));
+   quicklist_free(0, NULL, page_address(ptepage));
 }
 
 
@@ -66,6 +65,9 @@ static inline void 

[QUICKLIST 2/4] Quicklist support for i386

2007-03-13 Thread Christoph Lameter
i386: Convert to quicklists

Implement the i386 management of pgd and pmds using quicklists.

The i386 management of page table pages currently uses page sized slabs.
The page state is therefore mainly determined by the slab code. However,
i386 also uses its own fields in the page struct to mark special pages
and to build a list of pgds using the -private and -index field (yuck!).
This has been finely tuned to work right with SLAB but SLUB needs more
control over the page struct. Currently the only way for SLUB to support
these slabs is through special casing PAGE_SIZE slabs.

If we use quicklists instead then we can avoid the mess, and also the
overhead of manipulating page sized objects through slab.

It also allows us to use standard list manipulation macros for the
pgd list using page-lru thereby simplifying the code.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 arch/i386/Kconfig  |4 ++
 arch/i386/kernel/process.c |1 
 arch/i386/kernel/smp.c |2 -
 arch/i386/mm/fault.c   |5 +--
 arch/i386/mm/init.c|   25 -
 arch/i386/mm/pageattr.c|2 -
 arch/i386/mm/pgtable.c |   63 +
 include/asm-i386/pgalloc.h |2 -
 include/asm-i386/pgtable.h |   13 +++--
 9 files changed, 39 insertions(+), 78 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/i386/mm/init.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/init.c   2007-03-12 
22:49:20.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/init.c2007-03-12 22:53:27.0 
-0700
@@ -695,31 +695,6 @@ int remove_memory(u64 start, u64 size)
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif
 
-struct kmem_cache *pgd_cache;
-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
-   if (PTRS_PER_PMD  1) {
-   pmd_cache = kmem_cache_create(pmd,
-   PTRS_PER_PMD*sizeof(pmd_t),
-   PTRS_PER_PMD*sizeof(pmd_t),
-   0,
-   pmd_ctor,
-   NULL);
-   if (!pmd_cache)
-   panic(pgtable_cache_init(): cannot create pmd cache);
-   }
-   pgd_cache = kmem_cache_create(pgd,
-   PTRS_PER_PGD*sizeof(pgd_t),
-   PTRS_PER_PGD*sizeof(pgd_t),
-   0,
-   pgd_ctor,
-   PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
-   if (!pgd_cache)
-   panic(pgtable_cache_init(): Cannot create pgd cache);
-}
-
 /*
  * This function cannot be __init, since exceptions don't work in that
  * section.  Put this after the callers, so that it cannot be inlined.
Index: linux-2.6.21-rc3-mm2/arch/i386/mm/pgtable.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/pgtable.c2007-03-12 
22:49:20.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/pgtable.c 2007-03-12 22:53:27.0 
-0700
@@ -13,6 +13,7 @@
 #include linux/pagemap.h
 #include linux/spinlock.h
 #include linux/module.h
+#include linux/quicklist.h
 
 #include asm/system.h
 #include asm/pgtable.h
@@ -181,9 +182,12 @@ void reserve_top_address(unsigned long r
 #endif
 }
 
+#define QUICK_PGD 0
+#define QUICK_PT 1
+
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-   return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+   return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 }
 
 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -198,11 +202,6 @@ struct page *pte_alloc_one(struct mm_str
return pte;
 }
 
-void pmd_ctor(void *pmd, struct kmem_cache *cache, unsigned long flags)
-{
-   memset(pmd, 0, PTRS_PER_PMD*sizeof(pmd_t));
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -211,36 +210,15 @@ void pmd_ctor(void *pmd, struct kmem_cac
  * against pageattr.c; it is the unique case in which a valid change
  * of kernel pagetables can't be lazily synchronized by vmalloc faults.
  * vmalloc faults work because attached pagetables are never freed.
- * The locking scheme was chosen on the basis of manfred's
- * recommendations and having no core impact whatsoever.
  * -- wli
  */
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
-
-static inline void pgd_list_add(pgd_t *pgd)
-{
-   struct page *page = virt_to_page(pgd);
-   page-index = (unsigned long)pgd_list;
-   if (pgd_list)
-   set_page_private(pgd_list, (unsigned long)page-index);
-   pgd_list = page;
-   set_page_private(page, (unsigned long)pgd_list);
-}
+LIST_HEAD(pgd_list);
 
-static inline void pgd_list_del(pgd_t *pgd)
-{
-   

[QUICKLIST 3/4] Quicklist support for x86_64

2007-03-13 Thread Christoph Lameter
Conver x86_64 to using quicklists

This adds caching of pgds and puds, pmds, pte. That way we can
avoid costly zeroing and initialization of special mappings in the
pgd.

A second quicklist is useful to separate out PGD handling. We can carry
the initialized pgds over to the next process needing them.

Also clean up the pgd_list handling to use regular list macros.
There is no need anymore to avoid the lru field.

Move the add/removal of the pgds to the pgdlist into the
constructor / destructor. That way the implementation is
congruent with i386.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 arch/x86_64/Kconfig  |4 ++
 arch/x86_64/kernel/process.c |1 
 arch/x86_64/kernel/smp.c |2 -
 arch/x86_64/mm/fault.c   |5 +-
 include/asm-x86_64/pgalloc.h |   76 +--
 include/asm-x86_64/pgtable.h |3 -
 mm/Kconfig   |5 ++
 7 files changed, 52 insertions(+), 44 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig
===
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/Kconfig   2007-03-12 
22:49:20.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig2007-03-12 22:53:28.0 
-0700
@@ -56,6 +56,10 @@ config ZONE_DMA
bool
default y
 
+config NR_QUICK
+   int
+   default 2
+
 config ISA
bool
 
Index: linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h
===
--- linux-2.6.21-rc3-mm2.orig/include/asm-x86_64/pgalloc.h  2007-03-12 
22:49:20.0 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h   2007-03-12 
22:53:28.0 -0700
@@ -4,6 +4,10 @@
 #include asm/pda.h
 #include linux/threads.h
 #include linux/mm.h
+#include linux/quicklist.h
+
+#define QUICK_PGD 0/* We preserve special mappings over free */
+#define QUICK_PT 1 /* Other page table pages that are zero on free */
 
 #define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
@@ -20,86 +24,77 @@ static inline void pmd_populate(struct m
 static inline void pmd_free(pmd_t *pmd)
 {
BUG_ON((unsigned long)pmd  (PAGE_SIZE-1));
-   free_page((unsigned long)pmd);
+   quicklist_free(QUICK_PT, NULL, pmd);
 }
 
 static inline pmd_t *pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-   return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, 
NULL);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, 
NULL);
 }
 
 static inline void pud_free (pud_t *pud)
 {
BUG_ON((unsigned long)pud  (PAGE_SIZE-1));
-   free_page((unsigned long)pud);
+   quicklist_free(QUICK_PT, NULL, pud);
 }
 
-static inline void pgd_list_add(pgd_t *pgd)
+static inline void pgd_ctor(void *x)
 {
+   unsigned boundary;
+   pgd_t *pgd = x;
struct page *page = virt_to_page(pgd);
 
+   /*
+* Copy kernel pointers in from init.
+*/
+   boundary = pgd_index(__PAGE_OFFSET);
+   memcpy(pgd + boundary,
+   init_level4_pgt + boundary,
+   (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+
spin_lock(pgd_lock);
-   page-index = (pgoff_t)pgd_list;
-   if (pgd_list)
-   pgd_list-private = (unsigned long)page-index;
-   pgd_list = page;
-   page-private = (unsigned long)pgd_list;
+   list_add(page-lru, pgd_list);
spin_unlock(pgd_lock);
 }
 
-static inline void pgd_list_del(pgd_t *pgd)
+static inline void pgd_dtor(void *x)
 {
-   struct page *next, **pprev, *page = virt_to_page(pgd);
+   pgd_t *pgd = x;
+   struct page *page = virt_to_page(pgd);
 
spin_lock(pgd_lock);
-   next = (struct page *)page-index;
-   pprev = (struct page **)page-private;
-   *pprev = next;
-   if (next)
-   next-private = (unsigned long)pprev;
+   list_del(page-lru);
spin_unlock(pgd_lock);
 }
 
+
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   unsigned boundary;
-   pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-   if (!pgd)
-   return NULL;
-   pgd_list_add(pgd);
-   /*
-* Copy kernel pointers in from init.
-* Could keep a freelist or slab cache of those because the kernel
-* part never changes.
-*/
-   boundary = pgd_index(__PAGE_OFFSET);
-   memset(pgd, 0, boundary * sizeof(pgd_t));
-   memcpy(pgd + boundary,
-  init_level4_pgt + boundary,
-  (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+   pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
+GFP_KERNEL|__GFP_REPEAT, pgd_ctor);
+
  

Re: [RFC][PATCH 2/7] RSS controller core

2007-03-13 Thread Pavel Emelianov
Herbert Poetzl wrote:
 On Mon, Mar 12, 2007 at 12:02:01PM +0300, Pavel Emelianov wrote:
 Maybe you have some ideas how we can decide on this?
 We need to work out what the requirements are before we can 
 settle on an implementation.
 Linux-VServer (and probably OpenVZ):

  - shared mappings of 'shared' files (binaries 
and libraries) to allow for reduced memory
footprint when N identical guests are running
 This is done in current patches.
 
 nice, but the question was about _requirements_
 (so your requirements are?)
 
  - virtual 'physical' limit should not cause
swap out when there are still pages left on
the host system (but pages of over limit guests
can be preferred for swapping)
 So what to do when virtual physical limit is hit?
 OOM-kill current task?
 
 when the RSS limit is hit, but there _are_ enough
 pages left on the physical system, there is no
 good reason to swap out the page at all
 
  - there is no benefit in doing so (performance
wise, that is)
 
  - it actually hurts performance, and could
become a separate source for DoS
 
 what should happen instead (in an ideal world :)
 is that the page is considered swapped out for
 the guest (add guest penality for swapout), and 

Is the page stays mapped for the container or not?
If yes then what's the use of limits? Container mapped
pages more than the limit is but all the pages are
still in memory. Sounds weird.

 when the page would be swapped in again, the guest
 takes a penalty (for the 'virtual' page in) and
 the page is returned to the guest, possibly kicking
 out (again virtually) a different page
 
  - accounting and limits have to be consistent
and should roughly represent the actual used
memory/swap (modulo optimizations, I can go
into detail here, if necessary)
 This is true for current implementation for
 booth - this patchset ang OpenVZ beancounters.

 If you sum up the physpages values for all containers
 you'll get the exact number of RAM pages used.
 
 hmm, including or excluding the host pages?

Depends on whether you account host pages or not.

  - OOM handling on a per guest basis, i.e. some
out of memory condition in guest A must not
affect guest B
 This is done in current patches.
 
 Herbert, did you look at the patches before
 sending this mail or do you just want to
 'take part' in conversation w/o understanding
 of hat is going on?
 
 again, the question was about requirements, not
 your patches, and yes, I had a look at them _and_
 the OpenVZ implementations ...
 
 best,
 Herbert
 
 PS: hat is going on? :)
 
 HTC,
 Herbert

 Sigh.  Who is running this show?   Anyone?

 You can actually do a form of overcommittment by allowing multiple
 containers to share one or more of the zones. Whether that is
 sufficient or suitable I don't know. That depends on the requirements,
 and we haven't even discussed those, let alone agreed to them.

 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL-mm 0.28

2007-03-13 Thread Nick Piggin

David Schwartz wrote:

There's a substantial performance hit for not yield, so we probably
want to investigate alternate semantics for it. It seems reasonable
for apps to say let me not hog the CPU without completely expiring
them. Imagine you're in the front of the line (aka queue) and you
spend a moment fumbling for your wallet. The polite thing to do is to
let the next guy in front. But with the current sched_yield, you go
all the way to the back of the line.




Well... are you advocating we change sched_yield semantics to a
gentler form? This is a cinch to implement but I know how Ingo feels
about this. It will only encourage more lax coding using sched_yield
instead of proper blocking (see huge arguments with the ldap people on
this one who insist it's impossible not to use yield).



The basic point of sched_yield is to allow every other process at the same
static priority level a chance to use the CPU before you get it back. It is
generally an error to use sched_yield to be nice. It's nice to get your work
done when the scheduler gives you the CPU, that's why it gave it to you.

It is proper to use sched_yield as an optimization when it more efficient to
allow another process/thread to run than you, for example, when you
encounter a task you cannot do efficiently at that time because another
thread holds a lock.

It's also useful prior to doing something that can most efficiently be done
without interruption. So a thread that returns from 'sched_yield' should
ideally be given a full timeslice if possible. This may not be sensible if
the 'sched_yield' didn't actuall yield, but then again, if nothing else
wants to run, why not give the only task that does a full slice?

In no case is much of anything guaranteed, of course. (What can you do if
there's no other process to yield to?)

Note that processes that call sched_yield should be rewarded for doing so
just as process that block on I/O are, assuming they do in fact wind up
giving up the CPU when they would otherwise have had it.

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-13 Thread Michael K. Edwards

On 3/12/07, Alan Cox [EMAIL PROTECTED] wrote:

 Writing to a file from multiple processes is not usually the problem.
 Writing to a common struct file from multiple threads is.

Not normally because POSIX sensibly invented pread/pwrite. Forgot
preadv/pwritev but they did the basics and end of problem


pread/pwrite address a miniscule fraction of lseek+read(v)/write(v)
use cases -- a fraction that someone cared about strongly enough to
get into X/Open CAE Spec Issue 5 Version 2 (1997), from which it
propagated into UNIX98 and thence into POSIX.2 2001.  The fact that no
one has bothered to implement preadv/pwritev in the decade since
pread/pwrite entered the Single UNIX standard reflects the rarity with
which they appear in general code.  Life is too short to spend it
rewriting application code that uses readv/writev systematically,
especially when that code is going to ship inside a widget whose
kernel you control.


 So what?  My products are shipping _now_.

That doesn't inspire confidence.


Oh, please.  Like _your_ employer is the poster child for code
quality.  The cheap shot is also irrelevant to the point that I was
making, which is that sometimes portability simply doesn't matter and
the right thing to do is to firm up the semantics of the filesystem
primitives from underneath.


 even funny.  If POSIX mandates stupid shit, and application
 programmers don't read that part of the manual anyway (and don't code
 on that assumption in practice), to hell with POSIX.  On many file

Thats funny, you were talking about quality a moment ago.


Quality means the devices you ship now keep working in the field, and
the probable cost of later rework if the requirements change does not
exceed the opportunity cost of over-engineering up front.  Economy
gets a look-in too, and says that it's pointless to delay shipment and
bloat the application coding for cases that can't happen.  If POSIX
says that any and all writes (except small pipe/FIFO writes, whatever)
can return a short byte count -- but you know damn well you're writing
to a device driver that never, ever writes short, and if it did you'd
miss a timing budget recovering from it anyway -- to hell with POSIX.
And if you want to build a test jig for this code that uses pipes or
dummy files in place of the device driver, that test jig should never,
ever write short either.


 descriptors, short writes simply can't happen -- and code that

There is almost no descriptor this is true for. Any file I/O can and will
end up short on disk full or resource limit exceeded or quota exceeded or
NFS server exploded or ...


Not on a properly engineered widget, it won't.  And if it does, and
the application isn't coded to cope in some way totally different from
an infinite retry loop, then you might as well signal the exception
condition using whatever mechanism is appropriate to the API
(-EWHATEVER, SIGCRISIS, or block until some other process makes room).
And in any case files on disk are the least interesting kind of file
descriptor in an embedded scenario -- devices and pipes and pollfds
and netlink sockets are far more frequent read/write targets.


And on the device side about the only thing with the vaguest guarantees
is pipe().


Guaranteed by the standard, sure.  Guaranteed by the implementation,
as long as you write in the size blocks that the device is expecting?
Lots of devices -- ALSA's OSS PCM emulation, most AF_LOCAL and
AF_NETLINK sockets, almost any character device with a
record-structured format.  A short write to any of these almost
certainly means the framing is screwed and you need to close and
reopen the device.  Not all of these are exclusively O_APPEND
situations, and there's no reason on earth not to thread-safe the
f_pos handling so that an application and filesystem/driver can agree
on useful lseek() semantics.


 purports to handle short writes but has never been exercised is
 arguably worse than code that simply bombs on short write.  So if I
 can't shim in an induce-short-writes-randomly-on-purpose mechanism
 during development, I don't want short writes in production, period.

Easy enough to do and gcov plus dejagnu or similar tools will let you
coverage analyse the resulting test set and replay it.


Here we agree.  Except that I've rarely seen embedded application code
that wouldn't explode in my face if I tried it.  Databases yes, and
the better class of mail and web servers, and relatively mature
scripting languages and bytecode interpreters; but the vast majority
of working programmers in these latter days do not exercise this level
of discipline.


 Sure -- until the one code path in a hundred that handles the short
 write case incorrectly gets traversed in production, after having
 gone untested in a development environment that used a different
 filesystem that never happened to trigger it.

Competent QA and testing people test all the returns in the manual as
well as all the returns they can find in the code. See ptrace(2) if 

Re: sys_write() racy for multi-threaded append?

2007-03-13 Thread David Miller
From: Michael K. Edwards [EMAIL PROTECTED]
Date: Mon, 12 Mar 2007 23:25:48 -0800

 Quality means the devices you ship now keep working in the field, and
 the probable cost of later rework if the requirements change does not
 exceed the opportunity cost of over-engineering up front.  Economy
 gets a look-in too, and says that it's pointless to delay shipment and
 bloat the application coding for cases that can't happen.  If POSIX
 says that any and all writes (except small pipe/FIFO writes, whatever)
 can return a short byte count -- but you know damn well you're writing
 to a device driver that never, ever writes short, and if it did you'd
 miss a timing budget recovering from it anyway -- to hell with POSIX.

You're not even safe over standard output, simply run the program over
ssh and you suddenly have socket semantics to deal with.

In the early days the fun game to play was to run programs over rsh to
see in what amusing way they would explode.  ssh has replaced rsh in
this game, but the bugs have largely stayed the same.

Even early versions of tar used to explode on TCP half-closes and
whatnot.

In short, if you don't handle short writes, you're writing a program
for something other than unix.

We're not changing write() to interlock with other parallel callers or
messing with the f_pos semantics in such cases, that's stupid, please
cope, kthx.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


LSM Stacking

2007-03-13 Thread JanuGerman
Hi All,

Within the security folder in the kernel tree, the
2.6.20 linux kernel distribution is shipped with a
file root_plug.c (written by Greg Kroah-Hartman),
which is a classic introduction to Linux Security
Modules (LSM). The folder also contains the folder of
SELinux.

My question is that whether root_plug.c security
module is stacked with the SELinux security module or
not. If root_plug.c is stacked, where i can find the
code which handles the stacking of SELinux and
root_plug.c within the kernel.

Further, any pointer to stacking mechansims in Linux
2.6.* kernel will be highly appreciated.

Thanking you in advance,
MA






___ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Mike Galbraith
On Tue, 2007-03-13 at 16:53 +1100, Con Kolivas wrote:
 On Tuesday 13 March 2007 16:10, Mike Galbraith wrote:

  It's not offensive to me, it is a behavioral regression.  The
  situation as we speak is that you can run cpu intensive tasks while
  watching eye-candy.  With RSDL, you can't, you feel the non-interactive
  load instantly.  Doesn't the fact that you're asking me to lower my
  expectations tell you that I just might have a point?
 
 Yet looking at the mainline scheduler code, nice 5 tasks are also supposed to 
 get 75% cpu compared to nice 0 tasks, however I cannot seem to get 75% cpu 
 with a fully cpu bound task in the presence of an interactive task.

(One more comment before I go.  You can then have the last word this
time, promise :)

Because the interactivity logic, which was put there to do precisely
this, is doing it's job?

  To me 
 that means mainline is not living up to my expectations. What you're saying 
 is your expectations are based on a false cpu expectation from nice 5. You 
 can spin it both ways.

Talk about spin, you turn an example of the current scheduler working
properly into a negative attribute, and attempt to discredit me with it.

The floor is yours.  No reply will be forthcoming.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Andrew Morton
 On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 Page table pages have the characteristics that they are typically zero
 or in a known state when they are freed.

Well if they're zero then perhaps they should be released to the page allocator
to satisfy the next __GFP_ZERO request.  If that request is for a pagetable
page, we break even (except we get to remove special-case code).  If that
__GFP_ZERO allocation was or some application other than for a pagetable, we
win.

iow, can we just nuke 'em?

(Will require some work in the page allocator)
(That work will open the path to using the idle thread to prezero pages)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-13 Thread Roland McGrath
 Well, I can add in the test for 0, but finding the set of always-on bits
 in DR6 will have to be done separately.  Isn't it possible that different
 CPUs could have different bits?

I don't know, but it seems unlikely.  AFAIK all CPUs are presumed to have
the same CPUID results, for example.

 At that moment D, running on CPU 1, decides to unregister a breakpoint in
 T.  Clearing TIF_DEBUG now doesn't do any good -- it's too late; CPU 0 has
 already tested it.  CPU 1 goes in and alters the user breakpoint data,
 maybe even deallocating a breakpoint structure that CPU 0 is about to
 read.  Not a good situation.

You make it sound like a really good case for RCU. ;-)

 No way to tell when a task being debugged is started up by anything other 
 than its debugger?  Hmmm, in that case maybe it would be better to use 
 RCU.  It won't add much overhead to anything but the code for registering 
 and unregistering user breakpoints.

Indeed, it is for this sort of thing.  Still, it feels like a bit too much
is going on in switch_to_thread_hw_breakpoint for the common case.
It seems to me it ought to be something as simple as this:

if (unlikely((thbi-want_dr7 ~ chbi-kdr7) != thbi-active_tdr7) {
/* Need to make some installed or uninstalled callbacks.  */
if (thbi-active_tdr7  chbi-kdr7)
uninstalled callbacks;
else
installed callbacks;
recompute active_dr7, want_dr7;
}

switch (thbi-active_bkpts) {
case 4:
set_debugreg(0, thbi-tdr[0]);
case 3:
set_debugreg(1, thbi-tdr[1]);
case 2:
set_debugreg(2, thbi-tdr[2]);
case 1:
set_debugreg(3, thbi-tdr[3]);
}
set_debugreg(7, chbi-kdr7 | thbi-active_tdr7);

Only in the unlikely case do you need to worry about synchronization,
whether it's RCU or spin locks or whatever.  The idea is that breakpoint
installation when doing the fancy stuff would reduce it to the four
breakpoints I would like, in order (tdr[3] containing the highest-priority
one), and dr7 masks describing what dr7 you were using last time you were
running (active_tdr7), and that plus the enable bits you would like to have
set (want_dr7).  The unlikely case is when the number of kernel debugregs
consumed changed since the last time you switched in, so you go recompute
active_tdr7.  (Put the body of that if in another function.)

For the masks to work as I described, you need to use the same enable bit
(or both) for kernel and user allocations.  It really doesn't matter which
one you use, since all of Linux is local for the sense of the dr7 enable
bits (i.e. you should just use DR_GLOBAL_ENABLE).

It's perfectly safe to access all this stuff while it might be getting
overwritten, and worst case you switch in some user breakpoints you didn't
want.  That only happens in the SIGKILL case, when you never hit user mode
again and don't care.

 +void switch_to_thread_hw_breakpoint(struct task_struct *tsk)
[...]
 + /* Keep the DR7 bits that refer to kernel breakpoints */
 + get_debugreg(dr7, 7);
 + dr7 = kdr7_masks[chbi-num_kbps];

I don't understand what this part is for.  Why fetch dr7 from the CPU?  You
already know what's there.  All you need is the current dr7 bits belonging
to kernel allocations, i.e. a chbi-kdr7 mask.

 + if (tsk  test_tsk_thread_flag(tsk, TIF_DEBUG)) {

switch_to_thread_hw_breakpoint is on the context switch path.  On that
path, this test can never be false.  The context switch path should not
have any unnecessary conditionals.  If you want to share code with some
other places that now call switch_to_thread_hw_breakpoint, they can share a
common inline for part of the guts.

 + set_debugreg(dr7, 7);   /* Disable user bps while switching */

What is this for?  The kernel's dr7 bits are already set.  Why does it
matter if bits enabling user breakpoints are set too?  No user breakpoint
can be hit on this CPU before this function returns.

 + /* Clear any remaining stale bp pointers */
 + while (--i = chbi-num_kbps)
 + chbi-bps[i] = NULL;

Why is this done here?  This can be done when the kernel allocations are
installed/uninstalled.

 @@ -15,6 +15,7 @@ struct die_args {
   long err;
   int trapnr;
   int signr;
 + int ret;
  };

I don't understand why you added this at all.

  fastcall void __kprobes do_debug(struct pt_regs * regs, long error_code)
[...]
 + if ((args.err  (DR_STEP|DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3)) ||
 + args.ret)
 + send_sigtrap(tsk, regs, error_code);

The args.err test is fine.  A notifier that wants the SIGTRAP sent should
just leave the appropriate DR_* bit set rather than clear it.

In hw_breakpoint_handler, you could just change:

if (i = chbi-num_kbps)
data-ret = 1;
to:

if (i  

Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Nick Piggin

Andrew Morton wrote:

On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] 
wrote:
Page table pages have the characteristics that they are typically zero
or in a known state when they are freed.



Well if they're zero then perhaps they should be released to the page allocator
to satisfy the next __GFP_ZERO request.  If that request is for a pagetable
page, we break even (except we get to remove special-case code).  If that
__GFP_ZERO allocation was or some application other than for a pagetable, we
win.

iow, can we just nuke 'em?


Page allocator still requires interrupts to be disabled, which this doesn't.

Considering there isn't much else that frees known zeroed pages, I wonder if
it is worthwhile.

Last time the zeroidle discussion came up was IIRC not actually real performance
gain, just cooking the 1024 CPU threaded pagefault numbers ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

  It has been said that perfection is the enemy of good.  The two 
  interactive tasks receiving 40% cpu while two niced background jobs 
  receive 60% may well be perfect, but it's damn sure not good.
 
 Well, the real problem is really server that works on behalf of 
 somebody else.

i think Mike's testcase was even simpler than that: two plain CPU hogs 
on nice +5 stole much more CPU time with Con's new interactivity code 
than they did with the current interactivity code. I'd agree with Mike 
that a phenomenon like that needs to be fixed.

/less/ interactivity we can do easily in the current scheduler: just 
remove various bits here and there. The RSDL promise is that it gives us 
/more/ interactivity (with 'interactivity designed in', etc.), which in 
Mike's testcase does not seem to be the case.

 And the problem is that a lot of clients actually end up doing *more* 
 in the X server than they do themselves directly.

yeah. It's a hard case because X is not always a _clear_ interactive 
task - still the current interactivity code handles it quite well.

but Mike's scenario wasnt even that complex. It wasnt even a hard case 
of X being starved by _other_ interactive tasks running on the same nice 
level. Mike's test-scenario was about two plain nice +5 CPU hogs 
starving nice +0 interactive tasks more than the current scheduler does, 
and this is really not an area where we want to see any regression. Con, 
could you work on this area a bit more?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21rc suspend to ram regression on Lenovo X60

2007-03-13 Thread Eric W. Biederman
Dave Jones [EMAIL PROTECTED] writes:

 I spent considerable time over the last day or so bisecting to
 find out why an X60 stopped resuming somewhen between 2.6.20 and current -git.
 (Total lockup, black screen of death).

 The bisect log looked like this.

 git-bisect start
 # bad: [c8f71b01a50597e298dc3214a2f2be7b8d31170c] Linux 2.6.21-rc1
 git-bisect bad c8f71b01a50597e298dc3214a2f2be7b8d31170c
 # good: [fa285a3d7924a0e3782926e51f16865c5129a2f7] Linux 2.6.20
 git-bisect good fa285a3d7924a0e3782926e51f16865c5129a2f7
 # bad: [574009c1a895aeeb85eaab29c235d75852b09eb8] Merge branch 'upstream' of
 git://ftp.linux-mips.org/pub/scm/upstream-linus
 git-bisect bad 574009c1a895aeeb85eaab29c235d75852b09eb8
 # bad: [43187902cbfafe73ede0144166b741fb0f7d04e1] Merge
 master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
 git-bisect bad 43187902cbfafe73ede0144166b741fb0f7d04e1
 # good: [1545085a28f226b59c243f88b82ea25393b0d63f] drm: Allow for 44 bit
 user-tokens (or drm_file offsets)
 git-bisect good 1545085a28f226b59c243f88b82ea25393b0d63f
 # good: [c96e2c92072d3e78954c961f53d8c7352f7abbd7] Merge
 master.kernel.org:/pub/scm/linux/kernel/git/gregkh/usb-2.6
 git-bisect good c96e2c92072d3e78954c961f53d8c7352f7abbd7
 # good: [31c56d820e03a2fd47f81d6c826f92caf511f9ee] [POWERPC] pasemi: iommu
 support
 git-bisect good 31c56d820e03a2fd47f81d6c826f92caf511f9ee
 # bad: [78149df6d565c36675463352d0bfeb02b7a7] Merge
 master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6
 git-bisect bad 78149df6d565c36675463352d0bfeb02b7a7
 # good: [3d9c18872fa1db5c43ab97d8cbca43775998e49c] shpchp: remove
 CONFIG_HOTPLUG_PCI_SHPC_POLL_EVENT_MODE
 git-bisect good 3d9c18872fa1db5c43ab97d8cbca43775998e49c
 # good: [88187dfa4d8bb565df762f272511d2c91e427e0d] MSI: Replace pci_msi_quirk
 with calls to pci_no_msi()
 git-bisect good 88187dfa4d8bb565df762f272511d2c91e427e0d
 # good: [866a8c87c4e51046602387953bbef76992107bcb] msi: Fix
 msi_remove_pci_irq_vectors.
 git-bisect good 866a8c87c4e51046602387953bbef76992107bcb
 # good: [f7feaca77d6ad6bcfcc88ac54e3188970448d6fe] msi: Make MSI useable more
 architectures
 git-bisect good f7feaca77d6ad6bcfcc88ac54e3188970448d6fe
 # good: [14719f325e1cd4ff757587e9a221ebaf394563ee] Revert PCI: remove 
 duplicate
 device id from ata_piix
 git-bisect good 14719f325e1cd4ff757587e9a221ebaf394563ee

 which led me to a final 'bad' commit of 
 78149df6d565c36675463352d0bfeb02b7a7
 which is a merge changeset of lots of PCI bits.

Ok.  This is weird.  It looks like you marked the merge bad but
it's individual commits as good

Which would indicate a problem on one of the branches it was merged
with, or a problem that only shows up when both groups of changes
are present.

 Seeing a couple of MSI changes in there, on a hunch I booted latest tree with
 pci=nomsi, and it resumed again.

 Any ideas how to further debug this?
 I'll try backing out individual changes from that merge tomorrow.

Thanks.  

Of those msi patches you have identified I don't see anything really
obvious.  And you actually marked them as good in your bisect so
I don't expect it is core problem.

We do have a known e1000 regression, with msi and suspend/resume.
So it is possible the nomsi avoided a driver problem.  Especially
as we have a number of driver changes on the on Linus's side of
that merge.

I also know we have some known issues with pci_save_state and
pci_restore_state that require them to be paired for correct
operation.  For suspend and resume that is not generally a problem.

I have fixes for the pci_save_state and pci_restore_state in the -mm
and gregkh tree's.  Since they also happen to fix the e1000 driver as
a side effect they are worth looking at, at least if you have an
e1000.

I don't have a clue which hardware the x60 has so I don't know which
drivers it would be using.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Ingo Molnar

* Mike Galbraith [EMAIL PROTECTED] wrote:

 [...] The situation as we speak is that you can run cpu intensive 
 tasks while watching eye-candy.  With RSDL, you can't, you feel the 
 non-interactive load instantly. [...]

i have to agree with Mike that this is a material regression that cannot 
be talked around.

Con, we want RSDL to /improve/ interactivity. Having new scheduler 
interactivity logic that behaves /worse/ in the presence of CPU hogs, 
which CPU hogs are even reniced to +5, than the current interactivity 
code, is i think a non-starter. Could you try to fix this, please? Good 
interactivity in the presence of CPU hogs (be them default nice level or 
nice +5) is _the_ most important scheduler interactivity metric. 
Anything else is really secondary.

Ingo

ps. please be nice to each other - both of you are long-time
scheduler contributors who did lots of cool stuff :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Mike Galbraith
On Tue, 2007-03-13 at 09:18 +0100, Ingo Molnar wrote:

 ps. please be nice to each other - both of you are long-time
 scheduler contributors who did lots of cool stuff :-)

It's no big deal, Con and I just seem to be oil and water.  He'll have
to be oil, because water is already take.  *evaporate* :)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Ingo Molnar

* Con Kolivas [EMAIL PROTECTED] wrote:

  It's not offensive to me, it is a behavioral regression.  The 
  situation as we speak is that you can run cpu intensive tasks while 
  watching eye-candy.  With RSDL, you can't, you feel the 
  non-interactive load instantly.  Doesn't the fact that you're asking 
  me to lower my expectations tell you that I just might have a point?
 
 Yet looking at the mainline scheduler code, nice 5 tasks are also 
 supposed to get 75% cpu compared to nice 0 tasks, however I cannot 
 seem to get 75% cpu with a fully cpu bound task in the presence of an 
 interactive task. [...]

i'm sorry, but your argument seems to be negated. We of course have no 
problem with interactive tasks stealing CPU time from CPU hogs. The 
situation Mike found is _the other direction_: that /CPU hogs/ stole 
from interactive tasks. That's bad and needs to be fixed. Please?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


New thread RDSL, post-2.6.20 kernels and amanda (tar) miss-fires

2007-03-13 Thread Gene Heskett
Greetings;
Someone suggested a fresh thread for this.

I now have my scripts more or less under control, and I can report that 
kernel-2.6.20.1 with no other patches does not exhibit the undesirable 
behaviour where tar thinks its all new, even when told to do a level 2 on 
a directory tree that hasn't been touched in months to update anything.

Next up, 2.6.20.2, plain and with the latest RDSL-0.30 patch.

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
 Alan Cox wrote:
[..]

No I didnt.  Someone else wrote that.  Please keep attributions
straight.
-- From linux-kernel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] avoid OPEN_MAX in SCM_MAX_FD

2007-03-13 Thread Roland McGrath
The OPEN_MAX constant is an arbitrary number with no useful relation to
anything.  Nothing should be using it.  This patch changes SCM_MAX_FD to
use NR_OPEN instead of OPEN_MAX.  This increases the size of the struct
scm_fp_list type fourfold, to make it big enough to contain as many file
descriptors as could be asked of it.  This size increase may not be very
worthwhile, but at any rate if an arbitrary limit unrelated to anything
else is being defined it should be done explicitly here with:

#define SCM_MAX_FD  255

Using the OPEN_MAX constant here is just confusing and misleading.

Signed-off-by: Roland McGrath [EMAIL PROTECTED]
---
 include/net/scm.h |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/net/scm.h b/include/net/scm.h
index 5637d5e..4d37c5e 100644  
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -8,7 +8,7 @@
 /* Well, we should have at least one descriptor open
  * to accept passed FDs 8)
  */
-#define SCM_MAX_FD (OPEN_MAX-1)
+#define SCM_MAX_FD (NR_OPEN-1)
 
 struct scm_fp_list
 {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] Remove OPEN_MAX

2007-03-13 Thread Roland McGrath
The OPEN_MAX macro in limits.h should not be there.  It claims to be the
limit on file descriptors in a process, but its value is wrong for that.
There is no constant value, but a variable resource limit (RLIMIT_NOFILE).
Nothing in the kernel uses OPEN_MAX except things that are wrong to do so.
I've submitted other patches to remove those uses.

The proper thing to do according to POSIX is not to define OPEN_MAX at all.
The sysconf (_SC_OPEN_MAX) implementation works by calling getrlimit.

Signed-off-by: Roland McGrath [EMAIL PROTECTED]
---
 include/linux/limits.h |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/include/linux/limits.h b/include/linux/limits.h
index eaf2e09..c4b4e57 100644  
--- a/include/linux/limits.h
+++ b/include/linux/limits.h
@@ -6,7 +6,6 @@
 #define NGROUPS_MAX65536   /* supplemental group IDs are available */
 #define ARG_MAX   131072   /* # bytes of args + environ for exec() */
 #define CHILD_MAX999/* no limit :-) */
-#define OPEN_MAX 256   /* # open files a process may have */
 #define LINK_MAX 127   /* # links a file may have */
 #define MAX_CANON255   /* size of the canonical input queue */
 #define MAX_INPUT255   /* size of the type-ahead buffer */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Remove CHILD_MAX

2007-03-13 Thread Roland McGrath
The CHILD_MAX macro in limits.h should not be there.  It claims to be the
limit on processes a user can own, but its value is wrong for that.
There is no constant value, but a variable resource limit (RLIMIT_NPROC).
Nothing in the kernel uses CHILD_MAX.

The proper thing to do according to POSIX is not to define CHILD_MAX at all.
The sysconf (_SC_CHILD_MAX) implementation works by calling getrlimit.

Signed-off-by: Roland McGrath [EMAIL PROTECTED]
---
 include/linux/limits.h |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/include/linux/limits.h b/include/linux/limits.h
index c4b4e57..2d0f941 100644  
--- a/include/linux/limits.h
+++ b/include/linux/limits.h
@@ -5,7 +5,6 @@
 
 #define NGROUPS_MAX65536   /* supplemental group IDs are available */
 #define ARG_MAX   131072   /* # bytes of args + environ for exec() */
-#define CHILD_MAX999/* no limit :-) */
 #define LINK_MAX 127   /* # links a file may have */
 #define MAX_CANON255   /* size of the canonical input queue */
 #define MAX_INPUT255   /* size of the type-ahead buffer */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers: PMC MSP71xx LED driver

2007-03-13 Thread Florian Fainelli
Hi Marc,

Your patch does not seem to use the Linux LED API (include/linux/leds.h), 
which is sometimes pretty unknown, but dramatically ease your work. Maybe it 
is a good idea converting it to this API if you find it relevant.

Also consider the ongoing LED-GPIO API which is being written by ARM people : 
http://marc.theaimsgroup.com/?l=linux-kernelm=110873454720555w=2

My 2 cents

Le lundi 12 mars 2007, Marc St-Jean a écrit :
 [PATCH] drivers: PMC MSP71xx LED driver

 Patch to add LED driver for the PMC-Sierra MSP71xx devices.

 This patch references some platform support files previously
 submitted to the [EMAIL PROTECTED] list.

 Thanks,
 Marc

 Signed-off-by: Marc St-Jean [EMAIL PROTECTED]
 ---
 Re-posting patch with recommended changes:
 -Cleanup on style and formatting for comments, macros, etc.
 -Removed unnecessary memset.
 -Removed unnecessary inlines.
 -Moved some driver private data structures from .h to .c.
 -Made use of schedule_timeout_interruptible() call instead
 of multiple calls.
 -Added calls to kthread_should_stop and try_to_freeze().

  drivers/i2c/chips/Kconfig|9
  drivers/i2c/chips/Makefile   |1
  drivers/i2c/chips/pmctwiled.c|  524
 +++ include/asm-mips/pmc-sierra/msp71xx/msp_led_macros.h | 
 273 + 4 files changed, 807 insertions(+)

 diff --git a/drivers/i2c/chips/Kconfig b/drivers/i2c/chips/Kconfig
 index 87ee3ce..3bef46b 100644
 --- a/drivers/i2c/chips/Kconfig
 +++ b/drivers/i2c/chips/Kconfig
 @@ -50,6 +50,15 @@ config SENSORS_PCF8574
 These devices are hard to detect and rarely found on mainstream
 hardware.  If unsure, say N.

 +config SENSORS_PMCTWILED
 + tristate PMC Led-over-TWI driver
 + depends on I2C  PMC_MSP
 + help
 +   The new VPE-safe backend driver for all the LEDs on the 7120 platform.
 +
 +   While you may build this as a module, it is recommended you build it
 +   into the kernel monolithic so all drivers may access it at all times.
 +
  config SENSORS_PCA9539
   tristate Philips PCA9539 16-bit I/O port
   depends on I2C  EXPERIMENTAL
 diff --git a/drivers/i2c/chips/Makefile b/drivers/i2c/chips/Makefile
 index 779868e..4e79e27 100644
 --- a/drivers/i2c/chips/Makefile
 +++ b/drivers/i2c/chips/Makefile
 @@ -10,6 +10,7 @@ obj-$(CONFIG_SENSORS_M41T00)+= m41t00.o
  obj-$(CONFIG_SENSORS_PCA9539)+= pca9539.o
  obj-$(CONFIG_SENSORS_PCF8574)+= pcf8574.o
  obj-$(CONFIG_SENSORS_PCF8591)+= pcf8591.o
 +obj-$(CONFIG_SENSORS_PMCTWILED) += pmctwiled.o
  obj-$(CONFIG_ISP1301_OMAP)   += isp1301_omap.o
  obj-$(CONFIG_TPS65010)   += tps65010.o

 diff --git a/drivers/i2c/chips/pmctwiled.c b/drivers/i2c/chips/pmctwiled.c
 new file mode 100644
 index 000..69845a5
 --- /dev/null
 +++ b/drivers/i2c/chips/pmctwiled.c
 @@ -0,0 +1,524 @@
 +/*
 + * Special LED-over-TWI-PCA9554 driver for the PMC Sierra
 + * Residential Gateway demo board (and potentially others).
 + *
 + * Based on pca9539.c Copyright (C) 2005 Ben Gardner [EMAIL PROTECTED]
 + * Modified by Copyright 2006-2007 PMC-Sierra, Inc.
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License as published by
 + * the Free Software Foundation; version 2 of the License.
 + */
 +
 +#include linux/module.h
 +#include linux/init.h
 +#include linux/kthread.h
 +#include linux/i2c.h
 +#include linux/freezer.h
 +
 +#include msp_led_macros.h
 +
 +/*
 + * The externally available registers
 + *
 + * TODO: We must ensure these are in shared memory for the other VPE
 + *   to access
 + */
 +u32 msp_led_register[MSP_LED_COUNT];
 +
 +#ifdef CONFIG_PMC_MSP7120_GW
 +/*
 + * For each device, which pins will be in input mode:
 + * One byte per device:
 + *  0 = Output
 + *  1 = Input
 + */
 +static const u8 msp_led_initial_input_state[] = {
 + /* No outputs on device 0 or 1, these are inputs only */
 + MSP_LED_INPUT_MODE, MSP_LED_INPUT_MODE,
 + /* All outputs on device 2 through 4 */
 + MSP_LED_OUTPUT_MODE, MSP_LED_OUTPUT_MODE, MSP_LED_OUTPUT_MODE,
 +};
 +
 +/*
 + * For each device, which output pins should start on and off:
 + * One byte per device:
 + *  0 = OFF = HI
 + *  1 = ON  = Lo
 + */
 +static const u8 msp_led_initial_pin_state[] = {
 + 0, 0,   /* No initial state, these are input only */
 + 1  1, /* PWR_GREEN LED on, all others off */
 + 0,  /* All off */
 + 0,  /* All off */
 +};
 +#endif /* CONFIG_PMC_MSP7120_GW */
 +
 +/* Internal polling data */
 +#define POLL_PERIOD msecs_to_jiffies(125) /* Poll at 125ms */
 +
 +static struct i2c_client *pmctwiled_device[MSP_LED_NUM_DEVICES];
 +static struct task_struct *pmctwiled_pollthread;
 +static u32 private_msp_led_register[MSP_LED_COUNT];
 +static u16 current_period;
 +
 +/* Addresses to scan */
 +#define PMCTWILED_BASEADDRESS0x38
 +
 +static unsigned short 

Re: _proxy_pda still makes linking modules fail

2007-03-13 Thread Paul Mackerras
Rusty Russell writes:

 The ideal solution has always been to use __thread, but no architecture
 has yet managed it (I tried for i386, and it quickly caused unbearable
 pain).  On x86-64 that uses %fs on x86-64, not %gs as the kernel
 does, but I might try that if I feel particularly masochistic soon...

There is a fundamental problem with using __thread, which is that gcc
assumes that the addresses of __thread variables are constant within
one thread, and that therefore it can cache the result of address
calculations.  However, with preempt, threads in the kernel can't rely
on staying on one cpu, and therefore the addresses of per-cpu
variables can change.  There appears to be no way to tell gcc to drop
all cached __thread variable address calculations at a given point
(e.g. when enabling or disabling preemption).  That is basically why I
gave up on using __thread for per-cpu variables on powerpc.

We could use __thread for per-task variables quite naturally.  There
doesn't seem to be much point, though, since we already have a way to
do per-task variables - it's called struct task_struct. :-/

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Removal of multipath cached (was Re: [PATCH] [REVISED] net/ipv4/multipath_wrandom.c: check kmalloc() return value.)

2007-03-13 Thread Jarek Poplawski
On Mon, Mar 12, 2007 at 10:22:36PM -0800, Andrew Morton wrote:
  On Mon, 12 Mar 2007 13:53:11 -0700 (PDT) David Miller [EMAIL PROTECTED] 
  wrote:
...
  And there is absolutely no negotiations about this, I've held back on
  this for nearly 2 years, and nothing has happened, this code is not
  maintained, nobody cares enough to fix the bugs, and even no
  distributions enable it because it causes crashes.
 
 Good stuff.
 
 I suggest you put a big printk explaining the above into 2.6.21.
 

Plus official way: Documentation/feature-remove-schedule.txt
in the next rc-git.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-13 Thread Miklos Szeredi
   IIUC, your problem is that there's another bdi that holds all the
   dirty pages, and this throttle loop never flushes pages from that
   other bdi and we sleep instead. It seems to me that the fundamental
   problem is that to clean the pages we need to flush both bdi's, not
   just the bdi we are directly dirtying.
  
  This is what happens:
  
  write fault on upper filesystem
balance_dirty_pages
  submit write requests
loop ...
 
 Isn't this loop transferring the dirty state from the upper
 filesystem to the lower filesystem?

What this loop is doing is putting write requests in the request
queue, and in so doing transforming page state from dirty to
writeback.

 What I don't see here is how the pages on this filesystem are not
 getting cleaned if the lower filesystem is being flushed properly.

Because the lower filesystem writes back one request, but then gets
stuck in balance_dirty_pages before returning.  So the write request
is never completed.

The problem is that balance_dirty_pages is waiting for the condition
that the global number of dirty+writeback pages goes below the
threshold.  But this condition can only be satisfied if
balance_dirty_pages() returns.

 I'm probably missing something big and obvious, but I'm not
 familiar with the exact workings of FUSE so please excuse my
 ignorance
 
  --- fuse IPC ---
  [fuse loopback fs thread 1]
 
 This is the lower filesystem? Or a callback thread for
 doing the write requests to the lower filesystem?

This is the fuse daemon.  It's a normal process that reads requests
from /dev/fuse, serves these requests then writes the reply back onto
/dev/fuse.  It is usually multithreaded, so it can serve many requests
in parallel.

The loopback filesystem serves the requests by issuing the relevant
filesystem syscalls on the underlying fs.

  read request
  sys_write
mutex_lock(i_mutex)
...
   balance_dirty_pages
  submit write requests
  loop ... write requests completed ... dirty still over limit ... 
  ... loop forever
 
 Hmmm - the situation in balance_dirty_pages() after an attempt
 to writeback_inodes(wbc) that has written nothing because there
 is nothing to write would be:
 
   wbc-nr_write == write_chunk 
   wbc-pages_skipped == 0 
   wbc-encountered_congestion == 0 
   !bdi_congested(wbc-bdi)
 
 What happens if you make that an exit condition to the loop?

That's almost right.  The only problem is that even if there's no
congestion, the device queue can be holding a great amount of yet
unwritten pages.  So exiting on this condition would mean, that
dirty+writeback could go way over the threshold.

How much this would be a problem?  I don't know, I guess it depends on
many things: how many queues, how many requests per queue, how many
bytes per request.

 Or alternatively, adding another bit to the wbc structure to
 say there was nothing to do and setting that if we find
 list_empty(sb-s_dirty) when trying to flush dirty inodes.
 
 [ FWIW, this may also solve another problem of fast block devices
 being throttled incorrectly when a slow block dev is consuming
 all the dirty pages... ]

There may be a patch floating around, which I think basically does
this, but only as long as the dirty+writeback are over a soft limit,
but under the hard limit.

When over the the hard limit, balance_dirty_pages still loops until
dirty+writeback go below the threshold.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: /sys/devices/system/cpu/cpuX/online are missing

2007-03-13 Thread Heiko Carstens
On Tue, Mar 13, 2007 at 01:39:25AM +0100, Andreas Schwab wrote:
 Giuliano Pochini [EMAIL PROTECTED] writes:
 
  I had a look at arch/powerpc/kernel/smp.c but I'm not familiar at all with 
  those parts of the kernel.
 
 See arch/powerpc/kernel/sysfs.c:topology_init.  I don't think there is
 anything to do here.  You probably don't have CONFIG_HOTPLUG_CPU enabled.

I was referring to arch/ppc not arch/powerpc. But it seems that arch/ppc
doesn't support cpu hotplug anyway. So I guess it's indeed just a missing
config option.

Grepping a bit further shows that arm suffered by the change that inverted
the logic if the 'online' attribute for cpus should appear. Since arm
supports cpu hotplug but the patch left arm out, it doesn't work there
anymore (cc'ing arm people: changeset 72486f1f8f0a2bc828b9d30cf4690cf2dd6807fc
is most probably disabling cpu hotplug support on arm like it did on s390).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/4] Generic quicklist implementation

2007-03-13 Thread Paul Mundt
On Tue, Mar 13, 2007 at 12:13:30AM -0700, Christoph Lameter wrote:
 --- linux-2.6.21-rc3-mm2.orig/mm/Kconfig  2007-03-12 22:49:21.0 
 -0700
 +++ linux-2.6.21-rc3-mm2/mm/Kconfig   2007-03-13 00:09:50.0 -0700
 @@ -220,3 +220,8 @@ config DEBUG_READAHEAD
  
 Say N for production servers.
  
 +config QUICKLIST
 + bool
 + default y if NR_QUICK != 0
 +
 +

This doesn't work, and so CONFIG_QUICKLIST is always set. The NR_QUICK
thing seems a bit backwards anyways, perhaps it would make more sense to
have architectures set CONFIG_GENERIC_QUICKLIST in the same way that the
other GENERIC_xxx bits are defined, and then set NR_QUICK based off of
that. It's obviously going to be 2 or 1 for most people, and x86 seems to
be the only one that needs 2.

How about this?

--

diff --git a/mm/Kconfig b/mm/Kconfig
index 7942b33..2f20860 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -163,3 +163,8 @@ config ZONE_DMA_FLAG
default 0 if !ZONE_DMA
default 1
 
+config NR_QUICK
+   int
+   depends on GENERIC_QUICKLIST
+   default 2 if X86
+   default 1
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/7] Resource counters

2007-03-13 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Sun, Mar 11, 2007 at 01:00:15PM -0600, Eric W. Biederman wrote:
 Herbert Poetzl [EMAIL PROTECTED] writes:
 
 
  Linux-VServer does the accounting with atomic counters,
  so that works quite fine, just do the checks at the
  beginning of whatever resource allocation and the
  accounting once the resource is acquired ...
 
 Atomic operations versus locks is only a granularity thing.
 You still need the cache line which is the cost on SMP.
 
 Are you using atomic_add_return or atomic_add_unless or 
 are you performing you actions in two separate steps 
 which is racy? What I have seen indicates you are using 
 a racy two separate operation form.

 yes, this is the current implementation which
 is more than sufficient, but I'm aware of the
 potential issues here, and I have an experimental
 patch sitting here which removes this race with
 the following change:

  - doesn't store the accounted value but
limit - accounted (i.e. the free resource)
  - uses atomic_add_return() 
  - when negative, an error is returned and
the resource amount is added back

 changes to the limit have to adjust the 'current'
 value too, but that is again simple and atomic

 best,
 Herbert

 PS: atomic_add_unless() didn't exist back then
 (at least I think so) but that might be an option
 too ...

I think as far as having this discussion if you can remove that race
people will be more willing to talk about what vserver does.

That said anything that uses locks or atomic operations (finer grained locks)
because of the cache line ping pong is going to have scaling issues on large
boxes.

So in that sense anything short of per cpu variables sucks at scale.  That said
I would much rather get a simple correct version without the complexity of
per cpu counters, before we optimize the counters that much.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: /sys/devices/system/cpu/cpuX/online are missing

2007-03-13 Thread Heiko Carstens
On Tue, Mar 13, 2007 at 10:03:50AM +0100, Heiko Carstens wrote:
 On Tue, Mar 13, 2007 at 01:39:25AM +0100, Andreas Schwab wrote:
  Giuliano Pochini [EMAIL PROTECTED] writes:
  
   I had a look at arch/powerpc/kernel/smp.c but I'm not familiar at all 
   with 
   those parts of the kernel.
  
  See arch/powerpc/kernel/sysfs.c:topology_init.  I don't think there is
  anything to do here.  You probably don't have CONFIG_HOTPLUG_CPU enabled.
 
 I was referring to arch/ppc not arch/powerpc. But it seems that arch/ppc
 doesn't support cpu hotplug anyway. So I guess it's indeed just a missing
 config option.
 
 Grepping a bit further shows that arm suffered by the change that inverted
 the logic if the 'online' attribute for cpus should appear. Since arm
 supports cpu hotplug but the patch left arm out, it doesn't work there
 anymore (cc'ing arm people: changeset 72486f1f8f0a2bc828b9d30cf4690cf2dd6807fc
 is most probably disabling cpu hotplug support on arm like it did on s390).

Should have cc'ed Suresh Siddha who caused the breakage ;)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kthread_should_stop_check_freeze (was: Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread)

2007-03-13 Thread Christoph Hellwig
On Tue, Mar 13, 2007 at 08:44:11AM +0530, Srivatsa Vaddagiri wrote:
 On Mon, Mar 12, 2007 at 05:45:24PM -0500, Anton Blanchard wrote:
  Then please document it _clearly_ with the kthread code somewhere. 
 
 Document as well in the kernel_thread() API, as I notice people still
 use kernel_thread() some places (ex: rtasd.c in powerpc arch)?

They shouldn't use kernel_thread.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20*: PATA DMA timeout, hangs (2)

2007-03-13 Thread Frank van Maarseveen
On Mon, Mar 12, 2007 at 09:40:25PM +0100, Bartlomiej Zolnierkiewicz wrote:
 
 Hi,
 
 On Monday 12 March 2007, Frank van Maarseveen wrote:
  On Mon, Mar 12, 2007 at 01:21:18PM +0100, Bartlomiej Zolnierkiewicz wrote:
   
   Hi,
   
   Could you check if this is the same problem as this one:
   
   http://bugzilla.kernel.org/show_bug.cgi?id=8169
  
  Looks like it except that I don't see lost interrupt messages here. So,
  it might be something different (I don't know).
 
 From the first mail:
 
 hda: max request size: 128KiB
 hda: 40021632 sectors (20491 MB) w/2048KiB Cache, CHS=39704/16/63
 hda: cache flushes not supported
  hda: hda1 hda2 hda4
 
 It seems that DMA is not used by default (CONFIG_IDEDMA_PCI_AUTO=n),
 so this is probably exactly the same issue.
 
 Please try the patch attached to the bugzilla bug entry.

2.6.20.2 rejects this patch and I don't see a way to apply it by hand:
ide_set_dma() isn't there, nothing seems to match.

-- 
Frank
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21rc suspend to ram regression on Lenovo X60

2007-03-13 Thread Rafael J. Wysocki
On Tuesday, 13 March 2007 05:08, Dave Jones wrote:
 I spent considerable time over the last day or so bisecting to
 find out why an X60 stopped resuming somewhen between 2.6.20 and current -git.
 (Total lockup, black screen of death).

Do you have CONFIG_TICK_ONESHOT or CONFIG_NO_HZ set?  If you do, could you
please unset them and retest?

Thanks,
Rafael
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Con Kolivas
On Tuesday 13 March 2007 19:18, Ingo Molnar wrote:
 * Mike Galbraith [EMAIL PROTECTED] wrote:
  [...] The situation as we speak is that you can run cpu intensive
  tasks while watching eye-candy.  With RSDL, you can't, you feel the
  non-interactive load instantly. [...]

 i have to agree with Mike that this is a material regression that cannot
 be talked around.

 Con, we want RSDL to /improve/ interactivity. Having new scheduler
 interactivity logic that behaves /worse/ in the presence of CPU hogs,
 which CPU hogs are even reniced to +5, than the current interactivity
 code, is i think a non-starter. Could you try to fix this, please? Good
 interactivity in the presence of CPU hogs (be them default nice level or
 nice +5) is _the_ most important scheduler interactivity metric.
 Anything else is really secondary.

Well I guess you must have missed where I asked him if he would be happy if I 
changed +5 metrics to do whatever he wanted and he refused to answer me. That 
would easily fit within that scheme. Any percentage of nice value he chose. I 
suggest 50% of nice 0. Heck I can even increase it if he likes. All I asked 
for was an answer as to whether that would satisfy his criterion.

   Ingo

 ps. please be nice to each other - both of you are long-time
 scheduler contributors who did lots of cool stuff :-)

I have been civil. Only one email crossed the line on my part and I apologise.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/7] Resource counters

2007-03-13 Thread Pavel Emelianov
Eric W. Biederman wrote:
 Herbert Poetzl [EMAIL PROTECTED] writes:
 
 On Sun, Mar 11, 2007 at 01:00:15PM -0600, Eric W. Biederman wrote:
 Herbert Poetzl [EMAIL PROTECTED] writes:

 Linux-VServer does the accounting with atomic counters,
 so that works quite fine, just do the checks at the
 beginning of whatever resource allocation and the
 accounting once the resource is acquired ...
 Atomic operations versus locks is only a granularity thing.
 You still need the cache line which is the cost on SMP.

 Are you using atomic_add_return or atomic_add_unless or 
 are you performing you actions in two separate steps 
 which is racy? What I have seen indicates you are using 
 a racy two separate operation form.
 yes, this is the current implementation which
 is more than sufficient, but I'm aware of the
 potential issues here, and I have an experimental
 patch sitting here which removes this race with
 the following change:

  - doesn't store the accounted value but
limit - accounted (i.e. the free resource)
  - uses atomic_add_return() 
  - when negative, an error is returned and
the resource amount is added back

 changes to the limit have to adjust the 'current'
 value too, but that is again simple and atomic

 best,
 Herbert

 PS: atomic_add_unless() didn't exist back then
 (at least I think so) but that might be an option
 too ...
 
 I think as far as having this discussion if you can remove that race
 people will be more willing to talk about what vserver does.
 
 That said anything that uses locks or atomic operations (finer grained locks)
 because of the cache line ping pong is going to have scaling issues on large
 boxes.

BTW atomic_add_unless() is essentially a loop!!! Just
like spin_lock() is, so why is one better that another?

spin_lock() can go to schedule() on preemptive kernels
thus increasing interactivity, while atomic can't.

 So in that sense anything short of per cpu variables sucks at scale.  That 
 said
 I would much rather get a simple correct version without the complexity of
 per cpu counters, before we optimize the counters that much.
 
 Eric
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Ingo Molnar

* Con Kolivas [EMAIL PROTECTED] wrote:

 Well I guess you must have missed where I asked him if he would be 
 happy if I changed +5 metrics to do whatever he wanted and he refused 
 to answer me. [...]

I'd say lets keep nice levels out of this completely for now - while 
they should work _too_, it's easy because the scheduler has the 'nice' 
information. The basic behavior of CPU hogs that matters most.

So the question is: if all tasks are on the same nice level, how does, 
in Mike's test scenario, RSDL behave relative to the current 
interactivity code?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Con Kolivas
On Tuesday 13 March 2007 20:21, Con Kolivas wrote:
 On Tuesday 13 March 2007 19:18, Ingo Molnar wrote:
  * Mike Galbraith [EMAIL PROTECTED] wrote:
   [...] The situation as we speak is that you can run cpu intensive
   tasks while watching eye-candy.  With RSDL, you can't, you feel the
   non-interactive load instantly. [...]
 
  i have to agree with Mike that this is a material regression that cannot
  be talked around.
 
  Con, we want RSDL to /improve/ interactivity. Having new scheduler
  interactivity logic that behaves /worse/ in the presence of CPU hogs,
  which CPU hogs are even reniced to +5, than the current interactivity
  code, is i think a non-starter. Could you try to fix this, please? Good
  interactivity in the presence of CPU hogs (be them default nice level or
  nice +5) is _the_ most important scheduler interactivity metric.
  Anything else is really secondary.

 Well I guess you must have missed where I asked him if he would be happy if
 I changed +5 metrics to do whatever he wanted and he refused to answer me.
 That would easily fit within that scheme. Any percentage of nice value he
 chose. I suggest 50% of nice 0. Heck I can even increase it if he likes.
 All I asked for was an answer as to whether that would satisfy his
 criterion.

It seem Mike has chosen to go silent so I'll guess on his part. 

nice on my debian etch seems to choose nice +10 without arguments contrary to 
a previous discussion that said 4 was the default. However 4 is a good value 
to use as a base of sorts.

What I propose is as a proportion of nice 0:
nice 4 1/2
nice 8 1/4
nice 12 1/8
nice 16 1/16
nice 20 1/32 (of course nice 20 doesn't exist)

and we can do the opposite in the other direction
nice -4 2
nice -8 4
nice -12 8
nice -16 16
nice -20 32

Assuming no further discussion is forthcoming I'll implement that along with 
Al's suggestion for staggering the latencies better with nice differences 
since the two are changing the same thing.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Mike Galbraith
On Tue, 2007-03-13 at 09:18 +0100, Ingo Molnar wrote:

 Con, we want RSDL to /improve/ interactivity. Having new scheduler 
 interactivity logic that behaves /worse/ in the presence of CPU hogs, 
 which CPU hogs are even reniced to +5, than the current interactivity 
 code, is i think a non-starter. Could you try to fix this, please? Good 
 interactivity in the presence of CPU hogs (be them default nice level or 
 nice +5) is _the_ most important scheduler interactivity metric. 
 Anything else is really secondary.

I just retested with the encoders at nice 0, and the x/gforce combo is
terrible.  Funny thing though, x/gforce isn't as badly affected with a
kernel build.  Any build is quite noticable, but even at -j8, the effect
doen't seem to be (very brief test warning applies) as bad as with only
the two encoders running.  That seems quite odd.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Ingo Molnar

* Mike Galbraith [EMAIL PROTECTED] wrote:

 I just retested with the encoders at nice 0, and the x/gforce combo is 
 terrible. [...]

ok. So nice levels had nothing to do with it - it's some other 
regression somewhere. How does the vanilla scheduler cope with the 
exactly same workload? I.e. could you describe the 'delta' difference in 
behavior - because the delta is what we are interested in mostly, the 
'absolute' behavior alone is not sufficient. Something like:

 - on scheduler foo, under this workload, the CPU hogs steal 70% CPU 
   time and the resulting desktop experience is 'choppy': mouse pointer 
   is laggy and audio skips.

 - on scheduler bar, under this workload, the CPU hogs are at 40% 
   CPU time and the desktop experience is smooth.

things like that - we really need to be able to see the delta.

 [...]  Funny thing though, x/gforce isn't as badly affected with a 
 kernel build.  Any build is quite noticable, but even at -j8, the 
 effect doen't seem to be (very brief test warning applies) as bad as 
 with only the two encoders running.  That seems quite odd.

likewise, how does the RSDL kernel build behavior compare to the vanilla 
scheduler's behavior? (what happens in one that doesnt happen in the 
other, etc.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: /sys/devices/system/cpu/cpuX/online are missing

2007-03-13 Thread Russell King
On Tue, Mar 13, 2007 at 10:11:59AM +0100, Heiko Carstens wrote:
 On Tue, Mar 13, 2007 at 10:03:50AM +0100, Heiko Carstens wrote:
  On Tue, Mar 13, 2007 at 01:39:25AM +0100, Andreas Schwab wrote:
   Giuliano Pochini [EMAIL PROTECTED] writes:
   
I had a look at arch/powerpc/kernel/smp.c but I'm not familiar at all 
with 
those parts of the kernel.
   
   See arch/powerpc/kernel/sysfs.c:topology_init.  I don't think there is
   anything to do here.  You probably don't have CONFIG_HOTPLUG_CPU enabled.
  
  I was referring to arch/ppc not arch/powerpc. But it seems that arch/ppc
  doesn't support cpu hotplug anyway. So I guess it's indeed just a missing
  config option.
  
  Grepping a bit further shows that arm suffered by the change that inverted
  the logic if the 'online' attribute for cpus should appear. Since arm
  supports cpu hotplug but the patch left arm out, it doesn't work there
  anymore (cc'ing arm people: changeset 
  72486f1f8f0a2bc828b9d30cf4690cf2dd6807fc
  is most probably disabling cpu hotplug support on arm like it did on s390).
 
 Should have cc'ed Suresh Siddha who caused the breakage ;)

Welcome to why cleanups are bad news. ;(  Yes, ARM also needs to be fixed
and I'd ask that in future people doing cleanups in core code take a little
more time to review the code before submitting patches *AND* give heads-up
to *EVERYONE* who might be affected by the change.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Con Kolivas
On Tuesday 13 March 2007 20:29, Ingo Molnar wrote:
 * Con Kolivas [EMAIL PROTECTED] wrote:
  Well I guess you must have missed where I asked him if he would be
  happy if I changed +5 metrics to do whatever he wanted and he refused
  to answer me. [...]

 I'd say lets keep nice levels out of this completely for now - while
 they should work _too_, it's easy because the scheduler has the 'nice'
 information. The basic behavior of CPU hogs that matters most.

 So the question is: if all tasks are on the same nice level, how does,
 in Mike's test scenario, RSDL behave relative to the current
 interactivity code?

If everything is run at nice 0? (this was not the test case but that's what 
you've asked for).

We have:
X + GForce contribute a load of 1
lame x 2 threads contribute a load of 2

In RSDL
X + GForce will get 33% cpu
lame will get 66% cpu

In mainline
X + GForce gets a fluctuating percentage somewhere between 35~45% as far as I 
can see on UP.
lame gets the rest

The only way to get the same behaviour on RSDL without hacking an 
interactivity estimator, priority boost cpu misproportionator onto it is to 
either -nice X or +nice lame.

   Ingo

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Eric W. Biederman
Dave Hansen [EMAIL PROTECTED] writes:

 On Mon, 2007-03-12 at 20:07 +0300, Kirill Korotaev wrote:
  On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
 For these you essentially need per-container page-_mapcount counter,
 otherwise you can't detect whether rss group still has the page in question
 being mapped
 in its processes' address spaces or not. 
  
  What do you mean by this?  You can always tell whether a process has a
  particular page mapped.  Could you explain the issue a bit more.  I'm
  not sure I get it.
 When we do charge/uncharge we have to answer on another question:
 whether *any* task from the *container* has this page mapped, not the
 whether *this* task has this page mapped.

 That's a bit more clear. ;)

 OK, just so I make sure I'm getting your argument here.  It would be too
 expensive to go looking through all of the rmap data for _any_ other
 task that might be sharing the charge (in the same container) with the
 current task that is doing the unmapping.  

Which is a questionable assumption.  Worse case we are talking a list
several thousand entries long, and generally if you are used by the same
container you will hit one of your processes long before you traverse
the whole list.

So at least the average case performance should be good.

It is only in the case when you a page is shared between multiple
containers when this matters.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: /sys/devices/system/cpu/cpuX/online are missing

2007-03-13 Thread Andreas Schwab
Heiko Carstens [EMAIL PROTECTED] writes:

 On Tue, Mar 13, 2007 at 01:39:25AM +0100, Andreas Schwab wrote:
 Giuliano Pochini [EMAIL PROTECTED] writes:
 
  I had a look at arch/powerpc/kernel/smp.c but I'm not familiar at all with 
  those parts of the kernel.
 
 See arch/powerpc/kernel/sysfs.c:topology_init.  I don't think there is
 anything to do here.  You probably don't have CONFIG_HOTPLUG_CPU enabled.

 I was referring to arch/ppc not arch/powerpc.

Sorry, I missed that part.

 But it seems that arch/ppc doesn't support cpu hotplug anyway.

I think if there is no hotplug support then the file should not be created
in the first place.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


__HAVE_ARCH_PTEP_TEST_AND_CLEAR_{DIRTY,YOUNG} on i386

2007-03-13 Thread Jan Beulich
Isn't defining these on i386 of at most historical value. The only consumer is
include/asm-generic/pgtable.h (ptep_clear_flush_{dirty,young}), and those are
already suppressed by __HAVE_ARCH_PTEP_CLEAR_{DIRTY,YOUNG}_FLUSH
being defined in include/asm-i386/pgtable.h. Or is there a particular need to
detect if other uses appear (in which case the comment accompanying their
definitions is pretty misleading)?

Thanks, Jan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Andrea Arcangeli
On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote:
 Hi Anton,
 
 Very cool. Yeah I had come to the conclusion that it wasn't a kernel
 issue, and basically was afraid to look into userspace ;)

btw, regardless of what glibc is doing, still the cpu shouldn't go
idle IMHO. Even if we're overscheduling and trashing over the mmap_sem
with threads (no idea if other OS schedules the task away when they
find the other cpu in the mmap critical section), or if we've
overscheduling with futex locking, the cpu usage should remain 100%
system time in the worst case. The only explanation for going idle
legitimately could be on HT cpus where HT may hurt more than help but
on real multicore it shouldn't happen.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kthread_should_stop_check_freeze (was: Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread)

2007-03-13 Thread Srivatsa Vaddagiri
On Tue, Mar 13, 2007 at 09:16:29AM +, Christoph Hellwig wrote:
  Document as well in the kernel_thread() API, as I notice people still
  use kernel_thread() some places (ex: rtasd.c in powerpc arch)?
 
 They shouldn't use kernel_thread.

Hmm ..that needs to be documented as well then! I can easily count more
than dozen places where kernel_thread() is being used.

I agree though that kthread is a much cleaner interface to create/destroy 
threads.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: s2ram still broken with CONFIG_NO_HZ / HPET (macbook pro)

2007-03-13 Thread Soeren Sonnenburg
On Mon, 2007-03-12 at 19:59 +0900, Tejun Heo wrote:
 Soeren Sonnenburg wrote:
  Elsewise I still see the
  
  ATA: abnormal status 0x80 on port 0x000140df
  ATA: abnormal status 0x80 on port 0x000140df
  ata1.00: configured for UDMA/33
  ata3.01: revalidation failed (errno=-2)
  ata3: failed to recover some devices, retrying in 5 secs
  ATA: abnormal status 0x7F on port 0x000140df
  ATA: abnormal status 0x7F on port 0x000140df
  ata3.01: configured for UDMA/133
 
 I can't tell much about HPET timer, only the ATA messages.
 
 Abnormal messages can be ignored.  Hmmm... revalidation failed without
 explaining why.  Can you apply the attached patch and see whether the
 added message gets printed?

Well it is there:

ATA: abnormal status 0x80 on port 0x000140df
ata1.00: configured for UDMA/33
ata3.01: NODEV after polling detection
ata3.01: revalidation failed (errno=-2)
ata3: failed to recover some devices, retrying in 5 secs
ATA: abnormal status 0x7F on port 0x000140df
ATA: abnormal status 0x7F on port 0x000140df
ata3.01: configured for UDMA/133
SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't
support DPO or FUA

just FYI a macbook pro has a sata disc + a dvd drive registered
as /dev/sda and /dev/scd0 driven by ata_piix:

dmesg | grep ata_piix 
ata_piix :00:1f.1: version 2.10ac1
scsi0 : ata_piix
scsi1 : ata_piix
ata_piix :00:1f.2: MAP [ P0 P2 XX XX ]
ata_piix :00:1f.2: invalid MAP value 0
scsi2 : ata_piix
scsi3 : ata_piix

Soeren
-- 
Sometimes, there's a moment as you're waking, when you become aware of
the real world around you, but you're still dreaming.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question re hiddev

2007-03-13 Thread Jiri Kosina
On Tue, 13 Mar 2007, Robert Marquardt wrote:

  as there are many Bluetooth devices with conform to HID specification
 Namely the Wii controller which is already happily used by many programmers
 through Windows HID API.

Hi Robert,

not only this piece of hardware, but many others - for example almost all 
Bluetooth keyboards and mice are capable of working both in HID and HCI 
modes, etc. The layer introduced in 2.6.20 gives them the possibility of 
using full reporting facilities of the HID layer.

  There are still pending issues though, one of them being converting 
  the HID layer to bus, so that individual device that don't wish the 
  certain device to be handled by generic HID code, register on this HID 
  bus and handle the HID events from given device in a unified way (for 
  example current Wacom driver would use this facility, among other). As 
  a bonus, this is going to shrink the hid_blacklist, which currently 
  has to contain all devices for which there are specialized drivers.
 Ah, this means our iowarrior driver will be shortlived.

I don't think so. Firstly, it will take some time until the HID layer is 
converted to bus, as I have another things pending. Secondly, the 
iowarrior driver will still be needed to handle the HW-specific reports 
that won't be handled by the generic HID layer, it would only have to be 
changed to register through the HID bus.

 Do you know the Windows HID API? It would be nice to take it into 
 account so that a compatible API can be written. 

No, I don't know Windows HID api at all. Is it worth looking at?

 Mainly this needs a way to add a ReportID of 0x00 to reports from 
 devices without ReportIDs and access to the HID descriptor so the usages 
 can be extracted from a report.

Just a sidenote - this is in fact what I am currently implementing also 
for the new hidraw userspace interface.

Thanks,

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.21-rc3-mm2 0/4] Futexes functionalities and improvements

2007-03-13 Thread Pierre . Peiffer
Hi Andrew,

This is a re-send of a series of patches concerning futexes (here after 
is a short description)

Could you consider them for inclusion in -mm tree ?

All of them have already been discussed in January and have already 
been included in -rt for a while. I think that we agreed to potentially include 
them in the -mm tree.

Ulrich is specially interested by sys_futex64.

There are:
* futex uses prio list : allows RT-threads to be woken in priority order
instead of FIFO order.
* futex_wait uses hrtimer : allows the use of finer timer resolution.
* futex_requeue_pi functionality : allows use of requeue optimization for
PI-mutexes/PI-futexes.
* futex64 syscall : allows use of 64-bit futexes instead of 32-bit. 


Note: it does not inlcude the fix PI state locking fix sent yesterday by Ingo.

Thanks,

-- 
Pierre
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.21-rc3-mm2 1/4] futex priority based wakeup

2007-03-13 Thread Pierre . Peiffer
Today, all threads waiting for a given futex are woken in FIFO order (first
waiter woken first) instead of priority order.

This patch makes use of plist (pirotity ordered lists) instead of simple list in
futex_hash_bucket.

All non-RT threads are stored with priority MAX_RT_PRIO, causing them to be
woken last, in FIFO order (RT-threads are woken first, in priority order).

Signed-off-by: Sébastien Dugué [EMAIL PROTECTED]
Signed-off-by: Pierre Peiffer [EMAIL PROTECTED]

---
 kernel/futex.c |   78 +++--
 1 file changed, 49 insertions(+), 29 deletions(-)

Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -81,12 +81,12 @@ struct futex_pi_state {
  * we can wake only the relevant ones (hashed queues may be shared).
  *
  * A futex_q has a woken state, just like tasks have TASK_RUNNING.
- * It is considered woken when list_empty(q-list) || q-lock_ptr == 0.
+ * It is considered woken when plist_node_empty(q-list) || q-lock_ptr == 0.
  * The order of wakup is always to make the first condition true, then
  * wake up q-waiters, then make the second condition true.
  */
 struct futex_q {
-   struct list_head list;
+   struct plist_node list;
wait_queue_head_t waiters;
 
/* Which hash list lock to use: */
@@ -108,8 +108,8 @@ struct futex_q {
  * Split the global futex_lock into every hash list lock.
  */
 struct futex_hash_bucket {
-   spinlock_t  lock;
-   struct list_head   chain;
+   spinlock_t lock;
+   struct plist_head chain;
 };
 
 static struct futex_hash_bucket futex_queues[1FUTEX_HASHBITS];
@@ -443,13 +443,13 @@ lookup_pi_state(u32 uval, struct futex_h
 {
struct futex_pi_state *pi_state = NULL;
struct futex_q *this, *next;
-   struct list_head *head;
+   struct plist_head *head;
struct task_struct *p;
pid_t pid;
 
head = hb-chain;
 
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex(this-key, me-key)) {
/*
 * Another waiter already exists - bump up
@@ -513,12 +513,12 @@ lookup_pi_state(u32 uval, struct futex_h
  */
 static void wake_futex(struct futex_q *q)
 {
-   list_del_init(q-list);
+   plist_del(q-list, q-list.plist);
if (q-filp)
send_sigio(q-filp-f_owner, q-fd, POLL_IN);
/*
 * The lock in wake_up_all() is a crucial memory barrier after the
-* list_del_init() and also before assigning to q-lock_ptr.
+* plist_del() and also before assigning to q-lock_ptr.
 */
wake_up_all(q-waiters);
/*
@@ -631,7 +631,7 @@ static int futex_wake(u32 __user *uaddr,
 {
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
-   struct list_head *head;
+   struct plist_head *head;
union futex_key key;
int ret;
 
@@ -645,7 +645,7 @@ static int futex_wake(u32 __user *uaddr,
spin_lock(hb-lock);
head = hb-chain;
 
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (this-key, key)) {
if (this-pi_state) {
ret = -EINVAL;
@@ -673,7 +673,7 @@ futex_wake_op(u32 __user *uaddr1, u32 __
 {
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
-   struct list_head *head;
+   struct plist_head *head;
struct futex_q *this, *next;
int ret, op_ret, attempt = 0;
 
@@ -746,7 +746,7 @@ retry:
 
head = hb1-chain;
 
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (this-key, key1)) {
wake_futex(this);
if (++ret = nr_wake)
@@ -758,7 +758,7 @@ retry:
head = hb2-chain;
 
op_ret = 0;
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (this-key, key2)) {
wake_futex(this);
if (++op_ret = nr_wake2)
@@ -785,7 +785,7 @@ static int futex_requeue(u32 __user *uad
 {
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
-   struct list_head *head1;
+   struct plist_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
 
@@ -834,7 +834,7 @@ static int futex_requeue(u32 __user *uad
}
 
head1 = hb1-chain;
-   list_for_each_entry_safe(this, next, head1, list) {
+   plist_for_each_entry_safe(this, next, head1, list) {
if (!match_futex (this-key, key1))
  

[PATCH 2.6.21-rc3-mm2 2/4] Make futex_wait() use an hrtimer for timeout

2007-03-13 Thread Pierre . Peiffer
This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout().

  schedule_timeout() is tick based, therefore the timeout granularity is
the tick (1 ms, 4 ms or 10 ms depending on HZ). By using a high resolution
timer for timeout wakeup, we can attain a much finer timeout granularity
(in the microsecond range). This parallels what is already done for
futex_lock_pi().

  The timeout passed to the syscall is no longer converted to jiffies
and is therefore passed to do_futex() and futex_wait() as a timespec
therefore keeping nanosecond resolution.

  Also this removes the need to pass the nanoseconds timeout part to
futex_lock_pi() in val2.

  In futex_wait(), if there is no timeout then a regular schedule() is
performed. Otherwise, an hrtimer is fired before schedule() is called.

Signed-off-by: Sébastien Dugué [EMAIL PROTECTED]
Signed-off-by: Pierre Peiffer [EMAIL PROTECTED]

---
 include/linux/futex.h |2 -
 kernel/futex.c|   59 +-
 kernel/futex_compat.c |   12 ++
 3 files changed, 43 insertions(+), 30 deletions(-)

Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -998,7 +998,7 @@ static void unqueue_me_pi(struct futex_q
drop_futex_key_refs(q-key);
 }
 
-static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
+static int futex_wait(u32 __user *uaddr, u32 val, struct timespec *time)
 {
struct task_struct *curr = current;
DECLARE_WAITQUEUE(wait, curr);
@@ -1006,6 +1006,8 @@ static int futex_wait(u32 __user *uaddr,
struct futex_q q;
u32 uval;
int ret;
+   struct hrtimer_sleeper t;
+   int rem = 0;
 
q.pi_state = NULL;
  retry:
@@ -1083,8 +1085,31 @@ static int futex_wait(u32 __user *uaddr,
 * !plist_node_empty() is safe here without any lock.
 * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
 */
-   if (likely(!plist_node_empty(q.list)))
-   time = schedule_timeout(time);
+   if (likely(!plist_node_empty(q.list))) {
+   if (!time)
+   schedule();
+   else {
+   hrtimer_init(t.timer, CLOCK_MONOTONIC, 
HRTIMER_MODE_REL);
+   hrtimer_init_sleeper(t, current);
+   t.timer.expires = timespec_to_ktime(*time);
+
+   hrtimer_start(t.timer, t.timer.expires, 
HRTIMER_MODE_REL);
+
+   /*
+* the timer could have already expired, in which
+* case current would be flagged for rescheduling.
+* Don't bother calling schedule.
+*/
+   if (likely(t.task))
+   schedule();
+
+   hrtimer_cancel(t.timer);
+
+   /* Flag if a timeout occured */
+   rem = (t.task == NULL);
+   }
+   }
+
__set_current_state(TASK_RUNNING);
 
/*
@@ -1095,7 +1120,7 @@ static int futex_wait(u32 __user *uaddr,
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!unqueue_me(q))
return 0;
-   if (time == 0)
+   if (rem)
return -ETIMEDOUT;
/*
 * We expect signal_pending(current), but another thread may
@@ -1117,8 +1142,8 @@ static int futex_wait(u32 __user *uaddr,
  * if there are waiters then it will block, it does PI, etc. (Due to
  * races the kernel might see a 0 value of the futex too.)
  */
-static int futex_lock_pi(u32 __user *uaddr, int detect, unsigned long sec,
-long nsec, int trylock)
+static int futex_lock_pi(u32 __user *uaddr, int detect, struct timespec *time,
+int trylock)
 {
struct hrtimer_sleeper timeout, *to = NULL;
struct task_struct *curr = current;
@@ -1130,11 +1155,11 @@ static int futex_lock_pi(u32 __user *uad
if (refill_pi_state_cache())
return -ENOMEM;
 
-   if (sec != MAX_SCHEDULE_TIMEOUT) {
+   if (time) {
to = timeout;
hrtimer_init(to-timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);
-   to-timer.expires = ktime_set(sec, nsec);
+   to-timer.expires = timespec_to_ktime(*time);
}
 
q.pi_state = NULL;
@@ -1770,7 +1795,7 @@ void exit_robust_list(struct task_struct
}
 }
 
-long do_futex(u32 __user *uaddr, int op, u32 val, unsigned long timeout,
+long do_futex(u32 __user *uaddr, int op, u32 val, struct timespec *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
 {
int ret;
@@ -1796,13 +1821,13 @@ long do_futex(u32 __user *uaddr, int op,
ret = futex_wake_op(uaddr, uaddr2, val, val2, val3);
   

Re: /sys/devices/system/cpu/cpuX/online are missing

2007-03-13 Thread Russell King
On Tue, Mar 13, 2007 at 09:40:39AM +, Russell King wrote:
 On Tue, Mar 13, 2007 at 10:11:59AM +0100, Heiko Carstens wrote:
  On Tue, Mar 13, 2007 at 10:03:50AM +0100, Heiko Carstens wrote:
   I was referring to arch/ppc not arch/powerpc. But it seems that arch/ppc
   doesn't support cpu hotplug anyway. So I guess it's indeed just a missing
   config option.
   
   Grepping a bit further shows that arm suffered by the change that inverted
   the logic if the 'online' attribute for cpus should appear. Since arm
   supports cpu hotplug but the patch left arm out, it doesn't work there
   anymore (cc'ing arm people: changeset 
   72486f1f8f0a2bc828b9d30cf4690cf2dd6807fc
   is most probably disabling cpu hotplug support on arm like it did on 
   s390).
  
  Should have cc'ed Suresh Siddha who caused the breakage ;)
 
 Welcome to why cleanups are bad news. ;(  Yes, ARM also needs to be fixed
 and I'd ask that in future people doing cleanups in core code take a little
 more time to review the code before submitting patches *AND* give heads-up
 to *EVERYONE* who might be affected by the change.

Right, here's the ARM fix which is now in the ARM tree:

# Base git commit: 8b9909ded6922c33c221b105b26917780cfa497d
#   (Merge branch 'merge' of 
master.kernel.org:/pub/scm/linux/kernel/git/paulus/powerpc)
#
# Author:Russell King (Tue Mar 13 09:54:21 GMT 2007)
# Committer: Russell King (Tue Mar 13 09:54:21 GMT 2007)
#   
#   [ARM] Fix breakage caused by 72486f1f8f0a2bc828b9d30cf4690cf2dd6807fc
#   
#   72486f1f8f0a2bc828b9d30cf4690cf2dd6807fc inverted the sense for
#   enabling hotplug CPU controls without reference to any other
#   architecture other than i386, ia64 and PowerPC.  This left
#   everyone else without hotplug CPU control.
#   
#   Fix ARM for this brain damage.
#   
#   Signed-off-by: Russell King
#
#arch/arm/kernel/setup.c |7 +--
#1 files changed, 5 insertions(+), 2 deletions(-)
#
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index 03e37af..0453dcc 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -839,8 +839,11 @@ static int __init topology_init(void)
 {
int cpu;
 
-   for_each_possible_cpu(cpu)
-   register_cpu(per_cpu(cpu_data, cpu).cpu, cpu);
+   for_each_possible_cpu(cpu) {
+   struct cpuinfo_arm *cpuinfo = per_cpu(cpu_data, cpu);
+   cpuinfo-cpu.hotpluggable = 1;
+   register_cpu(cpuinfo-cpu, cpu);
+   }
 
return 0;
 }


-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.21-rc3-mm2 4/4] sys_futex64 : allows 64bit futexes

2007-03-13 Thread Pierre . Peiffer
This last patch is an adaptation of the sys_futex64 syscall provided in -rt
patch (originally written by Ingo). It allows the use of 64bit futex.

I have re-worked most of the code to avoid the duplication of the code.

It does not provide the functionality for all architectures (only for x64 for 
now).

Signed-off-by: Pierre Peiffer [EMAIL PROTECTED]

---
 include/asm-x86_64/futex.h  |  113 
 include/asm-x86_64/unistd.h |4 
 include/linux/futex.h   |7 -
 include/linux/syscalls.h|3 
 kernel/futex.c  |  248 +++-
 kernel/futex_compat.c   |3 
 kernel/sys_ni.c |1 
 7 files changed, 301 insertions(+), 78 deletions(-)

Index: b/include/asm-x86_64/futex.h
===
--- a/include/asm-x86_64/futex.h
+++ b/include/asm-x86_64/futex.h
@@ -41,6 +41,39 @@
  =r (tem)   \
: r (oparg), i (-EFAULT), m (*uaddr), 1 (0))
 
+#define __futex_atomic_op1_64(insn, ret, oldval, uaddr, oparg) \
+  __asm__ __volatile ( \
+1: insn \n \
+2:.section .fixup,\ax\\n\
+3: movq%3, %1\n\
+   jmp 2b\n\
+   .previous\n\
+   .section __ex_table,\a\\n\
+   .align  8\n\
+   .quad   1b,3b\n\
+   .previous  \
+   : =r (oldval), =r (ret), =m (*uaddr)  \
+   : i (-EFAULT), m (*uaddr), 0 (oparg), 1 (0))
+
+#define __futex_atomic_op2_64(insn, ret, oldval, uaddr, oparg) \
+  __asm__ __volatile ( \
+1:movq%2, %0\n\
+   movq%0, %3\n   \
+   insn \n   \
+2: LOCK_PREFIX cmpxchgq %3, %2\n\
+   jnz 1b\n\
+3: .section .fixup,\ax\\n\
+4: movq%5, %1\n\
+   jmp 3b\n\
+   .previous\n\
+   .section __ex_table,\a\\n\
+   .align  8\n\
+   .quad   1b,4b,2b,4b\n\
+   .previous  \
+   : =a (oldval), =r (ret), =m (*uaddr),   \
+ =r (tem)   \
+   : r (oparg), i (-EFAULT), m (*uaddr), 1 (0))
+
 static inline int
 futex_atomic_op_inuser (int encoded_op, int __user *uaddr)
 {
@@ -95,6 +128,60 @@ futex_atomic_op_inuser (int encoded_op, 
 }
 
 static inline int
+futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
+{
+   int op = (encoded_op  28)  7;
+   int cmp = (encoded_op  24)  15;
+   u64 oparg = (encoded_op  8)  20;
+   u64 cmparg = (encoded_op  20)  20;
+   u64 oldval = 0, ret, tem;
+
+   if (encoded_op  (FUTEX_OP_OPARG_SHIFT  28))
+   oparg = 1  oparg;
+
+   if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u64)))
+   return -EFAULT;
+
+   inc_preempt_count();
+
+   switch (op) {
+   case FUTEX_OP_SET:
+   __futex_atomic_op1_64(xchgq %0, %2, ret, oldval, uaddr, 
oparg);
+   break;
+   case FUTEX_OP_ADD:
+   __futex_atomic_op1_64(LOCK_PREFIX xaddq %0, %2, ret, oldval,
+  uaddr, oparg);
+   break;
+   case FUTEX_OP_OR:
+   __futex_atomic_op2_64(orq %4, %3, ret, oldval, uaddr, oparg);
+   break;
+   case FUTEX_OP_ANDN:
+   __futex_atomic_op2_64(andq %4, %3, ret, oldval, uaddr, 
~oparg);
+   break;
+   case FUTEX_OP_XOR:
+   __futex_atomic_op2_64(xorq %4, %3, ret, oldval, uaddr, oparg);
+   break;
+   default:
+   ret = -ENOSYS;
+   }
+
+   dec_preempt_count();
+
+   if (!ret) {
+   switch (cmp) {
+   case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
+   case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
+   case FUTEX_OP_CMP_LT: ret = (oldval  cmparg); break;
+   case FUTEX_OP_CMP_GE: ret = (oldval = cmparg); break;
+   case FUTEX_OP_CMP_LE: ret = (oldval = cmparg); break;
+   case FUTEX_OP_CMP_GT: ret = (oldval  cmparg); break;
+   default: ret = -ENOSYS;
+   }
+   }
+   return ret;
+}
+
+static inline int
 futex_atomic_cmpxchg_inatomic(int __user *uaddr, int oldval, int newval)
 {
if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int)))
@@ -121,5 +208,31 @@ futex_atomic_cmpxchg_inatomic(int __user
return oldval;
 }
 
+static inline u64
+futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
+{
+   if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u64)))
+   return -EFAULT;
+
+   __asm__ __volatile__(
+   1:  LOCK_PREFIX cmpxchgq %3, %1  \n
+
+   2: .section .fixup, \ax\ \n
+ 

Re: Gigabyte GA-M57SLI-S4 (the linuxbios compatible version) problems

2007-03-13 Thread ST
Hi Ingo
 That seem to be no hardware failure.
 It's the same here. But with the recent bigiron-kernel
 from Ubuntu edgy eft (Ubuntu 2.6.17-11.35-server-bigiron).
Thanks for your answer. Since this board has linuxbios support and i plan to 
put it on this board, there has been a post which tells the right boot 
parameters:

apic=debug acpi_dbg_level=0x pci=noacpi,routeirq 
snd-hda-intel.enable_msi=1

I guess the pci=noacpi,routeirq makes the difference, but i haven't tried 
yet myself.

ST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.21-rc3-mm2 3/4] futex_requeue_pi optimization

2007-03-13 Thread Pierre . Peiffer
This patch provides the futex_requeue_pi functionality.

This provides an optimization, already used for (normal) futexes, to be used for
PI-futexes.

This optimization is currently used by the glibc in pthread_broadcast, when
using normal mutexes. With futex_requeue_pi, it can be used with PRIO_INHERIT
mutexes too.

Signed-off-by: Pierre Peiffer [EMAIL PROTECTED]

---
 include/linux/futex.h   |8 
 kernel/futex.c  |  557 +++-
 kernel/futex_compat.c   |3 
 kernel/rtmutex.c|   41 ---
 kernel/rtmutex_common.h |   34 ++
 5 files changed, 555 insertions(+), 88 deletions(-)

Index: b/include/linux/futex.h
===
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -15,6 +15,7 @@
 #define FUTEX_LOCK_PI  6
 #define FUTEX_UNLOCK_PI7
 #define FUTEX_TRYLOCK_PI   8
+#define FUTEX_CMP_REQUEUE_PI   9
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
@@ -83,9 +84,14 @@ struct robust_list_head {
 #define FUTEX_OWNER_DIED   0x4000
 
 /*
+ * Some processes have been requeued on this PI-futex
+ */
+#define FUTEX_WAITER_REQUEUED  0x2000
+
+/*
  * The rest of the robust-futex field is for the TID:
  */
-#define FUTEX_TID_MASK 0x3fff
+#define FUTEX_TID_MASK 0x0fff
 
 /*
  * This limit protects against a deliberately circular list.
Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -53,6 +53,12 @@
 
 #include rtmutex_common.h
 
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+# include rtmutex-debug.h
+#else
+# include rtmutex.h
+#endif
+
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
 
 /*
@@ -102,6 +108,12 @@ struct futex_q {
/* Optional priority inheritance state: */
struct futex_pi_state *pi_state;
struct task_struct *task;
+
+   /*
+* This waiter is used in case of requeue from a
+* normal futex to a PI-futex
+*/
+   struct rt_mutex_waiter waiter;
 };
 
 /*
@@ -224,6 +236,25 @@ int get_futex_key(u32 __user *uaddr, uni
 EXPORT_SYMBOL_GPL(get_futex_key);
 
 /*
+ * Retrieve the original address used to compute this key
+ */
+static void *get_futex_address(union futex_key *key)
+{
+   void *uaddr;
+
+   if (key-both.offset  1) {
+   /* shared mapping */
+   uaddr = (void*)((key-shared.pgoff  PAGE_SHIFT)
+   + key-shared.offset - 1);
+   } else {
+   /* private mapping */
+   uaddr = (void*)(key-private.address + key-private.offset);
+   }
+
+   return uaddr;
+}
+
+/*
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
@@ -439,7 +470,8 @@ void exit_pi_state_list(struct task_stru
 }
 
 static int
-lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, struct futex_q *me)
+lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+   union futex_key *key, struct futex_pi_state **ps)
 {
struct futex_pi_state *pi_state = NULL;
struct futex_q *this, *next;
@@ -450,7 +482,7 @@ lookup_pi_state(u32 uval, struct futex_h
head = hb-chain;
 
plist_for_each_entry_safe(this, next, head, list) {
-   if (match_futex(this-key, me-key)) {
+   if (match_futex(this-key, key)) {
/*
 * Another waiter already exists - bump up
 * the refcount and return its pi_state:
@@ -465,7 +497,7 @@ lookup_pi_state(u32 uval, struct futex_h
WARN_ON(!atomic_read(pi_state-refcount));
 
atomic_inc(pi_state-refcount);
-   me-pi_state = pi_state;
+   *ps = pi_state;
 
return 0;
}
@@ -492,7 +524,7 @@ lookup_pi_state(u32 uval, struct futex_h
rt_mutex_init_proxy_locked(pi_state-pi_mutex, p);
 
/* Store the key for possible exit cleanups: */
-   pi_state-key = me-key;
+   pi_state-key = *key;
 
spin_lock_irq(p-pi_lock);
WARN_ON(!list_empty(pi_state-list));
@@ -502,7 +534,7 @@ lookup_pi_state(u32 uval, struct futex_h
 
put_task_struct(p);
 
-   me-pi_state = pi_state;
+   *ps = pi_state;
 
return 0;
 }
@@ -561,6 +593,8 @@ static int wake_futex_pi(u32 __user *uad
 */
if (!(uval  FUTEX_OWNER_DIED)) {
newval = FUTEX_WAITERS | new_owner-pid;
+   /* Keep the FUTEX_WAITER_REQUEUED flag if it was set */
+   newval |= (uval  FUTEX_WAITER_REQUEUED);
 
pagefault_disable();
curval = futex_atomic_cmpxchg_inatomic(uaddr, uval, newval);
@@ -664,6 +698,254 @@ out:
 }
 
 /*
+ * Called from futex_requeue_pi.
+ * Set FUTEX_WAITERS and FUTEX_WAITER_REQUEUED flags on the
+ * PI-futex value; 

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:
 On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
  
  For these you essentially need per-container page-_mapcount counter,
  otherwise you can't detect whether rss group still has the page 
  in question being mapped in its processes' address spaces or not. 

 What do you mean by this?  You can always tell whether a process has a
 particular page mapped.  Could you explain the issue a bit more.  I'm
 not sure I get it.

 OpenVZ wants to account _shared_ pages in a guest
 different than separate pages, so that the RSS
 accounted values reflect the actual used RAM instead
 of the sum of all processes RSS' pages, which for
 sure is more relevant to the administrator, but IMHO
 not so terribly important to justify memory consuming
 structures and sacrifice performance to get it right

 YMMV, but maybe we can find a smart solution to the
 issue too :)

I will tell you what I want.

I want a shared page cache that has nothing to do with RSS limits.

I want an RSS limit that once I know I can run a deterministic
application with a fixed set of inputs in I want to know it will
always run.

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.

I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.

Now sharing is sufficiently rare that I'm pretty certain that problems
come up rarely.  So maybe these problems have not shown up in testing
yet.  But until I see the proof that actually doing the accounting for
sharing properly has intolerable overhead.  I want proper accounting
not this hand waving that is only accurate on the third Tuesday of the
month.

Ideally all of this will be followed by smarter rss based swapping.
There are some very cool things that can be done to eliminate machine
overload once you have the ability to track real rss values.  

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Porting V4L2 drivers to 2.6.20

2007-03-13 Thread Maximus

Hey,
am porting V4L2 drivers from 2.6.13 to 2.6.20.

The driver is using a  structure 'video_device' which exists in
include/linux/videodev.h.

However, The linux kernel in 2.6.20 doesnot have that structure?.

 Has the architecture changed between 2.6.13 to 2.6.20 for V4L2?.



 Regards,
  Jo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64: fix cpu MHz reporting on constant_tsc cpus

2007-03-13 Thread Joerg Roedel
From: Mark Langsdorf [EMAIL PROTECTED]
From: Joerg Roedel [EMAIL PROTECTED]

This patch fixes the reporting of cpu_mhz in /proc/cpuinfo on CPUs with
a constant TSC rate and a kernel with disabled cpufreq.

Signed-off-by: Mark Langsdorf [EMAIL PROTECTED]
Signed-off-by: Joerg Roedel [EMAIL PROTECTED]

-- 
Joerg Roedel
Operating System Research Center
AMD Saxony LLC  Co. KG
diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
index 723417d..5f291b2 100644
--- a/arch/x86_64/kernel/apic.c
+++ b/arch/x86_64/kernel/apic.c
@@ -839,7 +839,7 @@ static int __init calibrate_APIC_clock(void)
} while ((tsc - tsc_start)  TICK_COUNT 
(apic - apic_start)  TICK_COUNT);
 
-   result = (apic_start - apic) * 1000L * cpu_khz /
+   result = (apic_start - apic) * 1000L * tsc_khz /
(tsc - tsc_start);
}
printk(result %d\n, result);
diff --git a/arch/x86_64/kernel/nmi.c b/arch/x86_64/kernel/nmi.c
index 486f4c6..0bbaeda 100644
--- a/arch/x86_64/kernel/nmi.c
+++ b/arch/x86_64/kernel/nmi.c
@@ -168,7 +168,7 @@ void release_evntsel_nmi(unsigned int msr)
clear_bit(counter, __get_cpu_var(evntsel_nmi_owner));
 }
 
-static __cpuinit inline int nmi_known_cpu(void)
+__cpuinit inline int nmi_known_cpu(void)
 {
switch (boot_cpu_data.x86_vendor) {
case X86_VENDOR_AMD:
@@ -225,8 +225,8 @@ static unsigned int adjust_for_32bit_ctr(unsigned int hz)
 * 32nd bit should be 1, for 33.. to be 1.
 * Find the appropriate nmi_hz
 */
-   if u64)cpu_khz * 1000) / retval)  0x7fffULL) {
-   retval = ((u64)cpu_khz * 1000) / 0x7fffUL + 1;
+   if u64)tsc_khz * 1000) / retval)  0x7fffULL) {
+   retval = ((u64)tsc_khz * 1000) / 0x7fffUL + 1;
}
return retval;
 }
@@ -493,7 +493,7 @@ static int setup_k7_watchdog(void)
 
/* setup the timer */
wrmsr(evntsel_msr, evntsel, 0);
-   wrmsrl(perfctr_msr, -((u64)cpu_khz * 1000 / nmi_hz));
+   wrmsrl(perfctr_msr, -((u64)tsc_khz * 1000 / nmi_hz));
apic_write(APIC_LVTPC, APIC_DM_NMI);
evntsel |= K7_EVNTSEL_ENABLE;
wrmsr(evntsel_msr, evntsel, 0);
@@ -601,7 +601,7 @@ static int setup_p4_watchdog(void)
 
wrmsr(evntsel_msr, evntsel, 0);
wrmsr(cccr_msr, cccr_val, 0);
-   wrmsrl(perfctr_msr, -((u64)cpu_khz * 1000 / nmi_hz));
+   wrmsrl(perfctr_msr, -((u64)tsc_khz * 1000 / nmi_hz));
apic_write(APIC_LVTPC, APIC_DM_NMI);
cccr_val |= P4_CCCR_ENABLE;
wrmsr(cccr_msr, cccr_val, 0);
@@ -671,7 +671,7 @@ static int setup_intel_arch_watchdog(void)
wrmsr(evntsel_msr, evntsel, 0);
 
nmi_hz = adjust_for_32bit_ctr(nmi_hz);
-   wrmsr(perfctr_msr, (u32)(-((u64)cpu_khz * 1000 / nmi_hz)), 0);
+   wrmsr(perfctr_msr, (u32)(-((u64)tsc_khz * 1000 / nmi_hz)), 0);
 
apic_write(APIC_LVTPC, APIC_DM_NMI);
evntsel |= ARCH_PERFMON_EVENTSEL0_ENABLE;
@@ -894,7 +894,7 @@ int __kprobes nmi_watchdog_tick(struct pt_regs * regs, 
unsigned reason)
apic_write(APIC_LVTPC, APIC_DM_NMI);
/* start the cycle over again */
wrmsrl(wd-perfctr_msr,
-  -((u64)cpu_khz * 1000 / nmi_hz));
+  -((u64)tsc_khz * 1000 / nmi_hz));
} else if (wd-perfctr_msr == 
MSR_ARCH_PERFMON_PERFCTR0) {
/*
 * ArchPerfom/Core Duo needs to re-unmask
@@ -903,11 +903,11 @@ int __kprobes nmi_watchdog_tick(struct pt_regs * regs, 
unsigned reason)
apic_write(APIC_LVTPC, APIC_DM_NMI);
/* ARCH_PERFMON has 32 bit counter writes */
wrmsr(wd-perfctr_msr,
-(u32)(-((u64)cpu_khz * 1000 / nmi_hz)), 0);
+(u32)(-((u64)tsc_khz * 1000 / nmi_hz)), 0);
} else {
/* start the cycle over again */
wrmsrl(wd-perfctr_msr,
-  -((u64)cpu_khz * 1000 / nmi_hz));
+  -((u64)tsc_khz * 1000 / nmi_hz));
}
rc = 1;
} else  if (nmi_watchdog == NMI_IO_APIC) {
@@ -1003,6 +1003,7 @@ void __trigger_all_cpu_backtrace(void)
 }
 
 EXPORT_SYMBOL(nmi_active);
+EXPORT_SYMBOL(nmi_known_cpu);
 EXPORT_SYMBOL(nmi_watchdog);
 EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
 EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit);
diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c
diff --git a/arch/x86_64/kernel/time.c b/arch/x86_64/kernel/time.c
index 75d73a9..52b5dc1 100644
--- a/arch/x86_64/kernel/time.c
+++ 

Re: SMP performance degradation with sysbench

2007-03-13 Thread Nick Piggin

Andrea Arcangeli wrote:

On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote:


Hi Anton,

Very cool. Yeah I had come to the conclusion that it wasn't a kernel
issue, and basically was afraid to look into userspace ;)



btw, regardless of what glibc is doing, still the cpu shouldn't go
idle IMHO. Even if we're overscheduling and trashing over the mmap_sem
with threads (no idea if other OS schedules the task away when they
find the other cpu in the mmap critical section), or if we've
overscheduling with futex locking, the cpu usage should remain 100%
system time in the worst case. The only explanation for going idle
legitimately could be on HT cpus where HT may hurt more than help but
on real multicore it shouldn't happen.



Well ignoring the HT issue, I was seeing lots of idle time simply
because userspace could not keep up enough load to the scheduler.
There simply were fewer runnable tasks than CPU cores.

But it wasn't a case of all CPUs going idle, just most of them ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Con Kolivas
On Tuesday 13 March 2007 20:39, Ingo Molnar wrote:
 * Mike Galbraith [EMAIL PROTECTED] wrote:
  I just retested with the encoders at nice 0, and the x/gforce combo is
  terrible. [...]

 ok. So nice levels had nothing to do with it - it's some other
 regression somewhere. How does the vanilla scheduler cope with the
 exactly same workload? I.e. could you describe the 'delta' difference in
 behavior - because the delta is what we are interested in mostly, the
 'absolute' behavior alone is not sufficient. Something like:

  - on scheduler foo, under this workload, the CPU hogs steal 70% CPU
time and the resulting desktop experience is 'choppy': mouse pointer
is laggy and audio skips.

  - on scheduler bar, under this workload, the CPU hogs are at 40%
CPU time and the desktop experience is smooth.

 things like that - we really need to be able to see the delta.

I only find a slowdown, no choppiness, no audio stutter (it would be extremely 
hard to make audio stutter in this design without i/o starvation or something 
along those lines). The number difference in cpu percentage I've already 
given on the previous email. The graphics driver does feature in this test 
case as well so others' mileage may vary. Mike said it was terrible.

  [...]  Funny thing though, x/gforce isn't as badly affected with a
  kernel build.  Any build is quite noticable, but even at -j8, the
  effect doen't seem to be (very brief test warning applies) as bad as
  with only the two encoders running.  That seems quite odd.

 likewise, how does the RSDL kernel build behavior compare to the vanilla
 scheduler's behavior? (what happens in one that doesnt happen in the
 other, etc.)

Kernel compiles seem similar till the jobs get above about 3 where rsdl gets 
slower but still smooth. Audio is basically unaffected either way.

Don't forget all the rest of the cases people have posted.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Porting V4L2 drivers to 2.6.20

2007-03-13 Thread Laurent Pinchart
 Hey,
  am porting V4L2 drivers from 2.6.13 to 2.6.20.

  The driver is using a  structure 'video_device' which exists in
 include/linux/videodev.h.

  However, The linux kernel in 2.6.20 doesnot have that structure?.

   Has the architecture changed between 2.6.13 to 2.6.20 for V4L2?.

The structure has been moved to include/media/v4l2-dev.h

Best regards,

Laurent Pinchart
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Move to unshared VMAs in NOMMU mode?

2007-03-13 Thread David Howells
Robin Getz [EMAIL PROTECTED] wrote:

 We (noMMU) folks need to have special code anyway - so why not put it there, 
 and try not to increase memory footprint?

I'd like to have the drivers and filesystems need to know as little as possible
about whether they're working in MMU-mode or NOMMU-mode - for the most part
such knowledge should be unnecessary.  Additionally, I'd rather not put special
case code in the generic parts of the NOMMU code.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kthread_should_stop_check_freeze (was: Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread)

2007-03-13 Thread Christoph Hellwig
On Tue, Mar 13, 2007 at 03:28:08PM +0530, Srivatsa Vaddagiri wrote:
 On Tue, Mar 13, 2007 at 09:16:29AM +, Christoph Hellwig wrote:
   Document as well in the kernel_thread() API, as I notice people still
   use kernel_thread() some places (ex: rtasd.c in powerpc arch)?
  
  They shouldn't use kernel_thread.
 
 Hmm ..that needs to be documented as well then! I can easily count more
 than dozen places where kernel_thread() is being used.
 
 I agree though that kthread is a much cleaner interface to create/destroy 
 threads.

Well, it takes a lot of time to convert all the existing users.  But
I try to make sure to flame^H^H^H^H^Hcorrect everyone who tries to sneak
a new user of kernel_thread in.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Xavier Bestel
On Tue, 2007-03-13 at 20:31 +1100, Con Kolivas wrote:
 nice on my debian etch seems to choose nice +10 without arguments contrary to 
 a previous discussion that said 4 was the default. However 4 is a good value 
 to use as a base of sorts.

I don't see why. nice uses +10 by default on all linux distro, and even
on Solaris and HP/UX. So I suspect that if Mike just used nice lame
instead of nice +5 lame, he would have got what he wanted.

Xav


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Nick Piggin

Eric W. Biederman wrote:

Herbert Poetzl [EMAIL PROTECTED] writes:



On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:


On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:


For these you essentially need per-container page-_mapcount counter,
otherwise you can't detect whether rss group still has the page 
in question being mapped in its processes' address spaces or not. 



What do you mean by this?  You can always tell whether a process has a
particular page mapped.  Could you explain the issue a bit more.  I'm
not sure I get it.


OpenVZ wants to account _shared_ pages in a guest
different than separate pages, so that the RSS
accounted values reflect the actual used RAM instead
of the sum of all processes RSS' pages, which for
sure is more relevant to the administrator, but IMHO
not so terribly important to justify memory consuming
structures and sacrifice performance to get it right

YMMV, but maybe we can find a smart solution to the
issue too :)



I will tell you what I want.

I want a shared page cache that has nothing to do with RSS limits.

I want an RSS limit that once I know I can run a deterministic
application with a fixed set of inputs in I want to know it will
always run.

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).


I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.


You seem to want total isolation. You could use virtualization?


Now sharing is sufficiently rare that I'm pretty certain that problems
come up rarely.  So maybe these problems have not shown up in testing
yet.  But until I see the proof that actually doing the accounting for
sharing properly has intolerable overhead.  I want proper accounting
not this hand waving that is only accurate on the third Tuesday of the
month.


It is basically handwaving anyway. The only approach I've seen with
a sane (not perfect, but good) way of accounting memory use is this
one. If you care to define proper, then we could discuss that.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Andrea Arcangeli
On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote:
 Well ignoring the HT issue, I was seeing lots of idle time simply
 because userspace could not keep up enough load to the scheduler.
 There simply were fewer runnable tasks than CPU cores.

When you said idle I thought idle and not waiting for I/O. Waiting for
I/O would be hardly a kernel issue ;). If they're not waiting for I/O
and they're not scheduling in userland with nanosleep/pause, the cpu
shouldn't go idle. Even if they're calling sched_yield in a loop the
cpu should account for zero idle time as far as I can tell.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Nick Piggin

Andrea Arcangeli wrote:

On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote:


Well ignoring the HT issue, I was seeing lots of idle time simply
because userspace could not keep up enough load to the scheduler.
There simply were fewer runnable tasks than CPU cores.



When you said idle I thought idle and not waiting for I/O. Waiting for
I/O would be hardly a kernel issue ;). If they're not waiting for I/O
and they're not scheduling in userland with nanosleep/pause, the cpu
shouldn't go idle. Even if they're calling sched_yield in a loop the
cpu should account for zero idle time as far as I can tell.


Well it wasn't iowait time. From Anton's analysis, I would probably
say it was time waiting for either the glibc malloc mutex or MySQL
heap mutex.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix corruption of memmap on IA64 SPARSEMEM when mem_section is not a power of 2

2007-03-13 Thread Mel Gorman

There are problems in the use of SPARSEMEM and pageblock flags that causes
problems on ia64.

The first part of the problem is that units are incorrect in
SECTION_BLOCKFLAGS_BITS computation. This results in a map_section's
section_mem_map being treated as part of a bitmap which isn't good. This
was evident with an invalid virtual address when mem_init attempted to free
bootmem pages while relinquishing control from the bootmem allocator.

The second part of the problem occurs because the pageblock flags bitmap is
be located with the mem_section. The SECTIONS_PER_ROOT computation using
sizeof (mem_section) may not be a power of 2 depending on the size of the
bitmap. This renders masks and other such things not power of 2 base. This
issue was seen with SPARSEMEM_EXTREME on ia64. This patch moves the bitmap
outside of mem_section and uses a pointer instead in the mem_section. The
bitmaps are allocated when the section is being initialised.

Note that sparse_early_usemap_alloc() does not use alloc_remap() like
sparse_early_mem_map_alloc(). The allocation required for the bitmap on x86,
the only architecture that uses alloc_remap is typically smaller than a cache
line. alloc_remap() pads out allocations to the cache size which would be
a needless waste.

Credit to Bob Picco for identifying the original problem and effecting a
fix for the SECTION_BLOCKFLAGS_BITS calculation. Credit to Andy Whitcroft
for devising the best way of allocating the bitmaps only when required for
the section.

Signed-off-by: Bob Picco [EMAIL PROTECTED]
Signed-off-by: Andy Whitcroft [EMAIL PROTECTED]
Signed-off-by: Mel Gorman [EMAIL PROTECTED]

 include/linux/mmzone.h |6 --
 mm/sparse.c|   49 ++---
 2 files changed, 50 insertions(+), 5 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.21-rc2-mm2-clean/include/linux/mmzone.h 
linux-2.6.21-rc2-mm2-sparsemem_rootsize_fix/include/linux/mmzone.h
--- linux-2.6.21-rc2-mm2-clean/include/linux/mmzone.h   2007-03-09 
10:22:08.0 +
+++ linux-2.6.21-rc2-mm2-sparsemem_rootsize_fix/include/linux/mmzone.h  
2007-03-09 19:42:00.0 +
@@ -705,7 +705,7 @@ extern struct zone *next_zone(struct zon
 #define PAGE_SECTION_MASK  (~(PAGES_PER_SECTION-1))
 
 #define SECTION_BLOCKFLAGS_BITS \
-   ((SECTION_SIZE_BITS - (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS)
+   ((1  (PFN_SECTION_SHIFT - (MAX_ORDER-1))) * NR_PAGEBLOCK_BITS)
 
 #if (MAX_ORDER - 1 + PAGE_SHIFT)  SECTION_SIZE_BITS
 #error Allocator MAX_ORDER exceeds SECTION_SIZE
@@ -726,7 +726,9 @@ struct mem_section {
 * before using it wrong.
 */
unsigned long section_mem_map;
-   DECLARE_BITMAP(pageblock_flags, SECTION_BLOCKFLAGS_BITS);
+
+   /* See declaration of similar field in struct zone */
+   unsigned long *pageblock_flags;
 };
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.21-rc2-mm2-clean/mm/sparse.c 
linux-2.6.21-rc2-mm2-sparsemem_rootsize_fix/mm/sparse.c
--- linux-2.6.21-rc2-mm2-clean/mm/sparse.c  2007-02-28 04:59:12.0 
+
+++ linux-2.6.21-rc2-mm2-sparsemem_rootsize_fix/mm/sparse.c 2007-03-09 
19:45:15.0 +
@@ -198,13 +198,15 @@ struct page *sparse_decode_mem_map(unsig
 }
 
 static int sparse_init_one_section(struct mem_section *ms,
-   unsigned long pnum, struct page *mem_map)
+   unsigned long pnum, struct page *mem_map,
+   unsigned long *pageblock_bitmap)
 {
if (!valid_section(ms))
return -EINVAL;
 
ms-section_mem_map = ~SECTION_MAP_MASK;
ms-section_mem_map |= sparse_encode_mem_map(mem_map, pnum);
+   ms-pageblock_flags = pageblock_bitmap;
 
return 1;
 }
@@ -268,6 +270,33 @@ static void __kfree_section_memmap(struc
   get_order(sizeof(struct page) * nr_pages));
 }
 
+static unsigned long usemap_size(void)
+{
+   unsigned long size_bytes;
+   size_bytes = roundup(SECTION_BLOCKFLAGS_BITS, 8) / 8;
+   size_bytes = roundup(size_bytes, sizeof(unsigned long));
+   return size_bytes;
+}
+
+static unsigned long *__kmalloc_section_usemap(void)
+{
+   return kmalloc(usemap_size(), GFP_KERNEL);
+}
+
+static unsigned long *sparse_early_usemap_alloc(unsigned long pnum)
+{
+   unsigned long *usemap;
+   struct mem_section *ms = __nr_to_section(pnum);
+   int nid = sparse_early_nid(ms);
+
+   usemap = alloc_bootmem_node(NODE_DATA(nid), usemap_size());
+   if (usemap)
+   return usemap;
+
+   printk(KERN_WARNING %s: allocation failed\n, __FUNCTION__);
+   return NULL;
+}
+
 /*
  * Allocate the accumulated non-linear sections, allocate a mem_map
  * for each and record the physical to section mapping.
@@ -276,6 +305,7 @@ void sparse_init(void)
 {
unsigned long pnum;
struct page *map;
+   unsigned long *usemap;
 
for (pnum = 

Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread hui
On Tue, Mar 13, 2007 at 08:41:05PM +1100, Con Kolivas wrote:
 On Tuesday 13 March 2007 20:29, Ingo Molnar wrote:
  So the question is: if all tasks are on the same nice level, how does,
  in Mike's test scenario, RSDL behave relative to the current
  interactivity code?
... 
 The only way to get the same behaviour on RSDL without hacking an 
 interactivity estimator, priority boost cpu misproportionator onto it is to 
 either -nice X or +nice lame.

Hello Ingo,

After talking to Con over IRC (and if I can summarize it), he's wondering if
properly nicing those tasks, as previously mention in user emails, would solve
this potential user reported regression or is something additional needed. It
seems like folks are happy with the results once the nice tweeking is done.
This is a huge behavior change after all to scheduler (just thinking out loud).

bill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Andrew Morton
 On Tue, 13 Mar 2007 19:03:38 +1100 Nick Piggin [EMAIL PROTECTED] wrote:
 Andrew Morton wrote:
 On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 Page table pages have the characteristics that they are typically zero
 or in a known state when they are freed.
  
  
  Well if they're zero then perhaps they should be released to the page 
  allocator
  to satisfy the next __GFP_ZERO request.  If that request is for a pagetable
  page, we break even (except we get to remove special-case code).  If that
  __GFP_ZERO allocation was or some application other than for a pagetable, we
  win.
  
  iow, can we just nuke 'em?
 
 Page allocator still requires interrupts to be disabled, which this doesn't.

Bah.  How many cli/sti statements fit into a single cachemiss?

 Considering there isn't much else that frees known zeroed pages, I wonder if
 it is worthwhile.

If you want a zeroed page for pagecache and someone has just stuffed a
known-zero, cache-hot page into the pagetable quicklists, you have good
reason to be upset.

In fact, if you want a _non_-zeroed page and someone has just stuffed a
known-zero, cache-hot page into the pagetable quicklists, you still have
reason to be upset.  You *want* that cache-hot page.

Generally, all these little private lists of pages (such as the ones which
slab had/has) are a bad deal.  Cache effects preponderate and I do think
we're generally better off tossing the things into a central pool.

Plus, we can get in a situation where take a cache-cold, known-zero page
from the pte quicklist when there is a cache-hot, non-zero page sitting in
the page allocator.  I suspect that zeroing the cache-hot page would take a
similar amount of time to a single miss agains the cache-cold page.

I'm not saying that I _know_ that the quicklists are pointless, but I don't
think it's established that they are pointful.

ISTR that experiments with removing the i386 quicklists made zero
difference, but that was an awfully long time ago.  Significantly, it
predated per-cpu-pages..


 Last time the zeroidle discussion came up was IIRC not actually real 
 performance
 gain, just cooking the 1024 CPU threaded pagefault numbers ;)

Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a
fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the
page.  But it needed too much support in core VM to bother.  Since then
we've grown per-cpu page magazines and __GFP_ZERO.  Plus I'm not aware of
anyone having tried doing it on x86 with non-temporal stores.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20*: PATA DMA timeout, hangs (2)

2007-03-13 Thread Bartlomiej Zolnierkiewicz

On Tuesday 13 March 2007, Frank van Maarseveen wrote:
 On Mon, Mar 12, 2007 at 09:40:25PM +0100, Bartlomiej Zolnierkiewicz wrote:
  
  Hi,
  
  On Monday 12 March 2007, Frank van Maarseveen wrote:
   On Mon, Mar 12, 2007 at 01:21:18PM +0100, Bartlomiej Zolnierkiewicz wrote:

Hi,

Could you check if this is the same problem as this one:

http://bugzilla.kernel.org/show_bug.cgi?id=8169
   
   Looks like it except that I don't see lost interrupt messages here. So,
   it might be something different (I don't know).
  
  From the first mail:
  
  hda: max request size: 128KiB
  hda: 40021632 sectors (20491 MB) w/2048KiB Cache, CHS=39704/16/63
  hda: cache flushes not supported
   hda: hda1 hda2 hda4
  
  It seems that DMA is not used by default (CONFIG_IDEDMA_PCI_AUTO=n),
  so this is probably exactly the same issue.
  
  Please try the patch attached to the bugzilla bug entry.
 
 2.6.20.2 rejects this patch and I don't see a way to apply it by hand:
 ide_set_dma() isn't there, nothing seems to match.

The patch is for 2.6.21-rc3, sorry for not making it clear.

Bart
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Andrea Arcangeli
On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote:
 Well it wasn't iowait time. From Anton's analysis, I would probably
 say it was time waiting for either the glibc malloc mutex or MySQL
 heap mutex.

So it again makes little sense to me that this is idle time, unless
some userland mutex has a usleep in the slow path which would be very
wrong, in the worst case they should yield() (yield can still waste
lots of cpu if two tasks in the slow paths calls it while the holder
is not scheduled, but at least it wouldn't be idle time).

Idle time is suspicious for a kernel issue in the scheduler or some
userland inefficiency (the latter sounds more likely).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Nick Piggin

Andrew Morton wrote:

On Tue, 13 Mar 2007 19:03:38 +1100 Nick Piggin [EMAIL PROTECTED] wrote:



Page allocator still requires interrupts to be disabled, which this doesn't.



Bah.  How many cli/sti statements fit into a single cachemiss?


On a Pentium 4? ;)

Sure, that is a minor detail, considering that you'll usually be allocating
an order of magnitude or three more anon/pagecache pages than page tables.


Considering there isn't much else that frees known zeroed pages, I wonder if
it is worthwhile.



If you want a zeroed page for pagecache and someone has just stuffed a
known-zero, cache-hot page into the pagetable quicklists, you have good
reason to be upset.


The thing is, pagetable pages are the one really good exception to the
rule that we should keep cache hot and initialise-on-demand. They
typically are fairly sparsely populated and sparsely accessed. Even
for last level page tables, I think it is reasonable to assume they will
usually be pretty cold.

And you want to allocate cache cold pages as well, for the same reasons
(you want to keep your cache hot pages for when they actually will be
used - eg. for the anon/pagecache itself).


In fact, if you want a _non_-zeroed page and someone has just stuffed a
known-zero, cache-hot page into the pagetable quicklists, you still have
reason to be upset.  You *want* that cache-hot page.

Generally, all these little private lists of pages (such as the ones which
slab had/has) are a bad deal.  Cache effects preponderate and I do think
we're generally better off tossing the things into a central pool.


For slab I understand. And a lot of users of slab constructers were also
silly, precisely because we should initialise on demand to keep the cache
hits up.

But cold(ish?) pagetable quicklists make sense, IMO (that is, if you *must*
avoid using slab).


Last time the zeroidle discussion came up was IIRC not actually real performance
gain, just cooking the 1024 CPU threaded pagefault numbers ;)



Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a
fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the
page.  But it needed too much support in core VM to bother.  Since then
we've grown per-cpu page magazines and __GFP_ZERO.  Plus I'm not aware of
anyone having tried doing it on x86 with non-temporal stores.


You can win on specifically constructed benchmarks, easily.

But considering all the other problems you're going to introduce, we'd need
a significant win on a significant something, IMO.

You waste memory bandwidth. You also use more CPU and memory cycles
speculatively, ergo you waste more power.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bugme-new] [Bug 8187] New: 2.6.20 PCI: Quirks patch breaks X11 on I82801

2007-03-13 Thread Bartlomiej Zolnierkiewicz

On Tuesday 13 March 2007, Andrew Morton wrote:
  On Mon, 12 Mar 2007 13:30:05 -0700 [EMAIL PROTECTED] wrote:
  http://bugzilla.kernel.org/show_bug.cgi?id=8187
  
 Summary: 2.6.20 PCI: Quirks patch breaks X11 on I82801
  Kernel Version: 2.6.20
  Status: NEW
Severity: normal
   Owner: [EMAIL PROTECTED]
   Submitter: [EMAIL PROTECTED]
  
  
  Most recent kernel where this bug did *NOT* occur:
  Any 2.6.20-pre prior to commit 368c73d4f689dae0807d0a2aa74c61fd2b9b075f
  
  Distribution:  Slackware 11.0
  Hardware Environment:  HP/Compaq dc5000S (P4, 82801, 82865)
  Software Environment:  Xorg 6.9.0
  Problem Description:
  
  Alan Cox introduced a PCI: Quirks patch (git commit
  368c73d4f689dae0807d0a2aa74c61fd2b9b075f) in 2.6.20 that breaks X11 on this
  I82801 platform.  Specifically, it causes the PCI initialisation to become
  buggered; Xorg 6.9.0 dumps the following to the console:
  (EE) end of block range 0x177  begin 0x3f0
  (EE) end of block range 0x177  begin 0x3f0
  (WW) INVALID IO ALLOCATION b: 0x14d0 e: 0x14d7 correcting
  [...]
  Backtrace:
  0: X(xf86SigHandler+0x8a) [0x8088b2a]
  1: [0xb7f2b420]
  2: /usr/X11R6/lib/modules/drivers/i810_drv.so [0xb797f592]
  3: X(InitOutput+0xb83) [0x8072713]
  4: X(main+0x226) [0x80d4496]
  5: /lib/tls/libc.so.6(__libc_start_main+0xd4) [0xb7da7e14]
  6: X [0x806ff61]
  
  Fatal server error:
  Caught signal 11.  Server aborting
  
  Steps to reproduce:
  
  Reverting the git commit mentioned above fixes the issue.  Apparently, this 
  may
  be limited to certain combinations of on-motherboard chipsets, as I haven't 
  seen
  many bug reports.  Googling shows some people having X11 segfault issues 
  with
  2.6.20 (e.g. freedesktop.org bug #9956) but in most of those cases it's due 
  to
  the evdev driver and not PCI initialisation.
  
  I wrote to Alan (cc'ed Greg as he signed off on the patch) nearly two weeks 
  ago
  but have heard nothing, so I'm leaving a bug here instead.
  
 
 argh.
 
 Would we break more machines than we fix if we just revert that?

this should be fixed in 2.6.21-rc3,
commit ed8ccee0918ad063a4741c0656fda783e02df627

Bart
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Nick Piggin

Andrea Arcangeli wrote:

On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote:


Well it wasn't iowait time. From Anton's analysis, I would probably
say it was time waiting for either the glibc malloc mutex or MySQL
heap mutex.



So it again makes little sense to me that this is idle time, unless
some userland mutex has a usleep in the slow path which would be very
wrong, in the worst case they should yield() (yield can still waste
lots of cpu if two tasks in the slow paths calls it while the holder
is not scheduled, but at least it wouldn't be idle time).


They'll be sleeping in futex_wait in the kernel, I think. One thread
will hold the critical mutex, some will be off doing their own thing,
but importantly there will be many sleeping for the mutex to become
available.


Idle time is suspicious for a kernel issue in the scheduler or some
userland inefficiency (the latter sounds more likely).


That is what I first suspected, because the dropoff appeared to happen
exactly after we saturated the CPU count: it seems like a scheduler
artifact.

However, I tested with a bigger system and actually the idle time
comes before we saturate all CPUs. Also, increasing the aggressiveness
of the load balancer did not drop idle time at all, so it is not a case
of some runqueues idle while others have many threads on them.


I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
glibc allocator. But I wonder if there are other improvements that glibc
can do here?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Andrew Morton
 On Tue, 13 Mar 2007 22:06:46 +1100 Nick Piggin [EMAIL PROTECTED] wrote:
 Andrew Morton wrote:
 On Tue, 13 Mar 2007 19:03:38 +1100 Nick Piggin [EMAIL PROTECTED] wrote:
 
 ...

 Page allocator still requires interrupts to be disabled, which this doesn't.

 it is worthwhile.
  
  
  If you want a zeroed page for pagecache and someone has just stuffed a
  known-zero, cache-hot page into the pagetable quicklists, you have good
  reason to be upset.
 
 The thing is, pagetable pages are the one really good exception to the
 rule that we should keep cache hot and initialise-on-demand. They
 typically are fairly sparsely populated and sparsely accessed. Even
 for last level page tables, I think it is reasonable to assume they will
 usually be pretty cold.

eh?  I'd have thought that a pte page which has just gone through
zap_pte_range() will very often have a _lot_ of hot cachelines, and
that's a common case.

Still.   It's pretty easy to test.

  
  Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a
  fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the
  page.  But it needed too much support in core VM to bother.  Since then
  we've grown per-cpu page magazines and __GFP_ZERO.  Plus I'm not aware of
  anyone having tried doing it on x86 with non-temporal stores.
 
 You can win on specifically constructed benchmarks, easily.
 
 But considering all the other problems you're going to introduce, we'd need
 a significant win on a significant something, IMO.
 
 You waste memory bandwidth. You also use more CPU and memory cycles
 speculatively, ergo you waste more power.

Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
anyone having tried it properly...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Christoph Lameter
On Tue, 13 Mar 2007, Andrew Morton wrote:

  On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter [EMAIL 
  PROTECTED] wrote:
  Page table pages have the characteristics that they are typically zero
  or in a known state when they are freed.
 
 Well if they're zero then perhaps they should be released to the page 
 allocator to satisfy the next __GFP_ZERO request.  If that request is 
 for a pagetable page, we break even (except we get to remove 
 special-case code).  If that __GFP_ZERO allocation was or some 
 application other than for a pagetable, we win.

Nope that wont work.

1. We need to support other states of pages other than zeroed.

2. Prezeroing does not make much sense if a large portion of the
   page is being used. Performance is better if the whole page 
   is zeroed directly before use.Prezeroing only makes sense for sparse
   allocations like the page table pages.

 (Will require some work in the page allocator)
 (That work will open the path to using the idle thread to prezero pages)

I already tried that 3 years ago and there was *no* benefit for usual
users of the a page allocator. The advantage exists only if a small
portion of the page is used. F.e. For one cacheline there was a 4x 
improvement. See lkml archives for prezeroing.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Christoph Lameter
On Tue, 13 Mar 2007, Andrew Morton wrote:

 Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
 anyone having tried it properly...

Ok, then what did I do wrong 3 years ago with the prezeroing patchsets?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Mike Galbraith
On Tue, 2007-03-13 at 21:06 +1100, Con Kolivas wrote:
 On Tuesday 13 March 2007 20:39, Ingo Molnar wrote:
  * Mike Galbraith [EMAIL PROTECTED] wrote:
   I just retested with the encoders at nice 0, and the x/gforce combo is
   terrible. [...]
 
  ok. So nice levels had nothing to do with it - it's some other
  regression somewhere. How does the vanilla scheduler cope with the
  exactly same workload? I.e. could you describe the 'delta' difference in
  behavior - because the delta is what we are interested in mostly, the
  'absolute' behavior alone is not sufficient. Something like:
 
   - on scheduler foo, under this workload, the CPU hogs steal 70% CPU
 time and the resulting desktop experience is 'choppy': mouse pointer
 is laggy and audio skips.
 
   - on scheduler bar, under this workload, the CPU hogs are at 40%
 CPU time and the desktop experience is smooth.
 
  things like that - we really need to be able to see the delta.
 
 I only find a slowdown, no choppiness, no audio stutter (it would be 
 extremely 
 hard to make audio stutter in this design without i/o starvation or something 
 along those lines). The number difference in cpu percentage I've already 
 given on the previous email. The graphics driver does feature in this test 
 case as well so others' mileage may vary. Mike said it was terrible.

My first test run with lame at nice 0 was truly horrid, but has _not_
repeated, so disregard that as an anomaly.  For the most part, it is as
you say, things just get slower with load, any load.  I definitely am
seeing lurchiness which is not present in mainline.  No audio problems
with either kernel.

   [...]  Funny thing though, x/gforce isn't as badly affected with a
   kernel build.  Any build is quite noticable, but even at -j8, the
   effect doen't seem to be (very brief test warning applies) as bad as
   with only the two encoders running.  That seems quite odd.
 
  likewise, how does the RSDL kernel build behavior compare to the vanilla
  scheduler's behavior? (what happens in one that doesnt happen in the
  other, etc.)
 
 Kernel compiles seem similar till the jobs get above about 3 where rsdl gets 
 slower but still smooth. Audio is basically unaffected either way.

It seems to be a plain linear slowdown.  The lurchiness I'm experiencing
varies in intensity, and is impossible to quantify.  I see neither
lurchiness nor slowdown in mainline through -j8.

 Don't forget all the rest of the cases people have posted.

Absolutely, all test results count.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Need help on mach-ep93xx

2007-03-13 Thread Ben Dooks
On Tue, Mar 13, 2007 at 10:54:08AM +0530, Maxin John wrote:
 Hi,
 
 I have one question mach-ep93xx.
 
 In  EP93xx  IRQ handling part in core.c,  the 2.6.19.2 kernel and
 newer  kernels are configuring the 16 interrupts of the ports A  B
 together. The code is not using the  interrupt capability  of the port
 F which can provide 3 interrupts.
 
 Why the port F is not configured for interrupts ?
 
 Thanks in advance,

subscribe to the linux-arm-kernel list and ask
the question there, you'll find more ARM people there.

-- 
Ben ([EMAIL PROTECTED], http://www.fluff.org/)

  'a smiley only costs 4 bytes'
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Andrew Morton
 On Tue, 13 Mar 2007 04:17:26 -0700 (PDT) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 On Tue, 13 Mar 2007, Andrew Morton wrote:
 
   On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter [EMAIL 
   PROTECTED] wrote:
   Page table pages have the characteristics that they are typically zero
   or in a known state when they are freed.
  
  Well if they're zero then perhaps they should be released to the page 
  allocator to satisfy the next __GFP_ZERO request.  If that request is 
  for a pagetable page, we break even (except we get to remove 
  special-case code).  If that __GFP_ZERO allocation was or some 
  application other than for a pagetable, we win.
 
 Nope that wont work.
 
 1. We need to support other states of pages other than zeroed.

What does this mean?

 2. Prezeroing does not make much sense if a large portion of the
page is being used. Performance is better if the whole page 
is zeroed directly before use.Prezeroing only makes sense for sparse
allocations like the page table pages.

This is not related to the above discussion.

  (Will require some work in the page allocator)
  (That work will open the path to using the idle thread to prezero pages)
 
 I already tried that 3 years ago and there was *no* benefit for usual
 users of the a page allocator. The advantage exists only if a small
 portion of the page is used. F.e. For one cacheline there was a 4x 
 improvement. See lkml archives for prezeroing.

Unsurprised.  Were non-temporal stores tried?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Nick Piggin

Andrew Morton wrote:

On Tue, 13 Mar 2007 22:06:46 +1100 Nick Piggin [EMAIL PROTECTED] wrote:
Andrew Morton wrote:


On Tue, 13 Mar 2007 19:03:38 +1100 Nick Piggin [EMAIL PROTECTED] wrote:


...



Page allocator still requires interrupts to be disabled, which this doesn't.




it is worthwhile.



If you want a zeroed page for pagecache and someone has just stuffed a
known-zero, cache-hot page into the pagetable quicklists, you have good
reason to be upset.


The thing is, pagetable pages are the one really good exception to the
rule that we should keep cache hot and initialise-on-demand. They
typically are fairly sparsely populated and sparsely accessed. Even
for last level page tables, I think it is reasonable to assume they will
usually be pretty cold.



eh?  I'd have thought that a pte page which has just gone through
zap_pte_range() will very often have a _lot_ of hot cachelines, and
that's a common case.

Still.   It's pretty easy to test.


Well I guess that would be the case if you had just unmapped a 4MB
chunk that was pretty dense with pages.

My malloc seems to allocate and free in blocks of 128K, so that's
only going to give us 3% of the last level pte being cache hot when
it gets freed. Not sure what common mmap(file) access patterns
look like.

The majority of programs I run have a smattering of llpt pages
pretty sparsely populated, covering text, libraries, heap, stack,
vdso.

We don't actually have to zap_pte_range the entire page table in
order to free it (IIRC we used to have to, before the 4lpt patches).

But yeah let's see some tests. I would definitely want to avoid this
extra layer of complexity if it is just as good to return the pages
to the pcp lists.


Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a
fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the
page.  But it needed too much support in core VM to bother.  Since then
we've grown per-cpu page magazines and __GFP_ZERO.  Plus I'm not aware of
anyone having tried doing it on x86 with non-temporal stores.


You can win on specifically constructed benchmarks, easily.

But considering all the other problems you're going to introduce, we'd need
a significant win on a significant something, IMO.

You waste memory bandwidth. You also use more CPU and memory cycles
speculatively, ergo you waste more power.



Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
anyone having tried it properly...


--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Andrew Morton
 On Tue, 13 Mar 2007 04:20:48 -0700 (PDT) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 On Tue, 13 Mar 2007, Andrew Morton wrote:
 
  Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
  anyone having tried it properly...
 
 Ok, then what did I do wrong 3 years ago with the prezeroing patchsets?

Failed to provide us a link to it?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Eric Dumazet
On Tuesday 13 March 2007 12:12, Nick Piggin wrote:

 I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
 glibc allocator. But I wonder if there are other improvements that glibc
 can do here?

I cooked a patch some time ago to speedup threaded apps and got no feedback.

http://lkml.org/lkml/2006/8/9/26

Maybe we have to wait for 32 core cpu before thinking of cache line 
bouncings...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Serge Belyshev
Mike Galbraith [EMAIL PROTECTED] writes:

[snip]
 It seems to be a plain linear slowdown.  The lurchiness I'm experiencing
 varies in intensity, and is impossible to quantify.  I see neither
 lurchiness nor slowdown in mainline through -j8.

Whaa? make -j8 on mainline makes my desktop box completely useless.

Please reconsider your statement.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Andrea Arcangeli
On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote:
 They'll be sleeping in futex_wait in the kernel, I think. One thread
 will hold the critical mutex, some will be off doing their own thing,
 but importantly there will be many sleeping for the mutex to become
 available.

The initial assumption was that there was zero idle time with threads
= cpus and the idle time showed up only when the number of threads
increased to the double the number of cpus. If the idle time wouldn't
increase with the number of threads, nothing would be suspect.

 However, I tested with a bigger system and actually the idle time
 comes before we saturate all CPUs. Also, increasing the aggressiveness
 of the load balancer did not drop idle time at all, so it is not a case
 of some runqueues idle while others have many threads on them.

It'd be interesting to see the sysrq+t after the idle time
increased.

 I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
 glibc allocator. But I wonder if there are other improvements that glibc
 can do here?

My wild guess is that they're allocating memory after taking
futexes. If they do, something like this will happen:

 taskA  taskB   taskC
 user lock
mmap_sem lock
 mmap sem - schedule
user lock - schedule

If taskB wouldn't be there triggering more random trashing over the
mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.

I suspect the real fix is not to allocate memory or to run other
expensive syscalls that can block inside the futex critical sections...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread Mike Galbraith
On Tue, 2007-03-13 at 14:41 +0300, Serge Belyshev wrote:
 Mike Galbraith [EMAIL PROTECTED] writes:
 
 [snip]
  It seems to be a plain linear slowdown.  The lurchiness I'm experiencing
  varies in intensity, and is impossible to quantify.  I see neither
  lurchiness nor slowdown in mainline through -j8.
 
 Whaa? make -j8 on mainline makes my desktop box completely useless.
 
 Please reconsider your statement.

I'll do no such thing, and don't appreciate the insinuation.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Andrew Morton
 On Tue, 13 Mar 2007 22:30:19 +1100 Nick Piggin [EMAIL PROTECTED] wrote:
 We don't actually have to zap_pte_range the entire page table in
 order to free it (IIRC we used to have to, before the 4lpt patches).

I'm trying to remember why we ever would have needed to zero out the pagetable
pages if we're taking down the whole mm?  Maybe it's because oh, the
arch wants to put this page into a quicklist to recycle it, which is
all rather circular.

It would be interesting to look at a) leave the page full of random garbage
if we're releasing the whole mm and b) return it straight to the page allocator.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Nick Piggin

Eric Dumazet wrote:

On Tuesday 13 March 2007 12:12, Nick Piggin wrote:


I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
glibc allocator. But I wonder if there are other improvements that glibc
can do here?



I cooked a patch some time ago to speedup threaded apps and got no feedback.


Well that doesn't help in this case. I tested and the mmap_sem contention
is not an issue.


http://lkml.org/lkml/2006/8/9/26

Maybe we have to wait for 32 core cpu before thinking of cache line 
bouncings...


The idea is a good one, and I was half way through implementing similar
myself at one point (some java apps hit this badly).

It is just horribly sad that futexes are supposed to implement a
_scalable_ thread synchronisation mechanism, whilst fundamentally
relying on an mm-wide lock to operate.

I don't like your interface, but then again, the futex interface isn't
exactly pretty anyway.

You should resubmit the patch, and get the glibc guys to use it.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Nick Piggin

Andrew Morton wrote:

On Tue, 13 Mar 2007 22:30:19 +1100 Nick Piggin [EMAIL PROTECTED] wrote:
We don't actually have to zap_pte_range the entire page table in
order to free it (IIRC we used to have to, before the 4lpt patches).



I'm trying to remember why we ever would have needed to zero out the pagetable
pages if we're taking down the whole mm?  Maybe it's because oh, the
arch wants to put this page into a quicklist to recycle it, which is
all rather circular.

It would be interesting to look at a) leave the page full of random garbage
if we're releasing the whole mm and b) return it straight to the page allocator.


Well we have the 'fullmm' case, which avoids all the locked pte operations
(for those architectures where hardware pt walking requires atomicity).

However we still have to visit those to-be-unmapped parts of the page table,
to find the pages and free them. So we still at least need to bring it into
cache for the read... at which point, the store probably isn't a big burden.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Eric Dumazet
On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote:

 My wild guess is that they're allocating memory after taking
 futexes. If they do, something like this will happen:

  taskAtaskB   taskC
  user lock
   mmap_sem lock
  mmap sem - schedule
   user lock - schedule

 If taskB wouldn't be there triggering more random trashing over the
 mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.

 I suspect the real fix is not to allocate memory or to run other
 expensive syscalls that can block inside the futex critical sections...

glibc malloc uses arenas, and trylock() only. It should not block because if 
an arena is already locked, thread automatically chose another arena, and 
might create a new one if necessary.

But yes, mmap_sem contention is a big problem, because it's also taken by 
futex code (unfortunately)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-13 Thread Alan Cox
On Tue, 13 Mar 2007 01:00:50 -0700 (PDT)
Roland McGrath [EMAIL PROTECTED] wrote:

  Well, I can add in the test for 0, but finding the set of always-on bits
  in DR6 will have to be done separately.  Isn't it possible that different
  CPUs could have different bits?
 
 I don't know, but it seems unlikely.  AFAIK all CPUs are presumed to have
 the same CPUID results, for example.

No. We merge the CPUID information to get a shared set of capability bits.

Generic PC systems with a mix of PII and PIII are possible. The voyager
architecture can have even more peculiar combinations of processor
modules installed.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Nick Piggin

Andrea Arcangeli wrote:

On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote:


They'll be sleeping in futex_wait in the kernel, I think. One thread
will hold the critical mutex, some will be off doing their own thing,
but importantly there will be many sleeping for the mutex to become
available.



The initial assumption was that there was zero idle time with threads
= cpus and the idle time showed up only when the number of threads
increased to the double the number of cpus. If the idle time wouldn't
increase with the number of threads, nothing would be suspect.


Well I think more threads ~= more probability that this guy is going to
be preempted while holding the mutex?

This might be why FreeBSD works much better, because it looks like MySQL
actually will set RT scheduling for those processes that take critical
resources.


However, I tested with a bigger system and actually the idle time
comes before we saturate all CPUs. Also, increasing the aggressiveness
of the load balancer did not drop idle time at all, so it is not a case
of some runqueues idle while others have many threads on them.



It'd be interesting to see the sysrq+t after the idle time
increased.



I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
glibc allocator. But I wonder if there are other improvements that glibc
can do here?



My wild guess is that they're allocating memory after taking
futexes. If they do, something like this will happen:

 taskA  taskB   taskC
 user lock
mmap_sem lock
 mmap sem - schedule
user lock - schedule

If taskB wouldn't be there triggering more random trashing over the
mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.

I suspect the real fix is not to allocate memory or to run other
expensive syscalls that can block inside the futex critical sections...



I would agree that it points to MySQL scalability issues, however the
fact that such large gains come from tcmalloc is still interesting.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Andrew Morton
 On Tue, 13 Mar 2007 23:01:11 +1100 Nick Piggin [EMAIL PROTECTED] wrote:
 Andrew Morton wrote:
 On Tue, 13 Mar 2007 22:30:19 +1100 Nick Piggin [EMAIL PROTECTED] wrote:
 We don't actually have to zap_pte_range the entire page table in
 order to free it (IIRC we used to have to, before the 4lpt patches).
  
  
  I'm trying to remember why we ever would have needed to zero out the 
  pagetable
  pages if we're taking down the whole mm?  Maybe it's because oh, the
  arch wants to put this page into a quicklist to recycle it, which is
  all rather circular.
  
  It would be interesting to look at a) leave the page full of random garbage
  if we're releasing the whole mm and b) return it straight to the page 
  allocator.
 
 Well we have the 'fullmm' case, which avoids all the locked pte operations
 (for those architectures where hardware pt walking requires atomicity).

I suspect there are some tlb operations which could be skipped in that case
too.

 However we still have to visit those to-be-unmapped parts of the page table
 to find the pages and free them. So we still at least need to bring it into
 cache for the read... at which point, the store probably isn't a big burden.

It means all that data has to be written back.  Yes, I expect it'll prove
to be less costly than the initial load.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-13 Thread Alan Cox
 But on that note -- do you have any idea how one might get ltrace to
 work on a multi-threaded program, or how one might enhance it to
 instrument function calls from one shared library to another?  Or

I don't know a vast amount about ARM ELF user space so no.

 better yet, can you advise me on how to induce gdbserver to stream
 traces of library/syscall entry/exits for all the threads in a
 process?  And then how to cram it down into the kernel so I don't take

One way to do this is to use kprobes which will do exactly what you want
as it builds a kernel module. Doesn't currently run on ARM afaik.

 the hit for an MMU context switch every time I hit a syscall or

Not easily with gdbstubs as you've got to talk to something to decide how
to log the data and proceed. If you stick it kernel side its a lot of
ugly new code and easier to port kprobes over, if you do it remotely as
gdbstubs intends it adds latencies and screws all your timings.

gdbstubs is also not terribly SMP aware and for low level work its
sometimes easier to have on gdb per processor if you can get your brain
around it.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


IDE disk runs just in DMA/33 with 2.6.20.2 on nVidia CK804 controller

2007-03-13 Thread l . genoni

Hi,
I reported this also for 2.6.20 kernel.
new libata with controller nVidia CK804 initializes the disk in DMA/33,
with with 2.6.19.5 and previous the disk is correctly inizialized in 
DMA/100.

Tha cable is OK, and with older kernels the disks runs without troubles.

The sistem has two sata disks on nvidia CK804 controllers, and then a disk 
as primary master, and a dvd writer (DMA/33) as secondary master)


here is lspci -vxxx
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) (prog-if 8a 
[Master SecP PriP])

Subsystem: Unknown device f043:815a
Flags: bus master, 66MHz, fast devsel, latency 0
I/O ports at f000 [size=16]
Capabilities: [44] Power Management version 2
00: de 10 53 00 05 00 b0 00 f2 8a 01 01 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 01 f0 00 00 00 00 00 00 00 00 00 00 43 f0 5a 81
30: 00 00 00 00 44 00 00 00 00 00 00 00 00 00 03 01
40: 43 f0 5a 81 01 00 02 00 00 00 00 00 00 00 00 00
50: 03 f0 01 00 00 00 00 00 a8 20 a8 20 22 00 20 20
60: 00 c0 00 c6 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 50 96 29 00 00 04 20 00 9e 4f 00
90: 00 00 02 30 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 10 ff ff ff 0a 11 30 07

00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev 
f3) (prog-if 85 [Master SecO PriO])

Subsystem: ASUSTeK Computer Inc. Unknown device 815a
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 23
I/O ports at 09f0 [size=8]
I/O ports at 0bf0 [size=4]
I/O ports at 0970 [size=8]
I/O ports at 0b70 [size=4]
I/O ports at d800 [size=16]
Memory at d5002000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
00: de 10 54 00 07 00 b0 00 f3 85 01 01 00 00 00 00
10: f1 09 00 00 f1 0b 00 00 71 09 00 00 71 0b 00 00
20: 01 d8 00 00 00 20 00 d5 00 00 00 00 43 10 5a 81
30: 00 00 00 00 44 00 00 00 00 00 00 00 0b 01 03 01
40: 43 10 5a 81 01 00 02 00 00 00 00 00 00 00 00 00
50: 17 00 00 15 00 00 00 00 a8 20 a8 20 66 00 20 20
60: 00 c0 00 c6 11 0c 00 00 08 0f 06 42 00 00 00 00
70: 2c 78 c4 40 01 10 00 00 01 10 00 00 20 00 20 00
80: 00 00 00 40 00 50 4a 7f 00 00 02 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 06 00 06 10 00 00 01 01
a0: 50 01 00 7c 00 00 00 00 00 00 00 00 33 bb aa 02
b0: 05 cc 84 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 0a 00 0a 00 08 00 02 a8
d0: 01 00 02 0d 42 00 00 00 00 00 00 00 0f 00 d0 87
e0: 01 00 02 0d 42 00 00 00 00 00 00 00 f7 e0 e2 01
f0: 00 00 00 00 00 00 00 00 00 ff ff ff 0f 36 32 07

00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev 
f3) (prog-if 85 [Master SecO PriO])

Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 22
I/O ports at 09e0 [size=8]
I/O ports at 0be0 [size=4]
I/O ports at 0960 [size=8]
I/O ports at 0b60 [size=4]
I/O ports at c400 [size=16]
Memory at d5001000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
00: de 10 55 00 07 00 b0 00 f3 85 01 01 00 00 00 00
10: e1 09 00 00 e1 0b 00 00 61 09 00 00 61 0b 00 00
20: 01 c4 00 00 00 10 00 d5 00 00 00 00 43 10 5a 81
30: 00 00 00 00 44 00 00 00 00 00 00 00 0a 01 03 01
40: 43 10 5a 81 01 00 02 00 00 00 00 00 00 00 00 00
50: 17 00 00 15 00 00 00 00 a8 20 a8 20 66 00 20 20
60: 00 c0 00 c6 11 0c 00 00 08 0f 06 42 00 00 00 00
70: 2c 78 c4 40 01 10 00 00 01 10 00 00 20 00 20 00
80: 00 00 00 40 00 a0 4a 7f 00 00 02 2c 00 00 00 00
90: 00 00 00 00 00 00 00 00 06 00 06 10 00 00 01 01
a0: 50 01 00 7c 00 00 00 00 00 00 00 00 33 bb aa 02
b0: 05 cc 84 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 0a 00 0a 00 08 00 02 a8
d0: 01 00 02 0d 42 00 00 00 00 00 00 00 ea 9f f6 80
e0: 01 00 02 0d 42 00 00 00 00 00 00 00 50 80 00 00
f0: 00 00 00 00 00 00 00 00 00 ff ff ff 11 3f 32 07

00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2) (prog-if 
01 [Subtractive decode])

Flags: bus master, 66MHz, fast devsel, latency 0
Bus: primary=00, secondary=05, subordinate=05, sec-latency=128
I/O behind bridge: a000-afff
Memory behind bridge: d300-d4ff
Prefetchable memory behind bridge: 8800-880f
00: de 10 5c 00 07 01 a0 00 a2 01 04 06 00 00 01 00
10: 00 00 00 00 00 00 00 00 00 05 05 80 a0 a0 80 a2
20: 00 d3 f0 d4 00 88 00 88 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 06
40: 00 00 07 00 01 00 02 00 07 00 00 00 00 00 44 01
50: 00 00 fe 7f 00 00 00 00 ff 1f ff 1f 00 00 00 00
60: 00 00 00 00 00 

  1   2   3   4   5   >