Re: [PATCH] rfc: threaded epoll_wait thundering herd

2007-05-04 Thread Eric Dumazet

Linus Torvalds a écrit :


On Sat, 5 May 2007, Eric Dumazet wrote:

But... what happens if the thread that was chosen exits from the loop in
ep_poll() with res = -EINTR (because of signal_pending(current))


Not a problem.

What happens is that an exclusive wake-up stops on the first entry in the 
wait-queue that it actually *wakes*up*, but if some task has just marked 
itself as being TASK_UNINTERRUPTIBLE, but is still on the run-queue, it 
will just be marked TASK_RUNNING and that in itself isn't enough to cause 
the "exclusive" test to trigger.


The code in sched.c is subtle, but worth understanding if you care about 
these things. You should look at:


 - try_to_wake_up() - this is the default wakeup function (and the one 
   that should work correctly - I'm not going to guarantee that any of the 
   other specialty-wakeup-functions do so)


   The return value is the important thing. Returning non-zero is 
   "success", and implies that we actually activated it.


   See the "goto out_running" case for the case where the process was 
   still actually on the run-queues, and we just ended up setting 
   "p->state = TASK_RUNNING" - we still return 0, and the "exclusive" 
   logic will not trigger.


 - __wake_up_common: this is the thing that _calls_ the above, and which 
   cares about the return value above. It does


if (curr->func(curr, mode, sync, key) &&
(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)


   ie it only decrements (and triggers) the nr_exclusive thing when the 
   wakeup-function returned non-zero (and when the waitqueue entry was 
   marked exclusive, of course).


So what does all this subtlety *mean*?

Walk through it. It means that it is safe to do the

if (signal_pending())
return -EINTR;

kind of thing, because *when* you do this, you obviously are always on the 
run-queue (otherwise the process wouldn't be running, and couldn't be 
doing the test). So if there is somebody else waking you up right then and 
there, they'll never count your wakeup as an exclusive one, and they will 
wake up at least one other real exclusive waiter.


(IOW, you get a very very small probability of a very very small 
"thundering herd" - obviously it won't be "thundering" any more, it will 
be more of a "whispering herdlet").


The Linux kernel sleep/wakeup thing is really quite nifty and smart. And 
very few people realize just *how* nifty and elegant (and efficient) it 
is. Hopefully a few more people appreciate its beauty and subtlety now ;)




Thank you Linus for these detailed explanations.

I think I was frightened not by the wakeup logic, but by the possibility in 
SMP that a signal could be delivered to the thread just after it has been 
selected.


Looking again at ep_poll(), I see  :

set_current_state(TASK_INTERRUPTIBLE);
[*] if (!list_empty(>rdllist) || !jtimeout)
break;
if (signal_pending(current)) {
res = -EINTR;
break;
}

So the test against signal_pending() is not done if an event is present in 
ready list : It should be delivered even if a signal is pending. I missed this 
bit ealier...




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes

2007-05-04 Thread Eric Dumazet

Christoph Lameter a écrit :

On Sat, 5 May 2007, Eric Dumazet wrote:


C. Introduces a slab_ops structure that allows a slab user to provide
   operations on slabs.

Could you please make it const ?


Sure. Done.


thanks :)




All of this is really not necessary since the compiler knows how to align
structures and we should use this information instead of having the user
specify an alignment. I would like to get rid of SLAB_HWCACHE_ALIGN
and kmem_cache_create. Instead one would use the following macros (that
then result in a call to __kmem_cache_create).

Hum, the problem is the compiler sometimes doesnt know the target processor
alignment.

Adding cacheline_aligned to 'struct ...' definitions might be overkill if
you compile a generic kernel and happens to boot a Pentium III with it.


Then add ___cacheline_aligned_in_smp or specify the alignment in the 
various other ways that exist. Practice is that most slabs specify 
SLAB_HWCACHE_ALIGN. So most slabs are cache aligned today.


Yes but this alignement is dynamic, not at compile time.

include/asm-i386/processor.h:739:#define cache_line_size() 
(boot_cpu_data.x86_cache_alignment)


So adding cacheline_aligned  to 'struct file' for example would be a 
regression for people with PII or PIII





G. Being able to track the number of pages in a kmem_cache


If you look at fs/buffer.c, you'll notice the bh_accounting, recalc_bh_state()
that might be overkill for large SMP configurations, when the real concern is
to be able to limit the bh's not to exceed 10% of LOWMEM.

Adding a callback in slab_ops to track total number of pages in use by a given
kmem_cache would be good.


Such functionality exists internal to SLUB and in the reporting tool. 
I can export that function if you need it.



Same thing for fs/file_table.c : nr_file logic
(percpu_counter_dec()/percpu_counter_inc() for each file open/close) could be
simplified if we could just count the pages in use by filp_cachep kmem_cache.
The get_nr_files() thing is not worth the pain.


Sure. What exactly do you want? The absolute number of pages of memory 
that the slab is using?


kmem_cache_pages_in_use(struct kmem_cache *) ?

The call will not be too lightweight since we will have to loop over all 
nodes and add the counters in each per node struct for allocates slabs.





On a typical system, number of pages for 'filp' kmem_cache tends to be stable

-bash-2.05b# grep filp /proc/slabinfo
filp  234727 374100256   151 : tunables  120   608 : 
slabdata  24940  24940135

-bash-2.05b# grep filp /proc/slabinfo
filp  234776 374100256   151 : tunables  120   608 : 
slabdata  24940  24940168

-bash-2.05b# grep filp /proc/slabinfo
filp  234728 374100256   151 : tunables  120   608 : 
slabdata  24940  24940180

-bash-2.05b# grep filp /proc/slabinfo
filp  234724 374100256   151 : tunables  120   608 : 
slabdata  24940  24940174


So revert nr_files logic to a single integer would be enough, even for NUMA

int nr_pages_used_by_filp;
int nr_pages_filp_limit;
int filp_in_danger __read_mostly;

static void callback_pages_in_use_by_filp(int inc)
{
int in_danger;

nr_pages_used_by_filp += inc;

in_danger = nr_pages_used_by_filp >= nr_pages_filp_limit;
if (in_danger != filp_in_danger)
filp_in_danger = in_danger;
}

struct file *get_empty_filp(void)
{
...
if (filp_in_danger && !capable(CAP_SYS_ADMIN))
goto over;

...
}


void __init files_init(unsigned long mempages)
{
...
nr_pages_filp_limit = (mempages * 10) / 100; /* 10% for filp use */
...
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cpufreq longhaul locks up

2007-05-04 Thread Rafał Bilski
Jan,

Can You send output of x86info program and output of 
lspci command? Longhaul wasn't working for You since 2.6.18 right?
I'm going to work now, but I will be available after 14:00 UTC.

If You have problem with longhaul+powersave there may be one thing 
related. When I started to change Longhaul it was causing lockups 
on Epia 800. I added transition protection. Helped, but not for 
long. After one or two hours machine locked up anyway. I found 
datasheet in Google and changed "disable BMDMA bit on PCI device" to 
northbridge support. Problem fixed. Somehow CLE133 chipset didn't 
like touching "BMDMA master" bits.
Second: I didn't get answer from VIA why they are blocking ACPI C3 on CPU's 
faster then 1GHz. I don't know if it is standard practice and if
Intel and AMD are doing it too.

Things worth checking: disable PREEMPT, change it to "Voluntary preemption". 
Check if using conservative governor makes any difference. I know that 
this may sound strange, but transition latency is directly proportional to 
difference between current and destination frequency. Maybe for faster 
processors it isn't allowed to change frequency directly from min to max?

Rafał



--
NIE KUPUJ!!!
...zanim nie porownasz cen >> http://link.interia.pl/f1a5e



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation

2007-05-04 Thread William Lee Irwin III
On Fri, May 04, 2007 at 03:15:57PM -0700, [EMAIL PROTECTED] wrote:
> 2. kick_object(void *)
> After SLUB has established references to the remaining objects in a slab it
> will drop all locks and then use kick_object on each of the objects for which
> we obtained a reference. The existence of the objects is guaranteed by
> virtue of the earlier obtained reference. The callback may perform any
> slab operation since no locks are held at the time of call.
> The callback should remove the object from the slab in some way. This may
> be accomplished by reclaiming the object and then running kmem_cache_free()
> or reallocating it and then running kmem_cache_free(). Reallocation
> is advantageous at this point because it will then allocate from the partial
> slabs with the most objects because we have just finished slab shrinking.
> NOTE: This patch is for conceptual review. I'd appreciate any feedback
> especially on the locking approach taken here. It will be critical to
> resolve the locking issue for this approach to become feasable.
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

kick_object() doesn't return an indicator of success, which might be
helpful for determining whether an object was successfully removed. The
later-added kick_dentry_object(), for instance, can't remove dentries
where reference counts are still held.

I suppose one could check to see if the ->inuse counter decreased, too.

In either event, it would probably be helpful to abort the operation if
there was a reclamation failure for an object within the slab.

This is a relatively minor optimization concern. I think this patch
series is great and a significant foray into the problem of slab
reclaim vs. fragmentation.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 02/29] xen: Allocate and free vmalloc areas

2007-05-04 Thread Jeremy Fitzhardinge
Allocate/destroy a 'vmalloc' VM area: alloc_vm_area and free_vm_area
The alloc function ensures that page tables are constructed for the
region of kernel virtual address space and mapped into init_mm.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Ian Pratt <[EMAIL PROTECTED]>
Signed-off-by: Christian Limpach <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Cc: "Jan Beulich" <[EMAIL PROTECTED]>
Cc: "Andi Kleen" <[EMAIL PROTECTED]>

---
 include/linux/vmalloc.h |4 +++
 mm/vmalloc.c|   51 +++
 2 files changed, 55 insertions(+)

===
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -68,6 +68,10 @@ extern int map_vm_area(struct vm_struct 
struct page ***pages);
 extern void unmap_vm_area(struct vm_struct *area);
 
+/* Allocate/destroy a 'vmalloc' VM area. */
+extern struct vm_struct *alloc_vm_area(unsigned long size);
+extern void free_vm_area(struct vm_struct *area);
+
 /*
  * Internals.  Dont't use..
  */
===
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -757,3 +757,54 @@ out_einval_locked:
 }
 EXPORT_SYMBOL(remap_vmalloc_range);
 
+static int f(pte_t *pte, struct page *pmd_page, unsigned long addr, void *data)
+{
+   /* apply_to_page_range() does all the hard work. */
+   return 0;
+}
+
+/**
+ * alloc_vm_area - allocate a range of kernel address space
+ * @size:  size of the area
+ * @returns:   NULL on failure, vm_struct on success
+ *
+ * This function reserves a range of kernel address space, and
+ * allocates pagetables to map that range.  No actual mappings
+ * are created.  If the kernel address space is not shared
+ * between processes, it syncs the pagetable across all
+ * processes.
+ */
+struct vm_struct *alloc_vm_area(unsigned long size)
+{
+   struct vm_struct *area;
+
+   area = get_vm_area(size, VM_IOREMAP);
+   if (area == NULL)
+   return NULL;
+
+   /*
+* This ensures that page tables are constructed for this region
+* of kernel virtual address space and mapped into init_mm.
+*/
+   if (apply_to_page_range(_mm, (unsigned long)area->addr,
+   area->size, f, NULL)) {
+   free_vm_area(area);
+   return NULL;
+   }
+
+   /* Make sure the pagetables are constructed in process kernel
+  mappings */
+   vmalloc_sync_all();
+
+   return area;
+}
+EXPORT_SYMBOL_GPL(alloc_vm_area);
+
+void free_vm_area(struct vm_struct *area)
+{
+   struct vm_struct *ret;
+   ret = remove_vm_area(area->addr);
+   BUG_ON(ret != area);
+   kfree(area);
+}
+EXPORT_SYMBOL_GPL(free_vm_area);

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes

2007-05-04 Thread Christoph Lameter
On Sat, 5 May 2007, Eric Dumazet wrote:

> > C. Introduces a slab_ops structure that allows a slab user to provide
> >operations on slabs.
> 
> Could you please make it const ?

Sure. Done.

> > All of this is really not necessary since the compiler knows how to align
> > structures and we should use this information instead of having the user
> > specify an alignment. I would like to get rid of SLAB_HWCACHE_ALIGN
> > and kmem_cache_create. Instead one would use the following macros (that
> > then result in a call to __kmem_cache_create).
> 
> Hum, the problem is the compiler sometimes doesnt know the target processor
> alignment.
> 
> Adding cacheline_aligned to 'struct ...' definitions might be overkill if
> you compile a generic kernel and happens to boot a Pentium III with it.

Then add ___cacheline_aligned_in_smp or specify the alignment in the 
various other ways that exist. Practice is that most slabs specify 
SLAB_HWCACHE_ALIGN. So most slabs are cache aligned today.

> G. Being able to track the number of pages in a kmem_cache
> 
> 
> If you look at fs/buffer.c, you'll notice the bh_accounting, recalc_bh_state()
> that might be overkill for large SMP configurations, when the real concern is
> to be able to limit the bh's not to exceed 10% of LOWMEM.
> 
> Adding a callback in slab_ops to track total number of pages in use by a given
> kmem_cache would be good.

Such functionality exists internal to SLUB and in the reporting tool. 
I can export that function if you need it.

> Same thing for fs/file_table.c : nr_file logic
> (percpu_counter_dec()/percpu_counter_inc() for each file open/close) could be
> simplified if we could just count the pages in use by filp_cachep kmem_cache.
> The get_nr_files() thing is not worth the pain.

Sure. What exactly do you want? The absolute number of pages of memory 
that the slab is using?

kmem_cache_pages_in_use(struct kmem_cache *) ?

The call will not be too lightweight since we will have to loop over all 
nodes and add the counters in each per node struct for allocates slabs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] Slab Defrag / Slab Targeted Reclaim and general Slab API changes

2007-05-04 Thread Eric Dumazet

[EMAIL PROTECTED] a écrit :

I originally intended this for the 2.6.23 development cycle but since there
is an aggressive push for SLUB I thought that we may want to introduce this 
earlier.
Note that this covers new locking approaches that we may need to talk
over before going any further.

This is an RFC for patches that do major changes to the way that slab
allocations are handled in order to introduce some more advanced features
and in order to get rid of some things that are no longer used or awkward.

A. Add Slab fragmentation

On kmem_cache_shrink SLUB will not only sort the partial slabs by object
number but attempt to free objects out of partial slabs that have a low
number of objects. Doing so increases the object density in the remaining
partial slabs and frees up memory. Ideally kmem_cache_shrink would be
able to completely defrag the partial list so that only one partial
slab is left over. But it is advantageous to have slabs with a few free
objects since that speeds up kfree. Also going to the extreme on this one
would mean that the reclaimable slabs would have to be able to move objects
in a reliable way. So we just free objects in slabs with a low population ratio
and tolerate if a attempt to move an object fails.


nice idea



B. Targeted Reclaim

Mainly to support antifragmentation / defragmentation methods. The slab adds
a new function kmem_cache_vacate(struct page *) which can be used to request
that a page be cleared of all objects. This makes it possible to reduce the
size of the RECLAIMABLE fragmentation area and move slabs into the MOVABLE
area enhancing the capabilities of antifragmentation significantly.

C. Introduces a slab_ops structure that allows a slab user to provide
   operations on slabs.


Could you please make it const ?



This replaces the current constructor / destructor scheme. It is necessary
in order to support additional methods needed to support targeted reclaim
and slab defragmentation. A slab supporting targeted reclaim and
slab defragmentation must support the following additional methods:

1. get_reference(void *)
Get a reference on a particular slab object.

2. kick_object(void *)
Kick an object off a slab. The object is either reclaimed
(easiest) or a new object is alloced using kmem_cache_alloc()
and then the object is moved to the new location.

D. Slab creation is no longer done using kmem_cache_create

kmem_cache_create is not a clean API since it has only 2 call backs for
constructor and destructor, does not allow the specification of a slab ops
structure. Parameters are confusing.

F.e. It is possible to specify alignment information in the alignment
field and in addition in the flags field (SLAB_HWCACHE_ALIGN). The semantics
of SLAB_HWCACHE_ALIGN are fuzzy because it only aligns object if
larger than 1/2 cache line.

All of this is really not necessary since the compiler knows how to align
structures and we should use this information instead of having the user
specify an alignment. I would like to get rid of SLAB_HWCACHE_ALIGN
and kmem_cache_create. Instead one would use the following macros (that
then result in a call to __kmem_cache_create).


Hum, the problem is the compiler sometimes doesnt know the target processor 
alignment.


Adding cacheline_aligned to 'struct ...' definitions might be overkill if 
you compile a generic kernel and happens to boot a Pentium III with it.





KMEM_CACHE(, flags)

The macro will determine the slab name from the struct name and use that for
/sys/slab, will use the size of the struct for slab size and the alignment
of the structure for alignment. This means one will be able to set slab
object alignment by specifying the usual alignment options for static
allocations when defining the structure.

Since the name is derived from the struct name it will much easier to
find the source code for slabs listed in /sys/slab.

An additional macro is provided if the slab also supports slab operations.

KMEM_CACHE_OPS(, flags, slab_ops)

It is likely that this macro will be rarely used.

E. kmem_cache_create() SLAB_HWCACHE_ALIGN legacy interface

In order to avoid having to modify all slab creation calls throughout
the kernel we will provide a kmem_cache_create emulation. That function
is the only call that will still understand SLAB_HWCACHE_ALIGN. If that
parameter is specified then it will set up the proper alignment (the slab
allocators never see that flag).

If constructor or destructor are specified then we will allocate a slab_ops
structure and populate it with the values specified. Note that this will
cause a memory leak if the slab is disposed of later. If you need disposable
slabs then the new API must be used.

F. Remove destructor support from all slab allocators?

I am only aware of two call sites left after all the changes that are
scheduled to go into 2.6.22-rc1 have been merged. These are in FRV and sh
arch 

Re: [PATCH] rfc: threaded epoll_wait thundering herd

2007-05-04 Thread Linus Torvalds


On Sat, 5 May 2007, Eric Dumazet wrote:
> 
> But... what happens if the thread that was chosen exits from the loop in
> ep_poll() with res = -EINTR (because of signal_pending(current))

Not a problem.

What happens is that an exclusive wake-up stops on the first entry in the 
wait-queue that it actually *wakes*up*, but if some task has just marked 
itself as being TASK_UNINTERRUPTIBLE, but is still on the run-queue, it 
will just be marked TASK_RUNNING and that in itself isn't enough to cause 
the "exclusive" test to trigger.

The code in sched.c is subtle, but worth understanding if you care about 
these things. You should look at:

 - try_to_wake_up() - this is the default wakeup function (and the one 
   that should work correctly - I'm not going to guarantee that any of the 
   other specialty-wakeup-functions do so)

   The return value is the important thing. Returning non-zero is 
   "success", and implies that we actually activated it.

   See the "goto out_running" case for the case where the process was 
   still actually on the run-queues, and we just ended up setting 
   "p->state = TASK_RUNNING" - we still return 0, and the "exclusive" 
   logic will not trigger.

 - __wake_up_common: this is the thing that _calls_ the above, and which 
   cares about the return value above. It does

if (curr->func(curr, mode, sync, key) &&
(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)


   ie it only decrements (and triggers) the nr_exclusive thing when the 
   wakeup-function returned non-zero (and when the waitqueue entry was 
   marked exclusive, of course).

So what does all this subtlety *mean*?

Walk through it. It means that it is safe to do the

if (signal_pending())
return -EINTR;

kind of thing, because *when* you do this, you obviously are always on the 
run-queue (otherwise the process wouldn't be running, and couldn't be 
doing the test). So if there is somebody else waking you up right then and 
there, they'll never count your wakeup as an exclusive one, and they will 
wake up at least one other real exclusive waiter.

(IOW, you get a very very small probability of a very very small 
"thundering herd" - obviously it won't be "thundering" any more, it will 
be more of a "whispering herdlet").

The Linux kernel sleep/wakeup thing is really quite nifty and smart. And 
very few people realize just *how* nifty and elegant (and efficient) it 
is. Hopefully a few more people appreciate its beauty and subtlety now ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] MMC updates

2007-05-04 Thread Linus Torvalds


On Sat, 5 May 2007, Pierre Ossman wrote:

> Pierre Ossman wrote:
> > Linus, please pull from
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc.git 
> > for-linus
> > 
> 
> *ping*

*pong*.

Thanks for reminding me. I was away for a couple of days, missed some 
emails, just pulled and pushed out.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: VFS: fix passing of AT_PHDR value in auxv to ELF interpreter

2007-05-04 Thread Andrew Morton
On Fri, 4 May 2007 23:34:08 -0400 Quentin Godfroy <[EMAIL PROTECTED]> wrote:

> By the way, is init 32 bits or 64 bits? It could break the ia32
> emulation thing, but not the 64bit native mode.

akpm2:/home/akpm> file /sbin/init  
/sbin/init: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for 
GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, 
stripped

akpm2:/home/akpm> cat /etc/issue
Fedora Core release 6 (Zod)
Kernel \r on an \m
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] MMC updates

2007-05-04 Thread Pierre Ossman
Pierre Ossman wrote:
> Linus, please pull from
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc.git for-linus
> 

*ping*

-- 
 -- Pierre Ossman

  Linux kernel, MMC maintainerhttp://www.kernel.org
  PulseAudio, core developer  http://pulseaudio.org
  rdesktop, core developer  http://www.rdesktop.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: VFS: fix passing of AT_PHDR value in auxv to ELF interpreter

2007-05-04 Thread Jeremy Fitzhardinge
Quentin Godfroy wrote:
>> Won't this break with ET_DYN executables?  And besides, isn't this the
>> same thing?   
>> 
>
> Indeed, I haven't seen that. For ET_DYN executables, it could be done a
> thing like load_addr+elf_ppnt->p_vaddr (in the function that creates the
> auxv, as ity has access to the elf header), and for ET_EXEC do what I
> propose. I think this is trivial to do. I'll do it as soon as I come back
> in front of my machine.
>   

I don't think you need to special-case it.  You can compute the offset
between the linked address and the load address (first
PT_LOAD[0]->p_vaddr - load_addr) and use that to offset all the other
addresses.


> I don't understand. Yes it is what it is supposed to be, and the kernel
> is supposed to give the vaddr of the phdr table to the interpreter and
> not load addr + offset of phdr in file, which is sometimes wrong.
>   

How can it be wrong?  Does the PT_PHDR point to a different array of
Phdr entries?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rfc: threaded epoll_wait thundering herd

2007-05-04 Thread Eric Dumazet

Davi Arnaut a écrit :

Hi,

If multiple threads are parked on epoll_wait (on a single epoll fd) and
events become available, epoll performs a wake up of all threads of the
poll wait list, causing a thundering herd of processes trying to grab
the eventpoll lock.

This patch addresses this by using exclusive waiters (wake one). Once
the exclusive thread finishes transferring it's events, a new thread
is woken if there are more events available.

Makes sense?



Yes it makes sense.

But... what happens if the thread that was chosen exits from the loop in 
ep_poll() with res = -EINTR (because of signal_pending(current))



Me thinks in this case some ready events can wait forever (up to next 
ep_poll_callback() or another thread enters ep_poll())



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cpufreq longhaul locks up

2007-05-04 Thread Rafał Bilski
>>> Switching from acpi_pm+performance to acpi_pm+ondemand also
>>> locks up after a few minutes.
>> Yep. Sounds like an ondemand issue. Thanks for verifying this for me.
> 
> Nah, it also happens with cpufreq_powersave. I just need to check 
> through some archives and try booting with governor=powersave so that it 
> always stays low. 
You have a lockup when switching from other governor to powersave? Or if  
You are using it for some time?


--
Wicie, rozumicie
Zobacz >>> http://link.interia.pl/f1a74

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] UBI: dereference after kfree in create_vtbl

2007-05-04 Thread Satyam Sharma

On 5/5/07, Florin Malita <[EMAIL PROTECTED]> wrote:

Hi Satyam,

Satyam Sharma wrote:
> Eeks ... no, wait. You found a (two, actually) bug alright, but fixed
> it wrong. When we fail a write, we *must* add it to the corrupted list
> and _then_ attempt to retry. So, the "if (++tries <= 5)" applies to
> "if (!err) goto retry;" and not to the ubi_scan_add_to_list(). The
> difference is quite subtle here ...

Not being familiar with the code, I was specifically trying to preserve
the old semantics and only address the use-after-free issue. So if there
was another bug... well, I guess I succeeded at preserving it ;)

> The correct fix should actually be as follows: (Artem, this is diffed
> on the original vtbl.c)
[snip]
> +err = ubi_scan_add_to_list(si, new_seb->pnum, new_seb->ec,
> >corr);
> +kfree(new_seb);
> +if (++tries <= 5)
> if (!err)
> goto retry;

There's a side effect to this change: by unconditionally overwriting err
we lose the original error code. Then if we're exceeding the number of
retries we'll end up returning 0 which is probably not what you want.


You're absolutely right. We must preserve and return the original
(write error) return code and not the spurious ENOMEM / 0 that would
come from ubi_scan_add_to_list. I had noticed this too, but this time
_I_ preserved the (flawed) old semantics which leads to this problem.
Wow ... so you actually found 3 bugs here in ~3 lines of code :-)

I tried going deeper into this, but the lifetime semantics of a struct
ubi_scan_leb in this driver are quite ... weird indeed.


Return code aside, it seems the only thing ubi_scan_add_to_list() is
doing is allocate a new struct ubi_scan_leb, initialize some fields with
values passed from new_seb and then add it to the desired list. But
copying new_seb to a newly allocated structure and then immediately
freeing the old one seems redundant - why not just add new_seb to the
corrupted list and be done? Then we don't have to deal with allocation
failures in an error path anymore - something like this (diff against
the original code):


Again, I saw that too, but would still prefer using the higher level
function ubi_scan_add_to_list() to add to the corrupted list, but with
a different identifier for the return value to avoid overwriting err.
list_add_tail seems best left as an implementation detail below
ubi_scan_add_to_list(), IMO. So if it fails in the error path, we'd
have to return with the original (write error) return value and the
ENOMEM sort of goes ... unreturned. Alas!

Artem would have to step in here to verify if there really is a good
reason why we kmalloc a fresh ubi_scan_leb every time we want to add
one to a list. If possible, the best solution would be to change
ubi_scan_add_to_list() to take in a valid struct ubi_scan_leb and just
add that to the specified list (using list_add_tail or whatever) --
and leave allocation up to callers, though this likely requires a
major cleanup of this driver w.r.t. ubi_scan_leb lifetime semantics.
ubi_scan_add_used() was such a horror!

If adding an existing ubi_scan_leb is fine, but we can't really change
ubi_scan_add_to_list right now, then I'd much rather see a
__ubi_scan_add_to_list() that does the list_add_tail (and those
associated debug printk messages: btw are those printk's the only
reason why we're carrying that ubi_scan_info into
ubi_scan_add_to_list?). So ubi_scan_add_to_list() just wraps a kmalloc
over it -- that way, callers who want a new eraseblock alloced before
adding it to a list can use ubi_scan_add_to_list() and those that
already hold a valid one (like us) could use __ubi_scan_add_to_list()
directly.

I see Andrew merged the previous fix, so this is on top of that,
though I'm not sure if this half-solution should actually go in --
hopefully that verbose comment / changelog will irritate Artem enough
to fix all this for good :-)

---

drivers/mtd/ubi/vtbl.c:create_vtbl() uses ubi_scan_add_to_list() in
the write error path. That can itself return with an ENOMEM error, in
which case we give up but return from create_vtbl() with the original
write error.

A robust solution would be to fix ubi_scan_add_to_list() to just take
in a valid eraseblock and simply add it to the specified list using
list_add_tail, and thus never fail. This would likely require a major
cleanup of this driver.

Alternatively, we could introduce and use a void
__ubi_scan_add_to_list() here that does this and make
ubi_scan_add_to_list() simply wrap a kmalloc over it, to avoid
changing existing users.

Signed-off-by: Satyam Sharma <[EMAIL PROTECTED]>

---

drivers/mtd/ubi/vtbl.c |   17 -
1 file changed, 12 insertions(+), 5 deletions(-)

---

diff -ruNp a/drivers/mtd/ubi/vtbl.c b/drivers/mtd/ubi/vtbl.c
--- a/drivers/mtd/ubi/vtbl.c2007-05-05 06:17:04.0 +0530
+++ b/drivers/mtd/ubi/vtbl.c2007-05-05 09:23:43.0 +0530
@@ -260,7 +260,7 @@ bad:
static int create_vtbl(const struct ubi_device *ubi, struct 

Re: [PATCH 3/8] Universal power supply class (was: battery class)

2007-05-04 Thread Henrique de Moraes Holschuh
On Fri, 04 May 2007, Shem Multinymous wrote:
> >+enum power_supply_type {
> >+   POWER_SUPPLY_TYPE_BATTERY = 0,
> >+   POWER_SUPPLY_TYPE_UPS,
> >+   POWER_SUPPLY_TYPE_AC,
> >+   POWER_SUPPLY_TYPE_USB,
> >+};
> 
> How about dumb (non-USB) DC power? Any reason to distinguish it from AC?

Hmm, if it should not be distinguished, it is better to rename AC to
something that means continuous power.  But I'd rather have it AC and DC, as
something might have both supplies separate, and you might want to
differentiate them for some (human interface) reason.  After all, USB and DC
are not really different anyway...

Anyway, what IS the difference between UPS and battery, or UPS and AC/DC for
that matter?  When should UPS be used?  If you have UPS there, should not
MGG (motor-generator group) also be provided?

Given that USB-power *is* usually also "dumb" (i.e. it doesn't do any
control signaling over the USB bus for power-control purposes), IMHO it
might be better to have just battery, AC and DC as types.  And a primary and
secondary notion too, as that is common.  It would be generic.

Or maybe I just didn't get the idea behind the "type" attribute :-)

I'd appreciate if these were documented in the text file.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revert "[PATCH] paravirt: Add startup infrastructure for paravirtualization"

2007-05-04 Thread David Miller
From: [EMAIL PROTECTED] (Eric W. Biederman)
Date: Fri, 04 May 2007 21:14:42 -0600

> David Miller <[EMAIL PROTECTED]> writes:
> 
> > From: Rusty Russell <[EMAIL PROTECTED]>
> > Date: Sat, 05 May 2007 11:22:48 +1000
> >
> >> Hi Eric,
> >
> > To all of those who don't speak "rusty", "Hi Soandso" means
> > "NAK".
> 
> The question between Rusty and myself is not if we should remove
> that code but when. 

I should have added a smiley face, sorry about that.

I was merely making a joke wrt. a recent blog entry of Rusty's:

http://ozlabs.org/~rusty/index.cgi/tech/2007-05-04.html

:-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: VFS: fix passing of AT_PHDR value in auxv to ELF interpreter

2007-05-04 Thread Quentin Godfroy
On Fri, May 04, 2007 at 04:22:08PM -0700, Andrew Morton wrote:
> This patch kills my FC6 machine (using a config which was derived from RH's
> original):
> 
> Freeing unused kernel memory: 368k freed
> Write protecting the kernel read-only data: 959k
> request_module: runaway loop modprobe binfmt-464c
> request_module: runaway loop modprobe binfmt-464c
> request_module: runaway loop modprobe binfmt-464c
> request_module: runaway loop modprobe binfmt-464c
> request_module: runaway loop modprobe binfmt-464c
> 
> 
> .config: http://userweb.kernel.org/~akpm/config-akpm2.txt

I didn't try it on a 64bit kernel. I'll do as soon as I can reach my
machine. Probably the loop does not find PT_PHDR and then returns noexec.

I had such a problem, but it was because I forgot elf_ppnt = elf_phdata
before the loop.

By the way, is init 32 bits or 64 bits? It could break the ia32
emulation thing, but not the 64bit native mode.

Anyway the problem could be addressed by returning back to the old
behaviour if the loop fails, but it's not clean at all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: VFS: fix passing of AT_PHDR value in auxv to ELF interpreter

2007-05-04 Thread Quentin Godfroy
On Fri, May 04, 2007 at 04:31:49PM -0700, Jeremy Fitzhardinge wrote:
> Quentin Godfroy wrote:
> > +   elf_ppnt = elf_phdata;
> > +   for (i = 0; i< loc->elf_ex.e_phnum; i++, elf_ppnt++)
> > +   if (elf_ppnt->p_type == PT_PHDR) {
> > +   phdr_addr = elf_ppnt->p_vaddr;
> >   
> 
> Won't this break with ET_DYN executables?  And besides, isn't this the
> same thing?   

Indeed, I haven't seen that. For ET_DYN executables, it could be done a
thing like load_addr+elf_ppnt->p_vaddr (in the function that creates the
auxv, as ity has access to the elf header), and for ET_EXEC do what I
propose. I think this is trivial to do. I'll do it as soon as I come back
in front of my machine.


>  Shouldn't PT_PHDR->p_vaddr point to the vaddr of the Phdr
> table itself?

I don't understand. Yes it is what it is supposed to be, and the kernel
is supposed to give the vaddr of the phdr table to the interpreter and
not load addr + offset of phdr in file, which is sometimes wrong.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revert "[PATCH] paravirt: Add startup infrastructure for paravirtualization"

2007-05-04 Thread Rusty Russell
On Fri, 2007-05-04 at 20:53 -0600, Eric W. Biederman wrote:
> Rusty Russell <[EMAIL PROTECTED]> writes:
> The delicate part right now is that lguest is attempting to use the
> standard kernel entry point which does have a fixed ABI.
> 
> If lguest uses that entry point in a hard to maintain way it provides
> a bad example, and it potentially leads to other problems.  So I
> really don't want to see the bad example happen, especially if the
> code in the bad example is as general as it is today.

I completely agree, a bad example is worse than no example.  Plus, an
opportunity to have you and hpa hacking on lguest is not to be missed.

> Frankly I think the least risk of problems comes from just doing a
> separate entry point for lguest for now.  It means we don't even have
> to touch the common code path and later dropping will be trivially 
> lguest specific, and certain to not break anything else.

Hmm, I railed for so long against Xen doing that, it feels... wrong...
to do that now 8)

I think I'll need to hack in a magic signature before the lguest start:
it's the only way it'll work with unpacking bzImages as well.  And it'll
be trivial to rip out later when we have the Right Way.

I'll spin a patch this afternoon (got to go to puppy training now).

Thanks!
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revert "[PATCH] paravirt: Add startup infrastructure for paravirtualization"

2007-05-04 Thread Eric W. Biederman
David Miller <[EMAIL PROTECTED]> writes:

> From: Rusty Russell <[EMAIL PROTECTED]>
> Date: Sat, 05 May 2007 11:22:48 +1000
>
>> Hi Eric,
>
> To all of those who don't speak "rusty", "Hi Soandso" means
> "NAK".

The question between Rusty and myself is not if we should remove
that code but when. 

If we are going to break an ABI I figure the sooner the better,
and the conversation between Rusty and myself has not been totally
unproductive.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ext3 vs NTFS performance

2007-05-04 Thread Xu CanHao

On Tue, 1 May 2007 13:43:18 -0700
"Cabot, Mason B" <[EMAIL PROTECTED]> wrote:


Hello all,



I've been testing the NAS performance of ext3/Openfiler 2.2 against
NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
video workloads. The Windows CIFS client will attempt a poor-man's
pre-allocation of the file on the server by sending 1-byte writes at
128K-byte strides, breaking block allocation on ext3 and leading to
fragmentation and poor performance. This will happen for many
applications (including iTunes) as the CIFS client issues these
pre-allocates under the application layer.


On 5 Mai, 10:20, Theodore Tso <[EMAIL PROTECTED]> wrote:


This is being worked on already.  XFS has a per-filesystem ioctl, but
we want to create a filesystem-independent system call,
sys_fallocate(), that would wired into the already existing
posix_fallocate() function exported by glibc.


The story told us: an application must look to the file-systems, ext3
is good at aaa, is not good at bbb; XFS is good at ccc, is not good at
ddd; reiserfs is good at eee, is not good at fff

For this scenario, XFS is good at dealing with fragmentation while ext3 not.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revert "[PATCH] paravirt: Add startup infrastructure for paravirtualization"

2007-05-04 Thread Eric W. Biederman
Rusty Russell <[EMAIL PROTECTED]> writes:

> Hi Eric,
>
>   Well, I certainly don't recall that (that's not to say that someone
> didn't say it).  Trying to meet the requirements of Xen, VMI and other
> future hypervisors lead to an awkward result; this is the main reason I
> started on lguest, so we'd have a simple example in front of us to say
> "do it this way".
>
>   (It's not certain that anyone else will ever use this code, but we
> should *try* IMHO).

Ok.  So the purpose of lguest is to be an experimental platform and
as much as possible be a good example.

>> Before lguest.  Thank you very much.  This code should never ever
>> have been in a stable kernel.  It is a very ill conceived interface.
>
> I disagree.  It was *not* obvious how paravirt kernels should boot.
> Lguest, for example, copied Xen's "set up kernel pagetables already"
> design decision, which now seems wrong.  But it was the example we had.

Makes some sense.  There was a bit of experience with booting kernels
without doing 16bit BIOS calls in nearly this way outside the paravirt
world though. 

>> And frankly I don't think lguest should be merged until we are as
>> close to certain as human beings can get that have the ABI reviewed
>> and sorted out.  ABIs unfortunately are very very hard to change.
>
> I think you misunderstand lguest.  I agree with this sentiment
> completely: this is *why* lguest doesn't have an ABI.  It's all in-tree,
> so it can simply be changed.  There's no guarantee that running
> different kernels as guest and host will work.

Reasonable.  Just so long as it is hidden under CONFIG_EXPERIMENTAL
And that people will know that mixing and matching is a problem.

Someday I might have to tell the story of the challenge of getting
the kexec on panic code path to all mixing and matching of kernels.

> Maybe later the ABI will nail down, but the last year of hacking on
> various hypervisors has shown it's folly to try to get it right now.  We
> need to play a lot first.
>
> Hope that clarifies!

For lguest at least.

The delicate part right now is that lguest is attempting to use the
standard kernel entry point which does have a fixed ABI.

If lguest uses that entry point in a hard to maintain way it provides
a bad example, and it potentially leads to other problems.  So I
really don't want to see the bad example happen, especially if the
code in the bad example is as general as it is today.

That said we have two practical short term solutions.
-  A separate entry point made advertised with an ELF note like
   Xen is currently doing.

-  We reserve a type field in the first 2K of the boot param area
   and if that type is lguest we immediately branch to your code.

Reserving the type field is a subset of what we are looking at for
the next rev of the boot protocol that will handle this clearly.

We have enough in flight work for cleaning up the booting of arch/i386
that I don't believe we can have a complete solution before the merge
window closes.  At best we can get a minor rev and reserve the fields
it looks like we will need in head.S

Frankly I think the least risk of problems comes from just doing a
separate entry point for lguest for now.  It means we don't even have
to touch the common code path and later dropping will be trivially 
lguest specific, and certain to not break anything else.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revert "[PATCH] paravirt: Add startup infrastructure for paravirtualization"

2007-05-04 Thread David Miller
From: Rusty Russell <[EMAIL PROTECTED]>
Date: Sat, 05 May 2007 11:22:48 +1000

> Hi Eric,

To all of those who don't speak "rusty", "Hi Soandso" means
"NAK".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ext3 vs NTFS performance

2007-05-04 Thread Theodore Tso
On Fri, May 04, 2007 at 07:49:13PM +0400, Michael Tokarev wrote:
> 
> How about providing a way to stop kernel (or filesystem) to make gaps
> in files instead?  Like some ioctl(fd, FS_NOGAPS, 1) -- pretty much
> like 'doze has, just the opposite (on windows, this flag is "on" by
> default).

This is being worked on already.  XFS has a per-filesystem ioctl, but
we want to create a filesystem-independent system call,
sys_fallocate(), that would wired into the already existing
posix_fallocate() function exported by glibc.

> It's even worse: imagine samba transforms this into write(zeros) (as
> preallocate isn't available yet), and at the same time, another process
> is writing there... Which will be perfectly valid in current case, but
> will go wrong way (overwriting just-written data with zeros) in this
> new scenario.

Samba can just use the posix_fallocate() system call.  Note that if
you have two processes are writing to the same file without proper
locking, you're probably going to run into potential problems anyway.
What if one process is writing whole blockfuls of data, while some
brain-damaged Windows client is writing a byte of zero every 128k, and
thus subtly corrupting the data written by the first process?  We
can't fix brain-damaged applications that aren't doing proper
application level locking

(Aside, of course, from convincing people to switch away from Vista to
Linux. :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: always clear bss

2007-05-04 Thread Eric W. Biederman
"H. Peter Anvin" <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>> 
>> I have this vague memory of liking 0x3c because if we do happen to use
>> more room then we intended the consequences are pretty benign.
>> 
>> But that is a pretty minor consequence.
>> 

I meant to say it is a pretty minor consideration.

> That's a dangerous assumption (besides, it's likely wrong, since there
> are only two unused bytes below it.)

Worse case on a modern machine is that we get a messed up screen.
The odds of it happening are sufficiently remote that it isn't worth
worrying about.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Regression with SLUB on Netperf and Volanomark

2007-05-04 Thread Christoph Lameter
Got something If I remove the atomics from both alloc and free then I 
get a performance jump. But maybe also a runtime variation

Avoid the use of atomics in slab_alloc

About 5-7% performance gain. Or am I also seeing runtime variations?

What we do is add the last free field in the page struct to setup
a separate per cpu freelist. From that one we can allocate without
taking the slab lock because we checkout the complete list of free
objects when we first touch the slab. If we have an active list
then we can also free to that list if we run on that processor
without taking the slab lock.

This allows even concurrent allocations and frees from the same slab using
two mutually exclusive freelists. If the allocator is running out of
its per cpu freelist then it will consult the per slab freelist and reload
if objects were freed in it.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/mm_types.h |5 +++-
 mm/slub.c|   54 +++
 2 files changed, 49 insertions(+), 10 deletions(-)

Index: slub/include/linux/mm_types.h
===
--- slub.orig/include/linux/mm_types.h  2007-05-04 18:58:06.0 -0700
+++ slub/include/linux/mm_types.h   2007-05-04 18:59:42.0 -0700
@@ -50,9 +50,12 @@ struct page {
spinlock_t ptl;
 #endif
struct {/* SLUB uses */
-   struct page *first_page;/* Compound pages */
+   void **cpu_freelist;/* Per cpu freelist */
struct kmem_cache *slab;/* Pointer to slab */
};
+   struct {
+   struct page *first_page;/* Compound pages */
+   };
};
union {
pgoff_t index;  /* Our offset within mapping. */
Index: slub/mm/slub.c
===
--- slub.orig/mm/slub.c 2007-05-04 18:58:06.0 -0700
+++ slub/mm/slub.c  2007-05-04 19:02:33.0 -0700
@@ -845,6 +845,7 @@ static struct page *new_slab(struct kmem
page->offset = s->offset / sizeof(void *);
page->slab = s;
page->inuse = 0;
+   page->cpu_freelist = NULL;
start = page_address(page);
end = start + s->objects * s->size;
 
@@ -1137,6 +1138,23 @@ static void putback_slab(struct kmem_cac
  */
 static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
 {
+   /*
+* Merge cpu freelist into freelist. Typically we get here
+* because both freelists are empty. So this is unlikely
+* to occur.
+*/
+   while (unlikely(page->cpu_freelist)) {
+   void **object;
+
+   /* Retrieve object from cpu_freelist */
+   object = page->cpu_freelist;
+   page->cpu_freelist = page->cpu_freelist[page->offset];
+
+   /* And put onto the regular freelist */
+   object[page->offset] = page->freelist;
+   page->freelist = object;
+   page->inuse--;
+   }
s->cpu_slab[cpu] = NULL;
ClearPageActive(page);
 
@@ -1206,25 +1224,32 @@ static void *slab_alloc(struct kmem_cach
local_irq_save(flags);
cpu = smp_processor_id();
page = s->cpu_slab[cpu];
-   if (!page)
+   if (unlikely(!page))
goto new_slab;
 
+   if (likely(page->cpu_freelist)) {
+fast_object:
+   object = page->cpu_freelist;
+   page->cpu_freelist = object[page->offset];
+   local_irq_restore(flags);
+   return object;
+   }
+
slab_lock(page);
if (unlikely(node != -1 && page_to_nid(page) != node))
goto another_slab;
 redo:
-   object = page->freelist;
-   if (unlikely(!object))
+   if (unlikely(!page->freelist))
goto another_slab;
if (unlikely(PageError(page)))
goto debug;
 
-have_object:
-   page->inuse++;
-   page->freelist = object[page->offset];
+   /* Reload the cpu freelist */
+   page->cpu_freelist = page->freelist;
+   page->freelist = NULL;
+   page->inuse = s->objects;
slab_unlock(page);
-   local_irq_restore(flags);
-   return object;
+   goto fast_object;
 
 another_slab:
deactivate_slab(s, page, cpu);
@@ -1267,6 +1292,7 @@ have_slab:
local_irq_restore(flags);
return NULL;
 debug:
+   object = page->freelist;
if (!alloc_object_checks(s, page, object))
goto another_slab;
if (s->flags & SLAB_STORE_USER)
@@ -1278,7 +1304,11 @@ debug:
dump_stack();
}
init_object(s, object, 1);
-   goto have_object;
+   page->freelist = object[page->offset];
+   page->inuse++;
+   slab_unlock(page);
+   local_irq_restore(flags);
+   return object;
 }
 
 

Re: [PATCH] Revert "[PATCH] paravirt: Add startup infrastructure for paravirtualization"

2007-05-04 Thread Rusty Russell
On Fri, 2007-05-04 at 09:07 -0600, Eric W. Biederman wrote:
> Rusty Russell <[EMAIL PROTECTED]> writes:
> 
> > On Fri, 2007-05-04 at 08:13 -0600, Eric W. Biederman wrote:
> >> We don't have any working code, there are no in tree users.
> >
> > Hi Eric,
> >
> > Lack of in-tree code is definitely not due to me.  The code which uses
> > it has been sitting in -mm for three months.  Suddenly ripping this out
> > and breaking all that work without replacing it is rude.
> 
> My memory is very fuzzy now, but I know it at least came up early on
> that everyone should be using %esi to point to real mode data and
> that didn't happen.

Hi Eric,

Well, I certainly don't recall that (that's not to say that someone
didn't say it).  Trying to meet the requirements of Xen, VMI and other
future hypervisors lead to an awkward result; this is the main reason I
started on lguest, so we'd have a simple example in front of us to say
"do it this way".

(It's not certain that anyone else will ever use this code, but we
should *try* IMHO).

> Before lguest.  Thank you very much.  This code should never ever
> have been in a stable kernel.  It is a very ill conceived interface.

I disagree.  It was *not* obvious how paravirt kernels should boot.
Lguest, for example, copied Xen's "set up kernel pagetables already"
design decision, which now seems wrong.  But it was the example we had.

> And frankly I don't think lguest should be merged until we are as
> close to certain as human beings can get that have the ABI reviewed
> and sorted out.  ABIs unfortunately are very very hard to change.

I think you misunderstand lguest.  I agree with this sentiment
completely: this is *why* lguest doesn't have an ABI.  It's all in-tree,
so it can simply be changed.  There's no guarantee that running
different kernels as guest and host will work.

Maybe later the ABI will nail down, but the last year of hacking on
various hypervisors has shown it's folly to try to get it right now.  We
need to play a lot first.

Hope that clarifies!
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revert "[PATCH] paravirt: Add startup infrastructure for paravirtualization"

2007-05-04 Thread Rusty Russell
On Fri, 2007-05-04 at 08:58 -0700, Andrew Morton wrote:
> On Sat, 05 May 2007 00:37:30 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:
> 
> > Of course, I expect that Andrew is about to push my patches to Linus
> > any day now... right Andrew?  Then we don't need this argument.
> 
> It would be about a week off.  I'll resend the paches out for rereview.

Unfortunately, it's becoming clear that we need to know what's going to
happen with the boot code first.  The details are both nasty and
important.

Hopefully this will all occur in the next few days so we can put
something small in w/o blocking lguest.  But if not, so be it: at least
we're making progress.

Sorry,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ExpressCard hotswap support?

2007-05-04 Thread Chris Adams
Once upon a time, Daniel J Blueman <[EMAIL PROTECTED]> said:
> On 4 May, 01:20, Chris Adams <[EMAIL PROTECTED]> wrote:
> >I've got a Thinkpad Z60m with an ExpressCard slot, and I got a Belkin
> >F5U250 GigE ExpressCard (Marvell 88E8053 chip using sky2 driver).  It
> >appears that Linux only recognizes it if I insert the card with the
> >system powered off.  If I hot-insert the card, nothing happens (no
> >messages logged, no PCI device shows up, nothing).
> 
> The BIOS initialises and powers up the downstream PCI express port
> when it detects a card is present.
> 
> When Linux boots, it enumerates the bus and sees it, but does not do
> prior configuration to enable, configure and cause link negotiation on
> all PCI express ports I believe; this requires chipset and (sometimes
> revision-) specific code, which wouldn't be so robust as the BIOS
> doing the footwork.

Actually, for me, loading pciehp with pciehp_force=1 set works.
-- 
Chris Adams <[EMAIL PROTECTED]>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: always clear bss

2007-05-04 Thread H. Peter Anvin
Eric W. Biederman wrote:
> 
> I have this vague memory of liking 0x3c because if we do happen to use
> more room then we intended the consequences are pretty benign.
> 
> But that is a pretty minor consequence.
> 

That's a dangerous assumption (besides, it's likely wrong, since there
are only two unused bytes below it.)

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: always clear bss

2007-05-04 Thread Eric W. Biederman
"H. Peter Anvin" <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>> 
>> My notes show 0x5c reserved for additional apm_bios_info, although
>> of the top of my head I don't know how realistic that is.
>> 
>> 0x1e4 does look available.
>> 
>> It has been a long time since I made that choice, and I do see that
>> looking at struct screen_info I did remember to document that I was
>> using 0x3c, even in your structure.
>> 
>> It is all internal to our boot process and external code isn't going
>> to use it so we can change it if we feel like.
>> 
>
> I don't see the actual instruction that does that anywhere in my tree,
> which was branched from Andi's "for-linus" git tree, but I have reserved
> 0x1e4 for that purpose as "scratch".

I have this vague memory of liking 0x3c because if we do happen to use
more room then we intended the consequences are pretty benign.

But that is a pretty minor consequence.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Regression with SLUB on Netperf and Volanomark

2007-05-04 Thread Christoph Lameter
If you want to test some more: Here is a patch that removes the atomic ops 
from the allocation patch. But I only see minor improvements on my amd64 
box here.



Avoid the use of atomics in slab_alloc

This only increases netperf performance by 1%. Wonder why?

What we do is add the last free field in the page struct to setup
a separate per cpu freelist. From that one we can allocate without
taking the slab lock because we checkout the complete list of fre
objects when we first touch the slab.

This allows concurrent allocations and frees from the same slab using
two mutually exclusive freelists. If the allocator is running out of
its per cpu freelist then it will consult the per slab freelist and reload
if objects were freed in it.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/mm_types.h |5 -
 mm/slub.c|   44 +++-
 2 files changed, 39 insertions(+), 10 deletions(-)

Index: slub/include/linux/mm_types.h
===
--- slub.orig/include/linux/mm_types.h  2007-05-04 17:39:33.0 -0700
+++ slub/include/linux/mm_types.h   2007-05-04 17:58:38.0 -0700
@@ -50,9 +50,12 @@ struct page {
spinlock_t ptl;
 #endif
struct {/* SLUB uses */
-   struct page *first_page;/* Compound pages */
+   void **active_freelist; /* Allocation freelist */
struct kmem_cache *slab;/* Pointer to slab */
};
+   struct {
+   struct page *first_page;/* Compound pages */
+   };
};
union {
pgoff_t index;  /* Our offset within mapping. */
Index: slub/mm/slub.c
===
--- slub.orig/mm/slub.c 2007-05-04 17:40:50.0 -0700
+++ slub/mm/slub.c  2007-05-04 18:14:23.0 -0700
@@ -845,6 +845,7 @@ static struct page *new_slab(struct kmem
page->offset = s->offset / sizeof(void *);
page->slab = s;
page->inuse = 0;
+   page->active_freelist = NULL;
start = page_address(page);
end = start + s->objects * s->size;
 
@@ -1137,6 +1138,19 @@ static void putback_slab(struct kmem_cac
  */
 static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
 {
+   /* Two freelists that are now to be consolidated */
+   while (unlikely(page->active_freelist)) {
+   void **object;
+
+   /* Retrieve object from active_freelist */
+   object = page->active_freelist;
+   page->active_freelist = page->active_freelist[page->offset];
+
+   /* And put onto the regular freelist */
+   object[page->offset] = page->freelist;
+   page->freelist = object;
+   page->inuse--;
+   }
s->cpu_slab[cpu] = NULL;
ClearPageActive(page);
 
@@ -1206,25 +1220,32 @@ static void *slab_alloc(struct kmem_cach
local_irq_save(flags);
cpu = smp_processor_id();
page = s->cpu_slab[cpu];
-   if (!page)
+   if (unlikely(!page))
goto new_slab;
 
+   if (likely(page->active_freelist)) {
+fast_object:
+   object = page->active_freelist;
+   page->active_freelist = object[page->offset];
+   local_irq_restore(flags);
+   return object;
+   }
+
slab_lock(page);
if (unlikely(node != -1 && page_to_nid(page) != node))
goto another_slab;
 redo:
-   object = page->freelist;
-   if (unlikely(!object))
+   if (unlikely(!page->freelist))
goto another_slab;
if (unlikely(PageError(page)))
goto debug;
 
-have_object:
-   page->inuse++;
-   page->freelist = object[page->offset];
+   /* Reload the active freelist */
+   page->active_freelist = page->freelist;
+   page->freelist = NULL;
+   page->inuse = s->objects;
slab_unlock(page);
-   local_irq_restore(flags);
-   return object;
+   goto fast_object;
 
 another_slab:
deactivate_slab(s, page, cpu);
@@ -1267,6 +1288,7 @@ have_slab:
local_irq_restore(flags);
return NULL;
 debug:
+   object = page->freelist;
if (!alloc_object_checks(s, page, object))
goto another_slab;
if (s->flags & SLAB_STORE_USER)
@@ -1278,7 +1300,11 @@ debug:
dump_stack();
}
init_object(s, object, 1);
-   goto have_object;
+   page->freelist = object[page->offset];
+   page->inuse++;
+   slab_unlock(page);
+   local_irq_restore(flags);
+   return object;
 }
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo 

Re: mkfs.ext2 triggerd RAM corruption

2007-05-04 Thread Bernd Schubert
Jan-Benedict Glaw wrote:

> On Fri, 2007-05-04 16:59:51 +0200, Bernd Schubert <[EMAIL PROTECTED]>
> wrote:
>> To see whats going on, I copied the entire / (so the initrd) into a
>> tmpfs
>> root, chrooted into it, also bind mounted the main / into this chroot
>> and
>> compared several times /bin of chroot/bin and the bind-mounted /bin
>> while
>> the mkfs.ext2 command was running.
>> 
>> beo-05:/# diff -r /bin /oldroot/bin/
>> beo-05:/# diff -r /bin /oldroot/bin/
>> beo-05:/# diff -r /bin /oldroot/bin/
>> Binary files /bin/sleep and /oldroot/bin/sleep differ
>> beo-05:/# diff -r /bin /oldroot/bin/
>> Binary files /bin/bsd-csh and /oldroot/bin/bsd-csh differ
>> Binary files /bin/cat and /oldroot/bin/cat differ
>> ...
>> 
>> Also tested different schedulers, at least happens with deadline and
>> anticipatory.
>> 
>> The corruption does NOT happen on running the mkfs command on
>> /dev/sda1,
>> but happens with sda2, sda3 and sda3. Also doesn't happen with
>> extended
>> partitions of sda1.
> 
> Is sda2 the largest filesystem out of sda2, sda3 (and the logical
> partitions within the extended sda1, if these get mkfs'ed, too)?

I tested it that way:

- test on sda1, no further partitions
- test on sda2, sda1: ~2MB, everything else for sda2
- test on sda3, sda1: ~2MB, sda2: ~2MB, everything else for sda3
...
test on sda5: sda1: partition that has the extended partition,
everything in
sda5

> 
> I'm not too sure that this is a kernel bug, but probably a bad RAM
> chip. Did you run memtest86 for a while? ...and can you reproduce this
> problem on different machines?

Reproducible on 4 test-systems (2 with identical hardware, but then the
2 + 1 + 1 with entirely different hardware combinations) with ECC memory,
which is monitored by EDAC. Memory, CPU, etc. are already real life stress
tested with several applications, e.g. linpack. 
Though I don't entirely agree, my colleagues in this group are always
telling me, that their real life stress test shows more memory
corruptions than memtest. As soon as I have physical access again, I can also 
do a memtest86 run (would like to do it over the weekend, but don't know how
to convince stupid rembo how to boot memtest).
Anyway, a memory corruption is more than unlikely on these systems for
several reasons.


Thanks,
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mkfs.ext2 triggerd RAM corruption

2007-05-04 Thread Bernd Schubert
Theodore Tso wrote:

> On Fri, May 04, 2007 at 04:59:51PM +0200, Bernd Schubert wrote:
>> 
>> I'm presently rather puzzled, if this is really a kernel bug, its a
>> big
>> bug.
>> 
>> Summary: The system ramdisk (initrd) gets corrupted while running
>> mkfs.ext2 on a local sata disk partition.
> 
> What distribution are you using?  What's the hardware configuration,

distribution: modified debian sarge, in which aspect is the distribution
important for this problem? mkfs2.ext2 is supposed to write to /dev/sdaX
and not /dev/rd/0. Stracing it and grepping for open calls shows that
only /dev/sdaX is opened in read-write mode.

hardware:
beo-05 and beo-06: cpu: xeon, acpi shows S3000PTH board, memory 2GB
(board too new for EDAC), piix sata controller

beo-106: Dual Core AMD Opteron, no idea what kind of board, 4GB memory
(k8_edac monitored), nforce sata controller

beo-01: Presently can't connect to it, afaik another intel system

(all system are running in x86_64 mode)


> including amount of memory?  What is the partition table look
> like for /dev/sda?  What filesystems are mounted?  If you have any

I already tested several partition types, e.g. something like this for a
test on sda3

beo-05:~# sfdisk -d /dev/sda
# partition table of /dev/sda
unit: sectors

/dev/sda1 : start=   63, size=  4208967, Id=83
/dev/sda2 : start=  4209030, size=  4209030, Id=83
/dev/sda3 : start=  8418060, size=313251435, Id=83
/dev/sda4 : start=0, size=0, Id= 0


For the tests nothing was mounted. 


> soft RAID partitions, are any of them using part of /dev/sda?  What

No raid during the tests on sda, of course. 
When sdaX was part of a raid testing the raid device, the corruption did
NOT happen.

> swap partitions are you using?  And do any of the swap partitions

Swap already entirely disabled.

> overlap with /dev/sda?  :-)

Suspected this first too, but the tested partition was never used as
swap partition (first always tested on sda4 and sda2 was used for swap),
later I entirely disabled the swap.

Thanks,
Bernd


PS: I took me about 10 hours of testing, before I wrote the first mail.
Took me that time to believe that its really a kernel bug.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RELEASE] linux-2.6.21 backport: 269 version

2007-05-04 Thread Glauber de Oliveira Costa

> The URL?  Well, I'm glad you asked:
>
>   http://lguest.ozlabs.org/lguest-2.6.21-269.patch.gz
>
>

No, I have not. Thanks anyway, will try it. Finally. 8-)


But he was talking about me! ;-)
(kidding)
--
Glauber de Oliveira Costa.
"Free as in Freedom"

"The less confident you are, the more serious you have to act."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] Input patches for 2.6.20

2007-05-04 Thread Linus Torvalds


On Thu, 3 May 2007, Dmitry Torokhov wrote:
> 
> If you have not pulled yet please pull from:
>
>         master.kernel.org:/pub/scm/linux/kernel/git/dtor/input.git 
> for-linus
> 
> because master branch will have extra stuff in the next minute or so.

Hmm. That thing had a conflict with the driver core changes I just pulled 
from Greg due to Greg removing "struct subsystem".

The conflict looked really trivial, and I fixed up the obvious places, 
probably correctly. Please verify.

Martin, in the process I noticed that the new file

arch/s390/kernel/ipl.c

seems to be broken by the same thing after the driver core merge. The fix 
_looks_ equally trivial (change *subsys.kset.kobj into just *subsys.kobj), 
but I didn't do that trivial thing, because I cannot even test-compile the 
end result. So can you give it a quick look, please?

Greg - maybe more of a heads-up to people next time you do something like 
this? Did this removal hit -mm? Anyway, it looks like something trivial to 
fix up after, but ..

Linus

Re: [PATCH] synclink_gt add compat_ioctl

2007-05-04 Thread Paul Fulghum

Andrew Morton wrote:

On Thu, 03 May 2007 13:01:17 -0500
Paul Fulghum <[EMAIL PROTECTED]> wrote:


Add compat_ioctl handler to synclink_gt driver.


i386 allmodconfig:

make[3]: *** No rule to make target 
`/usr/src/devel/usr/include/linux/.check.synclink.h', needed by 
`__headerscheck'.  Stop.

I got tired of this patch - I think I'll drop it.


This all seems to be related to the use of compat_ulong_t

Since my original patch worked fine using unsigned int,
how about I go back to that?


--
Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation

2007-05-04 Thread Christoph Lameter
On Fri, 4 May 2007, Randy Dunlap wrote:

> > /* Perform the KICK callbacks to remove the objects */
> > for(p = addr; p < addr + s->objects * s->size; p += s->size)
> 
> missed a space after "for".

Thanks but I was more hoping for a higher level of review. Locking

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Regression with SLUB on Netperf and Volanomark

2007-05-04 Thread Christoph Lameter
On Fri, 4 May 2007, Tim Chen wrote:

> On Thu, 2007-05-03 at 18:45 -0700, Christoph Lameter wrote:
> > H.. One potential issues are the complicated way the slab is 
> > handled. Could you try this patch and see what impact it has?
> > 
> The patch boost the throughput of TCP_STREAM test by 5%, for both slab
> and slub.  But slab is still 5% better in my tests.

Really? buffer head handling improves TCP performance? I think you have 
run to run variances. I need to look at this myself.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation

2007-05-04 Thread Randy Dunlap
On Fri, 4 May 2007 16:03:43 -0700 (PDT) Christoph Lameter wrote:

> Fixes suggested by Andrew
> 
> ---
>  include/linux/slab.h |   12 
>  mm/slub.c|   32 +---
>  2 files changed, 33 insertions(+), 11 deletions(-)
> 
>   /* Perform the KICK callbacks to remove the objects */
>   for(p = addr; p < addr + s->objects * s->size; p += s->size)

missed a space after "for".

> - if (!test_bit((p - addr) / s->size, map))
> + if (test_bit((p - addr) / s->size, map))
>   s->slab_ops->kick_object(p);

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mkfs.ext2 triggerd RAM corruption

2007-05-04 Thread Theodore Tso
On Fri, May 04, 2007 at 04:59:51PM +0200, Bernd Schubert wrote:
> 
> I'm presently rather puzzled, if this is really a kernel bug, its a big bug. 
> 
> Summary: The system ramdisk (initrd) gets corrupted while running
> mkfs.ext2 on a local sata disk partition.

What distribution are you using?  What's the hardware configuration,
including amount of memory?  What is the partition table look
like for /dev/sda?  What filesystems are mounted?  If you have any
soft RAID partitions, are any of them using part of /dev/sda?  What
swap partitions are you using?  And do any of the swap partitions
overlap with /dev/sda?  :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Regression with SLUB on Netperf and Volanomark

2007-05-04 Thread Tim Chen
On Fri, 2007-05-04 at 16:59 -0700, Christoph Lameter wrote:

> > 
> > to run the tests.  The results are about the same as the non-NUMA case,
> > with slab about 5% better than slub.  
> 
> H... both tests were run in the same context? NUMA has additional 
> overhead in other areas.

Both slab and slub tests are tested with the same NUMA options and
config.

Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Regression with SLUB on Netperf and Volanomark

2007-05-04 Thread Tim Chen
On Thu, 2007-05-03 at 18:45 -0700, Christoph Lameter wrote:
> H.. One potential issues are the complicated way the slab is 
> handled. Could you try this patch and see what impact it has?
> 
The patch boost the throughput of TCP_STREAM test by 5%, for both slab
and slub.  But slab is still 5% better in my tests.

> If it has any then remove the cachline alignment and see how that 
> influences things.

Removing the cacheline alignment didn't change the throughput.

Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] PCI MMCONFIG: add validation against ACPI motherboard resources

2007-05-04 Thread Robert Hancock

Jesse Barnes wrote:

On Wednesday, May 2, 2007 4:54 pm Jesse Barnes wrote:

What happens if you take out the chipset register detection, does
the MCFG table give you the same result? Wonder if they're doing
something funny with start/end bus values or something in their
table. There's some code in my patch that prints out the important
data from the MCFG table, can you tell me what that shows with the
chipset detection taken out?

Yeah, I'll look a little more closely.  It could also be that another
register needs tweaking somewhere to actually get the bridge to
decode the space.


If that doesn't provide any useful information, I think we may need
some assistance from Intel chipset/motherboard people to figure out
what is going on here..

I'm talking with them now, hopefully they'll shed some light on it.


I did a little more debugging this morning, and found that I can 
actually do reads from the space described by ACPI and the device 
register, but later when ACPI actually scans the root bridges, it 
hangs.  Specifically the call to pci_acpi_scan_root in 
pci_root.c:acpi_pci_root_add() never seems to return.


I'll walk through that logic when I get back to my test box, but it's 
also worth noting that Vista's MCFG on this machine apparently works ok 
too.


I would try sticking some debug in arch/x86_64/pci/mmconfig.c at the 
beginning and end of pci_mmcfg_read and pci_mmcfg_write to print the 
seg, bus, devfn and reg for each read and write. Hopefully that will 
track down the one that is causing the lockup, if it is an actual 
MMCONFIG access that's doing it..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] Call percpu smp cacheline algin interface

2007-05-04 Thread Fenghua Yu

Call percpu smp cacheline align interface.

Signed-off-by: Fenghua Yu <[EMAIL PROTECTED]>
Acked-by: Suresh Siddha <[EMAIL PROTECTED]>

diff -Nurp linux-2.6.21-rc7.0/arch/i386/kernel/init_task.c 
linux-2.6.21-rc7.1/arch/i386/kernel/init_task.c
--- linux-2.6.21-rc7.0/arch/i386/kernel/init_task.c 2007-04-15 
16:50:57.0 -0700
+++ linux-2.6.21-rc7.1/arch/i386/kernel/init_task.c 2007-05-02 
17:00:34.0 -0700
@@ -42,5 +42,5 @@ EXPORT_SYMBOL(init_task);
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
  * no more per-task TSS's.
  */ 
-DEFINE_PER_CPU(struct tss_struct, init_tss) 
cacheline_internodealigned_in_smp = INIT_TSS;
+DEFINE_PER_CPU_CACHELINE_ALIGNED_IN_SMP(struct tss_struct, init_tss) = 
INIT_TSS;
 
diff -Nurp linux-2.6.21-rc7.0/arch/i386/kernel/irq.c 
linux-2.6.21-rc7.1/arch/i386/kernel/irq.c
--- linux-2.6.21-rc7.0/arch/i386/kernel/irq.c   2007-05-01 07:32:59.0 
-0700
+++ linux-2.6.21-rc7.1/arch/i386/kernel/irq.c   2007-05-02 17:00:34.0 
-0700
@@ -21,7 +21,7 @@
 #include 
 #include 
 
-DEFINE_PER_CPU(irq_cpustat_t, irq_stat) cacheline_internodealigned_in_smp;
+DEFINE_PER_CPU_CACHELINE_ALIGNED_IN_SMP(irq_cpustat_t, irq_stat);
 EXPORT_PER_CPU_SYMBOL(irq_stat);
 
 DEFINE_PER_CPU(struct pt_regs *, irq_regs);
diff -Nurp linux-2.6.21-rc7.0/arch/ia64/kernel/smp.c 
linux-2.6.21-rc7.1/arch/ia64/kernel/smp.c
--- linux-2.6.21-rc7.0/arch/ia64/kernel/smp.c   2007-04-15 16:50:57.0 
-0700
+++ linux-2.6.21-rc7.1/arch/ia64/kernel/smp.c   2007-05-02 17:00:34.0 
-0700
@@ -70,7 +70,7 @@ static volatile struct call_data_struct 
 #define IPI_KDUMP_CPU_STOP 3
 
 /* This needs to be cacheline aligned because it is written to by *other* 
CPUs.  */
-static DEFINE_PER_CPU(u64, ipi_operation) cacheline_aligned;
+static DEFINE_PER_CPU_CACHELINE_ALIGNED_IN_SMP(u64, ipi_operation);
 
 extern void cpu_halt (void);
 
diff -Nurp linux-2.6.21-rc7.0/arch/x86_64/kernel/init_task.c 
linux-2.6.21-rc7.1/arch/x86_64/kernel/init_task.c
--- linux-2.6.21-rc7.0/arch/x86_64/kernel/init_task.c   2007-04-15 
16:50:57.0 -0700
+++ linux-2.6.21-rc7.1/arch/x86_64/kernel/init_task.c   2007-05-02 
17:00:34.0 -0700
@@ -44,7 +44,7 @@ EXPORT_SYMBOL(init_task);
  * section. Since TSS's are completely CPU-local, we want them
  * on exact cacheline boundaries, to eliminate cacheline ping-pong.
  */ 
-DEFINE_PER_CPU(struct tss_struct, init_tss) 
cacheline_internodealigned_in_smp = INIT_TSS;
+DEFINE_PER_CPU_CACHELINE_ALIGNED_IN_SMP(struct tss_struct, init_tss) = 
INIT_TSS;
 
 /* Copies of the original ist values from the tss are only accessed during
  * debugging, no special alignment required.
diff -Nurp linux-2.6.21-rc7.0/kernel/sched.c linux-2.6.21-rc7.1/kernel/sched.c
--- linux-2.6.21-rc7.0/kernel/sched.c   2007-05-01 07:33:07.0 -0700
+++ linux-2.6.21-rc7.1/kernel/sched.c   2007-05-02 17:00:34.0 -0700
@@ -263,7 +263,7 @@ struct rq {
struct lock_class_key rq_lock_key;
 };
 
-static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
+static DEFINE_PER_CPU_CACHELINE_ALIGNED_IN_SMP(struct rq, runqueues);
 static DEFINE_MUTEX(sched_hotcpu_mutex);
 
 static inline int cpu_of(struct rq *rq)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] Define percpu smp cacheline align interface

2007-05-04 Thread Fenghua Yu

Define percpu smp cacheline align interface

Signed-off-by: Fenghua Yu <[EMAIL PROTECTED]>
Acked-by: Suresh Siddha <[EMAIL PROTECTED]>

diff -Nurp linux-2.6.21-rc7.0/arch/alpha/kernel/vmlinux.lds.S 
linux-2.6.21-rc7.1/arch/alpha/kernel/vmlinux.lds.S
--- linux-2.6.21-rc7.0/arch/alpha/kernel/vmlinux.lds.S  2007-05-01 
07:32:58.0 -0700
+++ linux-2.6.21-rc7.1/arch/alpha/kernel/vmlinux.lds.S  2007-05-02 
17:00:29.0 -0700
@@ -69,10 +69,7 @@ SECTIONS
   . = ALIGN(8);
   SECURITY_INIT
 
-  . = ALIGN(8192);
-  __per_cpu_start = .;
-  .data.percpu : { *(.data.percpu) }
-  __per_cpu_end = .;
+  PERCPU(8192)
 
   . = ALIGN(2*8192);
   __init_end = .;
diff -Nurp linux-2.6.21-rc7.0/arch/arm/kernel/vmlinux.lds.S 
linux-2.6.21-rc7.1/arch/arm/kernel/vmlinux.lds.S
--- linux-2.6.21-rc7.0/arch/arm/kernel/vmlinux.lds.S2007-05-01 
07:32:58.0 -0700
+++ linux-2.6.21-rc7.1/arch/arm/kernel/vmlinux.lds.S2007-05-02 
17:00:29.0 -0700
@@ -59,10 +59,7 @@ SECTIONS
usr/built-in.o(.init.ramfs)
__initramfs_end = .;
 #endif
-   . = ALIGN(4096);
-   __per_cpu_start = .;
-   *(.data.percpu)
-   __per_cpu_end = .;
+   PERCPU(4096)
 #ifndef CONFIG_XIP_KERNEL
__init_begin = _stext;
*(.init.data)
diff -Nurp linux-2.6.21-rc7.0/arch/cris/arch-v32/vmlinux.lds.S 
linux-2.6.21-rc7.1/arch/cris/arch-v32/vmlinux.lds.S
--- linux-2.6.21-rc7.0/arch/cris/arch-v32/vmlinux.lds.S 2007-05-01 
07:32:59.0 -0700
+++ linux-2.6.21-rc7.1/arch/cris/arch-v32/vmlinux.lds.S 2007-05-02 
17:00:29.0 -0700
@@ -92,10 +92,7 @@ SECTIONS
}
SECURITY_INIT
 
-   . =  ALIGN (8192);
-   __per_cpu_start = .;
-   .data.percpu  : { *(.data.percpu) }
-   __per_cpu_end = .;
+   PERCPU(8192)
 
 #ifdef CONFIG_BLK_DEV_INITRD
.init.ramfs : {
diff -Nurp linux-2.6.21-rc7.0/arch/frv/kernel/vmlinux.lds.S 
linux-2.6.21-rc7.1/arch/frv/kernel/vmlinux.lds.S
--- linux-2.6.21-rc7.0/arch/frv/kernel/vmlinux.lds.S2007-05-01 
07:32:59.0 -0700
+++ linux-2.6.21-rc7.1/arch/frv/kernel/vmlinux.lds.S2007-05-02 
17:00:29.0 -0700
@@ -57,10 +57,7 @@ SECTIONS
   __alt_instructions_end = .;
  .altinstr_replacement : { *(.altinstr_replacement) }
 
-  . = ALIGN(4096);
-  __per_cpu_start = .;
-  .data.percpu  : { *(.data.percpu) }
-  __per_cpu_end = .;
+  PERCPU(4096)
 
 #ifdef CONFIG_BLK_DEV_INITRD
   . = ALIGN(4096);
diff -Nurp linux-2.6.21-rc7.0/arch/i386/kernel/vmlinux.lds.S 
linux-2.6.21-rc7.1/arch/i386/kernel/vmlinux.lds.S
--- linux-2.6.21-rc7.0/arch/i386/kernel/vmlinux.lds.S   2007-05-01 
07:32:59.0 -0700
+++ linux-2.6.21-rc7.1/arch/i386/kernel/vmlinux.lds.S   2007-05-02 
17:00:29.0 -0700
@@ -185,12 +185,7 @@ SECTIONS
__initramfs_end = .;
   }
 #endif
-  . = ALIGN(4096);
-  .data.percpu  : AT(ADDR(.data.percpu) - LOAD_OFFSET) {
-   __per_cpu_start = .;
-   *(.data.percpu)
-   __per_cpu_end = .;
-  }
+  PERCPU(4096)
   . = ALIGN(4096);
   /* freed after init ends here */

diff -Nurp linux-2.6.21-rc7.0/arch/ia64/kernel/vmlinux.lds.S 
linux-2.6.21-rc7.1/arch/ia64/kernel/vmlinux.lds.S
--- linux-2.6.21-rc7.0/arch/ia64/kernel/vmlinux.lds.S   2007-05-01 
07:32:59.0 -0700
+++ linux-2.6.21-rc7.1/arch/ia64/kernel/vmlinux.lds.S   2007-05-02 
17:00:29.0 -0700
@@ -206,6 +206,7 @@ SECTIONS
{
__per_cpu_start = .;
*(.data.percpu)
+   *(.data.percpu.cacheline_aligned_in_smp)
__per_cpu_end = .;
}
   . = __phys_per_cpu_start + PERCPU_PAGE_SIZE; /* ensure percpu data fits
diff -Nurp linux-2.6.21-rc7.0/arch/m32r/kernel/vmlinux.lds.S 
linux-2.6.21-rc7.1/arch/m32r/kernel/vmlinux.lds.S
--- linux-2.6.21-rc7.0/arch/m32r/kernel/vmlinux.lds.S   2007-05-01 
07:32:59.0 -0700
+++ linux-2.6.21-rc7.1/arch/m32r/kernel/vmlinux.lds.S   2007-05-02 
17:00:29.0 -0700
@@ -111,10 +111,7 @@ SECTIONS
   __initramfs_end = .;
 #endif
 
-  . = ALIGN(4096);
-  __per_cpu_start = .;
-  .data.percpu  : { *(.data.percpu) }
-  __per_cpu_end = .;
+  PERCPU(4096)
   . = ALIGN(4096);
   __init_end = .;
   /* freed after init ends here */
diff -Nurp linux-2.6.21-rc7.0/arch/mips/kernel/vmlinux.lds.S 
linux-2.6.21-rc7.1/arch/mips/kernel/vmlinux.lds.S
--- linux-2.6.21-rc7.0/arch/mips/kernel/vmlinux.lds.S   2007-05-01 
07:32:59.0 -0700
+++ linux-2.6.21-rc7.1/arch/mips/kernel/vmlinux.lds.S   2007-05-02 
17:00:29.0 -0700
@@ -121,10 +121,7 @@ SECTIONS
   .init.ramfs : { *(.init.ramfs) }
   __initramfs_end = .;
 #endif
-  . = ALIGN(_PAGE_SIZE);
-  __per_cpu_start = .;
-  .data.percpu  : { *(.data.percpu) }
-  __per_cpu_end = .;
+  PERCPU(_PAGE_SIZE)
   . = ALIGN(_PAGE_SIZE);
   __init_end = .;
   /* freed after init ends here */
diff -Nurp linux-2.6.21-rc7.0/arch/parisc/kernel/vmlinux.lds.S 
linux-2.6.21-rc7.1/arch/parisc/kernel/vmlinux.lds.S
--- 

[PATCH 0/2] Add percpu smp cacheline align section

2007-05-04 Thread Fenghua Yu


This is follow-up for Suresh's runqueue align in smp patch at:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0704.1/0340.html

The patches place all of smp cacheline aligned percpu data into 
.data.percpu.cacheline_aligned_in_smp. Other percpu data is still in 
data.percpu section. The patches can reduce cache line access in SMP and reduce 
alignment gap waste. The patches also define PERCPU macro for vmlinux.lds.S for 
code clean up.

PATCH 1/2: Define percpu smp cacheline align interface
PATCH 2/2: Call percpu smp cacheline algin interface

Thanks.

-Fenghua 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory

2007-05-04 Thread Ulrich Drepper
Nick Piggin wrote:
> I literally have about 4 or 5 new page flags I'd like to add today :) I
> can't of course, because we have very few spare ones left.

I remember Rik saying that if need be he can (try to?) think of a method
to implement it without a page flag.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


[patch 05/29] xen: Core Xen implementation

2007-05-04 Thread Jeremy Fitzhardinge
This patch is a rollup of all the core pieces of the Xen
implementation, including:
 - booting and setup
 - pagetable setup
 - privileged instructions
 - segmentation
 - multicall batching

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Cc: Ian Pratt <[EMAIL PROTECTED]>
Cc: Christian Limpach <[EMAIL PROTECTED]>
Cc: Adrian Bunk <[EMAIL PROTECTED]>

---
 arch/i386/Makefile   |3 
 arch/i386/kernel/entry.S |   71 +++
 arch/i386/kernel/head.S  |5 
 arch/i386/kernel/vmlinux.lds.S   |1 
 arch/i386/xen/Makefile   |1 
 arch/i386/xen/enlighten.c|  733 ++
 arch/i386/xen/features.c |   29 +
 arch/i386/xen/multicalls.c   |   82 
 arch/i386/xen/multicalls.h   |   26 +
 arch/i386/xen/setup.c|   96 
 arch/i386/xen/xen-head.S |   34 +
 arch/i386/xen/xen-ops.h  |   34 +
 include/asm-i386/irq.h   |1 
 include/asm-i386/xen/hypercall.h |   18 
 include/xen/features.h   |   23 +
 include/xen/page.h   |  178 +
 16 files changed, 1334 insertions(+), 1 deletion(-)

===
--- a/arch/i386/Makefile
+++ b/arch/i386/Makefile
@@ -93,6 +93,9 @@ mcore-$(CONFIG_X86_ES7000):= mach-defau
 mcore-$(CONFIG_X86_ES7000) := mach-default
 core-$(CONFIG_X86_ES7000)  := arch/i386/mach-es7000/
 
+# Xen paravirtualization support
+core-$(CONFIG_XEN) += arch/i386/xen/
+
 # default subarch .h files
 mflags-y += -Iinclude/asm-i386/mach-default
 
===
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -1023,6 +1023,77 @@ ENTRY(kernel_thread_helper)
CFI_ENDPROC
 ENDPROC(kernel_thread_helper)
 
+#ifdef CONFIG_XEN
+ENTRY(xen_hypervisor_callback)
+   CFI_STARTPROC
+   pushl $0
+   CFI_ADJUST_CFA_OFFSET 4
+   SAVE_ALL
+   TRACE_IRQS_OFF
+   mov %esp, %eax
+   call xen_evtchn_do_upcall
+   jmp  ret_from_intr
+   CFI_ENDPROC
+ENDPROC(xen_hypervisor_callback)
+
+# Hypervisor uses this for application faults while it executes.
+# We get here for two reasons:
+#  1. Fault while reloading DS, ES, FS or GS
+#  2. Fault while executing IRET
+# Category 1 we fix up by reattempting the load, and zeroing the segment
+# register if the load fails.
+# Category 2 we fix up by jumping to do_iret_error. We cannot use the
+# normal Linux return path in this case because if we use the IRET hypercall
+# to pop the stack frame we end up in an infinite loop of failsafe callbacks.
+# We distinguish between categories by maintaining a status value in EAX.
+ENTRY(xen_failsafe_callback)
+   CFI_STARTPROC
+   pushl %eax
+   CFI_ADJUST_CFA_OFFSET 4
+   movl $1,%eax
+1: mov 4(%esp),%ds
+2: mov 8(%esp),%es
+3: mov 12(%esp),%fs
+4: mov 16(%esp),%gs
+   testl %eax,%eax
+   popl %eax
+   CFI_ADJUST_CFA_OFFSET -4
+   lea 16(%esp),%esp
+   CFI_ADJUST_CFA_OFFSET -16
+   jz 5f
+   addl $16,%esp
+   jmp iret_exc# EAX != 0 => Category 2 (Bad IRET)
+5: pushl $0# EAX == 0 => Category 1 (Bad segment)
+   CFI_ADJUST_CFA_OFFSET 4
+   SAVE_ALL
+   jmp ret_from_exception
+   CFI_ENDPROC
+
+.section .fixup,"ax"
+6: xorl %eax,%eax
+   movl %eax,4(%esp)
+   jmp 1b
+7: xorl %eax,%eax
+   movl %eax,8(%esp)
+   jmp 2b
+8: xorl %eax,%eax
+   movl %eax,12(%esp)
+   jmp 3b
+9: xorl %eax,%eax
+   movl %eax,16(%esp)
+   jmp 4b
+.previous
+.section __ex_table,"a"
+   .align 4
+   .long 1b,6b
+   .long 2b,7b
+   .long 3b,8b
+   .long 4b,9b
+.previous
+ENDPROC(xen_failsafe_callback)
+
+#endif /* CONFIG_XEN */
+
 .section .rodata,"a"
 #include "syscall_table.S"
 
===
--- a/arch/i386/kernel/head.S
+++ b/arch/i386/kernel/head.S
@@ -545,6 +545,8 @@ unhandled_paravirt:
ud2
 #endif
 
+#include "../xen/xen-head.S"
+
 /*
  * Real beginning of normal "text" segment
  */
@@ -554,7 +556,8 @@ ENTRY(_stext)
 /*
  * BSS section
  */
-.section ".bss.page_aligned","w"
+.section ".bss.page_aligned","wa"
+   .align PAGE_SIZE_asm
 ENTRY(swapper_pg_dir)
.fill 1024,4,0
 ENTRY(empty_zero_page)
===
--- a/arch/i386/kernel/vmlinux.lds.S
+++ b/arch/i386/kernel/vmlinux.lds.S
@@ -94,6 +94,7 @@ SECTIONS
 
   . = ALIGN(4096);
   .data.page_aligned : AT(ADDR(.data.page_aligned) - LOAD_OFFSET) {
+   *(.data.page_aligned)
*(.data.idt)
   }
 
===
--- /dev/null
+++ b/arch/i386/xen/Makefile
@@ -0,0 +1,1 @@
+obj-y  := enlighten.o setup.o features.o multicalls.o

RE: Regression with SLUB on Netperf and Volanomark

2007-05-04 Thread Christoph Lameter
On Fri, 4 May 2007, Tim Chen wrote:

> On Fri, 2007-05-04 at 11:27 -0700, Christoph Lameter wrote:
> 
> > 
> > Not sure where to go here. Increasing the per cpu slab size may hold off 
> > the issue up to a certain cpu cache size. For that we would need to 
> > identify which slabs create the performance issue.
> > 
> > One easy way to check that this is indeed the case: Enable fake NUMA. You 
> > will then have separate queues for each processor since they are on 
> > different "nodes". Create two fake nodes. Run one thread in each node and 
> > see if this fixes it.
> 
> I tried with fake NUMA (boot with numa=fake=2) and use
> 
> numactl --physcpubind=1 --membind=0 ./netserver
> numactl --physcpubind=2 --membind=1 ./netperf -t TCP_STREAM -l 60 -H
> 127.0.0.1 -i 5,5 -I 99,5 -- -s 57344 -S 57344 -m 4096
> 
> to run the tests.  The results are about the same as the non-NUMA case,
> with slab about 5% better than slub.  

H... both tests were run in the same context? NUMA has additional 
overhead in other areas.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch] Fix section mismatch of memory hotplug related code.

2007-05-04 Thread Andrew Morton
On Thu, 05 Apr 2007 17:01:02 +0900
Yasunori Goto <[EMAIL PROTECTED]> wrote:

> Hello.
> 
> This is to fix many section mismatches of code related to memory hotplug.
> I checked compile with memory hotplug on/off on ia64 and x86-64 box.
> 
> ..
>
> ===
> --- meminit.orig/drivers/acpi/numa.c  2007-04-04 20:15:58.0 +0900
> +++ meminit/drivers/acpi/numa.c   2007-04-04 20:56:34.0 +0900
> @@ -228,7 +228,7 @@ int __init acpi_numa_init(void)
>   return 0;
>  }
>  
> -int acpi_get_pxm(acpi_handle h)
> +int __meminit acpi_get_pxm(acpi_handle h)
>  {
>   unsigned long pxm;
>   acpi_status status;
> @@ -246,7 +246,7 @@ int acpi_get_pxm(acpi_handle h)
>  }
>  EXPORT_SYMBOL(acpi_get_pxm);
>  
> -int acpi_get_node(acpi_handle *handle)
> +int __meminit acpi_get_node(acpi_handle *handle)
>  {
>   int pxm, node = -1;

It doesn't make a lot of sense to export an __init symbol to modules.  I
guess it's OK in this case, but we get warnings:

WARNING: drivers/built-in.o - Section mismatch: reference to 
.init.text:acpi_get_node from __ksymtab between '__ksymtab_acpi_get_node' (at 
offset 0x1040) and '__ksymtab_acpi_get_pxm'
WARNING: drivers/built-in.o - Section mismatch: reference to 
.init.text:acpi_get_pxm from __ksymtab between '__ksymtab_acpi_get_pxm' (at 
offset 0x1050) and '__ksymtab_acpi_unlock_battery_dir'

One fix would be to statically link them again.  Another fix would be to
wrap the exports in #ifdef CONFIG_ACPI_HOTPLUG_MEMORY_MODULE.  Neither is
very attractive.

Coold you have a think about it please?  The config is at
http://userweb.kernel.org/~akpm/config-akpm2.txt

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 13/29] xen: Account for time stolen by Xen

2007-05-04 Thread Jeremy Fitzhardinge
This accounts for the time Xen steals from our VCPUs.  This accounting
gets run on each timer interrupt, just as a way to get it run
relatively often, and when interesting things are going on.

Stolen time is not really used by much in the kernel; it is reported
in /proc/stats, and that's about it.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Chris Wright <[EMAIL PROTECTED]>
Cc: john stultz <[EMAIL PROTECTED]>
Cc: Rik van Riel <[EMAIL PROTECTED]>
---
 arch/i386/xen/time.c |  106 +-
 1 file changed, 105 insertions(+), 1 deletion(-)

===
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -2,6 +2,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -14,6 +15,7 @@
 
 #define XEN_SHIFT 22
 #define TIMER_SLOP 10  /* Xen may fire a timer up to this many ns 
early */
+#define NS_PER_TICK(10ll / HZ)
 
 /* These are perodically updated in shared_info, and then copied here. */
 struct shadow_time_info {
@@ -26,6 +28,104 @@ struct shadow_time_info {
 
 static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
 
+/* runstate info updated by Xen */
+static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate);
+
+/* snapshots of runstate info */
+static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate_snapshot);
+
+/* unused ns of stolen and blocked time */
+static DEFINE_PER_CPU(u64, residual_stolen);
+static DEFINE_PER_CPU(u64, residual_blocked);
+
+/*
+ * Runstate accounting
+ */
+static void get_runstate_snapshot(struct vcpu_runstate_info *res)
+{
+   u64 state_time;
+   struct vcpu_runstate_info *state;
+
+   preempt_disable();
+
+   state = &__get_cpu_var(runstate);
+
+   /*
+* The runstate info is always updated by the hypervisor on
+* the current CPU, so there's no need to use anything
+* stronger than a compiler barrier when fetching it.
+*/
+   do {
+   state_time = state->state_entry_time;
+   barrier();
+   *res = *state;
+   barrier();
+   } while(state->state_entry_time != state_time);
+
+   preempt_enable();
+}
+
+static void setup_runstate_info(void)
+{
+   struct vcpu_register_runstate_memory_area area;
+
+   area.addr.v = &__get_cpu_var(runstate);
+
+   if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
+  smp_processor_id(), ))
+   BUG();
+
+   get_runstate_snapshot(&__get_cpu_var(runstate_snapshot));
+}
+
+static void do_stolen_accounting(void)
+{
+   struct vcpu_runstate_info state;
+   struct vcpu_runstate_info *snap;
+   u64 blocked, runnable, offline, stolen;
+   cputime_t ticks;
+
+   get_runstate_snapshot();
+
+   WARN_ON(state.state != RUNSTATE_running);
+
+   snap = &__get_cpu_var(runstate_snapshot);
+
+   /* work out how much time the VCPU has not been runn*ing*  */
+   blocked = state.time[RUNSTATE_blocked] - snap->time[RUNSTATE_blocked];
+   runnable = state.time[RUNSTATE_runnable] - 
snap->time[RUNSTATE_runnable];
+   offline = state.time[RUNSTATE_offline] - snap->time[RUNSTATE_offline];
+
+   *snap = state;
+
+   /* Add the appropriate number of ticks of stolen time,
+  including any left-overs from last time.  Passing NULL to
+  account_steal_time accounts the time as stolen. */
+   stolen = runnable + offline + __get_cpu_var(residual_stolen);
+   ticks = 0;
+   while(stolen >= NS_PER_TICK) {
+   ticks++;
+   stolen -= NS_PER_TICK;
+   }
+   __get_cpu_var(residual_stolen) = stolen;
+   account_steal_time(NULL, ticks);
+
+   /* Add the appropriate number of ticks of blocked time,
+  including any left-overs from last time.  Passing idle to
+  account_steal_time accounts the time as idle/wait. */
+   blocked += __get_cpu_var(residual_blocked);
+   ticks = 0;
+   while(blocked >= NS_PER_TICK) {
+   ticks++;
+   blocked -= NS_PER_TICK;
+   }
+   __get_cpu_var(residual_blocked) = blocked;
+   account_steal_time(idle_task(smp_processor_id()), ticks);
+}
+
+
+
+/* Get the CPU speed from Xen */
 unsigned long xen_cpu_khz(void)
 {
u64 cpu_khz = 100ULL << 32;
@@ -338,6 +438,8 @@ static irqreturn_t xen_timer_interrupt(i
ret = IRQ_HANDLED;
}
 
+   do_stolen_accounting();
+
return ret;
 }
 
@@ -364,6 +466,8 @@ static void xen_setup_timer(int cpu)
evt->irq = irq;
clockevents_register_device(evt);
 
+   setup_runstate_info();
+
put_cpu_var(xen_clock_events);
 }
 
@@ -376,7 +480,7 @@ __init void xen_time_init(void)
clocksource_register(_clocksource);
 
if (HYPERVISOR_vcpu_op(VCPUOP_stop_periodic_timer, cpu, NULL) == 0) {
-   /* Successfully turned off 100hz tick, 

[patch 12/29] xen: fix multicall batching

2007-05-04 Thread Jeremy Fitzhardinge
Disable interrupts between allocating a multicall entry and actually
issuing it, to prevent an interrupt from coming in, allocating and
initializing further multicall entries, and then issuing them all,
including the partially completed one.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Chris Wright <[EMAIL PROTECTED]>

---
 arch/i386/xen/enlighten.c  |   44 +++-
 arch/i386/xen/mmu.c|   18 --
 arch/i386/xen/multicalls.c |9 -
 arch/i386/xen/multicalls.h |   27 +++
 arch/i386/xen/xen-ops.h|5 +
 5 files changed, 71 insertions(+), 32 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -187,13 +187,25 @@ static void xen_halt(void)
 
 static void xen_set_lazy_mode(enum paravirt_lazy_mode mode)
 {
-   enum paravirt_lazy_mode *lazy = _cpu_var(xen_lazy_mode);
+   switch(mode) {
+   case PARAVIRT_LAZY_NONE:
+   BUG_ON(x86_read_percpu(xen_lazy_mode) == PARAVIRT_LAZY_NONE);
+   break;
+
+   case PARAVIRT_LAZY_MMU:
+   case PARAVIRT_LAZY_CPU:
+   BUG_ON(x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE);
+   break;
+
+   case PARAVIRT_LAZY_FLUSH:
+   /* flush if necessary, but don't change state */
+   if (x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE)
+   xen_mc_flush();
+   return;
+   }
 
xen_mc_flush();
-
-   *lazy = mode;
-
-   put_cpu_var(xen_lazy_mode);
+   x86_write_percpu(xen_lazy_mode, mode);
 }
 
 static unsigned long xen_store_tr(void)
@@ -220,7 +232,7 @@ static void xen_set_ldt(const void *addr
 
MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
 
-   xen_mc_issue();
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
 }
 
 static void xen_load_gdt(const struct Xgt_desc_struct *dtr)
@@ -248,7 +260,7 @@ static void xen_load_gdt(const struct Xg
 
MULTI_set_gdt(mcs.mc, frames, size / sizeof(struct desc_struct));
 
-   xen_mc_issue();
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
 }
 
 static void load_TLS_descriptor(struct thread_struct *t,
@@ -256,18 +268,20 @@ static void load_TLS_descriptor(struct t
 {
struct desc_struct *gdt = get_cpu_gdt_table(cpu);
xmaddr_t maddr = virt_to_machine([GDT_ENTRY_TLS_MIN+i]);
-   struct multicall_space mc = xen_mc_entry(0);
+   struct multicall_space mc = __xen_mc_entry(0);
 
MULTI_update_descriptor(mc.mc, maddr.maddr, t->tls_array[i]);
 }
 
 static void xen_load_tls(struct thread_struct *t, unsigned int cpu)
 {
+   xen_mc_batch();
+
load_TLS_descriptor(t, cpu, 0);
load_TLS_descriptor(t, cpu, 1);
load_TLS_descriptor(t, cpu, 2);
 
-   xen_mc_issue();
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
 }
 
 static void xen_write_ldt_entry(struct desc_struct *dt, int entrynum, u32 low, 
u32 high)
@@ -387,13 +401,9 @@ static void xen_load_esp0(struct tss_str
 static void xen_load_esp0(struct tss_struct *tss,
   struct thread_struct *thread)
 {
-   if (xen_get_lazy_mode() != PARAVIRT_LAZY_CPU) {
-   if (HYPERVISOR_stack_switch(__KERNEL_DS, thread->esp0))
-   BUG();
-   } else {
-   struct multicall_space mcs = xen_mc_entry(0);
-   MULTI_stack_switch(mcs.mc, __KERNEL_DS, thread->esp0);
-   }
+   struct multicall_space mcs = xen_mc_entry(0);
+   MULTI_stack_switch(mcs.mc, __KERNEL_DS, thread->esp0);
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
 }
 
 static void xen_set_iopl_mask(unsigned mask)
@@ -485,7 +495,7 @@ static void xen_write_cr3(unsigned long 
 
MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
 
-   xen_mc_issue();
+   xen_mc_issue(PARAVIRT_LAZY_CPU);
}
 }
 
===
--- a/arch/i386/xen/mmu.c
+++ b/arch/i386/xen/mmu.c
@@ -366,7 +366,7 @@ static int pin_page(struct page *page, u
else {
void *pt = lowmem_page_address(page);
unsigned long pfn = page_to_pfn(page);
-   struct multicall_space mcs = xen_mc_entry(0);
+   struct multicall_space mcs = __xen_mc_entry(0);
 
flush = 0;
 
@@ -386,10 +386,12 @@ void xen_pgd_pin(pgd_t *pgd)
struct multicall_space mcs;
struct mmuext_op *op;
 
+   xen_mc_batch();
+
if (pgd_walk(pgd, pin_page, TASK_SIZE))
kmap_flush_unused();
 
-   mcs = xen_mc_entry(sizeof(*op));
+   mcs = __xen_mc_entry(sizeof(*op));
op = mcs.args;
 
 #ifdef CONFIG_X86_PAE
@@ -400,7 +402,7 @@ void xen_pgd_pin(pgd_t *pgd)
op->arg1.mfn = pfn_to_mfn(PFN_DOWN(__pa(pgd)));
MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
 
-   xen_mc_flush();
+   

[patch 01/29] xen: Add apply_to_page_range() which applies a function to a pte range.

2007-05-04 Thread Jeremy Fitzhardinge
Add a new mm function apply_to_page_range() which applies a given
function to every pte in a given virtual address range in a given mm
structure. This is a generic alternative to cut-and-pasting the Linux
idiomatic pagetable walking code in every place that a sequence of
PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen
subsystems, for example: to ensure that pagetables have been allocated
for a virtual address range, and to construct batched special
pagetable update requests to map I/O memory (in ioremap()).

Signed-off-by: Ian Pratt <[EMAIL PROTECTED]>
Signed-off-by: Christian Limpach <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
Cc: Matt Mackall <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]> 

---
 include/linux/mm.h |5 ++
 mm/memory.c|   94 
 2 files changed, 99 insertions(+)

===
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,6 +1135,11 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET   0x04/* do get_page on page */
 #define FOLL_ANON  0x08/* give ZERO_PAGE if no pgtable */
 
+typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr,
+   void *data);
+extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
+  unsigned long size, pte_fn_t fn, void *data);
+
 #ifdef CONFIG_PROC_FS
 void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
 #else
===
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1448,6 +1448,100 @@ int remap_pfn_range(struct vm_area_struc
 }
 EXPORT_SYMBOL(remap_pfn_range);
 
+static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pte_t *pte;
+   int err;
+   struct page *pmd_page;
+   spinlock_t *ptl;
+
+   pte = (mm == _mm) ?
+   pte_alloc_kernel(pmd, addr) :
+   pte_alloc_map_lock(mm, pmd, addr, );
+   if (!pte)
+   return -ENOMEM;
+
+   BUG_ON(pmd_huge(*pmd));
+
+   pmd_page = pmd_page(*pmd);
+
+   do {
+   err = fn(pte, pmd_page, addr, data);
+   if (err)
+   break;
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+
+   if (mm != _mm)
+   pte_unmap_unlock(pte-1, ptl);
+   return err;
+}
+
+static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pmd_t *pmd;
+   unsigned long next;
+   int err;
+
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return -ENOMEM;
+   do {
+   next = pmd_addr_end(addr, end);
+   err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pmd++, addr = next, addr != end);
+   return err;
+}
+
+static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pud_t *pud;
+   unsigned long next;
+   int err;
+
+   pud = pud_alloc(mm, pgd, addr);
+   if (!pud)
+   return -ENOMEM;
+   do {
+   next = pud_addr_end(addr, end);
+   err = apply_to_pmd_range(mm, pud, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pud++, addr = next, addr != end);
+   return err;
+}
+
+/*
+ * Scan a region of virtual memory, filling in page tables as necessary
+ * and calling a provided function on each leaf page table.
+ */
+int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
+   unsigned long size, pte_fn_t fn, void *data)
+{
+   pgd_t *pgd;
+   unsigned long next;
+   unsigned long end = addr + size;
+   int err;
+
+   BUG_ON(addr >= end);
+   pgd = pgd_offset(mm, addr);
+   do {
+   next = pgd_addr_end(addr, end);
+   err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pgd++, addr = next, addr != end);
+   return err;
+}
+EXPORT_SYMBOL_GPL(apply_to_page_range);
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry
  * which was read non-atomically.  Before making any commitment, on

-- 

-
To 

[patch 18/29] xen: deal with negative stolen time

2007-05-04 Thread Jeremy Fitzhardinge
Stolen time should never be negative; if it ever is, it probably
indicates some other bug.  However, if it does happen, then its better
to just clamp it at zero, rather than trying to account for it as a
huge positive number.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Chris Wright <[EMAIL PROTECTED]>

---
 arch/i386/xen/smp.c |4 +
 arch/i386/xen/time.c|  112 ---
 arch/i386/xen/xen-ops.h |3 -
 3 files changed, 83 insertions(+), 36 deletions(-)

===
--- a/arch/i386/xen/smp.c
+++ b/arch/i386/xen/smp.c
@@ -72,10 +72,11 @@ static __cpuinit void cpu_bringup_and_id
int cpu = smp_processor_id();
 
cpu_init();
-   xen_setup_timer();
 
preempt_disable();
per_cpu(cpu_state, cpu) = CPU_ONLINE;
+
+   xen_setup_cpu_clockevents();
 
/* We can take interrupts now: we're officially "up". */
local_irq_enable();
@@ -263,6 +264,7 @@ int __cpuinit xen_cpu_up(unsigned int cp
per_cpu(current_task, cpu) = idle;
xen_vcpu_setup(cpu);
irq_ctx_init(cpu);
+   xen_setup_timer(cpu);
 
/* make sure interrupts start blocked */
per_cpu(xen_vcpu, cpu)->evtchn_upcall_mask = 1;
===
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -49,6 +49,35 @@ static DEFINE_PER_CPU(u64, residual_stol
 static DEFINE_PER_CPU(u64, residual_stolen);
 static DEFINE_PER_CPU(u64, residual_blocked);
 
+/* return an consistent snapshot of 64-bit time/counter value */
+static u64 get64(const u64 *p)
+{
+   u64 ret;
+
+   if (BITS_PER_LONG < 64) {
+   u32 *p32 = (u32 *)p;
+   u32 h, l;
+
+   /*
+* Read high then low, and then make sure high is
+* still the same; this will only loop if low wraps
+* and carries into high.
+* XXX some clean way to make this endian-proof?
+*/
+   do {
+   h = p32[1];
+   barrier();
+   l = p32[0];
+   barrier();
+   } while (p32[1] != h);
+
+   ret = (((u64)h) << 32) | l;
+   } else
+   ret = *p;
+
+   return ret;
+}
+
 /*
  * Runstate accounting
  */
@@ -67,31 +96,29 @@ static void get_runstate_snapshot(struct
 * stronger than a compiler barrier when fetching it.
 */
do {
-   state_time = state->state_entry_time;
+   state_time = get64(>state_entry_time);
barrier();
*res = *state;
barrier();
-   } while(state->state_entry_time != state_time);
-}
-
-static void setup_runstate_info(void)
+   } while(get64(>state_entry_time) != state_time);
+}
+
+static void setup_runstate_info(int cpu)
 {
struct vcpu_register_runstate_memory_area area;
 
-   area.addr.v = &__get_cpu_var(runstate);
+   area.addr.v = _cpu(runstate, cpu);
 
if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
-  smp_processor_id(), ))
+  cpu, ))
BUG();
-
-   get_runstate_snapshot(&__get_cpu_var(runstate_snapshot));
 }
 
 static void do_stolen_accounting(void)
 {
struct vcpu_runstate_info state;
struct vcpu_runstate_info *snap;
-   u64 blocked, runnable, offline, stolen;
+   s64 blocked, runnable, offline, stolen;
cputime_t ticks;
 
get_runstate_snapshot();
@@ -111,6 +138,10 @@ static void do_stolen_accounting(void)
   including any left-overs from last time.  Passing NULL to
   account_steal_time accounts the time as stolen. */
stolen = runnable + offline + __get_cpu_var(residual_stolen);
+
+   if (stolen < 0)
+   stolen = 0;
+
ticks = 0;
while(stolen >= NS_PER_TICK) {
ticks++;
@@ -123,6 +154,10 @@ static void do_stolen_accounting(void)
   including any left-overs from last time.  Passing idle to
   account_steal_time accounts the time as idle/wait. */
blocked += __get_cpu_var(residual_blocked);
+
+   if (blocked < 0)
+   blocked = 0;
+
ticks = 0;
while(blocked >= NS_PER_TICK) {
ticks++;
@@ -141,7 +176,8 @@ unsigned long long xen_sched_clock(void)
 {
struct vcpu_runstate_info state;
cycle_t now;
-   unsigned long long ret;
+   u64 ret;
+   s64 offset;
 
/*
 * Ideally sched_clock should be called on a per-cpu basis
@@ -156,9 +192,13 @@ unsigned long long xen_sched_clock(void)
 
WARN_ON(state.state != RUNSTATE_running);
 
+   offset = now - state.state_entry_time;
+   if (offset < 0)
+   offset = 0;
+
ret = state.time[RUNSTATE_blocked] +
 

[patch 23/29] xen: Add Xen virtual block device driver.

2007-05-04 Thread Jeremy Fitzhardinge
The block device frontend driver allows the kernel to access block
devices exported exported by a virtual machine containing a physical
block device driver.

Signed-off-by: Ian Pratt <[EMAIL PROTECTED]>
Signed-off-by: Christian Limpach <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Cc: Arjan van de Ven <[EMAIL PROTECTED]>
Cc: Greg KH <[EMAIL PROTECTED]>
Cc: Jens Axboe <[EMAIL PROTECTED]>
---
 drivers/block/Kconfig|8 
 drivers/block/Makefile   |1 
 drivers/block/xen-blkfront.c |  985 ++
 include/linux/major.h|2 
 include/xen/interface/io/blkif.h |6 
 5 files changed, 998 insertions(+), 4 deletions(-)

===
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -453,6 +453,14 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
 
+config XEN_BLKDEV_FRONTEND
+   tristate "Xen virtual block device support"
+   depends on XEN
+   help
+ This driver implements the front-end of the Xen virtual
+ block device driver.  It communicates with a back-end driver
+ in another domain which drives the actual block device.
+
 endmenu
 
 endif
===
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_BLK_DEV_UB)  += ub.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 obj-$(CONFIG_LGUEST_GUEST) += lguest_blk.o
 
+obj-$(CONFIG_XEN_BLKDEV_FRONTEND)  := xen-blkfront.o
===
--- /dev/null
+++ b/drivers/block/xen-blkfront.c
@@ -0,0 +1,985 @@
+/*
+ * blkfront.c
+ *
+ * XenLinux virtual block device driver.
+ *
+ * Copyright (c) 2003-2004, Keir Fraser & Steve Hand
+ * Modifications by Mark A. Williamson are (c) Intel Research Cambridge
+ * Copyright (c) 2004, Christian Limpach
+ * Copyright (c) 2004, Andrew Warfield
+ * Copyright (c) 2005, Christopher Clark
+ * Copyright (c) 2005, XenSource Ltd
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include 
+
+enum blkif_state {
+   BLKIF_STATE_DISCONNECTED,
+   BLKIF_STATE_CONNECTED,
+   BLKIF_STATE_SUSPENDED,
+};
+
+struct blk_shadow {
+   struct blkif_request req;
+   unsigned long request;
+   unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+};
+
+static struct block_device_operations xlvbd_block_fops;
+
+#define BLK_RING_SIZE __RING_SIZE((struct blkif_sring *)0, PAGE_SIZE)
+
+/*
+ * We have one of these per vbd, whether ide, scsi or 'other'.  They
+ * hang in private_data off the gendisk structure. We may end up
+ * putting all kinds of interesting stuff here :-)
+ */
+struct blkfront_info
+{
+   struct xenbus_device *xbdev;
+   dev_t dev;
+   struct gendisk *gd;
+   int vdevice;
+   blkif_vdev_t handle;
+   enum blkif_state connected;
+   int ring_ref;
+   struct blkif_front_ring ring;
+   unsigned int evtchn, irq;
+   struct request_queue *rq;
+   struct work_struct work;
+   struct gnttab_free_callback callback;
+   struct blk_shadow shadow[BLK_RING_SIZE];
+   unsigned long shadow_free;
+   int feature_barrier;
+
+   /**
+* The number of people holding this device open.  We won't allow a
+* hot-unplug unless this is 0.
+*/
+   int users;
+};
+
+static 

[patch 09/29] xen: xen configuration

2007-05-04 Thread Jeremy Fitzhardinge
Put config options for Xen after the core pieces are in place.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>

---
 arch/i386/Kconfig |2 ++
 arch/i386/xen/Kconfig |   11 +++
 2 files changed, 13 insertions(+)

===
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -217,6 +217,8 @@ config PARAVIRT
  under a hypervisor, improving performance significantly.
  However, when run without a hypervisor the kernel is
  theoretically slower.  If in doubt, say N.
+
+source "arch/i386/xen/Kconfig"
 
 config VMI
bool "VMI Paravirt-ops support"
===
--- /dev/null
+++ b/arch/i386/xen/Kconfig
@@ -0,0 +1,11 @@
+#
+# This Kconfig describes xen options
+#
+
+config XEN
+   bool "Enable support for Xen hypervisor"
+   depends on PARAVIRT && !PREEMPT && !SMP
+   help
+ This is the Linux Xen port.  Enabling this will allow the
+ kernel to boot in a paravirtualized environment under the
+ Xen hypervisor.

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 10/29] xen: Complete pagetable pinning for Xen

2007-05-04 Thread Jeremy Fitzhardinge
Xen has a notion of pinned pagetables, which are pagetables that
remain read-only to the guest and are validated by the hypervisor.
This makes context switches much cheaper, because the hypervisor
doesn't need to revalidate the pagetable each time.

This patch adds a PG_pinned flag for pagetable pages so we can tell if
it has been pinned or not.  This allows various pagetable update
optimisations.

This also adds a mm parameter to the alloc_pt pv_op, so that Xen can
see if we're adding a page to a pinned pagetable.  This is not
necessary for alloc_pd or release_p[dt], which is fortunate because it
isn't available at all callsites.

This also adds a new paravirt hook which is called during setup once
the zones and memory allocator have been initialized.  When the
init_mm pagetable is first built, the struct page array does not yet
exist, and so there's nowhere to put he init_mm pagetable's PG_pinned
flags.  Once the zones are initialized and the struct page array
exists, we can set the PG_pinned flags for those pages.

This patch also adds the Xen support for pte pages allocated out of
highmem (highpte), principly by implementing xen_kmap_atomic_pte.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Cc: Zach Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/setup.c|3 
 arch/i386/kernel/vmi.c  |2 
 arch/i386/mm/init.c |2 
 arch/i386/mm/pageattr.c |2 
 arch/i386/xen/enlighten.c   |  105 +++
 arch/i386/xen/mmu.c |  283 +++
 arch/i386/xen/mmu.h |2 
 arch/i386/xen/xen-ops.h |2 
 include/asm-i386/paravirt.h |   16 +-
 include/asm-i386/pgalloc.h  |6 
 include/asm-i386/setup.h|4 
 include/linux/page-flags.h  |5 
 12 files changed, 289 insertions(+), 143 deletions(-)

===
--- a/arch/i386/kernel/setup.c
+++ b/arch/i386/kernel/setup.c
@@ -607,9 +607,12 @@ void __init setup_arch(char **cmdline_p)
sparse_init();
zone_sizes_init();
 
+
/*
 * NOTE: at this point the bootmem allocator is fully available.
 */
+
+   paravirt_post_allocator_init();
 
dmi_scan_machine();
 
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -361,7 +361,7 @@ static void *vmi_kmap_atomic_pte(struct 
 }
 #endif
 
-static void vmi_allocate_pt(u32 pfn)
+static void vmi_allocate_pt(struct mm_struct *mm, u32 pfn)
 {
vmi_set_page_type(pfn, VMI_PAGE_L1);
vmi_ops.allocate_page(pfn, VMI_PAGE_L1, 0, 0, 0);
===
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -87,7 +87,7 @@ static pte_t * __init one_page_table_ini
if (!(pmd_val(*pmd) & _PAGE_PRESENT)) {
pte_t *page_table = (pte_t *) 
alloc_bootmem_low_pages(PAGE_SIZE);
 
-   paravirt_alloc_pt(__pa(page_table) >> PAGE_SHIFT);
+   paravirt_alloc_pt(_mm, __pa(page_table) >> PAGE_SHIFT);
set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
BUG_ON(page_table != pte_offset_kernel(pmd, 0));
}
===
--- a/arch/i386/mm/pageattr.c
+++ b/arch/i386/mm/pageattr.c
@@ -60,7 +60,7 @@ static struct page *split_large_page(uns
address = __pa(address);
addr = address & LARGE_PAGE_MASK; 
pbase = (pte_t *)page_address(base);
-   paravirt_alloc_pt(page_to_pfn(base));
+   paravirt_alloc_pt(_mm, page_to_pfn(base));
for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
set_pte([i], pfn_pte(addr >> PAGE_SHIFT,
   addr == address ? prot : ref_prot));
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -21,6 +21,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -486,32 +489,59 @@ static void xen_write_cr3(unsigned long 
}
 }
 
-static void xen_alloc_pt(u32 pfn)
-{
-   /* XXX pfn isn't necessarily a lowmem page */
+/* Early in boot, while setting up the initial pagetable, assume
+   everything is pinned. */
+static void xen_alloc_pt_init(struct mm_struct *mm, u32 pfn)
+{
+   BUG_ON(mem_map);/* should only be used early */
make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
 }
 
-static void xen_alloc_pd(u32 pfn)
-{
-   make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
-}
-
-static void xen_release_pd(u32 pfn)
-{
-   make_lowmem_page_readwrite(__va(PFN_PHYS(pfn)));
-}
-
+/* This needs to make sure the new pte page is pinned iff its being
+   attached to a pinned pagetable. */
+static void xen_alloc_pt(struct mm_struct *mm, u32 pfn)
+{
+   struct page *page = 

[Fwd: [PATCH -mm] working 3D/DRI intel-agp.ko resume for i815 chip; Intel chipset testers wanted! (was: Re: intel-agp PM experiences ...)]

2007-05-04 Thread Sergio Monteiro Basto
Hi forward this message to dri-devel Mailing List, where you could find
more tester on i815 DRI drives .
I hope I don't had made a loop :) 

 Forwarded Message 
From: Andreas Mohr <[EMAIL PROTECTED]>
To: Pavel Machek <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>, [EMAIL PROTECTED],
[EMAIL PROTECTED], Matthew Garrett <[EMAIL PROTECTED]>,
kernel list , [EMAIL PROTECTED]
Subject: [PATCH -mm] working 3D/DRI intel-agp.ko resume for i815 chip;
Intel chipset testers wanted! (was: Re: intel-agp PM experiences ...)
Date:   Tue, 1 May 2007 16:59:47 +0200

Hi,

On Thu, Jan 18, 2007 at 11:16:51PM +, Pavel Machek wrote:
> Hi!
> 
> > > > Especially the PCI video_state trick finally got me a working resume on
> > > > 2.6.19-ck2 r128 Rage Mobility M4 AGP *WITH*(!) fully enabled and working
> > > > (and keeping working!) DRI (3D).
> > > 
> > > Can we get whitelist entry for suspend.sf.net? s2ram from there can do
> > > all the tricks you described, one letter per trick :-). We even got
> > > PCI saving lately.
> > 
> > Whitelist? Let me blacklist it all the way to Timbuktu instead!
> 
> > I've been doing more testing, and X never managed to come back to working
> > state without some of my couple intel-agp changes:
> > - a proper suspend method, doing a proper pci_save_state()
> >   or improved equivalent
> > - a missing resume check for my i815 chipset
> > - global cache flush in _configure
> > - restoring AGP bridge PCI config space
> 
> Can you post a patch?

Took way longer than I'd have wanted it to (nice daughter and stuff ;),
but here it is.

- add .suspend handler and pci_set_power_state() calls
- add i815-specific function agp_i815_remember_state() to remember important
  i815 register values
- add generic DEBUG_AGP_PM framework which will allow people to resume properly
  and help identify which registers need attention
- add obvious and detailed log message to make people sit up and take notice
  about long-standing AGP resume issues
- spelling fixes

Patch against 2.6.21-rc7-mm2, my Inspiron 8000 (i815 with Radeon AGP card,
internal Intel VGA unit NOT active) resumes fine with both
either i815-specific register saving or generic DEBUG_AGP_PM mechanism enabled.
(of course my notebook needs vbetool post and manual saving of video card
PCI space, too, but even when doing all this I still had X.org lockups before
whenever DRI/3D was enabled)

After resume I'm now still able to run both glxgears and glxinfo without
anomalies.

Right now all I care about is that this gets into mainline relatively soon,
since I'm rather certain that many other machines suffer from similar
AGP resume lockup issues that could be debugged this way (e.g. some Thinkpads,
as witnessed accidentally via IRC chats, and from the well-known "don't enable
DRI, that will lock up on resume!" chants).
Yes, this code is a cludge and somewhat far from a nicely generic
extended PCI space resume framework, but we've spent almost 10 (TEN!) years
without anything even remotely resembling a working cludge for
AGP suspend/resume in combination with DRI, so...

Feel free to offer thoughts on how this missing generic extended PCI space
restore functionality could be implemented, to be used by intel-agp and
various other drivers. No promise that it will be me who implements that,
though ;)

> Whitelist entry would still be welcome.

OK, I'll work on this next.


Thanks!

Signed-off-by: Andreas Mohr <[EMAIL PROTECTED]>


--- linux-2.6.21-rc7-mm2.orig/drivers/char/agp/intel-agp.c  2007-05-10 
14:52:26.0 +0200
+++ linux-2.6.21-rc7-mm2/drivers/char/agp/intel-agp.c   2007-05-10 
14:31:48.0 +0200
@@ -31,9 +31,16 @@
 extern int agp_memory_reserved;
 

-/* Intel 815 register */
-#define INTEL_815_APCONT   0x51
-#define INTEL_815_ATTBASE_MASK ~0x1FFF
+/* Intel i815 registers, see Intel spec #29068801 */
+#define I815_GMCHCFG   0x50
+#define I815_APCONT0x51
+#define I815_UNKNOWN_800x80
+#define I815_ATTBASE_MASK  ~0x1FFF
+#define I815_SM_RCOMP  0x98 /* Sys Mem R Compensation Ctrl */
+#define I815_SM0x9c /* System Memory Control Reg */
+#define I815_AGPCMD0xa8 /* AGP Command Register */
+#define I815_ERRSTS0xc8 /* undocumented in i815 spec; since this 
one is modified on resume and many other related chipsets have it, I assume it 
*is* ERRSTS */
+#define I815_UNKNOWN_E80xe8
 
 /* Intel i820 registers */
 #define INTEL_I820_RDCR0x51
@@ -664,7 +671,7 @@
if ((pg_start + mem->page_count) > num_entries)
goto out_err;
 
-   /* The i830 can't check the GTT for entries since its read only,
+   /* The i830 can't check the GTT for entries since it's read-only,
 * depend on the caller to make the correct offset decisions.
 */
 
@@ -787,7 +794,7 @@
if ((pg_start + mem->page_count) > num_entries)
goto out_err;
 
-   /* 

[patch 29/29] xen: Attempt to patch inline versions of common operations

2007-05-04 Thread Jeremy Fitzhardinge
This patchs adds the mechanism to allow us to patch inline versions of
common operations.

The implementations of the direct-access versions save_fl, restore_fl,
irq_enable and irq_disable are now in assembler, and the same code is
used for both out of line and inline uses.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Chris Wright <[EMAIL PROTECTED]>
Cc: Keir Fraser <[EMAIL PROTECTED]>
---
 arch/i386/kernel/asm-offsets.c |7 ++
 arch/i386/xen/Makefile |2 
 arch/i386/xen/enlighten.c  |  108 +++--
 arch/i386/xen/xen-asm.S|  114 
 arch/i386/xen/xen-ops.h|   13 
 5 files changed, 190 insertions(+), 54 deletions(-)

===
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DEFINE(sym, val) \
 asm volatile("\n->" #sym " %0 " #val : : "i" (val))
@@ -115,4 +116,10 @@ void foo(void)
OFFSET(PARAVIRT_iret, paravirt_ops, iret);
OFFSET(PARAVIRT_read_cr0, paravirt_ops, read_cr0);
 #endif
+
+#ifdef CONFIG_XEN
+   BLANK();
+   OFFSET(XEN_vcpu_info_mask, vcpu_info, evtchn_upcall_mask);
+   OFFSET(XEN_vcpu_info_pending, vcpu_info, evtchn_upcall_pending);
+#endif
 }
===
--- a/arch/i386/xen/Makefile
+++ b/arch/i386/xen/Makefile
@@ -1,4 +1,4 @@ obj-y   := enlighten.o setup.o features.o
 obj-y  := enlighten.o setup.o features.o multicalls.o mmu.o \
-   events.o time.o
+   events.o time.o xen-asm.o
 
 obj-$(CONFIG_SMP)  += smp.o
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -107,9 +107,11 @@ void xen_vcpu_setup(int cpu)
if (err == 0) {
have_vcpu_info_placement = 1;
per_cpu(xen_vcpu, cpu) = vcpup;
+
printk("cpu %d using vcpu_info at %p\n",
   cpu, vcpup);
}
+
 }
 
 static void __init xen_banner(void)
@@ -169,20 +171,6 @@ static unsigned long xen_save_fl(void)
return (-flags) & X86_EFLAGS_IF;
 }
 
-static unsigned long xen_save_fl_direct(void)
-{
-   unsigned long flags;
-
-   /* flag has opposite sense of mask */
-   flags = !x86_read_percpu(xen_vcpu_info.evtchn_upcall_mask);
-
-   /* convert to IF type flag
-  -0 -> 0x
-  -1 -> 0x
-   */
-   return (-flags) & X86_EFLAGS_IF;
-}
-
 static void xen_restore_fl(unsigned long flags)
 {
struct vcpu_info *vcpu;
@@ -209,25 +197,6 @@ static void xen_restore_fl(unsigned long
}
 }
 
-static void xen_restore_fl_direct(unsigned long flags)
-{
-   /* convert from IF type flag */
-   flags = !(flags & X86_EFLAGS_IF);
-
-   /* This is an atomic update, so no need to worry about
-  preemption. */
-   x86_write_percpu(xen_vcpu_info.evtchn_upcall_mask, flags);
-
-   /* If we get preempted here, then any pending event will be
-  handled anyway. */
-
-   if (flags == 0) {
-   barrier(); /* unmask then check (avoid races) */
-   if 
(unlikely(x86_read_percpu(xen_vcpu_info.evtchn_upcall_pending)))
-   force_evtchn_callback();
-   }
-}
-
 static void xen_irq_disable(void)
 {
/* There's a one instruction preempt window here.  We need to
@@ -236,12 +205,6 @@ static void xen_irq_disable(void)
preempt_disable();
x86_read_percpu(xen_vcpu)->evtchn_upcall_mask = 1;
preempt_enable_no_resched();
-}
-
-static void xen_irq_disable_direct(void)
-{
-   /* Atomic update, so preemption not a concern. */
-   x86_write_percpu(xen_vcpu_info.evtchn_upcall_mask, 1);
 }
 
 static void xen_irq_enable(void)
@@ -261,19 +224,6 @@ static void xen_irq_enable(void)
 
barrier(); /* unmask then check (avoid races) */
if (unlikely(vcpu->evtchn_upcall_pending))
-   force_evtchn_callback();
-}
-
-static void xen_irq_enable_direct(void)
-{
-   /* Atomic update, so preemption not a concern. */
-   x86_write_percpu(xen_vcpu_info.evtchn_upcall_mask, 0);
-
-   /* Doesn't matter if we get preempted here, because any
-  pending event will get dealt with anyway. */
-
-   barrier(); /* unmask then check (avoid races) */
-   if (unlikely(x86_read_percpu(xen_vcpu_info.evtchn_upcall_pending)))
force_evtchn_callback();
 }
 
@@ -852,6 +802,57 @@ static __init void xen_pagetable_setup_d
xen_vcpu_setup(smp_processor_id());
 }
 
+static unsigned xen_patch(u8 type, u16 clobbers, void *insns, unsigned len)
+{
+   char *start, *end, *reloc;
+   unsigned ret;
+
+   start = end = reloc = NULL;
+
+#define SITE(x)

[patch 24/29] xen: rename xen netif_ structures to xen_netif_

2007-05-04 Thread Jeremy Fitzhardinge
The "netif_" prefix is used in the Linux network stack, so rename the
Xen structures to xen_netif_ to avoid confusion (and potential
collision).

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>

---
 include/xen/interface/io/netif.h |   18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

===
--- a/include/xen/interface/io/netif.h
+++ b/include/xen/interface/io/netif.h
@@ -47,7 +47,7 @@
 #define _NETTXF_extra_info (3)
 #define  NETTXF_extra_info (1U<<_NETTXF_extra_info)
 
-struct netif_tx_request {
+struct xen_netif_tx_request {
 grant_ref_t gref;  /* Reference to buffer page */
 uint16_t offset;   /* Offset within buffer page */
 uint16_t flags;/* NETTXF_* */
@@ -71,7 +71,7 @@ struct netif_tx_request {
  * This structure needs to fit within both netif_tx_request and
  * netif_rx_response for compatibility.
  */
-struct netif_extra_info {
+struct xen_netif_extra_info {
 uint8_t type;  /* XEN_NETIF_EXTRA_TYPE_* */
 uint8_t flags; /* XEN_NETIF_EXTRA_FLAG_* */
 
@@ -103,12 +103,12 @@ struct netif_extra_info {
 } u;
 };
 
-struct netif_tx_response {
+struct xen_netif_tx_response {
 uint16_t id;
 int16_t  status;   /* NETIF_RSP_* */
 };
 
-struct netif_rx_request {
+struct xen_netif_rx_request {
 uint16_tid;/* Echoed in response message.*/
 grant_ref_t gref;  /* Reference to incoming granted frame */
 };
@@ -129,7 +129,7 @@ struct netif_rx_request {
 #define _NETRXF_extra_info (3)
 #define  NETRXF_extra_info (1U<<_NETRXF_extra_info)
 
-struct netif_rx_response {
+struct xen_netif_rx_response {
 uint16_t id;
 uint16_t offset;   /* Offset in page of start of received packet  */
 uint16_t flags;/* NETRXF_* */
@@ -140,8 +140,12 @@ struct netif_rx_response {
  * Generate netif ring structures and types.
  */
 
-DEFINE_RING_TYPES(netif_tx, struct netif_tx_request, struct netif_tx_response);
-DEFINE_RING_TYPES(netif_rx, struct netif_rx_request, struct netif_rx_response);
+DEFINE_RING_TYPES(xen_netif_tx,
+ struct xen_netif_tx_request,
+ struct xen_netif_tx_response);
+DEFINE_RING_TYPES(xen_netif_rx,
+ struct xen_netif_rx_request,
+ struct xen_netif_rx_response);
 
 #define NETIF_RSP_DROPPED -2
 #define NETIF_RSP_ERROR   -1

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory

2007-05-04 Thread Nick Piggin

Rik van Riel wrote:

Nick Piggin wrote:


Rik van Riel wrote:


With lazy freeing of anonymous pages through MADV_FREE, performance of
the MySQL sysbench workload more than doubles on my quad-core system.



OK, I've run some tests on a 16 core Opteron system, both sysbench with
MySQL 5.33 (set up as described in the freebsd vs linux page), and with
ebizzy.

What I found is that, on this system, MADV_FREE performance improvement
was in the noise when you look at it on top of the MADV_DONTNEED glibc
and down_read(mmap_sem) patch in sysbench.



Interesting, very different results from my system.

First, did you run with the properly TLB batched version of
the MADV_FREE patch?  And did you make sure that MADV_FREE
takes the mmap_sem for reading?   Without that, I did see
a similar thing to what you saw...


Yes and yes (I initially forgot to add MADV_FREE to the down_read
case and saw horrible performance!)



Secondly, I'll have to try some test runs one of the larger
systems in the lab.

Maybe the results from my quad core Intel system are not
typical; maybe the results from your 16 core Opteron are
not typical.  Either way, I want to find out :)


Yep. We might have something like that here, and I'll try with
some other architectures as well next week, if I can get glibc
built.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 28/29] xen: Place vcpu_info structure into per-cpu memory, if possible

2007-05-04 Thread Jeremy Fitzhardinge
An experimental patch for Xen allows guests to place their vcpu_info
structs anywhere.  We try to use this to place the vcpu_info into the
PDA, which allows direct access.

If this works, then switch to using direct access operations for
irq_enable, disable, save_fl and restore_fl.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Chris Wright <[EMAIL PROTECTED]>
Cc: Keir Fraser <[EMAIL PROTECTED]>
---
 arch/i386/xen/enlighten.c|  121 +-
 arch/i386/xen/setup.c|8 --
 include/xen/interface/vcpu.h |   13 
 3 files changed, 133 insertions(+), 9 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -61,9 +61,55 @@ struct start_info *xen_start_info;
 struct start_info *xen_start_info;
 EXPORT_SYMBOL_GPL(xen_start_info);
 
+static /* __initdata */ struct shared_info dummy_shared_info;
+
+/*
+ * Point at some empty memory to start with. We map the real shared_info
+ * page as soon as fixmap is up and running.
+ */
+struct shared_info *HYPERVISOR_shared_info = (void *)_shared_info;
+
+/* tristate - -1: not tested, 0: not available, 1: available */
+static int have_vcpu_info_placement = -1;
+
 void xen_vcpu_setup(int cpu)
 {
+   struct vcpu_register_vcpu_info info;
+   int err;
+   struct vcpu_info *vcpup;
+
per_cpu(xen_vcpu, cpu) = _shared_info->vcpu_info[cpu];
+
+   if (!have_vcpu_info_placement)
+   return; /* already tested, not available */
+
+   vcpup = _cpu(xen_vcpu_info, cpu);
+
+   printk("setting up vcpu for %d: xen_vcpu_info=%p\n",
+  cpu, vcpup);
+
+   info.mfn = virt_to_mfn(vcpup);
+   info.offset = offset_in_page(vcpup);
+
+   printk("trying to map vcpu_info %d at %p, mfn %x, offset %d\n",
+  cpu, vcpup, info.mfn, info.offset);
+
+   /* Check to see if the hypervisor will put the vcpu_info
+  structure where we want it, which allows direct access via
+  a percpu-variable. */
+   err = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_info, cpu, );
+
+   BUG_ON(err != 0 && err != -ENOSYS);
+   BUG_ON(err && have_vcpu_info_placement==1);  /* all or nothing */
+
+   have_vcpu_info_placement = 0;
+
+   if (err == 0) {
+   have_vcpu_info_placement = 1;
+   per_cpu(xen_vcpu, cpu) = vcpup;
+   printk("cpu %d using vcpu_info at %p\n",
+  cpu, vcpup);
+   }
 }
 
 static void __init xen_banner(void)
@@ -123,6 +169,20 @@ static unsigned long xen_save_fl(void)
return (-flags) & X86_EFLAGS_IF;
 }
 
+static unsigned long xen_save_fl_direct(void)
+{
+   unsigned long flags;
+
+   /* flag has opposite sense of mask */
+   flags = !x86_read_percpu(xen_vcpu_info.evtchn_upcall_mask);
+
+   /* convert to IF type flag
+  -0 -> 0x
+  -1 -> 0x
+   */
+   return (-flags) & X86_EFLAGS_IF;
+}
+
 static void xen_restore_fl(unsigned long flags)
 {
struct vcpu_info *vcpu;
@@ -149,6 +209,25 @@ static void xen_restore_fl(unsigned long
}
 }
 
+static void xen_restore_fl_direct(unsigned long flags)
+{
+   /* convert from IF type flag */
+   flags = !(flags & X86_EFLAGS_IF);
+
+   /* This is an atomic update, so no need to worry about
+  preemption. */
+   x86_write_percpu(xen_vcpu_info.evtchn_upcall_mask, flags);
+
+   /* If we get preempted here, then any pending event will be
+  handled anyway. */
+
+   if (flags == 0) {
+   barrier(); /* unmask then check (avoid races) */
+   if 
(unlikely(x86_read_percpu(xen_vcpu_info.evtchn_upcall_pending)))
+   force_evtchn_callback();
+   }
+}
+
 static void xen_irq_disable(void)
 {
/* There's a one instruction preempt window here.  We need to
@@ -157,6 +236,12 @@ static void xen_irq_disable(void)
preempt_disable();
x86_read_percpu(xen_vcpu)->evtchn_upcall_mask = 1;
preempt_enable_no_resched();
+}
+
+static void xen_irq_disable_direct(void)
+{
+   /* Atomic update, so preemption not a concern. */
+   x86_write_percpu(xen_vcpu_info.evtchn_upcall_mask, 1);
 }
 
 static void xen_irq_enable(void)
@@ -176,6 +261,19 @@ static void xen_irq_enable(void)
 
barrier(); /* unmask then check (avoid races) */
if (unlikely(vcpu->evtchn_upcall_pending))
+   force_evtchn_callback();
+}
+
+static void xen_irq_enable_direct(void)
+{
+   /* Atomic update, so preemption not a concern. */
+   x86_write_percpu(xen_vcpu_info.evtchn_upcall_mask, 0);
+
+   /* Doesn't matter if we get preempted here, because any
+  pending event will get dealt with anyway. */
+
+   barrier(); /* unmask then check (avoid races) */
+   if (unlikely(x86_read_percpu(xen_vcpu_info.evtchn_upcall_pending)))
force_evtchn_callback();
 

Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory

2007-05-04 Thread Nick Piggin

Ulrich Drepper wrote:

Nick Piggin wrote:


What I found is that, on this system, MADV_FREE performance improvement
was in the noise when you look at it on top of the MADV_DONTNEED glibc
and down_read(mmap_sem) patch in sysbench.



I don't want to judge the numbers since I cannot but I want to make an
observations: even if in the SMP case MADV_FREE turns out to not be a
bigger boost then there is still the UP case to keep in mind where Rik
measured a significant speed-up.  As long as the SMP case isn't hurt
this is reaosn enough to use the patch.  With more and more cores on one
processor SMP systems are pushed evermore to the high-end side.  You'll
find many installations which today use SMP will be happy enough with
many-core UP machines.


OK, sure. I think we need more numbers though.

And even if this was a patch with _no_ possibility for regressions and it
was a completely trivial one that improves performance in some cases...
one big problem is that it uses another page flag.

I literally have about 4 or 5 new page flags I'd like to add today :) I
can't of course, because we have very few spare ones left.

From the MySQL numbers on this system, it seems like performance is in the
noise, and MADV_DONTNEED makes the _vast_ majority of the improvement.
This is also the case with Rik's benchmarks, and while he did see some
improvement, I found the runs to be quite variable, so it would be ideal
to get a larger sample.

And the fact that the poor behaviour of the old style malloc/free went
unnoticed for so long indicates that it won't be the end of the world if
we didn't merge MADV_FREE right now.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 17/29] xen: lazy-mmu operations

2007-05-04 Thread Jeremy Fitzhardinge
This patch uses the lazy-mmu hooks to batch mmu operations where
possible.  This is primarily useful for batching operations applied to
active pagetables, which happens during mprotect, munmap, mremap and
the like (mmap does not do bulk pagetable operations, so it isn't
helped).

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Chris Wright <[EMAIL PROTECTED]>

---
 arch/i386/xen/enlighten.c  |   56 +++-
 arch/i386/xen/mmu.c|   56 
 arch/i386/xen/multicalls.c |4 +--
 3 files changed, 78 insertions(+), 38 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -451,28 +451,38 @@ static void xen_apic_write(unsigned long
 
 static void xen_flush_tlb(void)
 {
-   struct mmuext_op op;
-
-   op.cmd = MMUEXT_TLB_FLUSH_LOCAL;
-   if (HYPERVISOR_mmuext_op(, 1, NULL, DOMID_SELF))
-   BUG();
+   struct mmuext_op *op;
+   struct multicall_space mcs = xen_mc_entry(sizeof(*op));
+
+   op = mcs.args;
+   op->cmd = MMUEXT_TLB_FLUSH_LOCAL;
+   MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
 
 static void xen_flush_tlb_single(unsigned long addr)
 {
-   struct mmuext_op op;
-
-   op.cmd = MMUEXT_INVLPG_LOCAL;
-   op.arg1.linear_addr = addr & PAGE_MASK;
-   if (HYPERVISOR_mmuext_op(, 1, NULL, DOMID_SELF))
-   BUG();
+   struct mmuext_op *op;
+   struct multicall_space mcs = xen_mc_entry(sizeof(*op));
+
+   op = mcs.args;
+   op->cmd = MMUEXT_INVLPG_LOCAL;
+   op->arg1.linear_addr = addr & PAGE_MASK;
+   MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
 
 static void xen_flush_tlb_others(const cpumask_t *cpus, struct mm_struct *mm,
 unsigned long va)
 {
-   struct mmuext_op op;
+   struct {
+   struct mmuext_op op;
+   cpumask_t mask;
+   } *args;
cpumask_t cpumask = *cpus;
+   struct multicall_space mcs;
 
/*
 * A couple of (to be removed) sanity checks:
@@ -489,17 +499,21 @@ static void xen_flush_tlb_others(const c
if (cpus_empty(cpumask))
return;
 
+   mcs = xen_mc_entry(sizeof(*args));
+   args = mcs.args;
+   args->mask = cpumask;
+   args->op.arg2.vcpumask = >mask;
+
if (va == TLB_FLUSH_ALL) {
-   op.cmd = MMUEXT_TLB_FLUSH_MULTI;
-   op.arg2.vcpumask = (void *)cpus;
+   args->op.cmd = MMUEXT_TLB_FLUSH_MULTI;
} else {
-   op.cmd = MMUEXT_INVLPG_MULTI;
-   op.arg1.linear_addr = va;
-   op.arg2.vcpumask = (void *)cpus;
-   }
-
-   if (HYPERVISOR_mmuext_op(, 1, NULL, DOMID_SELF))
-   BUG();
+   args->op.cmd = MMUEXT_INVLPG_MULTI;
+   args->op.arg1.linear_addr = va;
+   }
+
+   MULTI_mmuext_op(mcs.mc, >op, 1, NULL, DOMID_SELF);
+
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
 
 static unsigned long xen_read_cr2(void)
===
--- a/arch/i386/xen/mmu.c
+++ b/arch/i386/xen/mmu.c
@@ -56,12 +56,20 @@ void make_lowmem_page_readwrite(void *va
 
 void xen_set_pmd(pmd_t *ptr, pmd_t val)
 {
-   struct mmu_update u;
-
-   u.ptr = virt_to_machine(ptr).maddr;
-   u.val = pmd_val_ma(val);
-   if (HYPERVISOR_mmu_update(, 1, NULL, DOMID_SELF) < 0)
-   BUG();
+   struct multicall_space mcs;
+   struct mmu_update *u;
+
+   preempt_disable();
+
+   mcs = xen_mc_entry(sizeof(*u));
+   u = mcs.args;
+   u->ptr = virt_to_machine(ptr).maddr;
+   u->val = pmd_val_ma(val);
+   MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF);
+
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
+
+   preempt_enable();
 }
 
 /*
@@ -104,20 +112,38 @@ void xen_set_pte_at(struct mm_struct *mm
 void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval)
 {
-   if ((mm != current->mm && mm != _mm) ||
-   HYPERVISOR_update_va_mapping(addr, pteval, 0) != 0)
-   xen_set_pte(ptep, pteval);
+   if (mm == current->mm || mm == _mm) {
+   if (xen_get_lazy_mode() == PARAVIRT_LAZY_MMU) {
+   struct multicall_space mcs;
+   mcs = xen_mc_entry(0);
+
+   MULTI_update_va_mapping(mcs.mc, addr, pteval, 0);
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
+   return;
+   } else
+   if (HYPERVISOR_update_va_mapping(addr, pteval, 0) == 0)
+   return;
+   }
+   xen_set_pte(ptep, pteval);
 }
 
 #ifdef CONFIG_X86_PAE
 void xen_set_pud(pud_t *ptr, pud_t val)
 {
-   struct 

[patch 26/29] xen: fix netfront checksums

2007-05-04 Thread Jeremy Fitzhardinge
If we receive a partially checksummed packed, we need to work out how
much of it was checksummed based on its protocol.  The ideal would be
that Xen could tell us how much is checksummed, but in the meantime
we'll make do with this.

[ This is a separate patch for review; will be folded into
  xen-netfront later. ]

From: Herbert Xu <[EMAIL PROTECTED]>
Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
---
 drivers/net/xen-netfront.c |   76 +++-
 1 file changed, 68 insertions(+), 8 deletions(-)

===
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -35,10 +35,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -564,8 +566,12 @@ static int xennet_start_xmit(struct sk_b
extra = NULL;
 
tx->flags = 0;
-   if (skb->ip_summed == CHECKSUM_PARTIAL) /* local packet? */
+   if (skb->ip_summed == CHECKSUM_PARTIAL)
+   /* local packet? */
tx->flags |= NETTXF_csum_blank | NETTXF_data_validated;
+   else if (skb->ip_summed == CHECKSUM_UNNECESSARY)
+   /* remote but checksummed. */
+   tx->flags |= NETTXF_data_validated;
 
if (skb_shinfo(skb)->gso_size) {
struct xen_netif_extra_info *gso;
@@ -867,9 +873,50 @@ static RING_IDX xennet_fill_frags(struct
return cons;
 }
 
-static void handle_incoming_queue(struct net_device *dev,
+static int skb_checksum_setup(struct sk_buff *skb)
+{
+   struct iphdr *iph;
+   unsigned char *th;
+   int err = -EPROTO;
+
+   if (skb->protocol != htons(ETH_P_IP))
+   goto out;
+
+   iph = (void *)skb->data;
+   th = skb->data + 4 * iph->ihl;
+   if (th >= skb_tail_pointer(skb))
+   goto out;
+
+   skb->csum_start = th - skb->head;
+   switch (iph->protocol) {
+   case IPPROTO_TCP:
+   skb->csum_offset = offsetof(struct tcphdr, check);
+   break;
+   case IPPROTO_UDP:
+   skb->csum_offset = offsetof(struct udphdr, check);
+   break;
+   default:
+   if (net_ratelimit())
+   printk(KERN_ERR "Attempting to checksum a non-"
+  "TCP/UDP packet, dropping a protocol"
+  " %d packet", iph->protocol);
+   goto out;
+   }
+
+   if ((th + skb->csum_offset + 2) > skb_tail_pointer(skb))
+   goto out;
+
+   err = 0;
+
+out:
+   return err;
+}
+
+static int handle_incoming_queue(struct net_device *dev,
  struct sk_buff_head *rxq)
 {
+   struct netfront_info *np = netdev_priv(dev);
+   int packets_dropped = 0;
struct sk_buff *skb;
 
while ((skb = __skb_dequeue(rxq)) != NULL) {
@@ -886,10 +933,24 @@ static void handle_incoming_queue(struct
/* Ethernet work: Delayed to here as it peeks the header. */
skb->protocol = eth_type_trans(skb, dev);
 
+   if (skb->ip_summed == CHECKSUM_PARTIAL) {
+   if (skb_checksum_setup(skb)) {
+   kfree_skb(skb);
+   packets_dropped++;
+   np->stats.rx_errors++;
+   continue;
+   }
+   }
+
+   np->stats.rx_packets++;
+   np->stats.rx_bytes += skb->len;
+
/* Pass it up. */
netif_receive_skb(skb);
dev->last_rx = jiffies;
}
+
+   return packets_dropped;
 }
 
 static int xennet_poll(struct net_device *dev, int *pbudget)
@@ -1003,11 +1064,10 @@ err:
skb->truesize += skb->data_len - (RX_COPY_THRESHOLD - len);
skb->len += skb->data_len;
 
-   if (rx->flags & NETRXF_data_validated)
+   if (rx->flags & NETRXF_csum_blank)
+   skb->ip_summed = CHECKSUM_PARTIAL;
+   else if (rx->flags & NETRXF_data_validated)
skb->ip_summed = CHECKSUM_UNNECESSARY;
-
-   np->stats.rx_packets++;
-   np->stats.rx_bytes += skb->len;
 
__skb_queue_tail(, skb);
 
@@ -1029,7 +1089,7 @@ err:
while ((skb = __skb_dequeue()))
kfree_skb(skb);
 
-   handle_incoming_queue(dev, );
+   work_done -= handle_incoming_queue(dev, );
 
/* If we get a callback with very few responses, reduce fill target. */
/* NB. Note exponential increase, linear decrease. */

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

[patch 21/29] xen: Add Xen grant table support

2007-05-04 Thread Jeremy Fitzhardinge
Add Xen 'grant table' driver which allows granting of access to
selected local memory pages by other virtual machines and,
symmetrically, the mapping of remote memory pages which other virtual
machines have granted access to.

This driver is a prerequisite for many of the Xen virtual device
drivers, which grant the 'device driver domain' restricted and
temporary access to only those memory pages that are currently
involved in I/O operations.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Ian Pratt <[EMAIL PROTECTED]>
Signed-off-by: Christian Limpach <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
---
 drivers/xen/Makefile|1 
 drivers/xen/grant-table.c   |  576 +++
 include/xen/grant_table.h   |  107 ++
 include/xen/interface/grant_table.h |  112 +-
 4 files changed, 777 insertions(+), 19 deletions(-)

===
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -1,1 +1,2 @@ obj-y   += hvc-console.o
+obj-y  += grant-table.o
 obj-y  += hvc-console.o
===
--- /dev/null
+++ b/drivers/xen/grant-table.c
@@ -0,0 +1,576 @@
+/**
+ * grant_table.c
+ *
+ * Granting foreign access to our memory reservation.
+ *
+ * Copyright (c) 2005-2006, Christopher Clark
+ * Copyright (c) 2004-2005, K A Fraser
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+
+/* External tools reserve first few grant table entries. */
+#define NR_RESERVED_ENTRIES 8
+#define GNTTAB_LIST_END 0x
+#define GREFS_PER_GRANT_FRAME (PAGE_SIZE / sizeof(struct grant_entry))
+
+static grant_ref_t **gnttab_list;
+static unsigned int nr_grant_frames;
+static unsigned int boot_max_nr_grant_frames;
+static int gnttab_free_count;
+static grant_ref_t gnttab_free_head;
+static DEFINE_SPINLOCK(gnttab_list_lock);
+
+static struct grant_entry *shared;
+
+static struct gnttab_free_callback *gnttab_free_callback_list;
+
+static int gnttab_expand(unsigned int req_entries);
+
+#define RPP (PAGE_SIZE / sizeof(grant_ref_t))
+#define gnttab_entry(entry) (gnttab_list[(entry) / RPP][(entry) % RPP])
+
+static int get_free_entries(int count)
+{
+   unsigned long flags;
+   int ref, rc;
+   grant_ref_t head;
+
+   spin_lock_irqsave(_list_lock, flags);
+
+   if ((gnttab_free_count < count) &&
+   ((rc = gnttab_expand(count - gnttab_free_count)) < 0)) {
+   spin_unlock_irqrestore(_list_lock, flags);
+   return rc;
+   }
+
+   ref = head = gnttab_free_head;
+   gnttab_free_count -= count;
+   while (count-- > 1)
+   head = gnttab_entry(head);
+   gnttab_free_head = gnttab_entry(head);
+   gnttab_entry(head) = GNTTAB_LIST_END;
+
+   spin_unlock_irqrestore(_list_lock, flags);
+
+   return ref;
+}
+
+#define get_free_entry() get_free_entries(1)
+
+static void do_free_callbacks(void)
+{
+   struct gnttab_free_callback *callback, *next;
+
+   callback = gnttab_free_callback_list;
+   gnttab_free_callback_list = NULL;
+
+   while (callback != NULL) {
+   next = callback->next;
+   if (gnttab_free_count >= callback->count) {
+   callback->next = NULL;
+   callback->fn(callback->arg);
+   } else {
+   callback->next = 

[patch 27/29] xen: Xen machine operations

2007-05-04 Thread Jeremy Fitzhardinge
Make the appropriate hypercalls to halt and reboot the virtual machine.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Chris Wright <[EMAIL PROTECTED]>

---
 arch/i386/xen/enlighten.c |   43 +++
 arch/i386/xen/smp.c   |4 +---
 2 files changed, 44 insertions(+), 3 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -14,6 +14,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -28,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -787,6 +789,45 @@ static const struct smp_ops xen_smp_ops 
 };
 #endif /* CONFIG_SMP */
 
+static void xen_reboot(int reason)
+{
+#ifdef CONFIG_SMP
+   smp_send_stop();
+#endif
+
+   if (HYPERVISOR_sched_op(SCHEDOP_shutdown, reason))
+   BUG();
+}
+
+static void xen_restart(char *msg)
+{
+   xen_reboot(SHUTDOWN_reboot);
+}
+
+static void xen_emergency_restart(void)
+{
+   xen_reboot(SHUTDOWN_reboot);
+}
+
+static void xen_machine_halt(void)
+{
+   xen_reboot(SHUTDOWN_poweroff);
+}
+
+static void xen_crash_shutdown(struct pt_regs *regs)
+{
+   xen_reboot(SHUTDOWN_crash);
+}
+
+static const struct machine_ops __initdata xen_machine_ops = {
+   .restart = xen_restart,
+   .halt = xen_machine_halt,
+   .power_off = xen_machine_halt,
+   .shutdown = xen_machine_halt,
+   .crash_shutdown = xen_crash_shutdown,
+   .emergency_restart = xen_emergency_restart,
+};
+
 /* First C function to be called on Xen boot */
 static asmlinkage void __init xen_start_kernel(void)
 {
@@ -800,6 +841,8 @@ static asmlinkage void __init xen_start_
 
/* Install Xen paravirt ops */
paravirt_ops = xen_paravirt_ops;
+   machine_ops = xen_machine_ops;
+
 #ifdef CONFIG_SMP
smp_ops = xen_smp_ops;
 #endif
===
--- a/arch/i386/xen/smp.c
+++ b/arch/i386/xen/smp.c
@@ -303,9 +303,7 @@ static void stop_self(void *v)
 
 void xen_smp_send_stop(void)
 {
-   cpumask_t mask = cpu_online_map;
-   cpu_clear(smp_processor_id(), mask);
-   xen_smp_call_function_mask(mask, stop_self, NULL, 0);
+   smp_call_function(stop_self, NULL, 0, 0);
 }
 
 void xen_smp_send_reschedule(int cpu)

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 11/29] xen: ignore RW mapping of RO pages in pagetable_init

2007-05-04 Thread Jeremy Fitzhardinge
When setting up the initial pagetable, which includes mappings of all
low physical memory, ignore a mapping which tries to set the RW bit on
an RO pte.  An RO pte indicates a page which is part of the current
pagetable, and so it cannot be allowed to become RW.

Once xen_pagetable_setup_done is called, set_pte reverts to its normal
behaviour.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Chris Wright <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED] (Eric W. Biederman)

---
 arch/i386/xen/enlighten.c |   27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -635,7 +635,7 @@ static void xen_write_cr3(unsigned long 
 
 /* Early in boot, while setting up the initial pagetable, assume
everything is pinned. */
-static void xen_alloc_pt_init(struct mm_struct *mm, u32 pfn)
+static __init void xen_alloc_pt_init(struct mm_struct *mm, u32 pfn)
 {
BUG_ON(mem_map);/* should only be used early */
make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
@@ -687,9 +687,31 @@ static void *xen_kmap_atomic_pte(struct 
 }
 #endif
 
+static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
+{
+   /* If there's an existing pte, then don't allow _PAGE_RW to be set */
+   if (pte_val_ma(*ptep) & _PAGE_PRESENT)
+   pte = __pte_ma(((pte_val_ma(*ptep) & _PAGE_RW) | ~_PAGE_RW) &
+  pte_val_ma(pte));
+
+   return pte;
+}
+
+/* Init-time set_pte while constructing initial pagetables, which
+   doesn't allow RO pagetable pages to be remapped RW */
+static __init void xen_set_pte_init(pte_t *ptep, pte_t pte)
+{
+   pte = mask_rw_pte(ptep, pte);
+
+   xen_set_pte(ptep, pte);
+}
+
 static __init void xen_pagetable_setup_start(pgd_t *base)
 {
pgd_t *xen_pgd = (pgd_t *)xen_start_info->pt_base;
+
+   /* special set_pte for pagetable initialization */
+   paravirt_ops.set_pte = xen_set_pte_init;
 
init_mm.pgd = base;
/*
@@ -737,6 +759,7 @@ static __init void xen_pagetable_setup_d
/* This will work as long as patching hasn't happened yet
   (which it hasn't) */
paravirt_ops.alloc_pt = xen_alloc_pt;
+   paravirt_ops.set_pte = xen_set_pte;
 
if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/*
@@ -919,7 +942,7 @@ static const struct paravirt_ops xen_par
.dup_mmap = xen_dup_mmap,
.exit_mmap = xen_exit_mmap,
 
-   .set_pte = xen_set_pte,
+   .set_pte = NULL,/* see xen_pagetable_setup_* */
.set_pte_at = xen_set_pte_at,
.set_pmd = xen_set_pmd,
 

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 07/29] xen: xen event channels

2007-05-04 Thread Jeremy Fitzhardinge
Xen implements interrupts in terms of event channels.  This patch maps
an event through an event channel to an irq, and then feeds it through
the normal interrupt path, via a Xen irq_chip implementation.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Eric W. Biederman <[EMAIL PROTECTED]>

---
 arch/i386/xen/Makefile|3 
 arch/i386/xen/enlighten.c |1 
 arch/i386/xen/events.c|  511 +
 include/xen/events.h  |   28 ++
 4 files changed, 542 insertions(+), 1 deletion(-)

===
--- a/arch/i386/xen/Makefile
+++ b/arch/i386/xen/Makefile
@@ -1,1 +1,2 @@ obj-y   := enlighten.o setup.o features.o
-obj-y  := enlighten.o setup.o features.o multicalls.o mmu.o
+obj-y  := enlighten.o setup.o features.o multicalls.o mmu.o \
+   events.o
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -593,6 +593,7 @@ static const struct paravirt_ops xen_par
 
.memory_setup = xen_memory_setup,
.arch_setup = xen_arch_setup,
+   .init_IRQ = xen_init_IRQ,
 
.cpuid = xen_cpuid,
 
===
--- /dev/null
+++ b/arch/i386/xen/events.c
@@ -0,0 +1,511 @@
+/*
+ * Xen event channels
+ *
+ * Xen models interrupts with abstract event channels.  Because each
+ * domain gets 1024 event channels, but NR_IRQ is not that large, we
+ * must dynamically map irqs<->event channels.  The event channels
+ * interface with the rest of the kernel by defining a xen interrupt
+ * chip.  When an event is recieved, it is mapped to an irq and sent
+ * through the normal interrupt processing path.
+ *
+ * There are four kinds of events which can be mapped to an event
+ * channel:
+ *
+ * 1. Inter-domain notifications.  This includes all the virtual
+ *device events, since they're driven by front-ends in another domain
+ *(typically dom0).
+ * 2. VIRQs, typically used for timers.  These are per-cpu events.
+ * 3. IPIs.
+ * 4. Hardware interrupts. Not supported at present.
+ *
+ * Jeremy Fitzhardinge <[EMAIL PROTECTED]>, XenSource Inc, 2007
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "xen-ops.h"
+
+/*
+ * This lock protects updates to the following mapping and reference-count
+ * arrays. The lock does not need to be acquired to read the mapping tables.
+ */
+static DEFINE_SPINLOCK(irq_mapping_update_lock);
+
+/* IRQ <-> VIRQ mapping. */
+static DEFINE_PER_CPU(int, virq_to_irq[NR_VIRQS]) = {[0 ... NR_VIRQS-1] = -1};
+
+/* Packed IRQ information: binding type, sub-type index, and event channel. */
+struct packed_irq
+{
+   unsigned short evtchn;
+   unsigned char index;
+   unsigned char type;
+};
+
+static struct packed_irq irq_info[NR_IRQS];
+
+/* Binding types. */
+enum { IRQT_UNBOUND, IRQT_PIRQ, IRQT_VIRQ, IRQT_IPI, IRQT_EVTCHN };
+
+/* Convenient shorthand for packed representation of an unbound IRQ. */
+#define IRQ_UNBOUNDmk_irq_info(IRQT_UNBOUND, 0, 0)
+
+static int evtchn_to_irq[NR_EVENT_CHANNELS] = {
+   [0 ... NR_EVENT_CHANNELS-1] = -1
+};
+static unsigned long cpu_evtchn_mask[NR_CPUS][NR_EVENT_CHANNELS/BITS_PER_LONG];
+static u8 cpu_evtchn[NR_EVENT_CHANNELS];
+
+/* Reference counts for bindings to IRQs. */
+static int irq_bindcount[NR_IRQS];
+
+/* Xen will never allocate port zero for any purpose. */
+#define VALID_EVTCHN(chn)  ((chn) != 0)
+
+/*
+ * Force a proper event-channel callback from Xen after clearing the
+ * callback mask. We do this in a very simple manner, by making a call
+ * down into Xen. The pending flag will be checked by Xen on return.
+ */
+void force_evtchn_callback(void)
+{
+   (void)HYPERVISOR_xen_version(0, NULL);
+}
+EXPORT_SYMBOL_GPL(force_evtchn_callback);
+
+static struct irq_chip xen_dynamic_chip;
+
+/* Constructor for packed IRQ information. */
+static inline struct packed_irq mk_irq_info(u32 type, u32 index, u32 evtchn)
+{
+   return (struct packed_irq) { evtchn, index, type };
+}
+
+/*
+ * Accessors for packed IRQ information.
+ */
+static inline unsigned int evtchn_from_irq(int irq)
+{
+   return irq_info[irq].evtchn;
+}
+
+static inline unsigned int index_from_irq(int irq)
+{
+   return irq_info[irq].index;
+}
+
+static inline unsigned int type_from_irq(int irq)
+{
+   return irq_info[irq].type;
+}
+
+static inline unsigned long active_evtchns(unsigned int cpu,
+  struct shared_info *sh,
+  unsigned int idx)
+{
+   return (sh->evtchn_pending[idx] &
+   cpu_evtchn_mask[cpu][idx] &
+   ~sh->evtchn_mask[idx]);
+}
+
+static void 

Re: Remove constructor from buffer_head

2007-05-04 Thread Andrew Morton
On Sat, 5 May 2007 01:22:05 +0200
Andi Kleen <[EMAIL PROTECTED]> wrote:

> > 2.6.21:
> > 
> > akpm2:/home/akpm# opreport -l /boot/vmlinux-$(uname -r) | head -50
> > opreport error: No sample file found: try running opcontrol --dump
> > or specify a session containing sample files
> 
> For me it works on a slightly post 2.6.21 kernel with suse oprofile-0.9.2-21
> 
> Did you try opcontrol --dump? 

Yes, tried various things.  There's just nothing turning up in 
/var/lib/oprofile.

Chuck appears to be claiming that 2.6.21 oprofile is known to be broken,
but I never heard anything about that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 06/29] xen: Xen virtual mmu

2007-05-04 Thread Jeremy Fitzhardinge
Xen pagetable handling, including the machinery to implement direct
pagetables.  This means that pte update operations are intercepted so
that pseudo-physical addresses can be converted to machine addresses.
It also implements late pinning/early unpinning so that pagetables are
initialized while they're normal read-write memory as much as
possible, yet pinned to make cr3 reloads as efficient as possible.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>

---
 arch/i386/xen/Makefile|2 
 arch/i386/xen/enlighten.c |   30 ++-
 arch/i386/xen/mmu.c   |  419 +
 arch/i386/xen/mmu.h   |   47 +
 4 files changed, 493 insertions(+), 5 deletions(-)

===
--- a/arch/i386/xen/Makefile
+++ b/arch/i386/xen/Makefile
@@ -1,1 +1,1 @@ obj-y   := enlighten.o setup.o features.o
-obj-y  := enlighten.o setup.o features.o multicalls.o
+obj-y  := enlighten.o setup.o features.o multicalls.o mmu.o
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -39,6 +39,7 @@
 #include 
 
 #include "xen-ops.h"
+#include "mmu.h"
 #include "multicalls.h"
 
 EXPORT_SYMBOL_GPL(hypercall_page);
@@ -565,11 +566,9 @@ static __init void xen_pagetable_setup_d
 * Should be set_fixmap(), but shared_info is a machine
 * address with no corresponding pseudo-phys address.
 */
-#if 0
set_pte_mfn(fix_to_virt(FIX_PARAVIRT_BOOTMAP),
PFN_DOWN(xen_start_info->shared_info),
PAGE_KERNEL);
-#endif
 
HYPERVISOR_shared_info =
(struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
@@ -578,9 +577,7 @@ static __init void xen_pagetable_setup_d
HYPERVISOR_shared_info =
(struct shared_info *)__va(xen_start_info->shared_info);
 
-#if 0
xen_pgd_pin(base);
-#endif
 
xen_vcpu_setup(smp_processor_id());
 }
@@ -682,6 +679,31 @@ static const struct paravirt_ops xen_par
.release_pd = xen_release_pd,
.release_pt = xen_release_pt,
 
+   .set_pte = xen_set_pte,
+   .set_pte_at = xen_set_pte_at,
+   .set_pmd = xen_set_pmd,
+
+   .pte_val = xen_pte_val,
+   .pgd_val = xen_pgd_val,
+
+   .make_pte = xen_make_pte,
+   .make_pgd = xen_make_pgd,
+
+#ifdef CONFIG_X86_PAE
+   .set_pte_atomic = xen_set_pte_atomic,
+   .set_pte_present = xen_set_pte_at,
+   .set_pud = xen_set_pud,
+   .pte_clear = xen_pte_clear,
+   .pmd_clear = xen_pmd_clear,
+
+   .make_pmd = xen_make_pmd,
+   .pmd_val = xen_pmd_val,
+#endif /* PAE */
+
+   .activate_mm = xen_activate_mm,
+   .dup_mmap = xen_dup_mmap,
+   .exit_mmap = xen_exit_mmap,
+
.set_lazy_mode = xen_set_lazy_mode,
 };
 
===
--- /dev/null
+++ b/arch/i386/xen/mmu.c
@@ -0,0 +1,419 @@
+/*
+ * Xen mmu operations
+ *
+ * This file contains the various mmu fetch and update operations.
+ * The most important job they must perform is the mapping between the
+ * domain's pfn and the overall machine mfns.
+ *
+ * Xen allows guests to directly update the pagetable, in a controlled
+ * fashion.  In other words, the guest modifies the same pagetable
+ * that the CPU actually uses, which eliminates the overhead of having
+ * a separate shadow pagetable.
+ *
+ * In order to allow this, it falls on the guest domain to map its
+ * notion of a "physical" pfn - which is just a domain-local linear
+ * address - into a real "machine address" which the CPU's MMU can
+ * use.
+ *
+ * A pgd_t/pmd_t/pte_t will typically contain an mfn, and so can be
+ * inserted directly into the pagetable.  When creating a new
+ * pte/pmd/pgd, it converts the passed pfn into an mfn.  Conversely,
+ * when reading the content back with __(pgd|pmd|pte)_val, it converts
+ * the mfn back into a pfn.
+ *
+ * The other constraint is that all pages which make up a pagetable
+ * must be mapped read-only in the guest.  This prevents uncontrolled
+ * guest updates to the pagetable.  Xen strictly enforces this, and
+ * will disallow any pagetable update which will end up mapping a
+ * pagetable page RW, and will disallow using any writable page as a
+ * pagetable.
+ *
+ * Naively, when loading %cr3 with the base of a new pagetable, Xen
+ * would need to validate the whole pagetable before going on.
+ * Naturally, this is quite slow.  The solution is to "pin" a
+ * pagetable, which enforces all the constraints on the pagetable even
+ * when it is not actively in use.  This menas that Xen can be assured
+ * that it is still valid when you do load it into %cr3, and doesn't
+ * need to revalidate it.
+ *
+ * Jeremy Fitzhardinge <[EMAIL PROTECTED]>, XenSource Inc, 

[patch 00/29] xen: Xen implementation for paravirt_ops

2007-05-04 Thread Jeremy Fitzhardinge
Hi Andi,

This series of patches implements the Xen paravirt-ops interface.
It applies to 2.6.21-git3 + ff patches-2.6.21-git3-070501-1.tar.gz.

Changes since the last posting:
 - reviews of xenbus (me), netfront (hch, rusty, herbert xu) and
   blockfront (hch), with most comments addressed.  Netfront review
   revealed a couple of real bugs, and the code for all three is looking
   cleaner overall.
 - Folded bugfix patches into their main patch
 - Lots of little style and other cleanups

There may be a trivial conflict in xen-hvc-console because of the
lguest console.

These patches are now moderately well tested, with several successful
runs through XenSource's regression test suite, and some amount of
non-me testing.  While I wouldn't go into production with a
xen/paravirt_ops kernel right now, it does seem pretty functional.

This series generally restricts itself to Xen-specific parts of the
tree, though it does make a few small changes elsewhere.

It includes:
 - some helper routines for allocating address space and walking pagetables
 - Xen interface header files
 - Core Xen implementation
 - Efficient late-pinning/early-unpinning pagetable handling
 - Virtualized time, including stolen time
 - SMP support
 - Preemption support
 - Batched pagetable updates
 - Xen console, based on hvc console
 - Xenbus
 - Netfront, the paravirtualized network device
 - Blockfront, the paravirtualized block device

Thanks,
J
-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 16/29] xen: Add support for preemption

2007-05-04 Thread Jeremy Fitzhardinge
Add Xen support for preemption.  This is mostly a cleanup of existing
preempt_enable/disable calls, or just comments to explain the current
usage.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Chris Wright <[EMAIL PROTECTED]>

---
 arch/i386/xen/Kconfig  |2 
 arch/i386/xen/enlighten.c  |  105 +---
 arch/i386/xen/mmu.c|4 +
 arch/i386/xen/multicalls.c |   11 ++--
 arch/i386/xen/time.c   |   22 +++--
 5 files changed, 89 insertions(+), 55 deletions(-)

===
--- a/arch/i386/xen/Kconfig
+++ b/arch/i386/xen/Kconfig
@@ -4,6 +4,6 @@
 
 config XEN
bool "Enable support for Xen hypervisor"
-   depends on PARAVIRT && !PREEMPT
+   depends on PARAVIRT
help
  This is the Linux Xen port.
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -108,11 +109,10 @@ static unsigned long xen_save_fl(void)
struct vcpu_info *vcpu;
unsigned long flags;
 
-   preempt_disable();
vcpu = x86_read_percpu(xen_vcpu);
+
/* flag has opposite sense of mask */
flags = !vcpu->evtchn_upcall_mask;
-   preempt_enable();
 
/* convert to IF type flag
   -0 -> 0x
@@ -125,51 +125,56 @@ static void xen_restore_fl(unsigned long
 {
struct vcpu_info *vcpu;
 
-   preempt_disable();
-
/* convert from IF type flag */
flags = !(flags & X86_EFLAGS_IF);
+
+   /* There's a one instruction preempt window here.  We need to
+  make sure we're don't switch CPUs between getting the vcpu
+  pointer and updating the mask. */
+   preempt_disable();
vcpu = x86_read_percpu(xen_vcpu);
vcpu->evtchn_upcall_mask = flags;
+   preempt_enable_no_resched();
+
+   /* Doesn't matter if we get preempted here, because any
+  pending event will get dealt with anyway. */
 
if (flags == 0) {
-   /* Unmask then check (avoid races).  We're only protecting
-  against updates by this CPU, so there's no need for
-  anything stronger. */
-   barrier();
-
+   preempt_check_resched();
+   barrier(); /* unmask then check (avoid races) */
if (unlikely(vcpu->evtchn_upcall_pending))
force_evtchn_callback();
-   preempt_enable();
-   } else
-   preempt_enable_no_resched();
+   }
 }
 
 static void xen_irq_disable(void)
 {
+   /* There's a one instruction preempt window here.  We need to
+  make sure we're don't switch CPUs between getting the vcpu
+  pointer and updating the mask. */
+   preempt_disable();
+   x86_read_percpu(xen_vcpu)->evtchn_upcall_mask = 1;
+   preempt_enable_no_resched();
+}
+
+static void xen_irq_enable(void)
+{
struct vcpu_info *vcpu;
-   preempt_disable();
-   vcpu = x86_read_percpu(xen_vcpu);
-   vcpu->evtchn_upcall_mask = 1;
-   preempt_enable_no_resched();
-}
-
-static void xen_irq_enable(void)
-{
-   struct vcpu_info *vcpu;
-
+
+   /* There's a one instruction preempt window here.  We need to
+  make sure we're don't switch CPUs between getting the vcpu
+  pointer and updating the mask. */
preempt_disable();
vcpu = x86_read_percpu(xen_vcpu);
vcpu->evtchn_upcall_mask = 0;
-
-   /* Unmask then check (avoid races).  We're only protecting
-  against updates by this CPU, so there's no need for
-  anything stronger. */
-   barrier();
-
+   preempt_enable_no_resched();
+
+   /* Doesn't matter if we get preempted here, because any
+  pending event will get dealt with anyway. */
+
+   barrier(); /* unmask then check (avoid races) */
if (unlikely(vcpu->evtchn_upcall_pending))
force_evtchn_callback();
-   preempt_enable();
 }
 
 static void xen_safe_halt(void)
@@ -189,6 +194,8 @@ static void xen_halt(void)
 
 static void xen_set_lazy_mode(enum paravirt_lazy_mode mode)
 {
+   BUG_ON(preemptible());
+
switch(mode) {
case PARAVIRT_LAZY_NONE:
BUG_ON(x86_read_percpu(xen_lazy_mode) == PARAVIRT_LAZY_NONE);
@@ -292,12 +299,17 @@ static void xen_write_ldt_entry(struct d
xmaddr_t mach_lp = virt_to_machine(lp);
u64 entry = (u64)high << 32 | low;
 
+   preempt_disable();
+
xen_mc_flush();
if (HYPERVISOR_update_descriptor(mach_lp.maddr, entry))
BUG();
-}
-
-static int cvt_gate_to_trap(int vector, u32 low, u32 high, struct trap_info 
*info)
+
+   preempt_enable();
+}
+
+static int cvt_gate_to_trap(int vector, u32 low, u32 high,
+   struct trap_info 

[patch 25/29] xen: Add the Xen virtual network device driver.

2007-05-04 Thread Jeremy Fitzhardinge
The network device frontend driver allows the kernel to access network
devices exported exported by a virtual machine containing a physical
network device driver.

* * *
use skb.cb for storing private data

Netfront's use of nh.raw and h.raw for storing page+offset is a bit
hinky, and it breaks with upcoming network stack updates which reduce
these fields to sub-pointer sizes.  Fortunately, skb offers the "cb"
field specifically for stashing this kind of info, so use it.

* * *
Lockdep fixes for xen-netfront

netfront contains two locking problems found by lockdep:

1. rx_lock is a normal spinlock, and tx_lock is an irq spinlock.  This
   means that in normal use, tx_lock may be taken by an interrupt routine
   while rx_lock is held.  However, netif_disconnect_backend takes them
   in the order tx_lock->rx_lock, which could lead to a deadlock.  Reverse
   them
2. rx_lock can also be taken in softirq context, so it should be taken/released
   with spin_(un)lock_bh.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Cc: Ian Pratt <[EMAIL PROTECTED]>
Cc: Christian Limpach <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Cc: Jeff Garzik <[EMAIL PROTECTED]>
Cc: Stephen Hemminger <[EMAIL PROTECTED]>
Cc: Christoph Hellwig <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Herbert Xu <[EMAIL PROTECTED]>

---
 drivers/net/Kconfig|   12 
 drivers/net/Makefile   |2 
 drivers/net/xen-netfront.c | 1916 
 3 files changed, 1930 insertions(+)

===
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2506,6 +2506,18 @@ source "drivers/atm/Kconfig"
 
 source "drivers/s390/net/Kconfig"
 
+config XEN_NETDEV_FRONTEND
+   tristate "Xen network device frontend driver"
+   depends on XEN
+   default y
+   help
+ The network device frontend driver allows the kernel to
+ access network devices exported exported by a virtual
+ machine containing a physical network device driver. The
+ frontend driver is intended for unprivileged guest domains;
+ if you are compiling a kernel for a Xen guest, you almost
+ certainly want to enable this.
+
 config ISERIES_VETH
tristate "iSeries Virtual Ethernet driver support"
depends on PPC_ISERIES
===
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -218,3 +218,5 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_FS_ENET) += fs_enet/
 
 obj-$(CONFIG_NETXEN_NIC) += netxen/
+
+obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
===
--- /dev/null
+++ b/drivers/net/xen-netfront.c
@@ -0,0 +1,1916 @@
+/*
+ * Virtual network driver for conversing with remote driver backends.
+ *
+ * Copyright (c) 2002-2005, K A Fraser
+ * Copyright (c) 2005, XenSource Ltd
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+static struct ethtool_ops xennet_ethtool_ops;
+
+struct netfront_cb {
+   struct page *page;
+   unsigned offset;
+};
+
+#define NETFRONT_SKB_CB(skb)   ((struct netfront_cb *)((skb)->cb))
+
+/*
+ * Mutually-exclusive module options to select receive data path:
+ *  rx_copy : Packets are copied by network backend into local memory
+ *  rx_flip : Page containing packet data is transferred 

[patch 14/29] xen: Implement xen_sched_clock

2007-05-04 Thread Jeremy Fitzhardinge
Implement xen_sched_clock, which returns the number of ns the current
vcpu has been actually in an unstolen state (ie, running or blocked,
vs runnable-but-not-running, or offline) since boot.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Chris Wright <[EMAIL PROTECTED]>
Cc: john stultz <[EMAIL PROTECTED]>

---
 arch/i386/xen/enlighten.c |2 +-
 arch/i386/xen/time.c  |   22 +-
 arch/i386/xen/xen-ops.h   |3 +--
 3 files changed, 23 insertions(+), 4 deletions(-)

===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -676,7 +676,7 @@ static const struct paravirt_ops xen_par
.set_wallclock = xen_set_wallclock,
.get_wallclock = xen_get_wallclock,
.get_cpu_khz = xen_cpu_khz,
-   .sched_clock = xen_clocksource_read,
+   .sched_clock = xen_sched_clock,
 
 #ifdef CONFIG_X86_LOCAL_APIC
.apic_write = paravirt_nop,
===
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -16,6 +16,8 @@
 #define XEN_SHIFT 22
 #define TIMER_SLOP 10  /* Xen may fire a timer up to this many ns 
early */
 #define NS_PER_TICK(10ll / HZ)
+
+static cycle_t xen_clocksource_read(void);
 
 /* These are perodically updated in shared_info, and then copied here. */
 struct shadow_time_info {
@@ -118,6 +120,24 @@ static void do_stolen_accounting(void)
account_steal_time(idle_task(smp_processor_id()), ticks);
 }
 
+/*
+ * Xen sched_clock implementation.  Returns the number of unstolen
+ * nanoseconds, which is nanoseconds the VCPU spent in RUNNING+BLOCKED
+ * states.
+ */
+unsigned long long xen_sched_clock(void)
+{
+   struct vcpu_runstate_info state;
+   cycle_t now = xen_clocksource_read();
+
+   get_runstate_snapshot();
+
+   WARN_ON(state.state != RUNSTATE_running);
+
+   return state.time[RUNSTATE_blocked] +
+   state.time[RUNSTATE_running] +
+   (now - state.state_entry_time);
+}
 
 
 /* Get the CPU speed from Xen */
@@ -209,7 +229,7 @@ static u64 get_nsec_offset(struct shadow
return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
 }
 
-cycle_t xen_clocksource_read(void)
+static cycle_t xen_clocksource_read(void)
 {
struct shadow_time_info *shadow = _cpu_var(shadow_time);
cycle_t ret;
===
--- a/arch/i386/xen/xen-ops.h
+++ b/arch/i386/xen/xen-ops.h
@@ -2,7 +2,6 @@
 #define XEN_OPS_H
 
 #include 
-#include 
 
 DECLARE_PER_CPU(struct vcpu_info *, xen_vcpu);
 DECLARE_PER_CPU(unsigned long, xen_cr3);
@@ -18,7 +17,7 @@ void __init xen_time_init(void);
 void __init xen_time_init(void);
 unsigned long xen_get_wallclock(void);
 int xen_set_wallclock(unsigned long time);
-cycle_t xen_clocksource_read(void);
+unsigned long long xen_sched_clock(void);
 
 void xen_mark_init_mm_pinned(void);
 

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 15/29] xen: Xen SMP guest support

2007-05-04 Thread Jeremy Fitzhardinge
This is a fairly straightforward Xen implementation of smp_ops.

Xen has its own IPI mechanisms, and has no dependency on any
APIC-based IPI.  The smp_ops hooks and the flush_tlb_others pv_op
allow a Xen guest to avoid all APIC code in arch/i386 (the only apic
operation is a single apic_read for the apic version number).

One subtle point which needs to be addressed is unpinning pagetables
when another cpu may have a lazy tlb reference to the pagetable. Xen
will not allow an in-use pagetable to be unpinned, so we must find any
other cpus with a reference to the pagetable and get them to shoot
down their references.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Cc: Benjamin LaHaise <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>

---
 arch/i386/kernel/smp.c |5 
 arch/i386/kernel/smpboot.c |8 
 arch/i386/xen/Kconfig  |2 
 arch/i386/xen/Makefile |2 
 arch/i386/xen/enlighten.c  |  119 -
 arch/i386/xen/events.c |   78 +++
 arch/i386/xen/mmu.c|   75 ++-
 arch/i386/xen/mmu.h|   13 
 arch/i386/xen/setup.c  |5 
 arch/i386/xen/smp.c|  405 
 arch/i386/xen/time.c   |9 
 arch/i386/xen/xen-ops.h|   25 +
 include/asm-i386/mach-default/irq_vectors_limits.h |2 
 include/asm-i386/mmu_context.h |2 
 include/asm-i386/smp.h |5 
 include/xen/events.h   |   27 +
 16 files changed, 717 insertions(+), 65 deletions(-)

===
--- a/arch/i386/kernel/smp.c
+++ b/arch/i386/kernel/smp.c
@@ -23,6 +23,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -250,13 +251,13 @@ static DEFINE_SPINLOCK(tlbstate_lock);
 static DEFINE_SPINLOCK(tlbstate_lock);
 
 /*
- * We cannot call mmdrop() because we are in interrupt context, 
+ * We cannot call mmdrop() because we are in interrupt context,
  * instead update mm->cpu_vm_mask.
  *
  * We need to reload %cr3 since the page tables may be going
  * away from under us..
  */
-static inline void leave_mm (unsigned long cpu)
+void leave_mm (unsigned long cpu)
 {
if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK)
BUG();
===
--- a/arch/i386/kernel/smpboot.c
+++ b/arch/i386/kernel/smpboot.c
@@ -149,7 +149,7 @@ void __init smp_alloc_memory(void)
  * a given CPU
  */
 
-static void __cpuinit smp_store_cpu_info(int id)
+void __cpuinit smp_store_cpu_info(int id)
 {
struct cpuinfo_x86 *c = cpu_data + id;
 
@@ -309,8 +309,7 @@ cpumask_t cpu_coregroup_map(int cpu)
 /* representing cpus for which sibling maps can be computed */
 static cpumask_t cpu_sibling_setup_map;
 
-static inline void
-set_cpu_sibling_map(int cpu)
+void set_cpu_sibling_map(int cpu)
 {
int i;
struct cpuinfo_x86 *c = cpu_data;
@@ -1145,8 +1144,7 @@ void __init native_smp_prepare_boot_cpu(
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
-static void
-remove_siblinginfo(int cpu)
+void remove_siblinginfo(int cpu)
 {
int sibling;
struct cpuinfo_x86 *c = cpu_data;
===
--- a/arch/i386/xen/Kconfig
+++ b/arch/i386/xen/Kconfig
@@ -4,6 +4,6 @@
 
 config XEN
bool "Enable support for Xen hypervisor"
-   depends on PARAVIRT && !PREEMPT && !SMP
+   depends on PARAVIRT && !PREEMPT
help
  This is the Linux Xen port.  Enabling this will allow the
===
--- a/arch/i386/xen/Makefile
+++ b/arch/i386/xen/Makefile
@@ -1,2 +1,4 @@ obj-y   := enlighten.o setup.o features.o
 obj-y  := enlighten.o setup.o features.o multicalls.o mmu.o \
events.o time.o
+
+obj-$(CONFIG_SMP)  += smp.o
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -40,6 +40,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -56,7 +58,7 @@ struct start_info *xen_start_info;
 struct start_info *xen_start_info;
 EXPORT_SYMBOL_GPL(xen_start_info);
 
-static void xen_vcpu_setup(int cpu)
+void xen_vcpu_setup(int cpu)
 {
per_cpu(xen_vcpu, cpu) = _shared_info->vcpu_info[cpu];
 }
@@ -344,6 +346,32 @@ static void xen_write_idt_entry(struct d
}
 }
 
+static void xen_convert_trap_info(const struct Xgt_desc_struct *desc,
+ struct trap_info *traps)
+{
+   unsigned in, out, count;
+
+ 

[patch 19/29] xen: Use the hvc console infrastructure for Xen console

2007-05-04 Thread Jeremy Fitzhardinge
Implement a Xen back-end for hvc console.

From: Gerd Hoffmann <[EMAIL PROTECTED]>
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Cc: Olof Johansson <[EMAIL PROTECTED]>
---
 arch/i386/xen/events.c |3 -
 drivers/Makefile   |3 +
 drivers/char/Kconfig   |8 ++
 drivers/char/Makefile  |1 
 drivers/char/hvc_xen.c |  134 
 include/xen/events.h   |1 
 6 files changed, 149 insertions(+), 1 deletion(-)

===
--- a/arch/i386/xen/events.c
+++ b/arch/i386/xen/events.c
@@ -244,7 +244,7 @@ static int find_unbound_irq(void)
return irq;
 }
 
-static int bind_evtchn_to_irq(unsigned int evtchn)
+int bind_evtchn_to_irq(unsigned int evtchn)
 {
int irq;
 
@@ -269,6 +269,7 @@ static int bind_evtchn_to_irq(unsigned i
 
return irq;
 }
+EXPORT_SYMBOL_GPL(bind_evtchn_to_irq);
 
 static int bind_ipi_to_irq(unsigned int ipi, unsigned int cpu)
 {
===
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -14,6 +14,9 @@ obj-$(CONFIG_ACPI)+= acpi/
 # was used and do nothing if so
 obj-$(CONFIG_PNP)  += pnp/
 obj-$(CONFIG_ARM_AMBA) += amba/
+
+# Xen is the default console when running as a guest
+obj-$(CONFIG_XEN)  += xen/
 
 # char/ comes before serial/ etc so that the VT console is the boot-time
 # default.
===
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -650,6 +650,14 @@ config HVC_BEAT
help
  Toshiba's Cell Reference Set Beat Console device driver
 
+config HVC_XEN
+   bool "Xen Hypervisor Console support"
+   depends on XEN
+   select HVC_DRIVER
+   default y
+   help
+ Xen virtual console device driver
+
 config HVCS
tristate "IBM Hypervisor Virtual Console Server support"
depends on PPC_PSERIES
===
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_HVC_RTAS)+= hvc_rtas.o
 obj-$(CONFIG_HVC_RTAS) += hvc_rtas.o
 obj-$(CONFIG_HVC_BEAT) += hvc_beat.o
 obj-$(CONFIG_LGUEST_GUEST) += hvc_lguest.o
+obj-$(CONFIG_HVC_XEN)  += hvc_xen.o
 obj-$(CONFIG_HVC_DRIVER)   += hvc_console.o
 obj-$(CONFIG_RAW_DRIVER)   += raw.o
 obj-$(CONFIG_SGI_SNSC) += snsc.o snsc_event.o
===
--- /dev/null
+++ b/drivers/char/hvc_xen.c
@@ -0,0 +1,134 @@
+/*
+ * xen console driver interface to hvc_console.c
+ *
+ * (c) 2007 Gerd Hoffmann <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include "hvc_console.h"
+
+#define HVC_COOKIE   0x58656e /* "Xen" in hex */
+
+static struct hvc_struct *hvc;
+static int xencons_irq;
+
+/* -- */
+
+static inline struct xencons_interface *xencons_interface(void)
+{
+   return mfn_to_virt(xen_start_info->console.domU.mfn);
+}
+
+static inline void notify_daemon(void)
+{
+   /* Use evtchn: this is called early, before irq is set up. */
+   notify_remote_via_evtchn(xen_start_info->console.domU.evtchn);
+}
+
+static int write_console(uint32_t vtermno, const char *data, int len)
+{
+   struct xencons_interface *intf = xencons_interface();
+   XENCONS_RING_IDX cons, prod;
+   int sent = 0;
+
+   cons = intf->out_cons;
+   prod = intf->out_prod;
+   mb();
+   BUG_ON((prod - cons) > sizeof(intf->out));
+
+   while ((sent < len) && ((prod - cons) < sizeof(intf->out)))
+   intf->out[MASK_XENCONS_IDX(prod++, intf->out)] = data[sent++];
+
+   wmb();
+   intf->out_prod = prod;
+
+   notify_daemon();
+   return sent;
+}
+
+static int read_console(uint32_t vtermno, char *buf, int len)
+{
+   struct xencons_interface *intf = xencons_interface();
+   XENCONS_RING_IDX cons, prod;
+   int recv = 0;
+
+   cons = intf->in_cons;
+   prod = 

[patch 08/29] xen: xen time implementation

2007-05-04 Thread Jeremy Fitzhardinge
Time is implemented by using a clocksource which is driven from the
hypervisor's nanosecond timebase.  Xen implements time by
extrapolating from known timestamps using the tsc; the hypervisor is
responsible for making sure that the tsc is constant rate and
synchronized between vcpus.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Thomas Gleixner <[EMAIL PROTECTED]>

---
 arch/i386/xen/Makefile|2 
 arch/i386/xen/enlighten.c |6 
 arch/i386/xen/time.c  |  402 +
 3 files changed, 409 insertions(+), 1 deletion(-)

===
--- a/arch/i386/xen/Makefile
+++ b/arch/i386/xen/Makefile
@@ -1,2 +1,2 @@ obj-y   := enlighten.o setup.o features.o
 obj-y  := enlighten.o setup.o features.o multicalls.o mmu.o \
-   events.o
+   events.o time.o
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -595,6 +595,12 @@ static const struct paravirt_ops xen_par
.arch_setup = xen_arch_setup,
.init_IRQ = xen_init_IRQ,
 
+   .time_init = xen_time_init,
+   .set_wallclock = xen_set_wallclock,
+   .get_wallclock = xen_get_wallclock,
+   .get_cpu_khz = xen_cpu_khz,
+   .sched_clock = xen_clocksource_read,
+
.cpuid = xen_cpuid,
 
.set_debugreg = xen_set_debugreg,
===
--- /dev/null
+++ b/arch/i386/xen/time.c
@@ -0,0 +1,402 @@
+/*
+ * Xen time implementation.
+ *
+ * This is implemented in terms of a clocksource driver which uses
+ * the hypervisor clock as a nanosecond timebase, and a clockevent
+ * driver which uses the hypervisor's timer mechanism.
+ *
+ * Jeremy Fitzhardinge <[EMAIL PROTECTED]>, XenSource Inc, 2007
+ */
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "xen-ops.h"
+
+#define XEN_SHIFT 22
+#define TIMER_SLOP 10  /* Xen may fire a timer up to this many ns 
early */
+
+/* These are perodically updated in shared_info, and then copied here. */
+struct shadow_time_info {
+   u64 tsc_timestamp; /* TSC at last update of time vals.  */
+   u64 system_timestamp;  /* Time, in nanosecs, since boot.*/
+   u32 tsc_to_nsec_mul;
+   int tsc_shift;
+   u32 version;
+};
+
+static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
+
+unsigned long xen_cpu_khz(void)
+{
+   u64 cpu_khz = 100ULL << 32;
+   const struct vcpu_time_info *info =
+   _shared_info->vcpu_info[0].time;
+
+   do_div(cpu_khz, info->tsc_to_system_mul);
+   if (info->tsc_shift < 0)
+   cpu_khz <<= -info->tsc_shift;
+   else
+   cpu_khz >>= info->tsc_shift;
+
+   return cpu_khz;
+}
+
+/*
+ * Reads a consistent set of time-base values from Xen, into a shadow data
+ * area.
+ */
+static void get_time_values_from_xen(void)
+{
+   struct vcpu_time_info   *src;
+   struct shadow_time_info *dst;
+
+   preempt_disable();
+
+   src = &__get_cpu_var(xen_vcpu)->time;
+   dst = &__get_cpu_var(shadow_time);
+
+   do {
+   dst->version = src->version;
+   rmb();
+   dst->tsc_timestamp = src->tsc_timestamp;
+   dst->system_timestamp  = src->system_time;
+   dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
+   dst->tsc_shift = src->tsc_shift;
+   rmb();
+   } while ((src->version & 1) | (dst->version ^ src->version));
+
+   preempt_enable();
+}
+
+/*
+ * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
+ * yielding a 64-bit result.
+ */
+static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
+{
+   u64 product;
+#ifdef __i386__
+   u32 tmp1, tmp2;
+#endif
+
+   if (shift < 0)
+   delta >>= -shift;
+   else
+   delta <<= shift;
+
+#ifdef __i386__
+   __asm__ (
+   "mul  %5   ; "
+   "mov  %4,%%eax ; "
+   "mov  %%edx,%4 ; "
+   "mul  %5   ; "
+   "xor  %5,%5; "
+   "add  %4,%%eax ; "
+   "adc  %5,%%edx ; "
+   : "=A" (product), "=r" (tmp1), "=r" (tmp2)
+   : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
+#elif __x86_64__
+   __asm__ (
+   "mul %%rdx ; shrd $32,%%rdx,%%rax"
+   : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
+#else
+#error implement me!
+#endif
+
+   return product;
+}
+
+static u64 get_nsec_offset(struct shadow_time_info *shadow)
+{
+   u64 now, delta;
+   rdtscll(now);
+   delta = now - shadow->tsc_timestamp;
+   return scale_delta(delta, 

[patch 20/29] xen: Add early printk support via hvc console

2007-05-04 Thread Jeremy Fitzhardinge
Add early printk support via hvc console, enable using
"earlyprintk=xen" on the kernel command line.

From: Gerd Hoffmann <[EMAIL PROTECTED]>
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

---
 arch/x86_64/kernel/early_printk.c |5 +
 drivers/char/hvc_xen.c|   25 +
 drivers/xen/Makefile  |1 +
 include/xen/hvc-console.h |6 ++
 4 files changed, 37 insertions(+)

===
--- a/arch/x86_64/kernel/early_printk.c
+++ b/arch/x86_64/kernel/early_printk.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Simple VGA output */
 
@@ -242,6 +243,10 @@ static int __init setup_early_printk(cha
simnow_init(buf + 6);
early_console = _console;
keep_early = 1;
+#ifdef CONFIG_XEN
+   } else if (!strncmp(buf, "xen", 3)) {
+   early_console = _console;
+#endif
}
register_console(early_console);
return 0;
===
--- a/drivers/char/hvc_xen.c
+++ b/drivers/char/hvc_xen.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "hvc_console.h"
 
@@ -132,3 +133,27 @@ module_init(xen_init);
 module_init(xen_init);
 module_exit(xen_fini);
 console_initcall(xen_cons_init);
+
+static void xenboot_write_console(struct console *console, const char *string,
+ unsigned len)
+{
+   unsigned int linelen, off = 0;
+   const char *pos;
+
+   while (off < len && NULL != (pos = strchr(string+off, '\n'))) {
+   linelen = pos-string+off;
+   if (off + linelen > len)
+   break;
+   write_console(0, string+off, linelen);
+   write_console(0, "\r\n", 2);
+   off += linelen + 1;
+   }
+   if (off < len)
+   write_console(0, string+off, len-off);
+}
+
+struct console xenboot_console = {
+   .name   = "xenboot",
+   .write  = xenboot_write_console,
+   .flags  = CON_PRINTBUFFER | CON_BOOT,
+};
===
--- /dev/null
+++ b/drivers/xen/Makefile
@@ -0,0 +1,1 @@
+obj-y  += grant-table.o
===
--- /dev/null
+++ b/include/xen/hvc-console.h
@@ -0,0 +1,6 @@
+#ifndef XEN_HVC_CONSOLE_H
+#define XEN_HVC_CONSOLE_H
+
+extern struct console xenboot_console;
+
+#endif /* XEN_HVC_CONSOLE_H */

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 03/29] xen: Add nosegneg capability to the vsyscall page notes

2007-05-04 Thread Jeremy Fitzhardinge
Add the "nosegneg" fake capabilty to the vsyscall page notes. This is
used by the runtime linker to select a glibc version which then
disables negative-offset accesses to the thread-local segment via
%gs. These accesses require emulation in Xen (because segments are
truncated to protect the hypervisor address space) and avoiding them
provides a measurable performance boost.

Signed-off-by: Ian Pratt <[EMAIL PROTECTED]>
Signed-off-by: Christian Limpach <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Roland McGrath <[EMAIL PROTECTED]>
Cc: Ulrich Drepper <[EMAIL PROTECTED]>

---
 arch/i386/kernel/vsyscall-note.S |   23 +++
 include/asm-i386/elf.h   |   14 ++
 2 files changed, 33 insertions(+), 4 deletions(-)

===
--- a/arch/i386/kernel/vsyscall-note.S
+++ b/arch/i386/kernel/vsyscall-note.S
@@ -12,3 +12,26 @@ ELFNOTE_START(Linux, 0, "a")
 ELFNOTE_START(Linux, 0, "a")
.long LINUX_VERSION_CODE
 ELFNOTE_END
+
+#ifdef CONFIG_XEN
+#include 
+
+/*
+ * Add a special note telling glibc's dynamic linker a fake hardware
+ * flavor that it will use to choose the search path for libraries in the
+ * same way it uses real hardware capabilities like "mmx".
+ * We supply "nosegneg" as the fake capability, to indicate that we
+ * do not like negative offsets in instructions using segment overrides,
+ * since we implement those inefficiently.  This makes it possible to
+ * install libraries optimized to avoid those access patterns in someplace
+ * like /lib/i686/tls/nosegneg.  Note that an /etc/ld.so.conf.d/file
+ * corresponding to the bits here is needed to make ldconfig work right.
+ * It should contain:
+ * hwcap 0 nosegneg
+ * to match the mapping of bit to name that we give here.
+ */
+ELFNOTE_START(GNU, 2, "a")
+   .long 1, 1<
 #include 
 #include 
@@ -24,6 +26,9 @@
 #define R_386_GOTPC10
 #define R_386_NUM  11
 
+/*
+ * ELF register definitions..
+ */
 typedef unsigned long elf_greg_t;
 
 #define ELF_NGREG (sizeof (struct user_regs_struct) / sizeof(elf_greg_t))
@@ -160,6 +165,7 @@ do if (vdso_enabled) {  
\
NEW_AUX_ENT(AT_SYSINFO_EHDR, VDSO_CURRENT_BASE);\
 } while (0)
 
-#endif
+#endif /* ARCH_DLINFO */
+#endif /* __ASSEMBLY__ */
 
 #endif

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cpufreq longhaul locks up

2007-05-04 Thread Jan Engelhardt

On May 4 2007 23:20, David Johnson wrote:
>
>longhaul: VIA C3 'Nehemiah C' [C5P] CPU detected.  Powersaver supported.
>longhaul: Using ACPI support.
>
>It seems that longhaul on my system is 'using ACPI support' whereas on yours 
>it is 'using northbridge support'. I'm getting lockups after approx. 2-3 
>hours using the ondemand governor. It has no problem changing the clock 
>speed, and runs at the minimum speed most of the time.

I had tried this:

-  if (enable_arbiter_disable()) {
+  if (0 && enable_arbiter_disable()) {

to skip enabling the northbridge. Unfortunately, I do not seem to have
southbridge or ACPI support.

>I seem to recall that I get an oops when my system locks-up (the system runs 
>headless normally, so it isn't easy to check). I'll investigate.

I think I did not see any oops, though I (1) did not redirect the
kernel output back to tty0 [the distro moves it away to tty11]
so I might have missed something, but (2) netconsole did not send
anything. IIRC, the kernel still catches sysrq if it paniced, i.e.
as a result of not finding a proper root device during startup;
but no sysrq, so it seems a harder lockup. Maybe I should try
without all the modules loaded and/or disable some hw in the bios.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Regression with SLUB on Netperf and Volanomark

2007-05-04 Thread Tim Chen
On Fri, 2007-05-04 at 11:27 -0700, Christoph Lameter wrote:

> 
> Not sure where to go here. Increasing the per cpu slab size may hold off 
> the issue up to a certain cpu cache size. For that we would need to 
> identify which slabs create the performance issue.
> 
> One easy way to check that this is indeed the case: Enable fake NUMA. You 
> will then have separate queues for each processor since they are on 
> different "nodes". Create two fake nodes. Run one thread in each node and 
> see if this fixes it.

I tried with fake NUMA (boot with numa=fake=2) and use

numactl --physcpubind=1 --membind=0 ./netserver
numactl --physcpubind=2 --membind=1 ./netperf -t TCP_STREAM -l 60 -H
127.0.0.1 -i 5,5 -I 99,5 -- -s 57344 -S 57344 -m 4096

to run the tests.  The results are about the same as the non-NUMA case,
with slab about 5% better than slub.  

So probably the difference is due to some other reasons than partial
slab.  The kernel config file is attached.

Tim




#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.21-rc7-mm2b-numasl
# Fri May  4 15:17:07 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_NR_QUICK=2
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SWAP_PREFETCH=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_CPUSETS=y
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PROC_SMAPS=y
CONFIG_PROC_CLEAR_REFS=y
CONFIG_PROC_PAGEMAP=y
CONFIG_PROC_KPAGEMAP=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
# CONFIG_MK8 is not set
CONFIG_MPSC=y
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=128
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_INTERNODE_CACHE_BYTES=128
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_HT=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_NODES_SHIFT=6
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NUMA_EMU=y
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
CONFIG_DISCONTIGMEM_MANUAL=y
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_DISCONTIGMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_NEED_MULTIPLE_NODES=y
# CONFIG_SPARSEMEM_STATIC is not set
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_MIGRATION=y
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_ADAPTIVE_READAHEAD=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y
CONFIG_NR_CPUS=8
CONFIG_HOTPLUG_CPU=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y

[PATCH] rfc: threaded epoll_wait thundering herd

2007-05-04 Thread Davi Arnaut
Hi,

If multiple threads are parked on epoll_wait (on a single epoll fd) and
events become available, epoll performs a wake up of all threads of the
poll wait list, causing a thundering herd of processes trying to grab
the eventpoll lock.

This patch addresses this by using exclusive waiters (wake one). Once
the exclusive thread finishes transferring it's events, a new thread
is woken if there are more events available.

Makes sense?

Signed-off-by: Davi E. M. Arnaut <[EMAIL PROTECTED]>

---
 fs/eventpoll.c |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6/fs/eventpoll.c
===
--- linux-2.6.orig/fs/eventpoll.c
+++ linux-2.6/fs/eventpoll.c
@@ -1491,6 +1491,12 @@ static void ep_reinject_items(struct eve
}
}

+   /*
+* If there is events available, wake up the next waiter, if any.
+*/
+   if (!ricnt)
+   ricnt = !list_empty(>rdllist);
+
if (ricnt) {
/*
 * Wake up ( if active ) both the eventpoll wait list and the 
->poll()
@@ -1570,6 +1576,7 @@ retry:
 * ep_poll_callback() when events will become available.
 */
init_waitqueue_entry(, current);
+   wait.flags |= WQ_FLAG_EXCLUSIVE;
__add_wait_queue(>wq, );

for (;;) {

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 14/22] pollfs: pollable futex

2007-05-04 Thread Ulrich Drepper

On 5/4/07, Davide Libenzi <[EMAIL PROTECTED]> wrote:

This is a pretty specific case, that is not very typical to find in the
usual common event loop dispatch application design.


This is where you are very wrong.  Yes, it's rare in the Unix world
because non-trivial programs cannot implement this in most cases with
the available infrastructure.  But it is very common in other places
and what is more, it makes a lot of sense.  It gives you scalability
with the size of the machines at no cost associated to reorganizing
the program.



And if you *really* want your truly generic WaitForMultipleObjects
implementation, your only way is to base it on files. Files are our almost
perfect match to HANDLEs in our world. We have the basic infrastructure
already there.


"basic", but not complete.  And I never said that the implementation
thye have is perfect, far from it.  The concept is good and if we now
can implement it, with all the event sources available, using an
efficient event delivery mechanism we are far ahead of their design.

The proposal now  on the table doesn't bring us there all the way and
it has the potential to make future work in the area of event delivery
harder just because there is more legacy code to be kept  happy.  This
is why I propose to not consider these changes and instead go for the
gold, i.e., the full solution.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cpufreq longhaul locks up

2007-05-04 Thread Jan Engelhardt

On May 4 2007 15:49, john stultz wrote:
>
>> Switching from acpi_pm+performance to acpi_pm+ondemand also
>> locks up after a few minutes.
>
>Yep. Sounds like an ondemand issue. Thanks for verifying this for me.

Nah, it also happens with cpufreq_powersave. I just need to check 
through some archives and try booting with governor=powersave so that it 
always stays low. Though, lowering the frequency does not really buy any 
temperature improvement in 60 seconds, so I don't think I will need 
cpufreq anyway (other processors have a noticable jump in core 
temperature between 100%idle and a busy loop).


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: VFS: fix passing of AT_PHDR value in auxv to ELF interpreter

2007-05-04 Thread Jeremy Fitzhardinge
Quentin Godfroy wrote:
> + elf_ppnt = elf_phdata;
> + for (i = 0; i< loc->elf_ex.e_phnum; i++, elf_ppnt++)
> + if (elf_ppnt->p_type == PT_PHDR) {
> + phdr_addr = elf_ppnt->p_vaddr;
>   

Won't this break with ET_DYN executables?  And besides, isn't this the
same thing?  Shouldn't PT_PHDR->p_vaddr point to the vaddr of the Phdr
table itself?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Blacklist Dell Optiplex 320 from using the HPET

2007-05-04 Thread john stultz
On Sat, 2007-05-05 at 01:18 +0200, Andi Kleen wrote:
> On Friday 04 May 2007 23:29:04 john stultz wrote:
> > One of the 2.6.21 regressions was Guilherme's problem seeing his box
> > lock up when the system detected an unstable TSC and dropped back to
> > using the HPET.
> > 
> > In digging deeper, we found the HPET is not actually incrementing on
> > this system. And in fact, the reason why this issue just cropped up was
> > because of Thomas's clocksource watchdog code was comparing the TSC to
> > the HPET (which wasn't moving) and thought the TSC was broken.
> > 
> > Anyway, Guliherme checked for a BIOS update and did not find one, so
> > I've added a DMI blacklist against his system so the HPET is not used.
> > 
> > Many thanks to Guilherme for the slow and laborious testing that finally
> > narrowed down this issue.
> 
> Before going to hard to maintain DMI black lists we should first check 
> if it's a more general problem and can't it be solved better? Most likely
> that system isn't the one with this issue and I don't want to apply
> DMI patches forever.

We can give it a whirl, I just didn't want to add yet another "compare
with some other counter that may or may not work" check. In this case,
probably reading three times in a row and getting the same result would
be a clearly broken box. 


> In particular: what lspci chipset does it have?  If it's Intel it might be
> worth checking the datasheet if there is some "HPET stop" bit -- perhaps it 
> could be fixed up.

Guilherme: Could you provide lspci output? 


> We seem to have a couple of Intel systems recently with HPET trouble.

Ok, I wasn't aware it was a common issue.

thanks
-john

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: VFS: fix passing of AT_PHDR value in auxv to ELF interpreter

2007-05-04 Thread Andrew Morton
On Fri, 4 May 2007 10:09:21 -0400
Quentin Godfroy <[EMAIL PROTECTED]> wrote:

> On a dynamic ELF executable, the current kernel loader gives to the
> interpreter (in the AUXV vector) the AT_PHDR argument as :
> offset_of_phdr_in_file + first address.
> 
> It can be wrong for an executable where the program headers are not located 
> in the first loaded segment.
> 
> This patch corrects the behaviour.
> 
> Signed-off-by: Quentin Godfroy <[EMAIL PROTECTED]>
> ---
>  Here is an example of such an ELF executable which the current code
>  fails on :
>  ftp://quatramaran.ens.fr/pub/godfroy/addrpath/broken-sample
> 
> --- linux-2.6.21.1/fs/binfmt_elf.c2007-05-04 03:20:00.0 -0400
> +++ linux-2.6.21.1-patch/fs/binfmt_elf.c  2007-05-04 08:02:18.0 
> -0400
> @@ -134,6 +134,7 @@ static int padzero(unsigned long elf_bss
>  static int
>  create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
>   int interp_aout, unsigned long load_addr,
> + unsigned long phdr_addr,
>   unsigned long interp_load_addr)
>  {
>   unsigned long p = bprm->p;
> @@ -190,7 +191,7 @@ create_elf_tables(struct linux_binprm *b
>   NEW_AUX_ENT(AT_HWCAP, ELF_HWCAP);
>   NEW_AUX_ENT(AT_PAGESZ, ELF_EXEC_PAGESIZE);
>   NEW_AUX_ENT(AT_CLKTCK, CLOCKS_PER_SEC);
> - NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
> + NEW_AUX_ENT(AT_PHDR, phdr_addr);
>   NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr));
>   NEW_AUX_ENT(AT_PHNUM, exec->e_phnum);
>   NEW_AUX_ENT(AT_BASE, interp_load_addr);
> @@ -529,7 +530,7 @@ static unsigned long randomize_stack_top
>  static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
>  {
>   struct file *interpreter = NULL; /* to shut gcc up */
> - unsigned long load_addr = 0, load_bias = 0;
> + unsigned long load_addr = 0, load_bias = 0, phdr_addr = 0;
>   int load_addr_set = 0;
>   char * elf_interpreter = NULL;
>   unsigned int interpreter_type = INTERPRETER_NONE;
> @@ -718,6 +719,16 @@ static int load_elf_binary(struct linux_
>   break;
>   }
>  
> + elf_ppnt = elf_phdata;
> + for (i = 0; i< loc->elf_ex.e_phnum; i++, elf_ppnt++)
> + if (elf_ppnt->p_type == PT_PHDR) {
> + phdr_addr = elf_ppnt->p_vaddr;
> + break;
> + }
> + retval = -ENOEXEC;
> + if (!phdr_addr)
> + goto out_free_dentry;
> +
>   /* Some simple consistency checks for the interpreter */
>   if (elf_interpreter) {
>   interpreter_type = INTERPRETER_ELF | INTERPRETER_AOUT;
> @@ -987,7 +998,7 @@ static int load_elf_binary(struct linux_
>   current->flags &= ~PF_FORKNOEXEC;
>   create_elf_tables(bprm, >elf_ex,
> (interpreter_type == INTERPRETER_AOUT),
> -   load_addr, interp_load_addr);
> +   load_addr, phdr_addr, interp_load_addr);
>   /* N.B. passed_fileno might not be initialized? */
>   if (interpreter_type == INTERPRETER_AOUT)
>   current->mm->arg_start += strlen(passed_fileno) + 1;

This patch kills my FC6 machine (using a config which was derived from RH's
original):

Freeing unused kernel memory: 368k freed
Write protecting the kernel read-only data: 959k
request_module: runaway loop modprobe binfmt-464c
request_module: runaway loop modprobe binfmt-464c
request_module: runaway loop modprobe binfmt-464c
request_module: runaway loop modprobe binfmt-464c
request_module: runaway loop modprobe binfmt-464c


.config: http://userweb.kernel.org/~akpm/config-akpm2.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] UBI: dereference after kfree in create_vtbl

2007-05-04 Thread Florin Malita

Hi Satyam,

Satyam Sharma wrote:

Eeks ... no, wait. You found a (two, actually) bug alright, but fixed
it wrong. When we fail a write, we *must* add it to the corrupted list
and _then_ attempt to retry. So, the "if (++tries <= 5)" applies to
"if (!err) goto retry;" and not to the ubi_scan_add_to_list(). The
difference is quite subtle here ...


Not being familiar with the code, I was specifically trying to preserve 
the old semantics and only address the use-after-free issue. So if there 
was another bug... well, I guess I succeeded at preserving it ;)



The correct fix should actually be as follows: (Artem, this is diffed
on the original vtbl.c)

[snip]
+err = ubi_scan_add_to_list(si, new_seb->pnum, new_seb->ec, 
>corr);

+kfree(new_seb);
+if (++tries <= 5)
if (!err)
goto retry;


There's a side effect to this change: by unconditionally overwriting err 
we lose the original error code. Then if we're exceeding the number of 
retries we'll end up returning 0 which is probably not what you want.


Return code aside, it seems the only thing ubi_scan_add_to_list() is 
doing is allocate a new struct ubi_scan_leb, initialize some fields with 
values passed from new_seb and then add it to the desired list. But 
copying new_seb to a newly allocated structure and then immediately 
freeing the old one seems redundant - why not just add new_seb to the 
corrupted list and be done? Then we don't have to deal with allocation 
failures in an error path anymore - something like this (diff against 
the original code):


Signed-off-by: Florin Malita <[EMAIL PROTECTED]>
---

diff --git a/drivers/mtd/ubi/vtbl.c b/drivers/mtd/ubi/vtbl.c
index b6fd6bb..2ad2d59 100644
--- a/drivers/mtd/ubi/vtbl.c
+++ b/drivers/mtd/ubi/vtbl.c
@@ -317,14 +317,10 @@ retry:
return err;

write_error:
-   kfree(new_seb);
-   /* May be this physical eraseblock went bad, try to pick another one */
-   if (++tries <= 5) {
-   err = ubi_scan_add_to_list(si, new_seb->pnum, new_seb->ec,
-  >corr);
-   if (!err)
-   goto retry;
-   }
+   /* Maybe this physical eraseblock went bad, try to pick another one */
+   list_add_tail(_seb->u.list, >corr);
+   if (++tries <= 5)
+   goto retry;
out_free:
ubi_free_vid_hdr(ubi, vid_hdr);
return err;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Blacklist Dell Optiplex 320 from using the HPET

2007-05-04 Thread Andi Kleen
On Friday 04 May 2007 23:44:08 Andrew Morton wrote:
> On Fri, 04 May 2007 14:29:04 -0700
> john stultz <[EMAIL PROTECTED]> wrote:
> 
> > One of the 2.6.21 regressions was Guilherme's problem seeing his box
> > lock up when the system detected an unstable TSC and dropped back to
> > using the HPET.
> > 
> > In digging deeper, we found the HPET is not actually incrementing on
> > this system. And in fact, the reason why this issue just cropped up was
> > because of Thomas's clocksource watchdog code was comparing the TSC to
> > the HPET (which wasn't moving) and thought the TSC was broken.
> > 
> > Anyway, Guliherme checked for a BIOS update and did not find one, so
> > I've added a DMI blacklist against his system so the HPET is not used.
> > 
> > Many thanks to Guilherme for the slow and laborious testing that finally
> > narrowed down this issue.
> > 
> 
> OK, I tagged that for -stable too.

Don't please. It is completely the wrong approach. DMI should be only last 
resort,
not first.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Remove constructor from buffer_head

2007-05-04 Thread Andi Kleen
On Friday 04 May 2007 23:33:47 Andrew Morton wrote:
> On Fri, 4 May 2007 13:42:12 -0700

> 
> 2.6.20:
> 
> akpm2:/home/akpm> opcontrol --start-daemon
> /usr/bin/opcontrol: line 1098: /dev/oprofile/0/enabled: No such file or 
> directory
> /usr/bin/opcontrol: line 1098: /dev/oprofile/0/event: No such file or 
> directory
> /usr/bin/opcontrol: line 1098: /dev/oprofile/0/count: No such file or 
> directory
> /usr/bin/opcontrol: line 1098: /dev/oprofile/0/kernel: No such file or 
> directory
> /usr/bin/opcontrol: line 1098: /dev/oprofile/0/user: No such file or directory
> /usr/bin/opcontrol: line 1098: /dev/oprofile/0/unit_mask: No such file or 
> directory

This isn't a problem anymore since the nmi watchdog is off by default now.

> 2.6.21:
> 
> akpm2:/home/akpm# opreport -l /boot/vmlinux-$(uname -r) | head -50
> opreport error: No sample file found: try running opcontrol --dump
> or specify a session containing sample files

For me it works on a slightly post 2.6.21 kernel with suse oprofile-0.9.2-21

Did you try opcontrol --dump? 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Blacklist Dell Optiplex 320 from using the HPET

2007-05-04 Thread Andi Kleen
On Friday 04 May 2007 23:29:04 john stultz wrote:
> One of the 2.6.21 regressions was Guilherme's problem seeing his box
> lock up when the system detected an unstable TSC and dropped back to
> using the HPET.
> 
> In digging deeper, we found the HPET is not actually incrementing on
> this system. And in fact, the reason why this issue just cropped up was
> because of Thomas's clocksource watchdog code was comparing the TSC to
> the HPET (which wasn't moving) and thought the TSC was broken.
> 
> Anyway, Guliherme checked for a BIOS update and did not find one, so
> I've added a DMI blacklist against his system so the HPET is not used.
> 
> Many thanks to Guilherme for the slow and laborious testing that finally
> narrowed down this issue.

Before going to hard to maintain DMI black lists we should first check 
if it's a more general problem and can't it be solved better? Most likely
that system isn't the one with this issue and I don't want to apply
DMI patches forever.

In particular: what lspci chipset does it have?  If it's Intel it might be
worth checking the datasheet if there is some "HPET stop" bit -- perhaps it 
could be fixed up.

We seem to have a couple of Intel systems recently with HPET trouble.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: always clear bss

2007-05-04 Thread H. Peter Anvin
Eric W. Biederman wrote:
> 
> My notes show 0x5c reserved for additional apm_bios_info, although
> of the top of my head I don't know how realistic that is.
> 
> 0x1e4 does look available.
> 
> It has been a long time since I made that choice, and I do see that
> looking at struct screen_info I did remember to document that I was
> using 0x3c, even in your structure.
> 
> It is all internal to our boot process and external code isn't going
> to use it so we can change it if we feel like.
> 

I don't see the actual instruction that does that anywhere in my tree,
which was branched from Andi's "for-linus" git tree, but I have reserved
0x1e4 for that purpose as "scratch".

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFT][PATCH] swsusp: Change code ordering related to ACPI

2007-05-04 Thread Ray Lee

On 5/4/07, Rafael J. Wysocki <[EMAIL PROTECTED]> wrote:

The change of the hibernation/suspend code ordering made before 2.6.21 has
caused some systems to have problems related to ACPI.  In particular, the
'platform' hibernation mode doesn't work any more on some systems.


It seems that somewhere between 2.6.21-rc4 and 2.6.21 final my laptop
stopped being able to come out of suspend to RAM. Before I start
bisecting (again, sigh), is this ringing any bells for anyone? In
particular your, patch (snipped) that deals with hibernation, would it
also affect suspend to RAM?

Ray
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/3] SLUB: Implement targeted reclaim and partial list defragmentation

2007-05-04 Thread Christoph Lameter
Fixes suggested by Andrew

---
 include/linux/slab.h |   12 
 mm/slub.c|   32 +---
 2 files changed, 33 insertions(+), 11 deletions(-)

Index: slub/mm/slub.c
===
--- slub.orig/mm/slub.c 2007-05-04 15:52:54.0 -0700
+++ slub/mm/slub.c  2007-05-04 15:53:11.0 -0700
@@ -2142,42 +2142,46 @@ EXPORT_SYMBOL(kfree);
  *
  * Return error code or number of remaining objects
  */
-static int __kmem_cache_vacate(struct kmem_cache *s, struct page *page)
+static int __kmem_cache_vacate(struct kmem_cache *s,
+   struct page *page, unsigned long flags)
 {
void *p;
void *addr = page_address(page);
-   unsigned long map[BITS_TO_LONGS(s->objects)];
+   DECLARE_BITMAP(map, s->objects);
int leftover;
 
if (!page->inuse)
return 0;
 
/* Determine free objects */
-   bitmap_zero(map, s->objects);
-   for(p = page->freelist; p; p = get_freepointer(s, p))
-   set_bit((p - addr) / s->size, map);
+   bitmap_fill(map, s->objects);
+   for (p = page->freelist; p; p = get_freepointer(s, p))
+   __clear_bit((p - addr) / s->size, map);
 
/*
 * Get a refcount for all used objects. If that fails then
 * no KICK callback can be performed.
 */
-   for(p = addr; p < addr + s->objects * s->size; p += s->size)
-   if (!test_bit((p - addr) / s->size, map))
+   for (p = addr; p < addr + s->objects * s->size; p += s->size)
+   if (test_bit((p - addr) / s->size, map))
if (!s->slab_ops->get_reference(p))
-   set_bit((p - addr) / s->size, map);
+   __clear_bit((p - addr) / s->size, map);
 
/* Got all the references we need. Now we can drop the slab lock */
slab_unlock(page);
+   local_irq_restore(flags);
 
/* Perform the KICK callbacks to remove the objects */
for(p = addr; p < addr + s->objects * s->size; p += s->size)
-   if (!test_bit((p - addr) / s->size, map))
+   if (test_bit((p - addr) / s->size, map))
s->slab_ops->kick_object(p);
 
+   local_irq_save(flags);
slab_lock(page);
leftover = page->inuse;
ClearPageActive(page);
putback_slab(s, page);
+   local_irq_restore(flags);
return leftover;
 }
 
@@ -2197,6 +2201,7 @@ static void remove_from_lists(struct kme
  */
 int kmem_cache_vacate(struct page *page)
 {
+   unsigned long flags;
struct kmem_cache *s;
int rc = 0;
 
@@ -2208,6 +2213,7 @@ int kmem_cache_vacate(struct page *page)
if (!PageSlab(page))
goto out;
 
+   local_irq_save(flags);
slab_lock(page);
 
/*
@@ -2221,6 +2227,7 @@ int kmem_cache_vacate(struct page *page)
 */
if (!PageSlab(page) || PageActive(page) || !page->inuse) {
slab_unlock(page);
+   local_irq_restore(flags);
goto out;
}
 
@@ -2231,7 +2238,7 @@ int kmem_cache_vacate(struct page *page)
s = page->slab;
remove_from_lists(s, page);
SetPageActive(page);
-   rc = __kmem_cache_vacate(s, page) == 0;
+   rc = __kmem_cache_vacate(s, page, flags) == 0;
 out:
put_page(page);
return rc;
@@ -2336,8 +2343,11 @@ int kmem_cache_shrink(struct kmem_cache 
 
/* Now we can free objects in the slabs on the zaplist */
list_for_each_entry_safe(page, page2, , lru) {
+   unsigned long flags;
+
+   local_irq_save(flags);
slab_lock(page);
-   __kmem_cache_vacate(s, page);
+   __kmem_cache_vacate(s, page, flags);
}
}
 
Index: slub/include/linux/slab.h
===
--- slub.orig/include/linux/slab.h  2007-05-04 15:53:06.0 -0700
+++ slub/include/linux/slab.h   2007-05-04 15:53:17.0 -0700
@@ -42,7 +42,19 @@ struct slab_ops {
void (*ctor)(void *, struct kmem_cache *, unsigned long);
/* FIXME: Remove all destructors ? */
void (*dtor)(void *, struct kmem_cache *, unsigned long);
+   /*
+* Called with slab lock held and interrupts disabled.
+* No slab operations may be performed in get_reference
+*
+* Must return 1 if a reference was obtained.
+* 0 if we failed to obtain the reference (f.e.
+* the object is concurrently freed).
+*/
int (*get_reference)(void *);
+   /*
+* Called with no locks held and interrupts enabled.
+* Any operation may be performed in kick_object.
+*/
void (*kick_object)(void *);
 };
 
-
To unsubscribe from this list: send the line 

Re: [SOLVED] Serial buffer corruption [was Re: FTDI usb-serial possible bug]

2007-05-04 Thread Paul Fulghum
Antonino:

Can you try two tests (with my patch applied):

1. comment out the tty_flush_buffer() call in tty_ldisc_flush() and test

2. uncomment (reenable) the above call and comment out the
tty_flush_buffer() call in tty_ioctl() and test

Thanks,
Paul

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RFC airo : wpa support

2007-05-04 Thread matthieu castet

Hi,

I attach a diff against 2.6.21 for adding wpa support for airo driver.
In then end of 2005 I manage to make work wpa but the code was really 
ugly. I manage to find some time to clean it.



To support wpa, a new interface of the firmware should be used. This 
interface is incompatible with the old interface and using both 
interface at the same time make the firmware hang.


Porting OPEN and WEP mode to new interface need some driver rewrite, and 
the old interface should be keep for the older cards (or the cards that 
doesn't have newer firmware).
That's why I didn't do it, and I added a module parameter for choosing 
between the old driver behavior (with no wpa support) or the driver with 
wpa only support.



The wireless extension handlers are a bit ugly, I will be very happy, if 
somebody could help me to clean them.


Any comments are appreciated.


Matthieu


PS : the lastest version of the driver can be found in 
http://svn.gna.org/viewcvs/airo-wpa/branches/kernel/


PS2 : There is some remaining trace in the driver for debug purpose, 
that will be removed in the final version
Index: airo.c
===
--- airo.c	(revision 16)
+++ airo.c	(working copy)
@@ -16,6 +16,7 @@
 Code was also integrated from the Cisco Aironet driver for Linux.
 Support for MPI350 cards was added by Fabrice Bellet
 <[EMAIL PROTECTED]>.
+(C) 2005-2007 Matthieu CASTET <[EMAIL PROTECTED]> for WPA support
 
 ==*/
 
@@ -91,6 +92,12 @@
 #include 
 #endif
 
+/* enable rx mic checking
+ *disable because it takes some time in ISR
+ * XXX crypto doesn't work anymore in ISR on 2.6.21
+ */
+//#define WPA_CHECK_RX_MIC
+
 /* Hack to do some power saving */
 #define POWER_ON_DOWN
 
@@ -223,6 +230,7 @@
 int maxencrypt /* = 0 */; /* The highest rate that the card can encrypt at.
 		   0 means no limit.  For old cards this was 4 */
 
+static int wpa_enabled; /* If set the card is in WPA mode. This is incompatible with WEP or open mode */
 static int auto_wep /* = 0 */; /* If set, it tries to figure out the wep mode */
 static int aux_bap /* = 0 */; /* Checks to see if the aux ports are needed to read
 		the bap, needed on some older cards and buses. */
@@ -250,6 +258,9 @@
 module_param_array(rates, int, NULL, 0);
 module_param_array(ssids, charp, NULL, 0);
 module_param(auto_wep, int, 0);
+module_param(wpa_enabled, int, 0);
+MODULE_PARM_DESC(wpa_enabled, "If non-zero, the driver can use WPA \
+but Open and WEP mode won't be possible");
 MODULE_PARM_DESC(auto_wep, "If non-zero, the driver will keep looping through \
 the authentication options until an association is made.  The value of \
 auto_wep is number of the wep keys to check.  A value of 2 will try using \
@@ -452,6 +463,7 @@
 #define RID_UNKNOWN22  0xFF22
 #define RID_LEAPUSERNAME 0xFF23
 #define RID_LEAPPASSWORD 0xFF24
+#define RID_WPA0xFF25
 #define RID_STATUS 0xFF50
 #define RID_BEACON_HST 0xFF51
 #define RID_BUSY_HST   0xFF52
@@ -506,6 +518,14 @@
 	u8 key[16];
 } WepKeyRid;
 
+typedef struct {
+	u16 len;
+	u16 kindex;
+	u8 mac[ETH_ALEN];
+	u16 klen;
+	u8 key[48];
+} WpaKeyRid;
+
 /* These structures are from the Aironet's PC4500 Developers Manual */
 typedef struct {
 	u16 len;
@@ -525,7 +545,20 @@
 #define MOD_MOK 2
 } ModulationRid;
 
+/* Only present on firmware >= 5.30.17 */
 typedef struct {
+	u16 _reserved5[4];
+	u16 auth_cipher;
+#define AUTH_CIPHER_NONE 1
+#define AUTH_CIPHER_WEP 0xc
+#define AUTH_CIPHER_TKIP 0x210
+	u16 auth_key;
+#define AUTH_KEY_MGMT_NONE 1
+#define AUTH_KEY_MGMT_802_1X 4
+#define AUTH_KEY_MGMT_PSK 8
+} ConfigRidExtra;
+
+typedef struct {
 	u16 len; /* sizeof(ConfigRid) */
 	u16 opmode; /* operating mode */
 #define MODE_STA_IBSS 0
@@ -580,6 +613,7 @@
 #define AUTH_ENCRYPT 0x101
 #define AUTH_SHAREDKEY 0x102
 #define AUTH_ALLOW_UNENCRYPTED 0x200
+#define AUTH_ENCRYPT_WPA 0xc101
 	u16 associationTimeout;
 	u16 specifiedApTimeout;
 	u16 offlineScanInterval;
@@ -643,6 +677,8 @@
 #define MAGIC_STAY_IN_CAM (1<<10)
 	u8 magicControl;
 	u16 autoWake;
+	/* Only present on firmware >= 5.30.17 */
+	ConfigRidExtra extra;
 } ConfigRid;
 
 typedef struct {
@@ -1227,6 +1263,15 @@
 	unsigned int bssListFirst;
 	unsigned int bssListNext;
 	unsigned int bssListRidLen;
+	unsigned int ConfigRidLen;
+	unsigned char wpa_tx_key [8];
+	unsigned char wpa_rx_key [8];
+	unsigned char wpa_rx_key_m [8];
+	unsigned char wpa_rx_key_m_old [8];
+	u8 LLC [10];
+	struct crypto_hash *tfm_michael;
+	int wpa_enabled;
+	int wpa_key_enabled;
 
 	struct list_head network_list;
 	struct list_head network_free_list;
@@ -1723,6 +1768,55 @@
 	digest[3] = val & 0xFF;
 }
 
+static void wpa_compute_mic(struct airo_info *ai ,char *pPacket, u8 *mic, int len, char *key)
+{
+	struct scatterlist sg[3];
+	struct hash_desc desc;
+
+	sg[0].page = virt_to_page(pPacket);
+	sg[0].offset = offset_in_page(pPacket);
+	sg[0].length = 

Re: cpufreq longhaul locks up

2007-05-04 Thread john stultz
On Fri, 2007-05-04 at 23:02 +0200, Jan Engelhardt wrote:
> On May 4 2007 13:37, john stultz wrote:
> >> 
> >> I found that setting the cpufreq governor to ondemand making the box 
> >> lock up solid in 2.6.20.2 and 2.6.21 after a few seconds. Sysrq 
> >> does not work anymore, and the last messages are:
> >> 
> >> May  3 19:16:58 cn kernel: longhaul: VIA C3 'Nehemiah C' [C5P] CPU 
> >> detected.  Powersaver supported.
> >> May  3 19:16:58 cn kernel: longhaul: Using northbridge support.
> >> May  3 19:17:22 cn kernel: Time: acpi_pm clocksource has been installed.
> >> May  3 19:17:22 cn kernel: Clocksource tsc unstable (delta = -136422685 
> >> ns)
> >
> >What happens if you boot wihtout the ondemand governor but w/
> >clocksource=acpi_pm ?
> 
> I always let it boot with the default gov (performance), then
> use cpufreq-set to change it.
> 
> acpi_pm+performance behaves like tsc+performance, which works
> 
> When switching from tsc+performance to (tsc+)ondemand, acpi_pm gets
> used because of the unstable tsc (of course, since we changed
> frequency and the cpu does NOT have constant_tsc), so it's
> becoming acpi_pm+ondemand naturally.

Ok. I just wanted to make sure it wasn't the ACPI PM that was broken and
when the system switched to it it was causing the hang.

> Switching from acpi_pm+performance to acpi_pm+ondemand also
> locks up after a few minutes.

Yep. Sounds like an ondemand issue. Thanks for verifying this for me.

-john

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >