date:20070404

[PATCH] kbuild: be more explicit on missing .config file

2007-04-04 Thread Randy Dunlap

From: Randy Dunlap <[EMAIL PROTECTED]>

Somewhat in reponse to kernel bugzilla #8197, be more explicit about
why 'make all' fails when there is no .config file.

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 scripts/kconfig/conf.c |1 +
 1 file changed, 1 insertion(+)

--- linux-2621-rc5g6.orig/scripts/kconfig/conf.c
+++ linux-2621-rc5g6/scripts/kconfig/conf.c
@@ -558,6 +558,7 @@ int main(int ac, char **av)
if (stat(".config", )) {
printf(_("***\n"
"*** You have not yet configured your kernel!\n"
+   "*** (missing kernel .config file)\n"
"***\n"
"*** Please run some configurator (e.g. \"make 
oldconfig\" or\n"
"*** \"make menuconfig\" or \"make 
xconfig\").\n"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.

2007-04-04 Thread Matt Mackall

On Wed, Apr 04, 2007 at 12:12:11PM -0700, Jeremy Fitzhardinge wrote:
> Add a new mm function apply_to_page_range() which applies a given
> function to every pte in a given virtual address range in a given mm
> structure. This is a generic alternative to cut-and-pasting the Linux
> idiomatic pagetable walking code in every place that a sequence of
> PTEs must be accessed.

As we discussed before, this obviously has a lot in common with my
walk_page_range code.

The major difference and one your above description seems to be
missing the important detail of why it's doing this:

> + pte_alloc_kernel(pmd, addr) :
> + pmd = pmd_alloc(mm, pud, addr);
> + pud = pud_alloc(mm, pgd, addr);

..which is mentioned here:

> +/*
> + * Scan a region of virtual memory, filling in page tables as necessary
> + * and calling a provided function on each leaf page table.
> + */

But I'm not sure what the use case is that wants filling in the page
table..? If both modes really make sense, perhaps a flag could unify
these differences.

> +typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long 
> addr,
> + void *data);

I'd gotten the impression that these sorts of typedefs were out of
fashion.

> +static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> +  unsigned long addr, unsigned long end,
> +  pte_fn_t fn, void *data)
> +{
> + pte_t *pte;
> + int err;
> + struct page *pmd_page;
> + spinlock_t *ptl;
> +
> + pte = (mm == _mm) ?
> + pte_alloc_kernel(pmd, addr) :
> + pte_alloc_map_lock(mm, pmd, addr, );
> + if (!pte)
> + return -ENOMEM;

Seems a bit awkward to pass mm all the way down the tree just for this
quirk. Which is a bit awkward as it means that whether or not a lock
is held in the callback is context dependent.

smaps, clear_ref, and my pagemap code all use the callback at the
pmd_range level, which a) localizes the pte-level locking concerns
with the user b) amortizes the indirection overhead and c)
(unfortunately) makes the user a bit more complex.

We should try to measure whether (b) actually makes a difference.

> + do {
> + err = fn(pte, pmd_page, addr, data);
> + if (err)
> + break;
> + } while (pte++, addr += PAGE_SIZE, addr != end);

I was about to say this do/while format seems a bit non-idiomatic for
page table walkers, but then I looked at the code in mm/memory.c and
realized the stuff I've been hacking on is the odd one out.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] no ZERO_PAGE?

2007-04-04 Thread Nick Piggin

On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 4 Apr 2007, Nick Piggin wrote:
> > 
> > Shall I do a more complete patchset and ask Andrew to give it a
> > run in -mm?
> 
> Do this trivial one first. See how it fares.

OK.

> Although I don't know how much -mm will do for it. There is certainly not 
> going to be any correctness problems, afaik, just *performance* problems. 
> Does anybody do any performance testing on -mm?
> 
> That said, talking about correctness/performance problems:
> 
> > +   page_table = pte_offset_map_lock(mm, pmd, address, );
> > +   if (likely(!pte_none(*page_table))) {
> > inc_mm_counter(mm, anon_rss);
> > lru_cache_add_active(page);
> > page_add_new_anon_rmap(page, vma, address);
> 
> Isn't that test the wrong way around?
> 
> Shouldn't it be
> 
>   if (likely(pte_none(*page_table))) {
> 
> without any logical negation? Was this patch tested?

Yeah, untested of course. I'm having problems booting my normal test box,
so the main point of the patch was to generate some discussion (which
worked! ;)).

Thanks,
Nick

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm3 PATCH] (Retry) Check the return value of kobject_add and etc.

2007-04-04 Thread WANG Cong

On Mon, Apr 02, 2007 at 01:01:28PM +0200, Cornelia Huck wrote:
>On Sun, 1 Apr 2007 15:32:34 +0800,
>"Cong WANG" <[EMAIL PROTECTED]> wrote:
>
>> --- linux-2.6.21-rc5-mm3/fs/partitions/check.c.orig  2007-03-30
>> 21:35:45.0 +0800
>> +++ linux-2.6.21-rc5-mm3/fs/partitions/check.c   2007-03-30
>> 21:49:53.0 +0800
>> @@ -385,10 +385,16 @@ void add_partition(struct gendisk *disk,
>>  p->kobj.parent = >kobj;
>>  p->kobj.ktype = _part;
>>  kobject_init(>kobj);
>> -kobject_add(>kobj);
>> +if (kobject_add(>kobj)) {
>> +kfree(p);
>> +return;
>> +}
>>  if (!disk->part_uevent_suppress)
>>  kobject_uevent(>kobj, KOBJ_ADD);
>> -sysfs_create_link(>kobj, _subsys.kset.kobj, "subsystem");
>> +if (sysfs_create_link(>kobj, _subsys.kset.kobj, "subsystem")) {
>> +kfree(p);
>> +return;
>> +}
>
>You need to properly undo whatever you did before. You're missing a
>KOBJ_DEL uevent and a kobject_del here.
>
>>  if (flags & ADDPART_FLAG_WHOLEDISK) {
>>  static struct attribute addpartattr = {
>>  .name = "whole_disk",
>> @@ -396,7 +402,10 @@ void add_partition(struct gendisk *disk,
>>  .owner = THIS_MODULE,
>>  };
>> 
>> -sysfs_create_file(>kobj, );
>> +if (sysfs_create_file(>kobj, )) {
>> +kfree(p);
>> +return;
>> +}
>
>Also here.

Sorry for my delay. My mutt sucked these days. ;-(

OK. Thanks for your point. I make the patch again and is it OK now?
I find -mm4 still has these warnings when compile this file. It is not fixed 
yet, and the following patch can also be applied to -mm4.



--- linux-2.6.21-rc5-mm3/fs/partitions/check.c.orig 2007-03-30 
21:35:45.0 +0800
+++ linux-2.6.21-rc5-mm3/fs/partitions/check.c  2007-04-02 21:29:02.0 
+0800
@@ -385,10 +385,18 @@ void add_partition(struct gendisk *disk,
p->kobj.parent = >kobj;
p->kobj.ktype = _part;
kobject_init(>kobj);
-   kobject_add(>kobj);
+   if (kobject_add(>kobj)) {
+   kfree(p);
+   return;
+   }
if (!disk->part_uevent_suppress)
kobject_uevent(>kobj, KOBJ_ADD);
-   sysfs_create_link(>kobj, _subsys.kset.kobj, "subsystem");
+   if (sysfs_create_link(>kobj, _subsys.kset.kobj, "subsystem")) {
+   kobject_uevent(>kobj, KOBJ_REMOVE);
+   kobject_del(>kobj);
+   kfree(p);
+   return;
+   }
if (flags & ADDPART_FLAG_WHOLEDISK) {
static struct attribute addpartattr = {
.name = "whole_disk",
@@ -396,7 +404,13 @@ void add_partition(struct gendisk *disk,
.owner = THIS_MODULE,
};
 
-   sysfs_create_file(>kobj, );
+   if (sysfs_create_file(>kobj, )) {
+   sysfs_remove_link(>kobj, "subsystem");
+   kobject_uevent(>kobj, KOBJ_REMOVE);
+   kobject_del(>kobj);
+   kfree(p);
+   return;
+   }
}
partition_sysfs_add_subdir(p);
disk->part[part-1] = p;



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] various drivers PCI must_checks

2007-04-04 Thread Randy Dunlap

From: Randy Dunlap <[EMAIL PROTECTED]>

Check PCI interface function results in parport, serial, & video drivers.

drivers/parport/parport_serial.c:402: warning: ignoring return value of 
'pci_enable_device', declared with attribute warn_unused_result
drivers/serial/8250_pci.c:1826: warning: ignoring return value of 
'pci_enable_device', declared with attribute warn_unused_result
drivers/video/s3fb.c:1078: warning: ignoring return value of 
'pci_enable_device', declared with attribute warn_unused_result

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 drivers/parport/parport_serial.c |8 +++-
 drivers/serial/8250_pci.c|   10 +-
 drivers/video/s3fb.c |9 -
 3 files changed, 24 insertions(+), 3 deletions(-)

--- linux-2.6.21-rc5-mm4.orig/drivers/parport/parport_serial.c
+++ linux-2.6.21-rc5-mm4/drivers/parport/parport_serial.c
@@ -392,6 +392,7 @@ static int parport_serial_pci_suspend(st
 static int parport_serial_pci_resume(struct pci_dev *dev)
 {
struct parport_serial_private *priv = pci_get_drvdata(dev);
+   int err;
 
pci_set_power_state(dev, PCI_D0);
pci_restore_state(dev);
@@ -399,7 +400,12 @@ static int parport_serial_pci_resume(str
/*
 * The device may have been disabled.  Re-enable it.
 */
-   pci_enable_device(dev);
+   err = pci_enable_device(dev);
+   if (err) {
+   printk(KERN_ERR "parport_serial: %s: error enabling "
+   "device for resume (%d)\n", pci_name(dev), err);
+   return err;
+   }
 
if (priv->serial)
pciserial_resume_ports(priv->serial);
--- linux-2.6.21-rc5-mm4.orig/drivers/serial/8250_pci.c
+++ linux-2.6.21-rc5-mm4/drivers/serial/8250_pci.c
@@ -1820,10 +1820,18 @@ static int pciserial_resume_one(struct p
pci_restore_state(dev);
 
if (priv) {
+   int err;
+
/*
 * The device may have been disabled.  Re-enable it.
 */
-   pci_enable_device(dev);
+   err = pci_enable_device(dev);
+   if (err) {
+   printk(KERN_ERR "8250_pci: %s: error %d "
+   "enabling device for resume\n",
+   pci_name(dev), err);
+   return err;
+   }
 
pciserial_resume_ports(priv);
}
--- linux-2.6.21-rc5-mm4.orig/drivers/video/s3fb.c
+++ linux-2.6.21-rc5-mm4/drivers/video/s3fb.c
@@ -1061,6 +1061,7 @@ static int s3_pci_resume(struct pci_dev*
 {
struct fb_info *info = pci_get_drvdata(dev);
struct s3fb_info *par = info->par;
+   int err;
 
dev_info(&(dev->dev), "resume\n");
 
@@ -1075,7 +1076,13 @@ static int s3_pci_resume(struct pci_dev*
 
pci_set_power_state(dev, PCI_D0);
pci_restore_state(dev);
-   pci_enable_device(dev);
+   err = pci_enable_device(dev);
+   if (err) {
+   mutex_unlock(&(par->open_lock));
+   release_console_sem();
+   dev_err(&(dev->dev), "error %d enabling device for resume\n", 
err);
+   return err;
+   }
pci_set_master(dev);
 
s3fb_set_par(info);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ShopInMe.net - Newest Products In Store Now!

2007-04-04 Thread ShopInMe.net


ShopInMe.net - Product Updates



Place here your text



4GB Sony memory stick Pro Duo
Product ID: 1164686843542328
£59.95
http://ardley.shopinme.net/index.php?p=productid=47parent=0



Digital Camcorder, 2.0inch TFT LCD, 10M Pixel, MMC/SD Slot
Product ID: cvCVA-DV3110
£79.95
http://ardley.shopinme.net/index.php?p=productid=8parent=0



Digital Camcorder, 2.5-inch LCD, 12M Pixel, 270 Degree
Product ID: cvCVA-DV3610
£94.95
http://ardley.shopinme.net/index.php?p=productid=11parent=0



Digital Camcorder, 3.1M Pixel, SD/MMC Slot
Product ID: cvCVA-DV1812
£53.95
http://ardley.shopinme.net/index.php?p=productid=1parent=0



Digital Camcorder, 4M Pixel, 32MB Int.Mem., SD/MMC Slot
Product ID: cvCVA-DV1815
£59.95
http://ardley.shopinme.net/index.php?p=productid=2parent=0



Digital Camcorder, MMC, SD slot, 12.0Mega pixel, USB1.1, TV OUT
Product ID: cvCVA-DV12818
£77.95
http://ardley.shopinme.net/index.php?p=productid=7parent=0



Digital Camera, 1/2.5 CCD 6.2M Pixel, 2.5-inch LCD, Optical: 3X
Product ID: cvZKX-DS63710
£107.95
http://ardley.shopinme.net/index.php?p=productid=30parent=0



Multifunction Pop-Up Flash Zoom Digital Camera Wide-Angle Lens
Product ID: cvZKX-DC5010
£99.95
http://ardley.shopinme.net/index.php?p=productid=26parent=0



Nvidia 7800G TX PCIE 16X 256MB Graphic Card
Product ID: 1162546518835761
£249.95
http://ardley.shopinme.net/index.php?p=productid=71parent=0



1/2.5 CCD 8.0M Pixel Digital Camera, 2.5-inch LCD, 32MB Int.Mem
Product ID: cvCVA-DX18
£99.95
http://ardley.shopinme.net/index.php?p=productid=28parent=0



1GB Memory Stick pro duo
Product ID: ct11677092617894945
£19.95
http://ardley.shopinme.net/index.php?p=productid=34parent=0



1GB Mini Sd Card with Adapter
Product ID: 11665885188267388
£17.95
http://ardley.shopinme.net/index.php?p=productid=45parent=0



1GB RS-MMC Card
Product ID: ct11671238657732652
£27.95
http://ardley.shopinme.net/index.php?p=productid=35parent=0



1GB Sandisk Mini SD Card
Product ID: 11633981674919238
£21.95
http://ardley.shopinme.net/index.php?p=productid=53parent=0



290 degree, 2.5-inch LCD, 1/2.5 CCD 8M Pixel Digital Camcorder
Product ID: cvCVA-DV6010
£112.95
http://ardley.shopinme.net/index.php?p=productid=15parent=0



2GB Memory Stick pro duo
Product ID: ct11677094748051550
£34.95
http://ardley.shopinme.net/index.php?p=productid=33parent=0



2GB Mini SD Card with Adapter
Product ID: ct11665886423586712
£28.95
http://ardley.shopinme.net/index.php?p=productid=36parent=0



2GB SanDisk Memory Stick Pro Duo
Product ID: 11617548585682156
£34.95
http://ardley.shopinme.net/index.php?p=productid=55parent=0



2GB Sandisk Mini SD Card
Product ID: 11633983090233965
£32.95
http://ardley.shopinme.net/index.php?p=productid=52parent=0



2GB Sandisk SD Card Ultra II
Product ID: 11652875387494255
£36.95
http://ardley.shopinme.net/index.php?p=productid=38parent=0



2GB Sandisk Ultra II Memory Stick Pro Duo
Product ID: 11652841751402307
£42.95
http://ardley.shopinme.net/index.php?p=productid=39parent=0



2GB Sony Memory Stick Pro Duo High Speed
Product ID: 11646236860421684
£44.95
http://ardley.shopinme.net/index.php?p=productid=48parent=0



3.0-inch LCD, 1/2.5 CCD 8.0M Pixel Digital Camera,

Re: 2.6.21-rc5-mm4 (SLUB)

2007-04-04 Thread Christoph Lameter


Here is a patch that adds validation (only for cpuslabs and partial 
slabs but thats where the action is). Apply this patch
and then do

echo 1 >/sys/slab//validate

I suggest to boot with full debugging and then run this on the ACPI slabs.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm4/mm/slub.c
===
--- linux-2.6.21-rc5-mm4.orig/mm/slub.c 2007-04-04 20:26:03.0 -0700
+++ linux-2.6.21-rc5-mm4/mm/slub.c  2007-04-04 21:26:15.0 -0700
@@ -2280,6 +2280,67 @@ void *__kmalloc_node_track_caller(size_t
 
 #ifdef CONFIG_SYSFS
 
+static int validate_slab(struct kmem_cache *s, struct page *page)
+{
+   void *p;
+   void *addr = page_address(page);
+   unsigned long map[BITS_TO_LONGS(s->objects)];
+
+   if (!check_slab(s, page) ||
+   !on_freelist(s, page, NULL))
+   return 0;
+
+   /* Now we know that a valid freelist exists */
+   bitmap_zero(map, s->objects);
+
+   for(p = page->freelist; p; p = get_freepointer(s, p)) {
+   set_bit((p - addr) / s->size, map);
+   if (!check_object(s, page, p, 0))
+   return 0;
+   }
+
+   for(p = addr; p < addr + s->objects * s->size; p += s->size)
+   if (!test_bit((p - addr) / s->size, map))
+   if (!check_object(s, page, p, 1))
+   return 0;
+   return 1;
+}
+
+static int validate_slab_node(struct kmem_cache *s, struct kmem_cache_node *n)
+{
+   int count = 0;
+   struct page *page;
+   unsigned long flags;
+
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry(page, >partial, lru) {
+   if (slab_trylock(page)) {
+   validate_slab(s, page);
+   slab_unlock(page);
+   } else
+   printk(KERN_INFO "Skipped busy slab %p\n", page);
+   count++;
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+   return count;
+}
+
+static void validate_slab_cache(struct kmem_cache *s)
+{
+   int node;
+   int count = 0;
+
+   printk(KERN_INFO "--- Validating slabcache '%s'\n", s->name);
+   flush_all(s);
+   for_each_online_node(node) {
+   struct kmem_cache_node *n = get_node(s, node);
+
+   count += validate_slab_node(s, n);
+   }
+   printk(KERN_INFO "--- Checked %d slabs in '%s'\n",
+   count, s->name);
+}
+
 static unsigned long count_partial(struct kmem_cache_node *n)
 {
unsigned long flags;
@@ -2402,7 +2463,6 @@ struct slab_attribute {
static struct slab_attribute _name##_attr =  \
__ATTR(_name, 0644, _name##_show, _name##_store)
 
-
 static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
 {
return sprintf(buf, "%d\n", s->size);
@@ -2609,6 +2669,22 @@ static ssize_t store_user_store(struct k
 }
 SLAB_ATTR(store_user);
 
+static ssize_t validate_show(struct kmem_cache *s, char *buf)
+{
+   return 0;
+}
+
+static ssize_t validate_store(struct kmem_cache *s,
+   const char *buf, size_t length)
+{
+   if (buf[0] == '1')
+   validate_slab_cache(s);
+   else
+   return -EINVAL;
+   return length;
+}
+SLAB_ATTR(validate);
+
 #ifdef CONFIG_NUMA
 static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
@@ -2648,6 +2724,7 @@ static struct attribute * slab_attrs[] =
_zone_attr.attr,
_attr.attr,
_user_attr.attr,
+   _attr.attr,
 #ifdef CONFIG_ZONE_DMA
_dma_attr.attr,
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-04 Thread Tejun Heo

Lee Revell wrote:
> On 4/4/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
>> I won't say that's voodoo, but if I ever did it I'd wipe down my
>> keyboard with holy water afterward. ;-)
>>
>> Well, I did save the message in my tricks file, but it sounds like a
>> last ditch effort after something get very wrong.

Which actually is true.  ATA ports failing to reset indicate something
is very wrong.  Either the attached device or the controller is broken
and libata shuts down the port to protect the rest of the system from
it.  The manual scan requests tell libata to give it one more shot and
polling hotplug can do that automatically.  Anyways, this shouldn't
happen unless you have a broken piece of hardware.

> Would it reallty be an impediment to development if the kernel
> maintainers simply refuse to merge patches that add new sysfs entries
> without corresponding documentation?

SCSI host scan nodes have been there for a long time.  I think it's
documented somewhere.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: missing madvise functionality

2007-04-04 Thread William Lee Irwin III

On Wed, 4 Apr 2007 06:09:18 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:
>> Oh dear.

On Wed, Apr 04, 2007 at 11:51:05AM -0700, Andrew Morton wrote:
> what's all this about?

I rewrote Jakub's testcase and included it as a MIME attachment.
Current working version inline below. Also at

http://holomorphy.com/~wli/jakub.c

The basic idea was that I wanted a few more niceties, such as specifying
the number of iterations and other things of that nature on the cmdline.
I threw in a little code reorganization and error checking, too.


-- wli


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

enum thread_return {
tr_success  =  0,
tr_mmap_init= -1,
tr_mmap_free= -2,
tr_mprotect = -3,
tr_madvise  = -4,
tr_unknown  = -5,
tr_munmap   = -6,
};

enum release_method {
release_by_mmap = 0,
release_by_madvise  = 1,
release_by_max  = 2,
};

struct thread_argument {
size_t page_size;
int iterations, pages_per_thread, nr_threads;
enum release_method method;
};

static enum thread_return mmap_release(void *p, size_t n)
{
void *q;

q = mmap(p, n, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
if (p != q) {
perror("thread_function: mmap release failed");
return tr_mmap_free;
}
if (mprotect(p, n, PROT_READ | PROT_WRITE)) {
perror("thread_function: mprotect failed");
return tr_mprotect;
}
return tr_success;
}

static enum thread_return madvise_release(void *p, size_t n)
{
if (madvise(p, n, MADV_DONTNEED)) {
perror("thread_function: madvise failed");
return tr_madvise;
}
return tr_success;
}

static enum thread_return (*release_methods[])(void *, size_t) = {
mmap_release,
madvise_release,
};

static void *thread_function(void *__arg)
{
char *p;
int i;
struct thread_argument *arg = __arg;
size_t arena_size = arg->pages_per_thread * arg->page_size;

p = (char *)mmap(NULL, arena_size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("thread_function: arena allocation failed");
return (void *)tr_mmap_init;
}
for (i = 0; i < arg->iterations; i++) {
size_t s;
char *q, *r;
enum thread_return ret;

/* Pretend to use the buffer.  */
r = p + arena_size;
for (q = p; q < r; q += arg->page_size)
*q = 1;
for (s = 0, q = p; q < r; q += arg->page_size)
s += *q;
if (arg->method >= release_by_max) {
perror("thread_function: "
"unknown freeing method specified");
return (void *)tr_unknown;
}
ret = (*release_methods[arg->method])(p, arena_size);
if (ret != tr_success)
return (void *)ret;
}
if (munmap(p, arena_size)) {
perror("thread_function: munmap() failed");
return (void *)tr_munmap;
}
return (void *)tr_success;
}

static int configure(struct thread_argument *arg, int argc, char *argv[])
{
char optstring[] = "t:m:i:p:";
int c, tmp, ret = 0;
long n;

n = sysconf(_SC_PAGE_SIZE);
if (n < 0) {
perror("configure: sysconf(_SC_PAGE_SIZE) failed");
ret = -1;
}
arg->nr_threads = 32, 
arg->page_size = (size_t)n;
arg->method = release_by_mmap;
arg->iterations = 10;
arg->pages_per_thread = 128;

while ((c = getopt(argc, argv, optstring)) != -1) {
switch (c) {
case 't':
if (sscanf(optarg, "%d", ) == 1)
arg->nr_threads = tmp;
else {
perror("configure: non-numeric thread 
count");
ret = -1;
}
break;
case 'm':
if (!strcmp(optarg, "mmap"))
arg->method = release_by_mmap;
else if (!strcmp(optarg, "madvise"))
arg->method = release_by_madvise;
else {
perror("configure: unrecognised release 
method");

Re: [PATCH 25/90] ARM: OMAP: h4 must have blinky leds!!

2007-04-04 Thread Randy Dunlap


Jan Engelhardt wrote:

On Apr 4 2007 14:26, Randy Dunlap (rd) wrote:
David Brownell (db) wrote:
Jan Engelhardt (je) wrote:
je>>> 
je>>> My stance, || goes at EOL, and final ) not standalone:
db>> 
db>> You are still violating the "only TABs used for indent" rule.

rd>
rd>Yes, but CodingStyle is a just set of guidelines and common practices.
rd>It doesn't cover multi-line if () statements, which is the subject of
rd>the current controvers^W discussion.

s/Yes/No/. The example block I posted does use /^\t+/.
Mind MUAs/browsers which transform tabs into single-spaces in some cases.


if (foo ||
  bar ||
  baz ||
  etc)
do_something;


I don't think it's a MUA thing.  I think David is talking about the
spaces after the ^\t that are used for indenting immediately under
the "if".

--
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: missing madvise functionality

2007-04-04 Thread Nick Piggin


Nick Piggin wrote:

Jakub Jelinek wrote:


On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote:


Does mmap(PROT_NONE) actually free the memory?




Yes.
/* Clear old maps */
error = -ENOMEM;
munmap_back:
vma = find_vma_prepare(mm, addr, , _link, _parent);
if (vma && vma->vm_start < addr + len) {
if (do_munmap(mm, addr, len))
return -ENOMEM;
goto munmap_back;
}



Thanks, I overlooked the mmap vs mprotect detail. So how are the subsequent
access faults avoided?


AFAIKS, the faults are not avoided. Not for single page allocations, not
for multi-page allocations.

So what glibc currently does to allocate, use, then deallocate a page is
this:
  mprotect(PROT_READ|PROT_WRITE) -> down_write(mmap_sem)
  touch page -> page fault -> down_read(mmap_sem)
  mmap(PROT_NONE) -> down_write(mmap_sem)

What it could be doing is:
  touch page -> page fault -> down_read(mmap_sem)
  madvise(MADV_DONTNEED) -> down_read(mmap_sem)

So after my previously posted patch (attached again) to only take down_read
in madvise where possible...

With 2 threads, the attached test.c ends up doing about 140,000 context
switches per second with just 2 threads/2CPUs, takes a little over 2
million faults, and about 80 seconds to complete, when running the
old_test() function (ie. mprotect,touch,mmap).

When running new_test() (ie. touch,madvise), context switches stay well
under 100, it takes slightly fewer faults, and it completes in about 8
seconds.

With 1 thread, new_test() actually completes in under half the time as
well (4.55 vs 9.88 seconds). This result won't have been altered by my
madvise patch, because the down_write fastpath is no slower than down_read.

Any comments?

--
SUSE Labs, Novell Inc.
Index: linux-2.6/mm/madvise.c
===
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -12,6 +12,25 @@
 #include 
 
 /*
+ * Any behaviour which results in changes to the vma->vm_flags needs to
+ * take mmap_sem for writing. Others, which simply traverse vmas, need
+ * to only take it for reading.
+ */
+static int madvise_need_mmap_write(int behavior)
+{
+   switch (behavior) {
+   case MADV_DOFORK:
+   case MADV_DONTFORK:
+   case MADV_NORMAL:
+   case MADV_SEQUENTIAL:
+   case MADV_RANDOM:
+   return 1;
+   default:
+   return 0;
+   }
+}
+
+/*
  * We can potentially split a vm area into separate
  * areas, each area with its own behavior.
  */
@@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon
int error = -EINVAL;
size_t len;
 
-   down_write(>mm->mmap_sem);
+   if (madvise_need_mmap_write(behavior))
+   down_write(>mm->mmap_sem);
+   else
+   down_read(>mm->mmap_sem);
 
if (start & ~PAGE_MASK)
goto out;
@@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon
vma = prev->vm_next;
}
 out:
-   up_write(>mm->mmap_sem);
+   if (madvise_need_mmap_write(behavior))
+   up_write(>mm->mmap_sem);
+   else
+   up_read(>mm->mmap_sem);
+
return error;
 }
#include 
#include 
#include 
#include 

#define NR_THREADS	1
#define ITERS	100
#define HEAPSIZE	(4*1024)

static void *old_thread(void *heap)
{
	int i;

	for (i = 0; i < ITERS; i++) {
		char *mem = heap;
		if (mprotect(heap, HEAPSIZE, PROT_READ|PROT_WRITE) == -1)
			perror("mprotect"), exit(1);
		*mem = i;
		if (mmap(heap, HEAPSIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0) == MAP_FAILED)
			perror("mmap"), exit(1);
	}

	return NULL;
}

static void old_test(void)
{
	void *heap;
	pthread_t pt[NR_THREADS];
	int i;

	heap = mmap(NULL, NR_THREADS*HEAPSIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	if (heap == MAP_FAILED)
		perror("mmap"), exit(1);

	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_create([i], NULL, old_thread, heap + i*HEAPSIZE) == -1)
			perror("pthread_create"), exit(1);
	}
	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_join(pt[i], NULL) == -1)
			perror("pthread_join"), exit(1);
	}

	if (munmap(heap, NR_THREADS*HEAPSIZE) == -1)
		perror("munmap"), exit(1);
}

static void *new_thread(void *heap)
{
	int i;

	for (i = 0; i < ITERS; i++) {
		char *mem = heap;
		*mem = i;
		if (madvise(heap, HEAPSIZE, MADV_DONTNEED) == -1)
			perror("madvise"), exit(1);
	}

	return NULL;
}

static void new_test(void)
{
	void *heap;
	pthread_t pt[NR_THREADS];
	int i;

	heap = mmap(NULL, HEAPSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	if (heap == MAP_FAILED)
		perror("mmap"), exit(1);

	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_create([i], NULL, new_thread, heap + i*HEAPSIZE) == -1)
			perror("pthread_create"), exit(1);
	}
	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_join(pt[i], NULL) == -1)
			perror("pthread_join"), exit(1);
	}

	if (munmap(heap, HEAPSIZE) ==

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-04 Thread Lee Revell


On 4/4/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:

I won't say that's voodoo, but if I ever did it I'd wipe down my
keyboard with holy water afterward. ;-)

Well, I did save the message in my tricks file, but it sounds like a
last ditch effort after something get very wrong.


Would it reallty be an impediment to development if the kernel
maintainers simply refuse to merge patches that add new sysfs entries
without corresponding documentation?

Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per-thread rusage

2007-04-04 Thread William Lee Irwin III

On Wed, 04 Apr 2007 10:29:31 PDT, William Lee Irwin III said:
>> Index: anon/include/linux/resource.h
>> ===
>> --- anon.orig/include/linux/resource.h   2007-04-04 09:57:41.239118534 
>> 0700
>> +++ anon/include/linux/resource.h2007-04-04 09:57:59.840178548 -0700
>> @@ -18,7 +18,8 @@
>>   */
>>  #define RUSAGE_SELF 0
>>  #define RUSAGE_CHILDREN (-1)
>> -#define RUSAGE_BOTH (-2)/* sys_wait4() uses this */
>> +#define RUSAGE_THREAD   (-2)
>> +#define RUSAGE_BOTH (-3)/* sys_wait4() uses this */

On Wed, Apr 04, 2007 at 06:36:47PM -0400, [EMAIL PROTECTED] wrote:
> Umm.. I'm having a high-idiot-quotient day today, but don't you want to
> leave _BOTH at -2 and put _THREAD at -3, to avoid an ABI breakage?

It's a rather natural question. The answer is that RUSAGE_BOTH is only
ever used internally to the kernel, so there is no userspace ABI change.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Blackfin arch: sync with uClibc no functional changes

2007-04-04 Thread Wu, Bryan

>  
> +.text
> +
> +.align 2
> +
>  ENTRY(_memchr)
> - P0 = R0 ; /* P0 = address */
> - P2 = R2 ; /* P2 = count */
> + P0 = R0; // P0 = address
> + P2 = R2; // P2 = count

Sorry for introducing wrong coding style source.
A updated one will be posted later.

Thanks Robin and Mike.
-Bryan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Blackfin arch: sync with uClibc no functional changes

2007-04-04 Thread Wu, Bryan

Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 arch/blackfin/lib/memchr.S  |   31 +---
 arch/blackfin/lib/memcmp.S  |   46
++
 arch/blackfin/lib/memcpy.S  |   24 +++---
 arch/blackfin/lib/memmove.S |   12 ++
 arch/blackfin/lib/memset.S  |2 +
 5 files changed, 68 insertions(+), 47 deletions(-)

diff --git a/arch/blackfin/lib/memchr.S b/arch/blackfin/lib/memchr.S
index c4f1aab..4981222 100644
--- a/arch/blackfin/lib/memchr.S
+++ b/arch/blackfin/lib/memchr.S
@@ -29,24 +29,27 @@
 
 #include 
 
-.align 2
-
-/*
- * C Library function MEMCHR
- * R0 = address
- * R1 = sought byte
- * R2 = count
+/* void *memchr(const void *s, int c, size_t n);
+ * R0 = address (s)
+ * R1 = sought byte (c)
+ * R2 = count (n)
+ *
  * Returns pointer to located character.
  */
 
+.text
+
+.align 2
+
 ENTRY(_memchr)
-   P0 = R0 ; /* P0 = address */
-   P2 = R2 ; /* P2 = count */
+   P0 = R0;/* P0 = address */
+   P2 = R2;/* P2 = count */
R1 = R1.B(Z);
CC = R2 == 0;
IF CC JUMP .Lfailed;
 
-.Lbytes: LSETUP (.Lbyte_loop_s , .Lbyte_loop_e) LC0=P2;
+.Lbytes:
+   LSETUP (.Lbyte_loop_s, .Lbyte_loop_e) LC0=P2;
 
 .Lbyte_loop_s:
R3 = B[P0++](Z);
@@ -55,9 +58,13 @@ ENTRY(_memchr)
 .Lbyte_loop_e:
NOP;
 
-.Lfailed: R0=0;
+.Lfailed:
+   R0=0;
RTS;
 
-.Lfound: R0 = P0;
+.Lfound:
+   R0 = P0;
R0 += -1;
RTS;
+
+.size _memchr,.-_memchr
diff --git a/arch/blackfin/lib/memcmp.S b/arch/blackfin/lib/memcmp.S
index e36fe8c..5b95023 100644
--- a/arch/blackfin/lib/memcmp.S
+++ b/arch/blackfin/lib/memcmp.S
@@ -29,39 +29,39 @@
 
 #include 
 
-.align 2
-
-/*
- * C Library function MEMCMP
- * R0 = First Address
- * R1 = Second Address
- * R2 = count
+/* int memcmp(const void *s1, const void *s2, size_t n);
+ * R0 = First Address (s1)
+ * R1 = Second Address (s2)
+ * R2 = count (n)
+ *
  * Favours word aligned data.
  */
 
+.text
+
+.align 2
+
 ENTRY(_memcmp)
I1 = P3;
P0 = R0;/* P0 = s1 address */
P3 = R1;/* P3 = s2 Address  */
P2 = R2 ;   /* P2 = count */
CC = R2 <= 7(IU);
-   IF CC JUMP  .Ltoo_small;
+   IF CC JUMP .Ltoo_small;
I0 = R1;/* s2 */
R1 = R1 | R0;   /* OR addresses together */
R1 <<= 30;  /* check bottom two bits */
CC =  AZ;   /* AZ set if zero. */
-   IF !CC JUMP  .Lbytes ;  /* Jump if addrs not aligned. */
+   IF !CC JUMP .Lbytes ;   /* Jump if addrs not aligned. */
 
P1 = P2 >> 2;   /* count = n/4 */
R3 =  3;
R2 = R2 & R3;   /* remainder */
P2 = R2;/* set remainder */
 
-   LSETUP (.Lquad_loop_s , .Lquad_loop_e) LC0=P1;
+   LSETUP (.Lquad_loop_s, .Lquad_loop_e) LC0=P1;
 .Lquad_loop_s:
-   NOP;
-   R0 = [P0++];
-   R1 = [I0++];
+   MNOP || R0 = [P0++] || R1 = [I0++];
CC = R0 == R1;
IF !CC JUMP .Lquad_different;
 .Lquad_loop_e:
@@ -73,7 +73,7 @@ ENTRY(_memcmp)
IF CC JUMP .Lfinished;  /* very unlikely*/
 
 .Lbytes:
-   LSETUP (.Lbyte_loop_s , .Lbyte_loop_e) LC0=P2;
+   LSETUP (.Lbyte_loop_s, .Lbyte_loop_e) LC0=P2;
 .Lbyte_loop_s:
R1 = B[P3++](Z);/* *s2 */
R0 = B[P0++](Z);/* *s1 */
@@ -88,14 +88,14 @@ ENTRY(_memcmp)
RTS;
 
 .Lquad_different:
-/* We've read two quads which don't match.
- * Can't just compare them, because we're
- * a little-endian machine, so the MSBs of
- * the regs occur at later addresses in the
- * string.
- * Arrange to re-read those two quads again,
- * byte-by-byte.
- */
+   /* We've read two quads which don't match.
+* Can't just compare them, because we're
+* a little-endian machine, so the MSBs of
+* the regs occur at later addresses in the
+* string.
+* Arrange to re-read those two quads again,
+* byte-by-byte.
+*/
P0 += -4;   /* back up to the start of the */
P3 = I0;/* quads, and increase the*/
P2 += 4;/* remainder count*/
@@ -106,3 +106,5 @@ ENTRY(_memcmp)
R0 = 0;
P3 = I1;
RTS;
+
+.size _memcmp,.-_memcmp
diff --git a/arch/blackfin/lib/memcpy.S b/arch/blackfin/lib/memcpy.S
index f757e1d..c1e00ef 100644
--- a/arch/blackfin/lib/memcpy.S
+++ b/arch/blackfin/lib/memcpy.S
@@ -35,6 +35,14 @@
 
 #include 
 
+/* void *memcpy(void *dest, const void *src, size_t n);
+ * R0 = To Address (dest) (leave unchanged to form result)
+ * R1 = From Address (src)
+ * R2 = count
+ *
+ * Note: Favours word alignment
+ */
+
 #ifdef CONFIG_MEMCPY_L1
 .section .l1.text
 #else
@@ -44,8 +52,8 @@
 .align 2
 
 ENTRY(_memcpy)
-   CC = R2 <=  0;  /* length not positive?*/
-   IF

[PATCH] blackfin arch: use boot_command_line instead of saved_command_line in setup c file

2007-04-04 Thread Wu, Bryan

[PATCH] blackfin arch
As boot_command_line is added in init/main.c for arch-specific boot
command line, replace old saved_command_line to boot_command_line for
right boot command passing in -mm tree.

Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 arch/blackfin/kernel/setup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/blackfin/kernel/setup.c b/arch/blackfin/kernel/setup.c
index ce51882..7d24229 100644
--- a/arch/blackfin/kernel/setup.c
+++ b/arch/blackfin/kernel/setup.c
@@ -221,8 +221,8 @@ void __init setup_arch(char **cmdline_p)
 
/* Keep a copy of command line */
*cmdline_p = _line[0];
-   memcpy(saved_command_line, command_line, COMMAND_LINE_SIZE);
-   saved_command_line[COMMAND_LINE_SIZE - 1] = 0;
+   memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
+   boot_command_line[COMMAND_LINE_SIZE - 1] = 0;
 
/* setup memory defaults from the user config */
physical_mem_end = 0;
-- 
1.5.0.5

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Rusty Russell

On Wed, 2007-04-04 at 23:14 -0400, Steven Rostedt wrote:
> On Wed, 2007-04-04 at 23:06 -0400, Kyle Moffett wrote:
> 
> > > (Erk, I wonder what I was thinking when I wrote that?) Can I ask  
> > > for %#x (or 0x%x)?  I'm easily confused.
> > 
> > How about "%p" for pointers?
> 
> But that would require casting the numbers to pointers.

And the kernel's printk doesn't put 0x on pointers anyway, last I
checked 8(

Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Lguest32, use guest page tables to find paddr for emulated instructions

2007-04-04 Thread Steven Rostedt

On Thu, 2007-04-05 at 12:59 +1000, Rusty Russell wrote:
> On Wed, 2007-04-04 at 15:07 -0400, Steven Rostedt wrote:

> Yeah, I haven't tried loading random modules but I can imagine this does
> happen (what module was it, BTW?)

I have no idea of which module it crashed on. I didn't investigate that
too much.  I could simply send a trap to guest when 
__pa(addr) != lguest_find_guest_paddr(addr) and see which module it
crashed on.

My block device I used was basically a copy of a RHEL5 system. I only
modified the inittab and fstab to get it working.  So on startup and
doing the udev init was when it crashed.

> 
> I used to have a function just like this, but managed to get rid of
> it.  
> 
> Hmm, perhaps we should have an "int lgread_virt_byte(u8 *)" which does
> the pgtable walk and read all in one?  It won't be efficient, but it'll
> be more correct and maybe even fewer lines 8)

I forgot that you have a goal to keep lguest small :)

Perhaps we can fork, and have lguest and lguest-lite.

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: usb hid: reset NumLock

2007-04-04 Thread Dmitry Torokhov

On Tuesday 03 April 2007 04:52, Jiri Kosina wrote:
> On Mon, 2 Apr 2007, Pete Zaitcev wrote:
> 
> > How about this?
> 
> Looks quite fine to me.
> 
> But in case that Dmitry's patch "Input: add generic suspend and resume for 
> uinput devices" fixes your issue too, I wouldn't merge it as it won't be 
> needed. Could you please let me know?

Unfortunately my patch is crap. We should not be sending events down
dev->event() until dev->open() has been called because many drivers
start hardware from there and not ready until then.

So it is HID driver responsibility to properly reset leds after all.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] blackfin arch: use boot_command_line instead of save_command_line in setup c file

2007-04-04 Thread Wu, Bryan

On Wed, 2007-04-04 at 08:07 -0700, Randy Dunlap wrote:
> On Wed, 04 Apr 2007 14:28:23 +0800 Wu, Bryan wrote:
> 
> > 
> > Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
> > ---
> >  arch/blackfin/kernel/setup.c |2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/arch/blackfin/kernel/setup.c b/arch/blackfin/kernel/setup.c
> > index ce51882..9870c60 100644
> > --- a/arch/blackfin/kernel/setup.c
> > +++ b/arch/blackfin/kernel/setup.c
> > @@ -221,7 +221,7 @@ void __init setup_arch(char **cmdline_p)
> >  
> > /* Keep a copy of command line */
> > *cmdline_p = _line[0];
> > -   memcpy(saved_command_line, command_line, COMMAND_LINE_SIZE);
> > +   memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
> > saved_command_line[COMMAND_LINE_SIZE - 1] = 0;
> >  
> > /* setup memory defaults from the user config */
> > -- 
> 
> Hi Bryan,
> 
> Patch descriptions should include _why_ a change is being made,
> not just what the change is.
> 

Thanks Randy. I am just confused by the following comments in -mm tree
init/main.c:
---
/* Untouched command line saved by arch-specific code. */
char __initdata boot_command_line[COMMAND_LINE_SIZE];
/* Untouched saved command line (eg. for /proc) */
char *saved_command_line;
---

And you know, in the 2.6.20.x stable kernel init/main.c:
---
/* Untouched command line (eg. for /proc) saved by arch-specific code. */
char saved_command_line[COMMAND_LINE_SIZE];
---

So the patch is to move saved_command_line to boot_command_line in
blackfin arch code. I will resend a new patch about this, because I
forgot to change :-<
---
-   saved_command_line[COMMAND_LINE_SIZE - 1] = 0;
+   boot_command_line[COMMAND_LINE_SIZE - 1] = 0;
---

Did I misunderstand about this? So some other arch (ARM/AVR32 ...)
should be updated, too. 

> ---
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***

Thanks again Randy.
I will follow the rule.

-Bryan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lguest32 kallsyms backtrace of guest.

2007-04-04 Thread Steven Rostedt

On Thu, 2007-04-05 at 12:54 +1000, Rusty Russell wrote:

> 
>   This is a cool idea, but there are two issues with this patch.  The
> first is that it's 500 lines of code: that's around +10% on lguest's
> total code size!  The second is that it conflicts with the medium-term
> plan to allow any user to run up lguests: this is why lg.ko never
> printk()s about problems with the guest.

Not much I can do about the size, but it's in the debug section so
hopefully it's not considered too bad :)

> 
> While it is useful for cases where a guest dies mysteriously before it
> brings up the console, three alternatives come to mind:
> 
> 1) Modify early_printk so Guests can use it.
> 2) Have a separate tool(-set?) for this kind of post-mortem.  Then you
> just have to implement guest suspend! 8)
> 3) Put this in a CONFIG_LGUEST_DEBUG.
> 
> Note that options 1 or 2 make you do more work, but are probably better
> in the long term.  I'm happy for #3 to sit as a patch in the tree for
> the duration, tho!

OK, I'll make a #3 patch to send, but the #1 looks best. Not to mention
that I still need to make it so that the console can read it.

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc5-mm4 (SLUB)

2007-04-04 Thread Christoph Lameter

On Wed, 4 Apr 2007, Badari Pulavarty wrote:

> > Were the slabs merged? Look at /sys/slab and see if there are any symlinks 
> > there.
> > 

Ok. symlinks there. Its a sporadic thing. I think I am going to add a slab
validator to SLUB that goes through all slabs and checks all objects for 
validity. Then we can trigger a scan through the acpi caches which should 
locate the problem.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Steven Rostedt

On Wed, 2007-04-04 at 23:06 -0400, Kyle Moffett wrote:

> > (Erk, I wonder what I was thinking when I wrote that?) Can I ask  
> > for %#x (or 0x%x)?  I'm easily confused.
> 
> How about "%p" for pointers?

But that would require casting the numbers to pointers.

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-usb-devel] [RFC] HID bus design overview.

2007-04-04 Thread Dmitry Torokhov

On Wednesday 04 April 2007 21:25, Li Yu wrote:
> Jiri Kosina wrote:
> > BTW as soon as you have some presentable code, could you please send
> > it so
> > that we could see what aproach you have taken? Debating over code is 
> > usualy more efficient than just ranting random ideas :)
> >
> >   
> There is a "presentable patch" in the attachment ;)

Some random notes without reading it all carefully...

> +static int hid_bus_match(struct device *dev, struct device_driver
> *drv) +{
> + struct hid_driver *hid_drv;
> + struct hid_device *hid_dev;
> +
> + hid_drv = to_hid_driver(drv);
> + hid_dev = to_hid_device(dev);
> +
> + if (is_hid_driver_sticky(hid_drv))
> + /* the sticky driver match device do not pass here. */
> + return 0;
> + if (hid_dev->bus != hid_drv->bus)
> + return 0;

How can this happen?

> + if (!hid_drv->match || hid_drv->match(hid_drv, hid_dev)) {
> + hid_dev->driver = hid_drv;

This usually done in bus->probe() function, when we know for sure that
driver binds to to the device.

> +static void hid_bus_release(struct device *dev)
> +{
> +}
> +
> +struct device hid_bus = {
> + .bus_id   = "hidbus0",
> + .release  = hid_bus_release
> +};
> +
> +static void hid_dev_release(struct device *dev)
> +{
> +}
> +

That will for sure raise Greg KH's blood pressure ;)

> + for (i=0; hid_dev->attrs && hid_dev->attrs[i]; ++i) {
> + ret = device_create_file(_dev->device, hid_dev->attrs[i]);
> + if (ret)
> + break;
> +

That should be handled via bus's device attributes and not open coded...

> - *  Copyright (c) 2000-2005 Vojtech Pavlik <[EMAIL PROTECTED]>
> - *  Copyright (c) 2005 Michael Haboustak <[EMAIL PROTECTED]> for Concept2, 
> Inc
> + *  Copyright (c) 2000-2005 Vojtech Pavlik 
> + *  Copyright (c) 2005 Michael Haboustak  for Concept2, 
> Inc
>   *  Copyright (c) 2006 Jiri  Kosina

Any particular reason for mangling addresses?

> + if (interrupt)
> + local_irq_save(flags);
> + spin_lock(_lock);
> + list_for_each_entry(driver, _sticky_drivers, sticky_link) {
> + hook = driver->hook;
> + if (hook && hook->raw_event) {
> + ret = hook->raw_event(hid, type, data, size, interrupt);
> + if (!ret)
> + break;
> + }
> + }
> + spin_unlock(_lock);
> + if (interrupt)
> + local_irq_restore(flags);
> +

This is scary. spin_lock_irqsave() and be done with it.

> +int hid_open(struct hid_device *hid)
> +{
> + struct hid_transport *tl;
> + int ret;
> +
> + if (hid->driver->open)
> + return hid->driver->open(hid);
> + ret = 0;
> + spin_lock(_lock);
> + tl =  hid_transports[hid->bus];
> + if (tl->open)
> + ret = tl->open(hid);
> + spin_unlock(_lock);
> + return ret;
> +}

Spinlock is not the best choise here, I'd expect most ->open()
implementation wait on some IO.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Kyle Moffett


On Apr 04, 2007, at 23:01:30, Rusty Russell wrote:

On Wed, 2007-04-04 at 15:14 -0400, Steven Rostedt wrote:
Currently the lguest32 error messages from bad reads and writes  
prints a decimal integer for addresses. This is pretty annoying.  
So this patch changes those to be hex outputs.


(Erk, I wonder what I was thinking when I wrote that?) Can I ask  
for %#x (or 0x%x)?  I'm easily confused.


How about "%p" for pointers?

Cheers,
Kyle Moffett


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Rusty Russell

On Wed, 2007-04-04 at 15:14 -0400, Steven Rostedt wrote:
> Currently the lguest32 error messages from bad reads and writes prints a
> decimal integer for addresses. This is pretty annoying. So this patch
> changes those to be hex outputs.

(Erk, I wonder what I was thinking when I wrote that?)

Can I ask for %#x (or 0x%x)?  I'm easily confused.

Thanks!
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Lguest32, use guest page tables to find paddr for emulated instructions

2007-04-04 Thread Rusty Russell

On Wed, 2007-04-04 at 15:07 -0400, Steven Rostedt wrote:
> [Bug that was found by my previous patch]
> 
> This patch allows things like modules, which don't have a direct
> __pa(EIP) mapping to do emulated instructions.
> 
> Sure, the emulated instruction probably should be a paravirt_op, but
> this patch lets you at least boot a kernel that has modules needing
> emulated instructions.

Yeah, I haven't tried loading random modules but I can imagine this does
happen (what module was it, BTW?)

I used to have a function just like this, but managed to get rid of
it.  

Hmm, perhaps we should have an "int lgread_virt_byte(u8 *)" which does
the pgtable walk and read all in one?  It won't be efficient, but it'll
be more correct and maybe even fewer lines 8)

Thanks for the patch!
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem

2007-04-04 Thread Paul Menage

On 4/4/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:

> - how do you handle additional reference counts on subsystems? E.g.
> beancounters wants to be able to associate each file with the
> container that owns it. You need to be able to lock out subsystems
> from taking new reference counts on an unreferenced container that
> you're deleting, without making the refcount operation too
> heavyweight.

Firstly, this is not a unique problem introduced by using ->nsproxy.
Secondly we have discussed this to some extent before
(http://lkml.org/lkml/2007/2/13/122). Essentially if we see zero tasks
sharing a resource object pointed to by ->nsproxy, then we can't be
racing with a function like bc_file_charge(), which simplifies the
problem quite a bit. In other words, seeing zero tasks in xxx_rmdir()
after taking manage_mutex is permission to kill nsproxy and associated
objects. Correct me if I am wrong here.

OK, I've managed to reconstruct my reasoning  remembered why it's
important to have the refcounts associated with the subsystems, and
why the simple use of the nsproxy count doesn't work. Essentially,
there's no way to figure out which underlying subsystem the refcount
refers to:

1) Assume the system has a single task T, and two subsystems, A and B

2) Mount hierarchy H1, with subsystem A and root subsystem state A0,
and hierarchy H2 with subsystem B and root subsystem state B0. Both
H1/ and H2/ share a single nsproxy N0, with refcount 3 (including the
reference from T), pointing at A0 and B0.

3) Create directory H1/foo, which creates subsystem state A1 (nsproxy
N1, refcount 1, pointing at A1 and B0)

4) Create directory H2/bar, which creates subsystem state B1 (nsproxy
N2, refcount 1, pointing at A0 and B1)

5) Move T into H1/foo/tasks and then H2/bar/tasks. It ends up with
nsproxy N3, refcount 1, pointing at A1 and B1.

6) T creates an object that is charged to A1 and hence needs to take a
reference on A1 in order to uncharge it later when it's released. So
N3 now has a refcount of 2

7) Move T back to H1/tasks and H2/tasks; assume it picks up nsproxy N0
again; N3 has a refcount of 1 now. (Assume that the object created in
step 6 isn't one that's practical/desirable to relocate when the task
that created it moves to a different container)

In this particular case the extra refcount on N3 is intended to keep
A1 alive (which prevents H1/foo being deleted), but there's no way to
tell from the structures in use whether it was taken on A1 or on B1.
Neither H1/foo nor H2/bar can be deleted, even though nothing is
intending to have a reference count on H2/bar.

Putting the extra refcount explicitly either in A1, or else in a
container object associated with H1/foo makes this more obvious.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lguest32 kallsyms backtrace of guest.

2007-04-04 Thread Rusty Russell

On Wed, 2007-04-04 at 14:23 -0400, Steven Rostedt wrote:
> This is taken from the work I did on lguest64.
> 
> When killing a guest, we read the guest stack to do a nice back trace of
> the guest and send it via printk to the host.
> 
> So instead of just getting an error message from the lguest launcher of:
> 
> lguest: bad read address 537012178 len 1
> 
> I also get in my dmesg:
> 
> called from  [] show_trace_log_lvl+0x1a/0x2f

Hi Steven,

This is a cool idea, but there are two issues with this patch.  The
first is that it's 500 lines of code: that's around +10% on lguest's
total code size!  The second is that it conflicts with the medium-term
plan to allow any user to run up lguests: this is why lg.ko never
printk()s about problems with the guest.

While it is useful for cases where a guest dies mysteriously before it
brings up the console, three alternatives come to mind:

1) Modify early_printk so Guests can use it.
2) Have a separate tool(-set?) for this kind of post-mortem.  Then you
just have to implement guest suspend! 8)
3) Put this in a CONFIG_LGUEST_DEBUG.

Note that options 1 or 2 make you do more work, but are probably better
in the long term.  I'm happy for #3 to sit as a patch in the tree for
the duration, tho!

Cheers,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Unified lguest launcher

2007-04-04 Thread Rusty Russell

On Wed, 2007-04-04 at 13:03 -0300, Glauber de Oliveira Costa wrote:
> This is a new version of the unified lguest launcher that applies to
> the current tree. According to rusty's suggestion, I'm bothering less
> to be able to load 32 bit kernels on 64-bit machines: changing the
> launcher for such case would be the easy part! In the absence of
> further objections, I'll commit it.
> 
> Signed-off-by: Glauber de Oliveira Costa <[EMAIL PROTECTED]>

Hi Glauber!

The patch looks more than reasonable, but I think we can go further with
the abstraction.  If you could spin it again, I'll apply it.  There may
be more cleanups after that, but I don't want to hold up your progress!

> --- /dev/null
> +++ linux-2.6.20/Documentation/lguest/i386/defines
> @@ -0,0 +1,4 @@
> +# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary.
> +# Some shells (dash - ubunu) can't handle numbers that big so we cheat.
> +include ../../.config
> +LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x0800)

The include needs another ../ and seems redundant (the .config is
included from the Makefile anyway).

The shells comment is obsolete and should be deleted too, my bad.

> +++ linux-2.6.20/Documentation/lguest/i386/lguest_defs.h
> @@ -0,0 +1,9 @@
> +#ifndef _LGUEST_DEFS_H_
> +#define _LGUEST_DEFS_H_
> +
> +/* LGUEST_TOP_ADDRESS comes from the Makefile */
> +#define RESERVE_TOP_ADDRESS LGUEST_GUEST_TOP - 1024*1024

Why -1M?  And RESERVE_TOP_ADDRESS isn't used in this patch?

> +static unsigned long map_elf(int elf_fd, const void *hdr, 
>  unsigned long *page_offset)
>  {
> -   void *addr;
> +#ifndef __x86_64__
> +   const Elf32_Ehdr *ehdr = hdr;
> Elf32_Phdr phdr[ehdr->e_phnum];
> +#else
> +   const Elf64_Ehdr *ehdr = hdr;
> +   Elf64_Phdr phdr[ehdr->e_phnum];
> +#endif

The way we did this in the module code was to define Elf_Ehdr etc in the
arch-specific headers to avoid ifdefs.  I think it would help this code,
too. 

> +   || ((ehdr->e_machine != EM_386) &&
> +   (ehdr->e_machine != EM_X86_64))

Similarly define ELF_MACHINE?

>else if (*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr)
> +   else if ((*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr)
> +#ifdef __x86_64__
> +&& (phdr[i].p_vaddr != VSYSCALL_START)
> +#endif
> +   )

Hmm, static inline bool is_vsyscall_segment(const Elf_Phdr *) maybe?

> +/* LGUEST_TOP_ADDRESS comes from the Makefile */
> +typedef uint64_t u64;
> +#include "../../../include/asm/lguest_user.h"
> +
> +#define RESERVE_TOP_ADDRESS LGUEST_GUEST_TOP
> +
> +
> +#define BOOT_PGTABLE "boot_level4_pgt"

The comment should refer to LGUEST_GUEST_TOP?

I think the typedef should be in the main code with the others: it
doesn't hurt i386 and it's neater.

I'm not sure the BOOT_PGTABLE define helps us here, either; it might be
clearer just to put it directly into the code.

Cheers!
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 11/90] ARM: OMAP: Add support for Amstrad Delta keypad

2007-04-04 Thread Dmitry Torokhov

On Wednesday 04 April 2007 18:05, Jonathan McDowell wrote:
> On Wed, Apr 04, 2007 at 04:57:57PM -0400, Dmitry Torokhov wrote:
> > On 4/4/07, Tony Lindgren <[EMAIL PROTECTED]> wrote:
> > 
> > >+   KEY(0, 7, KEY_LEFTSHIFT),   /* Vol up   */
> > 
> > KEY_VOLUMEUP?
> > 
> > >+   KEY(3, 7, KEY_LEFTCTRL),/* Vol down  */
> > 
> > KEY_VOLUMEDOWN?
> 
> In terms of making the keypad on top of the E3 usable left shift and
> left ctrl make more sense (there are no other keys with this function
> and they are to the left of the qwerty layout), but the keys are in fact
> labelled vol up/down, hence the comments.

Ah, I see, OK.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc5-mm4 (SLUB)

2007-04-04 Thread Badari Pulavarty

On Wed, 2007-04-04 at 17:31 -0700, Christoph Lameter wrote:
> On Wed, 4 Apr 2007, Badari Pulavarty wrote:
> 
> > On Wed, 2007-04-04 at 15:59 -0700, Christoph Lameter wrote:
> > > On Wed, 4 Apr 2007, Badari Pulavarty wrote:
> > > 
> > > > Here is the slub_debug=FU output with the above patch.
> > > 
> > > Hmmm... Looks like the object is actually free. Someone writes beyond the 
> > > end of the earlier object. Setting Z should check overwrites but it 
> > > switched off merging. So set
> > > 
> > > slub_debug = FZ
> > > 
> > > Analoguos to the last patch you would need to take out redzoning from 
> > > the flags that stop merging. Then rerun. Maybe we can track it down this 
> > > way.
> > 
> > Hmm.. I did that and machine boots fine, with absolutely no
> > debug messages :(
> 
> Were the slabs merged? Look at /sys/slab and see if there are any symlinks 
> there.
> 

elm3b29:/sys/slab # ls -ltr
total 0
drwxr-xr-x 2 root root 0 Apr  4 17:40 sock_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 skbuff_fclone_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 sigqueue
drwxr-xr-x 2 root root 0 Apr  4 17:40 shmem_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 radix_tree_node
drwxr-xr-x 2 root root 0 Apr  4 17:40 proc_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 ip_dst_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 file_lock_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 blkdev_requests
drwxr-xr-x 2 root root 0 Apr  4 17:40 blkdev_queue
drwxr-xr-x 2 root root 0 Apr  4 17:40 blkdev_ioc
drwxr-xr-x 2 root root 0 Apr  4 17:40 biovec-64
drwxr-xr-x 2 root root 0 Apr  4 17:40 biovec-256
drwxr-xr-x 2 root root 0 Apr  4 17:40 biovec-128
drwxr-xr-x 2 root root 0 Apr  4 17:40 bdev_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 TCP
drwxr-xr-x 2 root root 0 Apr  4 17:40 Acpi-State
drwxr-xr-x 2 root root 0 Apr  4 17:40 Acpi-ParseExt
drwxr-xr-x 2 root root 0 Apr  4 17:40 Acpi-Operand
drwxr-xr-x 2 root root 0 Apr  4 17:40 Acpi-Namespace
drwxr-xr-x 2 root root 0 Apr  4 17:40 vm_area_struct
drwxr-xr-x 2 root root 0 Apr  4 17:40 task_struct
drwxr-xr-x 2 root root 0 Apr  4 17:40 sysfs_dir_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 signal_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 sighand_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 pid
drwxr-xr-x 2 root root 0 Apr  4 17:40 names_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 mm_struct
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmem_cache_node
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-96
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-8192
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-8
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-65536
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-64
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-512
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-4096
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-32768
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-32
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-262144
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-256
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-2048
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-192
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-16384
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-16
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-131072
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-128
drwxr-xr-x 2 root root 0 Apr  4 17:40 kmalloc-1024
drwxr-xr-x 2 root root 0 Apr  4 17:40 inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 idr_layer_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 fs_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 filp
drwxr-xr-x 2 root root 0 Apr  4 17:40 dentry_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 buffer_head
drwxr-xr-x 2 root root 0 Apr  4 17:40 anon_vma
drwxr-xr-x 2 root root 0 Apr  4 17:40 dquot
drwxr-xr-x 2 root root 0 Apr  4 17:40 reiser_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 nfs_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 nfs_direct_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 mqueue_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 minix_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 journal_head
drwxr-xr-x 2 root root 0 Apr  4 17:40 isofs_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 hugetlbfs_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 ext3_xattr
drwxr-xr-x 2 root root 0 Apr  4 17:40 ext3_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 ext2_xattr
drwxr-xr-x 2 root root 0 Apr  4 17:40 ext2_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 cfq_pool
drwxr-xr-x 2 root root 0 Apr  4 17:40 cfq_ioc_pool
drwxr-xr-x 2 root root 0 Apr  4 17:40 UNIX
drwxr-xr-x 2 root root 0 Apr  4 17:40 rpc_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 rpc_buffers
drwxr-xr-x 2 root root 0 Apr  4 17:40 revokefs_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:40 TCPv6
drwxr-xr-x 2 root root 0 Apr  4 17:54 fat_inode_cache
drwxr-xr-x 2 root root 0 Apr  4 17:54 fat_cache
drwxr-xr-x 2 root root 0 Apr  4 17:54 sgpool-64
drwxr-xr-x 2 root root 0 Apr  4 17:54 sgpool-32
drwxr-xr-x 2 root root 0 Apr  4 17:54 sgpool-128
drwxr-xr-x 2 root root 0 Apr  4 17:54 scsi_io_context

Re: [rfc] no ZERO_PAGE?

2007-04-04 Thread Nick Piggin

On Wed, Apr 04, 2007 at 05:27:31PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 4 Apr 2007, [EMAIL PROTECTED] wrote:
> > 
> > I'd not be surprised if there's sparse-matrix code out there that wants to
> > malloc a *huge* array (like a 1025x1025 array of numbers) that then only
> > actually *writes* to several hundred locations, and relies on the fact that
> > all the untouched pages read back all-zeros.
> 
> Good point. In fact, it doesn't need to be a malloc() - I remember people 
> doing this with Fortran programs and just having an absolutely incredibly 
> big BSS (with traditional Fortran, dymic memory allocations are just not 
> done).

Sparse matrices are one thing I worry about. I don't know enough about
HPC code to know whether they will be a problem. I know there exist
data structures to optimise sparse matrix storage...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 25/90] ARM: OMAP: h4 must have blinky leds!!

2007-04-04 Thread Jan Engelhardt

On Apr 4 2007 14:26, Randy Dunlap (rd) wrote:
David Brownell (db) wrote:
Jan Engelhardt (je) wrote:
je>>> 
je>>> My stance, || goes at EOL, and final ) not standalone:
db>> 
db>> You are still violating the "only TABs used for indent" rule.
rd>
rd>Yes, but CodingStyle is a just set of guidelines and common practices.
rd>It doesn't cover multi-line if () statements, which is the subject of
rd>the current controvers^W discussion.

s/Yes/No/. The example block I posted does use /^\t+/.
Mind MUAs/browsers which transform tabs into single-spaces in some cases.

>>> if (foo ||
>>>   bar ||
>>>   baz ||
>>>   etc)
>>> do_something;

Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] allow vmsplice to work in 32-bit mode on ppc64

2007-04-04 Thread Stephen Rothwell

On Wed, 4 Apr 2007 13:25:19 -0400 Don Zickus <[EMAIL PROTECTED]> wrote:
>
> Trivial change to pass vmsplice arguments through the compat layer on
> pp64.
>
> Signed-off-by: Don Zickus <[EMAIL PROTECTED]>
Acked-by: Stephen Rothwell <[EMAIL PROTECTED]>

Obviously noone uses vmsplice from 32bit processes on ppc64 :-)
--
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgpRua4JETGVy.pgp
Description: PGP signature

Re: [rfc] no ZERO_PAGE?

2007-04-04 Thread Nick Piggin

On Wed, Apr 04, 2007 at 01:11:11PM -0700, David Miller wrote:
> From: Linus Torvalds <[EMAIL PROTECTED]>
> Date: Wed, 4 Apr 2007 08:35:30 -0700 (PDT)
> 
> > Anyway, I'm not against this, but I can see somebody actually *wanting* 
> > the ZERO page in some cases. I've used the fact for TLB testing, for 
> > example, by just doing a big malloc(), and knowing that the kernel will 
> > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least 
> > not any *physical* cache effects. Virtually indexed cached will still show 
> > effects of it, of course, but I haven't cared).
> > 
> > That's an example of an app that actually cares about the page allocation 
> > (or, in this case, the lack there-of). Not an important one, but maybe 
> > there are important ones that care?
> 
> If we're going to consider this seriously, there is a case I know of.
> Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there
> is an instructive comment:
> 
>   /* Do not bother with the expensive D-cache flush if it
>* is merely the zero page.  The 'bigcore' testcase in GDB
>* causes this case to run millions of times.
>*/
>   if (page == ZERO_PAGE(0))
>   return;
> 
> basically what the GDB test case does it mmap() an enormous anonymous
> area, not touch it, then dump core.
> 
> As I understand the patch being considered to remove ZERO_PAGE(), this
> kind of core dump will cause a lot of pages to be allocated, probably
> eating up a lot of system time as well as memory.

Yeah. Well it is trivial to leave ZERO_PAGE in get_user_pages, however
in the longer run it would be nice to get rid of ZERO_PAGE completely
so we need an alternative.

I've been working on a patch for core dumping that can detect unfaulted
anonymous memory and skip it without doing the ZERO_PAGE comparision.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: missing madvise functionality

2007-04-04 Thread Nick Piggin


Eric Dumazet wrote:

On Wed, 04 Apr 2007 20:05:54 +1000
Nick Piggin <[EMAIL PROTECTED]> wrote:


@@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u
unsigned long start;

addr &= PAGE_MASK;
-   vma = find_vma(mm,addr);
+   vma = find_vma(mm,addr,>vmacache);
if (!vma)
return NULL;
if (vma->vm_start <= addr)


So now you can have current calling find_extend_vma on someone else's mm
but using their cache. So you're going to return current's vma, or current
is going to get one of mm's vmas in its cache :P



This was not a working patch, just to throw the idea, since the answers I got 
showed I was not understood.

In this case, find_extend_vma() should of course have one struct vm_area_cache 
* argument, like find_vma()

One single cache on one mm is not scalable. oprofile badly hits it on a dual 
cpu config.


Oh, what sort of workload are you using to show this? The only reason that I
didn't submit my thread cache patches was that I didn't show a big enough
improvement.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/13] maps: Move the page walker code to lib/

2007-04-04 Thread Nick Piggin


Nick Piggin wrote:

Matt Mackall wrote:


On Wed, Apr 04, 2007 at 03:50:56PM +1000, Nick Piggin wrote:



Just put it in its own file in mm/ rather than its own file in lib.
lib should be for almost-standalone stuff, IMO (ie. only using basic
kernel functionality).




Arguably that's what lib/ should be for, but it's currently largely



I disagree. There is code everywhere that exists to provide some
functionality via an API to other parts of the kernel. You don't
think mm/page_alloc.c should go in lib/?


Oh, I think I misunderstood you.

Anyway, we have lots of conditionally compiled code in mm as well.
It doesn't seem to be much hardship.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: missing madvise functionality

2007-04-04 Thread Nick Piggin


Hugh Dickins wrote:

On Wed, 4 Apr 2007, Rik van Riel wrote:


Hugh Dickins wrote:



(I didn't understand how Rik would achieve his point 5, _no_ lock
contention while repeatedly re-marking these pages, but never mind.)


The CPU marks them accessed when they are reused.

The VM only moves the reused pages back to the active list
on memory pressure.  This means that when the system is
not under memory pressure, the same page can simply stay
PG_lazyfree for multiple malloc/free rounds.



Sure, there's no need for repetitious locking at the LRU end of it;
but you said "if the system has lots of free memory, pages can go
through multiple free/malloc cycles while sitting on the dontneed
list, very lazily with no lock contention".  I took that to mean,
with userspace repeatedly madvising on the ranges they fall in,
which will involve mmap_sem and ptl each time - just in order
to check that no LRU movement is required each time.

(Of course, there's also the problem that we don't leave our
systems with lots of free memory: some LRU balancing decisions.)


I don't agree this approach is the best one anyway. I'd rather
just the simple MADV_DONTNEED/MADV_DONEED.

Once you go through the trouble of protecting the memory and
flushing TLBs, unprotecting them afterwards and taking a trap
(even if it is a pure hardware trap), I doubt you've saved much.

You may have saved the cost of zeroing out the page, but that
has to be weighed against the fact that you have left a possibly
cache hot page sitting there to get cold, and your accesses to
initialise the malloced memory might have more cache misses.

If you just free the page, it goes onto a nice LIFO cache hot
list, and when you want to allocate another one, you'll probably
get a cache hot one.

The problem is down_write(mmap_sem) isn't it? We can and should
easily fix that problem now. If we subsequently want to look at
micro optimisations to avoid zeroing using MMU tricks, then we
have a good base to compare with.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem

2007-04-04 Thread Paul Menage


On 4/4/07, Paul Menage <[EMAIL PROTECTED]> wrote:

The current code creates such arrays when it needs an atomic snapshot
of the set of tasks in the container (e.g. for reporting them to
userspace or updating the mempolicies of all the tasks in the case of
cpusets). It may be possible to do it by traversing tasklist and
dropping the lock to operate on each task where necessary - I'll take
a look at that.


Just to clarify this - the cases that currently need an array of task
pointers *do* already traverse tasklist in order to locate those tasks
as needed - its when they want to be able to operate on those tasks
outside of the tasklist lock that the array is needed - lock
tasklist_lock, fill the array with tasks (with added refcounts), drop
tasklist_lock, do stuff.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem

2007-04-04 Thread Paul Menage


On 4/4/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:


In addition there appear to be some weird assumptions (an array with
one member per task_struct) in the group.  The pid limit allows
us millions of task_structs if the user wants it.   A several megabyte
array sounds like a completely unsuitable data structure.   What
is wrong with just traversing task_list at least initially?


The current code creates such arrays when it needs an atomic snapshot
of the set of tasks in the container (e.g. for reporting them to
userspace or updating the mempolicies of all the tasks in the case of
cpusets). It may be possible to do it by traversing tasklist and
dropping the lock to operate on each task where necessary - I'll take
a look at that.



- What functionality do we want to provide.
- How do we want to export that functionality to user space.


I'm working on a paper for OLS that sets out a lot of these issues -
I'll send you a draft copy hopefully tonight.



You can share code by having an embedded structure instead of a magic
subsystem things have to register with, and I expect that would be
preferable.  Libraries almost always are easier to work with then
a subsystem with strict rules that doesn't give you choices.

Why do we need a subsystem id?  Do we expect any controllers to
be provided as modules?  I think the code is so in the core that
modules of any form are a questionable assumption.


One of the things that I'd really like to be able to do is allow
userspace to mount one filesystem instance with several
controllers/subsystems attached to it, so that they can define a
single task->container mapping to be used for multiple different
subsystems.

I think that would be hard to do without a central registration of subsystems.

I'm not particularly expecting modules to register as subsystems,
since as you say they'd probably need to be somewhat embedded in the
kernel in order to operate. And there are definite performance
advantages to be had from allowing in-kernel subsystems to have a
constant subsystem id.

It would be nice if subsystems could optionally have a dynamic
subsystem id so that new subsystem patches didn't have to clash with
one another in some global enum, but maybe that's not worth worrying
about at this point.



I'm inclined to the rcfs variant and using nsproxy (from what I have
seen in passing) because it is more of an exercise in minimalism, and I
am strongly inclined to be minimalistic.


There's not a whole lot of difference between the two patch sets,
other than the names, and whether they piggyback on nsproxy or on a
separate object (container_group) that could be merged with nsproxy
later if it seemed appropriate. Plus mine have the ability to support
subsystem callbacks on fork/exit, with no significant overhead in the
event that no configured subsystems actually want those callbacks.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/13] maps: Move the page walker code to lib/

2007-04-04 Thread Nick Piggin


Matt Mackall wrote:

On Wed, Apr 04, 2007 at 03:50:56PM +1000, Nick Piggin wrote:


Matt Mackall wrote:


On Wed, Apr 04, 2007 at 01:51:37PM +1000, Nick Piggin wrote:



Matt Mackall wrote:



Move the page walker code to lib/

This lets it get shared outside of proc/ and linked in only when
needed.


I think it would be better in mm/.



I originally was looking at putting it in mm/memory.c and possibly


Just put it in its own file in mm/ rather than its own file in lib.
lib should be for almost-standalone stuff, IMO (ie. only using basic
kernel functionality).



Arguably that's what lib/ should be for, but it's currently largely


I disagree. There is code everywhere that exists to provide some
functionality via an API to other parts of the kernel. You don't
think mm/page_alloc.c should go in lib/?


used to avoid linking in unused code without adding more hair to
Kconfig. Which is what I'm trying to do here.


Well if you're doing a nice big reorganisation and jumble, then
why would you worry about a few lines in Kconfig?


Apart from these users outside mm/, I don't see much point in converting
things over. The page table walking API we have now is neat and simple.
It takes a few lines of code, but is it a big problem?



I don't think it really qualifies as either neat or simple. It may be
about as neat as walking a heterogenous tree with an inconsistent
naming scheme can be, but it's still a headache. A maze of twisty
little passages, all slightly different.


All page table range walking code should be the same.
It is a simple template to use to walk a range of ptes. It is
a headache outside mm/, sure, but not because of its incredible
complexity.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-04 Thread Chris Wright

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Acked-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> for all thats worth since I am not a i386 specialist.
> 
> How much of the issues with page struct sharing between slab and arch code 
> does this address?

I think the answer is 'none yet.'  It uses page sized slab and still
needs pgd_list, for example.  But the mm_list chaining should work too,
so it shouldn't make things any worse.

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-usb-devel] [RFC] HID bus design overview.

2007-04-04 Thread Li Yu

Jiri Kosina wrote:
> BTW as soon as you have some presentable code, could you please send
> it so
> that we could see what aproach you have taken? Debating over code is 
> usualy more efficient than just ranting random ideas :)
>
>   
There is a "presentable patch" in the attachment ;)

so far, the bluetooth also can not work :( TODO, TODO

TODO list include usbhid blacklist match logic should be move to HID
core too.

But It have a very very strange problem on my PC yet. I have two USB
input devices at that machine, one is a wireless mouse, another is a
joystick. When I unplug joystick, the input_dev of mouse also will be
unregistered, I found the unregister_hid_device() is called from
hid_disconnect().

I can sure the hardware have no problem, they work fine under 2.6.17.9
without any change of kernel. May, I doublt it is caused by illegal
memory access likely, or I failed to understand USB core ?! FIXING, The
fun also here.

The hiddump is good idea, I like, however, I think hidraw just is it.
the hiddump is one application of hidraw. is it right?

Good luck.

- Li Yu

hidbus.prototype.070404.patch.gz
Description: GNU Zip compressed data

Re: [rfc] no ZERO_PAGE?

2007-04-04 Thread Valdis . Kletnieks

On Wed, 04 Apr 2007 17:27:31 PDT, Linus Torvalds said:

> Sure you do. If glibc used mmap() or brk(), it *knows* the new data is 
> zero. So if you use calloc(), for example, it's entirely possible that 
> a good libc wouldn't waste time zeroing it.

Right.  However, the *user* code usually has no idea about the previous
history - so if it uses malloc(), it should be doing something like:

ptr = malloc(my_size*sizeof(whatever));
memset(ptr, my_size*sizeof(), 0);

So malloc does something clever to guarantee that it's zero, and then userspace
undoes the cleverness because it has no easy way to *know* that cleverness
happened.

Admittedly, calloc() *can* get away with being clever.  I know we have some
glibc experts lurking here - any of them want to comment on how smart calloc()
actually is, or how smart it can become without needing major changes to the
rest of the malloc() and friends?



pgpUr7QaEeRoh.pgp
Description: PGP signature

Invalid operand: kernel BUG at mm/rmap.c:434! and arch/i386/mm/highmem.c:42!)

2007-04-04 Thread Pat

I'm running kernel 2.6.9-22.ELsmp on dual Xeon
servers. I've received kernel panics occasionally in
the past, but they are more frequent now as the load
on the system has increased. Below is a capture of the
kernel panic. 

If anything below screams it's coming from a certain
source (defective RAM? Bad application or device
driver?), please let me know as I'm pulling what
little hair I have left out on this one. Suggestions
on what to try would be greatly appreciated. Thanks! 


[ cut here ]
kernel BUG at mm/rmap.c:434!
invalid operand:  [#1]
SMP 
Modules linked in: fusedriver(U) md5 ipv6 parport_pc
lp parport autofs4 i2c_dev i2c_core sunrpc
dm_multipath button battery ac uhci_hcd ehci_hcd e1000
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
3w_9xxx(U) sd_mod scsi_mod
CPU:5
EIP:0060:[]Not tainted VLI
EFLAGS: 00010202   (2.6.9-22.ELsmp) 
EIP is at page_add_anon_rmap+0xe/0x66
eax: 4964   ebx: c1800040   ecx: 090a100c   edx:
e5b5a544
esi: d1837e8c   edi: fff25508   ebp: c1800040   esp:
cf12ae40
ds: 007b   es: 007b   ss: 0068
Process check-enqueued- (pid: 17602,
threadinfo=cf12a000 task=f57577b0)
Stack: 40002067 8000 c014d034 fff29000 40002025
8000 0163 e5b5a544 
   f35b9080  fff25508 fff25508 090a100c
c014d0e2 cc5a2240 0001 
   090a100c    
8000   
Call Trace:
 [] do_anonymous_page+0x19c/0x1db
 [] do_no_page+0x6f/0x2f9
 [] handle_mm_fault+0xbd/0x175
 [] do_page_fault+0x1ae/0x5c6
 [] vma_adjust+0x286/0x2d6
 [] vma_merge+0xe1/0x165
 [] finish_task_switch+0x30/0x66
 [] schedule+0x844/0x87a
 [] do_page_fault+0x0/0x5c6
 [] error_code+0x2f/0x38
Code: 83 7b 10 00 74 0b 89 ca 89 d8 e8 fb fe ff ff 01
c6 89 d8 e8 ac df fe ff 5b 89 f0 5e c3 56 53 89 c3 8b
00 8b 72 44 f6 c4 08 74 08 <0f> 0b b2 01 ce 52 2e c0
85 f6 75 08 0f 0b b3 01 ce 52 2e c0 8b 
 <0>Fatal exception: panic in 5 seconds
bad: scheduling while atomic!
 [] schedule+0x2d/0x87a
 [] poke_blanked_console+0x3d/0x9a
 [] vt_console_print+0x294/0x2a5
 [] __mod_timer+0x101/0x10b
 [] schedule_timeout+0xd3/0xee
 [] process_timeout+0x0/0x5
 [] printk+0xe/0x11
 [] die+0x15a/0x16b
 [] do_invalid_op+0xcf/0xf2
 [] ext3_get_inode_loc+0x4f/0x226 [ext3]
 [] page_add_anon_rmap+0xe/0x66
 [] __wake_up+0x29/0x3c
 [] __rmqueue+0xc1/0x10c
 [] rmqueue_bulk+0x5b/0x65
 [] do_invalid_op+0x0/0xf2
 [] error_code+0x2f/0x38
 [] page_add_anon_rmap+0xe/0x66
 [] do_anonymous_page+0x19c/0x1db
 [] do_no_page+0x6f/0x2f9
 [] handle_mm_fault+0xbd/0x175
 [] do_page_fault+0x1ae/0x5c6
 [] vma_adjust+0x286/0x2d6
 [] vma_merge+0xe1/0x165
 [] finish_task_switch+0x30/0x66
 [] schedule+0x844/0x87a
 [] do_page_fault+0x0/0x5c6
 [] error_code+0x2f/0x38
[ cut here ]
kernel BUG at arch/i386/mm/highmem.c:42!
invalid operand:  [#2]
SMP 
Modules linked in: fusedriver(U) md5 ipv6 parport_pc
lp parport autofs4 i2c_dev i2c_core sunrpc
dm_multipath button battery ac uhci_hcd ehci_hcd e1000
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
3w_9xxx(U) sd_mod scsi_mod
CPU:5
EIP:0060:[]Not tainted VLI
EFLAGS: 00010286   (2.6.9-22.ELsmp) 
EIP is at kmap_atomic+0x73/0x178
eax: c000a928   ebx: 8000   ecx: 90b52163   edx:
00c4
esi: f467a380   edi: d898bbe8   ebp: c000af48   esp:
eee66e7c
ds: 007b   es: 007b   ss: 0068
Process search_pe (pid: 17511, threadinfo=eee66000
task=f511e030)
Stack: 0001  f5646900 0001 
   
    fff25000 c182d280 e19cb630 f467a380
d898bbe8 0c00 c014b006 
   e19cb630 e19cb630 e19cb630 f467a3b0 d898bbe8
afb80004 c014d4d1 f3aae22c 
Call Trace:
 [] pte_alloc_map+0xd9/0xe2
 [] handle_mm_fault+0x8b/0x175
 [] do_page_fault+0x1ae/0x5c6
 [] vma_merge+0x156/0x165
 [] do_mmap_pgoff+0x3cd/0x666
 [] do_mmap_pgoff+0x568/0x666
 [] sys_mmap2+0x7e/0xaf
 [] do_page_fault+0x0/0x5c6
 [] error_code+0x2f/0x38
 [] __lock_text_end+0x11a/0x100f
Code: c8 40 c0 01 c2 8d 42 16 c1 e0 0c 29 c1 89 4c 24
24 8d 04 d5 00 00 00 00 89 e9 29 c1 89 c8 8b 09 8b 58
04 85 c9 75 04 85 db 74 08 <0f> 0b 2a 00 04 27 2e c0
8b 5c 24 28 8b 0d 78 29 32 c0 8b 03 89 
 <0>Fatal exception: panic in 5 seconds
bad: scheduling while atomic!
 [] schedule+0x2d/0x87a
 [] poke_blanked_console+0x3d/0x9a
 [] vt_console_print+0x294/0x2a5
 [] __mod_timer+0x101/0x10b
 [] schedule_timeout+0xd3/0xee
 [] process_timeout+0x0/0x5
 [] printk+0xe/0x11
 [] die+0x15a/0x16b
 [] do_invalid_op+0xcf/0xf2
 [] autoremove_wake_function+0xd/0x2d
 [] kmap_atomic+0x73/0x178
 [] __wake_up+0x29/0x3c
 [] __kfree_skb+0xf4/0xf7
 [] unix_stream_recvmsg+0x2cc/0x39d
 [] do_invalid_op+0x0/0xf2
 [] error_code+0x2f/0x38
 [] kmap_atomic+0x73/0x178
 [] pte_alloc_map+0xd9/0xe2
 [] handle_mm_fault+0x8b/0x175
 [] do_page_fault+0x1ae/0x5c6
 [] vma_merge+0x156/0x165
 [] do_mmap_pgoff+0x3cd/0x666
 [] do_mmap_pgoff+0x568/0x666
 [] sys_mmap2+0x7e/0xaf
 [] do_page_fault+0x0/0x5c6
 [] error_code+0x2f/0x38
 [] __lock_text_end+0x11a/0x100f
[ cut

Re: 2.6.21-rc5-mm4

2007-04-04 Thread Antonino A. Daplas

On Mon, 2007-04-02 at 22:47 -0700, Andrew Morton wrote:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/
> 
> - The oops in git-net.patch has been fixed, so that tree has been restored. 
>   It is huge.
> 
> - Added the device-mapper development tree to the -mm lineup (Alasdair
>   Kergon).  It is a quilt tree, living at
>   ftp://ftp.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/.
> 
> - Added davidel's signalfd stuff.
> 
> 
> 

I see this tracing (from the lock-dependency validator?) for several -mm
versions.  This is from a Silan ethernet card (CONFIG_SC92031).

00:0b.0 Ethernet controller: Hangzhou Silan Microelectronics Co., Ltd.
Unknown device 2031 (rev 01)

Other than the tracing, I'm not having any problems.

Tony

==
[ INFO: soft-safe -> soft-unsafe lock order detected ]
2.6.21-rc5-mm4-default #44
--
ip/3036 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
 (>lock){--..}, at: [] sc92031_set_multicast_list
+0x14/0x2d [sc92031]

and this task is already holding:
 (>_xmit_lock){-...}, at: [] dev_mc_upload+0x14/0x3a
which would create a new lock dependency:
 (>_xmit_lock){-...} -> (>lock){--..}

but this new dependency connects a soft-irq-safe lock:
 (>mca_lock){-+..}
... which became soft-irq-safe at:
  [] __lock_acquire+0x3d7/0xb93
  [] lock_acquire+0x68/0x82
  [] _spin_lock_bh+0x30/0x3d
  [] mld_ifc_timer_expire+0x15b/0x21d [ipv6]
  [] run_timer_softirq+0xf1/0x14e
  [] __do_softirq+0x46/0x9c
  [] do_softirq+0x2d/0x46
  [] irq_exit+0x3b/0x6b
  [] do_IRQ+0x5e/0x76
  [] common_interrupt+0x2e/0x34
  [] error_code+0x71/0x78
  [] 0x

to a soft-irq-unsafe lock:
 (>lock){--..}
... which became soft-irq-unsafe at:
...  [] __lock_acquire+0x46b/0xb93
  [] lock_acquire+0x68/0x82
  [] _spin_lock+0x2b/0x38
  [] sc92031_open+0xcc/0x16f [sc92031]
  [] dev_open+0x33/0x6e
  [] dev_change_flags+0x57/0x10b
  [] devinet_ioctl+0x235/0x546
  [] inet_ioctl+0x89/0xaa
  [] sock_ioctl+0x1ac/0x1ca
  [] do_ioctl+0x1c/0x53
  [] vfs_ioctl+0x1ec/0x203
  [] sys_ioctl+0x49/0x62
  [] sysenter_past_esp+0x5d/0x99
  [] 0x

other info that might help us debug this:

2 locks held by ip/3036:
 #0:  (rtnl_mutex){--..}, at: [] mutex_lock+0x24/0x28
 #1:  (>_xmit_lock){-...}, at: [] dev_mc_upload+0x14/0x3a

the soft-irq-safe lock's dependencies:
-> (>mca_lock){-+..} ops: 9 {
   initial-use  at:
[] __lock_acquire+0x486/0xb93
[] lock_acquire+0x68/0x82
[] _spin_lock_bh+0x30/0x3d
[] igmp6_group_added+0x1b/0x120 [ipv6]
[] ipv6_dev_mc_inc+0x2f9/0x346 [ipv6]
[] ipv6_add_dev+0x232/0x240 [ipv6]
[] versions+0x1e8b/0xf9c8
[x_tables]
[] versions+0x1d54/0xf9c8
[x_tables]
[] sys_init_module+0x1252/0x138f
[] sysenter_past_esp+0x5d/0x99
[] 0x
   in-softirq-W at:
[] __lock_acquire+0x3d7/0xb93
[] lock_acquire+0x68/0x82
[] _spin_lock_bh+0x30/0x3d
[] mld_ifc_timer_expire+0x15b/0x21d
[ipv6]
[] run_timer_softirq+0xf1/0x14e
[] __do_softirq+0x46/0x9c
[] do_softirq+0x2d/0x46
[] irq_exit+0x3b/0x6b
[] do_IRQ+0x5e/0x76
[] common_interrupt+0x2e/0x34
[] error_code+0x71/0x78
[] 0x
   hardirq-on-W at:
[] __lock_acquire+0x441/0xb93
[] lock_acquire+0x68/0x82
[] _spin_lock_bh+0x30/0x3d
[] igmp6_group_added+0x1b/0x120 [ipv6]
[] ipv6_dev_mc_inc+0x2f9/0x346 [ipv6]
[] ipv6_add_dev+0x232/0x240 [ipv6]
[] versions+0x1e8b/0xf9c8
[x_tables]
[] versions+0x1d54/0xf9c8
[x_tables]
[] sys_init_module+0x1252/0x138f
[] sysenter_past_esp+0x5d/0x99
[] 0x
 }
 ... key  at: [] __key.29988+0x0/0xfffe9535 [ipv6]
 -> (>_xmit_lock){-...} ops: 18 {
initial-use  at:
  [] __lock_acquire+0x486/0xb93
  [] lock_acquire+0x68/0x82
  [] _spin_lock_bh+0x30/0x3d
  [] dev_mc_upload+0x14/0x3a
  [] dev_change_flags+0x31/0x10b
  [] devinet_ioctl+0x235/0x546
  [] inet_ioctl+0x89/0xaa
  [] sock_ioctl+0x1ac/0x1ca
  [] do_ioctl+0x1c/0x53
  []

Re: plain 2.6.21-rc5 (1) vs amanda (0)

2007-04-04 Thread Gene Heskett

On Wednesday 04 April 2007, Dave Dillow wrote:
>On Wed, 2007-04-04 at 13:29 -0700, Andrew Morton wrote:
>> On Wed, 04 Apr 2007 14:17:13 -0400
>>
>> Dave Dillow <[EMAIL PROTECTED]> wrote:
>> > The thing is, it's been broken for a long time -- this change just
>> > highlighted it. This isn't the first time that device-mapper has
>> > moved -- the introduction of mdp (before git, so haven't tracked
>> > down timeframe) also moved it around. The dynamic major is not
>> > stable, so should we be concerned if it moves for 2.6.21?
>> >
>> > I don't like the effect it has on the backups, but I don't think we
>> > should hand out LOCAL/EXP majors to dynamic devices, either. There
>> > is a module option to make the device-mapper and mdp majors stable,
>> > so perhaps a compromise is possible? Revert for 2.6.21, and schedule
>> > the patch for later addition, which gives distros time to use the DM
>> > major option?
>>
>> hm, good points.
>>
>> Overall, the patch helps kernel developers and hurts the userbase.  I
>> tend to prefer to hurt kernel developers than our users ;)
>>
>> I don't think the protect-lanana-numbers thing is very important,
>> really. If some kernel developer or someone who is maintaining an
>> unofficial out-of-tree driver hits the problem, they are presumably
>> able to handle it.  Preferably by switching to a dynamically-assigned
>> major.
>
>Works for me; I don't really have a dog in the fight.

I do, but now that I know that the problem is, and why the fix was done, 
I'm ambivalent as to how it should be fixed.  In this case the dog 
getting bit is tar (I have NDI if dump is also effected as I don't use 
it), and I would rather see tar made immune to the effects of a patch 
that was originally made to correct practice into a little closer to what 
the preacher said.

I see the facts as:
 1. the name of the directory didn't change,
 2. the inode contents aren't (ANAICT) modified in any way, and
 3. the file is still the file it was back in July of last year when I 
wrote it.

Perhaps someone can explain to me why tar thinks its all new just because 
the mapper was spanked and made to be a good puppy?

In my mind at least, its tar that has the egg on its face, and its tar 
that needs to be immunized.  Can someone tell me where am I wrong?

I'd really like to know why there is an apparent connection between this, 
and tars response, which is to think the whole disk is new and has to 
have a level 0 backup all over again.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
If God had a beard, he'd be a UNIX programmer.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-04 Thread Jeremy Fitzhardinge

Christoph Lameter wrote:
> Acked-by: Christoph Lameter <[EMAIL PROTECTED]>
>
> for all thats worth since I am not a i386 specialist.
>
> How much of the issues with page struct sharing between slab and arch code 
> does this address?
>   

I haven't been following that thread as closely as I should be, so I
don't have an answer.  I guess the interesting thing in this patch is
that it only uses the pmd cache for usermode pmds (which are
pre-zeroed), and normal page allocations for kernel pmds.  Also, if the
kernel pmds are unshared, the pgds are page-sized, so its not really
making good use of the pgd cache.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ PATCH 2/4] UML - Kernel segfaults should dump proper registers

2007-04-04 Thread Jeff Dike

If there's a segfault inside the kernel, we want a dump of the registers
at the point of the segfault, not the registers at the point of
calling panic or the last userspace registers.

sig_handler_common_skas now uses a static register set in the case
of a SIGSEGV to avoid messing up the process registers if the
segfault turns out to be non-fatal.

The architecture sigcontext-to-pt_regs copying code was repurposed
to copy data out of the SEGV stack frame.

Signed-off-by: Jeff Dike <[EMAIL PROTECTED]>
--
 arch/um/include/common-offsets.h |2 +
 arch/um/include/kern_util.h  |2 +
 arch/um/kernel/trap.c|   10 ++---
 arch/um/os-Linux/skas/trap.c |   15 --
 arch/um/sys-i386/signal.c|   41 ++-
 arch/um/sys-x86_64/signal.c  |   30 
 6 files changed, 78 insertions(+), 22 deletions(-)

Index: linux-2.6.21-mm/arch/um/kernel/trap.c
===
--- linux-2.6.21-mm.orig/arch/um/kernel/trap.c  2007-04-02 12:19:04.0 
-0400
+++ linux-2.6.21-mm/arch/um/kernel/trap.c   2007-04-02 12:19:07.0 
-0400
@@ -170,8 +170,10 @@ unsigned long segv(struct faultinfo fi, 
flush_tlb_kernel_vm();
return 0;
}
-   else if(current->mm == NULL)
-   panic("Segfault with no mm");
+   else if(current->mm == NULL) {
+   show_regs(container_of(regs, struct pt_regs, regs));
+   panic("Segfault with no mm");
+   }
 
if (SEGV_IS_FIXABLE() || SEGV_MAYBE_FIXABLE())
err = handle_page_fault(address, ip, is_write, is_user, 
_code);
@@ -194,9 +196,11 @@ unsigned long segv(struct faultinfo fi, 
else if(!is_user && arch_fixup(ip, regs))
return 0;
 
-   if(!is_user)
+   if(!is_user) {
+   show_regs(container_of(regs, struct pt_regs, regs));
panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
  address, ip);
+   }
 
if (err == -EACCES) {
si.si_signo = SIGBUS;
Index: linux-2.6.21-mm/arch/um/include/kern_util.h
===
--- linux-2.6.21-mm.orig/arch/um/include/kern_util.h2007-04-02 
12:19:04.0 -0400
+++ linux-2.6.21-mm/arch/um/include/kern_util.h 2007-04-02 12:19:07.0 
-0400
@@ -116,4 +116,6 @@ extern void log_info(char *fmt, ...) __a
 extern int __cant_sleep(void);
 extern void sigio_handler(int sig, union uml_pt_regs *regs);
 
+extern void copy_sc(union uml_pt_regs *regs, void *from);
+
 #endif
Index: linux-2.6.21-mm/arch/um/include/common-offsets.h
===
--- linux-2.6.21-mm.orig/arch/um/include/common-offsets.h   2007-04-02 
12:18:49.0 -0400
+++ linux-2.6.21-mm/arch/um/include/common-offsets.h2007-04-02 
12:19:07.0 -0400
@@ -24,5 +24,7 @@ DEFINE(UM_ELF_CLASS, ELF_CLASS);
 DEFINE(UM_ELFCLASS32, ELFCLASS32);
 DEFINE(UM_ELFCLASS64, ELFCLASS64);
 
+DEFINE(UM_NR_CPUS, NR_CPUS);
+
 /* For crypto assembler code. */
 DEFINE(crypto_tfm_ctx_offset, offsetof(struct crypto_tfm, __crt_ctx));
Index: linux-2.6.21-mm/arch/um/os-Linux/skas/trap.c
===
--- linux-2.6.21-mm.orig/arch/um/os-Linux/skas/trap.c   2007-04-02 
12:19:04.0 -0400
+++ linux-2.6.21-mm/arch/um/os-Linux/skas/trap.c2007-04-02 
12:19:07.0 -0400
@@ -15,6 +15,8 @@
 #include "sysdep/ptrace_user.h"
 #include "os.h"
 
+static union uml_pt_regs ksig_regs[UM_NR_CPUS];
+
 void sig_handler_common_skas(int sig, void *sc_ptr)
 {
struct sigcontext *sc = sc_ptr;
@@ -27,10 +29,19 @@ void sig_handler_common_skas(int sig, vo
 * the process will die.
 * XXX Figure out why this is better than SA_NODEFER
 */
-   if(sig == SIGSEGV)
+   if(sig == SIGSEGV) {
change_sig(SIGSEGV, 1);
+   /* For segfaults, we want the data from the
+* sigcontext.  In this case, we don't want to mangle
+* the process registers, so use a static set of
+* registers.  For other signals, the process
+* registers are OK.
+*/
+   r = _regs[cpu()];
+   copy_sc(r, sc_ptr);
+   }
+   else r = TASK_REGS(get_current());
 
-   r = TASK_REGS(get_current());
save_user = r->skas.is_user;
r->skas.is_user = 0;
if ( sig == SIGFPE || sig == SIGSEGV ||
Index: linux-2.6.21-mm/arch/um/sys-i386/signal.c
===
--- linux-2.6.21-mm.orig/arch/um/sys-i386/signal.c  2007-04-02 
12:19:04.0 -0400
+++ linux-2.6.21-mm/arch/um/sys-i386/signal.c   2007-04-02 12:19:07.0 
-0400
@@ -18,6 +18,28 @@
 
 #include "skas.h"
 
+void copy_sc(union uml_pt_regs

[i386] Use page allocator to allocate threadinfo structure

2007-04-04 Thread Christoph Lameter

i386 uses kmalloc to allocate the threadinfo structure assuming that the 
allocations result in a page sized aligned allocation. That has worked so 
far because SLAB exempts page sized slabs from debugging and aligns them 
in special ways that goes beyond the restrictions imposed by 
KMALLOC_ARCH_MINALIGN valid for other slabs in the kmalloc array.

SLUB also works fine without debugging since page sized allocations neatly 
align at page boundaries. However, if debugging is switched on then SLUB 
will extend the slab with debug information. The resulting slab is not 
longer of page size. It will only be aligned following the requirements 
imposed by KMALLOC_ARCH_MINALIGN. As a result the threadinfo 
structure may not be page aligned which makes i386 fail to boot with
SLUB debug on.

Replace the calls to kmalloc with calls into the page allocator.

An alternate solution may be to create a custom slab cache where the 
alignment is set to PAGE_SIZE. That would allow slub debugging to be 
applied to the threadinfo structure.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm4/include/asm-i386/thread_info.h
===
--- linux-2.6.21-rc5-mm4.orig/include/asm-i386/thread_info.h2007-04-03 
23:48:34.0 -0700
+++ linux-2.6.21-rc5-mm4/include/asm-i386/thread_info.h 2007-04-04 
17:33:41.0 -0700
@@ -95,12 +95,14 @@ static inline struct thread_info *curren
 
 /* thread information allocation */
 #ifdef CONFIG_DEBUG_STACK_USAGE
-#define alloc_thread_info(tsk) kzalloc(THREAD_SIZE, GFP_KERNEL)
+#define alloc_thread_info(tsk) ((struct thread_info *) \
+   __get_free_pages(GFP_KERNEL| __GFP_ZERO, get_order(THREAD_SIZE)))
 #else
-#define alloc_thread_info(tsk) kmalloc(THREAD_SIZE, GFP_KERNEL)
+#define alloc_thread_info(tsk) ((struct thread_info *) \
+   __get_free_pages(GFP_KERNEL, get_order(THREAD_SIZE)))
 #endif
 
-#define free_thread_info(info) kfree(info)
+#define free_thread_info(info) free_pages((unsigned long)(info), 
get_order(THREAD_SIZE))
 
 #else /* !__ASSEMBLY__ */
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ PATCH 3/4] UML - Comment early boot locking

2007-04-04 Thread Jeff Dike

Commentary about missing locking.

Also got rid of uml_start because it was pointless.

Signed-off-by: Jeff Dike <[EMAIL PROTECTED]>
--
 arch/um/kernel/um_arch.c |   22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

Index: linux-2.6.21-mm/arch/um/kernel/um_arch.c
===
--- linux-2.6.21-mm.orig/arch/um/kernel/um_arch.c   2007-04-02 
13:04:27.0 -0400
+++ linux-2.6.21-mm/arch/um/kernel/um_arch.c2007-04-02 15:10:06.0 
-0400
@@ -44,7 +44,7 @@
 
 #define DEFAULT_COMMAND_LINE "root=98:0"
 
-/* Changed in linux_main and setup_arch, which run before SMP is started */
+/* Changed in add_arg and setup_arch, which run before SMP is started */
 static char __initdata command_line[COMMAND_LINE_SIZE] = { 0 };
 
 static void __init add_arg(char *arg)
@@ -58,7 +58,12 @@ static void __init add_arg(char *arg)
strcat(command_line, arg);
 }
 
-struct cpuinfo_um boot_cpu_data = { 
+/*
+ * These fields are initialized at boot time and not changed.
+ * XXX This structure is used only in the non-SMP case.  Maybe this
+ * should be moved to smp.c.
+ */
+struct cpuinfo_um boot_cpu_data = {
.loops_per_jiffy= 0,
.ipi_pipe   = { -1, -1 }
 };
@@ -119,14 +124,12 @@ const struct seq_operations cpuinfo_op =
 /* Set in linux_main */
 unsigned long host_task_size;
 unsigned long task_size;
-
-unsigned long uml_start;
-
-/* Set in early boot */
 unsigned long uml_physmem;
-unsigned long uml_reserved;
+unsigned long uml_reserved; /* Also modified in mem_init */
 unsigned long start_vm;
 unsigned long end_vm;
+
+/* Set in uml_ncpus_setup */
 int ncpus = 1;
 
 #ifdef CONFIG_CMDLINE_ON_HOST
@@ -140,6 +143,8 @@ static char *argv1_end = NULL;
 
 /* Set in early boot */
 static int have_root __initdata = 0;
+
+/* Set in uml_mem_setup and modified in linux_main */
 long long physmem_size = 32 * 1024 * 1024;
 
 void set_cmdline(char *cmd)
@@ -378,7 +383,6 @@ int __init linux_main(int argc, char **a
 
printf("UML running in %s mode\n", mode);
 
-   uml_start = (unsigned long) &__binary_start;
host_task_size = CHOOSE_MODE_PROC(set_task_sizes_tt,
  set_task_sizes_skas, _size);
 
@@ -400,7 +404,7 @@ int __init linux_main(int argc, char **a
physmem_size += UML_ROUND_UP(brk_start) - UML_ROUND_UP(&_end);
}
 
-   uml_physmem = uml_start & PAGE_MASK;
+   uml_physmem = (unsigned long) &__binary_start & PAGE_MASK;
 
/* Reserve up to 4M after the current brk */
uml_reserved = ROUND_4M(brk_start) + (1 << 22);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ PATCH 4/4] UML - irq locking commentary

2007-04-04 Thread Jeff Dike

Locking commentary.

Signed-off-by: Jeff Dike <[EMAIL PROTECTED]>
--
 arch/um/kernel/irq.c |   15 +++
 1 file changed, 15 insertions(+)

Index: linux-2.6.21-mm/arch/um/kernel/irq.c
===
--- linux-2.6.21-mm.orig/arch/um/kernel/irq.c   2007-04-02 13:04:27.0 
-0400
+++ linux-2.6.21-mm/arch/um/kernel/irq.c2007-04-02 14:48:48.0 
-0400
@@ -78,6 +78,14 @@ skip:
return 0;
 }
 
+/*
+ * This list is accessed under irq_lock, except in sigio_handler,
+ * where it is safe from being modified.  IRQ handlers won't change it -
+ * if an IRQ source has vanished, it will be freed by free_irqs just
+ * before returning from sigio_handler.  That will process a separate
+ * list of irqs to free, with its own locking, coming back here to
+ * remove list elements, taking the irq_lock to do so.
+ */
 static struct irq_fd *active_fds = NULL;
 static struct irq_fd **last_irq_ptr = _fds;
 
@@ -243,6 +251,7 @@ void free_irq_by_fd(int fd)
free_irq_by_cb(same_fd, );
 }
 
+/* Must be called with irq_lock held */
 static struct irq_fd *find_irq_by_fd(int fd, int irqnum, int *index_out)
 {
struct irq_fd *irq;
@@ -308,6 +317,12 @@ void deactivate_fd(int fd, int irqnum)
ignore_sigio_fd(fd);
 }
 
+/*
+ * Called just before shutdown in order to provide a clean exec
+ * environment in case the system is rebooting.  No locking because
+ * that would cause a pointless shutdown hang if something hadn't
+ * released the lock.
+ */
 int deactivate_all_fds(void)
 {
struct irq_fd *irq;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ PATCH 1/4] UML - Tidy fault code

2007-04-04 Thread Jeff Dike

Tidying in preparation for the segfault register dumping patch which
follows.

void * pointers are changed to union uml_pt_regs *.  This makes
the types match reality, except in arch_fixup, which is changed to
operate on a union uml_pt_regs.  This fixes a bug in the call from
segv_handler, which passes a union uml_pt_regs, to segv, which expects
to pass a struct sigcontext to arch_fixup.

Whitespace and other style fixes.

There's also a errno printk fix.

Signed-off-by: Jeff Dike <[EMAIL PROTECTED]>
--
 arch/um/include/arch.h   |2 +-
 arch/um/include/kern_util.h  |2 +-
 arch/um/kernel/trap.c|   35 ++-
 arch/um/os-Linux/skas/trap.c |   17 -
 arch/um/sys-i386/fault.c |   18 ++
 arch/um/sys-i386/signal.c|   41 +++--
 arch/um/sys-x86_64/fault.c   |   30 +-
 arch/um/sys-x86_64/signal.c  |2 +-
 8 files changed, 63 insertions(+), 84 deletions(-)

Index: linux-2.6.21-mm/arch/um/include/arch.h
===
--- linux-2.6.21-mm.orig/arch/um/include/arch.h 2007-04-02 12:18:49.0 
-0400
+++ linux-2.6.21-mm/arch/um/include/arch.h  2007-04-02 12:19:04.0 
-0400
@@ -9,7 +9,7 @@
 #include "sysdep/ptrace.h"
 
 extern void arch_check_bugs(void);
-extern int arch_fixup(unsigned long address, void *sc_ptr);
+extern int arch_fixup(unsigned long address, union uml_pt_regs *regs);
 extern int arch_handle_signal(int sig, union uml_pt_regs *regs);
 
 #endif
Index: linux-2.6.21-mm/arch/um/include/kern_util.h
===
--- linux-2.6.21-mm.orig/arch/um/include/kern_util.h2007-04-02 
12:19:01.0 -0400
+++ linux-2.6.21-mm/arch/um/include/kern_util.h 2007-04-02 12:19:04.0 
-0400
@@ -43,7 +43,7 @@ extern unsigned long alloc_stack(int ord
 extern int do_signal(void);
 extern int is_stack_fault(unsigned long sp);
 extern unsigned long segv(struct faultinfo fi, unsigned long ip,
- int is_user, void *sc);
+ int is_user, union uml_pt_regs *regs);
 extern int handle_page_fault(unsigned long address, unsigned long ip,
 int is_write, int is_user, int *code_out);
 extern void syscall_ready(void);
Index: linux-2.6.21-mm/arch/um/kernel/trap.c
===
--- linux-2.6.21-mm.orig/arch/um/kernel/trap.c  2007-04-02 12:18:49.0 
-0400
+++ linux-2.6.21-mm/arch/um/kernel/trap.c   2007-04-02 12:19:04.0 
-0400
@@ -72,8 +72,8 @@ good_area:
goto out;
 
/* Don't require VM_READ|VM_EXEC for write faults! */
-if(!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC)))
-goto out;
+   if(!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC)))
+   goto out;
 
do {
 survive:
@@ -157,18 +157,19 @@ static void segv_handler(int sig, union 
  * the info in the regs. A pointer to the info then would
  * give us bad data!
  */
-unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user, void 
*sc)
+unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user,
+  union uml_pt_regs *regs)
 {
struct siginfo si;
void *catcher;
int err;
-int is_write = FAULT_WRITE(fi);
-unsigned long address = FAULT_ADDRESS(fi);
+   int is_write = FAULT_WRITE(fi);
+   unsigned long address = FAULT_ADDRESS(fi);
 
-if(!is_user && (address >= start_vm) && (address < end_vm)){
-flush_tlb_kernel_vm();
-return(0);
-}
+   if(!is_user && (address >= start_vm) && (address < end_vm)){
+   flush_tlb_kernel_vm();
+   return 0;
+   }
else if(current->mm == NULL)
panic("Segfault with no mm");
 
@@ -183,17 +184,17 @@ unsigned long segv(struct faultinfo fi, 
 
catcher = current->thread.fault_catcher;
if(!err)
-   return(0);
+   return 0;
else if(catcher != NULL){
current->thread.fault_addr = (void *) address;
do_longjmp(catcher, 1);
}
else if(current->thread.fault_addr != NULL)
panic("fault_addr set but no fault catcher");
-else if(!is_user && arch_fixup(ip, sc))
-   return(0);
+   else if(!is_user && arch_fixup(ip, regs))
+   return 0;
 
-   if(!is_user)
+   if(!is_user)
panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
  address, ip);
 
@@ -202,7 +203,7 @@ unsigned long segv(struct faultinfo fi, 
si.si_errno = 0;
si.si_code = BUS_ADRERR;
si.si_addr = (void __user *)address;
-current->thread.arch.faultinfo = fi;
+

[ PATCH 0/4] UML - Four for 2.6.22

2007-04-04 Thread Jeff Dike

These four patches tidy code, improve debug output, and comment
pre-existing or pre-lacking locking.

They should wait for 2.6.22.

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + clocksource-driver-initialize-list-value.patch added to -mm tree

2007-04-04 Thread Daniel Walker

On Wed, 2007-04-04 at 17:20 -0700, Jeremy Fitzhardinge wrote:
> Daniel Walker wrote:
> > Setting CLOCK_SOURCE_IS_CONTINUOUS is largely administration , do you
> > know what that flag means?
> >   
> 
> Sure, but at least it has something to do with clocks and time.
> 
> > list values and list initialization are hardly internal details , they
> > are commonly used all over the kernel.
> >   
> 
> The fact that a list is used to string together clocksource structures
> is an internal detail of the clock subsystem.  It's annoying that there
> has to be a list head in the clocksource structure, but at least as a
> clocksource implementer I can ignore it, and if it ever gets changed to
> something else I can keep ignoring it.  Your change makes it something
> that gets pointlessly replicated all over the kernel.

It's already been discussed in the past, but the clocksource structure
is already very simple.. Adding a list initializer doesn't change
anything..

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-04 Thread Christoph Lameter

Acked-by: Christoph Lameter <[EMAIL PROTECTED]>

for all thats worth since I am not a i386 specialist.

How much of the issues with page struct sharing between slab and arch code 
does this address?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc5-mm4 (SLUB)

2007-04-04 Thread Christoph Lameter

On Wed, 4 Apr 2007, Badari Pulavarty wrote:

> On Wed, 2007-04-04 at 15:59 -0700, Christoph Lameter wrote:
> > On Wed, 4 Apr 2007, Badari Pulavarty wrote:
> > 
> > > Here is the slub_debug=FU output with the above patch.
> > 
> > Hmmm... Looks like the object is actually free. Someone writes beyond the 
> > end of the earlier object. Setting Z should check overwrites but it 
> > switched off merging. So set
> > 
> > slub_debug = FZ
> > 
> > Analoguos to the last patch you would need to take out redzoning from 
> > the flags that stop merging. Then rerun. Maybe we can track it down this 
> > way.
> 
> Hmm.. I did that and machine boots fine, with absolutely no
> debug messages :(

Were the slabs merged? Look at /sys/slab and see if there are any symlinks 
there.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 14/20] add common patching machinery

2007-04-04 Thread Jeremy Fitzhardinge

Implement the actual patching machinery.  paravirt_patch_default()
contains the logic to automatically patch a callsite based on a few
simple rules:

 - if the paravirt_op function is paravirt_nop, then patch nops
 - if the paravirt_op function is a jmp target, then jmp to it
 - if the paravirt_op function is callable and doesn't clobber too much
for the callsite, call it directly

paravirt_patch_default is suitable as a default implementation of
paravirt_ops.patch, will remove most of the expensive indirect calls
in favour of either a direct call or a pile of nops.

Backends may implement their own patcher, however.  There are several
helper functions to help with this:

paravirt_patch_nop  nop out a callsite
paravirt_patch_ignore   leave the callsite as-is
paravirt_patch_call patch a call if the caller and callee
have compatible clobbers
paravirt_patch_jmp  patch in a jmp
paravirt_patch_insnspatch some literal instructions over
the callsite, if they fit

This patch also implements more direct patches for the native case, so
that when running on native hardware many common operations are
implemented inline.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Anthony Liguori <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 arch/i386/kernel/alternative.c |5 -
 arch/i386/kernel/paravirt.c|  164 
 include/asm-i386/paravirt.h|   12 ++
 3 files changed, 149 insertions(+), 32 deletions(-)

===
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -349,11 +349,14 @@ void apply_paravirt(struct paravirt_patc
used = paravirt_ops.patch(p->instrtype, p->clobbers, p->instr,
  p->len);
 
+   BUG_ON(used > p->len);
+
/* Pad the rest with nops */
nop_out(p->instr + used, p->len - used);
}
 
-   /* Sync to be conservative, in case we patched following instructions */
+   /* Sync to be conservative, in case we patched following
+  instructions */
sync_core();
 }
 #endif /* CONFIG_PARAVIRT */
===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -54,40 +54,142 @@ char *memory_setup(void)
 #define DEF_NATIVE(name, code) \
extern const char start_##name[], end_##name[]; \
asm("start_" #name ": " code "; end_" #name ":")
-DEF_NATIVE(cli, "cli");
-DEF_NATIVE(sti, "sti");
-DEF_NATIVE(popf, "push %eax; popf");
-DEF_NATIVE(pushf, "pushf; pop %eax");
+
+DEF_NATIVE(irq_disable, "cli");
+DEF_NATIVE(irq_enable, "sti");
+DEF_NATIVE(restore_fl, "push %eax; popf");
+DEF_NATIVE(save_fl, "pushf; pop %eax");
 DEF_NATIVE(iret, "iret");
-DEF_NATIVE(sti_sysexit, "sti; sysexit");
-
-static const struct native_insns
-{
-   const char *start, *end;
-} native_insns[] = {
-   [PARAVIRT_PATCH(irq_disable)] = { start_cli, end_cli },
-   [PARAVIRT_PATCH(irq_enable)] = { start_sti, end_sti },
-   [PARAVIRT_PATCH(restore_fl)] = { start_popf, end_popf },
-   [PARAVIRT_PATCH(save_fl)] = { start_pushf, end_pushf },
-   [PARAVIRT_PATCH(iret)] = { start_iret, end_iret },
-   [PARAVIRT_PATCH(irq_enable_sysexit)] = { start_sti_sysexit, 
end_sti_sysexit },
-};
+DEF_NATIVE(irq_enable_sysexit, "sti; sysexit");
+DEF_NATIVE(read_cr2, "mov %cr2, %eax");
+DEF_NATIVE(write_cr3, "mov %eax, %cr3");
+DEF_NATIVE(read_cr3, "mov %cr3, %eax");
+DEF_NATIVE(clts, "clts");
+DEF_NATIVE(read_tsc, "rdtsc");
+
+DEF_NATIVE(ud2a, "ud2a");
 
 static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
 {
-   unsigned int insn_len;
-
-   /* Don't touch it if we don't have a replacement */
-   if (type >= ARRAY_SIZE(native_insns) || !native_insns[type].start)
-   return len;
-
-   insn_len = native_insns[type].end - native_insns[type].start;
-
-   /* Similarly if we can't fit replacement. */
-   if (len < insn_len)
-   return len;
-
-   memcpy(insns, native_insns[type].start, insn_len);
+   const unsigned char *start, *end;
+   unsigned ret;
+
+   switch(type) {
+#define SITE(x)case PARAVIRT_PATCH(x): start = start_##x; end = 
end_##x; goto patch_site
+   SITE(irq_disable);
+   SITE(irq_enable);
+   SITE(restore_fl);
+   SITE(save_fl);
+   SITE(iret);
+   SITE(irq_enable_sysexit);
+   SITE(read_cr2);
+   SITE(read_cr3);
+   SITE(write_cr3);
+   SITE(clts);
+   SITE(read_tsc);
+#undef SITE
+
+   patch_site:
+   ret = paravirt_patch_insns(insns, len,

Re: [rfc] no ZERO_PAGE?

2007-04-04 Thread Linus Torvalds

On Wed, 4 Apr 2007, [EMAIL PROTECTED] wrote:
> 
> I'd not be surprised if there's sparse-matrix code out there that wants to
> malloc a *huge* array (like a 1025x1025 array of numbers) that then only
> actually *writes* to several hundred locations, and relies on the fact that
> all the untouched pages read back all-zeros.

Good point. In fact, it doesn't need to be a malloc() - I remember people 
doing this with Fortran programs and just having an absolutely incredibly 
big BSS (with traditional Fortran, dymic memory allocations are just not 
done).

> Of course, said code is probably buggy because it doesn't zero the whole 
> thing because you don't usually know if some other function already 
> scribbled on that heap page.

Sure you do. If glibc used mmap() or brk(), it *knows* the new data is 
zero. So if you use calloc(), for example, it's entirely possible that 
a good libc wouldn't waste time zeroing it.

The same is true of BSS. You never clear the BSS with a memset, you just 
know it starts out zeroed.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + clocksource-driver-initialize-list-value.patch added to -mm tree

2007-04-04 Thread Jeremy Fitzhardinge

Daniel Walker wrote:
> Setting CLOCK_SOURCE_IS_CONTINUOUS is largely administration , do you
> know what that flag means?
>   

Sure, but at least it has something to do with clocks and time.

> list values and list initialization are hardly internal details , they
> are commonly used all over the kernel.
>   

The fact that a list is used to string together clocksource structures
is an internal detail of the clock subsystem.  It's annoying that there
has to be a list head in the clocksource structure, but at least as a
clocksource implementer I can ignore it, and if it ever gets changed to
something else I can keep ignoring it.  Your change makes it something
that gets pointlessly replicated all over the kernel.

The real point is that it seems this change just adds work and cruft,
but for no benefit.  Your comment is that it "simplifies the
registration process".  Why does it need to be simpler?  How is it
simpler?  From a clocksource implementers perspective, the registration
is already pretty simple; how does it get simpler?  It looks less simple
to me, because now there's another failure-mode (a BUG if I forget to
initialize the list).

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 05/20] Hooks to set up initial pagetable

2007-04-04 Thread Jeremy Fitzhardinge

This patch introduces paravirt_ops hooks to control how the kernel's
initial pagetable is set up.

In the case of a native boot, the very early bootstrap code creates a
simple non-PAE pagetable to map the kernel and physical memory.  When
the VM subsystem is initialized, it creates a proper pagetable which
respects the PAE mode, large pages, etc.

When booting under a hypervisor, there are many possibilities for what
paging environment the hypervisor establishes for the guest kernel, so
the constructon of the kernel's pagetable depends on the hypervisor.

In the case of Xen, the hypervisor boots the kernel with a fully
constructed pagetable, which is already using PAE if necessary.  Also,
Xen requires particular care when constructing pagetables to make sure
all pagetables are always mapped read-only.

In order to make this easier, kernel's initial pagetable construction
has been changed to only allocate and initialize a pagetable page if
there's no page already present in the pagetable.  This allows the Xen
paravirt backend to make a copy of the hypervisor-provided pagetable,
allowing the kernel to establish any more mappings it needs while
keeping the existing ones.

A slightly subtle point which is worth highlighting here is that Xen
requires all kernel mappings to share the same pte_t pages between all
pagetables, so that updating a kernel page's mapping in one pagetable
is reflected in all other pagetables.  This makes it possible to
allocate a page and attach it to a pagetable without having to
explicitly enumerate that page's mapping in all pagetables.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: William Irwin <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |3 
 arch/i386/mm/init.c |  158 +--
 include/asm-i386/paravirt.h |   17 
 include/asm-i386/pgtable.h  |   16 
 4 files changed, 142 insertions(+), 52 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -476,6 +476,9 @@ struct paravirt_ops paravirt_ops = {
 #endif
.set_lazy_mode = paravirt_nop,
 
+   .pagetable_setup_start = native_pagetable_setup_start,
+   .pagetable_setup_done = native_pagetable_setup_done,
+
.flush_tlb_user = native_flush_tlb,
.flush_tlb_kernel = native_flush_tlb_global,
.flush_tlb_single = native_flush_tlb_single,
===
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 unsigned int __VMALLOC_RESERVE = 128 << 20;
 
@@ -62,6 +63,7 @@ static pmd_t * __init one_md_table_init(

 #ifdef CONFIG_X86_PAE
pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+
paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
pud = pud_offset(pgd, 0);
@@ -83,12 +85,10 @@ static pte_t * __init one_page_table_ini
 {
if (pmd_none(*pmd)) {
pte_t *page_table = (pte_t *) 
alloc_bootmem_low_pages(PAGE_SIZE);
+
paravirt_alloc_pt(__pa(page_table) >> PAGE_SHIFT);
set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
-   if (page_table != pte_offset_kernel(pmd, 0))
-   BUG();  
-
-   return page_table;
+   BUG_ON(page_table != pte_offset_kernel(pmd, 0));
}

return pte_offset_kernel(pmd, 0);
@@ -119,7 +119,7 @@ static void __init page_table_range_init
pgd = pgd_base + pgd_idx;
 
for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd++, pgd_idx++) {
-   if (pgd_none(*pgd)) 
+   if (!(pgd_val(*pgd) & _PAGE_PRESENT))
one_md_table_init(pgd);
pud = pud_offset(pgd, vaddr);
pmd = pmd_offset(pud, vaddr);
@@ -158,7 +158,11 @@ static void __init kernel_physical_mappi
pfn = 0;
 
for (; pgd_idx < PTRS_PER_PGD; pgd++, pgd_idx++) {
-   pmd = one_md_table_init(pgd);
+   if (!(pgd_val(*pgd) & _PAGE_PRESENT))
+   pmd = one_md_table_init(pgd);
+   else
+   pmd = pmd_offset(pud_offset(pgd, PAGE_OFFSET), 
PAGE_OFFSET);
+
if (pfn >= max_low_pfn)
continue;
for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; 
pmd++, pmd_idx++) {
@@ -167,20 +171,26 @@ static void __init kernel_physical_mappi
/* Map with big pages if possible, otherwise create 
normal page tables. */
if (cpu_has_pse) {
unsigned int address2 = (pfn + PTRS_PER_PTE - 
1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1;
-
-   if (is_kernel_text(address)

[patch 02/20] Remove CONFIG_DEBUG_PARAVIRT

2007-04-04 Thread Jeremy Fitzhardinge

Remove CONFIG_DEBUG_PARAVIRT.  When inlining code, this option
attempts to trash registers in the patch-site's "clobber" field, on
the grounds that this should find bugs with incorrect clobbers.
Unfortunately, the clobber field really means "registers modified by
this patch site", which includes return values.

Because of this, this option has outlived its usefulness, so remove
it.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>

---
 arch/i386/Kconfig.debug|   10 --
 arch/i386/kernel/alternative.c |   14 +-
 2 files changed, 1 insertion(+), 23 deletions(-)

===
--- a/arch/i386/Kconfig.debug
+++ b/arch/i386/Kconfig.debug
@@ -85,14 +85,4 @@ config DOUBLEFAULT
   option saves about 4k and might cause you much additional grey
   hair.
 
-config DEBUG_PARAVIRT
-   bool "Enable some paravirtualization debugging"
-   default n
-   depends on PARAVIRT && DEBUG_KERNEL
-   help
- Currently deliberately clobbers regs which are allowed to be
- clobbered in inlined paravirt hooks, even in native mode.
- If turning this off solves a problem, then DISABLE_INTERRUPTS() or
- ENABLE_INTERRUPTS() is lying about what registers can be clobbered.
-
 endmenu
===
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -359,19 +359,7 @@ void apply_paravirt(struct paravirt_patc
 
used = paravirt_ops.patch(p->instrtype, p->clobbers, p->instr,
  p->len);
-#ifdef CONFIG_DEBUG_PARAVIRT
-   {
-   int i;
-   /* Deliberately clobber regs using "not %reg" to find bugs. */
-   for (i = 0; i < 3; i++) {
-   if (p->len - used >= 2 && (p->clobbers & (1 << i))) {
-   memcpy(p->instr + used, "\xf7\xd0", 2);
-   p->instr[used+1] |= i;
-   used += 2;
-   }
-   }
-   }
-#endif
+
/* Pad the rest with nops */
nop_out(p->instr + used, p->len - used);
}

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 01/20] update MAINTAINERS

2007-04-04 Thread Jeremy Fitzhardinge

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Chris Wright <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
---
 MAINTAINERS |   22 ++
 1 file changed, 22 insertions(+)

===
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2583,6 +2583,19 @@ T:   cvs cvs.parisc-linux.org:/var/cvs/lin
 T: cvs cvs.parisc-linux.org:/var/cvs/linux-2.6
 S: Maintained
 
+PARAVIRT_OPS INTERFACE
+P: Jeremy Fitzhardinge
+M: [EMAIL PROTECTED]
+P: Chris Wright
+M: [EMAIL PROTECTED]
+P: Zachary Amsden
+M: [EMAIL PROTECTED]
+P: Rusty Russell
+M: [EMAIL PROTECTED]
+L: virtualization@lists.osdl.org
+L: linux-kernel@vger.kernel.org
+S: Supported
+
 PC87360 HARDWARE MONITORING DRIVER
 P: Jim Cromie
 M: [EMAIL PROTECTED]
@@ -3780,6 +3793,15 @@ L:   linux-x25@vger.kernel.org
 L: linux-x25@vger.kernel.org
 S: Maintained
 
+XEN HYPERVISOR INTERFACE
+P: Jeremy Fitzhardinge
+M: [EMAIL PROTECTED]
+P: Chris Wright
+M: [EMAIL PROTECTED]
+L: virtualization@lists.osdl.org
+L: [EMAIL PROTECTED]
+S: Supported
+
 XFS FILESYSTEM
 P: Silicon Graphics Inc
 P: Tim Shimmin, David Chatterton

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 12/20] Consistently wrap paravirt ops callsites to make them patchable

2007-04-04 Thread Jeremy Fitzhardinge

Wrap a set of interesting paravirt_ops calls in a wrapper which makes
the callsites available for patching.  Unfortunately this is pretty
ugly because there's no way to get gcc to generate a function call,
but also wrap just the callsite itself with the necessary labels.

This patch supports functions with 0-4 arguments, and either void or
returning a value.  64-bit arguments must be split into a pair of
32-bit arguments (lower word first).  Small structures are returned in
registers.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Anthony Liguori <[EMAIL PROTECTED]>

---
 include/asm-i386/paravirt.h |  715 ++-
 1 file changed, 569 insertions(+), 146 deletions(-)

===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -124,7 +124,7 @@ struct paravirt_ops
 
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
-   void (*flush_tlb_single)(u32 addr);
+   void (*flush_tlb_single)(unsigned long addr);
 
void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
@@ -188,7 +188,7 @@ extern struct paravirt_ops paravirt_ops;
 #define paravirt_clobber(clobber)  \
[paravirt_clobber] "i" (clobber)
 
-#define PARAVIRT_CALL  "call *paravirt_ops+%c[paravirt_typenum]*4;"
+#define PARAVIRT_CALL  "call *(paravirt_ops+%c[paravirt_typenum]*4);"
 
 #define _paravirt_alt(insn_string, type, clobber)  \
"771:\n\t" insn_string "\n" "772:\n"\
@@ -199,26 +199,234 @@ extern struct paravirt_ops paravirt_ops;
"  .short " clobber "\n"\
".popsection\n"
 
-#define paravirt_alt(insn_string)  \
+#define paravirt_alt(insn_string)  \
_paravirt_alt(insn_string, "%c[paravirt_typenum]", 
"%c[paravirt_clobber]")
 
-#define paravirt_enabled() (paravirt_ops.paravirt_enabled)
+#define PVOP_CALL0(__rettype, __op)\
+   ({  \
+   __rettype __ret;\
+   if (sizeof(__rettype) > sizeof(unsigned long)) {\
+   unsigned long long __tmp;   \
+   unsigned long __ecx;\
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: "=A" (__tmp), "=c" (__ecx)   \
+: paravirt_type(__op), \
+  paravirt_clobber(CLBR_ANY)   \
+: "memory", "cc"); \
+   __ret = (__rettype)__tmp;   \
+   } else {\
+   unsigned long __tmp, __edx, __ecx;  \
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: "=a" (__tmp), "=d" (__edx),  \
+  "=c" (__ecx) \
+: paravirt_type(__op), \
+  paravirt_clobber(CLBR_ANY)   \
+: "memory", "cc"); \
+   __ret = (__rettype)__tmp;   \
+   }   \
+   __ret;  \
+   })
+#define PVOP_VCALL0(__op)  \
+   ({  \
+   unsigned long __eax, __edx, __ecx;  \
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+: paravirt_type(__op), \
+  paravirt_clobber(CLBR_ANY)   \
+: "memory", "cc"); \
+   })
+
+#define PVOP_CALL1(__rettype, __op, arg1)  \
+   ({  \
+   __rettype __ret;\
+   if (sizeof(__rettype) > sizeof(unsigned long)) {\
+   unsigned long long __tmp;   \
+   unsigned long __ecx;\
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: "=A" (__tmp), "=c" (__ecx)   \
+

[patch 17/20] add kmap_atomic_pte for mapping highpte pages

2007-04-04 Thread Jeremy Fitzhardinge

Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space.  These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.

Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, kmap_atomic_prot, which maps the page with the
specified page protections.

This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings.  Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.

[ Zach - vmi.c will need some attention after this patch.  It wasn't
  immediately obvious to me what needs to be done. ]

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |7 +++
 arch/i386/mm/highmem.c  |9 +++--
 include/asm-i386/highmem.h  |   11 +++
 include/asm-i386/paravirt.h |   13 -
 include/asm-i386/pgtable.h  |4 ++--
 include/linux/highmem.h |6 ++
 mm/highmem.c|9 +
 7 files changed, 54 insertions(+), 5 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -318,6 +319,12 @@ struct paravirt_ops paravirt_ops = {
 
.ptep_get_and_clear = native_ptep_get_and_clear,
 
+#ifdef CONFIG_HIGHPTE
+   .kmap_atomic_pte = native_kmap_atomic_pte,
+#else
+   .kmap_atomic_pte = paravirt_nop,
+#endif
+
 #ifdef CONFIG_X86_PAE
.set_pte_atomic = native_set_pte_atomic,
.set_pte_present = native_set_pte_present,
===
--- a/arch/i386/mm/highmem.c
+++ b/arch/i386/mm/highmem.c
@@ -26,7 +26,7 @@ void kunmap(struct page *page)
  * However when holding an atomic kmap is is not legal to sleep, so atomic
  * kmaps are appropriate for short, tight code paths only.
  */
-void *kmap_atomic(struct page *page, enum km_type type)
+void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot)
 {
enum fixed_addresses idx;
unsigned long vaddr;
@@ -41,9 +41,14 @@ void *kmap_atomic(struct page *page, enu
return page_address(page);
 
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+   set_pte(kmap_pte-idx, mk_pte(page, prot));
 
return (void*) vaddr;
+}
+
+void *kmap_atomic(struct page *page, enum km_type type)
+{
+   return kmap_atomic_prot(page, type, kmap_prot);
 }
 
 void kunmap_atomic(void *kvaddr, enum km_type type)
===
--- a/include/asm-i386/highmem.h
+++ b/include/asm-i386/highmem.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* declarations for highmem.c */
 extern unsigned long highstart_pfn, highend_pfn;
@@ -67,10 +68,20 @@ extern void FASTCALL(kunmap_high(struct 
 
 void *kmap(struct page *page);
 void kunmap(struct page *page);
+void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot);
 void *kmap_atomic(struct page *page, enum km_type type);
 void kunmap_atomic(void *kvaddr, enum km_type type);
 void *kmap_atomic_pfn(unsigned long pfn, enum km_type type);
 struct page *kmap_atomic_to_page(void *ptr);
+
+static inline void *native_kmap_atomic_pte(struct page *page, enum km_type 
type)
+{
+   return kmap_atomic(page, type);
+}
+
+#ifndef CONFIG_PARAVIRT
+#define kmap_atomic_pte(page, type)kmap_atomic(page, type)
+#endif
 
 #define flush_cache_kmaps()do { } while (0)
 
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -16,7 +16,9 @@
 #ifndef __ASSEMBLY__
 #include 
 #include 
-
+#include 
+
+struct page;
 struct thread_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
@@ -143,6 +145,8 @@ struct paravirt_ops
void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep);
 
pte_t (*ptep_get_and_clear)(pte_t *ptep);
+
+   void *(*kmap_atomic_pte)(struct page *page, enum km_type type);
 
 #ifdef CONFIG_X86_PAE
void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
@@ -768,6 +772,13 @@ static inline void paravirt_release_pd(u
PVOP_VCALL1(release_pd, pfn);
 }
 
+static inline void *kmap_atomic_pte(struct page *page, enum km_type type)
+{
+   unsigned long ret;
+   ret = PVOP_CALL2(unsigned long, kmap_atomic_pte, page, type);
+   return (void *)ret;
+}
+
 static inline void pte_update(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep)
 {
===
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -481,9 +481,9

[patch 11/20] Fix patch site clobbers to include return register

2007-04-04 Thread Jeremy Fitzhardinge

Fix a few clobbers to include the return register.  The clobbers set
is the set of all registers modified (or may be modified) by the code
snippet, regardless of whether it was deliberate or accidental.

Also, make sure that callsites which are used in contexts which don't
allow clobbers actually save and restore all clobberable registers.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/entry.S|2 +-
 include/asm-i386/paravirt.h |   18 ++
 2 files changed, 11 insertions(+), 9 deletions(-)

===
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -342,7 +342,7 @@ 1:  movl (%ebp),%ebp
jae syscall_badsys
call *sys_call_table(,%eax,4)
movl %eax,PT_EAX(%esp)
-   DISABLE_INTERRUPTS(CLBR_ECX|CLBR_EDX)
+   DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_OFF
movl TI_flags(%ebp), %ecx
testw $_TIF_ALLWORK_MASK, %cx
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -532,7 +532,7 @@ static inline unsigned long __raw_local_
  "popl %%edx; popl %%ecx")
 : "=a"(f)
 : paravirt_type(save_fl),
-  paravirt_clobber(CLBR_NONE)
+  paravirt_clobber(CLBR_EAX)
 : "memory", "cc");
return f;
 }
@@ -617,27 +617,29 @@ 772:; \
.popsection
 
 #define INTERRUPT_RETURN   \
-   PARA_SITE(PARA_PATCH(PARAVIRT_iret), CLBR_ANY,  \
+   PARA_SITE(PARA_PATCH(PARAVIRT_iret), CLBR_NONE, \
  jmp *%cs:paravirt_ops+PARAVIRT_iret)
 
 #define DISABLE_INTERRUPTS(clobbers)   \
PARA_SITE(PARA_PATCH(PARAVIRT_irq_disable), clobbers,   \
- pushl %ecx; pushl %edx;   \
+ pushl %eax; pushl %ecx; pushl %edx;   \
  call *%cs:paravirt_ops+PARAVIRT_irq_disable;  \
- popl %edx; popl %ecx) \
+ popl %edx; popl %ecx; popl %eax)  \
 
 #define ENABLE_INTERRUPTS(clobbers)\
PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable), clobbers,\
- pushl %ecx; pushl %edx;   \
+ pushl %eax; pushl %ecx; pushl %edx;   \
  call *%cs:paravirt_ops+PARAVIRT_irq_enable;   \
- popl %edx; popl %ecx)
+ popl %edx; popl %ecx; popl %eax)
 
 #define ENABLE_INTERRUPTS_SYSEXIT  \
-   PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable_sysexit), CLBR_ANY,\
+   PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable_sysexit), CLBR_NONE,   \
  jmp *%cs:paravirt_ops+PARAVIRT_irq_enable_sysexit)
 
 #define GET_CR0_INTO_EAX   \
-   call *paravirt_ops+PARAVIRT_read_cr0
+   push %ecx; push %edx;   \
+   call *paravirt_ops+PARAVIRT_read_cr0;   \
+   pop %edx; pop %ecx
 
 #endif /* __ASSEMBLY__ */
 #endif /* CONFIG_PARAVIRT */

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 15/20] add flush_tlb_others paravirt_op

2007-04-04 Thread Jeremy Fitzhardinge

This patch adds a pv_op for flush_tlb_others.  Linux running on native
hardware uses cross-CPU IPIs to flush the TLB on any CPU which may
have a particular mm's pagetable entries cached in its TLB.  This is
inefficient in a paravirtualized environment, since the hypervisor
knows which real CPUs actually contain cached mappings, which may be a
small subset of a guest's VCPUs.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |1 +
 arch/i386/kernel/smp.c  |   15 ---
 include/asm-i386/paravirt.h |9 +
 include/asm-i386/tlbflush.h |   19 +--
 4 files changed, 35 insertions(+), 9 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -301,6 +301,7 @@ struct paravirt_ops paravirt_ops = {
.flush_tlb_user = native_flush_tlb,
.flush_tlb_kernel = native_flush_tlb_global,
.flush_tlb_single = native_flush_tlb_single,
+   .flush_tlb_others = native_flush_tlb_others,
 
.map_pt_hook = paravirt_nop,
 
===
--- a/arch/i386/kernel/smp.c
+++ b/arch/i386/kernel/smp.c
@@ -256,7 +256,6 @@ static struct mm_struct * flush_mm;
 static struct mm_struct * flush_mm;
 static unsigned long flush_va;
 static DEFINE_SPINLOCK(tlbstate_lock);
-#define FLUSH_ALL  0x
 
 /*
  * We cannot call mmdrop() because we are in interrupt context, 
@@ -338,7 +337,7 @@ fastcall void smp_invalidate_interrupt(s
 
if (flush_mm == per_cpu(cpu_tlbstate, cpu).active_mm) {
if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK) {
-   if (flush_va == FLUSH_ALL)
+   if (flush_va == TLB_FLUSH_ALL)
local_flush_tlb();
else
__flush_tlb_one(flush_va);
@@ -353,9 +352,11 @@ out:
put_cpu_no_resched();
 }
 
-static void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
-   unsigned long va)
-{
+void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
+unsigned long va)
+{
+   cpumask_t cpumask = *cpumaskp;
+
/*
 * A couple of (to be removed) sanity checks:
 *
@@ -417,7 +418,7 @@ void flush_tlb_current_task(void)
 
local_flush_tlb();
if (!cpus_empty(cpu_mask))
-   flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
+   flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL);
preempt_enable();
 }
 
@@ -436,7 +437,7 @@ void flush_tlb_mm (struct mm_struct * mm
leave_mm(smp_processor_id());
}
if (!cpus_empty(cpu_mask))
-   flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
+   flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL);
 
preempt_enable();
 }
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -15,6 +15,7 @@
 
 #ifndef __ASSEMBLY__
 #include 
+#include 
 
 struct thread_struct;
 struct Xgt_desc_struct;
@@ -125,6 +126,8 @@ struct paravirt_ops
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_single)(unsigned long addr);
+   void (*flush_tlb_others)(const cpumask_t *cpus, struct mm_struct *mm,
+unsigned long va);
 
void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
@@ -731,6 +734,12 @@ static inline void __flush_tlb_single(un
PVOP_VCALL1(flush_tlb_single, addr);
 }
 
+static inline void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
+   unsigned long va)
+{
+   PVOP_VCALL3(flush_tlb_others, , mm, va);
+}
+
 static inline void paravirt_map_pt_hook(int type, pte_t *va, u32 pfn)
 {
PVOP_VCALL3(map_pt_hook, type, va, pfn);
===
--- a/include/asm-i386/tlbflush.h
+++ b/include/asm-i386/tlbflush.h
@@ -79,10 +79,14 @@
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
  *  - flush_tlb_pgtables(mm, start, end) flushes a range of page tables
+ *  - flush_tlb_others(cpumask, mm, va) flushes a TLBs on other cpus
  *
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
  */
+
+#define TLB_FLUSH_ALL  0x
+
 
 #ifndef CONFIG_SMP
 
@@ -110,7 +114,12 @@ static inline void flush_tlb_range(struc
__flush_tlb();
 }
 
-#else
+static inline void native_flush_tlb_others(const cpumask_t *cpumask,
+  struct mm_struct *mm, unsigned long 
va)
+{
+}
+
+#else  /* SMP */
 
 #include 
 
@@ -129,6

[patch 04/20] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Jeremy Fitzhardinge

Add a set of accessors to pack, unpack and modify page table entries
(at all levels).  This allows a paravirt implementation to control the
contents of pgd/pmd/pte entries.  For example, Xen uses this to
convert the (pseudo-)physical address into a machine address when
populating a pagetable entry, and converting back to pphys address
when an entry is read.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c   |   84 +
 arch/i386/kernel/vmi.c|6 +-
 include/asm-i386/page.h   |   79 +-
 include/asm-i386/paravirt.h   |   52 +-
 include/asm-i386/pgtable-2level.h |   28 +---
 include/asm-i386/pgtable-3level.h |   65 +---
 include/asm-i386/pgtable.h|2 
 7 files changed, 186 insertions(+), 130 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -399,78 +399,6 @@ static void native_flush_tlb_single(u32 
 {
__native_flush_tlb_single(addr);
 }
-
-#ifndef CONFIG_X86_PAE
-static void native_set_pte(pte_t *ptep, pte_t pteval)
-{
-   *ptep = pteval;
-}
-
-static void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, 
pte_t pteval)
-{
-   *ptep = pteval;
-}
-
-static void native_set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-   *pmdp = pmdval;
-}
-
-#else /* CONFIG_X86_PAE */
-
-static void native_set_pte(pte_t *ptep, pte_t pte)
-{
-   ptep->pte_high = pte.pte_high;
-   smp_wmb();
-   ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, 
pte_t pte)
-{
-   ptep->pte_high = pte.pte_high;
-   smp_wmb();
-   ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_present(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep, pte_t pte)
-{
-   ptep->pte_low = 0;
-   smp_wmb();
-   ptep->pte_high = pte.pte_high;
-   smp_wmb();
-   ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_atomic(pte_t *ptep, pte_t pteval)
-{
-   set_64bit((unsigned long long *)ptep,pte_val(pteval));
-}
-
-static void native_set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-   set_64bit((unsigned long long *)pmdp,pmd_val(pmdval));
-}
-
-static void native_set_pud(pud_t *pudp, pud_t pudval)
-{
-   *pudp = pudval;
-}
-
-static void native_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep)
-{
-   ptep->pte_low = 0;
-   smp_wmb();
-   ptep->pte_high = 0;
-}
-
-static void native_pmd_clear(pmd_t *pmd)
-{
-   u32 *tmp = (u32 *)pmd;
-   *tmp = 0;
-   smp_wmb();
-   *(tmp + 1) = 0;
-}
-#endif /* CONFIG_X86_PAE */
 
 /* These are in entry.S */
 extern void native_iret(void);
@@ -565,13 +493,25 @@ struct paravirt_ops paravirt_ops = {
.set_pmd = native_set_pmd,
.pte_update = paravirt_nop,
.pte_update_defer = paravirt_nop,
+
+   .ptep_get_and_clear = native_ptep_get_and_clear,
+
 #ifdef CONFIG_X86_PAE
.set_pte_atomic = native_set_pte_atomic,
.set_pte_present = native_set_pte_present,
.set_pud = native_set_pud,
.pte_clear = native_pte_clear,
.pmd_clear = native_pmd_clear,
+
+   .pmd_val = native_pmd_val,
+   .make_pmd = native_make_pmd,
 #endif
+
+   .pte_val = native_pte_val,
+   .pgd_val = native_pgd_val,
+
+   .make_pte = native_make_pte,
+   .make_pgd = native_make_pgd,
 
.irq_enable_sysexit = native_irq_enable_sysexit,
.iret = native_iret,
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -444,13 +444,13 @@ static void vmi_release_pd(u32 pfn)
 ((level) | (is_current_as(mm, user) ?   \
 (VMI_PAGE_DEFER | VMI_PAGE_CURRENT_AS | ((addr) & 
VMI_PAGE_VA_MASK)) : 0))
 
-static void vmi_update_pte(struct mm_struct *mm, u32 addr, pte_t *ptep)
+static void vmi_update_pte(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep)
 {
vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
vmi_ops.update_pte(ptep, vmi_flags_addr(mm, addr, VMI_PAGE_PT, 0));
 }
 
-static void vmi_update_pte_defer(struct mm_struct *mm, u32 addr, pte_t *ptep)
+static void vmi_update_pte_defer(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep)
 {
vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
vmi_ops.update_pte(ptep, vmi_flags_addr_defer(mm, addr, VMI_PAGE_PT, 
0));
@@ -463,7 +463,7 @@ static void vmi_set_pte(pte_t *ptep, pte
vmi_ops.set_pte(pte, ptep, VMI_PAGE_PT);
 }
 
-static void vmi_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t 
pte)
+static void vmi_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep, pte_t pte)
 {

[patch 10/20] Use patch site IDs computed from offset in paravirt_ops structure

2007-04-04 Thread Jeremy Fitzhardinge

Use patch type identifiers derived from the offset of the operation in
the paravirt_ops structure.  This avoids having to maintain a separate
enum for patch site types.

Also, since the identifier is derived from the offset into
paravirt_ops, the offset can be derived from the identifier.  This is
used to remove replicated information in the various callsite macros,
which has been a source of bugs in the past.

This patch also drops the fused save_fl+cli operation, which doesn't
really add much and makes things more complex - specifically because
it breaks the 1:1 relationship between identifiers and offsets.  If
this operation turns out to be particularly beneficial, then the right
answer is to define a new entrypoint for it.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |   14 +--
 arch/i386/kernel/vmi.c  |   39 +
 include/asm-i386/paravirt.h |  179 ++-
 3 files changed, 105 insertions(+), 127 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -58,7 +58,6 @@ DEF_NATIVE(sti, "sti");
 DEF_NATIVE(sti, "sti");
 DEF_NATIVE(popf, "push %eax; popf");
 DEF_NATIVE(pushf, "pushf; pop %eax");
-DEF_NATIVE(pushf_cli, "pushf; pop %eax; cli");
 DEF_NATIVE(iret, "iret");
 DEF_NATIVE(sti_sysexit, "sti; sysexit");
 
@@ -66,13 +65,12 @@ static const struct native_insns
 {
const char *start, *end;
 } native_insns[] = {
-   [PARAVIRT_IRQ_DISABLE] = { start_cli, end_cli },
-   [PARAVIRT_IRQ_ENABLE] = { start_sti, end_sti },
-   [PARAVIRT_RESTORE_FLAGS] = { start_popf, end_popf },
-   [PARAVIRT_SAVE_FLAGS] = { start_pushf, end_pushf },
-   [PARAVIRT_SAVE_FLAGS_IRQ_DISABLE] = { start_pushf_cli, end_pushf_cli },
-   [PARAVIRT_INTERRUPT_RETURN] = { start_iret, end_iret },
-   [PARAVIRT_STI_SYSEXIT] = { start_sti_sysexit, end_sti_sysexit },
+   [PARAVIRT_PATCH(irq_disable)] = { start_cli, end_cli },
+   [PARAVIRT_PATCH(irq_enable)] = { start_sti, end_sti },
+   [PARAVIRT_PATCH(restore_fl)] = { start_popf, end_popf },
+   [PARAVIRT_PATCH(save_fl)] = { start_pushf, end_pushf },
+   [PARAVIRT_PATCH(iret)] = { start_iret, end_iret },
+   [PARAVIRT_PATCH(irq_enable_sysexit)] = { start_sti_sysexit, 
end_sti_sysexit },
 };
 
 static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -78,11 +78,6 @@ static struct {
 #define MNEM_JMP  0xe9
 #define MNEM_RET  0xc3
 
-static char irq_save_disable_callout[] = {
-   MNEM_CALL, 0, 0, 0, 0,
-   MNEM_CALL, 0, 0, 0, 0,
-   MNEM_RET
-};
 #define IRQ_PATCH_INT_MASK 0
 #define IRQ_PATCH_DISABLE  5
 
@@ -130,33 +125,17 @@ static unsigned vmi_patch(u8 type, u16 c
 static unsigned vmi_patch(u8 type, u16 clobbers, void *insns, unsigned len)
 {
switch (type) {
-   case PARAVIRT_IRQ_DISABLE:
+   case PARAVIRT_PATCH(irq_disable):
return patch_internal(VMI_CALL_DisableInterrupts, len, 
insns);
-   case PARAVIRT_IRQ_ENABLE:
+   case PARAVIRT_PATCH(irq_enable):
return patch_internal(VMI_CALL_EnableInterrupts, len, 
insns);
-   case PARAVIRT_RESTORE_FLAGS:
+   case PARAVIRT_PATCH(restore_fl):
return patch_internal(VMI_CALL_SetInterruptMask, len, 
insns);
-   case PARAVIRT_SAVE_FLAGS:
+   case PARAVIRT_PATCH(save_fl):
return patch_internal(VMI_CALL_GetInterruptMask, len, 
insns);
-   case PARAVIRT_SAVE_FLAGS_IRQ_DISABLE:
-   if (len >= 10) {
-   patch_internal(VMI_CALL_GetInterruptMask, len, 
insns);
-   patch_internal(VMI_CALL_DisableInterrupts, 
len-5, insns+5);
-   return 10;
-   } else {
-   /*
-* You bastards didn't leave enough room to
-* patch save_flags_irq_disable inline.  Patch
-* to a helper
-*/
-   BUG_ON(len < 5);
-   *(char *)insns = MNEM_CALL;
-   patch_offset(insns, irq_save_disable_callout);
-   return 5;
-   }
-   case PARAVIRT_INTERRUPT_RETURN:
+   case PARAVIRT_PATCH(iret):
return patch_internal(VMI_CALL_IRET, len, insns);
-   case PARAVIRT_STI_SYSEXIT:
+   case PARAVIRT_PATCH(irq_enable_sysexit):

[patch 16/20] revert map_pt_hook.

2007-04-04 Thread Jeremy Fitzhardinge

Back out the map_pt_hook to clear the way for kmap_atomic_pte.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |2 --
 arch/i386/kernel/vmi.c  |2 ++
 include/asm-i386/paravirt.h |7 ---
 include/asm-i386/pgtable.h  |   23 ---
 4 files changed, 6 insertions(+), 28 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -303,8 +303,6 @@ struct paravirt_ops paravirt_ops = {
.flush_tlb_single = native_flush_tlb_single,
.flush_tlb_others = native_flush_tlb_others,
 
-   .map_pt_hook = paravirt_nop,
-
.alloc_pt = paravirt_nop,
.alloc_pd = paravirt_nop,
.alloc_pd_clone = paravirt_nop,
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -819,8 +819,10 @@ static inline int __init activate_vmi(vo
paravirt_ops.release_pt = vmi_release_pt;
paravirt_ops.release_pd = vmi_release_pd;
}
+#if 0
para_wrap(map_pt_hook, vmi_map_pt_hook, set_linear_mapping,
  SetLinearMapping);
+#endif
 
/*
 * These MUST always be patched.  Don't support indirect jumps
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -128,8 +128,6 @@ struct paravirt_ops
void (*flush_tlb_single)(unsigned long addr);
void (*flush_tlb_others)(const cpumask_t *cpus, struct mm_struct *mm,
 unsigned long va);
-
-   void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
void (*alloc_pt)(u32 pfn);
void (*alloc_pd)(u32 pfn);
@@ -740,11 +738,6 @@ static inline void flush_tlb_others(cpum
PVOP_VCALL3(flush_tlb_others, , mm, va);
 }
 
-static inline void paravirt_map_pt_hook(int type, pte_t *va, u32 pfn)
-{
-   PVOP_VCALL3(map_pt_hook, type, va, pfn);
-}
-
 static inline void paravirt_alloc_pt(unsigned pfn)
 {
PVOP_VCALL1(alloc_pt, pfn);
===
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -272,7 +272,6 @@ static inline void vmalloc_sync_all(void
  */
 #define pte_update(mm, addr, ptep) do { } while (0)
 #define pte_update_defer(mm, addr, ptep)   do { } while (0)
-#define paravirt_map_pt_hook(slot, va, pfn)do { } while (0)
 
 #define raw_ptep_get_and_clear(xp) native_ptep_get_and_clear(xp)
 #endif
@@ -481,24 +480,10 @@ extern pte_t *lookup_address(unsigned lo
 #endif
 
 #if defined(CONFIG_HIGHPTE)
-#define pte_offset_map(dir, address)   \
-({ \
-   pte_t *__ptep;  \
-   unsigned pfn = pmd_val(*(dir)) >> PAGE_SHIFT;   \
-   __ptep = (pte_t *)kmap_atomic(pfn_to_page(pfn),KM_PTE0);\
-   paravirt_map_pt_hook(KM_PTE0,__ptep, pfn);  \
-   __ptep = __ptep + pte_index(address);   \
-   __ptep; \
-})
-#define pte_offset_map_nested(dir, address)\
-({ \
-   pte_t *__ptep;  \
-   unsigned pfn = pmd_val(*(dir)) >> PAGE_SHIFT;   \
-   __ptep = (pte_t *)kmap_atomic(pfn_to_page(pfn),KM_PTE1);\
-   paravirt_map_pt_hook(KM_PTE1,__ptep, pfn);  \
-   __ptep = __ptep + pte_index(address);   \
-   __ptep; \
-})
+#define pte_offset_map(dir, address) \
+   ((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
+#define pte_offset_map_nested(dir, address) \
+   ((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
 #define pte_unmap(pte) kunmap_atomic(pte, KM_PTE0)
 #define pte_unmap_nested(pte) kunmap_atomic(pte, KM_PTE1)
 #else

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 03/20] use paravirt_nop to consistently mark no-op operations

2007-04-04 Thread Jeremy Fitzhardinge

Add a _paravirt_nop function for use as a stub for no-op operations,
and paravirt_nop #defined void * version to make using it easier
(since all its uses are as a void *).

This is useful to allow the patcher to automatically identify noop
operations so it can simply nop out the callsite.


Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
[mingo] but only as a cleanup of the current open-coded (void *) casts.
My problem with this is that it loses the types. Not that there is much
to check for, but still, this adds some assumptions about how function
calls look like

---
 arch/i386/kernel/paravirt.c |   26 +-
 include/asm-i386/paravirt.h |3 +++
 2 files changed, 16 insertions(+), 13 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -35,7 +35,7 @@
 #include 
 
 /* nop stub */
-static void native_nop(void)
+void _paravirt_nop(void)
 {
 }
 
@@ -490,7 +490,7 @@ struct paravirt_ops paravirt_ops = {
 
.patch = native_patch,
.banner = default_banner,
-   .arch_setup = native_nop,
+   .arch_setup = paravirt_nop,
.memory_setup = machine_specific_memory_setup,
.get_wallclock = native_get_wallclock,
.set_wallclock = native_set_wallclock,
@@ -546,25 +546,25 @@ struct paravirt_ops paravirt_ops = {
.setup_boot_clock = setup_boot_APIC_clock,
.setup_secondary_clock = setup_secondary_APIC_clock,
 #endif
-   .set_lazy_mode = (void *)native_nop,
+   .set_lazy_mode = paravirt_nop,
 
.flush_tlb_user = native_flush_tlb,
.flush_tlb_kernel = native_flush_tlb_global,
.flush_tlb_single = native_flush_tlb_single,
 
-   .map_pt_hook = (void *)native_nop,
-
-   .alloc_pt = (void *)native_nop,
-   .alloc_pd = (void *)native_nop,
-   .alloc_pd_clone = (void *)native_nop,
-   .release_pt = (void *)native_nop,
-   .release_pd = (void *)native_nop,
+   .map_pt_hook = paravirt_nop,
+
+   .alloc_pt = paravirt_nop,
+   .alloc_pd = paravirt_nop,
+   .alloc_pd_clone = paravirt_nop,
+   .release_pt = paravirt_nop,
+   .release_pd = paravirt_nop,
 
.set_pte = native_set_pte,
.set_pte_at = native_set_pte_at,
.set_pmd = native_set_pmd,
-   .pte_update = (void *)native_nop,
-   .pte_update_defer = (void *)native_nop,
+   .pte_update = paravirt_nop,
+   .pte_update_defer = paravirt_nop,
 #ifdef CONFIG_X86_PAE
.set_pte_atomic = native_set_pte_atomic,
.set_pte_present = native_set_pte_present,
@@ -576,7 +576,7 @@ struct paravirt_ops paravirt_ops = {
.irq_enable_sysexit = native_irq_enable_sysexit,
.iret = native_iret,
 
-   .startup_ipi_hook = (void *)native_nop,
+   .startup_ipi_hook = paravirt_nop,
 };
 
 /*
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -430,6 +430,9 @@ static inline void pmd_clear(pmd_t *pmdp
 #define arch_enter_lazy_mmu_mode() 
paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_MMU)
 #define arch_leave_lazy_mmu_mode() 
paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
 
+void _paravirt_nop(void);
+#define paravirt_nop   ((void *)_paravirt_nop)
+
 /* These all sit in the .parainstructions section to tell us what to patch. */
 struct paravirt_patch {
u8 *instr;  /* original instructions */

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 18/20] clean up tsc-based sched_clock

2007-04-04 Thread Jeremy Fitzhardinge

Three cleanups:
 - change "instable" -> "unstable"
 - its better to use get_cpu_var for getting this cpu's variables
 - change cycles_2_ns to do the full computation rather than just the
   tsc->ns scaling.  Its a simpler interface, and it makes the function
   more generally useful.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

---
 arch/i386/kernel/sched-clock.c |   35 +--
 1 file changed, 21 insertions(+), 14 deletions(-)

===
--- a/arch/i386/kernel/sched-clock.c
+++ b/arch/i386/kernel/sched-clock.c
@@ -39,17 +39,23 @@
 
 struct sc_data {
unsigned int cyc2ns_scale;
-   unsigned char instable;
+   unsigned char unstable;
unsigned long long last_tsc;
unsigned long long ns_base;
 };
 
 static DEFINE_PER_CPU(struct sc_data, sc_data);
 
-static inline unsigned long long cycles_2_ns(int cpu, unsigned long long cyc)
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
-   struct sc_data *sc = _cpu(sc_data, cpu);
-   return (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+   const struct sc_data *sc = &__get_cpu_var(sc_data);
+   unsigned long long ns;
+
+   cyc -= sc->last_tsc;
+   ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+   ns += sc->ns_base;
+
+   return ns;
 }
 
 /*
@@ -62,18 +68,19 @@ static inline unsigned long long cycles_
  */
 unsigned long long sched_clock(void)
 {
-   int cpu = get_cpu();
-   struct sc_data *sc = _cpu(sc_data, cpu);
unsigned long long r;
+   const struct sc_data *sc = _cpu_var(sc_data);
 
-   if (sc->instable) {
+   if (sc->unstable) {
/* TBD find a cheaper fallback timer than this */
r = ktime_to_ns(ktime_get());
} else {
get_scheduled_cycles(r);
-   r = ((u64)sc->ns_base) + cycles_2_ns(cpu, r - sc->last_tsc);
+   r = cycles_2_ns(r);
}
-   put_cpu();
+
+   put_cpu_var(sc_data);
+
return r;
 }
 
@@ -81,7 +88,7 @@ static void resync_sc_freq(struct sc_dat
 static void resync_sc_freq(struct sc_data *sc, unsigned int newfreq)
 {
if (!cpu_has_tsc) {
-   sc->instable = 1;
+   sc->unstable = 1;
return;
}
/* RED-PEN protect with seqlock? I hope that's not needed
@@ -90,7 +97,7 @@ static void resync_sc_freq(struct sc_dat
sc->ns_base = ktime_to_ns(ktime_get());
get_scheduled_cycles(sc->last_tsc);
sc->cyc2ns_scale = (100 << CYC2NS_SCALE_FACTOR) / newfreq;
-   sc->instable = 0;
+   sc->unstable = 0;
 }
 
 static void call_r_s_f(void *arg)
@@ -119,9 +126,9 @@ static int sc_freq_event(struct notifier
switch (event) {
case CPUFREQ_RESUMECHANGE:  /* needed? */
case CPUFREQ_PRECHANGE:
-   /* Mark TSC as instable until cpu frequency change is done
+   /* Mark TSC as unstable until cpu frequency change is done
   because we don't know when exactly it will change */
-   sc->instable = 1;
+   sc->unstable = 1;
break;
case CPUFREQ_SUSPENDCHANGE:
case CPUFREQ_POSTCHANGE:
@@ -163,7 +170,7 @@ static __init int init_sched_clock(void)
int i;
struct cpufreq_freqs f = { .cpu = get_cpu(), .new = 0 };
for_each_possible_cpu (i)
-   per_cpu(sc_data, i).instable = 1;
+   per_cpu(sc_data, i).unstable = 1;
WARN_ON(num_online_cpus() > 1);
call_r_s_f();
put_cpu();

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 13/20] Document asm-i386/paravirt.h

2007-04-04 Thread Jeremy Fitzhardinge

Clean things up, and broadly document:
 - the paravirt_ops functions themselves
 - the patching mechanism

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
 
---
 include/asm-i386/paravirt.h |  140 +--
 1 file changed, 123 insertions(+), 17 deletions(-)

===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -21,6 +21,14 @@ struct tss_struct;
 struct tss_struct;
 struct mm_struct;
 struct desc_struct;
+
+/* Lazy mode for batching updates / context switch */
+enum paravirt_lazy_mode {
+   PARAVIRT_LAZY_NONE = 0,
+   PARAVIRT_LAZY_MMU = 1,
+   PARAVIRT_LAZY_CPU = 2,
+};
+
 struct paravirt_ops
 {
unsigned int kernel_rpl;
@@ -37,22 +45,33 @@ struct paravirt_ops
 */
unsigned (*patch)(u8 type, u16 clobber, void *firstinsn, unsigned len);
 
+   /* Basic arch-specific setup */
void (*arch_setup)(void);
char *(*memory_setup)(void);
void (*init_IRQ)(void);
-
+   void (*time_init)(void);
+
+   /*
+* Called before/after init_mm pagetable setup. setup_start
+* may reset %cr3, and may pre-install parts of the pagetable;
+* pagetable setup is expected to preserve any existing
+* mapping.
+*/
void (*pagetable_setup_start)(pgd_t *pgd_base);
void (*pagetable_setup_done)(pgd_t *pgd_base);
 
+   /* Print a banner to identify the environment */
void (*banner)(void);
 
+   /* Set and set time of day */
unsigned long (*get_wallclock)(void);
int (*set_wallclock)(unsigned long);
-   void (*time_init)(void);
-
+
+   /* cpuid emulation, mostly so that caps bits can be disabled */
void (*cpuid)(unsigned int *eax, unsigned int *ebx,
  unsigned int *ecx, unsigned int *edx);
 
+   /* hooks for various privileged instructions */
unsigned long (*get_debugreg)(int regno);
void (*set_debugreg)(int regno, unsigned long value);
 
@@ -71,15 +90,23 @@ struct paravirt_ops
unsigned long (*read_cr4)(void);
void (*write_cr4)(unsigned long);
 
+   /*
+* Get/set interrupt state.  save_fl and restore_fl are only
+* expected to use X86_EFLAGS_IF; all other bits
+* returned from save_fl are undefined, and may be ignored by
+* restore_fl.
+*/
unsigned long (*save_fl)(void);
void (*restore_fl)(unsigned long);
void (*irq_disable)(void);
void (*irq_enable)(void);
void (*safe_halt)(void);
void (*halt)(void);
+
void (*wbinvd)(void);
 
-   /* err = 0/-EFAULT.  wrmsr returns 0/-EFAULT. */
+   /* MSR, PMC and TSR operations.
+  err = 0/-EFAULT.  wrmsr returns 0/-EFAULT. */
u64 (*read_msr)(unsigned int msr, int *err);
int (*write_msr)(unsigned int msr, u64 val);
 
@@ -88,6 +115,7 @@ struct paravirt_ops
u64 (*get_scheduled_cycles)(void);
unsigned long (*get_cpu_khz)(void);
 
+   /* Segment descriptor handling */
void (*load_tr_desc)(void);
void (*load_gdt)(const struct Xgt_desc_struct *);
void (*load_idt)(const struct Xgt_desc_struct *);
@@ -105,9 +133,12 @@ struct paravirt_ops
void (*load_esp0)(struct tss_struct *tss, struct thread_struct *t);
 
void (*set_iopl_mask)(unsigned mask);
-
void (*io_delay)(void);
 
+   /*
+* Hooks for intercepting the creation/use/destruction of an
+* mm_struct.
+*/
void (*activate_mm)(struct mm_struct *prev,
struct mm_struct *next);
void (*dup_mmap)(struct mm_struct *oldmm,
@@ -115,30 +146,43 @@ struct paravirt_ops
void (*exit_mmap)(struct mm_struct *mm);
 
 #ifdef CONFIG_X86_LOCAL_APIC
+   /*
+* Direct APIC operations, principally for VMI.  Ideally
+* these shouldn't be in this interface.
+*/
void (*apic_write)(unsigned long reg, unsigned long v);
void (*apic_write_atomic)(unsigned long reg, unsigned long v);
unsigned long (*apic_read)(unsigned long reg);
void (*setup_boot_clock)(void);
void (*setup_secondary_clock)(void);
+
+   void (*startup_ipi_hook)(int phys_apicid,
+unsigned long start_eip,
+unsigned long start_esp);
 #endif
 
+   /* TLB operations */
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_single)(unsigned long addr);
 
void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
+   /* Hooks for allocating/releasing pagetable pages */
void (*alloc_pt)(u32 pfn);
void (*alloc_pd)(u32 pfn);
void (*alloc_pd_clone)(u32 pfn, u32 clonepfn, u32 start, u32 count);
void (*release_pt)(u32 pfn);
void (*release_pd)(u32 pfn);
 
+

[patch 19/20] Add a sched_clock paravirt_op

2007-04-04 Thread Jeremy Fitzhardinge

The tsc-based get_scheduled_cycles interface is not a good match for
Xen's runstate accounting, which reports everything in nanoseconds.

This patch replaces this interface with a sched_clock interface, which
matches both Xen and VMI's requirements.

In order to do this, we:
   1. replace get_scheduled_cycles with sched_clock
   2. hoist cycles_2_ns into a common header
   3. update vmi accordingly

One thing to note: because sched_clock is implemented as a weak
function in kernel/sched.c, we must define a real function in order to
override this weak binding.  This means the usual paravirt_ops
technique of using an inline function won't work in this case.


Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Dan Hecht <[EMAIL PROTECTED]>
Cc: john stultz <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c|2 -
 arch/i386/kernel/sched-clock.c |   39 ++
 arch/i386/kernel/vmi.c |2 -
 arch/i386/kernel/vmitime.c |6 ++---
 include/asm-i386/paravirt.h|7 --
 include/asm-i386/timer.h   |   45 +++-
 include/asm-i386/vmi_time.h|2 -
 7 files changed, 71 insertions(+), 32 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -269,7 +269,7 @@ struct paravirt_ops paravirt_ops = {
.write_msr = native_write_msr_safe,
.read_tsc = native_read_tsc,
.read_pmc = native_read_pmc,
-   .get_scheduled_cycles = native_read_tsc,
+   .sched_clock = native_sched_clock,
.get_cpu_khz = native_calculate_cpu_khz,
.load_tr_desc = native_load_tr_desc,
.set_ldt = native_set_ldt,
===
--- a/arch/i386/kernel/sched-clock.c
+++ b/arch/i386/kernel/sched-clock.c
@@ -37,26 +37,7 @@
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-struct sc_data {
-   unsigned int cyc2ns_scale;
-   unsigned char unstable;
-   unsigned long long last_tsc;
-   unsigned long long ns_base;
-};
-
-static DEFINE_PER_CPU(struct sc_data, sc_data);
-
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-   const struct sc_data *sc = &__get_cpu_var(sc_data);
-   unsigned long long ns;
-
-   cyc -= sc->last_tsc;
-   ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
-   ns += sc->ns_base;
-
-   return ns;
-}
+DEFINE_PER_CPU(struct sc_data, sc_data);
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -66,7 +47,7 @@ static inline unsigned long long cycles_
  * [1] no attempt to stop CPU instruction reordering, which can hit
  * in a 100 instruction window or so.
  */
-unsigned long long sched_clock(void)
+unsigned long long native_sched_clock(void)
 {
unsigned long long r;
const struct sc_data *sc = _cpu_var(sc_data);
@@ -75,7 +56,7 @@ unsigned long long sched_clock(void)
/* TBD find a cheaper fallback timer than this */
r = ktime_to_ns(ktime_get());
} else {
-   get_scheduled_cycles(r);
+   rdtscll(r);
r = cycles_2_ns(r);
}
 
@@ -83,6 +64,18 @@ unsigned long long sched_clock(void)
 
return r;
 }
+
+/* We need to define a real function for sched_clock, to override the
+   weak default version */
+#ifdef CONFIG_PARAVIRT
+unsigned long long sched_clock(void)
+{
+   return paravirt_sched_clock();
+}
+#else
+unsigned long long sched_clock(void)
+   __attribute__((alias("native_sched_clock")));
+#endif
 
 /* Resync with new CPU frequency */
 static void resync_sc_freq(struct sc_data *sc, unsigned int newfreq)
@@ -95,7 +88,7 @@ static void resync_sc_freq(struct sc_dat
   because sched_clock callers should be able to tolerate small
   errors. */
sc->ns_base = ktime_to_ns(ktime_get());
-   get_scheduled_cycles(sc->last_tsc);
+   rdtscll(sc->last_tsc);
sc->cyc2ns_scale = (100 << CYC2NS_SCALE_FACTOR) / newfreq;
sc->unstable = 0;
 }
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -866,7 +866,7 @@ static inline int __init activate_vmi(vo
paravirt_ops.setup_boot_clock = vmi_timer_setup_boot_alarm;
paravirt_ops.setup_secondary_clock = 
vmi_timer_setup_secondary_alarm;
 #endif
-   paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles;
+   paravirt_ops.sched_clock = vmi_sched_clock;
paravirt_ops.get_cpu_khz = vmi_cpu_khz;
 
/* We have true wallclock functions; disable CMOS clock sync */
===
--- a/arch/i386/kernel/vmitime.c
+++ b/arch/i386/kernel/vmitime.c
@@ -163,9 +163,9 @@ int vmi_set_wallclock(unsigned long now)

[patch 08/20] add hooks to intercept mm creation and destruction

2007-04-04 Thread Jeremy Fitzhardinge

Add hooks to allow a paravirt implementation to track the lifetime of
an mm.  Paravirtualization requires three hooks, but only two are
needed in common code.  They are:

arch_dup_mmap, which is called when a new mmap is created at fork

arch_exit_mmap, which is called when the last process reference to an
  mm is dropped, which typically happens on exit and exec.

The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures.  It's called when an mm is first used.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |4 
 include/asm-alpha/mmu_context.h |1 +
 include/asm-arm/mmu_context.h   |1 +
 include/asm-arm26/mmu_context.h |2 ++
 include/asm-avr32/mmu_context.h |1 +
 include/asm-cris/mmu_context.h  |2 ++
 include/asm-frv/mmu_context.h   |1 +
 include/asm-generic/mm_hooks.h  |   18 ++
 include/asm-h8300/mmu_context.h |1 +
 include/asm-i386/mmu_context.h  |   17 +++--
 include/asm-i386/paravirt.h |   23 +++
 include/asm-ia64/mmu_context.h  |1 +
 include/asm-m32r/mmu_context.h  |1 +
 include/asm-m68k/mmu_context.h  |1 +
 include/asm-m68knommu/mmu_context.h |1 +
 include/asm-mips/mmu_context.h  |1 +
 include/asm-parisc/mmu_context.h|1 +
 include/asm-powerpc/mmu_context.h   |1 +
 include/asm-ppc/mmu_context.h   |1 +
 include/asm-s390/mmu_context.h  |2 ++
 include/asm-sh/mmu_context.h|1 +
 include/asm-sh64/mmu_context.h  |2 +-
 include/asm-sparc/mmu_context.h |2 ++
 include/asm-sparc64/mmu_context.h   |1 +
 include/asm-um/mmu_context.h|2 ++
 include/asm-v850/mmu_context.h  |2 ++
 include/asm-x86_64/mmu_context.h|1 +
 include/asm-xtensa/mmu_context.h|1 +
 kernel/fork.c   |2 ++
 mm/mmap.c   |4 
 30 files changed, 96 insertions(+), 3 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -520,6 +520,10 @@ struct paravirt_ops paravirt_ops = {
.irq_enable_sysexit = native_irq_enable_sysexit,
.iret = native_iret,
 
+   .dup_mmap = paravirt_nop,
+   .exit_mmap = paravirt_nop,
+   .activate_mm = paravirt_nop,
+
.startup_ipi_hook = paravirt_nop,
 };
 
===
--- a/include/asm-alpha/mmu_context.h
+++ b/include/asm-alpha/mmu_context.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Force a context reload. This is needed when we change the page
===
--- a/include/asm-arm/mmu_context.h
+++ b/include/asm-arm/mmu_context.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void __check_kvm_seq(struct mm_struct *mm);
 
===
--- a/include/asm-arm26/mmu_context.h
+++ b/include/asm-arm26/mmu_context.h
@@ -12,6 +12,8 @@
  */
 #ifndef __ASM_ARM_MMU_CONTEXT_H
 #define __ASM_ARM_MMU_CONTEXT_H
+
+#include 
 
 #define init_new_context(tsk,mm)   0
 #define destroy_context(mm)do { } while(0)
===
--- a/include/asm-avr32/mmu_context.h
+++ b/include/asm-avr32/mmu_context.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The MMU "context" consists of two things:
===
--- a/include/asm-cris/mmu_context.h
+++ b/include/asm-cris/mmu_context.h
@@ -1,5 +1,7 @@
 #ifndef __CRIS_MMU_CONTEXT_H
 #define __CRIS_MMU_CONTEXT_H
+
+#include 
 
 extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
 extern void get_mmu_context(struct mm_struct *mm);
===
--- a/include/asm-frv/mmu_context.h
+++ b/include/asm-frv/mmu_context.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
*tsk)
 {
===
--- /dev/null
+++ b/include/asm-generic/mm_hooks.h
@@ -0,0 +1,18 @@
+/*
+ * Define generic no-op hooks for arch_dup_mmap and arch_exit_mmap, to
+ * be included in asm-FOO/mmu_context.h for any arch FOO which doesn't
+ * need to hook these.
+ */
+#ifndef _ASM_GENERIC_MM_HOOKS_H
+#define _ASM_GENERIC_MM_HOOKS_H
+
+static inline void arch_dup_mmap(struct mm_struct *oldmm,
+struct mm_struct *mm)
+{
+}
+
+static inline void

[patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-04 Thread Jeremy Fitzhardinge

Normally when running in PAE mode, the 4th PMD maps the kernel address
space, which can be shared among all processes (since they all need
the same kernel mappings).

Xen, however, does not allow guests to have the kernel pmd shared
between page tables, so parameterize pgtable.c to allow both modes of
operation.

There are several side-effects of this.  One is that vmalloc will
update the kernel address space mappings, and those updates need to be
propagated into all processes if the kernel mappings are not
intrinsically shared.  In the non-PAE case, this is done by
maintaining a pgd_list of all processes; this list is used when all
process pagetables must be updated.  pgd_list is threaded via
otherwise unused entries in the page structure for the pgd, which
means that the pgd must be page-sized for this to work.

Normally the PAE pgd is only 4x64 byte entries large, but Xen requires
the PAE pgd to page aligned anyway, so this patch forces the pgd to be
page aligned+sized when the kernel pmd is unshared, to accomodate both
these requirements.

Also, since there may be several distinct kernel pmds (if the
user/kernel split is below 3G), there's no point in allocating them
from a slab cache; they're just allocated with get_free_page and
initialized appropriately.  (Of course the could be cached if there is
just a single kernel pmd - which is the default with a 3G user/kernel
split - but it doesn't seem worthwhile to add yet another case into
this code).

[ Many thanks to wli for review comments. ]

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: William Lee Irwin III <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 arch/i386/kernel/paravirt.c|1 
 arch/i386/mm/fault.c   |6 +-
 arch/i386/mm/init.c|   18 +-
 arch/i386/mm/pageattr.c|2 
 arch/i386/mm/pgtable.c |   84 ++--
 include/asm-i386/paravirt.h|1 
 include/asm-i386/pgtable-2level-defs.h |2 
 include/asm-i386/pgtable-2level.h  |2 
 include/asm-i386/pgtable-3level-defs.h |6 ++
 include/asm-i386/pgtable-3level.h  |2 
 include/asm-i386/pgtable.h |7 ++
 11 files changed, 105 insertions(+), 26 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -604,6 +604,7 @@ struct paravirt_ops paravirt_ops = {
.name = "bare hardware",
.paravirt_enabled = 0,
.kernel_rpl = 0,
+   .shared_kernel_pmd = 1, /* Only used when CONFIG_X86_PAE is set */
 
.patch = native_patch,
.banner = default_banner,
===
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -588,8 +588,7 @@ do_sigbus:
force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
 }
 
-#ifndef CONFIG_X86_PAE
-void vmalloc_sync_all(void)
+void _vmalloc_sync_all(void)
 {
/*
 * Note that races in the updates of insync and start aren't
@@ -600,6 +599,8 @@ void vmalloc_sync_all(void)
static DECLARE_BITMAP(insync, PTRS_PER_PGD);
static unsigned long start = TASK_SIZE;
unsigned long address;
+
+   BUG_ON(SHARED_KERNEL_PMD);
 
BUILD_BUG_ON(TASK_SIZE & ~PGDIR_MASK);
for (address = start; address >= TASK_SIZE; address += PGDIR_SIZE) {
@@ -623,4 +624,3 @@ void vmalloc_sync_all(void)
start = address + PGDIR_SIZE;
}
 }
-#endif
===
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -715,6 +715,8 @@ struct kmem_cache *pmd_cache;
 
 void __init pgtable_cache_init(void)
 {
+   size_t pgd_size = PTRS_PER_PGD*sizeof(pgd_t);
+
if (PTRS_PER_PMD > 1) {
pmd_cache = kmem_cache_create("pmd",
PTRS_PER_PMD*sizeof(pmd_t),
@@ -724,13 +726,23 @@ void __init pgtable_cache_init(void)
NULL);
if (!pmd_cache)
panic("pgtable_cache_init(): cannot create pmd cache");
+
+   if (!SHARED_KERNEL_PMD) {
+   /* If we're in PAE mode and have a non-shared
+  kernel pmd, then the pgd size must be a
+  page size.  This is because the pgd_list
+  links through the page structure, so there
+  can only be one pgd per page for this to
+  work. */
+   pgd_size = PAGE_SIZE;
+   }
}
pgd_cache = kmem_cache_create("pgd",
-   PTRS_PER_PGD*sizeof(pgd_t),
-   PTRS_PER_PGD*sizeof(pgd_t),
+

[patch 20/20] Add apply_to_page_range() which applies a function to a pte range.

2007-04-04 Thread Jeremy Fitzhardinge

Add a new mm function apply_to_page_range() which applies a given
function to every pte in a given virtual address range in a given mm
structure. This is a generic alternative to cut-and-pasting the Linux
idiomatic pagetable walking code in every place that a sequence of
PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen
subsystems, for example: to ensure that pagetables have been allocated
for a virtual address range, and to construct batched special
pagetable update requests to map I/O memory (in ioremap()).

Signed-off-by: Ian Pratt <[EMAIL PROTECTED]>
Signed-off-by: Christian Limpach <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
Cc: Matt Mackall <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]> 

---
 include/linux/mm.h |5 ++
 mm/memory.c|   94 
 2 files changed, 99 insertions(+)

===
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,6 +1135,11 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET   0x04/* do get_page on page */
 #define FOLL_ANON  0x08/* give ZERO_PAGE if no pgtable */
 
+typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr,
+   void *data);
+extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
+  unsigned long size, pte_fn_t fn, void *data);
+
 #ifdef CONFIG_PROC_FS
 void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
 #else
===
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1448,6 +1448,100 @@ int remap_pfn_range(struct vm_area_struc
 }
 EXPORT_SYMBOL(remap_pfn_range);
 
+static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pte_t *pte;
+   int err;
+   struct page *pmd_page;
+   spinlock_t *ptl;
+
+   pte = (mm == _mm) ?
+   pte_alloc_kernel(pmd, addr) :
+   pte_alloc_map_lock(mm, pmd, addr, );
+   if (!pte)
+   return -ENOMEM;
+
+   BUG_ON(pmd_huge(*pmd));
+
+   pmd_page = pmd_page(*pmd);
+
+   do {
+   err = fn(pte, pmd_page, addr, data);
+   if (err)
+   break;
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+
+   if (mm != _mm)
+   pte_unmap_unlock(pte-1, ptl);
+   return err;
+}
+
+static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pmd_t *pmd;
+   unsigned long next;
+   int err;
+
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return -ENOMEM;
+   do {
+   next = pmd_addr_end(addr, end);
+   err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pmd++, addr = next, addr != end);
+   return err;
+}
+
+static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pud_t *pud;
+   unsigned long next;
+   int err;
+
+   pud = pud_alloc(mm, pgd, addr);
+   if (!pud)
+   return -ENOMEM;
+   do {
+   next = pud_addr_end(addr, end);
+   err = apply_to_pmd_range(mm, pud, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pud++, addr = next, addr != end);
+   return err;
+}
+
+/*
+ * Scan a region of virtual memory, filling in page tables as necessary
+ * and calling a provided function on each leaf page table.
+ */
+int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
+   unsigned long size, pte_fn_t fn, void *data)
+{
+   pgd_t *pgd;
+   unsigned long next;
+   unsigned long end = addr + size;
+   int err;
+
+   BUG_ON(addr >= end);
+   pgd = pgd_offset(mm, addr);
+   do {
+   next = pgd_addr_end(addr, end);
+   err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pgd++, addr = next, addr != end);
+   return err;
+}
+EXPORT_SYMBOL_GPL(apply_to_page_range);
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry
  * which was read non-atomically.  Before making any commitment, on

-- 

-
To

[patch 06/20] Allocate a fixmap slot

2007-04-04 Thread Jeremy Fitzhardinge

Allocate a fixmap slot for use by a paravirt_ops implementation.  This
is intended for early-boot bootstrap mappings.  Once the zones and
allocator have been set up, it would be better to use get_vm_area() to
allocate some virtual space.

Xen uses this to map the hypervisor's shared info page, which doesn't
have a pseudo-physical page number, and therefore can't be mapped
ordinarily.  It is needed early because it contains the vcpu state,
including the interrupt mask.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

---
 include/asm-i386/fixmap.h |3 +++
 1 file changed, 3 insertions(+)

===
--- a/include/asm-i386/fixmap.h
+++ b/include/asm-i386/fixmap.h
@@ -86,6 +86,9 @@ enum fixed_addresses {
 #ifdef CONFIG_PCI_MMCONFIG
FIX_PCIE_MCFG,
 #endif
+#ifdef CONFIG_PARAVIRT
+   FIX_PARAVIRT_BOOTMAP,
+#endif
__end_of_permanent_fixed_addresses,
/* temporary boot-time mappings, used before ioremap() is functional */
 #define NR_FIX_BTMAPS  16

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity

2007-04-04 Thread Jeremy Fitzhardinge

Rename struct paravirt_patch to paravirt_patch_site, so that it
clearly refers to a callsite, and not the patch which may be applied
to that callsite.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/alternative.c |9 -
 arch/i386/kernel/vmi.c |4 
 include/asm-i386/alternative.h |8 +---
 include/asm-i386/paravirt.h|5 -
 4 files changed, 13 insertions(+), 13 deletions(-)

===
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -335,9 +335,10 @@ void alternatives_smp_switch(int smp)
 #endif
 
 #ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end)
-{
-   struct paravirt_patch *p;
+void apply_paravirt(struct paravirt_patch_site *start,
+   struct paravirt_patch_site *end)
+{
+   struct paravirt_patch_site *p;
 
if (noreplace_paravirt)
return;
@@ -355,8 +356,6 @@ void apply_paravirt(struct paravirt_patc
/* Sync to be conservative, in case we patched following instructions */
sync_core();
 }
-extern struct paravirt_patch __parainstructions[],
-   __parainstructions_end[];
 #endif /* CONFIG_PARAVIRT */
 
 void __init alternative_instructions(void)
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -70,10 +70,6 @@ static struct {
void (*set_initial_ap_state)(int, int);
void (*halt)(void);
 } vmi_ops;
-
-/* XXX move this to alternative.h */
-extern struct paravirt_patch __parainstructions[],
-   __parainstructions_end[];
 
 /*
  * VMI patching routines.
===
--- a/include/asm-i386/alternative.h
+++ b/include/asm-i386/alternative.h
@@ -114,12 +114,14 @@ static inline void alternatives_smp_swit
 #define LOCK_PREFIX ""
 #endif
 
-struct paravirt_patch;
+struct paravirt_patch_site;
 #ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end);
+void apply_paravirt(struct paravirt_patch_site *start,
+   struct paravirt_patch_site *end);
 #else
 static inline void
-apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end)
+apply_paravirt(struct paravirt_patch_site *start,
+  struct paravirt_patch_site *end)
 {}
 #define __parainstructions NULL
 #define __parainstructions_end NULL
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -502,12 +502,15 @@ void _paravirt_nop(void);
 #define paravirt_nop   ((void *)_paravirt_nop)
 
 /* These all sit in the .parainstructions section to tell us what to patch. */
-struct paravirt_patch {
+struct paravirt_patch_site {
u8 *instr;  /* original instructions */
u8 instrtype;   /* type of this instruction */
u8 len; /* length of original instruction */
u16 clobbers;   /* what registers you may clobber */
 };
+
+extern struct paravirt_patch_site __parainstructions[],
+   __parainstructions_end[];
 
 #define paravirt_alt(insn_string, typenum, clobber)\
"771:\n\t" insn_string "\n" "772:\n"\

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 00/20] paravirt_ops updates

2007-04-04 Thread Jeremy Fitzhardinge

Hi Andi,

Here's a repost of the paravirt_ops update series I posted the other day.
Since then, I found a few potential bugs with patching clobbering,
cleaned up and documented paravirt.h and the patching machinery.

Overview:

add-MAINTAINERS.patch
obvious

remove-CONFIG_DEBUG_PARAVIRT.patch
No longer meaningful or needed.

paravirt-nop.patch
Clean up nop paravirt_ops functions, mainly to allow the patching
machinery to easily identify them.

paravirt-pte-accessors.patch
Accessors to allow pv_ops to control the content of pagetable entries.

paravirt-memory-init.patch
Hook into initial pagetable creation.

paravirt-fixmap.patch
Create a fixmap for early paravirt_ops mappings.

shared-kernel-pmd.patch
Make the choice of whether the kernel pmd is shared between
processes or not a runtime selectable flag.

mm-lifetime-hooks.patch
Hooks to allow the creation, use and destruction of an mm_struct
to be followed.

paravirt-patch-rename-paravirt_patch.patch
Rename a structure to make its use a bit more clear.

paravirt-use-offset-site-ids.patch
Use the offsetof each function pointer in paravirt_ops as the
basis of its patching identifier.

paravirt-fix-clobbers.patch
Fix up various register/use clobber problems.  This may be 2.6.21
material, but I don't think it will materially affect VMI.

paravirt-patchable-call-wrappers.patch
Wrap each paravirt_ops call to allow the callsites to be runtime
patched.

paravirt-document-paravirt_ops.patch
Document the paravirt_ops structure itself, the patching
mechanism, and other cleanups.

paravirt-patch-machinery.patch
General patch machinery for use by pv_ops backends to implment
patching.

paravirt-flush_tlb_others.patch
Add a hook for cross-cpu tlb flushing.

revert-map_pt_hook.patch
Back out the map_pt_hook change.

paravirt-kmap_atomic_pte.patch
Replace map_pt_hook with kmap_atomic_pte.

cleanup-tsc-sched-clock.patch
Clean up the tsc-based sched_clock.  (I think you already
have this.)

paravirt-sched-clock.patch
Add a hook for sched_clock, so that paravirt_ops backends can
report unstolen time for use as the scheduler clock.

apply-to-page-range.patch
Apply a function to a range of pagetable entries.

Thanks,
J

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MODULE_MAINTAINER

2007-04-04 Thread Stefan Richter

Adrian Bunk wrote:
> Realistically, users should report problems with vendor kernels to the 
> vendor and problems with ftp.kernel.org kernels to either linux-kernel 
> or the kernel Bugzilla, and forwarding issues to the responsible people 
> (if any) should be done there [1].
> 
>> Rene.
> 
> cu
> Adrian
> 
> [1] Andrew is doing this

At bugzilla.kernel.org, driver maintainers can also watch the respective
subsystem maintainer's address or subsystem's meta address to get
notified when a bug is being filed under a subsystem.
-- 
Stefan Richter
-=-=-=== -=-- --=-=
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + clocksource-driver-initialize-list-value.patch added to -mm tree

2007-04-04 Thread Daniel Walker

On Wed, 2007-04-04 at 16:48 -0700, Jeremy Fitzhardinge wrote:
> Daniel Walker wrote:
> > I vaguely remember, but I don't think this creates a maintenance
> > issue .. It's not related to maintenance , it's an issue of creating a
> > new clocksource .. My perspective is that it has even less an effect
> > than the CLOCK_SOURCE_IS_CONTINUOUS field .. People actually have to
> > research that field, but list initialization is fairly clear.
> >   
> 
> Yes, but its just make-work.  It has no bearing on what the clocksource
> implementer needs to care about.  CLOCK_SOURCE_IS_CONTINUOUS is a
> property of the clocksource they're implementing; it matters to them. 
> But "list"?  It's just administration.

Setting CLOCK_SOURCE_IS_CONTINUOUS is largely administration , do you
know what that flag means?

> > The majority method for creating these clocksources is copy, so
> > I'm not sure nano argument on this subject particularly relevant ..
> 
> Great, so we end up with some random piece of clocksource internal
> implementation detail cut-n-paste replicated all over the kernel.

list values and list initialization are hardly internal details , they
are commonly used all over the kernel.

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc5-mm4

2007-04-04 Thread Antonino A. Daplas

On Thu, 2007-04-05 at 08:38 +1000, Con Kolivas wrote:
> On Thursday 05 April 2007 08:10, Andrew Morton wrote:
> > Thanks - that'll be the CPU scheduler changes.
> >
> > Con has produced a patch or two which might address this but afaik we don't
> > yet have a definitive fix?
> >
> > I believe that reverting
> > sched-implement-staircase-deadline-cpu-scheduler-staircase-improvements.pat
> >ch will prevent it.
> 
> I posted a definitive fix which Michal tested for me offlist. Subject was:
>  [PATCH] sched: implement staircase deadline cpu scheduler improvements fix
> 
> Sorry about relative noise prior to that. Akpm please pick it up.
> 
> Here again just in case.
> 

Rebooted a few times, I can confirm that this patch fixes this.

Thanks

Tony


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + clocksource-driver-initialize-list-value.patch added to -mm tree

2007-04-04 Thread Jeremy Fitzhardinge

Daniel Walker wrote:
> I vaguely remember, but I don't think this creates a maintenance
> issue .. It's not related to maintenance , it's an issue of creating a
> new clocksource .. My perspective is that it has even less an effect
> than the CLOCK_SOURCE_IS_CONTINUOUS field .. People actually have to
> research that field, but list initialization is fairly clear.
>   

Yes, but its just make-work.  It has no bearing on what the clocksource
implementer needs to care about.  CLOCK_SOURCE_IS_CONTINUOUS is a
property of the clocksource they're implementing; it matters to them. 
But "list"?  It's just administration.

> The majority method for creating these clocksources is copy, so
> I'm not sure nano argument on this subject particularly relevant ..

Great, so we end up with some random piece of clocksource internal
implementation detail cut-n-paste replicated all over the kernel.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc5-mm4 (SLUB)

2007-04-04 Thread Badari Pulavarty

On Wed, 2007-04-04 at 15:59 -0700, Christoph Lameter wrote:
> On Wed, 4 Apr 2007, Badari Pulavarty wrote:
> 
> > Here is the slub_debug=FU output with the above patch.
> 
> Hmmm... Looks like the object is actually free. Someone writes beyond the 
> end of the earlier object. Setting Z should check overwrites but it 
> switched off merging. So set
> 
> slub_debug = FZ
> 
> Analoguos to the last patch you would need to take out redzoning from 
> the flags that stop merging. Then rerun. Maybe we can track it down this 
> way.

Hmm.. I did that and machine boots fine, with absolutely no
debug messages :(

Thanks,
Badari



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 04/17] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Jeremy Fitzhardinge

Rusty Russell wrote:
> You'll still have the damage inflicted on gcc's optimizer, though.

Well, I could remove the clobbers for PVOP_CALL[0-2] and add the
appropriate push/pops, and put similar push/pop wrappers around all the
called functions.  But it doesn't make it any prettier.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: set up new kernel with grub

2007-04-04 Thread Robert Hancock


Michael wrote:

Hi,

I compiled a new kernel: 2.6.20.3, and hope to test it without removing 
my old kernel.


Here is what I did by following 
http://searchenterpriselinux.techtarget.com/tip/0,289483,sid39_gci1204148,00.html 


Those instructions are way out of date. All you should need to do is 
"make modules_install install".


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-usb-devel] [RFC] HID bus design overview.

2007-04-04 Thread Adam Kropelin


Jiri Kosina wrote:

On Wed, 4 Apr 2007, Adam Kropelin wrote:


I apologize for picking up this thread late and asking what may be a
question with an obvious answer... Will hiddev still exist after
hidraw and the HID bus redesign work is done? I have a
widely-deployed userspace app that relies on hiddev, and I'm looking
for reassurance that it will still work as it always has...


Hi Adam,

hiddev will have to stay for quite some time, exactly because of
backward compatibility with userspace applications/drivers that use
it (I am not aware of many of them though, but apparently there are
some).


Apcupsd is the one on my mind, but I believe there are others.


I won't allow it to vanish, don't worry.


Thanks!


We just have to make sure that new users will use hidraw instead, as
it provides more flexibility for the user, is not dependent on the
underlying transport protocol, etc.


On Apcupsd we've recently introduced a libusb-based driver that does all 
HID parsing in userspace. Not only does that free us from hiddev, it 
also frees us from the umpteen other proprietary HID interfaces across 
various platforms. Although the hiddev-based driver is still the default 
for Linux platforms, I plan to change that in the next major release and 
thus begin migrating folks off of hiddev.


I appreciate your pledge to keep hiddev functioning in the mean time :)

--Adam

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 4/5] smsc-ircc2: add PNP support

2007-04-04 Thread Randy Dunlap

On Wed, 04 Apr 2007 16:45:40 -0600 Bjorn Helgaas wrote:

> Index: w/drivers/net/irda/smsc-ircc2.c
> ===
> --- w.orig/drivers/net/irda/smsc-ircc2.c  2007-04-04 13:45:18.0 
> -0600
> +++ w/drivers/net/irda/smsc-ircc2.c   2007-04-04 13:47:00.0 -0600
> @@ -79,6 +79,10 @@
>  MODULE_DESCRIPTION("SMC IrCC SIR/FIR controller driver");
>  MODULE_LICENSE("GPL");
>  
> +static int smsc_nopnp;
> +module_param_named(nopnp, smsc_nopnp, bool, 0);
> +MODULE_PARM_DESC(nopnp, "Do not use PNP to detect controller settings");

Document this parameter (like you did "legacy_serial.force" in the
other patch -- thanks).

>  #define DMA_INVAL 255
>  static int ircc_dma = DMA_INVAL;
>  module_param(ircc_dma, int, 0);


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/14] Pass MAP_FIXED down to get_unmapped_area

2007-04-04 Thread Benjamin Herrenschmidt

On Wed, 2007-04-04 at 11:31 +0100, David Howells wrote:
> Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote:
> 
> > This serie of patches moves the logic to handle MAP_FIXED down to the
> > various arch/driver get_unmapped_area() implementations, and then changes
> > the generic code to always call them. The hugetlbfs hacks then disappear
> > from the generic code.
> 
> This sounds like get_unmapped_area() is now doing more than it says on the
> tin.  As I understand it, it's to be called to locate an unmapped area when
> one wasn't specified by MAP_FIXED, and so shouldn't be called if MAP_FIXED is
> set.

Well... that was the initial implementation. But that doesn't quite deal
well with various issues like page size constraints like the segment
constraints on powerpc or other hugetlbfs realted issues, and the
aliasing problems on architectures with virtually caches...

Just look at how many architectures already have special case for
MAP_FIXED in their arch_get_unmapped_area ! It was never called so far
though, my patch makes it being called.

I agree it's probably not the best interface, but I'm still trying to
figure out something that would be nicer as a "second step", as I don't
want to do too much in one set of patches. This serie allows me to hook
in my SPE 64K page thingy, to cleanup & improve a bit my hugetlb
handling, and possibly fixes some of those aliasing issues on
architectures with virtual caches...

> Admittedly, on NOMMU, it's also used to find the location of quasi-memory
> devices such as framebuffers and ramfs files, but that's not a great deviation
> from the original intent.
> 
> Perhaps a change of name is in order for the function?

I'm not sure. "get" can mean "obtain" :-) The way it's currently
implemented for me on powerpc works fine that way, I don't need an
"unget".

> > Since I need to do some special 64K pages mappings for SPEs on cell, I need
> > to work around the first problem at least. I have further patches thus
> > implementing a "slices" layer that handles multiple page sizes through
> > slices of the address space for use by hugetlbfs, the SPE code, and possibly
> > others, but it requires that serie of patches first/
> 
> That makes it sound like there should be an "unget" too for when an error
> occurs between ->get_unmapped_area() being called and ->mmap() returning
> successfully.

I don't need it because I can flip the page size of the segment back if
it has no VMA in it on the next get_unmapped_area(). Again, I'd like to
come up with a better interface, and I might post something in that
direction next week, but I beleive those patches (+/- bug fixes) are a
good first step in the right direction. I also need to find a proper way
to solve the mremap problem as it's bogus as it is already with things
like hugetlbfs on powerpc at least. 

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-04 Thread Bill Davidsen


Tejun Heo wrote:

[resending.  my mail service was down for more than a week and this
message didn't get delivered.]

[EMAIL PROTECTED] wrote:
  

Anyway, what's annoying is that I can't figure out how to bring the
drive back on line without resetting the box.  It's in a hot-swap
  

enclosure,
  

but power cycling the drive doesn't seem to help.  I thought libata
  

hotplug
  

was working?  (SiI3132 card, using the sil24 driver.)
  


Yeah, it's working but failing resets are considered highly dangerous
(in that the controller status is unknown and may cause something
dangerous like screaming interrupts) and port is muted after that.  The
plan is to handle this with polling hotplug such that libata tries to
revive the port if PHY status change is detected by polling.  Patches
are available but they need other things to resolved to get integrated.
 I think it'll happen before the summer.

Anyways, you can tell libata to retry the port by manually telling it to
rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
  
I won't say that's voodoo, but if I ever did it I'd wipe down my 
keyboard with holy water afterward. ;-)


Well, I did save the message in my tricks file, but it sounds like a 
last ditch effort after something get very wrong.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 12/14] get_unmapped_area handles MAP_FIXED in /dev/mem (nommu)

2007-04-04 Thread Benjamin Herrenschmidt

On Wed, 2007-04-04 at 11:31 +0100, David Howells wrote:
> Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote:
> 
> > +   if (flags & MAP_FIXED)
> > +   if ((addr >> PAGE_SHIFT) != pgoff)
> > +   return (unsigned long) -EINVAL;
> 
> Again... in NOMMU-mode there is no MAP_FIXED - it's rejected before we get
> this far.
> 
> > -   return pgoff;
> > +   return pgoff << PAGE_SHIFT;
> 
> That, however, does appear to be a genuine bugfix.

I'll separate it from the rest of the patches

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 11/14] get_unmapped_area handles MAP_FIXED on ramfs (nommu)

2007-04-04 Thread Benjamin Herrenschmidt

On Wed, 2007-04-04 at 11:16 +0100, David Howells wrote:
> Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote:
> 
> > -   if (!(flags & MAP_SHARED))
> > +   /* Deal with MAP_FIXED differently ? Forbid it ? Need help from some 
> > nommu
> > +* folks there... --BenH.
> > +*/
> > +   if ((flags & MAP_FIXED) || !(flags & MAP_SHARED))
> 
> MAP_FIXED on NOMMU?  Surely you jest...

Heh, see the comment, I was actually wondering about it :-)

> See the first if-statement in validate_mmap_request().
> 
> If anything, you should be adding BUG_ON(flags & MAP_FIXED).

Yeah, I missed that bit. That will simplify the problem.

Thanks,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem

2007-04-04 Thread Eric W. Biederman


Next time I have a moment I will try and take a closer look.  However
currently these approaches feel like there is some unholy coupling going
on between different things.

In addition there appear to be some weird assumptions (an array with
one member per task_struct) in the group.  The pid limit allows
us millions of task_structs if the user wants it.   A several megabyte
array sounds like a completely unsuitable data structure.   What
is wrong with just traversing task_list at least initially?

What happened to the unix philosophy of starting with simple and
correct designs? 

Further we have several different questions that are all mixed up 
in this thread.

- What functionality do we want to provide.
- How do we want to export that functionality to user space.

You can share code by having an embedded structure instead of a magic
subsystem things have to register with, and I expect that would be
preferable.  Libraries almost always are easier to work with then
a subsystem with strict rules that doesn't give you choices.

Why do we need a subsystem id?  Do we expect any controllers to
be provided as modules?  I think the code is so in the core that
modules of any form are a questionable assumption.

.
There is a real issue to be solved here that we can't add
controls/limits for a group of processes if we don't have a user space
interface for it.

Are the issues of building a user space interface so very hard,
and the locking so very nasty or is this a case of over engineering?

I'm inclined to the rcfs variant and using nsproxy (from what I have
seen in passing) because it is more of an exercise in minimalism, and I
am strongly inclined to be minimalistic.  The straight cpuset
derivative seems to start with everything but the kitchen sink and
then add on to it.  Which at first glance seems unhealthy.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] [patch 00/37] 2.6.20-stable review

2007-04-04 Thread Greg KH

On Wed, Apr 04, 2007 at 10:28:21AM -0400, Chuck Ebbert wrote:
> Greg KH wrote:
> > This is the start of the stable review cycle for the 2.6.20.5 release.
> > There are 37 patches in this series, all will be posted as a response to
> > this one.  If anyone has any issues with these being applied, please let
> > us know.  If anyone is a maintainer of the proper subsystem, and wants
> > to add a Signed-off-by: line to the patch, please respond with it.
> 
> Will this be released anytime soon?

Yes, sorry, I ended up having to travel over to germany on kind of a
short notice, sorry.  I'll push the release out tomorrow (.de timezone)
when I get back to a better network connection...

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] IA64: SPARSE_VIRTUAL 16K page size support

2007-04-04 Thread Christoph Lameter

[IA64] Sparse virtual implementation

Equip IA64 sparsemem with a virtual memmap. This is similar to the existing
CONFIG_VMEMMAP functionality for discontig. It uses a page size mapping.

This is provided as a minimally intrusive solution. We split the
128TB VMALLOC area into two 64TB areas and use one for the virtual memmap.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm2/arch/ia64/Kconfig
===
--- linux-2.6.21-rc5-mm2.orig/arch/ia64/Kconfig 2007-04-02 16:15:29.0 
-0700
+++ linux-2.6.21-rc5-mm2/arch/ia64/Kconfig  2007-04-02 16:15:50.0 
-0700
@@ -350,6 +350,10 @@ config ARCH_SPARSEMEM_ENABLE
def_bool y
depends on ARCH_DISCONTIGMEM_ENABLE
 
+config SPARSE_VIRTUAL
+   def_bool y
+   depends on ARCH_SPARSEMEM_ENABLE
+
 config ARCH_DISCONTIGMEM_DEFAULT
def_bool y if (IA64_SGI_SN2 || IA64_GENERIC || IA64_HP_ZX1 || 
IA64_HP_ZX1_SWIOTLB)
depends on ARCH_DISCONTIGMEM_ENABLE
Index: linux-2.6.21-rc5-mm2/include/asm-ia64/page.h
===
--- linux-2.6.21-rc5-mm2.orig/include/asm-ia64/page.h   2007-04-02 
16:15:29.0 -0700
+++ linux-2.6.21-rc5-mm2/include/asm-ia64/page.h2007-04-02 
16:15:50.0 -0700
@@ -106,6 +106,9 @@ extern int ia64_pfn_valid (unsigned long
 # define ia64_pfn_valid(pfn) 1
 #endif
 
+#define vmemmap ((struct page *)(RGN_BASE(RGN_GATE) + \
+   (1UL << (4*PAGE_SHIFT - 10
+
 #ifdef CONFIG_VIRTUAL_MEM_MAP
 extern struct page *vmem_map;
 #ifdef CONFIG_DISCONTIGMEM
Index: linux-2.6.21-rc5-mm2/include/asm-ia64/pgtable.h
===
--- linux-2.6.21-rc5-mm2.orig/include/asm-ia64/pgtable.h2007-04-02 
16:15:29.0 -0700
+++ linux-2.6.21-rc5-mm2/include/asm-ia64/pgtable.h 2007-04-02 
16:15:50.0 -0700
@@ -236,8 +236,13 @@ ia64_phys_addr_valid (unsigned long addr
 # define VMALLOC_END   vmalloc_end
   extern unsigned long vmalloc_end;
 #else
+#if defined(CONFIG_SPARSEMEM) && defined(CONFIG_SPARSE_VIRTUAL)
+/* SPARSE_VIRTUAL uses half of vmalloc... */
+# define VMALLOC_END   (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 
10)))
+#else
 # define VMALLOC_END   (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 
9)))
 #endif
+#endif
 
 /* fs/proc/kcore.c */
 #definekc_vaddr_to_offset(v) ((v) - RGN_BASE(RGN_GATE))
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-usb-devel] [RFC] HID bus design overview.

2007-04-04 Thread Jiri Kosina

On Wed, 4 Apr 2007, Adam Kropelin wrote:

> I apologize for picking up this thread late and asking what may be a 
> question with an obvious answer... Will hiddev still exist after hidraw 
> and the HID bus redesign work is done? I have a widely-deployed 
> userspace app that relies on hiddev, and I'm looking for reassurance 
> that it will still work as it always has...

Hi Adam,

hiddev will have to stay for quite some time, exactly because of backward 
compatibility with userspace applications/drivers that use it (I am not 
aware of many of them though, but apparently there are some). I won't 
allow it to vanish, don't worry.

We just have to make sure that new users will use hidraw instead, as it 
provides more flexibility for the user, is not dependent on the underlying 
transport protocol, etc.

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] IA64: SPARSE_VIRTUAL 16M page size support

2007-04-04 Thread Christoph Lameter

[IA64] Large vmemmap support

This implements granule page sized vmemmap support for IA64. This is
important because the traditional vmemmap on IA64 uses page size for
mapping the TLB. For a typical 8GB node on IA64 we need about
(33 - 14 + 6 = 25) = 32 MB of page structs.

Using page size we will end up with (25 - 14 = 11) 2048 page table entries.

This patch will reduce this to two 16MB TLBs. So its a factor
of 1000 less TLBs for the virtual memory map.

We modify the alt_dtlb_miss handler to branch to a vmemmap TLB lookup
function if bit 60 is set. The vmemmap will start with 0xF000xxx so its
going be very distinctive in dumps and can be distinguished easily from
0xE000xxx (kernel 1-1 area) and 0xA000xxx (kernel text, data and vmalloc).

We use a 1 level page table to do lookups for the vmemmap TLBs. Since
we need to cover 1 Petabyte we need to reserve 1 megabyte just for
the table but we can statically allocate it in the data segment. This
simplifies lookups and handling. The fault handler only has to do
a single lookup in contrast to 4 for the current vmalloc/vmemmap
implementation.

Problems with this patchset are:

1. Large 1M array required to cover all of possible memory (1 Petabyte).
   Maybe reduce this to actually supported HW sizes? 16TB or 64TB?

2. For systems with small nodes there is a significant chance of
   large overlaps. We could dynamically determine the TLB size
   but that would make the code more complex.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm4/arch/ia64/kernel/ivt.S
===
--- linux-2.6.21-rc5-mm4.orig/arch/ia64/kernel/ivt.S2007-04-04 
15:45:47.0 -0700
+++ linux-2.6.21-rc5-mm4/arch/ia64/kernel/ivt.S 2007-04-04 15:49:24.0 
-0700
@@ -391,9 +391,11 @@ ENTRY(alt_dtlb_miss)
tbit.z p12,p0=r16,61// access to region 6?
mov r25=PERCPU_PAGE_SHIFT << 2
mov r26=PERCPU_PAGE_SIZE
-   nop.m 0
-   nop.b 0
+   tbit.nz p6,p0=r16,60// Access to VMEMMAP?
+(p6)   br.cond.dptk vmemmap
;;
+dtlb_continue:
+   .pred.rel "mutex", p11, p10
 (p10)  mov r19=IA64_KR(PER_CPU_DATA)
 (p11)  and r19=r19,r16 // clear non-ppn fields
extr.u r23=r21,IA64_PSR_CPL0_BIT,2  // extract psr.cpl
@@ -416,6 +418,37 @@ ENTRY(alt_dtlb_miss)
 (p7)   itc.d r19   // insert the TLB entry
mov pr=r31,-1
rfi
+
+vmemmap:
+   //
+   // Granule lookup via vmemmap_table for
+   // the virtual memory map.
+   //
+   tbit.nz p6,p0=r16,59// more top bits set?
+(p6)   br.cond.spnt dtlb_continue  // then its mmu bootstrap
+   ;;
+   rsm psr.dt  // switch to using physical 
data addressing
+   extr.u r25=r16, IA64_GRANULE_SHIFT, 32
+   ;;
+   srlz.d
+   LOAD_PHYSICAL(p0, r26, vmemmap_table)
+   shl r25=r25,2
+   ;;
+   add r26=r26,r25 // Index into vmemmap table
+   ;;
+   ld4 r25=[r26]   // Get 32 bit descriptor */
+   ;;
+   dep.z r19=r25, 0, 31// Isolate ppn
+   tbit.z p6,p0=r25, 31// Present bit set?
+(p6)   br.cond.spnt page_fault // Page not present
+   ;;
+   shl r19=r19, IA64_GRANULE_SHIFT // Shift ppn in place
+   ;;
+   or r19=r19,r17  // insert PTE control bits into r19
+   ;;
+   itc.d r19   // insert the TLB entry
+   mov pr=r31,-1
+   rfi
 END(alt_dtlb_miss)
 
.org ia64_ivt+0x1400
Index: linux-2.6.21-rc5-mm4/arch/ia64/mm/discontig.c
===
--- linux-2.6.21-rc5-mm4.orig/arch/ia64/mm/discontig.c  2007-04-04 
15:45:47.0 -0700
+++ linux-2.6.21-rc5-mm4/arch/ia64/mm/discontig.c   2007-04-04 
15:53:02.0 -0700
@@ -8,6 +8,8 @@
  * Russ Anderson <[EMAIL PROTECTED]>
  * Jesse Barnes <[EMAIL PROTECTED]>
  * Jack Steiner <[EMAIL PROTECTED]>
+ * Copyright (C) 2007 sgi
+ * Christoph Lameter <[EMAIL PROTECTED]>
  */
 
 /*
@@ -44,6 +46,79 @@ struct early_node_data {
unsigned long max_pfn;
 };
 
+#ifdef CONFIG_ARCH_POPULATES_VIRTUAL_MEMMAP
+
+/*
+ * The vmemmap_table contains the number of the granule used to map
+ * that section of the virtual memmap.
+ *
+ * We support 50 address bits, 14 bits are used for the page size. This
+ * leaves 36 bits (64G) for the pfn. Using page structs the memmap is going
+ * to take up a bit less than 4TB of virtual space.
+ *
+ * We are mapping these 4TB using 16M granule size which makes us end up
+ * with a bit less than 256k entries.
+ *
+ * Thus the common size of the needed vmemmap_table will be less than 1M.
+ */
+
+#define VMEMMAP_SIZE GRANULEROUNDUP((1UL << (MAX_PHYSMEM_BITS - PAGE_SHIFT)) \
+   * sizeof(struct

[PATCH 1/4] Generic Virtual Memmap suport for SPARSEMEM V3

2007-04-04 Thread Christoph Lameter

Sparse Virtual: Virtual Memmap support for SPARSEMEM V4

V1->V3
 - Add IA64 16M vmemmap size support (reduces TLB pressure)
 - Add function to test for eventual node/node vmemmap overlaps
 - Upper / Lower boundary fix.

V1->V2
 - Support for PAGE_SIZE vmemmap which allows the general use of
   of virtual memmap on any MMU capable platform (enabled IA64
   support).
 - Fix various issues as suggested by Dave Hansen.
 - Add comments and error handling.

SPARSEMEM is a pretty nice framework that unifies quite a bit of
code over all the arches. It would be great if it could be the default
so that we can get rid of various forms of DISCONTIG and other variations
on memory maps. So far what has hindered this are the additional lookups
that SPARSEMEM introduces for virt_to_page and page_address. This goes
so far that the code to do this has to be kept in a separate function
and cannot be used inline.

This patch introduces virtual memmap support for sparsemem. virt_to_page
page_address and consorts become simple shift/add operations. No page flag
fields, no table lookups, nothing involving memory is required.

The two key operations pfn_to_page and page_to_page become:

#define pfn_to_page(pfn) (vmemmap + (pfn))
#define page_to_pfn(page)((page) - vmemmap)

In order for this to work we will have to use a virtual mapping.
These are usually for free since kernel memory is already mapped
via a 1-1 mapping requiring a page tabld. The virtual mapping must
be big enough to span all of memory that an arch can support which
may make a virtual memmap difficult to use on 32 bit platforms
that support 36 address bits.

However, if there is enough virtual space available and the arch
already maps its 1-1 kernel space using TLBs (f.e. true of IA64
and x86_64) then this technique makes sparsemem lookups even more
effiecient than CONFIG_FLATMEM. FLATMEM still needs to read the
contents of mem_map. mem_map is constant for a virtual memory map.

Maybe this patch will allow us to make SPARSEMEM the default
configuration that will work on UP, SMP and NUMA on most platforms?
Then we may hopefully be able to remove the various forms of support
for FLATMEM, DISCONTIG etc etc.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm4/include/asm-generic/memory_model.h
===
--- linux-2.6.21-rc5-mm4.orig/include/asm-generic/memory_model.h
2007-04-04 15:45:48.0 -0700
+++ linux-2.6.21-rc5-mm4/include/asm-generic/memory_model.h 2007-04-04 
15:45:52.0 -0700
@@ -46,6 +46,14 @@
 __pgdat->node_start_pfn;   \
 })
 
+#elif defined(CONFIG_SPARSE_VIRTUAL)
+
+/*
+ * We have a virtual memmap that makes lookups very simple
+ */
+#define __pfn_to_page(pfn) (vmemmap + (pfn))
+#define __page_to_pfn(page)((page) - vmemmap)
+
 #elif defined(CONFIG_SPARSEMEM)
 /*
  * Note: section's mem_map is encorded to reflect its start_pfn.
Index: linux-2.6.21-rc5-mm4/mm/sparse.c
===
--- linux-2.6.21-rc5-mm4.orig/mm/sparse.c   2007-04-04 15:45:48.0 
-0700
+++ linux-2.6.21-rc5-mm4/mm/sparse.c2007-04-04 15:48:11.0 -0700
@@ -9,6 +9,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Permanent SPARSEMEM data:
@@ -101,7 +103,7 @@ static inline int sparse_index_init(unsi
 
 /*
  * Although written for the SPARSEMEM_EXTREME case, this happens
- * to also work for the flat array case becase
+ * to also work for the flat array case because
  * NR_SECTION_ROOTS==NR_MEM_SECTIONS.
  */
 int __section_nr(struct mem_section* ms)
@@ -211,6 +213,253 @@ static int sparse_init_one_section(struc
return 1;
 }
 
+#ifdef CONFIG_SPARSE_VIRTUAL
+/*
+ * Virtual Memory Map support
+ *
+ * (C) 2007 sgi. Christoph Lameter <[EMAIL PROTECTED]>.
+ *
+ * Virtual memory maps allow VM primitives pfn_to_page, page_to_pfn,
+ * virt_to_page, page_address() etc that involve no memory accesses at all.
+ *
+ * However, virtual mappings need a page table and TLBs. Many Linux
+ * architectures already map their physical space using 1-1 mappings
+ * via TLBs. For those arches the virtual memmory map is essentially
+ * for free if we use the same page size as the 1-1 mappings. In that
+ * case the overhead consists of a few additional pages that are
+ * allocated to create a view of memory for vmemmap.
+ *
+ * Special Kconfig settings:
+ *
+ * CONFIG_ARCH_POPULATES_VIRTUAL_MEMMAP
+ *
+ * The architecture has its own functions to populate the memory
+ * map and provides a vmemmap_populate function.
+ *
+ * CONFIG_ARCH_SUPPORTS_PMD_MAPPING
+ *
+ * If not set then PAGE_SIZE mappings are generated which
+ * require one PTE/TLB per PAGE_SIZE chunk of the virtual memory map.
+ *
+ * If set then PMD_SIZE mappings are generated which are much
+ * lighter on the TLB. On some platforms these generate
+ * the

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1202 matches

Mail list logo