Re: [PATCH] x86: fix PSE pagetable construction

2007-04-27 Thread Eric W. Biederman
Jeremy Fitzhardinge <[EMAIL PROTECTED]> writes:

> When constructing the initial pagetable in pagetable_init, make sure
> that non-PSE pmds are updated to PSE ones.  This fixes a bug in the
> paravirt pagetable init code, which otherwise tries to avoid overwrite
> existing mappings.
>
> This moves the definition of pmd_huge() out of the hugetlbfs files
> into pgtable.h.
>
> [ I know Eric would like to make larger changes to the way
>   pagetable init works, but this patch is the minimal fix to an
>   existing bug. ]

My preference would be for whoever had:
paravirt_ops-hooks-to-set-up-initial-pagetable.patch

queued to drop it until we can get a version that doesn't break early
page table setup.

Your partial fix still leaves the real page tables in a partially
incorrect state, and even if we removed your changes from the PSE
path we still can wind up not failing to set _PAGE_NX in the
appropriate places on 4K pages.

I have tried to be constructive and suggest how we can fix this
cleanly.

Short of that this is what I see needing to happen to fix the
above patches changes to arch/i386/mm/init.c.

Eric


...

Subject: [PATCH] i386: During page table initialization always set the leaf 
page table entries.

If we don't set the leaf page table entries it is quite possible that
we will inherit and incorrect page table entry from the initial boot
page table setup in head.S.  So we need to redo the effort here.

I don't know what to do about hypervisors like Xen that require
their page tables to be read only, as our identity mapped page
table entries currently violate that requirement.  All I know
if the kernel doesn't work properly on native hardware it is a bug.

Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
---
 arch/i386/mm/init.c |   52 +++---
 1 files changed, 20 insertions(+), 32 deletions(-)

diff --git a/arch/i386/mm/init.c b/arch/i386/mm/init.c
index b77a43c..dbe16f6 100644
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -63,18 +63,18 @@ static pmd_t * __init one_md_table_init(pgd_t *pgd)
pmd_t *pmd_table;

 #ifdef CONFIG_X86_PAE
-   pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
-
-   paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
-   set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
-   pud = pud_offset(pgd, 0);
-   if (pmd_table != pmd_offset(pud, 0)) 
-   BUG();
-#else
+   if (!(pgd_val(*pgd) & _PAGE_PRESENT)) {
+   pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+
+   paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
+   set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
+   pud = pud_offset(pgd, 0);
+   if (pmd_table != pmd_offset(pud, 0))
+   BUG();
+   }
+#endif
pud = pud_offset(pgd, 0);
pmd_table = pmd_offset(pud, 0);
-#endif
-
return pmd_table;
 }
 
@@ -84,7 +84,7 @@ static pmd_t * __init one_md_table_init(pgd_t *pgd)
  */
 static pte_t * __init one_page_table_init(pmd_t *pmd)
 {
-   if (pmd_none(*pmd)) {
+   if (!(pmd_val(*pmd) & _PAGE_PRESENT)) {
pte_t *page_table = (pte_t *) 
alloc_bootmem_low_pages(PAGE_SIZE);
 
paravirt_alloc_pt(__pa(page_table) >> PAGE_SHIFT);
@@ -109,7 +109,6 @@ static pte_t * __init one_page_table_init(pmd_t *pmd)
 static void __init page_table_range_init (unsigned long start, unsigned long 
end, pgd_t *pgd_base)
 {
pgd_t *pgd;
-   pud_t *pud;
pmd_t *pmd;
int pgd_idx, pmd_idx;
unsigned long vaddr;
@@ -120,13 +119,10 @@ static void __init page_table_range_init (unsigned long 
start, unsigned long end
pgd = pgd_base + pgd_idx;
 
for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd++, pgd_idx++) {
-   if (!(pgd_val(*pgd) & _PAGE_PRESENT))
-   one_md_table_init(pgd);
-   pud = pud_offset(pgd, vaddr);
-   pmd = pmd_offset(pud, vaddr);
+   pmd = one_md_table_init(pgd);
+   pmd = pmd + pmd_index(vaddr);
for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end); pmd++, 
pmd_idx++) {
-   if (pmd_none(*pmd)) 
-   one_page_table_init(pmd);
+   one_page_table_init(pmd);
 
vaddr += PMD_SIZE;
}
@@ -159,11 +155,7 @@ static void __init kernel_physical_mapping_init(pgd_t 
*pgd_base)
pfn = 0;
 
for (; pgd_idx < PTRS_PER_PGD; pgd++, pgd_idx++) {
-   if (!(pgd_val(*pgd) & _PAGE_PRESENT))
-   pmd = one_md_table_init(pgd);
-   else
-   pmd = pmd_offset(pud_offset(pgd, PAGE_OFFSET), 
PAGE_OFFSET);
-
+   pmd = one_md_table_init(pgd);
if (pfn >= max_low_pfn)
continue;
for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD 

Re: checkpatch, a patch checking script.

2007-04-27 Thread Roland Dreier
 > Use WARN_ON & Recovery code rather than BUG() and BUG_ON()
 > 23286:+ BUILD_BUG_ON(BCM43xx_SEC_KEYSIZE < ETH_ALEN);

BTW, I missed this before -- BUILD_BUG_ON() is actually far better
than WARN_ON(), I think.

Maybe something like this?  (Although someone who knows perl probably
has a better way)

---
Don't tell people to change BUILD_BUG_ON() to WARN_ON().

Signed-off-by: Roland Dreier <[EMAIL PROTECTED]>

--- checkpatch.pl.orig  2007-04-27 20:30:34.0 -0700
+++ checkpatch.pl   2007-04-27 22:54:42.0 -0700
@@ -123,7 +123,7 @@
$warnings += search(qr/kernel_thread\(/, "Use kthread abstraction 
instead of kernel_thread()\n");
$warnings += search(qr/typedef/, "Do not add new typedefs.\n");
$warnings += search(qr/uint32_t/, "Incorrect type usage for kernel 
code. Use __u32 etc.\n");
-   $warnings += search(qr/BUG(_ON)\(/, "Use WARN_ON & Recovery code rather 
than BUG() and BUG_ON()\n");
+   $warnings += search(qr/(?http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: checkpatch, a patch checking script.

2007-04-27 Thread Roland Dreier
 > box:/usr/src/25> ~/checkpatch.pl patches/git-infiniband.patch 

Yup, I ran this too.

 > Checking patches/git-infiniband.patch:  signoffs = 113
 > Use WARN_ON & Recovery code rather than BUG() and BUG_ON()
 > 8143:+  BUG_ON(mlx4_ib_alloc_db_from_pgdir(pgdir, db, order));
 > 12629:+ BUG_ON(cmd->free_head < 0);
 > 16580:+ BUG_ON(index < dev->caps.num_mgms);
 > 16665:+ BUG_ON(amgm_index_to_free < dev->caps.num_mgms);
 > 16681:+ BUG_ON(index < dev->caps.num_mgms);

I agree -- killing the kernel for a driver bug is dump.  I'll remove
all these BUGs before merging.

 > Don't init statics to 0/NULL:
 > 10333:+ path->static_rate = 0;

This is a false positive/opportunity for script improvement, obviously.

 > 15461:+static int msi_x = 0;
 > 16113:+ static int mlx4_version_printed = 0;

Already zapped these.

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: - maps2-add-proc-pid-pagemap-interface-fix.patch removed from -mm tree

2007-04-27 Thread Andrew Morton
On Sat, 28 Apr 2007 06:13:39 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote:

> On Fri, 27 Apr 2007, Andrew Morton wrote:
> > 
> > hm, could do.  might_sleep() is intertwined with preempt in complex ways,
> > but we did decouple that at the config level.  no_mmap_sem() will dtrt for
> > all preempt settings.
> > 
> > But I'll be keeping this as a -mm-only debug patch (which brings us up to
> > about thirty of 'em), so I think it's best to make it unconfigurable so we
> > get maximum coverage.
> > 
> > That's if it actually works.  I haven't tried running it yet, and I have a
> > feeling that running it might cause a big "doh" moment.  We'll see.
> 
> Yes, I'm expecting the crucial
> 
> > +   WARN_ON(rwsem_is_locked(>mmap_sem))
> 
> to give a bogus warning every time another thread (or /proc,
> or swapoff, or whatever) happens to have this mmap_sem locked.
> might_sleep() is quite different, works on our thread's info.
> 

Yes.  lockdep has a way of working out if this task already has a
particular lock for reading or writing, but it isn't immediately obvious
how to extract that.

I guess a simple hack would be do do a down_read() on it.  If it's already
held for reading, lockdep should warn.  If it's already held for writing
someone will notice.

Oh well, it's not my top priority.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/33] 2.6.20-stable review

2007-04-27 Thread Greg KH
On Sat, Apr 28, 2007 at 12:21:24PM +0800, Bryan WU wrote:
> On Fri, 2007-04-27 at 08:13 -0700, Greg KH wrote:
> > On Fri, Apr 27, 2007 at 06:15:54PM +0800, Wu, Bryan wrote:
> > > 
> > > You know for some customer's product, they want to use the stable and
> > > long term support kernel instead to use the latest one. 
> > 
> > Then they should get that support from a vendor, not from the kernel.org
> > releases :)
> > 
> 
> Yeah, but we are the vendor as you mentioned. -:))

Ah, then you already know what to do :)

> If we wanna to release a kernel to customer product development, how to
> choose the stable version?

That's up to you.

> Currently, we always followed the kernel release cycle/rules and give
> customer the latest stable version.

Ok, then what has really changed here?  We've been doing this .y release
thing (also called -stable) for about 2 years now, nothing is different
this week from last.

Confused,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Allow __vmalloc with GFP_ATOMIC

2007-04-27 Thread Giridhar Pemmasani
--- Nick Piggin <[EMAIL PROTECTED]> wrote:

>> The patch below uses bh disabled lock for vmlist_lock, so
>> that __vmalloc can be used in interrupt context.

> Hi Giri,
> 
> I'm sure I've read the reason for this one before, but when you do patches
> like these, can you include that reason in the changelog please?
> 
> Thanks,
> Nick

Sorry about that. There were too many mails on this subjet and thought it
might not be good to quote them. I am quoting here the ones that matter to
this discussion. If you need more (all), let me know:

http://www.ussg.iu.edu/hypermail/linux/kernel/0605.2/1608.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0605.2/1611.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0605.2/1656.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0605.2/1779.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0605.2/1669.html

Thanks,
Giri

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mikulas Patocka



On Sat, 28 Apr 2007, Mikulas Patocka wrote:


On Fri, 27 Apr 2007, Bill Huey wrote:
Hi

SpadFS doesn't write to unallocated parts like log filesystems (LFS) or 
phase tree filesystems (TUX2);


--- BTW, I don't think that writing to unallocated parts of disk is good 
idea. These filesystems have cool write benchmarks, but one subtle (and 
unbenchmarkable) problem:
They group files according to time when they were created and not 
according to directory hierarchy.
When the user has directory with project files and he edited different 
files at different times, normal filesystems will place the files near 
each other (so that "grep blabla *" is fast) and log-structured 
filesystems will scatter the files over the whole disk.


Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem 2.6.21

2007-04-27 Thread Len Brown
On Friday 27 April 2007 14:39, Riccardo Ricci wrote:
> 
> Hi to everyone,
> i've compiled kernel 2.6.21 on my debian PIII 650 / 256MB / Dell Latitude
> J650GT. With 2.6.20.8 all works very good, with 2.6.21 don't boot... While
> booting it stops after ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 *5 6 7 9
> 10 11 12 14 15).

Does 2.6.20.8 boot with acpi=off, does 2.6.21?

Any chance you can get the serial console log of the failure when booted with 
"debug"?
Also, the 2.6.20.8 dmesg is missing the beginning, try dmesg -s64000 --
though it will probably not be very interesting until 2.6.21 output is 
available to compare to it.

thanks,
-Len
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 22:08:17 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Fri, 27 Apr 2007, Andrew Morton wrote:
> 
> > My (repeated) point is that if we populate pagecache with 
> > physically-contiguous 4k
> > pages in this manner then bio+block will be able to create much larger SG 
> > lists.
> 
> True but the "if" becomes exceedingly rare the longer the system was in 
> operation. 64k implies 16 pages in sequence. This is going to be a bit 
> difficult to get.

Nonsense.  We need higher-order allocations whichever scheme is used.

And lumpy reclaim in the moveable zone should be extremely reliable.  It
_should_ be the case that it can only be defeated by excessive use of
mlock.  But we've seen no testing to either confirm or refute that.

> Then there is the overhead of handling these pages. 
> Which may be not significant given growing processor capabilities in some 
> usage cases. In others like a synchronized application running on a large 
> number of nodes this is likely introduce random delays between processor 
> to processor communication that will significantly impair performance.

Well, who knows.

> And then there is the long list of features that cannot be accomplished 
> with such an approach like mounting a volume with large block size, 
> handling CD/DVDs, getting rid of various shim layers etc.

There are disadvantages against which this must be traded off.

And if the volume which is mounted with the large page option also has a
lot of small files on it, we've gone and dramatically deoptimised the
user's machine.  It would have been better to make the 4k-page
implementation faster, rather than working around existing inefficiencies.

> I'd also like to have much higher orders of allocations for scientific 
> applications that require an extremely large I/O rate. For those we 
> could f.e. dedicate memory nodes that will only use a very high page 
> order to prevent fragmentation. E.g. 1G pages is certainly something that 
> lots of our customers would find beneficial (and they are actually 
> already using those types of pages in the form of huge pages but with 
> limited capabilities).
> 
> But then we are sadly again trying to find another workaround that 
> will not get us there and will not allow the flexibility in the 
> VM that would make things much easier for lots of usage scenarios.

Your patch *is* a workaround.  It's a workaround for small CPU pagesize. 
It's a workaround for suboptimal VFS anf filesystem implementations.  It's
a workaround for a disk adapter which has suboptimal readahead and
writeback caching implementations.

See?  I can spin too.

Fact is, this change has *costs*.  And you're completely ignoring them,
trying to spin them away.  It ain't working and it never will.  I'm seeing
no serious attempt to think about how we can reduce those costs while
retaining most of the benefits.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-04-27 Thread Hugh Dickins
On Fri, 27 Apr 2007, Rohit Seth wrote:
> On Fri, 2007-04-27 at 15:18 +0100, Hugh Dickins wrote:
> 
> Right.  Extra flush_icache_page routines will add cost to archs that
> have non-null definition of this routine.  BTW, isn't flush_icache_page
> marked for deprecation?

Yes, flush_icache_page is marked for deprecation: but that's hardly
a reason to add another under a different name!  (Not quite what you
did, but...)

> lazy_mmu_prot_update was added specifically for notifying change in
> protection.  So, in a way it is closer to update_mmu_cache (Which is for
> change in mappings itself).  Though for ia64 implementation, this ends
> up flushing the icaches when needed.

The ia64 implementation is the only one which has any use for it, and
it's only interested when it's executable i.e. "lazy_mmu_prot_update"
is a name concealing some overdesign.

> Hopefully my reply is useful.

Yes, thanks Rohit, and I'll want to read through it again later.
In particular, I've now a better idea what's "lazy" about it.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[-mm Patch]nbd: check the return value of sysfs_create_file

2007-04-27 Thread WANG Cong

Since 'sysfs_create_file' is declared with attribute warn_unused_result, we 
must always check its return value carefully.

Signed-off-by: WANG Cong <[EMAIL PROTECTED]>

---

--- linux-2.6.21-rc7-mm2/drivers/block/nbd.c.orig   2007-04-27 
17:27:47.0 +0800
+++ linux-2.6.21-rc7-mm2/drivers/block/nbd.c2007-04-27 17:47:32.0 
+0800
@@ -373,7 +373,10 @@ static void nbd_do_it(struct nbd_device 
BUG_ON(lo->magic != LO_MAGIC);
 
lo->pid = current->pid;
-   sysfs_create_file(>disk->kobj, _attr.attr);
+   if (sysfs_create_file(>disk->kobj, _attr.attr)) {
+   printk(KERN_ERR "nbd: sysfs_create_file failed!");
+   return;
+   }
 
while ((req = nbd_read_stat(lo)) != NULL)
nbd_end_request(req);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: - maps2-add-proc-pid-pagemap-interface-fix.patch removed from -mm tree

2007-04-27 Thread Hugh Dickins
On Fri, 27 Apr 2007, Andrew Morton wrote:
> 
> hm, could do.  might_sleep() is intertwined with preempt in complex ways,
> but we did decouple that at the config level.  no_mmap_sem() will dtrt for
> all preempt settings.
> 
> But I'll be keeping this as a -mm-only debug patch (which brings us up to
> about thirty of 'em), so I think it's best to make it unconfigurable so we
> get maximum coverage.
> 
> That's if it actually works.  I haven't tried running it yet, and I have a
> feeling that running it might cause a big "doh" moment.  We'll see.

Yes, I'm expecting the crucial

> + WARN_ON(rwsem_is_locked(>mmap_sem))

to give a bogus warning every time another thread (or /proc,
or swapoff, or whatever) happens to have this mmap_sem locked.
might_sleep() is quite different, works on our thread's info.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-04-27 Thread Hugh Dickins
On Sat, 28 Apr 2007, Nick Piggin wrote:
> 
> OIC, you need a virtual address to evict the icache, so you can't
> flush at flush_dcache time? Or does ia64 have an instruction to
> flush the whole icache? (it would be worth testing, to see how much
> performance suffers).

I'm puzzled by that remark: the ia64 flush_icache_range always has
a virtual address, it uses the kernel virtual address; it takes no
interest in whether there's a user virtual address.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: checkpatch, a patch checking script.

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 23:08:05 -0400 Dave Jones <[EMAIL PROTECTED]> wrote:

> You can find the script at http://www.codemonkey.org.uk/projects/checkpatch/

hm.

box:/usr/src/25> ~/checkpatch.pl patches/slub-core.patch
Checking patches/slub-core.patch:  signoffs = 30
Use WARN_ON & Recovery code rather than BUG() and BUG_ON()
1588:+  VM_BUG_ON(!irqs_disabled());
1834:+  BUG_ON(flags & ~(GFP_DMA | GFP_LEVEL_MASK));
2538:+  BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
2544:+  BUG_ON(!page);
2546:+  BUG_ON(!n);
2736:+  BUG_ON(err);
2762:+  BUG_ON(flags & SLUB_UNIMPLEMENTED);
2777:+  BUG_ON(flags & (SLAB_RED_ZONE | SLAB_POISON |
2779:+  BUG_ON(ctor || dtor);
3054:+  BUG_ON(index < 0);
3118:+  BUG_ON(!page);
3120:+  BUG_ON(!s);
4062:+  BUG_ON(!name);
4083:+  BUG_ON(p > name + ID_STR_LENGTH - 1);
4188:+  BUG_ON(err);

15 warnings


surely we can do better than that ;)


box:/usr/src/25> ~/checkpatch.pl patches/git-ieee1394.patch 
Checking patches/git-ieee1394.patch:  signoffs = 291
Do not add new typedefs.
5239:+typedef int (*descriptor_callback_t)(struct context *ctx,
7254:+typedef void (*scsi_done_fn_t) (struct scsi_cmnd *);
8668:+typedef void (*fw_node_callback_t) (struct fw_card * card,
10077:+typedef void (*fw_packet_callback_t) (struct fw_packet *packet,
10080:+typedef void (*fw_transaction_callback_t)(struct fw_card *card, int 
rcode,
10085:+typedef void (*fw_address_callback_t)(struct fw_card *card,
10093:+typedef void (*fw_bus_reset_callback_t)(struct fw_card *handle,
10245:+typedef void (*fw_iso_callback_t) (struct fw_iso_context *context,

Use WARN_ON & Recovery code rather than BUG() and BUG_ON()
4342:+  BUG_ON(j >= ARRAY_SIZE(group->attrs));
9868:+  BUG_ON(retval < 0);
9872:+  BUG_ON(retval < 0);
9876:+  BUG_ON(retval < 0);
9878:+  BUG_ON(retval < 0);
10952:+ BUG_ON(!kv || !associate || kv->key.id == CSR1212_KV_ID_DESCRIPTOR ||
10983:+ BUG_ON(!kv || !dir || dir->key.type != CSR1212_KV_TYPE_DIRECTORY);
11396:+ BUG_ON(!csr || !csr->ops || !csr->ops->allocate_addr_range ||
11750:+ BUG_ON(!csr);
11802:+ BUG_ON(csr->max_rom < 1);
12106:+ BUG_ON(!csr || !kv || csr->max_rom < 1);
12248:+ BUG_ON(!csr || !csr->ops || !csr->ops->bus_read);
14541:+ BUG_ON(max_payload < 512 - ETHER1394_GASP_OVERHEAD);
14567:+ BUG_ON(max_payload < 512 - ETHER1394_GASP_OVERHEAD);
15213:+ BUG_ON(!list_empty(>driver_list) ||

23 warnings

ok.

box:/usr/src/25> ~/checkpatch.pl patches/git-net.patch 
Checking patches/git-net.patch:  signoffs = 831
Do not add new typedefs.
18871:+typedef unsigned int sk_buff_data_t;
18873:+typedef unsigned char *sk_buff_data_t;
20686:+typedef int (*rtnl_doit_func)(struct sk_buff *, struct nlmsghdr *, void 
*);
20687:+typedef int (*rtnl_dumpit_func)(struct sk_buff *, struct 
netlink_callback *);

Incorrect type usage for kernel code. Use __u32 etc.
21854:+uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
21865:+ uint32_t high, d;

Use WARN_ON & Recovery code rather than BUG() and BUG_ON()
11084:+ BUG_ON(ip_hdr(skb)->protocol != IPPROTO_TCP);
21600:+ BUG_ON(!wiphy);
21633:+ BUG_ON(!wdev);
25577:+ BUG_ON(r->ctarget != NULL);
26832:+ BUG_ON(msgindex < 0 || msgindex >= RTM_NR_MSGTYPES);
26882:+ BUG_ON(protocol < 0 || protocol >= NPROTO);
26936:+ BUG_ON(protocol < 0 || protocol >= NPROTO);
26959:+ BUG_ON(protocol < 0 || protocol >= NPROTO);
27772:+ BUG_ON(len);
30411:+ BUG_ON(hctx->ccid3hctx_p && !hctx->ccid3hctx_x_calc);
30626:+ BUG_ON(hctx == NULL);
32199:+ BUG_ON(ptr == NULL);
32217:+ BUG_ON(ptr == NULL);
58250:+ BUILD_BUG_ON(sizeof(struct illinois) > ICSK_CA_PRIV_SIZE);
61747:+ BUG_ON(sizeof(struct yeah) > ICSK_CA_PRIV_SIZE);
63079:+ BUG_ON(pad < 0);
69953:+ BUG_ON(sk == NULL);
69962:+ BUG_ON(self == NULL);
70883:+ BUG_ON(destroy == NULL);
80348:+ BUG_ON(!wiphy);

Don't init statics to 0/NULL:
61061:+static int port __read_mostly = 0;
70417:+static int hashbin_lock_depth = 0;

28 warnings

Bad David.


git-ocfs2.patch: couple fo new typedefs, zillions of BUG_ONs

box:/usr/src/25> ~/checkpatch.pl patches/git-libata-all.patch 
Checking patches/git-libata-all.patch:  signoffs = 167
Do not add new typedefs.
14867:+typedef int (*ata_prereset_fn_t)(struct ata_port *ap, unsigned long 
deadline);
14868:+typedef int (*ata_reset_fn_t)(struct ata_port *ap, unsigned int *classes,

Use WARN_ON & Recovery code rather than BUG() and BUG_ON()
5426:+  BUG_ON(!legacy_dr);

Don't init statics to 0/NULL:
2861:+static int ata_ignore_hpa = 0;



box:/usr/src/25> ~/checkpatch.pl patches/git-ia64.patch  
Checking patches/git-ia64.patch:  signoffs = 38
Do not add new typedefs.
875:+typedef unsigned long u64;
876:+typedef unsigned int  u32;
878:+typedef union err_type_info_u {
890:+typedef union err_struct_info_u {
930:+typedef union err_data_buffer_u {
954:+typedef union capabilities_u {
1009:+typedef struct resources_s {
1443:+typedef struct {

box:/usr/src/25> 

Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread Christoph Lameter
On Fri, 27 Apr 2007, Andrew Morton wrote:

> My (repeated) point is that if we populate pagecache with 
> physically-contiguous 4k
> pages in this manner then bio+block will be able to create much larger SG 
> lists.

True but the "if" becomes exceedingly rare the longer the system was in 
operation. 64k implies 16 pages in sequence. This is going to be a bit 
difficult to get. Then there is the overhead of handling these pages. 
Which may be not significant given growing processor capabilities in some 
usage cases. In others like a synchronized application running on a large 
number of nodes this is likely introduce random delays between processor 
to processor communication that will significantly impair performance.

And then there is the long list of features that cannot be accomplished 
with such an approach like mounting a volume with large block size, 
handling CD/DVDs, getting rid of various shim layers etc.

I'd also like to have much higher orders of allocations for scientific 
applications that require an extremely large I/O rate. For those we 
could f.e. dedicate memory nodes that will only use a very high page 
order to prevent fragmentation. E.g. 1G pages is certainly something that 
lots of our customers would find beneficial (and they are actually 
already using those types of pages in the form of huge pages but with 
limited capabilities).

But then we are sadly again trying to find another workaround that 
will not get us there and will not allow the flexibility in the 
VM that would make things much easier for lots of usage scenarios.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 20/38] Minor fault path optimization.

2007-04-27 Thread Paul Mackerras
Martin Schwidefsky writes:

> The minor fault path has grown a lot in terms of cycles. In particular
> the kprobes hook is very costly. Optimize the path to save a couple of
> cycles. If kprobes is enabled more than 300 cycles can be avoided if 
> kprobes_running() is false.

There's no good reason to use a notifier for page faults, since
there's only one external piece of code that wants to know about
them...

Regards,
Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread Andrew Morton
On Sat, 28 Apr 2007 13:17:40 +1000 David Chinner <[EMAIL PROTECTED]> wrote:

> > Fix up your lameo HBA for reads.
> 
> Where did that come from? You spend 20 lines described the inefficiencies
> of the readahead in the page cache and it should be fixed but then you
> turn around and say fix the HBA? 

My (repeated) point is that if we populate pagecache with physically-contiguous 
4k
pages in this manner then bio+block will be able to create much larger SG lists.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: commit 45cd8d8e -- why?

2007-04-27 Thread Tejun Heo
Andrew Morton wrote:
> On Fri, 27 Apr 2007 19:50:19 -0700 Roland Dreier <[EMAIL PROTECTED]> wrote:
> 
>> The changelog says:
>>
>> fs/sysfs/bin.c: In function 'read':
>> fs/sysfs/bin.c:77: warning: format '%zd' expects type 'signed size_t', 
>> but argument 4 has type 'int'
>>
>> but the signature of the function read() is
>>
>> read(struct file * file, char __user * userbuf, size_t count, loff_t * 
>> off)
>>
>> and git blame seems to show it was always thus -- ie count was always size_t.
>>
>> And now on x86-64 and ia64 with gcc 4.1 at least, I get:
>>
>> fs/sysfs/bin.c: In function 'read':
>> fs/sysfs/bin.c:62: warning: format '%d' expects type 'int', but argument 
>> 4 has type 'size_t'
> 
> Some patches landed out of order.  In Greg's tree (with Tejun's patches)
> `count' is a local variable (not an incoming arg) of type `int'.
> 
> So this patch was against Tejun's stuff, not against mainline.
> 
> I'd have picked that up, but I went and assumed that it was a victim of the
> new dev_dbg() printk arg checking stuff.  Ho hum.

Ah.. I already have this fix merged in my patch series.  I'm currently
testing things, so please be patient a little bit more.  Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lguest simplification: don't pin guest trap handlers

2007-04-27 Thread Rusty Russell
We don't actually need the Guest handlers mapped to avoid double
fault, just the stack pages.  Thanks to Zach for confirming.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
---
 drivers/lguest/interrupts_and_traps.c |   26 +-
 drivers/lguest/lg.h   |2 +-
 drivers/lguest/page_tables.c  |6 +++---
 3 files changed, 5 insertions(+), 29 deletions(-)

===
--- a/drivers/lguest/interrupts_and_traps.c
+++ b/drivers/lguest/interrupts_and_traps.c
@@ -138,31 +138,12 @@ static int direct_trap(const struct lgue
return idt_type(trap->a, trap->b) == 0xF;
 }
 
-static void pin_stack_pages(struct lguest *lg)
+void pin_stack_pages(struct lguest *lg)
 {
unsigned int i;
 
for (i = 0; i < lg->stack_pages; i++)
pin_page(lg, lg->esp1 - i * PAGE_SIZE);
-}
-
-/* We need to ensure all the direct trap pages are mapped after we
- * clear shadow mappings. */
-void pin_trap_pages(struct lguest *lg)
-{
-   unsigned int i;
-   struct desc_struct *trap;
-
-   for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++) {
-   trap = >idt[i];
-   if (direct_trap(lg, trap, i))
-   pin_page(lg, idt_address(trap->a, trap->b));
-   }
-
-   trap = >syscall_idt;
-   if (direct_trap(lg, trap, SYSCALL_VECTOR))
-   pin_page(lg, idt_address(trap->a, trap->b));
-   pin_stack_pages(lg);
 }
 
 void guest_set_stack(struct lguest *lg, u32 seg, u32 esp, unsigned int pages)
@@ -194,11 +175,6 @@ static void set_trap(struct lguest *lg, 
 
trap->a = ((__KERNEL_CS|GUEST_PL)<<16) | (lo&0x);
trap->b = (hi&0xEF00);
-
-   /* Make sure trap address is available so we don't fault.  In
-* theory, it could overlap two pages, in practice it's aligned. */
-   if (direct_trap(lg, trap, num))
-   pin_page(lg, idt_address(lo, hi));
 }
 
 void load_guest_idt_entry(struct lguest *lg, unsigned int num, u32 lo, u32 hi)
===
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -190,7 +190,7 @@ int deliver_trap(struct lguest *lg, unsi
 int deliver_trap(struct lguest *lg, unsigned int num);
 void load_guest_idt_entry(struct lguest *lg, unsigned int i, u32 low, u32 hi);
 void guest_set_stack(struct lguest *lg, u32 seg, u32 esp, unsigned int pages);
-void pin_trap_pages(struct lguest *lg);
+void pin_stack_pages(struct lguest *lg);
 void setup_default_idt_entries(struct lguest_ro_state *state,
   const unsigned long *def);
 void copy_traps(const struct lguest *lg, struct desc_struct *idt,
===
--- a/drivers/lguest/page_tables.c
+++ b/drivers/lguest/page_tables.c
@@ -186,7 +186,7 @@ void pin_page(struct lguest *lg, unsigne
 void pin_page(struct lguest *lg, unsigned long vaddr)
 {
if (!page_writable(lg, vaddr) && !demand_page(lg, vaddr, 0))
-   kill_guest(lg, "bad trap page %#lx", vaddr);
+   kill_guest(lg, "bad stack page %#lx", vaddr);
 }
 
 static void release_pgd(struct lguest *lg, spgd_t *spgd)
@@ -253,7 +253,7 @@ void guest_new_pagetable(struct lguest *
newpgdir = new_pgdir(lg, pgtable, );
lg->pgdidx = newpgdir;
if (repin)
-   pin_trap_pages(lg);
+   pin_stack_pages(lg);
 }
 
 static void release_all_pagetables(struct lguest *lg)
@@ -269,7 +269,7 @@ void guest_pagetable_clear_all(struct lg
 void guest_pagetable_clear_all(struct lguest *lg)
 {
release_all_pagetables(lg);
-   pin_trap_pages(lg);
+   pin_stack_pages(lg);
 }
 
 static void do_set_pte(struct lguest *lg, int idx,


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: commit 45cd8d8e -- why?

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 19:50:19 -0700 Roland Dreier <[EMAIL PROTECTED]> wrote:

> The changelog says:
> 
> fs/sysfs/bin.c: In function 'read':
> fs/sysfs/bin.c:77: warning: format '%zd' expects type 'signed size_t', 
> but argument 4 has type 'int'
> 
> but the signature of the function read() is
> 
> read(struct file * file, char __user * userbuf, size_t count, loff_t * 
> off)
> 
> and git blame seems to show it was always thus -- ie count was always size_t.
> 
> And now on x86-64 and ia64 with gcc 4.1 at least, I get:
> 
> fs/sysfs/bin.c: In function 'read':
> fs/sysfs/bin.c:62: warning: format '%d' expects type 'int', but argument 
> 4 has type 'size_t'

Some patches landed out of order.  In Greg's tree (with Tejun's patches)
`count' is a local variable (not an incoming arg) of type `int'.

So this patch was against Tejun's stuff, not against mainline.

I'd have picked that up, but I went and assumed that it was a victim of the
new dev_dbg() printk arg checking stuff.  Ho hum.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory

2007-04-27 Thread Rik van Riel

With lazy freeing of anonymous pages through MADV_FREE, performance of
the MySQL sysbench workload more than doubles on my quad-core system.

Madvise with MADV_FREE is used by applications to tell the kernel that
memory no longer contains useful data and can be reclaimed by the
kernel if it is needed elsewhere.  However, if the application puts
new data in the page (dirty bit gets set by hardware), the kernel
will not throw away the data.

This makes applications that free() and then later on malloc() the
same data again run a lot faster, since page faults are avoided.
In low memory situations, the kernel still knows which pages to
reclaim.

"Doing it all in userspace" is not a good solution for this problem,
because if the system needs the memory it is way cheaper to just throw
away these freed pages than to do the disk IO of swapping them out and
back in.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>
--- linux-2.6.21.noarch/mm/rmap.c.madv_free	2007-04-25 23:08:32.0 -0400
+++ linux-2.6.21.noarch/mm/rmap.c	2007-04-27 16:03:22.0 -0400
@@ -656,7 +656,17 @@ static int try_to_unmap_one(struct page 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	/* MADV_FREE is used to lazily free memory from userspace. */
+	if (PageLazyFree(page) && !migration) {
+		if (unlikely(pte_dirty(pteval))) {
+			/* There is new data in the page.  Reinstate it. */
+			set_pte_at(mm, address, pte, pteval);
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
+		/* Throw the page away. */
+		dec_mm_counter(mm, anon_rss);
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
--- linux-2.6.21.noarch/mm/page_alloc.c.madv_free	2007-04-27 16:03:22.0 -0400
+++ linux-2.6.21.noarch/mm/page_alloc.c	2007-04-27 16:03:22.0 -0400
@@ -203,6 +203,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab|
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_lazyfree |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -442,6 +443,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageLazyFree(page))
+		__ClearPageLazyFree(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -588,6 +591,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_lazyfree |
 			1 << PG_buddy 
 		bad_page(page);
 
--- linux-2.6.21.noarch/mm/memory.c.madv_free	2007-04-25 23:08:32.0 -0400
+++ linux-2.6.21.noarch/mm/memory.c	2007-04-27 21:12:57.0 -0400
@@ -432,6 +432,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int dirty = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -466,6 +467,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags)) {
+		dirty = pte_dirty(pte);
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
@@ -483,6 +485,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		get_page(page);
 		page_dup_rmap(page);
 		rss[!!PageAnon(page)]++;
+		if (dirty && PageLazyFree(page))
+			ClearPageLazyFree(page);
 	}
 
 out_set_pte:
@@ -661,6 +665,28 @@ static unsigned long zap_pte_range(struc
 (page->index < details->first_index ||
  page->index > details->last_index))
 	continue;
+
+/*
+ * MADV_FREE is used to lazily recycle
+ * anon memory.  The process no longer
+ * needs the data and wants to avoid IO.
+ */
+if (details->madv_free && PageAnon(page)) {
+	if (unlikely(PageSwapCache(page)) &&
+	!TestSetPageLocked(page)) {
+		remove_exclusive_swap_page(page);
+		unlock_page(page);
+	}
+	ptep_test_and_clear_dirty(vma, addr, pte);
+	ptep_test_and_clear_young(vma, addr, pte);
+	SetPageLazyFree(page);
+	if (PageActive(page))
+		deactivate_tail_page(page);
+	/* tlb_remove_page frees it again */
+	get_page(page);
+	tlb_remove_page(tlb, page);
+	continue;
+}
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 			tlb->fullmm);
@@ -689,7 +715,8 @@ static unsigned long zap_pte_range(struc
 		 * If details->check_mapping, we leave swap entries;
 		 * if details->nonlinear_vma, we leave file entries.
 		 */
-		if (unlikely(details))
+		if (unlikely(details && (details->check_mapping ||
+details->nonlinear_vma)))
 			continue;
 		if (!pte_file(ptent))
 			free_swap_and_cache(pte_to_swp_entry(ptent));
@@ -755,7 +782,8 @@ static unsigned long unmap_page_range(st
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping && !details->nonlinear_vma)
+	if (details && !details->check_mapping && 

Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mike Galbraith
On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:

> Actually, you don't need to apply the patch - just do
> 
>   echo 5 > /proc/sys/vm/dirty_background_ratio
>   echo 10 > /proc/sys/vm/dirty_ratio

That seems to have done the trick.  Amarok and GUI aren't exactly speed
demons while writeout is happening, but they are not hanging for
eternities.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/33] 2.6.20-stable review

2007-04-27 Thread Bryan WU
On Fri, 2007-04-27 at 08:13 -0700, Greg KH wrote:
> On Fri, Apr 27, 2007 at 06:15:54PM +0800, Wu, Bryan wrote:
> > 
> > You know for some customer's product, they want to use the stable and
> > long term support kernel instead to use the latest one. 
> 
> Then they should get that support from a vendor, not from the kernel.org
> releases :)
> 

Yeah, but we are the vendor as you mentioned. -:))

If we wanna to release a kernel to customer product development, how to
choose the stable version? Currently, we always followed the kernel
release cycle/rules and give customer the latest stable version.

Thank you Greg
-Bryan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Allow __vmalloc with GFP_ATOMIC

2007-04-27 Thread Nick Piggin

Giridhar Pemmasani wrote:

Until 2.6.19, __vmalloc with GFP_ATOMIC was possible, but __get_vm_area_node
would allocate the node itself with GFP_KERNEL, causing a warning. In 2.6.19,
this was "fixed" by using the same flags that were passed to __vmalloc also
in __get_vm_area_node. However, __get_vm_area_node does
BUG_ON(in_interrupt()) now, since vmlist_lock is obtained without disabling
bottom-half's. The patch below uses bh disabled lock for vmlist_lock, so that
__vmalloc can be used in interrupt context.

In 2.6.21, __vmalloc with GFP_ATOMIC is used by arch/um/kernel/process.c;
__vmalloc is also used in ntfs, xfs, but it is not clear to me if they use it
with GFP_ATOMIC or GFP_KERNEL.

Thanks,
Giri


Hi Giri,

I'm sure I've read the reason for this one before, but when you do patches
like these, can you include that reason in the changelog please?

Thanks,
Nick

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-04-27 Thread Nick Piggin

Nick Piggin wrote:

Rohit Seth wrote:



You mean by user space? If so, then it is user space responsibility to
do the appropriate operations (like flush icache in this case).



No, I mean places that set PG_arch_1. flush_dcache_page. This can
happen for mapped pages in write, splice, install_arg_page looks
questionable, direct IO...


Oh, and also ptrace! I think I was almost fooled by that attempt to flush
the cache in copy_to_user_page.

But that also fails if you map the underlying page with multiple virtual
addresses (or processes, if the icache is not flushed on ctxsw), because
those others won't have their caches flushed, right?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Allow __vmalloc with GFP_ATOMIC

2007-04-27 Thread Giridhar Pemmasani
Until 2.6.19, __vmalloc with GFP_ATOMIC was possible, but __get_vm_area_node
would allocate the node itself with GFP_KERNEL, causing a warning. In 2.6.19,
this was "fixed" by using the same flags that were passed to __vmalloc also
in __get_vm_area_node. However, __get_vm_area_node does
BUG_ON(in_interrupt()) now, since vmlist_lock is obtained without disabling
bottom-half's. The patch below uses bh disabled lock for vmlist_lock, so that
__vmalloc can be used in interrupt context.

In 2.6.21, __vmalloc with GFP_ATOMIC is used by arch/um/kernel/process.c;
__vmalloc is also used in ntfs, xfs, but it is not clear to me if they use it
with GFP_ATOMIC or GFP_KERNEL.

Thanks,
Giri

Signed-off-by: Giridhar Pemmasani <[EMAIL PROTECTED]>
---
--- linux-2.6.21.orig/./arch/arm/mm/ioremap.c   2007-04-25 23:08:32.0
-0400
+++ linux-2.6.21.new/./arch/arm/mm/ioremap.c2007-04-27 23:29:27.0
-0400
@@ -363,7 +363,7 @@
 * all the mappings before the area can be reclaimed
 * by someone else.
 */
-   write_lock(_lock);
+   write_lock_bh(_lock);
for (p =  ; (tmp = *p) ; p = >next) {
if((tmp->flags & VM_IOREMAP) && (tmp->addr == addr)) {
if (tmp->flags & VM_ARM_SECTION_MAPPING) {
@@ -376,7 +376,7 @@
break;
}
}
-   write_unlock(_lock);
+   write_unlock_bh(_lock);
 #endif
 
if (!section_mapping)
--- linux-2.6.21.orig/./arch/i386/mm/ioremap.c  2007-04-25 23:08:32.0
-0400
+++ linux-2.6.21.new/./arch/i386/mm/ioremap.c   2007-04-27 23:29:27.0
-0400
@@ -180,12 +180,12 @@
   in parallel. Reuse of the virtual address is prevented by
   leaving it in the global lists until we're done with it.
   cpa takes care of the direct mappings. */
-   read_lock(_lock);
+   read_lock_bh(_lock);
for (p = vmlist; p; p = p->next) {
if (p->addr == addr)
break;
}
-   read_unlock(_lock);
+   read_unlock_bh(_lock);
 
if (!p) {
printk("iounmap: bad address %p\n", addr);
--- linux-2.6.21.orig/./arch/x86_64/mm/ioremap.c2007-04-25
23:08:32.0 -0400
+++ linux-2.6.21.new/./arch/x86_64/mm/ioremap.c 2007-04-27 23:29:27.0
-0400
@@ -175,12 +175,12 @@
   in parallel. Reuse of the virtual address is prevented by
   leaving it in the global lists until we're done with it.
   cpa takes care of the direct mappings. */
-   read_lock(_lock);
+   read_lock_bh(_lock);
for (p = vmlist; p; p = p->next) {
if (p->addr == addr)
break;
}
-   read_unlock(_lock);
+   read_unlock_bh(_lock);
 
if (!p) {
printk("iounmap: bad address %p\n", addr);
--- linux-2.6.21.orig/./fs/proc/kcore.c 2007-04-25 23:08:32.0 -0400
+++ linux-2.6.21.new/./fs/proc/kcore.c  2007-04-27 23:29:27.0 -0400
@@ -335,7 +335,7 @@
if (!elf_buf)
return -ENOMEM;
 
-   read_lock(_lock);
+   read_lock_bh(_lock);
for (m=vmlist; m && cursize; m=m->next) {
unsigned long vmstart;
unsigned long vmsize;
@@ -363,7 +363,7 @@
memcpy(elf_buf + (vmstart - start),
(char *)vmstart, vmsize);
}
-   read_unlock(_lock);
+   read_unlock_bh(_lock);
if (copy_to_user(buffer, elf_buf, tsz)) {
kfree(elf_buf);
return -EFAULT;
--- linux-2.6.21.orig/./fs/proc/mmu.c   2007-04-25 23:08:32.0 -0400
+++ linux-2.6.21.new/./fs/proc/mmu.c2007-04-27 23:29:41.0 -0400
@@ -47,7 +47,7 @@
 
prev_end = VMALLOC_START;
 
-   read_lock(_lock);
+   read_lock_bh(_lock);
 
for (vma = vmlist; vma; vma = vma->next) {
unsigned long addr = (unsigned long) vma->addr;
@@ -72,6 +72,6 @@
if (VMALLOC_END - prev_end > vmi->largest_chunk)
vmi->largest_chunk = VMALLOC_END - prev_end;
 
-   read_unlock(_lock);
+   read_unlock_bh(_lock);
}
 }
--- linux-2.6.21.orig/./mm/vmalloc.c2007-04-25 23:08:32.0 -0400
+++ linux-2.6.21.new/./mm/vmalloc.c 2007-04-27 23:33:17.0 -0400
@@ -168,7 +168,7 @@
unsigned long align = 1;
unsigned long addr;
 
-   BUG_ON(in_interrupt());
+   BUG_ON(in_irq());
if (flags & VM_IOREMAP) {
int bit = fls(size);
 
@@ -193,7 +193,7 @@
 */
size += PAGE_SIZE;
 
-   write_lock(_lock);
+   write_lock_bh(_lock);
for (p =  (tmp = *p) != NULL ;p = >next) 

Re: What's in infiniband.git for 2.6.22

2007-04-27 Thread Roland Dreier
 > What about the mthca patch to use separate HW queues for kernel 
 > RC/UD/userspace RC?

right, I'll queue that up too.
BTW is there something analogous we could do for mlx4, or is FW not
quite ready?

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[git pull] DRM patches for 2.6.22-rc1

2007-04-27 Thread Dave Airlie


Hi Linus,

Please pull the 'drm-patches' branch of
git://master.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6.git drm-patches

This contains the drm patch for 2.6.22-rc1, and contains a number of fixes
in the mmap code and the locking for AIGLX systems along with new hw support
for i965GM.

Dave.

 drivers/char/drm/README.drm|   16 +++--
 drivers/char/drm/drm.h |4 +-
 drivers/char/drm/drmP.h|   23 +--
 drivers/char/drm/drm_bufs.c|   75 +++
 drivers/char/drm/drm_drv.c |9 +--
 drivers/char/drm/drm_fops.c|   96 ++--
 drivers/char/drm/drm_hashtab.c |   17 +-
 drivers/char/drm/drm_hashtab.h |1 -
 drivers/char/drm/drm_irq.c |4 +-
 drivers/char/drm/drm_lock.c|  134 ---
 drivers/char/drm/drm_mm.c  |2 +
 drivers/char/drm/drm_pciids.h  |3 +-
 drivers/char/drm/drm_proc.c|2 +-
 drivers/char/drm/drm_stub.c|1 -
 drivers/char/drm/drm_vm.c  |  102 ---
 drivers/char/drm/i915_dma.c|3 +-
 drivers/char/drm/radeon_cp.c   |8 +-
 drivers/char/drm/sis_drv.c |2 +-
 drivers/char/drm/via_drv.c |3 +-
 drivers/char/drm/via_mm.h  |   40 
 20 files changed, 196 insertions(+), 349 deletions(-)

commit ce7dd06372058f9e3e57ee4c0aeba694a43a80ad
Author: Wang Zhenyu <[EMAIL PROTECTED]>
Date:   Thu Apr 26 07:42:56 2007 +1000

drm/i915: Add 965GM pci id update

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 9e9c1326a592c677c94d730fcf4446d0e275aef4
Author: Dave Airlie <[EMAIL PROTECTED]>
Date:   Sat Mar 24 17:57:54 2007 +1100

drm: just use io_remap_pfn_range on all archs..

Move the sparc64 ifdef around to clean this up.

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 38315878a560eede1a2db52e511ad3a2cfbb4206
Author: Hugh Dickins <[EMAIL PROTECTED]>
Date:   Sat Mar 24 17:55:16 2007 +1100

drm: fix DRM_CONSISTENT mapping

This patch got lost in the DRM git tree for ages, bring it back to life.

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit d7d8aac79dc38cbdef83b774e49bafdae9918137
Author: Thomas Hellstrom 
Date:   Sat Mar 24 17:52:49 2007 +1100

drm: fix up mmap locking in preparation for ttm changes

This change is needed to protect againt disappearing maps which aren't 
common.
The map lists are protected using sturct_mutex but drm_mmap never locked it.

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 040ac32048d5efabd557c1e0a6ab8aec2c710c56
Author: Thomas Hellstrom 
Date:   Fri Mar 23 13:28:33 2007 +1100

drm: fix driver deadlock with AIGLX and reclaim_buffers_locked

Bugzilla Bug #9457

Add refcounting of user waiters to the DRM hardware lock, so that we can use
DRM_LOCK_CONT flag more conservatively.

Also add a kernel waiter refcount that if nonzero transfers the lock for the
kernel context when it is released. This is useful when waiting for idle 
and can be used for very simple fence object driver implementations for the new 
memory manager

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 4b560fde06aeb342f3ff0bce924627ab722d251a
Author: Andrew Morton <[EMAIL PROTECTED]>
Date:   Mon Mar 19 09:08:21 2007 +1100

drm: fix warning in drm_fops.c

drivers/char/drm/drm_fops.c: In function 'drm_setup':
drivers/char/drm/drm_fops.c:60: warning: comparison of distinct pointer 
types lacks a cast

Unfortunately PAGE_SIZE has different types on different architectures.

Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 99da6d861c659bb1a961b70f50fad268b9ed5a5f
Author: Thomas Hellstrom 
Date:   Mon Mar 19 08:52:17 2007 +1100

drm: allow for more generic drm ioctls

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 6244270ef62203e057191bf85489e2ff91cc0e60
Author: Jay Estabrook <[EMAIL PROTECTED]>
Date:   Sun Mar 11 11:46:27 2007 +1100

drm: fix alpha domain handling

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 74be8e3b3707956f8f232313de9fad896d5489ac
Author: Thomas Hellstrom 
Date:   Sun Mar 11 11:45:24 2007 +1100

via: fix CX700 pci id

Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 0bead7cdc94b4897f3d92db6170737a2da527134
Author: Adrian Bunk <[EMAIL PROTECTED]>
Date:   Sun Mar 11 11:41:16 2007 +1100

drm: make drm_io_prot static.

This patch makes the needlessly global drm_io_prot() static.

Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 5379397182a7b5fa1c68ceaefe311ce4c1d04b2a
Author: Robert P. J. Day <[EMAIL PROTECTED]>
Date:   Sun Mar 11 11:39:31 2007 +1100

drm: remove via_mm.h

Delete apparently unused header file drivers/char/drm/via_mm.h.

Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]>
Signed-off-by: Dave Airlie <[EMAIL PROTECTED]>

commit 

Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread Christoph Lameter
On Sat, 28 Apr 2007, David Chinner wrote:

> > 1-disk and 2-disk read throughput fell by an improbable amount, which makes
> > me cautious about the other numbers.
> 
> For read, yes, and it's because something is going wrong with the
> I/O size - it looks like readahead thrashing of some kind even
> with 4k pages tests.

Yup. I seem to have a problem in that area with my patches. Somehow the
nr of page is shifted by page order. I do not completely understand what 
is going on there yet.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: checkpatch, a patch checking script.

2007-04-27 Thread Adrian Bunk
On Fri, Apr 27, 2007 at 08:36:17PM -0700, Roland Dreier wrote:
>...
> Also, it would be nice to be able to do something like
> 
> git diff v2.6.20..|perl ~/checkpatch.pl -
>...

  perl ~/checkpatch.pl <(git diff v2.6.20..)

>  - R.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel oops with 2.6.21 while using cdda2wav & cooked_ioctl (x64-64)

2007-04-27 Thread Alexander E. Patrakov

Ross Alexander wrote:

Modules linked in: nvidia(P)
Tainted:P


With this, nobody will even look at your report. Please retry without 
proprietary modules.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: checkpatch, a patch checking script.

2007-04-27 Thread Roland Dreier
 > http://www.codemonkey.org.uk/projects/checkpatch/example.log shows
 > what fell out of running it on my mbox of lkml from the past month.
 > Some of them are kinda noisy, and perhaps should be moved under --pedantic
 > 
 > I'm all ears for additional regexps, bug reports or other suggestions.

Looks great... however I notice a few obvious false positives in the
example log:

 > Don't init statics to 0/NULL:
 > 94312:+static const struct in6_addr in6addr_v4mapped = { { { [10] = 0xff, 
 > [11] = 0xff } } };

ummm?

 > 137054:+static uint32_t drvr_ver  = 0x02200207;

that ain't zero...

 > 230079:+path->static_rate = 0;

and that ain't a static variable.

I guess trying to parse C in a regexp is a little tricky.

Also, it would be nice to be able to do something like

git diff v2.6.20..|perl ~/checkpatch.pl -

rather than having to create a temp file -- as it stands that command
produces

unknown option: -
usage: findbugs.pl [-options] file(s)
  -allsource : check entire source file, not just '+' patch lines
  -pedantic : TBD
  -style : TBD
  -v, --verbose : verbose
  -h, --help : this help text
Version: 002

And even worse

git diff v2.6.20..|perl ~/checkpatch.pl

just silently does nothing (maybe a "no input files" warning would be
a good clue for people).

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bugme-new] [Bug 8378] New: Averatec 3156X laptop doesn't reboot with kernels > 2.6.13.5 (responsible commit found)

2007-04-27 Thread Truxton Fulton
Andrew Morton wrote (at Fri, 27 Apr 2007 14:44:34 -0700) :
> 
> 
> On Fri, 27 Apr 2007 10:42:25 -0700
> [EMAIL PROTECTED] wrote:
> 
>> http://bugzilla.kernel.org/show_bug.cgi?id=8378
>> 
>>Summary: Averatec 3156X laptop doesn't reboot with kernels >
>> 2.6.13.5 (responsible commit found)
>> Kernel Version: 2.6.14 till 2.6.21
>> Status: NEW
>>   Severity: normal
>>  Owner: [EMAIL PROTECTED]
>>  Submitter: [EMAIL PROTECTED]
>> 
>> 
>> Most recent kernel where this bug did *NOT* occur: 2.6.13.5
>> 
>> Distribution: Debian
>> Hardware Environment: Averatec 3156X (seemingly identical to the american 
>> model
>> 3150P)
>> Software Environment:?
>> Problem Description:
>> I noticed that with recent kernels my laptop would reboot when I do an 'init 
>> 6',
>> but hang at the end of the init run. The last working vanilla kernel is
>> 2.6.13.5. With some trying and a bit of guessing I found a change to
>> include/asm-i386/mach-default/mach_reboot.h in 2.6.14 to be the culprit. It 
>> can
>> be found at:
>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.14.y.git;a=commitdiff;h=59f4e7d572980a521b7bdba74ab71b21f5995538
>> 
>> On a 2.6.21 source tree I can revert this patch, and then rebooting works.
>> 
>> Steps to reproduce:
>> 1) On a Averatec 3156X (or 3150p?) boot to your default runlevel.
>> 2) as root, type "init 6".
>> 3) instead of rebooting, the system will hang at the end with a blank screen.
>> 
> 
> Oh dear.  We have an ugly i386 snafu here.  Thanks for doing the bisection
> - it helps enormously.
> 
> Could some brave person please pick it up and see if we can get both
> Truxton and Lee's machines working?

Hi,

I verified on my IDEQ210M that performing the old reboot sequence
followed by the new reboot sequence works for me, and I suspect that
it will work for Lee also.  Like this :

/* old method, works on most machines */
for (i = 0; i < 100; i++) {
kb_wait();
udelay(50);
outb(0xfe, 0x64); /* pulse reset low */
udelay(50);
}

/* new method, sets the "System flag" which when set,
   indicates successful completion of the keyboard controller
   self-test (Basic Assurance Test, BAT).  This is needed
   for some machines with no keyboard plugged in */
for (i = 0; i < 100; i++) {
kb_wait();
udelay(50);
outb(0x60, 0x64); /* write Controller Command Byte */
udelay(50);
kb_wait();
udelay(50);
outb(0x14, 0x60); /* set "System flag" */
udelay(50);
kb_wait();
udelay(50);
outb(0xfe, 0x64); /* pulse reset low */
udelay(50);
}

Thanks,

-Truxton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread David Chinner
On Fri, Apr 27, 2007 at 12:11:08PM -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 03:34:32 +1000 David Chinner <[EMAIL PROTECTED]> wrote:
> 
> > Some more information - stripe unit on the dm raid0 is 512k.
> > I have not attempted to increase I/O sizes at all yet - these test are
> > just demonstrating efficiency improvements in the filesystem.
> > 
> > These numbers for 32GB files.
> > 
> > READWRITE
> > disks  blksz tput   sys   tputsys
> > -  --     -  
> >   1 4k89 18s   57 44s
> >   116k46 13s   67 18s
> >   164k75 12s   68 12s
> >   2 4k   179 20s  114 43s
> >   216k55 13s  132 18s
> >   264k   126 12s  126 12s
> >   4 4k   350 20s  214 43s
> >   416k   350 14s  264 19s
> >   464k   176 11s  266 12s
> >   8 4k   415 21s  446 41s
> >   816k   655 13s  518 19s
> >   864k   664 12s  552 12s
> >  12 4k   413 20s  633 33s
> >  1216k   736 14s  741 19s
> >  1264k   836 12s  743 12s
> > 
> > Throughput in MB/s.
> > 
> > 
> > Consistent improvement across the write results, first time
> > I've hit the limits of the PCI-X bus with a single buffered
> > I/O thread doing either reads or writes.
> 
> 1-disk and 2-disk read throughput fell by an improbable amount, which makes
> me cautious about the other numbers.

For read, yes, and it's because something is going wrong with the
I/O size - it looks like readahead thrashing of some kind even
with 4k pages tests.

When when I bumped the block device readahead from 256 -> 2048,
the single disk read numbers went 60, 75, 75MB/s for 4->64k block size
and were repeatable, so we definitely have some interaction with readahead.

> Your annotation says "blocksize".  Are you really varying the fs blocksize
> here, or did you mean "pagesize"?

Filesystem blocksize, as specified by mkfs.xfs. Which, in turn,
changes the page cache order.

> What worries me here is that we have inefficient code, and increasing the
> pagesize amortises that inefficiency without curing it.

Increasing the filesystem block size also reduces the overhead of
the filesystem, not just he page cache. A lot of the overhead (write
especially) reductions are going to be filesystem block size
related, so I wouldn't start assuming that it's just he page cache
changes that have brought about these system time reductions.

> If so, it would be better to fix the inefficiencies, so that 4k pagesize
> will also benefit.
> 
> For example, see __do_page_cache_readahead().  It does a read_lock() and a
> page allocation and a radix-tree lookup for each page.  We can vastly
> improve that.



Sure but that's a different problem to what we are trying to solve
now.  Even with this in place, I think we'd still realise
improvements with the compound pages

> Fix up your lameo HBA for reads.

Where did that come from? You spend 20 lines described the inefficiencies
of the readahead in the page cache and it should be fixed but then you
turn around and say fix the HBA? 

This test was constructed to keep the I/o sizes within the current
bounds, so the HBA sees no difference in I/O sizes as the filesystem
block size changes.  i.e. the HBA is constant factor during the
tests. IOWs, the changes in numbers above are purely a result of the
page cache and filesystem changes

And besides, the "lameo HBA" I'm using is cleared limited by the
PCI-X bus it's on, not the size and type of pages being thrown at it
by the I/O layers. The hardware is pretty much irrelevant in these
tests

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread William Lee Irwin III
William Lee Irwin III wrote:
>> What sort of strategy do you intend to use to speculatively populate
>> the pagecache with contiguous pages?

On Sat, Apr 28, 2007 at 12:50:26PM +1000, Nick Piggin wrote:
> Andrew outlined it.

I'd like to suggest a few straightforward additions to the proposal:

(1) the interface to the page allocator tries to allocate N pages where
(a) N is a power of 2
(b) some effort is made to get contiguity
(c) some effort is made to fall back to lesser contiguity
(d) some effort is made to get N pages even with no contiguity
(2) a corresponding group freeing interface to the page allocator
(3) Pass the pages around in a list or similar so that O(1) instead of
O(pages) splice operations under the lock suffice for passing
them around. Dissecting compound pages outside locks helps.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -rt] yet another irq storm

2007-04-27 Thread Steven Rostedt
Must be global warming, I'm getting a lot more irq storms than usual.

Now that I switched over to x86_64, I booted up and got another irq
storm. So I added my previous patch and it didn't fix it.  Looking
further, I found that the mask and unmask is done directly in the
x86_64/io_apic.c file.

This patch does basically the same thing as my previous patch, but to
the x86_64/io_apic.c file.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

Index: linux-2.6.21-rt1/arch/x86_64/kernel/io_apic.c
===
--- linux-2.6.21-rt1.orig/arch/x86_64/kernel/io_apic.c
+++ linux-2.6.21-rt1/arch/x86_64/kernel/io_apic.c
@@ -1437,7 +1437,8 @@ static void ack_apic_level(unsigned int 
irq_complete_move(irq);
 #if defined(CONFIG_GENERIC_PENDING_IRQ) || defined(CONFIG_IRQBALANCE)
/* If we are moving the irq we need to mask it */
-   if (unlikely(irq_desc[irq].status & IRQ_MOVE_PENDING)) {
+   if (unlikely(irq_desc[irq].status & IRQ_MOVE_PENDING) &&
+   !(irq_desc[irq].status & IRQ_INPROGRESS)) {
do_unmask_irq = 1;
mask_IO_APIC_irq(irq);
}


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


checkpatch, a patch checking script.

2007-04-27 Thread Dave Jones
On Wed, Apr 25, 2007 at 08:02:07PM -0700, Andrew Morton wrote:

 > > Yep, I was going to mention your scripts but you beat me to it.
 > > 
 > > I'll be glad to help maintain such animals if wanted.
 > > 
 > wanted ;)
 > 
 > At least, it would be interesting to investigate the usefulness.  I suspect
 > it will prove to be very useful for the little things.

Randy and I got together and hashed out a first cut at this.
(Randy actually gutted quite a lot of what I originally wrote, so deserves
 much kudos for improving this beyond my initial crappy version).
You can find the script at http://www.codemonkey.org.uk/projects/checkpatch/
There's also a git clonable tree there (only http right now).

http://www.codemonkey.org.uk/projects/checkpatch/example.log shows
what fell out of running it on my mbox of lkml from the past month.
Some of them are kinda noisy, and perhaps should be moved under --pedantic

I'm all ears for additional regexps, bug reports or other suggestions.

Before wiring this up to a procmail rule to scan every patch, I think it's
probably a better idea to flesh it out a bit more.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-04-27 Thread Nick Piggin

Nick Piggin wrote:


What if you were to say remove all the PG_arch_1 code, and do something
really simple like flush icache in flush_dcache_page? Would performance
suffer horribly?


OIC, you need a virtual address to evict the icache, so you can't
flush at flush_dcache time? Or does ia64 have an instruction to
flush the whole icache? (it would be worth testing, to see how much
performance suffers).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Daniel Hazelton
On Friday 27 April 2007 21:44:48 Rafael J. Wysocki wrote:
> On Saturday, 28 April 2007 03:12, Linus Torvalds wrote:
> > On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > > > It's doubly bad, because that idiocy has also infected s2ram. Again,
> > > > another thing that really makes no sense at all - and we do it not
> > > > just for snapshotting, but for s2ram too. Can you tell me *why*?
> > >
> > > Why we freeze tasks at all or why we freeze kernel threads?
> >
> > In many ways, "at all".
> >
> > I _do_ realize the IO request queue issues, and that we cannot actually
> > do s2ram with some devices in the middle of a DMA. So we want to be able
> > to avoid *that*, there's no question about that. And I suspect that
> > stopping user threads and then waiting for a sync is practically one of
> > the easier ways to do so.
> >


Apparently I *CANNOT* wrap my head around this - if just because my laptop, 
running a vendor 2.6.17 kernel does s2ram perfectly, at least, it does when 
using the "Upstart" init system rather than the classical SysV init system. I 
have tried it with the classical init and the suspend isn't triggered by the 
buttons that used to do it. I didn't try 'echo ram > /sys/power/state', but I 
have a feeling that would have worked as well. I have problems with s2disk, 
but thats because I keep my swap partition small - I try to keep it at or 
around 256M when I have more than half a gig of Ram in a system. Perhaps one 
of these days I'll grab a multi-gig flash disk, set it up as a swap partition 
and try it again. (every time I've tried s2disk I wind up running out of disk 
space - and this is with nothing but X running. Any kind of progress meter 
for when the system is doing s2disk would be nice - every time I've tried it 
all I see for the nearly 2 minutes before the s2disk attempt ends is a black 
screen. I say 2 minutes because thats how long it takes for it to learn that 
there isn't enough space on the swap-partition to save the image)

DRH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


commit 45cd8d8e -- why?

2007-04-27 Thread Roland Dreier
The changelog says:

fs/sysfs/bin.c: In function 'read':
fs/sysfs/bin.c:77: warning: format '%zd' expects type 'signed size_t', but 
argument 4 has type 'int'

but the signature of the function read() is

read(struct file * file, char __user * userbuf, size_t count, loff_t * off)

and git blame seems to show it was always thus -- ie count was always size_t.

And now on x86-64 and ia64 with gcc 4.1 at least, I get:

fs/sysfs/bin.c: In function 'read':
fs/sysfs/bin.c:62: warning: format '%d' expects type 'int', but argument 4 
has type 'size_t'

Andrew, what compiler were you using to get that warning?  Should we
revert commit 45cd8d8e?

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread Nick Piggin

William Lee Irwin III wrote:

On Sat, Apr 28, 2007 at 12:27:45PM +1000, Nick Piggin wrote:


I guess 10% isn't a small amount. Though it would be nice to have
before/after numbers for Linux. And, like Andrew was saying, we could
just _attempt_ to put contiguous pages in pagecache rather than
_require_ it. Which is still robust under fragmentation, and benefits
everyone, not just files with a large pagecache size.



What sort of strategy do you intend to use to speculatively populate
the pagecache with contiguous pages?


Andrew outlined it.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread William Lee Irwin III
On Sat, Apr 28, 2007 at 12:27:45PM +1000, Nick Piggin wrote:
> I guess 10% isn't a small amount. Though it would be nice to have
> before/after numbers for Linux. And, like Andrew was saying, we could
> just _attempt_ to put contiguous pages in pagecache rather than
> _require_ it. Which is still robust under fragmentation, and benefits
> everyone, not just files with a large pagecache size.

What sort of strategy do you intend to use to speculatively populate
the pagecache with contiguous pages?


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread Nick Piggin

Christoph Hellwig wrote:

On Fri, Apr 27, 2007 at 10:25:44PM +1000, Nick Piggin wrote:


Linus's favourite jokes about powerpc mmu being crippled forever, aside ;)



Different mmu.  The desktop 32bit mmu Linus refered to has almost nothing
in common with the mmu on 64bit systems.



Well I wasn't trying to make a point there so it isn't a big deal... but
he has known to say the 64-bit hash table is insane or broken. If he's
since recanted, I'd be interested to read the post :)



Right this could help but it is not addressing the basic requirement for
devices that need large contiguuos chunks of memory for I/O.


Did you read the last paragraph? Or anything Andrew's been writing?

"After that, I'd find it amusing if HBAs worth thousands of $ have
 trouble looking up sglists at the relatively glacial pace that IO
 requires, and/or can't spare a few more K for reasonable sglist
 sizes, but if that is really the case, then we could use iommus
 and/or just attempt to put physically contiguous pages in pagecache,
 rather than require it."



Real highend HBAs don't have that problem.  But for example aacraid
which is very common on mid-end servers is a _lot_ faster when it
gets continous memory.  Some benchmark was 10 or more percent faster
on windows due to this.


And that wasn't due to the 128 sg limit?

I guess 10% isn't a small amount. Though it would be nice to have
before/after numbers for Linux. And, like Andrew was saying, we could
just _attempt_ to put contiguous pages in pagecache rather than
_require_ it. Which is still robust under fragmentation, and benefits
everyone, not just files with a large pagecache size.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Add kvasprintf()

2007-04-27 Thread Jeremy Fitzhardinge
Add a kvasprintf() function to compliment kasprintf().

[ No in-tree users yet, but I have some coming up. ]

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
Cc: Keir Fraser <[EMAIL PROTECTED]>

---
 include/linux/kernel.h |1 +
 lib/vsprintf.c |   28 
 2 files changed, 21 insertions(+), 8 deletions(-)

===
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -121,6 +121,7 @@ extern int vscnprintf(char *buf, size_t 
__attribute__ ((format (printf, 3, 0)));
 extern char *kasprintf(gfp_t gfp, const char *fmt, ...)
__attribute__ ((format (printf, 2, 3)));
+extern char *kvasprintf(gfp_t gfp, const char *fmt, va_list args);
 
 extern int sscanf(const char *, const char *, ...)
__attribute__ ((format (scanf, 2, 3)));
===
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -851,22 +851,34 @@ EXPORT_SYMBOL(sscanf);
 
 
 /* Simplified asprintf. */
-char *kasprintf(gfp_t gfp, const char *fmt, ...)
-{
-   va_list ap;
+char *kvasprintf(gfp_t gfp, const char *fmt, va_list ap)
+{
unsigned int len;
char *p;
-
-   va_start(ap, fmt);
-   len = vsnprintf(NULL, 0, fmt, ap);
-   va_end(ap);
+   va_list aq;
+
+   va_copy(aq, ap);
+   len = vsnprintf(NULL, 0, fmt, aq);
+   va_end(aq);
 
p = kmalloc(len+1, gfp);
if (!p)
return NULL;
+
+   vsnprintf(p, len+1, fmt, ap);
+
+   return p;
+}
+
+char *kasprintf(gfp_t gfp, const char *fmt, ...)
+{
+   va_list ap;
+   char *p;
+
va_start(ap, fmt);
-   vsnprintf(p, len+1, fmt, ap);
+   p = kvasprintf(gfp, fmt, ap);
va_end(ap);
+
return p;
 }
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: X display shift with disabled console blanking

2007-04-27 Thread Antonino A. Daplas
On Fri, 2007-04-27 at 18:08 +0100, James Pearson wrote:
> I have a problem whereby the X display 'shifts' to left when anything 
> writes to /dev/console - where console screen blanking has been disabled 
> i.e. doing something like:
> 
> boot to run level 3
> 
> If not root, then make sure /dev/console is writeable
> 
> login and type:
> 
> setterm -blank 0
> 
> start X
> 
> type into an xterm:
> 
> echo "some random text" > /dev/console
> (may have to repeat the echo above a few times)
> 
> ... and the whole X display jumps (and wraps) to the left
> 
> I'm using a RHEL4 based distro with a vanilla 2.6.21 x86_64 kernel 
> (although I've seen the problem with various x86_64 and i686 2.6.X kernels).
> 
> I've seen this problem on a number of different nVidia cards - using 
> the vesa driver (same problem occurs with nVidia's binary driver). I 
> haven't tried using other makes of graphics cards.
> 
> 
> OK, this may be a strange combination of disabling the text console 
> blanking and running X, but something isn't right somewhere ...

Yep, it's strange because I can't reproduce this. And the console write
should not succeed if the current console is in KD_GRAPHICS mode, which
is done by X (unless your version is different).

> 
> Any ideas?

I don't.  But, what is your current console?  Is it VGA, or framebuffer?
Can you try doing this again in both VGA and vesafb?

And this does not happen if there is no previous setterm -blank 0
command?

Tony


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] use elfnote.h to generate vsyscall notes

2007-04-27 Thread Jeremy Fitzhardinge
Use existing elfnote.h to generate vsyscall notes, rather than doing
it locally.  Changes elfnote.h a bit to suite, since this is the first
asm user, and it wasn't quite right.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: "Eric W. Biederman" <[EMAIL PROTECTED]>
Cc: Roland McGrath <[EMAIL PROTECTED]>

---
 arch/i386/kernel/vsyscall-note.S |   23 ++-
 include/linux/elfnote.h  |   18 +-
 2 files changed, 19 insertions(+), 22 deletions(-)

===
--- a/arch/i386/kernel/vsyscall-note.S
+++ b/arch/i386/kernel/vsyscall-note.S
@@ -3,23 +3,12 @@
  * Here we can supply some information useful to userland.
  */
 
-#include 
 #include 
+#include 
 
-#define ASM_ELF_NOTE_BEGIN(name, flags, vendor, type)\
-   .section name, flags; \
-   .balign 4;\
-   .long 1f - 0f;  /* name length */ \
-   .long 3f - 2f;  /* data length */ \
-   .long type; /* note type */   \
-0: .asciz vendor;  /* vendor name */ \
-1: .balign 4;\
-2:
-
-#define ASM_ELF_NOTE_END \
-3: .balign 4;  /* pad out section */ \
-   .previous
-
-   ASM_ELF_NOTE_BEGIN(".note.kernel-version", "a", UTS_SYSNAME, 0)
+/* Ideally this would use UTS_NAME, but using a quoted string here
+   doesn't work. Remember to change this when changing the
+   kernel's name. */
+ELFNOTE_START(Linux, 0, "a")
.long LINUX_VERSION_CODE
-   ASM_ELF_NOTE_END
+ELFNOTE_END
===
--- a/include/linux/elfnote.h
+++ b/include/linux/elfnote.h
@@ -38,17 +38,25 @@
  * e.g. ELFNOTE(XYZCo, 42, .asciz, "forty-two")
  *  ELFNOTE(XYZCo, 12, .long, 0xdeadbeef)
  */
-#define ELFNOTE(name, type, desctype, descdata)\
-.pushsection .note.name, "",@note  ;   \
+#define ELFNOTE_START(name, type, flags)   \
+.pushsection .note.name, flags,@note   ;   \
   .align 4 ;   \
   .long 2f - 1f/* namesz */;   \
-  .long 4f - 3f/* descsz */;   \
+  .long 4484f - 3f /* descsz */;   \
   .long type   ;   \
 1:.asciz #name ;   \
 2:.align 4 ;   \
-3:desctype descdata;   \
-4:.align 4 ;   \
+3:
+
+#define ELFNOTE_END\
+4484:.align 4  ;   \
 .popsection;
+
+#define ELFNOTE(name, type, desc)  \
+   ELFNOTE_START(name, type, "")   \
+   desc;   \
+   ELFNOTE_END
+
 #else  /* !__ASSEMBLER__ */
 #include 
 /*


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Fabric7 VIOC driver going away

2007-04-27 Thread Jeff Garzik
It looks like Fabric7 has gone out of business, and the maintainer works 
elsewhere, so I'm no longer inclined to merge it into the upstream kernel.


Yell now, if there is a contigent of Fabric7 users that still want this.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread William Lee Irwin III
On Thu, Apr 26, 2007 at 11:55:42PM -0700, Andrew Morton wrote:
>>> Please address my point: if in five years time x86 has larger or varible
>>> pagesize, this code will be a permanent millstone around our necks which we
>>> *should not have merged*.
>>> And if in five years time x86 does not have larger pagesize support then
>>> the manufacturers would have decided that 4k pages are not a performance
>>> problem, so we again should not have merged this code.

On Fri, 27 Apr 2007 06:44:51 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:
>> So the verdict is wait 5 years, see if x86 did anything, and so on.

On Fri, Apr 27, 2007 at 12:15:57PM -0700, Andrew Morton wrote:
> You missed the bit about "evaluate alternatives".

No worries. I'm used to being on the wrong side of things. I'll have
no trouble picking out the alternative least likely to be accepted. ;)


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] deflate inflate_dynamic too

2007-04-27 Thread Jeremy Fitzhardinge
inflate_dynamic() has piggy stack usage too, so heap allocate it too.
I'm not sure it actually gets used, but it shows up large in "make
checkstack".

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

---
 lib/inflate.c |   63 ++---
 1 file changed, 42 insertions(+), 21 deletions(-)

===
--- a/lib/inflate.c
+++ b/lib/inflate.c
@@ -798,15 +798,18 @@ STATIC int noinline INIT inflate_dynamic
   unsigned nb;  /* number of bit length codes */
   unsigned nl;  /* number of literal/length codes */
   unsigned nd;  /* number of distance codes */
-#ifdef PKZIP_BUG_WORKAROUND
-  unsigned ll[288+32];  /* literal/length and distance code lengths */
-#else
-  unsigned ll[286+30];  /* literal/length and distance code lengths */
-#endif
+  unsigned *ll; /* literal/length and distance code lengths */
   register ulg b;   /* bit buffer */
   register unsigned k;  /* number of bits in bit buffer */
+  int ret;
 
 DEBG(" 286 || nd > 30)
 #endif
-return 1;   /* bad lengths */
+  {
+ret = 1; /* bad lengths */
+goto out;
+  }
 
 DEBG("dyn1 ");
 
@@ -850,7 +856,8 @@ DEBG("dyn2 ");
   {
 if (i == 1)
   huft_free(tl);
-return i;   /* incomplete code set */
+ret = i;   /* incomplete code set */
+goto out;
   }
 
 DEBG("dyn3 ");
@@ -872,8 +879,10 @@ DEBG("dyn3 ");
   NEEDBITS(2)
   j = 3 + ((unsigned)b & 3);
   DUMPBITS(2)
-  if ((unsigned)i + j > n)
-return 1;
+  if ((unsigned)i + j > n) {
+ret = 1;
+   goto out;
+  }
   while (j--)
 ll[i++] = l;
 }
@@ -882,8 +891,10 @@ DEBG("dyn3 ");
   NEEDBITS(3)
   j = 3 + ((unsigned)b & 7);
   DUMPBITS(3)
-  if ((unsigned)i + j > n)
-return 1;
+  if ((unsigned)i + j > n) {
+ret = 1;
+   goto out;
+  }
   while (j--)
 ll[i++] = 0;
   l = 0;
@@ -893,8 +904,10 @@ DEBG("dyn3 ");
   NEEDBITS(7)
   j = 11 + ((unsigned)b & 0x7f);
   DUMPBITS(7)
-  if ((unsigned)i + j > n)
-return 1;
+  if ((unsigned)i + j > n) {
+ret = 1;
+   goto out;
+  }
   while (j--)
 ll[i++] = 0;
   l = 0;
@@ -923,7 +936,8 @@ DEBG("dyn5b ");
   error("incomplete literal tree");
   huft_free(tl);
 }
-return i;   /* incomplete code set */
+ret = i;   /* incomplete code set */
+goto out;
   }
 DEBG("dyn5c ");
   bd = dbits;
@@ -939,15 +953,18 @@ DEBG("dyn5d ");
   huft_free(td);
 }
 huft_free(tl);
-return i;   /* incomplete code set */
+ret = i;   /* incomplete code set */
+goto out;
 #endif
   }
 
 DEBG("dyn6 ");
 
   /* decompress until an end-of-block code */
-  if (inflate_codes(tl, td, bl, bd))
-return 1;
+  if (inflate_codes(tl, td, bl, bd)) {
+ret = 1;
+goto out;
+  }
 
 DEBG("dyn7 ");
 
@@ -956,10 +973,14 @@ DEBG("dyn7 ");
   huft_free(td);
 
   DEBG(">");
-  return 0;
-
- underrun:
-  return 4;/* Input underrun */
+  ret = 0;
+out:
+  free(ll);
+  return ret;
+
+underrun:
+  ret = 4; /* Input underrun */
+  goto out;
 }
 
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: Add a sched_clock paravirt_op

2007-04-27 Thread Jeremy Fitzhardinge
The tsc-based get_scheduled_cycles interface is not a good match for
Xen's runstate accounting, which reports everything in nanoseconds.

This patch replaces this interface with a sched_clock interface, which
matches both Xen and VMI's requirements.

In order to do this, we:
   1. replace get_scheduled_cycles with sched_clock
   2. hoist cycles_2_ns into a common header
   3. update vmi accordingly

One thing to note: because sched_clock is implemented as a weak
function in kernel/sched.c, we must define a real function in order to
override this weak binding.  This means the usual paravirt_ops
technique of using an inline function won't work in this case.

[ This is against Andi's patch queue.  It fixes the x86-64 build problem. ]

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Dan Hecht <[EMAIL PROTECTED]>
Cc: john stultz <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c|2 -
 arch/i386/kernel/sched-clock.c |   43 +--
 arch/i386/kernel/vmi.c |2 -
 arch/i386/kernel/vmiclock.c|6 ++--
 include/asm-i386/paravirt.h|7 -
 include/asm-i386/sched-clock.h |   49 
 include/asm-i386/timer.h   |2 -
 include/asm-i386/vmi_time.h|2 -
 include/asm-x86_64/timer.h |2 -
 9 files changed, 79 insertions(+), 36 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -268,7 +268,7 @@ struct paravirt_ops paravirt_ops = {
.write_msr = native_write_msr_safe,
.read_tsc = native_read_tsc,
.read_pmc = native_read_pmc,
-   .get_scheduled_cycles = native_read_tsc,
+   .sched_clock = native_sched_clock,
.get_cpu_khz = native_calculate_cpu_khz,
.load_tr_desc = native_load_tr_desc,
.set_ldt = native_set_ldt,
===
--- a/arch/i386/kernel/sched-clock.c
+++ b/arch/i386/kernel/sched-clock.c
@@ -35,29 +35,8 @@
  * [EMAIL PROTECTED] "math is hard, lets go shopping!"
  */
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
-
-struct sc_data {
-   unsigned cyc2ns_scale;
-   unsigned unstable;
-   unsigned long long sync_base;   /* TSC or jiffies at syncpoint*/
-   unsigned long long ns_base; /* nanoseconds at sync point */
-   unsigned long long last_val;/* Last returned value */
-};
-
-static DEFINE_PER_CPU(struct sc_data, sc_data) =
+DEFINE_PER_CPU(struct sc_data, sc_data) =
{ .unstable = 1, .sync_base = INITIAL_JIFFIES };
-
-static inline u64 cycles_2_ns(struct sc_data *sc, u64 cyc)
-{
-   u64 ns;
-
-   cyc -= sc->sync_base;
-   ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
-   ns += sc->ns_base;
-
-   return ns;
-}
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -79,7 +58,7 @@ static inline u64 cycles_2_ns(struct sc_
  * per CPU. This state is protected against parallel state changes
  * with interrupts off.
  */
-unsigned long long sched_clock(void)
+unsigned long long native_sched_clock(void)
 {
unsigned long long r;
struct sc_data *sc = _cpu_var(sc_data);
@@ -98,8 +77,8 @@ unsigned long long sched_clock(void)
sc->last_val = r;
local_irq_restore(flags);
} else {
-   get_scheduled_cycles(r);
-   r = cycles_2_ns(sc, r);
+   rdtscll(r);
+   r = cycles_2_ns(r);
sc->last_val = r;
}
 
@@ -107,6 +86,18 @@ unsigned long long sched_clock(void)
 
return r;
 }
+
+/* We need to define a real function for sched_clock, to override the
+   weak default version */
+#ifdef CONFIG_PARAVIRT
+unsigned long long sched_clock(void)
+{
+   return paravirt_sched_clock();
+}
+#else
+unsigned long long sched_clock(void)
+   __attribute__((alias("native_sched_clock")));
+#endif
 
 /* Resync with new CPU frequency */
 static void resync_sc_freq(struct sc_data *sc, unsigned int newfreq)
@@ -124,7 +115,7 @@ static void resync_sc_freq(struct sc_dat
   because sched_clock callers should be able to tolerate small
   errors. */
sc->ns_base = ktime_to_ns(ktime_get());
-   get_scheduled_cycles(sc->sync_base);
+   rdtscll(sc->sync_base);
sc->cyc2ns_scale = (100 << CYC2NS_SCALE_FACTOR) / newfreq;
 }
 
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -890,7 +890,7 @@ static inline int __init activate_vmi(vo
paravirt_ops.setup_boot_clock = vmi_time_bsp_init;
paravirt_ops.setup_secondary_clock = vmi_time_ap_init;
 #endif
-   paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles;
+   paravirt_ops.sched_clock = vmi_sched_clock;

Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-04-27 Thread Nick Piggin

Hugh Dickins wrote:

On Fri, 27 Apr 2007, Nick Piggin wrote:


But that's because of ia64's cache coherency implementation. I don't really
follow the documentation to know whether it should be one way or the other,
but surely it should be done either before or after the set_pte_at, not both.

Anyway, how about fremap or mprotect, for example?
... 


OK, I'm still not sure that I understand why lazy_mmu_prot_update should be
used rather than flush_icache_page (in concept, not ia64 implementation).
Sure, flush_icache_page isn't given the pte, but let's assume we can change
that.



You're asking lots of good questions.  I wish the ia64 people would
know the answers, but from the length of time the "lazy_mmu_prot_update"
stuff took to get into the tree, and the length of time it's taken to be
found defective, I suspect they don't, and we'll have to guess for them.

Some guesses I'm working with...

I presume Mike and Anil are correct, that it needs to be done before
putting pte into page table, not left until after: but as you've
guessed, that needs to be done everywhere, not just in the two
places so far identified.

When it was discussed last year (in connection with Peter's page
cleaning patches) it was thought to be a variant of update_mmu_cache()
(after setting pte), and we added the fremap one to accompany it;
but now it looks to be a variant of flush_icache_page() (before
setting pte).


Right. I think.



I believe lazy_mmu_prot_update(pteval) came into existence primarily
for mprotect's change_pte_range() case.  If ia64 filled in its
flush_icache_page(vma, page), that could have been used there
(checking 'vm_flags & VM_EXEC' instead of pte_exec): but that would
involve a relatively expensive(?) pte_page() in a place which doesn't
need to know the struct page for other cases.


Well, I think we could always add a pte argument to flush_icache_page...
Then, there might be logic to have a flush_lazy_icache_page when
changing protections, but that operation (currently called
lazy_mmu_prot_update) really doesn't seem like it should be called in
all the other places that it is, flush_icache_page should be used for
that.

But AFAIKS, if we really want correctness, flush_icache_page should go
away and be implemented in flush_dcache_page.



Well, not pte_page(), it needs to be vm_normal_page() doesn't it?
and ia64's current lazy_mmu_prot_update is unsafe when !pfn_valid.

Some flush_icache_pages are already in place, others are not: do
we need to add some?  But those architectures which have a non-empty
flush_icache_page seem to have survived without the additional calls
- so they might be unnecessarily slowed down by additional calls.


Well flush_icache seems to be intended solely to bring icache in sync
with dcache modifications, but they try to skimp out on most of the
flushes required to handle dcache aliases... but really, I don't think
that is possible to do 100% correctly.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: fix PSE pagetable construction

2007-04-27 Thread Jeremy Fitzhardinge
When constructing the initial pagetable in pagetable_init, make sure
that non-PSE pmds are updated to PSE ones.  This fixes a bug in the
paravirt pagetable init code, which otherwise tries to avoid overwrite
existing mappings.

This moves the definition of pmd_huge() out of the hugetlbfs files
into pgtable.h.

[ I know Eric would like to make larger changes to the way
  pagetable init works, but this patch is the minimal fix to an
  existing bug. ]

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: "H. Peter Anvin" <[EMAIL PROTECTED]>
Cc: Eric W. Biederman <[EMAIL PROTECTED]>

---
 arch/i386/mm/hugetlbpage.c   |6 +-
 arch/i386/mm/init.c  |2 +-
 include/asm-i386/pgtable.h   |2 +-
 include/asm-x86_64/pgtable.h |1 +
 include/linux/hugetlb.h  |2 --
 5 files changed, 4 insertions(+), 9 deletions(-)

===
--- a/arch/i386/mm/hugetlbpage.c
+++ b/arch/i386/mm/hugetlbpage.c
@@ -183,6 +183,7 @@ follow_huge_addr(struct mm_struct *mm, u
return page;
 }
 
+#undef pmd_huge
 int pmd_huge(pmd_t pmd)
 {
return 0;
@@ -201,11 +202,6 @@ follow_huge_addr(struct mm_struct *mm, u
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 {
return ERR_PTR(-EINVAL);
-}
-
-int pmd_huge(pmd_t pmd)
-{
-   return !!(pmd_val(pmd) & _PAGE_PSE);
 }
 
 struct page *
===
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -172,7 +172,7 @@ static void __init kernel_physical_mappi
/* Map with big pages if possible, otherwise create 
normal page tables. */
if (cpu_has_pse) {
unsigned int address2 = (pfn + PTRS_PER_PTE - 
1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1;
-   if (!pmd_present(*pmd)) {
+   if (!pmd_present(*pmd) || !pmd_huge(*pmd)) {
if (is_kernel_text(address) || 
is_kernel_text(address2))
set_pmd(pmd, pfn_pmd(pfn, 
PAGE_KERNEL_LARGE_EXEC));
else
===
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -211,7 +211,7 @@ extern unsigned long pg0[];
 #define pmd_none(x)(!(unsigned long)pmd_val(x))
 #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
 #definepmd_bad(x)  ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != 
_KERNPG_TABLE)
-
+#define pmd_huge(x)((pmd_val(x) & _PAGE_PSE) != 0)
 
 #define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT))
 
===
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -352,6 +352,7 @@ static inline int pmd_large(pmd_t pte) {
pmd_index(address))
 #define pmd_none(x)(!pmd_val(x))
 #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
+#define pmd_huge(x)((pmd_val(x) & _PAGE_PSE) != 0)
 #define pmd_clear(xp)  do { set_pmd(xp, __pmd(0)); } while (0)
 #define pfn_pmd(nr,prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val(prot)))
 #define pmd_pfn(x)  ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)
===
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -41,7 +41,6 @@ struct page *follow_huge_addr(struct mm_
  int write);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write);
-int pmd_huge(pmd_t pmd);
 void hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);
 
@@ -114,7 +113,6 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_node_meminfo(n, buf)0
 #define follow_huge_pmd(mm, addr, pmd, write)  NULL
 #define prepare_hugepage_range(addr,len,pgoff) (-EINVAL)
-#define pmd_huge(x)0
 #define is_hugepage_only_range(mm, addr, len)  0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, write)({ BUG(); 0; })

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ide-cs: recognize 2GB CompactFlash from Transcend

2007-04-27 Thread Peter Stuge
On Fri, Apr 27, 2007 at 07:01:43PM -0700, Andrew Morton wrote:
> This one-liner is turning into a fiasco.
> diff -puN 
> drivers/ide/legacy/ide-cs.c~ide-cs-recognize-2gb-compactflash-from-transcend 
> drivers/ide/legacy/ide-cs.c
> --- 
> a/drivers/ide/legacy/ide-cs.c~ide-cs-recognize-2gb-compactflash-from-transcend
> +++ a/drivers/ide/legacy/ide-cs.c
> @@ -401,6 +401,8 @@ static struct pcmcia_device_id ide_ids[]
>   PCMCIA_DEVICE_PROD_ID12("TOSHIBA", "MK2001MPL", 0xb4585a1a, 0x3489e003),
>   PCMCIA_DEVICE_PROD_ID1("TRANSCEND512M   ", 0xd0909443),
>   PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS1GCF80", 0x709b1bf1, 
> 0x2a54d4b1),
> + PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS2GCF120", 0x709b1bf1, 
> 0xf54a91c8),
> + PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS2GCF120", 0x709b1bf1, 
> 0x969aa4f2),
>   PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS4GCF120", 0x709b1bf1, 
> 0xf54a91c8),
>   PCMCIA_DEVICE_PROD_ID12("WIT", "IDE16", 0x244e5994, 0x3e232852),
>   PCMCIA_DEVICE_PROD_ID12("WEIDA", "TWTTI", 0xcc7cf69c, 0x212bb918),
> _
> 
> 
> Is this really supposed to add a TS2GCF120 entry with the same IDs
> as TS4GCF120?

That's probably a copy and paste error. 0x969aa4f2 is the correct ID.


> And pata_pcmcia-recognize-2gb-compactflash-from-transcend.patch:

This one is all right so for what it's worth, it gets:

Acked-by: Peter Stuge <[EMAIL PROTECTED]>


//Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] h8300 generic irq

2007-04-27 Thread Andrew Morton
On Thu, 26 Apr 2007 17:34:37 +0900
Yoshinori Sato <[EMAIL PROTECTED]> wrote:

> h8300 using generic irq handler patch.
> 
> Signed-off-by: Yoshinori Sato <[EMAIL PROTECTED]>
> 

Minor things:

>
> --- /dev/null
> +++ b/arch/h8300/kernel/irq.c
> @@ -0,0 +1,211 @@
> +/*
> + * linux/arch/h8300/kernel/irq.c
> + *
> + * Copyright 2007 Yoshinori Sato <[EMAIL PROTECTED]>
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*#define DEBUG*/
> +
> +extern unsigned long *interrupt_redirect_table;
> +extern const int h8300_saved_vectors[];
> +extern const unsigned long h8300_trap_table[];
> +int h8300_enable_irq_pin(unsigned int irq);
> +void h8300_disable_irq_pin(unsigned int irq);

Please always avoid putting extern declarations into C files.  Please them
in a header file which is visible tot he definition site asw well as all
callers/users.

For something which is defined in assembly code (like
interrupt_redirect_table) it isn't so clear, because we cannot do
typechecking.  But I think it's still best to include the declaration in a
header file so that we only have to declare it once.  Plus it _is_ a global
symbol.

> +
> +/*
> + * h8300 interrupt controler implementation
> + */
> +struct irq_chip h8300irq_chip = {
> + .name   = "H8300-INTC",
> + .startup= h8300_startup_irq,
> + .shutdown   = h8300_shutdown_irq,
> + .enable = h8300_enable_irq,
> + .disable= h8300_disable_irq,
> + .ack= NULL,
> + .end= h8300_end_irq,
> +};

I think this could have static scope.

> +void ack_bad_irq(unsigned int irq)
> +{
> + printk("unexpected IRQ trap at vector %02x\n", irq);
> +}

printks should generally have facility levels (KERN_*)

> + panic("interrupt vector serup failed.");

typo

> + for ( i = 0; i < NR_IRQS; i++) {

for (i = 0

> + if (i == *saved_vector) {
> + ramvec_p++;
> + saved_vector++;
> + } else {
> + if ( i < NR_TRAPS ) {

if (i < NR_TRAPS)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ide-cs: recognize 2GB CompactFlash from Transcend

2007-04-27 Thread Andrew Morton
On Thu, 26 Apr 2007 11:21:01 +0200
"Aeschbacher, Fabrice" <[EMAIL PROTECTED]> wrote:

> As pointed to by Peter, and also as indicated by a judicious output in
> dmesg, the 4th parameter should be 0x969aa4f2. Please find below the
> corrected patch:
> 
> Signed-off-by: Fabrice Aeschbacher <[EMAIL PROTECTED]>
> 
> ===
> --- linux-2.6.20.7-orig/drivers/ide/legacy/ide-cs.c   2007-04-15
> 21:08:02.0 +0200
> +++ linux-2.6.20.7/drivers/ide/legacy/ide-cs.c2007-04-26
> 11:13:13.0 +0200
> @@ -399,6 +399,7 @@
>   PCMCIA_DEVICE_PROD_ID12("TOSHIBA", "MK2001MPL", 0xb4585a1a,
> 0x3489e003),
>   PCMCIA_DEVICE_PROD_ID1("TRANSCEND512M   ", 0xd0909443),
>   PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS1GCF80", 0x709b1bf1,
> 0x2a54d4b1),
> + PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS2GCF120", 0x709b1bf1,
> 0x969aa4f2),
>   PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS4GCF120", 0x709b1bf1,
> 0xf54a91c8),
>   PCMCIA_DEVICE_PROD_ID12("WIT", "IDE16", 0x244e5994, 0x3e232852),
>   PCMCIA_DEVICE_PROD_ID12("WEIDA", "TWTTI", 0xcc7cf69c,
> 0x212bb918),
> ===

This one-liner is turning into a fiasco.  All the top-posting and
word-wrapped patches aren't helping :(

I presently have two patches.  Please check them.


ide-cs-recognize-2gb-compactflash-from-transcend.patch:


From: "Aeschbacher, Fabrice" <[EMAIL PROTECTED]>

Without the following patch, the kernel does not automatically detect
2GB CompactFlash cards from Transcend.

Signed-off-by: Fabrice Aeschbacher <[EMAIL PROTECTED]>
Cc: Dominik Brodowski <[EMAIL PROTECTED]>
Cc: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 drivers/ide/legacy/ide-cs.c |2 ++
 1 files changed, 2 insertions(+)

diff -puN 
drivers/ide/legacy/ide-cs.c~ide-cs-recognize-2gb-compactflash-from-transcend 
drivers/ide/legacy/ide-cs.c
--- 
a/drivers/ide/legacy/ide-cs.c~ide-cs-recognize-2gb-compactflash-from-transcend
+++ a/drivers/ide/legacy/ide-cs.c
@@ -401,6 +401,8 @@ static struct pcmcia_device_id ide_ids[]
PCMCIA_DEVICE_PROD_ID12("TOSHIBA", "MK2001MPL", 0xb4585a1a, 0x3489e003),
PCMCIA_DEVICE_PROD_ID1("TRANSCEND512M   ", 0xd0909443),
PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS1GCF80", 0x709b1bf1, 
0x2a54d4b1),
+   PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS2GCF120", 0x709b1bf1, 
0xf54a91c8),
+   PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS2GCF120", 0x709b1bf1, 
0x969aa4f2),
PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS4GCF120", 0x709b1bf1, 
0xf54a91c8),
PCMCIA_DEVICE_PROD_ID12("WIT", "IDE16", 0x244e5994, 0x3e232852),
PCMCIA_DEVICE_PROD_ID12("WEIDA", "TWTTI", 0xcc7cf69c, 0x212bb918),
_


Is this really supposed to add a TS2GCF120 entry with the same IDs as
TS4GCF120?





And pata_pcmcia-recognize-2gb-compactflash-from-transcend.patch:

From: "Aeschbacher, Fabrice" <[EMAIL PROTECTED]>

Allow the pata_pcmcia driver to automatically detect 2GB CompactFlash cards
from Transcend.

Signed-off-by: Fabrice Aeschbacher <[EMAIL PROTECTED]>
Cc: "Peter Stuge" <[EMAIL PROTECTED]>
Acked-by: Alan Cox <[EMAIL PROTECTED]>
Cc: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 drivers/ata/pata_pcmcia.c |1 +
 1 files changed, 1 insertion(+)

diff -puN 
drivers/ata/pata_pcmcia.c~pata_pcmcia-recognize-2gb-compactflash-from-transcend 
drivers/ata/pata_pcmcia.c
--- 
a/drivers/ata/pata_pcmcia.c~pata_pcmcia-recognize-2gb-compactflash-from-transcend
+++ a/drivers/ata/pata_pcmcia.c
@@ -396,6 +396,7 @@ static struct pcmcia_device_id pcmcia_de
PCMCIA_DEVICE_PROD_ID12("TOSHIBA", "MK2001MPL", 0xb4585a1a, 0x3489e003),
PCMCIA_DEVICE_PROD_ID1("TRANSCEND512M   ", 0xd0909443),
PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS1GCF80", 0x709b1bf1, 
0x2a54d4b1),
+   PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS2GCF120", 0x709b1bf1, 
0x969aa4f2),
PCMCIA_DEVICE_PROD_ID12("TRANSCEND", "TS4GCF120", 0x709b1bf1, 
0xf54a91c8),
PCMCIA_DEVICE_PROD_ID12("WIT", "IDE16", 0x244e5994, 0x3e232852),
PCMCIA_DEVICE_PROD_ID12("WEIDA", "TWTTI", 0xcc7cf69c, 0x212bb918),
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-04-27 Thread Nick Piggin

Rohit Seth wrote:

On Fri, 2007-04-27 at 21:55 +1000, Nick Piggin wrote:



That's the theory. However, I'd still like to know how the arch code can
make the assertion that icache is known to be at all times other than at
the time of a fault?




Kernel needs to only worry about the updates that it does.  So, if
kernel is writing into a page that is getting marked with execute
permission then it will need to make sure that caches are coherent.
ia64 Kernel keeps track of whether it has done any write operation on a
page or not using PG_arch_1.  And accordingly flushes icaches.


It flushes icache at fault time, I know. What I don't know is why we
leave them to drift out of sync afterwards.



Ie. what if an operation which causes incoherency is carried out _after_
an executable mapping is installed for that page.




You mean by user space? If so, then it is user space responsibility to
do the appropriate operations (like flush icache in this case).


No, I mean places that set PG_arch_1. flush_dcache_page. This can
happen for mapped pages in write, splice, install_arg_page looks
questionable, direct IO...

Actually there are various windows where mapped pages can be !uptodate,
so there is technically most of the filesystem code as well, but I'm
trying to stamp those out, so let's ignore that for now.

What if you were to say remove all the PG_arch_1 code, and do something
really simple like flush icache in flush_dcache_page? Would performance
suffer horribly?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sky2 regression in 2.6.21: Asus P5B-E Plus ethernet adapter no more supported

2007-04-27 Thread Daniel Drake

Stephen Hemminger wrote:

But the same hardware dies horribly on Gigabyte GA-965P motherboards.
Could you send me full lspci -vvx output. I'll re-enable it for Asus and add a 
block
for the Gigabyte boards. (sigh)


To add to the mix, Robert Tate on the same Gentoo bug reports that the 
Yukon2 hardware on the Gigabyte DQ6 works fine with sky2:


https://bugs.gentoo.org/show_bug.cgi?id=176219

03:00.0 0200: 11ab:4364 (rev 12)
Subsystem: 1458:e000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 

SERR- Device: Supported: MaxPayload 128 bytes, PhantFunc 0, 
ExtTag-

Device: Latency L0s unlimited, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s L1, 
Port 0

Link: Latency L0s <256ns, L1 unlimited
Link: ASPM Disabled RCB 128 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x1
Capabilities: [100] Advanced Error Reporting
00: ab 11 64 43 07 04 10 00 12 00 00 02 08 00 00 00
10: 04 00 00 f7 00 00 00 00 01 70 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 58 14 00 e0
30: 00 00 00 00 48 00 00 00 00 00 00 00 0a 01 00 00

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.21 - BUG: at arch/i386/kernel/smp.c:177 send_IPI_mask_bitmask()

2007-04-27 Thread Jeff Chua

Got this error just before suspend to disk. Suspend/resume without
problem, but only saw this after upgrading to 2.6.21 (no problem with
2.6.21-rc7, I think).

CONFIG_NO_HZ, CONFIG_HIGH_RES_TIMERS unset
CONFIG_HPET_TIMER=y


ACPI: PCI interrupt for device :00:1b.0 disabled
Disabling non-boot CPUs ...
swsusp: critical section:
swsusp: Need to copy 46874 pages
BUG: at arch/i386/kernel/smp.c:177 send_IPI_mask_bitmask()
[] send_IPI_mask_bitmask+0x52/0xa4
[] tick_do_broadcast_on_off+0x0/0xd3
[] __smp_call_function_single+0x44/0x64
[] tick_do_broadcast_on_off+0x0/0xd3
[] smp_call_function_single+0xc2/0xeb
[] tick_broadcast_on_off+0x48/0x63
[] tick_notify+0x20/0x54
[] notifier_call_chain+0x1b/0x2d
[] clockevents_notify+0x19/0x54
[] acpi_processor_power_verify+0x7d/0x86
[] acpi_processor_get_power_info+0x35/0x6c
[] acpi_processor_cst_has_changed+0x37/0x55
[] acpi_processor_notify+0x4c/0x5f
[] acpi_ev_notify_dispatch+0x52/0x5b
[] acpi_ev_queue_notify_request+0x9e/0xb0
[] acpi_ex_opcode_2A_0T_0R+0x68/0x96
[] acpi_ds_exec_end_op+0xc1/0x386
[] acpi_os_release_object+0x5/0x8
[] acpi_ps_complete_op+0x1cc/0x1db
[] acpi_ps_parse_loop+0x271/0x2a7
[] acpi_os_release_object+0x5/0x8
[] acpi_ps_parse_aml+0x69/0x219
[] acpi_ds_init_aml_walk+0xb3/0x106
[] acpi_ps_execute_method+0xaf/0xe5
[] acpi_ns_evaluate+0x9b/0xf4
[] acpi_evaluate_object+0x14c/0x1f3
[] acpi_leave_sleep_state+0x190/0x26d
[] acpi_pm_finish+0x11/0x40
[] pm_suspend_disk+0x170/0x185
[] enter_state+0x44/0x76
[] state_store+0x87/0x9e
[] state_store+0x0/0x9e
[] subsys_attr_store+0x1c/0x24
[] flush_write_buffer+0x23/0x28
[] sysfs_write_file+0x45/0x67
[] vfs_write+0x8b/0x106
[] sys_write+0x41/0x67
[] syscall_call+0x7/0xb
[] irttp_open_tsap+0x149/0x1d1
===


Thanks,
Jeff.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/17] Large Blocksize Support V3

2007-04-27 Thread Nick Piggin

Andrew Morton wrote:

On Sat, 28 Apr 2007 03:34:32 +1000 David Chinner <[EMAIL PROTECTED]> wrote:



Some more information - stripe unit on the dm raid0 is 512k.
I have not attempted to increase I/O sizes at all yet - these test are
just demonstrating efficiency improvements in the filesystem.

These numbers for 32GB files.

   READWRITE
disks  blksz tput   sys   tputsys
-  --     -  
 1 4k89 18s   57 44s
 116k46 13s   67 18s
 164k75 12s   68 12s
 2 4k   179 20s  114 43s
 216k55 13s  132 18s
 264k   126 12s  126 12s
 4 4k   350 20s  214 43s
 416k   350 14s  264 19s
 464k   176 11s  266 12s
 8 4k   415 21s  446 41s
 816k   655 13s  518 19s
 864k   664 12s  552 12s
12 4k   413 20s  633 33s
1216k   736 14s  741 19s
1264k   836 12s  743 12s

Throughput in MB/s.


Consistent improvement across the write results, first time
I've hit the limits of the PCI-X bus with a single buffered
I/O thread doing either reads or writes.



1-disk and 2-disk read throughput fell by an improbable amount, which makes
me cautious about the other numbers.

Your annotation says "blocksize".  Are you really varying the fs blocksize
here, or did you mean "pagesize"?

What worries me here is that we have inefficient code, and increasing the
pagesize amortises that inefficiency without curing it.

If so, it would be better to fix the inefficiencies, so that 4k pagesize
will also benefit.

For example, see __do_page_cache_readahead().  It does a read_lock() and a
page allocation and a radix-tree lookup for each page.  We can vastly
improve that.

Step 1:

- do a read-lock

- do a radix-tree walk to work out how many pages are missing

- read-unlock

- allocate that many pages

- read_lock()

- populate all the pages.

- read_unlock

- if any pages are left over, free them

- if we ended up not having enough pages, redo the whole thing.

that will reduce the number of read_lock()s, read_unlock()s and radix-tree
descents by a factor of 32 or so in this testcase.  That's a lot, and it's
something we (Nick ;)) should have done ages ago.


We can do pretty well with the lockless radix tree (that is already upstream)
there. I split that stuff out of my most recent lockless pagecache patchset,
because it doesn't require the "scary" speculative refcount stuff of the
lockless pagecache proper. Subject: [patch 5/9] mm: lockless probe.

So that is something we could merge pretty soon.

The other thing is that we can batch up pagecache page insertions for bulk
writes as well (that is. write(2) with buffer size > page size). I should
have a patch somewhere for that as well if anyone interested.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc7-mm1: BUG_ON in kthread_bind during _cpu_down

2007-04-27 Thread Andrew Morton
On Thu, 26 Apr 2007 18:28:38 +0530
Gautham R Shenoy <[EMAIL PROTECTED]> wrote:

> I just checked with Vatsa if there was any subtle reason why they
> had put in the kthread_bind() in cpu.c. Vatsa cannot seem to recollect
> any and I can't see any. So let us just remove the kthread_bind.
> 
> Signed-off-by: Gautham R Shenoy <[EMAIL PROTECTED]>
> ---
>  kernel/cpu.c |4 
>  1 files changed, 4 deletions(-)
> 
> Index: linux-2.6.21-rc7/kernel/cpu.c
> ===
> --- linux-2.6.21-rc7.orig/kernel/cpu.c
> +++ linux-2.6.21-rc7/kernel/cpu.c
> @@ -176,10 +176,6 @@ static int _cpu_down(unsigned int cpu, i
>   /* This actually kills the CPU. */
>   __cpu_die(cpu);
>  
> - /* Move it here so it can run. */
> - kthread_bind(p, get_cpu());
> - put_cpu();
> -
>   /* CPU is completely dead: tell everyone.  Too late to complain. */
>   if (raw_notifier_call_chain(_chain, CPU_DEAD | mod,
>   hcpu) == NOTIFY_BAD)

So I cooked up a changelog and queued up the diff.  But I have an uneasy
feeling that things are getting a bit close to guesswork here.

We have a huge amount of change pending in the kthread/workqueue/freezer
area, partly because I decided not to merge most of the workqueue changes
into 2.6.21.

It'd be good if people could take some time to sit down and re-review the
code which we presently have.  I plan on sending it all off for 2.6.22 and
there might be some glitches but it seems to have a good track record so
far.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4 (how to boot it?)

2007-04-27 Thread Jeff Chua

On 4/28/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Thanks, that is certainly helpful, but that only mounts one directory
(partition) as Reiser4.

This I have already done.

I was more interested in how to have a whole partition dedicated to
Reiser4 and being able to boot into it.


Not able to boot a whole partition with grub2. I've seen patch for
grub ... 
ftp://ftp.namesys.com/pub/reiser4progs/grub-0.97-reiser4-20050808.tar.gz

But I since I'm using grub2, it's not possible to boot directly into
reiser4. I'm only use the whole 250GB partition on my 2nd hard disk
for testing.

I'm as interested as you in looking for grub2 to boot directly.
Currently, I've to create a small ext2 partition for grub2.

Jeff.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Rafael J. Wysocki
On Saturday, 28 April 2007 03:12, Linus Torvalds wrote:
> 
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > > It's doubly bad, because that idiocy has also infected s2ram. Again, 
> > > another thing that really makes no sense at all - and we do it not just 
> > > for snapshotting, but for s2ram too. Can you tell me *why*?
> > 
> > Why we freeze tasks at all or why we freeze kernel threads?
> 
> In many ways, "at all".
> 
> I _do_ realize the IO request queue issues, and that we cannot actually do 
> s2ram with some devices in the middle of a DMA. So we want to be able to 
> avoid *that*, there's no question about that. And I suspect that stopping 
> user threads and then waiting for a sync is practically one of the easier 
> ways to do so.
> 
> So in practice, the "at all" may become a "why freeze kernel threads?" and 
> freezing user threads I don't find really objectionable.
> 
> But as Paul pointed out, Linux on the old powerpc Mac hardware was 
> actually rather famous for having working (and reliable) suspend long 
> before it worked even remotely reliably on PC's. And they didn't do even
> that.
> 
> (They didn't have ACPI, and they had a much more limited set of devices, 
> but the whole process freezer is really about neither of those issues. The 
> wild and wacky PC hardware has its problems, but that's _one_ thing we 
> can't blame PC hardware for ;)

We freeze user space processes for the reasons that you have quoted above.

Why we freeze kernel threads in there too is a good question, but not for me to
answer.  I don't know.  Pavel should know, I think.

> > >   git grep create_freezeable_workthread
> > 
> > s/workthread/workqueue/
> 
> Yes.
> 
> > > and ponder the end results of that grep. If you don't see something 
> > > wrong, 
> > > you're blind.
> > 
> > This was a mistake, quite unrelated to the point you're making.
> 
> Did you actually _do_ the "grep" (with the fixed argument)?
> 
> I had two totally independent points. #1 was that you yourself have been 
> fixing bugs in this area. #2 was the result of that grep. It's absolutely 
> _empty_ except for the define to add that interface.
> 
> NOBODY USES IT!

The reason is pretty simple.

We wanted to drop that interface altogether, because it was broken (my fault),
but Oleg suggested that we keep it so that we could fix and use it in the
future (for purposes other than the hibernation, though).

> Now, grep for the same interface that creates _non_freezeable workqueues.
> 
> Put another way:
> 
>   [EMAIL PROTECTED] linux]$ git grep create_workqueue | wc -l
>   35
> 
>   [EMAIL PROTECTED] linux]$ git grep create_freezeable_workqueue | wc -l
>   1
> 
> and that _one_ hit you get for the "freezeable" case is not actually a 
> user, it's the definition!
> 
> Ie my point is, nobody wants freezeable kernel threads. Absolutely nobody.

That's freezable workqueues only. :-)

> Yet we have all this support for freezing them (or rather, we freeze them 
> by default, and then we have all this support for _not_ doing that wrong 
> default thing!)
> 
> So yes, I think it would be interesting to just stop freezing kernel 
> threads. Totally.

Okay, I'll do that.

Greetings,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm] Allow selective freezing of the system for different events

2007-04-27 Thread Gautham R Shenoy
This patch
* Provides an interface to selectively freeze the system for different events.
* Allows tasks to exempt themselves or other tasks from specific freeze
  events.
* Allow nesting of freezer calls. For eg:

freeze_processes(EVENT_A);
/* Do something with respect to event A */
.
.
.
freeze_processes(EVENT_B);
/* Do something with respect to event B */
.
.
.
thaw_processes(EVENT_B);
.
.
.
thaw_processes(EVENT_B);

This type of behaviour would be required when cpu hotplug would start
using the process freezer, where EVENT_A would be SUSPEND and EVENT_B
would be HOTPLUG_CPU.

This patch applies on the top of 2.6.21-rc7-mm2 + Rafael's freezer
changes from http://lkml.org/lkml/2007/4/27/302.

Signed-off-by: Gautham R Shenoy <[EMAIL PROTECTED]>
---
 arch/i386/kernel/apm.c  |2 -
 drivers/block/loop.c|2 -
 drivers/char/apm-emulation.c|6 +--
 drivers/ieee1394/ieee1394_core.c|2 -
 drivers/md/md.c |2 -
 drivers/mmc/card/queue.c|2 -
 drivers/mtd/mtd_blkdevs.c   |2 -
 drivers/scsi/libsas/sas_scsi_host.c |2 -
 drivers/scsi/scsi_error.c   |2 -
 drivers/usb/storage/usb.c   |2 -
 include/linux/freezer.h |   44 +++-
 kernel/freezer.c|   64 ++--
 kernel/kprobes.c|4 +-
 kernel/kthread.c|2 -
 kernel/power/disk.c |4 +-
 kernel/power/main.c |8 ++--
 kernel/power/user.c |6 +--
 kernel/rcutorture.c |4 +-
 kernel/sched.c  |2 -
 kernel/softirq.c|2 -
 kernel/softlockup.c |2 -
 kernel/workqueue.c  |2 -
 22 files changed, 119 insertions(+), 49 deletions(-)

Index: linux-2.6.21-rc7/include/linux/freezer.h
===
--- linux-2.6.21-rc7.orig/include/linux/freezer.h
+++ linux-2.6.21-rc7/include/linux/freezer.h
@@ -4,17 +4,27 @@
 
 #ifdef CONFIG_FREEZER
 
+
 /*
  * Per task flags used by the freezer
  *
  * They should not be referred to directly outside of this file.
  */
-#define TFF_NOFREEZE   0   /* task should not be frozen */
+#define TFF_FE_SUSPEND 0   /* Do not freeze task for software suspend */
+#define TFF_FE_KPROBES 1   /* Do not freeze task for kprobes */
 #define TFF_FREEZE 8   /* task should go to the refrigerator ASAP */
 #define TFF_SKIP   9   /* do not count this task as freezable */
 #define TFF_FROZEN 10  /* task is frozen */
 
 /*
+ * Codes of different events which use the freezer
+ * These are the only flags that can be referred outside this file
+ */
+#define FE_SUSPEND (1 << TFF_FE_SUSPEND) /* Software Suspend */
+#define FE_KPROBES (1 << TFF_FE_KPROBES)   /* Kprobes */
+#define FE_ALL (FE_SUSPEND | FE_KPROBES) /* All events using freezer */
+
+/*
  * Check if a process has been frozen
  */
 static inline int frozen(struct task_struct *p)
@@ -57,19 +67,29 @@ static inline void clear_freeze_flag(str
 }
 
 /*
- * Check if the task wants to be exempted from freezing
+ * Check if the task wants to be exempted from freezing for
+ * freeze_event.
  */
-static inline int freezer_should_exempt(struct task_struct *p)
+static inline int freezer_should_exempt(struct task_struct *p,
+   unsigned long freeze_event)
 {
-   return test_bit(TFF_NOFREEZE, >freezer_flags);
+   return p->freezer_flags & freeze_event;
 }
 
 /*
  * Tell the freezer to exempt this task from freezing
+ * for events in freeze_event_mask.
  */
-static inline void freezer_exempt(struct task_struct *p)
+static inline void freezer_exempt(struct task_struct *p,
+ unsigned long freeze_event_mask)
+{
+   atomic_set_mask(freeze_event_mask, >freezer_flags);
+}
+
+/* Returns the mask of the events for which this process is freezeable */
+static inline unsigned long freezeable_event_mask(struct task_struct *p)
 {
-   set_bit(TFF_NOFREEZE, >freezer_flags);
+   return ~p->freezer_flags & FE_ALL;
 }
 
 /*
@@ -96,8 +116,8 @@ static inline int thaw_process(struct ta
 }
 
 extern void refrigerator(void);
-extern int freeze_processes(void);
-extern void thaw_processes(void);
+extern int freeze_processes(unsigned long freeze_event);
+extern void thaw_processes(unsigned long freeze_event);
 
 static inline int try_to_freeze(void)
 {
@@ -160,11 +180,15 @@ static inline int freezing(struct task_s
 static inline void freeze(struct task_struct *p) { BUG(); }
 static inline int freezer_should_exempt(struct task_struct *p) { return 0; }
 static inline void freezer_exempt(struct task_struct *p) {}
+static inline unsigned long 

Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-04-27 Thread Rohit Seth
On Fri, 2007-04-27 at 15:18 +0100, Hugh Dickins wrote:
> I presume Mike and Anil are correct, that it needs to be done before
> putting pte into page table, not left until after: but as you've
> guessed, that needs to be done everywhere, not just in the two
> places so far identified.
> 

That sounds about right.  Before installing new mapping, kernel should
ensure there are no stale contents in caches or TLB.
lazy_mmu_prot_update needs to be called whenever the permissions on pte
(about to) change.  So if remapping is causing change in protection then
lazy_mmu_prot_update needs to be called.  

> When it was discussed last year (in connection with Peter's page
> cleaning patches) it was thought to be a variant of update_mmu_cache()
> (after setting pte), and we added the fremap one to accompany it;
> but now it looks to be a variant of flush_icache_page() (before
> setting pte).
> 
> I believe lazy_mmu_prot_update(pteval) came into existence primarily
> for mprotect's change_pte_range() case.

Yup.

>   If ia64 filled in its
> flush_icache_page(vma, page), that could have been used there
> (checking 'vm_flags & VM_EXEC' instead of pte_exec): but that would
> involve a relatively expensive(?) pte_page() in a place which doesn't
> need to know the struct page for other cases.
> 
> Well, not pte_page(), it needs to be vm_normal_page() doesn't it?
> and ia64's current lazy_mmu_prot_update is unsafe when !pfn_valid.
> 
> Some flush_icache_pages are already in place, others are not: do
> we need to add some?  But those architectures which have a non-empty
> flush_icache_page seem to have survived without the additional calls
> - so they might be unnecessarily slowed down by additional calls.
> 

Right.  Extra flush_icache_page routines will add cost to archs that
have non-null definition of this routine.  BTW, isn't flush_icache_page
marked for deprecation?

> I believe that was the secondary reason for lazy_mmu_prot_update(),
> perhaps better called ia64_flush_icache_page(): to allow calls to
> be added where ia64 was (mistakenly) thought to want them, without
> needing a protracted audit of how other architectures might be
> impacted.
> 

lazy_mmu_prot_update was added specifically for notifying change in
protection.  So, in a way it is closer to update_mmu_cache (Which is for
change in mappings itself).  Though for ia64 implementation, this ends
up flushing the icaches when needed.

Hopefully my reply is useful.

-rohit

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread David Lang

On Fri, 27 Apr 2007, Linus Torvalds wrote:


On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:



It's doubly bad, because that idiocy has also infected s2ram. Again,
another thing that really makes no sense at all - and we do it not just
for snapshotting, but for s2ram too. Can you tell me *why*?


Why we freeze tasks at all or why we freeze kernel threads?


In many ways, "at all".

I _do_ realize the IO request queue issues, and that we cannot actually do
s2ram with some devices in the middle of a DMA. So we want to be able to
avoid *that*, there's no question about that. And I suspect that stopping
user threads and then waiting for a sync is practically one of the easier
ways to do so.

So in practice, the "at all" may become a "why freeze kernel threads?" and
freezing user threads I don't find really objectionable.


there was a thread last week (or so) about splitting up the process list, one 
list for normal user processes, one for kernel threads, and one for dead 
processes waiting to be reaped.


it almost sounds like what you want to do is to act as if the normal user 
threads weren't there for a short time (while you make the snapshot) and then 
recover them to continue and save the snapshot.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread David Lang

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:


On Saturday, 28 April 2007 03:03, Kyle Moffett wrote:

On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:

Hi.

On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:

It makes it harder to debug (wouldn't it be *nice* to just ssh in,
and do
gdb -p 


Make the machine being suspended a VM and you can already do that.



when something goes wrong?) but we also *depend* on user space for
various things (the same way we depend on kernel threads, and why
it has been such a total disaster to try to freeze the kernel
threads too!). For example, if you want to do graphical stuff,
just using X would be quite nice,  wouldn't it?


But in doing so you make the contents of the disk inconsistent with
the state you've just snapshotted, leading to filesystem
corruption. Even if you modify filesystems to do checkpointing
(which is what we're really talking about), you still also have the
problem that your snapshot has to be stored somewhere before you
write it to disk, so you also have to either [snip]


Actually, it's a lot simpler than that.  We can just combine the
device-mapper snapshot with a VM+kernel snapshot system call and be
almost done:

   sys_snapshot(dev_t snapblockdev, int __user *snapshotfd);

When sys_snapshot is run, the kernel does:

1)  Sequentially freeze mounted filesystems using blockdev freezing.
If it's an fs that doesn't support freezing then either fail or force-
remount-ro that fs and downgrade all its filedescriptors to RO.
Doesn't need extra locking since process which try to do IO either
succeed before the freeze call returns for that blockdev or sleep on
the unfreeze of that blockdev.  Filesystems are synchronized and made
clean.
2)  Iterate over the userspace process list, freezing each process
and remapping all of its pages copy-on-write.  Any device-specific
pages need to have state saved by that device.


Why do you want to do 2) after 1) and not vice versa?


it doesn't really need to matter. if you care, just arrange to not schedule user 
processes while you are doing both steps.



3)  All processes (except kernel threads) are now frozen.
4)  Kernel should save internal state corresponding to current
userspace state.  The kernel also swaps out excess pages to free up
enough RAM and prepares the snapshot file-descriptor with copies of
kernel memory and the original (pre-COW) mapped userspace pages.
5)  Kernel substitutes filesystems for either a device-mapper
snapshot with snapblockdev as backing storage or union with tmpfs and
remounts the underlying filesystems as read-only.
6)  Kernel unfreezes all userspace processes and returns the snapshot
FD to userspace (where it can be read from).


Okay, but how do we do the error recovery if, for example, the image cannot
be saved?


give the user an error message telling him this, wait for confirmation, and then 
jump directly to the restore step. revert everything to the snapshot image(s), 
restart it.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-04-27 Thread Rohit Seth
On Fri, 2007-04-27 at 21:55 +1000, Nick Piggin wrote:

> That's the theory. However, I'd still like to know how the arch code can
> make the assertion that icache is known to be at all times other than at
> the time of a fault?
> 

Kernel needs to only worry about the updates that it does.  So, if
kernel is writing into a page that is getting marked with execute
permission then it will need to make sure that caches are coherent.
ia64 Kernel keeps track of whether it has done any write operation on a
page or not using PG_arch_1.  And accordingly flushes icaches.

> Ie. what if an operation which causes incoherency is carried out _after_
> an executable mapping is installed for that page.
> 

You mean by user space? If so, then it is user space responsibility to
do the appropriate operations (like flush icache in this case).

-rohit

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Kyle Moffett

On Apr 27, 2007, at 21:15:28, Rafael J. Wysocki wrote:

On Saturday, 28 April 2007 03:03, Kyle Moffett wrote:

On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:
But in doing so you make the contents of the disk inconsistent  
with the state you've just snapshotted, leading to filesystem  
corruption. Even if you modify filesystems to do checkpointing  
(which is what we're really talking about), you still also have  
the problem that your snapshot has to be stored somewhere before  
you write it to disk, so you also have to either [snip]


When sys_snapshot is run, the kernel does:

1)  Sequentially freeze mounted filesystems using blockdev  
freezing.  If it's an fs that doesn't support freezing then either  
fail or force-remount-ro that fs and downgrade all its  
filedescriptors to RO. Doesn't need extra locking since process  
which try to do IO either succeed before the freeze call returns  
for that blockdev or sleep on the unfreeze of that blockdev.   
Filesystems are synchronized and made clean.
2)  Iterate over the userspace process list, freezing each process  
and remapping all of its pages copy-on-write.  Any device-specific  
pages need to have state saved by that device.


Why do you want to do 2) after 1) and not vice versa?


(1) can be done without extra locking.  Device-mapper already has  
code to freeze filesystems and that makes a natural process-stopping  
point.  Any threads doing IO will very quickly put themselves to  
sleep at (1) and save us some effort during step 2.


6)  Kernel unfreezes all userspace processes and returns the  
snapshot FD to userspace (where it can be read from).


Okay, but how do we do the error recovery if, for example, the  
image cannot be saved?


If the image can't be saved then there are 2 options:
  (1)  Call sys_restore() with the image
  (2)  Pass your snapshot file-descriptor to sys_unsnapshot()

In the former case, the system will be restored to the state it was  
at a few seconds earlier, right as it took the snapshot.  In the  
latter case the modified-in-memory snapshot pages will be synced back  
to the disk filesystems, the copy-on-write data-structures torn down  
(think of merging an LVM snapshot back into its base device), and the  
memory allocated for the snapshot will be freed.  Either way the  
system is properly in sync with disk again, the only difference is  
whether you want to preserve the userspace state from during the  
attempted snapshot (IE: any error status).  You could also save the  
error state in case (1) by just auto-posting a bug-report on http:// 
bugs.$VENDOR.com/ of course :-D.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NR_UNSTABLE_FS vs. NR_FILE_DIRTY: double counting pages?

2007-04-27 Thread Ethan Solomita
There are several places where we add together NR_UNSTABLE_FS and
NF_FILE_DIRTY:

sync_inodes_sb()
balance_dirty_pages()
wakeup_pdflush()
wb_kupdate()
prefetch_suitable()

I can trace a standard codepath where it seems both of these are set
on the same page:

nfs_file_aops.commit_write ->
nfs_commit_write
nfs_updatepages
nfs_writepage_setup
nfs_wb_page
nfs_wb_page_priority
nfs_writepage_locked
nfs_flush_mapping
nfs_flush_list
nfs_flush_multi
nfs_write_partial_ops.rpc_call_done
nfs_writeback_done_partial
nfs_writepage_release
nfs_reschedule_unstable_write
nfs_mark_request_commit
incr NR_UNSTABLE_NFS

nfs_file_aops.commit_write ->
nfs_commit_write
nfs_updatepage
__set_page_dirty_nobuffers
incr NF_FILE_DIRTY


This is the standard code path that derives from sys_write(). Can
someone either show how this code sequence can't happen, or confirm for
me that there's a bug?
-- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Bojan Smojver
Nigel Cunningham  nigel.suspend2.net> writes:

> 4) uswsusp and swsusp get dropped and Suspend2 goes into mainline.

After reading most of this thread, it seems that Linus is of the view that all
three of these suck in one way or another. Suspend2 has the most features and is
the fastest of the lot. It can behave like swsusp from the user's point of view
(i.e. echo disk > /sys/power/state), so the migration should be seamless for
most distros. It isn't complicated to set up. It's been proven in the field. It
looks pretty.

So, while we're waiting for the next STD technology, why not have the best and
develop from there?

--
Bojan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Linus Torvalds


On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> 
> > It's doubly bad, because that idiocy has also infected s2ram. Again, 
> > another thing that really makes no sense at all - and we do it not just 
> > for snapshotting, but for s2ram too. Can you tell me *why*?
> 
> Why we freeze tasks at all or why we freeze kernel threads?

In many ways, "at all".

I _do_ realize the IO request queue issues, and that we cannot actually do 
s2ram with some devices in the middle of a DMA. So we want to be able to 
avoid *that*, there's no question about that. And I suspect that stopping 
user threads and then waiting for a sync is practically one of the easier 
ways to do so.

So in practice, the "at all" may become a "why freeze kernel threads?" and 
freezing user threads I don't find really objectionable.

But as Paul pointed out, Linux on the old powerpc Mac hardware was 
actually rather famous for having working (and reliable) suspend long 
before it worked even remotely reliably on PC's. And they didn't do even
that.

(They didn't have ACPI, and they had a much more limited set of devices, 
but the whole process freezer is really about neither of those issues. The 
wild and wacky PC hardware has its problems, but that's _one_ thing we 
can't blame PC hardware for ;)

> > git grep create_freezeable_workthread
> 
> s/workthread/workqueue/

Yes.

> > and ponder the end results of that grep. If you don't see something wrong, 
> > you're blind.
> 
> This was a mistake, quite unrelated to the point you're making.

Did you actually _do_ the "grep" (with the fixed argument)?

I had two totally independent points. #1 was that you yourself have been 
fixing bugs in this area. #2 was the result of that grep. It's absolutely 
_empty_ except for the define to add that interface.

NOBODY USES IT!

Now, grep for the same interface that creates _non_freezeable workqueues.

Put another way:

[EMAIL PROTECTED] linux]$ git grep create_workqueue | wc -l
35

[EMAIL PROTECTED] linux]$ git grep create_freezeable_workqueue | wc -l
1

and that _one_ hit you get for the "freezeable" case is not actually a 
user, it's the definition!

Ie my point is, nobody wants freezeable kernel threads. Absolutely nobody.

Yet we have all this support for freezing them (or rather, we freeze them 
by default, and then we have all this support for _not_ doing that wrong 
default thing!)

So yes, I think it would be interesting to just stop freezing kernel 
threads. Totally.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Rafael J. Wysocki
On Saturday, 28 April 2007 03:03, Kyle Moffett wrote:
> On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:
> > Hi.
> >
> > On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:
> >> It makes it harder to debug (wouldn't it be *nice* to just ssh in,  
> >> and do
> >>gdb -p 
> >
> > Make the machine being suspended a VM and you can already do that.
> 
> >> when something goes wrong?) but we also *depend* on user space for  
> >> various things (the same way we depend on kernel threads, and why  
> >> it has been such a total disaster to try to freeze the kernel  
> >> threads too!). For example, if you want to do graphical stuff,  
> >> just using X would be quite nice,  wouldn't it?
> >
> > But in doing so you make the contents of the disk inconsistent with  
> > the state you've just snapshotted, leading to filesystem  
> > corruption. Even if you modify filesystems to do checkpointing  
> > (which is what we're really talking about), you still also have the  
> > problem that your snapshot has to be stored somewhere before you  
> > write it to disk, so you also have to either [snip]
> 
> Actually, it's a lot simpler than that.  We can just combine the  
> device-mapper snapshot with a VM+kernel snapshot system call and be  
> almost done:
> 
>sys_snapshot(dev_t snapblockdev, int __user *snapshotfd);
> 
> When sys_snapshot is run, the kernel does:
> 
> 1)  Sequentially freeze mounted filesystems using blockdev freezing.   
> If it's an fs that doesn't support freezing then either fail or force- 
> remount-ro that fs and downgrade all its filedescriptors to RO.   
> Doesn't need extra locking since process which try to do IO either  
> succeed before the freeze call returns for that blockdev or sleep on  
> the unfreeze of that blockdev.  Filesystems are synchronized and made  
> clean.
> 2)  Iterate over the userspace process list, freezing each process  
> and remapping all of its pages copy-on-write.  Any device-specific  
> pages need to have state saved by that device.

Why do you want to do 2) after 1) and not vice versa?

> 3)  All processes (except kernel threads) are now frozen.
> 4)  Kernel should save internal state corresponding to current  
> userspace state.  The kernel also swaps out excess pages to free up  
> enough RAM and prepares the snapshot file-descriptor with copies of  
> kernel memory and the original (pre-COW) mapped userspace pages.
> 5)  Kernel substitutes filesystems for either a device-mapper  
> snapshot with snapblockdev as backing storage or union with tmpfs and  
> remounts the underlying filesystems as read-only.
> 6)  Kernel unfreezes all userspace processes and returns the snapshot  
> FD to userspace (where it can be read from).

Okay, but how do we do the error recovery if, for example, the image cannot
be saved?

> Then userspace can do whatever it wants.  Any changes to filesystems  
> mounted at the time of snapshot will be discarded at shutdown.   
> Freshly mounted filesystems won't have the union or COW thing done,  
> and so you can write your snapshot to a compressed encrypted file on  
> a USB key if you want to, you just have to unmount it before the  
> snapshot() syscall and remount it right afterwards.

This seems to be a good idea.

Greetings,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: "REPORT: sd-0.46 vs cfs-v6 vs mainline 2.6.21-rc7 Beryl + Video + Audio"

2007-04-27 Thread Con Kolivas

On 27/04/07, hechacker1 <[EMAIL PROTECTED]> wrote:

"REPORT: sd-0.46 vs cfs-v6 vs mainline 2.6.21-rc7 Beryl + Video + Audio"

Hardware:
Dell Inspiron 700m laptop
1.7GHz Pentium M (Dothan 2M cache)
2GB RAM
1000Hz
Gentoo Linux
dyn-tick
700m # cat /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate
1 (microseconds, 10ms)
855gm integrated video/chipset
xf86-video-i810 (intel 1.7.4) DRI enabled
xorg-server-1.2.0-r3
beryl-core 0.3.0-svn
MPlayer dev-SVN-rUNKNOWN-4.1.2 - x11
Gnome totem 2.16.5 - x11-gstreamer
reiser4 w/cryptcompress

Screenshot:
http://ordorica.org/misc/beryl.png

muine playing mp3's off mounted windows share

Tests run under 16 bit color which provides a constant 75 fps
on one cube side (fps forced limited). Drops to ~45-50 fps during
animation/rotate/scale (depending on complexity of rendering)
Vsync off. 75Hz refresh 1280x800.

totem running fullscreen playing 700MB divx "An Inconvenient Truth.avi" on
one side of cube/desktop
gmplayer running fullscreen on another cube side (same file).

The given observations/numbers are when I move the cube with my mouse
and view two faces at one time (see screenshot). One face is playing the
totem video, the other containing my terminals.


Some numbers I've seen other people throw around:
I don't know their relevance.

cfs-v6:
700m kernel # cat sched_granularity_ns
500
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 5  0  0 221480300 139461200   181 0 6068 5317 69  6 25  0
 4  0  0 220880300 139526800   176 0 6147 5579 68  6 27  0
 1  0  0 220340300 139576800   167 0 6052 5393 70  6 24  0
 6  0  0 219920300 139620400   103 0 5830 5211 73  6 21  0

top - 18:31:17 up  7:45,  5 users,  load average: 5.18, 4.73, 4.28
Tasks:  98 total,   4 running,  94 sleeping,   0 stopped,   0 zombie
Cpu(s): 91.6%us,  6.4%sy,  0.0%ni,  0.3%id,  0.0%wa,  1.3%hi,  0.3%si,  0.0%st
Mem:   2057700k total,  1845952k used,   211748k free,  300k buffers
Swap:   987988k total,0k used,   987988k free,  1404040k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
18046 hechacke  20   0  189m  83m  20m S 38.7  4.2  12:04.64 totem
18059 hechacke  20   0 51280  30m  18m R 25.8  1.5   9:47.36 gmplayer
12117 root  20   0  275m  54m  18m R 20.2  2.7  15:18.38 Xorg
22730 hechacke  20   0  119m  35m  18m R  5.3  1.7   0:12.68 mono
12350 hechacke  20   0 63820 6776 4328 S  3.6  0.3   2:20.36 beryl
16465 hechacke  20   0 43960  15m  10m S  2.3  0.8   0:07.14 gnome-terminal
12200 hechacke  20   0  5308 4016 1740 S  0.3  0.2   0:05.45 gconfd-2
12215 hechacke  20   0 38704 8956 7588 S  0.3  0.4   0:08.90 xfce4-clipman-p

Observation:
Music plays perfectly.
Audio of video's play perfectly.
New processes take forever to start. Firefox (already cached in ram) takes
about 5 seconds to start; even right after closing it.
Browsing the web is slow.
Already open applications are responsive.
Behavior of video:
video's both moving forward. totem is updating about every half second.
mplayer updates about every 3 seconds.

-

cfs-v6:
700m kernel # cat sched_granularity_ns
200
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 5  0  0  99604 44 151936400 0 0 3903 5575 91  5  5  0
 3  0  0  99512 44 151936400 0 0 5990 6783 72  5 23  0
 3  0  0 100412 44 151936400 0 0 6858 7261 67  5 28  0
 1  0  0 100412 44 151936400 0 0 7426 7634 62  4 34  0
 4  0  0 100288 44 151936400 0 0 7039 7442 60  6 34  0

top - 19:05:09 up  8:18,  5 users,  load average: 3.62, 4.16, 4.28
Tasks:  98 total,   4 running,  94 sleeping,   0 stopped,   0 zombie
Cpu(s): 69.8%us,  5.0%sy,  0.0%ni, 24.5%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
Mem:   2057700k total,  2009396k used,48304k free,  300k buffers
Swap:   987988k total,0k used,   987988k free,  1555428k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
18059 hechacke  20   0 51332  30m  18m R 30.8  1.5  18:48.17 gmplayer
18046 hechacke  20   0  189m  83m  20m S 20.9  4.2  23:25.49 totem
12117 root  20   0  276m  57m  18m S  9.6  2.8  20:59.01 Xorg
22730 hechacke  20   0  129m  36m  18m R  8.6  1.8   1:28.59 mono
22930 hechacke  20   0 65480 8392 4320 S  4.0  0.4   0:53.38 beryl
12213 hechacke  20   0 34472 7680 6484 S  0.7  0.4   1:16.41 xfce4-battery-p

Observation:
Music plays perfectly.
Audio of video's play perfectly.
New processes take forever to start.
Browsing the web is slow.
Already open applications are responsive.
Behavior of video:
video's both moving forward. totem is updating about 

Re: Back to the future.

2007-04-27 Thread Rafael J. Wysocki
On Saturday, 28 April 2007 03:00, Matthew Garrett wrote:
> On Fri, Apr 27, 2007 at 05:18:16PM -0700, Jeremy Fitzhardinge wrote:
> 
> > Then you could use kexec for resume...
> 
> While that would certainly be nifty, I think we're arguably starting 
> from the wrong point here. Why are we booting a kernel, trying to poke 
> the hardware back into some sort of mock-quiescent state, freeing memory 
> and then (finally) overwriting the entire contents of RAM rather than 
> just doing all of this from the bootloader? Given the time spent in 
> kernel setup and unpacking initramfs nowadays, I'm willing to bet it'd 
> still be faster even if you're stuck using int 13 on x86.

Yes, that would be faster.

> http://apcmag.com/5873/page14 suggests that Intel is looking into this, 
> but I haven't heard anything more yet. To the best of my knowledge, this 
> is also how Windows manages things.

I think you're right.

Greetings,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Kyle Moffett

On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:

Hi.

On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:
It makes it harder to debug (wouldn't it be *nice* to just ssh in,  
and do

gdb -p 


Make the machine being suspended a VM and you can already do that.


when something goes wrong?) but we also *depend* on user space for  
various things (the same way we depend on kernel threads, and why  
it has been such a total disaster to try to freeze the kernel  
threads too!). For example, if you want to do graphical stuff,  
just using X would be quite nice,  wouldn't it?


But in doing so you make the contents of the disk inconsistent with  
the state you've just snapshotted, leading to filesystem  
corruption. Even if you modify filesystems to do checkpointing  
(which is what we're really talking about), you still also have the  
problem that your snapshot has to be stored somewhere before you  
write it to disk, so you also have to either [snip]


Actually, it's a lot simpler than that.  We can just combine the  
device-mapper snapshot with a VM+kernel snapshot system call and be  
almost done:


  sys_snapshot(dev_t snapblockdev, int __user *snapshotfd);

When sys_snapshot is run, the kernel does:

1)  Sequentially freeze mounted filesystems using blockdev freezing.   
If it's an fs that doesn't support freezing then either fail or force- 
remount-ro that fs and downgrade all its filedescriptors to RO.   
Doesn't need extra locking since process which try to do IO either  
succeed before the freeze call returns for that blockdev or sleep on  
the unfreeze of that blockdev.  Filesystems are synchronized and made  
clean.
2)  Iterate over the userspace process list, freezing each process  
and remapping all of its pages copy-on-write.  Any device-specific  
pages need to have state saved by that device.

3)  All processes (except kernel threads) are now frozen.
4)  Kernel should save internal state corresponding to current  
userspace state.  The kernel also swaps out excess pages to free up  
enough RAM and prepares the snapshot file-descriptor with copies of  
kernel memory and the original (pre-COW) mapped userspace pages.
5)  Kernel substitutes filesystems for either a device-mapper  
snapshot with snapblockdev as backing storage or union with tmpfs and  
remounts the underlying filesystems as read-only.
6)  Kernel unfreezes all userspace processes and returns the snapshot  
FD to userspace (where it can be read from).


Then userspace can do whatever it wants.  Any changes to filesystems  
mounted at the time of snapshot will be discarded at shutdown.   
Freshly mounted filesystems won't have the union or COW thing done,  
and so you can write your snapshot to a compressed encrypted file on  
a USB key if you want to, you just have to unmount it before the  
snapshot() syscall and remount it right afterwards.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Jeremy Fitzhardinge
Matthew Garrett wrote:
> While that would certainly be nifty, I think we're arguably starting 
> from the wrong point here. Why are we booting a kernel, trying to poke 
> the hardware back into some sort of mock-quiescent state, freeing memory 
> and then (finally) overwriting the entire contents of RAM rather than 
> just doing all of this from the bootloader?

Sure, you could make suspend generate a complete bootable kernel image
containing all RAM.  Doesn't sound too hard to me.  You know, from over
here on the sidelines.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Matthew Garrett
On Fri, Apr 27, 2007 at 05:18:16PM -0700, Jeremy Fitzhardinge wrote:

> Then you could use kexec for resume...

While that would certainly be nifty, I think we're arguably starting 
from the wrong point here. Why are we booting a kernel, trying to poke 
the hardware back into some sort of mock-quiescent state, freeing memory 
and then (finally) overwriting the entire contents of RAM rather than 
just doing all of this from the bootloader? Given the time spent in 
kernel setup and unpacking initramfs nowadays, I'm willing to bet it'd 
still be faster even if you're stuck using int 13 on x86.

http://apcmag.com/5873/page14 suggests that Intel is looking into this, 
but I haven't heard anything more yet. To the best of my knowledge, this 
is also how Windows manages things.
-- 
Matthew Garrett | [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Rafael J. Wysocki
On Saturday, 28 April 2007 01:59, Linus Torvalds wrote:
> 
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > Actually, the less things happen while we're creating and saving the image,
> > the less sources of potential problems there are and by freezing the kernel
> > threads (not all of them), we cause less things to happen at that time.
> 
> That makes no sense.
> 
> You have to create the snapshot image with interrupts disabled *anyway*.
> 
> I really don't see how you can say that stopping threads etc can make any 
> difference what-so-ever. If you don't create the snapshot with interrupts 
> disabled (and just with a single CPU running) you have so many other 
> problems that it's not even remotely funny.
> 
> So there's *by*definition* nothing at all that can happen while you 
> snapshot the system. Claiming otherwise is just silly.

For creating the snapshot alone, it doesn't matter.  Except that the restore is
cleaner a bit (we know exactly what all of these threads will be doing when
we restore the image and enable the IRQs after that).

Still, I think that kernel threads can potentailly hold locks accross the
freezing of devices and image creation and that is fishy.  Also I believe,
although I'm not 100% sure, that some of them may cause problems to
appear after we've created the image and while we are saving it.

> > To make you happy, we could stop doing that, but what actual _advantage_
> > that would bring?
> 
> Like getting rid of all the magic "I don't want you to freeze me" crud?

And what exactly is wrong with it?

> Or getting rid of this horribly idiotic "three times widdershins" kind of 
> black magic mentality! It looks like the main reason for the process 
> freezing has nothing to do with technology, but some irrational fear of 
> other things happening at the same time, even though they CANNOT happen if 
> you do things even half-way sanely.
> 
> The "let's stop all kernel threads" is superstition. It's the same kind of 
> superstition that made people write "sync" three times before turning off 
> the power in the olden times. It's the kind of superstition that comes 
> from "we don't do things right, so let's be vewy vewy quiet and _pray_ 
> that it works when we are beign quiet".
>
> That's bad.

Okay.  Accidentally, I'm working on a freezer patch, so I'll probably drop
the freezing of kernel threads from swsusp in it and we'll see what happens.

Let's do the experiment, shall we?

> It's doubly bad, because that idiocy has also infected s2ram. Again, 
> another thing that really makes no sense at all - and we do it not just 
> for snapshotting, but for s2ram too. Can you tell me *why*?

Why we freeze tasks at all or why we freeze kernel threads?

> > > Trying to freeze kernel threads has _caused_ problems. It has _added_ 
> > > these interdependencies. It hasn't removed a single dependency at any 
> > > time, it has just added new problems!
> > 
> > What problems are you talking about?
> 
> Like you wouldn't know. Look at commit b43376927a that you yourself are 
> credited with, just a month ago. 
> 
> Then, do something as simple as
> 
>   git grep create_freezeable_workthread

s/workthread/workqueue/

> and ponder the end results of that grep. If you don't see something wrong, 
> you're blind.

This was a mistake, quite unrelated to the point you're making.  And actually,
I was trying to fix a problem with two kernel threads that we thought might
submit I/O to disk after the image had been created.  Otherwise I wouldn't
have thought of doing that change.

> > > NONE of these are valid explanations at all. You're listing totally 
> > > theoretical problems, and ignoring all the _real_ problems that trying to 
> > > freeze kernel threads has _caused_.
> > 
> > Example, please?
> 
> Who do you think you are kidding? See above.

Well, if someone does something in a wrong way, that need not mean the
thing he was trying to do was wrong.

Somehow, I knew you would point at this ...

> And if you think that's an isolated example, look again. And start 
> grepping for PF_NOFREEZE, and other examples.

May I say I'm not convinced?

> The fact is, there is not a *single* reason to freeze kernel threads. But 
> some rocket scientist decided to, and then screwed everybody else over.

At least _that_ wasn't me. :-)

Greetings,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Paul Mackerras
Linus Torvalds writes:

> I really don't see how you can say that stopping threads etc can make any 
> difference what-so-ever. If you don't create the snapshot with interrupts 
> disabled (and just with a single CPU running) you have so many other 
> problems that it's not even remotely funny.

I agree.  I don't like the freezer.  We have had working
kernel-controlled suspend to RAM on powerbooks for almost 10 years
now, and we never needed to freeze processes.

That said, I can see two attractions in freezing processes:

1. It provides a way to stop new I/O requests coming in, and thus
   somewhat makes up for the lack of a way to freeze device request
   queues (at least, we didn't have one last time I looked).

2. Systems do sometimes die while suspended (e.g. run out of battery,
   or the resume process fails), and to make the next boot painless,
   you want the filesystems on disk to be as clean as possible.
   Freezing processes and then doing a sync provides one way to
   achieve that.  Of course, you have to make sure you don't freeze
   any kernel threads that are needed for doing the sync...  And if
   one of your filesystems is using FUSE, it's not going to get very
   far.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[git patches] net driver fixes

2007-04-27 Thread Jeff Garzik

As mentioned previously, the big batch queued for 2.6.22 is coming
after the dust settles.


[EMAIL PROTECTED] folks:  the sis900 patch should be in 2.6.21.x


Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git 
upstream-linus

to receive the following updates:

 drivers/net/sis900.c  |9 +
 drivers/usb/net/pegasus.c |   10 --
 drivers/usb/net/pegasus.h |3 +--
 3 files changed, 6 insertions(+), 16 deletions(-)

Dan Williams (1):
  usb-net/pegasus: simplify carrier detection

Neil Horman (1):
  sis900: Allocate rx replacement buffer before rx operation

diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c
index dea0126..2cb2e15 100644
--- a/drivers/net/sis900.c
+++ b/drivers/net/sis900.c
@@ -1753,6 +1753,7 @@ static int sis900_rx(struct net_device *net_dev)
sis_priv->rx_ring[entry].cmdsts = RX_BUF_SIZE;
} else {
struct sk_buff * skb;
+   struct sk_buff * rx_skb;
 
pci_unmap_single(sis_priv->pci_dev,
sis_priv->rx_ring[entry].bufptr, RX_BUF_SIZE,
@@ -1786,10 +1787,10 @@ static int sis900_rx(struct net_device *net_dev)
}
 
/* give the socket buffer to upper layers */
-   skb = sis_priv->rx_skbuff[entry];
-   skb_put(skb, rx_size);
-   skb->protocol = eth_type_trans(skb, net_dev);
-   netif_rx(skb);
+   rx_skb = sis_priv->rx_skbuff[entry];
+   skb_put(rx_skb, rx_size);
+   rx_skb->protocol = eth_type_trans(rx_skb, net_dev);
+   netif_rx(rx_skb);
 
/* some network statistics */
if ((rx_status & BCAST) == MCAST)
diff --git a/drivers/usb/net/pegasus.c b/drivers/usb/net/pegasus.c
index 1ad4ee5..a05fd97 100644
--- a/drivers/usb/net/pegasus.c
+++ b/drivers/usb/net/pegasus.c
@@ -847,16 +847,6 @@ static void intr_callback(struct urb *urb)
 * d[0].NO_CARRIER kicks in only with failed TX.
 * ... so monitoring with MII may be safest.
 */
-   if (pegasus->features & TRUST_LINK_STATUS) {
-   if (d[5] & LINK_STATUS)
-   netif_carrier_on(net);
-   else
-   netif_carrier_off(net);
-   } else {
-   /* Never set carrier _on_ based on ! NO_CARRIER */
-   if (d[0] & NO_CARRIER)
-   netif_carrier_off(net); 
-   }
 
/* bytes 3-4 == rx_lostpkt, reg 2E/2F */
pegasus->stats.rx_missed_errors += ((d[3] & 0x7f) << 8) | d[4];
diff --git a/drivers/usb/net/pegasus.h b/drivers/usb/net/pegasus.h
index c7aadb4..c746782 100644
--- a/drivers/usb/net/pegasus.h
+++ b/drivers/usb/net/pegasus.h
@@ -11,7 +11,6 @@
 
 #definePEGASUS_II  0x8000
 #defineHAS_HOME_PNA0x4000
-#defineTRUST_LINK_STATUS   0x2000
 
 #definePEGASUS_MTU 1536
 #defineRX_SKBS 4
@@ -204,7 +203,7 @@ PEGASUS_DEV( "AEI USB Fast Ethernet Adapter", 
VENDOR_AEILAB, 0x1701,
 PEGASUS_DEV( "Allied Telesyn Int. AT-USB100", VENDOR_ALLIEDTEL, 0xb100,
DEFAULT_GPIO_RESET | PEGASUS_II )
 PEGASUS_DEV( "Belkin F5D5050 USB Ethernet", VENDOR_BELKIN, 0x0121,
-   DEFAULT_GPIO_RESET | PEGASUS_II | TRUST_LINK_STATUS )
+   DEFAULT_GPIO_RESET | PEGASUS_II )
 PEGASUS_DEV( "Billionton USB-100", VENDOR_BILLIONTON, 0x0986,
DEFAULT_GPIO_RESET )
 PEGASUS_DEV( "Billionton USBLP-100", VENDOR_BILLIONTON, 0x0987,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Linus Torvalds


On Fri, 27 Apr 2007, David Lang wrote:
> 
> all that's needed for the snapshot is to prevent userspace from scheduling,

Strictly speaking, all you *really* want to make sure is not so much that 
user-space isn't scheduling, as the fact that all device IO buffers must 
be empty.

We can trivially snapshot an active user-space, and in fact it would 
probably be hard to do a snapshot in a way that it could even *know* or 
care about whether there are user-space processes running at the time of 
the snapshot.

So that's not the real problem.

What we obviously *cannot* snapshot is if some particular device is in the 
middle of being written to or read from, and has outstanding commands on 
the device itself (as opposed to just queued to the driver). So what we do 
want to make sure happens is that there are no IO queues that are active.

And the best way to make sure that there are no IO queues active is to 
make sure that there are no new read or write-requests. And *that* you can 
do two ways:

 - actually intercepting the read/write requests. Probably not too hard, 
   we could literally do it in the IO scheduler (and probably much more 
   easily than doing it in the process scheduler), but the easy cases will 
   only cover the block device layer, and character devices don't have the 
   same kind of scheduler you can trap IO in.

 - we also don't want to generate new data that needs to be snapshotted, 
   so we want to trap people who write even just to the page cache and 
   turn pages dirty. Again, we could probably do it at *that* point (ie 
   trapping them when they try to dirty a page), and it would be more 
   logical, but again, there are other cases of people who generate more 
   data (just any memory allocation obviously is a special case of 
   generating more data to be snapshotted),

so I do agree that we want to stop producing new data to be snapshotted, 
and we want to stop producing new read-requests. But kernel threads really 
do neither: in an idle system, kernel threads are idle too. A kernel 
thread is not like a user program that actually generates data - they only 
tend to act on behalf of other processes' needs.

So I think that what snapshotting really *wants* to stop is not schedulign 
per se, but IO. And stopping user processes (as opposed to kernel threads) 
is probably a good way to get there.

In fact, I'd argue that you want to stop user space and then encourage 
some kernel threads to *start* running, notably things like bdflush should 
probably be kicked to clean up some dirty stuff as part of the "shrink 
data to be snapshotted" part. Trying to free memory will do that on its 
own, of course.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/memory.c: remove warning from an uninitialized spinlock. was: Re: 2.6.21-rc7-mm2

2007-04-27 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
> --- 
> a/mm/memory.c~add-apply_to_page_range-which-applies-a-function-to-a-pte-range-fix
> +++ a/mm/memory.c
> @@ -1455,7 +1455,7 @@ static int apply_to_pte_range(struct mm_
>   pte_t *pte;
>   int err;
>   struct page *pmd_page;
> - spinlock_t *ptl;
> + spinlock_t *ptl = ptl;  /* Suppress gcc warning */
>  

Sigh.  I guess so.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PATCH] UIO patches for 2.6.21

2007-04-27 Thread Hans-Jürgen Koch
Am Samstag 28 April 2007 01:04 schrieb Andrew Morton:
> On Fri, 27 Apr 2007 15:49:57 -0700
>
> Greg KH <[EMAIL PROTECTED]> wrote:
> > Here are the updated UIO (Userspace I/O driver framework) patches for
> > 2.6.21.
>
> I'm a bit uncertain about the whole UIO idea, really.  I have this vague
> feeling that we'd prefer to encourage people to move device drivers into
> GPL'ed kernel rather than encouraging them to do closed-source userspace
> implementations which will probably end up being slower, less reliable and
> unavailable on various architectures, distros, etc.
>
> But I don't think I have the capacity to actually think about this further
> - just tossing it out there ;)

Thanks for tossing it out ;-) I understand your uncertainty and I share your
opinion about encouraging industry developers to GPL their drivers. It really
took me some time until I understood that sometimes there are _good_ reasons
for a closed driver. UIO is not intended for mass products like graphic cards.
We're talking about companies who developed special hardware for use in
special applications like machine control. They sometimes need to keep a 
part of their driver closed, at least for some time. Sometimes it's because
they want to protect themselves, sometimes because their customer demands it.
Usually, they know about the disadvantages you mentioned (if they're our
customers, be sure we tell them!).

Anyway, UIO is not just a system to allow closed drivers. There are enough
other reasons why these industry developers want userspace drivers. The most
important one is that they're usually no experienced kernel developers. They
can let somebody write the kernel part for them, and then write their driver
using the tools and libraries they know, with floating point and all that 
stuff. It's just convenient. If I had to write a driver for a fieldbus card
today, I'd use UIO. And I'd make it free software. UIO doesn't force anybody 
to close his drivers.

>
> > They have been revamped from the last time you have seen them, and they
> > include a real driver, the Hilscher CIF DeviceNet and Profibus card
> > controller, which is being used in production systems with this driver
> > framework right now.  The kernel driver they replaced was a total mess,
> > with over 60+ ioctls to try to control the different aspects of the
> > device.  See the last patch in this series for more details on this
> > driver.
> >
> > These patches include full documentation, are self-contained from the
> > rest of the kernel, and have been in the -mm tree for the past few
> > months with no complaints.
> >
> > Please pull from:
> > master.kernel.org:/pub/scm/linux/kernel/git/gregkh/uio-2.6.git/
> >
> > Patches will be sent as a follow-on to this message to lkml for people
> > to see.
> >
> >  drivers/uio/uio_cif.c |  156 
>
> eh?  How come a particular device requires 156 lines of kernel code to
> support a userspace driver?  Doesn't that kind of defeat the point?

This is quite a large kernel module for an UIO device due to quite 
stupid hardware design. It needs two memory mappings, and the interrupt
handler is not the simplest thing possible. BTW, I don't think that
156 lines is so much. It allows to handle quite a complex PCI card. And
it's so simple that it can be even explained to industry programmers
who are no kernel gurus.

Thanks,
Hans

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread David Lang

On Sat, 28 Apr 2007, Nigel Cunningham wrote:


Hi.

On Sat, 2007-04-28 at 01:45 +0200, Rafael J. Wysocki wrote:

On Saturday, 28 April 2007 01:17, Linus Torvalds wrote:


On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:



And can you name a _single_ advantage of doing so?


Yes.  We have a lot less interdependencies to worry about during the whole
operation.


That's not an advantage. That's why it has *sucked*.


Actually, the less things happen while we're creating and saving the image,
the less sources of potential problems there are and by freezing the kernel
threads (not all of them), we cause less things to happen at that time.

To make you happy, we could stop doing that, but what actual _advantage_
that would bring?


A couple of other advantages to freezing other processes:

1) It makes predicting how much memory is available for making and
saving snapshot a tractable problem. It therefore makes hibernation
_much_ more reliable.
2) Racing against other processes would also make hibernation slower,
increasing the chances of your battery running out before the save is
complete.
3) It makes finding potential memory leaks in the code possible. It was
ages ago now, but at one stage I could display a table saying exactly
how many pages had been allocated and freed by different sections of the
process and compare the number of free pages at the start and end of the
cycle to ensure there were no memory leaks at all.


nobody is suggesting that you leave peocesses running while you do the snapshot, 
what is being proposed is


1. pause userspace (prevent scheduling)
2. make snapshot image of memory
3. make mounted filesystems read-only (possibly with snapshot/checkpoint)
4. unpause
5. save image (with full userspace available, including network)
6. shutdown system (throw away all userspace memory, no need to do graceful
   shutdown or nice kill signals, revert filesystem to snapshot/checkpoint if
   needed)


NONE of these are valid explanations at all. You're listing totally
theoretical problems, and ignoring all the _real_ problems that trying to
freeze kernel threads has _caused_.


Example, please?


I agree with Rafael. Freezing processes greatly helps in ensuring we
have a consistent image. He's right, too, in asserting that it's even
more important for Suspend2. Freezing processes is essential to being
able to know that those LRU pages won't change and therefore being able
to save them separately and then reuse them for the atomic copy.


all that's needed for the snapshot is to prevent userspace from scheduling, and 
prevent media from being written to in a permanent way (writing to a LVM volume 
after invoking a snapshot doesn't count, just revert to the snapshot)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/memory.c: remove warning from an uninitialized spinlock. was: Re: 2.6.21-rc7-mm2

2007-04-27 Thread Andrew Morton
On Thu, 26 Apr 2007 20:25:19 +0200
Borislav Petkov <[EMAIL PROTECTED]> wrote:

> 
> Remove build warning mm/memory.c:1491: warning: 'ptl' may be used 
> uninitialized in this function.
> The spinlock pointer is assigned to null since it gets overwritten right away 
> in
> pte_alloc_map_lock().
> 
> Signed-off-by: Borislav Petkov <[EMAIL PROTECTED]>
> ---
> 
> Index: linux-mm/mm/memory.c
> ===
> --- linux-mm.orig/mm/memory.c2007-04-26 19:57:14.0 +0200
> +++ linux-mm/mm/memory.c 2007-04-26 20:00:30.0 +0200
> @@ -1488,7 +1488,7 @@
> pte_t *pte;
> int err;
> struct page *pmd_page;
> -   spinlock_t *ptl;
> +   spinlock_t *ptl = NULL;
> 
> pte = (mm == _mm) ?
> pte_alloc_kernel(pmd, addr) :
> 

yes, I've been staring unhappily at this for some time.

Your change adds seven bytes of text to this function for no runtime
benefit, just to fix a build-time warning.  It's a general problem.


Often we just leave the warning in place and curse gcc each time it flies
past.  Sometimes the code can be restructured in a sensible fashion to
avoid the warning; often it cannot.

But I don't think I want to put up with a warning coming out of core MM all
the time so let's go with the following silliness which adds no additional
runtime cost.

--- 
a/mm/memory.c~add-apply_to_page_range-which-applies-a-function-to-a-pte-range-fix
+++ a/mm/memory.c
@@ -1455,7 +1455,7 @@ static int apply_to_pte_range(struct mm_
pte_t *pte;
int err;
struct page *pmd_page;
-   spinlock_t *ptl;
+   spinlock_t *ptl = ptl;  /* Suppress gcc warning */
 
pte = (mm == _mm) ?
pte_alloc_kernel(pmd, addr) :
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Linus Torvalds


On Fri, 27 Apr 2007, Linus Torvalds wrote:
> 
> The "let's stop all kernel threads" is superstition. It's the same kind of 
> superstition that made people write "sync" three times before turning off 
> the power in the olden times. It's the kind of superstition that comes 
> from "we don't do things right, so let's be vewy vewy quiet and _pray_ 
> that it works when we are beign quiet".

Side note: while I think things should probably *work* even with user 
processes going full bore while a snapshot it taken, I'll freely admit 
that I'll follow that superstition far enough that I think it's probably a 
good idea to try to quiesce the system to _some_ degree, and that stopping 
user programs is a good idea. Partly because the whole memory shrinking 
thing, and partly just because we should do the snapshot with hw IO queues 
empty.

But I don't think it would necessarily be wrong (and in many ways it would 
probably be *right*) to do that IO queue stopping at the queue level 
rather than at a process level. Why stop processes just becasue you want 
to clean out IO queues? They are two totally different things!

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Jeremy Fitzhardinge
Linus Torvalds wrote:
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>   
>> Why do you think that keeping the user space frozen after 'snapshot' is a bad
>> idea?  I think that solves many of the problems you're discussing.
>> 
>
> It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
>
>   gdb -p 
>
> when something goes wrong?)

Yeah, or gdb vmlinux snapshot

Then you could use kexec for resume...

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] ehea: fix for sysfs entries

2007-04-27 Thread Jeff Garzik

Thomas Klein wrote:

Create symbolic link from each logical port to ehea driver

Signed-off-by: Thomas Klein <[EMAIL PROTECTED]>
---


This patch applies on top of the netdev upstream branch for 2.6.22


applied 1-2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc7-mm2 -- x86_64 blade hard hangs

2007-04-27 Thread Mel Gorman

On Fri, 27 Apr 2007, Siddha, Suresh B wrote:


On Fri, Apr 27, 2007 at 12:07:10PM +0100, Mel Gorman wrote:

On (26/04/07 16:40), Siddha, Suresh B didst pronounce:

oops. Appended patch should fix this. Can you please check this and Ack it?


This patch does not apply cleanly to 2.6.21-rc7-mm2.


Mel, Please backout the existing  x86_64-set-node_possible_map-at-runtime.patch
in rc7-mm2  and apply the appended patch instead.



I backed out

broken-out/x86_64-set-node_possible_map-at-runtime.patch
broken-out/x86_64-set-node_possible_map-at-runtime-fix.patch
broken-out/x86_64-set-node_possible_map-at-runtime-fix-2.patch

and dropped in your new patch. It passed boot tests on the machine in 
question, so just from a testing perspective


Acked-by: Mel Gorman <[EMAIL PROTECTED]>


Andrew, as you already backedout x86_64-set-node_possible_map-at-runtime.patch
from your -mm series, please include the appended patch (as try 2), after
Mel confirms that it works fine on his setup.

Thanks!



Thank you.


---
From: Suresh Siddha <[EMAIL PROTECTED]>
[patch] x86_64: set node_possible_map at runtime - try 2

Set the node_possible_map at runtime on x86_64.  On a non NUMA system,
num_possible_nodes() will now say '1'.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: Eric Dumazet <[EMAIL PROTECTED]>
Cc: David Rientjes <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
---

diff -pNru linux/arch/x86_64/mm/k8topology.c linux~/arch/x86_64/mm/k8topology.c
--- linux/arch/x86_64/mm/k8topology.c   2007-04-27 10:37:19.0 -0700
+++ linux~/arch/x86_64/mm/k8topology.c  2007-04-27 10:34:10.0 -0700
@@ -49,11 +49,8 @@ int __init k8_scan_nodes(unsigned long s
int found = 0;
u32 reg;
unsigned numnodes;
-   nodemask_t nodes_parsed;
unsigned dualcore = 0;

-   nodes_clear(nodes_parsed);
-
if (!early_pci_allowed())
return -1;

@@ -102,7 +99,7 @@ int __init k8_scan_nodes(unsigned long s
   nodeid, (base>>8)&3, (limit>>8) & 3);
return -1;
}
-   if (node_isset(nodeid, nodes_parsed)) {
+   if (node_isset(nodeid, node_possible_map)) {
printk(KERN_INFO "Node %d already present. Skipping\n",
   nodeid);
continue;
@@ -155,7 +152,7 @@ int __init k8_scan_nodes(unsigned long s

prevbase = base;

-   node_set(nodeid, nodes_parsed);
+   node_set(nodeid, node_possible_map);
}

if (!found)
diff -pNru linux/arch/x86_64/mm/numa.c linux~/arch/x86_64/mm/numa.c
--- linux/arch/x86_64/mm/numa.c 2007-04-27 10:37:19.0 -0700
+++ linux~/arch/x86_64/mm/numa.c2007-04-27 10:34:10.0 -0700
@@ -298,7 +298,7 @@ static int __init setup_node_range(int n
ret = -1;
}
nodes[nid].end = *addr;
-   node_set_online(nid);
+   node_set(nid, node_possible_map);
printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n", nid,
   nodes[nid].start, nodes[nid].end,
   (nodes[nid].end - nodes[nid].start) >> 20);
@@ -482,7 +482,7 @@ out:
 * SRAT.
 */
remove_all_active_ranges();
-   for_each_online_node(i) {
+   for_each_node_mask(i, node_possible_map) {
e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
nodes[i].end >> PAGE_SHIFT);
setup_node_bootmem(i, nodes[i].start, nodes[i].end);
@@ -497,20 +497,25 @@ void __init numa_initmem_init(unsigned l
{
int i;

+   nodes_clear(node_possible_map);
+
#ifdef CONFIG_NUMA_EMU
if (cmdline && !numa_emulation(start_pfn, end_pfn))
return;
+   nodes_clear(node_possible_map);
#endif

#ifdef CONFIG_ACPI_NUMA
if (!numa_off && !acpi_scan_nodes(start_pfn << PAGE_SHIFT,
  end_pfn << PAGE_SHIFT))
return;
+   nodes_clear(node_possible_map);
#endif

#ifdef CONFIG_K8_NUMA
if (!numa_off && !k8_scan_nodes(start_pfn<

--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Linus Torvalds


On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> Actually, the less things happen while we're creating and saving the image,
> the less sources of potential problems there are and by freezing the kernel
> threads (not all of them), we cause less things to happen at that time.

That makes no sense.

You have to create the snapshot image with interrupts disabled *anyway*.

I really don't see how you can say that stopping threads etc can make any 
difference what-so-ever. If you don't create the snapshot with interrupts 
disabled (and just with a single CPU running) you have so many other 
problems that it's not even remotely funny.

So there's *by*definition* nothing at all that can happen while you 
snapshot the system. Claiming otherwise is just silly.

> To make you happy, we could stop doing that, but what actual _advantage_
> that would bring?

Like getting rid of all the magic "I don't want you to freeze me" crud? 

Or getting rid of this horribly idiotic "three times widdershins" kind of 
black magic mentality! It looks like the main reason for the process 
freezing has nothing to do with technology, but some irrational fear of 
other things happening at the same time, even though they CANNOT happen if 
you do things even half-way sanely.

The "let's stop all kernel threads" is superstition. It's the same kind of 
superstition that made people write "sync" three times before turning off 
the power in the olden times. It's the kind of superstition that comes 
from "we don't do things right, so let's be vewy vewy quiet and _pray_ 
that it works when we are beign quiet".

That's bad.

It's doubly bad, because that idiocy has also infected s2ram. Again, 
another thing that really makes no sense at all - and we do it not just 
for snapshotting, but for s2ram too. Can you tell me *why*?

> > Trying to freeze kernel threads has _caused_ problems. It has _added_ 
> > these interdependencies. It hasn't removed a single dependency at any 
> > time, it has just added new problems!
> 
> What problems are you talking about?

Like you wouldn't know. Look at commit b43376927a that you yourself are 
credited with, just a month ago. 

Then, do something as simple as

git grep create_freezeable_workthread

and ponder the end results of that grep. If you don't see something wrong, 
you're blind.

> > NONE of these are valid explanations at all. You're listing totally 
> > theoretical problems, and ignoring all the _real_ problems that trying to 
> > freeze kernel threads has _caused_.
> 
> Example, please?

Who do you think you are kidding? See above.

And if you think that's an isolated example, look again. And start 
grepping for PF_NOFREEZE, and other examples.

The fact is, there is not a *single* reason to freeze kernel threads. But 
some rocket scientist decided to, and then screwed everybody else over.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Rafael J. Wysocki
On Saturday, 28 April 2007 01:01, David Lang wrote:
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> 
> > On Saturday, 28 April 2007 00:26, David Lang wrote:
> >> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >>
> > We're freezing many of them just fine. ;-)
> 
>  And can you name a _single_ advantage of doing so?
> >>>
> >>> Yes.  We have a lot less interdependencies to worry about during the whole
> >>> operation.
> >>>
>  It so happens, that most people wouldn't notice or care that kmirrord got
>  frozen (kernel thread picked at random - it might be one of the threads
>  that has gotten special-cased to not do that), but I have yet to hear a
>  single coherent explanation for why it's actually a good idea in the 
>  first
>  place.
> >>>
> >>> Well, I don't know if that's a 'coherent' explanation from your point of 
> >>> view
> >>> (probably not), but I'll try nevertheless:
> >>> 1) if the kernel threads are frozen, we know that they don't hold any 
> >>> locks
> >>> that could interfere with the freezing of device drivers,
> >>
> >> does teh process of freezing really wait until all locks have been 
> >> released?
> >
> > Yes, it does.
> >
> >>> 2) if they are frozen, we know, for example, that they won't call user 
> >>> mode
> >>> helpers or do similar things,
> >>
> >> this won't matter unless the user mode helpers are going to do I/O or other
> >> permanent changes
> >
> > Please note that even accessing a file may be a permanent change.
> 
> if accessing a file on a read-only filesystem changes that filesystem it's a 
> bug
> 
> see the recent thread about ext3 journal replays when mounting read-only as 
> an 
> example.

Oh well.  Is this really wrong to protect users from such bugs, if we can do
that?

> >>> 3) if they are frozen, we know that they won't submit I/O to disks and
> >>> potentially damage filesystems (suspend2 has much more problems with that
> >>> than swsusp, but still.  And yes, there have been bug reports related to 
> >>> it,
> >>> so it's not just my fantasy).
> >>
> >> if you have the filesystems checkpointed then I/O after the freeze won't 
> >> matter
> >> as you just revert to the checkpoint (and since this is going to be thrown 
> >> away
> >> it can stay in ram)
> >
> > In that case, I would agree.  Currently, however, we're not even close to 
> > this
> > point.
> >
> > The checkpointing of filesystems would be a very welcome feature, but 
> > there's
> > no anyone working on it right now, AFAICT.
> >
> >> if we are willing to make a break with the past to implement the new 
> >> snapshot
> >> capability, we should be able to use the LVM snapshot code to handle the
> >> filesystem
> >
> > Yes, we can do that, in principle, and screw all of the current users in the
> > process.  And finally we'd end up with something similar to what is done 
> > now,
> > IMHO.
> 
> however, the result may be a lot less 'special case pwoer management' code 
> and a 

Are you referring to some specific code?

> lot more re-use of code that's in place for other uses.

This already is happening.

> if work on the current versions was stopped (other then trying to avoid 
> regressions) and a new version (with new userspace tools) was built in a way 
> that satisfies everyone the old version could be phased out in a year or two 
> (per the normal feture removal process)

May I say it's not realistic?

> > And no, the things are not just totally broken, as it may follow from these
> > discussions.  The problem is that the people who are discussing them so
> > viciously have never tried to write anything like the hibernation code.
> >
> > This is as though as I were discussing the design of the CPU schedulers,
> > although I only know how they work on a general level.
> >
> > Actually, the really problematic thing with the hibernation _right_ _now_ is
> > what Linus is so concerned about (and rightfully so) - that we use the
> > same device drivers' callbacks for the hibernation and suspend (aka s2ram).
> > The other things work quite well and are really robust.
> 
> if simply splitting the functions cleans everything up enough to satisfy 
> everyone then we're almost done right? ;-)

Practically, yes.  Theoretically, there's no software you can't improve
(except, probably, TeX), but that might not be worth the effort.

> however I think that there are other fundamental disagreements here, and 
> neither 
> the 'do absolutly everything in the kernel' or the 'do almost nothing in the 
> kernel' approaches are going to fly in the long run.

I think we'll have an agreement, though.

> I think the userspace<->kernel interface is going to be different then
> either apprach is doing now,

You're probably right

> and as such it's an oppurtunity to make more drastic changes if they are
> appropriate.

Well, maybe.

> for example, why should we have LVM snapshot code and hibernate 
> snapshot/filesystem checkpoint code instead of just useing the LVM code 
> 

Re: BAD_SG_DMA panic in aha1542

2007-04-27 Thread Bob Tracy
Alan Cox wrote:
> > As before, no problems using the sda hard disk (which is the boot drive):
> > everything works reliably until I touch the cdrom drive.
> 
> A little quiet contemplation and gnome number 387 suggests trying the 
> following
> (and providing more detailed information such as the last message printed 
> before
> the DMA message). Stuff a BUG() before the panic in BAD_DMA (aha1542.c) if 
> needed
> to get a good trace.
> 
> Please report success/failure/change.

Can do.  I don't have access to the machine on weekends, so it will be
at least Monday before I can give this a whirl.  Thanks!

-- 
---
Bob Tracy   WTO + WIPO = DMCA? http://www.anti-dmca.org
[EMAIL PROTECTED]
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Nigel Cunningham
Hi.

On Sat, 2007-04-28 at 01:45 +0200, Rafael J. Wysocki wrote:
> On Saturday, 28 April 2007 01:17, Linus Torvalds wrote:
> > 
> > On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > >
> > > > And can you name a _single_ advantage of doing so?
> > > 
> > > Yes.  We have a lot less interdependencies to worry about during the whole
> > > operation.
> > 
> > That's not an advantage. That's why it has *sucked*.
> 
> Actually, the less things happen while we're creating and saving the image,
> the less sources of potential problems there are and by freezing the kernel
> threads (not all of them), we cause less things to happen at that time.
> 
> To make you happy, we could stop doing that, but what actual _advantage_
> that would bring?

A couple of other advantages to freezing other processes:

1) It makes predicting how much memory is available for making and
saving snapshot a tractable problem. It therefore makes hibernation
_much_ more reliable.
2) Racing against other processes would also make hibernation slower,
increasing the chances of your battery running out before the save is
complete.
3) It makes finding potential memory leaks in the code possible. It was
ages ago now, but at one stage I could display a table saying exactly
how many pages had been allocated and freed by different sections of the
process and compare the number of free pages at the start and end of the
cycle to ensure there were no memory leaks at all.

> > Trying to freeze kernel threads has _caused_ problems. It has _added_ 
> > these interdependencies. It hasn't removed a single dependency at any 
> > time, it has just added new problems!
> 
> What problems are you talking about?
> 
> > > 1) if the kernel threads are frozen, we know that they don't hold any 
> > > locks
> > > that could interfere with the freezing of device drivers,
> > > 2) if they are frozen, we know, for example, that they won't call user 
> > > mode
> > > helpers or do similar things,
> > > 3) if they are frozen, we know that they won't submit I/O to disks and
> > > potentially damage filesystems (suspend2 has much more problems with that
> > > than swsusp, but still.  And yes, there have been bug reports related to 
> > > it,
> > > so it's not just my fantasy).
> > 
> > NONE of these are valid explanations at all. You're listing totally 
> > theoretical problems, and ignoring all the _real_ problems that trying to 
> > freeze kernel threads has _caused_.
> 
> Example, please?

I agree with Rafael. Freezing processes greatly helps in ensuring we
have a consistent image. He's right, too, in asserting that it's even
more important for Suspend2. Freezing processes is essential to being
able to know that those LRU pages won't change and therefore being able
to save them separately and then reuse them for the atomic copy.

> > If you want to control user-mode helpers, you do that - you do not freeze 
> > kernel threads!
> > 
> > And no, kernel threads do not submit IO to disks on their own. You just 
> > made that up.
> 
> No, I didn't.  Nigel can confirm, I think.

I have had problems with MD threads generating I/O that I couldn't
account for - after userspace had been frozen, filesystems had been
nicely synced and so on. I have to speak with reservations though,
because I haven't yet gotten to the bottom of where the I/O is coming
from... too many things, too small time slices.

> > Yes, they can be involved in that whole disk submission thing, but in a good
> > way - they can be required in order to make disk writing work!
> 
> Some of them can be, some other's need not be.  We don't need any fs-related
> kernel threads for saving the image, for example.

Yeah, so long as we bmap the storage we want to use beforehand (thinking
of swap files and ordinary files).

> > The problem that suspend has had is that it's done everything totally the 
> > wrong way around. Do kernel threads do disk IO? Sure, if asked to do so. 
> 
> They can be asked before we do the snapshot and complete the operation
> afterwards, no?
> 
> > For example, kernel threads can be involved in md etc, but that's a *good* 
> > thing.
> 
> We don't freeze these threads.
> 
> > The way to shut them up is not to freeze the threads, but to freeze the 
> > *disk*.
> 
> In principle, you're right.  In practice, go and try it.

I have to disagree here. Freezing the disk instead of the threads is
dealing with the symptoms instead of the cause.

Regards,

Nigel


signature.asc
Description: This is a digitally signed message part


Re: 2.6.21-rc7: known regressions

2007-04-27 Thread Domenico Andreoli
On Sat, Apr 28, 2007 at 12:38:07AM +0200, Michal Piotrowski wrote:
> Hi all,
> 
> Here is a list of known regressions reported after 2.6.21 release.

if this was also on a wiki page...

1) contributors (also casual ones) may update it or add new entries
2) adding a "Forwarded-To:" field and a "renew" button, regression
   reports could be fired semi-automatically to the right recipients.
   also the casual reader might bug proper maintainer simply clicking
   on the button. grave regressions would get more clicks...
3) when the new release is cut, such page is converted and saved as known
   regression list
4) a mail filter on lkml could perform some bookkeeping so people hating
   web could simply drop a message and the wiki page could update itself
   (no, not abuse itself)
5) web lovers could simply click on the links to lkml to dig into
   the regression

it looks like a simple and rudimentary bug tracker, but it is only a
regression reminder with links in the lkml flow. a distilled human-driven
regression-oriented semi-automatic lkml archive.

yes, some smart hybrid wiki/php|python thing with a db which could be
used to automatically send regression reminders..

i'm not a web developer, i'm not able to suggest the right wiki-tool. i
could offer some space on a tiny server with bandwidth but without any
email capability.

> Feel free to add new regressions/remove fixed etc.
> http://lkr.wikidot.com/list
 
what?!? it is already on a wiki page? doh! then read only the other
non-wiki ideas...

'night
domenico

-[ Domenico Andreoli, aka cavok
 --[ http://www.dandreoli.com/gpgkey.asc
   ---[ 3A0F 2F80 F79C 678A 8936  4FEE 0677 9033 A20E BC50
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BAD_SG_DMA panic in aha1542

2007-04-27 Thread Bob Tracy
James Bottomley wrote:
> On Fri, 2007-04-27 at 16:47 -0500, Bob Tracy wrote:
> > I previously reported an ISA DMA issue for the 2.6.12 kernel.  The issue
> > persists through at least 2.6.18.  SCSI controller is an Adaptec
> > AHA-1542B (ISA).
> > 
> > The action "mount -t iso9660 /dev/scd0 /mnt/cdrom -r"
> > 
> > produces
> > 
> > (cdrom detection messages as various modules autoload, then...)
> 
> Knowing what these messages are is would be helpful; it tells me what
> point in the initialisation it got to. 

Sorry about that...  I'm running the DSL-N distribution (based on
Knoppix), and having to transcribe the log messages by hand from the
console, i.e., there's no logfile to cut-and-paste from :-(.  I don't
have access to the machine except on weekdays, but I'll repeat the
crash first thing Monday morning and copy everything that's there...

> I'm interested.
> 
> This is clearly a use_sg==1 path that has failed to bounce the buffer
> for some reason ... and I was contemplating eliminating the GFP_DMA from
> our sr driver because I thought the block bouncing had it covered.
> 
> It might also be helpful to apply this patch.  It should give a stack
> trace of the problem command and not immediately panic the box.

I'll throw together a 2.6.21 kernel with this patch and give it a try.
Again, it will be at least Monday before you hear back from me on this.

Thanks!

-- 
---
Bob Tracy   WTO + WIPO = DMCA? http://www.anti-dmca.org
[EMAIL PROTECTED]
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4 (how to boot it?)

2007-04-27 Thread lkml777
Thanks, that is certainly helpful, but that only mounts one directory
(partition) as Reiser4.

This I have already done.

I was more interested in how to have a whole partition dedicated to
Reiser4 and being able to boot into it.

By any chance did you do that?


On Sat, 28 Apr 2007 00:37:05 +0800, "Jeff Chua"
<[EMAIL PROTECTED]> said:
> On 4/27/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> 
> > Hi Jeff, could you outline the procedure that YOU used to get Reiser4
> > installed and running.
> 
> Pretty much the same as the steps from   ...
>  http://linuxhelp.150m.com/installs/compile-kernel.htm
> 
> cd /usr/src
> tar --use=bzip2 -xpf linux-2.6.21.tar.bz2
> ln -nsf linux-2.6.21 linux
> cd /usr/src/linux
> bzip2 -d -c /tmp/reiser4-for-2.6.20.patch.bz2 | patch -p1
> # copy your old .config here
> make menuconfig
> File systems  --->
>   <*> Reiser4 (EXPERIMENTAL)
> make
> make modules_install
> # copy ./i386/boot/bzImage to the boot directory
> # reboot
> 
> 
> # download, compile and install ...
>   libaal-1.0.5.tar.gz
>   reiser4progs-1.0.6.tar.gz
> 
> I got them from  ftp://ftp.namesys.com/pub/reiser4progs/
> 
> Take an unused partition, and create reiser4fs on it...
> 
> mkfs.reiser4 /dev/sda8
> mount /dev/sda8 /mnt
> 
> Or you may want to try it on a loop device ...
> 
> dd if=/dev/zero  of=disk1  bs=1024k count=100
> mkfs.reiser4 -yf disk1
> mount -o loop disk1 /u0
> 
> Here's an entry in /etc/fstab
> /dev/sda8/u3reiser4noatime  0 0
> 
> 
> I hope this is good enough to get you started.
> 
> Thanks,
> Jeff.
-- 
  
  [EMAIL PROTECTED]

-- 
http://www.fastmail.fm - And now for something completely different…

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux-2.6.21 hangs during post boot initialization phase

2007-04-27 Thread Peter Williams

Neil Horman wrote:

On Sat, Apr 28, 2007 at 12:28:28AM +1000, Peter Williams wrote:

Neil Horman wrote:

On Fri, Apr 27, 2007 at 04:05:11PM +1000, Peter Williams wrote:


Damn, This is what happens when I try to do things too quickly.  I missed one
spot in my last patch where I replaced skb with rx_skb.  Its not critical, but
it should improve sis900 performance by quite a bit.  This applies on top of the
last two patches.  Sorry about that.

Thanks & Regards
Neil

Signed-off-by: Neil Horman <[EMAIL PROTECTED]>


 sis900.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

 
diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c

index 7e44939..db59dce 100644
--- a/drivers/net/sis900.c
+++ b/drivers/net/sis900.c
@@ -1790,7 +1790,7 @@ static int sis900_rx(struct net_device *net_dev)
/* give the socket buffer to upper layers */
rx_skb = sis_priv->rx_skbuff[entry];
skb_put(rx_skb, rx_size);
-   skb->protocol = eth_type_trans(rx_skb, net_dev);
+   rx_skb->protocol = eth_type_trans(rx_skb, net_dev);
netif_rx(rx_skb);
 
 			/* some network statistics */


My system also boots OK after I add this patch.  Can't tell whether it's 
improved the performance or not.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] battery2 git repository

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 03:29:02 +0400
Anton Vorontsov <[EMAIL PROTECTED]> wrote:

> You can get it using "git clone --reference linux-2.6 \
> git://git.infradead.org/users/cbou/battery2-2.6.git" command.

I added this to the -mm lineup.

Welcome to git.  This means that nobody looks at your code any more and you
get free rein to experiment with interesting innovations in VFS, MM and
security in the mainline kernel (well, not really - Linus does squint at
the diffstat).

But we do have a general problem that code which travels the
developer->git->mainline route is not getting sufficient review.  Please be
aware of this, and be as pushy as you like in sending your changes out to
mailing lists (including linux-kernel) to get them reviewed.

If you don't think they have received adequate review then send them again,
and shout at people - we'd all admire that.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Back to the future.

2007-04-27 Thread Rafael J. Wysocki
On Saturday, 28 April 2007 01:17, Linus Torvalds wrote:
> 
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > > And can you name a _single_ advantage of doing so?
> > 
> > Yes.  We have a lot less interdependencies to worry about during the whole
> > operation.
> 
> That's not an advantage. That's why it has *sucked*.

Actually, the less things happen while we're creating and saving the image,
the less sources of potential problems there are and by freezing the kernel
threads (not all of them), we cause less things to happen at that time.

To make you happy, we could stop doing that, but what actual _advantage_
that would bring?

> Trying to freeze kernel threads has _caused_ problems. It has _added_ 
> these interdependencies. It hasn't removed a single dependency at any 
> time, it has just added new problems!

What problems are you talking about?

> > 1) if the kernel threads are frozen, we know that they don't hold any locks
> > that could interfere with the freezing of device drivers,
> > 2) if they are frozen, we know, for example, that they won't call user mode
> > helpers or do similar things,
> > 3) if they are frozen, we know that they won't submit I/O to disks and
> > potentially damage filesystems (suspend2 has much more problems with that
> > than swsusp, but still.  And yes, there have been bug reports related to it,
> > so it's not just my fantasy).
> 
> NONE of these are valid explanations at all. You're listing totally 
> theoretical problems, and ignoring all the _real_ problems that trying to 
> freeze kernel threads has _caused_.

Example, please?

> If you want to control user-mode helpers, you do that - you do not freeze 
> kernel threads!
> 
> And no, kernel threads do not submit IO to disks on their own. You just 
> made that up.

No, I didn't.  Nigel can confirm, I think.

> Yes, they can be involved in that whole disk submission thing, but in a good
> way - they can be required in order to make disk writing work!

Some of them can be, some other's need not be.  We don't need any fs-related
kernel threads for saving the image, for example.

> The problem that suspend has had is that it's done everything totally the 
> wrong way around. Do kernel threads do disk IO? Sure, if asked to do so. 

They can be asked before we do the snapshot and complete the operation
afterwards, no?

> For example, kernel threads can be involved in md etc, but that's a *good* 
> thing.

We don't freeze these threads.

> The way to shut them up is not to freeze the threads, but to freeze the 
> *disk*.

In principle, you're right.  In practice, go and try it.

Anyway, why is it so important that _all_ of the kernel threads be running
while the snapshot is created and saved?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] utimensat implementation

2007-04-27 Thread David Lang

On Fri, 27 Apr 2007, H. Peter Anvin wrote:


The main use of atime seems to be to figure out when something can be
automatically deleted.  Anyone else have other usage scenarios?


as a varient of this, I use it to help determine what files are actually needed 
when building a chroot sandbox.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >