date:20050120

Re: [PATCH] relayfs redux for 2.6.10: lean and mean

2005-01-20 Thread Greg KH

On Fri, Jan 21, 2005 at 06:27:43PM +1100, Peter Williams wrote:
> Greg KH wrote:
> >On Fri, Jan 21, 2005 at 01:15:28PM +1100, Peter Williams wrote:
> >
> >>Perhaps the logical solution is to implement debugfs in terms of relayfs?
> >
> >
> >What do you mean by this statement?
> 
> I mean that if, as you say, debugfs is very similar to relayfs only more 
> restricted (i.e. a debugging option) then it should be implementable as 
> an instance or specialization of the more general relayfs and that this 
> should be a better solution than two independent implementations of 
> similar functionality.

Ah.

No.

The implementations are not of the same functionality, or so Karim says.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: patch to fix set_itimer() behaviour in boundary cases

2005-01-20 Thread Arjan van de Ven


> > 
> > This one I meant to fix in the kernel fwiw; we can put that loop inside
> > the kernel easily I'm sure
> 
> Yes, but it will increase the data size of the timer...
> 

eh how?
the way I think it can be done is to just have multiple timers fire
until the total time is up. It's not a performance issue (a timer firing
every 24 days.. who cares, esp since such long delays are rare anyway)
after all...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]sched: Isochronous class v2 for unprivileged soft rt scheduling

2005-01-20 Thread Ingo Molnar

* Con Kolivas <[EMAIL PROTECTED]> wrote:

> In terms of recommendation, the latency of non-preemptible codepaths
> will be fastest in ext3 in 2.6 due to the nature of it constantly
> being examined, addressed and updated. That does not mean it has the
> fastest performance by any stretch of the imagination. [...]

i agree with the latency observation. But ext3 got two significant
performance boosts recently, at two ends of the performance spectrum:

- in the (lots-of-)small-files area: the addition of the htree feature

- in the large-files-throughput case: with the addition of the
  reservation feature.

ext3 installed by a recent distro should have both features enabled. (i
know for sure that Fedora Core 3 with the update/erratum kernel
installed will create ext3 filesystems that utilize both of these
features by default.) 

I encourage everyone to try the famous 'create and read 1 million small
files' test on both recent ext3 and on other filesystems.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1 vs. PowerMac 8500/G3 (and VAIO laptop) [usb-storage oops]

2005-01-20 Thread David Woodhouse

On Thu, 2005-01-20 at 16:08 -0800, Greg KH wrote:
> Doh, sorry for missing this one.  I've applied your patch to my trees,
> and will show up in the next -mm release.

Actually I think John's problem was that the usb core code has now
_stopped_ doing this byteswapping, and he has a lsusb which is hacked to
expect it. So if you apply my patch you're preserving the userspace ABI
by reverting to the extremely stupid behaviour of byteswapping _some_ of
the fields in the descriptor we pass to userspace.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

2005-01-20 Thread Lukasz Trabinski

On Tue, 18 Jan 2005, chas williams - CONTRACTOR wrote:
the system keeps running right?  the error is a 'warning' that the
fore200e is driver is sleeping when it should not (probably while holding
interrupts).  the schedule() around like 1782 is not a good idea since
the fore200e_send() might not be running in a sleepable context.  just
try commenting that line for now.
Sorry, but I don;t understand, what line, i am not kernel guru. :/
oceanic:/usr/src/linux-2.4.29$ grep  fore200e_send * -r
drivers/atm/fore200e.c:fore200e_send(struct atm_vcc *vcc, struct sk_buff 
*skb)
drivers/atm/fore200e.c: send: fore200e_send,

Is was happened on 2.4.29, too. It is a interrupt problem?
Below Oops from 2.4.29:
ksymoops 2.4.11 on i686 2.4.29.  Options used
 -V (default)
 -k /proc/ksyms (default)
 -l /proc/modules (default)
 -o /lib/modules/2.4.29/ (default)
 -m /lib/modules/2.4.29/System.map (specified)
kernel BUG at sched.c:564!
invalid operand: 
CPU:0
EIP:0010:[]Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010286
eax: 0018   ebx: f76d2088   ecx: c02b2000   edx: f7651f7c
esi:    edi:    ebp: c02b3cdc   esp: c02b3cac
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c02b3000)
Stack: c026b646 376e8c01 f470 0054 c02b2000 f7c95494 c02b2000 
   0054 f76d2088 0246 f76d3084 f76d00e8 f8843d42 f76d f950
   0038 0001 f67d7c10 0038  0038  001f
Call Trace:[] [] [] [] []
  [] [] [] [] [] []
  [] [] [] [] [] []
  [] [] [] [] [] []
  [] [] [] [] [] []
  []
Code: 0f 0b 34 02 3e b6 26 c0 e9 17 fb ff ff 0f 0b 2d 02 3e b6 26

EIP; c0114f57<=

ebx; f76d2088 <_end+3738b1bc/384fb194>
ecx; c02b2000 
edx; f7651f7c <_end+3730b0b0/384fb194>
ebp; c02b3cdc 
esp; c02b3cac 
Trace; f8843d42 <[fore_200e]fore200e_send+172/6d0>
Trace; c02599d6 
Trace; c01fe4a9 
Trace; c01f36df 
Trace; c020fa03 
Trace; c01fda4f 
Trace; c020f920 
Trace; c020e3c2 
Trace; c020f920 
Trace; c020d060 
Trace; c01fda4f 
Trace; c020d010 
Trace; c020cf4a 
Trace; c020d010 
Trace; c020bd09 
Trace; c01fda4f 
Trace; c020bb00 
Trace; c020b920 
Trace; c020bb00 
Trace; c01f3cb4 
Trace; c01f3e0d 
Trace; c01f3f55 
Trace; c011d0a6 
Trace; c0109296 
Trace; c0105330 
Trace; c010b938 
Trace; c0105330 
Trace; c0105359 
Trace; c01053f2 
Trace; c0105000 <_stext+0/0>
Code;  c0114f57 
 <_EIP>:
Code;  c0114f57<=
   0:   0f 0b ud2a  <=
Code;  c0114f59 
   2:   34 02 xor$0x2,%al
Code;  c0114f5b 
   4:   3eds
Code;  c0114f5c 
   5:   b6 26 mov$0x26,%dh
Code;  c0114f5e 
   7:   c0 e9 17  shr$0x17,%cl
Code;  c0114f61 
   a:   fbsti 
Code;  c0114f62 
   b:   ff(bad) 
Code;  c0114f63 
   c:   ff 0f decl   (%edi)
Code;  c0114f65 
   e:   0b 2d 02 3e b6 26 or 0x26b63e02,%ebp

 <0>Kernel panic: Aiee, killing interrupt handler!
--
*[ Łukasz Trąbiński ]*
SysAdmin @wsisiz.edu.pl
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: oom killer gone nuts

2005-01-20 Thread Jens Axboe

On Thu, Jan 20 2005, Andrea Arcangeli wrote:
> On Thu, Jan 20, 2005 at 02:15:56PM +0100, Andries Brouwer wrote:
> > On Thu, Jan 20, 2005 at 01:34:06PM +0100, Jens Axboe wrote:
> > 
> > > Using current BK on my x86-64 workstation, it went completely nuts today
> > > killing tasks left and right with oodles of free memory available.
> > 
> > Yes, the fact that the oom-killer exists is a serious problem.
> > People work on trying to tune it, instead of just removing it.
> 
> I'm working on fixing it, not just tuning it. The bugs in mainline
> aren't about the selection algorithm (which is normally what people
> calls oom killer). The bugs in mainline are about being able to kill a
> task reliably, regardless of which task we pick, and every linux kernel
> out there has always killed some task when it was oom. So the bugs are
> just obvious regressions of 2.6 if compared to 2.4.
> 
> But this is all fixed now, I'm starting sending the first patches to
> Anderw very shortly (last week there was still the oracle stuff going
> on). Now I can fix the rejects.
> 
> I will guarantee nothing about which task will be picked (that's the old
> code at works, I changed not a bit in what normally people calls "the oom
> killer", plus the recent improvement from Thomas), but I guarantee the
> VM won't kill tasks right and left like it does now (i.e. by invoking the
> oom killer multiple times).

And especially not with 500MB of zone normal free, thanks :)

2.6.11-rc1-xx vm behaviour is looking a _lot_ worse than 2.6.10 btw, I
haven't looked closer at what has changed yet it's just a subjective
feeling. I regularly have to run a fillmem.c hog to prune caches or it
runs like an old dog.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Bug report : drivers/net/hamradio/Kconfig

2005-01-20 Thread 2df

Hello, i'm translating some Kconfig files to french for the kernelFR
project (http://kernelfr.traduc.org), and while i was reading
drivers/net/hamradio/Kconfig
The kernel is 2.6.10

In section : "Baycom ser12 halfduplex driver for AX.25" 9th section, in
the 3rdline, there is :
"The driver supports the ser12 design in full-duplex mode." instead of
"half-duplex mode."

Please Follow all your answers to : [EMAIL PROTECTED]
I'm not a member of the mailing list


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Bug report : drivers/net/hamradio/Kconfig

2005-01-20 Thread 2df

Hello, i'm translating some Kconfig files to french for the kernelFR
project (http://kernelfr.traduc.org), and while i was reading
drivers/net/hamradio/Kconfig
The kernel is 2.6.10

In section : "Baycom ser12 halfduplex driver for AX.25" 9th section, in
the 3rdline, there is :
"The driver supports the ser12 design in full-duplex mode." instead of
"half-duplex mode."

Please Follow all your answers to : [EMAIL PROTECTED]
I'm not a member of the mailing list
+
Simon


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] relayfs redux for 2.6.10: lean and mean

2005-01-20 Thread Peter Williams

Greg KH wrote:
On Fri, Jan 21, 2005 at 01:15:28PM +1100, Peter Williams wrote:
Perhaps the logical solution is to implement debugfs in terms of relayfs?

What do you mean by this statement?
I mean that if, as you say, debugfs is very similar to relayfs only more 
restricted (i.e. a debugging option) then it should be implementable as 
an instance or specialization of the more general relayfs and that this 
should be a better solution than two independent implementations of 
similar functionality.

Peter
--
Peter Williams   [EMAIL PROTECTED]
"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli

On Fri, Jan 21, 2005 at 08:08:21AM +0100, Andi Kleen wrote:
> So at least for GFP_DMA it seems to be definitely needed.

Indeed. Plus if you add pci32 zone, it'll be needed for it too on
x86-64, like for the normal zone on x86, since ptes will go in highmem
while pci32 allocations will not. So while floppy might be fixed, this
issue would be for brand new pci32 zone needed by some device (i.e.
nvidia, so not such a unlikely corner case).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli

On Fri, Jan 21, 2005 at 06:04:25PM +1100, Nick Piggin wrote:
> OK this is a fairly lame example... but the current code is more or
> less just lucky that ZONE_DMA doesn't usually fill up with pinned mem
> on machines that need explicit ZONE_DMA allocations.

Yep. For the DMA zone all slab cache will be a memory pin (like ptes for
highmem, but not that many people runs with 3G of ram in ptes, and I
guess the ones doing it aren't normally using a mainline kernel in the
first place so they're likely not running into it either). While slab
cache pinning the normal zone has more probability of being reproduced
on l-k in random usages.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli

On Thu, Jan 20, 2005 at 11:00:16PM -0800, Andrew Morton wrote:
> Last time we dicsussed this you pointed out that reserving more lowmem from
> highmem-capable allocations may actually *help* things.  (Tries to remember
> why) By reducing inode/dentry eviction rates?  I asked Martin Bligh if he
> could test that on a big NUMA box but iirc the results were inconclusive.

This is correct, guaranteeing more memory to be freeable in lowmem (ptes
aren't freeable without a sigkill for example) the icache/dcache will at
least have a margin where it can grow indipendently from highmem
allocations.

> Maybe it just won't make much difference.  Hard to say.

I don't know myself if it makes a performance difference, all old
benchmarks have been run with this applied. This was applied for
correcntess (i.e.  to avoid sigkills or lockups), it wasn't applied for
performance. But I don't see how it could hurt performance (especially
given current code already does the check at runtime, which is
pratically the only fast-path cost ;).

> >  The sysctl name had to change to lowmem_reserve_ratio because its
> >  semantics are completely different now.
> 
> That reminds me.  Documentation/filesystems/proc.txt ;)

Woops, forgotten about it ;)

> I'll cook something up for that.

Thanks. If you prefer I can write it too to relieve you from this load,
it's up to you. If you want to fix it yourself go ahead of course ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: COMMAND_LINE_SIZE increasing in 2.6.11-rc1-bk6

2005-01-20 Thread Andi Kleen

> I really suggest to push this limit to 4k. My reason is that under UML I 
> need to put a lot of stuff in command line and uml crash if I not extend 
> this limit. Can we make it depend on arhitecture?

It's dependent on the architecture already. I would like to enable
it on i386/x86-64 because the kernel command line is often used
to pass parameters to installers, and having a small limit there
can be awkward.

But first need to figure out what went wrong with EDD. 

Matt D., do you have thoughts on this?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Nick Piggin

On Thu, 2005-01-20 at 22:46 -0800, Andrew Morton wrote:
> Nick Piggin <[EMAIL PROTECTED]> wrote:

> > It does turn on lowmem protection by default. We never reached
> > an agreement about doing this though, but Andrea has shown that
> > it fixes trivial OOM cases.
> > 
> > I think it should be turned on by default. I can't recall what
> > your reservations were...?
> > 
> 
> Just that it throws away a bunch of potentially usable memory.  In three
> years I've seen zero reports of any problems which would have been solved
> by increasing the protection ratio.
> 
> Thus empirically, it appears that the number of machines which need a
> non-zero protection ratio is exceedingly small.  Why change the setting on
> all machines for the benefit of the tiny few?  Seems weird.  Especially
> when this problem could be solved with a few-line initscript.  Ho hum.

That is true, but it should not reserve a great deal of memory on
small memory machines. ZONE_NORMAL reservation may not even be too
noticeable as you'll usually have ZONE_NORMAL allocations during
the course of normal running.

Although it is true that there haven't been many problems attributed
to this, one example I can remember is when we fixed the __alloc_pages
watermark code, we fixed a bug that was reserving much more ZONE_DMA
than it was supposed to. This cased all those page allocation failure
problems. So we raised the atomic reserve, but that didn't bring
ZONE_DMA reservation back to its previous levels.

"So the buffer between GFP_KERNEL and GFP_ATOMIC allocations is:

2.6.8  | 465 dma, 117 norm, 582 tot = 2328K
2.6.10-rc  |   2 dma, 146 norm, 148 tot =  592K
patch  |  12 dma, 500 norm, 512 tot = 2048K"

So we were still seeing GFP_DMA allocation failures in the sound code.
You recently had to make that NOWARN to shut it up.

OK this is a fairly lame example... but the current code is more or
less just lucky that ZONE_DMA doesn't usually fill up with pinned mem
on machines that need explicit ZONE_DMA allocations.

Find local movie times and trailers on Yahoo! Movies.

http://au.movies.yahoo.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andi Kleen

Andrew Morton <[EMAIL PROTECTED]> writes:

> Just that it throws away a bunch of potentially usable memory.  In three
> years I've seen zero reports of any problems which would have been solved
> by increasing the protection ratio.

We ran into a big problem with this on x86-64. The SUSE installer
would load the floppy driver during installation. Floppy driver would
try to allocate some pages with GFP_DMA and on a small memory x86-64
system (256-512MB) the OOM killer would always start to kill things
trying to free some DMA pages. This was quite a show stopper
because you effectively couldn't install.

So at least for GFP_DMA it seems to be definitely needed.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli

On Thu, Jan 20, 2005 at 10:46:45PM -0800, Andrew Morton wrote:
> Thus empirically, it appears that the number of machines which need a
> non-zero protection ratio is exceedingly small.  Why change the setting on
> all machines for the benefit of the tiny few?  Seems weird.  Especially
> when this problem could be solved with a few-line initscript.  Ho hum.

It's up to you, IMHO you're doing a mistake, but I don't mind as long as our
customers aren't at risk of early oom kills (or worse kernel crashes)
with some db load (especially without swap the risk is huge for all
users, since all anonymous memory will be pinned like ptes, but with ~3G
of pagetables they're at risk even with swap).  At least you *must*
admit that without my patch applied as I posted, there's a >0 probabity
of running out of normal zone which will lead to an oom-kill or a
deadlock despite 10G of highmem might still be freeeable (like with
clean cache). And my patch obviously cannot make it impossible to run
out of normal zone, since there's only 800m of normal zone and one can
open more files than what fits in normal zone, but at least it gives the
user the security that a certain workload can run reliably. Without this
patch there's no guarantee at all that any workload will run when >1G of
ptes is allocated.

This below fix as well is needed and you won't find reports of people
reproducing this race condition. Please apply. CC'ed Hugh. Sorry Hugh, I
know you were working on it (you said not in the weekend IIRC), but I've
been upgraded to latest bk so I had to fixup quickly or I would have to
run the racy code on my smp systems to test new kernels.

From: Andrea Arcangeli <[EMAIL PROTECTED]>
Subject: fixup smp race introduced in 2.6.11-rc1

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- x/mm/memory.c.~1~   2005-01-21 06:58:14.747335048 +0100
+++ x/mm/memory.c   2005-01-21 07:16:15.318063328 +0100
@@ -1555,8 +1555,17 @@ void unmap_mapping_range(struct address_

spin_lock(>i_mmap_lock);

+   /* serialize i_size write against truncate_count write */
+   smp_wmb(); 
/* Protect against page faults, and endless unmapping loops */
mapping->truncate_count++;
+   /*
+* For archs where spin_lock has inclusive semantics like ia64
+* this smp_mb() will prevent to read pagetable contents
+* before the truncate_count increment is visible to
+* other cpus.
+*/
+   smp_mb();
if (unlikely(is_restart_addr(mapping->truncate_count))) {
if (mapping->truncate_count == 0)
reset_vma_truncate_counts(mapping);
@@ -1864,10 +1873,18 @@ do_no_page(struct mm_struct *mm, struct 
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
sequence = mapping->truncate_count;
+   smp_rmb(); /* serializes i_size against truncate_count */
}
 retry:
cond_resched();
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, );
+   /*
+* No smp_rmb is needed here as long as there's a full
+* spin_lock/unlock sequence inside the ->nopage callback
+* (for the pagecache lookup) that acts as an implicit
+* smp_mb() and prevents the i_size read to happen
+* after the next truncate_count read.
+*/

/* no page was available -- either SIGBUS or OOM */
if (new_page == NOPAGE_SIGBUS)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Reserving backup region for kexec based crashdumps.

2005-01-20 Thread Vivek Goyal

Hi Andrew,

Following patch is against 2.6.11-rc1-mm2. 

As mentioned by following note from Eric, crashdump code is currently
broken.
> 
> The crashdump code is currently slightly broken.  I have attempted to
> minimize the breakage so things can quick be made to work again.

We have started doing changes to make crashdump up and running again.
Following are few identified items to be done.

1. Reserve the backup region (640k) during kernel bootup. 
2. Copy the data to backup region during crash.(moved to kexec user
space code, patch posted in separate mail)
3. Prepare elf headers while loading kexec panic kernel and store in
reserved memory area.
4. Pass required information to crashdump kernel, which parses it and
exports through /proc/vmcore. (may be user space utility, open to
discussion)

Following patch implements item 1) in the list. Soon we shall be rolling
out the patches for rest.

Thanks
Vivek



This patch adds support for reserving 640k memory as backup region as required
by crashdump kernel for x86. 
---


Signed-off-by: Vivek Goyal <[EMAIL PROTECTED]>
---

 linux-2.6.11-rc1-mm2-kexec-eric-root/arch/i386/kernel/setup.c |8 
 linux-2.6.11-rc1-mm2-kexec-eric-root/include/linux/kexec.h|6 +-
 linux-2.6.11-rc1-mm2-kexec-eric-root/kernel/kexec.c   |8 
 3 files changed, 21 insertions(+), 1 deletion(-)

diff -puN arch/i386/kernel/setup.c~crashdump-x86-reserve-640k-memory 
arch/i386/kernel/setup.c
--- 
linux-2.6.11-rc1-mm2-kexec-eric/arch/i386/kernel/setup.c~crashdump-x86-reserve-640k-memory
  2005-01-20 13:55:33.0 +0530
+++ linux-2.6.11-rc1-mm2-kexec-eric-root/arch/i386/kernel/setup.c   
2005-01-20 13:55:33.0 +0530
@@ -1159,6 +1159,13 @@ static unsigned long __init setup_memory
 #ifdef CONFIG_KEXEC
if (crashk_res.start != crashk_res.end) {
reserve_bootmem(crashk_res.start, crashk_res.end - 
crashk_res.start + 1);
+
+#define CRASH_DUMP_BACKUP 0xa
+   /* Reserve another 640Kb for crashdump backup. */
+   crashdumpk_res.start = crashk_res.end + 1;
+   crashdumpk_res.end = crashdumpk_res.start +
+   CRASH_DUMP_BACKUP -1;
+   reserve_bootmem(crashdumpk_res.start, CRASH_DUMP_BACKUP);
}
 #endif
return max_low_pfn;
@@ -1202,6 +1209,7 @@ legacy_init_iomem_resources(struct resou
request_resource(res, data_resource);
 #ifdef CONFIG_KEXEC
request_resource(res, _res);
+   request_resource(res, _res);
 #endif
}
}
diff -puN include/linux/kexec.h~crashdump-x86-reserve-640k-memory 
include/linux/kexec.h
--- 
linux-2.6.11-rc1-mm2-kexec-eric/include/linux/kexec.h~crashdump-x86-reserve-640k-memory
 2005-01-20 13:55:33.0 +0530
+++ linux-2.6.11-rc1-mm2-kexec-eric-root/include/linux/kexec.h  2005-01-20 
13:55:33.0 +0530
@@ -79,7 +79,7 @@ struct kimage {
unsigned long control_page;
 
/* Flags to indicate special processing */
-   int type : 1;
+   unsigned int type : 1;
 #define KEXEC_TYPE_DEFAULT 0
 #define KEXEC_TYPE_CRASH   1
 };
@@ -122,6 +122,10 @@ extern struct kimage *kexec_crash_image;
  */
 extern struct resource crashk_res;
 
+/* Location of backup region to hold the crashdump kernel data.
+ */
+extern struct resource crashdumpk_res;
+
 #else /* !CONFIG_KEXEC */
 static inline void crash_kexec(void) { }
 #endif /* CONFIG_KEXEC */
diff -puN kernel/kexec.c~crashdump-x86-reserve-640k-memory kernel/kexec.c
--- 
linux-2.6.11-rc1-mm2-kexec-eric/kernel/kexec.c~crashdump-x86-reserve-640k-memory
2005-01-20 13:55:33.0 +0530
+++ linux-2.6.11-rc1-mm2-kexec-eric-root/kernel/kexec.c 2005-01-20 
13:55:33.0 +0530
@@ -32,6 +32,14 @@ struct resource crashk_res = {
.flags = IORESOURCE_BUSY | IORESOURCE_MEM
 };
 
+/* Location of the backup area for the crash dump kernel */
+struct resource crashdumpk_res = {
+   .name  = "Crash Dump Backup",
+   .start = 0,
+   .end   = 0,
+   .flags = IORESOURCE_BUSY | IORESOURCE_MEM
+};
+
 /*
  * When kexec transitions to the new kernel there is a one-to-one
  * mapping between physical and virtual addresses.  On processors
_

Re: COMMAND_LINE_SIZE increasing in 2.6.11-rc1-bk6

2005-01-20 Thread Catalin(ux aka Dino) BOIE

On Thu, 20 Jan 2005, Andi Kleen wrote:
AOL:
- lilo 22.6.1
- CONFIG_EDD=y
- 2.6.10-mm1 and 2.6.11-rc1 did boot
- 2.6.11-rc1-mm1 and 2.6.11-rc1-mm2 didn't boot
- 2.6.11-rc1-mm2 with this ChangeSet reverted boots.
What I gather so far the problem seems to only happen with lilo
and EDID together.  grub appears to work.  Or did anyone
see problems with grub too?
I'll dig a bit, but reverting for now is probably best.
Thanks Linus.
I really suggest to push this limit to 4k. My reason is that under UML I 
need to put a lot of stuff in command line and uml crash if I not extend 
this limit. Can we make it depend on arhitecture?

Thanks.
---
Catalin(ux aka Dino) BOIE
catab at deuroconsult.ro
http://kernel.umbrella.ro/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andrew Morton

Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
>
> Anyway if you leave it off by default I don't mind, with my new code
>  forward ported stright from 2.4 mainline, it's possible for the first
>  time to set it from userspace without having to embed knowledge on the
>  kernel min_kbytes settings at boot time.

Last time we dicsussed this you pointed out that reserving more lowmem from
highmem-capable allocations may actually *help* things.  (Tries to remember
why) By reducing inode/dentry eviction rates?  I asked Martin Bligh if he
could test that on a big NUMA box but iirc the results were inconclusive.

Maybe it just won't make much difference.  Hard to say.

>  The sysctl name had to change to lowmem_reserve_ratio because its
>  semantics are completely different now.

That reminds me.  Documentation/filesystems/proc.txt ;)

I'll cook something up for that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] relayfs redux for 2.6.10: lean and mean

2005-01-20 Thread Greg KH

On Fri, Jan 21, 2005 at 01:15:28PM +1100, Peter Williams wrote:
> 
> Perhaps the logical solution is to implement debugfs in terms of relayfs?

What do you mean by this statement?

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] relayfs redux for 2.6.10: lean and mean

2005-01-20 Thread Greg KH

On Thu, Jan 20, 2005 at 08:38:25PM -0500, Karim Yaghmour wrote:
> 
> Greg KH wrote:
> > Hm, how about this idea for cutting about 500 more lines from the code:
> > 
> > Why not drop the "fs" part of relayfs and just make the code a set of
> > struct file_operations.  That way you could have "relayfs-like" files in
> > any ram based file system that is being used.  Then, a user could use
> > these fops and assorted interface to create debugfs or even procfs files
> > using this type of interface.
> > 
> > As relayfs really is almost the same (conceptually wise) as debugfs as
> > far as concept of what kinds of files will be in there (nothing anyone
> > would ever rely on for normal operations, but for debugging only) this
> > keeps users and developers from having to spread their debugging and
> > instrumenting files from accross two different file systems.
> 
> However this assumes that the users of relayfs are not going to want
> it during normal system operation.

That is true.

> This is an assumption that fails with at least LTT as it is targeted
> at sysadmins, application developers and power users who need to be
> able to trace their systems at any time.

Are they willing to trade off the performance of LTT to get this?  I
thought this was being touted as a "when you need to test" type of
thing, not a "run it all the time" type of feature.

> I don't mind piggy-backing off another fs, if it makes sense, but
> unlike debugfs, relayfs is meant for general use, and all files in there
> are of the same type: relay channels for dumping huge amounts of data
> to user-space.

And a driver will never want to have both a relay channel, and a simple
debug output at the same time?  You are now requiring them to look for
that data in two different points in the fs.

> It seems to me the target audience and basic idea (relay
> channels only in the fs) are different, but let me know if there's a
> compeling argument for doing this in another way without making it too
> confusing for users of those special "files" (IOW, when this starts
> being used in distros, it'll be more straightforward for users to
> understand if all files in a mounted fs behave a certain way than if
> they have certain "odd" files in certain directories, even if it's
> /proc.)

So, since you are proposing that relayfs be mounted all the time, where
do you want to mount it at?  I had to provide a "standard" location for
debugfs for people to be happy with it, and the same issue comes up
here.

Also, why not export your relayfs ops so that someone useing debugfs can
create a relay channel in it, or in any other type of fs they might
create?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli

On Fri, Jan 21, 2005 at 05:36:14PM +1100, Nick Piggin wrote:
> I think it should be turned on by default. I can't recall what

I think it too, since the number of people that can be bitten by this is
certainly higher than the number of people who knows the VM internals
and for what kind of workloads they need to enable this by hand to avoid
risking lockups (notably with boxes without swap or with heavy pagetable
allocations all the time which is not uncommon with db usage).

This is needed on x86-64 too to avoid pagetables to lockup the dma zone.
Or anyways it's needed also on x86 for the dma zone on <1G boxes too.

Anyway if you leave it off by default I don't mind, with my new code
forward ported stright from 2.4 mainline, it's possible for the first
time to set it from userspace without having to embed knowledge on the
kernel min_kbytes settings at boot time. So if you want it down by
default it simply means we'll guarantee it on our distro with userland.
Setting a sysctl at boot time is no big deal for us (of course leaving
it enabled by default in kernel space is older distro where userland
isn't yet aware about it). So it's pretty much up to you, as long as we
can easily fixup in userland is fine with me and I already tried a dozen
times to push mainline in what I believe to be the right direction (like
I already did in 2.4 mainline since that same code is enabled by default
in 2.4).

The sysctl name had to change to lowmem_reserve_ratio because its
semantics are completely different now.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andrew Morton

Nick Piggin <[EMAIL PROTECTED]> wrote:
>
> On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote:
> > Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> > >
> > >  This is the forward port to 2.6 of the lowmem_reserved algorithm I
> > >  invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
> > >  like google (especially without swap) on x86 with >1G of ram, but it's
> > >  needed in all sort of workloads with lots of ram on x86, it's also
> > >  needed on x86-64 for dma allocations. This brings 2.6 in sync with
> > >  latest 2.4.2x.
> > 
> > But this patch doesn't change anything at all in the page allocation path
> > apart from renaming lots of things, does it?
> > 
> > AFAICT all it does is to change the default values in the protection map. 
> > It does it via a simplification, which is nice, but I can't see how it
> > fixes anything.
> > 
> > Confused.
> 
> 
> It does turn on lowmem protection by default. We never reached
> an agreement about doing this though, but Andrea has shown that
> it fixes trivial OOM cases.
> 
> I think it should be turned on by default. I can't recall what
> your reservations were...?
> 

Just that it throws away a bunch of potentially usable memory.  In three
years I've seen zero reports of any problems which would have been solved
by increasing the protection ratio.

Thus empirically, it appears that the number of machines which need a
non-zero protection ratio is exceedingly small.  Why change the setting on
all machines for the benefit of the tiny few?  Seems weird.  Especially
when this problem could be solved with a few-line initscript.  Ho hum.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.10-mm2: it87 sensor driver stops CPU fan

2005-01-20 Thread Jean Delvare

Hi Nicolas,

> I confirm that 0x7f is full speed.

So at least the polarity bit is correct, and Gigabyte isn't to blame.

> > Once you know if the polarity is correct, you can try different
> > values of PWM between 0x00 and 0x7F and see how exactly your fan
> > reacts to them.
> 
> That's where things get really really interesting.  As mentioned
> above 0x7f drives the fan full speed (2596 RPM).  Now lowering that
> value slows the CPU fan gradually down to a certain point.  With a
> value of 0x3f the fan turns at 1041 RPM.  But below 0x3f the fan
> starts speeding up again to reach a peak of 2280 RPM with a value
> of 0x31, then it slows  down again toward 0 RPM as the register
> value is decreased down to 0.
> 
> Bit 3 of register 0x14, when set, only modifies the curve so the
> first minimum is instead reached at 0x30 then the peak occurs at 0x1d
> before dropping to 0.
> 
> Changing the PWM base clock select has no effect.

Wow! Unexpected, to say the least. First time I see such a behavior.

Could it be that your CPU fan isn't a simple passive device but one of
these high-tech models with an embedded thermal sensor and automatic
speed adjustment? This would possibly interact with the motherboard PWM
capability and could explain the strange speed curve your obtained.

I would also like you to try a similar test with your case fan. Enable
"smart guardian" mode for this one (by writing 0x73 to register 0x13),
then scan the 0x7f-0x00 range (register 0x16) like you did for your CPU
fan. I wonder if you will obtain the same kind of result or a standard
linear curve.

(Note that PWM2 might not be wired at all on your motherboard, so don't
be surprised if the case fan speed doesn't change at all.)

Thanks,
-- 
Jean Delvare
http://khali.linux-fr.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeback-highmem

2005-01-20 Thread Andrea Arcangeli

On Thu, Jan 20, 2005 at 10:26:30PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> >
> > This needed highmem fix from Rik is still missing too, so please apply
> >  along the other 5 (it's orthogonal so you can apply this one in any
> >  order you want).
> > 
> >  From: Rik van Riel <[EMAIL PROTECTED]>
> >  Subject: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
> 
> I've held off on this one because the recent throttling fix should have
> helped this problem.  Has anyone confirmed that this patch still actually
> fixes something?  If so, what was the scenario?

Without this fix write throttling is completely broken for a blkdev and
it won't start _at_all_ and it'll just keep hanging in the allocation
routines. I agree it won't explain oom (with the other fixes the VM
should writeback synchronously instead of running oom) but it may make
the box completely unusable under a cp /dev/zero /dev/somedevice.

There is a reason why we start write throttling before 100% of ram is
being locked by dirty pages in the pagecache path.

The beauty of this fix is that Rik allowed the pagecache not to have the
limit (in 2.4 pagecache had the limit too). Probably async writeback
won't start but at least the write throttling will and that's all we
need to keep the box running other apps at the same time of the write.

If the system goes unresponsive for 10 minutes and swaps during backups
or workloads working on the blkdev, they'll file bugreports and they'd
be correct.

In short I agree this shouldn't be applied for oom, but it's still
definitely a correct and needed fix (and I rate it a bit more than just
an optimization).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] compat ioctl security hook fixup (take2)

2005-01-20 Thread Chris Wright

* Andi Kleen ([EMAIL PROTECTED]) wrote:
> On Thu, Jan 20, 2005 at 09:51:03PM -0800, Chris Wright wrote:
> > > If you add it make at least sure it's not EXPORT_SYMBOL()ed.
> > 
> > It's certainly not, nor intended to be.  Would a comment to that
> > affect alleviate your concern?
> 
> Yes please.

Patch respun, with comment added.

thanks,
-chris
--

Introduce a simple helper, vfs_ioctl(), so that both sys_ioctl() and
compat_sys_ioctl() call the security hook in all cases and without
duplication.

Signed-off-by: Chris Wright <[EMAIL PROTECTED]>

= fs/ioctl.c 1.15 vs edited =
--- 1.15/fs/ioctl.c 2005-01-15 14:31:01 -08:00
+++ edited/fs/ioctl.c   2005-01-20 22:27:43 -08:00
@@ -77,21 +77,13 @@ static int file_ioctl(struct file *filp,
return do_ioctl(filp, cmd, arg);
 }
 
-
-asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
+/* Simple helper for sys_ioctl and compat_sys_ioctl.  Not for drivers'
+ * use, and not intended to be EXPORT_SYMBOL()'d
+ */
+int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned 
long arg)
 {
-   struct file * filp;
unsigned int flag;
-   int on, error = -EBADF;
-   int fput_needed;
-
-   filp = fget_light(fd, _needed);
-   if (!filp)
-   goto out;
-
-   error = security_file_ioctl(filp, cmd, arg);
-   if (error)
-   goto out_fput;
+   int on, error = 0;
 
switch (cmd) {
case FIOCLEX:
@@ -157,6 +149,24 @@ asmlinkage long sys_ioctl(unsigned int f
error = do_ioctl(filp, cmd, arg);
break;
}
+   return error;
+}
+
+asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
+{
+   struct file * filp;
+   int error = -EBADF;
+   int fput_needed;
+
+   filp = fget_light(fd, _needed);
+   if (!filp)
+   goto out;
+
+   error = security_file_ioctl(filp, cmd, arg);
+   if (error)
+   goto out_fput;
+
+   error = vfs_ioctl(filp, fd, cmd, arg);
  out_fput:
fput_light(filp, fput_needed);
  out:
= fs/compat.c 1.48 vs edited =
--- 1.48/fs/compat.c2005-01-15 14:31:01 -08:00
+++ edited/fs/compat.c  2005-01-20 22:25:33 -08:00
@@ -437,6 +437,11 @@ asmlinkage long compat_sys_ioctl(unsigne
if (!filp)
goto out;
 
+   /* RED-PEN how should LSM module know it's handling 32bit? */
+   error = security_file_ioctl(filp, cmd, arg);
+   if (error)
+   goto out_fput;
+
if (filp->f_op && filp->f_op->compat_ioctl) {
error = filp->f_op->compat_ioctl(filp, cmd, arg);
if (error != -ENOIOCTLCMD)
@@ -477,7 +482,7 @@ asmlinkage long compat_sys_ioctl(unsigne
 
up_read(_sem);
  do_ioctl:
-   error = sys_ioctl(fd, cmd, arg);
+   error = vfs_ioctl(filp, fd, cmd, arg);
  out_fput:
fput_light(filp, fput_needed);
  out:
= include/linux/fs.h 1.373 vs edited =
--- 1.373/include/linux/fs.h2005-01-15 14:31:01 -08:00
+++ edited/include/linux/fs.h   2005-01-20 22:25:33 -08:00
@@ -1564,6 +1564,8 @@ extern int vfs_stat(char __user *, struc
 extern int vfs_lstat(char __user *, struct kstat *);
 extern int vfs_fstat(unsigned int, struct kstat *);
 
+extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long);
+
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *user_get_super(dev_t);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Nick Piggin

On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote:
> Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> >
> >  This is the forward port to 2.6 of the lowmem_reserved algorithm I
> >  invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
> >  like google (especially without swap) on x86 with >1G of ram, but it's
> >  needed in all sort of workloads with lots of ram on x86, it's also
> >  needed on x86-64 for dma allocations. This brings 2.6 in sync with
> >  latest 2.4.2x.
> 
> But this patch doesn't change anything at all in the page allocation path
> apart from renaming lots of things, does it?
> 
> AFAICT all it does is to change the default values in the protection map. 
> It does it via a simplification, which is nice, but I can't see how it
> fixes anything.
> 
> Confused.


It does turn on lowmem protection by default. We never reached
an agreement about doing this though, but Andrea has shown that
it fixes trivial OOM cases.

I think it should be turned on by default. I can't recall what
your reservations were...?




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] to fix xtime lock for in the RT kernel patch

2005-01-20 Thread Ingo Molnar


* George Anzinger  wrote:

> It seems to me that we need to either do the attached or to rewrite
> the timer front end code to just gather the offset info and defer to
> the timer irq thread to update jiffies and the offset stuff.  In
> either case we really can not split the two and we do need the
> xtime_lock protection.

how about the patch below? One of the important benefits of the threaded
timer IRQ is the ability to make xtime_lock a mutex.

Ingo

--- linux/arch/i386/kernel/time.c.orig2 
+++ linux/arch/i386/kernel/time.c   
@@ -313,6 +313,7 @@ irqreturn_t timer_interrupt(int irq, voi
write_seqlock(_lock);
 
cur_timer->mark_offset();
+   do_timer(regs);
  
do_timer_interrupt(irq, NULL, regs);
 
--- linux/include/asm-i386/mach-default/do_timer.h.orig2
+++ linux/include/asm-i386/mach-default/do_timer.h  
@@ -16,7 +16,6 @@
 
 static inline void do_timer_interrupt_hook(struct pt_regs *regs)
 {
-   do_timer(regs);
 #ifndef CONFIG_SMP
update_process_times(user_mode(regs));
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli

On Thu, Jan 20, 2005 at 10:20:56PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> >
> >  This is the forward port to 2.6 of the lowmem_reserved algorithm I
> >  invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
> >  like google (especially without swap) on x86 with >1G of ram, but it's
> >  needed in all sort of workloads with lots of ram on x86, it's also
> >  needed on x86-64 for dma allocations. This brings 2.6 in sync with
> >  latest 2.4.2x.
> 
> But this patch doesn't change anything at all in the page allocation path
> apart from renaming lots of things, does it?

In the allocation path not, but it rewrites the setting algorithm, so
from somebody watching it from userspace it's a completely different
thing, usable for the first time ever in 2.6. Otherwise userspace would
be required to have knowledge about the kernel internals to be able to
set it to a sane value. Plus the new init code is much cleaner too.

> AFAICT all it does is to change the default values in the protection map. 
> It does it via a simplification, which is nice, but I can't see how it
> fixes anything.

Having this patch applied is a major fix. See again the google fix
thread in 2.4.1x.  2.6 is vulnerable to it again. This patch makes the
feature usable and enables the feature as well, which is definitely a
fix as far as an end user is concerned (google was the user in this case).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: usbmon, usb core, ARM

2005-01-20 Thread David Brownell

On Thursday 20 January 2005 11:35 am, Pete Zaitcev wrote:
> On Wed, 19 Jan 2005 09:08:34 -0800, David Brownell <[EMAIL PROTECTED]> wrote:
> I do not like to refer to a dev because I do not quite understand where
> the necessary usb_dev_get/_put are now. But if you guarantee that the
> urb->dev is refcounted properly while urb is processed by 
> usb_hcd_giveback_urb,
> I do not mind an extra indirection.

We have no reason to suspect bugs there; if there were any,
lots of things would have been breaking for a long time now.

> What would be the right test in usb_hcd_giveback_urb, then?
> It looks to me that you want me to use this:
> 
> urb_is_for_root_hub(urb) {

Actually it'd be more like dev_is_root_hub(dev, bus), since
both values are readily at hand -- you're basically just
wanting to wrap "dev == hcd->self.root_hub" in most cases.
Though I'm still not clear why you'd want to change that
working code; nothing's broken now, after all.

By the way ... on the topic of usbmon rather than changing
usbcore, is there a brief writeup of what you want this
new version to be doing -- and how?  Like, why put the
spy hooks in that location, rather than any of the other
choices.  (Many of them would be less surprising to me!)

- Dave

>  return urb->dev == urb->dev->bus->hcpriv->self.root_hub;
> }
> 
> This is just ... ew. Can we use pipe for now or do you have
> a better idea?
> 
> -- Pete
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeback-highmem

2005-01-20 Thread Andrew Morton

Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
>
> This needed highmem fix from Rik is still missing too, so please apply
>  along the other 5 (it's orthogonal so you can apply this one in any
>  order you want).
> 
>  From: Rik van Riel <[EMAIL PROTECTED]>
>  Subject: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings

I've held off on this one because the recent throttling fix should have
helped this problem.  Has anyone confirmed that this patch still actually
fixes something?  If so, what was the scenario?

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM fixes 2/5

2005-01-20 Thread Andrew Morton

Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
>
>  This is the forward port to 2.6 of the lowmem_reserved algorithm I
>  invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
>  like google (especially without swap) on x86 with >1G of ram, but it's
>  needed in all sort of workloads with lots of ram on x86, it's also
>  needed on x86-64 for dma allocations. This brings 2.6 in sync with
>  latest 2.4.2x.

But this patch doesn't change anything at all in the page allocation path
apart from renaming lots of things, does it?

AFAICT all it does is to change the default values in the protection map. 
It does it via a simplification, which is nice, but I can't see how it
fixes anything.

Confused.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-20 Thread Karim Yaghmour

OK, I finally come around to answering this ...

Roman Zippel wrote:
> Sorry, you missunderstood me. At the moment I'm only secondarily 
> interested in the API details, primarily I want to work out the details of 
> what exactly relayfs/ltt are supposed to do. One main question here I 
> can't answer yet, why you insist on multiple relayfs modes.

I should have avoided earlier confusing the use of a certain type of
relayfs channel for a given purpose (i.e. LTT should not necessarily
depend on the managed mode.) I believe that there is a need for
more than one mode in relayfs independently of LTT. There are users
who want to be able to manage the data in a buffer (by manage I mean:
receive notification of important buffer events, be able to insert
important data at boundaries, etc.), and there are users who just
want to dump as much information as possible in as fast a way as
possible without having to deal with non-essential codepaths.

> This is what I basically have in mind for the relay_write function:
> 
>   cpu = get_cpu();
>   buffer = relay_get_buffer(chan, cpu);
>   while(1) {
>   offset = local_add_return(buffer->offset, length);
>   if (likely(offset + length <= buffer->size))
>   break;
>   buffer = relay_switch_buffer(chan, buffer, offset);
>   }
>   memcpy(buffer->data + offset, data, length);
>   put_cpu();

looking at this code:

1) get_cpu() and put_cpu() won't do. You need to outright disable
interrupts because you may be called from an interrupt handler.

2) You assume that relayfs creates one buffer per cpu for each
channel. We think this is wrong. Relayfs should not need to care
about the number of CPUs, it's the clients' responsibility to
create as many channels as they see fit, whether it be one channel
per CPU or 10 channels per CPU or 1 channel per interrupt, etc.

3) I'm unclear about the need for local_add_return(), why not
just:
if (likely(buffer->offset + length <= buffer->size)
In any case, here's what we do in relay_write():
write_pos = relay_reserve(rchan, count, _code, );
If there's any buffer switching required, that will be done in
relay_reserve. This has the added advantage that clients that
want to write directly to the buffer without using relay_write()
can do so by calling relay_reserve() and not care about required
buffer switching.

4) After securing the area, you simply go ahead and do a memcpy()
and leave. We think that this is insufficient. Here's what we
do:
if (likely(write_pos != NULL)) {
relay_write_direct(write_pos, data_ptr, count);
relay_commit(rchan, write_pos, count, reserve_code, 
interrupting);
*wrote_pos = write_pos;
the relay_write_direct() is basically an memcpy(). We also do
a relay_commit(). This actually effects the delivery of the
event. If, for example, there had been a buffer switch at the
previous relay_reserve(), then this call to relay_commit() will
generate a call to the client's deliver() callback function.
In the case of LTT, for example, this is how it knows that it's
got to notify the user-space daemon that there are buffers to
consume (i.e. write to disk.)

> ltt_log_event should only be a few lines more (for writing header and 
> event data).

Actually no, you don't want ltt_log_event using relay_write(),
for one thing because is can generate variable size events.
Instead, ltt_log_event does (basically):
data_size = sizeof(event_id) + sizeof(time_delta) + sizeof(data_size);

relay_lock_channel();
relay_reserve();

relay_write_direct(_id, sizeof(event_id));
relay_write_direct(_delta, sizeof(event_id));
if (var_data) {
relay_write_direct(var_data, var_data_len);
data_size += var_data_len;
}
relay_write_direct(_size, sizeof(data_size));

relay_commit();
relay_unlock_channel();

> What I'd like to know now are the reasons why you need more than this.

I hope the above explanation clarifies things.

> It's not the amount of data and any timing requirements have to be done by 
> the caller. During processing you either take the events in the order they 
> were recorded (often that's good enough) or you sort them which is not 
> that difficult.

Ordering is a non-issue to be honest. Unless you've got some hardware
scope in there, it's almost impossible to pinpoint exactly when an
event occurred. There is no single line of code where an event occurs,
so it's all an educated guess anyway. You want things to resemble what
really happened in as much as possible though.

> I know you don't want to touch the topic of kernel debugging, but its 
> requirements greatly overlap with what you want to do with ltt, e.g. one 
> needs very often information about scheduling events as many kernel 
> processes rely more and more on kernel threads. The only real requirement 
> for kernel debugging

kernel panic with 2.4.26

2005-01-20 Thread Klaus Muth

Hi.
Every now and then (maybe twice a week) my server panics. This
is a dual Xeon system with 5Gb memory. I did my best to get the
full oops from the screen and doublechecked. Sorry, but I don't
understand anything from the ksymoops output.
Any help will be appreciated.

ksymoops 2.4.5 on i686 2.4.26-msi1.  Options used
 -V (default)
 -k /proc/ksyms (default)
 -l /proc/modules (default)
 -o /lib/modules/2.4.26-msi1/ (default)
 -m System.map-2.4.26-msi1.nogood (specified)

f893281d
*pde = 
Oops: 0002
CPU:0
EIP:0010:[]Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010256
eax: fffc43fc   ebx: 0002   ecx: f703b000   edx: 000d
esi: f187d000   edi:    ebp: f7005c1c   esp: c0353ed4
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c0353000)
Stack:  f7040d00  f778a480 0040 f7005c00  f703b000
   00040d00 f7007e80 f8921982 f7040d00 f7030b08 f778a480 0002 
     f703b200  f8921a9c f778a480 f7040d08 f77dc680
Call Trace:[] [] [] [] []
  [] [] [] [] [] []
  []
Code: 88 08 8b 86 58 01 00 00 ff 86 5c 01 00 00 88 10 ff 86 58 01


>>EIP; f893281d <_end+3851dc61/385fa444>   <=

>>eax; fffc43fc 
>>ecx; f703b000 <_end+36c26444/385fa444>
>>esi; f187d000 <_end+31468444/385fa444>
>>ebp; f7005c1c <_end+36bf1060/385fa444>
>>esp; c0353ed4 

Trace; f8921982 <_end+3850cdc6/385fa444>
Trace; f8921a9c <_end+3850cee0/385fa444>
Trace; c010a041 
Trace; c010a236 
Trace; c0106d60 
Trace; c0106d60 
Trace; c0106d60 
Trace; c0106d60 
Trace; c0106d89 
Trace; c0106df2 
Trace; c0105000 <_stext+0/0>
Trace; c010504f 

Code;  f893281d <_end+3851dc61/385fa444>
 <_EIP>:
Code;  f893281d <_end+3851dc61/385fa444>   <=
   0:   88 08 mov%cl,(%eax)   <=
Code;  f893281f <_end+3851dc63/385fa444>
   2:   8b 86 58 01 00 00 mov0x158(%esi),%eax
Code;  f8932825 <_end+3851dc69/385fa444>
   8:   ff 86 5c 01 00 00 incl   0x15c(%esi)
Code;  f893282b <_end+3851dc6f/385fa444>
   e:   88 10 mov%dl,(%eax)
Code;  f893282d <_end+3851dc71/385fa444>
  10:   ff 86 58 01 00 00 incl   0x158(%esi)

 <0>Kernel panic: Aiee, killing interrupt handler!

Could you please help me out?

klaus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] compat ioctl security hook fixup

2005-01-20 Thread Andi Kleen

On Thu, Jan 20, 2005 at 09:51:03PM -0800, Chris Wright wrote:
> > If you add it make at least sure it's not EXPORT_SYMBOL()ed.
> 
> It's certainly not, nor intended to be.  Would a comment to that
> affect alleviate your concern?

Yes please.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Matt Mackall

On Thu, Jan 20, 2005 at 08:07:11PM -0800, Andrew Morton wrote:
> Andrew Morton <[EMAIL PROTECTED]> wrote:
> >
> > Next suspects would be:
> > 
> >  +cleanup-vc-array-access.patch
> >  +remove-console_macrosh.patch
> >  +merge-vt_struct-into-vc_data.patch
> > 
> > 
> 
> Make that:
> 
> +cleanup-vc-array-access.patch
> +remove-console_macrosh.patch
> +merge-vt_struct-into-vc_data.patch
> +vgacon-fixes-to-help-font-restauration-in-x11.patch

It's something in this batch. Which is good, as I'd be a bit
disappointed if the "vt leakage" were somehow attributable to the fb
layer. More bisection after dinner.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

writeback-highmem

2005-01-20 Thread Andrea Arcangeli

This needed highmem fix from Rik is still missing too, so please apply
along the other 5 (it's orthogonal so you can apply this one in any
order you want).

From: Rik van Riel <[EMAIL PROTECTED]>
Subject: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings

Simply running "dd if=/dev/zero of=/dev/hd" will
result in OOM kills, with the dirty pagecache completely filling up
lowmem.  This patch is part 1 to fixing that problem.

This patch effectively lowers the dirty limit for mappings which cannot
be cached in highmem, counting the dirty limit as a percentage of lowmem
instead.  This should prevent heavy block device writers from pushing
the VM over the edge and triggering OOM kills.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>
Acked-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- x/mm/page-writeback.c.orig  2005-01-04 01:13:30.0 +0100
+++ x/mm/page-writeback.c   2005-01-04 02:41:29.573177184 +0100
@@ -133,7 +133,8 @@ static void get_writeback_state(struct w
  * clamping level.
  */
 static void
-get_dirty_limits(struct writeback_state *wbs, long *pbackground, long *pdirty)
+get_dirty_limits(struct writeback_state *wbs, long *pbackground, long *pdirty,
+struct address_space *mapping)
 {
int background_ratio;   /* Percentages */
int dirty_ratio;
@@ -141,10 +142,20 @@ get_dirty_limits(struct writeback_state 
long background;
long dirty;
struct task_struct *tsk;
+   unsigned long available_memory = total_pages;
 
get_writeback_state(wbs);
 
-   unmapped_ratio = 100 - (wbs->nr_mapped * 100) / total_pages;
+#ifdef CONFIG_HIGHMEM
+   /*
+* In some cases we can only allocate from low memory,
+* so we exclude high memory from our count.
+*/
+   if (mapping && !(mapping_gfp_mask(mapping) & __GFP_HIGHMEM))
+   available_memory -= totalhigh_pages;
+#endif
+
+   unmapped_ratio = 100 - (wbs->nr_mapped * 100) / available_memory;
 
dirty_ratio = vm_dirty_ratio;
if (dirty_ratio > unmapped_ratio / 2)
@@ -194,7 +205,7 @@ static void balance_dirty_pages(struct a
.nr_to_write= write_chunk,
};
 
-   get_dirty_limits(, _thresh, _thresh);
+   get_dirty_limits(, _thresh, _thresh, 
mapping);
nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
break;
@@ -210,7 +221,7 @@ static void balance_dirty_pages(struct a
if (nr_reclaimable) {
writeback_inodes();
get_dirty_limits(, _thresh,
-   _thresh);
+   _thresh, mapping);
nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
break;
@@ -296,7 +307,7 @@ static void background_writeout(unsigned
long background_thresh;
long dirty_thresh;
 
-   get_dirty_limits(, _thresh, _thresh);
+   get_dirty_limits(, _thresh, _thresh, NULL);
if (wbs.nr_dirty + wbs.nr_unstable < background_thresh
&& min_pages <= 0)
break;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] compat ioctl security hook fixup

2005-01-20 Thread Chris Wright

* Andi Kleen ([EMAIL PROTECTED]) wrote:
> I'm not sure really adding vfs_ioctl is a good idea politically.
> I predict we'll see drivers starting to use it, which will cause quite
> broken design.

Yes, that'd be quite broken.  I didn't have the same expectation.

> If you add it make at least sure it's not EXPORT_SYMBOL()ed.

It's certainly not, nor intended to be.  Would a comment to that
affect alleviate your concern?

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

OOM fixes 5/5

2005-01-20 Thread Andrea Arcangeli

From: Andrea Arcangeli <[EMAIL PROTECTED]>
Subject: Convert the unsafe signed (16bit) used_math to a safe and optimal 
PF_USED_MATH

On Sat, Dec 25, 2004 at 04:24:30AM +0100, Andrea Arcangeli wrote:
> Here it is the first part. This makes memdie a TIF_MEMDIE. It's

And here is the final incremental part converting ->used_math to
PF_USED_MATH.

I might have broken arm, see the very first change in the patch to
asm-offsets.c, rest looks ok at first glance.

If you want used_math to return 0 or 1 (instead of 0 or PF_USED_MATH),
just s/!!// in the below patch and place !! in sched.h::*used_math()
accordingly after applying the patch, it should work just fine. Using !!
only when necessary as the below is optimal.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- mainline-5/arch/arm26/kernel/asm-offsets.c.orig 2003-07-17 
01:52:38.0 +0200
+++ mainline-5/arch/arm26/kernel/asm-offsets.c  2005-01-21 06:20:01.999885640 
+0100
@@ -42,7 +42,6 @@
 
 int main(void)
 {
-  DEFINE(TSK_USED_MATH,offsetof(struct task_struct, 
used_math));
   DEFINE(TSK_ACTIVE_MM,offsetof(struct task_struct, 
active_mm));
   BLANK();
   DEFINE(VMA_VM_MM,offsetof(struct vm_area_struct, vm_mm));
--- mainline-5/arch/arm26/kernel/process.c.orig 2005-01-15 20:44:48.0 
+0100
+++ mainline-5/arch/arm26/kernel/process.c  2005-01-21 06:20:02.013883512 
+0100
@@ -271,7 +271,7 @@ void flush_thread(void)
memset(>thread.debug, 0, sizeof(struct debug_info));
memset(>fpstate, 0, sizeof(union fp_state));
 
-   current->used_math = 0;
+   clear_used_math();
 }
 
 void release_thread(struct task_struct *dead_task)
@@ -305,7 +305,7 @@ copy_thread(int nr, unsigned long clone_
 int dump_fpu (struct pt_regs *regs, struct user_fp *fp)
 {
struct thread_info *thread = current_thread_info();
-   int used_math = current->used_math;
+   int used_math = !!used_math();
 
if (used_math)
memcpy(fp, >fpstate.soft, sizeof (*fp));
--- mainline-5/arch/arm26/kernel/ptrace.c.orig  2005-01-04 01:13:09.0 
+0100
+++ mainline-5/arch/arm26/kernel/ptrace.c   2005-01-21 06:20:02.018882752 
+0100
@@ -540,7 +540,7 @@ static int ptrace_getfpregs(struct task_
  */
 static int ptrace_setfpregs(struct task_struct *tsk, void *ufp)
 {
-   tsk->used_math = 1;
+   set_stopped_child_used_math(tsk);
return copy_from_user(>thread_info->fpstate, ufp,
  sizeof(struct user_fp)) ? -EFAULT : 0;
 }
--- mainline-5/arch/i386/kernel/cpu/common.c.orig   2005-01-15 
20:44:49.0 +0100
+++ mainline-5/arch/i386/kernel/cpu/common.c2005-01-21 06:20:02.027881384 
+0100
@@ -629,6 +629,6 @@ void __init cpu_init (void)
 * Force FPU initialization:
 */
current_thread_info()->status = 0;
-   current->used_math = 0;
+   clear_used_math();
mxcsr_feature_mask_init();
 }
--- mainline-5/arch/i386/kernel/i387.c.orig 2005-01-20 18:20:09.0 
+0100
+++ mainline-5/arch/i386/kernel/i387.c  2005-01-21 06:20:02.040879408 +0100
@@ -60,7 +60,8 @@ void init_fpu(struct task_struct *tsk)
tsk->thread.i387.fsave.twd = 0xu;
tsk->thread.i387.fsave.fos = 0xu;
}
-   tsk->used_math = 1;
+   /* only the device not available exception or ptrace can call init_fpu 
*/
+   set_stopped_child_used_math(tsk);
 }
 
 /*
@@ -331,13 +332,13 @@ static int save_i387_fxsave( struct _fps
 
 int save_i387( struct _fpstate __user *buf )
 {
-   if ( !current->used_math )
+   if ( !used_math() )
return 0;
 
/* This will cause a "finit" to be triggered by the next
 * attempted FPU operation by the 'current' process.
 */
-   current->used_math = 0;
+   clear_used_math();
 
if ( HAVE_HWFP ) {
if ( cpu_has_fxsr ) {
@@ -383,7 +384,7 @@ int restore_i387( struct _fpstate __user
} else {
err = restore_i387_soft( >thread.i387.soft, buf );
}
-   current->used_math = 1;
+   set_used_math();
return err;
 }
 
@@ -507,7 +508,7 @@ int dump_fpu( struct pt_regs *regs, stru
int fpvalid;
struct task_struct *tsk = current;
 
-   fpvalid = tsk->used_math;
+   fpvalid = !!used_math();
if ( fpvalid ) {
unlazy_fpu( tsk );
if ( cpu_has_fxsr ) {
@@ -522,7 +523,7 @@ int dump_fpu( struct pt_regs *regs, stru
 
 int dump_task_fpu(struct task_struct *tsk, struct user_i387_struct *fpu)
 {
-   int fpvalid = tsk->used_math;
+   int fpvalid = !!tsk_used_math(tsk);
 
if (fpvalid) {
if (tsk == current)
@@ -537,7 +538,7 @@ int dump_task_fpu(struct task_struct *ts
 
 int dump_task_extended_fpu(struct task_struct *tsk, struct user_fxsr_struct 
*fpu)
 {
-   int fpvalid = tsk->used_math && cpu_has_fxsr;
+   int fpvalid = tsk_used_math(tsk) &&

OOM fixes 4/5

2005-01-20 Thread Andrea Arcangeli

From: Andrea Arcangeli <[EMAIL PROTECTED]>
Subject: convert memdie to an atomic thread bitflag

On Sat, Dec 25, 2004 at 03:27:21AM +0100, Andrea Arcangeli wrote:
> So my current plan is to make used_math a PF_USED_MATH, and memdie a
> TIF_MEMDIE. And of course oomtaskadj an int (that one requires more than

This makes memdie a TIF_MEMDIE.

memdie will not be modified by the current task, so it cannot be a
PF_MEMDIE but it must be a TIF_MEMDIE.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- mainline-4/include/asm-alpha/thread_info.h.orig 2004-12-04 
08:55:03.0 +0100
+++ mainline-4/include/asm-alpha/thread_info.h  2005-01-21 06:17:24.780786576 
+0100
@@ -77,6 +77,7 @@ register struct thread_info *__current_t
 #define TIF_UAC_NOPRINT6   /* see sysinfo.h */
 #define TIF_UAC_NOFIX  7
 #define TIF_UAC_SIGBUS 8
+#define TIF_MEMDIE 9
 
 #define _TIF_SYSCALL_TRACE (1flags & PF_EXITING)) && 
!(p->flags & PF_DEAD))
+   if ((unlikely(test_tsk_thread_flag(p, TIF_MEMDIE)) || 
(p->flags & PF_EXITING)) &&
+   !(p->flags & PF_DEAD))
return ERR_PTR(-1UL);
if (p->flags & PF_SWAPOFF)
return p;
@@ -196,7 +197,7 @@ static void __oom_kill_task(task_t *p)
 * exit() and clear out its resources quickly...
 */
p->time_slice = HZ;
-   p->memdie = 1;
+   set_tsk_thread_flag(p, TIF_MEMDIE);
 
/* This process has hardware access, be more careful. */
if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO)) {
--- mainline-4/mm/page_alloc.c.orig 2005-01-21 06:09:43.068977440 +0100
+++ mainline-4/mm/page_alloc.c  2005-01-21 06:17:24.996753744 +0100
@@ -756,7 +756,7 @@ __alloc_pages(unsigned int gfp_mask, uns
}
 
/* This allocation should allow future memory freeing. */
-   if (((p->flags & PF_MEMALLOC) || p->memdie) && !in_interrupt()) {
+   if (((p->flags & PF_MEMALLOC) || 
unlikely(test_thread_flag(TIF_MEMDIE))) && !in_interrupt()) {
/* go through the zonelist yet again, ignoring mins */
for (i = 0; (z = zones[i]) != NULL; i++) {
page = buffered_rmqueue(z, order, gfp_mask);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

OOM fixes 3/5

2005-01-20 Thread Andrea Arcangeli

From: Andrea Arcangeli <[EMAIL PROTECTED]>
Subject: fix several oom killer bugs, most important avoid spurious oom kills
 badness algorithm tweaked by Thomas Gleixner to deal with fork bombs

This is the core of the oom-killer fixes I developed partly taking the
idea from Thomas's patches of getting feedback from the exit path, plus
I moved the oom killer into page_alloc.c as it should to be able to
check the watermarks before killing more stuff. This also tweaks the
badness to take thread bombs more into account (that change to badness
is from Thomas, from my part I'd rather rewrite badness from scratch
instead, but that's an orthgonal issue ;). With this applied the oom
killer is very sane, no more 5 sec waits and suprious oom kills.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- mainline-2/include/linux/sched.h2005-01-20 18:27:45.0 +0100
+++ mainline-3/include/linux/sched.h2005-01-21 06:01:08.585190864 +0100
@@ -615,6 +615,11 @@ struct task_struct {
struct key *thread_keyring; /* keyring private to this thread */
 #endif
 /*
+ * All archs should support atomic ops with
+ * 1 byte granularity.
+ */
+   unsigned char memdie;
+/*
  * Must be changed atomically so it shouldn't be
  * be a shareable bitflag.
  */
@@ -736,8 +741,7 @@ do { if (atomic_dec_and_test(&(tsk)->usa
 #define PF_DUMPCORE0x0200  /* dumped core */
 #define PF_SIGNALED0x0400  /* killed by a signal */
 #define PF_MEMALLOC0x0800  /* Allocating memory */
-#define PF_MEMDIE  0x1000  /* Killed for out-of-memory */
-#define PF_FLUSHER 0x2000  /* responsible for disk writeback */
+#define PF_FLUSHER 0x1000  /* responsible for disk writeback */
 
 #define PF_FREEZE  0x4000  /* this task is being frozen for 
suspend now */
 #define PF_NOFREEZE0x8000  /* this thread should not be frozen */
--- mainline-2/mm/oom_kill.c2005-01-20 18:26:30.0 +0100
+++ mainline-3/mm/oom_kill.c2005-01-21 06:14:00.290873768 +0100
@@ -45,18 +45,30 @@
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
unsigned long points, cpu_time, run_time, s;
+   struct list_head *tsk;
 
if (!p->mm)
return 0;
 
-   if (p->flags & PF_MEMDIE)
-   return 0;
/*
 * The memory size of the process is the basis for the badness.
 */
points = p->mm->total_vm;
 
/*
+* Processes which fork a lot of child processes are likely 
+* a good choice. We add the vmsize of the childs if they
+* have an own mm. This prevents forking servers to flood the
+* machine with an endless amount of childs
+*/
+   list_for_each(tsk, >children) {
+   struct task_struct *chld;
+   chld = list_entry(tsk, struct task_struct, sibling);
+   if (chld->mm != p->mm && chld->mm)
+   points += chld->mm->total_vm;
+   }
+
+   /*
 * CPU time is in tens of seconds and run time is in thousands
  * of seconds. There is no particular reason for this other than
  * that it turned out to work very well in practice.
@@ -132,14 +144,24 @@ static struct task_struct * select_bad_p
 
do_posix_clock_monotonic_gettime();
do_each_thread(g, p)
-   if (p->pid) {
-   unsigned long points = badness(p, uptime.tv_sec);
-   if (points > maxpoints) {
+   /* skip the init task with pid == 1 */
+   if (p->pid > 1) {
+   unsigned long points;
+
+   /*
+* This is in the process of releasing memory so wait it
+* to finish before killing some other task by mistake.
+*/
+   if ((p->memdie || (p->flags & PF_EXITING)) && 
!(p->flags & PF_DEAD))
+   return ERR_PTR(-1UL);
+   if (p->flags & PF_SWAPOFF)
+   return p;
+
+   points = badness(p, uptime.tv_sec);
+   if (points > maxpoints || !chosen) {
chosen = p;
maxpoints = points;
}
-   if (p->flags & PF_SWAPOFF)
-   return p;
}
while_each_thread(g, p);
return chosen;
@@ -152,6 +174,12 @@ static struct task_struct * select_bad_p
  */
 static void __oom_kill_task(task_t *p)
 {
+   if (p->pid == 1) {
+   WARN_ON(1);
+   printk(KERN_WARNING "tried to kill init!\n");
+   return;
+   }
+
task_lock(p);
if (!p->mm || p->mm == _mm) {
WARN_ON(1);
@@ -168,7 +196,7 @@ static void __oom_kill_task(task_t *p)
 * exit() and clear out its resources

OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli

From: Andrea Arcangeli <[EMAIL PROTECTED]>
Subject: keep balance between different classzones

This is the forward port to 2.6 of the lowmem_reserved algorithm I
invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
like google (especially without swap) on x86 with >1G of ram, but it's
needed in all sort of workloads with lots of ram on x86, it's also
needed on x86-64 for dma allocations. This brings 2.6 in sync with
latest 2.4.2x.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- mainline-2/include/linux/mmzone.h.orig  2005-01-15 20:45:00.0 
+0100
+++ mainline-2/include/linux/mmzone.h   2005-01-21 05:55:28.644869648 +0100
@@ -112,18 +112,14 @@ struct zone {
unsigned long   free_pages;
unsigned long   pages_min, pages_low, pages_high;
/*
-* protection[] is a pre-calculated number of extra pages that must be
-* available in a zone in order for __alloc_pages() to allocate memory
-* from the zone. i.e., for a GFP_KERNEL alloc of "order" there must
-* be "(1<
--- mainline-2/include/linux/sysctl.h.orig  2005-01-15 20:45:00.0 
+0100
+++ mainline-2/include/linux/sysctl.h   2005-01-21 05:55:28.646869344 +0100
@@ -160,7 +160,7 @@ enum
VM_PAGEBUF=17,  /* struct: Control pagebuf parameters */
VM_HUGETLB_PAGES=18,/* int: Number of available Huge Pages */
VM_SWAPPINESS=19,   /* Tendency to steal mapped memory */
-   VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
+   VM_LOWMEM_RESERVE_RATIO=20,/* reservation ratio for lower memory zones 
*/
VM_MIN_FREE_KBYTES=21,  /* Minimum free kilobytes to maintain */
VM_MAX_MAP_COUNT=22,/* int: Maximum number of mmaps/address-space */
VM_LAPTOP_MODE=23,  /* vm laptop mode */
--- mainline-2/kernel/sysctl.c.orig 2005-01-15 20:45:00.0 +0100
+++ mainline-2/kernel/sysctl.c  2005-01-21 05:55:28.648869040 +0100
@@ -61,7 +61,6 @@ extern int core_uses_pid;
 extern char core_pattern[];
 extern int cad_pid;
 extern int pid_max;
-extern int sysctl_lower_zone_protection;
 extern int min_free_kbytes;
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
@@ -745,14 +744,13 @@ static ctl_table vm_table[] = {
 },
 #endif
{
-   .ctl_name   = VM_LOWER_ZONE_PROTECTION,
-   .procname   = "lower_zone_protection",
-   .data   = _lower_zone_protection,
-   .maxlen = sizeof(sysctl_lower_zone_protection),
+   .ctl_name   = VM_LOWMEM_RESERVE_RATIO,
+   .procname   = "lowmem_reserve_ratio",
+   .data   = _lowmem_reserve_ratio,
+   .maxlen = sizeof(sysctl_lowmem_reserve_ratio),
.mode   = 0644,
-   .proc_handler   = _zone_protection_sysctl_handler,
+   .proc_handler   = _reserve_ratio_sysctl_handler,
.strategy   = _intvec,
-   .extra1 = ,
},
{
.ctl_name   = VM_MIN_FREE_KBYTES,
--- mainline-2/mm/page_alloc.c.orig 2005-01-15 20:45:00.0 +0100
+++ mainline-2/mm/page_alloc.c  2005-01-21 05:58:53.338751448 +0100
@@ -44,7 +44,15 @@ struct pglist_data *pgdat_list;
 unsigned long totalram_pages;
 unsigned long totalhigh_pages;
 long nr_swap_pages;
-int sysctl_lower_zone_protection = 0;
+/*
+ * results with 256, 32 in the lowmem_reserve sysctl:
+ * 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
+ * 1G machine -> (16M dma, 784M normal, 224M high)
+ * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
+ * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
+ * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA
+ */
+int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 256, 32 };
 
 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
@@ -654,7 +662,7 @@ buffered_rmqueue(struct zone *zone, int 
  * of the allocation.
  */
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-   int alloc_type, int can_try_harder, int gfp_high)
+ int classzone_idx, int can_try_harder, int gfp_high)
 {
/* free_pages my go negative - that's OK */
long min = mark, free_pages = z->free_pages - (1 << order) + 1;
@@ -665,7 +673,7 @@ int zone_watermark_ok(struct zone *z, in
if (can_try_harder)
min -= min / 4;
 
-   if (free_pages <= min + z->protection[alloc_type])
+   if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return 0;
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
@@ -682,19 +690,6 @@ int zone_watermark_ok(struct zone *z, in
 
 /*
  * This is the 'heart' of the zoned buddy allocator.
- *
- * Herein lies the mysterious

OOM fixes 1/5

2005-01-20 Thread Andrea Arcangeli

I'm sending 5 patches incremental with each other updated to the latest
bk snapshot I could find on kernel.org [kernel cvs is still unusable for
me, is it my mistake?]

From: [EMAIL PROTECTED]
Subject: protect-pids

This is protect-pids, a patch to allow the admin to tune the oom killer.
The tweak is inherited between parent and child so it's easy to write a
wrapper for complex apps.

I made used_math a char at the light of later patches. Current patch
breaks alpha, but future patches will fix it.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- mainline/fs/proc/base.c 2005-01-15 20:44:58.0 +0100
+++ mainline-1/fs/proc/base.c   2005-01-20 18:26:29.0 +0100
@@ -72,6 +72,8 @@ enum pid_directory_inos {
PROC_TGID_ATTR_FSCREATE,
 #endif
PROC_TGID_FD_DIR,
+   PROC_TGID_OOM_SCORE,
+   PROC_TGID_OOM_ADJUST,
PROC_TID_INO,
PROC_TID_STATUS,
PROC_TID_MEM,
@@ -98,6 +100,8 @@ enum pid_directory_inos {
PROC_TID_ATTR_FSCREATE,
 #endif
PROC_TID_FD_DIR = 0x8000,   /* 0x8000-0x */
+   PROC_TID_OOM_SCORE,
+   PROC_TID_OOM_ADJUST,
 };
 
 struct pid_entry {
@@ -133,6 +137,8 @@ static struct pid_entry tgid_base_stuff[
 #ifdef CONFIG_SCHEDSTATS
E(PROC_TGID_SCHEDSTAT, "schedstat", S_IFREG|S_IRUGO),
 #endif
+   E(PROC_TGID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO),
+   E(PROC_TGID_OOM_ADJUST,"oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
{0,0,NULL,0}
 };
 static struct pid_entry tid_base_stuff[] = {
@@ -158,6 +164,8 @@ static struct pid_entry tid_base_stuff[]
 #ifdef CONFIG_SCHEDSTATS
E(PROC_TID_SCHEDSTAT, "schedstat",S_IFREG|S_IRUGO),
 #endif
+   E(PROC_TID_OOM_SCORE,  "oom_score",S_IFREG|S_IRUGO),
+   E(PROC_TID_OOM_ADJUST, "oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
{0,0,NULL,0}
 };
 
@@ -384,6 +392,18 @@ static int proc_pid_schedstat(struct tas
 }
 #endif
 
+/* The badness from the OOM killer */
+unsigned long badness(struct task_struct *p, unsigned long uptime);
+static int proc_oom_score(struct task_struct *task, char *buffer)
+{
+   unsigned long points;
+   struct timespec uptime;
+
+   do_posix_clock_monotonic_gettime();
+   points = badness(task, uptime.tv_sec);
+   return sprintf(buffer, "%lu\n", points);
+}
+
 //
 /*   Here the fs part begins*/
 //
@@ -657,6 +677,55 @@ static struct file_operations proc_mem_o
.open   = mem_open,
 };
 
+static ssize_t oom_adjust_read(struct file * file, char * buf,
+   size_t count, loff_t *ppos)
+{
+   struct task_struct *task = proc_task(file->f_dentry->d_inode);
+   char buffer[8];
+   size_t len;
+   int oom_adjust = task->oomkilladj;
+
+   len = sprintf(buffer, "%i\n", oom_adjust) + 1;
+   if (*ppos >= len)
+   return 0;
+   if (count > len-*ppos)
+   count = len-*ppos;
+   if (copy_to_user(buf, buffer + *ppos, count)) 
+   return -EFAULT;
+   *ppos += count;
+   return count;
+}
+
+static ssize_t oom_adjust_write(struct file * file, const char * buf,
+   size_t count, loff_t *ppos)
+{
+   struct task_struct *task = proc_task(file->f_dentry->d_inode);
+   char buffer[8], *end;
+   int oom_adjust;
+
+   if (!capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+   memset(buffer, 0, 8);   
+   if (count > 6)
+   count = 6;
+   if (copy_from_user(buffer, buf, count)) 
+   return -EFAULT;
+   oom_adjust = simple_strtol(buffer, , 0);
+   if (oom_adjust < -16 || oom_adjust > 15)
+   return -EINVAL;
+   if (*end == '\n')
+   end++;
+   task->oomkilladj = oom_adjust;
+   if (end - buffer == 0) 
+   return -EIO;
+   return end - buffer;
+}
+
+static struct file_operations proc_oom_adjust_operations = {
+   read:   oom_adjust_read,
+   write:  oom_adjust_write,
+};
+
 static struct inode_operations proc_mem_inode_operations = {
.permission = proc_permission,
 };
@@ -1336,6 +1405,15 @@ static struct dentry *proc_pident_lookup
ei->op.proc_read = proc_pid_schedstat;
break;
 #endif
+   case PROC_TID_OOM_SCORE:
+   case PROC_TGID_OOM_SCORE:
+   inode->i_fop = _info_file_operations;
+   ei->op.proc_read = proc_oom_score;
+   break;
+   case PROC_TID_OOM_ADJUST:
+   case PROC_TGID_OOM_ADJUST:
+   inode->i_fop = _oom_adjust_operations;
+   break;
default:
printk("procfs: impossible type (%d)",p->type);

System calls effect after booting phase ??

2005-01-20 Thread selvakumar nagendran

--- [EMAIL PROTECTED] wrote:

> Possibility 1:
> Load them from an initrd image while booting.  If
> you're already
> using an initrd, and this is "early enough", you
> just need to put the
> module into the initrd, and make sure the /linuxrc
> or whatever script
> does an insmod for it.  This has the advantage of
> working for out-of-tree
> modules.

 Now, I am using an initrd image. How can I load my
module there? In which file, should I insert the
corresponding line? Can u tell me more regarding this
on how to do it? I am using kernel 2.4.28. should I
have to recompile the whole kernel once again? 

Thanks,
selva

__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] compat ioctl security hook fixup

2005-01-20 Thread Andi Kleen

On Thu, Jan 20, 2005 at 05:26:56PM -0800, Chris Wright wrote:
> * Michael S. Tsirkin ([EMAIL PROTECTED]) wrote:
> > Security hook seems to be missing before compat_ioctl in mm2.
> > And, it would be nice to avoid calling it twice on some paths.
> > 
> > Chris Wright's patch addressed this in the most elegant way I think,
> > by adding vfs_ioctl.
> 
> The patch below is against Linus' tree as per Andrew's request.  It will
> conflict with some of the changes in -mm2 (including the some-fixes bit
> from Andi, and LTT).  I also have a patch directly against -mm2 if anyone
> would like to see that instead.

I'm not sure really adding vfs_ioctl is a good idea politically.
I predict we'll see drivers starting to use it, which will cause quite
broken design.

If you add it make at least sure it's not EXPORT_SYMBOL()ed.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Andrew Morton

Andrew Morton <[EMAIL PROTECTED]> wrote:
>
> Next suspects would be:
> 
>  +cleanup-vc-array-access.patch
>  +remove-console_macrosh.patch
>  +merge-vt_struct-into-vc_data.patch
> 
> 

Make that:

+cleanup-vc-array-access.patch
+remove-console_macrosh.patch
+merge-vt_struct-into-vc_data.patch
+vgacon-fixes-to-help-font-restauration-in-x11.patch

and the fbdev updates, maybe:

+radeonfb-set-accelerator-id.patch
+vesafb-change-return-error-id.patch
+intelfb-workaround-for-830m.patch
+fbcon-save-blank-state-last.patch
+backlight-fix-compile-error-if-config_fb-is-unset.patch
+matroxfb-fb_matrox_g-kconfig-changes.patch


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Andrew Morton

Matt Mackall <[EMAIL PROTECTED]> wrote:
>
> Here are the symptoms:
> 
>  mm2: corruption of Tux logo at boot, corruption of display at
>  powerdown, lockup and LCD blooming on next warm boot when radeonfb
>  starts. Ben suggested I try some radeonfb options, but none seemed to
>  have any effect.
> 
>  mm1: no observed problems
> 
>  mm2 - above patches: corruption still occurs but no lockup on next
>  warm boot.

So we have multiple bugs?

Next suspects would be:

+cleanup-vc-array-access.patch
+remove-console_macrosh.patch
+merge-vt_struct-into-vc_data.patch


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Matt Mackall

On Thu, Jan 20, 2005 at 04:01:23PM -0800, Andrew Morton wrote:
> Matt Mackall <[EMAIL PROTECTED]> wrote:
> >
> > > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON?
> > 
> > FB_RADEON.
> 
> Ah, OK.  Likely culprits are
> 
> radeonfb-massive-update-of-pm-code.patch
> radeonfb-build-fix.patch

Ok, learned a few things.

Here are the symptoms:

mm2: corruption of Tux logo at boot, corruption of display at
powerdown, lockup and LCD blooming on next warm boot when radeonfb
starts. Ben suggested I try some radeonfb options, but none seemed to
have any effect.

mm1: no observed problems

mm2 - above patches: corruption still occurs but no lockup on next
warm boot.

I think I have a lead on the logo and shutdown corruption:

If I do a reboot(8) from inside X, I get switched to vt 0, but the
shutdown messages come out on vt 7, where X was running. As I'm
sitting on vt 0 during shutdown, I see character cells changed to
something like "_" (last two scanlines filled) slowly marching down
the screen corresponding to the shutdown messages.

So the logo corruption is probably getty popping up on the
other vts at the end of init. The timing and the screen placement seem
to agree.

Photos for the curious (be sure to see "executioner Tux" glitch):
http://selenic.com/radeon

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]: fix the bug of __free_pages() of mm/page_alloc.c

2005-01-20 Thread zhan rongkai

--- linux-2.6.10.orig/mm/page_alloc.c   2004-12-25 05:33:51.0 +0800
+++ linux-2.6.10/mm/page_alloc.c2005-01-21 11:46:58.0 +0800
@@ -788,7 +788,22 @@
 
 fastcall void __free_pages(struct page *page, unsigned int order)
 {
-   if (!PageReserved(page) && put_page_testzero(page)) {
+   if (!PageReserved(page)) {
+#ifdef CONFIG_MMU
+   if (!put_page_testzero(page))
+   return;
+#else
+   int i, result = 1;
+
+   /*
+* We need to de-reference all the pages for this order -- see
set_page_refs()
+*/
+   for (i = 0; i < (1 << order); i++)
+   result &= put_page_testzero(page+i);
+   if (!result)
+   BUG();
+#endif /* CONFIG_MMU */
+
if (order == 0)
free_hot_page(page);
else


On Fri, 21 Jan 2005 11:40:52 +0800, zhan rongkai <[EMAIL PROTECTED]> wrote:
> --- linux-2.6.10.orig/mm/page_alloc.c   2004-12-25 05:33:51.0 +0800
> +++ linux-2.6.10/mm/page_alloc.c2005-01-21 11:43:44.0 +0800
> @@ -788,7 +788,22 @@
> 
>  fastcall void __free_pages(struct page *page, unsigned int order)
>  {
> -   if (!PageReserved(page) && put_page_testzero(page)) {
> +   if (!PageReserved(page)) {
> +#ifdef CONFIG_MMU
> +   if (!put_page_testzero(page))
> +   return;
> +#else
> +   int i, result = 1;
> +
> +   /*
> +* We need to de-reference all the pages for this order -- see
> set_page_refs()
> +*/
> +for (i = 0; i < (1 << order); i++)
> +result &= put_page_testzero(page+i);
> +if (!result)
> +BUG();
> +#endif /* CONFIG_MMU */
> +
> if (order == 0)
> free_hot_page(page);
> else
> 
> --
> Rongkai Zhan
> 


-- 
Rongkai Zhan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]: fix the bug of __free_pages() of mm/page_alloc.c

2005-01-20 Thread zhan rongkai

--- linux-2.6.10.orig/mm/page_alloc.c   2004-12-25 05:33:51.0 +0800
+++ linux-2.6.10/mm/page_alloc.c2005-01-21 11:43:44.0 +0800
@@ -788,7 +788,22 @@
 
 fastcall void __free_pages(struct page *page, unsigned int order)
 {
-   if (!PageReserved(page) && put_page_testzero(page)) {
+   if (!PageReserved(page)) {
+#ifdef CONFIG_MMU
+   if (!put_page_testzero(page))
+   return;
+#else
+   int i, result = 1;
+
+   /*
+* We need to de-reference all the pages for this order -- see
set_page_refs()
+*/
+for (i = 0; i < (1 << order); i++)
+result &= put_page_testzero(page+i);
+if (!result)
+BUG();
+#endif /* CONFIG_MMU */
+
if (order == 0)
free_hot_page(page);
else


-- 
Rongkai Zhan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]: fix the bug of __free_pages() of mm/page_alloc.c

2005-01-20 Thread zhan rongkai

On Thu, 20 Jan 2005 14:31:34 +, Russell King
<[EMAIL PROTECTED]> wrote:
> On Thu, Jan 20, 2005 at 09:34:17PM +0800, zhan rongkai wrote:
> > [PATCH]: fix the bug of __free_pages() of mm/page_alloc.c
> > =
> >
> > The buddy allocator's __free_pages() function seems to be buggy.
> >
> > The following codes are from kernel 2.6.10:
> >
> > fastcall void __free_pages(struct page *page, unsigned int order)
> > {
> >   if (!PageReserved(page) && put_page_testzero(page)) {
> >   if (order == 0)
> >   free_hot_page(page);
> >   else
> >   __free_pages_ok(page, order);
> >   }
> > }
> >
> > As you know, before truely freeing all pages, this function calls
> > put_page_testzero(page) to
> > drop the refcount of the pages.
> >
> > But, in fact the macro put_page_testzero(page) **only** drops **one**
> > page's refcount.
> > Therefore, if (order > 0), the refcounts of (page+1) ..
> > (page+(1< > This will cause __free_pages_ok() to dump stack, because it finds some
> > pages' page_count()
> > are not zero!
> 
> When you allocate a page with order > 0, the first 0-order page has a
> refcount of 1, and the remaining 0-order pages have a refcount of 0.

Thank you for telling me this point.

> If you're triggering this check, I suspect you're fiddling about with
> the individual pages (using get_page on them individually?) which is
> a no-no.
> 
> --
> Russell King
> 

Oh, I forget to tell you that my CPU has no MMU, sorry:-)
Let's see the function set_page_refs() which is called by
prep_new_page() function:

static inline void set_page_refs(struct page *page, int order)
{
#ifdef CONFIG_MMU
set_page_count(page, 1);
#else
int i;

/*
 * We need to reference all the pages for this order, otherwise if
 * anyone accesses one of the pages with (get/put) it will be freed.
 */
for (i = 0; i < (1 << order); i++)
set_page_count(page+i, 1);
#endif /* CONFIG_MMU */
}

We can see that it sets all pages' refcount to 1 when there is no MMU.

My previous patch is wrong. Here is new one:


--- linux-2.6.10.orig/mm/page_alloc.c   2004-12-25 05:33:51.0 +0800
+++ linux-2.6.10/mm/page_alloc.c2005-01-21 11:34:57.0 +0800
@@ -787,8 +787,23 @@
 }
 
 fastcall void __free_pages(struct page *page, unsigned int order)
-{
-   if (!PageReserved(page) && put_page_testzero(page)) {
+{  
+   if (!PageReserved(page)) {
+#ifdef CONFIG_MMU
+   if (!put_page_testzero(page))
+   return;
+#else
+   int i, result = 1;
+
+   /*
+* We need to de-reference all the pages for this order -- see
set_page_refs()
+*/
+for (i = 0; i < (1 << order); i++)
+result &= put_page_testzero(page);
+if (!result)
+BUG();
+#endif /* CONFIG_MMU */
+
if (order == 0)
free_hot_page(page);
else

-- 
Rongkai Zhan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PROBLEM: possible memleak in 2.6.11-rc1

2005-01-20 Thread Andrew Morton

Lennert Van Alboom <[EMAIL PROTECTED]> wrote:
>
> Possible memleak in 2.6.11-rc1?

Please wait for it to happen again and then send the contents of
/proc/meminfo and /proc/slabinfo.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Typo in [AGPGART] i915GM support patch

2005-01-20 Thread Dave Jones

On Thu, Jan 20, 2005 at 05:46:22PM +0100, Marco Cipullo wrote:
 > - if (agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB)
 > + if (agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB ||
 > + agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB)
 >  gtt_entries = MB(48) - KB(size);
 >  else
 >  gtt_entries = 0;
 >  break;
 > Peraphs is:
 > 
 > @@ -415,14 +415,16 @@
 >  break;
 >  case I915_GMCH_GMS_STOLEN_48M:
 >  /* Check it's really I915G */
 > - if (agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB)
 > + if (agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB ||
 > + agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915GM_HB)
 >  gtt_entries = MB(48) - KB(size);
 >  else
 >  gtt_entries = 0;
 >  break;
 > 
 > The same applies few lines below

Duh, yes. Thanks.
Fix sent to Linus.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] tracepipe -- event streams, debugfs, and pipe_buffers

2005-01-20 Thread Karim Yaghmour

Zach Brown wrote:
> Only briefly.  They've always seemed more involved than the sort of
> thing I was after.  I'll try and sit down and investigate in more detail.

There's definitely an opportunity for interfacing here. If nothing else,
this clearly shows the interest for the kind of things both relayfs and
ltt attempt to achieve.

So here are a few comments regading the implementation and how this
relates to the stuff I'm working on.

> While it's running the kernel subsystem can send binary blobs, less than
> the length of a page, down this channel.  The blobs are copied into
> per-cpu lists of pages.  Cutesy little headers with get_cycles() and the
> cpu id are prepended to each blob.  The traces are only recorded if user
> space has open references to the file.

In the case of LTT, we just open one relay channel per cpu. This avoids
having to write the CPUID to the trace, that's 2 bytes less per event,
and also avoids any need for synchronization.

As for get_cycles(), some architectures don't have anything useful to
give. Here's for ARM (include/asm-arm/timex.h):
static inline cycles_t get_cycles (void)
{
return 0;
}

In the case of LTT, we just use the, albeit expensive, do_gettimeofday
when hardware counters aren't there (currently all non-x86 tracing does
this, but this should be fixed.) Also, in the case of the x86 at least,
we just write the lower 32-bits of the TSC, so that's 4 bytes less per
event. Instead, we use the buffer_start and buffer_end callbacks provided
by relayfs to write a header and footer containing full do_gettimeofday
value and TSC value.

> As the pages fill they're kicked off to a work_struct worker who puts
> them in the bufs[] array in the debugfs pipe file.  Userspace can then
> do whatever it wants with the data via the pipe.  One can imagine it
> wanting to splice() these pages to disk in huge batches, or perhaps some
> zero-copy network card, etc.  I've only tested this so far as verifying
> that 'cat' is able to push data into a regular file.

It seems to me that while this is a nice use of pipes, it isn't as fast
as ram-locked pages. Basically relayfs does the bttv driver magic (or
what used to be done in there, I haven't checked what they do lately.)
Basically, we allocate pages, lock them into ram and remap them for use
as a single memory area. No caching necessary. It goes from the buffer
to whatever media you want (disk, network, etc.) IOW, user-space does
a open(), mmap(), write(). Also, the channels exist whether user-space
has done an open or not. That's good for flight-recording.

Looking at the code:

- tracepipe_event() does a get_cpu()/put_cpu() for protecting the
writing to the buffer. What about tracing within an interrupt?
local_irq_save()?

- I hadn't thought of doing something like this to write the header:
+   hdr = tcpu->next_region;
+   hdr->cycles = get_cycles();
+   hdr->cpu = cpu;
I will replace some of the memcpy() code in LTT with something like this.

- From what I assume is a "whishlist":
+ * - actually communicate missed to userspace

Already done in LTT.

+ * - how to specify wrapping or dropping

relayfs provides RELAY_MODE_CONTINUOUS and RELAY_MODE_NO_OVERWRITE.

+ * - non-temporal stores into bufs

The latest relayfs code doesn't care about timestamps. It's its
clients job to do that (ex. ltt).

+ * - let caller reserve space and get a pointer into buf

This is the relevant relayfs function:
char* relay_reserve(struct rchan *rchan, u32 len, int *err, int *interrupting)

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [ANNOUNCE][RFC] plugsched-2.0 patches ...

2005-01-20 Thread Marc E. Fiuczynski

Hi Peter,

> I'm hoping that the CKRM folks will send me a patch to add their
> scheduler to plugsched :-)

They are planning to release a patch against 2.6.10.  But their patch wont
stand alone against 2.6.10 and so it might be difficult for you to integrate
their code into a scheduler for plugsched.

Also, the CKRM scheduler only modifies Ingo's O(1) scheduler.  It certainly
would be interesting to have CKRM variants of the other schedulers.  This
points to a whole new level of 'plugsched' in that general O(1) schedulers
need to support fair share plugins.

Marc

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] PPC64: EEH Recovery

2005-01-20 Thread Paul Mackerras

Linas Vepstas writes:

> > 2. I don't see why the device nodes for the PCI subtree being reset
> >would go away, and thus I don't see the need for your eeh_cfg_tree
> >struct.
> 
> Its not the reset, its the hot-plug remove.  The hot plug code assumes
> that you are going to physically remove the device from the slot, so
> it removes the device_node as part of the "unconfig".  

OK, I missed that.  It seems a bit bogus to me.  Could you point me at
where in the code this happens?

> > 3. Is there a good reason why we can't use the assigned-addresses
> >property on the relevant device tree nodes to tell us what to set
> >the BARs to?
> 
> Yes, the reason is that after a reset, that property doesn't hold any 
> decent data.   I discussed this with the firmware developers, and thier 
> response was that it is the kernel's responsibility to compute 
> (or save/restore) such values.  (Except for bridges, which they will do for 
> us).

The not holding any decent data is a consequence of the device nodes
getting thrown away, isn't it?  I fail to see how resetting the device
can of itself affect our copy of the device tree.

> > In particular I think it should be a
> >userland write to a sysfs file that kicks off the restart process
> >rather than it just happening after 5 seconds.  Anyway, what
> >process or thread is executing that 5 second sleep?  Is it keventd
> >or something?
> 
> Its a workqueue.

Which get run in keventd's context.  In other words no other
workqueues will get run during the 5 second sleep, or at least not on
that cpu.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE][RFC] plugsched-2.0 patches ...

2005-01-20 Thread Peter Williams

Marc E. Fiuczynski wrote:
Peter, thank you for maintaining Con's plugsched code in light of Linus' and
Ingo's prior objections to this idea.  On the one hand, I partially agree
with Linus's prior views that when there is only one scheduler that the
rest of the world + dog will focus on making it better. On the other hand,
having a clean framework that lets developers in a clean way plug in new
schedulers is quite useful.
Linus & Ingo, it would be good to have an indepth discussion on this topic.
I'd argue that the Linux kernel NEEDS a clean pluggable scheduling
framework.
Let me make a case for this NEED by example.  Ingo's scheduler belongs to
the egalitarian regime of schedulers that do a poor job of isolating
workloads from each other in multiprogrammed environments such as those
found on Enterprise servers and in my case on PlanetLab (www.planet-lab.org)
nodes.  This has been rectified by HP-UX, Solaris, and AIX through the use
of fair share schedulers that use O(1) schedulers within a share.  Currently
PlanetLab uses a CKRM modified version of Ingo's scheduler.
I'm hoping that the CKRM folks will send me a patch to add their 
scheduler to plugsched :-)

 Similarly, the
linux-vserver project also modifies Ingo's scheduler to construct an
entitlement based scheduling regime. These are not just variants of O(1)
schedulers in the sense of Con's staircase O(1). Nor is it clear what the
best type of scheduler is for these environments (i.e., HP-UX, Solaris and
AIX don't have it fully solved yet either). The ability to dynamically swap
out schedulers on a production system like PlanetLab would help in
determining what type of scheduler is the most appropriate.  This is because
it is non-trivial, if not impossible, to recreate the multiprogrammed
workloads that we see in a lab.
For these reasons, it would be useful for plugsched (or something like it)
to make its way into the mainline kernel as a framework to plug in different
schedulers.  Alternatively, it would be useful to consider in what way
Ingo's scheduler needs to support plugins such as the CKRM and Vserver types
of changes.
Best regards,
Marc

--
Peter Williams   [EMAIL PROTECTED]
"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

2005-01-20 Thread hugang

On Thu, Jan 20, 2005 at 10:46:37PM +0100, Rafael J. Wysocki wrote:
> On Thursday, 20 of January 2005 21:59, Pavel Machek wrote:
> 
> Sure, but I think it's there for a reason.
> 
> > Anyway, this is likely to clash with hugang's work; I'd prefer this not to 
> > be applied.
> 
> I am aware of that, but you are not going to merge the hugang's patches soon, 
> are you?
> If necessary, I can change the patch to work with his code (hugang, what do 
> you think?).
> 
I like this patch, And I change my code with this, Please have a look,
It pass in qemu X86_64. :)

Full patch still can get from
 http://soulinfo.com/~hugang/swsusp/2005-1-21/

here is only x86_64 part.

--- 2.6.11-rc1-mm1/arch/x86_64/kernel/suspend_asm.S 2004-12-30 
14:56:35.0 +0800
+++ 2.6.11-rc1-mm1-swsusp-x86_64/arch/x86_64/kernel/suspend_asm.S   
2005-01-21 10:13:15.0 +0800
@@ -35,6 +35,7 @@ ENTRY(swsusp_arch_suspend)
call swsusp_save
ret
 
+   .section.data.nosave
 ENTRY(swsusp_arch_resume)
/* set up cr3 */
leaqinit_level4_pgt(%rip),%rax
@@ -49,43 +50,32 @@ ENTRY(swsusp_arch_resume)
movq%rcx, %cr3;
movq%rax, %cr4;  # turn PGE back on
 
-   movlnr_copy_pages(%rip), %eax
-   xorl%ecx, %ecx
-   movq$0, %r10
-   testl   %eax, %eax
-   jz  done
-.L105:
-   xorl%esi, %esi
-   movq$0, %r11
-   jmp .L104
-   .p2align 4,,7
-copy_one_page:
-   movq%r10, %rcx
-.L104:
-   movqpagedir_nosave(%rip), %rdx
-   movq%rcx, %rax
-   salq$5, %rax
-   movq8(%rdx,%rax), %rcx
-   movq(%rdx,%rax), %rax
-   movzbl  (%rsi,%rax), %eax
-   movb%al, (%rsi,%rcx)
-
-   movq%cr3, %rax;  # flush TLB
-   movq%rax, %cr3;
-
-   movq%r11, %rax
-   incq%rax
-   cmpq$4095, %rax
-   movq%rax, %rsi
-   movq%rax, %r11
-   jbe copy_one_page
-   movq%r10, %rax
-   incq%rax
-   movq%rax, %rcx
-   movq%rax, %r10
-   mov nr_copy_pages(%rip), %eax
-   cmpq%rax, %rcx
-   jb  .L105
+   movqpagedir_nosave(%rip), %rax
+   testq   %rax, %rax
+   je  done
+
+copyback_page:
+   movq24(%rax), %r9
+   xorl%r8d, %r8d
+
+copy_one_pgdir:
+   movq8(%rax), %rdi
+   testq   %rdi, %rdi
+   je  done
+   movq(%rax), %rsi
+   movq$512, %rcx
+   rep
+   movsq
+
+   incq%r8
+   addq$32, %rax
+   cmpq$127, %r8
+   jbe copy_one_pgdir; # copy one pgdir
+
+   testq   %r9, %r9
+   movq%r9, %rax
+   jne copyback_page
+
 done:
movl$24, %eax
movl%eax, %ds

-- 
Hu Gang   .-.
  /v\
 // \\ 
Linux User  /(   )\  [204016]
GPG Key ID   ^^-^^   http://soulinfo.com/~hugang/hugang.asc
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] /proc//rlimit

2005-01-20 Thread Bill Rugolsky Jr.

On Thu, Jan 20, 2005 at 03:43:58PM +0100, Pavel Machek wrote:
> It would be nice if you could make it "value-per-file". That way,
> it could become writable in future. If "max nice level" ever becomes rlimit,
> this would be very usefull.

Agreed, though write support present difficulties.

My principal concern is that we don't want users changing resource limits
of privileged processes.  If we want an ordinary user to be allowed to
change limits, the rules would have to be similar to those allowed for
ptrace(), e.g., no-setuid processes, etc.  [With ptrace(), one can of
course attach to the process and invoke the setrlimit() syscall directly].
Additionally, sys_setrlimit() has an LSM hook:

security_task_setrlimit(unsigned int resource, struct rlimit *)

One would need to take account of changing the limit from a different
context.  It's a bit of a mess, and outside of the standard API; that's
why I didn't bother.

Anyway, for Jan, here's my incomplete and unmergeable cut-n-paste hack
to implement write on top of my previous patch.  Format is as was
suggested by Jan:

 <%u|unlimited> <%u|unlimited>

E.g.,
echo  memlock 65536 65536 > /proc/1/rlimit

Writing is limited to root (i.e. CAP_SYS_PTRACE), though see
fs/proc/base.c:may_ptrace_attach() for an idea of how to change that.

-Bill


--- linux-2.6.11-rc1-bk6/fs/proc/base.c.proc-pid-rlimit-write
+++ linux-2.6.11-rc1-bk6/fs/proc/base.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -127,7 +128,7 @@
E(PROC_TGID_ROOT,  "root",S_IFLNK|S_IRWXUGO),
E(PROC_TGID_EXE,   "exe", S_IFLNK|S_IRWXUGO),
E(PROC_TGID_MOUNTS,"mounts",  S_IFREG|S_IRUGO),
-   E(PROC_TGID_RLIMIT,"rlimit",  S_IFREG|S_IRUGO),
+   E(PROC_TGID_RLIMIT,"rlimit",  S_IFREG|S_IRUGO|S_IWUSR),
 #ifdef CONFIG_SECURITY
E(PROC_TGID_ATTR,  "attr",S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -153,7 +154,7 @@
E(PROC_TID_ROOT,   "root",S_IFLNK|S_IRWXUGO),
E(PROC_TID_EXE,"exe", S_IFLNK|S_IRWXUGO),
E(PROC_TID_MOUNTS, "mounts",  S_IFREG|S_IRUGO),
-   E(PROC_TID_RLIMIT, "rlimit",  S_IFREG|S_IRUGO),
+   E(PROC_TID_RLIMIT, "rlimit",  S_IFREG|S_IRUGO|S_IWUSR),
 #ifdef CONFIG_SECURITY
E(PROC_TID_ATTR,   "attr",S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -595,9 +596,99 @@
return single_release(inode, file);
 }
 
+static inline char *skip_ws(char *s)
+{
+   while (isspace(*s))
+   s++;
+   return s;
+}
+
+static inline char *find_ws(char *s)
+{
+   while (!isspace(*s) && *s != '\0')
+   s++;
+   return s;
+}
+
+#define MAX_RLIMIT_WRITE 79
+static ssize_t rlimit_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+   struct task_struct *task = proc_task(file->f_dentry->d_inode);
+   struct rlimit new_rlim, *old_rlim;
+   unsigned int i;
+   char *s, *t, kbuf[MAX_RLIMIT_WRITE+1];
+
+   /* changing resources limits can crash or subvert a process */
+   if (!capable(CAP_SYS_PTRACE) || security_ptrace(current,task))
+   return -ESRCH;
+
+if (count > MAX_RLIMIT_WRITE)
+return -EINVAL;
+if (copy_from_user(, buf, count))
+return -EFAULT;
+kbuf[MAX_RLIMIT_WRITE] = '\0'; 
+
+   /* parse the resource id */
+   s = skip_ws(kbuf);
+   t = find_ws(s);
+   if (*t == '\0')
+   return -EINVAL;
+   *t++ = '\0';
+   for (i = 0 ; i < RLIM_NLIMITS ; i++)
+   if (rlim_name[i] && !strcmp(s,rlim_name[i]))
+   break;
+   if (i >= RLIM_NLIMITS) {
+   if (!strncmp(s, "rlimit-",7))
+   s += 7;
+   if (sscanf(s, "%u", ) != 1 || i >= RLIM_NLIMITS)
+   return -EINVAL;
+   }
+
+   /* parse the soft limit */
+   s = skip_ws(t);
+   t = find_ws(s);
+   if (*t == '\0')
+   return -EINVAL;
+   *t++ = '\0';
+   if (!strcmp(s, "unlimited")) 
+   new_rlim.rlim_cur = RLIM_INFINITY;
+   else if (sscanf(s, "%lu", _rlim.rlim_cur) != 1)
+   return -EINVAL;
+
+   /* parse the hard limit */
+   s = skip_ws(t);
+   t = find_ws(s);
+   *t = '\0';
+   if (!strcmp(s, "unlimited")) 
+   new_rlim.rlim_max = RLIM_INFINITY;
+   else if (sscanf(s, "%lu", _rlim.rlim_max) != 1)
+   return -EINVAL;
+
+   /* validate the values; copied from sys_setrlimit() */
+   if (new_rlim.rlim_cur > new_rlim.rlim_max)
+   return -EINVAL;
+old_rlim = task->signal->rlim + i;
+   if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+   !capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+   if (i == RLIMIT_NOFILE && new_rlim.rlim_max > NR_OPEN)
+   return -EPERM;
+
+   /*

Re: [PATCH] relayfs redux for 2.6.10: lean and mean

2005-01-20 Thread Peter Williams

Karim Yaghmour wrote:
Greg KH wrote:
Hm, how about this idea for cutting about 500 more lines from the code:
Why not drop the "fs" part of relayfs and just make the code a set of
struct file_operations.  That way you could have "relayfs-like" files in
any ram based file system that is being used.  Then, a user could use
these fops and assorted interface to create debugfs or even procfs files
using this type of interface.
As relayfs really is almost the same (conceptually wise) as debugfs as
far as concept of what kinds of files will be in there (nothing anyone
would ever rely on for normal operations, but for debugging only) this
keeps users and developers from having to spread their debugging and
instrumenting files from accross two different file systems.

However this assumes that the users of relayfs are not going to want
it during normal system operation. This is an assumption that fails
with at least LTT as it is targeted at sysadmins, application developers
and power users who need to be able to trace their systems at any time.
I don't mind piggy-backing off another fs, if it makes sense, but
unlike debugfs, relayfs is meant for general use, and all files in there
are of the same type: relay channels for dumping huge amounts of data
to user-space. It seems to me the target audience and basic idea (relay
channels only in the fs) are different, but let me know if there's a
compeling argument for doing this in another way without making it too
confusing for users of those special "files" (IOW, when this starts
being used in distros, it'll be more straightforward for users to
understand if all files in a mounted fs behave a certain way than if
they have certain "odd" files in certain directories, even if it's
/proc.)
Perhaps the logical solution is to implement debugfs in terms of relayfs?
Peter
--
Peter Williams   [EMAIL PROTECTED]
"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] relayfs redux for 2.6.10: lean and mean

2005-01-20 Thread Karim Yaghmour

Greg KH wrote:
> Hm, how about this idea for cutting about 500 more lines from the code:
> 
> Why not drop the "fs" part of relayfs and just make the code a set of
> struct file_operations.  That way you could have "relayfs-like" files in
> any ram based file system that is being used.  Then, a user could use
> these fops and assorted interface to create debugfs or even procfs files
> using this type of interface.
> 
> As relayfs really is almost the same (conceptually wise) as debugfs as
> far as concept of what kinds of files will be in there (nothing anyone
> would ever rely on for normal operations, but for debugging only) this
> keeps users and developers from having to spread their debugging and
> instrumenting files from accross two different file systems.

However this assumes that the users of relayfs are not going to want
it during normal system operation. This is an assumption that fails
with at least LTT as it is targeted at sysadmins, application developers
and power users who need to be able to trace their systems at any time.

I don't mind piggy-backing off another fs, if it makes sense, but
unlike debugfs, relayfs is meant for general use, and all files in there
are of the same type: relay channels for dumping huge amounts of data
to user-space. It seems to me the target audience and basic idea (relay
channels only in the fs) are different, but let me know if there's a
compeling argument for doing this in another way without making it too
confusing for users of those special "files" (IOW, when this starts
being used in distros, it'll be more straightforward for users to
understand if all files in a mounted fs behave a certain way than if
they have certain "odd" files in certain directories, even if it's
/proc.)

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] compat ioctl security hook fixup

2005-01-20 Thread Chris Wright

* Michael S. Tsirkin ([EMAIL PROTECTED]) wrote:
> Security hook seems to be missing before compat_ioctl in mm2.
> And, it would be nice to avoid calling it twice on some paths.
> 
> Chris Wright's patch addressed this in the most elegant way I think,
> by adding vfs_ioctl.

The patch below is against Linus' tree as per Andrew's request.  It will
conflict with some of the changes in -mm2 (including the some-fixes bit
from Andi, and LTT).  I also have a patch directly against -mm2 if anyone
would like to see that instead.

thanks,
-chris
--

Introduce a simple helper, vfs_ioctl(), so that both sys_ioctl() and
compat_sys_ioctl() call the security hook in all cases and without
duplication.

Signed-off-by: Chris Wright <[EMAIL PROTECTED]>

= fs/ioctl.c 1.15 vs edited =
--- 1.15/fs/ioctl.c 2005-01-15 14:31:01 -08:00
+++ edited/fs/ioctl.c   2005-01-18 11:18:33 -08:00
@@ -77,21 +77,10 @@ static int file_ioctl(struct file *filp,
return do_ioctl(filp, cmd, arg);
 }
 
-
-asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
+int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned 
long arg)
 {
-   struct file * filp;
unsigned int flag;
-   int on, error = -EBADF;
-   int fput_needed;
-
-   filp = fget_light(fd, _needed);
-   if (!filp)
-   goto out;
-
-   error = security_file_ioctl(filp, cmd, arg);
-   if (error)
-   goto out_fput;
+   int on, error = 0;
 
switch (cmd) {
case FIOCLEX:
@@ -157,6 +146,24 @@ asmlinkage long sys_ioctl(unsigned int f
error = do_ioctl(filp, cmd, arg);
break;
}
+   return error;
+}
+
+asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
+{
+   struct file * filp;
+   int error = -EBADF;
+   int fput_needed;
+
+   filp = fget_light(fd, _needed);
+   if (!filp)
+   goto out;
+
+   error = security_file_ioctl(filp, cmd, arg);
+   if (error)
+   goto out_fput;
+
+   error = vfs_ioctl(filp, fd, cmd, arg);
  out_fput:
fput_light(filp, fput_needed);
  out:
= fs/compat.c 1.48 vs edited =
--- 1.48/fs/compat.c2005-01-15 14:31:01 -08:00
+++ edited/fs/compat.c  2005-01-18 11:07:56 -08:00
@@ -437,6 +437,11 @@ asmlinkage long compat_sys_ioctl(unsigne
if (!filp)
goto out;
 
+   /* RED-PEN how should LSM module know it's handling 32bit? */
+   error = security_file_ioctl(filp, cmd, arg);
+   if (error)
+   goto out_fput;
+
if (filp->f_op && filp->f_op->compat_ioctl) {
error = filp->f_op->compat_ioctl(filp, cmd, arg);
if (error != -ENOIOCTLCMD)
@@ -477,7 +482,7 @@ asmlinkage long compat_sys_ioctl(unsigne
 
up_read(_sem);
  do_ioctl:
-   error = sys_ioctl(fd, cmd, arg);
+   error = vfs_ioctl(filp, fd, cmd, arg);
  out_fput:
fput_light(filp, fput_needed);
  out:
= include/linux/fs.h 1.373 vs edited =
--- 1.373/include/linux/fs.h2005-01-15 14:31:01 -08:00
+++ edited/include/linux/fs.h   2005-01-18 11:10:54 -08:00
@@ -1564,6 +1564,8 @@ extern int vfs_stat(char __user *, struc
 extern int vfs_lstat(char __user *, struct kstat *);
 extern int vfs_fstat(unsigned int, struct kstat *);
 
+extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long);
+
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *user_get_super(dev_t);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Job - inescapable job containers

2005-01-20 Thread Limin Gu

> I'm totally not in a position to evaluate the completeness, desirability,
> interest-level, etc of this patch, I'm afraid.  This is an opportunity for
> other stakeholders to weigh in..

Thanks Andrew!

First, Job can work as a standalone kernel module.
The current implementation provides the inescapable job container.
Job provides global unique Job ID (jid) to processes in a cluster
environment. Job initiation on Linux is performed via a PAM session
module with authentication and security checks. Root level processes,
or those with the CAP_SYS_RESOURCE capability, can create new jobs
or escape from a job.

Second, Job based batch schedulers or resource limit tools can take the
advantage of the process control ability Job provides.

Thrid, Job provides a registion mechanism to various accounting modules
for setting and getting job based accounting information.
CSA (Comprehensive System Accounting) is one example of the accounting
modules, (CSA code maintainer Jay Lan is currently on vacation, he will
be back at Feb. 1).

We are pushing Job to linux kernel. If anybody has been using Job in your
open source software, please respond to show the desirability and
interest-level for Job, and we highly appreciate your suggestion on its 
completeness as well.

Thank you!

--Limin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/5] Disallow in-inode attributes for reserved inodes

2005-01-20 Thread Andreas Gruenbacher

On Friday 21 January 2005 00:05, Andreas Dilger wrote:
> [...]
> But as your patch stands it doesn't ever check if i_extra_isize is valid
> for the root or lost+found inode.  It just always sets i_extra_isize = 0
(that's the in-memory i_extra_isize)
> and never uses it.  Given that the root inode is fairly high-traffic it
> makes sense to use the faster EA space if it is available.

It's only a single block we're talking about, not all the overhead you run 
into with huge amounts of attributes in many xattr disk blocks. It sure would 
be much cleaner to use the root inode's in-inode space like with all other 
inodes, but performance wise I don't think it matters.

> If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm
> not so keen to have an ext3_error() there.  If the user doesn't have an
> e2fsck with ea-in-inode support there isn't anything they can do to fix
> it and they will get a full e2fsck on each boot.

Agreed, that would be really bad. We should get e2fsck fixed ASAP.

> Even so, for the effort of setting i_extra_isize = 4 (or larger if we
> initialize the fixed fields) we can do the equivalent of what e2fsck will
> do when it finds a bogus value.

We cannot ask the user, and we don't have the kind of global view that e2fsck 
has. Something different may be messed up, and may have lead to the 
corruption. It's unlikely, but not impossible.

Cheers,
-- 
Andreas Gruenbacher <[EMAIL PROTECTED]>
SUSE Labs, SUSE LINUX PRODUCTS GMBH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] BUG in io_destroy (fs/aio.c:1248)

2005-01-20 Thread Darrick J. Wong

Hi all,
[Please cc me on any replies because I'm not subscribed to linux-aio or 
linux-kernel.]

I was running a random system call generator against mainline the other 
day and got this bug report about AIO in dmesg:

[ cut here ]
kernel BUG at fs/aio.c:1249!
invalid operand:  [#1]
PREEMPT SMP
Modules linked in: 8250 serial_core isofs zlib_inflate ipt_limit 
iptable_mangle ipt_LOG ipt_MASQUERADE iptable_nat ipt_TOS ipt_REJECT 
ip_conntrack_irc ip_conntrack_ftp ipt_state ip_conntrack iptable_filter 
ip_tables snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd soundcore 
snd_page_alloc intel_agp agpgart evdev ehci_hcd uhci_hcd usbcore piix 
ext2 ide_generic ide_cd ide_core cdrom
CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010286   (2.6.10-elm3a74)
EIP is at io_destroy+0xb1/0xce
eax:    ebx: f6b2a300   ecx:    edx: cfca1000
esi: d0da5e80   edi: f6b2a488   ebp: cfca1fa4   esp: cfca1f94
ds: 007b   es: 007b   ss: 0068
Process io_destroy (pid: 6610, threadinfo=cfca1000 task=f6cdca20)
Stack:  08048008 fff2 fff2 cfca1fbc c01702fc d0da5e80 
0010
   b7fd8c50 b3d4 cfca1000 c0102eb3 0010 08048008 080482fd 
b7fd8c50
   b3d4 b3e8 00f5 007b 007b 00f5 b7f7c60d 
0073
Call Trace:
 [] show_stack+0x7a/0x90
 [] show_registers+0x152/0x1ca
 [] die+0x100/0x184
 [] do_invalid_op+0xa3/0xad
 [] error_code+0x2b/0x30
 [] sys_io_setup+0x9a/0xa9
 [] syscall_call+0x7/0xb
Code: 1c 8b 06 85 c0 78 24 83 c4 04 5b 5e 5f 5d c3 8b 0a 85 c9 2e 74 b5 
8b 46 10 89 02 eb ae 83 c4 04 89 f0 5b 5e 5f 5d e9 9b ee ff ff <0f> 0b 
e1 04 f5 f6 2b c0 eb d2 89 f0 e8 8a ee ff ff eb ab 0f 0b

This is a fairly run-of-the mill P4 box with SCSI disks and a plain 
vanilla 2.6.10 kernel on Debian.I 've written a test case that exposes 
this bug:

http://submarine.dyndns.org/~djwong/docs/io_destroy.c
The program takes as its only argument the address of a region of read 
only memory.  The libc mmap is a pretty good place for this, so you can 
run the program thusly:

$ ./io_destroy `cat /proc/$$/maps | grep libc- | grep 'r-' | \
awk -F "-" '{print $1}'`
...and watch the program segfault.  If you can't find an address, 
8048000 seems to work in most cases.

I think I've found the cause of this bug.  Each ioctx structure has a
"users" field that acts as a reference counter for the ioctx, and a
"dead" flag that seems to indicate that the ioctx isn't associated with
any particular list of IO requests.
The problem, then, lies in aio.c:1247.  The io_destroy function checks
the (old) value of the dead flag--if it's false (i.e. the ioctx is
alive), then the function calls put_ioctx to decrease the reference
count on the assumption that the ioctx is no longer associated with any
requests.  Later, it calls put_ioctx again, on the assumption that
someone called lookup_ioctx to perform some operation at some point.
This BUG is caused by the reference counts being off.  The testcase that
I provided looks for a chunk of user memory that's read-only and passes
that to the sys_io_setup syscall.  sys_io_setup checks that the pointer
is readable, creates the ioctx and then tries to write the ioctx handle
back to userland.  This is where the problems start to surface.
Since the pointer points to a non-writable region of memory, the write
fails.  The syscall handler then destroys the ioctx.  The dead flag is
zero, so io_destroy calls put_ioctx...but wait!  Nobody ever put the
ioctx into a request list.  The ioctx is alive but not in a list, yet
the io_destroy code assumes that being alive implies being in a request
list somewhere.  Hence, calling put_ioctx is bogus; the reference count
becomes 0, and the ioctx is freed.  Worse yet, put_ioctx is called again
(on a freed pointer!) to clear up the lookup_ioctx that never happened.
 put_ioctx sees that the reference count has become negative and BUGs.
The patch that I've provided calls aio_cancel_all before calling
io_destroy in this failure case.  aio_cancel_all sets ioctx->dead = 1
and cancels all requests (there shouldn't be any in this case) in
progress.  Since the dead flag is 1, io_destroy calls put_ioctx once to
zero the reference count and free the ioctx, and thus the BUG condition
doesn't get triggered.  The userland program receives an error code
instead of a segfault.
This patch is against 2.6.10; the problem doesn't seem to be fixed in 
2.6.11-rc1.  A simpler version of this fix would simply say "ioctx->dead 
= 1;" (or even call "get_ioctx(ioctx);" to inflate the refcounts 
artificially), but as I'm not an AIO developer I don't want to be the 
one making that call.

--Darrick
-
Signed-off-by: Darrick Wong <[EMAIL PROTECTED]>
--- linux-2.6.10-a74/fs/aio.c   2004-12-24 13:34:44.0 -0800
+++ linux-2.6.10/fs/aio.c   2005-01-12 16:09:37.0 -0800
@@ -1285,6 +1285,7 @@
if (!ret)
return 0;
+   aio_cancel_all(ioctx);

[PATCH] mips: fixed LTT build errors

2005-01-20 Thread Yoichi Yuasa

This patch had fixed LTT build errors on MIPS.

Yoichi

Signed-off-by: Yoichi Yuasa <[EMAIL PROTECTED]>

diff -urN -X dontdiff a-orig/arch/mips/kernel/irq.c a/arch/mips/kernel/irq.c
--- a-orig/arch/mips/kernel/irq.c   Fri Jan 21 00:15:19 2005
+++ a/arch/mips/kernel/irq.cFri Jan 21 08:17:31 2005
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff -urN -X dontdiff a-orig/arch/mips/kernel/traps.c a/arch/mips/kernel/traps.c
--- a-orig/arch/mips/kernel/traps.c Fri Jan 21 00:15:19 2005
+++ a/arch/mips/kernel/traps.c  Fri Jan 21 08:17:31 2005
@@ -13,6 +13,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff -urN -X dontdiff a-orig/arch/mips/mm/fault.c a/arch/mips/mm/fault.c
--- a-orig/arch/mips/mm/fault.c Fri Jan 21 00:15:19 2005
+++ a/arch/mips/mm/fault.c  Fri Jan 21 08:17:31 2005
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]sched: Isochronous class v2 for unprivileged soft rt scheduling

2005-01-20 Thread Jack O'Quin

Peter Chubb <[EMAIL PROTECTED]> writes:

>> "Jack" == Jack O'Quin <[EMAIL PROTECTED]> writes:
>
> Jack> Looks like we need to do another study to determine which
> Jack> filesystem works best for multi-track audio recording and
> Jack> playback.  XFS looks promising, but only if they get the latency
> Jack> right.  Any experience with that?  
>
> The nice thing about audio/video and XFS is that if you know ahead of
> time the max size of a file (and you usually do -- because you know
> ahead of time how long a take is going to be) you can precreadte the
> file as a contiguous chunk, then just fill it in, for minimum disc
> latency.

I am not talking about disk latency.  The problem Con uncovered in
ReiserFS was CPU hogging.  Every 20 seconds there was a 6msec latency
glitch in system response.
-- 
  joq
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux capabilities ?

2005-01-20 Thread Chris Wright

* jnf ([EMAIL PROTECTED]) wrote:
> I will read the paper before commenting on it further, however I cannot
> see what dangers it would really provide that a setuid program doesnt
> already have- other than the ability to give another non-root process root
> like abilities. However, the more I ponder it, it seems as if you could

It was a dangerous failure mode when a capability isn't present that hit
sendmail.

> accomplish a lot of things with a set of ACL's and Capabilities (think
> compartmentalizing everything from each other where no one thing has full
> control of anything other than its particular subsystem).

Yes, that's the ideal.  Unfortunately it doesn't work out quite so
neatly ;-/

> > Since /proc/kmsg is 0400 you need CAP_DAC_READ_SEARCH (don't necessarily
> > need full override).  Otherwise, you are right, you do need CAP_SYS_ADMIN.
> > Or just use syslog(2) directly, and you'll avoid the DAC requirement.
> 
> Hrm, even a chmod of it didn't appear to really affect things?

Should, and it makes a difference for me.

thanks,
-chris
-- 
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-fbdev-devel] Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread James Simmons


> > > > I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of
> > > > horizontal lines) and require powercycling to fix. Worked fine with 
> > > > 2.6.10.
> > > 
> > > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON?
> > 
> > FB_RADEON.
> > 
> > > (cc Ben, who is the likely cuprit ;)
> > 
> > Btw, ajoshi's address from MAINTAINERS is bouncing.
> 
> The file should be updated, I am the radeonfb maintainer now.

Speaking of. Should we nuke the old radeonfb driver?
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: intel8x0 and 2.6.11-rc1

2005-01-20 Thread Paul Ionescu

Hi Takashi,

The same applies for IBM T40/T41/R50p I have tested so far.
I had to disable "Headphone Jack Sense" and "Line Jack Sense" too.
So, what's the deal with these ?
What are they supposed to do ?
Should we report this as bug on alsa lists ?

Thanks,
Paul

On Thu, 20 Jan 2005 16:55:55 +0100, Takashi Iwai wrote:
>>   If you have "Headphone Jack Sense" mixer control,
>>   try to turn it off.
>> 
>> That did the trick. thanks..
> 
> Glad to hear that.  What machine do you have?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

2005-01-20 Thread Rafael J. Wysocki

Hi,

On Friday, 21 of January 2005 00:06, Pavel Machek wrote:
> Hi!
> 
> > > > The readability of code is also important, IMHO.
> > > 
> > > It did not seem too much better to me.
> > 
> > Well, the beauty is in the eye of the beholder. :-)
> > 
> > Still, it shrinks the code (22 lines vs 37 lines), it uses less GPRs (5 vs 
> > 7), it uses less
> > SIB arithmetics (0 vs 4 times), it uses a well known scheme for copying 
> > data pages.
> > As far as the result is concerned, it is equivalent to the existing code, 
> > but it's simpler
> > (and faster).  IMO, simpler code is always easier to understand.
> > 
> > 
> > > > > If you want cheap way to speed it up, kill cr3 manipulation.
> > > > 
> > > > Sure, but I think it's there for a reason.
> > > 
> > > Reason is "to crash it early if we have wrong pagetables".
> > > 
> > > > > Anyway, this is likely to clash with hugang's work; I'd prefer this 
> > > > > not to be applied.
> > > > 
> > > > I am aware of that, but you are not going to merge the hugang's patches 
> > > > soon, are you?
> > > > If necessary, I can change the patch to work with his code (hugang, 
> > > > what do you think?).
> > > 
> > > I think it is just not worth the effort.
> > 
> > Why?  It won't take much time.  I've spent more time for writing the 
> > messages
> > in this thread ... ;-)
> 
> Well, I know that current code works. It was produced by C compiler,
> btw. Now, new code works for you, but it was not in kernel for 4
> releases, and... this code is pretty subtle.

Now, I'm confused. :-)  It's roughly this:

struct pbe *pbe = pagedir_nosave, *end;
unsigned n = nr_copy_pages;
if (n) {
end = pbe + n;
do {
memcpy((void *)pbe->orig_address, (void *)pbe->address, 
PAGE_SIZE);
pbe++;
} while (pbe < end);
}

where memcpy() is of course a hand-written inline that includes the cr3 
manipulation,
and pbe, end, n are registers.

> And it is hand-made, not C produced.

Yes, it is.

> So... your code may be better but I do not think it is so much better
> that I'd like to risk it.

Now, that's clear. :-)

Anyway, if anyone could test it or look at it and say a word, please do so.

Greets,
RJW


-- 
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1 vs. PowerMac 8500/G3 (and VAIO laptop) [usb-storage oops]

2005-01-20 Thread Greg KH

On Thu, Jan 20, 2005 at 08:40:07AM +, David Woodhouse wrote:
> On Wed, 2005-01-19 at 15:39 -0800, John Mock wrote:
> > New to 2.6.11-rc1 is that 'lsusb' exhibits 'endian' problems on the
> > PowerMac.
> 
> Is that really new to 2.6.11-rc1? The kernel byte-swaps the bcdUSB,
> idVendor, idProduct, and bcdDevice fields in the device descriptor. It
> should probably swap them back before copying it up to userspace.

Doh, sorry for missing this one.  I've applied your patch to my trees,
and will show up in the next -mm release.

thanks.

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bug when using custom baud rates....

2005-01-20 Thread Greg KH

On Thu, Jan 20, 2005 at 04:22:56PM +0100, Rogier Wolff wrote:
> On Thu, Jan 20, 2005 at 07:08:58AM -0800, Greg KH wrote:
> > On Thu, Jan 20, 2005 at 03:54:22PM +0100, Rogier Wolff wrote:
> > > Hi,
> > > 
> > > When using custom baud rates, the code does: 
> > > 
> > > 
> > >if ((new_serial.baud_base != priv->baud_base) ||
> > > (new_serial.baud_base < 9600))
> > > return -EINVAL;
> > > 
> > > Which translates to english as: 
> > > 
> > >   If you changed the baud-base, OR the new one is
> > >   invalid, return invalid. 
> > > 
> > > but it should be:
> > > 
> > >   If you changed the baud-base, OR the new one is
> > >   invalid, return invalid. 
> > 
> > You mean AND, not OR here, right?  :)
> 
> :-) Sorry. Too noisy here. 
> 
> > > Patch attached. 
> > 
> > Have a 2.6 patch?
> 
> Patch told me: 
>patching file drivers/usb/serial/ftdi_sio.c
>Hunk #1 succeeded at 1137 (offset 156 lines).
> 
> but the resulting patch is attached. 

Applied, thanks.

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: [2.6 patch] let BLK_DEV_UB depend on USB_STORAGE=n

2005-01-20 Thread Greg KH

On Wed, Jan 19, 2005 at 06:49:00PM -0800, Matthew Dharm wrote:
> On Wed, Jan 19, 2005 at 02:07:07PM -0800, Greg KH wrote:
> > On Thu, Dec 23, 2004 at 03:40:31AM +0100, Adrian Bunk wrote:
> > > On Sun, Dec 19, 2004 at 04:31:46PM -0800, Greg KH wrote:
> > > > On Mon, Dec 20, 2004 at 01:16:44AM +0100, Adrian Bunk wrote:
> > > > > I've already seen people crippling their usb-storage driver with 
> > > > > enabling BLK_DEV_UB - and I doubt the warning in the help text added 
> > > > > after 2.6.9 will fix all such problems.
> > > > > 
> > > > > Is there except for kernel size any good reason for using BLK_DEV_UB 
> > > > > instead of USB_STORAGE?
> > > > 
> > > > You don't want to use the scsi layer?  You like the stability of it at
> > > > times?  :)
> > > > 
> > > > > If not, I'd suggest the patch below to let BLK_DEV_UB depend
> > > > > on EMBEDDED.
> > > > 
> > > > No, it's good for non-embedded boxes too.
> > > 
> > > 
> > > My current understanding is:
> > > - BLK_DEV_UB supports a subset of what USB_STORAGE can support
> > > - for an average user, there's no reason to enable BLK_DEV_UB
> > > - if you really know what you are doing, there might be several reasons
> > >   why you might want to use BLK_DEV_UB
> > 
> > I have been running with just the code portion of this patch for a while
> > now, with good results (no Kconfig changes.)
> > 
> > Pete and Matt, do you mind me applying the following portion of the
> > patch to the kernel tree?
> 
> I have no objection.

Ok, I've commited the change to my trees, thanks.

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: security hook missing in compat ioctl in 2.6.11-rc1-mm2

2005-01-20 Thread Chris Wright

* Michael S. Tsirkin ([EMAIL PROTECTED]) wrote:
> Hi!
> Security hook seems to be missing before compat_ioctl in mm2.
> And, it would be nice to avoid calling it twice on some paths.
> 
> Chris Wright's patch addressed this in the most elegant way I think,
> by adding vfs_ioctl.
> 
> Accordingly, this change:
> 
> @@ -468,6 +496,11 @@ asmlinkage long compat_sys_ioctl(unsigne
>  
>   found_handler:
>   if (t->handler) {
> + /* RED-PEN how should LSM module know it's handling 32bit? */
> + error = security_file_ioctl(filp, cmd, arg);
> + if (error)
> + goto out_fput;
> +
>   lock_kernel();
>   error = t->handler(fd, cmd, arg, filp);
>   unlock_kernel();
> 
>  from Andy's "some fixes" patch wont be needed.
> 
> Chris - are you planning to update your patch to -rc1-mm2?
> I'd like to see this addressed, after this I believe logically
> we'll get everything right, then I have a couple of small
> cosmetic patches, and I believe we'll be set.

Yes, Andrew asked me to wait until mm2 came out, so I'll rediff and send
shortly.

thanks,
-chris
-- 
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

security hook missing in compat ioctl in 2.6.11-rc1-mm2

2005-01-20 Thread Michael S. Tsirkin

Hi!
Security hook seems to be missing before compat_ioctl in mm2.
And, it would be nice to avoid calling it twice on some paths.

Chris Wright's patch addressed this in the most elegant way I think,
by adding vfs_ioctl.

Accordingly, this change:

@@ -468,6 +496,11 @@ asmlinkage long compat_sys_ioctl(unsigne
 
  found_handler:
if (t->handler) {
+   /* RED-PEN how should LSM module know it's handling 32bit? */
+   error = security_file_ioctl(filp, cmd, arg);
+   if (error)
+   goto out_fput;
+
lock_kernel();
error = t->handler(fd, cmd, arg, filp);
unlock_kernel();

 from Andy's "some fixes" patch wont be needed.

Chris - are you planning to update your patch to -rc1-mm2?
I'd like to see this addressed, after this I believe logically
we'll get everything right, then I have a couple of small
cosmetic patches, and I believe we'll be set.

-- 
I dont speak for Mellanox.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Benjamin Herrenschmidt

On Thu, 2005-01-20 at 15:48 -0800, Matt Mackall wrote:
> On Thu, Jan 20, 2005 at 03:39:21PM -0800, Andrew Morton wrote:
> > Matt Mackall <[EMAIL PROTECTED]> wrote:
> > >
> > > I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of
> > > horizontal lines) and require powercycling to fix. Worked fine with 
> > > 2.6.10.
> > 
> > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON?
> 
> FB_RADEON.
> 
> > (cc Ben, who is the likely cuprit ;)
> 
> Btw, ajoshi's address from MAINTAINERS is bouncing.

The file should be updated, I am the radeonfb maintainer now.

> > Which -mm2, btw?  2.6.10-mm2 or 2.6.11-rc1-mm2?
> 
> 2.6.11-rc1-mm2
> 
> > Did you try the corresponding -mm1?
> 
> Nothing between that and .10 yet. Building -mm1 now.

Thanks.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Benjamin Herrenschmidt

On Thu, 2005-01-20 at 15:39 -0800, Andrew Morton wrote:
> Matt Mackall <[EMAIL PROTECTED]> wrote:
> >
> > I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of
> > horizontal lines) and require powercycling to fix. Worked fine with 2.6.10.
> 
> Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON?
> 
> (cc Ben, who is the likely cuprit ;)
> 
> Which -mm2, btw?  2.6.10-mm2 or 2.6.11-rc1-mm2?
> 
> Did you try the corresponding -mm1?

/me curses possible BIOS crap ...

radeonfb tries to restore initial mode when the module is closed, which
wouldn't work for a VGA text thing in fact... I suspect something cause
driver remove() routines to be called on reboot, can you confirm ? Or is
it a module that gets removed ? It may well be a problem that has always
been there (regardless of the radeon driver version) and just triggered
by something the kernel does on reboot...

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

SCSI oops in 2.6.10 [was: usb-storage oops (PowerMac 8500/G3)]

2005-01-20 Thread John Mock

Sorry about the confusion, but it appears the 'oops' is not specific to
the USB subsystem, as it seems also to occur with an ordinary SCSI module
as well under 2.6.10 (PPC).  In this case, it's a ZIP drive connected via
'mac53c94' module, and in addition to, as noted before, the same problem
with USB digital camera and an IOMEGA CD/RW drive (see earlier posting 
"2.6.11-rc1 vs. PowerMac 8500/G3 (and VAIO laptop) [usb-storage oops]").

Additional details gladly provided upon request.

 -- JM

Attachments: SCSI oops from 'mac53c94'
 Example of usb-storage variant of same/similar 'oops'
---
...
scsi1 : 53C94
  Vendor: IOMEGAModel: ZIP 100   Rev: C.18
  Type:   Direct-Access  ANSI SCSI revision: 02
Oops: kernel access of bad area, sig: 11 [#1]
PREEMPT 
NIP: C009ABA4 LR: C009ABA4 SP: CC467C40 REGS: cc467b90 TRAP: 0600Not tainted
MSR: 9032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
DAR: 6B6B6BD7, DSISR: 
TASK = cd3b4b30[1094] 'modprobe' THREAD: cc466000
Last syscall: 120 
GPR00: C009ABA4 CC467C40 CD3B4B30 C01AADE0 0047 0047 CC467C78 000A 
GPR08:  8000 CDC2A6AC CC467C40 42002448 1001E284 100013A4  
GPR16:     100013A4 100186E0  CC467D98 
GPR24: CC467D9C 0001  CCC84994 CC467C78 CC40941C CCC84998 6B6B6BD7 
NIP [c009aba4] create_dir+0x38/0x1d0
LR [c009aba4] create_dir+0x38/0x1d0
Call trace:
 [c009ad98] sysfs_create_dir+0x48/0x94
 [c00ad688] create_dir+0x28/0x6c
 [c00ad98c] kobject_add+0x5c/0x15c
 [c00eaa40] device_add+0xb8/0x18c
 [c0111f9c] scsi_sysfs_add_sdev+0x78/0x39c
 [c01107c4] scsi_add_lun+0x2f8/0x364
 [c011091c] scsi_probe_and_add_lun+0xec/0x1d8
 [c010] scsi_scan_target+0x7c/0xec
 [c0fc] scsi_scan_channel+0x7c/0x9c
 [c01112f4] scsi_scan_host_selected+0xd8/0x138
 [cf854b40] mac53c94_probe+0x208/0x26c [mac53c94]
 [c0103fd4] macio_device_probe+0x80/0x9c
 [c00ebfec] driver_probe_device+0x4c/0xa0
 [c00ec184] driver_attach+0x88/0xc8
 [c00ec7c8] bus_add_driver+0xd0/0x11c
---
Jan 19 15:17:58 penngrove kernel:   Vendor: NIKON Model: NIKON DSC E4500   
Rev: 1.00
Jan 19 15:17:58 penngrove kernel:   Type:   Direct-Access  
ANSI SCSI revision: 02
Jan 19 15:17:58 penngrove kernel: input: Logitech N48 on usb-:00:0e.0-1
Jan 19 15:17:58 penngrove kernel: hub 4-0:1.0: port 2, status 0100, change 
, 12 Mb/s
Jan 19 15:17:58 penngrove kernel: Oops: kernel access of bad area, sig: 11 [#1]
Jan 19 15:17:58 penngrove kernel: PREEMPT 
Jan 19 15:17:58 penngrove kernel: NIP: C009BF14 LR: C009BF14 SP: CCF63DC0 REGS: 
ccf63d10 TRAP: 0300Not tainted
Jan 19 15:17:58 penngrove kernel: MSR: 9032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 
11
Jan 19 15:17:58 penngrove kernel: DAR: 0074, DSISR: 4000
Jan 19 15:17:58 penngrove kernel: TASK = ccb5ebf0[1651] 'usb-stor-scan' THREAD: 
ccf62000
Jan 19 15:17:58 penngrove kernel: Last syscall: -1 
Jan 19 15:17:58 penngrove kernel: GPR00: C009BF14 CCF63DC0 CCB5EBF0 C01AF674 
0047 0047 CCF63DF8 000A 
Jan 19 15:17:58 penngrove kernel: GPR08:  8000 CC9185C8 CCF63DC0 
42002448 1001E284 100013A4  
Jan 19 15:17:58 penngrove kernel: GPR16:     
100013A4 100187C0  CCF63F18 
Jan 19 15:17:58 penngrove kernel: GPR24: CCF63F1C 0001  CCC61184 
CCF63DF8 CCC1EE84 CCC61188 0074 
Jan 19 15:17:58 penngrove kernel: NIP [c009bf14] create_dir+0x38/0x1d0
Jan 19 15:17:58 penngrove kernel: LR [c009bf14] create_dir+0x38/0x1d0
Jan 19 15:17:58 penngrove kernel: Call trace:
Jan 19 15:17:58 penngrove kernel:  [c009c108] sysfs_create_dir+0x48/0x94
Jan 19 15:17:58 penngrove kernel:  [c00ae97c] create_dir+0x28/0x6c
Jan 19 15:17:58 penngrove kernel:  [c00aec80] kobject_add+0x5c/0x15c
Jan 19 15:17:58 penngrove kernel:  [c00ec960] device_add+0xc4/0x19c
Jan 19 15:17:58 penngrove kernel:  [c0114590] scsi_sysfs_add_sdev+0x78/0x3a4
Jan 19 15:17:58 penngrove kernel:  [c0112a88] scsi_add_lun+0x2f8/0x364
Jan 19 15:17:58 penngrove kernel:  [c0112be0] scsi_probe_and_add_lun+0xec/0x1fc
Jan 19 15:17:58 penngrove kernel:  [c0113414] scsi_scan_target+0x7c/0xec
Jan 19 15:17:58 penngrove kernel:  [c0113500] scsi_scan_channel+0x7c/0x9c
Jan 19 15:17:58 penngrove kernel:  [c01135f8] scsi_scan_host_selected+0xd8/0x138
Jan 19 15:17:58 penngrove kernel:  [cfb04dc0] usb_stor_scan_thread+0x6c/0x124 
[usb_storage]
Jan 19 15:17:58 penngrove kernel:  [c00066c4] kernel_thread+0x44/0x60
===
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] [PATCH] move bio code from dm into bio

2005-01-20 Thread Dave Olien


Jens, last December you observed there was bio code
duplicated in the dm drivers.

Here are a collection of patches that implements
support for local bio and bvec pools into bio.c and then
removes the duplicate bio code from the dm drivers.

It also replaces a call to alloc_bio() in dm.c with
a call to use a local bio pool.  This removes a 
deadlock case in that code.

These patches are against 2.6.11-rc1.  If that's not
a good source version to patch against, let me now
what versions I should generate patches for.

I still need to implement some form of congestion
control in the dm code.  As things are now, the snapshot
target can consume all the bio's in the system.  With the elmination
of the deadlock case in dm.c, the system nolonger deadlocks.
But instead, it starts invoking the oom killer.  That'll
be the next patch set to this code.

Please review.

Thanks!


diff -ur linux-2.6.11-rc1-original/drivers/md/dm.c 
linux-2.6.11-rc1-bio/drivers/md/dm.c
--- linux-2.6.11-rc1-original/drivers/md/dm.c   2004-12-24 13:35:24.0 
-0800
+++ linux-2.6.11-rc1-bio/drivers/md/dm.c2005-01-19 15:51:49.0 
-0800
@@ -96,10 +96,16 @@
 static kmem_cache_t *_io_cache;
 static kmem_cache_t *_tio_cache;
 
+static struct bio_set *dm_set;
+
 static int __init local_init(void)
 {
int r;
 
+   dm_set = bioset_create(512, 512, 1);
+   if (!dm_set)
+   return -ENOMEM;
+
/* allocate a slab for the dm_ios */
_io_cache = kmem_cache_create("dm_io",
  sizeof(struct dm_io), 0, 0, NULL, NULL);
@@ -133,6 +139,8 @@
kmem_cache_destroy(_tio_cache);
kmem_cache_destroy(_io_cache);
 
+   bioset_free(dm_set);
+   
if (unregister_blkdev(_major, _name) < 0)
DMERR("devfs_unregister_blkdev failed");
 
@@ -393,7 +401,7 @@
struct bio *clone;
struct bio_vec *bv = bio->bi_io_vec + idx;
 
-   clone = bio_alloc(GFP_NOIO, 1);
+   clone = bio_alloc_bs(GFP_NOIO, 1, dm_set);
*clone->bi_io_vec = *bv;
 
clone->bi_sector = sector;
diff -ur linux-2.6.11-rc1-original/drivers/md/dm-io.c 
linux-2.6.11-rc1-bio/drivers/md/dm-io.c
--- linux-2.6.11-rc1-original/drivers/md/dm-io.c2004-12-24 
13:35:39.0 -0800
+++ linux-2.6.11-rc1-bio/drivers/md/dm-io.c 2005-01-19 15:26:55.0 
-0800
@@ -12,207 +12,7 @@
 #include 
 #include 
 
-#define BIO_POOL_SIZE 256
-
-
-/*-
- * Bio set, move this to bio.c
- *---*/
-#define BV_NAME_SIZE 16
-struct biovec_pool {
-   int nr_vecs;
-   char name[BV_NAME_SIZE];
-   kmem_cache_t *slab;
-   mempool_t *pool;
-   atomic_t allocated; /* FIXME: debug */
-};
-
-#define BIOVEC_NR_POOLS 6
-struct bio_set {
-   char name[BV_NAME_SIZE];
-   kmem_cache_t *bio_slab;
-   mempool_t *bio_pool;
-   struct biovec_pool pools[BIOVEC_NR_POOLS];
-};
-
-static void bio_set_exit(struct bio_set *bs)
-{
-   unsigned i;
-   struct biovec_pool *bp;
-
-   if (bs->bio_pool)
-   mempool_destroy(bs->bio_pool);
-
-   if (bs->bio_slab)
-   kmem_cache_destroy(bs->bio_slab);
-
-   for (i = 0; i < BIOVEC_NR_POOLS; i++) {
-   bp = bs->pools + i;
-   if (bp->pool)
-   mempool_destroy(bp->pool);
-
-   if (bp->slab)
-   kmem_cache_destroy(bp->slab);
-   }
-}
-
-static void mk_name(char *str, size_t len, const char *prefix, unsigned count)
-{
-   snprintf(str, len, "%s-%u", prefix, count);
-}
-
-static int bio_set_init(struct bio_set *bs, const char *slab_prefix,
-unsigned pool_entries, unsigned scale)
-{
-   /* FIXME: this must match bvec_index(), why not go the
-* whole hog and have a pool per power of 2 ? */
-   static unsigned _vec_lengths[BIOVEC_NR_POOLS] = {
-   1, 4, 16, 64, 128, BIO_MAX_PAGES
-   };
-
-
-   unsigned i, size;
-   struct biovec_pool *bp;
-
-   /* zero the bs so we can tear down properly on error */
-   memset(bs, 0, sizeof(*bs));
-
-   /*
-* Set up the bio pool.
-*/
-   snprintf(bs->name, sizeof(bs->name), "%s-bio", slab_prefix);
-
-   bs->bio_slab = kmem_cache_create(bs->name, sizeof(struct bio), 0,
-SLAB_HWCACHE_ALIGN, NULL, NULL);
-   if (!bs->bio_slab) {
-   DMWARN("can't init bio slab");
-   goto bad;
-   }
-
-   bs->bio_pool = mempool_create(pool_entries, mempool_alloc_slab,
- mempool_free_slab, bs->bio_slab);
-   if (!bs->bio_pool) {
-   DMWARN("can't init bio pool");
-   goto bad;
-   }
-
-   /*
-* Set up the biovec pools.
-*/
-   for (i = 0; i < BIOVEC_NR_POOLS; i++) {
-   bp

[PATCH] to fix xtime lock for in the RT kernel patch

2005-01-20 Thread George Anzinger

It seems to me that we need to either do the attached or to rewrite the timer 
front end code to just gather the offset info and defer to the timer irq thread 
to update jiffies and the offset stuff.  In either case we really can not split 
the two and we do need the xtime_lock protection.
--
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Source: MontaVista Software, Inc.  George Anzinger 
Type: Defect Fix 
Keywords:
Signed-off-by: George Anzinger 
Description:
This patch changes the timer interrupt code for the RT patch to 
respect the xtime_lock which should protect jiffies and to collect
offset information on jiffies interrupts.  This offset info must
be collected as soon as possible during the jiffies interrupt and 
also needs to be protected by the xtime_lock.

The xtime_lock is thus a "raw" lock.

 arch/i386/kernel/time.c |8 +---
 include/linux/time.h|2 +-
 kernel/timer.c  |2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

Index: topdir/kernel/timer.c
===
--- topdir.orig/kernel/timer.c
+++ topdir/kernel/timer.c
@@ -946,7 +946,7 @@ unsigned long wall_jiffies = INITIAL_JIF
  * playing with xtime and avenrun.
  */
 #ifndef ARCH_HAVE_XTIME_LOCK
-DECLARE_SEQLOCK(xtime_lock);
+DECLARE_RAW_SEQLOCK(xtime_lock);
 
 EXPORT_SYMBOL(xtime_lock);
 #endif
Index: topdir/include/linux/time.h
===
--- topdir.orig/include/linux/time.h
+++ topdir/include/linux/time.h
@@ -80,7 +80,7 @@ mktime (unsigned int year, unsigned int 
 
 extern struct timespec xtime;
 extern struct timespec wall_to_monotonic;
-extern seqlock_t xtime_lock;
+extern raw_seqlock_t xtime_lock;
 
 static inline unsigned long get_seconds(void)
 { 
Index: topdir/arch/i386/kernel/time.c
===
--- topdir.orig/arch/i386/kernel/time.c
+++ topdir/arch/i386/kernel/time.c
@@ -20,7 +20,7 @@
  * monotonic gettimeofday() with fast_get_timeoffset(),
  * drift-proof precision TSC calibration on boot
  * (C. Scott Ananian <[EMAIL PROTECTED]>, Andrew D.
- * Balsa <[EMAIL PROTECTED]>, Philip Gladstone <[EMAIL PROTECTED]>;
+ * Balsa <[EMAIL PROTECTED]>, Philip Gladstone <[EMAIL PROTECTED]>;
  * ported from 2.0.35 Jumbo-9 by Michael Krause <[EMAIL PROTECTED]>).
  * 1998-12-16Andrea Arcangeli
  * Fixed Jumbo-9 code in 2.1.131: do_gettimeofday was missing 1 jiffy
@@ -224,7 +224,10 @@ EXPORT_SYMBOL(profile_pc);
  */
 void direct_timer_interrupt(struct pt_regs *regs)
 {
+   write_seqlock(_lock);
+   cur_timer->mark_offset();
do_timer_interrupt_hook(regs);
+   write_sequnlock(_lock);
 }
 
 #endif
@@ -254,6 +257,7 @@ static inline void do_timer_interrupt(in
 #endif
 
 #ifndef CONFIG_PREEMPT_HARDIRQS
+   cur_timer->mark_offset();
do_timer_interrupt_hook(regs);
 #endif
 
@@ -312,8 +316,6 @@ irqreturn_t timer_interrupt(int irq, voi
 * locally disabled. -arca
 */
write_seqlock(_lock);
-
-   cur_timer->mark_offset();
  
do_timer_interrupt(irq, NULL, regs);

Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Andrew Morton

Matt Mackall <[EMAIL PROTECTED]> wrote:
>
> > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON?
> 
> FB_RADEON.

Ah, OK.  Likely culprits are

radeonfb-massive-update-of-pm-code.patch
radeonfb-build-fix.patch

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sparse warning, or why does jifies_to_msecs() return an int?

2005-01-20 Thread David Mosberger

> On Sat, 15 Jan 2005 10:05:37 -0800 (PST), Linus Torvalds <[EMAIL 
> PROTECTED]> said:

  Linus> Hmm.. I don't think your patch is wrong per se, but I do
  Linus> think it's a bit too subtle. I'd almost rather make
  Linus> "jiffies_to_msecs()" just test for overflow instead, and that
  Linus> should also fix it.

You sure about that?

Actually, I think my patch was broken anyhow for HZ < 1000 because you
can potentially get integer-overflows in temporary results which could
make things come out wrong again.

I _think_ the attached patch works for all reasonable cases reasonably
uniformly, but if you thought the previous patch was subtle, I'm sure
you going to like this one even less.

Note that with the patch, platforms where HZ is not a power of two and
doesn't fit any of the other special cases (namely (HZ % 1000) != 0 &&
(1000 % HZ) != 0) would suffer a penalty.  AFAICS, this is true only
for Alpha/Rawhide (HZ=1200).  In such a case, rather than:

(j * 1000)/HZ

the new code would compute:

(j/HZ)*1000 + ((j%HZ)*1000)/HZ

It looks to me like we could get rid of all the ugly & complex
intermediate overflow-checks if we defined MAX_JIFFY_OFFSET
as:
(~0UL / 1000)

However, on a 32-bit platform that runs at 1000 Hz, this would limit
us to 4294 seconds.  That may be cutting it a bit close.

--david

= include/linux/jiffies.h 1.11 vs edited =
--- 1.11/include/linux/jiffies.h2005-01-04 18:48:02 -08:00
+++ edited/include/linux/jiffies.h  2005-01-20 15:21:14 -08:00
@@ -254,13 +254,32 @@
  */
 static inline unsigned int jiffies_to_msecs(const unsigned long j)
 {
+   unsigned long res;
+
 #if HZ <= 1000 && !(1000 % HZ)
-   return (1000 / HZ) * j;
+   unsigned long max = ~0UL / (1000 / HZ);
+
+   if (j > max)
+   max = j;
+   res = (1000 / HZ) * j;
 #elif HZ > 1000 && !(HZ % 1000)
-   return (j + (HZ / 1000) - 1)/(HZ / 1000);
+   res = (j + (HZ / 1000) - 1) / (HZ / 1000);
 #else
-   return (j * 1000) / HZ;
+   /*
+* HZ better be a power of two; otherwise this gets real
+* expensive.  Better expensive than wrong, though.
+*/
+# if HZ < 1000
+   unsigned long max = (~0UL / 1000) * HZ;
+
+   if (j > max)
+   j = max;
+# endif
+   res = (j / HZ) * 1000 + ((j % HZ) * 1000) / HZ;
 #endif
+   if (res > ~0U)
+   return ~0U;
+   return res;
 }
 
 static inline unsigned int jiffies_to_usecs(const unsigned long j)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Matt Mackall

On Thu, Jan 20, 2005 at 03:39:21PM -0800, Andrew Morton wrote:
> Matt Mackall <[EMAIL PROTECTED]> wrote:
> >
> > I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of
> > horizontal lines) and require powercycling to fix. Worked fine with 2.6.10.
> 
> Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON?

FB_RADEON.

> (cc Ben, who is the likely cuprit ;)

Btw, ajoshi's address from MAINTAINERS is bouncing.
 
> Which -mm2, btw?  2.6.10-mm2 or 2.6.11-rc1-mm2?

2.6.11-rc1-mm2

> Did you try the corresponding -mm1?

Nothing between that and .10 yet. Building -mm1 now.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] drivers/usb/devio.c, against ioctl bug in 2.4.28 & 2.4.29

2005-01-20 Thread Kaupo Arulo

Hi!
Here is the tested patch against modem_run and eciadsl hang since 2.4.28.
Longer discussion about it is in:
http://sourceforge.net/mailarchive/forum.php?thread_id=6054671_id=5398
and feedback from users is in:
http://www.mail-archive.com/speedtouch%40ml.free.fr/msg06848.html
The patch itself is also located in:
http://linux.ee/~kaups/devio.patch
It:
- prevent grabbing exclusive_access mutex for ioctls that doesn't need it
- prevent grabbing exclusive_access mutex for non existing ioctls
- use interruptible sleep instead uninterruptible
PS. keep me in CC since I'm not subscribed...
--
best regards,
Kaupo Arulo--- devio.c.orig2004-11-28 22:24:49.0 +0200
+++ devio.c 2004-12-01 12:47:02.0 +0200
@@ -1153,45 +1153,62 @@ static int usbdev_ioctl(struct inode *in
 
if (!(file->f_mode & FMODE_WRITE))
return -EPERM;
-   down_read(>devsem);
+   down_read(>devsem); /* FIXME: should we set devsem also per "case" 
+  like exclusive_access to avoid
+  blocking nonexistent ioctls ? */
if (!ps->dev) {
up_read(>devsem);
return -ENODEV;
}
-
-   /*
-* grab device's exclusive_access mutex to prevent its driver from
-* using this device while it is being accessed by us.
+/*
+ * Some ioctls don't touch the device and can be called without
+ * grabbing its exclusive_access mutex; they are handled together 
+ * in same switch with ioctls which need it. Exclusive_access is 
handled in
+ * particular switch branches, so we grab device's exclusive_access 
+* mutex ONLY if needed and WHEN actually needed!!! 
 */
-   down(>dev->exclusive_access);
-
switch (cmd) {
case USBDEVFS_CONTROL:
-   ret = proc_control(ps, (void *)arg);
-   if (ret >= 0)
-   inode->i_mtime = CURRENT_TIME;
+   if (down_interruptible(>dev->exclusive_access) == 0) {
+   ret = proc_control(ps, (void *)arg);
+   up(>dev->exclusive_access);
+   if (ret >= 0)
+   inode->i_mtime = CURRENT_TIME;
+   } else ret = -ERESTARTSYS;
break;
 
case USBDEVFS_BULK:
-   ret = proc_bulk(ps, (void *)arg);
-   if (ret >= 0)
-   inode->i_mtime = CURRENT_TIME;
+   if (down_interruptible(>dev->exclusive_access) == 0) {
+   ret = proc_bulk(ps, (void *)arg);
+   up(>dev->exclusive_access);
+   if (ret >= 0)
+   inode->i_mtime = CURRENT_TIME;
+   } else ret = -ERESTARTSYS;
break;
 
case USBDEVFS_RESETEP:
-   ret = proc_resetep(ps, (void *)arg);
-   if (ret >= 0)
-   inode->i_mtime = CURRENT_TIME;
+   if (down_interruptible(>dev->exclusive_access) == 0) {
+   ret = proc_resetep(ps, (void *)arg);
+   up(>dev->exclusive_access);
+   if (ret >= 0)
+   inode->i_mtime = CURRENT_TIME;
+   } else ret = -ERESTARTSYS;
break;
 
case USBDEVFS_RESET:
-   ret = proc_resetdevice(ps);
+   if (down_interruptible(>dev->exclusive_access) == 0) {
+   ret = proc_resetdevice(ps);
+   up(>dev->exclusive_access);
+   } else ret = -ERESTARTSYS;
break;

case USBDEVFS_CLEAR_HALT:
-   ret = proc_clearhalt(ps, (void *)arg);
-   if (ret >= 0)
-   inode->i_mtime = CURRENT_TIME;
+   if (down_interruptible(>dev->exclusive_access) == 0) {
+   ret = proc_clearhalt(ps, (void *)arg);
+   up(>dev->exclusive_access);
+   if (ret >= 0)
+   inode->i_mtime = CURRENT_TIME;
+   } else ret = -ERESTARTSYS;
break;
 
case USBDEVFS_GETDRIVER:
@@ -1203,21 +1220,33 @@ static int usbdev_ioctl(struct inode *in
break;
 
case USBDEVFS_SETINTERFACE:
-   ret = proc_setintf(ps, (void *)arg);
+   if (down_interruptible(>dev->exclusive_access) == 0) {
+   ret = proc_setintf(ps, (void *)arg);
+   up(>dev->exclusive_access);
+   } else ret = -ERESTARTSYS;
break;
 
case USBDEVFS_SETCONFIGURATION:
-   ret = proc_setconfig(ps, (void *)arg);
+   if (down_interruptible(>dev->exclusive_access) == 0) {
+   ret = proc_setconfig(ps, (void *)arg);
+   up(>dev->exclusive_access);
+   }

Re: [patch, BK-curr] nonintrusive spin-polling loop in kernel/spinlock.c

2005-01-20 Thread Linus Torvalds



Btw, I think I've now merged everything to bring us back to where we 
wanted to be - can people verify that the architecture they care about has 
all the right "read_can_lock()" etc infrastructure (and preferably that it 
_works_ too ;), and that I've not missed of incorrectly ignored some 
patches in this thread?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Andrew Morton

Matt Mackall <[EMAIL PROTECTED]> wrote:
>
> I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of
> horizontal lines) and require powercycling to fix. Worked fine with 2.6.10.

Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON?

(cc Ben, who is the likely cuprit ;)

Which -mm2, btw?  2.6.10-mm2 or 2.6.11-rc1-mm2?

Did you try the corresponding -mm1?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: TCP checksum calculation

2005-01-20 Thread Stephen Hemminger

On Thu, 20 Jan 2005 15:52:34 -0500 (EST)
Rahul Jain <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I have written a module that changes IP addrs and TCP port values. After
> changing these fields, I am able to recalculate the IP checksum  within
> the module. To recalculate the TCP checksum, I wrote a new function in
> tcp_ipv4.c which is very similar to tcp_v4_send_check(). The only
> difference is that, my function does not use the sock parameter and gets
> the saddr and daddr from sk_buff. I call this function before the
> following piece of code in tcp_v4_rcv()
> 
> if ((skb->ip_summed != CHECKSUM_UNNECESSARY &&
>  tcp_v4_checksum_init(skb) < 0))
> goto bad_packet;
> 
> However I am still getting a bad tcp checksum error. Does anyone know what
> I am missing and point me in the right direction.
> 

Look at the netfilter code, in fact if you are changing values there may already
be a netfilter module to do what you want, and you could have saved the effort.

-- 
Stephen Hemminger   <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: oom killer gone nuts

2005-01-20 Thread Andrea Arcangeli

On Thu, Jan 20, 2005 at 03:57:07PM -0600, Chris Friesen wrote:
> Andries Brouwer wrote:
> 
> >But let me stress that I also consider the earlier situation
> >unacceptable. It is really bad to lose a few weeks of computation.
> 
> Shouldn't the application be backing up intermediate results to disk 
> periodically?  Power outages do occur, as do bus faults, electrical 
> glitches, dead fans, etc.

Agreed. Plus if you truly cannot change the app because it's binary only
at least you can set the ulimit based on the virtual sizes, ulimit
should work reliably even if overcommit doesn't.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] tracepipe -- event streams, debugfs, and pipe_buffers

2005-01-20 Thread Zach Brown

Karim Yaghmour wrote:
> Zach Brown wrote:
> 
>>Thoughts?  I, for one, am tired of writing throw-away per-cpu tracing
>>patches ;)
> 
> Have you taken a look at relayfs and ltt?

Only briefly.  They've always seemed more involved than the sort of
thing I was after.  I'll try and sit down and investigate in more detail.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Radeon framebuffer weirdness in -mm2

2005-01-20 Thread Matt Mackall

I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of
horizontal lines) and require powercycling to fix. Worked fine with 2.6.10.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]sched: Isochronous class v2 for unprivileged soft rt scheduling

2005-01-20 Thread Con Kolivas

[EMAIL PROTECTED] wrote:
On Thu, Jan 20, 2005 at 10:42:24AM -0500, Paul Davis wrote:
over on #ardour last week, we saw appalling performance from
reiserfs. a 120GB filesystem with 11GB of space failed to be able to
deliver enough read/write speed to keep up with a 16 track
session. When the filesystem was cleared to provide 36GB of space,
things improved. The actual recording takes place using writes of
256kB, and no more than a few hundred MB was being written during the
failed tests.

It's been a long while since I followed ReiserFS development closely,
*however*, this issue used to be a common problem ReiserFS - when
free space starts to drop below 10%, performace takes a big hit.  So
performance improved when space was cleared up.
I don't remember what causes this or what the status is in modern
ResierFS systems.

everything i read about reiser suggests it is unsuitable for audio
work: it is optimized around the common case of filesystems with many
small files. the filesystems where we record audio is typically filled
with a relatively small number of very, very large files.

Anecdotally, I've found this to not be the case.  I only use ReiserFS
and have a few reasonably sized projects in Ardour that work fine:
maybe 20 tracks, with 10-15 plugins (in the whole project), and I can
do overdubs with no problems.  It may be relevant that I only have a
four track card and so load is too small.
But at least in my practice, it hasn't been a huge hinderance.
This is my understanding of the situation, which is not gospel but 
interpretation of the information data I have had available.

Reiserfs3.6 is in maintenance mode. Its performance was very good in 2.4 
days, but since 2.6 the block layer has matured so much that the code 
paths that were fast in reiserfs are no longer so impressive compared to 
those shared by ext3.

In terms of recommendation, the latency of non-preemptible codepaths 
will be fastest in ext3 in 2.6 due to the nature of it constantly being 
examined, addressed and updated. That does not mean it has the fastest 
performance by any stretch of the imagination. XFS, I believe, has 
significantly faster large file performance, and reiser3.6 has 
significantly faster small file performance. But if throughput is not a 
problem, and latency is, then ext3 is a better choice. Reiser4 is a 
curious beast with obviously high throughput, but for the moment I do 
not think it is remotely suitable for low latency applications.

As for the %full issue; no filesystem works well as it approaches full 
capacity. Performance degrades dramatically beyond 75% on all of them, 
becoming woeful once beyond 85%. If you're looking for good performance, 
more free capacity is more effective than changing filesystems.

All of this should be taken into consideration if you're worried about 
low latency cpu scheduling, as it all will collapse if your filesystem 
code has high latency in the kernel. It also would make benchmarking low 
latency cpu scheduling potentially prone to disastrous mis-interpretation.

Cheers,
Con
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.11-rc1-mm2] mips: fixed conflicting types

2005-01-20 Thread Yoichi Yuasa

This patch had fixed following 2 conflicting type errors.

Yoichi

arch/mips/lib/csum_partial_copy.c:21: error: conflicting types for 
`csum_partial_copy_nocheck'
include/asm/checksum.h:65: error: previous declaration of 
`csum_partial_copy_nocheck'
arch/mips/lib/csum_partial_copy.c:38: error: conflicting types for 
`csum_partial_copy_from_user'
include/asm/checksum.h:38: error: previous declaration of 
`csum_partial_copy_from_user'
make[1]: *** [arch/mips/lib/csum_partial_copy.o] Error 1
make: *** [arch/mips/lib] Error 2

Signed-off-by: Yoichi Yuasa <[EMAIL PROTECTED]>

diff -urN -X dontdiff a-orig/arch/mips/lib/csum_partial_copy.c 
a/arch/mips/lib/csum_partial_copy.c
--- a-orig/arch/mips/lib/csum_partial_copy.cWed Jan 12 13:02:09 2005
+++ a/arch/mips/lib/csum_partial_copy.c Fri Jan 21 07:47:35 2005
@@ -16,7 +16,7 @@
 /*
  * copy while checksumming, otherwise like csum_partial
  */
-unsigned int csum_partial_copy_nocheck(const char *src, char *dst,
+unsigned int csum_partial_copy_nocheck(const unsigned char *src, unsigned char 
*dst,
int len, unsigned int sum)
 {
/*
@@ -33,7 +33,7 @@
  * Copy from userspace and compute checksum.  If we catch an exception
  * then zero the rest of the buffer.
  */
-unsigned int csum_partial_copy_from_user (const char *src, char *dst,
+unsigned int csum_partial_copy_from_user (const unsigned char *src, unsigned 
char *dst,
int len, unsigned int sum, int *err_ptr)
 {
int missing;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

inotify-0.18-rml-4: Oops

2005-01-20 Thread Juerg Billeter

Hi

I reproducibly get the following Oops as soon as I start using inotify
with gamin and/or beagle. This happens with linux 2.6.10-as1 + inotify
0.18-rml-4 on multiple x86 machines.

Unable to handle kernel NULL pointer dereference at virtual address 
 printing eip:
c01d6d31
*pde = 
Oops:  [#1]
PREEMPT SMP 
Modules linked in: nfs lockd sunrpc mga af_packet autofs4 md5 ipv6 e100 mii
snd_cmipci snd_opl3_lib snd_hwdep snd_mpu401_uart snd_rawmidi snd_seq_device
intel_agp agpgart snd_intel8x0 snd_ac97_codec tun snd_pcm_oss snd_pcm snd_timer
snd_page_alloc snd_mixer_oss snd soundcore ext3 jbd mbcache binfmt_misc xfs
sd_mod pl2303 usbserial ide_cd cdrom ide_disk aic7xxx scsi_mod piix ide_core
ehci_hcd uhci_hcd usbcore
CPU:0
EIP:0060:[inotify_dev_queue_event+353/368]Not tainted VLI
EFLAGS: 00010246   (2.6.10-paldo4) 
EIP is at inotify_dev_queue_event+0x161/0x170
eax:    ebx: d7a50f00   ecx: 0003   edx: c6c7a2cc
esi:    edi:    ebp: 0020   esp: c8b6bf6c
ds: 007b   es: 007b   ss: 0068
Process multiload-apple (pid: 2756, threadinfo=c8b6a000 task=e76bc020)
Stack: c014b27d    ddc822e8 ddc822e8 cbda31ac  
   0020 c01d72c9   0024 d8dd3980 f7772000 c8b6a000 
   c015826f  b777e8fc b777e8fc 8000 c0103029 b777e8fc  
Call Trace:
 [remove_vm_struct+93/144] remove_vm_struct+0x5d/0x90
 [inotify_inode_queue_event+73/128] inotify_inode_queue_event+0x49/0x80
 [sys_open+95/176] sys_open+0x5f/0xb0
 [sysenter_past_esp+82/117] sysenter_past_esp+0x52/0x75
Code: 24 18 8b 7c 24 1c 8b 6c 24 20 83 c4 24 c3 c7 04 24 00 00 00 00 8b 4c 24
0c ba 00 40 00 00 b8 ff ff ff ff e9 3d ff ff ff 8b 42 18 <80> 38 00 eb bf 8d
76 00 8d bc 27 00 00 00 00 53 89 c3 8b 4b 20 
 <6>note: multiload-apple[2756] exited with preempt_count 1

I can provide more information on request.

Thanks for any advice

JÃrg

(please cc me on replies)

-- 
Juerg Billeter <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: patch to fix set_itimer() behaviour in boundary cases

2005-01-20 Thread George Anzinger

Arjan van de Ven wrote:
On Wed, 2005-01-19 at 15:51 -0800, George Anzinger wrote:
Arjan van de Ven wrote:
On Sun, 2005-01-16 at 00:58 +, Alan Cox wrote:

On Sad, 2005-01-15 at 09:30, Andrew Morton wrote:

Matthias Lang <[EMAIL PROTECTED]> wrote:
These are things we probably cannot change now.  All three are arguably
sensible behaviour and do satisfy the principle of least surprise.  So
there may be apps out there which will break if we "fix" these things.
If the kernel version was 2.7.0 then well maybe...
These are things we should fix. They are bugs. Since there is no 2.7
plan pick a date to fix it. We should certainly error the overflow case
*now* because the behaviour is undefined/broken. The other cases I'm not
clear about. setitimer() is a library interface and it can do the basic
checking and error if it wants to be strictly posixly compliant.

why error?
I'm pretty sure we can make a loop in the setitimer code that detects
we're at the end of jiffies but haven't upsurped the entire interval the
user requested yet, so that the code should just do another round of
sleeping...
That would work for sleep (but glibc uses nanosleep for that) but an itimer 
delivers a signal.  Rather hard to trap that in glibc.

This one I meant to fix in the kernel fwiw; we can put that loop inside
the kernel easily I'm sure
Yes, but it will increase the data size of the timer...
--
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] dynamic tick patch

2005-01-20 Thread George Anzinger

Tony Lindgren wrote:
* George Anzinger  [050119 16:25]:
Tony Lindgren wrote:
* George Anzinger  [050119 15:00]:

I don't think you will ever get good time if you EVER reprogramm the PIT. 
That is why the VST patch on sourceforge does NOT touch the PIT, it only 
turns off the interrupt by interrupting the interrupt path (not changing 
the PIT).  This allows the PIT to be the "gold standard" in time that it 
is designed to be.  The wake up interrupt, then needs to come from an 
independent timer.  My patch requires a local APIC for this.  Patch is 
available at http://sourceforge.net/projects/high-res-timers/

Well on my test systems I have pretty good accurate time. But I agree,
PIT is not the best option for interrupt. It should be possible to use
other interrupt sources as well.
It should not matter where the timer interrupt comes from, as long as 
it comes when programmed. Updating time should be separate from timer
interrupts. Currently we have a problem where time is tied to the
timer interrupt.
In the HRT code time is most correctly stated as wall_time + 
get_arch_cycles_since(wall_jiffies) (plus conversion or two:)).  This is 
some what removed from the tick interrupt, but is resynced to that 
interrupt more or less each interrupt.

That sounds very accurate :)

A second issue is trying to get the jiffies update as close to the run of 
the timer list as possible.  Without this we have no hope of high res 
timers.

OK. But if the timer interrupt is separated from updating the time,
the next timer interrupt should be programmable to happen exactly
when a HRT timer needs it, right?
First, HRT uses a two phase system of timing.  The first phase is the normal 
timer list expires the timer.  The timer is then handed to the high res code 
which keeps a list of timers that are to expire prior to the next jiffie.  An 
interrupt is scheduled to make this happen.  Depending on the hardware 
available, this can come from the same timer or a different timer.  For example 
on x86 systems with a local apic we use the apic timer to generate this 
interrupt.  It triggers either a tasklet for UP or SMP with out per cpu timers 
or a soft irq for SMP systems with per cpu timers.

What this means is that, for timers near but just after a jiffie, the run_timer 
list being late can make the HR timer late.

This code on on sourceforge if you want a closer look...
Hmm, how about using a pool of programmable timers available on the 
system for the timer interrupts and HRT? Or is one interrupt source
always enough?
Hardware heaven :), but no thanks.  A reliable tick generator for the jiffies 
timer and one additional timer (or one per cpu) works well in the x86.

If you have something like the PPC where you can mess with the timer with out 
loosing time, that works well also.  The correct formulation would be a "clock" 
that can be read quickly and a timer tied to the same "rock" that uses the same 
count units as the clock.  PARISC has a counter that just counts and a compare 
register.  When they are equal an interrupt is generated.  That is a nice set up.

Now the X86 is bad and has little hope of being fixed for these reasons:
a.) the TSC is fast and easy to read but its not clocked at any given frequency 
and, on some platforms, it changes without notifying the software.
b.) the PIT and the PMTIMER are both in I/O space and so take forever to access.
c.) All three of these use different units (but at least the PMTIMER is 
(supposed to be) related to the PIT clock.
d.) the HPET, again is in I/O space.  I suspect that it uses a reasonable "rock" 
but, as I understand it, it knocks out the PIT and, of course it uses units 
unrelated to all the others.

--
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] tracepipe -- event streams, debugfs, and pipe_buffers

2005-01-20 Thread Karim Yaghmour


Zach Brown wrote:
> Thoughts?  I, for one, am tired of writing throw-away per-cpu tracing
> patches ;)

Have you taken a look at relayfs and ltt?

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

2005-01-20 Thread Pavel Machek

Hi!

> > > The readability of code is also important, IMHO.
> > 
> > It did not seem too much better to me.
> 
> Well, the beauty is in the eye of the beholder. :-)
> 
> Still, it shrinks the code (22 lines vs 37 lines), it uses less GPRs (5 vs 
> 7), it uses less
> SIB arithmetics (0 vs 4 times), it uses a well known scheme for copying data 
> pages.
> As far as the result is concerned, it is equivalent to the existing code, but 
> it's simpler
> (and faster).  IMO, simpler code is always easier to understand.
> 
> 
> > > > If you want cheap way to speed it up, kill cr3 manipulation.
> > > 
> > > Sure, but I think it's there for a reason.
> > 
> > Reason is "to crash it early if we have wrong pagetables".
> > 
> > > > Anyway, this is likely to clash with hugang's work; I'd prefer this not 
> > > > to be applied.
> > > 
> > > I am aware of that, but you are not going to merge the hugang's patches 
> > > soon, are you?
> > > If necessary, I can change the patch to work with his code (hugang, what 
> > > do you think?).
> > 
> > I think it is just not worth the effort.
> 
> Why?  It won't take much time.  I've spent more time for writing the messages
> in this thread ... ;-)

Well, I know that current code works. It was produced by C compiler,
btw. Now, new code works for you, but it was not in kernel for 4
releases, and... this code is pretty subtle. And it is hand-made, not
C produced.

So... your code may be better but I do not think it is so much better
that I'd like to risk it.

Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/5] Disallow in-inode attributes for reserved inodes

2005-01-20 Thread Andreas Dilger

On Jan 20, 2005  14:29 +0100, Andreas Gruenbacher wrote:
> The ea-in-inode patch totally relies on getting all the available inode space 
> cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for 
> any inode we find, then i_extra_isize may contain a random number, and we've 
> just lost, period: There is no way of sanitizing a random i_extra_isize; we 
> cannot know what the right number would be.

The large-inode support is designed to allow different amounts of
"fixed" optional data (i.e. what is stored inside i_extra_isize),
so it is valid to set this to 4 (i.e. just enough to hold i_extra_isize
itself) and store the EA data after that.  Any code which reads "fixed"
fields from a large inode (e.g. i_mtime_nsec) needs to validate that
i_extra_isize on that inode is large enough for that data to actually
be in the fixed area in the large inode.

If the kernel is setting i_extra_isize > 4 (i.e. it is storing optional
fields there like i_mtime_msb_and_ns) it should/is-able-to also initialize
those values since it should know what they are or they shouldn't be in
struct ext3_inode.

The whole point of i_extra_isize is that it is possible for inodes
to have different amounts of the optional fixed fields in each large
inode, depending on what the kernel that wrote the inode knew about.
So any value for i_extra_isize is valid as long as those fields
are initialized.  If we arbitrarily set i_extra_isize = 4 instead of
leaving the bad value this is no different than waiting for e2fsck to
do the same.

> > It is debatable whether we should mark inodes bad if the i_extra_isize
> > field is bad, or if we should just initialize i_extra_isize in that case.
> 
> IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply 
> changing it to something we think is better is a really bad idea.
> 
> You may have an access acl on the inode. Not being able to read an access acl 
> is a clear sign of trouble. The same applies for everything else in the 
> system.* and security.* namespaces, at least.

Well, I said it was debatable and we're having a debate ;-).  I don't have
a strong opinion either way.  If we ext3_error() in this case at least we
will check the fs on the next boot (which will just zero i_extra_isize)
instead of never doing anything to resolve the situation.

> > For the root and lost+found inodes it looks like we can never store an
> > EA in the extra part of the inode regardless of whether i_extra_isize is
> > good or not.  If a bad value is found we could just initialize it and
> > start using that space (though not print an ext3_error() in that case,
> > an ext3_warning() if anything since this is probably the fault of mke2fs).
> 
> I disagree. We cannot just use the space when we think the inode is corrupted.

But as your patch stands it doesn't ever check if i_extra_isize is valid
for the root or lost+found inode.  It just always sets i_extra_isize = 0
and never uses it.  Given that the root inode is fairly high-traffic it
makes sense to use the faster EA space if it is available.

If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm
not so keen to have an ext3_error() there.  If the user doesn't have an
e2fsck with ea-in-inode support there isn't anything they can do to fix
it and they will get a full e2fsck on each boot.

Even so, for the effort of setting i_extra_isize = 4 (or larger if we
initialize the fixed fields) we can do the equivalent of what e2fsck will
do when it finds a bogus value.

The good news is that we can still apply your patch as-is and address my
concerns later since this is a transient issue.  Also, given that there
are probably only a handful of filesystems in the world using large inodes
(excluding Lustre filesystems which aren't affected by this) I don't think
it is a pressing issue yet.  I'm going to be away for 2 weeks, so I'll say
accept this patch as is and we can look at it again when I get back, and
maybe Ted and Stephen will have weighed in on this issue also.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/

pgpkg0LYwVEMp.pgp
Description: PGP signature

Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

2005-01-20 Thread Rafael J. Wysocki

Hi,

On Thursday, 20 of January 2005 23:06, Pavel Machek wrote:
> Hi!
> 
> > > > The following patch speeds up the restoring of swsusp images on x86-64
> > > > and makes the assembly code more readable (tested and works on AMD64).  
> > > > It's
> > > > against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2.  Please consifer 
> > > > for applying.
> > > 
> > > Can you really measure the speedup?
> > 
> > In terms of time?  Probably I can, but I prefer to measure it in terms of 
> > the numbers of
> > operations to be performed.
> > 
> > With this patch, at least 8 times less memory accesses are required to 
> > restore an image
> > than without it, and in the original code cr3 is reloaded after copying 
> > each _byte_,
> > let alone the SIB arithmetics.  I'd expect it to be 10 times faster
> > or so.
> 
> Well, 8 times less cr3 reloads may be significant... for the copy
> loop. Speeding up copy loop that takes  ... 100msec?... of whole
> resume (30 seconds) does not seem too important to me.
> 
> > The readability of code is also important, IMHO.
> 
> It did not seem too much better to me.

Well, the beauty is in the eye of the beholder. :-)

Still, it shrinks the code (22 lines vs 37 lines), it uses less GPRs (5 vs 7), 
it uses less
SIB arithmetics (0 vs 4 times), it uses a well known scheme for copying data 
pages.
As far as the result is concerned, it is equivalent to the existing code, but 
it's simpler
(and faster).  IMO, simpler code is always easier to understand.


> > > If you want cheap way to speed it up, kill cr3 manipulation.
> > 
> > Sure, but I think it's there for a reason.
> 
> Reason is "to crash it early if we have wrong pagetables".
> 
> > > Anyway, this is likely to clash with hugang's work; I'd prefer this not 
> > > to be applied.
> > 
> > I am aware of that, but you are not going to merge the hugang's patches 
> > soon, are you?
> > If necessary, I can change the patch to work with his code (hugang, what do 
> > you think?).
> 
> I think it is just not worth the effort.

Why?  It won't take much time.  I've spent more time for writing the messages
in this thread ... ;-)

Greets,
RJW


-- 
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 >

1 - 100 of 654 matches

Mail list logo