Re: [PATCH] relayfs redux for 2.6.10: lean and mean
On Fri, Jan 21, 2005 at 06:27:43PM +1100, Peter Williams wrote: > Greg KH wrote: > >On Fri, Jan 21, 2005 at 01:15:28PM +1100, Peter Williams wrote: > > > >>Perhaps the logical solution is to implement debugfs in terms of relayfs? > > > > > >What do you mean by this statement? > > I mean that if, as you say, debugfs is very similar to relayfs only more > restricted (i.e. a debugging option) then it should be implementable as > an instance or specialization of the more general relayfs and that this > should be a better solution than two independent implementations of > similar functionality. Ah. No. The implementations are not of the same functionality, or so Karim says. thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch to fix set_itimer() behaviour in boundary cases
> > > > This one I meant to fix in the kernel fwiw; we can put that loop inside > > the kernel easily I'm sure > > Yes, but it will increase the data size of the timer... > eh how? the way I think it can be done is to just have multiple timers fire until the total time is up. It's not a performance issue (a timer firing every 24 days.. who cares, esp since such long delays are rare anyway) after all... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]sched: Isochronous class v2 for unprivileged soft rt scheduling
* Con Kolivas <[EMAIL PROTECTED]> wrote: > In terms of recommendation, the latency of non-preemptible codepaths > will be fastest in ext3 in 2.6 due to the nature of it constantly > being examined, addressed and updated. That does not mean it has the > fastest performance by any stretch of the imagination. [...] i agree with the latency observation. But ext3 got two significant performance boosts recently, at two ends of the performance spectrum: - in the (lots-of-)small-files area: the addition of the htree feature - in the large-files-throughput case: with the addition of the reservation feature. ext3 installed by a recent distro should have both features enabled. (i know for sure that Fedora Core 3 with the update/erratum kernel installed will create ext3 filesystems that utilize both of these features by default.) I encourage everyone to try the famous 'create and read 1 million small files' test on both recent ext3 and on other filesystems. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1 vs. PowerMac 8500/G3 (and VAIO laptop) [usb-storage oops]
On Thu, 2005-01-20 at 16:08 -0800, Greg KH wrote: > Doh, sorry for missing this one. I've applied your patch to my trees, > and will show up in the next -mm release. Actually I think John's problem was that the usb core code has now _stopped_ doing this byteswapping, and he has a lsusb which is hacked to expect it. So if you apply my patch you're preserving the userspace ABI by reverting to the extremely stupid behaviour of byteswapping _some_ of the fields in the descriptor we pass to userspace. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)
On Tue, 18 Jan 2005, chas williams - CONTRACTOR wrote: the system keeps running right? the error is a 'warning' that the fore200e is driver is sleeping when it should not (probably while holding interrupts). the schedule() around like 1782 is not a good idea since the fore200e_send() might not be running in a sleepable context. just try commenting that line for now. Sorry, but I don;t understand, what line, i am not kernel guru. :/ oceanic:/usr/src/linux-2.4.29$ grep fore200e_send * -r drivers/atm/fore200e.c:fore200e_send(struct atm_vcc *vcc, struct sk_buff *skb) drivers/atm/fore200e.c: send: fore200e_send, Is was happened on 2.4.29, too. It is a interrupt problem? Below Oops from 2.4.29: ksymoops 2.4.11 on i686 2.4.29. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.4.29/ (default) -m /lib/modules/2.4.29/System.map (specified) kernel BUG at sched.c:564! invalid operand: CPU:0 EIP:0010:[]Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010286 eax: 0018 ebx: f76d2088 ecx: c02b2000 edx: f7651f7c esi: edi: ebp: c02b3cdc esp: c02b3cac ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c02b3000) Stack: c026b646 376e8c01 f470 0054 c02b2000 f7c95494 c02b2000 0054 f76d2088 0246 f76d3084 f76d00e8 f8843d42 f76d f950 0038 0001 f67d7c10 0038 0038 001f Call Trace:[] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] Code: 0f 0b 34 02 3e b6 26 c0 e9 17 fb ff ff 0f 0b 2d 02 3e b6 26 EIP; c0114f57<= ebx; f76d2088 <_end+3738b1bc/384fb194> ecx; c02b2000 edx; f7651f7c <_end+3730b0b0/384fb194> ebp; c02b3cdc esp; c02b3cac Trace; f8843d42 <[fore_200e]fore200e_send+172/6d0> Trace; c02599d6 Trace; c01fe4a9 Trace; c01f36df Trace; c020fa03 Trace; c01fda4f Trace; c020f920 Trace; c020e3c2 Trace; c020f920 Trace; c020d060 Trace; c01fda4f Trace; c020d010 Trace; c020cf4a Trace; c020d010 Trace; c020bd09 Trace; c01fda4f Trace; c020bb00 Trace; c020b920 Trace; c020bb00 Trace; c01f3cb4 Trace; c01f3e0d Trace; c01f3f55 Trace; c011d0a6 Trace; c0109296 Trace; c0105330 Trace; c010b938 Trace; c0105330 Trace; c0105359 Trace; c01053f2 Trace; c0105000 <_stext+0/0> Code; c0114f57 <_EIP>: Code; c0114f57<= 0: 0f 0b ud2a <= Code; c0114f59 2: 34 02 xor$0x2,%al Code; c0114f5b 4: 3eds Code; c0114f5c 5: b6 26 mov$0x26,%dh Code; c0114f5e 7: c0 e9 17 shr$0x17,%cl Code; c0114f61 a: fbsti Code; c0114f62 b: ff(bad) Code; c0114f63 c: ff 0f decl (%edi) Code; c0114f65 e: 0b 2d 02 3e b6 26 or 0x26b63e02,%ebp <0>Kernel panic: Aiee, killing interrupt handler! -- *[ Łukasz Trąbiński ]* SysAdmin @wsisiz.edu.pl - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: oom killer gone nuts
On Thu, Jan 20 2005, Andrea Arcangeli wrote: > On Thu, Jan 20, 2005 at 02:15:56PM +0100, Andries Brouwer wrote: > > On Thu, Jan 20, 2005 at 01:34:06PM +0100, Jens Axboe wrote: > > > > > Using current BK on my x86-64 workstation, it went completely nuts today > > > killing tasks left and right with oodles of free memory available. > > > > Yes, the fact that the oom-killer exists is a serious problem. > > People work on trying to tune it, instead of just removing it. > > I'm working on fixing it, not just tuning it. The bugs in mainline > aren't about the selection algorithm (which is normally what people > calls oom killer). The bugs in mainline are about being able to kill a > task reliably, regardless of which task we pick, and every linux kernel > out there has always killed some task when it was oom. So the bugs are > just obvious regressions of 2.6 if compared to 2.4. > > But this is all fixed now, I'm starting sending the first patches to > Anderw very shortly (last week there was still the oracle stuff going > on). Now I can fix the rejects. > > I will guarantee nothing about which task will be picked (that's the old > code at works, I changed not a bit in what normally people calls "the oom > killer", plus the recent improvement from Thomas), but I guarantee the > VM won't kill tasks right and left like it does now (i.e. by invoking the > oom killer multiple times). And especially not with 500MB of zone normal free, thanks :) 2.6.11-rc1-xx vm behaviour is looking a _lot_ worse than 2.6.10 btw, I haven't looked closer at what has changed yet it's just a subjective feeling. I regularly have to run a fillmem.c hog to prune caches or it runs like an old dog. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Bug report : drivers/net/hamradio/Kconfig
Hello, i'm translating some Kconfig files to french for the kernelFR project (http://kernelfr.traduc.org), and while i was reading drivers/net/hamradio/Kconfig The kernel is 2.6.10 In section : "Baycom ser12 halfduplex driver for AX.25" 9th section, in the 3rdline, there is : "The driver supports the ser12 design in full-duplex mode." instead of "half-duplex mode." Please Follow all your answers to : [EMAIL PROTECTED] I'm not a member of the mailing list - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Bug report : drivers/net/hamradio/Kconfig
Hello, i'm translating some Kconfig files to french for the kernelFR project (http://kernelfr.traduc.org), and while i was reading drivers/net/hamradio/Kconfig The kernel is 2.6.10 In section : "Baycom ser12 halfduplex driver for AX.25" 9th section, in the 3rdline, there is : "The driver supports the ser12 design in full-duplex mode." instead of "half-duplex mode." Please Follow all your answers to : [EMAIL PROTECTED] I'm not a member of the mailing list + Simon - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] relayfs redux for 2.6.10: lean and mean
Greg KH wrote: On Fri, Jan 21, 2005 at 01:15:28PM +1100, Peter Williams wrote: Perhaps the logical solution is to implement debugfs in terms of relayfs? What do you mean by this statement? I mean that if, as you say, debugfs is very similar to relayfs only more restricted (i.e. a debugging option) then it should be implementable as an instance or specialization of the more general relayfs and that this should be a better solution than two independent implementations of similar functionality. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Fri, Jan 21, 2005 at 08:08:21AM +0100, Andi Kleen wrote: > So at least for GFP_DMA it seems to be definitely needed. Indeed. Plus if you add pci32 zone, it'll be needed for it too on x86-64, like for the normal zone on x86, since ptes will go in highmem while pci32 allocations will not. So while floppy might be fixed, this issue would be for brand new pci32 zone needed by some device (i.e. nvidia, so not such a unlikely corner case). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Fri, Jan 21, 2005 at 06:04:25PM +1100, Nick Piggin wrote: > OK this is a fairly lame example... but the current code is more or > less just lucky that ZONE_DMA doesn't usually fill up with pinned mem > on machines that need explicit ZONE_DMA allocations. Yep. For the DMA zone all slab cache will be a memory pin (like ptes for highmem, but not that many people runs with 3G of ram in ptes, and I guess the ones doing it aren't normally using a mainline kernel in the first place so they're likely not running into it either). While slab cache pinning the normal zone has more probability of being reproduced on l-k in random usages. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, Jan 20, 2005 at 11:00:16PM -0800, Andrew Morton wrote: > Last time we dicsussed this you pointed out that reserving more lowmem from > highmem-capable allocations may actually *help* things. (Tries to remember > why) By reducing inode/dentry eviction rates? I asked Martin Bligh if he > could test that on a big NUMA box but iirc the results were inconclusive. This is correct, guaranteeing more memory to be freeable in lowmem (ptes aren't freeable without a sigkill for example) the icache/dcache will at least have a margin where it can grow indipendently from highmem allocations. > Maybe it just won't make much difference. Hard to say. I don't know myself if it makes a performance difference, all old benchmarks have been run with this applied. This was applied for correcntess (i.e. to avoid sigkills or lockups), it wasn't applied for performance. But I don't see how it could hurt performance (especially given current code already does the check at runtime, which is pratically the only fast-path cost ;). > > The sysctl name had to change to lowmem_reserve_ratio because its > > semantics are completely different now. > > That reminds me. Documentation/filesystems/proc.txt ;) Woops, forgotten about it ;) > I'll cook something up for that. Thanks. If you prefer I can write it too to relieve you from this load, it's up to you. If you want to fix it yourself go ahead of course ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: COMMAND_LINE_SIZE increasing in 2.6.11-rc1-bk6
> I really suggest to push this limit to 4k. My reason is that under UML I > need to put a lot of stuff in command line and uml crash if I not extend > this limit. Can we make it depend on arhitecture? It's dependent on the architecture already. I would like to enable it on i386/x86-64 because the kernel command line is often used to pass parameters to installers, and having a small limit there can be awkward. But first need to figure out what went wrong with EDD. Matt D., do you have thoughts on this? -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, 2005-01-20 at 22:46 -0800, Andrew Morton wrote: > Nick Piggin <[EMAIL PROTECTED]> wrote: > > It does turn on lowmem protection by default. We never reached > > an agreement about doing this though, but Andrea has shown that > > it fixes trivial OOM cases. > > > > I think it should be turned on by default. I can't recall what > > your reservations were...? > > > > Just that it throws away a bunch of potentially usable memory. In three > years I've seen zero reports of any problems which would have been solved > by increasing the protection ratio. > > Thus empirically, it appears that the number of machines which need a > non-zero protection ratio is exceedingly small. Why change the setting on > all machines for the benefit of the tiny few? Seems weird. Especially > when this problem could be solved with a few-line initscript. Ho hum. That is true, but it should not reserve a great deal of memory on small memory machines. ZONE_NORMAL reservation may not even be too noticeable as you'll usually have ZONE_NORMAL allocations during the course of normal running. Although it is true that there haven't been many problems attributed to this, one example I can remember is when we fixed the __alloc_pages watermark code, we fixed a bug that was reserving much more ZONE_DMA than it was supposed to. This cased all those page allocation failure problems. So we raised the atomic reserve, but that didn't bring ZONE_DMA reservation back to its previous levels. "So the buffer between GFP_KERNEL and GFP_ATOMIC allocations is: 2.6.8 | 465 dma, 117 norm, 582 tot = 2328K 2.6.10-rc | 2 dma, 146 norm, 148 tot = 592K patch | 12 dma, 500 norm, 512 tot = 2048K" So we were still seeing GFP_DMA allocation failures in the sound code. You recently had to make that NOWARN to shut it up. OK this is a fairly lame example... but the current code is more or less just lucky that ZONE_DMA doesn't usually fill up with pinned mem on machines that need explicit ZONE_DMA allocations. Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Andrew Morton <[EMAIL PROTECTED]> writes: > Just that it throws away a bunch of potentially usable memory. In three > years I've seen zero reports of any problems which would have been solved > by increasing the protection ratio. We ran into a big problem with this on x86-64. The SUSE installer would load the floppy driver during installation. Floppy driver would try to allocate some pages with GFP_DMA and on a small memory x86-64 system (256-512MB) the OOM killer would always start to kill things trying to free some DMA pages. This was quite a show stopper because you effectively couldn't install. So at least for GFP_DMA it seems to be definitely needed. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, Jan 20, 2005 at 10:46:45PM -0800, Andrew Morton wrote: > Thus empirically, it appears that the number of machines which need a > non-zero protection ratio is exceedingly small. Why change the setting on > all machines for the benefit of the tiny few? Seems weird. Especially > when this problem could be solved with a few-line initscript. Ho hum. It's up to you, IMHO you're doing a mistake, but I don't mind as long as our customers aren't at risk of early oom kills (or worse kernel crashes) with some db load (especially without swap the risk is huge for all users, since all anonymous memory will be pinned like ptes, but with ~3G of pagetables they're at risk even with swap). At least you *must* admit that without my patch applied as I posted, there's a >0 probabity of running out of normal zone which will lead to an oom-kill or a deadlock despite 10G of highmem might still be freeeable (like with clean cache). And my patch obviously cannot make it impossible to run out of normal zone, since there's only 800m of normal zone and one can open more files than what fits in normal zone, but at least it gives the user the security that a certain workload can run reliably. Without this patch there's no guarantee at all that any workload will run when >1G of ptes is allocated. This below fix as well is needed and you won't find reports of people reproducing this race condition. Please apply. CC'ed Hugh. Sorry Hugh, I know you were working on it (you said not in the weekend IIRC), but I've been upgraded to latest bk so I had to fixup quickly or I would have to run the racy code on my smp systems to test new kernels. From: Andrea Arcangeli <[EMAIL PROTECTED]> Subject: fixup smp race introduced in 2.6.11-rc1 Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> --- x/mm/memory.c.~1~ 2005-01-21 06:58:14.747335048 +0100 +++ x/mm/memory.c 2005-01-21 07:16:15.318063328 +0100 @@ -1555,8 +1555,17 @@ void unmap_mapping_range(struct address_ spin_lock(>i_mmap_lock); + /* serialize i_size write against truncate_count write */ + smp_wmb(); /* Protect against page faults, and endless unmapping loops */ mapping->truncate_count++; + /* +* For archs where spin_lock has inclusive semantics like ia64 +* this smp_mb() will prevent to read pagetable contents +* before the truncate_count increment is visible to +* other cpus. +*/ + smp_mb(); if (unlikely(is_restart_addr(mapping->truncate_count))) { if (mapping->truncate_count == 0) reset_vma_truncate_counts(mapping); @@ -1864,10 +1873,18 @@ do_no_page(struct mm_struct *mm, struct if (vma->vm_file) { mapping = vma->vm_file->f_mapping; sequence = mapping->truncate_count; + smp_rmb(); /* serializes i_size against truncate_count */ } retry: cond_resched(); new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, ); + /* +* No smp_rmb is needed here as long as there's a full +* spin_lock/unlock sequence inside the ->nopage callback +* (for the pagecache lookup) that acts as an implicit +* smp_mb() and prevents the i_size read to happen +* after the next truncate_count read. +*/ /* no page was available -- either SIGBUS or OOM */ if (new_page == NOPAGE_SIGBUS) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Reserving backup region for kexec based crashdumps.
Hi Andrew, Following patch is against 2.6.11-rc1-mm2. As mentioned by following note from Eric, crashdump code is currently broken. > > The crashdump code is currently slightly broken. I have attempted to > minimize the breakage so things can quick be made to work again. We have started doing changes to make crashdump up and running again. Following are few identified items to be done. 1. Reserve the backup region (640k) during kernel bootup. 2. Copy the data to backup region during crash.(moved to kexec user space code, patch posted in separate mail) 3. Prepare elf headers while loading kexec panic kernel and store in reserved memory area. 4. Pass required information to crashdump kernel, which parses it and exports through /proc/vmcore. (may be user space utility, open to discussion) Following patch implements item 1) in the list. Soon we shall be rolling out the patches for rest. Thanks Vivek This patch adds support for reserving 640k memory as backup region as required by crashdump kernel for x86. --- Signed-off-by: Vivek Goyal <[EMAIL PROTECTED]> --- linux-2.6.11-rc1-mm2-kexec-eric-root/arch/i386/kernel/setup.c |8 linux-2.6.11-rc1-mm2-kexec-eric-root/include/linux/kexec.h|6 +- linux-2.6.11-rc1-mm2-kexec-eric-root/kernel/kexec.c |8 3 files changed, 21 insertions(+), 1 deletion(-) diff -puN arch/i386/kernel/setup.c~crashdump-x86-reserve-640k-memory arch/i386/kernel/setup.c --- linux-2.6.11-rc1-mm2-kexec-eric/arch/i386/kernel/setup.c~crashdump-x86-reserve-640k-memory 2005-01-20 13:55:33.0 +0530 +++ linux-2.6.11-rc1-mm2-kexec-eric-root/arch/i386/kernel/setup.c 2005-01-20 13:55:33.0 +0530 @@ -1159,6 +1159,13 @@ static unsigned long __init setup_memory #ifdef CONFIG_KEXEC if (crashk_res.start != crashk_res.end) { reserve_bootmem(crashk_res.start, crashk_res.end - crashk_res.start + 1); + +#define CRASH_DUMP_BACKUP 0xa + /* Reserve another 640Kb for crashdump backup. */ + crashdumpk_res.start = crashk_res.end + 1; + crashdumpk_res.end = crashdumpk_res.start + + CRASH_DUMP_BACKUP -1; + reserve_bootmem(crashdumpk_res.start, CRASH_DUMP_BACKUP); } #endif return max_low_pfn; @@ -1202,6 +1209,7 @@ legacy_init_iomem_resources(struct resou request_resource(res, data_resource); #ifdef CONFIG_KEXEC request_resource(res, _res); + request_resource(res, _res); #endif } } diff -puN include/linux/kexec.h~crashdump-x86-reserve-640k-memory include/linux/kexec.h --- linux-2.6.11-rc1-mm2-kexec-eric/include/linux/kexec.h~crashdump-x86-reserve-640k-memory 2005-01-20 13:55:33.0 +0530 +++ linux-2.6.11-rc1-mm2-kexec-eric-root/include/linux/kexec.h 2005-01-20 13:55:33.0 +0530 @@ -79,7 +79,7 @@ struct kimage { unsigned long control_page; /* Flags to indicate special processing */ - int type : 1; + unsigned int type : 1; #define KEXEC_TYPE_DEFAULT 0 #define KEXEC_TYPE_CRASH 1 }; @@ -122,6 +122,10 @@ extern struct kimage *kexec_crash_image; */ extern struct resource crashk_res; +/* Location of backup region to hold the crashdump kernel data. + */ +extern struct resource crashdumpk_res; + #else /* !CONFIG_KEXEC */ static inline void crash_kexec(void) { } #endif /* CONFIG_KEXEC */ diff -puN kernel/kexec.c~crashdump-x86-reserve-640k-memory kernel/kexec.c --- linux-2.6.11-rc1-mm2-kexec-eric/kernel/kexec.c~crashdump-x86-reserve-640k-memory 2005-01-20 13:55:33.0 +0530 +++ linux-2.6.11-rc1-mm2-kexec-eric-root/kernel/kexec.c 2005-01-20 13:55:33.0 +0530 @@ -32,6 +32,14 @@ struct resource crashk_res = { .flags = IORESOURCE_BUSY | IORESOURCE_MEM }; +/* Location of the backup area for the crash dump kernel */ +struct resource crashdumpk_res = { + .name = "Crash Dump Backup", + .start = 0, + .end = 0, + .flags = IORESOURCE_BUSY | IORESOURCE_MEM +}; + /* * When kexec transitions to the new kernel there is a one-to-one * mapping between physical and virtual addresses. On processors _
Re: COMMAND_LINE_SIZE increasing in 2.6.11-rc1-bk6
On Thu, 20 Jan 2005, Andi Kleen wrote: AOL: - lilo 22.6.1 - CONFIG_EDD=y - 2.6.10-mm1 and 2.6.11-rc1 did boot - 2.6.11-rc1-mm1 and 2.6.11-rc1-mm2 didn't boot - 2.6.11-rc1-mm2 with this ChangeSet reverted boots. What I gather so far the problem seems to only happen with lilo and EDID together. grub appears to work. Or did anyone see problems with grub too? I'll dig a bit, but reverting for now is probably best. Thanks Linus. I really suggest to push this limit to 4k. My reason is that under UML I need to put a lot of stuff in command line and uml crash if I not extend this limit. Can we make it depend on arhitecture? Thanks. --- Catalin(ux aka Dino) BOIE catab at deuroconsult.ro http://kernel.umbrella.ro/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > Anyway if you leave it off by default I don't mind, with my new code > forward ported stright from 2.4 mainline, it's possible for the first > time to set it from userspace without having to embed knowledge on the > kernel min_kbytes settings at boot time. Last time we dicsussed this you pointed out that reserving more lowmem from highmem-capable allocations may actually *help* things. (Tries to remember why) By reducing inode/dentry eviction rates? I asked Martin Bligh if he could test that on a big NUMA box but iirc the results were inconclusive. Maybe it just won't make much difference. Hard to say. > The sysctl name had to change to lowmem_reserve_ratio because its > semantics are completely different now. That reminds me. Documentation/filesystems/proc.txt ;) I'll cook something up for that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] relayfs redux for 2.6.10: lean and mean
On Fri, Jan 21, 2005 at 01:15:28PM +1100, Peter Williams wrote: > > Perhaps the logical solution is to implement debugfs in terms of relayfs? What do you mean by this statement? greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] relayfs redux for 2.6.10: lean and mean
On Thu, Jan 20, 2005 at 08:38:25PM -0500, Karim Yaghmour wrote: > > Greg KH wrote: > > Hm, how about this idea for cutting about 500 more lines from the code: > > > > Why not drop the "fs" part of relayfs and just make the code a set of > > struct file_operations. That way you could have "relayfs-like" files in > > any ram based file system that is being used. Then, a user could use > > these fops and assorted interface to create debugfs or even procfs files > > using this type of interface. > > > > As relayfs really is almost the same (conceptually wise) as debugfs as > > far as concept of what kinds of files will be in there (nothing anyone > > would ever rely on for normal operations, but for debugging only) this > > keeps users and developers from having to spread their debugging and > > instrumenting files from accross two different file systems. > > However this assumes that the users of relayfs are not going to want > it during normal system operation. That is true. > This is an assumption that fails with at least LTT as it is targeted > at sysadmins, application developers and power users who need to be > able to trace their systems at any time. Are they willing to trade off the performance of LTT to get this? I thought this was being touted as a "when you need to test" type of thing, not a "run it all the time" type of feature. > I don't mind piggy-backing off another fs, if it makes sense, but > unlike debugfs, relayfs is meant for general use, and all files in there > are of the same type: relay channels for dumping huge amounts of data > to user-space. And a driver will never want to have both a relay channel, and a simple debug output at the same time? You are now requiring them to look for that data in two different points in the fs. > It seems to me the target audience and basic idea (relay > channels only in the fs) are different, but let me know if there's a > compeling argument for doing this in another way without making it too > confusing for users of those special "files" (IOW, when this starts > being used in distros, it'll be more straightforward for users to > understand if all files in a mounted fs behave a certain way than if > they have certain "odd" files in certain directories, even if it's > /proc.) So, since you are proposing that relayfs be mounted all the time, where do you want to mount it at? I had to provide a "standard" location for debugfs for people to be happy with it, and the same issue comes up here. Also, why not export your relayfs ops so that someone useing debugfs can create a relay channel in it, or in any other type of fs they might create? thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Fri, Jan 21, 2005 at 05:36:14PM +1100, Nick Piggin wrote: > I think it should be turned on by default. I can't recall what I think it too, since the number of people that can be bitten by this is certainly higher than the number of people who knows the VM internals and for what kind of workloads they need to enable this by hand to avoid risking lockups (notably with boxes without swap or with heavy pagetable allocations all the time which is not uncommon with db usage). This is needed on x86-64 too to avoid pagetables to lockup the dma zone. Or anyways it's needed also on x86 for the dma zone on <1G boxes too. Anyway if you leave it off by default I don't mind, with my new code forward ported stright from 2.4 mainline, it's possible for the first time to set it from userspace without having to embed knowledge on the kernel min_kbytes settings at boot time. So if you want it down by default it simply means we'll guarantee it on our distro with userland. Setting a sysctl at boot time is no big deal for us (of course leaving it enabled by default in kernel space is older distro where userland isn't yet aware about it). So it's pretty much up to you, as long as we can easily fixup in userland is fine with me and I already tried a dozen times to push mainline in what I believe to be the right direction (like I already did in 2.4 mainline since that same code is enabled by default in 2.4). The sysctl name had to change to lowmem_reserve_ratio because its semantics are completely different now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Nick Piggin <[EMAIL PROTECTED]> wrote: > > On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote: > > Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > > > > > This is the forward port to 2.6 of the lowmem_reserved algorithm I > > > invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads > > > like google (especially without swap) on x86 with >1G of ram, but it's > > > needed in all sort of workloads with lots of ram on x86, it's also > > > needed on x86-64 for dma allocations. This brings 2.6 in sync with > > > latest 2.4.2x. > > > > But this patch doesn't change anything at all in the page allocation path > > apart from renaming lots of things, does it? > > > > AFAICT all it does is to change the default values in the protection map. > > It does it via a simplification, which is nice, but I can't see how it > > fixes anything. > > > > Confused. > > > It does turn on lowmem protection by default. We never reached > an agreement about doing this though, but Andrea has shown that > it fixes trivial OOM cases. > > I think it should be turned on by default. I can't recall what > your reservations were...? > Just that it throws away a bunch of potentially usable memory. In three years I've seen zero reports of any problems which would have been solved by increasing the protection ratio. Thus empirically, it appears that the number of machines which need a non-zero protection ratio is exceedingly small. Why change the setting on all machines for the benefit of the tiny few? Seems weird. Especially when this problem could be solved with a few-line initscript. Ho hum. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.10-mm2: it87 sensor driver stops CPU fan
Hi Nicolas, > I confirm that 0x7f is full speed. So at least the polarity bit is correct, and Gigabyte isn't to blame. > > Once you know if the polarity is correct, you can try different > > values of PWM between 0x00 and 0x7F and see how exactly your fan > > reacts to them. > > That's where things get really really interesting. As mentioned > above 0x7f drives the fan full speed (2596 RPM). Now lowering that > value slows the CPU fan gradually down to a certain point. With a > value of 0x3f the fan turns at 1041 RPM. But below 0x3f the fan > starts speeding up again to reach a peak of 2280 RPM with a value > of 0x31, then it slows down again toward 0 RPM as the register > value is decreased down to 0. > > Bit 3 of register 0x14, when set, only modifies the curve so the > first minimum is instead reached at 0x30 then the peak occurs at 0x1d > before dropping to 0. > > Changing the PWM base clock select has no effect. Wow! Unexpected, to say the least. First time I see such a behavior. Could it be that your CPU fan isn't a simple passive device but one of these high-tech models with an embedded thermal sensor and automatic speed adjustment? This would possibly interact with the motherboard PWM capability and could explain the strange speed curve your obtained. I would also like you to try a similar test with your case fan. Enable "smart guardian" mode for this one (by writing 0x73 to register 0x13), then scan the 0x7f-0x00 range (register 0x16) like you did for your CPU fan. I wonder if you will obtain the same kind of result or a standard linear curve. (Note that PWM2 might not be wired at all on your motherboard, so don't be surprised if the case fan speed doesn't change at all.) Thanks, -- Jean Delvare http://khali.linux-fr.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: writeback-highmem
On Thu, Jan 20, 2005 at 10:26:30PM -0800, Andrew Morton wrote: > Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > > > This needed highmem fix from Rik is still missing too, so please apply > > along the other 5 (it's orthogonal so you can apply this one in any > > order you want). > > > > From: Rik van Riel <[EMAIL PROTECTED]> > > Subject: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings > > I've held off on this one because the recent throttling fix should have > helped this problem. Has anyone confirmed that this patch still actually > fixes something? If so, what was the scenario? Without this fix write throttling is completely broken for a blkdev and it won't start _at_all_ and it'll just keep hanging in the allocation routines. I agree it won't explain oom (with the other fixes the VM should writeback synchronously instead of running oom) but it may make the box completely unusable under a cp /dev/zero /dev/somedevice. There is a reason why we start write throttling before 100% of ram is being locked by dirty pages in the pagecache path. The beauty of this fix is that Rik allowed the pagecache not to have the limit (in 2.4 pagecache had the limit too). Probably async writeback won't start but at least the write throttling will and that's all we need to keep the box running other apps at the same time of the write. If the system goes unresponsive for 10 minutes and swaps during backups or workloads working on the blkdev, they'll file bugreports and they'd be correct. In short I agree this shouldn't be applied for oom, but it's still definitely a correct and needed fix (and I rate it a bit more than just an optimization). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] compat ioctl security hook fixup (take2)
* Andi Kleen ([EMAIL PROTECTED]) wrote: > On Thu, Jan 20, 2005 at 09:51:03PM -0800, Chris Wright wrote: > > > If you add it make at least sure it's not EXPORT_SYMBOL()ed. > > > > It's certainly not, nor intended to be. Would a comment to that > > affect alleviate your concern? > > Yes please. Patch respun, with comment added. thanks, -chris -- Introduce a simple helper, vfs_ioctl(), so that both sys_ioctl() and compat_sys_ioctl() call the security hook in all cases and without duplication. Signed-off-by: Chris Wright <[EMAIL PROTECTED]> = fs/ioctl.c 1.15 vs edited = --- 1.15/fs/ioctl.c 2005-01-15 14:31:01 -08:00 +++ edited/fs/ioctl.c 2005-01-20 22:27:43 -08:00 @@ -77,21 +77,13 @@ static int file_ioctl(struct file *filp, return do_ioctl(filp, cmd, arg); } - -asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg) +/* Simple helper for sys_ioctl and compat_sys_ioctl. Not for drivers' + * use, and not intended to be EXPORT_SYMBOL()'d + */ +int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned long arg) { - struct file * filp; unsigned int flag; - int on, error = -EBADF; - int fput_needed; - - filp = fget_light(fd, _needed); - if (!filp) - goto out; - - error = security_file_ioctl(filp, cmd, arg); - if (error) - goto out_fput; + int on, error = 0; switch (cmd) { case FIOCLEX: @@ -157,6 +149,24 @@ asmlinkage long sys_ioctl(unsigned int f error = do_ioctl(filp, cmd, arg); break; } + return error; +} + +asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg) +{ + struct file * filp; + int error = -EBADF; + int fput_needed; + + filp = fget_light(fd, _needed); + if (!filp) + goto out; + + error = security_file_ioctl(filp, cmd, arg); + if (error) + goto out_fput; + + error = vfs_ioctl(filp, fd, cmd, arg); out_fput: fput_light(filp, fput_needed); out: = fs/compat.c 1.48 vs edited = --- 1.48/fs/compat.c2005-01-15 14:31:01 -08:00 +++ edited/fs/compat.c 2005-01-20 22:25:33 -08:00 @@ -437,6 +437,11 @@ asmlinkage long compat_sys_ioctl(unsigne if (!filp) goto out; + /* RED-PEN how should LSM module know it's handling 32bit? */ + error = security_file_ioctl(filp, cmd, arg); + if (error) + goto out_fput; + if (filp->f_op && filp->f_op->compat_ioctl) { error = filp->f_op->compat_ioctl(filp, cmd, arg); if (error != -ENOIOCTLCMD) @@ -477,7 +482,7 @@ asmlinkage long compat_sys_ioctl(unsigne up_read(_sem); do_ioctl: - error = sys_ioctl(fd, cmd, arg); + error = vfs_ioctl(filp, fd, cmd, arg); out_fput: fput_light(filp, fput_needed); out: = include/linux/fs.h 1.373 vs edited = --- 1.373/include/linux/fs.h2005-01-15 14:31:01 -08:00 +++ edited/include/linux/fs.h 2005-01-20 22:25:33 -08:00 @@ -1564,6 +1564,8 @@ extern int vfs_stat(char __user *, struc extern int vfs_lstat(char __user *, struct kstat *); extern int vfs_fstat(unsigned int, struct kstat *); +extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long); + extern struct file_system_type *get_fs_type(const char *name); extern struct super_block *get_super(struct block_device *); extern struct super_block *user_get_super(dev_t); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote: > Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > > > This is the forward port to 2.6 of the lowmem_reserved algorithm I > > invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads > > like google (especially without swap) on x86 with >1G of ram, but it's > > needed in all sort of workloads with lots of ram on x86, it's also > > needed on x86-64 for dma allocations. This brings 2.6 in sync with > > latest 2.4.2x. > > But this patch doesn't change anything at all in the page allocation path > apart from renaming lots of things, does it? > > AFAICT all it does is to change the default values in the protection map. > It does it via a simplification, which is nice, but I can't see how it > fixes anything. > > Confused. It does turn on lowmem protection by default. We never reached an agreement about doing this though, but Andrea has shown that it fixes trivial OOM cases. I think it should be turned on by default. I can't recall what your reservations were...? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] to fix xtime lock for in the RT kernel patch
* George Anzinger wrote: > It seems to me that we need to either do the attached or to rewrite > the timer front end code to just gather the offset info and defer to > the timer irq thread to update jiffies and the offset stuff. In > either case we really can not split the two and we do need the > xtime_lock protection. how about the patch below? One of the important benefits of the threaded timer IRQ is the ability to make xtime_lock a mutex. Ingo --- linux/arch/i386/kernel/time.c.orig2 +++ linux/arch/i386/kernel/time.c @@ -313,6 +313,7 @@ irqreturn_t timer_interrupt(int irq, voi write_seqlock(_lock); cur_timer->mark_offset(); + do_timer(regs); do_timer_interrupt(irq, NULL, regs); --- linux/include/asm-i386/mach-default/do_timer.h.orig2 +++ linux/include/asm-i386/mach-default/do_timer.h @@ -16,7 +16,6 @@ static inline void do_timer_interrupt_hook(struct pt_regs *regs) { - do_timer(regs); #ifndef CONFIG_SMP update_process_times(user_mode(regs)); #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, Jan 20, 2005 at 10:20:56PM -0800, Andrew Morton wrote: > Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > > > This is the forward port to 2.6 of the lowmem_reserved algorithm I > > invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads > > like google (especially without swap) on x86 with >1G of ram, but it's > > needed in all sort of workloads with lots of ram on x86, it's also > > needed on x86-64 for dma allocations. This brings 2.6 in sync with > > latest 2.4.2x. > > But this patch doesn't change anything at all in the page allocation path > apart from renaming lots of things, does it? In the allocation path not, but it rewrites the setting algorithm, so from somebody watching it from userspace it's a completely different thing, usable for the first time ever in 2.6. Otherwise userspace would be required to have knowledge about the kernel internals to be able to set it to a sane value. Plus the new init code is much cleaner too. > AFAICT all it does is to change the default values in the protection map. > It does it via a simplification, which is nice, but I can't see how it > fixes anything. Having this patch applied is a major fix. See again the google fix thread in 2.4.1x. 2.6 is vulnerable to it again. This patch makes the feature usable and enables the feature as well, which is definitely a fix as far as an end user is concerned (google was the user in this case). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: usbmon, usb core, ARM
On Thursday 20 January 2005 11:35 am, Pete Zaitcev wrote: > On Wed, 19 Jan 2005 09:08:34 -0800, David Brownell <[EMAIL PROTECTED]> wrote: > I do not like to refer to a dev because I do not quite understand where > the necessary usb_dev_get/_put are now. But if you guarantee that the > urb->dev is refcounted properly while urb is processed by > usb_hcd_giveback_urb, > I do not mind an extra indirection. We have no reason to suspect bugs there; if there were any, lots of things would have been breaking for a long time now. > What would be the right test in usb_hcd_giveback_urb, then? > It looks to me that you want me to use this: > > urb_is_for_root_hub(urb) { Actually it'd be more like dev_is_root_hub(dev, bus), since both values are readily at hand -- you're basically just wanting to wrap "dev == hcd->self.root_hub" in most cases. Though I'm still not clear why you'd want to change that working code; nothing's broken now, after all. By the way ... on the topic of usbmon rather than changing usbcore, is there a brief writeup of what you want this new version to be doing -- and how? Like, why put the spy hooks in that location, rather than any of the other choices. (Many of them would be less surprising to me!) - Dave > return urb->dev == urb->dev->bus->hcpriv->self.root_hub; > } > > This is just ... ew. Can we use pipe for now or do you have > a better idea? > > -- Pete > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: writeback-highmem
Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > This needed highmem fix from Rik is still missing too, so please apply > along the other 5 (it's orthogonal so you can apply this one in any > order you want). > > From: Rik van Riel <[EMAIL PROTECTED]> > Subject: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings I've held off on this one because the recent throttling fix should have helped this problem. Has anyone confirmed that this patch still actually fixes something? If so, what was the scenario? Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > This is the forward port to 2.6 of the lowmem_reserved algorithm I > invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads > like google (especially without swap) on x86 with >1G of ram, but it's > needed in all sort of workloads with lots of ram on x86, it's also > needed on x86-64 for dma allocations. This brings 2.6 in sync with > latest 2.4.2x. But this patch doesn't change anything at all in the page allocation path apart from renaming lots of things, does it? AFAICT all it does is to change the default values in the protection map. It does it via a simplification, which is nice, but I can't see how it fixes anything. Confused. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
OK, I finally come around to answering this ... Roman Zippel wrote: > Sorry, you missunderstood me. At the moment I'm only secondarily > interested in the API details, primarily I want to work out the details of > what exactly relayfs/ltt are supposed to do. One main question here I > can't answer yet, why you insist on multiple relayfs modes. I should have avoided earlier confusing the use of a certain type of relayfs channel for a given purpose (i.e. LTT should not necessarily depend on the managed mode.) I believe that there is a need for more than one mode in relayfs independently of LTT. There are users who want to be able to manage the data in a buffer (by manage I mean: receive notification of important buffer events, be able to insert important data at boundaries, etc.), and there are users who just want to dump as much information as possible in as fast a way as possible without having to deal with non-essential codepaths. > This is what I basically have in mind for the relay_write function: > > cpu = get_cpu(); > buffer = relay_get_buffer(chan, cpu); > while(1) { > offset = local_add_return(buffer->offset, length); > if (likely(offset + length <= buffer->size)) > break; > buffer = relay_switch_buffer(chan, buffer, offset); > } > memcpy(buffer->data + offset, data, length); > put_cpu(); looking at this code: 1) get_cpu() and put_cpu() won't do. You need to outright disable interrupts because you may be called from an interrupt handler. 2) You assume that relayfs creates one buffer per cpu for each channel. We think this is wrong. Relayfs should not need to care about the number of CPUs, it's the clients' responsibility to create as many channels as they see fit, whether it be one channel per CPU or 10 channels per CPU or 1 channel per interrupt, etc. 3) I'm unclear about the need for local_add_return(), why not just: if (likely(buffer->offset + length <= buffer->size) In any case, here's what we do in relay_write(): write_pos = relay_reserve(rchan, count, _code, ); If there's any buffer switching required, that will be done in relay_reserve. This has the added advantage that clients that want to write directly to the buffer without using relay_write() can do so by calling relay_reserve() and not care about required buffer switching. 4) After securing the area, you simply go ahead and do a memcpy() and leave. We think that this is insufficient. Here's what we do: if (likely(write_pos != NULL)) { relay_write_direct(write_pos, data_ptr, count); relay_commit(rchan, write_pos, count, reserve_code, interrupting); *wrote_pos = write_pos; the relay_write_direct() is basically an memcpy(). We also do a relay_commit(). This actually effects the delivery of the event. If, for example, there had been a buffer switch at the previous relay_reserve(), then this call to relay_commit() will generate a call to the client's deliver() callback function. In the case of LTT, for example, this is how it knows that it's got to notify the user-space daemon that there are buffers to consume (i.e. write to disk.) > ltt_log_event should only be a few lines more (for writing header and > event data). Actually no, you don't want ltt_log_event using relay_write(), for one thing because is can generate variable size events. Instead, ltt_log_event does (basically): data_size = sizeof(event_id) + sizeof(time_delta) + sizeof(data_size); relay_lock_channel(); relay_reserve(); relay_write_direct(_id, sizeof(event_id)); relay_write_direct(_delta, sizeof(event_id)); if (var_data) { relay_write_direct(var_data, var_data_len); data_size += var_data_len; } relay_write_direct(_size, sizeof(data_size)); relay_commit(); relay_unlock_channel(); > What I'd like to know now are the reasons why you need more than this. I hope the above explanation clarifies things. > It's not the amount of data and any timing requirements have to be done by > the caller. During processing you either take the events in the order they > were recorded (often that's good enough) or you sort them which is not > that difficult. Ordering is a non-issue to be honest. Unless you've got some hardware scope in there, it's almost impossible to pinpoint exactly when an event occurred. There is no single line of code where an event occurs, so it's all an educated guess anyway. You want things to resemble what really happened in as much as possible though. > I know you don't want to touch the topic of kernel debugging, but its > requirements greatly overlap with what you want to do with ltt, e.g. one > needs very often information about scheduling events as many kernel > processes rely more and more on kernel threads. The only real requirement > for kernel debugging
kernel panic with 2.4.26
Hi. Every now and then (maybe twice a week) my server panics. This is a dual Xeon system with 5Gb memory. I did my best to get the full oops from the screen and doublechecked. Sorry, but I don't understand anything from the ksymoops output. Any help will be appreciated. ksymoops 2.4.5 on i686 2.4.26-msi1. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.4.26-msi1/ (default) -m System.map-2.4.26-msi1.nogood (specified) f893281d *pde = Oops: 0002 CPU:0 EIP:0010:[]Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010256 eax: fffc43fc ebx: 0002 ecx: f703b000 edx: 000d esi: f187d000 edi: ebp: f7005c1c esp: c0353ed4 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c0353000) Stack: f7040d00 f778a480 0040 f7005c00 f703b000 00040d00 f7007e80 f8921982 f7040d00 f7030b08 f778a480 0002 f703b200 f8921a9c f778a480 f7040d08 f77dc680 Call Trace:[] [] [] [] [] [] [] [] [] [] [] [] Code: 88 08 8b 86 58 01 00 00 ff 86 5c 01 00 00 88 10 ff 86 58 01 >>EIP; f893281d <_end+3851dc61/385fa444> <= >>eax; fffc43fc >>ecx; f703b000 <_end+36c26444/385fa444> >>esi; f187d000 <_end+31468444/385fa444> >>ebp; f7005c1c <_end+36bf1060/385fa444> >>esp; c0353ed4 Trace; f8921982 <_end+3850cdc6/385fa444> Trace; f8921a9c <_end+3850cee0/385fa444> Trace; c010a041 Trace; c010a236 Trace; c0106d60 Trace; c0106d60 Trace; c0106d60 Trace; c0106d60 Trace; c0106d89 Trace; c0106df2 Trace; c0105000 <_stext+0/0> Trace; c010504f Code; f893281d <_end+3851dc61/385fa444> <_EIP>: Code; f893281d <_end+3851dc61/385fa444> <= 0: 88 08 mov%cl,(%eax) <= Code; f893281f <_end+3851dc63/385fa444> 2: 8b 86 58 01 00 00 mov0x158(%esi),%eax Code; f8932825 <_end+3851dc69/385fa444> 8: ff 86 5c 01 00 00 incl 0x15c(%esi) Code; f893282b <_end+3851dc6f/385fa444> e: 88 10 mov%dl,(%eax) Code; f893282d <_end+3851dc71/385fa444> 10: ff 86 58 01 00 00 incl 0x158(%esi) <0>Kernel panic: Aiee, killing interrupt handler! Could you please help me out? klaus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] compat ioctl security hook fixup
On Thu, Jan 20, 2005 at 09:51:03PM -0800, Chris Wright wrote: > > If you add it make at least sure it's not EXPORT_SYMBOL()ed. > > It's certainly not, nor intended to be. Would a comment to that > affect alleviate your concern? Yes please. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Radeon framebuffer weirdness in -mm2
On Thu, Jan 20, 2005 at 08:07:11PM -0800, Andrew Morton wrote: > Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > Next suspects would be: > > > > +cleanup-vc-array-access.patch > > +remove-console_macrosh.patch > > +merge-vt_struct-into-vc_data.patch > > > > > > Make that: > > +cleanup-vc-array-access.patch > +remove-console_macrosh.patch > +merge-vt_struct-into-vc_data.patch > +vgacon-fixes-to-help-font-restauration-in-x11.patch It's something in this batch. Which is good, as I'd be a bit disappointed if the "vt leakage" were somehow attributable to the fb layer. More bisection after dinner. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
writeback-highmem
This needed highmem fix from Rik is still missing too, so please apply along the other 5 (it's orthogonal so you can apply this one in any order you want). From: Rik van Riel <[EMAIL PROTECTED]> Subject: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings Simply running "dd if=/dev/zero of=/dev/hd" will result in OOM kills, with the dirty pagecache completely filling up lowmem. This patch is part 1 to fixing that problem. This patch effectively lowers the dirty limit for mappings which cannot be cached in highmem, counting the dirty limit as a percentage of lowmem instead. This should prevent heavy block device writers from pushing the VM over the edge and triggering OOM kills. Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> Acked-by: Andrea Arcangeli <[EMAIL PROTECTED]> --- x/mm/page-writeback.c.orig 2005-01-04 01:13:30.0 +0100 +++ x/mm/page-writeback.c 2005-01-04 02:41:29.573177184 +0100 @@ -133,7 +133,8 @@ static void get_writeback_state(struct w * clamping level. */ static void -get_dirty_limits(struct writeback_state *wbs, long *pbackground, long *pdirty) +get_dirty_limits(struct writeback_state *wbs, long *pbackground, long *pdirty, +struct address_space *mapping) { int background_ratio; /* Percentages */ int dirty_ratio; @@ -141,10 +142,20 @@ get_dirty_limits(struct writeback_state long background; long dirty; struct task_struct *tsk; + unsigned long available_memory = total_pages; get_writeback_state(wbs); - unmapped_ratio = 100 - (wbs->nr_mapped * 100) / total_pages; +#ifdef CONFIG_HIGHMEM + /* +* In some cases we can only allocate from low memory, +* so we exclude high memory from our count. +*/ + if (mapping && !(mapping_gfp_mask(mapping) & __GFP_HIGHMEM)) + available_memory -= totalhigh_pages; +#endif + + unmapped_ratio = 100 - (wbs->nr_mapped * 100) / available_memory; dirty_ratio = vm_dirty_ratio; if (dirty_ratio > unmapped_ratio / 2) @@ -194,7 +205,7 @@ static void balance_dirty_pages(struct a .nr_to_write= write_chunk, }; - get_dirty_limits(, _thresh, _thresh); + get_dirty_limits(, _thresh, _thresh, mapping); nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable; if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh) break; @@ -210,7 +221,7 @@ static void balance_dirty_pages(struct a if (nr_reclaimable) { writeback_inodes(); get_dirty_limits(, _thresh, - _thresh); + _thresh, mapping); nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable; if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh) break; @@ -296,7 +307,7 @@ static void background_writeout(unsigned long background_thresh; long dirty_thresh; - get_dirty_limits(, _thresh, _thresh); + get_dirty_limits(, _thresh, _thresh, NULL); if (wbs.nr_dirty + wbs.nr_unstable < background_thresh && min_pages <= 0) break; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] compat ioctl security hook fixup
* Andi Kleen ([EMAIL PROTECTED]) wrote: > I'm not sure really adding vfs_ioctl is a good idea politically. > I predict we'll see drivers starting to use it, which will cause quite > broken design. Yes, that'd be quite broken. I didn't have the same expectation. > If you add it make at least sure it's not EXPORT_SYMBOL()ed. It's certainly not, nor intended to be. Would a comment to that affect alleviate your concern? thanks, -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
OOM fixes 5/5
From: Andrea Arcangeli <[EMAIL PROTECTED]> Subject: Convert the unsafe signed (16bit) used_math to a safe and optimal PF_USED_MATH On Sat, Dec 25, 2004 at 04:24:30AM +0100, Andrea Arcangeli wrote: > Here it is the first part. This makes memdie a TIF_MEMDIE. It's And here is the final incremental part converting ->used_math to PF_USED_MATH. I might have broken arm, see the very first change in the patch to asm-offsets.c, rest looks ok at first glance. If you want used_math to return 0 or 1 (instead of 0 or PF_USED_MATH), just s/!!// in the below patch and place !! in sched.h::*used_math() accordingly after applying the patch, it should work just fine. Using !! only when necessary as the below is optimal. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> --- mainline-5/arch/arm26/kernel/asm-offsets.c.orig 2003-07-17 01:52:38.0 +0200 +++ mainline-5/arch/arm26/kernel/asm-offsets.c 2005-01-21 06:20:01.999885640 +0100 @@ -42,7 +42,6 @@ int main(void) { - DEFINE(TSK_USED_MATH,offsetof(struct task_struct, used_math)); DEFINE(TSK_ACTIVE_MM,offsetof(struct task_struct, active_mm)); BLANK(); DEFINE(VMA_VM_MM,offsetof(struct vm_area_struct, vm_mm)); --- mainline-5/arch/arm26/kernel/process.c.orig 2005-01-15 20:44:48.0 +0100 +++ mainline-5/arch/arm26/kernel/process.c 2005-01-21 06:20:02.013883512 +0100 @@ -271,7 +271,7 @@ void flush_thread(void) memset(>thread.debug, 0, sizeof(struct debug_info)); memset(>fpstate, 0, sizeof(union fp_state)); - current->used_math = 0; + clear_used_math(); } void release_thread(struct task_struct *dead_task) @@ -305,7 +305,7 @@ copy_thread(int nr, unsigned long clone_ int dump_fpu (struct pt_regs *regs, struct user_fp *fp) { struct thread_info *thread = current_thread_info(); - int used_math = current->used_math; + int used_math = !!used_math(); if (used_math) memcpy(fp, >fpstate.soft, sizeof (*fp)); --- mainline-5/arch/arm26/kernel/ptrace.c.orig 2005-01-04 01:13:09.0 +0100 +++ mainline-5/arch/arm26/kernel/ptrace.c 2005-01-21 06:20:02.018882752 +0100 @@ -540,7 +540,7 @@ static int ptrace_getfpregs(struct task_ */ static int ptrace_setfpregs(struct task_struct *tsk, void *ufp) { - tsk->used_math = 1; + set_stopped_child_used_math(tsk); return copy_from_user(>thread_info->fpstate, ufp, sizeof(struct user_fp)) ? -EFAULT : 0; } --- mainline-5/arch/i386/kernel/cpu/common.c.orig 2005-01-15 20:44:49.0 +0100 +++ mainline-5/arch/i386/kernel/cpu/common.c2005-01-21 06:20:02.027881384 +0100 @@ -629,6 +629,6 @@ void __init cpu_init (void) * Force FPU initialization: */ current_thread_info()->status = 0; - current->used_math = 0; + clear_used_math(); mxcsr_feature_mask_init(); } --- mainline-5/arch/i386/kernel/i387.c.orig 2005-01-20 18:20:09.0 +0100 +++ mainline-5/arch/i386/kernel/i387.c 2005-01-21 06:20:02.040879408 +0100 @@ -60,7 +60,8 @@ void init_fpu(struct task_struct *tsk) tsk->thread.i387.fsave.twd = 0xu; tsk->thread.i387.fsave.fos = 0xu; } - tsk->used_math = 1; + /* only the device not available exception or ptrace can call init_fpu */ + set_stopped_child_used_math(tsk); } /* @@ -331,13 +332,13 @@ static int save_i387_fxsave( struct _fps int save_i387( struct _fpstate __user *buf ) { - if ( !current->used_math ) + if ( !used_math() ) return 0; /* This will cause a "finit" to be triggered by the next * attempted FPU operation by the 'current' process. */ - current->used_math = 0; + clear_used_math(); if ( HAVE_HWFP ) { if ( cpu_has_fxsr ) { @@ -383,7 +384,7 @@ int restore_i387( struct _fpstate __user } else { err = restore_i387_soft( >thread.i387.soft, buf ); } - current->used_math = 1; + set_used_math(); return err; } @@ -507,7 +508,7 @@ int dump_fpu( struct pt_regs *regs, stru int fpvalid; struct task_struct *tsk = current; - fpvalid = tsk->used_math; + fpvalid = !!used_math(); if ( fpvalid ) { unlazy_fpu( tsk ); if ( cpu_has_fxsr ) { @@ -522,7 +523,7 @@ int dump_fpu( struct pt_regs *regs, stru int dump_task_fpu(struct task_struct *tsk, struct user_i387_struct *fpu) { - int fpvalid = tsk->used_math; + int fpvalid = !!tsk_used_math(tsk); if (fpvalid) { if (tsk == current) @@ -537,7 +538,7 @@ int dump_task_fpu(struct task_struct *ts int dump_task_extended_fpu(struct task_struct *tsk, struct user_fxsr_struct *fpu) { - int fpvalid = tsk->used_math && cpu_has_fxsr; + int fpvalid = tsk_used_math(tsk) &&
OOM fixes 4/5
From: Andrea Arcangeli <[EMAIL PROTECTED]> Subject: convert memdie to an atomic thread bitflag On Sat, Dec 25, 2004 at 03:27:21AM +0100, Andrea Arcangeli wrote: > So my current plan is to make used_math a PF_USED_MATH, and memdie a > TIF_MEMDIE. And of course oomtaskadj an int (that one requires more than This makes memdie a TIF_MEMDIE. memdie will not be modified by the current task, so it cannot be a PF_MEMDIE but it must be a TIF_MEMDIE. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> --- mainline-4/include/asm-alpha/thread_info.h.orig 2004-12-04 08:55:03.0 +0100 +++ mainline-4/include/asm-alpha/thread_info.h 2005-01-21 06:17:24.780786576 +0100 @@ -77,6 +77,7 @@ register struct thread_info *__current_t #define TIF_UAC_NOPRINT6 /* see sysinfo.h */ #define TIF_UAC_NOFIX 7 #define TIF_UAC_SIGBUS 8 +#define TIF_MEMDIE 9 #define _TIF_SYSCALL_TRACE (1flags & PF_EXITING)) && !(p->flags & PF_DEAD)) + if ((unlikely(test_tsk_thread_flag(p, TIF_MEMDIE)) || (p->flags & PF_EXITING)) && + !(p->flags & PF_DEAD)) return ERR_PTR(-1UL); if (p->flags & PF_SWAPOFF) return p; @@ -196,7 +197,7 @@ static void __oom_kill_task(task_t *p) * exit() and clear out its resources quickly... */ p->time_slice = HZ; - p->memdie = 1; + set_tsk_thread_flag(p, TIF_MEMDIE); /* This process has hardware access, be more careful. */ if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO)) { --- mainline-4/mm/page_alloc.c.orig 2005-01-21 06:09:43.068977440 +0100 +++ mainline-4/mm/page_alloc.c 2005-01-21 06:17:24.996753744 +0100 @@ -756,7 +756,7 @@ __alloc_pages(unsigned int gfp_mask, uns } /* This allocation should allow future memory freeing. */ - if (((p->flags & PF_MEMALLOC) || p->memdie) && !in_interrupt()) { + if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) && !in_interrupt()) { /* go through the zonelist yet again, ignoring mins */ for (i = 0; (z = zones[i]) != NULL; i++) { page = buffered_rmqueue(z, order, gfp_mask); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
OOM fixes 3/5
From: Andrea Arcangeli <[EMAIL PROTECTED]> Subject: fix several oom killer bugs, most important avoid spurious oom kills badness algorithm tweaked by Thomas Gleixner to deal with fork bombs This is the core of the oom-killer fixes I developed partly taking the idea from Thomas's patches of getting feedback from the exit path, plus I moved the oom killer into page_alloc.c as it should to be able to check the watermarks before killing more stuff. This also tweaks the badness to take thread bombs more into account (that change to badness is from Thomas, from my part I'd rather rewrite badness from scratch instead, but that's an orthgonal issue ;). With this applied the oom killer is very sane, no more 5 sec waits and suprious oom kills. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> --- mainline-2/include/linux/sched.h2005-01-20 18:27:45.0 +0100 +++ mainline-3/include/linux/sched.h2005-01-21 06:01:08.585190864 +0100 @@ -615,6 +615,11 @@ struct task_struct { struct key *thread_keyring; /* keyring private to this thread */ #endif /* + * All archs should support atomic ops with + * 1 byte granularity. + */ + unsigned char memdie; +/* * Must be changed atomically so it shouldn't be * be a shareable bitflag. */ @@ -736,8 +741,7 @@ do { if (atomic_dec_and_test(&(tsk)->usa #define PF_DUMPCORE0x0200 /* dumped core */ #define PF_SIGNALED0x0400 /* killed by a signal */ #define PF_MEMALLOC0x0800 /* Allocating memory */ -#define PF_MEMDIE 0x1000 /* Killed for out-of-memory */ -#define PF_FLUSHER 0x2000 /* responsible for disk writeback */ +#define PF_FLUSHER 0x1000 /* responsible for disk writeback */ #define PF_FREEZE 0x4000 /* this task is being frozen for suspend now */ #define PF_NOFREEZE0x8000 /* this thread should not be frozen */ --- mainline-2/mm/oom_kill.c2005-01-20 18:26:30.0 +0100 +++ mainline-3/mm/oom_kill.c2005-01-21 06:14:00.290873768 +0100 @@ -45,18 +45,30 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) { unsigned long points, cpu_time, run_time, s; + struct list_head *tsk; if (!p->mm) return 0; - if (p->flags & PF_MEMDIE) - return 0; /* * The memory size of the process is the basis for the badness. */ points = p->mm->total_vm; /* +* Processes which fork a lot of child processes are likely +* a good choice. We add the vmsize of the childs if they +* have an own mm. This prevents forking servers to flood the +* machine with an endless amount of childs +*/ + list_for_each(tsk, >children) { + struct task_struct *chld; + chld = list_entry(tsk, struct task_struct, sibling); + if (chld->mm != p->mm && chld->mm) + points += chld->mm->total_vm; + } + + /* * CPU time is in tens of seconds and run time is in thousands * of seconds. There is no particular reason for this other than * that it turned out to work very well in practice. @@ -132,14 +144,24 @@ static struct task_struct * select_bad_p do_posix_clock_monotonic_gettime(); do_each_thread(g, p) - if (p->pid) { - unsigned long points = badness(p, uptime.tv_sec); - if (points > maxpoints) { + /* skip the init task with pid == 1 */ + if (p->pid > 1) { + unsigned long points; + + /* +* This is in the process of releasing memory so wait it +* to finish before killing some other task by mistake. +*/ + if ((p->memdie || (p->flags & PF_EXITING)) && !(p->flags & PF_DEAD)) + return ERR_PTR(-1UL); + if (p->flags & PF_SWAPOFF) + return p; + + points = badness(p, uptime.tv_sec); + if (points > maxpoints || !chosen) { chosen = p; maxpoints = points; } - if (p->flags & PF_SWAPOFF) - return p; } while_each_thread(g, p); return chosen; @@ -152,6 +174,12 @@ static struct task_struct * select_bad_p */ static void __oom_kill_task(task_t *p) { + if (p->pid == 1) { + WARN_ON(1); + printk(KERN_WARNING "tried to kill init!\n"); + return; + } + task_lock(p); if (!p->mm || p->mm == _mm) { WARN_ON(1); @@ -168,7 +196,7 @@ static void __oom_kill_task(task_t *p) * exit() and clear out its resources
OOM fixes 2/5
From: Andrea Arcangeli <[EMAIL PROTECTED]> Subject: keep balance between different classzones This is the forward port to 2.6 of the lowmem_reserved algorithm I invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads like google (especially without swap) on x86 with >1G of ram, but it's needed in all sort of workloads with lots of ram on x86, it's also needed on x86-64 for dma allocations. This brings 2.6 in sync with latest 2.4.2x. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> --- mainline-2/include/linux/mmzone.h.orig 2005-01-15 20:45:00.0 +0100 +++ mainline-2/include/linux/mmzone.h 2005-01-21 05:55:28.644869648 +0100 @@ -112,18 +112,14 @@ struct zone { unsigned long free_pages; unsigned long pages_min, pages_low, pages_high; /* -* protection[] is a pre-calculated number of extra pages that must be -* available in a zone in order for __alloc_pages() to allocate memory -* from the zone. i.e., for a GFP_KERNEL alloc of "order" there must -* be "(1< --- mainline-2/include/linux/sysctl.h.orig 2005-01-15 20:45:00.0 +0100 +++ mainline-2/include/linux/sysctl.h 2005-01-21 05:55:28.646869344 +0100 @@ -160,7 +160,7 @@ enum VM_PAGEBUF=17, /* struct: Control pagebuf parameters */ VM_HUGETLB_PAGES=18,/* int: Number of available Huge Pages */ VM_SWAPPINESS=19, /* Tendency to steal mapped memory */ - VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */ + VM_LOWMEM_RESERVE_RATIO=20,/* reservation ratio for lower memory zones */ VM_MIN_FREE_KBYTES=21, /* Minimum free kilobytes to maintain */ VM_MAX_MAP_COUNT=22,/* int: Maximum number of mmaps/address-space */ VM_LAPTOP_MODE=23, /* vm laptop mode */ --- mainline-2/kernel/sysctl.c.orig 2005-01-15 20:45:00.0 +0100 +++ mainline-2/kernel/sysctl.c 2005-01-21 05:55:28.648869040 +0100 @@ -61,7 +61,6 @@ extern int core_uses_pid; extern char core_pattern[]; extern int cad_pid; extern int pid_max; -extern int sysctl_lower_zone_protection; extern int min_free_kbytes; extern int printk_ratelimit_jiffies; extern int printk_ratelimit_burst; @@ -745,14 +744,13 @@ static ctl_table vm_table[] = { }, #endif { - .ctl_name = VM_LOWER_ZONE_PROTECTION, - .procname = "lower_zone_protection", - .data = _lower_zone_protection, - .maxlen = sizeof(sysctl_lower_zone_protection), + .ctl_name = VM_LOWMEM_RESERVE_RATIO, + .procname = "lowmem_reserve_ratio", + .data = _lowmem_reserve_ratio, + .maxlen = sizeof(sysctl_lowmem_reserve_ratio), .mode = 0644, - .proc_handler = _zone_protection_sysctl_handler, + .proc_handler = _reserve_ratio_sysctl_handler, .strategy = _intvec, - .extra1 = , }, { .ctl_name = VM_MIN_FREE_KBYTES, --- mainline-2/mm/page_alloc.c.orig 2005-01-15 20:45:00.0 +0100 +++ mainline-2/mm/page_alloc.c 2005-01-21 05:58:53.338751448 +0100 @@ -44,7 +44,15 @@ struct pglist_data *pgdat_list; unsigned long totalram_pages; unsigned long totalhigh_pages; long nr_swap_pages; -int sysctl_lower_zone_protection = 0; +/* + * results with 256, 32 in the lowmem_reserve sysctl: + * 1G machine -> (16M dma, 800M-16M normal, 1G-800M high) + * 1G machine -> (16M dma, 784M normal, 224M high) + * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA + * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL + * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA + */ +int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 256, 32 }; EXPORT_SYMBOL(totalram_pages); EXPORT_SYMBOL(nr_swap_pages); @@ -654,7 +662,7 @@ buffered_rmqueue(struct zone *zone, int * of the allocation. */ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, - int alloc_type, int can_try_harder, int gfp_high) + int classzone_idx, int can_try_harder, int gfp_high) { /* free_pages my go negative - that's OK */ long min = mark, free_pages = z->free_pages - (1 << order) + 1; @@ -665,7 +673,7 @@ int zone_watermark_ok(struct zone *z, in if (can_try_harder) min -= min / 4; - if (free_pages <= min + z->protection[alloc_type]) + if (free_pages <= min + z->lowmem_reserve[classzone_idx]) return 0; for (o = 0; o < order; o++) { /* At the next order, this order's pages become unavailable */ @@ -682,19 +690,6 @@ int zone_watermark_ok(struct zone *z, in /* * This is the 'heart' of the zoned buddy allocator. - * - * Herein lies the mysterious
OOM fixes 1/5
I'm sending 5 patches incremental with each other updated to the latest bk snapshot I could find on kernel.org [kernel cvs is still unusable for me, is it my mistake?] From: [EMAIL PROTECTED] Subject: protect-pids This is protect-pids, a patch to allow the admin to tune the oom killer. The tweak is inherited between parent and child so it's easy to write a wrapper for complex apps. I made used_math a char at the light of later patches. Current patch breaks alpha, but future patches will fix it. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> --- mainline/fs/proc/base.c 2005-01-15 20:44:58.0 +0100 +++ mainline-1/fs/proc/base.c 2005-01-20 18:26:29.0 +0100 @@ -72,6 +72,8 @@ enum pid_directory_inos { PROC_TGID_ATTR_FSCREATE, #endif PROC_TGID_FD_DIR, + PROC_TGID_OOM_SCORE, + PROC_TGID_OOM_ADJUST, PROC_TID_INO, PROC_TID_STATUS, PROC_TID_MEM, @@ -98,6 +100,8 @@ enum pid_directory_inos { PROC_TID_ATTR_FSCREATE, #endif PROC_TID_FD_DIR = 0x8000, /* 0x8000-0x */ + PROC_TID_OOM_SCORE, + PROC_TID_OOM_ADJUST, }; struct pid_entry { @@ -133,6 +137,8 @@ static struct pid_entry tgid_base_stuff[ #ifdef CONFIG_SCHEDSTATS E(PROC_TGID_SCHEDSTAT, "schedstat", S_IFREG|S_IRUGO), #endif + E(PROC_TGID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO), + E(PROC_TGID_OOM_ADJUST,"oom_adj", S_IFREG|S_IRUGO|S_IWUSR), {0,0,NULL,0} }; static struct pid_entry tid_base_stuff[] = { @@ -158,6 +164,8 @@ static struct pid_entry tid_base_stuff[] #ifdef CONFIG_SCHEDSTATS E(PROC_TID_SCHEDSTAT, "schedstat",S_IFREG|S_IRUGO), #endif + E(PROC_TID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO), + E(PROC_TID_OOM_ADJUST, "oom_adj", S_IFREG|S_IRUGO|S_IWUSR), {0,0,NULL,0} }; @@ -384,6 +392,18 @@ static int proc_pid_schedstat(struct tas } #endif +/* The badness from the OOM killer */ +unsigned long badness(struct task_struct *p, unsigned long uptime); +static int proc_oom_score(struct task_struct *task, char *buffer) +{ + unsigned long points; + struct timespec uptime; + + do_posix_clock_monotonic_gettime(); + points = badness(task, uptime.tv_sec); + return sprintf(buffer, "%lu\n", points); +} + // /* Here the fs part begins*/ // @@ -657,6 +677,55 @@ static struct file_operations proc_mem_o .open = mem_open, }; +static ssize_t oom_adjust_read(struct file * file, char * buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = proc_task(file->f_dentry->d_inode); + char buffer[8]; + size_t len; + int oom_adjust = task->oomkilladj; + + len = sprintf(buffer, "%i\n", oom_adjust) + 1; + if (*ppos >= len) + return 0; + if (count > len-*ppos) + count = len-*ppos; + if (copy_to_user(buf, buffer + *ppos, count)) + return -EFAULT; + *ppos += count; + return count; +} + +static ssize_t oom_adjust_write(struct file * file, const char * buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = proc_task(file->f_dentry->d_inode); + char buffer[8], *end; + int oom_adjust; + + if (!capable(CAP_SYS_RESOURCE)) + return -EPERM; + memset(buffer, 0, 8); + if (count > 6) + count = 6; + if (copy_from_user(buffer, buf, count)) + return -EFAULT; + oom_adjust = simple_strtol(buffer, , 0); + if (oom_adjust < -16 || oom_adjust > 15) + return -EINVAL; + if (*end == '\n') + end++; + task->oomkilladj = oom_adjust; + if (end - buffer == 0) + return -EIO; + return end - buffer; +} + +static struct file_operations proc_oom_adjust_operations = { + read: oom_adjust_read, + write: oom_adjust_write, +}; + static struct inode_operations proc_mem_inode_operations = { .permission = proc_permission, }; @@ -1336,6 +1405,15 @@ static struct dentry *proc_pident_lookup ei->op.proc_read = proc_pid_schedstat; break; #endif + case PROC_TID_OOM_SCORE: + case PROC_TGID_OOM_SCORE: + inode->i_fop = _info_file_operations; + ei->op.proc_read = proc_oom_score; + break; + case PROC_TID_OOM_ADJUST: + case PROC_TGID_OOM_ADJUST: + inode->i_fop = _oom_adjust_operations; + break; default: printk("procfs: impossible type (%d)",p->type);
System calls effect after booting phase ??
--- [EMAIL PROTECTED] wrote: > Possibility 1: > Load them from an initrd image while booting. If > you're already > using an initrd, and this is "early enough", you > just need to put the > module into the initrd, and make sure the /linuxrc > or whatever script > does an insmod for it. This has the advantage of > working for out-of-tree > modules. Now, I am using an initrd image. How can I load my module there? In which file, should I insert the corresponding line? Can u tell me more regarding this on how to do it? I am using kernel 2.4.28. should I have to recompile the whole kernel once again? Thanks, selva __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] compat ioctl security hook fixup
On Thu, Jan 20, 2005 at 05:26:56PM -0800, Chris Wright wrote: > * Michael S. Tsirkin ([EMAIL PROTECTED]) wrote: > > Security hook seems to be missing before compat_ioctl in mm2. > > And, it would be nice to avoid calling it twice on some paths. > > > > Chris Wright's patch addressed this in the most elegant way I think, > > by adding vfs_ioctl. > > The patch below is against Linus' tree as per Andrew's request. It will > conflict with some of the changes in -mm2 (including the some-fixes bit > from Andi, and LTT). I also have a patch directly against -mm2 if anyone > would like to see that instead. I'm not sure really adding vfs_ioctl is a good idea politically. I predict we'll see drivers starting to use it, which will cause quite broken design. If you add it make at least sure it's not EXPORT_SYMBOL()ed. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Radeon framebuffer weirdness in -mm2
Andrew Morton <[EMAIL PROTECTED]> wrote: > > Next suspects would be: > > +cleanup-vc-array-access.patch > +remove-console_macrosh.patch > +merge-vt_struct-into-vc_data.patch > > Make that: +cleanup-vc-array-access.patch +remove-console_macrosh.patch +merge-vt_struct-into-vc_data.patch +vgacon-fixes-to-help-font-restauration-in-x11.patch and the fbdev updates, maybe: +radeonfb-set-accelerator-id.patch +vesafb-change-return-error-id.patch +intelfb-workaround-for-830m.patch +fbcon-save-blank-state-last.patch +backlight-fix-compile-error-if-config_fb-is-unset.patch +matroxfb-fb_matrox_g-kconfig-changes.patch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Radeon framebuffer weirdness in -mm2
Matt Mackall <[EMAIL PROTECTED]> wrote: > > Here are the symptoms: > > mm2: corruption of Tux logo at boot, corruption of display at > powerdown, lockup and LCD blooming on next warm boot when radeonfb > starts. Ben suggested I try some radeonfb options, but none seemed to > have any effect. > > mm1: no observed problems > > mm2 - above patches: corruption still occurs but no lockup on next > warm boot. So we have multiple bugs? Next suspects would be: +cleanup-vc-array-access.patch +remove-console_macrosh.patch +merge-vt_struct-into-vc_data.patch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Radeon framebuffer weirdness in -mm2
On Thu, Jan 20, 2005 at 04:01:23PM -0800, Andrew Morton wrote: > Matt Mackall <[EMAIL PROTECTED]> wrote: > > > > > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON? > > > > FB_RADEON. > > Ah, OK. Likely culprits are > > radeonfb-massive-update-of-pm-code.patch > radeonfb-build-fix.patch Ok, learned a few things. Here are the symptoms: mm2: corruption of Tux logo at boot, corruption of display at powerdown, lockup and LCD blooming on next warm boot when radeonfb starts. Ben suggested I try some radeonfb options, but none seemed to have any effect. mm1: no observed problems mm2 - above patches: corruption still occurs but no lockup on next warm boot. I think I have a lead on the logo and shutdown corruption: If I do a reboot(8) from inside X, I get switched to vt 0, but the shutdown messages come out on vt 7, where X was running. As I'm sitting on vt 0 during shutdown, I see character cells changed to something like "_" (last two scanlines filled) slowly marching down the screen corresponding to the shutdown messages. So the logo corruption is probably getty popping up on the other vts at the end of init. The timing and the screen placement seem to agree. Photos for the curious (be sure to see "executioner Tux" glitch): http://selenic.com/radeon -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]: fix the bug of __free_pages() of mm/page_alloc.c
--- linux-2.6.10.orig/mm/page_alloc.c 2004-12-25 05:33:51.0 +0800 +++ linux-2.6.10/mm/page_alloc.c2005-01-21 11:46:58.0 +0800 @@ -788,7 +788,22 @@ fastcall void __free_pages(struct page *page, unsigned int order) { - if (!PageReserved(page) && put_page_testzero(page)) { + if (!PageReserved(page)) { +#ifdef CONFIG_MMU + if (!put_page_testzero(page)) + return; +#else + int i, result = 1; + + /* +* We need to de-reference all the pages for this order -- see set_page_refs() +*/ + for (i = 0; i < (1 << order); i++) + result &= put_page_testzero(page+i); + if (!result) + BUG(); +#endif /* CONFIG_MMU */ + if (order == 0) free_hot_page(page); else On Fri, 21 Jan 2005 11:40:52 +0800, zhan rongkai <[EMAIL PROTECTED]> wrote: > --- linux-2.6.10.orig/mm/page_alloc.c 2004-12-25 05:33:51.0 +0800 > +++ linux-2.6.10/mm/page_alloc.c2005-01-21 11:43:44.0 +0800 > @@ -788,7 +788,22 @@ > > fastcall void __free_pages(struct page *page, unsigned int order) > { > - if (!PageReserved(page) && put_page_testzero(page)) { > + if (!PageReserved(page)) { > +#ifdef CONFIG_MMU > + if (!put_page_testzero(page)) > + return; > +#else > + int i, result = 1; > + > + /* > +* We need to de-reference all the pages for this order -- see > set_page_refs() > +*/ > +for (i = 0; i < (1 << order); i++) > +result &= put_page_testzero(page+i); > +if (!result) > +BUG(); > +#endif /* CONFIG_MMU */ > + > if (order == 0) > free_hot_page(page); > else > > -- > Rongkai Zhan > -- Rongkai Zhan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]: fix the bug of __free_pages() of mm/page_alloc.c
--- linux-2.6.10.orig/mm/page_alloc.c 2004-12-25 05:33:51.0 +0800 +++ linux-2.6.10/mm/page_alloc.c2005-01-21 11:43:44.0 +0800 @@ -788,7 +788,22 @@ fastcall void __free_pages(struct page *page, unsigned int order) { - if (!PageReserved(page) && put_page_testzero(page)) { + if (!PageReserved(page)) { +#ifdef CONFIG_MMU + if (!put_page_testzero(page)) + return; +#else + int i, result = 1; + + /* +* We need to de-reference all the pages for this order -- see set_page_refs() +*/ +for (i = 0; i < (1 << order); i++) +result &= put_page_testzero(page+i); +if (!result) +BUG(); +#endif /* CONFIG_MMU */ + if (order == 0) free_hot_page(page); else -- Rongkai Zhan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]: fix the bug of __free_pages() of mm/page_alloc.c
On Thu, 20 Jan 2005 14:31:34 +, Russell King <[EMAIL PROTECTED]> wrote: > On Thu, Jan 20, 2005 at 09:34:17PM +0800, zhan rongkai wrote: > > [PATCH]: fix the bug of __free_pages() of mm/page_alloc.c > > = > > > > The buddy allocator's __free_pages() function seems to be buggy. > > > > The following codes are from kernel 2.6.10: > > > > fastcall void __free_pages(struct page *page, unsigned int order) > > { > > if (!PageReserved(page) && put_page_testzero(page)) { > > if (order == 0) > > free_hot_page(page); > > else > > __free_pages_ok(page, order); > > } > > } > > > > As you know, before truely freeing all pages, this function calls > > put_page_testzero(page) to > > drop the refcount of the pages. > > > > But, in fact the macro put_page_testzero(page) **only** drops **one** > > page's refcount. > > Therefore, if (order > 0), the refcounts of (page+1) .. > > (page+(1< > This will cause __free_pages_ok() to dump stack, because it finds some > > pages' page_count() > > are not zero! > > When you allocate a page with order > 0, the first 0-order page has a > refcount of 1, and the remaining 0-order pages have a refcount of 0. Thank you for telling me this point. > If you're triggering this check, I suspect you're fiddling about with > the individual pages (using get_page on them individually?) which is > a no-no. > > -- > Russell King > Oh, I forget to tell you that my CPU has no MMU, sorry:-) Let's see the function set_page_refs() which is called by prep_new_page() function: static inline void set_page_refs(struct page *page, int order) { #ifdef CONFIG_MMU set_page_count(page, 1); #else int i; /* * We need to reference all the pages for this order, otherwise if * anyone accesses one of the pages with (get/put) it will be freed. */ for (i = 0; i < (1 << order); i++) set_page_count(page+i, 1); #endif /* CONFIG_MMU */ } We can see that it sets all pages' refcount to 1 when there is no MMU. My previous patch is wrong. Here is new one: --- linux-2.6.10.orig/mm/page_alloc.c 2004-12-25 05:33:51.0 +0800 +++ linux-2.6.10/mm/page_alloc.c2005-01-21 11:34:57.0 +0800 @@ -787,8 +787,23 @@ } fastcall void __free_pages(struct page *page, unsigned int order) -{ - if (!PageReserved(page) && put_page_testzero(page)) { +{ + if (!PageReserved(page)) { +#ifdef CONFIG_MMU + if (!put_page_testzero(page)) + return; +#else + int i, result = 1; + + /* +* We need to de-reference all the pages for this order -- see set_page_refs() +*/ +for (i = 0; i < (1 << order); i++) +result &= put_page_testzero(page); +if (!result) +BUG(); +#endif /* CONFIG_MMU */ + if (order == 0) free_hot_page(page); else -- Rongkai Zhan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: possible memleak in 2.6.11-rc1
Lennert Van Alboom <[EMAIL PROTECTED]> wrote: > > Possible memleak in 2.6.11-rc1? Please wait for it to happen again and then send the contents of /proc/meminfo and /proc/slabinfo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Typo in [AGPGART] i915GM support patch
On Thu, Jan 20, 2005 at 05:46:22PM +0100, Marco Cipullo wrote: > - if (agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB) > + if (agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB || > + agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB) > gtt_entries = MB(48) - KB(size); > else > gtt_entries = 0; > break; > Peraphs is: > > @@ -415,14 +415,16 @@ > break; > case I915_GMCH_GMS_STOLEN_48M: > /* Check it's really I915G */ > - if (agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB) > + if (agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915G_HB || > + agp_bridge->dev->device == PCI_DEVICE_ID_INTEL_82915GM_HB) > gtt_entries = MB(48) - KB(size); > else > gtt_entries = 0; > break; > > The same applies few lines below Duh, yes. Thanks. Fix sent to Linus. Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] tracepipe -- event streams, debugfs, and pipe_buffers
Zach Brown wrote: > Only briefly. They've always seemed more involved than the sort of > thing I was after. I'll try and sit down and investigate in more detail. There's definitely an opportunity for interfacing here. If nothing else, this clearly shows the interest for the kind of things both relayfs and ltt attempt to achieve. So here are a few comments regading the implementation and how this relates to the stuff I'm working on. > While it's running the kernel subsystem can send binary blobs, less than > the length of a page, down this channel. The blobs are copied into > per-cpu lists of pages. Cutesy little headers with get_cycles() and the > cpu id are prepended to each blob. The traces are only recorded if user > space has open references to the file. In the case of LTT, we just open one relay channel per cpu. This avoids having to write the CPUID to the trace, that's 2 bytes less per event, and also avoids any need for synchronization. As for get_cycles(), some architectures don't have anything useful to give. Here's for ARM (include/asm-arm/timex.h): static inline cycles_t get_cycles (void) { return 0; } In the case of LTT, we just use the, albeit expensive, do_gettimeofday when hardware counters aren't there (currently all non-x86 tracing does this, but this should be fixed.) Also, in the case of the x86 at least, we just write the lower 32-bits of the TSC, so that's 4 bytes less per event. Instead, we use the buffer_start and buffer_end callbacks provided by relayfs to write a header and footer containing full do_gettimeofday value and TSC value. > As the pages fill they're kicked off to a work_struct worker who puts > them in the bufs[] array in the debugfs pipe file. Userspace can then > do whatever it wants with the data via the pipe. One can imagine it > wanting to splice() these pages to disk in huge batches, or perhaps some > zero-copy network card, etc. I've only tested this so far as verifying > that 'cat' is able to push data into a regular file. It seems to me that while this is a nice use of pipes, it isn't as fast as ram-locked pages. Basically relayfs does the bttv driver magic (or what used to be done in there, I haven't checked what they do lately.) Basically, we allocate pages, lock them into ram and remap them for use as a single memory area. No caching necessary. It goes from the buffer to whatever media you want (disk, network, etc.) IOW, user-space does a open(), mmap(), write(). Also, the channels exist whether user-space has done an open or not. That's good for flight-recording. Looking at the code: - tracepipe_event() does a get_cpu()/put_cpu() for protecting the writing to the buffer. What about tracing within an interrupt? local_irq_save()? - I hadn't thought of doing something like this to write the header: + hdr = tcpu->next_region; + hdr->cycles = get_cycles(); + hdr->cpu = cpu; I will replace some of the memcpy() code in LTT with something like this. - From what I assume is a "whishlist": + * - actually communicate missed to userspace Already done in LTT. + * - how to specify wrapping or dropping relayfs provides RELAY_MODE_CONTINUOUS and RELAY_MODE_NO_OVERWRITE. + * - non-temporal stores into bufs The latest relayfs code doesn't care about timestamps. It's its clients job to do that (ex. ltt). + * - let caller reserve space and get a pointer into buf This is the relevant relayfs function: char* relay_reserve(struct rchan *rchan, u32 len, int *err, int *interrupting) Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [ANNOUNCE][RFC] plugsched-2.0 patches ...
Hi Peter, > I'm hoping that the CKRM folks will send me a patch to add their > scheduler to plugsched :-) They are planning to release a patch against 2.6.10. But their patch wont stand alone against 2.6.10 and so it might be difficult for you to integrate their code into a scheduler for plugsched. Also, the CKRM scheduler only modifies Ingo's O(1) scheduler. It certainly would be interesting to have CKRM variants of the other schedulers. This points to a whole new level of 'plugsched' in that general O(1) schedulers need to support fair share plugins. Marc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PPC64: EEH Recovery
Linas Vepstas writes: > > 2. I don't see why the device nodes for the PCI subtree being reset > >would go away, and thus I don't see the need for your eeh_cfg_tree > >struct. > > Its not the reset, its the hot-plug remove. The hot plug code assumes > that you are going to physically remove the device from the slot, so > it removes the device_node as part of the "unconfig". OK, I missed that. It seems a bit bogus to me. Could you point me at where in the code this happens? > > 3. Is there a good reason why we can't use the assigned-addresses > >property on the relevant device tree nodes to tell us what to set > >the BARs to? > > Yes, the reason is that after a reset, that property doesn't hold any > decent data. I discussed this with the firmware developers, and thier > response was that it is the kernel's responsibility to compute > (or save/restore) such values. (Except for bridges, which they will do for > us). The not holding any decent data is a consequence of the device nodes getting thrown away, isn't it? I fail to see how resetting the device can of itself affect our copy of the device tree. > > In particular I think it should be a > >userland write to a sysfs file that kicks off the restart process > >rather than it just happening after 5 seconds. Anyway, what > >process or thread is executing that 5 second sleep? Is it keventd > >or something? > > Its a workqueue. Which get run in keventd's context. In other words no other workqueues will get run during the 5 second sleep, or at least not on that cpu. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE][RFC] plugsched-2.0 patches ...
Marc E. Fiuczynski wrote: Peter, thank you for maintaining Con's plugsched code in light of Linus' and Ingo's prior objections to this idea. On the one hand, I partially agree with Linus's prior views that when there is only one scheduler that the rest of the world + dog will focus on making it better. On the other hand, having a clean framework that lets developers in a clean way plug in new schedulers is quite useful. Linus & Ingo, it would be good to have an indepth discussion on this topic. I'd argue that the Linux kernel NEEDS a clean pluggable scheduling framework. Let me make a case for this NEED by example. Ingo's scheduler belongs to the egalitarian regime of schedulers that do a poor job of isolating workloads from each other in multiprogrammed environments such as those found on Enterprise servers and in my case on PlanetLab (www.planet-lab.org) nodes. This has been rectified by HP-UX, Solaris, and AIX through the use of fair share schedulers that use O(1) schedulers within a share. Currently PlanetLab uses a CKRM modified version of Ingo's scheduler. I'm hoping that the CKRM folks will send me a patch to add their scheduler to plugsched :-) Similarly, the linux-vserver project also modifies Ingo's scheduler to construct an entitlement based scheduling regime. These are not just variants of O(1) schedulers in the sense of Con's staircase O(1). Nor is it clear what the best type of scheduler is for these environments (i.e., HP-UX, Solaris and AIX don't have it fully solved yet either). The ability to dynamically swap out schedulers on a production system like PlanetLab would help in determining what type of scheduler is the most appropriate. This is because it is non-trivial, if not impossible, to recreate the multiprogrammed workloads that we see in a lab. For these reasons, it would be useful for plugsched (or something like it) to make its way into the mainline kernel as a framework to plug in different schedulers. Alternatively, it would be useful to consider in what way Ingo's scheduler needs to support plugins such as the CKRM and Vserver types of changes. Best regards, Marc -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64
On Thu, Jan 20, 2005 at 10:46:37PM +0100, Rafael J. Wysocki wrote: > On Thursday, 20 of January 2005 21:59, Pavel Machek wrote: > > Sure, but I think it's there for a reason. > > > Anyway, this is likely to clash with hugang's work; I'd prefer this not to > > be applied. > > I am aware of that, but you are not going to merge the hugang's patches soon, > are you? > If necessary, I can change the patch to work with his code (hugang, what do > you think?). > I like this patch, And I change my code with this, Please have a look, It pass in qemu X86_64. :) Full patch still can get from http://soulinfo.com/~hugang/swsusp/2005-1-21/ here is only x86_64 part. --- 2.6.11-rc1-mm1/arch/x86_64/kernel/suspend_asm.S 2004-12-30 14:56:35.0 +0800 +++ 2.6.11-rc1-mm1-swsusp-x86_64/arch/x86_64/kernel/suspend_asm.S 2005-01-21 10:13:15.0 +0800 @@ -35,6 +35,7 @@ ENTRY(swsusp_arch_suspend) call swsusp_save ret + .section.data.nosave ENTRY(swsusp_arch_resume) /* set up cr3 */ leaqinit_level4_pgt(%rip),%rax @@ -49,43 +50,32 @@ ENTRY(swsusp_arch_resume) movq%rcx, %cr3; movq%rax, %cr4; # turn PGE back on - movlnr_copy_pages(%rip), %eax - xorl%ecx, %ecx - movq$0, %r10 - testl %eax, %eax - jz done -.L105: - xorl%esi, %esi - movq$0, %r11 - jmp .L104 - .p2align 4,,7 -copy_one_page: - movq%r10, %rcx -.L104: - movqpagedir_nosave(%rip), %rdx - movq%rcx, %rax - salq$5, %rax - movq8(%rdx,%rax), %rcx - movq(%rdx,%rax), %rax - movzbl (%rsi,%rax), %eax - movb%al, (%rsi,%rcx) - - movq%cr3, %rax; # flush TLB - movq%rax, %cr3; - - movq%r11, %rax - incq%rax - cmpq$4095, %rax - movq%rax, %rsi - movq%rax, %r11 - jbe copy_one_page - movq%r10, %rax - incq%rax - movq%rax, %rcx - movq%rax, %r10 - mov nr_copy_pages(%rip), %eax - cmpq%rax, %rcx - jb .L105 + movqpagedir_nosave(%rip), %rax + testq %rax, %rax + je done + +copyback_page: + movq24(%rax), %r9 + xorl%r8d, %r8d + +copy_one_pgdir: + movq8(%rax), %rdi + testq %rdi, %rdi + je done + movq(%rax), %rsi + movq$512, %rcx + rep + movsq + + incq%r8 + addq$32, %rax + cmpq$127, %r8 + jbe copy_one_pgdir; # copy one pgdir + + testq %r9, %r9 + movq%r9, %rax + jne copyback_page + done: movl$24, %eax movl%eax, %ds -- Hu Gang .-. /v\ // \\ Linux User /( )\ [204016] GPG Key ID ^^-^^ http://soulinfo.com/~hugang/hugang.asc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] /proc//rlimit
On Thu, Jan 20, 2005 at 03:43:58PM +0100, Pavel Machek wrote: > It would be nice if you could make it "value-per-file". That way, > it could become writable in future. If "max nice level" ever becomes rlimit, > this would be very usefull. Agreed, though write support present difficulties. My principal concern is that we don't want users changing resource limits of privileged processes. If we want an ordinary user to be allowed to change limits, the rules would have to be similar to those allowed for ptrace(), e.g., no-setuid processes, etc. [With ptrace(), one can of course attach to the process and invoke the setrlimit() syscall directly]. Additionally, sys_setrlimit() has an LSM hook: security_task_setrlimit(unsigned int resource, struct rlimit *) One would need to take account of changing the limit from a different context. It's a bit of a mess, and outside of the standard API; that's why I didn't bother. Anyway, for Jan, here's my incomplete and unmergeable cut-n-paste hack to implement write on top of my previous patch. Format is as was suggested by Jan: <%u|unlimited> <%u|unlimited> E.g., echo memlock 65536 65536 > /proc/1/rlimit Writing is limited to root (i.e. CAP_SYS_PTRACE), though see fs/proc/base.c:may_ptrace_attach() for an idea of how to change that. -Bill --- linux-2.6.11-rc1-bk6/fs/proc/base.c.proc-pid-rlimit-write +++ linux-2.6.11-rc1-bk6/fs/proc/base.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -127,7 +128,7 @@ E(PROC_TGID_ROOT, "root",S_IFLNK|S_IRWXUGO), E(PROC_TGID_EXE, "exe", S_IFLNK|S_IRWXUGO), E(PROC_TGID_MOUNTS,"mounts", S_IFREG|S_IRUGO), - E(PROC_TGID_RLIMIT,"rlimit", S_IFREG|S_IRUGO), + E(PROC_TGID_RLIMIT,"rlimit", S_IFREG|S_IRUGO|S_IWUSR), #ifdef CONFIG_SECURITY E(PROC_TGID_ATTR, "attr",S_IFDIR|S_IRUGO|S_IXUGO), #endif @@ -153,7 +154,7 @@ E(PROC_TID_ROOT, "root",S_IFLNK|S_IRWXUGO), E(PROC_TID_EXE,"exe", S_IFLNK|S_IRWXUGO), E(PROC_TID_MOUNTS, "mounts", S_IFREG|S_IRUGO), - E(PROC_TID_RLIMIT, "rlimit", S_IFREG|S_IRUGO), + E(PROC_TID_RLIMIT, "rlimit", S_IFREG|S_IRUGO|S_IWUSR), #ifdef CONFIG_SECURITY E(PROC_TID_ATTR, "attr",S_IFDIR|S_IRUGO|S_IXUGO), #endif @@ -595,9 +596,99 @@ return single_release(inode, file); } +static inline char *skip_ws(char *s) +{ + while (isspace(*s)) + s++; + return s; +} + +static inline char *find_ws(char *s) +{ + while (!isspace(*s) && *s != '\0') + s++; + return s; +} + +#define MAX_RLIMIT_WRITE 79 +static ssize_t rlimit_write(struct file * file, const char * buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = proc_task(file->f_dentry->d_inode); + struct rlimit new_rlim, *old_rlim; + unsigned int i; + char *s, *t, kbuf[MAX_RLIMIT_WRITE+1]; + + /* changing resources limits can crash or subvert a process */ + if (!capable(CAP_SYS_PTRACE) || security_ptrace(current,task)) + return -ESRCH; + +if (count > MAX_RLIMIT_WRITE) +return -EINVAL; +if (copy_from_user(, buf, count)) +return -EFAULT; +kbuf[MAX_RLIMIT_WRITE] = '\0'; + + /* parse the resource id */ + s = skip_ws(kbuf); + t = find_ws(s); + if (*t == '\0') + return -EINVAL; + *t++ = '\0'; + for (i = 0 ; i < RLIM_NLIMITS ; i++) + if (rlim_name[i] && !strcmp(s,rlim_name[i])) + break; + if (i >= RLIM_NLIMITS) { + if (!strncmp(s, "rlimit-",7)) + s += 7; + if (sscanf(s, "%u", ) != 1 || i >= RLIM_NLIMITS) + return -EINVAL; + } + + /* parse the soft limit */ + s = skip_ws(t); + t = find_ws(s); + if (*t == '\0') + return -EINVAL; + *t++ = '\0'; + if (!strcmp(s, "unlimited")) + new_rlim.rlim_cur = RLIM_INFINITY; + else if (sscanf(s, "%lu", _rlim.rlim_cur) != 1) + return -EINVAL; + + /* parse the hard limit */ + s = skip_ws(t); + t = find_ws(s); + *t = '\0'; + if (!strcmp(s, "unlimited")) + new_rlim.rlim_max = RLIM_INFINITY; + else if (sscanf(s, "%lu", _rlim.rlim_max) != 1) + return -EINVAL; + + /* validate the values; copied from sys_setrlimit() */ + if (new_rlim.rlim_cur > new_rlim.rlim_max) + return -EINVAL; +old_rlim = task->signal->rlim + i; + if ((new_rlim.rlim_max > old_rlim->rlim_max) && + !capable(CAP_SYS_RESOURCE)) + return -EPERM; + if (i == RLIMIT_NOFILE && new_rlim.rlim_max > NR_OPEN) + return -EPERM; + + /*
Re: [PATCH] relayfs redux for 2.6.10: lean and mean
Karim Yaghmour wrote: Greg KH wrote: Hm, how about this idea for cutting about 500 more lines from the code: Why not drop the "fs" part of relayfs and just make the code a set of struct file_operations. That way you could have "relayfs-like" files in any ram based file system that is being used. Then, a user could use these fops and assorted interface to create debugfs or even procfs files using this type of interface. As relayfs really is almost the same (conceptually wise) as debugfs as far as concept of what kinds of files will be in there (nothing anyone would ever rely on for normal operations, but for debugging only) this keeps users and developers from having to spread their debugging and instrumenting files from accross two different file systems. However this assumes that the users of relayfs are not going to want it during normal system operation. This is an assumption that fails with at least LTT as it is targeted at sysadmins, application developers and power users who need to be able to trace their systems at any time. I don't mind piggy-backing off another fs, if it makes sense, but unlike debugfs, relayfs is meant for general use, and all files in there are of the same type: relay channels for dumping huge amounts of data to user-space. It seems to me the target audience and basic idea (relay channels only in the fs) are different, but let me know if there's a compeling argument for doing this in another way without making it too confusing for users of those special "files" (IOW, when this starts being used in distros, it'll be more straightforward for users to understand if all files in a mounted fs behave a certain way than if they have certain "odd" files in certain directories, even if it's /proc.) Perhaps the logical solution is to implement debugfs in terms of relayfs? Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] relayfs redux for 2.6.10: lean and mean
Greg KH wrote: > Hm, how about this idea for cutting about 500 more lines from the code: > > Why not drop the "fs" part of relayfs and just make the code a set of > struct file_operations. That way you could have "relayfs-like" files in > any ram based file system that is being used. Then, a user could use > these fops and assorted interface to create debugfs or even procfs files > using this type of interface. > > As relayfs really is almost the same (conceptually wise) as debugfs as > far as concept of what kinds of files will be in there (nothing anyone > would ever rely on for normal operations, but for debugging only) this > keeps users and developers from having to spread their debugging and > instrumenting files from accross two different file systems. However this assumes that the users of relayfs are not going to want it during normal system operation. This is an assumption that fails with at least LTT as it is targeted at sysadmins, application developers and power users who need to be able to trace their systems at any time. I don't mind piggy-backing off another fs, if it makes sense, but unlike debugfs, relayfs is meant for general use, and all files in there are of the same type: relay channels for dumping huge amounts of data to user-space. It seems to me the target audience and basic idea (relay channels only in the fs) are different, but let me know if there's a compeling argument for doing this in another way without making it too confusing for users of those special "files" (IOW, when this starts being used in distros, it'll be more straightforward for users to understand if all files in a mounted fs behave a certain way than if they have certain "odd" files in certain directories, even if it's /proc.) Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] compat ioctl security hook fixup
* Michael S. Tsirkin ([EMAIL PROTECTED]) wrote: > Security hook seems to be missing before compat_ioctl in mm2. > And, it would be nice to avoid calling it twice on some paths. > > Chris Wright's patch addressed this in the most elegant way I think, > by adding vfs_ioctl. The patch below is against Linus' tree as per Andrew's request. It will conflict with some of the changes in -mm2 (including the some-fixes bit from Andi, and LTT). I also have a patch directly against -mm2 if anyone would like to see that instead. thanks, -chris -- Introduce a simple helper, vfs_ioctl(), so that both sys_ioctl() and compat_sys_ioctl() call the security hook in all cases and without duplication. Signed-off-by: Chris Wright <[EMAIL PROTECTED]> = fs/ioctl.c 1.15 vs edited = --- 1.15/fs/ioctl.c 2005-01-15 14:31:01 -08:00 +++ edited/fs/ioctl.c 2005-01-18 11:18:33 -08:00 @@ -77,21 +77,10 @@ static int file_ioctl(struct file *filp, return do_ioctl(filp, cmd, arg); } - -asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg) +int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned long arg) { - struct file * filp; unsigned int flag; - int on, error = -EBADF; - int fput_needed; - - filp = fget_light(fd, _needed); - if (!filp) - goto out; - - error = security_file_ioctl(filp, cmd, arg); - if (error) - goto out_fput; + int on, error = 0; switch (cmd) { case FIOCLEX: @@ -157,6 +146,24 @@ asmlinkage long sys_ioctl(unsigned int f error = do_ioctl(filp, cmd, arg); break; } + return error; +} + +asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg) +{ + struct file * filp; + int error = -EBADF; + int fput_needed; + + filp = fget_light(fd, _needed); + if (!filp) + goto out; + + error = security_file_ioctl(filp, cmd, arg); + if (error) + goto out_fput; + + error = vfs_ioctl(filp, fd, cmd, arg); out_fput: fput_light(filp, fput_needed); out: = fs/compat.c 1.48 vs edited = --- 1.48/fs/compat.c2005-01-15 14:31:01 -08:00 +++ edited/fs/compat.c 2005-01-18 11:07:56 -08:00 @@ -437,6 +437,11 @@ asmlinkage long compat_sys_ioctl(unsigne if (!filp) goto out; + /* RED-PEN how should LSM module know it's handling 32bit? */ + error = security_file_ioctl(filp, cmd, arg); + if (error) + goto out_fput; + if (filp->f_op && filp->f_op->compat_ioctl) { error = filp->f_op->compat_ioctl(filp, cmd, arg); if (error != -ENOIOCTLCMD) @@ -477,7 +482,7 @@ asmlinkage long compat_sys_ioctl(unsigne up_read(_sem); do_ioctl: - error = sys_ioctl(fd, cmd, arg); + error = vfs_ioctl(filp, fd, cmd, arg); out_fput: fput_light(filp, fput_needed); out: = include/linux/fs.h 1.373 vs edited = --- 1.373/include/linux/fs.h2005-01-15 14:31:01 -08:00 +++ edited/include/linux/fs.h 2005-01-18 11:10:54 -08:00 @@ -1564,6 +1564,8 @@ extern int vfs_stat(char __user *, struc extern int vfs_lstat(char __user *, struct kstat *); extern int vfs_fstat(unsigned int, struct kstat *); +extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long); + extern struct file_system_type *get_fs_type(const char *name); extern struct super_block *get_super(struct block_device *); extern struct super_block *user_get_super(dev_t); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Job - inescapable job containers
> I'm totally not in a position to evaluate the completeness, desirability, > interest-level, etc of this patch, I'm afraid. This is an opportunity for > other stakeholders to weigh in.. Thanks Andrew! First, Job can work as a standalone kernel module. The current implementation provides the inescapable job container. Job provides global unique Job ID (jid) to processes in a cluster environment. Job initiation on Linux is performed via a PAM session module with authentication and security checks. Root level processes, or those with the CAP_SYS_RESOURCE capability, can create new jobs or escape from a job. Second, Job based batch schedulers or resource limit tools can take the advantage of the process control ability Job provides. Thrid, Job provides a registion mechanism to various accounting modules for setting and getting job based accounting information. CSA (Comprehensive System Accounting) is one example of the accounting modules, (CSA code maintainer Jay Lan is currently on vacation, he will be back at Feb. 1). We are pushing Job to linux kernel. If anybody has been using Job in your open source software, please respond to show the desirability and interest-level for Job, and we highly appreciate your suggestion on its completeness as well. Thank you! --Limin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
On Friday 21 January 2005 00:05, Andreas Dilger wrote: > [...] > But as your patch stands it doesn't ever check if i_extra_isize is valid > for the root or lost+found inode. It just always sets i_extra_isize = 0 (that's the in-memory i_extra_isize) > and never uses it. Given that the root inode is fairly high-traffic it > makes sense to use the faster EA space if it is available. It's only a single block we're talking about, not all the overhead you run into with huge amounts of attributes in many xattr disk blocks. It sure would be much cleaner to use the root inode's in-inode space like with all other inodes, but performance wise I don't think it matters. > If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm > not so keen to have an ext3_error() there. If the user doesn't have an > e2fsck with ea-in-inode support there isn't anything they can do to fix > it and they will get a full e2fsck on each boot. Agreed, that would be really bad. We should get e2fsck fixed ASAP. > Even so, for the effort of setting i_extra_isize = 4 (or larger if we > initialize the fixed fields) we can do the equivalent of what e2fsck will > do when it finds a bogus value. We cannot ask the user, and we don't have the kind of global view that e2fsck has. Something different may be messed up, and may have lead to the corruption. It's unlikely, but not impossible. Cheers, -- Andreas Gruenbacher <[EMAIL PROTECTED]> SUSE Labs, SUSE LINUX PRODUCTS GMBH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] BUG in io_destroy (fs/aio.c:1248)
Hi all, [Please cc me on any replies because I'm not subscribed to linux-aio or linux-kernel.] I was running a random system call generator against mainline the other day and got this bug report about AIO in dmesg: [ cut here ] kernel BUG at fs/aio.c:1249! invalid operand: [#1] PREEMPT SMP Modules linked in: 8250 serial_core isofs zlib_inflate ipt_limit iptable_mangle ipt_LOG ipt_MASQUERADE iptable_nat ipt_TOS ipt_REJECT ip_conntrack_irc ip_conntrack_ftp ipt_state ip_conntrack iptable_filter ip_tables snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd soundcore snd_page_alloc intel_agp agpgart evdev ehci_hcd uhci_hcd usbcore piix ext2 ide_generic ide_cd ide_core cdrom CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010286 (2.6.10-elm3a74) EIP is at io_destroy+0xb1/0xce eax: ebx: f6b2a300 ecx: edx: cfca1000 esi: d0da5e80 edi: f6b2a488 ebp: cfca1fa4 esp: cfca1f94 ds: 007b es: 007b ss: 0068 Process io_destroy (pid: 6610, threadinfo=cfca1000 task=f6cdca20) Stack: 08048008 fff2 fff2 cfca1fbc c01702fc d0da5e80 0010 b7fd8c50 b3d4 cfca1000 c0102eb3 0010 08048008 080482fd b7fd8c50 b3d4 b3e8 00f5 007b 007b 00f5 b7f7c60d 0073 Call Trace: [] show_stack+0x7a/0x90 [] show_registers+0x152/0x1ca [] die+0x100/0x184 [] do_invalid_op+0xa3/0xad [] error_code+0x2b/0x30 [] sys_io_setup+0x9a/0xa9 [] syscall_call+0x7/0xb Code: 1c 8b 06 85 c0 78 24 83 c4 04 5b 5e 5f 5d c3 8b 0a 85 c9 2e 74 b5 8b 46 10 89 02 eb ae 83 c4 04 89 f0 5b 5e 5f 5d e9 9b ee ff ff <0f> 0b e1 04 f5 f6 2b c0 eb d2 89 f0 e8 8a ee ff ff eb ab 0f 0b This is a fairly run-of-the mill P4 box with SCSI disks and a plain vanilla 2.6.10 kernel on Debian.I 've written a test case that exposes this bug: http://submarine.dyndns.org/~djwong/docs/io_destroy.c The program takes as its only argument the address of a region of read only memory. The libc mmap is a pretty good place for this, so you can run the program thusly: $ ./io_destroy `cat /proc/$$/maps | grep libc- | grep 'r-' | \ awk -F "-" '{print $1}'` ...and watch the program segfault. If you can't find an address, 8048000 seems to work in most cases. I think I've found the cause of this bug. Each ioctx structure has a "users" field that acts as a reference counter for the ioctx, and a "dead" flag that seems to indicate that the ioctx isn't associated with any particular list of IO requests. The problem, then, lies in aio.c:1247. The io_destroy function checks the (old) value of the dead flag--if it's false (i.e. the ioctx is alive), then the function calls put_ioctx to decrease the reference count on the assumption that the ioctx is no longer associated with any requests. Later, it calls put_ioctx again, on the assumption that someone called lookup_ioctx to perform some operation at some point. This BUG is caused by the reference counts being off. The testcase that I provided looks for a chunk of user memory that's read-only and passes that to the sys_io_setup syscall. sys_io_setup checks that the pointer is readable, creates the ioctx and then tries to write the ioctx handle back to userland. This is where the problems start to surface. Since the pointer points to a non-writable region of memory, the write fails. The syscall handler then destroys the ioctx. The dead flag is zero, so io_destroy calls put_ioctx...but wait! Nobody ever put the ioctx into a request list. The ioctx is alive but not in a list, yet the io_destroy code assumes that being alive implies being in a request list somewhere. Hence, calling put_ioctx is bogus; the reference count becomes 0, and the ioctx is freed. Worse yet, put_ioctx is called again (on a freed pointer!) to clear up the lookup_ioctx that never happened. put_ioctx sees that the reference count has become negative and BUGs. The patch that I've provided calls aio_cancel_all before calling io_destroy in this failure case. aio_cancel_all sets ioctx->dead = 1 and cancels all requests (there shouldn't be any in this case) in progress. Since the dead flag is 1, io_destroy calls put_ioctx once to zero the reference count and free the ioctx, and thus the BUG condition doesn't get triggered. The userland program receives an error code instead of a segfault. This patch is against 2.6.10; the problem doesn't seem to be fixed in 2.6.11-rc1. A simpler version of this fix would simply say "ioctx->dead = 1;" (or even call "get_ioctx(ioctx);" to inflate the refcounts artificially), but as I'm not an AIO developer I don't want to be the one making that call. --Darrick - Signed-off-by: Darrick Wong <[EMAIL PROTECTED]> --- linux-2.6.10-a74/fs/aio.c 2004-12-24 13:34:44.0 -0800 +++ linux-2.6.10/fs/aio.c 2005-01-12 16:09:37.0 -0800 @@ -1285,6 +1285,7 @@ if (!ret) return 0; + aio_cancel_all(ioctx);
[PATCH] mips: fixed LTT build errors
This patch had fixed LTT build errors on MIPS. Yoichi Signed-off-by: Yoichi Yuasa <[EMAIL PROTECTED]> diff -urN -X dontdiff a-orig/arch/mips/kernel/irq.c a/arch/mips/kernel/irq.c --- a-orig/arch/mips/kernel/irq.c Fri Jan 21 00:15:19 2005 +++ a/arch/mips/kernel/irq.cFri Jan 21 08:17:31 2005 @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include diff -urN -X dontdiff a-orig/arch/mips/kernel/traps.c a/arch/mips/kernel/traps.c --- a-orig/arch/mips/kernel/traps.c Fri Jan 21 00:15:19 2005 +++ a/arch/mips/kernel/traps.c Fri Jan 21 08:17:31 2005 @@ -13,6 +13,7 @@ */ #include #include +#include #include #include #include diff -urN -X dontdiff a-orig/arch/mips/mm/fault.c a/arch/mips/mm/fault.c --- a-orig/arch/mips/mm/fault.c Fri Jan 21 00:15:19 2005 +++ a/arch/mips/mm/fault.c Fri Jan 21 08:17:31 2005 @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]sched: Isochronous class v2 for unprivileged soft rt scheduling
Peter Chubb <[EMAIL PROTECTED]> writes: >> "Jack" == Jack O'Quin <[EMAIL PROTECTED]> writes: > > Jack> Looks like we need to do another study to determine which > Jack> filesystem works best for multi-track audio recording and > Jack> playback. XFS looks promising, but only if they get the latency > Jack> right. Any experience with that? > > The nice thing about audio/video and XFS is that if you know ahead of > time the max size of a file (and you usually do -- because you know > ahead of time how long a take is going to be) you can precreadte the > file as a contiguous chunk, then just fill it in, for minimum disc > latency. I am not talking about disk latency. The problem Con uncovered in ReiserFS was CPU hogging. Every 20 seconds there was a 6msec latency glitch in system response. -- joq - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux capabilities ?
* jnf ([EMAIL PROTECTED]) wrote: > I will read the paper before commenting on it further, however I cannot > see what dangers it would really provide that a setuid program doesnt > already have- other than the ability to give another non-root process root > like abilities. However, the more I ponder it, it seems as if you could It was a dangerous failure mode when a capability isn't present that hit sendmail. > accomplish a lot of things with a set of ACL's and Capabilities (think > compartmentalizing everything from each other where no one thing has full > control of anything other than its particular subsystem). Yes, that's the ideal. Unfortunately it doesn't work out quite so neatly ;-/ > > Since /proc/kmsg is 0400 you need CAP_DAC_READ_SEARCH (don't necessarily > > need full override). Otherwise, you are right, you do need CAP_SYS_ADMIN. > > Or just use syslog(2) directly, and you'll avoid the DAC requirement. > > Hrm, even a chmod of it didn't appear to really affect things? Should, and it makes a difference for me. thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-fbdev-devel] Re: Radeon framebuffer weirdness in -mm2
> > > > I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of > > > > horizontal lines) and require powercycling to fix. Worked fine with > > > > 2.6.10. > > > > > > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON? > > > > FB_RADEON. > > > > > (cc Ben, who is the likely cuprit ;) > > > > Btw, ajoshi's address from MAINTAINERS is bouncing. > > The file should be updated, I am the radeonfb maintainer now. Speaking of. Should we nuke the old radeonfb driver? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: intel8x0 and 2.6.11-rc1
Hi Takashi, The same applies for IBM T40/T41/R50p I have tested so far. I had to disable "Headphone Jack Sense" and "Line Jack Sense" too. So, what's the deal with these ? What are they supposed to do ? Should we report this as bug on alsa lists ? Thanks, Paul On Thu, 20 Jan 2005 16:55:55 +0100, Takashi Iwai wrote: >> If you have "Headphone Jack Sense" mixer control, >> try to turn it off. >> >> That did the trick. thanks.. > > Glad to hear that. What machine do you have? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64
Hi, On Friday, 21 of January 2005 00:06, Pavel Machek wrote: > Hi! > > > > > The readability of code is also important, IMHO. > > > > > > It did not seem too much better to me. > > > > Well, the beauty is in the eye of the beholder. :-) > > > > Still, it shrinks the code (22 lines vs 37 lines), it uses less GPRs (5 vs > > 7), it uses less > > SIB arithmetics (0 vs 4 times), it uses a well known scheme for copying > > data pages. > > As far as the result is concerned, it is equivalent to the existing code, > > but it's simpler > > (and faster). IMO, simpler code is always easier to understand. > > > > > > > > > If you want cheap way to speed it up, kill cr3 manipulation. > > > > > > > > Sure, but I think it's there for a reason. > > > > > > Reason is "to crash it early if we have wrong pagetables". > > > > > > > > Anyway, this is likely to clash with hugang's work; I'd prefer this > > > > > not to be applied. > > > > > > > > I am aware of that, but you are not going to merge the hugang's patches > > > > soon, are you? > > > > If necessary, I can change the patch to work with his code (hugang, > > > > what do you think?). > > > > > > I think it is just not worth the effort. > > > > Why? It won't take much time. I've spent more time for writing the > > messages > > in this thread ... ;-) > > Well, I know that current code works. It was produced by C compiler, > btw. Now, new code works for you, but it was not in kernel for 4 > releases, and... this code is pretty subtle. Now, I'm confused. :-) It's roughly this: struct pbe *pbe = pagedir_nosave, *end; unsigned n = nr_copy_pages; if (n) { end = pbe + n; do { memcpy((void *)pbe->orig_address, (void *)pbe->address, PAGE_SIZE); pbe++; } while (pbe < end); } where memcpy() is of course a hand-written inline that includes the cr3 manipulation, and pbe, end, n are registers. > And it is hand-made, not C produced. Yes, it is. > So... your code may be better but I do not think it is so much better > that I'd like to risk it. Now, that's clear. :-) Anyway, if anyone could test it or look at it and say a word, please do so. Greets, RJW -- - Would you tell me, please, which way I ought to go from here? - That depends a good deal on where you want to get to. -- Lewis Carroll "Alice's Adventures in Wonderland" - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1 vs. PowerMac 8500/G3 (and VAIO laptop) [usb-storage oops]
On Thu, Jan 20, 2005 at 08:40:07AM +, David Woodhouse wrote: > On Wed, 2005-01-19 at 15:39 -0800, John Mock wrote: > > New to 2.6.11-rc1 is that 'lsusb' exhibits 'endian' problems on the > > PowerMac. > > Is that really new to 2.6.11-rc1? The kernel byte-swaps the bcdUSB, > idVendor, idProduct, and bcdDevice fields in the device descriptor. It > should probably swap them back before copying it up to userspace. Doh, sorry for missing this one. I've applied your patch to my trees, and will show up in the next -mm release. thanks. greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug when using custom baud rates....
On Thu, Jan 20, 2005 at 04:22:56PM +0100, Rogier Wolff wrote: > On Thu, Jan 20, 2005 at 07:08:58AM -0800, Greg KH wrote: > > On Thu, Jan 20, 2005 at 03:54:22PM +0100, Rogier Wolff wrote: > > > Hi, > > > > > > When using custom baud rates, the code does: > > > > > > > > >if ((new_serial.baud_base != priv->baud_base) || > > > (new_serial.baud_base < 9600)) > > > return -EINVAL; > > > > > > Which translates to english as: > > > > > > If you changed the baud-base, OR the new one is > > > invalid, return invalid. > > > > > > but it should be: > > > > > > If you changed the baud-base, OR the new one is > > > invalid, return invalid. > > > > You mean AND, not OR here, right? :) > > :-) Sorry. Too noisy here. > > > > Patch attached. > > > > Have a 2.6 patch? > > Patch told me: >patching file drivers/usb/serial/ftdi_sio.c >Hunk #1 succeeded at 1137 (offset 156 lines). > > but the resulting patch is attached. Applied, thanks. greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: [2.6 patch] let BLK_DEV_UB depend on USB_STORAGE=n
On Wed, Jan 19, 2005 at 06:49:00PM -0800, Matthew Dharm wrote: > On Wed, Jan 19, 2005 at 02:07:07PM -0800, Greg KH wrote: > > On Thu, Dec 23, 2004 at 03:40:31AM +0100, Adrian Bunk wrote: > > > On Sun, Dec 19, 2004 at 04:31:46PM -0800, Greg KH wrote: > > > > On Mon, Dec 20, 2004 at 01:16:44AM +0100, Adrian Bunk wrote: > > > > > I've already seen people crippling their usb-storage driver with > > > > > enabling BLK_DEV_UB - and I doubt the warning in the help text added > > > > > after 2.6.9 will fix all such problems. > > > > > > > > > > Is there except for kernel size any good reason for using BLK_DEV_UB > > > > > instead of USB_STORAGE? > > > > > > > > You don't want to use the scsi layer? You like the stability of it at > > > > times? :) > > > > > > > > > If not, I'd suggest the patch below to let BLK_DEV_UB depend > > > > > on EMBEDDED. > > > > > > > > No, it's good for non-embedded boxes too. > > > > > > > > > My current understanding is: > > > - BLK_DEV_UB supports a subset of what USB_STORAGE can support > > > - for an average user, there's no reason to enable BLK_DEV_UB > > > - if you really know what you are doing, there might be several reasons > > > why you might want to use BLK_DEV_UB > > > > I have been running with just the code portion of this patch for a while > > now, with good results (no Kconfig changes.) > > > > Pete and Matt, do you mind me applying the following portion of the > > patch to the kernel tree? > > I have no objection. Ok, I've commited the change to my trees, thanks. greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: security hook missing in compat ioctl in 2.6.11-rc1-mm2
* Michael S. Tsirkin ([EMAIL PROTECTED]) wrote: > Hi! > Security hook seems to be missing before compat_ioctl in mm2. > And, it would be nice to avoid calling it twice on some paths. > > Chris Wright's patch addressed this in the most elegant way I think, > by adding vfs_ioctl. > > Accordingly, this change: > > @@ -468,6 +496,11 @@ asmlinkage long compat_sys_ioctl(unsigne > > found_handler: > if (t->handler) { > + /* RED-PEN how should LSM module know it's handling 32bit? */ > + error = security_file_ioctl(filp, cmd, arg); > + if (error) > + goto out_fput; > + > lock_kernel(); > error = t->handler(fd, cmd, arg, filp); > unlock_kernel(); > > from Andy's "some fixes" patch wont be needed. > > Chris - are you planning to update your patch to -rc1-mm2? > I'd like to see this addressed, after this I believe logically > we'll get everything right, then I have a couple of small > cosmetic patches, and I believe we'll be set. Yes, Andrew asked me to wait until mm2 came out, so I'll rediff and send shortly. thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
security hook missing in compat ioctl in 2.6.11-rc1-mm2
Hi! Security hook seems to be missing before compat_ioctl in mm2. And, it would be nice to avoid calling it twice on some paths. Chris Wright's patch addressed this in the most elegant way I think, by adding vfs_ioctl. Accordingly, this change: @@ -468,6 +496,11 @@ asmlinkage long compat_sys_ioctl(unsigne found_handler: if (t->handler) { + /* RED-PEN how should LSM module know it's handling 32bit? */ + error = security_file_ioctl(filp, cmd, arg); + if (error) + goto out_fput; + lock_kernel(); error = t->handler(fd, cmd, arg, filp); unlock_kernel(); from Andy's "some fixes" patch wont be needed. Chris - are you planning to update your patch to -rc1-mm2? I'd like to see this addressed, after this I believe logically we'll get everything right, then I have a couple of small cosmetic patches, and I believe we'll be set. -- I dont speak for Mellanox. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Radeon framebuffer weirdness in -mm2
On Thu, 2005-01-20 at 15:48 -0800, Matt Mackall wrote: > On Thu, Jan 20, 2005 at 03:39:21PM -0800, Andrew Morton wrote: > > Matt Mackall <[EMAIL PROTECTED]> wrote: > > > > > > I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of > > > horizontal lines) and require powercycling to fix. Worked fine with > > > 2.6.10. > > > > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON? > > FB_RADEON. > > > (cc Ben, who is the likely cuprit ;) > > Btw, ajoshi's address from MAINTAINERS is bouncing. The file should be updated, I am the radeonfb maintainer now. > > Which -mm2, btw? 2.6.10-mm2 or 2.6.11-rc1-mm2? > > 2.6.11-rc1-mm2 > > > Did you try the corresponding -mm1? > > Nothing between that and .10 yet. Building -mm1 now. Thanks. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Radeon framebuffer weirdness in -mm2
On Thu, 2005-01-20 at 15:39 -0800, Andrew Morton wrote: > Matt Mackall <[EMAIL PROTECTED]> wrote: > > > > I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of > > horizontal lines) and require powercycling to fix. Worked fine with 2.6.10. > > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON? > > (cc Ben, who is the likely cuprit ;) > > Which -mm2, btw? 2.6.10-mm2 or 2.6.11-rc1-mm2? > > Did you try the corresponding -mm1? /me curses possible BIOS crap ... radeonfb tries to restore initial mode when the module is closed, which wouldn't work for a VGA text thing in fact... I suspect something cause driver remove() routines to be called on reboot, can you confirm ? Or is it a module that gets removed ? It may well be a problem that has always been there (regardless of the radeon driver version) and just triggered by something the kernel does on reboot... Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SCSI oops in 2.6.10 [was: usb-storage oops (PowerMac 8500/G3)]
Sorry about the confusion, but it appears the 'oops' is not specific to the USB subsystem, as it seems also to occur with an ordinary SCSI module as well under 2.6.10 (PPC). In this case, it's a ZIP drive connected via 'mac53c94' module, and in addition to, as noted before, the same problem with USB digital camera and an IOMEGA CD/RW drive (see earlier posting "2.6.11-rc1 vs. PowerMac 8500/G3 (and VAIO laptop) [usb-storage oops]"). Additional details gladly provided upon request. -- JM Attachments: SCSI oops from 'mac53c94' Example of usb-storage variant of same/similar 'oops' --- ... scsi1 : 53C94 Vendor: IOMEGAModel: ZIP 100 Rev: C.18 Type: Direct-Access ANSI SCSI revision: 02 Oops: kernel access of bad area, sig: 11 [#1] PREEMPT NIP: C009ABA4 LR: C009ABA4 SP: CC467C40 REGS: cc467b90 TRAP: 0600Not tainted MSR: 9032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 DAR: 6B6B6BD7, DSISR: TASK = cd3b4b30[1094] 'modprobe' THREAD: cc466000 Last syscall: 120 GPR00: C009ABA4 CC467C40 CD3B4B30 C01AADE0 0047 0047 CC467C78 000A GPR08: 8000 CDC2A6AC CC467C40 42002448 1001E284 100013A4 GPR16: 100013A4 100186E0 CC467D98 GPR24: CC467D9C 0001 CCC84994 CC467C78 CC40941C CCC84998 6B6B6BD7 NIP [c009aba4] create_dir+0x38/0x1d0 LR [c009aba4] create_dir+0x38/0x1d0 Call trace: [c009ad98] sysfs_create_dir+0x48/0x94 [c00ad688] create_dir+0x28/0x6c [c00ad98c] kobject_add+0x5c/0x15c [c00eaa40] device_add+0xb8/0x18c [c0111f9c] scsi_sysfs_add_sdev+0x78/0x39c [c01107c4] scsi_add_lun+0x2f8/0x364 [c011091c] scsi_probe_and_add_lun+0xec/0x1d8 [c010] scsi_scan_target+0x7c/0xec [c0fc] scsi_scan_channel+0x7c/0x9c [c01112f4] scsi_scan_host_selected+0xd8/0x138 [cf854b40] mac53c94_probe+0x208/0x26c [mac53c94] [c0103fd4] macio_device_probe+0x80/0x9c [c00ebfec] driver_probe_device+0x4c/0xa0 [c00ec184] driver_attach+0x88/0xc8 [c00ec7c8] bus_add_driver+0xd0/0x11c --- Jan 19 15:17:58 penngrove kernel: Vendor: NIKON Model: NIKON DSC E4500 Rev: 1.00 Jan 19 15:17:58 penngrove kernel: Type: Direct-Access ANSI SCSI revision: 02 Jan 19 15:17:58 penngrove kernel: input: Logitech N48 on usb-:00:0e.0-1 Jan 19 15:17:58 penngrove kernel: hub 4-0:1.0: port 2, status 0100, change , 12 Mb/s Jan 19 15:17:58 penngrove kernel: Oops: kernel access of bad area, sig: 11 [#1] Jan 19 15:17:58 penngrove kernel: PREEMPT Jan 19 15:17:58 penngrove kernel: NIP: C009BF14 LR: C009BF14 SP: CCF63DC0 REGS: ccf63d10 TRAP: 0300Not tainted Jan 19 15:17:58 penngrove kernel: MSR: 9032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 Jan 19 15:17:58 penngrove kernel: DAR: 0074, DSISR: 4000 Jan 19 15:17:58 penngrove kernel: TASK = ccb5ebf0[1651] 'usb-stor-scan' THREAD: ccf62000 Jan 19 15:17:58 penngrove kernel: Last syscall: -1 Jan 19 15:17:58 penngrove kernel: GPR00: C009BF14 CCF63DC0 CCB5EBF0 C01AF674 0047 0047 CCF63DF8 000A Jan 19 15:17:58 penngrove kernel: GPR08: 8000 CC9185C8 CCF63DC0 42002448 1001E284 100013A4 Jan 19 15:17:58 penngrove kernel: GPR16: 100013A4 100187C0 CCF63F18 Jan 19 15:17:58 penngrove kernel: GPR24: CCF63F1C 0001 CCC61184 CCF63DF8 CCC1EE84 CCC61188 0074 Jan 19 15:17:58 penngrove kernel: NIP [c009bf14] create_dir+0x38/0x1d0 Jan 19 15:17:58 penngrove kernel: LR [c009bf14] create_dir+0x38/0x1d0 Jan 19 15:17:58 penngrove kernel: Call trace: Jan 19 15:17:58 penngrove kernel: [c009c108] sysfs_create_dir+0x48/0x94 Jan 19 15:17:58 penngrove kernel: [c00ae97c] create_dir+0x28/0x6c Jan 19 15:17:58 penngrove kernel: [c00aec80] kobject_add+0x5c/0x15c Jan 19 15:17:58 penngrove kernel: [c00ec960] device_add+0xc4/0x19c Jan 19 15:17:58 penngrove kernel: [c0114590] scsi_sysfs_add_sdev+0x78/0x3a4 Jan 19 15:17:58 penngrove kernel: [c0112a88] scsi_add_lun+0x2f8/0x364 Jan 19 15:17:58 penngrove kernel: [c0112be0] scsi_probe_and_add_lun+0xec/0x1fc Jan 19 15:17:58 penngrove kernel: [c0113414] scsi_scan_target+0x7c/0xec Jan 19 15:17:58 penngrove kernel: [c0113500] scsi_scan_channel+0x7c/0x9c Jan 19 15:17:58 penngrove kernel: [c01135f8] scsi_scan_host_selected+0xd8/0x138 Jan 19 15:17:58 penngrove kernel: [cfb04dc0] usb_stor_scan_thread+0x6c/0x124 [usb_storage] Jan 19 15:17:58 penngrove kernel: [c00066c4] kernel_thread+0x44/0x60 === - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] [PATCH] move bio code from dm into bio
Jens, last December you observed there was bio code duplicated in the dm drivers. Here are a collection of patches that implements support for local bio and bvec pools into bio.c and then removes the duplicate bio code from the dm drivers. It also replaces a call to alloc_bio() in dm.c with a call to use a local bio pool. This removes a deadlock case in that code. These patches are against 2.6.11-rc1. If that's not a good source version to patch against, let me now what versions I should generate patches for. I still need to implement some form of congestion control in the dm code. As things are now, the snapshot target can consume all the bio's in the system. With the elmination of the deadlock case in dm.c, the system nolonger deadlocks. But instead, it starts invoking the oom killer. That'll be the next patch set to this code. Please review. Thanks! diff -ur linux-2.6.11-rc1-original/drivers/md/dm.c linux-2.6.11-rc1-bio/drivers/md/dm.c --- linux-2.6.11-rc1-original/drivers/md/dm.c 2004-12-24 13:35:24.0 -0800 +++ linux-2.6.11-rc1-bio/drivers/md/dm.c2005-01-19 15:51:49.0 -0800 @@ -96,10 +96,16 @@ static kmem_cache_t *_io_cache; static kmem_cache_t *_tio_cache; +static struct bio_set *dm_set; + static int __init local_init(void) { int r; + dm_set = bioset_create(512, 512, 1); + if (!dm_set) + return -ENOMEM; + /* allocate a slab for the dm_ios */ _io_cache = kmem_cache_create("dm_io", sizeof(struct dm_io), 0, 0, NULL, NULL); @@ -133,6 +139,8 @@ kmem_cache_destroy(_tio_cache); kmem_cache_destroy(_io_cache); + bioset_free(dm_set); + if (unregister_blkdev(_major, _name) < 0) DMERR("devfs_unregister_blkdev failed"); @@ -393,7 +401,7 @@ struct bio *clone; struct bio_vec *bv = bio->bi_io_vec + idx; - clone = bio_alloc(GFP_NOIO, 1); + clone = bio_alloc_bs(GFP_NOIO, 1, dm_set); *clone->bi_io_vec = *bv; clone->bi_sector = sector; diff -ur linux-2.6.11-rc1-original/drivers/md/dm-io.c linux-2.6.11-rc1-bio/drivers/md/dm-io.c --- linux-2.6.11-rc1-original/drivers/md/dm-io.c2004-12-24 13:35:39.0 -0800 +++ linux-2.6.11-rc1-bio/drivers/md/dm-io.c 2005-01-19 15:26:55.0 -0800 @@ -12,207 +12,7 @@ #include #include -#define BIO_POOL_SIZE 256 - - -/*- - * Bio set, move this to bio.c - *---*/ -#define BV_NAME_SIZE 16 -struct biovec_pool { - int nr_vecs; - char name[BV_NAME_SIZE]; - kmem_cache_t *slab; - mempool_t *pool; - atomic_t allocated; /* FIXME: debug */ -}; - -#define BIOVEC_NR_POOLS 6 -struct bio_set { - char name[BV_NAME_SIZE]; - kmem_cache_t *bio_slab; - mempool_t *bio_pool; - struct biovec_pool pools[BIOVEC_NR_POOLS]; -}; - -static void bio_set_exit(struct bio_set *bs) -{ - unsigned i; - struct biovec_pool *bp; - - if (bs->bio_pool) - mempool_destroy(bs->bio_pool); - - if (bs->bio_slab) - kmem_cache_destroy(bs->bio_slab); - - for (i = 0; i < BIOVEC_NR_POOLS; i++) { - bp = bs->pools + i; - if (bp->pool) - mempool_destroy(bp->pool); - - if (bp->slab) - kmem_cache_destroy(bp->slab); - } -} - -static void mk_name(char *str, size_t len, const char *prefix, unsigned count) -{ - snprintf(str, len, "%s-%u", prefix, count); -} - -static int bio_set_init(struct bio_set *bs, const char *slab_prefix, -unsigned pool_entries, unsigned scale) -{ - /* FIXME: this must match bvec_index(), why not go the -* whole hog and have a pool per power of 2 ? */ - static unsigned _vec_lengths[BIOVEC_NR_POOLS] = { - 1, 4, 16, 64, 128, BIO_MAX_PAGES - }; - - - unsigned i, size; - struct biovec_pool *bp; - - /* zero the bs so we can tear down properly on error */ - memset(bs, 0, sizeof(*bs)); - - /* -* Set up the bio pool. -*/ - snprintf(bs->name, sizeof(bs->name), "%s-bio", slab_prefix); - - bs->bio_slab = kmem_cache_create(bs->name, sizeof(struct bio), 0, -SLAB_HWCACHE_ALIGN, NULL, NULL); - if (!bs->bio_slab) { - DMWARN("can't init bio slab"); - goto bad; - } - - bs->bio_pool = mempool_create(pool_entries, mempool_alloc_slab, - mempool_free_slab, bs->bio_slab); - if (!bs->bio_pool) { - DMWARN("can't init bio pool"); - goto bad; - } - - /* -* Set up the biovec pools. -*/ - for (i = 0; i < BIOVEC_NR_POOLS; i++) { - bp
[PATCH] to fix xtime lock for in the RT kernel patch
It seems to me that we need to either do the attached or to rewrite the timer front end code to just gather the offset info and defer to the timer irq thread to update jiffies and the offset stuff. In either case we really can not split the two and we do need the xtime_lock protection. -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ Source: MontaVista Software, Inc. George Anzinger Type: Defect Fix Keywords: Signed-off-by: George Anzinger Description: This patch changes the timer interrupt code for the RT patch to respect the xtime_lock which should protect jiffies and to collect offset information on jiffies interrupts. This offset info must be collected as soon as possible during the jiffies interrupt and also needs to be protected by the xtime_lock. The xtime_lock is thus a "raw" lock. arch/i386/kernel/time.c |8 +--- include/linux/time.h|2 +- kernel/timer.c |2 +- 3 files changed, 7 insertions(+), 5 deletions(-) Index: topdir/kernel/timer.c === --- topdir.orig/kernel/timer.c +++ topdir/kernel/timer.c @@ -946,7 +946,7 @@ unsigned long wall_jiffies = INITIAL_JIF * playing with xtime and avenrun. */ #ifndef ARCH_HAVE_XTIME_LOCK -DECLARE_SEQLOCK(xtime_lock); +DECLARE_RAW_SEQLOCK(xtime_lock); EXPORT_SYMBOL(xtime_lock); #endif Index: topdir/include/linux/time.h === --- topdir.orig/include/linux/time.h +++ topdir/include/linux/time.h @@ -80,7 +80,7 @@ mktime (unsigned int year, unsigned int extern struct timespec xtime; extern struct timespec wall_to_monotonic; -extern seqlock_t xtime_lock; +extern raw_seqlock_t xtime_lock; static inline unsigned long get_seconds(void) { Index: topdir/arch/i386/kernel/time.c === --- topdir.orig/arch/i386/kernel/time.c +++ topdir/arch/i386/kernel/time.c @@ -20,7 +20,7 @@ * monotonic gettimeofday() with fast_get_timeoffset(), * drift-proof precision TSC calibration on boot * (C. Scott Ananian <[EMAIL PROTECTED]>, Andrew D. - * Balsa <[EMAIL PROTECTED]>, Philip Gladstone <[EMAIL PROTECTED]>; + * Balsa <[EMAIL PROTECTED]>, Philip Gladstone <[EMAIL PROTECTED]>; * ported from 2.0.35 Jumbo-9 by Michael Krause <[EMAIL PROTECTED]>). * 1998-12-16Andrea Arcangeli * Fixed Jumbo-9 code in 2.1.131: do_gettimeofday was missing 1 jiffy @@ -224,7 +224,10 @@ EXPORT_SYMBOL(profile_pc); */ void direct_timer_interrupt(struct pt_regs *regs) { + write_seqlock(_lock); + cur_timer->mark_offset(); do_timer_interrupt_hook(regs); + write_sequnlock(_lock); } #endif @@ -254,6 +257,7 @@ static inline void do_timer_interrupt(in #endif #ifndef CONFIG_PREEMPT_HARDIRQS + cur_timer->mark_offset(); do_timer_interrupt_hook(regs); #endif @@ -312,8 +316,6 @@ irqreturn_t timer_interrupt(int irq, voi * locally disabled. -arca */ write_seqlock(_lock); - - cur_timer->mark_offset(); do_timer_interrupt(irq, NULL, regs);
Re: Radeon framebuffer weirdness in -mm2
Matt Mackall <[EMAIL PROTECTED]> wrote: > > > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON? > > FB_RADEON. Ah, OK. Likely culprits are radeonfb-massive-update-of-pm-code.patch radeonfb-build-fix.patch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sparse warning, or why does jifies_to_msecs() return an int?
> On Sat, 15 Jan 2005 10:05:37 -0800 (PST), Linus Torvalds <[EMAIL > PROTECTED]> said: Linus> Hmm.. I don't think your patch is wrong per se, but I do Linus> think it's a bit too subtle. I'd almost rather make Linus> "jiffies_to_msecs()" just test for overflow instead, and that Linus> should also fix it. You sure about that? Actually, I think my patch was broken anyhow for HZ < 1000 because you can potentially get integer-overflows in temporary results which could make things come out wrong again. I _think_ the attached patch works for all reasonable cases reasonably uniformly, but if you thought the previous patch was subtle, I'm sure you going to like this one even less. Note that with the patch, platforms where HZ is not a power of two and doesn't fit any of the other special cases (namely (HZ % 1000) != 0 && (1000 % HZ) != 0) would suffer a penalty. AFAICS, this is true only for Alpha/Rawhide (HZ=1200). In such a case, rather than: (j * 1000)/HZ the new code would compute: (j/HZ)*1000 + ((j%HZ)*1000)/HZ It looks to me like we could get rid of all the ugly & complex intermediate overflow-checks if we defined MAX_JIFFY_OFFSET as: (~0UL / 1000) However, on a 32-bit platform that runs at 1000 Hz, this would limit us to 4294 seconds. That may be cutting it a bit close. --david = include/linux/jiffies.h 1.11 vs edited = --- 1.11/include/linux/jiffies.h2005-01-04 18:48:02 -08:00 +++ edited/include/linux/jiffies.h 2005-01-20 15:21:14 -08:00 @@ -254,13 +254,32 @@ */ static inline unsigned int jiffies_to_msecs(const unsigned long j) { + unsigned long res; + #if HZ <= 1000 && !(1000 % HZ) - return (1000 / HZ) * j; + unsigned long max = ~0UL / (1000 / HZ); + + if (j > max) + max = j; + res = (1000 / HZ) * j; #elif HZ > 1000 && !(HZ % 1000) - return (j + (HZ / 1000) - 1)/(HZ / 1000); + res = (j + (HZ / 1000) - 1) / (HZ / 1000); #else - return (j * 1000) / HZ; + /* +* HZ better be a power of two; otherwise this gets real +* expensive. Better expensive than wrong, though. +*/ +# if HZ < 1000 + unsigned long max = (~0UL / 1000) * HZ; + + if (j > max) + j = max; +# endif + res = (j / HZ) * 1000 + ((j % HZ) * 1000) / HZ; #endif + if (res > ~0U) + return ~0U; + return res; } static inline unsigned int jiffies_to_usecs(const unsigned long j) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Radeon framebuffer weirdness in -mm2
On Thu, Jan 20, 2005 at 03:39:21PM -0800, Andrew Morton wrote: > Matt Mackall <[EMAIL PROTECTED]> wrote: > > > > I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of > > horizontal lines) and require powercycling to fix. Worked fine with 2.6.10. > > Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON? FB_RADEON. > (cc Ben, who is the likely cuprit ;) Btw, ajoshi's address from MAINTAINERS is bouncing. > Which -mm2, btw? 2.6.10-mm2 or 2.6.11-rc1-mm2? 2.6.11-rc1-mm2 > Did you try the corresponding -mm1? Nothing between that and .10 yet. Building -mm1 now. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] drivers/usb/devio.c, against ioctl bug in 2.4.28 & 2.4.29
Hi! Here is the tested patch against modem_run and eciadsl hang since 2.4.28. Longer discussion about it is in: http://sourceforge.net/mailarchive/forum.php?thread_id=6054671_id=5398 and feedback from users is in: http://www.mail-archive.com/speedtouch%40ml.free.fr/msg06848.html The patch itself is also located in: http://linux.ee/~kaups/devio.patch It: - prevent grabbing exclusive_access mutex for ioctls that doesn't need it - prevent grabbing exclusive_access mutex for non existing ioctls - use interruptible sleep instead uninterruptible PS. keep me in CC since I'm not subscribed... -- best regards, Kaupo Arulo--- devio.c.orig2004-11-28 22:24:49.0 +0200 +++ devio.c 2004-12-01 12:47:02.0 +0200 @@ -1153,45 +1153,62 @@ static int usbdev_ioctl(struct inode *in if (!(file->f_mode & FMODE_WRITE)) return -EPERM; - down_read(>devsem); + down_read(>devsem); /* FIXME: should we set devsem also per "case" + like exclusive_access to avoid + blocking nonexistent ioctls ? */ if (!ps->dev) { up_read(>devsem); return -ENODEV; } - - /* -* grab device's exclusive_access mutex to prevent its driver from -* using this device while it is being accessed by us. +/* + * Some ioctls don't touch the device and can be called without + * grabbing its exclusive_access mutex; they are handled together + * in same switch with ioctls which need it. Exclusive_access is handled in + * particular switch branches, so we grab device's exclusive_access +* mutex ONLY if needed and WHEN actually needed!!! */ - down(>dev->exclusive_access); - switch (cmd) { case USBDEVFS_CONTROL: - ret = proc_control(ps, (void *)arg); - if (ret >= 0) - inode->i_mtime = CURRENT_TIME; + if (down_interruptible(>dev->exclusive_access) == 0) { + ret = proc_control(ps, (void *)arg); + up(>dev->exclusive_access); + if (ret >= 0) + inode->i_mtime = CURRENT_TIME; + } else ret = -ERESTARTSYS; break; case USBDEVFS_BULK: - ret = proc_bulk(ps, (void *)arg); - if (ret >= 0) - inode->i_mtime = CURRENT_TIME; + if (down_interruptible(>dev->exclusive_access) == 0) { + ret = proc_bulk(ps, (void *)arg); + up(>dev->exclusive_access); + if (ret >= 0) + inode->i_mtime = CURRENT_TIME; + } else ret = -ERESTARTSYS; break; case USBDEVFS_RESETEP: - ret = proc_resetep(ps, (void *)arg); - if (ret >= 0) - inode->i_mtime = CURRENT_TIME; + if (down_interruptible(>dev->exclusive_access) == 0) { + ret = proc_resetep(ps, (void *)arg); + up(>dev->exclusive_access); + if (ret >= 0) + inode->i_mtime = CURRENT_TIME; + } else ret = -ERESTARTSYS; break; case USBDEVFS_RESET: - ret = proc_resetdevice(ps); + if (down_interruptible(>dev->exclusive_access) == 0) { + ret = proc_resetdevice(ps); + up(>dev->exclusive_access); + } else ret = -ERESTARTSYS; break; case USBDEVFS_CLEAR_HALT: - ret = proc_clearhalt(ps, (void *)arg); - if (ret >= 0) - inode->i_mtime = CURRENT_TIME; + if (down_interruptible(>dev->exclusive_access) == 0) { + ret = proc_clearhalt(ps, (void *)arg); + up(>dev->exclusive_access); + if (ret >= 0) + inode->i_mtime = CURRENT_TIME; + } else ret = -ERESTARTSYS; break; case USBDEVFS_GETDRIVER: @@ -1203,21 +1220,33 @@ static int usbdev_ioctl(struct inode *in break; case USBDEVFS_SETINTERFACE: - ret = proc_setintf(ps, (void *)arg); + if (down_interruptible(>dev->exclusive_access) == 0) { + ret = proc_setintf(ps, (void *)arg); + up(>dev->exclusive_access); + } else ret = -ERESTARTSYS; break; case USBDEVFS_SETCONFIGURATION: - ret = proc_setconfig(ps, (void *)arg); + if (down_interruptible(>dev->exclusive_access) == 0) { + ret = proc_setconfig(ps, (void *)arg); + up(>dev->exclusive_access); + }
Re: [patch, BK-curr] nonintrusive spin-polling loop in kernel/spinlock.c
Btw, I think I've now merged everything to bring us back to where we wanted to be - can people verify that the architecture they care about has all the right "read_can_lock()" etc infrastructure (and preferably that it _works_ too ;), and that I've not missed of incorrectly ignored some patches in this thread? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Radeon framebuffer weirdness in -mm2
Matt Mackall <[EMAIL PROTECTED]> wrote: > > I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of > horizontal lines) and require powercycling to fix. Worked fine with 2.6.10. Which radeon driver? CONFIG_FB_RADEON_OLD or CONFIG_FB_RADEON? (cc Ben, who is the likely cuprit ;) Which -mm2, btw? 2.6.10-mm2 or 2.6.11-rc1-mm2? Did you try the corresponding -mm1? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: TCP checksum calculation
On Thu, 20 Jan 2005 15:52:34 -0500 (EST) Rahul Jain <[EMAIL PROTECTED]> wrote: > Hi, > > I have written a module that changes IP addrs and TCP port values. After > changing these fields, I am able to recalculate the IP checksum within > the module. To recalculate the TCP checksum, I wrote a new function in > tcp_ipv4.c which is very similar to tcp_v4_send_check(). The only > difference is that, my function does not use the sock parameter and gets > the saddr and daddr from sk_buff. I call this function before the > following piece of code in tcp_v4_rcv() > > if ((skb->ip_summed != CHECKSUM_UNNECESSARY && > tcp_v4_checksum_init(skb) < 0)) > goto bad_packet; > > However I am still getting a bad tcp checksum error. Does anyone know what > I am missing and point me in the right direction. > Look at the netfilter code, in fact if you are changing values there may already be a netfilter module to do what you want, and you could have saved the effort. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: oom killer gone nuts
On Thu, Jan 20, 2005 at 03:57:07PM -0600, Chris Friesen wrote: > Andries Brouwer wrote: > > >But let me stress that I also consider the earlier situation > >unacceptable. It is really bad to lose a few weeks of computation. > > Shouldn't the application be backing up intermediate results to disk > periodically? Power outages do occur, as do bus faults, electrical > glitches, dead fans, etc. Agreed. Plus if you truly cannot change the app because it's binary only at least you can set the ulimit based on the virtual sizes, ulimit should work reliably even if overcommit doesn't. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] tracepipe -- event streams, debugfs, and pipe_buffers
Karim Yaghmour wrote: > Zach Brown wrote: > >>Thoughts? I, for one, am tired of writing throw-away per-cpu tracing >>patches ;) > > Have you taken a look at relayfs and ltt? Only briefly. They've always seemed more involved than the sort of thing I was after. I'll try and sit down and investigate in more detail. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Radeon framebuffer weirdness in -mm2
I'm seeing radeonfb on my ThinkPad T30 go weird on reboot (lots of horizontal lines) and require powercycling to fix. Worked fine with 2.6.10. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]sched: Isochronous class v2 for unprivileged soft rt scheduling
[EMAIL PROTECTED] wrote: On Thu, Jan 20, 2005 at 10:42:24AM -0500, Paul Davis wrote: over on #ardour last week, we saw appalling performance from reiserfs. a 120GB filesystem with 11GB of space failed to be able to deliver enough read/write speed to keep up with a 16 track session. When the filesystem was cleared to provide 36GB of space, things improved. The actual recording takes place using writes of 256kB, and no more than a few hundred MB was being written during the failed tests. It's been a long while since I followed ReiserFS development closely, *however*, this issue used to be a common problem ReiserFS - when free space starts to drop below 10%, performace takes a big hit. So performance improved when space was cleared up. I don't remember what causes this or what the status is in modern ResierFS systems. everything i read about reiser suggests it is unsuitable for audio work: it is optimized around the common case of filesystems with many small files. the filesystems where we record audio is typically filled with a relatively small number of very, very large files. Anecdotally, I've found this to not be the case. I only use ReiserFS and have a few reasonably sized projects in Ardour that work fine: maybe 20 tracks, with 10-15 plugins (in the whole project), and I can do overdubs with no problems. It may be relevant that I only have a four track card and so load is too small. But at least in my practice, it hasn't been a huge hinderance. This is my understanding of the situation, which is not gospel but interpretation of the information data I have had available. Reiserfs3.6 is in maintenance mode. Its performance was very good in 2.4 days, but since 2.6 the block layer has matured so much that the code paths that were fast in reiserfs are no longer so impressive compared to those shared by ext3. In terms of recommendation, the latency of non-preemptible codepaths will be fastest in ext3 in 2.6 due to the nature of it constantly being examined, addressed and updated. That does not mean it has the fastest performance by any stretch of the imagination. XFS, I believe, has significantly faster large file performance, and reiser3.6 has significantly faster small file performance. But if throughput is not a problem, and latency is, then ext3 is a better choice. Reiser4 is a curious beast with obviously high throughput, but for the moment I do not think it is remotely suitable for low latency applications. As for the %full issue; no filesystem works well as it approaches full capacity. Performance degrades dramatically beyond 75% on all of them, becoming woeful once beyond 85%. If you're looking for good performance, more free capacity is more effective than changing filesystems. All of this should be taken into consideration if you're worried about low latency cpu scheduling, as it all will collapse if your filesystem code has high latency in the kernel. It also would make benchmarking low latency cpu scheduling potentially prone to disastrous mis-interpretation. Cheers, Con - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2.6.11-rc1-mm2] mips: fixed conflicting types
This patch had fixed following 2 conflicting type errors. Yoichi arch/mips/lib/csum_partial_copy.c:21: error: conflicting types for `csum_partial_copy_nocheck' include/asm/checksum.h:65: error: previous declaration of `csum_partial_copy_nocheck' arch/mips/lib/csum_partial_copy.c:38: error: conflicting types for `csum_partial_copy_from_user' include/asm/checksum.h:38: error: previous declaration of `csum_partial_copy_from_user' make[1]: *** [arch/mips/lib/csum_partial_copy.o] Error 1 make: *** [arch/mips/lib] Error 2 Signed-off-by: Yoichi Yuasa <[EMAIL PROTECTED]> diff -urN -X dontdiff a-orig/arch/mips/lib/csum_partial_copy.c a/arch/mips/lib/csum_partial_copy.c --- a-orig/arch/mips/lib/csum_partial_copy.cWed Jan 12 13:02:09 2005 +++ a/arch/mips/lib/csum_partial_copy.c Fri Jan 21 07:47:35 2005 @@ -16,7 +16,7 @@ /* * copy while checksumming, otherwise like csum_partial */ -unsigned int csum_partial_copy_nocheck(const char *src, char *dst, +unsigned int csum_partial_copy_nocheck(const unsigned char *src, unsigned char *dst, int len, unsigned int sum) { /* @@ -33,7 +33,7 @@ * Copy from userspace and compute checksum. If we catch an exception * then zero the rest of the buffer. */ -unsigned int csum_partial_copy_from_user (const char *src, char *dst, +unsigned int csum_partial_copy_from_user (const unsigned char *src, unsigned char *dst, int len, unsigned int sum, int *err_ptr) { int missing; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
inotify-0.18-rml-4: Oops
Hi I reproducibly get the following Oops as soon as I start using inotify with gamin and/or beagle. This happens with linux 2.6.10-as1 + inotify 0.18-rml-4 on multiple x86 machines. Unable to handle kernel NULL pointer dereference at virtual address printing eip: c01d6d31 *pde = Oops: [#1] PREEMPT SMP Modules linked in: nfs lockd sunrpc mga af_packet autofs4 md5 ipv6 e100 mii snd_cmipci snd_opl3_lib snd_hwdep snd_mpu401_uart snd_rawmidi snd_seq_device intel_agp agpgart snd_intel8x0 snd_ac97_codec tun snd_pcm_oss snd_pcm snd_timer snd_page_alloc snd_mixer_oss snd soundcore ext3 jbd mbcache binfmt_misc xfs sd_mod pl2303 usbserial ide_cd cdrom ide_disk aic7xxx scsi_mod piix ide_core ehci_hcd uhci_hcd usbcore CPU:0 EIP:0060:[inotify_dev_queue_event+353/368]Not tainted VLI EFLAGS: 00010246 (2.6.10-paldo4) EIP is at inotify_dev_queue_event+0x161/0x170 eax: ebx: d7a50f00 ecx: 0003 edx: c6c7a2cc esi: edi: ebp: 0020 esp: c8b6bf6c ds: 007b es: 007b ss: 0068 Process multiload-apple (pid: 2756, threadinfo=c8b6a000 task=e76bc020) Stack: c014b27d ddc822e8 ddc822e8 cbda31ac 0020 c01d72c9 0024 d8dd3980 f7772000 c8b6a000 c015826f b777e8fc b777e8fc 8000 c0103029 b777e8fc Call Trace: [remove_vm_struct+93/144] remove_vm_struct+0x5d/0x90 [inotify_inode_queue_event+73/128] inotify_inode_queue_event+0x49/0x80 [sys_open+95/176] sys_open+0x5f/0xb0 [sysenter_past_esp+82/117] sysenter_past_esp+0x52/0x75 Code: 24 18 8b 7c 24 1c 8b 6c 24 20 83 c4 24 c3 c7 04 24 00 00 00 00 8b 4c 24 0c ba 00 40 00 00 b8 ff ff ff ff e9 3d ff ff ff 8b 42 18 <80> 38 00 eb bf 8d 76 00 8d bc 27 00 00 00 00 53 89 c3 8b 4b 20 <6>note: multiload-apple[2756] exited with preempt_count 1 I can provide more information on request. Thanks for any advice JÃrg (please cc me on replies) -- Juerg Billeter <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch to fix set_itimer() behaviour in boundary cases
Arjan van de Ven wrote: On Wed, 2005-01-19 at 15:51 -0800, George Anzinger wrote: Arjan van de Ven wrote: On Sun, 2005-01-16 at 00:58 +, Alan Cox wrote: On Sad, 2005-01-15 at 09:30, Andrew Morton wrote: Matthias Lang <[EMAIL PROTECTED]> wrote: These are things we probably cannot change now. All three are arguably sensible behaviour and do satisfy the principle of least surprise. So there may be apps out there which will break if we "fix" these things. If the kernel version was 2.7.0 then well maybe... These are things we should fix. They are bugs. Since there is no 2.7 plan pick a date to fix it. We should certainly error the overflow case *now* because the behaviour is undefined/broken. The other cases I'm not clear about. setitimer() is a library interface and it can do the basic checking and error if it wants to be strictly posixly compliant. why error? I'm pretty sure we can make a loop in the setitimer code that detects we're at the end of jiffies but haven't upsurped the entire interval the user requested yet, so that the code should just do another round of sleeping... That would work for sleep (but glibc uses nanosleep for that) but an itimer delivers a signal. Rather hard to trap that in glibc. This one I meant to fix in the kernel fwiw; we can put that loop inside the kernel easily I'm sure Yes, but it will increase the data size of the timer... -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] dynamic tick patch
Tony Lindgren wrote: * George Anzinger [050119 16:25]: Tony Lindgren wrote: * George Anzinger [050119 15:00]: I don't think you will ever get good time if you EVER reprogramm the PIT. That is why the VST patch on sourceforge does NOT touch the PIT, it only turns off the interrupt by interrupting the interrupt path (not changing the PIT). This allows the PIT to be the "gold standard" in time that it is designed to be. The wake up interrupt, then needs to come from an independent timer. My patch requires a local APIC for this. Patch is available at http://sourceforge.net/projects/high-res-timers/ Well on my test systems I have pretty good accurate time. But I agree, PIT is not the best option for interrupt. It should be possible to use other interrupt sources as well. It should not matter where the timer interrupt comes from, as long as it comes when programmed. Updating time should be separate from timer interrupts. Currently we have a problem where time is tied to the timer interrupt. In the HRT code time is most correctly stated as wall_time + get_arch_cycles_since(wall_jiffies) (plus conversion or two:)). This is some what removed from the tick interrupt, but is resynced to that interrupt more or less each interrupt. That sounds very accurate :) A second issue is trying to get the jiffies update as close to the run of the timer list as possible. Without this we have no hope of high res timers. OK. But if the timer interrupt is separated from updating the time, the next timer interrupt should be programmable to happen exactly when a HRT timer needs it, right? First, HRT uses a two phase system of timing. The first phase is the normal timer list expires the timer. The timer is then handed to the high res code which keeps a list of timers that are to expire prior to the next jiffie. An interrupt is scheduled to make this happen. Depending on the hardware available, this can come from the same timer or a different timer. For example on x86 systems with a local apic we use the apic timer to generate this interrupt. It triggers either a tasklet for UP or SMP with out per cpu timers or a soft irq for SMP systems with per cpu timers. What this means is that, for timers near but just after a jiffie, the run_timer list being late can make the HR timer late. This code on on sourceforge if you want a closer look... Hmm, how about using a pool of programmable timers available on the system for the timer interrupts and HRT? Or is one interrupt source always enough? Hardware heaven :), but no thanks. A reliable tick generator for the jiffies timer and one additional timer (or one per cpu) works well in the x86. If you have something like the PPC where you can mess with the timer with out loosing time, that works well also. The correct formulation would be a "clock" that can be read quickly and a timer tied to the same "rock" that uses the same count units as the clock. PARISC has a counter that just counts and a compare register. When they are equal an interrupt is generated. That is a nice set up. Now the X86 is bad and has little hope of being fixed for these reasons: a.) the TSC is fast and easy to read but its not clocked at any given frequency and, on some platforms, it changes without notifying the software. b.) the PIT and the PMTIMER are both in I/O space and so take forever to access. c.) All three of these use different units (but at least the PMTIMER is (supposed to be) related to the PIT clock. d.) the HPET, again is in I/O space. I suspect that it uses a reasonable "rock" but, as I understand it, it knocks out the PIT and, of course it uses units unrelated to all the others. -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] tracepipe -- event streams, debugfs, and pipe_buffers
Zach Brown wrote: > Thoughts? I, for one, am tired of writing throw-away per-cpu tracing > patches ;) Have you taken a look at relayfs and ltt? Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64
Hi! > > > The readability of code is also important, IMHO. > > > > It did not seem too much better to me. > > Well, the beauty is in the eye of the beholder. :-) > > Still, it shrinks the code (22 lines vs 37 lines), it uses less GPRs (5 vs > 7), it uses less > SIB arithmetics (0 vs 4 times), it uses a well known scheme for copying data > pages. > As far as the result is concerned, it is equivalent to the existing code, but > it's simpler > (and faster). IMO, simpler code is always easier to understand. > > > > > > If you want cheap way to speed it up, kill cr3 manipulation. > > > > > > Sure, but I think it's there for a reason. > > > > Reason is "to crash it early if we have wrong pagetables". > > > > > > Anyway, this is likely to clash with hugang's work; I'd prefer this not > > > > to be applied. > > > > > > I am aware of that, but you are not going to merge the hugang's patches > > > soon, are you? > > > If necessary, I can change the patch to work with his code (hugang, what > > > do you think?). > > > > I think it is just not worth the effort. > > Why? It won't take much time. I've spent more time for writing the messages > in this thread ... ;-) Well, I know that current code works. It was produced by C compiler, btw. Now, new code works for you, but it was not in kernel for 4 releases, and... this code is pretty subtle. And it is hand-made, not C produced. So... your code may be better but I do not think it is so much better that I'd like to risk it. Pavel -- People were complaining that M$ turns users into beta-testers... ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/5] Disallow in-inode attributes for reserved inodes
On Jan 20, 2005 14:29 +0100, Andreas Gruenbacher wrote: > The ea-in-inode patch totally relies on getting all the available inode space > cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for > any inode we find, then i_extra_isize may contain a random number, and we've > just lost, period: There is no way of sanitizing a random i_extra_isize; we > cannot know what the right number would be. The large-inode support is designed to allow different amounts of "fixed" optional data (i.e. what is stored inside i_extra_isize), so it is valid to set this to 4 (i.e. just enough to hold i_extra_isize itself) and store the EA data after that. Any code which reads "fixed" fields from a large inode (e.g. i_mtime_nsec) needs to validate that i_extra_isize on that inode is large enough for that data to actually be in the fixed area in the large inode. If the kernel is setting i_extra_isize > 4 (i.e. it is storing optional fields there like i_mtime_msb_and_ns) it should/is-able-to also initialize those values since it should know what they are or they shouldn't be in struct ext3_inode. The whole point of i_extra_isize is that it is possible for inodes to have different amounts of the optional fixed fields in each large inode, depending on what the kernel that wrote the inode knew about. So any value for i_extra_isize is valid as long as those fields are initialized. If we arbitrarily set i_extra_isize = 4 instead of leaving the bad value this is no different than waiting for e2fsck to do the same. > > It is debatable whether we should mark inodes bad if the i_extra_isize > > field is bad, or if we should just initialize i_extra_isize in that case. > > IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply > changing it to something we think is better is a really bad idea. > > You may have an access acl on the inode. Not being able to read an access acl > is a clear sign of trouble. The same applies for everything else in the > system.* and security.* namespaces, at least. Well, I said it was debatable and we're having a debate ;-). I don't have a strong opinion either way. If we ext3_error() in this case at least we will check the fs on the next boot (which will just zero i_extra_isize) instead of never doing anything to resolve the situation. > > For the root and lost+found inodes it looks like we can never store an > > EA in the extra part of the inode regardless of whether i_extra_isize is > > good or not. If a bad value is found we could just initialize it and > > start using that space (though not print an ext3_error() in that case, > > an ext3_warning() if anything since this is probably the fault of mke2fs). > > I disagree. We cannot just use the space when we think the inode is corrupted. But as your patch stands it doesn't ever check if i_extra_isize is valid for the root or lost+found inode. It just always sets i_extra_isize = 0 and never uses it. Given that the root inode is fairly high-traffic it makes sense to use the faster EA space if it is available. If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm not so keen to have an ext3_error() there. If the user doesn't have an e2fsck with ea-in-inode support there isn't anything they can do to fix it and they will get a full e2fsck on each boot. Even so, for the effort of setting i_extra_isize = 4 (or larger if we initialize the fixed fields) we can do the equivalent of what e2fsck will do when it finds a bogus value. The good news is that we can still apply your patch as-is and address my concerns later since this is a transient issue. Also, given that there are probably only a handful of filesystems in the world using large inodes (excluding Lustre filesystems which aren't affected by this) I don't think it is a pressing issue yet. I'm going to be away for 2 weeks, so I'll say accept this patch as is and we can look at it again when I get back, and maybe Ted and Stephen will have weighed in on this issue also. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ pgpkg0LYwVEMp.pgp Description: PGP signature
Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64
Hi, On Thursday, 20 of January 2005 23:06, Pavel Machek wrote: > Hi! > > > > > The following patch speeds up the restoring of swsusp images on x86-64 > > > > and makes the assembly code more readable (tested and works on AMD64). > > > > It's > > > > against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer > > > > for applying. > > > > > > Can you really measure the speedup? > > > > In terms of time? Probably I can, but I prefer to measure it in terms of > > the numbers of > > operations to be performed. > > > > With this patch, at least 8 times less memory accesses are required to > > restore an image > > than without it, and in the original code cr3 is reloaded after copying > > each _byte_, > > let alone the SIB arithmetics. I'd expect it to be 10 times faster > > or so. > > Well, 8 times less cr3 reloads may be significant... for the copy > loop. Speeding up copy loop that takes ... 100msec?... of whole > resume (30 seconds) does not seem too important to me. > > > The readability of code is also important, IMHO. > > It did not seem too much better to me. Well, the beauty is in the eye of the beholder. :-) Still, it shrinks the code (22 lines vs 37 lines), it uses less GPRs (5 vs 7), it uses less SIB arithmetics (0 vs 4 times), it uses a well known scheme for copying data pages. As far as the result is concerned, it is equivalent to the existing code, but it's simpler (and faster). IMO, simpler code is always easier to understand. > > > If you want cheap way to speed it up, kill cr3 manipulation. > > > > Sure, but I think it's there for a reason. > > Reason is "to crash it early if we have wrong pagetables". > > > > Anyway, this is likely to clash with hugang's work; I'd prefer this not > > > to be applied. > > > > I am aware of that, but you are not going to merge the hugang's patches > > soon, are you? > > If necessary, I can change the patch to work with his code (hugang, what do > > you think?). > > I think it is just not worth the effort. Why? It won't take much time. I've spent more time for writing the messages in this thread ... ;-) Greets, RJW -- - Would you tell me, please, which way I ought to go from here? - That depends a good deal on where you want to get to. -- Lewis Carroll "Alice's Adventures in Wonderland" - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/