Re: crash in kmem_cache_init
On Tue, Jan 22, Christoph Lameter wrote: > > 0xc00fe018 is in setup_cpu_cache > > (/home/olaf/kernel/git/linux-2.6-numa/mm/slab.c:2111). > > 2106BUG_ON(!cachep->nodelists[node]); > > 2107 > > kmem_list3_init(cachep->nodelists[node]); > > 2108} > > 2109} > > 2110} > > if (cachep->nodelists[numa_node_id()]) > return; Does not help. Linux version 2.6.24-rc8-ppc64 ([EMAIL PROTECTED]) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #48 SMP Wed Jan 23 08:54:23 CET 2008 [boot]0012 Setup Arch EEH: PCI Enhanced I/O Error Handling Enabled PPC64 nvram contains 8192 bytes Zone PFN ranges: DMA 0 -> 892928 Normal 892928 -> 892928 Movable zone start PFN for each node early_node_map[1] active PFN ranges 1:0 -> 892928 Could not find start_pfn for node 0 [boot]0015 Setup Done Built 2 zonelists in Node order, mobility grouping on. Total pages: 880720 Policy zone: DMA Kernel command line: debug xmon=on panic=1 [boot]0020 XICS Init xics: no ISA interrupt controller [boot]0021 XICS Done PID hash table entries: 4096 (order: 12, 32768 bytes) time_init: decrementer frequency = 275.07 MHz time_init: processor frequency = 2197.80 MHz clocksource: timebase mult[e8ab05] shift[22] registered clockevent: decrementer mult[466a] shift[16] cpu[0] Console: colour dummy device 80x25 console handover: boot [udbg-1] -> real [hvc0] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) freeing bootmem node 1 Memory: 3496632k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init) Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-32(DMA)' Rebooting in 1 seconds.. --- mm/slab.c | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) --- a/mm/slab.c +++ b/mm/slab.c @@ -1590,7 +1590,7 @@ void __init kmem_cache_init(void) /* Replace the static kmem_list3 structures for the boot cpu */ init_list(_cache, _list3[CACHE_CACHE], node); - for_each_node_state(nid, N_NORMAL_MEMORY) { + for_each_online_node(nid) { init_list(malloc_sizes[INDEX_AC].cs_cachep, _list3[SIZE_AC + nid], nid); @@ -1968,7 +1968,7 @@ static void __init set_up_list3s(struct { int node; - for_each_node_state(node, N_NORMAL_MEMORY) { + for_each_online_node(node) { cachep->nodelists[node] = _list3[index + node]; cachep->nodelists[node]->next_reap = jiffies + REAPTIMEOUT_LIST3 + @@ -2108,6 +2108,8 @@ static int __init_refok setup_cpu_cache( } } } + if (!cachep->nodelists[numa_node_id()]) + return -ENODEV; cachep->nodelists[numa_node_id()]->next_reap = jiffies + REAPTIMEOUT_LIST3 + ((unsigned long)cachep) % REAPTIMEOUT_LIST3; @@ -2775,6 +2777,11 @@ static int cache_grow(struct kmem_cache /* Take the l3 list lock to change the colour_next on this node */ check_irq_off(); l3 = cachep->nodelists[nodeid]; + if (!l3) { + nodeid = numa_node_id(); + l3 = cachep->nodelists[nodeid]; + } + BUG_ON(!l3); spin_lock(>list_lock); /* Get colour for the slab, and cal the next value. */ @@ -3317,6 +3324,10 @@ static void *cache_alloc_node(struct int x; l3 = cachep->nodelists[nodeid]; + if (!l3) { + nodeid = numa_node_id(); + l3 = cachep->nodelists[nodeid]; + } BUG_ON(!l3); retry: @@ -3815,7 +3826,7 @@ static int alloc_kmemlist(struct kmem_ca struct array_cache *new_shared; struct array_cache **new_alien = NULL; - for_each_node_state(node, N_NORMAL_MEMORY) { + for_each_online_node(node) { if (use_alien_caches) { new_alien = alloc_alien_cache(node, cachep->limit); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 24rc8: unregister_netdevice: waiting for ... to become free. Usage count = 1?
On Tue, 2008-01-22 at 22:44 -0800, David Miller wrote: > From: Soeren Sonnenburg <[EMAIL PROTECTED]> > Date: Wed, 23 Jan 2008 07:42:21 +0100 > > > Dear all, > > > > since some 2.6.24rc version I suddenly experience such messages on > > console when trying to shutdown a vpn connection: > > > > unregister_netdevice: waiting for tun0 to become free. Usage count = 1 > > > > or when removing an usb wlan dongle (although it was ifconfig wlan0 > > down'd before) > > Current GIT already has a fix for this, attached below: Thank you very much for pointing this out! git pull ; make ; ... Soeren -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Could you please merge the x86_64 EFI boot support patchset?
Huang wrote: > This patchset has been merged into Linux 2.6.24. Excellent. > Unfortunately, the new EFI support patches do not use EFI memory map for > system boot up ... So, I think the resolution for your problem is the > "struct setup_data" mechanism proposed by H. Peter Anvin. So you're saying that the EFI in the kernel now still won't support more than 128 or so chunks of memory in the boottime memory map, because it still goes via the legacy E820h memory map code? I'll have to study the code more and give it a try. Are you optimistic that some variation of H. Peter Anvin's "struct setup_data" mechanism will make it into 2.6.25 or thereabouts? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Jan 23, 2008 3:41 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > On Tue, 22 Jan 2008, David Miller wrote: > > > From: "Dave Young" <[EMAIL PROTECTED]> > > Date: Wed, 23 Jan 2008 09:44:30 +0800 > > > > > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > > > [PATCH] [TCP]: debug S+L > > > > > > Thanks, If there's new findings I will let you know. > > > > Thanks for helping with this bug Dave. > > I noticed btw that there thing might (is likely to) spuriously trigger at > WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK > is not enabled. If that does happen too often, I send a fixed patch for > it, yet, the fact that I print print tp->rx_opt.sack_ok allows > identification of those cases already as it's zero when SACK is not > enabled. > > Just ask if you need the updated debug patch. Thanks, please send, I would like to get it. > > -- > i. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Tue, 22 Jan 2008, David Miller wrote: > From: "Dave Young" <[EMAIL PROTECTED]> > Date: Wed, 23 Jan 2008 09:44:30 +0800 > > > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > > [PATCH] [TCP]: debug S+L > > > > Thanks, If there's new findings I will let you know. > > Thanks for helping with this bug Dave. I noticed btw that there thing might (is likely to) spuriously trigger at WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK is not enabled. If that does happen too often, I send a fixed patch for it, yet, the fact that I print print tp->rx_opt.sack_ok allows identification of those cases already as it's zero when SACK is not enabled. Just ask if you need the updated debug patch. -- i.
Re: [PATCH] sound: fix opti9xx/miro section mismatch
On Tue, Jan 22, 2008 at 09:39:47PM -0800, Randy Dunlap wrote: > From: Randy Dunlap <[EMAIL PROTECTED]> > > snd_opti93x_mixer() is only called by __devinit snd_opti93x_probe(), > so the former can also be __devinit. > > snd_miro_mixer() is only called by __devinit snd_miro_probe(), > so the former can also be __devinit. > > sound/isa/opti9xx/opti92x-ad1848.c: > WARNING: vmlinux.o(.text+0xf91cd7): Section mismatch: reference to > .init.data:snd_opti93x_controls (between 'snd_opti93x_mixer' and > 'snd_card_opti9xx_free') > WARNING: vmlinux.o(.text+0xf91d66): Section mismatch: reference to > .init.data:snd_miro_controls (between 'snd_opti93x_mixer' and > 'snd_card_opti9xx_free') > > opti9xx/miro.c: > WARNING: vmlinux.o(.text+0xf926c2): Section mismatch: reference to > .init.data:snd_miro_controls (between 'snd_miro_mixer' and > 'snd_legacy_find_free_ioport') > WARNING: vmlinux.o(.text+0xf926e5): Section mismatch: reference to > .init.data:snd_miro_eq_controls (between 'snd_miro_mixer' and > 'snd_legacy_find_free_ioport') > WARNING: vmlinux.o(.text+0xf926f9): Section mismatch: reference to > .init.data:snd_miro_line_control (between 'snd_miro_mixer' and > 'snd_legacy_find_free_ioport') > WARNING: vmlinux.o(.text+0xf92716): Section mismatch: reference to > .init.data:snd_miro_amp_control (between 'snd_miro_mixer' and > 'snd_legacy_find_free_ioport') > WARNING: vmlinux.o(.text+0xf9273e): Section mismatch: reference to > .init.data:snd_miro_preamp_control (between 'snd_miro_mixer' and > 'snd_legacy_find_free_ioport') > WARNING: vmlinux.o(.text+0xf92764): Section mismatch: reference to > .init.data:snd_miro_capture_control (between 'snd_miro_mixer' and > 'snd_legacy_find_free_ioport') > WARNING: vmlinux.o(.text+0xf92783): Section mismatch: reference to > .init.data:snd_miro_radio_control (between 'snd_miro_mixer' and > 'snd_legacy_find_free_ioport') > WARNING: vmlinux.o(.text+0xf9279a): Section mismatch: reference to > .init.data:snd_miro_eq_controls (between 'snd_miro_mixer' and > 'snd_legacy_find_free_ioport') > WARNING: vmlinux.o(.text+0xf927b9): Section mismatch: reference to > .init.data:snd_miro_radio_control (between 'snd_miro_mixer' and > 'snd_legacy_find_free_ioport') > > Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]> Acked-by: Sam Ravnborg <[EMAIL PROTECTED]> > --- > sound/isa/opti9xx/miro.c |2 +- > sound/isa/opti9xx/opti92x-ad1848.c |2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > --- linux-2.6.24-rc8-git5.orig/sound/isa/opti9xx/miro.c > +++ linux-2.6.24-rc8-git5/sound/isa/opti9xx/miro.c > @@ -662,7 +662,7 @@ static int __devinit snd_set_aci_init_va > return 0; > } > > -static int snd_miro_mixer(struct snd_miro *miro) > +static int __devinit snd_miro_mixer(struct snd_miro *miro) > { > struct snd_card *card; > unsigned int idx; > --- linux-2.6.24-rc8-git5.orig/sound/isa/opti9xx/opti92x-ad1848.c > +++ linux-2.6.24-rc8-git5/sound/isa/opti9xx/opti92x-ad1848.c > @@ -1595,7 +1595,7 @@ OPTi93X_DOUBLE("Capture Volume", 0, OPTi > } > }; > > -static int snd_opti93x_mixer(struct snd_opti93x *chip) > +static int __devinit snd_opti93x_mixer(struct snd_opti93x *chip) > { > struct snd_card *card; > struct snd_kcontrol_new knew; > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Could you please merge the x86_64 EFI boot support patchset?
On Wed, 2008-01-23 at 01:00 -0600, Paul Jackson wrote: > In Nov 2007, Huang Ying wrote: > > Could you please merge the following patchset: > > > > [PATCH 0/2 -v3] x86_64 EFI boot support > > [PATCH 1/2 -v3] x86_64 EFI boot support: EFI frame buffer driver > > [PATCH 2/2 -v3] x86_64 EFI boot support: EFI boot document > > Huang - what has become of this patchset? This patchset has been merged into Linux 2.6.24. > We (SGI) have designs on a big honkin NUMA box using x86_64 arch, and > we are runnning up against the arbitrary limits on the memory map size > due to the ancient 4k zero page size limit (hence H. Peter Anvin added > to the CC list, since he knows a bazillion times more about any such > limits than I do.). The limits on 128 or so local chunks of memory > imposed by the kernel code that invokes Int 15 E820h are too small for > us. > > Since we are already accustomed to dealing with EFI on our IA64 > Itanium boxes, I'm figuring that it will be easier in the short run, > and better in the long run, to just use EFI on these upcoming big > x86_64 NUMA boxes. Unfortunately, the new EFI support patches do not use EFI memory map for system boot up (just for runtime service support). The EFI memory map is converted into E820 memory map in bootloader. The main reason for this change is to remove the duplication between E820 memory map and EFI memory map handling code. So, I think the resolution for your problem is the "struct setup_data" mechanism proposed by H. Peter Anvin. That is a linked list data structure for boot parameter without size limitation. I have ever writen a patch for it, but there are some issues for implementation scheme. Most people think it should be based on the "early reservation/early allocation" mechanism from Andi Kleen. So I am waiting that is merged by -mm or git-x86. Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sound: fix cs5535 section mismatch
On Tue, Jan 22, 2008 at 09:39:43PM -0800, Randy Dunlap wrote: > From: Randy Dunlap <[EMAIL PROTECTED]> > > snd_cs5535audio_mixer() is only called by __devinit snd_cs5535audio_probe(), > so the mixer function can also be __devinit. > > WARNING: vmlinux.o(.text+0xfdbba0): Section mismatch: reference to > .init.data:ac97_quirks (between 'snd_cs5535audio_mixer' and 'process_bm0_irq') > > Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]> Acked-by: Sam Ravnborg <[EMAIL PROTECTED]> > --- > sound/pci/cs5535audio/cs5535audio.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- linux-2.6.24-rc8-git5.orig/sound/pci/cs5535audio/cs5535audio.c > +++ linux-2.6.24-rc8-git5/sound/pci/cs5535audio/cs5535audio.c > @@ -145,7 +145,7 @@ static unsigned short snd_cs5535audio_ac > return snd_cs5535audio_codec_read(cs5535au, reg); > } > > -static int snd_cs5535audio_mixer(struct cs5535audio *cs5535au) > +static int __devinit snd_cs5535audio_mixer(struct cs5535audio *cs5535au) > { > struct snd_card *card = cs5535au->card; > struct snd_ac97_bus *pbus; > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] radio: fix sf16fmi section mismatch
On Tue, Jan 22, 2008 at 09:39:39PM -0800, Randy Dunlap wrote: > From: Randy Dunlap <[EMAIL PROTECTED]> > > isapnp_fmi_probe() is only called by fmi_init(), which is __init, > so isapnp_fmi_probe() can also be __init. > > media/radio/radio-sf16fmi.c: > WARNING: vmlinux.o(.text+0x994e19): Section mismatch: reference to > .init.data: (between 'isapnp_fmi_probe' and 'vidioc_s_tuner') > WARNING: vmlinux.o(.text+0x994e22): Section mismatch: reference to > .init.data: (between 'isapnp_fmi_probe' and 'vidioc_s_tuner') > WARNING: vmlinux.o(.text+0x994e3a): Section mismatch: reference to > .init.data:id_table (between 'isapnp_fmi_probe' and 'vidioc_s_tuner') > > Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]> Acked-by: Sam Ravnborg <[EMAIL PROTECTED]> > --- > drivers/media/radio/radio-sf16fmi.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- linux-2.6.24-rc8-git5.orig/drivers/media/radio/radio-sf16fmi.c > +++ linux-2.6.24-rc8-git5/drivers/media/radio/radio-sf16fmi.c > @@ -321,7 +321,7 @@ static struct isapnp_device_id id_table[ > > MODULE_DEVICE_TABLE(isapnp, id_table); > > -static int isapnp_fmi_probe(void) > +static int __init isapnp_fmi_probe(void) > { > int i = 0; > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Could you please merge the x86_64 EFI boot support patchset?
In Nov 2007, Huang Ying wrote: > Could you please merge the following patchset: > > [PATCH 0/2 -v3] x86_64 EFI boot support > [PATCH 1/2 -v3] x86_64 EFI boot support: EFI frame buffer driver > [PATCH 2/2 -v3] x86_64 EFI boot support: EFI boot document Huang - what has become of this patchset? We (SGI) have designs on a big honkin NUMA box using x86_64 arch, and we are runnning up against the arbitrary limits on the memory map size due to the ancient 4k zero page size limit (hence H. Peter Anvin added to the CC list, since he knows a bazillion times more about any such limits than I do.). The limits on 128 or so local chunks of memory imposed by the kernel code that invokes Int 15 E820h are too small for us. Since we are already accustomed to dealing with EFI on our IA64 Itanium boxes, I'm figuring that it will be easier in the short run, and better in the long run, to just use EFI on these upcoming big x86_64 NUMA boxes. But we'd need to get this patch, or something equivalent, into the Linux kernel in the next 2.6.* cycle or so, for this to work for us. Also, is there a current version of this patch set available, against either 2.6.24-rc8-mm1 or Ingo's recent x86 git tree? I should try it out. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 24rc8: unregister_netdevice: waiting for ... to become free. Usage count = 1?
From: Soeren Sonnenburg <[EMAIL PROTECTED]> Date: Wed, 23 Jan 2008 07:42:21 +0100 > Dear all, > > since some 2.6.24rc version I suddenly experience such messages on > console when trying to shutdown a vpn connection: > > unregister_netdevice: waiting for tun0 to become free. Usage count = 1 > > or when removing an usb wlan dongle (although it was ifconfig wlan0 > down'd before) Current GIT already has a fix for this, attached below: [NEIGH]: Revert 'Fix race between neigh_parms_release and neightbl_fill_parms' Commit 9cd40029423701c376391da59d2c6469672b4bed (Fix race between neigh_parms_release and neightbl_fill_parms) introduced device reference counting regressions for several people, see: http://bugzilla.kernel.org/show_bug.cgi?id=9778 for example. Signed-off-by: David S. Miller <[EMAIL PROTECTED]> diff --git a/net/core/neighbour.c b/net/core/neighbour.c index cc8a2f1..29b8ee4 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1316,6 +1316,8 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms) *p = parms->next; parms->dead = 1; write_unlock_bh(>lock); + if (parms->dev) + dev_put(parms->dev); call_rcu(>rcu_head, neigh_rcu_free_parms); return; } @@ -1326,8 +1328,6 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms) void neigh_parms_destroy(struct neigh_parms *parms) { - if (parms->dev) - dev_put(parms->dev); kfree(parms); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
24rc8: unregister_netdevice: waiting for ... to become free. Usage count = 1?
Dear all, since some 2.6.24rc version I suddenly experience such messages on console when trying to shutdown a vpn connection: unregister_netdevice: waiting for tun0 to become free. Usage count = 1 or when removing an usb wlan dongle (although it was ifconfig wlan0 down'd before) unregister_netdevice: waiting for wlan0 to become free. Usage count = 1 Then only when all potential connections going over that iface are gone these messages disappear (sometimes this does not happen and the kernel then hangs on reboot...) Is this intended? Soeren -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] some page can't be migrated
Anonymous page might have fs-private metadata, the page is truncated. As the page hasn't mapping, page migration refuse to migrate the page. It appears the page is only freed in page reclaim and if zone watermark is low, the page is never freed, as a result migration always fail. I thought we could free the metadata so such page can be freed in migration and make migration more reliable? Thanks, Shaohua diff --git a/mm/migrate.c b/mm/migrate.c index 6a207e8..6bc38f7 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -633,6 +633,17 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, goto unlock; wait_on_page_writeback(page); } + + /* +* See truncate_complete_page(). Anonymous page might have +* fs-private metadata, the page is truncated. Such page can't be +* migrated. Try to free metadata, so the page can be freed. +*/ + if (!page->mapping && !PageAnon(page) && PagePrivate(page)) { + try_to_release_page(page, GFP_KERNEL); + goto unlock; + } + /* * By try_to_unmap(), page->mapcount goes down to 0 here. In this case, * we cannot notice that anon_vma is freed while we migrates a page. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)
On Wed, Jan 23, 2008 at 04:24:33PM +1030, Jonathan Woithe wrote: > > On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote: > > > Last night my laptop suffered an oops during closedown. The full oops > > > reports can be downloaded from > > > > > > http://www.atrad.com.au/~jwoithe/xfs_oops/ > > > > Assertion failed: atomic_read(>m_active_trans) == 0, file: > > fs/xfs/xfs_vfsops.c, line 689. > > > > The remount read-only of the root drive supposedly completed > > while there was still active modification of the filesystem > > taking place. . > > The read only flag only gets set *after* we've made the filesystem > > readonly, which means before we are truly read only, we can race > > with other threads opening files read/write or filesystem > > modifcations can take place. > > > > The result of that race (if it is really unsafe) will be assert you > > see. The patch I wrote a couple of months ago to fix the problem > > is attached below > > Thanks for the patch. I will apply it and see what happens. > > Will this be in 2.6.24? No - because hitting the problem is so rare that I'm not even sure it's a problem. One of the VFS gurus will need to comment on whether this really is a problem, and if so the correct fix is to do_remount_sb() so that it closes the hole for everyone. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc8-mm1: sparc64 warning at fs/file_table.c:49 __fput+0x1a8/0x1e0()
On Tue, Jan 22, 2008 at 03:13:58PM -0800, Dave Hansen wrote: > The emergency remount code forcibly removes FMODE_WRITE from > filps. The r/o bind mount code notices that this was done > without a proper mnt_drop_write() and properly gives a > warning. > > This patch does a mnt_drop_write() and also notes in the > filp that this was done to suppress any warning that would > have otherwise been triggered. > > I also wonder if inode->i_writecount is made inconsistent > by the emergency remount code. I guess it is, but the > damage is limited to a single inode instead of being > visible more globally like the mnt write count. Probably > not really worth fixing. The right fix is to not simply remove FMODE_WRITE, but just remove this whole function. Until we have a proper revoke it will cause more harm than good. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)
Hi Dave > On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote: > > Last night my laptop suffered an oops during closedown. The full oops > > reports can be downloaded from > > > > http://www.atrad.com.au/~jwoithe/xfs_oops/ > > Assertion failed: atomic_read(>m_active_trans) == 0, file: > fs/xfs/xfs_vfsops.c, line 689. > > The remount read-only of the root drive supposedly completed > while there was still active modification of the filesystem > taking place. > > > Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. > > The patch in 2.6.23 that introduced this check was: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=516b2e7c2661615ba5d5ad9fb584f068363502d3 > > Basically, the remount-readonly path was not flushing things > properly, so we changed it to flushing things properly and ensure we > got bug reports if it wasn't. Yours is the second report of not > shutting down correctly since this change went in (we've seen it > once in ~8 months in a QA environment). > > I've had suspicions of a race in the remount-ro code in > do_remount_sb() w.r.t to the fs_may_remount_ro() check. That is, we > do an unlocked check to see if we can remount readonly and then fail > to check again once we've locked the superblock out and start the > remount. > > The read only flag only gets set *after* we've made the filesystem > readonly, which means before we are truly read only, we can race > with other threads opening files read/write or filesystem > modifcations can take place. > > The result of that race (if it is really unsafe) will be assert you > see. The patch I wrote a couple of months ago to fix the problem > is attached below Thanks for the patch. I will apply it and see what happens. Will this be in 2.6.24? Regards jonathan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: InfiniBand/RDMA merge plans for 2.6.25
On Tue, Jan 22, 2008 at 01:56:00PM -0800, Roland Dreier wrote: > be improved (sparse endianness annotation, that's a blocker for sure. No new code that's not sparse clean, please. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sound: fix opti9xx/miro section mismatch
From: Randy Dunlap <[EMAIL PROTECTED]> snd_opti93x_mixer() is only called by __devinit snd_opti93x_probe(), so the former can also be __devinit. snd_miro_mixer() is only called by __devinit snd_miro_probe(), so the former can also be __devinit. sound/isa/opti9xx/opti92x-ad1848.c: WARNING: vmlinux.o(.text+0xf91cd7): Section mismatch: reference to .init.data:snd_opti93x_controls (between 'snd_opti93x_mixer' and 'snd_card_opti9xx_free') WARNING: vmlinux.o(.text+0xf91d66): Section mismatch: reference to .init.data:snd_miro_controls (between 'snd_opti93x_mixer' and 'snd_card_opti9xx_free') opti9xx/miro.c: WARNING: vmlinux.o(.text+0xf926c2): Section mismatch: reference to .init.data:snd_miro_controls (between 'snd_miro_mixer' and 'snd_legacy_find_free_ioport') WARNING: vmlinux.o(.text+0xf926e5): Section mismatch: reference to .init.data:snd_miro_eq_controls (between 'snd_miro_mixer' and 'snd_legacy_find_free_ioport') WARNING: vmlinux.o(.text+0xf926f9): Section mismatch: reference to .init.data:snd_miro_line_control (between 'snd_miro_mixer' and 'snd_legacy_find_free_ioport') WARNING: vmlinux.o(.text+0xf92716): Section mismatch: reference to .init.data:snd_miro_amp_control (between 'snd_miro_mixer' and 'snd_legacy_find_free_ioport') WARNING: vmlinux.o(.text+0xf9273e): Section mismatch: reference to .init.data:snd_miro_preamp_control (between 'snd_miro_mixer' and 'snd_legacy_find_free_ioport') WARNING: vmlinux.o(.text+0xf92764): Section mismatch: reference to .init.data:snd_miro_capture_control (between 'snd_miro_mixer' and 'snd_legacy_find_free_ioport') WARNING: vmlinux.o(.text+0xf92783): Section mismatch: reference to .init.data:snd_miro_radio_control (between 'snd_miro_mixer' and 'snd_legacy_find_free_ioport') WARNING: vmlinux.o(.text+0xf9279a): Section mismatch: reference to .init.data:snd_miro_eq_controls (between 'snd_miro_mixer' and 'snd_legacy_find_free_ioport') WARNING: vmlinux.o(.text+0xf927b9): Section mismatch: reference to .init.data:snd_miro_radio_control (between 'snd_miro_mixer' and 'snd_legacy_find_free_ioport') Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]> --- sound/isa/opti9xx/miro.c |2 +- sound/isa/opti9xx/opti92x-ad1848.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) --- linux-2.6.24-rc8-git5.orig/sound/isa/opti9xx/miro.c +++ linux-2.6.24-rc8-git5/sound/isa/opti9xx/miro.c @@ -662,7 +662,7 @@ static int __devinit snd_set_aci_init_va return 0; } -static int snd_miro_mixer(struct snd_miro *miro) +static int __devinit snd_miro_mixer(struct snd_miro *miro) { struct snd_card *card; unsigned int idx; --- linux-2.6.24-rc8-git5.orig/sound/isa/opti9xx/opti92x-ad1848.c +++ linux-2.6.24-rc8-git5/sound/isa/opti9xx/opti92x-ad1848.c @@ -1595,7 +1595,7 @@ OPTi93X_DOUBLE("Capture Volume", 0, OPTi } }; -static int snd_opti93x_mixer(struct snd_opti93x *chip) +static int __devinit snd_opti93x_mixer(struct snd_opti93x *chip) { struct snd_card *card; struct snd_kcontrol_new knew; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] radio: fix sf16fmi section mismatch
From: Randy Dunlap <[EMAIL PROTECTED]> isapnp_fmi_probe() is only called by fmi_init(), which is __init, so isapnp_fmi_probe() can also be __init. media/radio/radio-sf16fmi.c: WARNING: vmlinux.o(.text+0x994e19): Section mismatch: reference to .init.data: (between 'isapnp_fmi_probe' and 'vidioc_s_tuner') WARNING: vmlinux.o(.text+0x994e22): Section mismatch: reference to .init.data: (between 'isapnp_fmi_probe' and 'vidioc_s_tuner') WARNING: vmlinux.o(.text+0x994e3a): Section mismatch: reference to .init.data:id_table (between 'isapnp_fmi_probe' and 'vidioc_s_tuner') Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]> --- drivers/media/radio/radio-sf16fmi.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-2.6.24-rc8-git5.orig/drivers/media/radio/radio-sf16fmi.c +++ linux-2.6.24-rc8-git5/drivers/media/radio/radio-sf16fmi.c @@ -321,7 +321,7 @@ static struct isapnp_device_id id_table[ MODULE_DEVICE_TABLE(isapnp, id_table); -static int isapnp_fmi_probe(void) +static int __init isapnp_fmi_probe(void) { int i = 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sound: fix cs5535 section mismatch
From: Randy Dunlap <[EMAIL PROTECTED]> snd_cs5535audio_mixer() is only called by __devinit snd_cs5535audio_probe(), so the mixer function can also be __devinit. WARNING: vmlinux.o(.text+0xfdbba0): Section mismatch: reference to .init.data:ac97_quirks (between 'snd_cs5535audio_mixer' and 'process_bm0_irq') Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]> --- sound/pci/cs5535audio/cs5535audio.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-2.6.24-rc8-git5.orig/sound/pci/cs5535audio/cs5535audio.c +++ linux-2.6.24-rc8-git5/sound/pci/cs5535audio/cs5535audio.c @@ -145,7 +145,7 @@ static unsigned short snd_cs5535audio_ac return snd_cs5535audio_codec_read(cs5535au, reg); } -static int snd_cs5535audio_mixer(struct cs5535audio *cs5535au) +static int __devinit snd_cs5535audio_mixer(struct cs5535audio *cs5535au) { struct snd_card *card = cs5535au->card; struct snd_ac97_bus *pbus; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)
On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote: > Last night my laptop suffered an oops during closedown. The full oops > reports can be downloaded from > > http://www.atrad.com.au/~jwoithe/xfs_oops/ Assertion failed: atomic_read(>m_active_trans) == 0, file: fs/xfs/xfs_vfsops.c, line 689. The remount read-only of the root drive supposedly completed while there was still active modification of the filesystem taking place. > Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. The patch in 2.6.23 that introduced this check was: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=516b2e7c2661615ba5d5ad9fb584f068363502d3 Basically, the remount-readonly path was not flushing things properly, so we changed it to flushing things properly and ensure we got bug reports if it wasn't. Yours is the second report of not shutting down correctly since this change went in (we've seen it once in ~8 months in a QA environment). I've had suspicions of a race in the remount-ro code in do_remount_sb() w.r.t to the fs_may_remount_ro() check. That is, we do an unlocked check to see if we can remount readonly and then fail to check again once we've locked the superblock out and start the remount. The read only flag only gets set *after* we've made the filesystem readonly, which means before we are truly read only, we can race with other threads opening files read/write or filesystem modifcations can take place. The result of that race (if it is really unsafe) will be assert you see. The patch I wrote a couple of months ago to fix the problem is attached below Cheers, Dave. --- Set the MS_RDONLY before we check to see if we can remount read only so that we close a race between checking remount is ok and setting the superblock flag that allows other processes to start modifying the filesystem while it is being remounted. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_super.c | 16 1 file changed, 16 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2008-01-22 14:57:07.753782292 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c 2008-01-23 16:22:16.940279351 +1100 @@ -1222,6 +1222,22 @@ xfs_fs_remount( struct xfs_mount_args *args = xfs_args_allocate(sb, 0); int error; + /* +* We need to have the MS_RDONLY flag set on the filesystem before we +* try to quiesce it down to a sane state. If we don't set the +* MS_RDONLY before we check the fs_may_remount_ro(sb) state, we have a +* race where write operations can start after we've checked it is OK +* to remount read only. This results in assert failures due to being +* unable to quiesce the transaction subsystem correctly. +*/ + if (!(sb->s_flags & MS_RDONLY) && (*flags & MS_RDONLY)) { + sb->s_flags |= MS_RDONLY; + if (!fs_may_remount_ro(sb)) { + sb->s_flags &= ~MS_RDONLY; + return -EBUSY; + } + } + error = xfs_parseargs(mp, options, args, 1); if (!error) error = xfs_mntupdate(mp, flags, args); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: test case for the RODATA config option
On Wed, 23 Jan 2008 12:11:41 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > #ifdef CONFIG_DEBUG_RODATA > > > > +const int rodata_test_data = 5; > > I guess this should match the 32-bit case, and be zero instead of > 5? actually it should have been 5 for both (well any non-zero number) > > Can you disallow building as a module, and put this in the test > code? It could be run from the end of mark_rodata_ro()... fair; I was developing it as a module (just easier) but yeah it makes more sense as part of mark_rodata_ro(). I'll do that in the next rev -- If you want to reach me at my work email, use [EMAIL PROTECTED] For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] dma: dma_{un}map_{single|sg}_attrs() interface
> --- a/include/linux/dma-attrs.h > +++ b/include/linux/dma-attrs.h > @@ -0,0 +1,33 @@ > +#ifndef _DMA_ATTR_H > +#define _DMA_ATTR_H > + > +#include > + > +enum { > +DMA_ATTR_INVALID, > +DMA_ATTR_BARRIER, > +DMA_ATTR_FOO, > +DMA_ATTR_GOO, > +DMA_ATTR_MAX, > +}; > + > +struct dma_attrs { > +unsigned flags; > +}; > + > +static inline int dma_set_attr(struct dma_attrs *attrs, unsigned attr) { maybe this would be cleaner if you named the DMA_ATTR enum and used that instead of unsigned here (and below)? > +BUG_ON(attrs == NULL); does this BUG_ON() buy us much? It seems the only thing we would fail to oops on is if someone did dma_set_attr(NULL, INVALID) and I'm not sure it's worth it to BUG here. > +if (attr > DMA_ATTR_INVALID && attr < DMA_ATTR_MAX) { > +attrs->flags = (1 << attr); > +return 0; > +} > +return 1; returning -EINVAL here instead of 1 would probably be more "kernelish". > +} > + > +static inline int dma_get_attr(struct dma_attrs *attrs, unsigned attr) { > +if (attrs) > + return attrs->flags & (1 << attr); so it's OK to pass attrs == NULL into dma_get_attr() but not into dma_set_attr()? seems kind of odd. > +return 0; > +} It seems you're missing a way to initialize a struct dma_attrs. How do I clear the flags field to start with? A macro like DEFINE_DMA_ATTRS() that initializes things for you (like LIST_HEAD or DEFINE_SPIN_LOCK) would probably be a good thing to have as well. Also I guess you could test ARCH_USES_DMA_ATTRS in this file and stub everything out and define an empty structure if it's not defined. save a few bytes of stack etc. > + > +#endif /* _DMA_ATTR_H */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] generic: Percpu infrastructure to rebase the per cpu area to zero
* Support an option CONFIG_HAVE_ZERO_BASED_PER_CPU that makes offsets for per cpu variables to start at zero. If a percpu area starts at zero then: - We do not need RELOC_HIDE anymore - Provides for the future capability of architectures providing a per cpu allocator that returns offsets instead of pointers. The offsets would be independent of the processor so that address calculations can be done in a processor independent way. Per cpu instructions can then add the processor specific offset at the last minute possibly in an atomic instruction. The data the linker provides is different for zero based percpu segments: __per_cpu_load -> The address at which the percpu area was loaded __per_cpu_size -> The length of the per cpu area * Removes the &__per_cpu_x in lockdep. The __per_cpu_x are already pointers. There is no need to take the address. * Changes generic setup_per_cpu_areas to allocate per_cpu space in node local memory. This requires a generic early_cpu_to_node function. Based on 2.6.24-rc8-mm1 Signed-off-by: Mike Travis <[EMAIL PROTECTED]> Reviewed-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/asm-alpha/topology.h |1 + include/asm-generic/percpu.h |7 ++- include/asm-generic/sections.h| 10 ++ include/asm-generic/topology.h|3 +++ include/asm-generic/vmlinux.lds.h | 15 +++ include/asm-ia64/topology.h |1 + include/asm-mips/mach-ip27/topology.h |1 + include/asm-powerpc/topology.h|1 + init/main.c | 18 ++ kernel/lockdep.c |4 ++-- 10 files changed, 50 insertions(+), 11 deletions(-) --- a/include/asm-alpha/topology.h +++ b/include/asm-alpha/topology.h @@ -6,6 +6,7 @@ #include #ifdef CONFIG_NUMA +#define early_cpu_to_node(cpu) cpu_to_node(cpu) static inline int cpu_to_node(int cpu) { int node; --- a/include/asm-generic/percpu.h +++ b/include/asm-generic/percpu.h @@ -43,7 +43,12 @@ extern unsigned long __per_cpu_offset[NR * Only S390 provides its own means of moving the pointer. */ #ifndef SHIFT_PERCPU_PTR -#define SHIFT_PERCPU_PTR(__p, __offset)RELOC_HIDE((__p), (__offset)) +# ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU +# define SHIFT_PERCPU_PTR(__p, __offset) \ + ((__typeof(__p))(((void *)(__p)) + (__offset))) +# else +# define SHIFT_PERCPU_PTR(__p, __offset) RELOC_HIDE((__p), (__offset)) +# endif /* CONFIG_HAVE_ZERO_BASED_PER_CPU */ #endif /* --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -9,7 +9,17 @@ extern char __bss_start[], __bss_stop[]; extern char __init_begin[], __init_end[]; extern char _sinittext[], _einittext[]; extern char _end[]; +#ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU +extern char __per_cpu_load[]; +extern char per_cpu_size[]; +#define __per_cpu_size ((unsigned long)&per_cpu_size) +#define __per_cpu_start ((char *)0) +#define __per_cpu_end ((char *)__per_cpu_size) +#else extern char __per_cpu_start[], __per_cpu_end[]; +#define __per_cpu_load __per_cpu_start +#define __per_cpu_size (__per_cpu_end - __per_cpu_start) +#endif extern char __kprobes_text_start[], __kprobes_text_end[]; extern char __initdata_begin[], __initdata_end[]; extern char __start_rodata[], __end_rodata[]; --- a/include/asm-generic/topology.h +++ b/include/asm-generic/topology.h @@ -32,6 +32,9 @@ #ifndef cpu_to_node #define cpu_to_node(cpu) (0) #endif +#ifndef early_cpu_to_node +#define early_cpu_to_node(cpu) cpu_to_node(cpu) +#endif #ifndef parent_node #define parent_node(node) (0) #endif --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -255,6 +255,20 @@ *(.initcall7.init) \ *(.initcall7s.init) +#ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU +#define PERCPU(align) \ + . = ALIGN(align); \ + percpu : { } :percpu\ + __per_cpu_load = .; \ + .data.percpu 0 : AT(__per_cpu_load - LOAD_OFFSET) { \ + *(.data.percpu.first) \ + *(.data.percpu) \ + *(.data.percpu.shared_aligned) \ + per_cpu_size = .; \ + } \ + . = __per_cpu_load + per_cpu_size; \ + data : { } :data +#else #define PERCPU(align) \ . = ALIGN(align);
[PATCH 2/3] x86_64: Fold pda into per cpu area
* Declare the pda as a per cpu variable. This will move the pda area to an address accessible by the x86_64 per cpu macros. Subtraction of __per_cpu_start will make the offset based from the beginning of the per cpu area. Since %gs is pointing to the pda, it will then also point to the per cpu variables and can be accessed thusly: %gs:[_cpu_ - __per_cpu_start] * The boot_pdas are only needed in head64.c so move the declaration over there and make it static. * Remove the code that allocates special pda data structures. Based on 2.6.24-rc8-mm1 Signed-off-by: Mike Travis <[EMAIL PROTECTED]> Reviewed-by: Christoph Lameter <[EMAIL PROTECTED]> --- arch/x86/kernel/head64.c |6 ++ arch/x86/kernel/setup64.c | 12 ++-- arch/x86/kernel/smpboot_64.c | 16 include/asm-generic/vmlinux.lds.h |1 + include/asm-x86/pda.h |1 - include/asm-x86/percpu.h | 30 +++--- include/linux/percpu.h| 13 - 7 files changed, 48 insertions(+), 31 deletions(-) --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -22,6 +22,12 @@ #include #include +/* + * Only used before the per cpu areas are setup. The use for the non possible + * cpus continues after boot + */ +static struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned; + static void __init zap_identity_mappings(void) { pgd_t *pgd = pgd_offset_k(0UL); --- a/arch/x86/kernel/setup64.c +++ b/arch/x86/kernel/setup64.c @@ -34,7 +34,9 @@ cpumask_t cpu_initialized __cpuinitdata struct x8664_pda *_cpu_pda[NR_CPUS] __read_mostly; EXPORT_SYMBOL(_cpu_pda); -struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned; + +DEFINE_PER_CPU_FIRST(struct x8664_pda, pda); +EXPORT_PER_CPU_SYMBOL(pda); struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table }; @@ -150,10 +152,16 @@ void __init setup_per_cpu_areas(void) } if (!ptr) panic("Cannot allocate cpu data for CPU %d\n", i); - cpu_pda(i)->data_offset = ptr - __per_cpu_start; memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); + /* Relocate the pda */ + memcpy(ptr, cpu_pda(i), sizeof(struct x8664_pda)); + cpu_pda(i) = (struct x8664_pda *)ptr; + cpu_pda(i)->data_offset = ptr - __per_cpu_start; } + /* Fix up pda for this processor */ + pda_init(0); + /* setup percpu data maps early */ setup_per_cpu_maps(); } --- a/arch/x86/kernel/smpboot_64.c +++ b/arch/x86/kernel/smpboot_64.c @@ -566,22 +566,6 @@ static int __cpuinit do_boot_cpu(int cpu return -1; } - /* Allocate node local memory for AP pdas */ - if (cpu_pda(cpu) == _cpu_pda[cpu]) { - struct x8664_pda *newpda, *pda; - int node = cpu_to_node(cpu); - pda = cpu_pda(cpu); - newpda = kmalloc_node(sizeof (struct x8664_pda), GFP_ATOMIC, - node); - if (newpda) { - memcpy(newpda, pda, sizeof (struct x8664_pda)); - cpu_pda(cpu) = newpda; - } else - printk(KERN_ERR - "Could not allocate node local PDA for CPU %d on node %d\n", - cpu, node); - } - alternatives_smp_switch(1); c_idle.idle = get_idle_for_cpu(cpu); --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -273,6 +273,7 @@ . = ALIGN(align); \ __per_cpu_start = .;\ .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { \ + *(.data.percpu.first) \ *(.data.percpu) \ *(.data.percpu.shared_aligned) \ } \ --- a/include/asm-x86/pda.h +++ b/include/asm-x86/pda.h @@ -39,7 +39,6 @@ struct x8664_pda { } cacheline_aligned_in_smp; extern struct x8664_pda *_cpu_pda[]; -extern struct x8664_pda boot_cpu_pda[]; extern void pda_init(int); #define cpu_pda(i) (_cpu_pda[i]) --- a/include/asm-x86/percpu.h +++ b/include/asm-x86/percpu.h @@ -16,7 +16,14 @@ #define __my_cpu_offset read_pda(data_offset) #define per_cpu_offset(x) (__per_cpu_offset(x)) +#define __percpu_seg "%%gs:" +/* Calculate the offset to use with the segment register */ +#define seg_offset(name) (*SHIFT_PERCPU_PTR(_cpu_var(name), \ + - (unsigned long)__per_cpu_start)) +#else +#define __percpu_seg "" +#define seg_offset(name) per_cpu_var(name) #endif #include @@ -64,16 +71,11
[PATCH 0/3] percpu: Optimize percpu accesses
This patchset provides the following: * Generic: Percpu infrastructure to rebase the per cpu area to zero This provides for the capability of accessing the percpu variables using a local register instead of having to go through a table on node 0 to find the cpu-specific offsets. It also would allow atomic operations on percpu variables to reduce required locking. * x86_64: Fold pda into per cpu area Declare the pda as a per cpu variable. This will move the pda area to an address accessible by the x86_64 per cpu macros. Subtraction of __per_cpu_start will make the offset based from the beginning of the per cpu area. Since %gs is pointing to the pda, it will then also point to the per cpu variables and can be accessed thusly: %gs:[_cpu_ - __per_cpu_start] * x86_64: Rebase per cpu variables to zero Take advantage of the zero-based per cpu area provided above. Then we can directly use the x86_32 percpu operations. x86_32 offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly to the pda and the per cpu area thereby allowing access to the pda with the x86_64 pda operations and access to the per cpu variables using x86_32 percpu operations. Based on 2.6.24-rc8-mm1 Signed-off-by: Mike Travis <[EMAIL PROTECTED]> Reviewed-by: Christoph Lameter <[EMAIL PROTECTED]> --- -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] x86_64: Rebase per cpu variables to zero
* Relocate the x86_64 percpu variables to begin at zero. Then we can directly use the x86_32 percpu operations. x86_32 offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly to the pda and the per cpu area thereby allowing access to the pda with the x86_64 pda operations and access to the per cpu variables using x86_32 percpu operations. * This also supports further integration of x86_32/64. Based on 2.6.24-rc8-mm1 Signed-off-by: Mike Travis <[EMAIL PROTECTED]> Reviewed-by: Christoph Lameter <[EMAIL PROTECTED]> --- arch/x86/Kconfig |3 +++ arch/x86/kernel/setup64.c|2 +- arch/x86/kernel/vmlinux_64.lds.S |1 + kernel/module.c |7 --- 4 files changed, 9 insertions(+), 4 deletions(-) --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -107,6 +107,9 @@ config GENERIC_TIME_VSYSCALL bool default X86_64 +config HAVE_ZERO_BASED_PER_CPU + def_bool X86_64 + config ARCH_SUPPORTS_OPROFILE bool default y --- a/arch/x86/kernel/setup64.c +++ b/arch/x86/kernel/setup64.c @@ -152,7 +152,7 @@ void __init setup_per_cpu_areas(void) } if (!ptr) panic("Cannot allocate cpu data for CPU %d\n", i); - memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); + memcpy(ptr, __per_cpu_load, __per_cpu_size); /* Relocate the pda */ memcpy(ptr, cpu_pda(i), sizeof(struct x8664_pda)); cpu_pda(i) = (struct x8664_pda *)ptr; --- a/arch/x86/kernel/vmlinux_64.lds.S +++ b/arch/x86/kernel/vmlinux_64.lds.S @@ -16,6 +16,7 @@ jiffies_64 = jiffies; _proxy_pda = 1; PHDRS { text PT_LOAD FLAGS(5); /* R_E */ + percpu PT_LOAD FLAGS(4);/* R__ */ data PT_LOAD FLAGS(7); /* RWE */ user PT_LOAD FLAGS(7); /* RWE */ data.init PT_LOAD FLAGS(7); /* RWE */ --- a/kernel/module.c +++ b/kernel/module.c @@ -45,6 +45,7 @@ #include #include #include +#include #include #include @@ -351,7 +352,7 @@ static void *percpu_modalloc(unsigned lo align = PAGE_SIZE; } - ptr = __per_cpu_start; + ptr = __per_cpu_load; for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) { /* Extra for alignment requirement. */ extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr; @@ -386,7 +387,7 @@ static void *percpu_modalloc(unsigned lo static void percpu_modfree(void *freeme) { unsigned int i; - void *ptr = __per_cpu_start + block_size(pcpu_size[0]); + void *ptr = __per_cpu_load + block_size(pcpu_size[0]); /* First entry is core kernel percpu data. */ for (i = 1; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) { @@ -437,7 +438,7 @@ static int percpu_modinit(void) pcpu_size = kmalloc(sizeof(pcpu_size[0]) * pcpu_num_allocated, GFP_KERNEL); /* Static in-kernel percpu data (used). */ - pcpu_size[0] = -(__per_cpu_end-__per_cpu_start); + pcpu_size[0] = -__per_cpu_size; /* Free room. */ pcpu_size[1] = PERCPU_ENOUGH_ROOM + pcpu_size[0]; if (pcpu_size[1] < 0) { -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
XFS oops under 2.6.23.9
Last night my laptop suffered an oops during closedown. The full oops reports can be downloaded from http://www.atrad.com.au/~jwoithe/xfs_oops/ as photos of the screen. Since the laptop was unusable at this point I wasn't able to cut and paste the details, and they weren't in the logs when the machine was rebooted. The initial complaint claims to be an "invalid opcode". Is this possibly a memory fault developing or does it ring any bells for anyone? memtest86 finds no fault with the memory. Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. The RT patches were not applied. Regards jonathan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: W1: w1_slave units, standardize 1C or .001C? Break API
David Fries wrote: On Mon, Jan 21, 2008 at 07:11:07PM -0800, H. Peter Anvin wrote: H. Peter Anvin wrote: Millikelvins would have the nice property of never being negative. :) True, but the sensor returns the value as a signed integer in C. That is where the earlier negative number problem was, it would have to do yet another conversion to go to Kelvin, and it would be just one more potential for error. Everyone knows that a bad conversion doomed at least one space craft, let's stick to Centigrade. Uhm... the conversion is exact as long as you have at least centikelvin precision (0 °C = 273.15 K by definition, and the multiplier is 1.) Alternatively, centikelvins would fit nicely in 16 bits if anyone cares... 655.35 K = 382.20 °C = 719.96 °F The range for the sensor is -55 to 125 C, if an application didn't care about precision they could store it in a signed 8 bit value just fine. This was more a comment as to it possibly being a convenient format for more than this particular sensor. The nice thing with kelvins is no need to worry about negative numbers and something misparsing them, that's all. I certainly did not imply that we should even consider use °F. That's obviously ridiculous. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CONFIG_MARKERS
On Tue, 2008-01-22 at 22:10 -0500, Mathieu Desnoyers wrote: > * Frank Ch. Eigler ([EMAIL PROTECTED]) wrote: > > > > Jon Masters <[EMAIL PROTECTED]> writes: > > > > > I notice in module.c: > > > > > > #ifdef CONFIG_MARKERS > > > if (!mod->taints) > > > marker_update_probe_range(mod->markers, > > > mod->markers + mod->num_markers, NULL, NULL); > > > #endif > > > > > > Is this an attempt to not set a marker for proprietary modules? [...] > > > > I can't seem to find any discussion about this aspect. If this is the > > intent, it seems misguided to me. There may instead be a relationship > > to TAINT_FORCED_{RMMOD,MODULE}. Mathieu? > > > > - FChE > > On my part, its mostly a matter of not crashing the kernel when someone > tries to force modprobe of a proprietary module (where the checksums > doesn't match) on a kernel that supports the markers. Not doing so > causes the markers to try to find the marker-specific information in > struct module which doesn't exist and OOPSes. > > Christoph's point of view is rather more drastic than mine : it's not > interesting for the kernel community to help proprietary modules writers, > so it's a good idea not to give them marker support. (I CC'ed him so he > can clarify his position). Right. I thought that was your collective opinion, and I happen to personally agree with you, but my question was more that you should be explicitly comparing to whether it's proprietary and not just whether the taints field is set - there are other flags in there too. Jon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] procfs: constify function pointer tables
On Jan 23, 2008 4:00 AM, Jan Engelhardt <[EMAIL PROTECTED]> wrote: > Hi, > > > This touches so many different places that I did not feel like creating > a miniscule patch for each architecture. I hope that is ok. > > ===Patch begins=== > [PATCH] procfs: constify function pointer tables > > Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]> > --- > arch/alpha/kernel/setup.c |2 +- > arch/blackfin/kernel/setup.c |2 +- > arch/cris/kernel/setup.c |2 +- > arch/frv/kernel/setup.c |2 +- > arch/h8300/kernel/setup.c |2 +- > arch/m32r/kernel/setup.c |2 +- > arch/m68k/kernel/setup.c |2 +- > arch/m68knommu/kernel/setup.c |2 +- > arch/parisc/kernel/setup.c|2 +- > arch/ppc/kernel/setup.c |2 +- > arch/v850/kernel/procfs.c |2 +- > arch/xtensa/kernel/setup.c|2 +- > fs/proc/base.c|6 +++--- > fs/proc/nommu.c |2 +- > fs/proc/proc_misc.c | 22 +++--- > fs/proc/proc_sysctl.c |4 ++-- > fs/proc/proc_tty.c|2 +- > fs/proc/task_mmu.c|8 > fs/proc/task_nommu.c |2 +- > 19 files changed, 35 insertions(+), 35 deletions(-) > > diff --git a/arch/alpha/kernel/setup.c b/arch/alpha/kernel/setup.c > index bd5e68c..823f18e 100644 > --- a/arch/alpha/kernel/setup.c > +++ b/arch/alpha/kernel/setup.c > @@ -1472,7 +1472,7 @@ c_stop(struct seq_file *f, void *v) > { > } > > -struct seq_operations cpuinfo_op = { > +const struct seq_operations cpuinfo_op = { > .start = c_start, > .next = c_next, > .stop = c_stop, > diff --git a/arch/blackfin/kernel/setup.c b/arch/blackfin/kernel/setup.c > index d282201..d67cf54 100644 > --- a/arch/blackfin/kernel/setup.c > +++ b/arch/blackfin/kernel/setup.c > @@ -691,7 +691,7 @@ static void c_stop(struct seq_file *m, void *v) > { > } > > -struct seq_operations cpuinfo_op = { > +const struct seq_operations cpuinfo_op = { > .start = c_start, > .next = c_next, > .stop = c_stop, Thanks, I understand the seq_xxx() API needs "const struct seq_operations *". So for Blackfin part, I agree with Mike. Acked-by: Bryan Wu <[EMAIL PROTECTED]> but there are still some other files need add "const": --- /opt/git-tree/blackfin-2.6$ grep -r seq_operations arch/* arch/alpha/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/arm/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/arm/mach-davinci/clock.c:static struct seq_operations davinci_ck_op = { arch/avr32/kernel/cpu.c:struct seq_operations cpuinfo_op = { arch/avr32/mm/tlb.c:static struct seq_operations tlb_ops = { arch/blackfin/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/cris/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/frv/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/h8300/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/ia64/hp/common/sba_iommu.c:static struct seq_operations ioc_seq_ops = { arch/ia64/kernel/perfmon.c:struct seq_operations pfm_seq_ops = { arch/ia64/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/ia64/sn/kernel/sn2/sn2_smp.c:static struct seq_operations sn2_ptc_seq_ops = { arch/ia64/sn/kernel/sn2/sn_hwperf.c:static struct seq_operations sn_topology_seq_ops = { arch/m32r/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/m68k/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/m68knommu/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/mips/kernel/proc.c:struct seq_operations cpuinfo_op = { arch/parisc/kernel/setup.c:struct seq_operations cpuinfo_op = { arch/powerpc/kernel/setup-common.c:struct seq_operations cpuinfo_op = { [!snip!] Regards, -Bryan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ppc: fix #ifdef-s in mediabay driver
On Wed, 2008-01-23 at 01:58 +0100, Bartlomiej Zolnierkiewicz wrote: > I'm more worried about breaking automatic build checking (make randconfig) > than a few extra bytes so if you remove all #ifdefs you'll have to either > make BLK_DEV_IDE_PMAC select PMAC_MEDIABAY or make PMAC_MEDIABAY depend > on BLK_DEV_IDE_PMAC (otherwise BLK_DEV_IDE=n && PMAC_MEDIABAY=y will fail > since mediabay.c is referencing IDE code). I was thinking about having the pmac arch code provide an exported function pointer to put the hook in to avoid that problem. Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: W1: w1_slave units, standardize 1C or .001C? Break API
On Wed, Jan 23, 2008 at 12:06:27AM +0300, Evgeniy Polyakov wrote: > > What about instead of breaking application just add new sysfs file, > which will only return temperature instead of full rom content. > It can be millidegrees Centigrade, another one can be millikelvins :) If someone wrote their application to read degrees C because they have an ds18b20, the application will break anyway if they run it with an ds1820 sensor. Or the opposite way around. Yes it would be better not to break a program, but I think having a consistent interface for both sensors to be a better option. > Actually it is already posible for applications to decode whatever > precision they like from the rom content displayed, although that can be > not very convenient. I was first surprised then glad that the raw data was included in the user available data. I was wanting the full precision, so that was my plan. > Even more, what about possibility of changing of the base, relative to > which temperature is displayed? By default I vote for centigrades, > those, who live behind the oceans, can setup Fahrenheit, Kelvin or anything > else, but please in a new file :) > David will this work for you? I'm biased toward Fehrenheit, against Kelvin, but I think continuing to keep Centigrade is the correct choice here. I don't like the idea of selecting the base the kernel displays by a userland option, too easy to make assumptions, give it one interface and let the application do the conversion, C/1000.0*9/5+32 is pretty easy (for millidegrees C that is). I'll get the trivial patch to change the ds18b20 output in millidegrees C to make things consistent. I'm out of time tonight. It does sound like a good idea to have a sysfs file that just returns the millidegrees C in ASCII without any other text. It would be easier to parse. If the conversion fails return 0 bytes. Just an idea, but if someone wants it they can write the patch. -- David Fries <[EMAIL PROTECTED]> http://fries.net/~david/ (PGP encryption key available) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: W1: w1_slave units, standardize 1C or .001C? Break API
On Mon, Jan 21, 2008 at 07:11:07PM -0800, H. Peter Anvin wrote: > H. Peter Anvin wrote: > >Millikelvins would have the nice property of never being negative. :) True, but the sensor returns the value as a signed integer in C. That is where the earlier negative number problem was, it would have to do yet another conversion to go to Kelvin, and it would be just one more potential for error. Everyone knows that a bad conversion doomed at least one space craft, let's stick to Centigrade. > Alternatively, centikelvins would fit nicely in 16 bits if anyone cares... > > 655.35 K = 382.20 ?C = 719.96 ?F The range for the sensor is -55 to 125 C, if an application didn't care about precision they could store it in a signed 8 bit value just fine. -- David Fries <[EMAIL PROTECTED]> http://fries.net/~david/ (PGP encryption key available) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] IO context sharing
On Tue, Jan 22, 2008 at 10:49:15AM +0100, Jens Axboe wrote: > Hi, > > Today io contexts are per-process and define the (surprise) io context > of that process. In some situations it would be handy if several > processes share an IO context. I think that the nfsd threads should probably share as well. It should probably provide an io context per thread pool Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_32: trim memory by updating e820 v2
On Monday 21 January 2008 01:37:09 pm Justin Piszcz wrote: > > On Mon, 21 Jan 2008, Yinghai Lu wrote: > > > On Monday 21 January 2008 11:14:02 am Justin Piszcz wrote: > > please get x86.git > > > > git clone > > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > > cd linux-2.6 > > #--{ x86.git instructions }--> > > # Add Linus's tree as a remote > > git remote add linus > > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > > > > # Add Ingo's tree as a remote > > git remote add x86 > > git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git > > > > # With that setup, just run the following to get any changes you > > # don't have. It will also notice any new branches Ingo/Linus > > # add to their repo. Look in .git/config afterwards, the format > > # to add new remotes is easy to figure out. > > git remote update > > #- > > git merge x86/master > > git merge x86/mm > > > > and apply > > > > [PATCH] x86_64: check if Tom2 is enabled > > http://lkml.org/lkml/2008/1/21/20 > > [PATCH] x86_64: update e820 instead of updating end_pfn v3 > > http://lkml.org/lkml/2008/1/21/19 > > [PATCH] x86_32: trim memory by updating e820 v2 > > http://lkml.org/lkml/2008/1/21/18 > > > > YH > > > > Thanks, I am all patched up and ready to test, unfortunately one of my disks > in my RAID 1 just died, I already filled out the advanced replacement form, > I will test when I receive the replacement disk. please get x86.git and apply [PATCH] x86_32: trim memory by updating e820 v3 http://lkml.org/lkml/2008/1/22/394 Ingo already put other two into the tree. Thanks YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CONFIG_MARKERS
* Frank Ch. Eigler ([EMAIL PROTECTED]) wrote: > > Jon Masters <[EMAIL PROTECTED]> writes: > > > I notice in module.c: > > > > #ifdef CONFIG_MARKERS > > if (!mod->taints) > > marker_update_probe_range(mod->markers, > > mod->markers + mod->num_markers, NULL, NULL); > > #endif > > > > Is this an attempt to not set a marker for proprietary modules? [...] > > I can't seem to find any discussion about this aspect. If this is the > intent, it seems misguided to me. There may instead be a relationship > to TAINT_FORCED_{RMMOD,MODULE}. Mathieu? > > - FChE On my part, its mostly a matter of not crashing the kernel when someone tries to force modprobe of a proprietary module (where the checksums doesn't match) on a kernel that supports the markers. Not doing so causes the markers to try to find the marker-specific information in struct module which doesn't exist and OOPSes. Christoph's point of view is rather more drastic than mine : it's not interesting for the kernel community to help proprietary modules writers, so it's a good idea not to give them marker support. (I CC'ed him so he can clarify his position). Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CONFIG_MARKERS
Jon Masters <[EMAIL PROTECTED]> writes: > I notice in module.c: > > #ifdef CONFIG_MARKERS > if (!mod->taints) > marker_update_probe_range(mod->markers, > mod->markers + mod->num_markers, NULL, NULL); > #endif > > Is this an attempt to not set a marker for proprietary modules? [...] I can't seem to find any discussion about this aspect. If this is the intent, it seems misguided to me. There may instead be a relationship to TAINT_FORCED_{RMMOD,MODULE}. Mathieu? - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
From: "Dave Young" <[EMAIL PROTECTED]> Date: Wed, 23 Jan 2008 09:44:30 +0800 > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > [PATCH] [TCP]: debug S+L > > Thanks, If there's new findings I will let you know. Thanks for helping with this bug Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/26] atl1: update initialization parameters
Jay Cliburn wrote: On Tue, 22 Jan 2008 04:56:11 -0500 Jeff Garzik <[EMAIL PROTECTED]> wrote: [EMAIL PROTECTED] wrote: From: Jay Cliburn <[EMAIL PROTECTED]> Update initialization parameters to match the current vendor driver version 1.2.40.2. [...] ACK without any better knowledge... but is any addition insight available at all? No, sorry Jeff. I simply took the vendor's current driver and matched his initialization settings. I can only assume he discovered these values through lab testing. For this and the other "conform to vendor driver" patches in this set, I thought it important to have the in-tree driver match the vendor driver as closely as possible. The primary motivations are (1) my belief that he's in a better position to test the NIC, and (2) to be able to go to him for assistance occasionally and not be rejected because of significant differences between his and our drivers. I don't think we should be doing this without justification. From all the atl1 and atl2 code I've looked at, I've gotten the impression that their driver development processes are extremely ad-hoc. There is code in the Atheros version of atl2 that cannot *possibly* apply to that hardware and was just copied and pasted from atl1, just as much of atl1 was copied and pasted from e1000. The fact that various versions have different magic numbers may simply mean they copied and pasted from different irrelevant and incorrect sources. Our contacts at Atheros seem to be very good electrical engineers, so when they tell us that a certain setting should be changed to match particular properties of the hardware, I trust them. They are not, however, experienced and disciplined kernel developers, so absent such justification I think we should stick with what we have, which has been improved and reviewed by people who *are* experienced and disciplined kernel developers. We have at least as much to teach Atheros about writing kernel code as they have to teach us about their hardware. -- Chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/26] atl1: update initialization parameters
On Tue, 22 Jan 2008 04:56:11 -0500 Jeff Garzik <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > From: Jay Cliburn <[EMAIL PROTECTED]> > > > > Update initialization parameters to match the current vendor driver > > version 1.2.40.2. [...] > ACK without any better knowledge... but is any addition insight > available at all? No, sorry Jeff. I simply took the vendor's current driver and matched his initialization settings. I can only assume he discovered these values through lab testing. For this and the other "conform to vendor driver" patches in this set, I thought it important to have the in-tree driver match the vendor driver as closely as possible. The primary motivations are (1) my belief that he's in a better position to test the NIC, and (2) to be able to go to him for assistance occasionally and not be rejected because of significant differences between his and our drivers. Jay -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]PCIE ASPM support - takes 3
On Tue, 2008-01-22 at 14:58 -0800, Greg KH wrote: > On Fri, Jan 18, 2008 at 09:56:28AM +0800, Shaohua Li wrote: > > v3->v2, fixed the issues Matthew Wilcox raised. > > > > PCI Express ASPM defines a protocol for PCI Express components in the D0 > > state to reduce Link power by placing their Links into a low power state > > and instructing the other end of the Link to do likewise. This > > capability allows hardware-autonomous, dynamic Link power reduction > > beyond what is achievable by software-only controlled power management. > > However, The device should be configured by software appropriately. > > Enabling ASPM will save power, but will introduce device latency. > > > > This patch adds ASPM support in Linux. It introduces a global policy for > > ASPM, a sysfs file /sys/module/pcie_aspm/parameters/policy can control > > it. The interface can be used as a boot option too. Currently we have > > below setting: > > -default, BIOS default setting > > -powersave, highest power saving mode, enable all available ASPM > > state > > and clock power management > > -performance, highest performance, disable ASPM and clock power > > management > > By default, the 'default' policy is used currently. > > > > In my test, power difference between powersave mode and performance mode > > is about 1.3w in a system with 3 PCIE links. > > > > please review, any comments will be appreciated. > > Can you please fix up all of the warnings that checkpatch.pl and sparse > produce from this patch? > > Also, one small thing: > > > --- linux.orig/include/linux/pci.h 2008-01-16 15:59:42.0 +0800 > > +++ linux/include/linux/pci.h 2008-01-18 09:41:20.0 +0800 > > @@ -164,6 +164,10 @@ struct pci_dev { > >this is D0-D3, D0 being fully > > functional, > >and D3 being off. */ > > > > +#ifdef CONFIG_PCIEASPM > > + void*link_state;/* ASPM link state. */ > > +#endif > > Can we make this a "real" pointer to a structure? I note that you use > two different structures here in this pointer, should you really do > that? It's good to get type-checks whereever possible. The structure is just for internal use of ASPM, just don't want make it global. Fixed, now sparse and checkpatch.pl haven't warning. Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> --- drivers/pci/pci-sysfs.c |5 drivers/pci/pci.c |4 drivers/pci/pcie/Kconfig | 20 + drivers/pci/pcie/Makefile |3 drivers/pci/pcie/aspm.c | 818 ++ drivers/pci/probe.c |5 drivers/pci/remove.c |4 include/linux/aspm.h | 44 ++ include/linux/pci.h |4 include/linux/pci_regs.h |8 10 files changed, 915 insertions(+) Index: linux/drivers/pci/pcie/Makefile === --- linux.orig/drivers/pci/pcie/Makefile2008-01-23 10:14:14.0 +0800 +++ linux/drivers/pci/pcie/Makefile 2008-01-23 10:14:46.0 +0800 @@ -2,6 +2,9 @@ # Makefile for PCI-Express PORT Driver # +# Build PCI Express ASPM if needed +obj-$(CONFIG_PCIEASPM) += aspm.o + pcieportdrv-y := portdrv_core.o portdrv_pci.o portdrv_bus.o obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o Index: linux/drivers/pci/pcie/aspm.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux/drivers/pci/pcie/aspm.c 2008-01-23 10:14:46.0 +0800 @@ -0,0 +1,818 @@ +/* + * File: drivers/pci/pcie/aspm.c + * Enabling PCIE link L0s/L1 state and Clock Power Management + * + * Copyright (C) 2007 Intel + * Copyright (C) Zhang Yanmin ([EMAIL PROTECTED]) + * Copyright (C) Shaohua Li ([EMAIL PROTECTED]) + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "../pci.h" + +#ifdef MODULE_PARAM_PREFIX +#undef MODULE_PARAM_PREFIX +#endif +#define MODULE_PARAM_PREFIX "pcie_aspm." + +/* only for downstream port */ +struct link_state { + struct list_head sibiling; + struct pci_dev *pdev; + + /* ASPM state */ + unsigned int support_state; + unsigned int enabled_state; + unsigned int bios_aspm_state; + /* upstream component */ + unsigned int l0s_upper_latency; + unsigned int l1_upper_latency; + /* downstream component */ + unsigned int l0s_down_latency; + unsigned int l1_down_latency; + /* Clock PM state*/ + unsigned int clk_pm_capable; + unsigned int clk_pm_enabled; + unsigned int bios_clk_state; + +}; + +/* Only for endpoint */ +struct endpoint_state { + unsigned int l0s_acceptable_latency; + unsigned int l1_acceptable_latency; +}; + +static int aspm_disabled; +static DEFINE_MUTEX(aspm_lock); +static
Re: [PATCH 06/26] atl1: update initialization parameters
Jay Cliburn wrote: On Tue, 22 Jan 2008 04:56:11 -0500 Jeff Garzik <[EMAIL PROTECTED]> wrote: [EMAIL PROTECTED] wrote: From: Jay Cliburn <[EMAIL PROTECTED]> Update initialization parameters to match the current vendor driver version 1.2.40.2. [...] ACK without any better knowledge... but is any addition insight available at all? No, sorry Jeff. I simply took the vendor's current driver and matched his initialization settings. I can only assume he discovered these values through lab testing. For this and the other "conform to vendor driver" patches in this set, I thought it important to have the in-tree driver match the vendor driver as closely as possible. The primary motivations are (1) my belief that he's in a better position to test the NIC, and (2) to be able to go to him for assistance occasionally and not be rejected because of significant differences between his and our drivers. Since these changes are not simply moving code around, we really do need full explanations for them, and to understand their need. Blindly copying code from an exterior driver is pointless, and no way at all to run an engineering process. If the driver is not going to get the review and attention necessary, bug fixes and feedback attended-to, then there's not much point in having this driver in the kernel at all. You will only lead yourself to frustration, if you set up a system where changes only flow one way. That's not how Linux development is done at all. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Relax restrictions on setting CONFIG_NUMA on x86
Hi mel > Hi > > > A fix[1] was merged to the x86.git tree that allowed NUMA kernels to boot > > on normal x86 machines (and not just NUMA-Q, Summit etc.). I took a look > > at the restrictions on setting NUMA on x86 to see if they could be lifted. > > Interesting! > > I will test tomorrow. Hmm... It doesn't works on my machine. panic at booting at __free_pages_ok() with blow call trace. [] free_all_bootmem_core [] mem_init [] alloc_large_system_hash [] inode_init_early [] start_kernel [] unknown_bootoption my machine spec CPU: Pentium4 with HT MEM: 512M I will try more investigate. but I have no time for a while, sorry ;-) BTW: when config sparse mem turn on instead discontig mem. panic at booting at get_pageblock_flags_group() with below call stack. free_initrd free_init_pages free_hot_cold_page - kosaki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc8: iwl3945 gets stuck
On Tuesday 22 January 2008 17:15:42 John W. Linville wrote: > On Tue, Jan 22, 2008 at 09:54:11PM +0100, Harald Dunkel wrote: > > If I put some heavy load on the iwl3945, then the network connection > > gets stuck after a some time. To fix it I have to reload the module. > > Can you quantify this a bit more? What constitutes a "heavey load"? > What (if any) encryption are you using? Are you using any options > for iwl3945 in /etc/modprobe.conf? > > Could you include the output of dmesg and/or the contents of > /var/log/messages (trimmed for the most recent boot)? > > > AFAICS this problem was a topic on lkml almost 3 months ago. Any news > > about this? I would be glad to help to track this down, but I have > > no idea how to change the scaling algorithm to iwl-3945-rs . > > This should happen automatically now. > > John I've been getting a warning in the dmesg of my laptop with every boot since I started using 2.6.24-rc7 that might be related. This doesn't appear to cause any problems, but from looking at the source of the warning it appears that the ipw3945 hardware might be causing the problem. [ 31.460143] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready [ 31.549722] WARNING: at net/mac80211/rx.c:1486 __ieee80211_rx() [ 31.549817] Pid: 4436, comm: amixer Not tainted 2.6.24-rc7-git2 #1 [ 31.549903] [ 31.549904] Call Trace: [ 31.550063][] :mac80211:__ieee80211_rx+0xc99/0xd60 [ 31.550236] [] _spin_unlock_irqrestore+0x16/0x40 [ 31.550332] [] :iwl3945:iwl_rx_queue_restock+0xca/0x170 [ 31.550422] [] _spin_unlock_irqrestore+0x16/0x40 [ 31.550520] [] :mac80211:ieee80211_tasklet_handler+0xb8/0x120 [ 31.550646] [] tasklet_action+0x51/0xc0 [ 31.550732] [] _spin_unlock+0x14/0x40 [ 31.550820] [] __do_softirq+0x64/0xe0 [ 31.550909] [] call_softirq+0x1c/0x30 [ 31.550995] [] do_softirq+0x3d/0x90 [ 31.551083] [] irq_exit+0x88/0xa0 [ 31.551169] [] do_IRQ+0xc5/0x1b0 [ 31.551257] [] ret_from_intr+0x0/0xa [ 31.551369][] get_page_from_freelist+0x30e/0x670 [ 31.551519] [] __alloc_pages+0x6e/0x3b0 [ 31.551608] [] generic_file_aio_read+0xd7/0x180 [ 31.551699] [] alloc_page_vma+0x9c/0xf0 [ 31.551788] [] handle_mm_fault+0x50e/0x780 [ 31.551874] [] _spin_unlock+0x14/0x40 [ 31.551962] [] _spin_unlock_irqrestore+0x16/0x40 [ 31.552052] [] do_page_fault+0x228/0x970 [ 31.552146] [] _spin_unlock+0x14/0x40 [ 31.552251] [] vfs_read+0x13e/0x180 [ 31.552340] [] error_exit+0x0/0x51 [ 31.552436] The location of the warning is: hdrlen = ieee80211_get_hdrlen(rx.fc); line in question -->WARN_ON_ONCE(((unsigned long)(skb->data + hdrlen)) & 3); if (type == IEEE80211_FTYPE_DATA || type == IEEE80211_FTYPE_MGMT) local->dot11ReceivedFragmentCount++; sta = rx.sta = sta_info_get(local, hdr->addr2); Now, the problem is that this might be nothing, and it might be the cause of the problem. (I don't think it is the cause, myself, because I've subjected my laptop to a lot of activity - to the point that the card was starting to drop packets - and have seen no problems) DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernel/params.c: fix the module name length in param_sysfs_builtin
On Wednesday 23 January 2008 10:13:37 Jan Engelhardt wrote: > On Jan 21 2008 22:16, Rusty Russell wrote: > >On Monday 21 January 2008 20:08:25 Denis Cheng wrote: > >> the original code use KOBJ_NAME_LEN for built-in module name length, > >> that's defined to 20 in linux/kobject.h, but this is not enough > >> appearntly, many module names are longer than this; > >> #define KOBJ_NAME_LEN 20 > > > >Thanks, applied. I was surprisedto learn that we have a 35-char source > >filename in the kernel. > > > >And congratulations to nf_conntrack_l3proto_ipv4_compat.c! > > But nf..dada_compat.c gets linked into nf_conntrack_ipv4.ko, > and that is what is used in /sys/module - and it fits the 20. > Any place where nf_conntrack_l3proto_ipv4_compat would still be used? Of course, but my point was that we already have a 35 char filename in the kernel, and lots of > 22 chars, so increasing it is not unreasonable. FYI make allmodconfig here gives me the following of 21 chars or longer: dvb-usb-af9005-remote dvb-usb-dibusb-common nf_conntrack_netbios_ns nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_proto_gre Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernel/params.c: fix the module name length in param_sysfs_builtin
On Jan 23, 2008 7:13 AM, Jan Engelhardt <[EMAIL PROTECTED]> wrote: > But nf..dada_compat.c gets linked into nf_conntrack_ipv4.ko, > and that is what is used in /sys/module - and it fits the 20. > Any place where nf_conntrack_l3proto_ipv4_compat would still be used? there is a module named nf_conntrack_proto_icmp.ko, length 23. and you can find all them by: $ make allmodconfig && make modules $ find -name '*.ko' -printf '%f\n' |gawk '{print length($0), $0}' |sort -n ... 24 dvb-usb-af9005-remote.ko 24 dvb-usb-dibusb-common.ko 25 nf_conntrack_proto_gre.ko 26 nf_conntrack_netbios_ns.ko 26 nf_conntrack_proto_sctp.ko 29 nf_conntrack_proto_udplite.ko so currently tha max length of module name is 26 (in nf_conntrack_proto_udplite), but still no any length limit to module names in Documentation/, so we have to prepare reserved space for modules later, or mark MODULE_NAME_LEN as the modules' name length limit in Documentation/? Simply speaking, MODULE_NAME_LEN does the better job. > -- Denis Cheng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > On Tue, 22 Jan 2008, Dave Young wrote: > > > On Jan 22, 2008 12:37 PM, Dave Young <[EMAIL PROTECTED]> wrote: > > > > > > On Jan 22, 2008 5:14 AM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > > > > > > > On Mon, 21 Jan 2008, Dave Young wrote: > > > > > > > > > Please see the kernel messages following,(trigged while using some > > > > > qemu session) > > > > > BTW, seems there's some e100 error message as well. > > > > > > > > > > PCI: Setting latency timer of device :00:1b.0 to 64 > > > > > e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI > > > > > e100: Copyright(c) 1999-2006 Intel Corporation > > > > > ACPI: PCI Interrupt :03:08.0[A] -> GSI 20 (level, low) -> IRQ 20 > > > > > modprobe:2331 conflicting cache attribute efaff000-efb0 > > > > > uncached<->default > > > > > e100: :03:08.0: e100_probe: Cannot map device registers, aborting. > > > > > ACPI: PCI interrupt for device :03:08.0 disabled > > > > > e100: probe of :03:08.0 failed with error -12 > > > > > eth0: setting full-duplex. > > > > > [ cut here ] > > > > > WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150() > > > > > Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event > > > > > snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse > > > > > snd_hda_intel snd_pcm snd_timer btusb rtc_cmos thermal bluetooth > > > > > rtc_core serio_raw intel_agp button processor sg snd rtc_lib i2c_i801 > > > > > evdev agpgart soundcore dcdbas 3c59x pcspkr snd_page_alloc > > > > > Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4 > > > > > [] ? printk+0x0/0x20 > > > > > [] warn_on_slowpath+0x54/0x80 > > > > > [] ? ip_finish_output+0x128/0x2e0 > > > > > [] ? ip_output+0xe7/0x100 > > > > > [] ? ip_local_out+0x18/0x20 > > > > > [] ? ip_queue_xmit+0x3dc/0x470 > > > > > [] ? _spin_unlock_irqrestore+0x5e/0x70 > > > > > [] ? check_pad_bytes+0x61/0x80 > > > > > [] tcp_mark_head_lost+0x121/0x150 > > > > > [] tcp_update_scoreboard+0x4c/0x170 > > > > > [] tcp_fastretrans_alert+0x48a/0x6b0 > > > > > [] tcp_ack+0x1b3/0x3a0 > > > > > [] tcp_rcv_established+0x3eb/0x710 > > > > > [] tcp_v4_do_rcv+0xe5/0x100 > > > > > [] tcp_v4_rcv+0x5db/0x660 > > > > > > > > Doh, once more these S+L things..., the rest are symptom of the first > > > > problem. > > > > > > What is the S+L thing? Could you explain a bit? > > It means that one of the skbs is both SACKed and marked as LOST at the > same time in the counters (might be due to miscount of lost/sacked_out > too, not necessarilily in the ->sacked bits). Such state is logically > invalid because it would mean that the sender thinks that the same packet > both reached the receiver and is lost in the network. > > Traditionally TCP has just silently "corrected" over-estimates > (sacked_out+lost_out > packets_out). I changed this couple of releases ago > because those over-estimates often are due to bugs that should be fixed > (there have been couple of them but it has been very quite on this front > long time, months or even half year already; but I might have broken > something with the early Dec changes). > > These problem may originate from a bug that occurred a number of ACKs > earlier the WARN_ON triggered, therefore they are a bit tricky to track, > those WARN_ON serve just for alerting purposes and usually do not point > out where the bug actually occurred. > > I usually just asked people to include exhaustive verifier which compares > ->sacked bitmaps with sacked/lost_out counters and report immediately when > the problem shows up, rather than waiting for the cheaper S+L check we do > in the WARN_ON to trigger. I tried to collect tracking patch from the > previous efforts (hopefully got it right after modifications). > > > > I'm a bit worried about its > > > > reproducability if it takes this far to see it... > > > > > > > > It's trigged again in my pc, just while using firefox. > > ...Good, then there's some chance to catch it. > > -- > i. > > [PATCH] [TCP]: debug S+L Thanks, If there's new findings I will let you know. > > --- > include/net/tcp.h |8 +++- > net/ipv4/tcp_input.c |6 +++ > net/ipv4/tcp_ipv4.c | 101 > + > net/ipv4/tcp_output.c | 21 +++--- > 4 files changed, 129 insertions(+), 7 deletions(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 7de4ea3..0685035 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); > #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, > field, val) > #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, > field, val) > > +extern voidtcp_verify_wq(struct sock *sk); > + > extern voidtcp_v4_err(struct sk_buff *skb, u32); > > extern void
[PATCH] x86: left over fix for leak of early_ioremp in dmi_scan
[PATCH] x86: left over fix for leak of early_ioremp in dmi_scan Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]> Index: linux-2.6/drivers/firmware/dmi_scan.c === --- linux-2.6.orig/drivers/firmware/dmi_scan.c +++ linux-2.6/drivers/firmware/dmi_scan.c @@ -353,6 +353,7 @@ void __init dmi_scan_machine(void) return; } } + dmi_iounmap(p, 0x1); } out: printk(KERN_INFO "DMI not present or invalid.\n"); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] export notifier #1
On Tue, Jan 22, 2008 at 04:40:50PM -0800, Christoph Lameter wrote: > On Wed, 23 Jan 2008, Benjamin Herrenschmidt wrote: > > > > - anon_vma/inode and pte locks are held during callbacks. > > > > So how does that fix the problem of sleeping then ? > > The locks are taken in the mmu_ops patch. This patch does not hold them > while performing the callbacks. Let me start by clarifying, the page is referenced prior to exporting and that reference is not removed until after recall is complete and memory protections are back to normal. As Christoph pointed out, the mmu_ops callouts do not allow sleeping. This is a problem for us as our recall path includes a message to one or more other hosts and a wait until we receive a response. That message sequence can take seconds or more to complete. It includes an operation to ensure the memory is in a cross-partition clean state and then changes memory protection. When that is complete we remove our page reference and return. Christoph's patch allows that long slow activity to happen prior to the mmu_ops callout. By the time the mmu_ops callout is made, we no longer are exporting the page so the cleanup is equivalent to the cleanup of a page we have never used. Thanks, Robin Holt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cgroup: limit block I/O bandwidth
On 22/01/2008, Andrea Righi <[EMAIL PROTECTED]> wrote: > Naveen Gupta wrote: > > See if using priority levels to have per level bandwidth limit can > > solve the priority inversion problem you were seeing earlier. I have a > > priority scheduling patch for anticipatory scheduler, if you want to > > try it. It's much simpler than CFQ priority. I still need to port it > > to 2.6.24 though and send across for review. > > > > Though as already said, this would be for read side only. > > > > -Naveen > > Thanks Naveen, I can test you scheduler if you want, but the priority > inversion problem (or better we should call it a "bandwidth limiting" > that impacts in wrong tasks) occurs only with write operations and, as > said by Jens, the I/O scheduler is not the right place to implement this > kind of limiting, because at this level the processes have already > performed the operations (dirty pages in memory) that raise the requests > to the I/O scheduler (made by different processes asynchronously). If the i/o submission is happening in bursts, and we limit the rate during submission, we will have to stop the current task from submitting any further i/o and hence change it's pattern. Also, then we are limiting the submission rate and not the rate which is going on the wire as scheduler may reorder. One of the ways could be - to limit the rate when the i/o is sent out from the scheduler and if we see that the number of allocated requests are above a threshold we disallow request allocation in the offending task. This way an application submitting bursts under the allowed average rate will not stop frequently. Something like leaky bucket. Now for dirtying of memory happening in a different context than the submission path, you could still put a limit looking at the dirty ratio and this limit is higher than the actual b/w rate you are looking to achieve. In process making sure you always have something to write and still now blow your entire memory. Or you can get really fancy and track who dirtied the i/o and start limiting it that way. > > A possible way to model the write limiting is to look at the dirty page > ratio that is, in part, the principal reason for the requests to the I/O > scheduler. But in this way we would limit also the re-write operations > in memory and this is too much limiting. > > So, the cgroup dirty page throttling could be very interesting anyway, > but it's not the same thing as limiting the real write I/O bandwidth. > > For now I've rewritten my patch as following, moving away the code from > the I/O scheduler, it seems to work in my small tests (apart all the > things said above), but I'd like to find a different way to have a more > sophisticated I/O throttling approach (probably looking also directly at > the read()/write() level)... just investigating for now... > > BTW I've seen that also OpenVZ has not a solution for this problem, yet. > AFAIU OpenVZ I/O activity is accounted in virtual enviromnents (VE) by > the user beancounters (http://wiki.openvz.org/IO_accounting), but > there's not any policy that implements the block I/O limiting, except > that it's possible to set different per-VE I/O priorities (mapped on CFQ > priorities). But I've not understood if this just sets this I/O priority > to all processes in the VE, or if it does something different. I still > need to look at the code in details. > > -Andrea > > Signed-off-by: Andrea Righi <[EMAIL PROTECTED]> > --- > > diff -urpN linux-2.6.24-rc8/block/io-throttle.c > linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c > --- linux-2.6.24-rc8/block/io-throttle.c1970-01-01 01:00:00.0 > +0100 > +++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c 2008-01-22 > 23:06:09.0 +0100 > @@ -0,0 +1,222 @@ > +/* > + * io-throttle.c > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public > + * License as published by the Free Software Foundation; either > + * version 2 of the License, or (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + * > + * You should have received a copy of the GNU General Public > + * License along with this program; if not, write to the > + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, > + * Boston, MA 021110-1307, USA. > + * > + * Copyright (C) 2008 Andrea Righi <[EMAIL PROTECTED]> > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +struct iothrottle { > + struct cgroup_subsys_state css; > + spinlock_t lock; > + unsigned long iorate; > + unsigned long req; > + unsigned long last_request; >
Re: [PATCH 1/6] driver-core : add class iteration api
On Jan 23, 2008 6:25 AM, Greg KH <[EMAIL PROTECTED]> wrote: > On Tue, Jan 22, 2008 at 03:27:08PM +0800, Dave Young wrote: > > > > Add the following class iteration functions for driver use: > > class_for_each_device > > class_find_device > > class_for_each_child > > class_find_child > > As class_for_each_child() is not used by anyone in this patch series, > and we want to heavily discourage the use of class_device (only scsi and > IB are the last remaining users), I'll cut out this portion of the > patch. > > Any objection? Looks good to me. > > thanks, > > greg k-h > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: test case for the RODATA config option
On Wednesday 23 January 2008 09:44, Arjan van de Ven wrote: > From: Arjan van de Ven <[EMAIL PROTECTED]> > Subject: x86: test case for the RODATA config option > > This patch adds a test module for the DEBUG_RODATA config > option to make sure change_page_attr() did indeed make > "const" data read only. > > This testcase both tests the DEBUG_RODATA code as well as > the change_page_attr() code for correct operation. > > When the tests/ patch gets merged, this module should move > to the tests/ directory. > > Signed-off-by: Arjan van de Ven <[EMAIL PROTECTED]> > --- > arch/x86/Kconfig.debug|8 + > arch/x86/kernel/Makefile_32 |1 > arch/x86/kernel/Makefile_64 |2 + > arch/x86/kernel/test_rodata.c | 65 > ++ arch/x86/mm/init_32.c | > 3 + > arch/x86/mm/init_64.c |3 + > 6 files changed, 82 insertions(+) > > Index: linux-2.6.24-rc8/arch/x86/Kconfig.debug > === > --- linux-2.6.24-rc8.orig/arch/x86/Kconfig.debug > +++ linux-2.6.24-rc8/arch/x86/Kconfig.debug > @@ -57,6 +57,14 @@ config DEBUG_RODATA > portion of the kernel code won't be covered by a 2MB TLB anymore. > If in doubt, say "N". > > +config DEBUG_RODATA_TEST > + tristate "Testcase for the DEBUG_RODATA feature" > + depends on DEBUG_RODATA && m > + help > + This option enables a testcase for the DEBUG_RODATA > + feature as well as for the change_page_attr() infrastructure. > + If in doubt, say "N" > + > config 4KSTACKS > bool "Use 4Kb for kernel stacks instead of 8Kb" > depends on DEBUG_KERNEL > Index: linux-2.6.24-rc8/arch/x86/mm/init_32.c > === > --- linux-2.6.24-rc8.orig/arch/x86/mm/init_32.c > +++ linux-2.6.24-rc8/arch/x86/mm/init_32.c > @@ -790,6 +790,9 @@ static int noinline do_test_wp_bit(void) > > #ifdef CONFIG_DEBUG_RODATA > > +const int rodata_test_data; > +EXPORT_SYMBOL_GPL(rodata_test_data); > + > void mark_rodata_ro(void) > { > unsigned long start = PFN_ALIGN(_text); > Index: linux-2.6.24-rc8/arch/x86/mm/init_64.c > === > --- linux-2.6.24-rc8.orig/arch/x86/mm/init_64.c > +++ linux-2.6.24-rc8/arch/x86/mm/init_64.c > @@ -590,6 +590,9 @@ void free_initmem(void) > > #ifdef CONFIG_DEBUG_RODATA > > +const int rodata_test_data = 5; I guess this should match the 32-bit case, and be zero instead of 5? Can you disallow building as a module, and put this in the test code? It could be run from the end of mark_rodata_ro()... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
On Tuesday 22 January 2008 21:37, Ingo Molnar wrote: > * Nick Piggin <[EMAIL PROTECTED]> wrote: > > Well I've twice tried to submit a patch to print stacks for running > > tasks as well, but nobody seems interested. It would at least give a > > chance to see something. > > i definitely remembering having done this myself a couple of times (it > makes tons of sense to get _some_ info out of the system) but some > problem in -mm kept reverting it. I dont remember the specifics ... it > was some race. Hmm, that's not unlikely. But there is nothing in the backtrace code which prevents a task from being woken up anyway, is there? I guess it will be more common now, but if we find a race we can try to fix the root cause. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] Minimal fix for private_list handling races
On Wednesday 23 January 2008 04:10, Jan Kara wrote: > Hi, > > as I got no answer for a week, I'm resending this fix for races in > private_list handling. Andrew, do you like them more than the previous > version? FWIW, I reviewed this, and it looks OK although I think some comments would be in order. What would be really nice is to avoid the use of b_assoc_buffers completely in this function like I've attempted (untested). I don't know if you'd actually call that an improvement...? Couple of things I noticed while looking at this code. - What is osync_buffers_list supposed to do? I couldn't actually work it out. Why do we care about waiting for these buffers on here that were added while waiting for writeout of other buffers to finish? Can we just remove it completely? I must be missing something. - What are the get_bh(bh) things supposed to do? Protect the lifetime of a given bh while "lock" is dropped? That's nice, ignoring the fact that we brelse(bh) *before* taking the lock again... but isn't every single other buffer that we _have't_ elevated its reference exposed to exactly the same lifetime problem? IOW, either it is not required at all, or it is required for _all_ buffers? (my patch should fix this). Hmm, now I remember why I rewrote this file :P Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -792,47 +792,53 @@ EXPORT_SYMBOL(__set_page_dirty_buffers); */ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) { + struct buffer_head *batch[16]; + int i, idx, done; struct buffer_head *bh; - struct list_head tmp; int err = 0, err2; - INIT_LIST_HEAD(); - +again: spin_lock(lock); + idx = 0; while (!list_empty(list)) { bh = BH_ENTRY(list->next); __remove_assoc_queue(bh); if (buffer_dirty(bh) || buffer_locked(bh)) { - list_add(>b_assoc_buffers, ); - if (buffer_dirty(bh)) { -get_bh(bh); -spin_unlock(lock); -/* - * Ensure any pending I/O completes so that - * ll_rw_block() actually writes the current - * contents - it is a noop if I/O is still in - * flight on potentially older contents. - */ -ll_rw_block(SWRITE, 1, ); -brelse(bh); -spin_lock(lock); - } + batch[idx++] = bh; + get_bh(bh); } + + if (idx == 16) + break; } + done = list_empty(list); + spin_unlock(lock); - while (!list_empty()) { - bh = BH_ENTRY(tmp.prev); - list_del_init(>b_assoc_buffers); - get_bh(bh); - spin_unlock(lock); + for (i = 0; i < idx; i++) { + bh = batch[i]; + if (buffer_dirty(bh)) { + /* + * Ensure any pending I/O completes so + * that ll_rw_block() actually writes + * the current contents - it is a noop + * if I/O is still in flight on + * potentially older contents. + */ + ll_rw_block(SWRITE, 1, ); + } + } + for (i = 0; i < idx; i++) { + bh = batch[i]; wait_on_buffer(bh); if (!buffer_uptodate(bh)) err = -EIO; brelse(bh); - spin_lock(lock); } + + idx = 0; + if (!done) + goto again; - spin_unlock(lock); err2 = osync_buffers_list(lock, list); if (err) return err;
Re: [PATCH] ppc: fix #ifdef-s in mediabay driver
Hi, On Wednesday 23 January 2008, Benjamin Herrenschmidt wrote: > > On Wed, 2008-01-23 at 00:12 +0100, Bartlomiej Zolnierkiewicz wrote: > > * Replace incorrect CONFIG_BLK_DEV_IDE #ifdef in > > check_media_bay() by CONFIG_MAC_FLOPPY one. > > > > * Replace incorrect CONFIG_BLK_DEV_IDE #ifdef-s by > > CONFIG_BLK_DEV_IDE_PMAC ones. > > > > * check_media_bay() is used only by drivers/block/swim3.c > > so make this function available only if CONFIG_MAC_FLOPPY > > is defined. > > > > * check_media_bay_by_base() and media_bay_set_ide_infos() > > are used only by drivers/ide/ppc/pmac.c so so make these > > functions available only if CONFIG_MAC_FLOPPY is defined. > > > > Signed-off-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]> > > --- > > Ben, IMO this patch is safe for 2.6.24 (assuming that it builds fine :), > > otherwise I would like to ask for permission to merge it through IDE > > tree since I have other pending IDE patches depending on this one. > > I'd rather avoid touching 2.6.24 unless it actually fixes a bug or > regression... Well, it is a bugfix for PMAC_MEDIABAY=y && BLK_DEV_IDE=n && MAC_FLOPPY=y. :) > I'm tempted to actually remove all ifdef's ... if you have a media-bay, > then there are about 99% chances it contains an IDE device, with the > remaining percent being split with putting a floppy or a battery in. I > doubt anybody will care building a kernel without the support for these > and with the mediabay support, and still want to save a handful of bytes > in that driver. I'm more worried about breaking automatic build checking (make randconfig) than a few extra bytes so if you remove all #ifdefs you'll have to either make BLK_DEV_IDE_PMAC select PMAC_MEDIABAY or make PMAC_MEDIABAY depend on BLK_DEV_IDE_PMAC (otherwise BLK_DEV_IDE=n && PMAC_MEDIABAY=y will fail since mediabay.c is referencing IDE code). Thanks, Bart -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 27/27] NFS: Separate caching by superblock, explicitly if necessary
Separate caching by superblock, explicitly if necessary. This means mounts of the same remote data with different parameters do not share cache objects for common files. The administrator may also provide a uniquifier to further enhance the uniqueness. Where it is otherwise impossible to distinguish superblocks because all the parameters are identical, but the 'nosharecache' option is supplied, a uniquifying string must be supplied, else only the first mount will be permitted to use the cache. If there's a key collision, then the second mount will disable caching and give a warning into the kernel log. There are three variant NFS mount options that can be added to a mount command to control caching for a mount. Only the last one specified takes effect: (*) Adding "fsc" will request caching. (*) Adding "fsc=" will request caching and also specify a uniquifier. (*) Adding "nofsc" will disable caching. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/nfs/fscache-def.c | 33 fs/nfs/fscache.c | 122 - fs/nfs/fscache.h | 46 - fs/nfs/internal.h |3 + fs/nfs/super.c| 24 +++-- include/linux/nfs_fs_sb.h |3 + 6 files changed, 220 insertions(+), 11 deletions(-) diff --git a/fs/nfs/fscache-def.c b/fs/nfs/fscache-def.c index bc20b7d..1d10b4e 100644 --- a/fs/nfs/fscache-def.c +++ b/fs/nfs/fscache-def.c @@ -117,6 +117,39 @@ const struct fscache_cookie_def nfs_cache_server_index_def = { }; /* + * Generate a key to describe a superblock key in the main NFS index + */ +static uint16_t nfs_super_get_key(const void *cookie_netfs_data, + void *buffer, uint16_t bufmax) +{ + const struct nfs_fscache_key *key; + const struct nfs_server *nfss = cookie_netfs_data; + uint16_t len; + + key = nfss->fscache_key; + len = sizeof(key->key) + key->key.uniq_len; + if (len > bufmax) { + len = 0; + } else { + memcpy(buffer, >key, sizeof(key->key)); + memcpy(buffer + sizeof(key->key), + key->key.uniquifier, key->key.uniq_len); + } + + return len; +} + +/* + * The superblock index for the filesystem is defined by all the NFS parameters + * that might cause a separate superblock + */ +const struct fscache_cookie_def nfs_cache_super_index_def = { + .name = "NFS.supers", + .type = FSCACHE_COOKIE_TYPE_INDEX, + .get_key= nfs_super_get_key, +}; + +/* * Generate a key to describe an NFS inode in an NFS server's index */ static uint16_t nfs_fh_get_key(const void *cookie_netfs_data, diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c index 465f961..af9c65c 100644 --- a/fs/nfs/fscache.c +++ b/fs/nfs/fscache.c @@ -23,6 +23,9 @@ #define NFSDBG_FACILITYNFSDBG_FSCACHE +static struct rb_root nfs_fscache_keys = RB_ROOT; +static DEFINE_SPINLOCK(nfs_fscache_keys_lock); + /* * Get the per-client index cookie for an NFS client if the appropriate mount * flag was set @@ -52,6 +55,118 @@ void nfs_fscache_release_client_cookie(struct nfs_client *clp) } /* + * get a cookie for a superblock + */ +void nfs_fscache_get_super_cookie(struct super_block *sb, + struct nfs_parsed_mount_data *data) +{ + struct nfs_fscache_key *key, *xkey; + struct nfs_server *nfss = NFS_SB(sb); + struct rb_node **p, *parent; + const char *uniq = data->fscache_uniq ?: ""; + int diff, ulen; + + ulen = strlen(uniq); + key = kzalloc(sizeof(*key) + ulen, GFP_KERNEL); + if (!key) + return; + + key->nfs_client = nfss->nfs_client; + key->key.super.s_flags = sb->s_flags & NFS_MS_MASK; + key->key.nfs_server.flags = nfss->flags; + key->key.nfs_server.rsize = nfss->rsize; + key->key.nfs_server.wsize = nfss->wsize; + key->key.nfs_server.acregmin = nfss->acregmin; + key->key.nfs_server.acregmax = nfss->acregmax; + key->key.nfs_server.acdirmin = nfss->acdirmin; + key->key.nfs_server.acdirmax = nfss->acdirmax; + key->key.nfs_server.fsid = nfss->fsid; + key->key.rpc_auth.au_flavor = nfss->client->cl_auth->au_flavor; + + key->key.uniq_len = ulen; + memcpy(key->key.uniquifier, uniq, ulen); + + spin_lock(_fscache_keys_lock); + p = _fscache_keys.rb_node; + parent = NULL; + while (*p) { + parent = *p; + xkey = rb_entry(parent, struct nfs_fscache_key, node); + + if (key->nfs_client < xkey->nfs_client) + goto go_left; + if (key->nfs_client > xkey->nfs_client) + goto go_right; + + diff = memcmp(>key, >key, sizeof(key->key)); + if (diff < 0) + goto go_left; + if (diff > 0) +
[PATCH 26/27] NFS: Display local caching state
Display the local caching state in /proc/fs/nfsfs/volumes. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/nfs/client.c |7 --- fs/nfs/fscache.h | 15 +++ 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/fs/nfs/client.c b/fs/nfs/client.c index 92f9b84..68d3124 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -1335,7 +1335,7 @@ static int nfs_volume_list_show(struct seq_file *m, void *v) /* display header on line 1 */ if (v == _volume_list) { - seq_puts(m, "NV SERVER PORT DEV FSID\n"); + seq_puts(m, "NV SERVER PORT DEV FSID FSC\n"); return 0; } /* display one transport per line on subsequent lines */ @@ -1349,12 +1349,13 @@ static int nfs_volume_list_show(struct seq_file *m, void *v) (unsigned long long) server->fsid.major, (unsigned long long) server->fsid.minor); - seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s\n", + seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s %s\n", clp->cl_nfsversion, NIPQUAD(clp->cl_addr.sin_addr), ntohs(clp->cl_addr.sin_port), dev, - fsid); + fsid, + nfs_server_fscache_state(server)); return 0; } diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h index 144fb58..9a735fc 100644 --- a/fs/nfs/fscache.h +++ b/fs/nfs/fscache.h @@ -53,6 +53,17 @@ extern void __nfs_fscache_invalidate_page(struct page *, struct inode *); extern int nfs_fscache_release_page(struct page *, gfp_t); /* + * indicate the client caching state as readable text + */ +static inline const char *nfs_server_fscache_state(struct nfs_server *server) +{ + if (server->nfs_client->fscache && + (server->options & NFS_OPTION_FSCACHE)) + return "yes"; + return "no "; +} + +/* * release the caching state associated with a page if undergoing complete page * invalidation */ @@ -109,6 +120,10 @@ static inline void nfs4_fscache_get_client_cookie(struct nfs_client *clp) {} static inline void nfs_fscache_release_client_cookie(struct nfs_client *clp) {} static inline void nfs_fscache_show_stats(struct seq_file *m, struct nfs_server *nfss) {} +static inline const char *nfs_server_fscache_state(struct nfs_server *server) +{ + return "no "; +} static inline void nfs_fscache_init_fh_cookie(struct inode *inode) {} static inline void nfs_fscache_enable_fh_cookie(struct inode *inode) {} -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 25/27] NFS: Configuration and mount option changes to enable local caching on NFS
Changes to the kernel configuration defintions and to the NFS mount options to allow the local caching support added by the previous patch to be enabled. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/Kconfig|8 fs/nfs/client.c |2 ++ fs/nfs/internal.h |1 + fs/nfs/super.c| 14 ++ 4 files changed, 25 insertions(+), 0 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index e0eedf9..8352dc7 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1650,6 +1650,14 @@ config NFS_V4 If unsure, say N. +config NFS_FSCACHE + bool "Provide NFS client caching support (EXPERIMENTAL)" + depends on EXPERIMENTAL + depends on NFS_FS=m && FSCACHE || NFS_FS=y && FSCACHE=y + help + Say Y here if you want NFS data to be cached locally on disc through + the general filesystem cache manager + config NFS_DIRECTIO bool "Allow direct I/O on NFS files" depends on NFS_FS diff --git a/fs/nfs/client.c b/fs/nfs/client.c index bcdc5d0..92f9b84 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -572,6 +572,7 @@ static int nfs_init_server(struct nfs_server *server, /* Initialise the client representation from the mount data */ server->flags = data->flags & NFS_MOUNT_FLAGMASK; + server->options = data->options; if (data->rsize) server->rsize = nfs_block_size(data->rsize, NULL); @@ -931,6 +932,7 @@ static int nfs4_init_server(struct nfs_server *server, /* Initialise the client representation from the mount data */ server->flags = data->flags & NFS_MOUNT_FLAGMASK; server->caps |= NFS_CAP_ATOMIC_OPEN; + server->options = data->options; if (data->rsize) server->rsize = nfs_block_size(data->rsize, NULL); diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index f3acf48..ef09e00 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -35,6 +35,7 @@ struct nfs_parsed_mount_data { int acregmin, acregmax, acdirmin, acdirmax; int namlen; + unsigned intoptions; unsigned intbsize; unsigned intauth_flavor_len; rpc_authflavor_tauth_flavors[1]; diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 6dd628f..0542550 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -74,6 +74,7 @@ enum { Opt_acl, Opt_noacl, Opt_rdirplus, Opt_nordirplus, Opt_sharecache, Opt_nosharecache, + Opt_fscache, Opt_nofscache, /* Mount options that take integer arguments */ Opt_port, @@ -123,6 +124,8 @@ static match_table_t nfs_mount_option_tokens = { { Opt_nordirplus, "nordirplus" }, { Opt_sharecache, "sharecache" }, { Opt_nosharecache, "nosharecache" }, + { Opt_fscache, "fsc" }, + { Opt_nofscache, "nofsc" }, { Opt_port, "port=%u" }, { Opt_rsize, "rsize=%u" }, @@ -459,6 +462,8 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss, seq_printf(m, ",timeo=%lu", 10U * clp->retrans_timeo / HZ); seq_printf(m, ",retrans=%u", clp->retrans_count); seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client->cl_auth->au_flavor)); + if (nfss->options & NFS_OPTION_FSCACHE) + seq_printf(m, ",fsc"); } /* @@ -697,6 +702,15 @@ static int nfs_parse_mount_options(char *raw, break; case Opt_nosharecache: mnt->flags |= NFS_MOUNT_UNSHARED; + mnt->options &= ~NFS_OPTION_FSCACHE; + break; + case Opt_fscache: + /* sharing is mandatory with fscache */ + mnt->options |= NFS_OPTION_FSCACHE; + mnt->flags &= ~NFS_MOUNT_UNSHARED; + break; + case Opt_nofscache: + mnt->options &= ~NFS_OPTION_FSCACHE; break; case Opt_port: -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 24/27] NFS: Use local caching
The attached patch makes it possible for the NFS filesystem to make use of the network filesystem local caching service (FS-Cache). To be able to use this, an updated mount program is required. This can be obtained from: http://people.redhat.com/steved/fscache/util-linux/ To mount an NFS filesystem to use caching, add an "fsc" option to the mount: mount warthog:/ /a -o fsc Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/nfs/Makefile |1 fs/nfs/client.c |5 + fs/nfs/file.c | 37 fs/nfs/fscache-def.c | 289 + fs/nfs/fscache.c | 391 + fs/nfs/fscache.h | 148 + fs/nfs/inode.c| 47 + fs/nfs/read.c | 28 +++ fs/nfs/super.c|3 fs/nfs/sysctl.c |1 include/linux/nfs_fs.h|9 + include/linux/nfs_fs_sb.h | 18 ++ 12 files changed, 968 insertions(+), 9 deletions(-) create mode 100644 fs/nfs/fscache-def.c create mode 100644 fs/nfs/fscache.c create mode 100644 fs/nfs/fscache.h diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile index df0f41e..073d04c 100644 --- a/fs/nfs/Makefile +++ b/fs/nfs/Makefile @@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \ nfs4namespace.o nfs-$(CONFIG_NFS_DIRECTIO) += direct.o nfs-$(CONFIG_SYSCTL) += sysctl.o +nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-def.o diff --git a/fs/nfs/client.c b/fs/nfs/client.c index a6f6254..bcdc5d0 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -43,6 +43,7 @@ #include "delegation.h" #include "iostat.h" #include "internal.h" +#include "fscache.h" #define NFSDBG_FACILITYNFSDBG_CLIENT @@ -139,6 +140,8 @@ static struct nfs_client *nfs_alloc_client(const char *hostname, clp->cl_state = 1 << NFS4CLNT_LEASE_EXPIRED; #endif + nfs_fscache_get_client_cookie(clp); + return clp; error_3: @@ -170,6 +173,8 @@ static void nfs_free_client(struct nfs_client *clp) nfs4_shutdown_client(clp); + nfs_fscache_release_client_cookie(clp); + /* -EIO all pending I/O */ if (!IS_ERR(clp->cl_rpcclient)) rpc_shutdown_client(clp->cl_rpcclient); diff --git a/fs/nfs/file.c b/fs/nfs/file.c index b3bb89f..d492cd7 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -35,6 +35,7 @@ #include "delegation.h" #include "internal.h" #include "iostat.h" +#include "fscache.h" #define NFSDBG_FACILITYNFSDBG_FILE @@ -352,22 +353,48 @@ static int nfs_write_end(struct file *file, struct address_space *mapping, return status < 0 ? status : copied; } +/* + * Partially or wholly invalidate a page + * - Release the private state associated with a page if undergoing complete + * page invalidation + * - Called if either PG_private or PG_fscache set on the page + * - Caller holds page lock + */ static void nfs_invalidate_page(struct page *page, unsigned long offset) { if (offset != 0) return; /* Cancel any unstarted writes on this page */ nfs_wb_page_cancel(page->mapping->host, page); + + nfs_fscache_invalidate_page(page, page->mapping->host); } +/* + * Release the private state associated with a page + * - Called if either PG_private or PG_fscache set on the page + * - Caller holds page lock + * - Return true (may release) or false (may not) + */ static int nfs_release_page(struct page *page, gfp_t gfp) { /* If PagePrivate() is set, then the page is not freeable */ - return 0; + if (PagePrivate(page)) + return 0; + return nfs_fscache_release_page(page, gfp); } +/* + * Attempt to clear the private state associated with a page when an error + * occurs that requires the cached contents of an inode to be written back or + * destroyed + * - Called if either PG_private or PG_fscache set on the page + * - Caller holds page lock + * - Return 0 if successful, -error otherwise + */ static int nfs_launder_page(struct page *page) { + wait_on_page_fscache_write(page); return nfs_wb_page(page->mapping->host, page); } @@ -387,6 +414,11 @@ const struct address_space_operations nfs_file_aops = { .launder_page = nfs_launder_page, }; +/* + * Notification that a PTE pointing to an NFS page is about to be made + * writable, implying that someone is about to modify the page through a + * shared-writable mapping + */ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page) { struct file *filp = vma->vm_file; @@ -396,6 +428,9 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page) struct address_space *mapping; loff_t offset; + /* make sure the cache has finished storing the page */ + wait_on_page_fscache_write(page); + lock_page(page);
[PATCH 23/27] NFS: Fix memory leak
Fix a memory leak whereby multiple clientaddr=xxx mount options just overwrite the duplicated client_address option pointer, without freeing the old memory. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/nfs/super.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 0b0c72a..7f5e747 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -936,6 +936,7 @@ static int nfs_parse_mount_options(char *raw, string = match_strdup(args); if (string == NULL) goto out_nomem; + kfree(mnt->client_address); mnt->client_address = string; break; case Opt_mountaddr: -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 21/27] CacheFiles: Export things for CacheFiles
Export a number of functions for CacheFiles's use. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/super.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/super.c b/fs/super.c index ceaf2e3..cd199ae 100644 --- a/fs/super.c +++ b/fs/super.c @@ -266,6 +266,7 @@ int fsync_super(struct super_block *sb) __fsync_super(sb); return sync_blockdev(sb->s_bdev); } +EXPORT_SYMBOL_GPL(fsync_super); /** * generic_shutdown_super - common helper for ->kill_sb() -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 20/27] CacheFiles: Permit the page lock state to be monitored
Add a function to install a monitor on the page lock waitqueue for a particular page, thus allowing the page being unlocked to be detected. This is used by CacheFiles to detect read completion on a page in the backing filesystem so that it can then copy the data to the waiting netfs page. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- include/linux/pagemap.h |5 + mm/filemap.c| 18 ++ 2 files changed, 23 insertions(+), 0 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index f9e0f81..e9f37b3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -223,6 +223,11 @@ static inline void wait_on_page_owner_priv_2(struct page *page) extern void end_page_owner_priv_2(struct page *page); /* + * Add an arbitrary waiter to a page's wait queue + */ +extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter); + +/* * Fault a userspace page into pagetables. Return non-zero on a fault. * * This assumes that two userspace pages are always sufficient. That's diff --git a/mm/filemap.c b/mm/filemap.c index ed52b0b..4d50623 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -533,6 +533,24 @@ void fastcall wait_on_page_bit(struct page *page, int bit_nr) EXPORT_SYMBOL(wait_on_page_bit); /** + * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue + * @page - Page defining the wait queue of interest + * @waiter - Waiter to add to the queue + * + * Add an arbitrary @waiter to the wait queue for the nominated @page. + */ +void add_page_wait_queue(struct page *page, wait_queue_t *waiter) +{ + wait_queue_head_t *q = page_waitqueue(page); + unsigned long flags; + + spin_lock_irqsave(>lock, flags); + __add_wait_queue(q, waiter); + spin_unlock_irqrestore(>lock, flags); +} +EXPORT_SYMBOL_GPL(add_page_wait_queue); + +/** * unlock_page - unlock a locked page * @page: the page * -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 19/27] CacheFiles: Add a hook to write a single page of data to an inode
Add an address space operation to write one single page of data to an inode at a page-aligned location (thus permitting the implementation to be highly optimised). The data source is a single page. This is used by CacheFiles to store the contents of netfs pages into their backing file pages. Supply a generic implementation for this that uses the write_begin() and write_end() address_space operations to bind a copy directly into the page cache. Hook the Ext2 and Ext3 operations to the generic implementation. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/ext2/inode.c|2 ++ fs/ext3/inode.c|3 +++ include/linux/fs.h |7 ++ mm/filemap.c | 61 4 files changed, 73 insertions(+), 0 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index b1ab32a..cfa56e6 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -796,6 +796,7 @@ const struct address_space_operations ext2_aops = { .direct_IO = ext2_direct_IO, .writepages = ext2_writepages, .migratepage= buffer_migrate_page, + .write_one_page = generic_file_buffered_write_one_page, }; const struct address_space_operations ext2_aops_xip = { @@ -814,6 +815,7 @@ const struct address_space_operations ext2_nobh_aops = { .direct_IO = ext2_direct_IO, .writepages = ext2_writepages, .migratepage= buffer_migrate_page, + .write_one_page = generic_file_buffered_write_one_page, }; /* diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index bc918d3..435c684 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1780,6 +1780,7 @@ static const struct address_space_operations ext3_ordered_aops = { .releasepage= ext3_releasepage, .direct_IO = ext3_direct_IO, .migratepage= buffer_migrate_page, + .write_one_page = generic_file_buffered_write_one_page, }; static const struct address_space_operations ext3_writeback_aops = { @@ -1794,6 +1795,7 @@ static const struct address_space_operations ext3_writeback_aops = { .releasepage= ext3_releasepage, .direct_IO = ext3_direct_IO, .migratepage= buffer_migrate_page, + .write_one_page = generic_file_buffered_write_one_page, }; static const struct address_space_operations ext3_journalled_aops = { @@ -1807,6 +1809,7 @@ static const struct address_space_operations ext3_journalled_aops = { .bmap = ext3_bmap, .invalidatepage = ext3_invalidatepage, .releasepage= ext3_releasepage, + .write_one_page = generic_file_buffered_write_one_page, }; void ext3_set_aops(struct inode *inode) diff --git a/include/linux/fs.h b/include/linux/fs.h index 850d3fc..a3c3369 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -479,6 +479,11 @@ struct address_space_operations { int (*migratepage) (struct address_space *, struct page *, struct page *); int (*launder_page) (struct page *); + /* write the contents of the source page over the page at the specified +* index in the target address space (the source page does not need to +* be related to the target address space) */ + int (*write_one_page)(struct address_space *, pgoff_t, struct page *); + }; /* @@ -1801,6 +1806,8 @@ extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *, unsigned long *, loff_t, loff_t *, size_t, size_t); extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *, unsigned long, loff_t, loff_t *, size_t, ssize_t); +extern int generic_file_buffered_write_one_page(struct address_space *, + pgoff_t, struct page *); extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos); extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos); extern void do_generic_mapping_read(struct address_space *mapping, diff --git a/mm/filemap.c b/mm/filemap.c index f22801a..ed52b0b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2333,6 +2333,67 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov, } EXPORT_SYMBOL(generic_file_buffered_write); +/** + * generic_file_buffered_write_one_page - Write a single page of data to an + * inode + * @mapping - The address space of the target inode + * @index - The target page in the target inode to fill + * @source - The data to write into the target page + * + * Write the data from the source page to the page in the nominated address + * space at the @index specified. Note that the file will not be extended if + * the page crosses the EOF marker, in which case only the first part of the + * page will be written. + * + * The @source page does not need to have any association
[PATCH 18/27] CacheFiles: Be consistent about the use of mapping vs file->f_mapping in Ext3
Change all the usages of file->f_mapping in ext3_*write_end() functions to use the mapping argument directly. This has two consequences: (*) Consistency. Without this patch sometimes one is used and sometimes the other is. (*) A NULL file pointer can be passed. This feature is then made use of by the generic hook in the next patch, which is used by CacheFiles to write pages to a file without setting up a file struct. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/ext3/inode.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 9b162cd..bc918d3 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1227,7 +1227,7 @@ static int ext3_generic_write_end(struct file *file, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata) { - struct inode *inode = file->f_mapping->host; + struct inode *inode = mapping->host; copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); @@ -1252,7 +1252,7 @@ static int ext3_ordered_write_end(struct file *file, struct page *page, void *fsdata) { handle_t *handle = ext3_journal_current_handle(); - struct inode *inode = file->f_mapping->host; + struct inode *inode = mapping->host; unsigned from, to; int ret = 0, ret2; @@ -1293,7 +1293,7 @@ static int ext3_writeback_write_end(struct file *file, struct page *page, void *fsdata) { handle_t *handle = ext3_journal_current_handle(); - struct inode *inode = file->f_mapping->host; + struct inode *inode = mapping->host; int ret = 0, ret2; loff_t new_i_size; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 17/27] CacheFiles: Add missing copy_page export for ia64
This one-line patch fixes the missing export of copy_page introduced by the cachefile patches. This patch is not yet upstream, but is required for cachefile on ia64. It will be pushed upstream when cachefile goes upstream. Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]> Signed-off-by: David Howells <[EMAIL PROTECTED]> --- arch/ia64/kernel/ia64_ksyms.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/ia64/kernel/ia64_ksyms.c b/arch/ia64/kernel/ia64_ksyms.c index c3b4412..e64fd61 100644 --- a/arch/ia64/kernel/ia64_ksyms.c +++ b/arch/ia64/kernel/ia64_ksyms.c @@ -43,6 +43,7 @@ EXPORT_SYMBOL(__do_clear_user); EXPORT_SYMBOL(__strlen_user); EXPORT_SYMBOL(__strncpy_from_user); EXPORT_SYMBOL(__strnlen_user); +EXPORT_SYMBOL(copy_page); /* from arch/ia64/lib */ extern void __divsi3(void); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 15/27] FS-Cache: Provide an add_wait_queue_tail() function
Provide an add_wait_queue_tail() function to add a waiter to the back of a wait queue instead of the front. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- include/linux/wait.h |2 ++ kernel/wait.c| 18 ++ 2 files changed, 20 insertions(+), 0 deletions(-) diff --git a/include/linux/wait.h b/include/linux/wait.h index 0e68628..f1038d0 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -118,6 +118,8 @@ static inline int waitqueue_active(wait_queue_head_t *q) #define is_sync_wait(wait) (!(wait) || ((wait)->private)) extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)); +extern void FASTCALL(add_wait_queue_tail(wait_queue_head_t *q, +wait_queue_t *wait)); extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait)); extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)); diff --git a/kernel/wait.c b/kernel/wait.c index 444ddbf..7acc9cc 100644 --- a/kernel/wait.c +++ b/kernel/wait.c @@ -29,6 +29,24 @@ void fastcall add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait) } EXPORT_SYMBOL(add_wait_queue); +/** + * add_wait_queue_tail - Add a waiter to the back of a waitqueue + * @q: the wait queue to append the waiter to + * @wait: the waiter to be queued + * + * Add a waiter to the back of a waitqueue so that it gets woken up last. + */ +void fastcall add_wait_queue_tail(wait_queue_head_t *q, wait_queue_t *wait) +{ + unsigned long flags; + + wait->flags &= ~WQ_FLAG_EXCLUSIVE; + spin_lock_irqsave(>lock, flags); + __add_wait_queue_tail(q, wait); + spin_unlock_irqrestore(>lock, flags); +} +EXPORT_SYMBOL(add_wait_queue_tail); + void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait) { unsigned long flags; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 14/27] FS-Cache: Recruit a couple of page flags for cache management
Recruit a couple of page flags to aid in cache management. The following extra flags are defined: (1) PG_fscache (PG_private_2) The marked page is backed by a local cache and is pinning resources in the cache driver. (2) PG_fscache_write (PG_owner_priv_2) The marked page is being written to the local cache. The page may not be modified whilst this is in progress. If PG_fscache is set, then things that checked for PG_private will now also check for that. This includes things like truncation and page invalidation. The function page_has_private() had been added to make the checks for both PG_private and PG_private_2 at the same time. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/splice.c|2 +- include/linux/page-flags.h | 39 +-- include/linux/pagemap.h| 11 +++ mm/filemap.c | 16 mm/migrate.c |2 +- mm/page_alloc.c|3 +++ mm/readahead.c |9 + mm/swap.c |4 ++-- mm/swap_state.c|4 ++-- mm/truncate.c | 10 +- mm/vmscan.c|2 +- 11 files changed, 84 insertions(+), 18 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 6bdcb61..61edad7 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe, */ wait_on_page_writeback(page); - if (PagePrivate(page)) + if (page_has_private(page)) try_to_release_page(page, GFP_KERNEL); /* diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 209d3a4..f375e3b 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -77,25 +77,32 @@ #define PG_active 6 #define PG_slab 7 /* slab debug (Suparna wants this) */ -#define PG_owner_priv_1 8 /* Owner use. If pagecache, fs may use*/ +#define PG_owner_priv_1 8 /* Owner use. fs may use in pagecache */ #define PG_arch_1 9 #define PG_reserved10 #define PG_private 11 /* If pagecache, has fs-private data */ #define PG_writeback 12 /* Page is under writeback */ +#define PG_private_2 13 /* If pagecache, has fs aux data */ #define PG_compound14 /* Part of a compound page */ #define PG_swapcache 15 /* Swap page: swp_entry_t in private */ #define PG_mappedtodisk16 /* Has blocks allocated on-disk */ #define PG_reclaim 17 /* To be reclaimed asap */ +#define PG_owner_priv_218 /* Owner use. fs may use in pagecache */ #define PG_buddy 19 /* Page is free, on buddy lists */ /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ #define PG_readahead PG_reclaim /* Reminder to do async read-ahead */ -/* PG_owner_priv_1 users should have descriptive aliases */ +/* PG_owner_priv_1/2 users should have descriptive aliases */ #define PG_checked PG_owner_priv_1 /* Used by some filesystems */ #define PG_pinned PG_owner_priv_1 /* Xen pinned pagetable */ +#define PG_fscache_write PG_owner_priv_2 /* Writing to local cache */ + +/* PG_private_2 causes releasepage() and co to be invoked */ +#define PG_fscache PG_private_2/* Backed by local cache */ + #if (BITS_PER_LONG > 32) /* @@ -199,6 +206,23 @@ static inline void SetPageUptodate(struct page *page) #define TestClearPageWriteback(page) test_and_clear_bit(PG_writeback, \ &(page)->flags) +#define PagePrivate2(page) test_bit(PG_private_2, &(page)->flags) +#define SetPagePrivate2(page) set_bit(PG_private_2, &(page)->flags) +#define ClearPagePrivate2(page)clear_bit(PG_private_2, &(page)->flags) +#define TestSetPagePrivate2(page) test_and_set_bit(PG_private_2, &(page)->flags) +#define TestClearPagePrivate2(page) test_and_clear_bit(PG_private_2, \ + &(page)->flags) + +#define PageOwnerPriv2(page) test_bit(PG_owner_priv_2, \ +&(page)->flags) +#define SetPageOwnerPriv2(page)set_bit(PG_owner_priv_2, &(page)->flags) +#define ClearPageOwnerPriv2(page) clear_bit(PG_owner_priv_2, \ + &(page)->flags) +#define TestSetPageOwnerPriv2(page)test_and_set_bit(PG_owner_priv_2, \ +&(page)->flags) +#define TestClearPageOwnerPriv2(page) test_and_clear_bit(PG_owner_priv_2, \ + &(page)->flags) + #define
[PATCH 13/27] FS-Cache: Release page->private after failed readahead
The attached patch causes read_cache_pages() to release page-private data on a page for which add_to_page_cache() fails or the filler function fails. This permits pages with caching references associated with them to be cleaned up. The invalidatepage() address space op is called (indirectly) to do the honours. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- mm/readahead.c | 39 +-- 1 files changed, 37 insertions(+), 2 deletions(-) diff --git a/mm/readahead.c b/mm/readahead.c index c9c50ca..75aa6b6 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -44,6 +44,41 @@ EXPORT_SYMBOL_GPL(file_ra_state_init); #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +/* + * see if a page needs releasing upon read_cache_pages() failure + * - the caller of read_cache_pages() may have set PG_private before calling, + * such as the NFS fs marking pages that are cached locally on disk, thus we + * need to give the fs a chance to clean up in the event of an error + */ +static void read_cache_pages_invalidate_page(struct address_space *mapping, +struct page *page) +{ + if (PagePrivate(page)) { + if (TestSetPageLocked(page)) + BUG(); + page->mapping = mapping; + do_invalidatepage(page, 0); + page->mapping = NULL; + unlock_page(page); + } + page_cache_release(page); +} + +/* + * release a list of pages, invalidating them first if need be + */ +static void read_cache_pages_invalidate_pages(struct address_space *mapping, + struct list_head *pages) +{ + struct page *victim; + + while (!list_empty(pages)) { + victim = list_to_page(pages); + list_del(>lru); + read_cache_pages_invalidate_page(mapping, victim); + } +} + /** * read_cache_pages - populate an address space with some pages & start reads against them * @mapping: the address_space @@ -65,14 +100,14 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages, list_del(>lru); if (add_to_page_cache_lru(page, mapping, page->index, GFP_KERNEL)) { - page_cache_release(page); + read_cache_pages_invalidate_page(mapping, page); continue; } page_cache_release(page); ret = filler(data, page); if (unlikely(ret)) { - put_pages_list(pages); + read_cache_pages_invalidate_pages(mapping, pages); break; } task_io_account_read(PAGE_CACHE_SIZE); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/27] Security: Make NFSD work with detached security
Make NFSD work with detached security, using the patches that excise the security information from task_struct to struct task_security as a base. Each time NFSD wants a new security descriptor (to do NFS4 recovery or just to do NFS operations), a task_security record is derived from NFSD's *objective* security, modified and then applied as the *subjective* security. This means (a) the changes are not visible to anyone looking at NFSD through /proc, (b) there is no leakage between two consecutive ops with different security configurations. Consideration should probably be given to caching the task_security record on the basis that there'll probably be several ops that will want to use any particular security configuration. Furthermore, nfs4recover.c perhaps ought to set an appropriate LSM context on the record pointed to by rec_security so that the disk is accessed appropriately (see set_security_override[_from_ctx]()). NOTE! This patch must be rolled in to one of the earlier security patches to make it compile fully. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/nfsd/auth.c| 31 +--- fs/nfsd/nfs4recover.c | 64 +++-- 2 files changed, 62 insertions(+), 33 deletions(-) diff --git a/fs/nfsd/auth.c b/fs/nfsd/auth.c index b2e19c8..32d8e34 100644 --- a/fs/nfsd/auth.c +++ b/fs/nfsd/auth.c @@ -6,6 +6,7 @@ #include #include +#include #include #include #include @@ -28,11 +29,17 @@ int nfsexp_flags(struct svc_rqst *rqstp, struct svc_export *exp) int nfsd_setuser(struct svc_rqst *rqstp, struct svc_export *exp) { + struct task_security *sec, *old; struct svc_cred cred = rqstp->rq_cred; int i; int flags = nfsexp_flags(rqstp, exp); int ret; + /* derive the new security record from nfsd's objective security */ + sec = get_kernel_security(current); + if (!sec) + return -ENOMEM; + if (flags & NFSEXP_ALLSQUASH) { cred.cr_uid = exp->ex_anon_uid; cred.cr_gid = exp->ex_anon_gid; @@ -56,24 +63,30 @@ int nfsd_setuser(struct svc_rqst *rqstp, struct svc_export *exp) get_group_info(cred.cr_group_info); if (cred.cr_uid != (uid_t) -1) - current->act_as->fsuid = cred.cr_uid; + sec->fsuid = cred.cr_uid; else - current->act_as->fsuid = exp->ex_anon_uid; + sec->fsuid = exp->ex_anon_uid; if (cred.cr_gid != (gid_t) -1) - current->act_as->fsgid = cred.cr_gid; + sec->fsgid = cred.cr_gid; else - current->act_as->fsgid = exp->ex_anon_gid; + sec->fsgid = exp->ex_anon_gid; - if (!cred.cr_group_info) + if (!cred.cr_group_info) { + put_task_security(sec); return -ENOMEM; - ret = set_groups(current->act_as, cred.cr_group_info); + } + ret = set_groups(sec, cred.cr_group_info); put_group_info(cred.cr_group_info); if ((cred.cr_uid)) { - cap_t(current->act_as->cap_effective) &= ~CAP_NFSD_MASK; + cap_t(sec->cap_effective) &= ~CAP_NFSD_MASK; } else { - cap_t(current->act_as->cap_effective) |= - (CAP_NFSD_MASK & current->act_as->cap_permitted); + cap_t(sec->cap_effective) |= CAP_NFSD_MASK & sec->cap_permitted; } + + /* set the new security as nfsd's subjective security */ + old = current->act_as; + current->act_as = sec; + put_task_security(old); return ret; } diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c index bf0217a..ae91262 100644 --- a/fs/nfsd/nfs4recover.c +++ b/fs/nfsd/nfs4recover.c @@ -46,27 +46,37 @@ #include #include #include +#include #define NFSDDBG_FACILITYNFSDDBG_PROC /* Globals */ static struct nameidata rec_dir; static int rec_dir_init = 0; +static struct task_security *rec_security; +/* + * switch the special recovery access security in on the current task's + * subjective security + */ static void -nfs4_save_user(uid_t *saveuid, gid_t *savegid) +nfs4_begin_secure(struct task_security **saved_sec) { - *saveuid = current->act_as->fsuid; - *savegid = current->act_as->fsgid; - current->act_as->fsuid = 0; - current->act_as->fsgid = 0; + *saved_sec = current->act_as; + current->act_as = get_task_security(rec_security); } +/* + * return the current task's subjective security to its former glory + */ static void -nfs4_reset_user(uid_t saveuid, gid_t savegid) +nfs4_end_secure(struct task_security *saved_sec) { - current->act_as->fsuid = saveuid; - current->act_as->fsgid = savegid; + struct task_security *discard; + + discard = current->act_as; + current->act_as = saved_sec; + put_task_security(discard); } static void @@ -128,10 +138,9
[PATCH 11/27] Security: Allow kernel services to override LSM settings for task actions
Allow kernel services to override LSM settings appropriate to the actions performed by a task by duplicating a security record, modifying it and then using task_struct::act_as to point to it when performing operations on behalf of a task. This is used, for example, by CacheFiles which has to transparently access the cache on behalf of a process that thinks it is doing, say, NFS accesses with a potentially inappropriate (with respect to accessing the cache) set of security data. This patch provides two LSM hooks for modifying a task security record: (*) security_kernel_act_as() which allows modification of the security datum with which a task acts on other objects (most notably files). (*) security_create_files_as() which allows modification of the security datum that is used to initialise the security data on a file that a task creates. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- include/linux/cred.h| 23 +++ include/linux/security.h| 43 +- kernel/cred.c | 111 +++ security/dummy.c| 17 + security/security.c | 15 - security/selinux/hooks.c| 51 security/selinux/include/security.h |2 - security/selinux/ss/services.c |5 +- 8 files changed, 258 insertions(+), 9 deletions(-) create mode 100644 include/linux/cred.h diff --git a/include/linux/cred.h b/include/linux/cred.h new file mode 100644 index 000..497af5b --- /dev/null +++ b/include/linux/cred.h @@ -0,0 +1,23 @@ +/* Credential management + * + * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved. + * Written by David Howells ([EMAIL PROTECTED]) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ + +#ifndef _LINUX_CRED_H +#define _LINUX_CRED_H + +struct task_security; +struct inode; + +extern struct task_security *get_kernel_security(struct task_struct *); +extern int set_security_override(struct task_security *, u32); +extern int set_security_override_from_ctx(struct task_security *, const char *); +extern int change_create_files_as(struct task_security *, struct inode *); + +#endif /* _LINUX_CRED_H */ diff --git a/include/linux/security.h b/include/linux/security.h index e8f2f2d..e6be746 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -557,6 +557,19 @@ struct request_sock; * Duplicate and attach the security structure currently attached to the * p->security field. * Return 0 if operation was successful. + * @task_kernel_act_as: + * Set the credentials for a kernel service to act as (subjective context). + * @p points to the task that nominated @secid. + * @sec points to the task security record to be modified. + * @secid specifies the security ID to be set + * Return 0 if successful. + * @task_create_files_as: + * Set the file creation context in a task security record to be the same + * as the objective context of the specified inode. + * @p points to the task that nominated @inode. + * @sec points to the task security record to be modified. + * @inode points to the inode to use as a reference. + * Return 0 if successful. * @task_setuid: * Check permission before setting one or more of the user identity * attributes of the current process. The @flags parameter indicates @@ -1325,6 +1338,11 @@ struct security_operations { int (*task_alloc_security) (struct task_struct *p); void (*task_free_security) (struct task_security *p); int (*task_dup_security) (struct task_security *p); + int (*task_kernel_act_as)(struct task_struct *p, + struct task_security *sec, u32 secid); + int (*task_create_files_as)(struct task_struct *p, + struct task_security *sec, + struct inode *inode); int (*task_setuid) (uid_t id0, uid_t id1, uid_t id2, int flags); int (*task_post_setuid) (uid_t old_ruid /* or fsuid */ , uid_t old_euid, uid_t old_suid, int flags); @@ -1393,7 +1411,7 @@ struct security_operations { int (*getprocattr)(struct task_struct *p, char *name, char **value); int (*setprocattr)(struct task_struct *p, char *name, void *value, size_t size); int (*secid_to_secctx)(u32 secid, char **secdata, u32 *seclen); - int (*secctx_to_secid)(char *secdata, u32 seclen, u32 *secid); + int (*secctx_to_secid)(const char *secdata, u32 seclen, u32 *secid); void (*release_secctx)(char *secdata, u32 seclen); #ifdef CONFIG_SECURITY_NETWORK @@ -1576,6 +1594,11 @@ int security_task_create(unsigned long
[PATCH 10/27] Security: Add a kernel_service object class to SELinux
Add a 'kernel_service' object class to SELinux and give this object class two access vectors: 'use_as_override' and 'create_files_as'. The first vector is used to grant a process the right to nominate an alternate process security ID for the kernel to use as an override for the SELinux subjective security when accessing stuff on behalf of another process. For example, CacheFiles when accessing the cache on behalf on a process accessing an NFS file needs to use a subjective security ID appropriate to the cache rather then the one the calling process is using. The cachefilesd daemon will nominate the security ID to be used. The second vector is used to grant a process the right to nominate a file creation label for a kernel service to use. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- security/selinux/include/av_perm_to_string.h |2 ++ security/selinux/include/av_permissions.h|2 ++ security/selinux/include/class_to_string.h |1 + security/selinux/include/flask.h |1 + 4 files changed, 6 insertions(+), 0 deletions(-) diff --git a/security/selinux/include/av_perm_to_string.h b/security/selinux/include/av_perm_to_string.h index caa0634..6ba8200 100644 --- a/security/selinux/include/av_perm_to_string.h +++ b/security/selinux/include/av_perm_to_string.h @@ -164,3 +164,5 @@ S_(SECCLASS_DCCP_SOCKET, DCCP_SOCKET__NAME_CONNECT, "name_connect") S_(SECCLASS_MEMPROTECT, MEMPROTECT__MMAP_ZERO, "mmap_zero") S_(SECCLASS_PEER, PEER__RECV, "recv") + S_(SECCLASS_KERNEL_SERVICE, KERNEL_SERVICE__USE_AS_OVERRIDE, "use_as_override") + S_(SECCLASS_KERNEL_SERVICE, KERNEL_SERVICE__CREATE_FILES_AS, "create_files_as") diff --git a/security/selinux/include/av_permissions.h b/security/selinux/include/av_permissions.h index c2b5bb2..9500ba3 100644 --- a/security/selinux/include/av_permissions.h +++ b/security/selinux/include/av_permissions.h @@ -829,3 +829,5 @@ #define DCCP_SOCKET__NAME_CONNECT 0x0080UL #define MEMPROTECT__MMAP_ZERO 0x0001UL #define PEER__RECV0x0001UL +#define KERNEL_SERVICE__USE_AS_OVERRIDE 0x0001UL +#define KERNEL_SERVICE__CREATE_FILES_AS 0x0002UL diff --git a/security/selinux/include/class_to_string.h b/security/selinux/include/class_to_string.h index b1b0d1d..efe9efa 100644 --- a/security/selinux/include/class_to_string.h +++ b/security/selinux/include/class_to_string.h @@ -71,3 +71,4 @@ S_(NULL) S_(NULL) S_("peer") +S_("kernel_service") diff --git a/security/selinux/include/flask.h b/security/selinux/include/flask.h index 09e9dd2..2bc251a 100644 --- a/security/selinux/include/flask.h +++ b/security/selinux/include/flask.h @@ -51,6 +51,7 @@ #define SECCLASS_DCCP_SOCKET 60 #define SECCLASS_MEMPROTECT 61 #define SECCLASS_PEER68 +#define SECCLASS_KERNEL_SERVICE 69 /* * Security identifier indices for initial entities -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/27] Security: Pre-add additional non-caching classes
Pre-add additional non-caching classes that are in the SELinux upstream repository, but not in the upstream kernel so they don't get in the fscache class patch. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- security/selinux/include/av_perm_to_string.h |5 + security/selinux/include/av_permissions.h|5 + security/selinux/include/class_to_string.h |7 +++ security/selinux/include/flask.h |1 + 4 files changed, 18 insertions(+), 0 deletions(-) diff --git a/security/selinux/include/av_perm_to_string.h b/security/selinux/include/av_perm_to_string.h index 049bf69..caa0634 100644 --- a/security/selinux/include/av_perm_to_string.h +++ b/security/selinux/include/av_perm_to_string.h @@ -37,6 +37,8 @@ S_(SECCLASS_NODE, NODE__ENFORCE_DEST, "enforce_dest") S_(SECCLASS_NODE, NODE__DCCP_RECV, "dccp_recv") S_(SECCLASS_NODE, NODE__DCCP_SEND, "dccp_send") + S_(SECCLASS_NODE, NODE__RECVFROM, "recvfrom") + S_(SECCLASS_NODE, NODE__SENDTO, "sendto") S_(SECCLASS_NETIF, NETIF__TCP_RECV, "tcp_recv") S_(SECCLASS_NETIF, NETIF__TCP_SEND, "tcp_send") S_(SECCLASS_NETIF, NETIF__UDP_RECV, "udp_recv") @@ -45,6 +47,8 @@ S_(SECCLASS_NETIF, NETIF__RAWIP_SEND, "rawip_send") S_(SECCLASS_NETIF, NETIF__DCCP_RECV, "dccp_recv") S_(SECCLASS_NETIF, NETIF__DCCP_SEND, "dccp_send") + S_(SECCLASS_NETIF, NETIF__INGRESS, "ingress") + S_(SECCLASS_NETIF, NETIF__EGRESS, "egress") S_(SECCLASS_UNIX_STREAM_SOCKET, UNIX_STREAM_SOCKET__CONNECTTO, "connectto") S_(SECCLASS_UNIX_STREAM_SOCKET, UNIX_STREAM_SOCKET__NEWCONN, "newconn") S_(SECCLASS_UNIX_STREAM_SOCKET, UNIX_STREAM_SOCKET__ACCEPTFROM, "acceptfrom") @@ -159,3 +163,4 @@ S_(SECCLASS_DCCP_SOCKET, DCCP_SOCKET__NODE_BIND, "node_bind") S_(SECCLASS_DCCP_SOCKET, DCCP_SOCKET__NAME_CONNECT, "name_connect") S_(SECCLASS_MEMPROTECT, MEMPROTECT__MMAP_ZERO, "mmap_zero") + S_(SECCLASS_PEER, PEER__RECV, "recv") diff --git a/security/selinux/include/av_permissions.h b/security/selinux/include/av_permissions.h index eda89a2..c2b5bb2 100644 --- a/security/selinux/include/av_permissions.h +++ b/security/selinux/include/av_permissions.h @@ -292,6 +292,8 @@ #define NODE__ENFORCE_DEST0x0040UL #define NODE__DCCP_RECV 0x0080UL #define NODE__DCCP_SEND 0x0100UL +#define NODE__RECVFROM0x0200UL +#define NODE__SENDTO 0x0400UL #define NETIF__TCP_RECV 0x0001UL #define NETIF__TCP_SEND 0x0002UL #define NETIF__UDP_RECV 0x0004UL @@ -300,6 +302,8 @@ #define NETIF__RAWIP_SEND 0x0020UL #define NETIF__DCCP_RECV 0x0040UL #define NETIF__DCCP_SEND 0x0080UL +#define NETIF__INGRESS0x0100UL +#define NETIF__EGRESS 0x0200UL #define NETLINK_SOCKET__IOCTL 0x0001UL #define NETLINK_SOCKET__READ 0x0002UL #define NETLINK_SOCKET__WRITE 0x0004UL @@ -824,3 +828,4 @@ #define DCCP_SOCKET__NODE_BIND0x0040UL #define DCCP_SOCKET__NAME_CONNECT 0x0080UL #define MEMPROTECT__MMAP_ZERO 0x0001UL +#define PEER__RECV0x0001UL diff --git a/security/selinux/include/class_to_string.h b/security/selinux/include/class_to_string.h index e77de0e..b1b0d1d 100644 --- a/security/selinux/include/class_to_string.h +++ b/security/selinux/include/class_to_string.h @@ -64,3 +64,10 @@ S_(NULL) S_("dccp_socket") S_("memprotect") +S_(NULL) +S_(NULL) +S_(NULL) +S_(NULL) +S_(NULL) +S_(NULL) +S_("peer") diff --git a/security/selinux/include/flask.h b/security/selinux/include/flask.h index a9c2b20..09e9dd2 100644 --- a/security/selinux/include/flask.h +++ b/security/selinux/include/flask.h @@ -50,6 +50,7 @@ #define SECCLASS_KEY 58 #define SECCLASS_DCCP_SOCKET 60 #define SECCLASS_MEMPROTECT 61 +#define SECCLASS_PEER68 /* * Security identifier indices for initial entities -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/27] KEYS: Add keyctl function to get a security label
Add a keyctl() function to get the security label of a key. The following is added to Documentation/keys.txt: (*) Get the LSM security context attached to a key. long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer, size_t buflen) This function returns a string that represents the LSM security context attached to a key in the buffer provided. Unless there's an error, it always returns the amount of data it could produce, even if that's too big for the buffer, but it won't copy more than requested to userspace. If the buffer pointer is NULL then no copy will take place. A NUL character is included at the end of the string if the buffer is sufficiently big. This is included in the returned count. If no LSM is in force then an empty string will be returned. A process must have view permission on the key for this function to be successful. Signed-off-by: David Howells <[EMAIL PROTECTED]> Acked-by: Stephen Smalley <[EMAIL PROTECTED]> --- Documentation/keys.txt | 21 +++ include/linux/keyctl.h |1 + include/linux/security.h | 20 +- security/dummy.c |8 ++ security/keys/compat.c |3 ++ security/keys/keyctl.c | 66 ++ security/security.c |5 +++ security/selinux/hooks.c | 21 +-- 8 files changed, 141 insertions(+), 4 deletions(-) diff --git a/Documentation/keys.txt b/Documentation/keys.txt index b82d38d..be424b0 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt @@ -711,6 +711,27 @@ The keyctl syscall functions are: The assumed authoritative key is inherited across fork and exec. + (*) Get the LSM security context attached to a key. + + long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer, + size_t buflen) + + This function returns a string that represents the LSM security context + attached to a key in the buffer provided. + + Unless there's an error, it always returns the amount of data it could + produce, even if that's too big for the buffer, but it won't copy more + than requested to userspace. If the buffer pointer is NULL then no copy + will take place. + + A NUL character is included at the end of the string if the buffer is + sufficiently big. This is included in the returned count. If no LSM is + in force then an empty string will be returned. + + A process must have view permission on the key for this function to be + successful. + + === KERNEL SERVICES === diff --git a/include/linux/keyctl.h b/include/linux/keyctl.h index 3365945..656ee6b 100644 --- a/include/linux/keyctl.h +++ b/include/linux/keyctl.h @@ -49,5 +49,6 @@ #define KEYCTL_SET_REQKEY_KEYRING 14 /* set default request-key keyring */ #define KEYCTL_SET_TIMEOUT 15 /* set key timeout */ #define KEYCTL_ASSUME_AUTHORITY16 /* assume request_key() authorisation */ +#define KEYCTL_GET_SECURITY17 /* get key security label */ #endif /* _LINUX_KEYCTL_H */ diff --git a/include/linux/security.h b/include/linux/security.h index ac05083..8d9e946 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -959,6 +959,17 @@ struct request_sock; * @perm describes the combination of permissions required of this key. * Return 1 if permission granted, 0 if permission denied and -ve it the * normal permissions model should be effected. + * @key_getsecurity: + * Get a textual representation of the security context attached to a key + * for the purposes of honouring KEYCTL_GETSECURITY. This function + * allocates the storage for the NUL-terminated string and the caller + * should free it. + * @key points to the key to be queried. + * @_buffer points to a pointer that should be set to point to the + * resulting string (if no label or an error occurs). + * Return the length of the string (including terminating NUL) or -ve if + * an error. + * May also return 0 (and a NULL buffer pointer) if there is no label. * * Security hooks affecting all System V IPC operations. * @@ -1437,7 +1448,7 @@ struct security_operations { int (*key_permission)(key_ref_t key_ref, struct task_struct *context, key_perm_t perm); - + int (*key_getsecurity)(struct key *key, char **_buffer); #endif /* CONFIG_KEYS */ }; @@ -2567,6 +2578,7 @@ int security_key_alloc(struct key *key, struct task_struct *tsk, unsigned long f void security_key_free(struct key *key); int security_key_permission(key_ref_t key_ref, struct task_struct *context, key_perm_t perm); +int security_key_getsecurity(struct key *key, char **_buffer); #else @@ -2588,6 +2600,12 @@ static inline int
[PATCH 08/27] Add a secctx_to_secid() LSM hook to go along with the existing
secid_to_secctx() LSM hook. This patch also includes the SELinux implementation for this hook. Signed-off-by: Paul Moore <[EMAIL PROTECTED]> Acked-by: Stephen Smalley <[EMAIL PROTECTED]> --- include/linux/security.h | 13 + security/dummy.c |6 ++ security/security.c |6 ++ security/selinux/hooks.c |6 ++ 4 files changed, 31 insertions(+), 0 deletions(-) diff --git a/include/linux/security.h b/include/linux/security.h index b7ba073..e8f2f2d 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -1200,6 +1200,10 @@ struct request_sock; * Convert secid to security context. * @secid contains the security ID. * @secdata contains the pointer that stores the converted security context. + * @secctx_to_secid: + * Convert security context to secid. + * @secid contains the pointer to the generated security ID. + * @secdata contains the security context. * * @release_secctx: * Release the security context. @@ -1389,6 +1393,7 @@ struct security_operations { int (*getprocattr)(struct task_struct *p, char *name, char **value); int (*setprocattr)(struct task_struct *p, char *name, void *value, size_t size); int (*secid_to_secctx)(u32 secid, char **secdata, u32 *seclen); + int (*secctx_to_secid)(char *secdata, u32 seclen, u32 *secid); void (*release_secctx)(char *secdata, u32 seclen); #ifdef CONFIG_SECURITY_NETWORK @@ -1623,6 +1628,7 @@ int security_setprocattr(struct task_struct *p, char *name, void *value, size_t int security_netlink_send(struct sock *sk, struct sk_buff *skb); int security_netlink_recv(struct sk_buff *skb, int cap); int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen); +int security_secctx_to_secid(char *secdata, u32 seclen, u32 *secid); void security_release_secctx(char *secdata, u32 seclen); #else /* CONFIG_SECURITY */ @@ -2305,6 +2311,13 @@ static inline int security_secid_to_secctx(u32 secid, char **secdata, u32 *secle return -EOPNOTSUPP; } +static inline int security_secctx_to_secid(char *secdata, + u32 seclen, + u32 *secid) +{ + return -EOPNOTSUPP; +} + static inline void security_release_secctx(char *secdata, u32 seclen) { } diff --git a/security/dummy.c b/security/dummy.c index 6f97089..72f1666 100644 --- a/security/dummy.c +++ b/security/dummy.c @@ -943,6 +943,11 @@ static int dummy_secid_to_secctx(u32 secid, char **secdata, u32 *seclen) return -EOPNOTSUPP; } +static int dummy_secctx_to_secid(char *secdata, u32 seclen, u32 *secid) +{ + return -EOPNOTSUPP; +} + static void dummy_release_secctx(char *secdata, u32 seclen) { } @@ -1109,6 +1114,7 @@ void security_fixup_ops (struct security_operations *ops) set_to_dummy_if_null(ops, getprocattr); set_to_dummy_if_null(ops, setprocattr); set_to_dummy_if_null(ops, secid_to_secctx); + set_to_dummy_if_null(ops, secctx_to_secid); set_to_dummy_if_null(ops, release_secctx); #ifdef CONFIG_SECURITY_NETWORK set_to_dummy_if_null(ops, unix_stream_connect); diff --git a/security/security.c b/security/security.c index 92d66d6..1ef4908 100644 --- a/security/security.c +++ b/security/security.c @@ -821,6 +821,12 @@ int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen) } EXPORT_SYMBOL(security_secid_to_secctx); +int security_secctx_to_secid(char *secdata, u32 seclen, u32 *secid) +{ + return security_ops->secctx_to_secid(secdata, seclen, secid); +} +EXPORT_SYMBOL(security_secctx_to_secid); + void security_release_secctx(char *secdata, u32 seclen) { return security_ops->release_secctx(secdata, seclen); diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 20a6b55..1d3eab7 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -4734,6 +4734,11 @@ static int selinux_secid_to_secctx(u32 secid, char **secdata, u32 *seclen) return security_sid_to_context(secid, secdata, seclen); } +static int selinux_secctx_to_secid(char *secdata, u32 seclen, u32 *secid) +{ + return security_context_to_sid(secdata, seclen, secid); +} + static void selinux_release_secctx(char *secdata, u32 seclen) { kfree(secdata); @@ -4937,6 +4942,7 @@ static struct security_operations selinux_ops = { .setprocattr = selinux_setprocattr, .secid_to_secctx = selinux_secid_to_secctx, + .secctx_to_secid = selinux_secctx_to_secid, .release_secctx = selinux_release_secctx, .unix_stream_connect = selinux_socket_unix_stream_connect, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/27] Security: Change current->fs[ug]id to current_fs[ug]id()
Change current->fs[ug]id to current_fs[ug]id() so that fsgid and fsuid can be separated from the task_struct. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- arch/ia64/kernel/perfmon.c|4 ++-- arch/powerpc/platforms/cell/spufs/inode.c |4 ++-- drivers/isdn/capi/capifs.c|4 ++-- drivers/usb/core/inode.c |4 ++-- fs/9p/fid.c |2 +- fs/9p/vfs_inode.c |4 ++-- fs/9p/vfs_super.c |4 ++-- fs/affs/inode.c |4 ++-- fs/anon_inodes.c |4 ++-- fs/attr.c |4 ++-- fs/bfs/dir.c |4 ++-- fs/cifs/cifsproto.h |2 +- fs/cifs/dir.c | 12 ++-- fs/cifs/inode.c |8 fs/cifs/misc.c|4 ++-- fs/coda/cache.c |6 +++--- fs/coda/upcall.c |4 ++-- fs/devpts/inode.c |4 ++-- fs/dquot.c|2 +- fs/exec.c |4 ++-- fs/ext2/balloc.c |2 +- fs/ext2/ialloc.c |4 ++-- fs/ext2/ioctl.c |2 +- fs/ext3/balloc.c |2 +- fs/ext3/ialloc.c |4 ++-- fs/ext4/balloc.c |2 +- fs/ext4/ialloc.c |4 ++-- fs/fuse/dev.c |4 ++-- fs/gfs2/inode.c | 10 +- fs/hfs/inode.c|4 ++-- fs/hfsplus/inode.c|4 ++-- fs/hpfs/namei.c | 24 fs/hugetlbfs/inode.c | 16 fs/jffs2/fs.c |4 ++-- fs/jfs/jfs_inode.c|4 ++-- fs/locks.c|2 +- fs/minix/bitmap.c |4 ++-- fs/namei.c|8 fs/nfsd/vfs.c |4 ++-- fs/ocfs2/dlm/dlmfs.c |8 fs/ocfs2/namei.c |4 ++-- fs/pipe.c |4 ++-- fs/posix_acl.c|4 ++-- fs/ramfs/inode.c |4 ++-- fs/reiserfs/namei.c |4 ++-- fs/sysv/ialloc.c |4 ++-- fs/udf/ialloc.c |4 ++-- fs/udf/namei.c|2 +- fs/ufs/ialloc.c |4 ++-- fs/xfs/linux-2.6/xfs_linux.h |4 ++-- fs/xfs/xfs_acl.c |6 +++--- fs/xfs/xfs_attr.c |2 +- fs/xfs/xfs_inode.c|6 +++--- fs/xfs/xfs_vnodeops.c |8 include/linux/fs.h|2 +- include/linux/sched.h |3 +++ ipc/mqueue.c |4 ++-- kernel/cgroup.c |4 ++-- mm/shmem.c|8 net/9p/client.c |2 +- net/socket.c |4 ++-- net/sunrpc/auth.c |8 security/commoncap.c |8 security/keys/key.c |2 +- security/keys/keyctl.c|2 +- security/keys/request_key.c | 10 +- security/keys/request_key_auth.c |2 +- 67 files changed, 163 insertions(+), 160 deletions(-) diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c index 73e7c2e..ef383d9 100644 --- a/arch/ia64/kernel/perfmon.c +++ b/arch/ia64/kernel/perfmon.c @@ -2206,8 +2206,8 @@ pfm_alloc_fd(struct file **cfile) DPRINT(("new inode ino=%ld @%p\n", inode->i_ino, inode)); inode->i_mode = S_IFCHR|S_IRUGO; - inode->i_uid = current->fsuid; - inode->i_gid = current->fsgid; + inode->i_uid = current_fsuid(); + inode->i_gid = current_fsgid(); sprintf(name, "[%lu]", inode->i_ino); this.name = name; diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c index c0e968a..4efe7bf 100644 --- a/arch/powerpc/platforms/cell/spufs/inode.c +++ b/arch/powerpc/platforms/cell/spufs/inode.c @@ -85,8 +85,8 @@ spufs_new_inode(struct super_block *sb, int mode) goto out; inode->i_mode = mode; - inode->i_uid = current->fsuid; - inode->i_gid = current->fsgid; + inode->i_uid = current_fsuid();
[PATCH 02/27] KEYS: Check starting keyring as part of search
Check the starting keyring as part of the search to (a) see if that is what we're searching for, and (b) to check it is still valid for searching. The scenario: User in process A does things that cause things to be created in its process session keyring. The user then does an su to another user and starts a new process, B. The two processes now share the same process session keyring. Process B does an NFS access which results in an upcall to gssd. When gssd attempts to instantiate the context key (to be linked into the process session keyring), it is denied access even though it has an authorization key. The order of calls is: keyctl_instantiate_key() lookup_user_key() (the default: case) search_process_keyrings(current) search_process_keyrings(rka->context) (recursive call) keyring_search_aux() keyring_search_aux() verifies the keys and keyrings underneath the top-level keyring it is given, but that top-level keyring is neither fully validated nor checked to see if it is the thing being searched for. This patch changes keyring_search_aux() to: 1) do more validation on the top keyring it is given and 2) check whether that top-level keyring is the thing being searched for Signed-off-by: Kevin Coffman <[EMAIL PROTECTED]> Signed-off-by: David Howells <[EMAIL PROTECTED]> --- security/keys/keyring.c | 35 +++ 1 files changed, 31 insertions(+), 4 deletions(-) diff --git a/security/keys/keyring.c b/security/keys/keyring.c index 88292e3..76b89b2 100644 --- a/security/keys/keyring.c +++ b/security/keys/keyring.c @@ -292,7 +292,7 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref, struct keyring_list *keylist; struct timespec now; - unsigned long possessed; + unsigned long possessed, kflags; struct key *keyring, *key; key_ref_t key_ref; long err; @@ -318,6 +318,32 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref, now = current_kernel_time(); err = -EAGAIN; sp = 0; + + /* firstly we should check to see if this top-level keyring is what we +* are looking for */ + key_ref = ERR_PTR(-EAGAIN); + kflags = keyring->flags; + if (keyring->type == type && match(keyring, description)) { + key = keyring; + + /* check it isn't negative and hasn't expired or been +* revoked */ + if (kflags & (1 << KEY_FLAG_REVOKED)) + goto error_2; + if (key->expiry && now.tv_sec >= key->expiry) + goto error_2; + key_ref = ERR_PTR(-ENOKEY); + if (kflags & (1 << KEY_FLAG_NEGATIVE)) + goto error_2; + goto found; + } + + /* otherwise, the top keyring must not be revoked, expired, or +* negatively instantiated if we are to search it */ + key_ref = ERR_PTR(-EAGAIN); + if (kflags & ((1 << KEY_FLAG_REVOKED) | (1 << KEY_FLAG_NEGATIVE)) || + (keyring->expiry && now.tv_sec >= keyring->expiry)) + goto error_2; /* start processing a new keyring */ descend: @@ -331,13 +357,14 @@ descend: /* iterate through the keys in this keyring first */ for (kix = 0; kix < keylist->nkeys; kix++) { key = keylist->keys[kix]; + kflags = key->flags; /* ignore keys not of this type */ if (key->type != type) continue; /* skip revoked keys and expired keys */ - if (test_bit(KEY_FLAG_REVOKED, >flags)) + if (kflags & (1 << KEY_FLAG_REVOKED)) continue; if (key->expiry && now.tv_sec >= key->expiry) @@ -352,8 +379,8 @@ descend: context, KEY_SEARCH) < 0) continue; - /* we set a different error code if we find a negative key */ - if (test_bit(KEY_FLAG_NEGATIVE, >flags)) { + /* we set a different error code if we pass a negative key */ + if (kflags & (1 << KEY_FLAG_NEGATIVE)) { err = -ENOKEY; continue; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/27] KEYS: Allow the callout data to be passed as a blob rather than a string
Allow the callout data to be passed as a blob rather than a string for internal kernel services that call any request_key_*() interface other than request_key(). request_key() itself still takes a NUL-terminated string. The functions that change are: request_key_with_auxdata() request_key_async() request_key_async_with_auxdata() Signed-off-by: David Howells <[EMAIL PROTECTED]> --- Documentation/keys-request-key.txt | 11 +--- Documentation/keys.txt | 14 +++--- include/linux/key.h|9 --- security/keys/internal.h |9 --- security/keys/keyctl.c |7 - security/keys/request_key.c| 49 ++-- security/keys/request_key_auth.c | 12 + 7 files changed, 70 insertions(+), 41 deletions(-) diff --git a/Documentation/keys-request-key.txt b/Documentation/keys-request-key.txt index 266955d..09b55e4 100644 --- a/Documentation/keys-request-key.txt +++ b/Documentation/keys-request-key.txt @@ -11,26 +11,29 @@ request_key*(): struct key *request_key(const struct key_type *type, const char *description, - const char *callout_string); + const char *callout_info); or: struct key *request_key_with_auxdata(const struct key_type *type, const char *description, -const char *callout_string, +const char *callout_info, +size_t callout_len, void *aux); or: struct key *request_key_async(const struct key_type *type, const char *description, - const char *callout_string); + const char *callout_info, + size_t callout_len); or: struct key *request_key_async_with_auxdata(const struct key_type *type, const char *description, - const char *callout_string, + const char *callout_info, + size_t callout_len, void *aux); Or by userspace invoking the request_key system call: diff --git a/Documentation/keys.txt b/Documentation/keys.txt index 51652d3..b82d38d 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt @@ -771,7 +771,7 @@ payload contents" for more information. struct key *request_key(const struct key_type *type, const char *description, - const char *callout_string); + const char *callout_info); This is used to request a key or keyring with a description that matches the description specified according to the key type's match function. This @@ -793,24 +793,28 @@ payload contents" for more information. struct key *request_key_with_auxdata(const struct key_type *type, const char *description, -const char *callout_string, +const void *callout_info, +size_t callout_len, void *aux); This is identical to request_key(), except that the auxiliary data is -passed to the key_type->request_key() op if it exists. +passed to the key_type->request_key() op if it exists, and the callout_info +is a blob of length callout_len, if given (the length may be 0). (*) A key can be requested asynchronously by calling one of: struct key *request_key_async(const struct key_type *type, const char *description, - const char *callout_string); + const void *callout_info, + size_t callout_len); or: struct key *request_key_async_with_auxdata(const struct key_type *type, const char *description, - const char *callout_string, + const char *callout_info, + size_t callout_len, void *aux); which are asynchronous equivalents of request_key() and diff --git a/include/linux/key.h b/include/linux/key.h index a70b8a8..163f864 100644 --- a/include/linux/key.h +++
[PATCH 01/27] KEYS: Increase the payload size when instantiating a key
Increase the size of a payload that can be used to instantiate a key in add_key() and keyctl_instantiate_key(). This permits huge CIFS SPNEGO blobs to be passed around. The limit is raised to 1MB. If kmalloc() can't allocate a buffer of sufficient size, vmalloc() will be tried instead. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- security/keys/keyctl.c | 38 ++ 1 files changed, 30 insertions(+), 8 deletions(-) diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c index d9ca15c..8ec8432 100644 --- a/security/keys/keyctl.c +++ b/security/keys/keyctl.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include "internal.h" @@ -62,9 +63,10 @@ asmlinkage long sys_add_key(const char __user *_type, char type[32], *description; void *payload; long ret; + bool vm; ret = -EINVAL; - if (plen > 32767) + if (plen > 1024 * 1024 - 1) goto error; /* draw all the data into kernel space */ @@ -81,11 +83,18 @@ asmlinkage long sys_add_key(const char __user *_type, /* pull the payload in if one was supplied */ payload = NULL; + vm = false; if (_payload) { ret = -ENOMEM; payload = kmalloc(plen, GFP_KERNEL); - if (!payload) - goto error2; + if (!payload) { + if (plen <= PAGE_SIZE) + goto error2; + vm = true; + payload = vmalloc(plen); + if (!payload) + goto error2; + } ret = -EFAULT; if (copy_from_user(payload, _payload, plen) != 0) @@ -113,7 +122,10 @@ asmlinkage long sys_add_key(const char __user *_type, key_ref_put(keyring_ref); error3: - kfree(payload); + if (!vm) + kfree(payload); + else + vfree(payload); error2: kfree(description); error: @@ -821,9 +833,10 @@ long keyctl_instantiate_key(key_serial_t id, key_ref_t keyring_ref; void *payload; long ret; + bool vm = false; ret = -EINVAL; - if (plen > 32767) + if (plen > 1024 * 1024 - 1) goto error; /* the appropriate instantiation authorisation key must have been @@ -843,8 +856,14 @@ long keyctl_instantiate_key(key_serial_t id, if (_payload) { ret = -ENOMEM; payload = kmalloc(plen, GFP_KERNEL); - if (!payload) - goto error; + if (!payload) { + if (plen <= PAGE_SIZE) + goto error; + vm = true; + payload = vmalloc(plen); + if (!payload) + goto error; + } ret = -EFAULT; if (copy_from_user(payload, _payload, plen) != 0) @@ -877,7 +896,10 @@ long keyctl_instantiate_key(key_serial_t id, } error2: - kfree(payload); + if (!vm) + kfree(payload); + else + vfree(payload); error: return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 00/27] Permit filesystem local caching
These patches add local caching for network filesystems such as NFS. The patches can roughly be broken down into a number of sets: (*) 01-keys-inc-payload.diff (*) 02-keys-search-keyring.diff (*) 03-keys-callout-blob.diff Three patches to the keyring code made to help the CIFS people. Included because of patches 05-08. (*) 04-keys-get-label.diff A patch to allow the security label of a key to be retrieved. Included because of patches 05-08. (*) 05-security-current-fsugid.diff (*) 06-security-separate-task-bits.diff (*) 07-security-subjective.diff (*) 08-security-secctx2secid.diff (*) 09-security-additional-classes.diff (*) 10-security-kernel_service-class.diff (*) 11-security-kernel-service.diff (*) 12-security-nfsd.diff Patches to permit the subjective security of a task to be overridden. All the security details in task_struct are decanted into a new struct that task_struct then has two pointers two: one that defines the objective security of that task (how other tasks may affect it) and one that defines the subjective security (how it may affect other objects). Note that I have dropped the idea of struct cred for the moment. With the amount of stuff that was excluded from it, it wasn't actually any use to me. However, it can be added later. Required for cachefiles. (*) 13-release-page.diff (*) 14-fscache-page-flags.diff (*) 15-add_wait_queue_tail.diff (*) 16-fscache.diff Patches to provide a local caching facility for network filesystems. (*) 17-cachefiles-ia64.diff (*) 18-cachefiles-ext3-f_mapping.diff (*) 19-cachefiles-write.diff (*) 20-cachefiles-monitor.diff (*) 21-cachefiles-export.diff (*) 22-cachefiles.diff Patches to provide a local cache in a directory of an already mounted filesystem. (*) 23-nfs-memleak.diff (*) 24-fscache-nfs.diff (*) 25-fscache-nfs-mount.diff (*) 26-fscache-nfs-display.diff (*) 27-fscache-nfs-persb.diff Patches to provide NFS with local caching. The fifth of these patches makes caching configurable per superblock. I've updated the patches to compile on as many arches I can get compilers for and can get to compile. However, for patch 06, the sparc and alpha arches need some asm work as they access security information from asm code, using asm-offsets to calculate the offset. The SELinux base code will also need updating to have the security class, lest the following error appear in dmesg: context_struct_compute_av: unrecognized class 69 I've provided a patch to make NFSd use task_security and current->act_as to change its security settings. I've also renamed the accessors for the PG_fscache and PG_fscache_write bits in page-flags.h, pagemap.h and filemap.c (they subclass PG_private_2 and PG_owner_priv_2 so these are the accessors in the main headers). I've then wrapped them in fscache.h. -- A tarball of the patches is available at: http://people.redhat.com/~dhowells/fscache/patches/nfs+fscache-27.tar.bz2 To use this version of CacheFiles, the cachefilesd-0.9 is also required. It is available as an SRPM: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9-1.fc7.src.rpm Or as individual bits: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9.tar.bz2 http://people.redhat.com/~dhowells/fscache/cachefilesd.fc http://people.redhat.com/~dhowells/fscache/cachefilesd.if http://people.redhat.com/~dhowells/fscache/cachefilesd.te http://people.redhat.com/~dhowells/fscache/cachefilesd.spec The .fc, .if and .te files are for manipulating SELinux. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] export notifier #1
On Wed, 23 Jan 2008, Benjamin Herrenschmidt wrote: > > - anon_vma/inode and pte locks are held during callbacks. > > So how does that fix the problem of sleeping then ? The locks are taken in the mmu_ops patch. This patch does not hold them while performing the callbacks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Massive IDE problems. Who leaves data here?
Manuel Reimer wrote: Hello, anything started with a try to burn Slackware 12.0 from the original DVD to an new medium with different boot settings. I always got corrupted results and didn't know why. So I started with an "md5sum -c CHECKSUMS.md5" directly on the original media. This resulted in "anything OK". Now I copied the whole DVD to my hard drive and created an ISO from it. I mounted the ISO locally and my md5sum now results in 5 corrupted files. --> A Bug in mkisofs? No, unfortunately not, as a md5sum on the copy, I have created from the original DVD by using "cp -vr" is corrupted, too! Possibly a known kernel problem, you may have read past the end of data into the pad sectors of the DVD and gotten garbage at the end of the ISO image. Use isoinfo to determine the correct size of the ISO filesystem, and compare. You can try setting readahead on the DVD reader to zero with blockdev. If the file is smaller, other bug, if readahead hit EOF it returns no data instead of a short read, the blockdev fix should handle that as well. This was supposed to be fixed in recent kernels, that may be true. I suggest the [EMAIL PROTECTED] mailing list is a better forum for CD/DVD/BR problems, good technical people, unfortunately with personal agendas in some cases. So md5sum on the original DVD is OK, but after copying to my hard drive, several files are corrupted. That's odd, I would expect the data on the disk to just be the wrong size, and get a CRC on that. You might also use readcd to pull the data, that almost always does what it should. I'm using kernel 2.6.21.5. Distribution is Slackware 12.0 All my "partitions" are LVs in LVM2 I also updated the kernel to 2.6.23.12 to test with this one, but I still get corrupted files. Is this a LVM bug? Do I already have a corrupted LVM filesystem? How to check/fix it? Is this a known kernel bug? Which may be the reason for corrupted files? I've created a backup of my important data to a second disc to a "real ext2 partition" (without LVM), but this is connected to the same IDE controller and I don't even know if I may still trust my mainboard... I also get those kernel messages via dmesg: http://pastebin.org/16537 Could be anything, in no order dirty lens, bad drive, bad DVD, firmware error, cable, power supply, acpi confused... could even be a poorly handled end of data on the DVD. Not enough info for me to tell, for sure. Trying readcd is cheap, turning off readahead on the DVD drive is easy, if the problem persists you probably want to take it to the mailing list. Thank you very much in advance for any help! I'm not sure I helped, but you now have more and better things about which to be confused. ;-) Yours Manuel -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CPA boot crash (was: [PATCH] [0/36] Great change_page_attr patch series v3)
On Tue, 22 Jan 2008, Andi Kleen wrote: > According to you and Ingo "the global perspective" is to get > simple stuff first in. But in this case you're doing the complicated > (and worse the unfinished) stuff first which seems to be against > your own principles. No, the global perspective is to get a stable and reliable system, which allows us to do new features like gbpages, PAT and whatever comes up next in a clean way. Your patches just shove another extra into the existing code base without doing any consolidation work and without any consideration of problems we need to urgently solve in this area. Your only care is to get stuff merged which is interesting for you. I can understand that, but it should be entirely clear to you as an engineer that ignoring the existing problems and adding more (even simple) stuff makes it more complex to consolidate and is nothing else than bad engineering. PAT is high on the requirements list, not because it's not complex (it definitely is), but simply because Linux has a years long of backlog (it's the only modern OS on the planet still not using PAT) and hardware makers are stepping beyond the limits of MTRRs. There is an increasing number of systems which don't work under Linux properly due to the MTRR limitations, but work perfectly fine with other OSs. Should we ignore that ? While PAT is a 10 years old hardware feature, gbpages is a feature for a brand new chip, which is not even available to mere mortals in a useable form. And there is no real problem with not having gbpages for some time. So where is the pressure to get that in? Just because it can be done and happens to work on some test machine? PAT patches have been around for years and nothing happened - while the first time gbpages were submitted was 19 days ago by you. Of all pending features, PAT has a priority simply because it affects users. The lack of gbpages does not. We are not going to rush PAT in before it is stable, but we hold everything off which interferes with getting it to that point. Please stop arguing around with the subtle undertone of us having no clue about the topics. We looked into the whole set of pending issues, including your gbpages patchset and we well understand the implications. It is quite clear that we need to fix the underlying system _before_ we add more things to it. That applies to PAT, CPA and gbpages in the same way. In the end all of those features will benefit from a consolidated implementation. It's up to you whether you help to get there sooner or just sit back and argue in circles until others have done the hard work and you can add gbpages. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/26] atl1: refactor tx processing
On Tue, 22 Jan 2008 04:58:17 -0500 Jeff Garzik <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > From: Jay Cliburn <[EMAIL PROTECTED]> > > > > Refactor tx processing to use a less convoluted tx packet > > descriptor and to conform generally with the vendor's current > > version 1.2.40.2. > > > > Signed-off-by: Jay Cliburn <[EMAIL PROTECTED]> > > --- > > drivers/net/atlx/atl1.c | 265 > > +-- > > drivers/net/atlx/atl1.h | 201 +++- > > 2 files changed, 246 insertions(+), 220 deletions(-) > > for such a huge patch, this description is very tiny. [describe] > what is refactored, and why. Okay, I'll go back and rework the offending descriptions for this and the other patches in this set. > what does "less convoluted" mean? I should have written "simpler," I suppose. Before: === struct tso_param { u32 tsopu; /* tso_param upper word */ u32 tsopl; /* tso_param lower word */ }; struct csum_param { u32 csumpu; /* csum_param upper word */ u32 csumpl; /* csum_param lower word */ }; union tpd_descr { u64 data; struct csum_param csum; struct tso_param tso; }; struct tx_packet_desc { __le64 buffer_addr; union tpd_descr desc; }; After: == struct tx_packet_desc { __le64 buffer_addr; __le32 word2; __le32 word3; }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: SECCOMP documentation outdated in some arch/*/Kconfig
On Tue, 22 Jan 2008 15:41:58 +0100 Helmut Grohne wrote: > Hi, > > I didn't find out whom to report this bug to and thus report to > linux-kernel@vger.kernel.org as described in > http://kernel.org/pub/linux/docs/lkml/reporting-bugs.html. Andrea cc-ed. Helmut, would you care to make a patch that you think should be applied to the current kernel source tree? > I'm posting from outside, so please CC me. > > [1] The description about seccomp is outdated in some arch/*/Kconfig > files. > > [2] According to the source (2.6.23.14) seccomp is to be activated using > pcrtl. It was previously activated using a file /proc//seccomp. > The Kconfig documentation (also displayed in menuconfig) does not > reflect this change and is thus wrong. > > [3] seccomp documentation Kconfig > > [4] 2.6.23.14, seems to also apply to git head: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/x86/Kconfig;h=80b7ba4056dbbb566841c1e1cbef9475730fe199;hb=HEAD > > [5] no oops > > [6] less arch/x86_64/Kconfig > /SECCOMP > > [7] Ask me again if you really think you need information about the > environment for a documentation bug. --- ~Randy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] PCI: Run ACPI _OSC method on root bridges only
From: Andrew Patterson <[EMAIL PROTECTED]> According to the PCI Firmware Specification Revision 3.0 section 4.5, _OSC should only be called on a root brdige. Here is the relevant passage: "The _OSC interface defined in this section applies only to Host Bridge ACPI devices that originate PCI, PCI-X, or PCI Express hierarchies". Changed the code to find the parent root bridge of the device and call _OSC on that. Signed-off-by: Andrew Patterson <[EMAIL PROTECTED]> --- drivers/pci/pcie/aer/aerdrv_acpi.c | 22 ++ 1 files changed, 6 insertions(+), 16 deletions(-) diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c index f685bf5..8c199ae 100644 --- a/drivers/pci/pcie/aer/aerdrv_acpi.c +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c @@ -31,23 +31,13 @@ int aer_osc_setup(struct pcie_device *pciedev) { acpi_status status = AE_NOT_FOUND; struct pci_dev *pdev = pciedev->port; - acpi_handle handle = DEVICE_ACPI_HANDLE(>dev); - struct pci_bus *parent; + acpi_handle handle = 0; - while (!handle) { - if (!pdev || !pdev->bus->parent) - break; - parent = pdev->bus->parent; - if (!parent->self) - /* Parent must be a host bridge */ - handle = acpi_get_pci_rootbridge_handle( - pci_domain_nr(parent), - parent->number); - else - handle = DEVICE_ACPI_HANDLE( - &(parent->self->dev)); - pdev = parent->self; - } + /* Find root host bridge */ + while (pdev->bus && pdev->bus->self) + pdev = pdev->bus->self; + handle = acpi_get_pci_rootbridge_handle( + pci_domain_nr(pdev->bus), pdev->bus->number); if (handle) { pcie_osc_support_set(OSC_EXT_PCI_CONFIG_SUPPORT); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] PCI ACPI: AER driver should only register PCIe devices with _OSC.
From: Andrew Patterson <[EMAIL PROTECTED]> AER is only used with PCIe devices so we should only check PCIe devices for _OSC support. Signed-off-by: Andrew Patterson <[EMAIL PROTECTED]> --- drivers/pci/pcie/aer/aerdrv_acpi.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c index 1a1eb45..f685bf5 100644 --- a/drivers/pci/pcie/aer/aerdrv_acpi.c +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c @@ -50,7 +50,7 @@ int aer_osc_setup(struct pcie_device *pciedev) } if (handle) { - pci_osc_support_set(OSC_EXT_PCI_CONFIG_SUPPORT); + pcie_osc_support_set(OSC_EXT_PCI_CONFIG_SUPPORT); status = pci_osc_control_set(handle, OSC_PCI_EXPRESS_AER_CONTROL | OSC_PCI_EXPRESS_CAP_STRUCTURE_CONTROL); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] PCI ACPI: Added a function to register _OSC with only PCIe devices.
From: Andrew Patterson <[EMAIL PROTECTED]> The function pci_osc_support_set() traverses every root bridge when checking for _OSC support for a capability. It quits as soon as it finds a device/bridge that doesn't support the requested capability. This won't work for systems that have mixed PCI and PCIe bridges when checking for PCIe features. I split this function into two -- pci_osc_support_set() and pcie_osc_support_set(). The latter is used when only PCIe devices should be traversed. Signed-off-by: Andrew Patterson <[EMAIL PROTECTED]> --- drivers/pci/pci-acpi.c |6 +++--- include/linux/pci-acpi.h | 11 ++- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c index 02e4876..ec61428 100644 --- a/drivers/pci/pci-acpi.c +++ b/drivers/pci/pci-acpi.c @@ -156,13 +156,13 @@ run_osc_out: } /** - * pci_osc_support_set - register OS support to Firmware +__pci_osc_support_set - register OS support to Firmware * @flags: OS support bits * * Update OS support fields and doing a _OSC Query to obtain an update * from Firmware on supported control bits. **/ -acpi_status pci_osc_support_set(u32 flags) +acpi_status __pci_osc_support_set(u32 flags, const char *hid) { u32 temp; acpi_status retval; @@ -176,7 +176,7 @@ acpi_status pci_osc_support_set(u32 flags) temp = ctrlset_buf[OSC_CONTROL_TYPE]; ctrlset_buf[OSC_QUERY_TYPE] = OSC_QUERY_ENABLE; ctrlset_buf[OSC_CONTROL_TYPE] = OSC_CONTROL_MASKS; - acpi_get_devices ( PCI_ROOT_HID_STRING, + acpi_get_devices(hid, acpi_query_osc, ctrlset_buf, (void **) ); diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h index 936ef82..3ba2506 100644 --- a/include/linux/pci-acpi.h +++ b/include/linux/pci-acpi.h @@ -48,7 +48,15 @@ #ifdef CONFIG_ACPI extern acpi_status pci_osc_control_set(acpi_handle handle, u32 flags); -extern acpi_status pci_osc_support_set(u32 flags); +extern acpi_status __pci_osc_support_set(u32 flags, const char *hid); +static inline acpi_status pci_osc_support_set(u32 flags) +{ + return __pci_osc_support_set(flags, PCI_ROOT_HID_STRING); +} +static inline acpi_status pcie_osc_support_set(u32 flags) +{ + return __pci_osc_support_set(flags, PCI_EXPRESS_ROOT_HID_STRING); +} #else #if !defined(AE_ERROR) typedef u32acpi_status; @@ -57,6 +65,7 @@ typedef u32 acpi_status; static inline acpi_status pci_osc_control_set(acpi_handle handle, u32 flags) {return AE_ERROR;} static inline acpi_status pci_osc_support_set(u32 flags) {return AE_ERROR;} +static inline acpi_status pcie_osc_support_set(u32 flags) {return AE_ERROR;} #endif #endif /* _PCI_ACPI_H_ */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] ACPI: Check for any matching CID when walking namespace.
From: Andrew Patterson <[EMAIL PROTECTED]> The callback function acpi_ns_get_device_callback called from acpi_get_devices() will check CID's if the HID does not match. This code has a bug where it requires that all CIDs match the HID. Changed the code so that any CID match will do. Signed-off-by: Andrew Patterson <[EMAIL PROTECTED]> --- drivers/acpi/namespace/nsxfeval.c | 11 --- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/acpi/namespace/nsxfeval.c b/drivers/acpi/namespace/nsxfeval.c index f39fbc6..e562b24 100644 --- a/drivers/acpi/namespace/nsxfeval.c +++ b/drivers/acpi/namespace/nsxfeval.c @@ -443,6 +443,7 @@ acpi_ns_get_device_callback(acpi_handle obj_handle, struct acpica_device_id hid; struct acpi_compatible_id_list *cid; acpi_native_uint i; + int found; status = acpi_ut_acquire_mutex(ACPI_MTX_NAMESPACE); if (ACPI_FAILURE(status)) { @@ -496,16 +497,20 @@ acpi_ns_get_device_callback(acpi_handle obj_handle, /* Walk the CID list */ + found = 0; for (i = 0; i < cid->count; i++) { if (ACPI_STRNCMP(cid->id[i].value, info->hid, sizeof(struct - acpi_compatible_id)) != + acpi_compatible_id)) == 0) { - ACPI_FREE(cid); - return (AE_OK); + found = 1; + break; } } ACPI_FREE(cid); + if (!found) { + return (AE_OK); + } } } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] ACPI fixes for PCIe AER
The following patch series fixes some bugs in how Linux determines whether PCIe Advance Error Reporting (AER) is supported on a platform. It is currently broken on at least HP IA-64 systems. - PCI: Run ACPI _OSC method on root bridges only - ACPI: Check for any matching CID when walking namespace. - PCI ACPI: AER driver should only register PCIe devices with _OSC. - PCI ACPI: Added a function to register _OSC with only PCIe devices. These patches apply to gregkh's patch tree: git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/patches -- Andrew Patterson -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86_32: trim memory by updating e820 v3
[PATCH] x86_32: trim memory by updating e820 v3 when mtrr is not covering all e820 table, need to trim the ram, need to update e820 reuse some code for x86_64 here need to add early_get_cap and use it in early_cpu_detect, and move mtrr_bp_init early need Justine to test with his special system with bug bios. Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]> Index: linux-2.6/arch/x86/kernel/cpu/common.c === --- linux-2.6.orig/arch/x86/kernel/cpu/common.c +++ linux-2.6/arch/x86/kernel/cpu/common.c @@ -278,6 +278,33 @@ void __init cpu_detect(struct cpuinfo_x8 c->x86_cache_alignment = ((misc >> 8) & 0xff) * 8; } } +static void __cpuinit early_get_cap(struct cpuinfo_x86 *c) +{ + u32 tfms, xlvl; + int ebx; + + memset(>x86_capability, 0, sizeof c->x86_capability); + if (have_cpuid_p()) { + /* Intel-defined flags: level 0x0001 */ + if (c->cpuid_level >= 0x0001) { + u32 capability, excap; + cpuid(0x0001, , , , ); + c->x86_capability[0] = capability; + c->x86_capability[4] = excap; + } + + /* AMD-defined flags: level 0x8001 */ + xlvl = cpuid_eax(0x8000); + if ((xlvl & 0x) == 0x8000) { + if (xlvl >= 0x8001) { + c->x86_capability[1] = cpuid_edx(0x8001); + c->x86_capability[6] = cpuid_ecx(0x8001); + } + } + + } + +} /* Do minimum CPU detection early. Fields really needed: vendor, cpuid_level, family, model, mask, cache alignment. @@ -306,6 +333,8 @@ static void __init early_cpu_detect(void early_init_intel(c); break; } + + early_get_cap(c); } static void __cpuinit generic_identify(struct cpuinfo_x86 * c) @@ -485,7 +514,6 @@ void __init identify_boot_cpu(void) identify_cpu(_cpu_data); sysenter_setup(); enable_sep_cpu(); - mtrr_bp_init(); } void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c) Index: linux-2.6/arch/x86/kernel/setup_32.c === --- linux-2.6.orig/arch/x86/kernel/setup_32.c +++ linux-2.6/arch/x86/kernel/setup_32.c @@ -49,6 +49,7 @@ #include +#include #include #include #include @@ -762,6 +763,11 @@ void __init setup_arch(char **cmdline_p) max_low_pfn = setup_memory(); + /* update e820 for memory not covered by WB MTRRs */ + mtrr_bp_init(); + if (mtrr_trim_uncached_memory(max_pfn)) + max_low_pfn = setup_memory(); + #ifdef CONFIG_VMI /* * Must be after max_low_pfn is determined, and before kernel Index: linux-2.6/arch/x86/kernel/cpu/mtrr/main.c === --- linux-2.6.orig/arch/x86/kernel/cpu/mtrr/main.c +++ linux-2.6/arch/x86/kernel/cpu/mtrr/main.c @@ -624,7 +624,6 @@ static struct sysdev_driver mtrr_sysdev_ .resume = mtrr_restore, }; -#ifdef CONFIG_X86_64 static int disable_mtrr_trim; static int __init disable_mtrr_trim_setup(char *str) @@ -726,7 +725,6 @@ int __init mtrr_trim_uncached_memory(uns return 0; } -#endif /** * mtrr_bp_init - initialize mtrrs on the boot CPU Index: linux-2.6/Documentation/kernel-parameters.txt === --- linux-2.6.orig/Documentation/kernel-parameters.txt +++ linux-2.6/Documentation/kernel-parameters.txt @@ -575,7 +575,7 @@ and is between 256 and 4096 characters. See drivers/char/README.epca and Documentation/digiepca.txt. - disable_mtrr_trim [X86-64, Intel only] + disable_mtrr_trim [X86, Intel and AMD only] By default the kernel will trim any uncacheable memory out of your available memory pool based on MTRR settings. This parameter disables that behavior, Index: linux-2.6/arch/x86/kernel/e820_32.c === --- linux-2.6.orig/arch/x86/kernel/e820_32.c +++ linux-2.6/arch/x86/kernel/e820_32.c @@ -749,3 +749,14 @@ static int __init parse_memmap(char *arg return 0; } early_param("memmap", parse_memmap); +void __init update_e820(void) +{ + u8 nr_map; + + nr_map = e820.nr_map; + if (sanitize_e820_map(e820.map, _map)) + return; + e820.nr_map = nr_map; + printk(KERN_INFO "modified physical RAM map:\n"); + print_memory_map("modified"); +} Index: linux-2.6/include/asm-x86/e820_32.h === --- linux-2.6.orig/include/asm-x86/e820_32.h +++
AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]
On Jan 22, 2008 12:29 AM, Mike Snitzer <[EMAIL PROTECTED]> wrote: > cc'ing Tanaka-san given his recent raid1 BUG report: > http://lkml.org/lkml/2008/1/14/515 > > > On Jan 21, 2008 6:04 PM, Mike Snitzer <[EMAIL PROTECTED]> wrote: > > Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to > > an aacraid controller) that was acting as the local raid1 member of > > /dev/md30. > > > > Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by > > doing a read (with dd) from /dev/md30: > The raid1d thread is locked at line 720 in raid1.c (raid1d+2437); aka > freeze_array: > > (gdb) l *0x2539 > 0x2539 is in raid1d (drivers/md/raid1.c:720). > 715 * wait until barrier+nr_pending match nr_queued+2 > 716 */ > 717 spin_lock_irq(>resync_lock); > 718 conf->barrier++; > 719 conf->nr_waiting++; > 720 wait_event_lock_irq(conf->wait_barrier, > 721 conf->barrier+conf->nr_pending == > conf->nr_queued+2, > 722 conf->resync_lock, > 723 raid1_unplug(conf->mddev->queue)); > 724 spin_unlock_irq(>resync_lock); > > Given Tanaka-san's report against 2.6.23 and me hitting what seems to > be the same deadlock in 2.6.22.16; it stands to reason this affects > raid1 in 2.6.24-rcX too. Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when you pull a drive); it responds to MD's write requests with uptodate=1 (in raid1_end_write_request) for the drive that was pulled! I've not looked to see if aacraid has been fixed in newer kernels... are others aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24? After the drive was physically pulled, and small periodic writes continued to the associated MD device, the raid1 MD driver did _NOT_ detect the pulled drive's writes as having failed (verified this with systemtap). MD happily thought the write completed to both members (so MD had no reason to mark the pulled drive "faulty"; or mark the raid "degraded"). Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to work as expected. That said, I now have a recipe for hitting the raid1 deadlock that Tanaka first reported over a week ago. I'm still surprised that all of this chatter about that BUG hasn't drawn interest/scrutiny from others!? regards, Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] x86: Unify fault_32|64.c with ifdefs
Elimination of these ifdefs can be done in a unified file. Signed-off-by: Harvey Harrison <[EMAIL PROTECTED]> --- arch/x86/mm/fault_32.c | 100 +-- arch/x86/mm/fault_64.c | 93 +++- 2 files changed, 177 insertions(+), 16 deletions(-) diff --git a/arch/x86/mm/fault_32.c b/arch/x86/mm/fault_32.c index 2d8a577..be0921c 100644 --- a/arch/x86/mm/fault_32.c +++ b/arch/x86/mm/fault_32.c @@ -48,7 +48,11 @@ static inline int notify_page_fault(struct pt_regs *regs) int ret = 0; /* kprobe_running() needs smp_processor_id() */ +#ifdef CONFIG_X86_32 if (!user_mode_vm(regs)) { +#else + if (!user_mode(regs)) { +#endif preempt_disable(); if (kprobe_running() && kprobe_fault_handler(regs, 14)) ret = 1; @@ -429,11 +433,15 @@ static noinline void pgtable_bad(unsigned long address, struct pt_regs *regs, #endif /* + * X86_32 * Handle a fault on the vmalloc or module mapping area * + * X86_64 + * Handle a fault on the vmalloc area + * * This assumes no large pages in there. */ -static inline int vmalloc_fault(unsigned long address) +static int vmalloc_fault(unsigned long address) { #ifdef CONFIG_X86_32 unsigned long pgd_paddr; @@ -508,6 +516,9 @@ int show_unhandled_signals = 1; * and the problem, and then passes it off to one of the appropriate * routines. */ +#ifdef CONFIG_X86_64 +asmlinkage +#endif void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) { struct task_struct *tsk; @@ -516,6 +527,9 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) unsigned long address; int write, si_code; int fault; +#ifdef CONFIG_X86_64 + unsigned long flags; +#endif /* * We can fault from pretty much anywhere, with unknown IRQ state. @@ -547,6 +561,7 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) * (error_code & 4) == 0, and that the fault was not a * protection error (error_code & 9) == 0. */ +#ifdef CONFIG_X86_32 if (unlikely(address >= TASK_SIZE)) { if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) && vmalloc_fault(address) >= 0) @@ -569,7 +584,45 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) */ if (in_atomic() || !mm) goto bad_area_nosemaphore; +#else /* CONFIG_X86_64 */ + if (unlikely(address >= TASK_SIZE64)) { + /* +* Don't check for the module range here: its PML4 +* is always initialized because it's shared with the main +* kernel text. Only vmalloc may need PML4 syncups. +*/ + if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) && + ((address >= VMALLOC_START && address < VMALLOC_END))) { + if (vmalloc_fault(address) >= 0) + return; + } + /* +* Don't take the mm semaphore here. If we fixup a prefetch +* fault we could otherwise deadlock. +*/ + goto bad_area_nosemaphore; + } + if (likely(regs->flags & X86_EFLAGS_IF)) + local_irq_enable(); + + if (unlikely(error_code & PF_RSVD)) + pgtable_bad(address, regs, error_code); + + /* +* If we're in an interrupt, have no user context or are running in an +* atomic region then we must not take the fault. +*/ + if (unlikely(in_atomic() || !mm)) + goto bad_area_nosemaphore; + /* +* User-mode registers count as a user access even for any +* potential system fault or CPU buglet. +*/ + if (user_mode_vm(regs)) + error_code |= PF_USER; +again: +#endif /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -595,7 +648,11 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) vma = find_vma(mm, address); if (!vma) goto bad_area; +#ifdef CONFIG_X86_32 if (vma->vm_start <= address) +#else + if (likely(vma->vm_start <= address)) +#endif goto good_area; if (!(vma->vm_flags & VM_GROWSDOWN)) goto bad_area; @@ -633,7 +690,9 @@ good_area: goto bad_area; } - survive: +#ifdef CONFIG_X86_32 +survive: +#endif /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -704,6 +763,7 @@ bad_area_nosemaphore:
RE: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.25
> From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of > Roland Dreier > Sent: Tuesday, January 22, 2008 3:56 PM > To: Christoph Hellwig > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] > Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.25 > > > > > - Neteffect "nes" driver. It's not terribly clean code > but since > > >it's a new driver that is completely self-contained, I plan on > > >merging it and letting cleanups happen upstream. > > > > New code should be better quality than old code, not > worse. I haven't > > actually seen the driver yet, but by that statement I'd be clearly > > against a merge. > > The driver has been posted a few times; the latest code is in the > "neteffect" branch of my tree: > > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniban > d.git neteffect > > It's not *that* bad -- certainly there are lots of things that could > be improved (sparse endianness annotation, too many lines that are way > to long, strange indentation of case labeles, etc, etc) but it is a > self-contained hardware driver. I agree with Linus's position (stated > at the last kernel summit) that we ought to merge hardware drivers > early, so that users get the drivers with as little hassle as > possible. We lose a little leverage in getting cleanups done, but the > number of people who see the code and are able to clean it up > increases, so I think it's a good trade-off. > > - R. > My view is the code should and will be cleaned up based upon the feedback we've gotten from the community. It is a priority for me. Several cleanup fixes are in the queue and are being worked. Haven't slipped into complacency at the prospect of the merge. Glenn [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] x86: Unify fault_32|64.c by ifdef'd function bodies
It's about time to get on with unifying these files, elimination of the ugly ifdefs can occur in the unified file. Signed-off-by: Harvey Harrison <[EMAIL PROTECTED]> --- OK, time to bite the bullet, it's ugly, but we can now do the cleanups once in the unified files. arch/x86/mm/fault_32.c | 116 + arch/x86/mm/fault_64.c | 148 +++- 2 files changed, 263 insertions(+), 1 deletions(-) diff --git a/arch/x86/mm/fault_32.c b/arch/x86/mm/fault_32.c index f85e7c9..2d8a577 100644 --- a/arch/x86/mm/fault_32.c +++ b/arch/x86/mm/fault_32.c @@ -172,8 +172,17 @@ static void force_sig_info_fault(int si_signo, int si_code, force_sig_info(si_signo, , tsk); } +#ifdef CONFIG_X86_64 +static int bad_address(void *p) +{ + unsigned long dummy; + return probe_kernel_address((unsigned long *)p, dummy); +} +#endif + void dump_pagetable(unsigned long address) { +#ifdef CONFIG_X86_32 __typeof__(pte_val(__pte(0))) page; page = read_cr3(); @@ -208,8 +217,42 @@ void dump_pagetable(unsigned long address) } printk("\n"); +#else /* CONFIG_X86_64 */ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + pgd = (pgd_t *)read_cr3(); + + pgd = __va((unsigned long)pgd & PHYSICAL_PAGE_MASK); + pgd += pgd_index(address); + if (bad_address(pgd)) goto bad; + printk("PGD %lx ", pgd_val(*pgd)); + if (!pgd_present(*pgd)) goto ret; + + pud = pud_offset(pgd, address); + if (bad_address(pud)) goto bad; + printk("PUD %lx ", pud_val(*pud)); + if (!pud_present(*pud)) goto ret; + + pmd = pmd_offset(pud, address); + if (bad_address(pmd)) goto bad; + printk("PMD %lx ", pmd_val(*pmd)); + if (!pmd_present(*pmd) || pmd_large(*pmd)) goto ret; + + pte = pte_offset_kernel(pmd, address); + if (bad_address(pte)) goto bad; + printk("PTE %lx", pte_val(*pte)); +ret: + printk("\n"); + return; +bad: + printk("BAD\n"); +#endif } +#ifdef CONFIG_X86_32 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) { unsigned index = pgd_index(address); @@ -245,6 +288,7 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k)); return pmd_k; } +#endif #ifdef CONFIG_X86_64 static const char errata93_warning[] = @@ -325,6 +369,7 @@ static int is_f00f_bug(struct pt_regs *regs, unsigned long address) static void show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address) { +#ifdef CONFIG_X86_32 if (!oops_may_print()) return; @@ -349,7 +394,39 @@ static void show_fault_oops(struct pt_regs *regs, unsigned long error_code, printk(KERN_ALERT "IP:"); printk_address(regs->ip, 1); dump_pagetable(address); +#else /* CONFIG_X86_64 */ + printk(KERN_ALERT "BUG: unable to handle kernel "); + if (address < PAGE_SIZE) + printk(KERN_CONT "NULL pointer dereference"); + else + printk(KERN_CONT "paging request"); + printk(KERN_CONT " at %016lx\n", address); + + printk(KERN_ALERT "IP:"); + printk_address(regs->ip, 1); + dump_pagetable(address); +#endif +} + +#ifdef CONFIG_X86_64 +static noinline void pgtable_bad(unsigned long address, struct pt_regs *regs, +unsigned long error_code) +{ + unsigned long flags = oops_begin(); + struct task_struct *tsk; + + printk(KERN_ALERT "%s: Corrupted page table at address %lx\n", + current->comm, address); + dump_pagetable(address); + tsk = current; + tsk->thread.cr2 = address; + tsk->thread.trap_no = 14; + tsk->thread.error_code = error_code; + if (__die("Bad pagetable", regs, error_code)) + regs = NULL; + oops_end(flags, regs, SIGKILL); } +#endif /* * Handle a fault on the vmalloc or module mapping area @@ -705,6 +782,7 @@ do_sigbus: void vmalloc_sync_all(void) { +#ifdef CONFIG_X86_32 /* * Note that races in the updates of insync and start aren't * problematic: insync can only get set bits added, and updates to @@ -739,4 +817,42 @@ void vmalloc_sync_all(void) if (address == start && test_bit(pgd_index(address), insync)) start = address + PGDIR_SIZE; } +#else /* CONFIG_X86_64 */ + /* +* Note that races in the updates of insync and start aren't +* problematic: insync can only get set bits added, and updates to +* start are only improving performance (without affecting correctness +* if undone). +*/ + static DECLARE_BITMAP(insync, PTRS_PER_PGD); + static unsigned long start = VMALLOC_START & PGDIR_MASK; + unsigned long
Re: [patch] x86: test case for the RODATA config option
cool! could you perhaps also do an add-on: > + /* test 1: read the value */ > + /* test 2: write to the variable; this should fault */ > + /* test 3: check the value hasn't changed */ test 4: make it writable again test 5: make it NX -> check that it's not executable and perhaps also check that normal kernel allocations (kmalloc(), etc.) are NX as well? (with the same section trick you use in this patch - perhaps try to call a kmalloc()-ed buffer that contains a 'ret' instruction - if that call faults then the test is OK, if the call succeeds then the test failed.) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET] printk: implement printk_header() and merging printk, take #2
Jan Engelhardt wrote: > On Jan 23 2008 08:51, Tejun Heo wrote: >> What do you think about the second suggestion then? >> >> ata1.00: line0 >> ata1.00 line1 >> ata1.00 line2 >> >> It allows you to grab for the header && has indication for message >> boundaries. > > Then again, why not "[ata1.00] line0", then it matches what sd_mod does :) Well, that's fine too but using ':' is much more common. Just take a look at the boot log and if we go with '[]', any ideas on how to indicate multiline messages? -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
Quoting Miklos Szeredi ([EMAIL PROTECTED]): > > > > What do you think about doing this only if FS_SAFE is also set, > > > > so for instance at first only FUSE would allow itself to be > > > > made user-mountable? > > > > > > > > A safe thing to do, or overly intrusive? > > > > > > It goes somewhat against the "no policy in kernel" policy ;). I think > > > the warning in the documentation should be enough to make sysadmins > > > think twice before doing anything foolish: > > > > Warning in which documentation? A sysadmin considering setting fs_safe > > for ext2 or xfs isn't going to be looking at fuse docs, which I think is > > what you're talking about. Are you going to add a file under > > Documentation/filesystems? > > Yes, I meant documentation of the new sysctl tunable in > Documentation/filesystems/proc.txt: Argh, sorry. > > Index: linux/Documentation/filesystems/proc.txt > > === > > --- linux.orig/Documentation/filesystems/proc.txt 2008-01-16 > > 13:25:07.0 +0100 > > +++ linux/Documentation/filesystems/proc.txt2008-01-16 > > 13:25:09.0 +0100 > > @@ -43,6 +43,7 @@ Table of Contents > >2.13 /proc//oom_score - Display current oom-killer score > >2.14 /proc//io - Display the IO accounting fields > >2.15 /proc//coredump_filter - Core dump filtering settings > > + 2.16 /proc/sys/fs/types - File system type specific parameters > > > > > > -- > > Preface > > @@ -2283,4 +2284,21 @@ For example: > >$ echo 0x7 > /proc/self/coredump_filter > >$ ./some_program > > > > +2.16 /proc/sys/fs/types/ - File system type specific parameters > > + > > + > > +There's a separate directory /proc/sys/fs/types// for each > > +filesystem type, containing the following files: > > + > > +usermount_safe > > +-- > > + > > +Setting this to non-zero will allow filesystems of this type to be > > +mounted by unprivileged users (note, that there are other > > +prerequisites as well). > > + > > +Care should be taken when enabling this, since most > > +filesystems haven't been designed with unprivileged mounting > > +in mind. > > + > > > > -- > > > > Do you think this is enough? Or do we need something more, to prevent > sysadmin inadvertently setting this for an unsafe filesystem? I would think something more would be good. First explaining that fuse should be safe modulo warnings in the fuse documentation, procfs and sysfs may be safe, while other filesystems are not known safe at all. Then explaining the dangers with not-known-safe filesystems and what is needed to make them safe. Clearly making sure input validation is properly done so for instance getsb() doesn't turn into a buffer overflow, etc. Such a checklist also would be useful for holding a meaningful discussion about the other filesystems and maybe turning some people loose on an audit of other filesystems. thanks, -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CPA boot crash (was: [PATCH] [0/36] Great change_page_attr patch series v3)
* Andi Kleen <[EMAIL PROTECTED]> wrote: > > because it interferes/interacts with CPA and the page table code. So > > No that is not its main problem I believe. Main problem are all the > driver and other subsystem interactions (it is a little bit similar to > power management where you have lots of little bits all over right > instead of a single big one). [...] that is (yet another) major misconception on your part. "Drivers" are an easy to blame target (i guess because there's no one out there to defend a vague "drivers" accusation), and they are not the problem here _at all_. Drivers tell the architecture code which physical pages they'd like to have access to (or which page range they'd like to see different cache attributes on) and that's it. They are plain users of the ioremap() and change_page_attr() APIs. Nothing more, nothing less. It is the utmost duty of architecture code to make those APIs fool-proof. Hardware _will_ mess up the physical parameters that get passed in every possible way - and drivers just try to use what the hardware tells them to use. So robustness is key and there's just no "driver reason" why these APIs cannot be robust. so you are delusional if you think that the c_p_a() problems are "driver and other subsystem interactions". And your analogy with power management could not be more mistaken. Power management and suspend/resume in particular is so complex because it is analogous to a _full bootup and shutdown cycle_, with the following, hard to meet expectation from the user: 'this stuff must work all the time, and must be instantaneous'. Suspend/resume is an _incredibly complex_ machinery and the user does not realize (and does not accept the concequences) of this complexity. It is a codepath that is affected by tens and tens of thousands of driver and core kernel code. Just one single mistake and "resume does not work". ioremap() and change_page_attr() on the other hand is a small, few hundred lines codebase for a stable and well-defined purpose. There's no significant "subsystem interactions" whatsoever. by far the most intense and most high-frequency user of the change_page_attr() code is CONFIG_DEBUG_PAGEALLOC=y. It does a cpa call for every single page and slab allocation/freeing. But this debug feature ... is not enabled on the 64-bit side - why? So unfortunately we dont have any real robustness track record of the 64-bit side of the CPA code, and that's exactly the code your clflush and gbpages code changes. oh, and due to that i'll probably revert these two patches of yours: Subject: x86: c_p_a(), change kernel_map_pages to not use c_p_a() Subject: x86: c_p_a(), change 32-bit back to init_mm semaphore locking as with these changes you've removed _the_ most important stress-tester for the c_p_a() code: DEBUG_PAGEALLOC. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ppc: fix #ifdef-s in mediabay driver
On Wed, 2008-01-23 at 00:12 +0100, Bartlomiej Zolnierkiewicz wrote: > * Replace incorrect CONFIG_BLK_DEV_IDE #ifdef in > check_media_bay() by CONFIG_MAC_FLOPPY one. > > * Replace incorrect CONFIG_BLK_DEV_IDE #ifdef-s by > CONFIG_BLK_DEV_IDE_PMAC ones. > > * check_media_bay() is used only by drivers/block/swim3.c > so make this function available only if CONFIG_MAC_FLOPPY > is defined. > > * check_media_bay_by_base() and media_bay_set_ide_infos() > are used only by drivers/ide/ppc/pmac.c so so make these > functions available only if CONFIG_MAC_FLOPPY is defined. > > Signed-off-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]> > --- > Ben, IMO this patch is safe for 2.6.24 (assuming that it builds fine :), > otherwise I would like to ask for permission to merge it through IDE > tree since I have other pending IDE patches depending on this one. I'd rather avoid touching 2.6.24 unless it actually fixes a bug or regression... I'm tempted to actually remove all ifdef's ... if you have a media-bay, then there are about 99% chances it contains an IDE device, with the remaining percent being split with putting a floppy or a battery in. I doubt anybody will care building a kernel without the support for these and with the mediabay support, and still want to save a handful of bytes in that driver. Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/