Re: crash in kmem_cache_init

2008-01-22 Thread Olaf Hering
On Tue, Jan 22, Christoph Lameter wrote:

> > 0xc00fe018 is in setup_cpu_cache 
> > (/home/olaf/kernel/git/linux-2.6-numa/mm/slab.c:2111).
> > 2106BUG_ON(!cachep->nodelists[node]);
> > 2107
> > kmem_list3_init(cachep->nodelists[node]);
> > 2108}
> > 2109}
> > 2110}
> 
> if (cachep->nodelists[numa_node_id()])
>   return;

Does not help.


Linux version 2.6.24-rc8-ppc64 ([EMAIL PROTECTED]) (gcc version 4.1.2 20070115 
(prerelease) (SUSE Linux)) #48 SMP Wed Jan 23 08:54:23 CET 2008
[boot]0012 Setup Arch
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 8192 bytes
Zone PFN ranges:
  DMA 0 ->   892928
  Normal 892928 ->   892928
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
1:0 ->   892928
Could not find start_pfn for node 0
[boot]0015 Setup Done
Built 2 zonelists in Node order, mobility grouping on.  Total pages: 880720
Policy zone: DMA
Kernel command line: debug xmon=on panic=1  
[boot]0020 XICS Init
xics: no ISA interrupt controller
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
time_init: decrementer frequency = 275.07 MHz
time_init: processor frequency   = 2197.80 MHz
clocksource: timebase mult[e8ab05] shift[22] registered
clockevent: decrementer mult[466a] shift[16] cpu[0]
Console: colour dummy device 80x25
console handover: boot [udbg-1] -> real [hvc0]
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
freeing bootmem node 1
Memory: 3496632k/3571712k available (6188k kernel code, 75080k reserved, 1324k 
data, 1220k bss, 304k init)
Kernel panic - not syncing: kmem_cache_create(): failed to create slab 
`size-32(DMA)'

Rebooting in 1 seconds..

---
 mm/slab.c |   17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1590,7 +1590,7 @@ void __init kmem_cache_init(void)
/* Replace the static kmem_list3 structures for the boot cpu */
init_list(_cache, _list3[CACHE_CACHE], node);
 
-   for_each_node_state(nid, N_NORMAL_MEMORY) {
+   for_each_online_node(nid) {
init_list(malloc_sizes[INDEX_AC].cs_cachep,
  _list3[SIZE_AC + nid], nid);
 
@@ -1968,7 +1968,7 @@ static void __init set_up_list3s(struct 
 {
int node;
 
-   for_each_node_state(node, N_NORMAL_MEMORY) {
+   for_each_online_node(node) {
cachep->nodelists[node] = _list3[index + node];
cachep->nodelists[node]->next_reap = jiffies +
REAPTIMEOUT_LIST3 +
@@ -2108,6 +2108,8 @@ static int __init_refok setup_cpu_cache(
}
}
}
+   if (!cachep->nodelists[numa_node_id()])
+   return -ENODEV;
cachep->nodelists[numa_node_id()]->next_reap =
jiffies + REAPTIMEOUT_LIST3 +
((unsigned long)cachep) % REAPTIMEOUT_LIST3;
@@ -2775,6 +2777,11 @@ static int cache_grow(struct kmem_cache 
/* Take the l3 list lock to change the colour_next on this node */
check_irq_off();
l3 = cachep->nodelists[nodeid];
+   if (!l3) {
+   nodeid = numa_node_id();
+   l3 = cachep->nodelists[nodeid];
+   }
+   BUG_ON(!l3);
spin_lock(>list_lock);
 
/* Get colour for the slab, and cal the next value. */
@@ -3317,6 +3324,10 @@ static void *cache_alloc_node(struct
int x;
 
l3 = cachep->nodelists[nodeid];
+   if (!l3) {
+   nodeid = numa_node_id();
+   l3 = cachep->nodelists[nodeid];
+   }
BUG_ON(!l3);
 
 retry:
@@ -3815,7 +3826,7 @@ static int alloc_kmemlist(struct kmem_ca
struct array_cache *new_shared;
struct array_cache **new_alien = NULL;
 
-   for_each_node_state(node, N_NORMAL_MEMORY) {
+   for_each_online_node(node) {
 
 if (use_alien_caches) {
 new_alien = alloc_alien_cache(node, cachep->limit);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24rc8: unregister_netdevice: waiting for ... to become free. Usage count = 1?

2008-01-22 Thread Soeren Sonnenburg
On Tue, 2008-01-22 at 22:44 -0800, David Miller wrote:
> From: Soeren Sonnenburg <[EMAIL PROTECTED]>
> Date: Wed, 23 Jan 2008 07:42:21 +0100
> 
> > Dear all,
> > 
> > since some 2.6.24rc version I suddenly experience such messages on
> > console when trying to shutdown a vpn connection:
> > 
> > unregister_netdevice: waiting for tun0 to become free. Usage count = 1
> > 
> > or when removing an usb wlan dongle (although it was ifconfig wlan0
> > down'd before)
> 
> Current GIT already has a fix for this, attached below:

Thank you very much for pointing this out!

git pull ; make ; ...
Soeren
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Could you please merge the x86_64 EFI boot support patchset?

2008-01-22 Thread Paul Jackson
Huang wrote:
> This patchset has been merged into Linux 2.6.24.

Excellent.

> Unfortunately, the new EFI support patches do not use EFI memory map for
> system boot up ...  So, I think the resolution for your problem is the
> "struct setup_data" mechanism proposed by H. Peter Anvin.

So you're saying that the EFI in the kernel now still won't support more
than 128 or so chunks of memory in the boottime memory map, because it
still goes via the legacy E820h memory map code?

I'll have to study the code more and give it a try.

Are you optimistic that some variation of H. Peter Anvin's "struct
setup_data" mechanism will make it into 2.6.25 or thereabouts?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-22 Thread Dave Young
On Jan 23, 2008 3:41 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
>
> On Tue, 22 Jan 2008, David Miller wrote:
>
> > From: "Dave Young" <[EMAIL PROTECTED]>
> > Date: Wed, 23 Jan 2008 09:44:30 +0800
> >
> > > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> > > > [PATCH] [TCP]: debug S+L
> > >
> > > Thanks, If there's new findings I will let you know.
> >
> > Thanks for helping with this bug Dave.
>
> I noticed btw that there thing might (is likely to) spuriously trigger at
> WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK
> is not enabled. If that does happen too often, I send a fixed patch for
> it, yet, the fact that I print print tp->rx_opt.sack_ok allows
> identification of those cases already as it's zero when SACK is not
> enabled.
>
> Just ask if you need the updated debug patch.

Thanks,  please send, I would like to get it.

>
> --
>  i.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-22 Thread Ilpo Järvinen
On Tue, 22 Jan 2008, David Miller wrote:

> From: "Dave Young" <[EMAIL PROTECTED]>
> Date: Wed, 23 Jan 2008 09:44:30 +0800
> 
> > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> > > [PATCH] [TCP]: debug S+L
> > 
> > Thanks, If there's new findings I will let you know.
> 
> Thanks for helping with this bug Dave.

I noticed btw that there thing might (is likely to) spuriously trigger at 
WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK 
is not enabled. If that does happen too often, I send a fixed patch for 
it, yet, the fact that I print print tp->rx_opt.sack_ok allows
identification of those cases already as it's zero when SACK is not 
enabled.

Just ask if you need the updated debug patch.

-- 
 i.

Re: [PATCH] sound: fix opti9xx/miro section mismatch

2008-01-22 Thread Sam Ravnborg
On Tue, Jan 22, 2008 at 09:39:47PM -0800, Randy Dunlap wrote:
> From: Randy Dunlap <[EMAIL PROTECTED]>
> 
> snd_opti93x_mixer() is only called by __devinit snd_opti93x_probe(),
> so the former can also be __devinit.
> 
> snd_miro_mixer() is only called by __devinit snd_miro_probe(),
> so the former can also be __devinit.
> 
> sound/isa/opti9xx/opti92x-ad1848.c:
> WARNING: vmlinux.o(.text+0xf91cd7): Section mismatch: reference to 
> .init.data:snd_opti93x_controls (between 'snd_opti93x_mixer' and 
> 'snd_card_opti9xx_free')
> WARNING: vmlinux.o(.text+0xf91d66): Section mismatch: reference to 
> .init.data:snd_miro_controls (between 'snd_opti93x_mixer' and 
> 'snd_card_opti9xx_free')
> 
> opti9xx/miro.c:
> WARNING: vmlinux.o(.text+0xf926c2): Section mismatch: reference to 
> .init.data:snd_miro_controls (between 'snd_miro_mixer' and 
> 'snd_legacy_find_free_ioport')
> WARNING: vmlinux.o(.text+0xf926e5): Section mismatch: reference to 
> .init.data:snd_miro_eq_controls (between 'snd_miro_mixer' and 
> 'snd_legacy_find_free_ioport')
> WARNING: vmlinux.o(.text+0xf926f9): Section mismatch: reference to 
> .init.data:snd_miro_line_control (between 'snd_miro_mixer' and 
> 'snd_legacy_find_free_ioport')
> WARNING: vmlinux.o(.text+0xf92716): Section mismatch: reference to 
> .init.data:snd_miro_amp_control (between 'snd_miro_mixer' and 
> 'snd_legacy_find_free_ioport')
> WARNING: vmlinux.o(.text+0xf9273e): Section mismatch: reference to 
> .init.data:snd_miro_preamp_control (between 'snd_miro_mixer' and 
> 'snd_legacy_find_free_ioport')
> WARNING: vmlinux.o(.text+0xf92764): Section mismatch: reference to 
> .init.data:snd_miro_capture_control (between 'snd_miro_mixer' and 
> 'snd_legacy_find_free_ioport')
> WARNING: vmlinux.o(.text+0xf92783): Section mismatch: reference to 
> .init.data:snd_miro_radio_control (between 'snd_miro_mixer' and 
> 'snd_legacy_find_free_ioport')
> WARNING: vmlinux.o(.text+0xf9279a): Section mismatch: reference to 
> .init.data:snd_miro_eq_controls (between 'snd_miro_mixer' and 
> 'snd_legacy_find_free_ioport')
> WARNING: vmlinux.o(.text+0xf927b9): Section mismatch: reference to 
> .init.data:snd_miro_radio_control (between 'snd_miro_mixer' and 
> 'snd_legacy_find_free_ioport')
> 
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
Acked-by: Sam Ravnborg <[EMAIL PROTECTED]>

> ---
>  sound/isa/opti9xx/miro.c   |2 +-
>  sound/isa/opti9xx/opti92x-ad1848.c |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux-2.6.24-rc8-git5.orig/sound/isa/opti9xx/miro.c
> +++ linux-2.6.24-rc8-git5/sound/isa/opti9xx/miro.c
> @@ -662,7 +662,7 @@ static int __devinit snd_set_aci_init_va
>   return 0;
>  }
>  
> -static int snd_miro_mixer(struct snd_miro *miro)
> +static int __devinit snd_miro_mixer(struct snd_miro *miro)
>  {
>   struct snd_card *card;
>   unsigned int idx;
> --- linux-2.6.24-rc8-git5.orig/sound/isa/opti9xx/opti92x-ad1848.c
> +++ linux-2.6.24-rc8-git5/sound/isa/opti9xx/opti92x-ad1848.c
> @@ -1595,7 +1595,7 @@ OPTi93X_DOUBLE("Capture Volume", 0, OPTi
>  }
>  };
>  
> -static int snd_opti93x_mixer(struct snd_opti93x *chip)
> +static int __devinit snd_opti93x_mixer(struct snd_opti93x *chip)
>  {
>   struct snd_card *card;
>   struct snd_kcontrol_new knew;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Could you please merge the x86_64 EFI boot support patchset?

2008-01-22 Thread Huang, Ying
On Wed, 2008-01-23 at 01:00 -0600, Paul Jackson wrote:
> In Nov 2007, Huang Ying wrote:
> > Could you please merge the following patchset:
> > 
> > [PATCH 0/2 -v3] x86_64 EFI boot support
> > [PATCH 1/2 -v3] x86_64 EFI boot support: EFI frame buffer driver
> > [PATCH 2/2 -v3] x86_64 EFI boot support: EFI boot document
> 
> Huang - what has become of this patchset?

This patchset has been merged into Linux 2.6.24.

> We (SGI) have designs on a big honkin NUMA box using x86_64 arch, and
> we are runnning up against the arbitrary limits on the memory map size
> due to the ancient 4k zero page size limit (hence H. Peter Anvin added
> to the CC list, since he knows a bazillion times more about any such
> limits than I do.).  The limits on 128 or so local chunks of memory
> imposed by the kernel code that invokes Int 15 E820h are too small for
> us.
> 
> Since we are already accustomed to dealing with EFI on our IA64
> Itanium boxes, I'm figuring that it will be easier in the short run,
> and better in the long run, to just use EFI on these upcoming big
> x86_64 NUMA boxes.

Unfortunately, the new EFI support patches do not use EFI memory map for
system boot up (just for runtime service support). The EFI memory map is
converted into E820 memory map in bootloader. The main reason for this
change is to remove the duplication between E820 memory map and EFI
memory map handling code.

So, I think the resolution for your problem is the "struct setup_data"
mechanism proposed by H. Peter Anvin. That is a linked list data
structure for boot parameter without size limitation. I have ever writen
a patch for it, but there are some issues for implementation scheme.
Most people think it should be based on the "early reservation/early
allocation" mechanism from Andi Kleen. So I am waiting that is merged by
-mm or git-x86.

Best Regards,
Huang Ying

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sound: fix cs5535 section mismatch

2008-01-22 Thread Sam Ravnborg
On Tue, Jan 22, 2008 at 09:39:43PM -0800, Randy Dunlap wrote:
> From: Randy Dunlap <[EMAIL PROTECTED]>
> 
> snd_cs5535audio_mixer() is only called by __devinit snd_cs5535audio_probe(),
> so the mixer function can also be __devinit.
> 
> WARNING: vmlinux.o(.text+0xfdbba0): Section mismatch: reference to 
> .init.data:ac97_quirks (between 'snd_cs5535audio_mixer' and 'process_bm0_irq')
> 
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
Acked-by: Sam Ravnborg <[EMAIL PROTECTED]>

> ---
>  sound/pci/cs5535audio/cs5535audio.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-2.6.24-rc8-git5.orig/sound/pci/cs5535audio/cs5535audio.c
> +++ linux-2.6.24-rc8-git5/sound/pci/cs5535audio/cs5535audio.c
> @@ -145,7 +145,7 @@ static unsigned short snd_cs5535audio_ac
>   return snd_cs5535audio_codec_read(cs5535au, reg);
>  }
>  
> -static int snd_cs5535audio_mixer(struct cs5535audio *cs5535au)
> +static int __devinit snd_cs5535audio_mixer(struct cs5535audio *cs5535au)
>  {
>   struct snd_card *card = cs5535au->card;
>   struct snd_ac97_bus *pbus;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] radio: fix sf16fmi section mismatch

2008-01-22 Thread Sam Ravnborg
On Tue, Jan 22, 2008 at 09:39:39PM -0800, Randy Dunlap wrote:
> From: Randy Dunlap <[EMAIL PROTECTED]>
> 
> isapnp_fmi_probe() is only called by fmi_init(), which is __init,
> so isapnp_fmi_probe() can also be __init.
> 
> media/radio/radio-sf16fmi.c:
> WARNING: vmlinux.o(.text+0x994e19): Section mismatch: reference to 
> .init.data: (between 'isapnp_fmi_probe' and 'vidioc_s_tuner')
> WARNING: vmlinux.o(.text+0x994e22): Section mismatch: reference to 
> .init.data: (between 'isapnp_fmi_probe' and 'vidioc_s_tuner')
> WARNING: vmlinux.o(.text+0x994e3a): Section mismatch: reference to 
> .init.data:id_table (between 'isapnp_fmi_probe' and 'vidioc_s_tuner')
> 
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
Acked-by: Sam Ravnborg <[EMAIL PROTECTED]>

> ---
>  drivers/media/radio/radio-sf16fmi.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-2.6.24-rc8-git5.orig/drivers/media/radio/radio-sf16fmi.c
> +++ linux-2.6.24-rc8-git5/drivers/media/radio/radio-sf16fmi.c
> @@ -321,7 +321,7 @@ static struct isapnp_device_id id_table[
>  
>  MODULE_DEVICE_TABLE(isapnp, id_table);
>  
> -static int isapnp_fmi_probe(void)
> +static int __init isapnp_fmi_probe(void)
>  {
>   int i = 0;
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Could you please merge the x86_64 EFI boot support patchset?

2008-01-22 Thread Paul Jackson
In Nov 2007, Huang Ying wrote:
> Could you please merge the following patchset:
> 
> [PATCH 0/2 -v3] x86_64 EFI boot support
> [PATCH 1/2 -v3] x86_64 EFI boot support: EFI frame buffer driver
> [PATCH 2/2 -v3] x86_64 EFI boot support: EFI boot document

Huang - what has become of this patchset?

We (SGI) have designs on a big honkin NUMA box using x86_64 arch, and
we are runnning up against the arbitrary limits on the memory map size
due to the ancient 4k zero page size limit (hence H. Peter Anvin added
to the CC list, since he knows a bazillion times more about any such
limits than I do.).  The limits on 128 or so local chunks of memory
imposed by the kernel code that invokes Int 15 E820h are too small for
us.

Since we are already accustomed to dealing with EFI on our IA64
Itanium boxes, I'm figuring that it will be easier in the short run,
and better in the long run, to just use EFI on these upcoming big
x86_64 NUMA boxes.

But we'd need to get this patch, or something equivalent, into the
Linux kernel in the next 2.6.* cycle or so, for this to work for us.

Also, is there a current version of this patch set available, against
either 2.6.24-rc8-mm1 or Ingo's recent x86 git tree?  I should try it
out.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24rc8: unregister_netdevice: waiting for ... to become free. Usage count = 1?

2008-01-22 Thread David Miller
From: Soeren Sonnenburg <[EMAIL PROTECTED]>
Date: Wed, 23 Jan 2008 07:42:21 +0100

> Dear all,
> 
> since some 2.6.24rc version I suddenly experience such messages on
> console when trying to shutdown a vpn connection:
> 
> unregister_netdevice: waiting for tun0 to become free. Usage count = 1
> 
> or when removing an usb wlan dongle (although it was ifconfig wlan0
> down'd before)

Current GIT already has a fix for this, attached below:

[NEIGH]: Revert 'Fix race between neigh_parms_release and neightbl_fill_parms'

Commit 9cd40029423701c376391da59d2c6469672b4bed (Fix race between
neigh_parms_release and neightbl_fill_parms) introduced device
reference counting regressions for several people, see:

http://bugzilla.kernel.org/show_bug.cgi?id=9778

for example.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index cc8a2f1..29b8ee4 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1316,6 +1316,8 @@ void neigh_parms_release(struct neigh_table *tbl, struct 
neigh_parms *parms)
*p = parms->next;
parms->dead = 1;
write_unlock_bh(>lock);
+   if (parms->dev)
+   dev_put(parms->dev);
call_rcu(>rcu_head, neigh_rcu_free_parms);
return;
}
@@ -1326,8 +1328,6 @@ void neigh_parms_release(struct neigh_table *tbl, struct 
neigh_parms *parms)
 
 void neigh_parms_destroy(struct neigh_parms *parms)
 {
-   if (parms->dev)
-   dev_put(parms->dev);
kfree(parms);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


24rc8: unregister_netdevice: waiting for ... to become free. Usage count = 1?

2008-01-22 Thread Soeren Sonnenburg
Dear all,

since some 2.6.24rc version I suddenly experience such messages on
console when trying to shutdown a vpn connection:

unregister_netdevice: waiting for tun0 to become free. Usage count = 1

or when removing an usb wlan dongle (although it was ifconfig wlan0
down'd before)

unregister_netdevice: waiting for wlan0 to become free. Usage count = 1

Then only when all potential connections going over that iface are gone
these messages disappear (sometimes this does not happen and the kernel
then hangs on reboot...)

Is this intended?

Soeren
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] some page can't be migrated

2008-01-22 Thread Shaohua Li
Anonymous page might have fs-private metadata, the page is truncated. As
the page hasn't mapping, page migration refuse to migrate the page. It
appears the page is only freed in page reclaim and if zone watermark is
low, the page is never freed, as a result migration always fail. I
thought we could free the metadata so such page can be freed in
migration and make migration more reliable?

Thanks,
Shaohua

diff --git a/mm/migrate.c b/mm/migrate.c
index 6a207e8..6bc38f7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -633,6 +633,17 @@ static int unmap_and_move(new_page_t get_new_page, 
unsigned long private,
goto unlock;
wait_on_page_writeback(page);
}
+
+   /*
+* See truncate_complete_page(). Anonymous page might have
+* fs-private metadata, the page is truncated. Such page can't be
+* migrated. Try to free metadata, so the page can be freed.
+*/
+   if (!page->mapping && !PageAnon(page) && PagePrivate(page)) {
+   try_to_release_page(page, GFP_KERNEL);
+   goto unlock;
+   }
+
/*
 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
 * we cannot notice that anon_vma is freed while we migrates a page.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)

2008-01-22 Thread David Chinner
On Wed, Jan 23, 2008 at 04:24:33PM +1030, Jonathan Woithe wrote:
> > On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote:
> > > Last night my laptop suffered an oops during closedown.  The full oops
> > > reports can be downloaded from
> > > 
> > >   http://www.atrad.com.au/~jwoithe/xfs_oops/
> > 
> > Assertion failed: atomic_read(>m_active_trans) == 0, file:
> > fs/xfs/xfs_vfsops.c, line 689.
> > 
> > The remount read-only of the root drive supposedly completed
> > while there was still active modification of the filesystem
> > taking place.
.
> > The read only flag only gets set *after* we've made the filesystem
> > readonly, which means before we are truly read only, we can race
> > with other threads opening files read/write or filesystem
> > modifcations can take place.
> > 
> > The result of that race (if it is really unsafe) will be assert you
> > see. The patch I wrote a couple of months ago to fix the problem
> > is attached below
> 
> Thanks for the patch.  I will apply it and see what happens.
> 
> Will this be in 2.6.24?

No - because hitting the problem is so rare that I'm not even
sure it's a problem. One of the VFS gurus will need to comment
on whether this really is a problem, and if so the correct fix
is to do_remount_sb() so that it closes the hole for everyone.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc8-mm1: sparc64 warning at fs/file_table.c:49 __fput+0x1a8/0x1e0()

2008-01-22 Thread Christoph Hellwig
On Tue, Jan 22, 2008 at 03:13:58PM -0800, Dave Hansen wrote:
> The emergency remount code forcibly removes FMODE_WRITE from
> filps.  The r/o bind mount code notices that this was done
> without a proper mnt_drop_write() and properly gives a
> warning.
> 
> This patch does a mnt_drop_write() and also notes in the
> filp that this was done to suppress any warning that would
> have otherwise been triggered.
> 
> I also wonder if inode->i_writecount is made inconsistent
> by the emergency remount code.  I guess it is, but the
> damage is limited to a single inode instead of being
> visible more globally like the mnt write count.  Probably
> not really worth fixing.

The right fix is to not simply remove FMODE_WRITE, but just remove
this whole function.  Until we have a proper revoke it will cause more
harm than good.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)

2008-01-22 Thread Jonathan Woithe
Hi Dave

> On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote:
> > Last night my laptop suffered an oops during closedown.  The full oops
> > reports can be downloaded from
> > 
> >   http://www.atrad.com.au/~jwoithe/xfs_oops/
> 
> Assertion failed: atomic_read(>m_active_trans) == 0, file:
> fs/xfs/xfs_vfsops.c, line 689.
> 
> The remount read-only of the root drive supposedly completed
> while there was still active modification of the filesystem
> taking place.
> 
> > Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. 
> 
> The patch in 2.6.23 that introduced this check was:
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=516b2e7c2661615ba5d5ad9fb584f068363502d3
> 
> Basically, the remount-readonly path was not flushing things
> properly, so we changed it to flushing things properly and ensure we
> got bug reports if it wasn't. Yours is the second report of not
> shutting down correctly since this change went in (we've seen it
> once in ~8 months in a QA environment).
> 
> I've had suspicions of a race in the remount-ro code in
> do_remount_sb() w.r.t to the fs_may_remount_ro() check.  That is, we
> do an unlocked check to see if we can remount readonly and then fail
> to check again once we've locked the superblock out and start the
> remount.
> 
> The read only flag only gets set *after* we've made the filesystem
> readonly, which means before we are truly read only, we can race
> with other threads opening files read/write or filesystem
> modifcations can take place.
> 
> The result of that race (if it is really unsafe) will be assert you
> see. The patch I wrote a couple of months ago to fix the problem
> is attached below

Thanks for the patch.  I will apply it and see what happens.

Will this be in 2.6.24?

Regards
  jonathan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: InfiniBand/RDMA merge plans for 2.6.25

2008-01-22 Thread Christoph Hellwig
On Tue, Jan 22, 2008 at 01:56:00PM -0800, Roland Dreier wrote:
> be improved (sparse endianness annotation,

that's a blocker for sure.  No new code that's not sparse clean, please.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sound: fix opti9xx/miro section mismatch

2008-01-22 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

snd_opti93x_mixer() is only called by __devinit snd_opti93x_probe(),
so the former can also be __devinit.

snd_miro_mixer() is only called by __devinit snd_miro_probe(),
so the former can also be __devinit.

sound/isa/opti9xx/opti92x-ad1848.c:
WARNING: vmlinux.o(.text+0xf91cd7): Section mismatch: reference to 
.init.data:snd_opti93x_controls (between 'snd_opti93x_mixer' and 
'snd_card_opti9xx_free')
WARNING: vmlinux.o(.text+0xf91d66): Section mismatch: reference to 
.init.data:snd_miro_controls (between 'snd_opti93x_mixer' and 
'snd_card_opti9xx_free')

opti9xx/miro.c:
WARNING: vmlinux.o(.text+0xf926c2): Section mismatch: reference to 
.init.data:snd_miro_controls (between 'snd_miro_mixer' and 
'snd_legacy_find_free_ioport')
WARNING: vmlinux.o(.text+0xf926e5): Section mismatch: reference to 
.init.data:snd_miro_eq_controls (between 'snd_miro_mixer' and 
'snd_legacy_find_free_ioport')
WARNING: vmlinux.o(.text+0xf926f9): Section mismatch: reference to 
.init.data:snd_miro_line_control (between 'snd_miro_mixer' and 
'snd_legacy_find_free_ioport')
WARNING: vmlinux.o(.text+0xf92716): Section mismatch: reference to 
.init.data:snd_miro_amp_control (between 'snd_miro_mixer' and 
'snd_legacy_find_free_ioport')
WARNING: vmlinux.o(.text+0xf9273e): Section mismatch: reference to 
.init.data:snd_miro_preamp_control (between 'snd_miro_mixer' and 
'snd_legacy_find_free_ioport')
WARNING: vmlinux.o(.text+0xf92764): Section mismatch: reference to 
.init.data:snd_miro_capture_control (between 'snd_miro_mixer' and 
'snd_legacy_find_free_ioport')
WARNING: vmlinux.o(.text+0xf92783): Section mismatch: reference to 
.init.data:snd_miro_radio_control (between 'snd_miro_mixer' and 
'snd_legacy_find_free_ioport')
WARNING: vmlinux.o(.text+0xf9279a): Section mismatch: reference to 
.init.data:snd_miro_eq_controls (between 'snd_miro_mixer' and 
'snd_legacy_find_free_ioport')
WARNING: vmlinux.o(.text+0xf927b9): Section mismatch: reference to 
.init.data:snd_miro_radio_control (between 'snd_miro_mixer' and 
'snd_legacy_find_free_ioport')

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 sound/isa/opti9xx/miro.c   |2 +-
 sound/isa/opti9xx/opti92x-ad1848.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- linux-2.6.24-rc8-git5.orig/sound/isa/opti9xx/miro.c
+++ linux-2.6.24-rc8-git5/sound/isa/opti9xx/miro.c
@@ -662,7 +662,7 @@ static int __devinit snd_set_aci_init_va
return 0;
 }
 
-static int snd_miro_mixer(struct snd_miro *miro)
+static int __devinit snd_miro_mixer(struct snd_miro *miro)
 {
struct snd_card *card;
unsigned int idx;
--- linux-2.6.24-rc8-git5.orig/sound/isa/opti9xx/opti92x-ad1848.c
+++ linux-2.6.24-rc8-git5/sound/isa/opti9xx/opti92x-ad1848.c
@@ -1595,7 +1595,7 @@ OPTi93X_DOUBLE("Capture Volume", 0, OPTi
 }
 };
 
-static int snd_opti93x_mixer(struct snd_opti93x *chip)
+static int __devinit snd_opti93x_mixer(struct snd_opti93x *chip)
 {
struct snd_card *card;
struct snd_kcontrol_new knew;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] radio: fix sf16fmi section mismatch

2008-01-22 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

isapnp_fmi_probe() is only called by fmi_init(), which is __init,
so isapnp_fmi_probe() can also be __init.

media/radio/radio-sf16fmi.c:
WARNING: vmlinux.o(.text+0x994e19): Section mismatch: reference to .init.data: 
(between 'isapnp_fmi_probe' and 'vidioc_s_tuner')
WARNING: vmlinux.o(.text+0x994e22): Section mismatch: reference to .init.data: 
(between 'isapnp_fmi_probe' and 'vidioc_s_tuner')
WARNING: vmlinux.o(.text+0x994e3a): Section mismatch: reference to 
.init.data:id_table (between 'isapnp_fmi_probe' and 'vidioc_s_tuner')

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 drivers/media/radio/radio-sf16fmi.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.24-rc8-git5.orig/drivers/media/radio/radio-sf16fmi.c
+++ linux-2.6.24-rc8-git5/drivers/media/radio/radio-sf16fmi.c
@@ -321,7 +321,7 @@ static struct isapnp_device_id id_table[
 
 MODULE_DEVICE_TABLE(isapnp, id_table);
 
-static int isapnp_fmi_probe(void)
+static int __init isapnp_fmi_probe(void)
 {
int i = 0;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sound: fix cs5535 section mismatch

2008-01-22 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

snd_cs5535audio_mixer() is only called by __devinit snd_cs5535audio_probe(),
so the mixer function can also be __devinit.

WARNING: vmlinux.o(.text+0xfdbba0): Section mismatch: reference to 
.init.data:ac97_quirks (between 'snd_cs5535audio_mixer' and 'process_bm0_irq')

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 sound/pci/cs5535audio/cs5535audio.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.24-rc8-git5.orig/sound/pci/cs5535audio/cs5535audio.c
+++ linux-2.6.24-rc8-git5/sound/pci/cs5535audio/cs5535audio.c
@@ -145,7 +145,7 @@ static unsigned short snd_cs5535audio_ac
return snd_cs5535audio_codec_read(cs5535au, reg);
 }
 
-static int snd_cs5535audio_mixer(struct cs5535audio *cs5535au)
+static int __devinit snd_cs5535audio_mixer(struct cs5535audio *cs5535au)
 {
struct snd_card *card = cs5535au->card;
struct snd_ac97_bus *pbus;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)

2008-01-22 Thread David Chinner
On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote:
> Last night my laptop suffered an oops during closedown.  The full oops
> reports can be downloaded from
> 
>   http://www.atrad.com.au/~jwoithe/xfs_oops/

Assertion failed: atomic_read(>m_active_trans) == 0, file:
fs/xfs/xfs_vfsops.c, line 689.

The remount read-only of the root drive supposedly completed
while there was still active modification of the filesystem
taking place.

> Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. 

The patch in 2.6.23 that introduced this check was:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=516b2e7c2661615ba5d5ad9fb584f068363502d3

Basically, the remount-readonly path was not flushing things
properly, so we changed it to flushing things properly and ensure we
got bug reports if it wasn't. Yours is the second report of not
shutting down correctly since this change went in (we've seen it
once in ~8 months in a QA environment).

I've had suspicions of a race in the remount-ro code in
do_remount_sb() w.r.t to the fs_may_remount_ro() check.  That is, we
do an unlocked check to see if we can remount readonly and then fail
to check again once we've locked the superblock out and start the
remount.

The read only flag only gets set *after* we've made the filesystem
readonly, which means before we are truly read only, we can race
with other threads opening files read/write or filesystem
modifcations can take place.

The result of that race (if it is really unsafe) will be assert you
see. The patch I wrote a couple of months ago to fix the problem
is attached below

Cheers,

Dave.

---

Set the MS_RDONLY before we check to see if we can remount
read only so that we close a race between checking remount
is ok and setting the superblock flag that allows other
processes to start modifying the filesystem while it is
being remounted.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_super.c |   16 
 1 file changed, 16 insertions(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2008-01-22 
14:57:07.753782292 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c  2008-01-23 16:22:16.940279351 
+1100
@@ -1222,6 +1222,22 @@ xfs_fs_remount(
struct xfs_mount_args   *args = xfs_args_allocate(sb, 0);
int error;
 
+   /*
+* We need to have the MS_RDONLY flag set on the filesystem before we
+* try to quiesce it down to a sane state. If we don't set the
+* MS_RDONLY before we check the fs_may_remount_ro(sb) state, we have a
+* race where write operations can start after we've checked it is OK
+* to remount read only. This results in assert failures due to being
+* unable to quiesce the transaction subsystem correctly.
+*/
+   if (!(sb->s_flags & MS_RDONLY) && (*flags & MS_RDONLY)) {
+   sb->s_flags |= MS_RDONLY;
+   if (!fs_may_remount_ro(sb)) {
+   sb->s_flags &= ~MS_RDONLY;
+   return -EBUSY;
+   }
+   }
+
error = xfs_parseargs(mp, options, args, 1);
if (!error)
error = xfs_mntupdate(mp, flags, args);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: test case for the RODATA config option

2008-01-22 Thread Arjan van de Ven
On Wed, 23 Jan 2008 12:11:41 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

> >  #ifdef CONFIG_DEBUG_RODATA
> >
> > +const int rodata_test_data = 5;
> 
> I guess this should match the 32-bit case, and be zero instead of
> 5?

actually it should have been 5 for both (well any non-zero number)
> 
> Can you disallow building as a module, and put this in the test
> code? It could be run from the end of mark_rodata_ro()...

fair; I was developing it as a module (just easier) but yeah it makes more 
sense as part
of mark_rodata_ro(). I'll do that in the next rev


-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH] dma: dma_{un}map_{single|sg}_attrs() interface

2008-01-22 Thread Roland Dreier
 > --- a/include/linux/dma-attrs.h
 > +++ b/include/linux/dma-attrs.h
 > @@ -0,0 +1,33 @@
 > +#ifndef _DMA_ATTR_H
 > +#define _DMA_ATTR_H
 > +
 > +#include 
 > +
 > +enum {
 > +DMA_ATTR_INVALID,
 > +DMA_ATTR_BARRIER,
 > +DMA_ATTR_FOO,
 > +DMA_ATTR_GOO,
 > +DMA_ATTR_MAX,
 > +};
 > +
 > +struct dma_attrs {
 > +unsigned flags;
 > +};
 > +
 > +static inline int dma_set_attr(struct dma_attrs *attrs, unsigned attr) {

maybe this would be cleaner if you named the DMA_ATTR enum and used
that instead of unsigned here (and below)?

 > +BUG_ON(attrs == NULL);

does this BUG_ON() buy us much?  It seems the only thing we would fail
to oops on is if someone did dma_set_attr(NULL, INVALID) and I'm not
sure it's worth it to BUG here.

 > +if (attr > DMA_ATTR_INVALID && attr < DMA_ATTR_MAX) {
 > +attrs->flags = (1 << attr);
 > +return 0;
 > +}
 > +return 1;

returning -EINVAL here instead of 1 would probably be more "kernelish".

 > +}
 > +
 > +static inline int dma_get_attr(struct dma_attrs *attrs, unsigned attr) {
 > +if (attrs) 
 > + return attrs->flags & (1 << attr);

so it's OK to pass attrs == NULL into dma_get_attr() but not into
dma_set_attr()?  seems kind of odd.

 > +return 0;
 > +}

It seems you're missing a way to initialize a struct dma_attrs.  How
do I clear the flags field to start with?

A macro like DEFINE_DMA_ATTRS() that initializes things for you (like
LIST_HEAD or DEFINE_SPIN_LOCK) would probably be a good thing to have
as well.

Also I guess you could test ARCH_USES_DMA_ATTRS in this file and stub
everything out and define an empty structure if it's not defined.
save a few bytes of stack etc.

 > +
 > +#endif /* _DMA_ATTR_H */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] generic: Percpu infrastructure to rebase the per cpu area to zero

2008-01-22 Thread travis
* Support an option

CONFIG_HAVE_ZERO_BASED_PER_CPU

  that makes offsets for per cpu variables to start at zero.

  If a percpu area starts at zero then:

-  We do not need RELOC_HIDE anymore

-  Provides for the future capability of architectures providing
   a per cpu allocator that returns offsets instead of pointers.
   The offsets would be independent of the processor so that
   address calculations can be done in a processor independent way.
   Per cpu instructions can then add the processor specific offset
   at the last minute possibly in an atomic instruction.

  The data the linker provides is different for zero based percpu segments:

__per_cpu_load  -> The address at which the percpu area was loaded
__per_cpu_size  -> The length of the per cpu area

* Removes the &__per_cpu_x in lockdep. The __per_cpu_x are already
  pointers. There is no need to take the address.

* Changes generic setup_per_cpu_areas to allocate per_cpu space in
  node local memory.  This requires a generic early_cpu_to_node function.

Based on 2.6.24-rc8-mm1

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
Reviewed-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/asm-alpha/topology.h  |1 +
 include/asm-generic/percpu.h  |7 ++-
 include/asm-generic/sections.h|   10 ++
 include/asm-generic/topology.h|3 +++
 include/asm-generic/vmlinux.lds.h |   15 +++
 include/asm-ia64/topology.h   |1 +
 include/asm-mips/mach-ip27/topology.h |1 +
 include/asm-powerpc/topology.h|1 +
 init/main.c   |   18 ++
 kernel/lockdep.c  |4 ++--
 10 files changed, 50 insertions(+), 11 deletions(-)

--- a/include/asm-alpha/topology.h
+++ b/include/asm-alpha/topology.h
@@ -6,6 +6,7 @@
 #include 
 
 #ifdef CONFIG_NUMA
+#define early_cpu_to_node(cpu) cpu_to_node(cpu)
 static inline int cpu_to_node(int cpu)
 {
int node;
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -43,7 +43,12 @@ extern unsigned long __per_cpu_offset[NR
  * Only S390 provides its own means of moving the pointer.
  */
 #ifndef SHIFT_PERCPU_PTR
-#define SHIFT_PERCPU_PTR(__p, __offset)RELOC_HIDE((__p), (__offset))
+# ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU
+#  define SHIFT_PERCPU_PTR(__p, __offset) \
+   ((__typeof(__p))(((void *)(__p)) + (__offset)))
+# else
+#  define SHIFT_PERCPU_PTR(__p, __offset)  RELOC_HIDE((__p), (__offset))
+# endif /* CONFIG_HAVE_ZERO_BASED_PER_CPU */
 #endif
 
 /*
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -9,7 +9,17 @@ extern char __bss_start[], __bss_stop[];
 extern char __init_begin[], __init_end[];
 extern char _sinittext[], _einittext[];
 extern char _end[];
+#ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU
+extern char __per_cpu_load[];
+extern char per_cpu_size[];
+#define __per_cpu_size ((unsigned long)&per_cpu_size)
+#define __per_cpu_start ((char *)0)
+#define __per_cpu_end ((char *)__per_cpu_size)
+#else
 extern char __per_cpu_start[], __per_cpu_end[];
+#define __per_cpu_load __per_cpu_start
+#define __per_cpu_size (__per_cpu_end - __per_cpu_start)
+#endif
 extern char __kprobes_text_start[], __kprobes_text_end[];
 extern char __initdata_begin[], __initdata_end[];
 extern char __start_rodata[], __end_rodata[];
--- a/include/asm-generic/topology.h
+++ b/include/asm-generic/topology.h
@@ -32,6 +32,9 @@
 #ifndef cpu_to_node
 #define cpu_to_node(cpu)   (0)
 #endif
+#ifndef early_cpu_to_node
+#define early_cpu_to_node(cpu) cpu_to_node(cpu)
+#endif
 #ifndef parent_node
 #define parent_node(node)  (0)
 #endif
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -255,6 +255,20 @@
*(.initcall7.init)  \
*(.initcall7s.init)
 
+#ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU
+#define PERCPU(align)  \
+   . = ALIGN(align);   \
+   percpu : { } :percpu\
+   __per_cpu_load = .; \
+   .data.percpu 0 : AT(__per_cpu_load - LOAD_OFFSET) { \
+   *(.data.percpu.first)   \
+   *(.data.percpu) \
+   *(.data.percpu.shared_aligned)  \
+   per_cpu_size = .;   \
+   }   \
+   . = __per_cpu_load + per_cpu_size;  \
+   data : { } :data
+#else
 #define PERCPU(align)  \
. = ALIGN(align); 

[PATCH 2/3] x86_64: Fold pda into per cpu area

2008-01-22 Thread travis
  * Declare the pda as a per cpu variable. This will move the pda area
to an address accessible by the x86_64 per cpu macros.  Subtraction
of __per_cpu_start will make the offset based from the beginning
of the per cpu area.  Since %gs is pointing to the pda, it will
then also point to the per cpu variables and can be accessed thusly:

%gs:[_cpu_ - __per_cpu_start]

  * The boot_pdas are only needed in head64.c so move the declaration
over there and make it static.

  * Remove the code that allocates special pda data structures.

Based on 2.6.24-rc8-mm1

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
Reviewed-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 arch/x86/kernel/head64.c  |6 ++
 arch/x86/kernel/setup64.c |   12 ++--
 arch/x86/kernel/smpboot_64.c  |   16 
 include/asm-generic/vmlinux.lds.h |1 +
 include/asm-x86/pda.h |1 -
 include/asm-x86/percpu.h  |   30 +++---
 include/linux/percpu.h|   13 -
 7 files changed, 48 insertions(+), 31 deletions(-)

--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -22,6 +22,12 @@
 #include 
 #include 
 
+/*
+ * Only used before the per cpu areas are setup. The use for the non possible
+ * cpus continues after boot
+ */
+static struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;
+
 static void __init zap_identity_mappings(void)
 {
pgd_t *pgd = pgd_offset_k(0UL);
--- a/arch/x86/kernel/setup64.c
+++ b/arch/x86/kernel/setup64.c
@@ -34,7 +34,9 @@ cpumask_t cpu_initialized __cpuinitdata 
 
 struct x8664_pda *_cpu_pda[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(_cpu_pda);
-struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;
+
+DEFINE_PER_CPU_FIRST(struct x8664_pda, pda);
+EXPORT_PER_CPU_SYMBOL(pda);
 
 struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table };
 
@@ -150,10 +152,16 @@ void __init setup_per_cpu_areas(void)
}
if (!ptr)
panic("Cannot allocate cpu data for CPU %d\n", i);
-   cpu_pda(i)->data_offset = ptr - __per_cpu_start;
memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
+   /* Relocate the pda */
+   memcpy(ptr, cpu_pda(i), sizeof(struct x8664_pda));
+   cpu_pda(i) = (struct x8664_pda *)ptr;
+   cpu_pda(i)->data_offset = ptr - __per_cpu_start;
}
 
+   /* Fix up pda for this processor  */
+   pda_init(0);
+
/* setup percpu data maps early */
setup_per_cpu_maps();
 } 
--- a/arch/x86/kernel/smpboot_64.c
+++ b/arch/x86/kernel/smpboot_64.c
@@ -566,22 +566,6 @@ static int __cpuinit do_boot_cpu(int cpu
return -1;
}
 
-   /* Allocate node local memory for AP pdas */
-   if (cpu_pda(cpu) == _cpu_pda[cpu]) {
-   struct x8664_pda *newpda, *pda;
-   int node = cpu_to_node(cpu);
-   pda = cpu_pda(cpu);
-   newpda = kmalloc_node(sizeof (struct x8664_pda), GFP_ATOMIC,
- node);
-   if (newpda) {
-   memcpy(newpda, pda, sizeof (struct x8664_pda));
-   cpu_pda(cpu) = newpda;
-   } else
-   printk(KERN_ERR
-   "Could not allocate node local PDA for CPU %d on node %d\n",
-   cpu, node);
-   }
-
alternatives_smp_switch(1);
 
c_idle.idle = get_idle_for_cpu(cpu);
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -273,6 +273,7 @@
. = ALIGN(align);   \
__per_cpu_start = .;\
.data.percpu  : AT(ADDR(.data.percpu) - LOAD_OFFSET) {  \
+   *(.data.percpu.first)   \
*(.data.percpu) \
*(.data.percpu.shared_aligned)  \
}   \
--- a/include/asm-x86/pda.h
+++ b/include/asm-x86/pda.h
@@ -39,7 +39,6 @@ struct x8664_pda {
 } cacheline_aligned_in_smp;
 
 extern struct x8664_pda *_cpu_pda[];
-extern struct x8664_pda boot_cpu_pda[];
 extern void pda_init(int);
 
 #define cpu_pda(i) (_cpu_pda[i])
--- a/include/asm-x86/percpu.h
+++ b/include/asm-x86/percpu.h
@@ -16,7 +16,14 @@
 #define __my_cpu_offset read_pda(data_offset)
 
 #define per_cpu_offset(x) (__per_cpu_offset(x))
+#define __percpu_seg "%%gs:"
+/* Calculate the offset to use with the segment register */
+#define seg_offset(name)   (*SHIFT_PERCPU_PTR(_cpu_var(name), \
+   - (unsigned long)__per_cpu_start))
 
+#else
+#define __percpu_seg ""
+#define seg_offset(name)   per_cpu_var(name)
 #endif
 #include 
 
@@ -64,16 +71,11 

[PATCH 0/3] percpu: Optimize percpu accesses

2008-01-22 Thread travis

This patchset provides the following:

  * Generic: Percpu infrastructure to rebase the per cpu area to zero

This provides for the capability of accessing the percpu variables
using a local register instead of having to go through a table
on node 0 to find the cpu-specific offsets.  It also would allow
atomic operations on percpu variables to reduce required locking.

  * x86_64: Fold pda into per cpu area

Declare the pda as a per cpu variable. This will move the pda
area to an address accessible by the x86_64 per cpu macros.
Subtraction of __per_cpu_start will make the offset based from
the beginning of the per cpu area.  Since %gs is pointing to the
pda, it will then also point to the per cpu variables and can be
accessed thusly:

%gs:[_cpu_ - __per_cpu_start]

  * x86_64: Rebase per cpu variables to zero

Take advantage of the zero-based per cpu area provided above.
Then we can directly use the x86_32 percpu operations. x86_32
offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly
to the pda and the per cpu area thereby allowing access to the
pda with the x86_64 pda operations and access to the per cpu
variables using x86_32 percpu operations.

Based on 2.6.24-rc8-mm1

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
Reviewed-by: Christoph Lameter <[EMAIL PROTECTED]>
---

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] x86_64: Rebase per cpu variables to zero

2008-01-22 Thread travis
  * Relocate the x86_64 percpu variables to begin at zero. Then
we can directly use the x86_32 percpu operations. x86_32
offsets %fs by __per_cpu_start. x86_64 has %gs pointing
directly to the pda and the per cpu area thereby allowing
access to the pda with the x86_64 pda operations and access
to the per cpu variables using x86_32 percpu operations.

  * This also supports further integration of x86_32/64.

Based on 2.6.24-rc8-mm1

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
Reviewed-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 arch/x86/Kconfig |3 +++
 arch/x86/kernel/setup64.c|2 +-
 arch/x86/kernel/vmlinux_64.lds.S |1 +
 kernel/module.c  |7 ---
 4 files changed, 9 insertions(+), 4 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -107,6 +107,9 @@ config GENERIC_TIME_VSYSCALL
bool
default X86_64
 
+config HAVE_ZERO_BASED_PER_CPU
+   def_bool X86_64
+
 config ARCH_SUPPORTS_OPROFILE
bool
default y
--- a/arch/x86/kernel/setup64.c
+++ b/arch/x86/kernel/setup64.c
@@ -152,7 +152,7 @@ void __init setup_per_cpu_areas(void)
}
if (!ptr)
panic("Cannot allocate cpu data for CPU %d\n", i);
-   memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
+   memcpy(ptr, __per_cpu_load, __per_cpu_size);
/* Relocate the pda */
memcpy(ptr, cpu_pda(i), sizeof(struct x8664_pda));
cpu_pda(i) = (struct x8664_pda *)ptr;
--- a/arch/x86/kernel/vmlinux_64.lds.S
+++ b/arch/x86/kernel/vmlinux_64.lds.S
@@ -16,6 +16,7 @@ jiffies_64 = jiffies;
 _proxy_pda = 1;
 PHDRS {
text PT_LOAD FLAGS(5);  /* R_E */
+   percpu PT_LOAD FLAGS(4);/* R__ */
data PT_LOAD FLAGS(7);  /* RWE */
user PT_LOAD FLAGS(7);  /* RWE */
data.init PT_LOAD FLAGS(7); /* RWE */
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -351,7 +352,7 @@ static void *percpu_modalloc(unsigned lo
align = PAGE_SIZE;
}
 
-   ptr = __per_cpu_start;
+   ptr = __per_cpu_load;
for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
/* Extra for alignment requirement. */
extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
@@ -386,7 +387,7 @@ static void *percpu_modalloc(unsigned lo
 static void percpu_modfree(void *freeme)
 {
unsigned int i;
-   void *ptr = __per_cpu_start + block_size(pcpu_size[0]);
+   void *ptr = __per_cpu_load + block_size(pcpu_size[0]);
 
/* First entry is core kernel percpu data. */
for (i = 1; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
@@ -437,7 +438,7 @@ static int percpu_modinit(void)
pcpu_size = kmalloc(sizeof(pcpu_size[0]) * pcpu_num_allocated,
GFP_KERNEL);
/* Static in-kernel percpu data (used). */
-   pcpu_size[0] = -(__per_cpu_end-__per_cpu_start);
+   pcpu_size[0] = -__per_cpu_size;
/* Free room. */
pcpu_size[1] = PERCPU_ENOUGH_ROOM + pcpu_size[0];
if (pcpu_size[1] < 0) {

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


XFS oops under 2.6.23.9

2008-01-22 Thread Jonathan Woithe
Last night my laptop suffered an oops during closedown.  The full oops
reports can be downloaded from

  http://www.atrad.com.au/~jwoithe/xfs_oops/

as photos of the screen.  Since the laptop was unusable at this point I
wasn't able to cut and paste the details, and they weren't in the logs when
the machine was rebooted.

The initial complaint claims to be an "invalid opcode".  Is this possibly a
memory fault developing or does it ring any bells for anyone?  memtest86
finds no fault with the memory.

Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. 
The RT patches were not applied.

Regards
  jonathan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: W1: w1_slave units, standardize 1C or .001C? Break API

2008-01-22 Thread H. Peter Anvin

David Fries wrote:

On Mon, Jan 21, 2008 at 07:11:07PM -0800, H. Peter Anvin wrote:

H. Peter Anvin wrote:

Millikelvins would have the nice property of never being negative.  :)


True, but the sensor returns the value as a signed integer in C.  That
is where the earlier negative number problem was, it would have to do
yet another conversion to go to Kelvin, and it would be just one more
potential for error.  Everyone knows that a bad conversion doomed at
least one space craft, let's stick to Centigrade.



Uhm... the conversion is exact as long as you have at least centikelvin 
precision (0 °C = 273.15 K by definition, and the multiplier is 1.)



Alternatively, centikelvins would fit nicely in 16 bits if anyone cares...

655.35 K = 382.20 °C = 719.96 °F


The range for the sensor is -55 to 125 C, if an application didn't
care about precision they could store it in a signed 8 bit value just
fine.


This was more a comment as to it possibly being a convenient format for 
more than this particular sensor.


The nice thing with kelvins is no need to worry about negative numbers 
and something misparsing them, that's all.


I certainly did not imply that we should even consider use °F.  That's 
obviously ridiculous.


-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_MARKERS

2008-01-22 Thread Jon Masters

On Tue, 2008-01-22 at 22:10 -0500, Mathieu Desnoyers wrote:
> * Frank Ch. Eigler ([EMAIL PROTECTED]) wrote:
> > 
> > Jon Masters <[EMAIL PROTECTED]> writes:
> > 
> > > I notice in module.c:
> > >
> > > #ifdef CONFIG_MARKERS
> > >   if (!mod->taints)
> > >   marker_update_probe_range(mod->markers,
> > >   mod->markers + mod->num_markers, NULL, NULL);
> > > #endif
> > >
> > > Is this an attempt to not set a marker for proprietary modules? [...]
> > 
> > I can't seem to find any discussion about this aspect.  If this is the
> > intent, it seems misguided to me.  There may instead be a relationship
> > to TAINT_FORCED_{RMMOD,MODULE}.  Mathieu?
> > 
> > - FChE
> 
> On my part, its mostly a matter of not crashing the kernel when someone
> tries to force modprobe of a proprietary module (where the checksums
> doesn't match) on a kernel that supports the markers. Not doing so
> causes the markers to try to find the marker-specific information in
> struct module which doesn't exist and OOPSes.
> 
> Christoph's point of view is rather more drastic than mine : it's not
> interesting for the kernel community to help proprietary modules writers,
> so it's a good idea not to give them marker support. (I CC'ed him so he
> can clarify his position).

Right. I thought that was your collective opinion, and I happen to
personally agree with you, but my question was more that you should be
explicitly comparing to whether it's proprietary and not just whether
the taints field is set - there are other flags in there too.

Jon.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] procfs: constify function pointer tables

2008-01-22 Thread Bryan Wu
On Jan 23, 2008 4:00 AM, Jan Engelhardt <[EMAIL PROTECTED]> wrote:
> Hi,
>
>
> This touches so many different places that I did not feel like creating
> a miniscule patch for each architecture. I hope that is ok.
>
> ===Patch begins===
> [PATCH] procfs: constify function pointer tables
>
> Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]>
> ---
>  arch/alpha/kernel/setup.c |2 +-
>  arch/blackfin/kernel/setup.c  |2 +-
>  arch/cris/kernel/setup.c  |2 +-
>  arch/frv/kernel/setup.c   |2 +-
>  arch/h8300/kernel/setup.c |2 +-
>  arch/m32r/kernel/setup.c  |2 +-
>  arch/m68k/kernel/setup.c  |2 +-
>  arch/m68knommu/kernel/setup.c |2 +-
>  arch/parisc/kernel/setup.c|2 +-
>  arch/ppc/kernel/setup.c   |2 +-
>  arch/v850/kernel/procfs.c |2 +-
>  arch/xtensa/kernel/setup.c|2 +-
>  fs/proc/base.c|6 +++---
>  fs/proc/nommu.c   |2 +-
>  fs/proc/proc_misc.c   |   22 +++---
>  fs/proc/proc_sysctl.c |4 ++--
>  fs/proc/proc_tty.c|2 +-
>  fs/proc/task_mmu.c|8 
>  fs/proc/task_nommu.c  |2 +-
>  19 files changed, 35 insertions(+), 35 deletions(-)
>
> diff --git a/arch/alpha/kernel/setup.c b/arch/alpha/kernel/setup.c
> index bd5e68c..823f18e 100644
> --- a/arch/alpha/kernel/setup.c
> +++ b/arch/alpha/kernel/setup.c
> @@ -1472,7 +1472,7 @@ c_stop(struct seq_file *f, void *v)
>  {
>  }
>
> -struct seq_operations cpuinfo_op = {
> +const struct seq_operations cpuinfo_op = {
> .start  = c_start,
> .next   = c_next,
> .stop   = c_stop,
> diff --git a/arch/blackfin/kernel/setup.c b/arch/blackfin/kernel/setup.c
> index d282201..d67cf54 100644
> --- a/arch/blackfin/kernel/setup.c
> +++ b/arch/blackfin/kernel/setup.c
> @@ -691,7 +691,7 @@ static void c_stop(struct seq_file *m, void *v)
>  {
>  }
>
> -struct seq_operations cpuinfo_op = {
> +const struct seq_operations cpuinfo_op = {
> .start = c_start,
> .next = c_next,
> .stop = c_stop,

Thanks, I understand the seq_xxx() API needs "const struct seq_operations *".
So for Blackfin part, I agree with Mike.

Acked-by: Bryan Wu <[EMAIL PROTECTED]>

but there are still some other files need add "const":
---
/opt/git-tree/blackfin-2.6$ grep -r seq_operations arch/*
arch/alpha/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/arm/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/arm/mach-davinci/clock.c:static struct seq_operations davinci_ck_op = {
arch/avr32/kernel/cpu.c:struct seq_operations cpuinfo_op = {
arch/avr32/mm/tlb.c:static struct seq_operations tlb_ops = {
arch/blackfin/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/cris/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/frv/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/h8300/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/ia64/hp/common/sba_iommu.c:static struct seq_operations ioc_seq_ops = {
arch/ia64/kernel/perfmon.c:struct seq_operations pfm_seq_ops = {
arch/ia64/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/ia64/sn/kernel/sn2/sn2_smp.c:static struct seq_operations
sn2_ptc_seq_ops = {
arch/ia64/sn/kernel/sn2/sn_hwperf.c:static struct seq_operations
sn_topology_seq_ops = {
arch/m32r/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/m68k/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/m68knommu/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/mips/kernel/proc.c:struct seq_operations cpuinfo_op = {
arch/parisc/kernel/setup.c:struct seq_operations cpuinfo_op = {
arch/powerpc/kernel/setup-common.c:struct seq_operations cpuinfo_op = {


[!snip!]

Regards,
-Bryan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ppc: fix #ifdef-s in mediabay driver

2008-01-22 Thread Benjamin Herrenschmidt

On Wed, 2008-01-23 at 01:58 +0100, Bartlomiej Zolnierkiewicz wrote:
> I'm more worried about breaking automatic build checking (make randconfig)
> than a few extra bytes so if you remove all #ifdefs you'll have to either
> make BLK_DEV_IDE_PMAC select PMAC_MEDIABAY or make PMAC_MEDIABAY depend
> on BLK_DEV_IDE_PMAC (otherwise BLK_DEV_IDE=n && PMAC_MEDIABAY=y will fail
> since mediabay.c is referencing IDE code).

I was thinking about having the pmac arch code provide an exported
function pointer to put the hook in to avoid that problem.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: W1: w1_slave units, standardize 1C or .001C? Break API

2008-01-22 Thread David Fries

On Wed, Jan 23, 2008 at 12:06:27AM +0300, Evgeniy Polyakov wrote:
> 
> What about instead of breaking application just add new sysfs file,
> which will only return temperature instead of full rom content.
> It can be millidegrees Centigrade, another one can be millikelvins :)

If someone wrote their application to read degrees C because they have
an ds18b20, the application will break anyway if they run it with an
ds1820 sensor.  Or the opposite way around.  Yes it would be better
not to break a program, but I think having a consistent interface for
both sensors to be a better option.

> Actually it is already posible for applications to decode whatever
> precision they like from the rom content displayed, although that can be
> not very convenient.

I was first surprised then glad that the raw data was included in the
user available data.  I was wanting the full precision, so that was my
plan.

> Even more, what about possibility of changing of the base, relative to
> which temperature is displayed? By default I vote for centigrades,
> those, who live behind the oceans, can setup Fahrenheit, Kelvin or anything
> else, but please in a new file :)
> David will this work for you?

I'm biased toward Fehrenheit, against Kelvin, but I think continuing
to keep Centigrade is the correct choice here.  I don't like the idea
of selecting the base the kernel displays by a userland option, too
easy to make assumptions, give it one interface and let the
application do the conversion, C/1000.0*9/5+32 is pretty easy (for
millidegrees C that is).

I'll get the trivial patch to change the ds18b20 output in
millidegrees C to make things consistent.  I'm out of time tonight.

It does sound like a good idea to have a sysfs file that just returns
the millidegrees C in ASCII without any other text.  It would be
easier to parse.  If the conversion fails return 0 bytes.  Just an
idea, but if someone wants it they can write the patch.

-- 
David Fries <[EMAIL PROTECTED]>
http://fries.net/~david/ (PGP encryption key available)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: W1: w1_slave units, standardize 1C or .001C? Break API

2008-01-22 Thread David Fries
On Mon, Jan 21, 2008 at 07:11:07PM -0800, H. Peter Anvin wrote:
> H. Peter Anvin wrote:
> >Millikelvins would have the nice property of never being negative.  :)

True, but the sensor returns the value as a signed integer in C.  That
is where the earlier negative number problem was, it would have to do
yet another conversion to go to Kelvin, and it would be just one more
potential for error.  Everyone knows that a bad conversion doomed at
least one space craft, let's stick to Centigrade.

> Alternatively, centikelvins would fit nicely in 16 bits if anyone cares...
> 
> 655.35 K = 382.20 ?C = 719.96 ?F

The range for the sensor is -55 to 125 C, if an application didn't
care about precision they could store it in a signed 8 bit value just
fine.

-- 
David Fries <[EMAIL PROTECTED]>
http://fries.net/~david/ (PGP encryption key available)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] IO context sharing

2008-01-22 Thread David Chinner
On Tue, Jan 22, 2008 at 10:49:15AM +0100, Jens Axboe wrote:
> Hi,
> 
> Today io contexts are per-process and define the (surprise) io context
> of that process. In some situations it would be handy if several
> processes share an IO context.

I think that the nfsd threads should probably share as
well. It should probably provide an io context per thread
pool

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_32: trim memory by updating e820 v2

2008-01-22 Thread Yinghai Lu
On Monday 21 January 2008 01:37:09 pm Justin Piszcz wrote:
> 
> On Mon, 21 Jan 2008, Yinghai Lu wrote:
> 
> > On Monday 21 January 2008 11:14:02 am Justin Piszcz wrote:
> > please get x86.git
> >
> >  git clone 
> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> >  cd linux-2.6
> >  #--{ x86.git instructions }-->
> >  # Add Linus's tree as a remote
> >  git remote add linus 
> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> >
> >  # Add Ingo's tree as a remote
> >  git remote add x86 
> > git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git
> >
> >  # With that setup, just run the following to get any changes you
> >  # don't have.  It will also notice any new branches Ingo/Linus
> >  # add to their repo.  Look in .git/config afterwards, the format
> >  # to add new remotes is easy to figure out.
> >  git remote update
> >  #-
> >  git merge x86/master
> >  git merge x86/mm
> >
> > and apply
> >
> > [PATCH] x86_64: check if Tom2 is enabled
> > http://lkml.org/lkml/2008/1/21/20
> > [PATCH] x86_64: update e820 instead of updating end_pfn v3
> > http://lkml.org/lkml/2008/1/21/19
> > [PATCH] x86_32: trim memory by updating e820 v2
> > http://lkml.org/lkml/2008/1/21/18
> >
> > YH
> >
> 
> Thanks, I am all patched up and ready to test, unfortunately one of my disks
> in my RAID 1 just died, I already filled out the advanced replacement form,
> I will test when I receive the replacement disk.

please get x86.git and apply
[PATCH] x86_32: trim memory by updating e820 v3
http://lkml.org/lkml/2008/1/22/394

Ingo already put other two into the tree.

Thanks

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_MARKERS

2008-01-22 Thread Mathieu Desnoyers
* Frank Ch. Eigler ([EMAIL PROTECTED]) wrote:
> 
> Jon Masters <[EMAIL PROTECTED]> writes:
> 
> > I notice in module.c:
> >
> > #ifdef CONFIG_MARKERS
> > if (!mod->taints)
> > marker_update_probe_range(mod->markers,
> > mod->markers + mod->num_markers, NULL, NULL);
> > #endif
> >
> > Is this an attempt to not set a marker for proprietary modules? [...]
> 
> I can't seem to find any discussion about this aspect.  If this is the
> intent, it seems misguided to me.  There may instead be a relationship
> to TAINT_FORCED_{RMMOD,MODULE}.  Mathieu?
> 
> - FChE

On my part, its mostly a matter of not crashing the kernel when someone
tries to force modprobe of a proprietary module (where the checksums
doesn't match) on a kernel that supports the markers. Not doing so
causes the markers to try to find the marker-specific information in
struct module which doesn't exist and OOPSes.

Christoph's point of view is rather more drastic than mine : it's not
interesting for the kernel community to help proprietary modules writers,
so it's a good idea not to give them marker support. (I CC'ed him so he
can clarify his position).

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_MARKERS

2008-01-22 Thread Frank Ch. Eigler

Jon Masters <[EMAIL PROTECTED]> writes:

> I notice in module.c:
>
> #ifdef CONFIG_MARKERS
>   if (!mod->taints)
>   marker_update_probe_range(mod->markers,
>   mod->markers + mod->num_markers, NULL, NULL);
> #endif
>
> Is this an attempt to not set a marker for proprietary modules? [...]

I can't seem to find any discussion about this aspect.  If this is the
intent, it seems misguided to me.  There may instead be a relationship
to TAINT_FORCED_{RMMOD,MODULE}.  Mathieu?

- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-22 Thread David Miller
From: "Dave Young" <[EMAIL PROTECTED]>
Date: Wed, 23 Jan 2008 09:44:30 +0800

> On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> > [PATCH] [TCP]: debug S+L
> 
> Thanks, If there's new findings I will let you know.

Thanks for helping with this bug Dave.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/26] atl1: update initialization parameters

2008-01-22 Thread Chris Snook

Jay Cliburn wrote:

On Tue, 22 Jan 2008 04:56:11 -0500
Jeff Garzik <[EMAIL PROTECTED]> wrote:


[EMAIL PROTECTED] wrote:

From: Jay Cliburn <[EMAIL PROTECTED]>

Update initialization parameters to match the current vendor driver
version 1.2.40.2.


[...]

ACK without any better knowledge...  but is any addition insight 
available at all?


No, sorry Jeff.  I simply took the vendor's current driver and matched
his initialization settings.  I can only assume he discovered these
values through lab testing.

For this and the other "conform to vendor driver" patches in this set, I
thought it important to have the in-tree driver match the vendor driver
as closely as possible.  The primary motivations are (1) my belief that
he's in a better position to test the NIC, and (2) to be able to go to
him for assistance occasionally and not be rejected because of
significant differences between his and our drivers.


I don't think we should be doing this without justification.  From all the atl1 
and atl2 code I've looked at, I've gotten the impression that their driver 
development processes are extremely ad-hoc.  There is code in the Atheros 
version of atl2 that cannot *possibly* apply to that hardware and was just 
copied and pasted from atl1, just as much of atl1 was copied and pasted from 
e1000.  The fact that various versions have different magic numbers may simply 
mean they copied and pasted from different irrelevant and incorrect sources.


Our contacts at Atheros seem to be very good electrical engineers, so when they 
tell us that a certain setting should be changed to match particular properties 
of the hardware, I trust them.  They are not, however, experienced and 
disciplined kernel developers, so absent such justification I think we should 
stick with what we have, which has been improved and reviewed by people who 
*are* experienced and disciplined kernel developers.


We have at least as much to teach Atheros about writing kernel code as they have 
to teach us about their hardware.


-- Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/26] atl1: update initialization parameters

2008-01-22 Thread Jay Cliburn
On Tue, 22 Jan 2008 04:56:11 -0500
Jeff Garzik <[EMAIL PROTECTED]> wrote:

> [EMAIL PROTECTED] wrote:
> > From: Jay Cliburn <[EMAIL PROTECTED]>
> > 
> > Update initialization parameters to match the current vendor driver
> > version 1.2.40.2.

[...]

> ACK without any better knowledge...  but is any addition insight 
> available at all?

No, sorry Jeff.  I simply took the vendor's current driver and matched
his initialization settings.  I can only assume he discovered these
values through lab testing.

For this and the other "conform to vendor driver" patches in this set, I
thought it important to have the in-tree driver match the vendor driver
as closely as possible.  The primary motivations are (1) my belief that
he's in a better position to test the NIC, and (2) to be able to go to
him for assistance occasionally and not be rejected because of
significant differences between his and our drivers.

Jay
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]PCIE ASPM support - takes 3

2008-01-22 Thread Shaohua Li

On Tue, 2008-01-22 at 14:58 -0800, Greg KH wrote:
> On Fri, Jan 18, 2008 at 09:56:28AM +0800, Shaohua Li wrote:
> > v3->v2, fixed the issues Matthew Wilcox raised.
> > 
> > PCI Express ASPM defines a protocol for PCI Express components in the D0
> > state to reduce Link power by placing their Links into a low power state
> > and instructing the other end of the Link to do likewise. This
> > capability allows hardware-autonomous, dynamic Link power reduction
> > beyond what is achievable by software-only controlled power management.
> > However, The device should be configured by software appropriately.
> > Enabling ASPM will save power, but will introduce device latency.
> > 
> > This patch adds ASPM support in Linux. It introduces a global policy for
> > ASPM, a sysfs file /sys/module/pcie_aspm/parameters/policy can control
> > it. The interface can be used as a boot option too. Currently we have
> > below setting:
> > -default, BIOS default setting
> > -powersave, highest power saving mode, enable all available ASPM
> > state
> > and clock power management
> > -performance, highest performance, disable ASPM and clock power
> > management
> > By default, the 'default' policy is used currently.
> > 
> > In my test, power difference between powersave mode and performance mode
> > is about 1.3w in a system with 3 PCIE links.
> > 
> > please review, any comments will be appreciated.
> 
> Can you please fix up all of the warnings that checkpatch.pl and sparse
> produce from this patch?
> 
> Also, one small thing:
> 
> > --- linux.orig/include/linux/pci.h  2008-01-16 15:59:42.0 +0800
> > +++ linux/include/linux/pci.h   2008-01-18 09:41:20.0 +0800
> > @@ -164,6 +164,10 @@ struct pci_dev {
> >this is D0-D3, D0 being fully 
> > functional,
> >and D3 being off. */
> >  
> > +#ifdef CONFIG_PCIEASPM
> > +   void*link_state;/* ASPM link state. */
> > +#endif
> 
> Can we make this a "real" pointer to a structure?  I note that you use
> two different structures here in this pointer, should you really do
> that?  It's good to get type-checks whereever possible.
The structure is just for internal use of ASPM, just don't want make it
global.

Fixed, now sparse and checkpatch.pl haven't warning.

Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
---
 drivers/pci/pci-sysfs.c   |5 
 drivers/pci/pci.c |4 
 drivers/pci/pcie/Kconfig  |   20 +
 drivers/pci/pcie/Makefile |3 
 drivers/pci/pcie/aspm.c   |  818 ++
 drivers/pci/probe.c   |5 
 drivers/pci/remove.c  |4 
 include/linux/aspm.h  |   44 ++
 include/linux/pci.h   |4 
 include/linux/pci_regs.h  |8 
 10 files changed, 915 insertions(+)

Index: linux/drivers/pci/pcie/Makefile
===
--- linux.orig/drivers/pci/pcie/Makefile2008-01-23 10:14:14.0 
+0800
+++ linux/drivers/pci/pcie/Makefile 2008-01-23 10:14:46.0 +0800
@@ -2,6 +2,9 @@
 # Makefile for PCI-Express PORT Driver
 #
 
+# Build PCI Express ASPM if needed
+obj-$(CONFIG_PCIEASPM) += aspm.o
+
 pcieportdrv-y  := portdrv_core.o portdrv_pci.o portdrv_bus.o
 
 obj-$(CONFIG_PCIEPORTBUS)  += pcieportdrv.o
Index: linux/drivers/pci/pcie/aspm.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux/drivers/pci/pcie/aspm.c   2008-01-23 10:14:46.0 +0800
@@ -0,0 +1,818 @@
+/*
+ * File:   drivers/pci/pcie/aspm.c
+ * Enabling PCIE link L0s/L1 state and Clock Power Management
+ *
+ * Copyright (C) 2007 Intel
+ * Copyright (C) Zhang Yanmin ([EMAIL PROTECTED])
+ * Copyright (C) Shaohua Li ([EMAIL PROTECTED])
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../pci.h"
+
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+#define MODULE_PARAM_PREFIX "pcie_aspm."
+
+/* only for downstream port */
+struct link_state {
+   struct list_head sibiling;
+   struct pci_dev *pdev;
+
+   /* ASPM state */
+   unsigned int support_state;
+   unsigned int enabled_state;
+   unsigned int bios_aspm_state;
+   /* upstream component */
+   unsigned int l0s_upper_latency;
+   unsigned int l1_upper_latency;
+   /* downstream component */
+   unsigned int l0s_down_latency;
+   unsigned int l1_down_latency;
+   /* Clock PM state*/
+   unsigned int clk_pm_capable;
+   unsigned int clk_pm_enabled;
+   unsigned int bios_clk_state;
+
+};
+
+/* Only for endpoint */
+struct endpoint_state {
+   unsigned int l0s_acceptable_latency;
+   unsigned int l1_acceptable_latency;
+};
+
+static int aspm_disabled;
+static DEFINE_MUTEX(aspm_lock);
+static 

Re: [PATCH 06/26] atl1: update initialization parameters

2008-01-22 Thread Jeff Garzik

Jay Cliburn wrote:

On Tue, 22 Jan 2008 04:56:11 -0500
Jeff Garzik <[EMAIL PROTECTED]> wrote:


[EMAIL PROTECTED] wrote:

From: Jay Cliburn <[EMAIL PROTECTED]>

Update initialization parameters to match the current vendor driver
version 1.2.40.2.


[...]

ACK without any better knowledge...  but is any addition insight 
available at all?


No, sorry Jeff.  I simply took the vendor's current driver and matched
his initialization settings.  I can only assume he discovered these
values through lab testing.

For this and the other "conform to vendor driver" patches in this set, I
thought it important to have the in-tree driver match the vendor driver
as closely as possible.  The primary motivations are (1) my belief that
he's in a better position to test the NIC, and (2) to be able to go to
him for assistance occasionally and not be rejected because of
significant differences between his and our drivers.


Since these changes are not simply moving code around, we really do need 
full explanations for them, and to understand their need.


Blindly copying code from an exterior driver is pointless, and no way at 
all to run an engineering process.


If the driver is not going to get the review and attention necessary, 
bug fixes and feedback attended-to, then there's not much point in 
having this driver in the kernel at all.


You will only lead yourself to frustration, if you set up a system where 
changes only flow one way.  That's not how Linux development is done at all.


Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Relax restrictions on setting CONFIG_NUMA on x86

2008-01-22 Thread KOSAKI Motohiro
Hi mel

> Hi 
> 
> > A fix[1] was merged to the x86.git tree that allowed NUMA kernels to boot
> > on normal x86 machines (and not just NUMA-Q, Summit etc.). I took a look
> > at the restrictions on setting NUMA on x86 to see if they could be lifted.
> 
> Interesting!
> 
> I will test tomorrow.

Hmm...
It doesn't works on my machine.

panic at booting at __free_pages_ok() with blow call trace.

[] free_all_bootmem_core
[] mem_init
[] alloc_large_system_hash
[] inode_init_early
[] start_kernel
[] unknown_bootoption

my machine spec
CPU:   Pentium4 with HT
MEM:   512M

I will try more investigate.
but I have no time for a while, sorry ;-)


BTW:
when config sparse mem turn on instead discontig mem.
panic at booting at get_pageblock_flags_group() with below call stack.

free_initrd
free_init_pages
free_hot_cold_page



- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc8: iwl3945 gets stuck

2008-01-22 Thread Daniel Hazelton
On Tuesday 22 January 2008 17:15:42 John W. Linville wrote:
> On Tue, Jan 22, 2008 at 09:54:11PM +0100, Harald Dunkel wrote:
> > If I put some heavy load on the iwl3945, then the network connection
> > gets stuck after a some time. To fix it I have to reload the module.
>
> Can you quantify this a bit more?  What constitutes a "heavey load"?
> What (if any) encryption are you using?  Are you using any options
> for iwl3945 in /etc/modprobe.conf?
>
> Could you include the output of dmesg and/or the contents of
> /var/log/messages (trimmed for the most recent boot)?
>
> > AFAICS this problem was a topic on lkml almost 3 months ago. Any news
> > about this? I would be glad to help to track this down, but I have
> > no idea how to change the scaling algorithm to iwl-3945-rs .
>
> This should happen automatically now.
>
> John

I've been getting a warning in the dmesg of my laptop with every boot since I 
started using 2.6.24-rc7 that might be related.

This doesn't appear to cause any problems, but from looking at the source 
of the warning it appears that the ipw3945 hardware might be causing the 
problem.
 
[   31.460143] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[   31.549722] WARNING: at net/mac80211/rx.c:1486 __ieee80211_rx()
[   31.549817] Pid: 4436, comm: amixer Not tainted 2.6.24-rc7-git2 #1
[   31.549903]
[   31.549904] Call Trace:
[   31.550063][] :mac80211:__ieee80211_rx+0xc99/0xd60
[   31.550236]  [] _spin_unlock_irqrestore+0x16/0x40
[   31.550332]  [] :iwl3945:iwl_rx_queue_restock+0xca/0x170
[   31.550422]  [] _spin_unlock_irqrestore+0x16/0x40
[   31.550520]  [] 
:mac80211:ieee80211_tasklet_handler+0xb8/0x120
[   31.550646]  [] tasklet_action+0x51/0xc0
[   31.550732]  [] _spin_unlock+0x14/0x40
[   31.550820]  [] __do_softirq+0x64/0xe0
[   31.550909]  [] call_softirq+0x1c/0x30
[   31.550995]  [] do_softirq+0x3d/0x90
[   31.551083]  [] irq_exit+0x88/0xa0
[   31.551169]  [] do_IRQ+0xc5/0x1b0
[   31.551257]  [] ret_from_intr+0x0/0xa
[   31.551369][] get_page_from_freelist+0x30e/0x670
[   31.551519]  [] __alloc_pages+0x6e/0x3b0
[   31.551608]  [] generic_file_aio_read+0xd7/0x180
[   31.551699]  [] alloc_page_vma+0x9c/0xf0
[   31.551788]  [] handle_mm_fault+0x50e/0x780
[   31.551874]  [] _spin_unlock+0x14/0x40
[   31.551962]  [] _spin_unlock_irqrestore+0x16/0x40
[   31.552052]  [] do_page_fault+0x228/0x970
[   31.552146]  [] _spin_unlock+0x14/0x40
[   31.552251]  [] vfs_read+0x13e/0x180
[   31.552340]  [] error_exit+0x0/0x51
[   31.552436]

The location of the warning is:
hdrlen = ieee80211_get_hdrlen(rx.fc);
line in question -->WARN_ON_ONCE(((unsigned long)(skb->data + hdrlen)) 
& 3);

if (type == IEEE80211_FTYPE_DATA || type == IEEE80211_FTYPE_MGMT)
local->dot11ReceivedFragmentCount++;

sta = rx.sta = sta_info_get(local, hdr->addr2);

Now, the problem is that this might be nothing, and it might be the cause 
of the problem. (I don't think it is the cause, myself, because I've subjected
my laptop to a lot of activity - to the point that the card was starting to drop
packets - and have seen no problems)

DRH

--
Dialup is like pissing through a pipette. Slow and excruciatingly painful.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel/params.c: fix the module name length in param_sysfs_builtin

2008-01-22 Thread Rusty Russell
On Wednesday 23 January 2008 10:13:37 Jan Engelhardt wrote:
> On Jan 21 2008 22:16, Rusty Russell wrote:
> >On Monday 21 January 2008 20:08:25 Denis Cheng wrote:
> >> the original code use KOBJ_NAME_LEN for built-in module name length,
> >> that's defined to 20 in linux/kobject.h, but this is not enough
> >> appearntly, many module names are longer than this;
> >>  #define KOBJ_NAME_LEN   20
> >
> >Thanks, applied.  I was surprisedto learn that we have a 35-char source
> >filename in the kernel.
> >
> >And congratulations to nf_conntrack_l3proto_ipv4_compat.c!
>
> But nf..dada_compat.c gets linked into nf_conntrack_ipv4.ko,
> and that is what is used in /sys/module - and it fits the 20.
> Any place where nf_conntrack_l3proto_ipv4_compat would still be used?

Of course, but my point was that we already have a 35 char filename in the 
kernel, and lots of > 22 chars, so increasing it is not unreasonable.

FYI make allmodconfig here gives me the following of 21 chars or longer:

dvb-usb-af9005-remote
dvb-usb-dibusb-common
nf_conntrack_netbios_ns
nf_conntrack_proto_udplite
nf_conntrack_proto_sctp
nf_conntrack_proto_gre

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel/params.c: fix the module name length in param_sysfs_builtin

2008-01-22 Thread rae l
On Jan 23, 2008 7:13 AM, Jan Engelhardt <[EMAIL PROTECTED]> wrote:
> But nf..dada_compat.c gets linked into nf_conntrack_ipv4.ko,
> and that is what is used in /sys/module - and it fits the 20.
> Any place where nf_conntrack_l3proto_ipv4_compat would still be used?
there is a module named nf_conntrack_proto_icmp.ko, length 23. and you
can find all them by:

$ make allmodconfig && make modules

$ find -name '*.ko' -printf '%f\n' |gawk '{print length($0), $0}' |sort -n
...
24 dvb-usb-af9005-remote.ko
24 dvb-usb-dibusb-common.ko
25 nf_conntrack_proto_gre.ko
26 nf_conntrack_netbios_ns.ko
26 nf_conntrack_proto_sctp.ko
29 nf_conntrack_proto_udplite.ko

so currently tha max length of module name is 26 (in
nf_conntrack_proto_udplite), but still no any length limit to module
names in Documentation/, so we have to prepare reserved space for
modules later, or mark MODULE_NAME_LEN as the modules' name length
limit in Documentation/?

Simply speaking, MODULE_NAME_LEN does the better job.
>

-- 
Denis Cheng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-22 Thread Dave Young
On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
>
> On Tue, 22 Jan 2008, Dave Young wrote:
>
> > On Jan 22, 2008 12:37 PM, Dave Young <[EMAIL PROTECTED]> wrote:
> > >
> > > On Jan 22, 2008 5:14 AM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> > > >
> > > > On Mon, 21 Jan 2008, Dave Young wrote:
> > > >
> > > > > Please see the kernel messages following,(trigged while using some 
> > > > > qemu session)
> > > > > BTW, seems there's some e100 error message as well.
> > > > >
> > > > > PCI: Setting latency timer of device :00:1b.0 to 64
> > > > > e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI
> > > > > e100: Copyright(c) 1999-2006 Intel Corporation
> > > > > ACPI: PCI Interrupt :03:08.0[A] -> GSI 20 (level, low) -> IRQ 20
> > > > > modprobe:2331 conflicting cache attribute efaff000-efb0 
> > > > > uncached<->default
> > > > > e100: :03:08.0: e100_probe: Cannot map device registers, aborting.
> > > > > ACPI: PCI interrupt for device :03:08.0 disabled
> > > > > e100: probe of :03:08.0 failed with error -12
> > > > > eth0:  setting full-duplex.
> > > > > [ cut here ]
> > > > > WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150()
> > > > > Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event 
> > > > > snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse 
> > > > > snd_hda_intel snd_pcm snd_timer btusb rtc_cmos thermal bluetooth 
> > > > > rtc_core serio_raw intel_agp button processor sg snd rtc_lib i2c_i801 
> > > > > evdev agpgart soundcore dcdbas 3c59x pcspkr snd_page_alloc
> > > > > Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4
> > > > >  [] ? printk+0x0/0x20
> > > > >  [] warn_on_slowpath+0x54/0x80
> > > > >  [] ? ip_finish_output+0x128/0x2e0
> > > > >  [] ? ip_output+0xe7/0x100
> > > > >  [] ? ip_local_out+0x18/0x20
> > > > >  [] ? ip_queue_xmit+0x3dc/0x470
> > > > >  [] ? _spin_unlock_irqrestore+0x5e/0x70
> > > > >  [] ? check_pad_bytes+0x61/0x80
> > > > >  [] tcp_mark_head_lost+0x121/0x150
> > > > >  [] tcp_update_scoreboard+0x4c/0x170
> > > > >  [] tcp_fastretrans_alert+0x48a/0x6b0
> > > > >  [] tcp_ack+0x1b3/0x3a0
> > > > >  [] tcp_rcv_established+0x3eb/0x710
> > > > >  [] tcp_v4_do_rcv+0xe5/0x100
> > > > >  [] tcp_v4_rcv+0x5db/0x660
> > > >
> > > > Doh, once more these S+L things..., the rest are symptom of the first
> > > > problem.
> > >
> > > What is the S+L thing? Could you explain a bit?
>
> It means that one of the skbs is both SACKed and marked as LOST at the
> same time in the counters (might be due to miscount of lost/sacked_out
> too, not necessarilily in the ->sacked bits). Such state is logically
> invalid because it would mean that the sender thinks that the same packet
> both reached the receiver and is lost in the network.
>
> Traditionally TCP has just silently "corrected" over-estimates
> (sacked_out+lost_out > packets_out). I changed this couple of releases ago
> because those over-estimates often are due to bugs that should be fixed
> (there have been couple of them but it has been very quite on this front
> long time, months or even half year already; but I might have broken
> something with the early Dec changes).
>
> These problem may originate from a bug that occurred a number of ACKs
> earlier the WARN_ON triggered, therefore they are a bit tricky to track,
> those WARN_ON serve just for alerting purposes and usually do not point
> out where the bug actually occurred.
>
> I usually just asked people to include exhaustive verifier which compares
> ->sacked bitmaps with sacked/lost_out counters and report immediately when
> the problem shows up, rather than waiting for the cheaper S+L check we do
> in the WARN_ON to trigger. I tried to collect tracking patch from the
> previous efforts (hopefully got it right after modifications).
>
> > > I'm a bit worried about its
> > > > reproducability if it takes this far to see it...
> > > >
> >
> > It's trigged again in my pc, just while using firefox.
>
> ...Good, then there's some chance to catch it.
>
> --
>  i.
>
> [PATCH] [TCP]: debug S+L

Thanks, If there's new findings I will let you know.

>
> ---
>  include/net/tcp.h |8 +++-
>  net/ipv4/tcp_input.c  |6 +++
>  net/ipv4/tcp_ipv4.c   |  101 
> +
>  net/ipv4/tcp_output.c |   21 +++---
>  4 files changed, 129 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 7de4ea3..0685035 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
>  #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
> field, val)
>  #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
> field, val)
>
> +extern voidtcp_verify_wq(struct sock *sk);
> +
>  extern voidtcp_v4_err(struct sk_buff *skb, u32);
>
>  extern void

[PATCH] x86: left over fix for leak of early_ioremp in dmi_scan

2008-01-22 Thread Yinghai Lu
[PATCH] x86: left over fix for leak of early_ioremp in dmi_scan

Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]>

Index: linux-2.6/drivers/firmware/dmi_scan.c
===
--- linux-2.6.orig/drivers/firmware/dmi_scan.c
+++ linux-2.6/drivers/firmware/dmi_scan.c
@@ -353,6 +353,7 @@ void __init dmi_scan_machine(void)
return;
}
}
+   dmi_iounmap(p, 0x1);
}
  out:  printk(KERN_INFO "DMI not present or invalid.\n");
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH] export notifier #1

2008-01-22 Thread Robin Holt
On Tue, Jan 22, 2008 at 04:40:50PM -0800, Christoph Lameter wrote:
> On Wed, 23 Jan 2008, Benjamin Herrenschmidt wrote:
> 
> > > - anon_vma/inode and pte locks are held during callbacks.
> > 
> > So how does that fix the problem of sleeping then ?
> 
> The locks are taken in the mmu_ops patch. This patch does not hold them 
> while performing the callbacks.

Let me start by clarifying, the page is referenced prior to exporting
and that reference is not removed until after recall is complete and
memory protections are back to normal.

As Christoph pointed out, the mmu_ops callouts do not allow sleeping.
This is a problem for us as our recall path includes a message to one or
more other hosts and a wait until we receive a response.  That message
sequence can take seconds or more to complete.  It includes an operation
to ensure the memory is in a cross-partition clean state and then changes
memory protection.  When that is complete we remove our page reference
and return.

Christoph's patch allows that long slow activity to happen prior to the
mmu_ops callout.  By the time the mmu_ops callout is made, we no longer
are exporting the page so the cleanup is equivalent to the cleanup of
a page we have never used.

Thanks,
Robin Holt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cgroup: limit block I/O bandwidth

2008-01-22 Thread Naveen Gupta
On 22/01/2008, Andrea Righi <[EMAIL PROTECTED]> wrote:
> Naveen Gupta wrote:
> > See if using priority levels to have per level bandwidth limit can
> > solve the priority inversion problem you were seeing earlier. I have a
> > priority scheduling patch for anticipatory scheduler, if you want to
> > try it. It's much simpler than CFQ priority.  I still need to port it
> > to 2.6.24 though and send across for review.
> >
> > Though as already said, this would be for read side only.
> >
> > -Naveen
>
> Thanks Naveen, I can test you scheduler if you want, but the priority
> inversion problem (or better we should call it a "bandwidth limiting"
> that impacts in wrong tasks) occurs only with write operations and, as
> said by Jens, the I/O scheduler is not the right place to implement this
> kind of limiting, because at this level the processes have already
> performed the operations (dirty pages in memory) that raise the requests
> to the I/O scheduler (made by different processes asynchronously).

If the i/o submission is happening in bursts, and we limit the rate
during submission, we will have to stop the current task from
submitting any further i/o and hence change it's pattern. Also, then
we are limiting the submission rate and not the rate which is going on
the wire as scheduler may reorder.

One of the ways could be - to limit the rate when the i/o is sent out
from the scheduler and if we see that the number of allocated requests
are above a threshold we disallow request allocation in the offending
task. This way an application submitting bursts under the allowed
average rate will not stop frequently. Something like leaky bucket.

Now for dirtying of memory happening in a different context than the
submission path, you could still put a limit looking at the dirty
ratio and this limit is higher than the actual b/w rate you are
looking to achieve. In process making sure you always have something
to write and still  now blow your entire memory. Or you can get really
fancy and track who dirtied the i/o and start limiting it that way.



>
> A possible way to model the write limiting is to look at the dirty page
> ratio that is, in part, the principal reason for the requests to the I/O
> scheduler. But in this way we would limit also the re-write operations
> in memory and this is too much limiting.
>
> So, the cgroup dirty page throttling could be very interesting anyway,
> but it's not the same thing as limiting the real write I/O bandwidth.
>
> For now I've rewritten my patch as following, moving away the code from
> the I/O scheduler, it seems to work in my small tests (apart all the
> things said above), but I'd like to find a different way to have a more
> sophisticated I/O throttling approach (probably looking also directly at
> the read()/write() level)... just investigating for now...
>
> BTW I've seen that also OpenVZ has not a solution for this problem, yet.
> AFAIU OpenVZ I/O activity is accounted in virtual enviromnents (VE) by
> the user beancounters (http://wiki.openvz.org/IO_accounting), but
> there's not any policy that implements the block I/O limiting, except
> that it's possible to set different per-VE I/O priorities (mapped on CFQ
> priorities). But I've not understood if this just sets this I/O priority
> to all processes in the VE, or if it does something different. I still
> need to look at the code in details.
>
> -Andrea
>
> Signed-off-by: Andrea Righi <[EMAIL PROTECTED]>
> ---
>
> diff -urpN linux-2.6.24-rc8/block/io-throttle.c 
> linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
> --- linux-2.6.24-rc8/block/io-throttle.c1970-01-01 01:00:00.0 
> +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c   2008-01-22 
> 23:06:09.0 +0100
> @@ -0,0 +1,222 @@
> +/*
> + * io-throttle.c
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + *
> + * Copyright (C) 2008 Andrea Righi <[EMAIL PROTECTED]>
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct iothrottle {
> +   struct cgroup_subsys_state css;
> +   spinlock_t lock;
> +   unsigned long iorate;
> +   unsigned long req;
> +   unsigned long last_request;
> 

Re: [PATCH 1/6] driver-core : add class iteration api

2008-01-22 Thread Dave Young
On Jan 23, 2008 6:25 AM, Greg KH <[EMAIL PROTECTED]> wrote:
> On Tue, Jan 22, 2008 at 03:27:08PM +0800, Dave Young wrote:
> >
> > Add the following class iteration functions for driver use:
> > class_for_each_device
> > class_find_device
> > class_for_each_child
> > class_find_child
>
> As class_for_each_child() is not used by anyone in this patch series,
> and we want to heavily discourage the use of class_device (only scsi and
> IB are the last remaining users), I'll cut out this portion of the
> patch.
>
> Any objection?

Looks good to me.

>
> thanks,
>
> greg k-h
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: test case for the RODATA config option

2008-01-22 Thread Nick Piggin
On Wednesday 23 January 2008 09:44, Arjan van de Ven wrote:
> From: Arjan van de Ven <[EMAIL PROTECTED]>
> Subject: x86: test case for the RODATA config option
>
> This patch adds a test module for the DEBUG_RODATA config
> option to make sure change_page_attr() did indeed make
> "const" data read only.
>
> This testcase both tests the DEBUG_RODATA code as well as
> the change_page_attr() code for correct operation.
>
> When the tests/ patch gets merged, this module should move
> to the tests/ directory.
>
> Signed-off-by: Arjan van de Ven <[EMAIL PROTECTED]>
> ---
>  arch/x86/Kconfig.debug|8 +
>  arch/x86/kernel/Makefile_32   |1
>  arch/x86/kernel/Makefile_64   |2 +
>  arch/x86/kernel/test_rodata.c |   65
> ++ arch/x86/mm/init_32.c | 
>   3 +
>  arch/x86/mm/init_64.c |3 +
>  6 files changed, 82 insertions(+)
>
> Index: linux-2.6.24-rc8/arch/x86/Kconfig.debug
> ===
> --- linux-2.6.24-rc8.orig/arch/x86/Kconfig.debug
> +++ linux-2.6.24-rc8/arch/x86/Kconfig.debug
> @@ -57,6 +57,14 @@ config DEBUG_RODATA
> portion of the kernel code won't be covered by a 2MB TLB anymore.
> If in doubt, say "N".
>
> +config DEBUG_RODATA_TEST
> + tristate "Testcase for the DEBUG_RODATA feature"
> + depends on DEBUG_RODATA && m
> + help
> +   This option enables a testcase for the DEBUG_RODATA
> +   feature as well as for the change_page_attr() infrastructure.
> +   If in doubt, say "N"
> +
>  config 4KSTACKS
>   bool "Use 4Kb for kernel stacks instead of 8Kb"
>   depends on DEBUG_KERNEL
> Index: linux-2.6.24-rc8/arch/x86/mm/init_32.c
> ===
> --- linux-2.6.24-rc8.orig/arch/x86/mm/init_32.c
> +++ linux-2.6.24-rc8/arch/x86/mm/init_32.c
> @@ -790,6 +790,9 @@ static int noinline do_test_wp_bit(void)
>
>  #ifdef CONFIG_DEBUG_RODATA
>
> +const int rodata_test_data;
> +EXPORT_SYMBOL_GPL(rodata_test_data);
> +
>  void mark_rodata_ro(void)
>  {
>   unsigned long start = PFN_ALIGN(_text);
> Index: linux-2.6.24-rc8/arch/x86/mm/init_64.c
> ===
> --- linux-2.6.24-rc8.orig/arch/x86/mm/init_64.c
> +++ linux-2.6.24-rc8/arch/x86/mm/init_64.c
> @@ -590,6 +590,9 @@ void free_initmem(void)
>
>  #ifdef CONFIG_DEBUG_RODATA
>
> +const int rodata_test_data = 5;

I guess this should match the 32-bit case, and be zero instead of
5?

Can you disallow building as a module, and put this in the test
code? It could be run from the end of mark_rodata_ro()...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-01-22 Thread Nick Piggin
On Tuesday 22 January 2008 21:37, Ingo Molnar wrote:
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> > Well I've twice tried to submit a patch to print stacks for running
> > tasks as well, but nobody seems interested. It would at least give a
> > chance to see something.
>
> i definitely remembering having done this myself a couple of times (it
> makes tons of sense to get _some_ info out of the system) but some
> problem in -mm kept reverting it. I dont remember the specifics ... it
> was some race.

Hmm, that's not unlikely. But there is nothing in the backtrace code
which prevents a task from being woken up anyway, is there? I guess
it will be more common now, but if we find a race we can try to fix
the root cause.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] Minimal fix for private_list handling races

2008-01-22 Thread Nick Piggin
On Wednesday 23 January 2008 04:10, Jan Kara wrote:
>   Hi,
>
>   as I got no answer for a week, I'm resending this fix for races in
> private_list handling. Andrew, do you like them more than the previous
> version?

FWIW, I reviewed this, and it looks OK although I think some comments
would be in order.

What would be really nice is to avoid the use of b_assoc_buffers
completely in this function like I've attempted (untested). I don't
know if you'd actually call that an improvement...?

Couple of things I noticed while looking at this code.

- What is osync_buffers_list supposed to do? I couldn't actually
  work it out. Why do we care about waiting for these buffers on
  here that were added while waiting for writeout of other buffers
  to finish? Can we just remove it completely? I must be missing
  something.

- What are the get_bh(bh) things supposed to do? Protect the lifetime
  of a given bh while "lock" is dropped? That's nice, ignoring the
  fact that we brelse(bh) *before* taking the lock again... but isn't
  every single other buffer that we _have't_ elevated its reference
  exposed to exactly the same lifetime problem? IOW, either it is not
  required at all, or it is required for _all_ buffers? (my patch
  should fix this).

Hmm, now I remember why I rewrote this file :P
Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -792,47 +792,53 @@ EXPORT_SYMBOL(__set_page_dirty_buffers);
  */
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 {
+	struct buffer_head *batch[16];
+	int i, idx, done;
 	struct buffer_head *bh;
-	struct list_head tmp;
 	int err = 0, err2;
 
-	INIT_LIST_HEAD();
-
+again:
 	spin_lock(lock);
+	idx = 0;
 	while (!list_empty(list)) {
 		bh = BH_ENTRY(list->next);
 		__remove_assoc_queue(bh);
 		if (buffer_dirty(bh) || buffer_locked(bh)) {
-			list_add(>b_assoc_buffers, );
-			if (buffer_dirty(bh)) {
-get_bh(bh);
-spin_unlock(lock);
-/*
- * Ensure any pending I/O completes so that
- * ll_rw_block() actually writes the current
- * contents - it is a noop if I/O is still in
- * flight on potentially older contents.
- */
-ll_rw_block(SWRITE, 1, );
-brelse(bh);
-spin_lock(lock);
-			}
+			batch[idx++] = bh;
+			get_bh(bh);
 		}
+
+		if (idx == 16)
+			break;
 	}
+	done = list_empty(list);
+	spin_unlock(lock);
 
-	while (!list_empty()) {
-		bh = BH_ENTRY(tmp.prev);
-		list_del_init(>b_assoc_buffers);
-		get_bh(bh);
-		spin_unlock(lock);
+	for (i = 0; i < idx; i++) {
+		bh = batch[i];
+		if (buffer_dirty(bh)) {
+			/*
+			 * Ensure any pending I/O completes so
+			 * that ll_rw_block() actually writes
+			 * the current contents - it is a noop
+			 * if I/O is still in flight on
+			 * potentially older contents.
+			 */
+			ll_rw_block(SWRITE, 1, );
+		}
+	}
+	for (i = 0; i < idx; i++) {
+		bh = batch[i];
 		wait_on_buffer(bh);
 		if (!buffer_uptodate(bh))
 			err = -EIO;
 		brelse(bh);
-		spin_lock(lock);
 	}
+
+	idx = 0;
+	if (!done)
+		goto again;
 	
-	spin_unlock(lock);
 	err2 = osync_buffers_list(lock, list);
 	if (err)
 		return err;


Re: [PATCH] ppc: fix #ifdef-s in mediabay driver

2008-01-22 Thread Bartlomiej Zolnierkiewicz

Hi,

On Wednesday 23 January 2008, Benjamin Herrenschmidt wrote:
> 
> On Wed, 2008-01-23 at 00:12 +0100, Bartlomiej Zolnierkiewicz wrote:
> > * Replace incorrect CONFIG_BLK_DEV_IDE #ifdef in
> >   check_media_bay() by CONFIG_MAC_FLOPPY one.
> > 
> > * Replace incorrect CONFIG_BLK_DEV_IDE #ifdef-s by
> >   CONFIG_BLK_DEV_IDE_PMAC ones.
> > 
> > * check_media_bay() is used only by drivers/block/swim3.c
> >   so make this function available only if CONFIG_MAC_FLOPPY
> >   is defined.
> > 
> > * check_media_bay_by_base() and media_bay_set_ide_infos()
> >   are used only by drivers/ide/ppc/pmac.c so so make these
> >   functions available only if CONFIG_MAC_FLOPPY is defined.
> > 
> > Signed-off-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]>
> > ---
> > Ben, IMO this patch is safe for 2.6.24 (assuming that it builds fine :),
> > otherwise I would like to ask for permission to merge it through IDE
> > tree since I have other pending IDE patches depending on this one.
> 
> I'd rather avoid touching 2.6.24 unless it actually fixes a bug or
> regression...

Well, it is a bugfix for PMAC_MEDIABAY=y && BLK_DEV_IDE=n && MAC_FLOPPY=y. :)

> I'm tempted to actually remove all ifdef's ... if you have a media-bay,
> then there are about 99% chances it contains an IDE device, with the
> remaining percent being split with putting a floppy or a battery in. I
> doubt anybody will care building a kernel without the support for these
> and with the mediabay support, and still want to save a handful of bytes
> in that driver.

I'm more worried about breaking automatic build checking (make randconfig)
than a few extra bytes so if you remove all #ifdefs you'll have to either
make BLK_DEV_IDE_PMAC select PMAC_MEDIABAY or make PMAC_MEDIABAY depend
on BLK_DEV_IDE_PMAC (otherwise BLK_DEV_IDE=n && PMAC_MEDIABAY=y will fail
since mediabay.c is referencing IDE code).

Thanks,
Bart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 27/27] NFS: Separate caching by superblock, explicitly if necessary

2008-01-22 Thread David Howells
Separate caching by superblock, explicitly if necessary.  This means mounts of
the same remote data with different parameters do not share cache objects for
common files.  The administrator may also provide a uniquifier to further
enhance the uniqueness.

Where it is otherwise impossible to distinguish superblocks because all the
parameters are identical, but the 'nosharecache' option is supplied, a
uniquifying string must be supplied, else only the first mount will be
permitted to use the cache.

If there's a key collision, then the second mount will disable caching and give
a warning into the kernel log.

There are three variant NFS mount options that can be added to a mount command
to control caching for a mount.  Only the last one specified takes effect:

 (*) Adding "fsc" will request caching.

 (*) Adding "fsc=" will request caching and also specify a uniquifier.

 (*) Adding "nofsc" will disable caching.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/fscache-def.c  |   33 
 fs/nfs/fscache.c  |  122 -
 fs/nfs/fscache.h  |   46 -
 fs/nfs/internal.h |3 +
 fs/nfs/super.c|   24 +++--
 include/linux/nfs_fs_sb.h |3 +
 6 files changed, 220 insertions(+), 11 deletions(-)


diff --git a/fs/nfs/fscache-def.c b/fs/nfs/fscache-def.c
index bc20b7d..1d10b4e 100644
--- a/fs/nfs/fscache-def.c
+++ b/fs/nfs/fscache-def.c
@@ -117,6 +117,39 @@ const struct fscache_cookie_def nfs_cache_server_index_def 
= {
 };
 
 /*
+ * Generate a key to describe a superblock key in the main NFS index
+ */
+static uint16_t nfs_super_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+   const struct nfs_fscache_key *key;
+   const struct nfs_server *nfss = cookie_netfs_data;
+   uint16_t len;
+
+   key = nfss->fscache_key;
+   len = sizeof(key->key) + key->key.uniq_len;
+   if (len > bufmax) {
+   len = 0;
+   } else {
+   memcpy(buffer, >key, sizeof(key->key));
+   memcpy(buffer + sizeof(key->key),
+  key->key.uniquifier, key->key.uniq_len);
+   }
+
+   return len;
+}
+
+/*
+ * The superblock index for the filesystem is defined by all the NFS parameters
+ * that might cause a separate superblock
+ */
+const struct fscache_cookie_def nfs_cache_super_index_def = {
+   .name   = "NFS.supers",
+   .type   = FSCACHE_COOKIE_TYPE_INDEX,
+   .get_key= nfs_super_get_key,
+};
+
+/*
  * Generate a key to describe an NFS inode in an NFS server's index
  */
 static uint16_t nfs_fh_get_key(const void *cookie_netfs_data,
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index 465f961..af9c65c 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -23,6 +23,9 @@
 
 #define NFSDBG_FACILITYNFSDBG_FSCACHE
 
+static struct rb_root nfs_fscache_keys = RB_ROOT;
+static DEFINE_SPINLOCK(nfs_fscache_keys_lock);
+
 /*
  * Get the per-client index cookie for an NFS client if the appropriate mount
  * flag was set
@@ -52,6 +55,118 @@ void nfs_fscache_release_client_cookie(struct nfs_client 
*clp)
 }
 
 /*
+ * get a cookie for a superblock
+ */
+void nfs_fscache_get_super_cookie(struct super_block *sb,
+ struct nfs_parsed_mount_data *data)
+{
+   struct nfs_fscache_key *key, *xkey;
+   struct nfs_server *nfss = NFS_SB(sb);
+   struct rb_node **p, *parent;
+   const char *uniq = data->fscache_uniq ?: "";
+   int diff, ulen;
+
+   ulen = strlen(uniq);
+   key = kzalloc(sizeof(*key) + ulen, GFP_KERNEL);
+   if (!key)
+   return;
+
+   key->nfs_client = nfss->nfs_client;
+   key->key.super.s_flags = sb->s_flags & NFS_MS_MASK;
+   key->key.nfs_server.flags = nfss->flags;
+   key->key.nfs_server.rsize = nfss->rsize;
+   key->key.nfs_server.wsize = nfss->wsize;
+   key->key.nfs_server.acregmin = nfss->acregmin;
+   key->key.nfs_server.acregmax = nfss->acregmax;
+   key->key.nfs_server.acdirmin = nfss->acdirmin;
+   key->key.nfs_server.acdirmax = nfss->acdirmax;
+   key->key.nfs_server.fsid = nfss->fsid;
+   key->key.rpc_auth.au_flavor = nfss->client->cl_auth->au_flavor;
+
+   key->key.uniq_len = ulen;
+   memcpy(key->key.uniquifier, uniq, ulen);
+
+   spin_lock(_fscache_keys_lock);
+   p = _fscache_keys.rb_node;
+   parent = NULL;
+   while (*p) {
+   parent = *p;
+   xkey = rb_entry(parent, struct nfs_fscache_key, node);
+
+   if (key->nfs_client < xkey->nfs_client)
+   goto go_left;
+   if (key->nfs_client > xkey->nfs_client)
+   goto go_right;
+
+   diff = memcmp(>key, >key, sizeof(key->key));
+   if (diff < 0)
+   goto go_left;
+   if (diff > 0)
+ 

[PATCH 26/27] NFS: Display local caching state

2008-01-22 Thread David Howells
Display the local caching state in /proc/fs/nfsfs/volumes.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/client.c  |7 ---
 fs/nfs/fscache.h |   15 +++
 2 files changed, 19 insertions(+), 3 deletions(-)


diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 92f9b84..68d3124 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1335,7 +1335,7 @@ static int nfs_volume_list_show(struct seq_file *m, void 
*v)
 
/* display header on line 1 */
if (v == _volume_list) {
-   seq_puts(m, "NV SERVER   PORT DEV FSID\n");
+   seq_puts(m, "NV SERVER   PORT DEV FSID  FSC\n");
return 0;
}
/* display one transport per line on subsequent lines */
@@ -1349,12 +1349,13 @@ static int nfs_volume_list_show(struct seq_file *m, 
void *v)
 (unsigned long long) server->fsid.major,
 (unsigned long long) server->fsid.minor);
 
-   seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s\n",
+   seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s %s\n",
   clp->cl_nfsversion,
   NIPQUAD(clp->cl_addr.sin_addr),
   ntohs(clp->cl_addr.sin_port),
   dev,
-  fsid);
+  fsid,
+  nfs_server_fscache_state(server));
 
return 0;
 }
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 144fb58..9a735fc 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -53,6 +53,17 @@ extern void __nfs_fscache_invalidate_page(struct page *, 
struct inode *);
 extern int nfs_fscache_release_page(struct page *, gfp_t);
 
 /*
+ * indicate the client caching state as readable text
+ */
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+   if (server->nfs_client->fscache &&
+   (server->options & NFS_OPTION_FSCACHE))
+   return "yes";
+   return "no ";
+}
+
+/*
  * release the caching state associated with a page if undergoing complete page
  * invalidation
  */
@@ -109,6 +120,10 @@ static inline void nfs4_fscache_get_client_cookie(struct 
nfs_client *clp) {}
 static inline void nfs_fscache_release_client_cookie(struct nfs_client *clp) {}
 static inline void nfs_fscache_show_stats(struct seq_file *m,
  struct nfs_server *nfss) {}
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+   return "no ";
+}
 
 static inline void nfs_fscache_init_fh_cookie(struct inode *inode) {}
 static inline void nfs_fscache_enable_fh_cookie(struct inode *inode) {}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 25/27] NFS: Configuration and mount option changes to enable local caching on NFS

2008-01-22 Thread David Howells
Changes to the kernel configuration defintions and to the NFS mount options to
allow the local caching support added by the previous patch to be enabled.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/Kconfig|8 
 fs/nfs/client.c   |2 ++
 fs/nfs/internal.h |1 +
 fs/nfs/super.c|   14 ++
 4 files changed, 25 insertions(+), 0 deletions(-)


diff --git a/fs/Kconfig b/fs/Kconfig
index e0eedf9..8352dc7 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1650,6 +1650,14 @@ config NFS_V4
 
  If unsure, say N.
 
+config NFS_FSCACHE
+   bool "Provide NFS client caching support (EXPERIMENTAL)"
+   depends on EXPERIMENTAL
+   depends on NFS_FS=m && FSCACHE || NFS_FS=y && FSCACHE=y
+   help
+ Say Y here if you want NFS data to be cached locally on disc through
+ the general filesystem cache manager
+
 config NFS_DIRECTIO
bool "Allow direct I/O on NFS files"
depends on NFS_FS
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index bcdc5d0..92f9b84 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -572,6 +572,7 @@ static int nfs_init_server(struct nfs_server *server,
 
/* Initialise the client representation from the mount data */
server->flags = data->flags & NFS_MOUNT_FLAGMASK;
+   server->options = data->options;
 
if (data->rsize)
server->rsize = nfs_block_size(data->rsize, NULL);
@@ -931,6 +932,7 @@ static int nfs4_init_server(struct nfs_server *server,
/* Initialise the client representation from the mount data */
server->flags = data->flags & NFS_MOUNT_FLAGMASK;
server->caps |= NFS_CAP_ATOMIC_OPEN;
+   server->options = data->options;
 
if (data->rsize)
server->rsize = nfs_block_size(data->rsize, NULL);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f3acf48..ef09e00 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -35,6 +35,7 @@ struct nfs_parsed_mount_data {
int acregmin, acregmax,
acdirmin, acdirmax;
int namlen;
+   unsigned intoptions;
unsigned intbsize;
unsigned intauth_flavor_len;
rpc_authflavor_tauth_flavors[1];
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 6dd628f..0542550 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -74,6 +74,7 @@ enum {
Opt_acl, Opt_noacl,
Opt_rdirplus, Opt_nordirplus,
Opt_sharecache, Opt_nosharecache,
+   Opt_fscache, Opt_nofscache,
 
/* Mount options that take integer arguments */
Opt_port,
@@ -123,6 +124,8 @@ static match_table_t nfs_mount_option_tokens = {
{ Opt_nordirplus, "nordirplus" },
{ Opt_sharecache, "sharecache" },
{ Opt_nosharecache, "nosharecache" },
+   { Opt_fscache, "fsc" },
+   { Opt_nofscache, "nofsc" },
 
{ Opt_port, "port=%u" },
{ Opt_rsize, "rsize=%u" },
@@ -459,6 +462,8 @@ static void nfs_show_mount_options(struct seq_file *m, 
struct nfs_server *nfss,
seq_printf(m, ",timeo=%lu", 10U * clp->retrans_timeo / HZ);
seq_printf(m, ",retrans=%u", clp->retrans_count);
seq_printf(m, ",sec=%s", 
nfs_pseudoflavour_to_name(nfss->client->cl_auth->au_flavor));
+   if (nfss->options & NFS_OPTION_FSCACHE)
+   seq_printf(m, ",fsc");
 }
 
 /*
@@ -697,6 +702,15 @@ static int nfs_parse_mount_options(char *raw,
break;
case Opt_nosharecache:
mnt->flags |= NFS_MOUNT_UNSHARED;
+   mnt->options &= ~NFS_OPTION_FSCACHE;
+   break;
+   case Opt_fscache:
+   /* sharing is mandatory with fscache */
+   mnt->options |= NFS_OPTION_FSCACHE;
+   mnt->flags &= ~NFS_MOUNT_UNSHARED;
+   break;
+   case Opt_nofscache:
+   mnt->options &= ~NFS_OPTION_FSCACHE;
break;
 
case Opt_port:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 24/27] NFS: Use local caching

2008-01-22 Thread David Howells
The attached patch makes it possible for the NFS filesystem to make use of the
network filesystem local caching service (FS-Cache).

To be able to use this, an updated mount program is required.  This can be
obtained from:

http://people.redhat.com/steved/fscache/util-linux/

To mount an NFS filesystem to use caching, add an "fsc" option to the mount:

mount warthog:/ /a -o fsc

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/Makefile   |1 
 fs/nfs/client.c   |5 +
 fs/nfs/file.c |   37 
 fs/nfs/fscache-def.c  |  289 +
 fs/nfs/fscache.c  |  391 +
 fs/nfs/fscache.h  |  148 +
 fs/nfs/inode.c|   47 +
 fs/nfs/read.c |   28 +++
 fs/nfs/super.c|3 
 fs/nfs/sysctl.c   |1 
 include/linux/nfs_fs.h|9 +
 include/linux/nfs_fs_sb.h |   18 ++
 12 files changed, 968 insertions(+), 9 deletions(-)
 create mode 100644 fs/nfs/fscache-def.c
 create mode 100644 fs/nfs/fscache.c
 create mode 100644 fs/nfs/fscache.h


diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index df0f41e..073d04c 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4)  += nfs4proc.o nfs4xdr.o nfs4state.o 
nfs4renewd.o \
   nfs4namespace.o
 nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-def.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index a6f6254..bcdc5d0 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -43,6 +43,7 @@
 #include "delegation.h"
 #include "iostat.h"
 #include "internal.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITYNFSDBG_CLIENT
 
@@ -139,6 +140,8 @@ static struct nfs_client *nfs_alloc_client(const char 
*hostname,
clp->cl_state = 1 << NFS4CLNT_LEASE_EXPIRED;
 #endif
 
+   nfs_fscache_get_client_cookie(clp);
+
return clp;
 
 error_3:
@@ -170,6 +173,8 @@ static void nfs_free_client(struct nfs_client *clp)
 
nfs4_shutdown_client(clp);
 
+   nfs_fscache_release_client_cookie(clp);
+
/* -EIO all pending I/O */
if (!IS_ERR(clp->cl_rpcclient))
rpc_shutdown_client(clp->cl_rpcclient);
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index b3bb89f..d492cd7 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -35,6 +35,7 @@
 #include "delegation.h"
 #include "internal.h"
 #include "iostat.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITYNFSDBG_FILE
 
@@ -352,22 +353,48 @@ static int nfs_write_end(struct file *file, struct 
address_space *mapping,
return status < 0 ? status : copied;
 }
 
+/*
+ * Partially or wholly invalidate a page
+ * - Release the private state associated with a page if undergoing complete
+ *   page invalidation
+ * - Called if either PG_private or PG_fscache set on the page
+ * - Caller holds page lock
+ */
 static void nfs_invalidate_page(struct page *page, unsigned long offset)
 {
if (offset != 0)
return;
/* Cancel any unstarted writes on this page */
nfs_wb_page_cancel(page->mapping->host, page);
+
+   nfs_fscache_invalidate_page(page, page->mapping->host);
 }
 
+/*
+ * Release the private state associated with a page
+ * - Called if either PG_private or PG_fscache set on the page
+ * - Caller holds page lock
+ * - Return true (may release) or false (may not)
+ */
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
/* If PagePrivate() is set, then the page is not freeable */
-   return 0;
+   if (PagePrivate(page))
+   return 0;
+   return nfs_fscache_release_page(page, gfp);
 }
 
+/*
+ * Attempt to clear the private state associated with a page when an error
+ * occurs that requires the cached contents of an inode to be written back or
+ * destroyed
+ * - Called if either PG_private or PG_fscache set on the page
+ * - Caller holds page lock
+ * - Return 0 if successful, -error otherwise
+ */
 static int nfs_launder_page(struct page *page)
 {
+   wait_on_page_fscache_write(page);
return nfs_wb_page(page->mapping->host, page);
 }
 
@@ -387,6 +414,11 @@ const struct address_space_operations nfs_file_aops = {
.launder_page = nfs_launder_page,
 };
 
+/*
+ * Notification that a PTE pointing to an NFS page is about to be made
+ * writable, implying that someone is about to modify the page through a
+ * shared-writable mapping
+ */
 static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
 {
struct file *filp = vma->vm_file;
@@ -396,6 +428,9 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, 
struct page *page)
struct address_space *mapping;
loff_t offset;
 
+   /* make sure the cache has finished storing the page */
+   wait_on_page_fscache_write(page);
+
lock_page(page);

[PATCH 23/27] NFS: Fix memory leak

2008-01-22 Thread David Howells
Fix a memory leak whereby multiple clientaddr=xxx mount options just overwrite
the duplicated client_address option pointer, without freeing the old memory.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/super.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 0b0c72a..7f5e747 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -936,6 +936,7 @@ static int nfs_parse_mount_options(char *raw,
string = match_strdup(args);
if (string == NULL)
goto out_nomem;
+   kfree(mnt->client_address);
mnt->client_address = string;
break;
case Opt_mountaddr:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 21/27] CacheFiles: Export things for CacheFiles

2008-01-22 Thread David Howells
Export a number of functions for CacheFiles's use.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/super.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)


diff --git a/fs/super.c b/fs/super.c
index ceaf2e3..cd199ae 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -266,6 +266,7 @@ int fsync_super(struct super_block *sb)
__fsync_super(sb);
return sync_blockdev(sb->s_bdev);
 }
+EXPORT_SYMBOL_GPL(fsync_super);
 
 /**
  * generic_shutdown_super  -   common helper for ->kill_sb()

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/27] CacheFiles: Permit the page lock state to be monitored

2008-01-22 Thread David Howells
Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.

This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 include/linux/pagemap.h |5 +
 mm/filemap.c|   18 ++
 2 files changed, 23 insertions(+), 0 deletions(-)


diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f9e0f81..e9f37b3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -223,6 +223,11 @@ static inline void wait_on_page_owner_priv_2(struct page 
*page)
 extern void end_page_owner_priv_2(struct page *page);
 
 /*
+ * Add an arbitrary waiter to a page's wait queue
+ */
+extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
+
+/*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
diff --git a/mm/filemap.c b/mm/filemap.c
index ed52b0b..4d50623 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -533,6 +533,24 @@ void fastcall wait_on_page_bit(struct page *page, int 
bit_nr)
 EXPORT_SYMBOL(wait_on_page_bit);
 
 /**
+ * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
+ * @page - Page defining the wait queue of interest
+ * @waiter - Waiter to add to the queue
+ *
+ * Add an arbitrary @waiter to the wait queue for the nominated @page.
+ */
+void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
+{
+   wait_queue_head_t *q = page_waitqueue(page);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   __add_wait_queue(q, waiter);
+   spin_unlock_irqrestore(>lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_page_wait_queue);
+
+/**
  * unlock_page - unlock a locked page
  * @page: the page
  *

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 19/27] CacheFiles: Add a hook to write a single page of data to an inode

2008-01-22 Thread David Howells
Add an address space operation to write one single page of data to an inode at
a page-aligned location (thus permitting the implementation to be highly
optimised).  The data source is a single page.

This is used by CacheFiles to store the contents of netfs pages into their
backing file pages.

Supply a generic implementation for this that uses the write_begin() and
write_end() address_space operations to bind a copy directly into the page
cache.

Hook the Ext2 and Ext3 operations to the generic implementation.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/ext2/inode.c|2 ++
 fs/ext3/inode.c|3 +++
 include/linux/fs.h |7 ++
 mm/filemap.c   |   61 
 4 files changed, 73 insertions(+), 0 deletions(-)


diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b1ab32a..cfa56e6 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -796,6 +796,7 @@ const struct address_space_operations ext2_aops = {
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage= buffer_migrate_page,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 const struct address_space_operations ext2_aops_xip = {
@@ -814,6 +815,7 @@ const struct address_space_operations ext2_nobh_aops = {
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage= buffer_migrate_page,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 /*
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index bc918d3..435c684 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1780,6 +1780,7 @@ static const struct address_space_operations 
ext3_ordered_aops = {
.releasepage= ext3_releasepage,
.direct_IO  = ext3_direct_IO,
.migratepage= buffer_migrate_page,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 static const struct address_space_operations ext3_writeback_aops = {
@@ -1794,6 +1795,7 @@ static const struct address_space_operations 
ext3_writeback_aops = {
.releasepage= ext3_releasepage,
.direct_IO  = ext3_direct_IO,
.migratepage= buffer_migrate_page,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 static const struct address_space_operations ext3_journalled_aops = {
@@ -1807,6 +1809,7 @@ static const struct address_space_operations 
ext3_journalled_aops = {
.bmap   = ext3_bmap,
.invalidatepage = ext3_invalidatepage,
.releasepage= ext3_releasepage,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 void ext3_set_aops(struct inode *inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 850d3fc..a3c3369 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -479,6 +479,11 @@ struct address_space_operations {
int (*migratepage) (struct address_space *,
struct page *, struct page *);
int (*launder_page) (struct page *);
+   /* write the contents of the source page over the page at the specified
+* index in the target address space (the source page does not need to
+* be related to the target address space) */
+   int (*write_one_page)(struct address_space *, pgoff_t, struct page *);
+
 };
 
 /*
@@ -1801,6 +1806,8 @@ extern ssize_t generic_file_direct_write(struct kiocb *, 
const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
 extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec 
*,
unsigned long, loff_t, loff_t *, size_t, ssize_t);
+extern int generic_file_buffered_write_one_page(struct address_space *,
+   pgoff_t, struct page *);
 extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, 
loff_t *ppos);
 extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t 
len, loff_t *ppos);
 extern void do_generic_mapping_read(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index f22801a..ed52b0b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2333,6 +2333,67 @@ generic_file_buffered_write(struct kiocb *iocb, const 
struct iovec *iov,
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
+/**
+ * generic_file_buffered_write_one_page - Write a single page of data to an
+ * inode
+ * @mapping - The address space of the target inode
+ * @index - The target page in the target inode to fill
+ * @source - The data to write into the target page
+ *
+ * Write the data from the source page to the page in the nominated address
+ * space at the @index specified.  Note that the file will not be extended if
+ * the page crosses the EOF marker, in which case only the first part of the
+ * page will be written.
+ *
+ * The @source page does not need to have any association 

[PATCH 18/27] CacheFiles: Be consistent about the use of mapping vs file->f_mapping in Ext3

2008-01-22 Thread David Howells
Change all the usages of file->f_mapping in ext3_*write_end() functions to use
the mapping argument directly.  This has two consequences:

 (*) Consistency.  Without this patch sometimes one is used and sometimes the
 other is.

 (*) A NULL file pointer can be passed.  This feature is then made use of by
 the generic hook in the next patch, which is used by CacheFiles to write
 pages to a file without setting up a file struct.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/ext3/inode.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)


diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 9b162cd..bc918d3 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1227,7 +1227,7 @@ static int ext3_generic_write_end(struct file *file,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
 {
-   struct inode *inode = file->f_mapping->host;
+   struct inode *inode = mapping->host;
 
copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
@@ -1252,7 +1252,7 @@ static int ext3_ordered_write_end(struct file *file,
struct page *page, void *fsdata)
 {
handle_t *handle = ext3_journal_current_handle();
-   struct inode *inode = file->f_mapping->host;
+   struct inode *inode = mapping->host;
unsigned from, to;
int ret = 0, ret2;
 
@@ -1293,7 +1293,7 @@ static int ext3_writeback_write_end(struct file *file,
struct page *page, void *fsdata)
 {
handle_t *handle = ext3_journal_current_handle();
-   struct inode *inode = file->f_mapping->host;
+   struct inode *inode = mapping->host;
int ret = 0, ret2;
loff_t new_i_size;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 17/27] CacheFiles: Add missing copy_page export for ia64

2008-01-22 Thread David Howells
This one-line patch fixes the missing export of copy_page introduced
by the cachefile patches.  This patch is not yet upstream, but is required
for cachefile on ia64.  It will be pushed upstream when cachefile goes
upstream.

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>
Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 arch/ia64/kernel/ia64_ksyms.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)


diff --git a/arch/ia64/kernel/ia64_ksyms.c b/arch/ia64/kernel/ia64_ksyms.c
index c3b4412..e64fd61 100644
--- a/arch/ia64/kernel/ia64_ksyms.c
+++ b/arch/ia64/kernel/ia64_ksyms.c
@@ -43,6 +43,7 @@ EXPORT_SYMBOL(__do_clear_user);
 EXPORT_SYMBOL(__strlen_user);
 EXPORT_SYMBOL(__strncpy_from_user);
 EXPORT_SYMBOL(__strnlen_user);
+EXPORT_SYMBOL(copy_page);
 
 /* from arch/ia64/lib */
 extern void __divsi3(void);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15/27] FS-Cache: Provide an add_wait_queue_tail() function

2008-01-22 Thread David Howells
Provide an add_wait_queue_tail() function to add a waiter to the back of a
wait queue instead of the front.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 include/linux/wait.h |2 ++
 kernel/wait.c|   18 ++
 2 files changed, 20 insertions(+), 0 deletions(-)


diff --git a/include/linux/wait.h b/include/linux/wait.h
index 0e68628..f1038d0 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -118,6 +118,8 @@ static inline int waitqueue_active(wait_queue_head_t *q)
 #define is_sync_wait(wait) (!(wait) || ((wait)->private))
 
 extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * 
wait));
+extern void FASTCALL(add_wait_queue_tail(wait_queue_head_t *q,
+wait_queue_t *wait));
 extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, 
wait_queue_t * wait));
 extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * 
wait));
 
diff --git a/kernel/wait.c b/kernel/wait.c
index 444ddbf..7acc9cc 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -29,6 +29,24 @@ void fastcall add_wait_queue(wait_queue_head_t *q, 
wait_queue_t *wait)
 }
 EXPORT_SYMBOL(add_wait_queue);
 
+/**
+ * add_wait_queue_tail - Add a waiter to the back of a waitqueue
+ * @q: the wait queue to append the waiter to
+ * @wait: the waiter to be queued
+ *
+ * Add a waiter to the back of a waitqueue so that it gets woken up last.
+ */
+void fastcall add_wait_queue_tail(wait_queue_head_t *q, wait_queue_t *wait)
+{
+   unsigned long flags;
+
+   wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+   spin_lock_irqsave(>lock, flags);
+   __add_wait_queue_tail(q, wait);
+   spin_unlock_irqrestore(>lock, flags);
+}
+EXPORT_SYMBOL(add_wait_queue_tail);
+
 void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t 
*wait)
 {
unsigned long flags;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/27] FS-Cache: Recruit a couple of page flags for cache management

2008-01-22 Thread David Howells
Recruit a couple of page flags to aid in cache management.  The following extra
flags are defined:

 (1) PG_fscache (PG_private_2)

 The marked page is backed by a local cache and is pinning resources in the
 cache driver.

 (2) PG_fscache_write (PG_owner_priv_2)

 The marked page is being written to the local cache.  The page may not be
 modified whilst this is in progress.

If PG_fscache is set, then things that checked for PG_private will now also
check for that.  This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/splice.c|2 +-
 include/linux/page-flags.h |   39 +--
 include/linux/pagemap.h|   11 +++
 mm/filemap.c   |   16 
 mm/migrate.c   |2 +-
 mm/page_alloc.c|3 +++
 mm/readahead.c |9 +
 mm/swap.c  |4 ++--
 mm/swap_state.c|4 ++--
 mm/truncate.c  |   10 +-
 mm/vmscan.c|2 +-
 11 files changed, 84 insertions(+), 18 deletions(-)


diff --git a/fs/splice.c b/fs/splice.c
index 6bdcb61..61edad7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info 
*pipe,
 */
wait_on_page_writeback(page);
 
-   if (PagePrivate(page))
+   if (page_has_private(page))
try_to_release_page(page, GFP_KERNEL);
 
/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 209d3a4..f375e3b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -77,25 +77,32 @@
 #define PG_active   6
 #define PG_slab 7  /* slab debug (Suparna wants 
this) */
 
-#define PG_owner_priv_1 8  /* Owner use. If pagecache, fs 
may use*/
+#define PG_owner_priv_1 8  /* Owner use. fs may use in 
pagecache */
 #define PG_arch_1   9
 #define PG_reserved10
 #define PG_private 11  /* If pagecache, has fs-private data */
 
 #define PG_writeback   12  /* Page is under writeback */
+#define PG_private_2   13  /* If pagecache, has fs aux data */
 #define PG_compound14  /* Part of a compound page */
 #define PG_swapcache   15  /* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk16  /* Has blocks allocated on-disk 
*/
 #define PG_reclaim 17  /* To be reclaimed asap */
+#define PG_owner_priv_218  /* Owner use. fs may use in 
pagecache */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead   PG_reclaim /* Reminder to do async read-ahead */
 
-/* PG_owner_priv_1 users should have descriptive aliases */
+/* PG_owner_priv_1/2 users should have descriptive aliases */
 #define PG_checked PG_owner_priv_1 /* Used by some filesystems */
 #define PG_pinned  PG_owner_priv_1 /* Xen pinned pagetable */
+#define PG_fscache_write   PG_owner_priv_2 /* Writing to local cache */
+
+/* PG_private_2 causes releasepage() and co to be invoked */
+#define PG_fscache PG_private_2/* Backed by local cache */
+
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -199,6 +206,23 @@ static inline void SetPageUptodate(struct page *page)
 #define TestClearPageWriteback(page) test_and_clear_bit(PG_writeback,  \
&(page)->flags)
 
+#define PagePrivate2(page) test_bit(PG_private_2, &(page)->flags)
+#define SetPagePrivate2(page)  set_bit(PG_private_2, &(page)->flags)
+#define ClearPagePrivate2(page)clear_bit(PG_private_2, &(page)->flags)
+#define TestSetPagePrivate2(page) test_and_set_bit(PG_private_2, 
&(page)->flags)
+#define TestClearPagePrivate2(page) test_and_clear_bit(PG_private_2, \
+ &(page)->flags)
+
+#define PageOwnerPriv2(page)   test_bit(PG_owner_priv_2, \
+&(page)->flags)
+#define SetPageOwnerPriv2(page)set_bit(PG_owner_priv_2, 
&(page)->flags)
+#define ClearPageOwnerPriv2(page)  clear_bit(PG_owner_priv_2, \
+ &(page)->flags)
+#define TestSetPageOwnerPriv2(page)test_and_set_bit(PG_owner_priv_2, \
+&(page)->flags)
+#define TestClearPageOwnerPriv2(page)  test_and_clear_bit(PG_owner_priv_2, \
+  &(page)->flags)
+
 #define 

[PATCH 13/27] FS-Cache: Release page->private after failed readahead

2008-01-22 Thread David Howells
The attached patch causes read_cache_pages() to release page-private data on a
page for which add_to_page_cache() fails or the filler function fails. This
permits pages with caching references associated with them to be cleaned up.

The invalidatepage() address space op is called (indirectly) to do the honours.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 mm/readahead.c |   39 +--
 1 files changed, 37 insertions(+), 2 deletions(-)


diff --git a/mm/readahead.c b/mm/readahead.c
index c9c50ca..75aa6b6 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -44,6 +44,41 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+/*
+ * see if a page needs releasing upon read_cache_pages() failure
+ * - the caller of read_cache_pages() may have set PG_private before calling,
+ *   such as the NFS fs marking pages that are cached locally on disk, thus we
+ *   need to give the fs a chance to clean up in the event of an error
+ */
+static void read_cache_pages_invalidate_page(struct address_space *mapping,
+struct page *page)
+{
+   if (PagePrivate(page)) {
+   if (TestSetPageLocked(page))
+   BUG();
+   page->mapping = mapping;
+   do_invalidatepage(page, 0);
+   page->mapping = NULL;
+   unlock_page(page);
+   }
+   page_cache_release(page);
+}
+
+/*
+ * release a list of pages, invalidating them first if need be
+ */
+static void read_cache_pages_invalidate_pages(struct address_space *mapping,
+ struct list_head *pages)
+{
+   struct page *victim;
+
+   while (!list_empty(pages)) {
+   victim = list_to_page(pages);
+   list_del(>lru);
+   read_cache_pages_invalidate_page(mapping, victim);
+   }
+}
+
 /**
  * read_cache_pages - populate an address space with some pages & start reads 
against them
  * @mapping: the address_space
@@ -65,14 +100,14 @@ int read_cache_pages(struct address_space *mapping, struct 
list_head *pages,
list_del(>lru);
if (add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
-   page_cache_release(page);
+   read_cache_pages_invalidate_page(mapping, page);
continue;
}
page_cache_release(page);
 
ret = filler(data, page);
if (unlikely(ret)) {
-   put_pages_list(pages);
+   read_cache_pages_invalidate_pages(mapping, pages);
break;
}
task_io_account_read(PAGE_CACHE_SIZE);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/27] Security: Make NFSD work with detached security

2008-01-22 Thread David Howells
Make NFSD work with detached security, using the patches that excise the
security information from task_struct to struct task_security as a base.

Each time NFSD wants a new security descriptor (to do NFS4 recovery or just to
do NFS operations), a task_security record is derived from NFSD's *objective*
security, modified and then applied as the *subjective* security.  This means
(a) the changes are not visible to anyone looking at NFSD through /proc, (b)
there is no leakage between two consecutive ops with different security
configurations.

Consideration should probably be given to caching the task_security record on
the basis that there'll probably be several ops that will want to use any
particular security configuration.

Furthermore, nfs4recover.c perhaps ought to set an appropriate LSM context on
the record pointed to by rec_security so that the disk is accessed
appropriately (see set_security_override[_from_ctx]()).

NOTE!  This patch must be rolled in to one of the earlier security patches to
make it compile fully.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfsd/auth.c|   31 +---
 fs/nfsd/nfs4recover.c |   64 +++--
 2 files changed, 62 insertions(+), 33 deletions(-)


diff --git a/fs/nfsd/auth.c b/fs/nfsd/auth.c
index b2e19c8..32d8e34 100644
--- a/fs/nfsd/auth.c
+++ b/fs/nfsd/auth.c
@@ -6,6 +6,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -28,11 +29,17 @@ int nfsexp_flags(struct svc_rqst *rqstp, struct svc_export 
*exp)
 
 int nfsd_setuser(struct svc_rqst *rqstp, struct svc_export *exp)
 {
+   struct task_security *sec, *old;
struct svc_cred cred = rqstp->rq_cred;
int i;
int flags = nfsexp_flags(rqstp, exp);
int ret;
 
+   /* derive the new security record from nfsd's objective security */
+   sec = get_kernel_security(current);
+   if (!sec)
+   return -ENOMEM;
+
if (flags & NFSEXP_ALLSQUASH) {
cred.cr_uid = exp->ex_anon_uid;
cred.cr_gid = exp->ex_anon_gid;
@@ -56,24 +63,30 @@ int nfsd_setuser(struct svc_rqst *rqstp, struct svc_export 
*exp)
get_group_info(cred.cr_group_info);
 
if (cred.cr_uid != (uid_t) -1)
-   current->act_as->fsuid = cred.cr_uid;
+   sec->fsuid = cred.cr_uid;
else
-   current->act_as->fsuid = exp->ex_anon_uid;
+   sec->fsuid = exp->ex_anon_uid;
if (cred.cr_gid != (gid_t) -1)
-   current->act_as->fsgid = cred.cr_gid;
+   sec->fsgid = cred.cr_gid;
else
-   current->act_as->fsgid = exp->ex_anon_gid;
+   sec->fsgid = exp->ex_anon_gid;
 
-   if (!cred.cr_group_info)
+   if (!cred.cr_group_info) {
+   put_task_security(sec);
return -ENOMEM;
-   ret = set_groups(current->act_as, cred.cr_group_info);
+   }
+   ret = set_groups(sec, cred.cr_group_info);
put_group_info(cred.cr_group_info);
if ((cred.cr_uid)) {
-   cap_t(current->act_as->cap_effective) &= ~CAP_NFSD_MASK;
+   cap_t(sec->cap_effective) &= ~CAP_NFSD_MASK;
} else {
-   cap_t(current->act_as->cap_effective) |=
-   (CAP_NFSD_MASK & current->act_as->cap_permitted);
+   cap_t(sec->cap_effective) |= CAP_NFSD_MASK & sec->cap_permitted;
}
+
+   /* set the new security as nfsd's subjective security */
+   old = current->act_as;
+   current->act_as = sec;
+   put_task_security(old);
return ret;
 }
 
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index bf0217a..ae91262 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -46,27 +46,37 @@
 #include 
 #include 
 #include 
+#include 
 
 #define NFSDDBG_FACILITYNFSDDBG_PROC
 
 /* Globals */
 static struct nameidata rec_dir;
 static int rec_dir_init = 0;
+static struct task_security *rec_security;
 
+/*
+ * switch the special recovery access security in on the current task's
+ * subjective security
+ */
 static void
-nfs4_save_user(uid_t *saveuid, gid_t *savegid)
+nfs4_begin_secure(struct task_security **saved_sec)
 {
-   *saveuid = current->act_as->fsuid;
-   *savegid = current->act_as->fsgid;
-   current->act_as->fsuid = 0;
-   current->act_as->fsgid = 0;
+   *saved_sec = current->act_as;
+   current->act_as = get_task_security(rec_security);
 }
 
+/*
+ * return the current task's subjective security to its former glory
+ */
 static void
-nfs4_reset_user(uid_t saveuid, gid_t savegid)
+nfs4_end_secure(struct task_security *saved_sec)
 {
-   current->act_as->fsuid = saveuid;
-   current->act_as->fsgid = savegid;
+   struct task_security *discard;
+
+   discard = current->act_as;
+   current->act_as = saved_sec;
+   put_task_security(discard);
 }
 
 static void
@@ -128,10 +138,9 

[PATCH 11/27] Security: Allow kernel services to override LSM settings for task actions

2008-01-22 Thread David Howells
Allow kernel services to override LSM settings appropriate to the actions
performed by a task by duplicating a security record, modifying it and then
using task_struct::act_as to point to it when performing operations on behalf
of a task.

This is used, for example, by CacheFiles which has to transparently access the
cache on behalf of a process that thinks it is doing, say, NFS accesses with a
potentially inappropriate (with respect to accessing the cache) set of
security data.

This patch provides two LSM hooks for modifying a task security record:

 (*) security_kernel_act_as() which allows modification of the security datum
 with which a task acts on other objects (most notably files).

 (*) security_create_files_as() which allows modification of the security
 datum that is used to initialise the security data on a file that a task
 creates.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 include/linux/cred.h|   23 +++
 include/linux/security.h|   43 +-
 kernel/cred.c   |  111 +++
 security/dummy.c|   17 +
 security/security.c |   15 -
 security/selinux/hooks.c|   51 
 security/selinux/include/security.h |2 -
 security/selinux/ss/services.c  |5 +-
 8 files changed, 258 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/cred.h


diff --git a/include/linux/cred.h b/include/linux/cred.h
new file mode 100644
index 000..497af5b
--- /dev/null
+++ b/include/linux/cred.h
@@ -0,0 +1,23 @@
+/* Credential management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([EMAIL PROTECTED])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_CRED_H
+#define _LINUX_CRED_H
+
+struct task_security;
+struct inode;
+
+extern struct task_security *get_kernel_security(struct task_struct *);
+extern int set_security_override(struct task_security *, u32);
+extern int set_security_override_from_ctx(struct task_security *, const char 
*);
+extern int change_create_files_as(struct task_security *, struct inode *);
+
+#endif /* _LINUX_CRED_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index e8f2f2d..e6be746 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -557,6 +557,19 @@ struct request_sock;
  * Duplicate and attach the security structure currently attached to the
  * p->security field.
  * Return 0 if operation was successful.
+ * @task_kernel_act_as:
+ * Set the credentials for a kernel service to act as (subjective context).
+ * @p points to the task that nominated @secid.
+ * @sec points to the task security record to be modified.
+ * @secid specifies the security ID to be set
+ * Return 0 if successful.
+ * @task_create_files_as:
+ * Set the file creation context in a task security record to be the same
+ * as the objective context of the specified inode.
+ * @p points to the task that nominated @inode.
+ * @sec points to the task security record to be modified.
+ * @inode points to the inode to use as a reference.
+ * Return 0 if successful.
  * @task_setuid:
  * Check permission before setting one or more of the user identity
  * attributes of the current process.  The @flags parameter indicates
@@ -1325,6 +1338,11 @@ struct security_operations {
int (*task_alloc_security) (struct task_struct *p);
void (*task_free_security) (struct task_security *p);
int (*task_dup_security) (struct task_security *p);
+   int (*task_kernel_act_as)(struct task_struct *p,
+ struct task_security *sec, u32 secid);
+   int (*task_create_files_as)(struct task_struct *p,
+   struct task_security *sec,
+   struct inode *inode);
int (*task_setuid) (uid_t id0, uid_t id1, uid_t id2, int flags);
int (*task_post_setuid) (uid_t old_ruid /* or fsuid */ ,
 uid_t old_euid, uid_t old_suid, int flags);
@@ -1393,7 +1411,7 @@ struct security_operations {
int (*getprocattr)(struct task_struct *p, char *name, char **value);
int (*setprocattr)(struct task_struct *p, char *name, void *value, 
size_t size);
int (*secid_to_secctx)(u32 secid, char **secdata, u32 *seclen);
-   int (*secctx_to_secid)(char *secdata, u32 seclen, u32 *secid);
+   int (*secctx_to_secid)(const char *secdata, u32 seclen, u32 *secid);
void (*release_secctx)(char *secdata, u32 seclen);
 
 #ifdef CONFIG_SECURITY_NETWORK
@@ -1576,6 +1594,11 @@ int security_task_create(unsigned long 

[PATCH 10/27] Security: Add a kernel_service object class to SELinux

2008-01-22 Thread David Howells
Add a 'kernel_service' object class to SELinux and give this object class two
access vectors: 'use_as_override' and 'create_files_as'.

The first vector is used to grant a process the right to nominate an alternate
process security ID for the kernel to use as an override for the SELinux
subjective security when accessing stuff on behalf of another process.

For example, CacheFiles when accessing the cache on behalf on a process
accessing an NFS file needs to use a subjective security ID appropriate to the
cache rather then the one the calling process is using.  The cachefilesd
daemon will nominate the security ID to be used.

The second vector is used to grant a process the right to nominate a file
creation label for a kernel service to use.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 security/selinux/include/av_perm_to_string.h |2 ++
 security/selinux/include/av_permissions.h|2 ++
 security/selinux/include/class_to_string.h   |1 +
 security/selinux/include/flask.h |1 +
 4 files changed, 6 insertions(+), 0 deletions(-)


diff --git a/security/selinux/include/av_perm_to_string.h 
b/security/selinux/include/av_perm_to_string.h
index caa0634..6ba8200 100644
--- a/security/selinux/include/av_perm_to_string.h
+++ b/security/selinux/include/av_perm_to_string.h
@@ -164,3 +164,5 @@
S_(SECCLASS_DCCP_SOCKET, DCCP_SOCKET__NAME_CONNECT, "name_connect")
S_(SECCLASS_MEMPROTECT, MEMPROTECT__MMAP_ZERO, "mmap_zero")
S_(SECCLASS_PEER, PEER__RECV, "recv")
+   S_(SECCLASS_KERNEL_SERVICE, KERNEL_SERVICE__USE_AS_OVERRIDE, 
"use_as_override")
+   S_(SECCLASS_KERNEL_SERVICE, KERNEL_SERVICE__CREATE_FILES_AS, 
"create_files_as")
diff --git a/security/selinux/include/av_permissions.h 
b/security/selinux/include/av_permissions.h
index c2b5bb2..9500ba3 100644
--- a/security/selinux/include/av_permissions.h
+++ b/security/selinux/include/av_permissions.h
@@ -829,3 +829,5 @@
 #define DCCP_SOCKET__NAME_CONNECT 0x0080UL
 #define MEMPROTECT__MMAP_ZERO 0x0001UL
 #define PEER__RECV0x0001UL
+#define KERNEL_SERVICE__USE_AS_OVERRIDE   0x0001UL
+#define KERNEL_SERVICE__CREATE_FILES_AS   0x0002UL
diff --git a/security/selinux/include/class_to_string.h 
b/security/selinux/include/class_to_string.h
index b1b0d1d..efe9efa 100644
--- a/security/selinux/include/class_to_string.h
+++ b/security/selinux/include/class_to_string.h
@@ -71,3 +71,4 @@
 S_(NULL)
 S_(NULL)
 S_("peer")
+S_("kernel_service")
diff --git a/security/selinux/include/flask.h b/security/selinux/include/flask.h
index 09e9dd2..2bc251a 100644
--- a/security/selinux/include/flask.h
+++ b/security/selinux/include/flask.h
@@ -51,6 +51,7 @@
 #define SECCLASS_DCCP_SOCKET 60
 #define SECCLASS_MEMPROTECT  61
 #define SECCLASS_PEER68
+#define SECCLASS_KERNEL_SERVICE  69
 
 /*
  * Security identifier indices for initial entities

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/27] Security: Pre-add additional non-caching classes

2008-01-22 Thread David Howells
Pre-add additional non-caching classes that are in the SELinux upstream
repository, but not in the upstream kernel so they don't get in the fscache
class patch.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 security/selinux/include/av_perm_to_string.h |5 +
 security/selinux/include/av_permissions.h|5 +
 security/selinux/include/class_to_string.h   |7 +++
 security/selinux/include/flask.h |1 +
 4 files changed, 18 insertions(+), 0 deletions(-)


diff --git a/security/selinux/include/av_perm_to_string.h 
b/security/selinux/include/av_perm_to_string.h
index 049bf69..caa0634 100644
--- a/security/selinux/include/av_perm_to_string.h
+++ b/security/selinux/include/av_perm_to_string.h
@@ -37,6 +37,8 @@
S_(SECCLASS_NODE, NODE__ENFORCE_DEST, "enforce_dest")
S_(SECCLASS_NODE, NODE__DCCP_RECV, "dccp_recv")
S_(SECCLASS_NODE, NODE__DCCP_SEND, "dccp_send")
+   S_(SECCLASS_NODE, NODE__RECVFROM, "recvfrom")
+   S_(SECCLASS_NODE, NODE__SENDTO, "sendto")
S_(SECCLASS_NETIF, NETIF__TCP_RECV, "tcp_recv")
S_(SECCLASS_NETIF, NETIF__TCP_SEND, "tcp_send")
S_(SECCLASS_NETIF, NETIF__UDP_RECV, "udp_recv")
@@ -45,6 +47,8 @@
S_(SECCLASS_NETIF, NETIF__RAWIP_SEND, "rawip_send")
S_(SECCLASS_NETIF, NETIF__DCCP_RECV, "dccp_recv")
S_(SECCLASS_NETIF, NETIF__DCCP_SEND, "dccp_send")
+   S_(SECCLASS_NETIF, NETIF__INGRESS, "ingress")
+   S_(SECCLASS_NETIF, NETIF__EGRESS, "egress")
S_(SECCLASS_UNIX_STREAM_SOCKET, UNIX_STREAM_SOCKET__CONNECTTO, "connectto")
S_(SECCLASS_UNIX_STREAM_SOCKET, UNIX_STREAM_SOCKET__NEWCONN, "newconn")
S_(SECCLASS_UNIX_STREAM_SOCKET, UNIX_STREAM_SOCKET__ACCEPTFROM, 
"acceptfrom")
@@ -159,3 +163,4 @@
S_(SECCLASS_DCCP_SOCKET, DCCP_SOCKET__NODE_BIND, "node_bind")
S_(SECCLASS_DCCP_SOCKET, DCCP_SOCKET__NAME_CONNECT, "name_connect")
S_(SECCLASS_MEMPROTECT, MEMPROTECT__MMAP_ZERO, "mmap_zero")
+   S_(SECCLASS_PEER, PEER__RECV, "recv")
diff --git a/security/selinux/include/av_permissions.h 
b/security/selinux/include/av_permissions.h
index eda89a2..c2b5bb2 100644
--- a/security/selinux/include/av_permissions.h
+++ b/security/selinux/include/av_permissions.h
@@ -292,6 +292,8 @@
 #define NODE__ENFORCE_DEST0x0040UL
 #define NODE__DCCP_RECV   0x0080UL
 #define NODE__DCCP_SEND   0x0100UL
+#define NODE__RECVFROM0x0200UL
+#define NODE__SENDTO  0x0400UL
 #define NETIF__TCP_RECV   0x0001UL
 #define NETIF__TCP_SEND   0x0002UL
 #define NETIF__UDP_RECV   0x0004UL
@@ -300,6 +302,8 @@
 #define NETIF__RAWIP_SEND 0x0020UL
 #define NETIF__DCCP_RECV  0x0040UL
 #define NETIF__DCCP_SEND  0x0080UL
+#define NETIF__INGRESS0x0100UL
+#define NETIF__EGRESS 0x0200UL
 #define NETLINK_SOCKET__IOCTL 0x0001UL
 #define NETLINK_SOCKET__READ  0x0002UL
 #define NETLINK_SOCKET__WRITE 0x0004UL
@@ -824,3 +828,4 @@
 #define DCCP_SOCKET__NODE_BIND0x0040UL
 #define DCCP_SOCKET__NAME_CONNECT 0x0080UL
 #define MEMPROTECT__MMAP_ZERO 0x0001UL
+#define PEER__RECV0x0001UL
diff --git a/security/selinux/include/class_to_string.h 
b/security/selinux/include/class_to_string.h
index e77de0e..b1b0d1d 100644
--- a/security/selinux/include/class_to_string.h
+++ b/security/selinux/include/class_to_string.h
@@ -64,3 +64,10 @@
 S_(NULL)
 S_("dccp_socket")
 S_("memprotect")
+S_(NULL)
+S_(NULL)
+S_(NULL)
+S_(NULL)
+S_(NULL)
+S_(NULL)
+S_("peer")
diff --git a/security/selinux/include/flask.h b/security/selinux/include/flask.h
index a9c2b20..09e9dd2 100644
--- a/security/selinux/include/flask.h
+++ b/security/selinux/include/flask.h
@@ -50,6 +50,7 @@
 #define SECCLASS_KEY 58
 #define SECCLASS_DCCP_SOCKET 60
 #define SECCLASS_MEMPROTECT  61
+#define SECCLASS_PEER68
 
 /*
  * Security identifier indices for initial entities

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/27] KEYS: Add keyctl function to get a security label

2008-01-22 Thread David Howells
Add a keyctl() function to get the security label of a key.

The following is added to Documentation/keys.txt:

 (*) Get the LSM security context attached to a key.

long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
size_t buflen)

 This function returns a string that represents the LSM security context
 attached to a key in the buffer provided.

 Unless there's an error, it always returns the amount of data it could
 produce, even if that's too big for the buffer, but it won't copy more
 than requested to userspace. If the buffer pointer is NULL then no copy
 will take place.

 A NUL character is included at the end of the string if the buffer is
 sufficiently big.  This is included in the returned count.  If no LSM is
 in force then an empty string will be returned.

 A process must have view permission on the key for this function to be
 successful.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
Acked-by:  Stephen Smalley <[EMAIL PROTECTED]>
---

 Documentation/keys.txt   |   21 +++
 include/linux/keyctl.h   |1 +
 include/linux/security.h |   20 +-
 security/dummy.c |8 ++
 security/keys/compat.c   |3 ++
 security/keys/keyctl.c   |   66 ++
 security/security.c  |5 +++
 security/selinux/hooks.c |   21 +--
 8 files changed, 141 insertions(+), 4 deletions(-)


diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index b82d38d..be424b0 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -711,6 +711,27 @@ The keyctl syscall functions are:
  The assumed authoritative key is inherited across fork and exec.
 
 
+ (*) Get the LSM security context attached to a key.
+
+   long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
+   size_t buflen)
+
+ This function returns a string that represents the LSM security context
+ attached to a key in the buffer provided.
+
+ Unless there's an error, it always returns the amount of data it could
+ produce, even if that's too big for the buffer, but it won't copy more
+ than requested to userspace. If the buffer pointer is NULL then no copy
+ will take place.
+
+ A NUL character is included at the end of the string if the buffer is
+ sufficiently big.  This is included in the returned count.  If no LSM is
+ in force then an empty string will be returned.
+
+ A process must have view permission on the key for this function to be
+ successful.
+
+
 ===
 KERNEL SERVICES
 ===
diff --git a/include/linux/keyctl.h b/include/linux/keyctl.h
index 3365945..656ee6b 100644
--- a/include/linux/keyctl.h
+++ b/include/linux/keyctl.h
@@ -49,5 +49,6 @@
 #define KEYCTL_SET_REQKEY_KEYRING  14  /* set default request-key 
keyring */
 #define KEYCTL_SET_TIMEOUT 15  /* set key timeout */
 #define KEYCTL_ASSUME_AUTHORITY16  /* assume request_key() 
authorisation */
+#define KEYCTL_GET_SECURITY17  /* get key security label */
 
 #endif /*  _LINUX_KEYCTL_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index ac05083..8d9e946 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -959,6 +959,17 @@ struct request_sock;
  * @perm describes the combination of permissions required of this key.
  * Return 1 if permission granted, 0 if permission denied and -ve it the
  *  normal permissions model should be effected.
+ * @key_getsecurity:
+ * Get a textual representation of the security context attached to a key
+ * for the purposes of honouring KEYCTL_GETSECURITY.  This function
+ * allocates the storage for the NUL-terminated string and the caller
+ * should free it.
+ * @key points to the key to be queried.
+ * @_buffer points to a pointer that should be set to point to the
+ *  resulting string (if no label or an error occurs).
+ * Return the length of the string (including terminating NUL) or -ve if
+ *  an error.
+ * May also return 0 (and a NULL buffer pointer) if there is no label.
  *
  * Security hooks affecting all System V IPC operations.
  *
@@ -1437,7 +1448,7 @@ struct security_operations {
int (*key_permission)(key_ref_t key_ref,
  struct task_struct *context,
  key_perm_t perm);
-
+   int (*key_getsecurity)(struct key *key, char **_buffer);
 #endif /* CONFIG_KEYS */
 
 };
@@ -2567,6 +2578,7 @@ int security_key_alloc(struct key *key, struct 
task_struct *tsk, unsigned long f
 void security_key_free(struct key *key);
 int security_key_permission(key_ref_t key_ref,
struct task_struct *context, key_perm_t perm);
+int security_key_getsecurity(struct key *key, char **_buffer);
 
 #else
 
@@ -2588,6 +2600,12 @@ static inline int 

[PATCH 08/27] Add a secctx_to_secid() LSM hook to go along with the existing

2008-01-22 Thread David Howells
secid_to_secctx() LSM hook.  This patch also includes the SELinux
implementation for this hook.

Signed-off-by: Paul Moore <[EMAIL PROTECTED]>
Acked-by: Stephen Smalley <[EMAIL PROTECTED]>
---

 include/linux/security.h |   13 +
 security/dummy.c |6 ++
 security/security.c  |6 ++
 security/selinux/hooks.c |6 ++
 4 files changed, 31 insertions(+), 0 deletions(-)


diff --git a/include/linux/security.h b/include/linux/security.h
index b7ba073..e8f2f2d 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1200,6 +1200,10 @@ struct request_sock;
  * Convert secid to security context.
  * @secid contains the security ID.
  * @secdata contains the pointer that stores the converted security 
context.
+ * @secctx_to_secid:
+ *  Convert security context to secid.
+ *  @secid contains the pointer to the generated security ID.
+ *  @secdata contains the security context.
  *
  * @release_secctx:
  * Release the security context.
@@ -1389,6 +1393,7 @@ struct security_operations {
int (*getprocattr)(struct task_struct *p, char *name, char **value);
int (*setprocattr)(struct task_struct *p, char *name, void *value, 
size_t size);
int (*secid_to_secctx)(u32 secid, char **secdata, u32 *seclen);
+   int (*secctx_to_secid)(char *secdata, u32 seclen, u32 *secid);
void (*release_secctx)(char *secdata, u32 seclen);
 
 #ifdef CONFIG_SECURITY_NETWORK
@@ -1623,6 +1628,7 @@ int security_setprocattr(struct task_struct *p, char 
*name, void *value, size_t
 int security_netlink_send(struct sock *sk, struct sk_buff *skb);
 int security_netlink_recv(struct sk_buff *skb, int cap);
 int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen);
+int security_secctx_to_secid(char *secdata, u32 seclen, u32 *secid);
 void security_release_secctx(char *secdata, u32 seclen);
 
 #else /* CONFIG_SECURITY */
@@ -2305,6 +2311,13 @@ static inline int security_secid_to_secctx(u32 secid, 
char **secdata, u32 *secle
return -EOPNOTSUPP;
 }
 
+static inline int security_secctx_to_secid(char *secdata,
+  u32 seclen,
+  u32 *secid)
+{
+   return -EOPNOTSUPP;
+}
+
 static inline void security_release_secctx(char *secdata, u32 seclen)
 {
 }
diff --git a/security/dummy.c b/security/dummy.c
index 6f97089..72f1666 100644
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -943,6 +943,11 @@ static int dummy_secid_to_secctx(u32 secid, char 
**secdata, u32 *seclen)
return -EOPNOTSUPP;
 }
 
+static int dummy_secctx_to_secid(char *secdata, u32 seclen, u32 *secid)
+{
+   return -EOPNOTSUPP;
+}
+
 static void dummy_release_secctx(char *secdata, u32 seclen)
 {
 }
@@ -1109,6 +1114,7 @@ void security_fixup_ops (struct security_operations *ops)
set_to_dummy_if_null(ops, getprocattr);
set_to_dummy_if_null(ops, setprocattr);
set_to_dummy_if_null(ops, secid_to_secctx);
+   set_to_dummy_if_null(ops, secctx_to_secid);
set_to_dummy_if_null(ops, release_secctx);
 #ifdef CONFIG_SECURITY_NETWORK
set_to_dummy_if_null(ops, unix_stream_connect);
diff --git a/security/security.c b/security/security.c
index 92d66d6..1ef4908 100644
--- a/security/security.c
+++ b/security/security.c
@@ -821,6 +821,12 @@ int security_secid_to_secctx(u32 secid, char **secdata, 
u32 *seclen)
 }
 EXPORT_SYMBOL(security_secid_to_secctx);
 
+int security_secctx_to_secid(char *secdata, u32 seclen, u32 *secid)
+{
+   return security_ops->secctx_to_secid(secdata, seclen, secid);
+}
+EXPORT_SYMBOL(security_secctx_to_secid);
+
 void security_release_secctx(char *secdata, u32 seclen)
 {
return security_ops->release_secctx(secdata, seclen);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 20a6b55..1d3eab7 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4734,6 +4734,11 @@ static int selinux_secid_to_secctx(u32 secid, char 
**secdata, u32 *seclen)
return security_sid_to_context(secid, secdata, seclen);
 }
 
+static int selinux_secctx_to_secid(char *secdata, u32 seclen, u32 *secid)
+{
+   return security_context_to_sid(secdata, seclen, secid);
+}
+
 static void selinux_release_secctx(char *secdata, u32 seclen)
 {
kfree(secdata);
@@ -4937,6 +4942,7 @@ static struct security_operations selinux_ops = {
.setprocattr =  selinux_setprocattr,
 
.secid_to_secctx =  selinux_secid_to_secctx,
+   .secctx_to_secid =  selinux_secctx_to_secid,
.release_secctx =   selinux_release_secctx,
 
 .unix_stream_connect = selinux_socket_unix_stream_connect,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/27] Security: Change current->fs[ug]id to current_fs[ug]id()

2008-01-22 Thread David Howells
Change current->fs[ug]id to current_fs[ug]id() so that fsgid and fsuid can be
separated from the task_struct.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 arch/ia64/kernel/perfmon.c|4 ++--
 arch/powerpc/platforms/cell/spufs/inode.c |4 ++--
 drivers/isdn/capi/capifs.c|4 ++--
 drivers/usb/core/inode.c  |4 ++--
 fs/9p/fid.c   |2 +-
 fs/9p/vfs_inode.c |4 ++--
 fs/9p/vfs_super.c |4 ++--
 fs/affs/inode.c   |4 ++--
 fs/anon_inodes.c  |4 ++--
 fs/attr.c |4 ++--
 fs/bfs/dir.c  |4 ++--
 fs/cifs/cifsproto.h   |2 +-
 fs/cifs/dir.c |   12 ++--
 fs/cifs/inode.c   |8 
 fs/cifs/misc.c|4 ++--
 fs/coda/cache.c   |6 +++---
 fs/coda/upcall.c  |4 ++--
 fs/devpts/inode.c |4 ++--
 fs/dquot.c|2 +-
 fs/exec.c |4 ++--
 fs/ext2/balloc.c  |2 +-
 fs/ext2/ialloc.c  |4 ++--
 fs/ext2/ioctl.c   |2 +-
 fs/ext3/balloc.c  |2 +-
 fs/ext3/ialloc.c  |4 ++--
 fs/ext4/balloc.c  |2 +-
 fs/ext4/ialloc.c  |4 ++--
 fs/fuse/dev.c |4 ++--
 fs/gfs2/inode.c   |   10 +-
 fs/hfs/inode.c|4 ++--
 fs/hfsplus/inode.c|4 ++--
 fs/hpfs/namei.c   |   24 
 fs/hugetlbfs/inode.c  |   16 
 fs/jffs2/fs.c |4 ++--
 fs/jfs/jfs_inode.c|4 ++--
 fs/locks.c|2 +-
 fs/minix/bitmap.c |4 ++--
 fs/namei.c|8 
 fs/nfsd/vfs.c |4 ++--
 fs/ocfs2/dlm/dlmfs.c  |8 
 fs/ocfs2/namei.c  |4 ++--
 fs/pipe.c |4 ++--
 fs/posix_acl.c|4 ++--
 fs/ramfs/inode.c  |4 ++--
 fs/reiserfs/namei.c   |4 ++--
 fs/sysv/ialloc.c  |4 ++--
 fs/udf/ialloc.c   |4 ++--
 fs/udf/namei.c|2 +-
 fs/ufs/ialloc.c   |4 ++--
 fs/xfs/linux-2.6/xfs_linux.h  |4 ++--
 fs/xfs/xfs_acl.c  |6 +++---
 fs/xfs/xfs_attr.c |2 +-
 fs/xfs/xfs_inode.c|6 +++---
 fs/xfs/xfs_vnodeops.c |8 
 include/linux/fs.h|2 +-
 include/linux/sched.h |3 +++
 ipc/mqueue.c  |4 ++--
 kernel/cgroup.c   |4 ++--
 mm/shmem.c|8 
 net/9p/client.c   |2 +-
 net/socket.c  |4 ++--
 net/sunrpc/auth.c |8 
 security/commoncap.c  |8 
 security/keys/key.c   |2 +-
 security/keys/keyctl.c|2 +-
 security/keys/request_key.c   |   10 +-
 security/keys/request_key_auth.c  |2 +-
 67 files changed, 163 insertions(+), 160 deletions(-)


diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 73e7c2e..ef383d9 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2206,8 +2206,8 @@ pfm_alloc_fd(struct file **cfile)
DPRINT(("new inode ino=%ld @%p\n", inode->i_ino, inode));
 
inode->i_mode = S_IFCHR|S_IRUGO;
-   inode->i_uid  = current->fsuid;
-   inode->i_gid  = current->fsgid;
+   inode->i_uid  = current_fsuid();
+   inode->i_gid  = current_fsgid();
 
sprintf(name, "[%lu]", inode->i_ino);
this.name = name;
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c 
b/arch/powerpc/platforms/cell/spufs/inode.c
index c0e968a..4efe7bf 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -85,8 +85,8 @@ spufs_new_inode(struct super_block *sb, int mode)
goto out;
 
inode->i_mode = mode;
-   inode->i_uid = current->fsuid;
-   inode->i_gid = current->fsgid;
+   inode->i_uid = current_fsuid();

[PATCH 02/27] KEYS: Check starting keyring as part of search

2008-01-22 Thread David Howells
Check the starting keyring as part of the search to (a) see if that is what
we're searching for, and (b) to check it is still valid for searching.

The scenario:  User in process A does things that cause things to be
created in its process session keyring.  The user then does an su to
another user and starts a new process, B.  The two processes now
share the same process session keyring.

Process B does an NFS access which results in an upcall to gssd.
When gssd attempts to instantiate the context key (to be linked
into the process session keyring), it is denied access even though it
has an authorization key.

The order of calls is:

   keyctl_instantiate_key()
  lookup_user_key() (the default: case)
 search_process_keyrings(current)
search_process_keyrings(rka->context)   (recursive call)
   keyring_search_aux()

keyring_search_aux() verifies the keys and keyrings underneath the
top-level keyring it is given, but that top-level keyring is neither
fully validated nor checked to see if it is the thing being searched for.

This patch changes keyring_search_aux() to:
1) do more validation on the top keyring it is given and
2) check whether that top-level keyring is the thing being searched for


Signed-off-by: Kevin Coffman <[EMAIL PROTECTED]>
Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 security/keys/keyring.c |   35 +++
 1 files changed, 31 insertions(+), 4 deletions(-)


diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 88292e3..76b89b2 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -292,7 +292,7 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
 
struct keyring_list *keylist;
struct timespec now;
-   unsigned long possessed;
+   unsigned long possessed, kflags;
struct key *keyring, *key;
key_ref_t key_ref;
long err;
@@ -318,6 +318,32 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
now = current_kernel_time();
err = -EAGAIN;
sp = 0;
+   
+   /* firstly we should check to see if this top-level keyring is what we
+* are looking for */
+   key_ref = ERR_PTR(-EAGAIN);
+   kflags = keyring->flags;
+   if (keyring->type == type && match(keyring, description)) {
+   key = keyring;
+
+   /* check it isn't negative and hasn't expired or been
+* revoked */
+   if (kflags & (1 << KEY_FLAG_REVOKED))
+   goto error_2;
+   if (key->expiry && now.tv_sec >= key->expiry)
+   goto error_2;
+   key_ref = ERR_PTR(-ENOKEY);
+   if (kflags & (1 << KEY_FLAG_NEGATIVE))
+   goto error_2;
+   goto found;
+   }
+
+   /* otherwise, the top keyring must not be revoked, expired, or
+* negatively instantiated if we are to search it */
+   key_ref = ERR_PTR(-EAGAIN);
+   if (kflags & ((1 << KEY_FLAG_REVOKED) | (1 << KEY_FLAG_NEGATIVE)) ||
+   (keyring->expiry && now.tv_sec >= keyring->expiry))
+   goto error_2;
 
/* start processing a new keyring */
 descend:
@@ -331,13 +357,14 @@ descend:
/* iterate through the keys in this keyring first */
for (kix = 0; kix < keylist->nkeys; kix++) {
key = keylist->keys[kix];
+   kflags = key->flags;
 
/* ignore keys not of this type */
if (key->type != type)
continue;
 
/* skip revoked keys and expired keys */
-   if (test_bit(KEY_FLAG_REVOKED, >flags))
+   if (kflags & (1 << KEY_FLAG_REVOKED))
continue;
 
if (key->expiry && now.tv_sec >= key->expiry)
@@ -352,8 +379,8 @@ descend:
context, KEY_SEARCH) < 0)
continue;
 
-   /* we set a different error code if we find a negative key */
-   if (test_bit(KEY_FLAG_NEGATIVE, >flags)) {
+   /* we set a different error code if we pass a negative key */
+   if (kflags & (1 << KEY_FLAG_NEGATIVE)) {
err = -ENOKEY;
continue;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/27] KEYS: Allow the callout data to be passed as a blob rather than a string

2008-01-22 Thread David Howells
Allow the callout data to be passed as a blob rather than a string for internal
kernel services that call any request_key_*() interface other than
request_key().  request_key() itself still takes a NUL-terminated string.

The functions that change are:

request_key_with_auxdata()
request_key_async()
request_key_async_with_auxdata()

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 Documentation/keys-request-key.txt |   11 +---
 Documentation/keys.txt |   14 +++---
 include/linux/key.h|9 ---
 security/keys/internal.h   |9 ---
 security/keys/keyctl.c |7 -
 security/keys/request_key.c|   49 ++--
 security/keys/request_key_auth.c   |   12 +
 7 files changed, 70 insertions(+), 41 deletions(-)


diff --git a/Documentation/keys-request-key.txt 
b/Documentation/keys-request-key.txt
index 266955d..09b55e4 100644
--- a/Documentation/keys-request-key.txt
+++ b/Documentation/keys-request-key.txt
@@ -11,26 +11,29 @@ request_key*():
 
struct key *request_key(const struct key_type *type,
const char *description,
-   const char *callout_string);
+   const char *callout_info);
 
 or:
 
struct key *request_key_with_auxdata(const struct key_type *type,
 const char *description,
-const char *callout_string,
+const char *callout_info,
+size_t callout_len,
 void *aux);
 
 or:
 
struct key *request_key_async(const struct key_type *type,
  const char *description,
- const char *callout_string);
+ const char *callout_info,
+ size_t callout_len);
 
 or:
 
struct key *request_key_async_with_auxdata(const struct key_type *type,
   const char *description,
-  const char *callout_string,
+  const char *callout_info,
+  size_t callout_len,
   void *aux);
 
 Or by userspace invoking the request_key system call:
diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index 51652d3..b82d38d 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -771,7 +771,7 @@ payload contents" for more information.
 
struct key *request_key(const struct key_type *type,
const char *description,
-   const char *callout_string);
+   const char *callout_info);
 
 This is used to request a key or keyring with a description that matches
 the description specified according to the key type's match function. This
@@ -793,24 +793,28 @@ payload contents" for more information.
 
struct key *request_key_with_auxdata(const struct key_type *type,
 const char *description,
-const char *callout_string,
+const void *callout_info,
+size_t callout_len,
 void *aux);
 
 This is identical to request_key(), except that the auxiliary data is
-passed to the key_type->request_key() op if it exists.
+passed to the key_type->request_key() op if it exists, and the callout_info
+is a blob of length callout_len, if given (the length may be 0).
 
 
 (*) A key can be requested asynchronously by calling one of:
 
struct key *request_key_async(const struct key_type *type,
  const char *description,
- const char *callout_string);
+ const void *callout_info,
+ size_t callout_len);
 
 or:
 
struct key *request_key_async_with_auxdata(const struct key_type *type,
   const char *description,
-  const char *callout_string,
+  const char *callout_info,
+  size_t callout_len,
   void *aux);
 
 which are asynchronous equivalents of request_key() and
diff --git a/include/linux/key.h b/include/linux/key.h
index a70b8a8..163f864 100644
--- a/include/linux/key.h
+++ 

[PATCH 01/27] KEYS: Increase the payload size when instantiating a key

2008-01-22 Thread David Howells
Increase the size of a payload that can be used to instantiate a key in
add_key() and keyctl_instantiate_key().  This permits huge CIFS SPNEGO blobs to
be passed around.  The limit is raised to 1MB.  If kmalloc() can't allocate a
buffer of sufficient size, vmalloc() will be tried instead.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 security/keys/keyctl.c |   38 ++
 1 files changed, 30 insertions(+), 8 deletions(-)


diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index d9ca15c..8ec8432 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -62,9 +63,10 @@ asmlinkage long sys_add_key(const char __user *_type,
char type[32], *description;
void *payload;
long ret;
+   bool vm;
 
ret = -EINVAL;
-   if (plen > 32767)
+   if (plen > 1024 * 1024 - 1)
goto error;
 
/* draw all the data into kernel space */
@@ -81,11 +83,18 @@ asmlinkage long sys_add_key(const char __user *_type,
/* pull the payload in if one was supplied */
payload = NULL;
 
+   vm = false;
if (_payload) {
ret = -ENOMEM;
payload = kmalloc(plen, GFP_KERNEL);
-   if (!payload)
-   goto error2;
+   if (!payload) {
+   if (plen <= PAGE_SIZE)
+   goto error2;
+   vm = true;
+   payload = vmalloc(plen);
+   if (!payload)
+   goto error2;
+   }
 
ret = -EFAULT;
if (copy_from_user(payload, _payload, plen) != 0)
@@ -113,7 +122,10 @@ asmlinkage long sys_add_key(const char __user *_type,
 
key_ref_put(keyring_ref);
  error3:
-   kfree(payload);
+   if (!vm)
+   kfree(payload);
+   else
+   vfree(payload);
  error2:
kfree(description);
  error:
@@ -821,9 +833,10 @@ long keyctl_instantiate_key(key_serial_t id,
key_ref_t keyring_ref;
void *payload;
long ret;
+   bool vm = false;
 
ret = -EINVAL;
-   if (plen > 32767)
+   if (plen > 1024 * 1024 - 1)
goto error;
 
/* the appropriate instantiation authorisation key must have been
@@ -843,8 +856,14 @@ long keyctl_instantiate_key(key_serial_t id,
if (_payload) {
ret = -ENOMEM;
payload = kmalloc(plen, GFP_KERNEL);
-   if (!payload)
-   goto error;
+   if (!payload) {
+   if (plen <= PAGE_SIZE)
+   goto error;
+   vm = true;
+   payload = vmalloc(plen);
+   if (!payload)
+   goto error;
+   }
 
ret = -EFAULT;
if (copy_from_user(payload, _payload, plen) != 0)
@@ -877,7 +896,10 @@ long keyctl_instantiate_key(key_serial_t id,
}
 
 error2:
-   kfree(payload);
+   if (!vm)
+   kfree(payload);
+   else
+   vfree(payload);
 error:
return ret;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00/27] Permit filesystem local caching

2008-01-22 Thread David Howells


These patches add local caching for network filesystems such as NFS.

The patches can roughly be broken down into a number of sets:

  (*) 01-keys-inc-payload.diff
  (*) 02-keys-search-keyring.diff
  (*) 03-keys-callout-blob.diff

  Three patches to the keyring code made to help the CIFS people.
  Included because of patches 05-08.

  (*) 04-keys-get-label.diff

  A patch to allow the security label of a key to be retrieved.
  Included because of patches 05-08.

  (*) 05-security-current-fsugid.diff
  (*) 06-security-separate-task-bits.diff
  (*) 07-security-subjective.diff
  (*) 08-security-secctx2secid.diff
  (*) 09-security-additional-classes.diff
  (*) 10-security-kernel_service-class.diff
  (*) 11-security-kernel-service.diff
  (*) 12-security-nfsd.diff

  Patches to permit the subjective security of a task to be overridden.
  All the security details in task_struct are decanted into a new struct
  that task_struct then has two pointers two: one that defines the
  objective security of that task (how other tasks may affect it) and one
  that defines the subjective security (how it may affect other objects).

  Note that I have dropped the idea of struct cred for the moment.  With
  the amount of stuff that was excluded from it, it wasn't actually any
  use to me.  However, it can be added later.

  Required for cachefiles.

  (*) 13-release-page.diff
  (*) 14-fscache-page-flags.diff
  (*) 15-add_wait_queue_tail.diff
  (*) 16-fscache.diff

  Patches to provide a local caching facility for network filesystems.

  (*) 17-cachefiles-ia64.diff
  (*) 18-cachefiles-ext3-f_mapping.diff
  (*) 19-cachefiles-write.diff
  (*) 20-cachefiles-monitor.diff
  (*) 21-cachefiles-export.diff
  (*) 22-cachefiles.diff

  Patches to provide a local cache in a directory of an already mounted
  filesystem.

  (*) 23-nfs-memleak.diff
  (*) 24-fscache-nfs.diff
  (*) 25-fscache-nfs-mount.diff
  (*) 26-fscache-nfs-display.diff
  (*) 27-fscache-nfs-persb.diff

  Patches to provide NFS with local caching.  The fifth of these patches
  makes caching configurable per superblock.


I've updated the patches to compile on as many arches I can get compilers for
and can get to compile.  However, for patch 06, the sparc and alpha arches need
some asm work as they access security information from asm code, using
asm-offsets to calculate the offset.


The SELinux base code will also need updating to have the security class, lest
the following error appear in dmesg:

context_struct_compute_av:  unrecognized class 69


I've provided a patch to make NFSd use task_security and current->act_as to
change its security settings.


I've also renamed the accessors for the PG_fscache and PG_fscache_write bits in
page-flags.h, pagemap.h and filemap.c (they subclass PG_private_2 and
PG_owner_priv_2 so these are the accessors in the main headers).  I've then
wrapped them in fscache.h.

--
A tarball of the patches is available at:


http://people.redhat.com/~dhowells/fscache/patches/nfs+fscache-27.tar.bz2


To use this version of CacheFiles, the cachefilesd-0.9 is also required.  It
is available as an SRPM:

http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9-1.fc7.src.rpm

Or as individual bits:

http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9.tar.bz2
http://people.redhat.com/~dhowells/fscache/cachefilesd.fc
http://people.redhat.com/~dhowells/fscache/cachefilesd.if
http://people.redhat.com/~dhowells/fscache/cachefilesd.te
http://people.redhat.com/~dhowells/fscache/cachefilesd.spec

The .fc, .if and .te files are for manipulating SELinux.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH] export notifier #1

2008-01-22 Thread Christoph Lameter
On Wed, 23 Jan 2008, Benjamin Herrenschmidt wrote:

> > - anon_vma/inode and pte locks are held during callbacks.
> 
> So how does that fix the problem of sleeping then ?

The locks are taken in the mmu_ops patch. This patch does not hold them 
while performing the callbacks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Massive IDE problems. Who leaves data here?

2008-01-22 Thread Bill Davidsen

Manuel Reimer wrote:

Hello,

anything started with a try to burn Slackware 12.0 from the original DVD 
to an new medium with different boot settings. I always got corrupted 
results and didn't know why.


So I started with an "md5sum -c CHECKSUMS.md5" directly on the original 
media. This resulted in "anything OK".


Now I copied the whole DVD to my hard drive and created an ISO from it. 
I mounted the ISO locally and my md5sum now results in 5 corrupted files.


--> A Bug in mkisofs?

No, unfortunately not, as a md5sum on the copy, I have created from the 
original DVD by using "cp -vr" is corrupted, too!


Possibly a known kernel problem, you may have read past the end of data 
into the pad sectors of the DVD and gotten garbage at the end of the ISO 
image. Use isoinfo to determine the correct size of the ISO filesystem, 
and compare. You can try setting readahead on the DVD reader to zero 
with blockdev.


If the file is smaller, other bug, if readahead hit EOF it returns no 
data instead of a short read, the blockdev fix should handle that as 
well. This was supposed to be fixed in recent kernels, that may be true.


I suggest the [EMAIL PROTECTED] mailing list is a better forum 
for CD/DVD/BR problems, good technical people, unfortunately with 
personal agendas in some cases.


So md5sum on the original DVD is OK, but after copying to my hard drive, 
several files are corrupted.


That's odd, I would expect the data on the disk to just be the wrong 
size, and get a CRC on that. You might also use readcd to pull the data, 
that almost always does what it should.



I'm using kernel 2.6.21.5. Distribution is Slackware 12.0
All my "partitions" are LVs in LVM2

I also updated the kernel to 2.6.23.12 to test with this one, but I 
still get corrupted files.


Is this a LVM bug? Do I already have a corrupted LVM filesystem? How to 
check/fix it? Is this a known kernel bug? Which may be the reason for 
corrupted files?


I've created a backup of my important data to a second disc to a "real 
ext2 partition" (without LVM), but this is connected to the same IDE 
controller and I don't even know if I may still trust my mainboard...


I also get those kernel messages via dmesg:

http://pastebin.org/16537

Could be anything, in no order dirty lens, bad drive, bad DVD, firmware 
error, cable, power supply, acpi confused... could even be a poorly 
handled end of data on the DVD. Not enough info for me to tell, for 
sure. Trying readcd is cheap, turning off readahead on the DVD drive is 
easy, if the problem persists you probably want to take it to the 
mailing list.



Thank you very much in advance for any help!

I'm not sure I helped, but you now have more and better things about 
which to be confused.  ;-)



Yours

Manuel




--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA boot crash (was: [PATCH] [0/36] Great change_page_attr patch series v3)

2008-01-22 Thread Thomas Gleixner
On Tue, 22 Jan 2008, Andi Kleen wrote:
> According to you and Ingo "the global perspective" is to get
> simple stuff first in. But in this case you're doing the complicated
> (and worse the unfinished) stuff first which seems to be against
> your own principles.

No, the global perspective is to get a stable and reliable system,
which allows us to do new features like gbpages, PAT and whatever
comes up next in a clean way.

Your patches just shove another extra into the existing code base
without doing any consolidation work and without any consideration of
problems we need to urgently solve in this area.

Your only care is to get stuff merged which is interesting for you. I
can understand that, but it should be entirely clear to you as an
engineer that ignoring the existing problems and adding more (even
simple) stuff makes it more complex to consolidate and is nothing else
than bad engineering.

PAT is high on the requirements list, not because it's not complex (it
definitely is), but simply because Linux has a years long of backlog
(it's the only modern OS on the planet still not using PAT) and
hardware makers are stepping beyond the limits of MTRRs. There is an
increasing number of systems which don't work under Linux properly due
to the MTRR limitations, but work perfectly fine with other
OSs. Should we ignore that ?

While PAT is a 10 years old hardware feature, gbpages is a feature for
a brand new chip, which is not even available to mere mortals in a
useable form. And there is no real problem with not having gbpages for
some time. So where is the pressure to get that in? Just because it
can be done and happens to work on some test machine?

PAT patches have been around for years and nothing happened - while
the first time gbpages were submitted was 19 days ago by you.

Of all pending features, PAT has a priority simply because it
affects users. The lack of gbpages does not. We are not going to rush
PAT in before it is stable, but we hold everything off which
interferes with getting it to that point.

Please stop arguing around with the subtle undertone of us having no
clue about the topics. We looked into the whole set of pending issues,
including your gbpages patchset and we well understand the
implications. It is quite clear that we need to fix the underlying
system _before_ we add more things to it. That applies to PAT, CPA and
gbpages in the same way.

In the end all of those features will benefit from a consolidated
implementation.

It's up to you whether you help to get there sooner or just sit back
and argue in circles until others have done the hard work and you can
add gbpages.

Thanks,
tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/26] atl1: refactor tx processing

2008-01-22 Thread Jay Cliburn
On Tue, 22 Jan 2008 04:58:17 -0500
Jeff Garzik <[EMAIL PROTECTED]> wrote:

> [EMAIL PROTECTED] wrote:
> > From: Jay Cliburn <[EMAIL PROTECTED]>
> > 
> > Refactor tx processing to use a less convoluted tx packet
> > descriptor and to conform generally with the vendor's current
> > version 1.2.40.2.
> > 
> > Signed-off-by: Jay Cliburn <[EMAIL PROTECTED]>
> > ---
> >  drivers/net/atlx/atl1.c |  265
> > +--
> > drivers/net/atlx/atl1.h |  201 +++-
> > 2 files changed, 246 insertions(+), 220 deletions(-)
> 
> for such a huge patch, this description is very tiny.  [describe]
> what is refactored, and why.

Okay, I'll go back and rework the offending descriptions for this and
the other patches in this set.

> what does "less convoluted" mean?

I should have written "simpler," I suppose.

Before:
===
struct tso_param {
u32 tsopu;  /* tso_param upper word */
u32 tsopl;  /* tso_param lower word */
};

struct csum_param {
u32 csumpu; /* csum_param upper word */
u32 csumpl; /* csum_param lower word */
};

union tpd_descr {
u64 data;
struct csum_param csum;
struct tso_param tso;
};

struct tx_packet_desc {
__le64 buffer_addr;
union tpd_descr desc;
};


After:
==
struct tx_packet_desc {
__le64 buffer_addr;
__le32 word2;
__le32 word3;
};

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: SECCOMP documentation outdated in some arch/*/Kconfig

2008-01-22 Thread Randy Dunlap
On Tue, 22 Jan 2008 15:41:58 +0100 Helmut Grohne wrote:

> Hi,
> 
> I didn't find out whom to report this bug to and thus report to
> linux-kernel@vger.kernel.org as described in
> http://kernel.org/pub/linux/docs/lkml/reporting-bugs.html.

Andrea cc-ed.

Helmut, would you care to make a patch that you think should be
applied to the current kernel source tree?


> I'm posting from outside, so please CC me.
> 
> [1] The description about seccomp is outdated in some arch/*/Kconfig
> files.
> 
> [2] According to the source (2.6.23.14) seccomp is to be activated using
> pcrtl. It was previously activated using a file /proc//seccomp.
> The Kconfig documentation (also displayed in menuconfig) does not
> reflect this change and is thus wrong.
> 
> [3] seccomp documentation Kconfig
> 
> [4] 2.6.23.14, seems to also apply to git head:
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/x86/Kconfig;h=80b7ba4056dbbb566841c1e1cbef9475730fe199;hb=HEAD
> 
> [5] no oops
> 
> [6] less arch/x86_64/Kconfig
> /SECCOMP
> 
> [7] Ask me again if you really think you need information about the
> environment for a documentation bug.

---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] PCI: Run ACPI _OSC method on root bridges only

2008-01-22 Thread Andrew Patterson
From: Andrew Patterson <[EMAIL PROTECTED]>

According to the PCI Firmware Specification Revision 3.0 section 4.5, _OSC
should only be called on a root brdige.  Here is the relevant passage: "The
_OSC interface defined in this section applies only to Host Bridge ACPI
devices that originate PCI, PCI-X, or PCI Express hierarchies". Changed the
code to find the parent root bridge of the device and call _OSC on that.

Signed-off-by: Andrew Patterson <[EMAIL PROTECTED]>
---

 drivers/pci/pcie/aer/aerdrv_acpi.c |   22 ++
 1 files changed, 6 insertions(+), 16 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c 
b/drivers/pci/pcie/aer/aerdrv_acpi.c
index f685bf5..8c199ae 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -31,23 +31,13 @@ int aer_osc_setup(struct pcie_device *pciedev)
 {
acpi_status status = AE_NOT_FOUND;
struct pci_dev *pdev = pciedev->port;
-   acpi_handle handle = DEVICE_ACPI_HANDLE(>dev);
-   struct pci_bus *parent;
+   acpi_handle handle = 0;
 
-   while (!handle) {
-   if (!pdev || !pdev->bus->parent)
-   break;
-   parent = pdev->bus->parent;
-   if (!parent->self)
-   /* Parent must be a host bridge */
-   handle = acpi_get_pci_rootbridge_handle(
-   pci_domain_nr(parent),
-   parent->number);
-   else
-   handle = DEVICE_ACPI_HANDLE(
-   &(parent->self->dev));
-   pdev = parent->self;
-   }
+   /* Find root host bridge */
+   while (pdev->bus && pdev->bus->self)
+   pdev = pdev->bus->self;
+   handle = acpi_get_pci_rootbridge_handle(
+   pci_domain_nr(pdev->bus), pdev->bus->number);
 
if (handle) {
pcie_osc_support_set(OSC_EXT_PCI_CONFIG_SUPPORT);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] PCI ACPI: AER driver should only register PCIe devices with _OSC.

2008-01-22 Thread Andrew Patterson
From: Andrew Patterson <[EMAIL PROTECTED]>

AER is only used with PCIe devices so we should only check PCIe devices for
_OSC support.

Signed-off-by: Andrew Patterson <[EMAIL PROTECTED]>
---

 drivers/pci/pcie/aer/aerdrv_acpi.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c 
b/drivers/pci/pcie/aer/aerdrv_acpi.c
index 1a1eb45..f685bf5 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -50,7 +50,7 @@ int aer_osc_setup(struct pcie_device *pciedev)
}
 
if (handle) {
-   pci_osc_support_set(OSC_EXT_PCI_CONFIG_SUPPORT);
+   pcie_osc_support_set(OSC_EXT_PCI_CONFIG_SUPPORT);
status = pci_osc_control_set(handle,
OSC_PCI_EXPRESS_AER_CONTROL |
OSC_PCI_EXPRESS_CAP_STRUCTURE_CONTROL);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] PCI ACPI: Added a function to register _OSC with only PCIe devices.

2008-01-22 Thread Andrew Patterson
From: Andrew Patterson <[EMAIL PROTECTED]>

The function pci_osc_support_set() traverses every root bridge when
checking for _OSC support for a capability.  It quits as soon as it finds a
device/bridge that doesn't support the requested capability. This won't
work for systems that have mixed PCI and PCIe bridges when checking for
PCIe features.  I split this function into two -- pci_osc_support_set() and
pcie_osc_support_set(). The latter is used when only PCIe devices should be
traversed.

Signed-off-by: Andrew Patterson <[EMAIL PROTECTED]>
---

 drivers/pci/pci-acpi.c   |6 +++---
 include/linux/pci-acpi.h |   11 ++-
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 02e4876..ec61428 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -156,13 +156,13 @@ run_osc_out:
 }
 
 /**
- * pci_osc_support_set - register OS support to Firmware
+__pci_osc_support_set - register OS support to Firmware
  * @flags: OS support bits
  *
  * Update OS support fields and doing a _OSC Query to obtain an update
  * from Firmware on supported control bits.
  **/
-acpi_status pci_osc_support_set(u32 flags)
+acpi_status __pci_osc_support_set(u32 flags, const char *hid)
 {
u32 temp;
acpi_status retval;
@@ -176,7 +176,7 @@ acpi_status pci_osc_support_set(u32 flags)
temp = ctrlset_buf[OSC_CONTROL_TYPE];
ctrlset_buf[OSC_QUERY_TYPE] = OSC_QUERY_ENABLE;
ctrlset_buf[OSC_CONTROL_TYPE] = OSC_CONTROL_MASKS;
-   acpi_get_devices ( PCI_ROOT_HID_STRING,
+   acpi_get_devices(hid,
acpi_query_osc,
ctrlset_buf,
(void **)  );
diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h
index 936ef82..3ba2506 100644
--- a/include/linux/pci-acpi.h
+++ b/include/linux/pci-acpi.h
@@ -48,7 +48,15 @@
 
 #ifdef CONFIG_ACPI
 extern acpi_status pci_osc_control_set(acpi_handle handle, u32 flags);
-extern acpi_status pci_osc_support_set(u32 flags);
+extern acpi_status __pci_osc_support_set(u32 flags, const char *hid);
+static inline acpi_status pci_osc_support_set(u32 flags)
+{
+   return __pci_osc_support_set(flags, PCI_ROOT_HID_STRING);
+}
+static inline acpi_status pcie_osc_support_set(u32 flags)
+{
+   return __pci_osc_support_set(flags, PCI_EXPRESS_ROOT_HID_STRING);
+}
 #else
 #if !defined(AE_ERROR)
 typedef u32acpi_status;
@@ -57,6 +65,7 @@ typedef u32   acpi_status;
 static inline acpi_status pci_osc_control_set(acpi_handle handle, u32 flags)
 {return AE_ERROR;}
 static inline acpi_status pci_osc_support_set(u32 flags) {return AE_ERROR;} 
+static inline acpi_status pcie_osc_support_set(u32 flags) {return AE_ERROR;}
 #endif
 
 #endif /* _PCI_ACPI_H_ */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] ACPI: Check for any matching CID when walking namespace.

2008-01-22 Thread Andrew Patterson
From: Andrew Patterson <[EMAIL PROTECTED]>

The callback function acpi_ns_get_device_callback called from
acpi_get_devices() will check CID's if the HID does not match.  This code
has a bug where it requires that all CIDs match the HID. Changed the code
so that any CID match will do.

Signed-off-by: Andrew Patterson <[EMAIL PROTECTED]>
---

 drivers/acpi/namespace/nsxfeval.c |   11 ---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/namespace/nsxfeval.c 
b/drivers/acpi/namespace/nsxfeval.c
index f39fbc6..e562b24 100644
--- a/drivers/acpi/namespace/nsxfeval.c
+++ b/drivers/acpi/namespace/nsxfeval.c
@@ -443,6 +443,7 @@ acpi_ns_get_device_callback(acpi_handle obj_handle,
struct acpica_device_id hid;
struct acpi_compatible_id_list *cid;
acpi_native_uint i;
+   int found;
 
status = acpi_ut_acquire_mutex(ACPI_MTX_NAMESPACE);
if (ACPI_FAILURE(status)) {
@@ -496,16 +497,20 @@ acpi_ns_get_device_callback(acpi_handle obj_handle,
 
/* Walk the CID list */
 
+   found = 0;
for (i = 0; i < cid->count; i++) {
if (ACPI_STRNCMP(cid->id[i].value, info->hid,
 sizeof(struct
-   acpi_compatible_id)) !=
+   acpi_compatible_id)) ==
0) {
-   ACPI_FREE(cid);
-   return (AE_OK);
+   found = 1;
+   break;
}
}
ACPI_FREE(cid);
+   if (!found) {
+   return (AE_OK);
+   }
}
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] ACPI fixes for PCIe AER

2008-01-22 Thread Andrew Patterson

The following patch series fixes some bugs in how Linux determines
whether PCIe Advance Error Reporting (AER) is supported on a platform.  It
is currently broken on at least HP IA-64 systems.

 - PCI: Run ACPI _OSC method on root bridges only
 - ACPI: Check for any matching CID when walking namespace.
 - PCI ACPI: AER driver should only register PCIe devices with _OSC.
 - PCI ACPI: Added a function to register _OSC with only PCIe devices.


These patches apply to gregkh's patch tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/patches


-- 
Andrew Patterson
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_32: trim memory by updating e820 v3

2008-01-22 Thread Yinghai Lu
[PATCH] x86_32: trim memory by updating e820 v3

when mtrr is not covering all e820 table, need to trim the ram, need to update 
e820

reuse some code for x86_64

here need to add early_get_cap and use it in early_cpu_detect, and move 
mtrr_bp_init early

need Justine to test with his special system with bug bios.

Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]>

Index: linux-2.6/arch/x86/kernel/cpu/common.c
===
--- linux-2.6.orig/arch/x86/kernel/cpu/common.c
+++ linux-2.6/arch/x86/kernel/cpu/common.c
@@ -278,6 +278,33 @@ void __init cpu_detect(struct cpuinfo_x8
c->x86_cache_alignment = ((misc >> 8) & 0xff) * 8;
}
 }
+static void __cpuinit early_get_cap(struct cpuinfo_x86 *c)
+{
+   u32 tfms, xlvl;
+   int ebx;
+
+   memset(>x86_capability, 0, sizeof c->x86_capability);
+   if (have_cpuid_p()) {
+   /* Intel-defined flags: level 0x0001 */
+   if (c->cpuid_level >= 0x0001) {
+   u32 capability, excap;
+   cpuid(0x0001, , , , );
+   c->x86_capability[0] = capability;
+   c->x86_capability[4] = excap;
+   }
+
+   /* AMD-defined flags: level 0x8001 */
+   xlvl = cpuid_eax(0x8000);
+   if ((xlvl & 0x) == 0x8000) {
+   if (xlvl >= 0x8001) {
+   c->x86_capability[1] = cpuid_edx(0x8001);
+   c->x86_capability[6] = cpuid_ecx(0x8001);
+   }
+   }
+
+   }
+
+}
 
 /* Do minimum CPU detection early.
Fields really needed: vendor, cpuid_level, family, model, mask, cache 
alignment.
@@ -306,6 +333,8 @@ static void __init early_cpu_detect(void
early_init_intel(c);
break;
}
+
+   early_get_cap(c);
 }
 
 static void __cpuinit generic_identify(struct cpuinfo_x86 * c)
@@ -485,7 +514,6 @@ void __init identify_boot_cpu(void)
identify_cpu(_cpu_data);
sysenter_setup();
enable_sep_cpu();
-   mtrr_bp_init();
 }
 
 void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
Index: linux-2.6/arch/x86/kernel/setup_32.c
===
--- linux-2.6.orig/arch/x86/kernel/setup_32.c
+++ linux-2.6/arch/x86/kernel/setup_32.c
@@ -49,6 +49,7 @@
 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -762,6 +763,11 @@ void __init setup_arch(char **cmdline_p)
 
max_low_pfn = setup_memory();
 
+   /* update e820 for memory not covered by WB MTRRs */
+   mtrr_bp_init();
+   if (mtrr_trim_uncached_memory(max_pfn))
+   max_low_pfn = setup_memory();
+
 #ifdef CONFIG_VMI
/*
 * Must be after max_low_pfn is determined, and before kernel
Index: linux-2.6/arch/x86/kernel/cpu/mtrr/main.c
===
--- linux-2.6.orig/arch/x86/kernel/cpu/mtrr/main.c
+++ linux-2.6/arch/x86/kernel/cpu/mtrr/main.c
@@ -624,7 +624,6 @@ static struct sysdev_driver mtrr_sysdev_
.resume = mtrr_restore,
 };
 
-#ifdef CONFIG_X86_64
 static int disable_mtrr_trim;
 
 static int __init disable_mtrr_trim_setup(char *str)
@@ -726,7 +725,6 @@ int __init mtrr_trim_uncached_memory(uns
 
return 0;
 }
-#endif
 
 /**
  * mtrr_bp_init - initialize mtrrs on the boot CPU
Index: linux-2.6/Documentation/kernel-parameters.txt
===
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -575,7 +575,7 @@ and is between 256 and 4096 characters. 
See drivers/char/README.epca and
Documentation/digiepca.txt.
 
-   disable_mtrr_trim [X86-64, Intel only]
+   disable_mtrr_trim [X86, Intel and AMD only]
By default the kernel will trim any uncacheable
memory out of your available memory pool based on
MTRR settings.  This parameter disables that behavior,
Index: linux-2.6/arch/x86/kernel/e820_32.c
===
--- linux-2.6.orig/arch/x86/kernel/e820_32.c
+++ linux-2.6/arch/x86/kernel/e820_32.c
@@ -749,3 +749,14 @@ static int __init parse_memmap(char *arg
return 0;
 }
 early_param("memmap", parse_memmap);
+void __init update_e820(void)
+{
+   u8 nr_map;
+
+   nr_map = e820.nr_map;
+   if (sanitize_e820_map(e820.map, _map))
+   return;
+   e820.nr_map = nr_map;
+   printk(KERN_INFO "modified physical RAM map:\n");
+   print_memory_map("modified");
+}
Index: linux-2.6/include/asm-x86/e820_32.h
===
--- linux-2.6.orig/include/asm-x86/e820_32.h
+++ 

AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]

2008-01-22 Thread Mike Snitzer
On Jan 22, 2008 12:29 AM, Mike Snitzer <[EMAIL PROTECTED]> wrote:
> cc'ing Tanaka-san given his recent raid1 BUG report:
> http://lkml.org/lkml/2008/1/14/515
>
>
> On Jan 21, 2008 6:04 PM, Mike Snitzer <[EMAIL PROTECTED]> wrote:
> > Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to
> > an aacraid controller) that was acting as the local raid1 member of
> > /dev/md30.
> >
> > Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by
> > doing a read (with dd) from /dev/md30:

> The raid1d thread is locked at line 720 in raid1.c (raid1d+2437); aka
> freeze_array:
>
> (gdb) l *0x2539
> 0x2539 is in raid1d (drivers/md/raid1.c:720).
> 715  * wait until barrier+nr_pending match nr_queued+2
> 716  */
> 717 spin_lock_irq(>resync_lock);
> 718 conf->barrier++;
> 719 conf->nr_waiting++;
> 720 wait_event_lock_irq(conf->wait_barrier,
> 721 conf->barrier+conf->nr_pending ==
> conf->nr_queued+2,
> 722 conf->resync_lock,
> 723 raid1_unplug(conf->mddev->queue));
> 724 spin_unlock_irq(>resync_lock);
>
> Given Tanaka-san's report against 2.6.23 and me hitting what seems to
> be the same deadlock in 2.6.22.16; it stands to reason this affects
> raid1 in 2.6.24-rcX too.

Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when
you pull a drive); it responds to MD's write requests with uptodate=1
(in raid1_end_write_request) for the drive that was pulled!  I've not
looked to see if aacraid has been fixed in newer kernels... are others
aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24?

After the drive was physically pulled, and small periodic writes
continued to the associated MD device, the raid1 MD driver did _NOT_
detect the pulled drive's writes as having failed (verified this with
systemtap).  MD happily thought the write completed to both members
(so MD had no reason to mark the pulled drive "faulty"; or mark the
raid "degraded").

Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to
work as expected.

That said, I now have a recipe for hitting the raid1 deadlock that
Tanaka first reported over a week ago.  I'm still surprised that all
of this chatter about that BUG hasn't drawn interest/scrutiny from
others!?

regards,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] x86: Unify fault_32|64.c with ifdefs

2008-01-22 Thread Harvey Harrison
Elimination of these ifdefs can be done in a unified file.

Signed-off-by: Harvey Harrison <[EMAIL PROTECTED]>
---
 arch/x86/mm/fault_32.c |  100 +--
 arch/x86/mm/fault_64.c |   93 +++-
 2 files changed, 177 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/fault_32.c b/arch/x86/mm/fault_32.c
index 2d8a577..be0921c 100644
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -48,7 +48,11 @@ static inline int notify_page_fault(struct pt_regs *regs)
int ret = 0;
 
/* kprobe_running() needs smp_processor_id() */
+#ifdef CONFIG_X86_32
if (!user_mode_vm(regs)) {
+#else
+   if (!user_mode(regs)) {
+#endif
preempt_disable();
if (kprobe_running() && kprobe_fault_handler(regs, 14))
ret = 1;
@@ -429,11 +433,15 @@ static noinline void pgtable_bad(unsigned long address, 
struct pt_regs *regs,
 #endif
 
 /*
+ * X86_32
  * Handle a fault on the vmalloc or module mapping area
  *
+ * X86_64
+ * Handle a fault on the vmalloc area
+ *
  * This assumes no large pages in there.
  */
-static inline int vmalloc_fault(unsigned long address)
+static int vmalloc_fault(unsigned long address)
 {
 #ifdef CONFIG_X86_32
unsigned long pgd_paddr;
@@ -508,6 +516,9 @@ int show_unhandled_signals = 1;
  * and the problem, and then passes it off to one of the appropriate
  * routines.
  */
+#ifdef CONFIG_X86_64
+asmlinkage
+#endif
 void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
 {
struct task_struct *tsk;
@@ -516,6 +527,9 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned 
long error_code)
unsigned long address;
int write, si_code;
int fault;
+#ifdef CONFIG_X86_64
+   unsigned long flags;
+#endif
 
/*
 * We can fault from pretty much anywhere, with unknown IRQ state.
@@ -547,6 +561,7 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned 
long error_code)
 * (error_code & 4) == 0, and that the fault was not a
 * protection error (error_code & 9) == 0.
 */
+#ifdef CONFIG_X86_32
if (unlikely(address >= TASK_SIZE)) {
if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) &&
vmalloc_fault(address) >= 0)
@@ -569,7 +584,45 @@ void __kprobes do_page_fault(struct pt_regs *regs, 
unsigned long error_code)
 */
if (in_atomic() || !mm)
goto bad_area_nosemaphore;
+#else /* CONFIG_X86_64 */
+   if (unlikely(address >= TASK_SIZE64)) {
+   /*
+* Don't check for the module range here: its PML4
+* is always initialized because it's shared with the main
+* kernel text. Only vmalloc may need PML4 syncups.
+*/
+   if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) &&
+ ((address >= VMALLOC_START && address < VMALLOC_END))) {
+   if (vmalloc_fault(address) >= 0)
+   return;
+   }
+   /*
+* Don't take the mm semaphore here. If we fixup a prefetch
+* fault we could otherwise deadlock.
+*/
+   goto bad_area_nosemaphore;
+   }
+   if (likely(regs->flags & X86_EFLAGS_IF))
+   local_irq_enable();
+
+   if (unlikely(error_code & PF_RSVD))
+   pgtable_bad(address, regs, error_code);
+
+   /*
+* If we're in an interrupt, have no user context or are running in an
+* atomic region then we must not take the fault.
+*/
+   if (unlikely(in_atomic() || !mm))
+   goto bad_area_nosemaphore;
 
+   /*
+* User-mode registers count as a user access even for any
+* potential system fault or CPU buglet.
+*/
+   if (user_mode_vm(regs))
+   error_code |= PF_USER;
+again:
+#endif
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -595,7 +648,11 @@ void __kprobes do_page_fault(struct pt_regs *regs, 
unsigned long error_code)
vma = find_vma(mm, address);
if (!vma)
goto bad_area;
+#ifdef CONFIG_X86_32
if (vma->vm_start <= address)
+#else
+   if (likely(vma->vm_start <= address))
+#endif
goto good_area;
if (!(vma->vm_flags & VM_GROWSDOWN))
goto bad_area;
@@ -633,7 +690,9 @@ good_area:
goto bad_area;
}
 
- survive:
+#ifdef CONFIG_X86_32
+survive:
+#endif
/*
 * If for any reason at all we couldn't handle the fault,
 * make sure we exit gracefully rather than endlessly redo
@@ -704,6 +763,7 @@ bad_area_nosemaphore:

RE: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.25

2008-01-22 Thread Glenn Streiff

> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of 
> Roland Dreier
> Sent: Tuesday, January 22, 2008 3:56 PM
> To: Christoph Hellwig
> Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
> Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.25
> 
> 
>  > >  - Neteffect "nes" driver.  It's not terribly clean code 
> but since
>  > >it's a new driver that is completely self-contained, I plan on
>  > >merging it and letting cleanups happen upstream.
>  > 
>  > New code should be better quality than old code, not 
> worse.   I haven't
>  > actually seen the driver yet, but by that statement I'd be clearly
>  > against a merge.
> 
> The driver has been posted a few times; the latest code is in the
> "neteffect" branch of my tree:
> 
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniban
> d.git neteffect
> 
> It's not *that* bad -- certainly there are lots of things that could
> be improved (sparse endianness annotation, too many lines that are way
> to long, strange indentation of case labeles, etc, etc) but it is a
> self-contained hardware driver.  I agree with Linus's position (stated
> at the last kernel summit) that we ought to merge hardware drivers
> early, so that users get the drivers with as little hassle as
> possible.  We lose a little leverage in getting cleanups done, but the
> number of people who see the code and are able to clean it up
> increases, so I think it's a good trade-off.
> 
>  - R.
> 

My view is the code should and will be cleaned up based upon
the feedback we've gotten from the community.  It is a priority
for me.

Several cleanup fixes are in the queue and are being worked.
Haven't slipped into complacency at the prospect of the merge.

Glenn
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] x86: Unify fault_32|64.c by ifdef'd function bodies

2008-01-22 Thread Harvey Harrison
It's about time to get on with unifying these files, elimination
of the ugly ifdefs can occur in the unified file.

Signed-off-by: Harvey Harrison <[EMAIL PROTECTED]>
---
OK, time to bite the bullet, it's ugly, but we can now do the cleanups
once in the unified files.

 arch/x86/mm/fault_32.c |  116 +
 arch/x86/mm/fault_64.c |  148 +++-
 2 files changed, 263 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/fault_32.c b/arch/x86/mm/fault_32.c
index f85e7c9..2d8a577 100644
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -172,8 +172,17 @@ static void force_sig_info_fault(int si_signo, int si_code,
force_sig_info(si_signo, , tsk);
 }
 
+#ifdef CONFIG_X86_64
+static int bad_address(void *p)
+{
+   unsigned long dummy;
+   return probe_kernel_address((unsigned long *)p, dummy);
+}
+#endif
+
 void dump_pagetable(unsigned long address)
 {
+#ifdef CONFIG_X86_32
__typeof__(pte_val(__pte(0))) page;
 
page = read_cr3();
@@ -208,8 +217,42 @@ void dump_pagetable(unsigned long address)
}
 
printk("\n");
+#else /* CONFIG_X86_64 */
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = (pgd_t *)read_cr3();
+
+   pgd = __va((unsigned long)pgd & PHYSICAL_PAGE_MASK);
+   pgd += pgd_index(address);
+   if (bad_address(pgd)) goto bad;
+   printk("PGD %lx ", pgd_val(*pgd));
+   if (!pgd_present(*pgd)) goto ret;
+
+   pud = pud_offset(pgd, address);
+   if (bad_address(pud)) goto bad;
+   printk("PUD %lx ", pud_val(*pud));
+   if (!pud_present(*pud)) goto ret;
+
+   pmd = pmd_offset(pud, address);
+   if (bad_address(pmd)) goto bad;
+   printk("PMD %lx ", pmd_val(*pmd));
+   if (!pmd_present(*pmd) || pmd_large(*pmd)) goto ret;
+
+   pte = pte_offset_kernel(pmd, address);
+   if (bad_address(pte)) goto bad;
+   printk("PTE %lx", pte_val(*pte));
+ret:
+   printk("\n");
+   return;
+bad:
+   printk("BAD\n");
+#endif
 }
 
+#ifdef CONFIG_X86_32
 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
 {
unsigned index = pgd_index(address);
@@ -245,6 +288,7 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned 
long address)
BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k));
return pmd_k;
 }
+#endif
 
 #ifdef CONFIG_X86_64
 static const char errata93_warning[] =
@@ -325,6 +369,7 @@ static int is_f00f_bug(struct pt_regs *regs, unsigned long 
address)
 static void show_fault_oops(struct pt_regs *regs, unsigned long error_code,
unsigned long address)
 {
+#ifdef CONFIG_X86_32
if (!oops_may_print())
return;
 
@@ -349,7 +394,39 @@ static void show_fault_oops(struct pt_regs *regs, unsigned 
long error_code,
printk(KERN_ALERT "IP:");
printk_address(regs->ip, 1);
dump_pagetable(address);
+#else /* CONFIG_X86_64 */
+   printk(KERN_ALERT "BUG: unable to handle kernel ");
+   if (address < PAGE_SIZE)
+   printk(KERN_CONT "NULL pointer dereference");
+   else
+   printk(KERN_CONT "paging request");
+   printk(KERN_CONT " at %016lx\n", address);
+
+   printk(KERN_ALERT "IP:");
+   printk_address(regs->ip, 1);
+   dump_pagetable(address);
+#endif
+}
+
+#ifdef CONFIG_X86_64
+static noinline void pgtable_bad(unsigned long address, struct pt_regs *regs,
+unsigned long error_code)
+{
+   unsigned long flags = oops_begin();
+   struct task_struct *tsk;
+
+   printk(KERN_ALERT "%s: Corrupted page table at address %lx\n",
+  current->comm, address);
+   dump_pagetable(address);
+   tsk = current;
+   tsk->thread.cr2 = address;
+   tsk->thread.trap_no = 14;
+   tsk->thread.error_code = error_code;
+   if (__die("Bad pagetable", regs, error_code))
+   regs = NULL;
+   oops_end(flags, regs, SIGKILL);
 }
+#endif
 
 /*
  * Handle a fault on the vmalloc or module mapping area
@@ -705,6 +782,7 @@ do_sigbus:
 
 void vmalloc_sync_all(void)
 {
+#ifdef CONFIG_X86_32
/*
 * Note that races in the updates of insync and start aren't
 * problematic: insync can only get set bits added, and updates to
@@ -739,4 +817,42 @@ void vmalloc_sync_all(void)
if (address == start && test_bit(pgd_index(address), insync))
start = address + PGDIR_SIZE;
}
+#else /* CONFIG_X86_64 */
+   /*
+* Note that races in the updates of insync and start aren't
+* problematic: insync can only get set bits added, and updates to
+* start are only improving performance (without affecting correctness
+* if undone).
+*/
+   static DECLARE_BITMAP(insync, PTRS_PER_PGD);
+   static unsigned long start = VMALLOC_START & PGDIR_MASK;
+   unsigned long 

Re: [patch] x86: test case for the RODATA config option

2008-01-22 Thread Ingo Molnar

cool!

could you perhaps also do an add-on:

> + /* test 1: read the value */
> + /* test 2: write to the variable; this should fault */
> + /* test 3: check the value hasn't changed */

   test 4: make it writable again
   test 5: make it NX -> check that it's not executable

and perhaps also check that normal kernel allocations (kmalloc(), etc.) are NX 
as well? (with the same section trick you use in 
this patch - perhaps try to call a kmalloc()-ed buffer that contains a 
'ret' instruction - if that call faults then the test is OK, if the call 
succeeds then the test failed.)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHSET] printk: implement printk_header() and merging printk, take #2

2008-01-22 Thread Tejun Heo
Jan Engelhardt wrote:
> On Jan 23 2008 08:51, Tejun Heo wrote:
>> What do you think about the second suggestion then?
>>
>> ata1.00: line0
>> ata1.00  line1
>> ata1.00  line2
>>
>> It allows you to grab for the header && has indication for message
>> boundaries.
> 
> Then again, why not "[ata1.00] line0", then it matches what sd_mod does :)

Well, that's fine too but using ':' is much more common.  Just take a
look at the boot log and if we go with '[]', any ideas on how to
indicate multiline messages?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property

2008-01-22 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > What do you think about doing this only if FS_SAFE is also set,
> > > > so for instance at first only FUSE would allow itself to be
> > > > made user-mountable?
> > > > 
> > > > A safe thing to do, or overly intrusive?
> > > 
> > > It goes somewhat against the "no policy in kernel" policy ;).  I think
> > > the warning in the documentation should be enough to make sysadmins
> > > think twice before doing anything foolish:
> > 
> > Warning in which documentation?  A sysadmin considering setting fs_safe
> > for ext2 or xfs isn't going to be looking at fuse docs, which I think is
> > what you're talking about.  Are you going to add a file under
> > Documentation/filesystems?
> 
> Yes, I meant documentation of the new sysctl tunable in
> Documentation/filesystems/proc.txt:

Argh, sorry.

> > Index: linux/Documentation/filesystems/proc.txt
> > ===
> > --- linux.orig/Documentation/filesystems/proc.txt   2008-01-16 
> > 13:25:07.0 +0100
> > +++ linux/Documentation/filesystems/proc.txt2008-01-16 
> > 13:25:09.0 +0100
> > @@ -43,6 +43,7 @@ Table of Contents
> >2.13 /proc//oom_score - Display current oom-killer score
> >2.14 /proc//io - Display the IO accounting fields
> >2.15 /proc//coredump_filter - Core dump filtering settings
> > +  2.16 /proc/sys/fs/types - File system type specific parameters
> >  
> >  
> > --
> >  Preface
> > @@ -2283,4 +2284,21 @@ For example:
> >$ echo 0x7 > /proc/self/coredump_filter
> >$ ./some_program
> >  
> > +2.16 /proc/sys/fs/types/ - File system type specific parameters
> > +
> > +
> > +There's a separate directory /proc/sys/fs/types// for each
> > +filesystem type, containing the following files:
> > +
> > +usermount_safe
> > +--
> > +
> > +Setting this to non-zero will allow filesystems of this type to be
> > +mounted by unprivileged users (note, that there are other
> > +prerequisites as well).
> > +
> > +Care should be taken when enabling this, since most
> > +filesystems haven't been designed with unprivileged mounting
> > +in mind.
> > +
> >  
> > --
> > 
> 
> Do you think this is enough?  Or do we need something more, to prevent
> sysadmin inadvertently setting this for an unsafe filesystem?

I would think something more would be good.  First explaining
that fuse should be safe modulo warnings in the fuse documentation,
procfs and sysfs may be safe, while other filesystems are not known safe
at all.

Then explaining the dangers with not-known-safe filesystems and what is
needed to make them safe.  Clearly making sure input validation is
properly done so for instance getsb() doesn't turn into a buffer
overflow, etc.

Such a checklist also would be useful for holding a meaningful discussion
about the other filesystems and maybe turning some people loose on
an audit of other filesystems.

thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA boot crash (was: [PATCH] [0/36] Great change_page_attr patch series v3)

2008-01-22 Thread Ingo Molnar

* Andi Kleen <[EMAIL PROTECTED]> wrote:

> > because it interferes/interacts with CPA and the page table code. So
> 
> No that is not its main problem I believe. Main problem are all the 
> driver and other subsystem interactions (it is a little bit similar to 
> power management where you have lots of little bits all over right 
> instead of a single big one). [...]

that is (yet another) major misconception on your part. "Drivers" are an 
easy to blame target (i guess because there's no one out there to defend 
a vague "drivers" accusation), and they are not the problem here _at 
all_.

Drivers tell the architecture code which physical pages they'd like to 
have access to (or which page range they'd like to see different cache 
attributes on) and that's it. They are plain users of the ioremap() and 
change_page_attr() APIs. Nothing more, nothing less.

It is the utmost duty of architecture code to make those APIs 
fool-proof. Hardware _will_ mess up the physical parameters that get 
passed in every possible way - and drivers just try to use what the 
hardware tells them to use. So robustness is key and there's just no 
"driver reason" why these APIs cannot be robust.

so you are delusional if you think that the c_p_a() problems are "driver 
and other subsystem interactions".

And your analogy with power management could not be more mistaken. Power 
management and suspend/resume in particular is so complex because it is 
analogous to a _full bootup and shutdown cycle_, with the following, 
hard to meet expectation from the user: 'this stuff must work all the 
time, and must be instantaneous'. Suspend/resume is an _incredibly 
complex_ machinery and the user does not realize (and does not accept 
the concequences) of this complexity. It is a codepath that is affected 
by tens and tens of thousands of driver and core kernel code. Just one 
single mistake and "resume does not work".

ioremap() and change_page_attr() on the other hand is a small, few 
hundred lines codebase for a stable and well-defined purpose. There's no 
significant "subsystem interactions" whatsoever.

by far the most intense and most high-frequency user of the 
change_page_attr() code is CONFIG_DEBUG_PAGEALLOC=y. It does a cpa call 
for every single page and slab allocation/freeing. But this debug 
feature ... is not enabled on the 64-bit side - why? So unfortunately we 
dont have any real robustness track record of the 64-bit side of the CPA 
code, and that's exactly the code your clflush and gbpages code changes.

oh, and due to that i'll probably revert these two patches of yours:

  Subject: x86: c_p_a(), change kernel_map_pages to not use c_p_a()
  Subject: x86: c_p_a(), change 32-bit back to init_mm semaphore locking

as with these changes you've removed _the_ most important stress-tester 
for the c_p_a() code: DEBUG_PAGEALLOC.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ppc: fix #ifdef-s in mediabay driver

2008-01-22 Thread Benjamin Herrenschmidt

On Wed, 2008-01-23 at 00:12 +0100, Bartlomiej Zolnierkiewicz wrote:
> * Replace incorrect CONFIG_BLK_DEV_IDE #ifdef in
>   check_media_bay() by CONFIG_MAC_FLOPPY one.
> 
> * Replace incorrect CONFIG_BLK_DEV_IDE #ifdef-s by
>   CONFIG_BLK_DEV_IDE_PMAC ones.
> 
> * check_media_bay() is used only by drivers/block/swim3.c
>   so make this function available only if CONFIG_MAC_FLOPPY
>   is defined.
> 
> * check_media_bay_by_base() and media_bay_set_ide_infos()
>   are used only by drivers/ide/ppc/pmac.c so so make these
>   functions available only if CONFIG_MAC_FLOPPY is defined.
> 
> Signed-off-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]>
> ---
> Ben, IMO this patch is safe for 2.6.24 (assuming that it builds fine :),
> otherwise I would like to ask for permission to merge it through IDE
> tree since I have other pending IDE patches depending on this one.

I'd rather avoid touching 2.6.24 unless it actually fixes a bug or
regression...

I'm tempted to actually remove all ifdef's ... if you have a media-bay,
then there are about 99% chances it contains an IDE device, with the
remaining percent being split with putting a floppy or a battery in. I
doubt anybody will care building a kernel without the support for these
and with the mediabay support, and still want to save a handful of bytes
in that driver.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   >