Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 2007-09-24 at 12:42 -0700, Andrew Morton wrote: > On Mon, 24 Sep 2007 12:28:11 -0700 > Dave Hansen <[EMAIL PROTECTED]> wrote: > > On Mon, 2007-09-24 at 18:54 +0100, Christoph Hellwig wrote: > > > As we already say in various messages the percpu counters in here > > > look rather fishy. I'd recomment to take a look at the per-cpu > > > superblock counters in XFS as they've been debugged quite well > > > now and could probably be lifted into a generic library for this > > > kind of think. The code is mostly in fs/xfs/xfs_mount.c can > > > can be spotted by beeing under #ifdef HAVE_PERCPU_SB. > > > > > > It also handles cases like hotplug cpu nicely that this code > > > seems to work around by always iterating over all possible cpus > > > which might not be nice on a dual core laptop with a distro kernel > > > that also has to support big iron. > > > > I'll take a look at xfs to see what I can get out of it. > > And at include/linux/percpu_counter.h, please. The basic incompatibility with what that provides and what I need is that percpu_counters allow some fuzziness in the numbers. One cpu can be summing the numbers up while another is still adding to the local percpu counters. That's fine for statistics, but horrible for questions like, "can anybody write to and corrupt my FS right now?" It could probably be modified to do what I want, but it would still have a the "invented your own lock" problem, and would likely impact the scalability of the existing "fuzzy" users. We could introduce fuzzy and coherent variants of the function calls, but that would probably introduce more code than what I have now for the very specific mnt_writer_lock. > > There are basically two times when you have to do this > > for_each_possible_cpu() stuff: > > 1. when doing a r/w->r/o transition, which is rare, and > >certainly not a fast path > > 2. Where the per-cpu writer count underflows. This requires > >a _minimum_ of 1<<16 file opens (configurable) each of which > >is closed on a different cpu than it was opened on. Even > >if you were trying, I'm not sure you'd notice the overhead. > > > > Sounds like what you're doing is more akin to the local_t-based module > refcounting. `grep local_ kernel/module.c'. > > That code should be converted from NR_CPUS to for_each_possible_cpu().. I think that can get converted to use the percpu_counters pretty easily. I've coded that up, and sent it [RFC] to lkml. Rusty will forward on into mainline. -- Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 2007-09-24 at 12:42 -0700, Andrew Morton wrote: On Mon, 24 Sep 2007 12:28:11 -0700 Dave Hansen [EMAIL PROTECTED] wrote: On Mon, 2007-09-24 at 18:54 +0100, Christoph Hellwig wrote: As we already say in various messages the percpu counters in here look rather fishy. I'd recomment to take a look at the per-cpu superblock counters in XFS as they've been debugged quite well now and could probably be lifted into a generic library for this kind of think. The code is mostly in fs/xfs/xfs_mount.c can can be spotted by beeing under #ifdef HAVE_PERCPU_SB. It also handles cases like hotplug cpu nicely that this code seems to work around by always iterating over all possible cpus which might not be nice on a dual core laptop with a distro kernel that also has to support big iron. I'll take a look at xfs to see what I can get out of it. And at include/linux/percpu_counter.h, please. The basic incompatibility with what that provides and what I need is that percpu_counters allow some fuzziness in the numbers. One cpu can be summing the numbers up while another is still adding to the local percpu counters. That's fine for statistics, but horrible for questions like, can anybody write to and corrupt my FS right now? It could probably be modified to do what I want, but it would still have a the invented your own lock problem, and would likely impact the scalability of the existing fuzzy users. We could introduce fuzzy and coherent variants of the function calls, but that would probably introduce more code than what I have now for the very specific mnt_writer_lock. There are basically two times when you have to do this for_each_possible_cpu() stuff: 1. when doing a r/w-r/o transition, which is rare, and certainly not a fast path 2. Where the per-cpu writer count underflows. This requires a _minimum_ of 116 file opens (configurable) each of which is closed on a different cpu than it was opened on. Even if you were trying, I'm not sure you'd notice the overhead. Sounds like what you're doing is more akin to the local_t-based module refcounting. `grep local_ kernel/module.c'. That code should be converted from NR_CPUS to for_each_possible_cpu().. I think that can get converted to use the percpu_counters pretty easily. I've coded that up, and sent it [RFC] to lkml. Rusty will forward on into mainline. -- Dave - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 2007-09-24 at 16:15 -0700, Andrew Morton wrote: > On Mon, 24 Sep 2007 16:05:37 -0700 > Dave Hansen <[EMAIL PROTECTED]> wrote: > > > On Mon, 2007-09-24 at 15:25 -0700, Andrew Morton wrote: > > > hm. I saw that warning on my 2-way. It has CONFIG_NR_CPUS=8 so perhaps > > > the kernel has decided that this machine can possibly have eight CPUs. > > > > > > It's an old super-micro board, doesn't have ACPI. > > > > Well, it's looking like we only set cpu_possible_map from data we find > > in the MP table, which makes sense. The only question is how your > > system gets more than ~8 possible cpus. Do you have the .config handy? > > http://userweb.kernel.org/~akpm/config-vmm.txt I've reproduced this. I'll do more runtime testing with all of the debugging, CONFIG_CPU_HOTPLUG=y, and NR_CPUS to a high value. It should help catch things like this in the future. -- Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 2007-09-24 at 16:15 -0700, Andrew Morton wrote: On Mon, 24 Sep 2007 16:05:37 -0700 Dave Hansen [EMAIL PROTECTED] wrote: On Mon, 2007-09-24 at 15:25 -0700, Andrew Morton wrote: hm. I saw that warning on my 2-way. It has CONFIG_NR_CPUS=8 so perhaps the kernel has decided that this machine can possibly have eight CPUs. It's an old super-micro board, doesn't have ACPI. Well, it's looking like we only set cpu_possible_map from data we find in the MP table, which makes sense. The only question is how your system gets more than ~8 possible cpus. Do you have the .config handy? http://userweb.kernel.org/~akpm/config-vmm.txt I've reproduced this. I'll do more runtime testing with all of the debugging, CONFIG_CPU_HOTPLUG=y, and NR_CPUS to a high value. It should help catch things like this in the future. -- Dave - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 24 Sep 2007 16:05:37 -0700 Dave Hansen <[EMAIL PROTECTED]> wrote: > On Mon, 2007-09-24 at 15:25 -0700, Andrew Morton wrote: > > hm. I saw that warning on my 2-way. It has CONFIG_NR_CPUS=8 so perhaps > > the kernel has decided that this machine can possibly have eight CPUs. > > > > It's an old super-micro board, doesn't have ACPI. > > Well, it's looking like we only set cpu_possible_map from data we find > in the MP table, which makes sense. The only question is how your > system gets more than ~8 possible cpus. Do you have the .config handy? > http://userweb.kernel.org/~akpm/config-vmm.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 2007-09-24 at 15:25 -0700, Andrew Morton wrote: > hm. I saw that warning on my 2-way. It has CONFIG_NR_CPUS=8 so perhaps > the kernel has decided that this machine can possibly have eight CPUs. > > It's an old super-micro board, doesn't have ACPI. Well, it's looking like we only set cpu_possible_map from data we find in the MP table, which makes sense. The only question is how your system gets more than ~8 possible cpus. Do you have the .config handy? -- Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 24 Sep 2007 15:06:42 -0700 Dave Hansen <[EMAIL PROTECTED]> wrote: > On Sun, 2007-09-23 at 23:17 -0700, Andrew Morton wrote: > > It look like a false positive to me, but really, for a patchset of this > > complexity and maturity I cannot fathom how it could have escaped any > > lockdep testing. > > I test with lockdep all the time. The problem was that lockdep doesn't > complain until you have 8 nested locks, and I only tested on a 4-cpu > system. > > I lowered the lockdep nesting limit to 3, and got the warning on my > machine. > hm. I saw that warning on my 2-way. It has CONFIG_NR_CPUS=8 so perhaps the kernel has decided that this machine can possibly have eight CPUs. It's an old super-micro board, doesn't have ACPI. OEM ID: INTELProduct ID: 440BXAPIC at: 0xFEE0 Processor #0 6:8 APIC version 17 Processor #1 6:8 APIC version 17 I/O APIC #2 Version 17 at 0xFEC0. Enabling APIC mode: Flat. Using 1 I/O APICs Processors: 2 ... Checking if this processor honours the WP bit even in supervisor mode... Ok. Calibrating delay using timer specific routine.. 1707.03 BogoMIPS (lpj=3414067) Mount-cache hash table entries: 512 CPU: After generic identify, caps: 0383fbff CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 256K CPU: After all inits, caps: 0383fbff 0040 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Compat vDSO mapped to e000. Checking 'hlt' instruction... OK. lockdep: not fixing up alternatives. CPU0: Intel Pentium III (Coppermine) stepping 03 lockdep: not fixing up alternatives. Booting processor 1/1 eip 2000 Initializing CPU#1 Calibrating delay using timer specific routine.. 1704.02 BogoMIPS (lpj=3408054) CPU: After generic identify, caps: 0383fbff CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 256K CPU: After all inits, caps: 0383fbff 0040 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#1. CPU1: Intel Pentium III (Coppermine) stepping 03 Total of 2 processors activated (3411.06 BogoMIPS). ExtINT not setup in hardware but reported by MP table ENABLING IO-APIC IRQs ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=0 pin2=0 APIC timer registered as dummy, due to nmi_watchdog=1! checking TSC synchronization [CPU#0 -> CPU#1]: passed. Brought up 2 CPUs One would think that the kernel would work out that eight CPUs ain't possible on such a machine. But we don't seem to print out any info which allows me to confirm that this is really what the kernel did. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Sun, 2007-09-23 at 23:17 -0700, Andrew Morton wrote: > It look like a false positive to me, but really, for a patchset of this > complexity and maturity I cannot fathom how it could have escaped any > lockdep testing. I test with lockdep all the time. The problem was that lockdep doesn't complain until you have 8 nested locks, and I only tested on a 4-cpu system. I lowered the lockdep nesting limit to 3, and got the warning on my machine. -- Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 24 Sep 2007 12:28:11 -0700 Dave Hansen <[EMAIL PROTECTED]> wrote: > On Mon, 2007-09-24 at 18:54 +0100, Christoph Hellwig wrote: > > As we already say in various messages the percpu counters in here > > look rather fishy. I'd recomment to take a look at the per-cpu > > superblock counters in XFS as they've been debugged quite well > > now and could probably be lifted into a generic library for this > > kind of think. The code is mostly in fs/xfs/xfs_mount.c can > > can be spotted by beeing under #ifdef HAVE_PERCPU_SB. > > > > It also handles cases like hotplug cpu nicely that this code > > seems to work around by always iterating over all possible cpus > > which might not be nice on a dual core laptop with a distro kernel > > that also has to support big iron. > > I'll take a look at xfs to see what I can get out of it. And at include/linux/percpu_counter.h, please. > There are basically two times when you have to do this > for_each_possible_cpu() stuff: > 1. when doing a r/w->r/o transition, which is rare, and >certainly not a fast path > 2. Where the per-cpu writer count underflows. This requires >a _minimum_ of 1<<16 file opens (configurable) each of which >is closed on a different cpu than it was opened on. Even >if you were trying, I'm not sure you'd notice the overhead. > Sounds like what you're doing is more akin to the local_t-based module refcounting. `grep local_ kernel/module.c'. That code should be converted from NR_CPUS to for_each_possible_cpu().. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 2007-09-24 at 18:54 +0100, Christoph Hellwig wrote: > As we already say in various messages the percpu counters in here > look rather fishy. I'd recomment to take a look at the per-cpu > superblock counters in XFS as they've been debugged quite well > now and could probably be lifted into a generic library for this > kind of think. The code is mostly in fs/xfs/xfs_mount.c can > can be spotted by beeing under #ifdef HAVE_PERCPU_SB. > > It also handles cases like hotplug cpu nicely that this code > seems to work around by always iterating over all possible cpus > which might not be nice on a dual core laptop with a distro kernel > that also has to support big iron. I'll take a look at xfs to see what I can get out of it. There are basically two times when you have to do this for_each_possible_cpu() stuff: 1. when doing a r/w->r/o transition, which is rare, and certainly not a fast path 2. Where the per-cpu writer count underflows. This requires a _minimum_ of 1<<16 file opens (configurable) each of which is closed on a different cpu than it was opened on. Even if you were trying, I'm not sure you'd notice the overhead. -- Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, Sep 24, 2007 at 12:10:35PM -0700, Andrew Morton wrote: > > As we already say in various messages the percpu counters in here > > look rather fishy. I'd recomment to take a look at the per-cpu > > superblock counters in XFS as they've been debugged quite well > > now and could probably be lifted into a generic library for this > > kind of think. The code is mostly in fs/xfs/xfs_mount.c can > > can be spotted by beeing under #ifdef HAVE_PERCPU_SB. > > > > It also handles cases like hotplug cpu nicely that this code > > seems to work around by always iterating over all possible cpus > > which might not be nice on a dual core laptop with a distro kernel > > that also has to support big iron. > > hm. How come xfs invented a new version of percpu_counters? This code actually predates the generic percpu_counters even if it was merged to mainline later. Neither Dave who wrote it nor me who reviewed it before it was merged thought of percpu_counters probably. Then again this code is considerably more complex due to features actually needed in a very hot fastpath. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 24 Sep 2007 18:54:11 +0100 Christoph Hellwig <[EMAIL PROTECTED]> wrote: > As we already say in various messages the percpu counters in here > look rather fishy. I'd recomment to take a look at the per-cpu > superblock counters in XFS as they've been debugged quite well > now and could probably be lifted into a generic library for this > kind of think. The code is mostly in fs/xfs/xfs_mount.c can > can be spotted by beeing under #ifdef HAVE_PERCPU_SB. > > It also handles cases like hotplug cpu nicely that this code > seems to work around by always iterating over all possible cpus > which might not be nice on a dual core laptop with a distro kernel > that also has to support big iron. hm. How come xfs invented a new version of percpu_counters? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
As we already say in various messages the percpu counters in here look rather fishy. I'd recomment to take a look at the per-cpu superblock counters in XFS as they've been debugged quite well now and could probably be lifted into a generic library for this kind of think. The code is mostly in fs/xfs/xfs_mount.c can can be spotted by beeing under #ifdef HAVE_PERCPU_SB. It also handles cases like hotplug cpu nicely that this code seems to work around by always iterating over all possible cpus which might not be nice on a dual core laptop with a distro kernel that also has to support big iron. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
> It look like a false positive to me, but really, for a patchset of > this complexity and maturity I cannot fathom how it could have > escaped any lockdep testing. the code tries to implement per cpu spinlocks, or rather it tries to bring back the brlocks from way past cute. we can educate lockdep about this quite easily; but isn't there some primitive already in existence that we can use? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Thu, 20 Sep 2007 12:53:20 -0700 Dave Hansen <[EMAIL PROTECTED]> wrote: > This is the real meat of the entire series. It actually > implements the tracking of the number of writers to a mount. > However, it causes scalability problems because there can > be hundreds of cpus doing open()/close() on files on the > same mnt at the same time. Even an atomic_t in the mnt > has massive scalaing problems because the cacheline gets > so terribly contended. > > This uses a statically-allocated percpu variable. All > operations are local to a cpu as long that cpu operates on > the same mount, and there are no writer count imbalances. > Writer count imbalances happen when a write is taken on one > cpu, and released on another, like when an open/close pair > is performed on two different cpus because the task moved. Did you test with lockdep enabled? = [ INFO: possible recursive locking detected ] 2.6.23-rc7-mm1 #1 - swapper/1 is trying to acquire lock: (>lock){--..}, at: [] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70 but task is already holding lock: (>lock){--..}, at: [] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70 other info that might help us debug this: 1 lock held by swapper/1: #0: (>lock){--..}, at: [] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70 stack backtrace: [] show_trace_log_lvl+0x1a/0x30 [] show_trace+0x12/0x20 [] dump_stack+0x16/0x20 [] __lock_acquire+0xde5/0x10a0 [] lock_acquire+0x7a/0xa0 [] _spin_lock+0x2c/0x40 [] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70 [] mntput_no_expire+0x36/0xc0 [] path_release_on_umount+0x15/0x20 [] sys_umount+0x40/0x230 [] name_to_dev_t+0x9b/0x270 [] prepare_namespace+0x62/0x1b0 [] kernel_init+0x21a/0x320 [] kernel_thread_helper+0x7/0x10 === It look like a false positive to me, but really, for a patchset of this complexity and maturity I cannot fathom how it could have escaped any lockdep testing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Thu, 20 Sep 2007 12:53:20 -0700 Dave Hansen [EMAIL PROTECTED] wrote: This is the real meat of the entire series. It actually implements the tracking of the number of writers to a mount. However, it causes scalability problems because there can be hundreds of cpus doing open()/close() on files on the same mnt at the same time. Even an atomic_t in the mnt has massive scalaing problems because the cacheline gets so terribly contended. This uses a statically-allocated percpu variable. All operations are local to a cpu as long that cpu operates on the same mount, and there are no writer count imbalances. Writer count imbalances happen when a write is taken on one cpu, and released on another, like when an open/close pair is performed on two different cpus because the task moved. Did you test with lockdep enabled? = [ INFO: possible recursive locking detected ] 2.6.23-rc7-mm1 #1 - swapper/1 is trying to acquire lock: (writer-lock){--..}, at: [c0197a32] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70 but task is already holding lock: (writer-lock){--..}, at: [c0197a32] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70 other info that might help us debug this: 1 lock held by swapper/1: #0: (writer-lock){--..}, at: [c0197a32] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70 stack backtrace: [c0103ffa] show_trace_log_lvl+0x1a/0x30 [c0104b82] show_trace+0x12/0x20 [c0104c96] dump_stack+0x16/0x20 [c0144dc5] __lock_acquire+0xde5/0x10a0 [c01450fa] lock_acquire+0x7a/0xa0 [c03e734c] _spin_lock+0x2c/0x40 [c0197a32] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70 [c01982c6] mntput_no_expire+0x36/0xc0 [c0188f15] path_release_on_umount+0x15/0x20 [c0198930] sys_umount+0x40/0x230 [c010070b] name_to_dev_t+0x9b/0x270 [c05230c2] prepare_namespace+0x62/0x1b0 [c05226ca] kernel_init+0x21a/0x320 [c0103b47] kernel_thread_helper+0x7/0x10 === It look like a false positive to me, but really, for a patchset of this complexity and maturity I cannot fathom how it could have escaped any lockdep testing. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
It look like a false positive to me, but really, for a patchset of this complexity and maturity I cannot fathom how it could have escaped any lockdep testing. the code tries to implement per cpu spinlocks, or rather it tries to bring back the brlocks from way past cute. we can educate lockdep about this quite easily; but isn't there some primitive already in existence that we can use? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
As we already say in various messages the percpu counters in here look rather fishy. I'd recomment to take a look at the per-cpu superblock counters in XFS as they've been debugged quite well now and could probably be lifted into a generic library for this kind of think. The code is mostly in fs/xfs/xfs_mount.c can can be spotted by beeing under #ifdef HAVE_PERCPU_SB. It also handles cases like hotplug cpu nicely that this code seems to work around by always iterating over all possible cpus which might not be nice on a dual core laptop with a distro kernel that also has to support big iron. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 24 Sep 2007 18:54:11 +0100 Christoph Hellwig [EMAIL PROTECTED] wrote: As we already say in various messages the percpu counters in here look rather fishy. I'd recomment to take a look at the per-cpu superblock counters in XFS as they've been debugged quite well now and could probably be lifted into a generic library for this kind of think. The code is mostly in fs/xfs/xfs_mount.c can can be spotted by beeing under #ifdef HAVE_PERCPU_SB. It also handles cases like hotplug cpu nicely that this code seems to work around by always iterating over all possible cpus which might not be nice on a dual core laptop with a distro kernel that also has to support big iron. hm. How come xfs invented a new version of percpu_counters? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, Sep 24, 2007 at 12:10:35PM -0700, Andrew Morton wrote: As we already say in various messages the percpu counters in here look rather fishy. I'd recomment to take a look at the per-cpu superblock counters in XFS as they've been debugged quite well now and could probably be lifted into a generic library for this kind of think. The code is mostly in fs/xfs/xfs_mount.c can can be spotted by beeing under #ifdef HAVE_PERCPU_SB. It also handles cases like hotplug cpu nicely that this code seems to work around by always iterating over all possible cpus which might not be nice on a dual core laptop with a distro kernel that also has to support big iron. hm. How come xfs invented a new version of percpu_counters? This code actually predates the generic percpu_counters even if it was merged to mainline later. Neither Dave who wrote it nor me who reviewed it before it was merged thought of percpu_counters probably. Then again this code is considerably more complex due to features actually needed in a very hot fastpath. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 24 Sep 2007 12:28:11 -0700 Dave Hansen [EMAIL PROTECTED] wrote: On Mon, 2007-09-24 at 18:54 +0100, Christoph Hellwig wrote: As we already say in various messages the percpu counters in here look rather fishy. I'd recomment to take a look at the per-cpu superblock counters in XFS as they've been debugged quite well now and could probably be lifted into a generic library for this kind of think. The code is mostly in fs/xfs/xfs_mount.c can can be spotted by beeing under #ifdef HAVE_PERCPU_SB. It also handles cases like hotplug cpu nicely that this code seems to work around by always iterating over all possible cpus which might not be nice on a dual core laptop with a distro kernel that also has to support big iron. I'll take a look at xfs to see what I can get out of it. And at include/linux/percpu_counter.h, please. There are basically two times when you have to do this for_each_possible_cpu() stuff: 1. when doing a r/w-r/o transition, which is rare, and certainly not a fast path 2. Where the per-cpu writer count underflows. This requires a _minimum_ of 116 file opens (configurable) each of which is closed on a different cpu than it was opened on. Even if you were trying, I'm not sure you'd notice the overhead. Sounds like what you're doing is more akin to the local_t-based module refcounting. `grep local_ kernel/module.c'. That code should be converted from NR_CPUS to for_each_possible_cpu().. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 2007-09-24 at 18:54 +0100, Christoph Hellwig wrote: As we already say in various messages the percpu counters in here look rather fishy. I'd recomment to take a look at the per-cpu superblock counters in XFS as they've been debugged quite well now and could probably be lifted into a generic library for this kind of think. The code is mostly in fs/xfs/xfs_mount.c can can be spotted by beeing under #ifdef HAVE_PERCPU_SB. It also handles cases like hotplug cpu nicely that this code seems to work around by always iterating over all possible cpus which might not be nice on a dual core laptop with a distro kernel that also has to support big iron. I'll take a look at xfs to see what I can get out of it. There are basically two times when you have to do this for_each_possible_cpu() stuff: 1. when doing a r/w-r/o transition, which is rare, and certainly not a fast path 2. Where the per-cpu writer count underflows. This requires a _minimum_ of 116 file opens (configurable) each of which is closed on a different cpu than it was opened on. Even if you were trying, I'm not sure you'd notice the overhead. -- Dave - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Sun, 2007-09-23 at 23:17 -0700, Andrew Morton wrote: It look like a false positive to me, but really, for a patchset of this complexity and maturity I cannot fathom how it could have escaped any lockdep testing. I test with lockdep all the time. The problem was that lockdep doesn't complain until you have 8 nested locks, and I only tested on a 4-cpu system. I lowered the lockdep nesting limit to 3, and got the warning on my machine. -- Dave - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 24 Sep 2007 15:06:42 -0700 Dave Hansen [EMAIL PROTECTED] wrote: On Sun, 2007-09-23 at 23:17 -0700, Andrew Morton wrote: It look like a false positive to me, but really, for a patchset of this complexity and maturity I cannot fathom how it could have escaped any lockdep testing. I test with lockdep all the time. The problem was that lockdep doesn't complain until you have 8 nested locks, and I only tested on a 4-cpu system. I lowered the lockdep nesting limit to 3, and got the warning on my machine. hm. I saw that warning on my 2-way. It has CONFIG_NR_CPUS=8 so perhaps the kernel has decided that this machine can possibly have eight CPUs. It's an old super-micro board, doesn't have ACPI. OEM ID: INTELProduct ID: 440BXAPIC at: 0xFEE0 Processor #0 6:8 APIC version 17 Processor #1 6:8 APIC version 17 I/O APIC #2 Version 17 at 0xFEC0. Enabling APIC mode: Flat. Using 1 I/O APICs Processors: 2 ... Checking if this processor honours the WP bit even in supervisor mode... Ok. Calibrating delay using timer specific routine.. 1707.03 BogoMIPS (lpj=3414067) Mount-cache hash table entries: 512 CPU: After generic identify, caps: 0383fbff CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 256K CPU: After all inits, caps: 0383fbff 0040 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Compat vDSO mapped to e000. Checking 'hlt' instruction... OK. lockdep: not fixing up alternatives. CPU0: Intel Pentium III (Coppermine) stepping 03 lockdep: not fixing up alternatives. Booting processor 1/1 eip 2000 Initializing CPU#1 Calibrating delay using timer specific routine.. 1704.02 BogoMIPS (lpj=3408054) CPU: After generic identify, caps: 0383fbff CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 256K CPU: After all inits, caps: 0383fbff 0040 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#1. CPU1: Intel Pentium III (Coppermine) stepping 03 Total of 2 processors activated (3411.06 BogoMIPS). ExtINT not setup in hardware but reported by MP table ENABLING IO-APIC IRQs ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=0 pin2=0 APIC timer registered as dummy, due to nmi_watchdog=1! checking TSC synchronization [CPU#0 - CPU#1]: passed. Brought up 2 CPUs One would think that the kernel would work out that eight CPUs ain't possible on such a machine. But we don't seem to print out any info which allows me to confirm that this is really what the kernel did. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 2007-09-24 at 15:25 -0700, Andrew Morton wrote: hm. I saw that warning on my 2-way. It has CONFIG_NR_CPUS=8 so perhaps the kernel has decided that this machine can possibly have eight CPUs. It's an old super-micro board, doesn't have ACPI. Well, it's looking like we only set cpu_possible_map from data we find in the MP table, which makes sense. The only question is how your system gets more than ~8 possible cpus. Do you have the .config handy? -- Dave - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/25] r/o bind mounts: track number of mount writers
On Mon, 24 Sep 2007 16:05:37 -0700 Dave Hansen [EMAIL PROTECTED] wrote: On Mon, 2007-09-24 at 15:25 -0700, Andrew Morton wrote: hm. I saw that warning on my 2-way. It has CONFIG_NR_CPUS=8 so perhaps the kernel has decided that this machine can possibly have eight CPUs. It's an old super-micro board, doesn't have ACPI. Well, it's looking like we only set cpu_possible_map from data we find in the MP table, which makes sense. The only question is how your system gets more than ~8 possible cpus. Do you have the .config handy? http://userweb.kernel.org/~akpm/config-vmm.txt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 24/25] r/o bind mounts: track number of mount writers
If we can't pull just this patch in for now, it would be great to get everything leading up to here pulled in. I've re-implemented this several ways, and it has never caused the preceeding patches to change at all. -- This is the real meat of the entire series. It actually implements the tracking of the number of writers to a mount. However, it causes scalability problems because there can be hundreds of cpus doing open()/close() on files on the same mnt at the same time. Even an atomic_t in the mnt has massive scalaing problems because the cacheline gets so terribly contended. This uses a statically-allocated percpu variable. All operations are local to a cpu as long that cpu operates on the same mount, and there are no writer count imbalances. Writer count imbalances happen when a write is taken on one cpu, and released on another, like when an open/close pair is performed on two different cpus because the task moved. Upon a remount,ro request, all of the data from the percpu variables is collected (expensive, but very rare) and we determine if there are any outstanding writers to the mount. I've written a little benchmark to sit in a loop for a couple of seconds in several cpus in parallel doing open/write/close loops. http://sr71.net/~dave/linux/openbench.c The code in here is a a worst-possible case for this patch. It does opens on a _pair_ of files in two different mounts in parallel. This should cause my code to lose its "operate on the same mount" optimization completely. This worst-case scenario causes a 3% degredation in the benchmark. I could probably get rid of even this 3%, but it would be more complex than what I have here, and I think this is getting into acceptable territory. In practice, I expect writing more than 3 bytes to a file, as well as disk I/O to mask any effects that this has. (To get rid of that 3%, we could have an #defined number of mounts in the percpu variable. So, instead of a CPU getting operate only on percpu data when it accesses only one mount, it could stay on percpu data when it only accesses N or fewer mounts.) Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/namespace.c| 205 ++--- lxc-dave/include/linux/mount.h |8 + 2 files changed, 198 insertions(+), 15 deletions(-) diff -puN fs/namespace.c~numa_mnt_want_write fs/namespace.c --- lxc/fs/namespace.c~numa_mnt_want_write 2007-09-20 12:16:23.0 -0700 +++ lxc-dave/fs/namespace.c 2007-09-20 12:16:23.0 -0700 @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -52,6 +53,8 @@ static inline unsigned long hash(struct return tmp & hash_mask; } +#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16) + struct vfsmount *alloc_vfsmnt(const char *name) { struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); @@ -65,6 +68,7 @@ struct vfsmount *alloc_vfsmnt(const char INIT_LIST_HEAD(>mnt_share); INIT_LIST_HEAD(>mnt_slave_list); INIT_LIST_HEAD(>mnt_slave); + atomic_set(>__mnt_writers, 0); if (name) { int size = strlen(name) + 1; char *newname = kmalloc(size, GFP_KERNEL); @@ -85,6 +89,84 @@ struct vfsmount *alloc_vfsmnt(const char * we can determine when writes are able to occur to * a filesystem. */ +/* + * __mnt_is_readonly: check whether a mount is read-only + * @mnt: the mount to check for its write status + * + * This shouldn't be used directly ouside of the VFS. + * It does not guarantee that the filesystem will stay + * r/w, just that it is right *now*. This can not and + * should not be used in place of IS_RDONLY(inode). + * mnt_want/drop_write() will _keep_ the filesystem + * r/w. + */ +int __mnt_is_readonly(struct vfsmount *mnt) +{ + return (mnt->mnt_sb->s_flags & MS_RDONLY); +} +EXPORT_SYMBOL_GPL(__mnt_is_readonly); + +struct mnt_writer { + /* +* If holding multiple instances of this lock, they +* must be ordered by cpu number. +*/ + spinlock_t lock; + unsigned long count; + struct vfsmount *mnt; +} cacheline_aligned_in_smp; +static DEFINE_PER_CPU(struct mnt_writer, mnt_writers); + +static int __init init_mnt_writers(void) +{ + int cpu; + for_each_possible_cpu(cpu) { + struct mnt_writer *writer = _cpu(mnt_writers, cpu); + spin_lock_init(>lock); + writer->count = 0; + } + return 0; +} +fs_initcall(init_mnt_writers); + +static void mnt_unlock_cpus(void) +{ + int cpu; + struct mnt_writer *cpu_writer; + + for_each_possible_cpu(cpu) { + cpu_writer = _cpu(mnt_writers, cpu); + spin_unlock(_writer->lock); + } +} + +static inline void __clear_mnt_count(struct mnt_writer *cpu_writer) +{ + if (!cpu_writer->mnt) + return; +
[PATCH 24/25] r/o bind mounts: track number of mount writers
If we can't pull just this patch in for now, it would be great to get everything leading up to here pulled in. I've re-implemented this several ways, and it has never caused the preceeding patches to change at all. -- This is the real meat of the entire series. It actually implements the tracking of the number of writers to a mount. However, it causes scalability problems because there can be hundreds of cpus doing open()/close() on files on the same mnt at the same time. Even an atomic_t in the mnt has massive scalaing problems because the cacheline gets so terribly contended. This uses a statically-allocated percpu variable. All operations are local to a cpu as long that cpu operates on the same mount, and there are no writer count imbalances. Writer count imbalances happen when a write is taken on one cpu, and released on another, like when an open/close pair is performed on two different cpus because the task moved. Upon a remount,ro request, all of the data from the percpu variables is collected (expensive, but very rare) and we determine if there are any outstanding writers to the mount. I've written a little benchmark to sit in a loop for a couple of seconds in several cpus in parallel doing open/write/close loops. http://sr71.net/~dave/linux/openbench.c The code in here is a a worst-possible case for this patch. It does opens on a _pair_ of files in two different mounts in parallel. This should cause my code to lose its operate on the same mount optimization completely. This worst-case scenario causes a 3% degredation in the benchmark. I could probably get rid of even this 3%, but it would be more complex than what I have here, and I think this is getting into acceptable territory. In practice, I expect writing more than 3 bytes to a file, as well as disk I/O to mask any effects that this has. (To get rid of that 3%, we could have an #defined number of mounts in the percpu variable. So, instead of a CPU getting operate only on percpu data when it accesses only one mount, it could stay on percpu data when it only accesses N or fewer mounts.) Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/namespace.c| 205 ++--- lxc-dave/include/linux/mount.h |8 + 2 files changed, 198 insertions(+), 15 deletions(-) diff -puN fs/namespace.c~numa_mnt_want_write fs/namespace.c --- lxc/fs/namespace.c~numa_mnt_want_write 2007-09-20 12:16:23.0 -0700 +++ lxc-dave/fs/namespace.c 2007-09-20 12:16:23.0 -0700 @@ -17,6 +17,7 @@ #include linux/quotaops.h #include linux/acct.h #include linux/capability.h +#include linux/cpumask.h #include linux/module.h #include linux/sysfs.h #include linux/seq_file.h @@ -52,6 +53,8 @@ static inline unsigned long hash(struct return tmp hash_mask; } +#define MNT_WRITER_UNDERFLOW_LIMIT -(116) + struct vfsmount *alloc_vfsmnt(const char *name) { struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); @@ -65,6 +68,7 @@ struct vfsmount *alloc_vfsmnt(const char INIT_LIST_HEAD(mnt-mnt_share); INIT_LIST_HEAD(mnt-mnt_slave_list); INIT_LIST_HEAD(mnt-mnt_slave); + atomic_set(mnt-__mnt_writers, 0); if (name) { int size = strlen(name) + 1; char *newname = kmalloc(size, GFP_KERNEL); @@ -85,6 +89,84 @@ struct vfsmount *alloc_vfsmnt(const char * we can determine when writes are able to occur to * a filesystem. */ +/* + * __mnt_is_readonly: check whether a mount is read-only + * @mnt: the mount to check for its write status + * + * This shouldn't be used directly ouside of the VFS. + * It does not guarantee that the filesystem will stay + * r/w, just that it is right *now*. This can not and + * should not be used in place of IS_RDONLY(inode). + * mnt_want/drop_write() will _keep_ the filesystem + * r/w. + */ +int __mnt_is_readonly(struct vfsmount *mnt) +{ + return (mnt-mnt_sb-s_flags MS_RDONLY); +} +EXPORT_SYMBOL_GPL(__mnt_is_readonly); + +struct mnt_writer { + /* +* If holding multiple instances of this lock, they +* must be ordered by cpu number. +*/ + spinlock_t lock; + unsigned long count; + struct vfsmount *mnt; +} cacheline_aligned_in_smp; +static DEFINE_PER_CPU(struct mnt_writer, mnt_writers); + +static int __init init_mnt_writers(void) +{ + int cpu; + for_each_possible_cpu(cpu) { + struct mnt_writer *writer = per_cpu(mnt_writers, cpu); + spin_lock_init(writer-lock); + writer-count = 0; + } + return 0; +} +fs_initcall(init_mnt_writers); + +static void mnt_unlock_cpus(void) +{ + int cpu; + struct mnt_writer *cpu_writer; + + for_each_possible_cpu(cpu) { + cpu_writer = per_cpu(mnt_writers, cpu); + spin_unlock(cpu_writer-lock); + } +} + +static inline void