Re: [PATCH 1/1] irqchip: exynos-combiner: Save IRQ enable set on suspend
>>>>> "Javier" == Javier Martinez Canillas >>>>> writes: Javier> The Exynos interrupt combiner IP looses its state when the SoC s/looses/loses/ Peter C -- Dr Peter Chubb peter.chubb AT nicta.com.au http://www.ssrg.nicta.com.au Software Systems Research Group/NICTA -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] irqchip: exynos-combiner: Save IRQ enable set on suspend
Javier == Javier Martinez Canillas javier.marti...@collabora.co.uk writes: Javier The Exynos interrupt combiner IP looses its state when the SoC s/looses/loses/ Peter C -- Dr Peter Chubb peter.chubb AT nicta.com.au http://www.ssrg.nicta.com.au Software Systems Research Group/NICTA -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: ARM: EXYNOS: Describe boot loaders interface
>>>>> "Krzysztof" == Krzysztof Kozlowski writes: Krzysztof> Various boot loaders for Exynos based boards use certain Krzysztof> memory addresses during booting for different Krzysztof> purposes. Mostly this is one of following : 1. as a CPU Krzysztof> boot address, 2. for storing magic cookie related to low Krzysztof> power mode (AFTR, sleep). Krzysztof> The document, based solely on kernel source code, tries to Krzysztof> group the information scattered over different files. This Krzysztof> would help in the future when adding support for new SoC or Krzysztof> when extending features related to low power modes. Is it worth grabbing the info from u=boot and documenting it here (it's not documented other than in the hardkenel U=Boot source)? I can send you the info, or you can see it in https://github.com/hardkernel/u-boot/blob/odroidxu3-v2012.07/board/samsung/smdk5420/lowlevel_init.S at symbol nscode_base near line 104 -- Dr Peter Chubb peter.chubb AT nicta.com.au http://www.ssrg.nicta.com.au Software Systems Research Group/NICTA -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: ARM: EXYNOS: Describe boot loaders interface
Krzysztof == Krzysztof Kozlowski k.kozlowsk...@gmail.com writes: Krzysztof Various boot loaders for Exynos based boards use certain Krzysztof memory addresses during booting for different Krzysztof purposes. Mostly this is one of following : 1. as a CPU Krzysztof boot address, 2. for storing magic cookie related to low Krzysztof power mode (AFTR, sleep). Krzysztof The document, based solely on kernel source code, tries to Krzysztof group the information scattered over different files. This Krzysztof would help in the future when adding support for new SoC or Krzysztof when extending features related to low power modes. Is it worth grabbing the info from u=boot and documenting it here (it's not documented other than in the hardkenel U=Boot source)? I can send you the info, or you can see it in https://github.com/hardkernel/u-boot/blob/odroidxu3-v2012.07/board/samsung/smdk5420/lowlevel_init.S at symbol nscode_base near line 104 -- Dr Peter Chubb peter.chubb AT nicta.com.au http://www.ssrg.nicta.com.au Software Systems Research Group/NICTA -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] debug: Do not permit CONFIG_DEBUG_STACK_USAGE=y on IA64 or PARISC
>>>>> "Ingo" == Ingo Molnar writes: Ingo> * James Bottomley wrote: >> Since the problem is an invalid assumption about how the stack >> grows, why not just condition it on that. We actually have a >> config option for this: CONFIG_STACK_GROWSUP. But for some reason >> ia64 doesn't define this, why not, Tony? It looks deliberate >> because you have replaced a lot of >> >> #ifdef CONFIG_STACK_GROWSUP >> >> with >> >> #if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64) >> >> but not all of them. Ingo> Yes, that's another possible solution, assuming that it's really Ingo> only about the up/down difference. Ingo> Thanks, IA64 has two stacks -- the standard one, that grows down, and the register stack engine backing store, that grows up. The usual mechanisms for stack growth are used, so only some of the bits predicated on `STACK_GROWSUP' are useful. Peter C -- Dr Peter Chubb peter.chubb AT nicta.com.au http://www.ssrg.nicta.com.au Software Systems Research Group/NICTA -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] debug: Do not permit CONFIG_DEBUG_STACK_USAGE=y on IA64 or PARISC
Ingo == Ingo Molnar mi...@kernel.org writes: Ingo * James Bottomley james.bottom...@hansenpartnership.com wrote: Since the problem is an invalid assumption about how the stack grows, why not just condition it on that. We actually have a config option for this: CONFIG_STACK_GROWSUP. But for some reason ia64 doesn't define this, why not, Tony? It looks deliberate because you have replaced a lot of #ifdef CONFIG_STACK_GROWSUP with #if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64) but not all of them. Ingo Yes, that's another possible solution, assuming that it's really Ingo only about the up/down difference. Ingo Thanks, IA64 has two stacks -- the standard one, that grows down, and the register stack engine backing store, that grows up. The usual mechanisms for stack growth are used, so only some of the bits predicated on `STACK_GROWSUP' are useful. Peter C -- Dr Peter Chubb peter.chubb AT nicta.com.au http://www.ssrg.nicta.com.au Software Systems Research Group/NICTA -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Fix compilation with gcc 4.2
gcc-4.2 is a lot more picky about its symbol handling. EXPORT_SYMBOL no longer works on symbols that are undefined or defined with static scope. For example, with CONFIG_PROFILE off, I see: kernel/profile.c:206: error: __ksymtab_profile_event_unregister causes a section type conflict kernel/profile.c:205: error: __ksymtab_profile_event_register causes a section type conflict This patch moves the EXPORTs inside the #ifdef CONFIG_PROFILE, so we only try to export symbols that are defined. Also, in kernel/kprobes.c there's an EXPORT_SYMBOL_GPL() for jprobes_return, which if CONFIG_JPROBES is undefined is a static inline and gives the same error. And in drivers/acpi/resources/rsxface.c, there's an ACPI_EXPORT_SYMBOPL() for a static symbol. If it's static, it's not accessible from outside the compilation unit, so should bot be exported. These three changes allow building a zx1_defconfig kernel with gcc 4.2 on IA64. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> Index: linux-2.6-git/kernel/profile.c === --- linux-2.6-git.orig/kernel/profile.c 2007-08-09 12:10:19.921216500 +1000 +++ linux-2.6-git/kernel/profile.c 2007-08-09 12:10:26.061162039 +1000 @@ -199,11 +199,11 @@ EXPORT_SYMBOL_GPL(register_timer_hook); EXPORT_SYMBOL_GPL(unregister_timer_hook); EXPORT_SYMBOL_GPL(task_handoff_register); EXPORT_SYMBOL_GPL(task_handoff_unregister); +EXPORT_SYMBOL_GPL(profile_event_register); +EXPORT_SYMBOL_GPL(profile_event_unregister); #endif /* CONFIG_PROFILING */ -EXPORT_SYMBOL_GPL(profile_event_register); -EXPORT_SYMBOL_GPL(profile_event_unregister); #ifdef CONFIG_SMP /* Index: linux-2.6-gie/kernel/kprobes.c === --- linux-2.6-git.orig/kernel/kprobes.c 2007-08-09 12:14:48.898830198 +1000 +++ linux-2.6-git/kernel/kprobes.c 2007-08-09 14:09:50.180322576 +1000 @@ -1063,6 +1063,8 @@ EXPORT_SYMBOL_GPL(register_kprobe); EXPORT_SYMBOL_GPL(unregister_kprobe); EXPORT_SYMBOL_GPL(register_jprobe); EXPORT_SYMBOL_GPL(unregister_jprobe); -EXPORT_SYMBOL_GPL(jprobe_return); + +#ifdef CONFIG_KPROBES EXPORT_SYMBOL_GPL(register_kretprobe); EXPORT_SYMBOL_GPL(unregister_kretprobe); +#endif Index: linux-2.6-git/drivers/acpi/resources/rsxface.c === --- linux-2.6-git.orig/drivers/acpi/resources/rsxface.c 2007-08-09 13:06:59.040346772 +1000 +++ linux-2.6-git/drivers/acpi/resources/rsxface.c 2007-08-09 13:12:03.125801491 +1000 @@ -474,8 +474,6 @@ acpi_rs_match_vendor_resource(struct acp return (AE_CTRL_TERMINATE); } -ACPI_EXPORT_SYMBOL(acpi_rs_match_vendor_resource) - /*** * * FUNCTION:acpi_walk_resources -- Dr Peter Chubb http://www.gelato.unsw.edu.au [EMAIL PROTECTED] http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Fix compilation with gcc 4.2
gcc-4.2 is a lot more picky about its symbol handling. EXPORT_SYMBOL no longer works on symbols that are undefined or defined with static scope. For example, with CONFIG_PROFILE off, I see: kernel/profile.c:206: error: __ksymtab_profile_event_unregister causes a section type conflict kernel/profile.c:205: error: __ksymtab_profile_event_register causes a section type conflict This patch moves the EXPORTs inside the #ifdef CONFIG_PROFILE, so we only try to export symbols that are defined. Also, in kernel/kprobes.c there's an EXPORT_SYMBOL_GPL() for jprobes_return, which if CONFIG_JPROBES is undefined is a static inline and gives the same error. And in drivers/acpi/resources/rsxface.c, there's an ACPI_EXPORT_SYMBOPL() for a static symbol. If it's static, it's not accessible from outside the compilation unit, so should bot be exported. These three changes allow building a zx1_defconfig kernel with gcc 4.2 on IA64. Signed-off-by: Peter Chubb [EMAIL PROTECTED] Index: linux-2.6-git/kernel/profile.c === --- linux-2.6-git.orig/kernel/profile.c 2007-08-09 12:10:19.921216500 +1000 +++ linux-2.6-git/kernel/profile.c 2007-08-09 12:10:26.061162039 +1000 @@ -199,11 +199,11 @@ EXPORT_SYMBOL_GPL(register_timer_hook); EXPORT_SYMBOL_GPL(unregister_timer_hook); EXPORT_SYMBOL_GPL(task_handoff_register); EXPORT_SYMBOL_GPL(task_handoff_unregister); +EXPORT_SYMBOL_GPL(profile_event_register); +EXPORT_SYMBOL_GPL(profile_event_unregister); #endif /* CONFIG_PROFILING */ -EXPORT_SYMBOL_GPL(profile_event_register); -EXPORT_SYMBOL_GPL(profile_event_unregister); #ifdef CONFIG_SMP /* Index: linux-2.6-gie/kernel/kprobes.c === --- linux-2.6-git.orig/kernel/kprobes.c 2007-08-09 12:14:48.898830198 +1000 +++ linux-2.6-git/kernel/kprobes.c 2007-08-09 14:09:50.180322576 +1000 @@ -1063,6 +1063,8 @@ EXPORT_SYMBOL_GPL(register_kprobe); EXPORT_SYMBOL_GPL(unregister_kprobe); EXPORT_SYMBOL_GPL(register_jprobe); EXPORT_SYMBOL_GPL(unregister_jprobe); -EXPORT_SYMBOL_GPL(jprobe_return); + +#ifdef CONFIG_KPROBES EXPORT_SYMBOL_GPL(register_kretprobe); EXPORT_SYMBOL_GPL(unregister_kretprobe); +#endif Index: linux-2.6-git/drivers/acpi/resources/rsxface.c === --- linux-2.6-git.orig/drivers/acpi/resources/rsxface.c 2007-08-09 13:06:59.040346772 +1000 +++ linux-2.6-git/drivers/acpi/resources/rsxface.c 2007-08-09 13:12:03.125801491 +1000 @@ -474,8 +474,6 @@ acpi_rs_match_vendor_resource(struct acp return (AE_CTRL_TERMINATE); } -ACPI_EXPORT_SYMBOL(acpi_rs_match_vendor_resource) - /*** * * FUNCTION:acpi_walk_resources -- Dr Peter Chubb http://www.gelato.unsw.edu.au [EMAIL PROTECTED] http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Deferred interrupt handling.
The problem you're having is essentially the same as the user-level interrupt handler problem I've been dealing with for ages. The basic rule is: don't share interrupts between devices on the host and devices in the guest. But you *can* share interrupts between devices in a single guest. If you want the code, see http://www.gelato.unsw.edu.au/cgi-bin/viewvc.cgi/cvs/kernel/usrdrivers/latest/ and look at generic-irq.patch and fasync (which adds asynchronous notifications) For the KVM work it'll need modifying a little, but the basic infrastructure is there. We've currently got this working to pass interrupts to a type-II (hosted) virtual machine monitor running a guest kernel with native drivers. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Deferred interrupt handling.
The problem you're having is essentially the same as the user-level interrupt handler problem I've been dealing with for ages. The basic rule is: don't share interrupts between devices on the host and devices in the guest. But you *can* share interrupts between devices in a single guest. If you want the code, see http://www.gelato.unsw.edu.au/cgi-bin/viewvc.cgi/cvs/kernel/usrdrivers/latest/ and look at generic-irq.patch and fasync (which adds asynchronous notifications) For the KVM work it'll need modifying a little, but the basic infrastructure is there. We've currently got this working to pass interrupts to a type-II (hosted) virtual machine monitor running a guest kernel with native drivers. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-ia64 build warning messages
>>>>> "Russ" == Russ Anderson <[EMAIL PROTECTED]> writes: Russ> Tony Luck wrote: >> > I used the sn2_defconfig in the tree :) >> >> So there is something odd happening. Russ complained that he was >> still seeing several errors from the sn2_defconfig build too when I >> posted the "last fix" to Len. But I don't see them when I build. Russ> An additional data point. I have a copy of Tony's test tree Russ> pulled down on March 30th that builds without the warning Russ> messages. The copy of Tony's test tree pulled down on May 22nd Russ> does have warning messages. I'm building both with the same Russ> compiler (etc). I'm fairly certain a tree I pulled down in Russ> April built without warnings. I've since blown away that tree. Change request 85bd2fddd68e757da8e1af98f857f61a3c9ce647 introduced section-mismatch checking for vmlinux, which caused all these warnings to become visible. It looks as if gcc can create references from .sdata to .init.sdata depending on what optimisations it chooses to do. Ideally we could teach gcc to put its constants in the same section they reference. But I'm no gcc guru. The alternative is to get modpost to ignore such references, at the cost of perhaps missing a real problem somewhere. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-ia64 build warning messages
Russ == Russ Anderson [EMAIL PROTECTED] writes: Russ Tony Luck wrote: I used the sn2_defconfig in the tree :) So there is something odd happening. Russ complained that he was still seeing several errors from the sn2_defconfig build too when I posted the last fix to Len. But I don't see them when I build. Russ An additional data point. I have a copy of Tony's test tree Russ pulled down on March 30th that builds without the warning Russ messages. The copy of Tony's test tree pulled down on May 22nd Russ does have warning messages. I'm building both with the same Russ compiler (etc). I'm fairly certain a tree I pulled down in Russ April built without warnings. I've since blown away that tree. Change request 85bd2fddd68e757da8e1af98f857f61a3c9ce647 introduced section-mismatch checking for vmlinux, which caused all these warnings to become visible. It looks as if gcc can create references from .sdata to .init.sdata depending on what optimisations it chooses to do. Ideally we could teach gcc to put its constants in the same section they reference. But I'm no gcc guru. The alternative is to get modpost to ignore such references, at the cost of perhaps missing a real problem somewhere. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: sleeping function called from invalid context at kernel/fork.c:385
I see many many section mismatches when compiling with gcc 4.1 and binutils 2.17.50.20070426 They appear to be from .sdata to .init.data. This is with basic zx1_defconfig with a few mods. The reason appears to be compiler weirdness.. WARNING: init/built-in.o(.sdata+0x30): Section mismatch: reference to .init.data:ino (after 'root_mountflags') (initramfs.s contains a 32-word table `head'. Code like: static __initdata struct hash {..} *head[32]; for (p = head; p < head + 32; p++) is generating: .section .sdata L24: .data8 head#+256 Rather than adding 256 to head at run time, the compiler loads L24 and uses that for the comparison. This triggers the warning. WARNING: arch/ia64/kernel/built-in.o(.sdata+0x110): Section mismatch: reference to .init.data:rsvd_region (between 'ia64_sal' and 'ia64_i_cache_stride_shift') WARNING: mm/built-in.o(.sdata+0x48): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x50): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x58): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x60): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x68): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x70): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x78): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x80): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x3c8): Section mismatch: reference to .init.data: (between 'swap_list' and 'slab_early_init') WARNING: mm/built-in.o(.sdata+0x3d8): Section mismatch: reference to .init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init') WARNING: mm/built-in.o(.sdata+0x3e0): Section mismatch: reference to .init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init') WARNING: drivers/built-in.o(.data.rel.local+0x20a8): Section mismatch: reference to .init.text:acpi_processor_start (between 'acpi_processor_driver' and 'acpi_thermal_driver') WARNING: drivers/built-in.o(.data.rel+0x1d80): Section mismatch: reference to .init.text:serial8250_console_setup (between 'serial8250_console' and 'dpm_active') WARNING: drivers/built-in.o(.sdata+0x788): Section mismatch: reference to .init.data: (between 'first.20152' and 'enabled') WARNING: drivers/built-in.o(.sdata+0x790): Section mismatch: reference to .init.data: (between 'first.20152' and 'enabled') WARNING: drivers/built-in.o(.sdata+0xa18): Section mismatch: reference to .init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo') WARNING: drivers/built-in.o(.sdata+0xa20): Section mismatch: reference to .init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo') WARNING: drivers/built-in.o(.sdata+0xa28): Section mismatch: reference to .init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo') WARNING: drivers/built-in.o(.sdata+0xac8): Section mismatch: reference to .init.data: (between 'Symbios_trailer.24436' and 'try_direct_io') WARNING: drivers/built-in.o(.sdata+0xb00): Section mismatch: reference to .init.data: (between 'st_max_sg_segs' and 'osst_version') WARNING: arch/ia64/hp/common/built-in.o(.data.rel.local+0xa8): Section mismatch: reference to .init.text:acpi_sba_ioc_add (between 'acpi_sba_ioc_driver' and 'ioc_seq_ops') WARNING: arch/ia64/hp/common/built-in.o(.sdata+0x0): Section mismatch: reference to .init.data:__setup_str_sba_page_override before 'reserve_sba_gart' (at offset -0x204c2613) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: sleeping function called from invalid context at kernel/fork.c:385
I see many many section mismatches when compiling with gcc 4.1 and binutils 2.17.50.20070426 They appear to be from .sdata to .init.data. This is with basic zx1_defconfig with a few mods. The reason appears to be compiler weirdness.. WARNING: init/built-in.o(.sdata+0x30): Section mismatch: reference to .init.data:ino (after 'root_mountflags') (initramfs.s contains a 32-word table `head'. Code like: static __initdata struct hash {..} *head[32]; for (p = head; p head + 32; p++) is generating: .section .sdata L24: .data8 head#+256 Rather than adding 256 to head at run time, the compiler loads L24 and uses that for the comparison. This triggers the warning. WARNING: arch/ia64/kernel/built-in.o(.sdata+0x110): Section mismatch: reference to .init.data:rsvd_region (between 'ia64_sal' and 'ia64_i_cache_stride_shift') WARNING: mm/built-in.o(.sdata+0x48): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x50): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x58): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x60): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x68): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x70): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x78): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x80): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x3c8): Section mismatch: reference to .init.data: (between 'swap_list' and 'slab_early_init') WARNING: mm/built-in.o(.sdata+0x3d8): Section mismatch: reference to .init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init') WARNING: mm/built-in.o(.sdata+0x3e0): Section mismatch: reference to .init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init') WARNING: drivers/built-in.o(.data.rel.local+0x20a8): Section mismatch: reference to .init.text:acpi_processor_start (between 'acpi_processor_driver' and 'acpi_thermal_driver') WARNING: drivers/built-in.o(.data.rel+0x1d80): Section mismatch: reference to .init.text:serial8250_console_setup (between 'serial8250_console' and 'dpm_active') WARNING: drivers/built-in.o(.sdata+0x788): Section mismatch: reference to .init.data: (between 'first.20152' and 'enabled') WARNING: drivers/built-in.o(.sdata+0x790): Section mismatch: reference to .init.data: (between 'first.20152' and 'enabled') WARNING: drivers/built-in.o(.sdata+0xa18): Section mismatch: reference to .init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo') WARNING: drivers/built-in.o(.sdata+0xa20): Section mismatch: reference to .init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo') WARNING: drivers/built-in.o(.sdata+0xa28): Section mismatch: reference to .init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo') WARNING: drivers/built-in.o(.sdata+0xac8): Section mismatch: reference to .init.data: (between 'Symbios_trailer.24436' and 'try_direct_io') WARNING: drivers/built-in.o(.sdata+0xb00): Section mismatch: reference to .init.data: (between 'st_max_sg_segs' and 'osst_version') WARNING: arch/ia64/hp/common/built-in.o(.data.rel.local+0xa8): Section mismatch: reference to .init.text:acpi_sba_ioc_add (between 'acpi_sba_ioc_driver' and 'ioc_seq_ops') WARNING: arch/ia64/hp/common/built-in.o(.sdata+0x0): Section mismatch: reference to .init.data:__setup_str_sba_page_override before 'reserve_sba_gart' (at offset -0x204c2613) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend] - SN: validate smp_affinity mask on intr redirect
Jack> } Jack> + Jack> +bool is_affinity_mask_valid(cpumask_t cpumask) Jack> +{ Jack> + if (ia64_platform_is("sn2")) { Jack> + /* Only allow one CPU to be specified in the smp_affinity mask */ Jack> + if (cpus_weight(cpumask) != 1) Jack> + return false; Why not just: return cpus_weight(cpumask) == 1; It's a Boolean; treat it as one. (If you thought the average kernel programmer (who's s/he?) understood the logical implication rule it could be: return !ia64_platform_is("sn2") || cpus_weight(cpumask) == 1; ) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend] - SN: validate smp_affinity mask on intr redirect
Jack } Jack + Jack +bool is_affinity_mask_valid(cpumask_t cpumask) Jack +{ Jack + if (ia64_platform_is(sn2)) { Jack + /* Only allow one CPU to be specified in the smp_affinity mask */ Jack + if (cpus_weight(cpumask) != 1) Jack + return false; Why not just: return cpus_weight(cpumask) == 1; It's a Boolean; treat it as one. (If you thought the average kernel programmer (who's s/he?) understood the logical implication rule it could be: return !ia64_platform_is(sn2) || cpus_weight(cpumask) == 1; ) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [QUICKLIST 0/4] Arch independent quicklists V2
> "Jeremy" == Jeremy Fitzhardinge <[EMAIL PROTECTED]> writes: Jeremy> And do the same in pte pages for actual mapped pages? Or do Jeremy> you think they would be too densely populated for it to be Jeremy> worthwhile? We've been doing some measurements on how densely clumped ptes are. On 32-bit platforms, they're pretty dense. On IA64, quite a bit sparser, depending on the workload of course. I think that's mostly because of the larger pagesize on IA64 -- with 64k pages, you don't need very many to map a small object. I'm hoping IanW can give more details. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [QUICKLIST 0/4] Arch independent quicklists V2
Jeremy == Jeremy Fitzhardinge [EMAIL PROTECTED] writes: Jeremy And do the same in pte pages for actual mapped pages? Or do Jeremy you think they would be too densely populated for it to be Jeremy worthwhile? We've been doing some measurements on how densely clumped ptes are. On 32-bit platforms, they're pretty dense. On IA64, quite a bit sparser, depending on the workload of course. I think that's mostly because of the larger pagesize on IA64 -- with 64k pages, you don't need very many to map a small object. I'm hoping IanW can give more details. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ski for huge page size !
>>>>> "sudhnesh" == sudhnesh adapawar <[EMAIL PROTECTED]> writes: sudhnesh> Hey all ! I am thinking to use ski simulator as I can get sudhnesh> the ia64 (Itanium 2)simulated on ia32 archiSo can I use sudhnesh> this product for the project related to huge page size ??? sudhnesh> Will the problems related to huge pages such as sudhnesh> swapping,IO,etc...will be covered if I use ski with 2.6 sudhnesh> kernel image configured for ia64 archi with huge page size sudhnesh> support ? Should work perfectly. We've been using Ski for similar work, looking at SuperPage support. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to boot 2.6 kernel using hp ski simulator ???
Please check out http://www.gelato.unsw.edu.au/IA64wiki/SkiSimulator for lots of info on Ski. It works fine with Linux 2.6; and hugepage work too. > 1) I used 'make ARCH=ia64 menuconfig' to configure and followed the > steps to get kernel image of version 2.6 ! I also selected the generic > type as Ski-simulator and also selected the HP-ski drivers something > simscsi,etc.etc. I suggest you start with make sim_defconfig Your symptoms look like a misconigured or misbuilt vmlinux. The sim_defconfig If you're running on IA32, then you need something like: make CROSS_COMPILE=ia64-linux-gnu ARCH=ia64 boot to build kernel and bootloader. You need to get or build yourself a disk image. Instructions for building at http://www.gelato.unsw.edu.au/IA64wiki/skidiskimage -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ski for huge page size !
sudhnesh == sudhnesh adapawar [EMAIL PROTECTED] writes: sudhnesh Hey all ! I am thinking to use ski simulator as I can get sudhnesh the ia64 (Itanium 2)simulated on ia32 archiSo can I use sudhnesh this product for the project related to huge page size ??? sudhnesh Will the problems related to huge pages such as sudhnesh swapping,IO,etc...will be covered if I use ski with 2.6 sudhnesh kernel image configured for ia64 archi with huge page size sudhnesh support ? Should work perfectly. We've been using Ski for similar work, looking at SuperPage support. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to boot 2.6 kernel using hp ski simulator ???
Please check out http://www.gelato.unsw.edu.au/IA64wiki/SkiSimulator for lots of info on Ski. It works fine with Linux 2.6; and hugepage work too. 1) I used 'make ARCH=ia64 menuconfig' to configure and followed the steps to get kernel image of version 2.6 ! I also selected the generic type as Ski-simulator and also selected the HP-ski drivers something simscsi,etc.etc. I suggest you start with make sim_defconfig Your symptoms look like a misconigured or misbuilt vmlinux. The sim_defconfig If you're running on IA32, then you need something like: make CROSS_COMPILE=ia64-linux-gnu ARCH=ia64 boot to build kernel and bootloader. You need to get or build yourself a disk image. Instructions for building at http://www.gelato.unsw.edu.au/IA64wiki/skidiskimage -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: ip_contrack refuses to load if built UP as a module on IA64
This patch makes UP and SMP do the same thing as far as module per-cpu data go. Unfortunately it affects core code. To repeat the problem: IA64 keeps per-cpu data in a small data area that is referenced by a 22-bit offset, for both UP and SMP cases. If a module defines per-cpu data, it too will end up in the small-data area. But the module loader at present special-cases the UP treatment of per-cpu data, assumes that it is in the GP-relative data area, and does nothing (for SMP it allocates space, and copies initialised data items into it) The effect is that modules defining per-cpu data fail to load if they're built UP, because of an impossible relocation. The appended patch makes the treatment of per-cpu data uniform between UP and SMP cases. For most architectures, the per-cpu data section will be empty for UP, and so the per-cpu setup code will not be invoked. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> diff --git a/arch/ia64/kernel/module.c b/arch/ia64/kernel/module.c --- a/arch/ia64/kernel/module.c +++ b/arch/ia64/kernel/module.c @@ -951,4 +951,10 @@ percpu_modcopy (void *pcpudst, const voi if (cpu_possible(i)) memcpy(pcpudst + __per_cpu_offset[i], src, size); } +#else +void +percpu_modcopy (void *pcpudst, const void *src, unsigned long size) +{ + memcpy(pcpudst, src, size); +} #endif /* CONFIG_SMP */ diff --git a/kernel/module.c b/kernel/module.c --- a/kernel/module.c +++ b/kernel/module.c @@ -209,7 +209,6 @@ static struct module *find_module(const return NULL; } -#ifdef CONFIG_SMP /* Number of blocks used and allocated. */ static unsigned int pcpu_num_used, pcpu_num_allocated; /* Size of each block. -ve means used. */ @@ -352,29 +351,7 @@ static int percpu_modinit(void) return 0; } __initcall(percpu_modinit); -#else /* ... !CONFIG_SMP */ -static inline void *percpu_modalloc(unsigned long size, unsigned long align, - const char *name) -{ - return NULL; -} -static inline void percpu_modfree(void *pcpuptr) -{ - BUG(); -} -static inline unsigned int find_pcpusec(Elf_Ehdr *hdr, - Elf_Shdr *sechdrs, - const char *secstrings) -{ - return 0; -} -static inline void percpu_modcopy(void *pcpudst, const void *src, - unsigned long size) -{ - /* pcpusec should be 0, and size of that section should be 0. */ - BUG_ON(size != 0); -} -#endif /* CONFIG_SMP */ + #ifdef CONFIG_MODULE_UNLOAD #define MODINFO_ATTR(field)\ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
'mdio_bus_exit' in discarded section .text.exit
When building with CONFIG_PHYLIB=y on Itanium, I see: `mdio_bus_exit' referenced in section `.init.text' of drivers/built-in.o: defined in discarded section `.exit.text' of drivers/built-in.o I believe that mdio_bus_exit should not be declared __exit, because it is referencesd from __init sections in, say, phy_init(). Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c --- a/drivers/net/phy/mdio_bus.c +++ b/drivers/net/phy/mdio_bus.c @@ -170,7 +170,7 @@ int __init mdio_bus_init(void) return bus_register(_bus_type); } -void __exit mdio_bus_exit(void) +void mdio_bus_exit(void) { bus_unregister(_bus_type); } -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
'mdio_bus_exit' in discarded section .text.exit
When building with CONFIG_PHYLIB=y on Itanium, I see: `mdio_bus_exit' referenced in section `.init.text' of drivers/built-in.o: defined in discarded section `.exit.text' of drivers/built-in.o I believe that mdio_bus_exit should not be declared __exit, because it is referencesd from __init sections in, say, phy_init(). Signed-off-by: Peter Chubb [EMAIL PROTECTED] diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c --- a/drivers/net/phy/mdio_bus.c +++ b/drivers/net/phy/mdio_bus.c @@ -170,7 +170,7 @@ int __init mdio_bus_init(void) return bus_register(mdio_bus_type); } -void __exit mdio_bus_exit(void) +void mdio_bus_exit(void) { bus_unregister(mdio_bus_type); } -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: ip_contrack refuses to load if built UP as a module on IA64
This patch makes UP and SMP do the same thing as far as module per-cpu data go. Unfortunately it affects core code. To repeat the problem: IA64 keeps per-cpu data in a small data area that is referenced by a 22-bit offset, for both UP and SMP cases. If a module defines per-cpu data, it too will end up in the small-data area. But the module loader at present special-cases the UP treatment of per-cpu data, assumes that it is in the GP-relative data area, and does nothing (for SMP it allocates space, and copies initialised data items into it) The effect is that modules defining per-cpu data fail to load if they're built UP, because of an impossible relocation. The appended patch makes the treatment of per-cpu data uniform between UP and SMP cases. For most architectures, the per-cpu data section will be empty for UP, and so the per-cpu setup code will not be invoked. Signed-off-by: Peter Chubb [EMAIL PROTECTED] diff --git a/arch/ia64/kernel/module.c b/arch/ia64/kernel/module.c --- a/arch/ia64/kernel/module.c +++ b/arch/ia64/kernel/module.c @@ -951,4 +951,10 @@ percpu_modcopy (void *pcpudst, const voi if (cpu_possible(i)) memcpy(pcpudst + __per_cpu_offset[i], src, size); } +#else +void +percpu_modcopy (void *pcpudst, const void *src, unsigned long size) +{ + memcpy(pcpudst, src, size); +} #endif /* CONFIG_SMP */ diff --git a/kernel/module.c b/kernel/module.c --- a/kernel/module.c +++ b/kernel/module.c @@ -209,7 +209,6 @@ static struct module *find_module(const return NULL; } -#ifdef CONFIG_SMP /* Number of blocks used and allocated. */ static unsigned int pcpu_num_used, pcpu_num_allocated; /* Size of each block. -ve means used. */ @@ -352,29 +351,7 @@ static int percpu_modinit(void) return 0; } __initcall(percpu_modinit); -#else /* ... !CONFIG_SMP */ -static inline void *percpu_modalloc(unsigned long size, unsigned long align, - const char *name) -{ - return NULL; -} -static inline void percpu_modfree(void *pcpuptr) -{ - BUG(); -} -static inline unsigned int find_pcpusec(Elf_Ehdr *hdr, - Elf_Shdr *sechdrs, - const char *secstrings) -{ - return 0; -} -static inline void percpu_modcopy(void *pcpudst, const void *src, - unsigned long size) -{ - /* pcpusec should be 0, and size of that section should be 0. */ - BUG_ON(size != 0); -} -#endif /* CONFIG_SMP */ + #ifdef CONFIG_MODULE_UNLOAD #define MODINFO_ATTR(field)\ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the performance bottleneck?
>>>>> "Holger" == Holger Kiehl <[EMAIL PROTECTED]> writes: Holger> Hello I have a system with the following setup: (4-way CPUs, 8 spindles on two controllers) Try using XFS. See http://scalability.gelato.org/DiskScalability_2fResults --- ext3 is single threaded and tends not to get the full benefit of either the multiple spindles nor the multiple processors. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the performance bottleneck?
Holger == Holger Kiehl [EMAIL PROTECTED] writes: Holger Hello I have a system with the following setup: (4-way CPUs, 8 spindles on two controllers) Try using XFS. See http://scalability.gelato.org/DiskScalability_2fResults --- ext3 is single threaded and tends not to get the full benefit of either the multiple spindles nor the multiple processors. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Include assembly entry points in TAGS
As it stands, etags doesn't find labels in the IA64 or i386 assembler source code, because they're disguised inside a preprocessor macro. I propose the attached fix, which adds a regular expression to enable labels disguised by ENTRY() and GLOBAL_ENTRY() macros. There's a similar problem for MIPS, which needs to match LEAF(entrypoint) Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> diff --git a/Makefile b/Makefile --- a/Makefile +++ b/Makefile @@ -1187,7 +1187,7 @@ cscope: FORCE $(call cmd,cscope) quiet_cmd_TAGS = MAKE $@ -cmd_TAGS = $(all-sources) | etags - +cmd_TAGS = $(all-sources) | etags --regex='{asm}/\(GLOBAL_\)?ENTRY(\([^)]+\))/\2/' - # Exuberant ctags works better with -I - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Include assembly entry points in TAGS
As it stands, etags doesn't find labels in the IA64 or i386 assembler source code, because they're disguised inside a preprocessor macro. I propose the attached fix, which adds a regular expression to enable labels disguised by ENTRY() and GLOBAL_ENTRY() macros. There's a similar problem for MIPS, which needs to match LEAF(entrypoint) Signed-off-by: Peter Chubb [EMAIL PROTECTED] diff --git a/Makefile b/Makefile --- a/Makefile +++ b/Makefile @@ -1187,7 +1187,7 @@ cscope: FORCE $(call cmd,cscope) quiet_cmd_TAGS = MAKE $@ -cmd_TAGS = $(all-sources) | etags - +cmd_TAGS = $(all-sources) | etags --regex='{asm}/\(GLOBAL_\)?ENTRY(\([^)]+\))/\2/' - # Exuberant ctags works better with -I - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fcntl(F_GETLEASE) semantics??
>>>>> "Trond" == Trond Myklebust <[EMAIL PROTECTED]> writes: Trond> to den 11.08.2005 Klokka 09:48 (+1000) skreiv Peter Chubb: >> Hi, The LTP test fcntl23 is failing. It does, in essence, fd = >> open(xxx, O_RDWR|O_CREAT, 0777); if (fcntl(fd, F_SETLEASE, F_RDLCK) >> == -1) fail; >> >> fcntl always returns EAGAIN here. The manual page says that a read >> lease causes notification when `another process' opens the file for >> writing or truncates it. The kernel implements `any process' >> (including the current one). >> >> Which semantics are correct? Personally I think that what the >> kernel implements is correct (you can't get a read lease unsless >> there are no writers _at_ _all_) Trond> A read lease should mean that there are no writers at all. Trond> If we were to allow the current process to open for write, then Trond> that would still mean that nobody else can get a lease. In Trond> effect you have been granted a lease with exclusive semantics Trond> (i.e. a write lease). You might as well request that instead of Trond> pretending it is a read lease. So the manual page is wrong. Fine. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
fcntl(F_GETLEASE) semantics??
Hi, The LTP test fcntl23 is failing. It does, in essence, fd = open(xxx, O_RDWR|O_CREAT, 0777); if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1) fail; fcntl always returns EAGAIN here. The manual page says that a read lease causes notification when `another process' opens the file for writing or truncates it. The kernel implements `any process' (including the current one). Which semantics are correct? Personally I think that what the kernel implements is correct (you can't get a read lease unsless there are no writers _at_ _all_) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
fcntl(F_GETLEASE) semantics??
Hi, The LTP test fcntl23 is failing. It does, in essence, fd = open(xxx, O_RDWR|O_CREAT, 0777); if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1) fail; fcntl always returns EAGAIN here. The manual page says that a read lease causes notification when `another process' opens the file for writing or truncates it. The kernel implements `any process' (including the current one). Which semantics are correct? Personally I think that what the kernel implements is correct (you can't get a read lease unsless there are no writers _at_ _all_) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fcntl(F_GETLEASE) semantics??
Trond == Trond Myklebust [EMAIL PROTECTED] writes: Trond to den 11.08.2005 Klokka 09:48 (+1000) skreiv Peter Chubb: Hi, The LTP test fcntl23 is failing. It does, in essence, fd = open(xxx, O_RDWR|O_CREAT, 0777); if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1) fail; fcntl always returns EAGAIN here. The manual page says that a read lease causes notification when `another process' opens the file for writing or truncates it. The kernel implements `any process' (including the current one). Which semantics are correct? Personally I think that what the kernel implements is correct (you can't get a read lease unsless there are no writers _at_ _all_) Trond A read lease should mean that there are no writers at all. Trond If we were to allow the current process to open for write, then Trond that would still mean that nobody else can get a lease. In Trond effect you have been granted a lease with exclusive semantics Trond (i.e. a write lease). You might as well request that instead of Trond pretending it is a read lease. So the manual page is wrong. Fine. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to get the physical page addresses from a kernel virtual address for DMA SG List?
You may want to take a look at the user-mode driver infrastructure patches, which do almost exactly what you're trying to do. Get them from http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/cvs/kernel/usrdrivers/kernel-2.6.12-rc3/ -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to get the physical page addresses from a kernel virtual address for DMA SG List?
You may want to take a look at the user-mode driver infrastructure patches, which do almost exactly what you're trying to do. Get them from http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/cvs/kernel/usrdrivers/kernel-2.6.12-rc3/ -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangcheck problem
>>>>> "Noah" == Noah Silverman <[EMAIL PROTECTED]> writes: Noah> Sorry 2.6.7 Noah> Burton Windle wrote: >> Kernel version? Are you running on an x86 machine without TSC, e.g., a 486? the Hangcheck timer then devolves into using jiffies, and a single jiffy error gives you the printout you mention. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangcheck problem
Noah == Noah Silverman [EMAIL PROTECTED] writes: Noah Sorry 2.6.7 Noah Burton Windle wrote: Kernel version? Are you running on an x86 machine without TSC, e.g., a 486? the Hangcheck timer then devolves into using jiffies, and a single jiffy error gives you the printout you mention. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to measure time accurately.
>>>>> "Chris" == Chris Friesen <[EMAIL PROTECTED]> writes: Chris> krishna wrote: >> Hi All, >> >> Can any one tell me how to measure time accurately for a block of C >> code in device drivers. For example, If I want to measure the time >> duration of firmware download. Chris> Most cpus have some way of getting at a counter or decrementer Chris> of various frequencies. Usually it requires low-level hardware Chris> knowledge and often it needs assembly code. As a device driver is inside the linux kernel (unless you're writein a user-mode device driver :-)) you can use the getcycles() macro that's defined for most architectures. It provides a snapshot of the cycle-counter. Caveats: 1. If you're running with power management, the cycle counter ticks at a variable rate. 2. If you're on a multiprocessor, the cycle counters of different processors need not be synchronised. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to measure time accurately.
Chris == Chris Friesen [EMAIL PROTECTED] writes: Chris krishna wrote: Hi All, Can any one tell me how to measure time accurately for a block of C code in device drivers. For example, If I want to measure the time duration of firmware download. Chris Most cpus have some way of getting at a counter or decrementer Chris of various frequencies. Usually it requires low-level hardware Chris knowledge and often it needs assembly code. As a device driver is inside the linux kernel (unless you're writein a user-mode device driver :-)) you can use the getcycles() macro that's defined for most architectures. It provides a snapshot of the cycle-counter. Caveats: 1. If you're running with power management, the cycle counter ticks at a variable rate. 2. If you're on a multiprocessor, the cycle counters of different processors need not be synchronised. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LBD/filesystems over 2TB: is it safe?
>>>>> "jniehof" == jniehof <[EMAIL PROTECTED]> writes: jniehof> Someone posted to the LBD list last December regarding some jniehof> supposedly horrible bugs in large filesystems: jniehof> https://www.gelato.unsw.edu.au/archives/lbd/2004-December/75.html jniehof> https://www.gelato.unsw.edu.au/archives/lbd/2004-December/74.html The changes in those emails are irrelevant --- they fail to take into account the properties of the filesystems that they modify, that mean that the 32-bit quantities being shifted will not overflow. They're typically of the form: - iblock = index << (PAGE_CACHE_SHIFT - inode->i_blkbits); + iblock = (sector_t) index << (PAGE_CACHE_SHIFT - inode->i_blkbits); Now, on a 32-bit processor with 4k pages, PAGE_CACHE_SHIFT is 12, and i_blkbits is also 12 if you're using 4k blocks (which you have to to get a large filesystem). So this does nothing and is safe. The on-disk format for ext[23] uses 32-bit block numbers, so your maximum filesystem size is 16TB, and your maximum value of iblock is 2^32-1. Please do benchmark XFS and ext3 on your system before choosing. Our tests (to be published in Linux.Conf.Au next month) show that XFS is significantly faster for some workloads. Also its scalability to very large filesystems is much more mature than ext3. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LBD/filesystems over 2TB: is it safe?
jniehof == jniehof [EMAIL PROTECTED] writes: jniehof Someone posted to the LBD list last December regarding some jniehof supposedly horrible bugs in large filesystems: jniehof https://www.gelato.unsw.edu.au/archives/lbd/2004-December/75.html jniehof https://www.gelato.unsw.edu.au/archives/lbd/2004-December/74.html The changes in those emails are irrelevant --- they fail to take into account the properties of the filesystems that they modify, that mean that the 32-bit quantities being shifted will not overflow. They're typically of the form: - iblock = index (PAGE_CACHE_SHIFT - inode-i_blkbits); + iblock = (sector_t) index (PAGE_CACHE_SHIFT - inode-i_blkbits); Now, on a 32-bit processor with 4k pages, PAGE_CACHE_SHIFT is 12, and i_blkbits is also 12 if you're using 4k blocks (which you have to to get a large filesystem). So this does nothing and is safe. The on-disk format for ext[23] uses 32-bit block numbers, so your maximum filesystem size is 16TB, and your maximum value of iblock is 2^32-1. Please do benchmark XFS and ext3 on your system before choosing. Our tests (to be published in Linux.Conf.Au next month) show that XFS is significantly faster for some workloads. Also its scalability to very large filesystems is much more mature than ext3. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: forkbombing Linux distributions
>>>>> "William" == William Beebe <[EMAIL PROTECTED]> writes: William> Sure enough, I created the following script and ran it as a William> non-root user: William> #!/bin/bash $0 & $0 & There are two approaches to fixing this. 1. Rate limit fork(). Unfortunately some legitimate usges do a lot of forking, and you don't really want to slow them down. 2. Limit (per user) the number of processes allowed. This is what's currently done; and if you as administrator want to you can set RLIMIT_NPROC in /etc/security/limits.conf On an almost-single-user system such as most desktops, there isn't much point in setting this. On shared systems, it can be useful. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: forkbombing Linux distributions
William == William Beebe [EMAIL PROTECTED] writes: William Sure enough, I created the following script and ran it as a William non-root user: William #!/bin/bash $0 $0 There are two approaches to fixing this. 1. Rate limit fork(). Unfortunately some legitimate usges do a lot of forking, and you don't really want to slow them down. 2. Limit (per user) the number of processes allowed. This is what's currently done; and if you as administrator want to you can set RLIMIT_NPROC in /etc/security/limits.conf On an almost-single-user system such as most desktops, there isn't much point in setting this. On shared systems, it can be useful. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm_dirty_ratio seems a bit large.
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes: Andrew> Robin Holt <[EMAIL PROTECTED]> wrote: >> One other issue we have is the vm_dirty_ratio and background_ratio >> adjustments are a little coarse with these memory sizes. Since our >> minimum adjustment is 1%, we are adjusting by 40GB on the largest >> configuration from above. The hardware we are shipping today is >> capable of going to far greater amounts of memory, but we don't >> have customers demanding that yet. I would like to plan ahead for >> that and change vm_dirty_ratio from a straight percent into a >> millipercent (thousandth of a percent). Would that type of change >> be acceptable? Andrew> Oh drat. I think such a change would require a new set of Andrew> /proc entries. No, you could just extend them to understand fixed point. Keep printing integers as integers, print non-integers with one (or two: will we ever need 0.01% increments?) decimal places. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm_dirty_ratio seems a bit large.
Andrew == Andrew Morton [EMAIL PROTECTED] writes: Andrew Robin Holt [EMAIL PROTECTED] wrote: One other issue we have is the vm_dirty_ratio and background_ratio adjustments are a little coarse with these memory sizes. Since our minimum adjustment is 1%, we are adjusting by 40GB on the largest configuration from above. The hardware we are shipping today is capable of going to far greater amounts of memory, but we don't have customers demanding that yet. I would like to plan ahead for that and change vm_dirty_ratio from a straight percent into a millipercent (thousandth of a percent). Would that type of change be acceptable? Andrew Oh drat. I think such a change would require a new set of Andrew /proc entries. No, you could just extend them to understand fixed point. Keep printing integers as integers, print non-integers with one (or two: will we ever need 0.01% increments?) decimal places. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Can no longer build ipv6 built-in (2.6.11, today's BK head)
Changeset [EMAIL PROTECTED]|ChangeSet|20050310043957|06845 added cleanup to ipv6_init(), which calls ip6_route_cleanup() ip6_route_cleanup() is marked __exit so cannot be called from an __init section -- it's discarded by the linker from the image (although it'll be retained in a module). You get errors like this: ip6_route_cleanup: discarded in section `.exit.text' from net/built-in.o xfrm6_fini: discarded in section `.exit.text' from net/built-in.o fib6_gc_cleanup: discarded in section `.exit.text' from net/built-in.o ipv6_packet_cleanup: discarded in section `.exit.text' from net/built-in.o A simple fix is to delete the __exit from the various functions now that they're called other than at module_exit. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> Index: linux-2.5-import/net/ipv6/route.c === --- linux-2.5-import.orig/net/ipv6/route.c 2005-03-16 10:12:44.742595387 +1100 +++ linux-2.5-import/net/ipv6/route.c 2005-03-16 13:01:50.246678866 +1100 @@ -2116,7 +2116,7 @@ #endif } -void __exit ip6_route_cleanup(void) +void ip6_route_cleanup(void) { #ifdef CONFIG_PROC_FS proc_net_remove("ipv6_route"); Index: linux-2.5-import/net/ipv6/ipv6_sockglue.c === --- linux-2.5-import.orig/net/ipv6/ipv6_sockglue.c 2005-03-16 10:12:44.736736056 +1100 +++ linux-2.5-import/net/ipv6/ipv6_sockglue.c 2005-03-16 13:24:19.095793200 +1100 @@ -698,7 +698,7 @@ dev_add_pack(_packet_type); } -void __exit ipv6_packet_cleanup(void) +void ipv6_packet_cleanup(void) { dev_remove_pack(_packet_type); } Index: linux-2.5-import/net/ipv6/ip6_fib.c === --- linux-2.5-import.orig/net/ipv6/ip6_fib.c2005-03-15 12:28:44.819748921 +1100 +++ linux-2.5-import/net/ipv6/ip6_fib.c 2005-03-16 13:27:46.423351526 +1100 @@ -1218,7 +1218,7 @@ panic("cannot create fib6_nodes cache"); } -void __exit fib6_gc_cleanup(void) +void fib6_gc_cleanup(void) { del_timer(_fib_timer); kmem_cache_destroy(fib6_node_kmem); Index: linux-2.5-import/net/ipv6/xfrm6_policy.c === --- linux-2.5-import.orig/net/ipv6/xfrm6_policy.c 2005-03-15 12:28:44.853928319 +1100 +++ linux-2.5-import/net/ipv6/xfrm6_policy.c2005-03-16 13:53:28.890552848 +1100 @@ -276,7 +276,7 @@ xfrm_policy_register_afinfo(_policy_afinfo); } -static void __exit xfrm6_policy_fini(void) +static void xfrm6_policy_fini(void) { xfrm_policy_unregister_afinfo(_policy_afinfo); } @@ -287,7 +287,7 @@ xfrm6_state_init(); } -void __exit xfrm6_fini(void) +void xfrm6_fini(void) { //xfrm6_input_fini(); xfrm6_policy_fini(); Index: linux-2.5-import/net/ipv6/xfrm6_state.c === --- linux-2.5-import.orig/net/ipv6/xfrm6_state.c2005-03-15 12:28:44.854904874 +1100 +++ linux-2.5-import/net/ipv6/xfrm6_state.c 2005-03-16 13:29:30.183337361 +1100 @@ -129,7 +129,7 @@ xfrm_state_register_afinfo(_state_afinfo); } -void __exit xfrm6_state_fini(void) +void xfrm6_state_fini(void) { xfrm_state_unregister_afinfo(_state_afinfo); } -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Can no longer build ipv6 built-in (2.6.11, today's BK head)
Changeset [EMAIL PROTECTED]|ChangeSet|20050310043957|06845 added cleanup to ipv6_init(), which calls ip6_route_cleanup() ip6_route_cleanup() is marked __exit so cannot be called from an __init section -- it's discarded by the linker from the image (although it'll be retained in a module). You get errors like this: ip6_route_cleanup: discarded in section `.exit.text' from net/built-in.o xfrm6_fini: discarded in section `.exit.text' from net/built-in.o fib6_gc_cleanup: discarded in section `.exit.text' from net/built-in.o ipv6_packet_cleanup: discarded in section `.exit.text' from net/built-in.o A simple fix is to delete the __exit from the various functions now that they're called other than at module_exit. Signed-off-by: Peter Chubb [EMAIL PROTECTED] Index: linux-2.5-import/net/ipv6/route.c === --- linux-2.5-import.orig/net/ipv6/route.c 2005-03-16 10:12:44.742595387 +1100 +++ linux-2.5-import/net/ipv6/route.c 2005-03-16 13:01:50.246678866 +1100 @@ -2116,7 +2116,7 @@ #endif } -void __exit ip6_route_cleanup(void) +void ip6_route_cleanup(void) { #ifdef CONFIG_PROC_FS proc_net_remove(ipv6_route); Index: linux-2.5-import/net/ipv6/ipv6_sockglue.c === --- linux-2.5-import.orig/net/ipv6/ipv6_sockglue.c 2005-03-16 10:12:44.736736056 +1100 +++ linux-2.5-import/net/ipv6/ipv6_sockglue.c 2005-03-16 13:24:19.095793200 +1100 @@ -698,7 +698,7 @@ dev_add_pack(ipv6_packet_type); } -void __exit ipv6_packet_cleanup(void) +void ipv6_packet_cleanup(void) { dev_remove_pack(ipv6_packet_type); } Index: linux-2.5-import/net/ipv6/ip6_fib.c === --- linux-2.5-import.orig/net/ipv6/ip6_fib.c2005-03-15 12:28:44.819748921 +1100 +++ linux-2.5-import/net/ipv6/ip6_fib.c 2005-03-16 13:27:46.423351526 +1100 @@ -1218,7 +1218,7 @@ panic(cannot create fib6_nodes cache); } -void __exit fib6_gc_cleanup(void) +void fib6_gc_cleanup(void) { del_timer(ip6_fib_timer); kmem_cache_destroy(fib6_node_kmem); Index: linux-2.5-import/net/ipv6/xfrm6_policy.c === --- linux-2.5-import.orig/net/ipv6/xfrm6_policy.c 2005-03-15 12:28:44.853928319 +1100 +++ linux-2.5-import/net/ipv6/xfrm6_policy.c2005-03-16 13:53:28.890552848 +1100 @@ -276,7 +276,7 @@ xfrm_policy_register_afinfo(xfrm6_policy_afinfo); } -static void __exit xfrm6_policy_fini(void) +static void xfrm6_policy_fini(void) { xfrm_policy_unregister_afinfo(xfrm6_policy_afinfo); } @@ -287,7 +287,7 @@ xfrm6_state_init(); } -void __exit xfrm6_fini(void) +void xfrm6_fini(void) { //xfrm6_input_fini(); xfrm6_policy_fini(); Index: linux-2.5-import/net/ipv6/xfrm6_state.c === --- linux-2.5-import.orig/net/ipv6/xfrm6_state.c2005-03-15 12:28:44.854904874 +1100 +++ linux-2.5-import/net/ipv6/xfrm6_state.c 2005-03-16 13:29:30.183337361 +1100 @@ -129,7 +129,7 @@ xfrm_state_register_afinfo(xfrm6_state_afinfo); } -void __exit xfrm6_state_fini(void) +void xfrm6_state_fini(void) { xfrm_state_unregister_afinfo(xfrm6_state_afinfo); } -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Tue, 15 Mar 2005 14:47:42 +1100, Peter Chubb Jon> <[EMAIL PROTECTED]> wrote: >> What I really want to do is deprivilege the driver code as much as >> possible. Whatever a driver does, the rest of the system should >> keep going. That way malicious or buggy drivers can only affect >> the processes that are trying to use the device they manage. >> Moreover, it should be possible to kill -9 a driver, then restart >> it, without the rest of the system noticing more than a hiccup. To >> do this, step one is to run the driver in user space, so that it's >> subject to the same resource management control as any other >> process. Step two, which is a lot harder, is to connect the driver >> back into the kernel so that it can be shared. Tun/Tap can be used >> for network devices, but it's really too slow -- you need zero-copy >> and shared notification. Jon> Have you considered running the drivers in a domain under Xen? See the paper presented by Karlsruhr at OSDI: Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Götz: Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines. OSDI '04. They're using L4, rather than Xen as the paravirtualisation layer. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb Jon> <[EMAIL PROTECTED]> wrote: >> >>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: >> >> >> The scenario I'm thinking about with these patches are things >> like >> low-latency user-level networking between nodes in a >> cluster, where >> for good performance even with a kernel driver >> you don't want to >> share your interrupt line with anything else. >> Jon> The code needs to refuse to install if the IRQ line is shared. >> It does. The request_irq() call explicitly does not include >> SA_SHARED in its flags, so if the line is shared, it'll return an >> error to user space when the driver tries to open the file >> representing the interrupt. Jon> Please put some big comments warning people about adding Jon> SA_SHARED. I can easily see someone thinking that they are fixing Jon> a bug by adding it. I'd probably even write a paragraph about Jon> what will happen if SA_SHARED is added. Will do. The main problem here is X86, as other architectures either don't care, or have enough interrupt lines. And the people who are paying me for this kind of thing all run IA64 What I really want to do is deprivilege the driver code as much as possible. Whatever a driver does, the rest of the system should keep going. That way malicious or buggy drivers can only affect the processes that are trying to use the device they manage. Moreover, it should be possible to kill -9 a driver, then restart it, without the rest of the system noticing more than a hiccup. To do this, step one is to run the driver in user space, so that it's subject to the same resource management control as any other process. Step two, which is a lot harder, is to connect the driver back into the kernel so that it can be shared. Tun/Tap can be used for network devices, but it's really too slow -- you need zero-copy and shared notification. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
Jon == Jon Smirl [EMAIL PROTECTED] writes: Jon On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb Jon [EMAIL PROTECTED] wrote: Jon == Jon Smirl [EMAIL PROTECTED] writes: The scenario I'm thinking about with these patches are things like low-latency user-level networking between nodes in a cluster, where for good performance even with a kernel driver you don't want to share your interrupt line with anything else. Jon The code needs to refuse to install if the IRQ line is shared. It does. The request_irq() call explicitly does not include SA_SHARED in its flags, so if the line is shared, it'll return an error to user space when the driver tries to open the file representing the interrupt. Jon Please put some big comments warning people about adding Jon SA_SHARED. I can easily see someone thinking that they are fixing Jon a bug by adding it. I'd probably even write a paragraph about Jon what will happen if SA_SHARED is added. Will do. The main problem here is X86, as other architectures either don't care, or have enough interrupt lines. And the people who are paying me for this kind of thing all run IA64 What I really want to do is deprivilege the driver code as much as possible. Whatever a driver does, the rest of the system should keep going. That way malicious or buggy drivers can only affect the processes that are trying to use the device they manage. Moreover, it should be possible to kill -9 a driver, then restart it, without the rest of the system noticing more than a hiccup. To do this, step one is to run the driver in user space, so that it's subject to the same resource management control as any other process. Step two, which is a lot harder, is to connect the driver back into the kernel so that it can be shared. Tun/Tap can be used for network devices, but it's really too slow -- you need zero-copy and shared notification. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
Jon == Jon Smirl [EMAIL PROTECTED] writes: Jon On Tue, 15 Mar 2005 14:47:42 +1100, Peter Chubb Jon [EMAIL PROTECTED] wrote: What I really want to do is deprivilege the driver code as much as possible. Whatever a driver does, the rest of the system should keep going. That way malicious or buggy drivers can only affect the processes that are trying to use the device they manage. Moreover, it should be possible to kill -9 a driver, then restart it, without the rest of the system noticing more than a hiccup. To do this, step one is to run the driver in user space, so that it's subject to the same resource management control as any other process. Step two, which is a lot harder, is to connect the driver back into the kernel so that it can be shared. Tun/Tap can be used for network devices, but it's really too slow -- you need zero-copy and shared notification. Jon Have you considered running the drivers in a domain under Xen? See the paper presented by Karlsruhr at OSDI: Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Götz: Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines. OSDI '04. They're using L4, rather than Xen as the paravirtualisation layer. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
inode_lock heavily contended in 2.6.11
When running reaim7 on a 12-way IA64 on an ext2 filesystem on a ram disc, I see very heavy contention on inode_lock. lockstat output shows: SPINLOCKS HOLDWAIT UTIL CONMEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 46.8% 52.4% 1.9us( 130us) 20us(8073us)(21.5%) 5072151 47.6% 52.4%0% inode_lock 15.9% 59.5% 3.8us( 61us) 18us(7067us)( 3.9%)852983 40.5% 59.5%0% __sync_single_inode+0xf0 9.2% 59.0% 1.2us( 25us) 20us(8073us)( 7.8%) 1596487 41.0% 59.0%0% generic_osync_inode+0xe0 (etc). Is anyone else seeing this on more realistic workloads? -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb Jon> <[EMAIL PROTECTED]> wrote: >> >>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: >> >> >> The scenario I'm thinking about with these patches are things >> like >> low-latency user-level networking between nodes in a >> cluster, where >> for good performance even with a kernel driver >> you don't want to >> share your interrupt line with anything else. Jon> Instead of making up a new API what about making a library of Jon> calls that emulates the common entry points used by device Jon> drivers. The version I did for UML could take the same driver and Jon> run it in user space or the kernel without changing source Jon> code. I found this very useful. The in-kernel device drivers interface is very large --- I want to start with something a bit simpler. We do have a compatibility library, as yet unreleased, that allows the same drivers to run in-kernel or in user space. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Sat, 12 Mar 2005 10:11:18 -0700 (MST), Zwane Mwaikambo Jon> <[EMAIL PROTECTED]> wrote: >> Alan's proposal sounds very plausible and additionally if we find >> that we have an irq line screaming we could use the same supplied >> information to disable userspace interrupt handled devices first. Jon> I like it too and it would help Xen. Now we just need to modify Jon> 800 device drivers to use it. It's incomplete. But you probably knew that... The main problem I see is that even with the proposed interface, you'd need to disable the interrupt in the interrupt controller, because merely acknowledging an interrupt to a device doesn't stop it from interrupting. And you really want the device to stop asserting the interrupt before doing an EOI, unless you're going to mask the interrupt. So you'd need to have an interface that not only acknowledged the current interrupt but also prevented the device from interrupting. That typically means reading a status register (slow!) and then setting one or more bits in one or more control registers. Also for a user level driver you really want to do the EIO before invoking user space. Otherwise, depending on the interrupt controller, lower numbered interrupts could be masked until the user space returns --- which might be a long time off. Reading the status register is typically one of the slowest single parts of a device driver (latency can be > 2 usec), so you don't really want to have to read it again within the driver... so you'd probably want to pass it as part of the interrupt arguments to the driver. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: >> The scenario I'm thinking about with these patches are things like >> low-latency user-level networking between nodes in a cluster, where >> for good performance even with a kernel driver you don't want to >> share your interrupt line with anything else. Jon> The code needs to refuse to install if the IRQ line is shared. It does. The request_irq() call explicitly does not include SA_SHARED in its flags, so if the line is shared, it'll return an error to user space when the driver tries to open the file representing the interrupt. Jon> Also what about SMP, if you shut the IRQ off on one CPU isn't it Jon> still enabled on all of the others? Nope. disable_irq_nosync() talks to the interrupt controller, which is common to all the processors. The main problem is that it's slow, because it has to go off-chip. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Fri, 11 Mar 2005 11:29:20 +0100, Pavel Machek <[EMAIL PROTECTED]> Jon> wrote: >> Hi! >> >> > As many of you will be aware, we've been working on >> infrastructure for > user-mode PCI and other drivers. The first >> step is to be able to > handle interrupts from user >> space. Subsequent patches add > infrastructure for setting up DMA >> for PCI devices. >> > >> > The user-level interrupt code doesn't depend on the other >> patches, and > is probably the most mature of this patchset. >> >> Okay, I like it; it means way easier PCI driver development. Jon> It won't help with PCI driver development. I tried implementing Jon> this for UML. If your driver has any bugs it won't get the Jon> interrupts acknowledged correctly and you'll end up rebooting. That's not actually true, at least when we developed drivers here. The only times we had to reboot were the times we mucked up the dma register settings, and dma'd all over the kernel by mistake... -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Fri, 11 Mar 2005 14:36:10 +1100, Peter Chubb Jon> <[EMAIL PROTECTED]> wrote: >> As many of you will be aware, we've been working on infrastructure >> for user-mode PCI and other drivers. The first step is to be able >> to handle interrupts from user space. Subsequent patches add >> infrastructure for setting up DMA for PCI devices. Jon> I've tried implementing this before and could not get around the Jon> interrupt problem. Most interrupts on the x86 architecture are Jon> shared. Disabling the IRQ at the PIC blocks all of the shared Fortunately, most interrupts on IA64, ARM, etc., are unshared. And with PCI-Express, the problem will go away. Even on X86, things aren't all bad: one can usually find a PCI slot which doesn't share interrupts with anything you care about. The scenario I'm thinking about with these patches are things like low-latency user-level networking between nodes in a cluster, where for good performance even with a kernel driver you don't want to share your interrupt line with anything else. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
>>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes: Greg> On Fri, Mar 11, 2005 at 07:34:46PM +1100, Peter Chubb wrote: >> >>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes: >> Greg> On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote: >> >> +/* + * The PCI subsystem is implemented as yet-another pseudo >> >> filesystem, + * albeit one that is never mounted. + * This is >> its >> magic number. + */ +#define USR_PCI_MAGIC (0x12345678) >> Greg> If you make it a real, mountable filesystem, then you don't need Greg> to have any of your new syscalls, right? Why not just do that Greg> instead? >> >> >> The only call that would go is usr_pci_open() -- you'd still need >> usr_pci_map() Greg> see mmap(2) mmap maps a file's contents into your own virtual memory. usr_pci_map maps part of your own virtual memory into pci bus space for a particular device (using the IOMMU if your machine has one), and returns a scatterlist of bus addresses to hand to the device. Different semantics entirely. Greg> In fact, both of the above can be done today from /proc/bus/pci/ Greg> right? Nope. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
On Gwe, 2005-03-11 at 03:36, Peter Chubb wrote: > +static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs > *regs) > +{ > + struct irq_proc *idp = (struct irq_proc *)vidp; > + > + BUG_ON(idp->irq != irq); > + disable_irq_nosync(irq); > + atomic_inc(>count); > + wake_up(>q); > + return IRQ_HANDLED; Alan> You just deadlocked the machine in many configurations. You can't use Alan> disable_irq for this trick you have to tell the kernel how to handle it. Can you elaborate, please? In particular, why doesn't essentially the same action (disabling an interrupt before the EOI) in note_interrupt() not lock up the machine? I can see there'd be problems if the code allowed shared interrupts, but it doesn't. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
On Gwe, 2005-03-11 at 03:36, Peter Chubb wrote: +static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs *regs) +{ + struct irq_proc *idp = (struct irq_proc *)vidp; + + BUG_ON(idp-irq != irq); + disable_irq_nosync(irq); + atomic_inc(idp-count); + wake_up(idp-q); + return IRQ_HANDLED; Alan You just deadlocked the machine in many configurations. You can't use Alan disable_irq for this trick you have to tell the kernel how to handle it. Can you elaborate, please? In particular, why doesn't essentially the same action (disabling an interrupt before the EOI) in note_interrupt() not lock up the machine? I can see there'd be problems if the code allowed shared interrupts, but it doesn't. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
Greg == Greg KH [EMAIL PROTECTED] writes: Greg On Fri, Mar 11, 2005 at 07:34:46PM +1100, Peter Chubb wrote: Greg == Greg KH [EMAIL PROTECTED] writes: Greg On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote: +/* + * The PCI subsystem is implemented as yet-another pseudo filesystem, + * albeit one that is never mounted. + * This is its magic number. + */ +#define USR_PCI_MAGIC (0x12345678) Greg If you make it a real, mountable filesystem, then you don't need Greg to have any of your new syscalls, right? Why not just do that Greg instead? The only call that would go is usr_pci_open() -- you'd still need usr_pci_map() Greg see mmap(2) mmap maps a file's contents into your own virtual memory. usr_pci_map maps part of your own virtual memory into pci bus space for a particular device (using the IOMMU if your machine has one), and returns a scatterlist of bus addresses to hand to the device. Different semantics entirely. Greg In fact, both of the above can be done today from /proc/bus/pci/ Greg right? Nope. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
Jon == Jon Smirl [EMAIL PROTECTED] writes: Jon On Fri, 11 Mar 2005 14:36:10 +1100, Peter Chubb Jon [EMAIL PROTECTED] wrote: As many of you will be aware, we've been working on infrastructure for user-mode PCI and other drivers. The first step is to be able to handle interrupts from user space. Subsequent patches add infrastructure for setting up DMA for PCI devices. Jon I've tried implementing this before and could not get around the Jon interrupt problem. Most interrupts on the x86 architecture are Jon shared. Disabling the IRQ at the PIC blocks all of the shared Fortunately, most interrupts on IA64, ARM, etc., are unshared. And with PCI-Express, the problem will go away. Even on X86, things aren't all bad: one can usually find a PCI slot which doesn't share interrupts with anything you care about. The scenario I'm thinking about with these patches are things like low-latency user-level networking between nodes in a cluster, where for good performance even with a kernel driver you don't want to share your interrupt line with anything else. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
Jon == Jon Smirl [EMAIL PROTECTED] writes: Jon On Fri, 11 Mar 2005 11:29:20 +0100, Pavel Machek [EMAIL PROTECTED] Jon wrote: Hi! As many of you will be aware, we've been working on infrastructure for user-mode PCI and other drivers. The first step is to be able to handle interrupts from user space. Subsequent patches add infrastructure for setting up DMA for PCI devices. The user-level interrupt code doesn't depend on the other patches, and is probably the most mature of this patchset. Okay, I like it; it means way easier PCI driver development. Jon It won't help with PCI driver development. I tried implementing Jon this for UML. If your driver has any bugs it won't get the Jon interrupts acknowledged correctly and you'll end up rebooting. That's not actually true, at least when we developed drivers here. The only times we had to reboot were the times we mucked up the dma register settings, and dma'd all over the kernel by mistake... -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
Jon == Jon Smirl [EMAIL PROTECTED] writes: The scenario I'm thinking about with these patches are things like low-latency user-level networking between nodes in a cluster, where for good performance even with a kernel driver you don't want to share your interrupt line with anything else. Jon The code needs to refuse to install if the IRQ line is shared. It does. The request_irq() call explicitly does not include SA_SHARED in its flags, so if the line is shared, it'll return an error to user space when the driver tries to open the file representing the interrupt. Jon Also what about SMP, if you shut the IRQ off on one CPU isn't it Jon still enabled on all of the others? Nope. disable_irq_nosync() talks to the interrupt controller, which is common to all the processors. The main problem is that it's slow, because it has to go off-chip. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
Jon == Jon Smirl [EMAIL PROTECTED] writes: Jon On Sat, 12 Mar 2005 10:11:18 -0700 (MST), Zwane Mwaikambo Jon [EMAIL PROTECTED] wrote: Alan's proposal sounds very plausible and additionally if we find that we have an irq line screaming we could use the same supplied information to disable userspace interrupt handled devices first. Jon I like it too and it would help Xen. Now we just need to modify Jon 800 device drivers to use it. It's incomplete. But you probably knew that... The main problem I see is that even with the proposed interface, you'd need to disable the interrupt in the interrupt controller, because merely acknowledging an interrupt to a device doesn't stop it from interrupting. And you really want the device to stop asserting the interrupt before doing an EOI, unless you're going to mask the interrupt. So you'd need to have an interface that not only acknowledged the current interrupt but also prevented the device from interrupting. That typically means reading a status register (slow!) and then setting one or more bits in one or more control registers. Also for a user level driver you really want to do the EIO before invoking user space. Otherwise, depending on the interrupt controller, lower numbered interrupts could be masked until the user space returns --- which might be a long time off. Reading the status register is typically one of the slowest single parts of a device driver (latency can be 2 usec), so you don't really want to have to read it again within the driver... so you'd probably want to pass it as part of the interrupt arguments to the driver. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
Jon == Jon Smirl [EMAIL PROTECTED] writes: Jon On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb Jon [EMAIL PROTECTED] wrote: Jon == Jon Smirl [EMAIL PROTECTED] writes: The scenario I'm thinking about with these patches are things like low-latency user-level networking between nodes in a cluster, where for good performance even with a kernel driver you don't want to share your interrupt line with anything else. Jon Instead of making up a new API what about making a library of Jon calls that emulates the common entry points used by device Jon drivers. The version I did for UML could take the same driver and Jon run it in user space or the kernel without changing source Jon code. I found this very useful. The in-kernel device drivers interface is very large --- I want to start with something a bit simpler. We do have a compatibility library, as yet unreleased, that allows the same drivers to run in-kernel or in user space. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
inode_lock heavily contended in 2.6.11
When running reaim7 on a 12-way IA64 on an ext2 filesystem on a ram disc, I see very heavy contention on inode_lock. lockstat output shows: SPINLOCKS HOLDWAIT UTIL CONMEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 46.8% 52.4% 1.9us( 130us) 20us(8073us)(21.5%) 5072151 47.6% 52.4%0% inode_lock 15.9% 59.5% 3.8us( 61us) 18us(7067us)( 3.9%)852983 40.5% 59.5%0% __sync_single_inode+0xf0 9.2% 59.0% 1.2us( 25us) 20us(8073us)( 7.8%) 1596487 41.0% 59.0%0% generic_osync_inode+0xe0 (etc). Is anyone else seeing this on more realistic workloads? -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Microstate Accounting for 2.6.11
>>>>> "Andi" == Andi Kleen <[EMAIL PROTECTED]> writes: Andi> Andrew Morton <[EMAIL PROTECTED]> writes: >> Why does the kernel need this feature? >> >> Have you any numbers on the overhead? Andi> It does RDTSC and lots of complicated stuff twice for each Andi> system call. On P4 this will be extremly slow (> 1000cycles Andi> combined) It is pretty unlikely that whatever it does justifies Andi> this extreme overhead in a critical fast path. Not really `lots of complicated stuff'. Just swap a timer and set a flag on entry: msp->timers[msp->laststate] += now - msp->lastchange msp->lastchange = now msp->laststate = ONCPU_SYS msp->cflags |= MSA_SYS And swap timers and clear the flag on exit. The flag's needed to force return to ONCPU_SYS rather than ONCPU_USR if the task preempted or interrupted while in a system call. If there's a simpler, cheaper, faster way to track time spent in system calls (as opposed to time spent in interrupt handlers, or on the run queue) thn I'd like to know what it is. And I recognise there're are lots of people who don't want this --- but there are some who do. I've maintained this patch since mid 2003, and have seen a steady trickle of downloads --- one or two a week. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
>>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes: Greg> On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote: >> +/* + * The PCI subsystem is implemented as yet-another pseudo >> filesystem, + * albeit one that is never mounted. + * This is its >> magic number. + */ +#define USR_PCI_MAGIC (0x12345678) Greg> If you make it a real, mountable filesystem, then you don't need Greg> to have any of your new syscalls, right? Why not just do that Greg> instead? The only call that would go is usr_pci_open() -- you'd still need usr_pci_map(), usr_pci_unmap() and usr_pci_get_consistent(). -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
Greg == Greg KH [EMAIL PROTECTED] writes: Greg On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote: +/* + * The PCI subsystem is implemented as yet-another pseudo filesystem, + * albeit one that is never mounted. + * This is its magic number. + */ +#define USR_PCI_MAGIC (0x12345678) Greg If you make it a real, mountable filesystem, then you don't need Greg to have any of your new syscalls, right? Why not just do that Greg instead? The only call that would go is usr_pci_open() -- you'd still need usr_pci_map(), usr_pci_unmap() and usr_pci_get_consistent(). -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Microstate Accounting for 2.6.11
Andi == Andi Kleen [EMAIL PROTECTED] writes: Andi Andrew Morton [EMAIL PROTECTED] writes: Why does the kernel need this feature? Have you any numbers on the overhead? Andi It does RDTSC and lots of complicated stuff twice for each Andi system call. On P4 this will be extremly slow ( 1000cycles Andi combined) It is pretty unlikely that whatever it does justifies Andi this extreme overhead in a critical fast path. Not really `lots of complicated stuff'. Just swap a timer and set a flag on entry: msp-timers[msp-laststate] += now - msp-lastchange msp-lastchange = now msp-laststate = ONCPU_SYS msp-cflags |= MSA_SYS And swap timers and clear the flag on exit. The flag's needed to force return to ONCPU_SYS rather than ONCPU_USR if the task preempted or interrupted while in a system call. If there's a simpler, cheaper, faster way to track time spent in system calls (as opposed to time spent in interrupt handlers, or on the run queue) thn I'd like to know what it is. And I recognise there're are lots of people who don't want this --- but there are some who do. I've maintained this patch since mid 2003, and have seen a steady trickle of downloads --- one or two a week. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Microstate Accounting for 2.6.11
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes: Andrew> Peter Chubb <[EMAIL PROTECTED]> wrote: >> Timing data on threads at present is pretty crude: when the timer >> interrupt occurs, a tick is added to either system time or user >> time for the currently running thread. Thus in an unpacthed kernel >> one can distinguish three timed states: On-cpu in userspace, on-cpu >> in system space, and not running. >> >> The actual number of states is much larger. A thread can be on a >> runqueue or the expired queue (i.e., ready to run but not running), >> sleeping on a semaphore or on a futex, having its time stolen to >> service an interrupt, etc., etc. >> >> This patch adds timers per-state to each struct task_struct, so >> that time in all these states can be tracked. This patch contains >> the core code do the timing, and to initialise the timers. >> Subsequent patches enable the code (by adding Kconfig options) and >> add hooks to track state changes. Andrew> Why does the kernel need this feature? I find that it's useful when trying to work out why a thread is going more slowly than it needs to. Userspace tools in the CVS repository at gelato.unsw.edu.au let you graph in real time the time spent in each state, so you get graphs like this: http://gelato.unsw.edu.au/patches/snapshot.png which shows mplay skipping because of a slow disk/filesystem. Andrew> Have you any numbers on the overhead? Around 5% on LMbench context switch numbers for uniprocessor, negligeable on SMP (but SMP context switch results are horrible at the moment according to LMbench2 -- almost 16usec); select on 10 fd goes from 1.665 usec to 1.701; Andrew> The preempt_disable() in sys_msa() seems odd. Yes I only added that yesterday. It's to prevent migration while updating the current timer. All the other places where the current timer are updated are naturally protected this. It should probably be a local_irq_disable() instead. Peter C - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate accounting, IA64 support
Microstate Accounting: Add suppoort for IA64. linux-2.6-ustate/arch/ia64/Kconfig | 25 +++ linux-2.6-ustate/arch/ia64/kernel/entry.S| 44 +++ linux-2.6-ustate/arch/ia64/kernel/irq_ia64.c | 21 +++- linux-2.6-ustate/arch/ia64/kernel/ivt.S |8 +++- linux-2.6-ustate/include/asm-ia64/msa.h | 33 linux-2.6-ustate/include/asm-ia64/unistd.h |1 7 files changed, 129 insertions(+), 5 deletions(-) Index: linux-2.6-ustate/arch/ia64/Kconfig === --- linux-2.6-ustate.orig/arch/ia64/Kconfig 2005-03-10 09:13:01.780632777 +1100 +++ linux-2.6-ustate/arch/ia64/Kconfig 2005-03-10 09:16:14.593655619 +1100 @@ -302,6 +302,31 @@ little bigger and slows down execution a bit, but it is generally a good idea to turn this on. If you're unsure, say Y. +config MICROSTATE + bool "Microstate accounting" + help + This option causes the kernel to keep very accurate track of + how long your threads spend on the runqueues, running, or asleep or + stopped. It will slow down your kernel. + Times are reported in /proc/pid/msa and through a new msa() + system call. +choice + depends on MICROSTATE + prompt "Microstate timing source" + default MICROSTATE_ITC + help + On IA64 one can use two timeing sources for the microstate + accounting; the on-chip interval counter, or Linux's + time-of-day clock. The first is very cheap; the other is + more accurate on SMP systems. + +config MICROSTATE_ITC + bool "Use on-chip ITC for microstate timing" + +config MICROSTATE_TOD + bool "Use time-of-day clock for microstate timings" +endchoice + config IA64_PALINFO tristate "/proc/pal support" help Index: linux-2.6-ustate/include/asm-ia64/msa.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-ustate/include/asm-ia64/msa.h 2005-03-10 09:16:14.594632174 +1100 @@ -0,0 +1,33 @@ +/ + * asm-ia64/msa.h + * + * Provide an architecture-specific clock. + */ + +#ifndef _ASM_IA64_MSA_H +#define _ASM_IA64_MSA_H + +#include +#include +#include + + +# if defined(CONFIG_MICROSTATE_ITC) +# define MSA_NOW(now) do { now = (clk_t)get_cycles(); } while (0) + +# define MSA_TO_NSEC(clk) ((10*clk) / cpu_data(smp_processor_id())->itc_freq) + +# elif defined(CONFIG_MICROSTATE_TOD) +static inline void msa_now(clk_t *nsp) { + struct timeval tv; + do_gettimeofday(); + *nsp = tv.tv_sec * 100 + tv.tv_usec; +} +# define MSA_NOW(x) msa_now() +# define MSA_TO_NSEC(clk) ((clk) * 1000) + +# else +# include +# endif + +#endif /* _ASM_IA64_MSA_H */ Microstate Accounting: Track time in system calls for IA64 arch/ia64/kernel/entry.S | 44 arch/ia64/kernel/ivt.S |8 ++-- 2 files changed, 50 insertions(+), 2 deletions(-) Index: linux-2.6-ustate/arch/ia64/kernel/entry.S === --- linux-2.6-ustate.orig/arch/ia64/kernel/entry.S 2005-03-10 09:13:01.149778160 +1100 +++ linux-2.6-ustate/arch/ia64/kernel/entry.S 2005-03-10 09:16:15.157128068 +1100 @@ -589,6 +589,46 @@ .ret4: br.cond.sptk ia64_leave_kernel END(ia64_strace_leave_kernel) +#ifdef CONFIG_MICROSTATE +/* + * preserve input registers, + * and r8 + */ +GLOBAL_ENTRY(invoke_msa_end_syscall) + .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8) + alloc loc1=ar.pfs,8,4,0,0 + mov loc0=rp + .body + ;; + mov loc2=ret0 + mov loc3=ret2 + br.call.sptk.many rp=msa_end_syscall +1: mov rp=loc0 + mov ret0=loc2 + mov ret2=loc3 + mov ar.pfs=loc1 + br.ret.sptk.many rp +END(invoke_msa_end_syscall) +/* + * Preserves in0-7, and all callee-save registers. + */ +GLOBAL_ENTRY(invoke_msa_start_syscall) + .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8) + alloc loc1=ar.pfs,8,4,0,0 + mov loc0=rp + .body + mov loc2=r3 + mov loc3=r15 + ;; + br.call.sptk.many rp=msa_start_syscall +1: mov r15=loc3 + mov r3=loc2 + mov ar.pfs=loc1 + mov rp=loc0 + br.ret.sptk.many rp +END(invoke_msa_start_syscall) +#endif /* CONFIG_MICROSTATE */ + GLOBAL_ENTRY(ia64_ret_from_clone) PT_REGS_UNWIND_INFO(0) { /* @@ -671,6 +711,10 @@ */ ENTRY(ia64_leave_syscall) PT_REGS_UNWIND_INFO(0) +#ifdef CONFIG_MICROSTATE + br.call.sptk.many rp=invoke_msa_end_syscall +1: +#endif /* * work.need_resched etc. mustn't get changed by this CPU before it returns to * user- or fsys-mode, hence we
Microstate Accounting for 2.6.11, patch 4/6
Microstate accounting: Account for time in interrupt handlers for I386. arch/i386/kernel/irq.c | 13 - 1 files changed, 12 insertions(+), 1 deletion(-) Index: linux-2.6-ustate/arch/i386/kernel/irq.c === --- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 09:13:00.115606274 +1100 +++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 +1100 @@ -55,6 +55,8 @@ #endif irq_enter(); + msa_start_irq(irq); + #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { @@ -101,6 +103,7 @@ #endif __do_IRQ(irq, regs); + msa_finish_irq(irq); irq_exit(); return 1; @@ -221,10 +224,18 @@ seq_printf(p, "%3d: ",i); #ifndef CONFIG_SMP seq_printf(p, "%10u ", kstat_irqs(i)); +#ifdef CONFIG_MICROSTATE + seq_printf(p, "%10llu", msa_irq_time(0, i)); +#endif #else for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) + if (cpu_online(j)) { seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); +#ifdef CONFIG_MICROSTATE + seq_printf(p, "%10llu", msa_irq_time(j, i)); +#endif + } + #endif seq_printf(p, " %14s", irq_desc[i].handler->typename); seq_printf(p, " %s", action->name); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 6/6
Microstate accounting: Track time spent asleep while paging, in poll() or select(), or on a futex separately from other sleeps. fs/select.c |2 ++ kernel/futex.c |2 ++ mm/memory.c |6 +- Index: linux-2.6-ustate/mm/memory.c === --- linux-2.6-ustate.orig/mm/memory.c 2005-03-10 09:12:59.492564100 +1100 +++ linux-2.6-ustate/mm/memory.c2005-03-10 09:16:16.583875465 +1100 @@ -2079,6 +2079,7 @@ if (is_vm_hugetlb_page(vma)) return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + msa_next_state(current, PAGING_SLEEP); /* * We need the page table lock to synchronize with kswapd * and the SMP-safe atomic PTE updates. @@ -2098,10 +2099,13 @@ if (!pte) goto oom; - return handle_pte_fault(mm, vma, address, write_access, pte, pmd); + int ret = handle_pte_fault(mm, vma, address, write_access, pte, pmd); + msa_next_state(current, MSA_UNKNOWN); + return ret; oom: spin_unlock(>page_table_lock); + msa_next_state(current, MSA_UNKNOWN); return VM_FAULT_OOM; } Index: linux-2.6-ustate/kernel/futex.c === --- linux-2.6-ustate.orig/kernel/futex.c2005-03-10 09:12:58.843154938 +1100 +++ linux-2.6-ustate/kernel/futex.c 2005-03-10 09:16:17.109262256 +1100 @@ -39,6 +39,7 @@ #include #include #include +#include #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) @@ -571,6 +572,7 @@ * wakes us up. */ + msa_next_state(current, FUTEX_SLEEP); /* add_wait_queue is the barrier after __set_current_state. */ __set_current_state(TASK_INTERRUPTIBLE); add_wait_queue(, ); Index: linux-2.6-ustate/fs/select.c === --- linux-2.6-ustate.orig/fs/select.c 2005-03-10 09:12:59.182996124 +1100 +++ linux-2.6-ustate/fs/select.c2005-03-10 09:16:16.843639194 +1100 @@ -256,6 +256,7 @@ retval = table.error; break; } + msa_next_state(current, POLL_SLEEP); __timeout = schedule_timeout(__timeout); } __set_current_state(TASK_RUNNING); @@ -447,6 +448,7 @@ count = wait->error; if (count) break; + msa_next_state(current, POLL_SLEEP); timeout = schedule_timeout(timeout); } __set_current_state(TASK_RUNNING); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 5/6
Microstate accounting: Add the I386 system call. arch/i386/kernel/entry.S |2 +- include/asm-i386/unistd.h |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6-ustate/arch/i386/kernel/entry.S === --- linux-2.6-ustate.orig/arch/i386/kernel/entry.S 2005-03-10 09:16:14.888575341 +1100 +++ linux-2.6-ustate/arch/i386/kernel/entry.S 2005-03-10 09:16:15.446188457 +1100 @@ -876,7 +876,7 @@ .long sys_mq_getsetattr .long sys_ni_syscall/* reserved for kexec */ .long sys_waitid - .long sys_ni_syscall/* 285 */ /* available */ + .long sys_msa /* 285 */ /* available */ .long sys_add_key .long sys_request_key .long sys_keyctl Index: linux-2.6-ustate/include/asm-i386/unistd.h === --- linux-2.6-ustate.orig/include/asm-i386/unistd.h 2005-03-10 09:13:00.813843194 +1100 +++ linux-2.6-ustate/include/asm-i386/unistd.h 2005-03-10 09:16:15.448141568 +1100 @@ -290,7 +290,7 @@ #define __NR_mq_getsetattr (__NR_mq_open+5) #define __NR_sys_kexec_load283 #define __NR_waitid284 -/* #define __NR_sys_setaltroot 285 */ +#define __NR_sys_msa 285 #define __NR_add_key 286 #define __NR_request_key 287 #define __NR_keyctl288 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 3/6
Microstate accounting: Provide I386-dependent MSA clocks, and Kconfig options. arch/i386/Kconfig | 39 ++- include/asm-i386/msa.h | 49 + 2 files changed, 87 insertions(+), 1 deletion(-) Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> Index: linux-2.6-ustate/arch/i386/Kconfig === --- linux-2.6-ustate.orig/arch/i386/Kconfig 2005-03-11 09:59:38.773632446 +1100 +++ linux-2.6-ustate/arch/i386/Kconfig 2005-03-11 09:59:38.777538666 +1100 @@ -923,8 +923,45 @@ If unsure, say Y. Only embedded should say N here. -endmenu +config MICROSTATE + bool "Microstate accounting" + help + This option causes the kernel to keep very accurate track of +how long your threads spend on the runqueues, running, or asleep or +stopped. It will slow down your kernel. +Times are reported in /proc/pid/msa and through a new msa() +system call. + +choice + depends on MICROSTATE + prompt "Microstate timing source" + default MICROSTATE_TSC + +config MICROSTATE_PM + bool "Use Power-Management timer for microstate timings" + depends on X86_PM_TIMER + help +If your machine is ACPI enabled and uses power-management, then the +TSC runs at a variable rate, which will distort the +microstate measurements. This timer, although having +slightly more overhead, and a lower resolution (279 +nanoseconds or so) will always run at a constant rate. + +config MICROSTATE_TSC + bool "Use on-chip TSC for microstate timings" + depends on X86_TSC + help + If your machine's clock runs at constant rate, then this timer +gives you cycle precision in measureing times spent in microstates. + +config MICROSTATE_TOD + bool "Use time-of-day clock for microstate timings" + help + If none of the other timers are any good for you, this timer +will give you micro-second precision. +endchoice +endmenu menu "Power management options (ACPI, APM)" depends on !X86_VOYAGER Index: linux-2.6-ustate/include/asm-i386/msa.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-ustate/include/asm-i386/msa.h 2005-03-11 09:59:38.779491777 +1100 @@ -0,0 +1,49 @@ +/ + * asm-i386/msa.h + * + * Provide an architecture-specific clock. + */ + +#ifndef _ASM_I386_MSA_H +# define _ASM_I386_MSA_H + +# include + + +# if defined(CONFIG_MICROSTATE_TSC) +/* + * Use the processor's time-stamp counter as a timesource + */ +# include +# include + +# define MSA_NOW(now) rdtscll(now) + +extern unsigned long cpu_khz; +# define MSA_TO_NSEC(clk) ({ clk_t _x = ((clk) * 100ULL); do_div(_x, cpu_khz); _x; }) + +# elif defined(CONFIG_MICROSTATE_PM) +/* + * Use the system's monotonic clock as a timesource. + * This will only be enabled if the Power Management Timer is enabled. + */ +unsigned long long monotonic_clock(void); +# define MSA_NOW(now) do { now = monotonic_clock(); } while (0) +# define MSA_TO_NSEC(clk) (clk) + +# elif defined(CONFIG_MICROSTATE_TOD) +/* + * Fall back to gettimeofday. + * This one is incompatible with interrupt-time measurement on some processors. + */ +static inline void msa_now(clk_t *nsp) { + struct timeval tv; + do_gettimeofday(); + *nsp = tv.tv_sec * 100 + tv.tv_usec; +} +# define MSA_NOW(x) msa_now() +# define MSA_TO_NSEC(clk) ((clk) * 1000) +# endif + + +#endif /* _ASM_I386_MSA_H */ I386 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 2/6
Microstate Accounting: Add hooks into the scheduler to track state changes. Arrange for parent process's child times to be updated at process exit. kernel/sched.c |8 kernel/exit.c |3 +++ Index: linux-2.6-ustate/kernel/sched.c === --- linux-2.6-ustate.orig/kernel/sched.c2005-03-11 09:59:31.109628035 +1100 +++ linux-2.6-ustate/kernel/sched.c 2005-03-11 09:59:31.116463921 +1100 @@ -635,6 +635,7 @@ */ static inline void __activate_task(task_t *p, runqueue_t *rq) { + msa_set_timer(p, ONACTIVEQUEUE); enqueue_task(p, rq->active); rq->nr_running++; } @@ -1238,6 +1239,7 @@ if (unlikely(!current->array)) __activate_task(p, rq); else { + msa_set_timer(p, ONACTIVEQUEUE); p->prio = current->prio; list_add_tail(>run_list, >run_list); p->array = current->array; @@ -2422,6 +2424,7 @@ if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { + msa_next_state(p, ONEXPIREDQUEUE); enqueue_task(p, rq->expired); if (p->static_prio < rq->best_expired_prio) rq->best_expired_prio = p->static_prio; @@ -2733,6 +2736,7 @@ array = rq->active; rq->expired_timestamp = 0; rq->best_expired_prio = MAX_PRIO; + msa_flip_expired(prev); } else schedstat_inc(rq, sched_noswitch); @@ -2773,6 +2777,8 @@ rq->curr = next; ++*switch_count; + msa_switch(prev, next); + prepare_arch_switch(rq, next); prev = context_switch(rq, prev, next); barrier(); @@ -3693,6 +3699,8 @@ */ if (rt_task(current)) target = rq->active; + else + msa_next_state(current, ONEXPIREDQUEUE); if (current->array->nr_active == 1) { schedstat_inc(rq, yld_act_empty); Index: linux-2.6-ustate/kernel/exit.c === --- linux-2.6-ustate.orig/kernel/exit.c 2005-03-11 09:59:36.360564796 +1100 +++ linux-2.6-ustate/kernel/exit.c 2005-03-11 09:59:36.364471017 +1100 @@ -93,6 +93,9 @@ } sched_exit(p); + + msa_update_parent(p->parent, p); + write_unlock_irq(_lock); spin_unlock(>proc_lock); proc_pid_flush(proc_dentry); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 3/
Microstate Accounting: Track time in system calls and interrupts, i386 code. Signed-off-by; Peter Chubb <[EMAIL PROTECTED]> arch/i386/kernel/entry.S | 16 arch/i386/kernel/irq.c | 13 - Index: linux-2.6-ustate/arch/i386/kernel/entry.S === --- linux-2.6-ustate.orig/arch/i386/kernel/entry.S 2005-03-10 09:13:01.448604031 +1100 +++ linux-2.6-ustate/arch/i386/kernel/entry.S 2005-03-10 09:16:14.888575341 +1100 @@ -222,10 +222,18 @@ /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */ testw $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp) jnz syscall_trace_entry +#ifdef CONFIG_MICROSTATE + pushl %eax + call msa_start_syscall + popl%eax +#endif cmpl $(nr_syscalls), %eax jae syscall_badsys call *sys_call_table(,%eax,4) movl %eax,EAX(%esp) +#ifdef CONFIG_MICROSTATE + call msa_end_syscall +#endif cli movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx @@ -250,9 +258,17 @@ cmpl $(nr_syscalls), %eax jae syscall_badsys syscall_call: +#ifdef CONFIG_MICROSTATE + pushl %eax + call msa_start_syscall + popl%eax +#endif call *sys_call_table(,%eax,4) movl %eax,EAX(%esp) # store the return value syscall_exit: +#ifdef CONFIG_MICROSTATE + call msa_end_syscall +#endif cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret Index: linux-2.6-ustate/arch/i386/kernel/irq.c === --- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 09:13:00.115606274 +1100 +++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 +1100 @@ -55,6 +55,8 @@ #endif irq_enter(); + msa_start_irq(irq); + #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { @@ -101,6 +103,7 @@ #endif __do_IRQ(irq, regs); + msa_finish_irq(irq); irq_exit(); return 1; @@ -221,10 +224,18 @@ seq_printf(p, "%3d: ",i); #ifndef CONFIG_SMP seq_printf(p, "%10u ", kstat_irqs(i)); +#ifdef CONFIG_MICROSTATE + seq_printf(p, "%10llu", msa_irq_time(0, i)); +#endif #else for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) + if (cpu_online(j)) { seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); +#ifdef CONFIG_MICROSTATE + seq_printf(p, "%10llu", msa_irq_time(j, i)); +#endif + } + #endif seq_printf(p, " %14s", irq_desc[i].handler->typename); seq_printf(p, " %s", action->name); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11
Microstate Accounting - Timing data on threads at present is pretty crude: when the timer interrupt occurs, a tick is added to either system time or user time for the currently running thread. Thus in an unpacthed kernel one can distinguish three timed states: On-cpu in userspace, on-cpu in system space, and not running. The actual number of states is much larger. A thread can be on a runqueue or the expired queue (i.e., ready to run but not running), sleeping on a semaphore or on a futex, having its time stolen to service an interrupt, etc., etc. This patch adds timers per-state to each struct task_struct, so that time in all these states can be tracked. This patch contains the core code do the timing, and to initialise the timers. Subsequent patches enable the code (by adding Kconfig options) and add hooks to track state changes. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> include/asm-generic/msa.h | 21 ++ include/linux/msa-kernel.h | 99 + include/linux/msa.h| 46 include/linux/sched.h |4 kernel/Makefile|2 kernel/fork.c |2 kernel/msa.c | 472 + 7 files changed, 645 insertions(+), 1 deletion(-) Index: linux-2.6-ustate/kernel/msa.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-ustate/kernel/msa.c 2005-03-11 09:58:20.574030768 +1100 @@ -0,0 +1,472 @@ +/* + * Microstate accounting. + * Try to account for various states much more accurately than + * the normal code does. + * + * Copyright (c) Peter Chubb 2005 + * UNSW and National ICT Australia + * This code is released under the Gnu Public Licence, version 2. + */ + + +#include +#include +#include +#include +#ifdef CONFIG_MICROSTATE +#include +#include +#include +#include + +#include + +/* + * Track time spend in interrupt handlers. + */ +struct msa_irq { + clk_t times; + clk_t last_entered; +}; + +/* + * When the scheduler last swapped active and expired queues + */ +static DEFINE_PER_CPU(clk_t, queueflip_time); + +/* + * Time spent in interrupt handlers + */ +static DEFINE_PER_CPU(struct msa_irq[NR_IRQS+1], msa_irq); + + +/** + * msa_switch: Update microstate timers when switching from one task to another. + * @prev, @next: The prev task is coming off the processor; + *the new task is about to run on the processor. + * + * Update the times in both prev and next. It may be necessary to infer the + * next state for each task. + * + */ +void +msa_switch(struct task_struct *prev, struct task_struct *next) +{ + struct microstates *msprev = >microstates; + struct microstates *msnext = >microstates; + clk_t now; + enum thread_state next_state; + int interrupted = msprev->cur_state == INTERRUPTED; + + preempt_disable(); + + MSA_NOW(now); + + if (msprev->flags & QUEUE_FLIPPED) { + __get_cpu_var(queueflip_time) = now; + msprev->flags &= ~QUEUE_FLIPPED; + } + + /* +* If the queues have been flipped, +* update the state as of the last flip time. +*/ + if (msnext->cur_state == ONEXPIREDQUEUE) { + clk_t qfp = per_cpu(queueflip_time, msnext->lastqueued); + msnext->cur_state = ONACTIVEQUEUE; + msnext->timers[ONEXPIREDQUEUE] += qfp - msnext->last_change; + msnext->last_change = qfp; + } + + msprev->timers[msprev->cur_state] += now - msprev->last_change; + msnext->timers[msnext->cur_state] += now - msnext->last_change; + + /* Update states */ + switch (msprev->next_state) { + case MSA_UNKNOWN: + /* +* Infer from actual state +*/ + switch (prev->state) { + case TASK_INTERRUPTIBLE: + next_state = INTERRUPTIBLE_SLEEP; + break; + + case TASK_UNINTERRUPTIBLE: + next_state = UNINTERRUPTIBLE_SLEEP; + break; + + case TASK_STOPPED: + next_state = STOPPED; + break; + + case EXIT_DEAD: + case EXIT_ZOMBIE: + next_state = ZOMBIE; + break; + + case TASK_RUNNING: + next_state = ONACTIVEQUEUE; + break; + + default: + next_state = MSA_UNKNOWN; + break; + + } + break; + + case PAGING_SLEEP: /* + * Sleep states are PAGING_SLEEP; + * others inferred fro
User mode drivers: part 2: PCI device handling (patch 2/2 for 2.6.11)
User-level drivers: Add system calls for I386 and IA64. Signed-Off-By: Peter Chubb <[EMAIL PROTECTED]> # # arch/i386/kernel/entry.S |4 # arch/ia64/kernel/entry.S |8 # include/asm-i386/unistd.h |6 +- # include/asm-ia64/unistd.h |4 # 4 files changed, 17 insertions(+), 5 deletions(-) # Index: linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S === --- linux-2.6.11-usrdrivers.orig/arch/ia64/kernel/entry.S 2005-03-11 13:59:28.940744950 +1100 +++ linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S2005-03-11 13:59:41.236542676 +1100 @@ -1577,10 +1577,10 @@ data8 sys_add_key data8 sys_request_key data8 sys_keyctl - data8 sys_ni_syscall - data8 sys_ni_syscall// 1275 - data8 sys_ni_syscall - data8 sys_ni_syscall + data8 sys_usr_pci_open + data8 sys_usr_pci_mmap // 1275 + data8 sys_usr_pci_munmap + data8 sys_usr_pci_get_consistent data8 sys_ni_syscall data8 sys_ni_syscall Index: linux-2.6.11-usrdrivers/include/asm-i386/unistd.h === --- linux-2.6.11-usrdrivers.orig/include/asm-i386/unistd.h 2005-03-11 13:59:28.942698059 +1100 +++ linux-2.6.11-usrdrivers/include/asm-i386/unistd.h 2005-03-11 13:59:41.245331667 +1100 @@ -294,8 +294,12 @@ #define __NR_add_key 286 #define __NR_request_key 287 #define __NR_keyctl288 +#define __NR_usr_pci_open 289 +#define __NR_usr_pci_mmap (__NR_usr_pci_open+1) +#define __NR_usr_pci_munmap(__NR_usr_pci_open+2) +#define __NR_usr_pci_get_consistent(__NR_usr_pci_open+3) -#define NR_syscalls 289 +#define NR_syscalls 293 /* * user-visible error numbers are in the range -1 - -128: see Index: linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h === --- linux-2.6.11-usrdrivers.orig/include/asm-ia64/unistd.h 2005-03-11 13:59:28.942698059 +1100 +++ linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h 2005-03-11 13:59:41.247284776 +1100 @@ -263,6 +263,10 @@ #define __NR_add_key 1271 #define __NR_request_key 1272 #define __NR_keyctl1273 +#define __NR_usr_pci_open 1274 +#define __NR_usr_pci_mmap 1275 +#define __NR_usr_pci_unmap 1276 +#define __NR_usr_pci_get_consistent 1277 #ifdef __KERNEL__ Index: linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S === --- linux-2.6.11-usrdrivers.orig/arch/i386/kernel/entry.S 2005-03-11 13:59:28.941721505 +1100 +++ linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S2005-03-11 13:59:41.248261330 +1100 @@ -864,5 +864,9 @@ .long sys_add_key .long sys_request_key .long sys_keyctl + .long sys_usr_pci_open + .long sys_usr_pci_mmap /* 290 */ + .long sys_usr_pci_munmap + .long sys_usr_pci_get_consistent syscall_table_size=(.-sys_call_table) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
USER LEVEL DRIVERS: enable PCI device drivers at user space. This patch adds the capability for suitably privileged user-level processes to enable a PCI device, and set up DMA for it. A subsequent patch hooks up the actual system calls. There are three new system calls: long usr_pci_open(int bus, int slot, int function, __u64 dma_mask); Returns a filedescriptor for the PCI device described by bus,slot,function. It also enables the device, and sets it up as a bus-mastering DMA device, with the specified dma mask. Error codes are: ENOMEM: insufficient kernel memory to fulfil your request ENOENT: the specified device doesn't exist, or is otherwise invisible to Linux. EBUSY: Another driver has claimed the device EIO: The specified dma mask is invalid for this device. ENFILE: too many open files long usr_pci_get_consistent(int fd, size_t size, void **vaddrp, unsigned long *dmaaddrp) Call pci_alloc_consistent() to get size worth of pci consistent memory (currently an error if size != PAGESIZE); map the allocated memory into the user's address space; return the virtual user address in *vaddrp, and the bus address in *dmaaddrp ERRORS: EINVAL: the filedescriptor was not one obtained from usr_pci_open(), or size != PAGESIZE ENOMEM: insufficient appropriate memory or insufficient free virtual address space in the user program. EFAULT: vaddrp or dmaaddrp didn't point to writeable memory. The mapping obtained can be cleaned up with munmap(). long usr_pci_mmap(int fd, struct mapping_info *mp) -- map some memory for DMA to/from the device represented by fd, which was obtained from usr_pci_open(). struct mapping_info contains: void *virtaddr -- the virtual address to dma to int size -- how many bytes to set up struct usr_pci_sglist *sglist -- a pointer to a scatterlist int nents -- how many entries in the scatterlist enum dma_data_direction direction --- which way the dma is going to happen. The scatterlist should be sized at least size/PAGESIZE + 2. usr_pci_mmap() will call pci_map_sg() on the virtual region, then copy the resulting scatterlist into *sglist. The nents field will be updated with the actual number of scatterlist entries filled in. Failure codes are: EINVAL: the fd wasn't obtained from usr_pci_open, or direction wasn't one of DMA_TO_DEVICE, DMA_FROM_DEVICE or DMA_BIDIRECTIONAL, or the size of the scatterlist is insufficient to map the region. EFAULT: mp was a bad pointer, or the region of memory spanned by (virtaddr, virtaddr + size) was not all mapped. ENOMEM: insufficient appropriate memory long usr_pci_munmap(int fd, struct mapping_info *mp) Unmap a dma region mapped by usr_pci_map(). Struct mapping info is the same one used in usr_pci_mmap(). Error codes are: EINVAL: : the fd wasn't obtained from usr_pci_open, or the struct mapping_info was never mapped for this device Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> # # drivers/Makefile |3 # drivers/pci/Kconfig|6 # drivers/usr/Makefile |2 # drivers/usr/sys.c | 952 + # include/linux/usrdrv.h | 63 +++ # 5 files changed, 1026 insertions(+) # Index: linux-2.6.11-usrdrivers/drivers/Makefile === --- linux-2.6.11-usrdrivers.orig/drivers/Makefile 2005-03-11 12:25:29.169139978 +1100 +++ linux-2.6.11-usrdrivers/drivers/Makefile2005-03-11 12:25:41.159270471 +1100 @@ -13,6 +13,9 @@ # was used and do nothing if so obj-$(CONFIG_PNP) += pnp/ +# User level device drivers +obj-$(CONFIG_USRDEV) += usr/ + # char/ comes before serial/ etc so that the VT console is the boot-time # default. obj-y += char/ Index: linux-2.6.11-usrdrivers/drivers/usr/Makefile === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.11-usrdrivers/drivers/usr/Makefile2005-03-11 12:25:41.160247026 +1100 @@ -0,0 +1,2 @@ +obj-y += sys.o +obj-$(CONFIG_USRBLKDEV) += blkdev.o Index: linux-2.6.11-usrdrivers/drivers/usr/sys.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.11-usrdrivers/drivers/usr/sys.c 2005-03-11 14:15:59.897394833 +1100 @@ -0,0 +1,952 @@ +/* + * Expose PCI-DMA interface to user mode. + * + * Copyright 2005 Peter Chubb + * Nation
User mode drivers: part 1, interrupt handling (patch for 2.6.11)
As many of you will be aware, we've been working on infrastructure for user-mode PCI and other drivers. The first step is to be able to handle interrupts from user space. Subsequent patches add infrastructure for setting up DMA for PCI devices. The user-level interrupt code doesn't depend on the other patches, and is probably the most mature of this patchset. This patch adds a new file to /proc/irq// called irq. Suitably privileged processes can open this file. Reading the file returns the number of interrupts (if any) that have occurred since the last read. If the file is opened in blocking mode, reading it blocks until an interrupt occurs. poll(2) and select(2) work as one would expect, to allow interrupts to be one of many events to wait for. (If you didn't like the file, one could have a special system call to return the file descriptor). Interrupts are usually masked; while a thread is in poll(2) or read(2) on the file they are unmasked. All architectures that use CONFIG_GENERIC_HARDIRQ are supported by this patch. A low latency user level interrupt handler would do something like this, on a CONFIG_PREEMPT kernel: int irqfd; int n_ints; struct sched_param sched_param; irqfd = open("/proc/irq/513/irq", O_RDONLY); mlockall() sched_param.sched_priority = sched_get_priority_max(SCHED_FIFO) - 10; sched_setscheduler(0, SCHED_FIFO, _param); while(read(irqfd, n_ints, sizeof n_ints) == sizeof nints) { ... talk to device to handle interrupt } If you don't care about latency, then forget about the mlockall() and setting the priority, and you don't need CONFIG_PREEMPT. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> kernel/irq/proc.c | 163 ++ 1 files changed, 153 insertions(+), 10 deletions(-) Index: linux-2.6.11-usrdrivers/kernel/irq/proc.c === --- linux-2.6.11-usrdrivers.orig/kernel/irq/proc.c 2005-03-11 10:30:57.875619102 +1100 +++ linux-2.6.11-usrdrivers/kernel/irq/proc.c 2005-03-11 10:45:07.146928168 +1100 @@ -9,6 +9,8 @@ #include #include #include +#include +#include "internals.h" static struct proc_dir_entry *root_irq_dir, *irq_dir[NR_IRQS]; @@ -90,27 +92,168 @@ action->dir = proc_mkdir(name, irq_dir[irq]); } +struct irq_proc { + unsigned long irq; + wait_queue_head_t q; + atomic_t count; + char devname[TASK_COMM_LEN]; +}; + +static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs *regs) +{ + struct irq_proc *idp = (struct irq_proc *)vidp; + + BUG_ON(idp->irq != irq); + disable_irq_nosync(irq); + atomic_inc(>count); + wake_up(>q); + return IRQ_HANDLED; +} + + +/* + * Signal to userspace an interrupt has occured. + */ +static ssize_t irq_proc_read(struct file *filp, char __user *bufp, size_t len, loff_t *ppos) +{ + struct irq_proc *ip = (struct irq_proc *)filp->private_data; + irq_desc_t *idp = irq_desc + ip->irq; + int pending; + + DEFINE_WAIT(wait); + + if (len < sizeof(int)) + return -EINVAL; + + pending = atomic_read(>count); + if (pending == 0) { + if (idp->status & IRQ_DISABLED) + enable_irq(ip->irq); + if (filp->f_flags & O_NONBLOCK) + return -EWOULDBLOCK; + } + + while (pending == 0) { + prepare_to_wait(>q, , TASK_INTERRUPTIBLE); + pending = atomic_read(>count); + if (pending == 0) + schedule(); + finish_wait(>q, ); + if (signal_pending(current)) + return -ERESTARTSYS; + } + + if (copy_to_user(bufp, , sizeof pending)) + return -EFAULT; + + *ppos += sizeof pending; + + atomic_sub(pending, >count); + return sizeof pending; +} + + +static int irq_proc_open(struct inode *inop, struct file *filp) +{ + struct irq_proc *ip; + struct proc_dir_entry *ent = PDE(inop); + int error; + + ip = kmalloc(sizeof *ip, GFP_KERNEL); + if (ip == NULL) + return -ENOMEM; + + memset(ip, 0, sizeof(*ip)); + strcpy(ip->devname, current->comm); + init_waitqueue_head(>q); + atomic_set(>count, 0); + ip->irq = (unsigned long)ent->data; + + error = request_irq(ip->irq, + irq_proc_irq_handler, + SA_INTERRUPT, + ip->devname, + ip); + if (error < 0) { + kfree(ip); + return error; + } + filp->private_data = (void *)ip; + + return 0; +} + +static int irq_proc_release(
Re: binary drivers and development
> "John" == John Richard Moser <[EMAIL PROTECTED]> writes: John> I've done more thought, here's a small list of advantages on John> using binary drivers, specifically considering UDI. You can John> consider a different implementation for binary drivers as well, John> with most of the same advantages. Almost all these advantages are also present for user-mode drivers... and getting drivers out of the kernel, where possible, is a much better approach IMHO than trying to maintain a leaky in-kernel interface. The problem with in-kernel interfaces, even if set in concrete, is that any binary driver can go outside the interface --- there's no encapsulation --- and so break when the kernel changes. Peter C - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: binary drivers and development
John == John Richard Moser [EMAIL PROTECTED] writes: John I've done more thought, here's a small list of advantages on John using binary drivers, specifically considering UDI. You can John consider a different implementation for binary drivers as well, John with most of the same advantages. Almost all these advantages are also present for user-mode drivers... and getting drivers out of the kernel, where possible, is a much better approach IMHO than trying to maintain a leaky in-kernel interface. The problem with in-kernel interfaces, even if set in concrete, is that any binary driver can go outside the interface --- there's no encapsulation --- and so break when the kernel changes. Peter C - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
User mode drivers: part 1, interrupt handling (patch for 2.6.11)
As many of you will be aware, we've been working on infrastructure for user-mode PCI and other drivers. The first step is to be able to handle interrupts from user space. Subsequent patches add infrastructure for setting up DMA for PCI devices. The user-level interrupt code doesn't depend on the other patches, and is probably the most mature of this patchset. This patch adds a new file to /proc/irq/nnn/ called irq. Suitably privileged processes can open this file. Reading the file returns the number of interrupts (if any) that have occurred since the last read. If the file is opened in blocking mode, reading it blocks until an interrupt occurs. poll(2) and select(2) work as one would expect, to allow interrupts to be one of many events to wait for. (If you didn't like the file, one could have a special system call to return the file descriptor). Interrupts are usually masked; while a thread is in poll(2) or read(2) on the file they are unmasked. All architectures that use CONFIG_GENERIC_HARDIRQ are supported by this patch. A low latency user level interrupt handler would do something like this, on a CONFIG_PREEMPT kernel: int irqfd; int n_ints; struct sched_param sched_param; irqfd = open(/proc/irq/513/irq, O_RDONLY); mlockall() sched_param.sched_priority = sched_get_priority_max(SCHED_FIFO) - 10; sched_setscheduler(0, SCHED_FIFO, sched_param); while(read(irqfd, n_ints, sizeof n_ints) == sizeof nints) { ... talk to device to handle interrupt } If you don't care about latency, then forget about the mlockall() and setting the priority, and you don't need CONFIG_PREEMPT. Signed-off-by: Peter Chubb [EMAIL PROTECTED] kernel/irq/proc.c | 163 ++ 1 files changed, 153 insertions(+), 10 deletions(-) Index: linux-2.6.11-usrdrivers/kernel/irq/proc.c === --- linux-2.6.11-usrdrivers.orig/kernel/irq/proc.c 2005-03-11 10:30:57.875619102 +1100 +++ linux-2.6.11-usrdrivers/kernel/irq/proc.c 2005-03-11 10:45:07.146928168 +1100 @@ -9,6 +9,8 @@ #include linux/irq.h #include linux/proc_fs.h #include linux/interrupt.h +#include linux/poll.h +#include internals.h static struct proc_dir_entry *root_irq_dir, *irq_dir[NR_IRQS]; @@ -90,27 +92,168 @@ action-dir = proc_mkdir(name, irq_dir[irq]); } +struct irq_proc { + unsigned long irq; + wait_queue_head_t q; + atomic_t count; + char devname[TASK_COMM_LEN]; +}; + +static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs *regs) +{ + struct irq_proc *idp = (struct irq_proc *)vidp; + + BUG_ON(idp-irq != irq); + disable_irq_nosync(irq); + atomic_inc(idp-count); + wake_up(idp-q); + return IRQ_HANDLED; +} + + +/* + * Signal to userspace an interrupt has occured. + */ +static ssize_t irq_proc_read(struct file *filp, char __user *bufp, size_t len, loff_t *ppos) +{ + struct irq_proc *ip = (struct irq_proc *)filp-private_data; + irq_desc_t *idp = irq_desc + ip-irq; + int pending; + + DEFINE_WAIT(wait); + + if (len sizeof(int)) + return -EINVAL; + + pending = atomic_read(ip-count); + if (pending == 0) { + if (idp-status IRQ_DISABLED) + enable_irq(ip-irq); + if (filp-f_flags O_NONBLOCK) + return -EWOULDBLOCK; + } + + while (pending == 0) { + prepare_to_wait(ip-q, wait, TASK_INTERRUPTIBLE); + pending = atomic_read(ip-count); + if (pending == 0) + schedule(); + finish_wait(ip-q, wait); + if (signal_pending(current)) + return -ERESTARTSYS; + } + + if (copy_to_user(bufp, pending, sizeof pending)) + return -EFAULT; + + *ppos += sizeof pending; + + atomic_sub(pending, ip-count); + return sizeof pending; +} + + +static int irq_proc_open(struct inode *inop, struct file *filp) +{ + struct irq_proc *ip; + struct proc_dir_entry *ent = PDE(inop); + int error; + + ip = kmalloc(sizeof *ip, GFP_KERNEL); + if (ip == NULL) + return -ENOMEM; + + memset(ip, 0, sizeof(*ip)); + strcpy(ip-devname, current-comm); + init_waitqueue_head(ip-q); + atomic_set(ip-count, 0); + ip-irq = (unsigned long)ent-data; + + error = request_irq(ip-irq, + irq_proc_irq_handler, + SA_INTERRUPT, + ip-devname, + ip); + if (error 0) { + kfree(ip); + return error; + } + filp-private_data = (void *)ip; + + return 0; +} + +static int irq_proc_release(struct inode *inop, struct file *filp
User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
USER LEVEL DRIVERS: enable PCI device drivers at user space. This patch adds the capability for suitably privileged user-level processes to enable a PCI device, and set up DMA for it. A subsequent patch hooks up the actual system calls. There are three new system calls: long usr_pci_open(int bus, int slot, int function, __u64 dma_mask); Returns a filedescriptor for the PCI device described by bus,slot,function. It also enables the device, and sets it up as a bus-mastering DMA device, with the specified dma mask. Error codes are: ENOMEM: insufficient kernel memory to fulfil your request ENOENT: the specified device doesn't exist, or is otherwise invisible to Linux. EBUSY: Another driver has claimed the device EIO: The specified dma mask is invalid for this device. ENFILE: too many open files long usr_pci_get_consistent(int fd, size_t size, void **vaddrp, unsigned long *dmaaddrp) Call pci_alloc_consistent() to get size worth of pci consistent memory (currently an error if size != PAGESIZE); map the allocated memory into the user's address space; return the virtual user address in *vaddrp, and the bus address in *dmaaddrp ERRORS: EINVAL: the filedescriptor was not one obtained from usr_pci_open(), or size != PAGESIZE ENOMEM: insufficient appropriate memory or insufficient free virtual address space in the user program. EFAULT: vaddrp or dmaaddrp didn't point to writeable memory. The mapping obtained can be cleaned up with munmap(). long usr_pci_mmap(int fd, struct mapping_info *mp) -- map some memory for DMA to/from the device represented by fd, which was obtained from usr_pci_open(). struct mapping_info contains: void *virtaddr -- the virtual address to dma to int size -- how many bytes to set up struct usr_pci_sglist *sglist -- a pointer to a scatterlist int nents -- how many entries in the scatterlist enum dma_data_direction direction --- which way the dma is going to happen. The scatterlist should be sized at least size/PAGESIZE + 2. usr_pci_mmap() will call pci_map_sg() on the virtual region, then copy the resulting scatterlist into *sglist. The nents field will be updated with the actual number of scatterlist entries filled in. Failure codes are: EINVAL: the fd wasn't obtained from usr_pci_open, or direction wasn't one of DMA_TO_DEVICE, DMA_FROM_DEVICE or DMA_BIDIRECTIONAL, or the size of the scatterlist is insufficient to map the region. EFAULT: mp was a bad pointer, or the region of memory spanned by (virtaddr, virtaddr + size) was not all mapped. ENOMEM: insufficient appropriate memory long usr_pci_munmap(int fd, struct mapping_info *mp) Unmap a dma region mapped by usr_pci_map(). Struct mapping info is the same one used in usr_pci_mmap(). Error codes are: EINVAL: : the fd wasn't obtained from usr_pci_open, or the struct mapping_info was never mapped for this device Signed-off-by: Peter Chubb [EMAIL PROTECTED] # # drivers/Makefile |3 # drivers/pci/Kconfig|6 # drivers/usr/Makefile |2 # drivers/usr/sys.c | 952 + # include/linux/usrdrv.h | 63 +++ # 5 files changed, 1026 insertions(+) # Index: linux-2.6.11-usrdrivers/drivers/Makefile === --- linux-2.6.11-usrdrivers.orig/drivers/Makefile 2005-03-11 12:25:29.169139978 +1100 +++ linux-2.6.11-usrdrivers/drivers/Makefile2005-03-11 12:25:41.159270471 +1100 @@ -13,6 +13,9 @@ # was used and do nothing if so obj-$(CONFIG_PNP) += pnp/ +# User level device drivers +obj-$(CONFIG_USRDEV) += usr/ + # char/ comes before serial/ etc so that the VT console is the boot-time # default. obj-y += char/ Index: linux-2.6.11-usrdrivers/drivers/usr/Makefile === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.11-usrdrivers/drivers/usr/Makefile2005-03-11 12:25:41.160247026 +1100 @@ -0,0 +1,2 @@ +obj-y += sys.o +obj-$(CONFIG_USRBLKDEV) += blkdev.o Index: linux-2.6.11-usrdrivers/drivers/usr/sys.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.11-usrdrivers/drivers/usr/sys.c 2005-03-11 14:15:59.897394833 +1100 @@ -0,0 +1,952 @@ +/* + * Expose PCI-DMA interface to user mode. + * + * Copyright 2005 Peter Chubb + * National ICT
User mode drivers: part 2: PCI device handling (patch 2/2 for 2.6.11)
User-level drivers: Add system calls for I386 and IA64. Signed-Off-By: Peter Chubb [EMAIL PROTECTED] # # arch/i386/kernel/entry.S |4 # arch/ia64/kernel/entry.S |8 # include/asm-i386/unistd.h |6 +- # include/asm-ia64/unistd.h |4 # 4 files changed, 17 insertions(+), 5 deletions(-) # Index: linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S === --- linux-2.6.11-usrdrivers.orig/arch/ia64/kernel/entry.S 2005-03-11 13:59:28.940744950 +1100 +++ linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S2005-03-11 13:59:41.236542676 +1100 @@ -1577,10 +1577,10 @@ data8 sys_add_key data8 sys_request_key data8 sys_keyctl - data8 sys_ni_syscall - data8 sys_ni_syscall// 1275 - data8 sys_ni_syscall - data8 sys_ni_syscall + data8 sys_usr_pci_open + data8 sys_usr_pci_mmap // 1275 + data8 sys_usr_pci_munmap + data8 sys_usr_pci_get_consistent data8 sys_ni_syscall data8 sys_ni_syscall Index: linux-2.6.11-usrdrivers/include/asm-i386/unistd.h === --- linux-2.6.11-usrdrivers.orig/include/asm-i386/unistd.h 2005-03-11 13:59:28.942698059 +1100 +++ linux-2.6.11-usrdrivers/include/asm-i386/unistd.h 2005-03-11 13:59:41.245331667 +1100 @@ -294,8 +294,12 @@ #define __NR_add_key 286 #define __NR_request_key 287 #define __NR_keyctl288 +#define __NR_usr_pci_open 289 +#define __NR_usr_pci_mmap (__NR_usr_pci_open+1) +#define __NR_usr_pci_munmap(__NR_usr_pci_open+2) +#define __NR_usr_pci_get_consistent(__NR_usr_pci_open+3) -#define NR_syscalls 289 +#define NR_syscalls 293 /* * user-visible error numbers are in the range -1 - -128: see Index: linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h === --- linux-2.6.11-usrdrivers.orig/include/asm-ia64/unistd.h 2005-03-11 13:59:28.942698059 +1100 +++ linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h 2005-03-11 13:59:41.247284776 +1100 @@ -263,6 +263,10 @@ #define __NR_add_key 1271 #define __NR_request_key 1272 #define __NR_keyctl1273 +#define __NR_usr_pci_open 1274 +#define __NR_usr_pci_mmap 1275 +#define __NR_usr_pci_unmap 1276 +#define __NR_usr_pci_get_consistent 1277 #ifdef __KERNEL__ Index: linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S === --- linux-2.6.11-usrdrivers.orig/arch/i386/kernel/entry.S 2005-03-11 13:59:28.941721505 +1100 +++ linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S2005-03-11 13:59:41.248261330 +1100 @@ -864,5 +864,9 @@ .long sys_add_key .long sys_request_key .long sys_keyctl + .long sys_usr_pci_open + .long sys_usr_pci_mmap /* 290 */ + .long sys_usr_pci_munmap + .long sys_usr_pci_get_consistent syscall_table_size=(.-sys_call_table) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11
Microstate Accounting - Timing data on threads at present is pretty crude: when the timer interrupt occurs, a tick is added to either system time or user time for the currently running thread. Thus in an unpacthed kernel one can distinguish three timed states: On-cpu in userspace, on-cpu in system space, and not running. The actual number of states is much larger. A thread can be on a runqueue or the expired queue (i.e., ready to run but not running), sleeping on a semaphore or on a futex, having its time stolen to service an interrupt, etc., etc. This patch adds timers per-state to each struct task_struct, so that time in all these states can be tracked. This patch contains the core code do the timing, and to initialise the timers. Subsequent patches enable the code (by adding Kconfig options) and add hooks to track state changes. Signed-off-by: Peter Chubb [EMAIL PROTECTED] include/asm-generic/msa.h | 21 ++ include/linux/msa-kernel.h | 99 + include/linux/msa.h| 46 include/linux/sched.h |4 kernel/Makefile|2 kernel/fork.c |2 kernel/msa.c | 472 + 7 files changed, 645 insertions(+), 1 deletion(-) Index: linux-2.6-ustate/kernel/msa.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-ustate/kernel/msa.c 2005-03-11 09:58:20.574030768 +1100 @@ -0,0 +1,472 @@ +/* + * Microstate accounting. + * Try to account for various states much more accurately than + * the normal code does. + * + * Copyright (c) Peter Chubb 2005 + * UNSW and National ICT Australia + * This code is released under the Gnu Public Licence, version 2. + */ + + +#include linux/config.h +#include linux/types.h +#include linux/errno.h +#include linux/linkage.h +#ifdef CONFIG_MICROSTATE +#include linux/irq.h +#include linux/hardirq.h +#include linux/sched.h +#include linux/jiffies.h + +#include asm/uaccess.h + +/* + * Track time spend in interrupt handlers. + */ +struct msa_irq { + clk_t times; + clk_t last_entered; +}; + +/* + * When the scheduler last swapped active and expired queues + */ +static DEFINE_PER_CPU(clk_t, queueflip_time); + +/* + * Time spent in interrupt handlers + */ +static DEFINE_PER_CPU(struct msa_irq[NR_IRQS+1], msa_irq); + + +/** + * msa_switch: Update microstate timers when switching from one task to another. + * @prev, @next: The prev task is coming off the processor; + *the new task is about to run on the processor. + * + * Update the times in both prev and next. It may be necessary to infer the + * next state for each task. + * + */ +void +msa_switch(struct task_struct *prev, struct task_struct *next) +{ + struct microstates *msprev = prev-microstates; + struct microstates *msnext = next-microstates; + clk_t now; + enum thread_state next_state; + int interrupted = msprev-cur_state == INTERRUPTED; + + preempt_disable(); + + MSA_NOW(now); + + if (msprev-flags QUEUE_FLIPPED) { + __get_cpu_var(queueflip_time) = now; + msprev-flags = ~QUEUE_FLIPPED; + } + + /* +* If the queues have been flipped, +* update the state as of the last flip time. +*/ + if (msnext-cur_state == ONEXPIREDQUEUE) { + clk_t qfp = per_cpu(queueflip_time, msnext-lastqueued); + msnext-cur_state = ONACTIVEQUEUE; + msnext-timers[ONEXPIREDQUEUE] += qfp - msnext-last_change; + msnext-last_change = qfp; + } + + msprev-timers[msprev-cur_state] += now - msprev-last_change; + msnext-timers[msnext-cur_state] += now - msnext-last_change; + + /* Update states */ + switch (msprev-next_state) { + case MSA_UNKNOWN: + /* +* Infer from actual state +*/ + switch (prev-state) { + case TASK_INTERRUPTIBLE: + next_state = INTERRUPTIBLE_SLEEP; + break; + + case TASK_UNINTERRUPTIBLE: + next_state = UNINTERRUPTIBLE_SLEEP; + break; + + case TASK_STOPPED: + next_state = STOPPED; + break; + + case EXIT_DEAD: + case EXIT_ZOMBIE: + next_state = ZOMBIE; + break; + + case TASK_RUNNING: + next_state = ONACTIVEQUEUE; + break; + + default: + next_state = MSA_UNKNOWN; + break; + + } + break; + + case PAGING_SLEEP: /* + * Sleep states are PAGING_SLEEP
Microstate Accounting for 2.6.11, patch 3/
Microstate Accounting: Track time in system calls and interrupts, i386 code. Signed-off-by; Peter Chubb [EMAIL PROTECTED] arch/i386/kernel/entry.S | 16 arch/i386/kernel/irq.c | 13 - Index: linux-2.6-ustate/arch/i386/kernel/entry.S === --- linux-2.6-ustate.orig/arch/i386/kernel/entry.S 2005-03-10 09:13:01.448604031 +1100 +++ linux-2.6-ustate/arch/i386/kernel/entry.S 2005-03-10 09:16:14.888575341 +1100 @@ -222,10 +222,18 @@ /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */ testw $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp) jnz syscall_trace_entry +#ifdef CONFIG_MICROSTATE + pushl %eax + call msa_start_syscall + popl%eax +#endif cmpl $(nr_syscalls), %eax jae syscall_badsys call *sys_call_table(,%eax,4) movl %eax,EAX(%esp) +#ifdef CONFIG_MICROSTATE + call msa_end_syscall +#endif cli movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx @@ -250,9 +258,17 @@ cmpl $(nr_syscalls), %eax jae syscall_badsys syscall_call: +#ifdef CONFIG_MICROSTATE + pushl %eax + call msa_start_syscall + popl%eax +#endif call *sys_call_table(,%eax,4) movl %eax,EAX(%esp) # store the return value syscall_exit: +#ifdef CONFIG_MICROSTATE + call msa_end_syscall +#endif cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret Index: linux-2.6-ustate/arch/i386/kernel/irq.c === --- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 09:13:00.115606274 +1100 +++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 +1100 @@ -55,6 +55,8 @@ #endif irq_enter(); + msa_start_irq(irq); + #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { @@ -101,6 +103,7 @@ #endif __do_IRQ(irq, regs); + msa_finish_irq(irq); irq_exit(); return 1; @@ -221,10 +224,18 @@ seq_printf(p, %3d: ,i); #ifndef CONFIG_SMP seq_printf(p, %10u , kstat_irqs(i)); +#ifdef CONFIG_MICROSTATE + seq_printf(p, %10llu, msa_irq_time(0, i)); +#endif #else for (j = 0; j NR_CPUS; j++) - if (cpu_online(j)) + if (cpu_online(j)) { seq_printf(p, %10u , kstat_cpu(j).irqs[i]); +#ifdef CONFIG_MICROSTATE + seq_printf(p, %10llu, msa_irq_time(j, i)); +#endif + } + #endif seq_printf(p, %14s, irq_desc[i].handler-typename); seq_printf(p, %s, action-name); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 2/6
Microstate Accounting: Add hooks into the scheduler to track state changes. Arrange for parent process's child times to be updated at process exit. kernel/sched.c |8 kernel/exit.c |3 +++ Index: linux-2.6-ustate/kernel/sched.c === --- linux-2.6-ustate.orig/kernel/sched.c2005-03-11 09:59:31.109628035 +1100 +++ linux-2.6-ustate/kernel/sched.c 2005-03-11 09:59:31.116463921 +1100 @@ -635,6 +635,7 @@ */ static inline void __activate_task(task_t *p, runqueue_t *rq) { + msa_set_timer(p, ONACTIVEQUEUE); enqueue_task(p, rq-active); rq-nr_running++; } @@ -1238,6 +1239,7 @@ if (unlikely(!current-array)) __activate_task(p, rq); else { + msa_set_timer(p, ONACTIVEQUEUE); p-prio = current-prio; list_add_tail(p-run_list, current-run_list); p-array = current-array; @@ -2422,6 +2424,7 @@ if (!rq-expired_timestamp) rq-expired_timestamp = jiffies; if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { + msa_next_state(p, ONEXPIREDQUEUE); enqueue_task(p, rq-expired); if (p-static_prio rq-best_expired_prio) rq-best_expired_prio = p-static_prio; @@ -2733,6 +2736,7 @@ array = rq-active; rq-expired_timestamp = 0; rq-best_expired_prio = MAX_PRIO; + msa_flip_expired(prev); } else schedstat_inc(rq, sched_noswitch); @@ -2773,6 +2777,8 @@ rq-curr = next; ++*switch_count; + msa_switch(prev, next); + prepare_arch_switch(rq, next); prev = context_switch(rq, prev, next); barrier(); @@ -3693,6 +3699,8 @@ */ if (rt_task(current)) target = rq-active; + else + msa_next_state(current, ONEXPIREDQUEUE); if (current-array-nr_active == 1) { schedstat_inc(rq, yld_act_empty); Index: linux-2.6-ustate/kernel/exit.c === --- linux-2.6-ustate.orig/kernel/exit.c 2005-03-11 09:59:36.360564796 +1100 +++ linux-2.6-ustate/kernel/exit.c 2005-03-11 09:59:36.364471017 +1100 @@ -93,6 +93,9 @@ } sched_exit(p); + + msa_update_parent(p-parent, p); + write_unlock_irq(tasklist_lock); spin_unlock(p-proc_lock); proc_pid_flush(proc_dentry); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 3/6
Microstate accounting: Provide I386-dependent MSA clocks, and Kconfig options. arch/i386/Kconfig | 39 ++- include/asm-i386/msa.h | 49 + 2 files changed, 87 insertions(+), 1 deletion(-) Signed-off-by: Peter Chubb [EMAIL PROTECTED] Index: linux-2.6-ustate/arch/i386/Kconfig === --- linux-2.6-ustate.orig/arch/i386/Kconfig 2005-03-11 09:59:38.773632446 +1100 +++ linux-2.6-ustate/arch/i386/Kconfig 2005-03-11 09:59:38.777538666 +1100 @@ -923,8 +923,45 @@ If unsure, say Y. Only embedded should say N here. -endmenu +config MICROSTATE + bool Microstate accounting + help + This option causes the kernel to keep very accurate track of +how long your threads spend on the runqueues, running, or asleep or +stopped. It will slow down your kernel. +Times are reported in /proc/pid/msa and through a new msa() +system call. + +choice + depends on MICROSTATE + prompt Microstate timing source + default MICROSTATE_TSC + +config MICROSTATE_PM + bool Use Power-Management timer for microstate timings + depends on X86_PM_TIMER + help +If your machine is ACPI enabled and uses power-management, then the +TSC runs at a variable rate, which will distort the +microstate measurements. This timer, although having +slightly more overhead, and a lower resolution (279 +nanoseconds or so) will always run at a constant rate. + +config MICROSTATE_TSC + bool Use on-chip TSC for microstate timings + depends on X86_TSC + help + If your machine's clock runs at constant rate, then this timer +gives you cycle precision in measureing times spent in microstates. + +config MICROSTATE_TOD + bool Use time-of-day clock for microstate timings + help + If none of the other timers are any good for you, this timer +will give you micro-second precision. +endchoice +endmenu menu Power management options (ACPI, APM) depends on !X86_VOYAGER Index: linux-2.6-ustate/include/asm-i386/msa.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-ustate/include/asm-i386/msa.h 2005-03-11 09:59:38.779491777 +1100 @@ -0,0 +1,49 @@ +/ + * asm-i386/msa.h + * + * Provide an architecture-specific clock. + */ + +#ifndef _ASM_I386_MSA_H +# define _ASM_I386_MSA_H + +# include linux/config.h + + +# if defined(CONFIG_MICROSTATE_TSC) +/* + * Use the processor's time-stamp counter as a timesource + */ +# include asm/msr.h +# include asm/div64.h + +# define MSA_NOW(now) rdtscll(now) + +extern unsigned long cpu_khz; +# define MSA_TO_NSEC(clk) ({ clk_t _x = ((clk) * 100ULL); do_div(_x, cpu_khz); _x; }) + +# elif defined(CONFIG_MICROSTATE_PM) +/* + * Use the system's monotonic clock as a timesource. + * This will only be enabled if the Power Management Timer is enabled. + */ +unsigned long long monotonic_clock(void); +# define MSA_NOW(now) do { now = monotonic_clock(); } while (0) +# define MSA_TO_NSEC(clk) (clk) + +# elif defined(CONFIG_MICROSTATE_TOD) +/* + * Fall back to gettimeofday. + * This one is incompatible with interrupt-time measurement on some processors. + */ +static inline void msa_now(clk_t *nsp) { + struct timeval tv; + do_gettimeofday(tv); + *nsp = tv.tv_sec * 100 + tv.tv_usec; +} +# define MSA_NOW(x) msa_now(x) +# define MSA_TO_NSEC(clk) ((clk) * 1000) +# endif + + +#endif /* _ASM_I386_MSA_H */ I386 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 5/6
Microstate accounting: Add the I386 system call. arch/i386/kernel/entry.S |2 +- include/asm-i386/unistd.h |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6-ustate/arch/i386/kernel/entry.S === --- linux-2.6-ustate.orig/arch/i386/kernel/entry.S 2005-03-10 09:16:14.888575341 +1100 +++ linux-2.6-ustate/arch/i386/kernel/entry.S 2005-03-10 09:16:15.446188457 +1100 @@ -876,7 +876,7 @@ .long sys_mq_getsetattr .long sys_ni_syscall/* reserved for kexec */ .long sys_waitid - .long sys_ni_syscall/* 285 */ /* available */ + .long sys_msa /* 285 */ /* available */ .long sys_add_key .long sys_request_key .long sys_keyctl Index: linux-2.6-ustate/include/asm-i386/unistd.h === --- linux-2.6-ustate.orig/include/asm-i386/unistd.h 2005-03-10 09:13:00.813843194 +1100 +++ linux-2.6-ustate/include/asm-i386/unistd.h 2005-03-10 09:16:15.448141568 +1100 @@ -290,7 +290,7 @@ #define __NR_mq_getsetattr (__NR_mq_open+5) #define __NR_sys_kexec_load283 #define __NR_waitid284 -/* #define __NR_sys_setaltroot 285 */ +#define __NR_sys_msa 285 #define __NR_add_key 286 #define __NR_request_key 287 #define __NR_keyctl288 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 6/6
Microstate accounting: Track time spent asleep while paging, in poll() or select(), or on a futex separately from other sleeps. fs/select.c |2 ++ kernel/futex.c |2 ++ mm/memory.c |6 +- Index: linux-2.6-ustate/mm/memory.c === --- linux-2.6-ustate.orig/mm/memory.c 2005-03-10 09:12:59.492564100 +1100 +++ linux-2.6-ustate/mm/memory.c2005-03-10 09:16:16.583875465 +1100 @@ -2079,6 +2079,7 @@ if (is_vm_hugetlb_page(vma)) return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + msa_next_state(current, PAGING_SLEEP); /* * We need the page table lock to synchronize with kswapd * and the SMP-safe atomic PTE updates. @@ -2098,10 +2099,13 @@ if (!pte) goto oom; - return handle_pte_fault(mm, vma, address, write_access, pte, pmd); + int ret = handle_pte_fault(mm, vma, address, write_access, pte, pmd); + msa_next_state(current, MSA_UNKNOWN); + return ret; oom: spin_unlock(mm-page_table_lock); + msa_next_state(current, MSA_UNKNOWN); return VM_FAULT_OOM; } Index: linux-2.6-ustate/kernel/futex.c === --- linux-2.6-ustate.orig/kernel/futex.c2005-03-10 09:12:58.843154938 +1100 +++ linux-2.6-ustate/kernel/futex.c 2005-03-10 09:16:17.109262256 +1100 @@ -39,6 +39,7 @@ #include linux/mount.h #include linux/pagemap.h #include linux/syscalls.h +#include linux/msa.h #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) @@ -571,6 +572,7 @@ * wakes us up. */ + msa_next_state(current, FUTEX_SLEEP); /* add_wait_queue is the barrier after __set_current_state. */ __set_current_state(TASK_INTERRUPTIBLE); add_wait_queue(q.waiters, wait); Index: linux-2.6-ustate/fs/select.c === --- linux-2.6-ustate.orig/fs/select.c 2005-03-10 09:12:59.182996124 +1100 +++ linux-2.6-ustate/fs/select.c2005-03-10 09:16:16.843639194 +1100 @@ -256,6 +256,7 @@ retval = table.error; break; } + msa_next_state(current, POLL_SLEEP); __timeout = schedule_timeout(__timeout); } __set_current_state(TASK_RUNNING); @@ -447,6 +448,7 @@ count = wait-error; if (count) break; + msa_next_state(current, POLL_SLEEP); timeout = schedule_timeout(timeout); } __set_current_state(TASK_RUNNING); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 4/6
Microstate accounting: Account for time in interrupt handlers for I386. arch/i386/kernel/irq.c | 13 - 1 files changed, 12 insertions(+), 1 deletion(-) Index: linux-2.6-ustate/arch/i386/kernel/irq.c === --- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 09:13:00.115606274 +1100 +++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 +1100 @@ -55,6 +55,8 @@ #endif irq_enter(); + msa_start_irq(irq); + #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { @@ -101,6 +103,7 @@ #endif __do_IRQ(irq, regs); + msa_finish_irq(irq); irq_exit(); return 1; @@ -221,10 +224,18 @@ seq_printf(p, %3d: ,i); #ifndef CONFIG_SMP seq_printf(p, %10u , kstat_irqs(i)); +#ifdef CONFIG_MICROSTATE + seq_printf(p, %10llu, msa_irq_time(0, i)); +#endif #else for (j = 0; j NR_CPUS; j++) - if (cpu_online(j)) + if (cpu_online(j)) { seq_printf(p, %10u , kstat_cpu(j).irqs[i]); +#ifdef CONFIG_MICROSTATE + seq_printf(p, %10llu, msa_irq_time(j, i)); +#endif + } + #endif seq_printf(p, %14s, irq_desc[i].handler-typename); seq_printf(p, %s, action-name); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate accounting, IA64 support
Microstate Accounting: Add suppoort for IA64. linux-2.6-ustate/arch/ia64/Kconfig | 25 +++ linux-2.6-ustate/arch/ia64/kernel/entry.S| 44 +++ linux-2.6-ustate/arch/ia64/kernel/irq_ia64.c | 21 +++- linux-2.6-ustate/arch/ia64/kernel/ivt.S |8 +++- linux-2.6-ustate/include/asm-ia64/msa.h | 33 linux-2.6-ustate/include/asm-ia64/unistd.h |1 7 files changed, 129 insertions(+), 5 deletions(-) Index: linux-2.6-ustate/arch/ia64/Kconfig === --- linux-2.6-ustate.orig/arch/ia64/Kconfig 2005-03-10 09:13:01.780632777 +1100 +++ linux-2.6-ustate/arch/ia64/Kconfig 2005-03-10 09:16:14.593655619 +1100 @@ -302,6 +302,31 @@ little bigger and slows down execution a bit, but it is generally a good idea to turn this on. If you're unsure, say Y. +config MICROSTATE + bool Microstate accounting + help + This option causes the kernel to keep very accurate track of + how long your threads spend on the runqueues, running, or asleep or + stopped. It will slow down your kernel. + Times are reported in /proc/pid/msa and through a new msa() + system call. +choice + depends on MICROSTATE + prompt Microstate timing source + default MICROSTATE_ITC + help + On IA64 one can use two timeing sources for the microstate + accounting; the on-chip interval counter, or Linux's + time-of-day clock. The first is very cheap; the other is + more accurate on SMP systems. + +config MICROSTATE_ITC + bool Use on-chip ITC for microstate timing + +config MICROSTATE_TOD + bool Use time-of-day clock for microstate timings +endchoice + config IA64_PALINFO tristate /proc/pal support help Index: linux-2.6-ustate/include/asm-ia64/msa.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-ustate/include/asm-ia64/msa.h 2005-03-10 09:16:14.594632174 +1100 @@ -0,0 +1,33 @@ +/ + * asm-ia64/msa.h + * + * Provide an architecture-specific clock. + */ + +#ifndef _ASM_IA64_MSA_H +#define _ASM_IA64_MSA_H + +#include asm/processor.h +#include asm/timex.h +#include asm/smp.h + + +# if defined(CONFIG_MICROSTATE_ITC) +# define MSA_NOW(now) do { now = (clk_t)get_cycles(); } while (0) + +# define MSA_TO_NSEC(clk) ((10*clk) / cpu_data(smp_processor_id())-itc_freq) + +# elif defined(CONFIG_MICROSTATE_TOD) +static inline void msa_now(clk_t *nsp) { + struct timeval tv; + do_gettimeofday(tv); + *nsp = tv.tv_sec * 100 + tv.tv_usec; +} +# define MSA_NOW(x) msa_now(x) +# define MSA_TO_NSEC(clk) ((clk) * 1000) + +# else +# include asm-generic/msa.h +# endif + +#endif /* _ASM_IA64_MSA_H */ Microstate Accounting: Track time in system calls for IA64 arch/ia64/kernel/entry.S | 44 arch/ia64/kernel/ivt.S |8 ++-- 2 files changed, 50 insertions(+), 2 deletions(-) Index: linux-2.6-ustate/arch/ia64/kernel/entry.S === --- linux-2.6-ustate.orig/arch/ia64/kernel/entry.S 2005-03-10 09:13:01.149778160 +1100 +++ linux-2.6-ustate/arch/ia64/kernel/entry.S 2005-03-10 09:16:15.157128068 +1100 @@ -589,6 +589,46 @@ .ret4: br.cond.sptk ia64_leave_kernel END(ia64_strace_leave_kernel) +#ifdef CONFIG_MICROSTATE +/* + * preserve input registers, + * and r8 + */ +GLOBAL_ENTRY(invoke_msa_end_syscall) + .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8) + alloc loc1=ar.pfs,8,4,0,0 + mov loc0=rp + .body + ;; + mov loc2=ret0 + mov loc3=ret2 + br.call.sptk.many rp=msa_end_syscall +1: mov rp=loc0 + mov ret0=loc2 + mov ret2=loc3 + mov ar.pfs=loc1 + br.ret.sptk.many rp +END(invoke_msa_end_syscall) +/* + * Preserves in0-7, and all callee-save registers. + */ +GLOBAL_ENTRY(invoke_msa_start_syscall) + .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8) + alloc loc1=ar.pfs,8,4,0,0 + mov loc0=rp + .body + mov loc2=r3 + mov loc3=r15 + ;; + br.call.sptk.many rp=msa_start_syscall +1: mov r15=loc3 + mov r3=loc2 + mov ar.pfs=loc1 + mov rp=loc0 + br.ret.sptk.many rp +END(invoke_msa_start_syscall) +#endif /* CONFIG_MICROSTATE */ + GLOBAL_ENTRY(ia64_ret_from_clone) PT_REGS_UNWIND_INFO(0) { /* @@ -671,6 +711,10 @@ */ ENTRY(ia64_leave_syscall) PT_REGS_UNWIND_INFO(0) +#ifdef CONFIG_MICROSTATE + br.call.sptk.many rp=invoke_msa_end_syscall +1: +#endif /* * work.need_resched etc. mustn't get changed by this CPU before it returns to
Re: Microstate Accounting for 2.6.11
Andrew == Andrew Morton [EMAIL PROTECTED] writes: Andrew Peter Chubb [EMAIL PROTECTED] wrote: Timing data on threads at present is pretty crude: when the timer interrupt occurs, a tick is added to either system time or user time for the currently running thread. Thus in an unpacthed kernel one can distinguish three timed states: On-cpu in userspace, on-cpu in system space, and not running. The actual number of states is much larger. A thread can be on a runqueue or the expired queue (i.e., ready to run but not running), sleeping on a semaphore or on a futex, having its time stolen to service an interrupt, etc., etc. This patch adds timers per-state to each struct task_struct, so that time in all these states can be tracked. This patch contains the core code do the timing, and to initialise the timers. Subsequent patches enable the code (by adding Kconfig options) and add hooks to track state changes. Andrew Why does the kernel need this feature? I find that it's useful when trying to work out why a thread is going more slowly than it needs to. Userspace tools in the CVS repository at gelato.unsw.edu.au let you graph in real time the time spent in each state, so you get graphs like this: http://gelato.unsw.edu.au/patches/snapshot.png which shows mplay skipping because of a slow disk/filesystem. Andrew Have you any numbers on the overhead? Around 5% on LMbench context switch numbers for uniprocessor, negligeable on SMP (but SMP context switch results are horrible at the moment according to LMbench2 -- almost 16usec); select on 10 fd goes from 1.665 usec to 1.701; Andrew The preempt_disable() in sys_msa() seems odd. Yes I only added that yesterday. It's to prevent migration while updating the current timer. All the other places where the current timer are updated are naturally protected this. It should probably be a local_irq_disable() instead. Peter C - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reading large /proc entry from kernel module
>>>>> "Kristian" == Kristian Sørensen <[EMAIL PROTECTED]> writes: Kristian> Hi all! I have some trouble reading a 2346 byte /proc entry Kristian> from our Umbrella kernel module. Kristian> static int umb_proc_write(struct file *file, const char *buffer, Kristian> unsigned long count, void *data) { Kristian> char *policy; Kristian> int *lbuf; Kristian> int i; Here's your problem: lbuf should be a char * not an int *. When you look lbuf[0] you'll get the first four characters packed into the int. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reading large /proc entry from kernel module
Kristian == Kristian Sørensen [EMAIL PROTECTED] writes: Kristian Hi all! I have some trouble reading a 2346 byte /proc entry Kristian from our Umbrella kernel module. Kristian static int umb_proc_write(struct file *file, const char *buffer, Kristian unsigned long count, void *data) { Kristian char *policy; Kristian int *lbuf; Kristian int i; Here's your problem: lbuf should be a char * not an int *. When you look lbuf[0] you'll get the first four characters packed into the int. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/