Re: [PATCH 1/1] irqchip: exynos-combiner: Save IRQ enable set on suspend

2015-06-10 Thread Peter Chubb
>>>>> "Javier" == Javier Martinez Canillas  
>>>>> writes:

Javier> The Exynos interrupt combiner IP looses its state when the SoC
     s/looses/loses/

Peter C
-- 
Dr Peter Chubb  peter.chubb AT nicta.com.au
http://www.ssrg.nicta.com.au  Software Systems Research Group/NICTA
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] irqchip: exynos-combiner: Save IRQ enable set on suspend

2015-06-10 Thread Peter Chubb
 Javier == Javier Martinez Canillas javier.marti...@collabora.co.uk 
 writes:

Javier The Exynos interrupt combiner IP looses its state when the SoC
 s/looses/loses/

Peter C
-- 
Dr Peter Chubb  peter.chubb AT nicta.com.au
http://www.ssrg.nicta.com.au  Software Systems Research Group/NICTA
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Documentation: ARM: EXYNOS: Describe boot loaders interface

2015-06-06 Thread Peter Chubb
>>>>> "Krzysztof" == Krzysztof Kozlowski  writes:

Krzysztof> Various boot loaders for Exynos based boards use certain
Krzysztof> memory addresses during booting for different
Krzysztof> purposes. Mostly this is one of following : 1. as a CPU
Krzysztof> boot address, 2. for storing magic cookie related to low
Krzysztof> power mode (AFTR, sleep).

Krzysztof> The document, based solely on kernel source code, tries to
Krzysztof> group the information scattered over different files. This
Krzysztof> would help in the future when adding support for new SoC or
Krzysztof> when extending features related to low power modes.

Is it worth grabbing the info from u=boot and documenting it here
(it's not documented other than in the hardkenel U=Boot source)?

I can send you the info, or you can see it in
https://github.com/hardkernel/u-boot/blob/odroidxu3-v2012.07/board/samsung/smdk5420/lowlevel_init.S
at symbol nscode_base near line 104

-- 
Dr Peter Chubb  peter.chubb AT nicta.com.au
http://www.ssrg.nicta.com.au  Software Systems Research Group/NICTA
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Documentation: ARM: EXYNOS: Describe boot loaders interface

2015-06-06 Thread Peter Chubb
 Krzysztof == Krzysztof Kozlowski k.kozlowsk...@gmail.com writes:

Krzysztof Various boot loaders for Exynos based boards use certain
Krzysztof memory addresses during booting for different
Krzysztof purposes. Mostly this is one of following : 1. as a CPU
Krzysztof boot address, 2. for storing magic cookie related to low
Krzysztof power mode (AFTR, sleep).

Krzysztof The document, based solely on kernel source code, tries to
Krzysztof group the information scattered over different files. This
Krzysztof would help in the future when adding support for new SoC or
Krzysztof when extending features related to low power modes.

Is it worth grabbing the info from u=boot and documenting it here
(it's not documented other than in the hardkenel U=Boot source)?

I can send you the info, or you can see it in
https://github.com/hardkernel/u-boot/blob/odroidxu3-v2012.07/board/samsung/smdk5420/lowlevel_init.S
at symbol nscode_base near line 104

-- 
Dr Peter Chubb  peter.chubb AT nicta.com.au
http://www.ssrg.nicta.com.au  Software Systems Research Group/NICTA
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] debug: Do not permit CONFIG_DEBUG_STACK_USAGE=y on IA64 or PARISC

2012-07-26 Thread Peter Chubb
>>>>> "Ingo" == Ingo Molnar  writes:

Ingo> * James Bottomley  wrote:
>> Since the problem is an invalid assumption about how the stack
>> grows, why not just condition it on that.  We actually have a
>> config option for this: CONFIG_STACK_GROWSUP.  But for some reason
>> ia64 doesn't define this, why not, Tony?  It looks deliberate
>> because you have replaced a lot of
>> 
>> #ifdef CONFIG_STACK_GROWSUP
>> 
>> with
>> 
>> #if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64)
>> 
>> but not all of them.

Ingo> Yes, that's another possible solution, assuming that it's really
Ingo> only about the up/down difference.

Ingo> Thanks,

IA64 has two stacks -- the standard one, that grows down, and the
register stack engine backing store, that grows up.  The usual
mechanisms for stack growth are used, so only some of the bits
predicated on `STACK_GROWSUP' are useful.

Peter C
--
Dr Peter Chubb  peter.chubb AT nicta.com.au
http://www.ssrg.nicta.com.au  Software Systems Research Group/NICTA
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] debug: Do not permit CONFIG_DEBUG_STACK_USAGE=y on IA64 or PARISC

2012-07-26 Thread Peter Chubb
 Ingo == Ingo Molnar mi...@kernel.org writes:

Ingo * James Bottomley james.bottom...@hansenpartnership.com wrote:
 Since the problem is an invalid assumption about how the stack
 grows, why not just condition it on that.  We actually have a
 config option for this: CONFIG_STACK_GROWSUP.  But for some reason
 ia64 doesn't define this, why not, Tony?  It looks deliberate
 because you have replaced a lot of
 
 #ifdef CONFIG_STACK_GROWSUP
 
 with
 
 #if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64)
 
 but not all of them.

Ingo Yes, that's another possible solution, assuming that it's really
Ingo only about the up/down difference.

Ingo Thanks,

IA64 has two stacks -- the standard one, that grows down, and the
register stack engine backing store, that grows up.  The usual
mechanisms for stack growth are used, so only some of the bits
predicated on `STACK_GROWSUP' are useful.

Peter C
--
Dr Peter Chubb  peter.chubb AT nicta.com.au
http://www.ssrg.nicta.com.au  Software Systems Research Group/NICTA
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Fix compilation with gcc 4.2

2007-08-08 Thread Peter Chubb

gcc-4.2 is a lot more picky about its symbol handling.  EXPORT_SYMBOL
no longer works on symbols that are undefined or defined with static scope.

For example, with CONFIG_PROFILE off, I see:

  kernel/profile.c:206: error: __ksymtab_profile_event_unregister causes a 
section type conflict
  kernel/profile.c:205: error: __ksymtab_profile_event_register causes a 
section type conflict

This patch moves the EXPORTs inside the #ifdef CONFIG_PROFILE, so we
only try to export symbols that are defined.

Also, in kernel/kprobes.c there's an EXPORT_SYMBOL_GPL() for
jprobes_return, which if CONFIG_JPROBES is undefined is a static
inline and gives the same error.

And in drivers/acpi/resources/rsxface.c, there's an
ACPI_EXPORT_SYMBOPL() for a static symbol. If it's static, it's not
accessible from outside the compilation unit, so should bot be exported.

These three changes allow building a zx1_defconfig kernel with gcc 4.2
on IA64.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

Index: linux-2.6-git/kernel/profile.c
===
--- linux-2.6-git.orig/kernel/profile.c 2007-08-09 12:10:19.921216500 +1000
+++ linux-2.6-git/kernel/profile.c  2007-08-09 12:10:26.061162039 +1000
@@ -199,11 +199,11 @@ EXPORT_SYMBOL_GPL(register_timer_hook);
 EXPORT_SYMBOL_GPL(unregister_timer_hook);
 EXPORT_SYMBOL_GPL(task_handoff_register);
 EXPORT_SYMBOL_GPL(task_handoff_unregister);
+EXPORT_SYMBOL_GPL(profile_event_register);
+EXPORT_SYMBOL_GPL(profile_event_unregister);
 
 #endif /* CONFIG_PROFILING */
 
-EXPORT_SYMBOL_GPL(profile_event_register);
-EXPORT_SYMBOL_GPL(profile_event_unregister);
 
 #ifdef CONFIG_SMP
 /*
Index: linux-2.6-gie/kernel/kprobes.c
===
--- linux-2.6-git.orig/kernel/kprobes.c 2007-08-09 12:14:48.898830198 +1000
+++ linux-2.6-git/kernel/kprobes.c  2007-08-09 14:09:50.180322576 +1000
@@ -1063,6 +1063,8 @@ EXPORT_SYMBOL_GPL(register_kprobe);
 EXPORT_SYMBOL_GPL(unregister_kprobe);
 EXPORT_SYMBOL_GPL(register_jprobe);
 EXPORT_SYMBOL_GPL(unregister_jprobe);
-EXPORT_SYMBOL_GPL(jprobe_return);
+
+#ifdef CONFIG_KPROBES
 EXPORT_SYMBOL_GPL(register_kretprobe);
 EXPORT_SYMBOL_GPL(unregister_kretprobe);
+#endif
Index: linux-2.6-git/drivers/acpi/resources/rsxface.c
===
--- linux-2.6-git.orig/drivers/acpi/resources/rsxface.c 2007-08-09 
13:06:59.040346772 +1000
+++ linux-2.6-git/drivers/acpi/resources/rsxface.c  2007-08-09 
13:12:03.125801491 +1000
@@ -474,8 +474,6 @@ acpi_rs_match_vendor_resource(struct acp
return (AE_CTRL_TERMINATE);
 }
 
-ACPI_EXPORT_SYMBOL(acpi_rs_match_vendor_resource)
-
 
/***
  *
  * FUNCTION:acpi_walk_resources


--
Dr Peter Chubb http://www.gelato.unsw.edu.au  [EMAIL PROTECTED]
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Fix compilation with gcc 4.2

2007-08-08 Thread Peter Chubb

gcc-4.2 is a lot more picky about its symbol handling.  EXPORT_SYMBOL
no longer works on symbols that are undefined or defined with static scope.

For example, with CONFIG_PROFILE off, I see:

  kernel/profile.c:206: error: __ksymtab_profile_event_unregister causes a 
section type conflict
  kernel/profile.c:205: error: __ksymtab_profile_event_register causes a 
section type conflict

This patch moves the EXPORTs inside the #ifdef CONFIG_PROFILE, so we
only try to export symbols that are defined.

Also, in kernel/kprobes.c there's an EXPORT_SYMBOL_GPL() for
jprobes_return, which if CONFIG_JPROBES is undefined is a static
inline and gives the same error.

And in drivers/acpi/resources/rsxface.c, there's an
ACPI_EXPORT_SYMBOPL() for a static symbol. If it's static, it's not
accessible from outside the compilation unit, so should bot be exported.

These three changes allow building a zx1_defconfig kernel with gcc 4.2
on IA64.

Signed-off-by: Peter Chubb [EMAIL PROTECTED]

Index: linux-2.6-git/kernel/profile.c
===
--- linux-2.6-git.orig/kernel/profile.c 2007-08-09 12:10:19.921216500 +1000
+++ linux-2.6-git/kernel/profile.c  2007-08-09 12:10:26.061162039 +1000
@@ -199,11 +199,11 @@ EXPORT_SYMBOL_GPL(register_timer_hook);
 EXPORT_SYMBOL_GPL(unregister_timer_hook);
 EXPORT_SYMBOL_GPL(task_handoff_register);
 EXPORT_SYMBOL_GPL(task_handoff_unregister);
+EXPORT_SYMBOL_GPL(profile_event_register);
+EXPORT_SYMBOL_GPL(profile_event_unregister);
 
 #endif /* CONFIG_PROFILING */
 
-EXPORT_SYMBOL_GPL(profile_event_register);
-EXPORT_SYMBOL_GPL(profile_event_unregister);
 
 #ifdef CONFIG_SMP
 /*
Index: linux-2.6-gie/kernel/kprobes.c
===
--- linux-2.6-git.orig/kernel/kprobes.c 2007-08-09 12:14:48.898830198 +1000
+++ linux-2.6-git/kernel/kprobes.c  2007-08-09 14:09:50.180322576 +1000
@@ -1063,6 +1063,8 @@ EXPORT_SYMBOL_GPL(register_kprobe);
 EXPORT_SYMBOL_GPL(unregister_kprobe);
 EXPORT_SYMBOL_GPL(register_jprobe);
 EXPORT_SYMBOL_GPL(unregister_jprobe);
-EXPORT_SYMBOL_GPL(jprobe_return);
+
+#ifdef CONFIG_KPROBES
 EXPORT_SYMBOL_GPL(register_kretprobe);
 EXPORT_SYMBOL_GPL(unregister_kretprobe);
+#endif
Index: linux-2.6-git/drivers/acpi/resources/rsxface.c
===
--- linux-2.6-git.orig/drivers/acpi/resources/rsxface.c 2007-08-09 
13:06:59.040346772 +1000
+++ linux-2.6-git/drivers/acpi/resources/rsxface.c  2007-08-09 
13:12:03.125801491 +1000
@@ -474,8 +474,6 @@ acpi_rs_match_vendor_resource(struct acp
return (AE_CTRL_TERMINATE);
 }
 
-ACPI_EXPORT_SYMBOL(acpi_rs_match_vendor_resource)
-
 
/***
  *
  * FUNCTION:acpi_walk_resources


--
Dr Peter Chubb http://www.gelato.unsw.edu.au  [EMAIL PROTECTED]
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Deferred interrupt handling.

2007-07-18 Thread Peter Chubb

The problem you're having is essentially the same as the user-level
interrupt handler problem I've been dealing with for ages.

The basic rule is: don't share interrupts between devices on the host
and devices in the guest.  But you *can* share interrupts between
devices in a single guest.

If you want the code, see
http://www.gelato.unsw.edu.au/cgi-bin/viewvc.cgi/cvs/kernel/usrdrivers/latest/
and look at generic-irq.patch and fasync (which adds asynchronous notifications)

For the KVM work it'll need modifying a little, but the basic
infrastructure is there.

We've currently got this working to pass interrupts to a type-II (hosted)
virtual machine monitor running a guest kernel with native drivers.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Deferred interrupt handling.

2007-07-18 Thread Peter Chubb

The problem you're having is essentially the same as the user-level
interrupt handler problem I've been dealing with for ages.

The basic rule is: don't share interrupts between devices on the host
and devices in the guest.  But you *can* share interrupts between
devices in a single guest.

If you want the code, see
http://www.gelato.unsw.edu.au/cgi-bin/viewvc.cgi/cvs/kernel/usrdrivers/latest/
and look at generic-irq.patch and fasync (which adds asynchronous notifications)

For the KVM work it'll need modifying a little, but the basic
infrastructure is there.

We've currently got this working to pass interrupts to a type-II (hosted)
virtual machine monitor running a guest kernel with native drivers.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-ia64 build warning messages

2007-06-06 Thread Peter Chubb
>>>>> "Russ" == Russ Anderson <[EMAIL PROTECTED]> writes:

Russ> Tony Luck wrote:
>> > I used the sn2_defconfig in the tree :)
>> 
>> So there is something odd happening.  Russ complained that he was
>> still seeing several errors from the sn2_defconfig build too when I
>> posted the "last fix" to Len.  But I don't see them when I build.

Russ> An additional data point.  I have a copy of Tony's test tree
Russ> pulled down on March 30th that builds without the warning
Russ> messages.  The copy of Tony's test tree pulled down on May 22nd
Russ> does have warning messages.  I'm building both with the same
Russ> compiler (etc).  I'm fairly certain a tree I pulled down in
Russ> April built without warnings.  I've since blown away that tree.

Change request 85bd2fddd68e757da8e1af98f857f61a3c9ce647 introduced
section-mismatch checking for vmlinux, which caused all these warnings
to become visible.

It looks as if gcc can create references from .sdata to .init.sdata
depending on what optimisations it chooses to do.  Ideally we could
teach gcc to put its constants in the same section they reference.
But I'm no gcc guru.  The alternative is to get modpost to ignore such
references, at the cost of perhaps missing a real problem somewhere.
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-ia64 build warning messages

2007-06-06 Thread Peter Chubb
 Russ == Russ Anderson [EMAIL PROTECTED] writes:

Russ Tony Luck wrote:
  I used the sn2_defconfig in the tree :)
 
 So there is something odd happening.  Russ complained that he was
 still seeing several errors from the sn2_defconfig build too when I
 posted the last fix to Len.  But I don't see them when I build.

Russ An additional data point.  I have a copy of Tony's test tree
Russ pulled down on March 30th that builds without the warning
Russ messages.  The copy of Tony's test tree pulled down on May 22nd
Russ does have warning messages.  I'm building both with the same
Russ compiler (etc).  I'm fairly certain a tree I pulled down in
Russ April built without warnings.  I've since blown away that tree.

Change request 85bd2fddd68e757da8e1af98f857f61a3c9ce647 introduced
section-mismatch checking for vmlinux, which caused all these warnings
to become visible.

It looks as if gcc can create references from .sdata to .init.sdata
depending on what optimisations it chooses to do.  Ideally we could
teach gcc to put its constants in the same section they reference.
But I'm no gcc guru.  The alternative is to get modpost to ignore such
references, at the cost of perhaps missing a real problem somewhere.
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: sleeping function called from invalid context at kernel/fork.c:385

2007-06-03 Thread Peter Chubb

I see many many section mismatches when compiling with gcc 4.1 and
binutils 2.17.50.20070426   They appear to be from .sdata to
.init.data.

This is with basic zx1_defconfig with a few mods.

The reason appears to be compiler weirdness..


WARNING: init/built-in.o(.sdata+0x30): Section mismatch: reference to 
.init.data:ino (after 'root_mountflags')

(initramfs.s contains a 32-word table `head'.  Code like:
static __initdata struct hash {..} *head[32];

for (p = head; p < head + 32; p++)
is generating:
  .section .sdata
L24:
.data8 head#+256


Rather than adding 256 to head at run time, the compiler loads L24 and
uses that for the comparison.  This triggers the warning.




WARNING: arch/ia64/kernel/built-in.o(.sdata+0x110): Section mismatch: reference 
to .init.data:rsvd_region (between 'ia64_sal' and 'ia64_i_cache_stride_shift')
WARNING: mm/built-in.o(.sdata+0x48): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x50): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x58): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x60): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x68): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x70): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x78): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x80): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x3c8): Section mismatch: reference to 
.init.data: (between 'swap_list' and 'slab_early_init')
WARNING: mm/built-in.o(.sdata+0x3d8): Section mismatch: reference to 
.init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init')
WARNING: mm/built-in.o(.sdata+0x3e0): Section mismatch: reference to 
.init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init')
WARNING: drivers/built-in.o(.data.rel.local+0x20a8): Section mismatch: 
reference to .init.text:acpi_processor_start (between 'acpi_processor_driver' 
and 'acpi_thermal_driver')
WARNING: drivers/built-in.o(.data.rel+0x1d80): Section mismatch: reference to 
.init.text:serial8250_console_setup (between 'serial8250_console' and 
'dpm_active')
WARNING: drivers/built-in.o(.sdata+0x788): Section mismatch: reference to 
.init.data: (between 'first.20152' and 'enabled')
WARNING: drivers/built-in.o(.sdata+0x790): Section mismatch: reference to 
.init.data: (between 'first.20152' and 'enabled')
WARNING: drivers/built-in.o(.sdata+0xa18): Section mismatch: reference to 
.init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo')
WARNING: drivers/built-in.o(.sdata+0xa20): Section mismatch: reference to 
.init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo')
WARNING: drivers/built-in.o(.sdata+0xa28): Section mismatch: reference to 
.init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo')
WARNING: drivers/built-in.o(.sdata+0xac8): Section mismatch: reference to 
.init.data: (between 'Symbios_trailer.24436' and 'try_direct_io')
WARNING: drivers/built-in.o(.sdata+0xb00): Section mismatch: reference to 
.init.data: (between 'st_max_sg_segs' and 'osst_version')
WARNING: arch/ia64/hp/common/built-in.o(.data.rel.local+0xa8): Section 
mismatch: reference to .init.text:acpi_sba_ioc_add (between 
'acpi_sba_ioc_driver' and 'ioc_seq_ops')
WARNING: arch/ia64/hp/common/built-in.o(.sdata+0x0): Section mismatch: 
reference to .init.data:__setup_str_sba_page_override before 'reserve_sba_gart' 
(at offset -0x204c2613)
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: sleeping function called from invalid context at kernel/fork.c:385

2007-06-03 Thread Peter Chubb

I see many many section mismatches when compiling with gcc 4.1 and
binutils 2.17.50.20070426   They appear to be from .sdata to
.init.data.

This is with basic zx1_defconfig with a few mods.

The reason appears to be compiler weirdness..


WARNING: init/built-in.o(.sdata+0x30): Section mismatch: reference to 
.init.data:ino (after 'root_mountflags')

(initramfs.s contains a 32-word table `head'.  Code like:
static __initdata struct hash {..} *head[32];

for (p = head; p  head + 32; p++)
is generating:
  .section .sdata
L24:
.data8 head#+256


Rather than adding 256 to head at run time, the compiler loads L24 and
uses that for the comparison.  This triggers the warning.




WARNING: arch/ia64/kernel/built-in.o(.sdata+0x110): Section mismatch: reference 
to .init.data:rsvd_region (between 'ia64_sal' and 'ia64_i_cache_stride_shift')
WARNING: mm/built-in.o(.sdata+0x48): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x50): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x58): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x60): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x68): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x70): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x78): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x80): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x3c8): Section mismatch: reference to 
.init.data: (between 'swap_list' and 'slab_early_init')
WARNING: mm/built-in.o(.sdata+0x3d8): Section mismatch: reference to 
.init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init')
WARNING: mm/built-in.o(.sdata+0x3e0): Section mismatch: reference to 
.init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init')
WARNING: drivers/built-in.o(.data.rel.local+0x20a8): Section mismatch: 
reference to .init.text:acpi_processor_start (between 'acpi_processor_driver' 
and 'acpi_thermal_driver')
WARNING: drivers/built-in.o(.data.rel+0x1d80): Section mismatch: reference to 
.init.text:serial8250_console_setup (between 'serial8250_console' and 
'dpm_active')
WARNING: drivers/built-in.o(.sdata+0x788): Section mismatch: reference to 
.init.data: (between 'first.20152' and 'enabled')
WARNING: drivers/built-in.o(.sdata+0x790): Section mismatch: reference to 
.init.data: (between 'first.20152' and 'enabled')
WARNING: drivers/built-in.o(.sdata+0xa18): Section mismatch: reference to 
.init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo')
WARNING: drivers/built-in.o(.sdata+0xa20): Section mismatch: reference to 
.init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo')
WARNING: drivers/built-in.o(.sdata+0xa28): Section mismatch: reference to 
.init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo')
WARNING: drivers/built-in.o(.sdata+0xac8): Section mismatch: reference to 
.init.data: (between 'Symbios_trailer.24436' and 'try_direct_io')
WARNING: drivers/built-in.o(.sdata+0xb00): Section mismatch: reference to 
.init.data: (between 'st_max_sg_segs' and 'osst_version')
WARNING: arch/ia64/hp/common/built-in.o(.data.rel.local+0xa8): Section 
mismatch: reference to .init.text:acpi_sba_ioc_add (between 
'acpi_sba_ioc_driver' and 'ioc_seq_ops')
WARNING: arch/ia64/hp/common/built-in.o(.sdata+0x0): Section mismatch: 
reference to .init.data:__setup_str_sba_page_override before 'reserve_sba_gart' 
(at offset -0x204c2613)
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend] - SN: validate smp_affinity mask on intr redirect

2007-05-08 Thread Peter Chubb

Jack>  }
Jack> +
Jack> +bool is_affinity_mask_valid(cpumask_t cpumask)
Jack> +{
Jack> + if (ia64_platform_is("sn2")) {
Jack> + /* Only allow one CPU to be specified in the smp_affinity mask 
*/
Jack> + if (cpus_weight(cpumask) != 1)
Jack> + return false;

Why not just:
return cpus_weight(cpumask) == 1;


It's a Boolean; treat it as one.
(If you thought the average kernel programmer (who's s/he?) understood
the logical implication rule it could be:
return !ia64_platform_is("sn2") || cpus_weight(cpumask) == 1;
)
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend] - SN: validate smp_affinity mask on intr redirect

2007-05-08 Thread Peter Chubb

Jack  }
Jack +
Jack +bool is_affinity_mask_valid(cpumask_t cpumask)
Jack +{
Jack + if (ia64_platform_is(sn2)) {
Jack + /* Only allow one CPU to be specified in the smp_affinity mask 
*/
Jack + if (cpus_weight(cpumask) != 1)
Jack + return false;

Why not just:
return cpus_weight(cpumask) == 1;


It's a Boolean; treat it as one.
(If you thought the average kernel programmer (who's s/he?) understood
the logical implication rule it could be:
return !ia64_platform_is(sn2) || cpus_weight(cpumask) == 1;
)
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Peter Chubb
> "Jeremy" == Jeremy Fitzhardinge <[EMAIL PROTECTED]> writes:


Jeremy> And do the same in pte pages for actual mapped pages?  Or do
Jeremy> you think they would be too densely populated for it to be
Jeremy> worthwhile?

We've been doing some measurements on how densely clumped ptes are.
On 32-bit platforms, they're pretty dense.  On IA64, quite a bit
sparser, depending on the workload of course.  I think that's mostly because
of the larger pagesize on IA64 -- with 64k pages, you don't need very
many to map a small object.

I'm hoping IanW can give more details.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Peter Chubb
 Jeremy == Jeremy Fitzhardinge [EMAIL PROTECTED] writes:


Jeremy And do the same in pte pages for actual mapped pages?  Or do
Jeremy you think they would be too densely populated for it to be
Jeremy worthwhile?

We've been doing some measurements on how densely clumped ptes are.
On 32-bit platforms, they're pretty dense.  On IA64, quite a bit
sparser, depending on the workload of course.  I think that's mostly because
of the larger pagesize on IA64 -- with 64k pages, you don't need very
many to map a small object.

I'm hoping IanW can give more details.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ski for huge page size !

2006-11-27 Thread Peter Chubb
>>>>> "sudhnesh" == sudhnesh adapawar <[EMAIL PROTECTED]> writes:

sudhnesh> Hey all !  I am thinking to use ski simulator as I can get
sudhnesh> the ia64 (Itanium 2)simulated on ia32 archiSo can I use
sudhnesh> this product for the project related to huge page size ???
sudhnesh> Will the problems related to huge pages such as
sudhnesh> swapping,IO,etc...will be covered if I use ski with 2.6
sudhnesh> kernel image configured for ia64 archi with huge page size
sudhnesh> support ?


Should work perfectly.  We've been using Ski for similar work, looking
at SuperPage support.
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to boot 2.6 kernel using hp ski simulator ???

2006-11-27 Thread Peter Chubb

Please check out http://www.gelato.unsw.edu.au/IA64wiki/SkiSimulator
for lots of info on Ski.

It works fine with Linux 2.6; and hugepage work too.

> 1) I used 'make ARCH=ia64 menuconfig' to configure and followed the
> steps to get kernel image of version 2.6 ! I also selected the generic
> type as Ski-simulator and also selected the HP-ski drivers something
> simscsi,etc.etc.

I suggest you start with
make sim_defconfig

Your symptoms look like a misconigured or misbuilt vmlinux.  The sim_defconfig

If you're running on IA32, then you need something like:
make CROSS_COMPILE=ia64-linux-gnu ARCH=ia64 boot 
to build kernel and bootloader.

You need to get or build yourself a disk image.  Instructions for
building at http://www.gelato.unsw.edu.au/IA64wiki/skidiskimage 




--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ski for huge page size !

2006-11-27 Thread Peter Chubb
 sudhnesh == sudhnesh adapawar [EMAIL PROTECTED] writes:

sudhnesh Hey all !  I am thinking to use ski simulator as I can get
sudhnesh the ia64 (Itanium 2)simulated on ia32 archiSo can I use
sudhnesh this product for the project related to huge page size ???
sudhnesh Will the problems related to huge pages such as
sudhnesh swapping,IO,etc...will be covered if I use ski with 2.6
sudhnesh kernel image configured for ia64 archi with huge page size
sudhnesh support ?


Should work perfectly.  We've been using Ski for similar work, looking
at SuperPage support.
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to boot 2.6 kernel using hp ski simulator ???

2006-11-27 Thread Peter Chubb

Please check out http://www.gelato.unsw.edu.au/IA64wiki/SkiSimulator
for lots of info on Ski.

It works fine with Linux 2.6; and hugepage work too.

 1) I used 'make ARCH=ia64 menuconfig' to configure and followed the
 steps to get kernel image of version 2.6 ! I also selected the generic
 type as Ski-simulator and also selected the HP-ski drivers something
 simscsi,etc.etc.

I suggest you start with
make sim_defconfig

Your symptoms look like a misconigured or misbuilt vmlinux.  The sim_defconfig

If you're running on IA32, then you need something like:
make CROSS_COMPILE=ia64-linux-gnu ARCH=ia64 boot 
to build kernel and bootloader.

You need to get or build yourself a disk image.  Instructions for
building at http://www.gelato.unsw.edu.au/IA64wiki/skidiskimage 




--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: ip_contrack refuses to load if built UP as a module on IA64

2005-08-31 Thread Peter Chubb


This patch makes UP and SMP do the same thing as far as module per-cpu
data go.

Unfortunately it affects core code.

To repeat the problem:
  IA64 keeps per-cpu data in a small data area that is referenced by a
  22-bit offset, for both UP and SMP cases.  If a module defines
  per-cpu data, it too will end up in the small-data area.  But the
  module loader at present special-cases the UP treatment of per-cpu
  data, assumes that it is in the GP-relative data area, and does
  nothing (for SMP it allocates space, and copies initialised data
  items into it) 

  The effect is that modules defining per-cpu data fail to load if
  they're built UP, because of an impossible relocation.

  The appended patch makes the treatment of per-cpu data uniform
  between UP and SMP cases.  For most architectures, the per-cpu data
  section will be empty for UP, and so the per-cpu setup code will not
  be invoked.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

diff --git a/arch/ia64/kernel/module.c b/arch/ia64/kernel/module.c
--- a/arch/ia64/kernel/module.c
+++ b/arch/ia64/kernel/module.c
@@ -951,4 +951,10 @@ percpu_modcopy (void *pcpudst, const voi
if (cpu_possible(i))
memcpy(pcpudst + __per_cpu_offset[i], src, size);
 }
+#else
+void
+percpu_modcopy (void *pcpudst, const void *src, unsigned long size)
+{
+   memcpy(pcpudst, src, size);
+}
 #endif /* CONFIG_SMP */
diff --git a/kernel/module.c b/kernel/module.c
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -209,7 +209,6 @@ static struct module *find_module(const 
return NULL;
 }
 
-#ifdef CONFIG_SMP
 /* Number of blocks used and allocated. */
 static unsigned int pcpu_num_used, pcpu_num_allocated;
 /* Size of each block.  -ve means used. */
@@ -352,29 +351,7 @@ static int percpu_modinit(void)
return 0;
 }  
 __initcall(percpu_modinit);
-#else /* ... !CONFIG_SMP */
-static inline void *percpu_modalloc(unsigned long size, unsigned long align,
-   const char *name)
-{
-   return NULL;
-}
-static inline void percpu_modfree(void *pcpuptr)
-{
-   BUG();
-}
-static inline unsigned int find_pcpusec(Elf_Ehdr *hdr,
-   Elf_Shdr *sechdrs,
-   const char *secstrings)
-{
-   return 0;
-}
-static inline void percpu_modcopy(void *pcpudst, const void *src,
- unsigned long size)
-{
-   /* pcpusec should be 0, and size of that section should be 0. */
-   BUG_ON(size != 0);
-}
-#endif /* CONFIG_SMP */
+
 
 #ifdef CONFIG_MODULE_UNLOAD
 #define MODINFO_ATTR(field)\
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


'mdio_bus_exit' in discarded section .text.exit

2005-08-31 Thread Peter Chubb

When building with  CONFIG_PHYLIB=y on Itanium, I see:
 `mdio_bus_exit' referenced in section `.init.text' of
drivers/built-in.o: defined in discarded section `.exit.text' of
drivers/built-in.o

I believe that mdio_bus_exit should not be declared __exit, because it
is referencesd from __init sections in, say, phy_init().

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -170,7 +170,7 @@ int __init mdio_bus_init(void)
   return bus_register(_bus_type);
 }
 
-void __exit mdio_bus_exit(void)
+void mdio_bus_exit(void)
 {
bus_unregister(_bus_type);
 }


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


'mdio_bus_exit' in discarded section .text.exit

2005-08-31 Thread Peter Chubb

When building with  CONFIG_PHYLIB=y on Itanium, I see:
 `mdio_bus_exit' referenced in section `.init.text' of
drivers/built-in.o: defined in discarded section `.exit.text' of
drivers/built-in.o

I believe that mdio_bus_exit should not be declared __exit, because it
is referencesd from __init sections in, say, phy_init().

Signed-off-by: Peter Chubb [EMAIL PROTECTED]

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -170,7 +170,7 @@ int __init mdio_bus_init(void)
   return bus_register(mdio_bus_type);
 }
 
-void __exit mdio_bus_exit(void)
+void mdio_bus_exit(void)
 {
bus_unregister(mdio_bus_type);
 }


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: ip_contrack refuses to load if built UP as a module on IA64

2005-08-31 Thread Peter Chubb


This patch makes UP and SMP do the same thing as far as module per-cpu
data go.

Unfortunately it affects core code.

To repeat the problem:
  IA64 keeps per-cpu data in a small data area that is referenced by a
  22-bit offset, for both UP and SMP cases.  If a module defines
  per-cpu data, it too will end up in the small-data area.  But the
  module loader at present special-cases the UP treatment of per-cpu
  data, assumes that it is in the GP-relative data area, and does
  nothing (for SMP it allocates space, and copies initialised data
  items into it) 

  The effect is that modules defining per-cpu data fail to load if
  they're built UP, because of an impossible relocation.

  The appended patch makes the treatment of per-cpu data uniform
  between UP and SMP cases.  For most architectures, the per-cpu data
  section will be empty for UP, and so the per-cpu setup code will not
  be invoked.

Signed-off-by: Peter Chubb [EMAIL PROTECTED]

diff --git a/arch/ia64/kernel/module.c b/arch/ia64/kernel/module.c
--- a/arch/ia64/kernel/module.c
+++ b/arch/ia64/kernel/module.c
@@ -951,4 +951,10 @@ percpu_modcopy (void *pcpudst, const voi
if (cpu_possible(i))
memcpy(pcpudst + __per_cpu_offset[i], src, size);
 }
+#else
+void
+percpu_modcopy (void *pcpudst, const void *src, unsigned long size)
+{
+   memcpy(pcpudst, src, size);
+}
 #endif /* CONFIG_SMP */
diff --git a/kernel/module.c b/kernel/module.c
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -209,7 +209,6 @@ static struct module *find_module(const 
return NULL;
 }
 
-#ifdef CONFIG_SMP
 /* Number of blocks used and allocated. */
 static unsigned int pcpu_num_used, pcpu_num_allocated;
 /* Size of each block.  -ve means used. */
@@ -352,29 +351,7 @@ static int percpu_modinit(void)
return 0;
 }  
 __initcall(percpu_modinit);
-#else /* ... !CONFIG_SMP */
-static inline void *percpu_modalloc(unsigned long size, unsigned long align,
-   const char *name)
-{
-   return NULL;
-}
-static inline void percpu_modfree(void *pcpuptr)
-{
-   BUG();
-}
-static inline unsigned int find_pcpusec(Elf_Ehdr *hdr,
-   Elf_Shdr *sechdrs,
-   const char *secstrings)
-{
-   return 0;
-}
-static inline void percpu_modcopy(void *pcpudst, const void *src,
- unsigned long size)
-{
-   /* pcpusec should be 0, and size of that section should be 0. */
-   BUG_ON(size != 0);
-}
-#endif /* CONFIG_SMP */
+
 
 #ifdef CONFIG_MODULE_UNLOAD
 #define MODINFO_ATTR(field)\
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the performance bottleneck?

2005-08-29 Thread Peter Chubb
>>>>> "Holger" == Holger Kiehl <[EMAIL PROTECTED]> writes:

Holger> Hello I have a system with the following setup:

(4-way CPUs, 8 spindles on two controllers)

Try using XFS.

See http://scalability.gelato.org/DiskScalability_2fResults --- ext3
is single threaded and tends not to get the full benefit of either the
multiple spindles nor the multiple processors.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the performance bottleneck?

2005-08-29 Thread Peter Chubb
 Holger == Holger Kiehl [EMAIL PROTECTED] writes:

Holger Hello I have a system with the following setup:

(4-way CPUs, 8 spindles on two controllers)

Try using XFS.

See http://scalability.gelato.org/DiskScalability_2fResults --- ext3
is single threaded and tends not to get the full benefit of either the
multiple spindles nor the multiple processors.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Include assembly entry points in TAGS

2005-08-22 Thread Peter Chubb

As it stands, etags doesn't find labels in the IA64 or i386 assembler source
code, because they're disguised inside a preprocessor macro.

I propose the attached fix, which adds a regular expression to enable
labels disguised by ENTRY() and GLOBAL_ENTRY() macros.

There's a similar problem for MIPS, which needs to match LEAF(entrypoint)

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

diff --git a/Makefile b/Makefile
--- a/Makefile
+++ b/Makefile
@@ -1187,7 +1187,7 @@ cscope: FORCE
$(call cmd,cscope)
 
 quiet_cmd_TAGS = MAKE   $@
-cmd_TAGS = $(all-sources) | etags -
+cmd_TAGS = $(all-sources) | etags 
--regex='{asm}/\(GLOBAL_\)?ENTRY(\([^)]+\))/\2/' -
 
 #  Exuberant ctags works better with -I
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Include assembly entry points in TAGS

2005-08-22 Thread Peter Chubb

As it stands, etags doesn't find labels in the IA64 or i386 assembler source
code, because they're disguised inside a preprocessor macro.

I propose the attached fix, which adds a regular expression to enable
labels disguised by ENTRY() and GLOBAL_ENTRY() macros.

There's a similar problem for MIPS, which needs to match LEAF(entrypoint)

Signed-off-by: Peter Chubb [EMAIL PROTECTED]

diff --git a/Makefile b/Makefile
--- a/Makefile
+++ b/Makefile
@@ -1187,7 +1187,7 @@ cscope: FORCE
$(call cmd,cscope)
 
 quiet_cmd_TAGS = MAKE   $@
-cmd_TAGS = $(all-sources) | etags -
+cmd_TAGS = $(all-sources) | etags 
--regex='{asm}/\(GLOBAL_\)?ENTRY(\([^)]+\))/\2/' -
 
 #  Exuberant ctags works better with -I
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fcntl(F_GETLEASE) semantics??

2005-08-10 Thread Peter Chubb
>>>>> "Trond" == Trond Myklebust <[EMAIL PROTECTED]> writes:

Trond> to den 11.08.2005 Klokka 09:48 (+1000) skreiv Peter Chubb:
>> Hi, The LTP test fcntl23 is failing.  It does, in essence, fd =
>> open(xxx, O_RDWR|O_CREAT, 0777); if (fcntl(fd, F_SETLEASE, F_RDLCK)
>> == -1) fail;
>> 
>> fcntl always returns EAGAIN here.  The manual page says that a read
>> lease causes notification when `another process' opens the file for
>> writing or truncates it.  The kernel implements `any process'
>> (including the current one).
>> 
>> Which semantics are correct?  Personally I think that what the
>> kernel implements is correct (you can't get a read lease unsless
>> there are no writers _at_ _all_)

Trond> A read lease should mean that there are no writers at all.

Trond> If we were to allow the current process to open for write, then
Trond> that would still mean that nobody else can get a lease. In
Trond> effect you have been granted a lease with exclusive semantics
Trond> (i.e. a write lease). You might as well request that instead of
Trond> pretending it is a read lease.

So the manual page is wrong.  Fine.


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


fcntl(F_GETLEASE) semantics??

2005-08-10 Thread Peter Chubb

Hi,
The LTP test fcntl23 is failing.  It does, in essence, 
fd = open(xxx, O_RDWR|O_CREAT, 0777);
if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1)
   fail;

fcntl always returns EAGAIN here.  The manual page says that a read
lease causes notification when `another process' opens the file for
writing or truncates it.  The kernel implements `any process'
(including the current one).

Which semantics are correct?  Personally I think that what the kernel
implements is correct (you can't get a read lease unsless there are no
writers _at_ _all_)


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


fcntl(F_GETLEASE) semantics??

2005-08-10 Thread Peter Chubb

Hi,
The LTP test fcntl23 is failing.  It does, in essence, 
fd = open(xxx, O_RDWR|O_CREAT, 0777);
if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1)
   fail;

fcntl always returns EAGAIN here.  The manual page says that a read
lease causes notification when `another process' opens the file for
writing or truncates it.  The kernel implements `any process'
(including the current one).

Which semantics are correct?  Personally I think that what the kernel
implements is correct (you can't get a read lease unsless there are no
writers _at_ _all_)


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fcntl(F_GETLEASE) semantics??

2005-08-10 Thread Peter Chubb
 Trond == Trond Myklebust [EMAIL PROTECTED] writes:

Trond to den 11.08.2005 Klokka 09:48 (+1000) skreiv Peter Chubb:
 Hi, The LTP test fcntl23 is failing.  It does, in essence, fd =
 open(xxx, O_RDWR|O_CREAT, 0777); if (fcntl(fd, F_SETLEASE, F_RDLCK)
 == -1) fail;
 
 fcntl always returns EAGAIN here.  The manual page says that a read
 lease causes notification when `another process' opens the file for
 writing or truncates it.  The kernel implements `any process'
 (including the current one).
 
 Which semantics are correct?  Personally I think that what the
 kernel implements is correct (you can't get a read lease unsless
 there are no writers _at_ _all_)

Trond A read lease should mean that there are no writers at all.

Trond If we were to allow the current process to open for write, then
Trond that would still mean that nobody else can get a lease. In
Trond effect you have been granted a lease with exclusive semantics
Trond (i.e. a write lease). You might as well request that instead of
Trond pretending it is a read lease.

So the manual page is wrong.  Fine.


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to get the physical page addresses from a kernel virtual address for DMA SG List?

2005-08-04 Thread Peter Chubb
You may want to take a look at the user-mode driver infrastructure
patches, which do almost exactly what you're trying to do.

Get them from
http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/cvs/kernel/usrdrivers/kernel-2.6.12-rc3/

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to get the physical page addresses from a kernel virtual address for DMA SG List?

2005-08-04 Thread Peter Chubb
You may want to take a look at the user-mode driver infrastructure
patches, which do almost exactly what you're trying to do.

Get them from
http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/cvs/kernel/usrdrivers/kernel-2.6.12-rc3/

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangcheck problem

2005-03-30 Thread Peter Chubb
>>>>> "Noah" == Noah Silverman <[EMAIL PROTECTED]> writes:

Noah> Sorry 2.6.7


Noah> Burton Windle wrote:
>> Kernel version?

Are you running on an x86 machine without TSC, e.g., a 486?  the
Hangcheck timer then devolves into using jiffies, and a single jiffy
error gives you the printout you mention.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangcheck problem

2005-03-30 Thread Peter Chubb
 Noah == Noah Silverman [EMAIL PROTECTED] writes:

Noah Sorry 2.6.7


Noah Burton Windle wrote:
 Kernel version?

Are you running on an x86 machine without TSC, e.g., a 486?  the
Hangcheck timer then devolves into using jiffies, and a single jiffy
error gives you the printout you mention.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to measure time accurately.

2005-03-29 Thread Peter Chubb
>>>>> "Chris" == Chris Friesen <[EMAIL PROTECTED]> writes:

Chris> krishna wrote:
>> Hi All,
>> 
>> Can any one tell me how to measure time accurately for a block of C
>> code in device drivers.  For example, If I want to measure the time
>> duration of firmware download.

Chris> Most cpus have some way of getting at a counter or decrementer
Chris> of various frequencies.  Usually it requires low-level hardware
Chris> knowledge and often it needs assembly code.

As a device driver is inside the linux kernel (unless you're writein a
user-mode device driver :-)) you can use the getcycles() macro that's
defined for most architectures.  It provides a snapshot of the
cycle-counter.

Caveats:
1.  If you're running with power management, the  cycle
counter ticks at a  variable rate.
2.  If you're on a multiprocessor, the cycle counters of
different processors need not be synchronised.
-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to measure time accurately.

2005-03-29 Thread Peter Chubb
 Chris == Chris Friesen [EMAIL PROTECTED] writes:

Chris krishna wrote:
 Hi All,
 
 Can any one tell me how to measure time accurately for a block of C
 code in device drivers.  For example, If I want to measure the time
 duration of firmware download.

Chris Most cpus have some way of getting at a counter or decrementer
Chris of various frequencies.  Usually it requires low-level hardware
Chris knowledge and often it needs assembly code.

As a device driver is inside the linux kernel (unless you're writein a
user-mode device driver :-)) you can use the getcycles() macro that's
defined for most architectures.  It provides a snapshot of the
cycle-counter.

Caveats:
1.  If you're running with power management, the  cycle
counter ticks at a  variable rate.
2.  If you're on a multiprocessor, the cycle counters of
different processors need not be synchronised.
-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LBD/filesystems over 2TB: is it safe?

2005-03-21 Thread Peter Chubb
>>>>> "jniehof" == jniehof  <[EMAIL PROTECTED]> writes:

jniehof> Someone posted to the LBD list last December regarding some
jniehof> supposedly horrible bugs in large filesystems:
jniehof> https://www.gelato.unsw.edu.au/archives/lbd/2004-December/75.html
jniehof> https://www.gelato.unsw.edu.au/archives/lbd/2004-December/74.html

The changes in those emails are irrelevant --- they fail to take into
account the properties of the filesystems that they modify, that mean
that the 32-bit quantities being shifted will not overflow.

They're typically of the form:
-   iblock = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+   iblock = (sector_t) index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 
Now, on a 32-bit processor with 4k pages, PAGE_CACHE_SHIFT is 12, and
i_blkbits is also 12 if you're using 4k blocks (which you have to to
get a large filesystem).  So this does nothing and is safe.  The
on-disk format for ext[23] uses 32-bit block numbers, so your maximum
filesystem size is 16TB, and your maximum value of iblock is 2^32-1.

Please do benchmark XFS and ext3 on your system before choosing.  Our
tests (to be published in Linux.Conf.Au next month) show that XFS is
significantly faster for some workloads.
Also its scalability to very large filesystems is much more mature than ext3.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LBD/filesystems over 2TB: is it safe?

2005-03-21 Thread Peter Chubb
 jniehof == jniehof  [EMAIL PROTECTED] writes:

jniehof Someone posted to the LBD list last December regarding some
jniehof supposedly horrible bugs in large filesystems:
jniehof https://www.gelato.unsw.edu.au/archives/lbd/2004-December/75.html
jniehof https://www.gelato.unsw.edu.au/archives/lbd/2004-December/74.html

The changes in those emails are irrelevant --- they fail to take into
account the properties of the filesystems that they modify, that mean
that the 32-bit quantities being shifted will not overflow.

They're typically of the form:
-   iblock = index  (PAGE_CACHE_SHIFT - inode-i_blkbits);
+   iblock = (sector_t) index  (PAGE_CACHE_SHIFT - inode-i_blkbits);
 
Now, on a 32-bit processor with 4k pages, PAGE_CACHE_SHIFT is 12, and
i_blkbits is also 12 if you're using 4k blocks (which you have to to
get a large filesystem).  So this does nothing and is safe.  The
on-disk format for ext[23] uses 32-bit block numbers, so your maximum
filesystem size is 16TB, and your maximum value of iblock is 2^32-1.

Please do benchmark XFS and ext3 on your system before choosing.  Our
tests (to be published in Linux.Conf.Au next month) show that XFS is
significantly faster for some workloads.
Also its scalability to very large filesystems is much more mature than ext3.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: forkbombing Linux distributions

2005-03-20 Thread Peter Chubb
>>>>> "William" == William Beebe <[EMAIL PROTECTED]> writes:

William> Sure enough, I created the following script and ran it as a
William> non-root user:

William> #!/bin/bash $0 & $0 &

There are two approaches to fixing this.
  1.  Rate limit fork().  Unfortunately some legitimate usges do a lot
  of forking, and you don't really want to slow them down.
  2.  Limit (per user) the number of processes allowed. This is what's
  currently done; and if you as administrator want to you can set
  RLIMIT_NPROC in /etc/security/limits.conf

On an almost-single-user system such as most desktops, there isn't much
point in setting this.  On shared systems, it can be useful.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: forkbombing Linux distributions

2005-03-20 Thread Peter Chubb
 William == William Beebe [EMAIL PROTECTED] writes:

William Sure enough, I created the following script and ran it as a
William non-root user:

William #!/bin/bash $0  $0 

There are two approaches to fixing this.
  1.  Rate limit fork().  Unfortunately some legitimate usges do a lot
  of forking, and you don't really want to slow them down.
  2.  Limit (per user) the number of processes allowed. This is what's
  currently done; and if you as administrator want to you can set
  RLIMIT_NPROC in /etc/security/limits.conf

On an almost-single-user system such as most desktops, there isn't much
point in setting this.  On shared systems, it can be useful.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm_dirty_ratio seems a bit large.

2005-03-17 Thread Peter Chubb
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes:

Andrew> Robin Holt <[EMAIL PROTECTED]> wrote:

>>  One other issue we have is the vm_dirty_ratio and background_ratio
>> adjustments are a little coarse with these memory sizes.  Since our
>> minimum adjustment is 1%, we are adjusting by 40GB on the largest
>> configuration from above.  The hardware we are shipping today is
>> capable of going to far greater amounts of memory, but we don't
>> have customers demanding that yet.  I would like to plan ahead for
>> that and change vm_dirty_ratio from a straight percent into a
>> millipercent (thousandth of a percent).  Would that type of change
>> be acceptable?

Andrew> Oh drat.  I think such a change would require a new set of
Andrew> /proc entries.  

No, you could just extend them to understand fixed point.  Keep
printing integers as integers, print non-integers with one (or two:
will we ever need 0.01% increments?) decimal places.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm_dirty_ratio seems a bit large.

2005-03-17 Thread Peter Chubb
 Andrew == Andrew Morton [EMAIL PROTECTED] writes:

Andrew Robin Holt [EMAIL PROTECTED] wrote:

  One other issue we have is the vm_dirty_ratio and background_ratio
 adjustments are a little coarse with these memory sizes.  Since our
 minimum adjustment is 1%, we are adjusting by 40GB on the largest
 configuration from above.  The hardware we are shipping today is
 capable of going to far greater amounts of memory, but we don't
 have customers demanding that yet.  I would like to plan ahead for
 that and change vm_dirty_ratio from a straight percent into a
 millipercent (thousandth of a percent).  Would that type of change
 be acceptable?

Andrew Oh drat.  I think such a change would require a new set of
Andrew /proc entries.  

No, you could just extend them to understand fixed point.  Keep
printing integers as integers, print non-integers with one (or two:
will we ever need 0.01% increments?) decimal places.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Can no longer build ipv6 built-in (2.6.11, today's BK head)

2005-03-15 Thread Peter Chubb


Changeset 
  [EMAIL PROTECTED]|ChangeSet|20050310043957|06845
added cleanup to ipv6_init(), which calls ip6_route_cleanup()

ip6_route_cleanup() is marked __exit so cannot be called from an
__init section -- it's discarded by the linker from the image
(although it'll be retained in a module).

You get errors like this:
ip6_route_cleanup: discarded in section `.exit.text' from
net/built-in.o 
xfrm6_fini: discarded in section `.exit.text' from net/built-in.o
fib6_gc_cleanup: discarded in section `.exit.text' from net/built-in.o
ipv6_packet_cleanup: discarded in section `.exit.text' from
net/built-in.o


A simple fix is to delete the __exit from the various functions now that
they're called other than at module_exit.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

Index: linux-2.5-import/net/ipv6/route.c
===
--- linux-2.5-import.orig/net/ipv6/route.c  2005-03-16 10:12:44.742595387 
+1100
+++ linux-2.5-import/net/ipv6/route.c   2005-03-16 13:01:50.246678866 +1100
@@ -2116,7 +2116,7 @@
 #endif
 }
 
-void __exit ip6_route_cleanup(void)
+void ip6_route_cleanup(void)
 {
 #ifdef CONFIG_PROC_FS
proc_net_remove("ipv6_route");
Index: linux-2.5-import/net/ipv6/ipv6_sockglue.c
===
--- linux-2.5-import.orig/net/ipv6/ipv6_sockglue.c  2005-03-16 
10:12:44.736736056 +1100
+++ linux-2.5-import/net/ipv6/ipv6_sockglue.c   2005-03-16 13:24:19.095793200 
+1100
@@ -698,7 +698,7 @@
dev_add_pack(_packet_type);
 }
 
-void __exit ipv6_packet_cleanup(void)
+void ipv6_packet_cleanup(void)
 {
dev_remove_pack(_packet_type);
 }
Index: linux-2.5-import/net/ipv6/ip6_fib.c
===
--- linux-2.5-import.orig/net/ipv6/ip6_fib.c2005-03-15 12:28:44.819748921 
+1100
+++ linux-2.5-import/net/ipv6/ip6_fib.c 2005-03-16 13:27:46.423351526 +1100
@@ -1218,7 +1218,7 @@
panic("cannot create fib6_nodes cache");
 }
 
-void __exit fib6_gc_cleanup(void)
+void fib6_gc_cleanup(void)
 {
del_timer(_fib_timer);
kmem_cache_destroy(fib6_node_kmem);
Index: linux-2.5-import/net/ipv6/xfrm6_policy.c
===
--- linux-2.5-import.orig/net/ipv6/xfrm6_policy.c   2005-03-15 
12:28:44.853928319 +1100
+++ linux-2.5-import/net/ipv6/xfrm6_policy.c2005-03-16 13:53:28.890552848 
+1100
@@ -276,7 +276,7 @@
xfrm_policy_register_afinfo(_policy_afinfo);
 }
 
-static void __exit xfrm6_policy_fini(void)
+static void xfrm6_policy_fini(void)
 {
xfrm_policy_unregister_afinfo(_policy_afinfo);
 }
@@ -287,7 +287,7 @@
xfrm6_state_init();
 }
 
-void __exit xfrm6_fini(void)
+void xfrm6_fini(void)
 {
//xfrm6_input_fini();
xfrm6_policy_fini();
Index: linux-2.5-import/net/ipv6/xfrm6_state.c
===
--- linux-2.5-import.orig/net/ipv6/xfrm6_state.c2005-03-15 
12:28:44.854904874 +1100
+++ linux-2.5-import/net/ipv6/xfrm6_state.c 2005-03-16 13:29:30.183337361 
+1100
@@ -129,7 +129,7 @@
xfrm_state_register_afinfo(_state_afinfo);
 }
 
-void __exit xfrm6_state_fini(void)
+void xfrm6_state_fini(void)
 {
xfrm_state_unregister_afinfo(_state_afinfo);
 }



-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Can no longer build ipv6 built-in (2.6.11, today's BK head)

2005-03-15 Thread Peter Chubb


Changeset 
  [EMAIL PROTECTED]|ChangeSet|20050310043957|06845
added cleanup to ipv6_init(), which calls ip6_route_cleanup()

ip6_route_cleanup() is marked __exit so cannot be called from an
__init section -- it's discarded by the linker from the image
(although it'll be retained in a module).

You get errors like this:
ip6_route_cleanup: discarded in section `.exit.text' from
net/built-in.o 
xfrm6_fini: discarded in section `.exit.text' from net/built-in.o
fib6_gc_cleanup: discarded in section `.exit.text' from net/built-in.o
ipv6_packet_cleanup: discarded in section `.exit.text' from
net/built-in.o


A simple fix is to delete the __exit from the various functions now that
they're called other than at module_exit.

Signed-off-by: Peter Chubb [EMAIL PROTECTED]

Index: linux-2.5-import/net/ipv6/route.c
===
--- linux-2.5-import.orig/net/ipv6/route.c  2005-03-16 10:12:44.742595387 
+1100
+++ linux-2.5-import/net/ipv6/route.c   2005-03-16 13:01:50.246678866 +1100
@@ -2116,7 +2116,7 @@
 #endif
 }
 
-void __exit ip6_route_cleanup(void)
+void ip6_route_cleanup(void)
 {
 #ifdef CONFIG_PROC_FS
proc_net_remove(ipv6_route);
Index: linux-2.5-import/net/ipv6/ipv6_sockglue.c
===
--- linux-2.5-import.orig/net/ipv6/ipv6_sockglue.c  2005-03-16 
10:12:44.736736056 +1100
+++ linux-2.5-import/net/ipv6/ipv6_sockglue.c   2005-03-16 13:24:19.095793200 
+1100
@@ -698,7 +698,7 @@
dev_add_pack(ipv6_packet_type);
 }
 
-void __exit ipv6_packet_cleanup(void)
+void ipv6_packet_cleanup(void)
 {
dev_remove_pack(ipv6_packet_type);
 }
Index: linux-2.5-import/net/ipv6/ip6_fib.c
===
--- linux-2.5-import.orig/net/ipv6/ip6_fib.c2005-03-15 12:28:44.819748921 
+1100
+++ linux-2.5-import/net/ipv6/ip6_fib.c 2005-03-16 13:27:46.423351526 +1100
@@ -1218,7 +1218,7 @@
panic(cannot create fib6_nodes cache);
 }
 
-void __exit fib6_gc_cleanup(void)
+void fib6_gc_cleanup(void)
 {
del_timer(ip6_fib_timer);
kmem_cache_destroy(fib6_node_kmem);
Index: linux-2.5-import/net/ipv6/xfrm6_policy.c
===
--- linux-2.5-import.orig/net/ipv6/xfrm6_policy.c   2005-03-15 
12:28:44.853928319 +1100
+++ linux-2.5-import/net/ipv6/xfrm6_policy.c2005-03-16 13:53:28.890552848 
+1100
@@ -276,7 +276,7 @@
xfrm_policy_register_afinfo(xfrm6_policy_afinfo);
 }
 
-static void __exit xfrm6_policy_fini(void)
+static void xfrm6_policy_fini(void)
 {
xfrm_policy_unregister_afinfo(xfrm6_policy_afinfo);
 }
@@ -287,7 +287,7 @@
xfrm6_state_init();
 }
 
-void __exit xfrm6_fini(void)
+void xfrm6_fini(void)
 {
//xfrm6_input_fini();
xfrm6_policy_fini();
Index: linux-2.5-import/net/ipv6/xfrm6_state.c
===
--- linux-2.5-import.orig/net/ipv6/xfrm6_state.c2005-03-15 
12:28:44.854904874 +1100
+++ linux-2.5-import/net/ipv6/xfrm6_state.c 2005-03-16 13:29:30.183337361 
+1100
@@ -129,7 +129,7 @@
xfrm_state_register_afinfo(xfrm6_state_afinfo);
 }
 
-void __exit xfrm6_state_fini(void)
+void xfrm6_state_fini(void)
 {
xfrm_state_unregister_afinfo(xfrm6_state_afinfo);
 }



-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-14 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Tue, 15 Mar 2005 14:47:42 +1100, Peter Chubb
Jon> <[EMAIL PROTECTED]> wrote:
>> What I really want to do is deprivilege the driver code as much as
>> possible.  Whatever a driver does, the rest of the system should
>> keep going.  That way malicious or buggy drivers can only affect
>> the processes that are trying to use the device they manage.
>> Moreover, it should be possible to kill -9 a driver, then restart
>> it, without the rest of the system noticing more than a hiccup.  To
>> do this, step one is to run the driver in user space, so that it's
>> subject to the same resource management control as any other
>> process.  Step two, which is a lot harder, is to connect the driver
>> back into the kernel so that it can be shared.  Tun/Tap can be used
>> for network devices, but it's really too slow -- you need zero-copy
>> and shared notification.

Jon> Have you considered running the drivers in a domain under Xen?

See the paper presented by Karlsruhr at OSDI:

Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Götz:
Unmodified Device Driver Reuse and Improved System Dependability via
    Virtual Machines.  OSDI '04.

They're using L4, rather than Xen as the paravirtualisation layer.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-14 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb
Jon> <[EMAIL PROTECTED]> wrote:
>> >>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:
>> 
>> >> The scenario I'm thinking about with these patches are things
>> like >> low-latency user-level networking between nodes in a
>> cluster, where >> for good performance even with a kernel driver
>> you don't want to >> share your interrupt line with anything else.
>> 
Jon> The code needs to refuse to install if the IRQ line is shared.
>>  It does.  The request_irq() call explicitly does not include
>> SA_SHARED in its flags, so if the line is shared, it'll return an
>> error to user space when the driver tries to open the file
>> representing the interrupt.

Jon> Please put some big comments warning people about adding
Jon> SA_SHARED. I can easily see someone thinking that they are fixing
Jon> a bug by adding it. I'd probably even write a paragraph about
Jon> what will happen if SA_SHARED is added.

Will do.  The main problem here is X86, as other architectures either
don't care, or have enough interrupt lines.  And the people who are
paying me for this kind of thing all run IA64

What I really want to do is deprivilege the driver code as much as
possible.  Whatever a driver does, the rest of the system should keep
going.  That way malicious or buggy drivers can only affect the
processes that are trying to use the device they manage.  Moreover, it
should be possible to kill -9 a driver, then restart it, without the
rest of the system noticing more than a hiccup.  To do this,
step one is to run the driver in user space, so that it's subject to
the same resource management control as any other process.  Step two,
which is a lot harder, is to connect the driver back into the kernel
so that it can be shared.  Tun/Tap can be used for network devices,
but it's really too slow -- you need zero-copy and shared notification.


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-14 Thread Peter Chubb
 Jon == Jon Smirl [EMAIL PROTECTED] writes:

Jon On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb
Jon [EMAIL PROTECTED] wrote:
  Jon == Jon Smirl [EMAIL PROTECTED] writes:
 
  The scenario I'm thinking about with these patches are things
 like  low-latency user-level networking between nodes in a
 cluster, where  for good performance even with a kernel driver
 you don't want to  share your interrupt line with anything else.
 
Jon The code needs to refuse to install if the IRQ line is shared.
  It does.  The request_irq() call explicitly does not include
 SA_SHARED in its flags, so if the line is shared, it'll return an
 error to user space when the driver tries to open the file
 representing the interrupt.

Jon Please put some big comments warning people about adding
Jon SA_SHARED. I can easily see someone thinking that they are fixing
Jon a bug by adding it. I'd probably even write a paragraph about
Jon what will happen if SA_SHARED is added.

Will do.  The main problem here is X86, as other architectures either
don't care, or have enough interrupt lines.  And the people who are
paying me for this kind of thing all run IA64

What I really want to do is deprivilege the driver code as much as
possible.  Whatever a driver does, the rest of the system should keep
going.  That way malicious or buggy drivers can only affect the
processes that are trying to use the device they manage.  Moreover, it
should be possible to kill -9 a driver, then restart it, without the
rest of the system noticing more than a hiccup.  To do this,
step one is to run the driver in user space, so that it's subject to
the same resource management control as any other process.  Step two,
which is a lot harder, is to connect the driver back into the kernel
so that it can be shared.  Tun/Tap can be used for network devices,
but it's really too slow -- you need zero-copy and shared notification.


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-14 Thread Peter Chubb
 Jon == Jon Smirl [EMAIL PROTECTED] writes:

Jon On Tue, 15 Mar 2005 14:47:42 +1100, Peter Chubb
Jon [EMAIL PROTECTED] wrote:
 What I really want to do is deprivilege the driver code as much as
 possible.  Whatever a driver does, the rest of the system should
 keep going.  That way malicious or buggy drivers can only affect
 the processes that are trying to use the device they manage.
 Moreover, it should be possible to kill -9 a driver, then restart
 it, without the rest of the system noticing more than a hiccup.  To
 do this, step one is to run the driver in user space, so that it's
 subject to the same resource management control as any other
 process.  Step two, which is a lot harder, is to connect the driver
 back into the kernel so that it can be shared.  Tun/Tap can be used
 for network devices, but it's really too slow -- you need zero-copy
 and shared notification.

Jon Have you considered running the drivers in a domain under Xen?

See the paper presented by Karlsruhr at OSDI:

Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Götz:
Unmodified Device Driver Reuse and Improved System Dependability via
Virtual Machines.  OSDI '04.

They're using L4, rather than Xen as the paravirtualisation layer.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


inode_lock heavily contended in 2.6.11

2005-03-13 Thread Peter Chubb

When running reaim7 on a 12-way IA64 on an ext2 filesystem on a ram
disc, I see very heavy contention on inode_lock.

lockstat output shows:

SPINLOCKS HOLDWAIT
  UTIL  CONMEAN(  MAX )   MEAN(  MAX )(% CPU) TOTAL NOWAIT SPIN RJECT  
NAME
 46.8% 52.4%  1.9us( 130us)   20us(8073us)(21.5%)   5072151 47.6% 52.4%0%  
inode_lock
 15.9% 59.5%  3.8us(  61us)   18us(7067us)( 3.9%)852983 40.5% 59.5%0%   
 __sync_single_inode+0xf0
  9.2% 59.0%  1.2us(  25us)   20us(8073us)( 7.8%)   1596487 41.0% 59.0%0%   
 generic_osync_inode+0xe0

 (etc).

Is anyone else seeing this on more realistic workloads?

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb
Jon> <[EMAIL PROTECTED]> wrote:
>> >>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:
>> 
>> >> The scenario I'm thinking about with these patches are things
>> like >> low-latency user-level networking between nodes in a
>> cluster, where >> for good performance even with a kernel driver
>> you don't want to >> share your interrupt line with anything else.

Jon> Instead of making up a new API what about making a library of
Jon> calls that emulates the common entry points used by device
Jon> drivers. The version I did for UML could take the same driver and
Jon> run it in user space or the kernel without changing source
Jon> code. I found this very useful.

The in-kernel device drivers interface is very large --- I want to
start with something a bit simpler.  We do have a compatibility
library, as yet unreleased, that allows the same drivers to run
in-kernel or in user space.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Sat, 12 Mar 2005 10:11:18 -0700 (MST), Zwane Mwaikambo
Jon> <[EMAIL PROTECTED]> wrote:
>> Alan's proposal sounds very plausible and additionally if we find
>> that we have an irq line screaming we could use the same supplied
>> information to disable userspace interrupt handled devices first.

Jon> I like it too and it would help Xen. Now we just need to modify
Jon> 800 device drivers to use it.

It's incomplete.  But you probably knew that...

The main problem I see is that even with the proposed interface, you'd
need to disable the interrupt in the interrupt controller, because
merely acknowledging an interrupt to a device doesn't stop it from
interrupting.  And you really want the device to stop asserting the
interrupt before doing an EOI, unless you're going to mask the
interrupt.  So you'd need to have an interface that not only
acknowledged the current interrupt but also prevented the device from
interrupting.  That typically means reading a status register (slow!)
and then setting one or more bits in one or more control registers.

Also for a user level driver you really want to do the EIO before
invoking user space.  Otherwise, depending on the interrupt
controller, lower numbered interrupts could be masked until the user
space returns --- which might be a long time off.

Reading the status register is typically one of the slowest
single parts of a device driver (latency can be > 2 usec), so you don't
really want to have to read it again within the driver... so you'd
probably want to pass it as part of the interrupt arguments to the
driver.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

>>  The scenario I'm thinking about with these patches are things like
>> low-latency user-level networking between nodes in a cluster, where
>> for good performance even with a kernel driver you don't want to
>> share your interrupt line with anything else.

Jon> The code needs to refuse to install if the IRQ line is shared.

It does.  The request_irq() call explicitly does not include SA_SHARED
in its flags, so if the line is shared, it'll return an error to user
space when the driver tries to open the file representing the interrupt.

Jon> Also what about SMP, if you shut the IRQ off on one CPU isn't it
Jon> still enabled on all of the others?

Nope.   disable_irq_nosync() talks to the interrupt controller, which
is common to all the processors.  The main problem is that it's slow,
because it has to go off-chip.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Fri, 11 Mar 2005 11:29:20 +0100, Pavel Machek <[EMAIL PROTECTED]>
Jon> wrote:
>> Hi!
>> 
>> > As many of you will be aware, we've been working on
>> infrastructure for > user-mode PCI and other drivers.  The first
>> step is to be able to > handle interrupts from user
>> space. Subsequent patches add > infrastructure for setting up DMA
>> for PCI devices.
>> >
>> > The user-level interrupt code doesn't depend on the other
>> patches, and > is probably the most mature of this patchset.
>> 
>> Okay, I like it; it means way easier PCI driver development.

Jon> It won't help with PCI driver development. I tried implementing
Jon> this for UML. If your driver has any bugs it won't get the
Jon> interrupts acknowledged correctly and you'll end up rebooting.

That's not actually true, at least when we developed drivers here.
The only times we had to reboot were the times we mucked up the dma
register settings, and dma'd all over the kernel by mistake...

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Fri, 11 Mar 2005 14:36:10 +1100, Peter Chubb
Jon> <[EMAIL PROTECTED]> wrote:
>>  As many of you will be aware, we've been working on infrastructure
>> for user-mode PCI and other drivers.  The first step is to be able
>> to handle interrupts from user space. Subsequent patches add
>> infrastructure for setting up DMA for PCI devices.

Jon> I've tried implementing this before and could not get around the
Jon> interrupt problem. Most interrupts on the x86 architecture are
Jon> shared.  Disabling the IRQ at the PIC blocks all of the shared

Fortunately, most interrupts on IA64, ARM, etc.,  are unshared.  And
with PCI-Express, the problem will go away.  Even on X86, things
aren't all bad: one can usually find a PCI slot which doesn't share
interrupts with anything you care about.

The scenario I'm thinking about with these patches are things like
low-latency user-level networking between nodes in a cluster, where
for good performance even with a kernel driver you don't want to share
your interrupt line with anything else.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes:

Greg> On Fri, Mar 11, 2005 at 07:34:46PM +1100, Peter Chubb wrote:
>> >>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes:
>> 
Greg> On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote:
>> >> +/* + * The PCI subsystem is implemented as yet-another pseudo
>> >> filesystem, + * albeit one that is never mounted.  + * This is
>> its >> magic number.  + */ +#define USR_PCI_MAGIC (0x12345678)
>> 
Greg> If you make it a real, mountable filesystem, then you don't need
Greg> to have any of your new syscalls, right?  Why not just do that
Greg> instead?
>> 
>> 
>> The only call that would go is usr_pci_open() -- you'd still need
>> usr_pci_map()

Greg> see mmap(2)

mmap maps a file's contents into your own virtual memory.
usr_pci_map maps part of your own virtual memory into pci bus space
for a particular device (using the IOMMU if your machine has one), and
returns a scatterlist of bus addresses to hand to the device.

Different semantics entirely.


Greg> In fact, both of the above can be done today from /proc/bus/pci/
Greg> right?

Nope.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb

On Gwe, 2005-03-11 at 03:36, Peter Chubb wrote:
> +static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs 
> *regs)
> +{
> + struct irq_proc *idp = (struct irq_proc *)vidp;
> + 
> + BUG_ON(idp->irq != irq);
> + disable_irq_nosync(irq);
> + atomic_inc(>count);
> + wake_up(>q);
> + return IRQ_HANDLED;

Alan> You just deadlocked the machine in many configurations. You can't use
Alan> disable_irq for this trick you have to tell the kernel how to handle it.


Can you elaborate, please?  In particular, why doesn't essentially the
same action (disabling an interrupt before the EOI) in
note_interrupt() not lock up the machine?

I can see there'd be problems if the code allowed shared interrupts,
but it doesn't.


--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb

On Gwe, 2005-03-11 at 03:36, Peter Chubb wrote:
 +static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs 
 *regs)
 +{
 + struct irq_proc *idp = (struct irq_proc *)vidp;
 + 
 + BUG_ON(idp-irq != irq);
 + disable_irq_nosync(irq);
 + atomic_inc(idp-count);
 + wake_up(idp-q);
 + return IRQ_HANDLED;

Alan You just deadlocked the machine in many configurations. You can't use
Alan disable_irq for this trick you have to tell the kernel how to handle it.


Can you elaborate, please?  In particular, why doesn't essentially the
same action (disabling an interrupt before the EOI) in
note_interrupt() not lock up the machine?

I can see there'd be problems if the code allowed shared interrupts,
but it doesn't.


--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-13 Thread Peter Chubb
 Greg == Greg KH [EMAIL PROTECTED] writes:

Greg On Fri, Mar 11, 2005 at 07:34:46PM +1100, Peter Chubb wrote:
  Greg == Greg KH [EMAIL PROTECTED] writes:
 
Greg On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote:
  +/* + * The PCI subsystem is implemented as yet-another pseudo
  filesystem, + * albeit one that is never mounted.  + * This is
 its  magic number.  + */ +#define USR_PCI_MAGIC (0x12345678)
 
Greg If you make it a real, mountable filesystem, then you don't need
Greg to have any of your new syscalls, right?  Why not just do that
Greg instead?
 
 
 The only call that would go is usr_pci_open() -- you'd still need
 usr_pci_map()

Greg see mmap(2)

mmap maps a file's contents into your own virtual memory.
usr_pci_map maps part of your own virtual memory into pci bus space
for a particular device (using the IOMMU if your machine has one), and
returns a scatterlist of bus addresses to hand to the device.

Different semantics entirely.


Greg In fact, both of the above can be done today from /proc/bus/pci/
Greg right?

Nope.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
 Jon == Jon Smirl [EMAIL PROTECTED] writes:

Jon On Fri, 11 Mar 2005 14:36:10 +1100, Peter Chubb
Jon [EMAIL PROTECTED] wrote:
  As many of you will be aware, we've been working on infrastructure
 for user-mode PCI and other drivers.  The first step is to be able
 to handle interrupts from user space. Subsequent patches add
 infrastructure for setting up DMA for PCI devices.

Jon I've tried implementing this before and could not get around the
Jon interrupt problem. Most interrupts on the x86 architecture are
Jon shared.  Disabling the IRQ at the PIC blocks all of the shared

Fortunately, most interrupts on IA64, ARM, etc.,  are unshared.  And
with PCI-Express, the problem will go away.  Even on X86, things
aren't all bad: one can usually find a PCI slot which doesn't share
interrupts with anything you care about.

The scenario I'm thinking about with these patches are things like
low-latency user-level networking between nodes in a cluster, where
for good performance even with a kernel driver you don't want to share
your interrupt line with anything else.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
 Jon == Jon Smirl [EMAIL PROTECTED] writes:

Jon On Fri, 11 Mar 2005 11:29:20 +0100, Pavel Machek [EMAIL PROTECTED]
Jon wrote:
 Hi!
 
  As many of you will be aware, we've been working on
 infrastructure for  user-mode PCI and other drivers.  The first
 step is to be able to  handle interrupts from user
 space. Subsequent patches add  infrastructure for setting up DMA
 for PCI devices.
 
  The user-level interrupt code doesn't depend on the other
 patches, and  is probably the most mature of this patchset.
 
 Okay, I like it; it means way easier PCI driver development.

Jon It won't help with PCI driver development. I tried implementing
Jon this for UML. If your driver has any bugs it won't get the
Jon interrupts acknowledged correctly and you'll end up rebooting.

That's not actually true, at least when we developed drivers here.
The only times we had to reboot were the times we mucked up the dma
register settings, and dma'd all over the kernel by mistake...

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
 Jon == Jon Smirl [EMAIL PROTECTED] writes:

  The scenario I'm thinking about with these patches are things like
 low-latency user-level networking between nodes in a cluster, where
 for good performance even with a kernel driver you don't want to
 share your interrupt line with anything else.

Jon The code needs to refuse to install if the IRQ line is shared.

It does.  The request_irq() call explicitly does not include SA_SHARED
in its flags, so if the line is shared, it'll return an error to user
space when the driver tries to open the file representing the interrupt.

Jon Also what about SMP, if you shut the IRQ off on one CPU isn't it
Jon still enabled on all of the others?

Nope.   disable_irq_nosync() talks to the interrupt controller, which
is common to all the processors.  The main problem is that it's slow,
because it has to go off-chip.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
 Jon == Jon Smirl [EMAIL PROTECTED] writes:

Jon On Sat, 12 Mar 2005 10:11:18 -0700 (MST), Zwane Mwaikambo
Jon [EMAIL PROTECTED] wrote:
 Alan's proposal sounds very plausible and additionally if we find
 that we have an irq line screaming we could use the same supplied
 information to disable userspace interrupt handled devices first.

Jon I like it too and it would help Xen. Now we just need to modify
Jon 800 device drivers to use it.

It's incomplete.  But you probably knew that...

The main problem I see is that even with the proposed interface, you'd
need to disable the interrupt in the interrupt controller, because
merely acknowledging an interrupt to a device doesn't stop it from
interrupting.  And you really want the device to stop asserting the
interrupt before doing an EOI, unless you're going to mask the
interrupt.  So you'd need to have an interface that not only
acknowledged the current interrupt but also prevented the device from
interrupting.  That typically means reading a status register (slow!)
and then setting one or more bits in one or more control registers.

Also for a user level driver you really want to do the EIO before
invoking user space.  Otherwise, depending on the interrupt
controller, lower numbered interrupts could be masked until the user
space returns --- which might be a long time off.

Reading the status register is typically one of the slowest
single parts of a device driver (latency can be  2 usec), so you don't
really want to have to read it again within the driver... so you'd
probably want to pass it as part of the interrupt arguments to the
driver.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
 Jon == Jon Smirl [EMAIL PROTECTED] writes:

Jon On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb
Jon [EMAIL PROTECTED] wrote:
  Jon == Jon Smirl [EMAIL PROTECTED] writes:
 
  The scenario I'm thinking about with these patches are things
 like  low-latency user-level networking between nodes in a
 cluster, where  for good performance even with a kernel driver
 you don't want to  share your interrupt line with anything else.

Jon Instead of making up a new API what about making a library of
Jon calls that emulates the common entry points used by device
Jon drivers. The version I did for UML could take the same driver and
Jon run it in user space or the kernel without changing source
Jon code. I found this very useful.

The in-kernel device drivers interface is very large --- I want to
start with something a bit simpler.  We do have a compatibility
library, as yet unreleased, that allows the same drivers to run
in-kernel or in user space.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


inode_lock heavily contended in 2.6.11

2005-03-13 Thread Peter Chubb

When running reaim7 on a 12-way IA64 on an ext2 filesystem on a ram
disc, I see very heavy contention on inode_lock.

lockstat output shows:

SPINLOCKS HOLDWAIT
  UTIL  CONMEAN(  MAX )   MEAN(  MAX )(% CPU) TOTAL NOWAIT SPIN RJECT  
NAME
 46.8% 52.4%  1.9us( 130us)   20us(8073us)(21.5%)   5072151 47.6% 52.4%0%  
inode_lock
 15.9% 59.5%  3.8us(  61us)   18us(7067us)( 3.9%)852983 40.5% 59.5%0%   
 __sync_single_inode+0xf0
  9.2% 59.0%  1.2us(  25us)   20us(8073us)( 7.8%)   1596487 41.0% 59.0%0%   
 generic_osync_inode+0xe0

 (etc).

Is anyone else seeing this on more realistic workloads?

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Microstate Accounting for 2.6.11

2005-03-11 Thread Peter Chubb
>>>>> "Andi" == Andi Kleen <[EMAIL PROTECTED]> writes:

Andi> Andrew Morton <[EMAIL PROTECTED]> writes:
>> Why does the kernel need this feature?
>> 
>> Have you any numbers on the overhead?

Andi> It does RDTSC and lots of complicated stuff twice for each
Andi> system call.  On P4 this will be extremly slow (> 1000cycles
Andi> combined) It is pretty unlikely that whatever it does justifies
Andi> this extreme overhead in a critical fast path.

Not really `lots of complicated stuff'.  Just swap a timer and set a
flag on entry:

msp->timers[msp->laststate] += now - msp->lastchange
msp->lastchange = now
msp->laststate = ONCPU_SYS
msp->cflags |= MSA_SYS


And swap timers and clear the flag on exit.  The flag's needed to
force return to ONCPU_SYS rather than ONCPU_USR if the task preempted or
interrupted while in a system call.

If there's a simpler, cheaper, faster way to track time spent in
system calls (as opposed to time spent in interrupt handlers, or on
the run queue)  thn I'd like to know what it is.

And I recognise there're are lots of people who don't want this ---
but there are some who do.  I've maintained this patch since mid 2003,
and have seen a steady trickle of downloads --- one or two a week.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-11 Thread Peter Chubb
>>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes:

Greg> On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote:
>> +/* + * The PCI subsystem is implemented as yet-another pseudo
>> filesystem, + * albeit one that is never mounted.  + * This is its
>> magic number.  + */ +#define USR_PCI_MAGIC (0x12345678)

Greg> If you make it a real, mountable filesystem, then you don't need
Greg> to have any of your new syscalls, right?  Why not just do that
Greg> instead?


The only call that would go is usr_pci_open() -- you'd still need 
usr_pci_map(), usr_pci_unmap() and usr_pci_get_consistent().

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-11 Thread Peter Chubb
 Greg == Greg KH [EMAIL PROTECTED] writes:

Greg On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote:
 +/* + * The PCI subsystem is implemented as yet-another pseudo
 filesystem, + * albeit one that is never mounted.  + * This is its
 magic number.  + */ +#define USR_PCI_MAGIC (0x12345678)

Greg If you make it a real, mountable filesystem, then you don't need
Greg to have any of your new syscalls, right?  Why not just do that
Greg instead?


The only call that would go is usr_pci_open() -- you'd still need 
usr_pci_map(), usr_pci_unmap() and usr_pci_get_consistent().

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Microstate Accounting for 2.6.11

2005-03-11 Thread Peter Chubb
 Andi == Andi Kleen [EMAIL PROTECTED] writes:

Andi Andrew Morton [EMAIL PROTECTED] writes:
 Why does the kernel need this feature?
 
 Have you any numbers on the overhead?

Andi It does RDTSC and lots of complicated stuff twice for each
Andi system call.  On P4 this will be extremly slow ( 1000cycles
Andi combined) It is pretty unlikely that whatever it does justifies
Andi this extreme overhead in a critical fast path.

Not really `lots of complicated stuff'.  Just swap a timer and set a
flag on entry:

msp-timers[msp-laststate] += now - msp-lastchange
msp-lastchange = now
msp-laststate = ONCPU_SYS
msp-cflags |= MSA_SYS


And swap timers and clear the flag on exit.  The flag's needed to
force return to ONCPU_SYS rather than ONCPU_USR if the task preempted or
interrupted while in a system call.

If there's a simpler, cheaper, faster way to track time spent in
system calls (as opposed to time spent in interrupt handlers, or on
the run queue)  thn I'd like to know what it is.

And I recognise there're are lots of people who don't want this ---
but there are some who do.  I've maintained this patch since mid 2003,
and have seen a steady trickle of downloads --- one or two a week.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Microstate Accounting for 2.6.11

2005-03-10 Thread Peter Chubb
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes:

Andrew> Peter Chubb <[EMAIL PROTECTED]> wrote:
>>  Timing data on threads at present is pretty crude: when the timer
>> interrupt occurs, a tick is added to either system time or user
>> time for the currently running thread.  Thus in an unpacthed kernel
>> one can distinguish three timed states: On-cpu in userspace, on-cpu
>> in system space, and not running.
>> 
>> The actual number of states is much larger.  A thread can be on a
>> runqueue or the expired queue (i.e., ready to run but not running),
>> sleeping on a semaphore or on a futex, having its time stolen to
>> service an interrupt, etc., etc.
>> 
>> This patch adds timers per-state to each struct task_struct, so
>> that time in all these states can be tracked.  This patch contains
>> the core code do the timing, and to initialise the timers.
>> Subsequent patches enable the code (by adding Kconfig options) and
>> add hooks to track state changes.

Andrew> Why does the kernel need this feature?

I find that it's useful when trying to work out why a thread is going
more slowly than it needs to.  Userspace tools in the CVS repository
at gelato.unsw.edu.au let you graph in real time the time spent in
each state, so you get graphs like this:

 http://gelato.unsw.edu.au/patches/snapshot.png

which shows mplay skipping because of a slow disk/filesystem.

Andrew> Have you any numbers on the overhead?

Around 5% on LMbench context switch numbers for uniprocessor,
negligeable on SMP (but SMP context switch results are horrible at the
moment according to LMbench2 -- almost 16usec); select on 10 fd goes
from 1.665 usec to 1.701; 

Andrew> The preempt_disable() in sys_msa() seems odd.

Yes I only added that yesterday.  It's to prevent migration while
updating the current timer.  All the other places where the current
timer are updated are naturally protected this.  It should probably be a
local_irq_disable() instead.

Peter C

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate accounting, IA64 support

2005-03-10 Thread Peter Chubb
Microstate Accounting: 
Add suppoort for IA64.


 linux-2.6-ustate/arch/ia64/Kconfig   |   25 +++
 linux-2.6-ustate/arch/ia64/kernel/entry.S|   44 +++
 linux-2.6-ustate/arch/ia64/kernel/irq_ia64.c |   21 +++-
 linux-2.6-ustate/arch/ia64/kernel/ivt.S  |8 +++-
 linux-2.6-ustate/include/asm-ia64/msa.h  |   33 
 linux-2.6-ustate/include/asm-ia64/unistd.h   |1 
 7 files changed, 129 insertions(+), 5 deletions(-)

Index: linux-2.6-ustate/arch/ia64/Kconfig
===
--- linux-2.6-ustate.orig/arch/ia64/Kconfig 2005-03-10 09:13:01.780632777 
+1100
+++ linux-2.6-ustate/arch/ia64/Kconfig  2005-03-10 09:16:14.593655619 +1100
@@ -302,6 +302,31 @@
  little bigger and slows down execution a bit, but it is generally
  a good idea to turn this on.  If you're unsure, say Y.
 
+config MICROSTATE
+   bool "Microstate accounting"
+   help
+ This option causes the kernel to keep very accurate track of
+ how long your threads spend on the runqueues, running, or asleep or
+ stopped.  It will slow down your kernel.
+ Times are reported in /proc/pid/msa and through a new msa()
+ system call.
+choice
+   depends on MICROSTATE
+   prompt "Microstate timing source"
+   default MICROSTATE_ITC
+   help
+  On IA64 one can use two timeing sources for the microstate
+  accounting;  the on-chip interval counter, or Linux's
+  time-of-day clock.  The first is very cheap; the other is
+  more accurate on SMP systems.
+
+config MICROSTATE_ITC
+   bool "Use on-chip ITC for microstate timing"
+ 
+config MICROSTATE_TOD
+   bool "Use time-of-day clock for microstate timings"
+endchoice
+
 config IA64_PALINFO
tristate "/proc/pal support"
help
Index: linux-2.6-ustate/include/asm-ia64/msa.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-ustate/include/asm-ia64/msa.h 2005-03-10 09:16:14.594632174 
+1100
@@ -0,0 +1,33 @@
+/
+ * asm-ia64/msa.h
+ *
+ * Provide an architecture-specific clock.
+ */
+
+#ifndef _ASM_IA64_MSA_H
+#define _ASM_IA64_MSA_H
+
+#include 
+#include 
+#include 
+
+
+# if defined(CONFIG_MICROSTATE_ITC)
+#   define MSA_NOW(now)  do { now = (clk_t)get_cycles(); } while (0)
+
+#   define MSA_TO_NSEC(clk) ((10*clk) / 
cpu_data(smp_processor_id())->itc_freq)
+
+# elif defined(CONFIG_MICROSTATE_TOD)
+static inline void msa_now(clk_t *nsp) {
+   struct timeval tv;
+   do_gettimeofday();
+   *nsp = tv.tv_sec * 100 + tv.tv_usec;
+}
+#   define MSA_NOW(x) msa_now()
+#   define MSA_TO_NSEC(clk) ((clk) * 1000)
+
+# else
+#  include 
+# endif
+
+#endif /* _ASM_IA64_MSA_H */
Microstate Accounting: Track time in system calls for IA64

 arch/ia64/kernel/entry.S |   44 
 arch/ia64/kernel/ivt.S   |8 ++--
 2 files changed, 50 insertions(+), 2 deletions(-)

Index: linux-2.6-ustate/arch/ia64/kernel/entry.S
===
--- linux-2.6-ustate.orig/arch/ia64/kernel/entry.S  2005-03-10 
09:13:01.149778160 +1100
+++ linux-2.6-ustate/arch/ia64/kernel/entry.S   2005-03-10 09:16:15.157128068 
+1100
@@ -589,6 +589,46 @@
 .ret4: br.cond.sptk ia64_leave_kernel
 END(ia64_strace_leave_kernel)
 
+#ifdef CONFIG_MICROSTATE
+/*
+ * preserve input registers,
+ * and r8
+ */
+GLOBAL_ENTRY(invoke_msa_end_syscall)
+   .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
+   alloc loc1=ar.pfs,8,4,0,0
+   mov loc0=rp
+   .body
+   ;;
+   mov loc2=ret0
+   mov loc3=ret2
+   br.call.sptk.many rp=msa_end_syscall
+1: mov rp=loc0
+   mov ret0=loc2
+   mov ret2=loc3
+   mov ar.pfs=loc1
+   br.ret.sptk.many rp
+END(invoke_msa_end_syscall)
+/*
+ * Preserves in0-7, and all callee-save registers.
+ */
+GLOBAL_ENTRY(invoke_msa_start_syscall)
+   .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
+   alloc loc1=ar.pfs,8,4,0,0
+   mov loc0=rp
+   .body
+   mov loc2=r3
+   mov loc3=r15
+   ;;
+   br.call.sptk.many rp=msa_start_syscall
+1: mov r15=loc3
+   mov r3=loc2
+   mov ar.pfs=loc1
+   mov rp=loc0
+   br.ret.sptk.many rp
+END(invoke_msa_start_syscall)
+#endif /* CONFIG_MICROSTATE */
+
 GLOBAL_ENTRY(ia64_ret_from_clone)
PT_REGS_UNWIND_INFO(0)
 {  /*
@@ -671,6 +711,10 @@
  */
 ENTRY(ia64_leave_syscall)
PT_REGS_UNWIND_INFO(0)
+#ifdef CONFIG_MICROSTATE
+   br.call.sptk.many rp=invoke_msa_end_syscall
+1: 
+#endif
/*
 * work.need_resched etc. mustn't get changed by this CPU before it 
returns to
 * user- or fsys-mode, hence we 

Microstate Accounting for 2.6.11, patch 4/6

2005-03-10 Thread Peter Chubb
Microstate accounting:  Account for time in interrupt handlers for I386.

 arch/i386/kernel/irq.c |   13 -
 1 files changed, 12 insertions(+), 1 deletion(-)


Index: linux-2.6-ustate/arch/i386/kernel/irq.c
===
--- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 
09:13:00.115606274 +1100
+++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 
+1100
@@ -55,6 +55,8 @@
 #endif
 
irq_enter();
+   msa_start_irq(irq);
+   
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 1KB free? */
{
@@ -101,6 +103,7 @@
 #endif
__do_IRQ(irq, regs);
 
+   msa_finish_irq(irq);
irq_exit();
 
return 1;
@@ -221,10 +224,18 @@
seq_printf(p, "%3d: ",i);
 #ifndef CONFIG_SMP
seq_printf(p, "%10u ", kstat_irqs(i));
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, "%10llu", msa_irq_time(0, i));
+#endif
 #else
for (j = 0; j < NR_CPUS; j++)
-   if (cpu_online(j))
+   if (cpu_online(j)) {
seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, "%10llu", msa_irq_time(j, i));
+#endif
+   }
+
 #endif
seq_printf(p, " %14s", irq_desc[i].handler->typename);
seq_printf(p, "  %s", action->name);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 6/6

2005-03-10 Thread Peter Chubb


Microstate accounting: Track time spent asleep while paging,
in poll() or select(), or on a futex separately from other sleeps.

 fs/select.c |2 ++
 kernel/futex.c |2 ++
 mm/memory.c |6 +-


Index: linux-2.6-ustate/mm/memory.c
===
--- linux-2.6-ustate.orig/mm/memory.c   2005-03-10 09:12:59.492564100 +1100
+++ linux-2.6-ustate/mm/memory.c2005-03-10 09:16:16.583875465 +1100
@@ -2079,6 +2079,7 @@
if (is_vm_hugetlb_page(vma))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */
 
+   msa_next_state(current, PAGING_SLEEP);
/*
 * We need the page table lock to synchronize with kswapd
 * and the SMP-safe atomic PTE updates.
@@ -2098,10 +2099,13 @@
if (!pte)
goto oom;

-   return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+   int ret = handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+   msa_next_state(current, MSA_UNKNOWN);
+   return ret;
 
  oom:
spin_unlock(>page_table_lock);
+   msa_next_state(current, MSA_UNKNOWN);
return VM_FAULT_OOM;
 }
 

Index: linux-2.6-ustate/kernel/futex.c
===
--- linux-2.6-ustate.orig/kernel/futex.c2005-03-10 09:12:58.843154938 
+1100
+++ linux-2.6-ustate/kernel/futex.c 2005-03-10 09:16:17.109262256 +1100
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
 
@@ -571,6 +572,7 @@
 * wakes us up.
 */
 
+   msa_next_state(current, FUTEX_SLEEP);
/* add_wait_queue is the barrier after __set_current_state. */
__set_current_state(TASK_INTERRUPTIBLE);
add_wait_queue(, );


Index: linux-2.6-ustate/fs/select.c
===
--- linux-2.6-ustate.orig/fs/select.c   2005-03-10 09:12:59.182996124 +1100
+++ linux-2.6-ustate/fs/select.c2005-03-10 09:16:16.843639194 +1100
@@ -256,6 +256,7 @@
retval = table.error;
break;
}
+   msa_next_state(current, POLL_SLEEP);
__timeout = schedule_timeout(__timeout);
}
__set_current_state(TASK_RUNNING);
@@ -447,6 +448,7 @@
count = wait->error;
if (count)
break;
+   msa_next_state(current, POLL_SLEEP);
timeout = schedule_timeout(timeout);
}
__set_current_state(TASK_RUNNING);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 5/6

2005-03-10 Thread Peter Chubb
Microstate accounting: Add the I386 system call.

 arch/i386/kernel/entry.S  |2 +-
 include/asm-i386/unistd.h |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6-ustate/arch/i386/kernel/entry.S
===
--- linux-2.6-ustate.orig/arch/i386/kernel/entry.S  2005-03-10 
09:16:14.888575341 +1100
+++ linux-2.6-ustate/arch/i386/kernel/entry.S   2005-03-10 09:16:15.446188457 
+1100
@@ -876,7 +876,7 @@
.long sys_mq_getsetattr
.long sys_ni_syscall/* reserved for kexec */
.long sys_waitid
-   .long sys_ni_syscall/* 285 */ /* available */
+   .long sys_msa   /* 285 */ /* available */
.long sys_add_key
.long sys_request_key
.long sys_keyctl
Index: linux-2.6-ustate/include/asm-i386/unistd.h
===
--- linux-2.6-ustate.orig/include/asm-i386/unistd.h 2005-03-10 
09:13:00.813843194 +1100
+++ linux-2.6-ustate/include/asm-i386/unistd.h  2005-03-10 09:16:15.448141568 
+1100
@@ -290,7 +290,7 @@
 #define __NR_mq_getsetattr (__NR_mq_open+5)
 #define __NR_sys_kexec_load283
 #define __NR_waitid284
-/* #define __NR_sys_setaltroot 285 */
+#define __NR_sys_msa   285
 #define __NR_add_key   286
 #define __NR_request_key   287
 #define __NR_keyctl288
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 3/6

2005-03-10 Thread Peter Chubb

Microstate accounting:  

Provide I386-dependent MSA clocks, and Kconfig options.

 arch/i386/Kconfig  |   39 ++-
 include/asm-i386/msa.h |   49 +
 2 files changed, 87 insertions(+), 1 deletion(-)

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

Index: linux-2.6-ustate/arch/i386/Kconfig
===
--- linux-2.6-ustate.orig/arch/i386/Kconfig 2005-03-11 09:59:38.773632446 
+1100
+++ linux-2.6-ustate/arch/i386/Kconfig  2005-03-11 09:59:38.777538666 +1100
@@ -923,8 +923,45 @@
 
  If unsure, say Y. Only embedded should say N here.
 
-endmenu
+config MICROSTATE
+   bool "Microstate accounting"
+   help
+ This option causes the kernel to keep very accurate track of
+how long your threads spend on the runqueues, running, or asleep or
+stopped.  It will slow down your kernel.
+Times are reported in /proc/pid/msa and through a new msa()
+system call.
+
+choice 
+   depends on MICROSTATE
+   prompt "Microstate timing source"
+   default MICROSTATE_TSC
+
+config MICROSTATE_PM
+   bool "Use Power-Management timer for microstate timings"
+   depends on X86_PM_TIMER
+   help
+If your machine is ACPI enabled and uses power-management, then the 
+TSC runs at a variable rate, which will distort the 
+microstate measurements.  This timer, although having
+slightly more overhead, and a lower resolution (279
+nanoseconds or so) will always run at a constant rate.
+
+config MICROSTATE_TSC
+   bool "Use on-chip TSC for microstate timings"
+   depends on X86_TSC
+   help
+ If your machine's clock runs at constant rate, then this timer 
+gives you cycle precision in measureing times spent in microstates.
+
+config MICROSTATE_TOD
+   bool "Use time-of-day clock for microstate timings"
+   help
+ If none of the other timers are any good for you, this timer 
+will give you micro-second precision.
+endchoice
 
+endmenu
 
 menu "Power management options (ACPI, APM)"
depends on !X86_VOYAGER
Index: linux-2.6-ustate/include/asm-i386/msa.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-ustate/include/asm-i386/msa.h 2005-03-11 09:59:38.779491777 
+1100
@@ -0,0 +1,49 @@
+/
+ * asm-i386/msa.h
+ *
+ * Provide an architecture-specific clock.
+ */
+
+#ifndef _ASM_I386_MSA_H
+# define _ASM_I386_MSA_H
+
+# include 
+
+
+# if defined(CONFIG_MICROSTATE_TSC)
+/*
+ * Use the processor's time-stamp counter as a timesource
+ */
+#  include 
+#  include 
+
+#  define MSA_NOW(now)  rdtscll(now)
+
+extern unsigned long cpu_khz;
+#  define MSA_TO_NSEC(clk) ({ clk_t _x = ((clk) * 100ULL); do_div(_x, 
cpu_khz); _x; })
+
+# elif defined(CONFIG_MICROSTATE_PM)
+/*
+ * Use the system's monotonic clock as a timesource.
+ * This will only be enabled if the Power Management Timer is enabled.
+ */
+unsigned long long monotonic_clock(void);
+#  define MSA_NOW(now) do { now = monotonic_clock(); } while (0)
+#  define MSA_TO_NSEC(clk) (clk)
+
+# elif defined(CONFIG_MICROSTATE_TOD)
+/*
+ * Fall back to gettimeofday.
+ * This one is incompatible with interrupt-time measurement on some processors.
+ */
+static inline void msa_now(clk_t *nsp) {
+   struct timeval tv;
+   do_gettimeofday();
+   *nsp = tv.tv_sec * 100 + tv.tv_usec;
+}
+#   define MSA_NOW(x) msa_now()
+#   define MSA_TO_NSEC(clk) ((clk) * 1000)
+# endif
+
+
+#endif /* _ASM_I386_MSA_H */
I386
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 2/6

2005-03-10 Thread Peter Chubb
Microstate Accounting:
Add hooks into the scheduler to track state changes.
Arrange for parent process's child times to be updated at process exit. 


 kernel/sched.c |8 
 kernel/exit.c  |3 +++

Index: linux-2.6-ustate/kernel/sched.c
===
--- linux-2.6-ustate.orig/kernel/sched.c2005-03-11 09:59:31.109628035 
+1100
+++ linux-2.6-ustate/kernel/sched.c 2005-03-11 09:59:31.116463921 +1100
@@ -635,6 +635,7 @@
  */
 static inline void __activate_task(task_t *p, runqueue_t *rq)
 {
+   msa_set_timer(p, ONACTIVEQUEUE);
enqueue_task(p, rq->active);
rq->nr_running++;
 }
@@ -1238,6 +1239,7 @@
if (unlikely(!current->array))
__activate_task(p, rq);
else {
+   msa_set_timer(p, ONACTIVEQUEUE);
p->prio = current->prio;
list_add_tail(>run_list, >run_list);
p->array = current->array;
@@ -2422,6 +2424,7 @@
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
+   msa_next_state(p, ONEXPIREDQUEUE);
enqueue_task(p, rq->expired);
if (p->static_prio < rq->best_expired_prio)
rq->best_expired_prio = p->static_prio;
@@ -2733,6 +2736,7 @@
array = rq->active;
rq->expired_timestamp = 0;
rq->best_expired_prio = MAX_PRIO;
+   msa_flip_expired(prev);
} else
schedstat_inc(rq, sched_noswitch);
 
@@ -2773,6 +2777,8 @@
rq->curr = next;
++*switch_count;
 
+   msa_switch(prev, next);
+
prepare_arch_switch(rq, next);
prev = context_switch(rq, prev, next);
barrier();
@@ -3693,6 +3699,8 @@
 */
if (rt_task(current))
target = rq->active;
+   else
+   msa_next_state(current, ONEXPIREDQUEUE);
 
if (current->array->nr_active == 1) {
schedstat_inc(rq, yld_act_empty);


Index: linux-2.6-ustate/kernel/exit.c
===
--- linux-2.6-ustate.orig/kernel/exit.c 2005-03-11 09:59:36.360564796 +1100
+++ linux-2.6-ustate/kernel/exit.c  2005-03-11 09:59:36.364471017 +1100
@@ -93,6 +93,9 @@
}
 
sched_exit(p);
+
+   msa_update_parent(p->parent, p);
+
write_unlock_irq(_lock);
spin_unlock(>proc_lock);
proc_pid_flush(proc_dentry);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 3/

2005-03-10 Thread Peter Chubb

Microstate Accounting: Track time in system calls and interrupts, i386 code.

Signed-off-by; Peter Chubb <[EMAIL PROTECTED]>

 arch/i386/kernel/entry.S |   16 
 arch/i386/kernel/irq.c |   13 -


Index: linux-2.6-ustate/arch/i386/kernel/entry.S
===
--- linux-2.6-ustate.orig/arch/i386/kernel/entry.S  2005-03-10 
09:13:01.448604031 +1100
+++ linux-2.6-ustate/arch/i386/kernel/entry.S   2005-03-10 09:16:14.888575341 
+1100
@@ -222,10 +222,18 @@
/* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not 
testb */
testw 
$(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp)
jnz syscall_trace_entry
+#ifdef CONFIG_MICROSTATE
+   pushl   %eax
+   call msa_start_syscall
+   popl%eax
+#endif
cmpl $(nr_syscalls), %eax
jae syscall_badsys
call *sys_call_table(,%eax,4)
movl %eax,EAX(%esp)
+#ifdef CONFIG_MICROSTATE
+   call msa_end_syscall
+#endif
cli
movl TI_flags(%ebp), %ecx
testw $_TIF_ALLWORK_MASK, %cx
@@ -250,9 +258,17 @@
cmpl $(nr_syscalls), %eax
jae syscall_badsys
 syscall_call:
+#ifdef CONFIG_MICROSTATE
+   pushl   %eax
+   call msa_start_syscall
+   popl%eax
+#endif
call *sys_call_table(,%eax,4)
movl %eax,EAX(%esp) # store the return value
 syscall_exit:
+#ifdef CONFIG_MICROSTATE
+   call msa_end_syscall
+#endif
cli # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret


Index: linux-2.6-ustate/arch/i386/kernel/irq.c
===
--- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 
09:13:00.115606274 +1100
+++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 
+1100
@@ -55,6 +55,8 @@
 #endif
 
irq_enter();
+   msa_start_irq(irq);
+   
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 1KB free? */
{
@@ -101,6 +103,7 @@
 #endif
__do_IRQ(irq, regs);
 
+   msa_finish_irq(irq);
irq_exit();
 
return 1;
@@ -221,10 +224,18 @@
seq_printf(p, "%3d: ",i);
 #ifndef CONFIG_SMP
seq_printf(p, "%10u ", kstat_irqs(i));
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, "%10llu", msa_irq_time(0, i));
+#endif
 #else
for (j = 0; j < NR_CPUS; j++)
-   if (cpu_online(j))
+   if (cpu_online(j)) {
seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, "%10llu", msa_irq_time(j, i));
+#endif
+   }
+
 #endif
seq_printf(p, " %14s", irq_desc[i].handler->typename);
seq_printf(p, "  %s", action->name);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11

2005-03-10 Thread Peter Chubb


Microstate Accounting
-

Timing data on threads at present is pretty crude:  when the timer
interrupt occurs, a tick is added to either system time or user time
for the currently running thread.  Thus in an unpacthed kernel one can
distinguish three timed states:  On-cpu in userspace, on-cpu in system
space, and not running.

The actual number of states is much larger.  A thread can be on a
runqueue or  the expired queue (i.e., ready to run but not running),
sleeping on a semaphore or on a futex, having its time stolen to
service an interrupt, etc., etc.

This patch adds timers per-state to each struct task_struct, so that
time in all these states can be tracked.  This patch contains the core
code do the timing, and to initialise the timers.  Subsequent patches
enable the code (by adding Kconfig options) and add hooks to track
state changes.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

 include/asm-generic/msa.h  |   21 ++
 include/linux/msa-kernel.h |   99 +
 include/linux/msa.h|   46 
 include/linux/sched.h  |4 
 kernel/Makefile|2 
 kernel/fork.c  |2 
 kernel/msa.c   |  472 +
 7 files changed, 645 insertions(+), 1 deletion(-)

Index: linux-2.6-ustate/kernel/msa.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-ustate/kernel/msa.c   2005-03-11 09:58:20.574030768 +1100
@@ -0,0 +1,472 @@
+/*
+ * Microstate accounting.
+ * Try to account for various states much more accurately than
+ * the normal code does.
+ *
+ * Copyright (c) Peter Chubb 2005
+ *  UNSW and National ICT Australia
+ * This code is released under the Gnu Public Licence, version 2.
+ */
+
+
+#include 
+#include 
+#include 
+#include 
+#ifdef CONFIG_MICROSTATE
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/*
+ * Track time spend in interrupt handlers.
+ */
+struct msa_irq {
+   clk_t times;
+   clk_t last_entered;
+};
+
+/*
+ * When the scheduler last swapped active and expired queues
+ */
+static DEFINE_PER_CPU(clk_t, queueflip_time);
+
+/*
+ * Time spent in interrupt handlers
+ */
+static DEFINE_PER_CPU(struct msa_irq[NR_IRQS+1], msa_irq);
+
+
+/**
+ * msa_switch: Update microstate timers when switching from one task to 
another.
+ * @prev, @next:  The prev task is coming off the processor;
+ *the new task is about to run on the processor.
+ *
+ * Update the times in both prev and next.  It may be necessary to infer the 
+ * next state for each task.
+ *
+ */
+void
+msa_switch(struct task_struct *prev, struct task_struct *next)
+{
+   struct microstates *msprev = >microstates;
+   struct microstates *msnext = >microstates;
+   clk_t now;
+   enum thread_state next_state;
+   int interrupted = msprev->cur_state == INTERRUPTED;
+
+   preempt_disable();
+
+   MSA_NOW(now);
+
+   if (msprev->flags & QUEUE_FLIPPED) {
+   __get_cpu_var(queueflip_time) = now;
+   msprev->flags &= ~QUEUE_FLIPPED;
+   }
+
+   /*
+* If the queues have been flipped,
+* update the state as of the last flip time.
+*/
+   if (msnext->cur_state == ONEXPIREDQUEUE) {
+   clk_t qfp = per_cpu(queueflip_time, msnext->lastqueued);
+   msnext->cur_state = ONACTIVEQUEUE;
+   msnext->timers[ONEXPIREDQUEUE] += qfp - msnext->last_change;
+   msnext->last_change = qfp;
+   }
+
+   msprev->timers[msprev->cur_state] += now - msprev->last_change;
+   msnext->timers[msnext->cur_state] += now - msnext->last_change;
+   
+   /* Update states */
+   switch (msprev->next_state) {
+   case MSA_UNKNOWN:
+   /*
+* Infer from actual state
+*/
+   switch (prev->state) {
+   case TASK_INTERRUPTIBLE:
+   next_state = INTERRUPTIBLE_SLEEP;
+   break;
+   
+   case TASK_UNINTERRUPTIBLE:
+   next_state = UNINTERRUPTIBLE_SLEEP;
+   break;
+
+   case TASK_STOPPED:
+   next_state = STOPPED;
+   break;
+
+   case EXIT_DEAD:
+   case EXIT_ZOMBIE:
+   next_state = ZOMBIE;
+   break;
+
+   case TASK_RUNNING:  
+   next_state = ONACTIVEQUEUE;
+   break;
+
+   default:
+   next_state = MSA_UNKNOWN;
+   break;
+
+   } 
+   break;
+
+   case PAGING_SLEEP: /*
+   * Sleep states are PAGING_SLEEP;
+   * others inferred fro

User mode drivers: part 2: PCI device handling (patch 2/2 for 2.6.11)

2005-03-10 Thread Peter Chubb

User-level drivers:  Add system calls for I386 and IA64.
Signed-Off-By: Peter Chubb <[EMAIL PROTECTED]>

# 
# arch/i386/kernel/entry.S  |4 
# arch/ia64/kernel/entry.S  |8 
# include/asm-i386/unistd.h |6 +-
# include/asm-ia64/unistd.h |4 
# 4 files changed, 17 insertions(+), 5 deletions(-)
#
Index: linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S
===
--- linux-2.6.11-usrdrivers.orig/arch/ia64/kernel/entry.S   2005-03-11 
13:59:28.940744950 +1100
+++ linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S2005-03-11 
13:59:41.236542676 +1100
@@ -1577,10 +1577,10 @@
data8 sys_add_key
data8 sys_request_key
data8 sys_keyctl
-   data8 sys_ni_syscall
-   data8 sys_ni_syscall// 1275
-   data8 sys_ni_syscall
-   data8 sys_ni_syscall
+   data8 sys_usr_pci_open
+   data8 sys_usr_pci_mmap  // 1275
+   data8 sys_usr_pci_munmap
+   data8 sys_usr_pci_get_consistent
data8 sys_ni_syscall
data8 sys_ni_syscall
 
Index: linux-2.6.11-usrdrivers/include/asm-i386/unistd.h
===
--- linux-2.6.11-usrdrivers.orig/include/asm-i386/unistd.h  2005-03-11 
13:59:28.942698059 +1100
+++ linux-2.6.11-usrdrivers/include/asm-i386/unistd.h   2005-03-11 
13:59:41.245331667 +1100
@@ -294,8 +294,12 @@
 #define __NR_add_key   286
 #define __NR_request_key   287
 #define __NR_keyctl288
+#define __NR_usr_pci_open  289
+#define __NR_usr_pci_mmap  (__NR_usr_pci_open+1)
+#define __NR_usr_pci_munmap(__NR_usr_pci_open+2)
+#define __NR_usr_pci_get_consistent(__NR_usr_pci_open+3)
 
-#define NR_syscalls 289
+#define NR_syscalls 293
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
Index: linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h
===
--- linux-2.6.11-usrdrivers.orig/include/asm-ia64/unistd.h  2005-03-11 
13:59:28.942698059 +1100
+++ linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h   2005-03-11 
13:59:41.247284776 +1100
@@ -263,6 +263,10 @@
 #define __NR_add_key   1271
 #define __NR_request_key   1272
 #define __NR_keyctl1273
+#define __NR_usr_pci_open   1274
+#define __NR_usr_pci_mmap   1275
+#define __NR_usr_pci_unmap  1276
+#define __NR_usr_pci_get_consistent 1277
 
 #ifdef __KERNEL__
 
Index: linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S
===
--- linux-2.6.11-usrdrivers.orig/arch/i386/kernel/entry.S   2005-03-11 
13:59:28.941721505 +1100
+++ linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S2005-03-11 
13:59:41.248261330 +1100
@@ -864,5 +864,9 @@
.long sys_add_key
.long sys_request_key
.long sys_keyctl
+   .long sys_usr_pci_open
+   .long sys_usr_pci_mmap  /* 290 */
+   .long sys_usr_pci_munmap
+   .long sys_usr_pci_get_consistent
 
 syscall_table_size=(.-sys_call_table)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-10 Thread Peter Chubb


USER LEVEL DRIVERS: enable PCI device drivers at user space.

This patch adds the capability for suitably privileged user-level processes to 
enable a PCI device, and set up DMA for it.  A subsequent patch hooks up 
the actual system calls.

There are three new system calls:

  long   usr_pci_open(int bus, int slot, int function, __u64 dma_mask);
 Returns a filedescriptor for the PCI device described 
 by bus,slot,function.  It also enables the device, and sets it 
 up as a bus-mastering DMA device, with the specified dma mask.

 Error codes are:
ENOMEM: insufficient kernel memory to fulfil your  request
ENOENT: the specified device doesn't exist, or is otherwise
invisible to Linux.
EBUSY: Another driver has claimed the device
EIO:   The specified dma mask is invalid for this device.
ENFILE: too many open files

  long usr_pci_get_consistent(int fd, size_t size, void **vaddrp, unsigned 
long *dmaaddrp)

Call pci_alloc_consistent() to get size worth of pci
consistent memory (currently an error if size != PAGESIZE); 
map the allocated memory into the user's address space; 
return the virtual user address in *vaddrp, and the bus 
address in *dmaaddrp

ERRORS:
EINVAL: the filedescriptor was not one obtained from usr_pci_open(), or
size != PAGESIZE
ENOMEM: insufficient  appropriate memory or insufficient free 
virtual address space in the user program.
EFAULT: vaddrp or dmaaddrp didn't point to writeable memory.

The mapping obtained can be cleaned up with munmap().

   long usr_pci_mmap(int fd, struct mapping_info *mp) -- 
map some memory for DMA to/from the device represented by fd, 
which was obtained from usr_pci_open().

struct mapping_info contains:
void *virtaddr -- the virtual address to dma to
int size -- how many bytes to set up
struct usr_pci_sglist *sglist -- a pointer to a scatterlist
int nents -- how many entries in the scatterlist
enum dma_data_direction direction --- which way the 
dma is going to happen.

The scatterlist should be sized at least size/PAGESIZE + 2.

usr_pci_mmap() will call pci_map_sg() on the virtual region, 
then copy the resulting scatterlist into *sglist.  The nents field 
will be updated with the actual number of scatterlist entries filled in.

Failure codes are:
EINVAL: the fd wasn't obtained from usr_pci_open, or 
direction wasn't one of DMA_TO_DEVICE, DMA_FROM_DEVICE 
or DMA_BIDIRECTIONAL, or the size of the 
scatterlist is insufficient to map the region.
EFAULT: mp was a bad pointer, or the region of memory spanned 
by (virtaddr, virtaddr + size) was not all mapped.
ENOMEM: insufficient appropriate memory

   long usr_pci_munmap(int fd, struct mapping_info *mp)
Unmap a dma region mapped by usr_pci_map().
Struct mapping info is the same one used in usr_pci_mmap().

Error codes are:
EINVAL: : the fd wasn't obtained from usr_pci_open, or the 
  struct mapping_info was never mapped for this device


Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>  


#
# drivers/Makefile   |3 
# drivers/pci/Kconfig|6 
# drivers/usr/Makefile   |2 
# drivers/usr/sys.c  |  952 
+
# include/linux/usrdrv.h |   63 +++
# 5 files changed, 1026 insertions(+)
#
Index: linux-2.6.11-usrdrivers/drivers/Makefile
===
--- linux-2.6.11-usrdrivers.orig/drivers/Makefile   2005-03-11 
12:25:29.169139978 +1100
+++ linux-2.6.11-usrdrivers/drivers/Makefile2005-03-11 12:25:41.159270471 
+1100
@@ -13,6 +13,9 @@
 # was used and do nothing if so
 obj-$(CONFIG_PNP)  += pnp/
 
+# User level device drivers
+obj-$(CONFIG_USRDEV)   += usr/
+
 # char/ comes before serial/ etc so that the VT console is the boot-time
 # default.
 obj-y  += char/
Index: linux-2.6.11-usrdrivers/drivers/usr/Makefile
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.11-usrdrivers/drivers/usr/Makefile2005-03-11 
12:25:41.160247026 +1100
@@ -0,0 +1,2 @@
+obj-y  += sys.o 
+obj-$(CONFIG_USRBLKDEV) += blkdev.o
Index: linux-2.6.11-usrdrivers/drivers/usr/sys.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.11-usrdrivers/drivers/usr/sys.c   2005-03-11 14:15:59.897394833 
+1100
@@ -0,0 +1,952 @@
+/*
+ * Expose PCI-DMA interface to user mode.
+ *
+ * Copyright 2005 Peter Chubb
+ * Nation

User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-10 Thread Peter Chubb

As many of you will be aware, we've been working on infrastructure for
user-mode PCI and other drivers.  The first step is to be able to
handle interrupts from user space. Subsequent patches add
infrastructure for setting up DMA for PCI devices.

The user-level interrupt code doesn't depend on the other patches, and
is probably the most mature of this patchset.


This patch adds a new file to /proc/irq// called irq.  Suitably 
privileged processes can open this file.  Reading the file returns the 
number of interrupts (if any) that have occurred since the last read.
If the file is opened in blocking mode, reading it blocks until 
an interrupt occurs.  poll(2) and select(2) work as one would expect, to 
allow interrupts to be one of many events to wait for.
(If you didn't like the file, one could have a special system call to
return the file descriptor).

Interrupts are usually masked; while a thread is in poll(2) or read(2) on the 
file they are unmasked.  

All architectures that use CONFIG_GENERIC_HARDIRQ are supported by
this patch.

A low latency user level interrupt handler would do something like
this, on a CONFIG_PREEMPT kernel:

  int irqfd;
  int n_ints;
  struct sched_param sched_param;

  irqfd = open("/proc/irq/513/irq", O_RDONLY);
  mlockall()
  sched_param.sched_priority = sched_get_priority_max(SCHED_FIFO) - 10;
  sched_setscheduler(0, SCHED_FIFO, _param);

  while(read(irqfd, n_ints, sizeof n_ints) == sizeof nints) {
   ... talk to device to handle interrupt
  }

If you don't care about latency, then forget about the mlockall() and
setting the priority, and you don't need CONFIG_PREEMPT.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

 kernel/irq/proc.c |  163 ++
 1 files changed, 153 insertions(+), 10 deletions(-)

Index: linux-2.6.11-usrdrivers/kernel/irq/proc.c
===
--- linux-2.6.11-usrdrivers.orig/kernel/irq/proc.c  2005-03-11 
10:30:57.875619102 +1100
+++ linux-2.6.11-usrdrivers/kernel/irq/proc.c   2005-03-11 10:45:07.146928168 
+1100
@@ -9,6 +9,8 @@
 #include 
 #include 
 #include 
+#include 
+#include "internals.h"
 
 static struct proc_dir_entry *root_irq_dir, *irq_dir[NR_IRQS];
 
@@ -90,27 +92,168 @@
action->dir = proc_mkdir(name, irq_dir[irq]);
 }
 
+struct irq_proc {
+   unsigned long irq;
+   wait_queue_head_t q;
+   atomic_t count;
+   char devname[TASK_COMM_LEN];
+};
+ 
+static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs 
*regs)
+{
+   struct irq_proc *idp = (struct irq_proc *)vidp;
+ 
+   BUG_ON(idp->irq != irq);
+   disable_irq_nosync(irq);
+   atomic_inc(>count);
+   wake_up(>q);
+   return IRQ_HANDLED;
+}
+ 
+
+/*
+ * Signal to userspace an interrupt has occured.
+ */
+static ssize_t irq_proc_read(struct file *filp, char  __user *bufp, size_t 
len, loff_t *ppos)
+{
+   struct irq_proc *ip = (struct irq_proc *)filp->private_data;
+   irq_desc_t *idp = irq_desc + ip->irq;
+   int pending;
+   
+   DEFINE_WAIT(wait);
+   
+   if (len < sizeof(int))
+   return -EINVAL;
+   
+   pending = atomic_read(>count);
+   if (pending == 0) {
+   if (idp->status & IRQ_DISABLED)
+   enable_irq(ip->irq);
+   if (filp->f_flags & O_NONBLOCK)
+   return -EWOULDBLOCK;
+   }
+   
+   while (pending == 0) {
+   prepare_to_wait(>q, , TASK_INTERRUPTIBLE);
+   pending = atomic_read(>count);
+   if (pending == 0)
+   schedule();
+   finish_wait(>q, );
+   if (signal_pending(current))
+   return -ERESTARTSYS;
+   }
+   
+   if (copy_to_user(bufp, , sizeof pending))
+   return -EFAULT;
+
+   *ppos += sizeof pending;
+   
+   atomic_sub(pending, >count);
+   return sizeof pending;
+}
+
+
+static int irq_proc_open(struct inode *inop, struct file *filp)
+{
+   struct irq_proc *ip;
+   struct proc_dir_entry *ent = PDE(inop);
+   int error;
+
+   ip = kmalloc(sizeof *ip, GFP_KERNEL);
+   if (ip == NULL)
+   return -ENOMEM;
+   
+   memset(ip, 0, sizeof(*ip));
+   strcpy(ip->devname, current->comm);
+   init_waitqueue_head(>q);
+   atomic_set(>count, 0);
+   ip->irq = (unsigned long)ent->data;
+   
+   error = request_irq(ip->irq,
+   irq_proc_irq_handler,
+   SA_INTERRUPT,
+   ip->devname,
+   ip);
+   if (error < 0) {
+   kfree(ip);
+   return error;
+   }
+   filp->private_data = (void *)ip;
+
+   return 0;
+}
+
+static int irq_proc_release(

Re: binary drivers and development

2005-03-10 Thread Peter Chubb
> "John" == John Richard Moser <[EMAIL PROTECTED]> writes:


John> I've done more thought, here's a small list of advantages on
John> using binary drivers, specifically considering UDI.  You can
John> consider a different implementation for binary drivers as well,
John> with most of the same advantages.

Almost all these advantages are also present for user-mode drivers...
and getting drivers out of the kernel, where possible, is a much
better approach IMHO than trying to maintain a leaky in-kernel
interface.  The problem with in-kernel interfaces, even if set in
concrete, is that any binary driver can go outside the interface ---
there's no encapsulation --- and so break when the kernel changes.

Peter C


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: binary drivers and development

2005-03-10 Thread Peter Chubb
 John == John Richard Moser [EMAIL PROTECTED] writes:


John I've done more thought, here's a small list of advantages on
John using binary drivers, specifically considering UDI.  You can
John consider a different implementation for binary drivers as well,
John with most of the same advantages.

Almost all these advantages are also present for user-mode drivers...
and getting drivers out of the kernel, where possible, is a much
better approach IMHO than trying to maintain a leaky in-kernel
interface.  The problem with in-kernel interfaces, even if set in
concrete, is that any binary driver can go outside the interface ---
there's no encapsulation --- and so break when the kernel changes.

Peter C


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-10 Thread Peter Chubb

As many of you will be aware, we've been working on infrastructure for
user-mode PCI and other drivers.  The first step is to be able to
handle interrupts from user space. Subsequent patches add
infrastructure for setting up DMA for PCI devices.

The user-level interrupt code doesn't depend on the other patches, and
is probably the most mature of this patchset.


This patch adds a new file to /proc/irq/nnn/ called irq.  Suitably 
privileged processes can open this file.  Reading the file returns the 
number of interrupts (if any) that have occurred since the last read.
If the file is opened in blocking mode, reading it blocks until 
an interrupt occurs.  poll(2) and select(2) work as one would expect, to 
allow interrupts to be one of many events to wait for.
(If you didn't like the file, one could have a special system call to
return the file descriptor).

Interrupts are usually masked; while a thread is in poll(2) or read(2) on the 
file they are unmasked.  

All architectures that use CONFIG_GENERIC_HARDIRQ are supported by
this patch.

A low latency user level interrupt handler would do something like
this, on a CONFIG_PREEMPT kernel:

  int irqfd;
  int n_ints;
  struct sched_param sched_param;

  irqfd = open(/proc/irq/513/irq, O_RDONLY);
  mlockall()
  sched_param.sched_priority = sched_get_priority_max(SCHED_FIFO) - 10;
  sched_setscheduler(0, SCHED_FIFO, sched_param);

  while(read(irqfd, n_ints, sizeof n_ints) == sizeof nints) {
   ... talk to device to handle interrupt
  }

If you don't care about latency, then forget about the mlockall() and
setting the priority, and you don't need CONFIG_PREEMPT.

Signed-off-by: Peter Chubb [EMAIL PROTECTED]

 kernel/irq/proc.c |  163 ++
 1 files changed, 153 insertions(+), 10 deletions(-)

Index: linux-2.6.11-usrdrivers/kernel/irq/proc.c
===
--- linux-2.6.11-usrdrivers.orig/kernel/irq/proc.c  2005-03-11 
10:30:57.875619102 +1100
+++ linux-2.6.11-usrdrivers/kernel/irq/proc.c   2005-03-11 10:45:07.146928168 
+1100
@@ -9,6 +9,8 @@
 #include linux/irq.h
 #include linux/proc_fs.h
 #include linux/interrupt.h
+#include linux/poll.h
+#include internals.h
 
 static struct proc_dir_entry *root_irq_dir, *irq_dir[NR_IRQS];
 
@@ -90,27 +92,168 @@
action-dir = proc_mkdir(name, irq_dir[irq]);
 }
 
+struct irq_proc {
+   unsigned long irq;
+   wait_queue_head_t q;
+   atomic_t count;
+   char devname[TASK_COMM_LEN];
+};
+ 
+static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs 
*regs)
+{
+   struct irq_proc *idp = (struct irq_proc *)vidp;
+ 
+   BUG_ON(idp-irq != irq);
+   disable_irq_nosync(irq);
+   atomic_inc(idp-count);
+   wake_up(idp-q);
+   return IRQ_HANDLED;
+}
+ 
+
+/*
+ * Signal to userspace an interrupt has occured.
+ */
+static ssize_t irq_proc_read(struct file *filp, char  __user *bufp, size_t 
len, loff_t *ppos)
+{
+   struct irq_proc *ip = (struct irq_proc *)filp-private_data;
+   irq_desc_t *idp = irq_desc + ip-irq;
+   int pending;
+   
+   DEFINE_WAIT(wait);
+   
+   if (len  sizeof(int))
+   return -EINVAL;
+   
+   pending = atomic_read(ip-count);
+   if (pending == 0) {
+   if (idp-status  IRQ_DISABLED)
+   enable_irq(ip-irq);
+   if (filp-f_flags  O_NONBLOCK)
+   return -EWOULDBLOCK;
+   }
+   
+   while (pending == 0) {
+   prepare_to_wait(ip-q, wait, TASK_INTERRUPTIBLE);
+   pending = atomic_read(ip-count);
+   if (pending == 0)
+   schedule();
+   finish_wait(ip-q, wait);
+   if (signal_pending(current))
+   return -ERESTARTSYS;
+   }
+   
+   if (copy_to_user(bufp, pending, sizeof pending))
+   return -EFAULT;
+
+   *ppos += sizeof pending;
+   
+   atomic_sub(pending, ip-count);
+   return sizeof pending;
+}
+
+
+static int irq_proc_open(struct inode *inop, struct file *filp)
+{
+   struct irq_proc *ip;
+   struct proc_dir_entry *ent = PDE(inop);
+   int error;
+
+   ip = kmalloc(sizeof *ip, GFP_KERNEL);
+   if (ip == NULL)
+   return -ENOMEM;
+   
+   memset(ip, 0, sizeof(*ip));
+   strcpy(ip-devname, current-comm);
+   init_waitqueue_head(ip-q);
+   atomic_set(ip-count, 0);
+   ip-irq = (unsigned long)ent-data;
+   
+   error = request_irq(ip-irq,
+   irq_proc_irq_handler,
+   SA_INTERRUPT,
+   ip-devname,
+   ip);
+   if (error  0) {
+   kfree(ip);
+   return error;
+   }
+   filp-private_data = (void *)ip;
+
+   return 0;
+}
+
+static int irq_proc_release(struct inode *inop, struct file *filp

User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-10 Thread Peter Chubb


USER LEVEL DRIVERS: enable PCI device drivers at user space.

This patch adds the capability for suitably privileged user-level processes to 
enable a PCI device, and set up DMA for it.  A subsequent patch hooks up 
the actual system calls.

There are three new system calls:

  long   usr_pci_open(int bus, int slot, int function, __u64 dma_mask);
 Returns a filedescriptor for the PCI device described 
 by bus,slot,function.  It also enables the device, and sets it 
 up as a bus-mastering DMA device, with the specified dma mask.

 Error codes are:
ENOMEM: insufficient kernel memory to fulfil your  request
ENOENT: the specified device doesn't exist, or is otherwise
invisible to Linux.
EBUSY: Another driver has claimed the device
EIO:   The specified dma mask is invalid for this device.
ENFILE: too many open files

  long usr_pci_get_consistent(int fd, size_t size, void **vaddrp, unsigned 
long *dmaaddrp)

Call pci_alloc_consistent() to get size worth of pci
consistent memory (currently an error if size != PAGESIZE); 
map the allocated memory into the user's address space; 
return the virtual user address in *vaddrp, and the bus 
address in *dmaaddrp

ERRORS:
EINVAL: the filedescriptor was not one obtained from usr_pci_open(), or
size != PAGESIZE
ENOMEM: insufficient  appropriate memory or insufficient free 
virtual address space in the user program.
EFAULT: vaddrp or dmaaddrp didn't point to writeable memory.

The mapping obtained can be cleaned up with munmap().

   long usr_pci_mmap(int fd, struct mapping_info *mp) -- 
map some memory for DMA to/from the device represented by fd, 
which was obtained from usr_pci_open().

struct mapping_info contains:
void *virtaddr -- the virtual address to dma to
int size -- how many bytes to set up
struct usr_pci_sglist *sglist -- a pointer to a scatterlist
int nents -- how many entries in the scatterlist
enum dma_data_direction direction --- which way the 
dma is going to happen.

The scatterlist should be sized at least size/PAGESIZE + 2.

usr_pci_mmap() will call pci_map_sg() on the virtual region, 
then copy the resulting scatterlist into *sglist.  The nents field 
will be updated with the actual number of scatterlist entries filled in.

Failure codes are:
EINVAL: the fd wasn't obtained from usr_pci_open, or 
direction wasn't one of DMA_TO_DEVICE, DMA_FROM_DEVICE 
or DMA_BIDIRECTIONAL, or the size of the 
scatterlist is insufficient to map the region.
EFAULT: mp was a bad pointer, or the region of memory spanned 
by (virtaddr, virtaddr + size) was not all mapped.
ENOMEM: insufficient appropriate memory

   long usr_pci_munmap(int fd, struct mapping_info *mp)
Unmap a dma region mapped by usr_pci_map().
Struct mapping info is the same one used in usr_pci_mmap().

Error codes are:
EINVAL: : the fd wasn't obtained from usr_pci_open, or the 
  struct mapping_info was never mapped for this device


Signed-off-by: Peter Chubb [EMAIL PROTECTED]  


#
# drivers/Makefile   |3 
# drivers/pci/Kconfig|6 
# drivers/usr/Makefile   |2 
# drivers/usr/sys.c  |  952 
+
# include/linux/usrdrv.h |   63 +++
# 5 files changed, 1026 insertions(+)
#
Index: linux-2.6.11-usrdrivers/drivers/Makefile
===
--- linux-2.6.11-usrdrivers.orig/drivers/Makefile   2005-03-11 
12:25:29.169139978 +1100
+++ linux-2.6.11-usrdrivers/drivers/Makefile2005-03-11 12:25:41.159270471 
+1100
@@ -13,6 +13,9 @@
 # was used and do nothing if so
 obj-$(CONFIG_PNP)  += pnp/
 
+# User level device drivers
+obj-$(CONFIG_USRDEV)   += usr/
+
 # char/ comes before serial/ etc so that the VT console is the boot-time
 # default.
 obj-y  += char/
Index: linux-2.6.11-usrdrivers/drivers/usr/Makefile
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.11-usrdrivers/drivers/usr/Makefile2005-03-11 
12:25:41.160247026 +1100
@@ -0,0 +1,2 @@
+obj-y  += sys.o 
+obj-$(CONFIG_USRBLKDEV) += blkdev.o
Index: linux-2.6.11-usrdrivers/drivers/usr/sys.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.11-usrdrivers/drivers/usr/sys.c   2005-03-11 14:15:59.897394833 
+1100
@@ -0,0 +1,952 @@
+/*
+ * Expose PCI-DMA interface to user mode.
+ *
+ * Copyright 2005 Peter Chubb
+ * National ICT

User mode drivers: part 2: PCI device handling (patch 2/2 for 2.6.11)

2005-03-10 Thread Peter Chubb

User-level drivers:  Add system calls for I386 and IA64.
Signed-Off-By: Peter Chubb [EMAIL PROTECTED]

# 
# arch/i386/kernel/entry.S  |4 
# arch/ia64/kernel/entry.S  |8 
# include/asm-i386/unistd.h |6 +-
# include/asm-ia64/unistd.h |4 
# 4 files changed, 17 insertions(+), 5 deletions(-)
#
Index: linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S
===
--- linux-2.6.11-usrdrivers.orig/arch/ia64/kernel/entry.S   2005-03-11 
13:59:28.940744950 +1100
+++ linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S2005-03-11 
13:59:41.236542676 +1100
@@ -1577,10 +1577,10 @@
data8 sys_add_key
data8 sys_request_key
data8 sys_keyctl
-   data8 sys_ni_syscall
-   data8 sys_ni_syscall// 1275
-   data8 sys_ni_syscall
-   data8 sys_ni_syscall
+   data8 sys_usr_pci_open
+   data8 sys_usr_pci_mmap  // 1275
+   data8 sys_usr_pci_munmap
+   data8 sys_usr_pci_get_consistent
data8 sys_ni_syscall
data8 sys_ni_syscall
 
Index: linux-2.6.11-usrdrivers/include/asm-i386/unistd.h
===
--- linux-2.6.11-usrdrivers.orig/include/asm-i386/unistd.h  2005-03-11 
13:59:28.942698059 +1100
+++ linux-2.6.11-usrdrivers/include/asm-i386/unistd.h   2005-03-11 
13:59:41.245331667 +1100
@@ -294,8 +294,12 @@
 #define __NR_add_key   286
 #define __NR_request_key   287
 #define __NR_keyctl288
+#define __NR_usr_pci_open  289
+#define __NR_usr_pci_mmap  (__NR_usr_pci_open+1)
+#define __NR_usr_pci_munmap(__NR_usr_pci_open+2)
+#define __NR_usr_pci_get_consistent(__NR_usr_pci_open+3)
 
-#define NR_syscalls 289
+#define NR_syscalls 293
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
Index: linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h
===
--- linux-2.6.11-usrdrivers.orig/include/asm-ia64/unistd.h  2005-03-11 
13:59:28.942698059 +1100
+++ linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h   2005-03-11 
13:59:41.247284776 +1100
@@ -263,6 +263,10 @@
 #define __NR_add_key   1271
 #define __NR_request_key   1272
 #define __NR_keyctl1273
+#define __NR_usr_pci_open   1274
+#define __NR_usr_pci_mmap   1275
+#define __NR_usr_pci_unmap  1276
+#define __NR_usr_pci_get_consistent 1277
 
 #ifdef __KERNEL__
 
Index: linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S
===
--- linux-2.6.11-usrdrivers.orig/arch/i386/kernel/entry.S   2005-03-11 
13:59:28.941721505 +1100
+++ linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S2005-03-11 
13:59:41.248261330 +1100
@@ -864,5 +864,9 @@
.long sys_add_key
.long sys_request_key
.long sys_keyctl
+   .long sys_usr_pci_open
+   .long sys_usr_pci_mmap  /* 290 */
+   .long sys_usr_pci_munmap
+   .long sys_usr_pci_get_consistent
 
 syscall_table_size=(.-sys_call_table)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11

2005-03-10 Thread Peter Chubb


Microstate Accounting
-

Timing data on threads at present is pretty crude:  when the timer
interrupt occurs, a tick is added to either system time or user time
for the currently running thread.  Thus in an unpacthed kernel one can
distinguish three timed states:  On-cpu in userspace, on-cpu in system
space, and not running.

The actual number of states is much larger.  A thread can be on a
runqueue or  the expired queue (i.e., ready to run but not running),
sleeping on a semaphore or on a futex, having its time stolen to
service an interrupt, etc., etc.

This patch adds timers per-state to each struct task_struct, so that
time in all these states can be tracked.  This patch contains the core
code do the timing, and to initialise the timers.  Subsequent patches
enable the code (by adding Kconfig options) and add hooks to track
state changes.

Signed-off-by: Peter Chubb [EMAIL PROTECTED]

 include/asm-generic/msa.h  |   21 ++
 include/linux/msa-kernel.h |   99 +
 include/linux/msa.h|   46 
 include/linux/sched.h  |4 
 kernel/Makefile|2 
 kernel/fork.c  |2 
 kernel/msa.c   |  472 +
 7 files changed, 645 insertions(+), 1 deletion(-)

Index: linux-2.6-ustate/kernel/msa.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-ustate/kernel/msa.c   2005-03-11 09:58:20.574030768 +1100
@@ -0,0 +1,472 @@
+/*
+ * Microstate accounting.
+ * Try to account for various states much more accurately than
+ * the normal code does.
+ *
+ * Copyright (c) Peter Chubb 2005
+ *  UNSW and National ICT Australia
+ * This code is released under the Gnu Public Licence, version 2.
+ */
+
+
+#include linux/config.h
+#include linux/types.h
+#include linux/errno.h
+#include linux/linkage.h
+#ifdef CONFIG_MICROSTATE
+#include linux/irq.h
+#include linux/hardirq.h
+#include linux/sched.h
+#include linux/jiffies.h
+
+#include asm/uaccess.h
+
+/*
+ * Track time spend in interrupt handlers.
+ */
+struct msa_irq {
+   clk_t times;
+   clk_t last_entered;
+};
+
+/*
+ * When the scheduler last swapped active and expired queues
+ */
+static DEFINE_PER_CPU(clk_t, queueflip_time);
+
+/*
+ * Time spent in interrupt handlers
+ */
+static DEFINE_PER_CPU(struct msa_irq[NR_IRQS+1], msa_irq);
+
+
+/**
+ * msa_switch: Update microstate timers when switching from one task to 
another.
+ * @prev, @next:  The prev task is coming off the processor;
+ *the new task is about to run on the processor.
+ *
+ * Update the times in both prev and next.  It may be necessary to infer the 
+ * next state for each task.
+ *
+ */
+void
+msa_switch(struct task_struct *prev, struct task_struct *next)
+{
+   struct microstates *msprev = prev-microstates;
+   struct microstates *msnext = next-microstates;
+   clk_t now;
+   enum thread_state next_state;
+   int interrupted = msprev-cur_state == INTERRUPTED;
+
+   preempt_disable();
+
+   MSA_NOW(now);
+
+   if (msprev-flags  QUEUE_FLIPPED) {
+   __get_cpu_var(queueflip_time) = now;
+   msprev-flags = ~QUEUE_FLIPPED;
+   }
+
+   /*
+* If the queues have been flipped,
+* update the state as of the last flip time.
+*/
+   if (msnext-cur_state == ONEXPIREDQUEUE) {
+   clk_t qfp = per_cpu(queueflip_time, msnext-lastqueued);
+   msnext-cur_state = ONACTIVEQUEUE;
+   msnext-timers[ONEXPIREDQUEUE] += qfp - msnext-last_change;
+   msnext-last_change = qfp;
+   }
+
+   msprev-timers[msprev-cur_state] += now - msprev-last_change;
+   msnext-timers[msnext-cur_state] += now - msnext-last_change;
+   
+   /* Update states */
+   switch (msprev-next_state) {
+   case MSA_UNKNOWN:
+   /*
+* Infer from actual state
+*/
+   switch (prev-state) {
+   case TASK_INTERRUPTIBLE:
+   next_state = INTERRUPTIBLE_SLEEP;
+   break;
+   
+   case TASK_UNINTERRUPTIBLE:
+   next_state = UNINTERRUPTIBLE_SLEEP;
+   break;
+
+   case TASK_STOPPED:
+   next_state = STOPPED;
+   break;
+
+   case EXIT_DEAD:
+   case EXIT_ZOMBIE:
+   next_state = ZOMBIE;
+   break;
+
+   case TASK_RUNNING:  
+   next_state = ONACTIVEQUEUE;
+   break;
+
+   default:
+   next_state = MSA_UNKNOWN;
+   break;
+
+   } 
+   break;
+
+   case PAGING_SLEEP: /*
+   * Sleep states are PAGING_SLEEP

Microstate Accounting for 2.6.11, patch 3/

2005-03-10 Thread Peter Chubb

Microstate Accounting: Track time in system calls and interrupts, i386 code.

Signed-off-by; Peter Chubb [EMAIL PROTECTED]

 arch/i386/kernel/entry.S |   16 
 arch/i386/kernel/irq.c |   13 -


Index: linux-2.6-ustate/arch/i386/kernel/entry.S
===
--- linux-2.6-ustate.orig/arch/i386/kernel/entry.S  2005-03-10 
09:13:01.448604031 +1100
+++ linux-2.6-ustate/arch/i386/kernel/entry.S   2005-03-10 09:16:14.888575341 
+1100
@@ -222,10 +222,18 @@
/* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not 
testb */
testw 
$(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp)
jnz syscall_trace_entry
+#ifdef CONFIG_MICROSTATE
+   pushl   %eax
+   call msa_start_syscall
+   popl%eax
+#endif
cmpl $(nr_syscalls), %eax
jae syscall_badsys
call *sys_call_table(,%eax,4)
movl %eax,EAX(%esp)
+#ifdef CONFIG_MICROSTATE
+   call msa_end_syscall
+#endif
cli
movl TI_flags(%ebp), %ecx
testw $_TIF_ALLWORK_MASK, %cx
@@ -250,9 +258,17 @@
cmpl $(nr_syscalls), %eax
jae syscall_badsys
 syscall_call:
+#ifdef CONFIG_MICROSTATE
+   pushl   %eax
+   call msa_start_syscall
+   popl%eax
+#endif
call *sys_call_table(,%eax,4)
movl %eax,EAX(%esp) # store the return value
 syscall_exit:
+#ifdef CONFIG_MICROSTATE
+   call msa_end_syscall
+#endif
cli # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret


Index: linux-2.6-ustate/arch/i386/kernel/irq.c
===
--- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 
09:13:00.115606274 +1100
+++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 
+1100
@@ -55,6 +55,8 @@
 #endif
 
irq_enter();
+   msa_start_irq(irq);
+   
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 1KB free? */
{
@@ -101,6 +103,7 @@
 #endif
__do_IRQ(irq, regs);
 
+   msa_finish_irq(irq);
irq_exit();
 
return 1;
@@ -221,10 +224,18 @@
seq_printf(p, %3d: ,i);
 #ifndef CONFIG_SMP
seq_printf(p, %10u , kstat_irqs(i));
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, %10llu, msa_irq_time(0, i));
+#endif
 #else
for (j = 0; j  NR_CPUS; j++)
-   if (cpu_online(j))
+   if (cpu_online(j)) {
seq_printf(p, %10u , kstat_cpu(j).irqs[i]);
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, %10llu, msa_irq_time(j, i));
+#endif
+   }
+
 #endif
seq_printf(p,  %14s, irq_desc[i].handler-typename);
seq_printf(p,   %s, action-name);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 2/6

2005-03-10 Thread Peter Chubb
Microstate Accounting:
Add hooks into the scheduler to track state changes.
Arrange for parent process's child times to be updated at process exit. 


 kernel/sched.c |8 
 kernel/exit.c  |3 +++

Index: linux-2.6-ustate/kernel/sched.c
===
--- linux-2.6-ustate.orig/kernel/sched.c2005-03-11 09:59:31.109628035 
+1100
+++ linux-2.6-ustate/kernel/sched.c 2005-03-11 09:59:31.116463921 +1100
@@ -635,6 +635,7 @@
  */
 static inline void __activate_task(task_t *p, runqueue_t *rq)
 {
+   msa_set_timer(p, ONACTIVEQUEUE);
enqueue_task(p, rq-active);
rq-nr_running++;
 }
@@ -1238,6 +1239,7 @@
if (unlikely(!current-array))
__activate_task(p, rq);
else {
+   msa_set_timer(p, ONACTIVEQUEUE);
p-prio = current-prio;
list_add_tail(p-run_list, current-run_list);
p-array = current-array;
@@ -2422,6 +2424,7 @@
if (!rq-expired_timestamp)
rq-expired_timestamp = jiffies;
if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
+   msa_next_state(p, ONEXPIREDQUEUE);
enqueue_task(p, rq-expired);
if (p-static_prio  rq-best_expired_prio)
rq-best_expired_prio = p-static_prio;
@@ -2733,6 +2736,7 @@
array = rq-active;
rq-expired_timestamp = 0;
rq-best_expired_prio = MAX_PRIO;
+   msa_flip_expired(prev);
} else
schedstat_inc(rq, sched_noswitch);
 
@@ -2773,6 +2777,8 @@
rq-curr = next;
++*switch_count;
 
+   msa_switch(prev, next);
+
prepare_arch_switch(rq, next);
prev = context_switch(rq, prev, next);
barrier();
@@ -3693,6 +3699,8 @@
 */
if (rt_task(current))
target = rq-active;
+   else
+   msa_next_state(current, ONEXPIREDQUEUE);
 
if (current-array-nr_active == 1) {
schedstat_inc(rq, yld_act_empty);


Index: linux-2.6-ustate/kernel/exit.c
===
--- linux-2.6-ustate.orig/kernel/exit.c 2005-03-11 09:59:36.360564796 +1100
+++ linux-2.6-ustate/kernel/exit.c  2005-03-11 09:59:36.364471017 +1100
@@ -93,6 +93,9 @@
}
 
sched_exit(p);
+
+   msa_update_parent(p-parent, p);
+
write_unlock_irq(tasklist_lock);
spin_unlock(p-proc_lock);
proc_pid_flush(proc_dentry);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 3/6

2005-03-10 Thread Peter Chubb

Microstate accounting:  

Provide I386-dependent MSA clocks, and Kconfig options.

 arch/i386/Kconfig  |   39 ++-
 include/asm-i386/msa.h |   49 +
 2 files changed, 87 insertions(+), 1 deletion(-)

Signed-off-by: Peter Chubb [EMAIL PROTECTED]

Index: linux-2.6-ustate/arch/i386/Kconfig
===
--- linux-2.6-ustate.orig/arch/i386/Kconfig 2005-03-11 09:59:38.773632446 
+1100
+++ linux-2.6-ustate/arch/i386/Kconfig  2005-03-11 09:59:38.777538666 +1100
@@ -923,8 +923,45 @@
 
  If unsure, say Y. Only embedded should say N here.
 
-endmenu
+config MICROSTATE
+   bool Microstate accounting
+   help
+ This option causes the kernel to keep very accurate track of
+how long your threads spend on the runqueues, running, or asleep or
+stopped.  It will slow down your kernel.
+Times are reported in /proc/pid/msa and through a new msa()
+system call.
+
+choice 
+   depends on MICROSTATE
+   prompt Microstate timing source
+   default MICROSTATE_TSC
+
+config MICROSTATE_PM
+   bool Use Power-Management timer for microstate timings
+   depends on X86_PM_TIMER
+   help
+If your machine is ACPI enabled and uses power-management, then the 
+TSC runs at a variable rate, which will distort the 
+microstate measurements.  This timer, although having
+slightly more overhead, and a lower resolution (279
+nanoseconds or so) will always run at a constant rate.
+
+config MICROSTATE_TSC
+   bool Use on-chip TSC for microstate timings
+   depends on X86_TSC
+   help
+ If your machine's clock runs at constant rate, then this timer 
+gives you cycle precision in measureing times spent in microstates.
+
+config MICROSTATE_TOD
+   bool Use time-of-day clock for microstate timings
+   help
+ If none of the other timers are any good for you, this timer 
+will give you micro-second precision.
+endchoice
 
+endmenu
 
 menu Power management options (ACPI, APM)
depends on !X86_VOYAGER
Index: linux-2.6-ustate/include/asm-i386/msa.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-ustate/include/asm-i386/msa.h 2005-03-11 09:59:38.779491777 
+1100
@@ -0,0 +1,49 @@
+/
+ * asm-i386/msa.h
+ *
+ * Provide an architecture-specific clock.
+ */
+
+#ifndef _ASM_I386_MSA_H
+# define _ASM_I386_MSA_H
+
+# include linux/config.h
+
+
+# if defined(CONFIG_MICROSTATE_TSC)
+/*
+ * Use the processor's time-stamp counter as a timesource
+ */
+#  include asm/msr.h
+#  include asm/div64.h
+
+#  define MSA_NOW(now)  rdtscll(now)
+
+extern unsigned long cpu_khz;
+#  define MSA_TO_NSEC(clk) ({ clk_t _x = ((clk) * 100ULL); do_div(_x, 
cpu_khz); _x; })
+
+# elif defined(CONFIG_MICROSTATE_PM)
+/*
+ * Use the system's monotonic clock as a timesource.
+ * This will only be enabled if the Power Management Timer is enabled.
+ */
+unsigned long long monotonic_clock(void);
+#  define MSA_NOW(now) do { now = monotonic_clock(); } while (0)
+#  define MSA_TO_NSEC(clk) (clk)
+
+# elif defined(CONFIG_MICROSTATE_TOD)
+/*
+ * Fall back to gettimeofday.
+ * This one is incompatible with interrupt-time measurement on some processors.
+ */
+static inline void msa_now(clk_t *nsp) {
+   struct timeval tv;
+   do_gettimeofday(tv);
+   *nsp = tv.tv_sec * 100 + tv.tv_usec;
+}
+#   define MSA_NOW(x) msa_now(x)
+#   define MSA_TO_NSEC(clk) ((clk) * 1000)
+# endif
+
+
+#endif /* _ASM_I386_MSA_H */
I386
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 5/6

2005-03-10 Thread Peter Chubb
Microstate accounting: Add the I386 system call.

 arch/i386/kernel/entry.S  |2 +-
 include/asm-i386/unistd.h |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6-ustate/arch/i386/kernel/entry.S
===
--- linux-2.6-ustate.orig/arch/i386/kernel/entry.S  2005-03-10 
09:16:14.888575341 +1100
+++ linux-2.6-ustate/arch/i386/kernel/entry.S   2005-03-10 09:16:15.446188457 
+1100
@@ -876,7 +876,7 @@
.long sys_mq_getsetattr
.long sys_ni_syscall/* reserved for kexec */
.long sys_waitid
-   .long sys_ni_syscall/* 285 */ /* available */
+   .long sys_msa   /* 285 */ /* available */
.long sys_add_key
.long sys_request_key
.long sys_keyctl
Index: linux-2.6-ustate/include/asm-i386/unistd.h
===
--- linux-2.6-ustate.orig/include/asm-i386/unistd.h 2005-03-10 
09:13:00.813843194 +1100
+++ linux-2.6-ustate/include/asm-i386/unistd.h  2005-03-10 09:16:15.448141568 
+1100
@@ -290,7 +290,7 @@
 #define __NR_mq_getsetattr (__NR_mq_open+5)
 #define __NR_sys_kexec_load283
 #define __NR_waitid284
-/* #define __NR_sys_setaltroot 285 */
+#define __NR_sys_msa   285
 #define __NR_add_key   286
 #define __NR_request_key   287
 #define __NR_keyctl288
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 6/6

2005-03-10 Thread Peter Chubb


Microstate accounting: Track time spent asleep while paging,
in poll() or select(), or on a futex separately from other sleeps.

 fs/select.c |2 ++
 kernel/futex.c |2 ++
 mm/memory.c |6 +-


Index: linux-2.6-ustate/mm/memory.c
===
--- linux-2.6-ustate.orig/mm/memory.c   2005-03-10 09:12:59.492564100 +1100
+++ linux-2.6-ustate/mm/memory.c2005-03-10 09:16:16.583875465 +1100
@@ -2079,6 +2079,7 @@
if (is_vm_hugetlb_page(vma))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */
 
+   msa_next_state(current, PAGING_SLEEP);
/*
 * We need the page table lock to synchronize with kswapd
 * and the SMP-safe atomic PTE updates.
@@ -2098,10 +2099,13 @@
if (!pte)
goto oom;

-   return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+   int ret = handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+   msa_next_state(current, MSA_UNKNOWN);
+   return ret;
 
  oom:
spin_unlock(mm-page_table_lock);
+   msa_next_state(current, MSA_UNKNOWN);
return VM_FAULT_OOM;
 }
 

Index: linux-2.6-ustate/kernel/futex.c
===
--- linux-2.6-ustate.orig/kernel/futex.c2005-03-10 09:12:58.843154938 
+1100
+++ linux-2.6-ustate/kernel/futex.c 2005-03-10 09:16:17.109262256 +1100
@@ -39,6 +39,7 @@
 #include linux/mount.h
 #include linux/pagemap.h
 #include linux/syscalls.h
+#include linux/msa.h
 
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
 
@@ -571,6 +572,7 @@
 * wakes us up.
 */
 
+   msa_next_state(current, FUTEX_SLEEP);
/* add_wait_queue is the barrier after __set_current_state. */
__set_current_state(TASK_INTERRUPTIBLE);
add_wait_queue(q.waiters, wait);


Index: linux-2.6-ustate/fs/select.c
===
--- linux-2.6-ustate.orig/fs/select.c   2005-03-10 09:12:59.182996124 +1100
+++ linux-2.6-ustate/fs/select.c2005-03-10 09:16:16.843639194 +1100
@@ -256,6 +256,7 @@
retval = table.error;
break;
}
+   msa_next_state(current, POLL_SLEEP);
__timeout = schedule_timeout(__timeout);
}
__set_current_state(TASK_RUNNING);
@@ -447,6 +448,7 @@
count = wait-error;
if (count)
break;
+   msa_next_state(current, POLL_SLEEP);
timeout = schedule_timeout(timeout);
}
__set_current_state(TASK_RUNNING);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 4/6

2005-03-10 Thread Peter Chubb
Microstate accounting:  Account for time in interrupt handlers for I386.

 arch/i386/kernel/irq.c |   13 -
 1 files changed, 12 insertions(+), 1 deletion(-)


Index: linux-2.6-ustate/arch/i386/kernel/irq.c
===
--- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 
09:13:00.115606274 +1100
+++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 
+1100
@@ -55,6 +55,8 @@
 #endif
 
irq_enter();
+   msa_start_irq(irq);
+   
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 1KB free? */
{
@@ -101,6 +103,7 @@
 #endif
__do_IRQ(irq, regs);
 
+   msa_finish_irq(irq);
irq_exit();
 
return 1;
@@ -221,10 +224,18 @@
seq_printf(p, %3d: ,i);
 #ifndef CONFIG_SMP
seq_printf(p, %10u , kstat_irqs(i));
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, %10llu, msa_irq_time(0, i));
+#endif
 #else
for (j = 0; j  NR_CPUS; j++)
-   if (cpu_online(j))
+   if (cpu_online(j)) {
seq_printf(p, %10u , kstat_cpu(j).irqs[i]);
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, %10llu, msa_irq_time(j, i));
+#endif
+   }
+
 #endif
seq_printf(p,  %14s, irq_desc[i].handler-typename);
seq_printf(p,   %s, action-name);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate accounting, IA64 support

2005-03-10 Thread Peter Chubb
Microstate Accounting: 
Add suppoort for IA64.


 linux-2.6-ustate/arch/ia64/Kconfig   |   25 +++
 linux-2.6-ustate/arch/ia64/kernel/entry.S|   44 +++
 linux-2.6-ustate/arch/ia64/kernel/irq_ia64.c |   21 +++-
 linux-2.6-ustate/arch/ia64/kernel/ivt.S  |8 +++-
 linux-2.6-ustate/include/asm-ia64/msa.h  |   33 
 linux-2.6-ustate/include/asm-ia64/unistd.h   |1 
 7 files changed, 129 insertions(+), 5 deletions(-)

Index: linux-2.6-ustate/arch/ia64/Kconfig
===
--- linux-2.6-ustate.orig/arch/ia64/Kconfig 2005-03-10 09:13:01.780632777 
+1100
+++ linux-2.6-ustate/arch/ia64/Kconfig  2005-03-10 09:16:14.593655619 +1100
@@ -302,6 +302,31 @@
  little bigger and slows down execution a bit, but it is generally
  a good idea to turn this on.  If you're unsure, say Y.
 
+config MICROSTATE
+   bool Microstate accounting
+   help
+ This option causes the kernel to keep very accurate track of
+ how long your threads spend on the runqueues, running, or asleep or
+ stopped.  It will slow down your kernel.
+ Times are reported in /proc/pid/msa and through a new msa()
+ system call.
+choice
+   depends on MICROSTATE
+   prompt Microstate timing source
+   default MICROSTATE_ITC
+   help
+  On IA64 one can use two timeing sources for the microstate
+  accounting;  the on-chip interval counter, or Linux's
+  time-of-day clock.  The first is very cheap; the other is
+  more accurate on SMP systems.
+
+config MICROSTATE_ITC
+   bool Use on-chip ITC for microstate timing
+ 
+config MICROSTATE_TOD
+   bool Use time-of-day clock for microstate timings
+endchoice
+
 config IA64_PALINFO
tristate /proc/pal support
help
Index: linux-2.6-ustate/include/asm-ia64/msa.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-ustate/include/asm-ia64/msa.h 2005-03-10 09:16:14.594632174 
+1100
@@ -0,0 +1,33 @@
+/
+ * asm-ia64/msa.h
+ *
+ * Provide an architecture-specific clock.
+ */
+
+#ifndef _ASM_IA64_MSA_H
+#define _ASM_IA64_MSA_H
+
+#include asm/processor.h
+#include asm/timex.h
+#include asm/smp.h
+
+
+# if defined(CONFIG_MICROSTATE_ITC)
+#   define MSA_NOW(now)  do { now = (clk_t)get_cycles(); } while (0)
+
+#   define MSA_TO_NSEC(clk) ((10*clk) / 
cpu_data(smp_processor_id())-itc_freq)
+
+# elif defined(CONFIG_MICROSTATE_TOD)
+static inline void msa_now(clk_t *nsp) {
+   struct timeval tv;
+   do_gettimeofday(tv);
+   *nsp = tv.tv_sec * 100 + tv.tv_usec;
+}
+#   define MSA_NOW(x) msa_now(x)
+#   define MSA_TO_NSEC(clk) ((clk) * 1000)
+
+# else
+#  include asm-generic/msa.h
+# endif
+
+#endif /* _ASM_IA64_MSA_H */
Microstate Accounting: Track time in system calls for IA64

 arch/ia64/kernel/entry.S |   44 
 arch/ia64/kernel/ivt.S   |8 ++--
 2 files changed, 50 insertions(+), 2 deletions(-)

Index: linux-2.6-ustate/arch/ia64/kernel/entry.S
===
--- linux-2.6-ustate.orig/arch/ia64/kernel/entry.S  2005-03-10 
09:13:01.149778160 +1100
+++ linux-2.6-ustate/arch/ia64/kernel/entry.S   2005-03-10 09:16:15.157128068 
+1100
@@ -589,6 +589,46 @@
 .ret4: br.cond.sptk ia64_leave_kernel
 END(ia64_strace_leave_kernel)
 
+#ifdef CONFIG_MICROSTATE
+/*
+ * preserve input registers,
+ * and r8
+ */
+GLOBAL_ENTRY(invoke_msa_end_syscall)
+   .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
+   alloc loc1=ar.pfs,8,4,0,0
+   mov loc0=rp
+   .body
+   ;;
+   mov loc2=ret0
+   mov loc3=ret2
+   br.call.sptk.many rp=msa_end_syscall
+1: mov rp=loc0
+   mov ret0=loc2
+   mov ret2=loc3
+   mov ar.pfs=loc1
+   br.ret.sptk.many rp
+END(invoke_msa_end_syscall)
+/*
+ * Preserves in0-7, and all callee-save registers.
+ */
+GLOBAL_ENTRY(invoke_msa_start_syscall)
+   .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
+   alloc loc1=ar.pfs,8,4,0,0
+   mov loc0=rp
+   .body
+   mov loc2=r3
+   mov loc3=r15
+   ;;
+   br.call.sptk.many rp=msa_start_syscall
+1: mov r15=loc3
+   mov r3=loc2
+   mov ar.pfs=loc1
+   mov rp=loc0
+   br.ret.sptk.many rp
+END(invoke_msa_start_syscall)
+#endif /* CONFIG_MICROSTATE */
+
 GLOBAL_ENTRY(ia64_ret_from_clone)
PT_REGS_UNWIND_INFO(0)
 {  /*
@@ -671,6 +711,10 @@
  */
 ENTRY(ia64_leave_syscall)
PT_REGS_UNWIND_INFO(0)
+#ifdef CONFIG_MICROSTATE
+   br.call.sptk.many rp=invoke_msa_end_syscall
+1: 
+#endif
/*
 * work.need_resched etc. mustn't get changed by this CPU before it 
returns to
   

Re: Microstate Accounting for 2.6.11

2005-03-10 Thread Peter Chubb
 Andrew == Andrew Morton [EMAIL PROTECTED] writes:

Andrew Peter Chubb [EMAIL PROTECTED] wrote:
  Timing data on threads at present is pretty crude: when the timer
 interrupt occurs, a tick is added to either system time or user
 time for the currently running thread.  Thus in an unpacthed kernel
 one can distinguish three timed states: On-cpu in userspace, on-cpu
 in system space, and not running.
 
 The actual number of states is much larger.  A thread can be on a
 runqueue or the expired queue (i.e., ready to run but not running),
 sleeping on a semaphore or on a futex, having its time stolen to
 service an interrupt, etc., etc.
 
 This patch adds timers per-state to each struct task_struct, so
 that time in all these states can be tracked.  This patch contains
 the core code do the timing, and to initialise the timers.
 Subsequent patches enable the code (by adding Kconfig options) and
 add hooks to track state changes.

Andrew Why does the kernel need this feature?

I find that it's useful when trying to work out why a thread is going
more slowly than it needs to.  Userspace tools in the CVS repository
at gelato.unsw.edu.au let you graph in real time the time spent in
each state, so you get graphs like this:

 http://gelato.unsw.edu.au/patches/snapshot.png

which shows mplay skipping because of a slow disk/filesystem.

Andrew Have you any numbers on the overhead?

Around 5% on LMbench context switch numbers for uniprocessor,
negligeable on SMP (but SMP context switch results are horrible at the
moment according to LMbench2 -- almost 16usec); select on 10 fd goes
from 1.665 usec to 1.701; 

Andrew The preempt_disable() in sys_msa() seems odd.

Yes I only added that yesterday.  It's to prevent migration while
updating the current timer.  All the other places where the current
timer are updated are naturally protected this.  It should probably be a
local_irq_disable() instead.

Peter C

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading large /proc entry from kernel module

2005-03-08 Thread Peter Chubb
>>>>> "Kristian" == Kristian Sørensen <[EMAIL PROTECTED]> writes:

Kristian> Hi all!  I have some trouble reading a 2346 byte /proc entry
Kristian> from our Umbrella kernel module.


Kristian> static int umb_proc_write(struct file *file, const char *buffer,
Kristian>  unsigned long count, void *data) {
Kristian>   char *policy;
Kristian>   int *lbuf;
Kristian>   int i;

Here's your problem:  lbuf should be a char * not an int *.
When you look lbuf[0] you'll get the first four characters packed
into the int.
-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading large /proc entry from kernel module

2005-03-08 Thread Peter Chubb
 Kristian == Kristian Sørensen [EMAIL PROTECTED] writes:

Kristian Hi all!  I have some trouble reading a 2346 byte /proc entry
Kristian from our Umbrella kernel module.


Kristian static int umb_proc_write(struct file *file, const char *buffer,
Kristian  unsigned long count, void *data) {
Kristian   char *policy;
Kristian   int *lbuf;
Kristian   int i;

Here's your problem:  lbuf should be a char * not an int *.
When you look lbuf[0] you'll get the first four characters packed
into the int.
-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >