Re: [perfmon2] perfmon2 merge news

2007-12-15 Thread Frank Ch. Eigler
Stephane Eranian <[EMAIL PROTECTED]> writes:

> [...]
>> > [...]  AFAIK, there is no single call to stop T1 and wait until it
>> > is completely off the CPU, unless we go through the (internal)
>> > ptrace interface.
>> 
>> The utrace code supports this style of thread manipulation better
>> than ptrace.
>
> Afre you saying that utrace provides a utrace_thread_stop(tid) call
> that returns only when the thread tid is off the CPU. And then there
> is a utrace_thread_resume(tid) call. If that's the case then that is
> what I need.

While I see no single call, it can be synthesized from a sequence of
them: utrace_attach, utrace_set_flags (... UTRACE_ACTION_QUESCE ...),
then waiting for a callback.  Roland, is there a more compact way?

> How are we with regards to utrace integration?

Roland McGrath is working on breaking the patches down.

- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon2] perfmon2 merge news

2007-12-15 Thread Frank Ch. Eigler
Stephane Eranian [EMAIL PROTECTED] writes:

 [...]
  [...]  AFAIK, there is no single call to stop T1 and wait until it
  is completely off the CPU, unless we go through the (internal)
  ptrace interface.
 
 The utrace code supports this style of thread manipulation better
 than ptrace.

 Afre you saying that utrace provides a utrace_thread_stop(tid) call
 that returns only when the thread tid is off the CPU. And then there
 is a utrace_thread_resume(tid) call. If that's the case then that is
 what I need.

While I see no single call, it can be synthesized from a sequence of
them: utrace_attach, utrace_set_flags (... UTRACE_ACTION_QUESCE ...),
then waiting for a callback.  Roland, is there a more compact way?

 How are we with regards to utrace integration?

Roland McGrath is working on breaking the patches down.

- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon2] perfmon2 merge news

2007-12-14 Thread Stephane Eranian
Charles,

On Fri, Dec 14, 2007 at 02:12:17PM -0500, Frank Ch. Eigler wrote:
> 
> Stephane Eranian <[EMAIL PROTECTED]> writes:
> 
> > [...]  AFAIK, there is no single call to stop T1 and wait until it
> > is completely off the CPU, unless we go through the (internal)
> > ptrace interface.
> 
> The utrace code supports this style of thread manipulation better
> than ptrace.

Afre you saying that utrace provides a utrace_thread_stop(tid) call
that returns only when the thread tid is off the CPU. And then there
is a utrace_thread_resume(tid) call. If that's the case then that is
what I need.

How are we with regards to utrace integration?

Thanks.

-- 
-Stephane
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon2] perfmon2 merge news

2007-12-14 Thread Frank Ch. Eigler

Stephane Eranian <[EMAIL PROTECTED]> writes:

> [...]  AFAIK, there is no single call to stop T1 and wait until it
> is completely off the CPU, unless we go through the (internal)
> ptrace interface.

The utrace code supports this style of thread manipulation better
than ptrace.

- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon2] perfmon2 merge news

2007-12-14 Thread Frank Ch. Eigler

Stephane Eranian [EMAIL PROTECTED] writes:

 [...]  AFAIK, there is no single call to stop T1 and wait until it
 is completely off the CPU, unless we go through the (internal)
 ptrace interface.

The utrace code supports this style of thread manipulation better
than ptrace.

- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon2] perfmon2 merge news

2007-12-14 Thread Stephane Eranian
Charles,

On Fri, Dec 14, 2007 at 02:12:17PM -0500, Frank Ch. Eigler wrote:
 
 Stephane Eranian [EMAIL PROTECTED] writes:
 
  [...]  AFAIK, there is no single call to stop T1 and wait until it
  is completely off the CPU, unless we go through the (internal)
  ptrace interface.
 
 The utrace code supports this style of thread manipulation better
 than ptrace.

Afre you saying that utrace provides a utrace_thread_stop(tid) call
that returns only when the thread tid is off the CPU. And then there
is a utrace_thread_resume(tid) call. If that's the case then that is
what I need.

How are we with regards to utrace integration?

Thanks.

-- 
-Stephane
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon2] perfmon2 merge news

2007-12-13 Thread Stephane Eranian
Hello,

A few weeks back, I mentioned that I would post some
interesting problems that I have encountered while
implementing perfmon and for which I am still looking
for better solutions.

Here is one that I would like to solve right now and
for which I am interested in your comments.

One of the perfmon syscall (pfm_restart()) is used to
resume monitoring after a user level notification. When
 operating in per-thread non self-monitoring mode, the
syscall needs to operate on the machine state of the
monitored thread. So you get into this situation:


Thread T0Thread T1
||
   pfm_restart() |
||
spin_lock_irqsave()  |
||
  --->|
||
spin_unlock_irqrestore() |
||
vv

Thread T1 may be running at the time T0 needs to modify its state.
The current solution is to set a TIF flag in T1. That TIF flag will
cause T1 (on kernel exit) to go into a perfmon function that will
then modify the state, i.e., state is self-modified. That works okay
but there are a few race conditions. For self-monitoring sessions
(e.g., system-wide or per-thread), it is easy because we operate in
the correct thread.

But there is a big difference between self-monitoring and non
self-monitoring. The pfm_restart() syscall does not provide the
same guarantee.

In self-monitoring modes, the interface guarantees that by the time you
return from the call, the effects of the call are visible. Whereas when
monitoring another thread, the call currently does not provide such
guarantee, i.e., it does not wait until T1 has seen the TIF flag and
completed the state modification before returning. We could add a semaphore
to enforce that guarantee but it gets difficult with corner cases and
cleanups in case of unpexected termination.

AFAIK, there is no single call to stop T1 and wait until it is completely
off the CPU, unless we go through the (internal) ptrace interface. 

Would you have anything better to suggest?

Thanks.

--
-Stephane
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon2] perfmon2 merge news

2007-12-13 Thread Stephane Eranian
Hello,

A few weeks back, I mentioned that I would post some
interesting problems that I have encountered while
implementing perfmon and for which I am still looking
for better solutions.

Here is one that I would like to solve right now and
for which I am interested in your comments.

One of the perfmon syscall (pfm_restart()) is used to
resume monitoring after a user level notification. When
 operating in per-thread non self-monitoring mode, the
syscall needs to operate on the machine state of the
monitored thread. So you get into this situation:


Thread T0Thread T1
||
   pfm_restart() |
||
spin_lock_irqsave()  |
||
  modify T1's machine state---|
||
spin_unlock_irqrestore() |
||
vv

Thread T1 may be running at the time T0 needs to modify its state.
The current solution is to set a TIF flag in T1. That TIF flag will
cause T1 (on kernel exit) to go into a perfmon function that will
then modify the state, i.e., state is self-modified. That works okay
but there are a few race conditions. For self-monitoring sessions
(e.g., system-wide or per-thread), it is easy because we operate in
the correct thread.

But there is a big difference between self-monitoring and non
self-monitoring. The pfm_restart() syscall does not provide the
same guarantee.

In self-monitoring modes, the interface guarantees that by the time you
return from the call, the effects of the call are visible. Whereas when
monitoring another thread, the call currently does not provide such
guarantee, i.e., it does not wait until T1 has seen the TIF flag and
completed the state modification before returning. We could add a semaphore
to enforce that guarantee but it gets difficult with corner cases and
cleanups in case of unpexected termination.

AFAIK, there is no single call to stop T1 and wait until it is completely
off the CPU, unless we go through the (internal) ptrace interface. 

Would you have anything better to suggest?

Thanks.

--
-Stephane
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread David Miller
From: Stephane Eranian <[EMAIL PROTECTED]>
Date: Mon, 19 Nov 2007 12:53:30 -0800

> In anycase, I would be happy to integrate your sparc64 patches.

I sent these to Philip Mucci late last night, but in the meantime
I finished implementing breakpoint support as well for pfmon.

Let me clean up my diffs and I'll send it all out to you in a
few hours.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread David Miller
From: Stephane Eranian <[EMAIL PROTECTED]>
Date: Mon, 19 Nov 2007 14:48:46 -0800

> Looks like we will have to use bytes (u8) instead.  This may have some
> performance impact as well. Several bitmaps are used in the context/interrupt
> routines. Even with u8, there is still a problem with the bitmap*() macros.
> Now, only a small subset of the bitmap() macros are used, so it may be okay
> to duplicate them for u8.

I think it would be fine to just create a set of bitop interfaces that
operate on u32 objects instead of "unsigned long".

Currently perfmon2 does not need the atomic variants at all, and those
could thus be provided entirely under include/asm-generic/bitops/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread Stephane Eranian
Paul,

On Tue, Nov 20, 2007 at 08:43:32AM +1100, Paul Mackerras wrote:
> David Miller writes:
> 
> > As a result I've found that perfmon2 is quite nice and allows
> > incredibly useful and powerful tools to be written.  The syscalls
> > aren't that bad and really I see not reason to block it's inclusion.
> > 
> > I rescind all of my earlier objections, let's merge this soon :-)
> 
> Strongly agree.  However, I think we need to add structure size
> arguments to most of the syscalls so we can extend them later.
> 
Yes, that is one way. It works well if you only extend structures at the end.
Given that you need to obtain the file descriptor first via a pfm_create_context
call, an alternative could be that you pass a version number to that call to
identify the version the application is requesting.

> Also, something I've been meaning to mention to Stephane is that the
> use of the cast_ulp() macro in perfmon is bogus and won't work on
> 32-bit big-endian platforms such as ppc32 and sparc32.  On such

I don't like those cast_ulp() macros. They were put there to avoid compiler
warnings on some architectures. Clearly with the big-endian issue, we need
to find something else. The bitmap*() macros make unsigned long *.

The interface uses fixed size type to ensure ABI compatibility between
32 and 64 bit modes. This way there is no need to marhsall syscall arguments
for a 32-bit app running on a 64-bit host.

Looks like we will have to use bytes (u8) instead.  This may have some
performance impact as well. Several bitmaps are used in the context/interrupt
routines. Even with u8, there is still a problem with the bitmap*() macros.
Now, only a small subset of the bitmap() macros are used, so it may be okay
to duplicate them for u8.

What do you think?

> platforms you can't take a pointer to an array of u64, cast it to
> unsigned long * and expect the kernel bitmap operations to work
> correctly on it.  At the least you also need to XOR the bit numbers
> with 32 on those platforms.  Another alternative is to define the
> bitmaps as arrays of bytes instead, which eliminates all byte ordering
> and wordsize problems (but makes it more tricky to use the kernel
> bitmap functions directly).
> 

-- 

-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread Paul Mackerras
David Miller writes:

> As a result I've found that perfmon2 is quite nice and allows
> incredibly useful and powerful tools to be written.  The syscalls
> aren't that bad and really I see not reason to block it's inclusion.
> 
> I rescind all of my earlier objections, let's merge this soon :-)

Strongly agree.  However, I think we need to add structure size
arguments to most of the syscalls so we can extend them later.

Also, something I've been meaning to mention to Stephane is that the
use of the cast_ulp() macro in perfmon is bogus and won't work on
32-bit big-endian platforms such as ppc32 and sparc32.  On such
platforms you can't take a pointer to an array of u64, cast it to
unsigned long * and expect the kernel bitmap operations to work
correctly on it.  At the least you also need to XOR the bit numbers
with 32 on those platforms.  Another alternative is to define the
bitmaps as arrays of bytes instead, which eliminates all byte ordering
and wordsize problems (but makes it more tricky to use the kernel
bitmap functions directly).

Paul.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread Stephane Eranian
David,

On Mon, Nov 19, 2007 at 05:08:43AM -0800, David Miller wrote:
> 
> Instead of blabbering further about this topic, I decided to put my
> code where my mouth is and spent the weekend porting the perfmon2
> kernel bits, and the user bits (libpfm and pfmon) to sparc64.
> 

I appreciate your effort. I am glad to see that the interface
and implementation survived yet another architecture. I think at this
point ARM is the only major architecture missing. In anycase, I would
be happy to integrate your sparc64 patches.

> As a result I've found that perfmon2 is quite nice and allows
> incredibly useful and powerful tools to be written.  The syscalls
> aren't that bad and really I see not reason to block it's inclusion.
> 

As I said earlier, I am not opposed to changing the syscalls. I have
proposed a few schemes to address the issue of versioning. If vectors
arguments are problematic, we can go with single register/call.

I think there are other areas where perfmon2 could benefit from the
help of the LKML developers. I will post a list shortly.

> I rescind all of my earlier objections, let's merge this soon :-)

Thanks.

-- 
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread David Miller

Instead of blabbering further about this topic, I decided to put my
code where my mouth is and spent the weekend porting the perfmon2
kernel bits, and the user bits (libpfm and pfmon) to sparc64.

As a result I've found that perfmon2 is quite nice and allows
incredibly useful and powerful tools to be written.  The syscalls
aren't that bad and really I see not reason to block it's inclusion.

I rescind all of my earlier objections, let's merge this soon :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread David Miller

Instead of blabbering further about this topic, I decided to put my
code where my mouth is and spent the weekend porting the perfmon2
kernel bits, and the user bits (libpfm and pfmon) to sparc64.

As a result I've found that perfmon2 is quite nice and allows
incredibly useful and powerful tools to be written.  The syscalls
aren't that bad and really I see not reason to block it's inclusion.

I rescind all of my earlier objections, let's merge this soon :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread Stephane Eranian
David,

On Mon, Nov 19, 2007 at 05:08:43AM -0800, David Miller wrote:
 
 Instead of blabbering further about this topic, I decided to put my
 code where my mouth is and spent the weekend porting the perfmon2
 kernel bits, and the user bits (libpfm and pfmon) to sparc64.
 

I appreciate your effort. I am glad to see that the interface
and implementation survived yet another architecture. I think at this
point ARM is the only major architecture missing. In anycase, I would
be happy to integrate your sparc64 patches.

 As a result I've found that perfmon2 is quite nice and allows
 incredibly useful and powerful tools to be written.  The syscalls
 aren't that bad and really I see not reason to block it's inclusion.
 

As I said earlier, I am not opposed to changing the syscalls. I have
proposed a few schemes to address the issue of versioning. If vectors
arguments are problematic, we can go with single register/call.

I think there are other areas where perfmon2 could benefit from the
help of the LKML developers. I will post a list shortly.

 I rescind all of my earlier objections, let's merge this soon :-)

Thanks.

-- 
-Stephane
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread Paul Mackerras
David Miller writes:

 As a result I've found that perfmon2 is quite nice and allows
 incredibly useful and powerful tools to be written.  The syscalls
 aren't that bad and really I see not reason to block it's inclusion.
 
 I rescind all of my earlier objections, let's merge this soon :-)

Strongly agree.  However, I think we need to add structure size
arguments to most of the syscalls so we can extend them later.

Also, something I've been meaning to mention to Stephane is that the
use of the cast_ulp() macro in perfmon is bogus and won't work on
32-bit big-endian platforms such as ppc32 and sparc32.  On such
platforms you can't take a pointer to an array of u64, cast it to
unsigned long * and expect the kernel bitmap operations to work
correctly on it.  At the least you also need to XOR the bit numbers
with 32 on those platforms.  Another alternative is to define the
bitmaps as arrays of bytes instead, which eliminates all byte ordering
and wordsize problems (but makes it more tricky to use the kernel
bitmap functions directly).

Paul.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread Stephane Eranian
Paul,

On Tue, Nov 20, 2007 at 08:43:32AM +1100, Paul Mackerras wrote:
 David Miller writes:
 
  As a result I've found that perfmon2 is quite nice and allows
  incredibly useful and powerful tools to be written.  The syscalls
  aren't that bad and really I see not reason to block it's inclusion.
  
  I rescind all of my earlier objections, let's merge this soon :-)
 
 Strongly agree.  However, I think we need to add structure size
 arguments to most of the syscalls so we can extend them later.
 
Yes, that is one way. It works well if you only extend structures at the end.
Given that you need to obtain the file descriptor first via a pfm_create_context
call, an alternative could be that you pass a version number to that call to
identify the version the application is requesting.

 Also, something I've been meaning to mention to Stephane is that the
 use of the cast_ulp() macro in perfmon is bogus and won't work on
 32-bit big-endian platforms such as ppc32 and sparc32.  On such

I don't like those cast_ulp() macros. They were put there to avoid compiler
warnings on some architectures. Clearly with the big-endian issue, we need
to find something else. The bitmap*() macros make unsigned long *.

The interface uses fixed size type to ensure ABI compatibility between
32 and 64 bit modes. This way there is no need to marhsall syscall arguments
for a 32-bit app running on a 64-bit host.

Looks like we will have to use bytes (u8) instead.  This may have some
performance impact as well. Several bitmaps are used in the context/interrupt
routines. Even with u8, there is still a problem with the bitmap*() macros.
Now, only a small subset of the bitmap() macros are used, so it may be okay
to duplicate them for u8.

What do you think?

 platforms you can't take a pointer to an array of u64, cast it to
 unsigned long * and expect the kernel bitmap operations to work
 correctly on it.  At the least you also need to XOR the bit numbers
 with 32 on those platforms.  Another alternative is to define the
 bitmaps as arrays of bytes instead, which eliminates all byte ordering
 and wordsize problems (but makes it more tricky to use the kernel
 bitmap functions directly).
 

-- 

-Stephane
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread David Miller
From: Stephane Eranian [EMAIL PROTECTED]
Date: Mon, 19 Nov 2007 12:53:30 -0800

 In anycase, I would be happy to integrate your sparc64 patches.

I sent these to Philip Mucci late last night, but in the meantime
I finished implementing breakpoint support as well for pfmon.

Let me clean up my diffs and I'll send it all out to you in a
few hours.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-19 Thread David Miller
From: Stephane Eranian [EMAIL PROTECTED]
Date: Mon, 19 Nov 2007 14:48:46 -0800

 Looks like we will have to use bytes (u8) instead.  This may have some
 performance impact as well. Several bitmaps are used in the context/interrupt
 routines. Even with u8, there is still a problem with the bitmap*() macros.
 Now, only a small subset of the bitmap() macros are used, so it may be okay
 to duplicate them for u8.

I think it would be fine to just create a set of bitop interfaces that
operate on u32 objects instead of unsigned long.

Currently perfmon2 does not need the atomic variants at all, and those
could thus be provided entirely under include/asm-generic/bitops/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-15 Thread Stephane Eranian
Hello,

On Wed, Nov 14, 2007 at 08:20:22PM -0800, dean gaudet wrote:
> On Wed, 14 Nov 2007, Andi Kleen wrote:
> 
> > Later a syscall might be needed with event multiplexing, but that seems
> > more like a far away non essential feature.
> 
> actually multiplexing is the main feature i am in need of. there are an 
> insufficient number of counters (even on k8 with 4 counters) to do 
> complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
> hit rates, average miss latency, time spent in various stalls, and the 
> memory system utilization (or HT bus utilization).  this runs out to 
> something like 30 events which are interesting... and re-running a 
> benchmark over and over just to get around the lack of multiplexing is a 
> royal pain in the ass.
> 
> it's not a "far away non-essential feature" to me.  it's something i would 
> use daily if i had all the pieces together now (and i'm constrained 
> because i cannot add an out-of-tree patch which adds unofficial syscalls 
> to the kernel i use).
> 

Multiplexing in the context of perfmon2 means that you can measure more events
than there are counters. To make this work, we create the notion of an event set
or more precisely a register set. Each set encapsulates the full PMU state. Then
the kernel multiplexes the sets onto the actual PMU hardware.

Why do we need this?

As Dean pointed out, that are many important metrics which do require more 
events
than there are counters. Making multiple runs can be difficult with some 
workloads.

But there are also other, less known, reasons why you'd want to do this. This is
not because you have lots of counters that you can necessarily measure lots of
related events simultaneously. Take pentium 4 for instance, it has 18 counters, 
but
for most interesting metrics, you cannot measure all the events at once. Why? 
Because
there are important hardware constraints which translate into event combination 
constraints. It is not uncommon to have constraints such as:
- event A and B cannot be measured together
- event A can only be measured by counter X
- if event A is measured, then only events B, C, D can be measured

This is not just on Itanium. Power has limitations, Intel Core 2 has 
limitations,
AMD Opterons also have limitations.

When you combine limited number of counters with strong constraints, it can 
quickly
become difficult to make measurements in one run.

Multiplexing is, of course, not as good as measuring all events continuously but
if you run for long enough and with a reasonable switching periods, the 
*estimates*
you get by scaling the obtained counts can be very close to what they would have
been had you measured all events all the time. You have to balance precision 
with
overhead.

Why do this in the kernel?

One might argue that there is nothing preventing tools from multiplexing at the 
user
level. That's true and we do support this as well. You have to:
- stop monitoring
- read out current counter
- reprogram config and data registers
- restart monitoring

But there are some important benefits for doing this in the kernel especially 
for
per-thread monitoring. When you are not self-monitoring, you would need to stop 
the
other thread first, then issue a minimum of 4 system calls and incur a couple of
context switches. By doing it in the kernel, you guaranteed that switching 
always occur
in the context of the monitored thread.

Furthermore it can be integrated with kernel-level sampling. Adding the notion
of event set is fairly pervasive and you need to make sure that it fits well 
with
the other parts of the interface.

-- 
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-15 Thread Andi Kleen
Herbert Xu <[EMAIL PROTECTED]> writes:

> That's strong static typing.  Netlink is 90% strong static
> typing plus 10% strong dynamic typing.  That is, it'll tell
> you at run-time if you give it the wrong netlink attribute.

Well it tells you EINVAL no matter what is wrong.

That's roughly similar to a compiler whose only error message
is 'WRONG'. Or the ed school of error reporting.

That makes any checking it does barely useful.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-15 Thread Stephane Eranian
Hi,

On Thu, Nov 15, 2007 at 12:11:10PM +1100, Paul Mackerras wrote:
> David Miller writes:
> 
> > From: Paul Mackerras <[EMAIL PROTECTED]>
> > Date: Thu, 15 Nov 2007 10:12:22 +1100
> > 
> > > *I* never had a problem with a few extra system calls.  I don't
> > >  understand why you (apparently) do.
> > 
> > We're stuck with them forever, they are hard to version and extend
> > cleanly.
> > 
> > Those are my main objections.
> 
> The first is valid (for suitable values of "forever") but applies to
> any user/kernel interface, not just system calls.
> 
Agreed.

> As for the second (hard to version) I don't see why it applies to
> syscalls specifically more than to other interfaces.  It's just a
> matter of designing it correctly in the first place.  For example, the
> sys_swapcontext system call we have on powerpc takes an argument which
> is the size of the ucontext_t that userland is using, which allows us
> to extend it in future if necessary.  (Note that I'm not saying that
> the current perfmon2 interfaces are well-designed in this respect.)
> 
> The third (hard to extend cleanly) is a good point, and is a valid
> criticism of the current set of perfmon2 system calls, I think.
> However, the goal of being able to extend the interface tends to be in
> opposition to the goal of having strong typing of the interface.
> Things like a multiplexed syscall or an ioctl are much easier to
> extend but that is at the expense of losing strong typing.  Something
> like my transaction() (or your weird kind of read() :) also provides
> extensibility but loses type safety to some degree.
> 
In the initial design there was only one perfmon syscall perfmonctl()
and it was a multiplexing call. People objected to it and thus I split it
up into multiple system calls. I like the strong typing but I agree that
it is harder to extend without creating new syscalls. In the current
state, all perfmon syscalls take a pointer to structs which have reserved
fields for future extensions. If you specify that reserved fields must be
zeroed, then it leaves you *some* flexibility for extending the structs.

Another alternative, similar to your ucontext, would be to pass the size
of the structure. If we assume we drop the vector arguments, we could do:

pfm_write_pmcs(fd, , sizeof(pmc));
instead of
pfm_write_pmcs(fd, );

Should the sizeof(pmc) need to change we could demultiplex inside the
kernel. Another, probably cleaner, possibility is to version structures
that are passed:
union pfarg_pmc {
int version;
struct {
int version;
int reg_num;
u64 reg_value;
}
}

But that seems overkill. I think versioning could be passed when the session
is created instead of at every call:

fd = pfm_create_session(version, , );


> Also, as Andi says, this is core CPU state that we are dealing with,
> not some I/O device, so treating the whole of perfmon2 (or any
> performance monitoring infrastructure) as a driver doesn't fit very
> well, and in fact system calls are appropriate.  Just like we don't
> try to make access to debugging facilities fit into a driver, we
> shouldn't make performance monitoring fit into a driver either.
> 

Agreed 100%. This is especially true because we support per-thread
monitoring.

-- 
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-15 Thread Herbert Xu
Paul Mackerras <[EMAIL PROTECTED]> wrote:
>
> Well you must mean something different by "strong typing" from the
> rest of us.  Strong typing means that the compiler can check that you
> have passed in the correct types of arguments, but the compiler
> doesn't have any visibility into what structures are valid in netlink
> messages.

That's strong static typing.  Netlink is 90% strong static
typing plus 10% strong dynamic typing.  That is, it'll tell
you at run-time if you give it the wrong netlink attribute.

The types within each netlink attribute is checked at compile
time.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-15 Thread Herbert Xu
Paul Mackerras [EMAIL PROTECTED] wrote:

 Well you must mean something different by strong typing from the
 rest of us.  Strong typing means that the compiler can check that you
 have passed in the correct types of arguments, but the compiler
 doesn't have any visibility into what structures are valid in netlink
 messages.

That's strong static typing.  Netlink is 90% strong static
typing plus 10% strong dynamic typing.  That is, it'll tell
you at run-time if you give it the wrong netlink attribute.

The types within each netlink attribute is checked at compile
time.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-15 Thread Stephane Eranian
Hi,

On Thu, Nov 15, 2007 at 12:11:10PM +1100, Paul Mackerras wrote:
 David Miller writes:
 
  From: Paul Mackerras [EMAIL PROTECTED]
  Date: Thu, 15 Nov 2007 10:12:22 +1100
  
   *I* never had a problem with a few extra system calls.  I don't
understand why you (apparently) do.
  
  We're stuck with them forever, they are hard to version and extend
  cleanly.
  
  Those are my main objections.
 
 The first is valid (for suitable values of forever) but applies to
 any user/kernel interface, not just system calls.
 
Agreed.

 As for the second (hard to version) I don't see why it applies to
 syscalls specifically more than to other interfaces.  It's just a
 matter of designing it correctly in the first place.  For example, the
 sys_swapcontext system call we have on powerpc takes an argument which
 is the size of the ucontext_t that userland is using, which allows us
 to extend it in future if necessary.  (Note that I'm not saying that
 the current perfmon2 interfaces are well-designed in this respect.)
 
 The third (hard to extend cleanly) is a good point, and is a valid
 criticism of the current set of perfmon2 system calls, I think.
 However, the goal of being able to extend the interface tends to be in
 opposition to the goal of having strong typing of the interface.
 Things like a multiplexed syscall or an ioctl are much easier to
 extend but that is at the expense of losing strong typing.  Something
 like my transaction() (or your weird kind of read() :) also provides
 extensibility but loses type safety to some degree.
 
In the initial design there was only one perfmon syscall perfmonctl()
and it was a multiplexing call. People objected to it and thus I split it
up into multiple system calls. I like the strong typing but I agree that
it is harder to extend without creating new syscalls. In the current
state, all perfmon syscalls take a pointer to structs which have reserved
fields for future extensions. If you specify that reserved fields must be
zeroed, then it leaves you *some* flexibility for extending the structs.

Another alternative, similar to your ucontext, would be to pass the size
of the structure. If we assume we drop the vector arguments, we could do:

pfm_write_pmcs(fd, pmc, sizeof(pmc));
instead of
pfm_write_pmcs(fd, pmc);

Should the sizeof(pmc) need to change we could demultiplex inside the
kernel. Another, probably cleaner, possibility is to version structures
that are passed:
union pfarg_pmc {
int version;
struct {
int version;
int reg_num;
u64 reg_value;
}
}

But that seems overkill. I think versioning could be passed when the session
is created instead of at every call:

fd = pfm_create_session(version, ctx, );


 Also, as Andi says, this is core CPU state that we are dealing with,
 not some I/O device, so treating the whole of perfmon2 (or any
 performance monitoring infrastructure) as a driver doesn't fit very
 well, and in fact system calls are appropriate.  Just like we don't
 try to make access to debugging facilities fit into a driver, we
 shouldn't make performance monitoring fit into a driver either.
 

Agreed 100%. This is especially true because we support per-thread
monitoring.

-- 
-Stephane
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-15 Thread Andi Kleen
Herbert Xu [EMAIL PROTECTED] writes:

 That's strong static typing.  Netlink is 90% strong static
 typing plus 10% strong dynamic typing.  That is, it'll tell
 you at run-time if you give it the wrong netlink attribute.

Well it tells you EINVAL no matter what is wrong.

That's roughly similar to a compiler whose only error message
is 'WRONG'. Or the ed school of error reporting.

That makes any checking it does barely useful.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-15 Thread Stephane Eranian
Hello,

On Wed, Nov 14, 2007 at 08:20:22PM -0800, dean gaudet wrote:
 On Wed, 14 Nov 2007, Andi Kleen wrote:
 
  Later a syscall might be needed with event multiplexing, but that seems
  more like a far away non essential feature.
 
 actually multiplexing is the main feature i am in need of. there are an 
 insufficient number of counters (even on k8 with 4 counters) to do 
 complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
 hit rates, average miss latency, time spent in various stalls, and the 
 memory system utilization (or HT bus utilization).  this runs out to 
 something like 30 events which are interesting... and re-running a 
 benchmark over and over just to get around the lack of multiplexing is a 
 royal pain in the ass.
 
 it's not a far away non-essential feature to me.  it's something i would 
 use daily if i had all the pieces together now (and i'm constrained 
 because i cannot add an out-of-tree patch which adds unofficial syscalls 
 to the kernel i use).
 

Multiplexing in the context of perfmon2 means that you can measure more events
than there are counters. To make this work, we create the notion of an event set
or more precisely a register set. Each set encapsulates the full PMU state. Then
the kernel multiplexes the sets onto the actual PMU hardware.

Why do we need this?

As Dean pointed out, that are many important metrics which do require more 
events
than there are counters. Making multiple runs can be difficult with some 
workloads.

But there are also other, less known, reasons why you'd want to do this. This is
not because you have lots of counters that you can necessarily measure lots of
related events simultaneously. Take pentium 4 for instance, it has 18 counters, 
but
for most interesting metrics, you cannot measure all the events at once. Why? 
Because
there are important hardware constraints which translate into event combination 
constraints. It is not uncommon to have constraints such as:
- event A and B cannot be measured together
- event A can only be measured by counter X
- if event A is measured, then only events B, C, D can be measured

This is not just on Itanium. Power has limitations, Intel Core 2 has 
limitations,
AMD Opterons also have limitations.

When you combine limited number of counters with strong constraints, it can 
quickly
become difficult to make measurements in one run.

Multiplexing is, of course, not as good as measuring all events continuously but
if you run for long enough and with a reasonable switching periods, the 
*estimates*
you get by scaling the obtained counts can be very close to what they would have
been had you measured all events all the time. You have to balance precision 
with
overhead.

Why do this in the kernel?

One might argue that there is nothing preventing tools from multiplexing at the 
user
level. That's true and we do support this as well. You have to:
- stop monitoring
- read out current counter
- reprogram config and data registers
- restart monitoring

But there are some important benefits for doing this in the kernel especially 
for
per-thread monitoring. When you are not self-monitoring, you would need to stop 
the
other thread first, then issue a minimum of 4 system calls and incur a couple of
context switches. By doing it in the kernel, you guaranteed that switching 
always occur
in the context of the monitored thread.

Furthermore it can be integrated with kernel-level sampling. Adding the notion
of event set is fairly pervasive and you need to make sure that it fits well 
with
the other parts of the interface.

-- 
-Stephane
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread dean gaudet
On Thu, 15 Nov 2007, Paul Mackerras wrote:

> dean gaudet writes:
> 
> > actually multiplexing is the main feature i am in need of. there are an 
> > insufficient number of counters (even on k8 with 4 counters) to do 
> > complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
> > hit rates, average miss latency, time spent in various stalls, and the 
> > memory system utilization (or HT bus utilization).  this runs out to 
> > something like 30 events which are interesting... and re-running a 
> > benchmark over and over just to get around the lack of multiplexing is a 
> > royal pain in the ass.
> 
> So by "multiplexing" do you mean the ability to have multiple event
> sets associated with a context and have the kernel switch between them
> automatically?

yep.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
dean gaudet writes:

> actually multiplexing is the main feature i am in need of. there are an 
> insufficient number of counters (even on k8 with 4 counters) to do 
> complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
> hit rates, average miss latency, time spent in various stalls, and the 
> memory system utilization (or HT bus utilization).  this runs out to 
> something like 30 events which are interesting... and re-running a 
> benchmark over and over just to get around the lack of multiplexing is a 
> royal pain in the ass.

So by "multiplexing" do you mean the ability to have multiple event
sets associated with a context and have the kernel switch between them
automatically?

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread dean gaudet
On Wed, 14 Nov 2007, Andi Kleen wrote:

> Later a syscall might be needed with event multiplexing, but that seems
> more like a far away non essential feature.

actually multiplexing is the main feature i am in need of. there are an 
insufficient number of counters (even on k8 with 4 counters) to do 
complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
hit rates, average miss latency, time spent in various stalls, and the 
memory system utilization (or HT bus utilization).  this runs out to 
something like 30 events which are interesting... and re-running a 
benchmark over and over just to get around the lack of multiplexing is a 
royal pain in the ass.

it's not a "far away non-essential feature" to me.  it's something i would 
use daily if i had all the pieces together now (and i'm constrained 
because i cannot add an out-of-tree patch which adds unofficial syscalls 
to the kernel i use).

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
David Miller writes:

> From: Paul Mackerras <[EMAIL PROTECTED]>
> Date: Thu, 15 Nov 2007 12:11:10 +1100
> 
> > The third (hard to extend cleanly) is a good point, and is a valid
> > criticism of the current set of perfmon2 system calls, I think.
> > However, the goal of being able to extend the interface tends to be in
> > opposition to the goal of having strong typing of the interface.
> > Things like a multiplexed syscall or an ioctl are much easier to
> > extend but that is at the expense of losing strong typing.
> 
> I disagree.
> 
> With netlink we can just add new attributes when a new need arises for
> a particular interface.  The attribute code describes the type
> precisely, so there is no loss of strong typing at all.

Well you must mean something different by "strong typing" from the
rest of us.  Strong typing means that the compiler can check that you
have passed in the correct types of arguments, but the compiler
doesn't have any visibility into what structures are valid in netlink
messages.

In any case, I think that adding a structure size argument to the
current perfmon2 system calls where appropriate would mean that we
could extend them cleanly later on if necessary.  It would mean that
we could add fields at the end, and that the kernel could know what
version of the structures that userspace was using.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Paul Mackerras <[EMAIL PROTECTED]>
Date: Thu, 15 Nov 2007 12:11:10 +1100

> The third (hard to extend cleanly) is a good point, and is a valid
> criticism of the current set of perfmon2 system calls, I think.
> However, the goal of being able to extend the interface tends to be in
> opposition to the goal of having strong typing of the interface.
> Things like a multiplexed syscall or an ioctl are much easier to
> extend but that is at the expense of losing strong typing.

I disagree.

With netlink we can just add new attributes when a new need arises for
a particular interface.  The attribute code describes the type
precisely, so there is no loss of strong typing at all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
David Miller writes:

> From: Paul Mackerras <[EMAIL PROTECTED]>
> Date: Thu, 15 Nov 2007 10:12:22 +1100
> 
> > *I* never had a problem with a few extra system calls.  I don't
> >  understand why you (apparently) do.
> 
> We're stuck with them forever, they are hard to version and extend
> cleanly.
> 
> Those are my main objections.

The first is valid (for suitable values of "forever") but applies to
any user/kernel interface, not just system calls.

As for the second (hard to version) I don't see why it applies to
syscalls specifically more than to other interfaces.  It's just a
matter of designing it correctly in the first place.  For example, the
sys_swapcontext system call we have on powerpc takes an argument which
is the size of the ucontext_t that userland is using, which allows us
to extend it in future if necessary.  (Note that I'm not saying that
the current perfmon2 interfaces are well-designed in this respect.)

The third (hard to extend cleanly) is a good point, and is a valid
criticism of the current set of perfmon2 system calls, I think.
However, the goal of being able to extend the interface tends to be in
opposition to the goal of having strong typing of the interface.
Things like a multiplexed syscall or an ioctl are much easier to
extend but that is at the expense of losing strong typing.  Something
like my transaction() (or your weird kind of read() :) also provides
extensibility but loses type safety to some degree.

Also, as Andi says, this is core CPU state that we are dealing with,
not some I/O device, so treating the whole of perfmon2 (or any
performance monitoring infrastructure) as a driver doesn't fit very
well, and in fact system calls are appropriate.  Just like we don't
try to make access to debugging facilities fit into a driver, we
shouldn't make performance monitoring fit into a driver either.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
Andi Kleen writes:

> > This only works when counting (not sampling) and only for self-monitoring.
> 
> It works for global monitoring too.

How would you provide access to the counters of another process?
Through an extension to ptrace perhaps?

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian
Andi,

On Wed, Nov 14, 2007 at 03:24:11PM +0100, Andi Kleen wrote:
> On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote:
> > 
> > Partially true. The file descriptor becomes really useful when you sample.
> > You leverage the file descriptor to receive notifications of counter 
> > overflows
> > and full sampling buffer. You extract notification messages via read() and 
> > you can
> > use SIGIO, select/poll.
> 
> Hmm, ok for the event notification we would need a nice interface. Still
> have my doubts a file descriptor is the best way to do this though.
> 

Why do you think the existing interfaces are not a good fit for this?
Is this just because of your problem with file descriptors?

>From my experience read(), select(), and SIGIO are fine. I know many tools use 
>that.

As for the file descriptor, you would need to replace that with another 
identifier of
some sort. As I pointed out in another message on this thread, you don't want 
to use
a pid-based identifier. This is not usable when you monitor other threads and 
you
want to read out the results after their death.


> > Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?
> 
> See my example below.
> > 
> > That would be quite expensive when you have lots of registers to setup: one
> > syscall per register. The perfmon syscalls to read/write registers accept 
> > vector
> > of arguments to amortize the cost of the syscall over multiple registers
> > (similar to poll(2)).
> 
> 
> First system calls are not that slow on Linux. Measure it.
> 
If people do not like vector arguments, then I think I can live with N system 
calls
to program N registers. Now you have two choices for passing the arguments:

- a pointer to a struct
struct pfarg_pmc {
uint64_t reg_value;
uint16_t reg_num;
} pmc0;
pmc0.reg_value = 0; pmc0.reg_value = 0x1234;
pfm_write_pmcs(fd, );

- explicitly passing every field:
pfm_write_pmcs(fd, 0x0, 0x1234);

Given that event set and multiplexing would not be in initially, we would want
to allow for them to be added later without having to create yet another
system call, right?

Of course the same approach would work for the data registers at least for 
counting.

> > With many tools, registers are not just setup once. During certain 
> > measurements,
> > data registers may be read multiple times. When you sample or multiplex at
> 
> I think you optimize the wrong thing here.
> 
> There are basically two cases I see:
> 
> -  Global measurement of lots of things:

I am not sure I understand what you mean by 'lots of things'?
Are you still talking per-thread and self-monitoring?


> Things are slow anyways with large context switch overheads. The 
> overheads are large anyways. Doing one or more system calls probably
> does not matter much. Most important is a clean interface.
> 
> - Exact measurement of the current process. For that you need very
> low latencies. Any system call is too slow. That is why CPUs have
> instructions like RDPMC that allow to read those registers with
> minimal latency in user space. Interface should support those.
> 

I don't have a problem with that. And in fact, I already support that
at least on Itanium. I had that in there for X86 but I dropped it after
you said that you would enable cr4.pce globally. I don't have a problem
adding it back for self-monitoring sessions.


> Also for this case programming time does not matter too much. You
> just program once and then do RDPMC before code to measure and then
> afterwards and take the difference. The actual counter setup is out 
> of the latency critical path.
> 
Agreed.

> 
> > It depends on what you are doing. Here, this was not really necessary. It 
> > was
> > meant to show how you can program the data registers as well. Perfmon2 
> > provides
> > default values for all data registers. For counters, the value is 
> > guaranteed to
> > be zero.
> > 
> > But it is important to note that not all data registers are counters. That 
> > is the
> > case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are 
> > buffers as
> > well, and some may need to be initialized to non zero value, i.e., the IBS 
> > sampling
> > period.
> 
> Setting period should be a separate call. Mixing the two together into one
>  does not look like a nice interface.
> 
Periods are setup by data register. Given that there is already a call to 
program
the data register why add another one? You don't need to treat the sampling 
period
differently from the register value. This just a value that will cause the 
register
to overflow after an explicit number of occurrences.


> > With event-based sampling,  the period is expressed as the number of 
> > occurrences
> > of an event. For instance, you can say: " take a sample every 2000 L2 cache 
> > misses".
> > The way you express this with perfmon2 is that 

Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Nick Piggin
On Thursday 15 November 2007 09:56, Chuck Ebbert wrote:
> On 11/14/2007 05:17 AM, Nick Piggin wrote:
> > But in general, for special files, I guess the response is usually
> > some structured data (that is not visible at the syscall layer).
> > So I don't see a big problem to have a similarly arbitrarily
> > structured request.
>
> IOW, an ioctl.

In the same way a read of structured data from a special file
"is an" ioctl, yeah. You could implement either with an ioctl.

The main difference is they have more explicitly typed interfaces
Whether that's enough argument (and if Paul's proposal is widely
usable enough) is another question. Which I won't try to answer.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Paul Mackerras <[EMAIL PROTECTED]>
Date: Thu, 15 Nov 2007 10:12:22 +1100

> *I* never had a problem with a few extra system calls.  I don't
>  understand why you (apparently) do.

We're stuck with them forever, they are hard to version and extend
cleanly.

Those are my main objections.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
David Miller writes:

> From: Paul Mackerras <[EMAIL PROTECTED]>
> Date: Thu, 15 Nov 2007 08:50:22 +1100
> 
> > I'd prefer to have a transaction() system call like I suggested to
> > Nick rather than overloading read() like this.
> 
> So much for getting rid of the extra system calls...

*I* never had a problem with a few extra system calls.  I don't
 understand why you (apparently) do.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Paul Mackerras <[EMAIL PROTECTED]>
Date: Thu, 15 Nov 2007 08:50:22 +1100

> I'd prefer to have a transaction() system call like I suggested to
> Nick rather than overloading read() like this.

So much for getting rid of the extra system calls...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Chuck Ebbert
On 11/14/2007 05:17 AM, Nick Piggin wrote:
> 
> But in general, for special files, I guess the response is usually
> some structured data (that is not visible at the syscall layer).
> So I don't see a big problem to have a similarly arbitrarily
> structured request.
> 
> 

IOW, an ioctl.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Nick Piggin
On Thursday 15 November 2007 08:30, Paul Mackerras wrote:
> Nick Piggin writes:
> > What I really mean is a readv-like syscall, but one that also
> > vectorises the file offset. Maybe this is useful enough as a generic
> > syscall that also helps Paul's example...
>
> I've sometimes thought it would be useful to have a "transaction"
> system call that is like a write + read combined into one:
>
>   int transaction(int fd, char *req, size_t req_nb,
>   char *reply, size_t reply_nb);
>
> as a way to provide a general request/reply interface for special
> files.

Maybe not a bad idea, though I'm not the one to ask about taste ;)
In this case, it is enough for your requests to be a set of scalars
(eg. file offsets), so it _could_ be handled with vectorised offsets...

But in general, for special files, I guess the response is usually
some structured data (that is not visible at the syscall layer).
So I don't see a big problem to have a similarly arbitrarily
structured request.


> > Of course, I guess this all depends on whether the atomicity is an
> > important requirement. If not, you can obviously just do it with
> > multiple read syscalls...
>
> That would take N system calls instead of one, which could have a
> performance impact if you need to read the counters frequently (which
> I believe you do in some performance monitoring situations).

That's true too.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
David Miller writes:

> > You're suggesting that the behaviour of a read() should depend on what
> > was in the buffer before the read?  Gack!  Surely you have better
> > taste than that?
> 
> Absolutely that's what I mean, it's atomic and gives you exactly what
> you need.
> 
> I see nothing wrong or gross with these semantics.  Nothing in the
> "book of UNIX" specifies that for a device or special file the passed
> in buffer cannot contain input control data.

Oh kay *shudders*

It really violates the abstract model of "read" pretty badly.  "Read"
is "fill in the buffer with data from the device", not "do some
arbitrary stuff with this area of memory".

I'd prefer to have a transaction() system call like I suggested to
Nick rather than overloading read() like this.

> > Then you end up with two system calls to get the data rather than one
> > (one to send the request and another to read the reply).  For
> > something that needs to be quick that is a suboptimal interface.
> 
> Not necessarily, consider the possibility of using recvmsg() control
> message data.  With that it could be done in one go.
> 
> This also suggests that it could be implemented as it's own protocol
> family.

There's all sorts of possible ways that it could be implemented.  On
the one hand we have an actual proposed implementation, and on the
other we have various people saying "oh but it could be implemented
this other way" without providing any actual code.

Now if those people can show that their way of doing it is
significantly simpler and better than the existing implementation,
then that's useful.  I really don't think that doing a whole new
net protocol family is a simpler and better way of doing a performance
monitor interface, though.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
Nick Piggin writes:

> What I really mean is a readv-like syscall, but one that also
> vectorises the file offset. Maybe this is useful enough as a generic
> syscall that also helps Paul's example...

I've sometimes thought it would be useful to have a "transaction"
system call that is like a write + read combined into one:

int transaction(int fd, char *req, size_t req_nb,
char *reply, size_t reply_nb);

as a way to provide a general request/reply interface for special
files.

> Of course, I guess this all depends on whether the atomicity is an
> important requirement. If not, you can obviously just do it with
> multiple read syscalls...

That would take N system calls instead of one, which could have a
performance impact if you need to read the counters frequently (which
I believe you do in some performance monitoring situations).

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Andi Kleen <[EMAIL PROTECTED]>
Date: Wed, 14 Nov 2007 13:38:38 +0100

> At least for x86 and I suspect some 1other architectures we don't
> initially need a syscall at all for this. There is an instruction
> RDPMC who can read a performance counter just fine. It is also much
> faster and generally preferable for the case where a process measures
> events about itself. In fact it is essential for one of the use cases
> I would like to see perfmon used (replacement of RDTSC for cycle
> counting) 

I wouldn't even want to use a syscall for something like
that on Sparc, I'd rather give this a dedicated software
trap so that I can code it completely in assembler.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Andi Kleen
> BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
> to disable it, dunno if it has been applied.

Obviously -- without a system call to set up performance counters it
would be fairly useless. But of course once such system calls are in
they should be able to trigger the bit for each process.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Philippe Elie
On Wed, 14 Nov 2007 at 10:44 +, Will Cohen wrote:

> Andi Kleen wrote:
> 
> >>One approach does not prevent the other. Assuming you allow cr4.pce, then 
> >>nothing prevents
> >>a self-monitoring thread from reading the counters directly. You'll just 
> >>get the
> >>lower 32-bit of it. So if you read frequently enough, you should not have 
> >>a problem.
> >
> >Hmm? RDPMC is 64bit.
> 
> There are a number of processors that have 32-bit counters such as the IBM 
> power processors. On many x86 processors the upper bits of the counter are 
> sign extended from the lower 32 bits. Thus, one can only assume the lower 
> 32-bit are available. Roll over of values is quite possible (<2 seconds of 
> cycle count), so additional work needs to be done to obtain a valid value.

On x86 they are sign-extended only on write, on read they are 40 bits wide
for intel, 48 bits for AMD.

BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
to disable it, dunno if it has been applied.

-- 
Phe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian

On Wed, Nov 14, 2007 at 10:44:20AM -0500, William Cohen wrote:
> Andi Kleen wrote:
> 
> >>One approach does not prevent the other. Assuming you allow cr4.pce, then 
> >>nothing prevents
> >>a self-monitoring thread from reading the counters directly. You'll just 
> >>get the
> >>lower 32-bit of it. So if you read frequently enough, you should not have 
> >>a problem.
> >
> >Hmm? RDPMC is 64bit.
> 
> There are a number of processors that have 32-bit counters such as the IBM 
> power processors. On many x86 processors the upper bits of the counter are 
> sign extended from the lower 32 bits. Thus, one can only assume the lower 
> 32-bit are available. Roll over of values is quite possible (<2 seconds of 
> cycle count), so additional work needs to be done to obtain a valid value.
> 

Exactly, on Intel's only the bottom 32-bit actually are useable, the rest is
sign-extension. That's why it is okay for measuring small sections of code,
but that's it. On AMD, I think it is better. On Itanium you get the 47-bit 
worth.
Don't know about Power or Cell.

-- 
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread William Cohen

Andi Kleen wrote:


One approach does not prevent the other. Assuming you allow cr4.pce, then 
nothing prevents
a self-monitoring thread from reading the counters directly. You'll just get the
lower 32-bit of it. So if you read frequently enough, you should not have a 
problem.


Hmm? RDPMC is 64bit.


There are a number of processors that have 32-bit counters such as the IBM power 
processors. On many x86 processors the upper bits of the counter are sign 
extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are 
available. Roll over of values is quite possible (<2 seconds of cycle count), so 
additional work needs to be done to obtain a valid value.



But keep in mind that we do want a uniform interface across all hardware and 
all type
of sessions (self-monitoring, CPU-wide, monitoring of another thread). You 
don't want
an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() 
and so


I disagree. Using RDPMC is essential for at least some of the things I would 
like
to do with perfmon2. If the interface does not provide it it is useless to me 
at least.
System calls are far too slow for cycle measurements. 


What range of cycles are you interested in measuring? 100's of cycles? A couple 
thousand? Are you just looking at cycle counts or other events?


-Will
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Andi Kleen
On Wed, Nov 14, 2007 at 06:13:42AM -0800, Stephane Eranian wrote:
> > At least for x86 and I suspect some 1other architectures we don't
> > initially need a syscall at all for this. There is an instruction
> > RDPMC who can read a performance counter just fine. It is also much
> > faster and generally preferable for the case where a process measures
> > events about itself. In fact it is essential for one of the use cases
> > I would like to see perfmon used (replacement of RDTSC for cycle
> > counting) 
> > 
> 
> This only works when counting (not sampling) and only for self-monitoring.

It works for global monitoring too.

> 
> > Later a syscall might be needed with event multiplexing, but that seems
> > more like a far away non essential feature.
> > 
> On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
> multiplexing offers some advantages. If NMI watchdog is enabled, then you drop
> to one generic counter on on Core 2.

NMI watchdog is off by default now.

Yes longer term we might need multiplexing, but definitely not as first step.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Andi Kleen
On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote:
> 
> Partially true. The file descriptor becomes really useful when you sample.
> You leverage the file descriptor to receive notifications of counter overflows
> and full sampling buffer. You extract notification messages via read() and 
> you can
> use SIGIO, select/poll.

Hmm, ok for the event notification we would need a nice interface. Still
have my doubts a file descriptor is the best way to do this though.

> Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

See my example below.
> 
> That would be quite expensive when you have lots of registers to setup: one
> syscall per register. The perfmon syscalls to read/write registers accept 
> vector
> of arguments to amortize the cost of the syscall over multiple registers
> (similar to poll(2)).


First system calls are not that slow on Linux. Measure it.


> 
> With many tools, registers are not just setup once. During certain 
> measurements,
> data registers may be read multiple times. When you sample or multiplex at

I think you optimize the wrong thing here.

There are basically two cases I see:

-  Global measurement of lots of things:
Things are slow anyways with large context switch overheads. The 
overheads are large anyways. Doing one or more system calls probably
does not matter much. Most important is a clean interface.

- Exact measurement of the current process. For that you need very
low latencies. Any system call is too slow. That is why CPUs have
instructions like RDPMC that allow to read those registers with
minimal latency in user space. Interface should support those.

Also for this case programming time does not matter too much. You
just program once and then do RDPMC before code to measure and then
afterwards and take the difference. The actual counter setup is out 
of the latency critical path.


> It depends on what you are doing. Here, this was not really necessary. It was
> meant to show how you can program the data registers as well. Perfmon2 
> provides
> default values for all data registers. For counters, the value is guaranteed 
> to
> be zero.
> 
> But it is important to note that not all data registers are counters. That is 
> the
> case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are 
> buffers as
> well, and some may need to be initialized to non zero value, i.e., the IBS 
> sampling
> period.

Setting period should be a separate call. Mixing the two together into one
 does not look like a nice interface.

> 
> With event-based sampling,  the period is expressed as the number of 
> occurrences
> of an event. For instance, you can say: " take a sample every 2000 L2 cache 
> misses".
> The way you express this with perfmon2 is that you program a counter to 
> measure
> L2 cache misses, and then you initialize the corresponding data register 
> (counter)
> to overflow after 2000 occurrences. Given that the interface guarantees all 
> counters
> are 64-bit regardless of the hardware, you simply have to program the counter 
> to -2000.
> Thus you see that you need a call to actual program the data registers.

I didn't object to providing the initial value -- my example had that.
Just having a separate concept of data registers seems too complicated to me.
You should just pass event types and values and the kernel gives you
a register number.


> Perfmon2 decouples the two operations. In fact, no PMU hardware is actually 
> touched
> before you attach to either a CPU or a thread. This way, you can prepare your 
> measurement
> and then attach-and-go. Thus is is possible to create batches of ready-to-go 
> sessions.
> That is useful, for instance, when you are trying to measure across fork, 
> pthread_create
> which you can catch on-the-fly.
> 
> Take the per-thread example, you can setup your session before you fork/exec 
> the program
> you want to measure.

And?  You didn't say what the advantage of that is? 

All the approaches add context switch latencies. It is not clear that the 
separate
session setup helps it all that much.

> 
> Note also that perfmon2 supports attaching to an already running thread. So 
> there is
> more than "GLOBAL CONTEXT" versus "MY CONTEXT".

What is the use case of this? Do users use that? 

> 
> 
> > >   /* activate monitoring */
> > >   pfm_start(ctx_fd, NULL);
> > 
> > Why can't that be done by the call setting up the register?
> > 
> 
> Good question. If you do what say, you assume that the start/stop bit lives 
> in the
> config (or data) registers of the PMU. This is not true on all hardware. On 
> Itanium
> for instance, the start/stop bit is part of the Processor Status Register 
> (psr).
> That is not a PMU register.


Well the system call layer can manage that transparently with a little software 
state
(counter). No need to expose it.

> One approach does not prevent the other. Assuming you allow cr4.pce, then 
> nothing prevents
> a self-monitoring thread from reading the 

Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian
Andi,

On Wed, Nov 14, 2007 at 01:38:38PM +0100, Andi Kleen wrote:
> Christoph Hellwig <[EMAIL PROTECTED]> writes:
> >
> > I've done this a gazillion times before, so maybe instead of beeing a lazy
> > bastard you could look up mailinglist archive.  It's not like this is the
> > first discussion of perfmon.  But to get start look at the systems calls,
> > many of them are beasts like:
> >
> >   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> >
> > This is basically a read(2) (or for other syscalls a write) on something
> 
> At least for x86 and I suspect some 1other architectures we don't
> initially need a syscall at all for this. There is an instruction
> RDPMC who can read a performance counter just fine. It is also much
> faster and generally preferable for the case where a process measures
> events about itself. In fact it is essential for one of the use cases
> I would like to see perfmon used (replacement of RDTSC for cycle
> counting) 
> 

This only works when counting (not sampling) and only for self-monitoring.

> Later a syscall might be needed with event multiplexing, but that seems
> more like a far away non essential feature.
> 
On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
multiplexing offers some advantages. If NMI watchdog is enabled, then you drop
to one generic counter on on Core 2.

> > else than the file descriptor provided to the system call.   The right thing
> 
> I don't like read/write for this too much. I think it's better to
> have individual syscalls.  After all that is CPU state and having
> syscalls for that does seem reasonable.

As I said earlier, we do use read(), not for reading counters but to extract 
overflow
notification messages when we are sampling. It makes more sense for this usage 
because
this is where you want to leverage some key mechanisms such as:

 - asynchronous notification via SIGIO. this is how you can implement 
self-sampling
   for instance.

 - select/poll to allow monitoring tools to wait for notification 
coming from
   multiple sessions in one call. This is useful when monitoring across 
fork or
   pthread_create.

-- 
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian

On Wed, Nov 14, 2007 at 10:44:56PM +1100, Paul Mackerras wrote:
> David Miller writes:
> 
> > This is my impression too, all of the things being done with
> > a slew of system calls would be better served by real special
> > files and appropriate fops.
> 
> Special files and fops really only work well if you can coerce the
> interface into one where data flows predominantly one way.  I don't
> think they work so well for something that is more like an RPC across
> the user/kernel barrier.  For that a system call is better.
> 
> For instance, if you have something that kind-of looks like
> 
>   read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> 
> where the caller supplies an array of PMD numbers and the function
> returns their values (and you want that reading to be done atomically
> in some sense), how would you do that using special files and fops?
> 
Yes, the read call could be simplified to the level proposed above by Paul.

-- 
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian
Hello,

On Wed, Nov 14, 2007 at 10:39:24PM +1100, Paul Mackerras wrote:
> Christoph Hellwig writes:
> 
> >   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> > 
> > This is basically a read(2) (or for other syscalls a write) on something
> > else than the file descriptor provided to the system call.
> 
> No it's not basically a read().  It's more like a request/reply
> interface, which a read()/write() interface doesn't handle very well.
> The request in this case is "tell me about this particular collection
> of PMDs" and the reply is the values.
> 

Exactly. This is not a brute force read()! On input you pass the list
of registers you want to read. Upon return, you get the list of values.

Now, I think the current call could be optimized even more by making
the structure smaller. Today, the structure passed read/write
PMD registers is the same. On write, we pass other information such as 
the reset values (sampling periods), randomization parameters and some
flags. They are not needed on read.

> It seems to me that an important part of this is to be able to collect
> values from several PMDs at a single point in time, or at least an
> approximation to a single point in time.  So that means that you don't
> want a file per PMD either.
> 

Yes, we want to be able to read one or many registers in one call.
The number of PMU counters is not going to shrink, so having a file
descriptor per register looks overkill to me.

> Basically we don't have a good abstraction for a request/reply (or
> command/response) type of interface, and this is a case where we need
> one.  Having a syscall that takes a struct containing the request and
> reply is as good a way as any, particularly for something that needs
> to be quick.
> 

-- 
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian
Andi,

On Wed, Nov 14, 2007 at 03:07:02AM +0100, Andi Kleen wrote:
> 
> [dropped all these bouncing email lists. Adding closed lists to public
> cc lists is just a bad idea]
> 

Just want to make sure perfmon2 users participate in this discussion.

> > int
> > main(int argc, char **argv)
> > {
> > int ctx_fd;
> > pfarg_pmd_t pd[1];
> > pfarg_pmc_t pc[1];
> > pfarg_ctx_t ctx;
> > pfarg_load_t load_args;
> > 
> > memset(, 0, sizeof(ctx));
> > memset(pc, 0, sizeof(pc));
> > memset(pd, 0, sizeof(pd));
> > 
> > /* create session (context) and get file descriptor back (identifier) */
> > ctx_fd = pfm_create_context(, NULL, NULL, 0);
> 
> There's nothing in your example that makes the file descriptor needed.
> 

Partially true. The file descriptor becomes really useful when you sample.
You leverage the file descriptor to receive notifications of counter overflows
and full sampling buffer. You extract notification messages via read() and you 
can
use SIGIO, select/poll.

The example shows how you can leverage existing mechanisms to destroy the 
session, i.e.,
free the associated kernel resources. For that, you use close() instead of 
adding yet
another syscall. It also provides a resource limitation mechanisms to control 
consumption
of kernel memory, i.e., you can only create as many sessions as you can have 
open files.

> > 
> > /* setup one config register (PMC0) */
> > pc[0].reg_num   = 0
> > pc[0].reg_value = 0x1234;
> 
> That would be nicer if it was just two arguments.
> 
Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

That would be quite expensive when you have lots of registers to setup: one
syscall per register. The perfmon syscalls to read/write registers accept vector
of arguments to amortize the cost of the syscall over multiple registers
(similar to poll(2)).

With many tools, registers are not just setup once. During certain measurements,
data registers may be read multiple times. When you sample or multiplex at
the user level, you do need to reprogram the PMU state and that is on the 
critical
path.

You do not want a call that programs the entire PMU state all at once either. 
Many times,
you only want to modify a small subset. Having the full state does also cause 
some portability
problems.


> > 
> > /* setup one data register (PMD0) */
> > pd[0].reg_num = 0;
> > pd[0].reg_value = 0;
> 
> Why do you need to set the data register? Wouldn't it make
> more sense to let the kernel handle that and just return one.
> 
It depends on what you are doing. Here, this was not really necessary. It was
meant to show how you can program the data registers as well. Perfmon2 provides
default values for all data registers. For counters, the value is guaranteed to
be zero.

But it is important to note that not all data registers are counters. That is 
the
case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are 
buffers as
well, and some may need to be initialized to non zero value, i.e., the IBS 
sampling
period.

With event-based sampling,  the period is expressed as the number of occurrences
of an event. For instance, you can say: " take a sample every 2000 L2 cache 
misses".
The way you express this with perfmon2 is that you program a counter to measure
L2 cache misses, and then you initialize the corresponding data register 
(counter)
to overflow after 2000 occurrences. Given that the interface guarantees all 
counters
are 64-bit regardless of the hardware, you simply have to program the counter 
to -2000.
Thus you see that you need a call to actual program the data registers.

> > 
> > /* program the registers */
> > pfm_write_pmcs(ctx_fd, pc, 1);
> > pfm_write_pmds(ctx_fd, pd, 1);
> > 
> > /* attach the context to self */
> > load_args.load_pid = getpid();
> > pfm_load_context(ctx_fd, _args);
> 
> My replacement would be to just add a flags argument to write_pmcs 
> with one flag bit meaning "GLOBAL CONTEXT" versus "MY CONTEXT"
> > 

You are mixing PMU programming with the type of measurement you want to do.

Perfmon2 decouples the two operations. In fact, no PMU hardware is actually 
touched
before you attach to either a CPU or a thread. This way, you can prepare your 
measurement
and then attach-and-go. Thus is is possible to create batches of ready-to-go 
sessions.
That is useful, for instance, when you are trying to measure across fork, 
pthread_create
which you can catch on-the-fly.

Take the per-thread example, you can setup your session before you fork/exec 
the program
you want to measure.

Note also that perfmon2 supports attaching to an already running thread. So 
there is
more than "GLOBAL CONTEXT" versus "MY CONTEXT".


> > /* activate monitoring */
> > pfm_start(ctx_fd, NULL);
> 
> Why can't that be done by the call setting up the register?
> 

Good question. If you do what say, you assume that the start/stop bit lives in 
the
config (or data) registers of the 

Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Andi Kleen
Christoph Hellwig <[EMAIL PROTECTED]> writes:
>
> I've done this a gazillion times before, so maybe instead of beeing a lazy
> bastard you could look up mailinglist archive.  It's not like this is the
> first discussion of perfmon.  But to get start look at the systems calls,
> many of them are beasts like:
>
>   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
>
> This is basically a read(2) (or for other syscalls a write) on something

At least for x86 and I suspect some 1other architectures we don't
initially need a syscall at all for this. There is an instruction
RDPMC who can read a performance counter just fine. It is also much
faster and generally preferable for the case where a process measures
events about itself. In fact it is essential for one of the use cases
I would like to see perfmon used (replacement of RDTSC for cycle
counting) 

Later a syscall might be needed with event multiplexing, but that seems
more like a far away non essential feature.

> else than the file descriptor provided to the system call.   The right thing

I don't like read/write for this too much. I think it's better to
have individual syscalls.  After all that is CPU state and having
syscalls for that does seem reasonable.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Nick Piggin
On Wednesday 14 November 2007 23:07, David Miller wrote:
> From: Paul Mackerras <[EMAIL PROTECTED]>
> Date: Wed, 14 Nov 2007 23:03:24 +1100
>
> > You're suggesting that the behaviour of a read() should depend on what
> > was in the buffer before the read?  Gack!  Surely you have better
> > taste than that?
>
> Absolutely that's what I mean, it's atomic and gives you exactly what
> you need.
>
> I see nothing wrong or gross with these semantics.  Nothing in the
> "book of UNIX" specifies that for a device or special file the passed
> in buffer cannot contain input control data.

True, but is it now any so different to an ioctl?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Nick Piggin
On Wednesday 14 November 2007 22:58, David Miller wrote:
> From: Nick Piggin <[EMAIL PROTECTED]>
> Date: Wed, 14 Nov 2007 10:49:48 +1100
>
> > On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
> > > David Miller writes:
> > > > This is my impression too, all of the things being done with
> > > > a slew of system calls would be better served by real special
> > > > files and appropriate fops.
> > >
> > > Special files and fops really only work well if you can coerce the
> > > interface into one where data flows predominantly one way.  I don't
> > > think they work so well for something that is more like an RPC across
> > > the user/kernel barrier.  For that a system call is better.
> > >
> > > For instance, if you have something that kind-of looks like
> > >
> > >   read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> > >
> > > where the caller supplies an array of PMD numbers and the function
> > > returns their values (and you want that reading to be done atomically
> > > in some sense), how would you do that using special files and fops?
> >
> > Could you implement it with readv()?
>
> Sure, why not?  Just cook up an iovec.  pmd_numbers goes to offset
> X and pmd_values goes to offset Y, with some helpers like what
> we have in the networking already for recvmsg.
>
> But why would you want readv() for this?  The syscall thing
> Paul asked me to translate into a read() doesn't provide
> iovec-like behavior so I don't see why readv() is necessary
> at all.

Ah sorry, that's what I get for typing before I think: of course
readv doesn't vectorise the right part of the equation.

What I really mean is a readv-like syscall, but one that also
vectorises the file offset. Maybe this is useful enough as a generic
syscall that also helps Paul's example...

Of course, I guess this all depends on whether the atomicity is an
important requirement. If not, you can obviously just do it with
multiple read syscalls...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Paul Mackerras <[EMAIL PROTECTED]>
Date: Wed, 14 Nov 2007 23:03:24 +1100

> You're suggesting that the behaviour of a read() should depend on what
> was in the buffer before the read?  Gack!  Surely you have better
> taste than that?

Absolutely that's what I mean, it's atomic and gives you exactly what
you need.

I see nothing wrong or gross with these semantics.  Nothing in the
"book of UNIX" specifies that for a device or special file the passed
in buffer cannot contain input control data.

> > Another alternative is to use generic netlink.
> 
> Then you end up with two system calls to get the data rather than one
> (one to send the request and another to read the reply).  For
> something that needs to be quick that is a suboptimal interface.

Not necessarily, consider the possibility of using recvmsg() control
message data.  With that it could be done in one go.

This also suggests that it could be implemented as it's own protocol
family.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
David Miller writes:

> The same way we handle some of the multicast "getsockopt()"
> calls.  The parameters passed in are both inputs and outputs.

For a read??!!!

> For the above example:
> 
>   struct pmd_info {
>   int *pmd_numbers;
>   u64 *pmd_values;
>   int n;
>   } *p;
> 
>   buffer_size = N;
>   p = malloc(buffer_size);
>   p->pmd_numbers = p + foo;
>   p->pmd_values = p + bar;
>   p->n = whatever(N);
>   err = read(fd, p, N);

You're suggesting that the behaviour of a read() should depend on what
was in the buffer before the read?  Gack!  Surely you have better
taste than that?

Or are you saying that a read (or write) has a side-effect of altering
some other area of memory besides the buffer you give to read()?  That
seems even worse to me.

> Another alternative is to use generic netlink.

Then you end up with two system calls to get the data rather than one
(one to send the request and another to read the reply).  For
something that needs to be quick that is a suboptimal interface.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Nick Piggin <[EMAIL PROTECTED]>
Date: Wed, 14 Nov 2007 10:49:48 +1100

> On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
> > David Miller writes:
> > > This is my impression too, all of the things being done with
> > > a slew of system calls would be better served by real special
> > > files and appropriate fops.
> >
> > Special files and fops really only work well if you can coerce the
> > interface into one where data flows predominantly one way.  I don't
> > think they work so well for something that is more like an RPC across
> > the user/kernel barrier.  For that a system call is better.
> >
> > For instance, if you have something that kind-of looks like
> >
> > read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> >
> > where the caller supplies an array of PMD numbers and the function
> > returns their values (and you want that reading to be done atomically
> > in some sense), how would you do that using special files and fops?
> 
> Could you implement it with readv()?

Sure, why not?  Just cook up an iovec.  pmd_numbers goes to offset
X and pmd_values goes to offset Y, with some helpers like what
we have in the networking already for recvmsg.

But why would you want readv() for this?  The syscall thing
Paul asked me to translate into a read() doesn't provide
iovec-like behavior so I don't see why readv() is necessary
at all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Paul Mackerras <[EMAIL PROTECTED]>
Date: Wed, 14 Nov 2007 22:44:56 +1100

> For instance, if you have something that kind-of looks like
> 
>   read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> 
> where the caller supplies an array of PMD numbers and the function
> returns their values (and you want that reading to be done atomically
> in some sense), how would you do that using special files and fops?

The same way we handle some of the multicast "getsockopt()"
calls.  The parameters passed in are both inputs and outputs.

For the above example:

struct pmd_info {
int *pmd_numbers;
u64 *pmd_values;
int n;
} *p;

buffer_size = N;
p = malloc(buffer_size);
p->pmd_numbers = p + foo;
p->pmd_values = p + bar;
p->n = whatever(N);
err = read(fd, p, N);

It's definitely doable, use your imagination.

You can encode all kinds of operation types into the
header as well.

Another alternative is to use generic netlink.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Paul Mackerras <[EMAIL PROTECTED]>
Date: Wed, 14 Nov 2007 22:39:24 +1100

> No it's not basically a read().  It's more like a request/reply
> interface, which a read()/write() interface doesn't handle very well.

Yes it can, see my other reply.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Nick Piggin
On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
> David Miller writes:
> > This is my impression too, all of the things being done with
> > a slew of system calls would be better served by real special
> > files and appropriate fops.
>
> Special files and fops really only work well if you can coerce the
> interface into one where data flows predominantly one way.  I don't
> think they work so well for something that is more like an RPC across
> the user/kernel barrier.  For that a system call is better.
>
> For instance, if you have something that kind-of looks like
>
>   read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
>
> where the caller supplies an array of PMD numbers and the function
> returns their values (and you want that reading to be done atomically
> in some sense), how would you do that using special files and fops?

Could you implement it with readv()?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
Christoph Hellwig writes:

>   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> 
> This is basically a read(2) (or for other syscalls a write) on something
> else than the file descriptor provided to the system call.

No it's not basically a read().  It's more like a request/reply
interface, which a read()/write() interface doesn't handle very well.
The request in this case is "tell me about this particular collection
of PMDs" and the reply is the values.

It seems to me that an important part of this is to be able to collect
values from several PMDs at a single point in time, or at least an
approximation to a single point in time.  So that means that you don't
want a file per PMD either.

Basically we don't have a good abstraction for a request/reply (or
command/response) type of interface, and this is a case where we need
one.  Having a syscall that takes a struct containing the request and
reply is as good a way as any, particularly for something that needs
to be quick.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
David Miller writes:

> This is my impression too, all of the things being done with
> a slew of system calls would be better served by real special
> files and appropriate fops.

Special files and fops really only work well if you can coerce the
interface into one where data flows predominantly one way.  I don't
think they work so well for something that is more like an RPC across
the user/kernel barrier.  For that a system call is better.

For instance, if you have something that kind-of looks like

read_pmds(int n, int *pmd_numbers, u64 *pmd_values);

where the caller supplies an array of PMD numbers and the function
returns their values (and you want that reading to be done atomically
in some sense), how would you do that using special files and fops?

>  Whether the thing is some kind
> of misc device or procfs is less important than simply getting
> away from these system calls.

Why?  What's inherently offensive about system calls?

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller

Ok, I just got 4 freakin' bounces from all of these subscriber only
perfmon etc. mailing lists.

Please remove those lists from the CC: as it's pointless for those of
us not on the lists to participate if those lists can't even see the
feedback we are giving.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Christoph Hellwig <[EMAIL PROTECTED]>
Date: Wed, 14 Nov 2007 11:00:09 +

> I've done this a gazillion times before, so maybe instead of beeing a lazy
> bastard you could look up mailinglist archive.  It's not like this is the
> first discussion of perfmon.  But to get start look at the systems calls,
> many of them are beasts like:
> 
>   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> 
> This is basically a read(2) (or for other syscalls a write) on something
> else than the file descriptor provided to the system call.   The right thing
> to do is obviously have a pmds and pmcs file in procfs for the thread beeing
> monitored instead of these special-case files, with another set for global
> tracing.  Similarly I'm pretty sure we can get a much better interface
> if we introduce marching files in procfs for the other calls.

This is my impression too, all of the things being done with
a slew of system calls would be better served by real special
files and appropriate fops.  Whether the thing is some kind
of misc device or procfs is less important than simply getting
away from these system calls.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Christoph Hellwig
On Wed, Nov 14, 2007 at 09:43:02PM +1100, Paul Mackerras wrote:
> Christoph Hellwig writes:
> 
> > Mine for example.  The whole userspace interface is just on crack,
> > and the code is full of complexities aswell.
> 
> Could you give some _technical_ details of what you don't like?

I've done this a gazillion times before, so maybe instead of beeing a lazy
bastard you could look up mailinglist archive.  It's not like this is the
first discussion of perfmon.  But to get start look at the systems calls,
many of them are beasts like:

  int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)

This is basically a read(2) (or for other syscalls a write) on something
else than the file descriptor provided to the system call.   The right thing
to do is obviously have a pmds and pmcs file in procfs for the thread beeing
monitored instead of these special-case files, with another set for global
tracing.  Similarly I'm pretty sure we can get a much better interface
if we introduce marching files in procfs for the other calls.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
Christoph Hellwig writes:

> Mine for example.  The whole userspace interface is just on crack,
> and the code is full of complexities aswell.

Could you give some _technical_ details of what you don't like?

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Christoph Hellwig
On Wed, Nov 14, 2007 at 06:24:36PM +1100, Paul Mackerras wrote:
> Whose sentiment?

Mine for example.  The whole userspace interface is just on crack,
and the code is full of complexities aswell.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Christoph Hellwig
On Wed, Nov 14, 2007 at 06:24:36PM +1100, Paul Mackerras wrote:
 Whose sentiment?

Mine for example.  The whole userspace interface is just on crack,
and the code is full of complexities aswell.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller

Ok, I just got 4 freakin' bounces from all of these subscriber only
perfmon etc. mailing lists.

Please remove those lists from the CC: as it's pointless for those of
us not on the lists to participate if those lists can't even see the
feedback we are giving.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Christoph Hellwig
On Wed, Nov 14, 2007 at 09:43:02PM +1100, Paul Mackerras wrote:
 Christoph Hellwig writes:
 
  Mine for example.  The whole userspace interface is just on crack,
  and the code is full of complexities aswell.
 
 Could you give some _technical_ details of what you don't like?

I've done this a gazillion times before, so maybe instead of beeing a lazy
bastard you could look up mailinglist archive.  It's not like this is the
first discussion of perfmon.  But to get start look at the systems calls,
many of them are beasts like:

  int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)

This is basically a read(2) (or for other syscalls a write) on something
else than the file descriptor provided to the system call.   The right thing
to do is obviously have a pmds and pmcs file in procfs for the thread beeing
monitored instead of these special-case files, with another set for global
tracing.  Similarly I'm pretty sure we can get a much better interface
if we introduce marching files in procfs for the other calls.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Christoph Hellwig [EMAIL PROTECTED]
Date: Wed, 14 Nov 2007 11:00:09 +

 I've done this a gazillion times before, so maybe instead of beeing a lazy
 bastard you could look up mailinglist archive.  It's not like this is the
 first discussion of perfmon.  But to get start look at the systems calls,
 many of them are beasts like:
 
   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
 
 This is basically a read(2) (or for other syscalls a write) on something
 else than the file descriptor provided to the system call.   The right thing
 to do is obviously have a pmds and pmcs file in procfs for the thread beeing
 monitored instead of these special-case files, with another set for global
 tracing.  Similarly I'm pretty sure we can get a much better interface
 if we introduce marching files in procfs for the other calls.

This is my impression too, all of the things being done with
a slew of system calls would be better served by real special
files and appropriate fops.  Whether the thing is some kind
of misc device or procfs is less important than simply getting
away from these system calls.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
Christoph Hellwig writes:

 Mine for example.  The whole userspace interface is just on crack,
 and the code is full of complexities aswell.

Could you give some _technical_ details of what you don't like?

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
David Miller writes:

 This is my impression too, all of the things being done with
 a slew of system calls would be better served by real special
 files and appropriate fops.

Special files and fops really only work well if you can coerce the
interface into one where data flows predominantly one way.  I don't
think they work so well for something that is more like an RPC across
the user/kernel barrier.  For that a system call is better.

For instance, if you have something that kind-of looks like

read_pmds(int n, int *pmd_numbers, u64 *pmd_values);

where the caller supplies an array of PMD numbers and the function
returns their values (and you want that reading to be done atomically
in some sense), how would you do that using special files and fops?

  Whether the thing is some kind
 of misc device or procfs is less important than simply getting
 away from these system calls.

Why?  What's inherently offensive about system calls?

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
Christoph Hellwig writes:

   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
 
 This is basically a read(2) (or for other syscalls a write) on something
 else than the file descriptor provided to the system call.

No it's not basically a read().  It's more like a request/reply
interface, which a read()/write() interface doesn't handle very well.
The request in this case is tell me about this particular collection
of PMDs and the reply is the values.

It seems to me that an important part of this is to be able to collect
values from several PMDs at a single point in time, or at least an
approximation to a single point in time.  So that means that you don't
want a file per PMD either.

Basically we don't have a good abstraction for a request/reply (or
command/response) type of interface, and this is a case where we need
one.  Having a syscall that takes a struct containing the request and
reply is as good a way as any, particularly for something that needs
to be quick.

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Nick Piggin [EMAIL PROTECTED]
Date: Wed, 14 Nov 2007 10:49:48 +1100

 On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
  David Miller writes:
   This is my impression too, all of the things being done with
   a slew of system calls would be better served by real special
   files and appropriate fops.
 
  Special files and fops really only work well if you can coerce the
  interface into one where data flows predominantly one way.  I don't
  think they work so well for something that is more like an RPC across
  the user/kernel barrier.  For that a system call is better.
 
  For instance, if you have something that kind-of looks like
 
  read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
 
  where the caller supplies an array of PMD numbers and the function
  returns their values (and you want that reading to be done atomically
  in some sense), how would you do that using special files and fops?
 
 Could you implement it with readv()?

Sure, why not?  Just cook up an iovec.  pmd_numbers goes to offset
X and pmd_values goes to offset Y, with some helpers like what
we have in the networking already for recvmsg.

But why would you want readv() for this?  The syscall thing
Paul asked me to translate into a read() doesn't provide
iovec-like behavior so I don't see why readv() is necessary
at all.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Paul Mackerras [EMAIL PROTECTED]
Date: Wed, 14 Nov 2007 23:03:24 +1100

 You're suggesting that the behaviour of a read() should depend on what
 was in the buffer before the read?  Gack!  Surely you have better
 taste than that?

Absolutely that's what I mean, it's atomic and gives you exactly what
you need.

I see nothing wrong or gross with these semantics.  Nothing in the
book of UNIX specifies that for a device or special file the passed
in buffer cannot contain input control data.

  Another alternative is to use generic netlink.
 
 Then you end up with two system calls to get the data rather than one
 (one to send the request and another to read the reply).  For
 something that needs to be quick that is a suboptimal interface.

Not necessarily, consider the possibility of using recvmsg() control
message data.  With that it could be done in one go.

This also suggests that it could be implemented as it's own protocol
family.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Nick Piggin
On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
 David Miller writes:
  This is my impression too, all of the things being done with
  a slew of system calls would be better served by real special
  files and appropriate fops.

 Special files and fops really only work well if you can coerce the
 interface into one where data flows predominantly one way.  I don't
 think they work so well for something that is more like an RPC across
 the user/kernel barrier.  For that a system call is better.

 For instance, if you have something that kind-of looks like

   read_pmds(int n, int *pmd_numbers, u64 *pmd_values);

 where the caller supplies an array of PMD numbers and the function
 returns their values (and you want that reading to be done atomically
 in some sense), how would you do that using special files and fops?

Could you implement it with readv()?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Paul Mackerras [EMAIL PROTECTED]
Date: Wed, 14 Nov 2007 22:44:56 +1100

 For instance, if you have something that kind-of looks like
 
   read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
 
 where the caller supplies an array of PMD numbers and the function
 returns their values (and you want that reading to be done atomically
 in some sense), how would you do that using special files and fops?

The same way we handle some of the multicast getsockopt()
calls.  The parameters passed in are both inputs and outputs.

For the above example:

struct pmd_info {
int *pmd_numbers;
u64 *pmd_values;
int n;
} *p;

buffer_size = N;
p = malloc(buffer_size);
p-pmd_numbers = p + foo;
p-pmd_values = p + bar;
p-n = whatever(N);
err = read(fd, p, N);

It's definitely doable, use your imagination.

You can encode all kinds of operation types into the
header as well.

Another alternative is to use generic netlink.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Paul Mackerras [EMAIL PROTECTED]
Date: Wed, 14 Nov 2007 22:39:24 +1100

 No it's not basically a read().  It's more like a request/reply
 interface, which a read()/write() interface doesn't handle very well.

Yes it can, see my other reply.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
David Miller writes:

 The same way we handle some of the multicast getsockopt()
 calls.  The parameters passed in are both inputs and outputs.

For a read??!!!

 For the above example:
 
   struct pmd_info {
   int *pmd_numbers;
   u64 *pmd_values;
   int n;
   } *p;
 
   buffer_size = N;
   p = malloc(buffer_size);
   p-pmd_numbers = p + foo;
   p-pmd_values = p + bar;
   p-n = whatever(N);
   err = read(fd, p, N);

You're suggesting that the behaviour of a read() should depend on what
was in the buffer before the read?  Gack!  Surely you have better
taste than that?

Or are you saying that a read (or write) has a side-effect of altering
some other area of memory besides the buffer you give to read()?  That
seems even worse to me.

 Another alternative is to use generic netlink.

Then you end up with two system calls to get the data rather than one
(one to send the request and another to read the reply).  For
something that needs to be quick that is a suboptimal interface.

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Nick Piggin
On Wednesday 14 November 2007 22:58, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Wed, 14 Nov 2007 10:49:48 +1100

  On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
   David Miller writes:
This is my impression too, all of the things being done with
a slew of system calls would be better served by real special
files and appropriate fops.
  
   Special files and fops really only work well if you can coerce the
   interface into one where data flows predominantly one way.  I don't
   think they work so well for something that is more like an RPC across
   the user/kernel barrier.  For that a system call is better.
  
   For instance, if you have something that kind-of looks like
  
 read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
  
   where the caller supplies an array of PMD numbers and the function
   returns their values (and you want that reading to be done atomically
   in some sense), how would you do that using special files and fops?
 
  Could you implement it with readv()?

 Sure, why not?  Just cook up an iovec.  pmd_numbers goes to offset
 X and pmd_values goes to offset Y, with some helpers like what
 we have in the networking already for recvmsg.

 But why would you want readv() for this?  The syscall thing
 Paul asked me to translate into a read() doesn't provide
 iovec-like behavior so I don't see why readv() is necessary
 at all.

Ah sorry, that's what I get for typing before I think: of course
readv doesn't vectorise the right part of the equation.

What I really mean is a readv-like syscall, but one that also
vectorises the file offset. Maybe this is useful enough as a generic
syscall that also helps Paul's example...

Of course, I guess this all depends on whether the atomicity is an
important requirement. If not, you can obviously just do it with
multiple read syscalls...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Nick Piggin
On Wednesday 14 November 2007 23:07, David Miller wrote:
 From: Paul Mackerras [EMAIL PROTECTED]
 Date: Wed, 14 Nov 2007 23:03:24 +1100

  You're suggesting that the behaviour of a read() should depend on what
  was in the buffer before the read?  Gack!  Surely you have better
  taste than that?

 Absolutely that's what I mean, it's atomic and gives you exactly what
 you need.

 I see nothing wrong or gross with these semantics.  Nothing in the
 book of UNIX specifies that for a device or special file the passed
 in buffer cannot contain input control data.

True, but is it now any so different to an ioctl?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Andi Kleen
Christoph Hellwig [EMAIL PROTECTED] writes:

 I've done this a gazillion times before, so maybe instead of beeing a lazy
 bastard you could look up mailinglist archive.  It's not like this is the
 first discussion of perfmon.  But to get start look at the systems calls,
 many of them are beasts like:

   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)

 This is basically a read(2) (or for other syscalls a write) on something

At least for x86 and I suspect some 1other architectures we don't
initially need a syscall at all for this. There is an instruction
RDPMC who can read a performance counter just fine. It is also much
faster and generally preferable for the case where a process measures
events about itself. In fact it is essential for one of the use cases
I would like to see perfmon used (replacement of RDTSC for cycle
counting) 

Later a syscall might be needed with event multiplexing, but that seems
more like a far away non essential feature.

 else than the file descriptor provided to the system call.   The right thing

I don't like read/write for this too much. I think it's better to
have individual syscalls.  After all that is CPU state and having
syscalls for that does seem reasonable.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian
Andi,

On Wed, Nov 14, 2007 at 03:07:02AM +0100, Andi Kleen wrote:
 
 [dropped all these bouncing email lists. Adding closed lists to public
 cc lists is just a bad idea]
 

Just want to make sure perfmon2 users participate in this discussion.

  int
  main(int argc, char **argv)
  {
  int ctx_fd;
  pfarg_pmd_t pd[1];
  pfarg_pmc_t pc[1];
  pfarg_ctx_t ctx;
  pfarg_load_t load_args;
  
  memset(ctx, 0, sizeof(ctx));
  memset(pc, 0, sizeof(pc));
  memset(pd, 0, sizeof(pd));
  
  /* create session (context) and get file descriptor back (identifier) */
  ctx_fd = pfm_create_context(ctx, NULL, NULL, 0);
 
 There's nothing in your example that makes the file descriptor needed.
 

Partially true. The file descriptor becomes really useful when you sample.
You leverage the file descriptor to receive notifications of counter overflows
and full sampling buffer. You extract notification messages via read() and you 
can
use SIGIO, select/poll.

The example shows how you can leverage existing mechanisms to destroy the 
session, i.e.,
free the associated kernel resources. For that, you use close() instead of 
adding yet
another syscall. It also provides a resource limitation mechanisms to control 
consumption
of kernel memory, i.e., you can only create as many sessions as you can have 
open files.

  
  /* setup one config register (PMC0) */
  pc[0].reg_num   = 0
  pc[0].reg_value = 0x1234;
 
 That would be nicer if it was just two arguments.
 
Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

That would be quite expensive when you have lots of registers to setup: one
syscall per register. The perfmon syscalls to read/write registers accept vector
of arguments to amortize the cost of the syscall over multiple registers
(similar to poll(2)).

With many tools, registers are not just setup once. During certain measurements,
data registers may be read multiple times. When you sample or multiplex at
the user level, you do need to reprogram the PMU state and that is on the 
critical
path.

You do not want a call that programs the entire PMU state all at once either. 
Many times,
you only want to modify a small subset. Having the full state does also cause 
some portability
problems.


  
  /* setup one data register (PMD0) */
  pd[0].reg_num = 0;
  pd[0].reg_value = 0;
 
 Why do you need to set the data register? Wouldn't it make
 more sense to let the kernel handle that and just return one.
 
It depends on what you are doing. Here, this was not really necessary. It was
meant to show how you can program the data registers as well. Perfmon2 provides
default values for all data registers. For counters, the value is guaranteed to
be zero.

But it is important to note that not all data registers are counters. That is 
the
case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are 
buffers as
well, and some may need to be initialized to non zero value, i.e., the IBS 
sampling
period.

With event-based sampling,  the period is expressed as the number of occurrences
of an event. For instance, you can say:  take a sample every 2000 L2 cache 
misses.
The way you express this with perfmon2 is that you program a counter to measure
L2 cache misses, and then you initialize the corresponding data register 
(counter)
to overflow after 2000 occurrences. Given that the interface guarantees all 
counters
are 64-bit regardless of the hardware, you simply have to program the counter 
to -2000.
Thus you see that you need a call to actual program the data registers.

  
  /* program the registers */
  pfm_write_pmcs(ctx_fd, pc, 1);
  pfm_write_pmds(ctx_fd, pd, 1);
  
  /* attach the context to self */
  load_args.load_pid = getpid();
  pfm_load_context(ctx_fd, load_args);
 
 My replacement would be to just add a flags argument to write_pmcs 
 with one flag bit meaning GLOBAL CONTEXT versus MY CONTEXT
  

You are mixing PMU programming with the type of measurement you want to do.

Perfmon2 decouples the two operations. In fact, no PMU hardware is actually 
touched
before you attach to either a CPU or a thread. This way, you can prepare your 
measurement
and then attach-and-go. Thus is is possible to create batches of ready-to-go 
sessions.
That is useful, for instance, when you are trying to measure across fork, 
pthread_create
which you can catch on-the-fly.

Take the per-thread example, you can setup your session before you fork/exec 
the program
you want to measure.

Note also that perfmon2 supports attaching to an already running thread. So 
there is
more than GLOBAL CONTEXT versus MY CONTEXT.


  /* activate monitoring */
  pfm_start(ctx_fd, NULL);
 
 Why can't that be done by the call setting up the register?
 

Good question. If you do what say, you assume that the start/stop bit lives in 
the
config (or data) registers of the PMU. This is not true on all hardware. On 
Itanium
for instance, the start/stop bit is 

Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian
Hello,

On Wed, Nov 14, 2007 at 10:39:24PM +1100, Paul Mackerras wrote:
 Christoph Hellwig writes:
 
int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
  
  This is basically a read(2) (or for other syscalls a write) on something
  else than the file descriptor provided to the system call.
 
 No it's not basically a read().  It's more like a request/reply
 interface, which a read()/write() interface doesn't handle very well.
 The request in this case is tell me about this particular collection
 of PMDs and the reply is the values.
 

Exactly. This is not a brute force read()! On input you pass the list
of registers you want to read. Upon return, you get the list of values.

Now, I think the current call could be optimized even more by making
the structure smaller. Today, the structure passed read/write
PMD registers is the same. On write, we pass other information such as 
the reset values (sampling periods), randomization parameters and some
flags. They are not needed on read.

 It seems to me that an important part of this is to be able to collect
 values from several PMDs at a single point in time, or at least an
 approximation to a single point in time.  So that means that you don't
 want a file per PMD either.
 

Yes, we want to be able to read one or many registers in one call.
The number of PMU counters is not going to shrink, so having a file
descriptor per register looks overkill to me.

 Basically we don't have a good abstraction for a request/reply (or
 command/response) type of interface, and this is a case where we need
 one.  Having a syscall that takes a struct containing the request and
 reply is as good a way as any, particularly for something that needs
 to be quick.
 

-- 
-Stephane
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian

On Wed, Nov 14, 2007 at 10:44:56PM +1100, Paul Mackerras wrote:
 David Miller writes:
 
  This is my impression too, all of the things being done with
  a slew of system calls would be better served by real special
  files and appropriate fops.
 
 Special files and fops really only work well if you can coerce the
 interface into one where data flows predominantly one way.  I don't
 think they work so well for something that is more like an RPC across
 the user/kernel barrier.  For that a system call is better.
 
 For instance, if you have something that kind-of looks like
 
   read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
 
 where the caller supplies an array of PMD numbers and the function
 returns their values (and you want that reading to be done atomically
 in some sense), how would you do that using special files and fops?
 
Yes, the read call could be simplified to the level proposed above by Paul.

-- 
-Stephane
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian
Andi,

On Wed, Nov 14, 2007 at 01:38:38PM +0100, Andi Kleen wrote:
 Christoph Hellwig [EMAIL PROTECTED] writes:
 
  I've done this a gazillion times before, so maybe instead of beeing a lazy
  bastard you could look up mailinglist archive.  It's not like this is the
  first discussion of perfmon.  But to get start look at the systems calls,
  many of them are beasts like:
 
int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
 
  This is basically a read(2) (or for other syscalls a write) on something
 
 At least for x86 and I suspect some 1other architectures we don't
 initially need a syscall at all for this. There is an instruction
 RDPMC who can read a performance counter just fine. It is also much
 faster and generally preferable for the case where a process measures
 events about itself. In fact it is essential for one of the use cases
 I would like to see perfmon used (replacement of RDTSC for cycle
 counting) 
 

This only works when counting (not sampling) and only for self-monitoring.

 Later a syscall might be needed with event multiplexing, but that seems
 more like a far away non essential feature.
 
On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
multiplexing offers some advantages. If NMI watchdog is enabled, then you drop
to one generic counter on on Core 2.

  else than the file descriptor provided to the system call.   The right thing
 
 I don't like read/write for this too much. I think it's better to
 have individual syscalls.  After all that is CPU state and having
 syscalls for that does seem reasonable.

As I said earlier, we do use read(), not for reading counters but to extract 
overflow
notification messages when we are sampling. It makes more sense for this usage 
because
this is where you want to leverage some key mechanisms such as:

 - asynchronous notification via SIGIO. this is how you can implement 
self-sampling
   for instance.

 - select/poll to allow monitoring tools to wait for notification 
coming from
   multiple sessions in one call. This is useful when monitoring across 
fork or
   pthread_create.

-- 
-Stephane
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Andi Kleen
On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote:
 
 Partially true. The file descriptor becomes really useful when you sample.
 You leverage the file descriptor to receive notifications of counter overflows
 and full sampling buffer. You extract notification messages via read() and 
 you can
 use SIGIO, select/poll.

Hmm, ok for the event notification we would need a nice interface. Still
have my doubts a file descriptor is the best way to do this though.

 Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

See my example below.
 
 That would be quite expensive when you have lots of registers to setup: one
 syscall per register. The perfmon syscalls to read/write registers accept 
 vector
 of arguments to amortize the cost of the syscall over multiple registers
 (similar to poll(2)).


First system calls are not that slow on Linux. Measure it.


 
 With many tools, registers are not just setup once. During certain 
 measurements,
 data registers may be read multiple times. When you sample or multiplex at

I think you optimize the wrong thing here.

There are basically two cases I see:

-  Global measurement of lots of things:
Things are slow anyways with large context switch overheads. The 
overheads are large anyways. Doing one or more system calls probably
does not matter much. Most important is a clean interface.

- Exact measurement of the current process. For that you need very
low latencies. Any system call is too slow. That is why CPUs have
instructions like RDPMC that allow to read those registers with
minimal latency in user space. Interface should support those.

Also for this case programming time does not matter too much. You
just program once and then do RDPMC before code to measure and then
afterwards and take the difference. The actual counter setup is out 
of the latency critical path.


 It depends on what you are doing. Here, this was not really necessary. It was
 meant to show how you can program the data registers as well. Perfmon2 
 provides
 default values for all data registers. For counters, the value is guaranteed 
 to
 be zero.
 
 But it is important to note that not all data registers are counters. That is 
 the
 case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are 
 buffers as
 well, and some may need to be initialized to non zero value, i.e., the IBS 
 sampling
 period.

Setting period should be a separate call. Mixing the two together into one
 does not look like a nice interface.

 
 With event-based sampling,  the period is expressed as the number of 
 occurrences
 of an event. For instance, you can say:  take a sample every 2000 L2 cache 
 misses.
 The way you express this with perfmon2 is that you program a counter to 
 measure
 L2 cache misses, and then you initialize the corresponding data register 
 (counter)
 to overflow after 2000 occurrences. Given that the interface guarantees all 
 counters
 are 64-bit regardless of the hardware, you simply have to program the counter 
 to -2000.
 Thus you see that you need a call to actual program the data registers.

I didn't object to providing the initial value -- my example had that.
Just having a separate concept of data registers seems too complicated to me.
You should just pass event types and values and the kernel gives you
a register number.


 Perfmon2 decouples the two operations. In fact, no PMU hardware is actually 
 touched
 before you attach to either a CPU or a thread. This way, you can prepare your 
 measurement
 and then attach-and-go. Thus is is possible to create batches of ready-to-go 
 sessions.
 That is useful, for instance, when you are trying to measure across fork, 
 pthread_create
 which you can catch on-the-fly.
 
 Take the per-thread example, you can setup your session before you fork/exec 
 the program
 you want to measure.

And?  You didn't say what the advantage of that is? 

All the approaches add context switch latencies. It is not clear that the 
separate
session setup helps it all that much.

 
 Note also that perfmon2 supports attaching to an already running thread. So 
 there is
 more than GLOBAL CONTEXT versus MY CONTEXT.

What is the use case of this? Do users use that? 

 
 
 /* activate monitoring */
 pfm_start(ctx_fd, NULL);
  
  Why can't that be done by the call setting up the register?
  
 
 Good question. If you do what say, you assume that the start/stop bit lives 
 in the
 config (or data) registers of the PMU. This is not true on all hardware. On 
 Itanium
 for instance, the start/stop bit is part of the Processor Status Register 
 (psr).
 That is not a PMU register.


Well the system call layer can manage that transparently with a little software 
state
(counter). No need to expose it.

 One approach does not prevent the other. Assuming you allow cr4.pce, then 
 nothing prevents
 a self-monitoring thread from reading the counters directly. You'll just get 
 the
 lower 32-bit of it. So if you read frequently enough, 

Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Andi Kleen
On Wed, Nov 14, 2007 at 06:13:42AM -0800, Stephane Eranian wrote:
  At least for x86 and I suspect some 1other architectures we don't
  initially need a syscall at all for this. There is an instruction
  RDPMC who can read a performance counter just fine. It is also much
  faster and generally preferable for the case where a process measures
  events about itself. In fact it is essential for one of the use cases
  I would like to see perfmon used (replacement of RDTSC for cycle
  counting) 
  
 
 This only works when counting (not sampling) and only for self-monitoring.

It works for global monitoring too.

 
  Later a syscall might be needed with event multiplexing, but that seems
  more like a far away non essential feature.
  
 On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
 multiplexing offers some advantages. If NMI watchdog is enabled, then you drop
 to one generic counter on on Core 2.

NMI watchdog is off by default now.

Yes longer term we might need multiplexing, but definitely not as first step.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread William Cohen

Andi Kleen wrote:


One approach does not prevent the other. Assuming you allow cr4.pce, then 
nothing prevents
a self-monitoring thread from reading the counters directly. You'll just get the
lower 32-bit of it. So if you read frequently enough, you should not have a 
problem.


Hmm? RDPMC is 64bit.


There are a number of processors that have 32-bit counters such as the IBM power 
processors. On many x86 processors the upper bits of the counter are sign 
extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are 
available. Roll over of values is quite possible (2 seconds of cycle count), so 
additional work needs to be done to obtain a valid value.



But keep in mind that we do want a uniform interface across all hardware and 
all type
of sessions (self-monitoring, CPU-wide, monitoring of another thread). You 
don't want
an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() 
and so


I disagree. Using RDPMC is essential for at least some of the things I would 
like
to do with perfmon2. If the interface does not provide it it is useless to me 
at least.
System calls are far too slow for cycle measurements. 


What range of cycles are you interested in measuring? 100's of cycles? A couple 
thousand? Are you just looking at cycle counts or other events?


-Will
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Stephane Eranian

On Wed, Nov 14, 2007 at 10:44:20AM -0500, William Cohen wrote:
 Andi Kleen wrote:
 
 One approach does not prevent the other. Assuming you allow cr4.pce, then 
 nothing prevents
 a self-monitoring thread from reading the counters directly. You'll just 
 get the
 lower 32-bit of it. So if you read frequently enough, you should not have 
 a problem.
 
 Hmm? RDPMC is 64bit.
 
 There are a number of processors that have 32-bit counters such as the IBM 
 power processors. On many x86 processors the upper bits of the counter are 
 sign extended from the lower 32 bits. Thus, one can only assume the lower 
 32-bit are available. Roll over of values is quite possible (2 seconds of 
 cycle count), so additional work needs to be done to obtain a valid value.
 

Exactly, on Intel's only the bottom 32-bit actually are useable, the rest is
sign-extension. That's why it is okay for measuring small sections of code,
but that's it. On AMD, I think it is better. On Itanium you get the 47-bit 
worth.
Don't know about Power or Cell.

-- 
-Stephane
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Philippe Elie
On Wed, 14 Nov 2007 at 10:44 +, Will Cohen wrote:

 Andi Kleen wrote:
 
 One approach does not prevent the other. Assuming you allow cr4.pce, then 
 nothing prevents
 a self-monitoring thread from reading the counters directly. You'll just 
 get the
 lower 32-bit of it. So if you read frequently enough, you should not have 
 a problem.
 
 Hmm? RDPMC is 64bit.
 
 There are a number of processors that have 32-bit counters such as the IBM 
 power processors. On many x86 processors the upper bits of the counter are 
 sign extended from the lower 32 bits. Thus, one can only assume the lower 
 32-bit are available. Roll over of values is quite possible (2 seconds of 
 cycle count), so additional work needs to be done to obtain a valid value.

On x86 they are sign-extended only on write, on read they are 40 bits wide
for intel, 48 bits for AMD.

BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
to disable it, dunno if it has been applied.

-- 
Phe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Andi Kleen
 BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
 to disable it, dunno if it has been applied.

Obviously -- without a system call to set up performance counters it
would be fairly useless. But of course once such system calls are in
they should be able to trigger the bit for each process.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 14 Nov 2007 13:38:38 +0100

 At least for x86 and I suspect some 1other architectures we don't
 initially need a syscall at all for this. There is an instruction
 RDPMC who can read a performance counter just fine. It is also much
 faster and generally preferable for the case where a process measures
 events about itself. In fact it is essential for one of the use cases
 I would like to see perfmon used (replacement of RDTSC for cycle
 counting) 

I wouldn't even want to use a syscall for something like
that on Sparc, I'd rather give this a dedicated software
trap so that I can code it completely in assembler.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
David Miller writes:

  You're suggesting that the behaviour of a read() should depend on what
  was in the buffer before the read?  Gack!  Surely you have better
  taste than that?
 
 Absolutely that's what I mean, it's atomic and gives you exactly what
 you need.
 
 I see nothing wrong or gross with these semantics.  Nothing in the
 book of UNIX specifies that for a device or special file the passed
 in buffer cannot contain input control data.

Oh kay *shudders*

It really violates the abstract model of read pretty badly.  Read
is fill in the buffer with data from the device, not do some
arbitrary stuff with this area of memory.

I'd prefer to have a transaction() system call like I suggested to
Nick rather than overloading read() like this.

  Then you end up with two system calls to get the data rather than one
  (one to send the request and another to read the reply).  For
  something that needs to be quick that is a suboptimal interface.
 
 Not necessarily, consider the possibility of using recvmsg() control
 message data.  With that it could be done in one go.
 
 This also suggests that it could be implemented as it's own protocol
 family.

There's all sorts of possible ways that it could be implemented.  On
the one hand we have an actual proposed implementation, and on the
other we have various people saying oh but it could be implemented
this other way without providing any actual code.

Now if those people can show that their way of doing it is
significantly simpler and better than the existing implementation,
then that's useful.  I really don't think that doing a whole new
net protocol family is a simpler and better way of doing a performance
monitor interface, though.

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread Paul Mackerras
Nick Piggin writes:

 What I really mean is a readv-like syscall, but one that also
 vectorises the file offset. Maybe this is useful enough as a generic
 syscall that also helps Paul's example...

I've sometimes thought it would be useful to have a transaction
system call that is like a write + read combined into one:

int transaction(int fd, char *req, size_t req_nb,
char *reply, size_t reply_nb);

as a way to provide a general request/reply interface for special
files.

 Of course, I guess this all depends on whether the atomicity is an
 important requirement. If not, you can obviously just do it with
 multiple read syscalls...

That would take N system calls instead of one, which could have a
performance impact if you need to read the counters frequently (which
I believe you do in some performance monitoring situations).

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >