Re: [perfmon2] perfmon2 merge news
Stephane Eranian <[EMAIL PROTECTED]> writes: > [...] >> > [...] AFAIK, there is no single call to stop T1 and wait until it >> > is completely off the CPU, unless we go through the (internal) >> > ptrace interface. >> >> The utrace code supports this style of thread manipulation better >> than ptrace. > > Afre you saying that utrace provides a utrace_thread_stop(tid) call > that returns only when the thread tid is off the CPU. And then there > is a utrace_thread_resume(tid) call. If that's the case then that is > what I need. While I see no single call, it can be synthesized from a sequence of them: utrace_attach, utrace_set_flags (... UTRACE_ACTION_QUESCE ...), then waiting for a callback. Roland, is there a more compact way? > How are we with regards to utrace integration? Roland McGrath is working on breaking the patches down. - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon2] perfmon2 merge news
Stephane Eranian [EMAIL PROTECTED] writes: [...] [...] AFAIK, there is no single call to stop T1 and wait until it is completely off the CPU, unless we go through the (internal) ptrace interface. The utrace code supports this style of thread manipulation better than ptrace. Afre you saying that utrace provides a utrace_thread_stop(tid) call that returns only when the thread tid is off the CPU. And then there is a utrace_thread_resume(tid) call. If that's the case then that is what I need. While I see no single call, it can be synthesized from a sequence of them: utrace_attach, utrace_set_flags (... UTRACE_ACTION_QUESCE ...), then waiting for a callback. Roland, is there a more compact way? How are we with regards to utrace integration? Roland McGrath is working on breaking the patches down. - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon2] perfmon2 merge news
Charles, On Fri, Dec 14, 2007 at 02:12:17PM -0500, Frank Ch. Eigler wrote: > > Stephane Eranian <[EMAIL PROTECTED]> writes: > > > [...] AFAIK, there is no single call to stop T1 and wait until it > > is completely off the CPU, unless we go through the (internal) > > ptrace interface. > > The utrace code supports this style of thread manipulation better > than ptrace. Afre you saying that utrace provides a utrace_thread_stop(tid) call that returns only when the thread tid is off the CPU. And then there is a utrace_thread_resume(tid) call. If that's the case then that is what I need. How are we with regards to utrace integration? Thanks. -- -Stephane -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon2] perfmon2 merge news
Stephane Eranian <[EMAIL PROTECTED]> writes: > [...] AFAIK, there is no single call to stop T1 and wait until it > is completely off the CPU, unless we go through the (internal) > ptrace interface. The utrace code supports this style of thread manipulation better than ptrace. - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon2] perfmon2 merge news
Stephane Eranian [EMAIL PROTECTED] writes: [...] AFAIK, there is no single call to stop T1 and wait until it is completely off the CPU, unless we go through the (internal) ptrace interface. The utrace code supports this style of thread manipulation better than ptrace. - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon2] perfmon2 merge news
Charles, On Fri, Dec 14, 2007 at 02:12:17PM -0500, Frank Ch. Eigler wrote: Stephane Eranian [EMAIL PROTECTED] writes: [...] AFAIK, there is no single call to stop T1 and wait until it is completely off the CPU, unless we go through the (internal) ptrace interface. The utrace code supports this style of thread manipulation better than ptrace. Afre you saying that utrace provides a utrace_thread_stop(tid) call that returns only when the thread tid is off the CPU. And then there is a utrace_thread_resume(tid) call. If that's the case then that is what I need. How are we with regards to utrace integration? Thanks. -- -Stephane -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon2] perfmon2 merge news
Hello, A few weeks back, I mentioned that I would post some interesting problems that I have encountered while implementing perfmon and for which I am still looking for better solutions. Here is one that I would like to solve right now and for which I am interested in your comments. One of the perfmon syscall (pfm_restart()) is used to resume monitoring after a user level notification. When operating in per-thread non self-monitoring mode, the syscall needs to operate on the machine state of the monitored thread. So you get into this situation: Thread T0Thread T1 || pfm_restart() | || spin_lock_irqsave() | || --->| || spin_unlock_irqrestore() | || vv Thread T1 may be running at the time T0 needs to modify its state. The current solution is to set a TIF flag in T1. That TIF flag will cause T1 (on kernel exit) to go into a perfmon function that will then modify the state, i.e., state is self-modified. That works okay but there are a few race conditions. For self-monitoring sessions (e.g., system-wide or per-thread), it is easy because we operate in the correct thread. But there is a big difference between self-monitoring and non self-monitoring. The pfm_restart() syscall does not provide the same guarantee. In self-monitoring modes, the interface guarantees that by the time you return from the call, the effects of the call are visible. Whereas when monitoring another thread, the call currently does not provide such guarantee, i.e., it does not wait until T1 has seen the TIF flag and completed the state modification before returning. We could add a semaphore to enforce that guarantee but it gets difficult with corner cases and cleanups in case of unpexected termination. AFAIK, there is no single call to stop T1 and wait until it is completely off the CPU, unless we go through the (internal) ptrace interface. Would you have anything better to suggest? Thanks. -- -Stephane -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon2] perfmon2 merge news
Hello, A few weeks back, I mentioned that I would post some interesting problems that I have encountered while implementing perfmon and for which I am still looking for better solutions. Here is one that I would like to solve right now and for which I am interested in your comments. One of the perfmon syscall (pfm_restart()) is used to resume monitoring after a user level notification. When operating in per-thread non self-monitoring mode, the syscall needs to operate on the machine state of the monitored thread. So you get into this situation: Thread T0Thread T1 || pfm_restart() | || spin_lock_irqsave() | || modify T1's machine state---| || spin_unlock_irqrestore() | || vv Thread T1 may be running at the time T0 needs to modify its state. The current solution is to set a TIF flag in T1. That TIF flag will cause T1 (on kernel exit) to go into a perfmon function that will then modify the state, i.e., state is self-modified. That works okay but there are a few race conditions. For self-monitoring sessions (e.g., system-wide or per-thread), it is easy because we operate in the correct thread. But there is a big difference between self-monitoring and non self-monitoring. The pfm_restart() syscall does not provide the same guarantee. In self-monitoring modes, the interface guarantees that by the time you return from the call, the effects of the call are visible. Whereas when monitoring another thread, the call currently does not provide such guarantee, i.e., it does not wait until T1 has seen the TIF flag and completed the state modification before returning. We could add a semaphore to enforce that guarantee but it gets difficult with corner cases and cleanups in case of unpexected termination. AFAIK, there is no single call to stop T1 and wait until it is completely off the CPU, unless we go through the (internal) ptrace interface. Would you have anything better to suggest? Thanks. -- -Stephane -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Stephane Eranian <[EMAIL PROTECTED]> Date: Mon, 19 Nov 2007 12:53:30 -0800 > In anycase, I would be happy to integrate your sparc64 patches. I sent these to Philip Mucci late last night, but in the meantime I finished implementing breakpoint support as well for pfmon. Let me clean up my diffs and I'll send it all out to you in a few hours. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Stephane Eranian <[EMAIL PROTECTED]> Date: Mon, 19 Nov 2007 14:48:46 -0800 > Looks like we will have to use bytes (u8) instead. This may have some > performance impact as well. Several bitmaps are used in the context/interrupt > routines. Even with u8, there is still a problem with the bitmap*() macros. > Now, only a small subset of the bitmap() macros are used, so it may be okay > to duplicate them for u8. I think it would be fine to just create a set of bitop interfaces that operate on u32 objects instead of "unsigned long". Currently perfmon2 does not need the atomic variants at all, and those could thus be provided entirely under include/asm-generic/bitops/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Paul, On Tue, Nov 20, 2007 at 08:43:32AM +1100, Paul Mackerras wrote: > David Miller writes: > > > As a result I've found that perfmon2 is quite nice and allows > > incredibly useful and powerful tools to be written. The syscalls > > aren't that bad and really I see not reason to block it's inclusion. > > > > I rescind all of my earlier objections, let's merge this soon :-) > > Strongly agree. However, I think we need to add structure size > arguments to most of the syscalls so we can extend them later. > Yes, that is one way. It works well if you only extend structures at the end. Given that you need to obtain the file descriptor first via a pfm_create_context call, an alternative could be that you pass a version number to that call to identify the version the application is requesting. > Also, something I've been meaning to mention to Stephane is that the > use of the cast_ulp() macro in perfmon is bogus and won't work on > 32-bit big-endian platforms such as ppc32 and sparc32. On such I don't like those cast_ulp() macros. They were put there to avoid compiler warnings on some architectures. Clearly with the big-endian issue, we need to find something else. The bitmap*() macros make unsigned long *. The interface uses fixed size type to ensure ABI compatibility between 32 and 64 bit modes. This way there is no need to marhsall syscall arguments for a 32-bit app running on a 64-bit host. Looks like we will have to use bytes (u8) instead. This may have some performance impact as well. Several bitmaps are used in the context/interrupt routines. Even with u8, there is still a problem with the bitmap*() macros. Now, only a small subset of the bitmap() macros are used, so it may be okay to duplicate them for u8. What do you think? > platforms you can't take a pointer to an array of u64, cast it to > unsigned long * and expect the kernel bitmap operations to work > correctly on it. At the least you also need to XOR the bit numbers > with 32 on those platforms. Another alternative is to define the > bitmaps as arrays of bytes instead, which eliminates all byte ordering > and wordsize problems (but makes it more tricky to use the kernel > bitmap functions directly). > -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: > As a result I've found that perfmon2 is quite nice and allows > incredibly useful and powerful tools to be written. The syscalls > aren't that bad and really I see not reason to block it's inclusion. > > I rescind all of my earlier objections, let's merge this soon :-) Strongly agree. However, I think we need to add structure size arguments to most of the syscalls so we can extend them later. Also, something I've been meaning to mention to Stephane is that the use of the cast_ulp() macro in perfmon is bogus and won't work on 32-bit big-endian platforms such as ppc32 and sparc32. On such platforms you can't take a pointer to an array of u64, cast it to unsigned long * and expect the kernel bitmap operations to work correctly on it. At the least you also need to XOR the bit numbers with 32 on those platforms. Another alternative is to define the bitmaps as arrays of bytes instead, which eliminates all byte ordering and wordsize problems (but makes it more tricky to use the kernel bitmap functions directly). Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David, On Mon, Nov 19, 2007 at 05:08:43AM -0800, David Miller wrote: > > Instead of blabbering further about this topic, I decided to put my > code where my mouth is and spent the weekend porting the perfmon2 > kernel bits, and the user bits (libpfm and pfmon) to sparc64. > I appreciate your effort. I am glad to see that the interface and implementation survived yet another architecture. I think at this point ARM is the only major architecture missing. In anycase, I would be happy to integrate your sparc64 patches. > As a result I've found that perfmon2 is quite nice and allows > incredibly useful and powerful tools to be written. The syscalls > aren't that bad and really I see not reason to block it's inclusion. > As I said earlier, I am not opposed to changing the syscalls. I have proposed a few schemes to address the issue of versioning. If vectors arguments are problematic, we can go with single register/call. I think there are other areas where perfmon2 could benefit from the help of the LKML developers. I will post a list shortly. > I rescind all of my earlier objections, let's merge this soon :-) Thanks. -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Instead of blabbering further about this topic, I decided to put my code where my mouth is and spent the weekend porting the perfmon2 kernel bits, and the user bits (libpfm and pfmon) to sparc64. As a result I've found that perfmon2 is quite nice and allows incredibly useful and powerful tools to be written. The syscalls aren't that bad and really I see not reason to block it's inclusion. I rescind all of my earlier objections, let's merge this soon :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Instead of blabbering further about this topic, I decided to put my code where my mouth is and spent the weekend porting the perfmon2 kernel bits, and the user bits (libpfm and pfmon) to sparc64. As a result I've found that perfmon2 is quite nice and allows incredibly useful and powerful tools to be written. The syscalls aren't that bad and really I see not reason to block it's inclusion. I rescind all of my earlier objections, let's merge this soon :-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David, On Mon, Nov 19, 2007 at 05:08:43AM -0800, David Miller wrote: Instead of blabbering further about this topic, I decided to put my code where my mouth is and spent the weekend porting the perfmon2 kernel bits, and the user bits (libpfm and pfmon) to sparc64. I appreciate your effort. I am glad to see that the interface and implementation survived yet another architecture. I think at this point ARM is the only major architecture missing. In anycase, I would be happy to integrate your sparc64 patches. As a result I've found that perfmon2 is quite nice and allows incredibly useful and powerful tools to be written. The syscalls aren't that bad and really I see not reason to block it's inclusion. As I said earlier, I am not opposed to changing the syscalls. I have proposed a few schemes to address the issue of versioning. If vectors arguments are problematic, we can go with single register/call. I think there are other areas where perfmon2 could benefit from the help of the LKML developers. I will post a list shortly. I rescind all of my earlier objections, let's merge this soon :-) Thanks. -- -Stephane - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: As a result I've found that perfmon2 is quite nice and allows incredibly useful and powerful tools to be written. The syscalls aren't that bad and really I see not reason to block it's inclusion. I rescind all of my earlier objections, let's merge this soon :-) Strongly agree. However, I think we need to add structure size arguments to most of the syscalls so we can extend them later. Also, something I've been meaning to mention to Stephane is that the use of the cast_ulp() macro in perfmon is bogus and won't work on 32-bit big-endian platforms such as ppc32 and sparc32. On such platforms you can't take a pointer to an array of u64, cast it to unsigned long * and expect the kernel bitmap operations to work correctly on it. At the least you also need to XOR the bit numbers with 32 on those platforms. Another alternative is to define the bitmaps as arrays of bytes instead, which eliminates all byte ordering and wordsize problems (but makes it more tricky to use the kernel bitmap functions directly). Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Paul, On Tue, Nov 20, 2007 at 08:43:32AM +1100, Paul Mackerras wrote: David Miller writes: As a result I've found that perfmon2 is quite nice and allows incredibly useful and powerful tools to be written. The syscalls aren't that bad and really I see not reason to block it's inclusion. I rescind all of my earlier objections, let's merge this soon :-) Strongly agree. However, I think we need to add structure size arguments to most of the syscalls so we can extend them later. Yes, that is one way. It works well if you only extend structures at the end. Given that you need to obtain the file descriptor first via a pfm_create_context call, an alternative could be that you pass a version number to that call to identify the version the application is requesting. Also, something I've been meaning to mention to Stephane is that the use of the cast_ulp() macro in perfmon is bogus and won't work on 32-bit big-endian platforms such as ppc32 and sparc32. On such I don't like those cast_ulp() macros. They were put there to avoid compiler warnings on some architectures. Clearly with the big-endian issue, we need to find something else. The bitmap*() macros make unsigned long *. The interface uses fixed size type to ensure ABI compatibility between 32 and 64 bit modes. This way there is no need to marhsall syscall arguments for a 32-bit app running on a 64-bit host. Looks like we will have to use bytes (u8) instead. This may have some performance impact as well. Several bitmaps are used in the context/interrupt routines. Even with u8, there is still a problem with the bitmap*() macros. Now, only a small subset of the bitmap() macros are used, so it may be okay to duplicate them for u8. What do you think? platforms you can't take a pointer to an array of u64, cast it to unsigned long * and expect the kernel bitmap operations to work correctly on it. At the least you also need to XOR the bit numbers with 32 on those platforms. Another alternative is to define the bitmaps as arrays of bytes instead, which eliminates all byte ordering and wordsize problems (but makes it more tricky to use the kernel bitmap functions directly). -- -Stephane - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Stephane Eranian [EMAIL PROTECTED] Date: Mon, 19 Nov 2007 12:53:30 -0800 In anycase, I would be happy to integrate your sparc64 patches. I sent these to Philip Mucci late last night, but in the meantime I finished implementing breakpoint support as well for pfmon. Let me clean up my diffs and I'll send it all out to you in a few hours. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Stephane Eranian [EMAIL PROTECTED] Date: Mon, 19 Nov 2007 14:48:46 -0800 Looks like we will have to use bytes (u8) instead. This may have some performance impact as well. Several bitmaps are used in the context/interrupt routines. Even with u8, there is still a problem with the bitmap*() macros. Now, only a small subset of the bitmap() macros are used, so it may be okay to duplicate them for u8. I think it would be fine to just create a set of bitop interfaces that operate on u32 objects instead of unsigned long. Currently perfmon2 does not need the atomic variants at all, and those could thus be provided entirely under include/asm-generic/bitops/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Hello, On Wed, Nov 14, 2007 at 08:20:22PM -0800, dean gaudet wrote: > On Wed, 14 Nov 2007, Andi Kleen wrote: > > > Later a syscall might be needed with event multiplexing, but that seems > > more like a far away non essential feature. > > actually multiplexing is the main feature i am in need of. there are an > insufficient number of counters (even on k8 with 4 counters) to do > complete stall accounting or to get a general overview of L1d/L1i/L2 cache > hit rates, average miss latency, time spent in various stalls, and the > memory system utilization (or HT bus utilization). this runs out to > something like 30 events which are interesting... and re-running a > benchmark over and over just to get around the lack of multiplexing is a > royal pain in the ass. > > it's not a "far away non-essential feature" to me. it's something i would > use daily if i had all the pieces together now (and i'm constrained > because i cannot add an out-of-tree patch which adds unofficial syscalls > to the kernel i use). > Multiplexing in the context of perfmon2 means that you can measure more events than there are counters. To make this work, we create the notion of an event set or more precisely a register set. Each set encapsulates the full PMU state. Then the kernel multiplexes the sets onto the actual PMU hardware. Why do we need this? As Dean pointed out, that are many important metrics which do require more events than there are counters. Making multiple runs can be difficult with some workloads. But there are also other, less known, reasons why you'd want to do this. This is not because you have lots of counters that you can necessarily measure lots of related events simultaneously. Take pentium 4 for instance, it has 18 counters, but for most interesting metrics, you cannot measure all the events at once. Why? Because there are important hardware constraints which translate into event combination constraints. It is not uncommon to have constraints such as: - event A and B cannot be measured together - event A can only be measured by counter X - if event A is measured, then only events B, C, D can be measured This is not just on Itanium. Power has limitations, Intel Core 2 has limitations, AMD Opterons also have limitations. When you combine limited number of counters with strong constraints, it can quickly become difficult to make measurements in one run. Multiplexing is, of course, not as good as measuring all events continuously but if you run for long enough and with a reasonable switching periods, the *estimates* you get by scaling the obtained counts can be very close to what they would have been had you measured all events all the time. You have to balance precision with overhead. Why do this in the kernel? One might argue that there is nothing preventing tools from multiplexing at the user level. That's true and we do support this as well. You have to: - stop monitoring - read out current counter - reprogram config and data registers - restart monitoring But there are some important benefits for doing this in the kernel especially for per-thread monitoring. When you are not self-monitoring, you would need to stop the other thread first, then issue a minimum of 4 system calls and incur a couple of context switches. By doing it in the kernel, you guaranteed that switching always occur in the context of the monitored thread. Furthermore it can be integrated with kernel-level sampling. Adding the notion of event set is fairly pervasive and you need to make sure that it fits well with the other parts of the interface. -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Herbert Xu <[EMAIL PROTECTED]> writes: > That's strong static typing. Netlink is 90% strong static > typing plus 10% strong dynamic typing. That is, it'll tell > you at run-time if you give it the wrong netlink attribute. Well it tells you EINVAL no matter what is wrong. That's roughly similar to a compiler whose only error message is 'WRONG'. Or the ed school of error reporting. That makes any checking it does barely useful. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Hi, On Thu, Nov 15, 2007 at 12:11:10PM +1100, Paul Mackerras wrote: > David Miller writes: > > > From: Paul Mackerras <[EMAIL PROTECTED]> > > Date: Thu, 15 Nov 2007 10:12:22 +1100 > > > > > *I* never had a problem with a few extra system calls. I don't > > > understand why you (apparently) do. > > > > We're stuck with them forever, they are hard to version and extend > > cleanly. > > > > Those are my main objections. > > The first is valid (for suitable values of "forever") but applies to > any user/kernel interface, not just system calls. > Agreed. > As for the second (hard to version) I don't see why it applies to > syscalls specifically more than to other interfaces. It's just a > matter of designing it correctly in the first place. For example, the > sys_swapcontext system call we have on powerpc takes an argument which > is the size of the ucontext_t that userland is using, which allows us > to extend it in future if necessary. (Note that I'm not saying that > the current perfmon2 interfaces are well-designed in this respect.) > > The third (hard to extend cleanly) is a good point, and is a valid > criticism of the current set of perfmon2 system calls, I think. > However, the goal of being able to extend the interface tends to be in > opposition to the goal of having strong typing of the interface. > Things like a multiplexed syscall or an ioctl are much easier to > extend but that is at the expense of losing strong typing. Something > like my transaction() (or your weird kind of read() :) also provides > extensibility but loses type safety to some degree. > In the initial design there was only one perfmon syscall perfmonctl() and it was a multiplexing call. People objected to it and thus I split it up into multiple system calls. I like the strong typing but I agree that it is harder to extend without creating new syscalls. In the current state, all perfmon syscalls take a pointer to structs which have reserved fields for future extensions. If you specify that reserved fields must be zeroed, then it leaves you *some* flexibility for extending the structs. Another alternative, similar to your ucontext, would be to pass the size of the structure. If we assume we drop the vector arguments, we could do: pfm_write_pmcs(fd, , sizeof(pmc)); instead of pfm_write_pmcs(fd, ); Should the sizeof(pmc) need to change we could demultiplex inside the kernel. Another, probably cleaner, possibility is to version structures that are passed: union pfarg_pmc { int version; struct { int version; int reg_num; u64 reg_value; } } But that seems overkill. I think versioning could be passed when the session is created instead of at every call: fd = pfm_create_session(version, , ); > Also, as Andi says, this is core CPU state that we are dealing with, > not some I/O device, so treating the whole of perfmon2 (or any > performance monitoring infrastructure) as a driver doesn't fit very > well, and in fact system calls are appropriate. Just like we don't > try to make access to debugging facilities fit into a driver, we > shouldn't make performance monitoring fit into a driver either. > Agreed 100%. This is especially true because we support per-thread monitoring. -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Paul Mackerras <[EMAIL PROTECTED]> wrote: > > Well you must mean something different by "strong typing" from the > rest of us. Strong typing means that the compiler can check that you > have passed in the correct types of arguments, but the compiler > doesn't have any visibility into what structures are valid in netlink > messages. That's strong static typing. Netlink is 90% strong static typing plus 10% strong dynamic typing. That is, it'll tell you at run-time if you give it the wrong netlink attribute. The types within each netlink attribute is checked at compile time. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Paul Mackerras [EMAIL PROTECTED] wrote: Well you must mean something different by strong typing from the rest of us. Strong typing means that the compiler can check that you have passed in the correct types of arguments, but the compiler doesn't have any visibility into what structures are valid in netlink messages. That's strong static typing. Netlink is 90% strong static typing plus 10% strong dynamic typing. That is, it'll tell you at run-time if you give it the wrong netlink attribute. The types within each netlink attribute is checked at compile time. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Hi, On Thu, Nov 15, 2007 at 12:11:10PM +1100, Paul Mackerras wrote: David Miller writes: From: Paul Mackerras [EMAIL PROTECTED] Date: Thu, 15 Nov 2007 10:12:22 +1100 *I* never had a problem with a few extra system calls. I don't understand why you (apparently) do. We're stuck with them forever, they are hard to version and extend cleanly. Those are my main objections. The first is valid (for suitable values of forever) but applies to any user/kernel interface, not just system calls. Agreed. As for the second (hard to version) I don't see why it applies to syscalls specifically more than to other interfaces. It's just a matter of designing it correctly in the first place. For example, the sys_swapcontext system call we have on powerpc takes an argument which is the size of the ucontext_t that userland is using, which allows us to extend it in future if necessary. (Note that I'm not saying that the current perfmon2 interfaces are well-designed in this respect.) The third (hard to extend cleanly) is a good point, and is a valid criticism of the current set of perfmon2 system calls, I think. However, the goal of being able to extend the interface tends to be in opposition to the goal of having strong typing of the interface. Things like a multiplexed syscall or an ioctl are much easier to extend but that is at the expense of losing strong typing. Something like my transaction() (or your weird kind of read() :) also provides extensibility but loses type safety to some degree. In the initial design there was only one perfmon syscall perfmonctl() and it was a multiplexing call. People objected to it and thus I split it up into multiple system calls. I like the strong typing but I agree that it is harder to extend without creating new syscalls. In the current state, all perfmon syscalls take a pointer to structs which have reserved fields for future extensions. If you specify that reserved fields must be zeroed, then it leaves you *some* flexibility for extending the structs. Another alternative, similar to your ucontext, would be to pass the size of the structure. If we assume we drop the vector arguments, we could do: pfm_write_pmcs(fd, pmc, sizeof(pmc)); instead of pfm_write_pmcs(fd, pmc); Should the sizeof(pmc) need to change we could demultiplex inside the kernel. Another, probably cleaner, possibility is to version structures that are passed: union pfarg_pmc { int version; struct { int version; int reg_num; u64 reg_value; } } But that seems overkill. I think versioning could be passed when the session is created instead of at every call: fd = pfm_create_session(version, ctx, ); Also, as Andi says, this is core CPU state that we are dealing with, not some I/O device, so treating the whole of perfmon2 (or any performance monitoring infrastructure) as a driver doesn't fit very well, and in fact system calls are appropriate. Just like we don't try to make access to debugging facilities fit into a driver, we shouldn't make performance monitoring fit into a driver either. Agreed 100%. This is especially true because we support per-thread monitoring. -- -Stephane - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Herbert Xu [EMAIL PROTECTED] writes: That's strong static typing. Netlink is 90% strong static typing plus 10% strong dynamic typing. That is, it'll tell you at run-time if you give it the wrong netlink attribute. Well it tells you EINVAL no matter what is wrong. That's roughly similar to a compiler whose only error message is 'WRONG'. Or the ed school of error reporting. That makes any checking it does barely useful. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Hello, On Wed, Nov 14, 2007 at 08:20:22PM -0800, dean gaudet wrote: On Wed, 14 Nov 2007, Andi Kleen wrote: Later a syscall might be needed with event multiplexing, but that seems more like a far away non essential feature. actually multiplexing is the main feature i am in need of. there are an insufficient number of counters (even on k8 with 4 counters) to do complete stall accounting or to get a general overview of L1d/L1i/L2 cache hit rates, average miss latency, time spent in various stalls, and the memory system utilization (or HT bus utilization). this runs out to something like 30 events which are interesting... and re-running a benchmark over and over just to get around the lack of multiplexing is a royal pain in the ass. it's not a far away non-essential feature to me. it's something i would use daily if i had all the pieces together now (and i'm constrained because i cannot add an out-of-tree patch which adds unofficial syscalls to the kernel i use). Multiplexing in the context of perfmon2 means that you can measure more events than there are counters. To make this work, we create the notion of an event set or more precisely a register set. Each set encapsulates the full PMU state. Then the kernel multiplexes the sets onto the actual PMU hardware. Why do we need this? As Dean pointed out, that are many important metrics which do require more events than there are counters. Making multiple runs can be difficult with some workloads. But there are also other, less known, reasons why you'd want to do this. This is not because you have lots of counters that you can necessarily measure lots of related events simultaneously. Take pentium 4 for instance, it has 18 counters, but for most interesting metrics, you cannot measure all the events at once. Why? Because there are important hardware constraints which translate into event combination constraints. It is not uncommon to have constraints such as: - event A and B cannot be measured together - event A can only be measured by counter X - if event A is measured, then only events B, C, D can be measured This is not just on Itanium. Power has limitations, Intel Core 2 has limitations, AMD Opterons also have limitations. When you combine limited number of counters with strong constraints, it can quickly become difficult to make measurements in one run. Multiplexing is, of course, not as good as measuring all events continuously but if you run for long enough and with a reasonable switching periods, the *estimates* you get by scaling the obtained counts can be very close to what they would have been had you measured all events all the time. You have to balance precision with overhead. Why do this in the kernel? One might argue that there is nothing preventing tools from multiplexing at the user level. That's true and we do support this as well. You have to: - stop monitoring - read out current counter - reprogram config and data registers - restart monitoring But there are some important benefits for doing this in the kernel especially for per-thread monitoring. When you are not self-monitoring, you would need to stop the other thread first, then issue a minimum of 4 system calls and incur a couple of context switches. By doing it in the kernel, you guaranteed that switching always occur in the context of the monitored thread. Furthermore it can be integrated with kernel-level sampling. Adding the notion of event set is fairly pervasive and you need to make sure that it fits well with the other parts of the interface. -- -Stephane - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Thu, 15 Nov 2007, Paul Mackerras wrote: > dean gaudet writes: > > > actually multiplexing is the main feature i am in need of. there are an > > insufficient number of counters (even on k8 with 4 counters) to do > > complete stall accounting or to get a general overview of L1d/L1i/L2 cache > > hit rates, average miss latency, time spent in various stalls, and the > > memory system utilization (or HT bus utilization). this runs out to > > something like 30 events which are interesting... and re-running a > > benchmark over and over just to get around the lack of multiplexing is a > > royal pain in the ass. > > So by "multiplexing" do you mean the ability to have multiple event > sets associated with a context and have the kernel switch between them > automatically? yep. -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
dean gaudet writes: > actually multiplexing is the main feature i am in need of. there are an > insufficient number of counters (even on k8 with 4 counters) to do > complete stall accounting or to get a general overview of L1d/L1i/L2 cache > hit rates, average miss latency, time spent in various stalls, and the > memory system utilization (or HT bus utilization). this runs out to > something like 30 events which are interesting... and re-running a > benchmark over and over just to get around the lack of multiplexing is a > royal pain in the ass. So by "multiplexing" do you mean the ability to have multiple event sets associated with a context and have the kernel switch between them automatically? Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, 14 Nov 2007, Andi Kleen wrote: > Later a syscall might be needed with event multiplexing, but that seems > more like a far away non essential feature. actually multiplexing is the main feature i am in need of. there are an insufficient number of counters (even on k8 with 4 counters) to do complete stall accounting or to get a general overview of L1d/L1i/L2 cache hit rates, average miss latency, time spent in various stalls, and the memory system utilization (or HT bus utilization). this runs out to something like 30 events which are interesting... and re-running a benchmark over and over just to get around the lack of multiplexing is a royal pain in the ass. it's not a "far away non-essential feature" to me. it's something i would use daily if i had all the pieces together now (and i'm constrained because i cannot add an out-of-tree patch which adds unofficial syscalls to the kernel i use). -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: > From: Paul Mackerras <[EMAIL PROTECTED]> > Date: Thu, 15 Nov 2007 12:11:10 +1100 > > > The third (hard to extend cleanly) is a good point, and is a valid > > criticism of the current set of perfmon2 system calls, I think. > > However, the goal of being able to extend the interface tends to be in > > opposition to the goal of having strong typing of the interface. > > Things like a multiplexed syscall or an ioctl are much easier to > > extend but that is at the expense of losing strong typing. > > I disagree. > > With netlink we can just add new attributes when a new need arises for > a particular interface. The attribute code describes the type > precisely, so there is no loss of strong typing at all. Well you must mean something different by "strong typing" from the rest of us. Strong typing means that the compiler can check that you have passed in the correct types of arguments, but the compiler doesn't have any visibility into what structures are valid in netlink messages. In any case, I think that adding a structure size argument to the current perfmon2 system calls where appropriate would mean that we could extend them cleanly later on if necessary. It would mean that we could add fields at the end, and that the kernel could know what version of the structures that userspace was using. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Paul Mackerras <[EMAIL PROTECTED]> Date: Thu, 15 Nov 2007 12:11:10 +1100 > The third (hard to extend cleanly) is a good point, and is a valid > criticism of the current set of perfmon2 system calls, I think. > However, the goal of being able to extend the interface tends to be in > opposition to the goal of having strong typing of the interface. > Things like a multiplexed syscall or an ioctl are much easier to > extend but that is at the expense of losing strong typing. I disagree. With netlink we can just add new attributes when a new need arises for a particular interface. The attribute code describes the type precisely, so there is no loss of strong typing at all. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: > From: Paul Mackerras <[EMAIL PROTECTED]> > Date: Thu, 15 Nov 2007 10:12:22 +1100 > > > *I* never had a problem with a few extra system calls. I don't > > understand why you (apparently) do. > > We're stuck with them forever, they are hard to version and extend > cleanly. > > Those are my main objections. The first is valid (for suitable values of "forever") but applies to any user/kernel interface, not just system calls. As for the second (hard to version) I don't see why it applies to syscalls specifically more than to other interfaces. It's just a matter of designing it correctly in the first place. For example, the sys_swapcontext system call we have on powerpc takes an argument which is the size of the ucontext_t that userland is using, which allows us to extend it in future if necessary. (Note that I'm not saying that the current perfmon2 interfaces are well-designed in this respect.) The third (hard to extend cleanly) is a good point, and is a valid criticism of the current set of perfmon2 system calls, I think. However, the goal of being able to extend the interface tends to be in opposition to the goal of having strong typing of the interface. Things like a multiplexed syscall or an ioctl are much easier to extend but that is at the expense of losing strong typing. Something like my transaction() (or your weird kind of read() :) also provides extensibility but loses type safety to some degree. Also, as Andi says, this is core CPU state that we are dealing with, not some I/O device, so treating the whole of perfmon2 (or any performance monitoring infrastructure) as a driver doesn't fit very well, and in fact system calls are appropriate. Just like we don't try to make access to debugging facilities fit into a driver, we shouldn't make performance monitoring fit into a driver either. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Andi Kleen writes: > > This only works when counting (not sampling) and only for self-monitoring. > > It works for global monitoring too. How would you provide access to the counters of another process? Through an extension to ptrace perhaps? Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Andi, On Wed, Nov 14, 2007 at 03:24:11PM +0100, Andi Kleen wrote: > On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote: > > > > Partially true. The file descriptor becomes really useful when you sample. > > You leverage the file descriptor to receive notifications of counter > > overflows > > and full sampling buffer. You extract notification messages via read() and > > you can > > use SIGIO, select/poll. > > Hmm, ok for the event notification we would need a nice interface. Still > have my doubts a file descriptor is the best way to do this though. > Why do you think the existing interfaces are not a good fit for this? Is this just because of your problem with file descriptors? >From my experience read(), select(), and SIGIO are fine. I know many tools use >that. As for the file descriptor, you would need to replace that with another identifier of some sort. As I pointed out in another message on this thread, you don't want to use a pid-based identifier. This is not usable when you monitor other threads and you want to read out the results after their death. > > Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)? > > See my example below. > > > > That would be quite expensive when you have lots of registers to setup: one > > syscall per register. The perfmon syscalls to read/write registers accept > > vector > > of arguments to amortize the cost of the syscall over multiple registers > > (similar to poll(2)). > > > First system calls are not that slow on Linux. Measure it. > If people do not like vector arguments, then I think I can live with N system calls to program N registers. Now you have two choices for passing the arguments: - a pointer to a struct struct pfarg_pmc { uint64_t reg_value; uint16_t reg_num; } pmc0; pmc0.reg_value = 0; pmc0.reg_value = 0x1234; pfm_write_pmcs(fd, ); - explicitly passing every field: pfm_write_pmcs(fd, 0x0, 0x1234); Given that event set and multiplexing would not be in initially, we would want to allow for them to be added later without having to create yet another system call, right? Of course the same approach would work for the data registers at least for counting. > > With many tools, registers are not just setup once. During certain > > measurements, > > data registers may be read multiple times. When you sample or multiplex at > > I think you optimize the wrong thing here. > > There are basically two cases I see: > > - Global measurement of lots of things: I am not sure I understand what you mean by 'lots of things'? Are you still talking per-thread and self-monitoring? > Things are slow anyways with large context switch overheads. The > overheads are large anyways. Doing one or more system calls probably > does not matter much. Most important is a clean interface. > > - Exact measurement of the current process. For that you need very > low latencies. Any system call is too slow. That is why CPUs have > instructions like RDPMC that allow to read those registers with > minimal latency in user space. Interface should support those. > I don't have a problem with that. And in fact, I already support that at least on Itanium. I had that in there for X86 but I dropped it after you said that you would enable cr4.pce globally. I don't have a problem adding it back for self-monitoring sessions. > Also for this case programming time does not matter too much. You > just program once and then do RDPMC before code to measure and then > afterwards and take the difference. The actual counter setup is out > of the latency critical path. > Agreed. > > > It depends on what you are doing. Here, this was not really necessary. It > > was > > meant to show how you can program the data registers as well. Perfmon2 > > provides > > default values for all data registers. For counters, the value is > > guaranteed to > > be zero. > > > > But it is important to note that not all data registers are counters. That > > is the > > case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are > > buffers as > > well, and some may need to be initialized to non zero value, i.e., the IBS > > sampling > > period. > > Setting period should be a separate call. Mixing the two together into one > does not look like a nice interface. > Periods are setup by data register. Given that there is already a call to program the data register why add another one? You don't need to treat the sampling period differently from the register value. This just a value that will cause the register to overflow after an explicit number of occurrences. > > With event-based sampling, the period is expressed as the number of > > occurrences > > of an event. For instance, you can say: " take a sample every 2000 L2 cache > > misses". > > The way you express this with perfmon2 is that
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Thursday 15 November 2007 09:56, Chuck Ebbert wrote: > On 11/14/2007 05:17 AM, Nick Piggin wrote: > > But in general, for special files, I guess the response is usually > > some structured data (that is not visible at the syscall layer). > > So I don't see a big problem to have a similarly arbitrarily > > structured request. > > IOW, an ioctl. In the same way a read of structured data from a special file "is an" ioctl, yeah. You could implement either with an ioctl. The main difference is they have more explicitly typed interfaces Whether that's enough argument (and if Paul's proposal is widely usable enough) is another question. Which I won't try to answer. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Paul Mackerras <[EMAIL PROTECTED]> Date: Thu, 15 Nov 2007 10:12:22 +1100 > *I* never had a problem with a few extra system calls. I don't > understand why you (apparently) do. We're stuck with them forever, they are hard to version and extend cleanly. Those are my main objections. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: > From: Paul Mackerras <[EMAIL PROTECTED]> > Date: Thu, 15 Nov 2007 08:50:22 +1100 > > > I'd prefer to have a transaction() system call like I suggested to > > Nick rather than overloading read() like this. > > So much for getting rid of the extra system calls... *I* never had a problem with a few extra system calls. I don't understand why you (apparently) do. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Paul Mackerras <[EMAIL PROTECTED]> Date: Thu, 15 Nov 2007 08:50:22 +1100 > I'd prefer to have a transaction() system call like I suggested to > Nick rather than overloading read() like this. So much for getting rid of the extra system calls... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On 11/14/2007 05:17 AM, Nick Piggin wrote: > > But in general, for special files, I guess the response is usually > some structured data (that is not visible at the syscall layer). > So I don't see a big problem to have a similarly arbitrarily > structured request. > > IOW, an ioctl. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Thursday 15 November 2007 08:30, Paul Mackerras wrote: > Nick Piggin writes: > > What I really mean is a readv-like syscall, but one that also > > vectorises the file offset. Maybe this is useful enough as a generic > > syscall that also helps Paul's example... > > I've sometimes thought it would be useful to have a "transaction" > system call that is like a write + read combined into one: > > int transaction(int fd, char *req, size_t req_nb, > char *reply, size_t reply_nb); > > as a way to provide a general request/reply interface for special > files. Maybe not a bad idea, though I'm not the one to ask about taste ;) In this case, it is enough for your requests to be a set of scalars (eg. file offsets), so it _could_ be handled with vectorised offsets... But in general, for special files, I guess the response is usually some structured data (that is not visible at the syscall layer). So I don't see a big problem to have a similarly arbitrarily structured request. > > Of course, I guess this all depends on whether the atomicity is an > > important requirement. If not, you can obviously just do it with > > multiple read syscalls... > > That would take N system calls instead of one, which could have a > performance impact if you need to read the counters frequently (which > I believe you do in some performance monitoring situations). That's true too. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: > > You're suggesting that the behaviour of a read() should depend on what > > was in the buffer before the read? Gack! Surely you have better > > taste than that? > > Absolutely that's what I mean, it's atomic and gives you exactly what > you need. > > I see nothing wrong or gross with these semantics. Nothing in the > "book of UNIX" specifies that for a device or special file the passed > in buffer cannot contain input control data. Oh kay *shudders* It really violates the abstract model of "read" pretty badly. "Read" is "fill in the buffer with data from the device", not "do some arbitrary stuff with this area of memory". I'd prefer to have a transaction() system call like I suggested to Nick rather than overloading read() like this. > > Then you end up with two system calls to get the data rather than one > > (one to send the request and another to read the reply). For > > something that needs to be quick that is a suboptimal interface. > > Not necessarily, consider the possibility of using recvmsg() control > message data. With that it could be done in one go. > > This also suggests that it could be implemented as it's own protocol > family. There's all sorts of possible ways that it could be implemented. On the one hand we have an actual proposed implementation, and on the other we have various people saying "oh but it could be implemented this other way" without providing any actual code. Now if those people can show that their way of doing it is significantly simpler and better than the existing implementation, then that's useful. I really don't think that doing a whole new net protocol family is a simpler and better way of doing a performance monitor interface, though. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Nick Piggin writes: > What I really mean is a readv-like syscall, but one that also > vectorises the file offset. Maybe this is useful enough as a generic > syscall that also helps Paul's example... I've sometimes thought it would be useful to have a "transaction" system call that is like a write + read combined into one: int transaction(int fd, char *req, size_t req_nb, char *reply, size_t reply_nb); as a way to provide a general request/reply interface for special files. > Of course, I guess this all depends on whether the atomicity is an > important requirement. If not, you can obviously just do it with > multiple read syscalls... That would take N system calls instead of one, which could have a performance impact if you need to read the counters frequently (which I believe you do in some performance monitoring situations). Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Andi Kleen <[EMAIL PROTECTED]> Date: Wed, 14 Nov 2007 13:38:38 +0100 > At least for x86 and I suspect some 1other architectures we don't > initially need a syscall at all for this. There is an instruction > RDPMC who can read a performance counter just fine. It is also much > faster and generally preferable for the case where a process measures > events about itself. In fact it is essential for one of the use cases > I would like to see perfmon used (replacement of RDTSC for cycle > counting) I wouldn't even want to use a syscall for something like that on Sparc, I'd rather give this a dedicated software trap so that I can code it completely in assembler. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
> BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch > to disable it, dunno if it has been applied. Obviously -- without a system call to set up performance counters it would be fairly useless. But of course once such system calls are in they should be able to trigger the bit for each process. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, 14 Nov 2007 at 10:44 +, Will Cohen wrote: > Andi Kleen wrote: > > >>One approach does not prevent the other. Assuming you allow cr4.pce, then > >>nothing prevents > >>a self-monitoring thread from reading the counters directly. You'll just > >>get the > >>lower 32-bit of it. So if you read frequently enough, you should not have > >>a problem. > > > >Hmm? RDPMC is 64bit. > > There are a number of processors that have 32-bit counters such as the IBM > power processors. On many x86 processors the upper bits of the counter are > sign extended from the lower 32 bits. Thus, one can only assume the lower > 32-bit are available. Roll over of values is quite possible (<2 seconds of > cycle count), so additional work needs to be done to obtain a valid value. On x86 they are sign-extended only on write, on read they are 40 bits wide for intel, 48 bits for AMD. BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch to disable it, dunno if it has been applied. -- Phe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 10:44:20AM -0500, William Cohen wrote: > Andi Kleen wrote: > > >>One approach does not prevent the other. Assuming you allow cr4.pce, then > >>nothing prevents > >>a self-monitoring thread from reading the counters directly. You'll just > >>get the > >>lower 32-bit of it. So if you read frequently enough, you should not have > >>a problem. > > > >Hmm? RDPMC is 64bit. > > There are a number of processors that have 32-bit counters such as the IBM > power processors. On many x86 processors the upper bits of the counter are > sign extended from the lower 32 bits. Thus, one can only assume the lower > 32-bit are available. Roll over of values is quite possible (<2 seconds of > cycle count), so additional work needs to be done to obtain a valid value. > Exactly, on Intel's only the bottom 32-bit actually are useable, the rest is sign-extension. That's why it is okay for measuring small sections of code, but that's it. On AMD, I think it is better. On Itanium you get the 47-bit worth. Don't know about Power or Cell. -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Andi Kleen wrote: One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents a self-monitoring thread from reading the counters directly. You'll just get the lower 32-bit of it. So if you read frequently enough, you should not have a problem. Hmm? RDPMC is 64bit. There are a number of processors that have 32-bit counters such as the IBM power processors. On many x86 processors the upper bits of the counter are sign extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are available. Roll over of values is quite possible (<2 seconds of cycle count), so additional work needs to be done to obtain a valid value. But keep in mind that we do want a uniform interface across all hardware and all type of sessions (self-monitoring, CPU-wide, monitoring of another thread). You don't want an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() and so I disagree. Using RDPMC is essential for at least some of the things I would like to do with perfmon2. If the interface does not provide it it is useless to me at least. System calls are far too slow for cycle measurements. What range of cycles are you interested in measuring? 100's of cycles? A couple thousand? Are you just looking at cycle counts or other events? -Will - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 06:13:42AM -0800, Stephane Eranian wrote: > > At least for x86 and I suspect some 1other architectures we don't > > initially need a syscall at all for this. There is an instruction > > RDPMC who can read a performance counter just fine. It is also much > > faster and generally preferable for the case where a process measures > > events about itself. In fact it is essential for one of the use cases > > I would like to see perfmon used (replacement of RDTSC for cycle > > counting) > > > > This only works when counting (not sampling) and only for self-monitoring. It works for global monitoring too. > > > Later a syscall might be needed with event multiplexing, but that seems > > more like a far away non essential feature. > > > On a machine with only two generic counters such as MIPS or Intel Core 2 Duo, > multiplexing offers some advantages. If NMI watchdog is enabled, then you drop > to one generic counter on on Core 2. NMI watchdog is off by default now. Yes longer term we might need multiplexing, but definitely not as first step. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote: > > Partially true. The file descriptor becomes really useful when you sample. > You leverage the file descriptor to receive notifications of counter overflows > and full sampling buffer. You extract notification messages via read() and > you can > use SIGIO, select/poll. Hmm, ok for the event notification we would need a nice interface. Still have my doubts a file descriptor is the best way to do this though. > Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)? See my example below. > > That would be quite expensive when you have lots of registers to setup: one > syscall per register. The perfmon syscalls to read/write registers accept > vector > of arguments to amortize the cost of the syscall over multiple registers > (similar to poll(2)). First system calls are not that slow on Linux. Measure it. > > With many tools, registers are not just setup once. During certain > measurements, > data registers may be read multiple times. When you sample or multiplex at I think you optimize the wrong thing here. There are basically two cases I see: - Global measurement of lots of things: Things are slow anyways with large context switch overheads. The overheads are large anyways. Doing one or more system calls probably does not matter much. Most important is a clean interface. - Exact measurement of the current process. For that you need very low latencies. Any system call is too slow. That is why CPUs have instructions like RDPMC that allow to read those registers with minimal latency in user space. Interface should support those. Also for this case programming time does not matter too much. You just program once and then do RDPMC before code to measure and then afterwards and take the difference. The actual counter setup is out of the latency critical path. > It depends on what you are doing. Here, this was not really necessary. It was > meant to show how you can program the data registers as well. Perfmon2 > provides > default values for all data registers. For counters, the value is guaranteed > to > be zero. > > But it is important to note that not all data registers are counters. That is > the > case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are > buffers as > well, and some may need to be initialized to non zero value, i.e., the IBS > sampling > period. Setting period should be a separate call. Mixing the two together into one does not look like a nice interface. > > With event-based sampling, the period is expressed as the number of > occurrences > of an event. For instance, you can say: " take a sample every 2000 L2 cache > misses". > The way you express this with perfmon2 is that you program a counter to > measure > L2 cache misses, and then you initialize the corresponding data register > (counter) > to overflow after 2000 occurrences. Given that the interface guarantees all > counters > are 64-bit regardless of the hardware, you simply have to program the counter > to -2000. > Thus you see that you need a call to actual program the data registers. I didn't object to providing the initial value -- my example had that. Just having a separate concept of data registers seems too complicated to me. You should just pass event types and values and the kernel gives you a register number. > Perfmon2 decouples the two operations. In fact, no PMU hardware is actually > touched > before you attach to either a CPU or a thread. This way, you can prepare your > measurement > and then attach-and-go. Thus is is possible to create batches of ready-to-go > sessions. > That is useful, for instance, when you are trying to measure across fork, > pthread_create > which you can catch on-the-fly. > > Take the per-thread example, you can setup your session before you fork/exec > the program > you want to measure. And? You didn't say what the advantage of that is? All the approaches add context switch latencies. It is not clear that the separate session setup helps it all that much. > > Note also that perfmon2 supports attaching to an already running thread. So > there is > more than "GLOBAL CONTEXT" versus "MY CONTEXT". What is the use case of this? Do users use that? > > > > > /* activate monitoring */ > > > pfm_start(ctx_fd, NULL); > > > > Why can't that be done by the call setting up the register? > > > > Good question. If you do what say, you assume that the start/stop bit lives > in the > config (or data) registers of the PMU. This is not true on all hardware. On > Itanium > for instance, the start/stop bit is part of the Processor Status Register > (psr). > That is not a PMU register. Well the system call layer can manage that transparently with a little software state (counter). No need to expose it. > One approach does not prevent the other. Assuming you allow cr4.pce, then > nothing prevents > a self-monitoring thread from reading the
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Andi, On Wed, Nov 14, 2007 at 01:38:38PM +0100, Andi Kleen wrote: > Christoph Hellwig <[EMAIL PROTECTED]> writes: > > > > I've done this a gazillion times before, so maybe instead of beeing a lazy > > bastard you could look up mailinglist archive. It's not like this is the > > first discussion of perfmon. But to get start look at the systems calls, > > many of them are beasts like: > > > > int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) > > > > This is basically a read(2) (or for other syscalls a write) on something > > At least for x86 and I suspect some 1other architectures we don't > initially need a syscall at all for this. There is an instruction > RDPMC who can read a performance counter just fine. It is also much > faster and generally preferable for the case where a process measures > events about itself. In fact it is essential for one of the use cases > I would like to see perfmon used (replacement of RDTSC for cycle > counting) > This only works when counting (not sampling) and only for self-monitoring. > Later a syscall might be needed with event multiplexing, but that seems > more like a far away non essential feature. > On a machine with only two generic counters such as MIPS or Intel Core 2 Duo, multiplexing offers some advantages. If NMI watchdog is enabled, then you drop to one generic counter on on Core 2. > > else than the file descriptor provided to the system call. The right thing > > I don't like read/write for this too much. I think it's better to > have individual syscalls. After all that is CPU state and having > syscalls for that does seem reasonable. As I said earlier, we do use read(), not for reading counters but to extract overflow notification messages when we are sampling. It makes more sense for this usage because this is where you want to leverage some key mechanisms such as: - asynchronous notification via SIGIO. this is how you can implement self-sampling for instance. - select/poll to allow monitoring tools to wait for notification coming from multiple sessions in one call. This is useful when monitoring across fork or pthread_create. -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 10:44:56PM +1100, Paul Mackerras wrote: > David Miller writes: > > > This is my impression too, all of the things being done with > > a slew of system calls would be better served by real special > > files and appropriate fops. > > Special files and fops really only work well if you can coerce the > interface into one where data flows predominantly one way. I don't > think they work so well for something that is more like an RPC across > the user/kernel barrier. For that a system call is better. > > For instance, if you have something that kind-of looks like > > read_pmds(int n, int *pmd_numbers, u64 *pmd_values); > > where the caller supplies an array of PMD numbers and the function > returns their values (and you want that reading to be done atomically > in some sense), how would you do that using special files and fops? > Yes, the read call could be simplified to the level proposed above by Paul. -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Hello, On Wed, Nov 14, 2007 at 10:39:24PM +1100, Paul Mackerras wrote: > Christoph Hellwig writes: > > > int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) > > > > This is basically a read(2) (or for other syscalls a write) on something > > else than the file descriptor provided to the system call. > > No it's not basically a read(). It's more like a request/reply > interface, which a read()/write() interface doesn't handle very well. > The request in this case is "tell me about this particular collection > of PMDs" and the reply is the values. > Exactly. This is not a brute force read()! On input you pass the list of registers you want to read. Upon return, you get the list of values. Now, I think the current call could be optimized even more by making the structure smaller. Today, the structure passed read/write PMD registers is the same. On write, we pass other information such as the reset values (sampling periods), randomization parameters and some flags. They are not needed on read. > It seems to me that an important part of this is to be able to collect > values from several PMDs at a single point in time, or at least an > approximation to a single point in time. So that means that you don't > want a file per PMD either. > Yes, we want to be able to read one or many registers in one call. The number of PMU counters is not going to shrink, so having a file descriptor per register looks overkill to me. > Basically we don't have a good abstraction for a request/reply (or > command/response) type of interface, and this is a case where we need > one. Having a syscall that takes a struct containing the request and > reply is as good a way as any, particularly for something that needs > to be quick. > -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Andi, On Wed, Nov 14, 2007 at 03:07:02AM +0100, Andi Kleen wrote: > > [dropped all these bouncing email lists. Adding closed lists to public > cc lists is just a bad idea] > Just want to make sure perfmon2 users participate in this discussion. > > int > > main(int argc, char **argv) > > { > > int ctx_fd; > > pfarg_pmd_t pd[1]; > > pfarg_pmc_t pc[1]; > > pfarg_ctx_t ctx; > > pfarg_load_t load_args; > > > > memset(, 0, sizeof(ctx)); > > memset(pc, 0, sizeof(pc)); > > memset(pd, 0, sizeof(pd)); > > > > /* create session (context) and get file descriptor back (identifier) */ > > ctx_fd = pfm_create_context(, NULL, NULL, 0); > > There's nothing in your example that makes the file descriptor needed. > Partially true. The file descriptor becomes really useful when you sample. You leverage the file descriptor to receive notifications of counter overflows and full sampling buffer. You extract notification messages via read() and you can use SIGIO, select/poll. The example shows how you can leverage existing mechanisms to destroy the session, i.e., free the associated kernel resources. For that, you use close() instead of adding yet another syscall. It also provides a resource limitation mechanisms to control consumption of kernel memory, i.e., you can only create as many sessions as you can have open files. > > > > /* setup one config register (PMC0) */ > > pc[0].reg_num = 0 > > pc[0].reg_value = 0x1234; > > That would be nicer if it was just two arguments. > Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)? That would be quite expensive when you have lots of registers to setup: one syscall per register. The perfmon syscalls to read/write registers accept vector of arguments to amortize the cost of the syscall over multiple registers (similar to poll(2)). With many tools, registers are not just setup once. During certain measurements, data registers may be read multiple times. When you sample or multiplex at the user level, you do need to reprogram the PMU state and that is on the critical path. You do not want a call that programs the entire PMU state all at once either. Many times, you only want to modify a small subset. Having the full state does also cause some portability problems. > > > > /* setup one data register (PMD0) */ > > pd[0].reg_num = 0; > > pd[0].reg_value = 0; > > Why do you need to set the data register? Wouldn't it make > more sense to let the kernel handle that and just return one. > It depends on what you are doing. Here, this was not really necessary. It was meant to show how you can program the data registers as well. Perfmon2 provides default values for all data registers. For counters, the value is guaranteed to be zero. But it is important to note that not all data registers are counters. That is the case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as well, and some may need to be initialized to non zero value, i.e., the IBS sampling period. With event-based sampling, the period is expressed as the number of occurrences of an event. For instance, you can say: " take a sample every 2000 L2 cache misses". The way you express this with perfmon2 is that you program a counter to measure L2 cache misses, and then you initialize the corresponding data register (counter) to overflow after 2000 occurrences. Given that the interface guarantees all counters are 64-bit regardless of the hardware, you simply have to program the counter to -2000. Thus you see that you need a call to actual program the data registers. > > > > /* program the registers */ > > pfm_write_pmcs(ctx_fd, pc, 1); > > pfm_write_pmds(ctx_fd, pd, 1); > > > > /* attach the context to self */ > > load_args.load_pid = getpid(); > > pfm_load_context(ctx_fd, _args); > > My replacement would be to just add a flags argument to write_pmcs > with one flag bit meaning "GLOBAL CONTEXT" versus "MY CONTEXT" > > You are mixing PMU programming with the type of measurement you want to do. Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched before you attach to either a CPU or a thread. This way, you can prepare your measurement and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions. That is useful, for instance, when you are trying to measure across fork, pthread_create which you can catch on-the-fly. Take the per-thread example, you can setup your session before you fork/exec the program you want to measure. Note also that perfmon2 supports attaching to an already running thread. So there is more than "GLOBAL CONTEXT" versus "MY CONTEXT". > > /* activate monitoring */ > > pfm_start(ctx_fd, NULL); > > Why can't that be done by the call setting up the register? > Good question. If you do what say, you assume that the start/stop bit lives in the config (or data) registers of the
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Christoph Hellwig <[EMAIL PROTECTED]> writes: > > I've done this a gazillion times before, so maybe instead of beeing a lazy > bastard you could look up mailinglist archive. It's not like this is the > first discussion of perfmon. But to get start look at the systems calls, > many of them are beasts like: > > int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) > > This is basically a read(2) (or for other syscalls a write) on something At least for x86 and I suspect some 1other architectures we don't initially need a syscall at all for this. There is an instruction RDPMC who can read a performance counter just fine. It is also much faster and generally preferable for the case where a process measures events about itself. In fact it is essential for one of the use cases I would like to see perfmon used (replacement of RDTSC for cycle counting) Later a syscall might be needed with event multiplexing, but that seems more like a far away non essential feature. > else than the file descriptor provided to the system call. The right thing I don't like read/write for this too much. I think it's better to have individual syscalls. After all that is CPU state and having syscalls for that does seem reasonable. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wednesday 14 November 2007 23:07, David Miller wrote: > From: Paul Mackerras <[EMAIL PROTECTED]> > Date: Wed, 14 Nov 2007 23:03:24 +1100 > > > You're suggesting that the behaviour of a read() should depend on what > > was in the buffer before the read? Gack! Surely you have better > > taste than that? > > Absolutely that's what I mean, it's atomic and gives you exactly what > you need. > > I see nothing wrong or gross with these semantics. Nothing in the > "book of UNIX" specifies that for a device or special file the passed > in buffer cannot contain input control data. True, but is it now any so different to an ioctl? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wednesday 14 November 2007 22:58, David Miller wrote: > From: Nick Piggin <[EMAIL PROTECTED]> > Date: Wed, 14 Nov 2007 10:49:48 +1100 > > > On Wednesday 14 November 2007 22:44, Paul Mackerras wrote: > > > David Miller writes: > > > > This is my impression too, all of the things being done with > > > > a slew of system calls would be better served by real special > > > > files and appropriate fops. > > > > > > Special files and fops really only work well if you can coerce the > > > interface into one where data flows predominantly one way. I don't > > > think they work so well for something that is more like an RPC across > > > the user/kernel barrier. For that a system call is better. > > > > > > For instance, if you have something that kind-of looks like > > > > > > read_pmds(int n, int *pmd_numbers, u64 *pmd_values); > > > > > > where the caller supplies an array of PMD numbers and the function > > > returns their values (and you want that reading to be done atomically > > > in some sense), how would you do that using special files and fops? > > > > Could you implement it with readv()? > > Sure, why not? Just cook up an iovec. pmd_numbers goes to offset > X and pmd_values goes to offset Y, with some helpers like what > we have in the networking already for recvmsg. > > But why would you want readv() for this? The syscall thing > Paul asked me to translate into a read() doesn't provide > iovec-like behavior so I don't see why readv() is necessary > at all. Ah sorry, that's what I get for typing before I think: of course readv doesn't vectorise the right part of the equation. What I really mean is a readv-like syscall, but one that also vectorises the file offset. Maybe this is useful enough as a generic syscall that also helps Paul's example... Of course, I guess this all depends on whether the atomicity is an important requirement. If not, you can obviously just do it with multiple read syscalls... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Paul Mackerras <[EMAIL PROTECTED]> Date: Wed, 14 Nov 2007 23:03:24 +1100 > You're suggesting that the behaviour of a read() should depend on what > was in the buffer before the read? Gack! Surely you have better > taste than that? Absolutely that's what I mean, it's atomic and gives you exactly what you need. I see nothing wrong or gross with these semantics. Nothing in the "book of UNIX" specifies that for a device or special file the passed in buffer cannot contain input control data. > > Another alternative is to use generic netlink. > > Then you end up with two system calls to get the data rather than one > (one to send the request and another to read the reply). For > something that needs to be quick that is a suboptimal interface. Not necessarily, consider the possibility of using recvmsg() control message data. With that it could be done in one go. This also suggests that it could be implemented as it's own protocol family. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: > The same way we handle some of the multicast "getsockopt()" > calls. The parameters passed in are both inputs and outputs. For a read??!!! > For the above example: > > struct pmd_info { > int *pmd_numbers; > u64 *pmd_values; > int n; > } *p; > > buffer_size = N; > p = malloc(buffer_size); > p->pmd_numbers = p + foo; > p->pmd_values = p + bar; > p->n = whatever(N); > err = read(fd, p, N); You're suggesting that the behaviour of a read() should depend on what was in the buffer before the read? Gack! Surely you have better taste than that? Or are you saying that a read (or write) has a side-effect of altering some other area of memory besides the buffer you give to read()? That seems even worse to me. > Another alternative is to use generic netlink. Then you end up with two system calls to get the data rather than one (one to send the request and another to read the reply). For something that needs to be quick that is a suboptimal interface. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Nick Piggin <[EMAIL PROTECTED]> Date: Wed, 14 Nov 2007 10:49:48 +1100 > On Wednesday 14 November 2007 22:44, Paul Mackerras wrote: > > David Miller writes: > > > This is my impression too, all of the things being done with > > > a slew of system calls would be better served by real special > > > files and appropriate fops. > > > > Special files and fops really only work well if you can coerce the > > interface into one where data flows predominantly one way. I don't > > think they work so well for something that is more like an RPC across > > the user/kernel barrier. For that a system call is better. > > > > For instance, if you have something that kind-of looks like > > > > read_pmds(int n, int *pmd_numbers, u64 *pmd_values); > > > > where the caller supplies an array of PMD numbers and the function > > returns their values (and you want that reading to be done atomically > > in some sense), how would you do that using special files and fops? > > Could you implement it with readv()? Sure, why not? Just cook up an iovec. pmd_numbers goes to offset X and pmd_values goes to offset Y, with some helpers like what we have in the networking already for recvmsg. But why would you want readv() for this? The syscall thing Paul asked me to translate into a read() doesn't provide iovec-like behavior so I don't see why readv() is necessary at all. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Paul Mackerras <[EMAIL PROTECTED]> Date: Wed, 14 Nov 2007 22:44:56 +1100 > For instance, if you have something that kind-of looks like > > read_pmds(int n, int *pmd_numbers, u64 *pmd_values); > > where the caller supplies an array of PMD numbers and the function > returns their values (and you want that reading to be done atomically > in some sense), how would you do that using special files and fops? The same way we handle some of the multicast "getsockopt()" calls. The parameters passed in are both inputs and outputs. For the above example: struct pmd_info { int *pmd_numbers; u64 *pmd_values; int n; } *p; buffer_size = N; p = malloc(buffer_size); p->pmd_numbers = p + foo; p->pmd_values = p + bar; p->n = whatever(N); err = read(fd, p, N); It's definitely doable, use your imagination. You can encode all kinds of operation types into the header as well. Another alternative is to use generic netlink. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Paul Mackerras <[EMAIL PROTECTED]> Date: Wed, 14 Nov 2007 22:39:24 +1100 > No it's not basically a read(). It's more like a request/reply > interface, which a read()/write() interface doesn't handle very well. Yes it can, see my other reply. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wednesday 14 November 2007 22:44, Paul Mackerras wrote: > David Miller writes: > > This is my impression too, all of the things being done with > > a slew of system calls would be better served by real special > > files and appropriate fops. > > Special files and fops really only work well if you can coerce the > interface into one where data flows predominantly one way. I don't > think they work so well for something that is more like an RPC across > the user/kernel barrier. For that a system call is better. > > For instance, if you have something that kind-of looks like > > read_pmds(int n, int *pmd_numbers, u64 *pmd_values); > > where the caller supplies an array of PMD numbers and the function > returns their values (and you want that reading to be done atomically > in some sense), how would you do that using special files and fops? Could you implement it with readv()? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Christoph Hellwig writes: > int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) > > This is basically a read(2) (or for other syscalls a write) on something > else than the file descriptor provided to the system call. No it's not basically a read(). It's more like a request/reply interface, which a read()/write() interface doesn't handle very well. The request in this case is "tell me about this particular collection of PMDs" and the reply is the values. It seems to me that an important part of this is to be able to collect values from several PMDs at a single point in time, or at least an approximation to a single point in time. So that means that you don't want a file per PMD either. Basically we don't have a good abstraction for a request/reply (or command/response) type of interface, and this is a case where we need one. Having a syscall that takes a struct containing the request and reply is as good a way as any, particularly for something that needs to be quick. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: > This is my impression too, all of the things being done with > a slew of system calls would be better served by real special > files and appropriate fops. Special files and fops really only work well if you can coerce the interface into one where data flows predominantly one way. I don't think they work so well for something that is more like an RPC across the user/kernel barrier. For that a system call is better. For instance, if you have something that kind-of looks like read_pmds(int n, int *pmd_numbers, u64 *pmd_values); where the caller supplies an array of PMD numbers and the function returns their values (and you want that reading to be done atomically in some sense), how would you do that using special files and fops? > Whether the thing is some kind > of misc device or procfs is less important than simply getting > away from these system calls. Why? What's inherently offensive about system calls? Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Ok, I just got 4 freakin' bounces from all of these subscriber only perfmon etc. mailing lists. Please remove those lists from the CC: as it's pointless for those of us not on the lists to participate if those lists can't even see the feedback we are giving. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Christoph Hellwig <[EMAIL PROTECTED]> Date: Wed, 14 Nov 2007 11:00:09 + > I've done this a gazillion times before, so maybe instead of beeing a lazy > bastard you could look up mailinglist archive. It's not like this is the > first discussion of perfmon. But to get start look at the systems calls, > many of them are beasts like: > > int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) > > This is basically a read(2) (or for other syscalls a write) on something > else than the file descriptor provided to the system call. The right thing > to do is obviously have a pmds and pmcs file in procfs for the thread beeing > monitored instead of these special-case files, with another set for global > tracing. Similarly I'm pretty sure we can get a much better interface > if we introduce marching files in procfs for the other calls. This is my impression too, all of the things being done with a slew of system calls would be better served by real special files and appropriate fops. Whether the thing is some kind of misc device or procfs is less important than simply getting away from these system calls. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 09:43:02PM +1100, Paul Mackerras wrote: > Christoph Hellwig writes: > > > Mine for example. The whole userspace interface is just on crack, > > and the code is full of complexities aswell. > > Could you give some _technical_ details of what you don't like? I've done this a gazillion times before, so maybe instead of beeing a lazy bastard you could look up mailinglist archive. It's not like this is the first discussion of perfmon. But to get start look at the systems calls, many of them are beasts like: int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) This is basically a read(2) (or for other syscalls a write) on something else than the file descriptor provided to the system call. The right thing to do is obviously have a pmds and pmcs file in procfs for the thread beeing monitored instead of these special-case files, with another set for global tracing. Similarly I'm pretty sure we can get a much better interface if we introduce marching files in procfs for the other calls. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Christoph Hellwig writes: > Mine for example. The whole userspace interface is just on crack, > and the code is full of complexities aswell. Could you give some _technical_ details of what you don't like? Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 06:24:36PM +1100, Paul Mackerras wrote: > Whose sentiment? Mine for example. The whole userspace interface is just on crack, and the code is full of complexities aswell. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 06:24:36PM +1100, Paul Mackerras wrote: Whose sentiment? Mine for example. The whole userspace interface is just on crack, and the code is full of complexities aswell. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Ok, I just got 4 freakin' bounces from all of these subscriber only perfmon etc. mailing lists. Please remove those lists from the CC: as it's pointless for those of us not on the lists to participate if those lists can't even see the feedback we are giving. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 09:43:02PM +1100, Paul Mackerras wrote: Christoph Hellwig writes: Mine for example. The whole userspace interface is just on crack, and the code is full of complexities aswell. Could you give some _technical_ details of what you don't like? I've done this a gazillion times before, so maybe instead of beeing a lazy bastard you could look up mailinglist archive. It's not like this is the first discussion of perfmon. But to get start look at the systems calls, many of them are beasts like: int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) This is basically a read(2) (or for other syscalls a write) on something else than the file descriptor provided to the system call. The right thing to do is obviously have a pmds and pmcs file in procfs for the thread beeing monitored instead of these special-case files, with another set for global tracing. Similarly I'm pretty sure we can get a much better interface if we introduce marching files in procfs for the other calls. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Christoph Hellwig [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 11:00:09 + I've done this a gazillion times before, so maybe instead of beeing a lazy bastard you could look up mailinglist archive. It's not like this is the first discussion of perfmon. But to get start look at the systems calls, many of them are beasts like: int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) This is basically a read(2) (or for other syscalls a write) on something else than the file descriptor provided to the system call. The right thing to do is obviously have a pmds and pmcs file in procfs for the thread beeing monitored instead of these special-case files, with another set for global tracing. Similarly I'm pretty sure we can get a much better interface if we introduce marching files in procfs for the other calls. This is my impression too, all of the things being done with a slew of system calls would be better served by real special files and appropriate fops. Whether the thing is some kind of misc device or procfs is less important than simply getting away from these system calls. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Christoph Hellwig writes: Mine for example. The whole userspace interface is just on crack, and the code is full of complexities aswell. Could you give some _technical_ details of what you don't like? Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: This is my impression too, all of the things being done with a slew of system calls would be better served by real special files and appropriate fops. Special files and fops really only work well if you can coerce the interface into one where data flows predominantly one way. I don't think they work so well for something that is more like an RPC across the user/kernel barrier. For that a system call is better. For instance, if you have something that kind-of looks like read_pmds(int n, int *pmd_numbers, u64 *pmd_values); where the caller supplies an array of PMD numbers and the function returns their values (and you want that reading to be done atomically in some sense), how would you do that using special files and fops? Whether the thing is some kind of misc device or procfs is less important than simply getting away from these system calls. Why? What's inherently offensive about system calls? Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Christoph Hellwig writes: int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) This is basically a read(2) (or for other syscalls a write) on something else than the file descriptor provided to the system call. No it's not basically a read(). It's more like a request/reply interface, which a read()/write() interface doesn't handle very well. The request in this case is tell me about this particular collection of PMDs and the reply is the values. It seems to me that an important part of this is to be able to collect values from several PMDs at a single point in time, or at least an approximation to a single point in time. So that means that you don't want a file per PMD either. Basically we don't have a good abstraction for a request/reply (or command/response) type of interface, and this is a case where we need one. Having a syscall that takes a struct containing the request and reply is as good a way as any, particularly for something that needs to be quick. Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 10:49:48 +1100 On Wednesday 14 November 2007 22:44, Paul Mackerras wrote: David Miller writes: This is my impression too, all of the things being done with a slew of system calls would be better served by real special files and appropriate fops. Special files and fops really only work well if you can coerce the interface into one where data flows predominantly one way. I don't think they work so well for something that is more like an RPC across the user/kernel barrier. For that a system call is better. For instance, if you have something that kind-of looks like read_pmds(int n, int *pmd_numbers, u64 *pmd_values); where the caller supplies an array of PMD numbers and the function returns their values (and you want that reading to be done atomically in some sense), how would you do that using special files and fops? Could you implement it with readv()? Sure, why not? Just cook up an iovec. pmd_numbers goes to offset X and pmd_values goes to offset Y, with some helpers like what we have in the networking already for recvmsg. But why would you want readv() for this? The syscall thing Paul asked me to translate into a read() doesn't provide iovec-like behavior so I don't see why readv() is necessary at all. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Paul Mackerras [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 23:03:24 +1100 You're suggesting that the behaviour of a read() should depend on what was in the buffer before the read? Gack! Surely you have better taste than that? Absolutely that's what I mean, it's atomic and gives you exactly what you need. I see nothing wrong or gross with these semantics. Nothing in the book of UNIX specifies that for a device or special file the passed in buffer cannot contain input control data. Another alternative is to use generic netlink. Then you end up with two system calls to get the data rather than one (one to send the request and another to read the reply). For something that needs to be quick that is a suboptimal interface. Not necessarily, consider the possibility of using recvmsg() control message data. With that it could be done in one go. This also suggests that it could be implemented as it's own protocol family. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wednesday 14 November 2007 22:44, Paul Mackerras wrote: David Miller writes: This is my impression too, all of the things being done with a slew of system calls would be better served by real special files and appropriate fops. Special files and fops really only work well if you can coerce the interface into one where data flows predominantly one way. I don't think they work so well for something that is more like an RPC across the user/kernel barrier. For that a system call is better. For instance, if you have something that kind-of looks like read_pmds(int n, int *pmd_numbers, u64 *pmd_values); where the caller supplies an array of PMD numbers and the function returns their values (and you want that reading to be done atomically in some sense), how would you do that using special files and fops? Could you implement it with readv()? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Paul Mackerras [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 22:44:56 +1100 For instance, if you have something that kind-of looks like read_pmds(int n, int *pmd_numbers, u64 *pmd_values); where the caller supplies an array of PMD numbers and the function returns their values (and you want that reading to be done atomically in some sense), how would you do that using special files and fops? The same way we handle some of the multicast getsockopt() calls. The parameters passed in are both inputs and outputs. For the above example: struct pmd_info { int *pmd_numbers; u64 *pmd_values; int n; } *p; buffer_size = N; p = malloc(buffer_size); p-pmd_numbers = p + foo; p-pmd_values = p + bar; p-n = whatever(N); err = read(fd, p, N); It's definitely doable, use your imagination. You can encode all kinds of operation types into the header as well. Another alternative is to use generic netlink. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Paul Mackerras [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 22:39:24 +1100 No it's not basically a read(). It's more like a request/reply interface, which a read()/write() interface doesn't handle very well. Yes it can, see my other reply. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: The same way we handle some of the multicast getsockopt() calls. The parameters passed in are both inputs and outputs. For a read??!!! For the above example: struct pmd_info { int *pmd_numbers; u64 *pmd_values; int n; } *p; buffer_size = N; p = malloc(buffer_size); p-pmd_numbers = p + foo; p-pmd_values = p + bar; p-n = whatever(N); err = read(fd, p, N); You're suggesting that the behaviour of a read() should depend on what was in the buffer before the read? Gack! Surely you have better taste than that? Or are you saying that a read (or write) has a side-effect of altering some other area of memory besides the buffer you give to read()? That seems even worse to me. Another alternative is to use generic netlink. Then you end up with two system calls to get the data rather than one (one to send the request and another to read the reply). For something that needs to be quick that is a suboptimal interface. Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wednesday 14 November 2007 22:58, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 10:49:48 +1100 On Wednesday 14 November 2007 22:44, Paul Mackerras wrote: David Miller writes: This is my impression too, all of the things being done with a slew of system calls would be better served by real special files and appropriate fops. Special files and fops really only work well if you can coerce the interface into one where data flows predominantly one way. I don't think they work so well for something that is more like an RPC across the user/kernel barrier. For that a system call is better. For instance, if you have something that kind-of looks like read_pmds(int n, int *pmd_numbers, u64 *pmd_values); where the caller supplies an array of PMD numbers and the function returns their values (and you want that reading to be done atomically in some sense), how would you do that using special files and fops? Could you implement it with readv()? Sure, why not? Just cook up an iovec. pmd_numbers goes to offset X and pmd_values goes to offset Y, with some helpers like what we have in the networking already for recvmsg. But why would you want readv() for this? The syscall thing Paul asked me to translate into a read() doesn't provide iovec-like behavior so I don't see why readv() is necessary at all. Ah sorry, that's what I get for typing before I think: of course readv doesn't vectorise the right part of the equation. What I really mean is a readv-like syscall, but one that also vectorises the file offset. Maybe this is useful enough as a generic syscall that also helps Paul's example... Of course, I guess this all depends on whether the atomicity is an important requirement. If not, you can obviously just do it with multiple read syscalls... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wednesday 14 November 2007 23:07, David Miller wrote: From: Paul Mackerras [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 23:03:24 +1100 You're suggesting that the behaviour of a read() should depend on what was in the buffer before the read? Gack! Surely you have better taste than that? Absolutely that's what I mean, it's atomic and gives you exactly what you need. I see nothing wrong or gross with these semantics. Nothing in the book of UNIX specifies that for a device or special file the passed in buffer cannot contain input control data. True, but is it now any so different to an ioctl? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Christoph Hellwig [EMAIL PROTECTED] writes: I've done this a gazillion times before, so maybe instead of beeing a lazy bastard you could look up mailinglist archive. It's not like this is the first discussion of perfmon. But to get start look at the systems calls, many of them are beasts like: int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) This is basically a read(2) (or for other syscalls a write) on something At least for x86 and I suspect some 1other architectures we don't initially need a syscall at all for this. There is an instruction RDPMC who can read a performance counter just fine. It is also much faster and generally preferable for the case where a process measures events about itself. In fact it is essential for one of the use cases I would like to see perfmon used (replacement of RDTSC for cycle counting) Later a syscall might be needed with event multiplexing, but that seems more like a far away non essential feature. else than the file descriptor provided to the system call. The right thing I don't like read/write for this too much. I think it's better to have individual syscalls. After all that is CPU state and having syscalls for that does seem reasonable. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Andi, On Wed, Nov 14, 2007 at 03:07:02AM +0100, Andi Kleen wrote: [dropped all these bouncing email lists. Adding closed lists to public cc lists is just a bad idea] Just want to make sure perfmon2 users participate in this discussion. int main(int argc, char **argv) { int ctx_fd; pfarg_pmd_t pd[1]; pfarg_pmc_t pc[1]; pfarg_ctx_t ctx; pfarg_load_t load_args; memset(ctx, 0, sizeof(ctx)); memset(pc, 0, sizeof(pc)); memset(pd, 0, sizeof(pd)); /* create session (context) and get file descriptor back (identifier) */ ctx_fd = pfm_create_context(ctx, NULL, NULL, 0); There's nothing in your example that makes the file descriptor needed. Partially true. The file descriptor becomes really useful when you sample. You leverage the file descriptor to receive notifications of counter overflows and full sampling buffer. You extract notification messages via read() and you can use SIGIO, select/poll. The example shows how you can leverage existing mechanisms to destroy the session, i.e., free the associated kernel resources. For that, you use close() instead of adding yet another syscall. It also provides a resource limitation mechanisms to control consumption of kernel memory, i.e., you can only create as many sessions as you can have open files. /* setup one config register (PMC0) */ pc[0].reg_num = 0 pc[0].reg_value = 0x1234; That would be nicer if it was just two arguments. Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)? That would be quite expensive when you have lots of registers to setup: one syscall per register. The perfmon syscalls to read/write registers accept vector of arguments to amortize the cost of the syscall over multiple registers (similar to poll(2)). With many tools, registers are not just setup once. During certain measurements, data registers may be read multiple times. When you sample or multiplex at the user level, you do need to reprogram the PMU state and that is on the critical path. You do not want a call that programs the entire PMU state all at once either. Many times, you only want to modify a small subset. Having the full state does also cause some portability problems. /* setup one data register (PMD0) */ pd[0].reg_num = 0; pd[0].reg_value = 0; Why do you need to set the data register? Wouldn't it make more sense to let the kernel handle that and just return one. It depends on what you are doing. Here, this was not really necessary. It was meant to show how you can program the data registers as well. Perfmon2 provides default values for all data registers. For counters, the value is guaranteed to be zero. But it is important to note that not all data registers are counters. That is the case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as well, and some may need to be initialized to non zero value, i.e., the IBS sampling period. With event-based sampling, the period is expressed as the number of occurrences of an event. For instance, you can say: take a sample every 2000 L2 cache misses. The way you express this with perfmon2 is that you program a counter to measure L2 cache misses, and then you initialize the corresponding data register (counter) to overflow after 2000 occurrences. Given that the interface guarantees all counters are 64-bit regardless of the hardware, you simply have to program the counter to -2000. Thus you see that you need a call to actual program the data registers. /* program the registers */ pfm_write_pmcs(ctx_fd, pc, 1); pfm_write_pmds(ctx_fd, pd, 1); /* attach the context to self */ load_args.load_pid = getpid(); pfm_load_context(ctx_fd, load_args); My replacement would be to just add a flags argument to write_pmcs with one flag bit meaning GLOBAL CONTEXT versus MY CONTEXT You are mixing PMU programming with the type of measurement you want to do. Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched before you attach to either a CPU or a thread. This way, you can prepare your measurement and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions. That is useful, for instance, when you are trying to measure across fork, pthread_create which you can catch on-the-fly. Take the per-thread example, you can setup your session before you fork/exec the program you want to measure. Note also that perfmon2 supports attaching to an already running thread. So there is more than GLOBAL CONTEXT versus MY CONTEXT. /* activate monitoring */ pfm_start(ctx_fd, NULL); Why can't that be done by the call setting up the register? Good question. If you do what say, you assume that the start/stop bit lives in the config (or data) registers of the PMU. This is not true on all hardware. On Itanium for instance, the start/stop bit is
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Hello, On Wed, Nov 14, 2007 at 10:39:24PM +1100, Paul Mackerras wrote: Christoph Hellwig writes: int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) This is basically a read(2) (or for other syscalls a write) on something else than the file descriptor provided to the system call. No it's not basically a read(). It's more like a request/reply interface, which a read()/write() interface doesn't handle very well. The request in this case is tell me about this particular collection of PMDs and the reply is the values. Exactly. This is not a brute force read()! On input you pass the list of registers you want to read. Upon return, you get the list of values. Now, I think the current call could be optimized even more by making the structure smaller. Today, the structure passed read/write PMD registers is the same. On write, we pass other information such as the reset values (sampling periods), randomization parameters and some flags. They are not needed on read. It seems to me that an important part of this is to be able to collect values from several PMDs at a single point in time, or at least an approximation to a single point in time. So that means that you don't want a file per PMD either. Yes, we want to be able to read one or many registers in one call. The number of PMU counters is not going to shrink, so having a file descriptor per register looks overkill to me. Basically we don't have a good abstraction for a request/reply (or command/response) type of interface, and this is a case where we need one. Having a syscall that takes a struct containing the request and reply is as good a way as any, particularly for something that needs to be quick. -- -Stephane - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 10:44:56PM +1100, Paul Mackerras wrote: David Miller writes: This is my impression too, all of the things being done with a slew of system calls would be better served by real special files and appropriate fops. Special files and fops really only work well if you can coerce the interface into one where data flows predominantly one way. I don't think they work so well for something that is more like an RPC across the user/kernel barrier. For that a system call is better. For instance, if you have something that kind-of looks like read_pmds(int n, int *pmd_numbers, u64 *pmd_values); where the caller supplies an array of PMD numbers and the function returns their values (and you want that reading to be done atomically in some sense), how would you do that using special files and fops? Yes, the read call could be simplified to the level proposed above by Paul. -- -Stephane - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Andi, On Wed, Nov 14, 2007 at 01:38:38PM +0100, Andi Kleen wrote: Christoph Hellwig [EMAIL PROTECTED] writes: I've done this a gazillion times before, so maybe instead of beeing a lazy bastard you could look up mailinglist archive. It's not like this is the first discussion of perfmon. But to get start look at the systems calls, many of them are beasts like: int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) This is basically a read(2) (or for other syscalls a write) on something At least for x86 and I suspect some 1other architectures we don't initially need a syscall at all for this. There is an instruction RDPMC who can read a performance counter just fine. It is also much faster and generally preferable for the case where a process measures events about itself. In fact it is essential for one of the use cases I would like to see perfmon used (replacement of RDTSC for cycle counting) This only works when counting (not sampling) and only for self-monitoring. Later a syscall might be needed with event multiplexing, but that seems more like a far away non essential feature. On a machine with only two generic counters such as MIPS or Intel Core 2 Duo, multiplexing offers some advantages. If NMI watchdog is enabled, then you drop to one generic counter on on Core 2. else than the file descriptor provided to the system call. The right thing I don't like read/write for this too much. I think it's better to have individual syscalls. After all that is CPU state and having syscalls for that does seem reasonable. As I said earlier, we do use read(), not for reading counters but to extract overflow notification messages when we are sampling. It makes more sense for this usage because this is where you want to leverage some key mechanisms such as: - asynchronous notification via SIGIO. this is how you can implement self-sampling for instance. - select/poll to allow monitoring tools to wait for notification coming from multiple sessions in one call. This is useful when monitoring across fork or pthread_create. -- -Stephane - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote: Partially true. The file descriptor becomes really useful when you sample. You leverage the file descriptor to receive notifications of counter overflows and full sampling buffer. You extract notification messages via read() and you can use SIGIO, select/poll. Hmm, ok for the event notification we would need a nice interface. Still have my doubts a file descriptor is the best way to do this though. Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)? See my example below. That would be quite expensive when you have lots of registers to setup: one syscall per register. The perfmon syscalls to read/write registers accept vector of arguments to amortize the cost of the syscall over multiple registers (similar to poll(2)). First system calls are not that slow on Linux. Measure it. With many tools, registers are not just setup once. During certain measurements, data registers may be read multiple times. When you sample or multiplex at I think you optimize the wrong thing here. There are basically two cases I see: - Global measurement of lots of things: Things are slow anyways with large context switch overheads. The overheads are large anyways. Doing one or more system calls probably does not matter much. Most important is a clean interface. - Exact measurement of the current process. For that you need very low latencies. Any system call is too slow. That is why CPUs have instructions like RDPMC that allow to read those registers with minimal latency in user space. Interface should support those. Also for this case programming time does not matter too much. You just program once and then do RDPMC before code to measure and then afterwards and take the difference. The actual counter setup is out of the latency critical path. It depends on what you are doing. Here, this was not really necessary. It was meant to show how you can program the data registers as well. Perfmon2 provides default values for all data registers. For counters, the value is guaranteed to be zero. But it is important to note that not all data registers are counters. That is the case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as well, and some may need to be initialized to non zero value, i.e., the IBS sampling period. Setting period should be a separate call. Mixing the two together into one does not look like a nice interface. With event-based sampling, the period is expressed as the number of occurrences of an event. For instance, you can say: take a sample every 2000 L2 cache misses. The way you express this with perfmon2 is that you program a counter to measure L2 cache misses, and then you initialize the corresponding data register (counter) to overflow after 2000 occurrences. Given that the interface guarantees all counters are 64-bit regardless of the hardware, you simply have to program the counter to -2000. Thus you see that you need a call to actual program the data registers. I didn't object to providing the initial value -- my example had that. Just having a separate concept of data registers seems too complicated to me. You should just pass event types and values and the kernel gives you a register number. Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched before you attach to either a CPU or a thread. This way, you can prepare your measurement and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions. That is useful, for instance, when you are trying to measure across fork, pthread_create which you can catch on-the-fly. Take the per-thread example, you can setup your session before you fork/exec the program you want to measure. And? You didn't say what the advantage of that is? All the approaches add context switch latencies. It is not clear that the separate session setup helps it all that much. Note also that perfmon2 supports attaching to an already running thread. So there is more than GLOBAL CONTEXT versus MY CONTEXT. What is the use case of this? Do users use that? /* activate monitoring */ pfm_start(ctx_fd, NULL); Why can't that be done by the call setting up the register? Good question. If you do what say, you assume that the start/stop bit lives in the config (or data) registers of the PMU. This is not true on all hardware. On Itanium for instance, the start/stop bit is part of the Processor Status Register (psr). That is not a PMU register. Well the system call layer can manage that transparently with a little software state (counter). No need to expose it. One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents a self-monitoring thread from reading the counters directly. You'll just get the lower 32-bit of it. So if you read frequently enough,
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 06:13:42AM -0800, Stephane Eranian wrote: At least for x86 and I suspect some 1other architectures we don't initially need a syscall at all for this. There is an instruction RDPMC who can read a performance counter just fine. It is also much faster and generally preferable for the case where a process measures events about itself. In fact it is essential for one of the use cases I would like to see perfmon used (replacement of RDTSC for cycle counting) This only works when counting (not sampling) and only for self-monitoring. It works for global monitoring too. Later a syscall might be needed with event multiplexing, but that seems more like a far away non essential feature. On a machine with only two generic counters such as MIPS or Intel Core 2 Duo, multiplexing offers some advantages. If NMI watchdog is enabled, then you drop to one generic counter on on Core 2. NMI watchdog is off by default now. Yes longer term we might need multiplexing, but definitely not as first step. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Andi Kleen wrote: One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents a self-monitoring thread from reading the counters directly. You'll just get the lower 32-bit of it. So if you read frequently enough, you should not have a problem. Hmm? RDPMC is 64bit. There are a number of processors that have 32-bit counters such as the IBM power processors. On many x86 processors the upper bits of the counter are sign extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are available. Roll over of values is quite possible (2 seconds of cycle count), so additional work needs to be done to obtain a valid value. But keep in mind that we do want a uniform interface across all hardware and all type of sessions (self-monitoring, CPU-wide, monitoring of another thread). You don't want an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() and so I disagree. Using RDPMC is essential for at least some of the things I would like to do with perfmon2. If the interface does not provide it it is useless to me at least. System calls are far too slow for cycle measurements. What range of cycles are you interested in measuring? 100's of cycles? A couple thousand? Are you just looking at cycle counts or other events? -Will - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, Nov 14, 2007 at 10:44:20AM -0500, William Cohen wrote: Andi Kleen wrote: One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents a self-monitoring thread from reading the counters directly. You'll just get the lower 32-bit of it. So if you read frequently enough, you should not have a problem. Hmm? RDPMC is 64bit. There are a number of processors that have 32-bit counters such as the IBM power processors. On many x86 processors the upper bits of the counter are sign extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are available. Roll over of values is quite possible (2 seconds of cycle count), so additional work needs to be done to obtain a valid value. Exactly, on Intel's only the bottom 32-bit actually are useable, the rest is sign-extension. That's why it is okay for measuring small sections of code, but that's it. On AMD, I think it is better. On Itanium you get the 47-bit worth. Don't know about Power or Cell. -- -Stephane - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
On Wed, 14 Nov 2007 at 10:44 +, Will Cohen wrote: Andi Kleen wrote: One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents a self-monitoring thread from reading the counters directly. You'll just get the lower 32-bit of it. So if you read frequently enough, you should not have a problem. Hmm? RDPMC is 64bit. There are a number of processors that have 32-bit counters such as the IBM power processors. On many x86 processors the upper bits of the counter are sign extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are available. Roll over of values is quite possible (2 seconds of cycle count), so additional work needs to be done to obtain a valid value. On x86 they are sign-extended only on write, on read they are 40 bits wide for intel, 48 bits for AMD. BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch to disable it, dunno if it has been applied. -- Phe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch to disable it, dunno if it has been applied. Obviously -- without a system call to set up performance counters it would be fairly useless. But of course once such system calls are in they should be able to trigger the bit for each process. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
From: Andi Kleen [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 13:38:38 +0100 At least for x86 and I suspect some 1other architectures we don't initially need a syscall at all for this. There is an instruction RDPMC who can read a performance counter just fine. It is also much faster and generally preferable for the case where a process measures events about itself. In fact it is essential for one of the use cases I would like to see perfmon used (replacement of RDTSC for cycle counting) I wouldn't even want to use a syscall for something like that on Sparc, I'd rather give this a dedicated software trap so that I can code it completely in assembler. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
David Miller writes: You're suggesting that the behaviour of a read() should depend on what was in the buffer before the read? Gack! Surely you have better taste than that? Absolutely that's what I mean, it's atomic and gives you exactly what you need. I see nothing wrong or gross with these semantics. Nothing in the book of UNIX specifies that for a device or special file the passed in buffer cannot contain input control data. Oh kay *shudders* It really violates the abstract model of read pretty badly. Read is fill in the buffer with data from the device, not do some arbitrary stuff with this area of memory. I'd prefer to have a transaction() system call like I suggested to Nick rather than overloading read() like this. Then you end up with two system calls to get the data rather than one (one to send the request and another to read the reply). For something that needs to be quick that is a suboptimal interface. Not necessarily, consider the possibility of using recvmsg() control message data. With that it could be done in one go. This also suggests that it could be implemented as it's own protocol family. There's all sorts of possible ways that it could be implemented. On the one hand we have an actual proposed implementation, and on the other we have various people saying oh but it could be implemented this other way without providing any actual code. Now if those people can show that their way of doing it is significantly simpler and better than the existing implementation, then that's useful. I really don't think that doing a whole new net protocol family is a simpler and better way of doing a performance monitor interface, though. Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [perfmon] Re: [perfmon2] perfmon2 merge news
Nick Piggin writes: What I really mean is a readv-like syscall, but one that also vectorises the file offset. Maybe this is useful enough as a generic syscall that also helps Paul's example... I've sometimes thought it would be useful to have a transaction system call that is like a write + read combined into one: int transaction(int fd, char *req, size_t req_nb, char *reply, size_t reply_nb); as a way to provide a general request/reply interface for special files. Of course, I guess this all depends on whether the atomicity is an important requirement. If not, you can obviously just do it with multiple read syscalls... That would take N system calls instead of one, which could have a performance impact if you need to read the counters frequently (which I believe you do in some performance monitoring situations). Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/