Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Andi Kleen writes: >> So create two events, one for the PT stuff and one to track the >> side-band stuff. We have a NOP event for just this purpose. > > Ok I guess that could work. > > Essentially replace the magic mmap offset with a second fd. > > Alex, what do you think? Yes, that's what I suggested some time ago in [1]. A second buffer (through another fd or otherwise) is an essential thing from my point of view. [1] http://marc.info/?l=linux-kernel=138737306725663 Regards, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
restoring the list.. I really should drop all emails you send off list into /dev/null. On Wed, Jan 08, 2014 at 09:28:40AM +0100, Peter Zijlstra wrote: > On Tue, Jan 07, 2014 at 10:23:22PM +0100, Andi Kleen wrote: > > > Yes we very much rely on the FREEZE bits for LBR. PT and LBR being > > > mutually exclusive wasn't documented (or I missed it) and completely > > > blows. > > > > Can you describe why it is a problem? I had considered it only a minor > > inconvenience, for many things you would use LBRs for PT is far better. > > Because is someone writes a GCC tool using perf-LBR support for some > basic block analysis, and someone else writes another tool for PT, then > the first tool magically stops working when the PT tool is started. > > We cannot refuse to create perf-LBR events, because at that time there > might not be a PT user -- and even if there was one, it might go away. > > But as long as there's a PT user around, the LBR events will not be able > to be scheduled and will simply starve, for no apparent reason. > > Complete and utterly miserable position. > > And it makes sense to write LBR tools because they cover a much greater > spread of hardware. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
restoring the list.. I really should drop all emails you send off list into /dev/null. On Wed, Jan 08, 2014 at 09:28:40AM +0100, Peter Zijlstra wrote: On Tue, Jan 07, 2014 at 10:23:22PM +0100, Andi Kleen wrote: Yes we very much rely on the FREEZE bits for LBR. PT and LBR being mutually exclusive wasn't documented (or I missed it) and completely blows. Can you describe why it is a problem? I had considered it only a minor inconvenience, for many things you would use LBRs for PT is far better. Because is someone writes a GCC tool using perf-LBR support for some basic block analysis, and someone else writes another tool for PT, then the first tool magically stops working when the PT tool is started. We cannot refuse to create perf-LBR events, because at that time there might not be a PT user -- and even if there was one, it might go away. But as long as there's a PT user around, the LBR events will not be able to be scheduled and will simply starve, for no apparent reason. Complete and utterly miserable position. And it makes sense to write LBR tools because they cover a much greater spread of hardware. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Andi Kleen a...@firstfloor.org writes: So create two events, one for the PT stuff and one to track the side-band stuff. We have a NOP event for just this purpose. Ok I guess that could work. Essentially replace the magic mmap offset with a second fd. Alex, what do you think? Yes, that's what I suggested some time ago in [1]. A second buffer (through another fd or otherwise) is an essential thing from my point of view. [1] http://marc.info/?l=linux-kernelm=138737306725663 Regards, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Tue, Jan 07, 2014 at 09:51:45PM +0100, Peter Zijlstra wrote: > On Tue, Jan 07, 2014 at 04:42:55PM +0100, Andi Kleen wrote: > > > Yes; go read this: > > > > > > lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net > > > > Hmm, but AFAIK we're not using freeze counters on PMI today. > > We just rely on the explicit disabling in the counters through the global > > ctrl. > > > > So it should be the same as with any other PMI which also does not > > automatically freeze. Not true? > > Regardless whether its used or not; I'd very much like that answered. The freeze always starts with the counter overflow, independent if the interrupt is blocked or not. So everything should be ok. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Tue, Jan 07, 2014 at 04:42:55PM +0100, Andi Kleen wrote: > > Yes; go read this: > > > > lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net > > Hmm, but AFAIK we're not using freeze counters on PMI today. > We just rely on the explicit disabling in the counters through the global > ctrl. > > So it should be the same as with any other PMI which also does not > automatically freeze. Not true? Regardless whether its used or not; I'd very much like that answered. > Or do you mean interaction with the LBRs here? > (currently LBRs and PT are mutually exclusive) Yes we very much rely on the FREEZE bits for LBR. PT and LBR being mutually exclusive wasn't documented (or I missed it) and completely blows. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
> So create two events, one for the PT stuff and one to track the > side-band stuff. We have a NOP event for just this purpose. Ok I guess that could work. Essentially replace the magic mmap offset with a second fd. Alex, what do you think? -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
> Also, the PT interrupt doesn't actually need to be an NMI; when the > proposed S/G implementation would actually work as stated there can be > plenty room left when we trigger the interrupt. That's true. -andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
> Yes; go read this: > > lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net Hmm, but AFAIK we're not using freeze counters on PMI today. We just rely on the explicit disabling in the counters through the global ctrl. So it should be the same as with any other PMI which also does not automatically freeze. Not true? Or do you mean interaction with the LBRs here? (currently LBRs and PT are mutually exclusive) -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Tue, Jan 07, 2014 at 01:52:31AM +0100, Andi Kleen wrote: > > > Also of course it requires disabling/enabling PT explicitly for > > > every perf message, which is slow. So you add at least 2*WRMSR cost > > > (thousands of cycles). > > > > That's just dumb, no flush the entire PT buffer into a few large > > records. > > How would that work? > > You mean a separate buffer and then copy or map? > > -- > > Also here are some more problems with interleaving: > > A common PT config is to just run it as a ring buffer in the background > and only take the data out when something happens (sample, crash etc.) > > But the side band still needs to be logged and at arbitary times. > > So the PT wrapping will happen much more often than the perf wrapping. So create two events, one for the PT stuff and one to track the side-band stuff. We have a NOP event for just this purpose. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Mon, Jan 06, 2014 at 03:10:28PM -0800, Andi Kleen wrote: > > To me it seems very weird that PT is hooked to the same PMI as the > > normal PMU, it really should have been a different interrupt. > > It's in the same STATUS register, so it's cheap to check both. > > It shouldn't add any new spurious problems (or at least nothing > worse than what we already have) > > I understand that it would be nice to separate other NMI users > from all of PMI, but that would be an orthogonal problem. > > Any other issues? Aside from the fact that PT and the PMU are otherwise unrelated, so it being in the global status register is weird too. Also, the PT interrupt doesn't actually need to be an NMI; when the proposed S/G implementation would actually work as stated there can be plenty room left when we trigger the interrupt. But again, see the other email I referenced; the PMU triggering a PMI while we're in one PT triggered is my biggest concern; esp. since both have different FREEZE semantics. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Mon, Jan 06, 2014 at 03:10:28PM -0800, Andi Kleen wrote: > Peter Zijlstra writes: > > Also, do clarify the other points I asked about. Esp. the non > > FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely. > > The only reason for hardware freeze is when you have a few entries (like > with LBRs) so the interrupt entry code could overwhelm it. > > But PT is not small, it's gigantic: even with the smallest buffer you > have many thousands of entries. > > So you will get a few branches in the interrupt entry, but it's not a problem > because everything you really wanted to trace is still there. > > Eventually the handler disables PT, so there's no risk of racing with > the update or anything like that. > > Did I miss anything? Yes; go read this: lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Mon, Jan 06, 2014 at 03:10:28PM -0800, Andi Kleen wrote: Peter Zijlstra pet...@infradead.org writes: Also, do clarify the other points I asked about. Esp. the non FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely. The only reason for hardware freeze is when you have a few entries (like with LBRs) so the interrupt entry code could overwhelm it. But PT is not small, it's gigantic: even with the smallest buffer you have many thousands of entries. So you will get a few branches in the interrupt entry, but it's not a problem because everything you really wanted to trace is still there. Eventually the handler disables PT, so there's no risk of racing with the update or anything like that. Did I miss anything? Yes; go read this: lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Mon, Jan 06, 2014 at 03:10:28PM -0800, Andi Kleen wrote: To me it seems very weird that PT is hooked to the same PMI as the normal PMU, it really should have been a different interrupt. It's in the same STATUS register, so it's cheap to check both. It shouldn't add any new spurious problems (or at least nothing worse than what we already have) I understand that it would be nice to separate other NMI users from all of PMI, but that would be an orthogonal problem. Any other issues? Aside from the fact that PT and the PMU are otherwise unrelated, so it being in the global status register is weird too. Also, the PT interrupt doesn't actually need to be an NMI; when the proposed S/G implementation would actually work as stated there can be plenty room left when we trigger the interrupt. But again, see the other email I referenced; the PMU triggering a PMI while we're in one PT triggered is my biggest concern; esp. since both have different FREEZE semantics. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Tue, Jan 07, 2014 at 01:52:31AM +0100, Andi Kleen wrote: Also of course it requires disabling/enabling PT explicitly for every perf message, which is slow. So you add at least 2*WRMSR cost (thousands of cycles). That's just dumb, no flush the entire PT buffer into a few large records. How would that work? You mean a separate buffer and then copy or map? -- Also here are some more problems with interleaving: A common PT config is to just run it as a ring buffer in the background and only take the data out when something happens (sample, crash etc.) But the side band still needs to be logged and at arbitary times. So the PT wrapping will happen much more often than the perf wrapping. So create two events, one for the PT stuff and one to track the side-band stuff. We have a NOP event for just this purpose. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Yes; go read this: lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net Hmm, but AFAIK we're not using freeze counters on PMI today. We just rely on the explicit disabling in the counters through the global ctrl. So it should be the same as with any other PMI which also does not automatically freeze. Not true? Or do you mean interaction with the LBRs here? (currently LBRs and PT are mutually exclusive) -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Also, the PT interrupt doesn't actually need to be an NMI; when the proposed S/G implementation would actually work as stated there can be plenty room left when we trigger the interrupt. That's true. -andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
So create two events, one for the PT stuff and one to track the side-band stuff. We have a NOP event for just this purpose. Ok I guess that could work. Essentially replace the magic mmap offset with a second fd. Alex, what do you think? -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Tue, Jan 07, 2014 at 04:42:55PM +0100, Andi Kleen wrote: Yes; go read this: lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net Hmm, but AFAIK we're not using freeze counters on PMI today. We just rely on the explicit disabling in the counters through the global ctrl. So it should be the same as with any other PMI which also does not automatically freeze. Not true? Regardless whether its used or not; I'd very much like that answered. Or do you mean interaction with the LBRs here? (currently LBRs and PT are mutually exclusive) Yes we very much rely on the FREEZE bits for LBR. PT and LBR being mutually exclusive wasn't documented (or I missed it) and completely blows. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Tue, Jan 07, 2014 at 09:51:45PM +0100, Peter Zijlstra wrote: On Tue, Jan 07, 2014 at 04:42:55PM +0100, Andi Kleen wrote: Yes; go read this: lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net Hmm, but AFAIK we're not using freeze counters on PMI today. We just rely on the explicit disabling in the counters through the global ctrl. So it should be the same as with any other PMI which also does not automatically freeze. Not true? Regardless whether its used or not; I'd very much like that answered. The freeze always starts with the counter overflow, independent if the interrupt is blocked or not. So everything should be ok. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Tue, Jan 07, 2014 at 01:52:31AM +0100, Andi Kleen wrote: > > > Also of course it requires disabling/enabling PT explicitly for > > > every perf message, which is slow. So you add at least 2*WRMSR cost > > > (thousands of cycles). > > > > That's just dumb, no flush the entire PT buffer into a few large > > records. > > How would that work? > > You mean a separate buffer and then copy or map? > > -- > > Also here are some more problems with interleaving: > > A common PT config is to just run it as a ring buffer in the background > and only take the data out when something happens (sample, crash etc.) > > But the side band still needs to be logged and at arbitary times. > > So the PT wrapping will happen much more often than the perf wrapping. > > If you interleave you may actually end up with lots of small rings > in a single buffer, unless you stop every time the buffer fills up > (which would add a lot more overhead) > > I suppose it could be somehow parsed, but it would very different > from what perf does today. Thinking about it more it's likely very hard to parse. Dropping instructions is fine, dropping perf metadata is not (or only as last resort). If we miss a MMAP we may never be able to parse that code region. If we miss a context switch we may be also completely lost until the next switch. That means PT couldn't overwrite perf metadata normally. So you could easily get into situations where the interleaved PT buffer is between two perf metadata statements and ends up really small, while large other parts of the buffer are unused. The only way around it would be likely to move entries around -- to garbage collect so to say -- but doing that non-blocking from a NMI will be challenging. With the separate buffers we don't have any of these problems. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
> > Also of course it requires disabling/enabling PT explicitly for > > every perf message, which is slow. So you add at least 2*WRMSR cost > > (thousands of cycles). > > That's just dumb, no flush the entire PT buffer into a few large > records. How would that work? You mean a separate buffer and then copy or map? -- Also here are some more problems with interleaving: A common PT config is to just run it as a ring buffer in the background and only take the data out when something happens (sample, crash etc.) But the side band still needs to be logged and at arbitary times. So the PT wrapping will happen much more often than the perf wrapping. If you interleave you may actually end up with lots of small rings in a single buffer, unless you stop every time the buffer fills up (which would add a lot more overhead) I suppose it could be somehow parsed, but it would very different from what perf does today. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: Can you please clarify your position on the interleaved buffer? I still can't see how it is a efficient design. It's generally true in scather-gather (be it software or hardware) that each additional SG entry increases the cost. So to make things efficient you always want to minimize entries as much as possible. >> I don't think the PT design is broken in any way, it's straight >> forward and simple. > > Also, do clarify the other points I asked about. Esp. the non > FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely. The only reason for hardware freeze is when you have a few entries (like with LBRs) so the interrupt entry code could overwhelm it. But PT is not small, it's gigantic: even with the smallest buffer you have many thousands of entries. So you will get a few branches in the interrupt entry, but it's not a problem because everything you really wanted to trace is still there. Eventually the handler disables PT, so there's no risk of racing with the update or anything like that. Did I miss anything? > To me it seems very weird that PT is hooked to the same PMI as the > normal PMU, it really should have been a different interrupt. It's in the same STATUS register, so it's cheap to check both. It shouldn't add any new spurious problems (or at least nothing worse than what we already have) I understand that it would be nice to separate other NMI users from all of PMI, but that would be an orthogonal problem. Any other issues? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
> I don't think the PT design is broken in any way, it's straight > forward and simple. Also, do clarify the other points I asked about. Esp. the non FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely. To me it seems very weird that PT is hooked to the same PMI as the normal PMU, it really should have been a different interrupt. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Mon, Jan 06, 2014 at 01:25:02PM -0800, Andi Kleen wrote: > Peter Zijlstra writes: > > > On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: > >> So I'd like to steer away from the ways in which hardware can be broken > >> and talk about a usable interface, to begin with. > > > > Just dump it into the regular one buffer like I outlined. > > Just getting back to this. > > Do you realize that PT buffers have to be page aligned. > > So to mix it with a regular perf buffer would need padding every PT > message by 4K, which wastes a lot of memory. The side band messages > are usually only a few bytes (e.g. context switch). > > If the sideband is mfrequent it could even take up >half of the buffer, > but mostly only with padding. > > Is that what you intended? > > perf doesn't support gaps today, so your proposal wouldn't even > seem to fit into the current perf design. That would a really trivial addition. > Also of course it requires disabling/enabling PT explicitly for > every perf message, which is slow. So you add at least 2*WRMSR cost > (thousands of cycles). That's just dumb, no flush the entire PT buffer into a few large records. > > That said; we very much need to have at least two architectures > > implemented for any of this code to move. > > > > But we cannot ignore the hardware trainwreck; we cannot shape our > > interface around something that's utterly broken. > > > > Some hardware is just too broken to support. > > I don't think the PT design is broken in any way, it's straight > forward and simple. If it were actually implemented like the spec says and not have this crappy S/G limitation, then maybe. > Trying to mix hardware tracing and software tracing in the same buffer > on the other hand ... > > Anyways if perf is not flexible enough to support this I suppose > it could switch to a simple device driver, and only run perf with > separate fds for side band purposes. > > Would you prefer that? Don't be stupid. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: >> So I'd like to steer away from the ways in which hardware can be broken >> and talk about a usable interface, to begin with. > > Just dump it into the regular one buffer like I outlined. Just getting back to this. Do you realize that PT buffers have to be page aligned. So to mix it with a regular perf buffer would need padding every PT message by 4K, which wastes a lot of memory. The side band messages are usually only a few bytes (e.g. context switch). If the sideband is mfrequent it could even take up >half of the buffer, but mostly only with padding. Is that what you intended? perf doesn't support gaps today, so your proposal wouldn't even seem to fit into the current perf design. Also of course it requires disabling/enabling PT explicitly for every perf message, which is slow. So you add at least 2*WRMSR cost (thousands of cycles). > That said; we very much need to have at least two architectures > implemented for any of this code to move. > > But we cannot ignore the hardware trainwreck; we cannot shape our > interface around something that's utterly broken. > > Some hardware is just too broken to support. I don't think the PT design is broken in any way, it's straight forward and simple. Trying to mix hardware tracing and software tracing in the same buffer on the other hand ... Anyways if perf is not flexible enough to support this I suppose it could switch to a simple device driver, and only run perf with separate fds for side band purposes. Would you prefer that? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: So I'd like to steer away from the ways in which hardware can be broken and talk about a usable interface, to begin with. Just dump it into the regular one buffer like I outlined. Just getting back to this. Do you realize that PT buffers have to be page aligned. So to mix it with a regular perf buffer would need padding every PT message by 4K, which wastes a lot of memory. The side band messages are usually only a few bytes (e.g. context switch). If the sideband is mfrequent it could even take up half of the buffer, but mostly only with padding. Is that what you intended? perf doesn't support gaps today, so your proposal wouldn't even seem to fit into the current perf design. Also of course it requires disabling/enabling PT explicitly for every perf message, which is slow. So you add at least 2*WRMSR cost (thousands of cycles). That said; we very much need to have at least two architectures implemented for any of this code to move. But we cannot ignore the hardware trainwreck; we cannot shape our interface around something that's utterly broken. Some hardware is just too broken to support. I don't think the PT design is broken in any way, it's straight forward and simple. Trying to mix hardware tracing and software tracing in the same buffer on the other hand ... Anyways if perf is not flexible enough to support this I suppose it could switch to a simple device driver, and only run perf with separate fds for side band purposes. Would you prefer that? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Mon, Jan 06, 2014 at 01:25:02PM -0800, Andi Kleen wrote: Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: So I'd like to steer away from the ways in which hardware can be broken and talk about a usable interface, to begin with. Just dump it into the regular one buffer like I outlined. Just getting back to this. Do you realize that PT buffers have to be page aligned. So to mix it with a regular perf buffer would need padding every PT message by 4K, which wastes a lot of memory. The side band messages are usually only a few bytes (e.g. context switch). If the sideband is mfrequent it could even take up half of the buffer, but mostly only with padding. Is that what you intended? perf doesn't support gaps today, so your proposal wouldn't even seem to fit into the current perf design. That would a really trivial addition. Also of course it requires disabling/enabling PT explicitly for every perf message, which is slow. So you add at least 2*WRMSR cost (thousands of cycles). That's just dumb, no flush the entire PT buffer into a few large records. That said; we very much need to have at least two architectures implemented for any of this code to move. But we cannot ignore the hardware trainwreck; we cannot shape our interface around something that's utterly broken. Some hardware is just too broken to support. I don't think the PT design is broken in any way, it's straight forward and simple. If it were actually implemented like the spec says and not have this crappy S/G limitation, then maybe. Trying to mix hardware tracing and software tracing in the same buffer on the other hand ... Anyways if perf is not flexible enough to support this I suppose it could switch to a simple device driver, and only run perf with separate fds for side band purposes. Would you prefer that? Don't be stupid. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
I don't think the PT design is broken in any way, it's straight forward and simple. Also, do clarify the other points I asked about. Esp. the non FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely. To me it seems very weird that PT is hooked to the same PMI as the normal PMU, it really should have been a different interrupt. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: Can you please clarify your position on the interleaved buffer? I still can't see how it is a efficient design. It's generally true in scather-gather (be it software or hardware) that each additional SG entry increases the cost. So to make things efficient you always want to minimize entries as much as possible. I don't think the PT design is broken in any way, it's straight forward and simple. Also, do clarify the other points I asked about. Esp. the non FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely. The only reason for hardware freeze is when you have a few entries (like with LBRs) so the interrupt entry code could overwhelm it. But PT is not small, it's gigantic: even with the smallest buffer you have many thousands of entries. So you will get a few branches in the interrupt entry, but it's not a problem because everything you really wanted to trace is still there. Eventually the handler disables PT, so there's no risk of racing with the update or anything like that. Did I miss anything? To me it seems very weird that PT is hooked to the same PMI as the normal PMU, it really should have been a different interrupt. It's in the same STATUS register, so it's cheap to check both. It shouldn't add any new spurious problems (or at least nothing worse than what we already have) I understand that it would be nice to separate other NMI users from all of PMI, but that would be an orthogonal problem. Any other issues? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Also of course it requires disabling/enabling PT explicitly for every perf message, which is slow. So you add at least 2*WRMSR cost (thousands of cycles). That's just dumb, no flush the entire PT buffer into a few large records. How would that work? You mean a separate buffer and then copy or map? -- Also here are some more problems with interleaving: A common PT config is to just run it as a ring buffer in the background and only take the data out when something happens (sample, crash etc.) But the side band still needs to be logged and at arbitary times. So the PT wrapping will happen much more often than the perf wrapping. If you interleave you may actually end up with lots of small rings in a single buffer, unless you stop every time the buffer fills up (which would add a lot more overhead) I suppose it could be somehow parsed, but it would very different from what perf does today. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Tue, Jan 07, 2014 at 01:52:31AM +0100, Andi Kleen wrote: Also of course it requires disabling/enabling PT explicitly for every perf message, which is slow. So you add at least 2*WRMSR cost (thousands of cycles). That's just dumb, no flush the entire PT buffer into a few large records. How would that work? You mean a separate buffer and then copy or map? -- Also here are some more problems with interleaving: A common PT config is to just run it as a ring buffer in the background and only take the data out when something happens (sample, crash etc.) But the side band still needs to be logged and at arbitary times. So the PT wrapping will happen much more often than the perf wrapping. If you interleave you may actually end up with lots of small rings in a single buffer, unless you stop every time the buffer fills up (which would add a lot more overhead) I suppose it could be somehow parsed, but it would very different from what perf does today. Thinking about it more it's likely very hard to parse. Dropping instructions is fine, dropping perf metadata is not (or only as last resort). If we miss a MMAP we may never be able to parse that code region. If we miss a context switch we may be also completely lost until the next switch. That means PT couldn't overwrite perf metadata normally. So you could easily get into situations where the interleaved PT buffer is between two perf metadata statements and ends up really small, while large other parts of the buffer are unused. The only way around it would be likely to move entries around -- to garbage collect so to say -- but doing that non-blocking from a NMI will be challenging. With the separate buffers we don't have any of these problems. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 04:54:27PM +0200, Alexander Shishkin wrote: > Peter Zijlstra writes: > > > On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote: > > So you're basically forced to stop the tracing on PMI anyhow; so your > > continuous tracing argument goes out the window. > > It's only stopped inside the PMI handler to set up another buffer, and > is then started again, so no useful trace is lost. PMI handler is not > traced. What you're proposing is stopping it for good till perf collects > the previous data, which will lose us a lot of trace. So my argument > stands. That is not what I proposed at all. The PMI will swizzle the pages and resume recording. If there is no space in the output buffer, we'll simply re-use the existing pages and overwrite data. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: > So I'd like to steer away from the ways in which hardware can be broken > and talk about a usable interface, to begin with. Just dump it into the regular one buffer like I outlined. That said; we very much need to have at least two architectures implemented for any of this code to move. But we cannot ignore the hardware trainwreck; we cannot shape our interface around something that's utterly broken. Some hardware is just too broken to support. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 03:49:42PM +0100, Frederic Weisbecker wrote: > On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: > > Or the interface and implementation of BTS support in the kernel > > discourage its use and that is why it is so rarely used. > > I never heard complains about it. It's a simple dump of from/to address > couples. > I just think nobody take the time to develop userspace tooling to exploit it. > But it's famous slowness might have had a bad influence on this. And may be > also the fact that it's very architecture specific. AMD doesn't support BTS > if I recall > correctly. Or may be it has its own different implementation? No AMD doesn't do anything like that. There was some attempt to cure some of the wobblies: https://lkml.org/lkml/2013/7/8/154 But people never pursued that. That said, if people want overwrite mode to work for PT we'd need to fix the same thing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote: > So you're basically forced to stop the tracing on PMI anyhow; so your > continuous tracing argument goes out the window. It's only stopped inside the PMI handler to set up another buffer, and is then started again, so no useful trace is lost. PMI handler is not traced. What you're proposing is stopping it for good till perf collects the previous data, which will lose us a lot of trace. So my argument stands. Regards, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: > Or the interface and implementation of BTS support in the kernel > discourage its use and that is why it is so rarely used. I never heard complains about it. It's a simple dump of from/to address couples. I just think nobody take the time to develop userspace tooling to exploit it. But it's famous slowness might have had a bad influence on this. And may be also the fact that it's very architecture specific. AMD doesn't support BTS if I recall correctly. Or may be it has its own different implementation? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Ingo Molnar writes: > * Peter Zijlstra wrote: > >> On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote: >> > Peter Zijlstra writes: >> > >> > > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: >> > >> Yes and some implementations of PT have the same issue, but you can do a >> > >> sufficiently large high order allocation and map it to userspace and >> > >> still no copying (or parsing/decoding) in kernel space required. >> > > >> > > What's sufficiently large? The largest we could possibly allocate is >> > > something like 4k^11 which is 8M or so. That's not all that big given >> > > you keep saying it generates in the order of 100 MB/s. >> > >> > One chunk is 8M. You can have as many as the buddy allocator permits you >> > to have. When you get a PMI, you simply switch one chunk for another and >> > on the tracing goes. >> >> This document you referred me to looks to specify something with a >> proper s/g implementation; called ToPA. There doesn't appear to be a >> limit to the linked entries and you can specify a size per entry, and I >> don't see anywhere why 4k would be bad. >> >> That said, I'm still reading.. >> >> > > Also, 'some implementations', that sounds like a fail right there. Why >> > > are there already different implementations, and some which such stupid >> > > design, of something this new? >> > > >> > > How about just saying NO to the ones that requires physically contiguous >> > > allocations? >> > >> > No reason to leave those out, because they are still extremely useful >> > for tracing and fit perfectly fine in a model with two buffers. >> >> Maybe; but lets start with the sane hardware. Then we'll look at the >> amount of pain needed to support these broken pieces of crap and >> decide later. >> >> So drop all support for crappy hardware now. > > Absolutely agreed ... > > The thing is, BTS itself is rarely used (and not primarily because > it's slow, but because its tooling and thus its utility is poor), so > the last thing we want is another piece of broken hardware with a > quirky software interface to it that tooling has trouble utilizing. Or the interface and implementation of BTS support in the kernel discourage its use and that is why it is so rarely used. What I'm proposing is a unified interface for trace units to export their traces and not only the "non-crappy" ones, in a way that won't discourage its use from day one. So I'd like to steer away from the ways in which hardware can be broken and talk about a usable interface, to begin with. Regargs, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote: > On Thu, Dec 19, 2013 at 12:28:12PM +0100, Peter Zijlstra wrote: > > This document you referred me to looks to specify something with a > > proper s/g implementation; called ToPA. There doesn't appear to be a > > limit to the linked entries and you can specify a size per entry, and I > > don't see anywhere why 4k would be bad. > > > > That said, I'm still reading.. > > Found it: > > "Single Output Region ToPA Implementation > > The first processor generation to implement Intel PT supports only ToPA > configurations with a single ToPA entry followed by an END entry that > points back to the first entry (creating one circular output buffer). > Such processors enumerate CPUID.(EAX=14H,ECX=0):EBX[bit 1] as 0." > > So basically you guys buggered the hardware. > "ToPA PMI and Single Output Region ToPA Implementation A processor that supports only a single ToPA output region implementation (such that only one output region is supported; see above) will attempt to signal a ToPA PMI interrupt before the output wraps and overwrites the top of the buffer. To support this functionality, the PMI handler should disable packet generation as soon as possible. Due to PMI skid, it is possible, in rare cases, that the wrap will have occurred before the PMI is delivered. Software can avoid this by setting the STOP bit in the ToPA entry (see Table 11-3); this will disable tracing once the region is filled, and no wrap will occur. This approach has the downside of disabling packet generation so that some of the instructions that led up to the PMI will not be traced. If the PMI skid is significant enough to cause the region to fill and tracing to be disabled, the PMI handler will need to clear the IA32_RTIT_STATUS.Stopped indication before tracing can resume." So you're basically forced to stop the tracing on PMI anyhow; so your continuous tracing argument goes out the window. Also, what a complete clusterfuck. I think we're far better of pretending PT doesn't exist until its fixed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Found more: "Note that no “freezing” takes place with the ToPA PMI. Thus, packet generation is not frozen, and the interrupt handler will be traced (though filtering can prevent this). Further, the setting of IA32_DEBUGCTL.Freeze_Perfmon_on_PMI is ignored and performance counters are not frozen by a ToPA PMI." Can someone confirm with the hardware people what happens when an actual PMU counter overflows and tries to raise the PMI while we're in one that ignores the 'Freeze_perfmon_on_PMI' bit? Since you cannot assert an interrupt that already asserted, but that handler can see the overflow status bit set and will likely process it; assuming the PMU is actually frozen. Also, this just smells ripe for errata and ugly bugs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Thu, Dec 19, 2013 at 01:14:09PM +0200, Alexander Shishkin wrote: >> Peter Zijlstra writes: >> >> > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: >> >> Peter Zijlstra writes: >> >> > The thing is; why can't you zero-copy whatever buffer the hardware >> >> > writes into, into the normal buffer? >> >> >> >> I'm not sure I understand. You mean, have the buffer split between perf >> >> data and trace data? >> > >> > Yep, I don't see any reason why this wouldn't work. >> > >> > When the hardware thing sends an interrupt to notify us its buffer is >> > 'full', stop the recorder, try to create a single record in the buffer >> > that's big enough + 1 page, then swizzle the hardware pages and the >> > buffer pages for that record, using the +1 page to page align the actual >> > data. Then (re)start the hardware on the 'new' pages. >> >> We configure the hardware thing to send an interrupt *before* the buffer >> is full, keep the recorder running while userspace saves stuff to >> perf.data file. Recording only stops if perf fails to read the trace >> data out fast enough and the buffer fills up. So you'd have a complete >> trace. >> >> Also, we have what we call a "snapshot" mode, where we keep the hardware >> thing running, writing data to a circular buffer till it's stopped, in >> case we're only interested in the most recent trace data to see what it >> is that takes too long to respond, etc. And while it is running, we're >> getting new records in the perf stream all the time (mmaps, etc). >> >> Put simple: perf data and trace data are two different separate types of >> information that originate from two different sources, can exist and >> make sense separately from one another and should not be mixed. > > Well you're either having to change your stance or we're done talking > right now. I'm making a case in favor of 2 separate buffers just like you asked in one of the previous emails. It's backed by some very real usecases. That said, I'm not personally attached to any one design, only what makes sense. There is no 'stance'. Regards, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
* Peter Zijlstra wrote: > On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote: > > Peter Zijlstra writes: > > > > > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: > > >> Yes and some implementations of PT have the same issue, but you can do a > > >> sufficiently large high order allocation and map it to userspace and > > >> still no copying (or parsing/decoding) in kernel space required. > > > > > > What's sufficiently large? The largest we could possibly allocate is > > > something like 4k^11 which is 8M or so. That's not all that big given > > > you keep saying it generates in the order of 100 MB/s. > > > > One chunk is 8M. You can have as many as the buddy allocator permits you > > to have. When you get a PMI, you simply switch one chunk for another and > > on the tracing goes. > > This document you referred me to looks to specify something with a > proper s/g implementation; called ToPA. There doesn't appear to be a > limit to the linked entries and you can specify a size per entry, and I > don't see anywhere why 4k would be bad. > > That said, I'm still reading.. > > > > Also, 'some implementations', that sounds like a fail right there. Why > > > are there already different implementations, and some which such stupid > > > design, of something this new? > > > > > > How about just saying NO to the ones that requires physically contiguous > > > allocations? > > > > No reason to leave those out, because they are still extremely useful > > for tracing and fit perfectly fine in a model with two buffers. > > Maybe; but lets start with the sane hardware. Then we'll look at the > amount of pain needed to support these broken pieces of crap and > decide later. > > So drop all support for crappy hardware now. Absolutely agreed ... The thing is, BTS itself is rarely used (and not primarily because it's slow, but because its tooling and thus its utility is poor), so the last thing we want is another piece of broken hardware with a quirky software interface to it that tooling has trouble utilizing. Sigh, when will Intel learn to talk to Linux PMU experts _before_ committing to a hardware interface?? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 12:28:12PM +0100, Peter Zijlstra wrote: > This document you referred me to looks to specify something with a > proper s/g implementation; called ToPA. There doesn't appear to be a > limit to the linked entries and you can specify a size per entry, and I > don't see anywhere why 4k would be bad. > > That said, I'm still reading.. Found it: "Single Output Region ToPA Implementation The first processor generation to implement Intel PT supports only ToPA configurations with a single ToPA entry followed by an END entry that points back to the first entry (creating one circular output buffer). Such processors enumerate CPUID.(EAX=14H,ECX=0):EBX[bit 1] as 0." So basically you guys buggered the hardware. More specifically, what actual hardware is this? Is this first generation HSW or so? Please enumerate the actual hardware that supports this PT stuff and which hardware has it fixed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote: >> Peter Zijlstra writes: >> >> > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: >> >> Yes and some implementations of PT have the same issue, but you can do a >> >> sufficiently large high order allocation and map it to userspace and >> >> still no copying (or parsing/decoding) in kernel space required. >> > >> > What's sufficiently large? The largest we could possibly allocate is >> > something like 4k^11 which is 8M or so. That's not all that big given >> > you keep saying it generates in the order of 100 MB/s. >> >> One chunk is 8M. You can have as many as the buddy allocator permits you >> to have. When you get a PMI, you simply switch one chunk for another and >> on the tracing goes. > > This document you referred me to looks to specify something with a > proper s/g implementation; called ToPA. There doesn't appear to be a > limit to the linked entries and you can specify a size per entry, and I > don't see anywhere why 4k would be bad. JFYI, 11.2.4.1, "Single Output Region ToPA Implementation". Regards, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote: > Peter Zijlstra writes: > > > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: > >> Yes and some implementations of PT have the same issue, but you can do a > >> sufficiently large high order allocation and map it to userspace and > >> still no copying (or parsing/decoding) in kernel space required. > > > > What's sufficiently large? The largest we could possibly allocate is > > something like 4k^11 which is 8M or so. That's not all that big given > > you keep saying it generates in the order of 100 MB/s. > > One chunk is 8M. You can have as many as the buddy allocator permits you > to have. When you get a PMI, you simply switch one chunk for another and > on the tracing goes. This document you referred me to looks to specify something with a proper s/g implementation; called ToPA. There doesn't appear to be a limit to the linked entries and you can specify a size per entry, and I don't see anywhere why 4k would be bad. That said, I'm still reading.. > > Also, 'some implementations', that sounds like a fail right there. Why > > are there already different implementations, and some which such stupid > > design, of something this new? > > > > How about just saying NO to the ones that requires physically contiguous > > allocations? > > No reason to leave those out, because they are still extremely useful > for tracing and fit perfectly fine in a model with two buffers. Maybe; but lets start with the sane hardware. Then we'll look at the amount of pain needed to support these broken pieces of crap and decide later. So drop all support for crappy hardware now. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 01:14:09PM +0200, Alexander Shishkin wrote: > Peter Zijlstra writes: > > > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: > >> Peter Zijlstra writes: > >> > The thing is; why can't you zero-copy whatever buffer the hardware > >> > writes into, into the normal buffer? > >> > >> I'm not sure I understand. You mean, have the buffer split between perf > >> data and trace data? > > > > Yep, I don't see any reason why this wouldn't work. > > > > When the hardware thing sends an interrupt to notify us its buffer is > > 'full', stop the recorder, try to create a single record in the buffer > > that's big enough + 1 page, then swizzle the hardware pages and the > > buffer pages for that record, using the +1 page to page align the actual > > data. Then (re)start the hardware on the 'new' pages. > > We configure the hardware thing to send an interrupt *before* the buffer > is full, keep the recorder running while userspace saves stuff to > perf.data file. Recording only stops if perf fails to read the trace > data out fast enough and the buffer fills up. So you'd have a complete > trace. > > Also, we have what we call a "snapshot" mode, where we keep the hardware > thing running, writing data to a circular buffer till it's stopped, in > case we're only interested in the most recent trace data to see what it > is that takes too long to respond, etc. And while it is running, we're > getting new records in the perf stream all the time (mmaps, etc). > > Put simple: perf data and trace data are two different separate types of > information that originate from two different sources, can exist and > make sense separately from one another and should not be mixed. Well you're either having to change your stance or we're done talking right now. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: >> Yes and some implementations of PT have the same issue, but you can do a >> sufficiently large high order allocation and map it to userspace and >> still no copying (or parsing/decoding) in kernel space required. > > What's sufficiently large? The largest we could possibly allocate is > something like 4k^11 which is 8M or so. That's not all that big given > you keep saying it generates in the order of 100 MB/s. One chunk is 8M. You can have as many as the buddy allocator permits you to have. When you get a PMI, you simply switch one chunk for another and on the tracing goes. > Also, 'some implementations', that sounds like a fail right there. Why > are there already different implementations, and some which such stupid > design, of something this new? > > How about just saying NO to the ones that requires physically contiguous > allocations? No reason to leave those out, because they are still extremely useful for tracing and fit perfectly fine in a model with two buffers. Regards, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: >> Peter Zijlstra writes: >> > The thing is; why can't you zero-copy whatever buffer the hardware >> > writes into, into the normal buffer? >> >> I'm not sure I understand. You mean, have the buffer split between perf >> data and trace data? > > Yep, I don't see any reason why this wouldn't work. > > When the hardware thing sends an interrupt to notify us its buffer is > 'full', stop the recorder, try to create a single record in the buffer > that's big enough + 1 page, then swizzle the hardware pages and the > buffer pages for that record, using the +1 page to page align the actual > data. Then (re)start the hardware on the 'new' pages. We configure the hardware thing to send an interrupt *before* the buffer is full, keep the recorder running while userspace saves stuff to perf.data file. Recording only stops if perf fails to read the trace data out fast enough and the buffer fills up. So you'd have a complete trace. Also, we have what we call a "snapshot" mode, where we keep the hardware thing running, writing data to a circular buffer till it's stopped, in case we're only interested in the most recent trace data to see what it is that takes too long to respond, etc. And while it is running, we're getting new records in the perf stream all the time (mmaps, etc). Put simple: perf data and trace data are two different separate types of information that originate from two different sources, can exist and make sense separately from one another and should not be mixed. Regards, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: > Yes and some implementations of PT have the same issue, but you can do a > sufficiently large high order allocation and map it to userspace and > still no copying (or parsing/decoding) in kernel space required. What's sufficiently large? The largest we could possibly allocate is something like 4k^11 which is 8M or so. That's not all that big given you keep saying it generates in the order of 100 MB/s. Also, 'some implementations', that sounds like a fail right there. Why are there already different implementations, and some which such stupid design, of something this new? How about just saying NO to the ones that requires physically contiguous allocations? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: > Peter Zijlstra writes: > > The thing is; why can't you zero-copy whatever buffer the hardware > > writes into, into the normal buffer? > > I'm not sure I understand. You mean, have the buffer split between perf > data and trace data? Yep, I don't see any reason why this wouldn't work. When the hardware thing sends an interrupt to notify us its buffer is 'full', stop the recorder, try to create a single record in the buffer that's big enough + 1 page, then swizzle the hardware pages and the buffer pages for that record, using the +1 page to page align the actual data. Then (re)start the hardware on the 'new' pages. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: The thing is; why can't you zero-copy whatever buffer the hardware writes into, into the normal buffer? I'm not sure I understand. You mean, have the buffer split between perf data and trace data? Yep, I don't see any reason why this wouldn't work. When the hardware thing sends an interrupt to notify us its buffer is 'full', stop the recorder, try to create a single record in the buffer that's big enough + 1 page, then swizzle the hardware pages and the buffer pages for that record, using the +1 page to page align the actual data. Then (re)start the hardware on the 'new' pages. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Yes and some implementations of PT have the same issue, but you can do a sufficiently large high order allocation and map it to userspace and still no copying (or parsing/decoding) in kernel space required. What's sufficiently large? The largest we could possibly allocate is something like 4k^11 which is 8M or so. That's not all that big given you keep saying it generates in the order of 100 MB/s. Also, 'some implementations', that sounds like a fail right there. Why are there already different implementations, and some which such stupid design, of something this new? How about just saying NO to the ones that requires physically contiguous allocations? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: The thing is; why can't you zero-copy whatever buffer the hardware writes into, into the normal buffer? I'm not sure I understand. You mean, have the buffer split between perf data and trace data? Yep, I don't see any reason why this wouldn't work. When the hardware thing sends an interrupt to notify us its buffer is 'full', stop the recorder, try to create a single record in the buffer that's big enough + 1 page, then swizzle the hardware pages and the buffer pages for that record, using the +1 page to page align the actual data. Then (re)start the hardware on the 'new' pages. We configure the hardware thing to send an interrupt *before* the buffer is full, keep the recorder running while userspace saves stuff to perf.data file. Recording only stops if perf fails to read the trace data out fast enough and the buffer fills up. So you'd have a complete trace. Also, we have what we call a snapshot mode, where we keep the hardware thing running, writing data to a circular buffer till it's stopped, in case we're only interested in the most recent trace data to see what it is that takes too long to respond, etc. And while it is running, we're getting new records in the perf stream all the time (mmaps, etc). Put simple: perf data and trace data are two different separate types of information that originate from two different sources, can exist and make sense separately from one another and should not be mixed. Regards, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Yes and some implementations of PT have the same issue, but you can do a sufficiently large high order allocation and map it to userspace and still no copying (or parsing/decoding) in kernel space required. What's sufficiently large? The largest we could possibly allocate is something like 4k^11 which is 8M or so. That's not all that big given you keep saying it generates in the order of 100 MB/s. One chunk is 8M. You can have as many as the buddy allocator permits you to have. When you get a PMI, you simply switch one chunk for another and on the tracing goes. Also, 'some implementations', that sounds like a fail right there. Why are there already different implementations, and some which such stupid design, of something this new? How about just saying NO to the ones that requires physically contiguous allocations? No reason to leave those out, because they are still extremely useful for tracing and fit perfectly fine in a model with two buffers. Regards, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 01:14:09PM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: The thing is; why can't you zero-copy whatever buffer the hardware writes into, into the normal buffer? I'm not sure I understand. You mean, have the buffer split between perf data and trace data? Yep, I don't see any reason why this wouldn't work. When the hardware thing sends an interrupt to notify us its buffer is 'full', stop the recorder, try to create a single record in the buffer that's big enough + 1 page, then swizzle the hardware pages and the buffer pages for that record, using the +1 page to page align the actual data. Then (re)start the hardware on the 'new' pages. We configure the hardware thing to send an interrupt *before* the buffer is full, keep the recorder running while userspace saves stuff to perf.data file. Recording only stops if perf fails to read the trace data out fast enough and the buffer fills up. So you'd have a complete trace. Also, we have what we call a snapshot mode, where we keep the hardware thing running, writing data to a circular buffer till it's stopped, in case we're only interested in the most recent trace data to see what it is that takes too long to respond, etc. And while it is running, we're getting new records in the perf stream all the time (mmaps, etc). Put simple: perf data and trace data are two different separate types of information that originate from two different sources, can exist and make sense separately from one another and should not be mixed. Well you're either having to change your stance or we're done talking right now. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Yes and some implementations of PT have the same issue, but you can do a sufficiently large high order allocation and map it to userspace and still no copying (or parsing/decoding) in kernel space required. What's sufficiently large? The largest we could possibly allocate is something like 4k^11 which is 8M or so. That's not all that big given you keep saying it generates in the order of 100 MB/s. One chunk is 8M. You can have as many as the buddy allocator permits you to have. When you get a PMI, you simply switch one chunk for another and on the tracing goes. This document you referred me to looks to specify something with a proper s/g implementation; called ToPA. There doesn't appear to be a limit to the linked entries and you can specify a size per entry, and I don't see anywhere why 4k would be bad. That said, I'm still reading.. Also, 'some implementations', that sounds like a fail right there. Why are there already different implementations, and some which such stupid design, of something this new? How about just saying NO to the ones that requires physically contiguous allocations? No reason to leave those out, because they are still extremely useful for tracing and fit perfectly fine in a model with two buffers. Maybe; but lets start with the sane hardware. Then we'll look at the amount of pain needed to support these broken pieces of crap and decide later. So drop all support for crappy hardware now. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Yes and some implementations of PT have the same issue, but you can do a sufficiently large high order allocation and map it to userspace and still no copying (or parsing/decoding) in kernel space required. What's sufficiently large? The largest we could possibly allocate is something like 4k^11 which is 8M or so. That's not all that big given you keep saying it generates in the order of 100 MB/s. One chunk is 8M. You can have as many as the buddy allocator permits you to have. When you get a PMI, you simply switch one chunk for another and on the tracing goes. This document you referred me to looks to specify something with a proper s/g implementation; called ToPA. There doesn't appear to be a limit to the linked entries and you can specify a size per entry, and I don't see anywhere why 4k would be bad. JFYI, 11.2.4.1, Single Output Region ToPA Implementation. Regards, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 12:28:12PM +0100, Peter Zijlstra wrote: This document you referred me to looks to specify something with a proper s/g implementation; called ToPA. There doesn't appear to be a limit to the linked entries and you can specify a size per entry, and I don't see anywhere why 4k would be bad. That said, I'm still reading.. Found it: Single Output Region ToPA Implementation The first processor generation to implement Intel PT supports only ToPA configurations with a single ToPA entry followed by an END entry that points back to the first entry (creating one circular output buffer). Such processors enumerate CPUID.(EAX=14H,ECX=0):EBX[bit 1] as 0. So basically you guys buggered the hardware. More specifically, what actual hardware is this? Is this first generation HSW or so? Please enumerate the actual hardware that supports this PT stuff and which hardware has it fixed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
* Peter Zijlstra pet...@infradead.org wrote: On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Yes and some implementations of PT have the same issue, but you can do a sufficiently large high order allocation and map it to userspace and still no copying (or parsing/decoding) in kernel space required. What's sufficiently large? The largest we could possibly allocate is something like 4k^11 which is 8M or so. That's not all that big given you keep saying it generates in the order of 100 MB/s. One chunk is 8M. You can have as many as the buddy allocator permits you to have. When you get a PMI, you simply switch one chunk for another and on the tracing goes. This document you referred me to looks to specify something with a proper s/g implementation; called ToPA. There doesn't appear to be a limit to the linked entries and you can specify a size per entry, and I don't see anywhere why 4k would be bad. That said, I'm still reading.. Also, 'some implementations', that sounds like a fail right there. Why are there already different implementations, and some which such stupid design, of something this new? How about just saying NO to the ones that requires physically contiguous allocations? No reason to leave those out, because they are still extremely useful for tracing and fit perfectly fine in a model with two buffers. Maybe; but lets start with the sane hardware. Then we'll look at the amount of pain needed to support these broken pieces of crap and decide later. So drop all support for crappy hardware now. Absolutely agreed ... The thing is, BTS itself is rarely used (and not primarily because it's slow, but because its tooling and thus its utility is poor), so the last thing we want is another piece of broken hardware with a quirky software interface to it that tooling has trouble utilizing. Sigh, when will Intel learn to talk to Linux PMU experts _before_ committing to a hardware interface?? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 01:14:09PM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: The thing is; why can't you zero-copy whatever buffer the hardware writes into, into the normal buffer? I'm not sure I understand. You mean, have the buffer split between perf data and trace data? Yep, I don't see any reason why this wouldn't work. When the hardware thing sends an interrupt to notify us its buffer is 'full', stop the recorder, try to create a single record in the buffer that's big enough + 1 page, then swizzle the hardware pages and the buffer pages for that record, using the +1 page to page align the actual data. Then (re)start the hardware on the 'new' pages. We configure the hardware thing to send an interrupt *before* the buffer is full, keep the recorder running while userspace saves stuff to perf.data file. Recording only stops if perf fails to read the trace data out fast enough and the buffer fills up. So you'd have a complete trace. Also, we have what we call a snapshot mode, where we keep the hardware thing running, writing data to a circular buffer till it's stopped, in case we're only interested in the most recent trace data to see what it is that takes too long to respond, etc. And while it is running, we're getting new records in the perf stream all the time (mmaps, etc). Put simple: perf data and trace data are two different separate types of information that originate from two different sources, can exist and make sense separately from one another and should not be mixed. Well you're either having to change your stance or we're done talking right now. I'm making a case in favor of 2 separate buffers just like you asked in one of the previous emails. It's backed by some very real usecases. That said, I'm not personally attached to any one design, only what makes sense. There is no 'stance'. Regards, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Found more: Note that no “freezing” takes place with the ToPA PMI. Thus, packet generation is not frozen, and the interrupt handler will be traced (though filtering can prevent this). Further, the setting of IA32_DEBUGCTL.Freeze_Perfmon_on_PMI is ignored and performance counters are not frozen by a ToPA PMI. Can someone confirm with the hardware people what happens when an actual PMU counter overflows and tries to raise the PMI while we're in one that ignores the 'Freeze_perfmon_on_PMI' bit? Since you cannot assert an interrupt that already asserted, but that handler can see the overflow status bit set and will likely process it; assuming the PMU is actually frozen. Also, this just smells ripe for errata and ugly bugs. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote: On Thu, Dec 19, 2013 at 12:28:12PM +0100, Peter Zijlstra wrote: This document you referred me to looks to specify something with a proper s/g implementation; called ToPA. There doesn't appear to be a limit to the linked entries and you can specify a size per entry, and I don't see anywhere why 4k would be bad. That said, I'm still reading.. Found it: Single Output Region ToPA Implementation The first processor generation to implement Intel PT supports only ToPA configurations with a single ToPA entry followed by an END entry that points back to the first entry (creating one circular output buffer). Such processors enumerate CPUID.(EAX=14H,ECX=0):EBX[bit 1] as 0. So basically you guys buggered the hardware. ToPA PMI and Single Output Region ToPA Implementation A processor that supports only a single ToPA output region implementation (such that only one output region is supported; see above) will attempt to signal a ToPA PMI interrupt before the output wraps and overwrites the top of the buffer. To support this functionality, the PMI handler should disable packet generation as soon as possible. Due to PMI skid, it is possible, in rare cases, that the wrap will have occurred before the PMI is delivered. Software can avoid this by setting the STOP bit in the ToPA entry (see Table 11-3); this will disable tracing once the region is filled, and no wrap will occur. This approach has the downside of disabling packet generation so that some of the instructions that led up to the PMI will not be traced. If the PMI skid is significant enough to cause the region to fill and tracing to be disabled, the PMI handler will need to clear the IA32_RTIT_STATUS.Stopped indication before tracing can resume. So you're basically forced to stop the tracing on PMI anyhow; so your continuous tracing argument goes out the window. Also, what a complete clusterfuck. I think we're far better of pretending PT doesn't exist until its fixed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Ingo Molnar mi...@kernel.org writes: * Peter Zijlstra pet...@infradead.org wrote: On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote: Yes and some implementations of PT have the same issue, but you can do a sufficiently large high order allocation and map it to userspace and still no copying (or parsing/decoding) in kernel space required. What's sufficiently large? The largest we could possibly allocate is something like 4k^11 which is 8M or so. That's not all that big given you keep saying it generates in the order of 100 MB/s. One chunk is 8M. You can have as many as the buddy allocator permits you to have. When you get a PMI, you simply switch one chunk for another and on the tracing goes. This document you referred me to looks to specify something with a proper s/g implementation; called ToPA. There doesn't appear to be a limit to the linked entries and you can specify a size per entry, and I don't see anywhere why 4k would be bad. That said, I'm still reading.. Also, 'some implementations', that sounds like a fail right there. Why are there already different implementations, and some which such stupid design, of something this new? How about just saying NO to the ones that requires physically contiguous allocations? No reason to leave those out, because they are still extremely useful for tracing and fit perfectly fine in a model with two buffers. Maybe; but lets start with the sane hardware. Then we'll look at the amount of pain needed to support these broken pieces of crap and decide later. So drop all support for crappy hardware now. Absolutely agreed ... The thing is, BTS itself is rarely used (and not primarily because it's slow, but because its tooling and thus its utility is poor), so the last thing we want is another piece of broken hardware with a quirky software interface to it that tooling has trouble utilizing. Or the interface and implementation of BTS support in the kernel discourage its use and that is why it is so rarely used. What I'm proposing is a unified interface for trace units to export their traces and not only the non-crappy ones, in a way that won't discourage its use from day one. So I'd like to steer away from the ways in which hardware can be broken and talk about a usable interface, to begin with. Regargs, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: Or the interface and implementation of BTS support in the kernel discourage its use and that is why it is so rarely used. I never heard complains about it. It's a simple dump of from/to address couples. I just think nobody take the time to develop userspace tooling to exploit it. But it's famous slowness might have had a bad influence on this. And may be also the fact that it's very architecture specific. AMD doesn't support BTS if I recall correctly. Or may be it has its own different implementation? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote: So you're basically forced to stop the tracing on PMI anyhow; so your continuous tracing argument goes out the window. It's only stopped inside the PMI handler to set up another buffer, and is then started again, so no useful trace is lost. PMI handler is not traced. What you're proposing is stopping it for good till perf collects the previous data, which will lose us a lot of trace. So my argument stands. Regards, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 03:49:42PM +0100, Frederic Weisbecker wrote: On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: Or the interface and implementation of BTS support in the kernel discourage its use and that is why it is so rarely used. I never heard complains about it. It's a simple dump of from/to address couples. I just think nobody take the time to develop userspace tooling to exploit it. But it's famous slowness might have had a bad influence on this. And may be also the fact that it's very architecture specific. AMD doesn't support BTS if I recall correctly. Or may be it has its own different implementation? No AMD doesn't do anything like that. There was some attempt to cure some of the wobblies: https://lkml.org/lkml/2013/7/8/154 But people never pursued that. That said, if people want overwrite mode to work for PT we'd need to fix the same thing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote: So I'd like to steer away from the ways in which hardware can be broken and talk about a usable interface, to begin with. Just dump it into the regular one buffer like I outlined. That said; we very much need to have at least two architectures implemented for any of this code to move. But we cannot ignore the hardware trainwreck; we cannot shape our interface around something that's utterly broken. Some hardware is just too broken to support. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Thu, Dec 19, 2013 at 04:54:27PM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote: So you're basically forced to stop the tracing on PMI anyhow; so your continuous tracing argument goes out the window. It's only stopped inside the PMI handler to set up another buffer, and is then started again, so no useful trace is lost. PMI handler is not traced. What you're proposing is stopping it for good till perf collects the previous data, which will lose us a lot of trace. So my argument stands. That is not what I proposed at all. The PMI will swizzle the pages and resume recording. If there is no space in the output buffer, we'll simply re-use the existing pages and overwrite data. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Wed, Dec 18, 2013 at 04:22:36PM +0200, Alexander Shishkin wrote: >> > Still confused, if you cannot copy it into one buffer, then why can you >> > copy it into a second buffer? >> >> It's not copied, hardware writes directly into that second buffer. > > Where's the PT documentation? I can't find it in the SDM and your ISA > extensions link is a generic Intel website which is friggin useless > (like all corporate websites strive to be). [1] > Your actual PT patch doesn't describe how the things works either, and > while I could go read the code, I'm too lazy. > > The thing is; why can't you zero-copy whatever buffer the hardware > writes into, into the normal buffer? I'm not sure I understand. You mean, have the buffer split between perf data and trace data? > Machinery like that would also be useful to zero-copy bits out of the > buffer right into the page-cache. Please elaborate. >> I've done the same with BTS now (as Ingo suggested) and it also benefits >> from this approach. > > The problem with DS is that it needs physically contiguous pages is it > not? So you cannot really allocate a large buffer, and you end up > needing to copy or swizzle stuff. Yes and some implementations of PT have the same issue, but you can do a sufficiently large high order allocation and map it to userspace and still no copying (or parsing/decoding) in kernel space required. [1] http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf Regards, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Wed, Dec 18, 2013 at 04:22:36PM +0200, Alexander Shishkin wrote: > > Still confused, if you cannot copy it into one buffer, then why can you > > copy it into a second buffer? > > It's not copied, hardware writes directly into that second buffer. Where's the PT documentation? I can't find it in the SDM and your ISA extensions link is a generic Intel website which is friggin useless (like all corporate websites strive to be). Your actual PT patch doesn't describe how the things works either, and while I could go read the code, I'm too lazy. The thing is; why can't you zero-copy whatever buffer the hardware writes into, into the normal buffer? Machinery like that would also be useful to zero-copy bits out of the buffer right into the page-cache. > I've done the same with BTS now (as Ingo suggested) and it also benefits > from this approach. The problem with DS is that it needs physically contiguous pages is it not? So you cannot really allocate a large buffer, and you end up needing to copy or swizzle stuff. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Wed, Dec 18, 2013 at 04:01:04PM +0200, Alexander Shishkin wrote: >> > Why don't you start by explaining _why_ you need a second stream to >> > begin with? >> >> Oh, I'm sure I've explained it earlier ([1], [2]) > > See, I didn't read 0 because that information gets lost and patches > should be self explanatory, and i didn't get to the Intel driver yet > because well, I got stuck in the generic code. Sure. The general concept is more important than the actual driver at this point anyway. >> but why not. The data >> in the second stream is generated at a rate which is hundreds of >> megabytes per second per core. Decoding this data is ~1000 times slower >> than generating it. Ergo, can't be done in kernel, needs to be exported >> as-is to userspace for later retreival and decoding. Doing it via perf >> stream means an extra copy, which at these rates is a waste. Ergo, a >> second buffer. > > Still confused, if you cannot copy it into one buffer, then why can you > copy it into a second buffer? It's not copied, hardware writes directly into that second buffer. I've done the same with BTS now (as Ingo suggested) and it also benefits from this approach. Regards, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Wed, Dec 18, 2013 at 04:01:04PM +0200, Alexander Shishkin wrote: > > Why don't you start by explaining _why_ you need a second stream to > > begin with? > > Oh, I'm sure I've explained it earlier ([1], [2]) See, I didn't read 0 because that information gets lost and patches should be self explanatory, and i didn't get to the Intel driver yet because well, I got stuck in the generic code. > but why not. The data > in the second stream is generated at a rate which is hundreds of > megabytes per second per core. Decoding this data is ~1000 times slower > than generating it. Ergo, can't be done in kernel, needs to be exported > as-is to userspace for later retreival and decoding. Doing it via perf > stream means an extra copy, which at these rates is a waste. Ergo, a > second buffer. Still confused, if you cannot copy it into one buffer, then why can you copy it into a second buffer? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Wed, Dec 18, 2013 at 03:23:41PM +0200, Alexander Shishkin wrote: >> Peter Zijlstra writes: >> >> > On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote: >> >> Instruction tracing PMUs are capable of recording a log of instruction >> >> execution flow on a cpu core, which can be useful for profiling and crash >> >> analysis. This patch adds itrace infrastructure for perf events and the >> >> rest of the kernel to use. >> >> >> >> Since such PMUs can produce copious amounts of trace data, it may be >> >> impractical to process it inside the kernel in real time, but instead >> >> export >> >> raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs >> >> may export their trace buffers, which can be mmap()ed to userspace from a >> >> perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf >> >> is extended to work with multiple ring buffers per event, reusing the >> >> ring_buffer code in an attempt to reduce complexity. >> > >> > Please read the thread here: https://lkml.org/lkml/2008/12/4/64 >> > >> > On my thoughts of this creative mmap() usage. >> >> That's unfortunate, it made sense to me. But let's then have a look at >> the alternative approaches. Bearing in mind that it is crucial for us to >> export trace buffers to userspace as opposed to processing the trace >> data in the kernel, the fact that we still need the normal perf data >> stream and your dislike for mmap trickery, we need two separate file >> descriptors: one for the perf data and one for the trace data. > > Why don't you start by explaining _why_ you need a second stream to > begin with? Oh, I'm sure I've explained it earlier ([1], [2]), but why not. The data in the second stream is generated at a rate which is hundreds of megabytes per second per core. Decoding this data is ~1000 times slower than generating it. Ergo, can't be done in kernel, needs to be exported as-is to userspace for later retreival and decoding. Doing it via perf stream means an extra copy, which at these rates is a waste. Ergo, a second buffer. [1] https://lkml.org/lkml/2013/12/11/213 [2] https://lkml.org/lkml/2013/12/11/358 Regards, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Wed, Dec 18, 2013 at 03:23:41PM +0200, Alexander Shishkin wrote: > Peter Zijlstra writes: > > > On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote: > >> Instruction tracing PMUs are capable of recording a log of instruction > >> execution flow on a cpu core, which can be useful for profiling and crash > >> analysis. This patch adds itrace infrastructure for perf events and the > >> rest of the kernel to use. > >> > >> Since such PMUs can produce copious amounts of trace data, it may be > >> impractical to process it inside the kernel in real time, but instead > >> export > >> raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs > >> may export their trace buffers, which can be mmap()ed to userspace from a > >> perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf > >> is extended to work with multiple ring buffers per event, reusing the > >> ring_buffer code in an attempt to reduce complexity. > > > > Please read the thread here: https://lkml.org/lkml/2008/12/4/64 > > > > On my thoughts of this creative mmap() usage. > > That's unfortunate, it made sense to me. But let's then have a look at > the alternative approaches. Bearing in mind that it is crucial for us to > export trace buffers to userspace as opposed to processing the trace > data in the kernel, the fact that we still need the normal perf data > stream and your dislike for mmap trickery, we need two separate file > descriptors: one for the perf data and one for the trace data. Why don't you start by explaining _why_ you need a second stream to begin with? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra writes: > On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote: >> Instruction tracing PMUs are capable of recording a log of instruction >> execution flow on a cpu core, which can be useful for profiling and crash >> analysis. This patch adds itrace infrastructure for perf events and the >> rest of the kernel to use. >> >> Since such PMUs can produce copious amounts of trace data, it may be >> impractical to process it inside the kernel in real time, but instead export >> raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs >> may export their trace buffers, which can be mmap()ed to userspace from a >> perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf >> is extended to work with multiple ring buffers per event, reusing the >> ring_buffer code in an attempt to reduce complexity. > > Please read the thread here: https://lkml.org/lkml/2008/12/4/64 > > On my thoughts of this creative mmap() usage. That's unfortunate, it made sense to me. But let's then have a look at the alternative approaches. Bearing in mind that it is crucial for us to export trace buffers to userspace as opposed to processing the trace data in the kernel, the fact that we still need the normal perf data stream and your dislike for mmap trickery, we need two separate file descriptors: one for the perf data and one for the trace data. One way of doing this would be to call sys_perf_event_open() once for each. The first call would return a file descriptor, which provides good old perf data buffer; the second call would use this file descriptor for a group leader and will return another descriptor (thus creating another perf_event), which, when mmap()ed, will provide a trace buffer. Or, we could introduce a new PERF_FLAG_XXX to mean that we want a descriptor with a trace buffer. And then, of course, one could always add an ioctl(), but that'd probably be a bit over the top. Do any of these sound reasonable? Any other possibilities that I'm missing here? Thanks, -- Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote: Instruction tracing PMUs are capable of recording a log of instruction execution flow on a cpu core, which can be useful for profiling and crash analysis. This patch adds itrace infrastructure for perf events and the rest of the kernel to use. Since such PMUs can produce copious amounts of trace data, it may be impractical to process it inside the kernel in real time, but instead export raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs may export their trace buffers, which can be mmap()ed to userspace from a perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf is extended to work with multiple ring buffers per event, reusing the ring_buffer code in an attempt to reduce complexity. Please read the thread here: https://lkml.org/lkml/2008/12/4/64 On my thoughts of this creative mmap() usage. That's unfortunate, it made sense to me. But let's then have a look at the alternative approaches. Bearing in mind that it is crucial for us to export trace buffers to userspace as opposed to processing the trace data in the kernel, the fact that we still need the normal perf data stream and your dislike for mmap trickery, we need two separate file descriptors: one for the perf data and one for the trace data. One way of doing this would be to call sys_perf_event_open() once for each. The first call would return a file descriptor, which provides good old perf data buffer; the second call would use this file descriptor for a group leader and will return another descriptor (thus creating another perf_event), which, when mmap()ed, will provide a trace buffer. Or, we could introduce a new PERF_FLAG_XXX to mean that we want a descriptor with a trace buffer. And then, of course, one could always add an ioctl(), but that'd probably be a bit over the top. Do any of these sound reasonable? Any other possibilities that I'm missing here? Thanks, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Wed, Dec 18, 2013 at 03:23:41PM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote: Instruction tracing PMUs are capable of recording a log of instruction execution flow on a cpu core, which can be useful for profiling and crash analysis. This patch adds itrace infrastructure for perf events and the rest of the kernel to use. Since such PMUs can produce copious amounts of trace data, it may be impractical to process it inside the kernel in real time, but instead export raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs may export their trace buffers, which can be mmap()ed to userspace from a perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf is extended to work with multiple ring buffers per event, reusing the ring_buffer code in an attempt to reduce complexity. Please read the thread here: https://lkml.org/lkml/2008/12/4/64 On my thoughts of this creative mmap() usage. That's unfortunate, it made sense to me. But let's then have a look at the alternative approaches. Bearing in mind that it is crucial for us to export trace buffers to userspace as opposed to processing the trace data in the kernel, the fact that we still need the normal perf data stream and your dislike for mmap trickery, we need two separate file descriptors: one for the perf data and one for the trace data. Why don't you start by explaining _why_ you need a second stream to begin with? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Wed, Dec 18, 2013 at 03:23:41PM +0200, Alexander Shishkin wrote: Peter Zijlstra pet...@infradead.org writes: On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote: Instruction tracing PMUs are capable of recording a log of instruction execution flow on a cpu core, which can be useful for profiling and crash analysis. This patch adds itrace infrastructure for perf events and the rest of the kernel to use. Since such PMUs can produce copious amounts of trace data, it may be impractical to process it inside the kernel in real time, but instead export raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs may export their trace buffers, which can be mmap()ed to userspace from a perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf is extended to work with multiple ring buffers per event, reusing the ring_buffer code in an attempt to reduce complexity. Please read the thread here: https://lkml.org/lkml/2008/12/4/64 On my thoughts of this creative mmap() usage. That's unfortunate, it made sense to me. But let's then have a look at the alternative approaches. Bearing in mind that it is crucial for us to export trace buffers to userspace as opposed to processing the trace data in the kernel, the fact that we still need the normal perf data stream and your dislike for mmap trickery, we need two separate file descriptors: one for the perf data and one for the trace data. Why don't you start by explaining _why_ you need a second stream to begin with? Oh, I'm sure I've explained it earlier ([1], [2]), but why not. The data in the second stream is generated at a rate which is hundreds of megabytes per second per core. Decoding this data is ~1000 times slower than generating it. Ergo, can't be done in kernel, needs to be exported as-is to userspace for later retreival and decoding. Doing it via perf stream means an extra copy, which at these rates is a waste. Ergo, a second buffer. [1] https://lkml.org/lkml/2013/12/11/213 [2] https://lkml.org/lkml/2013/12/11/358 Regards, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Wed, Dec 18, 2013 at 04:01:04PM +0200, Alexander Shishkin wrote: Why don't you start by explaining _why_ you need a second stream to begin with? Oh, I'm sure I've explained it earlier ([1], [2]) See, I didn't read 0 because that information gets lost and patches should be self explanatory, and i didn't get to the Intel driver yet because well, I got stuck in the generic code. but why not. The data in the second stream is generated at a rate which is hundreds of megabytes per second per core. Decoding this data is ~1000 times slower than generating it. Ergo, can't be done in kernel, needs to be exported as-is to userspace for later retreival and decoding. Doing it via perf stream means an extra copy, which at these rates is a waste. Ergo, a second buffer. Still confused, if you cannot copy it into one buffer, then why can you copy it into a second buffer? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Wed, Dec 18, 2013 at 04:01:04PM +0200, Alexander Shishkin wrote: Why don't you start by explaining _why_ you need a second stream to begin with? Oh, I'm sure I've explained it earlier ([1], [2]) See, I didn't read 0 because that information gets lost and patches should be self explanatory, and i didn't get to the Intel driver yet because well, I got stuck in the generic code. Sure. The general concept is more important than the actual driver at this point anyway. but why not. The data in the second stream is generated at a rate which is hundreds of megabytes per second per core. Decoding this data is ~1000 times slower than generating it. Ergo, can't be done in kernel, needs to be exported as-is to userspace for later retreival and decoding. Doing it via perf stream means an extra copy, which at these rates is a waste. Ergo, a second buffer. Still confused, if you cannot copy it into one buffer, then why can you copy it into a second buffer? It's not copied, hardware writes directly into that second buffer. I've done the same with BTS now (as Ingo suggested) and it also benefits from this approach. Regards, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Wed, Dec 18, 2013 at 04:22:36PM +0200, Alexander Shishkin wrote: Still confused, if you cannot copy it into one buffer, then why can you copy it into a second buffer? It's not copied, hardware writes directly into that second buffer. Where's the PT documentation? I can't find it in the SDM and your ISA extensions link is a generic Intel website which is friggin useless (like all corporate websites strive to be). Your actual PT patch doesn't describe how the things works either, and while I could go read the code, I'm too lazy. The thing is; why can't you zero-copy whatever buffer the hardware writes into, into the normal buffer? Machinery like that would also be useful to zero-copy bits out of the buffer right into the page-cache. I've done the same with BTS now (as Ingo suggested) and it also benefits from this approach. The problem with DS is that it needs physically contiguous pages is it not? So you cannot really allocate a large buffer, and you end up needing to copy or swizzle stuff. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
Peter Zijlstra pet...@infradead.org writes: On Wed, Dec 18, 2013 at 04:22:36PM +0200, Alexander Shishkin wrote: Still confused, if you cannot copy it into one buffer, then why can you copy it into a second buffer? It's not copied, hardware writes directly into that second buffer. Where's the PT documentation? I can't find it in the SDM and your ISA extensions link is a generic Intel website which is friggin useless (like all corporate websites strive to be). [1] Your actual PT patch doesn't describe how the things works either, and while I could go read the code, I'm too lazy. The thing is; why can't you zero-copy whatever buffer the hardware writes into, into the normal buffer? I'm not sure I understand. You mean, have the buffer split between perf data and trace data? Machinery like that would also be useful to zero-copy bits out of the buffer right into the page-cache. Please elaborate. I've done the same with BTS now (as Ingo suggested) and it also benefits from this approach. The problem with DS is that it needs physically contiguous pages is it not? So you cannot really allocate a large buffer, and you end up needing to copy or swizzle stuff. Yes and some implementations of PT have the same issue, but you can do a sufficiently large high order allocation and map it to userspace and still no copying (or parsing/decoding) in kernel space required. [1] http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf Regards, -- Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote: > Instruction tracing PMUs are capable of recording a log of instruction > execution flow on a cpu core, which can be useful for profiling and crash > analysis. This patch adds itrace infrastructure for perf events and the > rest of the kernel to use. > > Since such PMUs can produce copious amounts of trace data, it may be > impractical to process it inside the kernel in real time, but instead export > raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs > may export their trace buffers, which can be mmap()ed to userspace from a > perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf > is extended to work with multiple ring buffers per event, reusing the > ring_buffer code in an attempt to reduce complexity. Please read the thread here: https://lkml.org/lkml/2008/12/4/64 On my thoughts of this creative mmap() usage. tl;dr: no f*cking way. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units
On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote: Instruction tracing PMUs are capable of recording a log of instruction execution flow on a cpu core, which can be useful for profiling and crash analysis. This patch adds itrace infrastructure for perf events and the rest of the kernel to use. Since such PMUs can produce copious amounts of trace data, it may be impractical to process it inside the kernel in real time, but instead export raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs may export their trace buffers, which can be mmap()ed to userspace from a perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf is extended to work with multiple ring buffers per event, reusing the ring_buffer code in an attempt to reduce complexity. Please read the thread here: https://lkml.org/lkml/2008/12/4/64 On my thoughts of this creative mmap() usage. tl;dr: no f*cking way. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/