Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-08 Thread Alexander Shishkin
Andi Kleen  writes:

>> So create two events, one for the PT stuff and one to track the
>> side-band stuff. We have a NOP event for just this purpose.
>
> Ok I guess that could work.
>
> Essentially replace the magic mmap offset with a second fd.
>
> Alex, what do you think?

Yes, that's what I suggested some time ago in [1]. A second buffer
(through another fd or otherwise) is an essential thing from my point of
view.

[1] http://marc.info/?l=linux-kernel=138737306725663

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-08 Thread Peter Zijlstra
restoring the list.. I really should drop all emails you send off list
into /dev/null.

On Wed, Jan 08, 2014 at 09:28:40AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 10:23:22PM +0100, Andi Kleen wrote:
> > > Yes we very much rely on the FREEZE bits for LBR. PT and LBR being
> > > mutually exclusive wasn't documented (or I missed it) and completely
> > > blows.
> > 
> > Can you describe why it is a problem? I had considered it only a minor
> > inconvenience, for many things you would use LBRs for PT is far better.
> 
> Because is someone writes a GCC tool using perf-LBR support for some
> basic block analysis, and someone else writes another tool for PT, then
> the first tool magically stops working when the PT tool is started.
> 
> We cannot refuse to create perf-LBR events, because at that time there
> might not be a PT user -- and even if there was one, it might go away.
> 
> But as long as there's a PT user around, the LBR events will not be able
> to be scheduled and will simply starve, for no apparent reason.
> 
> Complete and utterly miserable position.
> 
> And it makes sense to write LBR tools because they cover a much greater
> spread of hardware.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-08 Thread Peter Zijlstra
restoring the list.. I really should drop all emails you send off list
into /dev/null.

On Wed, Jan 08, 2014 at 09:28:40AM +0100, Peter Zijlstra wrote:
 On Tue, Jan 07, 2014 at 10:23:22PM +0100, Andi Kleen wrote:
   Yes we very much rely on the FREEZE bits for LBR. PT and LBR being
   mutually exclusive wasn't documented (or I missed it) and completely
   blows.
  
  Can you describe why it is a problem? I had considered it only a minor
  inconvenience, for many things you would use LBRs for PT is far better.
 
 Because is someone writes a GCC tool using perf-LBR support for some
 basic block analysis, and someone else writes another tool for PT, then
 the first tool magically stops working when the PT tool is started.
 
 We cannot refuse to create perf-LBR events, because at that time there
 might not be a PT user -- and even if there was one, it might go away.
 
 But as long as there's a PT user around, the LBR events will not be able
 to be scheduled and will simply starve, for no apparent reason.
 
 Complete and utterly miserable position.
 
 And it makes sense to write LBR tools because they cover a much greater
 spread of hardware.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-08 Thread Alexander Shishkin
Andi Kleen a...@firstfloor.org writes:

 So create two events, one for the PT stuff and one to track the
 side-band stuff. We have a NOP event for just this purpose.

 Ok I guess that could work.

 Essentially replace the magic mmap offset with a second fd.

 Alex, what do you think?

Yes, that's what I suggested some time ago in [1]. A second buffer
(through another fd or otherwise) is an essential thing from my point of
view.

[1] http://marc.info/?l=linux-kernelm=138737306725663

Regards,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Andi Kleen
On Tue, Jan 07, 2014 at 09:51:45PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 04:42:55PM +0100, Andi Kleen wrote:
> > > Yes; go read this:
> > > 
> > >  lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net
> > 
> > Hmm, but AFAIK we're not using freeze counters on PMI today.
> > We just rely on the explicit disabling in the counters through the global
> > ctrl.
> > 
> > So it should be the same as with any other PMI which also does not
> > automatically freeze. Not true?
> 
> Regardless whether its used or not; I'd very much like that answered.

The freeze always starts with the counter overflow, independent if the interrupt
is blocked or not. So everything should be ok.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Peter Zijlstra
On Tue, Jan 07, 2014 at 04:42:55PM +0100, Andi Kleen wrote:
> > Yes; go read this:
> > 
> >  lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net
> 
> Hmm, but AFAIK we're not using freeze counters on PMI today.
> We just rely on the explicit disabling in the counters through the global
> ctrl.
> 
> So it should be the same as with any other PMI which also does not
> automatically freeze. Not true?

Regardless whether its used or not; I'd very much like that answered.

> Or do you mean interaction with the LBRs here?
> (currently LBRs and PT are mutually exclusive)

Yes we very much rely on the FREEZE bits for LBR. PT and LBR being
mutually exclusive wasn't documented (or I missed it) and completely
blows.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Andi Kleen
> So create two events, one for the PT stuff and one to track the
> side-band stuff. We have a NOP event for just this purpose.

Ok I guess that could work.

Essentially replace the magic mmap offset with a second fd.

Alex, what do you think?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Andi Kleen
> Also, the PT interrupt doesn't actually need to be an NMI; when the
> proposed S/G implementation would actually work as stated there can be
> plenty room left when we trigger the interrupt.

That's true.

-andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Andi Kleen
> Yes; go read this:
> 
>  lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net

Hmm, but AFAIK we're not using freeze counters on PMI today.
We just rely on the explicit disabling in the counters through the global
ctrl.

So it should be the same as with any other PMI which also does not
automatically freeze. Not true?

Or do you mean interaction with the LBRs here?
(currently LBRs and PT are mutually exclusive)

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Peter Zijlstra
On Tue, Jan 07, 2014 at 01:52:31AM +0100, Andi Kleen wrote:
> > > Also of course it requires disabling/enabling PT explicitly for 
> > > every perf message, which is slow. So you add at least 2*WRMSR cost
> > > (thousands of cycles).
> > 
> > That's just dumb, no flush the entire PT buffer into a few large
> > records.
> 
> How would that work?
> 
> You mean a separate buffer and then copy or map?
> 
> --
> 
> Also here are some more problems with interleaving: 
> 
> A common PT config is to just run it as a ring buffer in the background
> and only take the data out when something happens (sample, crash etc.)
> 
> But the side band still needs to be logged and at arbitary times.
> 
> So the PT wrapping will happen much more often than the perf wrapping.

So create two events, one for the PT stuff and one to track the
side-band stuff. We have a NOP event for just this purpose.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 03:10:28PM -0800, Andi Kleen wrote:
> > To me it seems very weird that PT is hooked to the same PMI as the
> > normal PMU, it really should have been a different interrupt.
> 
> It's in the same STATUS register, so it's cheap to check both.
> 
> It shouldn't add any new spurious problems (or at least nothing
> worse than what we already have)
> 
> I understand that it would be nice to separate other NMI users
> from all of PMI, but that would be an orthogonal problem.
> 
> Any other issues?

Aside from the fact that PT and the PMU are otherwise unrelated, so it
being in the global status register is weird too.

Also, the PT interrupt doesn't actually need to be an NMI; when the
proposed S/G implementation would actually work as stated there can be
plenty room left when we trigger the interrupt.

But again, see the other email I referenced; the PMU triggering a PMI
while we're in one PT triggered is my biggest concern; esp. since both
have different FREEZE semantics.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 03:10:28PM -0800, Andi Kleen wrote:
> Peter Zijlstra  writes:
> > Also, do clarify the other points I asked about. Esp. the non
> > FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely.
> 
> The only reason for hardware freeze is when you have a few entries (like
> with LBRs) so the interrupt entry code could overwhelm it.
> 
> But PT is not small, it's gigantic: even with the smallest buffer you
> have many thousands of entries.
> 
> So you will get a few branches in the interrupt entry, but it's not a problem
> because everything you really wanted to trace is still there.
> 
> Eventually the handler disables PT, so there's no risk of racing with
> the update or anything like that.
> 
> Did I miss anything?

Yes; go read this:

 lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 03:10:28PM -0800, Andi Kleen wrote:
 Peter Zijlstra pet...@infradead.org writes:
  Also, do clarify the other points I asked about. Esp. the non
  FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely.
 
 The only reason for hardware freeze is when you have a few entries (like
 with LBRs) so the interrupt entry code could overwhelm it.
 
 But PT is not small, it's gigantic: even with the smallest buffer you
 have many thousands of entries.
 
 So you will get a few branches in the interrupt entry, but it's not a problem
 because everything you really wanted to trace is still there.
 
 Eventually the handler disables PT, so there's no risk of racing with
 the update or anything like that.
 
 Did I miss anything?

Yes; go read this:

 lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 03:10:28PM -0800, Andi Kleen wrote:
  To me it seems very weird that PT is hooked to the same PMI as the
  normal PMU, it really should have been a different interrupt.
 
 It's in the same STATUS register, so it's cheap to check both.
 
 It shouldn't add any new spurious problems (or at least nothing
 worse than what we already have)
 
 I understand that it would be nice to separate other NMI users
 from all of PMI, but that would be an orthogonal problem.
 
 Any other issues?

Aside from the fact that PT and the PMU are otherwise unrelated, so it
being in the global status register is weird too.

Also, the PT interrupt doesn't actually need to be an NMI; when the
proposed S/G implementation would actually work as stated there can be
plenty room left when we trigger the interrupt.

But again, see the other email I referenced; the PMU triggering a PMI
while we're in one PT triggered is my biggest concern; esp. since both
have different FREEZE semantics.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Peter Zijlstra
On Tue, Jan 07, 2014 at 01:52:31AM +0100, Andi Kleen wrote:
   Also of course it requires disabling/enabling PT explicitly for 
   every perf message, which is slow. So you add at least 2*WRMSR cost
   (thousands of cycles).
  
  That's just dumb, no flush the entire PT buffer into a few large
  records.
 
 How would that work?
 
 You mean a separate buffer and then copy or map?
 
 --
 
 Also here are some more problems with interleaving: 
 
 A common PT config is to just run it as a ring buffer in the background
 and only take the data out when something happens (sample, crash etc.)
 
 But the side band still needs to be logged and at arbitary times.
 
 So the PT wrapping will happen much more often than the perf wrapping.

So create two events, one for the PT stuff and one to track the
side-band stuff. We have a NOP event for just this purpose.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Andi Kleen
 Yes; go read this:
 
  lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net

Hmm, but AFAIK we're not using freeze counters on PMI today.
We just rely on the explicit disabling in the counters through the global
ctrl.

So it should be the same as with any other PMI which also does not
automatically freeze. Not true?

Or do you mean interaction with the LBRs here?
(currently LBRs and PT are mutually exclusive)

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Andi Kleen
 Also, the PT interrupt doesn't actually need to be an NMI; when the
 proposed S/G implementation would actually work as stated there can be
 plenty room left when we trigger the interrupt.

That's true.

-andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Andi Kleen
 So create two events, one for the PT stuff and one to track the
 side-band stuff. We have a NOP event for just this purpose.

Ok I guess that could work.

Essentially replace the magic mmap offset with a second fd.

Alex, what do you think?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Peter Zijlstra
On Tue, Jan 07, 2014 at 04:42:55PM +0100, Andi Kleen wrote:
  Yes; go read this:
  
   lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net
 
 Hmm, but AFAIK we're not using freeze counters on PMI today.
 We just rely on the explicit disabling in the counters through the global
 ctrl.
 
 So it should be the same as with any other PMI which also does not
 automatically freeze. Not true?

Regardless whether its used or not; I'd very much like that answered.

 Or do you mean interaction with the LBRs here?
 (currently LBRs and PT are mutually exclusive)

Yes we very much rely on the FREEZE bits for LBR. PT and LBR being
mutually exclusive wasn't documented (or I missed it) and completely
blows.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-07 Thread Andi Kleen
On Tue, Jan 07, 2014 at 09:51:45PM +0100, Peter Zijlstra wrote:
 On Tue, Jan 07, 2014 at 04:42:55PM +0100, Andi Kleen wrote:
   Yes; go read this:
   
lkml.kernel.org/r/20131219125205.gt3...@twins.programming.kicks-ass.net
  
  Hmm, but AFAIK we're not using freeze counters on PMI today.
  We just rely on the explicit disabling in the counters through the global
  ctrl.
  
  So it should be the same as with any other PMI which also does not
  automatically freeze. Not true?
 
 Regardless whether its used or not; I'd very much like that answered.

The freeze always starts with the counter overflow, independent if the interrupt
is blocked or not. So everything should be ok.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Andi Kleen
On Tue, Jan 07, 2014 at 01:52:31AM +0100, Andi Kleen wrote:
> > > Also of course it requires disabling/enabling PT explicitly for 
> > > every perf message, which is slow. So you add at least 2*WRMSR cost
> > > (thousands of cycles).
> > 
> > That's just dumb, no flush the entire PT buffer into a few large
> > records.
> 
> How would that work?
> 
> You mean a separate buffer and then copy or map?
> 
> --
> 
> Also here are some more problems with interleaving: 
> 
> A common PT config is to just run it as a ring buffer in the background
> and only take the data out when something happens (sample, crash etc.)
> 
> But the side band still needs to be logged and at arbitary times.
> 
> So the PT wrapping will happen much more often than the perf wrapping.
> 
> If you interleave you may actually end up with lots of small rings 
> in a single buffer, unless you stop every time the buffer fills up
> (which would add a lot more overhead)
> 
> I suppose it could be somehow parsed, but it would very different 
> from what perf does today.

Thinking about it more it's likely very hard to parse. Dropping instructions is
fine, dropping perf metadata is not (or only as last resort). 

If we miss a MMAP we may never be able to parse that code region.
If we miss a context switch we may be also completely lost until the
next switch.

That means PT couldn't overwrite perf metadata normally.

So you could easily get into situations where the interleaved PT buffer
is between two perf metadata statements and ends up really small, while
large other parts of the buffer are unused.

The only way around it would be likely to move entries around -- to 
garbage collect so to say -- but doing that non-blocking from a NMI will be
challenging.

With the separate buffers we don't have any of these problems.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Andi Kleen
> > Also of course it requires disabling/enabling PT explicitly for 
> > every perf message, which is slow. So you add at least 2*WRMSR cost
> > (thousands of cycles).
> 
> That's just dumb, no flush the entire PT buffer into a few large
> records.

How would that work?

You mean a separate buffer and then copy or map?

--

Also here are some more problems with interleaving: 

A common PT config is to just run it as a ring buffer in the background
and only take the data out when something happens (sample, crash etc.)

But the side band still needs to be logged and at arbitary times.

So the PT wrapping will happen much more often than the perf wrapping.

If you interleave you may actually end up with lots of small rings 
in a single buffer, unless you stop every time the buffer fills up
(which would add a lot more overhead)

I suppose it could be somehow parsed, but it would very different 
from what perf does today.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Andi Kleen
Peter Zijlstra  writes:

Can you please clarify your position on the interleaved buffer?

I still can't see how it is a efficient design.

It's generally true in scather-gather (be it software or hardware) 
that each additional SG entry increases the cost. So to make things
efficient you always want to minimize entries as much as possible.

>> I don't think the PT design is broken in any way, it's straight 
>> forward and simple.
>
> Also, do clarify the other points I asked about. Esp. the non
> FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely.

The only reason for hardware freeze is when you have a few entries (like
with LBRs) so the interrupt entry code could overwhelm it.

But PT is not small, it's gigantic: even with the smallest buffer you
have many thousands of entries.

So you will get a few branches in the interrupt entry, but it's not a problem
because everything you really wanted to trace is still there.

Eventually the handler disables PT, so there's no risk of racing with
the update or anything like that.

Did I miss anything?

> To me it seems very weird that PT is hooked to the same PMI as the
> normal PMU, it really should have been a different interrupt.

It's in the same STATUS register, so it's cheap to check both.

It shouldn't add any new spurious problems (or at least nothing
worse than what we already have)

I understand that it would be nice to separate other NMI users
from all of PMI, but that would be an orthogonal problem.

Any other issues?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Peter Zijlstra
> I don't think the PT design is broken in any way, it's straight 
> forward and simple.

Also, do clarify the other points I asked about. Esp. the non
FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely.

To me it seems very weird that PT is hooked to the same PMI as the
normal PMU, it really should have been a different interrupt.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 01:25:02PM -0800, Andi Kleen wrote:
> Peter Zijlstra  writes:
> 
> > On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
> >> So I'd like to steer away from the ways in which hardware can be broken
> >> and talk about a usable interface, to begin with.
> >
> > Just dump it into the regular one buffer like I outlined.
> 
> Just getting back to this. 
> 
> Do you realize that PT buffers have to be page aligned. 
> 
> So to mix it with a regular perf buffer would need padding every PT
> message by 4K, which wastes a lot of memory. The side band messages
> are usually only a few bytes (e.g. context switch).
> 
> If the sideband is mfrequent it could even take up >half of the buffer,
> but mostly only with padding.
> 
> Is that what you intended?
> 
> perf doesn't support gaps today, so your proposal wouldn't even
> seem to fit into the current perf design.

That would a really trivial addition.

> Also of course it requires disabling/enabling PT explicitly for 
> every perf message, which is slow. So you add at least 2*WRMSR cost
> (thousands of cycles).

That's just dumb, no flush the entire PT buffer into a few large
records.

> > That said; we very much need to have at least two architectures
> > implemented for any of this code to move.
> >
> > But we cannot ignore the hardware trainwreck; we cannot shape our
> > interface around something that's utterly broken.
> >
> > Some hardware is just too broken to support.
> 
> I don't think the PT design is broken in any way, it's straight 
> forward and simple.

If it were actually implemented like the spec says and not have this
crappy S/G limitation, then maybe.

> Trying to mix hardware tracing and software tracing in the same buffer
> on the other hand ...
> 
> Anyways if perf is not flexible enough to support this I suppose
> it could switch to a simple device driver, and only run perf with
> separate fds for side band purposes. 
> 
> Would you prefer that?

Don't be stupid.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Andi Kleen
Peter Zijlstra  writes:

> On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
>> So I'd like to steer away from the ways in which hardware can be broken
>> and talk about a usable interface, to begin with.
>
> Just dump it into the regular one buffer like I outlined.

Just getting back to this. 

Do you realize that PT buffers have to be page aligned. 

So to mix it with a regular perf buffer would need padding every PT
message by 4K, which wastes a lot of memory. The side band messages
are usually only a few bytes (e.g. context switch).

If the sideband is mfrequent it could even take up >half of the buffer,
but mostly only with padding.

Is that what you intended?

perf doesn't support gaps today, so your proposal wouldn't even
seem to fit into the current perf design.

Also of course it requires disabling/enabling PT explicitly for 
every perf message, which is slow. So you add at least 2*WRMSR cost
(thousands of cycles).

> That said; we very much need to have at least two architectures
> implemented for any of this code to move.
>
> But we cannot ignore the hardware trainwreck; we cannot shape our
> interface around something that's utterly broken.
>
> Some hardware is just too broken to support.

I don't think the PT design is broken in any way, it's straight 
forward and simple.

Trying to mix hardware tracing and software tracing in the same buffer
on the other hand ...

Anyways if perf is not flexible enough to support this I suppose
it could switch to a simple device driver, and only run perf with
separate fds for side band purposes. 

Would you prefer that?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Andi Kleen
Peter Zijlstra pet...@infradead.org writes:

 On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
 So I'd like to steer away from the ways in which hardware can be broken
 and talk about a usable interface, to begin with.

 Just dump it into the regular one buffer like I outlined.

Just getting back to this. 

Do you realize that PT buffers have to be page aligned. 

So to mix it with a regular perf buffer would need padding every PT
message by 4K, which wastes a lot of memory. The side band messages
are usually only a few bytes (e.g. context switch).

If the sideband is mfrequent it could even take up half of the buffer,
but mostly only with padding.

Is that what you intended?

perf doesn't support gaps today, so your proposal wouldn't even
seem to fit into the current perf design.

Also of course it requires disabling/enabling PT explicitly for 
every perf message, which is slow. So you add at least 2*WRMSR cost
(thousands of cycles).

 That said; we very much need to have at least two architectures
 implemented for any of this code to move.

 But we cannot ignore the hardware trainwreck; we cannot shape our
 interface around something that's utterly broken.

 Some hardware is just too broken to support.

I don't think the PT design is broken in any way, it's straight 
forward and simple.

Trying to mix hardware tracing and software tracing in the same buffer
on the other hand ...

Anyways if perf is not flexible enough to support this I suppose
it could switch to a simple device driver, and only run perf with
separate fds for side band purposes. 

Would you prefer that?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 01:25:02PM -0800, Andi Kleen wrote:
 Peter Zijlstra pet...@infradead.org writes:
 
  On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
  So I'd like to steer away from the ways in which hardware can be broken
  and talk about a usable interface, to begin with.
 
  Just dump it into the regular one buffer like I outlined.
 
 Just getting back to this. 
 
 Do you realize that PT buffers have to be page aligned. 
 
 So to mix it with a regular perf buffer would need padding every PT
 message by 4K, which wastes a lot of memory. The side band messages
 are usually only a few bytes (e.g. context switch).
 
 If the sideband is mfrequent it could even take up half of the buffer,
 but mostly only with padding.
 
 Is that what you intended?
 
 perf doesn't support gaps today, so your proposal wouldn't even
 seem to fit into the current perf design.

That would a really trivial addition.

 Also of course it requires disabling/enabling PT explicitly for 
 every perf message, which is slow. So you add at least 2*WRMSR cost
 (thousands of cycles).

That's just dumb, no flush the entire PT buffer into a few large
records.

  That said; we very much need to have at least two architectures
  implemented for any of this code to move.
 
  But we cannot ignore the hardware trainwreck; we cannot shape our
  interface around something that's utterly broken.
 
  Some hardware is just too broken to support.
 
 I don't think the PT design is broken in any way, it's straight 
 forward and simple.

If it were actually implemented like the spec says and not have this
crappy S/G limitation, then maybe.

 Trying to mix hardware tracing and software tracing in the same buffer
 on the other hand ...
 
 Anyways if perf is not flexible enough to support this I suppose
 it could switch to a simple device driver, and only run perf with
 separate fds for side band purposes. 
 
 Would you prefer that?

Don't be stupid.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Peter Zijlstra
 I don't think the PT design is broken in any way, it's straight 
 forward and simple.

Also, do clarify the other points I asked about. Esp. the non
FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely.

To me it seems very weird that PT is hooked to the same PMI as the
normal PMU, it really should have been a different interrupt.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Andi Kleen
Peter Zijlstra pet...@infradead.org writes:

Can you please clarify your position on the interleaved buffer?

I still can't see how it is a efficient design.

It's generally true in scather-gather (be it software or hardware) 
that each additional SG entry increases the cost. So to make things
efficient you always want to minimize entries as much as possible.

 I don't think the PT design is broken in any way, it's straight 
 forward and simple.

 Also, do clarify the other points I asked about. Esp. the non
 FREEZE_ON_PMI behaviour of the PT PMI is worrying me immensely.

The only reason for hardware freeze is when you have a few entries (like
with LBRs) so the interrupt entry code could overwhelm it.

But PT is not small, it's gigantic: even with the smallest buffer you
have many thousands of entries.

So you will get a few branches in the interrupt entry, but it's not a problem
because everything you really wanted to trace is still there.

Eventually the handler disables PT, so there's no risk of racing with
the update or anything like that.

Did I miss anything?

 To me it seems very weird that PT is hooked to the same PMI as the
 normal PMU, it really should have been a different interrupt.

It's in the same STATUS register, so it's cheap to check both.

It shouldn't add any new spurious problems (or at least nothing
worse than what we already have)

I understand that it would be nice to separate other NMI users
from all of PMI, but that would be an orthogonal problem.

Any other issues?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Andi Kleen
  Also of course it requires disabling/enabling PT explicitly for 
  every perf message, which is slow. So you add at least 2*WRMSR cost
  (thousands of cycles).
 
 That's just dumb, no flush the entire PT buffer into a few large
 records.

How would that work?

You mean a separate buffer and then copy or map?

--

Also here are some more problems with interleaving: 

A common PT config is to just run it as a ring buffer in the background
and only take the data out when something happens (sample, crash etc.)

But the side band still needs to be logged and at arbitary times.

So the PT wrapping will happen much more often than the perf wrapping.

If you interleave you may actually end up with lots of small rings 
in a single buffer, unless you stop every time the buffer fills up
(which would add a lot more overhead)

I suppose it could be somehow parsed, but it would very different 
from what perf does today.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2014-01-06 Thread Andi Kleen
On Tue, Jan 07, 2014 at 01:52:31AM +0100, Andi Kleen wrote:
   Also of course it requires disabling/enabling PT explicitly for 
   every perf message, which is slow. So you add at least 2*WRMSR cost
   (thousands of cycles).
  
  That's just dumb, no flush the entire PT buffer into a few large
  records.
 
 How would that work?
 
 You mean a separate buffer and then copy or map?
 
 --
 
 Also here are some more problems with interleaving: 
 
 A common PT config is to just run it as a ring buffer in the background
 and only take the data out when something happens (sample, crash etc.)
 
 But the side band still needs to be logged and at arbitary times.
 
 So the PT wrapping will happen much more often than the perf wrapping.
 
 If you interleave you may actually end up with lots of small rings 
 in a single buffer, unless you stop every time the buffer fills up
 (which would add a lot more overhead)
 
 I suppose it could be somehow parsed, but it would very different 
 from what perf does today.

Thinking about it more it's likely very hard to parse. Dropping instructions is
fine, dropping perf metadata is not (or only as last resort). 

If we miss a MMAP we may never be able to parse that code region.
If we miss a context switch we may be also completely lost until the
next switch.

That means PT couldn't overwrite perf metadata normally.

So you could easily get into situations where the interleaved PT buffer
is between two perf metadata statements and ends up really small, while
large other parts of the buffer are unused.

The only way around it would be likely to move entries around -- to 
garbage collect so to say -- but doing that non-blocking from a NMI will be
challenging.

With the separate buffers we don't have any of these problems.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 04:54:27PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra  writes:
> 
> > On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote:
> > So you're basically forced to stop the tracing on PMI anyhow; so your
> > continuous tracing argument goes out the window.
> 
> It's only stopped inside the PMI handler to set up another buffer, and
> is then started again, so no useful trace is lost. PMI handler is not
> traced. What you're proposing is stopping it for good till perf collects
> the previous data, which will lose us a lot of trace. So my argument
> stands.

That is not what I proposed at all.

The PMI will swizzle the pages and resume recording. If there is no
space in the output buffer, we'll simply re-use the existing pages and
overwrite data.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
> So I'd like to steer away from the ways in which hardware can be broken
> and talk about a usable interface, to begin with.

Just dump it into the regular one buffer like I outlined.

That said; we very much need to have at least two architectures
implemented for any of this code to move.

But we cannot ignore the hardware trainwreck; we cannot shape our
interface around something that's utterly broken.

Some hardware is just too broken to support.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 03:49:42PM +0100, Frederic Weisbecker wrote:
> On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
> > Or the interface and implementation of BTS support in the kernel
> > discourage its use and that is why it is so rarely used.
> 
> I never heard complains about it. It's a simple dump of from/to address 
> couples.
> I just think nobody take the time to develop userspace tooling to exploit it.
> But it's famous slowness might have had a bad influence on this. And may be
> also the fact that it's very architecture specific. AMD doesn't support BTS 
> if I recall
> correctly. Or may be it has its own different implementation?

No AMD doesn't do anything like that.

There was some attempt to cure some of the wobblies:

  https://lkml.org/lkml/2013/7/8/154

But people never pursued that.

That said, if people want overwrite mode to work for PT we'd need to fix
the same thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote:
> So you're basically forced to stop the tracing on PMI anyhow; so your
> continuous tracing argument goes out the window.

It's only stopped inside the PMI handler to set up another buffer, and
is then started again, so no useful trace is lost. PMI handler is not
traced. What you're proposing is stopping it for good till perf collects
the previous data, which will lose us a lot of trace. So my argument
stands.

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Frederic Weisbecker
On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
> Or the interface and implementation of BTS support in the kernel
> discourage its use and that is why it is so rarely used.

I never heard complains about it. It's a simple dump of from/to address couples.
I just think nobody take the time to develop userspace tooling to exploit it.
But it's famous slowness might have had a bad influence on this. And may be
also the fact that it's very architecture specific. AMD doesn't support BTS if 
I recall
correctly. Or may be it has its own different implementation?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Ingo Molnar  writes:

> * Peter Zijlstra  wrote:
>
>> On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote:
>> > Peter Zijlstra  writes:
>> > 
>> > > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
>> > >> Yes and some implementations of PT have the same issue, but you can do a
>> > >> sufficiently large high order allocation and map it to userspace and
>> > >> still no copying (or parsing/decoding) in kernel space required.
>> > >
>> > > What's sufficiently large? The largest we could possibly allocate is
>> > > something like 4k^11 which is 8M or so. That's not all that big given
>> > > you keep saying it generates in the order of 100 MB/s.
>> > 
>> > One chunk is 8M. You can have as many as the buddy allocator permits you
>> > to have. When you get a PMI, you simply switch one chunk for another and
>> > on the tracing goes.
>> 
>> This document you referred me to looks to specify something with a
>> proper s/g implementation; called ToPA. There doesn't appear to be a
>> limit to the linked entries and you can specify a size per entry, and I
>> don't see anywhere why 4k would be bad.
>> 
>> That said, I'm still reading..
>> 
>> > > Also, 'some implementations', that sounds like a fail right there. Why
>> > > are there already different implementations, and some which such stupid
>> > > design, of something this new?
>> > >
>> > > How about just saying NO to the ones that requires physically contiguous
>> > > allocations?
>> > 
>> > No reason to leave those out, because they are still extremely useful
>> > for tracing and fit perfectly fine in a model with two buffers.
>> 
>> Maybe; but lets start with the sane hardware. Then we'll look at the 
>> amount of pain needed to support these broken pieces of crap and 
>> decide later.
>> 
>> So drop all support for crappy hardware now.
>
> Absolutely agreed ...
>
> The thing is, BTS itself is rarely used (and not primarily because 
> it's slow, but because its tooling and thus its utility is poor), so 
> the last thing we want is another piece of broken hardware with a 
> quirky software interface to it that tooling has trouble utilizing.

Or the interface and implementation of BTS support in the kernel
discourage its use and that is why it is so rarely used.

What I'm proposing is a unified interface for trace units to export
their traces and not only the "non-crappy" ones, in a way that won't
discourage its use from day one.

So I'd like to steer away from the ways in which hardware can be broken
and talk about a usable interface, to begin with.

Regargs,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 19, 2013 at 12:28:12PM +0100, Peter Zijlstra wrote:
> > This document you referred me to looks to specify something with a
> > proper s/g implementation; called ToPA. There doesn't appear to be a
> > limit to the linked entries and you can specify a size per entry, and I
> > don't see anywhere why 4k would be bad.
> > 
> > That said, I'm still reading..
> 
> Found it:
> 
> "Single Output Region ToPA Implementation
> 
> The first processor generation to implement Intel PT supports only ToPA
> configurations with a single ToPA entry followed by an END entry that
> points back to the first entry (creating one circular output buffer).
> Such processors enumerate CPUID.(EAX=14H,ECX=0):EBX[bit 1] as 0."
> 
> So basically you guys buggered the hardware.
> 

"ToPA PMI and Single Output Region ToPA Implementation

A processor that supports only a single ToPA output region
implementation (such that only one output region is supported; see
above) will attempt to signal a ToPA PMI interrupt before the output
wraps and overwrites the top of the buffer. To support this
functionality, the PMI handler should disable packet generation as soon
as possible.  Due to PMI skid, it is possible, in rare cases, that the
wrap will have occurred before the PMI is delivered. Software can avoid
this by setting the STOP bit in the ToPA entry (see Table 11-3); this
will disable tracing once the region is filled, and no wrap will occur.
This approach has the downside of disabling packet generation so that
some of the instructions that led up to the PMI will not be traced. If
the PMI skid is significant enough to cause the region to fill and
tracing to be disabled, the PMI handler will need to clear the
IA32_RTIT_STATUS.Stopped indication before tracing can resume."


So you're basically forced to stop the tracing on PMI anyhow; so your
continuous tracing argument goes out the window.

Also, what a complete clusterfuck. I think we're far better of
pretending PT doesn't exist until its fixed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra


Found more:

"Note that no “freezing” takes place with the ToPA PMI. Thus, packet
generation is not frozen, and the interrupt handler will be traced
(though filtering can prevent this). Further, the setting of
IA32_DEBUGCTL.Freeze_Perfmon_on_PMI is ignored and performance counters
are not frozen by a ToPA PMI."


Can someone confirm with the hardware people what happens when an actual
PMU counter overflows and tries to raise the PMI while we're in one that
ignores the 'Freeze_perfmon_on_PMI' bit?

Since you cannot assert an interrupt that already asserted, but that
handler can see the overflow status bit set and will likely process it;
assuming the PMU is actually frozen.

Also, this just smells ripe for errata and ugly bugs.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Thu, Dec 19, 2013 at 01:14:09PM +0200, Alexander Shishkin wrote:
>> Peter Zijlstra  writes:
>> 
>> > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
>> >> Peter Zijlstra  writes:
>> >> > The thing is; why can't you zero-copy whatever buffer the hardware
>> >> > writes into, into the normal buffer?
>> >> 
>> >> I'm not sure I understand. You mean, have the buffer split between perf
>> >> data and trace data?
>> >
>> > Yep, I don't see any reason why this wouldn't work.
>> >
>> > When the hardware thing sends an interrupt to notify us its buffer is
>> > 'full', stop the recorder, try to create a single record in the buffer
>> > that's big enough + 1 page, then swizzle the hardware pages and the
>> > buffer pages for that record, using the +1 page to page align the actual
>> > data. Then (re)start the hardware on the 'new' pages.
>> 
>> We configure the hardware thing to send an interrupt *before* the buffer
>> is full, keep the recorder running while userspace saves stuff to
>> perf.data file. Recording only stops if perf fails to read the trace
>> data out fast enough and the buffer fills up. So you'd have a complete
>> trace.
>> 
>> Also, we have what we call a "snapshot" mode, where we keep the hardware
>> thing running, writing data to a circular buffer till it's stopped, in
>> case we're only interested in the most recent trace data to see what it
>> is that takes too long to respond, etc. And while it is running, we're
>> getting new records in the perf stream all the time (mmaps, etc).
>> 
>> Put simple: perf data and trace data are two different separate types of
>> information that originate from two different sources, can exist and
>> make sense separately from one another and should not be mixed.
>
> Well you're either having to change your stance or we're done talking
> right now.

I'm making a case in favor of 2 separate buffers just like you asked in
one of the previous emails. It's backed by some very real usecases. That
said, I'm not personally attached to any one design, only what makes
sense. There is no 'stance'.

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote:
> > Peter Zijlstra  writes:
> > 
> > > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
> > >> Yes and some implementations of PT have the same issue, but you can do a
> > >> sufficiently large high order allocation and map it to userspace and
> > >> still no copying (or parsing/decoding) in kernel space required.
> > >
> > > What's sufficiently large? The largest we could possibly allocate is
> > > something like 4k^11 which is 8M or so. That's not all that big given
> > > you keep saying it generates in the order of 100 MB/s.
> > 
> > One chunk is 8M. You can have as many as the buddy allocator permits you
> > to have. When you get a PMI, you simply switch one chunk for another and
> > on the tracing goes.
> 
> This document you referred me to looks to specify something with a
> proper s/g implementation; called ToPA. There doesn't appear to be a
> limit to the linked entries and you can specify a size per entry, and I
> don't see anywhere why 4k would be bad.
> 
> That said, I'm still reading..
> 
> > > Also, 'some implementations', that sounds like a fail right there. Why
> > > are there already different implementations, and some which such stupid
> > > design, of something this new?
> > >
> > > How about just saying NO to the ones that requires physically contiguous
> > > allocations?
> > 
> > No reason to leave those out, because they are still extremely useful
> > for tracing and fit perfectly fine in a model with two buffers.
> 
> Maybe; but lets start with the sane hardware. Then we'll look at the 
> amount of pain needed to support these broken pieces of crap and 
> decide later.
> 
> So drop all support for crappy hardware now.

Absolutely agreed ...

The thing is, BTS itself is rarely used (and not primarily because 
it's slow, but because its tooling and thus its utility is poor), so 
the last thing we want is another piece of broken hardware with a 
quirky software interface to it that tooling has trouble utilizing.

Sigh, when will Intel learn to talk to Linux PMU experts _before_ 
committing to a hardware interface??

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 12:28:12PM +0100, Peter Zijlstra wrote:
> This document you referred me to looks to specify something with a
> proper s/g implementation; called ToPA. There doesn't appear to be a
> limit to the linked entries and you can specify a size per entry, and I
> don't see anywhere why 4k would be bad.
> 
> That said, I'm still reading..

Found it:

"Single Output Region ToPA Implementation

The first processor generation to implement Intel PT supports only ToPA
configurations with a single ToPA entry followed by an END entry that
points back to the first entry (creating one circular output buffer).
Such processors enumerate CPUID.(EAX=14H,ECX=0):EBX[bit 1] as 0."

So basically you guys buggered the hardware.

More specifically, what actual hardware is this? Is this first
generation HSW or so?

Please enumerate the actual hardware that supports this PT stuff and
which hardware has it fixed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote:
>> Peter Zijlstra  writes:
>> 
>> > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
>> >> Yes and some implementations of PT have the same issue, but you can do a
>> >> sufficiently large high order allocation and map it to userspace and
>> >> still no copying (or parsing/decoding) in kernel space required.
>> >
>> > What's sufficiently large? The largest we could possibly allocate is
>> > something like 4k^11 which is 8M or so. That's not all that big given
>> > you keep saying it generates in the order of 100 MB/s.
>> 
>> One chunk is 8M. You can have as many as the buddy allocator permits you
>> to have. When you get a PMI, you simply switch one chunk for another and
>> on the tracing goes.
>
> This document you referred me to looks to specify something with a
> proper s/g implementation; called ToPA. There doesn't appear to be a
> limit to the linked entries and you can specify a size per entry, and I
> don't see anywhere why 4k would be bad.

JFYI, 11.2.4.1, "Single Output Region ToPA Implementation".

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra  writes:
> 
> > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
> >> Yes and some implementations of PT have the same issue, but you can do a
> >> sufficiently large high order allocation and map it to userspace and
> >> still no copying (or parsing/decoding) in kernel space required.
> >
> > What's sufficiently large? The largest we could possibly allocate is
> > something like 4k^11 which is 8M or so. That's not all that big given
> > you keep saying it generates in the order of 100 MB/s.
> 
> One chunk is 8M. You can have as many as the buddy allocator permits you
> to have. When you get a PMI, you simply switch one chunk for another and
> on the tracing goes.

This document you referred me to looks to specify something with a
proper s/g implementation; called ToPA. There doesn't appear to be a
limit to the linked entries and you can specify a size per entry, and I
don't see anywhere why 4k would be bad.

That said, I'm still reading..

> > Also, 'some implementations', that sounds like a fail right there. Why
> > are there already different implementations, and some which such stupid
> > design, of something this new?
> >
> > How about just saying NO to the ones that requires physically contiguous
> > allocations?
> 
> No reason to leave those out, because they are still extremely useful
> for tracing and fit perfectly fine in a model with two buffers.

Maybe; but lets start with the sane hardware. Then we'll look at the
amount of pain needed to support these broken pieces of crap and decide
later.

So drop all support for crappy hardware now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 01:14:09PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra  writes:
> 
> > On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
> >> Peter Zijlstra  writes:
> >> > The thing is; why can't you zero-copy whatever buffer the hardware
> >> > writes into, into the normal buffer?
> >> 
> >> I'm not sure I understand. You mean, have the buffer split between perf
> >> data and trace data?
> >
> > Yep, I don't see any reason why this wouldn't work.
> >
> > When the hardware thing sends an interrupt to notify us its buffer is
> > 'full', stop the recorder, try to create a single record in the buffer
> > that's big enough + 1 page, then swizzle the hardware pages and the
> > buffer pages for that record, using the +1 page to page align the actual
> > data. Then (re)start the hardware on the 'new' pages.
> 
> We configure the hardware thing to send an interrupt *before* the buffer
> is full, keep the recorder running while userspace saves stuff to
> perf.data file. Recording only stops if perf fails to read the trace
> data out fast enough and the buffer fills up. So you'd have a complete
> trace.
> 
> Also, we have what we call a "snapshot" mode, where we keep the hardware
> thing running, writing data to a circular buffer till it's stopped, in
> case we're only interested in the most recent trace data to see what it
> is that takes too long to respond, etc. And while it is running, we're
> getting new records in the perf stream all the time (mmaps, etc).
> 
> Put simple: perf data and trace data are two different separate types of
> information that originate from two different sources, can exist and
> make sense separately from one another and should not be mixed.

Well you're either having to change your stance or we're done talking
right now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
>> Yes and some implementations of PT have the same issue, but you can do a
>> sufficiently large high order allocation and map it to userspace and
>> still no copying (or parsing/decoding) in kernel space required.
>
> What's sufficiently large? The largest we could possibly allocate is
> something like 4k^11 which is 8M or so. That's not all that big given
> you keep saying it generates in the order of 100 MB/s.

One chunk is 8M. You can have as many as the buddy allocator permits you
to have. When you get a PMI, you simply switch one chunk for another and
on the tracing goes.

> Also, 'some implementations', that sounds like a fail right there. Why
> are there already different implementations, and some which such stupid
> design, of something this new?
>
> How about just saying NO to the ones that requires physically contiguous
> allocations?

No reason to leave those out, because they are still extremely useful
for tracing and fit perfectly fine in a model with two buffers.

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
>> Peter Zijlstra  writes:
>> > The thing is; why can't you zero-copy whatever buffer the hardware
>> > writes into, into the normal buffer?
>> 
>> I'm not sure I understand. You mean, have the buffer split between perf
>> data and trace data?
>
> Yep, I don't see any reason why this wouldn't work.
>
> When the hardware thing sends an interrupt to notify us its buffer is
> 'full', stop the recorder, try to create a single record in the buffer
> that's big enough + 1 page, then swizzle the hardware pages and the
> buffer pages for that record, using the +1 page to page align the actual
> data. Then (re)start the hardware on the 'new' pages.

We configure the hardware thing to send an interrupt *before* the buffer
is full, keep the recorder running while userspace saves stuff to
perf.data file. Recording only stops if perf fails to read the trace
data out fast enough and the buffer fills up. So you'd have a complete
trace.

Also, we have what we call a "snapshot" mode, where we keep the hardware
thing running, writing data to a circular buffer till it's stopped, in
case we're only interested in the most recent trace data to see what it
is that takes too long to respond, etc. And while it is running, we're
getting new records in the perf stream all the time (mmaps, etc).

Put simple: perf data and trace data are two different separate types of
information that originate from two different sources, can exist and
make sense separately from one another and should not be mixed.

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
> Yes and some implementations of PT have the same issue, but you can do a
> sufficiently large high order allocation and map it to userspace and
> still no copying (or parsing/decoding) in kernel space required.

What's sufficiently large? The largest we could possibly allocate is
something like 4k^11 which is 8M or so. That's not all that big given
you keep saying it generates in the order of 100 MB/s.

Also, 'some implementations', that sounds like a fail right there. Why
are there already different implementations, and some which such stupid
design, of something this new?

How about just saying NO to the ones that requires physically contiguous
allocations?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
> Peter Zijlstra  writes:
> > The thing is; why can't you zero-copy whatever buffer the hardware
> > writes into, into the normal buffer?
> 
> I'm not sure I understand. You mean, have the buffer split between perf
> data and trace data?

Yep, I don't see any reason why this wouldn't work.

When the hardware thing sends an interrupt to notify us its buffer is
'full', stop the recorder, try to create a single record in the buffer
that's big enough + 1 page, then swizzle the hardware pages and the
buffer pages for that record, using the +1 page to page align the actual
data. Then (re)start the hardware on the 'new' pages.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
 Peter Zijlstra pet...@infradead.org writes:
  The thing is; why can't you zero-copy whatever buffer the hardware
  writes into, into the normal buffer?
 
 I'm not sure I understand. You mean, have the buffer split between perf
 data and trace data?

Yep, I don't see any reason why this wouldn't work.

When the hardware thing sends an interrupt to notify us its buffer is
'full', stop the recorder, try to create a single record in the buffer
that's big enough + 1 page, then swizzle the hardware pages and the
buffer pages for that record, using the +1 page to page align the actual
data. Then (re)start the hardware on the 'new' pages.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
 Yes and some implementations of PT have the same issue, but you can do a
 sufficiently large high order allocation and map it to userspace and
 still no copying (or parsing/decoding) in kernel space required.

What's sufficiently large? The largest we could possibly allocate is
something like 4k^11 which is 8M or so. That's not all that big given
you keep saying it generates in the order of 100 MB/s.

Also, 'some implementations', that sounds like a fail right there. Why
are there already different implementations, and some which such stupid
design, of something this new?

How about just saying NO to the ones that requires physically contiguous
allocations?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra pet...@infradead.org writes:

 On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
 Peter Zijlstra pet...@infradead.org writes:
  The thing is; why can't you zero-copy whatever buffer the hardware
  writes into, into the normal buffer?
 
 I'm not sure I understand. You mean, have the buffer split between perf
 data and trace data?

 Yep, I don't see any reason why this wouldn't work.

 When the hardware thing sends an interrupt to notify us its buffer is
 'full', stop the recorder, try to create a single record in the buffer
 that's big enough + 1 page, then swizzle the hardware pages and the
 buffer pages for that record, using the +1 page to page align the actual
 data. Then (re)start the hardware on the 'new' pages.

We configure the hardware thing to send an interrupt *before* the buffer
is full, keep the recorder running while userspace saves stuff to
perf.data file. Recording only stops if perf fails to read the trace
data out fast enough and the buffer fills up. So you'd have a complete
trace.

Also, we have what we call a snapshot mode, where we keep the hardware
thing running, writing data to a circular buffer till it's stopped, in
case we're only interested in the most recent trace data to see what it
is that takes too long to respond, etc. And while it is running, we're
getting new records in the perf stream all the time (mmaps, etc).

Put simple: perf data and trace data are two different separate types of
information that originate from two different sources, can exist and
make sense separately from one another and should not be mixed.

Regards,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra pet...@infradead.org writes:

 On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
 Yes and some implementations of PT have the same issue, but you can do a
 sufficiently large high order allocation and map it to userspace and
 still no copying (or parsing/decoding) in kernel space required.

 What's sufficiently large? The largest we could possibly allocate is
 something like 4k^11 which is 8M or so. That's not all that big given
 you keep saying it generates in the order of 100 MB/s.

One chunk is 8M. You can have as many as the buddy allocator permits you
to have. When you get a PMI, you simply switch one chunk for another and
on the tracing goes.

 Also, 'some implementations', that sounds like a fail right there. Why
 are there already different implementations, and some which such stupid
 design, of something this new?

 How about just saying NO to the ones that requires physically contiguous
 allocations?

No reason to leave those out, because they are still extremely useful
for tracing and fit perfectly fine in a model with two buffers.

Regards,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 01:14:09PM +0200, Alexander Shishkin wrote:
 Peter Zijlstra pet...@infradead.org writes:
 
  On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
  Peter Zijlstra pet...@infradead.org writes:
   The thing is; why can't you zero-copy whatever buffer the hardware
   writes into, into the normal buffer?
  
  I'm not sure I understand. You mean, have the buffer split between perf
  data and trace data?
 
  Yep, I don't see any reason why this wouldn't work.
 
  When the hardware thing sends an interrupt to notify us its buffer is
  'full', stop the recorder, try to create a single record in the buffer
  that's big enough + 1 page, then swizzle the hardware pages and the
  buffer pages for that record, using the +1 page to page align the actual
  data. Then (re)start the hardware on the 'new' pages.
 
 We configure the hardware thing to send an interrupt *before* the buffer
 is full, keep the recorder running while userspace saves stuff to
 perf.data file. Recording only stops if perf fails to read the trace
 data out fast enough and the buffer fills up. So you'd have a complete
 trace.
 
 Also, we have what we call a snapshot mode, where we keep the hardware
 thing running, writing data to a circular buffer till it's stopped, in
 case we're only interested in the most recent trace data to see what it
 is that takes too long to respond, etc. And while it is running, we're
 getting new records in the perf stream all the time (mmaps, etc).
 
 Put simple: perf data and trace data are two different separate types of
 information that originate from two different sources, can exist and
 make sense separately from one another and should not be mixed.

Well you're either having to change your stance or we're done talking
right now.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote:
 Peter Zijlstra pet...@infradead.org writes:
 
  On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
  Yes and some implementations of PT have the same issue, but you can do a
  sufficiently large high order allocation and map it to userspace and
  still no copying (or parsing/decoding) in kernel space required.
 
  What's sufficiently large? The largest we could possibly allocate is
  something like 4k^11 which is 8M or so. That's not all that big given
  you keep saying it generates in the order of 100 MB/s.
 
 One chunk is 8M. You can have as many as the buddy allocator permits you
 to have. When you get a PMI, you simply switch one chunk for another and
 on the tracing goes.

This document you referred me to looks to specify something with a
proper s/g implementation; called ToPA. There doesn't appear to be a
limit to the linked entries and you can specify a size per entry, and I
don't see anywhere why 4k would be bad.

That said, I'm still reading..

  Also, 'some implementations', that sounds like a fail right there. Why
  are there already different implementations, and some which such stupid
  design, of something this new?
 
  How about just saying NO to the ones that requires physically contiguous
  allocations?
 
 No reason to leave those out, because they are still extremely useful
 for tracing and fit perfectly fine in a model with two buffers.

Maybe; but lets start with the sane hardware. Then we'll look at the
amount of pain needed to support these broken pieces of crap and decide
later.

So drop all support for crappy hardware now.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra pet...@infradead.org writes:

 On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote:
 Peter Zijlstra pet...@infradead.org writes:
 
  On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
  Yes and some implementations of PT have the same issue, but you can do a
  sufficiently large high order allocation and map it to userspace and
  still no copying (or parsing/decoding) in kernel space required.
 
  What's sufficiently large? The largest we could possibly allocate is
  something like 4k^11 which is 8M or so. That's not all that big given
  you keep saying it generates in the order of 100 MB/s.
 
 One chunk is 8M. You can have as many as the buddy allocator permits you
 to have. When you get a PMI, you simply switch one chunk for another and
 on the tracing goes.

 This document you referred me to looks to specify something with a
 proper s/g implementation; called ToPA. There doesn't appear to be a
 limit to the linked entries and you can specify a size per entry, and I
 don't see anywhere why 4k would be bad.

JFYI, 11.2.4.1, Single Output Region ToPA Implementation.

Regards,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 12:28:12PM +0100, Peter Zijlstra wrote:
 This document you referred me to looks to specify something with a
 proper s/g implementation; called ToPA. There doesn't appear to be a
 limit to the linked entries and you can specify a size per entry, and I
 don't see anywhere why 4k would be bad.
 
 That said, I'm still reading..

Found it:

Single Output Region ToPA Implementation

The first processor generation to implement Intel PT supports only ToPA
configurations with a single ToPA entry followed by an END entry that
points back to the first entry (creating one circular output buffer).
Such processors enumerate CPUID.(EAX=14H,ECX=0):EBX[bit 1] as 0.

So basically you guys buggered the hardware.

More specifically, what actual hardware is this? Is this first
generation HSW or so?

Please enumerate the actual hardware that supports this PT stuff and
which hardware has it fixed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote:
  Peter Zijlstra pet...@infradead.org writes:
  
   On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
   Yes and some implementations of PT have the same issue, but you can do a
   sufficiently large high order allocation and map it to userspace and
   still no copying (or parsing/decoding) in kernel space required.
  
   What's sufficiently large? The largest we could possibly allocate is
   something like 4k^11 which is 8M or so. That's not all that big given
   you keep saying it generates in the order of 100 MB/s.
  
  One chunk is 8M. You can have as many as the buddy allocator permits you
  to have. When you get a PMI, you simply switch one chunk for another and
  on the tracing goes.
 
 This document you referred me to looks to specify something with a
 proper s/g implementation; called ToPA. There doesn't appear to be a
 limit to the linked entries and you can specify a size per entry, and I
 don't see anywhere why 4k would be bad.
 
 That said, I'm still reading..
 
   Also, 'some implementations', that sounds like a fail right there. Why
   are there already different implementations, and some which such stupid
   design, of something this new?
  
   How about just saying NO to the ones that requires physically contiguous
   allocations?
  
  No reason to leave those out, because they are still extremely useful
  for tracing and fit perfectly fine in a model with two buffers.
 
 Maybe; but lets start with the sane hardware. Then we'll look at the 
 amount of pain needed to support these broken pieces of crap and 
 decide later.
 
 So drop all support for crappy hardware now.

Absolutely agreed ...

The thing is, BTS itself is rarely used (and not primarily because 
it's slow, but because its tooling and thus its utility is poor), so 
the last thing we want is another piece of broken hardware with a 
quirky software interface to it that tooling has trouble utilizing.

Sigh, when will Intel learn to talk to Linux PMU experts _before_ 
committing to a hardware interface??

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra pet...@infradead.org writes:

 On Thu, Dec 19, 2013 at 01:14:09PM +0200, Alexander Shishkin wrote:
 Peter Zijlstra pet...@infradead.org writes:
 
  On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
  Peter Zijlstra pet...@infradead.org writes:
   The thing is; why can't you zero-copy whatever buffer the hardware
   writes into, into the normal buffer?
  
  I'm not sure I understand. You mean, have the buffer split between perf
  data and trace data?
 
  Yep, I don't see any reason why this wouldn't work.
 
  When the hardware thing sends an interrupt to notify us its buffer is
  'full', stop the recorder, try to create a single record in the buffer
  that's big enough + 1 page, then swizzle the hardware pages and the
  buffer pages for that record, using the +1 page to page align the actual
  data. Then (re)start the hardware on the 'new' pages.
 
 We configure the hardware thing to send an interrupt *before* the buffer
 is full, keep the recorder running while userspace saves stuff to
 perf.data file. Recording only stops if perf fails to read the trace
 data out fast enough and the buffer fills up. So you'd have a complete
 trace.
 
 Also, we have what we call a snapshot mode, where we keep the hardware
 thing running, writing data to a circular buffer till it's stopped, in
 case we're only interested in the most recent trace data to see what it
 is that takes too long to respond, etc. And while it is running, we're
 getting new records in the perf stream all the time (mmaps, etc).
 
 Put simple: perf data and trace data are two different separate types of
 information that originate from two different sources, can exist and
 make sense separately from one another and should not be mixed.

 Well you're either having to change your stance or we're done talking
 right now.

I'm making a case in favor of 2 separate buffers just like you asked in
one of the previous emails. It's backed by some very real usecases. That
said, I'm not personally attached to any one design, only what makes
sense. There is no 'stance'.

Regards,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra


Found more:

Note that no “freezing” takes place with the ToPA PMI. Thus, packet
generation is not frozen, and the interrupt handler will be traced
(though filtering can prevent this). Further, the setting of
IA32_DEBUGCTL.Freeze_Perfmon_on_PMI is ignored and performance counters
are not frozen by a ToPA PMI.


Can someone confirm with the hardware people what happens when an actual
PMU counter overflows and tries to raise the PMI while we're in one that
ignores the 'Freeze_perfmon_on_PMI' bit?

Since you cannot assert an interrupt that already asserted, but that
handler can see the overflow status bit set and will likely process it;
assuming the PMU is actually frozen.

Also, this just smells ripe for errata and ugly bugs.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote:
 On Thu, Dec 19, 2013 at 12:28:12PM +0100, Peter Zijlstra wrote:
  This document you referred me to looks to specify something with a
  proper s/g implementation; called ToPA. There doesn't appear to be a
  limit to the linked entries and you can specify a size per entry, and I
  don't see anywhere why 4k would be bad.
  
  That said, I'm still reading..
 
 Found it:
 
 Single Output Region ToPA Implementation
 
 The first processor generation to implement Intel PT supports only ToPA
 configurations with a single ToPA entry followed by an END entry that
 points back to the first entry (creating one circular output buffer).
 Such processors enumerate CPUID.(EAX=14H,ECX=0):EBX[bit 1] as 0.
 
 So basically you guys buggered the hardware.
 

ToPA PMI and Single Output Region ToPA Implementation

A processor that supports only a single ToPA output region
implementation (such that only one output region is supported; see
above) will attempt to signal a ToPA PMI interrupt before the output
wraps and overwrites the top of the buffer. To support this
functionality, the PMI handler should disable packet generation as soon
as possible.  Due to PMI skid, it is possible, in rare cases, that the
wrap will have occurred before the PMI is delivered. Software can avoid
this by setting the STOP bit in the ToPA entry (see Table 11-3); this
will disable tracing once the region is filled, and no wrap will occur.
This approach has the downside of disabling packet generation so that
some of the instructions that led up to the PMI will not be traced. If
the PMI skid is significant enough to cause the region to fill and
tracing to be disabled, the PMI handler will need to clear the
IA32_RTIT_STATUS.Stopped indication before tracing can resume.


So you're basically forced to stop the tracing on PMI anyhow; so your
continuous tracing argument goes out the window.

Also, what a complete clusterfuck. I think we're far better of
pretending PT doesn't exist until its fixed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Ingo Molnar mi...@kernel.org writes:

 * Peter Zijlstra pet...@infradead.org wrote:

 On Thu, Dec 19, 2013 at 01:17:51PM +0200, Alexander Shishkin wrote:
  Peter Zijlstra pet...@infradead.org writes:
  
   On Thu, Dec 19, 2013 at 09:53:44AM +0200, Alexander Shishkin wrote:
   Yes and some implementations of PT have the same issue, but you can do a
   sufficiently large high order allocation and map it to userspace and
   still no copying (or parsing/decoding) in kernel space required.
  
   What's sufficiently large? The largest we could possibly allocate is
   something like 4k^11 which is 8M or so. That's not all that big given
   you keep saying it generates in the order of 100 MB/s.
  
  One chunk is 8M. You can have as many as the buddy allocator permits you
  to have. When you get a PMI, you simply switch one chunk for another and
  on the tracing goes.
 
 This document you referred me to looks to specify something with a
 proper s/g implementation; called ToPA. There doesn't appear to be a
 limit to the linked entries and you can specify a size per entry, and I
 don't see anywhere why 4k would be bad.
 
 That said, I'm still reading..
 
   Also, 'some implementations', that sounds like a fail right there. Why
   are there already different implementations, and some which such stupid
   design, of something this new?
  
   How about just saying NO to the ones that requires physically contiguous
   allocations?
  
  No reason to leave those out, because they are still extremely useful
  for tracing and fit perfectly fine in a model with two buffers.
 
 Maybe; but lets start with the sane hardware. Then we'll look at the 
 amount of pain needed to support these broken pieces of crap and 
 decide later.
 
 So drop all support for crappy hardware now.

 Absolutely agreed ...

 The thing is, BTS itself is rarely used (and not primarily because 
 it's slow, but because its tooling and thus its utility is poor), so 
 the last thing we want is another piece of broken hardware with a 
 quirky software interface to it that tooling has trouble utilizing.

Or the interface and implementation of BTS support in the kernel
discourage its use and that is why it is so rarely used.

What I'm proposing is a unified interface for trace units to export
their traces and not only the non-crappy ones, in a way that won't
discourage its use from day one.

So I'd like to steer away from the ways in which hardware can be broken
and talk about a usable interface, to begin with.

Regargs,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Frederic Weisbecker
On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
 Or the interface and implementation of BTS support in the kernel
 discourage its use and that is why it is so rarely used.

I never heard complains about it. It's a simple dump of from/to address couples.
I just think nobody take the time to develop userspace tooling to exploit it.
But it's famous slowness might have had a bad influence on this. And may be
also the fact that it's very architecture specific. AMD doesn't support BTS if 
I recall
correctly. Or may be it has its own different implementation?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Alexander Shishkin
Peter Zijlstra pet...@infradead.org writes:

 On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote:
 So you're basically forced to stop the tracing on PMI anyhow; so your
 continuous tracing argument goes out the window.

It's only stopped inside the PMI handler to set up another buffer, and
is then started again, so no useful trace is lost. PMI handler is not
traced. What you're proposing is stopping it for good till perf collects
the previous data, which will lose us a lot of trace. So my argument
stands.

Regards,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 03:49:42PM +0100, Frederic Weisbecker wrote:
 On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
  Or the interface and implementation of BTS support in the kernel
  discourage its use and that is why it is so rarely used.
 
 I never heard complains about it. It's a simple dump of from/to address 
 couples.
 I just think nobody take the time to develop userspace tooling to exploit it.
 But it's famous slowness might have had a bad influence on this. And may be
 also the fact that it's very architecture specific. AMD doesn't support BTS 
 if I recall
 correctly. Or may be it has its own different implementation?

No AMD doesn't do anything like that.

There was some attempt to cure some of the wobblies:

  https://lkml.org/lkml/2013/7/8/154

But people never pursued that.

That said, if people want overwrite mode to work for PT we'd need to fix
the same thing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 04:30:53PM +0200, Alexander Shishkin wrote:
 So I'd like to steer away from the ways in which hardware can be broken
 and talk about a usable interface, to begin with.

Just dump it into the regular one buffer like I outlined.

That said; we very much need to have at least two architectures
implemented for any of this code to move.

But we cannot ignore the hardware trainwreck; we cannot shape our
interface around something that's utterly broken.

Some hardware is just too broken to support.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-19 Thread Peter Zijlstra
On Thu, Dec 19, 2013 at 04:54:27PM +0200, Alexander Shishkin wrote:
 Peter Zijlstra pet...@infradead.org writes:
 
  On Thu, Dec 19, 2013 at 12:57:59PM +0100, Peter Zijlstra wrote:
  So you're basically forced to stop the tracing on PMI anyhow; so your
  continuous tracing argument goes out the window.
 
 It's only stopped inside the PMI handler to set up another buffer, and
 is then started again, so no useful trace is lost. PMI handler is not
 traced. What you're proposing is stopping it for good till perf collects
 the previous data, which will lose us a lot of trace. So my argument
 stands.

That is not what I proposed at all.

The PMI will swizzle the pages and resume recording. If there is no
space in the output buffer, we'll simply re-use the existing pages and
overwrite data.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Wed, Dec 18, 2013 at 04:22:36PM +0200, Alexander Shishkin wrote:
>> > Still confused, if you cannot copy it into one buffer, then why can you
>> > copy it into a second buffer?
>> 
>> It's not copied, hardware writes directly into that second buffer.
>
> Where's the PT documentation? I can't find it in the SDM and your ISA
> extensions link is a generic Intel website which is friggin useless
> (like all corporate websites strive to be).

[1]

> Your actual PT patch doesn't describe how the things works either, and
> while I could go read the code, I'm too lazy.
>
> The thing is; why can't you zero-copy whatever buffer the hardware
> writes into, into the normal buffer?

I'm not sure I understand. You mean, have the buffer split between perf
data and trace data?

> Machinery like that would also be useful to zero-copy bits out of the
> buffer right into the page-cache.

Please elaborate.

>> I've done the same with BTS now (as Ingo suggested) and it also benefits
>> from this approach.
>
> The problem with DS is that it needs physically contiguous pages is it
> not? So you cannot really allocate a large buffer, and you end up
> needing to copy or swizzle stuff.

Yes and some implementations of PT have the same issue, but you can do a
sufficiently large high order allocation and map it to userspace and
still no copying (or parsing/decoding) in kernel space required.

[1] 
http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Peter Zijlstra
On Wed, Dec 18, 2013 at 04:22:36PM +0200, Alexander Shishkin wrote:
> > Still confused, if you cannot copy it into one buffer, then why can you
> > copy it into a second buffer?
> 
> It's not copied, hardware writes directly into that second buffer.

Where's the PT documentation? I can't find it in the SDM and your ISA
extensions link is a generic Intel website which is friggin useless
(like all corporate websites strive to be).

Your actual PT patch doesn't describe how the things works either, and
while I could go read the code, I'm too lazy.

The thing is; why can't you zero-copy whatever buffer the hardware
writes into, into the normal buffer?

Machinery like that would also be useful to zero-copy bits out of the
buffer right into the page-cache.

> I've done the same with BTS now (as Ingo suggested) and it also benefits
> from this approach.

The problem with DS is that it needs physically contiguous pages is it
not? So you cannot really allocate a large buffer, and you end up
needing to copy or swizzle stuff.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Wed, Dec 18, 2013 at 04:01:04PM +0200, Alexander Shishkin wrote:
>> > Why don't you start by explaining _why_ you need a second stream to
>> > begin with?
>> 
>> Oh, I'm sure I've explained it earlier ([1], [2])
>
> See, I didn't read 0 because that information gets lost and patches
> should be self explanatory, and i didn't get to the Intel driver yet
> because well, I got stuck in the generic code.

Sure. The general concept is more important than the actual driver at
this point anyway.

>> but why not. The data
>> in the second stream is generated at a rate which is hundreds of
>> megabytes per second per core. Decoding this data is ~1000 times slower
>> than generating it. Ergo, can't be done in kernel, needs to be exported
>> as-is to userspace for later retreival and decoding. Doing it via perf
>> stream means an extra copy, which at these rates is a waste. Ergo, a
>> second buffer.
>
> Still confused, if you cannot copy it into one buffer, then why can you
> copy it into a second buffer?

It's not copied, hardware writes directly into that second buffer.

I've done the same with BTS now (as Ingo suggested) and it also benefits
from this approach.

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Peter Zijlstra
On Wed, Dec 18, 2013 at 04:01:04PM +0200, Alexander Shishkin wrote:
> > Why don't you start by explaining _why_ you need a second stream to
> > begin with?
> 
> Oh, I'm sure I've explained it earlier ([1], [2])

See, I didn't read 0 because that information gets lost and patches
should be self explanatory, and i didn't get to the Intel driver yet
because well, I got stuck in the generic code.

> but why not. The data
> in the second stream is generated at a rate which is hundreds of
> megabytes per second per core. Decoding this data is ~1000 times slower
> than generating it. Ergo, can't be done in kernel, needs to be exported
> as-is to userspace for later retreival and decoding. Doing it via perf
> stream means an extra copy, which at these rates is a waste. Ergo, a
> second buffer.

Still confused, if you cannot copy it into one buffer, then why can you
copy it into a second buffer?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Wed, Dec 18, 2013 at 03:23:41PM +0200, Alexander Shishkin wrote:
>> Peter Zijlstra  writes:
>> 
>> > On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote:
>> >> Instruction tracing PMUs are capable of recording a log of instruction
>> >> execution flow on a cpu core, which can be useful for profiling and crash
>> >> analysis. This patch adds itrace infrastructure for perf events and the
>> >> rest of the kernel to use.
>> >> 
>> >> Since such PMUs can produce copious amounts of trace data, it may be
>> >> impractical to process it inside the kernel in real time, but instead 
>> >> export
>> >> raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs
>> >> may export their trace buffers, which can be mmap()ed to userspace from a
>> >> perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf
>> >> is extended to work with multiple ring buffers per event, reusing the
>> >> ring_buffer code in an attempt to reduce complexity.
>> >
>> > Please read the thread here: https://lkml.org/lkml/2008/12/4/64
>> >
>> > On my thoughts of this creative mmap() usage.
>> 
>> That's unfortunate, it made sense to me. But let's then have a look at
>> the alternative approaches. Bearing in mind that it is crucial for us to
>> export trace buffers to userspace as opposed to processing the trace
>> data in the kernel, the fact that we still need the normal perf data
>> stream and your dislike for mmap trickery, we need two separate file
>> descriptors: one for the perf data and one for the trace data.
>
> Why don't you start by explaining _why_ you need a second stream to
> begin with?

Oh, I'm sure I've explained it earlier ([1], [2]), but why not. The data
in the second stream is generated at a rate which is hundreds of
megabytes per second per core. Decoding this data is ~1000 times slower
than generating it. Ergo, can't be done in kernel, needs to be exported
as-is to userspace for later retreival and decoding. Doing it via perf
stream means an extra copy, which at these rates is a waste. Ergo, a
second buffer.

[1] https://lkml.org/lkml/2013/12/11/213
[2] https://lkml.org/lkml/2013/12/11/358

Regards,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Peter Zijlstra
On Wed, Dec 18, 2013 at 03:23:41PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra  writes:
> 
> > On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote:
> >> Instruction tracing PMUs are capable of recording a log of instruction
> >> execution flow on a cpu core, which can be useful for profiling and crash
> >> analysis. This patch adds itrace infrastructure for perf events and the
> >> rest of the kernel to use.
> >> 
> >> Since such PMUs can produce copious amounts of trace data, it may be
> >> impractical to process it inside the kernel in real time, but instead 
> >> export
> >> raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs
> >> may export their trace buffers, which can be mmap()ed to userspace from a
> >> perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf
> >> is extended to work with multiple ring buffers per event, reusing the
> >> ring_buffer code in an attempt to reduce complexity.
> >
> > Please read the thread here: https://lkml.org/lkml/2008/12/4/64
> >
> > On my thoughts of this creative mmap() usage.
> 
> That's unfortunate, it made sense to me. But let's then have a look at
> the alternative approaches. Bearing in mind that it is crucial for us to
> export trace buffers to userspace as opposed to processing the trace
> data in the kernel, the fact that we still need the normal perf data
> stream and your dislike for mmap trickery, we need two separate file
> descriptors: one for the perf data and one for the trace data.

Why don't you start by explaining _why_ you need a second stream to
begin with?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote:
>> Instruction tracing PMUs are capable of recording a log of instruction
>> execution flow on a cpu core, which can be useful for profiling and crash
>> analysis. This patch adds itrace infrastructure for perf events and the
>> rest of the kernel to use.
>> 
>> Since such PMUs can produce copious amounts of trace data, it may be
>> impractical to process it inside the kernel in real time, but instead export
>> raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs
>> may export their trace buffers, which can be mmap()ed to userspace from a
>> perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf
>> is extended to work with multiple ring buffers per event, reusing the
>> ring_buffer code in an attempt to reduce complexity.
>
> Please read the thread here: https://lkml.org/lkml/2008/12/4/64
>
> On my thoughts of this creative mmap() usage.

That's unfortunate, it made sense to me. But let's then have a look at
the alternative approaches. Bearing in mind that it is crucial for us to
export trace buffers to userspace as opposed to processing the trace
data in the kernel, the fact that we still need the normal perf data
stream and your dislike for mmap trickery, we need two separate file
descriptors: one for the perf data and one for the trace data.

One way of doing this would be to call sys_perf_event_open() once for
each. The first call would return a file descriptor, which provides good
old perf data buffer; the second call would use this file descriptor for
a group leader and will return another descriptor (thus creating another
perf_event), which, when mmap()ed, will provide a trace buffer.

Or, we could introduce a new PERF_FLAG_XXX to mean that we want a
descriptor with a trace buffer. And then, of course, one could always
add an ioctl(), but that'd probably be a bit over the top.

Do any of these sound reasonable? Any other possibilities that I'm
missing here?

Thanks,
--
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Alexander Shishkin
Peter Zijlstra pet...@infradead.org writes:

 On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote:
 Instruction tracing PMUs are capable of recording a log of instruction
 execution flow on a cpu core, which can be useful for profiling and crash
 analysis. This patch adds itrace infrastructure for perf events and the
 rest of the kernel to use.
 
 Since such PMUs can produce copious amounts of trace data, it may be
 impractical to process it inside the kernel in real time, but instead export
 raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs
 may export their trace buffers, which can be mmap()ed to userspace from a
 perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf
 is extended to work with multiple ring buffers per event, reusing the
 ring_buffer code in an attempt to reduce complexity.

 Please read the thread here: https://lkml.org/lkml/2008/12/4/64

 On my thoughts of this creative mmap() usage.

That's unfortunate, it made sense to me. But let's then have a look at
the alternative approaches. Bearing in mind that it is crucial for us to
export trace buffers to userspace as opposed to processing the trace
data in the kernel, the fact that we still need the normal perf data
stream and your dislike for mmap trickery, we need two separate file
descriptors: one for the perf data and one for the trace data.

One way of doing this would be to call sys_perf_event_open() once for
each. The first call would return a file descriptor, which provides good
old perf data buffer; the second call would use this file descriptor for
a group leader and will return another descriptor (thus creating another
perf_event), which, when mmap()ed, will provide a trace buffer.

Or, we could introduce a new PERF_FLAG_XXX to mean that we want a
descriptor with a trace buffer. And then, of course, one could always
add an ioctl(), but that'd probably be a bit over the top.

Do any of these sound reasonable? Any other possibilities that I'm
missing here?

Thanks,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Peter Zijlstra
On Wed, Dec 18, 2013 at 03:23:41PM +0200, Alexander Shishkin wrote:
 Peter Zijlstra pet...@infradead.org writes:
 
  On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote:
  Instruction tracing PMUs are capable of recording a log of instruction
  execution flow on a cpu core, which can be useful for profiling and crash
  analysis. This patch adds itrace infrastructure for perf events and the
  rest of the kernel to use.
  
  Since such PMUs can produce copious amounts of trace data, it may be
  impractical to process it inside the kernel in real time, but instead 
  export
  raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs
  may export their trace buffers, which can be mmap()ed to userspace from a
  perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf
  is extended to work with multiple ring buffers per event, reusing the
  ring_buffer code in an attempt to reduce complexity.
 
  Please read the thread here: https://lkml.org/lkml/2008/12/4/64
 
  On my thoughts of this creative mmap() usage.
 
 That's unfortunate, it made sense to me. But let's then have a look at
 the alternative approaches. Bearing in mind that it is crucial for us to
 export trace buffers to userspace as opposed to processing the trace
 data in the kernel, the fact that we still need the normal perf data
 stream and your dislike for mmap trickery, we need two separate file
 descriptors: one for the perf data and one for the trace data.

Why don't you start by explaining _why_ you need a second stream to
begin with?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Alexander Shishkin
Peter Zijlstra pet...@infradead.org writes:

 On Wed, Dec 18, 2013 at 03:23:41PM +0200, Alexander Shishkin wrote:
 Peter Zijlstra pet...@infradead.org writes:
 
  On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote:
  Instruction tracing PMUs are capable of recording a log of instruction
  execution flow on a cpu core, which can be useful for profiling and crash
  analysis. This patch adds itrace infrastructure for perf events and the
  rest of the kernel to use.
  
  Since such PMUs can produce copious amounts of trace data, it may be
  impractical to process it inside the kernel in real time, but instead 
  export
  raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs
  may export their trace buffers, which can be mmap()ed to userspace from a
  perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf
  is extended to work with multiple ring buffers per event, reusing the
  ring_buffer code in an attempt to reduce complexity.
 
  Please read the thread here: https://lkml.org/lkml/2008/12/4/64
 
  On my thoughts of this creative mmap() usage.
 
 That's unfortunate, it made sense to me. But let's then have a look at
 the alternative approaches. Bearing in mind that it is crucial for us to
 export trace buffers to userspace as opposed to processing the trace
 data in the kernel, the fact that we still need the normal perf data
 stream and your dislike for mmap trickery, we need two separate file
 descriptors: one for the perf data and one for the trace data.

 Why don't you start by explaining _why_ you need a second stream to
 begin with?

Oh, I'm sure I've explained it earlier ([1], [2]), but why not. The data
in the second stream is generated at a rate which is hundreds of
megabytes per second per core. Decoding this data is ~1000 times slower
than generating it. Ergo, can't be done in kernel, needs to be exported
as-is to userspace for later retreival and decoding. Doing it via perf
stream means an extra copy, which at these rates is a waste. Ergo, a
second buffer.

[1] https://lkml.org/lkml/2013/12/11/213
[2] https://lkml.org/lkml/2013/12/11/358

Regards,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Peter Zijlstra
On Wed, Dec 18, 2013 at 04:01:04PM +0200, Alexander Shishkin wrote:
  Why don't you start by explaining _why_ you need a second stream to
  begin with?
 
 Oh, I'm sure I've explained it earlier ([1], [2])

See, I didn't read 0 because that information gets lost and patches
should be self explanatory, and i didn't get to the Intel driver yet
because well, I got stuck in the generic code.

 but why not. The data
 in the second stream is generated at a rate which is hundreds of
 megabytes per second per core. Decoding this data is ~1000 times slower
 than generating it. Ergo, can't be done in kernel, needs to be exported
 as-is to userspace for later retreival and decoding. Doing it via perf
 stream means an extra copy, which at these rates is a waste. Ergo, a
 second buffer.

Still confused, if you cannot copy it into one buffer, then why can you
copy it into a second buffer?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Alexander Shishkin
Peter Zijlstra pet...@infradead.org writes:

 On Wed, Dec 18, 2013 at 04:01:04PM +0200, Alexander Shishkin wrote:
  Why don't you start by explaining _why_ you need a second stream to
  begin with?
 
 Oh, I'm sure I've explained it earlier ([1], [2])

 See, I didn't read 0 because that information gets lost and patches
 should be self explanatory, and i didn't get to the Intel driver yet
 because well, I got stuck in the generic code.

Sure. The general concept is more important than the actual driver at
this point anyway.

 but why not. The data
 in the second stream is generated at a rate which is hundreds of
 megabytes per second per core. Decoding this data is ~1000 times slower
 than generating it. Ergo, can't be done in kernel, needs to be exported
 as-is to userspace for later retreival and decoding. Doing it via perf
 stream means an extra copy, which at these rates is a waste. Ergo, a
 second buffer.

 Still confused, if you cannot copy it into one buffer, then why can you
 copy it into a second buffer?

It's not copied, hardware writes directly into that second buffer.

I've done the same with BTS now (as Ingo suggested) and it also benefits
from this approach.

Regards,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Peter Zijlstra
On Wed, Dec 18, 2013 at 04:22:36PM +0200, Alexander Shishkin wrote:
  Still confused, if you cannot copy it into one buffer, then why can you
  copy it into a second buffer?
 
 It's not copied, hardware writes directly into that second buffer.

Where's the PT documentation? I can't find it in the SDM and your ISA
extensions link is a generic Intel website which is friggin useless
(like all corporate websites strive to be).

Your actual PT patch doesn't describe how the things works either, and
while I could go read the code, I'm too lazy.

The thing is; why can't you zero-copy whatever buffer the hardware
writes into, into the normal buffer?

Machinery like that would also be useful to zero-copy bits out of the
buffer right into the page-cache.

 I've done the same with BTS now (as Ingo suggested) and it also benefits
 from this approach.

The problem with DS is that it needs physically contiguous pages is it
not? So you cannot really allocate a large buffer, and you end up
needing to copy or swizzle stuff.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-18 Thread Alexander Shishkin
Peter Zijlstra pet...@infradead.org writes:

 On Wed, Dec 18, 2013 at 04:22:36PM +0200, Alexander Shishkin wrote:
  Still confused, if you cannot copy it into one buffer, then why can you
  copy it into a second buffer?
 
 It's not copied, hardware writes directly into that second buffer.

 Where's the PT documentation? I can't find it in the SDM and your ISA
 extensions link is a generic Intel website which is friggin useless
 (like all corporate websites strive to be).

[1]

 Your actual PT patch doesn't describe how the things works either, and
 while I could go read the code, I'm too lazy.

 The thing is; why can't you zero-copy whatever buffer the hardware
 writes into, into the normal buffer?

I'm not sure I understand. You mean, have the buffer split between perf
data and trace data?

 Machinery like that would also be useful to zero-copy bits out of the
 buffer right into the page-cache.

Please elaborate.

 I've done the same with BTS now (as Ingo suggested) and it also benefits
 from this approach.

 The problem with DS is that it needs physically contiguous pages is it
 not? So you cannot really allocate a large buffer, and you end up
 needing to copy or swizzle stuff.

Yes and some implementations of PT have the same issue, but you can do a
sufficiently large high order allocation and map it to userspace and
still no copying (or parsing/decoding) in kernel space required.

[1] 
http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

Regards,
--
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-17 Thread Peter Zijlstra
On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote:
> Instruction tracing PMUs are capable of recording a log of instruction
> execution flow on a cpu core, which can be useful for profiling and crash
> analysis. This patch adds itrace infrastructure for perf events and the
> rest of the kernel to use.
> 
> Since such PMUs can produce copious amounts of trace data, it may be
> impractical to process it inside the kernel in real time, but instead export
> raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs
> may export their trace buffers, which can be mmap()ed to userspace from a
> perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf
> is extended to work with multiple ring buffers per event, reusing the
> ring_buffer code in an attempt to reduce complexity.

Please read the thread here: https://lkml.org/lkml/2008/12/4/64

On my thoughts of this creative mmap() usage.

tl;dr: no f*cking way.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units

2013-12-17 Thread Peter Zijlstra
On Wed, Dec 11, 2013 at 02:36:16PM +0200, Alexander Shishkin wrote:
 Instruction tracing PMUs are capable of recording a log of instruction
 execution flow on a cpu core, which can be useful for profiling and crash
 analysis. This patch adds itrace infrastructure for perf events and the
 rest of the kernel to use.
 
 Since such PMUs can produce copious amounts of trace data, it may be
 impractical to process it inside the kernel in real time, but instead export
 raw trace streams to userspace for subsequent analysis. Thus, itrace PMUs
 may export their trace buffers, which can be mmap()ed to userspace from a
 perf event fd with a PERF_EVENT_ITRACE_OFFSET offset. To that end, perf
 is extended to work with multiple ring buffers per event, reusing the
 ring_buffer code in an attempt to reduce complexity.

Please read the thread here: https://lkml.org/lkml/2008/12/4/64

On my thoughts of this creative mmap() usage.

tl;dr: no f*cking way.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/