Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread David Ahern

On 10/15/13 7:52 PM, Namhyung Kim wrote:

Aha, okay.  So it mostly matters to syscall tracing, right?  For a
normal record session, it seems that the effect is not that large:


Yes, that's in the description "When recording raw_syscalls for the 
entire system"


There is a small benefit to all record sessions -- mmap+memcpy has less 
overhead than write(). I still need to look at Ingo's suggestion to use 
non-temporal stores which might reduce the overhead of the memcpy.


Try a workload that generates a HUGE data file -- say a full kernel 
build (e.g., perf record -g -- make O=/tmp/kbuild -j 16). You should see 
a much larger benefit from the mmap route. Be sure to use your callchain 
enhancements to look at that 1+G file. ;-)


David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
On Tue, 15 Oct 2013 07:35:53 -0600, David Ahern wrote:
> On 10/15/13 1:31 AM, Namhyung Kim wrote:
>> Hi David,
>>
>> On Mon, 14 Oct 2013 20:55:31 -0600, David Ahern wrote:
>>> When recording raw_syscalls for the entire system, e.g.,
>>>  perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
>>>
>>> you end up with a negative feedback loop as perf itself calls
>>> write() fairly often. This patch handles the problem by mmap'ing the
>>> file in chunks of 64M at a time and copies events from the event buffers
>>> to the file avoiding write system calls.
>>>
>>> Before (with write syscall):
>>>
>>> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- 
>>> sleep 1
>>> [ perf record: Woken up 0 times to write data ]
>>> [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 
>>> samples) ]
>>>
>>> After (using mmap):
>>>
>>> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- 
>>> sleep 1
>>> [ perf record: Woken up 31 times to write data ]
>>> [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) 
>>> ]
>>
>> Why do they have that different size?
>
> perf calls write() for each mmap, each time through the loop. Each
> write generates 2 events (syscall entry + exit) -- ie., generates more
> events. That's the negative feedback loop.

Aha, okay.  So it mostly matters to syscall tracing, right?  For a
normal record session, it seems that the effect is not that large:

Before:

  $ perf stat -r3 --null --sync -- perf record -a -- sleep 5
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.467 MB perf.data (~20420 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.544 MB perf.data (~23750 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.482 MB perf.data (~21073 samples) ]
  
   Performance counter stats for 'perf record -a -- sleep 5' (3 runs):
  
 5.174476094 seconds time elapsed   
   ( +-  0.07% )

  $ perf record -- perf bench sched pipe
  # Running sched/pipe benchmark...
  # Executed 100 pipe operations between two processes
  
   Total time: 21.271 [sec]
  
21.271357 usecs/op
47011 ops/sec
  [ perf record: Woken up 21 times to write data ]
  [ perf record: Captured and wrote 5.643 MB perf.data (~246524 samples) ]


After:

  $ perf stat -r3 --null --sync -- perf record -a -- sleep 5
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.459 MB perf.data (~20055 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.463 MB perf.data (~20230 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.467 MB perf.data (~20401 samples) ]
  
   Performance counter stats for 'perf record -a -- sleep 5' (3 runs):
  
 5.085910919 seconds time elapsed   
   ( +-  0.06% )

  $ perf record -- perf bench sched pipe
  # Running sched/pipe benchmark...
  # Executed 100 pipe operations between two processes
  
   Total time: 21.175 [sec]
  
21.175406 usecs/op
47224 ops/sec
  [ perf record: Woken up 21 times to write data ]
  [ perf record: Captured and wrote 5.612 MB perf.data (~245206 samples) ]


Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
Hi David,

On Tue, 15 Oct 2013 07:25:04 -0600, David Ahern wrote:
> On 10/15/13 1:09 AM, Namhyung Kim wrote:
>>> The stat() seems superfluous, here in __cmd_record() we've just checked
>>> the output_name and made sure it exists. Can that stat() call ever fail?
>>
>> AFAICS it's needed to check current file size.  But I think it's better
>> to use fstat().
>
> Sure fstat could be used over stat -- if it ends up staying.
>
>
>>>
>>> 3)
>>>
>>> The rec->bytes_at_mmap_start field feels a bit weird. If I read the code
>>> correctly, in every 'perf record' invocation, rec->bytes_written starts at
>>> 0 - i.e. we don't have repeat invocations of cmd_record().
>>
>> rec->bytes_written is updated when it writes to the output file for
>> synthesizing COMM/MMAP events (this mmap output is not used at that time).
>
> Ingo: I went through a number of itereations before using the
> bytes_at_mmap_start. One of those was to use the bytes_written
> counter. All failed. Header + synthesized events are written to the
> file before we start farming the ring buffers.
>
> Perhaps a good code cleanup will help figure out why. I needed the
> functionality ASAP for use with perf-trace -a so I stuck with the new
> variable. Since this change is working out well, I will look at a code
> clean up on the next round.

session->header.data_offset ?

>
> I am traveling to LinuxCon / KVM Forum / Tracing Forum on
> Friday. Perhaps the clean up and followup patch can be done on the
> long plane ride; more likely when I return which means 3.14 material.

See you there :)

>
>> Actually I worried about the mmap offset not being aligned to page
>> size.  But it seems that's not a problem.
>
> This code snippet makes sure the mmap offset is a multiple of 64M
> (rec->mmap_size). offset is the argument to mmap; mmap_offset is the
> where we are within the mmap for the next copy:
>
> + offset = rec->bytes_at_mmap_start + rec->bytes_written;
> + if (offset < (ssize_t) rec->mmap_size) {
> + rec->mmap_offset = offset;
> + offset = 0;
> + } else
> + rec->mmap_offset = 0;
>

Oh, I overlooked this code, it actually aligned offset.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread David Ahern

On 10/15/13 1:44 AM, Ingo Molnar wrote:


* Namhyung Kim  wrote:


[SNIP]

+/* mmap file big chunks at a time */
+#define MMAP_OUTPUT_SIZE   (64*1024*1024)


Why did you choose 64MB for the size?  Did you also test other sizes?


Btw., should this value go up if the ring buffer (mmap_pages) is larger
than 64MB?



I made mmap_size a variable:
+   size_t  mmap_size;  /* size of mmap segments */

with the above initial value. I was planning to make it in an option and 
just forgot to complete it.


Why 64M? mmap / munmap are also system calls and I was looking to trade 
off huge file size jumps versus the frequency of adjusting the maps. 64M 
Was just a nice round number between 1 and 100. 8, 16 are too small. 128 
seems to big for a default. That left only 32 and 64. 64M seems the 
better trade off of the two.


Making it a user knob would help with smaller deployments. Could also 
have mmap_size = 0 mean turn it off (use write over mmap).


Perhaps something that adjust automatically would be useful too. e.g., 
For the case that motivates the change I have 16 cpus each with a 4M 
buffer (1024 mmap pages). Should we generically set the size:


mmap_size = ncpus_online * mmap_pages * page_size?

Do that only for system wide profiling?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread David Ahern

On 10/15/13 1:31 AM, Namhyung Kim wrote:

Hi David,

On Mon, 14 Oct 2013 20:55:31 -0600, David Ahern wrote:

When recording raw_syscalls for the entire system, e.g.,
 perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1

you end up with a negative feedback loop as perf itself calls
write() fairly often. This patch handles the problem by mmap'ing the
file in chunks of 64M at a time and copies events from the event buffers
to the file avoiding write system calls.

Before (with write syscall):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ]

After (using mmap):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 31 times to write data ]
[ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]


Why do they have that different size?


perf calls write() for each mmap, each time through the loop. Each write 
generates 2 events (syscall entry + exit) -- ie., generates more events. 
That's the negative feedback loop.




[SNIP]

+
+   rec->mmap_addr = mmap(NULL, rec->mmap_size,
+PROT_WRITE | PROT_READ,
+MAP_SHARED,
+rec->output,
+offset);
+
+   if (rec->mmap_addr == MAP_FAILED) {
+   pr_err("mmap failed: %d: %s\n", errno, strerror(errno));
+   return -1;
+   }
+
+   /* expand file to include this mmap segment */
+   if (ftruncate(rec->output, offset + rec->mmap_size) != 0) {
+   pr_err("ftruncate failed\n");
+   return -1;
+   }


I think this mmap + ftruncate should be reordered.  Although it looks
work without problems the mmap man pages says it's unspecified behavior.

A file is mapped in multiples of the page size.  For a file that is not
a multiple of the page  size,  the  remaining  memory  is  zeroed  when
mapped, and writes to that region are not written out to the file.  The
effect of changing the size of the underlying file of a mapping on  the
pages  that  correspond  to  added  or  removed  regions of the file is
unspecified.


The mmap only expands the address range; the ftruncate expands the file 
behind the mmap. Both are needed and must succeed to function properly, 
and I don't see how the order matters. ie.,


This order has an extra call on the failure path:
  ftruncate
  mmap
  - on failure call ftruncate to reset file size

The order I have does not have that problem:
  mmap
  ftruncate

Here on failure just return -1 and we end the session.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread David Ahern

On 10/15/13 1:09 AM, Namhyung Kim wrote:

The stat() seems superfluous, here in __cmd_record() we've just checked
the output_name and made sure it exists. Can that stat() call ever fail?


AFAICS it's needed to check current file size.  But I think it's better
to use fstat().


Sure fstat could be used over stat -- if it ends up staying.




3)

The rec->bytes_at_mmap_start field feels a bit weird. If I read the code
correctly, in every 'perf record' invocation, rec->bytes_written starts at
0 - i.e. we don't have repeat invocations of cmd_record().


rec->bytes_written is updated when it writes to the output file for
synthesizing COMM/MMAP events (this mmap output is not used at that time).


Ingo: I went through a number of itereations before using the 
bytes_at_mmap_start. One of those was to use the bytes_written counter. 
All failed. Header + synthesized events are written to the file before 
we start farming the ring buffers.


Perhaps a good code cleanup will help figure out why. I needed the 
functionality ASAP for use with perf-trace -a so I stuck with the new 
variable. Since this change is working out well, I will look at a code 
clean up on the next round.


I am traveling to LinuxCon / KVM Forum / Tracing Forum on Friday. 
Perhaps the clean up and followup patch can be done on the long plane 
ride; more likely when I return which means 3.14 material.



Actually I worried about the mmap offset not being aligned to page
size.  But it seems that's not a problem.


This code snippet makes sure the mmap offset is a multiple of 64M 
(rec->mmap_size). offset is the argument to mmap; mmap_offset is the 
where we are within the mmap for the next copy:


+   offset = rec->bytes_at_mmap_start + rec->bytes_written;
+   if (offset < (ssize_t) rec->mmap_size) {
+   rec->mmap_offset = offset;
+   offset = 0;
+   } else
+   rec->mmap_offset = 0;


David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Ingo Molnar

* Jiri Olsa  wrote:

> On Tue, Oct 15, 2013 at 09:25:40AM +0200, Ingo Molnar wrote:
> > 
> > * Namhyung Kim  wrote:
> > 
> > > > 3)
> > > >
> > > > The rec->bytes_at_mmap_start field feels a bit weird. If I read the 
> > > > code 
> > > > correctly, in every 'perf record' invocation, rec->bytes_written starts 
> > > > at 
> > > > 0 - i.e. we don't have repeat invocations of cmd_record().
> > > 
> > > rec->bytes_written is updated when it writes to the output file for 
> > > synthesizing COMM/MMAP events (this mmap output is not used at that 
> > > time).
> > 
> > Btw., while looking into it, I think advance_output() needlessly 
> > obfuscates as well:
> > 
> > static void advance_output(struct perf_record *rec, size_t size)
> > {
> > rec->bytes_written += size;
> > }
> > 
> > that code should just be written open coded.
> > 
> > So I think all this needs a few good rounds of cleanups, before we can 
> > complicate it with a new feature. (the cleanups can be on top of the 
> > feature, if they go in at the same time.)
> 
> I sent some file code cleanup week ago, I'll rebase and resend it soon:
>
> http://marc.info/?l=linux-kernel=138113836428425=2

Nice!

> it's mostly about centralizing the code into file object.. got no 
> comments so far

Too much development going on I suspect :-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Jiri Olsa
On Tue, Oct 15, 2013 at 09:25:40AM +0200, Ingo Molnar wrote:
> 
> * Namhyung Kim  wrote:
> 
> > > 3)
> > >
> > > The rec->bytes_at_mmap_start field feels a bit weird. If I read the code 
> > > correctly, in every 'perf record' invocation, rec->bytes_written starts 
> > > at 
> > > 0 - i.e. we don't have repeat invocations of cmd_record().
> > 
> > rec->bytes_written is updated when it writes to the output file for 
> > synthesizing COMM/MMAP events (this mmap output is not used at that 
> > time).
> 
> Btw., while looking into it, I think advance_output() needlessly 
> obfuscates as well:
> 
> static void advance_output(struct perf_record *rec, size_t size)
> {
> rec->bytes_written += size;
> }
> 
> that code should just be written open coded.
> 
> So I think all this needs a few good rounds of cleanups, before we can 
> complicate it with a new feature. (the cleanups can be on top of the 
> feature, if they go in at the same time.)

I sent some file code cleanup week ago, I'll rebase
and resend it soon:
http://marc.info/?l=linux-kernel=138113836428425=2

it's mostly about centralizing the code into file
object.. got no comments so far

jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
On Tue, 15 Oct 2013 09:25:40 +0200, Ingo Molnar wrote:
> * Namhyung Kim  wrote:
>> No, st_size is bigger than rec->bytes_written due to the 
>> perf_file_header which is written without updating rec->bytes_written.
>
> Since all this is code that executes once during __cmd_record(), is this 
> all about the header writeout?
>
> That is what confused me about the stat() call and that's messy really: we 
> are the ones who write the file header, we have a very good idea about how 
> many bytes we wrote to the file! It should be entirely unnecessary to lose 
> that information and then execute a system call to recover that 
> information...

Right, I think session->header.data_offset has the info already.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Ingo Molnar

* Namhyung Kim  wrote:

> [SNIP]
> > +/* mmap file big chunks at a time */
> > +#define MMAP_OUTPUT_SIZE   (64*1024*1024)
> 
> Why did you choose 64MB for the size?  Did you also test other sizes?

Btw., should this value go up if the ring buffer (mmap_pages) is larger 
than 64MB?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
Hi David,

On Mon, 14 Oct 2013 20:55:31 -0600, David Ahern wrote:
> When recording raw_syscalls for the entire system, e.g.,
> perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
>
> you end up with a negative feedback loop as perf itself calls
> write() fairly often. This patch handles the problem by mmap'ing the
> file in chunks of 64M at a time and copies events from the event buffers
> to the file avoiding write system calls.
>
> Before (with write syscall):
>
> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- 
> sleep 1
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) 
> ]
>
> After (using mmap):
>
> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- 
> sleep 1
> [ perf record: Woken up 31 times to write data ]
> [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]

Why do they have that different size?

>
[SNIP]
> +/* mmap file big chunks at a time */
> +#define MMAP_OUTPUT_SIZE   (64*1024*1024)

Why did you choose 64MB for the size?  Did you also test other sizes?


[SNIP]
> +
> + rec->mmap_addr = mmap(NULL, rec->mmap_size,
> +  PROT_WRITE | PROT_READ,
> +  MAP_SHARED,
> +  rec->output,
> +  offset);
> +
> + if (rec->mmap_addr == MAP_FAILED) {
> + pr_err("mmap failed: %d: %s\n", errno, strerror(errno));
> + return -1;
> + }
> +
> + /* expand file to include this mmap segment */
> + if (ftruncate(rec->output, offset + rec->mmap_size) != 0) {
> + pr_err("ftruncate failed\n");
> + return -1;
> + }

I think this mmap + ftruncate should be reordered.  Although it looks
work without problems the mmap man pages says it's unspecified behavior.

   A file is mapped in multiples of the page size.  For a file that is not
   a multiple of the page  size,  the  remaining  memory  is  zeroed  when
   mapped, and writes to that region are not written out to the file.  The
   effect of changing the size of the underlying file of a mapping on  the
   pages  that  correspond  to  added  or  removed  regions of the file is
   unspecified.


Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Ingo Molnar

* Namhyung Kim  wrote:

> > 3)
> >
> > The rec->bytes_at_mmap_start field feels a bit weird. If I read the code 
> > correctly, in every 'perf record' invocation, rec->bytes_written starts at 
> > 0 - i.e. we don't have repeat invocations of cmd_record().
> 
> rec->bytes_written is updated when it writes to the output file for 
> synthesizing COMM/MMAP events (this mmap output is not used at that 
> time).

Btw., while looking into it, I think advance_output() needlessly 
obfuscates as well:

static void advance_output(struct perf_record *rec, size_t size)
{
rec->bytes_written += size;
}

that code should just be written open coded.

So I think all this needs a few good rounds of cleanups, before we can 
complicate it with a new feature. (the cleanups can be on top of the 
feature, if they go in at the same time.)

> > That means that this:
> >
> > rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;
> >
> > is really:
> >
> > rec->bytes_at_mmap_start = st.st_size;
> >
> > furthermore, since we don't allow appends anymore, st.st_size ought to be 
> > zero as well.
> >
> > Which means that ->bytes_at_mmap_start is always zero - and could be 
> > eliminated altogether.
> 
> No, st_size is bigger than rec->bytes_written due to the 
> perf_file_header which is written without updating rec->bytes_written.

Since all this is code that executes once during __cmd_record(), is this 
all about the header writeout?

That is what confused me about the stat() call and that's messy really: we 
are the ones who write the file header, we have a very good idea about how 
many bytes we wrote to the file! It should be entirely unnecessary to lose 
that information and then execute a system call to recover that 
information...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
Hi Ingo,

On Tue, 15 Oct 2013 08:02:00 +0200, Ingo Molnar wrote:
> * David Ahern  wrote:
>
>> +/* for MMAP based file writes */
>> +void*mmap_addr;
>> +u64 bytes_at_mmap_start; /* bytes in file when mmap 
>> use starts */
>> +u64 mmap_offset;/* current location within mmap 
>> */
>> +size_t  mmap_size;  /* size of mmap segments */
>> +booluse_mmap;
>
>> +if (!rec->opts.pipe_output && stat(output_name, ) == 0) {
>> +rec->use_mmap = true;
>> +rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;
>> +}
>
> 1)
>
> I think __cmd_record() has become way too large, nearly 300 lines of code. 
> It would be nice to split it into 2-3 helpers that operate on 'struct 
> perf_record' or so.

Agreed.

>
> 2)
>
> The stat() seems superfluous, here in __cmd_record() we've just checked 
> the output_name and made sure it exists. Can that stat() call ever fail?

AFAICS it's needed to check current file size.  But I think it's better
to use fstat().

>
> 3)
>
> The rec->bytes_at_mmap_start field feels a bit weird. If I read the code 
> correctly, in every 'perf record' invocation, rec->bytes_written starts at 
> 0 - i.e. we don't have repeat invocations of cmd_record().

rec->bytes_written is updated when it writes to the output file for
synthesizing COMM/MMAP events (this mmap output is not used at that time).

>
> That means that this:
>
>   rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;
>
> is really:
>
>   rec->bytes_at_mmap_start = st.st_size;
>
> furthermore, since we don't allow appends anymore, st.st_size ought to be 
> zero as well.
>
> Which means that ->bytes_at_mmap_start is always zero - and could be 
> eliminated altogether.

No, st_size is bigger than rec->bytes_written due to the
perf_file_header which is written without updating rec->bytes_written.

Actually I worried about the mmap offset not being aligned to page
size.  But it seems that's not a problem.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Ingo Molnar

* David Ahern  wrote:

> + /* for MMAP based file writes */
> + void*mmap_addr;
> + u64 bytes_at_mmap_start; /* bytes in file when mmap 
> use starts */
> + u64 mmap_offset;/* current location within mmap 
> */
> + size_t  mmap_size;  /* size of mmap segments */
> + booluse_mmap;

> + if (!rec->opts.pipe_output && stat(output_name, ) == 0) {
> + rec->use_mmap = true;
> + rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;
> + }

1)

I think __cmd_record() has become way too large, nearly 300 lines of code. 
It would be nice to split it into 2-3 helpers that operate on 'struct 
perf_record' or so.

2)

The stat() seems superfluous, here in __cmd_record() we've just checked 
the output_name and made sure it exists. Can that stat() call ever fail?

3)

The rec->bytes_at_mmap_start field feels a bit weird. If I read the code 
correctly, in every 'perf record' invocation, rec->bytes_written starts at 
0 - i.e. we don't have repeat invocations of cmd_record().

That means that this:

rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;

is really:

rec->bytes_at_mmap_start = st.st_size;

furthermore, since we don't allow appends anymore, st.st_size ought to be 
zero as well.

Which means that ->bytes_at_mmap_start is always zero - and could be 
eliminated altogether.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
Hi David,

On Tue, 15 Oct 2013 07:25:04 -0600, David Ahern wrote:
 On 10/15/13 1:09 AM, Namhyung Kim wrote:
 The stat() seems superfluous, here in __cmd_record() we've just checked
 the output_name and made sure it exists. Can that stat() call ever fail?

 AFAICS it's needed to check current file size.  But I think it's better
 to use fstat().

 Sure fstat could be used over stat -- if it ends up staying.



 3)

 The rec-bytes_at_mmap_start field feels a bit weird. If I read the code
 correctly, in every 'perf record' invocation, rec-bytes_written starts at
 0 - i.e. we don't have repeat invocations of cmd_record().

 rec-bytes_written is updated when it writes to the output file for
 synthesizing COMM/MMAP events (this mmap output is not used at that time).

 Ingo: I went through a number of itereations before using the
 bytes_at_mmap_start. One of those was to use the bytes_written
 counter. All failed. Header + synthesized events are written to the
 file before we start farming the ring buffers.

 Perhaps a good code cleanup will help figure out why. I needed the
 functionality ASAP for use with perf-trace -a so I stuck with the new
 variable. Since this change is working out well, I will look at a code
 clean up on the next round.

session-header.data_offset ?


 I am traveling to LinuxCon / KVM Forum / Tracing Forum on
 Friday. Perhaps the clean up and followup patch can be done on the
 long plane ride; more likely when I return which means 3.14 material.

See you there :)


 Actually I worried about the mmap offset not being aligned to page
 size.  But it seems that's not a problem.

 This code snippet makes sure the mmap offset is a multiple of 64M
 (rec-mmap_size). offset is the argument to mmap; mmap_offset is the
 where we are within the mmap for the next copy:

 + offset = rec-bytes_at_mmap_start + rec-bytes_written;
 + if (offset  (ssize_t) rec-mmap_size) {
 + rec-mmap_offset = offset;
 + offset = 0;
 + } else
 + rec-mmap_offset = 0;


Oh, I overlooked this code, it actually aligned offset.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
On Tue, 15 Oct 2013 07:35:53 -0600, David Ahern wrote:
 On 10/15/13 1:31 AM, Namhyung Kim wrote:
 Hi David,

 On Mon, 14 Oct 2013 20:55:31 -0600, David Ahern wrote:
 When recording raw_syscalls for the entire system, e.g.,
  perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1

 you end up with a negative feedback loop as perf itself calls
 write() fairly often. This patch handles the problem by mmap'ing the
 file in chunks of 64M at a time and copies events from the event buffers
 to the file avoiding write system calls.

 Before (with write syscall):

 perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- 
 sleep 1
 [ perf record: Woken up 0 times to write data ]
 [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 
 samples) ]

 After (using mmap):

 perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- 
 sleep 1
 [ perf record: Woken up 31 times to write data ]
 [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) 
 ]

 Why do they have that different size?

 perf calls write() for each mmap, each time through the loop. Each
 write generates 2 events (syscall entry + exit) -- ie., generates more
 events. That's the negative feedback loop.

Aha, okay.  So it mostly matters to syscall tracing, right?  For a
normal record session, it seems that the effect is not that large:

Before:

  $ perf stat -r3 --null --sync -- perf record -a -- sleep 5
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.467 MB perf.data (~20420 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.544 MB perf.data (~23750 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.482 MB perf.data (~21073 samples) ]
  
   Performance counter stats for 'perf record -a -- sleep 5' (3 runs):
  
 5.174476094 seconds time elapsed   
   ( +-  0.07% )

  $ perf record -- perf bench sched pipe
  # Running sched/pipe benchmark...
  # Executed 100 pipe operations between two processes
  
   Total time: 21.271 [sec]
  
21.271357 usecs/op
47011 ops/sec
  [ perf record: Woken up 21 times to write data ]
  [ perf record: Captured and wrote 5.643 MB perf.data (~246524 samples) ]


After:

  $ perf stat -r3 --null --sync -- perf record -a -- sleep 5
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.459 MB perf.data (~20055 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.463 MB perf.data (~20230 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.467 MB perf.data (~20401 samples) ]
  
   Performance counter stats for 'perf record -a -- sleep 5' (3 runs):
  
 5.085910919 seconds time elapsed   
   ( +-  0.06% )

  $ perf record -- perf bench sched pipe
  # Running sched/pipe benchmark...
  # Executed 100 pipe operations between two processes
  
   Total time: 21.175 [sec]
  
21.175406 usecs/op
47224 ops/sec
  [ perf record: Woken up 21 times to write data ]
  [ perf record: Captured and wrote 5.612 MB perf.data (~245206 samples) ]


Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread David Ahern

On 10/15/13 7:52 PM, Namhyung Kim wrote:

Aha, okay.  So it mostly matters to syscall tracing, right?  For a
normal record session, it seems that the effect is not that large:


Yes, that's in the description When recording raw_syscalls for the 
entire system


There is a small benefit to all record sessions -- mmap+memcpy has less 
overhead than write(). I still need to look at Ingo's suggestion to use 
non-temporal stores which might reduce the overhead of the memcpy.


Try a workload that generates a HUGE data file -- say a full kernel 
build (e.g., perf record -g -- make O=/tmp/kbuild -j 16). You should see 
a much larger benefit from the mmap route. Be sure to use your callchain 
enhancements to look at that 1+G file. ;-)


David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Ingo Molnar

* David Ahern dsah...@gmail.com wrote:

 + /* for MMAP based file writes */
 + void*mmap_addr;
 + u64 bytes_at_mmap_start; /* bytes in file when mmap 
 use starts */
 + u64 mmap_offset;/* current location within mmap 
 */
 + size_t  mmap_size;  /* size of mmap segments */
 + booluse_mmap;

 + if (!rec-opts.pipe_output  stat(output_name, st) == 0) {
 + rec-use_mmap = true;
 + rec-bytes_at_mmap_start = st.st_size - rec-bytes_written;
 + }

1)

I think __cmd_record() has become way too large, nearly 300 lines of code. 
It would be nice to split it into 2-3 helpers that operate on 'struct 
perf_record' or so.

2)

The stat() seems superfluous, here in __cmd_record() we've just checked 
the output_name and made sure it exists. Can that stat() call ever fail?

3)

The rec-bytes_at_mmap_start field feels a bit weird. If I read the code 
correctly, in every 'perf record' invocation, rec-bytes_written starts at 
0 - i.e. we don't have repeat invocations of cmd_record().

That means that this:

rec-bytes_at_mmap_start = st.st_size - rec-bytes_written;

is really:

rec-bytes_at_mmap_start = st.st_size;

furthermore, since we don't allow appends anymore, st.st_size ought to be 
zero as well.

Which means that -bytes_at_mmap_start is always zero - and could be 
eliminated altogether.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
Hi Ingo,

On Tue, 15 Oct 2013 08:02:00 +0200, Ingo Molnar wrote:
 * David Ahern dsah...@gmail.com wrote:

 +/* for MMAP based file writes */
 +void*mmap_addr;
 +u64 bytes_at_mmap_start; /* bytes in file when mmap 
 use starts */
 +u64 mmap_offset;/* current location within mmap 
 */
 +size_t  mmap_size;  /* size of mmap segments */
 +booluse_mmap;

 +if (!rec-opts.pipe_output  stat(output_name, st) == 0) {
 +rec-use_mmap = true;
 +rec-bytes_at_mmap_start = st.st_size - rec-bytes_written;
 +}

 1)

 I think __cmd_record() has become way too large, nearly 300 lines of code. 
 It would be nice to split it into 2-3 helpers that operate on 'struct 
 perf_record' or so.

Agreed.


 2)

 The stat() seems superfluous, here in __cmd_record() we've just checked 
 the output_name and made sure it exists. Can that stat() call ever fail?

AFAICS it's needed to check current file size.  But I think it's better
to use fstat().


 3)

 The rec-bytes_at_mmap_start field feels a bit weird. If I read the code 
 correctly, in every 'perf record' invocation, rec-bytes_written starts at 
 0 - i.e. we don't have repeat invocations of cmd_record().

rec-bytes_written is updated when it writes to the output file for
synthesizing COMM/MMAP events (this mmap output is not used at that time).


 That means that this:

   rec-bytes_at_mmap_start = st.st_size - rec-bytes_written;

 is really:

   rec-bytes_at_mmap_start = st.st_size;

 furthermore, since we don't allow appends anymore, st.st_size ought to be 
 zero as well.

 Which means that -bytes_at_mmap_start is always zero - and could be 
 eliminated altogether.

No, st_size is bigger than rec-bytes_written due to the
perf_file_header which is written without updating rec-bytes_written.

Actually I worried about the mmap offset not being aligned to page
size.  But it seems that's not a problem.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Ingo Molnar

* Namhyung Kim namhy...@kernel.org wrote:

  3)
 
  The rec-bytes_at_mmap_start field feels a bit weird. If I read the code 
  correctly, in every 'perf record' invocation, rec-bytes_written starts at 
  0 - i.e. we don't have repeat invocations of cmd_record().
 
 rec-bytes_written is updated when it writes to the output file for 
 synthesizing COMM/MMAP events (this mmap output is not used at that 
 time).

Btw., while looking into it, I think advance_output() needlessly 
obfuscates as well:

static void advance_output(struct perf_record *rec, size_t size)
{
rec-bytes_written += size;
}

that code should just be written open coded.

So I think all this needs a few good rounds of cleanups, before we can 
complicate it with a new feature. (the cleanups can be on top of the 
feature, if they go in at the same time.)

  That means that this:
 
  rec-bytes_at_mmap_start = st.st_size - rec-bytes_written;
 
  is really:
 
  rec-bytes_at_mmap_start = st.st_size;
 
  furthermore, since we don't allow appends anymore, st.st_size ought to be 
  zero as well.
 
  Which means that -bytes_at_mmap_start is always zero - and could be 
  eliminated altogether.
 
 No, st_size is bigger than rec-bytes_written due to the 
 perf_file_header which is written without updating rec-bytes_written.

Since all this is code that executes once during __cmd_record(), is this 
all about the header writeout?

That is what confused me about the stat() call and that's messy really: we 
are the ones who write the file header, we have a very good idea about how 
many bytes we wrote to the file! It should be entirely unnecessary to lose 
that information and then execute a system call to recover that 
information...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
Hi David,

On Mon, 14 Oct 2013 20:55:31 -0600, David Ahern wrote:
 When recording raw_syscalls for the entire system, e.g.,
 perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1

 you end up with a negative feedback loop as perf itself calls
 write() fairly often. This patch handles the problem by mmap'ing the
 file in chunks of 64M at a time and copies events from the event buffers
 to the file avoiding write system calls.

 Before (with write syscall):

 perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- 
 sleep 1
 [ perf record: Woken up 0 times to write data ]
 [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) 
 ]

 After (using mmap):

 perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- 
 sleep 1
 [ perf record: Woken up 31 times to write data ]
 [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]

Why do they have that different size?


[SNIP]
 +/* mmap file big chunks at a time */
 +#define MMAP_OUTPUT_SIZE   (64*1024*1024)

Why did you choose 64MB for the size?  Did you also test other sizes?


[SNIP]
 +
 + rec-mmap_addr = mmap(NULL, rec-mmap_size,
 +  PROT_WRITE | PROT_READ,
 +  MAP_SHARED,
 +  rec-output,
 +  offset);
 +
 + if (rec-mmap_addr == MAP_FAILED) {
 + pr_err(mmap failed: %d: %s\n, errno, strerror(errno));
 + return -1;
 + }
 +
 + /* expand file to include this mmap segment */
 + if (ftruncate(rec-output, offset + rec-mmap_size) != 0) {
 + pr_err(ftruncate failed\n);
 + return -1;
 + }

I think this mmap + ftruncate should be reordered.  Although it looks
work without problems the mmap man pages says it's unspecified behavior.

   A file is mapped in multiples of the page size.  For a file that is not
   a multiple of the page  size,  the  remaining  memory  is  zeroed  when
   mapped, and writes to that region are not written out to the file.  The
   effect of changing the size of the underlying file of a mapping on  the
   pages  that  correspond  to  added  or  removed  regions of the file is
   unspecified.


Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Ingo Molnar

* Namhyung Kim namhy...@kernel.org wrote:

 [SNIP]
  +/* mmap file big chunks at a time */
  +#define MMAP_OUTPUT_SIZE   (64*1024*1024)
 
 Why did you choose 64MB for the size?  Did you also test other sizes?

Btw., should this value go up if the ring buffer (mmap_pages) is larger 
than 64MB?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Namhyung Kim
On Tue, 15 Oct 2013 09:25:40 +0200, Ingo Molnar wrote:
 * Namhyung Kim namhy...@kernel.org wrote:
 No, st_size is bigger than rec-bytes_written due to the 
 perf_file_header which is written without updating rec-bytes_written.

 Since all this is code that executes once during __cmd_record(), is this 
 all about the header writeout?

 That is what confused me about the stat() call and that's messy really: we 
 are the ones who write the file header, we have a very good idea about how 
 many bytes we wrote to the file! It should be entirely unnecessary to lose 
 that information and then execute a system call to recover that 
 information...

Right, I think session-header.data_offset has the info already.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Jiri Olsa
On Tue, Oct 15, 2013 at 09:25:40AM +0200, Ingo Molnar wrote:
 
 * Namhyung Kim namhy...@kernel.org wrote:
 
   3)
  
   The rec-bytes_at_mmap_start field feels a bit weird. If I read the code 
   correctly, in every 'perf record' invocation, rec-bytes_written starts 
   at 
   0 - i.e. we don't have repeat invocations of cmd_record().
  
  rec-bytes_written is updated when it writes to the output file for 
  synthesizing COMM/MMAP events (this mmap output is not used at that 
  time).
 
 Btw., while looking into it, I think advance_output() needlessly 
 obfuscates as well:
 
 static void advance_output(struct perf_record *rec, size_t size)
 {
 rec-bytes_written += size;
 }
 
 that code should just be written open coded.
 
 So I think all this needs a few good rounds of cleanups, before we can 
 complicate it with a new feature. (the cleanups can be on top of the 
 feature, if they go in at the same time.)

I sent some file code cleanup week ago, I'll rebase
and resend it soon:
http://marc.info/?l=linux-kernelm=138113836428425w=2

it's mostly about centralizing the code into file
object.. got no comments so far

jirka
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread Ingo Molnar

* Jiri Olsa jo...@redhat.com wrote:

 On Tue, Oct 15, 2013 at 09:25:40AM +0200, Ingo Molnar wrote:
  
  * Namhyung Kim namhy...@kernel.org wrote:
  
3)
   
The rec-bytes_at_mmap_start field feels a bit weird. If I read the 
code 
correctly, in every 'perf record' invocation, rec-bytes_written starts 
at 
0 - i.e. we don't have repeat invocations of cmd_record().
   
   rec-bytes_written is updated when it writes to the output file for 
   synthesizing COMM/MMAP events (this mmap output is not used at that 
   time).
  
  Btw., while looking into it, I think advance_output() needlessly 
  obfuscates as well:
  
  static void advance_output(struct perf_record *rec, size_t size)
  {
  rec-bytes_written += size;
  }
  
  that code should just be written open coded.
  
  So I think all this needs a few good rounds of cleanups, before we can 
  complicate it with a new feature. (the cleanups can be on top of the 
  feature, if they go in at the same time.)
 
 I sent some file code cleanup week ago, I'll rebase and resend it soon:

 http://marc.info/?l=linux-kernelm=138113836428425w=2

Nice!

 it's mostly about centralizing the code into file object.. got no 
 comments so far

Too much development going on I suspect :-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread David Ahern

On 10/15/13 1:09 AM, Namhyung Kim wrote:

The stat() seems superfluous, here in __cmd_record() we've just checked
the output_name and made sure it exists. Can that stat() call ever fail?


AFAICS it's needed to check current file size.  But I think it's better
to use fstat().


Sure fstat could be used over stat -- if it ends up staying.




3)

The rec-bytes_at_mmap_start field feels a bit weird. If I read the code
correctly, in every 'perf record' invocation, rec-bytes_written starts at
0 - i.e. we don't have repeat invocations of cmd_record().


rec-bytes_written is updated when it writes to the output file for
synthesizing COMM/MMAP events (this mmap output is not used at that time).


Ingo: I went through a number of itereations before using the 
bytes_at_mmap_start. One of those was to use the bytes_written counter. 
All failed. Header + synthesized events are written to the file before 
we start farming the ring buffers.


Perhaps a good code cleanup will help figure out why. I needed the 
functionality ASAP for use with perf-trace -a so I stuck with the new 
variable. Since this change is working out well, I will look at a code 
clean up on the next round.


I am traveling to LinuxCon / KVM Forum / Tracing Forum on Friday. 
Perhaps the clean up and followup patch can be done on the long plane 
ride; more likely when I return which means 3.14 material.



Actually I worried about the mmap offset not being aligned to page
size.  But it seems that's not a problem.


This code snippet makes sure the mmap offset is a multiple of 64M 
(rec-mmap_size). offset is the argument to mmap; mmap_offset is the 
where we are within the mmap for the next copy:


+   offset = rec-bytes_at_mmap_start + rec-bytes_written;
+   if (offset  (ssize_t) rec-mmap_size) {
+   rec-mmap_offset = offset;
+   offset = 0;
+   } else
+   rec-mmap_offset = 0;


David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread David Ahern

On 10/15/13 1:31 AM, Namhyung Kim wrote:

Hi David,

On Mon, 14 Oct 2013 20:55:31 -0600, David Ahern wrote:

When recording raw_syscalls for the entire system, e.g.,
 perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1

you end up with a negative feedback loop as perf itself calls
write() fairly often. This patch handles the problem by mmap'ing the
file in chunks of 64M at a time and copies events from the event buffers
to the file avoiding write system calls.

Before (with write syscall):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ]

After (using mmap):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 31 times to write data ]
[ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]


Why do they have that different size?


perf calls write() for each mmap, each time through the loop. Each write 
generates 2 events (syscall entry + exit) -- ie., generates more events. 
That's the negative feedback loop.




[SNIP]

+
+   rec-mmap_addr = mmap(NULL, rec-mmap_size,
+PROT_WRITE | PROT_READ,
+MAP_SHARED,
+rec-output,
+offset);
+
+   if (rec-mmap_addr == MAP_FAILED) {
+   pr_err(mmap failed: %d: %s\n, errno, strerror(errno));
+   return -1;
+   }
+
+   /* expand file to include this mmap segment */
+   if (ftruncate(rec-output, offset + rec-mmap_size) != 0) {
+   pr_err(ftruncate failed\n);
+   return -1;
+   }


I think this mmap + ftruncate should be reordered.  Although it looks
work without problems the mmap man pages says it's unspecified behavior.

A file is mapped in multiples of the page size.  For a file that is not
a multiple of the page  size,  the  remaining  memory  is  zeroed  when
mapped, and writes to that region are not written out to the file.  The
effect of changing the size of the underlying file of a mapping on  the
pages  that  correspond  to  added  or  removed  regions of the file is
unspecified.


The mmap only expands the address range; the ftruncate expands the file 
behind the mmap. Both are needed and must succeed to function properly, 
and I don't see how the order matters. ie.,


This order has an extra call on the failure path:
  ftruncate
  mmap
  - on failure call ftruncate to reset file size

The order I have does not have that problem:
  mmap
  ftruncate

Here on failure just return -1 and we end the session.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf record: mmap output file - v2

2013-10-15 Thread David Ahern

On 10/15/13 1:44 AM, Ingo Molnar wrote:


* Namhyung Kim namhy...@kernel.org wrote:


[SNIP]

+/* mmap file big chunks at a time */
+#define MMAP_OUTPUT_SIZE   (64*1024*1024)


Why did you choose 64MB for the size?  Did you also test other sizes?


Btw., should this value go up if the ring buffer (mmap_pages) is larger
than 64MB?



I made mmap_size a variable:
+   size_t  mmap_size;  /* size of mmap segments */

with the above initial value. I was planning to make it in an option and 
just forgot to complete it.


Why 64M? mmap / munmap are also system calls and I was looking to trade 
off huge file size jumps versus the frequency of adjusting the maps. 64M 
Was just a nice round number between 1 and 100. 8, 16 are too small. 128 
seems to big for a default. That left only 32 and 64. 64M seems the 
better trade off of the two.


Making it a user knob would help with smaller deployments. Could also 
have mmap_size = 0 mean turn it off (use write over mmap).


Perhaps something that adjust automatically would be useful too. e.g., 
For the case that motivates the change I have 16 cpus each with a 4M 
buffer (1024 mmap pages). Should we generically set the size:


mmap_size = ncpus_online * mmap_pages * page_size?

Do that only for system wide profiling?

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf record: mmap output file - v2

2013-10-14 Thread David Ahern
When recording raw_syscalls for the entire system, e.g.,
perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1

you end up with a negative feedback loop as perf itself calls
write() fairly often. This patch handles the problem by mmap'ing the
file in chunks of 64M at a time and copies events from the event buffers
to the file avoiding write system calls.

Before (with write syscall):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ]

After (using mmap):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 31 times to write data ]
[ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]

In addition to perf-trace benefits using mmap lowers the overhead of
perf-record. For example,

  perf stat -i -- perf record -g -o /tmp/perf.data openssl speed aes

showsi a drop in time, CPU cycles, and instructions all drop by more than a
factor of 3. Jiri also ran a test that showed a big improvement.

v2: removed msync call before munmap per Jiri's suggestion

Signed-off-by: David Ahern 
Cc: Ingo Molnar 
Cc: Frederic Weisbecker 
Cc: Peter Zijlstra 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Mike Galbraith 
Cc: Stephane Eranian 
---
 tools/perf/builtin-record.c |   85 +++
 1 file changed, 85 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 92ca5419073b..a1cf20e65089 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -29,6 +29,9 @@
 #include 
 #include 
 
+/* mmap file big chunks at a time */
+#define MMAP_OUTPUT_SIZE   (64*1024*1024)
+
 #ifndef HAVE_ON_EXIT_SUPPORT
 #ifndef ATEXIT_MAX
 #define ATEXIT_MAX 32
@@ -64,6 +67,14 @@ static void __handle_on_exit_funcs(void)
 struct perf_record {
struct perf_tooltool;
struct perf_record_opts opts;
+
+   /* for MMAP based file writes */
+   void*mmap_addr;
+   u64 bytes_at_mmap_start; /* bytes in file when mmap 
use starts */
+   u64 mmap_offset;/* current location within mmap 
*/
+   size_t  mmap_size;  /* size of mmap segments */
+   booluse_mmap;
+
u64 bytes_written;
const char  *output_name;
struct perf_evlist  *evlist;
@@ -82,8 +93,65 @@ static void advance_output(struct perf_record *rec, size_t 
size)
rec->bytes_written += size;
 }
 
+static int do_mmap_output(struct perf_record *rec, void *buf, size_t size)
+{
+   u64 remaining;
+   off_t offset;
+
+   if (rec->mmap_addr == NULL) {
+do_mmap:
+   offset = rec->bytes_at_mmap_start + rec->bytes_written;
+   if (offset < (ssize_t) rec->mmap_size) {
+   rec->mmap_offset = offset;
+   offset = 0;
+   } else
+   rec->mmap_offset = 0;
+
+   rec->mmap_addr = mmap(NULL, rec->mmap_size,
+PROT_WRITE | PROT_READ,
+MAP_SHARED,
+rec->output,
+offset);
+
+   if (rec->mmap_addr == MAP_FAILED) {
+   pr_err("mmap failed: %d: %s\n", errno, strerror(errno));
+   return -1;
+   }
+
+   /* expand file to include this mmap segment */
+   if (ftruncate(rec->output, offset + rec->mmap_size) != 0) {
+   pr_err("ftruncate failed\n");
+   return -1;
+   }
+   }
+
+   remaining = rec->mmap_size - rec->mmap_offset;
+
+   if (size > remaining) {
+   memcpy(rec->mmap_addr + rec->mmap_offset, buf, remaining);
+   rec->bytes_written += remaining;
+
+   size -= remaining;
+   buf  += remaining;
+
+   munmap(rec->mmap_addr, rec->mmap_size);
+   goto do_mmap;
+   }
+
+   if (size) {
+   memcpy(rec->mmap_addr + rec->mmap_offset, buf, size);
+   rec->bytes_written += size;
+   rec->mmap_offset += size;
+   }
+
+   return 0;
+}
+
 static int write_output(struct perf_record *rec, void *buf, size_t size)
 {
+   if (rec->use_mmap)
+   return do_mmap_output(rec, buf, size);
+
while (size) {
int ret = write(rec->output, buf, size);
 
@@ -546,6 +614,11 @@ static int __cmd_record(struct perf_record *rec, int argc, 
const char **argv)
if (forks)
perf_evlist__start_workload(evsel_list);
 
+   if (!rec->opts.pipe_output && stat(output_name, ) == 0) {
+   rec->use_mmap = true;
+   

[PATCH] perf record: mmap output file - v2

2013-10-14 Thread David Ahern
When recording raw_syscalls for the entire system, e.g.,
perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1

you end up with a negative feedback loop as perf itself calls
write() fairly often. This patch handles the problem by mmap'ing the
file in chunks of 64M at a time and copies events from the event buffers
to the file avoiding write system calls.

Before (with write syscall):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ]

After (using mmap):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 31 times to write data ]
[ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]

In addition to perf-trace benefits using mmap lowers the overhead of
perf-record. For example,

  perf stat -i -- perf record -g -o /tmp/perf.data openssl speed aes

showsi a drop in time, CPU cycles, and instructions all drop by more than a
factor of 3. Jiri also ran a test that showed a big improvement.

v2: removed msync call before munmap per Jiri's suggestion

Signed-off-by: David Ahern dsah...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Jiri Olsa jo...@redhat.com
Cc: Namhyung Kim namhy...@kernel.org
Cc: Mike Galbraith efa...@gmx.de
Cc: Stephane Eranian eran...@google.com
---
 tools/perf/builtin-record.c |   85 +++
 1 file changed, 85 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 92ca5419073b..a1cf20e65089 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -29,6 +29,9 @@
 #include sched.h
 #include sys/mman.h
 
+/* mmap file big chunks at a time */
+#define MMAP_OUTPUT_SIZE   (64*1024*1024)
+
 #ifndef HAVE_ON_EXIT_SUPPORT
 #ifndef ATEXIT_MAX
 #define ATEXIT_MAX 32
@@ -64,6 +67,14 @@ static void __handle_on_exit_funcs(void)
 struct perf_record {
struct perf_tooltool;
struct perf_record_opts opts;
+
+   /* for MMAP based file writes */
+   void*mmap_addr;
+   u64 bytes_at_mmap_start; /* bytes in file when mmap 
use starts */
+   u64 mmap_offset;/* current location within mmap 
*/
+   size_t  mmap_size;  /* size of mmap segments */
+   booluse_mmap;
+
u64 bytes_written;
const char  *output_name;
struct perf_evlist  *evlist;
@@ -82,8 +93,65 @@ static void advance_output(struct perf_record *rec, size_t 
size)
rec-bytes_written += size;
 }
 
+static int do_mmap_output(struct perf_record *rec, void *buf, size_t size)
+{
+   u64 remaining;
+   off_t offset;
+
+   if (rec-mmap_addr == NULL) {
+do_mmap:
+   offset = rec-bytes_at_mmap_start + rec-bytes_written;
+   if (offset  (ssize_t) rec-mmap_size) {
+   rec-mmap_offset = offset;
+   offset = 0;
+   } else
+   rec-mmap_offset = 0;
+
+   rec-mmap_addr = mmap(NULL, rec-mmap_size,
+PROT_WRITE | PROT_READ,
+MAP_SHARED,
+rec-output,
+offset);
+
+   if (rec-mmap_addr == MAP_FAILED) {
+   pr_err(mmap failed: %d: %s\n, errno, strerror(errno));
+   return -1;
+   }
+
+   /* expand file to include this mmap segment */
+   if (ftruncate(rec-output, offset + rec-mmap_size) != 0) {
+   pr_err(ftruncate failed\n);
+   return -1;
+   }
+   }
+
+   remaining = rec-mmap_size - rec-mmap_offset;
+
+   if (size  remaining) {
+   memcpy(rec-mmap_addr + rec-mmap_offset, buf, remaining);
+   rec-bytes_written += remaining;
+
+   size -= remaining;
+   buf  += remaining;
+
+   munmap(rec-mmap_addr, rec-mmap_size);
+   goto do_mmap;
+   }
+
+   if (size) {
+   memcpy(rec-mmap_addr + rec-mmap_offset, buf, size);
+   rec-bytes_written += size;
+   rec-mmap_offset += size;
+   }
+
+   return 0;
+}
+
 static int write_output(struct perf_record *rec, void *buf, size_t size)
 {
+   if (rec-use_mmap)
+   return do_mmap_output(rec, buf, size);
+
while (size) {
int ret = write(rec-output, buf, size);
 
@@ -546,6 +614,11 @@ static int __cmd_record(struct perf_record *rec, int argc, 
const char **argv)
if (forks)
perf_evlist__start_workload(evsel_list);
 
+   if