Re: [PATCH v1 0/4] perf: enable compression of record mode trace to save storage space

2019-01-14 Thread Alexey Budankov
On 14.01.2019 14:03, Jiri Olsa wrote:
> On Mon, Jan 14, 2019 at 11:43:31AM +0300, Alexey Budankov wrote:
>> Hi,
>> On 09.01.2019 20:28, Jiri Olsa wrote:
>>> On Mon, Dec 24, 2018 at 04:21:33PM +0300, Alexey Budankov wrote:

 buffers for asynchronous trace writing serve that purpose.

>>>
>>> I dont like that it's onlt for aio only, I can't really see why it's
>>
>> For serial streaming, on CPU bound codes, under full system utilization it 
>> can induce more runtime overhead and increase data loss because amount of 
>> code on performance critical path grows, of course size of written data 
>> reduces but still. Feeding kernel buffer content by user space code to a 
>> syscall is extended with intermediate copying to user space memory with 
>> doing some math on it in the middle.
>>
>>> a problem for normal data.. can't we just have one layer before and
>>> stream the data to the compress function instead of the file (or aio
>>> buffers).. and that compress functions would spit out 64K size COMPRESSED
>>> events, which would go to file (or aio buffers)
>>
>> It is already almost like that. Compression could be bridged using AIO 
>> buffers but then still streamed to file serially using record__pushfn() 
>> and that would make some sense for moderate profiling cases on systems 
>> without AIO support and trace streaming based on it.
>>
>>>
>>> the report side would process them (decompress) on the session layer
>>> before the tool callbacks are called
>>
>> It is already pretty similar to that.
> 
> hum, AFAICS you do that in report code not in on the session layer

Correct. Decompressor and handling of compressed data chunks could be 
moved to session related code.

Thanks,
Alexey

> 
> jirka
> 


Re: [PATCH v1 0/4] perf: enable compression of record mode trace to save storage space

2019-01-14 Thread Jiri Olsa
On Mon, Jan 14, 2019 at 11:43:31AM +0300, Alexey Budankov wrote:
> Hi,
> On 09.01.2019 20:28, Jiri Olsa wrote:
> > On Mon, Dec 24, 2018 at 04:21:33PM +0300, Alexey Budankov wrote:
> >>
> >> The patch set implements runtime record trace compression accompanied by 
> >> trace file decompression implemented in the tool report mode. Zstandard 
> >> library API [1] is used for compression/decompression of data that come 
> >> from perf_events kernel data buffers.
> >>
> >> Realized -z,--compression_level=n option provides ~3-5x avg. trace file 
> >> size reduction on the tested workloads what significantly saves user's 
> >> storage space on larger server systems where trace file size can easily 
> >> reach several tens or even hundreds of GiBs, especially when profiling 
> >> with stacks for later dwarf unwinding, context-switches tracing and etc.
> >>
> >> The option is effective jointly with asynchronous trace writing because 
> >> compression requires auxiliary memory buffers to operate on and memory 
> >> buffers for asynchronous trace writing serve that purpose.
> > 
> > I dont like that it's onlt for aio only, I can't really see why it's
> 
> For serial streaming, on CPU bound codes, under full system utilization it 
> can induce more runtime overhead and increase data loss because amount of 
> code on performance critical path grows, of course size of written data 
> reduces but still. Feeding kernel buffer content by user space code to a 
> syscall is extended with intermediate copying to user space memory with 
> doing some math on it in the middle.
> 
> > a problem for normal data.. can't we just have one layer before and
> > stream the data to the compress function instead of the file (or aio
> > buffers).. and that compress functions would spit out 64K size COMPRESSED
> > events, which would go to file (or aio buffers)
> 
> It is already almost like that. Compression could be bridged using AIO 
> buffers but then still streamed to file serially using record__pushfn() 
> and that would make some sense for moderate profiling cases on systems 
> without AIO support and trace streaming based on it.
> 
> > 
> > the report side would process them (decompress) on the session layer
> > before the tool callbacks are called
> 
> It is already pretty similar to that.

hum, AFAICS you do that in report code not in on the session layer

jirka


Re: [PATCH v1 0/4] perf: enable compression of record mode trace to save storage space

2019-01-14 Thread Alexey Budankov
Hi,
On 09.01.2019 20:28, Jiri Olsa wrote:
> On Mon, Dec 24, 2018 at 04:21:33PM +0300, Alexey Budankov wrote:
>>
>> The patch set implements runtime record trace compression accompanied by 
>> trace file decompression implemented in the tool report mode. Zstandard 
>> library API [1] is used for compression/decompression of data that come 
>> from perf_events kernel data buffers.
>>
>> Realized -z,--compression_level=n option provides ~3-5x avg. trace file 
>> size reduction on the tested workloads what significantly saves user's 
>> storage space on larger server systems where trace file size can easily 
>> reach several tens or even hundreds of GiBs, especially when profiling 
>> with stacks for later dwarf unwinding, context-switches tracing and etc.
>>
>> The option is effective jointly with asynchronous trace writing because 
>> compression requires auxiliary memory buffers to operate on and memory 
>> buffers for asynchronous trace writing serve that purpose.
> 
> I dont like that it's onlt for aio only, I can't really see why it's

For serial streaming, on CPU bound codes, under full system utilization it 
can induce more runtime overhead and increase data loss because amount of 
code on performance critical path grows, of course size of written data 
reduces but still. Feeding kernel buffer content by user space code to a 
syscall is extended with intermediate copying to user space memory with 
doing some math on it in the middle.

> a problem for normal data.. can't we just have one layer before and
> stream the data to the compress function instead of the file (or aio
> buffers).. and that compress functions would spit out 64K size COMPRESSED
> events, which would go to file (or aio buffers)

It is already almost like that. Compression could be bridged using AIO 
buffers but then still streamed to file serially using record__pushfn() 
and that would make some sense for moderate profiling cases on systems 
without AIO support and trace streaming based on it.

> 
> the report side would process them (decompress) on the session layer
> before the tool callbacks are called

It is already pretty similar to that.

Thanks,
Alexey

> 
> jirka
> 


Re: [PATCH v1 0/4] perf: enable compression of record mode trace to save storage space

2019-01-09 Thread Jiri Olsa
On Mon, Dec 24, 2018 at 04:21:33PM +0300, Alexey Budankov wrote:
> 
> The patch set implements runtime record trace compression accompanied by 
> trace file decompression implemented in the tool report mode. Zstandard 
> library API [1] is used for compression/decompression of data that come 
> from perf_events kernel data buffers.
> 
> Realized -z,--compression_level=n option provides ~3-5x avg. trace file 
> size reduction on the tested workloads what significantly saves user's 
> storage space on larger server systems where trace file size can easily 
> reach several tens or even hundreds of GiBs, especially when profiling 
> with stacks for later dwarf unwinding, context-switches tracing and etc.
> 
> The option is effective jointly with asynchronous trace writing because 
> compression requires auxiliary memory buffers to operate on and memory 
> buffers for asynchronous trace writing serve that purpose.

I dont like that it's onlt for aio only, I can't really see why it's
a problem for normal data.. can't we just have one layer before and
stream the data to the compress function instead of the file (or aio
buffers).. and that compress functions would spit out 64K size COMPRESSED
events, which would go to file (or aio buffers)

the report side would process them (decompress) on the session layer
before the tool callbacks are called

jirka


[PATCH v1 0/4] perf: enable compression of record mode trace to save storage space

2018-12-24 Thread Alexey Budankov


The patch set implements runtime record trace compression accompanied by 
trace file decompression implemented in the tool report mode. Zstandard 
library API [1] is used for compression/decompression of data that come 
from perf_events kernel data buffers.

Realized -z,--compression_level=n option provides ~3-5x avg. trace file 
size reduction on the tested workloads what significantly saves user's 
storage space on larger server systems where trace file size can easily 
reach several tens or even hundreds of GiBs, especially when profiling 
with stacks for later dwarf unwinding, context-switches tracing and etc.

The option is effective jointly with asynchronous trace writing because 
compression requires auxiliary memory buffers to operate on and memory 
buffers for asynchronous trace writing serve that purpose.

Added --mmap-flush option can be used to avoid compressing every single 
byte of data from mmaped kernel buffers to the trace file and increase 
compression ratio at the same time lowering tool runtime overhead.

The feature can be disabled from the command line using NO_LIBZSTD define
and Zstandard sources can be overridden using value of LIBZSTD_DIR define.

The patch set is for Arnaldo's perf/core repository.

Examples:

  $ make -C tools/perf NO_LIBZSTD=1 clean all
  $ make -C tools/perf LIBZSTD_DIR=/path/to/zstd-1.3.7 clean all

  $ tools/perf/perf record -F 42000 --aio -z 1 --mmap-flush 0x1000 -e cycles -- 
matrix.gcc
  Addr of buf1 = 0x7fc1bf183010
  Offs of buf1 = 0x7fc1bf183180
  Addr of buf2 = 0x7fc1bd182010
  Offs of buf2 = 0x7fc1bd1821c0
  Addr of buf3 = 0x7fc1bb181010
  Offs of buf3 = 0x7fc1bb181100
  Addr of buf4 = 0x7fc1b9180010
  Offs of buf4 = 0x7fc1b9180140
  Threads #: 8 Pthreads
  Matrix size: 2048
  Using multiply kernel: multiply1
  Execution time = 25.499 seconds
  [ perf record: Woken up 1157 times to write data ]
  [ perf record: Compressed 316.684 MB to 58.034 MB, ratio is 5.457 ]
  [ perf record: Captured and wrote 58.059 MB perf.data ]

  $ tools/perf/perf report -D --header
  # 
  # captured on: Mon Dec 24 13:19:52 2018
  # header version : 1
  # data offset: 296
  # data size  : 60878779
  # feat offset: 60879075
  # hostname : nntvtune39
  # os release : 4.19.9-300.fc29.x86_64
  # perf version : 4.13.rc5.gdbb7997
  # arch : x86_64
  # nrcpus online : 8
  # nrcpus avail : 8
  # cpudesc : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
  # cpuid : GenuineIntel,6,94,3
  # total memory : 16153380 kB
  # cmdline : /root/abudanko/kernel/acme/tools/perf/perf record -F 42000 --aio 
-z 1 --mmap-flush 0x1000 -e cycles -- ../../matrix/linux/matrix.gcc 
  # event : name = cycles, , id = { 2315, 2316, 2317, 2318, 2319, 2320, 2321, 
2322 }, size = 112, { sample_period, sample_freq } = 42000, sample_type = 
IP|TID|TIME|PERIOD, read_form>
  # CPU_TOPOLOGY info available, use -I to display
  # NUMA_TOPOLOGY info available, use -I to display
  # pmu mappings: intel_pt = 8, software = 1, power = 11, uprobe = 7, 
uncore_imc = 12, cpu = 4, cstate_core = 18, uncore_cbox_2 = 15, breakpoint = 5, 
uncore_cbox_0 = 13, tracepoint >
  # CACHE info available, use -I to display
  # time of first sample : 0.00
  # time of last sample : 0.00
  # sample duration :  0.000 ms
  # MEM_TOPOLOGY info available, use -I to display
  # compressed : Zstd, level = 1, ratio = 5
  # missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE 
STAT CLOCKID 
  # 
  #
  
  0x128 [0x20]: event: 79
  .
  . ... raw event: size 32 bytes
  .  :  4f 00 00 00 00 00 20 00 1f 00 00 00 00 00 00 00  O. .
  .  0010:  11 a6 ef 1f 00 00 00 00 f8 fe 7c b5 f5 ff ff ff  ..|.
  
  0 0x128 [0x20]: PERF_RECORD_TIME_CONV: unhandled!
  
  0x148 [0x50]: event: 1
  .
  . ... raw event: size 80 bytes
  .  :  01 00 00 00 01 00 50 00 ff ff ff ff 00 00 00 00  ..P.
  .  0010:  00 00 00 a8 ff ff ff ff 00 80 33 18 00 00 00 00  ..3.
  .  0020:  00 00 00 a8 ff ff ff ff 5b 6b 65 72 6e 65 6c 2e  [kernel.
  .  0030:  6b 61 6c 6c 73 79 6d 73 5d 5f 74 65 78 74 00 00  kallsyms]_text..
  .  0040:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  
  0 0x148 [0x50]: PERF_RECORD_MMAP -1/0: [0xa800(0x18338000) @ 
0xa800]: x [kernel.kallsyms]_text

  ...
  0x62d8e [0x8]: event: 68
  .
  . ... raw event: size 8 bytes
  .  :  44 00 00 00 00 00 08 00  D...
  
  0 0x62d8e [0x8]: PERF_RECORD_FINISHED_ROUND
  
  0 [0x28]: event: 9
  .
  . ... raw event: size 40 bytes
  .  :  09 00 00 00 01 00 28 00 76 78 06 a8 ff ff ff ff  ..(.vx..
  .  0010:  94 29 00 00 94 29 00 00 82 02 f4 af 33 3e 01 00  .)...)..3>..
  .  0020:  01 00 00 00 00 00 00 00  
  
  349866692969090 0 [0x28]: PERF_RECORD_SAMPLE(IP, 0x1): 10644/10644: 
0xa8067876 period: 1 addr: 0
   ... thread: perf:10644
   .. dso: vmlinux
  
  0