Re: [PATCH v1 0/4] perf: enable compression of record mode trace to save storage space
On 14.01.2019 14:03, Jiri Olsa wrote: > On Mon, Jan 14, 2019 at 11:43:31AM +0300, Alexey Budankov wrote: >> Hi, >> On 09.01.2019 20:28, Jiri Olsa wrote: >>> On Mon, Dec 24, 2018 at 04:21:33PM +0300, Alexey Budankov wrote: buffers for asynchronous trace writing serve that purpose. >>> >>> I dont like that it's onlt for aio only, I can't really see why it's >> >> For serial streaming, on CPU bound codes, under full system utilization it >> can induce more runtime overhead and increase data loss because amount of >> code on performance critical path grows, of course size of written data >> reduces but still. Feeding kernel buffer content by user space code to a >> syscall is extended with intermediate copying to user space memory with >> doing some math on it in the middle. >> >>> a problem for normal data.. can't we just have one layer before and >>> stream the data to the compress function instead of the file (or aio >>> buffers).. and that compress functions would spit out 64K size COMPRESSED >>> events, which would go to file (or aio buffers) >> >> It is already almost like that. Compression could be bridged using AIO >> buffers but then still streamed to file serially using record__pushfn() >> and that would make some sense for moderate profiling cases on systems >> without AIO support and trace streaming based on it. >> >>> >>> the report side would process them (decompress) on the session layer >>> before the tool callbacks are called >> >> It is already pretty similar to that. > > hum, AFAICS you do that in report code not in on the session layer Correct. Decompressor and handling of compressed data chunks could be moved to session related code. Thanks, Alexey > > jirka >
Re: [PATCH v1 0/4] perf: enable compression of record mode trace to save storage space
On Mon, Jan 14, 2019 at 11:43:31AM +0300, Alexey Budankov wrote: > Hi, > On 09.01.2019 20:28, Jiri Olsa wrote: > > On Mon, Dec 24, 2018 at 04:21:33PM +0300, Alexey Budankov wrote: > >> > >> The patch set implements runtime record trace compression accompanied by > >> trace file decompression implemented in the tool report mode. Zstandard > >> library API [1] is used for compression/decompression of data that come > >> from perf_events kernel data buffers. > >> > >> Realized -z,--compression_level=n option provides ~3-5x avg. trace file > >> size reduction on the tested workloads what significantly saves user's > >> storage space on larger server systems where trace file size can easily > >> reach several tens or even hundreds of GiBs, especially when profiling > >> with stacks for later dwarf unwinding, context-switches tracing and etc. > >> > >> The option is effective jointly with asynchronous trace writing because > >> compression requires auxiliary memory buffers to operate on and memory > >> buffers for asynchronous trace writing serve that purpose. > > > > I dont like that it's onlt for aio only, I can't really see why it's > > For serial streaming, on CPU bound codes, under full system utilization it > can induce more runtime overhead and increase data loss because amount of > code on performance critical path grows, of course size of written data > reduces but still. Feeding kernel buffer content by user space code to a > syscall is extended with intermediate copying to user space memory with > doing some math on it in the middle. > > > a problem for normal data.. can't we just have one layer before and > > stream the data to the compress function instead of the file (or aio > > buffers).. and that compress functions would spit out 64K size COMPRESSED > > events, which would go to file (or aio buffers) > > It is already almost like that. Compression could be bridged using AIO > buffers but then still streamed to file serially using record__pushfn() > and that would make some sense for moderate profiling cases on systems > without AIO support and trace streaming based on it. > > > > > the report side would process them (decompress) on the session layer > > before the tool callbacks are called > > It is already pretty similar to that. hum, AFAICS you do that in report code not in on the session layer jirka
Re: [PATCH v1 0/4] perf: enable compression of record mode trace to save storage space
Hi, On 09.01.2019 20:28, Jiri Olsa wrote: > On Mon, Dec 24, 2018 at 04:21:33PM +0300, Alexey Budankov wrote: >> >> The patch set implements runtime record trace compression accompanied by >> trace file decompression implemented in the tool report mode. Zstandard >> library API [1] is used for compression/decompression of data that come >> from perf_events kernel data buffers. >> >> Realized -z,--compression_level=n option provides ~3-5x avg. trace file >> size reduction on the tested workloads what significantly saves user's >> storage space on larger server systems where trace file size can easily >> reach several tens or even hundreds of GiBs, especially when profiling >> with stacks for later dwarf unwinding, context-switches tracing and etc. >> >> The option is effective jointly with asynchronous trace writing because >> compression requires auxiliary memory buffers to operate on and memory >> buffers for asynchronous trace writing serve that purpose. > > I dont like that it's onlt for aio only, I can't really see why it's For serial streaming, on CPU bound codes, under full system utilization it can induce more runtime overhead and increase data loss because amount of code on performance critical path grows, of course size of written data reduces but still. Feeding kernel buffer content by user space code to a syscall is extended with intermediate copying to user space memory with doing some math on it in the middle. > a problem for normal data.. can't we just have one layer before and > stream the data to the compress function instead of the file (or aio > buffers).. and that compress functions would spit out 64K size COMPRESSED > events, which would go to file (or aio buffers) It is already almost like that. Compression could be bridged using AIO buffers but then still streamed to file serially using record__pushfn() and that would make some sense for moderate profiling cases on systems without AIO support and trace streaming based on it. > > the report side would process them (decompress) on the session layer > before the tool callbacks are called It is already pretty similar to that. Thanks, Alexey > > jirka >
Re: [PATCH v1 0/4] perf: enable compression of record mode trace to save storage space
On Mon, Dec 24, 2018 at 04:21:33PM +0300, Alexey Budankov wrote: > > The patch set implements runtime record trace compression accompanied by > trace file decompression implemented in the tool report mode. Zstandard > library API [1] is used for compression/decompression of data that come > from perf_events kernel data buffers. > > Realized -z,--compression_level=n option provides ~3-5x avg. trace file > size reduction on the tested workloads what significantly saves user's > storage space on larger server systems where trace file size can easily > reach several tens or even hundreds of GiBs, especially when profiling > with stacks for later dwarf unwinding, context-switches tracing and etc. > > The option is effective jointly with asynchronous trace writing because > compression requires auxiliary memory buffers to operate on and memory > buffers for asynchronous trace writing serve that purpose. I dont like that it's onlt for aio only, I can't really see why it's a problem for normal data.. can't we just have one layer before and stream the data to the compress function instead of the file (or aio buffers).. and that compress functions would spit out 64K size COMPRESSED events, which would go to file (or aio buffers) the report side would process them (decompress) on the session layer before the tool callbacks are called jirka
[PATCH v1 0/4] perf: enable compression of record mode trace to save storage space
The patch set implements runtime record trace compression accompanied by trace file decompression implemented in the tool report mode. Zstandard library API [1] is used for compression/decompression of data that come from perf_events kernel data buffers. Realized -z,--compression_level=n option provides ~3-5x avg. trace file size reduction on the tested workloads what significantly saves user's storage space on larger server systems where trace file size can easily reach several tens or even hundreds of GiBs, especially when profiling with stacks for later dwarf unwinding, context-switches tracing and etc. The option is effective jointly with asynchronous trace writing because compression requires auxiliary memory buffers to operate on and memory buffers for asynchronous trace writing serve that purpose. Added --mmap-flush option can be used to avoid compressing every single byte of data from mmaped kernel buffers to the trace file and increase compression ratio at the same time lowering tool runtime overhead. The feature can be disabled from the command line using NO_LIBZSTD define and Zstandard sources can be overridden using value of LIBZSTD_DIR define. The patch set is for Arnaldo's perf/core repository. Examples: $ make -C tools/perf NO_LIBZSTD=1 clean all $ make -C tools/perf LIBZSTD_DIR=/path/to/zstd-1.3.7 clean all $ tools/perf/perf record -F 42000 --aio -z 1 --mmap-flush 0x1000 -e cycles -- matrix.gcc Addr of buf1 = 0x7fc1bf183010 Offs of buf1 = 0x7fc1bf183180 Addr of buf2 = 0x7fc1bd182010 Offs of buf2 = 0x7fc1bd1821c0 Addr of buf3 = 0x7fc1bb181010 Offs of buf3 = 0x7fc1bb181100 Addr of buf4 = 0x7fc1b9180010 Offs of buf4 = 0x7fc1b9180140 Threads #: 8 Pthreads Matrix size: 2048 Using multiply kernel: multiply1 Execution time = 25.499 seconds [ perf record: Woken up 1157 times to write data ] [ perf record: Compressed 316.684 MB to 58.034 MB, ratio is 5.457 ] [ perf record: Captured and wrote 58.059 MB perf.data ] $ tools/perf/perf report -D --header # # captured on: Mon Dec 24 13:19:52 2018 # header version : 1 # data offset: 296 # data size : 60878779 # feat offset: 60879075 # hostname : nntvtune39 # os release : 4.19.9-300.fc29.x86_64 # perf version : 4.13.rc5.gdbb7997 # arch : x86_64 # nrcpus online : 8 # nrcpus avail : 8 # cpudesc : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz # cpuid : GenuineIntel,6,94,3 # total memory : 16153380 kB # cmdline : /root/abudanko/kernel/acme/tools/perf/perf record -F 42000 --aio -z 1 --mmap-flush 0x1000 -e cycles -- ../../matrix/linux/matrix.gcc # event : name = cycles, , id = { 2315, 2316, 2317, 2318, 2319, 2320, 2321, 2322 }, size = 112, { sample_period, sample_freq } = 42000, sample_type = IP|TID|TIME|PERIOD, read_form> # CPU_TOPOLOGY info available, use -I to display # NUMA_TOPOLOGY info available, use -I to display # pmu mappings: intel_pt = 8, software = 1, power = 11, uprobe = 7, uncore_imc = 12, cpu = 4, cstate_core = 18, uncore_cbox_2 = 15, breakpoint = 5, uncore_cbox_0 = 13, tracepoint > # CACHE info available, use -I to display # time of first sample : 0.00 # time of last sample : 0.00 # sample duration : 0.000 ms # MEM_TOPOLOGY info available, use -I to display # compressed : Zstd, level = 1, ratio = 5 # missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID # # 0x128 [0x20]: event: 79 . . ... raw event: size 32 bytes . : 4f 00 00 00 00 00 20 00 1f 00 00 00 00 00 00 00 O. . . 0010: 11 a6 ef 1f 00 00 00 00 f8 fe 7c b5 f5 ff ff ff ..|. 0 0x128 [0x20]: PERF_RECORD_TIME_CONV: unhandled! 0x148 [0x50]: event: 1 . . ... raw event: size 80 bytes . : 01 00 00 00 01 00 50 00 ff ff ff ff 00 00 00 00 ..P. . 0010: 00 00 00 a8 ff ff ff ff 00 80 33 18 00 00 00 00 ..3. . 0020: 00 00 00 a8 ff ff ff ff 5b 6b 65 72 6e 65 6c 2e [kernel. . 0030: 6b 61 6c 6c 73 79 6d 73 5d 5f 74 65 78 74 00 00 kallsyms]_text.. . 0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 0x148 [0x50]: PERF_RECORD_MMAP -1/0: [0xa800(0x18338000) @ 0xa800]: x [kernel.kallsyms]_text ... 0x62d8e [0x8]: event: 68 . . ... raw event: size 8 bytes . : 44 00 00 00 00 00 08 00 D... 0 0x62d8e [0x8]: PERF_RECORD_FINISHED_ROUND 0 [0x28]: event: 9 . . ... raw event: size 40 bytes . : 09 00 00 00 01 00 28 00 76 78 06 a8 ff ff ff ff ..(.vx.. . 0010: 94 29 00 00 94 29 00 00 82 02 f4 af 33 3e 01 00 .)...)..3>.. . 0020: 01 00 00 00 00 00 00 00 349866692969090 0 [0x28]: PERF_RECORD_SAMPLE(IP, 0x1): 10644/10644: 0xa8067876 period: 1 addr: 0 ... thread: perf:10644 .. dso: vmlinux 0