Re: [PATCH 1/2] perf: Add munmap callback
On 11/6/2018 10:00 AM, Stephane Eranian wrote: /* * mmap 1 page at the location of the unmap page (should reuse virtual space) * This creates a continuous region built from two mmaps and potentially two different sources * especially with jitted runtimes */ The two mmaps are both anon. As my understanding, we cannot symbolize from the anonymous address, can we? Can't we build the same test case using an actual file mapping (both mmap from the same file)? If they are from same file, both mmap have same symbolization name. We don't need to distinguish them either. I thought the symbolization issue can only happen when two mmaps are from different sources. Right? Thanks, Kan
Re: [PATCH 1/2] perf: Add munmap callback
On 11/6/2018 10:00 AM, Stephane Eranian wrote: /* * mmap 1 page at the location of the unmap page (should reuse virtual space) * This creates a continuous region built from two mmaps and potentially two different sources * especially with jitted runtimes */ The two mmaps are both anon. As my understanding, we cannot symbolize from the anonymous address, can we? Can't we build the same test case using an actual file mapping (both mmap from the same file)? If they are from same file, both mmap have same symbolization name. We don't need to distinguish them either. I thought the symbolization issue can only happen when two mmaps are from different sources. Right? Thanks, Kan
Re: [PATCH 1/2] perf: Add munmap callback
On Mon, Nov 5, 2018 at 7:43 AM Liang, Kan wrote: > > > > On 11/5/2018 5:59 AM, Stephane Eranian wrote: > > Hi Kan, > > > > I built a small test case for you to demonstrate the issue for code and > > data. > > Compile the test program and then do: > > For text: > > $ perf record ./mmap > > $ perf report -D | fgrep MMAP2 > > > > The test program mmaps 2 pages, unmaps the second, and remap 1 page > > over the freed space. > > If you look at the MMAP2 record, you will not be able to reconstruct > > what happened and perf will > > get confused should it try to symbolize from the address range > > > > With Text: > > PERF_RECORD_MMAP2 5937/5937: [0x40(0x1000) @ 0 08:01 400938 > > 824817672]: r-xp /home/eranian/mmap > > PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 > > 00:00 0 0]: rwxp //anon > > PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 > > 00:00 0 0]: rwxp //anon > > > > captures the whole VMA but not the mapping > > change in user space > > > > For data: > > $ perf record -d ./mmap > > $ perf report -D | fgrep MMAP2 > > With data: > > PERF_RECORD_MMAP2 6430/6430: [0x40(0x1000) @ 0 08:01 400938 > > 3278843184]: r-xp /home/eranian/mmap > > PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 > > 00:00 0 0]: rw-p //anon > > PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 > > 00:00 0 0]: rw-p //anon > > > > Same test case with data. > > Perf will think the entire 2 pages have been replaced when in fact > > only the second has. > > I believe the problem is likely to impact data and jitted code cache > > > > #include > > #include > > #include > > #include > > #include > > #include > > > > int main(int argc, char **argv) > > { > > void *addr1, *addr2; > > size_t pgsz = sysconf(_SC_PAGESIZE); > > int n = 2; > > int ret; > > int c, mode = 0; > > > > while ((c = getopt(argc, argv, "hd")) != -1) { > > switch (c) { > > case 'h': > > printf("[-h]\tget this help\n"); > > printf("[-d]\tuse data mmaps (no PROT_EXEC)\n"); > > return 0; > > case 'd': > > mode = PROT_EXEC; > > break; > > default: > > errx(1, "unknown option"); > > } > > } > > /* default to data */ > > if (mode == 0) > > mode = PROT_WRITE; > > > > /* > > * mmap 2 contiugous pages > > */ > > addr1 = mmap(NULL, n * pgsz, PROT_READ| mode, MAP_ANONYMOUS|MAP_PRIVATE, > > -1, 0); > > if (addr1 == (void *)MAP_FAILED) > > err(1, "mmap 1 failed"); > > > > printf("addr1=[%p : %p]\n", addr1, addr1 + n * pgsz); > > > > /* > > * unmap only the second page > > */ > > ret = munmap(addr1 + pgsz, pgsz); > > if (ret == -1) > > err(1, "munmp failed"); > > > > /* > > * mmap 1 page at the location of the unmap page (should reuse virtual space) > > * This creates a continuous region built from two mmaps and > > potentially two different sources > > * especially with jitted runtimes > > */ > > The two mmaps are both anon. As my understanding, we cannot symbolize > from the anonymous address, can we? Can't we build the same test case using an actual file mapping (both mmap from the same file)? > If we cannot, why we have to distinguish with them? I think we do not > need to know their sources for symbolization. > > As my understanding, only --jit can inject MMAP event, which tag an > anon. Perf can symbolize the address after that. Then the unmap is needed. > Yes, perf inject --jit injects timestamped MMAP2 records to cover the jitted regions which helps symbolize anons. > Thanks, > Kan > > addr2 = mmap(addr1 + pgsz, 1 * pgsz, PROT_READ|PROT_WRITE | mode, > > MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); > > > > printf("addr2=%p\n", addr2); > > > > if (addr2 == (void *)MAP_FAILED) > > err(1, "mmap 2 failed"); > > if (addr2 != (addr1 + pgsz)) > > errx(1, "wrong mmap2 address"); > > > > sleep(1); > > > > return 0; > > } > > > > On Thu, Nov 1, 2018 at 7:10 AM Liang, Kan wrote: > >> > >> > >> > >> On 10/24/2018 3:30 PM, Stephane Eranian wrote: > >>> The need for this new record type extends beyond physical address > >>> conversions > >>> and PEBS. A long while ago, someone reported issues with symbolization > >>> related > >>> to perf lacking munmap tracking. It had to do with vma merging. I think > >>> the > >>> sequence of mmaps was as follows in the problematic case: > >>> 1. addr1 = mmap(8192); > >>> 2. munmap(addr1 + 4096, 4096) > >>> 3. addr2 = mmap(addr1+4096, 4096) > >>> > >>> If successful, that yields addr2 = addr1 + 4096 (could also get the > >>> same without forcing the address). > >>> > >>> In that case, if I recall correctly, the vma for 1st mapping (now at > >>> 4k) and that of the 2nd mapping (4k) > >>> get merged into a single 8k vma and this is what perf_events will > >>> record for PERF_RECORD_MMAP. > >>> On the perf tool side, it is assumed that if two timestamped mappings > >>> overlap then, the latter overrides > >>> the former. In this case, perf would loose the mapping of the first > >>> 4kb and assume all symbols comes from > >>>
Re: [PATCH 1/2] perf: Add munmap callback
On Mon, Nov 5, 2018 at 7:43 AM Liang, Kan wrote: > > > > On 11/5/2018 5:59 AM, Stephane Eranian wrote: > > Hi Kan, > > > > I built a small test case for you to demonstrate the issue for code and > > data. > > Compile the test program and then do: > > For text: > > $ perf record ./mmap > > $ perf report -D | fgrep MMAP2 > > > > The test program mmaps 2 pages, unmaps the second, and remap 1 page > > over the freed space. > > If you look at the MMAP2 record, you will not be able to reconstruct > > what happened and perf will > > get confused should it try to symbolize from the address range > > > > With Text: > > PERF_RECORD_MMAP2 5937/5937: [0x40(0x1000) @ 0 08:01 400938 > > 824817672]: r-xp /home/eranian/mmap > > PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 > > 00:00 0 0]: rwxp //anon > > PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 > > 00:00 0 0]: rwxp //anon > > > > captures the whole VMA but not the mapping > > change in user space > > > > For data: > > $ perf record -d ./mmap > > $ perf report -D | fgrep MMAP2 > > With data: > > PERF_RECORD_MMAP2 6430/6430: [0x40(0x1000) @ 0 08:01 400938 > > 3278843184]: r-xp /home/eranian/mmap > > PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 > > 00:00 0 0]: rw-p //anon > > PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 > > 00:00 0 0]: rw-p //anon > > > > Same test case with data. > > Perf will think the entire 2 pages have been replaced when in fact > > only the second has. > > I believe the problem is likely to impact data and jitted code cache > > > > #include > > #include > > #include > > #include > > #include > > #include > > > > int main(int argc, char **argv) > > { > > void *addr1, *addr2; > > size_t pgsz = sysconf(_SC_PAGESIZE); > > int n = 2; > > int ret; > > int c, mode = 0; > > > > while ((c = getopt(argc, argv, "hd")) != -1) { > > switch (c) { > > case 'h': > > printf("[-h]\tget this help\n"); > > printf("[-d]\tuse data mmaps (no PROT_EXEC)\n"); > > return 0; > > case 'd': > > mode = PROT_EXEC; > > break; > > default: > > errx(1, "unknown option"); > > } > > } > > /* default to data */ > > if (mode == 0) > > mode = PROT_WRITE; > > > > /* > > * mmap 2 contiugous pages > > */ > > addr1 = mmap(NULL, n * pgsz, PROT_READ| mode, MAP_ANONYMOUS|MAP_PRIVATE, > > -1, 0); > > if (addr1 == (void *)MAP_FAILED) > > err(1, "mmap 1 failed"); > > > > printf("addr1=[%p : %p]\n", addr1, addr1 + n * pgsz); > > > > /* > > * unmap only the second page > > */ > > ret = munmap(addr1 + pgsz, pgsz); > > if (ret == -1) > > err(1, "munmp failed"); > > > > /* > > * mmap 1 page at the location of the unmap page (should reuse virtual space) > > * This creates a continuous region built from two mmaps and > > potentially two different sources > > * especially with jitted runtimes > > */ > > The two mmaps are both anon. As my understanding, we cannot symbolize > from the anonymous address, can we? Can't we build the same test case using an actual file mapping (both mmap from the same file)? > If we cannot, why we have to distinguish with them? I think we do not > need to know their sources for symbolization. > > As my understanding, only --jit can inject MMAP event, which tag an > anon. Perf can symbolize the address after that. Then the unmap is needed. > Yes, perf inject --jit injects timestamped MMAP2 records to cover the jitted regions which helps symbolize anons. > Thanks, > Kan > > addr2 = mmap(addr1 + pgsz, 1 * pgsz, PROT_READ|PROT_WRITE | mode, > > MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); > > > > printf("addr2=%p\n", addr2); > > > > if (addr2 == (void *)MAP_FAILED) > > err(1, "mmap 2 failed"); > > if (addr2 != (addr1 + pgsz)) > > errx(1, "wrong mmap2 address"); > > > > sleep(1); > > > > return 0; > > } > > > > On Thu, Nov 1, 2018 at 7:10 AM Liang, Kan wrote: > >> > >> > >> > >> On 10/24/2018 3:30 PM, Stephane Eranian wrote: > >>> The need for this new record type extends beyond physical address > >>> conversions > >>> and PEBS. A long while ago, someone reported issues with symbolization > >>> related > >>> to perf lacking munmap tracking. It had to do with vma merging. I think > >>> the > >>> sequence of mmaps was as follows in the problematic case: > >>> 1. addr1 = mmap(8192); > >>> 2. munmap(addr1 + 4096, 4096) > >>> 3. addr2 = mmap(addr1+4096, 4096) > >>> > >>> If successful, that yields addr2 = addr1 + 4096 (could also get the > >>> same without forcing the address). > >>> > >>> In that case, if I recall correctly, the vma for 1st mapping (now at > >>> 4k) and that of the 2nd mapping (4k) > >>> get merged into a single 8k vma and this is what perf_events will > >>> record for PERF_RECORD_MMAP. > >>> On the perf tool side, it is assumed that if two timestamped mappings > >>> overlap then, the latter overrides > >>> the former. In this case, perf would loose the mapping of the first > >>> 4kb and assume all symbols comes from > >>>
Re: [PATCH 1/2] perf: Add munmap callback
On 11/5/2018 5:59 AM, Stephane Eranian wrote: Hi Kan, I built a small test case for you to demonstrate the issue for code and data. Compile the test program and then do: For text: $ perf record ./mmap $ perf report -D | fgrep MMAP2 The test program mmaps 2 pages, unmaps the second, and remap 1 page over the freed space. If you look at the MMAP2 record, you will not be able to reconstruct what happened and perf will get confused should it try to symbolize from the address range With Text: PERF_RECORD_MMAP2 5937/5937: [0x40(0x1000) @ 0 08:01 400938 824817672]: r-xp /home/eranian/mmap PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 00:00 0 0]: rwxp //anon PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 00:00 0 0]: rwxp //anon captures the whole VMA but not the mapping change in user space For data: $ perf record -d ./mmap $ perf report -D | fgrep MMAP2 With data: PERF_RECORD_MMAP2 6430/6430: [0x40(0x1000) @ 0 08:01 400938 3278843184]: r-xp /home/eranian/mmap PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 00:00 0 0]: rw-p //anon PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 00:00 0 0]: rw-p //anon Same test case with data. Perf will think the entire 2 pages have been replaced when in fact only the second has. I believe the problem is likely to impact data and jitted code cache #include #include #include #include #include #include int main(int argc, char **argv) { void *addr1, *addr2; size_t pgsz = sysconf(_SC_PAGESIZE); int n = 2; int ret; int c, mode = 0; while ((c = getopt(argc, argv, "hd")) != -1) { switch (c) { case 'h': printf("[-h]\tget this help\n"); printf("[-d]\tuse data mmaps (no PROT_EXEC)\n"); return 0; case 'd': mode = PROT_EXEC; break; default: errx(1, "unknown option"); } } /* default to data */ if (mode == 0) mode = PROT_WRITE; /* * mmap 2 contiugous pages */ addr1 = mmap(NULL, n * pgsz, PROT_READ| mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr1 == (void *)MAP_FAILED) err(1, "mmap 1 failed"); printf("addr1=[%p : %p]\n", addr1, addr1 + n * pgsz); /* * unmap only the second page */ ret = munmap(addr1 + pgsz, pgsz); if (ret == -1) err(1, "munmp failed"); /* * mmap 1 page at the location of the unmap page (should reuse virtual space) * This creates a continuous region built from two mmaps and potentially two different sources * especially with jitted runtimes */ The two mmaps are both anon. As my understanding, we cannot symbolize from the anonymous address, can we? If we cannot, why we have to distinguish with them? I think we do not need to know their sources for symbolization. As my understanding, only --jit can inject MMAP event, which tag an anon. Perf can symbolize the address after that. Then the unmap is needed. Thanks, Kan addr2 = mmap(addr1 + pgsz, 1 * pgsz, PROT_READ|PROT_WRITE | mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); printf("addr2=%p\n", addr2); if (addr2 == (void *)MAP_FAILED) err(1, "mmap 2 failed"); if (addr2 != (addr1 + pgsz)) errx(1, "wrong mmap2 address"); sleep(1); return 0; } On Thu, Nov 1, 2018 at 7:10 AM Liang, Kan wrote: On 10/24/2018 3:30 PM, Stephane Eranian wrote: The need for this new record type extends beyond physical address conversions and PEBS. A long while ago, someone reported issues with symbolization related to perf lacking munmap tracking. It had to do with vma merging. I think the sequence of mmaps was as follows in the problematic case: 1. addr1 = mmap(8192); 2. munmap(addr1 + 4096, 4096) 3. addr2 = mmap(addr1+4096, 4096) If successful, that yields addr2 = addr1 + 4096 (could also get the same without forcing the address). In that case, if I recall correctly, the vma for 1st mapping (now at 4k) and that of the 2nd mapping (4k) get merged into a single 8k vma and this is what perf_events will record for PERF_RECORD_MMAP. On the perf tool side, it is assumed that if two timestamped mappings overlap then, the latter overrides the former. In this case, perf would loose the mapping of the first 4kb and assume all symbols comes from 2nd mapping. Hopefully I got the scenario right. If so, then you'd need PERF_RECORD_UNMAP to disambiguate assuming the perf tool is modified accordingly. Hi Stephane and Peter, I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying to understand the problematic case. It looks like the issue can only be triggered by perf inject --jit. Because it can inject extra MMAP events. As my understanding, Linux kernel only try to merge VMAs if they are both from anon or they are both from the same file. --jit breaks the rule, and makes the merged VMA partly from anon, partly from file. Now, there is a new MMAP event which range covers the modified VMA. Without the help of MUNMAP event, perf tool have no idea if the new one is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA. Current code just simply overwrite the modified VMAs.
Re: [PATCH 1/2] perf: Add munmap callback
On 11/5/2018 5:59 AM, Stephane Eranian wrote: Hi Kan, I built a small test case for you to demonstrate the issue for code and data. Compile the test program and then do: For text: $ perf record ./mmap $ perf report -D | fgrep MMAP2 The test program mmaps 2 pages, unmaps the second, and remap 1 page over the freed space. If you look at the MMAP2 record, you will not be able to reconstruct what happened and perf will get confused should it try to symbolize from the address range With Text: PERF_RECORD_MMAP2 5937/5937: [0x40(0x1000) @ 0 08:01 400938 824817672]: r-xp /home/eranian/mmap PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 00:00 0 0]: rwxp //anon PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 00:00 0 0]: rwxp //anon captures the whole VMA but not the mapping change in user space For data: $ perf record -d ./mmap $ perf report -D | fgrep MMAP2 With data: PERF_RECORD_MMAP2 6430/6430: [0x40(0x1000) @ 0 08:01 400938 3278843184]: r-xp /home/eranian/mmap PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 00:00 0 0]: rw-p //anon PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 00:00 0 0]: rw-p //anon Same test case with data. Perf will think the entire 2 pages have been replaced when in fact only the second has. I believe the problem is likely to impact data and jitted code cache #include #include #include #include #include #include int main(int argc, char **argv) { void *addr1, *addr2; size_t pgsz = sysconf(_SC_PAGESIZE); int n = 2; int ret; int c, mode = 0; while ((c = getopt(argc, argv, "hd")) != -1) { switch (c) { case 'h': printf("[-h]\tget this help\n"); printf("[-d]\tuse data mmaps (no PROT_EXEC)\n"); return 0; case 'd': mode = PROT_EXEC; break; default: errx(1, "unknown option"); } } /* default to data */ if (mode == 0) mode = PROT_WRITE; /* * mmap 2 contiugous pages */ addr1 = mmap(NULL, n * pgsz, PROT_READ| mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr1 == (void *)MAP_FAILED) err(1, "mmap 1 failed"); printf("addr1=[%p : %p]\n", addr1, addr1 + n * pgsz); /* * unmap only the second page */ ret = munmap(addr1 + pgsz, pgsz); if (ret == -1) err(1, "munmp failed"); /* * mmap 1 page at the location of the unmap page (should reuse virtual space) * This creates a continuous region built from two mmaps and potentially two different sources * especially with jitted runtimes */ The two mmaps are both anon. As my understanding, we cannot symbolize from the anonymous address, can we? If we cannot, why we have to distinguish with them? I think we do not need to know their sources for symbolization. As my understanding, only --jit can inject MMAP event, which tag an anon. Perf can symbolize the address after that. Then the unmap is needed. Thanks, Kan addr2 = mmap(addr1 + pgsz, 1 * pgsz, PROT_READ|PROT_WRITE | mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); printf("addr2=%p\n", addr2); if (addr2 == (void *)MAP_FAILED) err(1, "mmap 2 failed"); if (addr2 != (addr1 + pgsz)) errx(1, "wrong mmap2 address"); sleep(1); return 0; } On Thu, Nov 1, 2018 at 7:10 AM Liang, Kan wrote: On 10/24/2018 3:30 PM, Stephane Eranian wrote: The need for this new record type extends beyond physical address conversions and PEBS. A long while ago, someone reported issues with symbolization related to perf lacking munmap tracking. It had to do with vma merging. I think the sequence of mmaps was as follows in the problematic case: 1. addr1 = mmap(8192); 2. munmap(addr1 + 4096, 4096) 3. addr2 = mmap(addr1+4096, 4096) If successful, that yields addr2 = addr1 + 4096 (could also get the same without forcing the address). In that case, if I recall correctly, the vma for 1st mapping (now at 4k) and that of the 2nd mapping (4k) get merged into a single 8k vma and this is what perf_events will record for PERF_RECORD_MMAP. On the perf tool side, it is assumed that if two timestamped mappings overlap then, the latter overrides the former. In this case, perf would loose the mapping of the first 4kb and assume all symbols comes from 2nd mapping. Hopefully I got the scenario right. If so, then you'd need PERF_RECORD_UNMAP to disambiguate assuming the perf tool is modified accordingly. Hi Stephane and Peter, I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying to understand the problematic case. It looks like the issue can only be triggered by perf inject --jit. Because it can inject extra MMAP events. As my understanding, Linux kernel only try to merge VMAs if they are both from anon or they are both from the same file. --jit breaks the rule, and makes the merged VMA partly from anon, partly from file. Now, there is a new MMAP event which range covers the modified VMA. Without the help of MUNMAP event, perf tool have no idea if the new one is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA. Current code just simply overwrite the modified VMAs.
Re: [PATCH 1/2] perf: Add munmap callback
Hi Kan, I built a small test case for you to demonstrate the issue for code and data. Compile the test program and then do: For text: $ perf record ./mmap $ perf report -D | fgrep MMAP2 The test program mmaps 2 pages, unmaps the second, and remap 1 page over the freed space. If you look at the MMAP2 record, you will not be able to reconstruct what happened and perf will get confused should it try to symbolize from the address range With Text: PERF_RECORD_MMAP2 5937/5937: [0x40(0x1000) @ 0 08:01 400938 824817672]: r-xp /home/eranian/mmap PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 00:00 0 0]: rwxp //anon PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 00:00 0 0]: rwxp //anon captures the whole VMA but not the mapping change in user space For data: $ perf record -d ./mmap $ perf report -D | fgrep MMAP2 With data: PERF_RECORD_MMAP2 6430/6430: [0x40(0x1000) @ 0 08:01 400938 3278843184]: r-xp /home/eranian/mmap PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 00:00 0 0]: rw-p //anon PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 00:00 0 0]: rw-p //anon Same test case with data. Perf will think the entire 2 pages have been replaced when in fact only the second has. I believe the problem is likely to impact data and jitted code cache #include #include #include #include #include #include int main(int argc, char **argv) { void *addr1, *addr2; size_t pgsz = sysconf(_SC_PAGESIZE); int n = 2; int ret; int c, mode = 0; while ((c = getopt(argc, argv, "hd")) != -1) { switch (c) { case 'h': printf("[-h]\tget this help\n"); printf("[-d]\tuse data mmaps (no PROT_EXEC)\n"); return 0; case 'd': mode = PROT_EXEC; break; default: errx(1, "unknown option"); } } /* default to data */ if (mode == 0) mode = PROT_WRITE; /* * mmap 2 contiugous pages */ addr1 = mmap(NULL, n * pgsz, PROT_READ| mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr1 == (void *)MAP_FAILED) err(1, "mmap 1 failed"); printf("addr1=[%p : %p]\n", addr1, addr1 + n * pgsz); /* * unmap only the second page */ ret = munmap(addr1 + pgsz, pgsz); if (ret == -1) err(1, "munmp failed"); /* * mmap 1 page at the location of the unmap page (should reuse virtual space) * This creates a continuous region built from two mmaps and potentially two different sources * especially with jitted runtimes */ addr2 = mmap(addr1 + pgsz, 1 * pgsz, PROT_READ|PROT_WRITE | mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); printf("addr2=%p\n", addr2); if (addr2 == (void *)MAP_FAILED) err(1, "mmap 2 failed"); if (addr2 != (addr1 + pgsz)) errx(1, "wrong mmap2 address"); sleep(1); return 0; } On Thu, Nov 1, 2018 at 7:10 AM Liang, Kan wrote: > > > > On 10/24/2018 3:30 PM, Stephane Eranian wrote: > > The need for this new record type extends beyond physical address > > conversions > > and PEBS. A long while ago, someone reported issues with symbolization > > related > > to perf lacking munmap tracking. It had to do with vma merging. I think the > > sequence of mmaps was as follows in the problematic case: > > 1. addr1 = mmap(8192); > > 2. munmap(addr1 + 4096, 4096) > > 3. addr2 = mmap(addr1+4096, 4096) > > > > If successful, that yields addr2 = addr1 + 4096 (could also get the > > same without forcing the address). > > > > In that case, if I recall correctly, the vma for 1st mapping (now at > > 4k) and that of the 2nd mapping (4k) > > get merged into a single 8k vma and this is what perf_events will > > record for PERF_RECORD_MMAP. > > On the perf tool side, it is assumed that if two timestamped mappings > > overlap then, the latter overrides > > the former. In this case, perf would loose the mapping of the first > > 4kb and assume all symbols comes from > > 2nd mapping. Hopefully I got the scenario right. If so, then you'd > > need PERF_RECORD_UNMAP to > > disambiguate assuming the perf tool is modified accordingly. > > > > Hi Stephane and Peter, > > I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying > to understand the problematic case. > > It looks like the issue can only be triggered by perf inject --jit. > Because it can inject extra MMAP events. > As my understanding, Linux kernel only try to merge VMAs if they are > both from anon or they are both from the same file. --jit breaks the > rule, and makes the merged VMA partly from anon, partly from file. > Now, there is a new MMAP event which range covers the modified VMA. > Without the help of MUNMAP event, perf tool have no idea if the new one > is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA. > Current code just simply overwrite the modified VMAs. The VMA > information which --jit injected may be lost. The symbolization may be > lost as well. > > Except --jit, the VMAs information should be consistent between kernel > and perf tools. We shouldn't observe the problem. MUNMAP event is not > needed. > > Is my understanding correct? > > Do you have a test
Re: [PATCH 1/2] perf: Add munmap callback
Hi Kan, I built a small test case for you to demonstrate the issue for code and data. Compile the test program and then do: For text: $ perf record ./mmap $ perf report -D | fgrep MMAP2 The test program mmaps 2 pages, unmaps the second, and remap 1 page over the freed space. If you look at the MMAP2 record, you will not be able to reconstruct what happened and perf will get confused should it try to symbolize from the address range With Text: PERF_RECORD_MMAP2 5937/5937: [0x40(0x1000) @ 0 08:01 400938 824817672]: r-xp /home/eranian/mmap PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 00:00 0 0]: rwxp //anon PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000 00:00 0 0]: rwxp //anon captures the whole VMA but not the mapping change in user space For data: $ perf record -d ./mmap $ perf report -D | fgrep MMAP2 With data: PERF_RECORD_MMAP2 6430/6430: [0x40(0x1000) @ 0 08:01 400938 3278843184]: r-xp /home/eranian/mmap PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 00:00 0 0]: rw-p //anon PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000 00:00 0 0]: rw-p //anon Same test case with data. Perf will think the entire 2 pages have been replaced when in fact only the second has. I believe the problem is likely to impact data and jitted code cache #include #include #include #include #include #include int main(int argc, char **argv) { void *addr1, *addr2; size_t pgsz = sysconf(_SC_PAGESIZE); int n = 2; int ret; int c, mode = 0; while ((c = getopt(argc, argv, "hd")) != -1) { switch (c) { case 'h': printf("[-h]\tget this help\n"); printf("[-d]\tuse data mmaps (no PROT_EXEC)\n"); return 0; case 'd': mode = PROT_EXEC; break; default: errx(1, "unknown option"); } } /* default to data */ if (mode == 0) mode = PROT_WRITE; /* * mmap 2 contiugous pages */ addr1 = mmap(NULL, n * pgsz, PROT_READ| mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr1 == (void *)MAP_FAILED) err(1, "mmap 1 failed"); printf("addr1=[%p : %p]\n", addr1, addr1 + n * pgsz); /* * unmap only the second page */ ret = munmap(addr1 + pgsz, pgsz); if (ret == -1) err(1, "munmp failed"); /* * mmap 1 page at the location of the unmap page (should reuse virtual space) * This creates a continuous region built from two mmaps and potentially two different sources * especially with jitted runtimes */ addr2 = mmap(addr1 + pgsz, 1 * pgsz, PROT_READ|PROT_WRITE | mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); printf("addr2=%p\n", addr2); if (addr2 == (void *)MAP_FAILED) err(1, "mmap 2 failed"); if (addr2 != (addr1 + pgsz)) errx(1, "wrong mmap2 address"); sleep(1); return 0; } On Thu, Nov 1, 2018 at 7:10 AM Liang, Kan wrote: > > > > On 10/24/2018 3:30 PM, Stephane Eranian wrote: > > The need for this new record type extends beyond physical address > > conversions > > and PEBS. A long while ago, someone reported issues with symbolization > > related > > to perf lacking munmap tracking. It had to do with vma merging. I think the > > sequence of mmaps was as follows in the problematic case: > > 1. addr1 = mmap(8192); > > 2. munmap(addr1 + 4096, 4096) > > 3. addr2 = mmap(addr1+4096, 4096) > > > > If successful, that yields addr2 = addr1 + 4096 (could also get the > > same without forcing the address). > > > > In that case, if I recall correctly, the vma for 1st mapping (now at > > 4k) and that of the 2nd mapping (4k) > > get merged into a single 8k vma and this is what perf_events will > > record for PERF_RECORD_MMAP. > > On the perf tool side, it is assumed that if two timestamped mappings > > overlap then, the latter overrides > > the former. In this case, perf would loose the mapping of the first > > 4kb and assume all symbols comes from > > 2nd mapping. Hopefully I got the scenario right. If so, then you'd > > need PERF_RECORD_UNMAP to > > disambiguate assuming the perf tool is modified accordingly. > > > > Hi Stephane and Peter, > > I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying > to understand the problematic case. > > It looks like the issue can only be triggered by perf inject --jit. > Because it can inject extra MMAP events. > As my understanding, Linux kernel only try to merge VMAs if they are > both from anon or they are both from the same file. --jit breaks the > rule, and makes the merged VMA partly from anon, partly from file. > Now, there is a new MMAP event which range covers the modified VMA. > Without the help of MUNMAP event, perf tool have no idea if the new one > is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA. > Current code just simply overwrite the modified VMAs. The VMA > information which --jit injected may be lost. The symbolization may be > lost as well. > > Except --jit, the VMAs information should be consistent between kernel > and perf tools. We shouldn't observe the problem. MUNMAP event is not > needed. > > Is my understanding correct? > > Do you have a test
Re: [PATCH 1/2] perf: Add munmap callback
On 10/24/2018 3:30 PM, Stephane Eranian wrote: The need for this new record type extends beyond physical address conversions and PEBS. A long while ago, someone reported issues with symbolization related to perf lacking munmap tracking. It had to do with vma merging. I think the sequence of mmaps was as follows in the problematic case: 1. addr1 = mmap(8192); 2. munmap(addr1 + 4096, 4096) 3. addr2 = mmap(addr1+4096, 4096) If successful, that yields addr2 = addr1 + 4096 (could also get the same without forcing the address). In that case, if I recall correctly, the vma for 1st mapping (now at 4k) and that of the 2nd mapping (4k) get merged into a single 8k vma and this is what perf_events will record for PERF_RECORD_MMAP. On the perf tool side, it is assumed that if two timestamped mappings overlap then, the latter overrides the former. In this case, perf would loose the mapping of the first 4kb and assume all symbols comes from 2nd mapping. Hopefully I got the scenario right. If so, then you'd need PERF_RECORD_UNMAP to disambiguate assuming the perf tool is modified accordingly. Hi Stephane and Peter, I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying to understand the problematic case. It looks like the issue can only be triggered by perf inject --jit. Because it can inject extra MMAP events. As my understanding, Linux kernel only try to merge VMAs if they are both from anon or they are both from the same file. --jit breaks the rule, and makes the merged VMA partly from anon, partly from file. Now, there is a new MMAP event which range covers the modified VMA. Without the help of MUNMAP event, perf tool have no idea if the new one is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA. Current code just simply overwrite the modified VMAs. The VMA information which --jit injected may be lost. The symbolization may be lost as well. Except --jit, the VMAs information should be consistent between kernel and perf tools. We shouldn't observe the problem. MUNMAP event is not needed. Is my understanding correct? Do you have a test case for the problem? Thanks, Kan
Re: [PATCH 1/2] perf: Add munmap callback
On 10/24/2018 3:30 PM, Stephane Eranian wrote: The need for this new record type extends beyond physical address conversions and PEBS. A long while ago, someone reported issues with symbolization related to perf lacking munmap tracking. It had to do with vma merging. I think the sequence of mmaps was as follows in the problematic case: 1. addr1 = mmap(8192); 2. munmap(addr1 + 4096, 4096) 3. addr2 = mmap(addr1+4096, 4096) If successful, that yields addr2 = addr1 + 4096 (could also get the same without forcing the address). In that case, if I recall correctly, the vma for 1st mapping (now at 4k) and that of the 2nd mapping (4k) get merged into a single 8k vma and this is what perf_events will record for PERF_RECORD_MMAP. On the perf tool side, it is assumed that if two timestamped mappings overlap then, the latter overrides the former. In this case, perf would loose the mapping of the first 4kb and assume all symbols comes from 2nd mapping. Hopefully I got the scenario right. If so, then you'd need PERF_RECORD_UNMAP to disambiguate assuming the perf tool is modified accordingly. Hi Stephane and Peter, I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying to understand the problematic case. It looks like the issue can only be triggered by perf inject --jit. Because it can inject extra MMAP events. As my understanding, Linux kernel only try to merge VMAs if they are both from anon or they are both from the same file. --jit breaks the rule, and makes the merged VMA partly from anon, partly from file. Now, there is a new MMAP event which range covers the modified VMA. Without the help of MUNMAP event, perf tool have no idea if the new one is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA. Current code just simply overwrite the modified VMAs. The VMA information which --jit injected may be lost. The symbolization may be lost as well. Except --jit, the VMAs information should be consistent between kernel and perf tools. We shouldn't observe the problem. MUNMAP event is not needed. Is my understanding correct? Do you have a test case for the problem? Thanks, Kan
Re: [PATCH 1/2] perf: Add munmap callback
On Thu, Oct 25, 2018 at 10:00:07AM -0400, Liang, Kan wrote: > Is this patch you mentioned? > https://lkml.org/lkml/2017/1/27/452 Yes.
Re: [PATCH 1/2] perf: Add munmap callback
On Thu, Oct 25, 2018 at 10:00:07AM -0400, Liang, Kan wrote: > Is this patch you mentioned? > https://lkml.org/lkml/2017/1/27/452 Yes.
Re: [PATCH 1/2] perf: Add munmap callback
On 10/24/2018 8:29 PM, Peter Zijlstra wrote: On Wed, Oct 24, 2018 at 08:11:15AM -0700, kan.li...@linux.intel.com wrote: +void perf_event_munmap(void) +{ + struct perf_cpu_context *cpuctx; + unsigned long flags; + struct pmu *pmu; + + local_irq_save(flags); It is impossible to get here with IRQs already disabled. I don't think so. Based on my test, IRQs are still enabled here. I once observed dead lock with my stress test. So I have to explicitly disable the IRQs here. + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), sched_cb_entry) { + pmu = cpuctx->ctx.pmu; + + if (!pmu->munmap) + continue; + + perf_ctx_lock(cpuctx, cpuctx->task_ctx); + perf_pmu_disable(pmu); + + pmu->munmap(); + + perf_pmu_enable(pmu); + + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); + } + local_irq_restore(flags); +} + static void perf_event_switch(struct task_struct *task, struct task_struct *next_prev, bool sched_in); diff --git a/mm/mmap.c b/mm/mmap.c index 5f2b2b184c60..61978ad8c480 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* * Remove the vma's, and unmap the actual pages */ + perf_event_munmap(); I think that if you add the munmap hook, you should do it right and at least do it such that we can solve the other munmap problem. Is this patch you mentioned? https://lkml.org/lkml/2017/1/27/452 I will take a look and find a right place for both problems. Thanks, Kan detach_vmas_to_be_unmapped(mm, vma, prev, end); unmap_region(mm, vma, prev, start, end); -- 2.17.1
Re: [PATCH 1/2] perf: Add munmap callback
On 10/24/2018 8:29 PM, Peter Zijlstra wrote: On Wed, Oct 24, 2018 at 08:11:15AM -0700, kan.li...@linux.intel.com wrote: +void perf_event_munmap(void) +{ + struct perf_cpu_context *cpuctx; + unsigned long flags; + struct pmu *pmu; + + local_irq_save(flags); It is impossible to get here with IRQs already disabled. I don't think so. Based on my test, IRQs are still enabled here. I once observed dead lock with my stress test. So I have to explicitly disable the IRQs here. + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), sched_cb_entry) { + pmu = cpuctx->ctx.pmu; + + if (!pmu->munmap) + continue; + + perf_ctx_lock(cpuctx, cpuctx->task_ctx); + perf_pmu_disable(pmu); + + pmu->munmap(); + + perf_pmu_enable(pmu); + + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); + } + local_irq_restore(flags); +} + static void perf_event_switch(struct task_struct *task, struct task_struct *next_prev, bool sched_in); diff --git a/mm/mmap.c b/mm/mmap.c index 5f2b2b184c60..61978ad8c480 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* * Remove the vma's, and unmap the actual pages */ + perf_event_munmap(); I think that if you add the munmap hook, you should do it right and at least do it such that we can solve the other munmap problem. Is this patch you mentioned? https://lkml.org/lkml/2017/1/27/452 I will take a look and find a right place for both problems. Thanks, Kan detach_vmas_to_be_unmapped(mm, vma, prev, end); unmap_region(mm, vma, prev, start, end); -- 2.17.1
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 5:34 PM Peter Zijlstra wrote: > > On Wed, Oct 24, 2018 at 05:25:59PM -0700, Stephane Eranian wrote: > > On Wed, Oct 24, 2018 at 5:23 PM Peter Zijlstra wrote: > > > > That is actually a different problem. And you're right, we never did fix > > > that. > > > > > it is a different problem but the solution is the same: PERF_RECORD_UNMAP! > > But he's not actually doing PERF_RECORD_UNMAP, nor is his perf_munmap() > hook sufficient to actually generate those. > You're right. But I saw that and thought, this is going in the right direction: tracking munmap. I agree with you that we should use this opportunity to track unmap for his purpose but also the issue I raised, which needs a new record type based on the new unmap hook. > Now I agree that if he's going to do an munmap hook, he should do it > 'right' and at the very least allow for PERF_RECORD_UNMAP to be done, > but ideally simply pick up and finish that patch we had back then. Agreed.
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 5:34 PM Peter Zijlstra wrote: > > On Wed, Oct 24, 2018 at 05:25:59PM -0700, Stephane Eranian wrote: > > On Wed, Oct 24, 2018 at 5:23 PM Peter Zijlstra wrote: > > > > That is actually a different problem. And you're right, we never did fix > > > that. > > > > > it is a different problem but the solution is the same: PERF_RECORD_UNMAP! > > But he's not actually doing PERF_RECORD_UNMAP, nor is his perf_munmap() > hook sufficient to actually generate those. > You're right. But I saw that and thought, this is going in the right direction: tracking munmap. I agree with you that we should use this opportunity to track unmap for his purpose but also the issue I raised, which needs a new record type based on the new unmap hook. > Now I agree that if he's going to do an munmap hook, he should do it > 'right' and at the very least allow for PERF_RECORD_UNMAP to be done, > but ideally simply pick up and finish that patch we had back then. Agreed.
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 08:11:15AM -0700, kan.li...@linux.intel.com wrote: > +void perf_event_munmap(void) > +{ > + struct perf_cpu_context *cpuctx; > + unsigned long flags; > + struct pmu *pmu; > + > + local_irq_save(flags); It is impossible to get here with IRQs already disabled. > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > sched_cb_entry) { > + pmu = cpuctx->ctx.pmu; > + > + if (!pmu->munmap) > + continue; > + > + perf_ctx_lock(cpuctx, cpuctx->task_ctx); > + perf_pmu_disable(pmu); > + > + pmu->munmap(); > + > + perf_pmu_enable(pmu); > + > + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); > + } > + local_irq_restore(flags); > +} > + > static void perf_event_switch(struct task_struct *task, > struct task_struct *next_prev, bool sched_in); > > diff --git a/mm/mmap.c b/mm/mmap.c > index 5f2b2b184c60..61978ad8c480 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long > start, size_t len, > /* >* Remove the vma's, and unmap the actual pages >*/ > + perf_event_munmap(); I think that if you add the munmap hook, you should do it right and at least do it such that we can solve the other munmap problem. > detach_vmas_to_be_unmapped(mm, vma, prev, end); > unmap_region(mm, vma, prev, start, end); > > -- > 2.17.1 >
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 08:11:15AM -0700, kan.li...@linux.intel.com wrote: > +void perf_event_munmap(void) > +{ > + struct perf_cpu_context *cpuctx; > + unsigned long flags; > + struct pmu *pmu; > + > + local_irq_save(flags); It is impossible to get here with IRQs already disabled. > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > sched_cb_entry) { > + pmu = cpuctx->ctx.pmu; > + > + if (!pmu->munmap) > + continue; > + > + perf_ctx_lock(cpuctx, cpuctx->task_ctx); > + perf_pmu_disable(pmu); > + > + pmu->munmap(); > + > + perf_pmu_enable(pmu); > + > + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); > + } > + local_irq_restore(flags); > +} > + > static void perf_event_switch(struct task_struct *task, > struct task_struct *next_prev, bool sched_in); > > diff --git a/mm/mmap.c b/mm/mmap.c > index 5f2b2b184c60..61978ad8c480 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long > start, size_t len, > /* >* Remove the vma's, and unmap the actual pages >*/ > + perf_event_munmap(); I think that if you add the munmap hook, you should do it right and at least do it such that we can solve the other munmap problem. > detach_vmas_to_be_unmapped(mm, vma, prev, end); > unmap_region(mm, vma, prev, start, end); > > -- > 2.17.1 >
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 05:25:59PM -0700, Stephane Eranian wrote: > On Wed, Oct 24, 2018 at 5:23 PM Peter Zijlstra wrote: > > That is actually a different problem. And you're right, we never did fix > > that. > > > it is a different problem but the solution is the same: PERF_RECORD_UNMAP! But he's not actually doing PERF_RECORD_UNMAP, nor is his perf_munmap() hook sufficient to actually generate those. Now I agree that if he's going to do an munmap hook, he should do it 'right' and at the very least allow for PERF_RECORD_UNMAP to be done, but ideally simply pick up and finish that patch we had back then.
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 05:25:59PM -0700, Stephane Eranian wrote: > On Wed, Oct 24, 2018 at 5:23 PM Peter Zijlstra wrote: > > That is actually a different problem. And you're right, we never did fix > > that. > > > it is a different problem but the solution is the same: PERF_RECORD_UNMAP! But he's not actually doing PERF_RECORD_UNMAP, nor is his perf_munmap() hook sufficient to actually generate those. Now I agree that if he's going to do an munmap hook, he should do it 'right' and at the very least allow for PERF_RECORD_UNMAP to be done, but ideally simply pick up and finish that patch we had back then.
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 11:28:54AM -0700, Andi Kleen wrote: > > > void perf_event_mmap(struct vm_area_struct *vma) > > > { > > > struct perf_mmap_event mmap_event; > > > > > > if (!atomic_read(_mmap_events)) > > > return; > > > > > > } > > > > > > > Thanks. I'll add the nr_mmap_events check in V2. > > No, that's the wrong check here. The PEBS flush is independent of mmap > events being requested. > > If anything would need to check for any PEBS events active, which > would need a new counter. I think the easiest is to just check if > this_cpu_ptr(_cb_list) > is empty, which should be good enough. It is just the CLI+STI, not PUSHF;CLI+POPF that are required and that is a lot cheaper already. Also, you need to have preemption disabled in order to check the per-cpu cb list. So I don't think it really makes much sense to try and frob a fast path in there.
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 11:28:54AM -0700, Andi Kleen wrote: > > > void perf_event_mmap(struct vm_area_struct *vma) > > > { > > > struct perf_mmap_event mmap_event; > > > > > > if (!atomic_read(_mmap_events)) > > > return; > > > > > > } > > > > > > > Thanks. I'll add the nr_mmap_events check in V2. > > No, that's the wrong check here. The PEBS flush is independent of mmap > events being requested. > > If anything would need to check for any PEBS events active, which > would need a new counter. I think the easiest is to just check if > this_cpu_ptr(_cb_list) > is empty, which should be good enough. It is just the CLI+STI, not PUSHF;CLI+POPF that are required and that is a lot cheaper already. Also, you need to have preemption disabled in order to check the per-cpu cb list. So I don't think it really makes much sense to try and frob a fast path in there.
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 5:23 PM Peter Zijlstra wrote: > > On Wed, Oct 24, 2018 at 12:30:52PM -0700, Stephane Eranian wrote: > > Hi, > > > > On Wed, Oct 24, 2018 at 8:12 AM wrote: > > > > > > From: Kan Liang > > > > > > To calculate the physical address, perf needs to walk the pages tables. > > > The related mapping may has already been removed from pages table in > > > some cases (e.g. large PEBS). The virtual address recorded in the first > > > PEBS records may already be unmapped before draining PEBS buffers. > > > > > > Add a munmap callback to notify the PMU of any unmapping, which only be > > > invoked when the munmap is implemented. > > > > > The need for this new record type extends beyond physical address > > conversions > > and PEBS. A long while ago, someone reported issues with symbolization > > related > > to perf lacking munmap tracking. It had to do with vma merging. I think the > > sequence of mmaps was as follows in the problematic case: > > 1. addr1 = mmap(8192); > > 2. munmap(addr1 + 4096, 4096) > > 3. addr2 = mmap(addr1+4096, 4096) > > > > If successful, that yields addr2 = addr1 + 4096 (could also get the > > same without forcing the address). > > That is actually a different problem. And you're right, we never did fix > that. > it is a different problem but the solution is the same: PERF_RECORD_UNMAP! That's why I mentioned it here. To show that this is needed for more than one reason ;-) > I agree with you that Kan's Changelog is somewhat cryptic; it took me at > least 3 times reading and looking at what the patches actually do to > understand :/
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 5:23 PM Peter Zijlstra wrote: > > On Wed, Oct 24, 2018 at 12:30:52PM -0700, Stephane Eranian wrote: > > Hi, > > > > On Wed, Oct 24, 2018 at 8:12 AM wrote: > > > > > > From: Kan Liang > > > > > > To calculate the physical address, perf needs to walk the pages tables. > > > The related mapping may has already been removed from pages table in > > > some cases (e.g. large PEBS). The virtual address recorded in the first > > > PEBS records may already be unmapped before draining PEBS buffers. > > > > > > Add a munmap callback to notify the PMU of any unmapping, which only be > > > invoked when the munmap is implemented. > > > > > The need for this new record type extends beyond physical address > > conversions > > and PEBS. A long while ago, someone reported issues with symbolization > > related > > to perf lacking munmap tracking. It had to do with vma merging. I think the > > sequence of mmaps was as follows in the problematic case: > > 1. addr1 = mmap(8192); > > 2. munmap(addr1 + 4096, 4096) > > 3. addr2 = mmap(addr1+4096, 4096) > > > > If successful, that yields addr2 = addr1 + 4096 (could also get the > > same without forcing the address). > > That is actually a different problem. And you're right, we never did fix > that. > it is a different problem but the solution is the same: PERF_RECORD_UNMAP! That's why I mentioned it here. To show that this is needed for more than one reason ;-) > I agree with you that Kan's Changelog is somewhat cryptic; it took me at > least 3 times reading and looking at what the patches actually do to > understand :/
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 12:30:52PM -0700, Stephane Eranian wrote: > Hi, > > On Wed, Oct 24, 2018 at 8:12 AM wrote: > > > > From: Kan Liang > > > > To calculate the physical address, perf needs to walk the pages tables. > > The related mapping may has already been removed from pages table in > > some cases (e.g. large PEBS). The virtual address recorded in the first > > PEBS records may already be unmapped before draining PEBS buffers. > > > > Add a munmap callback to notify the PMU of any unmapping, which only be > > invoked when the munmap is implemented. > > > The need for this new record type extends beyond physical address conversions > and PEBS. A long while ago, someone reported issues with symbolization related > to perf lacking munmap tracking. It had to do with vma merging. I think the > sequence of mmaps was as follows in the problematic case: > 1. addr1 = mmap(8192); > 2. munmap(addr1 + 4096, 4096) > 3. addr2 = mmap(addr1+4096, 4096) > > If successful, that yields addr2 = addr1 + 4096 (could also get the > same without forcing the address). That is actually a different problem. And you're right, we never did fix that. I agree with you that Kan's Changelog is somewhat cryptic; it took me at least 3 times reading and looking at what the patches actually do to understand :/
Re: [PATCH 1/2] perf: Add munmap callback
On Wed, Oct 24, 2018 at 12:30:52PM -0700, Stephane Eranian wrote: > Hi, > > On Wed, Oct 24, 2018 at 8:12 AM wrote: > > > > From: Kan Liang > > > > To calculate the physical address, perf needs to walk the pages tables. > > The related mapping may has already been removed from pages table in > > some cases (e.g. large PEBS). The virtual address recorded in the first > > PEBS records may already be unmapped before draining PEBS buffers. > > > > Add a munmap callback to notify the PMU of any unmapping, which only be > > invoked when the munmap is implemented. > > > The need for this new record type extends beyond physical address conversions > and PEBS. A long while ago, someone reported issues with symbolization related > to perf lacking munmap tracking. It had to do with vma merging. I think the > sequence of mmaps was as follows in the problematic case: > 1. addr1 = mmap(8192); > 2. munmap(addr1 + 4096, 4096) > 3. addr2 = mmap(addr1+4096, 4096) > > If successful, that yields addr2 = addr1 + 4096 (could also get the > same without forcing the address). That is actually a different problem. And you're right, we never did fix that. I agree with you that Kan's Changelog is somewhat cryptic; it took me at least 3 times reading and looking at what the patches actually do to understand :/
Re: [PATCH 1/2] perf: Add munmap callback
Hi, On Wed, Oct 24, 2018 at 8:12 AM wrote: > > From: Kan Liang > > To calculate the physical address, perf needs to walk the pages tables. > The related mapping may has already been removed from pages table in > some cases (e.g. large PEBS). The virtual address recorded in the first > PEBS records may already be unmapped before draining PEBS buffers. > > Add a munmap callback to notify the PMU of any unmapping, which only be > invoked when the munmap is implemented. > The need for this new record type extends beyond physical address conversions and PEBS. A long while ago, someone reported issues with symbolization related to perf lacking munmap tracking. It had to do with vma merging. I think the sequence of mmaps was as follows in the problematic case: 1. addr1 = mmap(8192); 2. munmap(addr1 + 4096, 4096) 3. addr2 = mmap(addr1+4096, 4096) If successful, that yields addr2 = addr1 + 4096 (could also get the same without forcing the address). In that case, if I recall correctly, the vma for 1st mapping (now at 4k) and that of the 2nd mapping (4k) get merged into a single 8k vma and this is what perf_events will record for PERF_RECORD_MMAP. On the perf tool side, it is assumed that if two timestamped mappings overlap then, the latter overrides the former. In this case, perf would loose the mapping of the first 4kb and assume all symbols comes from 2nd mapping. Hopefully I got the scenario right. If so, then you'd need PERF_RECORD_UNMAP to disambiguate assuming the perf tool is modified accordingly. > > Signed-off-by: Kan Liang > --- > include/linux/perf_event.h | 3 +++ > kernel/events/core.c | 25 + > mm/mmap.c | 1 + > 3 files changed, 29 insertions(+) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 53c500f0ca79..7f0a9258ce1f 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -400,6 +400,7 @@ struct pmu { > */ > void (*sched_task) (struct perf_event_context *ctx, > bool sched_in); > + void (*munmap) (void); > /* > * PMU specific data size > */ > @@ -1113,6 +1114,7 @@ static inline void perf_event_task_sched_out(struct > task_struct *prev, > } > > extern void perf_event_mmap(struct vm_area_struct *vma); > +extern void perf_event_munmap(void); > extern struct perf_guest_info_callbacks *perf_guest_cbs; > extern int perf_register_guest_info_callbacks(struct > perf_guest_info_callbacks *callbacks); > extern int perf_unregister_guest_info_callbacks(struct > perf_guest_info_callbacks *callbacks); > @@ -1333,6 +1335,7 @@ static inline int perf_unregister_guest_info_callbacks > (struct perf_guest_info_callbacks *callbacks) { > return 0; } > > static inline void perf_event_mmap(struct vm_area_struct *vma) { } > +static inline void perf_event_munmap(void) { } > static inline void perf_event_exec(void) { } > static inline void perf_event_comm(struct task_struct *tsk, bool exec) { } > static inline void perf_event_namespaces(struct task_struct *tsk) { } > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 5a97f34bc14c..00338d6fbed7 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -3184,6 +3184,31 @@ static void perf_pmu_sched_task(struct task_struct > *prev, > } > } > > +void perf_event_munmap(void) > +{ > + struct perf_cpu_context *cpuctx; > + unsigned long flags; > + struct pmu *pmu; > + > + local_irq_save(flags); > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > sched_cb_entry) { > + pmu = cpuctx->ctx.pmu; > + > + if (!pmu->munmap) > + continue; > + > + perf_ctx_lock(cpuctx, cpuctx->task_ctx); > + perf_pmu_disable(pmu); > + > + pmu->munmap(); > + > + perf_pmu_enable(pmu); > + > + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); > + } > + local_irq_restore(flags); > +} > + > static void perf_event_switch(struct task_struct *task, > struct task_struct *next_prev, bool sched_in); > > diff --git a/mm/mmap.c b/mm/mmap.c > index 5f2b2b184c60..61978ad8c480 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long > start, size_t len, > /* > * Remove the vma's, and unmap the actual pages > */ > + perf_event_munmap(); > detach_vmas_to_be_unmapped(mm, vma, prev, end); > unmap_region(mm, vma, prev, start, end); > > -- > 2.17.1 >
Re: [PATCH 1/2] perf: Add munmap callback
Hi, On Wed, Oct 24, 2018 at 8:12 AM wrote: > > From: Kan Liang > > To calculate the physical address, perf needs to walk the pages tables. > The related mapping may has already been removed from pages table in > some cases (e.g. large PEBS). The virtual address recorded in the first > PEBS records may already be unmapped before draining PEBS buffers. > > Add a munmap callback to notify the PMU of any unmapping, which only be > invoked when the munmap is implemented. > The need for this new record type extends beyond physical address conversions and PEBS. A long while ago, someone reported issues with symbolization related to perf lacking munmap tracking. It had to do with vma merging. I think the sequence of mmaps was as follows in the problematic case: 1. addr1 = mmap(8192); 2. munmap(addr1 + 4096, 4096) 3. addr2 = mmap(addr1+4096, 4096) If successful, that yields addr2 = addr1 + 4096 (could also get the same without forcing the address). In that case, if I recall correctly, the vma for 1st mapping (now at 4k) and that of the 2nd mapping (4k) get merged into a single 8k vma and this is what perf_events will record for PERF_RECORD_MMAP. On the perf tool side, it is assumed that if two timestamped mappings overlap then, the latter overrides the former. In this case, perf would loose the mapping of the first 4kb and assume all symbols comes from 2nd mapping. Hopefully I got the scenario right. If so, then you'd need PERF_RECORD_UNMAP to disambiguate assuming the perf tool is modified accordingly. > > Signed-off-by: Kan Liang > --- > include/linux/perf_event.h | 3 +++ > kernel/events/core.c | 25 + > mm/mmap.c | 1 + > 3 files changed, 29 insertions(+) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 53c500f0ca79..7f0a9258ce1f 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -400,6 +400,7 @@ struct pmu { > */ > void (*sched_task) (struct perf_event_context *ctx, > bool sched_in); > + void (*munmap) (void); > /* > * PMU specific data size > */ > @@ -1113,6 +1114,7 @@ static inline void perf_event_task_sched_out(struct > task_struct *prev, > } > > extern void perf_event_mmap(struct vm_area_struct *vma); > +extern void perf_event_munmap(void); > extern struct perf_guest_info_callbacks *perf_guest_cbs; > extern int perf_register_guest_info_callbacks(struct > perf_guest_info_callbacks *callbacks); > extern int perf_unregister_guest_info_callbacks(struct > perf_guest_info_callbacks *callbacks); > @@ -1333,6 +1335,7 @@ static inline int perf_unregister_guest_info_callbacks > (struct perf_guest_info_callbacks *callbacks) { > return 0; } > > static inline void perf_event_mmap(struct vm_area_struct *vma) { } > +static inline void perf_event_munmap(void) { } > static inline void perf_event_exec(void) { } > static inline void perf_event_comm(struct task_struct *tsk, bool exec) { } > static inline void perf_event_namespaces(struct task_struct *tsk) { } > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 5a97f34bc14c..00338d6fbed7 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -3184,6 +3184,31 @@ static void perf_pmu_sched_task(struct task_struct > *prev, > } > } > > +void perf_event_munmap(void) > +{ > + struct perf_cpu_context *cpuctx; > + unsigned long flags; > + struct pmu *pmu; > + > + local_irq_save(flags); > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > sched_cb_entry) { > + pmu = cpuctx->ctx.pmu; > + > + if (!pmu->munmap) > + continue; > + > + perf_ctx_lock(cpuctx, cpuctx->task_ctx); > + perf_pmu_disable(pmu); > + > + pmu->munmap(); > + > + perf_pmu_enable(pmu); > + > + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); > + } > + local_irq_restore(flags); > +} > + > static void perf_event_switch(struct task_struct *task, > struct task_struct *next_prev, bool sched_in); > > diff --git a/mm/mmap.c b/mm/mmap.c > index 5f2b2b184c60..61978ad8c480 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long > start, size_t len, > /* > * Remove the vma's, and unmap the actual pages > */ > + perf_event_munmap(); > detach_vmas_to_be_unmapped(mm, vma, prev, end); > unmap_region(mm, vma, prev, start, end); > > -- > 2.17.1 >
Re: [PATCH 1/2] perf: Add munmap callback
Em Wed, Oct 24, 2018 at 02:12:54PM -0400, Liang, Kan escreveu: > > > On 10/24/2018 12:32 PM, Arnaldo Carvalho de Melo wrote: > > Em Wed, Oct 24, 2018 at 09:23:34AM -0700, Andi Kleen escreveu: > > > > +void perf_event_munmap(void) > > > > +{ > > > > + struct perf_cpu_context *cpuctx; > > > > + unsigned long flags; > > > > + struct pmu *pmu; > > > > + > > > > + local_irq_save(flags); > > > > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > > > > sched_cb_entry) { > > > > > > Would be good have a fast path here that checks for the list being empty > > > without disabling the interrupts. munmap can be somewhat hot. I think > > > it's ok to make it slower with perf running, but we shouldn't impact > > > it without perf. > > > > Right, look at how its counterpart, perf_event_mmap() works: > > > > void perf_event_mmap(struct vm_area_struct *vma) > > { > > struct perf_mmap_event mmap_event; > > > > if (!atomic_read(_mmap_events)) > > return; > > > > } > > > > Thanks. I'll add the nr_mmap_events check in V2. That would be a nr_munmap_events, if this is tied to PERF_RECORD_MUNMAP (right?), which it isn't right now, check Andi's response, mine was more of a "hey, perf_event_mmap does an atomic check, before grabbing any locks". - Arnaldo
Re: [PATCH 1/2] perf: Add munmap callback
Em Wed, Oct 24, 2018 at 02:12:54PM -0400, Liang, Kan escreveu: > > > On 10/24/2018 12:32 PM, Arnaldo Carvalho de Melo wrote: > > Em Wed, Oct 24, 2018 at 09:23:34AM -0700, Andi Kleen escreveu: > > > > +void perf_event_munmap(void) > > > > +{ > > > > + struct perf_cpu_context *cpuctx; > > > > + unsigned long flags; > > > > + struct pmu *pmu; > > > > + > > > > + local_irq_save(flags); > > > > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > > > > sched_cb_entry) { > > > > > > Would be good have a fast path here that checks for the list being empty > > > without disabling the interrupts. munmap can be somewhat hot. I think > > > it's ok to make it slower with perf running, but we shouldn't impact > > > it without perf. > > > > Right, look at how its counterpart, perf_event_mmap() works: > > > > void perf_event_mmap(struct vm_area_struct *vma) > > { > > struct perf_mmap_event mmap_event; > > > > if (!atomic_read(_mmap_events)) > > return; > > > > } > > > > Thanks. I'll add the nr_mmap_events check in V2. That would be a nr_munmap_events, if this is tied to PERF_RECORD_MUNMAP (right?), which it isn't right now, check Andi's response, mine was more of a "hey, perf_event_mmap does an atomic check, before grabbing any locks". - Arnaldo
Re: [PATCH 1/2] perf: Add munmap callback
> > void perf_event_mmap(struct vm_area_struct *vma) > > { > > struct perf_mmap_event mmap_event; > > > > if (!atomic_read(_mmap_events)) > > return; > > > > } > > > > Thanks. I'll add the nr_mmap_events check in V2. No, that's the wrong check here. The PEBS flush is independent of mmap events being requested. If anything would need to check for any PEBS events active, which would need a new counter. I think the easiest is to just check if this_cpu_ptr(_cb_list) is empty, which should be good enough. -Andi
Re: [PATCH 1/2] perf: Add munmap callback
> > void perf_event_mmap(struct vm_area_struct *vma) > > { > > struct perf_mmap_event mmap_event; > > > > if (!atomic_read(_mmap_events)) > > return; > > > > } > > > > Thanks. I'll add the nr_mmap_events check in V2. No, that's the wrong check here. The PEBS flush is independent of mmap events being requested. If anything would need to check for any PEBS events active, which would need a new counter. I think the easiest is to just check if this_cpu_ptr(_cb_list) is empty, which should be good enough. -Andi
Re: [PATCH 1/2] perf: Add munmap callback
On 10/24/2018 12:32 PM, Arnaldo Carvalho de Melo wrote: Em Wed, Oct 24, 2018 at 09:23:34AM -0700, Andi Kleen escreveu: +void perf_event_munmap(void) +{ + struct perf_cpu_context *cpuctx; + unsigned long flags; + struct pmu *pmu; + + local_irq_save(flags); + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), sched_cb_entry) { Would be good have a fast path here that checks for the list being empty without disabling the interrupts. munmap can be somewhat hot. I think it's ok to make it slower with perf running, but we shouldn't impact it without perf. Right, look at how its counterpart, perf_event_mmap() works: void perf_event_mmap(struct vm_area_struct *vma) { struct perf_mmap_event mmap_event; if (!atomic_read(_mmap_events)) return; } Thanks. I'll add the nr_mmap_events check in V2. Thanks, Kan
Re: [PATCH 1/2] perf: Add munmap callback
On 10/24/2018 12:32 PM, Arnaldo Carvalho de Melo wrote: Em Wed, Oct 24, 2018 at 09:23:34AM -0700, Andi Kleen escreveu: +void perf_event_munmap(void) +{ + struct perf_cpu_context *cpuctx; + unsigned long flags; + struct pmu *pmu; + + local_irq_save(flags); + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), sched_cb_entry) { Would be good have a fast path here that checks for the list being empty without disabling the interrupts. munmap can be somewhat hot. I think it's ok to make it slower with perf running, but we shouldn't impact it without perf. Right, look at how its counterpart, perf_event_mmap() works: void perf_event_mmap(struct vm_area_struct *vma) { struct perf_mmap_event mmap_event; if (!atomic_read(_mmap_events)) return; } Thanks. I'll add the nr_mmap_events check in V2. Thanks, Kan
Re: [PATCH 1/2] perf: Add munmap callback
Em Wed, Oct 24, 2018 at 09:23:34AM -0700, Andi Kleen escreveu: > > +void perf_event_munmap(void) > > +{ > > + struct perf_cpu_context *cpuctx; > > + unsigned long flags; > > + struct pmu *pmu; > > + > > + local_irq_save(flags); > > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > > sched_cb_entry) { > > Would be good have a fast path here that checks for the list being empty > without disabling the interrupts. munmap can be somewhat hot. I think > it's ok to make it slower with perf running, but we shouldn't impact > it without perf. Right, look at how its counterpart, perf_event_mmap() works: void perf_event_mmap(struct vm_area_struct *vma) { struct perf_mmap_event mmap_event; if (!atomic_read(_mmap_events)) return; } - Arnaldo
Re: [PATCH 1/2] perf: Add munmap callback
Em Wed, Oct 24, 2018 at 09:23:34AM -0700, Andi Kleen escreveu: > > +void perf_event_munmap(void) > > +{ > > + struct perf_cpu_context *cpuctx; > > + unsigned long flags; > > + struct pmu *pmu; > > + > > + local_irq_save(flags); > > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > > sched_cb_entry) { > > Would be good have a fast path here that checks for the list being empty > without disabling the interrupts. munmap can be somewhat hot. I think > it's ok to make it slower with perf running, but we shouldn't impact > it without perf. Right, look at how its counterpart, perf_event_mmap() works: void perf_event_mmap(struct vm_area_struct *vma) { struct perf_mmap_event mmap_event; if (!atomic_read(_mmap_events)) return; } - Arnaldo
Re: [PATCH 1/2] perf: Add munmap callback
> +void perf_event_munmap(void) > +{ > + struct perf_cpu_context *cpuctx; > + unsigned long flags; > + struct pmu *pmu; > + > + local_irq_save(flags); > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > sched_cb_entry) { Would be good have a fast path here that checks for the list being empty without disabling the interrupts. munmap can be somewhat hot. I think it's ok to make it slower with perf running, but we shouldn't impact it without perf. -Andi
Re: [PATCH 1/2] perf: Add munmap callback
> +void perf_event_munmap(void) > +{ > + struct perf_cpu_context *cpuctx; > + unsigned long flags; > + struct pmu *pmu; > + > + local_irq_save(flags); > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > sched_cb_entry) { Would be good have a fast path here that checks for the list being empty without disabling the interrupts. munmap can be somewhat hot. I think it's ok to make it slower with perf running, but we shouldn't impact it without perf. -Andi