Re: perf: multiple mmap of fd behavior on x86/ARM
On Fri, Aug 11, 2017 at 04:53:30PM +0200, Peter Zijlstra wrote: > On Fri, Aug 11, 2017 at 12:06:39PM +0100, Mark Rutland wrote: > > On Fri, Aug 11, 2017 at 12:52:52PM +0200, Peter Zijlstra wrote: > > > On Fri, Aug 11, 2017 at 11:01:27AM +0100, Mark Rutland wrote: > > > > On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote: > > > > > > > > > > So I was working on my perf_event_tests on ARM/ARM64 (the end goal > > > > > was to > > > > > get ARM64 rdpmc support working, but apparently those patches never > > > > > made > > > > > it upstream?) > > > > > > > > IIUC by 'rdpmc' you mean direct userspace counter access? > > > > > > > > Patches for that never made it upstream. Last I saw, there were no > > > > patches in a suitable state for review. > > > > > > > > There are also difficulties (e.g. big.LITTLE systems where the number of > > > > counters can differ across CPUs) which have yet to be solved. > > > > > > How would that be a problem? The API gives an explicit index to use with > > > the 'rdpmc' instruction. > > > > It's a problem because access to unimplemented counters trap. So if a > > task gets migrated from a CPU with N counters to one with N-1, accessing > > counter N would be problematic. > > > > So we'd need to account for that somehow, in addition to the usual > > sequence counter fun to verify the index was valid when the access was > > performed. > > Aah, you need restartable-sequences :-) Or, in the absence of those, I wouldn't mind only supporting this for non-big/little platforms initially. Will
Re: perf: multiple mmap of fd behavior on x86/ARM
On Fri, 11 Aug 2017, Mark Rutland wrote: > > This isn't some key thing that needs to be fixed, I was just curious about > > the behavior difference between x86 and ARM. > > Sure; likewise I'm curious. well I finally got a current git 64-bit kernel booted on the pi3. Challenge: USB known to be broken currently, so no keyboard or ethernet. Extra challenge: had the RX/TX lines switched on the serial connector. Bonus challenge: the bcm2837 dts file doesn't enable armv8 PMU I got through all of that, only to find: $ uname -a Linux pi3-git 4.13.0-rc4-00152-g2627393 #2 SMP PREEMPT Fri Aug 11 13:58:42 EDT 2017 aarch64 GNU/Linux $ ./mmap_multiple Trying to mmap same perf_event fd multiple times...PASSED So maybe the issue was fixed between 4.9 and current? Vince
Re: perf: multiple mmap of fd behavior on x86/ARM
On Fri, Aug 11, 2017 at 12:51:12PM -0400, Vince Weaver wrote: > On Fri, 11 Aug 2017, Mark Rutland wrote: > > Just to check, how does x86 behave on each of those kernel releases? > > > > Many things have changed since v4.4. > > I'm fairly sure this test (well, the equivelent code in > tests/record_sample/record_mmap that I based the test on) has been passing > on all of my x86 test machines since ~3.10 or so, or else I would noticed. Ok. > If I can get a custom kernel to boot on one of my machines I can start > digging in and see if I can find where the EINVAL comes from. >From a quick scan, I can't spot anything obvious that would affect the arm64 perf mmap behaviour, that has changed since v4.9. > This isn't some key thing that needs to be fixed, I was just curious about > the behavior difference between x86 and ARM. Sure; likewise I'm curious. Thanks, Mark.
Re: perf: multiple mmap of fd behavior on x86/ARM
On Fri, 11 Aug 2017, Mark Rutland wrote: > IIRC, patches were sent back in 2014, but as I mentioned above, those > were far from suitable for upstream, even ignoring cases like > big.LITTLE. Said patches were never reworked and reposted. Here's the commit message in the perf_event_tests tree, having trouble finding the original e-mail that went with it. commit 2cc2e21e349243889ba59408527cc1a97dd0dc44 Author: Yogesh Tillu Date: Tue Mar 1 14:18:22 2016 +0530 Add support for RDPMC test with mmap way This test adds support for reading perf hw counter from userspace. Method (2) rdpmc_comparision_mmap: Test read perf hw counter in userspace using open/mmap syscall. It requires kernel with perf mmap patchset and echo 1 > /sys/bus/platform/drivers/armv8-pmu/rdpmc Above Method Tested On:(X86/ARM) It is tested with perf mmap patchset on kernel v4.5.0-rc5+ With above Tests, we can benchmark access of perf hw counters in userspace with syscall vs perf_event_mmap_page way. Signed-off-by: Yogesh Tillu > Just to check, how does x86 behave on each of those kernel releases? > > Many things have changed since v4.4. I'm fairly sure this test (well, the equivelent code in tests/record_sample/record_mmap that I based the test on) has been passing on all of my x86 test machines since ~3.10 or so, or else I would noticed. If I can get a custom kernel to boot on one of my machines I can start digging in and see if I can find where the EINVAL comes from. This isn't some key thing that needs to be fixed, I was just curious about the behavior difference between x86 and ARM. There are a few other minor x86/ARM diferences, especially realting to perf_event_open() error returns, that I had to special case in a few of my tests. Vince
Re: perf: multiple mmap of fd behavior on x86/ARM
On Fri, Aug 11, 2017 at 11:25:37AM -0400, Vince Weaver wrote: > On Fri, 11 Aug 2017, Mark Rutland wrote: > > > IIUC by 'rdpmc' you mean direct userspace counter access? > > > > Patches for that never made it upstream. Last I saw, there were no > > patches in a suitable state for review. > > yes, someone from Linaro sent me some code a while back that implemented > the userspace side and claimed the kernel patches would appear at some > point. I should try to dig up that e-mail. IIRC, patches were sent back in 2014, but as I mentioned above, those were far from suitable for upstream, even ignoring cases like big.LITTLE. Said patches were never reworked and reposted. > > > On ARM/ARM64 you can only mmap() it once, any other attempts fail. > > > > Interesting. Which platform(s) are you testing on, with which kernel > > version(s)? > > This is on a Dragonbaord 401c running a vendor 64-bit 4.4 kernel, > a Nvidia Jetson TX-1 board running a 64-bit 3.10 vendor kernel, Just to check, how does x86 behave on each of those kernel releases? Many things have changed since v4.4. > as well as a Raspberry Pi 3B running a 32-bit 4.9 pi foundation kernel. Hmm. On 32-bit this might be down to some arch/arm/mm cache aliasing code, or it might be down to something that's changed since v4.9. > It's a pain getting a recent-git kernel on these boards but I'm most of > the way to getting one booting on the Pi 3B. (got distracted by the fact > that Linpack still reliably crashes the Pi-3b even with a heatsink). IIUC, were you to modify this test to use SW events, you could test it on an aarch64 kernel running under QEMU. To the best of my knowledge, the code paths for HW and SW PMU are identical for mmap. Otherwise, you might have more luck using a foundation model, which has a PMU. Thanks, Mark.
Re: perf: multiple mmap of fd behavior on x86/ARM
On Fri, 11 Aug 2017, Mark Rutland wrote: > IIUC by 'rdpmc' you mean direct userspace counter access? > > Patches for that never made it upstream. Last I saw, there were no > patches in a suitable state for review. yes, someone from Linaro sent me some code a while back that implemented the userspace side and claimed the kernel patches would appear at some point. I should try to dig up that e-mail. The "rdpmc" code looked something like this if (counter == PERF_COUNT_HW_CPU_CYCLES) asm volatile("mrs %0, pmccntr_el0" : "=r" (ret)); else { asm volatile("msr pmselr_el0, %0" : : "r" ((counter-1))); asm volatile("mrs %0, pmxevcntr_el0" : "=r" (ret)); } > > On ARM/ARM64 you can only mmap() it once, any other attempts fail. > > Interesting. Which platform(s) are you testing on, with which kernel > version(s)? This is on a Dragonbaord 401c running a vendor 64-bit 4.4 kernel, a Nvidia Jetson TX-1 board running a 64-bit 3.10 vendor kernel, as well as a Raspberry Pi 3B running a 32-bit 4.9 pi foundation kernel. It's a pain getting a recent-git kernel on these boards but I'm most of the way to getting one booting on the Pi 3B. (got distracted by the fact that Linpack still reliably crashes the Pi-3b even with a heatsink). Here's strace from the Dragonboard: perf_event_open(0x7fc649e900, 0, -1, -1, 0) = 3 mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f7e1b1000 mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = -1 EINVAL (Invalid argument) Vince
Re: perf: multiple mmap of fd behavior on x86/ARM
On Fri, Aug 11, 2017 at 12:06:39PM +0100, Mark Rutland wrote: > On Fri, Aug 11, 2017 at 12:52:52PM +0200, Peter Zijlstra wrote: > > On Fri, Aug 11, 2017 at 11:01:27AM +0100, Mark Rutland wrote: > > > On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote: > > > > > > > > So I was working on my perf_event_tests on ARM/ARM64 (the end goal was > > > > to > > > > get ARM64 rdpmc support working, but apparently those patches never > > > > made > > > > it upstream?) > > > > > > IIUC by 'rdpmc' you mean direct userspace counter access? > > > > > > Patches for that never made it upstream. Last I saw, there were no > > > patches in a suitable state for review. > > > > > > There are also difficulties (e.g. big.LITTLE systems where the number of > > > counters can differ across CPUs) which have yet to be solved. > > > > How would that be a problem? The API gives an explicit index to use with > > the 'rdpmc' instruction. > > It's a problem because access to unimplemented counters trap. So if a > task gets migrated from a CPU with N counters to one with N-1, accessing > counter N would be problematic. > > So we'd need to account for that somehow, in addition to the usual > sequence counter fun to verify the index was valid when the access was > performed. Aah, you need restartable-sequences :-)
Re: perf: multiple mmap of fd behavior on x86/ARM
On Fri, Aug 11, 2017 at 12:52:52PM +0200, Peter Zijlstra wrote: > On Fri, Aug 11, 2017 at 11:01:27AM +0100, Mark Rutland wrote: > > On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote: > > > > > > So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to > > > get ARM64 rdpmc support working, but apparently those patches never made > > > it upstream?) > > > > IIUC by 'rdpmc' you mean direct userspace counter access? > > > > Patches for that never made it upstream. Last I saw, there were no > > patches in a suitable state for review. > > > > There are also difficulties (e.g. big.LITTLE systems where the number of > > counters can differ across CPUs) which have yet to be solved. > > How would that be a problem? The API gives an explicit index to use with > the 'rdpmc' instruction. It's a problem because access to unimplemented counters trap. So if a task gets migrated from a CPU with N counters to one with N-1, accessing counter N would be problematic. So we'd need to account for that somehow, in addition to the usual sequence counter fun to verify the index was valid when the access was performed. Thanks, Mark.
Re: perf: multiple mmap of fd behavior on x86/ARM
On Fri, Aug 11, 2017 at 11:01:27AM +0100, Mark Rutland wrote: > On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote: > > > > So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to > > get ARM64 rdpmc support working, but apparently those patches never made > > it upstream?) > > IIUC by 'rdpmc' you mean direct userspace counter access? > > Patches for that never made it upstream. Last I saw, there were no > patches in a suitable state for review. > > There are also difficulties (e.g. big.LITTLE systems where the number of > counters can differ across CPUs) which have yet to be solved. How would that be a problem? The API gives an explicit index to use with the 'rdpmc' instruction.
Re: perf: multiple mmap of fd behavior on x86/ARM
On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote: > > So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to > get ARM64 rdpmc support working, but apparently those patches never made > it upstream?) IIUC by 'rdpmc' you mean direct userspace counter access? Patches for that never made it upstream. Last I saw, there were no patches in a suitable state for review. There are also difficulties (e.g. big.LITTLE systems where the number of counters can differ across CPUs) which have yet to be solved. > anyway one test was failing due to an x86/arm difference, which is > possibly only tangentially perf related. > > On x86 you can mmap() a perf_event_open() file descriptor multiple times > and it works. > > On ARM/ARM64 you can only mmap() it once, any other attempts fail. Interesting. Which platform(s) are you testing on, with which kernel version(s)? > Is this expected behavior? I'm not sure, but it sounds surprising. > You can run the > tests/record_sample/mmap_multiple > test in the current git of my perf_event_tests testsuite for a testcase. This appears to work for me: nanook@ribbensteg:~/src/perf_event_tests/tests/record_sample$ ./mmap_multiple Trying to mmap same perf_event fd multiple times...PASSED nanook@ribbensteg:~/src/perf_event_tests/tests/record_sample$ git log --oneline HEAD~1.. c82c4dd tests: huge_grou_start: add info that this was fixed in Linux 4.3 nanook@ribbensteg:~/src/perf_event_tests/tests/record_sample$ uname -a Linux ribbensteg 4.13.0-rc4-00010-g2ce1491 #229 SMP PREEMPT Thu Aug 10 17:06:56 BST 2017 aarch64 aarch64 aarch64 GNU/Linux nanook@ribbensteg:~/src/perf_event_tests/tests/record_sample$ strace ./mmap_multiple execve("./mmap_multiple", ["./mmap_multiple"], [/* 18 vars */]) = 0 brk(0) = 0x2d9aa000 faccessat(AT_FDCWD, "/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x9d10e000 faccessat(AT_FDCWD, "/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=42361, ...}) = 0 mmap(NULL, 42361, PROT_READ, MAP_PRIVATE, 3, 0) = 0x9d103000 close(3)= 0 faccessat(AT_FDCWD, "/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/lib/aarch64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0(\17\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1283776, ...}) = 0 mmap(NULL, 1356664, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x9cf9b000 mprotect(0x9d0ce000, 61440, PROT_NONE) = 0 mmap(0x9d0dd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x132000) = 0x9d0dd000 mmap(0x9d0e3000, 13176, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x9d0e3000 close(3)= 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x9cf9a000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x9cf99000 mprotect(0x9d0dd000, 16384, PROT_READ) = 0 mprotect(0x412000, 4096, PROT_READ) = 0 mprotect(0x9d112000, 4096, PROT_READ) = 0 munmap(0x9d103000, 42361) = 0 perf_event_open(0xfbff0310, 0, -1, -1, 0) = 3 mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x9d105000 mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x9cf9 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0 mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x9cf8 write(1, "Trying to mmap same perf_event f"..., 77Trying to mmap same perf_event fd multiple times...PASSED ) = 77 exit_group(0) = ? +++ exited with 0 +++ Thanks, Mark.
perf: multiple mmap of fd behavior on x86/ARM
So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to get ARM64 rdpmc support working, but apparently those patches never made it upstream?) anyway one test was failing due to an x86/arm difference, which is possibly only tangentially perf related. On x86 you can mmap() a perf_event_open() file descriptor multiple times and it works. On ARM/ARM64 you can only mmap() it once, any other attempts fail. Is this expected behavior? You can run the tests/record_sample/mmap_multiple test in the current git of my perf_event_tests testsuite for a testcase. Vince