Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Em Tue, Mar 23, 2010 at 11:14:41AM +0800, Zhang, Yanmin escreveu: On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote: Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu: On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote: Then, perf could access all files. It's possible because guest os instance happens to be multi-threading in a process. One of the defects is the accessing to guest os becomes slow or impossible when guest os is very busy. If the MMAP events on the guest included a cookie that could later be used to query for the symtab of that DSO, we wouldn't need to access the guest FS at all, right? It depends on specific sub commands. As for 'perf kvm top', developers want to see the profiling immediately. Even with 'perf kvm record', developers also want to That is not a problem, if you have the relevant buildids in your cache (Look in your machine at ~/.debug/), it will be as fast as ever. If you use a distro that has its userspace with build-ids, you probably use it always without noticing :-) see results quickly. At least I'm eager for the results when investigating a performance issue. Sure thing. With build-ids and debuginfo-install like tools the symbol resolution could be performed by using the cookies (build-ids) as keys to get to the *-debuginfo packages with matching symtabs (and DWARF for source annotation, etc). We can't make sure guest os uses the same os images, or don't know where we could find the original DVD images being used to install guest os. You don't have to have guest and host sharing the same OS image, you just have to somehow populate your buildid cache with what you need, be it using sshfs or what Ingo is suggesting once, or using what your vendor provides (debuginfo packages). And you just have to do it once, for the relevant apps, to have it in your buildid cache. Current perf does save build id, including both kernls's and other application lib/executables. Yeah, I know, I implemented it. :-) We have that for the kernel as: [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes [a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id -r--r--r-- 1 root root 4096 2010-03-22 13:38 /sys/module/ipv6/sections/.note.gnu.build-id [a...@doppio linux-2.6-tip]$ That way we would cover DSOs being reinstalled in long running 'perf record' sessions too. That's one of objectives of perf to support long running. But it doesn't fully supports right now, as I explained, build-ids are collected at the end of the record session, because we have to open the DSOs that had hits to get the 20 bytes cookie we need, the build-id. If we had it in the PERF_RECORD_MMAP record, we would close this race, and the added cost at load time should be minimal, to get the ELF section with it and put it somewhere in task struct. If only we could coalesce it a bit to reclaim this: [a...@doppio linux-2.6-tip]$ pahole -C task_struct ../build/v2.6.34-rc1-tip+/kernel/sched.o | tail -5 /* size: 5968, cachelines: 94, members: 150 */ /* sum members: 5943, holes: 7, sum holes: 25 */ /* bit holes: 1, sum bit holes: 28 bits */ /* last cacheline: 16 bytes */ }; [a...@doppio linux-2.6-tip]$ 8-) Or at least get just one of those 4 bytes holes then we could stick it at the end to get our build-id there, accessing it would be done only at PERF_RECORD_MMAP injection time, i.e. close to the time when we actually are loading the executable mmap, i.e. close to the time when the loader is injecting the build-id, I guess the extra memory and processing costs would be in the noise. This was discussed some time ago but would require help from the bits that load DSOs. build-ids then would be first class citizens. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Joerg Roedel j...@8bytes.org writes: On Mon, Mar 22, 2010 at 11:59:27AM +0100, Ingo Molnar wrote: Best would be if you demonstrated any problems of the perf symbol lookup code you are aware of on the host side, as it has that exact design you are criticising here. We are eager to fix any bugs in it. If you claim that it's buggy then that should very much be demonstratable - no need to go into theoretical arguments about it. I am not claiming anything. I just try to imagine how your proposal will look like in practice and forgot that symbol resolution is done at a later point. But even with defered symbol resolution we need more information from the guest than just the rip falling out of KVM. The guest needs to tell us about the process where the event happened (information that the host has about itself without any hassle) and which executable-files it was loaded from. Slightly tangential, but there is another case that has some of the same problems: profiling other language runtimes than C and C++, say Python. At the moment profilers will generally tell you what is going on inside the python runtime, but not what the python program itself is doing. To fix that problem, it seems like we need some way to have python export what is going on. Maybe the same mechanism could be used to both access what is going on in qemu and python. Soren -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Soeren Sandmann sandm...@daimi.au.dk writes: To fix that problem, it seems like we need some way to have python export what is going on. Maybe the same mechanism could be used to both access what is going on in qemu and python. oprofile already has an interface to let JITs export information about the JITed code. C Python is not a JIT, but presumably one of the python JITs could do it. http://oprofile.sourceforge.net/doc/devel/index.html I know it's not envogue anymore and you won't be a approved cool kid if you do, but you could just use oprofile? Ok presumably one would need to do a python interface for this first. I believe it's currently only implemented for Java and Mono. I presume it might work today with IronPython on Mono. IMHO it doesn't make sense to invent another interface for this, although I'm sure someone will propose just that. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Andi Kleen a...@firstfloor.org writes: Soeren Sandmann sandm...@daimi.au.dk writes: To fix that problem, it seems like we need some way to have python export what is going on. Maybe the same mechanism could be used to both access what is going on in qemu and python. oprofile already has an interface to let JITs export information about the JITed code. C Python is not a JIT, but presumably one of the python JITs could do it. http://oprofile.sourceforge.net/doc/devel/index.html It's not that I personally want to profile a particular python program. I'm interested in the more general problem of extracting more information from profiled user space programs than just stack traces. Examples: - What is going on inside QEMU? - Which client is the X server servicing? - What parts of a python/shell/scheme/javascript program is taking the most CPU time? I don't think the oprofile JIT interface solves any of these problems. (In fact, I don't see why the JIT problem is even hard. The JIT compiler can just generate a little ELF file with symbols in it, and the profiler can pick it up through the mmap events that you get through the perf interface). I know it's not envogue anymore and you won't be a approved cool kid if you do, but you could just use oprofile? I am bringing this up because I want to extend sysprof to be more useful. Soren -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Em Tue, Mar 23, 2010 at 02:49:01PM +0100, Andi Kleen escreveu: Soeren Sandmann sandm...@daimi.au.dk writes: To fix that problem, it seems like we need some way to have python export what is going on. Maybe the same mechanism could be used to both access what is going on in qemu and python. oprofile already has an interface to let JITs export information about the JITed code. C Python is not a JIT, but presumably one of the python JITs could do it. http://oprofile.sourceforge.net/doc/devel/index.html I know it's not envogue anymore and you won't be a approved cool kid if you do, but you could just use oprofile? perf also has supports for this and Pekka Enberg's jato uses it: http://penberg.blogspot.com/2009/06/jato-has-profiler.html - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Soeren Sandmann sandm...@daimi.au.dk writes: Examples: - What is going on inside QEMU? That's something the JIT interface could answer. - Which client is the X server servicing? - What parts of a python/shell/scheme/javascript program is taking the most CPU time? I suspect for those you rather need event based tracers of some sort, similar to kernel trace points. Otherwise you would need own separate stacks and other complications. systemtap has some effort to use the dtrace instrumentation that crops up in more and more user programs for this. It wouldn't surprise me if that was already in python and other programs you're interested in. I presume right now it only works if you apply the utrace monstrosity though, but perhaps the new uprobes patches floating around will come to rescue. There also was some effort to have a pure user space daemon based approach for LTT, but I believe that currently needs own trace points. Again I fully expect someone to reinvent the wheel here and afterwards complain about community inefficiences :-) I don't think the oprofile JIT interface solves any of these problems. (In fact, I don't see why the JIT problem is even hard. The JIT compiler can just generate a little ELF file with symbols in it, and the profiler can pick it up through the mmap events that you get through the perf interface). That would require keeping those temporary ELF files for potentially unlimited time around (profilers today look at the ELF files at the final analysis phase, which might be weeks away) Also that would be a lot of overhead for the JIT and most likely be a larger scale rewrite for a given JIT code base. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Em Tue, Mar 23, 2010 at 03:20:11PM +0100, Andi Kleen escreveu: Soeren Sandmann sandm...@daimi.au.dk writes: I don't think the oprofile JIT interface solves any of these problems. (In fact, I don't see why the JIT problem is even hard. The JIT compiler can just generate a little ELF file with symbols in it, and the profiler can pick it up through the mmap events that you get through the perf interface). That would require keeping those temporary ELF files for potentially unlimited time around (profilers today look at the ELF files at the final analysis phase, which might be weeks away) 'perf record' will traverse the perf.data file just collected and, if the binaries have build-ids, will stash them in ~/.debug/, keyed by build-id just like the -debuginfo packages do. So only the binaries with hits. Also one can use 'perf archive' to create a tar.bz2 file with the files with hits for the specified perf.data file, that can then be transfered to another machine, whatever arch, untarred at ~/.debug and then the report can be done there. As it is done by build-id, multiple 'perf record' sessions share files in the cache. Right now the whole ELF file (or /proc/kallsyms copy) is stored if collected from the DSO directly, or the bits that are stored in -debuginfo files if we find it installed (so smaller). We could strip that down further by storing just the ELF sections needed to make sense of the symtab. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Soeren Sandmann sandm...@daimi.au.dk writes: [...] - What is going on inside QEMU? - Which client is the X server servicing? - What parts of a python/shell/scheme/javascript program is taking the most CPU time? [...] These kinds of questions usually require navigation through internal data of the user-space process (Where in this linked list is this pointer?), and often also correlating them with history (which socket/fd was most recently serviced?). Systemtap excels at letting one express such things. - FChE -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-23 at 11:10 -0300, Arnaldo Carvalho de Melo wrote: Em Tue, Mar 23, 2010 at 02:49:01PM +0100, Andi Kleen escreveu: Soeren Sandmann sandm...@daimi.au.dk writes: To fix that problem, it seems like we need some way to have python export what is going on. Maybe the same mechanism could be used to both access what is going on in qemu and python. oprofile already has an interface to let JITs export information about the JITed code. C Python is not a JIT, but presumably one of the python JITs could do it. http://oprofile.sourceforge.net/doc/devel/index.html I know it's not envogue anymore and you won't be a approved cool kid if you do, but you could just use oprofile? perf also has supports for this and Pekka Enberg's jato uses it: http://penberg.blogspot.com/2009/06/jato-has-profiler.html Right, we need to move that into a library though (always meant to do that, never got around to doing it). That way the app can link against a dso with weak empty stubs and have perf record LD_PRELOAD a version that has a suitable implementation. That all has the advantage of not exposing the actual interface like we do now. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-23 at 10:15 -0300, Arnaldo Carvalho de Melo wrote: Em Tue, Mar 23, 2010 at 11:14:41AM +0800, Zhang, Yanmin escreveu: On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote: Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu: On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote: Then, perf could access all files. It's possible because guest os instance happens to be multi-threading in a process. One of the defects is the accessing to guest os becomes slow or impossible when guest os is very busy. If the MMAP events on the guest included a cookie that could later be used to query for the symtab of that DSO, we wouldn't need to access the guest FS at all, right? It depends on specific sub commands. As for 'perf kvm top', developers want to see the profiling immediately. Even with 'perf kvm record', developers also want to That is not a problem, if you have the relevant buildids in your cache (Look in your machine at ~/.debug/), it will be as fast as ever. If you use a distro that has its userspace with build-ids, you probably use it always without noticing :-) see results quickly. At least I'm eager for the results when investigating a performance issue. Sure thing. With build-ids and debuginfo-install like tools the symbol resolution could be performed by using the cookies (build-ids) as keys to get to the *-debuginfo packages with matching symtabs (and DWARF for source annotation, etc). We can't make sure guest os uses the same os images, or don't know where we could find the original DVD images being used to install guest os. You don't have to have guest and host sharing the same OS image, you just have to somehow populate your buildid cache with what you need, be it using sshfs or what Ingo is suggesting once, or using what your vendor provides (debuginfo packages). And you just have to do it once, for the relevant apps, to have it in your buildid cache. Current perf does save build id, including both kernls's and other application lib/executables. Yeah, I know, I implemented it. :-) We have that for the kernel as: [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes [a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id -r--r--r-- 1 root root 4096 2010-03-22 13:38 /sys/module/ipv6/sections/.note.gnu.build-id [a...@doppio linux-2.6-tip]$ That way we would cover DSOs being reinstalled in long running 'perf record' sessions too. That's one of objectives of perf to support long running. But it doesn't fully supports right now, as I explained, build-ids are collected at the end of the record session, because we have to open the DSOs that had hits to get the 20 bytes cookie we need, the build-id. If we had it in the PERF_RECORD_MMAP record, we would close this race, and the added cost at load time should be minimal, to get the ELF section with it and put it somewhere in task struct. Well, you are improving upon perfection. If only we could coalesce it a bit to reclaim this: [a...@doppio linux-2.6-tip]$ pahole -C task_struct ../build/v2.6.34-rc1-tip+/kernel/sched.o | tail -5 /* size: 5968, cachelines: 94, members: 150 */ /* sum members: 5943, holes: 7, sum holes: 25 */ /* bit holes: 1, sum bit holes: 28 bits */ /* last cacheline: 16 bytes */ }; [a...@doppio linux-2.6-tip]$ That reminds me I listened to your presentation on 2007 OLS. :) 8-) Or at least get just one of those 4 bytes holes then we could stick it at the end to get our build-id there, accessing it would be done only at PERF_RECORD_MMAP injection time, i.e. close to the time when we actually are loading the executable mmap, i.e. close to the time when the loader is injecting the build-id, I guess the extra memory and processing costs would be in the noise. This was discussed some time ago but would require help from the bits that load DSOs. build-ids then would be first class citizens. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote: Nice progress! This bit: 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Will be really be painful to developers - to enter that long line while we have these things called 'computers' that ought to reduce human work. Also, it's incomplete, we need access to the guest system's binaries to do ELF symbol resolution and dwarf decoding. Yes, I agree with you and Avi that we need the enhancement be user-friendly. One of my start points is to keep the tool having less dependency on other components. Admin/developers could write script wrappers quickly if perf has parameters to support the new capability. So we really need some good, automatic way to get to the guest symbol space, so that if a developer types: perf kvm top Then the obvious thing happens by default. (which is to show the guest overhead) There's no technical barrier on the perf tooling side to implement all that: perf supports build-ids extensively and can deal with multiple symbol spaces - as long as it has access to it. The guest kernel could be ID-ed based on its /sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids. I tried sshfs quickly. sshfs could mount root filesystem of guest os nicely. I could access the files quickly. However, it doesn't work when I access /proc/ and /sys/ because sshfs/scp depend on file size while the sizes of most files of /proc/ and /sys/ are 0. So some sort of --guestmount option would be the natural solution, which points to the guest system's root: and a Qemu enumeration of guest mounts (which would be off by default and configurable) from which perf can pick up the target guest all automatically. (obviously only under allowed permissions so that such access is secure) If sshfs could access /proc/ and /sys correctly, here is a design: --guestmount points to a directory which consists of a list of sub-directories. Every sub-directory's name is just the qemu process id of guest os. Admin/developer mounts every guest os instance's root directory to corresponding sub-directory. Then, perf could access all files. It's possible because guest os instance happens to be multi-threading in a process. One of the defects is the accessing to guest os becomes slow or impossible when guest os is very busy. This would allow not just kallsyms access via $guest/proc/kallsyms but also gives us the full space of symbol features: access to the guest binaries for annotation and general symbol resolution, command/binary name identification, etc. Such a mount would obviously not broaden existing privileges - and as an additional control a guest would also have a way to indicate that it does not wish a guest mount at all. Unfortunately, in a previous thread the Qemu maintainer has indicated that he will essentially NAK any attempt to enhance Qemu to provide an easily discoverable, self-contained, transparent guest mount on the host side. No technical justification was given for that NAK, despite my repeated requests to particulate the exact security problems that such an approach would cause. If that NAK does not stand in that form then i'd like to know about it - it makes no sense for us to try to code up a solution against a standing maintainer NAK ... The other option is some sysadmin level hackery to NFS-mount the guest or so. This is a vastly inferior method that brings us back to the absymal usability levels of OProfile: 1) it wont be guest transparent 2) has to be re-done for every guest image. 3) even if packaged it has to be gotten into every. single. Linux. distro. separately. 4) old Linux guests wont work out of box In other words: it's very inconvenient on multiple levels and wont ever happen on any reasonable enough scale to make a difference to Linux. Which is an unfortunate situation - and the ball is on the KVM/Qemu side so i can do little about it. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Sun, Mar 21, 2010 at 07:43:00PM +0100, Ingo Molnar wrote: Having access to the actual executable files that include the symbols achieves precisely that - with the additional robustness that all this functionality is concentrated into the host, while the guest side is kept minimal (and transparent). If you want to access the guests file-system you need a piece of software running in the guest which gives you this access. But when you get an event this piece of software may not be runnable (if the guest is in an interrupt handler or any other non-preemptible code path). When the host finally gets access to the guests filesystem again the source of that event may already be gone (process has exited, module unloaded...). The only way to solve that is to pass the event information to the guest immediatly and let it collect the information we want. It can decide whether it exposes the files. Nor are there any security issues to begin with. I am not talking about security. Security was sufficiently flamed about already. You need to be aware of the fact that symbol resolution is a separate step from call chain generation. Same concern as above applies to call-chain generation too. How we speak to the guest was already discussed in this thread. My personal opinion is that going through qemu is an unnecessary step and we can solve that more clever and transparent for perf. Meaning exactly what? Avi was against that but I think it would make sense to give names to virtual machines (with a default, similar to network interface names). Then we can create a directory in /dev/ with that name (e.g. /dev/vm/fedora/). Inside the guest a (priviledged) process can create some kind of named virt-pipe which results in a device file created in the guests directory (perf could create /dev/vm/fedora/perf for example). This file is used for guest-host communication. Thanks, Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* oerg Roedel j...@8bytes.org wrote: It can decide whether it exposes the files. Nor are there any security issues to begin with. I am not talking about security. [...] You were talking about security, in the portion of your mail that you snipped out, and which i replied to: 2. The guest can decide for its own if it want to pass this inforamtion to the host-perf. No security issues at all. I understood that portion to mean what it says: that your claim that your proposal 'has no security issues at all', in contrast to my suggestion. [...] Security was sufficiently flamed about already. All i saw was my suggestion to allow a guest to securely (and scalably and conveniently) integrate/mount its filesystems to the host if both sides (both the host and the guest) permit it, to make it easier for instrumentation to pick up symbol details. I.e. if a guest runs then its filesystem may be present on the host side as: /guests/Fedora-G1/ /guests/Fedora-G1/proc/ /guests/Fedora-G1/usr/ /guests/Fedora-G1/.../ ( This feature would be configurable and would be default-off, to maintain the current status quo. ) i.e. it's a bit like sshfs or NFS or loopback block mounts, just in an integrated and working fashion (sshfs doesnt work well with /proc for example) and more guest transparent (obviously sshfs or NFS exports need per guest configuration), and lower overhead than sshfs/NFS - i.e. without the (unnecessary) networking overhead. That suggestion was 'countered' by an unsubstantiated claim by Anthony that this kind of usability feature would somehow be a 'security nighmare'. In reality it is just an incremental, more usable, faster and more guest-transparent form of what is already possible today via: - loopback mounts on host - NFS exports - SMB exports - sshfs - (and other mechanisms) I wish there was at least flaming about it - as flames tend to have at least some specifics in them. What i saw instead was a claim about a 'security nightmare', which was, when i asked for specifics, was followed by deafening silence. And you appear to have repeated that claim here, unwilling to back it up with specifics. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* oerg Roedel j...@8bytes.org wrote: On Sun, Mar 21, 2010 at 07:43:00PM +0100, Ingo Molnar wrote: Having access to the actual executable files that include the symbols achieves precisely that - with the additional robustness that all this functionality is concentrated into the host, while the guest side is kept minimal (and transparent). If you want to access the guests file-system you need a piece of software running in the guest which gives you this access. But when you get an event this piece of software may not be runnable (if the guest is in an interrupt handler or any other non-preemptible code path). When the host finally gets access to the guests filesystem again the source of that event may already be gone (process has exited, module unloaded...). The only way to solve that is to pass the event information to the guest immediatly and let it collect the information we want. The very same is true of profiling in the host space as well (KVM is nothing special here, other than its unreasonable insistence on not enumerating readily available information in a more usable way). So are you suggesting a solution to a perf problem we already solved differently? (and which i argue we solved in a better way) We have solved that in the host space already (and quite elaborately so), and not via your suggestion of moving symbol resolution to a different stage, but by properly generating the right events to allow the post-processing stage to see processes that have already exited, to robustly handle files that have been rebuilt, etc. From an instrumentation POV it is fundamentally better to acquire the right data and delay any complexities to the analysis stage (the perf model) than to complicate sampling (the oprofile dcookies model). Your proposal of 'doing the symbol resolution in the guest context' is in essence re-arguing that very similar point that oprofile lost. Did you really intend to re-argue that point as well? If yes then please propose an alternative implementation for everything that perf does wrt. symbol lookups. What we propose for 'perf kvm' right now is simply a straight-forward extension of the existing (and well working) symbol handling code to virtualization. You need to be aware of the fact that symbol resolution is a separate step from call chain generation. Same concern as above applies to call-chain generation too. Best would be if you demonstrated any problems of the perf symbol lookup code you are aware of on the host side, as it has that exact design you are criticising here. We are eager to fix any bugs in it. If you claim that it's buggy then that should very much be demonstratable - no need to go into theoretical arguments about it. ( You should be aware of the fact that perf currently works with 'processes exiting prematurely' and similar scenarios just fine, so if you want to demonstrate that it's broken you will probably need a different example. ) How we speak to the guest was already discussed in this thread. My personal opinion is that going through qemu is an unnecessary step and we can solve that more clever and transparent for perf. Meaning exactly what? Avi was against that but I think it would make sense to give names to virtual machines (with a default, similar to network interface names). Then we can create a directory in /dev/ with that name (e.g. /dev/vm/fedora/). Inside the guest a (priviledged) process can create some kind of named virt-pipe which results in a device file created in the guests directory (perf could create /dev/vm/fedora/perf for example). This file is used for guest-host communication. That is kind of half of my suggestion - the built-in enumeration guests and a guaranteed channel to them accessible to tools. (KVM already has its own special channel so it's not like channels of communication are useless.) The other half of my suggestion is that if we bring this thought to its logical conclusion then we might as well walk the whole mile and not use quirky, binary API single-channel pipes. I.e. we could use this convenient, human-readable, structured, hierarchical abstraction to expose information in a finegrained, scalable way, which has a world-class implementation in Linux: the 'VFS namespace'. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Mon, Mar 22, 2010 at 11:59:27AM +0100, Ingo Molnar wrote: Best would be if you demonstrated any problems of the perf symbol lookup code you are aware of on the host side, as it has that exact design you are criticising here. We are eager to fix any bugs in it. If you claim that it's buggy then that should very much be demonstratable - no need to go into theoretical arguments about it. I am not claiming anything. I just try to imagine how your proposal will look like in practice and forgot that symbol resolution is done at a later point. But even with defered symbol resolution we need more information from the guest than just the rip falling out of KVM. The guest needs to tell us about the process where the event happened (information that the host has about itself without any hassle) and which executable-files it was loaded from. Avi was against that but I think it would make sense to give names to virtual machines (with a default, similar to network interface names). Then we can create a directory in /dev/ with that name (e.g. /dev/vm/fedora/). Inside the guest a (priviledged) process can create some kind of named virt-pipe which results in a device file created in the guests directory (perf could create /dev/vm/fedora/perf for example). This file is used for guest-host communication. That is kind of half of my suggestion - the built-in enumeration guests and a guaranteed channel to them accessible to tools. (KVM already has its own special channel so it's not like channels of communication are useless.) The other half of my suggestion is that if we bring this thought to its logical conclusion then we might as well walk the whole mile and not use quirky, binary API single-channel pipes. I.e. we could use this convenient, human-readable, structured, hierarchical abstraction to expose information in a finegrained, scalable way, which has a world-class implementation in Linux: the 'VFS namespace'. Probably. At least it is the solution that fits best into the current design of perf. But we should think about how this will be done. Raw disk access is no solution because we need to access virtual file-systems of the guest too. Network filesystems may be a solution but then we come back to the 'deployment-nightmare'. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Joerg Roedel j...@8bytes.org wrote: On Mon, Mar 22, 2010 at 11:59:27AM +0100, Ingo Molnar wrote: Best would be if you demonstrated any problems of the perf symbol lookup code you are aware of on the host side, as it has that exact design you are criticising here. We are eager to fix any bugs in it. If you claim that it's buggy then that should very much be demonstratable - no need to go into theoretical arguments about it. I am not claiming anything. I just try to imagine how your proposal will look like in practice and forgot that symbol resolution is done at a later point. But even with defered symbol resolution we need more information from the guest than just the rip falling out of KVM. The guest needs to tell us about the process where the event happened (information that the host has about itself without any hassle) and which executable-files it was loaded from. Correct - for full information we need a good paravirt perf integration of the kernel bits to pass that through. (I.e. we want to 'integrate' the PID space as well, at least within the perf notion of PIDs.) Initially we can do without that as well. Probably. At least it is the solution that fits best into the current design of perf. But we should think about how this will be done. Raw disk access is no solution because we need to access virtual file-systems of the guest too. [...] I never said anything about 'raw disk access'. Have you seen my proposal of (optional) VFS namespace integration? (It can be found repeated the Nth time in my mail you replied to) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu: On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote: So some sort of --guestmount option would be the natural solution, which points to the guest system's root: and a Qemu enumeration of guest mounts (which would be off by default and configurable) from which perf can pick up the target guest all automatically. (obviously only under allowed permissions so that such access is secure) If sshfs could access /proc/ and /sys correctly, here is a design: --guestmount points to a directory which consists of a list of sub-directories. Every sub-directory's name is just the qemu process id of guest os. Admin/developer mounts every guest os instance's root directory to corresponding sub-directory. Then, perf could access all files. It's possible because guest os instance happens to be multi-threading in a process. One of the defects is the accessing to guest os becomes slow or impossible when guest os is very busy. If the MMAP events on the guest included a cookie that could later be used to query for the symtab of that DSO, we wouldn't need to access the guest FS at all, right? With build-ids and debuginfo-install like tools the symbol resolution could be performed by using the cookies (build-ids) as keys to get to the *-debuginfo packages with matching symtabs (and DWARF for source annotation, etc). We have that for the kernel as: [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes [a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id -r--r--r-- 1 root root 4096 2010-03-22 13:38 /sys/module/ipv6/sections/.note.gnu.build-id [a...@doppio linux-2.6-tip]$ That way we would cover DSOs being reinstalled in long running 'perf record' sessions too. This was discussed some time ago but would require help from the bits that load DSOs. build-ids then would be first class citizens. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote: Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu: On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote: So some sort of --guestmount option would be the natural solution, which points to the guest system's root: and a Qemu enumeration of guest mounts (which would be off by default and configurable) from which perf can pick up the target guest all automatically. (obviously only under allowed permissions so that such access is secure) If sshfs could access /proc/ and /sys correctly, here is a design: --guestmount points to a directory which consists of a list of sub-directories. Every sub-directory's name is just the qemu process id of guest os. Admin/developer mounts every guest os instance's root directory to corresponding sub-directory. Then, perf could access all files. It's possible because guest os instance happens to be multi-threading in a process. One of the defects is the accessing to guest os becomes slow or impossible when guest os is very busy. If the MMAP events on the guest included a cookie that could later be used to query for the symtab of that DSO, we wouldn't need to access the guest FS at all, right? It depends on specific sub commands. As for 'perf kvm top', developers want to see the profiling immediately. Even with 'perf kvm record', developers also want to see results quickly. At least I'm eager for the results when investigating a performance issue. With build-ids and debuginfo-install like tools the symbol resolution could be performed by using the cookies (build-ids) as keys to get to the *-debuginfo packages with matching symtabs (and DWARF for source annotation, etc). We can't make sure guest os uses the same os images, or don't know where we could find the original DVD images being used to install guest os. Current perf does save build id, including both kernls's and other application lib/executables. We have that for the kernel as: [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes [a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id -r--r--r-- 1 root root 4096 2010-03-22 13:38 /sys/module/ipv6/sections/.note.gnu.build-id [a...@doppio linux-2.6-tip]$ That way we would cover DSOs being reinstalled in long running 'perf record' sessions too. That's one of objectives of perf to support long running. This was discussed some time ago but would require help from the bits that load DSOs. build-ids then would be first class citizens. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* oerg Roedel j...@8bytes.org wrote: On Fri, Mar 19, 2010 at 09:21:22AM +0100, Ingo Molnar wrote: Unfortunately, in a previous thread the Qemu maintainer has indicated that he will essentially NAK any attempt to enhance Qemu to provide an easily discoverable, self-contained, transparent guest mount on the host side. No technical justification was given for that NAK, despite my repeated requests to particulate the exact security problems that such an approach would cause. If that NAK does not stand in that form then i'd like to know about it - it makes no sense for us to try to code up a solution against a standing maintainer NAK ... I still think it is the best and most generic way to let the guest do the symbol resolution. [...] Not really. [...] This has several advantages: 1. The guest knows best about its symbol space. So this would be extensible to other guest operating systems. A brave developer may even implement symbol passing for Windows or the BSDs ;-) Having access to the actual executable files that include the symbols achieves precisely that - with the additional robustness that all this functionality is concentrated into the host, while the guest side is kept minimal (and transparent). 2. The guest can decide for its own if it want to pass this inforamtion to the host-perf. No security issues at all. It can decide whether it exposes the files. Nor are there any security issues to begin with. 3. The guest can also pass us the call-chain and we don't need to care about complicated of fetching from the guest ourself. You need to be aware of the fact that symbol resolution is a separate step from call chain generation. I.e. call-chains are a (entirely) separate issue, and could reasonably be done in the guest or in the host. It has no bearing on this symbol resolution question. 4. This way extensible to nested virtualization too. Nested virtualization is actually already taken care of by the filesystem solution via an existing method called 'subdirectories'. If the guest offers sub-guests then those symbols will be exposed in a similar way via its own 'guest files' directory hierarchy. I.e. if we have 'Guest-2' nested inside 'the 'Guest-Fedora-1' instance, we get: /guests/ /guests/Guest-Fedora-1/etc/ /guests/Guest-Fedora-1/usr/ we'd also have: /guests/Guest-Fedora-1/guests/Guest-2/ So this is taken care of automatically. I.e. none of the four 'advantages' listed here are actually advantages over my proposed solution, so your conclusion is subsequently flawed as well. How we speak to the guest was already discussed in this thread. My personal opinion is that going through qemu is an unnecessary step and we can solve that more clever and transparent for perf. Meaning exactly what? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Nice progress! This bit: 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Will be really be painful to developers - to enter that long line while we have these things called 'computers' that ought to reduce human work. Also, it's incomplete, we need access to the guest system's binaries to do ELF symbol resolution and dwarf decoding. So we really need some good, automatic way to get to the guest symbol space, so that if a developer types: perf kvm top Then the obvious thing happens by default. (which is to show the guest overhead) There's no technical barrier on the perf tooling side to implement all that: perf supports build-ids extensively and can deal with multiple symbol spaces - as long as it has access to it. The guest kernel could be ID-ed based on its /sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids. So some sort of --guestmount option would be the natural solution, which points to the guest system's root: and a Qemu enumeration of guest mounts (which would be off by default and configurable) from which perf can pick up the target guest all automatically. (obviously only under allowed permissions so that such access is secure) This would allow not just kallsyms access via $guest/proc/kallsyms but also gives us the full space of symbol features: access to the guest binaries for annotation and general symbol resolution, command/binary name identification, etc. Such a mount would obviously not broaden existing privileges - and as an additional control a guest would also have a way to indicate that it does not wish a guest mount at all. Unfortunately, in a previous thread the Qemu maintainer has indicated that he will essentially NAK any attempt to enhance Qemu to provide an easily discoverable, self-contained, transparent guest mount on the host side. No technical justification was given for that NAK, despite my repeated requests to particulate the exact security problems that such an approach would cause. If that NAK does not stand in that form then i'd like to know about it - it makes no sense for us to try to code up a solution against a standing maintainer NAK ... The other option is some sysadmin level hackery to NFS-mount the guest or so. This is a vastly inferior method that brings us back to the absymal usability levels of OProfile: 1) it wont be guest transparent 2) has to be re-done for every guest image. 3) even if packaged it has to be gotten into every. single. Linux. distro. separately. 4) old Linux guests wont work out of box In other words: it's very inconvenient on multiple levels and wont ever happen on any reasonable enough scale to make a difference to Linux. Which is an unfortunate situation - and the ball is on the KVM/Qemu side so i can do little about it. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Fri, Mar 19, 2010 at 09:21:22AM +0100, Ingo Molnar wrote: Unfortunately, in a previous thread the Qemu maintainer has indicated that he will essentially NAK any attempt to enhance Qemu to provide an easily discoverable, self-contained, transparent guest mount on the host side. No technical justification was given for that NAK, despite my repeated requests to particulate the exact security problems that such an approach would cause. If that NAK does not stand in that form then i'd like to know about it - it makes no sense for us to try to code up a solution against a standing maintainer NAK ... I still think it is the best and most generic way to let the guest do the symbol resolution. This has several advantages: 1. The guest knows best about its symbol space. So this would be extensible to other guest operating systems. A brave developer may even implement symbol passing for Windows or the BSDs ;-) 2. The guest can decide for its own if it want to pass this inforamtion to the host-perf. No security issues at all. 3. The guest can also pass us the call-chain and we don't need to care about complicated of fetching from the guest ourself. 4. This way extensible to nested virtualization too. How we speak to the guest was already discussed in this thread. My personal opinion is that going through qemu is an unnecessary step and we can solve that more clever and transparent for perf. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Thu, 2010-03-18 at 10:45 +0800, Zhang, Yanmin wrote: On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Thanks Ingo, Avi. I worked out below patch against tip/master of March 15th. Subject: [PATCH] Change perf's parameter --pid to process-wide collection From: Zhang, Yanmin yanmin_zh...@linux.intel.com Change parameter -p (--pid) to real process pid and add -t (--tid) meaning thread id. Now, --pid means perf collects the statistics of all threads of the process, while --tid means perf just collect the statistics of that thread. BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures attr-disabled=1 if it isn't a system-wide collection. If there is a '-p' and no forks, 'perf stat -p' doesn't collect any data. In addition, the while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact on running workload. I added a sleep(1) in the loop. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com Ingo, Sorry, the patch has bugs. I need do a better job and will work out 2 separate patches against the 2 issues. I worked out 3 new patches against tip/master tree of Mar. 17th. 1) Patch perf_stat: Fix the issue that perf doesn't enable counters when target_pid != -1. Change the condition to fork/exec subcommand. If there is a subcommand parameter, perf always fork/exec it. The usage example is: #perf stat -a sleep 10 So this command could collect statistics for 10 seconds precisely. User still could stop it by CTRL+C. 2) Patch perf_record: Fix the issue that when perf forks/exec a subcommand, it should enable all counters after the new process is execing.Change the condition to fork/exec subcommand. If there is a subcommand parameter, perf always fork/exec it. The usage example is: #perf record -f -a sleep 10 So this command could collect statistics for 10 seconds precisely. User still could stop it by CTRL+C. 3) perf_pid: Change parameter --pid to process-wide collection. Add --tid which means collecting thread-wide statistics. Usage example is: #perf top -p #perf record -p -f sleep 10 #perf stat -p -f sleep 10 Arnaldo, Pls. apply the 3 attached patches. Yanmin diff -Nraup linux-2.6_tipmaster0317/tools/perf/builtin-stat.c linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c --- linux-2.6_tipmaster0317/tools/perf/builtin-stat.c 2010-03-18 09:04:40.938289813 +0800 +++ linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c 2010-03-18
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: I worked out 3 new patches against tip/master tree of Mar. 17th. Cool! Mind sending them as a series of patches instead of attachment? That makes it easier to review them. Also, the Signed-off-by lines seem to be missing plus we need a per patch changelog as well. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/17/2010 07:41 PM, Sheng Yang wrote: On Thursday 18 March 2010 13:22:28 Sheng Yang wrote: On Thursday 18 March 2010 12:50:58 Zachary Amsden wrote: On 03/17/2010 03:19 PM, Sheng Yang wrote: On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote: On 03/16/2010 11:28 PM, Sheng Yang wrote: On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't supposed to be able to. Um? Why? Especially kernel is already using it to deliver NMI. That's the only defined case, and it is defined because the vector field is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending on your version). 8.5.1 / 8.6.1 '100 (NMI) Delivers an NMI interrupt to the target processor or processors. The vector information is ignored' 8.5.2 Valid Interrupt Vectors 'Local and I/O APICs support 240 of these vectors (in the range of 16 to 255) as valid interrupts.' 8.8.4 Interrupt Acceptance for Fixed Interrupts '...; vectors 0 through 15 are reserved by the APIC (see also: Section 8.5.2, Valid Interrupt Vectors)' So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but vectors 0x00-0x0f are not valid to send via APIC or I/O APIC. As you pointed out, NMI is not Fixed interrupt. If we want to send NMI, it would need a specific delivery mode rather than vector number. And if you look at code, if we specific NMI_VECTOR, the delivery mode would be set to NMI. So what's wrong here? OK, I think I understand your points now. You meant that these vectors can't be filled in vector field directly, right? But NMI is a exception due to DM_NMI. Is that your point? I think we agree on this. Yes, I think we agree. NMI is the only vector in 0x0-0xf which can be sent via self-IPI because the vector itself does not matter for NMI. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Em Thu, Mar 18, 2010 at 09:03:25AM +0100, Ingo Molnar escreveu: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: I worked out 3 new patches against tip/master tree of Mar. 17th. Cool! Mind sending them as a series of patches instead of attachment? That makes it easier to review them. Also, the Signed-off-by lines seem to be missing plus we need a per patch changelog as well. Yeah, please, and I hadn't merged them, so the resend was the best thing to do. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* oerg Roedel j...@8bytes.org wrote: On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote: Hm, that sounds rather messy if we want to use it to basically expose kernel functionality in a guest/host unified way. Is the qemu process discoverable in some secure way? Can we trust it? Is there some proper tooling available to do it, or do we have to push it through 2-3 packages to get such a useful feature done? Since we want to implement a pmu usable for the guest anyway why we don't just use a guests perf to get all information we want? [...] Look at the previous posting of this patch, this is something new and rather unique. The main power in the 'perf kvm' kind of instrumentation is to profile _both_ the host and the guest on the host, using the same tool (often using the same kernel) and using similar workloads, and do profile comparisons using 'perf diff'. Note that KVM's in-kernel design makes it easy to offer this kind of host/guest shared implementation that Yanmin has created. Other virtulization solutions with a poorer design (for example where the hypervisor code base is split away from the guest implementation) will have it much harder to create something similar. That kind of integrated approach can result in very interesting finds straight away, see: http://lkml.indiana.edu/hypermail/linux/kernel/1003.0/00613.html ( the profile there demoes the need for spinlock accelerators for example - there's clearly assymetrically large overhead in guest spinlock code. Guess how much else we'll be able to find with a full 'perf kvm' implementation. ) One of the main goals of a virtualization implementation is to eliminate as many performance differences to the host kernel as possible. From the first day KVM was released the overriding question from users was always: 'how much slower is it than native, and which workloads are hit worst, and why, and could you pretty please speed up important workload XYZ'. 'perf kvm' helps exactly that kind of development workflow. Note that with oprofile you can already do separate guest space and host space profiling (with the timer driven fallbackin the guest). One idea with 'perf kvm' is to change that paradigm of forced separation and forced duplication and to supprt the workflow that most developers employ: use the host space for development and unify instrumentation in an intuitive framework. Yanmin's 'perf kvm' patch is a very good step towards that goal. Anyway ... look at the patches, try them and see it for yourself. Back in the days when i did KVM performance work i wish i had something like Yanmin's 'perf kvm' feature. I'd probably still be hacking KVM today ;-) So, the code is there, it's useful and it's up to you guys whether you live with this opportunity - the perf developers are certainly eager to help out with the details. There's already tons of per kernel subsystem perf helper tools: perf sched, perf kmem, perf lock, perf bench, perf timechart. 'perf kvm' is really a natural and good next step IMO that underlines the main design goodness KVM brought to the world of virtualization: proper guest/host code base integration. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Frank Ch. Eigler f...@redhat.com wrote: Hi - On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote: [...] The only way to really address this is to change the interaction. Instead of running perf externally to qemu, we should support a perf command in the qemu monitor that can then tie directly to the perf tooling. That gives us the best possible user experience. To what extent could this be solved with less crossing of isolation/abstraction layers, if the perfctr facilities were properly virtualized? [...] Note, 'perfctr' is a different out-of-tree Linux kernel project run by someone else: it offers the /dev/perfctr special-purpose device that allows raw, unabstracted, low-level access to the PMU. I suspect the one you wanted to mention here is called 'perf' or 'perf events'. (and used to be called 'performance counters' or 'perfcounters' until it got renamed about a year ago) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: Monitoring guests from the host is useful for kvm developers, but less so for users. Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf' will work if a proper paravirt channel is opened to the host) I think you might have misunderstood the purpose and role of the 'perf kvm' patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM code, not guest kernel users. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/17/2010 10:16 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Monitoring guests from the host is useful for kvm developers, but less so for users. Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf' will work if a proper paravirt channel is opened to the host) I think you might have misunderstood the purpose and role of the 'perf kvm' patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM code, not guest kernel users. Of course I understood it. My point was that 'perf kvm' serves a tiny minority of users. That doesn't mean it isn't useful, just that it doesn't satisfy all needs by itself. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Anthony Liguori aligu...@linux.vnet.ibm.com wrote: If you want to use a synthetic filesystem as the management interface for qemu, that's one thing. But you suggested exposing the guest filesystem in its entirely and that's what I disagreed with. What did you think, that it would be world-readable? Why would we do such a stupid thing? Any mounted content should at minimum match whatever policy covers the image file. The mounting of contents is not a privilege escallation and it is already possible today - just not integrated properly and not practical. (and apparently not implemented for all the wrong 'security' reasons) The guest may encrypt it's disk image. It still ought to be possible to run perf against that guest, no? _In_ the guest you can of course run it just fine. (once paravirt bits are in place) That has no connection to 'perf kvm' though, which this patch submission is about ... If you want unified profiling of both host and guest then you need access to both the guest and the host. This is what the 'perf kvm' patch is about. Please read the patch, i think you might be misunderstanding what it does ... Regarding encrypted contents - that's really a distraction but the host has absolute, 100% control over the guest and there's nothing the guest can do about that - unless you are thinking about the sub-sub-case of Orwellian DRM-locked-down systems - in which case there's nothing for the host to mount and the guest can reject any requests for information on itself and impose additional policy that way. So it's a security non-issue. Note that DRM is pretty much the worst place to look at when it comes to usability: DRM lock-down is the anti-thesis of usability. Do you really want KVM to match the mind-set of the RIAA and MPAA? Why do you pretend that a developer cannot mount his own disk image? Pretty please, help Linux instead, where development is driven by usability and accessibility ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/17/2010 10:16 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Monitoring guests from the host is useful for kvm developers, but less so for users. Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf' will work if a proper paravirt channel is opened to the host) I think you might have misunderstood the purpose and role of the 'perf kvm' patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM code, not guest kernel users. Of course I understood it. My point was that 'perf kvm' serves a tiny minority of users. [...] I hope you wont be disappointed to learn that 100% of Linux, all 13+ million lines of it, was and is being developed by a tiny, tiny, tiny minority of users ;-) [...] That doesn't mean it isn't useful, just that it doesn't satisfy all needs by itself. Of course - and it doesnt bring world peace either. One step at a time. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Thanks Ingo, Avi. I worked out below patch against tip/master of March 15th. Subject: [PATCH] Change perf's parameter --pid to process-wide collection From: Zhang, Yanmin yanmin_zh...@linux.intel.com Change parameter -p (--pid) to real process pid and add -t (--tid) meaning thread id. Now, --pid means perf collects the statistics of all threads of the process, while --tid means perf just collect the statistics of that thread. BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures attr-disabled=1 if it isn't a system-wide collection. If there is a '-p' and no forks, 'perf stat -p' doesn't collect any data. In addition, the while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact on running workload. I added a sleep(1) in the loop. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-record.c linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c --- linux-2.6_tipmaster0315/tools/perf/builtin-record.c 2010-03-16 08:59:54.896488489 +0800 +++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c 2010-03-17 16:30:17.71706 +0800 @@ -27,7 +27,7 @@ #include unistd.h #include sched.h -static int fd[MAX_NR_CPUS][MAX_COUNTERS]; +static int *fd[MAX_NR_CPUS][MAX_COUNTERS]; static longdefault_interval= 0; @@ -43,6 +43,9 @@ static intraw_samples = 0; static int system_wide = 0; static int profile_cpu = -1; static pid_t target_pid = -1; +static pid_t target_tid = -1; +static int *all_tids = NULL; +static int thread_num = 0; static pid_t child_pid = -1; static int inherit = 1; static int force = 0; @@ -60,7 +63,7 @@ static struct timeval this_read; static u64 bytes_written = 0; -static struct pollfd event_array[MAX_NR_CPUS * MAX_COUNTERS]; +static struct pollfd *event_array; static int nr_poll = 0; static int nr_cpu = 0; @@ -77,7 +80,7 @@ struct mmap_data { unsigned int
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/17/2010 11:28 AM, Sheng Yang wrote: I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. That's pretty bad, as NMI runs on a separate stack (via IST). So if another NMI happens while our int $2 is running, the stack will be corrupted. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... I think you need DM_NMI for that to work correctly. An alternative is to call the NMI handler directly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Wednesday 17 March 2010 17:41:58 Avi Kivity wrote: On 03/17/2010 11:28 AM, Sheng Yang wrote: I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. That's pretty bad, as NMI runs on a separate stack (via IST). So if another NMI happens while our int $2 is running, the stack will be corrupted. Though hardware didn't provide this kind of block, software at least would warn about it... nmi_enter() still would be executed by int $2, and result in BUG() if we are already in NMI context(OK, it is a little better than mysterious crash due to corrupted stack). And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... I think you need DM_NMI for that to work correctly. An alternative is to call the NMI handler directly. apic_send_IPI_self() already took care of APIC_DM_NMI. And NMI handler would block the following NMI? -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/17/2010 11:51 AM, Sheng Yang wrote: I think you need DM_NMI for that to work correctly. An alternative is to call the NMI handler directly. apic_send_IPI_self() already took care of APIC_DM_NMI. So it does (though not for x2apic?). I don't see why it doesn't work. And NMI handler would block the following NMI? It wouldn't - won't work without extensive changes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 11:28 PM, Sheng Yang wrote: On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't supposed to be able to. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote: On 03/16/2010 11:28 PM, Sheng Yang wrote: On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't supposed to be able to. Um? Why? Especially kernel is already using it to deliver NMI. -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Thanks Ingo, Avi. I worked out below patch against tip/master of March 15th. Subject: [PATCH] Change perf's parameter --pid to process-wide collection From: Zhang, Yanmin yanmin_zh...@linux.intel.com Change parameter -p (--pid) to real process pid and add -t (--tid) meaning thread id. Now, --pid means perf collects the statistics of all threads of the process, while --tid means perf just collect the statistics of that thread. BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures attr-disabled=1 if it isn't a system-wide collection. If there is a '-p' and no forks, 'perf stat -p' doesn't collect any data. In addition, the while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact on running workload. I added a sleep(1) in the loop. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com Ingo, Sorry, the patch has bugs. I need do a better job and will work out 2 separate patches against the 2 issues. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/17/2010 03:19 PM, Sheng Yang wrote: On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote: On 03/16/2010 11:28 PM, Sheng Yang wrote: On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't supposed to be able to. Um? Why? Especially kernel is already using it to deliver NMI. That's the only defined case, and it is defined because the vector field is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending on your version). 8.5.1 / 8.6.1 '100 (NMI) Delivers an NMI interrupt to the target processor or processors. The vector information is ignored' 8.5.2 Valid Interrupt Vectors 'Local and I/O APICs support 240 of these vectors (in the range of 16 to 255) as valid interrupts.' 8.8.4 Interrupt Acceptance for Fixed Interrupts '...; vectors 0 through 15 are reserved by the APIC (see also: Section 8.5.2, Valid Interrupt Vectors)' So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but vectors 0x00-0x0f are not valid to send via APIC or I/O APIC. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Thursday 18 March 2010 12:50:58 Zachary Amsden wrote: On 03/17/2010 03:19 PM, Sheng Yang wrote: On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote: On 03/16/2010 11:28 PM, Sheng Yang wrote: On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't supposed to be able to. Um? Why? Especially kernel is already using it to deliver NMI. That's the only defined case, and it is defined because the vector field is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending on your version). 8.5.1 / 8.6.1 '100 (NMI) Delivers an NMI interrupt to the target processor or processors. The vector information is ignored' 8.5.2 Valid Interrupt Vectors 'Local and I/O APICs support 240 of these vectors (in the range of 16 to 255) as valid interrupts.' 8.8.4 Interrupt Acceptance for Fixed Interrupts '...; vectors 0 through 15 are reserved by the APIC (see also: Section 8.5.2, Valid Interrupt Vectors)' So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but vectors 0x00-0x0f are not valid to send via APIC or I/O APIC. As you pointed out, NMI is not Fixed interrupt. If we want to send NMI, it would need a specific delivery mode rather than vector number. And if you look at code, if we specific NMI_VECTOR, the delivery mode would be set to NMI. So what's wrong here? -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Hi Avi, Ingo, I've been following through this long thread since the very first email. I'm a performance engineer whose job is to tune workloads run on top of KVM (and Xen previously). As a performance engineer, I desperately want to have a tool that can monitor the host and guests at same time. Think about 100 guests mixed with Linux/Windows running together on single system, being able to know what's happening is critical to do performance analysis. Actually I am the person asked Yanmin to add feature for CPU utilization break down (into host_usr, host_krn, guest_usr, guest_krn) so that I can monitor dozens of running guests. I hasn't made this patch work on my system yet but I _do_ think this patch is a very good start. And finally, monitoring guests from host is useful for users too (administrator and performance guy like me). I really appreciate you guys' work and would love to provide feedback from my point of view if needed. Regards, HUANG, Zhiteng Intel SSG/SSD/SPA/PRC Scalability Lab -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Avi Kivity Sent: Wednesday, March 17, 2010 11:55 AM To: Frank Ch. Eigler Cc: Anthony Liguori; Ingo Molnar; Zhang, Yanmin; Peter Zijlstra; Sheng Yang; linux-ker...@vger.kernel.org; kvm@vger.kernel.org; Marcelo Tosatti; oerg Roedel; Jes Sorensen; Gleb Natapov; Zachary Amsden; ziteng.hu...@intel.com Subject: Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side On 03/17/2010 02:41 AM, Frank Ch. Eigler wrote: Hi - On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote: [...] The only way to really address this is to change the interaction. Instead of running perf externally to qemu, we should support a perf command in the qemu monitor that can then tie directly to the perf tooling. That gives us the best possible user experience. To what extent could this be solved with less crossing of isolation/abstraction layers, if the perfctr facilities were properly virtualized? That's the more interesting (by far) usage model. In general guest owners don't have access to the host, and host owners can't (and shouldn't) change guests. Monitoring guests from the host is useful for kvm developers, but less so for users. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Thursday 18 March 2010 13:22:28 Sheng Yang wrote: On Thursday 18 March 2010 12:50:58 Zachary Amsden wrote: On 03/17/2010 03:19 PM, Sheng Yang wrote: On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote: On 03/16/2010 11:28 PM, Sheng Yang wrote: On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't supposed to be able to. Um? Why? Especially kernel is already using it to deliver NMI. That's the only defined case, and it is defined because the vector field is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending on your version). 8.5.1 / 8.6.1 '100 (NMI) Delivers an NMI interrupt to the target processor or processors. The vector information is ignored' 8.5.2 Valid Interrupt Vectors 'Local and I/O APICs support 240 of these vectors (in the range of 16 to 255) as valid interrupts.' 8.8.4 Interrupt Acceptance for Fixed Interrupts '...; vectors 0 through 15 are reserved by the APIC (see also: Section 8.5.2, Valid Interrupt Vectors)' So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but vectors 0x00-0x0f are not valid to send via APIC or I/O APIC. As you pointed out, NMI is not Fixed interrupt. If we want to send NMI, it would need a specific delivery mode rather than vector number. And if you look at code, if we specific NMI_VECTOR, the delivery mode would be set to NMI. So what's wrong here? OK, I think I understand your points now. You meant that these vectors can't be filled in vector field directly, right? But NMI is a exception due to DM_NMI. Is that your point? I think we agree on this. -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? Perhaps a symbol server that perf can connect to (and that would connect to guests in turn)? The highest quality solution would be if KVM offered a 'guest extension' to the guest kernel's /proc/kallsyms that made it easy for user-space to get this information from an authorative source. That's the main reason why the host side /proc/kallsyms is so popular and so useful: while in theory it's mostly redundant information which can be gleaned from the System.map and other sources of symbol information, it's easily available and is _always_ trustable to come from the host kernel. Separate System.map's have a tendency to go out of sync (or go missing when a devel kernel gets rebuilt, or if a devel package is not installed), and server ports (be that a TCP port space server or an UDP port space mount-point) are both a configuration hassle and are not guest-transparent. So for instrumentation infrastructure (such as perf) we have a large and well founded preference for intrinsic, built-in, kernel-provided information: i.e. a largely 'built-in' and transparent mechanism to get to guest symbols. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Perhaps a symbol server that perf can connect to (and that would connect to guests in turn)? diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 2010-03-16 08:59:11.825295404 +0800 +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c 2010-03-16 09:01:09.976084492 +0800 @@ -26,6 +26,7 @@ #includelinux/sched.h #includelinux/moduleparam.h #includelinux/ftrace_event.h +#includelinux/perf_event.h #include kvm_cache_regs.h #include x86.h @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct vmcs_write32(TPR_THRESHOLD, irr); } +DEFINE_PER_CPU(int, kvm_in_guest) = {0}; + +static void kvm_set_in_guest(void) +{ + percpu_write(kvm_in_guest, 1); +} + +static int kvm_is_in_guest(void) +{ + return percpu_read(kvm_in_guest); +} There is already PF_VCPU for this. Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. +static struct perf_guest_info_callbacks kvm_guest_cbs = { + .is_in_guest= kvm_is_in_guest, + .is_user_mode = kvm_is_user_mode, + .get_guest_ip = kvm_get_guest_ip, + .reset_in_guest = kvm_reset_in_guest, +}; Should be in common code, not vmx specific. Right. I discussed with Yangsheng. I will move above data structures and callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to kvm_x86_ops. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 09:24 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? Perhaps a symbol server that perf can connect to (and that would connect to guests in turn)? The highest quality solution would be if KVM offered a 'guest extension' to the guest kernel's /proc/kallsyms that made it easy for user-space to get this information from an authorative source. That's the main reason why the host side /proc/kallsyms is so popular and so useful: while in theory it's mostly redundant information which can be gleaned from the System.map and other sources of symbol information, it's easily available and is _always_ trustable to come from the host kernel. Separate System.map's have a tendency to go out of sync (or go missing when a devel kernel gets rebuilt, or if a devel package is not installed), and server ports (be that a TCP port space server or an UDP port space mount-point) are both a configuration hassle and are not guest-transparent. So for instrumentation infrastructure (such as perf) we have a large and well founded preference for intrinsic, built-in, kernel-provided information: i.e. a largely 'built-in' and transparent mechanism to get to guest symbols. The symbol server's client can certainly access the bits through vmchannel. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. That certainly works, though automatic association of guest data with guest symbols is friendlier. diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 2010-03-16 08:59:11.825295404 +0800 +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c 2010-03-16 09:01:09.976084492 +0800 @@ -26,6 +26,7 @@ #includelinux/sched.h #includelinux/moduleparam.h #includelinux/ftrace_event.h +#includelinux/perf_event.h #include kvm_cache_regs.h #include x86.h @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct vmcs_write32(TPR_THRESHOLD, irr); } +DEFINE_PER_CPU(int, kvm_in_guest) = {0}; + +static void kvm_set_in_guest(void) +{ + percpu_write(kvm_in_guest, 1); +} + +static int kvm_is_in_guest(void) +{ + return percpu_read(kvm_in_guest); +} There is already PF_VCPU for this. Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. +static struct perf_guest_info_callbacks kvm_guest_cbs = { + .is_in_guest= kvm_is_in_guest, + .is_user_mode = kvm_is_user_mode, + .get_guest_ip = kvm_get_guest_ip, + .reset_in_guest = kvm_reset_in_guest, +}; Should be in common code, not vmx specific. Right. I discussed with Yangsheng. I will move above data structures and callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to kvm_x86_ops. You will need access to the vcpu pointer (kvm_rip_read() needs it), you can put it in a percpu variable. I guess if it's not null, you know you're in a guest, so no need for PF_VCPU. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 11:28 AM, Zhang, Yanmin wrote: Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? That seems like a worthwhile addition regardless of this thread. Profile all current threads and any new ones. It probably makes sense to call this --pid and rename the existing --pid to --thread. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/16/2010 09:24 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? Perhaps a symbol server that perf can connect to (and that would connect to guests in turn)? The highest quality solution would be if KVM offered a 'guest extension' to the guest kernel's /proc/kallsyms that made it easy for user-space to get this information from an authorative source. That's the main reason why the host side /proc/kallsyms is so popular and so useful: while in theory it's mostly redundant information which can be gleaned from the System.map and other sources of symbol information, it's easily available and is _always_ trustable to come from the host kernel. Separate System.map's have a tendency to go out of sync (or go missing when a devel kernel gets rebuilt, or if a devel package is not installed), and server ports (be that a TCP port space server or an UDP port space mount-point) are both a configuration hassle and are not guest-transparent. So for instrumentation infrastructure (such as perf) we have a large and well founded preference for intrinsic, built-in, kernel-provided information: i.e. a largely 'built-in' and transparent mechanism to get to guest symbols. The symbol server's client can certainly access the bits through vmchannel. Ok, that would work i suspect. Would be nice to have the symbol server in tools/perf/ and also make it easy to add it to the initrd via a .config switch or so. That would have basically all of the advantages of being built into the kernel (availability, configurability, transparency, hackability), while having all the advantages of a user-space approach as well (flexibility, extensibility, robustness, ease of maintenance, etc.). If only we had tools/xorg/ integrated via the initrd that way ;-) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 11:53 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 03/16/2010 09:24 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? Perhaps a symbol server that perf can connect to (and that would connect to guests in turn)? The highest quality solution would be if KVM offered a 'guest extension' to the guest kernel's /proc/kallsyms that made it easy for user-space to get this information from an authorative source. That's the main reason why the host side /proc/kallsyms is so popular and so useful: while in theory it's mostly redundant information which can be gleaned from the System.map and other sources of symbol information, it's easily available and is _always_ trustable to come from the host kernel. Separate System.map's have a tendency to go out of sync (or go missing when a devel kernel gets rebuilt, or if a devel package is not installed), and server ports (be that a TCP port space server or an UDP port space mount-point) are both a configuration hassle and are not guest-transparent. So for instrumentation infrastructure (such as perf) we have a large and well founded preference for intrinsic, built-in, kernel-provided information: i.e. a largely 'built-in' and transparent mechanism to get to guest symbols. The symbol server's client can certainly access the bits through vmchannel. Ok, that would work i suspect. Would be nice to have the symbol server in tools/perf/ and also make it easy to add it to the initrd via a .config switch or so. That would have basically all of the advantages of being built into the kernel (availability, configurability, transparency, hackability), while having all the advantages of a user-space approach as well (flexibility, extensibility, robustness, ease of maintenance, etc.). Note, I am not advocating building the vmchannel client into the host kernel. While that makes everything simpler for the user, it increases the kernel footprint with all the disadvantages that come with that (any bug is converted into a host DoS or worse). So, perf would connect to qemu via (say) a well-known unix domain socket, which would then talk to the guest kernel. I know you won't like it, we'll continue to disagree on this unfortunately. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/16/2010 11:53 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 03/16/2010 09:24 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? Perhaps a symbol server that perf can connect to (and that would connect to guests in turn)? The highest quality solution would be if KVM offered a 'guest extension' to the guest kernel's /proc/kallsyms that made it easy for user-space to get this information from an authorative source. That's the main reason why the host side /proc/kallsyms is so popular and so useful: while in theory it's mostly redundant information which can be gleaned from the System.map and other sources of symbol information, it's easily available and is _always_ trustable to come from the host kernel. Separate System.map's have a tendency to go out of sync (or go missing when a devel kernel gets rebuilt, or if a devel package is not installed), and server ports (be that a TCP port space server or an UDP port space mount-point) are both a configuration hassle and are not guest-transparent. So for instrumentation infrastructure (such as perf) we have a large and well founded preference for intrinsic, built-in, kernel-provided information: i.e. a largely 'built-in' and transparent mechanism to get to guest symbols. The symbol server's client can certainly access the bits through vmchannel. Ok, that would work i suspect. Would be nice to have the symbol server in tools/perf/ and also make it easy to add it to the initrd via a .config switch or so. That would have basically all of the advantages of being built into the kernel (availability, configurability, transparency, hackability), while having all the advantages of a user-space approach as well (flexibility, extensibility, robustness, ease of maintenance, etc.). Note, I am not advocating building the vmchannel client into the host kernel. [...] Neither am i. What i suggested was a user-space binary/executable built in tools/perf and put into the initrd. That approach has the advantages i listed above, without having the disadvantages of in-kernel code you listed. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 12:20 PM, Ingo Molnar wrote: The symbol server's client can certainly access the bits through vmchannel. Ok, that would work i suspect. Would be nice to have the symbol server in tools/perf/ and also make it easy to add it to the initrd via a .config switch or so. That would have basically all of the advantages of being built into the kernel (availability, configurability, transparency, hackability), while having all the advantages of a user-space approach as well (flexibility, extensibility, robustness, ease of maintenance, etc.). Note, I am not advocating building the vmchannel client into the host kernel. [...] Neither am i. What i suggested was a user-space binary/executable built in tools/perf and put into the initrd. I'm confused - initrd seems to be guest-side. I was talking about the host side. For the guest, placing the symbol server in tools/ is reasonable. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/16/2010 12:20 PM, Ingo Molnar wrote: The symbol server's client can certainly access the bits through vmchannel. Ok, that would work i suspect. Would be nice to have the symbol server in tools/perf/ and also make it easy to add it to the initrd via a .config switch or so. That would have basically all of the advantages of being built into the kernel (availability, configurability, transparency, hackability), while having all the advantages of a user-space approach as well (flexibility, extensibility, robustness, ease of maintenance, etc.). Note, I am not advocating building the vmchannel client into the host kernel. [...] Neither am i. What i suggested was a user-space binary/executable built in tools/perf and put into the initrd. I'm confused - initrd seems to be guest-side. I was talking about the host side. host side doesnt need much support - just some client capability in perf itself. I suspect vmchannels are sufficiently flexible and configuration-free for such purposes? (i.e. like a filesystem in essence) Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 12:50 PM, Ingo Molnar wrote: I'm confused - initrd seems to be guest-side. I was talking about the host side. host side doesnt need much support - just some client capability in perf itself. I suspect vmchannels are sufficiently flexible and configuration-free for such purposes? (i.e. like a filesystem in essence) I haven't followed vmchannel closely, but I think it is. vmchannel is terminated in qemu on the host side, not in the host kernel. So perf would need to connect to qemu. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/16/2010 12:50 PM, Ingo Molnar wrote: I'm confused - initrd seems to be guest-side. I was talking about the host side. host side doesnt need much support - just some client capability in perf itself. I suspect vmchannels are sufficiently flexible and configuration-free for such purposes? (i.e. like a filesystem in essence) I haven't followed vmchannel closely, but I think it is. vmchannel is terminated in qemu on the host side, not in the host kernel. So perf would need to connect to qemu. Hm, that sounds rather messy if we want to use it to basically expose kernel functionality in a guest/host unified way. Is the qemu process discoverable in some secure way? Can we trust it? Is there some proper tooling available to do it, or do we have to push it through 2-3 packages to get such a useful feature done? ( That is the general thought process how many cross-discipline useful desktop/server features hit the bit bucket before having had any chance of being vetted by users, and why Linux sucks so much when it comes to feature integration and application usability. ) Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 01:25 PM, Ingo Molnar wrote: I haven't followed vmchannel closely, but I think it is. vmchannel is terminated in qemu on the host side, not in the host kernel. So perf would need to connect to qemu. Hm, that sounds rather messy if we want to use it to basically expose kernel functionality in a guest/host unified way. Is the qemu process discoverable in some secure way? We know its pid. Can we trust it? No choice, it contains the guest address space. Is there some proper tooling available to do it, or do we have to push it through 2-3 packages to get such a useful feature done? libvirt manages qemu processes, but I don't think this should go through libvirt. qemu can do this directly by opening a unix domain socket in a well-known place. ( That is the general thought process how many cross-discipline useful desktop/server features hit the bit bucket before having had any chance of being vetted by users, and why Linux sucks so much when it comes to feature integration and application usability. ) You can't solve everything in the kernel, even with a well populated tools/. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/16/2010 01:25 PM, Ingo Molnar wrote: I haven't followed vmchannel closely, but I think it is. vmchannel is terminated in qemu on the host side, not in the host kernel. So perf would need to connect to qemu. Hm, that sounds rather messy if we want to use it to basically expose kernel functionality in a guest/host unified way. Is the qemu process discoverable in some secure way? We know its pid. How do i get a list of all 'guest instance PIDs', and what is the way to talk to Qemu? Can we trust it? No choice, it contains the guest address space. I mean, i can trust a kernel service and i can trust /proc/kallsyms. Can perf trust a random process claiming to be Qemu? What's the trust mechanism here? Is there some proper tooling available to do it, or do we have to push it through 2-3 packages to get such a useful feature done? libvirt manages qemu processes, but I don't think this should go through libvirt. qemu can do this directly by opening a unix domain socket in a well-known place. So Qemu has never run into such problems before? ( Sounds weird - i think Qemu configuration itself should be done via a unix domain socket driven configuration protocol as well. ) ( That is the general thought process how many cross-discipline useful desktop/server features hit the bit bucket before having had any chance of being vetted by users, and why Linux sucks so much when it comes to feature integration and application usability. ) You can't solve everything in the kernel, even with a well populated tools/. Certainly not, but this is a technical problem in the kernel's domain, so it's a fair (and natural) expectation to be able to solve this within the kernel project. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 02:29 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 03/16/2010 01:25 PM, Ingo Molnar wrote: I haven't followed vmchannel closely, but I think it is. vmchannel is terminated in qemu on the host side, not in the host kernel. So perf would need to connect to qemu. Hm, that sounds rather messy if we want to use it to basically expose kernel functionality in a guest/host unified way. Is the qemu process discoverable in some secure way? We know its pid. How do i get a list of all 'guest instance PIDs', Libvirt manages all qemus, but this should be implemented independently of libvirt. and what is the way to talk to Qemu? In general qemu exposes communication channels (such as the monitor) as tcp connections, unix-domain sockets, stdio, etc. It's very flexible. Can we trust it? No choice, it contains the guest address space. I mean, i can trust a kernel service and i can trust /proc/kallsyms. Can perf trust a random process claiming to be Qemu? What's the trust mechanism here? Obviously you can't trust anything you get from a guest, no matter how you get it. How do you trust a userspace program's symbols? you don't. How do you get them? they're on a well-known location. Is there some proper tooling available to do it, or do we have to push it through 2-3 packages to get such a useful feature done? libvirt manages qemu processes, but I don't think this should go through libvirt. qemu can do this directly by opening a unix domain socket in a well-known place. So Qemu has never run into such problems before? ( Sounds weird - i think Qemu configuration itself should be done via a unix domain socket driven configuration protocol as well. ) That's exactly what happens. You invoke qemu with -monitor unix:blah,server (or -qmp for a machine-readable format) and have your management application connect to that. You can redirect guest serial ports, console, parallel port, etc. to unix-domain or tcp sockets. vmchannel is an extension of that mechanism. ( That is the general thought process how many cross-discipline useful desktop/server features hit the bit bucket before having had any chance of being vetted by users, and why Linux sucks so much when it comes to feature integration and application usability. ) You can't solve everything in the kernel, even with a well populated tools/. Certainly not, but this is a technical problem in the kernel's domain, so it's a fair (and natural) expectation to be able to solve this within the kernel project. Someone writing perf-gui outside the kernel would have the same problems, no? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/16/2010 02:29 PM, Ingo Molnar wrote: I mean, i can trust a kernel service and i can trust /proc/kallsyms. Can perf trust a random process claiming to be Qemu? What's the trust mechanism here? Obviously you can't trust anything you get from a guest, no matter how you get it. I'm not talking about the symbol strings and addresses, and the object contents for allocation (or debuginfo). I'm talking about the basic protocol of establishing which guest is which. I.e. we really want to be able users to: 1) have it all working with a single guest, without having to specify 'which' guest (qemu PID) to work with. That is the dominant usecase both for developers and for a fair portion of testers. 2) Have some reasonable symbolic identification for guests. For example a usable approach would be to have 'perf kvm list', which would list all currently active guests: $ perf kvm list [1] Fedora [2] OpenSuse [3] Windows-XP [4] Windows-7 And from that point on 'perf kvm -g OpenSuse record' would do the obvious thing. Users will be able to just use the 'OpenSuse' symbolic name for that guest, even if the guest got restarted and switched its main PID. Any such facility needs trusted enumeration and a protocol where i can trust that the information i got is authorative. (I.e. 'OpenSuse' truly matches to the OpenSuse session - not to some local user starting up a Qemu instance that claims to be 'OpenSuse'.) Is such a scheme possible/available? I suspect all the KVM configuration tools (i havent used them in some time - gui and command-line tools alike) use similar methods to ease guest management? Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 03:08 PM, Ingo Molnar wrote: I mean, i can trust a kernel service and i can trust /proc/kallsyms. Can perf trust a random process claiming to be Qemu? What's the trust mechanism here? Obviously you can't trust anything you get from a guest, no matter how you get it. I'm not talking about the symbol strings and addresses, and the object contents for allocation (or debuginfo). I'm talking about the basic protocol of establishing which guest is which. There is none. So far, qemu only dealt with managing just its own guest, and left all multiple guest management to higher levels up the stack (like libvirt). I.e. we really want to be able users to: 1) have it all working with a single guest, without having to specify 'which' guest (qemu PID) to work with. That is the dominant usecase both for developers and for a fair portion of testers. That's reasonable if we can get it working simply. 2) Have some reasonable symbolic identification for guests. For example a usable approach would be to have 'perf kvm list', which would list all currently active guests: $ perf kvm list [1] Fedora [2] OpenSuse [3] Windows-XP [4] Windows-7 And from that point on 'perf kvm -g OpenSuse record' would do the obvious thing. Users will be able to just use the 'OpenSuse' symbolic name for that guest, even if the guest got restarted and switched its main PID. Any such facility needs trusted enumeration and a protocol where i can trust that the information i got is authorative. (I.e. 'OpenSuse' truly matches to the OpenSuse session - not to some local user starting up a Qemu instance that claims to be 'OpenSuse'.) Is such a scheme possible/available? I suspect all the KVM configuration tools (i havent used them in some time - gui and command-line tools alike) use similar methods to ease guest management? You can do that through libvirt, but that only works for guests started through libvirt. libvirt provides command-line tools to list and manage guests (for example autostarting them on startup), and tools built on top of libvirt can manage guests graphically. Looks like we have a layer inversion here. Maybe we need a plugin system - libvirt drops a .so into perf that teaches it how to list guests and get their symbols. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/16/2010 03:08 PM, Ingo Molnar wrote: I mean, i can trust a kernel service and i can trust /proc/kallsyms. Can perf trust a random process claiming to be Qemu? What's the trust mechanism here? Obviously you can't trust anything you get from a guest, no matter how you get it. I'm not talking about the symbol strings and addresses, and the object contents for allocation (or debuginfo). I'm talking about the basic protocol of establishing which guest is which. There is none. So far, qemu only dealt with managing just its own guest, and left all multiple guest management to higher levels up the stack (like libvirt). I.e. we really want to be able users to: 1) have it all working with a single guest, without having to specify 'which' guest (qemu PID) to work with. That is the dominant usecase both for developers and for a fair portion of testers. That's reasonable if we can get it working simply. IMO such ease of use is reasonable and required, full stop. If it cannot be gotten simply then that's a bug: either in the code, or in the design, or in the development process that led to the design. Bugs need fixing. 2) Have some reasonable symbolic identification for guests. For example a usable approach would be to have 'perf kvm list', which would list all currently active guests: $ perf kvm list [1] Fedora [2] OpenSuse [3] Windows-XP [4] Windows-7 And from that point on 'perf kvm -g OpenSuse record' would do the obvious thing. Users will be able to just use the 'OpenSuse' symbolic name for that guest, even if the guest got restarted and switched its main PID. Any such facility needs trusted enumeration and a protocol where i can trust that the information i got is authorative. (I.e. 'OpenSuse' truly matches to the OpenSuse session - not to some local user starting up a Qemu instance that claims to be 'OpenSuse'.) Is such a scheme possible/available? I suspect all the KVM configuration tools (i havent used them in some time - gui and command-line tools alike) use similar methods to ease guest management? You can do that through libvirt, but that only works for guests started through libvirt. libvirt provides command-line tools to list and manage guests (for example autostarting them on startup), and tools built on top of libvirt can manage guests graphically. Looks like we have a layer inversion here. Maybe we need a plugin system - libvirt drops a .so into perf that teaches it how to list guests and get their symbols. Is libvirt used to start up all KVM guests? If not, if it's only used on some distros while other distros have other solutions then there's apparently no good way to get to such information, and the kernel bits of KVM do not provide it. To the user (and to me) this looks like a KVM bug / missing feature. (and the user doesnt care where the blame is) If that is true then apparently the current KVM design has no technically actionable solution for certain categories of features! Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 03:31 PM, Ingo Molnar wrote: You can do that through libvirt, but that only works for guests started through libvirt. libvirt provides command-line tools to list and manage guests (for example autostarting them on startup), and tools built on top of libvirt can manage guests graphically. Looks like we have a layer inversion here. Maybe we need a plugin system - libvirt drops a .so into perf that teaches it how to list guests and get their symbols. Is libvirt used to start up all KVM guests? If not, if it's only used on some distros while other distros have other solutions then there's apparently no good way to get to such information, and the kernel bits of KVM do not provide it. Developers tend to start qemu from the command line, but the majority of users and all distros I know of use libvirt. Some users cobble up their own scripts. To the user (and to me) this looks like a KVM bug / missing feature. (and the user doesnt care where the blame is) If that is true then apparently the current KVM design has no technically actionable solution for certain categories of features! A plugin system allows anyone who is interested to provide the information; they just need to write a plugin for their management tool. Since we can't prevent people from writing management tools, I don't see what else we can do. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Ingo Molnar mi...@elte.hu writes: [...] I.e. we really want to be able users to: 1) have it all working with a single guest, without having to specify 'which' guest (qemu PID) to work with. That is the dominant usecase both for developers and for a fair portion of testers. That's reasonable if we can get it working simply. IMO such ease of use is reasonable and required, full stop. If it cannot be gotten simply then that's a bug: either in the code, or in the design, or in the development process that led to the design. Bugs need fixing. [...] Perhaps the fact that kvm happens to deal with an interesting application area (virtualization) is misleading here. As far as the host kernel or other host userspace is concerned, qemu is just some random unprivileged userspace program (with some *optional* /dev/kvm services that might happen to require temporary root). As such, perf trying to instrument qemu is no different than perf trying to instrument any other userspace widget. Therefore, expecting 'trusted enumeration' of instances is just as sensible as using 'trusted ps' and 'trusted /var/run/FOO.pid files'. - FChE -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Frank Ch. Eigler f...@redhat.com wrote: Ingo Molnar mi...@elte.hu writes: [...] I.e. we really want to be able users to: 1) have it all working with a single guest, without having to specify 'which' guest (qemu PID) to work with. That is the dominant usecase both for developers and for a fair portion of testers. That's reasonable if we can get it working simply. IMO such ease of use is reasonable and required, full stop. If it cannot be gotten simply then that's a bug: either in the code, or in the design, or in the development process that led to the design. Bugs need fixing. [...] Perhaps the fact that kvm happens to deal with an interesting application area (virtualization) is misleading here. As far as the host kernel or other host userspace is concerned, qemu is just some random unprivileged userspace program (with some *optional* /dev/kvm services that might happen to require temporary root). As such, perf trying to instrument qemu is no different than perf trying to instrument any other userspace widget. Therefore, expecting 'trusted enumeration' of instances is just as sensible as using 'trusted ps' and 'trusted /var/run/FOO.pid files'. You are quite mistaken: KVM isnt really a 'random unprivileged application' in this context, it is clearly an extension of system/kernel services. ( Which can be seen from the simple fact that what started the discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the existing host-space /proc/kallsyms was desired. ) In that sense the most natural 'extension' would be the solution i mentioned a week or two ago: to have a (read only) mount of all guest filesystems, plus a channel for profiling/tracing data. That would make symbol parsing easier and it's what extends the existing 'host space' abstraction in the most natural way. ( It doesnt even have to be done via the kernel - Qemu could implement that via FUSE for example. ) As a second best option a 'symbol server' might be used too. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Hi - On Tue, Mar 16, 2010 at 04:52:21PM +0100, Ingo Molnar wrote: [...] Perhaps the fact that kvm happens to deal with an interesting application area (virtualization) is misleading here. As far as the host kernel or other host userspace is concerned, qemu is just some random unprivileged userspace program [...] You are quite mistaken: KVM isnt really a 'random unprivileged application' in this context, it is clearly an extension of system/kernel services. I don't know what extension of system/kernel services means in this context, beyond something running on the system/kernel, like every other process. To clarify, to what extent do you consider your classification similarly clear for a host is running * multiple kvm instances run as unprivileged users * non-kvm OS simulators such as vmware or xen or gdb * kvm instances running something other than linux ( Which can be seen from the simple fact that what started the discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the existing host-space /proc/kallsyms was desired. ) (Sorry, that smacks of circular reasoning.) It may be a charming convenience function for perf users to give them shortcuts for certain favoured configurations (kvm running freshest linux), but that says more about perf than kvm. - FChE -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Frank Ch. Eigler f...@redhat.com wrote: Hi - On Tue, Mar 16, 2010 at 04:52:21PM +0100, Ingo Molnar wrote: [...] Perhaps the fact that kvm happens to deal with an interesting application area (virtualization) is misleading here. As far as the host kernel or other host userspace is concerned, qemu is just some random unprivileged userspace program [...] You are quite mistaken: KVM isnt really a 'random unprivileged application' in this context, it is clearly an extension of system/kernel services. I don't know what extension of system/kernel services means in this context, beyond something running on the system/kernel, like every other process. [...] It means something like my example of 'extended to guest space' /proc/kallsyms: [...] ( Which can be seen from the simple fact that what started the discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the existing host-space /proc/kallsyms was desired. ) (Sorry, that smacks of circular reasoning.) To me it sounds like an example supporting my point. /proc/kallsyms is a service by the kernel, and 'perf kvm' desires this to be extended to guest space as well. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 08:08 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 03/16/2010 02:29 PM, Ingo Molnar wrote: I mean, i can trust a kernel service and i can trust /proc/kallsyms. Can perf trust a random process claiming to be Qemu? What's the trust mechanism here? Obviously you can't trust anything you get from a guest, no matter how you get it. I'm not talking about the symbol strings and addresses, and the object contents for allocation (or debuginfo). I'm talking about the basic protocol of establishing which guest is which. I.e. we really want to be able users to: 1) have it all working with a single guest, without having to specify 'which' guest (qemu PID) to work with. That is the dominant usecase both for developers and for a fair portion of testers. You're making too many assumptions. There is no list of guests anymore than there is a list of web browsers. You can have a multi-tenant scenario where you have distinct groups of virtual machines running as unprivileged users. 2) Have some reasonable symbolic identification for guests. For example a usable approach would be to have 'perf kvm list', which would list all currently active guests: $ perf kvm list [1] Fedora [2] OpenSuse [3] Windows-XP [4] Windows-7 And from that point on 'perf kvm -g OpenSuse record' would do the obvious thing. Users will be able to just use the 'OpenSuse' symbolic name for that guest, even if the guest got restarted and switched its main PID. Does perf kvm list always run as root? What if two unprivileged users both have a VM named Fedora? If we look at the use-case, it's going to be something like, a user is creating virtual machines and wants to get performance information about them. Having to run a separate tool like perf is not going to be what they would expect they had to do. Instead, they would either use their existing GUI tool (like virt-manager) or they would use their management interface (either QMP or libvirt). The complexity of interaction is due to the fact that perf shouldn't be a stand alone tool. It should be a library or something with a programmatic interface that another tool can make use of. Regards, Anthony Liguori Is such a scheme possible/available? I suspect all the KVM configuration tools (i havent used them in some time - gui and command-line tools alike) use similar methods to ease guest management? Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 10:52 AM, Ingo Molnar wrote: You are quite mistaken: KVM isnt really a 'random unprivileged application' in this context, it is clearly an extension of system/kernel services. ( Which can be seen from the simple fact that what started the discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the existing host-space /proc/kallsyms was desired. ) Random tools (like perf) should not be able to do what you describe. It's a security nightmare. If it's desirable to have /proc/kallsyms available, we can expose an interface in QEMU to provide that. That can then be plumbed through libvirt and QMP. Then a management tool can use libvirt or QMP to obtain that information and interact with the kernel appropriately. In that sense the most natural 'extension' would be the solution i mentioned a week or two ago: to have a (read only) mount of all guest filesystems, plus a channel for profiling/tracing data. That would make symbol parsing easier and it's what extends the existing 'host space' abstraction in the most natural way. ( It doesnt even have to be done via the kernel - Qemu could implement that via FUSE for example. ) No way. The guest has sensitive data and exposing it widely on the host is a bad thing to do. It's a bad interface. We can expose specific information about guests but only through our existing channels which are validated through a security infrastructure. Ultimately, your goal is to keep perf a simple tool with little dependencies. But practically speaking, if you want to add features to it, it's going to have to interact with other subsystems in the appropriate way. That means, it's going to need to interact with libvirt or QMP. If you want all applications to expose their data via synthetic file systems, then there's always plan9 :-) Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Anthony Liguori anth...@codemonkey.ws wrote: On 03/16/2010 08:08 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 03/16/2010 02:29 PM, Ingo Molnar wrote: I mean, i can trust a kernel service and i can trust /proc/kallsyms. Can perf trust a random process claiming to be Qemu? What's the trust mechanism here? Obviously you can't trust anything you get from a guest, no matter how you get it. I'm not talking about the symbol strings and addresses, and the object contents for allocation (or debuginfo). I'm talking about the basic protocol of establishing which guest is which. I.e. we really want to be able users to: 1) have it all working with a single guest, without having to specify 'which' guest (qemu PID) to work with. That is the dominant usecase both for developers and for a fair portion of testers. You're making too many assumptions. There is no list of guests anymore than there is a list of web browsers. You can have a multi-tenant scenario where you have distinct groups of virtual machines running as unprivileged users. multi-tenant and groups is not a valid excuse at all for giving crappy technology in the simplest case: when there's a single VM. Yes, eventually it can be supported and any sane scheme will naturally support it too, but it's by no means what we care about primarily when it comes to these tools. I thought everyone learned the lesson behind SystemTap's failure (and to a certain degree this was behind Oprofile's failure as well): when it comes to tooling/instrumentation we dont want to concentrate on the fancy complex setups and abstract requirements drawn up by CIOs, as development isnt being done there. Concentrate on our developers today, and provide no-compromises usability to those who contribute stuff. If we dont help make the simplest (and most common) use-case convenient then we are failing on a fundamental level. 2) Have some reasonable symbolic identification for guests. For example a usable approach would be to have 'perf kvm list', which would list all currently active guests: $ perf kvm list [1] Fedora [2] OpenSuse [3] Windows-XP [4] Windows-7 And from that point on 'perf kvm -g OpenSuse record' would do the obvious thing. Users will be able to just use the 'OpenSuse' symbolic name for that guest, even if the guest got restarted and switched its main PID. Does perf kvm list always run as root? What if two unprivileged users both have a VM named Fedora? Again, the single-VM case is the most important case, by far. If you have multiple VMs running and want to develop the kernel on multiple VMs (sounds rather messy if you think it through ...), what would happen is similar to what happens when we have two probes for example: # perf probe schedule Added new event: probe:schedule (on schedule+0) You can now use it on all perf tools, such as: perf record -e probe:schedule -a sleep 1 # perf probe -f schedule Added new event: probe:schedule_1 (on schedule+0) You can now use it on all perf tools, such as: perf record -e probe:schedule_1 -a sleep 1 # perf probe -f schedule Added new event: probe:schedule_2 (on schedule+0) You can now use it on all perf tools, such as: perf record -e probe:schedule_2 -a sleep 1 Something similar could be used for KVM/Qemu: whichever got created first is named 'Fedora', the second is named 'Fedora-2'. If we look at the use-case, it's going to be something like, a user is creating virtual machines and wants to get performance information about them. Having to run a separate tool like perf is not going to be what they would expect they had to do. Instead, they would either use their existing GUI tool (like virt-manager) or they would use their management interface (either QMP or libvirt). The complexity of interaction is due to the fact that perf shouldn't be a stand alone tool. It should be a library or something with a programmatic interface that another tool can make use of. But ... a GUI interface/integration is of course possible too, and it's being worked on. perf is mainly a kernel developer tool, and kernel developers generally dont use GUIs to do their stuff: which is the (sole) reason why its first ~850 commits of tools/perf/ were done without a GUI. We go where our developers are. In any case it's not an excuse to have no proper command-line tooling. In fact if you cannot get simpler, more atomic command-line tooling right then you'll probably doubly suck at doing a GUI as well. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 03/16/2010 10:52 AM, Ingo Molnar wrote: You are quite mistaken: KVM isnt really a 'random unprivileged application' in this context, it is clearly an extension of system/kernel services. ( Which can be seen from the simple fact that what started the discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the existing host-space /proc/kallsyms was desired. ) Random tools (like perf) should not be able to do what you describe. It's a security nightmare. A security nightmare exactly how? Mind to go into details as i dont understand your point. If it's desirable to have /proc/kallsyms available, we can expose an interface in QEMU to provide that. That can then be plumbed through libvirt and QMP. Then a management tool can use libvirt or QMP to obtain that information and interact with the kernel appropriately. In that sense the most natural 'extension' would be the solution i mentioned a week or two ago: to have a (read only) mount of all guest filesystems, plus a channel for profiling/tracing data. That would make symbol parsing easier and it's what extends the existing 'host space' abstraction in the most natural way. ( It doesnt even have to be done via the kernel - Qemu could implement that via FUSE for example. ) No way. The guest has sensitive data and exposing it widely on the host is a bad thing to do. [...] Firstly, you are putting words into my mouth, as i said nothing about 'exposing it widely'. I suggest exposing it under the privileges of whoever has access to the guest image. Secondly, regarding confidentiality, and this is guest security 101: whoever can access the image on the host _already_ has access to all the guest data! A Linux image can generally be loopback mounted straight away: losetup -o 32256 /dev/loop0 ./guest-image.img mount -o ro /dev/loop0 /mnt-guest (Or, if you are an unprivileged user who cannot mount, it can be read via ext2 tools.) There's nothing the guest can do about that. The host is in total control of guest image data for heaven's sake! All i'm suggesting is to make what is already possible more convenient. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 12:52 PM, Ingo Molnar wrote: * Anthony Liguorialigu...@linux.vnet.ibm.com wrote: On 03/16/2010 10:52 AM, Ingo Molnar wrote: You are quite mistaken: KVM isnt really a 'random unprivileged application' in this context, it is clearly an extension of system/kernel services. ( Which can be seen from the simple fact that what started the discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the existing host-space /proc/kallsyms was desired. ) Random tools (like perf) should not be able to do what you describe. It's a security nightmare. A security nightmare exactly how? Mind to go into details as i dont understand your point. Assume you're using SELinux to implement mandatory access control. How do you label this file system? Generally speaking, we don't know the difference between /proc/kallsyms vs. /dev/mem if we do generic passthrough. While it might be safe to have a relaxed label of kallsyms (since it's read only), it's clearly not safe to do that for /dev/mem, /etc/shadow, or any file containing sensitive information. Rather, we ought to expose a higher level interface that we have more confidence in with respect to understanding the ramifications of exposing that guest data. No way. The guest has sensitive data and exposing it widely on the host is a bad thing to do. [...] Firstly, you are putting words into my mouth, as i said nothing about 'exposing it widely'. I suggest exposing it under the privileges of whoever has access to the guest image. That doesn't work as nicely with SELinux. It's completely reasonable to have a user that can interact in a read only mode with a VM via libvirt but cannot read the guest's disk images or the guest's memory contents. Secondly, regarding confidentiality, and this is guest security 101: whoever can access the image on the host _already_ has access to all the guest data! A Linux image can generally be loopback mounted straight away: losetup -o 32256 /dev/loop0 ./guest-image.img mount -o ro /dev/loop0 /mnt-guest (Or, if you are an unprivileged user who cannot mount, it can be read via ext2 tools.) There's nothing the guest can do about that. The host is in total control of guest image data for heaven's sake! It's not that simple in a MAC environment. Regards, Anthony Liguori All i'm suggesting is to make what is already possible more convenient. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 03/16/2010 12:52 PM, Ingo Molnar wrote: * Anthony Liguorialigu...@linux.vnet.ibm.com wrote: On 03/16/2010 10:52 AM, Ingo Molnar wrote: You are quite mistaken: KVM isnt really a 'random unprivileged application' in this context, it is clearly an extension of system/kernel services. ( Which can be seen from the simple fact that what started the discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the existing host-space /proc/kallsyms was desired. ) Random tools (like perf) should not be able to do what you describe. It's a security nightmare. A security nightmare exactly how? Mind to go into details as i dont understand your point. Assume you're using SELinux to implement mandatory access control. How do you label this file system? Generally speaking, we don't know the difference between /proc/kallsyms vs. /dev/mem if we do generic passthrough. While it might be safe to have a relaxed label of kallsyms (since it's read only), it's clearly not safe to do that for /dev/mem, /etc/shadow, or any file containing sensitive information. What's your _point_? Please outline a threat model, a vector of attack, _anything_ that substantiates your it's a security nightmare claim. Rather, we ought to expose a higher level interface that we have more confidence in with respect to understanding the ramifications of exposing that guest data. Exactly, we want something that has a flexible namespace and works well with Linux tools in general. Preferably that namespace should be human readable, and it should be hierarchic, and it should have a well-known permission model. This concept exists in Linux and is generally called a 'filesystem'. No way. The guest has sensitive data and exposing it widely on the host is a bad thing to do. [...] Firstly, you are putting words into my mouth, as i said nothing about 'exposing it widely'. I suggest exposing it under the privileges of whoever has access to the guest image. That doesn't work as nicely with SELinux. It's completely reasonable to have a user that can interact in a read only mode with a VM via libvirt but cannot read the guest's disk images or the guest's memory contents. If a user cannot read the image file then the user has no access to its contents via other namespaces either. That is, of course, a basic security aspect. ( That is perfectly true with a non-SELinux Unix permission model as well, and is true in the SELinux case as well. ) Secondly, regarding confidentiality, and this is guest security 101: whoever can access the image on the host _already_ has access to all the guest data! A Linux image can generally be loopback mounted straight away: losetup -o 32256 /dev/loop0 ./guest-image.img mount -o ro /dev/loop0 /mnt-guest (Or, if you are an unprivileged user who cannot mount, it can be read via ext2 tools.) There's nothing the guest can do about that. The host is in total control of guest image data for heaven's sake! It's not that simple in a MAC environment. Erm. Please explain to me, what exactly is 'not that simple' in a MAC environment? Also, i'd like to note that the 'restrictive SELinux setups' usecases are pretty secondary. To demonstrate that, i'd like every KVM developer on this list who reads this mail and who has their home development system where they produce their patches set up in a restrictive MAC environment, in that you cannot even read the images you are using, to chime in with a I'm doing that reply. If there's just a _single_ KVM developer amongst dozens and dozens of developers on this list who develops in an environment like that i'd be surprised. That result should pretty much tell you where the weight of instrumentation focus should lie - and it isnt on restrictive MAC environments ... Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote: Hm, that sounds rather messy if we want to use it to basically expose kernel functionality in a guest/host unified way. Is the qemu process discoverable in some secure way? Can we trust it? Is there some proper tooling available to do it, or do we have to push it through 2-3 packages to get such a useful feature done? Since we want to implement a pmu usable for the guest anyway why we don't just use a guests perf to get all information we want? If we get a pmu-nmi from the guest we just re-inject it to the guest and perf in the guest gives us all information we wand including kernel and userspace symbols, stack traces, and so on. In the previous thread we discussed about a direct trace channel between guest and host kernel (which can be used for ftrace events for example). This channel could be used to transport this information to the host kernel. The only additional feature needed is a way for the host to start a perf instance in the guest. Opinions? Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
oerg Roedel wrote: On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote: Hm, that sounds rather messy if we want to use it to basically expose kernel functionality in a guest/host unified way. Is the qemu process discoverable in some secure way? Can we trust it? Is there some proper tooling available to do it, or do we have to push it through 2-3 packages to get such a useful feature done? Since we want to implement a pmu usable for the guest anyway why we don't just use a guests perf to get all information we want? If we get a pmu-nmi from the guest we just re-inject it to the guest and perf in the guest gives us all information we wand including kernel and userspace symbols, stack traces, and so on. I guess this aims to get information from old environments running on kvm for life extension :) In the previous thread we discussed about a direct trace channel between guest and host kernel (which can be used for ftrace events for example). This channel could be used to transport this information to the host kernel. Interesting! I know the people who are trying to do that with systemtap. See, http://vesper.sourceforge.net/ The only additional feature needed is a way for the host to start a perf instance in the guest. # ssh localguest perf record --host-chanel ... ? B-) Thank you, Opinions? Joerg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Masami Hiramatsu e-mail: mhira...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 01:28 PM, Ingo Molnar wrote: * Anthony Liguorialigu...@linux.vnet.ibm.com wrote: On 03/16/2010 12:52 PM, Ingo Molnar wrote: * Anthony Liguorialigu...@linux.vnet.ibm.com wrote: On 03/16/2010 10:52 AM, Ingo Molnar wrote: You are quite mistaken: KVM isnt really a 'random unprivileged application' in this context, it is clearly an extension of system/kernel services. ( Which can be seen from the simple fact that what started the discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the existing host-space /proc/kallsyms was desired. ) Random tools (like perf) should not be able to do what you describe. It's a security nightmare. A security nightmare exactly how? Mind to go into details as i dont understand your point. Assume you're using SELinux to implement mandatory access control. How do you label this file system? Generally speaking, we don't know the difference between /proc/kallsyms vs. /dev/mem if we do generic passthrough. While it might be safe to have a relaxed label of kallsyms (since it's read only), it's clearly not safe to do that for /dev/mem, /etc/shadow, or any file containing sensitive information. What's your _point_? Please outline a threat model, a vector of attack, _anything_ that substantiates your it's a security nightmare claim. You suggested to have a (read only) mount of all guest filesystems. As I described earlier, not all of the information within the guest filesystem has the same level of sensitivity. If you exposed a generic interface like this, it makes it very difficult to delegate privileges. Delegating privileges is important because from in a higher security environment, you may want to prevent a management tool from accessing the VM's disk directly, but still allow it to do basic operations (in particular, to view performance statistics). Rather, we ought to expose a higher level interface that we have more confidence in with respect to understanding the ramifications of exposing that guest data. Exactly, we want something that has a flexible namespace and works well with Linux tools in general. Preferably that namespace should be human readable, and it should be hierarchic, and it should have a well-known permission model. This concept exists in Linux and is generally called a 'filesystem'. If you want to use a synthetic filesystem as the management interface for qemu, that's one thing. But you suggested exposing the guest filesystem in its entirely and that's what I disagreed with. If a user cannot read the image file then the user has no access to its contents via other namespaces either. That is, of course, a basic security aspect. ( That is perfectly true with a non-SELinux Unix permission model as well, and is true in the SELinux case as well. ) I don't think that's reasonable at all. The guest may encrypt it's disk image. It still ought to be possible to run perf against that guest, no? Erm. Please explain to me, what exactly is 'not that simple' in a MAC environment? Also, i'd like to note that the 'restrictive SELinux setups' usecases are pretty secondary. To demonstrate that, i'd like every KVM developer on this list who reads this mail and who has their home development system where they produce their patches set up in a restrictive MAC environment, in that you cannot even read the images you are using, to chime in with a I'm doing that reply. My home system doesn't run SELinux but I work daily with systems that are using SELinux. I want to be able to run tools like perf on these systems because ultimately, I need to debug these systems on a daily basis. But that's missing the point. We want to have an interface that works for both cases so that we're not maintaining two separate interfaces. We've rat holed a bit though. You want: 1) to run perf kvm list and be able to enumerate KVM guests 2) for this to Just Work with qemu guests launched from the command line You could achieve (1) by tying perf to libvirt but that won't work for (2). There are a few practical problems with (2). qemu does not require the user to associate any uniquely identifying information with a VM. We've also optimized the command line use case so that if all you want to do is run a disk image, you just execute qemu foo.img. To satisfy your use case, we would either have to force a use to always specify unique information, which would be less convenient for our users or we would have to let the name be an optional parameter. As it turns out, we already support qemu -name Fedora foo.img. What we don't do today, but I've been suggesting we should, is automatically create a QMP management socket in a well known location based on the -name parameter when it's specified. That would let a tool like perf Just Work provided that a user specified -name. No one uses -name
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 12:39 PM, Ingo Molnar wrote: If we look at the use-case, it's going to be something like, a user is creating virtual machines and wants to get performance information about them. Having to run a separate tool like perf is not going to be what they would expect they had to do. Instead, they would either use their existing GUI tool (like virt-manager) or they would use their management interface (either QMP or libvirt). The complexity of interaction is due to the fact that perf shouldn't be a stand alone tool. It should be a library or something with a programmatic interface that another tool can make use of. But ... a GUI interface/integration is of course possible too, and it's being worked on. perf is mainly a kernel developer tool, and kernel developers generally dont use GUIs to do their stuff: which is the (sole) reason why its first ~850 commits of tools/perf/ were done without a GUI. We go where our developers are. In any case it's not an excuse to have no proper command-line tooling. In fact if you cannot get simpler, more atomic command-line tooling right then you'll probably doubly suck at doing a GUI as well. It's about who owns the user interface. If qemu owns the user interface, than we can satisfy this in a very simple way by adding a perf monitor command. If we have to support third party tools, then it significantly complicates things. Regards, Anthony Liguori Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
Hi - On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote: [...] The only way to really address this is to change the interaction. Instead of running perf externally to qemu, we should support a perf command in the qemu monitor that can then tie directly to the perf tooling. That gives us the best possible user experience. To what extent could this be solved with less crossing of isolation/abstraction layers, if the perfctr facilities were properly virtualized? That way guests could run perf goo internally. Optionally virt tools on the host side could aggregate data from cooperating self-monitoring guests. - FChE -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. That certainly works, though automatic association of guest data with guest symbols is friendlier. Thanks. Originally, I planed to add a -G parameter to perf. Such like -G :/XXX/XXX/guestkallsyms:/XXX/XXX/modules,8889:/XXX/XXX/guestkallsyms:/XXX/XXX/modules and 8889 are just qemu guest pid. So we could define multiple guest os symbol files. But it seems ugly, and 'perf kvm report --sort pid and 'perf kvm top --pid' could provide similar functionality. diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c2010-03-16 08:59:11.825295404 +0800 +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c2010-03-16 09:01:09.976084492 +0800 @@ -26,6 +26,7 @@ #includelinux/sched.h #includelinux/moduleparam.h #includelinux/ftrace_event.h +#includelinux/perf_event.h #include kvm_cache_regs.h #include x86.h @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct vmcs_write32(TPR_THRESHOLD, irr); } +DEFINE_PER_CPU(int, kvm_in_guest) = {0}; + +static void kvm_set_in_guest(void) +{ + percpu_write(kvm_in_guest, 1); +} + +static int kvm_is_in_guest(void) +{ + return percpu_read(kvm_in_guest); +} There is already PF_VCPU for this. Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. +static struct perf_guest_info_callbacks kvm_guest_cbs = { + .is_in_guest= kvm_is_in_guest, + .is_user_mode = kvm_is_user_mode, + .get_guest_ip = kvm_get_guest_ip, + .reset_in_guest = kvm_reset_in_guest, +}; Should be in common code, not vmx specific. Right. I discussed with Yangsheng. I will move above data structures and callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to kvm_x86_ops. You will need access to the vcpu pointer (kvm_rip_read() needs it), you can put it in a percpu variable. We do so now in a new patch. I guess if it's not null, you know you're in a guest, so no need for PF_VCPU. Good suggestion. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/17/2010 02:41 AM, Frank Ch. Eigler wrote: Hi - On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote: [...] The only way to really address this is to change the interaction. Instead of running perf externally to qemu, we should support a perf command in the qemu monitor that can then tie directly to the perf tooling. That gives us the best possible user experience. To what extent could this be solved with less crossing of isolation/abstraction layers, if the perfctr facilities were properly virtualized? That's the more interesting (by far) usage model. In general guest owners don't have access to the host, and host owners can't (and shouldn't) change guests. Monitoring guests from the host is useful for kvm developers, but less so for users. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? Perhaps a symbol server that perf can connect to (and that would connect to guests in turn)? diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 2010-03-16 08:59:11.825295404 +0800 +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c 2010-03-16 09:01:09.976084492 +0800 @@ -26,6 +26,7 @@ #includelinux/sched.h #includelinux/moduleparam.h #includelinux/ftrace_event.h +#includelinux/perf_event.h #include kvm_cache_regs.h #include x86.h @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct vmcs_write32(TPR_THRESHOLD, irr); } +DEFINE_PER_CPU(int, kvm_in_guest) = {0}; + +static void kvm_set_in_guest(void) +{ + percpu_write(kvm_in_guest, 1); +} + +static int kvm_is_in_guest(void) +{ + return percpu_read(kvm_in_guest); +} There is already PF_VCPU for this. +static struct perf_guest_info_callbacks kvm_guest_cbs = { + .is_in_guest= kvm_is_in_guest, + .is_user_mode = kvm_is_user_mode, + .get_guest_ip = kvm_get_guest_ip, + .reset_in_guest = kvm_reset_in_guest, +}; Should be in common code, not vmx specific. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html