subject:"\[PATCH\] Enhance perf to collect KVM guest os statistics from host side"

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Arnaldo Carvalho de Melo

Em Tue, Mar 23, 2010 at 11:14:41AM +0800, Zhang, Yanmin escreveu:
 On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote:
  Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu:
   On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
   Then, perf could access all files. It's possible because guest os instance
   happens to be multi-threading in a process. One of the defects is the 
   accessing to
   guest os becomes slow or impossible when guest os is very busy.
  
  If the MMAP events on the guest included a cookie that could later be
  used to query for the symtab of that DSO, we wouldn't need to access the
  guest FS at all, right?

 It depends on specific sub commands. As for 'perf kvm top', developers
 want to see the profiling immediately. Even with 'perf kvm record',
 developers also want to

That is not a problem, if you have the relevant buildids in your cache
(Look in your machine at ~/.debug/), it will be as fast as ever.

If you use a distro that has its userspace with build-ids, you probably
use it always without noticing :-)

 see results quickly. At least I'm eager for the results when
 investigating a performance issue.

Sure thing.
 
  With build-ids and debuginfo-install like tools the symbol
  resolution could be performed by using the cookies (build-ids) as
  keys to get to the *-debuginfo packages with matching symtabs (and
  DWARF for source annotation, etc).

 We can't make sure guest os uses the same os images, or don't know
 where we could find the original DVD images being used to install
 guest os.

You don't have to have guest and host sharing the same OS image, you
just have to somehow populate your buildid cache with what you need, be
it using sshfs or what Ingo is suggesting once, or using what your
vendor provides (debuginfo packages). And you just have to do it once,
for the relevant apps, to have it in your buildid cache.
 
 Current perf does save build id, including both kernls's and other
 application lib/executables.

Yeah, I know, I implemented it. :-)
 
  We have that for the kernel as:

  [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes 
  -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes
  [a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id 
  -r--r--r-- 1 root root 4096 2010-03-22 13:38 
  /sys/module/ipv6/sections/.note.gnu.build-id
  [a...@doppio linux-2.6-tip]$

  That way we would cover DSOs being reinstalled in long running 'perf
  record' sessions too.

 That's one of objectives of perf to support long running.

But it doesn't fully supports right now, as I explained, build-ids are
collected at the end of the record session, because we have to open the
DSOs that had hits to get the 20 bytes cookie we need, the build-id.

If we had it in the PERF_RECORD_MMAP record, we would close this race,
and the added cost at load time should be minimal, to get the ELF
section with it and put it somewhere in task struct.

If only we could coalesce it a bit to reclaim this:

[a...@doppio linux-2.6-tip]$ pahole -C task_struct 
../build/v2.6.34-rc1-tip+/kernel/sched.o  | tail -5
/* size: 5968, cachelines: 94, members: 150 */
/* sum members: 5943, holes: 7, sum holes: 25 */
/* bit holes: 1, sum bit holes: 28 bits */
/* last cacheline: 16 bytes */
};
[a...@doppio linux-2.6-tip]$ 

8-)

Or at least get just one of those 4 bytes holes then we could stick it
at the end to get our build-id there, accessing it would be done only
at PERF_RECORD_MMAP injection time, i.e. close to the time when we
actually are loading the executable mmap, i.e. close to the time when
the loader is injecting the build-id, I guess the extra memory and
processing costs would be in the noise.

  This was discussed some time ago but would require help from the bits
  that load DSOs.

  build-ids then would be first class citizens.

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Soeren Sandmann

Joerg Roedel j...@8bytes.org writes:

 On Mon, Mar 22, 2010 at 11:59:27AM +0100, Ingo Molnar wrote:
  Best would be if you demonstrated any problems of the perf symbol lookup 
  code 
  you are aware of on the host side, as it has that exact design you are 
  criticising here. We are eager to fix any bugs in it.
  
  If you claim that it's buggy then that should very much be demonstratable - 
  no 
  need to go into theoretical arguments about it.
 
 I am not claiming anything. I just try to imagine how your proposal
 will look like in practice and forgot that symbol resolution is done at
 a later point.
 But even with defered symbol resolution we need more information from
 the guest than just the rip falling out of KVM. The guest needs to tell
 us about the process where the event happened (information that the host
 has about itself without any hassle) and which executable-files it was
 loaded from.

Slightly tangential, but there is another case that has some of the
same problems: profiling other language runtimes than C and C++, say
Python. At the moment profilers will generally tell you what is going
on inside the python runtime, but not what the python program itself
is doing.

To fix that problem, it seems like we need some way to have python
export what is going on. Maybe the same mechanism could be used to
both access what is going on in qemu and python.


Soren
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Andi Kleen

Soeren Sandmann sandm...@daimi.au.dk writes:

 To fix that problem, it seems like we need some way to have python
 export what is going on. Maybe the same mechanism could be used to
 both access what is going on in qemu and python.

oprofile already has an interface to let JITs export
information about the JITed code. C Python is not a JIT,
but presumably one of the python JITs could do it.

http://oprofile.sourceforge.net/doc/devel/index.html

I know it's not envogue anymore and you won't be a approved 
cool kid if you do, but you could just use oprofile? 

Ok presumably one would need to do a python interface for this
first. I believe it's currently only implemented for Java and
Mono. I presume it might work today with IronPython on Mono.

IMHO it doesn't make sense to invent another interface for this,
although I'm sure someone will propose just that.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Soeren Sandmann

Andi Kleen a...@firstfloor.org writes:

 Soeren Sandmann sandm...@daimi.au.dk writes:
 
  To fix that problem, it seems like we need some way to have python
  export what is going on. Maybe the same mechanism could be used to
  both access what is going on in qemu and python.
 
 oprofile already has an interface to let JITs export
 information about the JITed code. C Python is not a JIT,
 but presumably one of the python JITs could do it.
 
 http://oprofile.sourceforge.net/doc/devel/index.html

It's not that I personally want to profile a particular python
program. I'm interested in the more general problem of extracting more
information from profiled user space programs than just stack traces.

Examples:

- What is going on inside QEMU? 

- Which client is the X server servicing?

- What parts of a python/shell/scheme/javascript program is
  taking the most CPU time?

I don't think the oprofile JIT interface solves any of these
problems. (In fact, I don't see why the JIT problem is even hard. The
JIT compiler can just generate a little ELF file with symbols in it,
and the profiler can pick it up through the mmap events that you get
through the perf interface).

 I know it's not envogue anymore and you won't be a approved 
 cool kid if you do, but you could just use oprofile? 

I am bringing this up because I want to extend sysprof to be more
useful. 


Soren
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Arnaldo Carvalho de Melo

Em Tue, Mar 23, 2010 at 02:49:01PM +0100, Andi Kleen escreveu:
 Soeren Sandmann sandm...@daimi.au.dk writes:
  To fix that problem, it seems like we need some way to have python
  export what is going on. Maybe the same mechanism could be used to
  both access what is going on in qemu and python.
 
 oprofile already has an interface to let JITs export
 information about the JITed code. C Python is not a JIT,
 but presumably one of the python JITs could do it.
 
 http://oprofile.sourceforge.net/doc/devel/index.html
 
 I know it's not envogue anymore and you won't be a approved 
 cool kid if you do, but you could just use oprofile? 

perf also has supports for this and Pekka Enberg's jato uses it:

http://penberg.blogspot.com/2009/06/jato-has-profiler.html

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Andi Kleen

Soeren Sandmann sandm...@daimi.au.dk writes:

 Examples:

 - What is going on inside QEMU? 

That's something the JIT interface could answer.

 - Which client is the X server servicing?

 - What parts of a python/shell/scheme/javascript program is
   taking the most CPU time?

I suspect for those you rather need event based tracers of some sort,
similar to kernel trace points. Otherwise you would need own
separate stacks and other complications.

systemtap has some effort to use the dtrace instrumentation
that crops up in more and more user programs for this.  It wouldn't
surprise me if that was already in python and other programs
you're interested in.

I presume right now it only works if you apply the utrace monstrosity
though, but perhaps the new uprobes patches floating around 
will come to rescue.

There also was some effort to have a pure user space
daemon based approach for LTT, but I believe that currently
needs own trace points.

Again I fully expect someone to reinvent the wheel here
and afterwards complain about community inefficiences :-)

 I don't think the oprofile JIT interface solves any of these
 problems. (In fact, I don't see why the JIT problem is even hard. The
 JIT compiler can just generate a little ELF file with symbols in it,
 and the profiler can pick it up through the mmap events that you get
 through the perf interface).

That would require keeping those temporary ELF files for
potentially unlimited time around (profilers today look at the ELF
files at the final analysis phase, which might be weeks away)

Also that would be a lot of overhead for the JIT and most likely
be a larger scale rewrite for a given JIT code base.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Arnaldo Carvalho de Melo

Em Tue, Mar 23, 2010 at 03:20:11PM +0100, Andi Kleen escreveu:
 Soeren Sandmann sandm...@daimi.au.dk writes:
  I don't think the oprofile JIT interface solves any of these
  problems. (In fact, I don't see why the JIT problem is even hard. The
  JIT compiler can just generate a little ELF file with symbols in it,
  and the profiler can pick it up through the mmap events that you get
  through the perf interface).
 
 That would require keeping those temporary ELF files for
 potentially unlimited time around (profilers today look at the ELF
 files at the final analysis phase, which might be weeks away)

'perf record' will traverse the perf.data file just collected and, if the
binaries have build-ids, will stash them in ~/.debug/, keyed by build-id
just like the -debuginfo packages do.

So only the binaries with hits. Also one can use 'perf archive' to
create a tar.bz2 file with the files with hits for the specified
perf.data file, that can then be transfered to another machine, whatever
arch, untarred at ~/.debug and then the report can be done there.

As it is done by build-id, multiple 'perf record' sessions share files
in the cache.

Right now the whole ELF file (or /proc/kallsyms copy) is stored if
collected from the DSO directly, or the bits that are stored in
-debuginfo files if we find it installed (so smaller). We could strip
that down further by storing just the ELF sections needed to make sense
of the symtab.

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Frank Ch. Eigler

Soeren Sandmann sandm...@daimi.au.dk writes:

 [...]
 - What is going on inside QEMU? 
 - Which client is the X server servicing?
 - What parts of a python/shell/scheme/javascript program is
   taking the most CPU time?
 [...]

These kinds of questions usually require navigation through internal
data of the user-space process (Where in this linked list is this
pointer?), and often also correlating them with history (which
socket/fd was most recently serviced?).

Systemtap excels at letting one express such things.

- FChE
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Peter Zijlstra

On Tue, 2010-03-23 at 11:10 -0300, Arnaldo Carvalho de Melo wrote:
 Em Tue, Mar 23, 2010 at 02:49:01PM +0100, Andi Kleen escreveu:
  Soeren Sandmann sandm...@daimi.au.dk writes:
   To fix that problem, it seems like we need some way to have python
   export what is going on. Maybe the same mechanism could be used to
   both access what is going on in qemu and python.
  
  oprofile already has an interface to let JITs export
  information about the JITed code. C Python is not a JIT,
  but presumably one of the python JITs could do it.
  
  http://oprofile.sourceforge.net/doc/devel/index.html
  
  I know it's not envogue anymore and you won't be a approved 
  cool kid if you do, but you could just use oprofile? 
 
 perf also has supports for this and Pekka Enberg's jato uses it:
 
 http://penberg.blogspot.com/2009/06/jato-has-profiler.html

Right, we need to move that into a library though (always meant to do
that, never got around to doing it).

That way the app can link against a dso with weak empty stubs and have
perf record LD_PRELOAD a version that has a suitable implementation.

That all has the advantage of not exposing the actual interface like we
do now.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Zhang, Yanmin

On Tue, 2010-03-23 at 10:15 -0300, Arnaldo Carvalho de Melo wrote:
 Em Tue, Mar 23, 2010 at 11:14:41AM +0800, Zhang, Yanmin escreveu:
  On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote:
   Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu:
On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
Then, perf could access all files. It's possible because guest os 
instance
happens to be multi-threading in a process. One of the defects is the 
accessing to
guest os becomes slow or impossible when guest os is very busy.
   
   If the MMAP events on the guest included a cookie that could later be
   used to query for the symtab of that DSO, we wouldn't need to access the
   guest FS at all, right?
 
  It depends on specific sub commands. As for 'perf kvm top', developers
  want to see the profiling immediately. Even with 'perf kvm record',
  developers also want to
 
 That is not a problem, if you have the relevant buildids in your cache
 (Look in your machine at ~/.debug/), it will be as fast as ever.
 
 If you use a distro that has its userspace with build-ids, you probably
 use it always without noticing :-)
 
  see results quickly. At least I'm eager for the results when
  investigating a performance issue.
 
 Sure thing.
  
   With build-ids and debuginfo-install like tools the symbol
   resolution could be performed by using the cookies (build-ids) as
   keys to get to the *-debuginfo packages with matching symtabs (and
   DWARF for source annotation, etc).
 
  We can't make sure guest os uses the same os images, or don't know
  where we could find the original DVD images being used to install
  guest os.
 
 You don't have to have guest and host sharing the same OS image, you
 just have to somehow populate your buildid cache with what you need, be
 it using sshfs or what Ingo is suggesting once, or using what your
 vendor provides (debuginfo packages). And you just have to do it once,
 for the relevant apps, to have it in your buildid cache.
  
  Current perf does save build id, including both kernls's and other
  application lib/executables.
 
 Yeah, I know, I implemented it. :-)
  
   We have that for the kernel as:
 
   [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes 
   -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes
   [a...@doppio linux-2.6-tip]$ l 
   /sys/module/ipv6/sections/.note.gnu.build-id 
   -r--r--r-- 1 root root 4096 2010-03-22 13:38 
   /sys/module/ipv6/sections/.note.gnu.build-id
   [a...@doppio linux-2.6-tip]$
 
   That way we would cover DSOs being reinstalled in long running 'perf
   record' sessions too.
 
  That's one of objectives of perf to support long running.
 
 But it doesn't fully supports right now, as I explained, build-ids are
 collected at the end of the record session, because we have to open the
 DSOs that had hits to get the 20 bytes cookie we need, the build-id.
 
 If we had it in the PERF_RECORD_MMAP record, we would close this race,
 and the added cost at load time should be minimal, to get the ELF
 section with it and put it somewhere in task struct.
Well, you are improving upon perfection.

 
 If only we could coalesce it a bit to reclaim this:
 
 [a...@doppio linux-2.6-tip]$ pahole -C task_struct 
 ../build/v2.6.34-rc1-tip+/kernel/sched.o  | tail -5
   /* size: 5968, cachelines: 94, members: 150 */
   /* sum members: 5943, holes: 7, sum holes: 25 */
   /* bit holes: 1, sum bit holes: 28 bits */
   /* last cacheline: 16 bytes */
 };
 [a...@doppio linux-2.6-tip]$ 
That reminds me I listened to your presentation on 2007 OLS. :)

 
 8-)
 
 Or at least get just one of those 4 bytes holes then we could stick it
 at the end to get our build-id there, accessing it would be done only
 at PERF_RECORD_MMAP injection time, i.e. close to the time when we
 actually are loading the executable mmap, i.e. close to the time when
 the loader is injecting the build-id, I guess the extra memory and
 processing costs would be in the noise.
 
   This was discussed some time ago but would require help from the bits
   that load DSOs.
 
   build-ids then would be first class citizens.
 
 - Arnaldo


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread Zhang, Yanmin

On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
 Nice progress!
 
 This bit:
 
  1) perf kvm top
  [r...@lkp-ne01 norm]# perf kvm --host --guest 
  --guestkallsyms=/home/ymzhang/guest/kallsyms
  --guestmodules=/home/ymzhang/guest/modules top
 

 Will be really be painful to developers - to enter that long line while we 
 have these things called 'computers' that ought to reduce human work. Also, 
 it's incomplete, we need access to the guest system's binaries to do ELF 
 symbol resolution and dwarf decoding.
Yes, I agree with you and Avi that we need the enhancement be user-friendly.
One of my start points is to keep the tool having less dependency on
other components. Admin/developers could write script wrappers quickly if
perf has parameters to support the new capability.


 
 So we really need some good, automatic way to get to the guest symbol space, 
 so that if a developer types:
 
perf kvm top
 
 Then the obvious thing happens by default. (which is to show the guest 
 overhead)
 
 There's no technical barrier on the perf tooling side to implement all that: 
 perf supports build-ids extensively and can deal with multiple symbol spaces 
 - 
 as long as it has access to it. The guest kernel could be ID-ed based on its 
 /sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids.
I tried sshfs quickly. sshfs could mount root filesystem of guest os nicely.
I could access the files quickly. However, it doesn't work when I access
/proc/ and /sys/ because sshfs/scp depend on file size while the sizes of most
files of /proc/ and /sys/ are 0.


 
 So some sort of --guestmount option would be the natural solution, which 
 points to the guest system's root: and a Qemu enumeration of guest mounts 
 (which would be off by default and configurable) from which perf can pick up 
 the target guest all automatically. (obviously only under allowed permissions 
 so that such access is secure)
If sshfs could access /proc/ and /sys correctly, here is a design:
--guestmount points to a directory which consists of a list of sub-directories.
Every sub-directory's name is just the qemu process id of guest os. 
Admin/developer
mounts every guest os instance's root directory to corresponding sub-directory.

Then, perf could access all files. It's possible because guest os instance
happens to be multi-threading in a process. One of the defects is the accessing 
to
guest os becomes slow or impossible when guest os is very busy.


 
 This would allow not just kallsyms access via $guest/proc/kallsyms but also 
 gives us the full space of symbol features: access to the guest binaries for 
 annotation and general symbol resolution, command/binary name identification, 
 etc.
 
 Such a mount would obviously not broaden existing privileges - and as an 
 additional control a guest would also have a way to indicate that it does not 
 wish a guest mount at all.
 
 Unfortunately, in a previous thread the Qemu maintainer has indicated that he 
 will essentially NAK any attempt to enhance Qemu to provide an easily 
 discoverable, self-contained, transparent guest mount on the host side.
 
 No technical justification was given for that NAK, despite my repeated 
 requests to particulate the exact security problems that such an approach 
 would cause.
 
 If that NAK does not stand in that form then i'd like to know about it - it 
 makes no sense for us to try to code up a solution against a standing 
 maintainer NAK ...
 
 The other option is some sysadmin level hackery to NFS-mount the guest or so. 
 This is a vastly inferior method that brings us back to the absymal usability 
 levels of OProfile:
 
  1) it wont be guest transparent
  2) has to be re-done for every guest image. 
  3) even if packaged it has to be gotten into every. single. Linux. distro. 
 separately.
  4) old Linux guests wont work out of box
 
 In other words: it's very inconvenient on multiple levels and wont ever 
 happen 
 on any reasonable enough scale to make a difference to Linux.
 
 Which is an unfortunate situation - and the ball is on the KVM/Qemu side so i 
 can do little about it.
 
 Thanks,
 
   Ingo


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread oerg Roedel

On Sun, Mar 21, 2010 at 07:43:00PM +0100, Ingo Molnar wrote:
 Having access to the actual executable files that include the symbols 
 achieves 
 precisely that - with the additional robustness that all this functionality 
 is 
 concentrated into the host, while the guest side is kept minimal (and 
 transparent).

If you want to access the guests file-system you need a piece of
software running in the guest which gives you this access. But when you
get an event this piece of software may not be runnable (if the guest is
in an interrupt handler or any other non-preemptible code path). When the
host finally gets access to the guests filesystem again the source of
that event may already be gone (process has exited, module unloaded...).
The only way to solve that is to pass the event information to the guest
immediatly and let it collect the information we want.


 It can decide whether it exposes the files. Nor are there any security 
 issues to begin with.

I am not talking about security. Security was sufficiently flamed about
already.

 You need to be aware of the fact that symbol resolution is a separate step 
 from call chain generation.

Same concern as above applies to call-chain generation too.

  How we speak to the guest was already discussed in this thread. My personal 
  opinion is that going through qemu is an unnecessary step and we can solve 
  that more clever and transparent for perf.
 
 Meaning exactly what?

Avi was against that but I think it would make sense to give names to
virtual machines (with a default, similar to network interface names).
Then we can create a directory in /dev/ with that name (e.g.
/dev/vm/fedora/). Inside the guest a (priviledged) process can create
some kind of named virt-pipe which results in a device file created in
the guests directory (perf could create /dev/vm/fedora/perf for
example). This file is used for guest-host communication.

Thanks,

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread Ingo Molnar


* oerg Roedel j...@8bytes.org wrote:

  It can decide whether it exposes the files. Nor are there any security 
  issues to begin with.
 
 I am not talking about security. [...]

You were talking about security, in the portion of your mail that you snipped 
out, and which i replied to:

   2. The guest can decide for its own if it want to pass this
  inforamtion to the host-perf. No security issues at all.

I understood that portion to mean what it says: that your claim that your 
proposal 'has no security issues at all', in contrast to my suggestion.

 [...] Security was sufficiently flamed about already.

All i saw was my suggestion to allow a guest to securely (and scalably and 
conveniently) integrate/mount its filesystems to the host if both sides (both 
the host and the guest) permit it, to make it easier for instrumentation to 
pick up symbol details.

I.e. if a guest runs then its filesystem may be present on the host side as:

   /guests/Fedora-G1/
   /guests/Fedora-G1/proc/
   /guests/Fedora-G1/usr/
   /guests/Fedora-G1/.../

( This feature would be configurable and would be default-off, to maintain the 
  current status quo. )

i.e. it's a bit like sshfs or NFS or loopback block mounts, just in an 
integrated and working fashion (sshfs doesnt work well with /proc for example) 
and more guest transparent (obviously sshfs or NFS exports need per guest 
configuration), and lower overhead than sshfs/NFS - i.e. without the 
(unnecessary) networking overhead.

That suggestion was 'countered' by an unsubstantiated claim by Anthony that 
this kind of usability feature would somehow be a 'security nighmare'.

In reality it is just an incremental, more usable, faster and more 
guest-transparent form of what is already possible today via:

  - loopback mounts on host
  - NFS exports
  - SMB exports
  - sshfs
  - (and other mechanisms)

I wish there was at least flaming about it - as flames tend to have at least 
some specifics in them.

What i saw instead was a claim about a 'security nightmare', which was, when i 
asked for specifics, was followed by deafening silence. And you appear to have 
repeated that claim here, unwilling to back it up with specifics.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread Ingo Molnar


* oerg Roedel j...@8bytes.org wrote:

 On Sun, Mar 21, 2010 at 07:43:00PM +0100, Ingo Molnar wrote:
  Having access to the actual executable files that include the symbols 
  achieves 
  precisely that - with the additional robustness that all this functionality 
  is 
  concentrated into the host, while the guest side is kept minimal (and 
  transparent).
 
 If you want to access the guests file-system you need a piece of software 
 running in the guest which gives you this access. But when you get an event 
 this piece of software may not be runnable (if the guest is in an interrupt 
 handler or any other non-preemptible code path). When the host finally gets 
 access to the guests filesystem again the source of that event may already 
 be gone (process has exited, module unloaded...). The only way to solve that 
 is to pass the event information to the guest immediatly and let it collect 
 the information we want.

The very same is true of profiling in the host space as well (KVM is nothing 
special here, other than its unreasonable insistence on not enumerating 
readily available information in a more usable way).

So are you suggesting a solution to a perf problem we already solved 
differently? (and which i argue we solved in a better way)

We have solved that in the host space already (and quite elaborately so), and 
not via your suggestion of moving symbol resolution to a different stage, but 
by properly generating the right events to allow the post-processing stage to 
see processes that have already exited, to robustly handle files that have 
been rebuilt, etc.

From an instrumentation POV it is fundamentally better to acquire the right 
data and delay any complexities to the analysis stage (the perf model) than to 
complicate sampling (the oprofile dcookies model).

Your proposal of 'doing the symbol resolution in the guest context' is in 
essence re-arguing that very similar point that oprofile lost. Did you really 
intend to re-argue that point as well? If yes then please propose an 
alternative implementation for everything that perf does wrt. symbol lookups.

What we propose for 'perf kvm' right now is simply a straight-forward 
extension of the existing (and well working) symbol handling code to 
virtualization.

  You need to be aware of the fact that symbol resolution is a separate step 
  from call chain generation.
 
 Same concern as above applies to call-chain generation too.

Best would be if you demonstrated any problems of the perf symbol lookup code 
you are aware of on the host side, as it has that exact design you are 
criticising here. We are eager to fix any bugs in it.

If you claim that it's buggy then that should very much be demonstratable - no 
need to go into theoretical arguments about it.

( You should be aware of the fact that perf currently works with 'processes
  exiting prematurely' and similar scenarios just fine, so if you want to
  demonstrate that it's broken you will probably need a different example. )

   How we speak to the guest was already discussed in this thread. My 
   personal opinion is that going through qemu is an unnecessary step and 
   we can solve that more clever and transparent for perf.
  
  Meaning exactly what?
 
 Avi was against that but I think it would make sense to give names to 
 virtual machines (with a default, similar to network interface names). Then 
 we can create a directory in /dev/ with that name (e.g. /dev/vm/fedora/). 
 Inside the guest a (priviledged) process can create some kind of named 
 virt-pipe which results in a device file created in the guests directory 
 (perf could create /dev/vm/fedora/perf for example). This file is used for 
 guest-host communication.

That is kind of half of my suggestion - the built-in enumeration guests and a 
guaranteed channel to them accessible to tools. (KVM already has its own 
special channel so it's not like channels of communication are useless.)

The other half of my suggestion is that if we bring this thought to its 
logical conclusion then we might as well walk the whole mile and not use 
quirky, binary API single-channel pipes. I.e. we could use this convenient, 
human-readable, structured, hierarchical abstraction to expose information in 
a finegrained, scalable way, which has a world-class implementation in Linux: 
the 'VFS namespace'.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread Joerg Roedel

On Mon, Mar 22, 2010 at 11:59:27AM +0100, Ingo Molnar wrote:
 Best would be if you demonstrated any problems of the perf symbol lookup code 
 you are aware of on the host side, as it has that exact design you are 
 criticising here. We are eager to fix any bugs in it.
 
 If you claim that it's buggy then that should very much be demonstratable - 
 no 
 need to go into theoretical arguments about it.

I am not claiming anything. I just try to imagine how your proposal
will look like in practice and forgot that symbol resolution is done at
a later point.
But even with defered symbol resolution we need more information from
the guest than just the rip falling out of KVM. The guest needs to tell
us about the process where the event happened (information that the host
has about itself without any hassle) and which executable-files it was
loaded from.

  Avi was against that but I think it would make sense to give names to 
  virtual machines (with a default, similar to network interface names). Then 
  we can create a directory in /dev/ with that name (e.g. /dev/vm/fedora/). 
  Inside the guest a (priviledged) process can create some kind of named 
  virt-pipe which results in a device file created in the guests directory 
  (perf could create /dev/vm/fedora/perf for example). This file is used for 
  guest-host communication.
 
 That is kind of half of my suggestion - the built-in enumeration guests and a 
 guaranteed channel to them accessible to tools. (KVM already has its own 
 special channel so it's not like channels of communication are useless.)
 
 The other half of my suggestion is that if we bring this thought to its 
 logical conclusion then we might as well walk the whole mile and not use 
 quirky, binary API single-channel pipes. I.e. we could use this convenient, 
 human-readable, structured, hierarchical abstraction to expose information in 
 a finegrained, scalable way, which has a world-class implementation in Linux: 
 the 'VFS namespace'.

Probably. At least it is the solution that fits best into the current
design of perf. But we should think about how this will be done. Raw
disk access is no solution because we need to access virtual
file-systems of the guest too. Network filesystems may be a solution but
then we come back to the 'deployment-nightmare'.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread Ingo Molnar


* Joerg Roedel j...@8bytes.org wrote:

 On Mon, Mar 22, 2010 at 11:59:27AM +0100, Ingo Molnar wrote:
  Best would be if you demonstrated any problems of the perf symbol lookup 
  code 
  you are aware of on the host side, as it has that exact design you are 
  criticising here. We are eager to fix any bugs in it.
  
  If you claim that it's buggy then that should very much be demonstratable - 
  no 
  need to go into theoretical arguments about it.
 
 I am not claiming anything. I just try to imagine how your proposal will 
 look like in practice and forgot that symbol resolution is done at a later 
 point.

 But even with defered symbol resolution we need more information from the 
 guest than just the rip falling out of KVM. The guest needs to tell us about 
 the process where the event happened (information that the host has about 
 itself without any hassle) and which executable-files it was loaded from.

Correct - for full information we need a good paravirt perf integration of the 
kernel bits to pass that through. (I.e. we want to 'integrate' the PID space 
as well, at least within the perf notion of PIDs.)

Initially we can do without that as well.

 Probably. At least it is the solution that fits best into the current design 
 of perf. But we should think about how this will be done. Raw disk access is 
 no solution because we need to access virtual file-systems of the guest too. 
 [...]

I never said anything about 'raw disk access'. Have you seen my proposal of 
(optional) VFS namespace integration? (It can be found repeated the Nth time 
in my mail you replied to)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread Arnaldo Carvalho de Melo

Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu:
 On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
  So some sort of --guestmount option would be the natural solution, which 
  points to the guest system's root: and a Qemu enumeration of guest mounts 
  (which would be off by default and configurable) from which perf can pick 
  up 
  the target guest all automatically. (obviously only under allowed 
  permissions 
  so that such access is secure)
 If sshfs could access /proc/ and /sys correctly, here is a design:
 --guestmount points to a directory which consists of a list of 
 sub-directories.
 Every sub-directory's name is just the qemu process id of guest os. 
 Admin/developer
 mounts every guest os instance's root directory to corresponding 
 sub-directory.
 
 Then, perf could access all files. It's possible because guest os instance
 happens to be multi-threading in a process. One of the defects is the 
 accessing to
 guest os becomes slow or impossible when guest os is very busy.

If the MMAP events on the guest included a cookie that could later be
used to query for the symtab of that DSO, we wouldn't need to access the
guest FS at all, right?

With build-ids and debuginfo-install like tools the symbol resolution
could be performed by using the cookies (build-ids) as keys to get to
the *-debuginfo packages with matching symtabs (and DWARF for source
annotation, etc).

We have that for the kernel as:

[a...@doppio linux-2.6-tip]$ l /sys/kernel/notes 
-r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes
[a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id 
-r--r--r-- 1 root root 4096 2010-03-22 13:38 
/sys/module/ipv6/sections/.note.gnu.build-id
[a...@doppio linux-2.6-tip]$

That way we would cover DSOs being reinstalled in long running 'perf
record' sessions too.

This was discussed some time ago but would require help from the bits
that load DSOs.

build-ids then would be first class citizens.

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread Zhang, Yanmin

On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote:
 Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu:
  On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
   So some sort of --guestmount option would be the natural solution, which 
   points to the guest system's root: and a Qemu enumeration of guest mounts 
   (which would be off by default and configurable) from which perf can pick 
   up 
   the target guest all automatically. (obviously only under allowed 
   permissions 
   so that such access is secure)
  If sshfs could access /proc/ and /sys correctly, here is a design:
  --guestmount points to a directory which consists of a list of 
  sub-directories.
  Every sub-directory's name is just the qemu process id of guest os. 
  Admin/developer
  mounts every guest os instance's root directory to corresponding 
  sub-directory.
  
  Then, perf could access all files. It's possible because guest os instance
  happens to be multi-threading in a process. One of the defects is the 
  accessing to
  guest os becomes slow or impossible when guest os is very busy.
 
 If the MMAP events on the guest included a cookie that could later be
 used to query for the symtab of that DSO, we wouldn't need to access the
 guest FS at all, right?
It depends on specific sub commands. As for 'perf kvm top', developers want to 
see
the profiling immediately. Even with 'perf kvm record', developers also want to
see results quickly. At least I'm eager for the results when investigating
a performance issue.

 
 With build-ids and debuginfo-install like tools the symbol resolution
 could be performed by using the cookies (build-ids) as keys to get to
 the *-debuginfo packages with matching symtabs (and DWARF for source
 annotation, etc).
We can't make sure guest os uses the same os images, or don't know where we
could find the original DVD images being used to install guest os.

Current perf does save build id, including both kernls's and other application
lib/executables.

 
 We have that for the kernel as:
 
 [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes 
 -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes
 [a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id 
 -r--r--r-- 1 root root 4096 2010-03-22 13:38 
 /sys/module/ipv6/sections/.note.gnu.build-id
 [a...@doppio linux-2.6-tip]$
 
 That way we would cover DSOs being reinstalled in long running 'perf
 record' sessions too.
That's one of objectives of perf to support long running.

 
 This was discussed some time ago but would require help from the bits
 that load DSOs.
 
 build-ids then would be first class citizens.
 
 - Arnaldo


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-21 Thread Ingo Molnar


* oerg Roedel j...@8bytes.org wrote:

 On Fri, Mar 19, 2010 at 09:21:22AM +0100, Ingo Molnar wrote:
  Unfortunately, in a previous thread the Qemu maintainer has indicated that 
  he 
  will essentially NAK any attempt to enhance Qemu to provide an easily 
  discoverable, self-contained, transparent guest mount on the host side.
  
  No technical justification was given for that NAK, despite my repeated 
  requests to particulate the exact security problems that such an approach 
  would cause.
  
  If that NAK does not stand in that form then i'd like to know about it - it 
  makes no sense for us to try to code up a solution against a standing 
  maintainer NAK ...
 
 I still think it is the best and most generic way to let the guest do the 
 symbol resolution. [...]

Not really.

 [...] This has several advantages:
 
   1. The guest knows best about its symbol space. So this would be
  extensible to other guest operating systems.  A brave
  developer may even implement symbol passing for Windows or
  the BSDs ;-)

Having access to the actual executable files that include the symbols achieves 
precisely that - with the additional robustness that all this functionality is 
concentrated into the host, while the guest side is kept minimal (and 
transparent).

   2. The guest can decide for its own if it want to pass this
  inforamtion to the host-perf. No security issues at all.

It can decide whether it exposes the files. Nor are there any security 
issues to begin with.

   3. The guest can also pass us the call-chain and we don't need
  to care about complicated of fetching from the guest
  ourself.

You need to be aware of the fact that symbol resolution is a separate step 
from call chain generation.

I.e. call-chains are a (entirely) separate issue, and could reasonably be done 
in the guest or in the host.

It has no bearing on this symbol resolution question.

   4. This way extensible to nested virtualization too.

Nested virtualization is actually already taken care of by the filesystem 
solution via an existing method called 'subdirectories'. If the guest offers 
sub-guests then those symbols will be exposed in a similar way via its own 
'guest files' directory hierarchy.

I.e. if we have 'Guest-2' nested inside 'the 'Guest-Fedora-1' instance, we get:

 /guests/
 /guests/Guest-Fedora-1/etc/
 /guests/Guest-Fedora-1/usr/

we'd also have:

 /guests/Guest-Fedora-1/guests/Guest-2/

So this is taken care of automatically.

I.e. none of the four 'advantages' listed here are actually advantages over my 
proposed solution, so your conclusion is subsequently flawed as well.

 How we speak to the guest was already discussed in this thread. My personal 
 opinion is that going through qemu is an unnecessary step and we can solve 
 that more clever and transparent for perf.

Meaning exactly what?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-19 Thread Ingo Molnar


Nice progress!

This bit:

 1) perf kvm top
 [r...@lkp-ne01 norm]# perf kvm --host --guest 
 --guestkallsyms=/home/ymzhang/guest/kallsyms
 --guestmodules=/home/ymzhang/guest/modules top

Will be really be painful to developers - to enter that long line while we 
have these things called 'computers' that ought to reduce human work. Also, 
it's incomplete, we need access to the guest system's binaries to do ELF 
symbol resolution and dwarf decoding.

So we really need some good, automatic way to get to the guest symbol space, 
so that if a developer types:

   perf kvm top

Then the obvious thing happens by default. (which is to show the guest 
overhead)

There's no technical barrier on the perf tooling side to implement all that: 
perf supports build-ids extensively and can deal with multiple symbol spaces - 
as long as it has access to it. The guest kernel could be ID-ed based on its 
/sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids.

So some sort of --guestmount option would be the natural solution, which 
points to the guest system's root: and a Qemu enumeration of guest mounts 
(which would be off by default and configurable) from which perf can pick up 
the target guest all automatically. (obviously only under allowed permissions 
so that such access is secure)

This would allow not just kallsyms access via $guest/proc/kallsyms but also 
gives us the full space of symbol features: access to the guest binaries for 
annotation and general symbol resolution, command/binary name identification, 
etc.

Such a mount would obviously not broaden existing privileges - and as an 
additional control a guest would also have a way to indicate that it does not 
wish a guest mount at all.

Unfortunately, in a previous thread the Qemu maintainer has indicated that he 
will essentially NAK any attempt to enhance Qemu to provide an easily 
discoverable, self-contained, transparent guest mount on the host side.

No technical justification was given for that NAK, despite my repeated 
requests to particulate the exact security problems that such an approach 
would cause.

If that NAK does not stand in that form then i'd like to know about it - it 
makes no sense for us to try to code up a solution against a standing 
maintainer NAK ...

The other option is some sysadmin level hackery to NFS-mount the guest or so. 
This is a vastly inferior method that brings us back to the absymal usability 
levels of OProfile:

 1) it wont be guest transparent
 2) has to be re-done for every guest image. 
 3) even if packaged it has to be gotten into every. single. Linux. distro. 
separately.
 4) old Linux guests wont work out of box

In other words: it's very inconvenient on multiple levels and wont ever happen 
on any reasonable enough scale to make a difference to Linux.

Which is an unfortunate situation - and the ball is on the KVM/Qemu side so i 
can do little about it.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-19 Thread oerg Roedel

On Fri, Mar 19, 2010 at 09:21:22AM +0100, Ingo Molnar wrote:
 Unfortunately, in a previous thread the Qemu maintainer has indicated that he 
 will essentially NAK any attempt to enhance Qemu to provide an easily 
 discoverable, self-contained, transparent guest mount on the host side.
 
 No technical justification was given for that NAK, despite my repeated 
 requests to particulate the exact security problems that such an approach 
 would cause.
 
 If that NAK does not stand in that form then i'd like to know about it - it 
 makes no sense for us to try to code up a solution against a standing 
 maintainer NAK ...

I still think it is the best and most generic way to let the guest do
the symbol resolution. This has several advantages:

1. The guest knows best about its symbol space. So this would be
   extensible to other guest operating systems.  A brave
   developer may even implement symbol passing for Windows or
   the BSDs ;-)

2. The guest can decide for its own if it want to pass this
   inforamtion to the host-perf. No security issues at all.

3. The guest can also pass us the call-chain and we don't need
   to care about complicated of fetching from the guest
   ourself.

4. This way extensible to nested virtualization too.

How we speak to the guest was already discussed in this thread. My
personal opinion is that going through qemu is an unnecessary step and
we can solve that more clever and transparent for perf.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-18 Thread Zhang, Yanmin

On Thu, 2010-03-18 at 10:45 +0800, Zhang, Yanmin wrote:
 On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote:
  On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote:
   * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
   
On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
 On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
  On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
   From: Zhang, Yanminyanmin_zh...@linux.intel.com
  
   Based on the discussion in KVM community, I worked out the patch 
   to support
   perf to collect guest os statistics from host side. This patch is 
   implemented
   with Ingo, Peter and some other guys' kind help. Yang Sheng 
   pointed out a
   critical bug and provided good suggestions with other guys. I 
   really appreciate
   their kind help.
  
   The patch adds new subcommand kvm to perf.
  
  perf kvm top
  perf kvm record
  perf kvm report
  perf kvm diff
  
   The new perf could profile guest os kernel except guest os user 
   space, but it
   could summarize guest os user space utilization per guest os.
  
   Below are some examples.
   1) perf kvm top
   [r...@lkp-ne01 norm]# perf kvm --host --guest 
   --guestkallsyms=/home/ymzhang/guest/kallsyms
   --guestmodules=/home/ymzhang/guest/modules top
  
  
  
 Thanks for your kind comments.
 
  Excellent, support for guest kernel != host kernel is critical (I 
  can't 
  remember the last time I ran same kernels).
  
  How would we support multiple guests with different kernels?
 With the patch, 'perf kvm report --sort pid could show
 summary statistics for all guest os instances. Then, use
 parameter --pid of 'perf kvm record' to collect single problematic 
 instance data.
Sorry. I found currently --pid isn't process but a thread (main thread).

Ingo,

Is it possible to support a new parameter or extend --inherit, so 'perf 
record' and 'perf top' could collect data on all threads of a process 
when 
the process is running?

If not, I need add a new ugly parameter which is similar to --pid to 
filter 
out process data in userspace.
   
   Yeah. For maximum utility i'd suggest to extend --pid to include this, 
   and 
   introduce --tid for the previous, limited-to-a-single-task functionality.
   
   Most users would expect --pid to work like a 'late attach' - i.e. to work 
   like 
   strace -f or like a gdb attach.
  
  Thanks Ingo, Avi.
  
  I worked out below patch against tip/master of March 15th.
  
  Subject: [PATCH] Change perf's parameter --pid to process-wide collection
  From: Zhang, Yanmin yanmin_zh...@linux.intel.com
  
  Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
  thread id. Now, --pid means perf collects the statistics of all threads of
  the process, while --tid means perf just collect the statistics of that 
  thread.
  
  BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
  attr-disabled=1 if it isn't a system-wide collection. If there is a '-p'
  and no forks, 'perf stat -p' doesn't collect any data. In addition, the
  while(!done) in run_perf_stat consumes 100% single cpu time which has bad 
  impact
  on running workload. I added a sleep(1) in the loop.
  
  Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
 Ingo,
 
 Sorry, the patch has bugs.  I need do a better job and will work out 2
 separate patches against the 2 issues.

I worked out 3 new patches against tip/master tree of Mar. 17th.

1) Patch perf_stat: Fix the issue that perf doesn't enable counters when
target_pid != -1. Change the condition to fork/exec subcommand. If there
is a subcommand parameter, perf always fork/exec it. The usage example is:
#perf stat -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

2) Patch perf_record: Fix the issue that when perf forks/exec a subcommand,
it should enable all counters after the new process is execing.Change the
condition to fork/exec subcommand. If there is a subcommand parameter,
perf always fork/exec it. The usage example is:
#perf record -f -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

3) perf_pid: Change parameter --pid to process-wide collection. Add --tid
which means collecting thread-wide statistics. Usage example is:
#perf top -p 
#perf record -p  -f sleep 10
#perf stat -p  -f sleep 10

Arnaldo,

Pls. apply the 3 attached patches.

Yanmin

diff -Nraup linux-2.6_tipmaster0317/tools/perf/builtin-stat.c linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c
--- linux-2.6_tipmaster0317/tools/perf/builtin-stat.c	2010-03-18 09:04:40.938289813 +0800
+++ linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c	2010-03-18

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-18 Thread Ingo Molnar


* Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:

 I worked out 3 new patches against tip/master tree of Mar. 17th.

Cool! Mind sending them as a series of patches instead of attachment? That 
makes it easier to review them. Also, the Signed-off-by lines seem to be 
missing plus we need a per patch changelog as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-18 Thread Zachary Amsden


On 03/17/2010 07:41 PM, Sheng Yang wrote:

On Thursday 18 March 2010 13:22:28 Sheng Yang wrote:
   

On Thursday 18 March 2010 12:50:58 Zachary Amsden wrote:
 

On 03/17/2010 03:19 PM, Sheng Yang wrote:
   

On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
 

On 03/16/2010 11:28 PM, Sheng Yang wrote:
   

On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
 

On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
   

On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
 

Right, but there is a scope between kvm_guest_enter and really
running in guest os, where a perf event might overflow. Anyway,
the scope is very narrow, I will change it to use flag PF_VCPU.
   

There is also a window between setting the flag and calling 'int
$2' where an NMI might happen and be accounted incorrectly.

Perhaps separate the 'int $2' into a direct call into perf and
another call for the rest of NMI handling.  I don't see how it
would work on svm though - AFAICT the NMI is held whereas vmx
swallows it.

I guess NMIs
will be disabled until the next IRET so it isn't racy, just tricky.
 

I'm not sure if vmexit does break NMI context or not. Hardware NMI
context isn't reentrant till a IRET. YangSheng would like to double
check it.
   

After more check, I think VMX won't remained NMI block state for
host. That's means, if NMI happened and processor is in VMX non-root
mode, it would only result in VMExit, with a reason indicate that
it's due to NMI happened, but no more state change in the host.

So in that meaning, there _is_ a window between VMExit and KVM handle
the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI
handling code because int $2 don't have effect to block following
NMI.

And if the NMI sequence is not important(I think so), then we need to
generate a real NMI in current vmexit-after code. Seems let APIC send
a NMI IPI to itself is a good idea.

I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to
replace int $2. Something unexpected is happening...
 

You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
supposed to be able to.
   

Um? Why?

Especially kernel is already using it to deliver NMI.
 

That's the only defined case, and it is defined because the vector field
is ignore for DM_NMI.  Vol 3A (exact section numbers may vary depending
on your version).

8.5.1 / 8.6.1

'100 (NMI) Delivers an NMI interrupt to the target processor or
processors.  The vector information is ignored'

8.5.2  Valid Interrupt Vectors

'Local and I/O APICs support 240 of these vectors (in the range of 16 to
255) as valid interrupts.'

8.8.4 Interrupt Acceptance for Fixed Interrupts

'...; vectors 0 through 15 are reserved by the APIC (see also: Section
8.5.2, Valid Interrupt Vectors)'

So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but
vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.
   

As you pointed out, NMI is not Fixed interrupt. If we want to send NMI,
  it would need a specific delivery mode rather than vector number.

And if you look at code, if we specific NMI_VECTOR, the delivery mode would
  be set to NMI.

So what's wrong here?
 

OK, I think I understand your points now. You meant that these vectors can't
be filled in vector field directly, right? But NMI is a exception due to
DM_NMI. Is that your point? I think we agree on this.
   


Yes, I think we agree.  NMI is the only vector in 0x0-0xf which can be 
sent via self-IPI because the vector itself does not matter for NMI.


Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-18 Thread Arnaldo Carvalho de Melo

Em Thu, Mar 18, 2010 at 09:03:25AM +0100, Ingo Molnar escreveu:
 
 * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
 
  I worked out 3 new patches against tip/master tree of Mar. 17th.
 
 Cool! Mind sending them as a series of patches instead of attachment? That 
 makes it easier to review them. Also, the Signed-off-by lines seem to be 
 missing plus we need a per patch changelog as well.

Yeah, please, and I hadn't merged them, so the resend was the best thing to do.

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Ingo Molnar


* oerg Roedel j...@8bytes.org wrote:

 On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote:
  Hm, that sounds rather messy if we want to use it to basically expose 
  kernel 
  functionality in a guest/host unified way. Is the qemu process discoverable 
  in 
  some secure way? Can we trust it? Is there some proper tooling available to 
  do 
  it, or do we have to push it through 2-3 packages to get such a useful 
  feature 
  done?
 
 Since we want to implement a pmu usable for the guest anyway why we don't 
 just use a guests perf to get all information we want? [...]

Look at the previous posting of this patch, this is something new and rather 
unique. The main power in the 'perf kvm' kind of instrumentation is to profile 
_both_ the host and the guest on the host, using the same tool (often using 
the same kernel) and using similar workloads, and do profile comparisons using 
'perf diff'.

Note that KVM's in-kernel design makes it easy to offer this kind of 
host/guest shared implementation that Yanmin has created. Other virtulization 
solutions with a poorer design (for example where the hypervisor code base is 
split away from the guest implementation) will have it much harder to create 
something similar.

That kind of integrated approach can result in very interesting finds straight 
away, see:

  http://lkml.indiana.edu/hypermail/linux/kernel/1003.0/00613.html

( the profile there demoes the need for spinlock accelerators for example - 
  there's clearly assymetrically large overhead in guest spinlock code. Guess 
  how much else we'll be able to find with a full 'perf kvm' implementation. )

One of the main goals of a virtualization implementation is to eliminate as 
many performance differences to the host kernel as possible. From the first 
day KVM was released the overriding question from users was always: 'how much 
slower is it than native, and which workloads are hit worst, and why, and 
could you pretty please speed up important workload XYZ'.

'perf kvm' helps exactly that kind of development workflow.

Note that with oprofile you can already do separate guest space and host space 
profiling (with the timer driven fallbackin the guest). One idea with 'perf 
kvm' is to change that paradigm of forced separation and forced duplication 
and to supprt the workflow that most developers employ: use the host space for 
development and unify instrumentation in an intuitive framework. Yanmin's 
'perf kvm' patch is a very good step towards that goal.

Anyway ... look at the patches, try them and see it for yourself. Back in the 
days when i did KVM performance work i wish i had something like Yanmin's 
'perf kvm' feature. I'd probably still be hacking KVM today ;-)

So, the code is there, it's useful and it's up to you guys whether you live 
with this opportunity - the perf developers are certainly eager to help out 
with the details. There's already tons of per kernel subsystem perf helper 
tools: perf sched, perf kmem, perf lock, perf bench, perf timechart.

'perf kvm' is really a natural and good next step IMO that underlines the main 
design goodness KVM brought to the world of virtualization: proper guest/host 
code base integration.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Ingo Molnar


* Frank Ch. Eigler f...@redhat.com wrote:

 Hi -
 
 On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote:
  [...]
  The only way to really address this is to change the interaction.  
  Instead of running perf externally to qemu, we should support a perf 
  command in the qemu monitor that can then tie directly to the perf 
  tooling.  That gives us the best possible user experience.
 
 To what extent could this be solved with less crossing of 
 isolation/abstraction layers, if the perfctr facilities were properly 
 virtualized? [...]

Note, 'perfctr' is a different out-of-tree Linux kernel project run by someone 
else: it offers the /dev/perfctr special-purpose device that allows raw, 
unabstracted, low-level access to the PMU.

I suspect the one you wanted to mention here is called 'perf' or 'perf 
events'. (and used to be called 'performance counters' or 'perfcounters' until 
it got renamed about a year ago)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 Monitoring guests from the host is useful for kvm developers, but less so 
 for users.

Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf' 
will work if a proper paravirt channel is opened to the host)

I think you might have misunderstood the purpose and role of the 'perf kvm' 
patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM 
code, not guest kernel users.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Avi Kivity


On 03/17/2010 10:16 AM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

Monitoring guests from the host is useful for kvm developers, but less so
for users.
 

Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf'
will work if a proper paravirt channel is opened to the host)

I think you might have misunderstood the purpose and role of the 'perf kvm'
patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM
code, not guest kernel users.
   


Of course I understood it.  My point was that 'perf kvm' serves a tiny 
minority of users.  That doesn't mean it isn't useful, just that it 
doesn't satisfy all needs by itself.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Ingo Molnar


* Anthony Liguori aligu...@linux.vnet.ibm.com wrote:

 If you want to use a synthetic filesystem as the management interface for 
 qemu, that's one thing.  But you suggested exposing the guest filesystem in 
 its entirely and that's what I disagreed with.

What did you think, that it would be world-readable? Why would we do such a 
stupid thing? Any mounted content should at minimum match whatever policy 
covers the image file. The mounting of contents is not a privilege escallation 
and it is already possible today - just not integrated properly and not 
practical. (and apparently not implemented for all the wrong 'security' 
reasons)

 The guest may encrypt it's disk image.  It still ought to be possible to run 
 perf against that guest, no?

_In_ the guest you can of course run it just fine. (once paravirt bits are in 
place)

That has no connection to 'perf kvm' though, which this patch submission is 
about ...

If you want unified profiling of both host and guest then you need access to 
both the guest and the host. This is what the 'perf kvm' patch is about. 
Please read the patch, i think you might be misunderstanding what it does ...

Regarding encrypted contents - that's really a distraction but the host has 
absolute, 100% control over the guest and there's nothing the guest can do 
about that - unless you are thinking about the sub-sub-case of Orwellian 
DRM-locked-down systems - in which case there's nothing for the host to mount 
and the guest can reject any requests for information on itself and impose 
additional policy that way. So it's a security non-issue.

Note that DRM is pretty much the worst place to look at when it comes to 
usability: DRM lock-down is the anti-thesis of usability. Do you really want 
KVM to match the mind-set of the RIAA and MPAA? Why do you pretend that a 
developer cannot mount his own disk image? Pretty please, help Linux instead, 
where development is driven by usability and accessibility ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 03/17/2010 10:16 AM, Ingo Molnar wrote:
 * Avi Kivitya...@redhat.com  wrote:
 
  Monitoring guests from the host is useful for kvm developers, but less so
  for users.
 
  Guest space profiling is easy, and 'perf kvm' is not about that. (plain 
  'perf' will work if a proper paravirt channel is opened to the host)
 
  I think you might have misunderstood the purpose and role of the 'perf 
  kvm' patch here? 'perf kvm' is aimed at KVM developers: it is them who 
  improve KVM code, not guest kernel users.
 
 Of course I understood it.  My point was that 'perf kvm' serves a tiny 
 minority of users. [...]

I hope you wont be disappointed to learn that 100% of Linux, all 13+ million 
lines of it, was and is being developed by a tiny, tiny, tiny minority of 
users ;-)

 [...]  That doesn't mean it isn't useful, just that it doesn't satisfy all 
 needs by itself.

Of course - and it doesnt bring world peace either. One step at a time.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Zhang, Yanmin

On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote:
 * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
 
  On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
   On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
 From: Zhang, Yanminyanmin_zh...@linux.intel.com

 Based on the discussion in KVM community, I worked out the patch to 
 support
 perf to collect guest os statistics from host side. This patch is 
 implemented
 with Ingo, Peter and some other guys' kind help. Yang Sheng pointed 
 out a
 critical bug and provided good suggestions with other guys. I really 
 appreciate
 their kind help.

 The patch adds new subcommand kvm to perf.

perf kvm top
perf kvm record
perf kvm report
perf kvm diff

 The new perf could profile guest os kernel except guest os user 
 space, but it
 could summarize guest os user space utilization per guest os.

 Below are some examples.
 1) perf kvm top
 [r...@lkp-ne01 norm]# perf kvm --host --guest 
 --guestkallsyms=/home/ymzhang/guest/kallsyms
 --guestmodules=/home/ymzhang/guest/modules top



   Thanks for your kind comments.
   
Excellent, support for guest kernel != host kernel is critical (I can't 
remember the last time I ran same kernels).

How would we support multiple guests with different kernels?
   With the patch, 'perf kvm report --sort pid could show
   summary statistics for all guest os instances. Then, use
   parameter --pid of 'perf kvm record' to collect single problematic 
   instance data.
  Sorry. I found currently --pid isn't process but a thread (main thread).
  
  Ingo,
  
  Is it possible to support a new parameter or extend --inherit, so 'perf 
  record' and 'perf top' could collect data on all threads of a process when 
  the process is running?
  
  If not, I need add a new ugly parameter which is similar to --pid to filter 
  out process data in userspace.
 
 Yeah. For maximum utility i'd suggest to extend --pid to include this, and 
 introduce --tid for the previous, limited-to-a-single-task functionality.
 
 Most users would expect --pid to work like a 'late attach' - i.e. to work 
 like 
 strace -f or like a gdb attach.

Thanks Ingo, Avi.

I worked out below patch against tip/master of March 15th.

Subject: [PATCH] Change perf's parameter --pid to process-wide collection
From: Zhang, Yanmin yanmin_zh...@linux.intel.com

Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
thread id. Now, --pid means perf collects the statistics of all threads of
the process, while --tid means perf just collect the statistics of that thread.

BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
attr-disabled=1 if it isn't a system-wide collection. If there is a '-p'
and no forks, 'perf stat -p' doesn't collect any data. In addition, the
while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact
on running workload. I added a sleep(1) in the loop.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-record.c 
linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-record.c 2010-03-16 
08:59:54.896488489 +0800
+++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c 2010-03-17 
16:30:17.71706 +0800
@@ -27,7 +27,7 @@
 #include unistd.h
 #include sched.h
 
-static int fd[MAX_NR_CPUS][MAX_COUNTERS];
+static int *fd[MAX_NR_CPUS][MAX_COUNTERS];
 
 static longdefault_interval=  0;
 
@@ -43,6 +43,9 @@ static intraw_samples 
=  0;
 static int system_wide =  0;
 static int profile_cpu = -1;
 static pid_t   target_pid  = -1;
+static pid_t   target_tid  = -1;
+static int *all_tids   =  NULL;
+static int thread_num  =  0;
 static pid_t   child_pid   = -1;
 static int inherit =  1;
 static int force   =  0;
@@ -60,7 +63,7 @@ static struct timeval this_read;
 
 static u64 bytes_written   =  0;
 
-static struct pollfd   event_array[MAX_NR_CPUS * MAX_COUNTERS];
+static struct pollfd   *event_array;
 
 static int nr_poll =  0;
 static int nr_cpu  =  0;
@@ -77,7 +80,7 @@ struct mmap_data {
unsigned int

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Sheng Yang

On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
 On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
  On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
   Right, but there is a scope between kvm_guest_enter and really running
   in guest os, where a perf event might overflow. Anyway, the scope is
   very narrow, I will change it to use flag PF_VCPU.
 
  There is also a window between setting the flag and calling 'int $2'
  where an NMI might happen and be accounted incorrectly.
 
  Perhaps separate the 'int $2' into a direct call into perf and another
  call for the rest of NMI handling.  I don't see how it would work on svm
  though - AFAICT the NMI is held whereas vmx swallows it.
 
   I guess NMIs
  will be disabled until the next IRET so it isn't racy, just tricky.
 
 I'm not sure if vmexit does break NMI context or not. Hardware NMI context
 isn't reentrant till a IRET. YangSheng would like to double check it.

After more check, I think VMX won't remained NMI block state for host. That's 
means, if NMI happened and processor is in VMX non-root mode, it would only 
result in VMExit, with a reason indicate that it's due to NMI happened, but no 
more state change in the host.

So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. 
Moreover, I think we _can't_ stop the re-entrance of NMI handling code because 
int $2 don't have effect to block following NMI.

And if the NMI sequence is not important(I think so), then we need to generate 
a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to 
itself is a good idea.

I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace 
int $2. Something unexpected is happening...

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Avi Kivity


On 03/17/2010 11:28 AM, Sheng Yang wrote:



I'm not sure if vmexit does break NMI context or not. Hardware NMI context
isn't reentrant till a IRET. YangSheng would like to double check it.
 

After more check, I think VMX won't remained NMI block state for host. That's
means, if NMI happened and processor is in VMX non-root mode, it would only
result in VMExit, with a reason indicate that it's due to NMI happened, but no
more state change in the host.

So in that meaning, there _is_ a window between VMExit and KVM handle the NMI.
Moreover, I think we _can't_ stop the re-entrance of NMI handling code because
int $2 don't have effect to block following NMI.
   


That's pretty bad, as NMI runs on a separate stack (via IST).  So if 
another NMI happens while our int $2 is running, the stack will be 
corrupted.



And if the NMI sequence is not important(I think so), then we need to generate
a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to
itself is a good idea.

I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace
int $2. Something unexpected is happening...
   


I think you need DM_NMI for that to work correctly.

An alternative is to call the NMI handler directly.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Sheng Yang

On Wednesday 17 March 2010 17:41:58 Avi Kivity wrote:
 On 03/17/2010 11:28 AM, Sheng Yang wrote:
  I'm not sure if vmexit does break NMI context or not. Hardware NMI
  context isn't reentrant till a IRET. YangSheng would like to double
  check it.
 
  After more check, I think VMX won't remained NMI block state for host.
  That's means, if NMI happened and processor is in VMX non-root mode, it
  would only result in VMExit, with a reason indicate that it's due to NMI
  happened, but no more state change in the host.
 
  So in that meaning, there _is_ a window between VMExit and KVM handle the
  NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling
  code because int $2 don't have effect to block following NMI.
 
 That's pretty bad, as NMI runs on a separate stack (via IST).  So if
 another NMI happens while our int $2 is running, the stack will be
 corrupted.

Though hardware didn't provide this kind of block, software at least would 
warn about it... nmi_enter() still would be executed by int $2, and result 
in BUG() if we are already in NMI context(OK, it is a little better than 
mysterious crash due to corrupted stack).
 
  And if the NMI sequence is not important(I think so), then we need to
  generate a real NMI in current vmexit-after code. Seems let APIC send a
  NMI IPI to itself is a good idea.
 
  I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to
  replace int $2. Something unexpected is happening...
 
 I think you need DM_NMI for that to work correctly.
 
 An alternative is to call the NMI handler directly.

apic_send_IPI_self() already took care of APIC_DM_NMI.

And NMI handler would block the following NMI?

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Avi Kivity


On 03/17/2010 11:51 AM, Sheng Yang wrote:



I think you need DM_NMI for that to work correctly.

An alternative is to call the NMI handler directly.
 

apic_send_IPI_self() already took care of APIC_DM_NMI.
   


So it does (though not for x2apic?).  I don't see why it doesn't work.


And NMI handler would block the following NMI?

   


It wouldn't - won't work without extensive changes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Zachary Amsden


On 03/16/2010 11:28 PM, Sheng Yang wrote:

On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
   

On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
 

On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
   

Right, but there is a scope between kvm_guest_enter and really running
in guest os, where a perf event might overflow. Anyway, the scope is
very narrow, I will change it to use flag PF_VCPU.
 

There is also a window between setting the flag and calling 'int $2'
where an NMI might happen and be accounted incorrectly.

Perhaps separate the 'int $2' into a direct call into perf and another
call for the rest of NMI handling.  I don't see how it would work on svm
though - AFAICT the NMI is held whereas vmx swallows it.

  I guess NMIs
will be disabled until the next IRET so it isn't racy, just tricky.
   

I'm not sure if vmexit does break NMI context or not. Hardware NMI context
isn't reentrant till a IRET. YangSheng would like to double check it.
 

After more check, I think VMX won't remained NMI block state for host. That's
means, if NMI happened and processor is in VMX non-root mode, it would only
result in VMExit, with a reason indicate that it's due to NMI happened, but no
more state change in the host.

So in that meaning, there _is_ a window between VMExit and KVM handle the NMI.
Moreover, I think we _can't_ stop the re-entrance of NMI handling code because
int $2 don't have effect to block following NMI.

And if the NMI sequence is not important(I think so), then we need to generate
a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to
itself is a good idea.

I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace
int $2. Something unexpected is happening...
   


You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't 
supposed to be able to.


Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Sheng Yang

On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
 On 03/16/2010 11:28 PM, Sheng Yang wrote:
  On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
  On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
  On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
  Right, but there is a scope between kvm_guest_enter and really running
  in guest os, where a perf event might overflow. Anyway, the scope is
  very narrow, I will change it to use flag PF_VCPU.
 
  There is also a window between setting the flag and calling 'int $2'
  where an NMI might happen and be accounted incorrectly.
 
  Perhaps separate the 'int $2' into a direct call into perf and another
  call for the rest of NMI handling.  I don't see how it would work on
  svm though - AFAICT the NMI is held whereas vmx swallows it.
 
I guess NMIs
  will be disabled until the next IRET so it isn't racy, just tricky.
 
  I'm not sure if vmexit does break NMI context or not. Hardware NMI
  context isn't reentrant till a IRET. YangSheng would like to double
  check it.
 
  After more check, I think VMX won't remained NMI block state for host.
  That's means, if NMI happened and processor is in VMX non-root mode, it
  would only result in VMExit, with a reason indicate that it's due to NMI
  happened, but no more state change in the host.
 
  So in that meaning, there _is_ a window between VMExit and KVM handle the
  NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling
  code because int $2 don't have effect to block following NMI.
 
  And if the NMI sequence is not important(I think so), then we need to
  generate a real NMI in current vmexit-after code. Seems let APIC send a
  NMI IPI to itself is a good idea.
 
  I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to
  replace int $2. Something unexpected is happening...
 
 You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
 supposed to be able to.

Um? Why?

Especially kernel is already using it to deliver NMI.

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Zhang, Yanmin

On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote:
 On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote:
  * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
  
   On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
 On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
  From: Zhang, Yanminyanmin_zh...@linux.intel.com
 
  Based on the discussion in KVM community, I worked out the patch to 
  support
  perf to collect guest os statistics from host side. This patch is 
  implemented
  with Ingo, Peter and some other guys' kind help. Yang Sheng pointed 
  out a
  critical bug and provided good suggestions with other guys. I 
  really appreciate
  their kind help.
 
  The patch adds new subcommand kvm to perf.
 
 perf kvm top
 perf kvm record
 perf kvm report
 perf kvm diff
 
  The new perf could profile guest os kernel except guest os user 
  space, but it
  could summarize guest os user space utilization per guest os.
 
  Below are some examples.
  1) perf kvm top
  [r...@lkp-ne01 norm]# perf kvm --host --guest 
  --guestkallsyms=/home/ymzhang/guest/kallsyms
  --guestmodules=/home/ymzhang/guest/modules top
 
 
 
Thanks for your kind comments.

 Excellent, support for guest kernel != host kernel is critical (I 
 can't 
 remember the last time I ran same kernels).
 
 How would we support multiple guests with different kernels?
With the patch, 'perf kvm report --sort pid could show
summary statistics for all guest os instances. Then, use
parameter --pid of 'perf kvm record' to collect single problematic 
instance data.
   Sorry. I found currently --pid isn't process but a thread (main thread).
   
   Ingo,
   
   Is it possible to support a new parameter or extend --inherit, so 'perf 
   record' and 'perf top' could collect data on all threads of a process 
   when 
   the process is running?
   
   If not, I need add a new ugly parameter which is similar to --pid to 
   filter 
   out process data in userspace.
  
  Yeah. For maximum utility i'd suggest to extend --pid to include this, and 
  introduce --tid for the previous, limited-to-a-single-task functionality.
  
  Most users would expect --pid to work like a 'late attach' - i.e. to work 
  like 
  strace -f or like a gdb attach.
 
 Thanks Ingo, Avi.
 
 I worked out below patch against tip/master of March 15th.
 
 Subject: [PATCH] Change perf's parameter --pid to process-wide collection
 From: Zhang, Yanmin yanmin_zh...@linux.intel.com
 
 Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
 thread id. Now, --pid means perf collects the statistics of all threads of
 the process, while --tid means perf just collect the statistics of that 
 thread.
 
 BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
 attr-disabled=1 if it isn't a system-wide collection. If there is a '-p'
 and no forks, 'perf stat -p' doesn't collect any data. In addition, the
 while(!done) in run_perf_stat consumes 100% single cpu time which has bad 
 impact
 on running workload. I added a sleep(1) in the loop.
 
 Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
Ingo,

Sorry, the patch has bugs.  I need do a better job and will work out 2
separate patches against the 2 issues.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Zachary Amsden


On 03/17/2010 03:19 PM, Sheng Yang wrote:

On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
   

On 03/16/2010 11:28 PM, Sheng Yang wrote:
 

On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
   

On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
 

On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
   

Right, but there is a scope between kvm_guest_enter and really running
in guest os, where a perf event might overflow. Anyway, the scope is
very narrow, I will change it to use flag PF_VCPU.
 

There is also a window between setting the flag and calling 'int $2'
where an NMI might happen and be accounted incorrectly.

Perhaps separate the 'int $2' into a direct call into perf and another
call for the rest of NMI handling.  I don't see how it would work on
svm though - AFAICT the NMI is held whereas vmx swallows it.

   I guess NMIs
will be disabled until the next IRET so it isn't racy, just tricky.
   

I'm not sure if vmexit does break NMI context or not. Hardware NMI
context isn't reentrant till a IRET. YangSheng would like to double
check it.
 

After more check, I think VMX won't remained NMI block state for host.
That's means, if NMI happened and processor is in VMX non-root mode, it
would only result in VMExit, with a reason indicate that it's due to NMI
happened, but no more state change in the host.

So in that meaning, there _is_ a window between VMExit and KVM handle the
NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling
code because int $2 don't have effect to block following NMI.

And if the NMI sequence is not important(I think so), then we need to
generate a real NMI in current vmexit-after code. Seems let APIC send a
NMI IPI to itself is a good idea.

I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to
replace int $2. Something unexpected is happening...
   

You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
supposed to be able to.
 

Um? Why?

Especially kernel is already using it to deliver NMI.

   


That's the only defined case, and it is defined because the vector field 
is ignore for DM_NMI.  Vol 3A (exact section numbers may vary depending 
on your version).


8.5.1 / 8.6.1

'100 (NMI) Delivers an NMI interrupt to the target processor or 
processors.  The vector information is ignored'


8.5.2  Valid Interrupt Vectors

'Local and I/O APICs support 240 of these vectors (in the range of 16 to 
255) as valid interrupts.'


8.8.4 Interrupt Acceptance for Fixed Interrupts

'...; vectors 0 through 15 are reserved by the APIC (see also: Section 
8.5.2, Valid Interrupt Vectors)'


So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but 
vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.


Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Sheng Yang

On Thursday 18 March 2010 12:50:58 Zachary Amsden wrote:
 On 03/17/2010 03:19 PM, Sheng Yang wrote:
  On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
  On 03/16/2010 11:28 PM, Sheng Yang wrote:
  On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
  On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
  On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
  Right, but there is a scope between kvm_guest_enter and really
  running in guest os, where a perf event might overflow. Anyway, the
  scope is very narrow, I will change it to use flag PF_VCPU.
 
  There is also a window between setting the flag and calling 'int $2'
  where an NMI might happen and be accounted incorrectly.
 
  Perhaps separate the 'int $2' into a direct call into perf and
  another call for the rest of NMI handling.  I don't see how it would
  work on svm though - AFAICT the NMI is held whereas vmx swallows it.
 
 I guess NMIs
  will be disabled until the next IRET so it isn't racy, just tricky.
 
  I'm not sure if vmexit does break NMI context or not. Hardware NMI
  context isn't reentrant till a IRET. YangSheng would like to double
  check it.
 
  After more check, I think VMX won't remained NMI block state for host.
  That's means, if NMI happened and processor is in VMX non-root mode, it
  would only result in VMExit, with a reason indicate that it's due to
  NMI happened, but no more state change in the host.
 
  So in that meaning, there _is_ a window between VMExit and KVM handle
  the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI
  handling code because int $2 don't have effect to block following
  NMI.
 
  And if the NMI sequence is not important(I think so), then we need to
  generate a real NMI in current vmexit-after code. Seems let APIC send a
  NMI IPI to itself is a good idea.
 
  I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to
  replace int $2. Something unexpected is happening...
 
  You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
  supposed to be able to.
 
  Um? Why?
 
  Especially kernel is already using it to deliver NMI.
 
 That's the only defined case, and it is defined because the vector field
 is ignore for DM_NMI.  Vol 3A (exact section numbers may vary depending
 on your version).
 
 8.5.1 / 8.6.1
 
 '100 (NMI) Delivers an NMI interrupt to the target processor or
 processors.  The vector information is ignored'
 
 8.5.2  Valid Interrupt Vectors
 
 'Local and I/O APICs support 240 of these vectors (in the range of 16 to
 255) as valid interrupts.'
 
 8.8.4 Interrupt Acceptance for Fixed Interrupts
 
 '...; vectors 0 through 15 are reserved by the APIC (see also: Section
 8.5.2, Valid Interrupt Vectors)'
 
 So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but
 vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.

As you pointed out, NMI is not Fixed interrupt. If we want to send NMI, it 
would need a specific delivery mode rather than vector number. 

And if you look at code, if we specific NMI_VECTOR, the delivery mode would be 
set to NMI.

So what's wrong here?

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Huang, Zhiteng

Hi Avi, Ingo,

I've been following through this long thread since the very first email.  

I'm a performance engineer whose job is to tune workloads run on top of KVM 
(and Xen previously).  As a performance engineer, I desperately want to have a 
tool that can monitor the host and guests at same time.  Think about 100 
guests mixed with Linux/Windows running together on single system, being able 
to know what's happening is critical to do performance analysis.   Actually I 
am the person asked Yanmin to add feature for CPU utilization break down (into 
host_usr, host_krn, guest_usr, guest_krn) so that I can monitor dozens of 
running guests.   I hasn't made this patch work on my system yet but I _do_ 
think this patch is a very good start.  

And finally, monitoring guests from host is useful for users too (administrator 
and performance guy like me).   I really appreciate you guys' work and would 
love to provide feedback from my point of view if needed.


Regards,

HUANG, Zhiteng

Intel SSG/SSD/SPA/PRC Scalability Lab


-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Avi Kivity
Sent: Wednesday, March 17, 2010 11:55 AM
To: Frank Ch. Eigler
Cc: Anthony Liguori; Ingo Molnar; Zhang, Yanmin; Peter Zijlstra; Sheng Yang; 
linux-ker...@vger.kernel.org; kvm@vger.kernel.org; Marcelo Tosatti; oerg 
Roedel; Jes Sorensen; Gleb Natapov; Zachary Amsden; ziteng.hu...@intel.com
Subject: Re: [PATCH] Enhance perf to collect KVM guest os statistics from host 
side

On 03/17/2010 02:41 AM, Frank Ch. Eigler wrote:
 Hi -

 On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote:

 [...]
 The only way to really address this is to change the interaction.
 Instead of running perf externally to qemu, we should support a perf
 command in the qemu monitor that can then tie directly to the perf
 tooling.  That gives us the best possible user experience.
  
 To what extent could this be solved with less crossing of
 isolation/abstraction layers, if the perfctr facilities were properly
 virtualized?


That's the more interesting (by far) usage model.  In general guest 
owners don't have access to the host, and host owners can't (and 
shouldn't) change guests.

Monitoring guests from the host is useful for kvm developers, but less 
so for users.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Sheng Yang

On Thursday 18 March 2010 13:22:28 Sheng Yang wrote:
 On Thursday 18 March 2010 12:50:58 Zachary Amsden wrote:
  On 03/17/2010 03:19 PM, Sheng Yang wrote:
   On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote:
   On 03/16/2010 11:28 PM, Sheng Yang wrote:
   On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
   On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
   On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
   Right, but there is a scope between kvm_guest_enter and really
   running in guest os, where a perf event might overflow. Anyway,
   the scope is very narrow, I will change it to use flag PF_VCPU.
  
   There is also a window between setting the flag and calling 'int
   $2' where an NMI might happen and be accounted incorrectly.
  
   Perhaps separate the 'int $2' into a direct call into perf and
   another call for the rest of NMI handling.  I don't see how it
   would work on svm though - AFAICT the NMI is held whereas vmx
   swallows it.
  
  I guess NMIs
   will be disabled until the next IRET so it isn't racy, just tricky.
  
   I'm not sure if vmexit does break NMI context or not. Hardware NMI
   context isn't reentrant till a IRET. YangSheng would like to double
   check it.
  
   After more check, I think VMX won't remained NMI block state for
   host. That's means, if NMI happened and processor is in VMX non-root
   mode, it would only result in VMExit, with a reason indicate that
   it's due to NMI happened, but no more state change in the host.
  
   So in that meaning, there _is_ a window between VMExit and KVM handle
   the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI
   handling code because int $2 don't have effect to block following
   NMI.
  
   And if the NMI sequence is not important(I think so), then we need to
   generate a real NMI in current vmexit-after code. Seems let APIC send
   a NMI IPI to itself is a good idea.
  
   I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to
   replace int $2. Something unexpected is happening...
  
   You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
   supposed to be able to.
  
   Um? Why?
  
   Especially kernel is already using it to deliver NMI.
 
  That's the only defined case, and it is defined because the vector field
  is ignore for DM_NMI.  Vol 3A (exact section numbers may vary depending
  on your version).
 
  8.5.1 / 8.6.1
 
  '100 (NMI) Delivers an NMI interrupt to the target processor or
  processors.  The vector information is ignored'
 
  8.5.2  Valid Interrupt Vectors
 
  'Local and I/O APICs support 240 of these vectors (in the range of 16 to
  255) as valid interrupts.'
 
  8.8.4 Interrupt Acceptance for Fixed Interrupts
 
  '...; vectors 0 through 15 are reserved by the APIC (see also: Section
  8.5.2, Valid Interrupt Vectors)'
 
  So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but
  vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.
 
 As you pointed out, NMI is not Fixed interrupt. If we want to send NMI,
  it would need a specific delivery mode rather than vector number.
 
 And if you look at code, if we specific NMI_VECTOR, the delivery mode would
  be set to NMI.
 
 So what's wrong here?

OK, I think I understand your points now. You meant that these vectors can't 
be filled in vector field directly, right? But NMI is a exception due to 
DM_NMI. Is that your point? I think we agree on this.

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
 From: Zhang, Yanminyanmin_zh...@linux.intel.com
 
 Based on the discussion in KVM community, I worked out the patch to support
 perf to collect guest os statistics from host side. This patch is implemented
 with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
 critical bug and provided good suggestions with other guys. I really 
 appreciate
 their kind help.
 
 The patch adds new subcommand kvm to perf.
 
perf kvm top
perf kvm record
perf kvm report
perf kvm diff
 
 The new perf could profile guest os kernel except guest os user space, but it
 could summarize guest os user space utilization per guest os.
 
 Below are some examples.
 1) perf kvm top
 [r...@lkp-ne01 norm]# perf kvm --host --guest 
 --guestkallsyms=/home/ymzhang/guest/kallsyms
 --guestmodules=/home/ymzhang/guest/modules top
 
 
 Excellent, support for guest kernel != host kernel is critical (I
 can't remember the last time I ran same kernels).
 
 How would we support multiple guests with different kernels? Perhaps a 
 symbol server that perf can connect to (and that would connect to guests in 
 turn)?

The highest quality solution would be if KVM offered a 'guest extension' to 
the guest kernel's /proc/kallsyms that made it easy for user-space to get this 
information from an authorative source.

That's the main reason why the host side /proc/kallsyms is so popular and so 
useful: while in theory it's mostly redundant information which can be gleaned 
from the System.map and other sources of symbol information, it's easily 
available and is _always_ trustable to come from the host kernel.

Separate System.map's have a tendency to go out of sync (or go missing when a 
devel kernel gets rebuilt, or if a devel package is not installed), and server 
ports (be that a TCP port space server or an UDP port space mount-point) are 
both a configuration hassle and are not guest-transparent.

So for instrumentation infrastructure (such as perf) we have a large and well 
founded preference for intrinsic, built-in, kernel-provided information: i.e. 
a largely 'built-in' and transparent mechanism to get to guest symbols.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Zhang, Yanmin

On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
 On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
  From: Zhang, Yanminyanmin_zh...@linux.intel.com
 
  Based on the discussion in KVM community, I worked out the patch to support
  perf to collect guest os statistics from host side. This patch is 
  implemented
  with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
  critical bug and provided good suggestions with other guys. I really 
  appreciate
  their kind help.
 
  The patch adds new subcommand kvm to perf.
 
 perf kvm top
 perf kvm record
 perf kvm report
 perf kvm diff
 
  The new perf could profile guest os kernel except guest os user space, but 
  it
  could summarize guest os user space utilization per guest os.
 
  Below are some examples.
  1) perf kvm top
  [r...@lkp-ne01 norm]# perf kvm --host --guest 
  --guestkallsyms=/home/ymzhang/guest/kallsyms
  --guestmodules=/home/ymzhang/guest/modules top
 
 
 
Thanks for your kind comments.

 Excellent, support for guest kernel != host kernel is critical (I can't 
 remember the last time I ran same kernels).
 
 How would we support multiple guests with different kernels?
With the patch, 'perf kvm report --sort pid could show
summary statistics for all guest os instances. Then, use
parameter --pid of 'perf kvm record' to collect single problematic instance 
data.

   Perhaps a 
 symbol server that perf can connect to (and that would connect to guests 
 in turn)?

 
  diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 
  linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c
  --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c  2010-03-16 
  08:59:11.825295404 +0800
  +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c  2010-03-16 
  09:01:09.976084492 +0800
  @@ -26,6 +26,7 @@
#includelinux/sched.h
#includelinux/moduleparam.h
#includelinux/ftrace_event.h
  +#includelinux/perf_event.h
#include kvm_cache_regs.h
#include x86.h
 
  @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct
  vmcs_write32(TPR_THRESHOLD, irr);
}
 
  +DEFINE_PER_CPU(int, kvm_in_guest) = {0};
  +
  +static void kvm_set_in_guest(void)
  +{
  +   percpu_write(kvm_in_guest, 1);
  +}
  +
  +static int kvm_is_in_guest(void)
  +{
  +   return percpu_read(kvm_in_guest);
  +}
 
 

 There is already PF_VCPU for this.
Right, but there is a scope between kvm_guest_enter and really running
in guest os, where a perf event might overflow. Anyway, the scope is very
narrow, I will change it to use flag PF_VCPU.

 
  +static struct perf_guest_info_callbacks kvm_guest_cbs = {
  +   .is_in_guest= kvm_is_in_guest,
  +   .is_user_mode   = kvm_is_user_mode,
  +   .get_guest_ip   = kvm_get_guest_ip,
  +   .reset_in_guest = kvm_reset_in_guest,
  +};
 
 
 Should be in common code, not vmx specific.
Right. I discussed with Yangsheng. I will move above data structures and
callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to
kvm_x86_ops.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 09:24 AM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
 

From: Zhang, Yanminyanmin_zh...@linux.intel.com

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new subcommand kvm to perf.

   perf kvm top
   perf kvm record
   perf kvm report
   perf kvm diff

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[r...@lkp-ne01 norm]# perf kvm --host --guest 
--guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

   

Excellent, support for guest kernel != host kernel is critical (I
can't remember the last time I ran same kernels).

How would we support multiple guests with different kernels? Perhaps a
symbol server that perf can connect to (and that would connect to guests in
turn)?
 

The highest quality solution would be if KVM offered a 'guest extension' to
the guest kernel's /proc/kallsyms that made it easy for user-space to get this
information from an authorative source.

That's the main reason why the host side /proc/kallsyms is so popular and so
useful: while in theory it's mostly redundant information which can be gleaned
from the System.map and other sources of symbol information, it's easily
available and is _always_ trustable to come from the host kernel.

Separate System.map's have a tendency to go out of sync (or go missing when a
devel kernel gets rebuilt, or if a devel package is not installed), and server
ports (be that a TCP port space server or an UDP port space mount-point) are
both a configuration hassle and are not guest-transparent.

So for instrumentation infrastructure (such as perf) we have a large and well
founded preference for intrinsic, built-in, kernel-provided information: i.e.
a largely 'built-in' and transparent mechanism to get to guest symbols.
   


The symbol server's client can certainly access the bits through vmchannel.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Zhang, Yanmin

On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
 On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
  On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
   From: Zhang, Yanminyanmin_zh...@linux.intel.com
  
   Based on the discussion in KVM community, I worked out the patch to 
   support
   perf to collect guest os statistics from host side. This patch is 
   implemented
   with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
   critical bug and provided good suggestions with other guys. I really 
   appreciate
   their kind help.
  
   The patch adds new subcommand kvm to perf.
  
  perf kvm top
  perf kvm record
  perf kvm report
  perf kvm diff
  
   The new perf could profile guest os kernel except guest os user space, 
   but it
   could summarize guest os user space utilization per guest os.
  
   Below are some examples.
   1) perf kvm top
   [r...@lkp-ne01 norm]# perf kvm --host --guest 
   --guestkallsyms=/home/ymzhang/guest/kallsyms
   --guestmodules=/home/ymzhang/guest/modules top
  
  
  
 Thanks for your kind comments.
 
  Excellent, support for guest kernel != host kernel is critical (I can't 
  remember the last time I ran same kernels).
  
  How would we support multiple guests with different kernels?
 With the patch, 'perf kvm report --sort pid could show
 summary statistics for all guest os instances. Then, use
 parameter --pid of 'perf kvm record' to collect single problematic instance 
 data.
Sorry. I found currently --pid isn't process but a thread (main thread).

Ingo,

Is it possible to support a new parameter or extend --inherit, so 'perf record' 
and
'perf top' could collect data on all threads of a process when the process is 
running?

If not, I need add a new ugly parameter which is similar to --pid to filter out 
process
data in userspace.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:



Excellent, support for guest kernel != host kernel is critical (I can't
remember the last time I ran same kernels).

How would we support multiple guests with different kernels?
 

With the patch, 'perf kvm report --sort pid could show
summary statistics for all guest os instances. Then, use
parameter --pid of 'perf kvm record' to collect single problematic instance 
data.
   


That certainly works, though automatic association of guest data with 
guest symbols is friendlier.



diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 
linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c
--- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c  2010-03-16 08:59:11.825295404 
+0800
+++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c  2010-03-16 
09:01:09.976084492 +0800
@@ -26,6 +26,7 @@
   #includelinux/sched.h
   #includelinux/moduleparam.h
   #includelinux/ftrace_event.h
+#includelinux/perf_event.h
   #include kvm_cache_regs.h
   #include x86.h

@@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct
vmcs_write32(TPR_THRESHOLD, irr);
   }

+DEFINE_PER_CPU(int, kvm_in_guest) = {0};
+
+static void kvm_set_in_guest(void)
+{
+   percpu_write(kvm_in_guest, 1);
+}
+
+static int kvm_is_in_guest(void)
+{
+   return percpu_read(kvm_in_guest);
+}

   
 
   

There is already PF_VCPU for this.
 

Right, but there is a scope between kvm_guest_enter and really running
in guest os, where a perf event might overflow. Anyway, the scope is very
narrow, I will change it to use flag PF_VCPU.
   


There is also a window between setting the flag and calling 'int $2' 
where an NMI might happen and be accounted incorrectly.


Perhaps separate the 'int $2' into a direct call into perf and another 
call for the rest of NMI handling.  I don't see how it would work on svm 
though - AFAICT the NMI is held whereas vmx swallows it.  I guess NMIs 
will be disabled until the next IRET so it isn't racy, just tricky.



+static struct perf_guest_info_callbacks kvm_guest_cbs = {
+   .is_in_guest= kvm_is_in_guest,
+   .is_user_mode   = kvm_is_user_mode,
+   .get_guest_ip   = kvm_get_guest_ip,
+   .reset_in_guest = kvm_reset_in_guest,
+};

   

Should be in common code, not vmx specific.
 

Right. I discussed with Yangsheng. I will move above data structures and
callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to
kvm_x86_ops.
   


You will need access to the vcpu pointer (kvm_rip_read() needs it), you 
can put it in a percpu variable.  I guess if it's not null, you know 
you're in a guest, so no need for PF_VCPU.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 11:28 AM, Zhang, Yanmin wrote:

Sorry. I found currently --pid isn't process but a thread (main thread).

Ingo,

Is it possible to support a new parameter or extend --inherit, so 'perf record' 
and
'perf top' could collect data on all threads of a process when the process is 
running?
   


That seems like a worthwhile addition regardless of this thread.  
Profile all current threads and any new ones.  It probably makes sense 
to call this --pid and rename the existing --pid to --thread.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:

 On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
  On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
   On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
From: Zhang, Yanminyanmin_zh...@linux.intel.com
   
Based on the discussion in KVM community, I worked out the patch to 
support
perf to collect guest os statistics from host side. This patch is 
implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out 
a
critical bug and provided good suggestions with other guys. I really 
appreciate
their kind help.
   
The patch adds new subcommand kvm to perf.
   
   perf kvm top
   perf kvm record
   perf kvm report
   perf kvm diff
   
The new perf could profile guest os kernel except guest os user space, 
but it
could summarize guest os user space utilization per guest os.
   
Below are some examples.
1) perf kvm top
[r...@lkp-ne01 norm]# perf kvm --host --guest 
--guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top
   
   
   
  Thanks for your kind comments.
  
   Excellent, support for guest kernel != host kernel is critical (I can't 
   remember the last time I ran same kernels).
   
   How would we support multiple guests with different kernels?
  With the patch, 'perf kvm report --sort pid could show
  summary statistics for all guest os instances. Then, use
  parameter --pid of 'perf kvm record' to collect single problematic instance 
  data.
 Sorry. I found currently --pid isn't process but a thread (main thread).
 
 Ingo,
 
 Is it possible to support a new parameter or extend --inherit, so 'perf 
 record' and 'perf top' could collect data on all threads of a process when 
 the process is running?
 
 If not, I need add a new ugly parameter which is similar to --pid to filter 
 out process data in userspace.

Yeah. For maximum utility i'd suggest to extend --pid to include this, and 
introduce --tid for the previous, limited-to-a-single-task functionality.

Most users would expect --pid to work like a 'late attach' - i.e. to work like 
strace -f or like a gdb attach.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 03/16/2010 09:24 AM, Ingo Molnar wrote:
 * Avi Kivitya...@redhat.com  wrote:
 
 On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
 From: Zhang, Yanminyanmin_zh...@linux.intel.com
 
 Based on the discussion in KVM community, I worked out the patch to support
 perf to collect guest os statistics from host side. This patch is 
 implemented
 with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
 critical bug and provided good suggestions with other guys. I really 
 appreciate
 their kind help.
 
 The patch adds new subcommand kvm to perf.
 
perf kvm top
perf kvm record
perf kvm report
perf kvm diff
 
 The new perf could profile guest os kernel except guest os user space, but 
 it
 could summarize guest os user space utilization per guest os.
 
 Below are some examples.
 1) perf kvm top
 [r...@lkp-ne01 norm]# perf kvm --host --guest 
 --guestkallsyms=/home/ymzhang/guest/kallsyms
 --guestmodules=/home/ymzhang/guest/modules top
 
 Excellent, support for guest kernel != host kernel is critical (I
 can't remember the last time I ran same kernels).
 
 How would we support multiple guests with different kernels? Perhaps a
 symbol server that perf can connect to (and that would connect to guests in
 turn)?
 The highest quality solution would be if KVM offered a 'guest extension' to
 the guest kernel's /proc/kallsyms that made it easy for user-space to get 
 this
 information from an authorative source.
 
 That's the main reason why the host side /proc/kallsyms is so popular and so
 useful: while in theory it's mostly redundant information which can be 
 gleaned
 from the System.map and other sources of symbol information, it's easily
 available and is _always_ trustable to come from the host kernel.
 
 Separate System.map's have a tendency to go out of sync (or go missing when a
 devel kernel gets rebuilt, or if a devel package is not installed), and 
 server
 ports (be that a TCP port space server or an UDP port space mount-point) are
 both a configuration hassle and are not guest-transparent.
 
 So for instrumentation infrastructure (such as perf) we have a large and well
 founded preference for intrinsic, built-in, kernel-provided information: i.e.
 a largely 'built-in' and transparent mechanism to get to guest symbols.
 
 The symbol server's client can certainly access the bits through vmchannel.

Ok, that would work i suspect.

Would be nice to have the symbol server in tools/perf/ and also make it easy 
to add it to the initrd via a .config switch or so.

That would have basically all of the advantages of being built into the kernel 
(availability, configurability, transparency, hackability), while having all 
the advantages of a user-space approach as well (flexibility, extensibility, 
robustness, ease of maintenance, etc.).

If only we had tools/xorg/ integrated via the initrd that way ;-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 11:53 AM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

On 03/16/2010 09:24 AM, Ingo Molnar wrote:
 

* Avi Kivitya...@redhat.com   wrote:

   

On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
 

From: Zhang, Yanminyanmin_zh...@linux.intel.com

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new subcommand kvm to perf.

   perf kvm top
   perf kvm record
   perf kvm report
   perf kvm diff

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[r...@lkp-ne01 norm]# perf kvm --host --guest 
--guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

   

Excellent, support for guest kernel != host kernel is critical (I
can't remember the last time I ran same kernels).

How would we support multiple guests with different kernels? Perhaps a
symbol server that perf can connect to (and that would connect to guests in
turn)?
 

The highest quality solution would be if KVM offered a 'guest extension' to
the guest kernel's /proc/kallsyms that made it easy for user-space to get this
information from an authorative source.

That's the main reason why the host side /proc/kallsyms is so popular and so
useful: while in theory it's mostly redundant information which can be gleaned
   

from the System.map and other sources of symbol information, it's easily
 

available and is _always_ trustable to come from the host kernel.

Separate System.map's have a tendency to go out of sync (or go missing when a
devel kernel gets rebuilt, or if a devel package is not installed), and server
ports (be that a TCP port space server or an UDP port space mount-point) are
both a configuration hassle and are not guest-transparent.

So for instrumentation infrastructure (such as perf) we have a large and well
founded preference for intrinsic, built-in, kernel-provided information: i.e.
a largely 'built-in' and transparent mechanism to get to guest symbols.
   

The symbol server's client can certainly access the bits through vmchannel.
 

Ok, that would work i suspect.

Would be nice to have the symbol server in tools/perf/ and also make it easy
to add it to the initrd via a .config switch or so.

That would have basically all of the advantages of being built into the kernel
(availability, configurability, transparency, hackability), while having all
the advantages of a user-space approach as well (flexibility, extensibility,
robustness, ease of maintenance, etc.).
   


Note, I am not advocating building the vmchannel client into the host 
kernel.  While that makes everything simpler for the user, it increases 
the kernel footprint with all the disadvantages that come with that (any 
bug is converted into a host DoS or worse).


So, perf would connect to qemu via (say) a well-known unix domain 
socket, which would then talk to the guest kernel.


I know you won't like it, we'll continue to disagree on this unfortunately.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 03/16/2010 11:53 AM, Ingo Molnar wrote:
 * Avi Kivitya...@redhat.com  wrote:
 
 On 03/16/2010 09:24 AM, Ingo Molnar wrote:
 * Avi Kivitya...@redhat.com   wrote:
 
 On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
 From: Zhang, Yanminyanmin_zh...@linux.intel.com
 
 Based on the discussion in KVM community, I worked out the patch to 
 support
 perf to collect guest os statistics from host side. This patch is 
 implemented
 with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
 critical bug and provided good suggestions with other guys. I really 
 appreciate
 their kind help.
 
 The patch adds new subcommand kvm to perf.
 
perf kvm top
perf kvm record
perf kvm report
perf kvm diff
 
 The new perf could profile guest os kernel except guest os user space, 
 but it
 could summarize guest os user space utilization per guest os.
 
 Below are some examples.
 1) perf kvm top
 [r...@lkp-ne01 norm]# perf kvm --host --guest 
 --guestkallsyms=/home/ymzhang/guest/kallsyms
 --guestmodules=/home/ymzhang/guest/modules top
 
 Excellent, support for guest kernel != host kernel is critical (I
 can't remember the last time I ran same kernels).
 
 How would we support multiple guests with different kernels? Perhaps a
 symbol server that perf can connect to (and that would connect to guests 
 in
 turn)?
 The highest quality solution would be if KVM offered a 'guest extension' to
 the guest kernel's /proc/kallsyms that made it easy for user-space to get 
 this
 information from an authorative source.
 
 That's the main reason why the host side /proc/kallsyms is so popular and 
 so
 useful: while in theory it's mostly redundant information which can be 
 gleaned
 from the System.map and other sources of symbol information, it's easily
 available and is _always_ trustable to come from the host kernel.
 
 Separate System.map's have a tendency to go out of sync (or go missing 
 when a
 devel kernel gets rebuilt, or if a devel package is not installed), and 
 server
 ports (be that a TCP port space server or an UDP port space mount-point) 
 are
 both a configuration hassle and are not guest-transparent.
 
 So for instrumentation infrastructure (such as perf) we have a large and 
 well
 founded preference for intrinsic, built-in, kernel-provided information: 
 i.e.
 a largely 'built-in' and transparent mechanism to get to guest symbols.
 The symbol server's client can certainly access the bits through vmchannel.
 Ok, that would work i suspect.
 
 Would be nice to have the symbol server in tools/perf/ and also make it easy
 to add it to the initrd via a .config switch or so.
 
 That would have basically all of the advantages of being built into the 
 kernel
 (availability, configurability, transparency, hackability), while having all
 the advantages of a user-space approach as well (flexibility, extensibility,
 robustness, ease of maintenance, etc.).
 
 Note, I am not advocating building the vmchannel client into the host 
 kernel. [...]

Neither am i. What i suggested was a user-space binary/executable built in 
tools/perf and put into the initrd.

That approach has the advantages i listed above, without having the 
disadvantages of in-kernel code you listed.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 12:20 PM, Ingo Molnar wrote:


The symbol server's client can certainly access the bits through vmchannel.
 

Ok, that would work i suspect.

Would be nice to have the symbol server in tools/perf/ and also make it easy
to add it to the initrd via a .config switch or so.

That would have basically all of the advantages of being built into the kernel
(availability, configurability, transparency, hackability), while having all
the advantages of a user-space approach as well (flexibility, extensibility,
robustness, ease of maintenance, etc.).
   

Note, I am not advocating building the vmchannel client into the host
kernel. [...]
 

Neither am i. What i suggested was a user-space binary/executable built in
tools/perf and put into the initrd.
   


I'm confused - initrd seems to be guest-side.  I was talking about the 
host side.


For the guest, placing the symbol server in tools/ is reasonable.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 03/16/2010 12:20 PM, Ingo Molnar wrote:
 
 The symbol server's client can certainly access the bits through 
 vmchannel.
 Ok, that would work i suspect.
 
 Would be nice to have the symbol server in tools/perf/ and also make it 
 easy
 to add it to the initrd via a .config switch or so.
 
 That would have basically all of the advantages of being built into the 
 kernel
 (availability, configurability, transparency, hackability), while having 
 all
 the advantages of a user-space approach as well (flexibility, 
 extensibility,
 robustness, ease of maintenance, etc.).
 Note, I am not advocating building the vmchannel client into the host
 kernel. [...]
 Neither am i. What i suggested was a user-space binary/executable built in
 tools/perf and put into the initrd.
 
 I'm confused - initrd seems to be guest-side.  I was talking about the host 
 side.

host side doesnt need much support - just some client capability in perf 
itself. I suspect vmchannels are sufficiently flexible and configuration-free 
for such purposes? (i.e. like a filesystem in essence)

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 12:50 PM, Ingo Molnar wrote:



I'm confused - initrd seems to be guest-side.  I was talking about the host
side.
 

host side doesnt need much support - just some client capability in perf
itself. I suspect vmchannels are sufficiently flexible and configuration-free
for such purposes? (i.e. like a filesystem in essence)
   


I haven't followed vmchannel closely, but I think it is.  vmchannel is 
terminated in qemu on the host side, not in the host kernel.  So perf 
would need to connect to qemu.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 03/16/2010 12:50 PM, Ingo Molnar wrote:
 
 I'm confused - initrd seems to be guest-side.  I was talking about the host
 side.
 host side doesnt need much support - just some client capability in perf
 itself. I suspect vmchannels are sufficiently flexible and configuration-free
 for such purposes? (i.e. like a filesystem in essence)
 
 I haven't followed vmchannel closely, but I think it is.  vmchannel is 
 terminated in qemu on the host side, not in the host kernel.  So perf would 
 need to connect to qemu.

Hm, that sounds rather messy if we want to use it to basically expose kernel 
functionality in a guest/host unified way. Is the qemu process discoverable in 
some secure way? Can we trust it? Is there some proper tooling available to do 
it, or do we have to push it through 2-3 packages to get such a useful feature 
done?

( That is the general thought process how many cross-discipline useful
  desktop/server features hit the bit bucket before having had any chance of
  being vetted by users, and why Linux sucks so much when it comes to feature
  integration and application usability. )

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 01:25 PM, Ingo Molnar wrote:



I haven't followed vmchannel closely, but I think it is.  vmchannel is
terminated in qemu on the host side, not in the host kernel.  So perf would
need to connect to qemu.
 

Hm, that sounds rather messy if we want to use it to basically expose kernel
functionality in a guest/host unified way. Is the qemu process discoverable in
some secure way?


We know its pid.


Can we trust it?


No choice, it contains the guest address space.


Is there some proper tooling available to do
it, or do we have to push it through 2-3 packages to get such a useful feature
done?
   


libvirt manages qemu processes, but I don't think this should go through 
libvirt.  qemu can do this directly by opening a unix domain socket in a 
well-known place.



( That is the general thought process how many cross-discipline useful
   desktop/server features hit the bit bucket before having had any chance of
   being vetted by users, and why Linux sucks so much when it comes to feature
   integration and application usability. )
   


You can't solve everything in the kernel, even with a well populated tools/.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 03/16/2010 01:25 PM, Ingo Molnar wrote:
 
 I haven't followed vmchannel closely, but I think it is.  vmchannel is
 terminated in qemu on the host side, not in the host kernel.  So perf would
 need to connect to qemu.
 Hm, that sounds rather messy if we want to use it to basically expose kernel
 functionality in a guest/host unified way. Is the qemu process discoverable 
 in
 some secure way?
 
 We know its pid.

How do i get a list of all 'guest instance PIDs', and what is the way to talk 
to Qemu?

  Can we trust it?
 
 No choice, it contains the guest address space.

I mean, i can trust a kernel service and i can trust /proc/kallsyms.

Can perf trust a random process claiming to be Qemu? What's the trust 
mechanism here?

  Is there some proper tooling available to do it, or do we have to push it 
  through 2-3 packages to get such a useful feature done?
 
 libvirt manages qemu processes, but I don't think this should go through 
 libvirt.  qemu can do this directly by opening a unix domain socket in a 
 well-known place.

So Qemu has never run into such problems before?

( Sounds weird - i think Qemu configuration itself should be done via a 
  unix domain socket driven configuration protocol as well. )

 ( That is the general thought process how many cross-discipline useful
desktop/server features hit the bit bucket before having had any chance of
being vetted by users, and why Linux sucks so much when it comes to 
  feature
integration and application usability. )
 
 You can't solve everything in the kernel, even with a well populated tools/.

Certainly not, but this is a technical problem in the kernel's domain, so it's 
a fair (and natural) expectation to be able to solve this within the kernel 
project.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 02:29 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

On 03/16/2010 01:25 PM, Ingo Molnar wrote:
 
   

I haven't followed vmchannel closely, but I think it is.  vmchannel is
terminated in qemu on the host side, not in the host kernel.  So perf would
need to connect to qemu.
 

Hm, that sounds rather messy if we want to use it to basically expose kernel
functionality in a guest/host unified way. Is the qemu process discoverable in
some secure way?
   

We know its pid.
 

How do i get a list of all 'guest instance PIDs',


Libvirt manages all qemus, but this should be implemented independently 
of libvirt.



and what is the way to talk
to Qemu?
   


In general qemu exposes communication channels (such as the monitor) as 
tcp connections, unix-domain sockets, stdio, etc.  It's very flexible.



Can we trust it?
   

No choice, it contains the guest address space.
 

I mean, i can trust a kernel service and i can trust /proc/kallsyms.

Can perf trust a random process claiming to be Qemu? What's the trust
mechanism here?
   


Obviously you can't trust anything you get from a guest, no matter how 
you get it.


How do you trust a userspace program's symbols?  you don't.  How do you 
get them?  they're on a well-known location.



Is there some proper tooling available to do it, or do we have to push it
through 2-3 packages to get such a useful feature done?
   

libvirt manages qemu processes, but I don't think this should go through
libvirt.  qemu can do this directly by opening a unix domain socket in a
well-known place.
 

So Qemu has never run into such problems before?

( Sounds weird - i think Qemu configuration itself should be done via a
   unix domain socket driven configuration protocol as well. )
   


That's exactly what happens.  You invoke qemu with -monitor 
unix:blah,server (or -qmp for a machine-readable format) and have your 
management application connect to that.  You can redirect guest serial 
ports, console, parallel port, etc. to unix-domain or tcp sockets.  
vmchannel is an extension of that mechanism.




( That is the general thought process how many cross-discipline useful
   desktop/server features hit the bit bucket before having had any chance of
   being vetted by users, and why Linux sucks so much when it comes to feature
   integration and application usability. )
   

You can't solve everything in the kernel, even with a well populated tools/.
 

Certainly not, but this is a technical problem in the kernel's domain, so it's
a fair (and natural) expectation to be able to solve this within the kernel
project.
   


Someone writing perf-gui outside the kernel would have the same 
problems, no?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 03/16/2010 02:29 PM, Ingo Molnar wrote:

  I mean, i can trust a kernel service and i can trust /proc/kallsyms.
 
  Can perf trust a random process claiming to be Qemu? What's the trust 
  mechanism here?
 
 Obviously you can't trust anything you get from a guest, no matter how you 
 get it.

I'm not talking about the symbol strings and addresses, and the object 
contents for allocation (or debuginfo). I'm talking about the basic protocol 
of establishing which guest is which.

I.e. we really want to be able users to:

 1) have it all working with a single guest, without having to specify 'which' 
guest (qemu PID) to work with. That is the dominant usecase both for 
developers and for a fair portion of testers.

 2) Have some reasonable symbolic identification for guests. For example a 
usable approach would be to have 'perf kvm list', which would list all 
currently active guests:

 $ perf kvm list
   [1] Fedora
   [2] OpenSuse
   [3] Windows-XP
   [4] Windows-7

And from that point on 'perf kvm -g OpenSuse record' would do the obvious 
thing. Users will be able to just use the 'OpenSuse' symbolic name for 
that guest, even if the guest got restarted and switched its main PID.

Any such facility needs trusted enumeration and a protocol where i can trust 
that the information i got is authorative. (I.e. 'OpenSuse' truly matches to 
the OpenSuse session - not to some local user starting up a Qemu instance that 
claims to be 'OpenSuse'.)

Is such a scheme possible/available? I suspect all the KVM configuration tools 
(i havent used them in some time - gui and command-line tools alike) use 
similar methods to ease guest management?

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 03:08 PM, Ingo Molnar wrote:



I mean, i can trust a kernel service and i can trust /proc/kallsyms.

Can perf trust a random process claiming to be Qemu? What's the trust
mechanism here?
   

Obviously you can't trust anything you get from a guest, no matter how you
get it.
 

I'm not talking about the symbol strings and addresses, and the object
contents for allocation (or debuginfo). I'm talking about the basic protocol
of establishing which guest is which.
   


There is none.  So far, qemu only dealt with managing just its own 
guest, and left all multiple guest management to higher levels up the 
stack (like libvirt).



I.e. we really want to be able users to:

  1) have it all working with a single guest, without having to specify 'which'
 guest (qemu PID) to work with. That is the dominant usecase both for
 developers and for a fair portion of testers.
   


That's reasonable if we can get it working simply.


  2) Have some reasonable symbolic identification for guests. For example a
 usable approach would be to have 'perf kvm list', which would list all
 currently active guests:

  $ perf kvm list
[1] Fedora
[2] OpenSuse
[3] Windows-XP
[4] Windows-7

 And from that point on 'perf kvm -g OpenSuse record' would do the obvious
 thing. Users will be able to just use the 'OpenSuse' symbolic name for
 that guest, even if the guest got restarted and switched its main PID.

Any such facility needs trusted enumeration and a protocol where i can trust
that the information i got is authorative. (I.e. 'OpenSuse' truly matches to
the OpenSuse session - not to some local user starting up a Qemu instance that
claims to be 'OpenSuse'.)

Is such a scheme possible/available? I suspect all the KVM configuration tools
(i havent used them in some time - gui and command-line tools alike) use
similar methods to ease guest management?
   


You can do that through libvirt, but that only works for guests started 
through libvirt.  libvirt provides command-line tools to list and manage 
guests (for example autostarting them on startup), and tools built on 
top of libvirt can manage guests graphically.


Looks like we have a layer inversion here.  Maybe we need a plugin 
system - libvirt drops a .so into perf that teaches it how to list 
guests and get their symbols.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 03/16/2010 03:08 PM, Ingo Molnar wrote:
 
 I mean, i can trust a kernel service and i can trust /proc/kallsyms.
 
 Can perf trust a random process claiming to be Qemu? What's the trust
 mechanism here?
 Obviously you can't trust anything you get from a guest, no matter how you
 get it.
 I'm not talking about the symbol strings and addresses, and the object
 contents for allocation (or debuginfo). I'm talking about the basic protocol
 of establishing which guest is which.
 
 There is none.  So far, qemu only dealt with managing just its own
 guest, and left all multiple guest management to higher levels up
 the stack (like libvirt).
 
 I.e. we really want to be able users to:
 
   1) have it all working with a single guest, without having to specify 
  'which'
  guest (qemu PID) to work with. That is the dominant usecase both for
  developers and for a fair portion of testers.
 
 That's reasonable if we can get it working simply.

IMO such ease of use is reasonable and required, full stop.

If it cannot be gotten simply then that's a bug: either in the code, or in the 
design, or in the development process that led to the design. Bugs need 
fixing.

   2) Have some reasonable symbolic identification for guests. For example a
  usable approach would be to have 'perf kvm list', which would list all
  currently active guests:
 
   $ perf kvm list
 [1] Fedora
 [2] OpenSuse
 [3] Windows-XP
 [4] Windows-7
 
  And from that point on 'perf kvm -g OpenSuse record' would do the 
  obvious
  thing. Users will be able to just use the 'OpenSuse' symbolic name for
  that guest, even if the guest got restarted and switched its main PID.
 
  Any such facility needs trusted enumeration and a protocol where i can 
  trust that the information i got is authorative. (I.e. 'OpenSuse' truly 
  matches to the OpenSuse session - not to some local user starting up a 
  Qemu instance that claims to be 'OpenSuse'.)
 
  Is such a scheme possible/available? I suspect all the KVM configuration 
  tools (i havent used them in some time - gui and command-line tools alike) 
  use similar methods to ease guest management?
 
 You can do that through libvirt, but that only works for guests started 
 through libvirt.  libvirt provides command-line tools to list and manage 
 guests (for example autostarting them on startup), and tools built on top of 
 libvirt can manage guests graphically.
 
 Looks like we have a layer inversion here.  Maybe we need a plugin system - 
 libvirt drops a .so into perf that teaches it how to list guests and get 
 their symbols.

Is libvirt used to start up all KVM guests? If not, if it's only used on some 
distros while other distros have other solutions then there's apparently no 
good way to get to such information, and the kernel bits of KVM do not provide 
it.

To the user (and to me) this looks like a KVM bug / missing feature. (and the 
user doesnt care where the blame is) If that is true then apparently the 
current KVM design has no technically actionable solution for certain 
categories of features!

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/16/2010 03:31 PM, Ingo Molnar wrote:



You can do that through libvirt, but that only works for guests started
through libvirt.  libvirt provides command-line tools to list and manage
guests (for example autostarting them on startup), and tools built on top of
libvirt can manage guests graphically.

Looks like we have a layer inversion here.  Maybe we need a plugin system -
libvirt drops a .so into perf that teaches it how to list guests and get
their symbols.
 

Is libvirt used to start up all KVM guests? If not, if it's only used on some
distros while other distros have other solutions then there's apparently no
good way to get to such information, and the kernel bits of KVM do not provide
it.
   


Developers tend to start qemu from the command line, but the majority of 
users and all distros I know of use libvirt.  Some users cobble up their 
own scripts.



To the user (and to me) this looks like a KVM bug / missing feature. (and the
user doesnt care where the blame is) If that is true then apparently the
current KVM design has no technically actionable solution for certain
categories of features!
   


A plugin system allows anyone who is interested to provide the 
information; they just need to write a plugin for their management tool.


Since we can't prevent people from writing management tools, I don't see 
what else we can do.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Frank Ch. Eigler


Ingo Molnar mi...@elte.hu writes:

 [...]
 I.e. we really want to be able users to:
 
   1) have it all working with a single guest, without having to specify 
  'which'
  guest (qemu PID) to work with. That is the dominant usecase both for
  developers and for a fair portion of testers.
 
 That's reasonable if we can get it working simply.

 IMO such ease of use is reasonable and required, full stop.
 If it cannot be gotten simply then that's a bug: either in the code, or in 
 the 
 design, or in the development process that led to the design. Bugs need 
 fixing. [...]

Perhaps the fact that kvm happens to deal with an interesting
application area (virtualization) is misleading here.  As far as the
host kernel or other host userspace is concerned, qemu is just some
random unprivileged userspace program (with some *optional* /dev/kvm
services that might happen to require temporary root).

As such, perf trying to instrument qemu is no different than perf
trying to instrument any other userspace widget.  Therefore, expecting
'trusted enumeration' of instances is just as sensible as using
'trusted ps' and 'trusted /var/run/FOO.pid files'.


- FChE
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Frank Ch. Eigler f...@redhat.com wrote:

 
 Ingo Molnar mi...@elte.hu writes:
 
  [...]
  I.e. we really want to be able users to:
  
1) have it all working with a single guest, without having to specify 
   'which'
   guest (qemu PID) to work with. That is the dominant usecase both for
   developers and for a fair portion of testers.
  
  That's reasonable if we can get it working simply.
 
  IMO such ease of use is reasonable and required, full stop.
  If it cannot be gotten simply then that's a bug: either in the code, or in 
  the 
  design, or in the development process that led to the design. Bugs need 
  fixing. [...]
 
 Perhaps the fact that kvm happens to deal with an interesting application 
 area (virtualization) is misleading here.  As far as the host kernel or 
 other host userspace is concerned, qemu is just some random unprivileged 
 userspace program (with some *optional* /dev/kvm services that might happen 
 to require temporary root).
 
 As such, perf trying to instrument qemu is no different than perf trying to 
 instrument any other userspace widget.  Therefore, expecting 'trusted 
 enumeration' of instances is just as sensible as using 'trusted ps' and 
 'trusted /var/run/FOO.pid files'.

You are quite mistaken: KVM isnt really a 'random unprivileged application' in 
this context, it is clearly an extension of system/kernel services.

( Which can be seen from the simple fact that what started the discussion was 
  'how do we get /proc/kallsyms from the guest'. I.e. an extension of the 
  existing host-space /proc/kallsyms was desired. )

In that sense the most natural 'extension' would be the solution i mentioned a 
week or two ago: to have a (read only) mount of all guest filesystems, plus a 
channel for profiling/tracing data. That would make symbol parsing easier and 
it's what extends the existing 'host space' abstraction in the most natural 
way.

( It doesnt even have to be done via the kernel - Qemu could implement that
  via FUSE for example. )

As a second best option a 'symbol server' might be used too.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Frank Ch. Eigler

Hi -

On Tue, Mar 16, 2010 at 04:52:21PM +0100, Ingo Molnar wrote:
 [...]
  Perhaps the fact that kvm happens to deal with an interesting application 
  area (virtualization) is misleading here.  As far as the host kernel or 
  other host userspace is concerned, qemu is just some random unprivileged 
  userspace program [...]

 You are quite mistaken: KVM isnt really a 'random unprivileged
 application' in this context, it is clearly an extension of
 system/kernel services.

I don't know what extension of system/kernel services means in this
context, beyond something running on the system/kernel, like every
other process.  To clarify, to what extent do you consider your
classification similarly clear for a host is running

* multiple kvm instances run as unprivileged users
* non-kvm OS simulators such as vmware or xen or gdb
* kvm instances running something other than linux

 ( Which can be seen from the simple fact that what started the
 discussion was 'how do we get /proc/kallsyms from the
 guest'. I.e. an extension of the existing host-space /proc/kallsyms
 was desired. )

(Sorry, that smacks of circular reasoning.)

It may be a charming convenience function for perf users to give them
shortcuts for certain favoured configurations (kvm running freshest
linux), but that says more about perf than kvm.


- FChE
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Frank Ch. Eigler f...@redhat.com wrote:

 Hi -
 
 On Tue, Mar 16, 2010 at 04:52:21PM +0100, Ingo Molnar wrote:
  [...]
   Perhaps the fact that kvm happens to deal with an interesting application 
   area (virtualization) is misleading here.  As far as the host kernel or 
   other host userspace is concerned, qemu is just some random unprivileged 
   userspace program [...]
 
  You are quite mistaken: KVM isnt really a 'random unprivileged
  application' in this context, it is clearly an extension of
  system/kernel services.
 
 I don't know what extension of system/kernel services means in this 
 context, beyond something running on the system/kernel, like every other 
 process. [...]

It means something like my example of 'extended to guest space' 
/proc/kallsyms:

  [...]
 
  ( Which can be seen from the simple fact that what started the
discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an 
extension of the existing host-space /proc/kallsyms was desired. )
 
 (Sorry, that smacks of circular reasoning.)

To me it sounds like an example supporting my point. /proc/kallsyms is a 
service by the kernel, and 'perf kvm' desires this to be extended to guest 
space as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Anthony Liguori


On 03/16/2010 08:08 AM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

On 03/16/2010 02:29 PM, Ingo Molnar wrote:
 
   

I mean, i can trust a kernel service and i can trust /proc/kallsyms.

Can perf trust a random process claiming to be Qemu? What's the trust
mechanism here?
   

Obviously you can't trust anything you get from a guest, no matter how you
get it.
 

I'm not talking about the symbol strings and addresses, and the object
contents for allocation (or debuginfo). I'm talking about the basic protocol
of establishing which guest is which.

I.e. we really want to be able users to:

  1) have it all working with a single guest, without having to specify 'which'
 guest (qemu PID) to work with. That is the dominant usecase both for
 developers and for a fair portion of testers.
   


You're making too many assumptions.

There is no list of guests anymore than there is a list of web browsers.

You can have a multi-tenant scenario where you have distinct groups of 
virtual machines running as unprivileged users.



  2) Have some reasonable symbolic identification for guests. For example a
 usable approach would be to have 'perf kvm list', which would list all
 currently active guests:

  $ perf kvm list
[1] Fedora
[2] OpenSuse
[3] Windows-XP
[4] Windows-7

 And from that point on 'perf kvm -g OpenSuse record' would do the obvious
 thing. Users will be able to just use the 'OpenSuse' symbolic name for
 that guest, even if the guest got restarted and switched its main PID.
   


Does perf kvm list always run as root?  What if two unprivileged users 
both have a VM named Fedora?


If we look at the use-case, it's going to be something like, a user is 
creating virtual machines and wants to get performance information about 
them.


Having to run a separate tool like perf is not going to be what they 
would expect they had to do.  Instead, they would either use their 
existing GUI tool (like virt-manager) or they would use their management 
interface (either QMP or libvirt).


The complexity of interaction is due to the fact that perf shouldn't be 
a stand alone tool.  It should be a library or something with a 
programmatic interface that another tool can make use of.


Regards,

Anthony Liguori


Is such a scheme possible/available? I suspect all the KVM configuration tools
(i havent used them in some time - gui and command-line tools alike) use
similar methods to ease guest management?

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Anthony Liguori


On 03/16/2010 10:52 AM, Ingo Molnar wrote:

You are quite mistaken: KVM isnt really a 'random unprivileged application' in
this context, it is clearly an extension of system/kernel services.

( Which can be seen from the simple fact that what started the discussion was
   'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
   existing host-space /proc/kallsyms was desired. )
   


Random tools (like perf) should not be able to do what you describe.  
It's a security nightmare.


If it's desirable to have /proc/kallsyms available, we can expose an 
interface in QEMU to provide that.  That can then be plumbed through 
libvirt and QMP.


Then a management tool can use libvirt or QMP to obtain that information 
and interact with the kernel appropriately.



In that sense the most natural 'extension' would be the solution i mentioned a
week or two ago: to have a (read only) mount of all guest filesystems, plus a
channel for profiling/tracing data. That would make symbol parsing easier and
it's what extends the existing 'host space' abstraction in the most natural
way.

( It doesnt even have to be done via the kernel - Qemu could implement that
   via FUSE for example. )
   


No way.  The guest has sensitive data and exposing it widely on the host 
is a bad thing to do.  It's a bad interface.  We can expose specific 
information about guests but only through our existing channels which 
are validated through a security infrastructure.


Ultimately, your goal is to keep perf a simple tool with little 
dependencies.  But practically speaking, if you want to add features to 
it, it's going to have to interact with other subsystems in the 
appropriate way.  That means, it's going to need to interact with 
libvirt or QMP.


If you want all applications to expose their data via synthetic file 
systems, then there's always plan9 :-)


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Anthony Liguori anth...@codemonkey.ws wrote:

 On 03/16/2010 08:08 AM, Ingo Molnar wrote:
 * Avi Kivitya...@redhat.com  wrote:
 
 On 03/16/2010 02:29 PM, Ingo Molnar wrote:
 I mean, i can trust a kernel service and i can trust /proc/kallsyms.
 
 Can perf trust a random process claiming to be Qemu? What's the trust
 mechanism here?
 Obviously you can't trust anything you get from a guest, no matter how you
 get it.
 I'm not talking about the symbol strings and addresses, and the object
 contents for allocation (or debuginfo). I'm talking about the basic protocol
 of establishing which guest is which.
 
 I.e. we really want to be able users to:
 
   1) have it all working with a single guest, without having to specify 
  'which'
  guest (qemu PID) to work with. That is the dominant usecase both for
  developers and for a fair portion of testers.
 
 You're making too many assumptions.
 
 There is no list of guests anymore than there is a list of web browsers.
 
 You can have a multi-tenant scenario where you have distinct groups of 
 virtual machines running as unprivileged users.

multi-tenant and groups is not a valid excuse at all for giving crappy 
technology in the simplest case: when there's a single VM. Yes, eventually it 
can be supported and any sane scheme will naturally support it too, but it's 
by no means what we care about primarily when it comes to these tools.

I thought everyone learned the lesson behind SystemTap's failure (and to a 
certain degree this was behind Oprofile's failure as well): when it comes to 
tooling/instrumentation we dont want to concentrate on the fancy complex 
setups and abstract requirements drawn up by CIOs, as development isnt being 
done there. Concentrate on our developers today, and provide no-compromises 
usability to those who contribute stuff.

If we dont help make the simplest (and most common) use-case convenient then 
we are failing on a fundamental level.

   2) Have some reasonable symbolic identification for guests. For example a
  usable approach would be to have 'perf kvm list', which would list all
  currently active guests:
 
   $ perf kvm list
 [1] Fedora
 [2] OpenSuse
 [3] Windows-XP
 [4] Windows-7
 
  And from that point on 'perf kvm -g OpenSuse record' would do the 
  obvious
  thing. Users will be able to just use the 'OpenSuse' symbolic name for
  that guest, even if the guest got restarted and switched its main PID.
 
 Does perf kvm list always run as root?  What if two unprivileged users 
 both have a VM named Fedora?

Again, the single-VM case is the most important case, by far. If you have 
multiple VMs running and want to develop the kernel on multiple VMs (sounds 
rather messy if you think it through ...), what would happen is similar to 
what happens when we have two probes for example:

 # perf probe schedule
 Added new event:
   probe:schedule   (on schedule+0)

 You can now use it on all perf tools, such as:

perf record -e probe:schedule -a sleep 1

 # perf probe -f schedule   
 Added new event:
   probe:schedule_1 (on schedule+0)

 You can now use it on all perf tools, such as:

perf record -e probe:schedule_1 -a sleep 1

 # perf probe -f schedule
 Added new event:
   probe:schedule_2 (on schedule+0)

 You can now use it on all perf tools, such as:

perf record -e probe:schedule_2 -a sleep 1

Something similar could be used for KVM/Qemu: whichever got created first is 
named 'Fedora', the second is named 'Fedora-2'.

 If we look at the use-case, it's going to be something like, a user is 
 creating virtual machines and wants to get performance information about 
 them.
 
 Having to run a separate tool like perf is not going to be what they would 
 expect they had to do.  Instead, they would either use their existing GUI 
 tool (like virt-manager) or they would use their management interface 
 (either QMP or libvirt).
 
 The complexity of interaction is due to the fact that perf shouldn't be a 
 stand alone tool.  It should be a library or something with a programmatic 
 interface that another tool can make use of.

But ... a GUI interface/integration is of course possible too, and it's being 
worked on.

perf is mainly a kernel developer tool, and kernel developers generally dont 
use GUIs to do their stuff: which is the (sole) reason why its first ~850 
commits of tools/perf/ were done without a GUI. We go where our developers 
are.

In any case it's not an excuse to have no proper command-line tooling. In fact 
if you cannot get simpler, more atomic command-line tooling right then you'll 
probably doubly suck at doing a GUI as well.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Anthony Liguori aligu...@linux.vnet.ibm.com wrote:

 On 03/16/2010 10:52 AM, Ingo Molnar wrote:
 You are quite mistaken: KVM isnt really a 'random unprivileged application' 
 in
 this context, it is clearly an extension of system/kernel services.
 
 ( Which can be seen from the simple fact that what started the discussion was
'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
existing host-space /proc/kallsyms was desired. )
 
 Random tools (like perf) should not be able to do what you describe. It's a 
 security nightmare.

A security nightmare exactly how? Mind to go into details as i dont understand 
your point.

 If it's desirable to have /proc/kallsyms available, we can expose an 
 interface in QEMU to provide that.  That can then be plumbed through libvirt 
 and QMP.
 
 Then a management tool can use libvirt or QMP to obtain that information and 
 interact with the kernel appropriately.
 
  In that sense the most natural 'extension' would be the solution i 
  mentioned a week or two ago: to have a (read only) mount of all guest 
  filesystems, plus a channel for profiling/tracing data. That would make 
  symbol parsing easier and it's what extends the existing 'host space' 
  abstraction in the most natural way.
 
  ( It doesnt even have to be done via the kernel - Qemu could implement that
via FUSE for example. )
 
 No way.  The guest has sensitive data and exposing it widely on the host is 
 a bad thing to do. [...]

Firstly, you are putting words into my mouth, as i said nothing about 
'exposing it widely'. I suggest exposing it under the privileges of whoever 
has access to the guest image.

Secondly, regarding confidentiality, and this is guest security 101: whoever 
can access the image on the host _already_ has access to all the guest data!

A Linux image can generally be loopback mounted straight away:

  losetup -o 32256 /dev/loop0 ./guest-image.img
  mount -o ro /dev/loop0 /mnt-guest

(Or, if you are an unprivileged user who cannot mount, it can be read via ext2 
tools.)

There's nothing the guest can do about that. The host is in total control of 
guest image data for heaven's sake!

All i'm suggesting is to make what is already possible more convenient.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Anthony Liguori


On 03/16/2010 12:52 PM, Ingo Molnar wrote:

* Anthony Liguorialigu...@linux.vnet.ibm.com  wrote:

   

On 03/16/2010 10:52 AM, Ingo Molnar wrote:
 

You are quite mistaken: KVM isnt really a 'random unprivileged application' in
this context, it is clearly an extension of system/kernel services.

( Which can be seen from the simple fact that what started the discussion was
   'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
   existing host-space /proc/kallsyms was desired. )
   

Random tools (like perf) should not be able to do what you describe. It's a
security nightmare.
 

A security nightmare exactly how? Mind to go into details as i dont understand
your point.
   


Assume you're using SELinux to implement mandatory access control.  How 
do you label this file system?


Generally speaking, we don't know the difference between /proc/kallsyms 
vs. /dev/mem if we do generic passthrough.  While it might be safe to 
have a relaxed label of kallsyms (since it's read only), it's clearly 
not safe to do that for /dev/mem, /etc/shadow, or any file containing 
sensitive information.


Rather, we ought to expose a higher level interface that we have more 
confidence in with respect to understanding the ramifications of 
exposing that guest data.




No way.  The guest has sensitive data and exposing it widely on the host is
a bad thing to do. [...]
 

Firstly, you are putting words into my mouth, as i said nothing about
'exposing it widely'. I suggest exposing it under the privileges of whoever
has access to the guest image.
   


That doesn't work as nicely with SELinux.

It's completely reasonable to have a user that can interact in a read 
only mode with a VM via libvirt but cannot read the guest's disk images 
or the guest's memory contents.



Secondly, regarding confidentiality, and this is guest security 101: whoever
can access the image on the host _already_ has access to all the guest data!

A Linux image can generally be loopback mounted straight away:

   losetup -o 32256 /dev/loop0 ./guest-image.img
   mount -o ro /dev/loop0 /mnt-guest

(Or, if you are an unprivileged user who cannot mount, it can be read via ext2
tools.)

There's nothing the guest can do about that. The host is in total control of
guest image data for heaven's sake!
   


It's not that simple in a MAC environment.

Regards,

Anthony Liguori


All i'm suggesting is to make what is already possible more convenient.

Ingo
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Ingo Molnar


* Anthony Liguori aligu...@linux.vnet.ibm.com wrote:

 On 03/16/2010 12:52 PM, Ingo Molnar wrote:
 * Anthony Liguorialigu...@linux.vnet.ibm.com  wrote:
 
 On 03/16/2010 10:52 AM, Ingo Molnar wrote:
 You are quite mistaken: KVM isnt really a 'random unprivileged 
 application' in
 this context, it is clearly an extension of system/kernel services.
 
 ( Which can be seen from the simple fact that what started the discussion 
 was
'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
existing host-space /proc/kallsyms was desired. )
 Random tools (like perf) should not be able to do what you describe. It's a
 security nightmare.
 A security nightmare exactly how? Mind to go into details as i dont 
 understand
 your point.
 
 Assume you're using SELinux to implement mandatory access control.
 How do you label this file system?

 Generally speaking, we don't know the difference between /proc/kallsyms vs. 
 /dev/mem if we do generic passthrough.  While it might be safe to have a 
 relaxed label of kallsyms (since it's read only), it's clearly not safe to 
 do that for /dev/mem, /etc/shadow, or any file containing sensitive 
 information.

What's your _point_? Please outline a threat model, a vector of attack, 
_anything_ that substantiates your it's a security nightmare claim.

 Rather, we ought to expose a higher level interface that we have more 
 confidence in with respect to understanding the ramifications of exposing 
 that guest data.

Exactly, we want something that has a flexible namespace and works well with 
Linux tools in general. Preferably that namespace should be human readable, 
and it should be hierarchic, and it should have a well-known permission model.

This concept exists in Linux and is generally called a 'filesystem'.

  No way.  The guest has sensitive data and exposing it widely on the host 
  is a bad thing to do. [...]
 
  Firstly, you are putting words into my mouth, as i said nothing about 
  'exposing it widely'. I suggest exposing it under the privileges of 
  whoever has access to the guest image.
 
 That doesn't work as nicely with SELinux.
 
 It's completely reasonable to have a user that can interact in a read only 
 mode with a VM via libvirt but cannot read the guest's disk images or the 
 guest's memory contents.

If a user cannot read the image file then the user has no access to its 
contents via other namespaces either. That is, of course, a basic security 
aspect.

( That is perfectly true with a non-SELinux Unix permission model as well, and
  is true in the SELinux case as well. )

  Secondly, regarding confidentiality, and this is guest security 101: whoever
  can access the image on the host _already_ has access to all the guest data!
 
  A Linux image can generally be loopback mounted straight away:
 
losetup -o 32256 /dev/loop0 ./guest-image.img
mount -o ro /dev/loop0 /mnt-guest
 
 (Or, if you are an unprivileged user who cannot mount, it can be read via 
 ext2
 tools.)
 
  There's nothing the guest can do about that. The host is in total control of
  guest image data for heaven's sake!
 
 It's not that simple in a MAC environment.

Erm. Please explain to me, what exactly is 'not that simple' in a MAC 
environment?

Also, i'd like to note that the 'restrictive SELinux setups' usecases are 
pretty secondary.

To demonstrate that, i'd like every KVM developer on this list who reads this 
mail and who has their home development system where they produce their 
patches set up in a restrictive MAC environment, in that you cannot even read 
the images you are using, to chime in with a I'm doing that reply.

If there's just a _single_ KVM developer amongst dozens and dozens of 
developers on this list who develops in an environment like that i'd be 
surprised. That result should pretty much tell you where the weight of 
instrumentation focus should lie - and it isnt on restrictive MAC environments 
...

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread oerg Roedel

On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote:
 Hm, that sounds rather messy if we want to use it to basically expose kernel 
 functionality in a guest/host unified way. Is the qemu process discoverable 
 in 
 some secure way? Can we trust it? Is there some proper tooling available to 
 do 
 it, or do we have to push it through 2-3 packages to get such a useful 
 feature 
 done?

Since we want to implement a pmu usable for the guest anyway why we
don't just use a guests perf to get all information we want? If we get a
pmu-nmi from the guest we just re-inject it to the guest and perf in the
guest gives us all information we wand including kernel and userspace
symbols, stack traces, and so on.

In the previous thread we discussed about a direct trace channel between
guest and host kernel (which can be used for ftrace events for example).
This channel could be used to transport this information to the host
kernel.

The only additional feature needed is a way for the host to start a perf
instance in the guest.

Opinions?


Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Masami Hiramatsu

oerg Roedel wrote:
 On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote:
 Hm, that sounds rather messy if we want to use it to basically expose kernel 
 functionality in a guest/host unified way. Is the qemu process discoverable 
 in 
 some secure way? Can we trust it? Is there some proper tooling available to 
 do 
 it, or do we have to push it through 2-3 packages to get such a useful 
 feature 
 done?
 
 Since we want to implement a pmu usable for the guest anyway why we
 don't just use a guests perf to get all information we want? If we get a
 pmu-nmi from the guest we just re-inject it to the guest and perf in the
 guest gives us all information we wand including kernel and userspace
 symbols, stack traces, and so on.

I guess this aims to get information from old environments running on
kvm for life extension :)

 In the previous thread we discussed about a direct trace channel between
 guest and host kernel (which can be used for ftrace events for example).
 This channel could be used to transport this information to the host
 kernel.

Interesting! I know the people who are trying to do that with systemtap.
See, http://vesper.sourceforge.net/

 
 The only additional feature needed is a way for the host to start a perf
 instance in the guest.

# ssh localguest perf record --host-chanel ... ? B-)

Thank you,

 
 Opinions?
 
 
   Joerg
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
Masami Hiramatsu
e-mail: mhira...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Anthony Liguori


On 03/16/2010 01:28 PM, Ingo Molnar wrote:

* Anthony Liguorialigu...@linux.vnet.ibm.com  wrote:

   

On 03/16/2010 12:52 PM, Ingo Molnar wrote:
 

* Anthony Liguorialigu...@linux.vnet.ibm.com   wrote:

   

On 03/16/2010 10:52 AM, Ingo Molnar wrote:
 

You are quite mistaken: KVM isnt really a 'random unprivileged application' in
this context, it is clearly an extension of system/kernel services.

( Which can be seen from the simple fact that what started the discussion was
   'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
   existing host-space /proc/kallsyms was desired. )
   

Random tools (like perf) should not be able to do what you describe. It's a
security nightmare.
 

A security nightmare exactly how? Mind to go into details as i dont understand
your point.
   

Assume you're using SELinux to implement mandatory access control.
How do you label this file system?

Generally speaking, we don't know the difference between /proc/kallsyms vs.
/dev/mem if we do generic passthrough.  While it might be safe to have a
relaxed label of kallsyms (since it's read only), it's clearly not safe to
do that for /dev/mem, /etc/shadow, or any file containing sensitive
information.
 

What's your _point_? Please outline a threat model, a vector of attack,
_anything_ that substantiates your it's a security nightmare claim.
   


You suggested to have a (read only) mount of all guest filesystems.

As I described earlier, not all of the information within the guest 
filesystem has the same level of sensitivity.  If you exposed a generic 
interface like this, it makes it very difficult to delegate privileges.


Delegating privileges is important because from in a higher security 
environment, you may want to prevent a management tool from accessing 
the VM's disk directly, but still allow it to do basic operations (in 
particular, to view performance statistics).



Rather, we ought to expose a higher level interface that we have more
confidence in with respect to understanding the ramifications of exposing
that guest data.
 

Exactly, we want something that has a flexible namespace and works well with
Linux tools in general. Preferably that namespace should be human readable,
and it should be hierarchic, and it should have a well-known permission model.

This concept exists in Linux and is generally called a 'filesystem'.
   


If you want to use a synthetic filesystem as the management interface 
for qemu, that's one thing.  But you suggested exposing the guest 
filesystem in its entirely and that's what I disagreed with.



If a user cannot read the image file then the user has no access to its
contents via other namespaces either. That is, of course, a basic security
aspect.

( That is perfectly true with a non-SELinux Unix permission model as well, and
   is true in the SELinux case as well. )
   


I don't think that's reasonable at all.  The guest may encrypt it's disk 
image.  It still ought to be possible to run perf against that guest, no?



Erm. Please explain to me, what exactly is 'not that simple' in a MAC
environment?

Also, i'd like to note that the 'restrictive SELinux setups' usecases are
pretty secondary.

To demonstrate that, i'd like every KVM developer on this list who reads this
mail and who has their home development system where they produce their
patches set up in a restrictive MAC environment, in that you cannot even read
the images you are using, to chime in with a I'm doing that reply.
   


My home system doesn't run SELinux but I work daily with systems that 
are using SELinux.


I want to be able to run tools like perf on these systems because 
ultimately, I need to debug these systems on a daily basis.


But that's missing the point.  We want to have an interface that works 
for both cases so that we're not maintaining two separate interfaces.


We've rat holed a bit though.  You want:

1) to run perf kvm list and be able to enumerate KVM guests

2) for this to Just Work with qemu guests launched from the command line

You could achieve (1) by tying perf to libvirt but that won't work for 
(2).  There are a few practical problems with (2).


qemu does not require the user to associate any uniquely identifying 
information with a VM.  We've also optimized the command line use case 
so that if all you want to do is run a disk image, you just execute 
qemu foo.img.  To satisfy your use case, we would either have to force 
a use to always specify unique information, which would be less 
convenient for our users or we would have to let the name be an optional 
parameter.


As it turns out, we already support qemu -name Fedora foo.img.  What 
we don't do today, but I've been suggesting we should, is automatically 
create a QMP management socket in a well known location based on the 
-name parameter when it's specified.  That would let a tool like perf 
Just Work provided that a user specified -name.


No one uses -name

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Anthony Liguori


On 03/16/2010 12:39 PM, Ingo Molnar wrote:

If we look at the use-case, it's going to be something like, a user is
creating virtual machines and wants to get performance information about
them.

Having to run a separate tool like perf is not going to be what they would
expect they had to do.  Instead, they would either use their existing GUI
tool (like virt-manager) or they would use their management interface
(either QMP or libvirt).

The complexity of interaction is due to the fact that perf shouldn't be a
stand alone tool.  It should be a library or something with a programmatic
interface that another tool can make use of.
 

But ... a GUI interface/integration is of course possible too, and it's being
worked on.

perf is mainly a kernel developer tool, and kernel developers generally dont
use GUIs to do their stuff: which is the (sole) reason why its first ~850
commits of tools/perf/ were done without a GUI. We go where our developers
are.

In any case it's not an excuse to have no proper command-line tooling. In fact
if you cannot get simpler, more atomic command-line tooling right then you'll
probably doubly suck at doing a GUI as well.
   


It's about who owns the user interface.

If qemu owns the user interface, than we can satisfy this in a very 
simple way by adding a perf monitor command.  If we have to support 
third party tools, then it significantly complicates things.


Regards,

Anthony Liguori


Ingo
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Frank Ch. Eigler

Hi -

On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote:
 [...]
 The only way to really address this is to change the interaction.  
 Instead of running perf externally to qemu, we should support a perf 
 command in the qemu monitor that can then tie directly to the perf 
 tooling.  That gives us the best possible user experience.

To what extent could this be solved with less crossing of
isolation/abstraction layers, if the perfctr facilities were properly
virtualized?  That way guests could run perf goo internally.
Optionally virt tools on the host side could aggregate data from
cooperating self-monitoring guests.

- FChE
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Zhang, Yanmin

On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
 On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
 
  Excellent, support for guest kernel != host kernel is critical (I can't
  remember the last time I ran same kernels).
 
  How would we support multiple guests with different kernels?
   
  With the patch, 'perf kvm report --sort pid could show
  summary statistics for all guest os instances. Then, use
  parameter --pid of 'perf kvm record' to collect single problematic instance 
  data.
 
 
 That certainly works, though automatic association of guest data with 
 guest symbols is friendlier.
Thanks. Originally, I planed to add a -G parameter to perf. Such like
-G 
:/XXX/XXX/guestkallsyms:/XXX/XXX/modules,8889:/XXX/XXX/guestkallsyms:/XXX/XXX/modules
 and 8889 are just qemu guest pid.

So we could define multiple guest os symbol files. But it seems ugly,
and 'perf kvm report --sort pid and 'perf kvm top --pid' could provide
similar functionality.

 
  diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 
  linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c
  --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c2010-03-16 
  08:59:11.825295404 +0800
  +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c2010-03-16 
  09:01:09.976084492 +0800
  @@ -26,6 +26,7 @@
 #includelinux/sched.h
 #includelinux/moduleparam.h
 #includelinux/ftrace_event.h
  +#includelinux/perf_event.h
 #include kvm_cache_regs.h
 #include x86.h
 
  @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct
vmcs_write32(TPR_THRESHOLD, irr);
 }
 
  +DEFINE_PER_CPU(int, kvm_in_guest) = {0};
  +
  +static void kvm_set_in_guest(void)
  +{
  + percpu_write(kvm_in_guest, 1);
  +}
  +
  +static int kvm_is_in_guest(void)
  +{
  + return percpu_read(kvm_in_guest);
  +}
 
 
   
 
  There is already PF_VCPU for this.
   
  Right, but there is a scope between kvm_guest_enter and really running
  in guest os, where a perf event might overflow. Anyway, the scope is very
  narrow, I will change it to use flag PF_VCPU.
 
 
 There is also a window between setting the flag and calling 'int $2' 
 where an NMI might happen and be accounted incorrectly.
 
 Perhaps separate the 'int $2' into a direct call into perf and another 
 call for the rest of NMI handling.  I don't see how it would work on svm 
 though - AFAICT the NMI is held whereas vmx swallows it. 

  I guess NMIs 
 will be disabled until the next IRET so it isn't racy, just tricky.
I'm not sure if vmexit does break NMI context or not. Hardware NMI context
isn't reentrant till a IRET. YangSheng would like to double check it.

 
  +static struct perf_guest_info_callbacks kvm_guest_cbs = {
  + .is_in_guest= kvm_is_in_guest,
  + .is_user_mode   = kvm_is_user_mode,
  + .get_guest_ip   = kvm_get_guest_ip,
  + .reset_in_guest = kvm_reset_in_guest,
  +};
 
 
  Should be in common code, not vmx specific.
   
  Right. I discussed with Yangsheng. I will move above data structures and
  callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to
  kvm_x86_ops.
 
 
 You will need access to the vcpu pointer (kvm_rip_read() needs it), you 
 can put it in a percpu variable.
We do so now in a new patch.

   I guess if it's not null, you know 
 you're in a guest, so no need for PF_VCPU.
Good suggestion.

Thanks.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Avi Kivity


On 03/17/2010 02:41 AM, Frank Ch. Eigler wrote:

Hi -

On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote:
   

[...]
The only way to really address this is to change the interaction.
Instead of running perf externally to qemu, we should support a perf
command in the qemu monitor that can then tie directly to the perf
tooling.  That gives us the best possible user experience.
 

To what extent could this be solved with less crossing of
isolation/abstraction layers, if the perfctr facilities were properly
virtualized?
   


That's the more interesting (by far) usage model.  In general guest 
owners don't have access to the host, and host owners can't (and 
shouldn't) change guests.


Monitoring guests from the host is useful for kvm developers, but less 
so for users.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-15 Thread Avi Kivity


On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:

From: Zhang, Yanminyanmin_zh...@linux.intel.com

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new subcommand kvm to perf.

   perf kvm top
   perf kvm record
   perf kvm report
   perf kvm diff

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[r...@lkp-ne01 norm]# perf kvm --host --guest 
--guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

   


Excellent, support for guest kernel != host kernel is critical (I can't 
remember the last time I ran same kernels).


How would we support multiple guests with different kernels?  Perhaps a 
symbol server that perf can connect to (and that would connect to guests 
in turn)?



diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 
linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c
--- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c  2010-03-16 08:59:11.825295404 
+0800
+++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c  2010-03-16 
09:01:09.976084492 +0800
@@ -26,6 +26,7 @@
  #includelinux/sched.h
  #includelinux/moduleparam.h
  #includelinux/ftrace_event.h
+#includelinux/perf_event.h
  #include kvm_cache_regs.h
  #include x86.h

@@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct
vmcs_write32(TPR_THRESHOLD, irr);
  }

+DEFINE_PER_CPU(int, kvm_in_guest) = {0};
+
+static void kvm_set_in_guest(void)
+{
+   percpu_write(kvm_in_guest, 1);
+}
+
+static int kvm_is_in_guest(void)
+{
+   return percpu_read(kvm_in_guest);
+}
   


There is already PF_VCPU for this.


+static struct perf_guest_info_callbacks kvm_guest_cbs = {
+   .is_in_guest= kvm_is_in_guest,
+   .is_user_mode   = kvm_is_user_mode,
+   .get_guest_ip   = kvm_get_guest_ip,
+   .reset_in_guest = kvm_reset_in_guest,
+};
   


Should be in common code, not vmx specific.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

82 matches

Mail list logo