date:20120921

Re: [Qemu-devel] Dynamic Binary Instrumentation

2012-09-21 Thread Wei-Ren Chen

Hi Liuis,

On Tue, Sep 04, 2012 at 10:08:09PM +0200, Lluís Vilanova wrote:
 Hi there,
 
 Given that right now I don't have enough time to write the paper that should
 accompany this work, I've decided to open it up so that whoever is interested
 can have access to it.
 
 You can get some instructions here:
 
   https://projects.gso.ac.upc.edu/projects/qemu-dbi/wiki

  The website is down. :/ Would you like to take a look on that?
Thanks.

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

Re: [Qemu-devel] [PATCH] ehci: Fix interrupt packet MULT handling

2012-09-21 Thread Gerd Hoffmann

On 09/20/12 17:38, Hans de Goede wrote:
 There are several issues with our handling of the MULT epcap field
 of interrupt qhs, which this patch fixes.

Patch added to usb patch queue.

thanks,
  Gerd

Re: [Qemu-devel] [PATCH] blockdev: preserve readonly and snapshot states across media changes

2012-09-21 Thread Kevin Wolf

Am 21.09.2012 01:20, schrieb Kevin Shanahan:
 If readonly=on is given at device creation time, the -readonly flag
 needs to be set in the block driver state for this device so that
 readonly-ness is preserved across media changes (qmp change command).
 Similarly, to preserve the snapshot property requires -open_flags to
 be correct.
 
 Signed-off-by: Kevin Shanahan kmsha...@disenchant.net

Thanks, applied to the block branch.

Kevin

Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock

2012-09-21 Thread liu ping fan

On Thu, Sep 20, 2012 at 5:07 PM, Avi Kivity a...@redhat.com wrote:
 On 09/20/2012 10:51 AM, liu ping fan wrote:
 On Wed, Sep 19, 2012 at 5:23 PM, Avi Kivity a...@redhat.com wrote:
 On 09/19/2012 12:19 PM, liu ping fan wrote:
 On Wed, Sep 19, 2012 at 5:14 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 19/09/2012 11:11, liu ping fan ha scritto:
  Why not? devA will drop its local lock, devX will retake the big lock
  recursively, devB will take its local lock.  In the end, we have 
  biglock
  - devB.
 
 But when adopting local lock, we assume take local lock, then biglock.

 No, because the local lock will be dropped before taking the biglock.
 The order must always be coarse-fine.

 But if we takes coarse firstly, then the mmio-dispatcher will still
 contend for the big lock against each other.

 Can you detail the sequence?

 LOCK(local lock)
 ...
 LOCK(big lock)
 Access timer/block/network subsystem
 UNLOCK(big lock)
 .
 UNLOCK(local lock)

 This is an invalid sequence.  Either the subsystem has to be fine-grain
 locked, or the lock order has to be reversed.

Oh!  And from this thread, my understanding of the reason for the rule
of lock sequence: coarse-fine is that biglock means higher
possibility of conflict, so we try it first, then try the fine-lock.
In this way, we have a smaller window for holding fine-lock which
means the other thread can get this lock more smoothly.  Right?
NOT want to open an argument, just a question, is there any reason for
the sequence
devlock-timelock?

Regards,
pingfan
 Before we finish subsystem conversion, an mmio dispatcher may look like:

 dev_write(...)
 {
 lock(s-lock)
 switch (addr) {
 case REGA:
 ...
 case REGB:
 ...
 case REGC:
 unlock(s-lock)
 lock(big lock)
 lock(s-lock)
 qemu_mod_timer()
 unlock(bit lock)
 break;
 ...
 }
 unlock(s-lock)
 }



 --
 error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-21 Thread liu ping fan

On Thu, Sep 20, 2012 at 5:15 PM, Avi Kivity a...@redhat.com wrote:
 On 09/20/2012 10:51 AM, liu ping fan wrote:
 On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity a...@redhat.com wrote:
 On 09/19/2012 11:36 AM, liu ping fan wrote:

 It basically means you can't hold contents of device state in local
 variables.  You need to read everything again from the device.  That
 includes things like DMA enable bits.

 I think that read everything again from the device can not work.
 Suppose the following scene: If the device's state contains the change
 of a series of internal registers (supposing partA+partB;
 partC+partD), after partA changed, the device's lock is broken.  At
 this point, another access to this device, it will work on partA+partB
 to determine C+D, but since partB is not correctly updated yet. So C+D
 may be decided by broken context and be wrong.

 That's the guest's problem.  Note it could have happened even before the
 change, since the writes to A/B/C/D are unordered wrt the DMA.

 Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
 is not signaled to guest, the guest should not launch operations on
 (C+D). Right?   But here comes the question, if ready not signaled to
 guest, how can guest launch operation on (A+B) again?

 It may be evil.

 i.e. although local lock is broken, the (A+B) is still intact when
 re-acquire local lock.  So need not to read everything again from the
 device.  Wrong?

 The device needs to perform according to its specifications.  If the
 specifications allow for this kind of access, we must ensure it works.
 If they don't, we must ensure something sane happens, instead of a qemu
 crash or exploit.  This means that anything dangerous like pointers must
 be revalidated.  To be on the safe side, I recommend revalidating (or
 reloading) everything, but it may not be necessary in all cases.

Yeah, catch the two points exactly.

 What my suggestion is:
 lock();
 set_and_test(dev-busy);
 if busy
   unlock and return;
 changing device registers;
 do other things including calling to c_p_m_rw() //here,lock broken,
 but set_and_test() works
 clear(dev-busy);
 unlock();

 So changing device registers is protected, and unbreakable.

 But the changes may be legitimate.  Maybe you're writing to a completely
 unrelated register, from a different vcpu, now that write is lost.

 But I think it will mean more-fine locks for each groups of unrelated
 register, and accordingly, the busy should be bitmap for each group.

 It's possible.  Let's defer the discussion until a concrete case is
 before us.  It may be that different devices will want different
 solutions (though there is value in applying one solution everywhere).

Okay. appreciate for the total detail explanation.

Regards,
pingfan

 --
 error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH v3 3/3] Fix address handling in inet_nonblocking_connect

2012-09-21 Thread Markus Armbruster

Orit Wasserman owass...@redhat.com writes:

 On 09/20/2012 04:14 PM, Markus Armbruster wrote:
 Orit Wasserman owass...@redhat.com writes:
 
 getaddrinfo can give us a list of addresses, but we only try to
 connect to the first one. If that fails we never proceed to
 the next one.  This is common on desktop setups that often have ipv6
 configured but not actually working.

 To fix this make inet_connect_nonblocking retry connection with a different
 address.
 callers on inet_nonblocking_connect register a callback function that will
 be called when connect opertion completes, in case of failure the fd will 
 have
 a negative value

 Signed-off-by: Orit Wasserman owass...@redhat.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  migration-tcp.c |   29 +---
  qemu-char.c |2 +-
  qemu-sockets.c  |   95 
 +++---
  qemu_socket.h   |   13 ++--
  4 files changed, 102 insertions(+), 37 deletions(-)

 diff --git a/migration-tcp.c b/migration-tcp.c
 index 7f6ad98..cadea36 100644
 --- a/migration-tcp.c
 +++ b/migration-tcp.c
 @@ -53,29 +53,18 @@ static int tcp_close(MigrationState *s)
  return r;
  }
  
 -static void tcp_wait_for_connect(void *opaque)
 +static void tcp_wait_for_connect(int fd, void *opaque)
  {
  MigrationState *s = opaque;
 -int val, ret;
 -socklen_t valsize = sizeof(val);
  
 -DPRINTF(connect completed\n);
 -do {
 -ret = getsockopt(s-fd, SOL_SOCKET, SO_ERROR, (void *) val, 
 valsize);
 -} while (ret == -1  (socket_error()) == EINTR);
 -
 -if (ret  0) {
 +if (fd  0) {
 +DPRINTF(migrate connect error\n);
 +s-fd = -1;
  migrate_fd_error(s);
 -return;
 -}
 -
 -qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
 -
 -if (val == 0)
 +} else {
 +DPRINTF(migrate connect success\n);
 +s-fd = fd;
  migrate_fd_connect(s);
 -else {
 -DPRINTF(error connecting %d\n, val);
 -migrate_fd_error(s);
  }
  }
  
 @@ -88,7 +77,8 @@ int tcp_start_outgoing_migration(MigrationState *s, const 
 char *host_port,
  s-write = socket_write;
  s-close = tcp_close;
  
 -s-fd = inet_nonblocking_connect(host_port, in_progress, errp);
 +s-fd = inet_nonblocking_connect(host_port, tcp_wait_for_connect, s,
 + in_progress, errp);
  if (error_is_set(errp)) {
  migrate_fd_error(s);
  return -1;
 @@ -96,7 +86,6 @@ int tcp_start_outgoing_migration(MigrationState *s, const 
 char *host_port,
  
  if (in_progress) {
  DPRINTF(connect in progress\n);
 -qemu_set_fd_handler2(s-fd, NULL, NULL, tcp_wait_for_connect, s);
  } else {
  migrate_fd_connect(s);
  }
 diff --git a/qemu-char.c b/qemu-char.c
 index c442952..11cd5ef 100644
 --- a/qemu-char.c
 +++ b/qemu-char.c
 @@ -2459,7 +2459,7 @@ static CharDriverState *qemu_chr_open_socket(QemuOpts 
 *opts)
  if (is_listen) {
  fd = inet_listen_opts(opts, 0, NULL);
  } else {
 -fd = inet_connect_opts(opts, true, NULL, NULL);
 +fd = inet_connect_opts(opts, true, NULL, NULL, NULL);
  }
  }
  if (fd  0) {
 diff --git a/qemu-sockets.c b/qemu-sockets.c
 index 212075d..d321c58 100644
 --- a/qemu-sockets.c
 +++ b/qemu-sockets.c
 @@ -24,6 +24,7 @@
  
  #include qemu_socket.h
  #include qemu-common.h /* for qemu_isdigit */
 +#include main-loop.h
  
  #ifndef AI_ADDRCONFIG
  # define AI_ADDRCONFIG 0
 @@ -217,11 +218,69 @@ listen:
  ((rc) == -EINPROGRESS)
  #endif
  
 +/* Struct to store connect state for non blocking connect */
 +typedef struct ConnectState {
 +int fd;
 +struct addrinfo *addr_list;
 +struct addrinfo *current_addr;
 +ConnectHandler *callback;
 +void *opaque;
 +Error *errp;
 +} ConnectState;
 +
  static int inet_connect_addr(struct addrinfo *addr, bool block,
 - bool *in_progress)
 + bool *in_progress, ConnectState 
 *connect_state);
 +
 +static void wait_for_connect(void *opaque)
 +{
 +ConnectState *s = opaque;
 +int val = 0, rc = 0;
 +socklen_t valsize = sizeof(val);
 +bool in_progress = false;
 +
 +qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
 +
 +do {
 +rc = getsockopt(s-fd, SOL_SOCKET, SO_ERROR, (void *) val, 
 valsize);
 +} while (rc == -1  socket_error() == EINTR);
 +
 +/* update rc to contain error details */
 +if (!rc  val) {
 +rc = -val;
 
 Would rc = -1 suffice?  I'd find that clearer.
 I guess so, I want the errno for more detailed error message
 but those will come in another patch set and I can handle it than.
 I agree that using -1 will make the code much cleaner.
 
 +}
 +
 +/* connect error */
 +if (rc  0) {
 +closesocket(s-fd);
 +s-fd = rc;
 +}
 +
 +/* try to connect to the next address on the list */
 +while

Re: [Qemu-devel] [PATCH v3 0/3] nonblocking connect address handling cleanup

2012-09-21 Thread Markus Armbruster

Orit Wasserman owass...@redhat.com writes:

 On 09/20/2012 04:19 PM, Markus Armbruster wrote:
 Orit Wasserman owass...@redhat.com writes:
 
 Changes from v2:
 - remove the use of getnameinfo
 - remove errp for inet_connect_addr
 - remove QemuOpt block
 - fix errors in wait_for_connect 
 - pass ConnectState as a parameter to allow concurrent connect ops

 getaddrinfo can give us a list of addresses, but we only try to
 connect to the first one. If that fails we never proceed to
 the next one.  This is common on desktop setups that often have ipv6
 configured but not actually working.
 A simple way to reproduce the problem is migration:
 for the destination use -incoming tcp:0:, run migrate -d
 tcp:localhost:
 migration will fail on hosts that have both IPv4 and IPV6 address
 for localhost.

 To fix this, refactor address resolution code and make
 inet_nonblocking_connect
 retry connection with a different address.
 
 Almost there for connect.
 
 I'm afraid we have a similar problem with listen: we bind only on the
 first address that works.  Shouldn't we bind all of them?
 
 http://www.akkadia.org/drepper/userapi-ipv6.html
 
 yes listen should be fixed but lets do it in a separate patch set.

Absolutely.

Re: [Qemu-devel] [PATCH] sparc-dis: Remove redundant NULL check

2012-09-21 Thread Andreas Färber

Am 20.09.2012 19:03, schrieb Stefan Weil:
 Am 05.09.2012 19:45, schrieb Stefan Weil:
 Am 05.09.2012 19:15, schrieb Stefan Weil:

 AFAIK, binutils moved to GPL 3. Therefore I don't expect that
 QEMU will update to upstream in the next years.

 We'll have to maintain the code which we have.

 Try git log *-dis.c or gitk *-dis.c: there are already lots
 of more trivial changes which got applied to the disassembler files.

 = The patch should be applied.

 Regards,
 Stefan

 Here is some additional information:

 binutils switched from GPL 2 to GPL 3 with version 2.18:

 Changes in 2.18:

 * The binutils sources are now released under version 3 of the GNU
 General
   Public License.


 sparc-dis.c is already based on 2.17, so we won't get anything newer.
 Even the latest version still uses the redundant NULL check, so I
 can send my patch to the binutils maintainers, too.

So did you? And did they accept it?

Regards,
Andreas

 Ping? If nobody objects, I suggest to apply the patch via qemu-trivial.
 
 Regards
 
 Stefan W.

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock

2012-09-21 Thread Paolo Bonzini

 Oh!  And from this thread, my understanding of the reason for the rule
 of lock sequence: coarse-fine is that biglock means higher
 possibility of conflict, so we try it first, then try the fine-lock.
 In this way, we have a smaller window for holding fine-lock which
 means the other thread can get this lock more smoothly.  Right?

Yes.

 NOT want to open an argument, just a question, is there any reason
 for the sequence devlock-timelock?

Not any particular reason, just that it makes more sense. :)
Backend subsystems (timers, network, I/O handlers, ...) typically
will not call anything else.  So it makes sense that their locks
are held inside the device locks (and inside the big lock).

Also note that while the timer subsystem could call the devices,
it is perfectly plausible that the devices will do for example a
qemu_mod_timer from a timer callback.  Thus, in some sense the
timer subsystem already has to expect modifications of its state
during the callback.  Releasing the lock during callbacks is just
an extension of the current behavior.

Of course some changes are needed to the invariants, such as the one
Avi pointed out elsewhere in the thread, but devlock-timerlock is
overall more natural than the opposite.

Paolo

Re: [Qemu-devel] [PATCH v7 5/5] block: Support GlusterFS as a QEMU block backend.

2012-09-21 Thread Paolo Bonzini

Il 21/09/2012 05:50, Bharata B Rao ha scritto:
  Just shooting around a possibility: why reinvent the wheel poorly if we
  can use a full-blown URI parsing library?  The libxml2 one is very good
  and easy to use.
  
  It is also pretty much self-contained and has hardly seen a commit in 3
  years, so we can even make it completely self-contained and keep it in
  tree.  See attachment, which also includes some functions taken from
  libvirt to parse query parameters.
  
  (I know this is controversial and would rather add a build dependency,
  but at the same time it is tempting not to).
 It would be nice to have such URI parsing capability in QEMU. This would
 make it very easy to parse gluster URI.

Feel free to take the two attached files in v8.

Paolo

Re: [Qemu-devel] [PATCH v4 15/19] block: raw-win32 driver reopen support

2012-09-21 Thread Kevin Wolf

Am 20.09.2012 21:13, schrieb Jeff Cody:
 This is the support for reopen, for win32.  Note that there is a key
 difference in the win32 reopen, namely that it is not guaranteed safe
 like all the other drivers.  Attempts are made to reopen the file, or
 open the file in a new handle prior to closing the old handle.  However,
 this is not always feasible, and there are times when we must close the
 existing file and then open the new file, breaking the transactional nature
 of the reopen.
 
 Signed-off-by: Jeff Cody jc...@redhat.com
 ---
  block/raw-win32.c | 105 
 ++
  1 file changed, 105 insertions(+)

I can't really review win32 code, but two comments anyway:

 diff --git a/block/raw-win32.c b/block/raw-win32.c
 index 78c8306..8a698d3 100644
 --- a/block/raw-win32.c
 +++ b/block/raw-win32.c
 @@ -31,6 +31,7 @@
  #define FTYPE_FILE 0
  #define FTYPE_CD 1
  #define FTYPE_HARDDISK 2
 +#define WINDOWS_VISTA 6
  
  typedef struct BDRVRawState {
  HANDLE hfile;
 @@ -38,6 +39,10 @@ typedef struct BDRVRawState {
  char drive_path[16]; /* format: d:\ */
  } BDRVRawState;
  
 +typedef struct BDRVRawReopenState {
 +HANDLE hfile;
 +} BDRVRawReopenState;
 +
  int qemu_ftruncate64(int fd, int64_t length)
  {
  LARGE_INTEGER li;
 @@ -117,6 +122,103 @@ static int raw_open(BlockDriverState *bs, const char 
 *filename, int flags)
  return 0;
  }
  
 +static int raw_reopen_prepare(BDRVReopenState *state,
 +  BlockReopenQueue *queue, Error **errp)
 +{
 +BDRVRawState *s;
 +BDRVRawReopenState *raw_s;
 +int ret = 0;
 +int access_flags;
 +DWORD overlapped;
 +OSVERSIONINFO osvi;
 +
 +assert(state != NULL);
 +assert(state-bs != NULL);
 +
 +s = state-bs-opaque;
 +
 +state-opaque = g_malloc0(sizeof(BDRVRawReopenState));
 +raw_s = state-opaque;
 +
 +raw_parse_flags(state-flags, access_flags, overlapped);
 +
 +ZeroMemory(osvi, sizeof(OSVERSIONINFO));
 +osvi.dwOSVersionInfoSize = sizeof(OSVERSIONINFO);
 +
 +GetVersionEx(osvi);
 +raw_s-hfile = INVALID_HANDLE_VALUE;
 +
 +if (osvi.dwMajorVersion = WINDOWS_VISTA) {
 +raw_s-hfile = ReOpenFile(s-hfile, access_flags, FILE_SHARE_READ,
 +  overlapped);
 +}
 +
 +/* could not reopen the file handle, so fall back to opening
 + * new file (CreateFile) */
 +if (raw_s-hfile == INVALID_HANDLE_VALUE) {
 +raw_s-hfile = CreateFile(state-bs-filename, access_flags,
 +  FILE_SHARE_READ, NULL, OPEN_EXISTING,
 +  overlapped, NULL);
 +if (raw_s-hfile == INVALID_HANDLE_VALUE) {
 +/* this could happen because the access_flags requested are
 + * incompatible with the existing share mode of s-hfile,
 + * so our only option now is to close s-hfile, and try again.
 + * This could end badly */
 +CloseHandle(s-hfile);

How common is this case?

We do have another option, namely not reopen at all and return an error.
Of course, this only makes sense if it doesn't mean that we almost never
succeed.

 +s-hfile = INVALID_HANDLE_VALUE;
 +raw_s-hfile = CreateFile(state-bs-filename, access_flags,
 +  FILE_SHARE_READ, NULL, OPEN_EXISTING,
 +  overlapped, NULL);
 +if (raw_s-hfile == INVALID_HANDLE_VALUE) {
 +/* Unrecoverable error */
 +error_set(errp, ERROR_CLASS_GENERIC_ERROR,
 +  Fatal error reopening %s file; file closed and 
 cannot be opened\n,

This line is longer than 80 characters.

Kevin

Re: [Qemu-devel] [PATCH 2/2] tcg: add TB sanity checking

2012-09-21 Thread Michael Tokarev

On 21.09.2012 04:18, Max Filippov wrote:

 diff --git a/tcg/tcg.c b/tcg/tcg.c

 +#ifdef CONFIG_DEBUG_TCG
 +static void tcg_sanity_check(TCGContext *s)

#ifndef CONFIG_DEBUG_TCG
#define tcg_sanity_check(s) /*empty*/
#else
static void tcg_sanity_check(TCGContext *s)

 +{
[]
 +}
 +#endif

 @@ -2082,6 +2147,10 @@ static inline int tcg_gen_code_common(TCGContext *s, 
 uint8_t *gen_code_buf,
 +#ifdef CONFIG_DEBUG_TCG
 +tcg_sanity_check(s);
 +#endif

And here we can drop the #ifdef.  FWIW.

/mjt

Re: [Qemu-devel] [PATCH v4 02/17] target-i386: Add missing kvm bits.

2012-09-21 Thread Igor Mammedov

On Thu, 20 Sep 2012 16:03:17 -0400
Don Slutz d...@cloudswitch.com wrote:

 Fix duplicate name (kvmclock = kvm_clock2) also.
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
  target-i386/cpu.c |   12 
  1 files changed, 8 insertions(+), 4 deletions(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 0313cf5..5f9866a 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -87,10 +87,14 @@ static const char *ext3_feature_name[] = {
  };
  
  static const char *kvm_feature_name[] = {
 -kvmclock, kvm_nopiodelay, kvm_mmu, kvmclock, kvm_asyncpf, 
 NULL, kvm_pv_eoi, NULL,
 -NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 -NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 -NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 +kvmclock, kvm_nopiodelay, kvm_mmu, kvm_clock2,
before patch if kvmclock is specified it would set 0 and 3 bits,
after patch only bit 0 is set.
Is it correct/expected behavior? if yes, please add rationale into patch
description.
 
 +kvm_asyncpf, kvm_steal_time, kvm_pv_eoi, NULL,
 +NULL, NULL, NULL, NULL,
 +NULL, NULL, NULL, NULL,
 +NULL, NULL, NULL, NULL,
 +NULL, NULL, NULL, NULL,
 +kvm_clock_stable, NULL, NULL, NULL,
 +NULL, NULL, NULL, NULL,
  };
  
  static const char *svm_feature_name[] = {
 -- 
 1.7.1
 


-- 
Regards,
  Igor

Re: [Qemu-devel] [PATCH v4 15/19] block: raw-win32 driver reopen support

2012-09-21 Thread Paolo Bonzini

Il 21/09/2012 10:33, Kevin Wolf ha scritto:
  +/* could not reopen the file handle, so fall back to opening
  + * new file (CreateFile) */
  +if (raw_s-hfile == INVALID_HANDLE_VALUE) {
  +raw_s-hfile = CreateFile(state-bs-filename, access_flags,
  +  FILE_SHARE_READ, NULL, OPEN_EXISTING,
  +  overlapped, NULL);
  +if (raw_s-hfile == INVALID_HANDLE_VALUE) {
  +/* this could happen because the access_flags requested are
  + * incompatible with the existing share mode of s-hfile,
  + * so our only option now is to close s-hfile, and try again.
  + * This could end badly */
  +CloseHandle(s-hfile);
 How common is this case?
 
 We do have another option, namely not reopen at all and return an error.
 Of course, this only makes sense if it doesn't mean that we almost never
 succeed.

Probably pretty common since we specify FILE_SHARE_READ for the sharing
mode, meaning that subsequent open operations on a file or device are
only able to request read access.

I would change it to FILE_SHARE_READ|FILE_SHARE_WRITE and remove this code.

Paolo

[Qemu-devel] [PATCH 02/41] fix migration sync

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch_init.c b/arch_init.c
index f849f9b..cdd8ab7 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -489,6 +489,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 ram_addr_t addr;
 RAMBlock *block;

+memory_global_sync_dirty_bitmap(get_system_memory());
 bytes_transferred = 0;
 last_block = NULL;
 last_offset = 0;
-- 
1.7.11.4

[Qemu-devel] [PATCH 26/41] migration: make migrate_fd_wait_for_unfreeze() return errors

2012-09-21 Thread Juan Quintela

Adjust all callers

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 8 ++--
 migration.c | 7 ---
 migration.h | 2 +-
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 9db73dc..6d9a50b 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -149,8 +149,12 @@ static int buffered_close(void *opaque)
 if (ret  0) {
 break;
 }
-if (s-freeze_output)
-migrate_fd_wait_for_unfreeze(s-migration_state);
+if (s-freeze_output) {
+ret = migrate_fd_wait_for_unfreeze(s-migration_state);
+if (ret  0) {
+break;
+}
+}
 }

 ret2 = migrate_fd_close(s-migration_state);
diff --git a/migration.c b/migration.c
index 56014dd..6a505c1 100644
--- a/migration.c
+++ b/migration.c
@@ -368,13 +368,13 @@ static void migrate_fd_cancel(MigrationState *s)
 migrate_fd_cleanup(s);
 }

-void migrate_fd_wait_for_unfreeze(MigrationState *s)
+int migrate_fd_wait_for_unfreeze(MigrationState *s)
 {
 int ret;

 DPRINTF(wait for unfreeze\n);
 if (s-state != MIG_STATE_ACTIVE)
-return;
+return -EINVAL;

 do {
 fd_set wfds;
@@ -386,8 +386,9 @@ void migrate_fd_wait_for_unfreeze(MigrationState *s)
 } while (ret == -1  (s-get_error(s)) == EINTR);

 if (ret == -1) {
-qemu_file_set_error(s-file, -s-get_error(s));
+return -s-get_error(s);
 }
+return 0;
 }

 int migrate_fd_close(MigrationState *s)
diff --git a/migration.h b/migration.h
index ec022d6..1c3e9b7 100644
--- a/migration.h
+++ b/migration.h
@@ -81,7 +81,7 @@ void migrate_fd_connect(MigrationState *s);
 ssize_t migrate_fd_put_buffer(MigrationState *s, const void *data,
   size_t size);
 void migrate_fd_put_ready(MigrationState *s);
-void migrate_fd_wait_for_unfreeze(MigrationState *s);
+int migrate_fd_wait_for_unfreeze(MigrationState *s);
 int migrate_fd_close(MigrationState *s);

 void add_migration_state_change_notifier(Notifier *notify);
-- 
1.7.11.4

[Qemu-devel] [PATCH 38/41] block-migration: handle errors with the return codes correctly

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 block-migration.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/block-migration.c b/block-migration.c
index 565628f..9f94733 100644
--- a/block-migration.c
+++ b/block-migration.c
@@ -423,10 +423,9 @@ static int mig_save_device_dirty(QEMUFile *f, 
BlkMigDevState *bmds,

 error:
 DPRINTF(Error reading sector % PRId64 \n, sector);
-qemu_file_set_error(f, ret);
 g_free(blk-buf);
 g_free(blk);
-return 0;
+return ret;
 }

 /* return value:
@@ -440,7 +439,7 @@ static int blk_mig_save_dirty_block(QEMUFile *f, int 
is_async)

 QSIMPLEQ_FOREACH(bmds, block_mig_state.bmds_list, entry) {
 ret = mig_save_device_dirty(f, bmds, is_async);
-if (ret == 0) {
+if (ret = 0) {
 break;
 }
 }
@@ -598,12 +597,17 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
 block_mig_state.bulk_completed = 1;
 }
 } else {
-if (blk_mig_save_dirty_block(f, 1) != 0) {
+ret = blk_mig_save_dirty_block(f, 1);
+if (ret != 0) {
 /* no more dirty blocks */
 break;
 }
 }
 }
+if (ret) {
+blk_mig_cleanup();
+return ret;
+}

 ret = flush_blks(f);
 if (ret) {
@@ -635,18 +639,15 @@ static int block_save_complete(QEMUFile *f, void *opaque)
all async read completed */
 assert(block_mig_state.submitted == 0);

-while (blk_mig_save_dirty_block(f, 0) == 0) {
+while ((ret = blk_mig_save_dirty_block(f, 0)) == 0) {
 /* Do nothing */
 }
 blk_mig_cleanup();
-
-/* report completion */
-qemu_put_be64(f, (100  BDRV_SECTOR_BITS) | BLK_MIG_FLAG_PROGRESS);
-
-ret = qemu_file_get_error(f);
 if (ret) {
 return ret;
 }
+/* report completion */
+qemu_put_be64(f, (100  BDRV_SECTOR_BITS) | BLK_MIG_FLAG_PROGRESS);

 DPRINTF(Block migration completed\n);

-- 
1.7.11.4

[Qemu-devel] [PATCH 36/41] block-migration: make flush_blks() return errors

2012-09-21 Thread Juan Quintela

This means we don't need to pass through qemu_file to get the errors.
Adjust all callers.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 block-migration.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/block-migration.c b/block-migration.c
index 7def8ab..a822bb2 100644
--- a/block-migration.c
+++ b/block-migration.c
@@ -444,9 +444,10 @@ static int blk_mig_save_dirty_block(QEMUFile *f, int 
is_async)
 return ret;
 }

-static void flush_blks(QEMUFile* f)
+static int flush_blks(QEMUFile *f)
 {
 BlkMigBlock *blk;
+int ret = 0;

 DPRINTF(%s Enter submitted %d read_done %d transferred %d\n,
 __FUNCTION__, block_mig_state.submitted, block_mig_state.read_done,
@@ -457,7 +458,7 @@ static void flush_blks(QEMUFile* f)
 break;
 }
 if (blk-ret  0) {
-qemu_file_set_error(f, blk-ret);
+ret = blk-ret;
 break;
 }
 blk_send(f, blk);
@@ -474,6 +475,7 @@ static void flush_blks(QEMUFile* f)
 DPRINTF(%s Exit submitted %d read_done %d transferred %d\n, __FUNCTION__,
 block_mig_state.submitted, block_mig_state.read_done,
 block_mig_state.transferred);
+return ret;
 }

 static int64_t get_remaining_dirty(void)
@@ -553,9 +555,7 @@ static int block_save_setup(QEMUFile *f, void *opaque)
 /* start track dirty blocks */
 set_dirty_tracking(1);

-flush_blks(f);
-
-ret = qemu_file_get_error(f);
+ret = flush_blks(f);
 if (ret) {
 blk_mig_cleanup();
 return ret;
@@ -575,9 +575,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
 DPRINTF(Enter save live iterate submitted %d transferred %d\n,
 block_mig_state.submitted, block_mig_state.transferred);

-flush_blks(f);
-
-ret = qemu_file_get_error(f);
+ret = flush_blks(f);
 if (ret) {
 blk_mig_cleanup();
 return ret;
@@ -603,9 +601,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
 }
 }

-flush_blks(f);
-
-ret = qemu_file_get_error(f);
+ret = flush_blks(f);
 if (ret) {
 blk_mig_cleanup();
 return ret;
@@ -623,9 +619,7 @@ static int block_save_complete(QEMUFile *f, void *opaque)
 DPRINTF(Enter save live complete submitted %d transferred %d\n,
 block_mig_state.submitted, block_mig_state.transferred);

-flush_blks(f);
-
-ret = qemu_file_get_error(f);
+ret = flush_blks(f);
 if (ret) {
 blk_mig_cleanup();
 return ret;
-- 
1.7.11.4

[Qemu-devel] [PATCH 13/41] ram: create trace event for migration sync bitmap

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c  | 6 ++
 trace-events | 4 
 2 files changed, 10 insertions(+)

diff --git a/arch_init.c b/arch_init.c
index a58e8c3..6e0d7c4 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -45,6 +45,7 @@
 #include hw/pcspk.h
 #include qemu/page_cache.h
 #include qmp-commands.h
+#include trace.h

 #ifdef DEBUG_ARCH_INIT
 #define DPRINTF(fmt, ...) \
@@ -359,7 +360,12 @@ static inline void migration_bitmap_set_dirty(MemoryRegion 
*mr, int length)

 static void migration_bitmap_sync(void)
 {
+uint64_t num_dirty_pages_init = ram_list.dirty_pages;
+
+trace_migration_bitmap_sync_start();
 memory_global_sync_dirty_bitmap(get_system_memory());
+trace_migration_bitmap_sync_end(ram_list.dirty_pages
+- num_dirty_pages_init);
 }


diff --git a/trace-events b/trace-events
index b48fe2d..2666191 100644
--- a/trace-events
+++ b/trace-events
@@ -914,6 +914,10 @@ ppm_save(const char *filename, void *display_surface) %s 
surface=%p
 savevm_section_start(void) 
 savevm_section_end(unsigned int section_id) section_id %u

+# arch_init.c
+migration_bitmap_sync_start(void) 
+migration_bitmap_sync_end(uint64_t dirty_pages) dirty_pages % PRIu64
+
 # hw/qxl.c
 disable qxl_interface_set_mm_time(int qid, uint32_t mm_time) %d %d
 disable qxl_io_write_vga(int qid, const char *mode, uint32_t addr, uint32_t 
val) %d %s addr=%u val=%u
-- 
1.7.11.4

[Qemu-devel] [PATCH 25/41] buffered_file: make buffered_flush return the error code

2012-09-21 Thread Juan Quintela

Or the amount of data written if there is no error.  Adjust all callers.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 32 
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 747d672..9db73dc 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -58,26 +58,26 @@ static void buffered_append(QEMUFileBuffered *s,
 s-buffer_size += size;
 }

-static void buffered_flush(QEMUFileBuffered *s)
+static int buffered_flush(QEMUFileBuffered *s)
 {
 size_t offset = 0;
+int ret = 0;

 DPRINTF(flushing %zu byte(s) of data\n, s-buffer_size);

 while (s-bytes_xfer  s-xfer_limit  offset  s-buffer_size) {
-ssize_t ret;

 ret = migrate_fd_put_buffer(s-migration_state, s-buffer + offset,
 s-buffer_size - offset);
 if (ret == -EAGAIN) {
 DPRINTF(backend not ready, freezing\n);
+ret = 0;
 s-freeze_output = 1;
 break;
 }

 if (ret = 0) {
 DPRINTF(error flushing data, %zd\n, ret);
-qemu_file_set_error(s-file, ret);
 break;
 } else {
 DPRINTF(flushed %zd byte(s)\n, ret);
@@ -89,6 +89,11 @@ static void buffered_flush(QEMUFileBuffered *s)
 DPRINTF(flushed %zu of %zu byte(s)\n, offset, s-buffer_size);
 memmove(s-buffer, s-buffer + offset, s-buffer_size - offset);
 s-buffer_size -= offset;
+
+if (ret  0) {
+return ret;
+}
+return offset;
 }

 static int buffered_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, 
int size)
@@ -112,7 +117,13 @@ static int buffered_put_buffer(void *opaque, const uint8_t 
*buf, int64_t pos, in
 buffered_append(s, buf, size);
 }

-buffered_flush(s);
+error = buffered_flush(s);
+if (error  0) {
+DPRINTF(buffered flush error. bailing: %s\n, strerror(-error));
+qemu_file_set_error(s-file, error);
+
+return error;
+}

 if (pos == 0  size == 0) {
 DPRINTF(file is ready\n);
@@ -128,19 +139,24 @@ static int buffered_put_buffer(void *opaque, const 
uint8_t *buf, int64_t pos, in
 static int buffered_close(void *opaque)
 {
 QEMUFileBuffered *s = opaque;
-int ret;
+int ret = 0, ret2;

 DPRINTF(closing\n);

 s-xfer_limit = INT_MAX;
 while (!qemu_file_get_error(s-file)  s-buffer_size) {
-buffered_flush(s);
+ret = buffered_flush(s);
+if (ret  0) {
+break;
+}
 if (s-freeze_output)
 migrate_fd_wait_for_unfreeze(s-migration_state);
 }

-ret = migrate_fd_close(s-migration_state);
-
+ret2 = migrate_fd_close(s-migration_state);
+if (ret = 0) {
+ret = ret2;
+}
 qemu_del_timer(s-timer);
 qemu_free_timer(s-timer);
 g_free(s-buffer);
-- 
1.7.11.4

Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention

2012-09-21 Thread Richard Davies

Hi Mel,

Thank you for this series. I have applied on clean 3.6-rc5 and tested, and
it works well for me - the lock contention is (still) gone and
isolate_freepages_block is much reduced.

Here is a typical test with these patches:

# grep -F '[k]' report | head -8
65.20% qemu-kvm  [kernel.kallsyms] [k] clear_page_c
 2.18% qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
 1.56% qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
 1.40% qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
 1.38%  swapper  [kernel.kallsyms] [k] default_idle
 1.35% qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 0.74% ksmd  [kernel.kallsyms] [k] memcmp
 0.72% qemu-kvm  [kernel.kallsyms] [k] free_pages_prepare


I did manage to get a couple which were slightly worse, but nothing like as
bad as before. Here are the results:

# grep -F '[k]' report | head -8
45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
 3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
 2.27%   ksmd  [kernel.kallsyms] [k] memcmp
 2.02%swapper  [kernel.kallsyms] [k] default_idle
 1.58%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
 1.30%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
 1.09%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist

# grep -F '[k]' report | head -8
61.29%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
 4.52%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
 2.64%   qemu-kvm  [kernel.kallsyms] [k] copy_page_c
 1.61%swapper  [kernel.kallsyms] [k] default_idle
 1.57%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
 1.18%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 1.18%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
 1.11%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run

I will follow up with the detailed traces for these three tests.

Thank you!

Richard.

[Qemu-devel] [PATCH 33/41] savevm: make qemu_fill_buffer() be consistent

2012-09-21 Thread Juan Quintela

It was setting last_error directly once, and with the helper the other time.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 savevm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/savevm.c b/savevm.c
index 8ddb9d5..4e4aa3c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -501,7 +501,7 @@ static void qemu_fill_buffer(QEMUFile *f)
 f-buf_size += len;
 f-buf_offset += len;
 } else if (len == 0) {
-f-last_error = -EIO;
+qemu_file_set_error(f, -EIO);
 } else if (len != -EAGAIN)
 qemu_file_set_error(f, len);
 }
-- 
1.7.11.4

[Qemu-devel] [PATCH 40/41] savevm: make qemu_file_put_notify() return errors

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 migration.c | 5 +++--
 qemu-file.h | 2 +-
 savevm.c| 4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/migration.c b/migration.c
index 6a505c1..2c29d04 100644
--- a/migration.c
+++ b/migration.c
@@ -285,10 +285,11 @@ static void migrate_fd_completed(MigrationState *s)
 static void migrate_fd_put_notify(void *opaque)
 {
 MigrationState *s = opaque;
+int ret;

 qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
-qemu_file_put_notify(s-file);
-if (s-file  qemu_file_get_error(s-file)) {
+ret = qemu_file_put_notify(s-file);
+if (ret) {
 migrate_fd_error(s);
 }
 }
diff --git a/qemu-file.h b/qemu-file.h
index 8dd9207..9c8985b 100644
--- a/qemu-file.h
+++ b/qemu-file.h
@@ -107,7 +107,7 @@ int qemu_file_get_error(QEMUFile *f);
 /* Try to send any outstanding data.  This function is useful when output is
  * halted due to rate limiting or EAGAIN errors occur as it can be used to
  * resume output. */
-void qemu_file_put_notify(QEMUFile *f);
+int qemu_file_put_notify(QEMUFile *f);

 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
 {
diff --git a/savevm.c b/savevm.c
index 68c0464..2ea1fa6 100644
--- a/savevm.c
+++ b/savevm.c
@@ -523,9 +523,9 @@ int qemu_fclose(QEMUFile *f)
 return ret;
 }

-void qemu_file_put_notify(QEMUFile *f)
+int qemu_file_put_notify(QEMUFile *f)
 {
-f-put_buffer(f-opaque, NULL, 0, 0);
+return f-put_buffer(f-opaque, NULL, 0, 0);
 }

 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size)
-- 
1.7.11.4

[Qemu-devel] [PATCH 21/41] buffered_file: unfold migrate_fd_put_buffer

2012-09-21 Thread Juan Quintela

We only used it once, just remove the callback indirection.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 5 +
 buffered_file.h | 2 --
 migration.c | 4 +---
 migration.h | 1 +
 4 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 4c6a797..d257496 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -23,7 +23,6 @@

 typedef struct QEMUFileBuffered
 {
-BufferedWaitForUnfreezeFunc *wait_for_unfreeze;
 BufferedCloseFunc *close;
 MigrationState *migration_state;
 QEMUFile *file;
@@ -145,7 +144,7 @@ static int buffered_close(void *opaque)
 while (!qemu_file_get_error(s-file)  s-buffer_size) {
 buffered_flush(s);
 if (s-freeze_output)
-s-wait_for_unfreeze(s-migration_state);
+migrate_fd_wait_for_unfreeze(s-migration_state);
 }

 ret = s-close(s-migration_state);
@@ -226,7 +225,6 @@ static void buffered_rate_tick(void *opaque)

 QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
   size_t bytes_per_sec,
-  BufferedWaitForUnfreezeFunc 
*wait_for_unfreeze,
   BufferedCloseFunc *close)
 {
 QEMUFileBuffered *s;
@@ -235,7 +233,6 @@ QEMUFile *qemu_fopen_ops_buffered(MigrationState 
*migration_state,

 s-migration_state = migration_state;
 s-xfer_limit = bytes_per_sec / 10;
-s-wait_for_unfreeze = wait_for_unfreeze;
 s-close = close;

 s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
diff --git a/buffered_file.h b/buffered_file.h
index dd239b3..926e5c6 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -17,12 +17,10 @@
 #include hw/hw.h
 #include migration.h

-typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
 typedef int (BufferedCloseFunc)(void *opaque);

 QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
   size_t xfer_limit,
-  BufferedWaitForUnfreezeFunc 
*wait_for_unfreeze,
   BufferedCloseFunc *close);

 #endif
diff --git a/migration.c b/migration.c
index d0d1014..add4632 100644
--- a/migration.c
+++ b/migration.c
@@ -368,9 +368,8 @@ static void migrate_fd_cancel(MigrationState *s)
 migrate_fd_cleanup(s);
 }

-static void migrate_fd_wait_for_unfreeze(void *opaque)
+void migrate_fd_wait_for_unfreeze(MigrationState *s)
 {
-MigrationState *s = opaque;
 int ret;

 DPRINTF(wait for unfreeze\n);
@@ -432,7 +431,6 @@ void migrate_fd_connect(MigrationState *s)
 s-state = MIG_STATE_ACTIVE;
 s-file = qemu_fopen_ops_buffered(s,
   s-bandwidth_limit,
-  migrate_fd_wait_for_unfreeze,
   migrate_fd_close);

 DPRINTF(beginning savevm\n);
diff --git a/migration.h b/migration.h
index 031c2ab..d6341d6 100644
--- a/migration.h
+++ b/migration.h
@@ -81,6 +81,7 @@ void migrate_fd_connect(MigrationState *s);
 ssize_t migrate_fd_put_buffer(MigrationState *s, const void *data,
   size_t size);
 void migrate_fd_put_ready(MigrationState *s);
+void migrate_fd_wait_for_unfreeze(MigrationState *s);

 void add_migration_state_change_notifier(Notifier *notify);
 void remove_migration_state_change_notifier(Notifier *notify);
-- 
1.7.11.4

[Qemu-devel] [PATCH 29/41] savevm: Remove qemu_fseek()

2012-09-21 Thread Juan Quintela

It has no users, and is only half implemented.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 qemu-file.h |  1 -
 savevm.c| 21 -
 2 files changed, 22 deletions(-)

diff --git a/qemu-file.h b/qemu-file.h
index d8487cd..7fe7274 100644
--- a/qemu-file.h
+++ b/qemu-file.h
@@ -232,6 +232,5 @@ static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv)
 }

 int64_t qemu_ftell(QEMUFile *f);
-int64_t qemu_fseek(QEMUFile *f, int64_t pos, int whence);

 #endif
diff --git a/savevm.c b/savevm.c
index 1ec6ff1..6865862 100644
--- a/savevm.c
+++ b/savevm.c
@@ -676,27 +676,6 @@ int64_t qemu_ftell(QEMUFile *f)
 return f-buf_offset - f-buf_size + f-buf_index;
 }

-int64_t qemu_fseek(QEMUFile *f, int64_t pos, int whence)
-{
-if (whence == SEEK_SET) {
-/* nothing to do */
-} else if (whence == SEEK_CUR) {
-pos += qemu_ftell(f);
-} else {
-/* SEEK_END not supported */
-return -1;
-}
-if (f-put_buffer) {
-qemu_fflush(f);
-f-buf_offset = pos;
-} else {
-f-buf_offset = pos;
-f-buf_index = 0;
-f-buf_size = 0;
-}
-return pos;
-}
-
 int qemu_file_rate_limit(QEMUFile *f)
 {
 if (f-rate_limit)
-- 
1.7.11.4

[Qemu-devel] [PATCH 32/41] savevm: unexport qemu_ftell()

2012-09-21 Thread Juan Quintela

It was unused out of savevm.c.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 qemu-file.h | 3 ---
 savevm.c| 2 +-
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/qemu-file.h b/qemu-file.h
index 7fe7274..289849a 100644
--- a/qemu-file.h
+++ b/qemu-file.h
@@ -230,7 +230,4 @@ static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv)
 {
 qemu_get_be64s(f, (uint64_t *)pv);
 }
-
-int64_t qemu_ftell(QEMUFile *f);
-
 #endif
diff --git a/savevm.c b/savevm.c
index 8efa7cc..8ddb9d5 100644
--- a/savevm.c
+++ b/savevm.c
@@ -664,7 +664,7 @@ int qemu_get_byte(QEMUFile *f)
 return result;
 }

-int64_t qemu_ftell(QEMUFile *f)
+static int64_t qemu_ftell(QEMUFile *f)
 {
 return f-buf_offset - f-buf_size + f-buf_index;
 }
-- 
1.7.11.4

[Qemu-devel] [PATCH 39/41] savevm: un-export qemu_file_set_error()

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 qemu-file.h | 1 -
 savevm.c| 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/qemu-file.h b/qemu-file.h
index 289849a..8dd9207 100644
--- a/qemu-file.h
+++ b/qemu-file.h
@@ -103,7 +103,6 @@ int qemu_file_rate_limit(QEMUFile *f);
 int64_t qemu_file_set_rate_limit(QEMUFile *f, int64_t new_rate);
 int64_t qemu_file_get_rate_limit(QEMUFile *f);
 int qemu_file_get_error(QEMUFile *f);
-void qemu_file_set_error(QEMUFile *f, int error);

 /* Try to send any outstanding data.  This function is useful when output is
  * halted due to rate limiting or EAGAIN errors occur as it can be used to
diff --git a/savevm.c b/savevm.c
index 59ec8bf..68c0464 100644
--- a/savevm.c
+++ b/savevm.c
@@ -440,7 +440,7 @@ int qemu_file_get_error(QEMUFile *f)
 return f-last_error;
 }

-void qemu_file_set_error(QEMUFile *f, int ret)
+static void qemu_file_set_error(QEMUFile *f, int ret)
 {
 f-last_error = ret;
 }
-- 
1.7.11.4

[Qemu-devel] [PATCH 27/41] savevm: unexport qemu_fflush

2012-09-21 Thread Juan Quintela

It is not used outside of savevm.c

Signed-off-by: Juan Quintela quint...@redhat.com
---
 qemu-file.h | 1 -
 savevm.c| 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/qemu-file.h b/qemu-file.h
index 31b83f6..d8487cd 100644
--- a/qemu-file.h
+++ b/qemu-file.h
@@ -71,7 +71,6 @@ QEMUFile *qemu_fopen_socket(int fd);
 QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
-void qemu_fflush(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
diff --git a/savevm.c b/savevm.c
index c7fe283..1ec6ff1 100644
--- a/savevm.c
+++ b/savevm.c
@@ -461,7 +461,7 @@ static void qemu_file_set_if_error(QEMUFile *f, int ret)
  *
  * In case of error, last_error is set.
  */
-void qemu_fflush(QEMUFile *f)
+static void qemu_fflush(QEMUFile *f)
 {
 if (!f-put_buffer)
 return;
-- 
1.7.11.4

Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention

2012-09-21 Thread Mel Gorman

On Fri, Sep 21, 2012 at 10:13:33AM +0100, Richard Davies wrote:
 Hi Mel,
 
 Thank you for this series. I have applied on clean 3.6-rc5 and tested, and
 it works well for me - the lock contention is (still) gone and
 isolate_freepages_block is much reduced.
 

Excellent!

 Here is a typical test with these patches:
 
 # grep -F '[k]' report | head -8
 65.20% qemu-kvm  [kernel.kallsyms] [k] clear_page_c
  2.18% qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
  1.56% qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
  1.40% qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
  1.38%  swapper  [kernel.kallsyms] [k] default_idle
  1.35% qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
  0.74% ksmd  [kernel.kallsyms] [k] memcmp
  0.72% qemu-kvm  [kernel.kallsyms] [k] free_pages_prepare
 

Ok, so that is more or less acceptable. I would like to reduce the scanning
even further but I'll take this as a start -- largely because I do not have
any new good ideas on how it could be reduced further without incurring
a large cost in the page allocator :)

 I did manage to get a couple which were slightly worse, but nothing like as
 bad as before. Here are the results:
 
 # grep -F '[k]' report | head -8
 45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
 11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
  3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
  2.27%   ksmd  [kernel.kallsyms] [k] memcmp
  2.02%swapper  [kernel.kallsyms] [k] default_idle
  1.58%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
  1.30%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
  1.09%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 
 # grep -F '[k]' report | head -8
 61.29%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
  4.52%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
  2.64%   qemu-kvm  [kernel.kallsyms] [k] copy_page_c
  1.61%swapper  [kernel.kallsyms] [k] default_idle
  1.57%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
  1.18%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
  1.18%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
  1.11%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
 
 

Were the boot times acceptable even when these slightly worse figures
were recorded?

 I will follow up with the detailed traces for these three tests.
 
 Thank you!
 

Thank you for the detailed reporting and the testing, it's much
appreciated. I've already rebased the patches to Andrew's tree and tested
them overnight and the figures look good on my side. I'll update the
changelog and push them shortly.

-- 
Mel Gorman
SUSE Labs

[Qemu-devel] [PATCH 18/41] buffered_file: opaque is MigrationState

2012-09-21 Thread Juan Quintela

It always have that type, just change it.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 6 +++---
 buffered_file.h | 4 +++-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 33b700b..59d952d 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -27,7 +27,7 @@ typedef struct QEMUFileBuffered
 BufferedPutReadyFunc *put_ready;
 BufferedWaitForUnfreezeFunc *wait_for_unfreeze;
 BufferedCloseFunc *close;
-void *migration_state;
+MigrationState *migration_state;
 QEMUFile *file;
 int freeze_output;
 size_t bytes_xfer;
@@ -226,7 +226,7 @@ static void buffered_rate_tick(void *opaque)
 buffered_put_buffer(s, NULL, 0, 0);
 }

-QEMUFile *qemu_fopen_ops_buffered(void *opaque,
+QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
   size_t bytes_per_sec,
   BufferedPutFunc *put_buffer,
   BufferedPutReadyFunc *put_ready,
@@ -237,7 +237,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,

 s = g_malloc0(sizeof(*s));

-s-migration_state = opaque;
+s-migration_state = migration_state;
 s-xfer_limit = bytes_per_sec / 10;
 s-put_buffer = put_buffer;
 s-put_ready = put_ready;
diff --git a/buffered_file.h b/buffered_file.h
index 98d358b..39f7fa0 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -15,13 +15,15 @@
 #define QEMU_BUFFERED_FILE_H

 #include hw/hw.h
+#include migration.h

 typedef ssize_t (BufferedPutFunc)(void *opaque, const void *data, size_t size);
 typedef void (BufferedPutReadyFunc)(void *opaque);
 typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
 typedef int (BufferedCloseFunc)(void *opaque);

-QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t xfer_limit,
+QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
+  size_t xfer_limit,
   BufferedPutFunc *put_buffer,
   BufferedPutReadyFunc *put_ready,
   BufferedWaitForUnfreezeFunc 
*wait_for_unfreeze,
-- 
1.7.11.4

[Qemu-devel] [PATCH 31/41] savevm: unfold qemu_fclose_internal()

2012-09-21 Thread Juan Quintela

It was used only one, and was only one if.  It makes error handling
saner.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 savevm.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/savevm.c b/savevm.c
index 0953695..8efa7cc 100644
--- a/savevm.c
+++ b/savevm.c
@@ -506,22 +506,6 @@ static void qemu_fill_buffer(QEMUFile *f)
 qemu_file_set_error(f, len);
 }

-/** Calls close function and set last_error if needed
- *
- * Internal function. qemu_fflush() must be called before this.
- *
- * Returns f-close() return value, or 0 if close function is not set.
- */
-static int qemu_fclose_internal(QEMUFile *f)
-{
-int ret = 0;
-if (f-close) {
-ret = f-close(f-opaque);
-qemu_file_set_if_error(f, ret);
-}
-return ret;
-}
-
 /** Closes the file
  *
  * Returns negative error value if any error happened on previous operations or
@@ -532,12 +516,14 @@ static int qemu_fclose_internal(QEMUFile *f)
  */
 int qemu_fclose(QEMUFile *f)
 {
-int ret, ret2;
+int ret;
 ret = qemu_fflush(f);
-ret2 = qemu_fclose_internal(f);

-if (ret = 0) {
-ret = ret2;
+if (f-close) {
+int ret2 = f-close(f-opaque);
+if (ret = 0) {
+ret = ret2;
+}
 }
 /* If any error was spotted before closing, we should report it
  * instead of the close() return value.
-- 
1.7.11.4

[Qemu-devel] [PATCH 16/41] BufferedFile: append, then flush

2012-09-21 Thread Juan Quintela

From: Paolo Bonzini pbonz...@redhat.com

Simplify the logic for pushing data from the buffer to the output
pipe/socket.  This also matches more closely what will be the
operation of the migration thread.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 50 +++---
 1 file changed, 11 insertions(+), 39 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 4148abb..7155800 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -75,7 +75,7 @@ static void buffered_flush(QEMUFileBuffered *s)

 DPRINTF(flushing %zu byte(s) of data\n, s-buffer_size);

-while (offset  s-buffer_size) {
+while (s-bytes_xfer  s-xfer_limit  offset  s-buffer_size) {
 ssize_t ret;

 ret = s-put_buffer(s-opaque, s-buffer + offset,
@@ -93,6 +93,7 @@ static void buffered_flush(QEMUFileBuffered *s)
 } else {
 DPRINTF(flushed %zd byte(s)\n, ret);
 offset += ret;
+s-bytes_xfer += ret;
 }
 }

@@ -104,8 +105,7 @@ static void buffered_flush(QEMUFileBuffered *s)
 static int buffered_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, 
int size)
 {
 QEMUFileBuffered *s = opaque;
-int offset = 0, error;
-ssize_t ret;
+int error;

 DPRINTF(putting %d bytes at % PRId64 \n, size, pos);

@@ -118,48 +118,22 @@ static int buffered_put_buffer(void *opaque, const 
uint8_t *buf, int64_t pos, in
 DPRINTF(unfreezing output\n);
 s-freeze_output = 0;

-buffered_flush(s);
-
-while (!s-freeze_output  offset  size) {
-if (s-bytes_xfer  s-xfer_limit) {
-DPRINTF(transfer limit exceeded when putting\n);
-break;
-}
-
-ret = s-put_buffer(s-opaque, buf + offset, size - offset);
-if (ret == -EAGAIN) {
-DPRINTF(backend not ready, freezing\n);
-s-freeze_output = 1;
-break;
-}
-
-if (ret = 0) {
-DPRINTF(error putting\n);
-qemu_file_set_error(s-file, ret);
-offset = -EINVAL;
-break;
-}
-
-DPRINTF(put %zd byte(s)\n, ret);
-offset += ret;
-s-bytes_xfer += ret;
-}
-
-if (offset = 0) {
+if (size  0) {
 DPRINTF(buffering %d bytes\n, size - offset);
-buffered_append(s, buf + offset, size - offset);
-offset = size;
+buffered_append(s, buf, size);
 }

+buffered_flush(s);
+
 if (pos == 0  size == 0) {
 DPRINTF(file is ready\n);
-if (s-bytes_xfer = s-xfer_limit) {
+if (!s-freeze_output  s-bytes_xfer  s-xfer_limit) {
 DPRINTF(notifying client\n);
 s-put_ready(s-opaque);
 }
 }

-return offset;
+return size;
 }

 static int buffered_close(void *opaque)
@@ -169,6 +143,7 @@ static int buffered_close(void *opaque)

 DPRINTF(closing\n);

+s-xfer_limit = INT_MAX;
 while (!qemu_file_get_error(s-file)  s-buffer_size) {
 buffered_flush(s);
 if (s-freeze_output)
@@ -248,10 +223,7 @@ static void buffered_rate_tick(void *opaque)

 s-bytes_xfer = 0;

-buffered_flush(s);
-
-/* Add some checks around this */
-s-put_ready(s-opaque);
+buffered_put_buffer(s, NULL, 0, 0);
 }

 QEMUFile *qemu_fopen_ops_buffered(void *opaque,
-- 
1.7.11.4

[Qemu-devel] [PATCH 15/41] migration: Add dirty_pages_rate to query migrate output

2012-09-21 Thread Juan Quintela

It indicates how many pages were dirtied during the last second.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c  | 18 ++
 hmp.c|  4 
 migration.c  |  2 ++
 migration.h  |  1 +
 qapi-schema.json |  8 ++--
 5 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 0279d06..d96e888 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -370,6 +370,14 @@ static void migration_bitmap_sync(void)
 RAMBlock *block;
 ram_addr_t addr;
 uint64_t num_dirty_pages_init = migration_dirty_pages;
+MigrationState *s = migrate_get_current();
+static int64_t start_time;
+static int64_t num_dirty_pages_period;
+int64_t end_time;
+
+if (!start_time) {
+start_time = qemu_get_clock_ms(rt_clock);
+}

 trace_migration_bitmap_sync_start();
 memory_global_sync_dirty_bitmap(get_system_memory());
@@ -386,6 +394,16 @@ static void migration_bitmap_sync(void)
 }
 trace_migration_bitmap_sync_end(migration_dirty_pages
 - num_dirty_pages_init);
+num_dirty_pages_period += migration_dirty_pages - num_dirty_pages_init;
+end_time = qemu_get_clock_ms(rt_clock);
+
+/* more than 1 second = 1000 millisecons */
+if (end_time  start_time + 1000) {
+s-dirty_pages_rate = num_dirty_pages_period * 1000
+/ (end_time - start_time);
+start_time = end_time;
+num_dirty_pages_period = 0;
+}
 }


diff --git a/hmp.c b/hmp.c
index 71c9292..67a529a 100644
--- a/hmp.c
+++ b/hmp.c
@@ -175,6 +175,10 @@ void hmp_info_migrate(Monitor *mon)
info-ram-normal);
 monitor_printf(mon, normal bytes: % PRIu64  kbytes\n,
info-ram-normal_bytes  10);
+if (info-ram-dirty_pages_rate) {
+monitor_printf(mon, dirty pages rate: % PRIu64  pages\n,
+   info-ram-dirty_pages_rate);
+}
 }

 if (info-has_disk) {
diff --git a/migration.c b/migration.c
index 62c8fe9..05634d5 100644
--- a/migration.c
+++ b/migration.c
@@ -180,6 +180,8 @@ MigrationInfo *qmp_query_migrate(Error **errp)
 info-ram-duplicate = dup_mig_pages_transferred();
 info-ram-normal = norm_mig_pages_transferred();
 info-ram-normal_bytes = norm_mig_bytes_transferred();
+info-ram-dirty_pages_rate = s-dirty_pages_rate;
+

 if (blk_mig_active()) {
 info-has_disk = true;
diff --git a/migration.h b/migration.h
index 552200c..66d7f68 100644
--- a/migration.h
+++ b/migration.h
@@ -42,6 +42,7 @@ struct MigrationState
 int64_t total_time;
 int64_t downtime;
 int64_t expected_downtime;
+int64_t dirty_pages_rate;
 bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
 int64_t xbzrle_cache_size;
 };
diff --git a/qapi-schema.json b/qapi-schema.json
index b8a1244..4a9ae52 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -358,13 +358,17 @@
 #
 # @normal : number of normal pages (since 1.2)
 #
-# @normal-bytes : number of normal bytes sent (since 1.2)
+# @normal-bytes: number of normal bytes sent (since 1.2)
+#
+# @dirty-pages-rate: number of pages dirtied by second by the
+#guest (since 1.3)
 #
 # Since: 0.14.0
 ##
 { 'type': 'MigrationStats',
   'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
-   'duplicate': 'int', 'normal': 'int', 'normal-bytes': 'int' } }
+   'duplicate': 'int', 'normal': 'int', 'normal-bytes': 'int',
+   'dirty-pages-rate' : 'int' } }

 ##
 # @XBZRLECacheStats
-- 
1.7.11.4

[Qemu-devel] [PATCH 06/41] migration: export migrate_get_current()

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 migration.c | 2 +-
 migration.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration.c b/migration.c
index c1655b3..2827663 100644
--- a/migration.c
+++ b/migration.c
@@ -53,7 +53,7 @@ static NotifierList migration_state_notifiers =
migrations at once.  For now we don't need to add
dynamic creation of migration */

-static MigrationState *migrate_get_current(void)
+MigrationState *migrate_get_current(void)
 {
 static MigrationState current_migration = {
 .state = MIG_STATE_SETUP,
diff --git a/migration.h b/migration.h
index 3462917..dabc333 100644
--- a/migration.h
+++ b/migration.h
@@ -81,6 +81,7 @@ void remove_migration_state_change_notifier(Notifier *notify);
 bool migration_is_active(MigrationState *);
 bool migration_has_finished(MigrationState *);
 bool migration_has_failed(MigrationState *);
+MigrationState *migrate_get_current(void);

 uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_transferred(void);
-- 
1.7.11.4

[Qemu-devel] [PATCH 03/41] migration: store end_time in a local variable

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 migration.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/migration.c b/migration.c
index 1edeec5..1e3f791 100644
--- a/migration.c
+++ b/migration.c
@@ -327,6 +327,7 @@ static void migrate_fd_put_ready(void *opaque)
 migrate_fd_error(s);
 } else if (ret == 1) {
 int old_vm_running = runstate_is_running();
+int64_t end_time;

 DPRINTF(done iterating\n);
 qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
@@ -337,7 +338,8 @@ static void migrate_fd_put_ready(void *opaque)
 } else {
 migrate_fd_completed(s);
 }
-s-total_time = qemu_get_clock_ms(rt_clock) - s-total_time;
+end_time = qemu_get_clock_ms(rt_clock);
+s-total_time = end_time - s-total_time;
 if (s-state != MIG_STATE_COMPLETED) {
 if (old_vm_running) {
 vm_start();
-- 
1.7.11.4

Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention

2012-09-21 Thread Richard Davies

Mel Gorman wrote:
  I did manage to get a couple which were slightly worse, but nothing like as
  bad as before. Here are the results:
 
  # grep -F '[k]' report | head -8
  45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
  11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
   3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
   2.27%   ksmd  [kernel.kallsyms] [k] memcmp
   2.02%swapper  [kernel.kallsyms] [k] default_idle
   1.58%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
   1.30%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
   1.09%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 
  # grep -F '[k]' report | head -8
  61.29%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
   4.52%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
   2.64%   qemu-kvm  [kernel.kallsyms] [k] copy_page_c
   1.61%swapper  [kernel.kallsyms] [k] default_idle
   1.57%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
   1.18%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
   1.18%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
   1.11%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run

 Were the boot times acceptable even when these slightly worse figures
 were recorded?

Yes, they were 10-20% slower as you might expect from the traces, rather
than a factor slower.

 Thank you for the detailed reporting and the testing, it's much
 appreciated. I've already rebased the patches to Andrew's tree and tested
 them overnight and the figures look good on my side. I'll update the
 changelog and push them shortly.

Great. On my side, I'm delighted that senior kernel developers such as you,
Rik and Avi took our bug report seriously and helped fix it!

Thank you,

Richard.

[Qemu-devel] [PATCH 28/41] virtio-net: use qemu_get_buffer() in a temp buffer

2012-09-21 Thread Juan Quintela

qemu_fseek() is known to be wrong.  Would be removed on the next
commit.  This code should never been used (value has been
MAC_TABLE_ENTRIES since 2009).

Signed-off-by: Juan Quintela quint...@redhat.com
---
 hw/virtio-net.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 6490743..e8c43af 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -921,7 +921,9 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int 
version_id)
 qemu_get_buffer(f, n-mac_table.macs,
 n-mac_table.in_use * ETH_ALEN);
 } else if (n-mac_table.in_use) {
-qemu_fseek(f, n-mac_table.in_use * ETH_ALEN, SEEK_CUR);
+uint8_t *buf = g_malloc0(n-mac_table.in_use);
+qemu_get_buffer(f, buf, n-mac_table.in_use * ETH_ALEN);
+g_free(buf);
 n-mac_table.multi_overflow = n-mac_table.uni_overflow = 1;
 n-mac_table.in_use = 0;
 }
-- 
1.7.11.4

[Qemu-devel] [PATCH 05/41] migration: rename expected_time to expected_downtime

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index cdd8ab7..013e5e5 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -539,7 +539,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 double bwidth = 0;
 int ret;
 int i;
-uint64_t expected_time;
+uint64_t expected_downtime;

 bytes_transferred_last = bytes_transferred;
 bwidth = qemu_get_clock_ns(rt_clock);
@@ -578,24 +578,24 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 bwidth = qemu_get_clock_ns(rt_clock) - bwidth;
 bwidth = (bytes_transferred - bytes_transferred_last) / bwidth;

-/* if we haven't transferred anything this round, force expected_time to a
- * a very high value, but without crashing */
+/* if we haven't transferred anything this round, force
+ * expected_downtime to a very high value, but without
+ * crashing */
 if (bwidth == 0) {
 bwidth = 0.01;
 }

 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);

-expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
+expected_downtime = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
+DPRINTF(ram_save_live: expected(% PRIu64 ) = max( PRIu64 )?\n,
+expected_downtime, migrate_max_downtime());

-DPRINTF(ram_save_live: expected(% PRIu64 ) = max(% PRIu64 )?\n,
-expected_time, migrate_max_downtime());
-
-if (expected_time = migrate_max_downtime()) {
+if (expected_downtime = migrate_max_downtime()) {
 memory_global_sync_dirty_bitmap(get_system_memory());
-expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
+expected_downtime = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;

-return expected_time = migrate_max_downtime();
+return expected_downtime = migrate_max_downtime();
 }
 return 0;
 }
-- 
1.7.11.4

[Qemu-devel] [PATCH 23/41] buffered_file: We can access directly to bandwidth_limit

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 5 ++---
 buffered_file.h | 3 +--
 migration.c | 2 +-
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 4fca774..43e68b6 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -222,15 +222,14 @@ static void buffered_rate_tick(void *opaque)
 buffered_put_buffer(s, NULL, 0, 0);
 }

-QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
-  size_t bytes_per_sec)
+QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state)
 {
 QEMUFileBuffered *s;

 s = g_malloc0(sizeof(*s));

 s-migration_state = migration_state;
-s-xfer_limit = bytes_per_sec / 10;
+s-xfer_limit = migration_state-bandwidth_limit / 10;

 s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
  buffered_close, buffered_rate_limit,
diff --git a/buffered_file.h b/buffered_file.h
index 8a38754..ef010fe 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -17,7 +17,6 @@
 #include hw/hw.h
 #include migration.h

-QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
-  size_t xfer_limit);
+QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state);

 #endif
diff --git a/migration.c b/migration.c
index 6f1e4d3..56014dd 100644
--- a/migration.c
+++ b/migration.c
@@ -427,7 +427,7 @@ void migrate_fd_connect(MigrationState *s)
 int ret;

 s-state = MIG_STATE_ACTIVE;
-s-file = qemu_fopen_ops_buffered(s, s-bandwidth_limit);
+s-file = qemu_fopen_ops_buffered(s);

 DPRINTF(beginning savevm\n);
 ret = qemu_savevm_state_begin(s-file, s-params);
-- 
1.7.11.4

[Qemu-devel] [PATCH 09/41] ram: introduce migration_bitmap_set_dirty()

2012-09-21 Thread Juan Quintela

It just marks a region of memory as dirty.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 57f7f1a..b2dcc24 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -332,6 +332,18 @@ static int save_xbzrle_page(QEMUFile *f, uint8_t 
*current_data,
 static RAMBlock *last_block;
 static ram_addr_t last_offset;

+static inline void migration_bitmap_set_dirty(MemoryRegion *mr, int length)
+{
+ram_addr_t addr;
+
+for (addr = 0; addr  length; addr += TARGET_PAGE_SIZE) {
+if (!memory_region_get_dirty(mr, addr, TARGET_PAGE_SIZE,
+ DIRTY_MEMORY_MIGRATION)) {
+memory_region_set_dirty(mr, addr, TARGET_PAGE_SIZE);
+}
+}
+}
+
 /*
  * ram_save_block: Writes a page of memory to the stream f
  *
@@ -494,7 +506,6 @@ static void reset_ram_globals(void)

 static int ram_save_setup(QEMUFile *f, void *opaque)
 {
-ram_addr_t addr;
 RAMBlock *block;

 memory_global_sync_dirty_bitmap(get_system_memory());
@@ -516,12 +527,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)

 /* Make sure all dirty bits are set */
 QLIST_FOREACH(block, ram_list.blocks, next) {
-for (addr = 0; addr  block-length; addr += TARGET_PAGE_SIZE) {
-if (!memory_region_get_dirty(block-mr, addr, TARGET_PAGE_SIZE,
- DIRTY_MEMORY_MIGRATION)) {
-memory_region_set_dirty(block-mr, addr, TARGET_PAGE_SIZE);
-}
-}
+migration_bitmap_set_dirty(block-mr, block-length);
 }

 memory_global_dirty_log_start();
-- 
1.7.11.4

[Qemu-devel] [PATCH 07/41] migration: print expected downtime in info migrate

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c  | 2 ++
 hmp.c| 4 
 migration.c  | 2 ++
 migration.h  | 1 +
 qapi-schema.json | 5 +
 qmp-commands.hx  | 6 ++
 6 files changed, 20 insertions(+)

diff --git a/arch_init.c b/arch_init.c
index 013e5e5..52ccc7b 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -540,6 +540,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 int ret;
 int i;
 uint64_t expected_downtime;
+MigrationState *s = migrate_get_current();

 bytes_transferred_last = bytes_transferred;
 bwidth = qemu_get_clock_ns(rt_clock);
@@ -594,6 +595,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 if (expected_downtime = migrate_max_downtime()) {
 memory_global_sync_dirty_bitmap(get_system_memory());
 expected_downtime = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
+s-expected_downtime = expected_downtime / 100; /* ns - ms */

 return expected_downtime = migrate_max_downtime();
 }
diff --git a/hmp.c b/hmp.c
index 40b0c05..71c9292 100644
--- a/hmp.c
+++ b/hmp.c
@@ -152,6 +152,10 @@ void hmp_info_migrate(Monitor *mon)
 monitor_printf(mon, Migration status: %s\n, info-status);
 monitor_printf(mon, total time: % PRIu64  milliseconds\n,
info-total_time);
+if (info-has_expected_downtime) {
+monitor_printf(mon, expected downtime: % PRIu64  
milliseconds\n,
+   info-expected_downtime);
+}
 if (info-has_downtime) {
 monitor_printf(mon, downtime: % PRIu64  milliseconds\n,
info-downtime);
diff --git a/migration.c b/migration.c
index 2827663..62c8fe9 100644
--- a/migration.c
+++ b/migration.c
@@ -169,6 +169,8 @@ MigrationInfo *qmp_query_migrate(Error **errp)
 info-has_total_time = true;
 info-total_time = qemu_get_clock_ms(rt_clock)
 - s-total_time;
+info-has_expected_downtime = true;
+info-expected_downtime = s-expected_downtime;

 info-has_ram = true;
 info-ram = g_malloc0(sizeof(*info-ram));
diff --git a/migration.h b/migration.h
index dabc333..552200c 100644
--- a/migration.h
+++ b/migration.h
@@ -41,6 +41,7 @@ struct MigrationState
 MigrationParams params;
 int64_t total_time;
 int64_t downtime;
+int64_t expected_downtime;
 bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
 int64_t xbzrle_cache_size;
 };
diff --git a/qapi-schema.json b/qapi-schema.json
index b5a4360..b8a1244 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -417,6 +417,10 @@
 #total downtime in milliseconds for the guest.
 #(since 1.3)
 #
+# @expected-downtime: #optional only present while migration is active
+#expected downtime in milliseconds for the guest in last walk
+#of the dirty bitmap. (since 1.3)
+#
 # Since: 0.14.0
 ##
 { 'type': 'MigrationInfo',
@@ -424,6 +428,7 @@
'*disk': 'MigrationStats',
'*xbzrle-cache': 'XBZRLECacheStats',
'*total-time': 'int',
+   '*expected-downtime': 'int',
'*downtime': 'int'} }

 ##
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 37be613..68b6580 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -2269,6 +2269,9 @@ The main json-object contains the following:
 time (json-int)
 - downtime: only present when migration has finished correctly
 total amount in ms for downtime that happened (json-int)
+- expected-downtime: only present while migration is active
+total amount in ms for downtime that was calculated on
+   the last bitmap round (json-int)
 - ram: only present if status is active, it is a json-object with the
   following RAM information (in bytes):
  - transferred: amount transferred (json-int)
@@ -2330,6 +2333,7 @@ Examples:
 remaining:123,
 total:246,
 total-time:12345,
+expected-downtime:12345,
 duplicate:123,
 normal:123,
 normal-bytes:123456
@@ -2348,6 +2352,7 @@ Examples:
 remaining:1053304,
 transferred:3720,
 total-time:12345,
+expected-downtime:12345,
 duplicate:123,
 normal:123,
 normal-bytes:123456
@@ -2372,6 +2377,7 @@ Examples:
 remaining:1053304,
 transferred:3720,
 total-time:12345,
+expected-downtime:12345,
 duplicate:10,
 normal:,
 normal-bytes:3412992
-- 
1.7.11.4

Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention

2012-09-21 Thread Mel Gorman

On Fri, Sep 21, 2012 at 10:17:01AM +0100, Richard Davies wrote:
 Richard Davies wrote:
  I did manage to get a couple which were slightly worse, but nothing like as
  bad as before. Here are the results:
  
  # grep -F '[k]' report | head -8
  45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
  11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
   3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
   2.27%   ksmd  [kernel.kallsyms] [k] memcmp
   2.02%swapper  [kernel.kallsyms] [k] default_idle
   1.58%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
   1.30%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
   1.09%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 
 # 
 # captured on: Fri Sep 21 08:17:52 2012
 # os release : 3.6.0-rc5-elastic+
 # perf version : 3.5.2
 # arch : x86_64
 # nrcpus online : 16
 # nrcpus avail : 16
 # cpudesc : AMD Opteron(tm) Processor 6128
 # cpuid : AuthenticAMD,16,9,1
 # total memory : 131973276 kB
 # cmdline : /home/root/bin/perf record -g -a 
 # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 
 0x0, excl_usr = 0, excl_kern = 0, id = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
 12, 13, 14, 15, 16 }
 # HEADER_CPU_TOPOLOGY info available, use -I to display
 # HEADER_NUMA_TOPOLOGY info available, use -I to display
 # 
 #
 # Samples: 283K of event 'cycles'
 # Event count (approx.): 109057976176
 #
 # OverheadCommand Shared Object   
Symbol
 #   .    
 ..
 #
 45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c 
  
  |
  --- clear_page_c
 |  
 |--93.35%-- do_huge_pmd_anonymous_page

This is unavoidable. If THP was disabled, the cost would still be
incurred, just on base pages instead of huge pages.

 SNIP
 11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block  
  
  |
  --- isolate_freepages_block
  compaction_alloc
  migrate_pages
  compact_zone
  compact_zone_order
  try_to_compact_pages
  __alloc_pages_direct_compact
  __alloc_pages_nodemask
  alloc_pages_vma
  do_huge_pmd_anonymous_page

And this is showing that we're still spending a lot of time scanning
for free pages to isolate. I do not have a great idea on how this can be
reduced further without interfering with the page allocator.

One ok idea I considered in the past was using the buddy lists to find
free pages quickly but there is first the problem that the buddy lists
themselves may need to be searched and now that the zone lock is not held
during the scan it would be particularly difficult. The harder problem is
deciding when compaction finishes. I'll put more thought into it over
the weekend and see if something falls out but I'm not going to hold up
this series waiting for inspiration.

  3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock   
  
  |
  --- _raw_spin_lock
 |  
 |--39.96%-- tdp_page_fault

Nothing very interesting here until...

 |--1.69%-- free_pcppages_bulk
 |  |  
 |  |--77.53%-- drain_pages
 |  |  |  
 |  |  |--95.77%-- drain_local_pages
 |  |  |  |  
 |  |  |  |--97.90%-- 
 generic_smp_call_function_interrupt
 |  |  |  |  
 smp_call_function_interrupt
 |  |  |  |  
 call_function_interrupt
 |  |  |  |  |  
 |  |  |  |  |--23.37%-- 
 kvm_vcpu_ioctl
 |  |  |  |  |  
 do_vfs_ioctl
 |  |  |  |  |  
 sys_ioctl
 |  |  |  |  |  
 system_call_fastpath
 |  |  |  |  |  
 ioctl
 |  |  |  |  |  |  
 
 |  |  |  |  |  
 |--97.22%-- 0x1010006
 |  |  |

[Qemu-devel] [PATCH 22/41] buffered_file: unfold migrate_fd_put_buffer

2012-09-21 Thread Juan Quintela

We only used it once, just remove the callback indirection.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 7 ++-
 buffered_file.h | 5 +
 migration.c | 8 ++--
 migration.h | 1 +
 4 files changed, 6 insertions(+), 15 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index d257496..4fca774 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -23,7 +23,6 @@

 typedef struct QEMUFileBuffered
 {
-BufferedCloseFunc *close;
 MigrationState *migration_state;
 QEMUFile *file;
 int freeze_output;
@@ -147,7 +146,7 @@ static int buffered_close(void *opaque)
 migrate_fd_wait_for_unfreeze(s-migration_state);
 }

-ret = s-close(s-migration_state);
+ret = migrate_fd_close(s-migration_state);

 qemu_del_timer(s-timer);
 qemu_free_timer(s-timer);
@@ -224,8 +223,7 @@ static void buffered_rate_tick(void *opaque)
 }

 QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
-  size_t bytes_per_sec,
-  BufferedCloseFunc *close)
+  size_t bytes_per_sec)
 {
 QEMUFileBuffered *s;

@@ -233,7 +231,6 @@ QEMUFile *qemu_fopen_ops_buffered(MigrationState 
*migration_state,

 s-migration_state = migration_state;
 s-xfer_limit = bytes_per_sec / 10;
-s-close = close;

 s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
  buffered_close, buffered_rate_limit,
diff --git a/buffered_file.h b/buffered_file.h
index 926e5c6..8a38754 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -17,10 +17,7 @@
 #include hw/hw.h
 #include migration.h

-typedef int (BufferedCloseFunc)(void *opaque);
-
 QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
-  size_t xfer_limit,
-  BufferedCloseFunc *close);
+  size_t xfer_limit);

 #endif
diff --git a/migration.c b/migration.c
index add4632..6f1e4d3 100644
--- a/migration.c
+++ b/migration.c
@@ -390,10 +390,8 @@ void migrate_fd_wait_for_unfreeze(MigrationState *s)
 }
 }

-static int migrate_fd_close(void *opaque)
+int migrate_fd_close(MigrationState *s)
 {
-MigrationState *s = opaque;
-
 qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
 return s-close(s);
 }
@@ -429,9 +427,7 @@ void migrate_fd_connect(MigrationState *s)
 int ret;

 s-state = MIG_STATE_ACTIVE;
-s-file = qemu_fopen_ops_buffered(s,
-  s-bandwidth_limit,
-  migrate_fd_close);
+s-file = qemu_fopen_ops_buffered(s, s-bandwidth_limit);

 DPRINTF(beginning savevm\n);
 ret = qemu_savevm_state_begin(s-file, s-params);
diff --git a/migration.h b/migration.h
index d6341d6..ec022d6 100644
--- a/migration.h
+++ b/migration.h
@@ -82,6 +82,7 @@ ssize_t migrate_fd_put_buffer(MigrationState *s, const void 
*data,
   size_t size);
 void migrate_fd_put_ready(MigrationState *s);
 void migrate_fd_wait_for_unfreeze(MigrationState *s);
+int migrate_fd_close(MigrationState *s);

 void add_migration_state_change_notifier(Notifier *notify);
 void remove_migration_state_change_notifier(Notifier *notify);
-- 
1.7.11.4

[Qemu-devel] [PATCH 12/41] ram: introduce migration_bitmap_sync()

2012-09-21 Thread Juan Quintela

Helper that we use each time that we need to syncronize the migration
bitmap with the other dirty bitmaps.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index acc057f..a58e8c3 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -357,6 +357,12 @@ static inline void migration_bitmap_set_dirty(MemoryRegion 
*mr, int length)
 }
 }

+static void migration_bitmap_sync(void)
+{
+memory_global_sync_dirty_bitmap(get_system_memory());
+}
+
+
 /*
  * ram_save_block: Writes a page of memory to the stream f
  *
@@ -614,7 +620,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 expected_downtime, migrate_max_downtime());

 if (expected_downtime = migrate_max_downtime()) {
-memory_global_sync_dirty_bitmap(get_system_memory());
+migration_bitmap_sync();
 expected_downtime = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
 s-expected_downtime = expected_downtime / 100; /* ns - ms */

@@ -625,7 +631,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)

 static int ram_save_complete(QEMUFile *f, void *opaque)
 {
-memory_global_sync_dirty_bitmap(get_system_memory());
+migration_bitmap_sync();

 /* try transferring iterative blocks of memory */

-- 
1.7.11.4

[Qemu-devel] [PATCH 19/41] buffered_file: unfold migrate_fd_put_buffer

2012-09-21 Thread Juan Quintela

We only used it once, just remove the callback indirection

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 7 ++-
 buffered_file.h | 2 --
 migration.c | 6 ++
 migration.h | 3 +++
 4 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 59d952d..702a726 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -23,7 +23,6 @@

 typedef struct QEMUFileBuffered
 {
-BufferedPutFunc *put_buffer;
 BufferedPutReadyFunc *put_ready;
 BufferedWaitForUnfreezeFunc *wait_for_unfreeze;
 BufferedCloseFunc *close;
@@ -78,8 +77,8 @@ static void buffered_flush(QEMUFileBuffered *s)
 while (s-bytes_xfer  s-xfer_limit  offset  s-buffer_size) {
 ssize_t ret;

-ret = s-put_buffer(s-migration_state, s-buffer + offset,
-s-buffer_size - offset);
+ret = migrate_fd_put_buffer(s-migration_state, s-buffer + offset,
+s-buffer_size - offset);
 if (ret == -EAGAIN) {
 DPRINTF(backend not ready, freezing\n);
 s-freeze_output = 1;
@@ -228,7 +227,6 @@ static void buffered_rate_tick(void *opaque)

 QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
   size_t bytes_per_sec,
-  BufferedPutFunc *put_buffer,
   BufferedPutReadyFunc *put_ready,
   BufferedWaitForUnfreezeFunc 
*wait_for_unfreeze,
   BufferedCloseFunc *close)
@@ -239,7 +237,6 @@ QEMUFile *qemu_fopen_ops_buffered(MigrationState 
*migration_state,

 s-migration_state = migration_state;
 s-xfer_limit = bytes_per_sec / 10;
-s-put_buffer = put_buffer;
 s-put_ready = put_ready;
 s-wait_for_unfreeze = wait_for_unfreeze;
 s-close = close;
diff --git a/buffered_file.h b/buffered_file.h
index 39f7fa0..ca7e62d 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -17,14 +17,12 @@
 #include hw/hw.h
 #include migration.h

-typedef ssize_t (BufferedPutFunc)(void *opaque, const void *data, size_t size);
 typedef void (BufferedPutReadyFunc)(void *opaque);
 typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
 typedef int (BufferedCloseFunc)(void *opaque);

 QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
   size_t xfer_limit,
-  BufferedPutFunc *put_buffer,
   BufferedPutReadyFunc *put_ready,
   BufferedWaitForUnfreezeFunc 
*wait_for_unfreeze,
   BufferedCloseFunc *close);
diff --git a/migration.c b/migration.c
index 05634d5..a958f4f 100644
--- a/migration.c
+++ b/migration.c
@@ -293,10 +293,9 @@ static void migrate_fd_put_notify(void *opaque)
 }
 }

-static ssize_t migrate_fd_put_buffer(void *opaque, const void *data,
- size_t size)
+ssize_t migrate_fd_put_buffer(MigrationState *s, const void *data,
+  size_t size)
 {
-MigrationState *s = opaque;
 ssize_t ret;

 if (s-state != MIG_STATE_ACTIVE) {
@@ -434,7 +433,6 @@ void migrate_fd_connect(MigrationState *s)
 s-state = MIG_STATE_ACTIVE;
 s-file = qemu_fopen_ops_buffered(s,
   s-bandwidth_limit,
-  migrate_fd_put_buffer,
   migrate_fd_put_ready,
   migrate_fd_wait_for_unfreeze,
   migrate_fd_close);
diff --git a/migration.h b/migration.h
index 66d7f68..02d0219 100644
--- a/migration.h
+++ b/migration.h
@@ -78,6 +78,9 @@ void migrate_fd_error(MigrationState *s);

 void migrate_fd_connect(MigrationState *s);

+ssize_t migrate_fd_put_buffer(MigrationState *s, const void *data,
+  size_t size);
+
 void add_migration_state_change_notifier(Notifier *notify);
 void remove_migration_state_change_notifier(Notifier *notify);
 bool migration_is_active(MigrationState *);
-- 
1.7.11.4

[Qemu-devel] [PATCH 20/41] buffered_file: unfold migrate_fd_put_ready

2012-09-21 Thread Juan Quintela

We only use it once, just remove the callback indirection.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 5 +
 buffered_file.h | 2 --
 migration.c | 4 +---
 migration.h | 1 +
 4 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 702a726..4c6a797 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -23,7 +23,6 @@

 typedef struct QEMUFileBuffered
 {
-BufferedPutReadyFunc *put_ready;
 BufferedWaitForUnfreezeFunc *wait_for_unfreeze;
 BufferedCloseFunc *close;
 MigrationState *migration_state;
@@ -128,7 +127,7 @@ static int buffered_put_buffer(void *opaque, const uint8_t 
*buf, int64_t pos, in
 DPRINTF(file is ready\n);
 if (!s-freeze_output  s-bytes_xfer  s-xfer_limit) {
 DPRINTF(notifying client\n);
-s-put_ready(s-migration_state);
+migrate_fd_put_ready(s-migration_state);
 }
 }

@@ -227,7 +226,6 @@ static void buffered_rate_tick(void *opaque)

 QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
   size_t bytes_per_sec,
-  BufferedPutReadyFunc *put_ready,
   BufferedWaitForUnfreezeFunc 
*wait_for_unfreeze,
   BufferedCloseFunc *close)
 {
@@ -237,7 +235,6 @@ QEMUFile *qemu_fopen_ops_buffered(MigrationState 
*migration_state,

 s-migration_state = migration_state;
 s-xfer_limit = bytes_per_sec / 10;
-s-put_ready = put_ready;
 s-wait_for_unfreeze = wait_for_unfreeze;
 s-close = close;

diff --git a/buffered_file.h b/buffered_file.h
index ca7e62d..dd239b3 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -17,13 +17,11 @@
 #include hw/hw.h
 #include migration.h

-typedef void (BufferedPutReadyFunc)(void *opaque);
 typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
 typedef int (BufferedCloseFunc)(void *opaque);

 QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state,
   size_t xfer_limit,
-  BufferedPutReadyFunc *put_ready,
   BufferedWaitForUnfreezeFunc 
*wait_for_unfreeze,
   BufferedCloseFunc *close);

diff --git a/migration.c b/migration.c
index a958f4f..d0d1014 100644
--- a/migration.c
+++ b/migration.c
@@ -316,9 +316,8 @@ ssize_t migrate_fd_put_buffer(MigrationState *s, const void 
*data,
 return ret;
 }

-static void migrate_fd_put_ready(void *opaque)
+void migrate_fd_put_ready(MigrationState *s)
 {
-MigrationState *s = opaque;
 int ret;

 if (s-state != MIG_STATE_ACTIVE) {
@@ -433,7 +432,6 @@ void migrate_fd_connect(MigrationState *s)
 s-state = MIG_STATE_ACTIVE;
 s-file = qemu_fopen_ops_buffered(s,
   s-bandwidth_limit,
-  migrate_fd_put_ready,
   migrate_fd_wait_for_unfreeze,
   migrate_fd_close);

diff --git a/migration.h b/migration.h
index 02d0219..031c2ab 100644
--- a/migration.h
+++ b/migration.h
@@ -80,6 +80,7 @@ void migrate_fd_connect(MigrationState *s);

 ssize_t migrate_fd_put_buffer(MigrationState *s, const void *data,
   size_t size);
+void migrate_fd_put_ready(MigrationState *s);

 void add_migration_state_change_notifier(Notifier *notify);
 void remove_migration_state_change_notifier(Notifier *notify);
-- 
1.7.11.4

[Qemu-devel] [PATCH 41/41] cpus: create qemu_cpu_is_vcpu()

2012-09-21 Thread Juan Quintela

Old code used !io_thread to know if a thread was an vcpu or not.  That
fails when we introduce the iothread.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 cpus.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/cpus.c b/cpus.c
index e476a3c..1b7061a 100644
--- a/cpus.c
+++ b/cpus.c
@@ -902,6 +902,11 @@ int qemu_cpu_is_self(void *_env)
 return qemu_thread_is_self(cpu-thread);
 }

+static bool qemu_cpu_is_vcpu(void)
+{
+return cpu_single_env  qemu_cpu_is_self(cpu_single_env);
+}
+
 void qemu_mutex_lock_iothread(void)
 {
 if (!tcg_enabled()) {
@@ -947,7 +952,7 @@ void pause_all_vcpus(void)
 penv = penv-next_cpu;
 }

-if (!qemu_thread_is_self(io_thread)) {
+if (qemu_cpu_is_vcpu()) {
 cpu_stop_current();
 if (!kvm_enabled()) {
 while (penv) {
@@ -1064,7 +1069,7 @@ void cpu_stop_current(void)

 void vm_stop(RunState state)
 {
-if (!qemu_thread_is_self(io_thread)) {
+if (qemu_cpu_is_vcpu()) {
 qemu_system_vmstop_request(state);
 /*
  * FIXME: should not return to device code in case
-- 
1.7.11.4

Re: [Qemu-devel] [PATCH v4 04/19] block: move aio initialization into a helper function

2012-09-21 Thread Kevin Wolf

Am 20.09.2012 21:13, schrieb Jeff Cody:
 Move AIO initialization for raw-posix block driver into a helper function.
 
 In addition to just code motion, the aio_ctx pointer is checked for NULL,
 prior to calling laio_init(), to make sure laio_init() is only run once.
 
 Signed-off-by: Jeff Cody jc...@redhat.com
 ---
  block/raw-posix.c | 53 +++--
  1 file changed, 35 insertions(+), 18 deletions(-)
 
 diff --git a/block/raw-posix.c b/block/raw-posix.c
 index 6be20b1..5981d04 100644
 --- a/block/raw-posix.c
 +++ b/block/raw-posix.c

 @@ -240,25 +272,10 @@ static int raw_open_common(BlockDriverState *bs, const 
 char *filename,
  }
  
  #ifdef CONFIG_LINUX_AIO
 -/*
 - * Currently Linux do AIO only for files opened with O_DIRECT
 - * specified so check NOCACHE flag too
 - */
 -if ((bdrv_flags  (BDRV_O_NOCACHE|BDRV_O_NATIVE_AIO)) ==
 -  (BDRV_O_NOCACHE|BDRV_O_NATIVE_AIO)) {
 -
 -s-aio_ctx = laio_init();
 -if (!s-aio_ctx) {
 -goto out_free_buf;
 -}
 -s-use_aio = 1;
 -} else
 -#endif
 -{
 -#ifdef CONFIG_LINUX_AIO
 -s-use_aio = 0;
 -#endif
 +if (raw_set_aio(s-aio_ctx, s-use_aio, bdrv_flags)) {
 +goto out_close;

This leaks s-aligned_buf. It's removed later in the series anyway, so
no big deal, but if you need to respin for other reasons, probably worth
fixing.

Kevin

[Qemu-devel] [PATCH 35/41] buffered_file: buffered_put_buffer() don't need to set last_error

2012-09-21 Thread Juan Quintela

Callers on savevm.c:qemu_fflush() will set it.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 6d9a50b..318d0f0 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -120,8 +120,6 @@ static int buffered_put_buffer(void *opaque, const uint8_t 
*buf, int64_t pos, in
 error = buffered_flush(s);
 if (error  0) {
 DPRINTF(buffered flush error. bailing: %s\n, strerror(-error));
-qemu_file_set_error(s-file, error);
-
 return error;
 }

-- 
1.7.11.4

[Qemu-devel] [PATCH 01/41] buffered_file: g_realloc() can't fail

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index f170aa0..4148abb 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -50,20 +50,12 @@ static void buffered_append(QEMUFileBuffered *s,
 const uint8_t *buf, size_t size)
 {
 if (size  (s-buffer_capacity - s-buffer_size)) {
-void *tmp;
-
 DPRINTF(increasing buffer capacity from %zu by %zu\n,
 s-buffer_capacity, size + 1024);

 s-buffer_capacity += size + 1024;

-tmp = g_realloc(s-buffer, s-buffer_capacity);
-if (tmp == NULL) {
-fprintf(stderr, qemu file buffer expansion failed\n);
-exit(1);
-}
-
-s-buffer = tmp;
+s-buffer = g_realloc(s-buffer, s-buffer_capacity);
 }

 memcpy(s-buffer + s-buffer_size, buf, size);
-- 
1.7.11.4

[Qemu-devel] [PATCH 37/41] block-migration: Switch meaning of return value

2012-09-21 Thread Juan Quintela

Make consistent the result of blk_mig_save_dirty_block() and
mig_save_device_dirty()

Signed-off-by: Juan Quintela quint...@redhat.com
---
 block-migration.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/block-migration.c b/block-migration.c
index a822bb2..565628f 100644
--- a/block-migration.c
+++ b/block-migration.c
@@ -429,14 +429,18 @@ error:
 return 0;
 }

+/* return value:
+ * 0: too much data for max_downtime
+ * 1: few enough data for max_downtime
+*/
 static int blk_mig_save_dirty_block(QEMUFile *f, int is_async)
 {
 BlkMigDevState *bmds;
-int ret = 0;
+int ret = 1;

 QSIMPLEQ_FOREACH(bmds, block_mig_state.bmds_list, entry) {
-if (mig_save_device_dirty(f, bmds, is_async) == 0) {
-ret = 1;
+ret = mig_save_device_dirty(f, bmds, is_async);
+if (ret == 0) {
 break;
 }
 }
@@ -594,7 +598,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
 block_mig_state.bulk_completed = 1;
 }
 } else {
-if (blk_mig_save_dirty_block(f, 1) == 0) {
+if (blk_mig_save_dirty_block(f, 1) != 0) {
 /* no more dirty blocks */
 break;
 }
@@ -631,7 +635,7 @@ static int block_save_complete(QEMUFile *f, void *opaque)
all async read completed */
 assert(block_mig_state.submitted == 0);

-while (blk_mig_save_dirty_block(f, 0) != 0) {
+while (blk_mig_save_dirty_block(f, 0) == 0) {
 /* Do nothing */
 }
 blk_mig_cleanup();
-- 
1.7.11.4

[Qemu-devel] [PATCH 14/41] Separate migration bitmap

2012-09-21 Thread Juan Quintela

This patch creates a migration bitmap, which is periodically kept in
sync with the qemu bitmap. A separate copy of the dirty bitmap for the
migration limits the amount of concurrent access to the qemu bitmap
from iothread and migration thread (which requires taking the big
lock).

We use the qemu bitmap type.  We have to undo the dirty_pages
counting optimization on the general dirty bitmap and do the counting
optimization with the migration local bitmap.

Signed-off-by: Umesh Deshpande udesh...@redhat.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c | 63 +++--
 cpu-all.h   |  1 -
 exec-obsolete.h | 10 -
 3 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 6e0d7c4..0279d06 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -31,6 +31,8 @@
 #include config.h
 #include monitor.h
 #include sysemu.h
+#include bitops.h
+#include bitmap.h
 #include arch_init.h
 #include audio/audio.h
 #include hw/pc.h
@@ -332,39 +334,57 @@ static int save_xbzrle_page(QEMUFile *f, uint8_t 
*current_data,

 static RAMBlock *last_block;
 static ram_addr_t last_offset;
+static unsigned long *migration_bitmap;
+static uint64_t migration_dirty_pages;

 static inline bool migration_bitmap_test_and_reset_dirty(MemoryRegion *mr,
  ram_addr_t offset)
 {
-bool ret = memory_region_get_dirty(mr, offset, TARGET_PAGE_SIZE,
-   DIRTY_MEMORY_MIGRATION);
+bool ret;
+int nr = (mr-ram_addr + offset)  TARGET_PAGE_BITS;
+
+ret = test_and_clear_bit(nr, migration_bitmap);

 if (ret) {
-memory_region_reset_dirty(mr, offset, TARGET_PAGE_SIZE,
-  DIRTY_MEMORY_MIGRATION);
+migration_dirty_pages--;
 }
 return ret;
 }

-static inline void migration_bitmap_set_dirty(MemoryRegion *mr, int length)
+static inline bool migration_bitmap_set_dirty(MemoryRegion *mr,
+  ram_addr_t offset)
 {
-ram_addr_t addr;
+bool ret;
+int nr = (mr-ram_addr + offset)  TARGET_PAGE_BITS;

-for (addr = 0; addr  length; addr += TARGET_PAGE_SIZE) {
-if (!memory_region_get_dirty(mr, addr, TARGET_PAGE_SIZE,
- DIRTY_MEMORY_MIGRATION)) {
-memory_region_set_dirty(mr, addr, TARGET_PAGE_SIZE);
-}
+ret = test_and_set_bit(nr, migration_bitmap);
+
+if (!ret) {
+migration_dirty_pages++;
 }
+return ret;
 }

 static void migration_bitmap_sync(void)
 {
-uint64_t num_dirty_pages_init = ram_list.dirty_pages;
+RAMBlock *block;
+ram_addr_t addr;
+uint64_t num_dirty_pages_init = migration_dirty_pages;

 trace_migration_bitmap_sync_start();
 memory_global_sync_dirty_bitmap(get_system_memory());
-trace_migration_bitmap_sync_end(ram_list.dirty_pages
+
+QLIST_FOREACH(block, ram_list.blocks, next) {
+for (addr = 0; addr  block-length; addr += TARGET_PAGE_SIZE) {
+if (memory_region_get_dirty(block-mr, addr, TARGET_PAGE_SIZE,
+DIRTY_MEMORY_MIGRATION)) {
+migration_bitmap_set_dirty(block-mr, addr);
+}
+}
+memory_region_reset_dirty(block-mr, 0, block-length,
+  DIRTY_MEMORY_MIGRATION);
+}
+trace_migration_bitmap_sync_end(migration_dirty_pages
 - num_dirty_pages_init);
 }

@@ -443,7 +463,7 @@ static uint64_t bytes_transferred;

 static ram_addr_t ram_save_remaining(void)
 {
-return ram_list.dirty_pages;
+return migration_dirty_pages;
 }

 uint64_t ram_bytes_remaining(void)
@@ -528,8 +548,13 @@ static void reset_ram_globals(void)
 static int ram_save_setup(QEMUFile *f, void *opaque)
 {
 RAMBlock *block;
+int64_t ram_pages = last_ram_offset()  TARGET_PAGE_BITS;

-memory_global_sync_dirty_bitmap(get_system_memory());
+migration_bitmap = bitmap_new(ram_pages);
+bitmap_set(migration_bitmap, 1, ram_pages);
+migration_dirty_pages = ram_pages;
+
+migration_bitmap_sync();
 bytes_transferred = 0;
 reset_ram_globals();

@@ -546,11 +571,6 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 acct_clear();
 }

-/* Make sure all dirty bits are set */
-QLIST_FOREACH(block, ram_list.blocks, next) {
-migration_bitmap_set_dirty(block-mr, block-length);
-}
-
 memory_global_dirty_log_start();

 qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
@@ -656,6 +676,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)

 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);

+g_free(migration_bitmap);
+migration_bitmap = NULL;
+
 return 0;
 }

diff --git a/cpu-all.h b/cpu-all.h
index 5408782..db25f73 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@

[Qemu-devel] [PATCH 34/41] savevm: Only qemu_fflush() can generate errors

2012-09-21 Thread Juan Quintela

Move the error check to the beggining of the callers.  Once this is fixed
qemu_file_set_if_error() is not used anymore, so remove it.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 savevm.c | 35 ++-
 1 file changed, 18 insertions(+), 17 deletions(-)

diff --git a/savevm.c b/savevm.c
index 4e4aa3c..59ec8bf 100644
--- a/savevm.c
+++ b/savevm.c
@@ -445,18 +445,6 @@ void qemu_file_set_error(QEMUFile *f, int ret)
 f-last_error = ret;
 }

-/** Sets last_error conditionally
- *
- * Sets last_error only if ret is negative _and_ no error
- * was set before.
- */
-static void qemu_file_set_if_error(QEMUFile *f, int ret)
-{
-if (ret  0  !f-last_error) {
-qemu_file_set_error(f, ret);
-}
-}
-
 /** Flushes QEMUFile buffer
  *
  */
@@ -544,13 +532,17 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int 
size)
 {
 int l;

-if (!f-last_error  f-is_write == 0  f-buf_index  0) {
+if (f-last_error) {
+return;
+}
+
+if (f-is_write == 0  f-buf_index  0) {
 fprintf(stderr,
 Attempted to write to buffer while read buffer is not 
empty\n);
 abort();
 }

-while (!f-last_error  size  0) {
+while (size  0) {
 l = IO_BUF_SIZE - f-buf_index;
 if (l  size)
 l = size;
@@ -561,14 +553,21 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int 
size)
 size -= l;
 if (f-buf_index = IO_BUF_SIZE) {
 int ret = qemu_fflush(f);
-qemu_file_set_if_error(f, ret);
+if (ret  0) {
+qemu_file_set_error(f, ret);
+break;
+}
 }
 }
 }

 void qemu_put_byte(QEMUFile *f, int v)
 {
-if (!f-last_error  f-is_write == 0  f-buf_index  0) {
+if (f-last_error) {
+return;
+}
+
+if (f-is_write == 0  f-buf_index  0) {
 fprintf(stderr,
 Attempted to write to buffer while read buffer is not 
empty\n);
 abort();
@@ -578,7 +577,9 @@ void qemu_put_byte(QEMUFile *f, int v)
 f-is_write = 1;
 if (f-buf_index = IO_BUF_SIZE) {
 int ret = qemu_fflush(f);
-qemu_file_set_if_error(f, ret);
+if (ret  0) {
+qemu_file_set_error(f, ret);
+}
 }
 }

-- 
1.7.11.4

[Qemu-devel] [PATCH 10/41] ram: Introduce migration_bitmap_test_and_reset_dirty()

2012-09-21 Thread Juan Quintela

It just test if the dirty bit is set, and clears it.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index b2dcc24..acc057f 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -332,6 +332,19 @@ static int save_xbzrle_page(QEMUFile *f, uint8_t 
*current_data,
 static RAMBlock *last_block;
 static ram_addr_t last_offset;

+static inline bool migration_bitmap_test_and_reset_dirty(MemoryRegion *mr,
+ ram_addr_t offset)
+{
+bool ret = memory_region_get_dirty(mr, offset, TARGET_PAGE_SIZE,
+   DIRTY_MEMORY_MIGRATION);
+
+if (ret) {
+memory_region_reset_dirty(mr, offset, TARGET_PAGE_SIZE,
+  DIRTY_MEMORY_MIGRATION);
+}
+return ret;
+}
+
 static inline void migration_bitmap_set_dirty(MemoryRegion *mr, int length)
 {
 ram_addr_t addr;
@@ -365,14 +378,10 @@ static int ram_save_block(QEMUFile *f, bool last_stage)

 do {
 mr = block-mr;
-if (memory_region_get_dirty(mr, offset, TARGET_PAGE_SIZE,
-DIRTY_MEMORY_MIGRATION)) {
+if (migration_bitmap_test_and_reset_dirty(mr, offset)) {
 uint8_t *p;
 int cont = (block == last_block) ? RAM_SAVE_FLAG_CONTINUE : 0;

-memory_region_reset_dirty(mr, offset, TARGET_PAGE_SIZE,
-  DIRTY_MEMORY_MIGRATION);
-
 p = memory_region_get_ram_ptr(mr) + offset;

 if (is_dup_page(p)) {
-- 
1.7.11.4

[Qemu-devel] [PATCH 30/41] savevm: make qemu_fflush() return an error code

2012-09-21 Thread Juan Quintela

Adjust all the callers.  We moved the set of last_error from inside
qemu_fflush() to all the callers.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 savevm.c | 39 +++
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/savevm.c b/savevm.c
index 6865862..0953695 100644
--- a/savevm.c
+++ b/savevm.c
@@ -459,23 +459,22 @@ static void qemu_file_set_if_error(QEMUFile *f, int ret)

 /** Flushes QEMUFile buffer
  *
- * In case of error, last_error is set.
  */
-static void qemu_fflush(QEMUFile *f)
+static int qemu_fflush(QEMUFile *f)
 {
+int ret = 0;
+
 if (!f-put_buffer)
-return;
+return 0;

 if (f-is_write  f-buf_index  0) {
-int len;
-
-len = f-put_buffer(f-opaque, f-buf, f-buf_offset, f-buf_index);
-if (len  0)
+ret = f-put_buffer(f-opaque, f-buf, f-buf_offset, f-buf_index);
+if (ret = 0) {
 f-buf_offset += f-buf_index;
-else
-qemu_file_set_error(f, -EINVAL);
+}
 f-buf_index = 0;
 }
+return ret;
 }

 static void qemu_fill_buffer(QEMUFile *f)
@@ -533,9 +532,13 @@ static int qemu_fclose_internal(QEMUFile *f)
  */
 int qemu_fclose(QEMUFile *f)
 {
-int ret;
-qemu_fflush(f);
-ret = qemu_fclose_internal(f);
+int ret, ret2;
+ret = qemu_fflush(f);
+ret2 = qemu_fclose_internal(f);
+
+if (ret = 0) {
+ret = ret2;
+}
 /* If any error was spotted before closing, we should report it
  * instead of the close() return value.
  */
@@ -570,8 +573,10 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int 
size)
 f-buf_index += l;
 buf += l;
 size -= l;
-if (f-buf_index = IO_BUF_SIZE)
-qemu_fflush(f);
+if (f-buf_index = IO_BUF_SIZE) {
+int ret = qemu_fflush(f);
+qemu_file_set_if_error(f, ret);
+}
 }
 }

@@ -585,8 +590,10 @@ void qemu_put_byte(QEMUFile *f, int v)

 f-buf[f-buf_index++] = v;
 f-is_write = 1;
-if (f-buf_index = IO_BUF_SIZE)
-qemu_fflush(f);
+if (f-buf_index = IO_BUF_SIZE) {
+int ret = qemu_fflush(f);
+qemu_file_set_if_error(f, ret);
+}
 }

 static void qemu_file_skip(QEMUFile *f, int size)
-- 
1.7.11.4

[Qemu-devel] [PATCH 11/41] ram: Export last_ram_offset()

2012-09-21 Thread Juan Quintela

Is the only way of knowing the RAM size.

Signed-off-by: Juan Quintela quint...@redhat.com
---
 cpu-all.h | 2 ++
 exec.c| 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/cpu-all.h b/cpu-all.h
index 74d3681..5408782 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -517,6 +517,8 @@ extern int mem_prealloc;
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 #endif /* !CONFIG_USER_ONLY */

+ram_addr_t last_ram_offset(void);
+
 int cpu_memory_rw_debug(CPUArchState *env, target_ulong addr,
 uint8_t *buf, int len, int is_write);

diff --git a/exec.c b/exec.c
index f22e9e6..ad2cc2e 100644
--- a/exec.c
+++ b/exec.c
@@ -2464,7 +2464,7 @@ static ram_addr_t find_ram_offset(ram_addr_t size)
 return offset;
 }

-static ram_addr_t last_ram_offset(void)
+ram_addr_t last_ram_offset(void)
 {
 RAMBlock *block;
 ram_addr_t last = 0;
-- 
1.7.11.4

[Qemu-devel] [PATCH 24/41] buffered_file: callers of buffered_flush() already check for errors

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 43e68b6..747d672 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -61,13 +61,6 @@ static void buffered_append(QEMUFileBuffered *s,
 static void buffered_flush(QEMUFileBuffered *s)
 {
 size_t offset = 0;
-int error;
-
-error = qemu_file_get_error(s-file);
-if (error != 0) {
-DPRINTF(flush when error, bailing: %s\n, strerror(-error));
-return;
-}

 DPRINTF(flushing %zu byte(s) of data\n, s-buffer_size);

-- 
1.7.11.4

[Qemu-devel] [PATCH 17/41] buffered_file: rename opaque to migration_state

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 buffered_file.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 7155800..33b700b 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -27,7 +27,7 @@ typedef struct QEMUFileBuffered
 BufferedPutReadyFunc *put_ready;
 BufferedWaitForUnfreezeFunc *wait_for_unfreeze;
 BufferedCloseFunc *close;
-void *opaque;
+void *migration_state;
 QEMUFile *file;
 int freeze_output;
 size_t bytes_xfer;
@@ -78,7 +78,7 @@ static void buffered_flush(QEMUFileBuffered *s)
 while (s-bytes_xfer  s-xfer_limit  offset  s-buffer_size) {
 ssize_t ret;

-ret = s-put_buffer(s-opaque, s-buffer + offset,
+ret = s-put_buffer(s-migration_state, s-buffer + offset,
 s-buffer_size - offset);
 if (ret == -EAGAIN) {
 DPRINTF(backend not ready, freezing\n);
@@ -129,7 +129,7 @@ static int buffered_put_buffer(void *opaque, const uint8_t 
*buf, int64_t pos, in
 DPRINTF(file is ready\n);
 if (!s-freeze_output  s-bytes_xfer  s-xfer_limit) {
 DPRINTF(notifying client\n);
-s-put_ready(s-opaque);
+s-put_ready(s-migration_state);
 }
 }

@@ -147,10 +147,10 @@ static int buffered_close(void *opaque)
 while (!qemu_file_get_error(s-file)  s-buffer_size) {
 buffered_flush(s);
 if (s-freeze_output)
-s-wait_for_unfreeze(s-opaque);
+s-wait_for_unfreeze(s-migration_state);
 }

-ret = s-close(s-opaque);
+ret = s-close(s-migration_state);

 qemu_del_timer(s-timer);
 qemu_free_timer(s-timer);
@@ -237,7 +237,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,

 s = g_malloc0(sizeof(*s));

-s-opaque = opaque;
+s-migration_state = opaque;
 s-xfer_limit = bytes_per_sec / 10;
 s-put_buffer = put_buffer;
 s-put_ready = put_ready;
-- 
1.7.11.4

[Qemu-devel] [PATCH 00/41] Migration cleanups, refactorings, stats, and more

2012-09-21 Thread Juan Quintela

Hi

This is the mergable part of migration thread work that I am doing.
What it does:
- cleanups left and right
- 2nd patch is a (fix migration sync) should be applied to stable IMHO
   Othrewise, we could be sending zero pages 2 times.
- It introduces the stats discused previously
   * expected_downtime
   * downtime
   * dirty_pages_rate
 All Eric  Luiz comments have been addressed
- buffered file has been basically unfolded.  It was used only once,
  with one user.  So we de-abstracted it.
  Notice that with migration patches, it makes no sense anymore
- migration bitmap

  Now the bitmap is 1bit per page.  Notice that this only affects to
  migration.  For the rest (aka TCG  VGA), we still use the old bitmap.

- remove qemu_file* functions that were not used or used only once
- last_error: we have made that almost all error paths return one error
  instead of zero, so we use much, much less the error on the qemu_file.
  Some of the last remmants are:
  * still there due to callbacks (can be removed when the thread is
integrated)
  * are on the read path (I haven't touched a lot of those yet)
- qemu_cpu_is_vcpu()

  We used to test if we were _not_ an vcpu testing if we were the
  iothread.  With the migration thread that is not true anymore.  So
  create a function that does the right thing.

Please review.

Thanks, Juan.

The following changes since commit c26032b2c91721245bfec542d94f37a0238e986e:

  target-xtensa: don't emit extra tcg_gen_goto_tb (2012-09-21 03:07:27 +0400)

are available in the git repository at:

  http://repo.or.cz/r/qemu/quintela.git migration-next-20120921

for you to fetch changes up to 4bce0b88b10ed790ad3669ce4ff61c945cd655eb:

  cpus: create qemu_cpu_is_vcpu() (2012-09-21 10:43:10 +0200)


Juan Quintela (40):
  buffered_file: g_realloc() can't fail
  fix migration sync
  migration: store end_time in a local variable
  migration: print total downtime for final phase of migration
  migration: rename expected_time to expected_downtime
  migration: export migrate_get_current()
  migration: print expected downtime in info migrate
  savevm: Factorize ram globals reset in its own function
  ram: introduce migration_bitmap_set_dirty()
  ram: Introduce migration_bitmap_test_and_reset_dirty()
  ram: Export last_ram_offset()
  ram: introduce migration_bitmap_sync()
  ram: create trace event for migration sync bitmap
  Separate migration bitmap
  migration: Add dirty_pages_rate to query migrate output
  buffered_file: rename opaque to migration_state
  buffered_file: opaque is MigrationState
  buffered_file: unfold migrate_fd_put_buffer
  buffered_file: unfold migrate_fd_put_ready
  buffered_file: unfold migrate_fd_put_buffer
  buffered_file: unfold migrate_fd_put_buffer
  buffered_file: We can access directly to bandwidth_limit
  buffered_file: callers of buffered_flush() already check for errors
  buffered_file: make buffered_flush return the error code
  migration: make migrate_fd_wait_for_unfreeze() return errors
  savevm: unexport qemu_fflush
  virtio-net: use qemu_get_buffer() in a temp buffer
  savevm: Remove qemu_fseek()
  savevm: make qemu_fflush() return an error code
  savevm: unfold qemu_fclose_internal()
  savevm: unexport qemu_ftell()
  savevm: make qemu_fill_buffer() be consistent
  savevm: Only qemu_fflush() can generate errors
  buffered_file: buffered_put_buffer() don't need to set last_error
  block-migration: make flush_blks() return errors
  block-migration:  Switch meaning of return value
  block-migration: handle errors with the return codes correctly
  savevm: un-export qemu_file_set_error()
  savevm: make qemu_file_put_notify() return errors
  cpus: create qemu_cpu_is_vcpu()

Paolo Bonzini (1):
  BufferedFile: append, then flush

 arch_init.c   | 141 +-
 block-migration.c |  51 ++--
 buffered_file.c   | 128 +
 buffered_file.h   |  12 +
 cpu-all.h |   3 +-
 cpus.c|   9 +++-
 exec-obsolete.h   |  10 
 exec.c|   2 +-
 hmp.c |  12 +
 hw/virtio-net.c   |   4 +-
 migration.c   |  46 +-
 migration.h   |  10 
 qapi-schema.json  |  20 ++--
 qemu-file.h   |   8 +---
 qmp-commands.hx   |   9 
 savevm.c  | 117 +---
 trace-events  |   4 ++
 17 files changed, 316 insertions(+), 270 deletions(-)

-- 
1.7.11.4

[Qemu-devel] [PATCH 08/41] savevm: Factorize ram globals reset in its own function

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 arch_init.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 52ccc7b..57f7f1a 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -482,6 +482,14 @@ static void ram_migration_cancel(void *opaque)
 migration_end();
 }

+
+static void reset_ram_globals(void)
+{
+last_block = NULL;
+last_offset = 0;
+sort_ram_list();
+}
+
 #define MAX_WAIT 50 /* ms, half buffered_file limit */

 static int ram_save_setup(QEMUFile *f, void *opaque)
@@ -491,9 +499,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)

 memory_global_sync_dirty_bitmap(get_system_memory());
 bytes_transferred = 0;
-last_block = NULL;
-last_offset = 0;
-sort_ram_list();
+reset_ram_globals();

 if (migrate_use_xbzrle()) {
 XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() /
-- 
1.7.11.4

[Qemu-devel] [PATCH 04/41] migration: print total downtime for final phase of migration

2012-09-21 Thread Juan Quintela

Signed-off-by: Juan Quintela quint...@redhat.com
---
 hmp.c| 4 
 migration.c  | 6 +-
 migration.h  | 1 +
 qapi-schema.json | 7 ++-
 qmp-commands.hx  | 3 +++
 5 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/hmp.c b/hmp.c
index ba6fbd3..40b0c05 100644
--- a/hmp.c
+++ b/hmp.c
@@ -152,6 +152,10 @@ void hmp_info_migrate(Monitor *mon)
 monitor_printf(mon, Migration status: %s\n, info-status);
 monitor_printf(mon, total time: % PRIu64  milliseconds\n,
info-total_time);
+if (info-has_downtime) {
+monitor_printf(mon, downtime: % PRIu64  milliseconds\n,
+   info-downtime);
+}
 }

 if (info-has_ram) {
diff --git a/migration.c b/migration.c
index 1e3f791..c1655b3 100644
--- a/migration.c
+++ b/migration.c
@@ -195,6 +195,8 @@ MigrationInfo *qmp_query_migrate(Error **errp)
 info-has_status = true;
 info-status = g_strdup(completed);
 info-total_time = s-total_time;
+info-has_downtime = true;
+info-downtime = s-downtime;

 info-has_ram = true;
 info-ram = g_malloc0(sizeof(*info-ram));
@@ -327,9 +329,10 @@ static void migrate_fd_put_ready(void *opaque)
 migrate_fd_error(s);
 } else if (ret == 1) {
 int old_vm_running = runstate_is_running();
-int64_t end_time;
+int64_t start_time, end_time;

 DPRINTF(done iterating\n);
+start_time = qemu_get_clock_ms(rt_clock);
 qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
 vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);

@@ -340,6 +343,7 @@ static void migrate_fd_put_ready(void *opaque)
 }
 end_time = qemu_get_clock_ms(rt_clock);
 s-total_time = end_time - s-total_time;
+s-downtime = end_time - start_time;
 if (s-state != MIG_STATE_COMPLETED) {
 if (old_vm_running) {
 vm_start();
diff --git a/migration.h b/migration.h
index a9852fc..3462917 100644
--- a/migration.h
+++ b/migration.h
@@ -40,6 +40,7 @@ struct MigrationState
 void *opaque;
 MigrationParams params;
 int64_t total_time;
+int64_t downtime;
 bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
 int64_t xbzrle_cache_size;
 };
diff --git a/qapi-schema.json b/qapi-schema.json
index 14e4419..b5a4360 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -413,13 +413,18 @@
 #If migration has ended, it returns the total migration
 #time. (since 1.2)
 #
+# @downtime: #optional only present when migration finishes correctly
+#total downtime in milliseconds for the guest.
+#(since 1.3)
+#
 # Since: 0.14.0
 ##
 { 'type': 'MigrationInfo',
   'data': {'*status': 'str', '*ram': 'MigrationStats',
'*disk': 'MigrationStats',
'*xbzrle-cache': 'XBZRLECacheStats',
-   '*total-time': 'int'} }
+   '*total-time': 'int',
+   '*downtime': 'int'} }

 ##
 # @query-migrate
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 6e21ddb..37be613 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -2267,6 +2267,8 @@ The main json-object contains the following:
 - total-time: total amount of ms since migration started.  If
 migration has ended, it returns the total migration
 time (json-int)
+- downtime: only present when migration has finished correctly
+total amount in ms for downtime that happened (json-int)
 - ram: only present if status is active, it is a json-object with the
   following RAM information (in bytes):
  - transferred: amount transferred (json-int)
@@ -2304,6 +2306,7 @@ Examples:
   remaining:123,
   total:246,
   total-time:12345,
+  downtime:12345,
   duplicate:123,
   normal:123,
   normal-bytes:123456
-- 
1.7.11.4

Re: [Qemu-devel] [PATCH 13/13] tcg: Fix !USE_DIRECT_JUMP

2012-09-21 Thread Aurelien Jarno

On Mon, Sep 17, 2012 at 08:28:52AM -0700, Richard Henderson wrote:
 Commit 6375e09e changed the type of TranslationBlock.tb_next,
 but failed to change the type of TCGContext.tb_next.
 
 Signed-off-by: Richard Henderson r...@twiddle.net
 ---
  tcg/tcg.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/tcg/tcg.h b/tcg/tcg.h
 index 7a72729..21016ec 100644
 --- a/tcg/tcg.h
 +++ b/tcg/tcg.h
 @@ -343,7 +343,7 @@ struct TCGContext {
  
  /* goto_tb support */
  uint8_t *code_buf;
 -unsigned long *tb_next;
 +uintptr_t *tb_next;
  uint16_t *tb_next_offset;
  uint16_t *tb_jmp_offset; /* != NULL if USE_DIRECT_JUMP */
  

Not directly sparc related, and definitely correct.

Reviewed-by: Aurelien Jarno aurel...@aurel32.net

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net

Re: [Qemu-devel] [PATCH v4 00/19] block: bdrv_reopen() patches

2012-09-21 Thread Kevin Wolf

Am 20.09.2012 21:13, schrieb Jeff Cody:
 These patches are based off Supriya Kannery's original bdrv_reopen()
 patches as part of the hostcache series.
 
 This provides support for safe reopen of a single image, or transactional
 reopening of multiple images atomically.
 
 These changes are all reflected in my github repo:
 
 git://github.com/codyprime/qemu-kvm-jtc.git  branch: 
 jtc-live-commit-1.3-v7
 

 Jeff Cody (19):
   block: correctly set the keep_read_only flag
   block: make bdrv_set_enable_write_cache() modify open_flags
   block: Framework for reopening files safely
   block: move aio initialization into a helper function
   block: move open flag parsing in raw block drivers to helper
 functions
   block: do not parse BDRV_O_CACHE_WB in block drivers
   block: use BDRV_O_NOCACHE instead of s-aligned_buf in raw-posix.c
   block: purge s-aligned_buf and s-aligned_buf_size from raw-posix.c
   block: raw-posix image file reopen
   block: raw image file reopen
   block: qed image file reopen
   block: qcow2 image file reopen
   block: qcow image file reopen
   block: vmdk image file reopen
   block: raw-win32 driver reopen support
   block: vdi image file reopen
   block: vpc image file reopen
   block: convert bdrv_commit() to use bdrv_reopen()
   block: remove keep_read_only flag from BlockDriverState struct
 
  block.c   | 299 
 +-
  block.h   |  18 
  block/iscsi.c |   4 -
  block/qcow.c  |  10 ++
  block/qcow2.c |  10 ++
  block/qed.c   |   9 ++
  block/raw-posix.c | 225 ++--
  block/raw-win32.c | 145 ++
  block/raw.c   |  10 ++
  block/rbd.c   |   6 --
  block/sheepdog.c  |  14 ++-
  block/vdi.c   |   7 ++
  block/vmdk.c  |  35 +++
  block/vpc.c   |   7 ++
  block_int.h   |   9 +-
  15 files changed, 666 insertions(+), 142 deletions(-)

Thanks, applied all to the block branch, except for patch 15
(raw-win32), which I think can safely be applied on top when we've come
to a conclusion.

Kevin

[Qemu-devel] [PATCH 2/9] Revert mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix

2012-09-21 Thread Mel Gorman

This reverts
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix
as it is replaced by a later patch in the series.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 mm/compaction.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 4a77b4b..1c873bb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -907,7 +907,8 @@ static unsigned long compact_zone_order(struct zone *zone,
INIT_LIST_HEAD(cc.migratepages);
 
ret = compact_zone(zone, cc);
-   *contended = cc.contended;
+   if (contended)
+   *contended = cc.contended;
return ret;
 }
 
-- 
1.7.9.2

[Qemu-devel] [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long

2012-09-21 Thread Mel Gorman

From: Shaohua Li s...@fusionio.com

Changelog since V2
o Fix BUG_ON triggered due to pages left on cc.migratepages
o Make compact_zone_order() require non-NULL arg `contended'

Changelog since V1
o only abort the compaction if lock is contended or run too long
o Rearranged the code by Andrea Arcangeli.

isolate_migratepages_range() might isolate no pages if for example when
zone-lru_lock is contended and running asynchronous compaction. In this
case, we should abort compaction, otherwise, compact_zone will run a
useless loop and make zone-lru_lock is even contended.

[minc...@kernel.org: Putback pages isolated for migration if aborting]
[a...@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
Signed-off-by: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Shaohua Li s...@fusionio.com
Signed-off-by: Mel Gorman mgor...@suse.de
Acked-by: Rik van Riel r...@redhat.com
---
 mm/compaction.c |   17 -
 mm/internal.h   |2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 614f18b..6b55491 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -70,8 +70,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, 
unsigned long *flags,
 
/* async aborts if taking too long or contended */
if (!cc-sync) {
-   if (cc-contended)
-   *cc-contended = true;
+   cc-contended = true;
return false;
}
 
@@ -686,7 +685,7 @@ static isolate_migrate_t isolate_migratepages(struct zone 
*zone,
 
/* Perform the isolation */
low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
-   if (!low_pfn)
+   if (!low_pfn || cc-contended)
return ISOLATE_ABORT;
 
cc-migrate_pfn = low_pfn;
@@ -846,6 +845,8 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
switch (isolate_migratepages(zone, cc)) {
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
+   putback_lru_pages(cc-migratepages);
+   cc-nr_migratepages = 0;
goto out;
case ISOLATE_NONE:
continue;
@@ -894,6 +895,7 @@ static unsigned long compact_zone_order(struct zone *zone,
 bool sync, bool *contended,
 struct page **page)
 {
+   unsigned long ret;
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
@@ -901,13 +903,18 @@ static unsigned long compact_zone_order(struct zone *zone,
.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
-   .contended = contended,
.page = page,
};
INIT_LIST_HEAD(cc.freepages);
INIT_LIST_HEAD(cc.migratepages);
 
-   return compact_zone(zone, cc);
+   ret = compact_zone(zone, cc);
+
+   VM_BUG_ON(!list_empty(cc.freepages));
+   VM_BUG_ON(!list_empty(cc.migratepages));
+
+   *contended = cc.contended;
+   return ret;
 }
 
 int sysctl_extfrag_threshold = 500;
diff --git a/mm/internal.h b/mm/internal.h
index 386772f..eebbed5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -131,7 +131,7 @@ struct compact_control {
int order;  /* order a direct compactor needs */
int migratetype;/* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
-   bool *contended;/* True if a lock was contended */
+   bool contended; /* True if a lock was contended */
struct page **page; /* Page captured of requested size */
 };
 
-- 
1.7.9.2

Re: [Qemu-devel] [PATCH 7/9] fbdev: move to pixman

2012-09-21 Thread Stefano Stabellini

On Fri, 21 Sep 2012, Gerd Hoffmann wrote:
 On 09/20/12 17:33, Stefano Stabellini wrote:
  On Thu, 20 Sep 2012, Stefano Stabellini wrote:
  On Thu, 20 Sep 2012, Gerd Hoffmann wrote:
Hi,
 
  In any graphics mode relevant today vga emulation will use
  qemu_create_displaysurface_from().  Whenever a DisplayAllocator is
  present or not doesn't make any difference then.
 
  Unfortunately if my memory doesn't fail me, Windows uses 24 bpp.
  So actually the DisplayAllocator interface is the one that is going to be
  used all the time.
 
  Guess we want implement 24bpp support in displaylisteners then.
 
  vnc doesn't support 24bpp
  
  I mean the vnc protocol doesn't support 24bpp, so it couldn't help vnc
  (I am aware that at the moment vnc is not using a DisplayAllocator, but
  I guess it could in the future).
 
 Yes, vnc should transform 24bpp into 32bpp.  Given that vnc keeps a
 shadow copy of the guest display _anyway_ (to figure which parts of the
 guest display did _really_ change) we don't have to do any extra copying
 work in vnc.  We can just keep the shadow at 32bpp.  The 'compare+copy'
 code in vnc_refresh_server_surface must be able to cope with 24bpp guest
 + 32bpp server surface.  Done.  And we've dropped the 24-32 bpp
 conversion in the vga emulation along the way.

OK, I am sold :)

[Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-21 Thread Mel Gorman

When compaction was implemented it was known that scanning could potentially
be excessive. The ideal was that a counter be maintained for each pageblock
but maintaining this information would incur a severe penalty due to a
shared writable cache line. It has reached the point where the scanning
costs are an serious problem, particularly on long-lived systems where a
large process starts and allocates a large number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock
is marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is when to ignore the cached
information? If it's ignored too often, the scanning rates will still
be excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
  be discarded if it's at least 5 seconds since the last time the cache
  was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still allows
excessive scanning when THP allocations are failing in quick succession
due to memory pressure. Waiting until memory pressure is relieved would
cause compaction to continually fail instead of using reclaim/compaction
to try allocate the page. The time-based mechanism is clumsy but a better
option is not obvious.

Signed-off-by: Mel Gorman mgor...@suse.de
Acked-by: Rik van Riel r...@redhat.com
---
 include/linux/mmzone.h  |3 ++
 include/linux/pageblock-flags.h |   19 +++-
 mm/compaction.c |   93 +--
 mm/internal.h   |1 +
 mm/page_alloc.c |1 +
 5 files changed, 111 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 603d0b5..a456361 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,6 +368,9 @@ struct zone {
 */
spinlock_t  lock;
int all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+   unsigned long   compact_blockskip_expire;
+#endif
 #ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t   span_seqlock;
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 19ef95d..eed27f4 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,6 +30,9 @@ enum pageblock_bits {
PB_migrate,
PB_migrate_end = PB_migrate + 3 - 1,
/* 3 bits required for migrate types */
+#ifdef CONFIG_COMPACTION
+   PB_migrate_skip,/* If set the block is skipped by compaction */
+#endif /* CONFIG_COMPACTION */
NR_PAGEBLOCK_BITS
 };
 
@@ -65,10 +68,22 @@ unsigned long get_pageblock_flags_group(struct page *page,
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
int start_bitidx, int end_bitidx);
 
+#ifdef CONFIG_COMPACTION
+#define get_pageblock_skip(page) \
+   get_pageblock_flags_group(page, PB_migrate_skip, \
+   PB_migrate_skip + 1)
+#define clear_pageblock_skip(page) \
+   set_pageblock_flags_group(page, 0, PB_migrate_skip,  \
+   PB_migrate_skip + 1)
+#define set_pageblock_skip(page) \
+   set_pageblock_flags_group(page, 1, PB_migrate_skip,  \
+   PB_migrate_skip + 1)
+#endif /* CONFIG_COMPACTION */
+
 #define get_pageblock_flags(page) \
-   get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+   get_pageblock_flags_group(page, 0, PB_migrate_end)
 #define set_pageblock_flags(page, flags) \
set_pageblock_flags_group(page, flags,  \
- 0, NR_PAGEBLOCK_BITS-1)
+ 0, PB_migrate_end)
 
 #endif /* PAGEBLOCK_FLAGS_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 9fc1b61..9276bc8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,64 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+/* Returns true if the pageblock should be scanned for pages to isolate. */
+static inline bool

[Qemu-devel] [RFC PATCH v3 02/19][SeaBIOS] Add SSDT memory device support

2012-09-21 Thread Vasilis Liaskovitis

Define SSDT hotplug-able memory devices in _SB namespace. The dynamically
generated SSDT includes per memory device hotplug methods. These methods
just call methods defined in the DSDT. Also dynamically generate a MTFY
method and a MEON array of the online/available memory devices.  ACPI
extraction macros are used to place the AML code in variables later used by
src/acpi. The design is taken from SSDT cpu generation.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 Makefile |2 +-
 src/ssdt-mem.dsl |   65 ++
 2 files changed, 66 insertions(+), 1 deletions(-)
 create mode 100644 src/ssdt-mem.dsl

diff --git a/Makefile b/Makefile
index 5486f88..e82cfc9 100644
--- a/Makefile
+++ b/Makefile
@@ -233,7 +233,7 @@ $(OUT)%.hex: src/%.dsl ./tools/acpi_extract_preprocess.py 
./tools/acpi_extract.p
$(Q)$(PYTHON) ./tools/acpi_extract.py $(OUT)$*.lst  $(OUT)$*.off
$(Q)cat $(OUT)$*.off  $@
 
-$(OUT)ccode32flat.o: $(OUT)acpi-dsdt.hex $(OUT)ssdt-proc.hex 
$(OUT)ssdt-pcihp.hex $(OUT)ssdt-susp.hex
+$(OUT)ccode32flat.o: $(OUT)acpi-dsdt.hex $(OUT)ssdt-proc.hex 
$(OUT)ssdt-pcihp.hex $(OUT)ssdt-susp.hex $(OUT)ssdt-mem.hex
 
  Kconfig rules
 
diff --git a/src/ssdt-mem.dsl b/src/ssdt-mem.dsl
new file mode 100644
index 000..ee322f0
--- /dev/null
+++ b/src/ssdt-mem.dsl
@@ -0,0 +1,65 @@
+/* This file is the basis for the ssdt_mem[] variable in src/acpi.c.
+ * It is similar in design to the ssdt_proc variable.
+ * It defines the contents of the per-cpu Processor() object.  At
+ * runtime, a dynamically generated SSDT will contain one copy of this
+ * AML snippet for every possible memory device in the system.  The
+ * objects will * be placed in the \_SB_ namespace.
+ *
+ * In addition to the aml code generated from this file, the
+ * src/acpi.c file creates a MEMNTFY method with an entry for each memdevice:
+ * Method(MTFY, 2) {
+ * If (LEqual(Arg0, 0x00)) { Notify(MP00, Arg1) }
+ * If (LEqual(Arg0, 0x01)) { Notify(MP01, Arg1) }
+ * ...
+ * }
+ * and a MEON array with the list of active and inactive memory devices:
+ * Name(MEON, Package() { One, One, ..., Zero, Zero, ... })
+ */
+ACPI_EXTRACT_ALL_CODE ssdm_mem_aml
+
+DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1)
+/*  v-- DO NOT EDIT --v */
+{
+ACPI_EXTRACT_DEVICE_START ssdt_mem_start
+ACPI_EXTRACT_DEVICE_END ssdt_mem_end
+ACPI_EXTRACT_DEVICE_STRING ssdt_mem_name
+Device(MPAA) {
+ACPI_EXTRACT_NAME_BYTE_CONST ssdt_mem_id
+Name(ID, 0xAA)
+/*  ^-- DO NOT EDIT --^
+ *
+ * The src/acpi.c code requires the above layout so that it can update
+ * MPAA and 0xAA with the appropriate MEMDEVICE id (see
+ * SD_OFFSET_MEMHEX/MEMID1/MEMID2).  Don't change the above without
+ * also updating the C code.
+ */
+Name(_HID, EISAID(PNP0C80))
+Name(_PXM, 0xAA)
+
+External(CMST, MethodObj)
+External(MPEJ, MethodObj)
+
+Name(_CRS, ResourceTemplate() {
+QwordMemory(
+   ResourceConsumer,
+   ,
+   MinFixed,
+   MaxFixed,
+   Cacheable,
+   ReadWrite,
+   0x0,
+   0xDEADBEEF,
+   0xE6ADBEEE,
+   0x,
+   0x0800,
+   )
+})
+Method (_STA, 0) {
+Return(CMST(ID))
+}
+Method (_EJ0, 1, NotSerialized) {
+MPEJ(ID, Arg0)
+}
+}
+}
+
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 06/19] Implement -dimm command line option

2012-09-21 Thread Vasilis Liaskovitis

Example:
-dimm id=dimm0,size=512M,node=0,populated=off
will define a 512M memory slot belonging to numa node 0.

When populated=on, a DimmDevice is created and hot-plugged at system startup.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 hw/Makefile.objs |2 +-
 qemu-config.c|   25 +
 qemu-options.hx  |5 +
 sysemu.h |1 +
 vl.c |   50 ++
 5 files changed, 82 insertions(+), 1 deletions(-)

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 6dfebd2..8c5c39a 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -26,7 +26,7 @@ hw-obj-$(CONFIG_I8254) += i8254_common.o i8254.o
 hw-obj-$(CONFIG_PCSPK) += pcspk.o
 hw-obj-$(CONFIG_PCKBD) += pckbd.o
 hw-obj-$(CONFIG_FDC) += fdc.o
-hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o
+hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o dimm.o
 hw-obj-$(CONFIG_APM) += pm_smbus.o apm.o
 hw-obj-$(CONFIG_DMA) += dma.o
 hw-obj-$(CONFIG_I82374) += i82374.o
diff --git a/qemu-config.c b/qemu-config.c
index eba977e..4022d64 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -646,6 +646,30 @@ QemuOptsList qemu_boot_opts = {
 },
 };
 
+static QemuOptsList qemu_dimm_opts = {
+.name = dimm,
+.head = QTAILQ_HEAD_INITIALIZER(qemu_dimm_opts.head),
+.desc = {
+{
+.name = id,
+.type = QEMU_OPT_STRING,
+.help = id of this dimm device,
+},{
+.name = size,
+.type = QEMU_OPT_SIZE,
+.help = memory size for this dimm,
+},{
+.name = populated,
+.type = QEMU_OPT_BOOL,
+.help = populated for this dimm,
+},{
+.name = node,
+.type = QEMU_OPT_NUMBER,
+.help = NUMA node number (i.e. proximity) for this dimm,
+},
+{ /* end of list */ }
+},
+};
 static QemuOptsList *vm_config_groups[32] = {
 qemu_drive_opts,
 qemu_chardev_opts,
@@ -662,6 +686,7 @@ static QemuOptsList *vm_config_groups[32] = {
 qemu_boot_opts,
 qemu_iscsi_opts,
 qemu_sandbox_opts,
+qemu_dimm_opts,
 NULL,
 };
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 804a2d1..3687722 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2842,3 +2842,8 @@ HXCOMM This is the last statement. Insert new options 
before this line!
 STEXI
 @end table
 ETEXI
+
+DEF(dimm, HAS_ARG, QEMU_OPTION_dimm,
+-dimm id=dimmid,size=sz,node=nd,populated=on|off\n
+specify memory dimm device with name dimmid, size sz on node nd,
+QEMU_ARCH_ALL)
diff --git a/sysemu.h b/sysemu.h
index 65552ac..7baf9c9 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -139,6 +139,7 @@ extern QEMUClock *rtc_clock;
 extern int nb_numa_nodes;
 extern uint64_t node_mem[MAX_NODES];
 extern unsigned long *node_cpumask[MAX_NODES];
+extern int nb_hp_dimms;
 
 #define MAX_OPTION_ROMS 16
 typedef struct QEMUOptionRom {
diff --git a/vl.c b/vl.c
index 7c577fa..af1745c 100644
--- a/vl.c
+++ b/vl.c
@@ -126,6 +126,7 @@ int main(int argc, char **argv)
 #include hw/xen.h
 #include hw/qdev.h
 #include hw/loader.h
+#include hw/dimm.h
 #include bt-host.h
 #include net.h
 #include net/slirp.h
@@ -248,6 +249,7 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = 
QTAILQ_HEAD_INITIALIZER(fw_boot_order
 int nb_numa_nodes;
 uint64_t node_mem[MAX_NODES];
 unsigned long *node_cpumask[MAX_NODES];
+int nb_hp_dimms;
 
 uint8_t qemu_uuid[16];
 
@@ -530,6 +532,37 @@ static void configure_rtc_date_offset(const char 
*startdate, int legacy)
 }
 }
 
+static void configure_dimm(QemuOpts *opts)
+{
+const char *id;
+uint64_t size, node;
+bool populated;
+QemuOpts *devopts;
+char buf[256];
+if (nb_hp_dimms == MAX_DIMMS) {
+fprintf(stderr, qemu: maximum number of DIMMs (%d) exceeded\n,
+MAX_DIMMS);
+exit(1);
+}
+id = qemu_opts_id(opts);
+size = qemu_opt_get_size(opts, size, DEFAULT_DIMMSIZE);
+populated = qemu_opt_get_bool(opts, populated, 0);
+node = qemu_opt_get_number(opts, node, 0);
+
+dimm_config_create((char*)id, size, node, nb_hp_dimms, 0);
+
+if (populated) {
+devopts = qemu_opts_create(qemu_find_opts(device), id, 0, NULL);
+qemu_opt_set(devopts, driver, dimm);
+snprintf(buf, sizeof(buf), %lu, size);
+qemu_opt_set(devopts, size, buf);
+snprintf(buf, sizeof(buf), %lu, node);
+qemu_opt_set(devopts, node, buf);
+qemu_opt_set(devopts, bus, membus);
+}
+nb_hp_dimms++;
+}
+
 static void configure_rtc(QemuOpts *opts)
 {
 const char *value;
@@ -2354,6 +2387,8 @@ int main(int argc, char **argv, char **envp)
 DisplayChangeListener *dcl;
 int cyls, heads, secs, translation;
 QemuOpts *hda_opts = NULL, *opts, *machine_opts;
+QemuOpts *dimm_opts[MAX_DIMMS];
+int nb_dimm_opts = 0;
 QemuOptsList *olist;
 int optind;
 const char *optarg;
@@ -3288,6 +3323,18 @@ int

[Qemu-devel] [PATCH 3/9] Revert mm: compaction: abort compaction loop if lock is contended or run too long

2012-09-21 Thread Mel Gorman

This reverts
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long.patch
as it is replaced by a later patch in the series.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 mm/compaction.c |   12 +---
 mm/internal.h   |2 +-
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 1c873bb..614f18b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -70,7 +70,8 @@ static bool compact_checklock_irqsave(spinlock_t *lock, 
unsigned long *flags,
 
/* async aborts if taking too long or contended */
if (!cc-sync) {
-   cc-contended = true;
+   if (cc-contended)
+   *cc-contended = true;
return false;
}
 
@@ -685,7 +686,7 @@ static isolate_migrate_t isolate_migratepages(struct zone 
*zone,
 
/* Perform the isolation */
low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
-   if (!low_pfn || cc-contended)
+   if (!low_pfn)
return ISOLATE_ABORT;
 
cc-migrate_pfn = low_pfn;
@@ -893,7 +894,6 @@ static unsigned long compact_zone_order(struct zone *zone,
 bool sync, bool *contended,
 struct page **page)
 {
-   unsigned long ret;
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
@@ -901,15 +901,13 @@ static unsigned long compact_zone_order(struct zone *zone,
.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
+   .contended = contended,
.page = page,
};
INIT_LIST_HEAD(cc.freepages);
INIT_LIST_HEAD(cc.migratepages);
 
-   ret = compact_zone(zone, cc);
-   if (contended)
-   *contended = cc.contended;
-   return ret;
+   return compact_zone(zone, cc);
 }
 
 int sysctl_extfrag_threshold = 500;
diff --git a/mm/internal.h b/mm/internal.h
index eebbed5..386772f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -131,7 +131,7 @@ struct compact_control {
int order;  /* order a direct compactor needs */
int migratetype;/* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
-   bool contended; /* True if a lock was contended */
+   bool *contended;/* True if a lock was contended */
struct page **page; /* Page captured of requested size */
 };
 
-- 
1.7.9.2

[Qemu-devel] [RFC PATCH v3 19/19][SeaBIOS] Calculate pcimem_start and pcimem64_start from SRAT entries

2012-09-21 Thread Vasilis Liaskovitis

pcimem_start and pcimem64_start are adjusted from srat entries. For this reason,
paravirt info (NUMA SRAT entries and number of cpus) need to be read before 
pci_setup.
Imho, this is an ugly code change since SRAT bios tables and number of
cpus have to be read earlier. But the advantage is that no new paravirt 
interface
is introduced. Suggestions to make the code change cleaner are welcome.

The alternative patch (will be sent as a reply to this patch) implements a
paravirt interface to read the starting values of pcimem_start and
pcimem64_start from QEMU.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 src/acpi.c|   82 
 src/acpi.h|3 ++
 src/pciinit.c |6 +++-
 src/post.c|3 ++
 src/smp.c |4 +++
 5 files changed, 72 insertions(+), 26 deletions(-)

diff --git a/src/acpi.c b/src/acpi.c
index 1223b52..9e99aa7 100644
--- a/src/acpi.c
+++ b/src/acpi.c
@@ -428,7 +428,10 @@ encodeLen(u8 *ssdt_ptr, int length, int bytes)
 #define MEM_OFFSET_END   63
 #define MEM_OFFSET_SIZE  79
 
-u64 nb_hp_memslots = 0;
+u64 nb_hp_memslots = 0, nb_numanodes;
+u64 *numa_data, *hp_memdata;
+u64 below_4g_hp_mem_size = 0;
+u64 above_4g_hp_mem_size = 0;
 struct srat_memory_affinity *mem;
 
 #define SSDT_SIGNATURE 0x54445353 // SSDT
@@ -763,17 +766,7 @@ acpi_build_srat_memory(struct srat_memory_affinity 
*numamem,
 static void *
 build_srat(void)
 {
-int nb_numa_nodes = qemu_cfg_get_numa_nodes();
-
-u64 *numadata = malloc_tmphigh(sizeof(u64) * (MaxCountCPUs + 
nb_numa_nodes));
-if (!numadata) {
-warn_noalloc();
-return NULL;
-}
-
-qemu_cfg_get_numa_data(numadata, MaxCountCPUs + nb_numa_nodes);
-
-qemu_cfg_get_numa_data(nb_hp_memslots, 1);
+int nb_numa_nodes = nb_numanodes;
 struct system_resource_affinity_table *srat;
 int srat_size = sizeof(*srat) +
 sizeof(struct srat_processor_affinity) * MaxCountCPUs +
@@ -782,7 +775,7 @@ build_srat(void)
 srat = malloc_high(srat_size);
 if (!srat) {
 warn_noalloc();
-free(numadata);
+free(numa_data);
 return NULL;
 }
 
@@ -791,6 +784,7 @@ build_srat(void)
 struct srat_processor_affinity *core = (void*)(srat + 1);
 int i;
 u64 curnode;
+u64 *numadata = numa_data;
 
 for (i = 0; i  MaxCountCPUs; ++i) {
 core-type = SRAT_PROCESSOR;
@@ -847,15 +841,7 @@ build_srat(void)
 mem = (void*)numamem;
 
 if (nb_hp_memslots) {
-u64 *hpmemdata = malloc_tmphigh(sizeof(u64) * (3 * nb_hp_memslots));
-if (!hpmemdata) {
-warn_noalloc();
-free(hpmemdata);
-free(numadata);
-return NULL;
-}
-
-qemu_cfg_get_numa_data(hpmemdata, 3 * nb_hp_memslots);
+u64 *hpmemdata = hp_memdata;
 
 for (i = 1; i  nb_hp_memslots + 1; ++i) {
 mem_base = *hpmemdata++;
@@ -865,7 +851,7 @@ build_srat(void)
 numamem++;
 slots++;
 }
-free(hpmemdata);
+free(hp_memdata);
 }
 
 for (; slots  nb_numa_nodes + nb_hp_memslots + 2; slots++) {
@@ -875,10 +861,58 @@ build_srat(void)
 
 build_header((void*)srat, SRAT_SIGNATURE, srat_size, 1);
 
-free(numadata);
+free(numa_data);
 return srat;
 }
 
+/* QEMU paravirt SRAT entries need to be read in before pci initilization */
+void read_srat_early(void)
+{
+int i;
+
+nb_numanodes = qemu_cfg_get_numa_nodes();
+u64 *hpmemdata;
+u64 mem_len, mem_base;
+
+numa_data = malloc_tmphigh(sizeof(u64) * (MaxCountCPUs + nb_numanodes));
+if (!numa_data) {
+warn_noalloc();
+}
+
+qemu_cfg_get_numa_data(numa_data, MaxCountCPUs + nb_numanodes);
+qemu_cfg_get_numa_data(nb_hp_memslots, 1);
+
+if (nb_hp_memslots) {
+hp_memdata = malloc_tmphigh(sizeof(u64) * (3 * nb_hp_memslots));
+if (!hp_memdata) {
+warn_noalloc();
+free(hp_memdata);
+free(numa_data);
+}
+
+qemu_cfg_get_numa_data(hp_memdata, 3 * nb_hp_memslots);
+hpmemdata = hp_memdata;
+
+for (i = 1; i  nb_hp_memslots + 1; ++i) {
+mem_base = *hpmemdata++;
+mem_len = *hpmemdata++;
+hpmemdata++;
+if (mem_base = 0x1LL) {
+above_4g_hp_mem_size += mem_len;
+}
+/* if dimm fits before pci hole, append it normally */
+else if (mem_base + mem_len = BUILD_PCIMEM_START) {
+below_4g_hp_mem_size += mem_len;
+}
+/* otherwise place it above 4GB */
+else {
+above_4g_hp_mem_size += mem_len;
+}
+}
+
+}
+}
+
 static const struct pci_device_id acpi_find_tbl[] = {
 /* PIIX4 Power Management device. */
 PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82371AB_3, NULL),
diff --git a/src/acpi.h b/src/acpi.h
index cb21561..d29837f

[Qemu-devel] [RFC PATCH v3 15/19] Add _OST dimm support

2012-09-21 Thread Vasilis Liaskovitis

This allows qemu to receive notifications from the guest OS on success or
failure of a memory hotplug request. The guest OS needs to implement the _OST
functionality for this to work (linux-next: http://lkml.org/lkml/2012/6/25/321)

This patch also updates dimm bitmap state and hot-remove pending flag
on hot-remove fail.  This allows failed hot operations to be retried at
anytime. This only works for guests that use _OST notification.
Also adds new _OST registers in  docs/specs/acpi_hotplug.txt

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 docs/specs/acpi_hotplug.txt |   25 +
 hw/acpi_piix4.c |   35 ++-
 hw/dimm.c   |   28 +++-
 hw/dimm.h   |   10 +-
 4 files changed, 95 insertions(+), 3 deletions(-)

diff --git a/docs/specs/acpi_hotplug.txt b/docs/specs/acpi_hotplug.txt
index cf86242..536da16 100644
--- a/docs/specs/acpi_hotplug.txt
+++ b/docs/specs/acpi_hotplug.txt
@@ -20,3 +20,28 @@ ejected.
 
 Written by ACPI memory device _EJ0 method to notify qemu of successfull
 hot-removal.  Write-only.
+
+Memory Dimm ejection failure notification (IO port 0xafa1, 1-byte access):
+---
+Dimm hot-remove _OST notification. Byte value indicates Dimm slot for which
+ejection failed.
+
+Written by ACPI memory device _OST method to notify qemu of failed
+hot-removal.  Write-only.
+
+Memory Dimm insertion success notification (IO port 0xafa2, 1-byte access):
+---
+Dimm hot-remove _OST notification. Byte value indicates Dimm slot for which
+insertion succeeded.
+
+Written by ACPI memory device _OST method to notify qemu of failed
+hot-add.  Write-only.
+
+Memory Dimm insertion failure notification (IO port 0xafa3, 1-byte access):
+---
+Dimm hot-remove _OST notification. Byte value indicates Dimm slot for which
+insertion failed.
+
+Written by ACPI memory device _OST method to notify qemu of failed
+hot-add.  Write-only.
+
diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index 8776669..f7220d4 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -49,6 +49,9 @@
 #define PCI_RMV_BASE 0xae0c
 #define MEM_BASE 0xaf80
 #define MEM_EJ_BASE 0xafa0
+#define MEM_OST_REMOVE_FAIL 0xafa1
+#define MEM_OST_ADD_SUCCESS 0xafa2
+#define MEM_OST_ADD_FAIL 0xafa3
 
 #define PIIX4_MEM_HOTPLUG_STATUS 8
 #define PIIX4_PCI_HOTPLUG_STATUS 2
@@ -87,6 +90,7 @@ typedef struct PIIX4PMState {
 uint8_t s4_val;
 } PIIX4PMState;
 
+static int piix4_dimm_revert(DeviceState *qdev, DimmDevice *dev, int add);
 static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s);
 
 #define ACPI_ENABLE 0xf1
@@ -531,6 +535,15 @@ static void gpe_writeb(void *opaque, uint32_t addr, 
uint32_t val)
 case MEM_EJ_BASE:
 dimm_notify(val, DIMM_REMOVE_SUCCESS);
 break;
+case MEM_OST_REMOVE_FAIL:
+dimm_notify(val, DIMM_REMOVE_FAIL);
+break;
+case MEM_OST_ADD_SUCCESS:
+dimm_notify(val, DIMM_ADD_SUCCESS);
+break;
+case MEM_OST_ADD_FAIL:
+dimm_notify(val, DIMM_ADD_FAIL);
+break;
 default:
 acpi_gpe_ioport_writeb(s-ar, addr, val);
 }
@@ -604,13 +617,16 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, 
PIIX4PMState *s)
 
 register_ioport_read(MEM_BASE, DIMM_BITMAP_BYTES, 1,  gpe_readb, s);
 register_ioport_write(MEM_EJ_BASE, 1, 1,  gpe_writeb, s);
+register_ioport_write(MEM_OST_REMOVE_FAIL, 1, 1,  gpe_writeb, s);
+register_ioport_write(MEM_OST_ADD_SUCCESS, 1, 1,  gpe_writeb, s);
+register_ioport_write(MEM_OST_ADD_FAIL, 1, 1,  gpe_writeb, s);
 
 for(i = 0; i  DIMM_BITMAP_BYTES; i++) {
 s-gperegs.mems_sts[i] = 0;
 }
 
 pci_bus_hotplug(bus, piix4_device_hotplug, s-dev.qdev);
-dimm_bus_hotplug(piix4_dimm_hotplug, s-dev.qdev);
+dimm_bus_hotplug(piix4_dimm_hotplug, piix4_dimm_revert, s-dev.qdev);
 }
 
 static void enable_device(PIIX4PMState *s, int slot)
@@ -656,6 +672,23 @@ static int piix4_dimm_hotplug(DeviceState *qdev, 
DimmDevice *dev, int
 return 0;
 }
 
+static int piix4_dimm_revert(DeviceState *qdev, DimmDevice *dev, int add)
+{
+PCIDevice *pci_dev = DO_UPCAST(PCIDevice, qdev, qdev);
+PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev, pci_dev);
+struct gpe_regs *g = s-gperegs;
+DimmDevice *slot = DIMM(dev);
+int idx = slot-idx;
+
+if (add) {
+g-mems_sts[idx/8] = ~(1  (idx%8));
+}
+else {
+g-mems_sts[idx/8] |= (1  (idx%8));
+}
+return 0;
+}
+
 static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
PCIHotplugState state)
 {
diff --git a/hw/dimm.c b/hw/dimm.c
index 21626f6..1521462 100644
--- a/hw/dimm.c
+++ b/hw/dimm.c
@@

[Qemu-devel] [RFC PATCH v3 13/19] balloon: update with hotplugged memory

2012-09-21 Thread Vasilis Liaskovitis

query-balloon and info balloon should report total memory available to the
guest.

balloon inflate/ deflate can also use all memory available to the guest (initial
+ hotplugged memory)

Ballon driver has been minimaly tested with the patch, please review and test.

Caveat: if the guest does not online hotplugged-memory, it's easy for a balloon
inflate command to OOM a guest.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 hw/virtio-balloon.c |   13 +
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
index dd1a650..bca21bc 100644
--- a/hw/virtio-balloon.c
+++ b/hw/virtio-balloon.c
@@ -22,6 +22,7 @@
 #include virtio-balloon.h
 #include kvm.h
 #include exec-memory.h
+#include dimm.h
 
 #if defined(__linux__)
 #include sys/mman.h
@@ -147,10 +148,11 @@ static void virtio_balloon_set_config(VirtIODevice *vdev,
 VirtIOBalloon *dev = to_virtio_balloon(vdev);
 struct virtio_balloon_config config;
 uint32_t oldactual = dev-actual;
+uint64_t hotplugged_ram_size = get_hp_memory_total();
 memcpy(config, config_data, 8);
 dev-actual = le32_to_cpu(config.actual);
 if (dev-actual != oldactual) {
-qemu_balloon_changed(ram_size -
+qemu_balloon_changed(ram_size + hotplugged_ram_size -
  (dev-actual  VIRTIO_BALLOON_PFN_SHIFT));
 }
 }
@@ -188,17 +190,20 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo 
*info)
 
 info-actual = ram_size - ((uint64_t) dev-actual 
VIRTIO_BALLOON_PFN_SHIFT);
+info-actual += get_hp_memory_total(); 
 }
 
 static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
 {
 VirtIOBalloon *dev = opaque;
+uint64_t hotplugged_ram_size = get_hp_memory_total();
 
-if (target  ram_size) {
-target = ram_size;
+if (target  ram_size + hotplugged_ram_size) {
+target = ram_size + hotplugged_ram_size;
 }
 if (target) {
-dev-num_pages = (ram_size - target)  VIRTIO_BALLOON_PFN_SHIFT;
+dev-num_pages = (ram_size + hotplugged_ram_size - target) 
+ VIRTIO_BALLOON_PFN_SHIFT;
 virtio_notify_config(dev-vdev);
 }
 }
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 14/19][SeaBIOS] Add _OST dimm method

2012-09-21 Thread Vasilis Liaskovitis

Add support for _OST method. _OST method will write into the correct I/O byte to
signal success / failure of hot-add or hot-remove to qemu.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 src/acpi-dsdt.dsl |   50 ++
 src/ssdt-mem.dsl  |4 
 2 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl
index 5d3e92b..0d37bbc 100644
--- a/src/acpi-dsdt.dsl
+++ b/src/acpi-dsdt.dsl
@@ -762,6 +762,28 @@ DefinitionBlock (
 MPE, 8
 }
 
+
+/* Memory hot-remove notify failure byte */
+OperationRegion(MEEF, SystemIO, 0xafa1, 1)
+Field (MEEF, ByteAcc, NoLock, Preserve)
+{
+MEF, 8
+}
+
+/* Memory hot-add notify success byte */
+OperationRegion(MPIS, SystemIO, 0xafa2, 1)
+Field (MPIS, ByteAcc, NoLock, Preserve)
+{
+MIS, 8
+}
+
+/* Memory hot-add notify failure byte */
+OperationRegion(MPIF, SystemIO, 0xafa3, 1)
+Field (MPIF, ByteAcc, NoLock, Preserve)
+{
+MIF, 8
+}
+
 Method(MESC, 0) {
 // Local5 = active memdevice bitmap
 Store (MES, Local5)
@@ -802,6 +824,34 @@ DefinitionBlock (
 Store(Arg0, MPE)
 Sleep(200)
 }
+Method (MOST, 3, Serialized) {
+// _OST method - OS status indication
+Switch (And(Arg0, 0xFF)) {
+Case(0x3)
+{
+Switch(And(Arg1, 0xFF)) {
+Case(0x1) {
+Store(Arg2, MEF)
+// Revert MEON flag for this memory device to one
+Store(One, Index(MEON, Arg2))
+}
+}
+}
+Case(0x1)
+{
+Switch(And(Arg1, 0xFF)) {
+Case(0x0) {
+Store(Arg2, MIS)
+}
+Case(0x1) {
+Store(Arg2, MIF)
+// Revert MEON flag for this memory device to zero
+Store(Zero, Index(MEON, Arg2))
+}
+}
+}
+}
+}
 }
 
 
diff --git a/src/ssdt-mem.dsl b/src/ssdt-mem.dsl
index ee322f0..041d301 100644
--- a/src/ssdt-mem.dsl
+++ b/src/ssdt-mem.dsl
@@ -38,6 +38,7 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, 
CSSDT, 0x1)
 
 External(CMST, MethodObj)
 External(MPEJ, MethodObj)
+External(MOST, MethodObj)
 
 Name(_CRS, ResourceTemplate() {
 QwordMemory(
@@ -60,6 +61,9 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, 
CSSDT, 0x1)
 Method (_EJ0, 1, NotSerialized) {
 MPEJ(ID, Arg0)
 }
+Method (_OST, 3) {
+MOST(Arg0, Arg1, ID)
+}
 }
 }
 
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 16/19] Update dimm state on reset

2012-09-21 Thread Vasilis Liaskovitis

in case of hot-remove failure on a guest that does not implement _OST,
the dimm bitmaps in qemu and Seabios show the dimm as unplugged, but the dimm
is still present on the qdev/memory bus. To avoid this inconsistency, we set the
dimm state to active/hot-plugged on a reset of the associated acpi_pm device.
This way the dimm is still active after a VM reboot and dimm visibility has
always the same behaviour, regardless of _OST support in the guest.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 hw/acpi_piix4.c |1 +
 hw/dimm.c   |   20 
 hw/dimm.h   |1 +
 3 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index f7220d4..8bf58a6 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -373,6 +373,7 @@ static void piix4_reset(void *opaque)
 pci_conf[0x5B] = 0x02;
 }
 piix4_update_hotplug(s);
+dimm_state_sync();
 }
 
 static void piix4_powerdown(void *opaque, int irq, int power_failing)
diff --git a/hw/dimm.c b/hw/dimm.c
index 1521462..b993668 100644
--- a/hw/dimm.c
+++ b/hw/dimm.c
@@ -182,6 +182,26 @@ static DimmDevice *dimm_find_from_idx(uint32_t idx)
 return NULL;
 }
 
+void dimm_state_sync(void)
+{
+DimmBus *bus = main_memory_bus;
+DimmDevice *slot;
+
+/* if a hot-remove operation is pending on reset, it means the hot-remove
+ * operation has failed, but the guest hasn't notified us e.g. because the
+ * guest does not provide _OST notifications. The device is still present 
on
+ * the dimmbus, but the qemu and Seabios dimm bitmaps show this device as
+ * unplugged. To avoid this inconsistency, we set the dimm bits to active
+ * i.e. hot-plugged for each dimm present on the dimmbus.
+ */
+QTAILQ_FOREACH(slot, bus-dimmlist, nextdimm) {
+if (slot-pending == DIMM_REMOVE_PENDING) {
+if (bus-dimm_revert)
+bus-dimm_revert(bus-dimm_hotplug_qdev, slot, 0);
+}
+}
+}
+
 /* used to create a dimm device, only on incoming migration of a hotplugged
  * RAMBlock
  */
diff --git a/hw/dimm.h b/hw/dimm.h
index a6c6e6f..ce091fe 100644
--- a/hw/dimm.h
+++ b/hw/dimm.h
@@ -95,5 +95,6 @@ void main_memory_bus_create(Object *parent);
 void dimm_config_create(char *id, uint64_t size, uint64_t node,
 uint32_t dimm_idx, uint32_t populated);
 uint64_t get_hp_memory_total(void);
+void dimm_state_sync(void);
 
 #endif
-- 
1.7.9

[Qemu-devel] [PATCH 9/9] mm: compaction: Restart compaction from near where it left off

2012-09-21 Thread Mel Gorman

This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order  0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order allocations,
it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to compact
at the start of the zone, and for free pages to compact things to at the
end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at
the end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from on
subsequent compaction invocations using the pageblock-skip information. When
compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman mgor...@suse.de
Acked-by: Rik van Riel r...@redhat.com
---
 include/linux/mmzone.h |4 
 mm/compaction.c|   54 
 mm/internal.h  |4 
 3 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a456361..e7792a3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -370,6 +370,10 @@ struct zone {
int all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
unsigned long   compact_blockskip_expire;
+
+   /* pfns where compaction scanners should start */
+   unsigned long   compact_cached_free_pfn;
+   unsigned long   compact_cached_migrate_pfn;
 #endif
 #ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
diff --git a/mm/compaction.c b/mm/compaction.c
index 9276bc8..4bd96f3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -79,6 +79,9 @@ static void reset_isolation_suitable(struct zone *zone)
 */
if (time_before(jiffies, zone-compact_blockskip_expire))
return;
+
+   zone-compact_cached_migrate_pfn = start_pfn;
+   zone-compact_cached_free_pfn = end_pfn;
zone-compact_blockskip_expire = jiffies + (HZ * 5);
 
/* Walk the zone and mark every pageblock as suitable for isolation */
@@ -99,13 +102,29 @@ static void reset_isolation_suitable(struct zone *zone)
  * If no pages were isolated then mark this pageblock to be skipped in the
  * future. The information is later cleared by reset_isolation_suitable().
  */
-static void update_pageblock_skip(struct page *page, unsigned long nr_isolated)
+static void update_pageblock_skip(struct compact_control *cc,
+   struct page *page, unsigned long nr_isolated,
+   bool migrate_scanner)
 {
+   struct zone *zone = cc-zone;
if (!page)
return;
 
-   if (!nr_isolated)
+   if (!nr_isolated) {
+   unsigned long pfn = page_to_pfn(page);
set_pageblock_skip(page);
+
+   /* Update where compaction should restart */
+   if (migrate_scanner) {
+   if (!cc-finished_update_migrate 
+   pfn  zone-compact_cached_migrate_pfn)
+   zone-compact_cached_migrate_pfn = pfn;
+   } else {
+   if (!cc-finished_update_free 
+   pfn  zone-compact_cached_free_pfn)
+   zone-compact_cached_free_pfn = pfn;
+   }
+   }
 }
 
 static inline bool should_release_lock(spinlock_t *lock)
@@ -315,7 +334,7 @@ out:
 
/* Update the pageblock-skip if the whole pageblock was scanned */
if (blockpfn == end_pfn)
-   update_pageblock_skip(valid_page, total_isolated);
+   update_pageblock_skip(cc, valid_page, total_isolated, false);
 
return total_isolated;
 }
@@ -530,6 +549,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
 */
if (!cc-sync  last_pageblock_nr != pageblock_nr 
!migrate_async_suitable(get_pageblock_migratetype(page))) {
+   cc-finished_update_migrate = true;
goto next_pageblock;
}
 
@@ -578,6 +598,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
VM_BUG_ON(PageTransCompound(page));
 
/* Successfully isolated */
+   cc-finished_update_migrate = true;
del_page_from_lru_list(page, lruvec, page_lru(page));

Re: [Qemu-devel] [PATCH] New syscalls to the seccomp whitelist

2012-09-21 Thread Kevin Wolf

Am 20.09.2012 23:00, schrieb Eduardo Otubo:
 Seccomp syscall whitelist updated after tests running qemu under
 libvirt. Reference to the bug -
 https://bugzilla.redhat.com/show_bug.cgi?id=855162
 
 Regards,
 ---
  qemu-seccomp.c | 21 -
  1 file changed, 20 insertions(+), 1 deletion(-)

SoB is missing.

Kevin

[Qemu-devel] [RFC PATCH v3 07/19] acpi_piix4: Implement memory device hotplug registers

2012-09-21 Thread Vasilis Liaskovitis

A 32-byte register is used to present up to 256 hotplug-able memory devices
to BIOS and OSPM. Hot-add and hot-remove functions trigger an ACPI hotplug
event through these. Only reads are allowed from these registers.

An ACPI hot-remove event but needs to wait for OSPM to eject the device.
We use a single-byte register to know when OSPM has called the _EJ function
for a particular dimm. A write to this byte will depopulate the respective dimm.
Only writes are allowed to this byte.

v1-v2:
mems_sts address moved from 0xaf20 to 0xaf80 (to accomodate more space for
cpu-hotplugging in the future).
_EJ array is reduced to a single byte.
Add documentation in docs/specs/acpi_hotplug.txt

v2-v3:
minor name changes

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 docs/specs/acpi_hotplug.txt |   22 +
 hw/acpi_piix4.c |   73 --
 2 files changed, 91 insertions(+), 4 deletions(-)
 create mode 100644 docs/specs/acpi_hotplug.txt

diff --git a/docs/specs/acpi_hotplug.txt b/docs/specs/acpi_hotplug.txt
new file mode 100644
index 000..cf86242
--- /dev/null
+++ b/docs/specs/acpi_hotplug.txt
@@ -0,0 +1,22 @@
+QEMU-ACPI BIOS hotplug interface
+--
+This document describes the interface between QEMU and the ACPI BIOS for 
non-PCI
+space. For the PCI interface please look at docs/specs/acpi_pci_hotplug.txt
+
+QEMU-ACPI BIOS memory hotplug interface
+--
+
+Memory Dimm status array (IO port 0xaf80-0xaf9f, 1-byte access):
+---
+Dimm hot-plug notification pending. One bit per slot.
+
+Read by ACPI BIOS GPE.3 handler to notify OS of memory hot-add or hot-remove
+events.  Read-only.
+
+Memory Dimm ejection success notification (IO port 0xafa0, 1-byte access):
+---
+Dimm hot-remove _EJ0 notification. Byte value indicates Dimm slot that was
+ejected.
+
+Written by ACPI memory device _EJ0 method to notify qemu of successfull
+hot-removal.  Write-only.
diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index c56220b..8776669 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -28,6 +28,8 @@
 #include range.h
 #include ioport.h
 #include fw_cfg.h
+#include sysbus.h
+#include dimm.h
 
 //#define DEBUG
 
@@ -45,9 +47,15 @@
 #define PCI_DOWN_BASE 0xae04
 #define PCI_EJ_BASE 0xae08
 #define PCI_RMV_BASE 0xae0c
+#define MEM_BASE 0xaf80
+#define MEM_EJ_BASE 0xafa0
 
+#define PIIX4_MEM_HOTPLUG_STATUS 8
 #define PIIX4_PCI_HOTPLUG_STATUS 2
 
+struct gpe_regs {
+uint8_t mems_sts[DIMM_BITMAP_BYTES];
+};
 struct pci_status {
 uint32_t up; /* deprecated, maintained for migration compatibility */
 uint32_t down;
@@ -69,6 +77,7 @@ typedef struct PIIX4PMState {
 Notifier machine_ready;
 
 /* for pci hotplug */
+struct gpe_regs gperegs;
 struct pci_status pci0_status;
 uint32_t pci0_hotplug_enable;
 uint32_t pci0_slot_device_present;
@@ -93,8 +102,8 @@ static void pm_update_sci(PIIX4PMState *s)
ACPI_BITMASK_POWER_BUTTON_ENABLE |
ACPI_BITMASK_GLOBAL_LOCK_ENABLE |
ACPI_BITMASK_TIMER_ENABLE)) != 0) ||
-(((s-ar.gpe.sts[0]  s-ar.gpe.en[0])
-   PIIX4_PCI_HOTPLUG_STATUS) != 0);
+(((s-ar.gpe.sts[0]  s-ar.gpe.en[0]) 
+  (PIIX4_PCI_HOTPLUG_STATUS | PIIX4_MEM_HOTPLUG_STATUS)) != 0);
 
 qemu_set_irq(s-irq, sci_level);
 /* schedule a timer interruption if needed */
@@ -499,7 +508,16 @@ type_init(piix4_pm_register_types)
 static uint32_t gpe_readb(void *opaque, uint32_t addr)
 {
 PIIX4PMState *s = opaque;
-uint32_t val = acpi_gpe_ioport_readb(s-ar, addr);
+uint32_t val = 0;
+struct gpe_regs *g = s-gperegs;
+
+switch (addr) {
+case MEM_BASE ... MEM_BASE+DIMM_BITMAP_BYTES:
+val = g-mems_sts[addr - MEM_BASE];
+break;
+default:
+val = acpi_gpe_ioport_readb(s-ar, addr);
+}
 
 PIIX4_DPRINTF(gpe read %x == %x\n, addr, val);
 return val;
@@ -509,7 +527,13 @@ static void gpe_writeb(void *opaque, uint32_t addr, 
uint32_t val)
 {
 PIIX4PMState *s = opaque;
 
-acpi_gpe_ioport_writeb(s-ar, addr, val);
+switch (addr) {
+case MEM_EJ_BASE:
+dimm_notify(val, DIMM_REMOVE_SUCCESS);
+break;
+default:
+acpi_gpe_ioport_writeb(s-ar, addr, val);
+}
 pm_update_sci(s);
 
 PIIX4_DPRINTF(gpe write %x == %d\n, addr, val);
@@ -560,9 +584,11 @@ static uint32_t pcirmv_read(void *opaque, uint32_t addr)
 
 static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
 PCIHotplugState state);
+static int piix4_dimm_hotplug(DeviceState *qdev, DimmDevice *dev, int add);
 
 static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s)
 {
+int i = 0;

[Qemu-devel] [PATCH 1/9] Revert mm: compaction: check lock contention first before taking lock

2012-09-21 Thread Mel Gorman

This reverts
mm-compaction-check-lock-contention-first-before-taking-lock.patch as it
is replaced by a later patch in the series.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 mm/compaction.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 3bb7232..4a77b4b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -347,9 +347,8 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
 
/* Time to isolate some pages for migration */
cond_resched();
-   locked = compact_trylock_irqsave(zone-lru_lock, flags, cc);
-   if (!locked)
-   return 0;
+   spin_lock_irqsave(zone-lru_lock, flags);
+   locked = true;
for (; low_pfn  end_pfn; low_pfn++) {
struct page *page;
 
-- 
1.7.9.2

[Qemu-devel] [PATCH 5/9] mm: compaction: Acquire the zone-lru_lock as late as possible

2012-09-21 Thread Mel Gorman

Compactions migrate scanner acquires the zone-lru_lock when scanning a range
of pages looking for LRU pages to acquire. It does this even if there are
no LRU pages in the range. If multiple processes are compacting then this
can cause severe locking contention. To make matters worse commit b2eef8c0
(mm: compaction: minimise the time IRQs are disabled while isolating pages
for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
scanned.

This patch makes two changes to how the migrate scanner acquires the LRU
lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
the lock is contended. This reduces the number of times it unnecessarily
disables and re-enables IRQs. The second is that it defers acquiring the
LRU lock for as long as possible. If there are no LRU pages or the only
LRU pages are transhuge then the LRU lock will not be acquired at all
which reduces contention on zone-lru_lock.

Signed-off-by: Mel Gorman mgor...@suse.de
Acked-by: Rik van Riel r...@redhat.com
---
 mm/compaction.c |   63 +--
 1 file changed, 43 insertions(+), 20 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 6b55491..a6068ff 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,11 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+static inline bool should_release_lock(spinlock_t *lock)
+{
+   return need_resched() || spin_is_contended(lock);
+}
+
 /*
  * Compaction requires the taking of some coarse locks that are potentially
  * very heavily contended. Check if the process needs to be scheduled or
@@ -62,7 +67,7 @@ static inline bool migrate_async_suitable(int migratetype)
 static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
  bool locked, struct compact_control *cc)
 {
-   if (need_resched() || spin_is_contended(lock)) {
+   if (should_release_lock(lock)) {
if (locked) {
spin_unlock_irqrestore(lock, *flags);
locked = false;
@@ -327,7 +332,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
isolate_mode_t mode = 0;
struct lruvec *lruvec;
unsigned long flags;
-   bool locked;
+   bool locked = false;
 
/*
 * Ensure that there are not too many pages isolated from the LRU
@@ -347,23 +352,17 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
 
/* Time to isolate some pages for migration */
cond_resched();
-   spin_lock_irqsave(zone-lru_lock, flags);
-   locked = true;
for (; low_pfn  end_pfn; low_pfn++) {
struct page *page;
 
/* give a chance to irqs before checking need_resched() */
-   if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
-   spin_unlock_irqrestore(zone-lru_lock, flags);
-   locked = false;
+   if (locked  !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+   if (should_release_lock(zone-lru_lock)) {
+   spin_unlock_irqrestore(zone-lru_lock, flags);
+   locked = false;
+   }
}
 
-   /* Check if it is ok to still hold the lock */
-   locked = compact_checklock_irqsave(zone-lru_lock, flags,
-   locked, cc);
-   if (!locked || fatal_signal_pending(current))
-   break;
-
/*
 * migrate_pfn does not necessarily start aligned to a
 * pageblock. Ensure that pfn_valid is called when moving
@@ -403,21 +402,38 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
pageblock_nr = low_pfn  pageblock_order;
if (!cc-sync  last_pageblock_nr != pageblock_nr 
!migrate_async_suitable(get_pageblock_migratetype(page))) {
-   low_pfn += pageblock_nr_pages;
-   low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
-   last_pageblock_nr = pageblock_nr;
-   continue;
+   goto next_pageblock;
}
 
+   /* Check may be lockless but that's ok as we recheck later */
if (!PageLRU(page))
continue;
 
/*
-* PageLRU is set, and lru_lock excludes isolation,
-* splitting and collapsing (collapsing has already
-* happened if PageLRU is set).
+* PageLRU is set. lru_lock normally excludes isolation
+* splitting and collapsing (collapsing has already happened
+* if PageLRU is set) but the lock

[Qemu-devel] [RFC PATCH v3 10/19] fix live-migration when populated=on is missing

2012-09-21 Thread Vasilis Liaskovitis

Live migration works after memory hot-add events, as long as the
qemu command line -dimm arguments are changed on the destination host
to specify populated=on for the dimms that have been hot-added.

If a command-line change has not occured, the destination host does not yet
have the corresponding ramblock in its ram_list. Activate the dimm on the
destination during ram_load.

Perhaps several fields of the DimmDevice should be part of a
VMStateDescription to handle migration in a cleaner way. But the problem
is that ramblocks are checked before qdev vmstates.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 arch_init.c |   24 +---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 5a1173e..b63caa7 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -45,6 +45,7 @@
 #include hw/pcspk.h
 #include qemu/page_cache.h
 #include qmp-commands.h
+#include hw/dimm.h
 
 #ifdef DEBUG_ARCH_INIT
 #define DPRINTF(fmt, ...) \
@@ -740,10 +741,27 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 
 if (!block) {
-fprintf(stderr, Unknown ramblock \%s\, cannot 
+/* this can happen if a dimm was hot-added at source 
host */
+bool ramblock_found = false;
+if (dimm_add(id)) {
+fprintf(stderr, Cannot add unknown ramblock 
\%s\, 
+cannot accept migration\n, id);
+ret = -EINVAL;
+goto done;
+}
+/* rescan ram_list, verify ramblock is there now */
+QLIST_FOREACH(block, ram_list.blocks, next) {
+if (!strncmp(id, block-idstr, sizeof(id))) {
+ramblock_found = true;
+break;
+}
+}
+if (!ramblock_found) {
+fprintf(stderr, Unknown ramblock \%s\, cannot 
 accept migration\n, id);
-ret = -EINVAL;
-goto done;
+ret = -EINVAL;
+goto done;
+}
 }
 
 total_ram_bytes -= length;
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 17/19][SeaBIOS] Implement _PS3 method for memory device

2012-09-21 Thread Vasilis Liaskovitis

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 src/acpi-dsdt.dsl |   15 +++
 src/ssdt-mem.dsl  |4 
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl
index 0d37bbc..8a18770 100644
--- a/src/acpi-dsdt.dsl
+++ b/src/acpi-dsdt.dsl
@@ -784,6 +784,13 @@ DefinitionBlock (
 MIF, 8
 }
 
+/* Memory _PS3 byte */
+OperationRegion(MPSB, SystemIO, 0xafa4, 1)
+Field (MPSB, ByteAcc, NoLock, Preserve)
+{
+MPS, 8
+}
+
 Method(MESC, 0) {
 // Local5 = active memdevice bitmap
 Store (MES, Local5)
@@ -824,6 +831,14 @@ DefinitionBlock (
 Store(Arg0, MPE)
 Sleep(200)
 }
+
+Method (MPS3, 1, NotSerialized) {
+// _PS3 method - power-off method
+Store(Arg0, MPS)
+Store(Zero, Index(MEON, Arg0))
+Sleep(200)
+}
+
 Method (MOST, 3, Serialized) {
 // _OST method - OS status indication
 Switch (And(Arg0, 0xFF)) {
diff --git a/src/ssdt-mem.dsl b/src/ssdt-mem.dsl
index 041d301..7423fc6 100644
--- a/src/ssdt-mem.dsl
+++ b/src/ssdt-mem.dsl
@@ -39,6 +39,7 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, 
CSSDT, 0x1)
 External(CMST, MethodObj)
 External(MPEJ, MethodObj)
 External(MOST, MethodObj)
+External(MPS3, MethodObj)
 
 Name(_CRS, ResourceTemplate() {
 QwordMemory(
@@ -64,6 +65,9 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, 
CSSDT, 0x1)
 Method (_OST, 3) {
 MOST(Arg0, Arg1, ID)
 }
+Method (_PS3, 0) {
+MPS3(ID)
+}
 }
 }
 
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 04/19][SeaBIOS] acpi: generate hotplug memory devices

2012-09-21 Thread Vasilis Liaskovitis

The memory device generation is guided by qemu paravirt info. Seabios
first uses the info to setup SRAT entries for the hotplug-able memory slots.
Afterwards, build_memssdt uses the created SRAT entries to generate
appropriate memory device objects. One memory device (and corresponding SRAT
entry) is generated for each hotplug-able qemu memslot. Currently no SSDT
memory device is created for initial system memory.

We only support up to 255 DIMMs for now (PackageOp used for the MEON array can
only describe an array of at most 255 elements. VarPackageOp would be needed to
support more than 255 devices)

v1-v2:
Seabios reads mems_sts from qemu to build e820_map
SSDT size and some offsets are calculated with extraction macros.

v2-v3:
Minor name changes

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 src/acpi.c |  158 +--
 1 files changed, 152 insertions(+), 6 deletions(-)

diff --git a/src/acpi.c b/src/acpi.c
index 6d239fa..1223b52 100644
--- a/src/acpi.c
+++ b/src/acpi.c
@@ -13,6 +13,7 @@
 #include pci_regs.h // PCI_INTERRUPT_LINE
 #include ioport.h // inl
 #include paravirt.h // qemu_cfg_irq0_override
+#include memmap.h
 
 //
 /* ACPI tables init */
@@ -416,11 +417,26 @@ encodeLen(u8 *ssdt_ptr, int length, int bytes)
 #define PCIHP_AML (ssdp_pcihp_aml + *ssdt_pcihp_start)
 #define PCI_SLOTS 32
 
+/* 0x5B 0x82 DeviceOp PkgLength NameString DimmID */
+#define MEM_BASE 0xaf80
+#define MEM_AML (ssdm_mem_aml + *ssdt_mem_start)
+#define MEM_SIZEOF (*ssdt_mem_end - *ssdt_mem_start)
+#define MEM_OFFSET_HEX (*ssdt_mem_name - *ssdt_mem_start + 2)
+#define MEM_OFFSET_ID (*ssdt_mem_id - *ssdt_mem_start)
+#define MEM_OFFSET_PXM 31
+#define MEM_OFFSET_START 55
+#define MEM_OFFSET_END   63
+#define MEM_OFFSET_SIZE  79
+
+u64 nb_hp_memslots = 0;
+struct srat_memory_affinity *mem;
+
 #define SSDT_SIGNATURE 0x54445353 // SSDT
 #define SSDT_HEADER_LENGTH 36
 
 #include ssdt-susp.hex
 #include ssdt-pcihp.hex
+#include ssdt-mem.hex
 
 #define PCI_RMV_BASE 0xae0c
 
@@ -472,6 +488,111 @@ static void patch_pcihp(int slot, u8 *ssdt_ptr, u32 eject)
 }
 }
 
+static void build_memdev(u8 *ssdt_ptr, int i, u64 mem_base, u64 mem_len, u8 
node)
+{
+memcpy(ssdt_ptr, MEM_AML, MEM_SIZEOF);
+ssdt_ptr[MEM_OFFSET_HEX] = getHex(i  4);
+ssdt_ptr[MEM_OFFSET_HEX+1] = getHex(i);
+ssdt_ptr[MEM_OFFSET_ID] = i;
+ssdt_ptr[MEM_OFFSET_PXM] = node;
+*(u64*)(ssdt_ptr + MEM_OFFSET_START) = mem_base;
+*(u64*)(ssdt_ptr + MEM_OFFSET_END) = mem_base + mem_len;
+*(u64*)(ssdt_ptr + MEM_OFFSET_SIZE) = mem_len;
+}
+
+static void*
+build_memssdt(void)
+{
+u64 mem_base;
+u64 mem_len;
+u8  node;
+int i;
+struct srat_memory_affinity *entry = mem;
+u64 nb_memdevs = nb_hp_memslots;
+u8  memslot_status, enabled;
+
+int length = ((1+3+4)
+  + (nb_memdevs * MEM_SIZEOF)
+  + (1+2+5+(12*nb_memdevs))
+  + (6+2+1+(1*nb_memdevs)));
+u8 *ssdt = malloc_high(sizeof(struct acpi_table_header) + length);
+if (! ssdt) {
+warn_noalloc();
+return NULL;
+}
+u8 *ssdt_ptr = ssdt + sizeof(struct acpi_table_header);
+
+// build Scope(_SB_) header
+*(ssdt_ptr++) = 0x10; // ScopeOp
+ssdt_ptr = encodeLen(ssdt_ptr, length-1, 3);
+*(ssdt_ptr++) = '_';
+*(ssdt_ptr++) = 'S';
+*(ssdt_ptr++) = 'B';
+*(ssdt_ptr++) = '_';
+
+for (i = 0; i  nb_memdevs; i++) {
+mem_base = (((u64)(entry-base_addr_high)  32 )| 
entry-base_addr_low);
+mem_len = (((u64)(entry-length_high)  32 )| entry-length_low);
+node = entry-proximity[0];
+build_memdev(ssdt_ptr, i, mem_base, mem_len, node);
+ssdt_ptr += MEM_SIZEOF;
+entry++;
+}
+
+// build Method(MTFY, 2) {If (LEqual(Arg0, 0x00)) {Notify(CM00, Arg1)} 
...}
+*(ssdt_ptr++) = 0x14; // MethodOp
+ssdt_ptr = encodeLen(ssdt_ptr, 2+5+(12*nb_memdevs), 2);
+*(ssdt_ptr++) = 'M';
+*(ssdt_ptr++) = 'T';
+*(ssdt_ptr++) = 'F';
+*(ssdt_ptr++) = 'Y';
+*(ssdt_ptr++) = 0x02;
+for (i=0; inb_memdevs; i++) {
+*(ssdt_ptr++) = 0xA0; // IfOp
+   ssdt_ptr = encodeLen(ssdt_ptr, 11, 1);
+*(ssdt_ptr++) = 0x93; // LEqualOp
+*(ssdt_ptr++) = 0x68; // Arg0Op
+*(ssdt_ptr++) = 0x0A; // BytePrefix
+*(ssdt_ptr++) = i;
+*(ssdt_ptr++) = 0x86; // NotifyOp
+*(ssdt_ptr++) = 'M';
+*(ssdt_ptr++) = 'P';
+*(ssdt_ptr++) = getHex(i  4);
+*(ssdt_ptr++) = getHex(i);
+*(ssdt_ptr++) = 0x69; // Arg1Op
+}
+
+// build Name(MEON, Package() { One, One, ..., Zero, Zero, ... })
+*(ssdt_ptr++) = 0x08; // NameOp
+*(ssdt_ptr++) = 'M';
+*(ssdt_ptr++) = 'E';
+*(ssdt_ptr++) = 'O';
+*(ssdt_ptr++) = 'N';
+*(ssdt_ptr++) = 0x12; // PackageOp
+ssdt_ptr = encodeLen(ssdt_ptr, 2+1+(1*nb_memdevs), 2);
+*(ssdt_ptr++) =

[Qemu-devel] [RFC PATCH v3 19/19] alternative: Introduce paravirt interface QEMU_CFG_PCI_WINDOW

2012-09-21 Thread Vasilis Liaskovitis

Qemu already calculates the 32-bit and 64-bit PCI starting offsets based on
initial memory and hotplug-able dimms. This info needs to be passed to Seabios
for PCI initialization.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 docs/specs/fwcfg.txt |9 +
 hw/fw_cfg.h  |1 +
 hw/pc_piix.c |   10 ++
 3 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/docs/specs/fwcfg.txt b/docs/specs/fwcfg.txt
index 55f96d9..d9fa215 100644
--- a/docs/specs/fwcfg.txt
+++ b/docs/specs/fwcfg.txt
@@ -26,3 +26,12 @@ Entry max_cpus+nb_numa_nodes+1 contains the number of memory 
dimms (nb_hp_dimms)
 The last 3 * nb_hp_dimms entries are organized in triplets: Each triplet 
contains
 the physical address offset, size (in bytes), and node proximity for the
 respective dimm.
+
+FW_CFG_PCI_WINDOW paravirt info
+
+QEMU passes the starting address for the 32-bit and 64-bit PCI windows to BIOS.
+The following layouts are followed:
+
+
+pcimem32_start | pcimem64_start | 
+
diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h
index 856bf91..6c8c151 100644
--- a/hw/fw_cfg.h
+++ b/hw/fw_cfg.h
@@ -27,6 +27,7 @@
 #define FW_CFG_SETUP_SIZE   0x17
 #define FW_CFG_SETUP_DATA   0x18
 #define FW_CFG_FILE_DIR 0x19
+#define FW_CFG_PCI_WINDOW   0x1a
 
 #define FW_CFG_FILE_FIRST   0x20
 #define FW_CFG_FILE_SLOTS   0x10
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index d1fd276..034761f 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -44,6 +44,7 @@
 #include memory.h
 #include exec-memory.h
 #include dimm.h
+#include fw_cfg.h
 #ifdef CONFIG_XEN
 #  include xen/hvm/hvm_info_table.h
 #endif
@@ -149,6 +150,7 @@ static void pc_init1(MemoryRegion *system_memory,
 MemoryRegion *pci_memory;
 MemoryRegion *rom_memory;
 void *fw_cfg = NULL;
+uint64_t *pci_window_fw_cfg;
 
 pc_cpus_init(cpu_model);
 
@@ -205,6 +207,14 @@ static void pc_init1(MemoryRegion *system_memory,
? 0
: ((uint64_t)1  62)),
   pci_memory, ram_memory);
+
+pci_window_fw_cfg = g_malloc0(2 * 8);
+pci_window_fw_cfg[0] = cpu_to_le64(below_4g_mem_size +
+below_4g_hp_mem_size);
+pci_window_fw_cfg[1] = cpu_to_le64(0x1ULL + above_4g_mem_size
++ above_4g_hp_mem_size);
+fw_cfg_add_bytes(fw_cfg, FW_CFG_PCI_WINDOW, 
+(uint8_t *)pci_window_fw_cfg, 2 * 8);
 } else {
 pci_bus = NULL;
 i440fx_state = NULL;
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction

2012-09-21 Thread Vasilis Liaskovitis

Each hotplug-able memory slot is a DimmDevice. All DimmDevices are attached
to a new bus called DimmBus. This bus is introduced so that we no longer
depend on hotplug-capability of main system bus (the main bus does not allow
hotplugging). The DimmBus should be attached to a chipset Device (i440fx in case
of the pc)

A hot-add operation for a particular dimm:
- creates a new DimmDevice and attaches it to the DimmBus
- creates a new MemoryRegion of the given physical address offset, size and
node proximity, and attaches it to main system memory as a sub_region.

A successful hot-remove operation detaches and frees the MemoryRegion from
system memory, and removes the DimmDevice from the DimmBus.

Hotplug operations are done through normal device_add /device_del commands.
Also add properties to DimmDevice.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 hw/dimm.c |  305 +
 hw/dimm.h |   90 ++
 2 files changed, 395 insertions(+), 0 deletions(-)
 create mode 100644 hw/dimm.c
 create mode 100644 hw/dimm.h

diff --git a/hw/dimm.c b/hw/dimm.c
new file mode 100644
index 000..288b997
--- /dev/null
+++ b/hw/dimm.c
@@ -0,0 +1,305 @@
+/*
+ * Dimm device for Memory Hotplug
+ *
+ * Copyright ProfitBricks GmbH 2012
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see http://www.gnu.org/licenses/
+ */
+
+#include trace.h
+#include qdev.h
+#include dimm.h
+#include time.h
+#include ../exec-memory.h
+#include qmp-commands.h
+
+/* the system-wide memory bus. */
+static DimmBus *main_memory_bus;
+/* the following list is used to hold dimm config info before machine
+ * initialization. After machine init, the list is emptied and not used 
anymore.*/
+static DimmConfiglist dimmconfig_list = 
QTAILQ_HEAD_INITIALIZER(dimmconfig_list);
+
+static void dimmbus_dev_print(Monitor *mon, DeviceState *dev, int indent);
+static char *dimmbus_get_fw_dev_path(DeviceState *dev);
+
+static Property dimm_properties[] = {
+DEFINE_PROP_UINT64(start, DimmDevice, start, 0),
+DEFINE_PROP_UINT64(size, DimmDevice, size, DEFAULT_DIMMSIZE),
+DEFINE_PROP_UINT32(node, DimmDevice, node, 0),
+DEFINE_PROP_END_OF_LIST(),
+};
+
+static void dimmbus_dev_print(Monitor *mon, DeviceState *dev, int indent)
+{
+}
+
+static char *dimmbus_get_fw_dev_path(DeviceState *dev)
+{
+char path[40];
+
+snprintf(path, sizeof(path), %s, qdev_fw_name(dev));
+return strdup(path);
+}
+
+static void dimm_bus_class_init(ObjectClass *klass, void *data)
+{
+BusClass *k = BUS_CLASS(klass);
+
+k-print_dev = dimmbus_dev_print;
+k-get_fw_dev_path = dimmbus_get_fw_dev_path;
+}
+
+static void dimm_bus_initfn(Object *obj)
+{
+DimmConfig *dimm_cfg, *next_dimm_cfg;
+DimmBus *bus = DIMM_BUS(obj);
+QTAILQ_INIT(bus-dimmconfig_list);
+QTAILQ_INIT(bus-dimmlist);
+
+QTAILQ_FOREACH_SAFE(dimm_cfg, dimmconfig_list, nextdimmcfg, 
next_dimm_cfg) {
+QTAILQ_REMOVE(dimmconfig_list, dimm_cfg, nextdimmcfg);
+QTAILQ_INSERT_TAIL(bus-dimmconfig_list, dimm_cfg, nextdimmcfg);
+}
+}
+
+static const TypeInfo dimm_bus_info = {
+.name = TYPE_DIMM_BUS,
+.parent = TYPE_BUS,
+.instance_size = sizeof(DimmBus),
+.instance_init = dimm_bus_initfn,
+.class_init = dimm_bus_class_init,
+};
+
+void main_memory_bus_create(Object *parent)
+{
+main_memory_bus = g_malloc0(dimm_bus_info.instance_size);
+main_memory_bus-qbus.glib_allocated = true;
+qbus_create_inplace(main_memory_bus-qbus, TYPE_DIMM_BUS, DEVICE(parent),
+membus);
+}
+
+static void dimm_populate(DimmDevice *s)
+{
+DeviceState *dev= (DeviceState*)s;
+MemoryRegion *new = NULL;
+
+new = g_malloc(sizeof(MemoryRegion));
+memory_region_init_ram(new, dev-id, s-size);
+vmstate_register_ram_global(new);
+memory_region_add_subregion(get_system_memory(), s-start, new);
+s-mr = new;
+}
+
+static void dimm_depopulate(DimmDevice *s)
+{
+assert(s);
+vmstate_unregister_ram(s-mr, NULL);
+memory_region_del_subregion(get_system_memory(), s-mr);
+memory_region_destroy(s-mr);
+s-mr = NULL;
+}
+
+void dimm_config_create(char *id, uint64_t size, uint64_t node, uint32_t
+dimm_idx, uint32_t populated)
+{
+DimmConfig *dimm_cfg;
+dimm_cfg = (DimmConfig*) g_malloc0(sizeof(DimmConfig));
+dimm_cfg-name = id;

[Qemu-devel] [RFC PATCH v3 18/19] Implement _PS3 for dimm

2012-09-21 Thread Vasilis Liaskovitis

This will allow us to update dimm state on OSPM-initiated eject operations e.g.
with echo 1  /sys/bus/acpi/devices/PNP0C80\:00/eject

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 docs/specs/acpi_hotplug.txt |7 +++
 hw/acpi_piix4.c |5 +
 hw/dimm.c   |3 +++
 hw/dimm.h   |3 ++-
 4 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/docs/specs/acpi_hotplug.txt b/docs/specs/acpi_hotplug.txt
index 536da16..69868fe 100644
--- a/docs/specs/acpi_hotplug.txt
+++ b/docs/specs/acpi_hotplug.txt
@@ -45,3 +45,10 @@ insertion failed.
 Written by ACPI memory device _OST method to notify qemu of failed
 hot-add.  Write-only.
 
+Memory Dimm _PS3 power-off initiated by OSPM (IO port 0xafa4, 1-byte access):
+---
+Dimm hot-add _PS3 initiated by OSPM. Byte value indicates Dimm slot which
+entered D3 state.
+
+Written by ACPI memory device _PS3 method to notify qemu of power-off state for
+the dimm.  Write-only.
diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index 8bf58a6..aad78ca 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -52,6 +52,7 @@
 #define MEM_OST_REMOVE_FAIL 0xafa1
 #define MEM_OST_ADD_SUCCESS 0xafa2
 #define MEM_OST_ADD_FAIL 0xafa3
+#define MEM_PS3 0xafa4
 
 #define PIIX4_MEM_HOTPLUG_STATUS 8
 #define PIIX4_PCI_HOTPLUG_STATUS 2
@@ -545,6 +546,9 @@ static void gpe_writeb(void *opaque, uint32_t addr, 
uint32_t val)
 case MEM_OST_ADD_FAIL:
 dimm_notify(val, DIMM_ADD_FAIL);
 break;
+case MEM_PS3:
+dimm_notify(val, DIMM_OSPM_POWEROFF);
+break;
 default:
 acpi_gpe_ioport_writeb(s-ar, addr, val);
 }
@@ -621,6 +625,7 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, 
PIIX4PMState *s)
 register_ioport_write(MEM_OST_REMOVE_FAIL, 1, 1,  gpe_writeb, s);
 register_ioport_write(MEM_OST_ADD_SUCCESS, 1, 1,  gpe_writeb, s);
 register_ioport_write(MEM_OST_ADD_FAIL, 1, 1,  gpe_writeb, s);
+register_ioport_write(MEM_PS3, 1, 1,  gpe_writeb, s);
 
 for(i = 0; i  DIMM_BITMAP_BYTES; i++) {
 s-gperegs.mems_sts[i] = 0;
diff --git a/hw/dimm.c b/hw/dimm.c
index b993668..08f66d5 100644
--- a/hw/dimm.c
+++ b/hw/dimm.c
@@ -319,6 +319,9 @@ void dimm_notify(uint32_t idx, uint32_t event)
 qdev_simple_unplug_cb((DeviceState*)s);
 QTAILQ_INSERT_TAIL(bus-dimm_hp_result_queue, result, next);
 break;
+case DIMM_OSPM_POWEROFF:
+if (bus-dimm_revert)
+bus-dimm_revert(bus-dimm_hotplug_qdev, s, 1);
 default:
 g_free(result);
 break;
diff --git a/hw/dimm.h b/hw/dimm.h
index ce091fe..8d73b8f 100644
--- a/hw/dimm.h
+++ b/hw/dimm.h
@@ -15,7 +15,8 @@ typedef enum {
 DIMM_REMOVE_SUCCESS = 0,
 DIMM_REMOVE_FAIL = 1,
 DIMM_ADD_SUCCESS = 2,
-DIMM_ADD_FAIL = 3
+DIMM_ADD_FAIL = 3,
+DIMM_OSPM_POWEROFF = 4
 } dimm_hp_result_code;
 
 typedef enum {
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 11/19] Implement qmp and hmp commands for notification lists

2012-09-21 Thread Vasilis Liaskovitis

Guest can respond to ACPI hotplug events e.g. with _EJ or _OST method.
This patch implements a tail queue to store guest notifications for memory
hot-add and hot-remove requests.

Guest responses for memory hotplug command on a per-dimm basis can be detected
with the new hmp command info memhp or the new qmp command query-memhp
Examples:

(qemu) device_add dimm,id=ram0
(qemu) info memory-hotplug
dimm: ram0 hot-add success
or
dimm: ram0 hot-add failure

(qemu) device_del ram3
(qemu) info memory-hotplug
dimm: ram3 hot-remove success
or
dimm: ram3 hot-remove failure

Results are removed from the queue once read.

This patch only queues _EJ events that signal hot-remove success.
For  _OST event queuing, which cover the hot-remove failure and
hot-add success/failure cases, the _OST patches in this series are  are also
needed.

These notification items should probably be part of migration state (not yet
implemented).

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 hmp-commands.hx  |2 +
 hmp.c|   17 ++
 hmp.h|1 +
 hw/dimm.c|   62 +-
 hw/dimm.h|2 +-
 monitor.c|7 ++
 qapi-schema.json |   26 ++
 qmp-commands.hx  |   37 
 8 files changed, 152 insertions(+), 2 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index ed67e99..cfb1b67 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1462,6 +1462,8 @@ show device tree
 show qdev device model list
 @item info roms
 show roms
+@item info memory-hotplug
+show memory-hotplug
 @end table
 ETEXI
 
diff --git a/hmp.c b/hmp.c
index ba6fbd3..4b3d63d 100644
--- a/hmp.c
+++ b/hmp.c
@@ -1168,3 +1168,20 @@ void hmp_screen_dump(Monitor *mon, const QDict *qdict)
 qmp_screendump(filename, err);
 hmp_handle_error(mon, err);
 }
+
+void hmp_info_memory_hotplug(Monitor *mon)
+{
+MemHpInfoList *info;
+MemHpInfoList *item;
+MemHpInfo *dimm;
+
+info = qmp_query_memory_hotplug(NULL);
+for (item = info; item; item = item-next) {
+dimm = item-value;
+monitor_printf(mon, dimm: %s %s %s\n, dimm-dimm,
+dimm-request, dimm-result);
+dimm-dimm = NULL;
+}
+
+qapi_free_MemHpInfoList(info);
+}
diff --git a/hmp.h b/hmp.h
index 48b9c59..986705a 100644
--- a/hmp.h
+++ b/hmp.h
@@ -73,5 +73,6 @@ void hmp_getfd(Monitor *mon, const QDict *qdict);
 void hmp_closefd(Monitor *mon, const QDict *qdict);
 void hmp_send_key(Monitor *mon, const QDict *qdict);
 void hmp_screen_dump(Monitor *mon, const QDict *qdict);
+void hmp_info_memory_hotplug(Monitor *mon);
 
 #endif
diff --git a/hw/dimm.c b/hw/dimm.c
index 288b997..fbd93a8 100644
--- a/hw/dimm.c
+++ b/hw/dimm.c
@@ -65,6 +65,7 @@ static void dimm_bus_initfn(Object *obj)
 DimmBus *bus = DIMM_BUS(obj);
 QTAILQ_INIT(bus-dimmconfig_list);
 QTAILQ_INIT(bus-dimmlist);
+QTAILQ_INIT(bus-dimm_hp_result_queue);
 
 QTAILQ_FOREACH_SAFE(dimm_cfg, dimmconfig_list, nextdimmcfg, 
next_dimm_cfg) {
 QTAILQ_REMOVE(dimmconfig_list, dimm_cfg, nextdimmcfg);
@@ -236,20 +237,78 @@ void dimm_notify(uint32_t idx, uint32_t event)
 {
 DimmBus *bus = main_memory_bus;
 DimmDevice *s;
+DimmConfig *slotcfg;
+struct dimm_hp_result *result;
+
 s = dimm_find_from_idx(idx);
 assert(s != NULL);
+result = g_malloc0(sizeof(*result));
+slotcfg = dimmcfg_find_from_name(DEVICE(s)-id);
+result-dimmname = slotcfg-name;
 
 switch(event) {
 case DIMM_REMOVE_SUCCESS:
 dimm_depopulate(s);
-qdev_simple_unplug_cb((DeviceState*)s);
 QTAILQ_REMOVE(bus-dimmlist, s, nextdimm);
+qdev_simple_unplug_cb((DeviceState*)s);
+QTAILQ_INSERT_TAIL(bus-dimm_hp_result_queue, result, next);
 break;
 default:
+g_free(result);
 break;
 }
 }
 
+MemHpInfoList *qmp_query_memory_hotplug(Error **errp)
+{
+DimmBus *bus = main_memory_bus;
+MemHpInfoList *head = NULL, *cur_item = NULL, *info;
+struct dimm_hp_result *item, *nextitem;
+
+QTAILQ_FOREACH_SAFE(item, bus-dimm_hp_result_queue, next, nextitem) {
+
+info = g_malloc0(sizeof(*info));
+info-value = g_malloc0(sizeof(*info-value));
+info-value-dimm = g_malloc0(sizeof(char) * 32);
+info-value-request = g_malloc0(sizeof(char) * 16);
+info-value-result = g_malloc0(sizeof(char) * 16);
+switch (item-ret) {
+case DIMM_REMOVE_SUCCESS:
+strcpy(info-value-request, hot-remove);
+strcpy(info-value-result, success);
+break;
+case DIMM_REMOVE_FAIL:
+strcpy(info-value-request, hot-remove);
+strcpy(info-value-result, failure);
+break;
+case DIMM_ADD_SUCCESS:
+strcpy(info-value-request, hot-add);
+

Re: [Qemu-devel] [PATCH] usb-redir: Allow to attach USB 2.0 devices to 1.1 host controller

2012-09-21 Thread Jan Kiszka

On 2012-09-18 11:41, Hans de Goede wrote:
 Hi,
 
 On 09/17/2012 06:22 PM, Jan Kiszka wrote:
 If that does not work, add the debug parameter to the usb-redir device,
 set it
 to 4, collect logs of trying to redirect the device and send me the logs
 please, ie:
 -device usb-redir,chardev=usbredirchardev1,id=usbredirdev1,debug=4

 Also be aware that usb-redir relies on chardev flowcontrol working,
 which it does not upstream! See for example here for the chardev flow
 control patch set which RHEL / Fedora carry:
 http://cgit.freedesktop.org/~jwrdegoede/qemu/log/?h=qemu-kvm-1.2-usbredirofs=50


 And then the first 13 patches after: Merge tag 'v1.2.0'

 Oh, and also, if you're running qemu git master, make sure you've:
 http://cgit.freedesktop.org/~jwrdegoede/qemu/commit/?id=81e34f5973d8d6a1ef998a50c4a4bf66abb3b56b


 I used qemu-kvm-1.2-usbredir^ (the last commit is apparently broken -
 copypaste bug?).
 
 Yeah, that has been fixed now.
 
 I'm getting this right after typing cat /dev/ACM0 in
 the guest. It's an endless stream, and so is the output in the guest
 although there should be nothing to dump (that's the proper behaviour on
 the host).
 
 Hmm, can you try commenting out line 1608 of hw/usb/redirect.c:
  usb_ep-pipeline = true;
 
 And see if that helps. If it does not help, please bump the debug level to 5
 (this will also make it log packet contents), and then generate another log, 
 and
 then it is time to dive into the ACM protocol to see what is happening...

As it looks like now, I was just using the wrong test on the guest side.
Retried this morning briefly with a terminal program, and it was all
fine, even when forwarding from host-ehci to guest-uhci (with my broken
patch), even when using current QEMU git head. Sorry for the noise

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

[Qemu-devel] [RFC PATCH v3 01/19][SeaBIOS] Add ACPI_EXTRACT_DEVICE* macros

2012-09-21 Thread Vasilis Liaskovitis

This allows to extract the beginning, end and name of a Device object.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 tools/acpi_extract.py |   28 
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/tools/acpi_extract.py b/tools/acpi_extract.py
index 167a322..cb2540e 100755
--- a/tools/acpi_extract.py
+++ b/tools/acpi_extract.py
@@ -195,6 +195,28 @@ def aml_package_start(offset):
 offset += 1
 return offset + aml_pkglen_bytes(offset) + 1
 
+def aml_device_start(offset):
+#0x5B 0x82 DeviceOp PkgLength NameString ProcID
+if ((aml[offset] != 0x5B) or (aml[offset + 1] != 0x82)):
+die( Name offset 0x%x: expected 0x5B 0x83 actual 0x%x 0x%x %
+ (offset, aml[offset], aml[offset + 1]));
+return offset
+
+def aml_device_string(offset):
+#0x5B 0x82 DeviceOp PkgLength NameString ProcID
+start = aml_device_start(offset)
+offset += 2
+pkglenbytes = aml_pkglen_bytes(offset)
+offset += pkglenbytes
+return offset
+
+def aml_device_end(offset):
+start = aml_device_start(offset)
+offset += 2
+pkglenbytes = aml_pkglen_bytes(offset)
+pkglen = aml_pkglen(offset)
+return offset + pkglen
+
 lineno = 0
 for line in fileinput.input():
 # Strip trailing newline
@@ -279,6 +301,12 @@ for i in range(len(asl)):
 offset = aml_processor_end(offset)
 elif (directive == ACPI_EXTRACT_PKG_START):
 offset = aml_package_start(offset)
+elif (directive == ACPI_EXTRACT_DEVICE_START):
+offset = aml_device_start(offset)
+elif (directive == ACPI_EXTRACT_DEVICE_STRING):
+offset = aml_device_string(offset)
+elif (directive == ACPI_EXTRACT_DEVICE_END):
+offset = aml_device_end(offset)
 else:
 die(Unsupported directive %s % directive)
 
-- 
1.7.9

[Qemu-devel] [PATCH 0/9] Reduce compaction scanning and lock contention

2012-09-21 Thread Mel Gorman

Hi Andrew,

Richard Davies and Shaohua Li have both reported lock contention
problems in compaction on the zone and LRU locks as well as
significant amounts of time being spent in compaction. This series
aims to reduce lock contention and scanning rates to reduce that CPU
usage. Richard reported at https://lkml.org/lkml/2012/9/21/91 that
this series made a big different to a problem he reported in August
(http://marc.info/?l=kvmm=134511507015614w=2).

Patches 1-3 reverts existing patches in Andrew's tree that get replaced
later in the series.

Patch 4 is a fix for c67fe375 (mm: compaction: Abort async compaction if
locks are contended or taking too long) to properly abort in all
cases when contention is detected.

Patch 5 defers acquiring the zone-lru_lock as long as possible.

Patch 6 defers acquiring the zone-lock as lock as possible.

Patch 7 reverts Rik's skip-free patches as the core concept gets
reimplemented later and the remaining patches are easier to
understand if this is reverted first.

Patch 8 adds a pageblock-skip bit to the pageblock flags to cache what
pageblocks should be skipped by the migrate and free scanners.
This drastically reduces the amount of scanning compaction has
to do.

Patch 9 reimplements something similar to Rik's idea except it uses the
pageblock-skip information to decide where the scanners should
restart from and does not need to wrap around.

I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were

akpm-20120920   3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
lesslockPatches 1-6
revert  Patches 1-7
cachefail   Patches 1-8
skipuseless Patches 1-9

Stress high-order allocation tests looked ok. Success rates are more or
less the same with the full series applied but there is an expectation that
there is less opportunity to race with other allocation requests if there is
less scanning. The time to complete the tests did not vary that much and are
uninteresting as were the vmstat statistics so I will not present them here.

Using ftrace I recorded how much scanning was done by compaction and got this

3.6.0-rc6 3.6.0-rc6   3.6.0-rc6  3.6.0-rc6 
3.6.0-rc6
akpm-20120920 lockless  revert-v2r2  cachefail 
skipuseless

Total   freescanned 360753976  515414028  565479007   17103281   
18916589 
Total   freeisolated  285242935973694048601 670493 
727840 
Total   freeefficiency0.0079%0.0070%0.0072%0.0392%
0.0385% 
Total   migrate scanned 247728664  822729112 1004645830   17946827   
14118903 
Total   migrate isolated  255532432459373437501 616359 
658616 
Total   migrate efficiency0.0103%0.0039%0.0034%0.0343%
0.0466% 

The efficiency is worthless because of the nature of the test and the
number of failures.  The really interesting point as far as this patch
series is concerned is the number of pages scanned. Note that reverting
Rik's patches massively increases the number of pages scanned indicating
that those patches really did make a difference to CPU usage.

However, caching what pageblocks should be skipped has a much higher
impact. With patches 1-8 applied, free page and migrate page scanning are
both reduced by 95% in comparison to the akpm kernel.  If the basic concept
of Rik's patches are implemened on top then scanning then the free scanner
barely changed but migrate scanning was further reduced. That said, tests
on 3.6-rc5 indicated that the last patch had greater impact than what was
measured here so it is a bit variable.

One way or the other, this series has a large impact on the amount of
scanning compaction does when there is a storm of THP allocations.

 include/linux/mmzone.h  |5 +-
 include/linux/pageblock-flags.h |   19 +-
 mm/compaction.c |  397 +--
 mm/internal.h   |   11 +-
 mm/page_alloc.c |6 +-
 5 files changed, 280 insertions(+), 158 deletions(-)

-- 
1.7.9.2

[Qemu-devel] [RFC PATCH v3 09/19] pc: Add dimm paravirt SRAT info

2012-09-21 Thread Vasilis Liaskovitis

The numa_fw_cfg paravirt interface is extended to include SRAT information for
all hotplug-able dimms. There are 3 words for each hotplug-able memory slot,
denoting start address, size and node proximity. The new info is appended after
existing numa info, so that the fw_cfg layout does not break.  This information
is used by Seabios to build hotplug memory device objects at runtime.
nb_numa_nodes is set to 1 by default (not 0), so that we always pass srat info
to SeaBIOS.

v1-v2:
Dimm SRAT info (#dimms) is appended at end of existing numa fw_cfg in order not
to break existing layout
Documentation of the new fwcfg layout is included in docs/specs/fwcfg.txt

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 docs/specs/fwcfg.txt |   28 
 hw/pc.c  |   14 --
 2 files changed, 40 insertions(+), 2 deletions(-)
 create mode 100644 docs/specs/fwcfg.txt

diff --git a/docs/specs/fwcfg.txt b/docs/specs/fwcfg.txt
new file mode 100644
index 000..55f96d9
--- /dev/null
+++ b/docs/specs/fwcfg.txt
@@ -0,0 +1,28 @@
+QEMU-BIOS Paravirt Documentation
+--
+
+This document describes paravirt data structures passed from QEMU to BIOS.
+
+FW_CFG_NUMA paravirt info
+
+The SRAT info passed from QEMU to BIOS has the following layout:
+
+---
+#nodes | cpu0_pxm | cpu1_pxm | ... | cpulast_pxm | node0_mem | node1_mem | ... 
| nodelast_mem
+
+---
+#dimms | dimm0_start | dimm0_sz | dimm0_pxm | ... | dimmlast_start | 
dimmlast_sz | dimmlast_pxm
+
+Entry 0 contains the number of numa nodes (nb_numa_nodes).
+
+Entries 1..max_cpus: The next max_cpus entries describe node proximity for each
+one of the vCPUs in the system.
+
+Entries max_cpus+1..max_cpus+nb_numa_nodes+1:  The next nb_numa_nodes entries
+describe the memory size for each one of the NUMA nodes in the system.
+
+Entry max_cpus+nb_numa_nodes+1 contains the number of memory dimms 
(nb_hp_dimms)
+
+The last 3 * nb_hp_dimms entries are organized in triplets: Each triplet 
contains
+the physical address offset, size (in bytes), and node proximity for the
+respective dimm.
diff --git a/hw/pc.c b/hw/pc.c
index 2c9664d..f2604ae 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -598,6 +598,7 @@ static void *bochs_bios_init(void)
 uint8_t *smbios_table;
 size_t smbios_len;
 uint64_t *numa_fw_cfg;
+uint64_t *hp_dimms_fw_cfg;
 int i, j;
 
 register_ioport_write(0x400, 1, 2, bochs_bios_write, NULL);
@@ -632,8 +633,10 @@ static void *bochs_bios_init(void)
 /* allocate memory for the NUMA channel: one (64bit) word for the number
  * of nodes, one word for each VCPU-node and one word for each node to
  * hold the amount of memory.
+ * Finally one word for the number of hotplug memory slots and three words
+ * for each hotplug memory slot (start address, size and node proximity).
  */
-numa_fw_cfg = g_malloc0((1 + max_cpus + nb_numa_nodes) * 8);
+numa_fw_cfg = g_malloc0((2 + max_cpus + nb_numa_nodes + 3 * nb_hp_dimms) * 
8);
 numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes);
 for (i = 0; i  max_cpus; i++) {
 for (j = 0; j  nb_numa_nodes; j++) {
@@ -646,8 +649,15 @@ static void *bochs_bios_init(void)
 for (i = 0; i  nb_numa_nodes; i++) {
 numa_fw_cfg[max_cpus + 1 + i] = cpu_to_le64(node_mem[i]);
 }
+
+numa_fw_cfg[1 + max_cpus + nb_numa_nodes] = cpu_to_le64(nb_hp_dimms);
+
+hp_dimms_fw_cfg = numa_fw_cfg + 2 + max_cpus + nb_numa_nodes;
+if (nb_hp_dimms)
+setup_fwcfg_hp_dimms(hp_dimms_fw_cfg);
+
 fw_cfg_add_bytes(fw_cfg, FW_CFG_NUMA, (uint8_t *)numa_fw_cfg,
- (1 + max_cpus + nb_numa_nodes) * 8);
+ (2 + max_cpus + nb_numa_nodes + 3 * nb_hp_dimms) * 8);
 
 return fw_cfg;
 }
-- 
1.7.9

[Qemu-devel] [PATCH 7/9] Revert mm: have order 0 compaction start off where it left

2012-09-21 Thread Mel Gorman

This reverts commit 7db8889a (mm: have order  0 compaction start off
where it left) and commit de74f1cc (mm: have order  0 compaction start
near a pageblock with free pages). These patches were a good idea and
tests confirmed that they massively reduced the amount of scanning but
the implementation is complex and tricky to understand. A later patch
will cache what pageblocks should be skipped and reimplements the
concept of compact_cached_free_pfn on top for both migration and
free scanners.

Signed-off-by: Mel Gorman mgor...@suse.de
Acked-by: Rik van Riel r...@redhat.com
---
 include/linux/mmzone.h |4 ---
 mm/compaction.c|   65 
 mm/internal.h  |6 -
 mm/page_alloc.c|5 
 4 files changed, 5 insertions(+), 75 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2daa54f..603d0b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,10 +368,6 @@ struct zone {
 */
spinlock_t  lock;
int all_unreclaimable; /* All pages pinned */
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-   /* pfn where the last incremental compaction isolated free pages */
-   unsigned long   compact_cached_free_pfn;
-#endif
 #ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t   span_seqlock;
diff --git a/mm/compaction.c b/mm/compaction.c
index 8e56594..9fc1b61 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -539,20 +539,6 @@ next_pageblock:
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 #ifdef CONFIG_COMPACTION
 /*
- * Returns the start pfn of the last page block in a zone.  This is the 
starting
- * point for full compaction of a zone.  Compaction searches for free pages 
from
- * the end of each zone, while isolate_freepages_block scans forward inside 
each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
-   unsigned long free_pfn;
-   free_pfn = zone-zone_start_pfn + zone-spanned_pages;
-   free_pfn = ~(pageblock_nr_pages-1);
-   return free_pfn;
-}
-
-/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -620,19 +606,8 @@ static void isolate_freepages(struct zone *zone,
 * looking for free pages, the search will restart here as
 * page migration may have returned some pages to the allocator
 */
-   if (isolated) {
+   if (isolated)
high_pfn = max(high_pfn, pfn);
-
-   /*
-* If the free scanner has wrapped, update
-* compact_cached_free_pfn to point to the highest
-* pageblock with free pages. This reduces excessive
-* scanning of full pageblocks near the end of the
-* zone
-*/
-   if (cc-order  0  cc-wrapped)
-   zone-compact_cached_free_pfn = high_pfn;
-   }
}
 
/* split_free_page does not map the pages */
@@ -640,11 +615,6 @@ static void isolate_freepages(struct zone *zone,
 
cc-free_pfn = high_pfn;
cc-nr_freepages = nr_freepages;
-
-   /* If compact_cached_free_pfn is reset then set it now */
-   if (cc-order  0  !cc-wrapped 
-   zone-compact_cached_free_pfn == start_free_pfn(zone))
-   zone-compact_cached_free_pfn = high_pfn;
 }
 
 /*
@@ -739,26 +709,8 @@ static int compact_finished(struct zone *zone,
if (fatal_signal_pending(current))
return COMPACT_PARTIAL;
 
-   /*
-* A full (order == -1) compaction run starts at the beginning and
-* end of a zone; it completes when the migrate and free scanner meet.
-* A partial (order  0) compaction can start with the free scanner
-* at a random point in the zone, and may have to restart.
-*/
-   if (cc-free_pfn = cc-migrate_pfn) {
-   if (cc-order  0  !cc-wrapped) {
-   /* We started partway through; restart at the end. */
-   unsigned long free_pfn = start_free_pfn(zone);
-   zone-compact_cached_free_pfn = free_pfn;
-   cc-free_pfn = free_pfn;
-   cc-wrapped = 1;
-   return COMPACT_CONTINUE;
-   }
-   return COMPACT_COMPLETE;
-   }
-
-   /* We wrapped around and ended up where we started. */
-   if (cc-wrapped  cc-free_pfn = cc-start_free_pfn)
+   /* Compaction run completes if the migrate and free scanner meet */
+   if (cc-free_pfn = cc-migrate_pfn)
return COMPACT_COMPLETE;
 
/*
@@ -864,15 +816,8 @@ static int

[Qemu-devel] [PATCH 6/9] mm: compaction: Acquire the zone-lock as late as possible

2012-09-21 Thread Mel Gorman

Compactions free scanner acquires the zone-lock when checking for PageBuddy
pages and isolating them. It does this even if there are no PageBuddy pages
in the range.

This patch defers acquiring the zone lock for as long as possible. In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone-lock.

Signed-off-by: Mel Gorman mgor...@suse.de
Acked-by: Rik van Riel r...@redhat.com
---
 mm/compaction.c |  141 ++-
 1 file changed, 78 insertions(+), 63 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a6068ff..8e56594 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock,
return compact_checklock_irqsave(lock, flags, false, cc);
 }
 
+/* Returns true if the page is within a block suitable for migration to */
+static bool suitable_migration_target(struct page *page)
+{
+
+   int migratetype = get_pageblock_migratetype(page);
+
+   /* Don't interfere with memory hot-remove or the min_free_kbytes blocks 
*/
+   if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
+   return false;
+
+   /* If the page is a large free page, then allow migration */
+   if (PageBuddy(page)  page_order(page) = pageblock_order)
+   return true;
+
+   /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
+   if (migrate_async_suitable(migratetype))
+   return true;
+
+   /* Otherwise skip the block */
+   return false;
+}
+
 static void compact_capture_page(struct compact_control *cc)
 {
unsigned long flags;
@@ -153,13 +175,16 @@ static void compact_capture_page(struct compact_control 
*cc)
  * pages inside of the pageblock (even though it may still end up isolating
  * some pages).
  */
-static unsigned long isolate_freepages_block(unsigned long blockpfn,
+static unsigned long isolate_freepages_block(struct compact_control *cc,
+   unsigned long blockpfn,
unsigned long end_pfn,
struct list_head *freelist,
bool strict)
 {
int nr_scanned = 0, total_isolated = 0;
struct page *cursor;
+   unsigned long flags;
+   bool locked = false;
 
cursor = pfn_to_page(blockpfn);
 
@@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned 
long blockpfn,
int isolated, i;
struct page *page = cursor;
 
-   if (!pfn_valid_within(blockpfn)) {
-   if (strict)
-   return 0;
-   continue;
-   }
+   if (!pfn_valid_within(blockpfn))
+   goto strict_check;
nr_scanned++;
 
-   if (!PageBuddy(page)) {
-   if (strict)
-   return 0;
-   continue;
-   }
+   if (!PageBuddy(page))
+   goto strict_check;
+
+   /*
+* The zone lock must be held to isolate freepages. This
+* unfortunately this is a very coarse lock and can be
+* heavily contended if there are parallel allocations
+* or parallel compactions. For async compaction do not
+* spin on the lock and we acquire the lock as late as
+* possible.
+*/
+   locked = compact_checklock_irqsave(cc-zone-lock, flags,
+   locked, cc);
+   if (!locked)
+   break;
+
+   /* Recheck this is a suitable migration target under lock */
+   if (!strict  !suitable_migration_target(page))
+   break;
+
+   /* Recheck this is a buddy page under lock */
+   if (!PageBuddy(page))
+   goto strict_check;
 
/* Found a free page, break it into order-0 pages */
isolated = split_free_page(page);
if (!isolated  strict)
-   return 0;
+   goto strict_check;
total_isolated += isolated;
for (i = 0; i  isolated; i++) {
list_add(page-lru, freelist);
@@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned long 
blockpfn,
blockpfn += isolated - 1;
cursor += isolated - 1;
}
+
+   continue;
+
+strict_check:
+   /* Abort isolation if the caller requested strict isolation */
+   if (strict) {
+   total_isolated = 0;
+   goto out;
+   }
}

[Qemu-devel] [RFC PATCH v3 03/19][SeaBIOS] acpi-dsdt: Implement functions for memory hotplug

2012-09-21 Thread Vasilis Liaskovitis

Extend the DSDT to include methods for handling memory hot-add and hot-remove
notifications and memory device status requests. These functions are called
from the memory device SSDT methods.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 src/acpi-dsdt.dsl |   70 +++-
 1 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl
index 2060686..5d3e92b 100644
--- a/src/acpi-dsdt.dsl
+++ b/src/acpi-dsdt.dsl
@@ -737,6 +737,71 @@ DefinitionBlock (
 }
 Return(One)
 }
+/* Objects filled in by run-time generated SSDT */
+External(MTFY, MethodObj)
+External(MEON, PkgObj)
+
+Method (CMST, 1, NotSerialized) {
+// _STA method - return ON status of memdevice
+// Local0 = MEON flag for this cpu
+Store(DerefOf(Index(MEON, Arg0)), Local0)
+If (Local0) { Return(0xF) } Else { Return(0x0) }
+}
+
+/* Memory hotplug notify array */
+OperationRegion(MEST, SystemIO, 0xaf80, 32)
+Field (MEST, ByteAcc, NoLock, Preserve)
+{
+MES, 256
+}
+ 
+/* Memory eject byte */
+OperationRegion(MEMJ, SystemIO, 0xafa0, 1)
+Field (MEMJ, ByteAcc, NoLock, Preserve)
+{
+MPE, 8
+}
+
+Method(MESC, 0) {
+// Local5 = active memdevice bitmap
+Store (MES, Local5)
+// Local2 = last read byte from bitmap
+Store (Zero, Local2)
+// Local0 = memory device iterator
+Store (Zero, Local0)
+While (LLess(Local0, SizeOf(MEON))) {
+// Local1 = MEON flag for this memory device
+Store(DerefOf(Index(MEON, Local0)), Local1)
+If (And(Local0, 0x07)) {
+// Shift down previously read bitmap byte
+ShiftRight(Local2, 1, Local2)
+} Else {
+// Read next byte from memdevice bitmap
+Store(DerefOf(Index(Local5, ShiftRight(Local0, 3))), 
Local2)
+}
+// Local3 = active state for this memory device
+Store(And(Local2, 1), Local3)
+
+If (LNotEqual(Local1, Local3)) {
+// State change - update MEON with new state
+Store(Local3, Index(MEON, Local0))
+// Do MEM notify
+If (LEqual(Local3, 1)) {
+MTFY(Local0, 1)
+} Else {
+MTFY(Local0, 3)
+}
+}
+Increment(Local0)
+}
+Return(One)
+}
+
+Method (MPEJ, 2, NotSerialized) {
+// _EJ0 method - eject callback
+Store(Arg0, MPE)
+Sleep(200)
+}
 }
 
 
@@ -759,8 +824,9 @@ DefinitionBlock (
 // CPU hotplug event
 Return(\_SB.PRSC())
 }
-Method(_L03) {
-Return(0x01)
+Method(_E03) {
+// Memory hotplug event
+Return(\_SB.MESC())
 }
 Method(_L04) {
 Return(0x01)
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 12/19] Implement info memory-total and query-memory-total

2012-09-21 Thread Vasilis Liaskovitis

Returns total physical memory available to guest in bytes, including hotplugged
memory. Note that the number reported here may be different from what the guest
sees e.g. if the guest has not logically onlined hotplugged memory.

This functionality is provided independently of a balloon device, since a
guest can be using ACPI memory hotplug without using a balloon device.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 hmp-commands.hx  |2 ++
 hmp.c|7 +++
 hmp.h|1 +
 hw/dimm.c|   21 +
 hw/dimm.h|1 +
 monitor.c|7 +++
 qapi-schema.json |   11 +++
 qmp-commands.hx  |   20 
 8 files changed, 70 insertions(+), 0 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index cfb1b67..988d207 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1464,6 +1464,8 @@ show qdev device model list
 show roms
 @item info memory-hotplug
 show memory-hotplug
+@item info memory-total
+show memory-total
 @end table
 ETEXI
 
diff --git a/hmp.c b/hmp.c
index 4b3d63d..cc31ddc 100644
--- a/hmp.c
+++ b/hmp.c
@@ -1185,3 +1185,10 @@ void hmp_info_memory_hotplug(Monitor *mon)
 
 qapi_free_MemHpInfoList(info);
 }
+
+void hmp_info_memory_total(Monitor *mon)
+{
+uint64_t ram_total;
+ram_total = (uint64_t)qmp_query_memory_total(NULL);
+monitor_printf(mon, MemTotal: %lu \n, ram_total);
+}
diff --git a/hmp.h b/hmp.h
index 986705a..ab96dba 100644
--- a/hmp.h
+++ b/hmp.h
@@ -74,5 +74,6 @@ void hmp_closefd(Monitor *mon, const QDict *qdict);
 void hmp_send_key(Monitor *mon, const QDict *qdict);
 void hmp_screen_dump(Monitor *mon, const QDict *qdict);
 void hmp_info_memory_hotplug(Monitor *mon);
+void hmp_info_memory_total(Monitor *mon);
 
 #endif
diff --git a/hw/dimm.c b/hw/dimm.c
index fbd93a8..21626f6 100644
--- a/hw/dimm.c
+++ b/hw/dimm.c
@@ -28,6 +28,7 @@ static DimmBus *main_memory_bus;
 /* the following list is used to hold dimm config info before machine
  * initialization. After machine init, the list is emptied and not used 
anymore.*/
 static DimmConfiglist dimmconfig_list = 
QTAILQ_HEAD_INITIALIZER(dimmconfig_list);
+extern ram_addr_t ram_size;
 
 static void dimmbus_dev_print(Monitor *mon, DeviceState *dev, int indent);
 static char *dimmbus_get_fw_dev_path(DeviceState *dev);
@@ -233,6 +234,26 @@ void setup_fwcfg_hp_dimms(uint64_t *fw_cfg_slots)
 }
 }
 
+uint64_t get_hp_memory_total(void)
+{
+DimmBus *bus = main_memory_bus;
+DimmDevice *slot;
+uint64_t info = 0;
+
+QTAILQ_FOREACH(slot, bus-dimmlist, nextdimm) {
+info += slot-size;
+}
+return info;
+}
+
+int64_t qmp_query_memory_total(Error **errp)
+{
+uint64_t info;
+info = ram_size + get_hp_memory_total();
+
+return (int64_t)info;
+}
+
 void dimm_notify(uint32_t idx, uint32_t event)
 {
 DimmBus *bus = main_memory_bus;
diff --git a/hw/dimm.h b/hw/dimm.h
index 95251ba..21225be 100644
--- a/hw/dimm.h
+++ b/hw/dimm.h
@@ -86,5 +86,6 @@ int dimm_add(char *id);
 void main_memory_bus_create(Object *parent);
 void dimm_config_create(char *id, uint64_t size, uint64_t node,
 uint32_t dimm_idx, uint32_t populated);
+uint64_t get_hp_memory_total(void);
 
 #endif
diff --git a/monitor.c b/monitor.c
index be9a1d9..4f5ea60 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2747,6 +2747,13 @@ static mon_cmd_t info_cmds[] = {
 .mhandler.info = hmp_info_memory_hotplug,
 },
 {
+.name   = memory-total,
+.args_type  = ,
+.params = ,
+.help   = show total memory size,
+.mhandler.info = hmp_info_memory_total,
+},
+{
 .name   = NULL,
 },
 };
diff --git a/qapi-schema.json b/qapi-schema.json
index 3706a2a..c1d2571 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -2581,3 +2581,14 @@
 # Since: 1.3
 ##
 { 'command': 'query-memory-hotplug', 'returns': ['MemHpInfo'] }
+
+##
+# @query-memory-total:
+#
+# Returns total memory in bytes, including hotplugged dimms
+#
+# Returns: int
+#
+# Since: 1.3
+##
+{ 'command': 'query-memory-total', 'returns': 'int' }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index e50dcc2..20b7eea 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -2576,3 +2576,23 @@ Example:
}
 
 EQMP
+
+{
+.name   = query-memory-total,
+.args_type  = ,
+.mhandler.cmd_new = qmp_marshal_input_query_memory_total
+},
+SQMP
+query-memory-total
+--
+
+Return total memory in bytes, including hotplugged dimms
+
+Example:
+
+- { execute: query-memory-total }
+- {
+  return: 1073741824
+   }
+
+EQMP
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 20/19][SeaBIOS] alternative: Use paravirt interface for pci windows

2012-09-21 Thread Vasilis Liaskovitis

Initialize the 32-bit and 64-bit pci starting offsets from values passed in by
the qemu paravirt interface QEMU_CFG_PCI_WINDOW. Qemu calculates the starting
offsets based on initial memory and hotplug-able dimms.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 src/paravirt.c |6 ++
 src/paravirt.h |2 ++
 src/pciinit.c  |5 ++---
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/paravirt.c b/src/paravirt.c
index 2a98d53..390ef30 100644
--- a/src/paravirt.c
+++ b/src/paravirt.c
@@ -346,3 +346,9 @@ void qemu_cfg_romfile_setup(void)
 dprintf(3, Found fw_cfg file: %s (size=%d)\n, file-name, 
file-size);
 }
 }
+
+void qemu_cfg_get_pci_offsets(u64 *pcimem_start, u64 *pcimem64_start)
+{
+qemu_cfg_read_entry(pcimem_start, QEMU_CFG_PCI_WINDOW, sizeof(u64));
+qemu_cfg_read((u8*)(pcimem64_start), sizeof(u64));
+}
diff --git a/src/paravirt.h b/src/paravirt.h
index a284c41..b53ff88 100644
--- a/src/paravirt.h
+++ b/src/paravirt.h
@@ -35,6 +35,7 @@ static inline int kvm_para_available(void)
 #define QEMU_CFG_BOOT_MENU  0x0e
 #define QEMU_CFG_MAX_CPUS   0x0f
 #define QEMU_CFG_FILE_DIR   0x19
+#define QEMU_CFG_PCI_WINDOW 0x1a
 #define QEMU_CFG_ARCH_LOCAL 0x8000
 #define QEMU_CFG_ACPI_TABLES(QEMU_CFG_ARCH_LOCAL + 0)
 #define QEMU_CFG_SMBIOS_ENTRIES (QEMU_CFG_ARCH_LOCAL + 1)
@@ -65,5 +66,6 @@ struct e820_reservation {
 u32 qemu_cfg_e820_entries(void);
 void* qemu_cfg_e820_load_next(void *addr);
 void qemu_cfg_romfile_setup(void);
+void qemu_cfg_get_pci_offsets(u64 *pcimem_start, u64 *pcimem64_start);
 
 #endif
diff --git a/src/pciinit.c b/src/pciinit.c
index 68f302a..64468a0 100644
--- a/src/pciinit.c
+++ b/src/pciinit.c
@@ -592,8 +592,7 @@ static void pci_region_map_entries(struct pci_bus *busses, 
struct pci_region *r)
 
 static void pci_bios_map_devices(struct pci_bus *busses)
 {
-pcimem_start = RamSize;
-
+qemu_cfg_get_pci_offsets(pcimem_start, pcimem64_start);
 if (pci_bios_init_root_regions(busses)) {
 struct pci_region r64_mem, r64_pref;
 r64_mem.list = NULL;
@@ -611,7 +610,7 @@ static void pci_bios_map_devices(struct pci_bus *busses)
 u64 align_mem = pci_region_align(r64_mem);
 u64 align_pref = pci_region_align(r64_pref);
 
-r64_mem.base = ALIGN(0x1LL + RamSizeOver4G, align_mem);
+r64_mem.base = ALIGN(pcimem64_start, align_mem);
 r64_pref.base = ALIGN(r64_mem.base + sum_mem, align_pref);
 pcimem64_start = r64_mem.base;
 pcimem64_end = r64_pref.base + sum_pref;
-- 
1.7.9

[Qemu-devel] [RFC PATCH v3 08/19] pc: calculate dimm physical addresses and adjust memory map

2012-09-21 Thread Vasilis Liaskovitis

Dimm physical address offsets are calculated automatically and memory map is
adjusted accordingly. If a DIMM can fit before the PCI_HOLE_START (currently
0xe000), it will be added normally, otherwise its physical address will be
above 4GB.

Also create memory bus on i440fx-pcihost device.

Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
---
 hw/pc.c  |   41 +
 hw/pc.h  |6 ++
 hw/pc_piix.c |   20 ++--
 vl.c |1 +
 4 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 112739a..2c9664d 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -52,6 +52,7 @@
 #include arch_init.h
 #include bitmap.h
 #include vga-pci.h
+#include dimm.h
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
@@ -93,6 +94,9 @@ struct e820_table {
 static struct e820_table e820_table;
 struct hpet_fw_config hpet_cfg = {.count = UINT8_MAX};
 
+ram_addr_t below_4g_hp_mem_size = 0;
+ram_addr_t above_4g_hp_mem_size = 0;
+extern target_phys_addr_t ram_hp_offset;
 void gsi_handler(void *opaque, int n, int level)
 {
 GSIState *s = opaque;
@@ -1160,3 +1164,40 @@ void pc_pci_device_init(PCIBus *pci_bus)
 pci_create_simple(pci_bus, -1, lsi53c895a);
 }
 }
+
+
+/* Function to configure memory offsets of hotpluggable dimms */
+
+target_phys_addr_t pc_set_hp_memory_offset(uint64_t size)
+{
+target_phys_addr_t ret;
+
+/* on first call, initialize ram_hp_offset */
+if (!ram_hp_offset) {
+if (ram_size = PCI_HOLE_START ) {
+ram_hp_offset = 0x1LL + (ram_size - PCI_HOLE_START);
+} else {
+ram_hp_offset = ram_size;
+}
+}
+
+if (ram_hp_offset = 0x1LL) {
+ret = ram_hp_offset;
+above_4g_hp_mem_size += size;
+ram_hp_offset += size;
+}
+/* if dimm fits before pci hole, append it normally */
+else if (ram_hp_offset + size = PCI_HOLE_START) {
+ret = ram_hp_offset;
+below_4g_hp_mem_size += size;
+ram_hp_offset += size;
+}
+/* otherwise place it above 4GB */
+else {
+ret = 0x1LL;
+above_4g_hp_mem_size += size;
+ram_hp_offset = 0x1LL + size;
+}
+
+return ret;
+}
diff --git a/hw/pc.h b/hw/pc.h
index e4db071..f3304fc 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -10,6 +10,7 @@
 #include memory.h
 #include ioapic.h
 
+#define PCI_HOLE_START 0xe000
 /* PC-style peripherals (also used by other machines).  */
 
 /* serial.c */
@@ -214,6 +215,11 @@ static inline bool isa_ne2000_init(ISABus *bus, int base, 
int irq, NICInfo *nd)
 /* pc_sysfw.c */
 void pc_system_firmware_init(MemoryRegion *rom_memory);
 
+/* memory hotplug */
+target_phys_addr_t pc_set_hp_memory_offset(uint64_t size);
+extern ram_addr_t below_4g_hp_mem_size;
+extern ram_addr_t above_4g_hp_mem_size;
+
 /* e820 types */
 #define E820_RAM1
 #define E820_RESERVED   2
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index 88ff041..d1fd276 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -43,6 +43,7 @@
 #include xen.h
 #include memory.h
 #include exec-memory.h
+#include dimm.h
 #ifdef CONFIG_XEN
 #  include xen/hvm/hvm_info_table.h
 #endif
@@ -155,9 +156,9 @@ static void pc_init1(MemoryRegion *system_memory,
 kvmclock_create();
 }
 
-if (ram_size = 0xe000 ) {
-above_4g_mem_size = ram_size - 0xe000;
-below_4g_mem_size = 0xe000;
+if (ram_size = PCI_HOLE_START ) {
+above_4g_mem_size = ram_size - PCI_HOLE_START;
+below_4g_mem_size = PCI_HOLE_START;
 } else {
 above_4g_mem_size = 0;
 below_4g_mem_size = ram_size;
@@ -172,6 +173,9 @@ static void pc_init1(MemoryRegion *system_memory,
 rom_memory = system_memory;
 }
 
+/* adjust memory map for hotplug dimms */
+dimm_calc_offsets(pc_set_hp_memory_offset);
+
 /* allocate ram and load rom/bios */
 if (!xen_enabled()) {
 fw_cfg = pc_memory_init(system_memory,
@@ -192,9 +196,11 @@ static void pc_init1(MemoryRegion *system_memory,
 if (pci_enabled) {
 pci_bus = i440fx_init(i440fx_state, piix3_devfn, isa_bus, gsi,
   system_memory, system_io, ram_size,
-  below_4g_mem_size,
-  0x1ULL - below_4g_mem_size,
-  0x1ULL + above_4g_mem_size,
+  below_4g_mem_size + below_4g_hp_mem_size,
+  0x1ULL - below_4g_mem_size
+- below_4g_hp_mem_size,
+  0x1ULL + above_4g_mem_size
++ above_4g_hp_mem_size,
   (sizeof(target_phys_addr_t) == 4
? 0
: ((uint64_t)1  62)),
@@ -223,6 +229,8 @@ static void pc_init1(MemoryRegion *system_memory,

[Qemu-devel] [RFC PATCH v3 00/19] ACPI memory hotplug

2012-09-21 Thread Vasilis Liaskovitis

This is v3 of the ACPI memory hotplug functionality. Only x86_64 target is 
supported
for now.

Overview:

Dimm device layout is modeled with a new qemu command line 

-dimm id=name,size=sz,node=pxm,populated=on|off

The starting physical address for all dimms is calculated automatically from
top of memory, skipping the pci hole at [PCI_HOLE_START, 4G).
Node is defining numa proximity for this dimm. When not defined it defaults
to zero.
-dimm id=dimm0,size=512M,node=0,populated=off
will define a 512M memory slot belonging to numa node 0.

Dimms are added or removed with normal device_add, device_del operations:
Hot-add syntax: device_add dimm,id=mydimm0
Hot-remove syntax: dimm_del dimm,id=mydimm0

Changes v2-v3

- qdev integration. Dimms are attached to a dimmbus. The dimmbus is a child
  of i440fx device in the pc machine. Hot-add and hot-remove are done with 
normal
  device_add / device_del operations on the dimmbus. New commands dimm_add and
  dimm_del are obsolete. (In previous versions, dimms were always present on 
the
  qdev tree, and dimm_add/del simply meant allocating or deallocating memory for
  the devices. This version actually does hot-operations on the qdev tree)
- Add _PS3 method to allow OSPM-induced hot operations.
- pci-window calculation in Seabios takes dimms into account(for both 32-bit and
  64-bit windows)
- rename new qmp commands: query-memory-total and query-memory-hotplug
- balloon driver can see the hotplugged memory

Changes v1-v2

- memory map is automatically calculated for hotplug dimms. Dimms are added from
top-of-memory skipping the pci hole at [PCI_HOLE_START, 4G).
- Renamed from -memslot to -dimm. Commands changed to dimm_add, 
dimm_del.
- Seabios ejection array reduced to a byte. Use extraction macros for dimm ssdt.
- additional SRAT paravirt info does not break previous SRAT fw_cfg layout.
- Documentation of new acpi_piix4 registers and paravirt data.
- add ACPI _OST support for _OST enabled guests. This allows qemu to receive
notification for success / failure of memory hot-add and hot-remove operations.
Guest needs to support _OST (https://lkml.org/lkml/2012/6/25/321)
- add monitor info command to report total guest memory (initial + hot-added)
- add command line options and monitor commands for batch dimm
creation/population (obsolete from v3 onwards)

Issues:

- A main blocker issue is windows guest functionality. The patchset does not 
work for
windows currently. My guess is the windows pnpmem driver does not like the
seabios dimm device implementation (or the seabios dimm implementation is not
fully ACPI-compliant). If someone can review the seabios patches or has any
ideas to debug this, let me know.

Testing on win2012 server RC or windows2008 consumer prerelease. When adding a
DIMM, the device shows up in DeviceManager but does not work. 
Relevant messages:

 This device cannot start. (Code 10) 
Device configured(memory.inf) (UserPnP eventID 400)
Device installed (memory.inf) ACPI/PNP0C80\2daba3ff1 was configured 
Device not started(PNPMEM) (Kernel-PnP eventID 411, kernelID) 
Device ACPI\PNP0C80\2daba3ff1 had a problem starting Driver Name: memory.inf
(c:\Windows\system32\DRIVERS\pnpmem.sys 6.2.8400 winmain_win8rc))
Memory range:0x8000 - 0x9000 (Initial memory of VM is 2GB. The 
hotplugged DIMM
 was a 256GB with physical address range starting at 2GB ) 
Conflicting device list: No conflicts.  

Adding a 2nd or more dimms causes a crash (PNP_DETECTED_FATAL_ERROR with blue
screen of death) and makes windows reboot. After this, the VM keeps rebooting 
with
ACPI_BIOS_ERROR. The VM refuses to boot anymore once a 2nd (or more) extra dimm 
is
plugged-in.

- Is the dimmbus the correct way to go about integrating into qdev/qom? In a v1
comment, Anthony mentioned attaching dimms directly to an i440fx device as
children. Is this possible without a bus?

- Live migration works as long as the dimm layout (-dimm command line args) are
identical at the source and destination qemu command line. Patch 10/19
creates the DimmDevice that corresponds to the unknown incoming ramblock. 
Ramblocks are migrated before qdev VMStates are migrated (the DimmDevice 
structure
currently does not define a VMStateDescription). So the DimmDevice is handled
diferrently than other devices. If this is not acceptable, any suggestions on
how should it be reworked?

- Hot-operation notification lists need to be added to migration state.

Please review. Could people state which other issues they consider blocker for
including this upstream?

Does this patchset need to wait for 1.4 or could this be considered for 1.3 
(assuming
blockers are resolved)? The patchset has been revised every few months, but
I will provide quicker version updates onwards. I can also bring this up on a 
weekly
meeting agenda if needed. 

series is based on uq/master for qemu-kvm, and master for seabios. Can be found
also at:

http://github.com/vliaskov/qemu-kvm/commits/memhp-v3

Re: [Qemu-devel] [PATCH 01/41] buffered_file: g_realloc() can't fail

2012-09-21 Thread Paolo Bonzini

Il 21/09/2012 10:46, Juan Quintela ha scritto:
 Signed-off-by: Juan Quintela quint...@redhat.com
 ---
  buffered_file.c | 10 +-
  1 file changed, 1 insertion(+), 9 deletions(-)
 
 diff --git a/buffered_file.c b/buffered_file.c
 index f170aa0..4148abb 100644
 --- a/buffered_file.c
 +++ b/buffered_file.c
 @@ -50,20 +50,12 @@ static void buffered_append(QEMUFileBuffered *s,
  const uint8_t *buf, size_t size)
  {
  if (size  (s-buffer_capacity - s-buffer_size)) {
 -void *tmp;
 -
  DPRINTF(increasing buffer capacity from %zu by %zu\n,
  s-buffer_capacity, size + 1024);
 
  s-buffer_capacity += size + 1024;
 
 -tmp = g_realloc(s-buffer, s-buffer_capacity);
 -if (tmp == NULL) {
 -fprintf(stderr, qemu file buffer expansion failed\n);
 -exit(1);
 -}
 -
 -s-buffer = tmp;
 +s-buffer = g_realloc(s-buffer, s-buffer_capacity);
  }
 
  memcpy(s-buffer + s-buffer_size, buf, size);
 

Reviewed-by: Paolo Bonzini pbonz...@redhat.com

Paolo

Re: [Qemu-devel] [PATCH v4 15/19] block: raw-win32 driver reopen support

2012-09-21 Thread Jeff Cody

On 09/21/2012 04:43 AM, Paolo Bonzini wrote:
 Il 21/09/2012 10:33, Kevin Wolf ha scritto:
 +/* could not reopen the file handle, so fall back to opening
 + * new file (CreateFile) */
 +if (raw_s-hfile == INVALID_HANDLE_VALUE) {
 +raw_s-hfile = CreateFile(state-bs-filename, access_flags,
 +  FILE_SHARE_READ, NULL, OPEN_EXISTING,
 +  overlapped, NULL);
 +if (raw_s-hfile == INVALID_HANDLE_VALUE) {
 +/* this could happen because the access_flags requested are
 + * incompatible with the existing share mode of s-hfile,
 + * so our only option now is to close s-hfile, and try again.
 + * This could end badly */
 +CloseHandle(s-hfile);
 How common is this case?

 We do have another option, namely not reopen at all and return an error.
 Of course, this only makes sense if it doesn't mean that we almost never
 succeed.
 
 Probably pretty common since we specify FILE_SHARE_READ for the sharing
 mode, meaning that subsequent open operations on a file or device are
 only able to request read access.
 

Yes, I think this is by far the most common case.


 I would change it to FILE_SHARE_READ|FILE_SHARE_WRITE and remove this code.
 
 Paolo
 

I contemplated doing that, but I wasn't sure if there was any particular
reason it was originally done with FILE_SHARE_READ only in the first
place (security, etc..). I was hesitant to override that behaviour as
the new default under w32.  Do we know if this is acceptable / safe?

1 2 3 4 5 >

1 - 100 of 401 matches

Mail list logo