[Qemu-devel] [RFC PATCH RDMA support v6: 0/7] additional cleanup and consolidation

2013-04-09 Thread mrhines
From: Michael R. Hines mrhi...@us.ibm.com

Several changes since v5:

- Only one new file in the patch now! (migration-rdma.c)
- Smaller number of files touched, fewer prototypes
- Merged files as requested (rdma.c and and migration-rdma.c)
- Eliminated header as requested (rdma.h)
- Created new function pointers for hooks in arch_init.c
  to be cleaner and removed all explicit RDMA checks
  to instead use QEMUFileOps

Contents:
=
* Running
* RDMA Protocol Description
* Versioning
* QEMUFileRDMA Interface
* Migration of pc.ram
* Error handling
* TODO
* Performance

RUNNING:
===

First, decide if you want dynamic page registration on the server-side.
This always happens on the primary-VM side, but is optional on the server.
Doing this allows you to support overcommit (such as cgroups or ballooning)
with a smaller footprint on the server-side without having to register the
entire VM memory footprint. 
NOTE: This significantly slows down RDMA throughput (about 30% slower).

$ virsh qemu-monitor-command --hmp \
--cmd migrate_set_capability chunk_register_destination on # disabled by 
default

Next, if you decided *not* to use chunked registration on the server,
it is recommended to also disable zero page detection. While this is not
strictly necessary, zero page detection also significantly slows down
performance on higher-throughput links (by about 50%), like 40 gbps infiniband 
cards:

$ virsh qemu-monitor-command --hmp \
--cmd migrate_set_capability check_for_zero off # always enabled by 
default

Finally, set the migration speed to match your hardware's capabilities:

$ virsh qemu-monitor-command --hmp \
--cmd migrate_set_speed 40g # or whatever is the MAX of your RDMA device

Finally, perform the actual migration:

$ virsh migrate domain rdma:xx.xx.xx.xx:port

RDMA Protocol Description:
=

Migration with RDMA is separated into two parts:

1. The transmission of the pages using RDMA
2. Everything else (a control channel is introduced)

Everything else is transmitted using a formal 
protocol now, consisting of infiniband SEND / RECV messages.

An infiniband SEND message is the standard ibverbs
message used by applications of infiniband hardware.
The only difference between a SEND message and an RDMA
message is that SEND message cause completion notifications
to be posted to the completion queue (CQ) on the 
infiniband receiver side, whereas RDMA messages (used
for pc.ram) do not (to behave like an actual DMA).

Messages in infiniband require two things:

1. registration of the memory that will be transmitted
2. (SEND/RECV only) work requests to be posted on both
   sides of the network before the actual transmission
   can occur.

RDMA messages much easier to deal with. Once the memory
on the receiver side is registered and pinned, we're
basically done. All that is required is for the sender
side to start dumping bytes onto the link.

SEND messages require more coordination because the
receiver must have reserved space (using a receive
work request) on the receive queue (RQ) before QEMUFileRDMA
can start using them to carry all the bytes as
a transport for migration of device state.

To begin the migration, the initial connection setup is
as follows (migration-rdma.c):

1. Receiver and Sender are started (command line or libvirt):
2. Both sides post two RQ work requests
3. Receiver does listen()
4. Sender does connect()
5. Receiver accept()
6. Check versioning and capabilities (described later)

At this point, we define a control channel on top of SEND messages
which is described by a formal protocol. Each SEND message has a 
header portion and a data portion (but together are transmitted 
as a single SEND message).

Header:
* Length  (of the data portion)
* Type(what command to perform, described below)
* Version (protocol version validated before send/recv occurs)

The 'type' field has 7 different command values:
1. None
2. Ready (control-channel is available) 
3. QEMU File (for sending non-live device state) 
4. RAM Blocks(used right after connection setup)
5. Register request  (dynamic chunk registration) 
6. Register result   ('rkey' to be used by sender)
7. Register finished (registration for current iteration finished)

After connection setup is completed, we have two protocol-level
functions, responsible for communicating control-channel commands
using the above list of values: 

Logically:

qemu_rdma_exchange_recv(header, expected command type)

1. We transmit a READY command to let the sender know that 
   we are *ready* to receive some data bytes on the control channel.
2. Before attempting to receive the expected command, we post another
   RQ work request to replace the one we just used up.
3. Block on a CQ event channel and wait for the SEND to arrive.
4. When the send arrives, librdmacm will unblock us.
5. Verify 

[Qemu-devel] [RFC PATCH RDMA support v6: 3/7] Introduce QEMURamControlOps

2013-04-09 Thread mrhines
From: Michael R. Hines mrhi...@us.ibm.com

RDMA requires hooks before and after each iteration round
in order to coordinate the new dynamic page registration support.
This is done now by introducing a new set of function pointers
which are only used by arch_init.c.

Pointers include:
1. save_ram_page (which can be defined by anyone, not just RDMA)
2. hook after each iteration
3. hook before each iteration

The pointers are then installed in savevm.c because they
need visibility into QEMUFile.

Now that we have a proper set of pointers, we no longer need
specific checks anymore to determine whether or not RDMA
is enabled.

Signed-off-by: Michael R. Hines mrhi...@us.ibm.com
---
 include/migration/migration.h |   52 +
 savevm.c  |  104 +
 2 files changed, 146 insertions(+), 10 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index e2acec6..0287321 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -21,6 +21,7 @@
 #include qapi/error.h
 #include migration/vmstate.h
 #include qapi-types.h
+#include exec/cpu-common.h
 
 struct MigrationParams {
 bool blk;
@@ -75,6 +76,10 @@ void fd_start_incoming_migration(const char *path, Error 
**errp);
 
 void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error 
**errp);
 
+void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error 
**errp);
+
+void rdma_start_incoming_migration(const char * host_port, Error **errp);
+
 void migrate_fd_error(MigrationState *s);
 
 void migrate_fd_connect(MigrationState *s);
@@ -127,4 +132,51 @@ int migrate_use_xbzrle(void);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
+
+bool migrate_check_for_zero(void);
+bool migrate_chunk_register_destination(void);
+
+/*
+ * Hooks before and after each iteration round to perform special functions.
+ * In the case of RDMA, this is to handle dynamic server registration.
+ */
+#define RAM_CONTROL_SETUP0
+#define RAM_CONTROL_ROUND1
+#define RAM_CONTROL_REGISTER 2
+#define RAM_CONTROL_FINISH   3
+
+typedef void (RAMFunc)(QEMUFile *f, void *opaque, int section);
+
+struct QEMURamControlOps {
+RAMFunc *before_ram_iterate;
+RAMFunc *after_ram_iterate;
+RAMFunc *register_ram_iterate;
+size_t (*save_page)(QEMUFile *f,
+   void *opaque, ram_addr_t block_offset, 
+   ram_addr_t offset, int cont, size_t size, 
+   bool zero);
+};
+
+const QEMURamControlOps *qemu_savevm_get_control(QEMUFile *f);
+
+void ram_control_before_iterate(QEMUFile *f, int section);
+void ram_control_after_iterate(QEMUFile *f, int section);
+void ram_control_register_iterate(QEMUFile *f, int section);
+size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset, 
+ram_addr_t offset, int cont, 
+size_t size, bool zero);
+
+#ifdef CONFIG_RDMA
+extern const QEMURamControlOps qemu_rdma_control;
+
+size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
+   ram_addr_t block_offset, 
+   ram_addr_t offset, int cont, 
+   size_t size, bool zero);
+
+void qemu_rdma_registration_stop(QEMUFile *f, void *opaque, int section);
+void qemu_rdma_registration_handle(QEMUFile *f, void *opaque, int section);
+void qemu_ram_registration_start(QEMUFile *f, void *opaque, int section);
+#endif
+
 #endif
diff --git a/savevm.c b/savevm.c
index b1d8988..26eabb3 100644
--- a/savevm.c
+++ b/savevm.c
@@ -409,16 +409,24 @@ static const QEMUFileOps socket_write_ops = {
 .close =  socket_close
 };
 
-QEMUFile *qemu_fopen_socket(int fd, const char *mode)
+bool qemu_file_mode_is_not_valid(const char * mode)
 {
-QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
-
 if (mode == NULL ||
 (mode[0] != 'r'  mode[0] != 'w') ||
 mode[1] != 'b' || mode[2] != 0) {
 fprintf(stderr, qemu_fopen: Argument validity check failed\n);
-return NULL;
+return true;
 }
+
+return false;
+}
+
+QEMUFile *qemu_fopen_socket(int fd, const char *mode)
+{
+QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
+
+if(qemu_file_mode_is_not_valid(mode))
+   return NULL;
 
 s-fd = fd;
 if (mode[0] == 'w') {
@@ -430,16 +438,44 @@ QEMUFile *qemu_fopen_socket(int fd, const char *mode)
 return s-file;
 }
 
+#ifdef CONFIG_RDMA
+const QEMURamControlOps qemu_rdma_write_control = {
+.before_ram_iterate = qemu_ram_registration_start,
+.after_ram_iterate = qemu_rdma_registration_stop,
+.register_ram_iterate = qemu_rdma_registration_handle,
+.save_page = qemu_rdma_save_page, 
+};
+
+const QEMURamControlOps qemu_rdma_read_control = {
+.register_ram_iterate = qemu_rdma_registration_handle,
+};
+
+const QEMUFileOps rdma_read_ops = {
+.get_buffer  = 

[Qemu-devel] [RFC PATCH RDMA support v6: 6/7] send pc.ram over RDMA

2013-04-09 Thread mrhines
From: Michael R. Hines mrhi...@us.ibm.com

All that is left for this part of the patch is:

1. use the new (optionally defined) save_ram_page function pointer
   to decide what to do with the page if RDMA is enable or not
   and return ENOTSUP as agreed.
2. invoke hooks from QEMURamControlOps function pointers to hook
   into the RDMA protocol at the right points in order to perform
   dynamic page registration.

Signed-off-by: Michael R. Hines mrhi...@us.ibm.com
---
 arch_init.c |   45 +++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 769ce77..a7d5b16 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -115,6 +115,7 @@ const uint32_t arch_type = QEMU_ARCH;
 #define RAM_SAVE_FLAG_EOS  0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 #define RAM_SAVE_FLAG_XBZRLE   0x40
+#define RAM_SAVE_FLAG_REGISTER 0x80 /* perform hook during iteration */
 
 
 static struct defconfig_file {
@@ -170,6 +171,13 @@ static struct {
 .cache = NULL,
 };
 
+#ifdef CONFIG_RDMA
+void qemu_ram_registration_start(QEMUFile *f, void *opaque, int section)
+{
+DPRINTF(start section: %d\n, section);
+qemu_put_be64(f, RAM_SAVE_FLAG_REGISTER);
+}
+#endif
 
 int64_t xbzrle_cache_resize(int64_t new_size)
 {
@@ -447,15 +455,22 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
 ram_bulk_stage = false;
 }
 } else {
+bool zero;
 uint8_t *p;
 int cont = (block == last_sent_block) ?
 RAM_SAVE_FLAG_CONTINUE : 0;
 
 p = memory_region_get_ram_ptr(mr) + offset;
 
+/* use capability now, defaults to true */
+zero = migrate_check_for_zero() ? is_zero_page(p) : false;
+
 /* In doubt sent page as normal */
 bytes_sent = -1;
-if (is_zero_page(p)) {
+if ((bytes_sent = ram_control_save_page(f, block-offset, 
+offset, cont, TARGET_PAGE_SIZE, zero)) = 0) {
+acct_info.norm_pages++;
+} else if (zero) {
 acct_info.dup_pages++;
 if (!ram_bulk_stage) {
 bytes_sent = save_block_hdr(f, block, offset, cont,
@@ -476,7 +491,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
 }
 
 /* XBZRLE overflow or normal page */
-if (bytes_sent == -1) {
+if (bytes_sent == -1 || bytes_sent == -ENOTSUP) {
 bytes_sent = save_block_hdr(f, block, offset, cont, 
RAM_SAVE_FLAG_PAGE);
 qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
 bytes_sent += TARGET_PAGE_SIZE;
@@ -598,6 +613,18 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 }
 
 qemu_mutex_unlock_ramlist();
+
+/*
+ * These following calls generate reserved messages for future expansion 
of the RDMA
+ * protocol. If the ops are not defined, nothing will happen.
+ *
+ * Please leave in place. They are intended to be used to pre-register
+ * memory in the future to mitigate the extremely high cost of dynamic page
+ * registration.
+ */
+ram_control_before_iterate(f, RAM_CONTROL_SETUP);
+ram_control_after_iterate(f, RAM_CONTROL_SETUP);
+
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 
 return 0;
@@ -616,6 +643,8 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 reset_ram_globals();
 }
 
+ram_control_before_iterate(f, RAM_CONTROL_ROUND);
+
 t0 = qemu_get_clock_ns(rt_clock);
 i = 0;
 while ((ret = qemu_file_rate_limit(f)) == 0) {
@@ -646,6 +675,12 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 
 qemu_mutex_unlock_ramlist();
 
+/* 
+ * must occur before EOS (or any QEMUFile operation) 
+ * because of RDMA protocol 
+ */
+ram_control_after_iterate(f, RAM_CONTROL_ROUND);
+
 if (ret  0) {
 bytes_transferred += total_sent;
 return ret;
@@ -663,6 +698,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 qemu_mutex_lock_ramlist();
 migration_bitmap_sync();
 
+ram_control_before_iterate(f, RAM_CONTROL_FINISH);
+
 /* try transferring iterative blocks of memory */
 
 /* flush all remaining blocks regardless of rate limiting */
@@ -676,6 +713,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 }
 bytes_transferred += bytes_sent;
 }
+
+ram_control_after_iterate(f, RAM_CONTROL_FINISH);
 migration_end();
 
 qemu_mutex_unlock_ramlist();
@@ -864,6 +903,8 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 ret = -EINVAL;
 goto done;
 }
+} else if (flags  RAM_SAVE_FLAG_REGISTER) {
+ram_control_register_iterate(f, RAM_CONTROL_REGISTER); 
 }
 error = qemu_file_get_error(f);
 if (error) {
-- 
1.7.10.4




[Qemu-devel] [RFC PATCH RDMA support v6: 7/7] introduce qemu_ram_foreach_block()

2013-04-09 Thread mrhines
From: Michael R. Hines mrhi...@us.ibm.com

This is used during RDMA initialization in order to transmit
a description of all the RAM blocks to the peer for later
dynamic chunk registration purposes.

Signed-off-by: Michael R. Hines mrhi...@us.ibm.com
---
 exec.c|9 +
 include/exec/cpu-common.h |5 +
 2 files changed, 14 insertions(+)

diff --git a/exec.c b/exec.c
index fa1e0c3..0e5a2c3 100644
--- a/exec.c
+++ b/exec.c
@@ -2631,3 +2631,12 @@ bool cpu_physical_memory_is_io(hwaddr phys_addr)
  memory_region_is_romd(section-mr));
 }
 #endif
+
+void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
+{
+RAMBlock *block;
+
+QTAILQ_FOREACH(block, ram_list.blocks, next) {
+func(block-host, block-offset, block-length, opaque);
+}
+}
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 2e5f11f..88cb741 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -119,6 +119,11 @@ extern struct MemoryRegion io_mem_rom;
 extern struct MemoryRegion io_mem_unassigned;
 extern struct MemoryRegion io_mem_notdirty;
 
+typedef void  (RAMBlockIterFunc)(void *host_addr, 
+ram_addr_t offset, ram_addr_t length, void *opaque); 
+
+void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
+
 #endif
 
 #endif /* !CPU_COMMON_H */
-- 
1.7.10.4




[Qemu-devel] [RFC PATCH RDMA support v6: 2/7] documentation (docs/rdma.txt)

2013-04-09 Thread mrhines
From: Michael R. Hines mrhi...@us.ibm.com

Verbose documentation is included, for both the protocol and
interface to QEMU.

Additionally, there is a Features/RDMALiveMigration wiki as
well as a patch on github.com (hinesmr/qemu.git)

Signed-off-by: Michael R. Hines mrhi...@us.ibm.com
---
 docs/rdma.txt |  300 +
 1 file changed, 300 insertions(+)
 create mode 100644 docs/rdma.txt

diff --git a/docs/rdma.txt b/docs/rdma.txt
new file mode 100644
index 000..583836e
--- /dev/null
+++ b/docs/rdma.txt
@@ -0,0 +1,300 @@
+Several changes since v5:
+
+- Only one new file in the patch now! (migration-rdma.c)
+- Smaller number of files touched, fewer prototypes
+- Merged files as requested (rdma.c and and migration-rdma.c)
+- Eliminated header as requested (rdma.h)
+- Created new function pointers for hooks in arch_init.c
+  to be cleaner and removed all explicit RDMA checks
+  to instead use QEMUFileOps
+
+Contents:
+=
+* Running
+* RDMA Protocol Description
+* Versioning
+* QEMUFileRDMA Interface
+* Migration of pc.ram
+* Error handling
+* TODO
+* Performance
+
+RUNNING:
+===
+
+First, decide if you want dynamic page registration on the server-side.
+This always happens on the primary-VM side, but is optional on the server.
+Doing this allows you to support overcommit (such as cgroups or ballooning)
+with a smaller footprint on the server-side without having to register the
+entire VM memory footprint. 
+NOTE: This significantly slows down RDMA throughput (about 30% slower).
+
+$ virsh qemu-monitor-command --hmp \
+--cmd migrate_set_capability chunk_register_destination on # disabled by 
default
+
+Next, if you decided *not* to use chunked registration on the server,
+it is recommended to also disable zero page detection. While this is not
+strictly necessary, zero page detection also significantly slows down
+performance on higher-throughput links (by about 50%), like 40 gbps infiniband 
cards:
+
+$ virsh qemu-monitor-command --hmp \
+--cmd migrate_set_capability check_for_zero off # always enabled by 
default
+
+Finally, set the migration speed to match your hardware's capabilities:
+
+$ virsh qemu-monitor-command --hmp \
+--cmd migrate_set_speed 40g # or whatever is the MAX of your RDMA device
+
+Finally, perform the actual migration:
+
+$ virsh migrate domain rdma:xx.xx.xx.xx:port
+
+RDMA Protocol Description:
+=
+
+Migration with RDMA is separated into two parts:
+
+1. The transmission of the pages using RDMA
+2. Everything else (a control channel is introduced)
+
+Everything else is transmitted using a formal 
+protocol now, consisting of infiniband SEND / RECV messages.
+
+An infiniband SEND message is the standard ibverbs
+message used by applications of infiniband hardware.
+The only difference between a SEND message and an RDMA
+message is that SEND message cause completion notifications
+to be posted to the completion queue (CQ) on the 
+infiniband receiver side, whereas RDMA messages (used
+for pc.ram) do not (to behave like an actual DMA).
+
+Messages in infiniband require two things:
+
+1. registration of the memory that will be transmitted
+2. (SEND/RECV only) work requests to be posted on both
+   sides of the network before the actual transmission
+   can occur.
+
+RDMA messages much easier to deal with. Once the memory
+on the receiver side is registered and pinned, we're
+basically done. All that is required is for the sender
+side to start dumping bytes onto the link.
+
+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a transport for migration of device state.
+
+To begin the migration, the initial connection setup is
+as follows (migration-rdma.c):
+
+1. Receiver and Sender are started (command line or libvirt):
+2. Both sides post two RQ work requests
+3. Receiver does listen()
+4. Sender does connect()
+5. Receiver accept()
+6. Check versioning and capabilities (described later)
+
+At this point, we define a control channel on top of SEND messages
+which is described by a formal protocol. Each SEND message has a 
+header portion and a data portion (but together are transmitted 
+as a single SEND message).
+
+Header:
+* Length  (of the data portion)
+* Type(what command to perform, described below)
+* Version (protocol version validated before send/recv occurs)
+
+The 'type' field has 7 different command values:
+1. None
+2. Ready (control-channel is available) 
+3. QEMU File (for sending non-live device state) 
+4. RAM Blocks(used right after connection setup)
+5. Register request  (dynamic chunk registration) 
+6. Register result   ('rkey' to be used by sender)
+7. Register finished 

[Qemu-devel] [RFC PATCH RDMA support v6: 4/7] Introduce two new capabilities

2013-04-09 Thread mrhines
From: Michael R. Hines mrhi...@us.ibm.com

RDMA performs very slowly with zero-page checking.
Without the ability to disable it, RDMA throughput and
latency promises and high performance links cannot be
fully realized.

On the other hand, dynamic page registration support is also
included in the RDMA protocol. This second capability also
cannot be fully realized without the ability to enable zero
page scanning.

So, we have two new capabilities which work together:

1. migrate_set_capability check_for_zero on|off (default on)
2. migrate_set_capability chunk_register_destination on|off (default off)

Signed-off-by: Michael R. Hines mrhi...@us.ibm.com
---
 include/migration/qemu-file.h |   15 +++
 migration.c   |   33 +++--
 qapi-schema.json  |2 +-
 3 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index 623c434..b6f3256 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -57,12 +57,15 @@ typedef int (QEMUFileGetFD)(void *opaque);
 typedef ssize_t (QEMUFileWritevBufferFunc)(void *opaque, struct iovec *iov,
int iovcnt);
 
+typedef struct QEMURamControlOps QEMURamControlOps;
+
 typedef struct QEMUFileOps {
 QEMUFilePutBufferFunc *put_buffer;
 QEMUFileGetBufferFunc *get_buffer;
 QEMUFileCloseFunc *close;
 QEMUFileGetFD *get_fd;
 QEMUFileWritevBufferFunc *writev_buffer;
+const QEMURamControlOps *ram_control;
 } QEMUFileOps;
 
 QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
@@ -80,6 +83,18 @@ void qemu_put_byte(QEMUFile *f, int v);
  * The buffer should be available till it is sent asynchronously.
  */
 void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, int size);
+void qemu_file_set_error(QEMUFile *f, int ret);
+
+void qemu_rdma_cleanup(void *opaque);
+int qemu_rdma_close(void *opaque);
+int qemu_rdma_get_fd(void *opaque);
+int qemu_rdma_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size);
+int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf, 
+int64_t pos, int size);
+bool qemu_file_mode_is_not_valid(const char * mode);
+
+extern const QEMUFileOps rdma_read_ops;
+extern const QEMUFileOps rdma_write_ops;
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/migration.c b/migration.c
index 3b4b467..875cee3 100644
--- a/migration.c
+++ b/migration.c
@@ -66,6 +66,7 @@ MigrationState *migrate_get_current(void)
 .state = MIG_STATE_SETUP,
 .bandwidth_limit = MAX_THROTTLE,
 .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
+.enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO] = true,
 };
 
 return current_migration;
@@ -77,6 +78,10 @@ void qemu_start_incoming_migration(const char *uri, Error 
**errp)
 
 if (strstart(uri, tcp:, p))
 tcp_start_incoming_migration(p, errp);
+#ifdef CONFIG_RDMA
+else if (strstart(uri, rdma:, p))
+rdma_start_incoming_migration(p, errp);
+#endif
 #if !defined(WIN32)
 else if (strstart(uri, exec:, p))
 exec_start_incoming_migration(p, errp);
@@ -120,8 +125,10 @@ void process_incoming_migration(QEMUFile *f)
 Coroutine *co = qemu_coroutine_create(process_incoming_migration_co);
 int fd = qemu_get_fd(f);
 
-assert(fd != -1);
-qemu_set_nonblock(fd);
+if(fd != -2) { /* rdma returns -2 */
+assert(fd != -1);
+qemu_set_nonblock(fd);
+}
 qemu_coroutine_enter(co, f);
 }
 
@@ -405,6 +412,10 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
 
 if (strstart(uri, tcp:, p)) {
 tcp_start_outgoing_migration(s, p, local_err);
+#ifdef CONFIG_RDMA
+} else if (strstart(uri, rdma:, p)) {
+rdma_start_outgoing_migration(s, p, local_err);
+#endif
 #if !defined(WIN32)
 } else if (strstart(uri, exec:, p)) {
 exec_start_outgoing_migration(s, p, local_err);
@@ -474,6 +485,24 @@ void qmp_migrate_set_downtime(double value, Error **errp)
 max_downtime = (uint64_t)value;
 }
 
+bool migrate_chunk_register_destination(void)
+{
+MigrationState *s;
+
+s = migrate_get_current();
+
+return 
s-enabled_capabilities[MIGRATION_CAPABILITY_CHUNK_REGISTER_DESTINATION];
+}
+
+bool migrate_check_for_zero(void)
+{
+MigrationState *s;
+
+s = migrate_get_current();
+
+return s-enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO];
+}
+
 int migrate_use_xbzrle(void)
 {
 MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index db542f6..7ebcf99 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -602,7 +602,7 @@
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle'] }
+  'data': ['xbzrle', 'check_for_zero', 'chunk_register_destination'] }
 
 ##
 # @MigrationCapabilityStatus
-- 
1.7.10.4




Re: [Qemu-devel] [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael R. Hines

On 04/09/2013 11:24 PM, Michael S. Tsirkin wrote:
Which mechanism do you refer to? You patches still seem to pin each 
page in guest memory at some point, which will break all COW. In 
particular any pagemap tricks to detect duplicates on source that I 
suggested won't work. 


Sorry, I mispoke. I'm reffering to dynamic server page registration.

Of course it does not eliminate pinning - but it does mitigate the foot 
print of the VM as a feature that was requested.


I have implemented it and documented it.

- Michael


On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote:

presumably is_dup_page reads the page, so should not break COW ...

I'm not sure about the cgroups swap limit - you might have
too many non COW pages so attempting to fault them all in
makes you exceed the limit. You really should look at
what is going on in the pagemap, to see if there's
measureable gain from the patch.


On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:

Well, I have the is_dup_page() commented out...when RDMA is
activated.

Is there something else in QEMU that could be touching the page that
I don't know about?

- Michael


On 04/05/2013 05:03 PM, Roland Dreier wrote:

On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
mrhi...@linux.vnet.ibm.com wrote:

Sorry, I was wrong. ignore the comments about cgroups. That's still broken.
(i.e. trying to register RDMA memory while using a cgroup swap limit cause
the process get killed).

But the GIFT flag patch works (my understanding is that GIFT flag allows the
adapter to transmit stale memory information, it does not have anything to
do with cgroups specifically).

The point of the GIFT patch is to avoid triggering copy-on-write so
that memory doesn't blow up during migration.  If that doesn't work
then there's no point to the patch.

  - R.






Re: [Qemu-devel] [SeaBIOS] [PATCH v16] Add pvpanic device driver

2013-04-09 Thread Hu Tao
On Tue, Apr 09, 2013 at 08:37:16PM -0400, Kevin O'Connor wrote:
 On Tue, Apr 02, 2013 at 12:07:46PM +0300, Gleb Natapov wrote:
  On Mon, Apr 01, 2013 at 08:22:57PM -0400, Kevin O'Connor wrote:
   On Sun, Mar 31, 2013 at 05:34:10PM +0300, Gleb Natapov wrote:
On Sat, Mar 30, 2013 at 09:20:09AM -0400, Kevin O'Connor wrote:
The patch uses existing channel between qemu and seabios, one
romfile_loadint() is all it takes. We already have number of interfaces
to change OS visible ACPI tables, that's why we want to move ACPI table
creation to QEMU in the first place. It is unfortunate to start blocking
features now before we have an alternative. When ACPI table creation
will move into QEMU the code in this patch will be dropped along with
all the other code that serves similar purpose.
   
   If there is a general consensus that this feature is important then
   we'll go forward with adding it as is.  To be clear though, my
   preference would be to go forward with moving ACPI tables into QEMU,
   and then add this stuff on top of that.  If no one beats me to it,
   I'll send some initial patches myself.
   
  If we can accomplish the move before next major QEMU release we do not
  need this new fw_cfg file obviously. Paolo thinks this is not feasible,
  I haven't followed this work to close to have informed opinion.
 
 I was hoping I'd get a chance to submit some QEMU patches for this
 before the soft-freeze, but unfortunately I have not been able to.
 Since I don't want to hold up features, I remove my earlier objection
 and I'm okay with committing this to SeaBIOS.

Glad to hear that!

 
 Hu Tao - if the QEMU part of the pvpanic series is committed to QEMU
 I'll commit the corresponding SeaBIOS parts.

Thanks a lot!

-- 
Regards,
Hu Tao



<    1   2   3   4