Re: [PATCH V6 2/2] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

2012-08-08 Thread Asias He

On 08/07/2012 05:15 PM, Christoph Hellwig wrote:

At least after review is done I really think this patch sopuld be folded
into the previous one.


OK.


Some more comments below:


@@ -58,6 +58,12 @@ struct virtblk_req
struct bio *bio;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
+   struct work_struct work;
+   struct virtio_blk *vblk;


I think using bio-bi_private for the virtio_blk pointer would
be cleaner.


I wish I could use bio-bi_private but I am seeing this when using 
bio-bi_priate to store virito_blk pointer:



[1.100335] Call Trace:
[1.100335]  IRQ
[1.100335]  [811dd4b0] ? end_bio_bh_io_sync+0x30/0x50
[1.100335]  [811e167d] bio_endio+0x1d/0x40
[1.100335]  [81551fb2] virtblk_done+0xa2/0x260
[1.100335]  [813d714d] vring_interrupt+0x2d/0x40
[1.100335]  [81119c0d] handle_irq_event_percpu+0x6d/0x210
[1.100335]  [81119df1] handle_irq_event+0x41/0x70
[1.100335]  [8111d2c9] handle_edge_irq+0x69/0x120
[1.100335]  [810613a2] handle_irq+0x22/0x30
[1.100335]  [81aadccd] do_IRQ+0x5d/0xe0
[1.100335]  [81aa432f] common_interrupt+0x6f/0x6f

end_bio_bh_io_sync() uses bio-private:

   struct buffer_head *bh = bio-bi_private;




+   bool is_flush;
+   bool req_flush;
+   bool req_data;
+   bool req_fua;


This could be a bitmap, or better even a single state field.


Will use a bitmap for now.


+static int virtblk_bio_send_flush(struct virtio_blk *vblk,
+ struct virtblk_req *vbr)



+static int virtblk_bio_send_data(struct virtio_blk *vblk,
+struct virtblk_req *vbr)


These should only get the struct virtblk_req * argument as the virtio_blk
structure is easily derivable from it.


Yes. Will clean it up.


+static inline void virtblk_bio_done_flush(struct virtio_blk *vblk,
+ struct virtblk_req *vbr)
  {
+   if (vbr-req_data) {
+   /* Send out the actual write data */
+   struct virtblk_req *_vbr;
+   _vbr = virtblk_alloc_req(vblk, GFP_NOIO);
+   if (!_vbr) {
+   bio_endio(vbr-bio, -ENOMEM);
+   goto out;
+   }
+   _vbr-req_fua = vbr-req_fua;
+   _vbr-bio = vbr-bio;
+   _vbr-vblk = vblk;
+   INIT_WORK(_vbr-work, virtblk_bio_send_data_work);
+   queue_work(virtblk_wq, _vbr-work);


The _vbr naming isn't too nice.  Also can you explain why the original
request can't be reused in a comment here


Ah, the original request can be reused. Will fix this.


Also if using a state variable I think the whole code would be
a bit cleaner if the bio_done helpers are combined.


-   if (writeback  !use_bio)
+   if (writeback)
blk_queue_flush(vblk-disk-queue, REQ_FLUSH);


Shouldn't this be REQ_FLUSH | REQ_FUA for the bio case?


Without REQ_FUA, I am also seeing bio with REQ_FUA flag set. Do we need 
to set REQ_FUA explicitly?



Btw, did you verify that flushes really work correctly for all cases
using tracing in qemu?


I added some debug code in both kernel and kvm tool to verity the flush.

--
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 0/2] Improve virtio-blk performance

2012-08-08 Thread Asias He

On 08/08/2012 11:09 AM, Asias He wrote:

On 08/07/2012 05:16 PM, Christoph Hellwig wrote:

On Tue, Aug 07, 2012 at 04:47:13PM +0800, Asias He wrote:

1) Ramdisk device
  With bio-based IO path, sequential read/write, random read/write
  IOPS boost : 28%, 24%, 21%, 16%
  Latency improvement: 32%, 17%, 21%, 16%
2) Fusion IO device
  With bio-based IO path, sequential read/write, random read/write
  IOPS boost : 11%, 11%, 13%, 10%
  Latency improvement: 10%, 10%, 12%, 10%


Do you also have numbers for normal SAS/SATA disks?  The changelog should
have a reall good explanation of why this is optional, and numbers are a
very important part of that.


Yes. I posted the numbers on normal SATA disks a few days ago in the 
thread with Sasha. Will add that data and explanation of why optional to 
the changelog


--
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/15] qom: using atomic ops to re-implement object_ref

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/qemu/object.h |3 ++-
 qom/object.c  |   13 +
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/include/qemu/object.h b/include/qemu/object.h
index 8b17776..58db9d0 100644
--- a/include/qemu/object.h
+++ b/include/qemu/object.h
@@ -18,6 +18,7 @@
 #include stdint.h
 #include stdbool.h
 #include qemu-queue.h
+#include qemu/atomic.h
 
 struct Visitor;
 struct Error;
@@ -262,7 +263,7 @@ struct Object
 ObjectClass *class;
 GSList *interfaces;
 QTAILQ_HEAD(, ObjectProperty) properties;
-uint32_t ref;
+Atomic ref;
 Object *parent;
 };
 
diff --git a/qom/object.c b/qom/object.c
index 00bb3b0..822bdb7 100644
--- a/qom/object.c
+++ b/qom/object.c
@@ -378,7 +378,7 @@ void object_finalize(void *data)
 object_deinit(obj, ti);
 object_property_del_all(obj);
 
-g_assert(obj-ref == 0);
+g_assert(atomic_read(obj-ref) == 0);
 }
 
 Object *object_new_with_type(Type type)
@@ -405,7 +405,7 @@ Object *object_new(const char *typename)
 void object_delete(Object *obj)
 {
 object_unref(obj);
-g_assert(obj-ref == 0);
+g_assert(atomic_read(obj-ref) == 0);
 g_free(obj);
 }
 
@@ -639,16 +639,13 @@ GSList *object_class_get_list(const char *implements_type,
 
 void object_ref(Object *obj)
 {
-obj-ref++;
+atomic_inc(obj-ref);
 }
 
 void object_unref(Object *obj)
 {
-g_assert(obj-ref  0);
-obj-ref--;
-
-/* parent always holds a reference to its children */
-if (obj-ref == 0) {
+g_assert(atomic_read(obj-ref)  0);
+if (atomic_dec_and_test(obj-ref)) {
 object_finalize(obj);
 }
 }
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Collect unused object and release them at caller demand.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/qemu/reclaimer.h |   28 ++
 main-loop.c  |5 
 qemu-tool.c  |5 
 qom/Makefile.objs|2 +-
 qom/reclaimer.c  |   58 ++
 5 files changed, 97 insertions(+), 1 deletions(-)
 create mode 100644 include/qemu/reclaimer.h
 create mode 100644 qom/reclaimer.c

diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h
new file mode 100644
index 000..9307e93
--- /dev/null
+++ b/include/qemu/reclaimer.h
@@ -0,0 +1,28 @@
+/*
+ * QEMU reclaimer
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_RECLAIMER
+#define QEMU_RECLAIMER
+
+typedef void ReleaseHandler(void *opaque);
+typedef struct Chunk {
+QLIST_ENTRY(Chunk) list;
+void *opaque;
+ReleaseHandler *release;
+} Chunk;
+
+typedef struct ChunkHead {
+struct Chunk *lh_first;
+} ChunkHead;
+
+void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release);
+void reclaimer_worker(ChunkHead *head);
+void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release);
+void qemu_reclaimer(void);
+#endif
diff --git a/main-loop.c b/main-loop.c
index eb3b6e6..be9d095 100644
--- a/main-loop.c
+++ b/main-loop.c
@@ -26,6 +26,7 @@
 #include qemu-timer.h
 #include slirp/slirp.h
 #include main-loop.h
+#include qemu/reclaimer.h
 
 #ifndef _WIN32
 
@@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking)
them.  */
 qemu_bh_poll();
 
+/* ref to device from iohandler/bh/timer do not obey the rules, so delay
+ * reclaiming until now.
+ */
+qemu_reclaimer();
 return ret;
 }
diff --git a/qemu-tool.c b/qemu-tool.c
index 318c5fc..f5fe319 100644
--- a/qemu-tool.c
+++ b/qemu-tool.c
@@ -21,6 +21,7 @@
 #include main-loop.h
 #include qemu_socket.h
 #include slirp/libslirp.h
+#include qemu/reclaimer.h
 
 #include sys/time.h
 
@@ -75,6 +76,10 @@ void qemu_mutex_unlock_iothread(void)
 {
 }
 
+void qemu_reclaimer(void)
+{
+}
+
 int use_icount;
 
 void qemu_clock_warp(QEMUClock *clock)
diff --git a/qom/Makefile.objs b/qom/Makefile.objs
index 5ef060a..a579261 100644
--- a/qom/Makefile.objs
+++ b/qom/Makefile.objs
@@ -1,4 +1,4 @@
-qom-obj-y = object.o container.o qom-qobject.o
+qom-obj-y = object.o container.o qom-qobject.o reclaimer.o
 qom-obj-twice-y = cpu.o
 common-obj-y = $(qom-obj-twice-y)
 user-obj-y = $(qom-obj-twice-y)
diff --git a/qom/reclaimer.c b/qom/reclaimer.c
new file mode 100644
index 000..6cb53e3
--- /dev/null
+++ b/qom/reclaimer.c
@@ -0,0 +1,58 @@
+/*
+ * QEMU reclaimer
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include qemu-common.h
+#include qemu-thread.h
+#include main-loop.h
+#include qemu-queue.h
+#include qemu/reclaimer.h
+
+static struct QemuMutex reclaimer_lock;
+static QLIST_HEAD(rcl, Chunk) reclaimer_list;
+
+void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release)
+{
+Chunk *r = g_malloc0(sizeof(Chunk));
+r-opaque = opaque;
+r-release = release;
+QLIST_INSERT_HEAD_RCU(head, r, list);
+}
+
+void reclaimer_worker(ChunkHead *head)
+{
+Chunk *cur, *next;
+
+QLIST_FOREACH_SAFE(cur, head, list, next) {
+QLIST_REMOVE(cur, list);
+cur-release(cur-opaque);
+g_free(cur);
+}
+}
+
+void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release)
+{
+Chunk *r = g_malloc0(sizeof(Chunk));
+r-opaque = opaque;
+r-release = release;
+qemu_mutex_lock(reclaimer_lock);
+QLIST_INSERT_HEAD_RCU(reclaimer_list, r, list);
+qemu_mutex_unlock(reclaimer_lock);
+}
+
+
+void qemu_reclaimer(void)
+{
+Chunk *cur, *next;
+
+QLIST_FOREACH_SAFE(cur, reclaimer_list, list, next) {
+QLIST_REMOVE(cur, list);
+cur-release(cur-opaque);
+g_free(cur);
+}
+}
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/15] memory: MemoryRegion topology must be stable when updating

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Using mem_map_lock to protect among updaters. So we can get the intact
snapshot of mem topology -- FlatView  radix-tree.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 exec.c   |3 +++
 memory.c |   22 ++
 memory.h |2 ++
 3 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index 8244d54..0e29ef9 100644
--- a/exec.c
+++ b/exec.c
@@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc;
The bottom level has pointers to MemoryRegionSections.  */
 static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
 
+QemuMutex mem_map_lock;
+
 static void io_mem_init(void);
 static void memory_map_init(void);
 
@@ -637,6 +639,7 @@ void cpu_exec_init_all(void)
 #if !defined(CONFIG_USER_ONLY)
 memory_map_init();
 io_mem_init();
+qemu_mutex_init(mem_map_lock);
 #endif
 }
 
diff --git a/memory.c b/memory.c
index aab4a31..5986532 100644
--- a/memory.c
+++ b/memory.c
@@ -761,7 +761,9 @@ void memory_region_transaction_commit(void)
 assert(memory_region_transaction_depth);
 --memory_region_transaction_depth;
 if (!memory_region_transaction_depth  memory_region_update_pending) {
+qemu_mutex_lock(mem_map_lock);
 memory_region_update_topology(NULL);
+qemu_mutex_unlock(mem_map_lock);
 }
 }
 
@@ -1069,8 +1071,10 @@ void memory_region_set_log(MemoryRegion *mr, bool log, 
unsigned client)
 {
 uint8_t mask = 1  client;
 
+qemu_mutex_lock(mem_map_lock);
 mr-dirty_log_mask = (mr-dirty_log_mask  ~mask) | (log * mask);
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr,
@@ -1103,8 +1107,10 @@ void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
 void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
 {
 if (mr-readonly != readonly) {
+qemu_mutex_lock(mem_map_lock);
 mr-readonly = readonly;
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 }
 
@@ -1112,7 +1118,9 @@ void memory_region_rom_device_set_readable(MemoryRegion 
*mr, bool readable)
 {
 if (mr-readable != readable) {
 mr-readable = readable;
+qemu_mutex_lock(mem_map_lock);
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 }
 
@@ -1206,6 +1214,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
 };
 unsigned i;
 
+qemu_mutex_lock(mem_map_lock);
 for (i = 0; i  mr-ioeventfd_nb; ++i) {
 if (memory_region_ioeventfd_before(mrfd, mr-ioeventfds[i])) {
 break;
@@ -1218,6 +1227,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
 sizeof(*mr-ioeventfds) * (mr-ioeventfd_nb-1 - i));
 mr-ioeventfds[i] = mrfd;
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 void memory_region_del_eventfd(MemoryRegion *mr,
@@ -1236,6 +1246,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
 };
 unsigned i;
 
+qemu_mutex_lock(mem_map_lock);
 for (i = 0; i  mr-ioeventfd_nb; ++i) {
 if (memory_region_ioeventfd_equal(mrfd, mr-ioeventfds[i])) {
 break;
@@ -1248,6 +1259,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
 mr-ioeventfds = g_realloc(mr-ioeventfds,
   sizeof(*mr-ioeventfds)*mr-ioeventfd_nb + 
1);
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 static void memory_region_add_subregion_common(MemoryRegion *mr,
@@ -1259,6 +1271,8 @@ static void 
memory_region_add_subregion_common(MemoryRegion *mr,
 assert(!subregion-parent);
 subregion-parent = mr;
 subregion-addr = offset;
+
+qemu_mutex_lock(mem_map_lock);
 QTAILQ_FOREACH(other, mr-subregions, subregions_link) {
 if (subregion-may_overlap || other-may_overlap) {
 continue;
@@ -1289,6 +1303,7 @@ static void 
memory_region_add_subregion_common(MemoryRegion *mr,
 QTAILQ_INSERT_TAIL(mr-subregions, subregion, subregions_link);
 done:
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 
@@ -1316,8 +1331,11 @@ void memory_region_del_subregion(MemoryRegion *mr,
 {
 assert(subregion-parent == mr);
 subregion-parent = NULL;
+
+qemu_mutex_lock(mem_map_lock);
 QTAILQ_REMOVE(mr-subregions, subregion, subregions_link);
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 void memory_region_set_enabled(MemoryRegion *mr, bool enabled)
@@ -1325,8 +1343,10 @@ void memory_region_set_enabled(MemoryRegion *mr, bool 
enabled)
 if (enabled == mr-enabled) {
 return;
 }
+qemu_mutex_lock(mem_map_lock);
 mr-enabled = enabled;
 memory_region_update_topology(NULL);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 void memory_region_set_address(MemoryRegion *mr, target_phys_addr_t 

[PATCH 0/15 v2] prepare unplug out of protection of global lock

2012-08-08 Thread Liu Ping Fan
background:
refer to orignal plan posted by Marcelo Tosatti,
http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html

prev version:
https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03312.html


changes v1-v2

--introduce atomic ops
--introduce ref cnt for MemoryRegion
--make memory's flat view and radix tree toward rcu style.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/15] memory: introduce life_ops to MemoryRegion

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

The types of referred object by MemoryRegion are variable, ex,
another mr, DeviceState, or other struct defined by drivers.
So the refer/unrefer may be different by drivers.

Using this ops, we can mange the backend object.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 hw/ide/piix.c |6 ++--
 hw/pckbd.c|6 +++-
 hw/serial.c   |2 +-
 ioport.c  |3 +-
 memory.c  |   69 -
 memory.h  |   16 +
 6 files changed, 94 insertions(+), 8 deletions(-)

diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index f5a74c2..bdd70b1 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -93,11 +93,11 @@ static void bmdma_setup_bar(PCIIDEState *d)
 for(i = 0;i  2; i++) {
 BMDMAState *bm = d-bmdma[i];
 
-memory_region_init_io(bm-extra_io, piix_bmdma_ops, bm,
+memory_region_init_io_ext(bm-extra_io, piix_bmdma_ops, NULL, bm,
   piix-bmdma, 4);
 memory_region_add_subregion(d-bmdma_bar, i * 8, bm-extra_io);
-memory_region_init_io(bm-addr_ioport, bmdma_addr_ioport_ops, bm,
-  bmdma, 4);
+memory_region_init_io_ext(bm-addr_ioport, bmdma_addr_ioport_ops,
+  NULL, bm, bmdma, 4);
 memory_region_add_subregion(d-bmdma_bar, i * 8 + 4, 
bm-addr_ioport);
 }
 }
diff --git a/hw/pckbd.c b/hw/pckbd.c
index 69857ba..de3c46d 100644
--- a/hw/pckbd.c
+++ b/hw/pckbd.c
@@ -485,10 +485,12 @@ static int i8042_initfn(ISADevice *dev)
 isa_init_irq(dev, s-irq_kbd, 1);
 isa_init_irq(dev, s-irq_mouse, 12);
 
-memory_region_init_io(isa_s-io + 0, i8042_data_ops, s, i8042-data, 1);
+memory_region_init_io_ext(isa_s-io + 0, i8042_data_ops, NULL, s,
+i8042-data, 1);
 isa_register_ioport(dev, isa_s-io + 0, 0x60);
 
-memory_region_init_io(isa_s-io + 1, i8042_cmd_ops, s, i8042-cmd, 1);
+memory_region_init_io_ext(isa_s-io + 1, i8042_cmd_ops, NULL, s,
+i8042-cmd, 1);
 isa_register_ioport(dev, isa_s-io + 1, 0x64);
 
 s-kbd = ps2_kbd_init(kbd_update_kbd_irq, s);
diff --git a/hw/serial.c b/hw/serial.c
index a421d1e..e992c6a 100644
--- a/hw/serial.c
+++ b/hw/serial.c
@@ -794,7 +794,7 @@ static int serial_isa_initfn(ISADevice *dev)
 serial_init_core(s);
 qdev_set_legacy_instance_id(dev-qdev, isa-iobase, 3);
 
-memory_region_init_io(s-io, serial_io_ops, s, serial, 8);
+memory_region_init_io_ext(s-io, serial_io_ops, NULL, s, serial, 8);
 isa_register_ioport(dev, s-io, isa-iobase);
 return 0;
 }
diff --git a/ioport.c b/ioport.c
index 6e4ca0d..768e271 100644
--- a/ioport.c
+++ b/ioport.c
@@ -384,7 +384,8 @@ static void portio_list_add_1(PortioList *piolist,
  * Use an alias so that the callback is called with an absolute address,
  * rather than an offset relative to to start + off_low.
  */
-memory_region_init_io(region, ops, piolist-opaque, piolist-name,
+memory_region_init_io_ext(region, ops, NULL, piolist-opaque,
+  piolist-name,
   INT64_MAX);
 memory_region_init_alias(alias, piolist-name,
  region, start + off_low, off_high - off_low);
diff --git a/memory.c b/memory.c
index 5986532..80c7529 100644
--- a/memory.c
+++ b/memory.c
@@ -19,6 +19,7 @@
 #include bitops.h
 #include kvm.h
 #include assert.h
+#include hw/qdev.h
 
 #define WANT_EXEC_OBSOLETE
 #include exec-obsolete.h
@@ -799,6 +800,7 @@ static bool memory_region_wrong_endianness(MemoryRegion *mr)
 #endif
 }
 
+static MemoryRegionLifeOps nops;
 void memory_region_init(MemoryRegion *mr,
 const char *name,
 uint64_t size)
@@ -809,6 +811,7 @@ void memory_region_init(MemoryRegion *mr,
 if (size == UINT64_MAX) {
 mr-size = int128_2_64();
 }
+mr-life_ops = nops;
 mr-addr = 0;
 mr-subpage = false;
 mr-enabled = true;
@@ -931,6 +934,66 @@ static void memory_region_dispatch_write(MemoryRegion *mr,
   memory_region_write_accessor, mr);
 }
 
+static void mr_object_get(MemoryRegion *mr)
+{
+object_dynamic_cast_assert(OBJECT(mr-opaque), TYPE_DEVICE);
+object_ref(OBJECT(mr-opaque));
+}
+
+static void mr_object_put(MemoryRegion *mr)
+{
+object_unref(OBJECT(mr-opaque));
+}
+
+static MemoryRegionLifeOps obj_ops = {
+.get = mr_object_get,
+.put = mr_object_put,
+};
+
+static void mr_alias_get(MemoryRegion *mr)
+{
+}
+
+static void mr_alias_put(MemoryRegion *mr)
+{
+}
+
+static MemoryRegionLifeOps alias_ops = {
+.get = mr_alias_get,
+.put = mr_alias_put,
+};
+
+static void mr_nop_get(MemoryRegion *mr)
+{
+}
+
+static void mr_nop_put(MemoryRegion *mr)
+{
+}
+
+static MemoryRegionLifeOps nops = {
+.get = mr_nop_get,
+.put = mr_nop_put,
+};
+
+void memory_region_init_io_ext(MemoryRegion 

[PATCH 06/15] memory: use refcnt to manage MemoryRegion

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Using refcnt for mr, so we can separate mr's life cycle management
from refered object.
  When mr-ref 0-1, inc the refered object.
  When mr-ref 1-0, dec the refered object.

The refered object can be DeviceStae, another mr, or other opaque.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 memory.c |   18 ++
 memory.h |5 +
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/memory.c b/memory.c
index 80c7529..5dc8b59 100644
--- a/memory.c
+++ b/memory.c
@@ -811,6 +811,7 @@ void memory_region_init(MemoryRegion *mr,
 if (size == UINT64_MAX) {
 mr-size = int128_2_64();
 }
+atomic_set(mr-ref, 0);
 mr-life_ops = nops;
 mr-addr = 0;
 mr-subpage = false;
@@ -1090,6 +1091,23 @@ static const MemoryRegionOps reservation_ops = {
 .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
+void memory_region_get(MemoryRegion *mr)
+{
+if (atomic_add_and_return(1, mr-ref) == 1) {
+mr-life_ops-get(mr);
+}
+}
+
+void memory_region_put(MemoryRegion *mr)
+{
+assert(atomic_read(mr-ref)  0);
+
+if (atomic_dec_and_test(mr-ref)) {
+/* to fix, using call_rcu( ,release) */
+mr-life_ops-put(mr);
+}
+}
+
 void memory_region_init_reservation(MemoryRegion *mr,
 const char *name,
 uint64_t size)
diff --git a/memory.h b/memory.h
index 8fb543b..740f018 100644
--- a/memory.h
+++ b/memory.h
@@ -18,6 +18,7 @@
 
 #include stdint.h
 #include stdbool.h
+#include qemu/atomic.h
 #include qemu-common.h
 #include cpu-common.h
 #include targphys.h
@@ -26,6 +27,7 @@
 #include ioport.h
 #include int128.h
 #include qemu-thread.h
+#include qemu/reclaimer.h
 
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegionLifeOps MemoryRegionLifeOps;
@@ -126,6 +128,7 @@ typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
 struct MemoryRegion {
 /* All fields are private - violators will be prosecuted */
 const MemoryRegionOps *ops;
+Atomic ref;
 MemoryRegionLifeOps *life_ops;
 void *opaque;
 MemoryRegion *parent;
@@ -766,6 +769,8 @@ void memory_global_dirty_log_stop(void);
 
 void mtree_info(fprintf_function mon_printf, void *f);
 
+void memory_region_get(MemoryRegion *mr);
+void memory_region_put(MemoryRegion *mr);
 #endif
 
 #endif
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

If out of global lock, we will be challenged by SMP in low level,
so need atomic ops.

This file is heavily copied from kernel. Currently, only x86 atomic ops
included, and will be extended for other arch for future.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/qemu/atomic.h |  161 +
 1 files changed, 161 insertions(+), 0 deletions(-)
 create mode 100644 include/qemu/atomic.h

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
new file mode 100644
index 000..8e1fc3e
--- /dev/null
+++ b/include/qemu/atomic.h
@@ -0,0 +1,161 @@
+/*
+ * Simple interface for atomic operations.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef __QEMU_ATOMIC_H
+#define __QEMU_ATOMIC_H 1
+
+typedef struct Atomic {
+int counter;
+} Atomic;
+
+
+#if defined(__i386__) || defined(__x86_64__)
+
+/**
+ *  * atomic_read - read atomic variable
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically reads the value of @v.
+ *  */
+static inline int atomic_read(const Atomic *v)
+{
+return (*(volatile int *)(v)-counter);
+}
+
+/**
+ *  * atomic_set - set atomic variable
+ *   * @v: pointer of type Atomic
+ ** @i: required value
+ * *
+ *  * Atomically sets the value of @v to @i.
+ *   */
+static inline void atomic_set(Atomic *v, int i)
+{
+v-counter = i;
+}
+
+/**
+ *  * atomic_add - add integer to atomic variable
+ *   * @i: integer value to add
+ ** @v: pointer of type Atomic
+ * *
+ *  * Atomically adds @i to @v.
+ *   */
+static inline void atomic_add(int i, Atomic *v)
+{
+asm volatile(lock; addl %1,%0
+ : +m (v-counter)
+ : ir (i));
+}
+
+/**
+ *  * atomic_sub - subtract integer from atomic variable
+ *   * @i: integer value to subtract
+ ** @v: pointer of type Atomic
+ * *
+ *  * Atomically subtracts @i from @v.
+ *   */
+static inline void atomic_sub(int i, Atomic *v)
+{
+asm volatile(lock; subl %1,%0
+ : +m (v-counter)
+ : ir (i));
+}
+
+/**
+ *  * atomic_sub_and_test - subtract value from variable and test result
+ *   * @i: integer value to subtract
+ ** @v: pointer of type Atomic
+ * *
+ *  * Atomically subtracts @i from @v and returns
+ *   * true if the result is zero, or false for all
+ ** other cases.
+ * */
+static inline int atomic_sub_and_test(int i, Atomic *v)
+{
+unsigned char c;
+
+asm volatile(lock; subl %2,%0; sete %1
+ : +m (v-counter), =qm (c)
+ : ir (i) : memory);
+return c;
+}
+
+/**
+ *  * atomic_inc - increment atomic variable
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically increments @v by 1.
+ *  */
+static inline void atomic_inc(Atomic *v)
+{
+asm volatile(lock; incl %0
+ : +m (v-counter));
+}
+
+/**
+ *  * atomic_dec - decrement atomic variable
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically decrements @v by 1.
+ *  */
+static inline void atomic_dec(Atomic *v)
+{
+asm volatile(lock; decl %0
+ : +m (v-counter));
+}
+
+/**
+ *  * atomic_dec_and_test - decrement and test
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically decrements @v by 1 and
+ *  * returns true if the result is 0, or false for all other
+ *   * cases.
+ **/
+static inline int atomic_dec_and_test(Atomic *v)
+{
+unsigned char c;
+
+asm volatile(lock; decl %0; sete %1
+ : +m (v-counter), =qm (c)
+ : : memory);
+return c != 0;
+}
+
+/**
+ *  * atomic_inc_and_test - increment and test
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically increments @v by 1
+ *  * and returns true if the result is zero, or false for all
+ *   * other cases.
+ **/
+static inline int atomic_inc_and_test(Atomic *v)
+{
+unsigned char c;
+
+asm volatile(lock; incl %0; sete %1
+ : +m (v-counter), =qm (c)
+ : : memory);
+return c != 0;
+}
+
+static inline int atomic_add_and_return(int i, Atomic *v)
+{
+int ret = i;
+
+asm volatile (lock; xaddl %0, %1
+: +r (ret), +m (v-counter)
+: : memory, cc);
+
+return ret + i;
+}
+#endif
+
+#endif
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/15] memory: inc/dec mr's ref when adding/removing from mem view

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

memory_region_{add,del}_subregion will inc/dec mr's refcnt.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 memory.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/memory.c b/memory.c
index 5dc8b59..2eaa2fc 100644
--- a/memory.c
+++ b/memory.c
@@ -1356,7 +1356,7 @@ static void 
memory_region_add_subregion_common(MemoryRegion *mr,
 assert(!subregion-parent);
 subregion-parent = mr;
 subregion-addr = offset;
-
+memory_region_get(subregion);
 qemu_mutex_lock(mem_map_lock);
 QTAILQ_FOREACH(other, mr-subregions, subregions_link) {
 if (subregion-may_overlap || other-may_overlap) {
@@ -1420,6 +1420,8 @@ void memory_region_del_subregion(MemoryRegion *mr,
 qemu_mutex_lock(mem_map_lock);
 QTAILQ_REMOVE(mr-subregions, subregion, subregions_link);
 memory_region_update_topology(mr);
+/* mr may be still in use by reader of radix, must delay to release */
+memory_region_put(subregion);
 qemu_mutex_unlock(mem_map_lock);
 }
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

PhysMap contain the flatview and radix-tree view, they are snapshot
of system topology and should be consistent. With PhysMap, we can
swap the pointer when updating and achieve the atomic.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 exec.c   |8 
 memory.c |   33 -
 memory.h |   62 --
 3 files changed, 60 insertions(+), 43 deletions(-)

diff --git a/exec.c b/exec.c
index 0e29ef9..01b91b0 100644
--- a/exec.c
+++ b/exec.c
@@ -156,8 +156,6 @@ typedef struct PageDesc {
 #endif
 
 /* Size of the L2 (and L3, etc) page tables.  */
-#define L2_BITS 10
-#define L2_SIZE (1  L2_BITS)
 
 #define P_L2_LEVELS \
 (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1)
@@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask;
 static void *l1_map[V_L1_SIZE];
 
 #if !defined(CONFIG_USER_ONLY)
-typedef struct PhysPageEntry PhysPageEntry;
 
 static MemoryRegionSection *phys_sections;
 static unsigned phys_sections_nb, phys_sections_nb_alloc;
@@ -194,11 +191,6 @@ static uint16_t phys_section_notdirty;
 static uint16_t phys_section_rom;
 static uint16_t phys_section_watch;
 
-struct PhysPageEntry {
-uint16_t is_leaf : 1;
- /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
-uint16_t ptr : 15;
-};
 
 /* Simple allocator for PhysPageEntry nodes */
 static PhysPageEntry (*phys_map_nodes)[L2_SIZE];
diff --git a/memory.c b/memory.c
index 2eaa2fc..c7f2cfd 100644
--- a/memory.c
+++ b/memory.c
@@ -31,17 +31,6 @@ static bool global_dirty_log = false;
 static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners
 = QTAILQ_HEAD_INITIALIZER(memory_listeners);
 
-typedef struct AddrRange AddrRange;
-
-/*
- * Note using signed integers limits us to physical addresses at most
- * 63 bits wide.  They are needed for negative offsetting in aliases
- * (large MemoryRegion::alias_offset).
- */
-struct AddrRange {
-Int128 start;
-Int128 size;
-};
 
 static AddrRange addrrange_make(Int128 start, Int128 size)
 {
@@ -197,28 +186,6 @@ static bool 
memory_region_ioeventfd_equal(MemoryRegionIoeventfd a,
  !memory_region_ioeventfd_before(b, a);
 }
 
-typedef struct FlatRange FlatRange;
-typedef struct FlatView FlatView;
-
-/* Range of memory in the global map.  Addresses are absolute. */
-struct FlatRange {
-MemoryRegion *mr;
-target_phys_addr_t offset_in_region;
-AddrRange addr;
-uint8_t dirty_log_mask;
-bool readable;
-bool readonly;
-};
-
-/* Flattened global view of current active memory hierarchy.  Kept in sorted
- * order.
- */
-struct FlatView {
-FlatRange *ranges;
-unsigned nr;
-unsigned nr_allocated;
-};
-
 typedef struct AddressSpace AddressSpace;
 typedef struct AddressSpaceOps AddressSpaceOps;
 
diff --git a/memory.h b/memory.h
index 740f018..357edd8 100644
--- a/memory.h
+++ b/memory.h
@@ -29,12 +29,72 @@
 #include qemu-thread.h
 #include qemu/reclaimer.h
 
+typedef struct AddrRange AddrRange;
+typedef struct FlatRange FlatRange;
+typedef struct FlatView FlatView;
+typedef struct PhysPageEntry PhysPageEntry;
+typedef struct PhysMap PhysMap;
+typedef struct MemoryRegionSection MemoryRegionSection;
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegionLifeOps MemoryRegionLifeOps;
 typedef struct MemoryRegion MemoryRegion;
 typedef struct MemoryRegionPortio MemoryRegionPortio;
 typedef struct MemoryRegionMmio MemoryRegionMmio;
 
+/*
+ * Note using signed integers limits us to physical addresses at most
+ * 63 bits wide.  They are needed for negative offsetting in aliases
+ * (large MemoryRegion::alias_offset).
+ */
+struct AddrRange {
+Int128 start;
+Int128 size;
+};
+
+/* Range of memory in the global map.  Addresses are absolute. */
+struct FlatRange {
+MemoryRegion *mr;
+target_phys_addr_t offset_in_region;
+AddrRange addr;
+uint8_t dirty_log_mask;
+bool readable;
+bool readonly;
+};
+
+/* Flattened global view of current active memory hierarchy.  Kept in sorted
+ * order.
+ */
+struct FlatView {
+FlatRange *ranges;
+unsigned nr;
+unsigned nr_allocated;
+};
+
+struct PhysPageEntry {
+uint16_t is_leaf:1;
+ /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
+uint16_t ptr:15;
+};
+
+#define L2_BITS 10
+#define L2_SIZE (1  L2_BITS)
+/* This is a multi-level map on the physical address space.
+   The bottom level has pointers to MemoryRegionSections.  */
+struct PhysMap {
+Atomic ref;
+PhysPageEntry root;
+PhysPageEntry (*phys_map_nodes)[L2_SIZE];
+unsigned phys_map_nodes_nb;
+unsigned phys_map_nodes_nb_alloc;
+
+MemoryRegionSection *phys_sections;
+unsigned phys_sections_nb;
+unsigned phys_sections_nb_alloc;
+
+/* FlatView */
+FlatView views[2];
+};
+
 /* Must match *_DIRTY_FLAGS in cpu-all.h.  To be replaced with dynamic
  * 

[PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Flatview and radix view are all under the protection of pointer.
And this make sure the change of them seem to be atomic!

The mr accessed by radix-tree leaf or flatview will be reclaimed
after the prev PhysMap not in use any longer

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 exec.c  |  303 +++---
 hw/vhost.c  |2 +-
 hw/xen_pt.c |2 +-
 kvm-all.c   |2 +-
 memory.c|   92 ++-
 memory.h|9 ++-
 vl.c|1 +
 xen-all.c   |2 +-
 8 files changed, 286 insertions(+), 127 deletions(-)

diff --git a/exec.c b/exec.c
index 01b91b0..97addb9 100644
--- a/exec.c
+++ b/exec.c
@@ -24,6 +24,7 @@
 #include sys/mman.h
 #endif
 
+#include qemu/atomic.h
 #include qemu-common.h
 #include cpu.h
 #include tcg.h
@@ -35,6 +36,8 @@
 #include qemu-timer.h
 #include memory.h
 #include exec-memory.h
+#include qemu-thread.h
+#include qemu/reclaimer.h
 #if defined(CONFIG_USER_ONLY)
 #include qemu.h
 #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -184,25 +187,17 @@ static void *l1_map[V_L1_SIZE];
 
 #if !defined(CONFIG_USER_ONLY)
 
-static MemoryRegionSection *phys_sections;
-static unsigned phys_sections_nb, phys_sections_nb_alloc;
 static uint16_t phys_section_unassigned;
 static uint16_t phys_section_notdirty;
 static uint16_t phys_section_rom;
 static uint16_t phys_section_watch;
 
-
-/* Simple allocator for PhysPageEntry nodes */
-static PhysPageEntry (*phys_map_nodes)[L2_SIZE];
-static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc;
-
 #define PHYS_MAP_NODE_NIL (((uint16_t)~0)  1)
 
-/* This is a multi-level map on the physical address space.
-   The bottom level has pointers to MemoryRegionSections.  */
-static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
-
+static QemuMutex cur_map_lock;
+static PhysMap *cur_map;
 QemuMutex mem_map_lock;
+static PhysMap *next_map;
 
 static void io_mem_init(void);
 static void memory_map_init(void);
@@ -383,41 +378,38 @@ static inline PageDesc *page_find(tb_page_addr_t index)
 
 #if !defined(CONFIG_USER_ONLY)
 
-static void phys_map_node_reserve(unsigned nodes)
+static void phys_map_node_reserve(PhysMap *map, unsigned nodes)
 {
-if (phys_map_nodes_nb + nodes  phys_map_nodes_nb_alloc) {
+if (map-phys_map_nodes_nb + nodes  map-phys_map_nodes_nb_alloc) {
 typedef PhysPageEntry Node[L2_SIZE];
-phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16);
-phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc,
-  phys_map_nodes_nb + nodes);
-phys_map_nodes = g_renew(Node, phys_map_nodes,
- phys_map_nodes_nb_alloc);
+map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2,
+16);
+map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc,
+  map-phys_map_nodes_nb + nodes);
+map-phys_map_nodes = g_renew(Node, map-phys_map_nodes,
+ map-phys_map_nodes_nb_alloc);
 }
 }
 
-static uint16_t phys_map_node_alloc(void)
+static uint16_t phys_map_node_alloc(PhysMap *map)
 {
 unsigned i;
 uint16_t ret;
 
-ret = phys_map_nodes_nb++;
+ret = map-phys_map_nodes_nb++;
 assert(ret != PHYS_MAP_NODE_NIL);
-assert(ret != phys_map_nodes_nb_alloc);
+assert(ret != map-phys_map_nodes_nb_alloc);
 for (i = 0; i  L2_SIZE; ++i) {
-phys_map_nodes[ret][i].is_leaf = 0;
-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
+map-phys_map_nodes[ret][i].is_leaf = 0;
+map-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
 }
 return ret;
 }
 
-static void phys_map_nodes_reset(void)
-{
-phys_map_nodes_nb = 0;
-}
-
-
-static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index,
-target_phys_addr_t *nb, uint16_t leaf,
+static void phys_page_set_level(PhysMap *map, PhysPageEntry *lp,
+target_phys_addr_t *index,
+target_phys_addr_t *nb,
+uint16_t leaf,
 int level)
 {
 PhysPageEntry *p;
@@ -425,8 +417,8 @@ static void phys_page_set_level(PhysPageEntry *lp, 
target_phys_addr_t *index,
 target_phys_addr_t step = (target_phys_addr_t)1  (level * L2_BITS);
 
 if (!lp-is_leaf  lp-ptr == PHYS_MAP_NODE_NIL) {
-lp-ptr = phys_map_node_alloc();
-p = phys_map_nodes[lp-ptr];
+lp-ptr = phys_map_node_alloc(map);
+p = map-phys_map_nodes[lp-ptr];
 if (level == 0) {
 for (i = 0; i  L2_SIZE; i++) {
 p[i].is_leaf = 1;
@@ -434,7 +426,7 @@ static void phys_page_set_level(PhysPageEntry *lp, 
target_phys_addr_t *index,
 }
 }
 } else 

[PATCH 10/15] memory: change tcg related code to using PhysMap

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Change tcg code to use PhysMap.
This is separated from the prev patch for review purpose. Should be
merged into prev one.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 exec.c |   27 +--
 1 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 97addb9..8d0dea5 100644
--- a/exec.c
+++ b/exec.c
@@ -1923,6 +1923,7 @@ target_phys_addr_t 
memory_region_section_get_iotlb(CPUArchState *env,
 {
 target_phys_addr_t iotlb;
 CPUWatchpoint *wp;
+PhysMap *map = cur_map_get();
 
 if (memory_region_is_ram(section-mr)) {
 /* Normal RAM.  */
@@ -1940,7 +1941,7 @@ target_phys_addr_t 
memory_region_section_get_iotlb(CPUArchState *env,
and avoid full address decoding in every device.
We can't use the high bits of pd for this because
IO_MEM_ROMD uses these as a ram address.  */
-iotlb = section - phys_sections;
+iotlb = section - map-phys_sections;
 iotlb += memory_region_section_addr(section, paddr);
 }
 
@@ -1956,6 +1957,7 @@ target_phys_addr_t 
memory_region_section_get_iotlb(CPUArchState *env,
 }
 }
 }
+physmap_put(map);
 
 return iotlb;
 }
@@ -3185,7 +3187,12 @@ static uint16_t dummy_section(PhysMap *map, MemoryRegion 
*mr)
 
 MemoryRegion *iotlb_to_region(target_phys_addr_t index)
 {
-return phys_sections[index  ~TARGET_PAGE_MASK].mr;
+MemoryRegion *ret;
+PhysMap *map = cur_map_get();
+
+ret = map-phys_sections[index  ~TARGET_PAGE_MASK].mr;
+physmap_put(map);
+return ret;
 }
 
 static void io_mem_init(void)
@@ -3946,13 +3953,14 @@ void stl_phys_notdirty(target_phys_addr_t addr, 
uint32_t val)
 {
 uint8_t *ptr;
 MemoryRegionSection *section;
+PhysMap *map = cur_map_get();
 
 section = phys_page_find(addr  TARGET_PAGE_BITS);
 
 if (!memory_region_is_ram(section-mr) || section-readonly) {
 addr = memory_region_section_addr(section, addr);
 if (memory_region_is_ram(section-mr)) {
-section = phys_sections[phys_section_rom];
+section = map-phys_sections[phys_section_rom];
 }
 io_mem_write(section-mr, addr, val, 4);
 } else {
@@ -3972,19 +3980,21 @@ void stl_phys_notdirty(target_phys_addr_t addr, 
uint32_t val)
 }
 }
 }
+physmap_put(map);
 }
 
 void stq_phys_notdirty(target_phys_addr_t addr, uint64_t val)
 {
 uint8_t *ptr;
 MemoryRegionSection *section;
+PhysMap *map = cur_map_get();
 
 section = phys_page_find(addr  TARGET_PAGE_BITS);
 
 if (!memory_region_is_ram(section-mr) || section-readonly) {
 addr = memory_region_section_addr(section, addr);
 if (memory_region_is_ram(section-mr)) {
-section = phys_sections[phys_section_rom];
+section = map-phys_sections[phys_section_rom];
 }
 #ifdef TARGET_WORDS_BIGENDIAN
 io_mem_write(section-mr, addr, val  32, 4);
@@ -3999,6 +4009,7 @@ void stq_phys_notdirty(target_phys_addr_t addr, uint64_t 
val)
+ memory_region_section_addr(section, addr));
 stq_p(ptr, val);
 }
+physmap_put(map);
 }
 
 /* warning: addr must be aligned */
@@ -4008,12 +4019,13 @@ static inline void stl_phys_internal(target_phys_addr_t 
addr, uint32_t val,
 uint8_t *ptr;
 MemoryRegionSection *section;
 
+PhysMap *map = cur_map_get();
 section = phys_page_find(addr  TARGET_PAGE_BITS);
 
 if (!memory_region_is_ram(section-mr) || section-readonly) {
 addr = memory_region_section_addr(section, addr);
 if (memory_region_is_ram(section-mr)) {
-section = phys_sections[phys_section_rom];
+section = map-phys_sections[phys_section_rom];
 }
 #if defined(TARGET_WORDS_BIGENDIAN)
 if (endian == DEVICE_LITTLE_ENDIAN) {
@@ -4050,6 +4062,7 @@ static inline void stl_phys_internal(target_phys_addr_t 
addr, uint32_t val,
 (0xff  ~CODE_DIRTY_FLAG));
 }
 }
+physmap_put(map);
 }
 
 void stl_phys(target_phys_addr_t addr, uint32_t val)
@@ -4081,12 +4094,13 @@ static inline void stw_phys_internal(target_phys_addr_t 
addr, uint32_t val,
 uint8_t *ptr;
 MemoryRegionSection *section;
 
+PhysMap *map = cur_map_get();
 section = phys_page_find(addr  TARGET_PAGE_BITS);
 
 if (!memory_region_is_ram(section-mr) || section-readonly) {
 addr = memory_region_section_addr(section, addr);
 if (memory_region_is_ram(section-mr)) {
-section = phys_sections[phys_section_rom];
+section = map-phys_sections[phys_section_rom];
 }
 #if defined(TARGET_WORDS_BIGENDIAN)
 if (endian == DEVICE_LITTLE_ENDIAN) {
@@ -4123,6 +4137,7 @@ static inline void stw_phys_internal(target_phys_addr_t 
addr, uint32_t val,
 (0xff  ~CODE_DIRTY_FLAG));
 }
 }
+physmap_put(map);
 }
 
 void 

[PATCH 11/15] lock: introduce global lock for device tree

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 cpus.c  |   12 
 main-loop.h |3 +++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index b182b3d..a734b36 100644
--- a/cpus.c
+++ b/cpus.c
@@ -611,6 +611,7 @@ static void qemu_tcg_init_cpu_signals(void)
 }
 #endif /* _WIN32 */
 
+QemuMutex qemu_device_tree_mutex;
 QemuMutex qemu_global_mutex;
 static QemuCond qemu_io_proceeded_cond;
 static bool iothread_requesting_mutex;
@@ -634,6 +635,7 @@ void qemu_init_cpu_loop(void)
 qemu_cond_init(qemu_work_cond);
 qemu_cond_init(qemu_io_proceeded_cond);
 qemu_mutex_init(qemu_global_mutex);
+qemu_mutex_init(qemu_device_tree_mutex);
 
 qemu_thread_get_self(io_thread);
 }
@@ -911,6 +913,16 @@ void qemu_mutex_unlock_iothread(void)
 qemu_mutex_unlock(qemu_global_mutex);
 }
 
+void qemu_lock_devtree(void)
+{
+qemu_mutex_lock(qemu_device_tree_mutex);
+}
+
+void qemu_unlock_devtree(void)
+{
+qemu_mutex_unlock(qemu_device_tree_mutex);
+}
+
 static int all_vcpus_paused(void)
 {
 CPUArchState *penv = first_cpu;
diff --git a/main-loop.h b/main-loop.h
index dce1cd9..17e959a 100644
--- a/main-loop.h
+++ b/main-loop.h
@@ -353,6 +353,9 @@ void qemu_mutex_lock_iothread(void);
  */
 void qemu_mutex_unlock_iothread(void);
 
+void qemu_lock_devtree(void);
+void qemu_unlock_devtree(void);
+
 /* internal interfaces */
 
 void qemu_fd_register(int fd);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/15] qdev: using devtree lock to protect device's accessing

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

lock:
  qemu_device_tree_mutex

competitors:
  --device_del(destruction of device will be postphoned until unplug
ack from guest),
  --pci hot-unplug
  --iteration (qdev_reset_all)
  --device_add

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 hw/pci-hotplug.c  |4 
 hw/qdev-monitor.c |   17 -
 hw/qdev.c |2 ++
 3 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/hw/pci-hotplug.c b/hw/pci-hotplug.c
index e7fb780..33a9dfe 100644
--- a/hw/pci-hotplug.c
+++ b/hw/pci-hotplug.c
@@ -265,9 +265,11 @@ static int pci_device_hot_remove(Monitor *mon, const char 
*pci_addr)
 return -1;
 }
 
+qemu_lock_devtree();
 d = pci_find_device(pci_find_root_bus(dom), bus, PCI_DEVFN(slot, 0));
 if (!d) {
 monitor_printf(mon, slot %d empty\n, slot);
+qemu_unlock_devtree();
 return -1;
 }
 
@@ -275,9 +277,11 @@ static int pci_device_hot_remove(Monitor *mon, const char 
*pci_addr)
 if (error_is_set(local_err)) {
 monitor_printf(mon, %s\n, error_get_pretty(local_err));
 error_free(local_err);
+qemu_unlock_devtree();
 return -1;
 }
 
+qemu_unlock_devtree();
 return 0;
 }
 
diff --git a/hw/qdev-monitor.c b/hw/qdev-monitor.c
index 7915b45..2d47fe0 100644
--- a/hw/qdev-monitor.c
+++ b/hw/qdev-monitor.c
@@ -429,14 +429,18 @@ DeviceState *qdev_device_add(QemuOpts *opts)
 
 /* find bus */
 path = qemu_opt_get(opts, bus);
+
+qemu_lock_devtree();
 if (path != NULL) {
 bus = qbus_find(path);
 if (!bus) {
+qemu_unlock_devtree();
 return NULL;
 }
 if (strcmp(object_get_typename(OBJECT(bus)), k-bus_type) != 0) {
 qerror_report(QERR_BAD_BUS_FOR_DEVICE,
   driver, object_get_typename(OBJECT(bus)));
+qemu_unlock_devtree();
 return NULL;
 }
 } else {
@@ -444,11 +448,13 @@ DeviceState *qdev_device_add(QemuOpts *opts)
 if (!bus) {
 qerror_report(QERR_NO_BUS_FOR_DEVICE,
   driver, k-bus_type);
+qemu_unlock_devtree();
 return NULL;
 }
 }
 if (qdev_hotplug  !bus-allow_hotplug) {
 qerror_report(QERR_BUS_NO_HOTPLUG, bus-name);
+qemu_unlock_devtree();
 return NULL;
 }
 
@@ -466,6 +472,7 @@ DeviceState *qdev_device_add(QemuOpts *opts)
 }
 if (qemu_opt_foreach(opts, set_property, qdev, 1) != 0) {
 qdev_free(qdev);
+qemu_unlock_devtree();
 return NULL;
 }
 if (qdev-id) {
@@ -478,6 +485,8 @@ DeviceState *qdev_device_add(QemuOpts *opts)
   OBJECT(qdev), NULL);
 g_free(name);
 }
+qemu_unlock_devtree();
+
 if (qdev_init(qdev)  0) {
 qerror_report(QERR_DEVICE_INIT_FAILED, driver);
 return NULL;
@@ -600,13 +609,19 @@ void qmp_device_del(const char *id, Error **errp)
 {
 DeviceState *dev;
 
+/* protect against unplug ack from guest, where we really remove device
+ * from system
+ */
+qemu_lock_devtree();
 dev = qdev_find_recursive(sysbus_get_default(), id);
 if (NULL == dev) {
 error_set(errp, QERR_DEVICE_NOT_FOUND, id);
+qemu_unlock_devtree();
 return;
 }
-
+/* Just remove from system, and drop refcnt there*/
 qdev_unplug(dev, errp);
+qemu_unlock_devtree();
 }
 
 void qdev_machine_init(void)
diff --git a/hw/qdev.c b/hw/qdev.c
index af54467..17525fe 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -230,7 +230,9 @@ static int qbus_reset_one(BusState *bus, void *opaque)
 
 void qdev_reset_all(DeviceState *dev)
 {
+qemu_lock_devtree();
 qdev_walk_children(dev, qdev_reset_one, qbus_reset_one, NULL);
+qemu_unlock_devtree();
 }
 
 void qbus_reset_all_fn(void *opaque)
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

When guest confirm the removal of device, we should
--unmap from MemoryRegion view
--isolated from device tree view

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 hw/acpi_piix4.c |4 ++--
 hw/pci.c|   13 -
 hw/pci.h|2 ++
 hw/qdev.c   |   28 
 hw/qdev.h   |3 ++-
 5 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index 0aace60..c209ff7 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -305,8 +305,8 @@ static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned 
slots)
 if (pc-no_hotplug) {
 slot_free = false;
 } else {
-object_unparent(OBJECT(dev));
-qdev_free(qdev);
+/* refcnt will be decreased */
+qdev_unplug_complete(qdev, NULL);
 }
 }
 }
diff --git a/hw/pci.c b/hw/pci.c
index 99a4304..2095abf 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -856,12 +856,22 @@ static int pci_unregister_device(DeviceState *dev)
 if (ret)
 return ret;
 
-pci_unregister_io_regions(pci_dev);
 pci_del_option_rom(pci_dev);
 do_pci_unregister_device(pci_dev);
 return 0;
 }
 
+static void pci_unmap_device(DeviceState *dev)
+{
+PCIDevice *pci_dev = PCI_DEVICE(dev);
+PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
+
+pci_unregister_io_regions(pci_dev);
+if (pc-unmap) {
+pc-unmap(pci_dev);
+}
+}
+
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
   uint8_t type, MemoryRegion *memory)
 {
@@ -2022,6 +2032,7 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 DeviceClass *k = DEVICE_CLASS(klass);
 k-init = pci_qdev_init;
 k-unplug = pci_unplug_device;
+k-unmap = pci_unmap_device;
 k-exit = pci_unregister_device;
 k-bus_type = TYPE_PCI_BUS;
 k-props = pci_props;
diff --git a/hw/pci.h b/hw/pci.h
index 79d38fd..1c5b909 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -145,6 +145,8 @@ typedef struct PCIDeviceClass {
 DeviceClass parent_class;
 
 int (*init)(PCIDevice *dev);
+void (*unmap)(PCIDevice *dev);
+
 PCIUnregisterFunc *exit;
 PCIConfigReadFunc *config_read;
 PCIConfigWriteFunc *config_write;
diff --git a/hw/qdev.c b/hw/qdev.c
index 17525fe..530eabe 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -104,6 +104,14 @@ void qdev_set_parent_bus(DeviceState *dev, BusState *bus)
 bus_add_child(bus, dev);
 }
 
+static void qdev_unset_parent(DeviceState *dev)
+{
+BusState *b = dev-parent_bus;
+
+object_unparent(OBJECT(dev));
+bus_remove_child(b, dev);
+}
+
 /* Create a new device.  This only initializes the device state structure
and allows properties to be set.  qdev_init should be called to
initialize the actual device emulation.  */
@@ -194,6 +202,26 @@ void qdev_set_legacy_instance_id(DeviceState *dev, int 
alias_id,
 dev-alias_required_for_version = required_for_version;
 }
 
+static int qdev_unmap(DeviceState *dev)
+{
+DeviceClass *dc =  DEVICE_GET_CLASS(dev);
+if (dc-unmap) {
+dc-unmap(dev);
+}
+return 0;
+}
+
+void qdev_unplug_complete(DeviceState *dev, Error **errp)
+{
+/* isolate from mem view */
+qdev_unmap(dev);
+qemu_lock_devtree();
+/* isolate from device tree */
+qdev_unset_parent(dev);
+qemu_unlock_devtree();
+object_unref(OBJECT(dev));
+}
+
 void qdev_unplug(DeviceState *dev, Error **errp)
 {
 DeviceClass *dc = DEVICE_GET_CLASS(dev);
diff --git a/hw/qdev.h b/hw/qdev.h
index f4683dc..705635a 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -47,7 +47,7 @@ typedef struct DeviceClass {
 
 /* callbacks */
 void (*reset)(DeviceState *dev);
-
+void (*unmap)(DeviceState *dev);
 /* device state */
 const VMStateDescription *vmsd;
 
@@ -162,6 +162,7 @@ void qdev_init_nofail(DeviceState *dev);
 void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
  int required_for_version);
 void qdev_unplug(DeviceState *dev, Error **errp);
+void qdev_unplug_complete(DeviceState *dev, Error **errp);
 void qdev_free(DeviceState *dev);
 int qdev_simple_unplug_cb(DeviceState *dev);
 void qdev_machine_creation_done(void);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/15] qom: object_unref call reclaimer

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

iohandler/bh/timer may use DeviceState when its refcnt=0,
postpone the reclaimer till they have done with it.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 qom/object.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/qom/object.c b/qom/object.c
index 822bdb7..1452b1b 100644
--- a/qom/object.c
+++ b/qom/object.c
@@ -23,6 +23,8 @@
 #include qbool.h
 #include qint.h
 #include qstring.h
+#include hw/qdev.h
+#include qemu/reclaimer.h
 
 #define MAX_INTERFACES 32
 
@@ -646,7 +648,12 @@ void object_unref(Object *obj)
 {
 g_assert(atomic_read(obj-ref)  0);
 if (atomic_dec_and_test(obj-ref)) {
-object_finalize(obj);
+/* fixme, maybe introduce obj-finalze to make this more elegant */
+if (object_dynamic_cast(obj, TYPE_DEVICE) != NULL) {
+qemu_reclaimer_enqueue(obj, object_finalize);
+} else {
+object_finalize(obj);
+}
 }
 }
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/15] e1000: using new interface--unmap to unplug

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 hw/e1000.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/hw/e1000.c b/hw/e1000.c
index 4573f13..fa71455 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -1192,6 +1192,13 @@ e1000_cleanup(VLANClientState *nc)
 s-nic = NULL;
 }
 
+static void
+pci_e1000_unmap(PCIDevice *p)
+{
+/* DO NOT FREE anything!until refcnt=0 */
+/* isolate from memory view */
+}
+
 static int
 pci_e1000_uninit(PCIDevice *dev)
 {
@@ -1275,6 +1282,7 @@ static void e1000_class_init(ObjectClass *klass, void 
*data)
 PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
 
 k-init = pci_e1000_init;
+k-unmap = pci_e1000_unmap;
 k-exit = pci_e1000_uninit;
 k-romfile = pxe-e1000.rom;
 k-vendor_id = PCI_VENDOR_ID_INTEL;
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: PPC: E500: Implement MMU notifiers

2012-08-08 Thread Paul Mackerras
On Tue, Aug 07, 2012 at 12:57:14PM +0200, Alexander Graf wrote:
 The e500 target has lived without mmu notifiers ever since it got
 introduced, but fails for the user space check on them with hugetlbfs.

Ironically that user space check isn't necessary any more since David
Gibson's fix for the hugetlbfs bug went in (90481622, hugepages: fix
use after free bug in quota handling).  So on sufficiently recent
kernels you can just remove the userspace check.  Implementing
mmu-notifiers is a good thing for other reasons though.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: PPC: Add cache flush on page map

2012-08-08 Thread Alexander Graf


On 07.08.2012, at 23:01, Scott Wood scottw...@freescale.com wrote:

 On 08/07/2012 05:57 AM, Alexander Graf wrote:
 When we map a page that wasn't icache cleared before, do so when first
 mapping it in KVM using the same information bits as the Linux mapping
 logic. That way we are 100% sure that any page we map does not have stale
 entries in the icache.
 
 What we really would need is a method to flush the icache only when we
 actually map a page executable for the guest. But with the current
 abstraction that is hard to do, and this way we don't have a huge performance
 penalty, so better be safe now and micro optimize things later.
 
 Signed-off-by: Alexander Graf ag...@suse.de
 ---
 arch/powerpc/include/asm/kvm_host.h |   10 ++
 arch/powerpc/mm/mem.c   |1 +
 2 files changed, 11 insertions(+), 0 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index ed75bc9..c0a2fc1 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -33,6 +33,7 @@
 #include asm/kvm_asm.h
 #include asm/processor.h
 #include asm/page.h
 +#include asm/cacheflush.h
 
 #define KVM_MAX_VCPUSNR_CPUS
 #define KVM_MAX_VCORESNR_CPUS
 @@ -550,5 +551,14 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR0x0060
 
 #define __KVM_HAVE_ARCH_WQP
 +#define __KVM_HAVE_ARCH_MAP_PAGE
 +static inline void kvm_arch_map_page(struct page *page)
 +{
 +/* Need to invalidate the icache for new pages */
 +if (!test_bit(PG_arch_1, page-flags)) {
 +flush_dcache_icache_page(page);
 +set_bit(PG_arch_1, page-flags);
 +}
 +}
 
 Shouldn't this test CPU_FTR_COHERENT_ICACHE?

flush_icache_range checks for it a bit further down the code stream. I mostly 
modeled things the same way as set_pre_filter.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: PPC: E500: Implement MMU notifiers

2012-08-08 Thread Alexander Graf


On 08.08.2012, at 05:31, Paul Mackerras pau...@samba.org wrote:

 On Tue, Aug 07, 2012 at 12:57:14PM +0200, Alexander Graf wrote:
 The e500 target has lived without mmu notifiers ever since it got
 introduced, but fails for the user space check on them with hugetlbfs.
 
 Ironically that user space check isn't necessary any more since David
 Gibson's fix for the hugetlbfs bug went in (90481622, hugepages: fix
 use after free bug in quota handling).  So on sufficiently recent
 kernels you can just remove the userspace check.  Implementing
 mmu-notifiers is a good thing for other reasons though.

Yeah, it's probably best to just require mmu notifiers from every target. It 
would however be good to have a CAP nevertheless (or change the semantics of 
SYNC_MMU) so that we can run HV KVM on 970 without changes to QEMU.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V7 0/2] Improve virtio-blk performance

2012-08-08 Thread Asias He
Hi, all

Changes in v7:
- Using vbr-flags to trace request type
- Dropped unnecessary struct virtio_blk *vblk parameter
- Reuse struct virtblk_req in bio done function
- Added performance data on normal SATA device and the reason why make it 
optional

Fio test shows bio-based IO path gives the following performance improvement:

1) Ramdisk device
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost : 28%, 24%, 21%, 16%
 Latency improvement: 32%, 17%, 21%, 16%
2) Fusion IO device
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost : 11%, 11%, 13%, 10%
 Latency improvement: 10%, 10%, 12%, 10%
3) Normal SATA device
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost : -10%, -10%, 4.4%, 0.5%
 Latency improvement: -12%, -15%, 2.5%, 0.8%

Asias He (2):
  virtio-blk: Add bio-based IO path for virtio-blk
  virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

 drivers/block/virtio_blk.c | 301 +++--
 1 file changed, 264 insertions(+), 37 deletions(-)

-- 
1.7.11.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V7 2/2] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

2012-08-08 Thread Asias He
We need to support both REQ_FLUSH and REQ_FUA for bio based path since
it does not get the sequencing of REQ_FUA into REQ_FLUSH that request
based drivers can request.

REQ_FLUSH is emulated by:
A) If the bio has no data to write:
1. Send VIRTIO_BLK_T_FLUSH to device,
2. In the flush I/O completion handler, finish the bio

B) If the bio has data to write:
1. Send VIRTIO_BLK_T_FLUSH to device
2. In the flush I/O completion handler, send the actual write data to device
3. In the write I/O completion handler, finish the bio

REQ_FUA is emulated by:
1. Send the actual write data to device
2. In the write I/O completion handler, send VIRTIO_BLK_T_FLUSH to device
3. In the flush I/O completion handler, finish the bio

Changes in v7:
- Using vbr-flags to trace request type
- Dropped unnecessary struct virtio_blk *vblk parameter
- Reuse struct virtblk_req in bio done function

Cahnges in v6:
- Reworked REQ_FLUSH and REQ_FUA emulatation order

Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Jens Axboe ax...@kernel.dk
Cc: Christoph Hellwig h...@lst.de
Cc: Tejun Heo t...@kernel.org
Cc: Shaohua Li s...@kernel.org
Cc: Michael S. Tsirkin m...@redhat.com
Cc: kvm@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Signed-off-by: Asias He as...@redhat.com
---
 drivers/block/virtio_blk.c | 272 +++--
 1 file changed, 188 insertions(+), 84 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 95cfeed..2edfb5c 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -58,10 +58,20 @@ struct virtblk_req
struct bio *bio;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
+   struct work_struct work;
+   struct virtio_blk *vblk;
+   int flags;
u8 status;
struct scatterlist sg[];
 };
 
+enum {
+   VBLK_IS_FLUSH   = 1,
+   VBLK_REQ_FLUSH  = 2,
+   VBLK_REQ_DATA   = 4,
+   VBLK_REQ_FUA= 8,
+};
+
 static inline int virtblk_result(struct virtblk_req *vbr)
 {
switch (vbr-status) {
@@ -74,9 +84,133 @@ static inline int virtblk_result(struct virtblk_req *vbr)
}
 }
 
-static inline void virtblk_request_done(struct virtio_blk *vblk,
-   struct virtblk_req *vbr)
+static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
+   gfp_t gfp_mask)
+{
+   struct virtblk_req *vbr;
+
+   vbr = mempool_alloc(vblk-pool, gfp_mask);
+   if (vbr  use_bio)
+   sg_init_table(vbr-sg, vblk-sg_elems);
+
+   vbr-vblk = vblk;
+
+   return vbr;
+}
+
+static void virtblk_add_buf_wait(struct virtio_blk *vblk,
+struct virtblk_req *vbr,
+unsigned long out,
+unsigned long in)
+{
+   DEFINE_WAIT(wait);
+
+   for (;;) {
+   prepare_to_wait_exclusive(vblk-queue_wait, wait,
+ TASK_UNINTERRUPTIBLE);
+
+   spin_lock_irq(vblk-disk-queue-queue_lock);
+   if (virtqueue_add_buf(vblk-vq, vbr-sg, out, in, vbr,
+ GFP_ATOMIC)  0) {
+   spin_unlock_irq(vblk-disk-queue-queue_lock);
+   io_schedule();
+   } else {
+   virtqueue_kick(vblk-vq);
+   spin_unlock_irq(vblk-disk-queue-queue_lock);
+   break;
+   }
+
+   }
+
+   finish_wait(vblk-queue_wait, wait);
+}
+
+static inline void virtblk_add_req(struct virtblk_req *vbr,
+  unsigned int out, unsigned int in)
+{
+   struct virtio_blk *vblk = vbr-vblk;
+
+   spin_lock_irq(vblk-disk-queue-queue_lock);
+   if (unlikely(virtqueue_add_buf(vblk-vq, vbr-sg, out, in, vbr,
+   GFP_ATOMIC)  0)) {
+   spin_unlock_irq(vblk-disk-queue-queue_lock);
+   virtblk_add_buf_wait(vblk, vbr, out, in);
+   return;
+   }
+   virtqueue_kick(vblk-vq);
+   spin_unlock_irq(vblk-disk-queue-queue_lock);
+}
+
+static int virtblk_bio_send_flush(struct virtblk_req *vbr)
+{
+   unsigned int out = 0, in = 0;
+
+   vbr-flags |= VBLK_IS_FLUSH;
+   vbr-out_hdr.type = VIRTIO_BLK_T_FLUSH;
+   vbr-out_hdr.sector = 0;
+   vbr-out_hdr.ioprio = 0;
+   sg_set_buf(vbr-sg[out++], vbr-out_hdr, sizeof(vbr-out_hdr));
+   sg_set_buf(vbr-sg[out + in++], vbr-status, sizeof(vbr-status));
+
+   virtblk_add_req(vbr, out, in);
+
+   return 0;
+}
+
+static int virtblk_bio_send_data(struct virtblk_req *vbr)
 {
+   struct virtio_blk *vblk = vbr-vblk;
+   unsigned int num, out = 0, in = 0;
+   struct bio *bio = vbr-bio;
+
+   vbr-flags = ~VBLK_IS_FLUSH;
+   vbr-out_hdr.type = 0;

[PATCH V7 1/2] virtio-blk: Add bio-based IO path for virtio-blk

2012-08-08 Thread Asias He
This patch introduces bio-based IO path for virtio-blk.

Compared to request-based IO path, bio-based IO path uses driver
provided -make_request_fn() method to bypasses the IO scheduler. It
handles the bio to device directly without allocating a request in block
layer. This reduces the IO path in guest kernel to achieve high IOPS
and lower latency. The downside is that guest can not use the IO
scheduler to merge and sort requests. However, this is not a big problem
if the backend disk in host side uses faster disk device.

When the bio-based IO path is not enabled, virtio-blk still uses the
original request-based IO path, no performance difference is observed.

Using a slow device e.g. normal SATA disk, the bio-based IO path for
sequential read and write are slower than req-based IO path due to lack
of merge in guest kernel. So we make the bio-based path optional.

Performance evaluation:
-
1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost : 28%, 24%, 21%, 16%
 Latency improvement: 32%, 17%, 21%, 16%

Long version:
 With bio-based IO path:
  seq-read  : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
  seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
  rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
  rand-write: io=3095.7MB, bw=96198KB/s,  iops=192396 , runt= 32952msec
clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
  cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
  cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
  cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
  cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
 With request-based IO path:
  seq-read  : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
  seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
  rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
  rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
  cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
  cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
  cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
  cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985

2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost : 11%, 11%, 13%, 10%
 Latency improvement: 10%, 10%, 12%, 10%
Long Version:
 With bio-based IO path:
  read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
  write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
  read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
  write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
  cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
  cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
  cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
  cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
 With request-based IO path:
  read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
  write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
  read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
  write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
  cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
  cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
  cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
  cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409

3) Fio test is performed in a 8 vcpu guest with normal 

Re: KVM segfaults with 3.5 while installing ubuntu 12.04

2012-08-08 Thread Stefan Hajnoczi
On Wed, Aug 08, 2012 at 07:51:07AM +0200, Stefan Priebe wrote:
 Any news? Was this applied upstream?

Kevin is ill.  He has asked me to review and test patches in his
absence.  When he gets back later this week this will get picked up (and
included in QEMU 1.2).

Here is the tree, it includes this patch:

https://github.com/stefanha/qemu/commits/for-kevin

Stefan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.

2012-08-08 Thread Cornelia Huck
On Tue, 7 Aug 2012 21:00:59 +
Blue Swirl blauwir...@gmail.com wrote:


  diff --git a/hw/s390x/css.c b/hw/s390x/css.c
  new file mode 100644
  index 000..7941c44
  --- /dev/null
  +++ b/hw/s390x/css.c
  @@ -0,0 +1,440 @@
  +/*
  + * Channel subsystem base support.
  + *
  + * Copyright 2012 IBM Corp.
  + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
  + *
  + * This work is licensed under the terms of the GNU GPL, version 2 or (at
  + * your option) any later version. See the COPYING file in the top-level
  + * directory.
  + */
  +
  +#include qemu-thread.h
  +#include qemu-queue.h
  +#include hw/qdev.h
  +#include kvm.h
  +#include cpu.h
  +#include ioinst.h
  +#include css.h
  +
  +struct chp_info {
 
 CamelCase, please.

OK.
 
  +uint8_t in_use;
  +uint8_t type;
  +};
  +
  +static struct chp_info chpids[MAX_CSSID + 1][MAX_CHPID + 1];
  +
  +static css_subch_cb_func css_subch_cb;
 
 Probably these can be put to a container structure which can be passed around.

Still trying to come up with a good model for that.

 

  +case CCW_CMD_SENSE_ID:
  +{
  +uint8_t sense_bytes[256];
  +
  +/* Sense ID information is device specific. */
  +memcpy(sense_bytes, sch-id, sizeof(sense_bytes));
  +if (check_len) {
  +if (ccw-count != sizeof(sense_bytes)) {
  +ret = -EINVAL;
  +break;
  +}
  +}
  +len = MIN(ccw-count, sizeof(sense_bytes));
  +/*
  + * Only indicate 0xff in the first sense byte if we actually
  + * have enough place to store at least bytes 0-3.
  + */
  +if (len = 4) {
  +stb_phys(ccw-cda, 0xff);
  +} else {
  +stb_phys(ccw-cda, 0);
  +}
  +i = 1;
  +for (i = 1; i  len - 1; i++) {
  +stb_phys(ccw-cda + i, sense_bytes[i]);
  +}
 
 cpu_physical_memory_write()

Hm, what's wrong with storing byte-by-byte?

 
  +sch-curr_status.scsw.count = ccw-count - len;
  +ret = 0;
  +break;
  +}
  +case CCW_CMD_TIC:
  +if (sch-last_cmd-cmd_code == CCW_CMD_TIC) {
  +ret = -EINVAL;
  +break;
  +}
  +if (ccw-flags  (CCW_FLAG_CC | CCW_FLAG_DC)) {
  +ret = -EINVAL;
  +break;
  +}
  +sch-channel_prog = qemu_get_ram_ptr(ccw-cda);
  +ret = sch-channel_prog ? -EAGAIN : -EFAULT;
  +break;
  +default:
  +if (sch-ccw_cb) {
  +/* Handle device specific commands. */
  +ret = sch-ccw_cb(sch, ccw);
  +} else {
  +ret = -EOPNOTSUPP;
  +}
  +break;
  +}
  +sch-last_cmd = ccw;
  +if (ret == 0) {
  +if (ccw-flags  CCW_FLAG_CC) {
  +sch-channel_prog += 8;
  +ret = -EAGAIN;
  +}
  +}
  +
  +return ret;

  diff --git a/hw/s390x/css.h b/hw/s390x/css.h
  new file mode 100644
  index 000..b8a95cc
  --- /dev/null
  +++ b/hw/s390x/css.h
  @@ -0,0 +1,62 @@
  +/*
  + * Channel subsystem structures and definitions.
  + *
  + * Copyright 2012 IBM Corp.
  + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
  + *
  + * This work is licensed under the terms of the GNU GPL, version 2 or (at
  + * your option) any later version. See the COPYING file in the top-level
  + * directory.
  + */
  +
  +#ifndef CSS_H
  +#define CSS_H
  +
  +#include ioinst.h
  +
  +/* Channel subsystem constants. */
  +#define MAX_SCHID 65535
  +#define MAX_SSID 3
  +#define MAX_CSSID 254 /* 255 is reserved */
  +#define MAX_CHPID 255
  +
  +#define MAX_CIWS 8
  +
  +struct senseid {
 
 SenseID

OK.
 
  +/* common part */
  +uint32_t  reserved:8;/* always 0x'FF' */
 
 The standard syntax calls for 'unsigned' instead of uint32_t for bit
 fields. But bit fields are not very well defined, it's better to avoid
 them.

Well, the equivalent Linux structure also looks like that :) But I can
switch this to a uint8_t/uint16_t structure.

 
  +uint32_t  cu_type:16;/* control unit type */
  +uint32_t  cu_model:8;/* control unit model */
  +uint32_t  dev_type:16;   /* device type */
  +uint32_t  dev_model:8;   /* device model */
  +uint32_t  unused:8;  /* padding byte */
  +/* extended part */
  +uint32_t ciw[MAX_CIWS];  /* variable # of CIWs */
  +};
  +

  diff --git a/target-s390x/ioinst.h b/target-s390x/ioinst.h
  new file mode 100644
  index 000..79628b4
  --- /dev/null
  +++ b/target-s390x/ioinst.h
  @@ -0,0 +1,173 @@
  +/*
  + * S/390 channel I/O instructions
  + *
  + * Copyright 2012 IBM Corp.
  + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
  + *
  + * This work is licensed under the terms of the GNU GPL, version 2 or (at
  + * your option) any later version. See the COPYING file in the top-level
  + * directory.
  +*/
  +
  +#ifndef IOINST_S390X_H
  +#define IOINST_S390X_H
  +
  +/*
  + * Channel I/O 

Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.

2012-08-08 Thread Peter Maydell
On 7 August 2012 15:52, Cornelia Huck cornelia.h...@de.ibm.com wrote:
 +static void sch_handle_clear_func(SubchDev *sch)
 +{
 +struct pmcw *p = sch-curr_status.pmcw;
 +struct scsw *s = sch-curr_status.scsw;
 +int path;
 +
 +/* Path management: In our simple css, we always choose the only path. */
 +path = 0x80;
 +
 +/* Reset values prior to 'issueing the clear signal'. */
 +p-lpum = 0;

This is unnecessary since we're going to set p-lpum to something else about
ten lines later, right?

 +p-pom = 0xff;
 +s-pno = 0;
 +
 +/* We always 'attempt to issue the clear signal', and we always succeed. 
 */
 +sch-orb = NULL;
 +sch-channel_prog = NULL;
 +sch-last_cmd = NULL;
 +s-actl = ~SCSW_ACTL_CLEAR_PEND;
 +s-stctl |= SCSW_STCTL_STATUS_PEND;
 +
 +s-dstat = 0;
 +s-cstat = 0;
 +p-lpum = path;
 +
 +}

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 3/5] s390: Add new channel I/O based virtio transport.

2012-08-08 Thread Cornelia Huck
On Tue, 7 Aug 2012 20:47:22 +
Blue Swirl blauwir...@gmail.com wrote:


  diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
  new file mode 100644
  index 000..8a90c3a
  --- /dev/null
  +++ b/hw/s390x/virtio-ccw.c
  @@ -0,0 +1,962 @@
  +/*
  + * virtio ccw target implementation
  + *
  + * Copyright 2012 IBM Corp.
  + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
  + *
  + * This work is licensed under the terms of the GNU GPL, version 2 or (at
  + * your option) any later version. See the COPYING file in the top-level
  + * directory.
  + */
  +
  +#include hw/hw.h
  +#include block.h
  +#include blockdev.h
  +#include sysemu.h
  +#include net.h
  +#include monitor.h
  +#include qemu-thread.h
  +#include ../virtio.h
  +#include ../virtio-serial.h
  +#include ../virtio-net.h
  +#include ../sysbus.h
 
 hw/virtio... for the above

OK.
 
  +#include bitops.h
  +
  +#include ioinst.h
  +#include css.h
  +#include virtio-ccw.h
  +
  +static const TypeInfo virtio_ccw_bus_info = {
  +.name = TYPE_VIRTIO_CCW_BUS,
  +.parent = TYPE_BUS,
  +.instance_size = sizeof(VirtioCcwBus),
  +};
  +
  +static const VirtIOBindings virtio_ccw_bindings;
  +
  +typedef struct sch_entry {
  +SubchDev *sch;
  +QLIST_ENTRY(sch_entry) entry;
  +} sch_entry;
 
 SubchEntry, see CODING_STYLE. Also other struct and typedef names below.
 
  +
  +QLIST_HEAD(subch_list, sch_entry);
 
 static, but please put this to a structure that is passed around instead.
 
  +
  +typedef struct devno_entry {
  +uint16_t devno;
  +QLIST_ENTRY(devno_entry) entry;
  +} devno_entry;
  +
  +QLIST_HEAD(devno_list, devno_entry);
 
 Ditto
 
  +
  +struct subch_set {
  +struct subch_list *s_list[256];
  +struct devno_list *d_list[256];
  +};
  +
  +struct css_set {
  +struct subch_set *set[MAX_SSID + 1];
  +};
  +
  +static struct css_set *channel_subsys[MAX_CSSID + 1];

OK, will try to come up with some kind of structure for this and
CamelCasify it.

  +
  +VirtIODevice *virtio_ccw_get_vdev(SubchDev *sch)
  +{
  +VirtIODevice *vdev = NULL;
  +
  +if (sch-driver_data) {
  +vdev = ((VirtioCcwData *)sch-driver_data)-vdev;
  +}
  +return vdev;
  +}
  +

  +VirtioCcwBus *virtio_ccw_bus_init(void)
  +{
  +VirtioCcwBus *bus;
  +BusState *_bus;
 
 Please avoid identifiers with leading underscores.

OK.

 
  +DeviceState *dev;
  +
  +css_set_subch_cb(virtio_ccw_find_subch);
  +
  +/* Create bridge device */
  +dev = qdev_create(NULL, virtio-ccw-bridge);
  +qdev_init_nofail(dev);
  +
  +/* Create bus on bridge device */
  +_bus = qbus_create(TYPE_VIRTIO_CCW_BUS, dev, virtio-ccw);
  +bus = DO_UPCAST(VirtioCcwBus, bus, _bus);
  +
  +/* Enable hotplugging */
  +_bus-allow_hotplug = 1;
  +
  +return bus;
  +}
  +
  +struct vq_info_block {
  +uint64_t queue;
  +uint16_t num;
  +} QEMU_PACKED;
  +
  +struct vq_config_block {
  +uint16_t index;
  +uint16_t num;
  +} QEMU_PACKED;
 
 Aren't these KVM structures? They should be defined in a KVM header
 file file in linux-headers.

Not really, virtio-ccw isn't tied to kvm.

I see this more as command blocks that are specific to the control
unit - like something that would be defined in an attachment
specification for a classic s390 device (and in the virtio spec in this
case) and modeled as C structures here.

 

  +case CCW_CMD_WRITE_CONF:
  +if (check_len) {
  +if (ccw-count  data-vdev-config_len) {
  +ret = -EINVAL;
  +break;
  +}
  +}
  +len = MIN(ccw-count, data-vdev-config_len);
  +config = qemu_get_ram_ptr(ccw-cda);
 
 Please use cpu_physical_memory_read() (or DMA versions) instead of
 this + memcpy().

Will check.
 
  +if (!config) {
  +ret = -EFAULT;
  +} else {
  +memcpy(data-vdev-config, config, len);
  +if (data-vdev-set_config) {
  +data-vdev-set_config(data-vdev, data-vdev-config);
  +}
  +sch-curr_status.scsw.count = ccw-count - len;
  +ret = 0;
  +}
  +break;

  +case CCW_CMD_READ_VQ_CONF:
  +if (check_len) {
  +if (ccw-count != sizeof(vq_config)) {
  +ret = -EINVAL;
  +break;
  +}
  +} else if (ccw-count  sizeof(vq_config)) {
  +/* Can't execute command. */
  +ret = -EINVAL;
  +break;
  +}
  +if (!qemu_get_ram_ptr(ccw-cda)) {
  +ret = -EFAULT;
  +} else {
  +vq_config.index = lduw_phys(ccw-cda);
 
 lduw_{b,l}e_phys()
 
  +vq_config.num = virtio_queue_get_num(data-vdev, 
  vq_config.index);
  +stw_phys(ccw-cda + sizeof(vq_config.index), vq_config.num);
 
 stw_{b,l]e_phys(), likewise elsewhere.


Will check as well.
 
  +sch-curr_status.scsw.count = ccw-count - 

Re: KVM segfaults with 3.5 while installing ubuntu 12.04

2012-08-08 Thread Stefan Priebe

ah OK - thanks. Will there be a fixed 1.1.2 as well?

Stefan

Am 08.08.2012 10:06, schrieb Stefan Hajnoczi:

On Wed, Aug 08, 2012 at 07:51:07AM +0200, Stefan Priebe wrote:

Any news? Was this applied upstream?


Kevin is ill.  He has asked me to review and test patches in his
absence.  When he gets back later this week this will get picked up (and
included in QEMU 1.2).

Here is the tree, it includes this patch:

https://github.com/stefanha/qemu/commits/for-kevin

Stefan


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread Avi Kivity
On 08/08/2012 12:09 AM, Benjamin Herrenschmidt wrote:
 On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote:
 
 
 Peter has started to fix up this naming mess in qemu.  I guess we should
 do the same for the kernel (except for ABIs) and document it, because it
 keeps generating confusion. 
 
 Ok so our current situation is that the XICS and MPIC are emulated in
 userspace entirely but the link between them and the VCPU is the
 asynchronous EE line so we are fine.
 
 I'm currently working on moving the XICS into the kernel for performance
 reasons, however, just like ARM VGIC, I can't seem to find a way to
 fit in the generic irqchip code in there. It's just not generic at
 all and quite x86 centric :-)

The generic code is for the two apic architectures: x64 and ia64.  Don't
try to shoehorn it in, you'll damage both the shoe and your foot.

 
 So for now I'm just doing my own version of CREATE_IRQCHIP to create it
 and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
 stuff (which we really don't need).

You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

 
 That's a bit of a problem vs. some of the code qemu-side such as in
 virtio-pci which does seem to be written around the model exposed by the
 x86 stuff and relies on doing such mappings so I think we'll have to
 butcher some of that.

Can you elaborate? virtio-pci is pci-centric, there should be nothing
x86 specific there.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.

2012-08-08 Thread Cornelia Huck
On Wed, 8 Aug 2012 09:27:32 +0100
Peter Maydell peter.mayd...@linaro.org wrote:

 On 7 August 2012 15:52, Cornelia Huck cornelia.h...@de.ibm.com wrote:
  +static void sch_handle_clear_func(SubchDev *sch)
  +{
  +struct pmcw *p = sch-curr_status.pmcw;
  +struct scsw *s = sch-curr_status.scsw;
  +int path;
  +
  +/* Path management: In our simple css, we always choose the only path. 
  */
  +path = 0x80;
  +
  +/* Reset values prior to 'issueing the clear signal'. */
  +p-lpum = 0;
 
 This is unnecessary since we're going to set p-lpum to something else about
 ten lines later, right?

My intention here was following the description in the architecture
closely. In our case, 'attempting to issue the clear signal' will
always succeed, and so we will always set the lpum later on - on real
hardware, this may be different.

 
  +p-pom = 0xff;
  +s-pno = 0;
  +
  +/* We always 'attempt to issue the clear signal', and we always 
  succeed. */
  +sch-orb = NULL;
  +sch-channel_prog = NULL;
  +sch-last_cmd = NULL;
  +s-actl = ~SCSW_ACTL_CLEAR_PEND;
  +s-stctl |= SCSW_STCTL_STATUS_PEND;
  +
  +s-dstat = 0;
  +s-cstat = 0;
  +p-lpum = path;
  +
  +}
 
 -- PMM
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Paolo Bonzini
Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 If out of global lock, we will be challenged by SMP in low level,
 so need atomic ops.
 
 This file is heavily copied from kernel.

Then it cannot be GPLv2 _or later_.  Please use the version that I
pointed you to.

Paolo

 Currently, only x86 atomic ops
 included, and will be extended for other arch for future.
 
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  include/qemu/atomic.h |  161 
 +
  1 files changed, 161 insertions(+), 0 deletions(-)
  create mode 100644 include/qemu/atomic.h
 
 diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
 new file mode 100644
 index 000..8e1fc3e
 --- /dev/null
 +++ b/include/qemu/atomic.h
 @@ -0,0 +1,161 @@
 +/*
 + * Simple interface for atomic operations.
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + *
 + */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread Avi Kivity
On 08/08/2012 03:49 AM, David Gibson wrote:
  We never have irqchip in kernel (because we haven't written that yet)
  but we still sleep in-kernel for CEDE.  I haven't spotted any problem
  with that, but now I'm wondering if there is one, since x86 don't do
  it in what seems like the analogous situation.
  
  It's possible this works because our decrementer (timer) interrupts
  are different at the core level from external interrupts coming from
  the PIC, and *are* handled in kernel, but I haven't actually followed
  the logic to work out if this is the case.
  
   Meaning the normal state of things is to sleep in
  the kernel (whether or not you have an emulated interrupt controller in
  the kernel -- the term irqchip in kernel is overloaded for x86).
  
  Uh.. overloaded in what way.
 
 On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
 the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
 correspond to non-x86 interrupt controllers, but the local APIC is more
 tightly coupled to the core.  Interrupt acceptance by the core is an
 operation that involved synchronous communication with the local APIC:
 the APIC presents the interrupt, the core accepts it based on the value
 of the interrupt enable flag and possible a register (CR8), then the
 APIC updates the ISR and IRR.
 
 The upshot is that if the local APIC is in userspace, interrupts must be
 synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
 and HLT is emulated in userspace (so that local APIC emulation can check
 if an interrupt wakes it up or not).
 
 Sorry, still not 100% getting it.  When the vcpu is actually running
 code, that synchronous communication must still be accomplished via
 the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
 communication can't be accomplished in that case.

No, you're correct.  HLT could have been emulated in userspace, it just
wasn't.  The correct statement is that HLT was arbitrarily chosen to be
emulated in userspace with the synchronous model, but the asynchronous
model forced it into the kernel.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Avi Kivity
On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 If out of global lock, we will be challenged by SMP in low level,
 so need atomic ops.
 
 This file is heavily copied from kernel. Currently, only x86 atomic ops
 included, and will be extended for other arch for future.
 

I propose we use gcc builtins.  We get automatic architecture support,
and tuning for newer processors if the user so chooses.

http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html

In May 2031 we can switch to C11 atomics.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-08 Thread Avi Kivity
On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 Collect unused object and release them at caller demand.
 

Please explain the motivation for this patch.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Wei-Ren Chen
 I propose we use gcc builtins.  We get automatic architecture support,
 and tuning for newer processors if the user so chooses.
 
 http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
 
 In May 2031 we can switch to C11 atomics.
 
  Maybe 2013?

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-08 Thread Paolo Bonzini
Il 08/08/2012 11:05, Avi Kivity ha scritto:
  From: Liu Ping Fan pingf...@linux.vnet.ibm.com
  
  Collect unused object and release them at caller demand.
  
 Please explain the motivation for this patch.

It's poor man RCU, I think?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8] kvm: notify host when the guest is panicked

2012-08-08 Thread Andrew Jones
On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote:
 diff --git a/Documentation/virtual/kvm/pv_event.txt 
 b/Documentation/virtual/kvm/pv_event.txt
 new file mode 100644
 index 000..0ebc890
 --- /dev/null
 +++ b/Documentation/virtual/kvm/pv_event.txt
 @@ -0,0 +1,32 @@
 +The KVM paravirtual event interface
 +=
 +
 +Initializing the paravirtual event interface
 +==
 +kvm_pv_event_init()
 +Argiments:
 + None
 +
 +Return Value:
 + 0 : The guest kernel can't use paravirtual event interface.
 + -1: The guest kernel can use paravirtual event interface.
 +

This documentation has the can and can't backwards.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating

2012-08-08 Thread Avi Kivity
On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 Using mem_map_lock to protect among updaters. So we can get the intact
 snapshot of mem topology -- FlatView  radix-tree.
 
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  exec.c   |3 +++
  memory.c |   22 ++
  memory.h |2 ++
  3 files changed, 27 insertions(+), 0 deletions(-)
 
 diff --git a/exec.c b/exec.c
 index 8244d54..0e29ef9 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, 
 phys_map_nodes_nb_alloc;
 The bottom level has pointers to MemoryRegionSections.  */
  static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
  
 +QemuMutex mem_map_lock;
 +
  static void io_mem_init(void);
  static void memory_map_init(void);
  
 @@ -637,6 +639,7 @@ void cpu_exec_init_all(void)
  #if !defined(CONFIG_USER_ONLY)
  memory_map_init();
  io_mem_init();
 +qemu_mutex_init(mem_map_lock);
  #endif
  }
  
 diff --git a/memory.c b/memory.c
 index aab4a31..5986532 100644
 --- a/memory.c
 +++ b/memory.c
 @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void)
  assert(memory_region_transaction_depth);
  --memory_region_transaction_depth;
  if (!memory_region_transaction_depth  memory_region_update_pending) {
 +qemu_mutex_lock(mem_map_lock);
  memory_region_update_topology(NULL);
 +qemu_mutex_unlock(mem_map_lock);
  }
  }

Seems to me that nothing in memory.c can susceptible to races.  It must
already be called under the big qemu lock, and with the exception of
mutators (memory_region_set_*), changes aren't directly visible.

I think it's sufficient to take the mem_map_lock at the beginning of
core_begin() and drop it at the end of core_commit().  That means all
updates of volatile state, phys_map, are protected.



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-08 Thread Avi Kivity
On 08/08/2012 12:07 PM, Paolo Bonzini wrote:
 Il 08/08/2012 11:05, Avi Kivity ha scritto:
  From: Liu Ping Fan pingf...@linux.vnet.ibm.com
  
  Collect unused object and release them at caller demand.
  
 Please explain the motivation for this patch.
 
 It's poor man RCU, I think?

I thought that it was to defer destructors (finalizers) to a more
suitable context.  But why is the unref context unsuitable?

I don't see how it relates to RCU, where is the C and the U?

Anyway the list eagerly awaits the explanation.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Avi Kivity
On 08/08/2012 12:05 PM, 陳韋任 (Wei-Ren Chen) wrote:
 I propose we use gcc builtins.  We get automatic architecture support,
 and tuning for newer processors if the user so chooses.
 
 http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
 
 In May 2031 we can switch to C11 atomics.
  
   Maybe 2013?
 

Maybe in 2013 we'll get a compiler that supports C11...

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/15] memory: introduce life_ops to MemoryRegion

2012-08-08 Thread Avi Kivity
On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 The types of referred object by MemoryRegion are variable, ex,
 another mr, DeviceState, or other struct defined by drivers.
 So the refer/unrefer may be different by drivers.
 
 Using this ops, we can mange the backend object.
 

Seems to be a needless abstration - we already have lifetime manangement
for objects.

I suggested previously to replace the opaque parameter with an Object,
and use Object's refcounting.  That's a lot of work, but IMO is worth it
as the opaques are dangerous to live lying around.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/15] memory: use refcnt to manage MemoryRegion

2012-08-08 Thread Avi Kivity
On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 Using refcnt for mr, so we can separate mr's life cycle management
 from refered object.
   When mr-ref 0-1, inc the refered object.
   When mr-ref 1-0, dec the refered object.
 
 The refered object can be DeviceStae, another mr, or other opaque.

Please explain the motivation more fully.

Usually a MemoryRegion will be embedded within some DeviceState, or its
lifecycle will be managed by the DeviceState.  So long as we keep the
DeviceState alive all associated MemoryRegions should be alive as well.
 Why not do this directly?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Peter Maydell
On 8 August 2012 07:25, Liu Ping Fan qemul...@gmail.com wrote:
 +static inline void atomic_sub(int i, Atomic *v)
 +{
 +asm volatile(lock; subl %1,%0
 + : +m (v-counter)
 + : ir (i));
 +}

NAK. We don't want random inline assembly implementations of locking
primitives in QEMU, they are way too hard to keep working with all the
possible host architectures we support. I spent some time a while back
getting rid of the (variously busted) versions we had previously.

If you absolutely must use atomic ops, use the gcc builtins. For
preference, stick to higher level and less error-prone abstractions.

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8] kvm: notify host when the guest is panicked

2012-08-08 Thread Wen Congyang
At 08/08/2012 05:12 PM, Andrew Jones Wrote:
 On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote:
 diff --git a/Documentation/virtual/kvm/pv_event.txt 
 b/Documentation/virtual/kvm/pv_event.txt
 new file mode 100644
 index 000..0ebc890
 --- /dev/null
 +++ b/Documentation/virtual/kvm/pv_event.txt
 @@ -0,0 +1,32 @@
 +The KVM paravirtual event interface
 +=
 +
 +Initializing the paravirtual event interface
 +==
 +kvm_pv_event_init()
 +Argiments:
 +None
 +
 +Return Value:
 +0 : The guest kernel can't use paravirtual event interface.
 +-1: The guest kernel can use paravirtual event interface.
 +
 
 This documentation has the can and can't backwards.
 

Yes, I will fix it.

Thanks
Wen Congyang
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread Benjamin Herrenschmidt
On Wed, 2012-08-08 at 11:52 +0300, Avi Kivity wrote:
  So for now I'm just doing my own version of CREATE_IRQCHIP to create it
  and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
  stuff (which we really don't need).
 
 You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

Yes, sorry, brain and fingers not agreeing :-)

  That's a bit of a problem vs. some of the code qemu-side such as in
  virtio-pci which does seem to be written around the model exposed by the
  x86 stuff and relies on doing such mappings so I think we'll have to
  butcher some of that.
 
 Can you elaborate? virtio-pci is pci-centric, there should be nothing
 x86 specific there.

The kvm_irqchip_add_msi_route  co is a problem, it more/less dives into
the x86/ia64 specific routing stuff at a generic level, I'll have to
hook things up differently in qemu I suspect.

Not a huge deal, we can sort it out.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy

2012-08-08 Thread Avi Kivity
On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 PhysMap contain the flatview and radix-tree view, they are snapshot
 of system topology and should be consistent. With PhysMap, we can
 swap the pointer when updating and achieve the atomic.
 
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  exec.c   |8 
  memory.c |   33 -
  memory.h |   62 
 --
  3 files changed, 60 insertions(+), 43 deletions(-)
 
 diff --git a/exec.c b/exec.c
 index 0e29ef9..01b91b0 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -156,8 +156,6 @@ typedef struct PageDesc {
  #endif
  
  /* Size of the L2 (and L3, etc) page tables.  */
 -#define L2_BITS 10
 -#define L2_SIZE (1  L2_BITS)
  
  #define P_L2_LEVELS \
  (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1)
 @@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask;
  static void *l1_map[V_L1_SIZE];
  
  #if !defined(CONFIG_USER_ONLY)
 -typedef struct PhysPageEntry PhysPageEntry;


This (and the other stuff you're moving) is private memory internals.
It should be moved to memory-internals.h (currently named exec-obsolete.h).



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 12/15] qdev: using devtree lock to protect device's accessing

2012-08-08 Thread Peter Maydell
On 8 August 2012 07:25, Liu Ping Fan qemul...@gmail.com wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 lock:
   qemu_device_tree_mutex

Looking at where it's used, this doesn't seem to have anything to do
with device trees (ie dtb, see www.devicetree.org) : poorly named lock?

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-08 Thread Paolo Bonzini
Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 Collect unused object and release them at caller demand.
 
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  include/qemu/reclaimer.h |   28 ++
  main-loop.c  |5 
  qemu-tool.c  |5 
  qom/Makefile.objs|2 +-
  qom/reclaimer.c  |   58 
 ++
  5 files changed, 97 insertions(+), 1 deletions(-)
  create mode 100644 include/qemu/reclaimer.h
  create mode 100644 qom/reclaimer.c
 
 diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h
 new file mode 100644
 index 000..9307e93
 --- /dev/null
 +++ b/include/qemu/reclaimer.h
 @@ -0,0 +1,28 @@
 +/*
 + * QEMU reclaimer
 + *
 + * Copyright IBM, Corp. 2012
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
 +
 +#ifndef QEMU_RECLAIMER
 +#define QEMU_RECLAIMER
 +
 +typedef void ReleaseHandler(void *opaque);
 +typedef struct Chunk {
 +QLIST_ENTRY(Chunk) list;
 +void *opaque;
 +ReleaseHandler *release;
 +} Chunk;
 +
 +typedef struct ChunkHead {
 +struct Chunk *lh_first;
 +} ChunkHead;
 +
 +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler 
 *release);
 +void reclaimer_worker(ChunkHead *head);
 +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release);
 +void qemu_reclaimer(void);

So enqueue is call_rcu and qemu_reclaimer marks a quiescent state +
empties the pending call_rcu.

But what's the difference between the two pairs of APIs?

 +#endif
 diff --git a/main-loop.c b/main-loop.c
 index eb3b6e6..be9d095 100644
 --- a/main-loop.c
 +++ b/main-loop.c
 @@ -26,6 +26,7 @@
  #include qemu-timer.h
  #include slirp/slirp.h
  #include main-loop.h
 +#include qemu/reclaimer.h
  
  #ifndef _WIN32
  
 @@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking)
 them.  */
  qemu_bh_poll();
  
 +/* ref to device from iohandler/bh/timer do not obey the rules, so delay
 + * reclaiming until now.
 + */
 +qemu_reclaimer();
  return ret;
  }
 diff --git a/qemu-tool.c b/qemu-tool.c
 index 318c5fc..f5fe319 100644
 --- a/qemu-tool.c
 +++ b/qemu-tool.c
 @@ -21,6 +21,7 @@
  #include main-loop.h
  #include qemu_socket.h
  #include slirp/libslirp.h
 +#include qemu/reclaimer.h
  
  #include sys/time.h
  
 @@ -75,6 +76,10 @@ void qemu_mutex_unlock_iothread(void)
  {
  }
  
 +void qemu_reclaimer(void)
 +{
 +}
 +
  int use_icount;
  
  void qemu_clock_warp(QEMUClock *clock)
 diff --git a/qom/Makefile.objs b/qom/Makefile.objs
 index 5ef060a..a579261 100644
 --- a/qom/Makefile.objs
 +++ b/qom/Makefile.objs
 @@ -1,4 +1,4 @@
 -qom-obj-y = object.o container.o qom-qobject.o
 +qom-obj-y = object.o container.o qom-qobject.o reclaimer.o
  qom-obj-twice-y = cpu.o
  common-obj-y = $(qom-obj-twice-y)
  user-obj-y = $(qom-obj-twice-y)
 diff --git a/qom/reclaimer.c b/qom/reclaimer.c
 new file mode 100644
 index 000..6cb53e3
 --- /dev/null
 +++ b/qom/reclaimer.c
 @@ -0,0 +1,58 @@
 +/*
 + * QEMU reclaimer
 + *
 + * Copyright IBM, Corp. 2012
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
 +
 +#include qemu-common.h
 +#include qemu-thread.h
 +#include main-loop.h
 +#include qemu-queue.h
 +#include qemu/reclaimer.h
 +
 +static struct QemuMutex reclaimer_lock;
 +static QLIST_HEAD(rcl, Chunk) reclaimer_list;
 +
 +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler 
 *release)
 +{
 +Chunk *r = g_malloc0(sizeof(Chunk));
 +r-opaque = opaque;
 +r-release = release;
 +QLIST_INSERT_HEAD_RCU(head, r, list);
 +}

No lock?

 +void reclaimer_worker(ChunkHead *head)
 +{
 +Chunk *cur, *next;
 +
 +QLIST_FOREACH_SAFE(cur, head, list, next) {
 +QLIST_REMOVE(cur, list);
 +cur-release(cur-opaque);
 +g_free(cur);
 +}

QLIST_REMOVE needs a lock too, so using the lockless
QLIST_INSERT_HEAD_RCU is not necessary.

 +}
 +
 +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release)
 +{
 +Chunk *r = g_malloc0(sizeof(Chunk));
 +r-opaque = opaque;
 +r-release = release;
 +qemu_mutex_lock(reclaimer_lock);
 +QLIST_INSERT_HEAD_RCU(reclaimer_list, r, list);
 +qemu_mutex_unlock(reclaimer_lock);
 +}
 +
 +
 +void qemu_reclaimer(void)
 +{
 +Chunk *cur, *next;
 +
 +QLIST_FOREACH_SAFE(cur, reclaimer_list, list, next) {
 +QLIST_REMOVE(cur, list);
 +cur-release(cur-opaque);
 +g_free(cur);
 +}

Same here.

 +}
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/15] qom: object_unref call reclaimer

2012-08-08 Thread Paolo Bonzini
Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 iohandler/bh/timer may use DeviceState when its refcnt=0,
 postpone the reclaimer till they have done with it.
 
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  qom/object.c |9 -
  1 files changed, 8 insertions(+), 1 deletions(-)
 
 diff --git a/qom/object.c b/qom/object.c
 index 822bdb7..1452b1b 100644
 --- a/qom/object.c
 +++ b/qom/object.c
 @@ -23,6 +23,8 @@
  #include qbool.h
  #include qint.h
  #include qstring.h
 +#include hw/qdev.h
 +#include qemu/reclaimer.h
  
  #define MAX_INTERFACES 32
  
 @@ -646,7 +648,12 @@ void object_unref(Object *obj)
  {
  g_assert(atomic_read(obj-ref)  0);
  if (atomic_dec_and_test(obj-ref)) {
 -object_finalize(obj);
 +/* fixme, maybe introduce obj-finalze to make this more elegant */
 +if (object_dynamic_cast(obj, TYPE_DEVICE) != NULL) {
 +qemu_reclaimer_enqueue(obj, object_finalize);
 +} else {
 +object_finalize(obj);
 +}
  }
  }
  
 

Just do this unconditionally.

Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access

2012-08-08 Thread Avi Kivity
On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 Flatview and radix view are all under the protection of pointer.
 And this make sure the change of them seem to be atomic!
 
 The mr accessed by radix-tree leaf or flatview will be reclaimed
 after the prev PhysMap not in use any longer
 

IMO this cleverness should come much later.  Let's first take care of
dropping the big qemu lock, then make swithcing memory maps more efficient.

The initial paths could look like:

  lookup:
 take mem_map_lock
 lookup
 take ref
 drop mem_map_lock

  update:
 take mem_map_lock (in core_begin)
 do updates
 drop memo_map_lock

Later we can replace mem_map_lock with either a rwlock or (real) rcu.


  
  #if !defined(CONFIG_USER_ONLY)
  
 -static void phys_map_node_reserve(unsigned nodes)
 +static void phys_map_node_reserve(PhysMap *map, unsigned nodes)
  {
 -if (phys_map_nodes_nb + nodes  phys_map_nodes_nb_alloc) {
 +if (map-phys_map_nodes_nb + nodes  map-phys_map_nodes_nb_alloc) {
  typedef PhysPageEntry Node[L2_SIZE];
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16);
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc,
 -  phys_map_nodes_nb + nodes);
 -phys_map_nodes = g_renew(Node, phys_map_nodes,
 - phys_map_nodes_nb_alloc);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2,
 +16);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc,
 +  map-phys_map_nodes_nb + nodes);
 +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes,
 + map-phys_map_nodes_nb_alloc);
  }
  }

Please have a patch that just adds the map parameter to all these
functions.  This makes the later patch, that adds the copy, easier to read.

 +
 +void cur_map_update(PhysMap *next)
 +{
 +qemu_mutex_lock(cur_map_lock);
 +physmap_put(cur_map);
 +cur_map = next;
 +smp_mb();
 +qemu_mutex_unlock(cur_map_lock);
 +}

IMO this can be mem_map_lock.

If we take my previous suggestion:

  lookup:
 take mem_map_lock
 lookup
 take ref
 drop mem_map_lock

  update:
 take mem_map_lock (in core_begin)
 do updates
 drop memo_map_lock

And update it to


  update:
 prepare next_map (in core_begin)
 do updates
 take mem_map_lock (in core_commit)
 switch maps
 drop mem_map_lock
 free old map


Note the lookup path copies the MemoryRegionSection instead of
referencing it.  Thus we can destroy the old map without worrying; the
only pointers will point to MemoryRegions, which will be protected by
the refcounts on their Objects.

This can be easily switched to rcu:

  update:
 prepare next_map (in core_begin)
 do updates
 switch maps - rcu_assign_pointer
 call_rcu(free old map) (or synchronize_rcu; free old maps)

Again, this should be done after the simplictic patch that enables
parallel lookup but keeps just one map.



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/15] lock: introduce global lock for device tree

2012-08-08 Thread Avi Kivity
On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 

Please explain the motivation.  AFAICT, the big qemu lock is sufficient.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/15] lock: introduce global lock for device tree

2012-08-08 Thread Paolo Bonzini
Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  cpus.c  |   12 
  main-loop.h |3 +++
  2 files changed, 15 insertions(+), 0 deletions(-)
 
 diff --git a/cpus.c b/cpus.c
 index b182b3d..a734b36 100644
 --- a/cpus.c
 +++ b/cpus.c
 @@ -611,6 +611,7 @@ static void qemu_tcg_init_cpu_signals(void)
  }
  #endif /* _WIN32 */
  
 +QemuMutex qemu_device_tree_mutex;
  QemuMutex qemu_global_mutex;
  static QemuCond qemu_io_proceeded_cond;
  static bool iothread_requesting_mutex;
 @@ -634,6 +635,7 @@ void qemu_init_cpu_loop(void)
  qemu_cond_init(qemu_work_cond);
  qemu_cond_init(qemu_io_proceeded_cond);
  qemu_mutex_init(qemu_global_mutex);
 +qemu_mutex_init(qemu_device_tree_mutex);
  
  qemu_thread_get_self(io_thread);
  }
 @@ -911,6 +913,16 @@ void qemu_mutex_unlock_iothread(void)
  qemu_mutex_unlock(qemu_global_mutex);
  }
  
 +void qemu_lock_devtree(void)
 +{
 +qemu_mutex_lock(qemu_device_tree_mutex);
 +}
 +
 +void qemu_unlock_devtree(void)
 +{
 +qemu_mutex_unlock(qemu_device_tree_mutex);
 +}

We don't need the wrappers.  They are there for the big lock just
because TCG needs extra work for iothread_requesting_mutex.

Paolo

  static int all_vcpus_paused(void)
  {
  CPUArchState *penv = first_cpu;
 diff --git a/main-loop.h b/main-loop.h
 index dce1cd9..17e959a 100644
 --- a/main-loop.h
 +++ b/main-loop.h
 @@ -353,6 +353,9 @@ void qemu_mutex_lock_iothread(void);
   */
  void qemu_mutex_unlock_iothread(void);
  
 +void qemu_lock_devtree(void);
 +void qemu_unlock_devtree(void);
 +
  /* internal interfaces */
  
  void qemu_fd_register(int fd);
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views

2012-08-08 Thread Paolo Bonzini
Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
 +void qdev_unplug_complete(DeviceState *dev, Error **errp)
 +{
 +/* isolate from mem view */
 +qdev_unmap(dev);
 +qemu_lock_devtree();
 +/* isolate from device tree */
 +qdev_unset_parent(dev);
 +qemu_unlock_devtree();
 +object_unref(OBJECT(dev));

Rather than deferring the free, you should defer the unref.  Otherwise
the following can happen when you have real RCU access to the memory
map on the read-side:

VCPU threadI/O thread
=
get MMIO request
rcu_read_lock()
walk memory map
   qdev_unmap()
   lock_devtree()
   ...
   unlock_devtree
   unref dev - refcnt=0, free enqueued
ref()
rcu_read_unlock()
   free()
dangling pointer!

If you defer the unref, you have instead

VCPU threadI/O thread
=
get MMIO request
rcu_read_lock()
walk memory map
   qdev_unmap()
   lock_devtree()
   ...
   unlock_devtree
   unref is enqueued
ref() - refcnt = 2
rcu_read_unlock()
   unref() - refcnt=1
unref() - refcnt = 1

So this also makes patch 14 unnecessary.

Paolo

 +}


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] e1000: using new interface--unmap to unplug

2012-08-08 Thread Paolo Bonzini
Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
  
 +static void
 +pci_e1000_unmap(PCIDevice *p)
 +{
 +/* DO NOT FREE anything!until refcnt=0 */
 +/* isolate from memory view */
 +}

At least you need to call the superclass method.

Paolo

  static int
  pci_e1000_uninit(PCIDevice *dev)
  {
 @@ -1275,6 +1282,7 @@ static void e1000_class_init(ObjectClass *klass, void 
 *data)
  PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
  
  k-init = pci_e1000_init;
 +k-unmap = pci_e1000_unmap;
  k-exit = pci_e1000_uninit;
  k-romfile = pxe-e1000.rom;
  k-vendor_id = PCI_VENDOR_ID_INTEL;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views

2012-08-08 Thread Avi Kivity
On 08/08/2012 12:52 PM, Paolo Bonzini wrote:
 Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
 +void qdev_unplug_complete(DeviceState *dev, Error **errp)
 +{
 +/* isolate from mem view */
 +qdev_unmap(dev);
 +qemu_lock_devtree();
 +/* isolate from device tree */
 +qdev_unset_parent(dev);
 +qemu_unlock_devtree();
 +object_unref(OBJECT(dev));
 
 Rather than deferring the free, you should defer the unref.  Otherwise
 the following can happen when you have real RCU access to the memory
 map on the read-side:
 
 VCPU threadI/O thread
 =
 get MMIO request
 rcu_read_lock()
 walk memory map
qdev_unmap()
lock_devtree()
...
unlock_devtree
unref dev - refcnt=0, free enqueued
 ref()
 rcu_read_unlock()
free()
 dangling pointer!

unref should follow either synchronize_rcu(), or be in a call_rcu()
callback (deferring the unref).  IMO synchronize_rcu() is sufficient
here, unplug is a truly slow path, esp. on real hardware.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: booke: Add watchdog emulation

2012-08-08 Thread Alexander Graf

On 07.08.2012, at 17:59, Bhushan Bharat-R65777 wrote:

 
 
 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Tuesday, August 07, 2012 3:38 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Bhushan Bharat-R65777
 Subject: Re: [PATCH 1/2] KVM: PPC: booke: Add watchdog emulation
 
 
 On 03.08.2012, at 07:30, Bharat Bhushan wrote:
 
 This patch adds the watchdog emulation in KVM. The watchdog
 emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_BOOKE_WATCHDOG) ioctl.
 The kernel timer are used for watchdog emulation and emulates
 h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
 if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
 it is configured.
 
 Signed-off-by: Liu Yu yu@freescale.com
 Signed-off-by: Scott Wood scottw...@freescale.com
 [bharat.bhus...@freescale.com: reworked patch]
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 ---
 v7:
 - kvmppc_core_dequeue_watchdog() and kvmppc_core_enque_watchdog()
  are made staic
 - Clear the KVM_REQ_WATCHDOG request when TSR_ENW | ENW_WIS cleared
  rather than checking these bits on handling the request.
  Below changes (pasted for quick understanding)
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index 7eedaeb..a0f5922 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -629,8 +629,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct
 kvm_vcpu *vcpu)
   goto out;
   }
 
 -   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) 
 -   (vcpu-arch.tsr  (TSR_ENW | TSR_WIS))) {
 +   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu)) {
   kvm_run-exit_reason = KVM_EXIT_WATCHDOG;
   ret = 0;
   goto out;
 @@ -1098,8 +1097,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct
 kvm_vcpu *vcpu,
   }
   }
 
 -   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) 
 -   (vcpu-arch.tsr  (TSR_ENW | TSR_WIS))) {
 +   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu)) {
   run-exit_reason = KVM_EXIT_WATCHDOG;
   r = RESUME_HOST | (r  RESUME_FLAG_NV);
   }
 @@ -1251,8 +1249,10 @@ static int set_sregs_base(struct kvm_vcpu *vcpu,
 
   vcpu-arch.tsr = sregs-u.e.tsr;
 
 -   if ((old_tsr ^ vcpu-arch.tsr)  (TSR_ENW | TSR_WIS))
 +   if ((old_tsr ^ vcpu-arch.tsr)  (TSR_ENW | TSR_WIS)) {
 +   clear_bit(KVM_REQ_WATCHDOG, vcpu-requests);
   arm_next_watchdog(vcpu);
 +   }
 
   update_timer_ints(vcpu);
   }
 @@ -1434,8 +1434,10 @@ void kvmppc_clr_tsr_bits(struct kvm_vcpu *vcpu, u32
 tsr_bits)
* We may have stopped the watchdog due to
* being stuck on final expiration.
*/
 -   if (tsr_bits  (TSR_ENW | TSR_WIS))
 +   if (tsr_bits  (TSR_ENW | TSR_WIS)) {
 +   clear_bit(KVM_REQ_WATCHDOG, vcpu-requests);
   arm_next_watchdog(vcpu);
 +   }
 --
 
 v6:
 - Added kvmppc_subarch_vcpu_unit/uninit() for subarch specific 
 initialization
 - Using spin_lock_irqsave()
 - Using del_timer_sync().
 
 v5:
 - Checking that TSR_ENW/TSR_WIS are still set on KVM_EXIT_WATCHDOG.
 - Moved watchdog_next_timeout() in lock.
 
 arch/powerpc/include/asm/kvm_host.h  |3 +
 arch/powerpc/include/asm/kvm_ppc.h   |2 +
 arch/powerpc/include/asm/reg_booke.h |7 ++
 arch/powerpc/kvm/book3s.c|9 ++
 arch/powerpc/kvm/booke.c |  158 
 ++
 arch/powerpc/kvm/booke_emulate.c |8 ++
 arch/powerpc/kvm/powerpc.c   |   14 +++-
 include/linux/kvm.h  |2 +
 include/linux/kvm_host.h |1 +
 9 files changed, 202 insertions(+), 2 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_host.h
 b/arch/powerpc/include/asm/kvm_host.h
 index 572ad01..873ec11 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -469,6 +469,8 @@ struct kvm_vcpu_arch {
 ulong fault_esr;
 ulong queued_dear;
 ulong queued_esr;
 +   spinlock_t wdt_lock;
 +   struct timer_list wdt_timer;
 u32 tlbcfg[4];
 u32 mmucfg;
 u32 epr;
 @@ -484,6 +486,7 @@ struct kvm_vcpu_arch {
 u8 osi_needed;
 u8 osi_enabled;
 u8 papr_enabled;
 +   u8 watchdog_enabled;
 u8 sane;
 u8 cpu_type;
 u8 hcall_needed;
 diff --git a/arch/powerpc/include/asm/kvm_ppc.h
 b/arch/powerpc/include/asm/kvm_ppc.h
 index 0124937..1438a5e 100644
 --- a/arch/powerpc/include/asm/kvm_ppc.h
 +++ b/arch/powerpc/include/asm/kvm_ppc.h
 @@ -68,6 +68,8 @@ extern void kvmppc_emulate_dec(struct kvm_vcpu *vcpu);
 extern u32 kvmppc_get_dec(struct kvm_vcpu *vcpu, u64 tb);
 extern void kvmppc_decrementer_func(unsigned long data);
 extern int kvmppc_sanity_check(struct kvm_vcpu *vcpu);
 +extern int kvmppc_subarch_vcpu_init(struct kvm_vcpu *vcpu);
 +extern void 

Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread David Gibson
On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote:
 On 08/08/2012 03:49 AM, David Gibson wrote:
   We never have irqchip in kernel (because we haven't written that yet)
   but we still sleep in-kernel for CEDE.  I haven't spotted any problem
   with that, but now I'm wondering if there is one, since x86 don't do
   it in what seems like the analogous situation.
   
   It's possible this works because our decrementer (timer) interrupts
   are different at the core level from external interrupts coming from
   the PIC, and *are* handled in kernel, but I haven't actually followed
   the logic to work out if this is the case.
   
Meaning the normal state of things is to sleep in
   the kernel (whether or not you have an emulated interrupt controller in
   the kernel -- the term irqchip in kernel is overloaded for x86).
   
   Uh.. overloaded in what way.
  
  On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
  the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
  correspond to non-x86 interrupt controllers, but the local APIC is more
  tightly coupled to the core.  Interrupt acceptance by the core is an
  operation that involved synchronous communication with the local APIC:
  the APIC presents the interrupt, the core accepts it based on the value
  of the interrupt enable flag and possible a register (CR8), then the
  APIC updates the ISR and IRR.
  
  The upshot is that if the local APIC is in userspace, interrupts must be
  synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
  and HLT is emulated in userspace (so that local APIC emulation can check
  if an interrupt wakes it up or not).
  
  Sorry, still not 100% getting it.  When the vcpu is actually running
  code, that synchronous communication must still be accomplished via
  the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
  communication can't be accomplished in that case.
 
 No, you're correct.  HLT could have been emulated in userspace, it just
 wasn't.  The correct statement is that HLT was arbitrarily chosen to be
 emulated in userspace with the synchronous model, but the asynchronous
 model forced it into the kernel.

Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
userspace in the first line there, anyway.  Ok, so I am now reassured
that our current handling of CEDE in kernelspace does not cause
problems.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: correctly detect APIC SW state in kvm_apic_post_state_restore()

2012-08-08 Thread Gleb Natapov
For apic_set_spiv() to track APIC SW state correctly it needs to see
previous and next values of the spurious vector register, but currently
memset() overwrite the old value before apic_set_spiv() get a chance to
do tracking. Fix it by calling apic_set_spiv() before overwriting old
value.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f019ca2..d72d79a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1389,13 +1389,16 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
return vector;
 }
 
-void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu)
+void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
+   struct kvm_lapic_state *s)
 {
struct kvm_lapic *apic = vcpu-arch.apic;
 
kvm_lapic_set_base(vcpu, vcpu-arch.apic_base);
+   /* set SPIV separately to get count of SW disabled APICs right */
+   apic_set_spiv(apic, *((u32 *)(s-regs + APIC_SPIV)));
+   memcpy(vcpu-arch.apic-regs, s-regs, sizeof *s);
kvm_apic_set_version(vcpu);
-   apic_set_spiv(apic, kvm_apic_get_reg(apic, APIC_SPIV));
 
apic_update_ppr(apic);
hrtimer_cancel(apic-lapic_timer.timer);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 2ad9caa..615a8b0 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -54,7 +54,8 @@ int kvm_apic_local_deliver(struct kvm_lapic *apic, int 
lvt_type);
 
 u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
 void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
-void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu);
+void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
+   struct kvm_lapic_state *s);
 int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 042754b..3c1ccea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2345,8 +2345,7 @@ static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
 static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
struct kvm_lapic_state *s)
 {
-   memcpy(vcpu-arch.apic-regs, s-regs, sizeof *s);
-   kvm_apic_post_state_restore(vcpu);
+   kvm_apic_post_state_restore(vcpu, s);
update_cr8_intercept(vcpu);
 
return 0;
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86 emulator: access GPRs on demand

2012-08-08 Thread Avi Kivity
On 08/08/2012 12:23 AM, Marcelo Tosatti wrote:
 @@ -281,8 +294,10 @@ struct x86_emulate_ctxt {
  bool rip_relative;
  unsigned long _eip;
  struct operand memop;
 +u32 regs_valid;  /* bitmaps of registers in _regs[] that can be read */
 +u32 regs_dirty;  /* bitmaps of registers in _regs[] that have been 
 written */
 
 emul_regs_dirty (to avoid with the other regs_dirty).

Well, it's in x86_emulate_ctxt, and there are other fields that can be
confused here.  I think it's okay as they're never used together.

  
 @@ -4453,7 +4447,6 @@ int kvm_inject_realmode_interrupt(struct kvm_vcpu 
 *vcpu, int irq, int inc_eip)
  return EMULATE_FAIL;
  
  ctxt-eip = ctxt-_eip;
 -memcpy(vcpu-arch.regs, ctxt-regs, sizeof ctxt-regs);
  kvm_rip_write(vcpu, ctxt-eip);
  kvm_set_rflags(vcpu, ctxt-eflags);
 
 
 Need to writeback here?

In emulate_in_real().  Good catch.

 
 @@ -4601,7 +4594,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 changes registers values  during IO operation */
  if (vcpu-arch.emulate_regs_need_sync_from_vcpu) {
  vcpu-arch.emulate_regs_need_sync_from_vcpu = false;
 -memcpy(ctxt-regs, vcpu-arch.regs, sizeof ctxt-regs);
 +ctxt-regs_valid = 0;
  }
 
 I think you can improve this hack now (perhaps invalidate the emulator
 cache on SET_REGS?).

This is dangerous, if someone does a GET_REGS/SET_REGS pair within an
mmio transaction.

The documentation implies it should not be done, but it doesn't say so
explicitly.  It's likely qemu allows it if the mmio transaction drops
the lock, for example.

I agree that doing it in SET_REGS is better conceptually.

 
  vcpu-arch.emulate_regs_need_sync_to_vcpu = false;
  }
  regs-rax = kvm_register_read(vcpu, VCPU_REGS_RAX);
 @@ -5724,6 +5720,7 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 
 tss_selector, int idt_index,
  {
  struct x86_emulate_ctxt *ctxt = vcpu-arch.emulate_ctxt;
  int ret;
 +unsigned reg;
  
  init_emulate_ctxt(vcpu);
  
 @@ -5733,7 +5730,8 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 
 tss_selector, int idt_index,
  if (ret)
  return EMULATE_FAIL;
  
 -memcpy(vcpu-arch.regs, ctxt-regs, sizeof ctxt-regs);
 +for_each_set_bit(reg, (ulong *)ctxt-regs_dirty, 16)
 +kvm_register_write(vcpu, reg, ctxt-_regs[reg]);
 
 Should update regs_avail/regs_dirty? Better to do any of that in
 emulator.c via interfaces.

emulator_task_switch() should do it, yes.

 
  kvm_rip_write(vcpu, ctxt-eip);
  kvm_set_rflags(vcpu, ctxt-eflags);
  kvm_make_request(KVM_REQ_EVENT, vcpu);
 -- 
 1.7.11.2
 
 Did not double check that emulator.c convertion is complete with your
 patch (which should be done).

I did s/regs/_regs/ for that.  Or did you mean something else?

Regarding emulator interfaces, perhaps we should make all entry points
regular:

  emulator_init(ctxt, ops, vcpu); // just once

  emulator_reset(ctxt)
  while ((ret = emulator_entry_point(ctxt)) == EMULATOR_NEED_DATA)
   get that data from somewhere

emulator_reset() simply resets the state machine, can be as simple as
clearing a bool flag. emulator_entry_point (can be x86_emulate_insn(),
x86_emulate_int(), x86_emulate_task_switch(), etc.) runs until one of
the callbacks tells it to stop to get more data.

(of course the while loop is actually in userspace)


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread Avi Kivity
On 08/08/2012 02:59 PM, David Gibson wrote:
 
 No, you're correct.  HLT could have been emulated in userspace, it just
 wasn't.  The correct statement is that HLT was arbitrarily chosen to be
 emulated in userspace with the synchronous model, but the asynchronous
 model forced it into the kernel.
 
 Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
 userspace in the first line there, anyway.  

I did.

 Ok, so I am now reassured
 that our current handling of CEDE in kernelspace does not cause
 problems.

Great.  It's a real pity the original local APIC implementation was in
userspace, it causes no end of confusion, and in fact was broken for smp
until recently even though it is 7 years old.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Stefan Hajnoczi
On Wed, Aug 8, 2012 at 10:21 AM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 8 August 2012 07:25, Liu Ping Fan qemul...@gmail.com wrote:
 +static inline void atomic_sub(int i, Atomic *v)
 +{
 +asm volatile(lock; subl %1,%0
 + : +m (v-counter)
 + : ir (i));
 +}

 NAK. We don't want random inline assembly implementations of locking
 primitives in QEMU, they are way too hard to keep working with all the
 possible host architectures we support. I spent some time a while back
 getting rid of the (variously busted) versions we had previously.

 If you absolutely must use atomic ops, use the gcc builtins. For
 preference, stick to higher level and less error-prone abstractions.

We're spoilt for choice here:

1. Atomic built-ins from gcc
2. glib atomics

No need to roll our own or copy the implementation from the kernel.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Paolo Bonzini
Il 08/08/2012 15:09, Stefan Hajnoczi ha scritto:
  NAK. We don't want random inline assembly implementations of locking
  primitives in QEMU, they are way too hard to keep working with all the
  possible host architectures we support. I spent some time a while back
  getting rid of the (variously busted) versions we had previously.
 
  If you absolutely must use atomic ops, use the gcc builtins. For
  preference, stick to higher level and less error-prone abstractions.
 We're spoilt for choice here:
 
 1. Atomic built-ins from gcc
 2. glib atomics
 
 No need to roll our own or copy the implementation from the kernel.

To some extent we need to because:

1. GCC atomics look ugly, :) do not provide rmb/wmb, and in some
versions of GCC mb is known to be (wrongly) a no-op.

2. glib atomics do not provide mb/rmb/wmb either, and
g_atomic_int_get/g_atomic_int_set are inefficient: they add barriers
everywhere, while it is clearer if you put barriers manually, and you
often do not need barriers in the get side.  glib atomics also do not
provide xchg.

I agree however that a small wrapper around GCC atomics is much better
than assembly.  Assembly can be limited to the memory barriers (where we
already have it anyway).

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Peter Maydell
On 8 August 2012 14:18, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 08/08/2012 15:09, Stefan Hajnoczi ha scritto:
 No need to roll our own or copy the implementation from the kernel.

 To some extent we need to because:

 1. GCC atomics look ugly, :) do not provide rmb/wmb, and in some
 versions of GCC mb is known to be (wrongly) a no-op.

 2. glib atomics do not provide mb/rmb/wmb either, and
 g_atomic_int_get/g_atomic_int_set are inefficient: they add barriers
 everywhere, while it is clearer if you put barriers manually, and you
 often do not need barriers in the get side.  glib atomics also do not
 provide xchg.

These are arguments in favour of don't try to use atomic ops --
if serious large projects like GCC and glib can't produce working
efficient implementations for all target architectures, what chance
do we have?

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Paolo Bonzini
Il 08/08/2012 15:32, Peter Maydell ha scritto:
  1. GCC atomics look ugly, :) do not provide rmb/wmb, and in some
  versions of GCC mb is known to be (wrongly) a no-op.
 
  2. glib atomics do not provide mb/rmb/wmb either, and
  g_atomic_int_get/g_atomic_int_set are inefficient: they add barriers
  everywhere, while it is clearer if you put barriers manually, and you
  often do not need barriers in the get side.  glib atomics also do not
  provide xchg.
 These are arguments in favour of don't try to use atomic ops --
 if serious large projects like GCC and glib can't produce working
 efficient implementations for all target architectures, what chance
 do we have?

Well, maybe... but the flaws in both GCC and glib are small in size
(even though large in importance, at least for us) and we can work
around them easily.  mb/rmb/wmb is essentially the small set of atomic
operations that we're already using.

Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Avi Kivity
On 08/08/2012 04:49 PM, Paolo Bonzini wrote:
 Il 08/08/2012 15:32, Peter Maydell ha scritto:
  1. GCC atomics look ugly, :) do not provide rmb/wmb, and in some
  versions of GCC mb is known to be (wrongly) a no-op.
 
  2. glib atomics do not provide mb/rmb/wmb either, and
  g_atomic_int_get/g_atomic_int_set are inefficient: they add barriers
  everywhere, while it is clearer if you put barriers manually, and you
  often do not need barriers in the get side.  glib atomics also do not
  provide xchg.
 These are arguments in favour of don't try to use atomic ops --
 if serious large projects like GCC and glib can't produce working
 efficient implementations for all target architectures, what chance
 do we have?
 
 Well, maybe... but the flaws in both GCC and glib are small in size
 (even though large in importance, at least for us) and we can work
 around them easily.  mb/rmb/wmb is essentially the small set of atomic
 operations that we're already using.

We can easily define rmb()/wmb() to be __sync_synchronize() as a
default, and only override them where it matters.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for tuesday 31

2012-08-08 Thread Andreas Färber
Am 31.01.2012 15:01, schrieb Mitsyanko Igor:
 On 01/31/2012 05:15 PM, Andreas Färber wrote:
 Am 31.01.2012 00:53, schrieb Anthony Liguori:
 On 01/30/2012 05:41 PM, Andreas Färber wrote:
 Am 30.01.2012 19:55, schrieb Juan Quintela:
 Please send in any agenda items you are interested in covering.

 VMState:
 Anthony specifically said that VMState were not affected by QOM and
 that
 patches should not be deferred until the merge. Yet there's no review
 and/or decision-making for a month now. Ping^2 for AHCI+SDHC.

 Do you have pointers (to pending VMState patches)?

 http://patchwork.ozlabs.org/patch/137732/ (PATCH v4)

 It's basically about how to deal with variable-sized arrays. (Alex
 mentioned it on one call around November.) I found ways to deal with
 subsets of arrays embedded within the struct and variable-sized list of
 pointers to structs but no solution for a malloc()'ed array of structs.
 Maybe I'm just too stupid to see. Anyway, no one commented since Xmas.

 Igor posted (and refined for v2) a patch with a callback-based approach
 that I find promising. From my view, unofficially Juan is the VMState
 guy, he's been cc'ed. Are we lacking an official maintainer that cares?
 Or is Juan the official, undocumented maintainer but simply busy?

 SUSE's interest is making AHCI migratable, and my VMState workaround for
 that is simply ugly:

 http://patchwork.ozlabs.org/patch/133066/ (RFC)

 
 If I'm not mistaken, if you change AHCIState's .ports type to uint32_t
 you can use existing VMSTATE_BUFFER_MULTIPLY macro like this:
 
 VMSTATE_BUFFER_MULTIPLY(dev, AHCIState, 0, NULL, 0, ports,
 sizeof(AHCIDevice))

Igor, I finally got around to rebasing and trying this: Am I seeing
correctly that this tries to serialize the whole of AHCIDevice as an
opaque buffer? The difficulty here was that we were looking for a way to
serialize a variable number of structured elements with their own
vmstate_ahci_device specifying what fields to serialize.

Juan, how should we proceed there? Do as Igor suggested? Some VMSTATE_
macro I'm overlooking? Or do we need some new macro for this use case?

Regards,
Andreas

 VMSTATE_BUFFER_MULTIPLY currently lacks VMS_POINTER flag and therefore
 doesn't make any use of _start field (you don't need it anyway)
 
 Nevertheless, VMSTATE_BUFFER_MULTIPLY is just a partial solution to a
 bigger set of possible problems. SD card's vmstate implementation
 requires shift operation, SDHC gets size from switch {} statement,
 something else later may require division or addition e.t.c.,
 get_bufsize callback will cover all possible cases.

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hang on reboot with 3.6-rc1

2012-08-08 Thread David Ahern
Not sure if KVM is the culprit, but it is the last line shown on the 
console. I have to power cycle the server to reboot.


David
attachment: kvm-3.6-rc1.png

Re: [kvmarm] [PATCH 5/8] KVM: Add hva_to_memslot

2012-08-08 Thread Alexander Graf

On 08.08.2012, at 06:55, Christoffer Dall wrote:

 On Tue, Aug 7, 2012 at 6:57 AM, Alexander Graf ag...@suse.de wrote:
 Architecture code might want to figure out if an hva that it gets for example
 via the mmu notifier callbacks actually is in guest mapped memory, and if so,
 in which memory slot.
 
 This patch introduces a helper function to enable it to do so. It is a
 prerequisite for the e500 mmu notifier implementation.
 
 Signed-off-by: Alexander Graf ag...@suse.de
 ---
 include/linux/kvm_host.h |1 +
 virt/kvm/kvm_main.c  |   14 ++
 2 files changed, 15 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index dbc65f9..2b92928 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -464,6 +464,7 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct 
 gfn_to_hva_cache *ghc,
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 +struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm, hva_t hva);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index bcf973e..d42591d 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -999,6 +999,20 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, 
 gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 +struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm, hva_t hva)
 +{
 +   struct kvm_memslots *slots = kvm_memslots(kvm);
 +   struct kvm_memory_slot *memslot;
 +
 +   kvm_for_each_memslot(memslot, slots)
 +   if (hva = memslot-userspace_addr 
 + hva  memslot-userspace_addr + memslot-npages)
 
 addr + npages, this doesn't look right

Thanks a lot for spotting that one!

 
 +   return memslot;
 +
 +   return NULL;
 +}
 +EXPORT_SYMBOL_GPL(hva_to_memslot);
 
 consider also adding a hva_to_gpa wrapper now when you're add it, then
 ARM code will be so happy:
 
 bool hva_to_gpa(struct kvm *kvm, unsigned long hva, gpa_t *gpa)
 {
   struct kvm_memory_slot *memslot;
 
   memslot = hva_to_memslot(kvm, hva);
   if (!memslot)
   return false;
 
   gpa_t gpa_offset = hva - memslot-userspace_addr;
   *gpa = (memslot-base_gfn  PAGE_SHIFT) + gpa_offset;
   return true;
 }

What do you need this for? I usually don't like to add framework code that has 
no users (yet).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] KVM: PPC: E500: Implement MMU Notifiers

2012-08-08 Thread Alexander Graf

On 07.08.2012, at 12:57, Alexander Graf wrote:

 This patch set adds a very simple implementation of MMU notifiers for the
 e500 target. Along the way, I stumbled over a few other things that I've
 put into the same patch set, namely:
 
  * e500 prerequisites
  * icache flushing on page map (probably also hits ARM!)
  * exit trace point for e500
 
 Alexander Graf (8):
  KVM: PPC: BookE: Expose remote TLB flushes in debugfs
  KVM: PPC: E500: Fix clear_tlb_refs
  KVM: PPC: PR: Use generic tracepoint for guest exit
  KVM: PPC: Expose SYNC cap based on mmu notifiers

I applied the above 4 patches to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] handle device help before accelerator set up

2012-08-08 Thread Bruce Rogers
A command line device probe using just -device ? gets processed
after qemu-kvm initializes the accelerator. If /dev/kvm is not
present, the accelerator check will fail (kvm is defaulted to on),
which causes libvirt to not be set up to handle qemu guests.

Moving the device help handling before the accelerator set up allows
the device probe to work in this configuration and libvirt succeeds
in setting up for a qemu hypervisor mode.

Signed-off-by: Bruce Rogers brog...@suse.com
---
 vl.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/vl.c b/vl.c
index 1a46d2d..5b75cf9 100644
--- a/vl.c
+++ b/vl.c
@@ -3380,6 +3380,9 @@ int main(int argc, char **argv, char **envp)
 ram_size = DEFAULT_RAM_SIZE * 1024 * 1024;
 }
 
+if (qemu_opts_foreach(qemu_find_opts(device), device_help_func, NULL, 0) 
!= 0)
+exit(0);
+
 configure_accelerator();
 
 qemu_init_cpu_loop();
@@ -3535,9 +3538,6 @@ int main(int argc, char **argv, char **envp)
 }
 select_vgahw(vga_model);
 
-if (qemu_opts_foreach(qemu_find_opts(device), device_help_func, NULL, 0) 
!= 0)
-exit(0);
-
 if (watchdog) {
 i = select_watchdog(watchdog);
 if (i  0)
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v8 5/6] introduce a new qom device to deal with panicked event

2012-08-08 Thread Blue Swirl
On Wed, Aug 8, 2012 at 2:47 AM, Wen Congyang we...@cn.fujitsu.com wrote:
 If the target is x86/x86_64, the guest's kernel will write 0x01 to the
 port KVM_PV_EVENT_PORT when it is panciked. This patch introduces a new
 qom device kvm_pv_ioport to listen this I/O port, and deal with panicked
 event according to panicked_action's value. The possible actions are:
 1. emit QEVENT_GUEST_PANICKED only
 2. emit QEVENT_GUEST_PANICKED and pause the guest
 3. emit QEVENT_GUEST_PANICKED and poweroff the guest
 4. emit QEVENT_GUEST_PANICKED and reset the guest

 I/O ports does not work for some targets(for example: s390). And you
 can implement another qom device, and include it's code into pv_event.c
 for such target.

 Note: if we emit QEVENT_GUEST_PANICKED only, and the management
 application does not receive this event(the management may not
 run when the event is emitted), the management won't know the
 guest is panicked.

 Signed-off-by: Wen Congyang we...@cn.fujitsu.com
 ---
  hw/kvm/Makefile.objs |2 +-
  hw/kvm/pv_event.c|  109 
 ++
  hw/kvm/pv_ioport.c   |   93 ++
  hw/pc_piix.c |9 
  kvm.h|2 +
  5 files changed, 214 insertions(+), 1 deletions(-)
  create mode 100644 hw/kvm/pv_event.c
  create mode 100644 hw/kvm/pv_ioport.c

 diff --git a/hw/kvm/Makefile.objs b/hw/kvm/Makefile.objs
 index 226497a..23e3b30 100644
 --- a/hw/kvm/Makefile.objs
 +++ b/hw/kvm/Makefile.objs
 @@ -1 +1 @@
 -obj-$(CONFIG_KVM) += clock.o apic.o i8259.o ioapic.o i8254.o
 +obj-$(CONFIG_KVM) += clock.o apic.o i8259.o ioapic.o i8254.o pv_event.o
 diff --git a/hw/kvm/pv_event.c b/hw/kvm/pv_event.c
 new file mode 100644
 index 000..8897237
 --- /dev/null
 +++ b/hw/kvm/pv_event.c
 @@ -0,0 +1,109 @@
 +/*
 + * QEMU KVM support, paravirtual event device
 + *
 + * Copyright Fujitsu, Corp. 2012
 + *
 + * Authors:
 + * Wen Congyang we...@cn.fujitsu.com
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + *
 + */
 +
 +#include linux/kvm_para.h
 +#include asm/kvm_para.h
 +#include qobject.h
 +#include qjson.h
 +#include monitor.h
 +#include sysemu.h
 +#include kvm.h
 +
 +/* Possible values for action parameter. */
 +#define PANICKED_REPORT 1   /* emit QEVENT_GUEST_PANICKED only */
 +#define PANICKED_PAUSE  2   /* emit QEVENT_GUEST_PANICKED and pause VM */
 +#define PANICKED_POWEROFF   3   /* emit QEVENT_GUEST_PANICKED and quit VM */
 +#define PANICKED_RESET  4   /* emit QEVENT_GUEST_PANICKED and reset VM */
 +
 +#define PV_EVENT_DRIVER kvm_pv_event
 +
 +struct pv_event_action {

PVEventAction

 +char *panicked_action;
 +int panicked_action_value;
 +};
 +
 +#define DEFINE_PV_EVENT_PROPERTIES(_state, _conf)   \
 +DEFINE_PROP_STRING(panicked_action, _state, _conf.panicked_action)
 +
 +static void panicked_mon_event(const char *action)
 +{
 +QObject *data;
 +
 +data = qobject_from_jsonf({ 'action': %s }, action);
 +monitor_protocol_event(QEVENT_GUEST_PANICKED, data);
 +qobject_decref(data);
 +}
 +
 +static void panicked_perform_action(uint32_t panicked_action)
 +{
 +switch (panicked_action) {
 +case PANICKED_REPORT:
 +panicked_mon_event(report);
 +break;
 +
 +case PANICKED_PAUSE:
 +panicked_mon_event(pause);
 +vm_stop(RUN_STATE_GUEST_PANICKED);
 +break;
 +
 +case PANICKED_POWEROFF:
 +panicked_mon_event(poweroff);
 +qemu_system_shutdown_request();
 +break;

Misses a line break unlike other cases.

 +case PANICKED_RESET:
 +panicked_mon_event(reset);
 +qemu_system_reset_request();
 +break;
 +}
 +}
 +
 +static uint64_t supported_event(void)
 +{
 +return 1  KVM_PV_FEATURE_PANICKED;
 +}
 +
 +static void handle_event(int event, struct pv_event_action *conf)
 +{
 +if (event == KVM_PV_EVENT_PANICKED) {
 +panicked_perform_action(conf-panicked_action_value);
 +}
 +}
 +
 +static int pv_event_init(struct pv_event_action *conf)
 +{
 +if (!conf-panicked_action) {
 +conf-panicked_action_value = PANICKED_REPORT;
 +} else if (strcasecmp(conf-panicked_action, none) == 0) {
 +conf-panicked_action_value = PANICKED_REPORT;
 +} else if (strcasecmp(conf-panicked_action, pause) == 0) {
 +conf-panicked_action_value = PANICKED_PAUSE;
 +} else if (strcasecmp(conf-panicked_action, poweroff) == 0) {
 +conf-panicked_action_value = PANICKED_POWEROFF;
 +} else if (strcasecmp(conf-panicked_action, reset) == 0) {
 +conf-panicked_action_value = PANICKED_RESET;
 +} else {
 +return -1;
 +}
 +
 +return 0;
 +}
 +
 +#if defined(KVM_PV_EVENT_PORT)
 +
 +#include pv_ioport.c

I'd rather not include any .c files but insert the contents here directly.

 +
 +#else
 +void kvm_pv_event_init(void *opaque)
 +{
 +}
 +#endif
 diff 

Re: [Qemu-devel] [PATCH 3/5] s390: Add new channel I/O based virtio transport.

2012-08-08 Thread Blue Swirl
On Wed, Aug 8, 2012 at 8:28 AM, Cornelia Huck cornelia.h...@de.ibm.com wrote:
 On Tue, 7 Aug 2012 20:47:22 +
 Blue Swirl blauwir...@gmail.com wrote:


  diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
  new file mode 100644
  index 000..8a90c3a
  --- /dev/null
  +++ b/hw/s390x/virtio-ccw.c
  @@ -0,0 +1,962 @@
  +/*
  + * virtio ccw target implementation
  + *
  + * Copyright 2012 IBM Corp.
  + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
  + *
  + * This work is licensed under the terms of the GNU GPL, version 2 or (at
  + * your option) any later version. See the COPYING file in the top-level
  + * directory.
  + */
  +
  +#include hw/hw.h
  +#include block.h
  +#include blockdev.h
  +#include sysemu.h
  +#include net.h
  +#include monitor.h
  +#include qemu-thread.h
  +#include ../virtio.h
  +#include ../virtio-serial.h
  +#include ../virtio-net.h
  +#include ../sysbus.h

 hw/virtio... for the above

 OK.

  +#include bitops.h
  +
  +#include ioinst.h
  +#include css.h
  +#include virtio-ccw.h
  +
  +static const TypeInfo virtio_ccw_bus_info = {
  +.name = TYPE_VIRTIO_CCW_BUS,
  +.parent = TYPE_BUS,
  +.instance_size = sizeof(VirtioCcwBus),
  +};
  +
  +static const VirtIOBindings virtio_ccw_bindings;
  +
  +typedef struct sch_entry {
  +SubchDev *sch;
  +QLIST_ENTRY(sch_entry) entry;
  +} sch_entry;

 SubchEntry, see CODING_STYLE. Also other struct and typedef names below.

  +
  +QLIST_HEAD(subch_list, sch_entry);

 static, but please put this to a structure that is passed around instead.

  +
  +typedef struct devno_entry {
  +uint16_t devno;
  +QLIST_ENTRY(devno_entry) entry;
  +} devno_entry;
  +
  +QLIST_HEAD(devno_list, devno_entry);

 Ditto

  +
  +struct subch_set {
  +struct subch_list *s_list[256];
  +struct devno_list *d_list[256];
  +};
  +
  +struct css_set {
  +struct subch_set *set[MAX_SSID + 1];
  +};
  +
  +static struct css_set *channel_subsys[MAX_CSSID + 1];

 OK, will try to come up with some kind of structure for this and
 CamelCasify it.

  +
  +VirtIODevice *virtio_ccw_get_vdev(SubchDev *sch)
  +{
  +VirtIODevice *vdev = NULL;
  +
  +if (sch-driver_data) {
  +vdev = ((VirtioCcwData *)sch-driver_data)-vdev;
  +}
  +return vdev;
  +}
  +

  +VirtioCcwBus *virtio_ccw_bus_init(void)
  +{
  +VirtioCcwBus *bus;
  +BusState *_bus;

 Please avoid identifiers with leading underscores.

 OK.


  +DeviceState *dev;
  +
  +css_set_subch_cb(virtio_ccw_find_subch);
  +
  +/* Create bridge device */
  +dev = qdev_create(NULL, virtio-ccw-bridge);
  +qdev_init_nofail(dev);
  +
  +/* Create bus on bridge device */
  +_bus = qbus_create(TYPE_VIRTIO_CCW_BUS, dev, virtio-ccw);
  +bus = DO_UPCAST(VirtioCcwBus, bus, _bus);
  +
  +/* Enable hotplugging */
  +_bus-allow_hotplug = 1;
  +
  +return bus;
  +}
  +
  +struct vq_info_block {
  +uint64_t queue;
  +uint16_t num;
  +} QEMU_PACKED;
  +
  +struct vq_config_block {
  +uint16_t index;
  +uint16_t num;
  +} QEMU_PACKED;

 Aren't these KVM structures? They should be defined in a KVM header
 file file in linux-headers.

 Not really, virtio-ccw isn't tied to kvm.

 I see this more as command blocks that are specific to the control
 unit - like something that would be defined in an attachment
 specification for a classic s390 device (and in the virtio spec in this
 case) and modeled as C structures here.

OK. Then please use CamelCase for these too.




  +case CCW_CMD_WRITE_CONF:
  +if (check_len) {
  +if (ccw-count  data-vdev-config_len) {
  +ret = -EINVAL;
  +break;
  +}
  +}
  +len = MIN(ccw-count, data-vdev-config_len);
  +config = qemu_get_ram_ptr(ccw-cda);

 Please use cpu_physical_memory_read() (or DMA versions) instead of
 this + memcpy().

 Will check.

  +if (!config) {
  +ret = -EFAULT;
  +} else {
  +memcpy(data-vdev-config, config, len);
  +if (data-vdev-set_config) {
  +data-vdev-set_config(data-vdev, data-vdev-config);
  +}
  +sch-curr_status.scsw.count = ccw-count - len;
  +ret = 0;
  +}
  +break;

  +case CCW_CMD_READ_VQ_CONF:
  +if (check_len) {
  +if (ccw-count != sizeof(vq_config)) {
  +ret = -EINVAL;
  +break;
  +}
  +} else if (ccw-count  sizeof(vq_config)) {
  +/* Can't execute command. */
  +ret = -EINVAL;
  +break;
  +}
  +if (!qemu_get_ram_ptr(ccw-cda)) {
  +ret = -EFAULT;
  +} else {
  +vq_config.index = lduw_phys(ccw-cda);

 lduw_{b,l}e_phys()

  +vq_config.num = virtio_queue_get_num(data-vdev, 
  vq_config.index);
  +stw_phys(ccw-cda + sizeof(vq_config.index), vq_config.num);

 

Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.

2012-08-08 Thread Blue Swirl
On Wed, Aug 8, 2012 at 8:17 AM, Cornelia Huck cornelia.h...@de.ibm.com wrote:
 On Tue, 7 Aug 2012 21:00:59 +
 Blue Swirl blauwir...@gmail.com wrote:


  diff --git a/hw/s390x/css.c b/hw/s390x/css.c
  new file mode 100644
  index 000..7941c44
  --- /dev/null
  +++ b/hw/s390x/css.c
  @@ -0,0 +1,440 @@
  +/*
  + * Channel subsystem base support.
  + *
  + * Copyright 2012 IBM Corp.
  + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
  + *
  + * This work is licensed under the terms of the GNU GPL, version 2 or (at
  + * your option) any later version. See the COPYING file in the top-level
  + * directory.
  + */
  +
  +#include qemu-thread.h
  +#include qemu-queue.h
  +#include hw/qdev.h
  +#include kvm.h
  +#include cpu.h
  +#include ioinst.h
  +#include css.h
  +
  +struct chp_info {

 CamelCase, please.

 OK.

  +uint8_t in_use;
  +uint8_t type;
  +};
  +
  +static struct chp_info chpids[MAX_CSSID + 1][MAX_CHPID + 1];
  +
  +static css_subch_cb_func css_subch_cb;

 Probably these can be put to a container structure which can be passed 
 around.

 Still trying to come up with a good model for that.



  +case CCW_CMD_SENSE_ID:
  +{
  +uint8_t sense_bytes[256];
  +
  +/* Sense ID information is device specific. */
  +memcpy(sense_bytes, sch-id, sizeof(sense_bytes));
  +if (check_len) {
  +if (ccw-count != sizeof(sense_bytes)) {
  +ret = -EINVAL;
  +break;
  +}
  +}
  +len = MIN(ccw-count, sizeof(sense_bytes));
  +/*
  + * Only indicate 0xff in the first sense byte if we actually
  + * have enough place to store at least bytes 0-3.
  + */
  +if (len = 4) {
  +stb_phys(ccw-cda, 0xff);
  +} else {
  +stb_phys(ccw-cda, 0);
  +}
  +i = 1;
  +for (i = 1; i  len - 1; i++) {
  +stb_phys(ccw-cda + i, sense_bytes[i]);
  +}

 cpu_physical_memory_write()

 Hm, what's wrong with storing byte-by-byte?

cpu_physical_memory_write() could be more optimal, for example resolve
guest addresses only once per page.



  +sch-curr_status.scsw.count = ccw-count - len;
  +ret = 0;
  +break;
  +}
  +case CCW_CMD_TIC:
  +if (sch-last_cmd-cmd_code == CCW_CMD_TIC) {
  +ret = -EINVAL;
  +break;
  +}
  +if (ccw-flags  (CCW_FLAG_CC | CCW_FLAG_DC)) {
  +ret = -EINVAL;
  +break;
  +}
  +sch-channel_prog = qemu_get_ram_ptr(ccw-cda);
  +ret = sch-channel_prog ? -EAGAIN : -EFAULT;
  +break;
  +default:
  +if (sch-ccw_cb) {
  +/* Handle device specific commands. */
  +ret = sch-ccw_cb(sch, ccw);
  +} else {
  +ret = -EOPNOTSUPP;
  +}
  +break;
  +}
  +sch-last_cmd = ccw;
  +if (ret == 0) {
  +if (ccw-flags  CCW_FLAG_CC) {
  +sch-channel_prog += 8;
  +ret = -EAGAIN;
  +}
  +}
  +
  +return ret;

  diff --git a/hw/s390x/css.h b/hw/s390x/css.h
  new file mode 100644
  index 000..b8a95cc
  --- /dev/null
  +++ b/hw/s390x/css.h
  @@ -0,0 +1,62 @@
  +/*
  + * Channel subsystem structures and definitions.
  + *
  + * Copyright 2012 IBM Corp.
  + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
  + *
  + * This work is licensed under the terms of the GNU GPL, version 2 or (at
  + * your option) any later version. See the COPYING file in the top-level
  + * directory.
  + */
  +
  +#ifndef CSS_H
  +#define CSS_H
  +
  +#include ioinst.h
  +
  +/* Channel subsystem constants. */
  +#define MAX_SCHID 65535
  +#define MAX_SSID 3
  +#define MAX_CSSID 254 /* 255 is reserved */
  +#define MAX_CHPID 255
  +
  +#define MAX_CIWS 8
  +
  +struct senseid {

 SenseID

 OK.

  +/* common part */
  +uint32_t  reserved:8;/* always 0x'FF' */

 The standard syntax calls for 'unsigned' instead of uint32_t for bit
 fields. But bit fields are not very well defined, it's better to avoid
 them.

 Well, the equivalent Linux structure also looks like that :) But I can
 switch this to a uint8_t/uint16_t structure.


  +uint32_t  cu_type:16;/* control unit type */
  +uint32_t  cu_model:8;/* control unit model */
  +uint32_t  dev_type:16;   /* device type */
  +uint32_t  dev_model:8;   /* device model */
  +uint32_t  unused:8;  /* padding byte */
  +/* extended part */
  +uint32_t ciw[MAX_CIWS];  /* variable # of CIWs */
  +};
  +

  diff --git a/target-s390x/ioinst.h b/target-s390x/ioinst.h
  new file mode 100644
  index 000..79628b4
  --- /dev/null
  +++ b/target-s390x/ioinst.h
  @@ -0,0 +1,173 @@
  +/*
  + * S/390 channel I/O instructions
  + *
  + * Copyright 2012 IBM Corp.
  + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com
  + *
  + * This work is licensed under the terms of the GNU GPL, version 2 or 

Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating

2012-08-08 Thread Blue Swirl
On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Using mem_map_lock to protect among updaters. So we can get the intact
 snapshot of mem topology -- FlatView  radix-tree.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  exec.c   |3 +++
  memory.c |   22 ++
  memory.h |2 ++
  3 files changed, 27 insertions(+), 0 deletions(-)

 diff --git a/exec.c b/exec.c
 index 8244d54..0e29ef9 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, 
 phys_map_nodes_nb_alloc;
 The bottom level has pointers to MemoryRegionSections.  */
  static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };

 +QemuMutex mem_map_lock;
 +
  static void io_mem_init(void);
  static void memory_map_init(void);

 @@ -637,6 +639,7 @@ void cpu_exec_init_all(void)
  #if !defined(CONFIG_USER_ONLY)
  memory_map_init();
  io_mem_init();
 +qemu_mutex_init(mem_map_lock);

I'd move this and the mutex to memory.c since there are no other uses.
The mutex could be static then.

  #endif
  }

 diff --git a/memory.c b/memory.c
 index aab4a31..5986532 100644
 --- a/memory.c
 +++ b/memory.c
 @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void)
  assert(memory_region_transaction_depth);
  --memory_region_transaction_depth;
  if (!memory_region_transaction_depth  memory_region_update_pending) {
 +qemu_mutex_lock(mem_map_lock);
  memory_region_update_topology(NULL);
 +qemu_mutex_unlock(mem_map_lock);
  }
  }

 @@ -1069,8 +1071,10 @@ void memory_region_set_log(MemoryRegion *mr, bool log, 
 unsigned client)
  {
  uint8_t mask = 1  client;

 +qemu_mutex_lock(mem_map_lock);
  mr-dirty_log_mask = (mr-dirty_log_mask  ~mask) | (log * mask);
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }

  bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr,
 @@ -1103,8 +1107,10 @@ void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
  void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
  {
  if (mr-readonly != readonly) {
 +qemu_mutex_lock(mem_map_lock);
  mr-readonly = readonly;
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }
  }

 @@ -1112,7 +1118,9 @@ void memory_region_rom_device_set_readable(MemoryRegion 
 *mr, bool readable)
  {
  if (mr-readable != readable) {
  mr-readable = readable;
 +qemu_mutex_lock(mem_map_lock);
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }
  }

 @@ -1206,6 +1214,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
  };
  unsigned i;

 +qemu_mutex_lock(mem_map_lock);
  for (i = 0; i  mr-ioeventfd_nb; ++i) {
  if (memory_region_ioeventfd_before(mrfd, mr-ioeventfds[i])) {
  break;
 @@ -1218,6 +1227,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
  sizeof(*mr-ioeventfds) * (mr-ioeventfd_nb-1 - i));
  mr-ioeventfds[i] = mrfd;
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }

  void memory_region_del_eventfd(MemoryRegion *mr,
 @@ -1236,6 +1246,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
  };
  unsigned i;

 +qemu_mutex_lock(mem_map_lock);
  for (i = 0; i  mr-ioeventfd_nb; ++i) {
  if (memory_region_ioeventfd_equal(mrfd, mr-ioeventfds[i])) {
  break;
 @@ -1248,6 +1259,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
  mr-ioeventfds = g_realloc(mr-ioeventfds,
sizeof(*mr-ioeventfds)*mr-ioeventfd_nb + 
 1);
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }

  static void memory_region_add_subregion_common(MemoryRegion *mr,
 @@ -1259,6 +1271,8 @@ static void 
 memory_region_add_subregion_common(MemoryRegion *mr,
  assert(!subregion-parent);
  subregion-parent = mr;
  subregion-addr = offset;
 +
 +qemu_mutex_lock(mem_map_lock);
  QTAILQ_FOREACH(other, mr-subregions, subregions_link) {
  if (subregion-may_overlap || other-may_overlap) {
  continue;
 @@ -1289,6 +1303,7 @@ static void 
 memory_region_add_subregion_common(MemoryRegion *mr,
  QTAILQ_INSERT_TAIL(mr-subregions, subregion, subregions_link);
  done:
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }


 @@ -1316,8 +1331,11 @@ void memory_region_del_subregion(MemoryRegion *mr,
  {
  assert(subregion-parent == mr);
  subregion-parent = NULL;
 +
 +qemu_mutex_lock(mem_map_lock);
  QTAILQ_REMOVE(mr-subregions, subregion, subregions_link);
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }

  void memory_region_set_enabled(MemoryRegion *mr, bool enabled)
 @@ -1325,8 +1343,10 @@ void memory_region_set_enabled(MemoryRegion *mr, 

Re: [PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy

2012-08-08 Thread Blue Swirl
On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 PhysMap contain the flatview and radix-tree view, they are snapshot
 of system topology and should be consistent. With PhysMap, we can
 swap the pointer when updating and achieve the atomic.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  exec.c   |8 
  memory.c |   33 -
  memory.h |   62 
 --
  3 files changed, 60 insertions(+), 43 deletions(-)

 diff --git a/exec.c b/exec.c
 index 0e29ef9..01b91b0 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -156,8 +156,6 @@ typedef struct PageDesc {
  #endif

  /* Size of the L2 (and L3, etc) page tables.  */

Please copy this comment to the header file.

 -#define L2_BITS 10
 -#define L2_SIZE (1  L2_BITS)

  #define P_L2_LEVELS \
  (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1)
 @@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask;
  static void *l1_map[V_L1_SIZE];

  #if !defined(CONFIG_USER_ONLY)
 -typedef struct PhysPageEntry PhysPageEntry;

  static MemoryRegionSection *phys_sections;
  static unsigned phys_sections_nb, phys_sections_nb_alloc;
 @@ -194,11 +191,6 @@ static uint16_t phys_section_notdirty;
  static uint16_t phys_section_rom;
  static uint16_t phys_section_watch;

 -struct PhysPageEntry {
 -uint16_t is_leaf : 1;
 - /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
 -uint16_t ptr : 15;
 -};

  /* Simple allocator for PhysPageEntry nodes */
  static PhysPageEntry (*phys_map_nodes)[L2_SIZE];
 diff --git a/memory.c b/memory.c
 index 2eaa2fc..c7f2cfd 100644
 --- a/memory.c
 +++ b/memory.c
 @@ -31,17 +31,6 @@ static bool global_dirty_log = false;
  static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners
  = QTAILQ_HEAD_INITIALIZER(memory_listeners);

 -typedef struct AddrRange AddrRange;
 -
 -/*
 - * Note using signed integers limits us to physical addresses at most
 - * 63 bits wide.  They are needed for negative offsetting in aliases
 - * (large MemoryRegion::alias_offset).
 - */
 -struct AddrRange {
 -Int128 start;
 -Int128 size;
 -};

  static AddrRange addrrange_make(Int128 start, Int128 size)
  {
 @@ -197,28 +186,6 @@ static bool 
 memory_region_ioeventfd_equal(MemoryRegionIoeventfd a,
   !memory_region_ioeventfd_before(b, a);
  }

 -typedef struct FlatRange FlatRange;
 -typedef struct FlatView FlatView;
 -
 -/* Range of memory in the global map.  Addresses are absolute. */
 -struct FlatRange {
 -MemoryRegion *mr;
 -target_phys_addr_t offset_in_region;
 -AddrRange addr;
 -uint8_t dirty_log_mask;
 -bool readable;
 -bool readonly;
 -};
 -
 -/* Flattened global view of current active memory hierarchy.  Kept in sorted
 - * order.
 - */
 -struct FlatView {
 -FlatRange *ranges;
 -unsigned nr;
 -unsigned nr_allocated;
 -};
 -
  typedef struct AddressSpace AddressSpace;
  typedef struct AddressSpaceOps AddressSpaceOps;

 diff --git a/memory.h b/memory.h
 index 740f018..357edd8 100644
 --- a/memory.h
 +++ b/memory.h
 @@ -29,12 +29,72 @@
  #include qemu-thread.h
  #include qemu/reclaimer.h

 +typedef struct AddrRange AddrRange;
 +typedef struct FlatRange FlatRange;
 +typedef struct FlatView FlatView;
 +typedef struct PhysPageEntry PhysPageEntry;
 +typedef struct PhysMap PhysMap;
 +typedef struct MemoryRegionSection MemoryRegionSection;
  typedef struct MemoryRegionOps MemoryRegionOps;
  typedef struct MemoryRegionLifeOps MemoryRegionLifeOps;
  typedef struct MemoryRegion MemoryRegion;
  typedef struct MemoryRegionPortio MemoryRegionPortio;
  typedef struct MemoryRegionMmio MemoryRegionMmio;

 +/*
 + * Note using signed integers limits us to physical addresses at most
 + * 63 bits wide.  They are needed for negative offsetting in aliases
 + * (large MemoryRegion::alias_offset).
 + */
 +struct AddrRange {
 +Int128 start;
 +Int128 size;
 +};
 +
 +/* Range of memory in the global map.  Addresses are absolute. */
 +struct FlatRange {
 +MemoryRegion *mr;
 +target_phys_addr_t offset_in_region;
 +AddrRange addr;
 +uint8_t dirty_log_mask;
 +bool readable;
 +bool readonly;
 +};
 +
 +/* Flattened global view of current active memory hierarchy.  Kept in sorted
 + * order.
 + */
 +struct FlatView {
 +FlatRange *ranges;
 +unsigned nr;
 +unsigned nr_allocated;
 +};
 +
 +struct PhysPageEntry {
 +uint16_t is_leaf:1;
 + /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
 +uint16_t ptr:15;
 +};
 +
 +#define L2_BITS 10
 +#define L2_SIZE (1  L2_BITS)
 +/* This is a multi-level map on the physical address space.
 +   The bottom level has pointers to MemoryRegionSections.  */
 +struct PhysMap {
 +Atomic ref;
 +PhysPageEntry root;
 +PhysPageEntry (*phys_map_nodes)[L2_SIZE];
 +unsigned phys_map_nodes_nb;
 +unsigned 

Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access

2012-08-08 Thread Blue Swirl
On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Flatview and radix view are all under the protection of pointer.
 And this make sure the change of them seem to be atomic!

 The mr accessed by radix-tree leaf or flatview will be reclaimed
 after the prev PhysMap not in use any longer

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  exec.c  |  303 +++---
  hw/vhost.c  |2 +-
  hw/xen_pt.c |2 +-
  kvm-all.c   |2 +-
  memory.c|   92 ++-
  memory.h|9 ++-
  vl.c|1 +
  xen-all.c   |2 +-
  8 files changed, 286 insertions(+), 127 deletions(-)

 diff --git a/exec.c b/exec.c
 index 01b91b0..97addb9 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -24,6 +24,7 @@
  #include sys/mman.h
  #endif

 +#include qemu/atomic.h
  #include qemu-common.h
  #include cpu.h
  #include tcg.h
 @@ -35,6 +36,8 @@
  #include qemu-timer.h
  #include memory.h
  #include exec-memory.h
 +#include qemu-thread.h
 +#include qemu/reclaimer.h
  #if defined(CONFIG_USER_ONLY)
  #include qemu.h
  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
 @@ -184,25 +187,17 @@ static void *l1_map[V_L1_SIZE];

  #if !defined(CONFIG_USER_ONLY)

 -static MemoryRegionSection *phys_sections;
 -static unsigned phys_sections_nb, phys_sections_nb_alloc;
  static uint16_t phys_section_unassigned;
  static uint16_t phys_section_notdirty;
  static uint16_t phys_section_rom;
  static uint16_t phys_section_watch;

 -
 -/* Simple allocator for PhysPageEntry nodes */
 -static PhysPageEntry (*phys_map_nodes)[L2_SIZE];
 -static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc;
 -
  #define PHYS_MAP_NODE_NIL (((uint16_t)~0)  1)

 -/* This is a multi-level map on the physical address space.
 -   The bottom level has pointers to MemoryRegionSections.  */
 -static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
 -
 +static QemuMutex cur_map_lock;
 +static PhysMap *cur_map;
  QemuMutex mem_map_lock;
 +static PhysMap *next_map;

  static void io_mem_init(void);
  static void memory_map_init(void);
 @@ -383,41 +378,38 @@ static inline PageDesc *page_find(tb_page_addr_t index)

  #if !defined(CONFIG_USER_ONLY)

 -static void phys_map_node_reserve(unsigned nodes)
 +static void phys_map_node_reserve(PhysMap *map, unsigned nodes)
  {
 -if (phys_map_nodes_nb + nodes  phys_map_nodes_nb_alloc) {
 +if (map-phys_map_nodes_nb + nodes  map-phys_map_nodes_nb_alloc) {
  typedef PhysPageEntry Node[L2_SIZE];
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16);
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc,
 -  phys_map_nodes_nb + nodes);
 -phys_map_nodes = g_renew(Node, phys_map_nodes,
 - phys_map_nodes_nb_alloc);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2,
 +16);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc,
 +  map-phys_map_nodes_nb + nodes);
 +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes,
 + map-phys_map_nodes_nb_alloc);
  }
  }

 -static uint16_t phys_map_node_alloc(void)
 +static uint16_t phys_map_node_alloc(PhysMap *map)
  {
  unsigned i;
  uint16_t ret;

 -ret = phys_map_nodes_nb++;
 +ret = map-phys_map_nodes_nb++;
  assert(ret != PHYS_MAP_NODE_NIL);
 -assert(ret != phys_map_nodes_nb_alloc);
 +assert(ret != map-phys_map_nodes_nb_alloc);
  for (i = 0; i  L2_SIZE; ++i) {
 -phys_map_nodes[ret][i].is_leaf = 0;
 -phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
 +map-phys_map_nodes[ret][i].is_leaf = 0;
 +map-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
  }
  return ret;
  }

 -static void phys_map_nodes_reset(void)
 -{
 -phys_map_nodes_nb = 0;
 -}
 -
 -
 -static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index,
 -target_phys_addr_t *nb, uint16_t leaf,
 +static void phys_page_set_level(PhysMap *map, PhysPageEntry *lp,
 +target_phys_addr_t *index,
 +target_phys_addr_t *nb,
 +uint16_t leaf,
  int level)
  {
  PhysPageEntry *p;
 @@ -425,8 +417,8 @@ static void phys_page_set_level(PhysPageEntry *lp, 
 target_phys_addr_t *index,
  target_phys_addr_t step = (target_phys_addr_t)1  (level * L2_BITS);

  if (!lp-is_leaf  lp-ptr == PHYS_MAP_NODE_NIL) {
 -lp-ptr = phys_map_node_alloc();
 -p = phys_map_nodes[lp-ptr];
 +lp-ptr = phys_map_node_alloc(map);
 +p = map-phys_map_nodes[lp-ptr];
  if (level == 0) {
  for (i = 0; i  

Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.

2012-08-08 Thread Peter Maydell
On 8 August 2012 20:16, Blue Swirl blauwir...@gmail.com wrote:
 On Wed, Aug 8, 2012 at 8:17 AM, Cornelia Huck cornelia.h...@de.ibm.com 
 wrote:
 On Tue, 7 Aug 2012 21:00:59 +
 Blue Swirl blauwir...@gmail.com wrote:
 Please use more descriptive names instead of acronyms, for example 
 SubChStatus.

 I'd rather leave these at the well-known scsw, pmcw, etc. names. These
 have been around for decades, and somebody familiar with channel I/O
 will instantly know what a struct scsw is, but will need to look hard
 at the code to figure out the meaning of SubChStatus.

 If they are well-known and have been around for so long time, are
 there any suitable header files (with compatible licenses) where they
 are defined which could be reused?

 Otherwise, please follow CODING_STYLE.

I think we should follow CODING_STYLE for capitalisation issues
but generally if the device's documentation has standard abbreviations
for register names, structures, etc, etc we should use them. Often
this code has to be maintained later by somebody else who might not
be familiar with the general operation of the hardware and who is trying
to match up the code with whatever the data sheet says. Following the
naming used in the h/w docs makes that job easier.

(for instance I took the opportunity of making a bunch of structure
member names in target-arm line up with the ARM ARM names
as part of the refactoring that went on a while back.)

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.

2012-08-08 Thread Blue Swirl
On Wed, Aug 8, 2012 at 7:34 PM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 8 August 2012 20:16, Blue Swirl blauwir...@gmail.com wrote:
 On Wed, Aug 8, 2012 at 8:17 AM, Cornelia Huck cornelia.h...@de.ibm.com 
 wrote:
 On Tue, 7 Aug 2012 21:00:59 +
 Blue Swirl blauwir...@gmail.com wrote:
 Please use more descriptive names instead of acronyms, for example 
 SubChStatus.

 I'd rather leave these at the well-known scsw, pmcw, etc. names. These
 have been around for decades, and somebody familiar with channel I/O
 will instantly know what a struct scsw is, but will need to look hard
 at the code to figure out the meaning of SubChStatus.

 If they are well-known and have been around for so long time, are
 there any suitable header files (with compatible licenses) where they
 are defined which could be reused?

 Otherwise, please follow CODING_STYLE.

 I think we should follow CODING_STYLE for capitalisation issues
 but generally if the device's documentation has standard abbreviations
 for register names, structures, etc, etc we should use them. Often
 this code has to be maintained later by somebody else who might not
 be familiar with the general operation of the hardware and who is trying
 to match up the code with whatever the data sheet says. Following the
 naming used in the h/w docs makes that job easier.

Yes. typedef struct SCSW {} SCSW; should be OK too.


 (for instance I took the opportunity of making a bunch of structure
 member names in target-arm line up with the ARM ARM names
 as part of the refactoring that went on a while back.)

 -- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UIO: missing resource mapping

2012-08-08 Thread Hans J. Koch
On Wed, Jul 18, 2012 at 12:40:47PM +0200, Dominic Eschweiler wrote:
 Am Montag, den 16.07.2012, 23:58 +0200 schrieb Hans J. Koch:
  Try to hack up a patch to add generic BAR mapping to uio_pci_generic.c
  and post it for review.
  
 
 Here we go ...

Thank you very much for your work. I'm really sorry for the long delay,
but I was busy finishing a project because I go to vacation tomorrow.
Sorry, that might cause further delay since I don't know yet how often
I can read my mail...
Greg, can you review the next one?

Here's a first review.

Thanks,
Hans

  
 Signed-off-by: Dominic Eschweiler eschwei...@fias.uni-frankfurt.de
 diff --git a/drivers/uio/uio_pci_generic.c
 b/drivers/uio/uio_pci_generic.c
 index 0bd08ef..e25991e 100644
 --- a/drivers/uio/uio_pci_generic.c
 +++ b/drivers/uio/uio_pci_generic.c
 @@ -25,10 +25,12 @@
  #include linux/slab.h
  #include linux/uio_driver.h
  
 -#define DRIVER_VERSION   0.01.0
 +#define DRIVER_VERSION   0.02.0
  #define DRIVER_AUTHORMichael S. Tsirkin m...@redhat.com
  #define DRIVER_DESC  Generic UIO driver for PCI 2.3 devices
  
 +#define DRV_NAME uio_pci_generic
 +
  struct uio_pci_generic_dev {
   struct uio_info info;
   struct pci_dev *pdev;
 @@ -58,6 +60,7 @@ static int __devinit probe(struct pci_dev *pdev,
  {
   struct uio_pci_generic_dev *gdev;
   int err;
 + int i;
  
   err = pci_enable_device(pdev);
   if (err) {
 @@ -67,8 +70,7 @@ static int __devinit probe(struct pci_dev *pdev,
   }
  
   if (!pdev-irq) {
 - dev_warn(pdev-dev, No IRQ assigned to device: 
 -  no support for interrupts?\n);
 + dev_warn(pdev-dev, No IRQ assigned to device: no support for
 interrupts?\n);

Please configure your mail client not to break lines when sending a patch.
It can't be applied like this.

Why did you make that change anyway? If it's just coding style, please send
another patch, don't mix functional changes with coding style fixes.

   pci_disable_device(pdev);
   return -ENODEV;
   }
 @@ -91,10 +93,31 @@ static int __devinit probe(struct pci_dev *pdev,
   gdev-info.handler = irqhandler;
   gdev-pdev = pdev;
  
 + /* request regions */
 + err = pci_request_regions(pdev, DRV_NAME);
 + if (err) {
 + dev_err(pdev-dev, Couldn't get PCI resources, aborting\n);
 + return err;
 + }
 +
 + /* create attributes for BAR mappings */
 + for (i = 0; i  PCI_NUM_RESOURCES; i++) {
 + if (pdev-resource[i].flags 
 + (pdev-resource[i].flags  IORESOURCE_MEM)) {
 + gdev-info.mem[i].addr = pci_resource_start(pdev, i);
 + gdev-info.mem[i].size = pci_resource_len(pdev, i);
 + gdev-info.mem[i].internal_addr = NULL;
 + gdev-info.mem[i].memtype = UIO_MEM_PHYS;
 + }
 + }
 +
   if (uio_register_device(pdev-dev, gdev-info))
   goto err_register;
   pci_set_drvdata(pdev, gdev);
  
 + pr_info(UIO_PCI_GENERIC : initialized new device (%x %x)\n,

Please use dev_info()

 + pdev-vendor, pdev-device);
 +
   return 0;
  err_register:
   kfree(gdev);
 @@ -107,17 +130,21 @@ err_verify:
  static void remove(struct pci_dev *pdev)
  {
   struct uio_pci_generic_dev *gdev = pci_get_drvdata(pdev);
 -
   uio_unregister_device(gdev-info);
 +
 + pci_release_regions(pdev);
   pci_disable_device(pdev);
   kfree(gdev);
 +
 + pr_info(UIO_PCI_GENERIC : removed device (%x %x)\n,

ditto

 + pdev-vendor, pdev-device);
  }
  
  static struct pci_driver driver = {
 - .name = uio_pci_generic,
 + .name = DRV_NAME,
   .id_table = NULL, /* only dynamic id's */
 - .probe = probe,
 - .remove = remove,
 + .probe= probe,
 + .remove   = remove,

As above: Please put coding style fixes in an extra patch (if you really
insist on tabs instead of spaces...)

  };
  
  static int __init init(void)
 
 -- 
 Gruß
   Dominic
 
 Frankfurt Institute for Advanced Studies (FIAS)
 Ruth-Moufang-Straße 1
 D-60438 Frankfurt am Main
 Germany
 
 Phone:  +49 69 79844114
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Add initial virtio-scsi support

2012-08-08 Thread Asias He
This patch brings virito-scsi support to kvm tool.

With the introduce of tcm_vhost (vhost-scsi)

   tcm_vhost: Initial merge for vhost level target fabric driver

we can implement virito-scsi by simply having vhost-scsi to handle the
SCSI command.

Howto use:
1) Setup the tcm_vhost target through /sys/kernel/config

   [Stefan Hajnoczi, Thanks for the script to setup tcm_vhost]

   ** Setup wwpn and tpgt
   $ wwpn=naa.0
   $ tpgt=/sys/kernel/config/target/vhost/$wwpn/tpgt_0
   $ nexus=$tpgt/nexus
   $ mkdir -p $tpgt
   $ echo -n $wwpn  $nexus

   ** Setup lun using /dev/ram
   $ n=0
   $ lun=$tpgt/lun/lun_${n}
   $ data=/sys/kernel/config/target/core/iblock_0/data_${n}
   $ ram=/dev/ram${n}
   $ mkdir -p $lun
   $ mkdir -p $data
   $ echo -n udev_path=${ram}  $data/control
   $ echo -n 1  $data/enable
   $ ln -s $data $lun

2) Run kvm tool with the new disk option '-d scsi:$wwpn:$tpgt', e.g
   $ lkvm run -k /boot/bzImage -d ~/img/sid.img -d scsi:naa.0:0

Signed-off-by: Asias He asias.he...@gmail.com
Cc: Nicholas A. Bellinger n...@linux-iscsi.org
Cc: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
Cc: Paolo Bonzini pbonz...@redhat.com
---
 tools/kvm/Makefile |   1 +
 tools/kvm/builtin-run.c|  25 +++
 tools/kvm/disk/core.c  |  14 ++
 tools/kvm/include/kvm/disk-image.h |   4 +
 tools/kvm/include/kvm/virtio-pci-dev.h |   1 +
 tools/kvm/include/kvm/virtio-scsi.h|  11 ++
 tools/kvm/include/linux/compiler.h |   1 +
 tools/kvm/virtio/blk.c |   2 +
 tools/kvm/virtio/scsi.c| 332 +
 9 files changed, 391 insertions(+)
 create mode 100644 tools/kvm/include/kvm/virtio-scsi.h
 create mode 100644 tools/kvm/virtio/scsi.c

diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile
index f9e1ec1..749956a 100644
--- a/tools/kvm/Makefile
+++ b/tools/kvm/Makefile
@@ -55,6 +55,7 @@ OBJS  += mmio.o
 OBJS   += pci.o
 OBJS   += term.o
 OBJS   += virtio/blk.o
+OBJS   += virtio/scsi.o
 OBJS   += virtio/console.o
 OBJS   += virtio/core.o
 OBJS   += virtio/net.o
diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c
index a36bd00..c3ec469 100644
--- a/tools/kvm/builtin-run.c
+++ b/tools/kvm/builtin-run.c
@@ -8,6 +8,7 @@
 #include kvm/framebuffer.h
 #include kvm/disk-image.h
 #include kvm/threadpool.h
+#include kvm/virtio-scsi.h
 #include kvm/virtio-blk.h
 #include kvm/virtio-net.h
 #include kvm/virtio-rng.h
@@ -171,6 +172,19 @@ static int img_name_parser(const struct option *opt, const 
char *arg, int unset)
 
disk_image[image_count].filename = arg;
cur = arg;
+
+   if (strncmp(arg, scsi:, 5) == 0) {
+   sep = strstr(arg, :);
+   if (sep)
+   disk_image[image_count].wwpn = sep + 1;
+   sep = strstr(sep + 1, :);
+   if (sep) {
+   *sep = 0;
+   disk_image[image_count].tpgt = sep + 1;
+   }
+   cur = sep + 1;
+   }
+
do {
sep = strstr(cur, ,);
if (sep) {
@@ -1184,6 +1198,13 @@ static int kvm_cmd_run_init(int argc, const char **argv)
goto fail;
}
 
+   r = virtio_scsi_init(kvm);
+   if (r  0) {
+   pr_err(virtio_scsi_init() failed with error %d\n, r);
+   goto fail;
+   }
+
+
if (active_console == CONSOLE_VIRTIO)
virtio_console__init(kvm);
 
@@ -1332,6 +1353,10 @@ static void kvm_cmd_run_exit(int guest_ret)
 
fb__stop();
 
+   r = virtio_scsi_exit(kvm);
+   if (r  0)
+   pr_warning(virtio_scsi_exit() failed with error %d\n, r);
+
r = virtio_blk__exit(kvm);
if (r  0)
pr_warning(virtio_blk__exit() failed with error %d\n, r);
diff --git a/tools/kvm/disk/core.c b/tools/kvm/disk/core.c
index 621c940..f7e2c7f 100644
--- a/tools/kvm/disk/core.c
+++ b/tools/kvm/disk/core.c
@@ -122,6 +122,8 @@ struct disk_image **disk_image__open_all(struct 
disk_image_params *params, int c
 {
struct disk_image **disks;
const char *filename;
+   const char *wwpn;
+   const char *tpgt;
bool readonly;
bool direct;
void *err;
@@ -140,6 +142,18 @@ struct disk_image **disk_image__open_all(struct 
disk_image_params *params, int c
filename = params[i].filename;
readonly = params[i].readonly;
direct = params[i].direct;
+   wwpn = params[i].wwpn;
+   tpgt = params[i].tpgt;
+
+   if (wwpn) {
+   disks[i] = malloc(sizeof(struct disk_image));
+   if (!disks[i])
+   return ERR_PTR(-ENOMEM);
+   disks[i]-wwpn = wwpn;
+   disks[i]-tpgt = tpgt;
+   continue;
+   }
+
if (!filename)
continue;
 
diff --git 

Re: [PATCH 6/8] KVM: PPC: E500: Implement MMU notifiers

2012-08-08 Thread Paul Mackerras
On Tue, Aug 07, 2012 at 12:57:14PM +0200, Alexander Graf wrote:
 The e500 target has lived without mmu notifiers ever since it got
 introduced, but fails for the user space check on them with hugetlbfs.

Ironically that user space check isn't necessary any more since David
Gibson's fix for the hugetlbfs bug went in (90481622, hugepages: fix
use after free bug in quota handling).  So on sufficiently recent
kernels you can just remove the userspace check.  Implementing
mmu-notifiers is a good thing for other reasons though.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: PPC: Add cache flush on page map

2012-08-08 Thread Alexander Graf


On 07.08.2012, at 23:01, Scott Wood scottw...@freescale.com wrote:

 On 08/07/2012 05:57 AM, Alexander Graf wrote:
 When we map a page that wasn't icache cleared before, do so when first
 mapping it in KVM using the same information bits as the Linux mapping
 logic. That way we are 100% sure that any page we map does not have stale
 entries in the icache.
 
 What we really would need is a method to flush the icache only when we
 actually map a page executable for the guest. But with the current
 abstraction that is hard to do, and this way we don't have a huge performance
 penalty, so better be safe now and micro optimize things later.
 
 Signed-off-by: Alexander Graf ag...@suse.de
 ---
 arch/powerpc/include/asm/kvm_host.h |   10 ++
 arch/powerpc/mm/mem.c   |1 +
 2 files changed, 11 insertions(+), 0 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index ed75bc9..c0a2fc1 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -33,6 +33,7 @@
 #include asm/kvm_asm.h
 #include asm/processor.h
 #include asm/page.h
 +#include asm/cacheflush.h
 
 #define KVM_MAX_VCPUSNR_CPUS
 #define KVM_MAX_VCORESNR_CPUS
 @@ -550,5 +551,14 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR0x0060
 
 #define __KVM_HAVE_ARCH_WQP
 +#define __KVM_HAVE_ARCH_MAP_PAGE
 +static inline void kvm_arch_map_page(struct page *page)
 +{
 +/* Need to invalidate the icache for new pages */
 +if (!test_bit(PG_arch_1, page-flags)) {
 +flush_dcache_icache_page(page);
 +set_bit(PG_arch_1, page-flags);
 +}
 +}
 
 Shouldn't this test CPU_FTR_COHERENT_ICACHE?

flush_icache_range checks for it a bit further down the code stream. I mostly 
modeled things the same way as set_pre_filter.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread Avi Kivity
On 08/08/2012 12:09 AM, Benjamin Herrenschmidt wrote:
 On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote:
 
 
 Peter has started to fix up this naming mess in qemu.  I guess we should
 do the same for the kernel (except for ABIs) and document it, because it
 keeps generating confusion. 
 
 Ok so our current situation is that the XICS and MPIC are emulated in
 userspace entirely but the link between them and the VCPU is the
 asynchronous EE line so we are fine.
 
 I'm currently working on moving the XICS into the kernel for performance
 reasons, however, just like ARM VGIC, I can't seem to find a way to
 fit in the generic irqchip code in there. It's just not generic at
 all and quite x86 centric :-)

The generic code is for the two apic architectures: x64 and ia64.  Don't
try to shoehorn it in, you'll damage both the shoe and your foot.

 
 So for now I'm just doing my own version of CREATE_IRQCHIP to create it
 and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
 stuff (which we really don't need).

You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

 
 That's a bit of a problem vs. some of the code qemu-side such as in
 virtio-pci which does seem to be written around the model exposed by the
 x86 stuff and relies on doing such mappings so I think we'll have to
 butcher some of that.

Can you elaborate? virtio-pci is pci-centric, there should be nothing
x86 specific there.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread Avi Kivity
On 08/08/2012 03:49 AM, David Gibson wrote:
  We never have irqchip in kernel (because we haven't written that yet)
  but we still sleep in-kernel for CEDE.  I haven't spotted any problem
  with that, but now I'm wondering if there is one, since x86 don't do
  it in what seems like the analogous situation.
  
  It's possible this works because our decrementer (timer) interrupts
  are different at the core level from external interrupts coming from
  the PIC, and *are* handled in kernel, but I haven't actually followed
  the logic to work out if this is the case.
  
   Meaning the normal state of things is to sleep in
  the kernel (whether or not you have an emulated interrupt controller in
  the kernel -- the term irqchip in kernel is overloaded for x86).
  
  Uh.. overloaded in what way.
 
 On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
 the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
 correspond to non-x86 interrupt controllers, but the local APIC is more
 tightly coupled to the core.  Interrupt acceptance by the core is an
 operation that involved synchronous communication with the local APIC:
 the APIC presents the interrupt, the core accepts it based on the value
 of the interrupt enable flag and possible a register (CR8), then the
 APIC updates the ISR and IRR.
 
 The upshot is that if the local APIC is in userspace, interrupts must be
 synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
 and HLT is emulated in userspace (so that local APIC emulation can check
 if an interrupt wakes it up or not).
 
 Sorry, still not 100% getting it.  When the vcpu is actually running
 code, that synchronous communication must still be accomplished via
 the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
 communication can't be accomplished in that case.

No, you're correct.  HLT could have been emulated in userspace, it just
wasn't.  The correct statement is that HLT was arbitrarily chosen to be
emulated in userspace with the synchronous model, but the asynchronous
model forced it into the kernel.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread Benjamin Herrenschmidt
On Wed, 2012-08-08 at 11:52 +0300, Avi Kivity wrote:
  So for now I'm just doing my own version of CREATE_IRQCHIP to create it
  and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
  stuff (which we really don't need).
 
 You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

Yes, sorry, brain and fingers not agreeing :-)

  That's a bit of a problem vs. some of the code qemu-side such as in
  virtio-pci which does seem to be written around the model exposed by the
  x86 stuff and relies on doing such mappings so I think we'll have to
  butcher some of that.
 
 Can you elaborate? virtio-pci is pci-centric, there should be nothing
 x86 specific there.

The kvm_irqchip_add_msi_route  co is a problem, it more/less dives into
the x86/ia64 specific routing stuff at a generic level, I'll have to
hook things up differently in qemu I suspect.

Not a huge deal, we can sort it out.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: booke: Add watchdog emulation

2012-08-08 Thread Alexander Graf

On 07.08.2012, at 17:59, Bhushan Bharat-R65777 wrote:

 
 
 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Tuesday, August 07, 2012 3:38 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Bhushan Bharat-R65777
 Subject: Re: [PATCH 1/2] KVM: PPC: booke: Add watchdog emulation
 
 
 On 03.08.2012, at 07:30, Bharat Bhushan wrote:
 
 This patch adds the watchdog emulation in KVM. The watchdog
 emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_BOOKE_WATCHDOG) ioctl.
 The kernel timer are used for watchdog emulation and emulates
 h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
 if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
 it is configured.
 
 Signed-off-by: Liu Yu yu@freescale.com
 Signed-off-by: Scott Wood scottw...@freescale.com
 [bharat.bhus...@freescale.com: reworked patch]
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 ---
 v7:
 - kvmppc_core_dequeue_watchdog() and kvmppc_core_enque_watchdog()
  are made staic
 - Clear the KVM_REQ_WATCHDOG request when TSR_ENW | ENW_WIS cleared
  rather than checking these bits on handling the request.
  Below changes (pasted for quick understanding)
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index 7eedaeb..a0f5922 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -629,8 +629,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct
 kvm_vcpu *vcpu)
   goto out;
   }
 
 -   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) 
 -   (vcpu-arch.tsr  (TSR_ENW | TSR_WIS))) {
 +   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu)) {
   kvm_run-exit_reason = KVM_EXIT_WATCHDOG;
   ret = 0;
   goto out;
 @@ -1098,8 +1097,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct
 kvm_vcpu *vcpu,
   }
   }
 
 -   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) 
 -   (vcpu-arch.tsr  (TSR_ENW | TSR_WIS))) {
 +   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu)) {
   run-exit_reason = KVM_EXIT_WATCHDOG;
   r = RESUME_HOST | (r  RESUME_FLAG_NV);
   }
 @@ -1251,8 +1249,10 @@ static int set_sregs_base(struct kvm_vcpu *vcpu,
 
   vcpu-arch.tsr = sregs-u.e.tsr;
 
 -   if ((old_tsr ^ vcpu-arch.tsr)  (TSR_ENW | TSR_WIS))
 +   if ((old_tsr ^ vcpu-arch.tsr)  (TSR_ENW | TSR_WIS)) {
 +   clear_bit(KVM_REQ_WATCHDOG, vcpu-requests);
   arm_next_watchdog(vcpu);
 +   }
 
   update_timer_ints(vcpu);
   }
 @@ -1434,8 +1434,10 @@ void kvmppc_clr_tsr_bits(struct kvm_vcpu *vcpu, u32
 tsr_bits)
* We may have stopped the watchdog due to
* being stuck on final expiration.
*/
 -   if (tsr_bits  (TSR_ENW | TSR_WIS))
 +   if (tsr_bits  (TSR_ENW | TSR_WIS)) {
 +   clear_bit(KVM_REQ_WATCHDOG, vcpu-requests);
   arm_next_watchdog(vcpu);
 +   }
 --
 
 v6:
 - Added kvmppc_subarch_vcpu_unit/uninit() for subarch specific 
 initialization
 - Using spin_lock_irqsave()
 - Using del_timer_sync().
 
 v5:
 - Checking that TSR_ENW/TSR_WIS are still set on KVM_EXIT_WATCHDOG.
 - Moved watchdog_next_timeout() in lock.
 
 arch/powerpc/include/asm/kvm_host.h  |3 +
 arch/powerpc/include/asm/kvm_ppc.h   |2 +
 arch/powerpc/include/asm/reg_booke.h |7 ++
 arch/powerpc/kvm/book3s.c|9 ++
 arch/powerpc/kvm/booke.c |  158 
 ++
 arch/powerpc/kvm/booke_emulate.c |8 ++
 arch/powerpc/kvm/powerpc.c   |   14 +++-
 include/linux/kvm.h  |2 +
 include/linux/kvm_host.h |1 +
 9 files changed, 202 insertions(+), 2 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_host.h
 b/arch/powerpc/include/asm/kvm_host.h
 index 572ad01..873ec11 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -469,6 +469,8 @@ struct kvm_vcpu_arch {
 ulong fault_esr;
 ulong queued_dear;
 ulong queued_esr;
 +   spinlock_t wdt_lock;
 +   struct timer_list wdt_timer;
 u32 tlbcfg[4];
 u32 mmucfg;
 u32 epr;
 @@ -484,6 +486,7 @@ struct kvm_vcpu_arch {
 u8 osi_needed;
 u8 osi_enabled;
 u8 papr_enabled;
 +   u8 watchdog_enabled;
 u8 sane;
 u8 cpu_type;
 u8 hcall_needed;
 diff --git a/arch/powerpc/include/asm/kvm_ppc.h
 b/arch/powerpc/include/asm/kvm_ppc.h
 index 0124937..1438a5e 100644
 --- a/arch/powerpc/include/asm/kvm_ppc.h
 +++ b/arch/powerpc/include/asm/kvm_ppc.h
 @@ -68,6 +68,8 @@ extern void kvmppc_emulate_dec(struct kvm_vcpu *vcpu);
 extern u32 kvmppc_get_dec(struct kvm_vcpu *vcpu, u64 tb);
 extern void kvmppc_decrementer_func(unsigned long data);
 extern int kvmppc_sanity_check(struct kvm_vcpu *vcpu);
 +extern int kvmppc_subarch_vcpu_init(struct kvm_vcpu *vcpu);
 +extern void 

Re: [PATCH 3/3] KVM: PPC: booke: Added debug handler

2012-08-08 Thread Alexander Graf

On 08.08.2012, at 03:02, Bhushan Bharat-R65777 wrote:

 
 
 -Original Message-
 From: Wood Scott-B07421
 Sent: Wednesday, August 08, 2012 2:15 AM
 To: Alexander Graf
 Cc: Bhushan Bharat-R65777; kvm-ppc@vger.kernel.org; k...@vger.kernel.org; 
 Bhushan
 Bharat-R65777
 Subject: Re: [PATCH 3/3] KVM: PPC: booke: Added debug handler
 
 On 08/07/2012 05:47 AM, Alexander Graf wrote:
 diff --git a/arch/powerpc/kvm/booke_interrupts.S
 b/arch/powerpc/kvm/booke_interrupts.S
 index 3539805..890673c 100644
 --- a/arch/powerpc/kvm/booke_interrupts.S
 +++ b/arch/powerpc/kvm/booke_interrupts.S
 @@ -73,6 +73,51 @@ _GLOBAL(kvmppc_handler_\ivor_nr)
bctr
 .endm
 
 +.macro KVM_DBG_HANDLER ivor_nr scratch srr0
 
 This is a lot of asm code. Any chance to share a good share of it with the
 generic handler?
 
 That entire file could use an update to lok more like bookehv_interrupts.S 
 and
 its use of asm macros.
 
 In booke there is assumption that size of KVM IVORs will not me more than 
 host IVORs size so that only IVPR is changed. 
 
 I tried to give it that shape of bookehv_interrupts.S  and found that size of 
 some IVORs become more than host IVORs.

We can always jump off to another (more generic?) function and only have a 
small stub in the IVOR referenced code.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread David Gibson
On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote:
 On 08/08/2012 03:49 AM, David Gibson wrote:
   We never have irqchip in kernel (because we haven't written that yet)
   but we still sleep in-kernel for CEDE.  I haven't spotted any problem
   with that, but now I'm wondering if there is one, since x86 don't do
   it in what seems like the analogous situation.
   
   It's possible this works because our decrementer (timer) interrupts
   are different at the core level from external interrupts coming from
   the PIC, and *are* handled in kernel, but I haven't actually followed
   the logic to work out if this is the case.
   
Meaning the normal state of things is to sleep in
   the kernel (whether or not you have an emulated interrupt controller in
   the kernel -- the term irqchip in kernel is overloaded for x86).
   
   Uh.. overloaded in what way.
  
  On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
  the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
  correspond to non-x86 interrupt controllers, but the local APIC is more
  tightly coupled to the core.  Interrupt acceptance by the core is an
  operation that involved synchronous communication with the local APIC:
  the APIC presents the interrupt, the core accepts it based on the value
  of the interrupt enable flag and possible a register (CR8), then the
  APIC updates the ISR and IRR.
  
  The upshot is that if the local APIC is in userspace, interrupts must be
  synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
  and HLT is emulated in userspace (so that local APIC emulation can check
  if an interrupt wakes it up or not).
  
  Sorry, still not 100% getting it.  When the vcpu is actually running
  code, that synchronous communication must still be accomplished via
  the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
  communication can't be accomplished in that case.
 
 No, you're correct.  HLT could have been emulated in userspace, it just
 wasn't.  The correct statement is that HLT was arbitrarily chosen to be
 emulated in userspace with the synchronous model, but the asynchronous
 model forced it into the kernel.

Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
userspace in the first line there, anyway.  Ok, so I am now reassured
that our current handling of CEDE in kernelspace does not cause
problems.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread Avi Kivity
On 08/08/2012 02:59 PM, David Gibson wrote:
 
 No, you're correct.  HLT could have been emulated in userspace, it just
 wasn't.  The correct statement is that HLT was arbitrarily chosen to be
 emulated in userspace with the synchronous model, but the asynchronous
 model forced it into the kernel.
 
 Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
 userspace in the first line there, anyway.  

I did.

 Ok, so I am now reassured
 that our current handling of CEDE in kernelspace does not cause
 problems.

Great.  It's a real pity the original local APIC implementation was in
userspace, it causes no end of confusion, and in fact was broken for smp
until recently even though it is 7 years old.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] KVM: PPC: E500: Implement MMU Notifiers

2012-08-08 Thread Alexander Graf

On 07.08.2012, at 12:57, Alexander Graf wrote:

 This patch set adds a very simple implementation of MMU notifiers for the
 e500 target. Along the way, I stumbled over a few other things that I've
 put into the same patch set, namely:
 
  * e500 prerequisites
  * icache flushing on page map (probably also hits ARM!)
  * exit trace point for e500
 
 Alexander Graf (8):
  KVM: PPC: BookE: Expose remote TLB flushes in debugfs
  KVM: PPC: E500: Fix clear_tlb_refs
  KVM: PPC: PR: Use generic tracepoint for guest exit
  KVM: PPC: Expose SYNC cap based on mmu notifiers

I applied the above 4 patches to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html