[COMMIT master] KVM: Adjust makefile for x86_emulate.c rename

2009-08-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index afaaa76..0e7fe78 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,7 +9,7 @@ kvm-y   += $(addprefix ../../../virt/kvm/, 
kvm_main.o ioapic.o \
coalesced_mmio.o irq_comm.o eventfd.o)
 kvm-$(CONFIG_IOMMU_API)+= $(addprefix ../../../virt/kvm/, iommu.o)
 
-kvm-y  += x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
+kvm-y  += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
   i8254.o timer.o
 kvm-intel-y+= vmx.o
 kvm-amd-y  += svm.o
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] qemu-kvm: vhost net support

2009-08-13 Thread Michael S. Tsirkin
On Wed, Aug 12, 2009 at 04:27:44PM -0400, Gregory Haskins wrote:
 Michael S. Tsirkin wrote:
  This adds support for vhost-net virtio kernel backend.
  
  This is RFC, but works without issues for me.
  
  Still needs to be split up, tested and benchmarked properly,
  but posting it here in case people want to test drive
  the kernel bits I posted.
 
 This has a large degree of rejects against qemu-kvm.git/master.  What
 tree does this apply to?
 
 -Greg
 

Likely that tree has advanced since.
This is on top of commit b6bbd41fac4b6fb0efc65e083d2151ce1521f615.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Michael S. Tsirkin
On Wed, Aug 12, 2009 at 07:59:47PM +0200, Arnd Bergmann wrote:
 The trick is to swap the virtqueues instead. virtio-net is actually
 mostly symmetric in just the same way that the physical wires on a
 twisted pair ethernet are symmetric (I like how that analogy fits).

You need to really squint hard for it to look symmetric.

For example, for RX, virtio allocates an skb, puts a descriptor on a
ring and waits for host to fill it in. Host system can not do the same:
guest does not have access to host memory.

You can do a copy in transport to hide this fact, but it will kill
performance.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Michael S. Tsirkin
On Wed, Aug 12, 2009 at 02:27:31PM -0500, Anthony Liguori wrote:
 Arnd Bergmann wrote:
 As I pointed out earlier, most code in virtio net is asymmetrical: guest
 provides buffers, host consumes them.  Possibly, one could use virtio
 rings in a symmetrical way, but support of existing guest virtio net
 means there's almost no shared code.
 

 The trick is to swap the virtqueues instead. virtio-net is actually
 mostly symmetric in just the same way that the physical wires on a
 twisted pair ethernet are symmetric (I like how that analogy fits).
   

 It's already been done between two guests.  See  
 http://article.gmane.org/gmane.linux.kernel.virtualization/5423

 Regards,

 Anthony Liguori

Yes, this works by copying data (see PATCH 5/5).  Another possibility is
page flipping.  Either will kill performance.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] defer skb allocation in virtio_net -- mergable buff part

2009-08-13 Thread Shirley Ma
Guest virtio_net receives packets from its pre-allocated vring 
buffers, then it delivers these packets to upper layer protocols
as skb buffs. So it's not necessary to pre-allocate skb for each
mergable buffer, then frees it when it's useless. 

This patch has deferred skb allocation to when receiving packets, 
it reduces skb pre-allocations and skb_frees. And it induces two 
page list: freed_pages and used_page list, used_pages is used to 
track pages pre-allocated, it is only useful when removing virtio_net.

This patch has tested and measured against 2.6.31-rc4 git,
I thought this patch will improve large packet performance, but I saw
netperf TCP_STREAM performance improved for small packet for both 
local guest to host and host to local guest cases. It also reduces 
UDP packets drop rate from host to local guest. I am not fully understand 
why.

The netperf results from my laptop are:

mtu=1500
netperf -H xxx -l 120

w/o patch   w/i patch (two runs)
guest to host:  3336.84Mb/s   3730.14Mb/s ~ 3582.88Mb/s

host to guest:  3165.10Mb/s   3370.39Mb/s ~ 3407.96Mb/s

Here is the patch for your review. The same approach can apply to non-mergable
buffs too, so we can use code in common. If there is no objection, I will 
submit the non-mergable buffs patch later.


Signed-off-by: Shirley Ma x...@us.ibm.com
---

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2a6e81d..e31ebc9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -17,6 +17,7 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 //#define DEBUG
+#include linux/list.h
 #include linux/netdevice.h
 #include linux/etherdevice.h
 #include linux/ethtool.h
@@ -39,6 +40,12 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX2
 
+struct page_list
+{
+   struct page *page;
+   struct list_head list;
+};
+
 struct virtnet_info
 {
struct virtio_device *vdev;
@@ -72,6 +79,8 @@ struct virtnet_info
 
/* Chain pages by the private ptr. */
struct page *pages;
+   struct list_head used_pages;
+   struct list_head freed_pages;
 };
 
 static inline void *skb_vnet_hdr(struct sk_buff *skb)
@@ -106,6 +115,26 @@ static struct page *get_a_page(struct virtnet_info *vi, 
gfp_t gfp_mask)
return p;
 }
 
+static struct page_list *get_a_free_page(struct virtnet_info *vi, gfp_t 
gfp_mask)
+{
+   struct page_list *plist;
+
+   if (list_empty(vi-freed_pages)) {
+   plist = kmalloc(sizeof(struct page_list), gfp_mask);
+   if (!plist)
+   return NULL;
+   list_add_tail(plist-list, vi-freed_pages);
+   plist-page = alloc_page(gfp_mask);
+   } else {
+   plist = list_first_entry(vi-freed_pages, struct page_list, 
list);
+   if (!plist-page)
+   plist-page = alloc_page(gfp_mask);
+   }
+   if (plist-page)
+   list_move_tail(plist-list, vi-used_pages);
+   return plist;
+}
+
 static void skb_xmit_done(struct virtqueue *svq)
 {
struct virtnet_info *vi = svq-vdev-priv;
@@ -121,14 +150,14 @@ static void skb_xmit_done(struct virtqueue *svq)
tasklet_schedule(vi-tasklet);
 }
 
-static void receive_skb(struct net_device *dev, struct sk_buff *skb,
+static void receive_skb(struct net_device *dev, void *buf,
unsigned len)
 {
struct virtnet_info *vi = netdev_priv(dev);
-   struct virtio_net_hdr *hdr = skb_vnet_hdr(skb);
int err;
int i;
-
+   struct sk_buff *skb = NULL;
+   struct virtio_net_hdr *hdr = NULL;
if (unlikely(len  sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
pr_debug(%s: short packet %i\n, dev-name, len);
dev-stats.rx_length_errors++;
@@ -136,15 +165,30 @@ static void receive_skb(struct net_device *dev, struct 
sk_buff *skb,
}
 
if (vi-mergeable_rx_bufs) {
-   struct virtio_net_hdr_mrg_rxbuf *mhdr = skb_vnet_hdr(skb);
+   struct virtio_net_hdr_mrg_rxbuf *mhdr;
unsigned int copy;
-   char *p = page_address(skb_shinfo(skb)-frags[0].page);
+   skb_frag_t *f;
+   struct page_list *plist = (struct page_list *)buf;
+   char *p = page_address(plist-page);
+
+   skb = netdev_alloc_skb(vi-dev, GOOD_COPY_LEN + NET_IP_ALIGN);
+   if (unlikely(!skb)) {
+   /* drop the packet */
+   dev-stats.rx_dropped++;
+   list_move_tail(plist-list, vi-freed_pages);
+   return;
+   }
+
+   skb_reserve(skb, NET_IP_ALIGN);
 
if (len  PAGE_SIZE)
len = PAGE_SIZE;
len -= sizeof(struct virtio_net_hdr_mrg_rxbuf);
 
-   memcpy(hdr, p, sizeof(*mhdr));
+   mhdr = skb_vnet_hdr(skb);
+   

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Michael S. Tsirkin
On Wed, Aug 12, 2009 at 02:22:38PM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:

 We discussed this before, and I still think this could be directly derived
 from struct virtqueue, in the same way that vring_virtqueue is derived from
 struct virtqueue.
 

 I prefer keeping it simple. Much of abstraction in virtio is due to the
 fact that it needs to work on top of different hardware emulations:
 lguest,kvm, possibly others in the future.  vhost is always working on
 real hardware, using eventfd as the interface, so it does not need that.
   

 Actually, vhost may not always be limited to real hardware.

Yes, any ethernet device will do. What I mean is that vhost does not
deal with emulation at all. All setup is done in userspace.


 We may on day use vhost as the basis of a driver domain.  There's quite  
 a lot of interest in this for networking.

You can use veth for this. This works today.

 At any rate, I'd like to see performance results before we consider  
 trying to reuse virtio code.
 
 Regards,

 Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 0/3] qemu-kvm: vhost net support

2009-08-13 Thread Michael S. Tsirkin
This adds support for vhost-net virtio kernel backend.

This is RFC, but works without issues for me.

Still needs to be split up, tested and benchmarked properly,
but posting it here in case people want to test drive
the kernel bits I posted.

Changes since v1:
- rebased on top of 9dc275d9d660fe1cd64d36102d600885f9fdb88a

Michael S. Tsirkin (3):
  qemu-kvm: move virtio-pci.o to near pci.o
  virtio: move features to an inline function
  qemu-kvm: vhost-net implementation

 Makefile.hw |2 +-
 Makefile.target |3 +-
 hw/vhost_net.c  |  181 +++
 hw/vhost_net.h  |   30 +
 hw/virtio-balloon.c |2 +-
 hw/virtio-blk.c |2 +-
 hw/virtio-console.c |2 +-
 hw/virtio-net.c |   34 +-
 hw/virtio-pci.c |   43 +++-
 hw/virtio.c |   19 --
 hw/virtio.h |   38 ++-
 net.c   |5 ++
 net.h   |1 +
 qemu-kvm.h  |9 +++
 14 files changed, 339 insertions(+), 32 deletions(-)
 create mode 100644 hw/vhost_net.c
 create mode 100644 hw/vhost_net.h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 1/3] qemu-kvm: move virtio-pci.o to near pci.o

2009-08-13 Thread Michael S. Tsirkin
virtio-pci depends, and will always depend, on pci.c
so it makes sense to keep it in the same makefile,
(unlike the rest of virtio files which should eventually
 be moved out to Makefile.hw).

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 Makefile.hw |2 +-
 Makefile.target |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Makefile.hw b/Makefile.hw
index 139412e..6472ec1 100644
--- a/Makefile.hw
+++ b/Makefile.hw
@@ -11,7 +11,7 @@ VPATH=$(SRC_PATH):$(SRC_PATH)/hw
 QEMU_CFLAGS+=-I.. -I$(SRC_PATH)/fpu
 
 obj-y =
-obj-y += virtio.o virtio-pci.o
+obj-y += virtio.o
 obj-y += fw_cfg.o
 obj-y += watchdog.o
 obj-y += nand.o ecc.o
diff --git a/Makefile.target b/Makefile.target
index aeda3fe..f6d9708 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -170,7 +170,7 @@ obj-y = vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o 
machine.o \
 gdbstub.o gdbstub-xml.o msix.o ioport.o qemu-config.o
 # virtio has to be here due to weird dependency between PCI and virtio-net.
 # need to fix this properly
-obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o
+obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o 
virtio-pci.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 
 LIBS+=-lz
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 2/3] virtio: move features to an inline function

2009-08-13 Thread Michael S. Tsirkin
devices should have the final say over which virtio features they
support. E.g. indirect entries may or may not make sense in the context
of virtio-console.  Move the common bits from virtio-pci to an inline
function and let each device call it.

No functional changes.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 hw/virtio-balloon.c |2 +-
 hw/virtio-blk.c |2 +-
 hw/virtio-console.c |2 +-
 hw/virtio-net.c |2 +-
 hw/virtio-pci.c |3 ---
 hw/virtio.h |   10 ++
 6 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
index 7ca783e..15b50bb 100644
--- a/hw/virtio-balloon.c
+++ b/hw/virtio-balloon.c
@@ -127,7 +127,7 @@ static void virtio_balloon_set_config(VirtIODevice *vdev,
 
 static uint32_t virtio_balloon_get_features(VirtIODevice *vdev)
 {
-return 0;
+return virtio_common_features();
 }
 
 static ram_addr_t virtio_balloon_to_target(void *opaque, ram_addr_t target)
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index c278d2e..a33eafb 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -378,7 +378,7 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 if (strcmp(s-serial_str, 0))
 features |= 1  VIRTIO_BLK_F_IDENTIFY;
 
-return features;
+return features | virtio_common_features();
 }
 
 static void virtio_blk_save(QEMUFile *f, void *opaque)
diff --git a/hw/virtio-console.c b/hw/virtio-console.c
index 663c8b9..ac25499 100644
--- a/hw/virtio-console.c
+++ b/hw/virtio-console.c
@@ -53,7 +53,7 @@ static void virtio_console_handle_input(VirtIODevice *vdev, 
VirtQueue *vq)
 
 static uint32_t virtio_console_get_features(VirtIODevice *vdev)
 {
-return 0;
+return virtio_common_features();
 }
 
 static int vcon_can_read(void *opaque)
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ce8e6cb..469c6e3 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -154,7 +154,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 }
 #endif
 
-return features;
+return features | virtio_common_features();
 }
 
 static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 8b57dfc..ab6e9c4 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -230,9 +230,6 @@ static uint32_t virtio_ioport_read(VirtIOPCIProxy *proxy, 
uint32_t addr)
 switch (addr) {
 case VIRTIO_PCI_HOST_FEATURES:
 ret = vdev-get_features(vdev);
-ret |= (1  VIRTIO_F_NOTIFY_ON_EMPTY);
-ret |= (1  VIRTIO_RING_F_INDIRECT_DESC);
-ret |= (1  VIRTIO_F_BAD_FEATURE);
 break;
 case VIRTIO_PCI_GUEST_FEATURES:
 ret = vdev-features;
diff --git a/hw/virtio.h b/hw/virtio.h
index c441a93..cbf472b 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -167,4 +167,14 @@ VirtIODevice *virtio_net_init(DeviceState *dev);
 VirtIODevice *virtio_console_init(DeviceState *dev);
 VirtIODevice *virtio_balloon_init(DeviceState *dev);
 
+static inline uint32_t virtio_common_features(void)
+{
+uint32_t features = 0;
+features |= (1  VIRTIO_F_NOTIFY_ON_EMPTY);
+features |= (1  VIRTIO_RING_F_INDIRECT_DESC);
+features |= (1  VIRTIO_F_BAD_FEATURE);
+
+return features;
+}
+
 #endif
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 3/3] qemu-kvm: vhost-net implementation

2009-08-13 Thread Michael S. Tsirkin
This adds support for vhost-net virtio kernel backend.
To enable (assuming device eth2):
1. enable promisc mode or program guest mac in device eth2
2. disable tso, gso, lro on the card
3. add vhost=eth0 to -net flag
4. run with CAP_NET_ADMIN priviledge (e.g. root)

This patch is RFC, but works without issues for me.

It still needs to be split up, tested and benchmarked properly,
but posting it here in case people want to test drive
the kernel bits I posted.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 Makefile.target |3 +-
 hw/vhost_net.c  |  181 +++
 hw/vhost_net.h  |   30 +
 hw/virtio-net.c |   32 ++-
 hw/virtio-pci.c |   40 
 hw/virtio.c |   19 --
 hw/virtio.h |   28 -
 net.c   |5 ++
 net.h   |1 +
 qemu-kvm.h  |9 +++
 10 files changed, 324 insertions(+), 24 deletions(-)
 create mode 100644 hw/vhost_net.c
 create mode 100644 hw/vhost_net.h

diff --git a/Makefile.target b/Makefile.target
index f6d9708..e941a36 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -170,7 +170,8 @@ obj-y = vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o 
machine.o \
 gdbstub.o gdbstub-xml.o msix.o ioport.o qemu-config.o
 # virtio has to be here due to weird dependency between PCI and virtio-net.
 # need to fix this properly
-obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o 
virtio-pci.o
+obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o 
virtio-pci.o \
+   vhost_net.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 
 LIBS+=-lz
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
new file mode 100644
index 000..7d52de0
--- /dev/null
+++ b/hw/vhost_net.c
@@ -0,0 +1,181 @@
+#include sys/eventfd.h
+#include sys/socket.h
+#include linux/kvm.h
+#include fcntl.h
+#include sys/ioctl.h
+#include linux/vhost.h
+#include linux/virtio_ring.h
+#include netpacket/packet.h
+#include net/ethernet.h
+#include net/if.h
+#include netinet/in.h
+
+#include stdio.h
+
+#include qemu-kvm.h
+
+#include vhost_net.h
+
+const char *vhost_net_device;
+
+static int vhost_virtqueue_init(struct vhost_dev *dev,
+   struct VirtIODevice *vdev,
+   struct vhost_virtqueue *vq,
+   struct VirtQueue *q,
+   unsigned idx)
+{
+   target_phys_addr_t s, l;
+   int r;
+   struct vhost_vring_addr addr = {
+   .index = idx,
+   };
+   struct vhost_vring_file file = {
+   .index = idx,
+   };
+   struct vhost_vring_state size = {
+   .index = idx,
+   };
+
+   size.num = q-vring.num;
+   r = ioctl(dev-control, VHOST_SET_VRING_NUM, size);
+   if (r)
+   return -errno;
+
+   file.fd = vq-kick = eventfd(0, 0);
+   r = ioctl(dev-control, VHOST_SET_VRING_KICK, file);
+   if (r)
+   return -errno;
+   file.fd = vq-call = eventfd(0, 0);
+   r = ioctl(dev-control, VHOST_SET_VRING_CALL, file);
+   if (r)
+   return -errno;
+
+   s = l = sizeof(struct vring_desc) * q-vring.num;
+   vq-desc = cpu_physical_memory_map(q-vring.desc, l, 0);
+   if (!vq-desc || l != s)
+   return -ENOMEM;
+   addr.user_addr = (u_int64_t)(unsigned long)vq-desc;
+   r = ioctl(dev-control, VHOST_SET_VRING_DESC, addr);
+   if (r  0)
+   return -errno;
+   s = l = offsetof(struct vring_avail, ring) +
+   sizeof(u_int64_t) * q-vring.num;
+   vq-avail = cpu_physical_memory_map(q-vring.avail, l, 0);
+   if (!vq-avail || l != s)
+   return -ENOMEM;
+   addr.user_addr = (u_int64_t)(unsigned long)vq-avail;
+   r = ioctl(dev-control, VHOST_SET_VRING_AVAIL, addr);
+   if (r  0)
+   return -errno;
+   s = l = offsetof(struct vring_used, ring) +
+   sizeof(struct vring_used_elem) * q-vring.num;
+   vq-used = cpu_physical_memory_map(q-vring.used, l, 1);
+   if (!vq-used || l != s)
+   return -ENOMEM;
+   addr.user_addr = (u_int64_t)(unsigned long)vq-used;
+   r = ioctl(dev-control, VHOST_SET_VRING_USED, addr);
+   if (r  0)
+   return -errno;
+
+r = vdev-binding-irqfd(vdev-binding_opaque, q-vector, vq-call);
+if (r  0)
+return -errno;
+
+r = vdev-binding-queuefd(vdev-binding_opaque, idx, vq-kick);
+if (r  0)
+return -errno;
+
+   return 0;
+}
+
+static int vhost_dev_init(struct vhost_dev *hdev,
+ VirtIODevice *vdev)
+{
+   int i, r, n = 0;
+   struct vhost_memory *mem;
+   hdev-control = open(/dev/vhost-net, O_RDWR);
+   if (hdev-control  0)
+   return -errno;
+   r = ioctl(hdev-control, VHOST_SET_OWNER, NULL);
+   if (r  0)
+   return -errno;
+   for (i = 0; i  KVM_MAX_NUM_MEM_REGIONS; 

Re: [PATCH v3 1/8] Do not call ack notifiers on PIC reset.

2009-08-13 Thread Marcelo Tosatti
On Wed, Aug 12, 2009 at 03:17:15PM +0300, Gleb Natapov wrote:
 For device assigned it may cause host hang since ack notifier callback
 enables host interrupt and guest not necessary cleared interrupt
 condition in an assigned device. For PIT we should not call ack notifier
 here since interrupt was not acked by a guest and should be redelivered.
 
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/i8259.c |   16 
  1 files changed, 0 insertions(+), 16 deletions(-)
 
 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index 01f1516..eb2b8b7 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -225,22 +225,6 @@ int kvm_pic_read_irq(struct kvm *kvm)
  
  void kvm_pic_reset(struct kvm_kpic_state *s)
  {
 - int irq, irqbase, n;
 - struct kvm *kvm = s-pics_state-irq_request_opaque;
 - struct kvm_vcpu *vcpu0 = kvm-bsp_vcpu;
 -
 - if (s == s-pics_state-pics[0])
 - irqbase = 0;
 - else
 - irqbase = 8;
 -
 - for (irq = 0; irq  PIC_NUM_PINS/2; irq++) {
 - if (vcpu0  kvm_apic_accept_pic_intr(vcpu0))
 - if (s-irr  (1  irq) || s-isr  (1  irq)) {
 - n = irq + irqbase;
 - kvm_notify_acked_irq(kvm, SELECT_PIC(n), n);
 - }
 - }
   s-last_irr = 0;
   s-irr = 0;
   s-imr = 0;
 -- 
 1.6.3.3

This used to be necessary to clear pending state from i8254.c
irq_acked logic. I think it'll break it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/8] Do not call ack notifiers on PIC reset.

2009-08-13 Thread Gleb Natapov
On Thu, Aug 13, 2009 at 06:11:05AM -0300, Marcelo Tosatti wrote:
 On Wed, Aug 12, 2009 at 03:17:15PM +0300, Gleb Natapov wrote:
  For device assigned it may cause host hang since ack notifier callback
  enables host interrupt and guest not necessary cleared interrupt
  condition in an assigned device. For PIT we should not call ack notifier
  here since interrupt was not acked by a guest and should be redelivered.
  
  Signed-off-by: Gleb Natapov g...@redhat.com
  ---
   arch/x86/kvm/i8259.c |   16 
   1 files changed, 0 insertions(+), 16 deletions(-)
  
  diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
  index 01f1516..eb2b8b7 100644
  --- a/arch/x86/kvm/i8259.c
  +++ b/arch/x86/kvm/i8259.c
  @@ -225,22 +225,6 @@ int kvm_pic_read_irq(struct kvm *kvm)
   
   void kvm_pic_reset(struct kvm_kpic_state *s)
   {
  -   int irq, irqbase, n;
  -   struct kvm *kvm = s-pics_state-irq_request_opaque;
  -   struct kvm_vcpu *vcpu0 = kvm-bsp_vcpu;
  -
  -   if (s == s-pics_state-pics[0])
  -   irqbase = 0;
  -   else
  -   irqbase = 8;
  -
  -   for (irq = 0; irq  PIC_NUM_PINS/2; irq++) {
  -   if (vcpu0  kvm_apic_accept_pic_intr(vcpu0))
  -   if (s-irr  (1  irq) || s-isr  (1  irq)) {
  -   n = irq + irqbase;
  -   kvm_notify_acked_irq(kvm, SELECT_PIC(n), n);
  -   }
  -   }
  s-last_irr = 0;
  s-irr = 0;
  s-imr = 0;
  -- 
  1.6.3.3
 
 This used to be necessary to clear pending state from i8254.c
 irq_acked logic. I think it'll break it.
This is just a hack then and it does not exists in ioapic so if
it is really needed ioapic+pit combination is broken. But the problem
should be solved inside i8254.c not somewhere else. Setting irq_acked to 1 in
pit_load_count() seems like a right thing to do. Something like
the patch below. Ideally pending should be scaled instead of reset.
Also may be the problem exists because PIC doesn't call mask notifiers?


diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index b857ca3..aa7f68e 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -325,6 +325,9 @@ static void pit_load_count(struct kvm *kvm, int channel, 
u32 val)
return;
}
 
+   atomic_set(pt-pending, 0);
+   ps-irq_ack = 1;
+
/* Two types of timer
 * mode 1 is one shot, mode 2 is period, otherwise del timer */
switch (ps-channels[0].mode) {

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 7/8] Move IO APIC to its own lock.

2009-08-13 Thread Gleb Natapov
On Thu, Aug 13, 2009 at 06:13:30AM -0300, Marcelo Tosatti wrote:
  +++ b/virt/kvm/ioapic.c
  @@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
  irq, int level)
  union kvm_ioapic_redirect_entry entry;
  int ret = 1;
   
  +   mutex_lock(ioapic-lock);
  if (irq = 0  irq  IOAPIC_NUM_PINS) {
  entry = ioapic-redirtbl[irq];
  level ^= entry.fields.polarity;
 
 But this is an RCU critical section now, right? 
 
Correct! Forget about that. It was spinlock, but Avi prefers mutexes.

 If so, you can't sleep, must use a spinlock.
Either that or I can collect callbacks in critical section and call them
afterwords.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 7/8] Move IO APIC to its own lock.

2009-08-13 Thread Avi Kivity

On 08/13/2009 12:48 PM, Gleb Natapov wrote:

On Thu, Aug 13, 2009 at 06:13:30AM -0300, Marcelo Tosatti wrote:
   

+++ b/virt/kvm/ioapic.c
@@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, 
int level)
union kvm_ioapic_redirect_entry entry;
int ret = 1;

+   mutex_lock(ioapic-lock);
if (irq= 0  irq  IOAPIC_NUM_PINS) {
entry = ioapic-redirtbl[irq];
level ^= entry.fields.polarity;
   

But this is an RCU critical section now, right?

 

Correct! Forget about that. It was spinlock, but Avi prefers mutexes.
   


Well, I prefer correct code to mutexes.

   

If so, you can't sleep, must use a spinlock.
 

Either that or I can collect callbacks in critical section and call them
afterwords.
   


There's also srcu.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 7/8] Move IO APIC to its own lock.

2009-08-13 Thread Gleb Natapov
On Thu, Aug 13, 2009 at 12:49:45PM +0300, Avi Kivity wrote:
 On 08/13/2009 12:48 PM, Gleb Natapov wrote:
 On Thu, Aug 13, 2009 at 06:13:30AM -0300, Marcelo Tosatti wrote:
 +++ b/virt/kvm/ioapic.c
 @@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
 irq, int level)
union kvm_ioapic_redirect_entry entry;
int ret = 1;
 
 +  mutex_lock(ioapic-lock);
if (irq= 0  irq  IOAPIC_NUM_PINS) {
entry = ioapic-redirtbl[irq];
level ^= entry.fields.polarity;
 But this is an RCU critical section now, right?
 
 Correct! Forget about that. It was spinlock, but Avi prefers mutexes.
 
 Well, I prefer correct code to mutexes.
 
 If so, you can't sleep, must use a spinlock.
 Either that or I can collect callbacks in critical section and call them
 afterwords.
 
 There's also srcu.
 
What are the disadvantages? There should be some, otherwise why not use
it all the time.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 7/8] Move IO APIC to its own lock.

2009-08-13 Thread Avi Kivity

On 08/13/2009 01:09 PM, Gleb Natapov wrote:

There's also srcu.

 

What are the disadvantages? There should be some, otherwise why not use
it all the time.
   


I think it incurs an atomic op in the read path, but not much overhead 
otherwise.  Paul?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix Makefile rule for compiling emulate.c

2009-08-13 Thread Avi Kivity

On 08/13/2009 12:42 AM, Mohammed Gamal wrote:

Signed-off-by: Mohammed Gamalm.gamal...@gmail.com
---
  arch/x86/kvm/Makefile |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index afaaa76..0e7fe78 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,7 +9,7 @@ kvm-y   += $(addprefix ../../../virt/kvm/, 
kvm_main.o ioapic.o \
coalesced_mmio.o irq_comm.o eventfd.o)
  kvm-$(CONFIG_IOMMU_API)   += $(addprefix ../../../virt/kvm/, iommu.o)

-kvm-y  += x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
+kvm-y  += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
   


Already have the same fix in my tree, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] qemu-kvm: vhost net support

2009-08-13 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Wed, Aug 12, 2009 at 04:27:44PM -0400, Gregory Haskins wrote:
 Michael S. Tsirkin wrote:
 This adds support for vhost-net virtio kernel backend.

 This is RFC, but works without issues for me.

 Still needs to be split up, tested and benchmarked properly,
 but posting it here in case people want to test drive
 the kernel bits I posted.
 This has a large degree of rejects against qemu-kvm.git/master.  What
 tree does this apply to?

 -Greg

 
 Likely that tree has advanced since.
 This is on top of commit b6bbd41fac4b6fb0efc65e083d2151ce1521f615.
 


Hmmbetter, but I still get rejects.  Of particular concern is this
one in net.c:

@@ -1903,7 +1903,7 @@ static TAPState *net_tap_init(VLANState *vlan,
const char *model,
 typedef struct RAWState {
 VLANClientState *vc;
 int fd;
-uint8_t buf[4096];
+uint8_t buf[65000];
 int promisc;
 } RAWState;


I do not see any occurrence of RAWState in b6bbd41f (or master, for
that matter).  There is probably an operator error somewhere in here ;),
but any help getting this working is appreciated.  Do you have a git
tree I can pull somewhere?

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/3] qemu-kvm: vhost net support

2009-08-13 Thread Michael S. Tsirkin
On Thu, Aug 13, 2009 at 07:35:52AM -0400, Gregory Haskins wrote:
 Michael S. Tsirkin wrote:
  On Wed, Aug 12, 2009 at 04:27:44PM -0400, Gregory Haskins wrote:
  Michael S. Tsirkin wrote:
  This adds support for vhost-net virtio kernel backend.
 
  This is RFC, but works without issues for me.
 
  Still needs to be split up, tested and benchmarked properly,
  but posting it here in case people want to test drive
  the kernel bits I posted.
  This has a large degree of rejects against qemu-kvm.git/master.  What
  tree does this apply to?
 
  -Greg
 
  
  Likely that tree has advanced since.
  This is on top of commit b6bbd41fac4b6fb0efc65e083d2151ce1521f615.
  
 
 
 Hmmbetter, but I still get rejects.  Of particular concern is this
 one in net.c:
 
 @@ -1903,7 +1903,7 @@ static TAPState *net_tap_init(VLANState *vlan,
 const char *model,
  typedef struct RAWState {
  VLANClientState *vc;
  int fd;
 -uint8_t buf[4096];
 +uint8_t buf[65000];
  int promisc;
  } RAWState;
 
 
 I do not see any occurrence of RAWState in b6bbd41f (or master, for
 that matter).  There is probably an operator error somewhere in here ;),

Yes. Mine :)

 but any help getting this working is appreciated.

I reposted a clean one which is against latest bits earlier today.
Look for PATCHv2 in your inbox.

 Do you have a git tree I can pull somewhere?
 Kind Regards,
 -Greg
 

Thanks for the patience,

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)

2009-08-13 Thread Chris Webb
Chris Webb ch...@arachsys.com writes:

 Avi Kivity a...@redhat.com writes:
 
  I understand it's hard, but it's nearly impossible to work out the  
  problem from so little data, so please do make the effort to obtain 
  dumps.
 
 We're trying for this at the moment, but since we can't change the rlimit
 for the running qemu-kvm processes (?), we'll have to wait until one of the
 new ones dies, which may take some time. I'll follow up when I do have
 something.

We've been lucky and relatively quickly got a core dump from one of the new
qemu-kvms with the non-zero core file rlimit. A backtrace looks like this:

  (gdb) bt
  #0  0x004068f7 in qemu_mod_timer (ts=0x30d1f30, expire_time=430489)
  at /packages/qemu-kvm/src-f39tF1/vl.c:1161
  #1  0x00495dd5 in vnc_update_client (opaque=value optimized out) at 
vnc.c:765
  #2  0x004081da in main_loop_wait (timeout=value optimized out) at 
/packages/qemu-kvm/src-f39tF1/vl.c:1240
  #3  0x0051613a in kvm_main_loop () at 
/packages/qemu-kvm/src-f39tF1/qemu-kvm.c:596
  #4  0x0040c7b7 in main (argc=value optimized out, argv=value 
optimized out, envp=value optimized out)
  at /packages/qemu-kvm/src-f39tF1/vl.c:3850

The segfault appears to be a null pointer dereference. ts-clock is NULL
and line 1161 uses ts-clock-type:

  (gdb) p ts   
  $4 = (QEMUTimer *) 0x30d1f30
  (gdb) p ts-clock
  $5 = (QEMUClock *) 0x0

The VncState in vnc_update_client is as follows:

  (gdb) f 1
  #1  0x00495dd5 in vnc_update_client (opaque=value optimized out) at 
vnc.c:765
  765 qemu_mod_timer(vs-timer, qemu_get_clock(rt_clock) + 
VNC_REFRESH_INTERVAL);
  (gdb) p *vs
  $12 = {timer = 0x30d1f30, csock = -986235208, ds = 0x0, vd = 0x0, need_update 
= 1, dirty_row = {{0, 0, 4294967295, 
4294967295} repeats 768 times, {4294967295, 4294967295, 4294967295, 
4294967295} repeats 1280 times}, 
old_data = 0x7f9b8276f010 Address 0x7f9b8276f010 out of bounds, features 
= 98, absolute = 1, last_x = -1, 
last_y = -1, vnc_encoding = 5, tight_quality = 6 '\006', tight_compression 
= 1 '\001', major = 3, minor = 3, 
challenge = \032\314i\257\302t1(\320\312\263\024pH\226, output = 
{capacity = 1545078, offset = 684, 
  buffer = 0x3107860 }, input = {capacity = 5120, offset = 0, buffer = 
0x3106450 \020\220(\003}, 
write_pixels = 0x490b50 vnc_write_pixels_generic, send_hextile_tile = 
0x492030 send_hextile_tile_generic_32, 
clientds = {flags = 0 '\0', width = 800, height = 600, linesize = 3200, 
  data = 0x7f9b82944010 Address 0x7f9b82944010 out of bounds, pf = 
{bits_per_pixel = 32 ' ', 
bytes_per_pixel = 4 '\004', depth = 24 '\030', rmask = 0, gmask = 0, 
bmask = 0, amask = 0, rshift = 16 '\020', 
gshift = 8 '\b', bshift = 0 '\0', ashift = 24 '\030', rmax = 255 
'\377', gmax = 255 '\377', bmax = 255 '\377', 
amax = 255 '\377', rbits = 8 '\b', gbits = 8 '\b', bbits = 8 '\b', 
abits = 8 '\b'}}, serverds = {
  flags = 2 '\002', width = 1024, height = 768, linesize = 4096, data = 
0x7f9b8246e010 , pf = {
bits_per_pixel = 32 ' ', bytes_per_pixel = 4 '\004', depth = 24 '\030', 
rmask = 16711680, gmask = 65280, 
bmask = 255, amask = 0, rshift = 16 '\020', gshift = 8 '\b', bshift = 0 
'\0', ashift = 24 '\030', 
rmax = 255 '\377', gmax = 255 '\377', bmax = 255 '\377', amax = 255 
'\377', rbits = 8 '\b', gbits = 8 '\b', 
bbits = 8 '\b', abits = 8 '\b'}}, audio_cap = 0x0, as = {freq = 44100, 
nchannels = 2, fmt = AUD_FMT_S16, 
  endianness = 0}, read_handler = 0x494b40 protocol_client_msg, 
read_handler_expect = 1, 
modifiers_state = '\0' repeats 255 times, zlib = {capacity = 0, offset = 
0, buffer = 0x0}, zlib_tmp = {
  capacity = 0, offset = 0, buffer = 0x0}, zlib_stream = {{next_in = 0x0, 
avail_in = 0, total_in = 0, 
next_out = 0x0, avail_out = 0, total_out = 0, msg = 0x0, state = 0x0, 
zalloc = 0, zfree = 0, opaque = 0x0, 
data_type = 0, adler = 0, reserved = 0}, {next_in = 0x0, avail_in = 0, 
total_in = 0, next_out = 0x0, 
avail_out = 0, total_out = 0, msg = 0x0, state = 0x0, zalloc = 0, zfree 
= 0, opaque = 0x0, data_type = 0, 
adler = 0, reserved = 0}, {next_in = 0x0, avail_in = 0, total_in = 0, 
next_out = 0x0, avail_out = 0, 
total_out = 0, msg = 0x0, state = 0x0, zalloc = 0, zfree = 0, opaque = 
0x0, data_type = 0, adler = 0, 
reserved = 0}, {next_in = 0x0, avail_in = 0, total_in = 0, next_out = 
0x0, avail_out = 0, total_out = 0, 
msg = 0x0, state = 0x0, zalloc = 0, zfree = 0, opaque = 0x0, data_type 
= 0, adler = 0, reserved = 0}}, 
next = 0x0}

I'm afraid I only have one of these, so I can't say whether the other
segfaults were exactly the same or different (other than knowing the source
line matched), but I'll keep my eye out for more core dumps.

qemu-kvm command line for this guest would have been

  qemu-kvm -m 1024 -smp 1 

Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)

2009-08-13 Thread Chris Webb
Chris Webb ch...@arachsys.com writes:

 The segfault appears to be a null pointer dereference. ts-clock is NULL
 and line 1161 uses ts-clock-type:
 
   (gdb) p ts   
   $4 = (QEMUTimer *) 0x30d1f30
   (gdb) p ts-clock
   $5 = (QEMUClock *) 0x0

Sorry, meant to paste this too:

  (gdb) p *ts
  $1 = {clock = 0x0, expire_time = 49, cb = 0x2b63630, opaque = 0x30fe000, next 
= 0x495b40}

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)

2009-08-13 Thread Avi Kivity

On 08/13/2009 03:23 PM, Chris Webb wrote:

We've been lucky and relatively quickly got a core dump from one of the new
qemu-kvms with the non-zero core file rlimit. A backtrace looks like this:

   (gdb) bt
   #0  0x004068f7 in qemu_mod_timer (ts=0x30d1f30, expire_time=430489)
   at /packages/qemu-kvm/src-f39tF1/vl.c:1161
   #1  0x00495dd5 in vnc_update_client (opaque=value optimized out) 
at vnc.c:765
   #2  0x004081da in main_loop_wait (timeout=value optimized out) at 
/packages/qemu-kvm/src-f39tF1/vl.c:1240
   #3  0x0051613a in kvm_main_loop () at 
/packages/qemu-kvm/src-f39tF1/qemu-kvm.c:596
   #4  0x0040c7b7 in main (argc=value optimized out, argv=value optimized 
out, envp=value optimized out)
   at /packages/qemu-kvm/src-f39tF1/vl.c:3850

The segfault appears to be a null pointer dereference. ts-clock is NULL
and line 1161 uses ts-clock-type:

   (gdb) p ts
   $4 = (QEMUTimer *) 0x30d1f30
   (gdb) p ts-clock
   $5 = (QEMUClock *) 0x0

The VncState in vnc_update_client is as follows:

   (gdb) f 1
   #1  0x00495dd5 in vnc_update_client (opaque=value optimized out) 
at vnc.c:765
   765 qemu_mod_timer(vs-timer, qemu_get_clock(rt_clock) + 
VNC_REFRESH_INTERVAL);
   (gdb) p *vs
   $12 = {timer = 0x30d1f30, csock = -986235208,


csock looks corrupted, should be -1 or an fd.  Was a vnc client connected?

Was the guest playing with the display resolution?


ds = 0x0, vd = 0x0, need_update = 1, dirty_row = {{0, 0, 4294967295,
 4294967295}repeats 768 times, {4294967295, 4294967295, 4294967295, 
4294967295}repeats 1280 times},
 old_data = 0x7f9b8276f010Address 0x7f9b8276f010 out of bounds,


old_data is also corrupted according to gdb, though it seems sane.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)

2009-08-13 Thread Chris Webb
Avi Kivity a...@redhat.com writes:

 csock looks corrupted, should be -1 or an fd.  Was a vnc client connected?
 Was the guest playing with the display resolution?

Yes, I think in this case there was a vncviewer connected, and the guest had
started booting up into windows, which changes the resolution a couple of
times.

Best wishes,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)

2009-08-13 Thread Chris Webb
Chris Webb ch...@arachsys.com writes:

 Avi Kivity a...@redhat.com writes:
 
  csock looks corrupted, should be -1 or an fd.  Was a vnc client connected?
  Was the guest playing with the display resolution?
 
 Yes, I think in this case there was a vncviewer connected, and the guest had
 started booting up into windows, which changes the resolution a couple of
 times.

Also, I think the vncviewer might actually have been disconnecting at about
the time the segfault happened.

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)

2009-08-13 Thread Avi Kivity

On 08/13/2009 03:45 PM, Chris Webb wrote:

Chris Webbch...@arachsys.com  writes:

   

Avi Kivitya...@redhat.com  writes:

 

csock looks corrupted, should be -1 or an fd.  Was a vnc client connected?
Was the guest playing with the display resolution?
   

Yes, I think in this case there was a vncviewer connected, and the guest had
started booting up into windows, which changes the resolution a couple of
times.
 


Also, I think the vncviewer might actually have been disconnecting at about
the time the segfault happened.

   


master branch has a patch that fixes a use-after-free when 
disconnecting.  Unfortunately it doesn't port cleanly to stable-0.10.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann
On Wednesday 12 August 2009, Anthony Liguori wrote:
 At any rate, I'd like to see performance results before we consider 
 trying to reuse virtio code.

Yes, I agree. I'd also like to do more work on the macvlan extensions
to see if it works out without involving a socket. Passing the socket
into the vhost_net device is a nice feature of the current implementation
that we'd have to give up for something else (e.g. making the vhost
a real network interface that you can hook up to a bridge) if it were
to use virtio.

Unless I can come up with a solution that is clearly superior, I'm
taking back my objections on that part for now.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann
On Thursday 13 August 2009, Arnd Bergmann wrote:
 Unfortunately, this also implies that you could no longer simply use the
 packet socket interface as you do currently, as I realized only now.
 This obviously has a significant impact on your user space interface.

Also, if we do the copy in the transport, it definitely means that we
can't get to zero-copy RX/TX from guest space any more. The current
vhost_net driver doesn't do that yet, but could be extended in the
same way that I'm hoping to do it for macvtap.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann
On Thursday 13 August 2009, Michael S. Tsirkin wrote:
 On Wed, Aug 12, 2009 at 07:59:47PM +0200, Arnd Bergmann wrote:
  The trick is to swap the virtqueues instead. virtio-net is actually
  mostly symmetric in just the same way that the physical wires on a
  twisted pair ethernet are symmetric (I like how that analogy fits).
 
 You need to really squint hard for it to look symmetric.
 
 For example, for RX, virtio allocates an skb, puts a descriptor on a
 ring and waits for host to fill it in. Host system can not do the same:
 guest does not have access to host memory.
 
 You can do a copy in transport to hide this fact, but it will kill
 performance.

Yes, that is what I was suggesting all along. The actual copy operation
has to be done by the host transport, which is obviously different from
the guest transport that just calls the host using vring_kick().

Right now, the number of copy operations in your code is the same.
You are doing the copy a little bit later in skb_copy_datagram_iovec(),
which is indeed a very nice hack. Changing to a virtqueue based method
would imply that the host needs to add each skb_frag_t to its outbound
virtqueue, which then gets copied into the guests inbound virtqueue.

Unfortunately, this also implies that you could no longer simply use the
packet socket interface as you do currently, as I realized only now.
This obviously has a significant impact on your user space interface.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Documentation: Update KVM list email address

2009-08-13 Thread Amit Shah
The KVM list moved to vger.kernel.org last year

Signed-off-by: Amit Shah amit.s...@redhat.com
---
 Documentation/ioctl/ioctl-number.txt |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 1f779a2..a039cb0 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -189,7 +189,7 @@ CodeSeq#Include FileComments
 0xAD   00  Netfilter devicein development:
mailto:ru...@rustcorp.com.au  
 0xAE   all linux/kvm.h Kernel-based Virtual Machine
-   mailto:kvm-de...@lists.sourceforge.net
+   mailto:kvm@vger.kernel.org
 0xB0   all RATIO devices   in development:
mailto:v...@ratio.de
 0xB1   00-1F   PPPoX   mailto:mostr...@styx.uwaterloo.ca
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Michael S. Tsirkin
On Thu, Aug 13, 2009 at 03:38:43PM +0200, Arnd Bergmann wrote:
 On Thursday 13 August 2009, Michael S. Tsirkin wrote:
  On Wed, Aug 12, 2009 at 07:59:47PM +0200, Arnd Bergmann wrote:
   The trick is to swap the virtqueues instead. virtio-net is actually
   mostly symmetric in just the same way that the physical wires on a
   twisted pair ethernet are symmetric (I like how that analogy fits).
  
  You need to really squint hard for it to look symmetric.
  
  For example, for RX, virtio allocates an skb, puts a descriptor on a
  ring and waits for host to fill it in. Host system can not do the same:
  guest does not have access to host memory.
  
  You can do a copy in transport to hide this fact, but it will kill
  performance.
 
 Yes, that is what I was suggesting all along. The actual copy operation
 has to be done by the host transport, which is obviously different from
 the guest transport that just calls the host using vring_kick().
 
 Right now, the number of copy operations in your code is the same.
 You are doing the copy a little bit later in skb_copy_datagram_iovec(),
 which is indeed a very nice hack. Changing to a virtqueue based method
 would imply that the host needs to add each skb_frag_t to its outbound
 virtqueue, which then gets copied into the guests inbound virtqueue.

Which is a lot more code than just calling skb_copy_datagram_iovec.

 Unfortunately, this also implies that you could no longer simply use the
 packet socket interface as you do currently, as I realized only now.
 This obviously has a significant impact on your user space interface.
 
   Arnd 

And, it will remove our ability to implement zero copy
down the road (when raw sockets support it).

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Michael S. Tsirkin
On Thu, Aug 13, 2009 at 03:48:35PM +0200, Arnd Bergmann wrote:
 On Thursday 13 August 2009, Arnd Bergmann wrote:
  Unfortunately, this also implies that you could no longer simply use the
  packet socket interface as you do currently, as I realized only now.
  This obviously has a significant impact on your user space interface.
 
 Also, if we do the copy in the transport, it definitely means that we
 can't get to zero-copy RX/TX from guest space any more. The current
 vhost_net driver doesn't do that yet, but could be extended in the
 same way that I'm hoping to do it for macvtap.
 
   Arnd 

The best way to do this IMO would be to add zero copy support to raw
sockets, vhost will then get it basically for free.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann
On Thursday 13 August 2009, Michael S. Tsirkin wrote:
 The best way to do this IMO would be to add zero copy support to raw
 sockets, vhost will then get it basically for free.

Yes, that would be nice. I wonder if that could lead to security
problems on TX though. I guess It will only bring significant performance
improvements if we leave the data writable in the user space or guest
during the operation. If the user finds the right timing, it could
modify the frame headers after they have been checked using netfilter,
or while the frames are being consumed in the kernel (e.g. an NFS
server running in a guest).

Ardn 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann
On Thursday 13 August 2009, Michael S. Tsirkin wrote:
  Right now, the number of copy operations in your code is the same.
  You are doing the copy a little bit later in skb_copy_datagram_iovec(),
  which is indeed a very nice hack. Changing to a virtqueue based method
  would imply that the host needs to add each skb_frag_t to its outbound
  virtqueue, which then gets copied into the guests inbound virtqueue.
 
 Which is a lot more code than just calling skb_copy_datagram_iovec.

Well, I don't see this part as much of a problem, because the code
already exists in virtio_net. If we really wanted to go down that road,
just using virtio_net would solve the problem of frame handling
entirely, but create new problems elsewhere, as we have mentioned.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Michael S. Tsirkin
On Thu, Aug 13, 2009 at 04:58:06PM +0200, Arnd Bergmann wrote:
 On Thursday 13 August 2009, Michael S. Tsirkin wrote:
   Right now, the number of copy operations in your code is the same.
   You are doing the copy a little bit later in skb_copy_datagram_iovec(),
   which is indeed a very nice hack. Changing to a virtqueue based method
   would imply that the host needs to add each skb_frag_t to its outbound
   virtqueue, which then gets copied into the guests inbound virtqueue.
  
  Which is a lot more code than just calling skb_copy_datagram_iovec.
 
 Well, I don't see this part as much of a problem, because the code
 already exists in virtio_net.

I am talking about the copying done in low level transport, here.

 If we really wanted to go down that road,
 just using virtio_net would solve the problem of frame handling
 entirely, but create new problems elsewhere, as we have mentioned.
 
   Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 7/8] Move IO APIC to its own lock.

2009-08-13 Thread Paul E. McKenney
On Thu, Aug 13, 2009 at 01:44:06PM +0300, Avi Kivity wrote:
 On 08/13/2009 01:09 PM, Gleb Natapov wrote:
 There's also srcu.
  
 What are the disadvantages? There should be some, otherwise why not use
 it all the time.

 I think it incurs an atomic op in the read path, but not much overhead 
 otherwise.  Paul?

There are not atomic operations in srcu_read_lock():

int srcu_read_lock(struct srcu_struct *sp)
{
int idx;

preempt_disable();
idx = sp-completed  0x1;
barrier();  /* ensure compiler looks -once- at sp-completed. */
per_cpu_ptr(sp-per_cpu_ref, smp_processor_id())-c[idx]++;
srcu_barrier();  /* ensure compiler won't misorder critical 
section. */
preempt_enable();
return idx;
}

There is a preempt_disable() and a preempt_enable(), which
non-atomically manipulate a field in the thread_info structure.
There is a barrier() and an srcu_barrier(), which are just compiler
directives (no code generated).  Other than that, simple arithmetic
and array accesses.  Shouldn't even be any cache misses in the common
case (the uncommon case being where synchronize_srcu() executing on
some other CPU).

There is even less in srcu_read_unlock():

void srcu_read_unlock(struct srcu_struct *sp, int idx)
{
preempt_disable();
srcu_barrier();  /* ensure compiler won't misorder critical 
section. */
per_cpu_ptr(sp-per_cpu_ref, smp_processor_id())-c[idx]--;
preempt_enable();
}

So SRCU should have pretty low overhead.  And, as with other forms
of RCU, legal use of the read-side primitives cannot possibly
participate in deadlocks.

So, to answer the question above, what are the disadvantages?

o   On the update side, synchronize_srcu() does takes some time,
mostly blocking in synchronize_sched().  So, like other
forms of RCU, you would use SRCU in read-mostly situations.

o   Just as with RCU, reads and updates run concurrently, with
all the good and bad that this implies.  For an example
of the good, srcu_read_lock() executes deterministically,
no blocking or spinning.  For an example of the bad, there
is no way to shut down SRCU readers.  These are opposite
sides of the same coin.  ;-)

o   Although srcu_read_lock() and srcu_read_unlock() are light
weight, they are expensive compared to other forms of RCU.

o   In contrast to other forms of RCU, SRCU requires that the
return value from srcu_read_lock() be passed into
srcu_read_unlock().  Usually not a problem, but does place
another constraint on the code.

Please keep in mind that I have no idea about what you are thinking of
using SRCU for, so the above advice is necessarily quite generic.  ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Avi Kivity

On 08/13/2009 05:53 PM, Arnd Bergmann wrote:

On Thursday 13 August 2009, Michael S. Tsirkin wrote:
   

The best way to do this IMO would be to add zero copy support to raw
sockets, vhost will then get it basically for free.
 


Yes, that would be nice. I wonder if that could lead to security
problems on TX though. I guess It will only bring significant performance
improvements if we leave the data writable in the user space or guest
during the operation. If the user finds the right timing, it could
modify the frame headers after they have been checked using netfilter,
or while the frames are being consumed in the kernel (e.g. an NFS
server running in a guest).
   


IIRC when the kernel consumes data it linearizes the skb.  We just need 
to make sure all the zerocopy data is in the nonlinear part, and the 
kernel will copy if/when it needs to access packet data.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Move irq sharing information to irqchip level.

2009-08-13 Thread Gleb Natapov

This removes assumptions that max GSIs is smaller than number of pins.
Sharing is tracked on pin level not GSI level.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b17d845..4c15bdd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -413,7 +413,6 @@ struct kvm_arch{
gpa_t ept_identity_map_addr;
 
unsigned long irq_sources_bitmap;
-   unsigned long irq_states[KVM_IOAPIC_NUM_PINS];
u64 vm_init_tsc;
 };
 
diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h
index 7d6058a..c025a23 100644
--- a/arch/x86/kvm/irq.h
+++ b/arch/x86/kvm/irq.h
@@ -71,6 +71,7 @@ struct kvm_pic {
int output; /* intr from master PIC */
struct kvm_io_device dev;
void (*ack_notifier)(void *opaque, int irq);
+   unsigned long irq_states[16];
 };
 
 struct kvm_pic *kvm_create_pic(struct kvm *kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f814512..beab24b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -121,7 +121,7 @@ struct kvm_kernel_irq_routing_entry {
u32 gsi;
u32 type;
int (*set)(struct kvm_kernel_irq_routing_entry *e,
-   struct kvm *kvm, int level);
+  struct kvm *kvm, int irq_source_id, int level);
union {
struct {
unsigned irqchip;
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index 7080b71..6e461ad 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -41,6 +41,7 @@ struct kvm_ioapic {
u32 irr;
u32 pad;
union kvm_ioapic_redirect_entry redirtbl[IOAPIC_NUM_PINS];
+   unsigned long irq_states[IOAPIC_NUM_PINS];
struct kvm_io_device dev;
struct kvm *kvm;
void (*ack_notifier)(void *opaque, int irq);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 001663f..11aa702 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -31,20 +31,39 @@
 
 #include ioapic.h
 
+static inline int kvm_irq_line_state(unsigned long *irq_state,
+int irq_source_id, int level)
+{
+   /* Logical OR for level trig interrupt */
+   if (level)
+   set_bit(irq_source_id, irq_state);
+   else
+   clear_bit(irq_source_id, irq_state);
+
+   return !!(*irq_state);
+}
+
 static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
-  struct kvm *kvm, int level)
+  struct kvm *kvm, int irq_source_id, int level)
 {
 #ifdef CONFIG_X86
-   return kvm_pic_set_irq(pic_irqchip(kvm), e-irqchip.pin, level);
+   struct kvm_pic *pic = pic_irqchip(kvm);
+   level = kvm_irq_line_state(pic-irq_states[e-irqchip.pin],
+  irq_source_id, level);
+   return kvm_pic_set_irq(pic, e-irqchip.pin, level);
 #else
return -1;
 #endif
 }
 
 static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
- struct kvm *kvm, int level)
+ struct kvm *kvm, int irq_source_id, int level)
 {
-   return kvm_ioapic_set_irq(kvm-arch.vioapic, e-irqchip.pin, level);
+   struct kvm_ioapic *ioapic = kvm-arch.vioapic;
+   level = kvm_irq_line_state(ioapic-irq_states[e-irqchip.pin],
+  irq_source_id, level);
+
+   return kvm_ioapic_set_irq(ioapic, e-irqchip.pin, level);
 }
 
 inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
@@ -96,10 +115,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
kvm_lapic *src,
 }
 
 static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
-  struct kvm *kvm, int level)
+  struct kvm *kvm, int irq_source_id, int level)
 {
struct kvm_lapic_irq irq;
 
+   if (!level)
+   return -1;
+
trace_kvm_msi_set_irq(e-msi.address_lo, e-msi.data);
 
irq.dest_id = (e-msi.address_lo 
@@ -125,34 +147,19 @@ static int kvm_set_msi(struct 
kvm_kernel_irq_routing_entry *e,
 int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
 {
struct kvm_kernel_irq_routing_entry *e;
-   unsigned long *irq_state, sig_level;
int ret = -1;
 
trace_kvm_set_irq(irq, level, irq_source_id);
 
WARN_ON(!mutex_is_locked(kvm-irq_lock));
 
-   if (irq  KVM_IOAPIC_NUM_PINS) {
-   irq_state = (unsigned long *)kvm-arch.irq_states[irq];
-
-   /* Logical OR for level trig interrupt */
-   if (level)
-   set_bit(irq_source_id, irq_state);
-   else
-   clear_bit(irq_source_id, irq_state);
-   sig_level = !!(*irq_state);
-   } else if (!level)
-   return ret;
-   else /* Deal with MSI/MSI-X */
-   sig_level = 1;
-
/* Not possible 

[PATCHv3 0/2] vhost: a kernel-level virtio server

2009-08-13 Thread Michael S. Tsirkin
This implements vhost: a kernel-level backend for virtio,
The main motivation for this work is to reduce virtualization
overhead for virtio by removing system calls on data path,
without guest changes. For virtio-net, this removes up to
4 system calls per packet: vm exit for kick, reentry for kick,
iothread wakeup for packet, interrupt injection for packet.

Some more detailed description attached to the patch itself.

The patches are against 2.6.31-rc4.  I'd like them to go into linux-next
and down the road 2.6.32 if possible.  Please comment.

Changelog from v2:
- Comments on RCU usage
- Compat ioctl support
- Make variable static
- Copied more idiomatic english from Rusty

Changes from v1:
- Move use_mm/unuse_mm from fs/aio.c to mm instead of copying.
- Reorder code to avoid need for forward declarations
- Kill a couple of debugging printks

Michael S. Tsirkin (2):
  mm: export use_mm/unuse_mm to modules
  vhost_net: a kernel-level virtio server

 MAINTAINERS |   10 +
 arch/x86/kvm/Kconfig|1 +
 drivers/Makefile|1 +
 drivers/vhost/Kconfig   |   11 +
 drivers/vhost/Makefile  |2 +
 drivers/vhost/net.c |  429 
 drivers/vhost/vhost.c   |  663 +++
 drivers/vhost/vhost.h   |  108 +++
 fs/aio.c|   47 +---
 include/linux/Kbuild|1 +
 include/linux/miscdevice.h  |1 +
 include/linux/mmu_context.h |9 +
 include/linux/vhost.h   |  100 +++
 mm/Makefile |2 +-
 mm/mmu_context.c|   58 
 15 files changed, 1396 insertions(+), 47 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/mmu_context.h
 create mode 100644 include/linux/vhost.h
 create mode 100644 mm/mmu_context.c
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 1/2] mm: export use_mm/unuse_mm to modules

2009-08-13 Thread Michael S. Tsirkin
vhost net module wants to do copy to/from user from a kernel thread,
which needs use_mm (like what fs/aio has).  Move that into mm/ and
export to modules.

Acked-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 fs/aio.c|   47 +--
 include/linux/mmu_context.h |9 ++
 mm/Makefile |2 +-
 mm/mmu_context.c|   58 +++
 4 files changed, 69 insertions(+), 47 deletions(-)
 create mode 100644 include/linux/mmu_context.h
 create mode 100644 mm/mmu_context.c

diff --git a/fs/aio.c b/fs/aio.c
index d065b2c..fc21c23 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -24,6 +24,7 @@
 #include linux/file.h
 #include linux/mm.h
 #include linux/mman.h
+#include linux/mmu_context.h
 #include linux/slab.h
 #include linux/timer.h
 #include linux/aio.h
@@ -34,7 +35,6 @@
 
 #include asm/kmap_types.h
 #include asm/uaccess.h
-#include asm/mmu_context.h
 
 #if DEBUG  1
 #define dprintkprintk
@@ -595,51 +595,6 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 }
 
 /*
- * use_mm
- * Makes the calling kernel thread take on the specified
- * mm context.
- * Called by the retry thread execute retries within the
- * iocb issuer's mm context, so that copy_from/to_user
- * operations work seamlessly for aio.
- * (Note: this routine is intended to be called only
- * from a kernel thread context)
- */
-static void use_mm(struct mm_struct *mm)
-{
-   struct mm_struct *active_mm;
-   struct task_struct *tsk = current;
-
-   task_lock(tsk);
-   active_mm = tsk-active_mm;
-   atomic_inc(mm-mm_count);
-   tsk-mm = mm;
-   tsk-active_mm = mm;
-   switch_mm(active_mm, mm, tsk);
-   task_unlock(tsk);
-
-   mmdrop(active_mm);
-}
-
-/*
- * unuse_mm
- * Reverses the effect of use_mm, i.e. releases the
- * specified mm context which was earlier taken on
- * by the calling kernel thread
- * (Note: this routine is intended to be called only
- * from a kernel thread context)
- */
-static void unuse_mm(struct mm_struct *mm)
-{
-   struct task_struct *tsk = current;
-
-   task_lock(tsk);
-   tsk-mm = NULL;
-   /* active_mm is still 'mm' */
-   enter_lazy_tlb(mm, tsk);
-   task_unlock(tsk);
-}
-
-/*
  * Queue up a kiocb to be retried. Assumes that the kiocb
  * has already been marked as kicked, and places it on
  * the retry run list for the corresponding ioctx, if it
diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
new file mode 100644
index 000..70fffeb
--- /dev/null
+++ b/include/linux/mmu_context.h
@@ -0,0 +1,9 @@
+#ifndef _LINUX_MMU_CONTEXT_H
+#define _LINUX_MMU_CONTEXT_H
+
+struct mm_struct;
+
+void use_mm(struct mm_struct *mm);
+void unuse_mm(struct mm_struct *mm);
+
+#endif
diff --git a/mm/Makefile b/mm/Makefile
index 5e0bd64..46c3892 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o 
oom_kill.o fadvise.o \
   maccess.o page_alloc.o page-writeback.o pdflush.o \
   readahead.o swap.o truncate.o vmscan.o shmem.o \
   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-  page_isolation.o mm_init.o $(mmu-y)
+  page_isolation.o mm_init.o mmu_context.o $(mmu-y)
 obj-y += init-mm.o
 
 obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
new file mode 100644
index 000..9989c2f
--- /dev/null
+++ b/mm/mmu_context.c
@@ -0,0 +1,58 @@
+/* Copyright (C) 2009 Red Hat, Inc.
+ *
+ * See ../COPYING for licensing terms.
+ */
+
+#include linux/mm.h
+#include linux/mmu_context.h
+#include linux/module.h
+#include linux/sched.h
+
+#include asm/mmu_context.h
+
+/*
+ * use_mm
+ * Makes the calling kernel thread take on the specified
+ * mm context.
+ * Called by the retry thread execute retries within the
+ * iocb issuer's mm context, so that copy_from/to_user
+ * operations work seamlessly for aio.
+ * (Note: this routine is intended to be called only
+ * from a kernel thread context)
+ */
+void use_mm(struct mm_struct *mm)
+{
+   struct mm_struct *active_mm;
+   struct task_struct *tsk = current;
+
+   task_lock(tsk);
+   active_mm = tsk-active_mm;
+   atomic_inc(mm-mm_count);
+   tsk-mm = mm;
+   tsk-active_mm = mm;
+   switch_mm(active_mm, mm, tsk);
+   task_unlock(tsk);
+
+   mmdrop(active_mm);
+}
+EXPORT_SYMBOL_GPL(use_mm);
+
+/*
+ * unuse_mm
+ * Reverses the effect of use_mm, i.e. releases the
+ * specified mm context which was earlier taken on
+ * by the calling kernel thread
+ * (Note: this routine is intended to be called only
+ * from a kernel thread context)
+ */
+void unuse_mm(struct mm_struct *mm)
+{
+   struct 

[PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Michael S. Tsirkin
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a
tun-like device. In this version I only support raw socket as a backend,
which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
also configured by userspace, including vlan/mac etc.

Status:
This works for me, and I haven't see any crashes.
I have not run any benchmarks yet, compared to userspace, I expect to
see improved latency (as I save up to 4 system calls per packet) but not
bandwidth/CPU (as TSO and interrupt mitigation are not supported).

Features that I plan to look at in the future:
- TSO
- interrupt mitigation
- zero copy

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 MAINTAINERS|   10 +
 arch/x86/kvm/Kconfig   |1 +
 drivers/Makefile   |1 +
 drivers/vhost/Kconfig  |   11 +
 drivers/vhost/Makefile |2 +
 drivers/vhost/net.c|  429 
 drivers/vhost/vhost.c  |  663 
 drivers/vhost/vhost.h  |  108 +++
 include/linux/Kbuild   |1 +
 include/linux/miscdevice.h |1 +
 include/linux/vhost.h  |  100 +++
 11 files changed, 1327 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/vhost.h

diff --git a/MAINTAINERS b/MAINTAINERS
index ebc2691..eb0c1da 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6312,6 +6312,16 @@ S:   Maintained
 F: Documentation/filesystems/vfat.txt
 F: fs/fat/
 
+VIRTIO HOST (VHOST)
+P: Michael S. Tsirkin
+M: m...@redhat.com
+L: kvm@vger.kernel.org
+L: virtualizat...@lists.osdl.org
+L: net...@vger.kernel.org
+S: Maintained
+F: drivers/vhost/
+F: include/linux/vhost.h
+
 VIA RHINE NETWORK DRIVER
 P: Roger Luethi
 M: r...@hellgate.ch
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b84e571..94f44d9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,7 @@ config KVM_AMD
 
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
+source drivers/vhost/Kconfig
 source drivers/lguest/Kconfig
 source drivers/virtio/Kconfig
 
diff --git a/drivers/Makefile b/drivers/Makefile
index bc4205d..1551ae1 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -105,6 +105,7 @@ obj-$(CONFIG_HID)   += hid/
 obj-$(CONFIG_PPC_PS3)  += ps3/
 obj-$(CONFIG_OF)   += of/
 obj-$(CONFIG_SSB)  += ssb/
+obj-$(CONFIG_VHOST_NET)+= vhost/
 obj-$(CONFIG_VIRTIO)   += virtio/
 obj-$(CONFIG_VLYNQ)+= vlynq/
 obj-$(CONFIG_STAGING)  += staging/
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
new file mode 100644
index 000..d955406
--- /dev/null
+++ b/drivers/vhost/Kconfig
@@ -0,0 +1,11 @@
+config VHOST_NET
+   tristate Host kernel accelerator for virtio net
+   depends on NET  EVENTFD
+   ---help---
+ This kernel module can be loaded in host kernel to accelerate
+ guest networking with virtio_net. Not to be confused with virtio_net
+ module itself which needs to be loaded in guest kernel.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_net.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
new file mode 100644
index 000..72dd020
--- /dev/null
+++ b/drivers/vhost/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_VHOST_NET) += vhost_net.o
+vhost_net-y := vhost.o net.o
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
new file mode 100644
index 000..728094b
--- /dev/null
+++ b/drivers/vhost/net.c
@@ -0,0 +1,429 @@
+/* Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin m...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * 

[ kvm-Bugs-2837083 ] Wrong disk size on exported nbd device

2009-08-13 Thread SourceForge.net
Bugs item #2837083, was opened at 2009-08-13 16:27
Message generated for change (Tracker Item Submitted) made by atilaromero
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2837083group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: qemu
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Atila (atilaromero)
Assigned to: Nobody/Anonymous (nobody)
Summary: Wrong disk size on exported nbd device

Initial Comment:
kvm-nbd uses a blocksize of 1024 bytes. If the imaged disk had an odd number of 
sectors, the last sector isn't exported.
Solution: change the blocksize to 512 in nbd.c

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2837083group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Page allocation failures in guest

2009-08-13 Thread Pierre Ossman
On Wed, 12 Aug 2009 15:01:52 +0930
Rusty Russell ru...@rustcorp.com.au wrote:

 On Wed, 12 Aug 2009 12:49:51 pm Rusty Russell wrote:
  On Tue, 11 Aug 2009 04:22:53 pm Avi Kivity wrote:
   On 08/11/2009 09:32 AM, Pierre Ossman wrote:
I doesn't get out of it though, or at least the virtio net driver
wedges itself.
  
  There's a fixme to retry when this happens, but this is the first report
  I've received.  I'll check it out.
 
 Subject: virtio: net refill on out-of-memory
 
 If we run out of memory, use keventd to fill the buffer.  There's a
 report of this happening: Page allocation failures in guest,
 Message-ID: 20090713115158.0a489...@mjolnir.ossman.eu
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 

Patch applied. Now we wait. :)

-- 
 -- Pierre Ossman

  WARNING: This correspondence is being monitored by the
  Swedish government. Make sure your server uses encryption
  for SMTP traffic and consider using PGP for end-to-end
  encryption.


signature.asc
Description: PGP signature


[PATCH -tip v14 02/12] x86: x86 instruction decoder build-time selftest

2009-08-13 Thread Masami Hiramatsu
Add a user-space selftest of x86 instruction decoder at kernel build time.
When CONFIG_X86_DECODER_SELFTEST=y, Kbuild builds a test harness of x86
instruction decoder and performs it after building vmlinux.
The test compares the results of objdump and x86 instruction decoder
code and check there are no differences.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Signed-off-by: Jim Keniston jkeni...@us.ibm.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 arch/x86/Kconfig.debug|9 +++
 arch/x86/Makefile |3 +
 arch/x86/tools/Makefile   |   15 +
 arch/x86/tools/distill.awk|   42 +++
 arch/x86/tools/test_get_len.c |  113 +
 5 files changed, 182 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/tools/Makefile
 create mode 100644 arch/x86/tools/distill.awk
 create mode 100644 arch/x86/tools/test_get_len.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index d105f29..7d0b681 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -186,6 +186,15 @@ config X86_DS_SELFTEST
 config HAVE_MMIOTRACE_SUPPORT
def_bool y
 
+config X86_DECODER_SELFTEST
+ bool x86 instruction decoder selftest
+ depends on DEBUG_KERNEL
+   ---help---
+Perform x86 instruction decoder selftests at build time.
+This option is useful for checking the sanity of x86 instruction
+decoder code.
+If unsure, say N.
+
 #
 # IO delay types:
 #
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 1f3851a..f79580c 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -154,6 +154,9 @@ all: bzImage
 KBUILD_IMAGE := $(boot)/bzImage
 
 bzImage: vmlinux
+ifeq ($(CONFIG_X86_DECODER_SELFTEST),y)
+   $(Q)$(MAKE) $(build)=arch/x86/tools posttest
+endif
$(Q)$(MAKE) $(build)=$(boot) $(KBUILD_IMAGE)
$(Q)mkdir -p $(objtree)/arch/$(UTS_MACHINE)/boot
$(Q)ln -fsn ../../x86/boot/bzImage 
$(objtree)/arch/$(UTS_MACHINE)/boot/$@
diff --git a/arch/x86/tools/Makefile b/arch/x86/tools/Makefile
new file mode 100644
index 000..3dd626b
--- /dev/null
+++ b/arch/x86/tools/Makefile
@@ -0,0 +1,15 @@
+PHONY += posttest
+quiet_cmd_posttest = TEST$@
+  cmd_posttest = $(OBJDUMP) -d $(objtree)/vmlinux | awk -f 
$(srctree)/arch/x86/tools/distill.awk | $(obj)/test_get_len
+
+posttest: $(obj)/test_get_len vmlinux
+   $(call cmd,posttest)
+
+hostprogs-y:= test_get_len
+
+# -I needed for generated C source and C source which in the kernel tree.
+HOSTCFLAGS_test_get_len.o := -Wall -I$(objtree)/arch/x86/lib/ 
-I$(srctree)/arch/x86/include/ -I$(srctree)/arch/x86/lib/
+
+# Dependancies are also needed.
+$(obj)/test_get_len.o: $(srctree)/arch/x86/lib/insn.c 
$(srctree)/arch/x86/lib/inat.c $(srctree)/arch/x86/include/asm/inat_types.h 
$(srctree)/arch/x86/include/asm/inat.h $(srctree)/arch/x86/include/asm/insn.h 
$(objtree)/arch/x86/lib/inat-tables.c
+
diff --git a/arch/x86/tools/distill.awk b/arch/x86/tools/distill.awk
new file mode 100644
index 000..d433619
--- /dev/null
+++ b/arch/x86/tools/distill.awk
@@ -0,0 +1,42 @@
+#!/bin/awk -f
+# Usage: objdump -d a.out | awk -f distill.awk | ./test_get_len
+# Distills the disassembly as follows:
+# - Removes all lines except the disassembled instructions.
+# - For instructions that exceed 1 line (7 bytes), crams all the hex bytes
+# into a single line.
+# - Remove bad(or prefix only) instructions
+
+BEGIN {
+   prev_addr = 
+   prev_hex = 
+   prev_mnemonic = 
+   bad_expr = 
(\\(bad\\)|^rex|^.byte|^rep(z|nz)$|^lock$|^es$|^cs$|^ss$|^ds$|^fs$|^gs$|^data(16|32)$|^addr(16|32|64))
+   fwait_expr = ^9b 
+   fwait_str=9b\tfwait
+}
+
+/^ *[0-9a-f]+:/ {
+   if (split($0, field, \t)  3) {
+   # This is a continuation of the same insn.
+   prev_hex = prev_hex field[2]
+   } else {
+   # Skip bad instructions
+   if (match(prev_mnemonic, bad_expr))
+   prev_addr = 
+   # Split fwait from other f* instructions
+   if (match(prev_hex, fwait_expr)  prev_mnemonic != fwait) {
+   printf %s\t%s\n, prev_addr, fwait_str
+   sub(fwait_expr, , prev_hex)
+   }
+   if (prev_addr 

[PATCH -tip v14 01/12] x86: instruction decoder API

2009-08-13 Thread Masami Hiramatsu
Add x86 instruction decoder to arch-specific libraries. This decoder
can decode x86 instructions used in kernel into prefix, opcode, modrm,
sib, displacement and immediates. This can also show the length of
instructions.

This version introduces instruction attributes for decoding instructions.
The instruction attribute tables are generated from the opcode map file
(x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk).

Currently, the opcode maps are based on opcode maps in Intel(R) 64 and
IA-32 Architectures Software Developers Manual Vol.2: Appendix.A,
and consist of below two types of opcode tables.

1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are
written as below;

 Table: table-name
 Referrer: escaped-name
 opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 
2nd-mnemonic ...]
  (or)
 opcode: escape # escaped-name
 EndTable

Group opcodes, which has 8 elements, are written as below;

 GrpTable: GrpXXX
 reg:  mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 
2nd-mnemonic ...]
 EndTable

These opcode maps include a few SSE and FP opcodes (for setup), because
those opcodes are used in the kernel.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Signed-off-by: Jim Keniston jkeni...@us.ibm.com
Acked-by: H. Peter Anvin h...@zytor.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 arch/x86/include/asm/inat.h  |  188 +
 arch/x86/include/asm/inat_types.h|   29 +
 arch/x86/include/asm/insn.h  |  143 +++
 arch/x86/lib/Makefile|   13 +
 arch/x86/lib/inat.c  |   78 
 arch/x86/lib/insn.c  |  464 ++
 arch/x86/lib/x86-opcode-map.txt  |  719 ++
 arch/x86/tools/gen-insn-attr-x86.awk |  314 +++
 8 files changed, 1948 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/inat.h
 create mode 100644 arch/x86/include/asm/inat_types.h
 create mode 100644 arch/x86/include/asm/insn.h
 create mode 100644 arch/x86/lib/inat.c
 create mode 100644 arch/x86/lib/insn.c
 create mode 100644 arch/x86/lib/x86-opcode-map.txt
 create mode 100644 arch/x86/tools/gen-insn-attr-x86.awk

diff --git a/arch/x86/include/asm/inat.h b/arch/x86/include/asm/inat.h
new file mode 100644
index 000..2866fdd
--- /dev/null
+++ b/arch/x86/include/asm/inat.h
@@ -0,0 +1,188 @@
+#ifndef _ASM_X86_INAT_H
+#define _ASM_X86_INAT_H
+/*
+ * x86 instruction attributes
+ *
+ * Written by Masami Hiramatsu mhira...@redhat.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ */
+#include asm/inat_types.h
+
+/*
+ * Internal bits. Don't use bitmasks directly, because these bits are
+ * unstable. You should use checking functions.
+ */
+
+#define INAT_OPCODE_TABLE_SIZE 256
+#define INAT_GROUP_TABLE_SIZE 8
+
+/* Legacy instruction prefixes */
+#define INAT_PFX_OPNDSZ1   /* 0x66 */ /* LPFX1 */
+#define INAT_PFX_REPNE 2   /* 0xF2 */ /* LPFX2 */
+#define INAT_PFX_REPE  3   /* 0xF3 */ /* LPFX3 */
+#define INAT_PFX_LOCK  4   /* 0xF0 */
+#define INAT_PFX_CS5   /* 0x2E */
+#define INAT_PFX_DS6   /* 0x3E */
+#define INAT_PFX_ES7   /* 0x26 */
+#define INAT_PFX_FS8   /* 0x64 */
+#define INAT_PFX_GS9   /* 0x65 */
+#define INAT_PFX_SS10  /* 0x36 */
+#define INAT_PFX_ADDRSZ11  /* 0x67 */
+
+#define INAT_LPREFIX_MAX   3
+
+/* Immediate size */
+#define INAT_IMM_BYTE  1
+#define INAT_IMM_WORD  2
+#define INAT_IMM_DWORD 3
+#define INAT_IMM_QWORD 4
+#define INAT_IMM_PTR   5
+#define INAT_IMM_VWORD32   6

[PATCH -tip v14 00/12] tracing: kprobe-based event tracer and x86 instruction decoder

2009-08-13 Thread Masami Hiramatsu
Hi,

Here are the patches of kprobe-based event tracer for x86, version 14,
which allows you to probe various kernel events through ftrace interface.
The tracer supports per-probe filtering which allows you to set filters
on each probe and shows formats of each probe.

This version includes below fixes.
 - Define remove_subsystem_dir() always (patch 6/12)
 - Modify syscall_tracer because of ftrace_event_call change (patch 6/12)
 - Support 'sa' argument for stack address (patch 8/12)
 - Use call-data instead of container_of() macro. (patch 8/12)
 - Assign new event id for each event. (patch 11/12)

Lai, this version still can not be applied on your patch ('use defined
fields to print formats') yet, since I couldn't update your patch on
the latest -tip tree.

This patchset also includes x86(-64) instruction decoder which
supports non-SSE/FP opcodes and includes x86 opcode map. The decoder
is used for finding the instruction boundaries when inserting new
kprobes. I think it will be possible to share this opcode map
with KVM's decoder.
The decoder is tested when building kernel, the test compares the 
results of objdump and the decoder right after building vmlinux.
You can enable that test by CONFIG_X86_DECODER_SELFTEST=y.

This series can be applied on the latest linux-2.6.31-rc5-tip.

This supports only x86(-32/-64) (but porting it on other arch
just needs kprobes/kretprobes and register and stack access APIs).

I also made two tools for this tracer.
 - Kprobe stress test script which tests kprobes on all kernel symbols to
   find symbols which should be blacklisted.
 - C expression to kprobes event format converter which helps you to define
   kprobes events by C source code line number or function name, and local
   variable name.

Enhancement ideas will be added after merging:
- .init function tracing support.
- Support primitive types(long, ulong, int, uint, etc) for args.


Kprobe-based Event Tracer
=

Overview

This tracer is similar to the events tracer which is based on Tracepoint
infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe
and kretprobe). It probes anywhere where kprobes can probe(this means, all
functions body except for __kprobes functions).

Unlike the function tracer, this tracer can probe instructions inside of
kernel functions. It allows you to check which instruction has been executed.

Unlike the Tracepoint based events tracer, this tracer can add new probe points
on the fly.

Similar to the events tracer, this tracer doesn't need to be activated via
current_tracer, instead of that, just set probe points via
/sys/kernel/debug/tracing/kprobe_events. And you can set filters on each
probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter.


Synopsis of kprobe_events
-
  p[:EVENT] SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS] : Set a probe
  r[:EVENT] SYMBOL[+0] [FETCHARGS]  : Set a return probe

 EVENT  : Event name. If omitted, the event name is generated
  based on SYMBOL+offs or MEMADDR.
 SYMBOL[+offs|-offs]: Symbol+offset where the probe is inserted.
 MEMADDR: Address where the probe is inserted.

 FETCHARGS  : Arguments. Each probe can have up to 128 args.
  %REG  : Fetch register REG
  sN: Fetch Nth entry of stack (N = 0)
  sa: Fetch stack address.
  @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
  @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
  aN: Fetch function argument. (N = 0)(*)
  rv: Fetch return value.(**)
  ra: Fetch return address.(**)
  +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.(***)

  (*) aN may not correct on asmlinkaged functions and at the middle of
  function body.
  (**) only for return probe.
  (***) this is useful for fetching a field of data structures.


Per-Probe Event Filtering
-
 Per-probe event filtering feature allows you to set different filter on each
probe and gives you what arguments will be shown in trace buffer. If an event
name is specified right after 'p:' or 'r:' in kprobe_events, the tracer adds
an event under tracing/events/kprobes/EVENT, at the directory you can see
'id', 'enabled', 'format' and 'filter'.

enabled:
  You can enable/disable the probe by writing 1 or 0 on it.

format:
  It shows the format of this probe event. It also shows aliases of arguments
 which you specified to kprobe_events.

filter:
  You can write filtering rules of this event. And you can use both of aliase
 names and field names for describing filters.


Event Profiling
---
 You can check the total number of probe hits and probe miss-hits via
/sys/kernel/debug/tracing/kprobe_profile.
 The first column is event name, the second is the number of probe hits,
the third is the number of probe miss-hits.


Usage examples
--
To add a probe as a new event, write 

[PATCH -tip v14 04/12] kprobes: cleanup fix_riprel() using insn decoder on x86

2009-08-13 Thread Masami Hiramatsu
Cleanup fix_riprel() in arch/x86/kernel/kprobes.c by using x86 instruction
decoder.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: Jim Keniston jkeni...@us.ibm.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 arch/x86/kernel/kprobes.c |  128 -
 1 files changed, 23 insertions(+), 105 deletions(-)

diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 80d493f..98f48d0 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -109,50 +109,6 @@ static const u32 twobyte_is_boostable[256 / 32] = {
/*  --- */
/*  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  */
 };
-static const u32 onebyte_has_modrm[256 / 32] = {
-   /*  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  */
-   /*  --- */
-   W(0x00, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 00 */
-   W(0x10, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) , /* 10 */
-   W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 20 */
-   W(0x30, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) , /* 30 */
-   W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */
-   W(0x50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 50 */
-   W(0x60, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0) | /* 60 */
-   W(0x70, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 70 */
-   W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
-   W(0x90, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 90 */
-   W(0xa0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* a0 */
-   W(0xb0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* b0 */
-   W(0xc0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0) | /* c0 */
-   W(0xd0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
-   W(0xe0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* e0 */
-   W(0xf0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1)   /* f0 */
-   /*  --- */
-   /*  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  */
-};
-static const u32 twobyte_has_modrm[256 / 32] = {
-   /*  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  */
-   /*  --- */
-   W(0x00, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1) | /* 0f */
-   W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0) , /* 1f */
-   W(0x20, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 2f */
-   W(0x30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 3f */
-   W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 4f */
-   W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 5f */
-   W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 6f */
-   W(0x70, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1) , /* 7f */
-   W(0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 8f */
-   W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 9f */
-   W(0xa0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1) | /* af */
-   W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1) , /* bf */
-   W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0) | /* cf */
-   W(0xd0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* df */
-   W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* ef */
-   W(0xf0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)   /* ff */
-   /*  --- */
-   /*  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  */
-};
 #undef W
 
 struct kretprobe_blackpoint kretprobe_blacklist[] = {
@@ -345,68 +301,30 @@ static int __kprobes is_IF_modifier(kprobe_opcode_t *insn)
 static void __kprobes fix_riprel(struct kprobe *p)
 {
 #ifdef CONFIG_X86_64
-   u8 *insn = p-ainsn.insn;
-   s64 disp;
-   int need_modrm;
-
-   /* Skip legacy instruction prefixes.  */
-   while (1) {
-   switch (*insn) {
-   case 0x66:
-   case 0x67:

[PATCH -tip v14 06/12] tracing: ftrace dynamic ftrace_event_call support

2009-08-13 Thread Masami Hiramatsu
Add dynamic ftrace_event_call support to ftrace. Trace engines can adds new
ftrace_event_call to ftrace on the fly. Each operator functions of the call
takes a ftrace_event_call data structure as an argument, because these
functions may be shared among several ftrace_event_calls.

Changes from v13:
 - Define remove_subsystem_dir() always (revirt a2ca5e03), because
   trace_remove_event_call() uses it.
 - Modify syscall tracer because of ftrace_event_call change.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Acked-by: Frederic Weisbecker fweis...@gmail.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: Jim Keniston jkeni...@us.ibm.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 include/linux/ftrace_event.h  |   14 +++--
 include/linux/syscalls.h  |4 +
 include/trace/ftrace.h|   19 +++
 include/trace/syscall.h   |8 +--
 kernel/trace/trace_events.c   |  119 +
 kernel/trace/trace_export.c   |   23 
 kernel/trace/trace_syscalls.c |   16 +++---
 7 files changed, 125 insertions(+), 78 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 189806b..9af68ce 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -112,13 +112,13 @@ struct ftrace_event_call {
struct dentry   *dir;
struct trace_event  *event;
int enabled;
-   int (*regfunc)(void *);
-   void(*unregfunc)(void *);
+   int (*regfunc)(struct ftrace_event_call *);
+   void(*unregfunc)(struct ftrace_event_call *);
int id;
-   int (*raw_init)(void);
-   int (*show_format)(struct ftrace_event_call *call,
-  struct trace_seq *s);
-   int (*define_fields)(void);
+   int (*raw_init)(struct ftrace_event_call *);
+   int (*show_format)(struct ftrace_event_call *,
+  struct trace_seq *);
+   int (*define_fields)(struct ftrace_event_call *);
struct list_headfields;
int filter_active;
struct event_filter *filter;
@@ -142,6 +142,8 @@ extern int filter_current_check_discard(struct 
ftrace_event_call *call,
 
 extern int trace_define_field(struct ftrace_event_call *call, char *type,
  char *name, int offset, int size, int is_signed);
+extern int trace_add_event_call(struct ftrace_event_call *call);
+extern void trace_remove_event_call(struct ftrace_event_call *call);
 
 #define is_signed_type(type)   (((type)(-1))  0)
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 87d06c1..be59d22 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -165,7 +165,7 @@ static void prof_sysexit_disable_##sname(struct 
ftrace_event_call *event_call) \
struct trace_event enter_syscall_print_##sname = {  \
.trace  = print_syscall_enter,  \
};  \
-   static int init_enter_##sname(void) \
+   static int init_enter_##sname(struct ftrace_event_call *call)   \
{   \
int num, id;\
num = syscall_name_to_nr(sys#sname);  \
@@ -201,7 +201,7 @@ static void prof_sysexit_disable_##sname(struct 
ftrace_event_call *event_call) \
struct trace_event exit_syscall_print_##sname = {   \
.trace  = print_syscall_exit,   \
};  \
-   static int init_exit_##sname(void)  \
+   static int init_exit_##sname(struct ftrace_event_call *call)\
{   \
int num, id;\
num = syscall_name_to_nr(sys#sname);  \
diff --git 

[PATCH -tip v14 07/12] tracing: Introduce TRACE_FIELD_ZERO() macro

2009-08-13 Thread Masami Hiramatsu
Use TRACE_FIELD_ZERO(type, item) instead of TRACE_FIELD_ZERO_CHAR(item).
This also includes a fix of TRACE_ZERO_CHAR() macro.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: Jim Keniston jkeni...@us.ibm.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 kernel/trace/trace_event_types.h |4 ++--
 kernel/trace/trace_export.c  |   16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace_event_types.h b/kernel/trace/trace_event_types.h
index 6db005e..e74f090 100644
--- a/kernel/trace/trace_event_types.h
+++ b/kernel/trace/trace_event_types.h
@@ -109,7 +109,7 @@ TRACE_EVENT_FORMAT(bprint, TRACE_BPRINT, bprint_entry, 
ignore,
TRACE_STRUCT(
TRACE_FIELD(unsigned long, ip, ip)
TRACE_FIELD(char *, fmt, fmt)
-   TRACE_FIELD_ZERO_CHAR(buf)
+   TRACE_FIELD_ZERO(char, buf)
),
TP_RAW_FMT(%08lx (%d) fmt:%p %s)
 );
@@ -117,7 +117,7 @@ TRACE_EVENT_FORMAT(bprint, TRACE_BPRINT, bprint_entry, 
ignore,
 TRACE_EVENT_FORMAT(print, TRACE_PRINT, print_entry, ignore,
TRACE_STRUCT(
TRACE_FIELD(unsigned long, ip, ip)
-   TRACE_FIELD_ZERO_CHAR(buf)
+   TRACE_FIELD_ZERO(char, buf)
),
TP_RAW_FMT(%08lx (%d) fmt:%p %s)
 );
diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c
index 71c8d7f..b0ac92c 100644
--- a/kernel/trace/trace_export.c
+++ b/kernel/trace/trace_export.c
@@ -42,9 +42,9 @@ extern void __bad_type_size(void);
if (!ret)   \
return 0;
 
-#undef TRACE_FIELD_ZERO_CHAR
-#define TRACE_FIELD_ZERO_CHAR(item)\
-   ret = trace_seq_printf(s, \tfield:char  #item ;\t   \
+#undef TRACE_FIELD_ZERO
+#define TRACE_FIELD_ZERO(type, item)   \
+   ret = trace_seq_printf(s, \tfield: #type   #item ;\t  \
   offset:%u;\tsize:0;\n, \
   (unsigned int)offsetof(typeof(field), item)); \
if (!ret)   \
@@ -92,9 +92,6 @@ ftrace_format_##call(struct ftrace_event_call *unused,
\
 
 #include trace_event_types.h
 
-#undef TRACE_ZERO_CHAR
-#define TRACE_ZERO_CHAR(arg)
-
 #undef TRACE_FIELD
 #define TRACE_FIELD(type, item, assign)\
entry-item = assign;
@@ -107,6 +104,9 @@ ftrace_format_##call(struct ftrace_event_call *unused,  
\
 #define TRACE_FIELD_SIGN(type, item, assign, is_signed)\
TRACE_FIELD(type, item, assign)
 
+#undef TRACE_FIELD_ZERO
+#define TRACE_FIELD_ZERO(type, item)
+
 #undef TP_CMD
 #define TP_CMD(cmd...) cmd
 
@@ -178,8 +178,8 @@ __attribute__((section(_ftrace_events))) event_##call = { 
\
if (ret)\
return ret;
 
-#undef TRACE_FIELD_ZERO_CHAR
-#define TRACE_FIELD_ZERO_CHAR(item)
+#undef TRACE_FIELD_ZERO
+#define TRACE_FIELD_ZERO(type, item)
 
 #undef TRACE_EVENT_FORMAT
 #define TRACE_EVENT_FORMAT(call, proto, args, fmt, tstruct, tpfmt) \


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhira...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -tip v14 10/12] tracing: Generate names for each kprobe event automatically

2009-08-13 Thread Masami Hiramatsu
Generate names for each kprobe event based on the probe point,
and remove generic k*probe event types because there is no user
of those types.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: Jim Keniston jkeni...@us.ibm.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 Documentation/trace/kprobetrace.txt |3 +-
 kernel/trace/trace_event_types.h|   18 --
 kernel/trace/trace_kprobe.c |   64 ++-
 3 files changed, 35 insertions(+), 50 deletions(-)

diff --git a/Documentation/trace/kprobetrace.txt 
b/Documentation/trace/kprobetrace.txt
index c9c09b4..5e59e85 100644
--- a/Documentation/trace/kprobetrace.txt
+++ b/Documentation/trace/kprobetrace.txt
@@ -28,7 +28,8 @@ Synopsis of kprobe_events
   p[:EVENT] SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS]: Set a probe
   r[:EVENT] SYMBOL[+0] [FETCHARGS] : Set a return probe
 
- EVENT : Event name.
+ EVENT : Event name. If omitted, the event name is generated
+ based on SYMBOL+offs or MEMADDR.
  SYMBOL[+offs|-offs]   : Symbol+offset where the probe is inserted.
  MEMADDR   : Address where the probe is inserted.
 
diff --git a/kernel/trace/trace_event_types.h b/kernel/trace/trace_event_types.h
index 186b598..e74f090 100644
--- a/kernel/trace/trace_event_types.h
+++ b/kernel/trace/trace_event_types.h
@@ -175,22 +175,4 @@ TRACE_EVENT_FORMAT(kmem_free, TRACE_KMEM_FREE, 
kmemtrace_free_entry, ignore,
TP_RAW_FMT(type:%u call_site:%lx ptr:%p)
 );
 
-TRACE_EVENT_FORMAT(kprobe, TRACE_KPROBE, kprobe_trace_entry, ignore,
-   TRACE_STRUCT(
-   TRACE_FIELD(unsigned long, ip, ip)
-   TRACE_FIELD(int, nargs, nargs)
-   TRACE_FIELD_ZERO(unsigned long, args)
-   ),
-   TP_RAW_FMT(%08lx: args:0x%lx ...)
-);
-
-TRACE_EVENT_FORMAT(kretprobe, TRACE_KRETPROBE, kretprobe_trace_entry, ignore,
-   TRACE_STRUCT(
-   TRACE_FIELD(unsigned long, func, func)
-   TRACE_FIELD(unsigned long, ret_ip, ret_ip)
-   TRACE_FIELD(int, nargs, nargs)
-   TRACE_FIELD_ZERO(unsigned long, args)
-   ),
-   TP_RAW_FMT(%08lx - %08lx: args:0x%lx ...)
-);
 #undef TRACE_SYSTEM
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 4704e40..ec137ed 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -34,6 +34,7 @@
 
 #define MAX_TRACE_ARGS 128
 #define MAX_ARGSTR_LEN 63
+#define MAX_EVENT_NAME_LEN 64
 
 /* currently, trace_kprobe only supports X86. */
 
@@ -280,11 +281,11 @@ static struct trace_probe *alloc_trace_probe(const char 
*symbol,
if (!tp-symbol)
goto error;
}
-   if (event) {
-   tp-call.name = kstrdup(event, GFP_KERNEL);
-   if (!tp-call.name)
-   goto error;
-   }
+   if (!event)
+   goto error;
+   tp-call.name = kstrdup(event, GFP_KERNEL);
+   if (!tp-call.name)
+   goto error;
 
INIT_LIST_HEAD(tp-list);
return tp;
@@ -314,7 +315,7 @@ static struct trace_probe *find_probe_event(const char 
*event)
struct trace_probe *tp;
 
list_for_each_entry(tp, probe_list, list)
-   if (tp-call.name  !strcmp(tp-call.name, event))
+   if (!strcmp(tp-call.name, event))
return tp;
return NULL;
 }
@@ -330,8 +331,7 @@ static void __unregister_trace_probe(struct trace_probe *tp)
 /* Unregister a trace_probe and probe_event: call with locking probe_lock */
 static void unregister_trace_probe(struct trace_probe *tp)
 {
-   if (tp-call.name)
-   unregister_probe_event(tp);
+   unregister_probe_event(tp);
__unregister_trace_probe(tp);
list_del(tp-list);
 }
@@ -360,18 +360,16 @@ static int register_trace_probe(struct trace_probe *tp)
goto end;
}
/* register as an event */
-   if (tp-call.name) {
-   old_tp = find_probe_event(tp-call.name);
-   if (old_tp) {
-   /* delete old event */
-   unregister_trace_probe(old_tp);
-   free_trace_probe(old_tp);
-   }
-  

[PATCH -tip v14 12/12] tracing: Add kprobes event profiling interface

2009-08-13 Thread Masami Hiramatsu
Add profiling interaces for each kprobes event. This interface provides
how many times each probe hit or missed.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: Jim Keniston jkeni...@us.ibm.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 Documentation/trace/kprobetrace.txt |8 +++
 kernel/trace/trace_kprobe.c |   43 +++
 2 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/Documentation/trace/kprobetrace.txt 
b/Documentation/trace/kprobetrace.txt
index 5e59e85..3de7517 100644
--- a/Documentation/trace/kprobetrace.txt
+++ b/Documentation/trace/kprobetrace.txt
@@ -70,6 +70,14 @@ filter:
  names and field names for describing filters.
 
 
+Event Profiling
+---
+ You can check the total number of probe hits and probe miss-hits via
+/sys/kernel/debug/tracing/kprobe_profile.
+ The first column is event name, the second is the number of probe hits,
+the third is the number of probe miss-hits.
+
+
 Usage examples
 --
 To add a probe as a new event, write a new definition to kprobe_events
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 0e8498e..0f5d0a6 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -184,6 +184,7 @@ struct trace_probe {
struct kprobe   kp;
struct kretproberp;
};
+   unsigned long   nhit;
const char  *symbol;/* symbol name */
struct ftrace_event_callcall;
struct trace_event  event;
@@ -781,6 +782,37 @@ static const struct file_operations kprobe_events_ops = {
.write  = probes_write,
 };
 
+/* Probes profiling interfaces */
+static int probes_profile_seq_show(struct seq_file *m, void *v)
+{
+   struct trace_probe *tp = v;
+
+   seq_printf(m,   %-44s %15lu %15lu\n, tp-call.name, tp-nhit,
+  probe_is_return(tp) ? tp-rp.kp.nmissed : tp-kp.nmissed);
+
+   return 0;
+}
+
+static const struct seq_operations profile_seq_op = {
+   .start  = probes_seq_start,
+   .next   = probes_seq_next,
+   .stop   = probes_seq_stop,
+   .show   = probes_profile_seq_show
+};
+
+static int profile_open(struct inode *inode, struct file *file)
+{
+   return seq_open(file, profile_seq_op);
+}
+
+static const struct file_operations kprobe_profile_ops = {
+   .owner  = THIS_MODULE,
+   .open   = profile_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= seq_release,
+};
+
 /* Kprobe handler */
 static __kprobes int kprobe_trace_func(struct kprobe *kp, struct pt_regs *regs)
 {
@@ -791,6 +823,8 @@ static __kprobes int kprobe_trace_func(struct kprobe *kp, 
struct pt_regs *regs)
unsigned long irq_flags;
struct ftrace_event_call *call = tp-call;
 
+   tp-nhit++;
+
local_save_flags(irq_flags);
pc = preempt_count();
 
@@ -1143,9 +1177,18 @@ static __init int init_kprobe_trace(void)
entry = debugfs_create_file(kprobe_events, 0644, d_tracer,
NULL, kprobe_events_ops);
 
+   /* Event list interface */
if (!entry)
pr_warning(Could not create debugfs 
   'kprobe_events' entry\n);
+
+   /* Profile interface */
+   entry = debugfs_create_file(kprobe_profile, 0444, d_tracer,
+   NULL, kprobe_profile_ops);
+
+   if (!entry)
+   pr_warning(Could not create debugfs 
+  'kprobe_profile' entry\n);
return 0;
 }
 fs_initcall(init_kprobe_trace);


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhira...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -tip v14 08/12] tracing: add kprobe-based event tracer

2009-08-13 Thread Masami Hiramatsu
Add kprobes-based event tracer on ftrace.

This tracer is similar to the events tracer which is based on Tracepoint
infrastructure. Instead of Tracepoint, this tracer is based on kprobes
(kprobe and kretprobe). It probes anywhere where kprobes can probe(this
 means, all functions body except for __kprobes functions).

Similar to the events tracer, this tracer doesn't need to be activated via
current_tracer, instead of that, just set probe points via
/sys/kernel/debug/tracing/kprobe_events. And you can set filters on each
probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter.

This tracer supports following probe arguments for each probe.

  %REG  : Fetch register REG
  sN: Fetch Nth entry of stack (N = 0)
  sa: Fetch stack address.
  @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
  @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
  aN: Fetch function argument. (N = 0)
  rv: Fetch return value.
  ra: Fetch return address.
  +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.

See Documentation/trace/kprobetrace.txt for details.

Changes from v13:
 - Support 'sa' for stack address.
 - Use call-data instead of container_of() macro.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Acked-by: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: Jim Keniston jkeni...@us.ibm.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 Documentation/trace/kprobetrace.txt |  139 
 kernel/trace/Kconfig|   12 
 kernel/trace/Makefile   |1 
 kernel/trace/trace.h|   29 +
 kernel/trace/trace_event_types.h|   18 +
 kernel/trace/trace_kprobe.c | 1205 +++
 6 files changed, 1404 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/kprobetrace.txt
 create mode 100644 kernel/trace/trace_kprobe.c

diff --git a/Documentation/trace/kprobetrace.txt 
b/Documentation/trace/kprobetrace.txt
new file mode 100644
index 000..efff6eb
--- /dev/null
+++ b/Documentation/trace/kprobetrace.txt
@@ -0,0 +1,139 @@
+ Kprobe-based Event Tracer
+ =
+
+ Documentation is written by Masami Hiramatsu
+
+
+Overview
+
+This tracer is similar to the events tracer which is based on Tracepoint
+infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe
+and kretprobe). It probes anywhere where kprobes can probe(this means, all
+functions body except for __kprobes functions).
+
+Unlike the function tracer, this tracer can probe instructions inside of
+kernel functions. It allows you to check which instruction has been executed.
+
+Unlike the Tracepoint based events tracer, this tracer can add and remove
+probe points on the fly.
+
+Similar to the events tracer, this tracer doesn't need to be activated via
+current_tracer, instead of that, just set probe points via
+/sys/kernel/debug/tracing/kprobe_events. And you can set filters on each
+probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter.
+
+
+Synopsis of kprobe_events
+-
+  p[:EVENT] SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS]: Set a probe
+  r[:EVENT] SYMBOL[+0] [FETCHARGS] : Set a return probe
+
+ EVENT : Event name.
+ SYMBOL[+offs|-offs]   : Symbol+offset where the probe is inserted.
+ MEMADDR   : Address where the probe is inserted.
+
+ FETCHARGS : Arguments.
+  %REG : Fetch register REG
+  sN   : Fetch Nth entry of stack (N = 0)
+  sa   : Fetch stack address.
+  @ADDR: Fetch memory at ADDR (ADDR should be in kernel)
+  @SYM[+|-offs]: Fetch memory at SYM +|- offs (SYM should be a data 
symbol)
+  aN   : Fetch function argument. (N = 0)(*)
+  rv   : Fetch return value.(**)
+  ra   : Fetch return address.(**)
+  +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.(***)
+
+  (*) aN may not correct on asmlinkaged functions and at the middle of
+  function body.
+  (**) only for return probe.
+  (***) this is useful for fetching a field of data structures.
+
+
+Per-Probe Event Filtering
+-
+ Per-probe event filtering feature allows you to set different filter on each
+probe and gives you what arguments will be shown in trace 

[PATCH -tip v14 09/12] tracing: Kprobe-tracer supports more than 6 arguments

2009-08-13 Thread Masami Hiramatsu
Support up to 128 arguments for each kprobes event.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: Jim Keniston jkeni...@us.ibm.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 Documentation/trace/kprobetrace.txt |2 +-
 kernel/trace/trace_kprobe.c |   21 +
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/Documentation/trace/kprobetrace.txt 
b/Documentation/trace/kprobetrace.txt
index efff6eb..c9c09b4 100644
--- a/Documentation/trace/kprobetrace.txt
+++ b/Documentation/trace/kprobetrace.txt
@@ -32,7 +32,7 @@ Synopsis of kprobe_events
  SYMBOL[+offs|-offs]   : Symbol+offset where the probe is inserted.
  MEMADDR   : Address where the probe is inserted.
 
- FETCHARGS : Arguments.
+ FETCHARGS : Arguments. Each probe can have up to 128 args.
   %REG : Fetch register REG
   sN   : Fetch Nth entry of stack (N = 0)
   sa   : Fetch stack address.
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d92877a..4704e40 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -32,7 +32,7 @@
 #include trace.h
 #include trace_output.h
 
-#define TRACE_KPROBE_ARGS 6
+#define MAX_TRACE_ARGS 128
 #define MAX_ARGSTR_LEN 63
 
 /* currently, trace_kprobe only supports X86. */
@@ -184,11 +184,15 @@ struct trace_probe {
struct kretproberp;
};
const char  *symbol;/* symbol name */
-   unsigned intnr_args;
-   struct fetch_func   args[TRACE_KPROBE_ARGS];
struct ftrace_event_callcall;
+   unsigned intnr_args;
+   struct fetch_func   args[];
 };
 
+#define SIZEOF_TRACE_PROBE(n)  \
+   (offsetof(struct trace_probe, args) +   \
+   (sizeof(struct fetch_func) * (n)))
+
 static int kprobe_trace_func(struct kprobe *kp, struct pt_regs *regs);
 static int kretprobe_trace_func(struct kretprobe_instance *ri,
struct pt_regs *regs);
@@ -263,11 +267,11 @@ static DEFINE_MUTEX(probe_lock);
 static LIST_HEAD(probe_list);
 
 static struct trace_probe *alloc_trace_probe(const char *symbol,
-const char *event)
+const char *event, int nargs)
 {
struct trace_probe *tp;
 
-   tp = kzalloc(sizeof(struct trace_probe), GFP_KERNEL);
+   tp = kzalloc(SIZEOF_TRACE_PROBE(nargs), GFP_KERNEL);
if (!tp)
return ERR_PTR(-ENOMEM);
 
@@ -573,9 +577,10 @@ static int create_trace_probe(int argc, char **argv)
if (offset  is_return)
return -EINVAL;
}
+   argc -= 2; argv += 2;
 
/* setup a probe */
-   tp = alloc_trace_probe(symbol, event);
+   tp = alloc_trace_probe(symbol, event, argc);
if (IS_ERR(tp))
return PTR_ERR(tp);
 
@@ -594,8 +599,8 @@ static int create_trace_probe(int argc, char **argv)
kp-addr = addr;
 
/* parse arguments */
-   argc -= 2; argv += 2; ret = 0;
-   for (i = 0; i  argc  i  TRACE_KPROBE_ARGS; i++) {
+   ret = 0;
+   for (i = 0; i  argc  i  MAX_TRACE_ARGS; i++) {
if (strlen(argv[i])  MAX_ARGSTR_LEN) {
pr_info(Argument%d(%s) is too long.\n, i, argv[i]);
ret = -ENOSPC;


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhira...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -tip v14 11/12] tracing: Kprobe tracer assigns new event ids for each event

2009-08-13 Thread Masami Hiramatsu
Assigns new event ids for each kprobes event. This doesn't clear ring_buffer
when unregistering each kprobe event. Thus, if you mind 'Unknown event'
messages, clear the buffer manually after changing kprobe events.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: Jim Keniston jkeni...@us.ibm.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 kernel/trace/trace.h|6 -
 kernel/trace/trace_kprobe.c |   51 +--
 2 files changed, 15 insertions(+), 42 deletions(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 4ce4525..0b78d76 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -43,8 +43,6 @@ enum trace_type {
TRACE_POWER,
TRACE_BLK,
TRACE_KSYM,
-   TRACE_KPROBE,
-   TRACE_KRETPROBE,
 
__TRACE_LAST_TYPE,
 };
@@ -358,10 +356,6 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct kmemtrace_free_entry,\
  TRACE_KMEM_FREE); \
IF_ASSIGN(var, ent, struct ksym_trace_entry, TRACE_KSYM);\
-   IF_ASSIGN(var, ent, struct kprobe_trace_entry,  \
- TRACE_KPROBE);\
-   IF_ASSIGN(var, ent, struct kretprobe_trace_entry,   \
- TRACE_KRETPROBE); \
__ftrace_bad_type();\
} while (0)
 
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index ec137ed..0e8498e 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -186,6 +186,7 @@ struct trace_probe {
};
const char  *symbol;/* symbol name */
struct ftrace_event_callcall;
+   struct trace_event  event;
unsigned intnr_args;
struct fetch_func   args[];
 };
@@ -795,7 +796,7 @@ static __kprobes int kprobe_trace_func(struct kprobe *kp, 
struct pt_regs *regs)
 
size = SIZEOF_KPROBE_TRACE_ENTRY(tp-nr_args);
 
-   event = trace_current_buffer_lock_reserve(TRACE_KPROBE, size,
+   event = trace_current_buffer_lock_reserve(call-id, size,
  irq_flags, pc);
if (!event)
return 0;
@@ -827,7 +828,7 @@ static __kprobes int kretprobe_trace_func(struct 
kretprobe_instance *ri,
 
size = SIZEOF_KRETPROBE_TRACE_ENTRY(tp-nr_args);
 
-   event = trace_current_buffer_lock_reserve(TRACE_KRETPROBE, size,
+   event = trace_current_buffer_lock_reserve(call-id, size,
  irq_flags, pc);
if (!event)
return 0;
@@ -853,7 +854,7 @@ print_kprobe_event(struct trace_iterator *iter, int flags)
struct trace_seq *s = iter-seq;
int i;
 
-   trace_assign_type(field, iter-ent);
+   field = (struct kprobe_trace_entry *)iter-ent;
 
if (!seq_print_ip_sym(s, field-ip, flags | TRACE_ITER_SYM_OFFSET))
goto partial;
@@ -880,7 +881,7 @@ print_kretprobe_event(struct trace_iterator *iter, int 
flags)
struct trace_seq *s = iter-seq;
int i;
 
-   trace_assign_type(field, iter-ent);
+   field = (struct kretprobe_trace_entry *)iter-ent;
 
if (!seq_print_ip_sym(s, field-ret_ip, flags | TRACE_ITER_SYM_OFFSET))
goto partial;
@@ -906,16 +907,6 @@ partial:
return TRACE_TYPE_PARTIAL_LINE;
 }
 
-static struct trace_event kprobe_trace_event = {
-   .type   = TRACE_KPROBE,
-   .trace  = print_kprobe_event,
-};
-
-static struct trace_event kretprobe_trace_event = {
-   .type   = TRACE_KRETPROBE,
-   .trace  = print_kretprobe_event,
-};
-
 static int probe_event_enable(struct ftrace_event_call *call)
 {
struct trace_probe *tp = (struct trace_probe *)call-data;
@@ -1107,35 +1098,35 @@ static int register_probe_event(struct trace_probe *tp)
/* Initialize ftrace_event_call */
call-system = kprobes;
if (probe_is_return(tp)) {
-   call-event = kretprobe_trace_event;
-   call-id = TRACE_KRETPROBE;
+   tp-event.trace = print_kretprobe_event;
  

[PATCH -tip v14 03/12] kprobes: checks probe address is instruction boudary on x86

2009-08-13 Thread Masami Hiramatsu
Ensure safeness of inserting kprobes by checking whether the specified
address is at the first byte of a instruction on x86.
This is done by decoding probed function from its head to the probe point.

Signed-off-by: Masami Hiramatsu mhira...@redhat.com
Acked-by: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Christoph Hellwig h...@infradead.org
Cc: Frank Ch. Eigler f...@redhat.com
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Jason Baron jba...@redhat.com
Cc: Jim Keniston jkeni...@us.ibm.com
Cc: K.Prasad pra...@linux.vnet.ibm.com
Cc: Lai Jiangshan la...@cn.fujitsu.com
Cc: Li Zefan l...@cn.fujitsu.com
Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
Cc: Roland McGrath rol...@redhat.com
Cc: Sam Ravnborg s...@ravnborg.org
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Steven Rostedt rost...@goodmis.org
Cc: Tom Zanussi tzanu...@gmail.com
Cc: Vegard Nossum vegard.nos...@gmail.com
---

 arch/x86/kernel/kprobes.c |   69 +
 1 files changed, 69 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index b5b1848..80d493f 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -48,6 +48,7 @@
 #include linux/preempt.h
 #include linux/module.h
 #include linux/kdebug.h
+#include linux/kallsyms.h
 
 #include asm/cacheflush.h
 #include asm/desc.h
@@ -55,6 +56,7 @@
 #include asm/uaccess.h
 #include asm/alternative.h
 #include asm/debugreg.h
+#include asm/insn.h
 
 void jprobe_return_end(void);
 
@@ -245,6 +247,71 @@ retry:
}
 }
 
+/* Recover the probed instruction at addr for further analysis. */
+static int recover_probed_instruction(kprobe_opcode_t *buf, unsigned long addr)
+{
+   struct kprobe *kp;
+   kp = get_kprobe((void *)addr);
+   if (!kp)
+   return -EINVAL;
+
+   /*
+*  Basically, kp-ainsn.insn has an original instruction.
+*  However, RIP-relative instruction can not do single-stepping
+*  at different place, fix_riprel() tweaks the displacement of
+*  that instruction. In that case, we can't recover the instruction
+*  from the kp-ainsn.insn.
+*
+*  On the other hand, kp-opcode has a copy of the first byte of
+*  the probed instruction, which is overwritten by int3. And
+*  the instruction at kp-addr is not modified by kprobes except
+*  for the first byte, we can recover the original instruction
+*  from it and kp-opcode.
+*/
+   memcpy(buf, kp-addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
+   buf[0] = kp-opcode;
+   return 0;
+}
+
+/* Dummy buffers for kallsyms_lookup */
+static char __dummy_buf[KSYM_NAME_LEN];
+
+/* Check if paddr is at an instruction boundary */
+static int __kprobes can_probe(unsigned long paddr)
+{
+   int ret;
+   unsigned long addr, offset = 0;
+   struct insn insn;
+   kprobe_opcode_t buf[MAX_INSN_SIZE];
+
+   if (!kallsyms_lookup(paddr, NULL, offset, NULL, __dummy_buf))
+   return 0;
+
+   /* Decode instructions */
+   addr = paddr - offset;
+   while (addr  paddr) {
+   kernel_insn_init(insn, (void *)addr);
+   insn_get_opcode(insn);
+
+   /* Check if the instruction has been modified. */
+   if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) {
+   ret = recover_probed_instruction(buf, addr);
+   if (ret)
+   /*
+* Another debugging subsystem might insert
+* this breakpoint. In that case, we can't
+* recover it.
+*/
+   return 0;
+   kernel_insn_init(insn, buf);
+   }
+   insn_get_length(insn);
+   addr += insn.length;
+   }
+
+   return (addr == paddr);
+}
+
 /*
  * Returns non-zero if opcode modifies the interrupt flag.
  */
@@ -360,6 +427,8 @@ static void __kprobes arch_copy_kprobe(struct kprobe *p)
 
 int __kprobes arch_prepare_kprobe(struct kprobe *p)
 {
+   if (!can_probe((unsigned long)p-addr))
+   return -EILSEQ;
/* insn: must be on special executable page on x86. */
p-ainsn.insn = get_insn_slot();
if (!p-ainsn.insn)


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhira...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[TOOL] kprobestest : Kprobe stress test tool

2009-08-13 Thread Masami Hiramatsu

This script tests kprobes to probe on all symbols in the kernel and finds
symbols which must be blacklisted.


Usage
-
  kprobestest [-s SYMLIST] [-b BLACKLIST] [-w WHITELIST]
 Run stress test. If SYMLIST file is specified, use it as
 an initial symbol list (This is useful for verifying white list
 after diagnosing all symbols).

  kprobestest cleanup
 Cleanup all lists


How to Work
---
This tool list up all symbols in the kernel via /proc/kallsyms, and sorts
it into groups (each of them including 64 symbols in default). And then,
it tests each group by using kprobe-tracer. If a kernel crash occurred,
that group is moved into 'failed' dir. If the group passed the test, this
script moves it into 'passed' dir and saves kprobe_profile into
'passed/profiles/'.
After testing all groups, all 'failed' groups are merged and sorted into
smaller groups (divided by 4, in default). And those are tested again.
This loop will be repeated until all group has just 1 symbol.

Finally, the script sorts all 'passed' symbols into 'tested', 'untested',
and 'missed' based on profiles.


Note

 - This script just gives us some clues to the blacklisted functions.
   In some cases, a combination of probe points will cause a problem, but
   each of them doesn't cause the problem alone.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhira...@redhat.com

#!/bin/bash
#
#  kprobestest: Kprobes stress test tool
#  Written by Masami Hiramatsu mhira...@redhat.com
#
#  Usage:
# $ kprobestest [-s SYMLIST] [-b BLACKLIST] [-w WHITELIST]
#Run stress test. If SYMLIST file is specified, use it as 
#an initial symbol list (This is useful for verifying white list
#after diagnosing all symbols).
#
# $ kprobestest cleanup
#Cleanup all lists

# Configurations 
DEBUGFS=/sys/kernel/debug
INITNR=64
DIV=4
SYMFILE=syms.list
FAILFILE=black.list

function do_test () {
  # Do some benchmark
  for i in {1..4} ; do
  sleep 0.5
  echo -n .
  done
}

function usage () {
  echo Usage: kprobestest [cleanup] [-s SYMLIST] [-b BLACKLIST] [-w WHITELIST]
  exit 0
}

function cleanup_test () {
  echo Cleanup all files
  rm -rf $SYMFILE failed passed testing unset
  exit 0
}


# Parse arguments
WHITELIST=
BLACKLIST=
SYMLIST=

while [ $1 ]; do
  case $1 in
cleanup)
  cleanup_test
  ;;
-s)
  SYMLIST=$2
  shift 1
  ;;
-b)
  BLACKLIST=$2
  shift 1
  ;;
-w)
  WHITELIST=$2
  shift 1
  ;;
*)
  usage
  ;;
  esac
  shift 1
done

# Show configurations
echo Kprobe stress test starting.
[ -f $BLACKLIST ]  echo Blacklist: $BLACKLIST || BLACKLIST=
[ -f $WHITELIST ]  echo Whitelist: $WHITELIST || WHITELIST=
[ -f $SYMLIST ]  echo Symlist: $SYMLIST || SYMLIST=

function make_filter () {
  local EXP=
  if [ -z $WHITELIST -a -z $BLACKLIST ]; then
echo s/^$//g
  else
for i in `cat $WHITELIST $BLACKLIST` ;do
  [ -z $EXP ]  EXP=^$i\$ || EXP=$EXP\\|^$i\$
done ; EXP=s/$EXP//g
echo $EXP
  fi
}

function list_allsyms () {
  local sym
  local out=1
  for sym in `sort /proc/kallsyms | egrep '[0-9a-f]+ [Tt] [^[]*$' | cut -d\  -f 
3`;do
[ $sym  = __kprobes_text_start ]  out=0  continue
[ $sym  = __kprobes_text_end ]  out=1  continue
[ $sym  = _etext ]  break
[ $out -eq 1 ]  echo $sym
  done
}

function prep_testing () {
  local i=0
  local n=0
  local NR=$1
  local fname=

  echo Grouping symbols: $NR

  fname=`printf list-%03d.%d $i $NR`
  cat $SYMFILE | while read ln; do
[ -z $ln ]  continue
echo $ln  testing/$fname
n=$((n+1))
if [ $n -eq $NR ]; then
  n=0
  i=$((i+1))
  fname=`printf list-%03d.%d $i $NR`
fi
  done
  sync
}

function init_first () {
  local EXP
  EXP=`make_filter`
  if [ -f $SYMLIST ]; then
cat $SYMLIST | sed $EXP  $SYMFILE
  else
echo -n Generating symbol list from /proc/kallsyms...
list_allsyms | sed $EXP  $SYMFILE
echo done.  `wc -l $SYMFILE | cut -f1 -d\  ` symbols listed.
  fi
  mkdir -p testing failed unset passed passed/profiles
  prep_testing $INITNR
}

function get_max_nr () {
  wc -l failed/list-* unset/list-* 2/dev/null |\
  awk '/^ *[0-9]+ .*list.*$/{ if (nr  $1) nr=$1 } BEGIN { nr=0 } END { print 
nr}'
}

function init_next () {
  local NR
  NR=`get_max_nr`
  [ $NR -eq 0 ]  return 1
  [ $NR -eq 1 ]  return 2
  [ $NR -le $DIV ]  NR=1 || NR=`expr $NR / $DIV`

  cat failed/* unset/*  $SYMFILE
  rm failed/* unset/*

  prep_testing $NR
  return 0
}


# Initialize symbols
if [ ! -d testing ]; then
  init_first
elif [ -z `ls testing/` ]; then
  init_next
fi

function set_probes () {
  local s
  for s in `cat $1`; do
echo p:$s $s  $DEBUGFS/tracing/kprobe_events
[ $? -ne 0 ]  return -1
  done
  return 0
}

function clear_probes () {
  echo  $DEBUGFS/tracing/kprobe_events
}

function save_profile () {
  cat $DEBUGFS/tracing/kprobe_profile  

[TOOL] c2kpe: C expression to kprobe event format converter

2009-08-13 Thread Masami Hiramatsu

This program converts probe point in C expression to kprobe event
format for kprobe-based event tracer. This helps to define kprobes
events by C source line number or function name, and local variable
name. Currently, this supports only x86(32/64) kernels.


Compile

Before compilation, please install libelf and libdwarf development
packages.
(e.g. elfutils-libelf-devel and libdwarf-devel on Fedora)

 $ gcc -Wall -lelf -ldwarf c2kpe.c -o c2kpe


Synopsis

 $ c2kpe [options] function[+off...@src] [VAR [VAR ...]]
 or
 $ c2kpe [options] @SRC:LINE [VAR [VAR ...]]

  FUNCTION: Probing function name.
  OFFS: Offset in bytes.
  SRC:  Source file path.
  LINE: Line number
  VAR:  Local variable name.
  options:
  -r KREL   Kernel release version (e.g. 2.6.31-rc5)
  -m DEBUGINFO  Dwarf-format binary file (vmlinux or kmodule)


Example
---
 $ c2kpe sys_read fd buf count
 sys_read+0 %di %si %dx

 $ c2kpe @mm/filemap.c:339 inode pos
 sync_page_range+125 -48(%bp) %r14


Example with kprobe-tracer
--
Since C expression may be converted multiple results, I recommend to use
readline.

 $ c2kpe sys_read fd buf count | while read i; do \
   echo p $i  $DEBUGFS/tracing/kprobe_events ;\
   done


Note

 - This requires a kernel compiled with CONFIG_DEBUG_INFO.
 - Specifying @SRC speeds up c2kpe, because we can skip CUs which don't
   include specified SRC file.
 - c2kpe doesn't check whether the offset byte is correctly on the
   instruction boundary. I recommend you to use @SRC:LINE expression for
   tracing function body.
 - This tool doesn't search kmodule file. You need to specify kmodule
   file if you want to probe it.


TODO

 - Fix bugs.
 - Support multiple probepoints from stdin.
 - Better kmodule support.
 - Use elfutils-libdw?
 - Merge into trace-cmd or perf-tools?

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhira...@redhat.com

/*
 * c2kpe : C expression to kprobe event converter
 *
 * Written by Masami Hiramatsu mhira...@redhat.com
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 *
 */

#include sys/utsname.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include errno.h
#include stdio.h
#include unistd.h
#include getopt.h
#include stdlib.h
#include string.h
#include libdwarf/dwarf.h
#include libdwarf/libdwarf.h

/* Default vmlinux search paths */
#define NR_SEARCH_PATH 2
const char *default_search_path[NR_SEARCH_PATH] = {
/lib/modules/%s/build/vmlinux,/* Custom build kernel */
/usr/lib/debug/lib/modules/%s/vmlinux,/* Red Hat debuginfo */
};

#define _stringify(n)   #n
#define stringify(n)_stringify(n)

#ifdef DEBUG
#define debug(fmt ...)  \
fprintf(stderr, DBG( __FILE__ : stringify(__LINE__) ):  fmt)
#else
#define debug(fmt ...)  do {} while (0)
#endif

#define ERR_IF(cnd) \
do { if (cnd) { \
fprintf(stderr, Error ( __FILE__ : stringify(__LINE__) \
):  stringify(cnd) \n); 
\
exit(1);\
}} while (0)

#define MAX_PATH_LEN 256

/* Dwarf_Die Linkage to parent Die */
struct die_link {
struct die_link *parent;/* Parent die */
Dwarf_Die die;  /* Current die */
};

#define X86_32_MAX_REGS 8
const char *x86_32_regs_table[X86_32_MAX_REGS] = {
%ax,
%cx,
%dx,
%bx,
sa,   /* Stack address */
%bp,
%si,
%di,
};

#define X86_64_MAX_REGS 16
const char *x86_64_regs_table[X86_64_MAX_REGS] = {
%ax,
%dx,
%cx,
%bx,
%si,
%di,
%bp,
%sp,
%r8,
%r9,
%r10,
%r11,
%r12,
%r13,
%r14,
%r15,
};

/* TODO: switching by dwarf address size */
#ifdef __x86_64__
#define ARCH_MAX_REGS X86_64_MAX_REGS
#define arch_regs_table x86_64_regs_table
#else
#define ARCH_MAX_REGS X86_32_MAX_REGS
#define arch_regs_table x86_32_regs_table
#endif

/* Return architecture dependent register string */
static inline const char *get_arch_regstr(unsigned int n)
{
return (n = ARCH_MAX_REGS) ? arch_regs_table[n] : NULL;
}


Re: [TOOL] c2kpe: C expression to kprobe event format converter

2009-08-13 Thread Christoph Hellwig
You rock, this is awesome!  I'm a bit busy right now, but I'll play
around with it ASAP and will see how well it works for me.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Guest OpenGL Acceleration

2009-08-13 Thread Gordan Bobic
Is OpenGL Acceleration based on the host's OpenGL capability available 
in KVM?


Thanks.

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Disk Emulation and Trim Instruction

2009-08-13 Thread Gordan Bobic
With the recent talk of the trim SATA instruction becoming supported in 
the upcoming versions of Windows and claims from Intel that support for 
it in their SSDs is imminent, it occurs to me that this would be equally 
useful in virtual disk emulation.


Since the disk image is a sparse file, it always only grows, and 
eventually it will grow to it's full intended size even if the actual 
used space is a small fraction of the container size. Since the trim 
instruction tells the disk that a particular block is no longer used 
(and can thus be scheduled for erasing as and when required), the same 
thing could be used to reclaim space used by sparse files backing the 
VM. It would allow for higher overcommit of disk usage on VM farms.


Is this feature likely to be available in KVM soon?

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html