[COMMIT master] KVM: Adjust makefile for x86_emulate.c rename
From: Avi Kivity a...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index afaaa76..0e7fe78 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -9,7 +9,7 @@ kvm-y += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o) kvm-$(CONFIG_IOMMU_API)+= $(addprefix ../../../virt/kvm/, iommu.o) -kvm-y += x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \ +kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \ i8254.o timer.o kvm-intel-y+= vmx.o kvm-amd-y += svm.o -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] qemu-kvm: vhost net support
On Wed, Aug 12, 2009 at 04:27:44PM -0400, Gregory Haskins wrote: Michael S. Tsirkin wrote: This adds support for vhost-net virtio kernel backend. This is RFC, but works without issues for me. Still needs to be split up, tested and benchmarked properly, but posting it here in case people want to test drive the kernel bits I posted. This has a large degree of rejects against qemu-kvm.git/master. What tree does this apply to? -Greg Likely that tree has advanced since. This is on top of commit b6bbd41fac4b6fb0efc65e083d2151ce1521f615. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wed, Aug 12, 2009 at 07:59:47PM +0200, Arnd Bergmann wrote: The trick is to swap the virtqueues instead. virtio-net is actually mostly symmetric in just the same way that the physical wires on a twisted pair ethernet are symmetric (I like how that analogy fits). You need to really squint hard for it to look symmetric. For example, for RX, virtio allocates an skb, puts a descriptor on a ring and waits for host to fill it in. Host system can not do the same: guest does not have access to host memory. You can do a copy in transport to hide this fact, but it will kill performance. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wed, Aug 12, 2009 at 02:27:31PM -0500, Anthony Liguori wrote: Arnd Bergmann wrote: As I pointed out earlier, most code in virtio net is asymmetrical: guest provides buffers, host consumes them. Possibly, one could use virtio rings in a symmetrical way, but support of existing guest virtio net means there's almost no shared code. The trick is to swap the virtqueues instead. virtio-net is actually mostly symmetric in just the same way that the physical wires on a twisted pair ethernet are symmetric (I like how that analogy fits). It's already been done between two guests. See http://article.gmane.org/gmane.linux.kernel.virtualization/5423 Regards, Anthony Liguori Yes, this works by copying data (see PATCH 5/5). Another possibility is page flipping. Either will kill performance. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] defer skb allocation in virtio_net -- mergable buff part
Guest virtio_net receives packets from its pre-allocated vring buffers, then it delivers these packets to upper layer protocols as skb buffs. So it's not necessary to pre-allocate skb for each mergable buffer, then frees it when it's useless. This patch has deferred skb allocation to when receiving packets, it reduces skb pre-allocations and skb_frees. And it induces two page list: freed_pages and used_page list, used_pages is used to track pages pre-allocated, it is only useful when removing virtio_net. This patch has tested and measured against 2.6.31-rc4 git, I thought this patch will improve large packet performance, but I saw netperf TCP_STREAM performance improved for small packet for both local guest to host and host to local guest cases. It also reduces UDP packets drop rate from host to local guest. I am not fully understand why. The netperf results from my laptop are: mtu=1500 netperf -H xxx -l 120 w/o patch w/i patch (two runs) guest to host: 3336.84Mb/s 3730.14Mb/s ~ 3582.88Mb/s host to guest: 3165.10Mb/s 3370.39Mb/s ~ 3407.96Mb/s Here is the patch for your review. The same approach can apply to non-mergable buffs too, so we can use code in common. If there is no objection, I will submit the non-mergable buffs patch later. Signed-off-by: Shirley Ma x...@us.ibm.com --- diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 2a6e81d..e31ebc9 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -17,6 +17,7 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ //#define DEBUG +#include linux/list.h #include linux/netdevice.h #include linux/etherdevice.h #include linux/ethtool.h @@ -39,6 +40,12 @@ module_param(gso, bool, 0444); #define VIRTNET_SEND_COMMAND_SG_MAX2 +struct page_list +{ + struct page *page; + struct list_head list; +}; + struct virtnet_info { struct virtio_device *vdev; @@ -72,6 +79,8 @@ struct virtnet_info /* Chain pages by the private ptr. */ struct page *pages; + struct list_head used_pages; + struct list_head freed_pages; }; static inline void *skb_vnet_hdr(struct sk_buff *skb) @@ -106,6 +115,26 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask) return p; } +static struct page_list *get_a_free_page(struct virtnet_info *vi, gfp_t gfp_mask) +{ + struct page_list *plist; + + if (list_empty(vi-freed_pages)) { + plist = kmalloc(sizeof(struct page_list), gfp_mask); + if (!plist) + return NULL; + list_add_tail(plist-list, vi-freed_pages); + plist-page = alloc_page(gfp_mask); + } else { + plist = list_first_entry(vi-freed_pages, struct page_list, list); + if (!plist-page) + plist-page = alloc_page(gfp_mask); + } + if (plist-page) + list_move_tail(plist-list, vi-used_pages); + return plist; +} + static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; @@ -121,14 +150,14 @@ static void skb_xmit_done(struct virtqueue *svq) tasklet_schedule(vi-tasklet); } -static void receive_skb(struct net_device *dev, struct sk_buff *skb, +static void receive_skb(struct net_device *dev, void *buf, unsigned len) { struct virtnet_info *vi = netdev_priv(dev); - struct virtio_net_hdr *hdr = skb_vnet_hdr(skb); int err; int i; - + struct sk_buff *skb = NULL; + struct virtio_net_hdr *hdr = NULL; if (unlikely(len sizeof(struct virtio_net_hdr) + ETH_HLEN)) { pr_debug(%s: short packet %i\n, dev-name, len); dev-stats.rx_length_errors++; @@ -136,15 +165,30 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb, } if (vi-mergeable_rx_bufs) { - struct virtio_net_hdr_mrg_rxbuf *mhdr = skb_vnet_hdr(skb); + struct virtio_net_hdr_mrg_rxbuf *mhdr; unsigned int copy; - char *p = page_address(skb_shinfo(skb)-frags[0].page); + skb_frag_t *f; + struct page_list *plist = (struct page_list *)buf; + char *p = page_address(plist-page); + + skb = netdev_alloc_skb(vi-dev, GOOD_COPY_LEN + NET_IP_ALIGN); + if (unlikely(!skb)) { + /* drop the packet */ + dev-stats.rx_dropped++; + list_move_tail(plist-list, vi-freed_pages); + return; + } + + skb_reserve(skb, NET_IP_ALIGN); if (len PAGE_SIZE) len = PAGE_SIZE; len -= sizeof(struct virtio_net_hdr_mrg_rxbuf); - memcpy(hdr, p, sizeof(*mhdr)); + mhdr = skb_vnet_hdr(skb); +
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wed, Aug 12, 2009 at 02:22:38PM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: We discussed this before, and I still think this could be directly derived from struct virtqueue, in the same way that vring_virtqueue is derived from struct virtqueue. I prefer keeping it simple. Much of abstraction in virtio is due to the fact that it needs to work on top of different hardware emulations: lguest,kvm, possibly others in the future. vhost is always working on real hardware, using eventfd as the interface, so it does not need that. Actually, vhost may not always be limited to real hardware. Yes, any ethernet device will do. What I mean is that vhost does not deal with emulation at all. All setup is done in userspace. We may on day use vhost as the basis of a driver domain. There's quite a lot of interest in this for networking. You can use veth for this. This works today. At any rate, I'd like to see performance results before we consider trying to reuse virtio code. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 0/3] qemu-kvm: vhost net support
This adds support for vhost-net virtio kernel backend. This is RFC, but works without issues for me. Still needs to be split up, tested and benchmarked properly, but posting it here in case people want to test drive the kernel bits I posted. Changes since v1: - rebased on top of 9dc275d9d660fe1cd64d36102d600885f9fdb88a Michael S. Tsirkin (3): qemu-kvm: move virtio-pci.o to near pci.o virtio: move features to an inline function qemu-kvm: vhost-net implementation Makefile.hw |2 +- Makefile.target |3 +- hw/vhost_net.c | 181 +++ hw/vhost_net.h | 30 + hw/virtio-balloon.c |2 +- hw/virtio-blk.c |2 +- hw/virtio-console.c |2 +- hw/virtio-net.c | 34 +- hw/virtio-pci.c | 43 +++- hw/virtio.c | 19 -- hw/virtio.h | 38 ++- net.c |5 ++ net.h |1 + qemu-kvm.h |9 +++ 14 files changed, 339 insertions(+), 32 deletions(-) create mode 100644 hw/vhost_net.c create mode 100644 hw/vhost_net.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 1/3] qemu-kvm: move virtio-pci.o to near pci.o
virtio-pci depends, and will always depend, on pci.c so it makes sense to keep it in the same makefile, (unlike the rest of virtio files which should eventually be moved out to Makefile.hw). Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Makefile.hw |2 +- Makefile.target |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Makefile.hw b/Makefile.hw index 139412e..6472ec1 100644 --- a/Makefile.hw +++ b/Makefile.hw @@ -11,7 +11,7 @@ VPATH=$(SRC_PATH):$(SRC_PATH)/hw QEMU_CFLAGS+=-I.. -I$(SRC_PATH)/fpu obj-y = -obj-y += virtio.o virtio-pci.o +obj-y += virtio.o obj-y += fw_cfg.o obj-y += watchdog.o obj-y += nand.o ecc.o diff --git a/Makefile.target b/Makefile.target index aeda3fe..f6d9708 100644 --- a/Makefile.target +++ b/Makefile.target @@ -170,7 +170,7 @@ obj-y = vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o machine.o \ gdbstub.o gdbstub-xml.o msix.o ioport.o qemu-config.o # virtio has to be here due to weird dependency between PCI and virtio-net. # need to fix this properly -obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o +obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o virtio-pci.o obj-$(CONFIG_KVM) += kvm.o kvm-all.o LIBS+=-lz -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 2/3] virtio: move features to an inline function
devices should have the final say over which virtio features they support. E.g. indirect entries may or may not make sense in the context of virtio-console. Move the common bits from virtio-pci to an inline function and let each device call it. No functional changes. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- hw/virtio-balloon.c |2 +- hw/virtio-blk.c |2 +- hw/virtio-console.c |2 +- hw/virtio-net.c |2 +- hw/virtio-pci.c |3 --- hw/virtio.h | 10 ++ 6 files changed, 14 insertions(+), 7 deletions(-) diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c index 7ca783e..15b50bb 100644 --- a/hw/virtio-balloon.c +++ b/hw/virtio-balloon.c @@ -127,7 +127,7 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, static uint32_t virtio_balloon_get_features(VirtIODevice *vdev) { -return 0; +return virtio_common_features(); } static ram_addr_t virtio_balloon_to_target(void *opaque, ram_addr_t target) diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index c278d2e..a33eafb 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -378,7 +378,7 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev) if (strcmp(s-serial_str, 0)) features |= 1 VIRTIO_BLK_F_IDENTIFY; -return features; +return features | virtio_common_features(); } static void virtio_blk_save(QEMUFile *f, void *opaque) diff --git a/hw/virtio-console.c b/hw/virtio-console.c index 663c8b9..ac25499 100644 --- a/hw/virtio-console.c +++ b/hw/virtio-console.c @@ -53,7 +53,7 @@ static void virtio_console_handle_input(VirtIODevice *vdev, VirtQueue *vq) static uint32_t virtio_console_get_features(VirtIODevice *vdev) { -return 0; +return virtio_common_features(); } static int vcon_can_read(void *opaque) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ce8e6cb..469c6e3 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -154,7 +154,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) } #endif -return features; +return features | virtio_common_features(); } static uint32_t virtio_net_bad_features(VirtIODevice *vdev) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 8b57dfc..ab6e9c4 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -230,9 +230,6 @@ static uint32_t virtio_ioport_read(VirtIOPCIProxy *proxy, uint32_t addr) switch (addr) { case VIRTIO_PCI_HOST_FEATURES: ret = vdev-get_features(vdev); -ret |= (1 VIRTIO_F_NOTIFY_ON_EMPTY); -ret |= (1 VIRTIO_RING_F_INDIRECT_DESC); -ret |= (1 VIRTIO_F_BAD_FEATURE); break; case VIRTIO_PCI_GUEST_FEATURES: ret = vdev-features; diff --git a/hw/virtio.h b/hw/virtio.h index c441a93..cbf472b 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -167,4 +167,14 @@ VirtIODevice *virtio_net_init(DeviceState *dev); VirtIODevice *virtio_console_init(DeviceState *dev); VirtIODevice *virtio_balloon_init(DeviceState *dev); +static inline uint32_t virtio_common_features(void) +{ +uint32_t features = 0; +features |= (1 VIRTIO_F_NOTIFY_ON_EMPTY); +features |= (1 VIRTIO_RING_F_INDIRECT_DESC); +features |= (1 VIRTIO_F_BAD_FEATURE); + +return features; +} + #endif -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 3/3] qemu-kvm: vhost-net implementation
This adds support for vhost-net virtio kernel backend. To enable (assuming device eth2): 1. enable promisc mode or program guest mac in device eth2 2. disable tso, gso, lro on the card 3. add vhost=eth0 to -net flag 4. run with CAP_NET_ADMIN priviledge (e.g. root) This patch is RFC, but works without issues for me. It still needs to be split up, tested and benchmarked properly, but posting it here in case people want to test drive the kernel bits I posted. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Makefile.target |3 +- hw/vhost_net.c | 181 +++ hw/vhost_net.h | 30 + hw/virtio-net.c | 32 ++- hw/virtio-pci.c | 40 hw/virtio.c | 19 -- hw/virtio.h | 28 - net.c |5 ++ net.h |1 + qemu-kvm.h |9 +++ 10 files changed, 324 insertions(+), 24 deletions(-) create mode 100644 hw/vhost_net.c create mode 100644 hw/vhost_net.h diff --git a/Makefile.target b/Makefile.target index f6d9708..e941a36 100644 --- a/Makefile.target +++ b/Makefile.target @@ -170,7 +170,8 @@ obj-y = vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o machine.o \ gdbstub.o gdbstub-xml.o msix.o ioport.o qemu-config.o # virtio has to be here due to weird dependency between PCI and virtio-net. # need to fix this properly -obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o virtio-pci.o +obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o virtio-pci.o \ + vhost_net.o obj-$(CONFIG_KVM) += kvm.o kvm-all.o LIBS+=-lz diff --git a/hw/vhost_net.c b/hw/vhost_net.c new file mode 100644 index 000..7d52de0 --- /dev/null +++ b/hw/vhost_net.c @@ -0,0 +1,181 @@ +#include sys/eventfd.h +#include sys/socket.h +#include linux/kvm.h +#include fcntl.h +#include sys/ioctl.h +#include linux/vhost.h +#include linux/virtio_ring.h +#include netpacket/packet.h +#include net/ethernet.h +#include net/if.h +#include netinet/in.h + +#include stdio.h + +#include qemu-kvm.h + +#include vhost_net.h + +const char *vhost_net_device; + +static int vhost_virtqueue_init(struct vhost_dev *dev, + struct VirtIODevice *vdev, + struct vhost_virtqueue *vq, + struct VirtQueue *q, + unsigned idx) +{ + target_phys_addr_t s, l; + int r; + struct vhost_vring_addr addr = { + .index = idx, + }; + struct vhost_vring_file file = { + .index = idx, + }; + struct vhost_vring_state size = { + .index = idx, + }; + + size.num = q-vring.num; + r = ioctl(dev-control, VHOST_SET_VRING_NUM, size); + if (r) + return -errno; + + file.fd = vq-kick = eventfd(0, 0); + r = ioctl(dev-control, VHOST_SET_VRING_KICK, file); + if (r) + return -errno; + file.fd = vq-call = eventfd(0, 0); + r = ioctl(dev-control, VHOST_SET_VRING_CALL, file); + if (r) + return -errno; + + s = l = sizeof(struct vring_desc) * q-vring.num; + vq-desc = cpu_physical_memory_map(q-vring.desc, l, 0); + if (!vq-desc || l != s) + return -ENOMEM; + addr.user_addr = (u_int64_t)(unsigned long)vq-desc; + r = ioctl(dev-control, VHOST_SET_VRING_DESC, addr); + if (r 0) + return -errno; + s = l = offsetof(struct vring_avail, ring) + + sizeof(u_int64_t) * q-vring.num; + vq-avail = cpu_physical_memory_map(q-vring.avail, l, 0); + if (!vq-avail || l != s) + return -ENOMEM; + addr.user_addr = (u_int64_t)(unsigned long)vq-avail; + r = ioctl(dev-control, VHOST_SET_VRING_AVAIL, addr); + if (r 0) + return -errno; + s = l = offsetof(struct vring_used, ring) + + sizeof(struct vring_used_elem) * q-vring.num; + vq-used = cpu_physical_memory_map(q-vring.used, l, 1); + if (!vq-used || l != s) + return -ENOMEM; + addr.user_addr = (u_int64_t)(unsigned long)vq-used; + r = ioctl(dev-control, VHOST_SET_VRING_USED, addr); + if (r 0) + return -errno; + +r = vdev-binding-irqfd(vdev-binding_opaque, q-vector, vq-call); +if (r 0) +return -errno; + +r = vdev-binding-queuefd(vdev-binding_opaque, idx, vq-kick); +if (r 0) +return -errno; + + return 0; +} + +static int vhost_dev_init(struct vhost_dev *hdev, + VirtIODevice *vdev) +{ + int i, r, n = 0; + struct vhost_memory *mem; + hdev-control = open(/dev/vhost-net, O_RDWR); + if (hdev-control 0) + return -errno; + r = ioctl(hdev-control, VHOST_SET_OWNER, NULL); + if (r 0) + return -errno; + for (i = 0; i KVM_MAX_NUM_MEM_REGIONS;
Re: [PATCH v3 1/8] Do not call ack notifiers on PIC reset.
On Wed, Aug 12, 2009 at 03:17:15PM +0300, Gleb Natapov wrote: For device assigned it may cause host hang since ack notifier callback enables host interrupt and guest not necessary cleared interrupt condition in an assigned device. For PIT we should not call ack notifier here since interrupt was not acked by a guest and should be redelivered. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/i8259.c | 16 1 files changed, 0 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 01f1516..eb2b8b7 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -225,22 +225,6 @@ int kvm_pic_read_irq(struct kvm *kvm) void kvm_pic_reset(struct kvm_kpic_state *s) { - int irq, irqbase, n; - struct kvm *kvm = s-pics_state-irq_request_opaque; - struct kvm_vcpu *vcpu0 = kvm-bsp_vcpu; - - if (s == s-pics_state-pics[0]) - irqbase = 0; - else - irqbase = 8; - - for (irq = 0; irq PIC_NUM_PINS/2; irq++) { - if (vcpu0 kvm_apic_accept_pic_intr(vcpu0)) - if (s-irr (1 irq) || s-isr (1 irq)) { - n = irq + irqbase; - kvm_notify_acked_irq(kvm, SELECT_PIC(n), n); - } - } s-last_irr = 0; s-irr = 0; s-imr = 0; -- 1.6.3.3 This used to be necessary to clear pending state from i8254.c irq_acked logic. I think it'll break it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/8] Do not call ack notifiers on PIC reset.
On Thu, Aug 13, 2009 at 06:11:05AM -0300, Marcelo Tosatti wrote: On Wed, Aug 12, 2009 at 03:17:15PM +0300, Gleb Natapov wrote: For device assigned it may cause host hang since ack notifier callback enables host interrupt and guest not necessary cleared interrupt condition in an assigned device. For PIT we should not call ack notifier here since interrupt was not acked by a guest and should be redelivered. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/i8259.c | 16 1 files changed, 0 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 01f1516..eb2b8b7 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -225,22 +225,6 @@ int kvm_pic_read_irq(struct kvm *kvm) void kvm_pic_reset(struct kvm_kpic_state *s) { - int irq, irqbase, n; - struct kvm *kvm = s-pics_state-irq_request_opaque; - struct kvm_vcpu *vcpu0 = kvm-bsp_vcpu; - - if (s == s-pics_state-pics[0]) - irqbase = 0; - else - irqbase = 8; - - for (irq = 0; irq PIC_NUM_PINS/2; irq++) { - if (vcpu0 kvm_apic_accept_pic_intr(vcpu0)) - if (s-irr (1 irq) || s-isr (1 irq)) { - n = irq + irqbase; - kvm_notify_acked_irq(kvm, SELECT_PIC(n), n); - } - } s-last_irr = 0; s-irr = 0; s-imr = 0; -- 1.6.3.3 This used to be necessary to clear pending state from i8254.c irq_acked logic. I think it'll break it. This is just a hack then and it does not exists in ioapic so if it is really needed ioapic+pit combination is broken. But the problem should be solved inside i8254.c not somewhere else. Setting irq_acked to 1 in pit_load_count() seems like a right thing to do. Something like the patch below. Ideally pending should be scaled instead of reset. Also may be the problem exists because PIC doesn't call mask notifiers? diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index b857ca3..aa7f68e 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -325,6 +325,9 @@ static void pit_load_count(struct kvm *kvm, int channel, u32 val) return; } + atomic_set(pt-pending, 0); + ps-irq_ack = 1; + /* Two types of timer * mode 1 is one shot, mode 2 is period, otherwise del timer */ switch (ps-channels[0].mode) { -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 7/8] Move IO APIC to its own lock.
On Thu, Aug 13, 2009 at 06:13:30AM -0300, Marcelo Tosatti wrote: +++ b/virt/kvm/ioapic.c @@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) union kvm_ioapic_redirect_entry entry; int ret = 1; + mutex_lock(ioapic-lock); if (irq = 0 irq IOAPIC_NUM_PINS) { entry = ioapic-redirtbl[irq]; level ^= entry.fields.polarity; But this is an RCU critical section now, right? Correct! Forget about that. It was spinlock, but Avi prefers mutexes. If so, you can't sleep, must use a spinlock. Either that or I can collect callbacks in critical section and call them afterwords. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 7/8] Move IO APIC to its own lock.
On 08/13/2009 12:48 PM, Gleb Natapov wrote: On Thu, Aug 13, 2009 at 06:13:30AM -0300, Marcelo Tosatti wrote: +++ b/virt/kvm/ioapic.c @@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) union kvm_ioapic_redirect_entry entry; int ret = 1; + mutex_lock(ioapic-lock); if (irq= 0 irq IOAPIC_NUM_PINS) { entry = ioapic-redirtbl[irq]; level ^= entry.fields.polarity; But this is an RCU critical section now, right? Correct! Forget about that. It was spinlock, but Avi prefers mutexes. Well, I prefer correct code to mutexes. If so, you can't sleep, must use a spinlock. Either that or I can collect callbacks in critical section and call them afterwords. There's also srcu. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 7/8] Move IO APIC to its own lock.
On Thu, Aug 13, 2009 at 12:49:45PM +0300, Avi Kivity wrote: On 08/13/2009 12:48 PM, Gleb Natapov wrote: On Thu, Aug 13, 2009 at 06:13:30AM -0300, Marcelo Tosatti wrote: +++ b/virt/kvm/ioapic.c @@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) union kvm_ioapic_redirect_entry entry; int ret = 1; + mutex_lock(ioapic-lock); if (irq= 0 irq IOAPIC_NUM_PINS) { entry = ioapic-redirtbl[irq]; level ^= entry.fields.polarity; But this is an RCU critical section now, right? Correct! Forget about that. It was spinlock, but Avi prefers mutexes. Well, I prefer correct code to mutexes. If so, you can't sleep, must use a spinlock. Either that or I can collect callbacks in critical section and call them afterwords. There's also srcu. What are the disadvantages? There should be some, otherwise why not use it all the time. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 7/8] Move IO APIC to its own lock.
On 08/13/2009 01:09 PM, Gleb Natapov wrote: There's also srcu. What are the disadvantages? There should be some, otherwise why not use it all the time. I think it incurs an atomic op in the read path, but not much overhead otherwise. Paul? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix Makefile rule for compiling emulate.c
On 08/13/2009 12:42 AM, Mohammed Gamal wrote: Signed-off-by: Mohammed Gamalm.gamal...@gmail.com --- arch/x86/kvm/Makefile |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index afaaa76..0e7fe78 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -9,7 +9,7 @@ kvm-y += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o) kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o) -kvm-y += x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \ +kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \ Already have the same fix in my tree, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] qemu-kvm: vhost net support
Michael S. Tsirkin wrote: On Wed, Aug 12, 2009 at 04:27:44PM -0400, Gregory Haskins wrote: Michael S. Tsirkin wrote: This adds support for vhost-net virtio kernel backend. This is RFC, but works without issues for me. Still needs to be split up, tested and benchmarked properly, but posting it here in case people want to test drive the kernel bits I posted. This has a large degree of rejects against qemu-kvm.git/master. What tree does this apply to? -Greg Likely that tree has advanced since. This is on top of commit b6bbd41fac4b6fb0efc65e083d2151ce1521f615. Hmmbetter, but I still get rejects. Of particular concern is this one in net.c: @@ -1903,7 +1903,7 @@ static TAPState *net_tap_init(VLANState *vlan, const char *model, typedef struct RAWState { VLANClientState *vc; int fd; -uint8_t buf[4096]; +uint8_t buf[65000]; int promisc; } RAWState; I do not see any occurrence of RAWState in b6bbd41f (or master, for that matter). There is probably an operator error somewhere in here ;), but any help getting this working is appreciated. Do you have a git tree I can pull somewhere? Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH 0/3] qemu-kvm: vhost net support
On Thu, Aug 13, 2009 at 07:35:52AM -0400, Gregory Haskins wrote: Michael S. Tsirkin wrote: On Wed, Aug 12, 2009 at 04:27:44PM -0400, Gregory Haskins wrote: Michael S. Tsirkin wrote: This adds support for vhost-net virtio kernel backend. This is RFC, but works without issues for me. Still needs to be split up, tested and benchmarked properly, but posting it here in case people want to test drive the kernel bits I posted. This has a large degree of rejects against qemu-kvm.git/master. What tree does this apply to? -Greg Likely that tree has advanced since. This is on top of commit b6bbd41fac4b6fb0efc65e083d2151ce1521f615. Hmmbetter, but I still get rejects. Of particular concern is this one in net.c: @@ -1903,7 +1903,7 @@ static TAPState *net_tap_init(VLANState *vlan, const char *model, typedef struct RAWState { VLANClientState *vc; int fd; -uint8_t buf[4096]; +uint8_t buf[65000]; int promisc; } RAWState; I do not see any occurrence of RAWState in b6bbd41f (or master, for that matter). There is probably an operator error somewhere in here ;), Yes. Mine :) but any help getting this working is appreciated. I reposted a clean one which is against latest bits earlier today. Look for PATCHv2 in your inbox. Do you have a git tree I can pull somewhere? Kind Regards, -Greg Thanks for the patience, -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)
Chris Webb ch...@arachsys.com writes: Avi Kivity a...@redhat.com writes: I understand it's hard, but it's nearly impossible to work out the problem from so little data, so please do make the effort to obtain dumps. We're trying for this at the moment, but since we can't change the rlimit for the running qemu-kvm processes (?), we'll have to wait until one of the new ones dies, which may take some time. I'll follow up when I do have something. We've been lucky and relatively quickly got a core dump from one of the new qemu-kvms with the non-zero core file rlimit. A backtrace looks like this: (gdb) bt #0 0x004068f7 in qemu_mod_timer (ts=0x30d1f30, expire_time=430489) at /packages/qemu-kvm/src-f39tF1/vl.c:1161 #1 0x00495dd5 in vnc_update_client (opaque=value optimized out) at vnc.c:765 #2 0x004081da in main_loop_wait (timeout=value optimized out) at /packages/qemu-kvm/src-f39tF1/vl.c:1240 #3 0x0051613a in kvm_main_loop () at /packages/qemu-kvm/src-f39tF1/qemu-kvm.c:596 #4 0x0040c7b7 in main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /packages/qemu-kvm/src-f39tF1/vl.c:3850 The segfault appears to be a null pointer dereference. ts-clock is NULL and line 1161 uses ts-clock-type: (gdb) p ts $4 = (QEMUTimer *) 0x30d1f30 (gdb) p ts-clock $5 = (QEMUClock *) 0x0 The VncState in vnc_update_client is as follows: (gdb) f 1 #1 0x00495dd5 in vnc_update_client (opaque=value optimized out) at vnc.c:765 765 qemu_mod_timer(vs-timer, qemu_get_clock(rt_clock) + VNC_REFRESH_INTERVAL); (gdb) p *vs $12 = {timer = 0x30d1f30, csock = -986235208, ds = 0x0, vd = 0x0, need_update = 1, dirty_row = {{0, 0, 4294967295, 4294967295} repeats 768 times, {4294967295, 4294967295, 4294967295, 4294967295} repeats 1280 times}, old_data = 0x7f9b8276f010 Address 0x7f9b8276f010 out of bounds, features = 98, absolute = 1, last_x = -1, last_y = -1, vnc_encoding = 5, tight_quality = 6 '\006', tight_compression = 1 '\001', major = 3, minor = 3, challenge = \032\314i\257\302t1(\320\312\263\024pH\226, output = {capacity = 1545078, offset = 684, buffer = 0x3107860 }, input = {capacity = 5120, offset = 0, buffer = 0x3106450 \020\220(\003}, write_pixels = 0x490b50 vnc_write_pixels_generic, send_hextile_tile = 0x492030 send_hextile_tile_generic_32, clientds = {flags = 0 '\0', width = 800, height = 600, linesize = 3200, data = 0x7f9b82944010 Address 0x7f9b82944010 out of bounds, pf = {bits_per_pixel = 32 ' ', bytes_per_pixel = 4 '\004', depth = 24 '\030', rmask = 0, gmask = 0, bmask = 0, amask = 0, rshift = 16 '\020', gshift = 8 '\b', bshift = 0 '\0', ashift = 24 '\030', rmax = 255 '\377', gmax = 255 '\377', bmax = 255 '\377', amax = 255 '\377', rbits = 8 '\b', gbits = 8 '\b', bbits = 8 '\b', abits = 8 '\b'}}, serverds = { flags = 2 '\002', width = 1024, height = 768, linesize = 4096, data = 0x7f9b8246e010 , pf = { bits_per_pixel = 32 ' ', bytes_per_pixel = 4 '\004', depth = 24 '\030', rmask = 16711680, gmask = 65280, bmask = 255, amask = 0, rshift = 16 '\020', gshift = 8 '\b', bshift = 0 '\0', ashift = 24 '\030', rmax = 255 '\377', gmax = 255 '\377', bmax = 255 '\377', amax = 255 '\377', rbits = 8 '\b', gbits = 8 '\b', bbits = 8 '\b', abits = 8 '\b'}}, audio_cap = 0x0, as = {freq = 44100, nchannels = 2, fmt = AUD_FMT_S16, endianness = 0}, read_handler = 0x494b40 protocol_client_msg, read_handler_expect = 1, modifiers_state = '\0' repeats 255 times, zlib = {capacity = 0, offset = 0, buffer = 0x0}, zlib_tmp = { capacity = 0, offset = 0, buffer = 0x0}, zlib_stream = {{next_in = 0x0, avail_in = 0, total_in = 0, next_out = 0x0, avail_out = 0, total_out = 0, msg = 0x0, state = 0x0, zalloc = 0, zfree = 0, opaque = 0x0, data_type = 0, adler = 0, reserved = 0}, {next_in = 0x0, avail_in = 0, total_in = 0, next_out = 0x0, avail_out = 0, total_out = 0, msg = 0x0, state = 0x0, zalloc = 0, zfree = 0, opaque = 0x0, data_type = 0, adler = 0, reserved = 0}, {next_in = 0x0, avail_in = 0, total_in = 0, next_out = 0x0, avail_out = 0, total_out = 0, msg = 0x0, state = 0x0, zalloc = 0, zfree = 0, opaque = 0x0, data_type = 0, adler = 0, reserved = 0}, {next_in = 0x0, avail_in = 0, total_in = 0, next_out = 0x0, avail_out = 0, total_out = 0, msg = 0x0, state = 0x0, zalloc = 0, zfree = 0, opaque = 0x0, data_type = 0, adler = 0, reserved = 0}}, next = 0x0} I'm afraid I only have one of these, so I can't say whether the other segfaults were exactly the same or different (other than knowing the source line matched), but I'll keep my eye out for more core dumps. qemu-kvm command line for this guest would have been qemu-kvm -m 1024 -smp 1
Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)
Chris Webb ch...@arachsys.com writes: The segfault appears to be a null pointer dereference. ts-clock is NULL and line 1161 uses ts-clock-type: (gdb) p ts $4 = (QEMUTimer *) 0x30d1f30 (gdb) p ts-clock $5 = (QEMUClock *) 0x0 Sorry, meant to paste this too: (gdb) p *ts $1 = {clock = 0x0, expire_time = 49, cb = 0x2b63630, opaque = 0x30fe000, next = 0x495b40} Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)
On 08/13/2009 03:23 PM, Chris Webb wrote: We've been lucky and relatively quickly got a core dump from one of the new qemu-kvms with the non-zero core file rlimit. A backtrace looks like this: (gdb) bt #0 0x004068f7 in qemu_mod_timer (ts=0x30d1f30, expire_time=430489) at /packages/qemu-kvm/src-f39tF1/vl.c:1161 #1 0x00495dd5 in vnc_update_client (opaque=value optimized out) at vnc.c:765 #2 0x004081da in main_loop_wait (timeout=value optimized out) at /packages/qemu-kvm/src-f39tF1/vl.c:1240 #3 0x0051613a in kvm_main_loop () at /packages/qemu-kvm/src-f39tF1/qemu-kvm.c:596 #4 0x0040c7b7 in main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /packages/qemu-kvm/src-f39tF1/vl.c:3850 The segfault appears to be a null pointer dereference. ts-clock is NULL and line 1161 uses ts-clock-type: (gdb) p ts $4 = (QEMUTimer *) 0x30d1f30 (gdb) p ts-clock $5 = (QEMUClock *) 0x0 The VncState in vnc_update_client is as follows: (gdb) f 1 #1 0x00495dd5 in vnc_update_client (opaque=value optimized out) at vnc.c:765 765 qemu_mod_timer(vs-timer, qemu_get_clock(rt_clock) + VNC_REFRESH_INTERVAL); (gdb) p *vs $12 = {timer = 0x30d1f30, csock = -986235208, csock looks corrupted, should be -1 or an fd. Was a vnc client connected? Was the guest playing with the display resolution? ds = 0x0, vd = 0x0, need_update = 1, dirty_row = {{0, 0, 4294967295, 4294967295}repeats 768 times, {4294967295, 4294967295, 4294967295, 4294967295}repeats 1280 times}, old_data = 0x7f9b8276f010Address 0x7f9b8276f010 out of bounds, old_data is also corrupted according to gdb, though it seems sane. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)
Avi Kivity a...@redhat.com writes: csock looks corrupted, should be -1 or an fd. Was a vnc client connected? Was the guest playing with the display resolution? Yes, I think in this case there was a vncviewer connected, and the guest had started booting up into windows, which changes the resolution a couple of times. Best wishes, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)
Chris Webb ch...@arachsys.com writes: Avi Kivity a...@redhat.com writes: csock looks corrupted, should be -1 or an fd. Was a vnc client connected? Was the guest playing with the display resolution? Yes, I think in this case there was a vncviewer connected, and the guest had started booting up into windows, which changes the resolution a couple of times. Also, I think the vncviewer might actually have been disconnecting at about the time the segfault happened. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm segfaults in qemu_del_timer (0.10.5 and 0.10.6)
On 08/13/2009 03:45 PM, Chris Webb wrote: Chris Webbch...@arachsys.com writes: Avi Kivitya...@redhat.com writes: csock looks corrupted, should be -1 or an fd. Was a vnc client connected? Was the guest playing with the display resolution? Yes, I think in this case there was a vncviewer connected, and the guest had started booting up into windows, which changes the resolution a couple of times. Also, I think the vncviewer might actually have been disconnecting at about the time the segfault happened. master branch has a patch that fixes a use-after-free when disconnecting. Unfortunately it doesn't port cleanly to stable-0.10. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wednesday 12 August 2009, Anthony Liguori wrote: At any rate, I'd like to see performance results before we consider trying to reuse virtio code. Yes, I agree. I'd also like to do more work on the macvlan extensions to see if it works out without involving a socket. Passing the socket into the vhost_net device is a nice feature of the current implementation that we'd have to give up for something else (e.g. making the vhost a real network interface that you can hook up to a bridge) if it were to use virtio. Unless I can come up with a solution that is clearly superior, I'm taking back my objections on that part for now. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Arnd Bergmann wrote: Unfortunately, this also implies that you could no longer simply use the packet socket interface as you do currently, as I realized only now. This obviously has a significant impact on your user space interface. Also, if we do the copy in the transport, it definitely means that we can't get to zero-copy RX/TX from guest space any more. The current vhost_net driver doesn't do that yet, but could be extended in the same way that I'm hoping to do it for macvtap. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Michael S. Tsirkin wrote: On Wed, Aug 12, 2009 at 07:59:47PM +0200, Arnd Bergmann wrote: The trick is to swap the virtqueues instead. virtio-net is actually mostly symmetric in just the same way that the physical wires on a twisted pair ethernet are symmetric (I like how that analogy fits). You need to really squint hard for it to look symmetric. For example, for RX, virtio allocates an skb, puts a descriptor on a ring and waits for host to fill it in. Host system can not do the same: guest does not have access to host memory. You can do a copy in transport to hide this fact, but it will kill performance. Yes, that is what I was suggesting all along. The actual copy operation has to be done by the host transport, which is obviously different from the guest transport that just calls the host using vring_kick(). Right now, the number of copy operations in your code is the same. You are doing the copy a little bit later in skb_copy_datagram_iovec(), which is indeed a very nice hack. Changing to a virtqueue based method would imply that the host needs to add each skb_frag_t to its outbound virtqueue, which then gets copied into the guests inbound virtqueue. Unfortunately, this also implies that you could no longer simply use the packet socket interface as you do currently, as I realized only now. This obviously has a significant impact on your user space interface. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Documentation: Update KVM list email address
The KVM list moved to vger.kernel.org last year Signed-off-by: Amit Shah amit.s...@redhat.com --- Documentation/ioctl/ioctl-number.txt |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 1f779a2..a039cb0 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -189,7 +189,7 @@ CodeSeq#Include FileComments 0xAD 00 Netfilter devicein development: mailto:ru...@rustcorp.com.au 0xAE all linux/kvm.h Kernel-based Virtual Machine - mailto:kvm-de...@lists.sourceforge.net + mailto:kvm@vger.kernel.org 0xB0 all RATIO devices in development: mailto:v...@ratio.de 0xB1 00-1F PPPoX mailto:mostr...@styx.uwaterloo.ca -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thu, Aug 13, 2009 at 03:38:43PM +0200, Arnd Bergmann wrote: On Thursday 13 August 2009, Michael S. Tsirkin wrote: On Wed, Aug 12, 2009 at 07:59:47PM +0200, Arnd Bergmann wrote: The trick is to swap the virtqueues instead. virtio-net is actually mostly symmetric in just the same way that the physical wires on a twisted pair ethernet are symmetric (I like how that analogy fits). You need to really squint hard for it to look symmetric. For example, for RX, virtio allocates an skb, puts a descriptor on a ring and waits for host to fill it in. Host system can not do the same: guest does not have access to host memory. You can do a copy in transport to hide this fact, but it will kill performance. Yes, that is what I was suggesting all along. The actual copy operation has to be done by the host transport, which is obviously different from the guest transport that just calls the host using vring_kick(). Right now, the number of copy operations in your code is the same. You are doing the copy a little bit later in skb_copy_datagram_iovec(), which is indeed a very nice hack. Changing to a virtqueue based method would imply that the host needs to add each skb_frag_t to its outbound virtqueue, which then gets copied into the guests inbound virtqueue. Which is a lot more code than just calling skb_copy_datagram_iovec. Unfortunately, this also implies that you could no longer simply use the packet socket interface as you do currently, as I realized only now. This obviously has a significant impact on your user space interface. Arnd And, it will remove our ability to implement zero copy down the road (when raw sockets support it). -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thu, Aug 13, 2009 at 03:48:35PM +0200, Arnd Bergmann wrote: On Thursday 13 August 2009, Arnd Bergmann wrote: Unfortunately, this also implies that you could no longer simply use the packet socket interface as you do currently, as I realized only now. This obviously has a significant impact on your user space interface. Also, if we do the copy in the transport, it definitely means that we can't get to zero-copy RX/TX from guest space any more. The current vhost_net driver doesn't do that yet, but could be extended in the same way that I'm hoping to do it for macvtap. Arnd The best way to do this IMO would be to add zero copy support to raw sockets, vhost will then get it basically for free. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Michael S. Tsirkin wrote: The best way to do this IMO would be to add zero copy support to raw sockets, vhost will then get it basically for free. Yes, that would be nice. I wonder if that could lead to security problems on TX though. I guess It will only bring significant performance improvements if we leave the data writable in the user space or guest during the operation. If the user finds the right timing, it could modify the frame headers after they have been checked using netfilter, or while the frames are being consumed in the kernel (e.g. an NFS server running in a guest). Ardn -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Michael S. Tsirkin wrote: Right now, the number of copy operations in your code is the same. You are doing the copy a little bit later in skb_copy_datagram_iovec(), which is indeed a very nice hack. Changing to a virtqueue based method would imply that the host needs to add each skb_frag_t to its outbound virtqueue, which then gets copied into the guests inbound virtqueue. Which is a lot more code than just calling skb_copy_datagram_iovec. Well, I don't see this part as much of a problem, because the code already exists in virtio_net. If we really wanted to go down that road, just using virtio_net would solve the problem of frame handling entirely, but create new problems elsewhere, as we have mentioned. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thu, Aug 13, 2009 at 04:58:06PM +0200, Arnd Bergmann wrote: On Thursday 13 August 2009, Michael S. Tsirkin wrote: Right now, the number of copy operations in your code is the same. You are doing the copy a little bit later in skb_copy_datagram_iovec(), which is indeed a very nice hack. Changing to a virtqueue based method would imply that the host needs to add each skb_frag_t to its outbound virtqueue, which then gets copied into the guests inbound virtqueue. Which is a lot more code than just calling skb_copy_datagram_iovec. Well, I don't see this part as much of a problem, because the code already exists in virtio_net. I am talking about the copying done in low level transport, here. If we really wanted to go down that road, just using virtio_net would solve the problem of frame handling entirely, but create new problems elsewhere, as we have mentioned. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 7/8] Move IO APIC to its own lock.
On Thu, Aug 13, 2009 at 01:44:06PM +0300, Avi Kivity wrote: On 08/13/2009 01:09 PM, Gleb Natapov wrote: There's also srcu. What are the disadvantages? There should be some, otherwise why not use it all the time. I think it incurs an atomic op in the read path, but not much overhead otherwise. Paul? There are not atomic operations in srcu_read_lock(): int srcu_read_lock(struct srcu_struct *sp) { int idx; preempt_disable(); idx = sp-completed 0x1; barrier(); /* ensure compiler looks -once- at sp-completed. */ per_cpu_ptr(sp-per_cpu_ref, smp_processor_id())-c[idx]++; srcu_barrier(); /* ensure compiler won't misorder critical section. */ preempt_enable(); return idx; } There is a preempt_disable() and a preempt_enable(), which non-atomically manipulate a field in the thread_info structure. There is a barrier() and an srcu_barrier(), which are just compiler directives (no code generated). Other than that, simple arithmetic and array accesses. Shouldn't even be any cache misses in the common case (the uncommon case being where synchronize_srcu() executing on some other CPU). There is even less in srcu_read_unlock(): void srcu_read_unlock(struct srcu_struct *sp, int idx) { preempt_disable(); srcu_barrier(); /* ensure compiler won't misorder critical section. */ per_cpu_ptr(sp-per_cpu_ref, smp_processor_id())-c[idx]--; preempt_enable(); } So SRCU should have pretty low overhead. And, as with other forms of RCU, legal use of the read-side primitives cannot possibly participate in deadlocks. So, to answer the question above, what are the disadvantages? o On the update side, synchronize_srcu() does takes some time, mostly blocking in synchronize_sched(). So, like other forms of RCU, you would use SRCU in read-mostly situations. o Just as with RCU, reads and updates run concurrently, with all the good and bad that this implies. For an example of the good, srcu_read_lock() executes deterministically, no blocking or spinning. For an example of the bad, there is no way to shut down SRCU readers. These are opposite sides of the same coin. ;-) o Although srcu_read_lock() and srcu_read_unlock() are light weight, they are expensive compared to other forms of RCU. o In contrast to other forms of RCU, SRCU requires that the return value from srcu_read_lock() be passed into srcu_read_unlock(). Usually not a problem, but does place another constraint on the code. Please keep in mind that I have no idea about what you are thinking of using SRCU for, so the above advice is necessarily quite generic. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On 08/13/2009 05:53 PM, Arnd Bergmann wrote: On Thursday 13 August 2009, Michael S. Tsirkin wrote: The best way to do this IMO would be to add zero copy support to raw sockets, vhost will then get it basically for free. Yes, that would be nice. I wonder if that could lead to security problems on TX though. I guess It will only bring significant performance improvements if we leave the data writable in the user space or guest during the operation. If the user finds the right timing, it could modify the frame headers after they have been checked using netfilter, or while the frames are being consumed in the kernel (e.g. an NFS server running in a guest). IIRC when the kernel consumes data it linearizes the skb. We just need to make sure all the zerocopy data is in the nonlinear part, and the kernel will copy if/when it needs to access packet data. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Move irq sharing information to irqchip level.
This removes assumptions that max GSIs is smaller than number of pins. Sharing is tracked on pin level not GSI level. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b17d845..4c15bdd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -413,7 +413,6 @@ struct kvm_arch{ gpa_t ept_identity_map_addr; unsigned long irq_sources_bitmap; - unsigned long irq_states[KVM_IOAPIC_NUM_PINS]; u64 vm_init_tsc; }; diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h index 7d6058a..c025a23 100644 --- a/arch/x86/kvm/irq.h +++ b/arch/x86/kvm/irq.h @@ -71,6 +71,7 @@ struct kvm_pic { int output; /* intr from master PIC */ struct kvm_io_device dev; void (*ack_notifier)(void *opaque, int irq); + unsigned long irq_states[16]; }; struct kvm_pic *kvm_create_pic(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f814512..beab24b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -121,7 +121,7 @@ struct kvm_kernel_irq_routing_entry { u32 gsi; u32 type; int (*set)(struct kvm_kernel_irq_routing_entry *e, - struct kvm *kvm, int level); + struct kvm *kvm, int irq_source_id, int level); union { struct { unsigned irqchip; diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h index 7080b71..6e461ad 100644 --- a/virt/kvm/ioapic.h +++ b/virt/kvm/ioapic.h @@ -41,6 +41,7 @@ struct kvm_ioapic { u32 irr; u32 pad; union kvm_ioapic_redirect_entry redirtbl[IOAPIC_NUM_PINS]; + unsigned long irq_states[IOAPIC_NUM_PINS]; struct kvm_io_device dev; struct kvm *kvm; void (*ack_notifier)(void *opaque, int irq); diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c index 001663f..11aa702 100644 --- a/virt/kvm/irq_comm.c +++ b/virt/kvm/irq_comm.c @@ -31,20 +31,39 @@ #include ioapic.h +static inline int kvm_irq_line_state(unsigned long *irq_state, +int irq_source_id, int level) +{ + /* Logical OR for level trig interrupt */ + if (level) + set_bit(irq_source_id, irq_state); + else + clear_bit(irq_source_id, irq_state); + + return !!(*irq_state); +} + static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e, - struct kvm *kvm, int level) + struct kvm *kvm, int irq_source_id, int level) { #ifdef CONFIG_X86 - return kvm_pic_set_irq(pic_irqchip(kvm), e-irqchip.pin, level); + struct kvm_pic *pic = pic_irqchip(kvm); + level = kvm_irq_line_state(pic-irq_states[e-irqchip.pin], + irq_source_id, level); + return kvm_pic_set_irq(pic, e-irqchip.pin, level); #else return -1; #endif } static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e, - struct kvm *kvm, int level) + struct kvm *kvm, int irq_source_id, int level) { - return kvm_ioapic_set_irq(kvm-arch.vioapic, e-irqchip.pin, level); + struct kvm_ioapic *ioapic = kvm-arch.vioapic; + level = kvm_irq_line_state(ioapic-irq_states[e-irqchip.pin], + irq_source_id, level); + + return kvm_ioapic_set_irq(ioapic, e-irqchip.pin, level); } inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq) @@ -96,10 +115,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, } static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, - struct kvm *kvm, int level) + struct kvm *kvm, int irq_source_id, int level) { struct kvm_lapic_irq irq; + if (!level) + return -1; + trace_kvm_msi_set_irq(e-msi.address_lo, e-msi.data); irq.dest_id = (e-msi.address_lo @@ -125,34 +147,19 @@ static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level) { struct kvm_kernel_irq_routing_entry *e; - unsigned long *irq_state, sig_level; int ret = -1; trace_kvm_set_irq(irq, level, irq_source_id); WARN_ON(!mutex_is_locked(kvm-irq_lock)); - if (irq KVM_IOAPIC_NUM_PINS) { - irq_state = (unsigned long *)kvm-arch.irq_states[irq]; - - /* Logical OR for level trig interrupt */ - if (level) - set_bit(irq_source_id, irq_state); - else - clear_bit(irq_source_id, irq_state); - sig_level = !!(*irq_state); - } else if (!level) - return ret; - else /* Deal with MSI/MSI-X */ - sig_level = 1; - /* Not possible
[PATCHv3 0/2] vhost: a kernel-level virtio server
This implements vhost: a kernel-level backend for virtio, The main motivation for this work is to reduce virtualization overhead for virtio by removing system calls on data path, without guest changes. For virtio-net, this removes up to 4 system calls per packet: vm exit for kick, reentry for kick, iothread wakeup for packet, interrupt injection for packet. Some more detailed description attached to the patch itself. The patches are against 2.6.31-rc4. I'd like them to go into linux-next and down the road 2.6.32 if possible. Please comment. Changelog from v2: - Comments on RCU usage - Compat ioctl support - Make variable static - Copied more idiomatic english from Rusty Changes from v1: - Move use_mm/unuse_mm from fs/aio.c to mm instead of copying. - Reorder code to avoid need for forward declarations - Kill a couple of debugging printks Michael S. Tsirkin (2): mm: export use_mm/unuse_mm to modules vhost_net: a kernel-level virtio server MAINTAINERS | 10 + arch/x86/kvm/Kconfig|1 + drivers/Makefile|1 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c | 429 drivers/vhost/vhost.c | 663 +++ drivers/vhost/vhost.h | 108 +++ fs/aio.c| 47 +--- include/linux/Kbuild|1 + include/linux/miscdevice.h |1 + include/linux/mmu_context.h |9 + include/linux/vhost.h | 100 +++ mm/Makefile |2 +- mm/mmu_context.c| 58 15 files changed, 1396 insertions(+), 47 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/mmu_context.h create mode 100644 include/linux/vhost.h create mode 100644 mm/mmu_context.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv3 1/2] mm: export use_mm/unuse_mm to modules
vhost net module wants to do copy to/from user from a kernel thread, which needs use_mm (like what fs/aio has). Move that into mm/ and export to modules. Acked-by: Andrew Morton a...@linux-foundation.org Signed-off-by: Michael S. Tsirkin m...@redhat.com --- fs/aio.c| 47 +-- include/linux/mmu_context.h |9 ++ mm/Makefile |2 +- mm/mmu_context.c| 58 +++ 4 files changed, 69 insertions(+), 47 deletions(-) create mode 100644 include/linux/mmu_context.h create mode 100644 mm/mmu_context.c diff --git a/fs/aio.c b/fs/aio.c index d065b2c..fc21c23 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -24,6 +24,7 @@ #include linux/file.h #include linux/mm.h #include linux/mman.h +#include linux/mmu_context.h #include linux/slab.h #include linux/timer.h #include linux/aio.h @@ -34,7 +35,6 @@ #include asm/kmap_types.h #include asm/uaccess.h -#include asm/mmu_context.h #if DEBUG 1 #define dprintkprintk @@ -595,51 +595,6 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) } /* - * use_mm - * Makes the calling kernel thread take on the specified - * mm context. - * Called by the retry thread execute retries within the - * iocb issuer's mm context, so that copy_from/to_user - * operations work seamlessly for aio. - * (Note: this routine is intended to be called only - * from a kernel thread context) - */ -static void use_mm(struct mm_struct *mm) -{ - struct mm_struct *active_mm; - struct task_struct *tsk = current; - - task_lock(tsk); - active_mm = tsk-active_mm; - atomic_inc(mm-mm_count); - tsk-mm = mm; - tsk-active_mm = mm; - switch_mm(active_mm, mm, tsk); - task_unlock(tsk); - - mmdrop(active_mm); -} - -/* - * unuse_mm - * Reverses the effect of use_mm, i.e. releases the - * specified mm context which was earlier taken on - * by the calling kernel thread - * (Note: this routine is intended to be called only - * from a kernel thread context) - */ -static void unuse_mm(struct mm_struct *mm) -{ - struct task_struct *tsk = current; - - task_lock(tsk); - tsk-mm = NULL; - /* active_mm is still 'mm' */ - enter_lazy_tlb(mm, tsk); - task_unlock(tsk); -} - -/* * Queue up a kiocb to be retried. Assumes that the kiocb * has already been marked as kicked, and places it on * the retry run list for the corresponding ioctx, if it diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h new file mode 100644 index 000..70fffeb --- /dev/null +++ b/include/linux/mmu_context.h @@ -0,0 +1,9 @@ +#ifndef _LINUX_MMU_CONTEXT_H +#define _LINUX_MMU_CONTEXT_H + +struct mm_struct; + +void use_mm(struct mm_struct *mm); +void unuse_mm(struct mm_struct *mm); + +#endif diff --git a/mm/Makefile b/mm/Makefile index 5e0bd64..46c3892 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \ maccess.o page_alloc.o page-writeback.o pdflush.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ - page_isolation.o mm_init.o $(mmu-y) + page_isolation.o mm_init.o mmu_context.o $(mmu-y) obj-y += init-mm.o obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o diff --git a/mm/mmu_context.c b/mm/mmu_context.c new file mode 100644 index 000..9989c2f --- /dev/null +++ b/mm/mmu_context.c @@ -0,0 +1,58 @@ +/* Copyright (C) 2009 Red Hat, Inc. + * + * See ../COPYING for licensing terms. + */ + +#include linux/mm.h +#include linux/mmu_context.h +#include linux/module.h +#include linux/sched.h + +#include asm/mmu_context.h + +/* + * use_mm + * Makes the calling kernel thread take on the specified + * mm context. + * Called by the retry thread execute retries within the + * iocb issuer's mm context, so that copy_from/to_user + * operations work seamlessly for aio. + * (Note: this routine is intended to be called only + * from a kernel thread context) + */ +void use_mm(struct mm_struct *mm) +{ + struct mm_struct *active_mm; + struct task_struct *tsk = current; + + task_lock(tsk); + active_mm = tsk-active_mm; + atomic_inc(mm-mm_count); + tsk-mm = mm; + tsk-active_mm = mm; + switch_mm(active_mm, mm, tsk); + task_unlock(tsk); + + mmdrop(active_mm); +} +EXPORT_SYMBOL_GPL(use_mm); + +/* + * unuse_mm + * Reverses the effect of use_mm, i.e. releases the + * specified mm context which was earlier taken on + * by the calling kernel thread + * (Note: this routine is intended to be called only + * from a kernel thread context) + */ +void unuse_mm(struct mm_struct *mm) +{ + struct
[PATCHv3 2/2] vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have not run any benchmarks yet, compared to userspace, I expect to see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). Features that I plan to look at in the future: - TSO - interrupt mitigation - zero copy Signed-off-by: Michael S. Tsirkin m...@redhat.com --- MAINTAINERS| 10 + arch/x86/kvm/Kconfig |1 + drivers/Makefile |1 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 429 drivers/vhost/vhost.c | 663 drivers/vhost/vhost.h | 108 +++ include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 100 +++ 11 files changed, 1327 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h diff --git a/MAINTAINERS b/MAINTAINERS index ebc2691..eb0c1da 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6312,6 +6312,16 @@ S: Maintained F: Documentation/filesystems/vfat.txt F: fs/fat/ +VIRTIO HOST (VHOST) +P: Michael S. Tsirkin +M: m...@redhat.com +L: kvm@vger.kernel.org +L: virtualizat...@lists.osdl.org +L: net...@vger.kernel.org +S: Maintained +F: drivers/vhost/ +F: include/linux/vhost.h + VIA RHINE NETWORK DRIVER P: Roger Luethi M: r...@hellgate.ch diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b84e571..94f44d9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. +source drivers/vhost/Kconfig source drivers/lguest/Kconfig source drivers/virtio/Kconfig diff --git a/drivers/Makefile b/drivers/Makefile index bc4205d..1551ae1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/ obj-$(CONFIG_PPC_PS3) += ps3/ obj-$(CONFIG_OF) += of/ obj-$(CONFIG_SSB) += ssb/ +obj-$(CONFIG_VHOST_NET)+= vhost/ obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_VLYNQ)+= vlynq/ obj-$(CONFIG_STAGING) += staging/ diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig new file mode 100644 index 000..d955406 --- /dev/null +++ b/drivers/vhost/Kconfig @@ -0,0 +1,11 @@ +config VHOST_NET + tristate Host kernel accelerator for virtio net + depends on NET EVENTFD + ---help--- + This kernel module can be loaded in host kernel to accelerate + guest networking with virtio_net. Not to be confused with virtio_net + module itself which needs to be loaded in guest kernel. + + To compile this driver as a module, choose M here: the module will + be called vhost_net. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile new file mode 100644 index 000..72dd020 --- /dev/null +++ b/drivers/vhost/Makefile @@ -0,0 +1,2 @@ +obj-$(CONFIG_VHOST_NET) += vhost_net.o +vhost_net-y := vhost.o net.o diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c new file mode 100644 index 000..728094b --- /dev/null +++ b/drivers/vhost/net.c @@ -0,0 +1,429 @@ +/* Copyright (C) 2009 Red Hat, Inc. + * Author: Michael S. Tsirkin m...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. + * + *
[ kvm-Bugs-2837083 ] Wrong disk size on exported nbd device
Bugs item #2837083, was opened at 2009-08-13 16:27 Message generated for change (Tracker Item Submitted) made by atilaromero You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2837083group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: qemu Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Atila (atilaromero) Assigned to: Nobody/Anonymous (nobody) Summary: Wrong disk size on exported nbd device Initial Comment: kvm-nbd uses a blocksize of 1024 bytes. If the imaged disk had an odd number of sectors, the last sector isn't exported. Solution: change the blocksize to 512 in nbd.c -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2837083group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Page allocation failures in guest
On Wed, 12 Aug 2009 15:01:52 +0930 Rusty Russell ru...@rustcorp.com.au wrote: On Wed, 12 Aug 2009 12:49:51 pm Rusty Russell wrote: On Tue, 11 Aug 2009 04:22:53 pm Avi Kivity wrote: On 08/11/2009 09:32 AM, Pierre Ossman wrote: I doesn't get out of it though, or at least the virtio net driver wedges itself. There's a fixme to retry when this happens, but this is the first report I've received. I'll check it out. Subject: virtio: net refill on out-of-memory If we run out of memory, use keventd to fill the buffer. There's a report of this happening: Page allocation failures in guest, Message-ID: 20090713115158.0a489...@mjolnir.ossman.eu Signed-off-by: Rusty Russell ru...@rustcorp.com.au Patch applied. Now we wait. :) -- -- Pierre Ossman WARNING: This correspondence is being monitored by the Swedish government. Make sure your server uses encryption for SMTP traffic and consider using PGP for end-to-end encryption. signature.asc Description: PGP signature
[PATCH -tip v14 02/12] x86: x86 instruction decoder build-time selftest
Add a user-space selftest of x86 instruction decoder at kernel build time. When CONFIG_X86_DECODER_SELFTEST=y, Kbuild builds a test harness of x86 instruction decoder and performs it after building vmlinux. The test compares the results of objdump and x86 instruction decoder code and check there are no differences. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Signed-off-by: Jim Keniston jkeni...@us.ibm.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- arch/x86/Kconfig.debug|9 +++ arch/x86/Makefile |3 + arch/x86/tools/Makefile | 15 + arch/x86/tools/distill.awk| 42 +++ arch/x86/tools/test_get_len.c | 113 + 5 files changed, 182 insertions(+), 0 deletions(-) create mode 100644 arch/x86/tools/Makefile create mode 100644 arch/x86/tools/distill.awk create mode 100644 arch/x86/tools/test_get_len.c diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index d105f29..7d0b681 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -186,6 +186,15 @@ config X86_DS_SELFTEST config HAVE_MMIOTRACE_SUPPORT def_bool y +config X86_DECODER_SELFTEST + bool x86 instruction decoder selftest + depends on DEBUG_KERNEL + ---help--- +Perform x86 instruction decoder selftests at build time. +This option is useful for checking the sanity of x86 instruction +decoder code. +If unsure, say N. + # # IO delay types: # diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 1f3851a..f79580c 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -154,6 +154,9 @@ all: bzImage KBUILD_IMAGE := $(boot)/bzImage bzImage: vmlinux +ifeq ($(CONFIG_X86_DECODER_SELFTEST),y) + $(Q)$(MAKE) $(build)=arch/x86/tools posttest +endif $(Q)$(MAKE) $(build)=$(boot) $(KBUILD_IMAGE) $(Q)mkdir -p $(objtree)/arch/$(UTS_MACHINE)/boot $(Q)ln -fsn ../../x86/boot/bzImage $(objtree)/arch/$(UTS_MACHINE)/boot/$@ diff --git a/arch/x86/tools/Makefile b/arch/x86/tools/Makefile new file mode 100644 index 000..3dd626b --- /dev/null +++ b/arch/x86/tools/Makefile @@ -0,0 +1,15 @@ +PHONY += posttest +quiet_cmd_posttest = TEST$@ + cmd_posttest = $(OBJDUMP) -d $(objtree)/vmlinux | awk -f $(srctree)/arch/x86/tools/distill.awk | $(obj)/test_get_len + +posttest: $(obj)/test_get_len vmlinux + $(call cmd,posttest) + +hostprogs-y:= test_get_len + +# -I needed for generated C source and C source which in the kernel tree. +HOSTCFLAGS_test_get_len.o := -Wall -I$(objtree)/arch/x86/lib/ -I$(srctree)/arch/x86/include/ -I$(srctree)/arch/x86/lib/ + +# Dependancies are also needed. +$(obj)/test_get_len.o: $(srctree)/arch/x86/lib/insn.c $(srctree)/arch/x86/lib/inat.c $(srctree)/arch/x86/include/asm/inat_types.h $(srctree)/arch/x86/include/asm/inat.h $(srctree)/arch/x86/include/asm/insn.h $(objtree)/arch/x86/lib/inat-tables.c + diff --git a/arch/x86/tools/distill.awk b/arch/x86/tools/distill.awk new file mode 100644 index 000..d433619 --- /dev/null +++ b/arch/x86/tools/distill.awk @@ -0,0 +1,42 @@ +#!/bin/awk -f +# Usage: objdump -d a.out | awk -f distill.awk | ./test_get_len +# Distills the disassembly as follows: +# - Removes all lines except the disassembled instructions. +# - For instructions that exceed 1 line (7 bytes), crams all the hex bytes +# into a single line. +# - Remove bad(or prefix only) instructions + +BEGIN { + prev_addr = + prev_hex = + prev_mnemonic = + bad_expr = (\\(bad\\)|^rex|^.byte|^rep(z|nz)$|^lock$|^es$|^cs$|^ss$|^ds$|^fs$|^gs$|^data(16|32)$|^addr(16|32|64)) + fwait_expr = ^9b + fwait_str=9b\tfwait +} + +/^ *[0-9a-f]+:/ { + if (split($0, field, \t) 3) { + # This is a continuation of the same insn. + prev_hex = prev_hex field[2] + } else { + # Skip bad instructions + if (match(prev_mnemonic, bad_expr)) + prev_addr = + # Split fwait from other f* instructions + if (match(prev_hex, fwait_expr) prev_mnemonic != fwait) { + printf %s\t%s\n, prev_addr, fwait_str + sub(fwait_expr, , prev_hex) + } + if (prev_addr
[PATCH -tip v14 01/12] x86: instruction decoder API
Add x86 instruction decoder to arch-specific libraries. This decoder can decode x86 instructions used in kernel into prefix, opcode, modrm, sib, displacement and immediates. This can also show the length of instructions. This version introduces instruction attributes for decoding instructions. The instruction attribute tables are generated from the opcode map file (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk). Currently, the opcode maps are based on opcode maps in Intel(R) 64 and IA-32 Architectures Software Developers Manual Vol.2: Appendix.A, and consist of below two types of opcode tables. 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are written as below; Table: table-name Referrer: escaped-name opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] (or) opcode: escape # escaped-name EndTable Group opcodes, which has 8 elements, are written as below; GrpTable: GrpXXX reg: mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] EndTable These opcode maps include a few SSE and FP opcodes (for setup), because those opcodes are used in the kernel. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Signed-off-by: Jim Keniston jkeni...@us.ibm.com Acked-by: H. Peter Anvin h...@zytor.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- arch/x86/include/asm/inat.h | 188 + arch/x86/include/asm/inat_types.h| 29 + arch/x86/include/asm/insn.h | 143 +++ arch/x86/lib/Makefile| 13 + arch/x86/lib/inat.c | 78 arch/x86/lib/insn.c | 464 ++ arch/x86/lib/x86-opcode-map.txt | 719 ++ arch/x86/tools/gen-insn-attr-x86.awk | 314 +++ 8 files changed, 1948 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/inat.h create mode 100644 arch/x86/include/asm/inat_types.h create mode 100644 arch/x86/include/asm/insn.h create mode 100644 arch/x86/lib/inat.c create mode 100644 arch/x86/lib/insn.c create mode 100644 arch/x86/lib/x86-opcode-map.txt create mode 100644 arch/x86/tools/gen-insn-attr-x86.awk diff --git a/arch/x86/include/asm/inat.h b/arch/x86/include/asm/inat.h new file mode 100644 index 000..2866fdd --- /dev/null +++ b/arch/x86/include/asm/inat.h @@ -0,0 +1,188 @@ +#ifndef _ASM_X86_INAT_H +#define _ASM_X86_INAT_H +/* + * x86 instruction attributes + * + * Written by Masami Hiramatsu mhira...@redhat.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + */ +#include asm/inat_types.h + +/* + * Internal bits. Don't use bitmasks directly, because these bits are + * unstable. You should use checking functions. + */ + +#define INAT_OPCODE_TABLE_SIZE 256 +#define INAT_GROUP_TABLE_SIZE 8 + +/* Legacy instruction prefixes */ +#define INAT_PFX_OPNDSZ1 /* 0x66 */ /* LPFX1 */ +#define INAT_PFX_REPNE 2 /* 0xF2 */ /* LPFX2 */ +#define INAT_PFX_REPE 3 /* 0xF3 */ /* LPFX3 */ +#define INAT_PFX_LOCK 4 /* 0xF0 */ +#define INAT_PFX_CS5 /* 0x2E */ +#define INAT_PFX_DS6 /* 0x3E */ +#define INAT_PFX_ES7 /* 0x26 */ +#define INAT_PFX_FS8 /* 0x64 */ +#define INAT_PFX_GS9 /* 0x65 */ +#define INAT_PFX_SS10 /* 0x36 */ +#define INAT_PFX_ADDRSZ11 /* 0x67 */ + +#define INAT_LPREFIX_MAX 3 + +/* Immediate size */ +#define INAT_IMM_BYTE 1 +#define INAT_IMM_WORD 2 +#define INAT_IMM_DWORD 3 +#define INAT_IMM_QWORD 4 +#define INAT_IMM_PTR 5 +#define INAT_IMM_VWORD32 6
[PATCH -tip v14 00/12] tracing: kprobe-based event tracer and x86 instruction decoder
Hi, Here are the patches of kprobe-based event tracer for x86, version 14, which allows you to probe various kernel events through ftrace interface. The tracer supports per-probe filtering which allows you to set filters on each probe and shows formats of each probe. This version includes below fixes. - Define remove_subsystem_dir() always (patch 6/12) - Modify syscall_tracer because of ftrace_event_call change (patch 6/12) - Support 'sa' argument for stack address (patch 8/12) - Use call-data instead of container_of() macro. (patch 8/12) - Assign new event id for each event. (patch 11/12) Lai, this version still can not be applied on your patch ('use defined fields to print formats') yet, since I couldn't update your patch on the latest -tip tree. This patchset also includes x86(-64) instruction decoder which supports non-SSE/FP opcodes and includes x86 opcode map. The decoder is used for finding the instruction boundaries when inserting new kprobes. I think it will be possible to share this opcode map with KVM's decoder. The decoder is tested when building kernel, the test compares the results of objdump and the decoder right after building vmlinux. You can enable that test by CONFIG_X86_DECODER_SELFTEST=y. This series can be applied on the latest linux-2.6.31-rc5-tip. This supports only x86(-32/-64) (but porting it on other arch just needs kprobes/kretprobes and register and stack access APIs). I also made two tools for this tracer. - Kprobe stress test script which tests kprobes on all kernel symbols to find symbols which should be blacklisted. - C expression to kprobes event format converter which helps you to define kprobes events by C source code line number or function name, and local variable name. Enhancement ideas will be added after merging: - .init function tracing support. - Support primitive types(long, ulong, int, uint, etc) for args. Kprobe-based Event Tracer = Overview This tracer is similar to the events tracer which is based on Tracepoint infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe and kretprobe). It probes anywhere where kprobes can probe(this means, all functions body except for __kprobes functions). Unlike the function tracer, this tracer can probe instructions inside of kernel functions. It allows you to check which instruction has been executed. Unlike the Tracepoint based events tracer, this tracer can add new probe points on the fly. Similar to the events tracer, this tracer doesn't need to be activated via current_tracer, instead of that, just set probe points via /sys/kernel/debug/tracing/kprobe_events. And you can set filters on each probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter. Synopsis of kprobe_events - p[:EVENT] SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS] : Set a probe r[:EVENT] SYMBOL[+0] [FETCHARGS] : Set a return probe EVENT : Event name. If omitted, the event name is generated based on SYMBOL+offs or MEMADDR. SYMBOL[+offs|-offs]: Symbol+offset where the probe is inserted. MEMADDR: Address where the probe is inserted. FETCHARGS : Arguments. Each probe can have up to 128 args. %REG : Fetch register REG sN: Fetch Nth entry of stack (N = 0) sa: Fetch stack address. @ADDR : Fetch memory at ADDR (ADDR should be in kernel) @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) aN: Fetch function argument. (N = 0)(*) rv: Fetch return value.(**) ra: Fetch return address.(**) +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.(***) (*) aN may not correct on asmlinkaged functions and at the middle of function body. (**) only for return probe. (***) this is useful for fetching a field of data structures. Per-Probe Event Filtering - Per-probe event filtering feature allows you to set different filter on each probe and gives you what arguments will be shown in trace buffer. If an event name is specified right after 'p:' or 'r:' in kprobe_events, the tracer adds an event under tracing/events/kprobes/EVENT, at the directory you can see 'id', 'enabled', 'format' and 'filter'. enabled: You can enable/disable the probe by writing 1 or 0 on it. format: It shows the format of this probe event. It also shows aliases of arguments which you specified to kprobe_events. filter: You can write filtering rules of this event. And you can use both of aliase names and field names for describing filters. Event Profiling --- You can check the total number of probe hits and probe miss-hits via /sys/kernel/debug/tracing/kprobe_profile. The first column is event name, the second is the number of probe hits, the third is the number of probe miss-hits. Usage examples -- To add a probe as a new event, write
[PATCH -tip v14 04/12] kprobes: cleanup fix_riprel() using insn decoder on x86
Cleanup fix_riprel() in arch/x86/kernel/kprobes.c by using x86 instruction decoder. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- arch/x86/kernel/kprobes.c | 128 - 1 files changed, 23 insertions(+), 105 deletions(-) diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c index 80d493f..98f48d0 100644 --- a/arch/x86/kernel/kprobes.c +++ b/arch/x86/kernel/kprobes.c @@ -109,50 +109,6 @@ static const u32 twobyte_is_boostable[256 / 32] = { /* --- */ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ }; -static const u32 onebyte_has_modrm[256 / 32] = { - /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ - /* --- */ - W(0x00, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 00 */ - W(0x10, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) , /* 10 */ - W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 20 */ - W(0x30, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) , /* 30 */ - W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */ - W(0x50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 50 */ - W(0x60, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0) | /* 60 */ - W(0x70, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 70 */ - W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */ - W(0x90, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 90 */ - W(0xa0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* a0 */ - W(0xb0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* b0 */ - W(0xc0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0) | /* c0 */ - W(0xd0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */ - W(0xe0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* e0 */ - W(0xf0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) /* f0 */ - /* --- */ - /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ -}; -static const u32 twobyte_has_modrm[256 / 32] = { - /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ - /* --- */ - W(0x00, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1) | /* 0f */ - W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0) , /* 1f */ - W(0x20, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 2f */ - W(0x30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 3f */ - W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 4f */ - W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 5f */ - W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 6f */ - W(0x70, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1) , /* 7f */ - W(0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 8f */ - W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 9f */ - W(0xa0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1) | /* af */ - W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1) , /* bf */ - W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0) | /* cf */ - W(0xd0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* df */ - W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* ef */ - W(0xf0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0) /* ff */ - /* --- */ - /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ -}; #undef W struct kretprobe_blackpoint kretprobe_blacklist[] = { @@ -345,68 +301,30 @@ static int __kprobes is_IF_modifier(kprobe_opcode_t *insn) static void __kprobes fix_riprel(struct kprobe *p) { #ifdef CONFIG_X86_64 - u8 *insn = p-ainsn.insn; - s64 disp; - int need_modrm; - - /* Skip legacy instruction prefixes. */ - while (1) { - switch (*insn) { - case 0x66: - case 0x67:
[PATCH -tip v14 06/12] tracing: ftrace dynamic ftrace_event_call support
Add dynamic ftrace_event_call support to ftrace. Trace engines can adds new ftrace_event_call to ftrace on the fly. Each operator functions of the call takes a ftrace_event_call data structure as an argument, because these functions may be shared among several ftrace_event_calls. Changes from v13: - Define remove_subsystem_dir() always (revirt a2ca5e03), because trace_remove_event_call() uses it. - Modify syscall tracer because of ftrace_event_call change. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Acked-by: Frederic Weisbecker fweis...@gmail.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- include/linux/ftrace_event.h | 14 +++-- include/linux/syscalls.h |4 + include/trace/ftrace.h| 19 +++ include/trace/syscall.h |8 +-- kernel/trace/trace_events.c | 119 + kernel/trace/trace_export.c | 23 kernel/trace/trace_syscalls.c | 16 +++--- 7 files changed, 125 insertions(+), 78 deletions(-) diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index 189806b..9af68ce 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -112,13 +112,13 @@ struct ftrace_event_call { struct dentry *dir; struct trace_event *event; int enabled; - int (*regfunc)(void *); - void(*unregfunc)(void *); + int (*regfunc)(struct ftrace_event_call *); + void(*unregfunc)(struct ftrace_event_call *); int id; - int (*raw_init)(void); - int (*show_format)(struct ftrace_event_call *call, - struct trace_seq *s); - int (*define_fields)(void); + int (*raw_init)(struct ftrace_event_call *); + int (*show_format)(struct ftrace_event_call *, + struct trace_seq *); + int (*define_fields)(struct ftrace_event_call *); struct list_headfields; int filter_active; struct event_filter *filter; @@ -142,6 +142,8 @@ extern int filter_current_check_discard(struct ftrace_event_call *call, extern int trace_define_field(struct ftrace_event_call *call, char *type, char *name, int offset, int size, int is_signed); +extern int trace_add_event_call(struct ftrace_event_call *call); +extern void trace_remove_event_call(struct ftrace_event_call *call); #define is_signed_type(type) (((type)(-1)) 0) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 87d06c1..be59d22 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -165,7 +165,7 @@ static void prof_sysexit_disable_##sname(struct ftrace_event_call *event_call) \ struct trace_event enter_syscall_print_##sname = { \ .trace = print_syscall_enter, \ }; \ - static int init_enter_##sname(void) \ + static int init_enter_##sname(struct ftrace_event_call *call) \ { \ int num, id;\ num = syscall_name_to_nr(sys#sname); \ @@ -201,7 +201,7 @@ static void prof_sysexit_disable_##sname(struct ftrace_event_call *event_call) \ struct trace_event exit_syscall_print_##sname = { \ .trace = print_syscall_exit, \ }; \ - static int init_exit_##sname(void) \ + static int init_exit_##sname(struct ftrace_event_call *call)\ { \ int num, id;\ num = syscall_name_to_nr(sys#sname); \ diff --git
[PATCH -tip v14 07/12] tracing: Introduce TRACE_FIELD_ZERO() macro
Use TRACE_FIELD_ZERO(type, item) instead of TRACE_FIELD_ZERO_CHAR(item). This also includes a fix of TRACE_ZERO_CHAR() macro. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- kernel/trace/trace_event_types.h |4 ++-- kernel/trace/trace_export.c | 16 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/kernel/trace/trace_event_types.h b/kernel/trace/trace_event_types.h index 6db005e..e74f090 100644 --- a/kernel/trace/trace_event_types.h +++ b/kernel/trace/trace_event_types.h @@ -109,7 +109,7 @@ TRACE_EVENT_FORMAT(bprint, TRACE_BPRINT, bprint_entry, ignore, TRACE_STRUCT( TRACE_FIELD(unsigned long, ip, ip) TRACE_FIELD(char *, fmt, fmt) - TRACE_FIELD_ZERO_CHAR(buf) + TRACE_FIELD_ZERO(char, buf) ), TP_RAW_FMT(%08lx (%d) fmt:%p %s) ); @@ -117,7 +117,7 @@ TRACE_EVENT_FORMAT(bprint, TRACE_BPRINT, bprint_entry, ignore, TRACE_EVENT_FORMAT(print, TRACE_PRINT, print_entry, ignore, TRACE_STRUCT( TRACE_FIELD(unsigned long, ip, ip) - TRACE_FIELD_ZERO_CHAR(buf) + TRACE_FIELD_ZERO(char, buf) ), TP_RAW_FMT(%08lx (%d) fmt:%p %s) ); diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c index 71c8d7f..b0ac92c 100644 --- a/kernel/trace/trace_export.c +++ b/kernel/trace/trace_export.c @@ -42,9 +42,9 @@ extern void __bad_type_size(void); if (!ret) \ return 0; -#undef TRACE_FIELD_ZERO_CHAR -#define TRACE_FIELD_ZERO_CHAR(item)\ - ret = trace_seq_printf(s, \tfield:char #item ;\t \ +#undef TRACE_FIELD_ZERO +#define TRACE_FIELD_ZERO(type, item) \ + ret = trace_seq_printf(s, \tfield: #type #item ;\t \ offset:%u;\tsize:0;\n, \ (unsigned int)offsetof(typeof(field), item)); \ if (!ret) \ @@ -92,9 +92,6 @@ ftrace_format_##call(struct ftrace_event_call *unused, \ #include trace_event_types.h -#undef TRACE_ZERO_CHAR -#define TRACE_ZERO_CHAR(arg) - #undef TRACE_FIELD #define TRACE_FIELD(type, item, assign)\ entry-item = assign; @@ -107,6 +104,9 @@ ftrace_format_##call(struct ftrace_event_call *unused, \ #define TRACE_FIELD_SIGN(type, item, assign, is_signed)\ TRACE_FIELD(type, item, assign) +#undef TRACE_FIELD_ZERO +#define TRACE_FIELD_ZERO(type, item) + #undef TP_CMD #define TP_CMD(cmd...) cmd @@ -178,8 +178,8 @@ __attribute__((section(_ftrace_events))) event_##call = { \ if (ret)\ return ret; -#undef TRACE_FIELD_ZERO_CHAR -#define TRACE_FIELD_ZERO_CHAR(item) +#undef TRACE_FIELD_ZERO +#define TRACE_FIELD_ZERO(type, item) #undef TRACE_EVENT_FORMAT #define TRACE_EVENT_FORMAT(call, proto, args, fmt, tstruct, tpfmt) \ -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -tip v14 10/12] tracing: Generate names for each kprobe event automatically
Generate names for each kprobe event based on the probe point, and remove generic k*probe event types because there is no user of those types. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- Documentation/trace/kprobetrace.txt |3 +- kernel/trace/trace_event_types.h| 18 -- kernel/trace/trace_kprobe.c | 64 ++- 3 files changed, 35 insertions(+), 50 deletions(-) diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt index c9c09b4..5e59e85 100644 --- a/Documentation/trace/kprobetrace.txt +++ b/Documentation/trace/kprobetrace.txt @@ -28,7 +28,8 @@ Synopsis of kprobe_events p[:EVENT] SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS]: Set a probe r[:EVENT] SYMBOL[+0] [FETCHARGS] : Set a return probe - EVENT : Event name. + EVENT : Event name. If omitted, the event name is generated + based on SYMBOL+offs or MEMADDR. SYMBOL[+offs|-offs] : Symbol+offset where the probe is inserted. MEMADDR : Address where the probe is inserted. diff --git a/kernel/trace/trace_event_types.h b/kernel/trace/trace_event_types.h index 186b598..e74f090 100644 --- a/kernel/trace/trace_event_types.h +++ b/kernel/trace/trace_event_types.h @@ -175,22 +175,4 @@ TRACE_EVENT_FORMAT(kmem_free, TRACE_KMEM_FREE, kmemtrace_free_entry, ignore, TP_RAW_FMT(type:%u call_site:%lx ptr:%p) ); -TRACE_EVENT_FORMAT(kprobe, TRACE_KPROBE, kprobe_trace_entry, ignore, - TRACE_STRUCT( - TRACE_FIELD(unsigned long, ip, ip) - TRACE_FIELD(int, nargs, nargs) - TRACE_FIELD_ZERO(unsigned long, args) - ), - TP_RAW_FMT(%08lx: args:0x%lx ...) -); - -TRACE_EVENT_FORMAT(kretprobe, TRACE_KRETPROBE, kretprobe_trace_entry, ignore, - TRACE_STRUCT( - TRACE_FIELD(unsigned long, func, func) - TRACE_FIELD(unsigned long, ret_ip, ret_ip) - TRACE_FIELD(int, nargs, nargs) - TRACE_FIELD_ZERO(unsigned long, args) - ), - TP_RAW_FMT(%08lx - %08lx: args:0x%lx ...) -); #undef TRACE_SYSTEM diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 4704e40..ec137ed 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -34,6 +34,7 @@ #define MAX_TRACE_ARGS 128 #define MAX_ARGSTR_LEN 63 +#define MAX_EVENT_NAME_LEN 64 /* currently, trace_kprobe only supports X86. */ @@ -280,11 +281,11 @@ static struct trace_probe *alloc_trace_probe(const char *symbol, if (!tp-symbol) goto error; } - if (event) { - tp-call.name = kstrdup(event, GFP_KERNEL); - if (!tp-call.name) - goto error; - } + if (!event) + goto error; + tp-call.name = kstrdup(event, GFP_KERNEL); + if (!tp-call.name) + goto error; INIT_LIST_HEAD(tp-list); return tp; @@ -314,7 +315,7 @@ static struct trace_probe *find_probe_event(const char *event) struct trace_probe *tp; list_for_each_entry(tp, probe_list, list) - if (tp-call.name !strcmp(tp-call.name, event)) + if (!strcmp(tp-call.name, event)) return tp; return NULL; } @@ -330,8 +331,7 @@ static void __unregister_trace_probe(struct trace_probe *tp) /* Unregister a trace_probe and probe_event: call with locking probe_lock */ static void unregister_trace_probe(struct trace_probe *tp) { - if (tp-call.name) - unregister_probe_event(tp); + unregister_probe_event(tp); __unregister_trace_probe(tp); list_del(tp-list); } @@ -360,18 +360,16 @@ static int register_trace_probe(struct trace_probe *tp) goto end; } /* register as an event */ - if (tp-call.name) { - old_tp = find_probe_event(tp-call.name); - if (old_tp) { - /* delete old event */ - unregister_trace_probe(old_tp); - free_trace_probe(old_tp); - } -
[PATCH -tip v14 12/12] tracing: Add kprobes event profiling interface
Add profiling interaces for each kprobes event. This interface provides how many times each probe hit or missed. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- Documentation/trace/kprobetrace.txt |8 +++ kernel/trace/trace_kprobe.c | 43 +++ 2 files changed, 51 insertions(+), 0 deletions(-) diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt index 5e59e85..3de7517 100644 --- a/Documentation/trace/kprobetrace.txt +++ b/Documentation/trace/kprobetrace.txt @@ -70,6 +70,14 @@ filter: names and field names for describing filters. +Event Profiling +--- + You can check the total number of probe hits and probe miss-hits via +/sys/kernel/debug/tracing/kprobe_profile. + The first column is event name, the second is the number of probe hits, +the third is the number of probe miss-hits. + + Usage examples -- To add a probe as a new event, write a new definition to kprobe_events diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 0e8498e..0f5d0a6 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -184,6 +184,7 @@ struct trace_probe { struct kprobe kp; struct kretproberp; }; + unsigned long nhit; const char *symbol;/* symbol name */ struct ftrace_event_callcall; struct trace_event event; @@ -781,6 +782,37 @@ static const struct file_operations kprobe_events_ops = { .write = probes_write, }; +/* Probes profiling interfaces */ +static int probes_profile_seq_show(struct seq_file *m, void *v) +{ + struct trace_probe *tp = v; + + seq_printf(m, %-44s %15lu %15lu\n, tp-call.name, tp-nhit, + probe_is_return(tp) ? tp-rp.kp.nmissed : tp-kp.nmissed); + + return 0; +} + +static const struct seq_operations profile_seq_op = { + .start = probes_seq_start, + .next = probes_seq_next, + .stop = probes_seq_stop, + .show = probes_profile_seq_show +}; + +static int profile_open(struct inode *inode, struct file *file) +{ + return seq_open(file, profile_seq_op); +} + +static const struct file_operations kprobe_profile_ops = { + .owner = THIS_MODULE, + .open = profile_open, + .read = seq_read, + .llseek = seq_lseek, + .release= seq_release, +}; + /* Kprobe handler */ static __kprobes int kprobe_trace_func(struct kprobe *kp, struct pt_regs *regs) { @@ -791,6 +823,8 @@ static __kprobes int kprobe_trace_func(struct kprobe *kp, struct pt_regs *regs) unsigned long irq_flags; struct ftrace_event_call *call = tp-call; + tp-nhit++; + local_save_flags(irq_flags); pc = preempt_count(); @@ -1143,9 +1177,18 @@ static __init int init_kprobe_trace(void) entry = debugfs_create_file(kprobe_events, 0644, d_tracer, NULL, kprobe_events_ops); + /* Event list interface */ if (!entry) pr_warning(Could not create debugfs 'kprobe_events' entry\n); + + /* Profile interface */ + entry = debugfs_create_file(kprobe_profile, 0444, d_tracer, + NULL, kprobe_profile_ops); + + if (!entry) + pr_warning(Could not create debugfs + 'kprobe_profile' entry\n); return 0; } fs_initcall(init_kprobe_trace); -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -tip v14 08/12] tracing: add kprobe-based event tracer
Add kprobes-based event tracer on ftrace. This tracer is similar to the events tracer which is based on Tracepoint infrastructure. Instead of Tracepoint, this tracer is based on kprobes (kprobe and kretprobe). It probes anywhere where kprobes can probe(this means, all functions body except for __kprobes functions). Similar to the events tracer, this tracer doesn't need to be activated via current_tracer, instead of that, just set probe points via /sys/kernel/debug/tracing/kprobe_events. And you can set filters on each probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter. This tracer supports following probe arguments for each probe. %REG : Fetch register REG sN: Fetch Nth entry of stack (N = 0) sa: Fetch stack address. @ADDR : Fetch memory at ADDR (ADDR should be in kernel) @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) aN: Fetch function argument. (N = 0) rv: Fetch return value. ra: Fetch return address. +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address. See Documentation/trace/kprobetrace.txt for details. Changes from v13: - Support 'sa' for stack address. - Use call-data instead of container_of() macro. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Acked-by: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- Documentation/trace/kprobetrace.txt | 139 kernel/trace/Kconfig| 12 kernel/trace/Makefile |1 kernel/trace/trace.h| 29 + kernel/trace/trace_event_types.h| 18 + kernel/trace/trace_kprobe.c | 1205 +++ 6 files changed, 1404 insertions(+), 0 deletions(-) create mode 100644 Documentation/trace/kprobetrace.txt create mode 100644 kernel/trace/trace_kprobe.c diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt new file mode 100644 index 000..efff6eb --- /dev/null +++ b/Documentation/trace/kprobetrace.txt @@ -0,0 +1,139 @@ + Kprobe-based Event Tracer + = + + Documentation is written by Masami Hiramatsu + + +Overview + +This tracer is similar to the events tracer which is based on Tracepoint +infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe +and kretprobe). It probes anywhere where kprobes can probe(this means, all +functions body except for __kprobes functions). + +Unlike the function tracer, this tracer can probe instructions inside of +kernel functions. It allows you to check which instruction has been executed. + +Unlike the Tracepoint based events tracer, this tracer can add and remove +probe points on the fly. + +Similar to the events tracer, this tracer doesn't need to be activated via +current_tracer, instead of that, just set probe points via +/sys/kernel/debug/tracing/kprobe_events. And you can set filters on each +probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter. + + +Synopsis of kprobe_events +- + p[:EVENT] SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS]: Set a probe + r[:EVENT] SYMBOL[+0] [FETCHARGS] : Set a return probe + + EVENT : Event name. + SYMBOL[+offs|-offs] : Symbol+offset where the probe is inserted. + MEMADDR : Address where the probe is inserted. + + FETCHARGS : Arguments. + %REG : Fetch register REG + sN : Fetch Nth entry of stack (N = 0) + sa : Fetch stack address. + @ADDR: Fetch memory at ADDR (ADDR should be in kernel) + @SYM[+|-offs]: Fetch memory at SYM +|- offs (SYM should be a data symbol) + aN : Fetch function argument. (N = 0)(*) + rv : Fetch return value.(**) + ra : Fetch return address.(**) + +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.(***) + + (*) aN may not correct on asmlinkaged functions and at the middle of + function body. + (**) only for return probe. + (***) this is useful for fetching a field of data structures. + + +Per-Probe Event Filtering +- + Per-probe event filtering feature allows you to set different filter on each +probe and gives you what arguments will be shown in trace
[PATCH -tip v14 09/12] tracing: Kprobe-tracer supports more than 6 arguments
Support up to 128 arguments for each kprobes event. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- Documentation/trace/kprobetrace.txt |2 +- kernel/trace/trace_kprobe.c | 21 + 2 files changed, 14 insertions(+), 9 deletions(-) diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt index efff6eb..c9c09b4 100644 --- a/Documentation/trace/kprobetrace.txt +++ b/Documentation/trace/kprobetrace.txt @@ -32,7 +32,7 @@ Synopsis of kprobe_events SYMBOL[+offs|-offs] : Symbol+offset where the probe is inserted. MEMADDR : Address where the probe is inserted. - FETCHARGS : Arguments. + FETCHARGS : Arguments. Each probe can have up to 128 args. %REG : Fetch register REG sN : Fetch Nth entry of stack (N = 0) sa : Fetch stack address. diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d92877a..4704e40 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -32,7 +32,7 @@ #include trace.h #include trace_output.h -#define TRACE_KPROBE_ARGS 6 +#define MAX_TRACE_ARGS 128 #define MAX_ARGSTR_LEN 63 /* currently, trace_kprobe only supports X86. */ @@ -184,11 +184,15 @@ struct trace_probe { struct kretproberp; }; const char *symbol;/* symbol name */ - unsigned intnr_args; - struct fetch_func args[TRACE_KPROBE_ARGS]; struct ftrace_event_callcall; + unsigned intnr_args; + struct fetch_func args[]; }; +#define SIZEOF_TRACE_PROBE(n) \ + (offsetof(struct trace_probe, args) + \ + (sizeof(struct fetch_func) * (n))) + static int kprobe_trace_func(struct kprobe *kp, struct pt_regs *regs); static int kretprobe_trace_func(struct kretprobe_instance *ri, struct pt_regs *regs); @@ -263,11 +267,11 @@ static DEFINE_MUTEX(probe_lock); static LIST_HEAD(probe_list); static struct trace_probe *alloc_trace_probe(const char *symbol, -const char *event) +const char *event, int nargs) { struct trace_probe *tp; - tp = kzalloc(sizeof(struct trace_probe), GFP_KERNEL); + tp = kzalloc(SIZEOF_TRACE_PROBE(nargs), GFP_KERNEL); if (!tp) return ERR_PTR(-ENOMEM); @@ -573,9 +577,10 @@ static int create_trace_probe(int argc, char **argv) if (offset is_return) return -EINVAL; } + argc -= 2; argv += 2; /* setup a probe */ - tp = alloc_trace_probe(symbol, event); + tp = alloc_trace_probe(symbol, event, argc); if (IS_ERR(tp)) return PTR_ERR(tp); @@ -594,8 +599,8 @@ static int create_trace_probe(int argc, char **argv) kp-addr = addr; /* parse arguments */ - argc -= 2; argv += 2; ret = 0; - for (i = 0; i argc i TRACE_KPROBE_ARGS; i++) { + ret = 0; + for (i = 0; i argc i MAX_TRACE_ARGS; i++) { if (strlen(argv[i]) MAX_ARGSTR_LEN) { pr_info(Argument%d(%s) is too long.\n, i, argv[i]); ret = -ENOSPC; -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -tip v14 11/12] tracing: Kprobe tracer assigns new event ids for each event
Assigns new event ids for each kprobes event. This doesn't clear ring_buffer when unregistering each kprobe event. Thus, if you mind 'Unknown event' messages, clear the buffer manually after changing kprobe events. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- kernel/trace/trace.h|6 - kernel/trace/trace_kprobe.c | 51 +-- 2 files changed, 15 insertions(+), 42 deletions(-) diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 4ce4525..0b78d76 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -43,8 +43,6 @@ enum trace_type { TRACE_POWER, TRACE_BLK, TRACE_KSYM, - TRACE_KPROBE, - TRACE_KRETPROBE, __TRACE_LAST_TYPE, }; @@ -358,10 +356,6 @@ extern void __ftrace_bad_type(void); IF_ASSIGN(var, ent, struct kmemtrace_free_entry,\ TRACE_KMEM_FREE); \ IF_ASSIGN(var, ent, struct ksym_trace_entry, TRACE_KSYM);\ - IF_ASSIGN(var, ent, struct kprobe_trace_entry, \ - TRACE_KPROBE);\ - IF_ASSIGN(var, ent, struct kretprobe_trace_entry, \ - TRACE_KRETPROBE); \ __ftrace_bad_type();\ } while (0) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index ec137ed..0e8498e 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -186,6 +186,7 @@ struct trace_probe { }; const char *symbol;/* symbol name */ struct ftrace_event_callcall; + struct trace_event event; unsigned intnr_args; struct fetch_func args[]; }; @@ -795,7 +796,7 @@ static __kprobes int kprobe_trace_func(struct kprobe *kp, struct pt_regs *regs) size = SIZEOF_KPROBE_TRACE_ENTRY(tp-nr_args); - event = trace_current_buffer_lock_reserve(TRACE_KPROBE, size, + event = trace_current_buffer_lock_reserve(call-id, size, irq_flags, pc); if (!event) return 0; @@ -827,7 +828,7 @@ static __kprobes int kretprobe_trace_func(struct kretprobe_instance *ri, size = SIZEOF_KRETPROBE_TRACE_ENTRY(tp-nr_args); - event = trace_current_buffer_lock_reserve(TRACE_KRETPROBE, size, + event = trace_current_buffer_lock_reserve(call-id, size, irq_flags, pc); if (!event) return 0; @@ -853,7 +854,7 @@ print_kprobe_event(struct trace_iterator *iter, int flags) struct trace_seq *s = iter-seq; int i; - trace_assign_type(field, iter-ent); + field = (struct kprobe_trace_entry *)iter-ent; if (!seq_print_ip_sym(s, field-ip, flags | TRACE_ITER_SYM_OFFSET)) goto partial; @@ -880,7 +881,7 @@ print_kretprobe_event(struct trace_iterator *iter, int flags) struct trace_seq *s = iter-seq; int i; - trace_assign_type(field, iter-ent); + field = (struct kretprobe_trace_entry *)iter-ent; if (!seq_print_ip_sym(s, field-ret_ip, flags | TRACE_ITER_SYM_OFFSET)) goto partial; @@ -906,16 +907,6 @@ partial: return TRACE_TYPE_PARTIAL_LINE; } -static struct trace_event kprobe_trace_event = { - .type = TRACE_KPROBE, - .trace = print_kprobe_event, -}; - -static struct trace_event kretprobe_trace_event = { - .type = TRACE_KRETPROBE, - .trace = print_kretprobe_event, -}; - static int probe_event_enable(struct ftrace_event_call *call) { struct trace_probe *tp = (struct trace_probe *)call-data; @@ -1107,35 +1098,35 @@ static int register_probe_event(struct trace_probe *tp) /* Initialize ftrace_event_call */ call-system = kprobes; if (probe_is_return(tp)) { - call-event = kretprobe_trace_event; - call-id = TRACE_KRETPROBE; + tp-event.trace = print_kretprobe_event;
[PATCH -tip v14 03/12] kprobes: checks probe address is instruction boudary on x86
Ensure safeness of inserting kprobes by checking whether the specified address is at the first byte of a instruction on x86. This is done by decoding probed function from its head to the probe point. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Acked-by: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- arch/x86/kernel/kprobes.c | 69 + 1 files changed, 69 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c index b5b1848..80d493f 100644 --- a/arch/x86/kernel/kprobes.c +++ b/arch/x86/kernel/kprobes.c @@ -48,6 +48,7 @@ #include linux/preempt.h #include linux/module.h #include linux/kdebug.h +#include linux/kallsyms.h #include asm/cacheflush.h #include asm/desc.h @@ -55,6 +56,7 @@ #include asm/uaccess.h #include asm/alternative.h #include asm/debugreg.h +#include asm/insn.h void jprobe_return_end(void); @@ -245,6 +247,71 @@ retry: } } +/* Recover the probed instruction at addr for further analysis. */ +static int recover_probed_instruction(kprobe_opcode_t *buf, unsigned long addr) +{ + struct kprobe *kp; + kp = get_kprobe((void *)addr); + if (!kp) + return -EINVAL; + + /* +* Basically, kp-ainsn.insn has an original instruction. +* However, RIP-relative instruction can not do single-stepping +* at different place, fix_riprel() tweaks the displacement of +* that instruction. In that case, we can't recover the instruction +* from the kp-ainsn.insn. +* +* On the other hand, kp-opcode has a copy of the first byte of +* the probed instruction, which is overwritten by int3. And +* the instruction at kp-addr is not modified by kprobes except +* for the first byte, we can recover the original instruction +* from it and kp-opcode. +*/ + memcpy(buf, kp-addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t)); + buf[0] = kp-opcode; + return 0; +} + +/* Dummy buffers for kallsyms_lookup */ +static char __dummy_buf[KSYM_NAME_LEN]; + +/* Check if paddr is at an instruction boundary */ +static int __kprobes can_probe(unsigned long paddr) +{ + int ret; + unsigned long addr, offset = 0; + struct insn insn; + kprobe_opcode_t buf[MAX_INSN_SIZE]; + + if (!kallsyms_lookup(paddr, NULL, offset, NULL, __dummy_buf)) + return 0; + + /* Decode instructions */ + addr = paddr - offset; + while (addr paddr) { + kernel_insn_init(insn, (void *)addr); + insn_get_opcode(insn); + + /* Check if the instruction has been modified. */ + if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) { + ret = recover_probed_instruction(buf, addr); + if (ret) + /* +* Another debugging subsystem might insert +* this breakpoint. In that case, we can't +* recover it. +*/ + return 0; + kernel_insn_init(insn, buf); + } + insn_get_length(insn); + addr += insn.length; + } + + return (addr == paddr); +} + /* * Returns non-zero if opcode modifies the interrupt flag. */ @@ -360,6 +427,8 @@ static void __kprobes arch_copy_kprobe(struct kprobe *p) int __kprobes arch_prepare_kprobe(struct kprobe *p) { + if (!can_probe((unsigned long)p-addr)) + return -EILSEQ; /* insn: must be on special executable page on x86. */ p-ainsn.insn = get_insn_slot(); if (!p-ainsn.insn) -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[TOOL] kprobestest : Kprobe stress test tool
This script tests kprobes to probe on all symbols in the kernel and finds symbols which must be blacklisted. Usage - kprobestest [-s SYMLIST] [-b BLACKLIST] [-w WHITELIST] Run stress test. If SYMLIST file is specified, use it as an initial symbol list (This is useful for verifying white list after diagnosing all symbols). kprobestest cleanup Cleanup all lists How to Work --- This tool list up all symbols in the kernel via /proc/kallsyms, and sorts it into groups (each of them including 64 symbols in default). And then, it tests each group by using kprobe-tracer. If a kernel crash occurred, that group is moved into 'failed' dir. If the group passed the test, this script moves it into 'passed' dir and saves kprobe_profile into 'passed/profiles/'. After testing all groups, all 'failed' groups are merged and sorted into smaller groups (divided by 4, in default). And those are tested again. This loop will be repeated until all group has just 1 symbol. Finally, the script sorts all 'passed' symbols into 'tested', 'untested', and 'missed' based on profiles. Note - This script just gives us some clues to the blacklisted functions. In some cases, a combination of probe points will cause a problem, but each of them doesn't cause the problem alone. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com #!/bin/bash # # kprobestest: Kprobes stress test tool # Written by Masami Hiramatsu mhira...@redhat.com # # Usage: # $ kprobestest [-s SYMLIST] [-b BLACKLIST] [-w WHITELIST] #Run stress test. If SYMLIST file is specified, use it as #an initial symbol list (This is useful for verifying white list #after diagnosing all symbols). # # $ kprobestest cleanup #Cleanup all lists # Configurations DEBUGFS=/sys/kernel/debug INITNR=64 DIV=4 SYMFILE=syms.list FAILFILE=black.list function do_test () { # Do some benchmark for i in {1..4} ; do sleep 0.5 echo -n . done } function usage () { echo Usage: kprobestest [cleanup] [-s SYMLIST] [-b BLACKLIST] [-w WHITELIST] exit 0 } function cleanup_test () { echo Cleanup all files rm -rf $SYMFILE failed passed testing unset exit 0 } # Parse arguments WHITELIST= BLACKLIST= SYMLIST= while [ $1 ]; do case $1 in cleanup) cleanup_test ;; -s) SYMLIST=$2 shift 1 ;; -b) BLACKLIST=$2 shift 1 ;; -w) WHITELIST=$2 shift 1 ;; *) usage ;; esac shift 1 done # Show configurations echo Kprobe stress test starting. [ -f $BLACKLIST ] echo Blacklist: $BLACKLIST || BLACKLIST= [ -f $WHITELIST ] echo Whitelist: $WHITELIST || WHITELIST= [ -f $SYMLIST ] echo Symlist: $SYMLIST || SYMLIST= function make_filter () { local EXP= if [ -z $WHITELIST -a -z $BLACKLIST ]; then echo s/^$//g else for i in `cat $WHITELIST $BLACKLIST` ;do [ -z $EXP ] EXP=^$i\$ || EXP=$EXP\\|^$i\$ done ; EXP=s/$EXP//g echo $EXP fi } function list_allsyms () { local sym local out=1 for sym in `sort /proc/kallsyms | egrep '[0-9a-f]+ [Tt] [^[]*$' | cut -d\ -f 3`;do [ $sym = __kprobes_text_start ] out=0 continue [ $sym = __kprobes_text_end ] out=1 continue [ $sym = _etext ] break [ $out -eq 1 ] echo $sym done } function prep_testing () { local i=0 local n=0 local NR=$1 local fname= echo Grouping symbols: $NR fname=`printf list-%03d.%d $i $NR` cat $SYMFILE | while read ln; do [ -z $ln ] continue echo $ln testing/$fname n=$((n+1)) if [ $n -eq $NR ]; then n=0 i=$((i+1)) fname=`printf list-%03d.%d $i $NR` fi done sync } function init_first () { local EXP EXP=`make_filter` if [ -f $SYMLIST ]; then cat $SYMLIST | sed $EXP $SYMFILE else echo -n Generating symbol list from /proc/kallsyms... list_allsyms | sed $EXP $SYMFILE echo done. `wc -l $SYMFILE | cut -f1 -d\ ` symbols listed. fi mkdir -p testing failed unset passed passed/profiles prep_testing $INITNR } function get_max_nr () { wc -l failed/list-* unset/list-* 2/dev/null |\ awk '/^ *[0-9]+ .*list.*$/{ if (nr $1) nr=$1 } BEGIN { nr=0 } END { print nr}' } function init_next () { local NR NR=`get_max_nr` [ $NR -eq 0 ] return 1 [ $NR -eq 1 ] return 2 [ $NR -le $DIV ] NR=1 || NR=`expr $NR / $DIV` cat failed/* unset/* $SYMFILE rm failed/* unset/* prep_testing $NR return 0 } # Initialize symbols if [ ! -d testing ]; then init_first elif [ -z `ls testing/` ]; then init_next fi function set_probes () { local s for s in `cat $1`; do echo p:$s $s $DEBUGFS/tracing/kprobe_events [ $? -ne 0 ] return -1 done return 0 } function clear_probes () { echo $DEBUGFS/tracing/kprobe_events } function save_profile () { cat $DEBUGFS/tracing/kprobe_profile
[TOOL] c2kpe: C expression to kprobe event format converter
This program converts probe point in C expression to kprobe event format for kprobe-based event tracer. This helps to define kprobes events by C source line number or function name, and local variable name. Currently, this supports only x86(32/64) kernels. Compile Before compilation, please install libelf and libdwarf development packages. (e.g. elfutils-libelf-devel and libdwarf-devel on Fedora) $ gcc -Wall -lelf -ldwarf c2kpe.c -o c2kpe Synopsis $ c2kpe [options] function[+off...@src] [VAR [VAR ...]] or $ c2kpe [options] @SRC:LINE [VAR [VAR ...]] FUNCTION: Probing function name. OFFS: Offset in bytes. SRC: Source file path. LINE: Line number VAR: Local variable name. options: -r KREL Kernel release version (e.g. 2.6.31-rc5) -m DEBUGINFO Dwarf-format binary file (vmlinux or kmodule) Example --- $ c2kpe sys_read fd buf count sys_read+0 %di %si %dx $ c2kpe @mm/filemap.c:339 inode pos sync_page_range+125 -48(%bp) %r14 Example with kprobe-tracer -- Since C expression may be converted multiple results, I recommend to use readline. $ c2kpe sys_read fd buf count | while read i; do \ echo p $i $DEBUGFS/tracing/kprobe_events ;\ done Note - This requires a kernel compiled with CONFIG_DEBUG_INFO. - Specifying @SRC speeds up c2kpe, because we can skip CUs which don't include specified SRC file. - c2kpe doesn't check whether the offset byte is correctly on the instruction boundary. I recommend you to use @SRC:LINE expression for tracing function body. - This tool doesn't search kmodule file. You need to specify kmodule file if you want to probe it. TODO - Fix bugs. - Support multiple probepoints from stdin. - Better kmodule support. - Use elfutils-libdw? - Merge into trace-cmd or perf-tools? -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com /* * c2kpe : C expression to kprobe event converter * * Written by Masami Hiramatsu mhira...@redhat.com * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * */ #include sys/utsname.h #include sys/types.h #include sys/stat.h #include fcntl.h #include errno.h #include stdio.h #include unistd.h #include getopt.h #include stdlib.h #include string.h #include libdwarf/dwarf.h #include libdwarf/libdwarf.h /* Default vmlinux search paths */ #define NR_SEARCH_PATH 2 const char *default_search_path[NR_SEARCH_PATH] = { /lib/modules/%s/build/vmlinux,/* Custom build kernel */ /usr/lib/debug/lib/modules/%s/vmlinux,/* Red Hat debuginfo */ }; #define _stringify(n) #n #define stringify(n)_stringify(n) #ifdef DEBUG #define debug(fmt ...) \ fprintf(stderr, DBG( __FILE__ : stringify(__LINE__) ): fmt) #else #define debug(fmt ...) do {} while (0) #endif #define ERR_IF(cnd) \ do { if (cnd) { \ fprintf(stderr, Error ( __FILE__ : stringify(__LINE__) \ ): stringify(cnd) \n); \ exit(1);\ }} while (0) #define MAX_PATH_LEN 256 /* Dwarf_Die Linkage to parent Die */ struct die_link { struct die_link *parent;/* Parent die */ Dwarf_Die die; /* Current die */ }; #define X86_32_MAX_REGS 8 const char *x86_32_regs_table[X86_32_MAX_REGS] = { %ax, %cx, %dx, %bx, sa, /* Stack address */ %bp, %si, %di, }; #define X86_64_MAX_REGS 16 const char *x86_64_regs_table[X86_64_MAX_REGS] = { %ax, %dx, %cx, %bx, %si, %di, %bp, %sp, %r8, %r9, %r10, %r11, %r12, %r13, %r14, %r15, }; /* TODO: switching by dwarf address size */ #ifdef __x86_64__ #define ARCH_MAX_REGS X86_64_MAX_REGS #define arch_regs_table x86_64_regs_table #else #define ARCH_MAX_REGS X86_32_MAX_REGS #define arch_regs_table x86_32_regs_table #endif /* Return architecture dependent register string */ static inline const char *get_arch_regstr(unsigned int n) { return (n = ARCH_MAX_REGS) ? arch_regs_table[n] : NULL; }
Re: [TOOL] c2kpe: C expression to kprobe event format converter
You rock, this is awesome! I'm a bit busy right now, but I'll play around with it ASAP and will see how well it works for me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Guest OpenGL Acceleration
Is OpenGL Acceleration based on the host's OpenGL capability available in KVM? Thanks. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Disk Emulation and Trim Instruction
With the recent talk of the trim SATA instruction becoming supported in the upcoming versions of Windows and claims from Intel that support for it in their SSDs is imminent, it occurs to me that this would be equally useful in virtual disk emulation. Since the disk image is a sparse file, it always only grows, and eventually it will grow to it's full intended size even if the actual used space is a small fraction of the container size. Since the trim instruction tells the disk that a particular block is no longer used (and can thus be scheduled for erasing as and when required), the same thing could be used to reclaim space used by sparse files backing the VM. It would allow for higher overcommit of disk usage on VM farms. Is this feature likely to be available in KVM soon? Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html