from:"Stefan Hajnoczi"

Re: [Qemu-devel] KVM call agenda for July 26

2011-07-26 Thread Stefan Hajnoczi

On Tue, Jul 26, 2011 at 12:30 AM, Juan Quintela quint...@redhat.com wrote:
 Please send in any agenda items you are interested in covering.

0.15.0 release candidate testing
 * http://wiki.qemu.org/Planning/0.15/Testing
 * Please test hosts, targets, subsystems, or features you care about!
 * May just include running the test suite (KVM-Autotest, qemu-iotests, etc)

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH] Introduce QEMU_NEW()

2011-07-25 Thread Stefan Hajnoczi

On Mon, Jul 25, 2011 at 9:51 AM, Avi Kivity a...@redhat.com wrote:
 qemu_malloc() is type-unsafe as it returns a void pointer.  Introduce
 QEMU_NEW() (and QEMU_NEWZ()), which return the correct type.

 Signed-off-by: Avi Kivity a...@redhat.com
 ---

 This is part of my memory API patchset, but doesn't really belong there.

  qemu-common.h |    3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)

 diff --git a/qemu-common.h b/qemu-common.h
 index ba55719..66effa3 100644
 --- a/qemu-common.h
 +++ b/qemu-common.h
 @@ -186,6 +186,9 @@ void qemu_free(void *ptr);
  char *qemu_strdup(const char *str);
  char *qemu_strndup(const char *str, size_t size);

 +#define QEMU_NEW(type) ((type *)(qemu_malloc(sizeof(type
 +#define QEMU_NEWZ(type) ((type *)(qemu_mallocz(sizeof(type

Does this mean we need to duplicate the type name for each allocation?

struct foo *f;

...
f = qemu_malloc(sizeof(*f));

Becomes:

struct foo *f;

...
f = QEMU_NEW(struct foo);

If you ever change the name of the type you have to search-replace
these instances.  The idomatic C way works well, I don't see a reason
to use QEMU_NEW().

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 1/1] Submit the codes for QEMU disk I/O limits.

2011-07-25 Thread Stefan Hajnoczi

On Mon, Jul 25, 2011 at 8:08 AM, Zhi Yong Wu zwu.ker...@gmail.com wrote:
 On Fri, Jul 22, 2011 at 6:54 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Jul 22, 2011 at 10:20 AM, Zhi Yong Wu wu...@linux.vnet.ibm.com 
 wrote:
 +    elapsed_time  = (real_time - bs-slice_start[is_write]) / 10.0;
 +    fprintf(stderr, real_time = %ld, slice_start = %ld, elapsed_time = 
 %g\n, real_time, bs-slice_start[is_write], elapsed_time);
 +
 +    bytes_limit        = bps_limit * slice_time;
 +    bytes_disp  = bs-io_disps-bytes[is_write];
 +    if (bs-io_limits-bps[2]) {
 +        bytes_disp += bs-io_disps-bytes[!is_write];
 +    }
 +
 +    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
 +    fprintf(stderr, bytes_limit = %g bytes_disp = %g, bytes_res = %g, 
 elapsed_time = %g\n, bytes_limit, bytes_disp, bytes_res, elapsed_time);
 +
 +    if (bytes_disp + bytes_res = bytes_limit) {
 +        if (wait) {
 +            *wait = 0;
 +        }
 +
 +       fprintf(stderr, bytes_disp + bytes_res = bytes_limit\n);
 +        return false;
 +    }
 +
 +    /* Calc approx time to dispatch */
 +    wait_time = (bytes_disp + bytes_res - bytes_limit) / bps_limit;
 +    if (!wait_time) {
 +        wait_time = 1;
 +    }
 +
 +    wait_time = wait_time + (slice_time - elapsed_time);
 +    if (wait) {
 +       *wait = wait_time * 10 + 1;
 +    }
 +
 +    return true;
 +}

 After a slice expires all bytes/iops dispatched data is forgotten,
 even if there are still requests queued.  This means that requests
 issued by the guest immediately after a new 100 ms period will be
 issued but existing queued requests will still be waiting.

 And since the queued requests don't get their next chance until later,
 it's possible that they will be requeued because the requests that the
 guest snuck in have brought us to the limit again.

 In order to solve this problem, we need to extend the current slice if
 Extend the current slice? like in-kernel block throttling algorithm?
 Our algorithm seems not to adopt it currently.

I'm not sure if extending the slice is necessary as long as new
requests are queued while previous requests are still queued.  But
extending slices is one way to deal with requests that span across
multiple slices.  See below.

 there are still requests pending.  To prevent extensions from growing
 the slice forever (and keeping too much old data around), it should be
 alright to cull data from 2 slices ago.  The simplest way of doing
 that is to subtract the bps/iops limits from the bytes/iops
 dispatched.
 You mean that the largest value of current_slice_time is not more than
 2 slice_time?

Yes.  If no single request is larger than the I/O limit, then the
timer value for a queued request should always be within the next
slice.  Therefore everything before last slice should be completed
already and we don't need to keep that history around.


 @@ -2129,6 +2341,19 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState 
 *bs, int64_t sector_num,
     if (bdrv_check_request(bs, sector_num, nb_sectors))
         return NULL;

 +    /* throttling disk read I/O */
 +    if (bs-io_limits != NULL) {
 +       if (bdrv_exceed_io_limits(bs, nb_sectors, false, wait_time)) {
 +            ret = qemu_block_queue_enqueue(bs-block_queue, bs, 
 bdrv_aio_readv,
 +                               sector_num, qiov, nb_sectors, cb, opaque);

 5 space indent, should be 4.

 +           fprintf(stderr, bdrv_aio_readv: wait_time = %ld, timer value = 
 %ld\n, wait_time, wait_time + qemu_get_clock_ns(vm_clock));
 +           qemu_mod_timer(bs-block_timer, wait_time + 
 qemu_get_clock_ns(vm_clock));

 Imagine 3 requests that need to be queued: A, B, and C.  Since the
 timer is unconditionally set each time a request is queued, the timer
 will be set to C's wait_time.  However A or B's wait_time might be
 earlier and we will miss that deadline!
 Yeah, exactly there is this issue.

 We really need a priority queue here.  QEMU's timers solve the same
 problem with a sorted list, which might actually be faster for short
 lists where a fancy data structure has too much overhead.
 You mean the block requests should be handled in FIFO way in order?
 If the block queue is not empty, should this coming request be
 enqueued at first? right?

Yes.  If the limit was previously exceeded, enqueue new requests immediately:

/* If a limit was exceeded, immediately queue this request */
if (!QTAILQ_EMPTY(queue-requests)) {
if (limits-bps[IO_LIMIT_TOTAL]) {
/* queue any rd/wr request */
} else if (limits-bps[is_write]  another_request_is_queued[is_write]) {
/* queue if the rd/wr-specific limit was previously exceeded */
}

...same for iops...
}

This way new requests cannot skip ahead of queued requests due to the
lost history when a new slice starts.


 +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
 +                       BlockDriverState *bs,
 +                       BlockRequestHandler *handler

Re: virtagent for qemu-kvm ?

2011-07-25 Thread Stefan Hajnoczi

On Tue, Jul 26, 2011 at 6:05 AM, Prateek Sharma prate...@cse.iitb.ac.in wrote:
 Hi all,
    Is there any equivalent of qemu's virtagent in qemu-kvm?
 [http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg02149.html]  .
 In particular , i want to share pages between KVM guests and the host . Is
 there an appropriate mechanism for this in existence which i could use ?
 Another nice feature virtagent provides is the ability to see guest dmesg
 output in the host...

virtagent doesn't share pages between guest and host AFAIK.

Have you looked at hw/ivshmem.c?

Most of the time you don't need to actually share pages between guest
and host.  Use existing mechanisms like networking or serial to
communicate instead.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 1/1] Submit the codes for QEMU disk I/O limits.

2011-07-24 Thread Stefan Hajnoczi

On Mon, Jul 25, 2011 at 5:25 AM, Zhi Yong Wu zwu.ker...@gmail.com wrote:
 On Fri, Jul 22, 2011 at 6:54 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Jul 22, 2011 at 10:20 AM, Zhi Yong Wu wu...@linux.vnet.ibm.com 
 wrote:
 +static void bdrv_block_timer(void *opaque)
 +{
 +    BlockDriverState *bs = opaque;
 +    BlockQueue *queue = bs-block_queue;
 +    uint64_t intval = 1;
 +
 +    while (!QTAILQ_EMPTY(queue-requests)) {
 +        BlockIORequest *request;
 +        int ret;
 +
 +        request = QTAILQ_FIRST(queue-requests);
 +        QTAILQ_REMOVE(queue-requests, request, entry);
 +
 +        ret = queue-handler(request);

 Please remove the function pointer and call qemu_block_queue_handler()
 directly.  This indirection is not needed and makes it harder to
 follow the code.
 Should it keep the same style with other queue implemetations such as
 network queue? As you have known, network queue has one queue handler.

You mean net/queue.c:queue-deliver?  There are two deliver functions,
qemu_deliver_packet() and qemu_vlan_deliver_packet(), which is why a
function pointer is necessary.

In this case there is only one handler function so the indirection is
not necessary.


 +        if (ret == 0) {
 +            QTAILQ_INSERT_HEAD(queue-requests, request, entry);
 +            break;
 +        }
 +
 +        qemu_free(request);
 +    }
 +
 +    qemu_mod_timer(bs-block_timer, (intval * 10) + 
 qemu_get_clock_ns(vm_clock));

 intval is always 1.  The timer should be set to the next earliest deadline.
 pls see bdrv_aio_readv/writev:
 +    /* throttling disk read I/O */
 +    if (bs-io_limits != NULL) {
 +       if (bdrv_exceed_io_limits(bs, nb_sectors, false, wait_time)) {
 +            ret = qemu_block_queue_enqueue(bs-block_queue, bs, 
 bdrv_aio_readv,
 +                               sector_num, qiov, nb_sectors, cb, opaque);
 +           qemu_mod_timer(bs-block_timer, wait_time +
 qemu_get_clock_ns(vm_clock));

 The timer has been modified when the blk request is enqueued.

The current algorithm seems to be:
1. Queue requests that exceed the limit and set the timer.
2. Dequeue all requests when the timer fires.
3. Set 1s periodic timer.

Why is the timer set up as a periodic 1 second timer in bdrv_block_timer()?

 +        bs-slice_start[is_write] = real_time;
 +
 +        bs-io_disps-bytes[is_write] = 0;
 +        if (bs-io_limits-bps[2]) {
 +            bs-io_disps-bytes[!is_write] = 0;
 +        }

 All previous data should be discarded when a new time slice starts:
 bs-io_disps-bytes[IO_LIMIT_READ] = 0;
 bs-io_disps-bytes[IO_LIMIT_WRITE] = 0;
 If only bps_rd is specified, bs-io_disps-bytes[IO_LIMIT_WRITE] will
 never be used; i think that it should not be cleared here. right?

I think there is no advantage in keeping slices separate for
read/write.  The code would be simpler and work the same if there is
only one slice and past history is cleared when a new slice starts.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 1/1] Submit the codes for QEMU disk I/O limits.

2011-07-22 Thread Stefan Hajnoczi

On Fri, Jul 22, 2011 at 10:20 AM, Zhi Yong Wu wu...@linux.vnet.ibm.com wrote:
 +static void bdrv_block_timer(void *opaque)
 +{
 +    BlockDriverState *bs = opaque;
 +    BlockQueue *queue = bs-block_queue;
 +    uint64_t intval = 1;
 +
 +    while (!QTAILQ_EMPTY(queue-requests)) {
 +        BlockIORequest *request;
 +        int ret;
 +
 +        request = QTAILQ_FIRST(queue-requests);
 +        QTAILQ_REMOVE(queue-requests, request, entry);
 +
 +        ret = queue-handler(request);

Please remove the function pointer and call qemu_block_queue_handler()
directly.  This indirection is not needed and makes it harder to
follow the code.

 +        if (ret == 0) {
 +            QTAILQ_INSERT_HEAD(queue-requests, request, entry);
 +            break;
 +        }
 +
 +        qemu_free(request);
 +    }
 +
 +    qemu_mod_timer(bs-block_timer, (intval * 10) + 
 qemu_get_clock_ns(vm_clock));

intval is always 1.  The timer should be set to the next earliest deadline.

 +}
 +
  void bdrv_register(BlockDriver *bdrv)
  {
     if (!bdrv-bdrv_aio_readv) {
 @@ -476,6 +508,11 @@ static int bdrv_open_common(BlockDriverState *bs, const 
 char *filename,
         goto free_and_fail;
     }

 +    /* throttling disk I/O limits */
 +    if (bs-io_limits != NULL) {
 +       bs-block_queue = qemu_new_block_queue(qemu_block_queue_handler);
 +    }
 +
  #ifndef _WIN32
     if (bs-is_temporary) {
         unlink(filename);
 @@ -642,6 +679,16 @@ int bdrv_open(BlockDriverState *bs, const char 
 *filename, int flags,
             bs-change_cb(bs-change_opaque, CHANGE_MEDIA);
     }

 +    /* throttling disk I/O limits */
 +    if (bs-io_limits != NULL) {

block_queue is allocated in bdrv_open_common() but these variables are
initialized in bdrv_open().  Can you move them together, I don't see a
reason to keep them apart?

 +       bs-io_disps    = qemu_mallocz(sizeof(BlockIODisp));
 +       bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
 +       qemu_mod_timer(bs-block_timer, qemu_get_clock_ns(vm_clock));

Why is the timer being scheduled immediately?  There are no queued requests.

 +
 +       bs-slice_start[0] = qemu_get_clock_ns(vm_clock);
 +       bs-slice_start[1] = qemu_get_clock_ns(vm_clock);
 +    }
 +
     return 0;

  unlink_and_fail:
 @@ -680,6 +727,15 @@ void bdrv_close(BlockDriverState *bs)
         if (bs-change_cb)
             bs-change_cb(bs-change_opaque, CHANGE_MEDIA);
     }
 +
 +    /* throttling disk I/O limits */
 +    if (bs-block_queue) {
 +       qemu_del_block_queue(bs-block_queue);

3 space indent, should be 4 spaces.

 +    }
 +
 +    if (bs-block_timer) {

qemu_free_timer() will only free() the timer memory but it will not
cancel the timer.  Use qemu_del_timer() first to ensure that the timer
is not pending:

qemu_del_timer(bs-block_timer);

 +        qemu_free_timer(bs-block_timer);
 +    }
  }

  void bdrv_close_all(void)
 @@ -1312,6 +1368,13 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
     *psecs = bs-secs;
  }

 +/* throttling disk io limits */
 +void bdrv_set_io_limits(BlockDriverState *bs,
 +                           BlockIOLimit *io_limits)
 +{
 +    bs-io_limits            = io_limits;

This function takes ownership of io_limits but never frees it.  I
suggest not taking ownership and just copying io_limits into
bs-io_limits so that the caller does not need to malloc() and the
lifecycle of bs-io_limits is completely under our control.

Easiest would be to turn BlockDriverState.io_limits into:

BlockIOLimit io_limits;

and just copy in bdrv_set_io_limits():

bs-io_limits = *io_limits;

bdrv_exceed_io_limits() returns quite quickly if no limits are set, so
I wouldn't worry about optimizing it out yet.

 +}
 +
  /* Recognize floppy formats */
  typedef struct FDFormat {
     FDriveType drive;
 @@ -2111,6 +2174,154 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, 
 QEMUSnapshotInfo *sn)
     return buf;
  }

 +bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
 +                           bool is_write, uint64_t *wait) {
 +    int64_t  real_time;
 +    uint64_t bps_limit   = 0;
 +    double   bytes_limit, bytes_disp, bytes_res, elapsed_time;
 +    double   slice_time = 0.1, wait_time;
 +
 +    if (bs-io_limits-bps[2]) {
 +        bps_limit = bs-io_limits-bps[2];

Please define a constant (IO_LIMIT_TOTAL?) instead of using magic number 2.

 +    } else if (bs-io_limits-bps[is_write]) {
 +        bps_limit = bs-io_limits-bps[is_write];
 +    } else {
 +        if (wait) {
 +            *wait = 0;
 +        }
 +
 +        return false;
 +    }
 +
 +    real_time = qemu_get_clock_ns(vm_clock);
 +    if (bs-slice_start[is_write] + 1 = real_time) {

Please define a constant for the 100 ms time slice instead of using
1 directly.

 +        bs-slice_start[is_write] = real_time;
 +
 +        bs-io_disps-bytes[is_write] = 0;
 +        if (bs-io_limits-bps[2]) {
 +            bs-io_disps-bytes[!is_write] = 0;
 +        }

All previous

[RFC 1/2] KVM: Record instruction set in all vmexit tracepoints

2011-07-22 Thread Stefan Hajnoczi

The kvm_exit tracepoint recently added the isa argument to aid decoding
exit_reason.  The semantics of exit_reason depend on the instruction set
(vmx or svm) and the isa argument allows traces to be analyzed on other
machines.

Add the isa argument to kvm_nested_vmexit and kvm_nested_vmexit_inject
so these tracepoints can also be self-describing.

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 arch/x86/kvm/svm.c   |6 --
 arch/x86/kvm/trace.h |   12 
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 475d1c9..6adb7ba 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2182,7 +2182,8 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
   vmcb-control.exit_info_1,
   vmcb-control.exit_info_2,
   vmcb-control.exit_int_info,
-  vmcb-control.exit_int_info_err);
+  vmcb-control.exit_int_info_err,
+  KVM_ISA_SVM);
 
nested_vmcb = nested_svm_map(svm, svm-nested.vmcb, page);
if (!nested_vmcb)
@@ -3335,7 +3336,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
svm-vmcb-control.exit_info_1,
svm-vmcb-control.exit_info_2,
svm-vmcb-control.exit_int_info,
-   svm-vmcb-control.exit_int_info_err);
+   svm-vmcb-control.exit_int_info_err,
+   KVM_ISA_SVM);
 
vmexit = nested_svm_exit_special(svm);
 
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 3ff898c..4e1716b 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -486,9 +486,9 @@ TRACE_EVENT(kvm_nested_intercepts,
 TRACE_EVENT(kvm_nested_vmexit,
TP_PROTO(__u64 rip, __u32 exit_code,
 __u64 exit_info1, __u64 exit_info2,
-__u32 exit_int_info, __u32 exit_int_info_err),
+__u32 exit_int_info, __u32 exit_int_info_err, __u32 isa),
TP_ARGS(rip, exit_code, exit_info1, exit_info2,
-   exit_int_info, exit_int_info_err),
+   exit_int_info, exit_int_info_err, isa),
 
TP_STRUCT__entry(
__field(__u64,  rip )
@@ -497,6 +497,7 @@ TRACE_EVENT(kvm_nested_vmexit,
__field(__u64,  exit_info2  )
__field(__u32,  exit_int_info   )
__field(__u32,  exit_int_info_err   )
+   __field(__u32,  isa )
),
 
TP_fast_assign(
@@ -506,6 +507,7 @@ TRACE_EVENT(kvm_nested_vmexit,
__entry-exit_info2 = exit_info2;
__entry-exit_int_info  = exit_int_info;
__entry-exit_int_info_err  = exit_int_info_err;
+   __entry-isa= isa;
),
TP_printk(rip: 0x%016llx reason: %s ext_inf1: 0x%016llx 
  ext_inf2: 0x%016llx ext_int: 0x%08x ext_int_err: 0x%08x,
@@ -522,9 +524,9 @@ TRACE_EVENT(kvm_nested_vmexit,
 TRACE_EVENT(kvm_nested_vmexit_inject,
TP_PROTO(__u32 exit_code,
 __u64 exit_info1, __u64 exit_info2,
-__u32 exit_int_info, __u32 exit_int_info_err),
+__u32 exit_int_info, __u32 exit_int_info_err, __u32 isa),
TP_ARGS(exit_code, exit_info1, exit_info2,
-   exit_int_info, exit_int_info_err),
+   exit_int_info, exit_int_info_err, isa),
 
TP_STRUCT__entry(
__field(__u32,  exit_code   )
@@ -532,6 +534,7 @@ TRACE_EVENT(kvm_nested_vmexit_inject,
__field(__u64,  exit_info2  )
__field(__u32,  exit_int_info   )
__field(__u32,  exit_int_info_err   )
+   __field(__u32,  isa )
),
 
TP_fast_assign(
@@ -540,6 +543,7 @@ TRACE_EVENT(kvm_nested_vmexit_inject,
__entry-exit_info2 = exit_info2;
__entry-exit_int_info  = exit_int_info;
__entry-exit_int_info_err  = exit_int_info_err;
+   __entry-isa= isa;
),
 
TP_printk(reason: %s ext_inf1: 0x%016llx 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC 0/2] KVM: Fix kvm_exit trace event format

2011-07-22 Thread Stefan Hajnoczi

Currently both perf and trace-cmd cannot parse the kvm:kvm_exit trace event
format.  This patch is an attempt to make formatting work without changing the
kvm:kvm_exit prototype.  Since this event is a core KVM operation, no doubt
there are existing trace analysis scripts that rely on it and I don't want to
break them.

Patch 1 adjusts vmexit-related tracepoints so that they can be fixed too.

Patch 2 replaces ftrace_print_symbols_seq() with __print_symbolic().  This
means all information necessary for formatting the exit_reason field is now
part of the trace event's format.  In theory userspace tools should now work.

In practice both perf and trace-cmd are not happy with the new exit_reason
formatting expression (omitting the details and split across lines for easy
email reading here):

print fmt: reason %s rip 0x%lx info %llx %llx,
   (REC-isa == 1) ?
   __print_symbolic(REC-exit_reason, { 0, EXCEPTION_NMI }, ...) :
   __print_symbolic(REC-exit_reason, { 0x000, read_cr0 }, ...),
   REC-guest_rip, REC-info1, REC-info2

perf script says:

  Warning: Error: expected type 5 but read 4
  Warning: Error: expected type 5 but read 0
  Warning: unknown op '}'

kvm  2696 [001]   289.850941: kvm_exit: EVENT 'kvm_exit' FAILED TO PARSE

trace-cmd says:

  Error: expected type 5 but read 4
  Error: expected type 5 but read 0
  failed to read event print fmt for kvm_exit

kvm-2696  [000]  1451.564092: kvm_exit: [FAILED TO PARSE] exit_reason=44 
guest_rip=0xc01151a8 isa=1 info1=4272 info2=0

I'd really like to make perf and trace-cmd just work with kvm:kvm_exit.  Any
suggestions other than improving the parsers in the respective tools?

Stefan Hajnoczi (2):
  KVM: Record instruction set in all vmexit tracepoints
  KVM: Use __print_symbolic() for vmexit tracepoints

 arch/x86/include/asm/kvm_host.h |2 -
 arch/x86/kvm/svm.c  |   61 +---
 arch/x86/kvm/trace.h|  118 +++---
 arch/x86/kvm/vmx.c  |   44 --
 4 files changed, 112 insertions(+), 113 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC 2/2] KVM: Use __print_symbolic() for vmexit tracepoints

2011-07-22 Thread Stefan Hajnoczi

The vmexit tracepoints format the exit_reason to make it human-readable.
Since the exit_reason depends on the instruction set (vmx or svm),
formatting is handled with ftrace_print_symbols_seq() by referring to
the appropriate exit reason table.

However, the ftrace_print_symbols_seq() function is not meant to be used
directly in tracepoints since it does not export the formatting table
which userspace tools like trace-cmd and perf use to format traces.

In practice perf dies when formatting vmexit-related events and
trace-cmd falls back to printing the numeric value (with extra
formatting code in the kvm plugin to paper over this limitation).  Other
userspace consumers of vmexit-related tracepoints would be in similar
trouble.

To avoid significant changes to the kvm_exit tracepoint, this patch
moves the vmx and svm exit reason tables into arch/x86/kvm/trace.h and
selects the right table with __print_symbolic() depending on the
instruction set.  Note that __print_symbolic() is designed for exporting
the formatting table to userspace and allows trace-cmd and perf to work.

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |2 -
 arch/x86/kvm/svm.c  |   55 
 arch/x86/kvm/trace.h|  106 --
 arch/x86/kvm/vmx.c  |   44 
 4 files changed, 100 insertions(+), 107 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dd51c83..582c3b7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -634,8 +634,6 @@ struct kvm_x86_ops {
int (*check_intercept)(struct kvm_vcpu *vcpu,
   struct x86_instruction_info *info,
   enum x86_intercept_stage stage);
-
-   const struct trace_print_flags *exit_reasons_str;
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 6adb7ba..2b24a88 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3899,60 +3899,6 @@ static void svm_set_supported_cpuid(u32 func, struct 
kvm_cpuid_entry2 *entry)
}
 }
 
-static const struct trace_print_flags svm_exit_reasons_str[] = {
-   { SVM_EXIT_READ_CR0,read_cr0 },
-   { SVM_EXIT_READ_CR3,read_cr3 },
-   { SVM_EXIT_READ_CR4,read_cr4 },
-   { SVM_EXIT_READ_CR8,read_cr8 },
-   { SVM_EXIT_WRITE_CR0,   write_cr0 },
-   { SVM_EXIT_WRITE_CR3,   write_cr3 },
-   { SVM_EXIT_WRITE_CR4,   write_cr4 },
-   { SVM_EXIT_WRITE_CR8,   write_cr8 },
-   { SVM_EXIT_READ_DR0,read_dr0 },
-   { SVM_EXIT_READ_DR1,read_dr1 },
-   { SVM_EXIT_READ_DR2,read_dr2 },
-   { SVM_EXIT_READ_DR3,read_dr3 },
-   { SVM_EXIT_WRITE_DR0,   write_dr0 },
-   { SVM_EXIT_WRITE_DR1,   write_dr1 },
-   { SVM_EXIT_WRITE_DR2,   write_dr2 },
-   { SVM_EXIT_WRITE_DR3,   write_dr3 },
-   { SVM_EXIT_WRITE_DR5,   write_dr5 },
-   { SVM_EXIT_WRITE_DR7,   write_dr7 },
-   { SVM_EXIT_EXCP_BASE + DB_VECTOR,   DB excp },
-   { SVM_EXIT_EXCP_BASE + BP_VECTOR,   BP excp },
-   { SVM_EXIT_EXCP_BASE + UD_VECTOR,   UD excp },
-   { SVM_EXIT_EXCP_BASE + PF_VECTOR,   PF excp },
-   { SVM_EXIT_EXCP_BASE + NM_VECTOR,   NM excp },
-   { SVM_EXIT_EXCP_BASE + MC_VECTOR,   MC excp },
-   { SVM_EXIT_INTR,interrupt },
-   { SVM_EXIT_NMI, nmi },
-   { SVM_EXIT_SMI, smi },
-   { SVM_EXIT_INIT,init },
-   { SVM_EXIT_VINTR,   vintr },
-   { SVM_EXIT_CPUID,   cpuid },
-   { SVM_EXIT_INVD,invd },
-   { SVM_EXIT_HLT, hlt },
-   { SVM_EXIT_INVLPG,  invlpg },
-   { SVM_EXIT_INVLPGA, invlpga },
-   { SVM_EXIT_IOIO,io },
-   { SVM_EXIT_MSR, msr },
-   { SVM_EXIT_TASK_SWITCH, task_switch },
-   { SVM_EXIT_SHUTDOWN,shutdown },
-   { SVM_EXIT_VMRUN,   vmrun },
-   { SVM_EXIT_VMMCALL, hypercall },
-   { SVM_EXIT_VMLOAD,  vmload },
-   { SVM_EXIT_VMSAVE,  vmsave },
-   { SVM_EXIT_STGI,stgi },
-   { SVM_EXIT_CLGI,clgi },
-   { SVM_EXIT_SKINIT,  skinit },
-   { SVM_EXIT_WBINVD,  wbinvd

Re: [Qemu-devel] [RFC] New thread for the VM migration

2011-07-14 Thread Stefan Hajnoczi

On Thu, Jul 14, 2011 at 9:36 AM, Avi Kivity a...@redhat.com wrote:
 On 07/14/2011 10:14 AM, Umesh Deshpande wrote:
 @@ -260,10 +260,15 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int
 stage, void *opaque)
          return 0;
      }

 +    if (stage != 3)
 +        qemu_mutex_lock_iothread();

 Please read CODING_STYLE, especially the bit about braces.

Please use scripts/checkpatch.pl to check coding style before
submitting patches to the list.

You can also set git's pre-commit hook to automatically run checkpatch.pl:
http://blog.vmsplice.net/2011/03/how-to-automatically-run-checkpatchpl.html

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] scsi fixes

2011-07-11 Thread Stefan Hajnoczi

On Mon, Jul 11, 2011 at 2:02 PM, Hannes Reinecke h...@suse.de wrote:
 Hi all,

 these are some fixes I found during debugging my megasas HBA emulation.
 This time I've sent them as a separate patchset for inclusion.
 All of them have been acked, so please apply.

Are SCSI patches going through Kevin's tree?

If not, perhaps Paolo or I should keep a tree and start doing some
sanity testing on the subsystem in the future.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for June 28

2011-07-07 Thread Stefan Hajnoczi

On Tue, Jul 5, 2011 at 7:18 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Tue, Jul 05, 2011 at 04:37:08PM +0100, Stefan Hajnoczi wrote:
 On Tue, Jul 5, 2011 at 3:32 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Tue, Jul 05, 2011 at 04:39:06PM +0300, Dor Laor wrote:
  On 07/05/2011 03:58 PM, Marcelo Tosatti wrote:
  On Tue, Jul 05, 2011 at 01:40:08PM +0100, Stefan Hajnoczi wrote:
  On Tue, Jul 5, 2011 at 9:01 AM, Dor Laordl...@redhat.com  wrote:
  I tried to re-arrange all of the requirements and use cases using this 
  wiki
  page: http://wiki.qemu.org/Features/LiveBlockMigration
  
  It would be the best to agree upon the most interesting use cases 
  (while we
  make sure we cover future ones) and agree to them.
  The next step is to set the interface for all the various verbs since 
  the
  implementation seems to be converging.
  
  Live block copy was supposed to support snapshot merge.  I think the
  current favored approach is to make the source image a backing file to
  the destination image and essentially do image streaming.
  
  Using this mechanism for snapshot merge is tricky.  The COW file
  already uses the read-only snapshot base image.  So now we cannot
  trivally copy the COW file contents back into the snapshot base image
  using live block copy.
  
  It never did. Live copy creates a new image were both snapshot and
  current are copied to.
  
  This is similar with image streaming.
 
  Not sure I realize what's bad to do in-place merge:
 
  Let's suppose we have this COW chain:
 
    base -- s1 -- s2
 
  Now a live snapshot is created over s2, s2 becomes RO and s3 is RW:
 
    base -- s1 -- s2 -- s3
 
  Now we've done with s2 (post backup) and like to merge s3 into s2.
 
  With your approach we use live copy of s3 into newSnap:
 
    base -- s1 -- s2 -- s3
    base -- s1 -- newSnap
 
  When it is over s2 and s3 can be erased.
  The down side is the IOs for copying s2 data and the temporary
  storage. I guess temp storage is cheap but excessive IO are
  expensive.
 
  My approach was to collapse s3 into s2 and erase s3 eventually:
 
  before: base -- s1 -- s2 -- s3
  after:  base -- s1 -- s2
 
  If we use live block copy using mirror driver it should be safe as
  long as we keep the ordering of new writes into s3 during the
  execution.
  Even a failure in the the middle won't cause harm since the
  management will keep using s3 until it gets success event.
 
  Well, it is more complicated than simply streaming into a new
  image. I'm not entirely sure it is necessary. The common case is:
 
  base - sn-1 - sn-2 - ... - sn-n
 
  When n reaches a limit, you do:
 
  base - merge-1
 
  You're potentially copying similar amount of data when merging back into
  a single image (and you can't easily merge multiple snapshots).
 
  If the amount of data thats not in 'base' is large, you create
  leave a new external file around:
 
  base - merge-1 - sn-1 - sn-2 ... - sn-n
  to
  base - merge-1 - merge-2
 
  
  It seems like snapshot merge will require dedicated code that reads
  the allocated clusters from the COW file and writes them back into the
  base image.
  
  A very inefficient alternative would be to create a third image, the
  merge image file, which has the COW file as its backing file:
  snapshot (base) -  cow -  merge
 
  Remember there is a 'base' before snapshot, you don't copy the entire
  image.

 One use case I have in mind is the Live Backup approach that Jagane
 has been developing.  Here the backup solution only creates a snapshot
 for the period of time needed to read out the dirty blocks.  Then the
 snapshot is deleted again and probably contains very little new data
 relative to the base image.  The backup solution does this operation
 every day.

 This is the pathalogical case for any approach that copies the entire
 base into a new file.  We could have avoided a lot of I/O by doing an
 in-place update.

 I want to make sure this works well.

 This use case does not fit the streaming scheme that has come up. Its a
 completly different operation.

 IMO it should be implemented separately.

Okay, not everything can fit into this one grand unified block
copy/image streaming mechanism :).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Resize Hard Disk of VM

2011-07-06 Thread Stefan Hajnoczi

On Wed, Jul 6, 2011 at 10:15 AM, Kaushal Shriyan
kaushalshri...@gmail.com wrote:
 Is there a way to resize the Hard Disk of VM ?

You can use qemu-img resize on a disk image that is currently not in use:
qemu-img resize filename +10G

Or you can use libguestfs to get a parted-style resize:
http://libguestfs.org/virt-resize.1.html

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM autotest tip of the week

2011-07-06 Thread Stefan Hajnoczi

On Wed, Jul 6, 2011 at 5:28 AM, Lucas Meneghel Rodrigues l...@redhat.com 
wrote:
 On an effort to make KVM autotest better and more useful for our target
 users (KVM developers), we are expanding documentation and writing
 articles on how to get some commonly asked test jobs running (I plan on
 get this going at least for some weeks).

Basic question for a future week:
How do I run a single test on a specific VM configuration (e.g. Fedora
guest, rtl8139, virtio-blk disk)?

Ideally this would use a pre-installed guest image (for userspace
tests).  This would be very handy for reproducing test failures,
developing new tests, or just wanting to make sure a code change
doesn't break a specific test.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 2/3] Add fno-strict-overflow

2011-07-05 Thread Stefan Hajnoczi

On Tue, Jul 5, 2011 at 6:41 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Mon, Jul 4, 2011 at 11:38 PM, Peter Maydell peter.mayd...@linaro.org 
 wrote:
 On 4 July 2011 23:00, Raghavendra D Prabhu raghu.prabh...@gmail.com wrote:
 This is to avoid gcc optimizating out the comparison in assert,
 due to assumption of signed overflow being undefined by default 
 (-Werror=strict-overflow).

--- a/Makefile.hw
+++ b/Makefile.hw
@@ -9,7 +9,7 @@ include $(SRC_PATH)/rules.mak

 $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw)

 -QEMU_CFLAGS+=-I.. -I$(SRC_PATH)/fpu
 +QEMU_CFLAGS+=-I.. -I$(SRC_PATH)/fpu -fno-strict-overflow

 Can you give a more detailed description of the problem this is trying
 to solve? I think it would be nicer if we could remove the assumptions
 about signed overflows instead, if that's practical.

 (Also, if we do want to add this compiler flag then it ought to be
 done in configure I think, as we do for -fno-strict-aliasing.)

 a correct C/C++ program must never generate signed overflow when
 computing an expression. It also means that a compiler may assume that
 a program will never generated signed overflow.

 http://www.airs.com/blog/archives/120

You can check out the warnings that gcc raises with ./configure
--extra-cflags=-Wstrict-overflow -fstrict-overflow --disable-werror.

Either we'd have to fix the warnings or we could use
-fno-strict-overflow/-fwrapv.

This patch seems reasonable to me.  We're telling gcc not to take
advantage of the undefined behavior of signed overflow.  It also means
QEMU code is assuming two's complement representation and wrapping on
overflow, but that is a common assumption (what QEMU-capable hardware
doesn't?).

Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for June 28

2011-07-05 Thread Stefan Hajnoczi

On Tue, Jul 5, 2011 at 9:01 AM, Dor Laor dl...@redhat.com wrote:
 I tried to re-arrange all of the requirements and use cases using this wiki
 page: http://wiki.qemu.org/Features/LiveBlockMigration

 It would be the best to agree upon the most interesting use cases (while we
 make sure we cover future ones) and agree to them.
 The next step is to set the interface for all the various verbs since the
 implementation seems to be converging.

Live block copy was supposed to support snapshot merge.  I think the
current favored approach is to make the source image a backing file to
the destination image and essentially do image streaming.

Using this mechanism for snapshot merge is tricky.  The COW file
already uses the read-only snapshot base image.  So now we cannot
trivally copy the COW file contents back into the snapshot base image
using live block copy.

It seems like snapshot merge will require dedicated code that reads
the allocated clusters from the COW file and writes them back into the
base image.

A very inefficient alternative would be to create a third image, the
merge image file, which has the COW file as its backing file:
snapshot (base) - cow - merge

All data from snapshot and cow is copied into merge and then snapshot
and cow can be deleted.  But this approach is results in full data
copying and uses potentially 3x space if cow is close to the size of
snapshot.

Any other ideas that reuse live block copy for snapshot merge?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 5/5] megasas: LSI Megaraid SAS emulation

2011-07-05 Thread Stefan Hajnoczi

On Tue, Jul 5, 2011 at 12:03 PM, Hannes Reinecke h...@suse.de wrote:
 +static void megasas_unmap_sgl(struct megasas_cmd_t *cmd)
 +{
 +    uint16_t flags = le16_to_cpu(cmd-frame-header.flags);
 +    int i, is_write = (flags  MFI_FRAME_DIR_WRITE) ? 1 : 0;
 +
 +    for (i = 0; i  cmd-frame-header.sge_count; i++) {
 +        cpu_physical_memory_unmap(cmd-iov[i].iov_base, cmd-iov[i].iov_len,
 +                                  is_write, cmd-iov[i].iov_len);
 +    }

We cannot map control structures from guest memory and treating them
as valid request state later on.

A malicious guest can issue the request, then change the fields the
control structure while QEMU is processing the I/O, and then this
function will execute with is_write/sge_count no longer the same as
when the request started.

Good practice would be to copy in any request state needed instead of
reaching into guest memory at later points of the request lifecycle.
This way a malicious guest can never cause QEMU to crash or do
something due to inconsistent state.

The particular problem I see here is starting the request with
sge_count=1 and then setting it to sge_count=255.  We will perform
invalid iov[] accesses.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for June 28

2011-07-05 Thread Stefan Hajnoczi

On Tue, Jul 5, 2011 at 3:32 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Tue, Jul 05, 2011 at 04:39:06PM +0300, Dor Laor wrote:
 On 07/05/2011 03:58 PM, Marcelo Tosatti wrote:
 On Tue, Jul 05, 2011 at 01:40:08PM +0100, Stefan Hajnoczi wrote:
 On Tue, Jul 5, 2011 at 9:01 AM, Dor Laordl...@redhat.com  wrote:
 I tried to re-arrange all of the requirements and use cases using this 
 wiki
 page: http://wiki.qemu.org/Features/LiveBlockMigration
 
 It would be the best to agree upon the most interesting use cases (while 
 we
 make sure we cover future ones) and agree to them.
 The next step is to set the interface for all the various verbs since the
 implementation seems to be converging.
 
 Live block copy was supposed to support snapshot merge.  I think the
 current favored approach is to make the source image a backing file to
 the destination image and essentially do image streaming.
 
 Using this mechanism for snapshot merge is tricky.  The COW file
 already uses the read-only snapshot base image.  So now we cannot
 trivally copy the COW file contents back into the snapshot base image
 using live block copy.
 
 It never did. Live copy creates a new image were both snapshot and
 current are copied to.
 
 This is similar with image streaming.

 Not sure I realize what's bad to do in-place merge:

 Let's suppose we have this COW chain:

   base -- s1 -- s2

 Now a live snapshot is created over s2, s2 becomes RO and s3 is RW:

   base -- s1 -- s2 -- s3

 Now we've done with s2 (post backup) and like to merge s3 into s2.

 With your approach we use live copy of s3 into newSnap:

   base -- s1 -- s2 -- s3
   base -- s1 -- newSnap

 When it is over s2 and s3 can be erased.
 The down side is the IOs for copying s2 data and the temporary
 storage. I guess temp storage is cheap but excessive IO are
 expensive.

 My approach was to collapse s3 into s2 and erase s3 eventually:

 before: base -- s1 -- s2 -- s3
 after:  base -- s1 -- s2

 If we use live block copy using mirror driver it should be safe as
 long as we keep the ordering of new writes into s3 during the
 execution.
 Even a failure in the the middle won't cause harm since the
 management will keep using s3 until it gets success event.

 Well, it is more complicated than simply streaming into a new
 image. I'm not entirely sure it is necessary. The common case is:

 base - sn-1 - sn-2 - ... - sn-n

 When n reaches a limit, you do:

 base - merge-1

 You're potentially copying similar amount of data when merging back into
 a single image (and you can't easily merge multiple snapshots).

 If the amount of data thats not in 'base' is large, you create
 leave a new external file around:

 base - merge-1 - sn-1 - sn-2 ... - sn-n
 to
 base - merge-1 - merge-2

 
 It seems like snapshot merge will require dedicated code that reads
 the allocated clusters from the COW file and writes them back into the
 base image.
 
 A very inefficient alternative would be to create a third image, the
 merge image file, which has the COW file as its backing file:
 snapshot (base) -  cow -  merge

 Remember there is a 'base' before snapshot, you don't copy the entire
 image.

One use case I have in mind is the Live Backup approach that Jagane
has been developing.  Here the backup solution only creates a snapshot
for the period of time needed to read out the dirty blocks.  Then the
snapshot is deleted again and probably contains very little new data
relative to the base image.  The backup solution does this operation
every day.

This is the pathalogical case for any approach that copies the entire
base into a new file.  We could have avoided a lot of I/O by doing an
in-place update.

I want to make sure this works well.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 2/3] Add fno-strict-overflow

2011-07-05 Thread Stefan Hajnoczi

On Tue, Jul 5, 2011 at 4:36 PM, Raghavendra D Prabhu
raghu.prabh...@gmail.com wrote:
 * On Mon, Jul 04, 2011 at 11:38:30PM +0100, Peter Maydell
 peter.mayd...@linaro.org wrote:

 On 4 July 2011 23:00, Raghavendra D Prabhu raghu.prabh...@gmail.com
 wrote:

 This is to avoid gcc optimizating out the comparison in assert,
 due to assumption of signed overflow being undefined by default
 (-Werror=strict-overflow).

 --- a/Makefile.hw
 +++ b/Makefile.hw
 @@ -9,7 +9,7 @@ include $(SRC_PATH)/rules.mak

 $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw)

 -QEMU_CFLAGS+=-I.. -I$(SRC_PATH)/fpu
 +QEMU_CFLAGS+=-I.. -I$(SRC_PATH)/fpu -fno-strict-overflow

 Can you give a more detailed description of the problem this is trying
 to solve? I think it would be nicer if we could remove the assumptions
 about signed overflows instead, if that's practical.

 Following line in pcie.c:pcie_add_capability:505

    assert(offset  offset + size);

 is what the compiler was warning about. The compiler optimizes out that
 comparison without fno-strict-overflow flag. More information about it
 is here -  http://www.airs.com/blog/archives/120 -- as already mentioned by
 Stefan.

 (Also, if we do want to add this compiler flag then it ought to be
 done in configure I think, as we do for -fno-strict-aliasing.)

 Globally adding that flag can limits the optimizations of gcc since in
 other places (loops) the undefined behavior can be advantageous, hence
 added only to Makefile.hw.

Doing this on a per-subsystem or per-file basis does not make sense to
me.  This is a general C coding issue that needs to be settled for the
entire codebase.  We will not catch instances of overflow slipping in
during patch review, so limiting the scope of -fno-strict-overflow is
not feasible.

I suggest we cover all of QEMU with -fwrapv instead of worrying about
-fno-strict-overflow.  That way we can get some optimizations and it
reflects the model that we are all assuming:
This option instructs the compiler to assume that signed arithmetic
overflow of addition, subtraction and multiplication wraps around
using twos-complement representation. This flag enables some
optimizations and disables others. This option is enabled by default
for the Java front-end, as required by the Java language
specification.
http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/Code-Gen-Options.html

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio scsi host draft specification, v3

2011-07-04 Thread Stefan Hajnoczi

On Mon, Jul 4, 2011 at 2:38 PM, Hai Dong,Li haido...@linux.vnet.ibm.com wrote:
 So if I understand correctly, virtio-scsi looks like an SCSI tranport
 protocol,
 such as iSCSI, FCP and SRP which use tcp/ip, FC and Infiniband RDMA
 respectively as the transfer media while virtio-scsi uses virtio, an virtual
 IO
 channel, as the transfer media?

Correct.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 2/3] Add fno-strict-overflow

2011-07-04 Thread Stefan Hajnoczi

On Mon, Jul 4, 2011 at 11:38 PM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 4 July 2011 23:00, Raghavendra D Prabhu raghu.prabh...@gmail.com wrote:
 This is to avoid gcc optimizating out the comparison in assert,
 due to assumption of signed overflow being undefined by default 
 (-Werror=strict-overflow).

--- a/Makefile.hw
+++ b/Makefile.hw
@@ -9,7 +9,7 @@ include $(SRC_PATH)/rules.mak

 $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw)

 -QEMU_CFLAGS+=-I.. -I$(SRC_PATH)/fpu
 +QEMU_CFLAGS+=-I.. -I$(SRC_PATH)/fpu -fno-strict-overflow

 Can you give a more detailed description of the problem this is trying
 to solve? I think it would be nicer if we could remove the assumptions
 about signed overflows instead, if that's practical.

 (Also, if we do want to add this compiler flag then it ought to be
 done in configure I think, as we do for -fno-strict-aliasing.)

a correct C/C++ program must never generate signed overflow when
computing an expression. It also means that a compiler may assume that
a program will never generated signed overflow.

http://www.airs.com/blog/archives/120

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 3/3] megasas: LSI Megaraid SAS emulation

2011-07-02 Thread Stefan Hajnoczi

On Fri, Jul 1, 2011 at 4:35 PM, Hannes Reinecke h...@suse.de wrote:
 +static void megasas_mmio_writel(void *opaque, target_phys_addr_t addr,
 +                                uint32_t val)
 +{
 +    MPTState *s = opaque;
 +    target_phys_addr_t frame_addr;
 +    uint32_t frame_count;
 +    int i;
 +
 +    DPRINTF_REG(writel mmio %lx: %x\n, (unsigned long)addr, val);
 +
 +    switch (addr) {
 +    case MFI_IDB:
 +        if (val  MFI_FWINIT_ABORT) {
 +            /* Abort all pending cmds */
 +            for (i = 0; i = s-fw_cmds; i++) {
 +                megasas_abort_command(s-frames[i]);
 +            }
 +        }
 +        if (val  MFI_FWINIT_READY) {
 +            /* move to FW READY */
 +            megasas_soft_reset(s);
 +        }
 +        if (val  MFI_FWINIT_MFIMODE) {
 +            /* discard MFIs */
 +        }
 +        break;
 +    case MFI_OMSK:
 +        s-intr_mask = val;
 +        if (!MEGASAS_INTR_ENABLED(s)) {
 +            qemu_irq_lower(s-dev.irq[0]);
 +        }
 +        break;
 +    case MFI_ODCR0:
 +        /* Update reply queue pointer */
 +        DPRINTF_QUEUE(Update reply queue head %x busy %d\n,
 +                      s-reply_queue_index, s-busy);
 +        stl_phys(s-producer_pa, s-reply_queue_index);
 +        s-doorbell = 0;
 +        qemu_irq_lower(s-dev.irq[0]);
 +        break;
 +    case MFI_IQPH:
 +        s-frame_hi = val;
 +        break;
 +    case MFI_IQPL:
 +    case MFI_IQP:
 +        /* Received MFI frame address */
 +        frame_addr = (val  ~0xFF);
 +        /* Add possible 64 bit offset */
 +        frame_addr |= (uint64_t)s-frame_hi;

Is this missing  32 before ORing the high bits?

 +static int megasas_scsi_uninit(PCIDevice *d)
 +{
 +    MPTState *s = DO_UPCAST(MPTState, dev, d);
 +
 +    cpu_unregister_io_memory(s-mmio_io_addr);

Need to unregister io_addr and queue_addr.

 +
 +    return 0;
 +}
 +
 +static const struct SCSIBusOps megasas_scsi_ops = {
 +    .transfer_data = megasas_xfer_complete,
 +    .complete = megasas_command_complete,
 +    .cancel = megasas_command_cancel,
 +};
 +
 +static int megasas_scsi_init(PCIDevice *dev)
 +{
 +    MPTState *s = DO_UPCAST(MPTState, dev, dev);
 +    uint8_t *pci_conf;
 +    int i;
 +
 +    pci_conf = s-dev.config;
 +
 +    /* PCI Vendor ID (word) */
 +    pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_LSI_LOGIC);
 +    /* PCI device ID (word) */
 +    pci_config_set_device_id(pci_conf,  PCI_DEVICE_ID_LSI_SAS1078);
 +    /* PCI subsystem ID */
 +    pci_set_word(pci_conf[PCI_SUBSYSTEM_VENDOR_ID], 0x1000);
 +    pci_set_word(pci_conf[PCI_SUBSYSTEM_ID], 0x1013);
 +    /* PCI base class code */
 +    pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_RAID);


PCIDeviceInfo now has vendor_id, device_id, and other fields.  These
values can be set in the megasas_info definition below.

 +
 +    /* PCI latency timer = 0 */
 +    pci_conf[0x0d] = 0;
 +    /* Interrupt pin 1 */
 +    pci_conf[0x3d] = 0x01;
 +
 +    s-mmio_io_addr = cpu_register_io_memory(megasas_mmio_readfn,
 +                                             megasas_mmio_writefn, s,
 +                                             DEVICE_NATIVE_ENDIAN);
 +    s-io_addr = cpu_register_io_memory(megasas_io_readfn,
 +                                        megasas_io_writefn, s,
 +                                        DEVICE_NATIVE_ENDIAN);
 +    s-queue_addr = cpu_register_io_memory(megasas_queue_readfn,
 +                                           megasas_queue_writefn, s,
 +                                           DEVICE_NATIVE_ENDIAN);

Should these be little-endian?

 +    pci_register_bar((struct PCIDevice *)s, 0, 0x4,
 +                     PCI_BASE_ADDRESS_SPACE_MEMORY, megasas_mmio_mapfunc);
 +    pci_register_bar((struct PCIDevice *)s, 2, 256,
 +                     PCI_BASE_ADDRESS_SPACE_IO, megasas_io_mapfunc);
 +    pci_register_bar((struct PCIDevice *)s, 3, 0x4,
 +                     PCI_BASE_ADDRESS_SPACE_MEMORY, megasas_queue_mapfunc);
 +    if (s-fw_sge = MEGASAS_MAX_SGE - MFI_PASS_FRAME_SIZE) {
 +        s-fw_sge = MEGASAS_MAX_SGE - MFI_PASS_FRAME_SIZE;
 +    } else if (s-fw_sge = 128 - MFI_PASS_FRAME_SIZE) {
 +        s-fw_sge = 128 - MFI_PASS_FRAME_SIZE;
 +    } else {
 +        s-fw_sge = 64 - MFI_PASS_FRAME_SIZE;
 +    }
 +    if (s-fw_cmds  MEGASAS_MAX_FRAMES) {
 +        s-fw_cmds = MEGASAS_MAX_FRAMES;
 +    }
 +    if (s-raid_mode_str) {
 +        if (!strcmp(s-raid_mode_str, jbod)) {
 +            s-is_jbod = 1;
 +        } else {
 +            s-is_jbod = 0;
 +        }
 +    }
 +    DPRINTF(Using %d sges, %d cmds, %s mode\n,
 +            s-fw_sge, s-fw_cmds, s-is_jbod ? jbod : raid);
 +    s-fw_luns = (MFI_MAX_LD  MAX_SCSI_DEVS) ?
 +        MAX_SCSI_DEVS : MFI_MAX_LD;
 +    s-producer_pa = 0;
 +    s-consumer_pa = 0;
 +    for (i = 0; i  s-fw_cmds; i++) {
 +        s-frames[i].index = i;
 +        s-frames[i].context = -1;
 +        s-frames[i].pa = 0;
 +        s-frames[i].state = s;
 +    }

It is not clear to me

Re: [PATCH v2 00/31] Implement user mode network for kvm tools

2011-07-01 Thread Stefan Hajnoczi

On Fri, Jul 1, 2011 at 12:38 AM, Asias He asias.he...@gmail.com wrote:
 On 06/30/2011 04:56 PM, Stefan Hajnoczi wrote:
 On Thu, Jun 30, 2011 at 9:40 AM, Asias He asias.he...@gmail.com wrote:
 uip stands for user mode {TCP,UDP}/IP. Currently, uip supports ARP, ICMP,
 IPV4, UDP, TCP. So any network protocols above UDP/TCP should work as well,
 e.g., HTTP, FTP, SSH, DNS.

 There is an existing uIP which might cause confusion, not sure if
 you've seen it.  First I thought you were using that :).

 I heard about uIP, but this patchset have nothing to do with uIP ;-)

 At first I was naming the user mode network as UNET which is User mode
 NETwork, however, I though uip looks better because it is shorter.

 Anyway, if uip do cause confusion. I'd like to change this naming.

It's up to you but now is the right time to do it.  Consider if
another program wants to reuse this code or if you ever want to make
it a library, it wouldn't help to have a confusing name.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 00/31] Implement user mode network for kvm tools

2011-06-30 Thread Stefan Hajnoczi

On Thu, Jun 30, 2011 at 9:40 AM, Asias He asias.he...@gmail.com wrote:
 uip stands for user mode {TCP,UDP}/IP. Currently, uip supports ARP, ICMP,
 IPV4, UDP, TCP. So any network protocols above UDP/TCP should work as well,
 e.g., HTTP, FTP, SSH, DNS.

There is an existing uIP which might cause confusion, not sure if
you've seen it.  First I thought you were using that :).

http://en.wikipedia.org/wiki/UIP_(micro_IP)

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for June 28

2011-06-30 Thread Stefan Hajnoczi

On Wed, Jun 29, 2011 at 4:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Wed, Jun 29, 2011 at 11:08:23AM +0100, Stefan Hajnoczi wrote:
 In the future we could add a 'base' argument to block_stream.  If base
 is specified then data contained in the base image will not be copied.

 This is a present requirement.

It's not one that I have had in the past but it is a reasonable requirement.

One interesting thing about this requirement is that it makes
copy-on-read seem like the wrong primitive for image streaming.  If
there is a base image which should not be streamed then a plain loop
that calls bdrv_is_allocated_chain(bs, base, sector, pnum) and copies
sectors into bs is more straightforward than passing base to a
copy-on-read operation somehow (through a variable that stashes the
base away somewhere?).

  This can be used to merge data from an intermediate image without
 merging the base image.  When streaming completes the backing file
 will be set to the base image.  The backing file relationship would
 typically look like this:

 1. Before block_stream -a -b base.img ide0-hd completion:

 base.img - sn1 - ... - ide0-hd.qed

 2. After streaming completes:

 base.img - ide0-hd.qed

 This describes the image streaming use cases that I, Adam, and Anthony
 propose to support.  In the course of the discussion we've sometimes
 been distracted with the internals of what a unified live block
 copy/image streaming implementation should do.  I wanted to post this
 summary of image streaming to refocus us on the use case and the APIs
 that users will see.

 Stefan

 OK, with an external COW file for formats that do not support it the
 interface can be similar. Also there is no need to mirror writes,
 no switch operation, always use destination image.

Yep.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for June 28

2011-06-30 Thread Stefan Hajnoczi

On Wed, Jun 29, 2011 at 4:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Wed, Jun 29, 2011 at 11:08:23AM +0100, Stefan Hajnoczi wrote:
  This can be used to merge data from an intermediate image without
 merging the base image.  When streaming completes the backing file
 will be set to the base image.  The backing file relationship would
 typically look like this:

 1. Before block_stream -a -b base.img ide0-hd completion:

 base.img - sn1 - ... - ide0-hd.qed

 2. After streaming completes:

 base.img - ide0-hd.qed

 This describes the image streaming use cases that I, Adam, and Anthony
 propose to support.  In the course of the discussion we've sometimes
 been distracted with the internals of what a unified live block
 copy/image streaming implementation should do.  I wanted to post this
 summary of image streaming to refocus us on the use case and the APIs
 that users will see.

 Stefan

 OK, with an external COW file for formats that do not support it the
 interface can be similar. Also there is no need to mirror writes,
 no switch operation, always use destination image.

Marcelo, does this mean you are happy with how management deals with
power failure/crash during streaming?

Are we settled on the approach where the destination file always has
the source file as its backing file?

Here are the components that I can identify:

1. blkmirror - used by live block copy to keep source and destination
in sync.  Already implemented as a block driver by Marcelo.

2. External COW overlay - can be used to add backing file (COW)
support on top of any image, including raw.  Currently unimplemented,
needs to be a block driver.  Kevin, do you want to write this?

3. Unified background copy - image format-independent mechanism for
copy contents of a backing file chain into the image file (with
exception of backing files chained below base).  Needs to play nice
with blkmirror.  Stefan can write this.

4. Live block copy API and high-level control - the main code that
adds the live block copy feature.  Existing patches by Marcelo, can be
restructured to use common core by Marcelo.

5. Image streaming API and high-level control - the main code that
adds the image streaming feature.  Existing patches by Stefan, Adam,
Anthony, can be restructured to use common core by Stefan.

I previously posted a proposed API for the unified background copy
mechanism.  I'm thinking that background copy is not the best name
since it is limited to copying the backing file into the image file.

/**
 * Start a background copy operation
 *
 * Unallocated clusters in the image will be populated with data
 * from its backing file.  This operation runs in the background and a
 * completion function is invoked when it is finished.
 */
BackgroundCopy *background_copy_start(
   BlockDriverState *bs,

   /**
* Note: Kevin suggests we migrate this into BlockDriverState
*   in order to enable copy-on-read.
*
* Base image that both source and destination have as a
* backing file ancestor.  Data will not be copied from base
* since both source and destination will have access to base
* image.  This may be NULL to copy all data.
*/
   BlockDriverState *base,

   BlockDriverCompletionFunc *cb, void *opaque);

/**
 * Cancel a background copy operation
 *
 * This function marks the background copy operation for cancellation and the
 * completion function is invoked once the operation has been cancelled.
 */
void background_copy_cancel(BackgroundCopy *bgc,
BlockDriverCompletionFunc *cb, void *opaque);

/**
 * Get progress of a running background copy operation
 */
void background_copy_get_status(BackgroundCopy *bgc,
BackgroundCopyStatus *status);

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio scsi host draft specification, v3

2011-06-29 Thread Stefan Hajnoczi

On Wed, Jun 29, 2011 at 9:33 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 On 06/14/2011 10:39 AM, Hannes Reinecke wrote:
 If, however, we decide to expose some details about the backend, we
 could be using the values from the backend directly.
 EG we could be forwarding the SCSI target port identifier here
 (if backed by real hardware) or creating our own SAS-type
 identifier when backed by qemu block. Then we could just query
 the backend via a new command on the controlq
 (eg 'list target ports') and wouldn't have to worry about any protocol
 specific details here.

 Besides the controlq command, which I can certainly add, this is
 actually quite similar to what I had in mind (though my plan likely
 would not have worked because I was expecting hierarchical LUNs used
 uniformly).  So, list target ports would return a set of LUN values to
 which you can send REPORT LUNS, or something like that?

I think we're missing a level of addressing.  We need the ability to
talk to multiple target ports in order for list target ports to make
sense.  Right now there is one implicit target that handles all
commands.  That means there is one fixed I_T Nexus.

If we introduce list target ports we also need a way to say This
CDB is destined for target port #0.  Then it is possible to enumerate
target ports and address targets independently of the LUN field in the
CDB.

I'm pretty sure this is also how SAS and other transports work.  In
their framing they include the target port.

The question is whether we really need to support multiple targets on
a virtio-scsi adapter or not.  If you are selectively mapping LUNs
that the guest may access, then multiple targets are not necessary.
If we want to do pass-through of the entire SCSI bus then we need
multiple targets but I'm not sure if there are other challenges like
dependencies on the transport (Fibre Channel, SAS, etc) which make it
impossible to pass through bus-level access?

 If I understand it correctly, it should remain possible to use a single
 host for both pass-through and emulated targets.

Yes.

 Of course, when doing so we would be lose the ability to freely remap
 LUNs. But then remapping LUNs doesn't gain you much imho.
 Plus you could always use qemu block backend here if you want
 to hide the details.

 And you could always use the QEMU block backend with scsi-generic if you
 want to remap LUNs, instead of true passthrough via the kernel target.

IIUC the in-kernel target always does remapping.  It passes through
individual LUNs rather than entire targets and you pick LU Numbers to
map to the backing storage (which may or may not be a SCSI
pass-through device).  Nicholas Bellinger can confirm whether this is
correct.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for June 28

2011-06-29 Thread Stefan Hajnoczi

On Wed, Jun 29, 2011 at 8:57 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 28.06.2011 21:41, schrieb Marcelo Tosatti:
 stream
 --

 1) base - remote
 2) base - remote - local
 3) base - local

 local image is always valid. Requires backing file support.

 With the above, this restriction wouldn't apply any more.

 Also I don't think we should mix approaches. Either both block copy and
 image streaming use backing files, or none of them do. Mixing means
 duplicating more code, and even worse, that you can't stop a block copy
 in the middle and continue with streaming (which I believe is a really
 valuable feature to have).

Here is how the image streaming feature is used from HMP/QMP:

The guest is running from an image file with a backing file.  The aim
is to pull the data from the backing file and populate the image file
so that the dependency on the backing file can be eliminated.

1. Start a background streaming operation:

(qemu) block_stream -a ide0-hd

2. Check the status of the operation:

(qemu) info block-stream
Streaming device ide0-hd: Completed 512 of 34359738368 bytes

3. The status changes when the operation completes:

(qemu) info block-stream
No active stream

On completion the image file no longer has a backing file dependency.
When streaming completes QEMU updates the image file metadata to
indicate that no backing file is used.

The QMP interface is similar but provides QMP events to signal
streaming completion and failure.  Polling to query the streaming
status is only used when the management application wishes to refresh
progress information.

If guest execution is interrupted by a power failure or QEMU crash,
then the image file is still valid but streaming may be incomplete.
When QEMU is launched again the block_stream command can be issued to
resume streaming.

In the future we could add a 'base' argument to block_stream.  If base
is specified then data contained in the base image will not be copied.
 This can be used to merge data from an intermediate image without
merging the base image.  When streaming completes the backing file
will be set to the base image.  The backing file relationship would
typically look like this:

1. Before block_stream -a -b base.img ide0-hd completion:

base.img - sn1 - ... - ide0-hd.qed

2. After streaming completes:

base.img - ide0-hd.qed

This describes the image streaming use cases that I, Adam, and Anthony
propose to support.  In the course of the discussion we've sometimes
been distracted with the internals of what a unified live block
copy/image streaming implementation should do.  I wanted to post this
summary of image streaming to refocus us on the use case and the APIs
that users will see.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for June 28

2011-06-28 Thread Stefan Hajnoczi

On Mon, Jun 27, 2011 at 3:32 PM, Juan Quintela quint...@redhat.com wrote:
 Please send in any agenda items you are interested in covering.

Live block copy and image streaming:
 * The differences between Marcelo and Kevin's approaches
 * Which approach to choose and who can help implement it
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for June 28

2011-06-28 Thread Stefan Hajnoczi

On Tue, Jun 28, 2011 at 8:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Tue, Jun 28, 2011 at 02:38:15PM +0100, Stefan Hajnoczi wrote:
 On Mon, Jun 27, 2011 at 3:32 PM, Juan Quintela quint...@redhat.com wrote:
  Please send in any agenda items you are interested in covering.

 Live block copy and image streaming:
  * The differences between Marcelo and Kevin's approaches
  * Which approach to choose and who can help implement it

 After more thinking, i dislike the image metadata approach. Management
 must carry the information anyway, so its pointless to duplicate it
 inside an image format.

I agree with you.  It would be a significant change for QEMU users to
deal with block state files just in case they want to use live block
copy/image streaming.  Not only would existing management layers need
to be updated but also custom management or provisioning scripts.

 After the discussion today, i think the internal mechanism and interface
 should be different for copy and stream:

 block copy
 --

 With backing files:

 1) base - sn1 - sn2
 2) base - copy

 Without:

 1) source
 2) destination

 Copy is only valid after switch has been performed. Same interface and
 crash recovery characteristics for all image formats.

 If management wants to support continuation, it must specify
 blkcopy:sn2:copy on startup.

 stream
 --

 1) base - remote
 2) base - remote - local
 3) base - local

 local image is always valid. Requires backing file support.

I agree that the modes of operation are different and we should
provide different HMP/QMP APIs for them.  Internally I still think
they can share code for the source - destination copy operation.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] exec: remove unused variable

2011-06-26 Thread Stefan Hajnoczi

On Sun, Jun 26, 2011 at 12:08 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Tue, Jun 14, 2011 at 08:36:26PM +0300, Michael S. Tsirkin wrote:
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

 Any comments on this one?

Juan Quintela's exec: last_first_tb was only used in !ONLY_USER case
patch is equivalent and the trivial-patches pull request containing it
is on qemu-devel.  We're waiting for it to be merged.

http://patchwork.ozlabs.org/patch/101875/

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] virtio: Support releasing lock during kick

2011-06-24 Thread Stefan Hajnoczi

On Mon, Jun 20, 2011 at 4:27 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Sun, Jun 19, 2011 at 8:14 AM, Michael S. Tsirkin m...@redhat.com wrote:
 On Wed, Jun 23, 2010 at 10:24:02PM +0100, Stefan Hajnoczi wrote:
 The virtio block device holds a lock during I/O request processing.
 Kicking the virtqueue while the lock is held results in long lock hold
 times and increases contention for the lock.

 This patch modifies virtqueue_kick() to optionally release a lock while
 notifying the host.  Virtio block is modified to pass in its lock.  This
 allows other vcpus to queue I/O requests during the time spent servicing
 the virtqueue notify in the host.

 The virtqueue_kick() function is modified to know about locking because
 it changes the state of the virtqueue and should execute with the lock
 held (it would not be correct for virtio block to release the lock
 before calling virtqueue_kick()).

 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com

 While the optimization makes sense, the API's pretty hairy IMHO.
 Why don't we split the kick functionality instead?
 E.g.
        /* Report whether host notification is necessary. */
        bool virtqueue_kick_prepare(struct virtqueue *vq)
        /* Can be done in parallel with add_buf/get_buf */
        void virtqueue_kick_notify(struct virtqueue *vq)

 This is a nice idea, it makes the code cleaner.  I am testing patches
 that implement this and after Khoa has measured the performance I will
 send them out.

Just an update that benchmarks are being run.  Will send out patches
and results as soon as they are in.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] virtio: Support releasing lock during kick

2011-06-20 Thread Stefan Hajnoczi

On Sun, Jun 19, 2011 at 8:14 AM, Michael S. Tsirkin m...@redhat.com wrote:
 On Wed, Jun 23, 2010 at 10:24:02PM +0100, Stefan Hajnoczi wrote:
 The virtio block device holds a lock during I/O request processing.
 Kicking the virtqueue while the lock is held results in long lock hold
 times and increases contention for the lock.

 This patch modifies virtqueue_kick() to optionally release a lock while
 notifying the host.  Virtio block is modified to pass in its lock.  This
 allows other vcpus to queue I/O requests during the time spent servicing
 the virtqueue notify in the host.

 The virtqueue_kick() function is modified to know about locking because
 it changes the state of the virtqueue and should execute with the lock
 held (it would not be correct for virtio block to release the lock
 before calling virtqueue_kick()).

 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com

 While the optimization makes sense, the API's pretty hairy IMHO.
 Why don't we split the kick functionality instead?
 E.g.
        /* Report whether host notification is necessary. */
        bool virtqueue_kick_prepare(struct virtqueue *vq)
        /* Can be done in parallel with add_buf/get_buf */
        void virtqueue_kick_notify(struct virtqueue *vq)

This is a nice idea, it makes the code cleaner.  I am testing patches
that implement this and after Khoa has measured the performance I will
send them out.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Native Linux KVM tool v2

2011-06-16 Thread Stefan Hajnoczi

On Thu, Jun 16, 2011 at 8:24 AM, Ingo Molnar mi...@elte.hu wrote:
  - executing AIO in the vcpu thread eats up precious vcpu execution
   time: combined QCOW2 throughput would be limited by a single
   core's performance, and any time spent on QCOW2 processing would
   not be spent running the guest CPU. (In such a model we certainly
   couldnt do more intelligent, CPU-intense storage solutions like on
   the fly compress/decompress of QCOW2 data.)

This has been a problem in qemu-kvm.  io_submit(2) steals time from
the guest (I think it was around 20us on the system I measured last
year).

Add the fact that the guest kernel might be holding a spinlock and it
becomes a scalability problem for SMP guests.

Anything that takes noticable CPU time should be done outside the vcpu thread.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Native Linux KVM tool v2

2011-06-16 Thread Stefan Hajnoczi

On Fri, Jun 17, 2011 at 2:03 AM, Sasha Levin levinsasha...@gmail.com wrote:
On Thu, 2011-06-16 at 17:50 -0500, Anthony Liguori wrote:
On 06/16/2011 09:48 AM, Pekka Enberg wrote:
On Wed, Jun 15, 2011 at 6:53 PM, Pekka Enbergpenb...@kernel.org wrote:
- Fast QCOW2 image read-write support beating Qemu in fio benchmarks. See
the
following URL for test result details: https://gist.github.com/1026888

It turns out we were benchmarking the wrong guest kernel version for
qemu-kvm which is why it performed so much worse. Here's a summary of
qemu-kvm beating tools/kvm:

https://raw.github.com/gist/1029359/9f9a714ecee64802c08a3455971e410d5029370b/gistfile1.txt

I'd ask for a brown paper bag if I wasn't so busy eating my hat at the
moment.

np, it happens.

Is that still with QEMU with IDE emulation, cache=writethrough, and
128MB of guest memory?

Does your raw driver support multiple parallel requests? It doesn't
look like it does from how I read the code. At some point, I'd be happy
to help ya'll do some benchmarking against QEMU.

Each virtio-blk device can process requests regardless of other
virtio-blk devices, which means that we can do parallel requests for
devices.

Within each device, we support parallel requests in the sense that we do
vectored IO for each head (which may contain multiple blocks) in the
vring, we don't do multiple heads because when I've tried adding AIO
I've noticed that at most there are 2-3 possible heads - and since it
points to the same device it doesn't really help running them in
parallel.

One thing that QEMU does but I'm a little suspicious of is request
merging. virtio-blk will submit those 2-3 heads using
bdrv_aio_multiwrite() if they become available in the same virtqueue
notify. The requests will be merged if possible.

My feeling is that we should already have merged requests coming
through virtio-blk and there should be no need to do any merging -
which could be a workaround for a poor virtio-blk vring configuration
that prevented the guest from sending large requests. However, this
feature did yield performance improvements with qcow2 image files when
it was introduced, so that would be interesting to look at.

Are you enabling indirect descriptors on the virtio-blk vring? That
should allow more requests to be made available because you don't run
out of vring descriptors so easily.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Native Linux KVM tool v2

2011-06-16 Thread Stefan Hajnoczi

On Thu, Jun 16, 2011 at 3:48 PM, Pekka Enberg penb...@kernel.org wrote:
 On Wed, Jun 15, 2011 at 6:53 PM, Pekka Enberg penb...@kernel.org wrote:
 - Fast QCOW2 image read-write support beating Qemu in fio benchmarks. See the
  following URL for test result details: https://gist.github.com/1026888

 It turns out we were benchmarking the wrong guest kernel version for
 qemu-kvm which is why it performed so much worse. Here's a summary of
 qemu-kvm beating tools/kvm:

 https://raw.github.com/gist/1029359/9f9a714ecee64802c08a3455971e410d5029370b/gistfile1.txt

Thanks for digging into the results so quickly and rerunning.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/10] lsi53c895a: remove unused variables

2011-06-15 Thread Stefan Hajnoczi

On Tue, Jun 14, 2011 at 08:35:44PM +0300, Michael S. Tsirkin wrote:
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  hw/lsi53c895a.c |2 --
  1 files changed, 0 insertions(+), 2 deletions(-)

This one is already in the trivial-patches tree.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Native Linux KVM tool v2

2011-06-15 Thread Stefan Hajnoczi

On Wed, Jun 15, 2011 at 11:04 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 06/15/2011 03:13 PM, Prasad Joshi wrote:

 On Wed, Jun 15, 2011 at 6:10 PM, Pekka Enbergpenb...@kernel.org  wrote:

 On Wed, Jun 15, 2011 at 7:30 PM, Avi Kivitya...@redhat.com  wrote:

 On 06/15/2011 06:53 PM, Pekka Enberg wrote:

 - Fast QCOW2 image read-write support beating Qemu in fio benchmarks.
 See
 the
   following URL for test result details:
 https://gist.github.com/1026888

 This is surprising.  How is qemu invoked?

 Prasad will have the details. Please note that the above are with Qemu
 defaults which doesn't use virtio. The results with virtio are little
 better but still in favor of tools/kvm.


 The qcow2 image used for testing was copied on to /dev/shm to avoid
 the disk delays in performance measurement.

 QEMU was invoked with following parameters

 $ qemu-system-x86_64 -hdadisk image on hard disk  -hdb
 /dev/shm/test.qcow2 -m 1024M

 Looking more closely at native KVM tools, you would need to use the
 following invocation to have an apples-to-apples comparison:

 qemu-system-x86_64 -drive file=/dev/shm/test.qcow2,cache=writeback,if=virtio

In addition to this it is important to set identical guest RAM sizes
(QEMU's -m ram_mb) option.

If you are comparing with qemu.git rather than qemu-kvm.git then you
need to ./configure --enable-io-thread and launch with QEMU's
-enable-kvm option.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio scsi host draft specification, v3

2011-06-14 Thread Stefan Hajnoczi

On Tue, Jun 14, 2011 at 9:39 AM, Hannes Reinecke h...@suse.de wrote:
 On 06/10/2011 04:35 PM, Paolo Bonzini wrote:

 If requests are placed on arbitrary queues you'll inevitably run on
 locking issues to ensure strict request ordering.
 I would add here:

 If a device uses more than one queue it is the responsibility of the
 device to ensure strict request ordering.

 Applied with s/device/guest/g.

 Please do not rely in bus/target/lun here. These are leftovers from
 parallel SCSI and do not have any meaning on modern SCSI
 implementation (eg FC or SAS). Rephrase that to

 The lun field is the Logical Unit Number as defined in SAM.

 Ok.

      The status byte is written by the device to be the SCSI status
      code.

 ?? I doubt that exists. Make that:

 The status byte is written by the device to be the status code as
 defined in SAM.

 Ok.

      The response byte is written by the device to be one of the
      following:

      - VIRTIO_SCSI_S_OK when the request was completed and the
      status byte
        is filled with a SCSI status code (not necessarily GOOD).

      - VIRTIO_SCSI_S_UNDERRUN if the content of the CDB requires
      transferring
        more data than is available in the data buffers.

      - VIRTIO_SCSI_S_ABORTED if the request was cancelled due to a
      reset
        or another task management function.

      - VIRTIO_SCSI_S_FAILURE for other host or guest error. In
      particular,
        if neither dataout nor datain is empty, and the
        VIRTIO_SCSI_F_INOUT
        feature has not been negotiated, the request will be
        immediately
        returned with a response equal to VIRTIO_SCSI_S_FAILURE.

 And, of course:

 VIRTIO_SCSI_S_DISCONNECT if the request could not be processed due
 to a communication failure (eg device was removed or could not be
 reached).

 Ok.

 This specification implies a strict one-to-one mapping between host
 and target. IE there is no way of specifying more than one target
 per host.

 Actually no, the intention is to use hierarchical LUNs to support
 more than one target per host.

 Can't.

 Hierarchical LUNs is a target-internal representation.
 The initiator (ie guest OS) should _not_ try to assume anything about the
 internal structure and just use the LUN as an opaque number.

 Reason being that the LUN addressing is not unique, and there are several
 choices on how to represent a given LUN.
 So the consensus here is that different LUN numbers represent
 different physical devices, regardless on the (internal) LUN representation.
 Which in turn means we cannot use the LUN number to convey anything else
 than a device identification relative to a target.

 Cf SAM-3 paragraph 4.8:

 A logical unit number is a field (see 4.9) containing 64 bits that
 identifies the logical unit within a SCSI target device
 when accessed by a SCSI target port.

 IE the LUN is dependent on the target, but you cannot make assumptions on
 the target.

 Consequently, it's in the hosts' responsibility to figure out the targets in
 the system. After that it invokes the 'scan' function from the SCSI
 midlayer.
 You can't start from a LUN and try to figure out the targets ...

 If you want to support more than on target per host you need some sort of
 enumeration/callback which allows the host to figure out
 the number of available targets.
 But in general the targets are referenced by the target port identifier as
 specified in the appropriate standard (eg FC or SAS).
 Sadly, we don't have any standard to fall back on for this.

 If, however, we decide to expose some details about the backend, we could be
 using the values from the backend directly.
 EG we could be forwarding the SCSI target port identifier here
 (if backed by real hardware) or creating our own SAS-type
 identifier when backed by qemu block. Then we could just query
 the backend via a new command on the controlq
 (eg 'list target ports') and wouldn't have to worry about any protocol
 specific details here.

I think we want to be able to pass through one or more SCSI targets,
so we probably need a 'list target ports' control command.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio scsi host draft specification, v3

2011-06-10 Thread Stefan Hajnoczi

On Fri, Jun 10, 2011 at 12:33 PM, Rusty Russell ru...@rustcorp.com.au wrote:
 On Thu, 09 Jun 2011 08:59:27 +0200, Paolo Bonzini pbonz...@redhat.com wrote:
 On 06/09/2011 01:28 AM, Rusty Russell wrote:
    after some preliminary discussion on the QEMU mailing list, I present a
    draft specification for a virtio-based SCSI host (controller, HBA, you
    name it).
 
  OK, I'm impressed.  This is very well written and it doesn't make any of
  the obvious mistakes wrt. virtio.

 Thanks very much, and thanks to those who corrected my early mistakes.

  I assume you have an implementation, as well?

 Unfortunately not; we're working on it, which means I should start in
 July when I come back from vacation.

 Do you prefer to wait for one before I make a patch to the LyX source?
 In the meanwhile, can you reserve a subsystem ID for me?

 Paolo

 Sure, you can have the next subsystem ID.

 It's a pain to patch once it's in LyX, so let's get the implementation
 base on what you posted here an see how much it changes first...

Paolo, I'll switch the Linux guest LLD and QEMU virtio-scsi skeleton
that I have to comply with the spec.  Does this sound good or did you
want to write these from scratch?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC]QEMU disk I/O limits

2011-06-01 Thread Stefan Hajnoczi

On Wed, Jun 1, 2011 at 2:20 PM, Vivek Goyal vgo...@redhat.com wrote:
 On Tue, May 31, 2011 at 06:30:09PM -0500, Anthony Liguori wrote:

 [..]
 The level of consistency will then depend on whether you overcommit
 your hardware and how you have it configured.

 Agreed.


 Consistency is very hard because at the end of the day, you still
 have shared resources.  Even with blkio, I presume one guest can
 still impact another guest by forcing the disk to do excessive
 seeking or something of that nature.

 So absolutely consistency can't be the requirement for the use-case.
 The use-cases we are interested really are more about providing caps
 than anything else.

 I think both qemu and kenrel can do the job. The only thing which
 seriously favors throttling implementation in qemu is the ability
 to handle wide variety of backend files (NFS, qcow, libcurl based
 devices etc).

 So what I am arguing is that your previous reason that qemu can do
 a better job because it knows effective IOPS of guest, is not
 necessarily a very good reason. To me simplicity of being able to handle
 everything as file and do the throttling is the most compelling reason
 to do this implementation in qemu.

The variety of backends is the reason to go for a QEMU-based approach.
 If there were kernel mechanisms to handle non-block backends that
would be great.  cgroups NFS?

Of course for something like Sheepdog or Ceph it becomes quite hard to
do it in the kernel at all since they are userspace libraries that
speak their protocol over sockets, and you really don't have sinight
into what I/O operations they are doing from the kernel.

One issue that concerns me is how effective iops and throughput are as
capping mechanisms.  If you cap throughput then you're likely to
affect sequential I/O but do little against random I/O which can hog
the disk with a seeky I/O pattern.  If you limit iops you can cap
random I/O but artifically limit sequential I/O, which may be able to
perform a high number of iops without hogging the disk due to seek
times at all.  One proposed solution here (I think Christoph Hellwig
suggested it) is to do something like merging sequential I/O counting
so that multiple sequential I/Os only count as 1 iop.

I like the idea of a proportional share of disk utilization but doing
that from QEMU is problematic since we only know when we issued an I/O
to the kernel, not when it's actually being serviced by the disk -
there could be queue wait times in the block layer that we don't know
about - so we end up with a magic number for disk utilization which
may not be a very meaningful number.

So given the constraints and the backends we need to support, disk I/O
limits in QEMU with iops and throughput limits seem like the approach
we need.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC]QEMU disk I/O limits

2011-06-01 Thread Stefan Hajnoczi

On Wed, Jun 1, 2011 at 10:42 PM, Vivek Goyal vgo...@redhat.com wrote:
 On Wed, Jun 01, 2011 at 10:15:30PM +0100, Stefan Hajnoczi wrote:
 One issue that concerns me is how effective iops and throughput are as
 capping mechanisms.  If you cap throughput then you're likely to
 affect sequential I/O but do little against random I/O which can hog
 the disk with a seeky I/O pattern.  If you limit iops you can cap
 random I/O but artifically limit sequential I/O, which may be able to
 perform a high number of iops without hogging the disk due to seek
 times at all.  One proposed solution here (I think Christoph Hellwig
 suggested it) is to do something like merging sequential I/O counting
 so that multiple sequential I/Os only count as 1 iop.

 One of the things we atleast need to do is allow specifying both
 bps and iops rule together so that random IO  with high iops does
 not create havoc and seqential or large size IO with low iops and
 high bps does not overload the system.

 I am not sure how IO shows up in qemu but will elevator in guest
 make sure that lot of sequential IO is merged together? For dependent
 READS, I think counting multiple sequential reads as 1 iops might
 help. I think this is one optimization one can do once throttling
 starts working in qemu and see if it is a real concern.

The guest can use an I/O scheduler, so for Linux guests we see the
typical effects of cfq.  Requests do get merged by the guest before
being submitted to QEMU.

Okay, good idea.  Zhi Yong's test plan includes tests with multiple
VMs and both iops and throughput limits at the same time.  If
workloads turn up that cause issues it would be possible at counting
sequential I/Os a 1 iop.


 I like the idea of a proportional share of disk utilization but doing
 that from QEMU is problematic since we only know when we issued an I/O
 to the kernel, not when it's actually being serviced by the disk -
 there could be queue wait times in the block layer that we don't know
 about - so we end up with a magic number for disk utilization which
 may not be a very meaningful number.

 To be able to implement proportional IO one should be able to see
 all IO from all clients at one place. Qemu knows about IO of only
 its guest and not other guests running on the system. So I think
 qemu can't implement proportion IO.

Yeah :(


 So given the constraints and the backends we need to support, disk I/O
 limits in QEMU with iops and throughput limits seem like the approach
 we need.

 For qemu yes. For other non-qemu usages we will still require a kernel
 mechanism of throttling.

Definitely.  In fact I like the idea of using blkio-controller for raw
image files on local file systems or LVM volumes.

Hopefully the end-user API (libvirt interface) that QEMU disk I/O
limits gets exposed from complements the existing blkiotune
(blkio-controller) virsh command.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] kvm tools: Add ioeventfd support

2011-05-27 Thread Stefan Hajnoczi

On Fri, May 27, 2011 at 11:36 AM, Sasha Levin levinsasha...@gmail.com wrote:
 ioeventfd is way provided by KVM to receive notifications about
 reads and writes to PIO and MMIO areas within the guest.

 Such notifications are usefull if all we need to know is that
 a specific area of the memory has been changed, and we don't need
 a heavyweight exit to happen.

 The implementation uses epoll to scale to large number of ioeventfds.

 Signed-off-by: Sasha Levin levinsasha...@gmail.com
 ---
  tools/kvm/Makefile                |    1 +
  tools/kvm/include/kvm/ioeventfd.h |   27 
  tools/kvm/ioeventfd.c             |  127 
 +
  tools/kvm/kvm-run.c               |    4 +
  4 files changed, 159 insertions(+), 0 deletions(-)
  create mode 100644 tools/kvm/include/kvm/ioeventfd.h
  create mode 100644 tools/kvm/ioeventfd.c

Did you run any benchmarks?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to diagnose memory leak in kvm-qemu-0.14.0?

2011-05-20 Thread Stefan Hajnoczi

On Fri, May 20, 2011 at 12:47 PM, Steve Kemp st...@bytemark.co.uk wrote:
 On Fri May 20, 2011 at 12:01:58 +0100, Stefan Hajnoczi wrote:

   wget http://mirror.bytemark.co.uk/misc/test-files/500M
   while true; do cp 500M foo.img; rm foo.img; sleep 2; done
 
   top shows the virt memory growing to 1gb in under two minutes.

 Were you able to track down the culprit?

  Yes, or at least confirm my suspicion.  The virtio block device
  is the source of the leak.

  Host kernel: 2.6.32.15
  Guest Kernel: linux-2.6.32.23

  Leaking case:

  opt/kvm2/bin/qemu-system-x86_64 -m 500 \
    -drive file=/machines/kvm2/jail/root_fs,if=virtio,cache=off

  Non leaking case:

   /opt/kvm/current/bin/qemu-system-x86_64 -m 500 \
     -drive file=/machines/kvm1/jail/root_fs,cache=off ..

  The leak occurs with both KVM 0.12.5 and 0.14.0.

  I've had a quick read of hw/virtio-blk.c but didn't see anything
  glaringly obvious.  I'll need to trace through the code, drink more
  coffee, or get lucky to narrow it down further.

Enabling the memory allocation trace events and adding the
__builtin_return_address() to them should provide enough information
to catch the caller who is leaking memory.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to diagnose memory leak in kvm-qemu-0.14.0?

2011-05-20 Thread Stefan Hajnoczi

On Fri, May 20, 2011 at 2:47 PM, Steve Kemp st...@bytemark.co.uk wrote:
 On Fri May 20, 2011 at 14:16:05 +0100, Stefan Hajnoczi wrote:

   I've had a quick read of hw/virtio-blk.c but didn't see anything
   glaringly obvious.  I'll need to trace through the code, drink more
   coffee, or get lucky to narrow it down further.

 Enabling the memory allocation trace events and adding the
 __builtin_return_address() to them should provide enough information
 to catch the caller who is leaking memory.

  I'm trying to do that at the moment.  So far the only thing I've
  done is add a trace on virtio_blk_alloc_request - I'm noticing
  a leak there pretty easily.

  I see *two* request structures be allocated all the time, one
  is used and freed, the other is ignored.  That seems pretty
  conclusively wrong to me, but I'm trying to understand how that
  happens:

  virtio_blk_alloc_request 0.000 req=0x91e08f0  - Allocation 1
  virtio_blk_alloc_request 77.659 req=0x9215650  - Allocation 2

Are you sure this isn't the temporary one that is allocated but freed
immediately once the virtqueue is empty:

static VirtIOBlockReq *virtio_blk_get_request(VirtIOBlock *s)
{
VirtIOBlockReq *req = virtio_blk_alloc_request(s);

if (req != NULL) {
if (!virtqueue_pop(s-vq, req-elem)) {
qemu_free(req);  --- virtqueue empty, we're done
return NULL;
}
}

return req;
}

  virtio_blk_rw_complete 449.469 req=0x91e08f0 ret=0x0 - First is used.
  virtio_blk_req_complete 1.955 req=0x91e08f0 status=0x0 - First is freed.

  second is never seen again.

Sounds scary 8).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to diagnose memory leak in kvm-qemu-0.14.0?

2011-05-19 Thread Stefan Hajnoczi

On Wed, May 18, 2011 at 5:44 PM, Steve Kemp st...@bytemark.co.uk wrote:

  I'm running the most recent release of KVM, version 0.14.0
  on a host kernel 2.6.32.15, and seem to be able to trigger
  a leak of memory pretty easily.

  Inside a guest the following one-liner will cause the KVM
  process on the host to gradually increase its memory
  consumption:

    while true; do
      wget http://mirror.bytemark.co.uk/misc/test-files/500M; cp 500M new; rm 
 500M new; sleep 10 ;
    done

You are exercising both networking and storage.  Have you cut the test
down to just wget vs cp/rm?  Also why the sleep 10?

If you are building qemu-kvm from source you might like to enable
tracing to track memory allocations in qemu-kvm.  For full information
see qemu-kvm/docs/tracing.txt.  There are several trace events of
interest:
$ cd qemu-kvm
$ $EDITOR trace-events
# qemu-malloc.c
disable qemu_malloc(size_t size, void *ptr) size %zu ptr %p
disable qemu_realloc(void *ptr, size_t size, void *newptr) ptr %p
size %zu newptr %p
disable qemu_free(void *ptr) ptr %p

# osdep.c
disable qemu_memalign(size_t alignment, size_t size, void *ptr)
alignment %zu size %zu ptr %p
disable qemu_vmalloc(size_t size, void *ptr) size %zu ptr %p
disable qemu_vfree(void *ptr) ptr %p
^--- remove the disable property from these memory allocation events
$ ./configure --enable-trace-backend=simple [...]
$ make
$ # run the VM, reproduce the leak, shut the VM down
$ scripts/simpletrace.py trace-events trace-pid  # where pid was
the process ID

It is fairly easy to write a script that correlates mallocs and frees,
printing out memory allocations that were never freed at the end.
There is a Python API for processing trace files, here is an
explanation of how ot use it:
http://blog.vmsplice.net/2011/03/how-to-write-trace-analysis-scripts-for.html

If you have SystemTap installed you may wish to use the dtrace
backend instead of simple.  You can then use SystemTap scripts on
the probes.  SystemTap is more powerful, it should allow you to
extract call stacks when probes are fired but I'm not experienced with
it.

Feel free to contact me on #qemu (oftc) or #kvm (freenode) IRC if you
want some pointers, my nick is stefanha.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to diagnose memory leak in kvm-qemu-0.14.0?

2011-05-19 Thread Stefan Hajnoczi

On Thu, May 19, 2011 at 9:40 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Wed, May 18, 2011 at 5:44 PM, Steve Kemp st...@bytemark.co.uk wrote:
 If you have SystemTap installed you may wish to use the dtrace
 backend instead of simple.  You can then use SystemTap scripts on
 the probes.  SystemTap is more powerful, it should allow you to
 extract call stacks when probes are fired but I'm not experienced with
 it.

Forgot to add that the __builtin_return_address() gcc extension can be
used to collect return addresses even with the simple trace backend:
http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/Return-Address.html#index-g_t_005f_005fbuiltin_005freturn_005faddress-2431

I've used it in the past as a poor man's stack trace when tracking
down memory leaks.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v1] Add declarations for hierarchical memory region API

2011-05-19 Thread Stefan Hajnoczi

On Thu, May 19, 2011 at 3:12 PM, Avi Kivity a...@redhat.com wrote:
 +struct MemoryRegion {
 +    /* All fields are private - violators will be prosecuted */
 +    const MemoryRegionOps *ops;
 +    MemoryRegion *parent;

In the case where a region is aliased (mapped twice into the address
space at different addresses) I need two MemoryRegions?  The
MemoryRegion describes an actual mapping in the parent, addr,
ram_addr tuple, not just the attributes of the region (ops, size,
...).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: snapmirror functionality using qcow2 and rsync?

2011-05-15 Thread Stefan Hajnoczi

On Sun, May 15, 2011 at 1:16 PM, Fred van Zwieten fvzwie...@gmail.com wrote:
 Background:
 NetApp Snashot functionality gives you application consistent
 snapshots of data. Just inform the app a snapshot is about to be made,
 depending on the app, it needs to go in to some sort of backup mode,
 of just stop and flush outstanding I/O. Next, a snapshot is made and
 everything just runs on. Because of the underlying WAFL filesystem
 design, the snapshot always points to the blocks at the time of the
 creation without needing to do any COW.

 Layered on top op this functionality is SnapMirror, where the delta
 between this snapshot and a previous snapshot (both being static in
 nature), is asynchronously replicated to a second system. There, this
 delta is applied to a copy of the disk as a local snapshot.

 This setup gives you application consistent data disks on a remote
 location as a form of disaster tolerancy. The RPO is the
 snapshot/snapmirror frequency.

 KVM:
 My question is rather simple. Could something like this be implemented
 with kvm-img and rsync and/or lvm? I've played with the backing_file
 option, but this means I have to shutdown the vm a boot is on the new
 file to let this work.

This recent thread might interest you:
http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg00733.html

Basically you have to cobble it together yourself at the moment but
there is activity around live snapshots and dirty block tracking, so
hopefully they will be available as APIs in the future.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trouble adding kvm clock trace to qemu-kvm

2011-05-09 Thread Stefan Hajnoczi

On Sat, Apr 30, 2011 at 6:00 PM, Chris Thompson cth...@cs.umn.edu wrote:
 I'm trying to add a trace to qemu-kvm that will log the value of the vcpu's
 clock when a specific interrupt gets pushed. I'm working with
 qemu-kvm-0.14.0 on the 2.6.32-31 kernel. I've added the following to
 kvm_arch_try_push_interrupts in qemu-kvm-x86.c:

 if (irq == 41) {
    // Get the VCPU's TSC
    struct kvm_clock_data clock;
    kvm_vcpu_ioctl(env, KVM_GET_CLOCK, clock);
    uint64_t ticks = clock.clock;
    trace_kvm_clock_at_injection(ticks);
 }

 And here's the trace event I added:

 kvm_clock_at_injection(uint64_t ticks) interrupt 41 at clock %PRIu64

 I have that trace and the virtio_blk_req_complete trace enabled. An excerpt
 from the resulting trace output from simpletrace.py:

 virtio_blk_req_complete 288390365546367 30461.681 req=46972352 status=0
 kvm_clock_at_injection 288390365546578 0.211 ticks=46972352
 virtio_blk_req_complete 288390394870065 29323.487 req=46972352 status=0
 kvm_clock_at_injection 288390394870276 0.211 ticks=46972352

Did you modify simpletrace.py?  The 288390365546367 field is should
not be there.  The output format should be:
trace-event-name delta-microseconds [arg0=val0...]

It looks like your simpletrace.py may be pretty-printing trace records
incorrectly.

If you have a public git tree you can link to I'd be happy to check
that simpletrace.py is working.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/18] virtio ring: inline function to check for events

2011-05-05 Thread Stefan Hajnoczi

On Wed, May 4, 2011 at 9:51 PM, Michael S. Tsirkin m...@redhat.com wrote:
 With the new used_event and avail_event and features, both
 host and guest need similar logic to check whether events are
 enabled, so it helps to put the common code in the header.

 Note that Xen has similar logic for notification hold-off
 in include/xen/interface/io/ring.h with req_event and req_prod
 corresponding to event_idx + 1 and new_idx respectively.
 +1 comes from the fact that req_event and req_prod in Xen start at 1,
 while event index in virtio starts at 0.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  include/linux/virtio_ring.h |   14 ++
  1 files changed, 14 insertions(+), 0 deletions(-)

 diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
 index f791772..2a3b0ea 100644
 --- a/include/linux/virtio_ring.h
 +++ b/include/linux/virtio_ring.h
 @@ -124,6 +124,20 @@ static inline unsigned vring_size(unsigned int num, 
 unsigned long align)
                + sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
  }

 +/* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
 +/* Assuming a given event_idx value from the other size, if

s/other size/other side/ ?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A Live Backup feature for KVM

2011-04-25 Thread Stefan Hajnoczi

On Mon, Apr 25, 2011 at 9:16 AM, Jagane Sundar jag...@sundar.org wrote:
 The direction that I chose to go is slightly different. In both of the
 proposals you pointed me at, the original virtual disk is made
 read-only and the VM writes to a different COW file. After backup
 of the original virtual disk file is complete, the COW file is merged
 with the original vdisk file.

 Instead, I create an Original-Blocks-COW-file to store the original
 blocks that are overwritten by the VM everytime the VM performs
 a write while the backup is in progress. Livebackup copies these
 underlying blocks from the original virtual disk file before the VM's
 write to the original virtual disk file is scheduled. The advantage of
 this is that there is no merge necessary at the end of the backup, we
 can simply delete the Original-Blocks-COW-file.

The advantage of the approach that redirects writes to a new file
instead is that the heavy work of copying data is done asynchronously
during the merge operation instead of in the write path which will
impact guest performance.

Here's what I understand:

1. User takes a snapshot of the disk, QEMU creates old-disk.img backed
by the current-disk.img.
2. Guest issues a write A.
3. QEMU reads B from current-disk.img.
4. QEMU writes B to old-disk.img.
5. QEMU writes A to current-disk.img.
6. Guest receives write completion A.

The tricky thing is what happens if there is a failure after Step 5.
If writes A and B were unstable writes (no fsync()) then no ordering
is guaranteed and perhaps write A reached current-disk.img but write B
did not reach old-disk.img.  In this case we no longer have a
consistent old-disk.img snapshot - we're left with an updated
current-disk.img and old-disk.img does not have a copy of the old
data.

The solution is to fsync() after Step 4 and before Step 5 but this
will hurt performance.  We now have an extra read, write, and fsync()
on every write.

 I have some reasons to believe that the Original-Blocks-COW-file
 design that I am putting forth might work better. I have listed them
 below. (It's past midnight here, so pardon me if it sounds garbled -- I
 will try to clarify more in a writeup on wiki.qemu.org).
 Let me know what your thoughts are..

 I feel that the livebackup mechanism will impact the running VM
 less. For example, if something goes wrong with the backup process,
 then we can simply delete the Original-Blocks-COW-file and force
 the backup client to do a full backup the next time around. The
 running VM or its virtual disks are not impacted at all.

Abandoning snapshots is not okay.  Snapshots will be used in scenarios
beyond backup and I don't think we can make them
unreliable/throw-away.

 Livebackup includes a rudimentary network protocol to transfer
 the modified blocks to a livebackup_client. It supports incremental
 backups. Also, livebackup treats a backup as containing all the virtual
 disks of a VM. Hence a snapshot in livebackup terms refer to a
 snapshot of all the virtual disks.

 The approximate sequence of operation is as follows:
 1. VM boots up. When bdrv_open_common opens any file backed
    virtual disk, it checks for a file called base_file.livebackupconf.
    If such a file exists, then the virtual disk is part of the backup set,
    and a chunk of memory is allocated to keep track of dirty blocks.
 2. qemu starts up a  livebackup thread that listens on a specified port
    (e.g) port 7900, for connections from the livebackup client.
 3. The livebackup_client connects to qemu at port 7900.
 4. livebackup_client sends a 'do snapshot' command.
 5. qemu waits 30 seconds for outstanding asynchronous I/O to complete.
 6. When there are no more outstanding async I/O requests, qemu
    copies the dirty_bitmap to its snapshot structure and starts a new dirty
    bitmap.
 7. livebackup_client starts iterating through the list of dirty blocks, and
    starts saving these blocks to the backup image
 8. When all blocks have been backed up, then the backup_client sends a
    destroy snapshot command; the server simply deletes the
    Original-Blocks-COW-files for each of the virtual disks and frees the
    calloc'd memory holding the dirty blocks list.

I think there's a benefit to just pointing at
Original-Blocks-COW-files and letting the client access it directly.
This even works with shared storage where the actual backup work is
performed on another host via access to a shared network filesystem or
LUN.  It may not be desirable to send everything over the network.


Perhaps you made a custom network client because you are writing a
full-blown backup solution for KVM?  In that case it's your job to
move the data around and get it backed up.  But from QEMU's point of
view we just need to provide the data and it's up to the backup
software to send it over the network and do its magic.

 I have pushed my code to the following git tree.
 git://github.com/jagane/qemu-kvm-livebackup.git

 It started as a clone of the linux kvm tree

Re: [Qemu-devel] What's the difference between commands qemu, kvm, and qemu-kvm?

2011-04-24 Thread Stefan Hajnoczi

On Sun, Apr 24, 2011 at 12:38 AM, Ryan Wang openspace.w...@gmail.com wrote:
 I read some writings on the qemu, and found some demo examples use the
 command qemu, some use kvm, and some mention the qemu-kvm?

 I wonder are there any difference between these commands? Or they just
 point to the same executable with different names?

/usr/bin/kvm and /usr/bin/qemu-kvm (or /usr/libexec/qemu-kvm) are the
binary that is built from qemu-kvm.git.  If you want x86
virtualization you should use qemu-kvm.  Some distros have called it
just kvm in the past.

/usr/bin/qemu is the binary that is built from qemu.git.  If you want
to emulate other architectures (e.g. ARM) or run on non-x86 hosts then
you should use qemu.

http://blog.vmsplice.net/2011/03/should-i-use-qemu-or-kvm.html

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A Live Backup feature for KVM

2011-04-24 Thread Stefan Hajnoczi

On Sun, Apr 24, 2011 at 12:17 AM, Jagane Sundar jag...@sundar.org wrote:
 I would like to get your input on a KVM feature that I am
 currently developing.

 What it does is this - it can perform full and incremental
 disk backups of running KVM VMs, where a backup is defined
 as a snapshot of the disk state of all virtual disks
 configured for the VM.

Great, there is definitely demand for live snapshots and online
backup.  Some efforts are already underway to implement this.

Jes has worked on a live snapshot feature for online backups.  The
snapshot_blkdev QEMU monitor command is available in qemu.git and
works like this:
qemu snapshot_blockdev virtio-disk0 /tmp/new-img.qcow2

It will create a new image file backed by the current image file.  It
then switches the VM disk to the new image file.  All writes will go
to the new image file.  The backup software on the host can now read
from the original image file since it will not be modified.

There is no support yet for live merging the new image file back into
the original image file (live commit).

Here are some of the workflows and requirements:

http://wiki.qemu.org/Features/Snapshots
http://wiki.qemu.org/Features/Snapshots2
http://wiki.qemu.org/Features/Block/Merge

It is possible to find the dirty blocks by enumerating allocated
clusters in the new image file - these are the clusters that have been
written to since the snapshot.

 My proposal will also eventually need the capability to run an
 agent in the guest for sync'ing the filesystem, flushing
 database caches, etc. I am also unsure whether just sync'ing
 a ext3 or ext4 FS and then snapshotting is adequate for backup
 purposes.

virtagent is being developed by Mike Roth as a guest agent for QEMU.
One of the use cases for virtagent is backup/snapshots and Jes has
submitted patches to add file system freeze.  You can find both
virtagent and fsfreeze on the qemu mailing list.

 Please let me know if you find this feature interesting. I am
 looking forward to feedback on any and all aspects of this
 design. I would like to work with the KVM community to
 contribute this feature to the KVM code base.

Do you have a link to a git repo with your code?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Is there any qemu-monitor commands like virsh list ?

2011-04-23 Thread Stefan Hajnoczi

On Sat, Apr 23, 2011 at 1:08 PM, Ryan Wang openspace.w...@gmail.com wrote:
 I'm a newbie to qemu/kvm and reading some docs on them.
 I've learned some 'virsh' commands. Now I want to know is
 there any qemu-monitor commands like 'virsh list'?

 Or if I want to list current VMs on my host, I have to use the
 virsh commands, and cannot use some qemu commands?

qemu and qemu-kvm are local, they only know about the single VM they
are running.  The management tool (libvirt and virt-tools) knows about
all the VMs that are running on a host.

Without libvirt you can just us ps(1) to see which qemu processes are
running on the host.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-21 Thread Stefan Hajnoczi

On Thu, Apr 21, 2011 at 9:07 AM, Avi Kivity a...@redhat.com wrote:
 Note I think in both cases we can make significant improvements:
 - for VFs, steer device interrupts to the cpus which run the vcpus that will
 receive the interrupts eventually (ISTR some work about this, but not sure)
 - for virtio, use a DMA engine to copy data (I think there exists code in
 upstream which does this, but has this been enabled/tuned?)

Which data copy in virtio?  Is this a vhost-net specific thing you're
thinking about?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Enhancing qemu-img convert format compatibility

2011-04-18 Thread Stefan Hajnoczi

qemu-img is a pretty good Rosetta stone for image formats but it is
missing support some format versions.  In order to bring qemu-img
up-to-date with the latest disk image formats we will need to find
specific image files and/or software versions that produce image files
that qemu-img cannot understand today.

If you have image files that qemu-img is unable to manipulate, please
respond with details of the software and version used to produce the
image.  If possible please include a link to a small example image
file.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Enhancing qemu-img convert format compatibility

2011-04-18 Thread Stefan Hajnoczi

On Mon, Apr 18, 2011 at 12:03 PM, Richard W.M. Jones rjo...@redhat.com wrote:
 On Mon, Apr 18, 2011 at 11:18:42AM +0100, Stefan Hajnoczi wrote:
 qemu-img is a pretty good Rosetta stone for image formats but it is
 missing support some format versions.  In order to bring qemu-img
 up-to-date with the latest disk image formats we will need to find
 specific image files and/or software versions that produce image files
 that qemu-img cannot understand today.

 If you have image files that qemu-img is unable to manipulate, please
 respond with details of the software and version used to produce the
 image.  If possible please include a link to a small example image
 file.

 Stefan,

 We found that using the vSphere 4.x Export to OVF option would
 produce a VMDK file that qemu-img could not convert to raw.

Excellent, thanks for sharing this.  I hope we can build a picture of
where there is missing support and address this in the Improved Image
Format Compatiblity project for Google Summer of Code:
http://wiki.qemu.org/Google_Summer_of_Code_2011#Improved_image_format_compatibility

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why QCOW1? (was: [PATCH v2] kvm tool: add QCOW verions 1 read/write support)

2011-04-15 Thread Stefan Hajnoczi

On Fri, Apr 15, 2011 at 7:45 AM, Pekka Enberg penb...@kernel.org wrote:
 On Fri, Apr 15, 2011 at 9:41 AM, Markus Armbruster arm...@redhat.com wrote:
 What hasn't been discussed much is the other half of Kevin's remark: why
 QCOW1?

 QCOW1 was simpler to implement as the first non-raw image format.

Why even use a non-raw image format?  The current implementation only
does sparse files, but POSIX sparse raw files gives you the same
feature.

Besides, why not use btrfs or device-mapper instead of doing image
formats, which ultimately duplicate file system and volume management
code in userspace?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why QCOW1? (was: [PATCH v2] kvm tool: add QCOW verions 1 read/write support)

2011-04-15 Thread Stefan Hajnoczi

On Fri, Apr 15, 2011 at 12:17 PM, Pekka Enberg penb...@kernel.org wrote:
 On Fri, Apr 15, 2011 at 1:14 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 Why even use a non-raw image format?  The current implementation only
 does sparse files, but POSIX sparse raw files gives you the same
 feature.

 Because people have existing images they want to boot to?

People don't have existing QCOW1 images they want to boot from :).

They have vmdk, vhd, vdi, or qcow2.  You can use qemu-img to convert
them to raw.  You can use qemu-nbd if you are desperate to boot from
or inspect them in-place.

But I think the natural path for a native Linux KVM tool is to fully
exploit file systems and block layer features in Linux instead of
implementing a userspace block layer.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] kvm tool: add QCOW verions 1 read/write support

2011-04-14 Thread Stefan Hajnoczi

On Thu, Apr 14, 2011 at 11:26:07AM +0200, Markus Armbruster wrote:
 Kevin Wolf kw...@redhat.com writes:
 
  Am 14.04.2011 10:32, schrieb Pekka Enberg:
  Hi Kevin!
  
  Am 14.04.2011 10:21, schrieb Pekka Enberg:
  On Thu, Apr 14, 2011 at 11:02 AM, Kevin Wolf kw...@redhat.com wrote:
  Have you thought about a way to actually share code with qemu instead of
  repeating Xen's mistake of copying code, modifying it until merges are
  no longer possible and then let it bitrot?
 
  No we haven't and we're not planning to copy QEMU code as-is but
  re-implement support for formats we're interested in.
  
  On Thu, Apr 14, 2011 at 11:31 AM, Kevin Wolf kw...@redhat.com wrote:
  Okay. I might not consider it wise, but in the end it's your decision.
  I'm just curious why you think this is the better way?
  
  Well, how would you go about sharing the code without copying in
  practical terms? We're aiming for standalone tool in tools/kvm of the
  Linux kernel so I don't see how we could do that.
 
  Well, copying in itself is not a big problem as long as the copies are
  kept in sync. It's a bit painful, but manageable. Implementing every
  image format twice (and implementing image formats in a reliable and
  performing way isn't trivial) is much more painful.
 
  If you take the approach of getting inspired by qemu and then writing
  your own code, the code will read pretty much the same, but be different
  enough that a diff between both trees is useless and a patch against one
  tree is meaningless for the other one.
 
  The block drivers are relatively isolated in qemu, so I think they
  wouldn't pull in too many dependencies.
 
 Are you suggesting to turn QEMU's block drivers into a reasonably
 general-purpose library?

This is useful not just for native kvm, but also for people writing
tools or automating their KVM setups.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Asynchronous interruption of a compute-intensive guest

2011-04-13 Thread Stefan Hajnoczi

On Wed, Apr 13, 2011 at 1:09 AM, Tommaso Cucinotta
tommaso.cucino...@sssup.it wrote:
 I'd like to intercept from the host the exact times at which an incoming
 network packet directed to a guest VM:
 a) is delivered from the host OS to the KVM process;
 b) is delivered to the CPU thread of the KVM process.

 Specifically, I don't have a clean idea of how b) happens when the CPU
 thread is doing compute-intensive activities within the VM. How is the flow
 of control of such thread asynchronously interrupted so as to hand over
 control to the proper network driver in kvm ? Any pointer to the exact
 points to look at, in the KVM code, are also very well appreciated.

If you are using userspace virtio-net (not in-kernel vhost-net), then
an incoming (rx) packet results in the qemu-kvm iothread's select(2)
system call returning with a readable tap file descriptor:
vl.c:main_loop_wait()

(During this time the vcpu thread may still be executing guest code.)

The iothread runs the tap receive function:
net/tap.c:tap_send()

The iothread places the received packet into the rx virtqueue and
interrupts the guest:
hw/virtio-net.c:virtio_net_receive()
hw/virtio-pci.c:virtio_pci_notify()

The interrupt is injected by the KVM kernel module:
arch/x86/kvm/x86.c:kvm_arch_vm_ioctl() KVM_IRQ_LINE

There is some guest mode exiting logic here to kick the vcpu:
arch/x86/kvm/lapic.c:__apic_accept_irq()

During this whole time the vcpu may be executing guest code.  Only at
the very end has the interrupt been inject and the vcpu notified.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [transparent networking] Re: [PATCH] kvm tools: Implement virtio network device

2011-04-13 Thread Stefan Hajnoczi

On Wed, Apr 13, 2011 at 2:02 PM, Ingo Molnar mi...@elte.hu wrote:
 Strictly talking the guest does not need ICMP packets to have working Internet
 connectivity - only passing/tunneling through TCP sockets would be enough.

Don't forget UDP for DNS.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] rbd: use the higher level librbd instead of just librados

2011-04-12 Thread Stefan Hajnoczi

On Tue, Apr 12, 2011 at 1:18 AM, Josh Durgin josh.dur...@dreamhost.com wrote:
 On 04/08/2011 01:43 AM, Stefan Hajnoczi wrote:

 On Mon, Mar 28, 2011 at 04:15:57PM -0700, Josh Durgin wrote:

 librbd stacks on top of librados to provide access
 to rbd images.

 Using librbd simplifies the qemu code, and allows
 qemu to use new versions of the rbd format
 with few (if any) changes.

 Signed-off-by: Josh Durginjosh.dur...@dreamhost.com
 Signed-off-by: Yehuda Sadehyeh...@hq.newdream.net
 ---
  block/rbd.c       |  785
 +++--
  block/rbd_types.h |   71 -
  configure         |   33 +--
  3 files changed, 221 insertions(+), 668 deletions(-)
  delete mode 100644 block/rbd_types.h

 Hi Josh,
 I have applied your patches onto qemu.git/master and am running
 ceph.git/master.

 Unfortunately qemu-iotests fails for me.


 Test 016 seems to hang in qemu-io -g -c write -P 66 128M 512
 rbd:rbd/t.raw.  I can reproduce this consistently.  Here is the
 backtrace of the hung process (not consuming CPU, probably deadlocked):

 This hung because it wasn't checking the return value of rbd_aio_write.
 I've fixed this in the for-qemu branch of
 http://ceph.newdream.net/git/qemu-kvm.git. Also, the existing rbd
 implementation is not 'growable' - writing to a large offset will not expand
 the rbd image correctly. Should we implement bdrv_truncate to support this
 (librbd has a resize operation)? Is bdrv_truncate useful outside of qemu-img
 and qemu-io?

If librbd has a resize operation then it would be nice to wire up
bdrv_truncate() for completeness.  Note that bdrv_truncate() can also
be called online using the block_resize monitor command.

Since rbd devices are not growable we should fix qemu-iotests to skip
016 for rbd.

 Test 008 failed with an assertion but succeeded when run again.  I think
 this is a race condition:

 This is likely a use-after-free, but I haven't been able to find the race
 condition yet (or reproduce it). Could you get a backtrace from the core
 file?

Unfortunately I have no core file and wasn't able to reproduce it again.

Is qemu-iotests passing for you now?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu-kvm monitor

2011-04-12 Thread Stefan Hajnoczi

On Tue, Apr 12, 2011 at 12:26 PM, Onkar Mahajan kern.de...@gmail.com wrote:
 Hi All,

 I have following command line options to qemu-kvm ( apart from others
 - irrelevant here !! )

 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm00-SMP.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control

 How do I now start the qemu-kvm monitor control session ??

It looks like you're using libvirt.  Here are two methods of getting
at the QEMU monitor through libvirt:

http://blog.vmsplice.net/2011/03/how-to-access-qemu-monitor-through.html

Basically, libvirt is already connected to the vm00-SMP.monitor UNIX
domain socket.  You need to either use the qemu-monitor-command virsh
command or you need to stop libvirtd and connect to the socket
manually (e.g. using netcat).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] rbd: use the higher level librbd instead of just librados

2011-04-12 Thread Stefan Hajnoczi

On Tue, Apr 12, 2011 at 4:38 PM, Sage Weil s...@newdream.net wrote:
 On Tue, 12 Apr 2011, Stefan Hajnoczi wrote:
 On Tue, Apr 12, 2011 at 1:18 AM, Josh Durgin josh.dur...@dreamhost.com 
 wrote:
  On 04/08/2011 01:43 AM, Stefan Hajnoczi wrote:
 
  On Mon, Mar 28, 2011 at 04:15:57PM -0700, Josh Durgin wrote:
 
  librbd stacks on top of librados to provide access
  to rbd images.
 
  Using librbd simplifies the qemu code, and allows
  qemu to use new versions of the rbd format
  with few (if any) changes.
 
  Signed-off-by: Josh Durginjosh.dur...@dreamhost.com
  Signed-off-by: Yehuda Sadehyeh...@hq.newdream.net
  ---
   block/rbd.c       |  785
  +++--
   block/rbd_types.h |   71 -
   configure         |   33 +--
   3 files changed, 221 insertions(+), 668 deletions(-)
   delete mode 100644 block/rbd_types.h
 
  Hi Josh,
  I have applied your patches onto qemu.git/master and am running
  ceph.git/master.
 
  Unfortunately qemu-iotests fails for me.
 
 
  Test 016 seems to hang in qemu-io -g -c write -P 66 128M 512
  rbd:rbd/t.raw.  I can reproduce this consistently.  Here is the
  backtrace of the hung process (not consuming CPU, probably deadlocked):
 
  This hung because it wasn't checking the return value of rbd_aio_write.
  I've fixed this in the for-qemu branch of
  http://ceph.newdream.net/git/qemu-kvm.git. Also, the existing rbd
  implementation is not 'growable' - writing to a large offset will not 
  expand
  the rbd image correctly. Should we implement bdrv_truncate to support this
  (librbd has a resize operation)? Is bdrv_truncate useful outside of 
  qemu-img
  and qemu-io?

 If librbd has a resize operation then it would be nice to wire up
 bdrv_truncate() for completeness.  Note that bdrv_truncate() can also
 be called online using the block_resize monitor command.

 Since rbd devices are not growable we should fix qemu-iotests to skip
 016 for rbd.

 There is a resize operation, but it's expected that you'll use it for any
 bdev size change (grow or shrink).  Does qemu grow a device by writing to
 the (new) highest offset, or is there another operation that should be
 wired up?  We want to avoid a situation where RBD isn't aware of the qemu
 bdev resize and has to grow a bit each time we write to a larger offset,
 as resize is a somewhat expensive operation...

Good it sounds like RBD and QEMU have similar concepts here.  The
bdrv_truncate() operation is a (rare) image resize operation.  It is
not the extend-beyond-EOF grow operation which QEMU simply performs as
a write beyond bdrv_getlength() bytes.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: EuroSec'11 Presentation

2011-04-11 Thread Stefan Hajnoczi

On Sun, Apr 10, 2011 at 4:19 PM, Kuniyasu Suzaki k.suz...@aist.go.jp wrote:

 From: Avi Kivity a...@redhat.com
 Subject: Re: EuroSec'11 Presentation
 Date: Sun, 10 Apr 2011 17:49:52 +0300

 On 04/10/2011 05:23 PM, Kuniyasu Suzaki wrote:
  Dear,

  I made a presentation about memory disclosure attack on SKM (Kernel
  Samepage Merging) with KVM at EuroSec 2011.
  The titile is Memory Deduplication as a Threat to the Guest OS.
     http://www.iseclab.org/eurosec-2011/program.html

  The slide is downloadbale.
     http://www.slideshare.net/suzaki/eurosec2011-slide-memory-deduplication
  The paper will be downloadble form ACM Digital Library.

  Please tell me, if you have comments. Thank you.

 Very interesting presentation.  It seems every time you share something,
 it become a target for attacks.

 I'm happy to hear your comments.
 The referee's comment was severe. It said there was not brand-new
 point, but there are real attack experiences.  My paper was just
 evaluated the detction on apahce2 and sshd on Linux Guest OS and
 Firefox and IE6 on Windows Guest OS.

If I have a VM on the same physical host as someone else I may be able
to determine which programs and specific versions they are currently
running.

Is there some creative attack using this technique that I'm missing?
I don't see many serious threats.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: EuroSec'11 Presentation

2011-04-11 Thread Stefan Hajnoczi

On Mon, Apr 11, 2011 at 4:27 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 04/11/2011 03:51 AM, Stefan Hajnoczi wrote:

 I'm happy to hear your comments.
 The referee's comment was severe. It said there was not brand-new
 point, but there are real attack experiences.  My paper was just
 evaluated the detction on apahce2 and sshd on Linux Guest OS and
 Firefox and IE6 on Windows Guest OS.

 If I have a VM on the same physical host as someone else I may be able
 to determine which programs and specific versions they are currently
 running.

 Is there some creative attack using this technique that I'm missing?
 I don't see many serious threats.

 It's a deviation of a previously demonstrated attack where memory access
 timing is used to guess memory content.  This has been demonstrated in the
 past to be a viable technique to reduce the keyspace of things like ssh keys
 which makes attack a bit easier.

How can you reduce the key space by determining whether the guest has
arbitrary 4 KB data in physical memory?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Slow PXE boot in qemu.git (fast in qemu-kvm.git)

2011-04-09 Thread Stefan Hajnoczi

On Sat, Apr 9, 2011 at 1:50 AM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 04/08/2011 06:25 PM, Luiz Capitulino wrote:

 Hi there,

 Summary:

  - PXE boot in qemu.git (HEAD f124a41) is quite slow, more than 5 minutes.
 Got
    the problem with e1000, virtio and rtl8139. However, pcnet *works*
 (it's
    as fast as qemu-kvm.git)

  - PXE boot in qemu-kvm.git (HEAD df85c051) is fast, less than a minute.
 Tried
    with e1000, virtio and rtl8139 (I don't remember if I tried with pcnet)

 I tried with qemu.git v0.13.0 in order to check if this was a regression,
 but
 I got the same problem...

 Then I inspected qemu-kvm.git under the assumption that it could have a
 fix
 that wasn't commited to qemu.git. Found this:

  - commit 0836b77f0f65d56d08bdeffbac25cd6d78267dc9 which is merge, works

  - commit cc015e9a5dde2f03f123357fa060acbdfcd570a4 does not work (it's
 slow)

 I tried a bisect, but it brakes due to gcc4 vs. gcc3 changes. Then I
 inspected
 commits manually, and found out that commit 64d7e9a4 doesn't work, which
 makes
 me think that the fix could be in the conflict resolution of 0836b77f,
 which
 makes me remember that I'm late for diner, so my conclusions at this point
 are
 not reliable :)

 Can you run kvm_stat to see what the exit rates are?

 Maybe we're missing a coalesced io in qemu.git?  It's also possible that
 gpxe is hitting the apic or pit quite a lot.

In gPXE's main loop it will do real - protected mode switches and
poll hardware.  It doesn't handle interrupts itself but sets up the
8254 timer chip.

I once found that polling the keyboard only every couple of gPXE main
loop iterations significantly speeds up network throughput under KVM.
I never got around to auditing the entire main loop and implementing a
clean patch.

Anyway, kvm_stat is a good idea.  It may be tickling qemu in a way
that qemu-kvm is immune to.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] rbd: use the higher level librbd instead of just librados

2011-04-08 Thread Stefan Hajnoczi

On Mon, Mar 28, 2011 at 04:15:57PM -0700, Josh Durgin wrote:
 librbd stacks on top of librados to provide access
 to rbd images.
 
 Using librbd simplifies the qemu code, and allows
 qemu to use new versions of the rbd format
 with few (if any) changes.
 
 Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
 Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net
 ---
  block/rbd.c   |  785 
 +++--
  block/rbd_types.h |   71 -
  configure |   33 +--
  3 files changed, 221 insertions(+), 668 deletions(-)
  delete mode 100644 block/rbd_types.h

Hi Josh,
I have applied your patches onto qemu.git/master and am running
ceph.git/master.

Unfortunately qemu-iotests fails for me.


Test 016 seems to hang in qemu-io -g -c write -P 66 128M 512
rbd:rbd/t.raw.  I can reproduce this consistently.  Here is the
backtrace of the hung process (not consuming CPU, probably deadlocked):

Thread 9 (Thread 0x7f9ded6d6700 (LWP 26049)):
#0  0x7f9def41d16c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#1  0x7f9dee676d9a in Wait (this=0x2723950) at ./common/Cond.h:46
#2  SimpleMessenger::dispatch_entry (this=0x2723950) at 
msg/SimpleMessenger.cc:362
#3  0x7f9dee66180c in SimpleMessenger::DispatchThread::entry (this=value 
optimized out) at msg/SimpleMessenger.h:533
#4  0x7f9def4188ba in start_thread () from /lib/libpthread.so.0
#5  0x7f9dee14d02d in clone () from /lib/libc.so.6
#6  0x in ?? ()

Thread 8 (Thread 0x7f9deced5700 (LWP 26050)):
#0  0x7f9def41d16c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#1  0x7f9dee674fab in Wait (this=0x2723950) at ./common/Cond.h:46
#2  SimpleMessenger::reaper_entry (this=0x2723950) at 
msg/SimpleMessenger.cc:2251
#3  0x7f9dee6617ac in SimpleMessenger::ReaperThread::entry (this=0x2723d80) 
at msg/SimpleMessenger.h:485
#4  0x7f9def4188ba in start_thread () from /lib/libpthread.so.0
#5  0x7f9dee14d02d in clone () from /lib/libc.so.6
#6  0x in ?? ()

Thread 7 (Thread 0x7f9dec6d4700 (LWP 26051)):
#0  0x7f9def41d4d9 in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#1  0x7f9dee72187a in WaitUntil (this=0x2722c00) at common/Cond.h:60
#2  SafeTimer::timer_thread (this=0x2722c00) at common/Timer.cc:110
#3  0x7f9dee722d7d in SafeTimerThread::entry (this=value optimized out) 
at common/Timer.cc:38
#4  0x7f9def4188ba in start_thread () from /lib/libpthread.so.0
#5  0x7f9dee14d02d in clone () from /lib/libc.so.6
#6  0x in ?? ()

Thread 6 (Thread 0x7f9df07ea700 (LWP 26052)):
#0  0x7f9def41d16c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#1  0x7f9dee67cae1 in Wait (this=0x2729890) at ./common/Cond.h:46
#2  SimpleMessenger::Pipe::writer (this=0x2729890) at 
msg/SimpleMessenger.cc:1746
#3  0x7f9dee66187d in SimpleMessenger::Pipe::Writer::entry (this=value 
optimized out) at msg/SimpleMessenger.h:204
#4  0x7f9def4188ba in start_thread () from /lib/libpthread.so.0
#5  0x7f9dee14d02d in clone () from /lib/libc.so.6
#6  0x in ?? ()

Thread 5 (Thread 0x7f9debed3700 (LWP 26055)):
#0  0x7f9dee142113 in poll () from /lib/libc.so.6
#1  0x7f9dee66d599 in tcp_read_wait (sd=value optimized out, 
timeout=value optimized out) at msg/tcp.cc:48
#2  0x7f9dee66e89b in tcp_read (sd=3, buf=value optimized out, len=1, 
timeout=90) at msg/tcp.cc:25
#3  0x7f9dee67ffd2 in SimpleMessenger::Pipe::reader (this=0x2729890) at 
msg/SimpleMessenger.cc:1539
#4  0x7f9dee66185d in SimpleMessenger::Pipe::Reader::entry (this=value 
optimized out) at msg/SimpleMessenger.h:196
#5  0x7f9def4188ba in start_thread () from /lib/libpthread.so.0
#6  0x7f9dee14d02d in clone () from /lib/libc.so.6
#7  0x in ?? ()

Thread 4 (Thread 0x7f9debdd2700 (LWP 26056)):
#0  0x7f9def41d4d9 in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#1  0x7f9dee72187a in WaitUntil (this=0x2722e58) at common/Cond.h:60
#2  SafeTimer::timer_thread (this=0x2722e58) at common/Timer.cc:110
#3  0x7f9dee722d7d in SafeTimerThread::entry (this=value optimized out) 
at common/Timer.cc:38
#4  0x7f9def4188ba in start_thread () from /lib/libpthread.so.0
#5  0x7f9dee14d02d in clone () from /lib/libc.so.6
#6  0x in ?? ()

Thread 3 (Thread 0x7f9deb2ce700 (LWP 26306)):
#0  0x7f9def41d16c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#1  0x7f9dee67cae1 in Wait (this=0x272f090) at ./common/Cond.h:46
#2  SimpleMessenger::Pipe::writer (this=0x272f090) at 
msg/SimpleMessenger.cc:1746
#3  0x7f9dee66187d in SimpleMessenger::Pipe::Writer::entry (this=value 
optimized out) at msg/SimpleMessenger.h:204
#4  0x7f9def4188ba in start_thread () from /lib/libpthread.so.0
#5  0x7f9dee14d02d in clone () from /lib/libc.so.6
#6  0x in ?? ()

Thread 2 (Thread 0x7f9deb3cf700 (LWP 26309)):

Re: [PATCH v2 1/2] rbd: use the higher level librbd instead of just librados

2011-04-08 Thread Stefan Hajnoczi

On Fri, Apr 8, 2011 at 9:43 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Mon, Mar 28, 2011 at 04:15:57PM -0700, Josh Durgin wrote:
 librbd stacks on top of librados to provide access
 to rbd images.

 Using librbd simplifies the qemu code, and allows
 qemu to use new versions of the rbd format
 with few (if any) changes.

 Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
 Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net
 ---
  block/rbd.c       |  785 
 +++--
  block/rbd_types.h |   71 -
  configure         |   33 +--
  3 files changed, 221 insertions(+), 668 deletions(-)
  delete mode 100644 block/rbd_types.h

 Hi Josh,
 I have applied your patches onto qemu.git/master and am running
 ceph.git/master.

 Unfortunately qemu-iotests fails for me.

I forgot to mention that qemu-iotests lives at:

git://git.kernel.org/pub/scm/linux/kernel/git/hch/qemu-iotests.git

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call minutes for Apr 5

2011-04-08 Thread Stefan Hajnoczi

On Fri, Apr 08, 2011 at 09:58:22AM -0300, Lucas Meneghel Rodrigues wrote:
 On Thu, 2011-04-07 at 11:03 +0100, Stefan Hajnoczi wrote:
  On Tue, Apr 5, 2011 at 6:37 PM, Lucas Meneghel Rodrigues l...@redhat.com 
  wrote:
   Perhaps kvm-autotest is a good platform for the automated testing of
   ARM TCG.  Paul is CCed, I recently saw the Jenkins qemu build and boot
   tests he has set up.  Lucas, do you have ideas on how these efforts
   can work together to bring testing to upstream QEMU?
   http://validation.linaro.org/jenkins/job/qemu-boot-images/
  
   I heard about jenkins before and it is indeed a nice project. What they
   do here, from what I could assess browsing at the webpage you provided
   is:
  
   1) Build qemu.git every time there are commits
   2) Boot pre-made 'pristine' images, one is a lenny arm image and the
   other is a linaro arm image.
  
   It is possible to do the same with kvm autotest, just a matter of not
   performing guest install tests and executing only the boot tests with
   pre-made images. What jenkins does here is a even quicker and shorter
   version of our sanity jobs.
  
   About how we can work together, I thought about some possibilities:
  
   1) Modify the jenkins test step to execute a kvm autotest job after the
   build, with the stripped down test set. We might gain some extra debug
   info, that the current test step does not seem to provide
   2) Do the normal test step and if that succeeds, trigger a kvm autotest
   job that does more comprehensive testing, such as migration, time drift,
   block layer, etc
  
   The funny thing is that KVM autotest has infrastructure to do the same
   as jenkins does, but jenkins is highly streamlined for the buildbot use
   case (continuous build and integration), and I see that as a very nice
   advantage. So I'd rather keep use jenkins and have kvm autotest plugged
   into it conveniently.
  
  That sounds good.  I think the benefit of working together is that
  different entities (Linaro, Red Hat, etc) can contribute QEMU tests
  into a single place.  That testing can then cover to both upstream and
  downstream to prevent breakage.
  
  So kvm-autotest can run in single job mode and kicked off from jenkins
  or buildbot?
  
  It sounds like kvm-autotest has or needs its own cron, result
  archiving, etc infrastructure.  Does it make sense to use a harness
  like jenkins or buildbot instead and focus kvm-autotest purely as a
  testing framework?
 
 In the context that there are already jenkins/buildbot servers running
 for qemu, having only the KVM testing part (autotest client + kvm test)
 is a possibility, to make things easier to plug and work with what is
 already deployed.
 
 However, not possible to focus KVM autotest as a 'test framework'. What
 we call KVM autotest is in actuality, a client test of autotest.
 Autotest is a generic, large collection of programs and libraries
 targeted at peforming automated testing on the linux platform, it was
 developed to test the linux kernel itself, and it is used to do
 precisely that. Look at test.kernel.org. All those tests are executed by
 autotest.
 
 So, autotest is much more than KVM testing, and I am one of the autotest
 maintainers, so I am commited to work on all parts of that stack.
 Several testing projects urelated to KVM use our code, and our
 harnessing and infrastructure is already pretty good, we'll keep to
 develop it.
 
 The whole thing was designed in a modular way, so it's doable to use
 parts of it (such as the autotest client and the KVM test) and integrate
 with stuff such as jenkins and buildbot, and if people need and want to
 do that, awesome. But we are going to continue develop autotest as a
 whole framework/automation utilities/API, while developing the KVM test.

I wasn't aware of the scope of autotest and your involvement.  I need to
look into autotest more :).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: trace-cmd errors on kvm events

2011-04-08 Thread Stefan Hajnoczi

On Fri, Apr 8, 2011 at 7:53 PM, David Ahern dsah...@gmail.com wrote:
 2.6.38.2 kernel with trace-cmd git pulled this morning:

 trace-cmd record -e kvm

 trace-cmd report 21 | less

 trace-cmd: No such file or directory
  function ftrace_print_symbols_seq not defined
  failed to read event print fmt for kvm_nested_vmexit_inject
  function ftrace_print_symbols_seq not defined
  failed to read event print fmt for kvm_nested_vmexit
  function ftrace_print_symbols_seq not defined
  failed to read event print fmt for kvm_exit
  bad op token {
  failed to read event print fmt for kvm_emulate_insn

        qemu-kvm-1864  [002]  2253.714134: kvm_entry:            vcpu 1
        qemu-kvm-1863  [008]  2253.714136: kvm_exit:             [FAILED
 TO PARSE] exit_reason=44 guest_rip=0xc01185ed isa=1 info1=4272 info2=0
        qemu-kvm-1864  [002]  2253.714138: kvm_exit:             [FAILED
 TO PARSE] exit_reason=44 guest_rip=0xc01185ed isa=1 info1=4272 info2=0
        qemu-kvm-1863  [008]  2253.714145: kvm_emulate_insn:     [FAILED
 TO PARSE] rip=3222373869 csbase=0 len=2 insn=89^H]C38B^UEC95KC0U
 89E5]8D^E flags=5 failed=0

 I have not used trace-cmd much, so I am not familiar with the code. Is
 this a known issue? Suggestions on how to debug?

I think there have been issues for a long time.  I've never gotten
perf or trace-cmd to be happy with kvm:* events.  Here is a related
thread from a while back:

https://lkml.org/lkml/2010/5/26/194

When I looked a while back the problem was due to how there is some
preprocessor magic in Linux that ends up exporting C expressions as
strings to userspace and neither perf nor trace-cmd have the parsing
smarts to evaluate the C expressions at runtime.

I ended up using ftrace instead which handles everything inside the
kernel and compiles in those C expressions.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] rbd: allow configuration of rados from the rbd filename

2011-04-07 Thread Stefan Hajnoczi

On Thu, Apr 07, 2011 at 10:14:03AM +0900, Yoshiaki Tamura wrote:
 2011/3/29 Josh Durgin josh.dur...@dreamhost.com:
  The new format is 
  rbd:pool/image[@snapshot][:option1=value1[:option2=value2...]]
  Each option is used to configure rados, and may be any Ceph option, or 
  conf.
  The conf option specifies a Ceph configuration file to read.
 
  This allows rbd volumes from more than one Ceph cluster to be used by
  specifying different monitor addresses, as well as having different
  logging levels or locations for different volumes.
 
  Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
  ---
   block/rbd.c |  119 
  ++
   1 files changed, 102 insertions(+), 17 deletions(-)
 
  diff --git a/block/rbd.c b/block/rbd.c
  index cb76dd3..bc3323d 100644
  --- a/block/rbd.c
  +++ b/block/rbd.c
  @@ -22,13 +22,17 @@
   /*
   * When specifying the image filename use:
   *
  - * rbd:poolname/devicename
  + * 
  rbd:poolname/devicename[@snapshotname][:option1=value1[:option2=value2...]]
 
 I'm not sure IIUC, but currently this @snapshotname seems to be
 meaningless; it doesn't allow you to boot from a snapshot because it's
 read only.  Am I misunderstanding or tested incorrectly?

Read-only block devices are supported by QEMU and can be useful.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call minutes for Apr 5

2011-04-07 Thread Stefan Hajnoczi

On Tue, Apr 5, 2011 at 6:37 PM, Lucas Meneghel Rodrigues l...@redhat.com 
wrote:

Thanks for your detailed response!

 On Tue, 2011-04-05 at 16:29 +0100, Stefan Hajnoczi wrote:
 * Public notifications of breakage, qemu.git/master failures to
 qemu-devel mailing list.

 ^ The challenge is to get enough data to determine what is a new
 breakage from a known issue, mainly. More related to have historical
 data from test results than anything else, IMO.

I agree.  Does kvm-autotest currently archive test results?

 * A one-time contributor can get their code tested.  No requirement to
 set up a server because contributors may not have the resources.

 Coming back to the point that many colleagues made: We need a sort of
 'make test' on the qemu trees that would fetch autotest and could setup
 basic tests that people could run, maybe suggest test sets...

 The problem I see is, getting guests up and running using configs that
 actually matter is not trivial (there are things such as ensuring that
 all auxiliary utilities are installed in a distro agnostic fashion,
 having bridges and DHCP server setup on possibly a disconnected work
 laptop, and stuff).

 So, having a 'no brains involved at all' setup is quite a challenge,
 suggestions welcome. Also, downloading isos, waiting for guests to
 install and run thorough tests won't be fast. So J. Random Developer
 might not bother to run tests even if we can provide a fool proof,
 perfectly automated setup, because it'd take a long time at first to get
 the tests run. This is also a challenge.

I'm actually starting to think that there is no one-size-fits-all solution.

Developers need make check-type unit tests for various QEMU
subsystems.  kvm-autotest could also run these unit tests as part of
its execution.

Then there are end-to-end acceptance tests.  They simply require
storage, network, and time resources and there's no way around that.
These tests are more suited to centralized testing infrastructure that
periodically tests qemu.git.

On the community call I was trying to see if there is a lightweight
version of kvm-autotest that could be merged into qemu.git.  But now I
think that this isn't realistic and it would be better to grow unit
tests in qemu.git while covering it with kvm-autotest for acceptance
testing.

 Perhaps kvm-autotest is a good platform for the automated testing of
 ARM TCG.  Paul is CCed, I recently saw the Jenkins qemu build and boot
 tests he has set up.  Lucas, do you have ideas on how these efforts
 can work together to bring testing to upstream QEMU?
 http://validation.linaro.org/jenkins/job/qemu-boot-images/

 I heard about jenkins before and it is indeed a nice project. What they
 do here, from what I could assess browsing at the webpage you provided
 is:

 1) Build qemu.git every time there are commits
 2) Boot pre-made 'pristine' images, one is a lenny arm image and the
 other is a linaro arm image.

 It is possible to do the same with kvm autotest, just a matter of not
 performing guest install tests and executing only the boot tests with
 pre-made images. What jenkins does here is a even quicker and shorter
 version of our sanity jobs.

 About how we can work together, I thought about some possibilities:

 1) Modify the jenkins test step to execute a kvm autotest job after the
 build, with the stripped down test set. We might gain some extra debug
 info, that the current test step does not seem to provide
 2) Do the normal test step and if that succeeds, trigger a kvm autotest
 job that does more comprehensive testing, such as migration, time drift,
 block layer, etc

 The funny thing is that KVM autotest has infrastructure to do the same
 as jenkins does, but jenkins is highly streamlined for the buildbot use
 case (continuous build and integration), and I see that as a very nice
 advantage. So I'd rather keep use jenkins and have kvm autotest plugged
 into it conveniently.

That sounds good.  I think the benefit of working together is that
different entities (Linaro, Red Hat, etc) can contribute QEMU tests
into a single place.  That testing can then cover to both upstream and
downstream to prevent breakage.

So kvm-autotest can run in single job mode and kicked off from jenkins
or buildbot?

It sounds like kvm-autotest has or needs its own cron, result
archiving, etc infrastructure.  Does it make sense to use a harness
like jenkins or buildbot instead and focus kvm-autotest purely as a
testing framework?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call minutes for Apr 5

2011-04-05 Thread Stefan Hajnoczi

On Tue, Apr 5, 2011 at 4:07 PM, Chris Wright chr...@redhat.com wrote:
 kvm-autotest
 - roadmap...refactor to centralize testing (handle the xen-autotest split off)
 - internally at RH, lmr and cleber maintain autotest server to test
  branches (testing qemu.git daily)
  - have good automation for installs and testing
 - seems more QA focused than developers
  - plenty of benefit for developers, so lack of developer use partly
    cultural/visibility...
  - kvm-autotest team always looking for feedback to improve for
    developer use case
 - kvm-autotest day to have folks use it, write test, give feedback?
  - startup cost is/was steep, the day might be too much handholding
  - install-fest? (to get it installed and up and running)
 - buildbot or autotest for testing patches to verify building and working
 - one goal is to reduce mailing list load (patch resubmission because
  they haven't handled basic cases that buildbot or autotest would have
  caught)
 - fedora-virt test day coming up on April 14th.  lucas will be on hand and
  we can piggy back on that to include kvm-autotest install and virt testing
 - kvm autotest run before qemu pull request and post merge to track
  regressions, more frequent testing helps developers see breakage
  quickly
  - qemu.git daily testing already, only the sanity test subset
    - run more comprehensive stable set of tests on weekends
 - one issue is the large number of known failures, need to make these
  easier to identify (and fix the failures one way or another)
 - create database and verify (regressions) against that
  - red/yellow/green (yellow shows area was already broken)
 - autotest can be run against server, not just on laptop

Features that I think are important for a qemu.git kvm-autotest:
* Public results display (sounds like you're working on this)
* Public notifications of breakage, qemu.git/master failures to
qemu-devel mailing list.
* A one-time contributor can get their code tested.  No requirement to
set up a server because contributors may not have the resources.

Perhaps kvm-autotest is a good platform for the automated testing of
ARM TCG.  Paul is CCed, I recently saw the Jenkins qemu build and boot
tests he has set up.  Lucas, do you have ideas on how these efforts
can work together to bring testing to upstream QEMU?
http://validation.linaro.org/jenkins/job/qemu-boot-images/

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.32.x guest dies when trying to run tcpdump

2011-04-02 Thread Stefan Hajnoczi

On Sat, Apr 2, 2011 at 4:23 PM, Nikola Ciprich extmaill...@linuxbox.cz wrote:
 I'm using virtio network channel, and on one of the guests (the one with 
 aborted ext4) I use it also for one of virtual disks.
 One more interesting thing, I can't reproduce this immediately after guest 
 boot, but for example second day after boot, I can reproduce this.
 perhaps this can suggest something?
 Could somebody please help me to find and possibly fix this bug?

Softlockups are a symptom that a guest vcpu hasn't been able to
execute.  Unfortunately I don't see anything that points to a specific
bug in the backtraces.

 If needed, I can provide further debugging information, bisect etc...

It looks like your guests are SMP.  How many vcpus are you running?
How many physical cpus does /proc/cpuinfo list on the host?

Is the host overloaded when this occurs?

Are there any clues in host dmesg?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Automatic user feedback

2011-04-01 Thread Stefan Hajnoczi

On Fri, Apr 1, 2011 at 9:33 AM, Alexander Graf ag...@suse.de wrote:
 We're constantly developing and improving KVM, implementing new awesome
 features or simply fixing bugs in the existing code.

 But do people actually use that new code? Are we maybe writing it all in
 vain? Wouldn't it be nice to have some feeling for the number of users
 actually using our code?

 This patch enables us to stress test our automated test suite and at
 the same time figure out if we actually have users. When using a new
 kernel with this patch applied, the user will automatically send feedback,
 telling us that there are people out there, trying recent versions!

 Signed-off-by: Alexander Graf ag...@suse.de
 ---
  virt/kvm/kvm_main.c |    2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)

/me migrates to linux-kvm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio-blk.c handling of i/o which is not a 512 multiple

2011-03-30 Thread Stefan Hajnoczi

On Wed, Mar 30, 2011 at 9:15 AM, Conor Murphy
conor_murphy_v...@hotmail.com wrote:
 I'm trying to write a virtio-blk driver for Solaris. I've gotten it to the 
 point
 where Solaris can see the device and create a ZFS file system on it.

 However when I try and create a UFS filesystem on the device, the VM crashed
 with the error
 *** glibc detected *** /usr/bin/qemu-kvm: double free or corruption (!prev):
 0x7f2d38000a00 ***

This is a bug in QEMU.  A guest must not be able to trigger a crash.

 I can reproduce the problem with a simple dd, i.e.
 dd if=/dev/zero of=/dev/rdsk/c2d10p0 bs=5000 count=1

I think this a raw character device, which is why you're even able to
perform non-blocksize accesses?  Have you looked at how other drivers
(like the Xen pv blkfront) handle this?

 My driver will create a virtio-blk request with two elements in the sg list, 
 one
 for the first 4096 byes and the other for the remaining 904.

 From stepping through with gdb, virtio_blk_handle_write will sets n_sectors 
 to 9
 (5000 / 512). Later on the code, n_sectors is used the calculate the size of 
 the
 buffer required but 9 * 512 is too small and so when the request is process it
 ends up writing past the end of the buffer and I guest this triggers the glibc
 error.

We need to validate that (qiov-size % BDRV_SECTOR_SIZE) == 0 and
reject invalid requests.

 Is there a requirement for virtio-blk guest drivers that all i/o requests are
 sized in multiples of 512 bytes?

There is no strict requirement according to the virtio specification,
but maybe there should be:

http://ozlabs.org/~rusty/virtio-spec/virtio-spec-0.8.9.pdf

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] rbd: use the higher level librbd instead of just librados

2011-03-28 Thread Stefan Hajnoczi

On Thu, Mar 24, 2011 at 03:51:36PM -0700, Josh Durgin wrote:
You have sent a malformed patch.  Please send patches that follow the
guidelines at http://wiki.qemu.org/Contribute/SubmitAPatch and test that
your mail client is not line wrapping or mangling whitespace.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu-kvm crash with

2011-03-25 Thread Stefan Hajnoczi

On Thu, Mar 24, 2011 at 1:38 PM, Conor Murphy
conor_murphy_v...@hotmail.com wrote:
 #4  _int_free (av=value optimized out, p=0x7fa24c0009f0, have_lock=0) at
 malloc.c:4795
 #5  0x004a18fe in qemu_vfree (ptr=0x7fa24c000a00) at oslib-posix.c:76
 #6  0x0045af3d in handle_aiocb_rw (aiocb=0x7fa2dc034cd0) at
 posix-aio-compat.c:301

I don't see a way for a double-free to occur so I think something has
overwritten the memory preceeding the allocated buffer.

In gdb you could inspect the aiocb structure to look at its aio_iov[],
aio_niov, and aio_nbytes fields.  They might be invalid or corrupted
somehow.

You could also dump out the memory before 0x7fa24c000a00, specifically
0x7fa24c0009f0, to see if you notice any pattern or printable
characters that give a clue as to what has corrupted the memory here.

Are you running qemu-kvm.git/master?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM lock contention on 48 core AMD machine

2011-03-18 Thread Stefan Hajnoczi

On Fri, Mar 18, 2011 at 12:02 PM, Ben Nagy b...@iagu.net wrote:
 KVM commandline (using libvirt):
 LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
 QEMU_AUDIO_DRV=none /usr/local/bin/kvm-snapshot -S -M pc-0.14
 -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name fb-0
 -uuid de59229b-eb06-9ecc-758e-d20bc5ddc291 -nodefconfig -nodefaults
 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/fb-0.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=readline -rtc base=localtime
 -no-acpi -boot cd -drive
 file=/mnt/big/bigfiles/kvm_disks/eax/fb-0.ovl,if=none,id=drive-ide0-0-0,format=qcow2
 -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
 -drive if=none,media=cdrom,id=drive-ide0-0-1,readonly=on,format=raw
 -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1
 -netdev tap,fd=17,id=hostnet0 -device
 virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d9:09:ef,bus=pci.0,addr=0x3
 -usb -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -k en-us -vga
 cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

Please try without -usb -device usb-tablet,id=input0.  That is known
to cause increased CPU utilization.  I notice that idr_lock is either
Infiniband or POSIX timers related:
drivers/infiniband/core/sa_query.c
kernel/posix-timers.c

-usb sets up a 1000 Hz timer for each VM.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Virtual SCSI disks hangs on heavy IO

2011-03-18 Thread Stefan Hajnoczi

On Fri, Mar 18, 2011 at 4:06 PM, Guido Winkelmann
guido-k...@thisisnotatest.de wrote:
 Am Wednesday 16 March 2011 schrieb Stefan Hajnoczi:
 On Tue, Mar 15, 2011 at 1:20 PM, Guido Winkelmann

 guido-k...@thisisnotatest.de wrote:
  Am Tuesday 15 March 2011 schrieben Sie:
  On Mon, Mar 14, 2011 at 10:57 PM, Guido Winkelmann
 
  guido-k...@thisisnotatest.de wrote:
   On Monday 14 March 2011 20:32:23 Stefan Hajnoczi wrote:
   On Mon, Mar 14, 2011 at 6:05 PM, Guido Winkelmann
  
   guido-k...@thisisnotatest.de wrote:
Does anybody have an idea what might cause this or what might be
done about it?
  
   The lsi_scsi emulation code is incomplete.  It does not handle some
   situations like the ORDERED commands or message 0x0c.
  
   There is a patch to address the message 0xc issue, it has not been
   applied to qemu.git or qemu-kvm.git yet:
   http://patchwork.ozlabs.org/patch/63926/
  
   Basically there is no one actively maintaining or reviewing patches
   for the lsi53c895a SCSI controller.
  
   Does that mean that using the SCSI transport for virtual disks is
   officially unsupported or deprecated or that it should be?
 
  The LSI SCSI emulation in particular has not seen much attention.  As
  for the wider SCSI emulation there has been work over the past few
  months so it's alive and being used.
 
  Well, I cannot find any other HBAs than LSI when I run qemu -device ? -
  or at least nothing I would recognize as a SCSI HBA. As far as I can
  see, that pretty much means I cannot use SCSI disks in KVM at all,
  unless I'm prepared to live with the problems described earlier...

 The LSI controller is the only available PCI SCSI HBA.  Are you able
 to try the patch I linked?
 http://patchwork.ozlabs.org/patch/63926/

 I haven't tried the patch yet. At work, it was decided that we are not going 
 to
 use a manually patch version of qemu-kvm unless absolutely necessary, and at
 home, I'm unlikely to ever want to virtualize an OS without virtio-drivers.

 I can still try the patch on my home machine, if you want me to.

Don't worry about it if you're going virtio-blk already.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Virtual SCSI disks hangs on heavy IO

2011-03-15 Thread Stefan Hajnoczi

On Mon, Mar 14, 2011 at 10:57 PM, Guido Winkelmann
guido-k...@thisisnotatest.de wrote:
 On Monday 14 March 2011 20:32:23 Stefan Hajnoczi wrote:
 On Mon, Mar 14, 2011 at 6:05 PM, Guido Winkelmann

 guido-k...@thisisnotatest.de wrote:
  Does anybody have an idea what might cause this or what might be done
  about it?

 The lsi_scsi emulation code is incomplete.  It does not handle some
 situations like the ORDERED commands or message 0x0c.

 There is a patch to address the message 0xc issue, it has not been
 applied to qemu.git or qemu-kvm.git yet:
 http://patchwork.ozlabs.org/patch/63926/

 Basically there is no one actively maintaining or reviewing patches
 for the lsi53c895a SCSI controller.

 Does that mean that using the SCSI transport for virtual disks is officially
 unsupported or deprecated or that it should be?

The LSI SCSI emulation in particular has not seen much attention.  As
for the wider SCSI emulation there has been work over the past few
months so it's alive and being used.

 Are things better with the IDE driver?

IDE is commonly used for compatibility with guests that do not have
virtio-blk drivers.  It should work fine although performance is poor
due to the IDE interface.

 virtio-blk works very will with Linux guests.  Is there a reason you
 need to use SCSI emulation instead of virtio-blk?

 I can probably use virtio-blk most of the time, I was just hoping to be able
 to virtualize a wider array of operating systems, like the *BSDs,
 (Open)Solaris, Windows, or even just some linux distributions whose installers
 don't anticipate KVM and thus don't support virtio-anything.

Windows virtio-blk drivers are available and should be used:
http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers

BSD and Solaris don't ship with virtio-blk AFAIK.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Virtual SCSI disks hangs on heavy IO

2011-03-15 Thread Stefan Hajnoczi

On Tue, Mar 15, 2011 at 7:47 AM, Alexander Graf ag...@suse.de wrote:

 On 15.03.2011, at 08:09, Stefan Hajnoczi wrote:

 On Mon, Mar 14, 2011 at 10:57 PM, Guido Winkelmann
 guido-k...@thisisnotatest.de wrote:
 On Monday 14 March 2011 20:32:23 Stefan Hajnoczi wrote:
 On Mon, Mar 14, 2011 at 6:05 PM, Guido Winkelmann

 guido-k...@thisisnotatest.de wrote:
 Does anybody have an idea what might cause this or what might be done
 about it?

 The lsi_scsi emulation code is incomplete.  It does not handle some
 situations like the ORDERED commands or message 0x0c.

 There is a patch to address the message 0xc issue, it has not been
 applied to qemu.git or qemu-kvm.git yet:
 http://patchwork.ozlabs.org/patch/63926/

 Basically there is no one actively maintaining or reviewing patches
 for the lsi53c895a SCSI controller.

 Does that mean that using the SCSI transport for virtual disks is officially
 unsupported or deprecated or that it should be?

 The LSI SCSI emulation in particular has not seen much attention.  As
 for the wider SCSI emulation there has been work over the past few
 months so it's alive and being used.

 Are things better with the IDE driver?

 IDE is commonly used for compatibility with guests that do not have
 virtio-blk drivers.  It should work fine although performance is poor
 due to the IDE interface.

 virtio-blk works very will with Linux guests.  Is there a reason you
 need to use SCSI emulation instead of virtio-blk?

 I can probably use virtio-blk most of the time, I was just hoping to be able
 to virtualize a wider array of operating systems, like the *BSDs,
 (Open)Solaris, Windows, or even just some linux distributions whose 
 installers
 don't anticipate KVM and thus don't support virtio-anything.

 Windows virtio-blk drivers are available and should be used:
 http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers

 BSD and Solaris don't ship with virtio-blk AFAIK.

 This is pretty much the gap that AHCI is trying to fill. It's a 
 well-supported HBA that pretty much every OS supports, but is still simple 
 enough to implement. Unfortunately, 0.14 ships without BIOS support for it, 
 so you can't boot off an AHCI disk yet. But as of 0.15, AHCI is pretty much 
 the adapter of choice for your use case.

 Please keep in mind that I didn't get FreeBSD rolling with AHCI emulation 
 yet. OpenBSD works just fine.

I think one missing AHCI feature was legacy PATA mode?  Perhaps that
is a good GSoC project if you're willing to mentor it, Alex.  I'm
thinking that with complete AHCI and legacy mode it would be a good
choice as the default non-virtio-blk disk interface.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Virtual SCSI disks hangs on heavy IO

2011-03-15 Thread Stefan Hajnoczi

On Tue, Mar 15, 2011 at 9:16 AM, Alexander Graf ag...@suse.de wrote:

 On 15.03.2011, at 10:03, Stefan Hajnoczi wrote:

 On Tue, Mar 15, 2011 at 7:47 AM, Alexander Graf ag...@suse.de wrote:

 On 15.03.2011, at 08:09, Stefan Hajnoczi wrote:

 On Mon, Mar 14, 2011 at 10:57 PM, Guido Winkelmann
 guido-k...@thisisnotatest.de wrote:
 On Monday 14 March 2011 20:32:23 Stefan Hajnoczi wrote:
 On Mon, Mar 14, 2011 at 6:05 PM, Guido Winkelmann

 guido-k...@thisisnotatest.de wrote:
 Does anybody have an idea what might cause this or what might be done
 about it?

 The lsi_scsi emulation code is incomplete.  It does not handle some
 situations like the ORDERED commands or message 0x0c.

 There is a patch to address the message 0xc issue, it has not been
 applied to qemu.git or qemu-kvm.git yet:
 http://patchwork.ozlabs.org/patch/63926/

 Basically there is no one actively maintaining or reviewing patches
 for the lsi53c895a SCSI controller.

 Does that mean that using the SCSI transport for virtual disks is 
 officially
 unsupported or deprecated or that it should be?

 The LSI SCSI emulation in particular has not seen much attention.  As
 for the wider SCSI emulation there has been work over the past few
 months so it's alive and being used.

 Are things better with the IDE driver?

 IDE is commonly used for compatibility with guests that do not have
 virtio-blk drivers.  It should work fine although performance is poor
 due to the IDE interface.

 virtio-blk works very will with Linux guests.  Is there a reason you
 need to use SCSI emulation instead of virtio-blk?

 I can probably use virtio-blk most of the time, I was just hoping to be 
 able
 to virtualize a wider array of operating systems, like the *BSDs,
 (Open)Solaris, Windows, or even just some linux distributions whose 
 installers
 don't anticipate KVM and thus don't support virtio-anything.

 Windows virtio-blk drivers are available and should be used:
 http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers

 BSD and Solaris don't ship with virtio-blk AFAIK.

 This is pretty much the gap that AHCI is trying to fill. It's a 
 well-supported HBA that pretty much every OS supports, but is still simple 
 enough to implement. Unfortunately, 0.14 ships without BIOS support for it, 
 so you can't boot off an AHCI disk yet. But as of 0.15, AHCI is pretty much 
 the adapter of choice for your use case.

 Please keep in mind that I didn't get FreeBSD rolling with AHCI emulation 
 yet. OpenBSD works just fine.

 I think one missing AHCI feature was legacy PATA mode?  Perhaps that
 is a good GSoC project if you're willing to mentor it, Alex.  I'm
 thinking that with complete AHCI and legacy mode it would be a good
 choice as the default non-virtio-blk disk interface.

 Or two be more precise: There are two different dimensions

 SATA / PATA
 IDE / AHCI

 The first is the link model - the type of connection the disk/cd-rom is 
 connected to the hba with. The second is the OS interface.

 AHCI can handle SATA and PATA devices in AHCI mode. IIUC both link models 
 also work in IDE (legacy) mode.
 The ICH-HBA can be BIOS configured to either operate in AHCI mode or in 
 legacy mode, but not both at the same time. You can split between channels 
 though. So you can have channels 1,2 operate through legacy while channels 
 3,4 go through AHCI. The same disk still can only be accessed either through 
 IDE _or_ AHCI.

 Since we already have properly working PIIX3 IDE emulation code, I don't see 
 the point in implementing ICH-7 AHCI IDE legacy compatibility mode.

 There are SATA controllers out there that apparently can expose the same disk 
 through the legacy IDE interface and a faster SATA interface. I'm not sure 
 any of those is AHCI compatible - the spec doesn't forbid you to do this.

Okay, I was thinking that having just the AHCI device which is guest
(BIOS?) configurable to work with legacy guests is nicer than having
to switch QEMU command-line options on the host.  But then we don't
have non-volatile storage for the BIOS AFAIK, so it currently doesn't
make much difference which AHCI supports IDE emulation or whether you
explicitly use the piix IDE emulation.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Virtual SCSI disks hangs on heavy IO

2011-03-14 Thread Stefan Hajnoczi

On Mon, Mar 14, 2011 at 6:05 PM, Guido Winkelmann
guido-k...@thisisnotatest.de wrote:
 Does anybody have an idea what might cause this or what might be done about 
 it?

The lsi_scsi emulation code is incomplete.  It does not handle some
situations like the ORDERED commands or message 0x0c.

There is a patch to address the message 0xc issue, it has not been
applied to qemu.git or qemu-kvm.git yet:
http://patchwork.ozlabs.org/patch/63926/

Basically there is no one actively maintaining or reviewing patches
for the lsi53c895a SCSI controller.

virtio-blk works very will with Linux guests.  Is there a reason you
need to use SCSI emulation instead of virtio-blk?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: ppc: Fix breakage of kvm_arch_pre_run/process_irqchip_events

2011-03-10 Thread Stefan Hajnoczi

On Fri, Mar 11, 2011 at 5:55 AM, Alexander Graf ag...@suse.de wrote:

 On 17.02.2011, at 22:01, Jan Kiszka wrote:

 On 2011-02-07 12:19, Jan Kiszka wrote:
 We do not check them, and the only arch with non-empty implementations
 always returns 0 (this is also true for qemu-kvm).

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Alexander Graf ag...@suse.de
 ---
 kvm.h              |    5 ++---
 target-i386/kvm.c  |    8 ++--
 target-ppc/kvm.c   |    6 ++
 target-s390x/kvm.c |    6 ++
 4 files changed, 8 insertions(+), 17 deletions(-)


 ...

 diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
 index 93ecc57..bd4012a 100644
 --- a/target-ppc/kvm.c
 +++ b/target-ppc/kvm.c
 @@ -256,14 +256,12 @@ int kvm_arch_pre_run(CPUState *env, struct kvm_run 
 *run)
     return 0;
 }

 -int kvm_arch_post_run(CPUState *env, struct kvm_run *run)
 +void kvm_arch_post_run(CPUState *env, struct kvm_run *run)
 {
 -    return 0;
 }

 -int kvm_arch_process_irqchip_events(CPUState *env)
 +void kvm_arch_process_irqchip_events(CPUState *env)
 {
 -    return 0;
 }

 Oops. Do we already have a built-bot for KVM-enabled PPC (and s390)
 targets somewhere?

 Just before leaving for vacation I prepared a machine for each and gave 
 stefan access to them. Looks like they're not officially running though - 
 will try to look at this asap.

They are in the process of being added to the buildbot by Daniel
Gollub.  However, the ppc box is unable to build qemu.git because it
hits ENOMEM while compiling.  I doubled swap size but that didn't fix
the issue so I need to investigate more.  At least s390 should be good
to go soon and I will send an update when it is up and running.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vnc: threaded server depends on io-thread

2011-03-09 Thread Stefan Hajnoczi

On Wed, Mar 9, 2011 at 10:57 AM, Corentin Chary
corentin.ch...@gmail.com wrote:
 The threaded VNC servers messed up with QEMU fd handlers without
 any kind of locking, and that can cause some nasty race conditions.

 The IO-Thread provides appropriate locking primitives to avoid that.
 This patch makes CONFIG_VNC_THREAD depends on CONFIG_IO_THREAD,
 and add lock and unlock calls around the two faulty calls.

 qemu-kvm currently doesn't compile with --enable-io-thread. is there an easy 
 fix
 for this?

 If IO Thread is not available, I'm afraid that --disable-vnc-thread is
 the only fix.
 Or, you can try to define some global mutex acting like iothread
 locks, but that doesn't sounds like an easy fix.

Jan or Marcelo can help here but qemu-kvm has an iothread equivalent
built in by default.  It should be possible to use that.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: default elevator=noop for virtio block devices?

2011-03-09 Thread Stefan Hajnoczi

On Wed, Mar 9, 2011 at 10:01 AM, Avi Kivity a...@redhat.com wrote:
 On 03/09/2011 11:42 AM, Harald Dunkel wrote:

 Hi folks,

 would it make sense to make elevator=noop the default
 for virtio block devices? Or would you recommend to
 set this on the kvm server instead?


 I think leaving the defaults is best.  The elevator on the guest serves to
 schedule I/O among processes in the guest, and the elevator on the host
 partitions I/O among the guests.

It depends on the workload.  Khoa has seen cases where CFQ does not
scale with multi-threaded workloads and deadline is preferred.  But
it's not one-size-fits-all, it depends on your workload and requires
benchmarking.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call minutes for Mar 8

2011-03-08 Thread Stefan Hajnoczi

On Tue, Mar 8, 2011 at 4:00 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 http://wiki.qemu.org/Features/QAPI/VirtAgent

That page does not exist.  I think you meant this one:
http://wiki.qemu.org/Features/QAPI/GuestAgent

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Degraded performance with Windows 2008 R2 with applications

2011-03-07 Thread Stefan Hajnoczi

On Sun, Mar 6, 2011 at 10:25 PM, Mathias Klette mkle...@gmail.com wrote:
 I've tested with iozone to compare IO with a linux guest and also to
 verify changes made to improve situation - but nothing really helped.

 TESTS with iozone -s 4G -r 256k -c -e:

Please use the -I option to bypass the page cache, otherwise buffered
I/O will be used and requests may be satisfied from memory rather than
actually accessing the disk.

What is the qemu-kvm command-line (ps aux | grep kvm)?

Are you using virtio-blk and the Windows guest drivers from here:

http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problem about blocked monitor when disk image on NFS can not be reached.

2011-03-02 Thread Stefan Hajnoczi

On Wed, Mar 2, 2011 at 10:39 AM, ya su suya94...@gmail.com wrote:
 io_thread bt as the following:
 #0  0x7f3086eaa034 in __lll_lock_wait () from /lib64/libpthread.so.0
 #1  0x7f3086ea5345 in _L_lock_870 () from /lib64/libpthread.so.0
 #2  0x7f3086ea5217 in pthread_mutex_lock () from /lib64/libpthread.so.0
 #3  0x00436018 in kvm_mutex_lock () at
 /root/rpmbuild/BUILD/qemu-kvm-0.14/qemu-kvm.c:1730
 #4  qemu_mutex_lock_iothread () at
 /root/rpmbuild/BUILD/qemu-kvm-0.14/qemu-kvm.c:1744
 #5  0x0041ca67 in main_loop_wait (nonblocking=value optimized out)
    at /root/rpmbuild/BUILD/qemu-kvm-0.14/vl.c:1377
 #6  0x004363e7 in kvm_main_loop () at
 /root/rpmbuild/BUILD/qemu-kvm-0.14/qemu-kvm.c:1589
 #7  0x0041dc3a in main_loop (argc=value optimized out,
 argv=value optimized out,
    envp=value optimized out) at /root/rpmbuild/BUILD/qemu-kvm-0.14/vl.c:1429
 #8  main (argc=value optimized out, argv=value optimized out,
 envp=value optimized out)
    at /root/rpmbuild/BUILD/qemu-kvm-0.14/vl.c:3201

 cpu thread as the following:
 #0  0x7f3084dff093 in select () from /lib64/libc.so.6
 #1  0x004453ea in qemu_aio_wait () at aio.c:193
 #2  0x00444175 in bdrv_write_em (bs=0x1ec3090, sector_num=2009871,
    buf=0x7f3087532800
 F\b\200u\022\366F$\004u\fPV\350\226\367\377\377\003Ft\353\fPV\350\212\367\377\377\353\003\213Ft^]\302\b,
 nb_sectors=16) at block.c:2577
 #3  0x0059ca13 in ide_sector_write (s=0x215f508) at
 /root/rpmbuild/BUILD/qemu-kvm-0.14/hw/ide/core.c:574
 #4  0x00438ced in kvm_handle_io (env=0x202ef60) at
 /root/rpmbuild/BUILD/qemu-kvm-0.14/kvm-all.c:821
 #5  kvm_run (env=0x202ef60) at 
 /root/rpmbuild/BUILD/qemu-kvm-0.14/qemu-kvm.c:617
 #6  0x00438e09 in kvm_cpu_exec (env=value optimized out)
    at /root/rpmbuild/BUILD/qemu-kvm-0.14/qemu-kvm.c:1233
 #7  0x0043a0f7 in kvm_main_loop_cpu (_env=0x202ef60)
    at /root/rpmbuild/BUILD/qemu-kvm-0.14/qemu-kvm.c:1419
 #8  ap_main_loop (_env=0x202ef60) at
 /root/rpmbuild/BUILD/qemu-kvm-0.14/qemu-kvm.c:1466
 #9  0x7f3086ea37e1 in start_thread () from /lib64/libpthread.so.0
 #10 0x7f3084e0653d in clone () from /lib64/libc.so.6

 aio_thread bt as the following:
 #0  0x7f3086eaae83 in pwrite64 () from /lib64/libpthread.so.0
 #1  0x00447501 in handle_aiocb_rw_linear (aiocb=0x21cff10,
    buf=0x7f3087532800
 F\b\200u\022\366F$\004u\fPV\350\226\367\377\377\003Ft\353\fPV\350\212\367\377\377\353\003\213Ft^]\302\b)
 at posix-aio-compat.c:212
 #2  0x00447d48 in handle_aiocb_rw (unused=value optimized
 out) at posix-aio-compat.c:247
 #3  aio_thread (unused=value optimized out) at posix-aio-compat.c:341
 #4  0x7f3086ea37e1 in start_thread () from /lib64/libpthread.so.0
 #5  0x7f3084e0653d in clone () from /lib64/libc.so.6

 I think io_thread is blocked by cpu thread which take the qemu_mutux
 first, cpu thread is waiting for aio_thread's result by qemu_aio_wait
 function,  aio_thead take much time on pwrite64, it will take about
 5-10s, then return a error(it seems like an non-block timeout call),
 after that, io thead will have a chance to receive monitor input, so
 the monitor seems to blocked frequently. in this suition, if I stop
 the vm, the monitor will response faster.

 the problem is caused by unavailabity of block layer, the block layer
 process the io error in a normal way, it report error to ide device,
 the error is handled in ide_sector_write. the root cause is: monitor's
 input and io operation(pwrite function) must execute in a serialized
 method(by qemu_mutux seamphore), so pwrite long block time will hinder
 monitor input.

 as stefan says, it seems difficult to take monitor input out of the
 protection, currently I will stop the vm if the disk image can not be
 reached.

If you switch to -drive if=virtio instead of IDE then the problem
should be greatly reduced.  Virtio-blk uses aio instead of synchronous
calls, which means that the vcpu thread does not run qemu_aio_wait().

Kevin and I have been looking into the limitations imposed by
synchronous calls.  Today there is unfortunately synchronous code in
QEMU and we can hit these NFS hang situations.  qemu_aio_wait() runs a
nested event loop that does a subset of what the full event loop does.
 This is why the monitor does not respond.

If all code was asynchronous then only a top-level event loop would be
necessary and the monitor would continue to function.

In the immediate term I suggest using virtio-blk instead of IDE.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Poor disk IO performance from Windows 2003 guest

2011-03-02 Thread Stefan Hajnoczi

On Wed, Mar 2, 2011 at 10:30 AM, Kevin Clark kevin.cl...@csoft.co.uk wrote:
 The results are much better, with 64MB writes on the system drive coming in 
 at 39MB/s and reads 310MB/s.  The second drive gives me 94MB/s for writes and 
 777MB/s for reads for a 64MB file.  Again, that's wildy different results for 
 two storage devices in the same guest, and it needs further investigation, 
 but now the system is usable and I need to move on.

Good to hear that you're seeing acceptable performance now.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problem about blocked monitor when disk image on NFS can not be reached.

2011-03-01 Thread Stefan Hajnoczi

On Tue, Mar 1, 2011 at 5:01 AM, ya su suya94...@gmail.com wrote:
   kvm start with disk image on nfs server, when nfs server can not be
 reached, monitor will be blocked. I change io_thread to SCHED_RR
 policy, it will work unfluently waiting for disk read/write timeout.

There are some synchronous disk image reads that can put qemu-kvm to
sleep until NFS responds or errors.  For example, when starting
hw/virtio-blk.c calls bdrv_guess_geometry() which may invoke
bdrv_read().

Once the VM is running and you're using virtio-blk then disk I/O
should be asynchronous.  There are some synchronous cases to do with
migration, snapshotting, etc where we wait for outstanding aio
requests.  Again this can block qemu-kvm.

So in short, there's no easy way to avoid blocking the VM in all cases
today.  You should find, however, that normal read/write operation to
a running VM does not cause qemu-kvm to sleep.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Poor disk IO performance from Windows 2003 guest

2011-03-01 Thread Stefan Hajnoczi

On Tue, Mar 1, 2011 at 10:23 AM, Kevin Clark kevin.cl...@csoft.co.uk wrote:
 Any thoughts/ideas?

There are a lot of variables here.  Are you using virtio-blk devices
and Windows guest drivers?  Are you using hardware RAID5 on the NFS
server?  Could it be a network issue (contention during benchmark
runs)?

I'd start by benchmarking NFS on the host without running a virtual
machine.  Make sure you're getting acceptable performance and
repeatable results there first.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problem about blocked monitor when disk image on NFS can not be reached.

2011-03-01 Thread Stefan Hajnoczi

On Tue, Mar 1, 2011 at 12:39 PM, ya su suya94...@gmail.com wrote:
    how about to remove kvm_handle_io/handle_mmio in kvm_run function
 into kvm_main_loop, as these operation belong to io operation, this
 will remove the qemu_mutux between the 2 threads. is this an
 reasonable thought?

    In order to keep the monitor to response to user quicker under
 this suition, an easier way is to take monito io out of qemu_mutux
 protection. this include vnc/serial/telnet io related with monitor,
 as these io will not affect the running of vm itself, it need not in
 so stirct protection.

The qemu_mutex protects all QEMU global state.  The monitor does some
I/O and parsing which is not necessarily global state but once it
begins actually performing the command you sent, access to global
state will be required (pretty much any monitor command will operate
on global state).

I think there are two options for handling NFS hangs:
1. Ensure that QEMU is never put to sleep by NFS for disk images.  The
guest continues executing, may time out and notice that storage is
unavailable.
2. Pause the VM but keep the monitor running if a timeout error
occurs.  Not sure if there is a timeout from NFS that we can detect.

For I/O errors (e.g. running out of disk space on the host) there is a
configurable policy.  You can choose whether to return an error to the
guest or to pause the VM.  I think we should treat NFS hangs as an
extension to this and as a block layer problem rather than an io
thread problem.

Can you get backtraces when KVM hangs (gdb command: thread apply all
bt)?  It would be interesting to see some of the blocking cases that
you are hitting.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 uq/master 05/22] add win32 qemu-thread implementation

2011-02-28 Thread Stefan Hajnoczi

On Mon, Feb 28, 2011 at 9:10 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 +static unsigned __stdcall win32_start_routine(void *arg)
 +{
 +    struct QemuThreadData data = *(struct QemuThreadData *) arg;
 +    QemuThread *thread = data.thread;
 +
 +    free(arg);

qemu_free(arg);

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFH: Windos 7 64 + VirtIO stalls during installation / crashed with qcow2

2011-02-17 Thread Stefan Hajnoczi

On Thu, Feb 17, 2011 at 10:44 AM, Philipp Hahn h...@univention.de wrote:
 Hello,

 I tried to install Windows 7 Professional 64 Bit with VirtIO 1.16 on an Debian
 based system using AMD64 CPUs. During the install, the system froze (progress
 bar didn't advance) and kvm was slowly eating CPU cycles on the host.

 $ dpkg-query -W libvirt0 qemu-kvm linux-image-`uname -r`
 libvirt0        0.8.7-1.48.201102031226
 linux-image-2.6.32-ucs37-amd64  2.6.32-30.37.201102031101
 qemu-kvm        0.12.4+dfsg-1~bpo50+1.3.201010011432

 It was started using virsh, which generated the following command line:
 /usr/bin/kvm.bin -S \
  -M pc-0.12 \
  -enable-kvm \
  -m 768 \
  -smp 1,sockets=1,cores=1,threads=1 \
  -name 7-Professional_amd64 \
  -uuid 89c82cf9-0797-3da4-62f4-8767e4f59b7e \
  -nodefaults \
  -chardev
 socket,id=monitor,path=/var/lib/libvirt/qemu/7-Professional_amd64.monitor,server,nowait
 \
  -mon chardev=monitor,mode=readline \
  -rtc base=utc \
  -boot dc \
  -drive
 file=/var/lib/libvirt/images/7-Professional_amd64.qcow2,if=none,id=drive-virtio-disk0,boot=on,format=qcow2
 -device
 virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 \
  -drive
 file=/mnt/omar/vmwares/kvm/iso/windows/win_7_pro_64bit.iso,if=none,media=cdrom,id=drive-ide0-0-1,readonly=on,format=raw
 -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 \
  -drive
 file=/mnt/omar/vmwares/kvm/iso/others/virtio-win-1.1.16.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw
 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 \
  -device
 virtio-net-pci,vlan=0,id=net0,mac=52:54:00:f7:da:b5,bus=pci.0,addr=0x3
 \
  -net tap,fd=20,vlan=0,name=hostnet0 \
  -usb \
  -device usb-tablet,id=input0 \
  -vnc 0.0.0.0:0 \
  -k de \
  -vga cirrus \
  -incoming exec:cat \
  -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 \
  -no-kvm-irqchip

 The -no-kvm-irqchip-Option was added, because we experienced shutdown/resume
 problems with other machines, which either received no interrupts anymore or
 where caught in their interrupt service routine, never being able to
 acknowledge the interrupts. Adding that option solved that problem, but might
 be causing other problems now.

 Using gdb I was able to track down Windows hanging in the following routine,
 which look like some spin-lock / semaphore aquire() implementation:
 (gdb) x/20i 0xf8000c485a80
 0xf8000c485a80:     mov    %rbx,0x8(%rsp)
 0xf8000c485a85:     push   %rdi
 0xf8000c485a86:     sub    $0x20,%rsp
 0xf8000c485a8a:     mov    %rcx,%rdi
 0xf8000c485a8d:     xor    %ebx,%ebx
 0xf8000c485a8f:     nop
 0xf8000c485a90:     inc    %ebx
 0xf8000c485a92:     test   %ebx,0x274834(%rip)        # 0xf8000c6fa2cc
 0xf8000c485a98:     je     0xf8000c48adad
 0xf8000c485a9e:     pause
 0xf8000c485aa0:     mov    (%rdi),%rcx
 0xf8000c485aa3:     test   %rcx,%rcx
 0xf8000c485aa6:     jne    0xf8000c485a90
 0xf8000c485aa8:     lock btsq $0x0,(%rdi)
 0xf8000c485aae:     jb     0xf8000c485a90
 0xf8000c485ab0:     mov    %ebx,%eax
 0xf8000c485ab2:     mov    0x30(%rsp),%rbx
 0xf8000c485ab7:     add    $0x20,%rsp
 0xf8000c485abb:     pop    %rdi
 0xf8000c485abc:     retq
 (gdb) x/w 0xf8000c6fa2cc
 0xf8000c6fa2cc:     0x
 (gdb) x/w $rdi
 0xfa800131f600:     0x0001

 Did someone experience similar problems or does somebody know if there was a
 fix for such a problem in newer kvm- or Linux-kernel versions?

 We also encountered problems with some Windows Versions when using VirtIO with
 Qcow2 images, which were using backing-files for copy-on-write: they just
 crashed with a blue-screen. Just changing from the CoW-qcow2 to the
 master-qcow2 file fixed the problem, but this isn't satisfactory, since we
 would like to use the CoW-functionality. Not using VirtIO also fixed the
 problem, but has performance penalties.

Vadim: Any suggestions for extracting more relevant information in these cases?

One option may might be to set up the Windows debugger in order to
closely monitor what the guest is doing when it hangs or BSODs:
http://etherboot.org/wiki/sanboot/winnt_iscsi_debug

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFH: Windos 7 64 + VirtIO stalls during installation / crashed with qcow2

2011-02-17 Thread Stefan Hajnoczi

On Thu, Feb 17, 2011 at 12:45 PM, Vadim Rozenfeld vroze...@redhat.com wrote:
 On Thu, 2011-02-17 at 13:41 +0200, Gleb Natapov wrote:
 On Thu, Feb 17, 2011 at 11:30:25AM +, Stefan Hajnoczi wrote:
  On Thu, Feb 17, 2011 at 10:44 AM, Philipp Hahn h...@univention.de wrote:
   Hello,
  
   I tried to install Windows 7 Professional 64 Bit with VirtIO 1.16 on an 
   Debian
   based system using AMD64 CPUs. During the install, the system froze 
   (progress
   bar didn't advance) and kvm was slowly eating CPU cycles on the host.
  
   $ dpkg-query -W libvirt0 qemu-kvm linux-image-`uname -r`
   libvirt0        0.8.7-1.48.201102031226
   linux-image-2.6.32-ucs37-amd64  2.6.32-30.37.201102031101
   qemu-kvm        0.12.4+dfsg-1~bpo50+1.3.201010011432
  
   It was started using virsh, which generated the following command line:
   /usr/bin/kvm.bin -S \
    -M pc-0.12 \
    -enable-kvm \
    -m 768 \
    -smp 1,sockets=1,cores=1,threads=1 \
    -name 7-Professional_amd64 \
    -uuid 89c82cf9-0797-3da4-62f4-8767e4f59b7e \
    -nodefaults \
    -chardev
   socket,id=monitor,path=/var/lib/libvirt/qemu/7-Professional_amd64.monitor,server,nowait
   \
    -mon chardev=monitor,mode=readline \
    -rtc base=utc \
    -boot dc \
    -drive
   file=/var/lib/libvirt/images/7-Professional_amd64.qcow2,if=none,id=drive-virtio-disk0,boot=on,format=qcow2
   -device
   virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0
\
    -drive
   file=/mnt/omar/vmwares/kvm/iso/windows/win_7_pro_64bit.iso,if=none,media=cdrom,id=drive-ide0-0-1,readonly=on,format=raw
   -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 \
    -drive
   file=/mnt/omar/vmwares/kvm/iso/others/virtio-win-1.1.16.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw
   -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 \
    -device
   virtio-net-pci,vlan=0,id=net0,mac=52:54:00:f7:da:b5,bus=pci.0,addr=0x3
   \
    -net tap,fd=20,vlan=0,name=hostnet0 \
    -usb \
    -device usb-tablet,id=input0 \
    -vnc 0.0.0.0:0 \
    -k de \
    -vga cirrus \
    -incoming exec:cat \
    -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 \
    -no-kvm-irqchip
  
   The -no-kvm-irqchip-Option was added, because we experienced 
   shutdown/resume
   problems with other machines, which either received no interrupts 
   anymore or
   where caught in their interrupt service routine, never being able to
   acknowledge the interrupts. Adding that option solved that problem, but 
   might
   be causing other problems now.
  
   Using gdb I was able to track down Windows hanging in the following 
   routine,
   which look like some spin-lock / semaphore aquire() implementation:
   (gdb) x/20i 0xf8000c485a80
   0xf8000c485a80:     mov    %rbx,0x8(%rsp)
   0xf8000c485a85:     push   %rdi
   0xf8000c485a86:     sub    $0x20,%rsp
   0xf8000c485a8a:     mov    %rcx,%rdi
   0xf8000c485a8d:     xor    %ebx,%ebx
   0xf8000c485a8f:     nop
   0xf8000c485a90:     inc    %ebx
   0xf8000c485a92:     test   %ebx,0x274834(%rip)        # 
   0xf8000c6fa2cc
   0xf8000c485a98:     je     0xf8000c48adad
   0xf8000c485a9e:     pause
   0xf8000c485aa0:     mov    (%rdi),%rcx
   0xf8000c485aa3:     test   %rcx,%rcx
   0xf8000c485aa6:     jne    0xf8000c485a90
   0xf8000c485aa8:     lock btsq $0x0,(%rdi)
   0xf8000c485aae:     jb     0xf8000c485a90
   0xf8000c485ab0:     mov    %ebx,%eax
   0xf8000c485ab2:     mov    0x30(%rsp),%rbx
   0xf8000c485ab7:     add    $0x20,%rsp
   0xf8000c485abb:     pop    %rdi
   0xf8000c485abc:     retq
   (gdb) x/w 0xf8000c6fa2cc
   0xf8000c6fa2cc:     0x
   (gdb) x/w $rdi
   0xfa800131f600:     0x0001
  
   Did someone experience similar problems or does somebody know if there 
   was a
   fix for such a problem in newer kvm- or Linux-kernel versions?
  
   We also encountered problems with some Windows Versions when using 
   VirtIO with
   Qcow2 images, which were using backing-files for copy-on-write: they just
   crashed with a blue-screen. Just changing from the CoW-qcow2 to the
   master-qcow2 file fixed the problem, but this isn't satisfactory, 
   since we
   would like to use the CoW-functionality. Not using VirtIO also fixed the
   problem, but has performance penalties.
 
  Vadim: Any suggestions for extracting more relevant information in these 
  cases?
 Debugging installation-phase problems on 64-bit platforms is a very
 complicated thing. If the problem is reproducible on x86 platforms, you
 can try printing messages (RhelDbgPrint function) to localize the
 problem. You will need to adjust RhelDbgLevel in virtio_stor.c and build
 checked (debug) version of the driver.

Is that even possible - I thought these drivers need to be signed on
recent versions of Windows?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message

< 2 3 4 5 6 7 8 9 >

601 - 700 of 883 matches

Mail list logo