[Qemu-block] [PATCH 1/2] ignore bdrv_flush operation when no qcow2 cache item is dirty

2015-07-08 Thread Qingshu Chen
qcow2_cache_flush() writes dirty cache to the disk and invokes bdrv_flush()
to make the data durable. But even if there is no dirty cache,
qcow2_cache_flush() would invoke bdrv_flush().  In fact, bdrv_flush() will
invoke fdatasync(), and it is an expensive operation. The patch will not
invoke bdrv_flush if there is not dirty cache. The reason that I modify the
return value of qcow2_cache_flush()  is qcow2_co_flush_to_os needs to know
whether flush operation is called. Following is the patch:

>From 23f9f83da4178e8fbb53d2cffe128f5a2d3a239a Mon Sep 17 00:00:00 2001
From: Qingshu Chen 
Date: Wed, 1 Jul 2015 14:45:23 +0800
Subject: [PATCH 1/2] ignore bdrv_flush operation when no qcow2 cache item is
 dirty
Signed-off-by: Qingshu Chen 

---
 block/qcow2-cache.c | 9 -
 block/qcow2.c   | 2 ++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index ed92a09..57c0601 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -174,6 +174,7 @@ int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache
*c)
 int result = 0;
 int ret;
 int i;
+int flag = 0;

 trace_qcow2_cache_flush(qemu_coroutine_self(), c == s->l2_table_cache);

@@ -182,15 +183,21 @@ int qcow2_cache_flush(BlockDriverState *bs,
Qcow2Cache *c)
 if (ret < 0 && result != -ENOSPC) {
 result = ret;
 }
+if (c->entries[i].dirty && c->entries[i].offset) {
+flag = 1;
+}
 }

-if (result == 0) {
+if (result == 0 && flag == 1) {
 ret = bdrv_flush(bs->file);
 if (ret < 0) {
 result = ret;
 }
 }

+if (flag == 0 && result >= 0) {
+result = 1;
+}
+
 return result;
 }

diff --git a/block/qcow2.c b/block/qcow2.c
index d522ec7..ab4613a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2504,6 +2504,8 @@ static coroutine_fn int
qcow2_co_flush_to_os(BlockDriverState *bs)
 if (ret < 0) {
 qemu_co_mutex_unlock(&s->lock);
 return ret;
+} else if (ret == 1) {
+bdrv_flush(bs->file);
 }

 if (qcow2_need_accurate_refcounts(s)) {
--
1.9.1





-
Qingshu Chen,
Institute of Parallel and Distributed Systems (IPADS),
School of Software,
Shanghai Jiao Tong University
---


Re: [Qemu-block] [Qemu-devel] [PATCH RFC 0/4] aio: Use epoll_wait in aio_poll

2015-07-08 Thread Christian Borntraeger
Am 08.07.2015 um 03:02 schrieb Fam Zheng:
> On Tue, 07/07 16:54, Christian Borntraeger wrote:
>> Am 30.06.2015 um 15:19 schrieb Fam Zheng:
>>> epoll is more scalable than ppoll. It performs faster than ppoll when the
>>> number of polled fds is high.
>>>
>>> See patch 4 for an example of the senario and some benchmark data.
>>>
>>> Note: it is only effective on iothread (dataplane), while the main loop 
>>> cannot
>>> benefit from this yet, because the iohandler and chardev GSource's don't 
>>> easily
>>> fit into this epoll interface style (that's why main loop uses qemu_poll_ns
>>> directly instead of aio_poll()).
>>>
>>> There is hardly any timer activity in iothreads for now, as a result the
>>> timeout is always 0 or -1. Therefore, timerfd, or the said nanosecond
>>> epoll_pwait1 interface, which fixes the timeout granularity deficiency is 
>>> not
>>> immediately necessary at this point, but still that will be simple to add.
>>>
>>> Please review!
>>
>> Is there a branch somewhere, so that I could give it a spin?
>>
> 
> Here:
> 
> https://github.com/famz/qemu/tree/aio-posix-epoll
> 
In file included from /home/cborntra/REPOS/qemu/include/qemu/option.h:31:0,
 from /home/cborntra/REPOS/qemu/include/qemu-common.h:44,
 from /home/cborntra/REPOS/qemu/async.c:25:
/home/cborntra/REPOS/qemu/async.c: In function 'aio_context_new':
/home/cborntra/REPOS/qemu/include/qapi/error.h:57:20: error: 'ret' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
 error_set_errno(errp, os_error, ERROR_CLASS_GENERIC_ERROR, \
^
/home/cborntra/REPOS/qemu/async.c:291:9: note: 'ret' was declared here
 int ret;
 ^
cc1: all warnings being treated as errors

With that fixed, it seems to work. Still looking at the performance.




Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] ahci: Fix CD-ROM signature

2015-07-08 Thread Kevin Wolf
Am 07.07.2015 um 19:19 hat John Snow geschrieben:
> 
> 
> On 07/06/2015 05:49 PM, John Snow wrote:
> > From: Hannes Reinecke 
> > 
> > The CD-ROM signature is 0xeb140101, not 0xeb14.
> > Without this change OVMF/Duet runs into a timeout trying
> > to detect a SATA cdrom.
> > 
> > Signed-off-by: Hannes Reinecke 
> > Signed-off-by: John Snow 
> > ---
> >  hw/ide/ahci.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/ide/ahci.h b/hw/ide/ahci.h
> > index 9f5b4d2..68d5074 100644
> > --- a/hw/ide/ahci.h
> > +++ b/hw/ide/ahci.h
> > @@ -166,7 +166,7 @@
> >  #define AHCI_CMD_HDR_CMD_FIS_LEN   0x1f
> >  #define AHCI_CMD_HDR_PRDT_LEN  16
> >  
> > -#define SATA_SIGNATURE_CDROM   0xeb14
> > +#define SATA_SIGNATURE_CDROM   0xeb140101
> >  #define SATA_SIGNATURE_DISK0x0101
> >  
> >  #define AHCI_GENERIC_HOST_CONTROL_REGS_MAX_ADDR 0x20
> > 
> 
> FWIW for review purposes, this is based on ATA8 AC3, Table 184 "Device
> Signatures for Normal Output" and is very straightforward.

It is. The only thing that confuses me about this is that I thought it
had been fixed for a while already. My AHCI driver has a comment "Broken
value in qemu < 2.2" at the workaround, and I seem to remember that I
didn't need it for a newer qemu version indeed. Strange. Or perhaps I
only fixed it locally and then forgot to send the patch?

Anyway, the fix is obviously correct:

Reviewed-by: Kevin Wolf 



[Qemu-block] [PULL v2 01/16] dataplane: fix cross-endian issues

2015-07-08 Thread Michael S. Tsirkin
From: Greg Kurz 

Accesses to vring_avail_event and vring_used_event must honor the queue
endianness.

This patch allows cross-endian setups to use dataplane (tested with ppc64
on ppc64le, and vice-versa).

Suggested-by: Cornelia Huck 
Signed-off-by: Greg Kurz 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
Reviewed-by: Cornelia Huck 
---
 hw/virtio/dataplane/vring.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index 3589185..bed9b11 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -153,7 +153,8 @@ bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
 return true;
 }
 
-return vring_need_event(vring_used_event(&vring->vr), new, old);
+return vring_need_event(virtio_tswap16(vdev, vring_used_event(&vring->vr)),
+new, old);
 }
 
 
@@ -407,7 +408,8 @@ int vring_pop(VirtIODevice *vdev, Vring *vring,
 /* On success, increment avail index. */
 vring->last_avail_idx++;
 if (virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
-vring_avail_event(&vring->vr) = vring->last_avail_idx;
+vring_avail_event(&vring->vr) =
+virtio_tswap16(vdev, vring->last_avail_idx);
 }
 
 return head;
-- 
MST




[Qemu-block] [PULL v2 02/16] Revert "dataplane: allow virtio-1 devices"

2015-07-08 Thread Michael S. Tsirkin
From: Cornelia Huck 

This reverts commit f5a5628cf0b65b223fa0c9031714578dfac4cf04.

This was an old patch that had been already superseded by b0e5d90eb
("dataplane: endianness-aware accesses").

Signed-off-by: Cornelia Huck 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
Acked-by: Stefan Hajnoczi 
---
 hw/virtio/dataplane/vring.c | 47 -
 1 file changed, 21 insertions(+), 26 deletions(-)

diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index bed9b11..07fd69c 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -158,18 +158,15 @@ bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
 }
 
 
-static int get_desc(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem,
+static int get_desc(Vring *vring, VirtQueueElement *elem,
 struct vring_desc *desc)
 {
 unsigned *num;
 struct iovec *iov;
 hwaddr *addr;
 MemoryRegion *mr;
-int is_write = virtio_tswap16(vdev, desc->flags) & VRING_DESC_F_WRITE;
-uint32_t len = virtio_tswap32(vdev, desc->len);
-uint64_t desc_addr = virtio_tswap64(vdev, desc->addr);
 
-if (is_write) {
+if (desc->flags & VRING_DESC_F_WRITE) {
 num = &elem->in_num;
 iov = &elem->in_sg[*num];
 addr = &elem->in_addr[*num];
@@ -193,17 +190,18 @@ static int get_desc(VirtIODevice *vdev, Vring *vring, 
VirtQueueElement *elem,
 }
 
 /* TODO handle non-contiguous memory across region boundaries */
-iov->iov_base = vring_map(&mr, desc_addr, len, is_write);
+iov->iov_base = vring_map(&mr, desc->addr, desc->len,
+  desc->flags & VRING_DESC_F_WRITE);
 if (!iov->iov_base) {
 error_report("Failed to map descriptor addr %#" PRIx64 " len %u",
- (uint64_t)desc_addr, len);
+ (uint64_t)desc->addr, desc->len);
 return -EFAULT;
 }
 
 /* The MemoryRegion is looked up again and unref'ed later, leave the
  * ref in place.  */
-iov->iov_len = len;
-*addr = desc_addr;
+iov->iov_len = desc->len;
+*addr = desc->addr;
 *num += 1;
 return 0;
 }
@@ -225,23 +223,21 @@ static int get_indirect(VirtIODevice *vdev, Vring *vring,
 struct vring_desc desc;
 unsigned int i = 0, count, found = 0;
 int ret;
-uint32_t len = virtio_tswap32(vdev, indirect->len);
-uint64_t addr = virtio_tswap64(vdev, indirect->addr);
 
 /* Sanity check */
-if (unlikely(len % sizeof(desc))) {
+if (unlikely(indirect->len % sizeof(desc))) {
 error_report("Invalid length in indirect descriptor: "
  "len %#x not multiple of %#zx",
- len, sizeof(desc));
+ indirect->len, sizeof(desc));
 vring->broken = true;
 return -EFAULT;
 }
 
-count = len / sizeof(desc);
+count = indirect->len / sizeof(desc);
 /* Buffers are chained via a 16 bit next field, so
  * we can have at most 2^16 of these. */
 if (unlikely(count > USHRT_MAX + 1)) {
-error_report("Indirect buffer length too big: %d", len);
+error_report("Indirect buffer length too big: %d", indirect->len);
 vring->broken = true;
 return -EFAULT;
 }
@@ -252,12 +248,12 @@ static int get_indirect(VirtIODevice *vdev, Vring *vring,
 
 /* Translate indirect descriptor */
 desc_ptr = vring_map(&mr,
- addr + found * sizeof(desc),
+ indirect->addr + found * sizeof(desc),
  sizeof(desc), false);
 if (!desc_ptr) {
 error_report("Failed to map indirect descriptor "
  "addr %#" PRIx64 " len %zu",
- (uint64_t)addr + found * sizeof(desc),
+ (uint64_t)indirect->addr + found * sizeof(desc),
  sizeof(desc));
 vring->broken = true;
 return -EFAULT;
@@ -275,20 +271,19 @@ static int get_indirect(VirtIODevice *vdev, Vring *vring,
 return -EFAULT;
 }
 
-if (unlikely(virtio_tswap16(vdev, desc.flags)
- & VRING_DESC_F_INDIRECT)) {
+if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
 error_report("Nested indirect descriptor");
 vring->broken = true;
 return -EFAULT;
 }
 
-ret = get_desc(vdev, vring, elem, &desc);
+ret = get_desc(vring, elem, &desc);
 if (ret < 0) {
 vring->broken |= (ret == -EFAULT);
 return ret;
 }
-i = virtio_tswap16(vdev, desc.next);
-} while (virtio_tswap16(vdev, desc.flags) & VRING_DESC_F_NEXT);
+i = desc.next;
+} while (desc.flags & VRING_DESC_F_NEXT);
 return 0;
 }
 
@@ -389,7 +384,7 @@ int vring_pop(VirtIODevice *vdev, Vring *vring,
 /* Ensure descriptor is loaded before accessing fields */
  

Re: [Qemu-block] NVMe volatile write cache fixes

2015-07-08 Thread Kevin Wolf
Am 11.06.2015 um 12:01 hat Christoph Hellwig geschrieben:
> The first patch ensures that flush actually flushes data so that
> data integrity is preserved, the second fixe up WCE reporting.

Thanks, applied to the block branch for 2.4-rc1.

Kevin



Re: [Qemu-block] [Qemu-devel] [PATCH v2] block/curl: Don't lose original error when a connection fails.

2015-07-08 Thread Kevin Wolf
Am 03.07.2015 um 14:35 hat Markus Armbruster geschrieben:
> "Richard W.M. Jones"  writes:
> 
> > Currently if qemu is connected to a curl source (eg. web server), and
> > the web server fails / times out / dies, you always see a bogus EIO
> > "Input/output error".
> >
> > For example, choose a large file located on any local webserver which
> > you control:
> >
> >   $ qemu-img convert -p http://example.com/large.iso /tmp/test
> >
> > Once it starts copying the file, stop the webserver and you will see
> > qemu-img fail with:
> >
> >   qemu-img: error while reading sector 61440: Input/output error
> >
> > This patch does two things: Firstly print the actual error from curl
> > so it doesn't get lost.  Secondly, change EIO to EPROTO.  EPROTO is a
> > POSIX.1 compatible errno which more accurately reflects that there was
> > a protocol error, rather than some kind of hardware failure.
> >
> > After this patch is applied, the error changes to:
> >
> >   $ qemu-img convert -p http://example.com/large.iso /tmp/test
> >   qemu-img: curl: transfer closed with 469989 bytes remaining to read
> >   qemu-img: error while reading sector 16384: Protocol error
> >
> > Signed-off-by: Richard W.M. Jones 
> > Reviewed-by: Stefan Hajnoczi 
> > ---
> >  block/curl.c | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/curl.c b/block/curl.c
> > index 3a2b63e..2fd7c06 100644
> > --- a/block/curl.c
> > +++ b/block/curl.c
> > @@ -22,6 +22,7 @@
> >   * THE SOFTWARE.
> >   */
> >  #include "qemu-common.h"
> > +#include "qemu/error-report.h"
> >  #include "block/block_int.h"
> >  #include "qapi/qmp/qbool.h"
> >  #include "qapi/qmp/qstring.h"
> > @@ -298,6 +299,12 @@ static void curl_multi_check_completion(BDRVCURLState 
> > *s)
> >  /* ACBs for successful messages get completed in curl_read_cb 
> > */
> >  if (msg->data.result != CURLE_OK) {
> >  int i;
> > +
> > +/* Don't lose the original error message from curl, since
> > + * it contains extra data.
> > + */
> > +error_report("curl: %s", state->errmsg);
> > +
> >  for (i = 0; i < CURL_NUM_ACB; i++) {
> >  CURLAIOCB *acb = state->acb[i];
> >  
> 
> Printing an error message, then returning an error code is problematic.
> 
> It works when the caller is going to print its own error message to the
> same destination.  Callee produces a specific error message devoid of
> context, caller produces an unspecific one with hopefully more context.
> Better than just one of them.  Worse than a single specific error with
> context, but that can't be done with just a "return errno code"
> interface.
> 
> It's kind of wrong when the caller reports its own error somewhere else,
> e.g. to a monitor.  Still, when barfing extra info to stderr is the best
> we can do, it's better than nothing.
> 
> It's more wrong when the caller handles the error quietly.  I guess
> that's never the case here, but I can't be sure without a lot more
> sleuthing.  Perhaps Kevin or Stefan can judge this immediately.

I'm not worried too much about requests made by the monitor or during
startup. I don't like the error_report() there, but having a more
specific error message on stderr is better than having nothing.

The case that bothers me more is guest requests. Depending on the
werror/rerror settings, this may allow the guest to flood the log file
with curl error messages.

> > @@ -305,7 +312,7 @@ static void curl_multi_check_completion(BDRVCURLState 
> > *s)
> >  continue;
> >  }
> >  
> > -acb->common.cb(acb->common.opaque, -EIO);
> > +acb->common.cb(acb->common.opaque, -EPROTO);
> >  qemu_aio_unref(acb);
> >  state->acb[i] = NULL;
> >  }
> 
> To understand impact exactly, we'd need to figure out where the changed
> error code gets consumed.  However, I don't expect consumers to check
> the actual error code.  A quick grep for comparisons with EIO or -EIO
> finds nothing related to block I/O, except for nbd_trip() checking the
> value of nbd_co_receive_request(), and that's unrelated.

Yes, I wouldn't expect any problems caused by this change.

Kevin



Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom

2015-07-08 Thread Kevin Wolf
Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben:
> 
> 
> On 02/07/2015 16:03, Paolo Bonzini wrote:
> > 
> > 
> > On 02/07/2015 15:58, Laurent Vivier wrote:
> >> Since any /dev entry can be treated as a raw disk image, it is worth
> >> noting which devices can be accessed when and how. /dev/rdisk nodes are
> >> character-special devices, but are "raw" in the BSD sense and force
> >> block-aligned I/O. They are closer to the physical disk than the buffer
> >> cache. /dev/disk nodes, on the other hand, are buffered block-special
> >> devices and are used primarily by the kernel's filesystem code.
> > 
> > So the right thing to do would not be just to set need_alignment, but to
> > probe it like we do on Linux for BDRV_O_NO_CACHE.
> > 
> > I'm okay with doing the simple thing, but it needs a comment for non-BSDers.
> 
> So, what we have to do, in our case, for MacOS X cdrom, is something like:
> 
> ... GetBSDPath ...
> ...
> if (flags & BDRV_O_NOCACHE) {
> strcat(bsdPath, "r");
> }
> ...

I would avoid such magic. What we could do is rejecting /dev/rdisk nodes
without BDRV_O_NOCACHE.

Kevin



Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom

2015-07-08 Thread Laurent Vivier


On 08/07/2015 12:31, Kevin Wolf wrote:
> Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben:
>>
>>
>> On 02/07/2015 16:03, Paolo Bonzini wrote:
>>>
>>>
>>> On 02/07/2015 15:58, Laurent Vivier wrote:
 Since any /dev entry can be treated as a raw disk image, it is worth
 noting which devices can be accessed when and how. /dev/rdisk nodes are
 character-special devices, but are "raw" in the BSD sense and force
 block-aligned I/O. They are closer to the physical disk than the buffer
 cache. /dev/disk nodes, on the other hand, are buffered block-special
 devices and are used primarily by the kernel's filesystem code.
>>>
>>> So the right thing to do would not be just to set need_alignment, but to
>>> probe it like we do on Linux for BDRV_O_NO_CACHE.
>>>
>>> I'm okay with doing the simple thing, but it needs a comment for non-BSDers.
>>
>> So, what we have to do, in our case, for MacOS X cdrom, is something like:
>>
>> ... GetBSDPath ...
>> ...
>> if (flags & BDRV_O_NOCACHE) {
>> strcat(bsdPath, "r");
>> }
>> ...
> 
> I would avoid such magic. What we could do is rejecting /dev/rdisk nodes
> without BDRV_O_NOCACHE.

It's not how it works...

Look in hdev_open().

If user provides /dev/cdrom on the command line, in the case of MacOS X,
QEMU searches for a cdrom drive in the system and set filename to
/dev/rdiskX according to the result.

Perhaps this part should be removed.

But if we just want to correct the bug, we must not set filename to
/dev/rdiskX if NOCACHE is not set but to /dev/diskX

It's the aim of this change.

Laurent




Re: [Qemu-block] [PATCH RFC 3/4] aio: Introduce aio_context_setup

2015-07-08 Thread Stefan Hajnoczi
On Wed, Jul 08, 2015 at 09:15:57AM +0800, Fam Zheng wrote:
> On Tue, 07/07 15:35, Stefan Hajnoczi wrote:
> > On Tue, Jun 30, 2015 at 09:19:44PM +0800, Fam Zheng wrote:
> > > diff --git a/async.c b/async.c
> > > index 06971f4..1d70cfd 100644
> > > --- a/async.c
> > > +++ b/async.c
> > > @@ -290,12 +290,17 @@ AioContext *aio_context_new(Error **errp)
> > >  {
> > >  int ret;
> > >  AioContext *ctx;
> > > +Error *local_err = NULL;
> > > +
> > >  ctx = (AioContext *) g_source_new(&aio_source_funcs, 
> > > sizeof(AioContext));
> > > +aio_context_setup(ctx, &local_err);
> > > +if (local_err) {
> > > +error_propagate(errp, local_err);
> > 
> > Is there any reason to introduce local_err?  errp can be passed directly
> > into aio_context_setup().
> 
> It's used for catching failure of aio_context_setup, because the convention is
> errp can be NULL.

You are right, I missed that aio_context_setup() has a void return type.


pgpRqk_4uUwRV.pgp
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH RFC 4/4] aio-posix: Use epoll in aio_poll

2015-07-08 Thread Stefan Hajnoczi
On Wed, Jul 08, 2015 at 09:01:27AM +0800, Fam Zheng wrote:
> On Tue, 07/07 16:08, Stefan Hajnoczi wrote:
> > > +#define EPOLL_BATCH 128
> > > +static bool aio_poll_epoll(AioContext *ctx, bool blocking)
> > > +{
> > > +AioHandler *node;
> > > +bool was_dispatching;
> > > +int i, ret;
> > > +bool progress;
> > > +int64_t timeout;
> > > +struct epoll_event events[EPOLL_BATCH];
> > > +
> > > +aio_context_acquire(ctx);
> > > +was_dispatching = ctx->dispatching;
> > > +progress = false;
> > > +
> > > +/* aio_notify can avoid the expensive event_notifier_set if
> > > + * everything (file descriptors, bottom halves, timers) will
> > > + * be re-evaluated before the next blocking poll().  This is
> > > + * already true when aio_poll is called with blocking == false;
> > > + * if blocking == true, it is only true after poll() returns.
> > > + *
> > > + * If we're in a nested event loop, ctx->dispatching might be true.
> > > + * In that case we can restore it just before returning, but we
> > > + * have to clear it now.
> > > + */
> > > +aio_set_dispatching(ctx, !blocking);
> > > +
> > > +ctx->walking_handlers++;
> > > +
> > > +timeout = blocking ? aio_compute_timeout(ctx) : 0;
> > > +
> > > +if (timeout > 0) {
> > > +timeout = DIV_ROUND_UP(timeout, 100);
> > > +}
> > 
> > I think you already posted the timerfd code in an earlier series.  Why
> > degrade to millisecond precision?  It needs to be fixed up anyway if the
> > main loop uses aio_poll() in the future.
> 
> Because of a little complication: timeout here is always -1 for iothread, and
> what is interesting is that -1 actually requires an explicit
> 
> timerfd_settime(timerfd, flags, &(struct itimerspec){{0, 0}}, NULL)
> 
> to disable timerfd for this aio_poll(), which costs somethings. Passing -1 to
> epoll_wait() without this doesn't work because the timerfd is already added to
> the epollfd and may have an unexpected timeout set before.
> 
> Of course we can cache the state and optimize, but I've not reasoned about 
> what
> if another thread happens to call aio_poll() when we're in epoll_wait(), for
> example when the first aio_poll() has a positive timeout but the second one 
> has
> -1.

I'm not sure I understand the threads scenario since aio_poll_epoll()
has a big aio_context_acquire()/release() region that protects it, but I
guess the nested aio_poll() case is similar.  Care needs to be taken so
the extra timerfd state is consistent.

The optimization can be added later unless the timerfd_settime() syscall
is so expensive that it defeats the advantage of epoll().


pgpGBlTnInsLd.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom

2015-07-08 Thread Kevin Wolf
Am 08.07.2015 um 12:47 hat Laurent Vivier geschrieben:
> 
> 
> On 08/07/2015 12:31, Kevin Wolf wrote:
> > Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben:
> >>
> >>
> >> On 02/07/2015 16:03, Paolo Bonzini wrote:
> >>>
> >>>
> >>> On 02/07/2015 15:58, Laurent Vivier wrote:
>  Since any /dev entry can be treated as a raw disk image, it is worth
>  noting which devices can be accessed when and how. /dev/rdisk nodes are
>  character-special devices, but are "raw" in the BSD sense and force
>  block-aligned I/O. They are closer to the physical disk than the buffer
>  cache. /dev/disk nodes, on the other hand, are buffered block-special
>  devices and are used primarily by the kernel's filesystem code.
> >>>
> >>> So the right thing to do would not be just to set need_alignment, but to
> >>> probe it like we do on Linux for BDRV_O_NO_CACHE.
> >>>
> >>> I'm okay with doing the simple thing, but it needs a comment for 
> >>> non-BSDers.
> >>
> >> So, what we have to do, in our case, for MacOS X cdrom, is something like:
> >>
> >> ... GetBSDPath ...
> >> ...
> >> if (flags & BDRV_O_NOCACHE) {
> >> strcat(bsdPath, "r");
> >> }
> >> ...
> > 
> > I would avoid such magic. What we could do is rejecting /dev/rdisk nodes
> > without BDRV_O_NOCACHE.
> 
> It's not how it works...
> 
> Look in hdev_open().
> 
> If user provides /dev/cdrom on the command line, in the case of MacOS X,
> QEMU searches for a cdrom drive in the system and set filename to
> /dev/rdiskX according to the result.

Oh, we're already playing such games... I guess you're right then.

It even seems to be not only for '/dev/cdrom', but for everything
starting with this string. Does anyone know what's the reason for that?

Also, I guess before doing strcat() on bsdPath, we should check the
buffer length...

> Perhaps this part should be removed.
> 
> But if we just want to correct the bug, we must not set filename to
> /dev/rdiskX if NOCACHE is not set but to /dev/diskX
> 
> It's the aim of this change.

Yes, that looks right.

Kevin



Re: [Qemu-block] [PATCH] block/mirror: limit qiov to IOV_MAX elements

2015-07-08 Thread Kevin Wolf
Am 01.07.2015 um 16:45 hat Stefan Hajnoczi geschrieben:
> If mirror has more free buffers than IOV_MAX, preadv(2)/pwritev(2)
> EINVAL failures may be encountered.
> 
> It is possible to trigger this by setting granularity to a low value
> like 8192.
> 
> This patch stops appending chunks once IOV_MAX is reached.
> 
> The spurious EINVAL failure can be reproduced with a qcow2 image file
> and the following QMP invocation:
> 
>   qmp.command('drive-mirror', device='virtio0', target='/tmp/r7.s1',
>   granularity=8192, sync='full', mode='absolute-paths',
>   format='raw')
> 
> While the guest is running dd if=/dev/zero of=/var/tmp/foo oflag=direct
> bs=4k.
> 
> Cc: Jeff Cody 
> Signed-off-by: Stefan Hajnoczi 

This looks like a backend-specific problem with raw-posix. Wouldn't it
make more sense to either let raw-posix split requests it can't handle
or introduce a bs->iov_max instead of spreading knowledge about specific
backends into all callers of the block layer?

I'm not objecting to taking this patch for now to fix the bug in 2.4,
but I don't think it's the proper solution.

Kevin

>  block/mirror.c | 4 
>  trace-events   | 1 +
>  2 files changed, 5 insertions(+)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 048e452..985ad00 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -241,6 +241,10 @@ static uint64_t coroutine_fn 
> mirror_iteration(MirrorBlockJob *s)
>  trace_mirror_break_buf_busy(s, nb_chunks, s->in_flight);
>  break;
>  }
> +if (IOV_MAX < nb_chunks + added_chunks) {
> +trace_mirror_break_iov_max(s, nb_chunks, added_chunks);
> +break;
> +}
>  
>  /* We have enough free space to copy these sectors.  */
>  bitmap_set(s->in_flight_bitmap, next_chunk, added_chunks);
> diff --git a/trace-events b/trace-events
> index 52b7efa..943cd0c 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -94,6 +94,7 @@ mirror_yield(void *s, int64_t cnt, int buf_free_count, int 
> in_flight) "s %p dirt
>  mirror_yield_in_flight(void *s, int64_t sector_num, int in_flight) "s %p 
> sector_num %"PRId64" in_flight %d"
>  mirror_yield_buf_busy(void *s, int nb_chunks, int in_flight) "s %p requested 
> chunks %d in_flight %d"
>  mirror_break_buf_busy(void *s, int nb_chunks, int in_flight) "s %p requested 
> chunks %d in_flight %d"
> +mirror_break_iov_max(void *s, int nb_chunks, int added_chunks) "s %p 
> requested chunks %d added_chunks %d"
>  
>  # block/backup.c
>  backup_do_cow_enter(void *job, int64_t start, int64_t sector_num, int 
> nb_sectors) "job %p start %"PRId64" sector_num %"PRId64" nb_sectors %d"
> -- 
> 2.4.3
> 
> 



Re: [Qemu-block] [Qemu-devel] [PATCH v2] block/curl: Don't lose original error when a connection fails.

2015-07-08 Thread Richard W.M. Jones
On Wed, Jul 08, 2015 at 12:23:37PM +0200, Kevin Wolf wrote:
> Am 03.07.2015 um 14:35 hat Markus Armbruster geschrieben:
> > "Richard W.M. Jones"  writes:
> > 
> > > Currently if qemu is connected to a curl source (eg. web server), and
> > > the web server fails / times out / dies, you always see a bogus EIO
> > > "Input/output error".
> > >
> > > For example, choose a large file located on any local webserver which
> > > you control:
> > >
> > >   $ qemu-img convert -p http://example.com/large.iso /tmp/test
> > >
> > > Once it starts copying the file, stop the webserver and you will see
> > > qemu-img fail with:
> > >
> > >   qemu-img: error while reading sector 61440: Input/output error
> > >
> > > This patch does two things: Firstly print the actual error from curl
> > > so it doesn't get lost.  Secondly, change EIO to EPROTO.  EPROTO is a
> > > POSIX.1 compatible errno which more accurately reflects that there was
> > > a protocol error, rather than some kind of hardware failure.
> > >
> > > After this patch is applied, the error changes to:
> > >
> > >   $ qemu-img convert -p http://example.com/large.iso /tmp/test
> > >   qemu-img: curl: transfer closed with 469989 bytes remaining to read
> > >   qemu-img: error while reading sector 16384: Protocol error
> > >
> > > Signed-off-by: Richard W.M. Jones 
> > > Reviewed-by: Stefan Hajnoczi 
> > > ---
> > >  block/curl.c | 9 -
> > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/block/curl.c b/block/curl.c
> > > index 3a2b63e..2fd7c06 100644
> > > --- a/block/curl.c
> > > +++ b/block/curl.c
> > > @@ -22,6 +22,7 @@
> > >   * THE SOFTWARE.
> > >   */
> > >  #include "qemu-common.h"
> > > +#include "qemu/error-report.h"
> > >  #include "block/block_int.h"
> > >  #include "qapi/qmp/qbool.h"
> > >  #include "qapi/qmp/qstring.h"
> > > @@ -298,6 +299,12 @@ static void 
> > > curl_multi_check_completion(BDRVCURLState *s)
> > >  /* ACBs for successful messages get completed in 
> > > curl_read_cb */
> > >  if (msg->data.result != CURLE_OK) {
> > >  int i;
> > > +
> > > +/* Don't lose the original error message from curl, since
> > > + * it contains extra data.
> > > + */
> > > +error_report("curl: %s", state->errmsg);
> > > +
> > >  for (i = 0; i < CURL_NUM_ACB; i++) {
> > >  CURLAIOCB *acb = state->acb[i];
> > >  
> > 
> > Printing an error message, then returning an error code is problematic.
> > 
> > It works when the caller is going to print its own error message to the
> > same destination.  Callee produces a specific error message devoid of
> > context, caller produces an unspecific one with hopefully more context.
> > Better than just one of them.  Worse than a single specific error with
> > context, but that can't be done with just a "return errno code"
> > interface.
> > 
> > It's kind of wrong when the caller reports its own error somewhere else,
> > e.g. to a monitor.  Still, when barfing extra info to stderr is the best
> > we can do, it's better than nothing.
> > 
> > It's more wrong when the caller handles the error quietly.  I guess
> > that's never the case here, but I can't be sure without a lot more
> > sleuthing.  Perhaps Kevin or Stefan can judge this immediately.
> 
> I'm not worried too much about requests made by the monitor or during
> startup. I don't like the error_report() there, but having a more
> specific error message on stderr is better than having nothing.
> 
> The case that bothers me more is guest requests. Depending on the
> werror/rerror settings, this may allow the guest to flood the log file
> with curl error messages.

Can you expand a bit on how they would do this?  I can see how the
remote web server can cause a curl error (itself possibly a concern),
but not how the guest can do it.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top



Re: [Qemu-block] [PATCH 1/2] block: vpc - prevent overflow if max_table_entries >= 0x40000000

2015-07-08 Thread Kevin Wolf
Am 26.06.2015 um 11:57 hat Stefan Hajnoczi geschrieben:
> On Thu, Jun 25, 2015 at 11:05:20AM -0400, Jeff Cody wrote:
> > On Thu, Jun 25, 2015 at 03:28:35PM +0100, Stefan Hajnoczi wrote:
> > > On Wed, Jun 24, 2015 at 03:54:27PM -0400, Jeff Cody wrote:
> > > > @@ -269,7 +270,9 @@ static int vpc_open(BlockDriverState *bs, QDict 
> > > > *options, int flags,
> > > >  goto fail;
> > > >  }
> > > >  
> > > > -s->pagetable = qemu_try_blockalign(bs->file, 
> > > > s->max_table_entries * 4);
> > > > +pagetable_size = (size_t) s->max_table_entries * 4;
> > > > +
> > > > +s->pagetable = qemu_try_blockalign(bs->file, pagetable_size);
> > > 
> > > On 32-bit hosts size_t is 32-bit so the overflow hasn't been solved.
> > > 
> > > Does it make sense to impose a limit on pagetable_size?
> > 
> > Good point.  Yes, it does.
> > 
> > The VHD spec says that the "Max Table Entries" field should be equal
> > to the disk size / block size.  I don't know if there are images out
> > there that treat that as ">= disk size / block size" rather than "==",
> > however.  But if we assume max size of 2TB for a VHD disk, and a
> > minimal block size of 512 bytes, that would give us a
> > max_table_entries of 0x1, which exceeds 32 bits by itself.
> > 
> > For pagetable_size to fit in a 32-bit, that means to support 2TB on a
> > 32-bit host in the current implementation, the minimum block size is
> > 4096.
> > 
> > We could check during open / create that 
> > (disk_size / block_size) * 4 < SIZE_MAX, and refuse to open if this is
> > not true (and also validate max_table_entries to fit in the same).
> 
> Sounds good.

Jeff, I couldn't find a v2 anywhere. Are you still planning to send it?

Kevin


pgpwMzEoRfQNy.pgp
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2] block/curl: Don't lose original error when a connection fails.

2015-07-08 Thread Kevin Wolf
Am 08.07.2015 um 13:36 hat Richard W.M. Jones geschrieben:
> On Wed, Jul 08, 2015 at 12:23:37PM +0200, Kevin Wolf wrote:
> > Am 03.07.2015 um 14:35 hat Markus Armbruster geschrieben:
> > > "Richard W.M. Jones"  writes:
> > > 
> > > > Currently if qemu is connected to a curl source (eg. web server), and
> > > > the web server fails / times out / dies, you always see a bogus EIO
> > > > "Input/output error".
> > > >
> > > > For example, choose a large file located on any local webserver which
> > > > you control:
> > > >
> > > >   $ qemu-img convert -p http://example.com/large.iso /tmp/test
> > > >
> > > > Once it starts copying the file, stop the webserver and you will see
> > > > qemu-img fail with:
> > > >
> > > >   qemu-img: error while reading sector 61440: Input/output error
> > > >
> > > > This patch does two things: Firstly print the actual error from curl
> > > > so it doesn't get lost.  Secondly, change EIO to EPROTO.  EPROTO is a
> > > > POSIX.1 compatible errno which more accurately reflects that there was
> > > > a protocol error, rather than some kind of hardware failure.
> > > >
> > > > After this patch is applied, the error changes to:
> > > >
> > > >   $ qemu-img convert -p http://example.com/large.iso /tmp/test
> > > >   qemu-img: curl: transfer closed with 469989 bytes remaining to read
> > > >   qemu-img: error while reading sector 16384: Protocol error
> > > >
> > > > Signed-off-by: Richard W.M. Jones 
> > > > Reviewed-by: Stefan Hajnoczi 
> > > > ---
> > > >  block/curl.c | 9 -
> > > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/block/curl.c b/block/curl.c
> > > > index 3a2b63e..2fd7c06 100644
> > > > --- a/block/curl.c
> > > > +++ b/block/curl.c
> > > > @@ -22,6 +22,7 @@
> > > >   * THE SOFTWARE.
> > > >   */
> > > >  #include "qemu-common.h"
> > > > +#include "qemu/error-report.h"
> > > >  #include "block/block_int.h"
> > > >  #include "qapi/qmp/qbool.h"
> > > >  #include "qapi/qmp/qstring.h"
> > > > @@ -298,6 +299,12 @@ static void 
> > > > curl_multi_check_completion(BDRVCURLState *s)
> > > >  /* ACBs for successful messages get completed in 
> > > > curl_read_cb */
> > > >  if (msg->data.result != CURLE_OK) {
> > > >  int i;
> > > > +
> > > > +/* Don't lose the original error message from curl, 
> > > > since
> > > > + * it contains extra data.
> > > > + */
> > > > +error_report("curl: %s", state->errmsg);
> > > > +
> > > >  for (i = 0; i < CURL_NUM_ACB; i++) {
> > > >  CURLAIOCB *acb = state->acb[i];
> > > >  
> > > 
> > > Printing an error message, then returning an error code is problematic.
> > > 
> > > It works when the caller is going to print its own error message to the
> > > same destination.  Callee produces a specific error message devoid of
> > > context, caller produces an unspecific one with hopefully more context.
> > > Better than just one of them.  Worse than a single specific error with
> > > context, but that can't be done with just a "return errno code"
> > > interface.
> > > 
> > > It's kind of wrong when the caller reports its own error somewhere else,
> > > e.g. to a monitor.  Still, when barfing extra info to stderr is the best
> > > we can do, it's better than nothing.
> > > 
> > > It's more wrong when the caller handles the error quietly.  I guess
> > > that's never the case here, but I can't be sure without a lot more
> > > sleuthing.  Perhaps Kevin or Stefan can judge this immediately.
> > 
> > I'm not worried too much about requests made by the monitor or during
> > startup. I don't like the error_report() there, but having a more
> > specific error message on stderr is better than having nothing.
> > 
> > The case that bothers me more is guest requests. Depending on the
> > werror/rerror settings, this may allow the guest to flood the log file
> > with curl error messages.
> 
> Can you expand a bit on how they would do this?  I can see how the
> remote web server can cause a curl error (itself possibly a concern),
> but not how the guest can do it.

The guest can't cause it, but once the connection is down, I expect
every request to fail. You don't have to have a malicious guest for
filling up the log file, it just needs to be careless enough to continue
trying new requests instead of offlining the device.

Kevin



Re: [Qemu-block] [Qemu-devel] [PATCH v2] block/curl: Don't lose original error when a connection fails.

2015-07-08 Thread Richard W.M. Jones
On Wed, Jul 08, 2015 at 02:01:30PM +0200, Kevin Wolf wrote:
> The guest can't cause it, but once the connection is down, I expect
> every request to fail. You don't have to have a malicious guest for
> filling up the log file, it just needs to be careless enough to continue
> trying new requests instead of offlining the device.

How about something along these lines?

[I'm not clear if atomic is necessary here, nor if there is already
some mechanism for suppressing log messages - a cursory grep of the
qemu source didn't find anything.]

diff --git a/block/curl.c b/block/curl.c
index 2fd7c06..33c14d8 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -299,11 +299,20 @@ static void curl_multi_check_completion(BDRVCURLState *s)
 /* ACBs for successful messages get completed in curl_read_cb */
 if (msg->data.result != CURLE_OK) {
 int i;
+static int errcount = 100;
+int current_errcount;
 
 /* Don't lose the original error message from curl, since
  * it contains extra data.
  */
-error_report("curl: %s", state->errmsg);
+current_errcount = atomic_fetch_add(&errcount, 0);
+if (current_errcount > 0) {
+error_report("curl: %s", state->errmsg);
+if (current_errcount == 1) {
+error_report("curl: further errors suppressed");
+}
+atomic_dec(&errcount);
+}
 
 for (i = 0; i < CURL_NUM_ACB; i++) {
 CURLAIOCB *acb = state->acb[i];

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org



Re: [Qemu-block] [PATCH] block/mirror: limit qiov to IOV_MAX elements

2015-07-08 Thread Stefan Hajnoczi
On Wed, Jul 8, 2015 at 12:14 PM, Kevin Wolf  wrote:
> Am 01.07.2015 um 16:45 hat Stefan Hajnoczi geschrieben:
>> If mirror has more free buffers than IOV_MAX, preadv(2)/pwritev(2)
>> EINVAL failures may be encountered.
>>
>> It is possible to trigger this by setting granularity to a low value
>> like 8192.
>>
>> This patch stops appending chunks once IOV_MAX is reached.
>>
>> The spurious EINVAL failure can be reproduced with a qcow2 image file
>> and the following QMP invocation:
>>
>>   qmp.command('drive-mirror', device='virtio0', target='/tmp/r7.s1',
>>   granularity=8192, sync='full', mode='absolute-paths',
>>   format='raw')
>>
>> While the guest is running dd if=/dev/zero of=/var/tmp/foo oflag=direct
>> bs=4k.
>>
>> Cc: Jeff Cody 
>> Signed-off-by: Stefan Hajnoczi 
>
> This looks like a backend-specific problem with raw-posix. Wouldn't it
> make more sense to either let raw-posix split requests it can't handle
> or introduce a bs->iov_max instead of spreading knowledge about specific
> backends into all callers of the block layer?
>
> I'm not objecting to taking this patch for now to fix the bug in 2.4,
> but I don't think it's the proper solution.

I will send a patch for QEMU 2.5 that introduces an iov max field in
BlockLimits.  It will convert the request merging code which currently
hardcodes IOV_MAX too.

In practice there are probably only two values that get used: IOV_MAX
and INT_MAX, but we already have BlockLimits and it doesn't hurt to
add this.

Stefan



Re: [Qemu-block] [PATCH v6 2/4] qcow2: add option to clean unused cache entries after some time

2015-07-08 Thread Kevin Wolf
Am 05.06.2015 um 18:27 hat Alberto Garcia geschrieben:
> This adds a new 'cache-clean-interval' option that cleans all qcow2
> cache entries that haven't been used in a certain interval, given in
> seconds.
> 
> This allows setting a large L2 cache size so it can handle scenarios
> with lots of I/O and at the same time use little memory during periods
> of inactivity.
> 
> This feature currently relies on MADV_DONTNEED to free that memory, so
> it is not useful in systems that don't follow that behavior.
> 
> Signed-off-by: Alberto Garcia 
> Reviewed-by: Max Reitz 

This patch doesn't apply cleanly any more. Can you please rebase and
change the QAPI documenation to say "since 2.5"?

Kevin



[Qemu-block] [PATCH v3 0/2] block/curl: Don't lose original error when a connection fails.

2015-07-08 Thread Richard W.M. Jones
Since v2:

This adds a generalized anti-flooding mechanism, and then uses it to
ensure that curl errors wouldn't flood the log files.

Rich.




[Qemu-block] [PATCH v3 1/2] Add a simple mechanism to protect against error message floods.

2015-07-08 Thread Richard W.M. Jones
You can now write code like this:

  FLOOD_COUNTER(errcount, 100);
  ...
  NO_FLOOD(errcount,
   error_report("oops, something bad happened"),
   error_report("further errors suppressed"));

which would print the "oops, ..." error message up to 100 times,
followed by the "further errors suppressed" message once, and then
nothing else.

Signed-off-by: Richard W.M. Jones 
---
 include/qemu/no-flood.h | 34 ++
 1 file changed, 34 insertions(+)
 create mode 100644 include/qemu/no-flood.h

diff --git a/include/qemu/no-flood.h b/include/qemu/no-flood.h
new file mode 100644
index 000..90a24da
--- /dev/null
+++ b/include/qemu/no-flood.h
@@ -0,0 +1,34 @@
+/*
+ * Suppress error floods.
+ *
+ * Copyright (C) 2015 Red Hat, Inc.
+ *
+ * Author: Richard W.M. Jones 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef __QEMU_NO_FLOOD_H
+#define __QEMU_NO_FLOOD_H 1
+
+#include "qemu/atomic.h"
+
+/* Suggested initial value is 100 */
+#define FLOOD_COUNTER(counter_name, initial) \
+  static int counter_name = (initial)
+
+#define NO_FLOOD(counter_name, code, suppress_message) \
+  do { \
+int t_##counter_name = atomic_fetch_add(&counter_name, 0); \
+if (t_##counter_name > 0) {\
+code;  \
+if (t_##counter_name == 1) {   \
+suppress_message;  \
+}  \
+atomic_dec(&counter_name); \
+}  \
+} while (0)
+
+#endif
-- 
2.3.1




[Qemu-block] [PATCH v3 2/2] block/curl: Don't lose original error when a connection fails.

2015-07-08 Thread Richard W.M. Jones
Currently if qemu is connected to a curl source (eg. web server), and
the web server fails / times out / dies, you always see a bogus EIO
"Input/output error".

For example, choose a large file located on any local webserver which
you control:

  $ qemu-img convert -p http://example.com/large.iso /tmp/test

Once it starts copying the file, stop the webserver and you will see
qemu-img fail with:

  qemu-img: error while reading sector 61440: Input/output error

This patch does two things: Firstly print the actual error from curl
so it doesn't get lost.  Secondly, change EIO to EPROTO.  EPROTO is a
POSIX.1 compatible errno which more accurately reflects that there was
a protocol error, rather than some kind of hardware failure.

After this patch is applied, the error changes to:

  $ qemu-img convert -p http://example.com/large.iso /tmp/test
  qemu-img: curl: transfer closed with 469989 bytes remaining to read
  qemu-img: error while reading sector 16384: Protocol error

Signed-off-by: Richard W.M. Jones 
Reviewed-by: Stefan Hajnoczi 
---
 block/curl.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/block/curl.c b/block/curl.c
index 3a2b63e..b72d505 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -22,6 +22,8 @@
  * THE SOFTWARE.
  */
 #include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/no-flood.h"
 #include "block/block_int.h"
 #include "qapi/qmp/qbool.h"
 #include "qapi/qmp/qstring.h"
@@ -298,6 +300,15 @@ static void curl_multi_check_completion(BDRVCURLState *s)
 /* ACBs for successful messages get completed in curl_read_cb */
 if (msg->data.result != CURLE_OK) {
 int i;
+FLOOD_COUNTER(errcount, 100);
+
+/* Don't lose the original error message from curl, since
+ * it contains extra data.
+ */
+NO_FLOOD(errcount,
+ error_report("curl: %s", state->errmsg),
+ error_report("curl: further errors suppressed"));
+
 for (i = 0; i < CURL_NUM_ACB; i++) {
 CURLAIOCB *acb = state->acb[i];
 
@@ -305,7 +316,7 @@ static void curl_multi_check_completion(BDRVCURLState *s)
 continue;
 }
 
-acb->common.cb(acb->common.opaque, -EIO);
+acb->common.cb(acb->common.opaque, -EPROTO);
 qemu_aio_unref(acb);
 state->acb[i] = NULL;
 }
-- 
2.3.1




Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] ahci: fix signature generation

2015-07-08 Thread Stefan Hajnoczi
On Tue, Jul 07, 2015 at 01:15:02PM -0400, John Snow wrote:
> 
> 
> On 07/07/2015 04:49 AM, Stefan Hajnoczi wrote:
> > On Mon, Jul 06, 2015 at 05:49:52PM -0400, John Snow wrote:
> >> The initial register device-to-host FIS no longer needs to specially
> >> set certain fields, as these can be handled generically by setting those
> >> fields explicitly with the signatures we want at port reset time.
> >>
> >> (1) Signatures are decomposed into their four component registers and
> >> set upon (AHCI) port reset.
> >> (2) the signature cache register is no longer set manually per-each
> >> device type, but instead just once during ahci_init_d2h.
> >>
> >> Signed-off-by: John Snow 
> >> ---
> >>  hw/ide/ahci.c | 33 -
> >>  1 file changed, 20 insertions(+), 13 deletions(-)
> > 
> > I see two code paths that call ahci_init_d2h().  Either
> > ahci_reset_port() does it (if a block device is attached) or it's called
> > when the guest writes to the PORT_CMD register.
> > 
> > I'm not sure the latter works.  The signature doesn't seem to be set
> > anywhere.
> > 
> > Any ideas?
...
> So on initial boot, we call ahci_init_d2h and set pr->sig, then call
> ahci_write_fis_d2h. However, since the FIS RX engine (PxFRE) is off, we
> don't actually generate the FIS because there's nowhere to store it.

My question is about the ide_state->blk == NULL case:

ahci_reset_port() is contradictory:

static void ahci_reset_port(AHCIState *s, int port)
{
...
ide_state = &s->dev[port].port.ifs[0];
if (!ide_state->blk) {
return;
}

...

s->dev[port].port_state = STATE_RUN;
if (!ide_state->blk) { <-- deadcode?
pr->sig = 0;
ide_state->status = SEEK_STAT | WRERR_STAT;
}

Does code after the first "if (!ide_state->blk)" in ahci_reset_port()
ever execute in a drive hotplug scenario?

If it doesn't execute then sig is never filled in.

Your patch does not include a regression but either something is broken
here or I don't understand the code.


pgpCDWUbJkdJ7.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom

2015-07-08 Thread Programmingkid

On Jul 8, 2015, at 7:01 AM, Kevin Wolf wrote:

> Am 08.07.2015 um 12:47 hat Laurent Vivier geschrieben:
>> 
>> 
>> On 08/07/2015 12:31, Kevin Wolf wrote:
>>> Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben:
 
 
 On 02/07/2015 16:03, Paolo Bonzini wrote:
> 
> 
> On 02/07/2015 15:58, Laurent Vivier wrote:
>> Since any /dev entry can be treated as a raw disk image, it is worth
>> noting which devices can be accessed when and how. /dev/rdisk nodes are
>> character-special devices, but are "raw" in the BSD sense and force
>> block-aligned I/O. They are closer to the physical disk than the buffer
>> cache. /dev/disk nodes, on the other hand, are buffered block-special
>> devices and are used primarily by the kernel's filesystem code.
> 
> So the right thing to do would not be just to set need_alignment, but to
> probe it like we do on Linux for BDRV_O_NO_CACHE.
> 
> I'm okay with doing the simple thing, but it needs a comment for 
> non-BSDers.
 
 So, what we have to do, in our case, for MacOS X cdrom, is something like:
 
 ... GetBSDPath ...
 ...
if (flags & BDRV_O_NOCACHE) {
strcat(bsdPath, "r");
}
 ...
>>> 
>>> I would avoid such magic. What we could do is rejecting /dev/rdisk nodes
>>> without BDRV_O_NOCACHE.
>> 
>> It's not how it works...
>> 
>> Look in hdev_open().
>> 
>> If user provides /dev/cdrom on the command line, in the case of MacOS X,
>> QEMU searches for a cdrom drive in the system and set filename to
>> /dev/rdiskX according to the result.
> 
> Oh, we're already playing such games... I guess you're right then.
> 
> It even seems to be not only for '/dev/cdrom', but for everything
> starting with this string. Does anyone know what's the reason for that?
> 
> Also, I guess before doing strcat() on bsdPath, we should check the
> buffer length...

By buffer, do you mean the bsdPath variable?


Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom

2015-07-08 Thread Laurent Vivier


On 08/07/2015 13:01, Kevin Wolf wrote:
> Am 08.07.2015 um 12:47 hat Laurent Vivier geschrieben:
>>
>>
>> On 08/07/2015 12:31, Kevin Wolf wrote:
>>> Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben:


 On 02/07/2015 16:03, Paolo Bonzini wrote:
>
>
> On 02/07/2015 15:58, Laurent Vivier wrote:
>> Since any /dev entry can be treated as a raw disk image, it is worth
>> noting which devices can be accessed when and how. /dev/rdisk nodes are
>> character-special devices, but are "raw" in the BSD sense and force
>> block-aligned I/O. They are closer to the physical disk than the buffer
>> cache. /dev/disk nodes, on the other hand, are buffered block-special
>> devices and are used primarily by the kernel's filesystem code.
>
> So the right thing to do would not be just to set need_alignment, but to
> probe it like we do on Linux for BDRV_O_NO_CACHE.
>
> I'm okay with doing the simple thing, but it needs a comment for 
> non-BSDers.

 So, what we have to do, in our case, for MacOS X cdrom, is something like:

 ... GetBSDPath ...
 ...
 if (flags & BDRV_O_NOCACHE) {
 strcat(bsdPath, "r");
 }
 ...
>>>
>>> I would avoid such magic. What we could do is rejecting /dev/rdisk nodes
>>> without BDRV_O_NOCACHE.
>>
>> It's not how it works...
>>
>> Look in hdev_open().
>>
>> If user provides /dev/cdrom on the command line, in the case of MacOS X,
>> QEMU searches for a cdrom drive in the system and set filename to
>> /dev/rdiskX according to the result.
> 
> Oh, we're already playing such games... I guess you're right then.
> 
> It even seems to be not only for '/dev/cdrom', but for everything
> starting with this string. Does anyone know what's the reason for that?


At least 10 years old reason:

commit 3b0d4f61c917c4612b561d75b33a11f4da00738b
Author: bellard 
Date:   Sun Oct 30 18:30:10 2005 +

OS X: support for the built in CD-ROM drive (Mike Kronenberg)




> 
> Also, I guess before doing strcat() on bsdPath, we should check the
> buffer length...
> 
>> Perhaps this part should be removed.
>>
>> But if we just want to correct the bug, we must not set filename to
>> /dev/rdiskX if NOCACHE is not set but to /dev/diskX
>>
>> It's the aim of this change.
> 
> Yes, that looks right.
> 
> Kevin
> 



Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom

2015-07-08 Thread Kevin Wolf
Am 08.07.2015 um 14:56 hat Programmingkid geschrieben:
> 
> On Jul 8, 2015, at 7:01 AM, Kevin Wolf wrote:
> 
> > Am 08.07.2015 um 12:47 hat Laurent Vivier geschrieben:
> >> 
> >> 
> >> On 08/07/2015 12:31, Kevin Wolf wrote:
> >>> Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben:
>  
>  
>  On 02/07/2015 16:03, Paolo Bonzini wrote:
> > 
> > 
> > On 02/07/2015 15:58, Laurent Vivier wrote:
> >> Since any /dev entry can be treated as a raw disk image, it is worth
> >> noting which devices can be accessed when and how. /dev/rdisk nodes are
> >> character-special devices, but are "raw" in the BSD sense and force
> >> block-aligned I/O. They are closer to the physical disk than the buffer
> >> cache. /dev/disk nodes, on the other hand, are buffered block-special
> >> devices and are used primarily by the kernel's filesystem code.
> > 
> > So the right thing to do would not be just to set need_alignment, but to
> > probe it like we do on Linux for BDRV_O_NO_CACHE.
> > 
> > I'm okay with doing the simple thing, but it needs a comment for 
> > non-BSDers.
>  
>  So, what we have to do, in our case, for MacOS X cdrom, is something 
>  like:
>  
>  ... GetBSDPath ...
>  ...
> if (flags & BDRV_O_NOCACHE) {
> strcat(bsdPath, "r");
> }
>  ...
> >>> 
> >>> I would avoid such magic. What we could do is rejecting /dev/rdisk nodes
> >>> without BDRV_O_NOCACHE.
> >> 
> >> It's not how it works...
> >> 
> >> Look in hdev_open().
> >> 
> >> If user provides /dev/cdrom on the command line, in the case of MacOS X,
> >> QEMU searches for a cdrom drive in the system and set filename to
> >> /dev/rdiskX according to the result.
> > 
> > Oh, we're already playing such games... I guess you're right then.
> > 
> > It even seems to be not only for '/dev/cdrom', but for everything
> > starting with this string. Does anyone know what's the reason for that?
> > 
> > Also, I guess before doing strcat() on bsdPath, we should check the
> > buffer length...
> 
> By buffer, do you mean the bsdPath variable?

Yes. In theory, bsdPath could be completely filled with the path
returned by GetBSDPath() because we pass sizeof(bsdPath) as maxPathSize.
Appending "s0" would then overflow the buffer.

I'll admit that this is rather unlikely to happen, but being careful
when dealing with strings can never hurt.

Kevin



Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom

2015-07-08 Thread Programmingkid

On Jul 8, 2015, at 9:11 AM, Kevin Wolf wrote:

> Am 08.07.2015 um 14:56 hat Programmingkid geschrieben:
>> 
>> On Jul 8, 2015, at 7:01 AM, Kevin Wolf wrote:
>> 
>>> Am 08.07.2015 um 12:47 hat Laurent Vivier geschrieben:
 
 
 On 08/07/2015 12:31, Kevin Wolf wrote:
> Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben:
>> 
>> 
>> On 02/07/2015 16:03, Paolo Bonzini wrote:
>>> 
>>> 
>>> On 02/07/2015 15:58, Laurent Vivier wrote:
 Since any /dev entry can be treated as a raw disk image, it is worth
 noting which devices can be accessed when and how. /dev/rdisk nodes are
 character-special devices, but are "raw" in the BSD sense and force
 block-aligned I/O. They are closer to the physical disk than the buffer
 cache. /dev/disk nodes, on the other hand, are buffered block-special
 devices and are used primarily by the kernel's filesystem code.
>>> 
>>> So the right thing to do would not be just to set need_alignment, but to
>>> probe it like we do on Linux for BDRV_O_NO_CACHE.
>>> 
>>> I'm okay with doing the simple thing, but it needs a comment for 
>>> non-BSDers.
>> 
>> So, what we have to do, in our case, for MacOS X cdrom, is something 
>> like:
>> 
>> ... GetBSDPath ...
>> ...
>>   if (flags & BDRV_O_NOCACHE) {
>>   strcat(bsdPath, "r");
>>   }
>> ...
> 
> I would avoid such magic. What we could do is rejecting /dev/rdisk nodes
> without BDRV_O_NOCACHE.
 
 It's not how it works...
 
 Look in hdev_open().
 
 If user provides /dev/cdrom on the command line, in the case of MacOS X,
 QEMU searches for a cdrom drive in the system and set filename to
 /dev/rdiskX according to the result.
>>> 
>>> Oh, we're already playing such games... I guess you're right then.
>>> 
>>> It even seems to be not only for '/dev/cdrom', but for everything
>>> starting with this string. Does anyone know what's the reason for that?
>>> 
>>> Also, I guess before doing strcat() on bsdPath, we should check the
>>> buffer length...
>> 
>> By buffer, do you mean the bsdPath variable?
> 
> Yes. In theory, bsdPath could be completely filled with the path
> returned by GetBSDPath() because we pass sizeof(bsdPath) as maxPathSize.
> Appending "s0" would then overflow the buffer.
> 
> I'll admit that this is rather unlikely to happen, but being careful
> when dealing with strings can never hurt.
> 
> Kevin

Ok, after my newest patch has been reviewed, I will add the size checking code. 


Re: [Qemu-block] [PATCH v3 1/2] Add a simple mechanism to protect against error message floods.

2015-07-08 Thread Kevin Wolf
Am 08.07.2015 um 14:50 hat Richard W.M. Jones geschrieben:
> You can now write code like this:
> 
>   FLOOD_COUNTER(errcount, 100);
>   ...
>   NO_FLOOD(errcount,
>error_report("oops, something bad happened"),
>error_report("further errors suppressed"));
> 
> which would print the "oops, ..." error message up to 100 times,
> followed by the "further errors suppressed" message once, and then
> nothing else.
> 
> Signed-off-by: Richard W.M. Jones 
> ---
>  include/qemu/no-flood.h | 34 ++
>  1 file changed, 34 insertions(+)
>  create mode 100644 include/qemu/no-flood.h
> 
> diff --git a/include/qemu/no-flood.h b/include/qemu/no-flood.h
> new file mode 100644
> index 000..90a24da
> --- /dev/null
> +++ b/include/qemu/no-flood.h
> @@ -0,0 +1,34 @@
> +/*
> + * Suppress error floods.
> + *
> + * Copyright (C) 2015 Red Hat, Inc.
> + *
> + * Author: Richard W.M. Jones 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef __QEMU_NO_FLOOD_H
> +#define __QEMU_NO_FLOOD_H 1
> +
> +#include "qemu/atomic.h"
> +
> +/* Suggested initial value is 100 */
> +#define FLOOD_COUNTER(counter_name, initial) \
> +  static int counter_name = (initial)
> +
> +#define NO_FLOOD(counter_name, code, suppress_message) \
> +  do { \
> +int t_##counter_name = atomic_fetch_add(&counter_name, 0); \
> +if (t_##counter_name > 0) {\
> +code;  \
> +if (t_##counter_name == 1) {   \
> +suppress_message;  \
> +}  \
> +atomic_dec(&counter_name); \
> +}  \
> +} while (0)
> +
> +#endif

I don't think that you actually achieve thread safety with the atomic
operations you use here. The counter may change between your check and
the atomic_dec().

It doesn't matter for curl because we don't get concurrency there
anyway, but it might confuse others if it looks as if it was thread safe
when it fact it isn't.

So I'd suggest that if you have an easy fix for thread safety, do it,
but otherwise don't bother and just remove the atomics.

Other than that, I like this series.

Kevin



[Qemu-block] [PATCH v4] block/curl: Don't lose original error when a connection fails.

2015-07-08 Thread Richard W.M. Jones
Currently if qemu is connected to a curl source (eg. web server), and
the web server fails / times out / dies, you always see a bogus EIO
"Input/output error".

For example, choose a large file located on any local webserver which
you control:

  $ qemu-img convert -p http://example.com/large.iso /tmp/test

Once it starts copying the file, stop the webserver and you will see
qemu-img fail with:

  qemu-img: error while reading sector 61440: Input/output error

This patch does two things: Firstly print the actual error from curl
so it doesn't get lost.  Secondly, change EIO to EPROTO.  EPROTO is a
POSIX.1 compatible errno which more accurately reflects that there was
a protocol error, rather than some kind of hardware failure.

After this patch is applied, the error changes to:

  $ qemu-img convert -p http://example.com/large.iso /tmp/test
  qemu-img: curl: transfer closed with 469989 bytes remaining to read
  qemu-img: error while reading sector 16384: Protocol error

Signed-off-by: Richard W.M. Jones 
Reviewed-by: Stefan Hajnoczi 
---
 block/curl.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/curl.c b/block/curl.c
index 3a2b63e..032cc8a 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -22,6 +22,7 @@
  * THE SOFTWARE.
  */
 #include "qemu-common.h"
+#include "qemu/error-report.h"
 #include "block/block_int.h"
 #include "qapi/qmp/qbool.h"
 #include "qapi/qmp/qstring.h"
@@ -298,6 +299,18 @@ static void curl_multi_check_completion(BDRVCURLState *s)
 /* ACBs for successful messages get completed in curl_read_cb */
 if (msg->data.result != CURLE_OK) {
 int i;
+static int errcount = 100;
+
+/* Don't lose the original error message from curl, since
+ * it contains extra data.
+ */
+if (errcount > 0) {
+error_report("curl: %s", state->errmsg);
+if (--errcount == 0) {
+error_report("curl: further errors suppressed");
+}
+}
+
 for (i = 0; i < CURL_NUM_ACB; i++) {
 CURLAIOCB *acb = state->acb[i];
 
@@ -305,7 +318,7 @@ static void curl_multi_check_completion(BDRVCURLState *s)
 continue;
 }
 
-acb->common.cb(acb->common.opaque, -EIO);
+acb->common.cb(acb->common.opaque, -EPROTO);
 qemu_aio_unref(acb);
 state->acb[i] = NULL;
 }
-- 
2.3.1




[Qemu-block] [PATCH v4] block/curl: Don't lose original error when a connection

2015-07-08 Thread Richard W.M. Jones
Since v3:

Ditch the generalized anti-flood protection, and just use a simple variable.

Rich.




Re: [Qemu-block] [PATCH v3 1/2] Add a simple mechanism to protect against error message floods.

2015-07-08 Thread Paolo Bonzini


On 08/07/2015 15:26, Kevin Wolf wrote:
>> +/* Suggested initial value is 100 */
>> +#define FLOOD_COUNTER(counter_name, initial) \
>> +  static int counter_name = (initial)

No "static" here.

>> +#define NO_FLOOD(counter_name, code, suppress_message) \
>> +  do { \
>> +int t_##counter_name = atomic_fetch_add(&counter_name, 0); \
>> +if (t_##counter_name > 0) {\
>> +code;  \
>> +if (t_##counter_name == 1) {   \
>> +suppress_message;  \
>> +}  \
>> +atomic_dec(&counter_name); \
>> +}  \
>> +} while (0)

What you want is a cmpxchg loop like:

   int old;
   do
   old = atomic_read(&counter_name);
   while (old > 0 && atomic_cmpxchg(&counter_name, old, old - 1) != old);
   if (old > 0) {
   code;
   if (old == 1) {
   suppress_message;
   }
   }

but I would wrap it with a simple API like

   void error_report_limited(int *counter, const char *s, ...);
   void error_report_err_limited(int *counter, Error *err);

instead of your complex NO_FLOOD macro.

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] ahci: fix signature generation

2015-07-08 Thread John Snow


On 07/08/2015 08:56 AM, Stefan Hajnoczi wrote:
> On Tue, Jul 07, 2015 at 01:15:02PM -0400, John Snow wrote:
>>
>>
>> On 07/07/2015 04:49 AM, Stefan Hajnoczi wrote:
>>> On Mon, Jul 06, 2015 at 05:49:52PM -0400, John Snow wrote:
 The initial register device-to-host FIS no longer needs to specially
 set certain fields, as these can be handled generically by setting those
 fields explicitly with the signatures we want at port reset time.

 (1) Signatures are decomposed into their four component registers and
 set upon (AHCI) port reset.
 (2) the signature cache register is no longer set manually per-each
 device type, but instead just once during ahci_init_d2h.

 Signed-off-by: John Snow 
 ---
  hw/ide/ahci.c | 33 -
  1 file changed, 20 insertions(+), 13 deletions(-)
>>>
>>> I see two code paths that call ahci_init_d2h().  Either
>>> ahci_reset_port() does it (if a block device is attached) or it's called
>>> when the guest writes to the PORT_CMD register.
>>>
>>> I'm not sure the latter works.  The signature doesn't seem to be set
>>> anywhere.
>>>
>>> Any ideas?
> ...
>> So on initial boot, we call ahci_init_d2h and set pr->sig, then call
>> ahci_write_fis_d2h. However, since the FIS RX engine (PxFRE) is off, we
>> don't actually generate the FIS because there's nowhere to store it.
> 
> My question is about the ide_state->blk == NULL case:
> 
> ahci_reset_port() is contradictory:
> 
> static void ahci_reset_port(AHCIState *s, int port)
> {
> ...
> ide_state = &s->dev[port].port.ifs[0];
> if (!ide_state->blk) {
> return;
> }
> 
> ...
> 
> s->dev[port].port_state = STATE_RUN;
> if (!ide_state->blk) { <-- deadcode?
> pr->sig = 0;
> ide_state->status = SEEK_STAT | WRERR_STAT;
> }
> 
> Does code after the first "if (!ide_state->blk)" in ahci_reset_port()
> ever execute in a drive hotplug scenario?
> 
> If it doesn't execute then sig is never filled in.
> 
> Your patch does not include a regression but either something is broken
> here or I don't understand the code.
> 

I'm sorry, I misunderstood you...

Haven't really played with the hotplugging much so I don't know it to
work. I'll throw it on the list.

Actually, since I need to start focusing on non-legacy devices, I'll
start a wiki page of all the bugs and quirks I know about and that way
if I forget to get back to it (or my plane disappears over the Bermuda
Triangle) my understanding of existing problems will be documented.

I'll stage patch #1 here (Reviewed by Kevin) for inclusion in 2.4, #2 is
also a bug fix but it's more subtle and isn't known to break anything
and could possibly benefit from a more comprehensive fix so I'll leave
it for now.

Thanks,
--js



[Qemu-block] [PATCH for-2.5 2/4] block-backend: add blk_get_max_iov()

2015-07-08 Thread Stefan Hajnoczi
Add a function to query BlockLimits.max_iov.

Signed-off-by: Stefan Hajnoczi 
---
 block/block-backend.c  | 5 +
 include/sysemu/block-backend.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index aee8a12..5108aec 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -777,6 +777,11 @@ int blk_get_max_transfer_length(BlockBackend *blk)
 return blk->bs->bl.max_transfer_length;
 }
 
+int blk_get_max_iov(BlockBackend *blk)
+{
+return blk->bs->bl.max_iov;
+}
+
 void blk_set_guest_block_size(BlockBackend *blk, int align)
 {
 bdrv_set_guest_block_size(blk->bs, align);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 8fc960f..c114440 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -135,6 +135,7 @@ void blk_lock_medium(BlockBackend *blk, bool locked);
 void blk_eject(BlockBackend *blk, bool eject_flag);
 int blk_get_flags(BlockBackend *blk);
 int blk_get_max_transfer_length(BlockBackend *blk);
+int blk_get_max_iov(BlockBackend *blk);
 void blk_set_guest_block_size(BlockBackend *blk, int align);
 void *blk_blockalign(BlockBackend *blk, size_t size);
 bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp);
-- 
2.4.3




[Qemu-block] [PATCH for-2.5 1/4] block: add BlockLimits.max_iov field

2015-07-08 Thread Stefan Hajnoczi
The maximum number of struct iovec elements depends on the
BlockDriverState.  The raw-posix protocol has a maximum of IOV_MAX but
others could have different values.

Instead of assuming raw-posix and hardcoding IOV_MAX in several places,
put the limit into BlockLimits.

Cc: Peter Lieven 
Suggested-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
---
Peter Lieven: I think the SCSI LUN level does not have a maximum
scatter-gather segments constraint.  That is probably only at the HBA
level.  CCed you anyway in case you think block/iscsi.c should set the
max_iov field.

Kevin: The default is now INT_MAX.  This means non-raw-posix users will
now be able to merge more requests than before.  They were limited to
IOV_MAX previously.  This could expose limits in other BlockDrivers
which we weren't aware of...
---
 block/io.c| 3 +++
 block/raw-posix.c | 1 +
 include/block/block_int.h | 3 +++
 3 files changed, 7 insertions(+)

diff --git a/block/io.c b/block/io.c
index e295992..6750de6 100644
--- a/block/io.c
+++ b/block/io.c
@@ -165,9 +165,11 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error 
**errp)
 bs->bl.max_transfer_length = bs->file->bl.max_transfer_length;
 bs->bl.min_mem_alignment = bs->file->bl.min_mem_alignment;
 bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
+bs->bl.max_iov = bs->file->bl.max_iov;
 } else {
 bs->bl.min_mem_alignment = 512;
 bs->bl.opt_mem_alignment = getpagesize();
+bs->bl.max_iov = INT_MAX;
 }
 
 if (bs->backing_hd) {
@@ -188,6 +190,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
 bs->bl.min_mem_alignment =
 MAX(bs->bl.min_mem_alignment,
 bs->backing_hd->bl.min_mem_alignment);
+bs->bl.max_iov = MIN(bs->bl.max_iov, bs->backing_hd->bl.max_iov);
 }
 
 /* Then let the driver override it */
diff --git a/block/raw-posix.c b/block/raw-posix.c
index cbe6574..faa6ae0 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -735,6 +735,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error 
**errp)
 raw_probe_alignment(bs, s->fd, errp);
 bs->bl.min_mem_alignment = s->buf_align;
 bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize());
+bs->bl.max_iov = IOV_MAX; /* limit comes from preadv()/pwritev()/etc */
 }
 
 static int check_for_dasd(int fd)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index b0476fc..767e83d 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -315,6 +315,9 @@ typedef struct BlockLimits {
 
 /* memory alignment for bounce buffer */
 size_t opt_mem_alignment;
+
+/* maximum number of iovec elements */
+int max_iov;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
-- 
2.4.3




[Qemu-block] [PATCH for-2.5 0/4] block: replace IOV_MAX with BlockLimits.max_iov

2015-07-08 Thread Stefan Hajnoczi
This series depends on "[PATCH] block/mirror: limit qiov to IOV_MAX elements".

IOV_MAX has been hardcoded in several places since preadv()/pwritev()/etc
refuse to take more iovecs per call.  That's actually an implementation detail
of raw-posix and we shouldn't spread that across the codebase.

This series adds BlockLimits.max_iov (similar to BlockLimits.max_transfer_len
and other fields).  By default it is INT_MAX but raw-posix sets it to IOV_MAX.

Current IOV_MAX users are converted to blk_get_max_iov().

By the way, the IOV_MAX vs BlockLimits.max_iov naming is slightly confusing but
follows the BlockLimits field naming convention.  Once IOV_MAX is banished no
one will be bothered by the reverse naming anymore.

Suggested-by: Kevin Wolf 

Stefan Hajnoczi (4):
  block: add BlockLimits.max_iov field
  block-backend: add blk_get_max_iov()
  block: replace IOV_MAX with BlockLimits.max_iov
  block/mirror: replace IOV_MAX with blk_get_max_iov()

 block/block-backend.c  | 5 +
 block/io.c | 6 +-
 block/mirror.c | 6 --
 block/raw-posix.c  | 1 +
 hw/block/virtio-blk.c  | 2 +-
 include/block/block_int.h  | 3 +++
 include/sysemu/block-backend.h | 1 +
 7 files changed, 20 insertions(+), 4 deletions(-)

-- 
2.4.3




[Qemu-block] [PATCH for-2.5 4/4] block/mirror: replace IOV_MAX with blk_get_max_iov()

2015-07-08 Thread Stefan Hajnoczi
Use blk_get_max_iov() instead of hardcoding IOV_MAX, which may not apply
to all BlockDrivers.

Signed-off-by: Stefan Hajnoczi 
---
 block/mirror.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 985ad00..0f4445c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -160,11 +160,13 @@ static void mirror_read_complete(void *opaque, int ret)
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
 BlockDriverState *source = s->common.bs;
-int nb_sectors, sectors_per_chunk, nb_chunks;
+int nb_sectors, sectors_per_chunk, nb_chunks, max_iov;
 int64_t end, sector_num, next_chunk, next_sector, hbitmap_next_sector;
 uint64_t delay_ns = 0;
 MirrorOp *op;
 
+max_iov = MIN(source->bl.max_iov, s->target->bl.max_iov);
+
 s->sector_num = hbitmap_iter_next(&s->hbi);
 if (s->sector_num < 0) {
 bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi);
@@ -241,7 +243,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 trace_mirror_break_buf_busy(s, nb_chunks, s->in_flight);
 break;
 }
-if (IOV_MAX < nb_chunks + added_chunks) {
+if (max_iov < nb_chunks + added_chunks) {
 trace_mirror_break_iov_max(s, nb_chunks, added_chunks);
 break;
 }
-- 
2.4.3




[Qemu-block] [PATCH for-2.5 3/4] block: replace IOV_MAX with BlockLimits.max_iov

2015-07-08 Thread Stefan Hajnoczi
Request merging must not result in a huge request that exceeds the
maximum number of iovec elements.  Use BlockLimits.max_iov instead of
hardcoding IOV_MAX.

Signed-off-by: Stefan Hajnoczi 
---
 block/io.c| 3 ++-
 hw/block/virtio-blk.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index 6750de6..baffd02 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1818,7 +1818,8 @@ static int multiwrite_merge(BlockDriverState *bs, 
BlockRequest *reqs,
 merge = 1;
 }
 
-if (reqs[outidx].qiov->niov + reqs[i].qiov->niov + 1 > IOV_MAX) {
+if (reqs[outidx].qiov->niov + reqs[i].qiov->niov + 1 >
+bs->bl.max_iov) {
 merge = 0;
 }
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 6aefda4..a186956 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -407,7 +407,7 @@ void virtio_blk_submit_multireq(BlockBackend *blk, 
MultiReqBuffer *mrb)
 bool merge = true;
 
 /* merge would exceed maximum number of IOVs */
-if (niov + req->qiov.niov > IOV_MAX) {
+if (niov + req->qiov.niov > blk_get_max_iov(blk)) {
 merge = false;
 }
 
-- 
2.4.3




Re: [Qemu-block] [Qemu-devel] [PATCH COLO-BLOCK v7 00/17] Block replication for continuous checkpoints

2015-07-08 Thread Michael R. Hines

On 07/07/2015 08:38 PM, Wen Congyang wrote:

On 07/08/2015 12:56 AM, Michael R. Hines wrote:

On 07/07/2015 04:23 AM, Paolo Bonzini wrote:

On 07/07/2015 11:13, Dr. David Alan Gilbert wrote:

This log is very stange. The NBD client connects to NBD server, and NBD server 
wants to read data
from NBD client, but reading fails. It seems that the connection is closed 
unexpectedly. Can you
give me more log and how do you use it?

That was the same failure I was getting.   I think it's that the NBD server and 
client are in different
modes, with one of them expecting the export.

nbd_server_add always expects the export.

Paolo


OK, Wen, so your wiki finally does reflect this, but now we're back to the "export 
not found error".

Again, here's the exact command line:

1. First on the secondary VM:

qemu-system-x86_64 .snip...  -drive 
if=none,driver=qcow2,file=foo.qcow2,id=mc1,cache=none,aio=native -drive 
if=virtio,driver=replication,mode=secondary,throttling.bps-total-max=7000,file.file.filename=active_disk.qcow2,file.driver=qcow2,file.backing.file.filename=hidden_disk.qcow2,file.backing.driver=qcow2,file.backing.allow-write-backing-file=on,file.backing.backing.backing_reference=mc1

2. Then, then HMP commands:

nbd_server_start 0:6262
nbd_server_add -w mc1

3. Then the primary VM:

qemu-system-x86_64 .snip...  -drive 
if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on

With the error: Server requires an export name

*but*, your wiki has no export name on the primary VM size, so I added the 
export name back which is on your old wiki:

qemu-system-x86_64 .snip...  -drive 
if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.export=mc1,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on:
 Failed to read export length


Hmm, I think you use v7 version. Is it right?
In this version, the correct useage for primary qemu is:
1. qemu-system-x86_64 .snip...  -drive 
if=virtio,driver=quorum,read-pattern=fifo,id=disk1,
   
children.0.file.filename=bar.qcow2,children.0.driver=qcow2,

Then run hmp monitor command(It should be run after you run nbd_server_add in 
the secondary qemu):
child_add disk1 
child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on

Thanks
Wen Congyang


And server now says:

nbd.c:nbd_handle_export_name():L416: export not found
nbd.c:nbd_send_negotiate():L562: option negotiation failed

.





OK, I'm totally confused at this point. =)

Maybe it would be easier to just wait for your next clean patchset + 
documentation which is all consistent with each other.


- Michael




[Qemu-block] [PATCH for-2.4 0/5] block: Fix backing file child when modifying graph

2015-07-08 Thread Kevin Wolf
This series is extracted from my work towards removing bdrv_swap(),
which is targeted for 2.5. It contains a fix for dangling pointers when
modifying the BDS graph and its dependencies.

I didn't bother to split patches of which only a part is required, nor
did I remove references to the future bdrv_swap removal (after all, it
will happen, even if the patches will be delayed by the 2.4 freeze).

Specifically, bdrv_open/unref_child() are yet unused in this series; the
respective patches are included because of bdrv_attach/detach_child().

Kevin Wolf (5):
  block: Move bdrv_attach_child() calls up the call chain
  block: Introduce bdrv_open_child()
  block: Introduce bdrv_unref_child()
  block: Reorder cleanups in bdrv_close()
  block: Fix backing file child when modifying graph

 block.c   | 144 --
 include/block/block.h |   7 +++
 include/block/block_int.h |   1 +
 3 files changed, 108 insertions(+), 44 deletions(-)

-- 
1.8.3.1




[Qemu-block] [PATCH for-2.4 2/5] block: Introduce bdrv_open_child()

2015-07-08 Thread Kevin Wolf
It is the same as bdrv_open_image(), except that it doesn't only return
success or failure, but the newly created BdrvChild object for the new
child node.

As the BdrvChild object already contains a BlockDriverState pointer (and
this is supposed to become the only pointer so that bdrv_append() and
friends can just change a single pointer in BdrvChild), the pbs
parameter is removed for bdrv_open_child().

Signed-off-by: Kevin Wolf 
---
 block.c   | 71 +--
 include/block/block.h |  6 +
 2 files changed, 58 insertions(+), 19 deletions(-)

diff --git a/block.c b/block.c
index 0398bff..029feeb 100644
--- a/block.c
+++ b/block.c
@@ -1102,9 +1102,9 @@ static int bdrv_fill_options(QDict **options, const char 
**pfilename,
 return 0;
 }
 
-static void bdrv_attach_child(BlockDriverState *parent_bs,
-  BlockDriverState *child_bs,
-  const BdrvChildRole *child_role)
+static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+BlockDriverState *child_bs,
+const BdrvChildRole *child_role)
 {
 BdrvChild *child = g_new(BdrvChild, 1);
 *child = (BdrvChild) {
@@ -1113,6 +1113,8 @@ static void bdrv_attach_child(BlockDriverState *parent_bs,
 };
 
 QLIST_INSERT_HEAD(&parent_bs->children, child, next);
+
+return child;
 }
 
 void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd)
@@ -1229,7 +1231,7 @@ free_exit:
  * device's options.
  *
  * If allow_none is true, no image will be opened if filename is false and no
- * BlockdevRef is given. *pbs will remain unchanged and 0 will be returned.
+ * BlockdevRef is given. NULL will be returned, but errp remains unset.
  *
  * bdrev_key specifies the key for the image's BlockdevRef in the options 
QDict.
  * That QDict has to be flattened; therefore, if the BlockdevRef is a QDict
@@ -1237,21 +1239,20 @@ free_exit:
  * BlockdevRef.
  *
  * The BlockdevRef will be removed from the options QDict.
- *
- * To conform with the behavior of bdrv_open(), *pbs has to be NULL.
  */
-int bdrv_open_image(BlockDriverState **pbs, const char *filename,
-QDict *options, const char *bdref_key,
-BlockDriverState* parent, const BdrvChildRole *child_role,
-bool allow_none, Error **errp)
+BdrvChild *bdrv_open_child(const char *filename,
+   QDict *options, const char *bdref_key,
+   BlockDriverState* parent,
+   const BdrvChildRole *child_role,
+   bool allow_none, Error **errp)
 {
+BdrvChild *c = NULL;
+BlockDriverState *bs;
 QDict *image_options;
 int ret;
 char *bdref_key_dot;
 const char *reference;
 
-assert(pbs);
-assert(*pbs == NULL);
 assert(child_role != NULL);
 
 bdref_key_dot = g_strdup_printf("%s.", bdref_key);
@@ -1260,28 +1261,60 @@ int bdrv_open_image(BlockDriverState **pbs, const char 
*filename,
 
 reference = qdict_get_try_str(options, bdref_key);
 if (!filename && !reference && !qdict_size(image_options)) {
-if (allow_none) {
-ret = 0;
-} else {
+if (!allow_none) {
 error_setg(errp, "A block device must be specified for \"%s\"",
bdref_key);
-ret = -EINVAL;
 }
 QDECREF(image_options);
 goto done;
 }
 
-ret = bdrv_open_inherit(pbs, filename, reference, image_options, 0,
+bs = NULL;
+ret = bdrv_open_inherit(&bs, filename, reference, image_options, 0,
 parent, child_role, NULL, errp);
 if (ret < 0) {
 goto done;
 }
 
-bdrv_attach_child(parent, *pbs, child_role);
+c = bdrv_attach_child(parent, bs, child_role);
 
 done:
 qdict_del(options, bdref_key);
-return ret;
+return c;
+}
+
+/*
+ * This is a version of bdrv_open_child() that returns 0/-EINVAL instead of
+ * a BdrvChild object.
+ *
+ * If allow_none is true, no image will be opened if filename is false and no
+ * BlockdevRef is given. *pbs will remain unchanged and 0 will be returned.
+ *
+ * To conform with the behavior of bdrv_open(), *pbs has to be NULL.
+ */
+int bdrv_open_image(BlockDriverState **pbs, const char *filename,
+QDict *options, const char *bdref_key,
+BlockDriverState* parent, const BdrvChildRole *child_role,
+bool allow_none, Error **errp)
+{
+Error *local_err = NULL;
+BdrvChild *c;
+
+assert(pbs);
+assert(*pbs == NULL);
+
+c = bdrv_open_child(filename, options, bdref_key, parent, child_role,
+allow_none, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return -EINVAL;
+}
+
+if (c != NULL) {
+*pbs = c->bs;
+}
+
+return 0;
 }
 
 int bdr

[Qemu-block] [PATCH for-2.4 4/5] block: Reorder cleanups in bdrv_close()

2015-07-08 Thread Kevin Wolf
Block drivers may still want to access their child nodes in their
.bdrv_close handler. If they unref and/or detach a child by themselves,
this should not result in a double free.

There is additional code for backing files, which are just a special
case of child nodes. The same applies for them.

Signed-off-by: Kevin Wolf 
---
 block.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index b723cf2..d5c9f03 100644
--- a/block.c
+++ b/block.c
@@ -1901,6 +1901,14 @@ void bdrv_close(BlockDriverState *bs)
 if (bs->drv) {
 BdrvChild *child, *next;
 
+bs->drv->bdrv_close(bs);
+
+if (bs->backing_hd) {
+BlockDriverState *backing_hd = bs->backing_hd;
+bdrv_set_backing_hd(bs, NULL);
+bdrv_unref(backing_hd);
+}
+
 QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
 /* TODO Remove bdrv_unref() from drivers' close function and use
  * bdrv_unref_child() here */
@@ -1910,12 +1918,6 @@ void bdrv_close(BlockDriverState *bs)
 bdrv_detach_child(child);
 }
 
-if (bs->backing_hd) {
-BlockDriverState *backing_hd = bs->backing_hd;
-bdrv_set_backing_hd(bs, NULL);
-bdrv_unref(backing_hd);
-}
-bs->drv->bdrv_close(bs);
 g_free(bs->opaque);
 bs->opaque = NULL;
 bs->drv = NULL;
-- 
1.8.3.1




[Qemu-block] [PATCH for-2.4 5/5] block: Fix backing file child when modifying graph

2015-07-08 Thread Kevin Wolf
This patch moves bdrv_attach_child() from the individual places that add
a backing file to a BDS to bdrv_set_backing_hd(), which is called by all
of them. It also adds bdrv_detach_child() there.

For normal operation (starting with one backing file chain and not
changing it until the topmost image is closed) and live snapshots, this
constitutes no change in behaviour.

For all other cases, this is a fix for the bug that the old backing file
was still referenced as a child, and the new one wasn't referenced.

Signed-off-by: Kevin Wolf 
---
 block.c   | 5 +++--
 include/block/block_int.h | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index d5c9f03..d088ee0 100644
--- a/block.c
+++ b/block.c
@@ -1141,6 +1141,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 if (bs->backing_hd) {
 assert(bs->backing_blocker);
 bdrv_op_unblock_all(bs->backing_hd, bs->backing_blocker);
+bdrv_detach_child(bs->backing_child);
 } else if (backing_hd) {
 error_setg(&bs->backing_blocker,
"node is used as backing hd of '%s'",
@@ -1151,8 +1152,10 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 if (!backing_hd) {
 error_free(bs->backing_blocker);
 bs->backing_blocker = NULL;
+bs->backing_child = NULL;
 goto out;
 }
+bs->backing_child = bdrv_attach_child(bs, backing_hd, &child_backing);
 bs->open_flags &= ~BDRV_O_NO_BACKING;
 pstrcpy(bs->backing_file, sizeof(bs->backing_file), backing_hd->filename);
 pstrcpy(bs->backing_format, sizeof(bs->backing_format),
@@ -1236,7 +1239,6 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
*options, Error **errp)
 goto free_exit;
 }
 
-bdrv_attach_child(bs, backing_hd, &child_backing);
 bdrv_set_backing_hd(bs, backing_hd);
 
 free_exit:
@@ -2171,7 +2173,6 @@ void bdrv_append(BlockDriverState *bs_new, 
BlockDriverState *bs_top)
 /* The contents of 'tmp' will become bs_top, as we are
  * swapping bs_new and bs_top contents. */
 bdrv_set_backing_hd(bs_top, bs_new);
-bdrv_attach_child(bs_top, bs_new, &child_backing);
 }
 
 static void bdrv_delete(BlockDriverState *bs)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 8996baf..2cc973c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -379,6 +379,7 @@ struct BlockDriverState {
 char exact_filename[PATH_MAX];
 
 BlockDriverState *backing_hd;
+BdrvChild *backing_child;
 BlockDriverState *file;
 
 NotifierList close_notifiers;
-- 
1.8.3.1




[Qemu-block] [PATCH for-2.4 3/5] block: Introduce bdrv_unref_child()

2015-07-08 Thread Kevin Wolf
This is the counterpart for bdrv_open_child(). It decreases the
reference count of the child BDS and removes it from the list of
children of the given parent BDS.

Signed-off-by: Kevin Wolf 
---
 block.c   | 23 +--
 include/block/block.h |  1 +
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 029feeb..b723cf2 100644
--- a/block.c
+++ b/block.c
@@ -1117,6 +1117,24 @@ static BdrvChild *bdrv_attach_child(BlockDriverState 
*parent_bs,
 return child;
 }
 
+static void bdrv_detach_child(BdrvChild *child)
+{
+QLIST_REMOVE(child, next);
+g_free(child);
+}
+
+void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
+{
+BlockDriverState *child_bs = child->bs;
+
+if (child->bs->inherits_from == parent) {
+child->bs->inherits_from = NULL;
+}
+
+bdrv_detach_child(child);
+bdrv_unref(child_bs);
+}
+
 void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd)
 {
 
@@ -1884,11 +1902,12 @@ void bdrv_close(BlockDriverState *bs)
 BdrvChild *child, *next;
 
 QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
+/* TODO Remove bdrv_unref() from drivers' close function and use
+ * bdrv_unref_child() here */
 if (child->bs->inherits_from == bs) {
 child->bs->inherits_from = NULL;
 }
-QLIST_REMOVE(child, next);
-g_free(child);
+bdrv_detach_child(child);
 }
 
 if (bs->backing_hd) {
diff --git a/include/block/block.h b/include/block/block.h
index 5048772..37916f7 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -513,6 +513,7 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs);
 
 void bdrv_ref(BlockDriverState *bs);
 void bdrv_unref(BlockDriverState *bs);
+void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
 
 bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp);
 void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason);
-- 
1.8.3.1




[Qemu-block] [PATCH for-2.4 1/5] block: Move bdrv_attach_child() calls up the call chain

2015-07-08 Thread Kevin Wolf
Let the callers of bdrv_open_inherit() call bdrv_attach_child(). It
needs to be called in all cases where bdrv_open_inherit() succeeds (i.e.
returns 0) and a child_role is given.

bdrv_attach_child() is moved upwards to avoid a forward declaration.

Signed-off-by: Kevin Wolf 
---
 block.c | 41 +
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/block.c b/block.c
index 5e80336..0398bff 100644
--- a/block.c
+++ b/block.c
@@ -1102,6 +1102,19 @@ static int bdrv_fill_options(QDict **options, const char 
**pfilename,
 return 0;
 }
 
+static void bdrv_attach_child(BlockDriverState *parent_bs,
+  BlockDriverState *child_bs,
+  const BdrvChildRole *child_role)
+{
+BdrvChild *child = g_new(BdrvChild, 1);
+*child = (BdrvChild) {
+.bs = child_bs,
+.role   = child_role,
+};
+
+QLIST_INSERT_HEAD(&parent_bs->children, child, next);
+}
+
 void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd)
 {
 
@@ -1202,6 +1215,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
*options, Error **errp)
 error_free(local_err);
 goto free_exit;
 }
+
+bdrv_attach_child(bs, backing_hd, &child_backing);
 bdrv_set_backing_hd(bs, backing_hd);
 
 free_exit:
@@ -1237,6 +1252,7 @@ int bdrv_open_image(BlockDriverState **pbs, const char 
*filename,
 
 assert(pbs);
 assert(*pbs == NULL);
+assert(child_role != NULL);
 
 bdref_key_dot = g_strdup_printf("%s.", bdref_key);
 qdict_extract_subqdict(options, &image_options, bdref_key_dot);
@@ -1257,6 +1273,11 @@ int bdrv_open_image(BlockDriverState **pbs, const char 
*filename,
 
 ret = bdrv_open_inherit(pbs, filename, reference, image_options, 0,
 parent, child_role, NULL, errp);
+if (ret < 0) {
+goto done;
+}
+
+bdrv_attach_child(parent, *pbs, child_role);
 
 done:
 qdict_del(options, bdref_key);
@@ -1328,19 +1349,6 @@ out:
 return ret;
 }
 
-static void bdrv_attach_child(BlockDriverState *parent_bs,
-  BlockDriverState *child_bs,
-  const BdrvChildRole *child_role)
-{
-BdrvChild *child = g_new(BdrvChild, 1);
-*child = (BdrvChild) {
-.bs = child_bs,
-.role   = child_role,
-};
-
-QLIST_INSERT_HEAD(&parent_bs->children, child, next);
-}
-
 /*
  * Opens a disk image (raw, qcow2, vmdk, ...)
  *
@@ -1393,9 +1401,6 @@ static int bdrv_open_inherit(BlockDriverState **pbs, 
const char *filename,
 return -ENODEV;
 }
 bdrv_ref(bs);
-if (child_role) {
-bdrv_attach_child(parent, bs, child_role);
-}
 *pbs = bs;
 return 0;
 }
@@ -1540,10 +1545,6 @@ static int bdrv_open_inherit(BlockDriverState **pbs, 
const char *filename,
 goto close_and_fail;
 }
 
-if (child_role) {
-bdrv_attach_child(parent, bs, child_role);
-}
-
 QDECREF(options);
 *pbs = bs;
 return 0;
-- 
1.8.3.1




Re: [Qemu-block] [Qemu-devel] [PATCH COLO-BLOCK v7 00/17] Block replication for continuous checkpoints

2015-07-08 Thread Wen Congyang
On 07/08/2015 11:49 PM, Michael R. Hines wrote:
> On 07/07/2015 08:38 PM, Wen Congyang wrote:
>> On 07/08/2015 12:56 AM, Michael R. Hines wrote:
>>> On 07/07/2015 04:23 AM, Paolo Bonzini wrote:
 On 07/07/2015 11:13, Dr. David Alan Gilbert wrote:
>>> This log is very stange. The NBD client connects to NBD server, and NBD 
>>> server wants to read data
>>> from NBD client, but reading fails. It seems that the connection is 
>>> closed unexpectedly. Can you
>>> give me more log and how do you use it?
> That was the same failure I was getting.   I think it's that the NBD 
> server and client are in different
> modes, with one of them expecting the export.
 nbd_server_add always expects the export.

 Paolo

>>> OK, Wen, so your wiki finally does reflect this, but now we're back to the 
>>> "export not found error".
>>>
>>> Again, here's the exact command line:
>>>
>>> 1. First on the secondary VM:
>>>
>>> qemu-system-x86_64 .snip...  -drive 
>>> if=none,driver=qcow2,file=foo.qcow2,id=mc1,cache=none,aio=native -drive 
>>> if=virtio,driver=replication,mode=secondary,throttling.bps-total-max=7000,file.file.filename=active_disk.qcow2,file.driver=qcow2,file.backing.file.filename=hidden_disk.qcow2,file.backing.driver=qcow2,file.backing.allow-write-backing-file=on,file.backing.backing.backing_reference=mc1
>>>
>>> 2. Then, then HMP commands:
>>>
>>> nbd_server_start 0:6262
>>> nbd_server_add -w mc1
>>>
>>> 3. Then the primary VM:
>>>
>>> qemu-system-x86_64 .snip...  -drive 
>>> if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on
>>>
>>> With the error: Server requires an export name
>>>
>>> *but*, your wiki has no export name on the primary VM size, so I added the 
>>> export name back which is on your old wiki:
>>>
>>> qemu-system-x86_64 .snip...  -drive 
>>> if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.export=mc1,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on:
>>>  Failed to read export length
>>>
>> Hmm, I think you use v7 version. Is it right?
>> In this version, the correct useage for primary qemu is:
>> 1. qemu-system-x86_64 .snip...  -drive 
>> if=virtio,driver=quorum,read-pattern=fifo,id=disk1,
>>
>> children.0.file.filename=bar.qcow2,children.0.driver=qcow2,
>>
>> Then run hmp monitor command(It should be run after you run nbd_server_add 
>> in the secondary qemu):
>> child_add disk1 
>> child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on
>>
>> Thanks
>> Wen Congyang
>>
>>> And server now says:
>>>
>>> nbd.c:nbd_handle_export_name():L416: export not found
>>> nbd.c:nbd_send_negotiate():L562: option negotiation failed
>>>
>>> .
>>>
>>
> 
> OK, I'm totally confused at this point. =)
> 
> Maybe it would be easier to just wait for your next clean patchset + 
> documentation which is all consistent with each other.

I have sent the v8. But the usage is not changed. You can setup the environment 
according to the wiki.
When we open nbd client, we need to connect to the nbd server. So I introduce a 
new command child_add to add NBD client
as a quorum child when the nbd server is ready.

The nbd server is ready after you run the following command:
nbd_server_start 0:6262 # the secondary qemu will listen to host:port
nbd_server_add -w mc1   # the NBD server will know this disk is used as NBD 
server. The export name is its id wc1.
# -w means we allow to write to this disk.

Then you can run the following command in the primary qemu:
child_add disk1 
child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on

After this monitor command, nbd client has connected to the nbd server.

Thanks
Wen Congyang

> 
> - Michael
> 
> .
> 




Re: [Qemu-block] [PATCH COLO-BLOCK v8 09/18] Backup: clear all bitmap when doing block checkpoint

2015-07-08 Thread Dr. David Alan Gilbert
* Wen Congyang (we...@cn.fujitsu.com) wrote:
> Signed-off-by: Wen Congyang 
> Signed-off-by: zhanghailiang 
> Signed-off-by: Gonglei 
> Cc: Jeff Cody 
> ---
>  block/backup.c   | 13 +
>  blockjob.c   | 10 ++
>  include/block/blockjob.h | 12 
>  3 files changed, 35 insertions(+)
> 
> diff --git a/block/backup.c b/block/backup.c
> index d3c7d9f..ebb8a88 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -211,11 +211,24 @@ static void backup_iostatus_reset(BlockJob *job)
>  bdrv_iostatus_reset(s->target);
>  }
>  
> +static void backup_do_checkpoint(BlockJob *job, Error **errp)
> +{
> +BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
> +
> +if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
> +error_setg(errp, "this feature or command is not currently 
> supported");

You use this text in a few different errors in the block code; we've currently
got one test machine which is producing it and we haven't yet figured out
which one of the exit paths is doing it.  Please make each one unique and state
an identifier; e.g. "checkpoint on block backup in sync ... for blockjob ..."
(I'm not sure what exaclty makes sense for block jobs - but something like 
that).

Dave
  
> +return;
> +}
> +
> +hbitmap_reset_all(backup_job->bitmap);
> +}
> +
>  static const BlockJobDriver backup_job_driver = {
>  .instance_size  = sizeof(BackupBlockJob),
>  .job_type   = BLOCK_JOB_TYPE_BACKUP,
>  .set_speed  = backup_set_speed,
>  .iostatus_reset = backup_iostatus_reset,
> +.do_checkpoint  = backup_do_checkpoint,
>  };
>  
>  static BlockErrorAction backup_error_action(BackupBlockJob *job,
> diff --git a/blockjob.c b/blockjob.c
> index ec46fad..cb412d1 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -400,3 +400,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
>  
>  qemu_bh_schedule(data->bh);
>  }
> +
> +void block_job_do_checkpoint(BlockJob *job, Error **errp)
> +{
> +if (!job->driver->do_checkpoint) {
> +error_setg(errp, "this feature or command is not currently 
> supported");
> +return;
> +}
> +
> +job->driver->do_checkpoint(job, errp);
> +}
> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
> index 57d8ef1..b832dc3 100644
> --- a/include/block/blockjob.h
> +++ b/include/block/blockjob.h
> @@ -50,6 +50,9 @@ typedef struct BlockJobDriver {
>   * manually.
>   */
>  void (*complete)(BlockJob *job, Error **errp);
> +
> +/** Optional callback for job types that support checkpoint. */
> +void (*do_checkpoint)(BlockJob *job, Error **errp);
>  } BlockJobDriver;
>  
>  /**
> @@ -348,4 +351,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
>BlockJobDeferToMainLoopFn *fn,
>void *opaque);
>  
> +/**
> + * block_job_do_checkpoint:
> + * @job: The job.
> + * @errp: Error object.
> + *
> + * Do block checkpoint on the specified job.
> + */
> +void block_job_do_checkpoint(BlockJob *job, Error **errp);
> +
>  #endif
> -- 
> 2.4.3
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] [PATCH COLO-BLOCK v8 09/18] Backup: clear all bitmap when doing block checkpoint

2015-07-08 Thread Wen Congyang
On 07/09/2015 09:28 AM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>> Signed-off-by: Wen Congyang 
>> Signed-off-by: zhanghailiang 
>> Signed-off-by: Gonglei 
>> Cc: Jeff Cody 
>> ---
>>  block/backup.c   | 13 +
>>  blockjob.c   | 10 ++
>>  include/block/blockjob.h | 12 
>>  3 files changed, 35 insertions(+)
>>
>> diff --git a/block/backup.c b/block/backup.c
>> index d3c7d9f..ebb8a88 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
>> @@ -211,11 +211,24 @@ static void backup_iostatus_reset(BlockJob *job)
>>  bdrv_iostatus_reset(s->target);
>>  }
>>  
>> +static void backup_do_checkpoint(BlockJob *job, Error **errp)
>> +{
>> +BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
>> +
>> +if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
>> +error_setg(errp, "this feature or command is not currently 
>> supported");
> 
> You use this text in a few different errors in the block code; we've currently
> got one test machine which is producing it and we haven't yet figured out
> which one of the exit paths is doing it.  Please make each one unique and 
> state
> an identifier; e.g. "checkpoint on block backup in sync ... for blockjob ..."
> (I'm not sure what exaclty makes sense for block jobs - but something like 
> that).

OK, I will check all error messages.

Thanks
Wen Congyang

> 
> Dave
>   
>> +return;
>> +}
>> +
>> +hbitmap_reset_all(backup_job->bitmap);
>> +}
>> +
>>  static const BlockJobDriver backup_job_driver = {
>>  .instance_size  = sizeof(BackupBlockJob),
>>  .job_type   = BLOCK_JOB_TYPE_BACKUP,
>>  .set_speed  = backup_set_speed,
>>  .iostatus_reset = backup_iostatus_reset,
>> +.do_checkpoint  = backup_do_checkpoint,
>>  };
>>  
>>  static BlockErrorAction backup_error_action(BackupBlockJob *job,
>> diff --git a/blockjob.c b/blockjob.c
>> index ec46fad..cb412d1 100644
>> --- a/blockjob.c
>> +++ b/blockjob.c
>> @@ -400,3 +400,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
>>  
>>  qemu_bh_schedule(data->bh);
>>  }
>> +
>> +void block_job_do_checkpoint(BlockJob *job, Error **errp)
>> +{
>> +if (!job->driver->do_checkpoint) {
>> +error_setg(errp, "this feature or command is not currently 
>> supported");
>> +return;
>> +}
>> +
>> +job->driver->do_checkpoint(job, errp);
>> +}
>> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
>> index 57d8ef1..b832dc3 100644
>> --- a/include/block/blockjob.h
>> +++ b/include/block/blockjob.h
>> @@ -50,6 +50,9 @@ typedef struct BlockJobDriver {
>>   * manually.
>>   */
>>  void (*complete)(BlockJob *job, Error **errp);
>> +
>> +/** Optional callback for job types that support checkpoint. */
>> +void (*do_checkpoint)(BlockJob *job, Error **errp);
>>  } BlockJobDriver;
>>  
>>  /**
>> @@ -348,4 +351,13 @@ void block_job_defer_to_main_loop(BlockJob *job,
>>BlockJobDeferToMainLoopFn *fn,
>>void *opaque);
>>  
>> +/**
>> + * block_job_do_checkpoint:
>> + * @job: The job.
>> + * @errp: Error object.
>> + *
>> + * Do block checkpoint on the specified job.
>> + */
>> +void block_job_do_checkpoint(BlockJob *job, Error **errp);
>> +
>>  #endif
>> -- 
>> 2.4.3
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> .
> 




Re: [Qemu-block] [Qemu-devel] [PATCH COLO-BLOCK v7 00/17] Block replication for continuous checkpoints

2015-07-08 Thread Dr. David Alan Gilbert
* Wen Congyang (we...@cn.fujitsu.com) wrote:
> On 07/08/2015 11:49 PM, Michael R. Hines wrote:
> > On 07/07/2015 08:38 PM, Wen Congyang wrote:
> >> On 07/08/2015 12:56 AM, Michael R. Hines wrote:
> >>> On 07/07/2015 04:23 AM, Paolo Bonzini wrote:
>  On 07/07/2015 11:13, Dr. David Alan Gilbert wrote:
> >>> This log is very stange. The NBD client connects to NBD server, and 
> >>> NBD server wants to read data
> >>> from NBD client, but reading fails. It seems that the connection is 
> >>> closed unexpectedly. Can you
> >>> give me more log and how do you use it?
> > That was the same failure I was getting.   I think it's that the NBD 
> > server and client are in different
> > modes, with one of them expecting the export.
>  nbd_server_add always expects the export.
> 
>  Paolo
> 
> >>> OK, Wen, so your wiki finally does reflect this, but now we're back to 
> >>> the "export not found error".
> >>>
> >>> Again, here's the exact command line:
> >>>
> >>> 1. First on the secondary VM:
> >>>
> >>> qemu-system-x86_64 .snip...  -drive 
> >>> if=none,driver=qcow2,file=foo.qcow2,id=mc1,cache=none,aio=native -drive 
> >>> if=virtio,driver=replication,mode=secondary,throttling.bps-total-max=7000,file.file.filename=active_disk.qcow2,file.driver=qcow2,file.backing.file.filename=hidden_disk.qcow2,file.backing.driver=qcow2,file.backing.allow-write-backing-file=on,file.backing.backing.backing_reference=mc1
> >>>
> >>> 2. Then, then HMP commands:
> >>>
> >>> nbd_server_start 0:6262
> >>> nbd_server_add -w mc1
> >>>
> >>> 3. Then the primary VM:
> >>>
> >>> qemu-system-x86_64 .snip...  -drive 
> >>> if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on
> >>>
> >>> With the error: Server requires an export name
> >>>
> >>> *but*, your wiki has no export name on the primary VM size, so I added 
> >>> the export name back which is on your old wiki:
> >>>
> >>> qemu-system-x86_64 .snip...  -drive 
> >>> if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.export=mc1,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on:
> >>>  Failed to read export length
> >>>
> >> Hmm, I think you use v7 version. Is it right?
> >> In this version, the correct useage for primary qemu is:
> >> 1. qemu-system-x86_64 .snip...  -drive 
> >> if=virtio,driver=quorum,read-pattern=fifo,id=disk1,
> >>
> >> children.0.file.filename=bar.qcow2,children.0.driver=qcow2,
> >>
> >> Then run hmp monitor command(It should be run after you run nbd_server_add 
> >> in the secondary qemu):
> >> child_add disk1 
> >> child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on
> >>
> >> Thanks
> >> Wen Congyang
> >>
> >>> And server now says:
> >>>
> >>> nbd.c:nbd_handle_export_name():L416: export not found
> >>> nbd.c:nbd_send_negotiate():L562: option negotiation failed
> >>>
> >>> .
> >>>
> >>
> > 
> > OK, I'm totally confused at this point. =)
> > 
> > Maybe it would be easier to just wait for your next clean patchset + 
> > documentation which is all consistent with each other.
> 
> I have sent the v8. But the usage is not changed. You can setup the 
> environment according to the wiki.
> When we open nbd client, we need to connect to the nbd server. So I introduce 
> a new command child_add to add NBD client
> as a quorum child when the nbd server is ready.
> 
> The nbd server is ready after you run the following command:
> nbd_server_start 0:6262 # the secondary qemu will listen to host:port
> nbd_server_add -w mc1   # the NBD server will know this disk is used as NBD 
> server. The export name is its id wc1.
> # -w means we allow to write to this disk.
> 
> Then you can run the following command in the primary qemu:
> child_add disk1 
> child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on
> 
> After this monitor command, nbd client has connected to the nbd server.

Ah! The 'child.file.export=mc1' wasn't there previously; I see Yang added that 
to the wiki yesterday;
that probably explains the problem that we've been having.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > - Michael
> > 
> > .
> > 
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] [Qemu-devel] [PATCH COLO-BLOCK v7 00/17] Block replication for continuous checkpoints

2015-07-08 Thread Wen Congyang
On 07/09/2015 09:55 AM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>> On 07/08/2015 11:49 PM, Michael R. Hines wrote:
>>> On 07/07/2015 08:38 PM, Wen Congyang wrote:
 On 07/08/2015 12:56 AM, Michael R. Hines wrote:
> On 07/07/2015 04:23 AM, Paolo Bonzini wrote:
>> On 07/07/2015 11:13, Dr. David Alan Gilbert wrote:
> This log is very stange. The NBD client connects to NBD server, and 
> NBD server wants to read data
> from NBD client, but reading fails. It seems that the connection is 
> closed unexpectedly. Can you
> give me more log and how do you use it?
>>> That was the same failure I was getting.   I think it's that the NBD 
>>> server and client are in different
>>> modes, with one of them expecting the export.
>> nbd_server_add always expects the export.
>>
>> Paolo
>>
> OK, Wen, so your wiki finally does reflect this, but now we're back to 
> the "export not found error".
>
> Again, here's the exact command line:
>
> 1. First on the secondary VM:
>
> qemu-system-x86_64 .snip...  -drive 
> if=none,driver=qcow2,file=foo.qcow2,id=mc1,cache=none,aio=native -drive 
> if=virtio,driver=replication,mode=secondary,throttling.bps-total-max=7000,file.file.filename=active_disk.qcow2,file.driver=qcow2,file.backing.file.filename=hidden_disk.qcow2,file.backing.driver=qcow2,file.backing.allow-write-backing-file=on,file.backing.backing.backing_reference=mc1
>
> 2. Then, then HMP commands:
>
> nbd_server_start 0:6262
> nbd_server_add -w mc1
>
> 3. Then the primary VM:
>
> qemu-system-x86_64 .snip...  -drive 
> if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on
>
> With the error: Server requires an export name
>
> *but*, your wiki has no export name on the primary VM size, so I added 
> the export name back which is on your old wiki:
>
> qemu-system-x86_64 .snip...  -drive 
> if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.export=mc1,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on:
>  Failed to read export length
>
 Hmm, I think you use v7 version. Is it right?
 In this version, the correct useage for primary qemu is:
 1. qemu-system-x86_64 .snip...  -drive 
 if=virtio,driver=quorum,read-pattern=fifo,id=disk1,

 children.0.file.filename=bar.qcow2,children.0.driver=qcow2,

 Then run hmp monitor command(It should be run after you run nbd_server_add 
 in the secondary qemu):
 child_add disk1 
 child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on

 Thanks
 Wen Congyang

> And server now says:
>
> nbd.c:nbd_handle_export_name():L416: export not found
> nbd.c:nbd_send_negotiate():L562: option negotiation failed
>
> .
>

>>>
>>> OK, I'm totally confused at this point. =)
>>>
>>> Maybe it would be easier to just wait for your next clean patchset + 
>>> documentation which is all consistent with each other.
>>
>> I have sent the v8. But the usage is not changed. You can setup the 
>> environment according to the wiki.
>> When we open nbd client, we need to connect to the nbd server. So I 
>> introduce a new command child_add to add NBD client
>> as a quorum child when the nbd server is ready.
>>
>> The nbd server is ready after you run the following command:
>> nbd_server_start 0:6262 # the secondary qemu will listen to host:port
>> nbd_server_add -w mc1   # the NBD server will know this disk is used as NBD 
>> server. The export name is its id wc1.
>> # -w means we allow to write to this disk.
>>
>> Then you can run the following command in the primary qemu:
>> child_add disk1 
>> child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on
>>
>> After this monitor command, nbd client has connected to the nbd server.
> 
> Ah! The 'child.file.export=mc1' wasn't there previously; I see Yang added 
> that to the wiki yesterday;
> that probably explains the problem that we've been having.

Sorry for this mistake.

Thanks
Wen Congyang

> 
> Dave
> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> - Michael
>>>
>>> .
>>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manch

[Qemu-block] [PATCH 1/3] blockjob: Introduce block_job_relax_cpu

2015-07-08 Thread Fam Zheng
block_job_sleep_ns is called by block job coroutines to yield the
execution to VCPU threads and monitor etc. It is pointless to sleep for
0 or a few nanoseconds, because that equals to a "yield + enter" with no
intermission in between (the timer fires immediately in the same
iteration of event loop), which means other code still doesn't get a
fair share of main loop / BQL.

Introduce block_job_relax_cpu which will at least for
BLOCK_JOB_RELAX_CPU_NS. Existing block_job_sleep_ns(job, 0) callers can
be replaced by this later.

Reported-by: Alexandre DERUMIER 
Signed-off-by: Fam Zheng 
---
 include/block/blockjob.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 57d8ef1..53ac4f4 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -157,6 +157,22 @@ void *block_job_create(const BlockJobDriver *driver, 
BlockDriverState *bs,
  */
 void block_job_sleep_ns(BlockJob *job, QEMUClockType type, int64_t ns);
 
+#define BLOCK_JOB_RELAX_CPU_NS 1000L
+
+/**
+ * block_job_relax_cpu:
+ * @job: The job that calls the function.
+ *
+ * Sleep a little to avoid intensive cpu time occupation. Block jobs should
+ * call this or block_job_sleep_ns (for more precision, but note that 0 ns is
+ * usually not enought) periodically, otherwise the QMP and VCPU could starve
+ * on CPU and/or BQL.
+ */
+static inline void block_job_relax_cpu(BlockJob *job)
+{
+return block_job_sleep_ns(job, QEMU_CLOCK_REALTIME, 
BLOCK_JOB_RELAX_CPU_NS);
+}
+
 /**
  * block_job_yield:
  * @job: The job that calls the function.
-- 
2.4.3




[Qemu-block] [PATCH 0/3] mirror: Fix guest responsiveness during bitmap scan

2015-07-08 Thread Fam Zheng
This supersedes:

http://patchwork.ozlabs.org/patch/491415/

and [1] which is currently in Jeff's tree.

Although [1] fixed the QMP responsiveness, Alexandre DERUMIER reported that
guest responsiveness still suffers when we are busy in the initial dirty bitmap
scanning loop of mirror job. That is because 1) we issue too many lseeks; 2) we
only sleep for 0 ns which turns out quite ineffective in yielding BQL to vcpu
threads.  Both are fixed.

To reproduce: start a guest, attach a 10G raw image, then mirror it.  Your
guest will immediately start to stutter (with patch [1] testing on a local ext4
raw image, and "while echo -n .; do sleep 0.05; done" in guest console).

This series adds block_job_relax_cpu as suggested by Stefan Hajnoczi and uses
it in mirror job; and lets bdrv_is_allocated_above return a larger p_num as
suggested by Paolo Bonzini (although it's done without changing the API).

[1]: http://patchwork.ozlabs.org/patch/471656/ "block/mirror: Sleep
 periodically during bitmap scanning"

Fam Zheng (3):
  blockjob: Introduce block_job_relax_cpu
  mirror: Use block_job_relax_cpu during bitmap scanning
  mirror: Speed up bitmap initial scanning

 block/mirror.c   | 17 +++--
 include/block/blockjob.h | 16 
 2 files changed, 23 insertions(+), 10 deletions(-)

-- 
2.4.3




[Qemu-block] [PATCH 2/3] mirror: Use block_job_relax_cpu during bitmap scanning

2015-07-08 Thread Fam Zheng
Sleeping for 0 second may not be as effective as we want, use
block_job_relax_cpu.

Signed-off-by: Fam Zheng 
---
 block/mirror.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 62db031..ca55578 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -438,7 +438,7 @@ static void coroutine_fn mirror_run(void *opaque)
 
 if (now - last_pause_ns > SLICE_TIME) {
 last_pause_ns = now;
-block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, 0);
+block_job_relax_cpu(&s->common);
 }
 
 if (block_job_is_cancelled(&s->common)) {
-- 
2.4.3




[Qemu-block] [PATCH 3/3] mirror: Speed up bitmap initial scanning

2015-07-08 Thread Fam Zheng
Limiting to sectors_per_chunk for each bdrv_is_allocated_above is slow,
because the underlying protocol driver would issue much more queries
than necessary. We should coalesce the query.

Signed-off-by: Fam Zheng 
---
 block/mirror.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index ca55578..e8cb592 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -371,7 +371,7 @@ static void coroutine_fn mirror_run(void *opaque)
 MirrorBlockJob *s = opaque;
 MirrorExitData *data;
 BlockDriverState *bs = s->common.bs;
-int64_t sector_num, end, sectors_per_chunk, length;
+int64_t sector_num, end, length;
 uint64_t last_pause_ns;
 BlockDriverInfo bdi;
 char backing_filename[2]; /* we only need 2 characters because we are only
@@ -425,7 +425,6 @@ static void coroutine_fn mirror_run(void *opaque)
 goto immediate_exit;
 }
 
-sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
 mirror_free_init(s);
 
 last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -433,7 +432,9 @@ static void coroutine_fn mirror_run(void *opaque)
 /* First part, loop on the sectors and initialize the dirty bitmap.  */
 BlockDriverState *base = s->base;
 for (sector_num = 0; sector_num < end; ) {
-int64_t next = (sector_num | (sectors_per_chunk - 1)) + 1;
+/* Just to make sure we are not exceeding int limit. */
+int nb_sectors = MIN(INT_MAX >> BDRV_SECTOR_BITS,
+ end - sector_num);
 int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 
 if (now - last_pause_ns > SLICE_TIME) {
@@ -444,9 +445,7 @@ static void coroutine_fn mirror_run(void *opaque)
 if (block_job_is_cancelled(&s->common)) {
 goto immediate_exit;
 }
-
-ret = bdrv_is_allocated_above(bs, base,
-  sector_num, next - sector_num, &n);
+ret = bdrv_is_allocated_above(bs, base, sector_num, nb_sectors, 
&n);
 
 if (ret < 0) {
 goto immediate_exit;
@@ -455,10 +454,8 @@ static void coroutine_fn mirror_run(void *opaque)
 assert(n > 0);
 if (ret == 1) {
 bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n);
-sector_num = next;
-} else {
-sector_num += n;
 }
+sector_num += n;
 }
 }
 
-- 
2.4.3




Re: [Qemu-block] [PATCH for-2.5 1/4] block: add BlockLimits.max_iov field

2015-07-08 Thread Peter Lieven

Am 08.07.2015 um 17:30 schrieb Stefan Hajnoczi:

The maximum number of struct iovec elements depends on the
BlockDriverState.  The raw-posix protocol has a maximum of IOV_MAX but
others could have different values.

Instead of assuming raw-posix and hardcoding IOV_MAX in several places,
put the limit into BlockLimits.

Cc: Peter Lieven 
Suggested-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
---
Peter Lieven: I think the SCSI LUN level does not have a maximum
scatter-gather segments constraint.  That is probably only at the HBA
level.  CCed you anyway in case you think block/iscsi.c should set the
max_iov field.


libiscsi will send the iovec array straight to readv and writev to
read/write from the TCP socket. So we need IOV_MAX here as well.



Kevin: The default is now INT_MAX.  This means non-raw-posix users will
now be able to merge more requests than before.  They were limited to
IOV_MAX previously.  This could expose limits in other BlockDrivers
which we weren't aware of...


Why rise the default to INT_MAX and not leave it at IOV_MAX?
Is there any case where we except that much iovectors coming in?

Peter