[Xen-devel] [PATCH] block: xen-blkback: don't get/put blkif ref for each queue

2016-09-26 Thread Bob Liu
xen_blkif_get/put() for each queue is useless, and introduces a bug:

If there is I/O inflight when removing device, xen_blkif_disconnect() will
return -EBUSY and xen_blkif_put() not be called.
Which means the references are leaked, then even if I/O completed, the
xen_blkif_put() won't call xen_blkif_deferred_free() to free resources anymore.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/xenbus.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 3cc6d1d..2e1bb6d 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -159,7 +159,6 @@ static int xen_blkif_alloc_rings(struct xen_blkif *blkif)
init_waitqueue_head(>shutdown_wq);
ring->blkif = blkif;
ring->st_print = jiffies;
-   xen_blkif_get(blkif);
}
 
return 0;
@@ -296,7 +295,6 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
BUG_ON(ring->free_pages_num != 0);
BUG_ON(ring->persistent_gnt_c != 0);
WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));
-   xen_blkif_put(blkif);
}
blkif->nr_ring_pages = 0;
/*
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-08-10 Thread Bob Liu

On 08/10/2016 10:54 PM, Evgenii Shatokhin wrote:
> On 10.08.2016 15:49, Bob Liu wrote:
>>
>> On 08/10/2016 08:33 PM, Evgenii Shatokhin wrote:
>>> On 14.07.2016 15:04, Bob Liu wrote:
>>>>
>>>> On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
>>>>> On 11.07.2016 15:04, Bob Liu wrote:
>>>>>>
>>>>>>
>>>>>> On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
>>>>>>> On 06.06.2016 11:42, Dario Faggioli wrote:
>>>>>>>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>>>>>>>
>>>>>>>
>>>>>>> Ping.
>>>>>>>
>>>>>>> Any suggestions how to debug this or what might cause the problem?
>>>>>>>
>>>>>>> Obviously, we cannot control Xen on the Amazon's servers. But perhaps 
>>>>>>> there is something we can do at the kernel's side, is it?
>>>>>>>
>>>>>>>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>>>>>>>> (Resending this bug report because the message I sent last week did
>>>>>>>>> not
>>>>>>>>> make it to the mailing list somehow.)
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> One of our users gets kernel panics from time to time when he tries
>>>>>>>>> to
>>>>>>>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>>>>>>>> happens within minutes from the moment the instance starts. The
>>>>>>>>> problem
>>>>>>>>> does not show up every time, however.
>>>>>>>>>
>>>>>>>>> The user first observed the problem with a custom kernel, but it was
>>>>>>>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>>>>>>>> CentOS7 was affected as well.
>>>>>>
>>>>>> Please try this patch:
>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc
>>>>>>
>>>>>> Regards,
>>>>>> Bob
>>>>>>
>>>>>
>>>>> Unfortunately, it did not help. The same BUG_ON() in 
>>>>> blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
>>>>> 3.10.0-327.18.2, where I added the patch.
>>>>>
>>>>> As far as I can see, the patch makes sure the indirect pages are added to 
>>>>> the list only if (!info->feature_persistent) holds. I suppose it holds in 
>>>>> our case and the pages are added to the list because the triggered 
>>>>> BUG_ON() is here:
>>>>>
>>>>>   if (!info->feature_persistent && info->max_indirect_segments) {
>>>>>   <...>
>>>>>   BUG_ON(!list_empty(>indirect_pages));
>>>>>   <...>
>>>>>   }
>>>>>
>>>>
>>>> That's odd.
>>>> Could you please try to reproduce this issue with a recent upstream kernel?
>>>>
>>>> Thanks,
>>>> Bob
>>>
>>> No luck with the upstream kernel 4.7.0 so far due to unrelated issues (bad 
>>> initrd, I suppose, so the system does not even boot).
>>>
>>> However, the problem reproduced with the stable upstream kernel 3.14.74. 
>>> After the system booted the second time with this kernel, that BUG_ON 
>>> triggered:
>>>   kernel BUG at drivers/block/xen-blkfront.c:1701
>>>
>>
>> Could you please provide more detail on how to reproduce this bug? I'd like 
>> to have a test.
>>
>> Thanks!
>> Bob
> 
> As the user says, he uses an Amazon EC2 instance. Namely: HVM CentOS7 AMI on 
> a c3.large instance with EBS magnetic storage.
> 

Oh, then it would be difficult to debug this issue.
The xen-blkfront communicates with xen-blkback(in dom0 or driver domain), but 
that part is a black box when running Amazon EC2.
We can't see the source code of the backend side!

Can this bug be reproduced on your own environment(xen + dom0)?

> At least 2 LVM partitions are needed:
> * /, 20-30 Gb should be enough, ext4
> * /vz, 5-10 Gb should be enough, ext4
> 
> Kernel 3.14.74 I was t

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-08-10 Thread Bob Liu

On 08/10/2016 08:33 PM, Evgenii Shatokhin wrote:
> On 14.07.2016 15:04, Bob Liu wrote:
>>
>> On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
>>> On 11.07.2016 15:04, Bob Liu wrote:
>>>>
>>>>
>>>> On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
>>>>> On 06.06.2016 11:42, Dario Faggioli wrote:
>>>>>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>>>>>
>>>>>
>>>>> Ping.
>>>>>
>>>>> Any suggestions how to debug this or what might cause the problem?
>>>>>
>>>>> Obviously, we cannot control Xen on the Amazon's servers. But perhaps 
>>>>> there is something we can do at the kernel's side, is it?
>>>>>
>>>>>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>>>>>> (Resending this bug report because the message I sent last week did
>>>>>>> not
>>>>>>> make it to the mailing list somehow.)
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> One of our users gets kernel panics from time to time when he tries
>>>>>>> to
>>>>>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>>>>>> happens within minutes from the moment the instance starts. The
>>>>>>> problem
>>>>>>> does not show up every time, however.
>>>>>>>
>>>>>>> The user first observed the problem with a custom kernel, but it was
>>>>>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>>>>>> CentOS7 was affected as well.
>>>>
>>>> Please try this patch:
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc
>>>>
>>>> Regards,
>>>> Bob
>>>>
>>>
>>> Unfortunately, it did not help. The same BUG_ON() in 
>>> blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
>>> 3.10.0-327.18.2, where I added the patch.
>>>
>>> As far as I can see, the patch makes sure the indirect pages are added to 
>>> the list only if (!info->feature_persistent) holds. I suppose it holds in 
>>> our case and the pages are added to the list because the triggered BUG_ON() 
>>> is here:
>>>
>>>  if (!info->feature_persistent && info->max_indirect_segments) {
>>>  <...>
>>>  BUG_ON(!list_empty(>indirect_pages));
>>>  <...>
>>>  }
>>>
>>
>> That's odd.
>> Could you please try to reproduce this issue with a recent upstream kernel?
>>
>> Thanks,
>> Bob
> 
> No luck with the upstream kernel 4.7.0 so far due to unrelated issues (bad 
> initrd, I suppose, so the system does not even boot).
> 
> However, the problem reproduced with the stable upstream kernel 3.14.74. 
> After the system booted the second time with this kernel, that BUG_ON 
> triggered:
>  kernel BUG at drivers/block/xen-blkfront.c:1701
> 

Could you please provide more detail on how to reproduce this bug? I'd like to 
have a test.

Thanks!
Bob

>>
>>> So the problem is still out there somewhere, it seems.
>>>
>>> Regards,
>>> Evgenii
>>>
>>>>>>>
>>>>>>> The part of the system log he was able to retrieve is attached. Here
>>>>>>> is
>>>>>>> the bug info, for convenience:
>>>>>>>
>>>>>>> 
>>>>>>> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
>>>>>>> [2.246912] invalid opcode:  [#1] SMP
>>>>>>> [2.246912] Modules linked in: ata_generic pata_acpi
>>>>>>> crct10dif_pclmul
>>>>>>> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
>>>>>>> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
>>>>>>> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
>>>>>>> dm_mirror
>>>>>>> dm_region_hash dm_log dm_mod scsi_transport_iscsi
>>>>>>> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
>>>>>>> 3.10.0-327.18.2.el7.x86_64 #1
>>>>>>> [2.246912] Har

[Xen-devel] [PATCH v2] libxl: return any serial tty path in libxl_console_get_tty

2016-08-03 Thread Bob Liu
When specifying a serial list in domain config, users of
libxl_console_get_tty cannot get the tty path of a second specified pty serial,
since right now it always returns the tty path of serial 0.

Signed-off-by: Bob Liu <bob@oracle.com>
---
v2: Rename the patch title.
---
 tools/libxl/libxl.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 2cf7451..00af286 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -1795,7 +1795,7 @@ int libxl_console_get_tty(libxl_ctx *ctx, uint32_t domid, 
int cons_num,
 
 switch (type) {
 case LIBXL_CONSOLE_TYPE_SERIAL:
-tty_path = GCSPRINTF("%s/serial/0/tty", dom_path);
+tty_path = GCSPRINTF("%s/serial/%d/tty", dom_path, cons_num);
 break;
 case LIBXL_CONSOLE_TYPE_PV:
 if (cons_num == 0)
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] tools:libxl: return tty path for all serials

2016-08-02 Thread Bob Liu
When specifying a serial list in domain config, users of
libxl_console_get_tty cannot get the tty path of a second specified pty serial,
since right now it always returns the tty path of serial 0.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 tools/libxl/libxl.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 2cf7451..00af286 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -1795,7 +1795,7 @@ int libxl_console_get_tty(libxl_ctx *ctx, uint32_t domid, 
int cons_num,
 
 switch (type) {
 case LIBXL_CONSOLE_TYPE_SERIAL:
-tty_path = GCSPRINTF("%s/serial/0/tty", dom_path);
+tty_path = GCSPRINTF("%s/serial/%d/tty", dom_path, cons_num);
 break;
 case LIBXL_CONSOLE_TYPE_PV:
 if (cons_num == 0)
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-28 Thread Bob Liu

On 07/28/2016 09:19 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jul 26, 2016 at 01:19:35PM +0800, Bob Liu wrote:
>> Two places didn't get updated when 64KB page granularity was introduced, this
>> patch fix them.
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> Acked-by: Roger Pau Monné <roger@citrix.com>
> 
> Could you rebase this on xen-tip/for-linus-4.8 pls?

Done, sent the v2 for you to pick up.

> 
>> ---
>>  drivers/block/xen-blkfront.c |4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index fcc5b4e..032fc94 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -1321,7 +1321,7 @@ free_shadow:
>>  rinfo->ring_ref[i] = GRANT_INVALID_REF;
>>  }
>>  }
>> -free_pages((unsigned long)rinfo->ring.sring, 
>> get_order(info->nr_ring_pages * PAGE_SIZE));
>> +free_pages((unsigned long)rinfo->ring.sring, 
>> get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
>>  rinfo->ring.sring = NULL;
>>  
>>  if (rinfo->irq)
>> @@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
>>  
>>  blkfront_gather_backend_features(info);
>>  segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> -blk_queue_max_segments(info->rq, segs);
>> +blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
>>  
>>  for (r_index = 0; r_index < info->nr_rings; r_index++) {
>>  struct blkfront_ring_info *rinfo = >rinfo[r_index];
>> -- 
>> 1.7.10.4
>>

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 3/3] xen-blkfront: free resources if xlvbd_alloc_gendisk fails

2016-07-28 Thread Bob Liu
Current code forgets to free resources in the failure path of
xlvbd_alloc_gendisk(), this patch fix it.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d5ed60b..d8429d4 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2446,7 +2446,7 @@ static void blkfront_connect(struct blkfront_info *info)
if (err) {
xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
 info->xbdev->otherend);
-   return;
+   goto fail;
}
 
xenbus_switch_state(info->xbdev, XenbusStateConnected);
@@ -2459,6 +2459,11 @@ static void blkfront_connect(struct blkfront_info *info)
add_disk(info->gd);
 
info->is_ready = 1;
+   return;
+
+fail:
+   blkif_free(info, 0);
+   return;
 }
 
 /**
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-28 Thread Bob Liu
blk_mq_update_nr_hw_queues() reset all queue limits to default which it's
not as xen-blkfront expected, introducing blkif_set_queue_limits() to reset
limits with initial correct values.

Signed-off-by: Bob Liu <bob@oracle.com>
Acked-by: Roger Pau Monné <roger@citrix.com>
---
 drivers/block/xen-blkfront.c |   87 +++---
 1 file changed, 48 insertions(+), 39 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 36d9a0d..d5ed60b 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -189,6 +189,8 @@ struct blkfront_info
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
+   u16 sector_size;
+   unsigned int physical_sector_size;
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
@@ -910,9 +912,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
.map_queue = blk_mq_map_queue,
 };
 
+static void blkif_set_queue_limits(struct blkfront_info *info)
+{
+   struct request_queue *rq = info->rq;
+   struct gendisk *gd = info->gd;
+   unsigned int segments = info->max_indirect_segments ? :
+   BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
+
+   if (info->feature_discard) {
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
+   blk_queue_max_discard_sectors(rq, get_capacity(gd));
+   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_alignment = info->discard_alignment;
+   if (info->feature_secdiscard)
+   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
+   }
+
+   /* Hard sector size and max sectors impersonate the equiv. hardware. */
+   blk_queue_logical_block_size(rq, info->sector_size);
+   blk_queue_physical_block_size(rq, info->physical_sector_size);
+   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
+
+   /* Each segment in a request is up to an aligned page in size. */
+   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
+   blk_queue_max_segment_size(rq, PAGE_SIZE);
+
+   /* Ensure a merged request will fit in a single I/O ring slot. */
+   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
+
+   /* Make sure buffer addresses are sector-aligned. */
+   blk_queue_dma_alignment(rq, 511);
+
+   /* Make sure we don't use bounce buffers. */
+   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
+}
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
-   unsigned int physical_sector_size,
-   unsigned int segments)
+   unsigned int physical_sector_size)
 {
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
@@ -944,37 +982,11 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
}
 
rq->queuedata = info;
-   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
-
-   if (info->feature_discard) {
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
-   blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
-   rq->limits.discard_alignment = info->discard_alignment;
-   if (info->feature_secdiscard)
-   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
-   }
-
-   /* Hard sector size and max sectors impersonate the equiv. hardware. */
-   blk_queue_logical_block_size(rq, sector_size);
-   blk_queue_physical_block_size(rq, physical_sector_size);
-   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
-
-   /* Each segment in a request is up to an aligned page in size. */
-   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
-   blk_queue_max_segment_size(rq, PAGE_SIZE);
-
-   /* Ensure a merged request will fit in a single I/O ring slot. */
-   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
-
-   /* Make sure buffer addresses are sector-aligned. */
-   blk_queue_dma_alignment(rq, 511);
-
-   /* Make sure we don't use bounce buffers. */
-   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
-
-   gd->queue = rq;
-
+   info->rq = gd->queue = rq;
+   info->gd = gd;
+   info->sector_size = sector_size;
+   info->physical_sector_size = physical_sector_size;
+   blkif_set_queue_limits(info);
return 0;
 }
 
@@ -1139,16 +1151,11 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity);
 
-   if (xlvbd_init_blk_queue

[Xen-devel] [PATCH v2 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-28 Thread Bob Liu
Two places didn't get updated when 64KB page granularity was introduced,
this patch fix them.

Signed-off-by: Bob Liu <bob@oracle.com>
Acked-by: Roger Pau Monné <roger@citrix.com>
---
 drivers/block/xen-blkfront.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ca0536e..36d9a0d 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1318,7 +1318,7 @@ free_shadow:
rinfo->ring_ref[i] = GRANT_INVALID_REF;
}
}
-   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * PAGE_SIZE));
+   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
rinfo->ring.sring = NULL;
 
if (rinfo->irq)
@@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-   blk_queue_max_segments(info->rq, segs);
+   blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
bio_list_init(_list);
INIT_LIST_HEAD();
 
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-27 Thread Bob Liu

On 07/27/2016 10:24 PM, Roger Pau Monné wrote:
> On Wed, Jul 27, 2016 at 07:21:05PM +0800, Bob Liu wrote:
>>
>> On 07/27/2016 06:59 PM, Roger Pau Monné wrote:
>>> On Wed, Jul 27, 2016 at 11:21:25AM +0800, Bob Liu wrote:
>>> [...]
>>>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, 
>>>> ssize_t count)
>>>> +{
>>>> +  /*
>>>> +   * Prevent new requests even to software request queue.
>>>> +   */
>>>> +  blk_mq_freeze_queue(info->rq);
>>>> +
>>>> +  /*
>>>> +   * Guarantee no uncompleted reqs.
>>>> +   */
>>>
>>> I'm also wondering, why do you need to guarantee that there are no 
>>> uncompleted requests? The resume procedure is going to call blkif_recover 
>>> that will take care of requeuing any unfinished requests that are on the 
>>> ring.
>>>
>>
>> Because there may have requests in the software request queue with more 
>> segments than
>> we can handle(if info->max_indirect_segments is reduced).
>>
>> The blkif_recover() can't handle this since blk-mq was introduced,
>> because there is no way to iterate the sw-request queues(blk_fetch_request() 
>> can't be used by blk-mq).
>>
>> So there is a bug in blkif_recover(), I was thinking implement the suspend 
>> function of blkfront_driver like:
> 
> Hm, this is a regression and should be fixed ASAP. I'm still not sure I 
> follow, don't blk_queue_max_segments change the number of segments the 
> requests on the queue are going to have? So that you will only have to 
> re-queue the requests already on the ring?
> 

That's not enough, request queues were split to software queues and hardware 
queues since blk-mq was introduced.
We need to consider two more things:
 * Stop new requests be added to software queues before 
blk_queue_max_segments() is called(still using old 'max-indirect-segments').
   I didn't see other way except call blk_mq_freeze_queue().

 * Requests already in software queues but with old 'max-indirect-segments' 
also have to be re-queued based on new 'max-indirect-segments'.
   Neither blk-mq API can do this.

> Waiting for the whole queue to be flushed before suspending is IMHO not 
> acceptable, it introduces an unbounded delay during migration if the backend 
> is slow for some reason.
> 

Right, I also hope there is better solution.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-27 Thread Bob Liu

On 07/27/2016 06:59 PM, Roger Pau Monné wrote:
> On Wed, Jul 27, 2016 at 11:21:25AM +0800, Bob Liu wrote:
> [...]
>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, ssize_t 
>> count)
>> +{
>> +/*
>> + * Prevent new requests even to software request queue.
>> + */
>> +blk_mq_freeze_queue(info->rq);
>> +
>> +/*
>> + * Guarantee no uncompleted reqs.
>> + */
> 
> I'm also wondering, why do you need to guarantee that there are no 
> uncompleted requests? The resume procedure is going to call blkif_recover 
> that will take care of requeuing any unfinished requests that are on the 
> ring.
> 

Because there may have requests in the software request queue with more 
segments than
we can handle(if info->max_indirect_segments is reduced).

The blkif_recover() can't handle this since blk-mq was introduced,
because there is no way to iterate the sw-request queues(blk_fetch_request() 
can't be used by blk-mq).

So there is a bug in blkif_recover(), I was thinking implement the suspend 
function of blkfront_driver like:

+static int blkfront_suspend(struct xenbus_device *dev)
+{
+   blk_mq_freeze_queue(info->rq);
+   ..
+}
 static struct xenbus_driver blkfront_driver = {
.ids  = blkfront_ids,
.probe = blkfront_probe,
.remove = blkfront_remove,
+   .suspend = blkfront_suspend,
.resume = blkfront_resume,

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-27 Thread Bob Liu

On 07/27/2016 04:07 PM, Roger Pau Monné wrote:
..[snip]..
>> @@ -2443,6 +2674,22 @@ static void blkfront_connect(struct blkfront_info 
>> *info)
>>  return;
>>  }
>>  
>> +err = device_create_file(>xbdev->dev, 
>> _attr_max_ring_page_order);
>> +if (err)
>> +goto fail;
>> +
>> +err = device_create_file(>xbdev->dev, 
>> _attr_max_indirect_segs);
>> +if (err) {
>> +device_remove_file(>xbdev->dev, 
>> _attr_max_ring_page_order);
>> +goto fail;
>> +}
>> +
>> +err = device_create_file(>xbdev->dev, _attr_max_queues);
>> +if (err) {
>> +device_remove_file(>xbdev->dev, 
>> _attr_max_ring_page_order);
>> +device_remove_file(>xbdev->dev, 
>> _attr_max_indirect_segs);
>> +goto fail;
>> +}
>>  xenbus_switch_state(info->xbdev, XenbusStateConnected);
>>  
>>  /* Kick pending requests. */
>> @@ -2453,6 +2700,12 @@ static void blkfront_connect(struct blkfront_info 
>> *info)
>>  add_disk(info->gd);
>>  
>>  info->is_ready = 1;
>> +return;
>> +
>> +fail:
>> +blkif_free(info, 0);
>> +xlvbd_release_gendisk(info);
>> +return;
> 
> Hm, I'm not sure whether this chunk should be in a separate patch, it seems 
> like blkfront_connect doesn't properly cleanup on error (if 
> xlvbd_alloc_gendisk fails blkif_free will not be called). Do you think you 
> could send the addition of the 'fail' label as a separate patch and fix the 
> error path of xlvbd_alloc_gendisk?
> 

Sure, will fix all of your comments above.

>>  }
>>  
>>  /**
>> @@ -2500,8 +2753,16 @@ static void blkback_changed(struct xenbus_device *dev,
>>  break;
>>  
>>  case XenbusStateClosed:
>> -if (dev->state == XenbusStateClosed)
>> +if (dev->state == XenbusStateClosed) {
>> +if (info->reconfiguring)
>> +if (blkfront_resume(info->xbdev)) {
> 
> Could you please join those two conditions:
> 
> if (info->reconfiguring && blkfront_resume(info->xbdev)) { ...
> 
> Also, I'm not sure this is correct, if blkfront sees the "Closing" state on 
> blkback it will try to close the frontend and destroy the block device (see 
> blkfront_closing), and this should be avoided. You should call 
> blkfront_resume as soon as you see the backend move to the Closed or Closing 
> states, without calling blkfront_closing.
> 

I didn't get how this can happen, backend state won't be changed to 'Closing' 
before blkfront_closing() is called.
So I think current logic is fine.

Btw: could you please ack [PATCH v2 2/3] xen-blkfront: introduce 
blkif_set_queue_limits()?

Thank you!
Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-26 Thread Bob Liu
The current VBD layer reserves buffer space for each attached device based on
three statically configured settings which are read at boot time.
 * max_indirect_segs: Maximum amount of segments.
 * max_ring_page_order: Maximum order of pages to be used for the shared ring.
 * max_queues: Maximum of queues(rings) to be used.

But the storage backend, workload, and guest memory result in very different
tuning requirements. It's impossible to centrally predict application
characteristics so it's best to leave allow the settings can be dynamiclly
adjusted based on workload inside the Guest.

Usage:
Show current values:
cat /sys/devices/vbd-xxx/max_indirect_segs
cat /sys/devices/vbd-xxx/max_ring_page_order
cat /sys/devices/vbd-xxx/max_queues

Write new values:
echo  > /sys/devices/vbd-xxx/max_indirect_segs
echo  > /sys/devices/vbd-xxx/max_ring_page_order
echo  > /sys/devices/vbd-xxx/max_queues

Signed-off-by: Bob Liu <bob@oracle.com>
--
v3:
 * Remove new_max_indirect_segments.
 * Fix BUG_ON().
v2:
 * Rename to max_ring_page_order.
 * Remove the waiting code suggested by Roger.
---
 drivers/block/xen-blkfront.c |  277 --
 1 file changed, 269 insertions(+), 8 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 1b4c380..57baa54 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -212,6 +212,10 @@ struct blkfront_info
/* Save uncomplete reqs and bios for migration. */
struct list_head requests;
struct bio_list bio_list;
+   /* For dynamic configuration. */
+   unsigned int reconfiguring:1;
+   unsigned int max_ring_page_order;
+   unsigned int max_queues;
 };
 
 static unsigned int nr_minors;
@@ -1350,6 +1354,31 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
for (i = 0; i < info->nr_rings; i++)
blkif_free_ring(>rinfo[i]);
 
+   /* Remove old xenstore nodes. */
+   if (info->nr_ring_pages > 1)
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
+
+   if (info->nr_rings == 1) {
+   if (info->nr_ring_pages == 1) {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
+   } else {
+   for (i = 0; i < info->nr_ring_pages; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, 
"ring-ref%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
ring_ref_name);
+   }
+   }
+   } else {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
"multi-queue-num-queues");
+
+   for (i = 0; i < info->nr_rings; i++) {
+   char queuename[QUEUE_NAME_LEN];
+
+   snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
+   }
+   }
kfree(info->rinfo);
info->rinfo = NULL;
info->nr_rings = 0;
@@ -1763,15 +1792,20 @@ static int talk_to_blkback(struct xenbus_device *dev,
const char *message = NULL;
struct xenbus_transaction xbt;
int err;
-   unsigned int i, max_page_order = 0;
+   unsigned int i, backend_max_order = 0;
unsigned int ring_page_order = 0;
 
err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
-  "max-ring-page-order", "%u", _page_order);
+  "max-ring-page-order", "%u", _max_order);
if (err != 1)
info->nr_ring_pages = 1;
else {
-   ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
+   if (info->max_ring_page_order)
+   /* Dynamic configured through /sys. */
+   ring_page_order = min(info->max_ring_page_order, 
backend_max_order);
+   else
+   /* Default. */
+   ring_page_order = min(xen_blkif_max_ring_order, 
backend_max_order);
info->nr_ring_pages = 1 << ring_page_order;
}
 
@@ -1894,7 +1928,13 @@ static int negotiate_mq(struct blkfront_info *info)
if (err < 0)
backend_max_queues = 1;
 
-   info->nr_rings = min(backend_max_queues, xen_blkif_max_queues);
+   if (info->max_queues)
+   /* Dynamic configured through /sys */
+   info->nr_rings = min(backend_max_queues, info->max_queues);
+   else
+   /* Default. */
+   info->nr_rings = min(backend_max_queues, xen_blkif_max_queues);
+
/* We need at least one ring

Re: [Xen-devel] [PATCH v2 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-26 Thread Bob Liu

On 07/26/2016 04:44 PM, Roger Pau Monné wrote:
> On Tue, Jul 26, 2016 at 01:19:37PM +0800, Bob Liu wrote:
>> The current VBD layer reserves buffer space for each attached device based on
>> three statically configured settings which are read at boot time.
>>  * max_indirect_segs: Maximum amount of segments.
>>  * max_ring_page_order: Maximum order of pages to be used for the shared 
>> ring.
>>  * max_queues: Maximum of queues(rings) to be used.
>>
>> But the storage backend, workload, and guest memory result in very different
>> tuning requirements. It's impossible to centrally predict application
>> characteristics so it's best to leave allow the settings can be dynamiclly
>> adjusted based on workload inside the Guest.
>>
>> Usage:
>> Show current values:
>> cat /sys/devices/vbd-xxx/max_indirect_segs
>> cat /sys/devices/vbd-xxx/max_ring_page_order
>> cat /sys/devices/vbd-xxx/max_queues
>>
>> Write new values:
>> echo  > /sys/devices/vbd-xxx/max_indirect_segs
>> echo  > /sys/devices/vbd-xxx/max_ring_page_order
>> echo  > /sys/devices/vbd-xxx/max_queues
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> --
>> v2: Rename to max_ring_page_order and rm the waiting code suggested by Roger.
>> ---
>>  drivers/block/xen-blkfront.c |  275 
>> +-
>>  1 file changed, 269 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 1b4c380..ff5ebe5 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -212,6 +212,11 @@ struct blkfront_info
>>  /* Save uncomplete reqs and bios for migration. */
>>  struct list_head requests;
>>  struct bio_list bio_list;
>> +/* For dynamic configuration. */
>> +unsigned int reconfiguring:1;
>> +int new_max_indirect_segments;
> 
> Can't you just use max_indirect_segments? Is it really needed to introduce a 
> new struct member?
> 
>> +int max_ring_page_order;
>> +int max_queues;

Do you mean also get rid of these two new struct members?
I'll think about that.

>>  };
>>  
>>  static unsigned int nr_minors;
>> @@ -1350,6 +1355,31 @@ static void blkif_free(struct blkfront_info *info, 
>> int suspend)
>>  for (i = 0; i < info->nr_rings; i++)
>>  blkif_free_ring(>rinfo[i]);
>>  
>> +/* Remove old xenstore nodes. */
>> +if (info->nr_ring_pages > 1)
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
>> +
>> +if (info->nr_rings == 1) {
>> +if (info->nr_ring_pages == 1) {
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
>> +} else {
>> +for (i = 0; i < info->nr_ring_pages; i++) {
>> +char ring_ref_name[RINGREF_NAME_LEN];
>> +
>> +snprintf(ring_ref_name, RINGREF_NAME_LEN, 
>> "ring-ref%u", i);
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, 
>> ring_ref_name);
>> +}
>> +}
>> +} else {
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, 
>> "multi-queue-num-queues");
>> +
>> +for (i = 0; i < info->nr_rings; i++) {
>> +char queuename[QUEUE_NAME_LEN];
>> +
>> +snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
>> +}
>> +}
>>  kfree(info->rinfo);
>>  info->rinfo = NULL;
>>  info->nr_rings = 0;
>> @@ -1763,15 +1793,21 @@ static int talk_to_blkback(struct xenbus_device *dev,
>>  const char *message = NULL;
>>  struct xenbus_transaction xbt;
>>  int err;
>> -unsigned int i, max_page_order = 0;
>> +unsigned int i, backend_max_order = 0;
>>  unsigned int ring_page_order = 0;
>>  
>>  err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
>> -   "max-ring-page-order", "%u", _page_order);
>> +   "max-ring-page-order", "%u", _max_order);
>>  if (err != 1)
>>  info->nr_ring_pages = 1;
>>  else {
>> -ring_page_order = min(xen_blkif_max_ring_order, max_page_ord

[Xen-devel] [PATCH v2 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-25 Thread Bob Liu
blk_mq_update_nr_hw_queues() reset all queue limits to default which it's not
as xen-blkfront expected, introducing blkif_set_queue_limits() to reset limits
with initial correct values.

Signed-off-by: Bob Liu <bob@oracle.com>
---
v2: Move blkif_set_queue_limits() after blkfront_gather_backend_features.
---
 drivers/block/xen-blkfront.c |   87 +++---
 1 file changed, 48 insertions(+), 39 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 032fc94..1b4c380 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -189,6 +189,8 @@ struct blkfront_info
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
+   u16 sector_size;
+   unsigned int physical_sector_size;
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
@@ -913,9 +915,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
.map_queue = blk_mq_map_queue,
 };
 
+static void blkif_set_queue_limits(struct blkfront_info *info)
+{
+   struct request_queue *rq = info->rq;
+   struct gendisk *gd = info->gd;
+   unsigned int segments = info->max_indirect_segments ? :
+   BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
+
+   if (info->feature_discard) {
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
+   blk_queue_max_discard_sectors(rq, get_capacity(gd));
+   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_alignment = info->discard_alignment;
+   if (info->feature_secdiscard)
+   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
+   }
+
+   /* Hard sector size and max sectors impersonate the equiv. hardware. */
+   blk_queue_logical_block_size(rq, info->sector_size);
+   blk_queue_physical_block_size(rq, info->physical_sector_size);
+   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
+
+   /* Each segment in a request is up to an aligned page in size. */
+   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
+   blk_queue_max_segment_size(rq, PAGE_SIZE);
+
+   /* Ensure a merged request will fit in a single I/O ring slot. */
+   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
+
+   /* Make sure buffer addresses are sector-aligned. */
+   blk_queue_dma_alignment(rq, 511);
+
+   /* Make sure we don't use bounce buffers. */
+   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
+}
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
-   unsigned int physical_sector_size,
-   unsigned int segments)
+   unsigned int physical_sector_size)
 {
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
@@ -947,37 +985,11 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
}
 
rq->queuedata = info;
-   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
-
-   if (info->feature_discard) {
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
-   blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
-   rq->limits.discard_alignment = info->discard_alignment;
-   if (info->feature_secdiscard)
-   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
-   }
-
-   /* Hard sector size and max sectors impersonate the equiv. hardware. */
-   blk_queue_logical_block_size(rq, sector_size);
-   blk_queue_physical_block_size(rq, physical_sector_size);
-   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
-
-   /* Each segment in a request is up to an aligned page in size. */
-   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
-   blk_queue_max_segment_size(rq, PAGE_SIZE);
-
-   /* Ensure a merged request will fit in a single I/O ring slot. */
-   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
-
-   /* Make sure buffer addresses are sector-aligned. */
-   blk_queue_dma_alignment(rq, 511);
-
-   /* Make sure we don't use bounce buffers. */
-   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
-
-   gd->queue = rq;
-
+   info->rq = gd->queue = rq;
+   info->gd = gd;
+   info->sector_size = sector_size;
+   info->physical_sector_size = physical_sector_size;
+   blkif_set_queue_limits(info);
return 0;
 }
 
@@ -1142,16 +1154,11 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity);
 
-   if (xlvbd_init_blk_queue(gd, s

[Xen-devel] [PATCH 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-25 Thread Bob Liu
Two places didn't get updated when 64KB page granularity was introduced, this
patch fix them.

Signed-off-by: Bob Liu <bob@oracle.com>
Acked-by: Roger Pau Monné <roger@citrix.com>
---
 drivers/block/xen-blkfront.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index fcc5b4e..032fc94 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1321,7 +1321,7 @@ free_shadow:
rinfo->ring_ref[i] = GRANT_INVALID_REF;
}
}
-   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * PAGE_SIZE));
+   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
rinfo->ring.sring = NULL;
 
if (rinfo->irq)
@@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-   blk_queue_max_segments(info->rq, segs);
+   blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
 
for (r_index = 0; r_index < info->nr_rings; r_index++) {
struct blkfront_ring_info *rinfo = >rinfo[r_index];
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-25 Thread Bob Liu
The current VBD layer reserves buffer space for each attached device based on
three statically configured settings which are read at boot time.
 * max_indirect_segs: Maximum amount of segments.
 * max_ring_page_order: Maximum order of pages to be used for the shared ring.
 * max_queues: Maximum of queues(rings) to be used.

But the storage backend, workload, and guest memory result in very different
tuning requirements. It's impossible to centrally predict application
characteristics so it's best to leave allow the settings can be dynamiclly
adjusted based on workload inside the Guest.

Usage:
Show current values:
cat /sys/devices/vbd-xxx/max_indirect_segs
cat /sys/devices/vbd-xxx/max_ring_page_order
cat /sys/devices/vbd-xxx/max_queues

Write new values:
echo  > /sys/devices/vbd-xxx/max_indirect_segs
echo  > /sys/devices/vbd-xxx/max_ring_page_order
echo  > /sys/devices/vbd-xxx/max_queues

Signed-off-by: Bob Liu <bob@oracle.com>
--
v2: Rename to max_ring_page_order and rm the waiting code suggested by Roger.
---
 drivers/block/xen-blkfront.c |  275 +-
 1 file changed, 269 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 1b4c380..ff5ebe5 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -212,6 +212,11 @@ struct blkfront_info
/* Save uncomplete reqs and bios for migration. */
struct list_head requests;
struct bio_list bio_list;
+   /* For dynamic configuration. */
+   unsigned int reconfiguring:1;
+   int new_max_indirect_segments;
+   int max_ring_page_order;
+   int max_queues;
 };
 
 static unsigned int nr_minors;
@@ -1350,6 +1355,31 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
for (i = 0; i < info->nr_rings; i++)
blkif_free_ring(>rinfo[i]);
 
+   /* Remove old xenstore nodes. */
+   if (info->nr_ring_pages > 1)
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
+
+   if (info->nr_rings == 1) {
+   if (info->nr_ring_pages == 1) {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
+   } else {
+   for (i = 0; i < info->nr_ring_pages; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, 
"ring-ref%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
ring_ref_name);
+   }
+   }
+   } else {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
"multi-queue-num-queues");
+
+   for (i = 0; i < info->nr_rings; i++) {
+   char queuename[QUEUE_NAME_LEN];
+
+   snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
+   }
+   }
kfree(info->rinfo);
info->rinfo = NULL;
info->nr_rings = 0;
@@ -1763,15 +1793,21 @@ static int talk_to_blkback(struct xenbus_device *dev,
const char *message = NULL;
struct xenbus_transaction xbt;
int err;
-   unsigned int i, max_page_order = 0;
+   unsigned int i, backend_max_order = 0;
unsigned int ring_page_order = 0;
 
err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
-  "max-ring-page-order", "%u", _page_order);
+  "max-ring-page-order", "%u", _max_order);
if (err != 1)
info->nr_ring_pages = 1;
else {
-   ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
+   if (info->max_ring_page_order) {
+   /* Dynamic configured through /sys. */
+   BUG_ON(info->max_ring_page_order > backend_max_order);
+   ring_page_order = info->max_ring_page_order;
+   } else
+   /* Default. */
+   ring_page_order = min(xen_blkif_max_ring_order, 
backend_max_order);
info->nr_ring_pages = 1 << ring_page_order;
}
 
@@ -1894,7 +1930,14 @@ static int negotiate_mq(struct blkfront_info *info)
if (err < 0)
backend_max_queues = 1;
 
-   info->nr_rings = min(backend_max_queues, xen_blkif_max_queues);
+   if (info->max_queues) {
+   /* Dynamic configured through /sys */
+   BUG_ON(info->max_queues > backend_max_queues);
+   info->nr_rings = info->max_queues;
+   } else
+   /* Default. */
+   info->nr_rings = min(backend_max_queue

Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-25 Thread Bob Liu

On 07/25/2016 08:11 PM, Roger Pau Monné wrote:
> On Mon, Jul 25, 2016 at 07:08:36PM +0800, Bob Liu wrote:
>>
>> On 07/25/2016 06:53 PM, Roger Pau Monné wrote:
>> ..[snip..]
>>>>>>  * We get the device lock and blk_mq_freeze_queue() in 
>>>>>> dynamic_reconfig_device(),
>>>>>>and then have to release in blkif_recover() asynchronously which 
>>>>>> makes the code more difficult to readable.
>>>>>
>>>>> I'm not sure I'm following here, do you mean that you will pick the lock 
>>>>> in 
>>>>> dynamic_reconfig_device and release it in blkif_recover? Why wouldn't you 
>>>>
>>>> Yes.
>>>>
>>>>> release the lock in dynamic_reconfig_device after doing whatever is 
>>>>> needed?
>>>>>
>>>>
>>>> Both 'dynamic configuration' and migration:xenbus_dev_resume() use { 
>>>> blkfront_resume(); blkif_recover() } and depends on the change of 
>>>> xbdev->state.
>>>> If they happen simultaneously, the State machine of xbdev->state is going 
>>>> to be a mess and very difficult to track.
>>>>
>>>> The lock(actually a mutex) is like a big lock to make sure no race would 
>>>> happen at all.
>>>
>>> So the only thing that you should do is set the frontend state to closed 
>>> and 
>>> wait for the backend to also switch to state closed, and then switch the
>>> frontend state to init again in order to trigger a reconnection.
>>>
>>
>> But if migration:xenbus_dev_resume() is triggered at the same time, any 
>> state be set might be changed.
>> =
>> E.g
>> Dynamic_reconfig_device:
>> Migration:xenbus_dev_resume()
>>
>> Set the frontend state to closed   
>>  
>>  
>>  frontend state will be 
>> reset to XenbusStateInitialising by xenbus_dev_resume()
>>
>> Wait for the backend to also switch to state closed
> 
> This is not really a race, the backend will not switch to state closed, and 
> instead will appear at state init again, which is what we wanted anyway in 
> order to reconnect, so I don't see an issue with this flow.
> 
>> =
>> Similar situation may happen at any other place regarding set the state.
> 
> Right, but I don't see how your proposed patch also avoids this issues. I 
> see that you pick the device lock in dynamic_reconfig_device, but I don't 
> see xenbus_dev_resume picking the lock at all, and in any case I don't think 

The lock is picked from the power management framework.

Anyway, I'm convinced and will follow your suggestion.
Thank you!

> you should prevent the VM from migrating.
> 
> Depending on the toolstack it might decide to kill the VM because it has not 
> been able to migrate it, in which case the result is not better.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-25 Thread Bob Liu

On 07/25/2016 06:53 PM, Roger Pau Monné wrote:
..[snip..]
  * We get the device lock and blk_mq_freeze_queue() in 
 dynamic_reconfig_device(),
and then have to release in blkif_recover() asynchronously which makes 
 the code more difficult to readable.
>>>
>>> I'm not sure I'm following here, do you mean that you will pick the lock in 
>>> dynamic_reconfig_device and release it in blkif_recover? Why wouldn't you 
>>
>> Yes.
>>
>>> release the lock in dynamic_reconfig_device after doing whatever is needed?
>>>
>>
>> Both 'dynamic configuration' and migration:xenbus_dev_resume() use { 
>> blkfront_resume(); blkif_recover() } and depends on the change of 
>> xbdev->state.
>> If they happen simultaneously, the State machine of xbdev->state is going to 
>> be a mess and very difficult to track.
>>
>> The lock(actually a mutex) is like a big lock to make sure no race would 
>> happen at all.
> 
> So the only thing that you should do is set the frontend state to closed and 
> wait for the backend to also switch to state closed, and then switch the
> frontend state to init again in order to trigger a reconnection.
> 

But if migration:xenbus_dev_resume() is triggered at the same time, any state 
be set might be changed.
=
E.g
Dynamic_reconfig_device:
Migration:xenbus_dev_resume()

Set the frontend state to closed   
 

frontend state will be 
reset to XenbusStateInitialising by xenbus_dev_resume()

Wait for the backend to also switch to state closed
=
Similar situation may happen at any other place regarding set the state.

> You are right that all this process depends on the state being updated 
> correctly, but I don't see how's that different from a normal connection or 
> resume, and we don't seem to have races there.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-25 Thread Bob Liu

On 07/25/2016 05:20 PM, Roger Pau Monné wrote:
> On Sat, Jul 23, 2016 at 06:18:23AM +0800, Bob Liu wrote:
>>
>> On 07/22/2016 07:45 PM, Roger Pau Monné wrote:
>>> On Fri, Jul 22, 2016 at 05:43:32PM +0800, Bob Liu wrote:
>>>>
>>>> On 07/22/2016 05:34 PM, Roger Pau Monné wrote:
>>>>> On Fri, Jul 22, 2016 at 04:17:48PM +0800, Bob Liu wrote:
>>>>>>
>>>>>> On 07/22/2016 03:45 PM, Roger Pau Monné wrote:
>>>>>>> On Thu, Jul 21, 2016 at 06:08:05PM +0800, Bob Liu wrote:
>>>>>>>>
>>>>>>>> On 07/21/2016 04:57 PM, Roger Pau Monné wrote:
>>>>>> ..[snip]..
>>>>>>>>>> +
>>>>>>>>>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, 
>>>>>>>>>> ssize_t count)
>>>>>>>>>> +{
>>>>>>>>>> +unsigned int i;
>>>>>>>>>> +int err = -EBUSY;
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * Make sure no migration in parallel, device lock is actually a
>>>>>>>>>> + * mutex.
>>>>>>>>>> + */
>>>>>>>>>> +if (!device_trylock(>xbdev->dev)) {
>>>>>>>>>> +pr_err("Fail to acquire dev:%s lock, may be in 
>>>>>>>>>> migration.\n",
>>>>>>>>>> +dev_name(>xbdev->dev));
>>>>>>>>>> +return err;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * Prevent new requests and guarantee no uncompleted reqs.
>>>>>>>>>> + */
>>>>>>>>>> +blk_mq_freeze_queue(info->rq);
>>>>>>>>>> +if (part_in_flight(>gd->part0))
>>>>>>>>>> +goto out;
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * FrontBackend
>>>>>>>>>> + * Switch to XenbusStateClosed
>>>>>>>>>> + *  frontend_changed():
>>>>>>>>>> + *   case XenbusStateClosed:
>>>>>>>>>> + *  
>>>>>>>>>> xen_blkif_disconnect()
>>>>>>>>>> + *  Switch to 
>>>>>>>>>> XenbusStateClosed
>>>>>>>>>> + * blkfront_resume():
>>>>>>>>>> + *  frontend_changed():
>>>>>>>>>> + *  reconnect
>>>>>>>>>> + * Wait until XenbusStateConnected
>>>>>>>>>> + */
>>>>>>>>>> +info->reconfiguring = true;
>>>>>>>>>> +xenbus_switch_state(info->xbdev, XenbusStateClosed);
>>>>>>>>>> +
>>>>>>>>>> +/* Poll every 100ms, 1 minute timeout. */
>>>>>>>>>> +for (i = 0; i < 600; i++) {
>>>>>>>>>> +/*
>>>>>>>>>> + * Wait backend enter XenbusStateClosed, 
>>>>>>>>>> blkback_changed()
>>>>>>>>>> + * will clear reconfiguring.
>>>>>>>>>> + */
>>>>>>>>>> +if (!info->reconfiguring)
>>>>>>>>>> +goto resume;
>>>>>>>>>> +schedule_timeout_interruptible(msecs_to_jiffies(100));
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> Instead of having this wait, could you just set info->reconfiguring = 
>>>>>>>>> 1, set 
>>>>>>>>> the frontend state to XenbusStateClosed and mimic exactly what a 
>>>>>>>>> resume from 
>>>>>&

Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-22 Thread Bob Liu

On 07/22/2016 07:45 PM, Roger Pau Monné wrote:
> On Fri, Jul 22, 2016 at 05:43:32PM +0800, Bob Liu wrote:
>>
>> On 07/22/2016 05:34 PM, Roger Pau Monné wrote:
>>> On Fri, Jul 22, 2016 at 04:17:48PM +0800, Bob Liu wrote:
>>>>
>>>> On 07/22/2016 03:45 PM, Roger Pau Monné wrote:
>>>>> On Thu, Jul 21, 2016 at 06:08:05PM +0800, Bob Liu wrote:
>>>>>>
>>>>>> On 07/21/2016 04:57 PM, Roger Pau Monné wrote:
>>>> ..[snip]..
>>>>>>>> +
>>>>>>>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, 
>>>>>>>> ssize_t count)
>>>>>>>> +{
>>>>>>>> +  unsigned int i;
>>>>>>>> +  int err = -EBUSY;
>>>>>>>> +
>>>>>>>> +  /*
>>>>>>>> +   * Make sure no migration in parallel, device lock is actually a
>>>>>>>> +   * mutex.
>>>>>>>> +   */
>>>>>>>> +  if (!device_trylock(>xbdev->dev)) {
>>>>>>>> +  pr_err("Fail to acquire dev:%s lock, may be in 
>>>>>>>> migration.\n",
>>>>>>>> +  dev_name(>xbdev->dev));
>>>>>>>> +  return err;
>>>>>>>> +  }
>>>>>>>> +
>>>>>>>> +  /*
>>>>>>>> +   * Prevent new requests and guarantee no uncompleted reqs.
>>>>>>>> +   */
>>>>>>>> +  blk_mq_freeze_queue(info->rq);
>>>>>>>> +  if (part_in_flight(>gd->part0))
>>>>>>>> +  goto out;
>>>>>>>> +
>>>>>>>> +  /*
>>>>>>>> +   * FrontBackend
>>>>>>>> +   * Switch to XenbusStateClosed
>>>>>>>> +   *  frontend_changed():
>>>>>>>> +   *   case XenbusStateClosed:
>>>>>>>> +   *  
>>>>>>>> xen_blkif_disconnect()
>>>>>>>> +   *  Switch to 
>>>>>>>> XenbusStateClosed
>>>>>>>> +   * blkfront_resume():
>>>>>>>> +   *  frontend_changed():
>>>>>>>> +   *  reconnect
>>>>>>>> +   * Wait until XenbusStateConnected
>>>>>>>> +   */
>>>>>>>> +  info->reconfiguring = true;
>>>>>>>> +  xenbus_switch_state(info->xbdev, XenbusStateClosed);
>>>>>>>> +
>>>>>>>> +  /* Poll every 100ms, 1 minute timeout. */
>>>>>>>> +  for (i = 0; i < 600; i++) {
>>>>>>>> +  /*
>>>>>>>> +   * Wait backend enter XenbusStateClosed, 
>>>>>>>> blkback_changed()
>>>>>>>> +   * will clear reconfiguring.
>>>>>>>> +   */
>>>>>>>> +  if (!info->reconfiguring)
>>>>>>>> +  goto resume;
>>>>>>>> +  schedule_timeout_interruptible(msecs_to_jiffies(100));
>>>>>>>> +  }
>>>>>>>
>>>>>>> Instead of having this wait, could you just set info->reconfiguring = 
>>>>>>> 1, set 
>>>>>>> the frontend state to XenbusStateClosed and mimic exactly what a resume 
>>>>>>> from 
>>>>>>> suspension does? blkback_changed would have to set the frontend state 
>>>>>>> to 
>>>>>>> InitWait when it detects that the backend has switched to Closed, and 
>>>>>>> call 
>>>>>>> blkfront_resume.
>>>>>>
>>>>>>
>>>>>> I think that won't work.
>>>>>> In the real "resume" case, the power management system will trigger all 
>>>>>> ->resume() path.
>>>>>> But there is no pl

Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-22 Thread Bob Liu

On 07/22/2016 05:34 PM, Roger Pau Monné wrote:
> On Fri, Jul 22, 2016 at 04:17:48PM +0800, Bob Liu wrote:
>>
>> On 07/22/2016 03:45 PM, Roger Pau Monné wrote:
>>> On Thu, Jul 21, 2016 at 06:08:05PM +0800, Bob Liu wrote:
>>>>
>>>> On 07/21/2016 04:57 PM, Roger Pau Monné wrote:
>> ..[snip]..
>>>>>> +
>>>>>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, 
>>>>>> ssize_t count)
>>>>>> +{
>>>>>> +unsigned int i;
>>>>>> +int err = -EBUSY;
>>>>>> +
>>>>>> +/*
>>>>>> + * Make sure no migration in parallel, device lock is actually a
>>>>>> + * mutex.
>>>>>> + */
>>>>>> +if (!device_trylock(>xbdev->dev)) {
>>>>>> +pr_err("Fail to acquire dev:%s lock, may be in 
>>>>>> migration.\n",
>>>>>> +dev_name(>xbdev->dev));
>>>>>> +return err;
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Prevent new requests and guarantee no uncompleted reqs.
>>>>>> + */
>>>>>> +blk_mq_freeze_queue(info->rq);
>>>>>> +if (part_in_flight(>gd->part0))
>>>>>> +goto out;
>>>>>> +
>>>>>> +/*
>>>>>> + * FrontBackend
>>>>>> + * Switch to XenbusStateClosed
>>>>>> + *  frontend_changed():
>>>>>> + *   case XenbusStateClosed:
>>>>>> + *  
>>>>>> xen_blkif_disconnect()
>>>>>> + *  Switch to 
>>>>>> XenbusStateClosed
>>>>>> + * blkfront_resume():
>>>>>> + *  frontend_changed():
>>>>>> + *  reconnect
>>>>>> + * Wait until XenbusStateConnected
>>>>>> + */
>>>>>> +info->reconfiguring = true;
>>>>>> +xenbus_switch_state(info->xbdev, XenbusStateClosed);
>>>>>> +
>>>>>> +/* Poll every 100ms, 1 minute timeout. */
>>>>>> +for (i = 0; i < 600; i++) {
>>>>>> +/*
>>>>>> + * Wait backend enter XenbusStateClosed, 
>>>>>> blkback_changed()
>>>>>> + * will clear reconfiguring.
>>>>>> + */
>>>>>> +if (!info->reconfiguring)
>>>>>> +goto resume;
>>>>>> +schedule_timeout_interruptible(msecs_to_jiffies(100));
>>>>>> +}
>>>>>
>>>>> Instead of having this wait, could you just set info->reconfiguring = 1, 
>>>>> set 
>>>>> the frontend state to XenbusStateClosed and mimic exactly what a resume 
>>>>> from 
>>>>> suspension does? blkback_changed would have to set the frontend state to 
>>>>> InitWait when it detects that the backend has switched to Closed, and 
>>>>> call 
>>>>> blkfront_resume.
>>>>
>>>>
>>>> I think that won't work.
>>>> In the real "resume" case, the power management system will trigger all 
>>>> ->resume() path.
>>>> But there is no place for dynamic configuration.
>>>
>>> Hello,
>>>
>>> I think it should be possible to set info->reconfiguring and wait for the 
>>> backend to switch to state Closed, at that point we should call 
>>> blkif_resume 
>>> (from blkback_changed) and the backend will follow the reconection.
>>>
>>
>> Okay, I get your point. Yes, that's an option.
>>
>> But this will make 'dynamic configuration' to be async, I'm worry about the 
>> end-user will get panic.
>> E.g
>> A end-user "echo  > /sys/devices/vbd-xxx/max_indirect_segs",
>> but then the device will be Closed and disappeared, the user have to wait 
>> for a random time so that the device can resume.
> 
> That should not happen, AFAICT on migration the device never dissapears. 

Oh, yes.

> alloc_disk and friends should not be called on resume from migration (see 
> the switch in blkfront_connect, you should take the BLKIF_STATE_SUSPENDED 
> path for the reconfiguration).
> 

What about if the end-user starts I/O immediately after writing new value to 
/sys?
But the resume is still in progress.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-22 Thread Bob Liu

On 07/22/2016 03:45 PM, Roger Pau Monné wrote:
> On Thu, Jul 21, 2016 at 06:08:05PM +0800, Bob Liu wrote:
>>
>> On 07/21/2016 04:57 PM, Roger Pau Monné wrote:
..[snip]..
>>>> +
>>>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, 
>>>> ssize_t count)
>>>> +{
>>>> +  unsigned int i;
>>>> +  int err = -EBUSY;
>>>> +
>>>> +  /*
>>>> +   * Make sure no migration in parallel, device lock is actually a
>>>> +   * mutex.
>>>> +   */
>>>> +  if (!device_trylock(>xbdev->dev)) {
>>>> +  pr_err("Fail to acquire dev:%s lock, may be in migration.\n",
>>>> +  dev_name(>xbdev->dev));
>>>> +  return err;
>>>> +  }
>>>> +
>>>> +  /*
>>>> +   * Prevent new requests and guarantee no uncompleted reqs.
>>>> +   */
>>>> +  blk_mq_freeze_queue(info->rq);
>>>> +  if (part_in_flight(>gd->part0))
>>>> +  goto out;
>>>> +
>>>> +  /*
>>>> +   * FrontBackend
>>>> +   * Switch to XenbusStateClosed
>>>> +   *  frontend_changed():
>>>> +   *   case XenbusStateClosed:
>>>> +   *  xen_blkif_disconnect()
>>>> +   *  Switch to 
>>>> XenbusStateClosed
>>>> +   * blkfront_resume():
>>>> +   *  frontend_changed():
>>>> +   *  reconnect
>>>> +   * Wait until XenbusStateConnected
>>>> +   */
>>>> +  info->reconfiguring = true;
>>>> +  xenbus_switch_state(info->xbdev, XenbusStateClosed);
>>>> +
>>>> +  /* Poll every 100ms, 1 minute timeout. */
>>>> +  for (i = 0; i < 600; i++) {
>>>> +  /*
>>>> +   * Wait backend enter XenbusStateClosed, blkback_changed()
>>>> +   * will clear reconfiguring.
>>>> +   */
>>>> +  if (!info->reconfiguring)
>>>> +  goto resume;
>>>> +  schedule_timeout_interruptible(msecs_to_jiffies(100));
>>>> +  }
>>>
>>> Instead of having this wait, could you just set info->reconfiguring = 1, 
>>> set 
>>> the frontend state to XenbusStateClosed and mimic exactly what a resume 
>>> from 
>>> suspension does? blkback_changed would have to set the frontend state to 
>>> InitWait when it detects that the backend has switched to Closed, and call 
>>> blkfront_resume.
>>
>>
>> I think that won't work.
>> In the real "resume" case, the power management system will trigger all 
>> ->resume() path.
>> But there is no place for dynamic configuration.
> 
> Hello,
> 
> I think it should be possible to set info->reconfiguring and wait for the 
> backend to switch to state Closed, at that point we should call blkif_resume 
> (from blkback_changed) and the backend will follow the reconection.
> 

Okay, I get your point. Yes, that's an option.

But this will make 'dynamic configuration' to be async, I'm worry about the 
end-user will get panic.
E.g
A end-user "echo  > /sys/devices/vbd-xxx/max_indirect_segs",
but then the device will be Closed and disappeared, the user have to wait for a 
random time so that the device can resume.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-21 Thread Bob Liu

On 07/21/2016 04:57 PM, Roger Pau Monné wrote:
> On Fri, Jul 15, 2016 at 05:31:49PM +0800, Bob Liu wrote:
>> The current VBD layer reserves buffer space for each attached device based on
>> three statically configured settings which are read at boot time.
>>  * max_indirect_segs: Maximum amount of segments.
>>  * max_ring_page_order: Maximum order of pages to be used for the shared 
>> ring.
>>  * max_queues: Maximum of queues(rings) to be used.
>>
>> But the storage backend, workload, and guest memory result in very different
>> tuning requirements. It's impossible to centrally predict application
>> characteristics so it's best to leave allow the settings can be dynamiclly
>> adjusted based on workload inside the Guest.
>>
>> Usage:
>> Show current values:
>> cat /sys/devices/vbd-xxx/max_indirect_segs
>> cat /sys/devices/vbd-xxx/max_ring_page_order
>> cat /sys/devices/vbd-xxx/max_queues
>>
>> Write new values:
>> echo  > /sys/devices/vbd-xxx/max_indirect_segs
>> echo  > /sys/devices/vbd-xxx/max_ring_page_order
>> echo  > /sys/devices/vbd-xxx/max_queues
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> --
>> v2: Add device lock and other comments from Konrad.
>> ---
>>  drivers/block/xen-blkfront.c | 285 
>> ++-
>>  1 file changed, 283 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 10f46a8..9a5ed22 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -46,6 +46,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -212,6 +213,11 @@ struct blkfront_info
>>  /* Save uncomplete reqs and bios for migration. */
>>  struct list_head requests;
>>  struct bio_list bio_list;
>> +/* For dynamic configuration. */
>> +unsigned int reconfiguring:1;
>> +int new_max_indirect_segments;
>> +int new_max_ring_page_order;
>> +int new_max_queues;
>>  };
>>  
>>  static unsigned int nr_minors;
>> @@ -1350,6 +1356,31 @@ static void blkif_free(struct blkfront_info *info, 
>> int suspend)
>>  for (i = 0; i < info->nr_rings; i++)
>>  blkif_free_ring(>rinfo[i]);
>>  
>> +/* Remove old xenstore nodes. */
>> +if (info->nr_ring_pages > 1)
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
>> +
>> +if (info->nr_rings == 1) {
>> +if (info->nr_ring_pages == 1) {
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
>> +} else {
>> +for (i = 0; i < info->nr_ring_pages; i++) {
>> +char ring_ref_name[RINGREF_NAME_LEN];
>> +
>> +snprintf(ring_ref_name, RINGREF_NAME_LEN, 
>> "ring-ref%u", i);
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, 
>> ring_ref_name);
>> +}
>> +}
>> +} else {
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, 
>> "multi-queue-num-queues");
>> +
>> +for (i = 0; i < info->nr_rings; i++) {
>> +char queuename[QUEUE_NAME_LEN];
>> +
>> +snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
>> +xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
>> +}
>> +}
>>  kfree(info->rinfo);
>>  info->rinfo = NULL;
>>  info->nr_rings = 0;
>> @@ -1772,6 +1803,10 @@ static int talk_to_blkback(struct xenbus_device *dev,
>>  info->nr_ring_pages = 1;
>>  else {
>>  ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
>> +if (info->new_max_ring_page_order) {
> 
> Instead of calling this "new_max_ring_page_order", could you just call it 
> max_ring_page_order, iniitalize it to xen_blkif_max_ring_order by default 


Sure, I can do that.


> and use it everywhere instead of xen_blkif_max_ring_order?


But "xen_blkif_max_ring_order" still have to be used here, this is the only 
place "xen_blkif_max_ring_order" is used(except checking the value of it in 
xlblk_init()).


> 
>> +BUG_ON(info->new_max_ring_page_order > max_pa

Re: [Xen-devel] [PATCH 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-21 Thread Bob Liu

On 07/21/2016 04:29 PM, Roger Pau Monné wrote:
> On Fri, Jul 15, 2016 at 05:31:48PM +0800, Bob Liu wrote:
>> blk_mq_update_nr_hw_queues() reset all queue limits to default which it's not
>> as xen-blkfront expected, introducing blkif_set_queue_limits() to reset 
>> limits
>> with initial correct values.
> 
> Hm, great, and as usual in Linux there isn't even a comment in the function 
> that explains what it is supposed to do, or what are the side-effects of 
> calling blk_mq_update_nr_hw_queues.
>  
>> Signed-off-by: Bob Liu <bob@oracle.com>
>>
>>  drivers/block/xen-blkfront.c | 91 
>> 
>>  1 file changed, 50 insertions(+), 41 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 032fc94..10f46a8 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -189,6 +189,8 @@ struct blkfront_info
>>  struct mutex mutex;
>>  struct xenbus_device *xbdev;
>>  struct gendisk *gd;
>> +u16 sector_size;
>> +unsigned int physical_sector_size;
>>  int vdevice;
>>  blkif_vdev_t handle;
>>  enum blkif_state connected;
>> @@ -913,9 +915,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
>>  .map_queue = blk_mq_map_queue,
>>  };
>>  
>> +static void blkif_set_queue_limits(struct blkfront_info *info)
>> +{
>> +struct request_queue *rq = info->rq;
>> +struct gendisk *gd = info->gd;
>> +unsigned int segments = info->max_indirect_segments ? :
>> +BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> +
>> +queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
>> +
>> +if (info->feature_discard) {
>> +queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
>> +blk_queue_max_discard_sectors(rq, get_capacity(gd));
>> +rq->limits.discard_granularity = info->discard_granularity;
>> +rq->limits.discard_alignment = info->discard_alignment;
>> +if (info->feature_secdiscard)
>> +queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
>> +}
> 
> AFAICT, at the point this function is called (in blkfront_resume), the 
> value of info->feature_discard is still from the old backend, maybe this 
> should be called from blkif_recover after blkfront_gather_backend_features?
> 

Thank you for pointing out, will be fixed.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen

2016-07-18 Thread Bob Liu
Hey Haozhong,

On 07/18/2016 08:29 AM, Haozhong Zhang wrote:
> Hi,
> 
> Following is version 2 of the design doc for supporting vNVDIMM in

This version is really good, very clear and included almost everything I'd like 
to know.

> Xen. It's basically the summary of discussion on previous v1 design
> (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg6.html).
> Any comments are welcome. The corresponding patches are WIP.
> 

So are you(or Intel) going to write all the patches? Is there any task the 
community to take part in?

[..snip..]
> 3. Usage Example of vNVDIMM in Xen
> 
>  Our design is to provide virtual pmem devices to HVM domains. The
>  virtual pmem devices are backed by host pmem devices.
> 
>  Dom0 Linux kernel can detect the host pmem devices and create
>  /dev/pmemXX for each detected devices. Users in Dom0 can then create
>  DAX file system on /dev/pmemXX and create several pre-allocate files
>  in the DAX file system.
> 
>  After setup the file system on the host pmem, users can add the
>  following lines in the xl configuration files to assign the host pmem
>  regions to domains:
>  vnvdimm = [ 'file=/dev/pmem0' ]
>  or
>  vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]
> 

Could you please also consider the case when driver domain gets involved?
E.g vnvdimm = [ 'file=/dev/pmem0', backend='xxx' ]?

>   The first type of configuration assigns the entire pmem device
>   (/dev/pmem0) to the domain, while the second assigns the space
>   allocated to /mnt/dax/pre_allocated_file on the host pmem device to
>   the domain.
> 
..[snip..]
> 
> 4.2.2 Detection of Host pmem Devices
> 
>  The detection and initialize host pmem devices require a non-trivial
>  driver to interact with the corresponding ACPI namespace devices,
>  parse namespace labels and make necessary recovery actions. Instead
>  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
>  our designs leaves it to Dom0 Linux and let Dom0 Linux report
>  detected host pmem devices to Xen hypervisor.
> 
>  Our design takes following steps to detect host pmem devices when Xen
>  boots.
>  (1) As booting on bare metal, host pmem devices are detected by Dom0
>  Linux NVDIMM driver.
> 
>  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
>  of the pmem devices and reserved areas to Xen hypervisor via a
>  new hypercall.
> 
>  (3) Xen hypervisor then checks
>  - whether SPA and size of the newly reported pmem device is overlap
>with any previously reported pmem devices;
>  - whether the reserved area can fit in the pmem device and is
>large enough to hold page_info structs for itself.
> 
>  If any checks fail, the reported pmem device will be ignored by
>  Xen hypervisor and hence will not be used by any
>  guests. Otherwise, Xen hypervisor will recorded the reported
>  parameters and create page_info structs in the reserved area.
> 
>  (4) Because the reserved area is now used by Xen hypervisor, it
>  should not be accessible by Dom0 any more. Therefore, if a host
>  pmem device is recorded by Xen hypervisor, Xen will unmap its
>  reserved area from Dom0. Our design also needs to extend Linux
>  NVDIMM driver to "balloon out" the reserved area after it
>  successfully reports a pmem device to Xen hypervisor.
> 
> 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> 
>  Before a pmem file is assigned to a domain, we need to know the host
>  SPA ranges that are allocated to this file. We do this work in xl.
> 
>  If a pmem device /dev/pmem0 is given, xl will read
>  /sys/block/pmem0/device/{resource,size} respectively for the start
>  SPA and size of the pmem device.
> 
>  If a pre-allocated file /mnt/dax/file is given,
>  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
>  it uses the method above to get the start SPA of the host pmem
>  device.
>  (2) xl then uses fiemap ioctl to get the extend mappings of
>  /mnt/dax/file, and adds the corresponding physical offsets and
>  lengths in each mapping entries to above start SPA to get the SPA
>  ranges pre-allocated for this file.
> 

Looks like PMEM can't be passed through to driver domain directly like e.g PCI 
devices.

So if created a driver domain by: vnvdimm = [ 'file=/dev/pmem0' ], and make a 
DAX file system on the driver domain.

Then creating new guests with vnvdimm = [ 'file=dax file in driver domain', 
backend = 'driver domain' ].
Is this going to work? In my understanding, fiemap can only get the GPFN 
instead of the really SPA of PMEM in this case.


>  The resulting host SPA ranges will be passed to QEMU which allocates
>  guest address space for vNVDIMM devices and calls Xen hypervisor to
>  map the guest address to the host SPA ranges.
> 

Can Dom0 still access the same SPA range when Xen decides to assign it to new 
domU?
I assume the range will be unmapped automatically from dom0 in the hypercall?


[Xen-devel] [PATCH 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-15 Thread Bob Liu
Two places didn't get updated when 64KB page granularity was introduced, this
patch fix them.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index fcc5b4e..032fc94 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1321,7 +1321,7 @@ free_shadow:
rinfo->ring_ref[i] = GRANT_INVALID_REF;
}
}
-   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * PAGE_SIZE));
+   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
rinfo->ring.sring = NULL;
 
if (rinfo->irq)
@@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-   blk_queue_max_segments(info->rq, segs);
+   blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
 
for (r_index = 0; r_index < info->nr_rings; r_index++) {
struct blkfront_ring_info *rinfo = >rinfo[r_index];
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 3/3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-15 Thread Bob Liu
The current VBD layer reserves buffer space for each attached device based on
three statically configured settings which are read at boot time.
 * max_indirect_segs: Maximum amount of segments.
 * max_ring_page_order: Maximum order of pages to be used for the shared ring.
 * max_queues: Maximum of queues(rings) to be used.

But the storage backend, workload, and guest memory result in very different
tuning requirements. It's impossible to centrally predict application
characteristics so it's best to leave allow the settings can be dynamiclly
adjusted based on workload inside the Guest.

Usage:
Show current values:
cat /sys/devices/vbd-xxx/max_indirect_segs
cat /sys/devices/vbd-xxx/max_ring_page_order
cat /sys/devices/vbd-xxx/max_queues

Write new values:
echo  > /sys/devices/vbd-xxx/max_indirect_segs
echo  > /sys/devices/vbd-xxx/max_ring_page_order
echo  > /sys/devices/vbd-xxx/max_queues

Signed-off-by: Bob Liu <bob@oracle.com>
--
v2: Add device lock and other comments from Konrad.
---
 drivers/block/xen-blkfront.c | 285 ++-
 1 file changed, 283 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 10f46a8..9a5ed22 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -212,6 +213,11 @@ struct blkfront_info
/* Save uncomplete reqs and bios for migration. */
struct list_head requests;
struct bio_list bio_list;
+   /* For dynamic configuration. */
+   unsigned int reconfiguring:1;
+   int new_max_indirect_segments;
+   int new_max_ring_page_order;
+   int new_max_queues;
 };
 
 static unsigned int nr_minors;
@@ -1350,6 +1356,31 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
for (i = 0; i < info->nr_rings; i++)
blkif_free_ring(>rinfo[i]);
 
+   /* Remove old xenstore nodes. */
+   if (info->nr_ring_pages > 1)
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-page-order");
+
+   if (info->nr_rings == 1) {
+   if (info->nr_ring_pages == 1) {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, "ring-ref");
+   } else {
+   for (i = 0; i < info->nr_ring_pages; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, 
"ring-ref%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
ring_ref_name);
+   }
+   }
+   } else {
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, 
"multi-queue-num-queues");
+
+   for (i = 0; i < info->nr_rings; i++) {
+   char queuename[QUEUE_NAME_LEN];
+
+   snprintf(queuename, QUEUE_NAME_LEN, "queue-%u", i);
+   xenbus_rm(XBT_NIL, info->xbdev->nodename, queuename);
+   }
+   }
kfree(info->rinfo);
info->rinfo = NULL;
info->nr_rings = 0;
@@ -1772,6 +1803,10 @@ static int talk_to_blkback(struct xenbus_device *dev,
info->nr_ring_pages = 1;
else {
ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
+   if (info->new_max_ring_page_order) {
+   BUG_ON(info->new_max_ring_page_order > max_page_order);
+   ring_page_order = info->new_max_ring_page_order;
+   }
info->nr_ring_pages = 1 << ring_page_order;
}
 
@@ -1895,6 +1930,10 @@ static int negotiate_mq(struct blkfront_info *info)
backend_max_queues = 1;
 
info->nr_rings = min(backend_max_queues, xen_blkif_max_queues);
+   if (info->new_max_queues) {
+   BUG_ON(info->new_max_queues > backend_max_queues);
+   info->nr_rings = info->new_max_queues;
+   }
/* We need at least one ring. */
if (!info->nr_rings)
info->nr_rings = 1;
@@ -2352,11 +2391,227 @@ static void blkfront_gather_backend_features(struct 
blkfront_info *info)
NULL);
if (err)
info->max_indirect_segments = 0;
-   else
+   else {
info->max_indirect_segments = min(indirect_segments,
  xen_blkif_max_segments);
+   if (info->new_max_indirect_segments) {
+   BUG_ON(info->new_max_indirect_segments > 
indirect_segments);
+   info->max_indirect_segments = 
info->new_max_indirect_segments;
+   }
+   }
+}
+
+static ssize_t m

[Xen-devel] [PATCH 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-15 Thread Bob Liu
blk_mq_update_nr_hw_queues() reset all queue limits to default which it's not
as xen-blkfront expected, introducing blkif_set_queue_limits() to reset limits
with initial correct values.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 91 
 1 file changed, 50 insertions(+), 41 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 032fc94..10f46a8 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -189,6 +189,8 @@ struct blkfront_info
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
+   u16 sector_size;
+   unsigned int physical_sector_size;
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
@@ -913,9 +915,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
.map_queue = blk_mq_map_queue,
 };
 
+static void blkif_set_queue_limits(struct blkfront_info *info)
+{
+   struct request_queue *rq = info->rq;
+   struct gendisk *gd = info->gd;
+   unsigned int segments = info->max_indirect_segments ? :
+   BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
+
+   if (info->feature_discard) {
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
+   blk_queue_max_discard_sectors(rq, get_capacity(gd));
+   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_alignment = info->discard_alignment;
+   if (info->feature_secdiscard)
+   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
+   }
+
+   /* Hard sector size and max sectors impersonate the equiv. hardware. */
+   blk_queue_logical_block_size(rq, info->sector_size);
+   blk_queue_physical_block_size(rq, info->physical_sector_size);
+   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
+
+   /* Each segment in a request is up to an aligned page in size. */
+   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
+   blk_queue_max_segment_size(rq, PAGE_SIZE);
+
+   /* Ensure a merged request will fit in a single I/O ring slot. */
+   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
+
+   /* Make sure buffer addresses are sector-aligned. */
+   blk_queue_dma_alignment(rq, 511);
+
+   /* Make sure we don't use bounce buffers. */
+   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
+}
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
-   unsigned int physical_sector_size,
-   unsigned int segments)
+   unsigned int physical_sector_size)
 {
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
@@ -947,37 +985,11 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
}
 
rq->queuedata = info;
-   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
-
-   if (info->feature_discard) {
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
-   blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
-   rq->limits.discard_alignment = info->discard_alignment;
-   if (info->feature_secdiscard)
-   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
-   }
-
-   /* Hard sector size and max sectors impersonate the equiv. hardware. */
-   blk_queue_logical_block_size(rq, sector_size);
-   blk_queue_physical_block_size(rq, physical_sector_size);
-   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
-
-   /* Each segment in a request is up to an aligned page in size. */
-   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
-   blk_queue_max_segment_size(rq, PAGE_SIZE);
-
-   /* Ensure a merged request will fit in a single I/O ring slot. */
-   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
-
-   /* Make sure buffer addresses are sector-aligned. */
-   blk_queue_dma_alignment(rq, 511);
-
-   /* Make sure we don't use bounce buffers. */
-   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
-
-   gd->queue = rq;
-
+   info->rq = gd->queue = rq;
+   info->gd = gd;
+   info->sector_size = sector_size;
+   info->physical_sector_size = physical_sector_size;
+   blkif_set_queue_limits(info);
return 0;
 }
 
@@ -1142,16 +1154,11 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity);
 
-   if (xlvbd_init_blk_queue(gd, sector_size, physical_sector_size,
-i

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-14 Thread Bob Liu

On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
> On 11.07.2016 15:04, Bob Liu wrote:
>>
>>
>> On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
>>> On 06.06.2016 11:42, Dario Faggioli wrote:
>>>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>>>
>>>
>>> Ping.
>>>
>>> Any suggestions how to debug this or what might cause the problem?
>>>
>>> Obviously, we cannot control Xen on the Amazon's servers. But perhaps there 
>>> is something we can do at the kernel's side, is it?
>>>
>>>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>>>> (Resending this bug report because the message I sent last week did
>>>>> not
>>>>> make it to the mailing list somehow.)
>>>>>
>>>>> Hi,
>>>>>
>>>>> One of our users gets kernel panics from time to time when he tries
>>>>> to
>>>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>>>> happens within minutes from the moment the instance starts. The
>>>>> problem
>>>>> does not show up every time, however.
>>>>>
>>>>> The user first observed the problem with a custom kernel, but it was
>>>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>>>> CentOS7 was affected as well.
>>
>> Please try this patch:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc
>>
>> Regards,
>> Bob
>>
> 
> Unfortunately, it did not help. The same BUG_ON() in 
> blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
> 3.10.0-327.18.2, where I added the patch.
> 
> As far as I can see, the patch makes sure the indirect pages are added to the 
> list only if (!info->feature_persistent) holds. I suppose it holds in our 
> case and the pages are added to the list because the triggered BUG_ON() is 
> here:
> 
> if (!info->feature_persistent && info->max_indirect_segments) {
> <...>
> BUG_ON(!list_empty(>indirect_pages));
> <...>
> }
> 

That's odd.
Could you please try to reproduce this issue with a recent upstream kernel?

Thanks,
Bob

> So the problem is still out there somewhere, it seems.
> 
> Regards,
> Evgenii
> 
>>>>>
>>>>> The part of the system log he was able to retrieve is attached. Here
>>>>> is
>>>>> the bug info, for convenience:
>>>>>
>>>>> 
>>>>> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
>>>>> [2.246912] invalid opcode:  [#1] SMP
>>>>> [2.246912] Modules linked in: ata_generic pata_acpi
>>>>> crct10dif_pclmul
>>>>> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
>>>>> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
>>>>> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
>>>>> dm_mirror
>>>>> dm_region_hash dm_log dm_mod scsi_transport_iscsi
>>>>> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
>>>>> 3.10.0-327.18.2.el7.x86_64 #1
>>>>> [2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>>>> 12/07/2015
>>>>> [2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
>>>>> 8800e98bc000
>>>>> [2.246912] RIP: 0010:[]  []
>>>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>>>> [2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
>>>>> [2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
>>>>> 0020
>>>>> [2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
>>>>> 8800353e15d0
>>>>> [2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
>>>>> 8800eb403c00
>>>>> [2.246912] R10: a0155532 R11: ffe8 R12:
>>>>> 8800e98c4000
>>>>> [2.246912] R13: 8800e98c52b8 R14: 0020 R15:
>>>>> 8800353e15c0
>>>>> [2.246912] FS:  () GS:8800efc2()
>>>>> knlGS:
>>>>> [2.246912] CS:  0010 DS:  ES:  CR0: 80050033
>>>>> [2.2469

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-11 Thread Bob Liu


On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
> On 06.06.2016 11:42, Dario Faggioli wrote:
>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>
> 
> Ping.
> 
> Any suggestions how to debug this or what might cause the problem?
> 
> Obviously, we cannot control Xen on the Amazon's servers. But perhaps there 
> is something we can do at the kernel's side, is it?
> 
>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>> (Resending this bug report because the message I sent last week did
>>> not
>>> make it to the mailing list somehow.)
>>>
>>> Hi,
>>>
>>> One of our users gets kernel panics from time to time when he tries
>>> to
>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>> happens within minutes from the moment the instance starts. The
>>> problem
>>> does not show up every time, however.
>>>
>>> The user first observed the problem with a custom kernel, but it was
>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>> CentOS7 was affected as well.

Please try this patch:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc

Regards,
Bob

>>>
>>> The part of the system log he was able to retrieve is attached. Here
>>> is
>>> the bug info, for convenience:
>>>
>>> 
>>> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
>>> [2.246912] invalid opcode:  [#1] SMP
>>> [2.246912] Modules linked in: ata_generic pata_acpi
>>> crct10dif_pclmul
>>> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
>>> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
>>> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
>>> dm_mirror
>>> dm_region_hash dm_log dm_mod scsi_transport_iscsi
>>> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
>>> 3.10.0-327.18.2.el7.x86_64 #1
>>> [2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>> 12/07/2015
>>> [2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
>>> 8800e98bc000
>>> [2.246912] RIP: 0010:[]  []
>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>> [2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
>>> [2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
>>> 0020
>>> [2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
>>> 8800353e15d0
>>> [2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
>>> 8800eb403c00
>>> [2.246912] R10: a0155532 R11: ffe8 R12:
>>> 8800e98c4000
>>> [2.246912] R13: 8800e98c52b8 R14: 0020 R15:
>>> 8800353e15c0
>>> [2.246912] FS:  () GS:8800efc2()
>>> knlGS:
>>> [2.246912] CS:  0010 DS:  ES:  CR0: 80050033
>>> [2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
>>> 001406e0
>>> [2.246912] DR0:  DR1:  DR2:
>>> 
>>> [2.246912] DR3:  DR6: 0ff0 DR7:
>>> 0400
>>> [2.246912] Stack:
>>> [2.246912]  0020 0001 0020a0157217
>>> 0100e98bfdbc
>>> [2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000
>>> 8800e98c4000
>>> [2.246912]  8800e98ce040 0001 8800e98bfe08
>>> a0155d4c
>>> [2.246912] Call Trace:
>>> [2.246912]  [] blkback_changed+0x4ec/0xfc8
>>> [xen_blkfront]
>>> [2.246912]  [] ? xenbus_gather+0x170/0x190
>>> [2.246912]  [] ? __slab_free+0x10e/0x277
>>> [2.246912]  []
>>> xenbus_otherend_changed+0xad/0x110
>>> [2.246912]  [] ? xenwatch_thread+0x77/0x180
>>> [2.246912]  [] backend_changed+0x13/0x20
>>> [2.246912]  [] xenwatch_thread+0x66/0x180
>>> [2.246912]  [] ? wake_up_atomic_t+0x30/0x30
>>> [2.246912]  [] ?
>>> unregister_xenbus_watch+0x1f0/0x1f0
>>> [2.246912]  [] kthread+0xcf/0xe0
>>> [2.246912]  [] ?
>>> kthread_create_on_node+0x140/0x140
>>> [2.246912]  [] ret_from_fork+0x58/0x90
>>> [2.246912]  [] ?
>>> kthread_create_on_node+0x140/0x140
>>> [2.246912] Code: e1 48 85 c0 75 ce 49 8d 84 24 40 01 00 00 48 89
>>> 45
>>> b8 e9 91 fd ff ff 4c 89 ff e8 8d ae 06 e1 e9 f2 fc ff ff 31 c0 e9 2e
>>> fe
>>> ff ff <0f> 0b e8 9a 57 f2 e0 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44
>>> 00
>>> [2.246912] RIP  []
>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>> [2.246912]  RSP 
>>> [2.491574] ---[ end trace 8a9b992812627c71 ]---
>>> [2.495618] Kernel panic - not syncing: Fatal exception
>>> 
>>>
>>> Xen version 4.2.
>>>
>>> EC2 instance type: c3.large with EBS magnetic storage, if that
>>> matters.
>>>
>>> Here is the code where the BUG_ON triggers (drivers/block/xen-
>>> blkfront.c):
>>> 
>>> if (!info->feature_persistent && info->max_indirect_segments) {
>>>   

Re: [Xen-devel] [PATCH] xen-blkfront: save uncompleted reqs in blkfront_resume()

2016-06-27 Thread Bob Liu

On 06/27/2016 04:33 PM, Bob Liu wrote:
> Uncompleted reqs used to be 'saved and resubmitted' in blkfront_recover() 
> during
> migration, but that's too later after multi-queue introduced.
> 
> After a migrate to another host (which may not have multiqueue support), the
> number of rings (block hardware queues) may be changed and the ring and shadow
> structure will also be reallocated.
> So that blkfront_recover() can't 'save and resubmit' the real uncompleted reqs
> because shadow structure has been reallocated.
> 
> This patch fixes this issue by moving the 'save and resubmit' logic out of

Fix: Just moved the 'save' logic to earlier place:blkfront_resume(), the 
'resubmit' was no change and still in blkfront_recover().

> blkfront_recover() to earlier place:blkfront_resume().
> 
> Signed-off-by: Bob Liu <bob@oracle.com>
> ---
>  drivers/block/xen-blkfront.c | 91 
> +++-
>  1 file changed, 40 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 2e6d1e9..fcc5b4e 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -207,6 +207,9 @@ struct blkfront_info
>   struct blk_mq_tag_set tag_set;
>   struct blkfront_ring_info *rinfo;
>   unsigned int nr_rings;
> + /* Save uncomplete reqs and bios for migration. */
> + struct list_head requests;
> + struct bio_list bio_list;
>  };
>  
>  static unsigned int nr_minors;
> @@ -2002,69 +2005,22 @@ static int blkif_recover(struct blkfront_info *info)
>  {
>   unsigned int i, r_index;
>   struct request *req, *n;
> - struct blk_shadow *copy;
>   int rc;
>   struct bio *bio, *cloned_bio;
> - struct bio_list bio_list, merge_bio;
>   unsigned int segs, offset;
>   int pending, size;
>   struct split_bio *split_bio;
> - struct list_head requests;
>  
>   blkfront_gather_backend_features(info);
>   segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
>   blk_queue_max_segments(info->rq, segs);
> - bio_list_init(_list);
> - INIT_LIST_HEAD();
>  
>   for (r_index = 0; r_index < info->nr_rings; r_index++) {
> - struct blkfront_ring_info *rinfo;
> -
> - rinfo = >rinfo[r_index];
> - /* Stage 1: Make a safe copy of the shadow state. */
> - copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
> -GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
> - if (!copy)
> - return -ENOMEM;
> -
> - /* Stage 2: Set up free list. */
> - memset(>shadow, 0, sizeof(rinfo->shadow));
> - for (i = 0; i < BLK_RING_SIZE(info); i++)
> - rinfo->shadow[i].req.u.rw.id = i+1;
> - rinfo->shadow_free = rinfo->ring.req_prod_pvt;
> - rinfo->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fff;
> + struct blkfront_ring_info *rinfo = >rinfo[r_index];
>  
>   rc = blkfront_setup_indirect(rinfo);
> - if (rc) {
> - kfree(copy);
> + if (rc)
>   return rc;
> - }
> -
> - for (i = 0; i < BLK_RING_SIZE(info); i++) {
> - /* Not in use? */
> - if (!copy[i].request)
> - continue;
> -
> - /*
> -  * Get the bios in the request so we can re-queue them.
> -  */
> - if (copy[i].request->cmd_flags &
> - (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
> - /*
> -  * Flush operations don't contain bios, so
> -  * we need to requeue the whole request
> -  */
> - list_add([i].request->queuelist, 
> );
> - continue;
> - }
> - merge_bio.head = copy[i].request->bio;
> - merge_bio.tail = copy[i].request->biotail;
> - bio_list_merge(_list, _bio);
> - copy[i].request->bio = NULL;
> - blk_end_request_all(copy[i].request, 0);
> - }
> -
> - kfree(copy);
>   }
>   xenbus_switch_state(info->xbdev, XenbusStateConnected);
>  
> @@ -2079,7 +2035,7 @@ static int blkif_recover(struct blkfront_info *info)
>   kick_pend

[Xen-devel] [PATCH] xen-blkfront: save uncompleted reqs in blkfront_resume()

2016-06-27 Thread Bob Liu
Uncompleted reqs used to be 'saved and resubmitted' in blkfront_recover() during
migration, but that's too later after multi-queue introduced.

After a migrate to another host (which may not have multiqueue support), the
number of rings (block hardware queues) may be changed and the ring and shadow
structure will also be reallocated.
So that blkfront_recover() can't 'save and resubmit' the real uncompleted reqs
because shadow structure has been reallocated.

This patch fixes this issue by moving the 'save and resubmit' logic out of
blkfront_recover() to earlier place:blkfront_resume().

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 91 +++-
 1 file changed, 40 insertions(+), 51 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2e6d1e9..fcc5b4e 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -207,6 +207,9 @@ struct blkfront_info
struct blk_mq_tag_set tag_set;
struct blkfront_ring_info *rinfo;
unsigned int nr_rings;
+   /* Save uncomplete reqs and bios for migration. */
+   struct list_head requests;
+   struct bio_list bio_list;
 };
 
 static unsigned int nr_minors;
@@ -2002,69 +2005,22 @@ static int blkif_recover(struct blkfront_info *info)
 {
unsigned int i, r_index;
struct request *req, *n;
-   struct blk_shadow *copy;
int rc;
struct bio *bio, *cloned_bio;
-   struct bio_list bio_list, merge_bio;
unsigned int segs, offset;
int pending, size;
struct split_bio *split_bio;
-   struct list_head requests;
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
blk_queue_max_segments(info->rq, segs);
-   bio_list_init(_list);
-   INIT_LIST_HEAD();
 
for (r_index = 0; r_index < info->nr_rings; r_index++) {
-   struct blkfront_ring_info *rinfo;
-
-   rinfo = >rinfo[r_index];
-   /* Stage 1: Make a safe copy of the shadow state. */
-   copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
-  GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
-   if (!copy)
-   return -ENOMEM;
-
-   /* Stage 2: Set up free list. */
-   memset(>shadow, 0, sizeof(rinfo->shadow));
-   for (i = 0; i < BLK_RING_SIZE(info); i++)
-   rinfo->shadow[i].req.u.rw.id = i+1;
-   rinfo->shadow_free = rinfo->ring.req_prod_pvt;
-   rinfo->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fff;
+   struct blkfront_ring_info *rinfo = >rinfo[r_index];
 
rc = blkfront_setup_indirect(rinfo);
-   if (rc) {
-   kfree(copy);
+   if (rc)
return rc;
-   }
-
-   for (i = 0; i < BLK_RING_SIZE(info); i++) {
-   /* Not in use? */
-   if (!copy[i].request)
-   continue;
-
-   /*
-* Get the bios in the request so we can re-queue them.
-*/
-   if (copy[i].request->cmd_flags &
-   (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
-   /*
-* Flush operations don't contain bios, so
-* we need to requeue the whole request
-*/
-   list_add([i].request->queuelist, 
);
-   continue;
-   }
-   merge_bio.head = copy[i].request->bio;
-   merge_bio.tail = copy[i].request->biotail;
-   bio_list_merge(_list, _bio);
-   copy[i].request->bio = NULL;
-   blk_end_request_all(copy[i].request, 0);
-   }
-
-   kfree(copy);
}
xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
@@ -2079,7 +2035,7 @@ static int blkif_recover(struct blkfront_info *info)
kick_pending_request_queues(rinfo);
}
 
-   list_for_each_entry_safe(req, n, , queuelist) {
+   list_for_each_entry_safe(req, n, >requests, queuelist) {
/* Requeue pending requests (flush or discard) */
list_del_init(>queuelist);
BUG_ON(req->nr_phys_segments > segs);
@@ -2087,7 +2043,7 @@ static int blkif_recover(struct blkfront_info *info)
}
blk_mq_kick_requeue_list(info->rq);
 
-   while ((bio = bio_list_pop(_list)) != NULL) {
+   while ((bio = bio_list_pop(>bio_list)) != NULL) {

Re: [Xen-devel] [PATCH 1/2] xen-blkfront: don't call talk_to_blkback when already connected to blkback

2016-06-08 Thread Bob Liu

On 06/07/2016 11:25 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 01, 2016 at 01:49:23PM +0800, Bob Liu wrote:
>>
>> On 06/01/2016 04:33 AM, Konrad Rzeszutek Wilk wrote:
>>> On Tue, May 31, 2016 at 04:59:16PM +0800, Bob Liu wrote:
>>>> Sometimes blkfont may receive twice blkback_changed() notification after
>>>> migration, then talk_to_blkback() will be called twice too and confused
>>>> xen-blkback.
>>>
>>> Could you enlighten the patch description by having some form of
>>> state transition here? I am curious how you got the frontend
>>> to get in XenbusStateConnected (via blkif_recover right) and then
>>> the backend triggering the update once more?
>>>
>>> Or is just a simple race - the backend moves from XenbusStateConnected->
>>> XenbusStateConnected - which retriggers the frontend to hit in
>>> blkback_changed the XenbusStateConnected state and go in there?
>>> (That would be in conenct_ring changing the state). But I don't
>>> see how the frontend_changed code get there as we have:
>>>
>>>  770 /*
>>>  771  * Ensure we connect even when two watches fire in
>>>  772  * close succession and we miss the intermediate value
>>>  773  * of frontend_state.
>>>  774  */
>>>  775 if (dev->state == XenbusStateConnected)
>>>  776 break;
>>>  777 
>>>
>>> ?
>>>
>>> Now what about 'blkfront_connect' being called on the second time?
>>>
>>> Ah, info->connected is probably by then in BLKIF_STATE_CONNECTED
>>> (as blkif_recover changed) and we just reread the size of the disk.
>>>
>>> Is that how about the flow goes?
>>
>>  blkfrontblkback
>> blkfront_resume()   
>>  > talk_to_blkback()
>>   > Set blkfront to XenbusStateInitialised
>>  Front changed()
>>   > Connect()
>>> Set blkback to 
>> XenbusStateConnected
>>
>> blkback_changed()
>>  > Skip talk_to_blkback()
>>because frontstate == XenbusStateInitialised
>>  > blkfront_connect()
>>   > Set blkfront to XenbusStateConnected
>>
>>
>> --
>> But sometimes blkfront receives
>> blkback_changed() event more than once!
> 
> I think I know why. The udev scripts that get invoked when when
> we attach a disk are a bit custom. As such I think they just
> revalidate the size leading to this.
> 
> And this 'poke-at-XenbusStateConnected' state multiple times
> is allowed. It is used to signal disk changes (or just to revalidate).
> Hence it does not matter why really - we need to deal with this.
> 
> I modified your patch a bit and are testing it:
> 

Looks much better, thank you very much!

Bob

> From e49dc9fc65eda4923b41d903ac51a7ddee182bcd Mon Sep 17 00:00:00 2001
> From: Bob Liu <bob@oracle.com>
> Date: Tue, 7 Jun 2016 10:43:15 -0400
> Subject: [PATCH] xen-blkfront: don't call talk_to_blkback when already
>  connected to blkback
> 
> Sometimes blkfront may twice receive blkback_changed() notification
> (XenbusStateConnected) after migration, which will cause
> talk_to_blkback() to be called twice too and confuse xen-blkback.
> 
> The flow is as follow:
>blkfrontblkback
> blkfront_resume()
>  > talk_to_blkback()
>   > Set blkfront to XenbusStateInitialised
> front changed()
>  > Connect()
>   > Set blkback to 
> XenbusStateConnected
> 
> blkback_changed()
>  > Skip talk_to_blkback()
>because frontstate == XenbusStateInitialised
>  > blkfront_connect()
>   > Set blkfront to XenbusStateConnected
> 
> -
> And here we get another XenbusStateConnected notification leading
> to:
> -
> blkback_changed()
>  > because now frontstate != XenbusStateInitialised
>talk_to_blkback() is also called again
>   > blkfront state changed from
>   XenbusStateConnected to XenbusStateInitialised
> (Which is not correct!)
> 
>   front_changed():
>  >

Re: [Xen-devel] [PATCH 1/2] xen-blkfront: don't call talk_to_blkback when already connected to blkback

2016-05-31 Thread Bob Liu

On 06/01/2016 04:33 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, May 31, 2016 at 04:59:16PM +0800, Bob Liu wrote:
>> Sometimes blkfont may receive twice blkback_changed() notification after
>> migration, then talk_to_blkback() will be called twice too and confused
>> xen-blkback.
> 
> Could you enlighten the patch description by having some form of
> state transition here? I am curious how you got the frontend
> to get in XenbusStateConnected (via blkif_recover right) and then
> the backend triggering the update once more?
> 
> Or is just a simple race - the backend moves from XenbusStateConnected->
> XenbusStateConnected - which retriggers the frontend to hit in
> blkback_changed the XenbusStateConnected state and go in there?
> (That would be in conenct_ring changing the state). But I don't
> see how the frontend_changed code get there as we have:
> 
>  770 /*
>  771  * Ensure we connect even when two watches fire in
>  772  * close succession and we miss the intermediate value
>  773  * of frontend_state.
>  774  */
>  775 if (dev->state == XenbusStateConnected)
>  776 break;
>  777 
> 
> ?
> 
> Now what about 'blkfront_connect' being called on the second time?
> 
> Ah, info->connected is probably by then in BLKIF_STATE_CONNECTED
> (as blkif_recover changed) and we just reread the size of the disk.
> 
> Is that how about the flow goes?

blkfrontblkback
blkfront_resume()   
 > talk_to_blkback()
  > Set blkfront to XenbusStateInitialised
Front changed()
 > Connect()
  > Set blkback to 
XenbusStateConnected

blkback_changed()
 > Skip talk_to_blkback()
   because frontstate == XenbusStateInitialised
 > blkfront_connect()
  > Set blkfront to XenbusStateConnected


--
But sometimes blkfront receives
blkback_changed() event more than once!
Not sure why.

blkback_changed()
 > because now frontstate != XenbusStateInitialised
   talk_to_blkback() is also called again
  > blkfront state changed from 
XenbusStateConnected to XenbusStateInitialised
(Which is not correct!)


Front_changed():
 > Do nothing because blkback
       already in 
XenbusStateConnected


Now blkback is XenbusStateConnected but blkfront still XenbusStateInitialised.

>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> ---
>>  drivers/block/xen-blkfront.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index ca13df8..01aa460 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -2485,7 +2485,8 @@ static void blkback_changed(struct xenbus_device *dev,
>>  break;
>>  
>>  case XenbusStateConnected:
>> -if (dev->state != XenbusStateInitialised) {
>> +if ((dev->state != XenbusStateInitialised) &&
>> +(dev->state != XenbusStateConnected)) {
>>  if (talk_to_blkback(dev, info))
>>  break;
>>  }
>> -- 
>> 2.7.4
>>

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/2] xen-blkfront: don't call talk_to_blkback when already connected to blkback

2016-05-31 Thread Bob Liu
Sometimes blkfont may receive twice blkback_changed() notification after
migration, then talk_to_blkback() will be called twice too and confused
xen-blkback.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ca13df8..01aa460 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2485,7 +2485,8 @@ static void blkback_changed(struct xenbus_device *dev,
break;
 
case XenbusStateConnected:
-   if (dev->state != XenbusStateInitialised) {
+   if ((dev->state != XenbusStateInitialised) &&
+   (dev->state != XenbusStateConnected)) {
if (talk_to_blkback(dev, info))
break;
}
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/2] xen-blkfront: fix resume issues

2016-05-31 Thread Bob Liu
After migrate to another host, the number of rings(block hardware queues)
may be changed and the ring info structure will also be reallocated.

This patch fix two related place:
 * call blk_mq_update_nr_hw_queues() to make blk-core knows the number
of hardware queues have been changed.
 * Don't store rinfo pointer to hctx->driver_data, because rinfo may be
 * reallocated so using hctx->queue_num to get the rinfo structure instead.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 01aa460..83e36c5 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -874,8 +874,12 @@ static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
  const struct blk_mq_queue_data *qd)
 {
unsigned long flags;
-   struct blkfront_ring_info *rinfo = (struct blkfront_ring_info 
*)hctx->driver_data;
+   int qid = hctx->queue_num;
+   struct blkfront_info *info = hctx->queue->queuedata;
+   struct blkfront_ring_info *rinfo = NULL;
 
+   BUG_ON(info->nr_rings <= qid);
+   rinfo = >rinfo[qid];
blk_mq_start_request(qd->rq);
spin_lock_irqsave(>ring_lock, flags);
if (RING_FULL(>ring))
@@ -901,20 +905,9 @@ out_busy:
return BLK_MQ_RQ_QUEUE_BUSY;
 }
 
-static int blk_mq_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-   unsigned int index)
-{
-   struct blkfront_info *info = (struct blkfront_info *)data;
-
-   BUG_ON(info->nr_rings <= index);
-   hctx->driver_data = >rinfo[index];
-   return 0;
-}
-
 static struct blk_mq_ops blkfront_mq_ops = {
.queue_rq = blkif_queue_rq,
.map_queue = blk_mq_map_queue,
-   .init_hctx = blk_mq_init_hctx,
 };
 
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
@@ -950,6 +943,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
return PTR_ERR(rq);
}
 
+   rq->queuedata = info;
queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
 
if (info->feature_discard) {
@@ -2149,6 +2143,8 @@ static int blkfront_resume(struct xenbus_device *dev)
return err;
 
err = talk_to_blkback(dev, info);
+   if (!err)
+   blk_mq_update_nr_hw_queues(>tag_set, info->nr_rings);
 
/*
 * We have to wait for the backend to switch to
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH v2] Data integrity extension(DIX) support for xen-block

2016-04-20 Thread Bob Liu

On 04/20/2016 04:59 PM, David Vrabel wrote:
> On 20/04/16 08:26, Bob Liu wrote:
>>
>>  /*
>> + * Recognized only if "feature-data-integrity" is present in backend xenbus 
>> info.
>> + * A request with BLKIF_OP_DIX_FLAG indicates the following request is a 
>> special
>> + * request which only contains integrity-metadata segments of current 
>> request.
>> + *
>> + * If a backend does not recognize BLKIF_OP_DIX_FLAG, it should *not* 
>> create the
>> + * "feature-data-integrity" node!
>> + */
>> +#define BLKIF_OP_DIX_FLAG (0x80)
> 
> This looks fine as a mechanism for actually transferring the data but
> you do need to specify:
> 
> 1. The format of this DIX data.  You may reference external
> specifications for this.
> 

Sure!

> 2. A mechanism for reporting which DIX formats the backend supports and
> a way for the frontend to select one (if multiple are selected).
> 

The "feature-data-integrity" could be extended to "unsigned int" instead of 
"bool",
so as to report all DIX formats backend supports.

> 3. The behaviour the frontend can expect from the backend.  (e.g., if
> the frontend writes sector S with DIX data D, a read of sector S with
> complete with DIX data D).
> 

Sorry, I didn't get the point of this example.

Thank you for your review!

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [RFC PATCH v2] Data integrity extension(DIX) support for xen-block

2016-04-20 Thread Bob Liu
* What's data integrity extension(DIX) and why?
Modern filesystems feature checksumming of data and metadata to protect against
data corruption.  However, the detection of the corruption is done at read time
which could potentially be months after the data was written.  At that point the
original data that the application tried to write is most likely lost.

The solution in Linux is the data integrity framework which enables protection
information to be pinned to I/Os and sent to/received from controllers that
support it. struct bio has been extended with a pointer to a struct bip which
in turn contains the integrity metadata.
Both raw data and integrity metadata are mapped to two separate scatterlists.

* Issues when xen-block get involved.
xen-blkfront only transmits the raw data-segment scatterlist of each bio
while the integrity-metadata-segment scatterlist has been ignored.

* Proposal for transmitting integrity-metadata-segment scatterlist.
Adding an extra request following the normal data request, this extra request
contains integrity-metadata segments only.

The xen-blkback will reconstruct the new bio with recevied data and integrity
segments.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 xen/include/public/io/blkif.h |   19 +++
 1 file changed, 19 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..a0124b2 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -182,6 +182,15 @@
  *  backend driver paired with a LIFO queue in the frontend will
  *  allow us to have better performance in this scenario.
  *
+ * feature-data-integrity
+ *  Values: 0/1 (boolean)
+ *  Default Value:  0
+ *
+ *  A value of "1" indicates that the backend can process requests
+ *  containing the BLKIF_OP_DIX_FLAG request opcode.  Requests
+ *  with this flag means the following request is a special request which
+ *  only contains integrity-metadata segments of current request.
+ *
  *--- Request Transport Parameters 
  *
  * max-ring-page-order
@@ -635,6 +644,16 @@
 #define BLKIF_OP_INDIRECT  6
 
 /*
+ * Recognized only if "feature-data-integrity" is present in backend xenbus 
info.
+ * A request with BLKIF_OP_DIX_FLAG indicates the following request is a 
special
+ * request which only contains integrity-metadata segments of current request.
+ *
+ * If a backend does not recognize BLKIF_OP_DIX_FLAG, it should *not* create 
the
+ * "feature-data-integrity" node!
+ */
+#define BLKIF_OP_DIX_FLAG (0x80)
+
+/*
  * Maximum scatter/gather segments per request.
  * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
  * NB. This could be 12 if the ring indexes weren't stored in the same page.
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-13 Thread Bob Liu

On 04/07/2016 06:00 PM, Bob Liu wrote:
> * What's data integrity extension and why?
> Modern filesystems feature checksumming of data and metadata to protect 
> against
> data corruption.  However, the detection of the corruption is done at read 
> time
> which could potentially be months after the data was written.  At that point 
> the
> original data that the application tried to write is most likely lost.
> 
> The solution in Linux is the data integrity framework which enables protection
> information to be pinned to I/Os and sent to/received from controllers that
> support it. struct bio has been extended with a pointer to a struct bip which
> in turn contains the integrity metadata. The bip is essentially a trimmed down
> bio with a bio_vec and some housekeeping.
> 
> * Issues when xen-block get involved.
> xen-blkfront only transmits the normal data of struct bio while the integrity
> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
> 
> * Proposal of transmitting bio integrity payload.
> Adding an extra request following the normal data request, this extra request
> contains the integrity payload.
> The xen-blkback will reconstruct an new bio with both received normal data and
> integrity metadata.
> 
> Welcome any better ideas, thank you!
> 

A simpler possible solution:

bob@boliuliu:~/xen$ git diff xen/include/public/io/blkif.h
diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 3d8d39f..34581a5 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -689,6 +689,11 @@ struct blkif_request_segment {
 struct blkif_request {
 uint8_toperation;/* BLKIF_OP_??? */
 uint8_tnr_segments;  /* number of segments   */
+/*
+ * Recording how many segments are data integrity segments.
+ * raw data_segments + dix_segments = nr_segments
+ */
+uint8_t   dix_segments;
 blkif_vdev_t   handle;   /* only for read/write requests */
 uint64_t   id;   /* private guest value, echoed in resp  */
 blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
@@ -715,6 +720,11 @@ struct blkif_request_indirect {
 uint8_toperation;/* BLKIF_OP_INDIRECT*/
 uint8_tindirect_op;  /* BLKIF_OP_{READ/WRITE}*/
 uint16_t   nr_segments;  /* number of segments   */
+/*
+ * Recording how many segments are data integrity segments.
+ * raw data_segments + dix_segments = nr_segments
+ */
+uint16_t   dix_segments;
 uint64_t   id;   /* private guest value, echoed in resp  */
 blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
 blkif_vdev_t   handle;   /* same as for read/write requests  */

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-11 Thread Bob Liu

On 04/08/2016 10:32 PM, David Vrabel wrote:
> On 08/04/16 15:20, Ian Jackson wrote:
>> David Vrabel writes ("Re: [RFC PATCH] Data integrity extension support for 
>> xen-block"):
>>> You need to read the relevant SCSI specification and find out what
>>> interfaces and behaviour the hardware has so you can specify compatible
>>> interfaces in blkif.
>>>
>>> My (brief) reading around this suggests that the integrity data has a
>>> specific format (a CRC of some form) and the integrity data written for
>>> sector S and retrieved verbatim when sector S is re-read.
>>
>> I think it's this:
>>
>> https://en.wikipedia.org/wiki/Data_Integrity_Field
>> https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>
>> In which case AFAICT the format is up to the guest (ie the operating
>> system or file system) and it's opaque to the host (the storage) -
>> unless the guest consents, of course.
> 
> I disagree, but I can't work out where to get the relevant T10 PI/DIF
> spec from to provide an authoritative link[1].  The DI metadata has as a
> set of well defined format, most of which include a 16-bit GUARD CRC, a
> 32 bit REFERENCE tag and 16 bit for user defined usage.
> 

Yes.

> The application cannot use all the bits for its own use since the
> hardware may check the GUARD and REFERENCE tags itself.
> 
> David
> 
> [0] Try: https://www.usenix.org/legacy/event/lsf07/tech/petersen.pdf
> 

And https://oss.oracle.com/projects/data-integrity/dist/documentation/dix.pdf

No matter the actual format of the Integrity Meta Data looks like, it can be 
mapped to a scatter-list by using:
blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio, struct 
scatterlist *sglist)
just like blk_rq_map_sg(struct request_queue *q, struct request *rq, struct 
scatterlist *sglist) for normal data.

The extra scatter-list can be seen as the interface, we just need to find a 
good way transmitting this extra scatter-list between blkfront and blkback.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-08 Thread Bob Liu

On 04/08/2016 05:44 PM, Roger Pau Monné wrote:
> On Fri, 8 Apr 2016, Bob Liu wrote:
>>
>> On 04/07/2016 11:55 PM, Juergen Gross wrote:
>>> On 07/04/16 12:00, Bob Liu wrote:
>>>> * What's data integrity extension and why?
>>>> Modern filesystems feature checksumming of data and metadata to protect 
>>>> against
>>>> data corruption.  However, the detection of the corruption is done at read 
>>>> time
>>>> which could potentially be months after the data was written.  At that 
>>>> point the
>>>> original data that the application tried to write is most likely lost.
>>>>
>>>> The solution in Linux is the data integrity framework which enables 
>>>> protection
>>>> information to be pinned to I/Os and sent to/received from controllers that
>>>> support it. struct bio has been extended with a pointer to a struct bip 
>>>> which
>>>> in turn contains the integrity metadata. The bip is essentially a trimmed 
>>>> down
>>>> bio with a bio_vec and some housekeeping.
>>>>
>>>> * Issues when xen-block get involved.
>>>> xen-blkfront only transmits the normal data of struct bio while the 
>>>> integrity
>>>> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
>>>>
>>>> * Proposal of transmitting bio integrity payload.
>>>> Adding an extra request following the normal data request, this extra 
>>>> request
>>>> contains the integrity payload.
>>>> The xen-blkback will reconstruct an new bio with both received normal data 
>>>> and
>>>> integrity metadata.
>>>>
>>>> Welcome any better ideas, thank you!
>>>>
>>>> [1] http://lwn.net/Articles/280023/
>>>> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>>>
>>>> Signed-off-by: Bob Liu <bob@oracle.com>
>>>> ---
>>>>  xen/include/public/io/blkif.h |   50 
>>>> +
>>>>  1 file changed, 50 insertions(+)
>>>>
>>>> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
>>>> index 99f0326..3d8d39f 100644
>>>> --- a/xen/include/public/io/blkif.h
>>>> +++ b/xen/include/public/io/blkif.h
>>>> @@ -635,6 +635,28 @@
>>>>  #define BLKIF_OP_INDIRECT  6
>>>>  
>>>>  /*
>>>> + * Recognized only if "feature-extra-request" is present in backend 
>>>> xenbus info.
>>>> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is 
>>>> followed
>>>> + * in the shared ring buffer.
>>>> + *
>>>> + * By this way, extra data like bio integrity payload can be transmitted 
>>>> from
>>>> + * frontend to backend.
>>>> + *
>>>> + * The 'wire' format is like:
>>>> + *  Request 1: xen_blkif_request
>>>> + * [Request 2: xen_blkif_extra_request](only if request 1 has 
>>>> BLKIF_OP_EXTRA_FLAG)
>>>> + *  Request 3: xen_blkif_request
>>>> + *  Request 4: xen_blkif_request
>>>> + * [Request 5: xen_blkif_extra_request](only if request 4 has 
>>>> BLKIF_OP_EXTRA_FLAG)
>>>> + *  ...
>>>> + *  Request N: xen_blkif_request
>>>> + *
>>>> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* 
>>>> create the
>>>> + * "feature-extra-request" node!
>>>> + */
>>>> +#define BLKIF_OP_EXTRA_FLAG (0x80)
>>>> +
>>>> +/*
>>>>   * Maximum scatter/gather segments per request.
>>>>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>>>>   * NB. This could be 12 if the ring indexes weren't stored in the same 
>>>> page.
>>>> @@ -703,6 +725,34 @@ struct blkif_request_indirect {
>>>>  };
>>>>  typedef struct blkif_request_indirect blkif_request_indirect_t;
>>>>  
>>>> +enum blkif_extra_request_type {
>>>> +  BLKIF_EXTRA_TYPE_DIX = 1,   /* Data integrity extension 
>>>> payload.  */
>>>> +};
>>>> +
>>>> +struct bio_integrity_req {
>>>> +  /*
>>>> +   * Grant mapping for transmitting bio integrity payload to backend.
>>>> +   */
>>>> +  grant_ref_t *gref;
&g

Re: [Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-07 Thread Bob Liu

On 04/07/2016 11:55 PM, Juergen Gross wrote:
> On 07/04/16 12:00, Bob Liu wrote:
>> * What's data integrity extension and why?
>> Modern filesystems feature checksumming of data and metadata to protect 
>> against
>> data corruption.  However, the detection of the corruption is done at read 
>> time
>> which could potentially be months after the data was written.  At that point 
>> the
>> original data that the application tried to write is most likely lost.
>>
>> The solution in Linux is the data integrity framework which enables 
>> protection
>> information to be pinned to I/Os and sent to/received from controllers that
>> support it. struct bio has been extended with a pointer to a struct bip which
>> in turn contains the integrity metadata. The bip is essentially a trimmed 
>> down
>> bio with a bio_vec and some housekeeping.
>>
>> * Issues when xen-block get involved.
>> xen-blkfront only transmits the normal data of struct bio while the integrity
>> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
>>
>> * Proposal of transmitting bio integrity payload.
>> Adding an extra request following the normal data request, this extra request
>> contains the integrity payload.
>> The xen-blkback will reconstruct an new bio with both received normal data 
>> and
>> integrity metadata.
>>
>> Welcome any better ideas, thank you!
>>
>> [1] http://lwn.net/Articles/280023/
>> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> ---
>>  xen/include/public/io/blkif.h |   50 
>> +
>>  1 file changed, 50 insertions(+)
>>
>> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
>> index 99f0326..3d8d39f 100644
>> --- a/xen/include/public/io/blkif.h
>> +++ b/xen/include/public/io/blkif.h
>> @@ -635,6 +635,28 @@
>>  #define BLKIF_OP_INDIRECT  6
>>  
>>  /*
>> + * Recognized only if "feature-extra-request" is present in backend xenbus 
>> info.
>> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
>> + * in the shared ring buffer.
>> + *
>> + * By this way, extra data like bio integrity payload can be transmitted 
>> from
>> + * frontend to backend.
>> + *
>> + * The 'wire' format is like:
>> + *  Request 1: xen_blkif_request
>> + * [Request 2: xen_blkif_extra_request](only if request 1 has 
>> BLKIF_OP_EXTRA_FLAG)
>> + *  Request 3: xen_blkif_request
>> + *  Request 4: xen_blkif_request
>> + * [Request 5: xen_blkif_extra_request](only if request 4 has 
>> BLKIF_OP_EXTRA_FLAG)
>> + *  ...
>> + *  Request N: xen_blkif_request
>> + *
>> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* 
>> create the
>> + * "feature-extra-request" node!
>> + */
>> +#define BLKIF_OP_EXTRA_FLAG (0x80)
>> +
>> +/*
>>   * Maximum scatter/gather segments per request.
>>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>>   * NB. This could be 12 if the ring indexes weren't stored in the same page.
>> @@ -703,6 +725,34 @@ struct blkif_request_indirect {
>>  };
>>  typedef struct blkif_request_indirect blkif_request_indirect_t;
>>  
>> +enum blkif_extra_request_type {
>> +BLKIF_EXTRA_TYPE_DIX = 1,   /* Data integrity extension 
>> payload.  */
>> +};
>> +
>> +struct bio_integrity_req {
>> +/*
>> + * Grant mapping for transmitting bio integrity payload to backend.
>> + */
>> +grant_ref_t *gref;
>> +unsigned int nr_grefs;
>> +unsigned int len;
>> +};
> 
> How does the payload look like? It's structure should be defined here
> or a reference to it's definition in case it is a standard should be
> given.
> 

The payload is also described using struct bio_vec(the same as bio).

/*
 * bio integrity payload
 */
struct bio_integrity_payload {
struct bio  *bip_bio;   /* parent bio */

struct bvec_iterbip_iter;

bio_end_io_t*bip_end_io;/* saved I/O completion fn */

unsigned short  bip_slab;   /* slab the bip came from */
unsigned short  bip_vcnt;   /* # of integrity bio_vecs */
unsigned short  bip_max_vcnt;   /* integrity bio_vec slots */
unsigned short  bip_flags;  /* control flags */

struct work_struct  bip_work;   /* I/O

[Xen-devel] [RFC PATCH] Data integrity extension support for xen-block

2016-04-07 Thread Bob Liu
* What's data integrity extension and why?
Modern filesystems feature checksumming of data and metadata to protect against
data corruption.  However, the detection of the corruption is done at read time
which could potentially be months after the data was written.  At that point the
original data that the application tried to write is most likely lost.

The solution in Linux is the data integrity framework which enables protection
information to be pinned to I/Os and sent to/received from controllers that
support it. struct bio has been extended with a pointer to a struct bip which
in turn contains the integrity metadata. The bip is essentially a trimmed down
bio with a bio_vec and some housekeeping.

* Issues when xen-block get involved.
xen-blkfront only transmits the normal data of struct bio while the integrity
metadata buffer(struct bio_integrity_payload in each bio) is ignored.

* Proposal of transmitting bio integrity payload.
Adding an extra request following the normal data request, this extra request
contains the integrity payload.
The xen-blkback will reconstruct an new bio with both received normal data and
integrity metadata.

Welcome any better ideas, thank you!

[1] http://lwn.net/Articles/280023/
[2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt

Signed-off-by: Bob Liu <bob@oracle.com>
---
 xen/include/public/io/blkif.h |   50 +
 1 file changed, 50 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..3d8d39f 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -635,6 +635,28 @@
 #define BLKIF_OP_INDIRECT  6
 
 /*
+ * Recognized only if "feature-extra-request" is present in backend xenbus 
info.
+ * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
+ * in the shared ring buffer.
+ *
+ * By this way, extra data like bio integrity payload can be transmitted from
+ * frontend to backend.
+ *
+ * The 'wire' format is like:
+ *  Request 1: xen_blkif_request
+ * [Request 2: xen_blkif_extra_request](only if request 1 has 
BLKIF_OP_EXTRA_FLAG)
+ *  Request 3: xen_blkif_request
+ *  Request 4: xen_blkif_request
+ * [Request 5: xen_blkif_extra_request](only if request 4 has 
BLKIF_OP_EXTRA_FLAG)
+ *  ...
+ *  Request N: xen_blkif_request
+ *
+ * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create 
the
+ * "feature-extra-request" node!
+ */
+#define BLKIF_OP_EXTRA_FLAG (0x80)
+
+/*
  * Maximum scatter/gather segments per request.
  * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
  * NB. This could be 12 if the ring indexes weren't stored in the same page.
@@ -703,6 +725,34 @@ struct blkif_request_indirect {
 };
 typedef struct blkif_request_indirect blkif_request_indirect_t;
 
+enum blkif_extra_request_type {
+   BLKIF_EXTRA_TYPE_DIX = 1,   /* Data integrity extension 
payload.  */
+};
+
+struct bio_integrity_req {
+   /*
+* Grant mapping for transmitting bio integrity payload to backend.
+*/
+   grant_ref_t *gref;
+   unsigned int nr_grefs;
+   unsigned int len;
+};
+
+/*
+ * Extra request, must follow a normal-request and a normal-request can
+ * only be followed by one extra request.
+ */
+struct blkif_request_extra {
+   uint8_t type;   /* BLKIF_EXTRA_TYPE_* */
+   uint16_t _pad1;
+#ifndef CONFIG_X86_32
+   uint32_t _pad2; /* offsetof(blkif_...,u.extra.id) == 8 */
+#endif
+   uint64_t id;
+   struct bio_integrity_req bi_req;
+} __attribute__((__packed__));
+typedef struct blkif_request_extra blkif_request_extra_t;
+
 struct blkif_response {
 uint64_tid;  /* copied from request */
 uint8_t operation;   /* copied from request */
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] blkif.h: document scsi/0x12/0x node

2016-03-23 Thread Bob Liu

On 03/23/2016 08:33 PM, Roger Pau Monné wrote:
> On Wed, 23 Mar 2016, Bob Liu wrote:
> 
>> This patch documents a xenstore node which is used by XENVBD Windows PV
>> driver.
>>
>> The use case is that XenServer may have OEM specific storage backends and
>> there is requirement to run OEM software in guest which relied on VPD
>> information supplied by the storages.
>> Adding a node to xenstore is the easiest way to get this VPD information from
>> the backend into guest where XENVBD Windows PV driver can get INQUIRY VPD 
>> data
>> from this node and return to OEM software.
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> ---
>>  xen/include/public/io/blkif.h |   24 
>>  1 file changed, 24 insertions(+)
>>
>> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
>> index 99f0326..afbcbff 100644
>> --- a/xen/include/public/io/blkif.h
>> +++ b/xen/include/public/io/blkif.h
>> @@ -182,6 +182,30 @@
>>   *  backend driver paired with a LIFO queue in the frontend will
>>   *  allow us to have better performance in this scenario.
>>   *
>> + * scsi/0x12/0x
>> + *  Values: base64 encoded string
>> + *
>> + *  This optional node contains SCSI INQUIRY VPD information.
>> + *   is the hexadecimal representation of the VPD page code.
>> + *  Currently only XENVBD Windows PV driver is using this node.
>> + *
>> + *  A frontend e.g XENVBD Windows PV driver which represents a Xen VBD to
>> + *  its containing operating system as a (virtual) SCSI target may return 
>> the
>> + *  specified data in response to INQUIRY commands from its containing OS.
>> + *
>> + *  A frontend which supports this feature must return the backend-specified
>> + *  data for every INQUIRY command with the EVPD bit set.
>> + *  For EVPD=1 INQUIRY commands where the corresponding xenstore node
>> + *  does not exist, the frontend must report (to its containing OS) an
>> + *  appropriate failure condition.
>> + *
>> + *  A frontend which does not support this feature just disregard these
>> + *  xenstore nodes.
>> + *
>> + *  The data of this string node is base64 encoded. Base64 is a group of
>> + *  similar binary-to-text encoding schemes that represent binary data in an
>> + *  ASCII string format by translating it into a radix-64 representation.
>> + *
> 
> I'm sorry, but I need to raise similar concerns as the ones expressed by 
> other people.
> 
> I understand that those pages that you plan to export to the guest contain 
> some kind of hardware specific information, but how is the guest going to 
> make use of this?
> 
> It can only interact with a Xen virtual block device, and there you can 
> only send read, write, flush and discard requests. Even the block size is 
> hardcoded to 512b by the protocol, so I'm not sure how are you going to 
> use this information.
> 

For this part, there is ioctl() interface for all block device.
Looking at virtio-blk in KVM world, it can accept almost all SCSI commands also 
in ioctl() even they already have virtio-scsi.
But that's another story.

Thanks,
Bob

> Also, the fact that's implemented in some drivers in some OS isn't an 
> argument in order to have them added. FreeBSD had for a very long time a 
> set of custom extensions, that where never added to blkif.h simply because 
> they were broken and unneeded, so the solution was to remove them from the 
> implementation, and the same could happen here IMHO.
> 
> Roger.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] blkif.h: document scsi/0x12/0x node

2016-03-23 Thread Bob Liu
This patch documents a xenstore node which is used by XENVBD Windows PV
driver.

The use case is that XenServer may have OEM specific storage backends and
there is requirement to run OEM software in guest which relied on VPD
information supplied by the storages.
Adding a node to xenstore is the easiest way to get this VPD information from
the backend into guest where XENVBD Windows PV driver can get INQUIRY VPD data
from this node and return to OEM software.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 xen/include/public/io/blkif.h |   24 
 1 file changed, 24 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..afbcbff 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -182,6 +182,30 @@
  *  backend driver paired with a LIFO queue in the frontend will
  *  allow us to have better performance in this scenario.
  *
+ * scsi/0x12/0x
+ * Values: base64 encoded string
+ *
+ * This optional node contains SCSI INQUIRY VPD information.
+ *  is the hexadecimal representation of the VPD page code.
+ * Currently only XENVBD Windows PV driver is using this node.
+ *
+ * A frontend e.g XENVBD Windows PV driver which represents a Xen VBD to
+ * its containing operating system as a (virtual) SCSI target may return 
the
+ * specified data in response to INQUIRY commands from its containing OS.
+ *
+ * A frontend which supports this feature must return the backend-specified
+ * data for every INQUIRY command with the EVPD bit set.
+ * For EVPD=1 INQUIRY commands where the corresponding xenstore node
+ * does not exist, the frontend must report (to its containing OS) an
+ * appropriate failure condition.
+ *
+ * A frontend which does not support this feature just disregard these
+ * xenstore nodes.
+ *
+ * The data of this string node is base64 encoded. Base64 is a group of
+ * similar binary-to-text encoding schemes that represent binary data in an
+ * ASCII string format by translating it into a radix-64 representation.
+ *
  *--- Request Transport Parameters 
  *
  * max-ring-page-order
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-22 Thread Bob Liu

On 03/17/2016 07:12 PM, Ian Jackson wrote:
> David Vrabel writes ("Re: [Xen-devel] [RFC PATCH] blkif.h: document 
> scsi/0x12/0x83 node"):
>> On 16/03/16 13:59, Bob Liu wrote:
>>> But we'd like to get the VPD information(of underlying storage device) also 
>>> in Linux blkfront, even blkfront is not a SCSI device.
>>
>> Why does blkback/blkfront need to involved here?  This is just some
>> xenstore keys that can be written by the toolstack and directly read by
>> the relevant application in the guest.
> 

They want a more generic way because the application may run on all kinds of 
environment including baremetal.
So they prefers to just call ioctl(SG_IO) against a storage device.

> I'm getting rather a different picture here than at first.  Previously
> I thought you had some 3rd-party application, not under your control,
> which expected to see this VPD data.
> 
> But now I think that you're saying the application is under your own
> control.  I don't understand why synthetic VPD data is the best way to
> give your application the information it needs.
> 
> What is the application doing with this VPD data ?  I mean,
> which specific application functions, and how do they depend on the
> VPD data ?
> 

From the feedbacks I just got, they do *not* want the details to be in public.

Anyway, I think this is not a block of this patch.
In Windows PV block driver, we already use the same way to get the raw INQUIRY 
data.
 * The Windows PV block driver accepts ioctl(SG_IO).
 * Then it reads this /scsi/0x12/0x83 node.
 * Then return the raw INQURIY data back to ioctl.

Since Linux guest also wants to do the same thing, let's making this mechanism 
to be a generic interface!
I'll post a patch adding ioctl(SG_IO) support to xen-blkfront together with a 
updated version of this patch soon.

Thanks,
Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-20 Thread Bob Liu

On 03/16/2016 10:32 PM, David Vrabel wrote:
> On 16/03/16 13:59, Bob Liu wrote:
>>
>> But we'd like to get the VPD information(of underlying storage device) also 
>> in Linux blkfront, even blkfront is not a SCSI device.
> 
> Why does blkback/blkfront need to involved here?  This is just some
> xenstore keys that can be written by the toolstack and directly read by
> the relevant application in the guest.
> 

Exactly, let me check if they can direct read this xenstore node.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-19 Thread Bob Liu

On 03/16/2016 10:07 PM, Paul Durrant wrote:
>> -Original Message-
>> From: Bob Liu [mailto:bob@oracle.com]
..snip..
>>>
>>
>> But we'd like to get the VPD information(of underlying storage device) also 
>> in
>> Linux blkfront, even blkfront is not a SCSI device.
>>
>> That's because our underlying storage device has some vendor-specific
>> features which can be recognized through informations in VPD pages.
>> And Our applications in guest want to aware of these vendor-specific
>> features.
> 
> I think the missing piece of the puzzle is how the applications get this 
> information. 
> In Windows, since everything is a SCSI LUN (or has to emulate one) 
> applications just send down 'scsi pass-through' IOCTLs and get the raw 
> INQUIRY data back. 
> In Linux there would need to be some alternative scheme that presumably 
> blkfront would have to support.
> 

They plan to send a REQ_TYPE_BLOCK_PC request down to blkfront, and hoping 
blkfront can handle this request and return the VPD informations.
I'll confirm weather they can read the xenstore node directly.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-19 Thread Bob Liu

On 03/16/2016 08:36 PM, Ian Jackson wrote:
> Bob Liu writes ("[RFC PATCH] blkif.h: document scsi/0x12/0x83 node"):
>> Sometimes, we need to query VPD page=0x83 data from underlying
>> storage so that vendor supplied software can run inside the VM and
>> believe it's talking to the vendor's own storage.  But different
>> vendors may have different special features, so it's not suitable to
>> export through "feature-".
>>
>> One solution is query the whole VPD page through Xenstore node, which has
>> already been used by windows pv driver.
>> http://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenvbd.git;a=blob;f=src/xenvbd/pdoinquiry.c
> 
> Thanks for your contribution.
> 
> Thanks also to Konrad for decoding the numbers, which really helps me
> understand what is going on here and helped me find the relevant
> references.
> 
> (For background: I have just double-checked the SCSI spec and: INQUIRY
> lets you query either the standard page, or one of a number of `vital
> product data' pages, each identified by an 8-bit page number.  The VPD
> pages are mostly full of vendor-specific data in vendor-specific
> format.)
> 
> I have some qualms about the approach you have adopted.  It is
> difficult to see how this feature could be used safely without
> knowledge specific to the storage vendor.
> 
> But I think it is probably OK to define a specification along these
> lines provided that it is very clear that if you aren't the storage
> vendor and you use this and something breaks, you get to keep all the
> pieces.
> 
>> + * scsi/0x12/0x83
>> + *  Values: string
>> + *  A base64 formatted string providing VPD pages read out from backend
>> + *  device.
> 
> I think this probably isn't the prettiest name for this node or
> necessarily the best format but given that this protocol is already
> deployed, and this syntax will do, I don't want to quibble.
> 
> I would like the base64 encoding to specified much more explicitly.
> Just `base64 formatted' is too vague.
> 
> 
>> + *  The backend driver or the toolstack should write this node with VPD
>> + *  informations when attaching devices.
> 
> I think this is the wrong semantics.  I certainly don't want to
> encourage backends to use this feature.
> 
> Rather, I would prefer something like this:
> 
>  * scsi/0x12/0x
> 
>This optional node contains SCSI INQUIRY VPD information.
> is the hexadecimal representation of the VPD page code.
> 
>A frontend which represents a Xen VBD to its containing operating
>system as a (virtual) SCSI target may return the specified data in
>response to INQUIRY commands from its containing OS.
> 
>A frontend which supports this feature must return the backend-
>specified data for every INQUIRY command with the EVPD bit set.
>For EVPD=1 INQUIRY commands where the corresponding xenstore node
>does not exist, the frontend must report (to its containing OS) an
>appropriate failure condition.
> 
>A frontend which does not support this feature (ie, which does not
>use these xenstore nodes), and which presents as a SCSI target to
>its containing OS, should support and provide whatever VPD
>information it considers appropriate, and should disregard these
>xenstore nodes.
> 
>A frontend need not - and often will not - present to its
>containing OS as a device addressable with SCSI CDBs.  Such a
>frontend has no use for SCSI INQUIRY VPD information.
> 
>A backend should set this information with caution.  Pages
>containing device-vendor-specific information should not be
>specified without the appropriate device-vendor-specific knowledge.
> 

That's much more clear, thank you very much!

> 
> Also I have two other observations:
> 
> Firstly, AFAICT you have not provided any way to set the standard
> INQUIRY response.  Is it not necessary in your application to provide

If backends are not encouraged to use this node, then we must have the 
toolstack write this node with the right VPD information.
Paul mentioned there should be corresponding code in the xapi project, but I 
haven't found out where.


> synthetic vendorid and productid, at the very least ?
> 
> Secondly, I think your hope that
> 
>> blkfront in Linux ... can use the same mechanism.
> 
> is I think misguided.  blkfront does not present the disk (to the rest
> of the Linux storage system) as a SCSI device.  Rather, Linux allows
> blkfront to present as a block device, directly, and this is what
> blkfront does.
> 

But we'd like to get the VPD information(of underlying storage device) also in 
Linux blkfront, even blkfront is not a SCSI device.

That's because our underlying storage device has some vendor-specific features 
which can be recognized through informations in VPD pages.
And Our applications in guest want to aware of these vendor-specific features.

Regards,
Bob



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [RFC PATCH] blkif.h: document scsi/0x12/0x83 node

2016-03-15 Thread Bob Liu
Sometimes, we need to query VPD page=0x83 data from underlying storage so
that vendor supplied software can run inside the VM and believe it's talking to
the vendor's own storage.
But different vendors may have different special features, so it's not suitable
to export through "feature-".

One solution is query the whole VPD page through Xenstore node, which has
already been used by windows pv driver.
http://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenvbd.git;a=blob;f=src/xenvbd/pdoinquiry.c

This patch documents the Xenstore node to blkif.h, so that blkfront in Linux and
other frontends can use the same mechanism.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 xen/include/public/io/blkif.h |8 
 1 file changed, 8 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..30a6e46 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -182,6 +182,14 @@
  *  backend driver paired with a LIFO queue in the frontend will
  *  allow us to have better performance in this scenario.
  *
+ * scsi/0x12/0x83
+ * Values: string
+ *
+ * A base64 formatted string providing VPD pages read out from backend
+ * device.
+ * The backend driver or the toolstack should write this node with VPD
+ * informations when attaching devices.
+ *
  *--- Request Transport Parameters 
  *
  * max-ring-page-order
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-03-02 Thread Bob Liu

On 03/02/2016 07:40 PM, Ian Jackson wrote:
> Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to 
> pass-through SCSI commands"):
>> Do you know whether pvscsi can work on top of multipath(the device-mapper 
>> framework) or LVMs?
> 
> No, it can't.  devmapper and LVM work with the block device
> abstraction.
> 
> Implicitly you seem to be suggesting that you want to use dm-multipath
> and LVM, but also send other SCSI CDBs from the upper layers through
> to the underlying SCSI storage target.
> 

Exactly!

> I can't see how that could cause anything but pain.  In many cases
> "the underlying SCSI storage target" wouldn't be well defined.  Even
> if it was, these side channel SCSI commands are likely to Go Wrong in
> exciting ways.
> 
> What SCSI commands do you want to send ?
> 

* INQUIRY

* PERSISTENT RESERVE IN
* PERSISTENT RESERVE OUT
This is for Failover Clusters in Windows, not sure whether more commands are 
required.
I didn't get a required scsi commands list in the failover document.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-03-01 Thread Bob Liu
Hi Juergen,

On 03/02/2016 03:39 PM, Juergen Gross wrote:
> On 01/03/16 19:08, Ian Jackson wrote:
>> Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to 
>> pass-through SCSI commands"):
>>> One thing I'm still not sure about PVSCSI is do we have the same security 
>>> issue since LIO can interface to any block device.
>>> E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the 
>>> PVSCSI-frontend may still send SCSI operates on LUN bases (the whole disk).
>>
>> I don't think you can use pvscsi to passthrough a partition such as
>> /dev/sda1.  Such a thing is not a SCSI command target.
> 
> It might be possible via the fileio target backend. In this case LUN
> based SCSI operations are ignored/refused/emulated by LIO.
> 

Do you know whether pvscsi can work on top of multipath(the device-mapper 
framework) or LVMs?
Thank you!

Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-02-29 Thread Bob Liu

On 03/01/2016 12:29 AM, Ian Jackson wrote:
> Ian Jackson writes ("Re: [RFC PATCH] xen-block: introduces extra request to 
> pass-through SCSI commands"):
>> [stuff suggesting use of PVSCSI instead]
> 
> For the avoidance of doubt:
> 
> 1. Thanks very much for bringing this proposal to us at the concept
> stage.  It is much easier to discuss these matters in a constructive
> way before a lot of effort has been put into an implementation.
> 
> 2. I should explain the downsides which I see in your proposal:
> 
> - Your suggestion has bad security properties: previously, the PV
>   block protocol would present only a very simple and narrow
>   interface.  Your SCSI CDB passthrough proposal means that guests
>   would be able to activate features in SCSI targets which would be
>   unexpected and unintended by the host administrator.  Such features
>   would perhaps even be unknown to the host administrator.
> 
>   This could be mitigated by making this feature configurable, of
>   course, defaulting to off, along with clear documentation.  But it's
>   not a desirable property.
> 
> - For similar reasons it will often be difficult to use such a feature
>   safely.  Guest software in particular might expect that it can
>   safely use whatever features it can see, and do all sorts of
>   exciting things.
> 
> - It involves duplicating multiplexing logic which already exists in
>   PVSCSI.
> 

One thing I'm still not sure about PVSCSI is do we have the same security issue 
since LIO can interface to any block device.
E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the 
PVSCSI-frontend may still send SCSI operates on LUN bases (the whole disk).

P.S. Thanks to all of you, it helps a lot!

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-02-29 Thread Bob Liu

On 03/01/2016 12:29 AM, Ian Jackson wrote:
> Ian Jackson writes ("Re: [RFC PATCH] xen-block: introduces extra request to 
> pass-through SCSI commands"):
>> [stuff suggesting use of PVSCSI instead]
> 
> For the avoidance of doubt:
> 
> 1. Thanks very much for bringing this proposal to us at the concept
> stage.  It is much easier to discuss these matters in a constructive
> way before a lot of effort has been put into an implementation.
> 
> 2. I should explain the downsides which I see in your proposal:
> 
> - Your suggestion has bad security properties: previously, the PV
>   block protocol would present only a very simple and narrow
>   interface.  Your SCSI CDB passthrough proposal means that guests
>   would be able to activate features in SCSI targets which would be
>   unexpected and unintended by the host administrator.  Such features
>   would perhaps even be unknown to the host administrator.
> 
>   This could be mitigated by making this feature configurable, of
>   course, defaulting to off, along with clear documentation.  But it's
>   not a desirable property.
> 
> - For similar reasons it will often be difficult to use such a feature
>   safely.  Guest software in particular might expect that it can
>   safely use whatever features it can see, and do all sorts of
>   exciting things.
> 
> - It involves duplicating multiplexing logic which already exists in
>   PVSCSI.
> 

One thing I'm still not sure about PVSCSI is do we have the same security issue 
since LIO can interface to any block device.
E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the 
PVSCSI-frontend may send SCSI operates on LUN bases (the whole disk).

P.S. Thanks to all of you, it helps a lot!

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands

2016-02-28 Thread Bob Liu
1) What is this patch about?
This patch introduces an new block operation (BLKIF_OP_EXTRA_FLAG).
A request with BLKIF_OP_EXTRA_FLAG set means the following request is an
extra request which is used to pass through SCSI commands.
This is like a simplified version of XEN_NETIF_EXTRA_* in netif.h.
It can be extended easily to transmit other per-request/bio data from frontend
to backend e.g Data Integrity Field per bio.

2) Why we need this?
Currently only raw data segments are transmitted from blkfront to blkback, which
means some advanced features are lost.
 * Guest knows nothing about features of the real backend storage.
For example, on bare-metal environment INQUIRY SCSI command can be used
to query storage device information. If it's a SSD or flash device we
can have the option to use the device as a fast cache.
But this can't happen in current domU guests, because blkfront only
knows it's just a normal virtual disk

 * Failover Clusters in Windows
Failover clusters require SCSI-3 persistent reservation target disks,
but now this can't work in domU.

3) Known issues:
 * Security issues, how to 'validate' this extra request payload.
   E.g SCSI operates on LUN bases (the whole disk) while we really just want to
   operate on partitions

 * Can't pass SCSI commands through if the backend storage driver is bio-based
   instead of request-based.

4) Alternative approach: Using PVSCSI instead:
 * Doubt PVSCSI can support as many type of backend storage devices as 
Xen-block.

 * Much longer path:
   ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> 
PVSCSI-backend -> Target framework(LIO?) ->

   With xen-block we only need:
   ioctl() -> blkfront -> blkback ->

 * xen-block has been existed for many years, widely used and more stable.

Welcome any input, thank you!

Signed-off-by: Bob Liu <bob@oracle.com>
---
 xen/include/public/io/blkif.h |   73 +
 1 file changed, 73 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..7c10bce 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -635,6 +635,28 @@
 #define BLKIF_OP_INDIRECT  6
 
 /*
+ * Recognised only if "feature-extra-request" is present in backend xenbus 
info.
+ * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
+ * in the shared ring buffer.
+ *
+ * By this way, extra data like SCSI command, DIF/DIX and other per-request/bio
+ * data can be transmitted from frontend to backend.
+ *
+ * The 'wire' format is like:
+ *  Request 1: xen_blkif_request
+ * [Request 2: xen_blkif_extra_request](only if request 1 has 
BLKIF_OP_EXTRA_FLAG)
+ *  Request 3: xen_blkif_request
+ *  Request 4: xen_blkif_request
+ * [Request 5: xen_blkif_extra_request](only if request 4 has 
BLKIF_OP_EXTRA_FLAG)
+ *  ...
+ *  Request N: xen_blkif_request
+ *
+ * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create 
the
+ * "feature-extra-request" node!
+ */
+#define BLKIF_OP_EXTRA_FLAG (0x80)
+
+/*
  * Maximum scatter/gather segments per request.
  * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
  * NB. This could be 12 if the ring indexes weren't stored in the same page.
@@ -703,10 +725,61 @@ struct blkif_request_indirect {
 };
 typedef struct blkif_request_indirect blkif_request_indirect_t;
 
+enum blkif_extra_request_type {
+   BLKIF_EXTRA_TYPE_SCSI_CMD = 1,  /* Transmit SCSI command.  */
+};
+
+struct scsi_cmd_req {
+   /*
+* Grant mapping for transmiting SCSI command to backend, and
+* also receive sense data from backend.
+* One 4KB page is enough.
+*/
+   grant_ref_t cmd_gref;
+   /* Length of SCSI command in the grant mapped page. */
+   unsigned int cmd_len;
+
+   /*
+* SCSI command may require transmiting data segment length less
+* than a sector(512 bytes).
+* Record num_sg and last segment length in extra request so that
+* backend can know about them.
+*/
+   unsigned int num_sg;
+   unsigned int last_sg_len;
+};
+
+/*
+ * Extra request, must follow a normal-request and a normal-request can
+ * only be followed by one extra request.
+ */
+struct blkif_request_extra {
+   uint8_t type;   /* BLKIF_EXTRA_TYPE_* */
+   uint16_t _pad1;
+#ifndef CONFIG_X86_32
+   uint32_t _pad2; /* offsetof(blkif_...,u.extra.id) == 8 */
+#endif
+   uint64_t id;
+   struct scsi_cmd_req scsi_cmd;
+} __attribute__((__packed__));
+typedef struct blkif_request_extra blkif_request_extra_t;
+
+struct scsi_cmd_res {
+   unsigned int resid_len;
+   /* Length of sense data returned in grant mapped page. */
+   unsigned int sense_len;
+};
+
+struct blkif_response_extra {
+   uint8_t type;  /* BLKIF_EXTRA_TYPE_* *

Re: [Xen-devel] [PATCH v7 0/2] public/io/netif.h: support for toeplitz hashing

2016-02-02 Thread Bob Liu

On 02/02/2016 05:33 PM, Paul Durrant wrote:
>> -Original Message-
>> From: Bob Liu [mailto:bob@oracle.com]
>> Sent: 02 February 2016 05:02
>> To: Paul Durrant
>> Cc: xen-de...@lists.xenproject.org
>> Subject: Re: [Xen-devel] [PATCH v7 0/2] public/io/netif.h: support for
>> toeplitz hashing
>>
>> Hi Paul,
>>
>> On 01/12/2016 05:58 PM, Paul Durrant wrote:
>>> This series documents changes needed to support toeplitz hashing in a
>>> backend, configurable by the frontend.
>>>
>>> Patch #1 adds further clarifications to the receive and transmit wire
>>> formats.
>>>
>>> Patch #2 documents a new 'control ring' for passing bulk data between
>>> frontend and backend. This is needed for passing the hash mapping table
>>> and hash key. It also documents messages to allow a frontend to configure
>>> toeplitz hashing and a new extra info segment that can be used for passing
>>> hash values along with packets on both the transmit and receive side.
>>>
>>
>> I have a question, why not make the "netif_ctrl_request" a part of the extra
>> info segment?
>> So that can reuse the origin transmit shared ring.
> 
> That was an option but I think it's pretty hacky. Also, extra info segments 
> are pretty small. 
> The multicast control protocol uses that mechanism and it's quite inelegant. 
> I also think a dedicated ring for out-of-band control messages is likely to 
> be of further use in the future.
> 

Nod.

One more question, does request in this control ring can work synchronously 
with request in normal transmit ring?

I'm seeking a proper way to make xen-block driver can transmit some per-request 
extra-data from frontend to backend.
It seems similar mechanism like extra info segments in xen-network is more 
suitable.

Thanks,
Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 0/2] public/io/netif.h: support for toeplitz hashing

2016-02-01 Thread Bob Liu
Hi Paul,

On 01/12/2016 05:58 PM, Paul Durrant wrote:
> This series documents changes needed to support toeplitz hashing in a
> backend, configurable by the frontend.
> 
> Patch #1 adds further clarifications to the receive and transmit wire
> formats.
> 
> Patch #2 documents a new 'control ring' for passing bulk data between
> frontend and backend. This is needed for passing the hash mapping table
> and hash key. It also documents messages to allow a frontend to configure
> toeplitz hashing and a new extra info segment that can be used for passing
> hash values along with packets on both the transmit and receive side.
> 

I have a question, why not make the "netif_ctrl_request" a part of the extra 
info segment?
So that can reuse the origin transmit shared ring.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] netfront/netback multiqueue exhausting grants

2016-01-22 Thread Bob Liu


On 01/22/2016 03:53 PM, Jan Beulich wrote:
 On 22.01.16 at 04:36,  wrote:
>> By the way, do you think it's possible to make grant table support bigger 
>> page e.g 64K?
>> One grant-ref per 64KB instead of 4KB, this should able to reduce the grant 
>> entry consumption significantly.
> 
> How would that work with an underlying page size of 4k, and pages
> potentially being non-contiguous in machine address space? Besides
> that the grant table hypercall interface isn't prepared to support
> 64k page size, due to its use of uint16_t for the length of copy ops.
> 

Right, and I mean whether we should consider address all the place as your 
mentioned.
With multi-queue xen-block and xen-network, we got more reports that the grants 
were exhausted.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC] VirtFS support on Xen

2016-01-22 Thread Bob Liu
Hi Wei,

On 01/21/2016 06:59 PM, Wei Liu wrote:
> On Thu, Jan 21, 2016 at 10:50:08AM +, David Vrabel wrote:
>> On 21/01/16 10:28, Wei Liu wrote:
>>> [RFC] VirtFS support on Xen
>>>
>>> # Introduction
>>>
>>> QEMU/KVM supports file system passthrough via an interface called
>>> VirtFS [0]. VirtFS is in turn implemented with 9pfs protocol [1] and
>>> VirtIO transport.
>>>
>>> Xen used to have its own implementation of file system passthrough
>>> called XenFS, but that has been inactive for a few years. The latest
>>> update was in 2009 [2].
>>>
>>> This project aims to add VirtFS support on Xen. This is more
>>> sustainable than inventing our own wheel.#
>>
>> What's the use case for this?  Who wants this feature?
>>
> 
> Anyone who wants file system passthrough.  More specifically, VM-based
> container solutions can share files from host file system.
> 

I'm a bit confused, can't we just use the VirtFS of Qemu?
E.g
./configure --with-extra-qemuu-configure-args="--enable-virtfs"

Thanks,
Bob

> http://xendevsummit2015.sched.org/event/3WDg/xen-containers-better-way-to-run-docker-containers-sainath-grandhi-intel?iframe=no==yes=no
> http://xendevsummit2015.sched.org/event/3X1L/hyper-make-vm-run-like-containers-xu-wang-hyper?iframe=no==yes=no
> 
> Wei.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] netfront/netback multiqueue exhausting grants

2016-01-22 Thread Bob Liu

On 01/22/2016 07:02 PM, Jan Beulich wrote:
 On 22.01.16 at 11:40,  wrote:
>> On 01/22/2016 03:53 PM, Jan Beulich wrote:
>> On 22.01.16 at 04:36,  wrote:
 By the way, do you think it's possible to make grant table support bigger 
 page e.g 64K?
 One grant-ref per 64KB instead of 4KB, this should able to reduce the 
 grant 
 entry consumption significantly.
>>>
>>> How would that work with an underlying page size of 4k, and pages
>>> potentially being non-contiguous in machine address space? Besides
>>> that the grant table hypercall interface isn't prepared to support
>>> 64k page size, due to its use of uint16_t for the length of copy ops.
>>
>> Right, and I mean whether we should consider address all the place as your 
>> mentioned.
> 
> Just from an abstract perspective: How would you envision to avoid
> machine address discontiguity? Or would you want to limit such an

E.g Reserve a page pool with continuous 64KB pages, or make grant-map support 
huge page(2MB)?
To be honest, I haven't think much about the detail.

Do you think that's unlikely to implement?
If yes, we have to limit the queue numbers, VM numbers and vdisk/vif numbers in 
a proper way
to make sure the guests won't enter grant-exhausted state.

> improvement to only HVM/PVH/HVMlite guests?
> 
> Jan
> 

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC] VirtFS support on Xen

2016-01-22 Thread Bob Liu

On 01/22/2016 06:50 PM, Wei Liu wrote:
> On Fri, Jan 22, 2016 at 06:45:30PM +0800, Bob Liu wrote:
>> Hi Wei,
>>
>> On 01/21/2016 06:59 PM, Wei Liu wrote:
>>> On Thu, Jan 21, 2016 at 10:50:08AM +, David Vrabel wrote:
>>>> On 21/01/16 10:28, Wei Liu wrote:
>>>>> [RFC] VirtFS support on Xen
>>>>>
>>>>> # Introduction
>>>>>
>>>>> QEMU/KVM supports file system passthrough via an interface called
>>>>> VirtFS [0]. VirtFS is in turn implemented with 9pfs protocol [1] and
>>>>> VirtIO transport.
>>>>>
>>>>> Xen used to have its own implementation of file system passthrough
>>>>> called XenFS, but that has been inactive for a few years. The latest
>>>>> update was in 2009 [2].
>>>>>
>>>>> This project aims to add VirtFS support on Xen. This is more
>>>>> sustainable than inventing our own wheel.#
>>>>
>>>> What's the use case for this?  Who wants this feature?
>>>>
>>>
>>> Anyone who wants file system passthrough.  More specifically, VM-based
>>> container solutions can share files from host file system.
>>>
>>
>> I'm a bit confused, can't we just use the VirtFS of Qemu?
>> E.g
>> ./configure --with-extra-qemuu-configure-args="--enable-virtfs"
>>
> 
> Yes, in theory you can -- with VirtIO transport. But I'm not sure if
> Virtio has been fixed to work with Xen.  That also means you need QEMU
> emulation, which we don't really need (or want) when running in PV or
> PVH mode.
> 

Just make sure if I get it right, in the KVM case:
Linux guest(v9fs-client) -> VirtIO transport -> Qemu(v9fs-server) -> Local file 
system in Host

And your plan is:
DomU(v9fs-client) -> XEN transport:grant map based -> Qemu(v9fs-server) -> 
Local file system in Dom0

Which means we need to implement a XEN-transport in linux/net/9p/, and also 
make Qemu can recognize this transport because we need Qemu to run as the 
v9fs-server anyway?

Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] netfront/netback multiqueue exhausting grants

2016-01-21 Thread Bob Liu

On 01/21/2016 08:19 PM, Ian Campbell wrote:
> On Thu, 2016-01-21 at 10:56 +, David Vrabel wrote:
>> On 20/01/16 12:23, Ian Campbell wrote:
>>> There have been a few reports recently[0] which relate to a failure of
>>> netfront to allocate sufficient grant refs for all the queues:
>>>
>>> [0.533589] xen_netfront: can't alloc rx grant refs
>>> [0.533612] net eth0: only created 31 queues
>>>
>>> Which can be worked around by increasing the number of grants on the
>>> hypervisor command line or by limiting the number of queues permitted
>>> by
>>> either back or front using a module param (which was broken but is now
>>> fixed on both sides, but I'm not sure it has been backported everywhere
>>> such that it is a reliable thing to always tell users as a workaround).
>>>
>>> Is there any plan to do anything about the default/out of the box
>>> experience? Either limiting the number of queues or making both ends
>>> cope
>>> more gracefully with failure to create some queues (or both) might be
>>> sufficient?
>>>
>>> I think the crash after the above in the first link at [0] is fixed? I
>>> think that was the purpose of ca88ea1247df "xen-netfront: update
>>> num_queues
>>> to real created" which was in 4.3.
>>
>> I think the correct solution is to increase the default maximum grant
>> table size.
> 
> That could well make sense, but then there will just be another higher
> limit, so we should perhaps do both.
> 
> i.e. factoring in:
>  * performance i.e. ability for N queues to saturate whatever sort of link
>contemporary Linux can saturate these days, plus some headroom, or
>whatever other ceiling seems sensible)
>  * grant table resource consumption i.e. (sensible max number of blks * nr
>gnts per blk + sensible max number of vifs * nr gnts per vif + other
>devs needs) < per guest grant limit) to pick both the default gnttab
>size and the default max queuers.
> 

Agree.
By the way, do you think it's possible to make grant table support bigger page 
e.g 64K?
One grant-ref per 64KB instead of 4KB, this should able to reduce the grant 
entry consumption significantly.

Bob

> (or s/sensible/supportable/g etc).
> 
>> Although, unless you're using the not-yet-applied per-cpu rwlock patches
>> multiqueue is terrible on many (multisocket) systems and the number of
>> queue should be limited in netback to 4 or even just 2.
> 
> Presumably the guest can't tell, so it can't do this.
> 
> I think when you say "terrible" you don't mean "worse than without mq" but
> rather "not realising the expected gains from a larger nunber of queues",
> right?.
> 
> Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu

2016-01-20 Thread Bob Liu

On 01/20/2016 11:41 PM, Konrad Rzeszutek Wilk wrote:
> Neither of these are sufficient however.  That gets Qemu a mapping of
> the NVDIMM, not the guest.  Something, one way or another, has to turn
> this into appropriate add-to-phymap hypercalls.
>

 Yes, those hypercalls are what I'm going to add.
>>>
>>> Why?
>>>
>>> What you need (in a rought hand-wave way) is to:
>>>  - mount /dev/pmem0
>>>  - mmap the file on /dev/pmem0 FS
>>>  - walk the VMA for the file - extract the MFN (machien frame numbers)
>>

If I understand right, in this case the MFN is the block layout of the DAX-file?
If we find all the file blocks, then we get all the MFN.

>> Can this step be done by QEMU? Or does linux kernel provide some
>> approach for the userspace to do the translation?
> 

The ioctl(fd, FIBMAP, ) may help, which can get the LBAs that a given 
file occupies. 

-Bob

> I don't know. I would think no - as you wouldn't want the userspace
> application to figure out the physical frames from the virtual
> address (unless they are root). But then if you look in
> /proc//maps and /proc//smaps there are some data there.
> 
> Hm, /proc//pagemaps has something intersting
> 
> See pagemap_read function. That looks to be doing it?
> 
>>
>> Haozhong
>>
>>>  - feed those frame numbers to xc_memory_mapping hypercall. The
>>>guest pfns would be contingous.
>>>Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
>>>/dev/pmem0 FS - the guest pfns are 0x20 upward.
>>>
>>>However the MFNs may be discontingous as the NVDIMM could be an
>>>1TB - and the 8GB file is scattered all over.
>>>
>>> I believe that is all you would need to do?

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] xen-blkback: fix two memleaks

2015-12-09 Thread Bob Liu
This patch fixs two memleaks in konrad/xen.git/for-jens-4.5.
  backtrace:
[] kmemleak_alloc+0x28/0x50
[] kmem_cache_alloc+0xbb/0x1d0
[] xen_blkbk_probe+0x58/0x230
[] xenbus_dev_probe+0x76/0x130
[] driver_probe_device+0x166/0x2c0
[] __device_attach_driver+0xac/0xb0
[] bus_for_each_drv+0x67/0x90
[] __device_attach+0xc7/0x120
[] device_initial_probe+0x13/0x20
[] bus_probe_device+0x9a/0xb0
[] device_add+0x3b1/0x5c0
[] device_register+0x1e/0x30
[] xenbus_probe_node+0x158/0x170
[] xenbus_dev_changed+0x1af/0x1c0
[] backend_changed+0x1b/0x20
[] xenwatch_thread+0xb6/0x160
unreferenced object 0x880007ba8ef8 (size 224):

  backtrace:
[] kmemleak_alloc+0x28/0x50
[] __kmalloc+0xd3/0x1e0
[] frontend_changed+0x2c7/0x580
[] xenbus_otherend_changed+0xa2/0xb0
[] frontend_changed+0x10/0x20
[] xenwatch_thread+0xb6/0x160
[] kthread+0xd7/0xf0
[] ret_from_fork+0x3f/0x70
[] 0x
unreferenced object 0x8800048dcd38 (size 224):

The first leak is caused by not put the be->blkif reference got in
xen_blkif_alloc(), while the second is not free blkif->rings in the right
place.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/xenbus.c |   13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 44396b8..dabdb18 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -246,6 +246,9 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
struct pending_req *req, *n;
unsigned int j, r;
 
+   if (!blkif->rings)
+   goto out;
+
for (r = 0; r < blkif->nr_rings; r++) {
struct xen_blkif_ring *ring = >rings[r];
unsigned int i = 0;
@@ -299,7 +302,14 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));
}
blkif->nr_ring_pages = 0;
+   /*
+* blkif->rings was allocated in connect_ring, so we should free it in
+* disconnect.
+*/
+   kfree(blkif->rings);
+   blkif->rings = NULL;
 
+out:
return 0;
 }
 
@@ -310,7 +320,6 @@ static void xen_blkif_free(struct xen_blkif *blkif)
xen_vbd_free(>vbd);
 
/* Make sure everything is drained before shutting down */
-   kfree(blkif->rings);
kmem_cache_free(xen_blkif_cachep, blkif);
 }
 
@@ -505,6 +514,8 @@ static int xen_blkbk_remove(struct xenbus_device *dev)
xen_blkif_put(be->blkif);
}
 
+   /* Put the reference got in xen_blkif_alloc(). */
+   xen_blkif_put(be->blkif);
kfree(be->mode);
kfree(be);
return 0;
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] blkback feature announcement

2015-12-08 Thread Bob Liu

On 12/08/2015 07:13 PM, Jan Beulich wrote:
 On 08.12.15 at 12:06,  wrote:
> 
>> On 12/08/2015 04:13 PM, Jan Beulich wrote:
>> On 08.12.15 at 02:08,  wrote:
 On 12/07/2015 08:42 PM, Roger Pau Monné wrote:
> El 07/12/15 a les 13.00, Jan Beulich ha escrit:
>> Hello,
>>
>> is there a particular reason why "max-ring-page-order" gets written in
>> xen_blkbk_probe(), but e.g. "feature-max-indirect-segments" and
>> "feature-persistent" get written only in connect(), despite both having
>> constant values (and hence the node value effectively being known as
>> soon as the device exists)?
>
> No, AFAIK there's no specific reason.
>

 AFAIR, that's for the blkfront resume path.

 We need to get the "max-ring-page-order" in blkfront_resume() in advance, 
 so 
>>
 that we can know how many ring pages to be used before setup_blkring().
>>>
>>> I don't follow - the proposal is to have the backend announce the
>>> feature _earlier_, so how could frontend resume be affected?
>>>
>>
>> The frontend resume is like this:
>>
>> blkfront_resume()
>>  > blkif_free()
>>  > talk_to_blkback()
>>  > setup_blkring() etc.
>>
>> blkback_changed()
>>  > blkfront_connect()
>>
>>
>> Sometimes the "max-ring-page-order" of backend may have changed after the 
>> guest(frontend) migrated to a different machine,
>> the frontend must aware of this change so that have to get the new value of 
>> "max-ring-page-order" in blkfront_resume().
>>
>> But it would be too late if the backend announces the "max-ring-page-order" 
>> in connect(), this situation is like:
>>
>> blkfront_resume()
>>  > blkif_free()
>>  > talk_to_blkback()
>>   ^^^ Get a wrong "max-ring-page-order"
>>  > setup_blkring() etc. but using the wrong value!!
>>
>> blkback_changed()
>>  > blkfront_connect()
>>^^^ Then the connect() in backend will be called(after frontend enter 
>> XenbusStateConnected) and write the correct "max-ring-page-order", but it's 
>> too late.
> 
> Oh, you're arguing for why "max-ring-page-order" is written early,
> but the question really was why others aren't written early too.
> 

Oh, sorry for misunderstood your point.
Yes, I agree that others can be written early too.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] blkback feature announcement

2015-12-08 Thread Bob Liu

On 12/08/2015 04:13 PM, Jan Beulich wrote:
 On 08.12.15 at 02:08,  wrote:
>> On 12/07/2015 08:42 PM, Roger Pau Monné wrote:
>>> El 07/12/15 a les 13.00, Jan Beulich ha escrit:
 Hello,

 is there a particular reason why "max-ring-page-order" gets written in
 xen_blkbk_probe(), but e.g. "feature-max-indirect-segments" and
 "feature-persistent" get written only in connect(), despite both having
 constant values (and hence the node value effectively being known as
 soon as the device exists)?
>>>
>>> No, AFAIK there's no specific reason.
>>>
>>
>> AFAIR, that's for the blkfront resume path.
>>
>> We need to get the "max-ring-page-order" in blkfront_resume() in advance, so 
>> that we can know how many ring pages to be used before setup_blkring().
> 
> I don't follow - the proposal is to have the backend announce the
> feature _earlier_, so how could frontend resume be affected?
> 

The frontend resume is like this:

blkfront_resume()
 > blkif_free()
 > talk_to_blkback()
 > setup_blkring() etc.

blkback_changed()
 > blkfront_connect()


Sometimes the "max-ring-page-order" of backend may have changed after the 
guest(frontend) migrated to a different machine,
the frontend must aware of this change so that have to get the new value of 
"max-ring-page-order" in blkfront_resume().

But it would be too late if the backend announces the "max-ring-page-order" in 
connect(), this situation is like:

blkfront_resume()
 > blkif_free()
 > talk_to_blkback()
 ^^^ Get a wrong "max-ring-page-order"
 > setup_blkring() etc. but using the wrong value!!

blkback_changed()
 > blkfront_connect()
   ^^^ Then the connect() in backend will be called(after frontend enter 
XenbusStateConnected) and write the correct "max-ring-page-order", but it's too 
late.

Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] xen-blkback: make st_ statistics per ring

2015-12-08 Thread Bob Liu
Make st_* statistics per ring and the VBD sysfs would iterate over all the
rings.

Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings
are torn down, so it's safe.

This patch was based on konrad/xen.git/for-jens-4.5.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/blkback.c |   34 +-
 drivers/block/xen-blkback/common.h  |   20 
 drivers/block/xen-blkback/xenbus.c  |   45 +++
 3 files changed, 61 insertions(+), 38 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index a00d6c6..ef358d7 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -587,20 +587,18 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
 
 static void print_stats(struct xen_blkif_ring *ring)
 {
-   struct xen_blkif *blkif = ring->blkif;
-
pr_info("(%s): oo %3llu  |  rd %4llu  |  wr %4llu  |  f %4llu"
 "  |  ds %4llu | pg: %4u/%4d\n",
-current->comm, blkif->st_oo_req,
-blkif->st_rd_req, blkif->st_wr_req,
-blkif->st_f_req, blkif->st_ds_req,
+current->comm, ring->st_oo_req,
+ring->st_rd_req, ring->st_wr_req,
+ring->st_f_req, ring->st_ds_req,
 ring->persistent_gnt_c,
 xen_blkif_max_pgrants);
-   blkif->st_print = jiffies + msecs_to_jiffies(10 * 1000);
-   blkif->st_rd_req = 0;
-   blkif->st_wr_req = 0;
-   blkif->st_oo_req = 0;
-   blkif->st_ds_req = 0;
+   ring->st_print = jiffies + msecs_to_jiffies(10 * 1000);
+   ring->st_rd_req = 0;
+   ring->st_wr_req = 0;
+   ring->st_oo_req = 0;
+   ring->st_ds_req = 0;
 }
 
 int xen_blkif_schedule(void *arg)
@@ -655,7 +653,7 @@ purge_gnt_list:
/* Shrink if we have more than xen_blkif_max_buffer_pages */
shrink_free_pagepool(ring, xen_blkif_max_buffer_pages);
 
-   if (log_stats && time_after(jiffies, ring->blkif->st_print))
+   if (log_stats && time_after(jiffies, ring->st_print))
print_stats(ring);
}
 
@@ -1017,7 +1015,7 @@ static int dispatch_discard_io(struct xen_blkif_ring 
*ring,
preq.sector_number + preq.nr_sects, blkif->vbd.pdevice);
goto fail_response;
}
-   blkif->st_ds_req++;
+   ring->st_ds_req++;
 
secure = (blkif->vbd.discard_secure &&
 (req->u.discard.flag & BLKIF_DISCARD_SECURE)) ?
@@ -1144,7 +1142,7 @@ __do_block_io_op(struct xen_blkif_ring *ring)
 
pending_req = alloc_req(ring);
if (NULL == pending_req) {
-   ring->blkif->st_oo_req++;
+   ring->st_oo_req++;
more_to_do = 1;
break;
}
@@ -1242,17 +1240,17 @@ static int dispatch_rw_block_io(struct xen_blkif_ring 
*ring,
 
switch (req_operation) {
case BLKIF_OP_READ:
-   ring->blkif->st_rd_req++;
+   ring->st_rd_req++;
operation = READ;
break;
case BLKIF_OP_WRITE:
-   ring->blkif->st_wr_req++;
+   ring->st_wr_req++;
operation = WRITE_ODIRECT;
break;
case BLKIF_OP_WRITE_BARRIER:
drain = true;
case BLKIF_OP_FLUSH_DISKCACHE:
-   ring->blkif->st_f_req++;
+   ring->st_f_req++;
operation = WRITE_FLUSH;
break;
default:
@@ -1394,9 +1392,9 @@ static int dispatch_rw_block_io(struct xen_blkif_ring 
*ring,
blk_finish_plug();
 
if (operation == READ)
-   ring->blkif->st_rd_sect += preq.nr_sects;
+   ring->st_rd_sect += preq.nr_sects;
else if (operation & WRITE)
-   ring->blkif->st_wr_sect += preq.nr_sects;
+   ring->st_wr_sect += preq.nr_sects;
 
return 0;
 
diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index 54f126e..2495bb3 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -298,6 +298,16 @@ struct xen_blkif_ring {
atomic_tpersistent_gnt_in_use;
unsigned long   next_lru;
 
+   /* statistics */
+   unsigned long   st_print;
+   unsigned long long  st_rd_req;
+   unsigned long long  st_wr_req;
+   unsigned long long  st_oo_req;
+   unsigned long long  st_f_req;
+   unsigned long long   

Re: [Xen-devel] blkback feature announcement

2015-12-07 Thread Bob Liu

On 12/07/2015 08:42 PM, Roger Pau Monné wrote:
> El 07/12/15 a les 13.00, Jan Beulich ha escrit:
>> Hello,
>>
>> is there a particular reason why "max-ring-page-order" gets written in
>> xen_blkbk_probe(), but e.g. "feature-max-indirect-segments" and
>> "feature-persistent" get written only in connect(), despite both having
>> constant values (and hence the node value effectively being known as
>> soon as the device exists)?
> 
> No, AFAIK there's no specific reason.
> 

AFAIR, that's for the blkfront resume path.

We need to get the "max-ring-page-order" in blkfront_resume() in advance, so 
that we can know how many ring pages to be used before setup_blkring().

Bob.

>> Or in more general terms: Shouldn't it be well defined at what time
>> a frontend can rely on certain nodes to be available for inspection?
>> And in doing so, I'd expect the determination to be done such that
>> widest flexibility is provided towards the actual implementation, i.e.
>> nodes should be written as early as possible. (Of course this applies
>> to other frontend/backend pairs too.)
> 
> I agree. Regarding blkback the nodes about persistent grants, indirect
> descriptors and the ring page order should be written in
> xen_blkbk_probe, while the specific information about this virtual disk
> (sectors, sector size...) should be written before switching to the
> connected state (ie: after hotplug scripts have run).
> 
> Roger.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 00/10] xen-block: multi hardware-queues/rings support

2015-11-25 Thread Bob Liu

On 11/26/2015 06:12 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Nov 25, 2015 at 03:56:03PM -0500, Konrad Rzeszutek Wilk wrote:
>> On Wed, Nov 25, 2015 at 02:25:07PM -0500, Konrad Rzeszutek Wilk wrote:
   xen/blkback: separate ring information out of struct xen_blkif
   xen/blkback: pseudo support for multi hardware queues/rings
   xen/blkback: get the number of hardware queues/rings from blkfront
   xen/blkback: make pool of persistent grants and free pages per-queue
>>>
>>> OK, got to those as well. I have put them in 'devel/for-jens-4.5' and
>>> are going to test them overnight before pushing them out.
>>>
>>> I see two bugs in the code that we MUST deal with:
>>>
>>>  - print_stats () is going to show zero values.
>>>  - the sysfs code (VBD_SHOW) aren't converted over to fetch data
>>>from all the rings.
>>
>> - kthread_run can't handle the two "name, i" arguments. I see:
>>
>> root  5101 2  0 20:47 ?00:00:00 [blkback.3.xvda-]
>> root  5102 2  0 20:47 ?00:00:00 [blkback.3.xvda-]
> 
> And doing save/restore:
> 
> xl save  /tmp/A;
> xl restore /tmp/A;
> 
> ends up us loosing the proper state and not getting the ring setup back.
> I see this is backend:
> 
> [ 2719.448600] vbd vbd-22-51712: -1 guest requested 0 queues, exceeding the 
> maximum of 3.
> 
> And XenStore agrees:
> tool = ""
>  xenstored = ""
> local = ""
>  domain = ""
>   0 = ""
>domid = "0"
>name = "Domain-0"
>device-model = ""
> 0 = ""
>  state = "running"
>error = ""
> backend = ""
>  vbd = ""
>   2 = ""
>51712 = ""
> error = "-1 guest requested 0 queues, exceeding the maximum of 3."
> 
> .. which also leads to a memory leak as xen_blkbk_remove never gets
> called.

I think which was already fix by your patch:
[PATCH RFC 2/2] xen/blkback: Free resources if connect_ring failed.

P.S. I didn't see your git tree updated with these patches.

-- 
Regards,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 00/10] xen-block: multi hardware-queues/rings support

2015-11-25 Thread Bob Liu

On 11/26/2015 10:57 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 26, 2015 at 10:28:10AM +0800, Bob Liu wrote:
>>
>> On 11/26/2015 06:12 AM, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Nov 25, 2015 at 03:56:03PM -0500, Konrad Rzeszutek Wilk wrote:
>>>> On Wed, Nov 25, 2015 at 02:25:07PM -0500, Konrad Rzeszutek Wilk wrote:
>>>>>>   xen/blkback: separate ring information out of struct xen_blkif
>>>>>>   xen/blkback: pseudo support for multi hardware queues/rings
>>>>>>   xen/blkback: get the number of hardware queues/rings from blkfront
>>>>>>   xen/blkback: make pool of persistent grants and free pages per-queue
>>>>>
>>>>> OK, got to those as well. I have put them in 'devel/for-jens-4.5' and
>>>>> are going to test them overnight before pushing them out.
>>>>>
>>>>> I see two bugs in the code that we MUST deal with:
>>>>>
>>>>>  - print_stats () is going to show zero values.
>>>>>  - the sysfs code (VBD_SHOW) aren't converted over to fetch data
>>>>>from all the rings.
>>>>
>>>> - kthread_run can't handle the two "name, i" arguments. I see:
>>>>
>>>> root  5101 2  0 20:47 ?00:00:00 [blkback.3.xvda-]
>>>> root  5102 2  0 20:47 ?00:00:00 [blkback.3.xvda-]
>>>
>>> And doing save/restore:
>>>
>>> xl save  /tmp/A;
>>> xl restore /tmp/A;
>>>
>>> ends up us loosing the proper state and not getting the ring setup back.
>>> I see this is backend:
>>>
>>> [ 2719.448600] vbd vbd-22-51712: -1 guest requested 0 queues, exceeding the 
>>> maximum of 3.
>>>
>>> And XenStore agrees:
>>> tool = ""
>>>  xenstored = ""
>>> local = ""
>>>  domain = ""
>>>   0 = ""
>>>domid = "0"
>>>name = "Domain-0"
>>>device-model = ""
>>> 0 = ""
>>>  state = "running"
>>>error = ""
>>> backend = ""
>>>  vbd = ""
>>>   2 = ""
>>>51712 = ""
>>> error = "-1 guest requested 0 queues, exceeding the maximum of 3."
>>>
>>> .. which also leads to a memory leak as xen_blkbk_remove never gets
>>> called.
>>
>> I think which was already fix by your patch:
>> [PATCH RFC 2/2] xen/blkback: Free resources if connect_ring failed.
> 
> Nope. I get that with or without the patch.
> 

Attached patch should fix this issue. 

-- 
Regards,
-Bob
>From f297a05fc27fb0bc9a3ed15407f8cc6ffd5e2a00 Mon Sep 17 00:00:00 2001
From: Bob Liu <bob@oracle.com>
Date: Wed, 25 Nov 2015 14:56:32 -0500
Subject: [PATCH 1/2] xen:blkfront: fix compile error
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fix this build error:
drivers/block/xen-blkfront.c: In function ‘blkif_free’:
drivers/block/xen-blkfront.c:1234:6: error: ‘struct blkfront_info’ has no
member named ‘ring’ info->ring = NULL;

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 625604d..ef5ce43 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1231,7 +1231,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 		blkif_free_ring(>rinfo[i]);

 	kfree(info->rinfo);
-	info->ring = NULL;
+	info->rinfo = NULL;
 	info->nr_rings = 0;
 }

--
1.8.3.1

>From aab0bb1690213e665966ea22b021e0eeaacfc717 Mon Sep 17 00:00:00 2001
From: Bob Liu <bob@oracle.com>
Date: Wed, 25 Nov 2015 17:52:55 -0500
Subject: [PATCH 2/2] xen/blkfront: realloc ring info in blkif_resume

Need to reallocate ring info in the resume path, because info->rinfo was freed
in blkif_free(). And 'multi-queue-max-queues' backend reports may have been
changed.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ef5ce43..9634a65 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1926,12 +1926,38 @@ static int blkif_recover(struct blkfront_info *info)
 static int blkfront_resume(struct xenbus_device *dev)
 {
 	struct blkfront_info *info = dev_get_drvdata(>dev);
-	int err;
+	i

Re: [Xen-devel] [PATCH v5 05/10] xen/blkfront: negotiate number of queues/rings to be used with backend

2015-11-16 Thread Bob Liu

On 11/17/2015 05:27 AM, Konrad Rzeszutek Wilk wrote:
>>  /* Common code used when first setting up, and when resuming. */
>>  static int talk_to_blkback(struct xenbus_device *dev,
>> @@ -1527,10 +1582,9 @@ static int talk_to_blkback(struct xenbus_device *dev,
>>  {
>>  const char *message = NULL;
>>  struct xenbus_transaction xbt;
>> -int err, i;
>> -unsigned int max_page_order = 0;
>> +int err;
>> +unsigned int i, max_page_order = 0;
>>  unsigned int ring_page_order = 0;
>> -struct blkfront_ring_info *rinfo;
> 
> Why? You end up doing the 'struct blkfront_ring_info' decleration
> in two of the loops below?

Oh, that's because Roger mentioned we would be tempted to declare rinfo only 
inside the for loop, to limit
the scope.

>>  
>>  err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
>> "max-ring-page-order", "%u", _page_order);
>> @@ -1542,7 +1596,8 @@ static int talk_to_blkback(struct xenbus_device *dev,
>>  }
>>  
>>  for (i = 0; i < info->nr_rings; i++) {
>> -rinfo = >rinfo[i];
>> +struct blkfront_ring_info *rinfo = >rinfo[i];
>> +
> 
> Here..
> 
>> @@ -1617,7 +1677,7 @@ again:
>>  
>>  for (i = 0; i < info->nr_rings; i++) {
>>  int j;
>> -rinfo = >rinfo[i];
>> +struct blkfront_ring_info *rinfo = >rinfo[i];
> 
> And here?
> 
> It is not a big deal but I am curious of why add this change?
> 
>> @@ -1717,7 +1789,6 @@ static int blkfront_probe(struct xenbus_device *dev,
>>  
>>  mutex_init(>mutex);
>>  spin_lock_init(>dev_lock);
>> -info->xbdev = dev;
> 
> That looks like a spurious change? Ah, I see that we do the same exact
> operation earlier in the blkfront_probe.
> 

The place of this line was changed because:

1738 info->xbdev = dev;

1739 /* Check if backend supports multiple queues. */
1740 err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,

We need xbdev to be set in 
advance.
  
1741"multi-queue-max-queues", "%u", 
_max_queues);
1742 if (err < 0)
1743 backend_max_queues = 1;


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 07/10] xen/blkback: pseudo support for multi hardware queues/rings

2015-11-13 Thread Bob Liu
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in next
patch("xen/blkback: get the number of hardware queues/rings from blkfront") so
as to make every single patch small and readable.

Signed-off-by: Arianna Avanzini <avanzini.aria...@gmail.com>
Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/common.h |3 +-
 drivers/block/xen-blkback/xenbus.c |  277 ++--
 2 files changed, 175 insertions(+), 105 deletions(-)

diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index f4dfa5b..f2386e3 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -340,7 +340,8 @@ struct xen_blkif {
struct work_struct  free_work;
unsigned int nr_ring_pages;
/* All rings for this device. */
-   struct xen_blkif_ring ring;
+   struct xen_blkif_ring *rings;
+   unsigned int nr_rings;
 };
 
 struct seg_buf {
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index e4bfc92..6c6e048 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -86,9 +86,11 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
 {
int err;
char name[BLKBACK_NAME_LEN];
+   struct xen_blkif_ring *ring;
+   unsigned int i;
 
/* Not ready to connect? */
-   if (!blkif->ring.irq || !blkif->vbd.bdev)
+   if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
return;
 
/* Already connected? */
@@ -113,19 +115,55 @@ static void xen_update_blkif_status(struct xen_blkif 
*blkif)
}
invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
 
-   blkif->ring.xenblkd = kthread_run(xen_blkif_schedule, >ring, 
"%s", name);
-   if (IS_ERR(blkif->ring.xenblkd)) {
-   err = PTR_ERR(blkif->ring.xenblkd);
-   blkif->ring.xenblkd = NULL;
-   xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
-   return;
+   for (i = 0; i < blkif->nr_rings; i++) {
+   ring = >rings[i];
+   ring->xenblkd = kthread_run(xen_blkif_schedule, ring, "%s-%d", 
name, i);
+   if (IS_ERR(ring->xenblkd)) {
+   err = PTR_ERR(ring->xenblkd);
+   ring->xenblkd = NULL;
+   xenbus_dev_fatal(blkif->be->dev, err,
+   "start %s-%d xenblkd", name, i);
+   goto out;
+   }
+   }
+   return;
+
+out:
+   while (--i >= 0) {
+   ring = >rings[i];
+   kthread_stop(ring->xenblkd);
}
+   return;
+}
+
+static int xen_blkif_alloc_rings(struct xen_blkif *blkif)
+{
+   unsigned int r;
+
+   blkif->rings = kzalloc(blkif->nr_rings * sizeof(struct xen_blkif_ring), 
GFP_KERNEL);
+   if (!blkif->rings)
+   return -ENOMEM;
+
+   for (r = 0; r < blkif->nr_rings; r++) {
+   struct xen_blkif_ring *ring = >rings[r];
+
+   spin_lock_init(>blk_ring_lock);
+   init_waitqueue_head(>wq);
+   INIT_LIST_HEAD(>pending_free);
+
+   spin_lock_init(>pending_free_lock);
+   init_waitqueue_head(>pending_free_wq);
+   init_waitqueue_head(>shutdown_wq);
+   ring->blkif = blkif;
+   xen_blkif_get(blkif);
+   }
+
+   return 0;
 }
 
 static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 {
struct xen_blkif *blkif;
-   struct xen_blkif_ring *ring;
 
BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
 
@@ -143,15 +181,11 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->st_print = jiffies;
INIT_WORK(>persistent_purge_work, xen_blkbk_unmap_purged_grants);
 
-   ring = >ring;
-   ring->blkif = blkif;
-   spin_lock_init(>blk_ring_lock);
-   init_waitqueue_head(>wq);
-
-   INIT_LIST_HEAD(>pending_free);
-   spin_lock_init(>pending_free_lock);
-   init_waitqueue_head(>pending_free_wq);
-   init_waitqueue_head(>shutdown_wq);
+   blkif->nr_rings = 1;
+   if (xen_blkif_alloc_rings(blkif)) {
+   kmem_cache_free(xen_blkif_cachep, blkif);
+   return ERR_PTR(-ENOMEM);
+   }
 
return blkif;
 }
@@ -216,50 +250,54 @@ static int xen_blkif_map(struct xen_blkif_ring *ring, 
grant_ref_t *gref,
 static int xen_blkif_disconnect(struct xen_blkif *blkif)
 {
struct pending_req *req, *n;
-   int i = 0, j;
-   struct xen_blkif_ring *ring = >ring;
+   uns

[Xen-devel] [PATCH v5 09/10] xen/blkfront: make persistent grants pool per-queue

2015-11-13 Thread Bob Liu
Make persistent grants per-queue/ring instead of per-device, so that we can
drop the 'dev_lock' and get better scalability.

Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Queues:   14  8  16
Iops orig(k):   810 1064780 700
Iops patched(k):810 1230(~20%)  1024(~20%)  850(~20%)

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c |  110 +-
 1 file changed, 43 insertions(+), 67 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 84496be..451f852 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -142,6 +142,8 @@ struct blkfront_ring_info {
struct gnttab_free_callback callback;
struct blk_shadow shadow[BLK_MAX_RING_SIZE];
struct list_head indirect_pages;
+   struct list_head grants;
+   unsigned int persistent_gnts_c;
unsigned long shadow_free;
struct blkfront_info *dev_info;
 };
@@ -162,13 +164,6 @@ struct blkfront_info
/* Number of pages per ring buffer. */
unsigned int nr_ring_pages;
struct request_queue *rq;
-   /*
-* Lock to protect info->grants list and persistent_gnts_c shared by all
-* rings.
-*/
-   spinlock_t dev_lock;
-   struct list_head grants;
-   unsigned int persistent_gnts_c;
unsigned int feature_flush;
unsigned int feature_discard:1;
unsigned int feature_secdiscard:1;
@@ -272,9 +267,7 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
}
 
gnt_list_entry->gref = GRANT_INVALID_REF;
-   spin_lock_irq(>dev_lock);
-   list_add(_list_entry->node, >grants);
-   spin_unlock_irq(>dev_lock);
+   list_add(_list_entry->node, >grants);
i++;
}
 
@@ -282,10 +275,8 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
 
 out_of_memory:
list_for_each_entry_safe(gnt_list_entry, n,
->grants, node) {
-   spin_lock_irq(>dev_lock);
+>grants, node) {
list_del(_list_entry->node);
-   spin_unlock_irq(>dev_lock);
if (info->feature_persistent)
__free_page(gnt_list_entry->page);
kfree(gnt_list_entry);
@@ -295,20 +286,17 @@ out_of_memory:
return -ENOMEM;
 }
 
-static struct grant *get_free_grant(struct blkfront_info *info)
+static struct grant *get_free_grant(struct blkfront_ring_info *rinfo)
 {
struct grant *gnt_list_entry;
-   unsigned long flags;
 
-   spin_lock_irqsave(>dev_lock, flags);
-   BUG_ON(list_empty(>grants));
-   gnt_list_entry = list_first_entry(>grants, struct grant,
+   BUG_ON(list_empty(>grants));
+   gnt_list_entry = list_first_entry(>grants, struct grant,
  node);
list_del(_list_entry->node);
 
if (gnt_list_entry->gref != GRANT_INVALID_REF)
-   info->persistent_gnts_c--;
-   spin_unlock_irqrestore(>dev_lock, flags);
+   rinfo->persistent_gnts_c--;
 
return gnt_list_entry;
 }
@@ -324,9 +312,10 @@ static inline void grant_foreign_access(const struct grant 
*gnt_list_entry,
 
 static struct grant *get_grant(grant_ref_t *gref_head,
   unsigned long gfn,
-  struct blkfront_info *info)
+  struct blkfront_ring_info *rinfo)
 {
-   struct grant *gnt_list_entry = get_free_grant(info);
+   struct grant *gnt_list_entry = get_free_grant(rinfo);
+   struct blkfront_info *info = rinfo->dev_info;
 
if (gnt_list_entry->gref != GRANT_INVALID_REF)
return gnt_list_entry;
@@ -347,9 +336,10 @@ static struct grant *get_grant(grant_ref_t *gref_head,
 }
 
 static struct grant *get_indirect_grant(grant_ref_t *gref_head,
-   struct blkfront_info *info)
+   struct blkfront_ring_info *rinfo)
 {
-   struct grant *gnt_list_entry = get_free_grant(info);
+   struct grant *gnt_list_entry = get_free_grant(rinfo);
+   struct blkfront_info *info = rinfo->dev_info;
 
if (gnt_list_entry->gref != GRANT_INVALID_REF)
return gnt_list_entry;
@@ -361,8 +351,8 @@ static struct grant *get_indirect_grant(grant_ref_t 
*gref_head,
struct page *indirect_pa

[Xen-devel] [PATCH v5 02/10] xen/blkfront: separate per ring information out of device info

2015-11-13 Thread Bob Liu
Split per ring information to an new structure "blkfront_ring_info".

A ring is the representation of a hardware queue, every vbd device can associate
with one or more rings depending on how many hardware queues/rings to be used.

This patch is a preparation for supporting real multi hardware queues/rings.

Signed-off-by: Arianna Avanzini <avanzini.aria...@gmail.com>
Signed-off-by: Bob Liu <bob@oracle.com>
---
v2: Fix build error.
---
 drivers/block/xen-blkfront.c |  359 +++---
 1 file changed, 197 insertions(+), 162 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2fee2ee..0c3ad21 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -120,6 +120,23 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
 #define RINGREF_NAME_LEN (20)
 
 /*
+ *  Per-ring info.
+ *  Every blkfront device can associate with one or more blkfront_ring_info,
+ *  depending on how many hardware queues/rings to be used.
+ */
+struct blkfront_ring_info {
+   struct blkif_front_ring ring;
+   unsigned int ring_ref[XENBUS_MAX_RING_GRANTS];
+   unsigned int evtchn, irq;
+   struct work_struct work;
+   struct gnttab_free_callback callback;
+   struct blk_shadow shadow[BLK_MAX_RING_SIZE];
+   struct list_head indirect_pages;
+   unsigned long shadow_free;
+   struct blkfront_info *dev_info;
+};
+
+/*
  * We have one of these per vbd, whether ide, scsi or 'other'.  They
  * hang in private_data off the gendisk structure. We may end up
  * putting all kinds of interesting stuff here :-)
@@ -133,18 +150,10 @@ struct blkfront_info
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
-   int ring_ref[XENBUS_MAX_RING_GRANTS];
unsigned int nr_ring_pages;
-   struct blkif_front_ring ring;
-   unsigned int evtchn, irq;
struct request_queue *rq;
-   struct work_struct work;
-   struct gnttab_free_callback callback;
-   struct blk_shadow shadow[BLK_MAX_RING_SIZE];
struct list_head grants;
-   struct list_head indirect_pages;
unsigned int persistent_gnts_c;
-   unsigned long shadow_free;
unsigned int feature_flush;
unsigned int feature_discard:1;
unsigned int feature_secdiscard:1;
@@ -155,6 +164,7 @@ struct blkfront_info
unsigned int max_indirect_segments;
int is_ready;
struct blk_mq_tag_set tag_set;
+   struct blkfront_ring_info rinfo;
 };
 
 static unsigned int nr_minors;
@@ -198,33 +208,35 @@ static DEFINE_SPINLOCK(minor_lock);
 
 #define GREFS(_psegs)  ((_psegs) * GRANTS_PER_PSEG)
 
-static int blkfront_setup_indirect(struct blkfront_info *info);
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
 static int blkfront_gather_backend_features(struct blkfront_info *info);
 
-static int get_id_from_freelist(struct blkfront_info *info)
+static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
-   unsigned long free = info->shadow_free;
-   BUG_ON(free >= BLK_RING_SIZE(info));
-   info->shadow_free = info->shadow[free].req.u.rw.id;
-   info->shadow[free].req.u.rw.id = 0x0fee; /* debug */
+   unsigned long free = rinfo->shadow_free;
+
+   BUG_ON(free >= BLK_RING_SIZE(rinfo->dev_info));
+   rinfo->shadow_free = rinfo->shadow[free].req.u.rw.id;
+   rinfo->shadow[free].req.u.rw.id = 0x0fee; /* debug */
return free;
 }
 
-static int add_id_to_freelist(struct blkfront_info *info,
+static int add_id_to_freelist(struct blkfront_ring_info *rinfo,
   unsigned long id)
 {
-   if (info->shadow[id].req.u.rw.id != id)
+   if (rinfo->shadow[id].req.u.rw.id != id)
return -EINVAL;
-   if (info->shadow[id].request == NULL)
+   if (rinfo->shadow[id].request == NULL)
return -EINVAL;
-   info->shadow[id].req.u.rw.id  = info->shadow_free;
-   info->shadow[id].request = NULL;
-   info->shadow_free = id;
+   rinfo->shadow[id].req.u.rw.id  = rinfo->shadow_free;
+   rinfo->shadow[id].request = NULL;
+   rinfo->shadow_free = id;
return 0;
 }
 
-static int fill_grant_buffer(struct blkfront_info *info, int num)
+static int fill_grant_buffer(struct blkfront_ring_info *rinfo, int num)
 {
+   struct blkfront_info *info = rinfo->dev_info;
struct page *granted_page;
struct grant *gnt_list_entry, *n;
int i = 0;
@@ -326,8 +338,8 @@ static struct grant *get_indirect_grant(grant_ref_t 
*gref_head,
struct page *indirect_page;
 
/* Fetch a pre-allocated page to use for indirect grefs */
-   BUG_ON(list_empty(>indirect_pages));
-   indirect_page = list_first_entry(>indirect_pages,
+ 

[Xen-devel] [PATCH v5 08/10] xen/blkback: get the number of hardware queues/rings from blkfront

2015-11-13 Thread Bob Liu
Backend advertises "multi-queue-max-queues" to front, also get the negotiated
number from "multi-queue-num-queues" written by blkfront.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/blkback.c |   12 
 drivers/block/xen-blkback/common.h  |1 +
 drivers/block/xen-blkback/xenbus.c  |   34 --
 3 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index fb5bfd4..acedc46 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -84,6 +84,15 @@ MODULE_PARM_DESC(max_persistent_grants,
  "Maximum number of grants to map persistently");
 
 /*
+ * Maximum number of rings/queues blkback supports, allow as many queues as 
there
+ * are CPUs if user has not specified a value.
+ */
+unsigned int xenblk_max_queues;
+module_param_named(max_queues, xenblk_max_queues, uint, 0644);
+MODULE_PARM_DESC(max_queues,
+"Maximum number of hardware queues per virtual disk");
+
+/*
  * Maximum order of pages to be used for the shared ring between front and
  * backend, 4KB page granularity is used.
  */
@@ -1483,6 +1492,9 @@ static int __init xen_blkif_init(void)
xen_blkif_max_ring_order = XENBUS_MAX_RING_GRANT_ORDER;
}
 
+   if (xenblk_max_queues == 0)
+   xenblk_max_queues = num_online_cpus();
+
rc = xen_blkif_interface_init();
if (rc)
goto failed_init;
diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index f2386e3..0833dc6 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -46,6 +46,7 @@
 #include 
 
 extern unsigned int xen_blkif_max_ring_order;
+extern unsigned int xenblk_max_queues;
 /*
  * This is the maximum number of segments that would be allowed in indirect
  * requests. This value will also be passed to the frontend.
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 6c6e048..d83b790 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -181,12 +181,6 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->st_print = jiffies;
INIT_WORK(>persistent_purge_work, xen_blkbk_unmap_purged_grants);
 
-   blkif->nr_rings = 1;
-   if (xen_blkif_alloc_rings(blkif)) {
-   kmem_cache_free(xen_blkif_cachep, blkif);
-   return ERR_PTR(-ENOMEM);
-   }
-
return blkif;
 }
 
@@ -595,6 +589,12 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
goto fail;
}
 
+   /* Multi-queue: write how many queues are supported by the backend. */
+   err = xenbus_printf(XBT_NIL, dev->nodename,
+   "multi-queue-max-queues", "%u", xenblk_max_queues);
+   if (err)
+   pr_warn("Error writing multi-queue-num-queues\n");
+
/* setup back pointer */
be->blkif->be = be;
 
@@ -980,6 +980,7 @@ static int connect_ring(struct backend_info *be)
char *xspath;
size_t xspathsize;
const size_t xenstore_path_ext_size = 11; /* sufficient for 
"/queue-NNN" */
+   unsigned int requested_num_queues = 0;
 
pr_debug("%s %s\n", __func__, dev->otherend);
 
@@ -1007,6 +1008,27 @@ static int connect_ring(struct backend_info *be)
be->blkif->vbd.feature_gnt_persistent = pers_grants;
be->blkif->vbd.overflow_max_grants = 0;
 
+   /*
+* Read the number of hardware queues from frontend.
+*/
+   err = xenbus_scanf(XBT_NIL, dev->otherend, "multi-queue-num-queues",
+  "%u", _num_queues);
+   if (err < 0) {
+   requested_num_queues = 1;
+   } else {
+   if (requested_num_queues > xenblk_max_queues
+   || requested_num_queues == 0) {
+   /* buggy or malicious guest */
+   xenbus_dev_fatal(dev, err,
+   "guest requested %u queues, exceeding 
the maximum of %u.",
+   requested_num_queues, 
xenblk_max_queues);
+   return -1;
+   }
+   }
+   be->blkif->nr_rings = requested_num_queues;
+   if (xen_blkif_alloc_rings(be->blkif))
+   return -ENOMEM;
+
pr_info("%s: using %d queues, protocol %d (%s) %s\n", dev->nodename,
 be->blkif->nr_rings, be->blkif->blk_protocol, protocol,
 pers_grants ? "persistent grants" : "");
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 03/10] xen/blkfront: pseudo support for multi hardware queues/rings

2015-11-13 Thread Bob Liu
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in next
patch("xen/blkfront: negotiate number of queues/rings to be used with backend")
so as to make every single patch small and readable.

Signed-off-by: Bob Liu <bob@oracle.com>
---
v2:
 * Fix memleak.
 * Other comments from Konrad.
---
 drivers/block/xen-blkfront.c |  341 --
 1 file changed, 195 insertions(+), 146 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0c3ad21..d73734f 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -150,6 +150,7 @@ struct blkfront_info
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
+   /* Number of pages per ring buffer. */
unsigned int nr_ring_pages;
struct request_queue *rq;
struct list_head grants;
@@ -164,7 +165,8 @@ struct blkfront_info
unsigned int max_indirect_segments;
int is_ready;
struct blk_mq_tag_set tag_set;
-   struct blkfront_ring_info rinfo;
+   struct blkfront_ring_info *rinfo;
+   unsigned int nr_rings;
 };
 
 static unsigned int nr_minors;
@@ -209,7 +211,7 @@ static DEFINE_SPINLOCK(minor_lock);
 #define GREFS(_psegs)  ((_psegs) * GRANTS_PER_PSEG)
 
 static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
-static int blkfront_gather_backend_features(struct blkfront_info *info);
+static void blkfront_gather_backend_features(struct blkfront_info *info);
 
 static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
@@ -338,8 +340,8 @@ static struct grant *get_indirect_grant(grant_ref_t 
*gref_head,
struct page *indirect_page;
 
/* Fetch a pre-allocated page to use for indirect grefs */
-   BUG_ON(list_empty(>rinfo.indirect_pages));
-   indirect_page = list_first_entry(>rinfo.indirect_pages,
+   BUG_ON(list_empty(>rinfo->indirect_pages));
+   indirect_page = list_first_entry(>rinfo->indirect_pages,
 struct page, lru);
list_del(_page->lru);
gnt_list_entry->page = indirect_page;
@@ -597,7 +599,6 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
 * existing persistent grants, or if we have to get new grants,
 * as there are not sufficiently many free.
 */
-   bool new_persistent_gnts;
struct scatterlist *sg;
int num_sg, max_grefs, num_grant;
 
@@ -609,12 +610,12 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
 */
max_grefs += INDIRECT_GREFS(max_grefs);
 
-   /* Check if we have enough grants to allocate a requests */
-   if (info->persistent_gnts_c < max_grefs) {
-   new_persistent_gnts = 1;
-   if (gnttab_alloc_grant_references(
-   max_grefs - info->persistent_gnts_c,
-   _head) < 0) {
+   /*
+* We have to reserve 'max_grefs' grants at first because persistent
+* grants are shared by all rings.
+*/
+   if (max_grefs > 0)
+   if (gnttab_alloc_grant_references(max_grefs, _head) 
< 0) {
gnttab_request_free_callback(
>callback,
blkif_restart_queue_callback,
@@ -622,8 +623,6 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
max_grefs);
return 1;
}
-   } else
-   new_persistent_gnts = 0;
 
/* Fill out a communications ring structure. */
ring_req = RING_GET_REQUEST(>ring, rinfo->ring.req_prod_pvt);
@@ -712,7 +711,7 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
/* Keep a private copy so we can reissue requests when recovering. */
rinfo->shadow[id].req = *ring_req;
 
-   if (new_persistent_gnts)
+   if (max_grefs > 0)
gnttab_free_grant_references(setup.gref_head);
 
return 0;
@@ -791,7 +790,8 @@ static int blk_mq_init_hctx(struct blk_mq_hw_ctx *hctx, 
void *data,
 {
struct blkfront_info *info = (struct blkfront_info *)data;
 
-   hctx->driver_data = >rinfo;
+   BUG_ON(info->nr_rings <= index);
+   hctx->driver_data = >rinfo[index];
return 0;
 }
 
@@ -1050,8 +1050,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
 
 static void xlvbd_release_gendisk(struct blkfront_info *info)
 {
-   unsigned int minor, nr_minors;
-   struct blkfront_ring_info *rinfo = >rinfo;
+   unsigned int minor, nr_minors, i;
 
if (info->rq == NULL)
 

[Xen-devel] [PATCH v5 04/10] xen/blkfront: split per device io_lock

2015-11-13 Thread Bob Liu
After commit "xen/blkfront: separate per ring information out of device
info", per-ring data is protected by a per-device lock('io_lock').

This is not a good way and will effect the scalability, so introduces a
per-ring lock('ring_lock').

The old 'io_lock' is renamed to 'dev_lock' which protects the ->grants list and
persistent_gnts_c shared by all rings.

Signed-off-by: Bob Liu <bob@oracle.com>
---
v2:
 * Introduce kick_pending_request_queues_locked().
 * Add comment for 'ring_lock'.
 * Move locks to more suitable place.
---
 drivers/block/xen-blkfront.c |   73 +++---
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d73734f..56c9ec6 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -125,6 +125,8 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
  *  depending on how many hardware queues/rings to be used.
  */
 struct blkfront_ring_info {
+   /* Lock to protect data in every ring buffer. */
+   spinlock_t ring_lock;
struct blkif_front_ring ring;
unsigned int ring_ref[XENBUS_MAX_RING_GRANTS];
unsigned int evtchn, irq;
@@ -143,7 +145,6 @@ struct blkfront_ring_info {
  */
 struct blkfront_info
 {
-   spinlock_t io_lock;
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
@@ -153,6 +154,11 @@ struct blkfront_info
/* Number of pages per ring buffer. */
unsigned int nr_ring_pages;
struct request_queue *rq;
+   /*
+* Lock to protect info->grants list and persistent_gnts_c shared by all
+* rings.
+*/
+   spinlock_t dev_lock;
struct list_head grants;
unsigned int persistent_gnts_c;
unsigned int feature_flush;
@@ -258,7 +264,9 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
}
 
gnt_list_entry->gref = GRANT_INVALID_REF;
+   spin_lock_irq(>dev_lock);
list_add(_list_entry->node, >grants);
+   spin_unlock_irq(>dev_lock);
i++;
}
 
@@ -267,7 +275,9 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
 out_of_memory:
list_for_each_entry_safe(gnt_list_entry, n,
 >grants, node) {
+   spin_lock_irq(>dev_lock);
list_del(_list_entry->node);
+   spin_unlock_irq(>dev_lock);
if (info->feature_persistent)
__free_page(gnt_list_entry->page);
kfree(gnt_list_entry);
@@ -280,7 +290,9 @@ out_of_memory:
 static struct grant *get_free_grant(struct blkfront_info *info)
 {
struct grant *gnt_list_entry;
+   unsigned long flags;
 
+   spin_lock_irqsave(>dev_lock, flags);
BUG_ON(list_empty(>grants));
gnt_list_entry = list_first_entry(>grants, struct grant,
  node);
@@ -288,6 +300,7 @@ static struct grant *get_free_grant(struct blkfront_info 
*info)
 
if (gnt_list_entry->gref != GRANT_INVALID_REF)
info->persistent_gnts_c--;
+   spin_unlock_irqrestore(>dev_lock, flags);
 
return gnt_list_entry;
 }
@@ -757,11 +770,11 @@ static inline bool blkif_request_flush_invalid(struct 
request *req,
 static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
   const struct blk_mq_queue_data *qd)
 {
+   unsigned long flags;
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info 
*)hctx->driver_data;
-   struct blkfront_info *info = rinfo->dev_info;
 
blk_mq_start_request(qd->rq);
-   spin_lock_irq(>io_lock);
+   spin_lock_irqsave(>ring_lock, flags);
if (RING_FULL(>ring))
goto out_busy;
 
@@ -772,15 +785,15 @@ static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
goto out_busy;
 
flush_requests(rinfo);
-   spin_unlock_irq(>io_lock);
+   spin_unlock_irqrestore(>ring_lock, flags);
return BLK_MQ_RQ_QUEUE_OK;
 
 out_err:
-   spin_unlock_irq(>io_lock);
+   spin_unlock_irqrestore(>ring_lock, flags);
return BLK_MQ_RQ_QUEUE_ERROR;
 
 out_busy:
-   spin_unlock_irq(>io_lock);
+   spin_unlock_irqrestore(>ring_lock, flags);
blk_mq_stop_hw_queue(hctx);
return BLK_MQ_RQ_QUEUE_BUSY;
 }
@@ -1082,21 +1095,28 @@ static void xlvbd_release_gendisk(struct blkfront_info 
*info)
info->gd = NULL;
 }
 
-/* Must be called with io_lock holded */
-static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
+/* Already hold rinfo->ring_lock. */
+static inline void kick_pending_request_queues_locked(struct 
blkfront_ring_info *rinfo)
 {
if (!RING_FULL(>ring

[Xen-devel] [PATCH v5 00/10] xen-block: multi hardware-queues/rings support

2015-11-13 Thread Bob Liu
Note: These patches were based on original work of Arianna's internship for
GNOME's Outreach Program for Women.

After using blk-mq api, a guest has more than one(nr_vpus) software request
queues associated with each block front. These queues can be mapped over several
rings(hardware queues) to the backend, making it very easy for us to run
multiple threads on the backend for a single virtual disk.

By having different threads issuing requests at the same time, the performance
of guest can be improved significantly.

Test was done based on null_blk driver:
dom0: v4.3-rc7 16vcpus 10GB "modprobe null_blk"
domU: v4.3-rc7 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Results:
iops1: After commit("xen/blkfront: make persistent grants per-queue").
iops2: After commit("xen/blkback: make persistent grants and free pages pool 
per-queue").

Queues:   14  8  16
Iops orig(k):   810 1064780 700
Iops1(k):   810 1230(~20%)  1024(~20%)  850(~20%)
Iops2(k):   810 1410(~35%)  1354(~75%)  1440(~100%)

With 4 queues after this series we can get ~75% increase in IOPS, and
performance won't drop if incresing queue numbers.

Please find the respective chart in this link:
https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0

---
v5:
 * Rebase to xen/tip.git tags/for-linus-4.4-rc0-tag.
 * Comments from Konrad.

v4:
 * Rebase to v4.3-rc7.
 * Comments from Roger.

v3:
 * Rebased to v4.2-rc8.

Bob Liu (10):
  xen/blkif: document blkif multi-queue/ring extension
  xen/blkfront: separate per ring information out of device info
  xen/blkfront: pseudo support for multi hardware queues/rings
  xen/blkfront: split per device io_lock
  xen/blkfront: negotiate number of queues/rings to be used with backend
  xen/blkback: separate ring information out of struct xen_blkif
  xen/blkback: pseudo support for multi hardware queues/rings
  xen/blkback: get the number of hardware queues/rings from blkfront
  xen/blkfront: make persistent grants per-queue
  xen/blkback: make pool of persistent grants and free pages per-queue

 drivers/block/xen-blkback/blkback.c | 386 ++-
 drivers/block/xen-blkback/common.h  |  78 ++--
 drivers/block/xen-blkback/xenbus.c  | 359 --
 drivers/block/xen-blkfront.c| 718 ++--
 include/xen/interface/io/blkif.h|  48 +++
 5 files changed, 971 insertions(+), 618 deletions(-)

-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 01/10] xen/blkif: document blkif multi-queue/ring extension

2015-11-13 Thread Bob Liu
Document the multi-queue/ring feature in terms of XenStore keys to be written by
the backend and by the frontend.

Signed-off-by: Bob Liu <bob@oracle.com>
---
v2:
Add descriptions together with multi-page ring buffer.
---
 include/xen/interface/io/blkif.h |   48 ++
 1 file changed, 48 insertions(+)

diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
index c33e1c4..8b8cfad 100644
--- a/include/xen/interface/io/blkif.h
+++ b/include/xen/interface/io/blkif.h
@@ -28,6 +28,54 @@ typedef uint16_t blkif_vdev_t;
 typedef uint64_t blkif_sector_t;
 
 /*
+ * Multiple hardware queues/rings:
+ * If supported, the backend will write the key "multi-queue-max-queues" to
+ * the directory for that vbd, and set its value to the maximum supported
+ * number of queues.
+ * Frontends that are aware of this feature and wish to use it can write the
+ * key "multi-queue-num-queues" with the number they wish to use, which must be
+ * greater than zero, and no more than the value reported by the backend in
+ * "multi-queue-max-queues".
+ *
+ * For frontends requesting just one queue, the usual event-channel and
+ * ring-ref keys are written as before, simplifying the backend processing
+ * to avoid distinguishing between a frontend that doesn't understand the
+ * multi-queue feature, and one that does, but requested only one queue.
+ *
+ * Frontends requesting two or more queues must not write the toplevel
+ * event-channel and ring-ref keys, instead writing those keys under sub-keys
+ * having the name "queue-N" where N is the integer ID of the queue/ring for
+ * which those keys belong. Queues are indexed from zero.
+ * For example, a frontend with two queues must write the following set of
+ * queue-related keys:
+ *
+ * /local/domain/1/device/vbd/0/multi-queue-num-queues = "2"
+ * /local/domain/1/device/vbd/0/queue-0 = ""
+ * /local/domain/1/device/vbd/0/queue-0/ring-ref = "<ring-ref#0>"
+ * /local/domain/1/device/vbd/0/queue-0/event-channel = "<evtchn#0>"
+ * /local/domain/1/device/vbd/0/queue-1 = ""
+ * /local/domain/1/device/vbd/0/queue-1/ring-ref = "<ring-ref#1>"
+ * /local/domain/1/device/vbd/0/queue-1/event-channel = "<evtchn#1>"
+ *
+ * It is also possible to use multiple queues/rings together with
+ * feature multi-page ring buffer.
+ * For example, a frontend requests two queues/rings and the size of each ring
+ * buffer is two pages must write the following set of related keys:
+ *
+ * /local/domain/1/device/vbd/0/multi-queue-num-queues = "2"
+ * /local/domain/1/device/vbd/0/ring-page-order = "1"
+ * /local/domain/1/device/vbd/0/queue-0 = ""
+ * /local/domain/1/device/vbd/0/queue-0/ring-ref0 = "<ring-ref#0>"
+ * /local/domain/1/device/vbd/0/queue-0/ring-ref1 = "<ring-ref#1>"
+ * /local/domain/1/device/vbd/0/queue-0/event-channel = "<evtchn#0>"
+ * /local/domain/1/device/vbd/0/queue-1 = ""
+ * /local/domain/1/device/vbd/0/queue-1/ring-ref0 = "<ring-ref#2>"
+ * /local/domain/1/device/vbd/0/queue-1/ring-ref1 = "<ring-ref#3>"
+ * /local/domain/1/device/vbd/0/queue-1/event-channel = "<evtchn#1>"
+ *
+ */
+
+/*
  * REQUEST CODES.
  */
 #define BLKIF_OP_READ  0
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 05/10] xen/blkfront: negotiate number of queues/rings to be used with backend

2015-11-13 Thread Bob Liu
The max number of hardware queues for xen/blkfront is set by parameter
'max_queues'(default 4), while it is also capped by the max value that the
xen/blkback exposes through XenStore key 'multi-queue-max-queues'.

The negotiated number is the smaller one and would be written back to xenstore
as "multi-queue-num-queues", blkback needs to read this negotiated number.

Signed-off-by: Bob Liu <bob@oracle.com>
---
v2:
 * Make 'i' be an unsigned int.
 * Other comments from Konrad.
---
 drivers/block/xen-blkfront.c |  160 +++---
 1 file changed, 119 insertions(+), 41 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 56c9ec6..84496be 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -99,6 +99,10 @@ static unsigned int xen_blkif_max_segments = 32;
 module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
 MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests 
(default is 32)");
 
+static unsigned int xen_blkif_max_queues = 4;
+module_param_named(max_queues, xen_blkif_max_queues, uint, S_IRUGO);
+MODULE_PARM_DESC(max_queues, "Maximum number of hardware queues/rings used per 
virtual disk");
+
 /*
  * Maximum order of pages to be used for the shared ring between front and
  * backend, 4KB page granularity is used.
@@ -118,6 +122,10 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
  * characters are enough. Define to 20 to keep consist with backend.
  */
 #define RINGREF_NAME_LEN (20)
+/*
+ * queue-%u would take 7 + 10(UINT_MAX) = 17 characters
+ */
+#define QUEUE_NAME_LEN (17)
 
 /*
  *  Per-ring info.
@@ -823,7 +831,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
 
memset(>tag_set, 0, sizeof(info->tag_set));
info->tag_set.ops = _mq_ops;
-   info->tag_set.nr_hw_queues = 1;
+   info->tag_set.nr_hw_queues = info->nr_rings;
info->tag_set.queue_depth =  BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
@@ -1520,6 +1528,53 @@ fail:
return err;
 }
 
+/*
+ * Write out per-ring/queue nodes including ring-ref and event-channel, and 
each
+ * ring buffer may have multi pages depending on ->nr_ring_pages.
+ */
+static int write_per_ring_nodes(struct xenbus_transaction xbt,
+   struct blkfront_ring_info *rinfo, const char 
*dir)
+{
+   int err;
+   unsigned int i;
+   const char *message = NULL;
+   struct blkfront_info *info = rinfo->dev_info;
+
+   if (info->nr_ring_pages == 1) {
+   err = xenbus_printf(xbt, dir, "ring-ref", "%u", 
rinfo->ring_ref[0]);
+   if (err) {
+   message = "writing ring-ref";
+   goto abort_transaction;
+   }
+   } else {
+   for (i = 0; i < info->nr_ring_pages; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
i);
+   err = xenbus_printf(xbt, dir, ring_ref_name,
+   "%u", rinfo->ring_ref[i]);
+   if (err) {
+   message = "writing ring-ref";
+   goto abort_transaction;
+   }
+   }
+   }
+
+   err = xenbus_printf(xbt, dir, "event-channel", "%u", rinfo->evtchn);
+   if (err) {
+   message = "writing event-channel";
+   goto abort_transaction;
+   }
+
+   return 0;
+
+abort_transaction:
+   xenbus_transaction_end(xbt, 1);
+   if (message)
+   xenbus_dev_fatal(info->xbdev, err, "%s", message);
+
+   return err;
+}
 
 /* Common code used when first setting up, and when resuming. */
 static int talk_to_blkback(struct xenbus_device *dev,
@@ -1527,10 +1582,9 @@ static int talk_to_blkback(struct xenbus_device *dev,
 {
const char *message = NULL;
struct xenbus_transaction xbt;
-   int err, i;
-   unsigned int max_page_order = 0;
+   int err;
+   unsigned int i, max_page_order = 0;
unsigned int ring_page_order = 0;
-   struct blkfront_ring_info *rinfo;
 
err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
   "max-ring-page-order", "%u", _page_order);
@@ -1542,7 +1596,8 @@ static int talk_to_blkback(struct xenbus_device *dev,
}
 
for (i = 0; i < info->nr_rings; i++) {
-   rinfo = >rinfo[i];
+   struct blkfront_ring_info *rinfo = >rinfo[i];
+

[Xen-devel] [PATCH v5 10/10] xen/blkback: make pool of persistent grants and free pages per-queue

2015-11-13 Thread Bob Liu
Make pool of persistent grants and free pages per-queue/ring instead of
per-device to get better scalability.

Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Results:
iops1: After commit("xen/blkfront: make persistent grants per-queue").
iops2: After this commit.

Queues:   14  8  16
Iops orig(k):   810 1064780 700
Iops1(k):   810 1230(~20%)  1024(~20%)  850(~20%)
Iops2(k):   810 1410(~35%)  1354(~75%)  1440(~100%)

With 4 queues after this commit we can get ~75% increase in IOPS, and
performance won't drop if incresing queue numbers.

Please find the respective chart in this link:
https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/blkback.c |  202 ---
 drivers/block/xen-blkback/common.h  |   32 +++---
 drivers/block/xen-blkback/xenbus.c  |   21 ++--
 3 files changed, 118 insertions(+), 137 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index acedc46..0e8a04d 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -122,60 +122,60 @@ module_param(log_stats, int, 0644);
 /* Number of free pages to remove on each call to gnttab_free_pages */
 #define NUM_BATCH_FREE_PAGES 10
 
-static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
+static inline int get_free_page(struct xen_blkif_ring *ring, struct page 
**page)
 {
unsigned long flags;
 
-   spin_lock_irqsave(>free_pages_lock, flags);
-   if (list_empty(>free_pages)) {
-   BUG_ON(blkif->free_pages_num != 0);
-   spin_unlock_irqrestore(>free_pages_lock, flags);
+   spin_lock_irqsave(>free_pages_lock, flags);
+   if (list_empty(>free_pages)) {
+   BUG_ON(ring->free_pages_num != 0);
+   spin_unlock_irqrestore(>free_pages_lock, flags);
return gnttab_alloc_pages(1, page);
}
-   BUG_ON(blkif->free_pages_num == 0);
-   page[0] = list_first_entry(>free_pages, struct page, lru);
+   BUG_ON(ring->free_pages_num == 0);
+   page[0] = list_first_entry(>free_pages, struct page, lru);
list_del([0]->lru);
-   blkif->free_pages_num--;
-   spin_unlock_irqrestore(>free_pages_lock, flags);
+   ring->free_pages_num--;
+   spin_unlock_irqrestore(>free_pages_lock, flags);
 
return 0;
 }
 
-static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
+static inline void put_free_pages(struct xen_blkif_ring *ring, struct page 
**page,
   int num)
 {
unsigned long flags;
int i;
 
-   spin_lock_irqsave(>free_pages_lock, flags);
+   spin_lock_irqsave(>free_pages_lock, flags);
for (i = 0; i < num; i++)
-   list_add([i]->lru, >free_pages);
-   blkif->free_pages_num += num;
-   spin_unlock_irqrestore(>free_pages_lock, flags);
+   list_add([i]->lru, >free_pages);
+   ring->free_pages_num += num;
+   spin_unlock_irqrestore(>free_pages_lock, flags);
 }
 
-static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)
+static inline void shrink_free_pagepool(struct xen_blkif_ring *ring, int num)
 {
/* Remove requested pages in batches of NUM_BATCH_FREE_PAGES */
struct page *page[NUM_BATCH_FREE_PAGES];
unsigned int num_pages = 0;
unsigned long flags;
 
-   spin_lock_irqsave(>free_pages_lock, flags);
-   while (blkif->free_pages_num > num) {
-   BUG_ON(list_empty(>free_pages));
-   page[num_pages] = list_first_entry(>free_pages,
+   spin_lock_irqsave(>free_pages_lock, flags);
+   while (ring->free_pages_num > num) {
+   BUG_ON(list_empty(>free_pages));
+   page[num_pages] = list_first_entry(>free_pages,
   struct page, lru);
list_del([num_pages]->lru);
-   blkif->free_pages_num--;
+   ring->free_pages_num--;
if (++num_pages == NUM_BATCH_FREE_PAGES) {
-   spin_unlock_irqrestore(>free_pages_lock, flags);
+   spin_unlock_irqrestore(>free_pages_lock, flags);
gnttab_free_pages(num_pages, page);
-   spin_lock_irqsave(>free_pages_lock, flags);
+   spin_lock_irqsave(>free_pages_lock, f

[Xen-devel] [PATCH v5 06/10] xen/blkback: separate ring information out of struct xen_blkif

2015-11-13 Thread Bob Liu
Split per ring information to an new structure "xen_blkif_ring", so that one vbd
device can be associated with one or more rings/hardware queues.

Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we
may have multi backend threads.

This patch is a preparation for supporting multi hardware queues/rings.

Signed-off-by: Arianna Avanzini <avanzini.aria...@gmail.com>
Signed-off-by: Bob Liu <bob@oracle.com>
---
v2:
 * Have an BUG_ON on the holding of the pers_gnts_lock.
---
 drivers/block/xen-blkback/blkback.c |  235 ---
 drivers/block/xen-blkback/common.h  |   54 
 drivers/block/xen-blkback/xenbus.c  |   96 +++---
 3 files changed, 214 insertions(+), 171 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index f909994..fb5bfd4 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -173,11 +173,11 @@ static inline void shrink_free_pagepool(struct xen_blkif 
*blkif, int num)
 
 #define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
 
-static int do_block_io_op(struct xen_blkif *blkif);
-static int dispatch_rw_block_io(struct xen_blkif *blkif,
+static int do_block_io_op(struct xen_blkif_ring *ring);
+static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
struct blkif_request *req,
struct pending_req *pending_req);
-static void make_response(struct xen_blkif *blkif, u64 id,
+static void make_response(struct xen_blkif_ring *ring, u64 id,
  unsigned short op, int st);
 
 #define foreach_grant_safe(pos, n, rbtree, node) \
@@ -189,14 +189,8 @@ static void make_response(struct xen_blkif *blkif, u64 id,
 
 
 /*
- * We don't need locking around the persistent grant helpers
- * because blkback uses a single-thread for each backed, so we
- * can be sure that this functions will never be called recursively.
- *
- * The only exception to that is put_persistent_grant, that can be called
- * from interrupt context (by xen_blkbk_unmap), so we have to use atomic
- * bit operations to modify the flags of a persistent grant and to count
- * the number of used grants.
+ * pers_gnts_lock must be used around all the persistent grant helpers
+ * because blkback may use multi-thread/queue for each backend.
  */
 static int add_persistent_gnt(struct xen_blkif *blkif,
   struct persistent_gnt *persistent_gnt)
@@ -204,6 +198,7 @@ static int add_persistent_gnt(struct xen_blkif *blkif,
struct rb_node **new = NULL, *parent = NULL;
struct persistent_gnt *this;
 
+   BUG_ON(!spin_is_locked(>pers_gnts_lock));
if (blkif->persistent_gnt_c >= xen_blkif_max_pgrants) {
if (!blkif->vbd.overflow_max_grants)
blkif->vbd.overflow_max_grants = 1;
@@ -241,6 +236,7 @@ static struct persistent_gnt *get_persistent_gnt(struct 
xen_blkif *blkif,
struct persistent_gnt *data;
struct rb_node *node = NULL;
 
+   BUG_ON(!spin_is_locked(>pers_gnts_lock));
node = blkif->persistent_gnts.rb_node;
while (node) {
data = container_of(node, struct persistent_gnt, node);
@@ -265,6 +261,7 @@ static struct persistent_gnt *get_persistent_gnt(struct 
xen_blkif *blkif,
 static void put_persistent_gnt(struct xen_blkif *blkif,
struct persistent_gnt *persistent_gnt)
 {
+   BUG_ON(!spin_is_locked(>pers_gnts_lock));
if(!test_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags))
pr_alert_ratelimited("freeing a grant already unused\n");
set_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags);
@@ -286,6 +283,7 @@ static void free_persistent_gnts(struct xen_blkif *blkif, 
struct rb_root *root,
unmap_data.unmap_ops = unmap;
unmap_data.kunmap_ops = NULL;
 
+   BUG_ON(!spin_is_locked(>pers_gnts_lock));
foreach_grant_safe(persistent_gnt, n, root, node) {
BUG_ON(persistent_gnt->handle ==
BLKBACK_INVALID_HANDLE);
@@ -322,11 +320,13 @@ void xen_blkbk_unmap_purged_grants(struct work_struct 
*work)
int segs_to_unmap = 0;
struct xen_blkif *blkif = container_of(work, typeof(*blkif), 
persistent_purge_work);
struct gntab_unmap_queue_data unmap_data;
+   unsigned long flags;
 
unmap_data.pages = pages;
unmap_data.unmap_ops = unmap;
unmap_data.kunmap_ops = NULL;
 
+   spin_lock_irqsave(>pers_gnts_lock, flags);
while(!list_empty(>persistent_purge_list)) {
persistent_gnt = list_first_entry(>persistent_purge_list,
  struct persistent_gnt,
@@ -348,6 +348,7 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
}
kf

Re: [Xen-devel] [PATCH 06/32] xen blkback: prepare for bi_rw split

2015-11-08 Thread Bob Liu

On 11/07/2015 06:17 PM, Christoph Hellwig wrote:
> A little offtopic for this patch, but can some explain this whole
> mess about bios in Xen blkfront?  We can happily do partial completions
> at the request later.
> 
> Also since the blk-mq conversion the call to blk_end_request_all is

This will be fixed after my next blk-mq patch series which also modified the 
recover path.

> completely broken, so it doesn't look like this code path is exactly
> well tested.
>

Thanks,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 07/10] xen/blkback: pseudo support for multi hardware queues/rings

2015-11-04 Thread Bob Liu

On 11/05/2015 10:30 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 02, 2015 at 12:21:43PM +0800, Bob Liu wrote:
>> Preparatory patch for multiple hardware queues (rings). The number of
>> rings is unconditionally set to 1, larger number will be enabled in next
>> patch so as to make every single patch small and readable.
> 
> Instead of saying 'next patch' - spell out the title of the patch.
> 
> 
>>
>> Signed-off-by: Arianna Avanzini <avanzini.aria...@gmail.com>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> ---
>>  drivers/block/xen-blkback/common.h |   3 +-
>>  drivers/block/xen-blkback/xenbus.c | 292 
>> +++--
>>  2 files changed, 185 insertions(+), 110 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/common.h 
>> b/drivers/block/xen-blkback/common.h
>> index f0dd69a..4de1326 100644
>> --- a/drivers/block/xen-blkback/common.h
>> +++ b/drivers/block/xen-blkback/common.h
>> @@ -341,7 +341,8 @@ struct xen_blkif {
>>  struct work_struct  free_work;
>>  unsigned int nr_ring_pages;
>>  /* All rings for this device */
>> -struct xen_blkif_ring ring;
>> +struct xen_blkif_ring *rings;
>> +unsigned int nr_rings;
>>  };
>>  
>>  struct seg_buf {
>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>> b/drivers/block/xen-blkback/xenbus.c
>> index 7bdd5fd..ac4b458 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -84,11 +84,12 @@ static int blkback_name(struct xen_blkif *blkif, char 
>> *buf)
>>  
>>  static void xen_update_blkif_status(struct xen_blkif *blkif)
>>  {
>> -int err;
>> +int err, i;
> 
> unsigned int.
>>  char name[BLKBACK_NAME_LEN];
>> +struct xen_blkif_ring *ring;
>>  
>>  /* Not ready to connect? */
>> -if (!blkif->ring.irq || !blkif->vbd.bdev)
>> +if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
>>  return;
>>  
>>  /* Already connected? */
>> @@ -113,19 +114,57 @@ static void xen_update_blkif_status(struct xen_blkif 
>> *blkif)
>>  }
>>  invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
>>  
>> -blkif->ring.xenblkd = kthread_run(xen_blkif_schedule, >ring, 
>> "%s", name);
>> -if (IS_ERR(blkif->ring.xenblkd)) {
>> -err = PTR_ERR(blkif->ring.xenblkd);
>> -blkif->ring.xenblkd = NULL;
>> -xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
>> -return;
>> +if (blkif->nr_rings == 1) {
>> +blkif->rings[0].xenblkd = kthread_run(xen_blkif_schedule, 
>> >rings[0], "%s", name);
>> +if (IS_ERR(blkif->rings[0].xenblkd)) {
>> +err = PTR_ERR(blkif->rings[0].xenblkd);
>> +blkif->rings[0].xenblkd = NULL;
>> +xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
>> +return;
>> +}
>> +} else {
>> +for (i = 0; i < blkif->nr_rings; i++) {
>> +ring = >rings[i];
>> +ring->xenblkd = kthread_run(xen_blkif_schedule, ring, 
>> "%s-%d", name, i);
>> +if (IS_ERR(ring->xenblkd)) {
>> +err = PTR_ERR(ring->xenblkd);
>> +ring->xenblkd = NULL;
>> +xenbus_dev_error(blkif->be->dev, err,
>> +"start %s-%d xenblkd", name, i);
>> +return;
>> +}
>> +}
> 
> Please collapse this together and just have one loop.
> 
> And just use the same name throughout ('%s-%d')
> 
> Also this does not take care of actually freeing the allocated
> ring->xenblkd if one of them fails to start.
> 
> Hmm, looking at this function .. we seem to forget to change the
> state if something fails.
> 
> As in, connect switches the state to Connected.. And if anything blows
> up after the connect() we don't switch the state. We do report an error
> in the XenBus, but the state is the same.
> 
> We should be using xenbus_dev_fatal I believe - so at least the state
> changes to Closing (and then the frontend can switch itself to
> Closing->Closed) - and we will call 'xen_blkif_disconnect' on Closed. 
&g

Re: [Xen-devel] [PATCH v4 10/10] xen/blkback: make pool of persistent grants and free pages per-queue

2015-11-04 Thread Bob Liu

On 11/05/2015 10:43 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 02, 2015 at 12:21:46PM +0800, Bob Liu wrote:
>> Make pool of persistent grants and free pages per-queue/ring instead of
>> per-device to get better scalability.
> 
> How much better scalability do we get?
> 

Which already showed in [00/10], I paste them here:

domU(orig)  4 queues8 queues16 queues
iops:690k   1024k(+30%) 800k750k


After patch 9 and 10:
domU(orig)  4 queues8 queues16 queues
iops:690k   1600k(+100%)   1450k1320k

Chart: https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 03/10] xen/blkfront: pseudo support for multi hardware queues/rings

2015-11-03 Thread Bob Liu

On 11/04/2015 03:44 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 02, 2015 at 12:21:39PM +0800, Bob Liu wrote:
>> Preparatory patch for multiple hardware queues (rings). The number of
>> rings is unconditionally set to 1, larger number will be enabled in next
>> patch so as to make every single patch small and readable.
> 
> s/next patch/"xen/blkfront: negotiate number of queues/rings to be used with 
> backend"
> 
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> ---
>>  drivers/block/xen-blkfront.c | 327 
>> +--
>>  1 file changed, 188 insertions(+), 139 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 2a557e4..eab78e7 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -145,6 +145,7 @@ struct blkfront_info
>>  int vdevice;
>>  blkif_vdev_t handle;
>>  enum blkif_state connected;
>> +/* Number of pages per ring buffer */
> 
> Missing full stop, aka '.'.
> 
>>  unsigned int nr_ring_pages;
>>  struct request_queue *rq;
>>  struct list_head grants;
>> @@ -158,7 +159,8 @@ struct blkfront_info
>>  unsigned int max_indirect_segments;
>>  int is_ready;
>>  struct blk_mq_tag_set tag_set;
>> -struct blkfront_ring_info rinfo;
>> +struct blkfront_ring_info *rinfo;
>> +unsigned int nr_rings;
>>  };
>>  
>>  static unsigned int nr_minors;
>> @@ -190,7 +192,7 @@ static DEFINE_SPINLOCK(minor_lock);
>>  ((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
>>  
>>  static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
>> -static int blkfront_gather_backend_features(struct blkfront_info *info);
>> +static void blkfront_gather_backend_features(struct blkfront_info *info);
>>  
>>  static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
>>  {
>> @@ -443,12 +445,13 @@ static int blkif_queue_request(struct request *req, 
>> struct blkfront_ring_info *r
>>   */
>>  max_grefs += INDIRECT_GREFS(req->nr_phys_segments);
>>  
>> -/* Check if we have enough grants to allocate a requests */
>> -if (info->persistent_gnts_c < max_grefs) {
>> +/* Check if we have enough grants to allocate a requests, we have to
>> + * reserve 'max_grefs' grants because persistent grants are shared by 
>> all
>> + * rings */
> 
> Missing full stop.
> 
>> +if (0 < max_grefs) {
> 
>  ? 0!?
> 
> max_grefs will at least be BLKIF_MAX_SEGMENTS_PER_REQUEST
> so this will always be true.
> 

No,  max_grefs = req->nr_phys_segments;

It's 0 in some cases(flush req?), and gnttable_alloc_grant_references() can not 
handle 0 as the parameter.

> In which ase why not just ..
>>  new_persistent_gnts = 1;
>>  if (gnttab_alloc_grant_references(
>> -max_grefs - info->persistent_gnts_c,
>> -_head) < 0) {
>> +max_grefs, _head) < 0) {
>>  gnttab_request_free_callback(
>>  >callback,
>>  blkif_restart_queue_callback,
> 
> .. move this whole code down? And get rid of 'new_persistent_gnts'
> since it will always be true?
> 

Unless we fix gnttable_alloc_grant_references(0).

> But more importantly, why do we not check for 'info->persistent_gnts_c' 
> anymore?
> 

Info->persistent_gnts_c is for per-device not per-ring, the persistent grants 
may be taken by other queues/rings after we checked the value here.
Which would make get_grant() fail, so we have to reserved enough grants in 
advance.
Those new-allocated grants will be freed if there are enough grants in 
persistent list.

Will fix all other comments for this patch.

Thanks,
Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 04/10] xen/blkfront: split per device io_lock

2015-11-03 Thread Bob Liu

On 11/04/2015 04:09 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 02, 2015 at 12:21:40PM +0800, Bob Liu wrote:
>> The per device io_lock became a coarser grained lock after multi-queues/rings
>> was introduced, this patch introduced a fine-grained ring_lock for each ring.
> 
> s/was introduced/was introduced (see commit titled XYZ)/
> 
> s/introdued/introduces/
>>
>> The old io_lock was renamed to dev_lock and only protect the ->grants list
> 
> s/was/is/
> s/protect/protects/
> 
>> which is shared by all rings.
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> ---
>>  drivers/block/xen-blkfront.c | 57 
>> ++--
>>  1 file changed, 34 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index eab78e7..8cc5995 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -121,6 +121,7 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
>> pages to be used for the
>>   */
>>  struct blkfront_ring_info {
>>  struct blkif_front_ring ring;
> 
> Can you add a comment explaining the lock semantic? As in under what 
> conditions
> should it be taken? Like you have it below.
> 
>> +spinlock_t ring_lock;
>>  unsigned int ring_ref[XENBUS_MAX_RING_PAGES];
>>  unsigned int evtchn, irq;
>>  struct work_struct work;
>> @@ -138,7 +139,8 @@ struct blkfront_ring_info {
>>   */
>>  struct blkfront_info
>>  {
>> -spinlock_t io_lock;
>> +/* Lock to proect info->grants list shared by multi rings */
> 
> s/proect/protect/
> 
> Missing full stop.
> 
>> +spinlock_t dev_lock;
> 
> Shouldn't it be right next to what it is protecting?
> 
> That is right below (or above): 'struct list_head grants;'?
> 
>>  struct mutex mutex;
>>  struct xenbus_device *xbdev;
>>  struct gendisk *gd;
>> @@ -224,6 +226,7 @@ static int fill_grant_buffer(struct blkfront_ring_info 
>> *rinfo, int num)
>>  struct grant *gnt_list_entry, *n;
>>  int i = 0;
>>  
>> +spin_lock_irq(>dev_lock);
> 
> Why there? Why not where you add it to the list?
>>  while(i < num) {
>>  gnt_list_entry = kzalloc(sizeof(struct grant), GFP_NOIO);
>>  if (!gnt_list_entry)
>> @@ -242,6 +245,7 @@ static int fill_grant_buffer(struct blkfront_ring_info 
>> *rinfo, int num)
>>  list_add(_list_entry->node, >grants);
> 
> Right here that is?
> 
> You are holding the lock for the duration of 'kzalloc' and 'alloc_page'.
> 
> And more interestingly, GFP_NOIO translates to __GFP_WAIT which means
> it can call 'schedule'. - And you have taken an spinlock. That should
> have thrown lots of warnings?
> 
>>  i++;
>>  }
>> +spin_unlock_irq(>dev_lock);
>>  
>>  return 0;
>>  
>> @@ -254,6 +258,7 @@ out_of_memory:
>>  kfree(gnt_list_entry);
>>  i--;
>>  }
>> +spin_unlock_irq(>dev_lock);
> 
> Just do it around the 'list_del' operation. You are using an
> 'safe'
>>  BUG_ON(i != 0);
>>  return -ENOMEM;
>>  }
>> @@ -265,6 +270,7 @@ static struct grant *get_grant(grant_ref_t *gref_head,
>>  struct grant *gnt_list_entry;
>>  unsigned long buffer_gfn;
>>  
>> +spin_lock(>dev_lock);
>>  BUG_ON(list_empty(>grants));
>>  gnt_list_entry = list_first_entry(>grants, struct grant,
>>node);
>> @@ -272,8 +278,10 @@ static struct grant *get_grant(grant_ref_t *gref_head,
>>  
>>  if (gnt_list_entry->gref != GRANT_INVALID_REF) {
>>  info->persistent_gnts_c--;
>> +spin_unlock(>dev_lock);
>>  return gnt_list_entry;
>>  }
>> +spin_unlock(>dev_lock);
> 
> Just have one spin_unlock. Put it right before the 'if 
> (gnt_list_entry->gref)..'.

That's used to protect info->persistent_gnts_c, will update all other place.

Thanks,
-Bob

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 05/10] xen/blkfront: negotiate number of queues/rings to be used with backend

2015-11-03 Thread Bob Liu

On 11/04/2015 04:40 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 02, 2015 at 12:21:41PM +0800, Bob Liu wrote:
>> The number of hardware queues for xen/blkfront is set by parameter
>> 'max_queues'(default 4), while the max value xen/blkback supported is 
>> notified
>> through xenstore("multi-queue-max-queues").
> 
> That is not right.
> 
> s/The number/The max number/
> 
> The second part: ",while the max value xen/blkback supported is..". I think
> you are trying to say: "it is also capped by the max value that
> the xen/blkback exposes through XenStore key 'multi-queue-max-queues'.
> 
>>
>> The negotiated number is the smaller one and would be written back to 
>> xenstore
>> as "multi-queue-num-queues", blkback need to read this negotiated number.
> 
> s/blkback need to read/blkback needs to read this/
> 
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> ---
>>  drivers/block/xen-blkfront.c | 166 
>> +++
>>  1 file changed, 120 insertions(+), 46 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 8cc5995..23096d7 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -98,6 +98,10 @@ static unsigned int xen_blkif_max_segments = 32;
>>  module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
>>  MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests 
>> (default is 32)");
>>  
>> +static unsigned int xen_blkif_max_queues = 4;
>> +module_param_named(max_queues, xen_blkif_max_queues, uint, S_IRUGO);
>> +MODULE_PARM_DESC(max_queues, "Maximum number of hardware queues/rings used 
>> per virtual disk");
>> +
>>  /*
>>   * Maximum order of pages to be used for the shared ring between front and
>>   * backend, 4KB page granularity is used.
>> @@ -113,6 +117,7 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
>> pages to be used for the
>>   * characters are enough. Define to 20 to keep consist with backend.
>>   */
>>  #define RINGREF_NAME_LEN (20)
>> +#define QUEUE_NAME_LEN (12)
> 
> Little bit of documentation please. Why 12? Why not 31415 for example?
> I presume it is 'queue-%u' and since so that is 7 + 10 (UINT_MAX is
> 4294967295) = 17!
> 
> 
>>  
>>  /*
>>   *  Per-ring info.
>> @@ -695,7 +700,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
>> sector_size,
>>  
>>  memset(>tag_set, 0, sizeof(info->tag_set));
>>  info->tag_set.ops = _mq_ops;
>> -info->tag_set.nr_hw_queues = 1;
>> +info->tag_set.nr_hw_queues = info->nr_rings;
>>  info->tag_set.queue_depth =  BLK_RING_SIZE(info);
>>  info->tag_set.numa_node = NUMA_NO_NODE;
>>  info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
>> @@ -1352,6 +1357,51 @@ fail:
>>  return err;
>>  }
>>  
>> +static int write_per_ring_nodes(struct xenbus_transaction xbt,
>> +struct blkfront_ring_info *rinfo, const char 
>> *dir)
>> +{
>> +int err, i;
> 
> Please make 'i' be an unsigned int. Especially as you are using '%u' in the 
> snprintf.
> 
> 
>> +const char *message = NULL;
>> +struct blkfront_info *info = rinfo->dev_info;
>> +
>> +if (info->nr_ring_pages == 1) {
>> +err = xenbus_printf(xbt, dir, "ring-ref", "%u", 
>> rinfo->ring_ref[0]);
>> +if (err) {
>> +message = "writing ring-ref";
>> +goto abort_transaction;
>> +}
>> +pr_info("%s: write ring-ref:%d\n", dir, rinfo->ring_ref[0]);
> 
> Ewww. No.
> 
>> +} else {
>> +for (i = 0; i < info->nr_ring_pages; i++) {
>> +char ring_ref_name[RINGREF_NAME_LEN];
>> +
>> +snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
>> i);
>> +err = xenbus_printf(xbt, dir, ring_ref_name,
>> +"%u", rinfo->ring_ref[i]);
>> +if (err) {
>> +message = "writing ring-ref";
>> +goto abort_transaction;
>> +}
>> +pr_info("%s: write ring-ref:%d\n", dir, 
>> rinfo->ring_ref[i]);
> 
&g

[Xen-devel] [PATCH v4 08/10] xen/blkback: get the number of hardware queues/rings from blkfront

2015-11-01 Thread Bob Liu
Backend advertises "multi-queue-max-queues" to front, then get the negotiated
number from "multi-queue-num-queues" wrote by blkfront.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/blkback.c | 11 +++
 drivers/block/xen-blkback/common.h  |  1 +
 drivers/block/xen-blkback/xenbus.c  | 35 +--
 3 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index eaf7ec0..107cc4a 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -83,6 +83,11 @@ module_param_named(max_persistent_grants, 
xen_blkif_max_pgrants, int, 0644);
 MODULE_PARM_DESC(max_persistent_grants,
  "Maximum number of grants to map persistently");
 
+unsigned int xenblk_max_queues;
+module_param_named(max_queues, xenblk_max_queues, uint, 0644);
+MODULE_PARM_DESC(max_queues,
+"Maximum number of hardware queues per virtual disk");
+
 /*
  * Maximum order of pages to be used for the shared ring between front and
  * backend, 4KB page granularity is used.
@@ -1478,6 +1483,12 @@ static int __init xen_blkif_init(void)
xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
}
 
+   /* Allow as many queues as there are CPUs if user has not
+* specified a value.
+*/
+   if (xenblk_max_queues == 0)
+   xenblk_max_queues = num_online_cpus();
+
rc = xen_blkif_interface_init();
if (rc)
goto failed_init;
diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index 4de1326..fb28b91 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -45,6 +45,7 @@
 #include 
 
 extern unsigned int xen_blkif_max_ring_order;
+extern unsigned int xenblk_max_queues;
 /*
  * This is the maximum number of segments that would be allowed in indirect
  * requests. This value will also be passed to the frontend.
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index ac4b458..cafbadd 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -181,12 +181,6 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
INIT_LIST_HEAD(>persistent_purge_list);
INIT_WORK(>persistent_purge_work, xen_blkbk_unmap_purged_grants);
 
-   blkif->nr_rings = 1;
-   if (xen_blkif_alloc_rings(blkif)) {
-   kmem_cache_free(xen_blkif_cachep, blkif);
-   return ERR_PTR(-ENOMEM);
-   }
-
return blkif;
 }
 
@@ -606,6 +600,14 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
goto fail;
}
 
+   /* Multi-queue: write how many queues are supported by the backend. */
+   err = xenbus_printf(XBT_NIL, dev->nodename,
+   "multi-queue-max-queues", "%u", xenblk_max_queues);
+   if (err) {
+   pr_warn("Error writing multi-queue-num-queues\n");
+   goto fail;
+   }
+
/* setup back pointer */
be->blkif->be = be;
 
@@ -997,6 +999,7 @@ static int connect_ring(struct backend_info *be)
char *xspath;
size_t xspathsize;
const size_t xenstore_path_ext_size = 11; /* sufficient for 
"/queue-NNN" */
+   unsigned int requested_num_queues = 0;
 
pr_debug("%s %s\n", __func__, dev->otherend);
 
@@ -1024,6 +1027,26 @@ static int connect_ring(struct backend_info *be)
be->blkif->vbd.feature_gnt_persistent = pers_grants;
be->blkif->vbd.overflow_max_grants = 0;
 
+   /*
+* Read the number of hardware queues from frontend.
+*/
+   err = xenbus_scanf(XBT_NIL, dev->otherend, "multi-queue-num-queues", 
"%u", _num_queues);
+   if (err < 0) {
+   requested_num_queues = 1;
+   } else {
+   if (requested_num_queues > xenblk_max_queues
+   || requested_num_queues == 0) {
+   /* buggy or malicious guest */
+   xenbus_dev_fatal(dev, err,
+   "guest requested %u queues, exceeding 
the maximum of %u.",
+   requested_num_queues, 
xenblk_max_queues);
+   return -1;
+   }
+   }
+   be->blkif->nr_rings = requested_num_queues;
+   if (xen_blkif_alloc_rings(be->blkif))
+   return -ENOMEM;
+
pr_info("nr_rings:%d protocol %d (%s) %s\n", be->blkif->nr_rings,
 be->blkif->blk_protocol, protocol,
 pers_grants ? "persistent grants" : "");
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 09/10] xen/blkfront: make persistent grants per-queue

2015-11-01 Thread Bob Liu
Make persistent grants per-queue/ring instead of per-device, so that we can
drop the 'dev_lock' and get better scalability.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 89 +---
 1 file changed, 34 insertions(+), 55 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 23096d7..eb19f08 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -133,6 +133,8 @@ struct blkfront_ring_info {
struct gnttab_free_callback callback;
struct blk_shadow shadow[BLK_MAX_RING_SIZE];
struct list_head indirect_pages;
+   struct list_head grants;
+   unsigned int persistent_gnts_c;
unsigned long shadow_free;
struct blkfront_info *dev_info;
 };
@@ -144,8 +146,6 @@ struct blkfront_ring_info {
  */
 struct blkfront_info
 {
-   /* Lock to proect info->grants list shared by multi rings */
-   spinlock_t dev_lock;
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
@@ -155,8 +155,6 @@ struct blkfront_info
/* Number of pages per ring buffer */
unsigned int nr_ring_pages;
struct request_queue *rq;
-   struct list_head grants;
-   unsigned int persistent_gnts_c;
unsigned int feature_flush;
unsigned int feature_discard:1;
unsigned int feature_secdiscard:1;
@@ -231,7 +229,6 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
struct grant *gnt_list_entry, *n;
int i = 0;
 
-   spin_lock_irq(>dev_lock);
while(i < num) {
gnt_list_entry = kzalloc(sizeof(struct grant), GFP_NOIO);
if (!gnt_list_entry)
@@ -247,35 +244,32 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
}
 
gnt_list_entry->gref = GRANT_INVALID_REF;
-   list_add(_list_entry->node, >grants);
+   list_add(_list_entry->node, >grants);
i++;
}
-   spin_unlock_irq(>dev_lock);
 
return 0;
 
 out_of_memory:
list_for_each_entry_safe(gnt_list_entry, n,
->grants, node) {
+>grants, node) {
list_del(_list_entry->node);
if (info->feature_persistent)
__free_page(pfn_to_page(gnt_list_entry->pfn));
kfree(gnt_list_entry);
i--;
}
-   spin_unlock_irq(>dev_lock);
BUG_ON(i != 0);
return -ENOMEM;
 }
 
 static struct grant *get_grant(grant_ref_t *gref_head,
unsigned long pfn,
-   struct blkfront_info *info)
+  struct blkfront_ring_info *info)
 {
struct grant *gnt_list_entry;
unsigned long buffer_gfn;
 
-   spin_lock(>dev_lock);
BUG_ON(list_empty(>grants));
gnt_list_entry = list_first_entry(>grants, struct grant,
  node);
@@ -283,21 +277,19 @@ static struct grant *get_grant(grant_ref_t *gref_head,
 
if (gnt_list_entry->gref != GRANT_INVALID_REF) {
info->persistent_gnts_c--;
-   spin_unlock(>dev_lock);
return gnt_list_entry;
}
-   spin_unlock(>dev_lock);
 
/* Assign a gref to this page */
gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
BUG_ON(gnt_list_entry->gref == -ENOSPC);
-   if (!info->feature_persistent) {
+   if (!info->dev_info->feature_persistent) {
BUG_ON(!pfn);
gnt_list_entry->pfn = pfn;
}
buffer_gfn = pfn_to_gfn(gnt_list_entry->pfn);
gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
-   info->xbdev->otherend_id,
+   info->dev_info->xbdev->otherend_id,
buffer_gfn, 0);
return gnt_list_entry;
 }
@@ -559,13 +551,13 @@ static int blkif_queue_request(struct request *req, 
struct blkfront_ring_info *r
list_del(_page->lru);
pfn = page_to_pfn(indirect_page);
}
-   gnt_list_entry = get_grant(_head, pfn, 
info);
+   gnt_list_entry = get_grant(_head, pfn, 
rinfo);
rinfo->shadow[id].indirect_grants[n] = 
gnt_list_entry;
segments = 
kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
ring_req->u.indirect.indirect_grefs[n] = 
gnt_list_entry->gref;
}
 
-   gnt_list_ent

[Xen-devel] [PATCH v4 06/10] xen/blkback: separate ring information out of struct xen_blkif

2015-11-01 Thread Bob Liu
Split per ring information to an new structure "xen_blkif_ring", so that one vbd
device can associate with one or more rings/hardware queues.

Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we
may have multi backend threads.

This patch is a preparation for supporting multi hardware queues/rings.

Signed-off-by: Arianna Avanzini <avanzini.aria...@gmail.com>
Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/blkback.c | 233 
 drivers/block/xen-blkback/common.h  |  64 ++
 drivers/block/xen-blkback/xenbus.c  | 107 ++---
 3 files changed, 234 insertions(+), 170 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index 6a685ae..eaf7ec0 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -173,11 +173,11 @@ static inline void shrink_free_pagepool(struct xen_blkif 
*blkif, int num)
 
 #define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
 
-static int do_block_io_op(struct xen_blkif *blkif);
-static int dispatch_rw_block_io(struct xen_blkif *blkif,
+static int do_block_io_op(struct xen_blkif_ring *ring);
+static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
struct blkif_request *req,
struct pending_req *pending_req);
-static void make_response(struct xen_blkif *blkif, u64 id,
+static void make_response(struct xen_blkif_ring *ring, u64 id,
  unsigned short op, int st);
 
 #define foreach_grant_safe(pos, n, rbtree, node) \
@@ -189,14 +189,8 @@ static void make_response(struct xen_blkif *blkif, u64 id,
 
 
 /*
- * We don't need locking around the persistent grant helpers
- * because blkback uses a single-thread for each backed, so we
- * can be sure that this functions will never be called recursively.
- *
- * The only exception to that is put_persistent_grant, that can be called
- * from interrupt context (by xen_blkbk_unmap), so we have to use atomic
- * bit operations to modify the flags of a persistent grant and to count
- * the number of used grants.
+ * pers_gnts_lock must be used around all the persistent grant helpers
+ * because blkback may use multi-thread/queue for each backend.
  */
 static int add_persistent_gnt(struct xen_blkif *blkif,
   struct persistent_gnt *persistent_gnt)
@@ -322,11 +316,13 @@ void xen_blkbk_unmap_purged_grants(struct work_struct 
*work)
int segs_to_unmap = 0;
struct xen_blkif *blkif = container_of(work, typeof(*blkif), 
persistent_purge_work);
struct gntab_unmap_queue_data unmap_data;
+   unsigned long flags;
 
unmap_data.pages = pages;
unmap_data.unmap_ops = unmap;
unmap_data.kunmap_ops = NULL;
 
+   spin_lock_irqsave(>pers_gnts_lock, flags);
while(!list_empty(>persistent_purge_list)) {
persistent_gnt = list_first_entry(>persistent_purge_list,
  struct persistent_gnt,
@@ -348,6 +344,7 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
}
kfree(persistent_gnt);
}
+   spin_unlock_irqrestore(>pers_gnts_lock, flags);
if (segs_to_unmap > 0) {
unmap_data.count = segs_to_unmap;
BUG_ON(gnttab_unmap_refs_sync(_data));
@@ -362,16 +359,18 @@ static void purge_persistent_gnt(struct xen_blkif *blkif)
unsigned int num_clean, total;
bool scan_used = false, clean_used = false;
struct rb_root *root;
+   unsigned long flags;
 
+   spin_lock_irqsave(>pers_gnts_lock, flags);
if (blkif->persistent_gnt_c < xen_blkif_max_pgrants ||
(blkif->persistent_gnt_c == xen_blkif_max_pgrants &&
!blkif->vbd.overflow_max_grants)) {
-   return;
+   goto out;
}
 
if (work_busy(>persistent_purge_work)) {
pr_alert_ratelimited("Scheduled work from previous purge is 
still busy, cannot purge list\n");
-   return;
+   goto out;
}
 
num_clean = (xen_blkif_max_pgrants / 100) * LRU_PERCENT_CLEAN;
@@ -379,7 +378,7 @@ static void purge_persistent_gnt(struct xen_blkif *blkif)
num_clean = min(blkif->persistent_gnt_c, num_clean);
if ((num_clean == 0) ||
(num_clean > (blkif->persistent_gnt_c - 
atomic_read(>persistent_gnt_in_use
-   return;
+   goto out;
 
/*
 * At this point, we can assure that there will be no calls
@@ -436,29 +435,35 @@ finished:
}
 
blkif->persistent_gnt_c -= (total - num_clean);
+   spin_unlock_irqrestore(>pers_gnts_lock, flags);
blkif->vbd.overflow_max_grants = 0;
 
/* We can defer this work */
  

[Xen-devel] [PATCH v4 07/10] xen/blkback: pseudo support for multi hardware queues/rings

2015-11-01 Thread Bob Liu
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in next
patch so as to make every single patch small and readable.

Signed-off-by: Arianna Avanzini <avanzini.aria...@gmail.com>
Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/common.h |   3 +-
 drivers/block/xen-blkback/xenbus.c | 292 +++--
 2 files changed, 185 insertions(+), 110 deletions(-)

diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index f0dd69a..4de1326 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -341,7 +341,8 @@ struct xen_blkif {
struct work_struct  free_work;
unsigned int nr_ring_pages;
/* All rings for this device */
-   struct xen_blkif_ring ring;
+   struct xen_blkif_ring *rings;
+   unsigned int nr_rings;
 };
 
 struct seg_buf {
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 7bdd5fd..ac4b458 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -84,11 +84,12 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)
 
 static void xen_update_blkif_status(struct xen_blkif *blkif)
 {
-   int err;
+   int err, i;
char name[BLKBACK_NAME_LEN];
+   struct xen_blkif_ring *ring;
 
/* Not ready to connect? */
-   if (!blkif->ring.irq || !blkif->vbd.bdev)
+   if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
return;
 
/* Already connected? */
@@ -113,19 +114,57 @@ static void xen_update_blkif_status(struct xen_blkif 
*blkif)
}
invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
 
-   blkif->ring.xenblkd = kthread_run(xen_blkif_schedule, >ring, 
"%s", name);
-   if (IS_ERR(blkif->ring.xenblkd)) {
-   err = PTR_ERR(blkif->ring.xenblkd);
-   blkif->ring.xenblkd = NULL;
-   xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
-   return;
+   if (blkif->nr_rings == 1) {
+   blkif->rings[0].xenblkd = kthread_run(xen_blkif_schedule, 
>rings[0], "%s", name);
+   if (IS_ERR(blkif->rings[0].xenblkd)) {
+   err = PTR_ERR(blkif->rings[0].xenblkd);
+   blkif->rings[0].xenblkd = NULL;
+   xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
+   return;
+   }
+   } else {
+   for (i = 0; i < blkif->nr_rings; i++) {
+   ring = >rings[i];
+   ring->xenblkd = kthread_run(xen_blkif_schedule, ring, 
"%s-%d", name, i);
+   if (IS_ERR(ring->xenblkd)) {
+   err = PTR_ERR(ring->xenblkd);
+   ring->xenblkd = NULL;
+   xenbus_dev_error(blkif->be->dev, err,
+   "start %s-%d xenblkd", name, i);
+   return;
+   }
+   }
+   }
+}
+
+static int xen_blkif_alloc_rings(struct xen_blkif *blkif)
+{
+   int r;
+
+   blkif->rings = kzalloc(blkif->nr_rings * sizeof(struct xen_blkif_ring), 
GFP_KERNEL);
+   if (!blkif->rings)
+   return -ENOMEM;
+
+   for (r = 0; r < blkif->nr_rings; r++) {
+   struct xen_blkif_ring *ring = >rings[r];
+
+   spin_lock_init(>blk_ring_lock);
+   init_waitqueue_head(>wq);
+   INIT_LIST_HEAD(>pending_free);
+
+   spin_lock_init(>pending_free_lock);
+   init_waitqueue_head(>pending_free_wq);
+   init_waitqueue_head(>shutdown_wq);
+   ring->blkif = blkif;
+   xen_blkif_get(blkif);
}
+
+   return 0;
 }
 
 static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 {
struct xen_blkif *blkif;
-   struct xen_blkif_ring *ring;
 
BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
 
@@ -136,27 +175,17 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->domid = domid;
atomic_set(>refcnt, 1);
init_completion(>drain_complete);
-   atomic_set(>drain, 0);
INIT_WORK(>free_work, xen_blkif_deferred_free);
spin_lock_init(>free_pages_lock);
INIT_LIST_HEAD(>free_pages);
-   blkif->free_pages_num = 0;
-   blkif->persistent_gnts.rb_node = NULL;
INIT_LIST_HEAD(>persistent_purge_list);
-   atomic_set(>persistent_gnt_in_use, 0);
INIT_WORK(>persistent_purge_work, xen_b

[Xen-devel] [PATCH v4 10/10] xen/blkback: make pool of persistent grants and free pages per-queue

2015-11-01 Thread Bob Liu
Make pool of persistent grants and free pages per-queue/ring instead of
per-device to get better scalability.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/blkback.c | 212 +---
 drivers/block/xen-blkback/common.h  |  32 +++---
 drivers/block/xen-blkback/xenbus.c  |  21 ++--
 3 files changed, 124 insertions(+), 141 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index 107cc4a..28cbdae 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -118,60 +118,60 @@ module_param(log_stats, int, 0644);
 /* Number of free pages to remove on each call to gnttab_free_pages */
 #define NUM_BATCH_FREE_PAGES 10
 
-static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
+static inline int get_free_page(struct xen_blkif_ring *ring, struct page 
**page)
 {
unsigned long flags;
 
-   spin_lock_irqsave(>free_pages_lock, flags);
-   if (list_empty(>free_pages)) {
-   BUG_ON(blkif->free_pages_num != 0);
-   spin_unlock_irqrestore(>free_pages_lock, flags);
+   spin_lock_irqsave(>free_pages_lock, flags);
+   if (list_empty(>free_pages)) {
+   BUG_ON(ring->free_pages_num != 0);
+   spin_unlock_irqrestore(>free_pages_lock, flags);
return gnttab_alloc_pages(1, page);
}
-   BUG_ON(blkif->free_pages_num == 0);
-   page[0] = list_first_entry(>free_pages, struct page, lru);
+   BUG_ON(ring->free_pages_num == 0);
+   page[0] = list_first_entry(>free_pages, struct page, lru);
list_del([0]->lru);
-   blkif->free_pages_num--;
-   spin_unlock_irqrestore(>free_pages_lock, flags);
+   ring->free_pages_num--;
+   spin_unlock_irqrestore(>free_pages_lock, flags);
 
return 0;
 }
 
-static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
+static inline void put_free_pages(struct xen_blkif_ring *ring, struct page 
**page,
   int num)
 {
unsigned long flags;
int i;
 
-   spin_lock_irqsave(>free_pages_lock, flags);
+   spin_lock_irqsave(>free_pages_lock, flags);
for (i = 0; i < num; i++)
-   list_add([i]->lru, >free_pages);
-   blkif->free_pages_num += num;
-   spin_unlock_irqrestore(>free_pages_lock, flags);
+   list_add([i]->lru, >free_pages);
+   ring->free_pages_num += num;
+   spin_unlock_irqrestore(>free_pages_lock, flags);
 }
 
-static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)
+static inline void shrink_free_pagepool(struct xen_blkif_ring *ring, int num)
 {
/* Remove requested pages in batches of NUM_BATCH_FREE_PAGES */
struct page *page[NUM_BATCH_FREE_PAGES];
unsigned int num_pages = 0;
unsigned long flags;
 
-   spin_lock_irqsave(>free_pages_lock, flags);
-   while (blkif->free_pages_num > num) {
-   BUG_ON(list_empty(>free_pages));
-   page[num_pages] = list_first_entry(>free_pages,
+   spin_lock_irqsave(>free_pages_lock, flags);
+   while (ring->free_pages_num > num) {
+   BUG_ON(list_empty(>free_pages));
+   page[num_pages] = list_first_entry(>free_pages,
   struct page, lru);
list_del([num_pages]->lru);
-   blkif->free_pages_num--;
+   ring->free_pages_num--;
if (++num_pages == NUM_BATCH_FREE_PAGES) {
-   spin_unlock_irqrestore(>free_pages_lock, flags);
+   spin_unlock_irqrestore(>free_pages_lock, flags);
gnttab_free_pages(num_pages, page);
-   spin_lock_irqsave(>free_pages_lock, flags);
+   spin_lock_irqsave(>free_pages_lock, flags);
num_pages = 0;
}
}
-   spin_unlock_irqrestore(>free_pages_lock, flags);
+   spin_unlock_irqrestore(>free_pages_lock, flags);
if (num_pages != 0)
gnttab_free_pages(num_pages, page);
 }
@@ -194,22 +194,29 @@ static void make_response(struct xen_blkif_ring *ring, 
u64 id,
 
 
 /*
- * pers_gnts_lock must be used around all the persistent grant helpers
- * because blkback may use multi-thread/queue for each backend.
+ * We don't need locking around the persistent grant helpers
+ * because blkback uses a single-thread for each backed, so we
+ * can be sure that this functions will never be called recursively.
+ *
+ * The only exception to that is put_persistent_grant, that can be called
+ * from interrupt context (by xen_blkbk_unmap), so we have to use atomic
+ * bit operations to modify the flags of a persistent grant and to cou

[Xen-devel] [PATCH v4 03/10] xen/blkfront: pseudo support for multi hardware queues/rings

2015-11-01 Thread Bob Liu
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in next
patch so as to make every single patch small and readable.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 327 +--
 1 file changed, 188 insertions(+), 139 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2a557e4..eab78e7 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -145,6 +145,7 @@ struct blkfront_info
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
+   /* Number of pages per ring buffer */
unsigned int nr_ring_pages;
struct request_queue *rq;
struct list_head grants;
@@ -158,7 +159,8 @@ struct blkfront_info
unsigned int max_indirect_segments;
int is_ready;
struct blk_mq_tag_set tag_set;
-   struct blkfront_ring_info rinfo;
+   struct blkfront_ring_info *rinfo;
+   unsigned int nr_rings;
 };
 
 static unsigned int nr_minors;
@@ -190,7 +192,7 @@ static DEFINE_SPINLOCK(minor_lock);
((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
 
 static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
-static int blkfront_gather_backend_features(struct blkfront_info *info);
+static void blkfront_gather_backend_features(struct blkfront_info *info);
 
 static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
@@ -443,12 +445,13 @@ static int blkif_queue_request(struct request *req, 
struct blkfront_ring_info *r
 */
max_grefs += INDIRECT_GREFS(req->nr_phys_segments);
 
-   /* Check if we have enough grants to allocate a requests */
-   if (info->persistent_gnts_c < max_grefs) {
+   /* Check if we have enough grants to allocate a requests, we have to
+* reserve 'max_grefs' grants because persistent grants are shared by 
all
+* rings */
+   if (0 < max_grefs) {
new_persistent_gnts = 1;
if (gnttab_alloc_grant_references(
-   max_grefs - info->persistent_gnts_c,
-   _head) < 0) {
+   max_grefs, _head) < 0) {
gnttab_request_free_callback(
>callback,
blkif_restart_queue_callback,
@@ -665,7 +668,7 @@ static int blk_mq_init_hctx(struct blk_mq_hw_ctx *hctx, 
void *data,
 {
struct blkfront_info *info = (struct blkfront_info *)data;
 
-   hctx->driver_data = >rinfo;
+   hctx->driver_data = >rinfo[index];
return 0;
 }
 
@@ -924,8 +927,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
 
 static void xlvbd_release_gendisk(struct blkfront_info *info)
 {
-   unsigned int minor, nr_minors;
-   struct blkfront_ring_info *rinfo = >rinfo;
+   unsigned int minor, nr_minors, i;
 
if (info->rq == NULL)
return;
@@ -933,11 +935,15 @@ static void xlvbd_release_gendisk(struct blkfront_info 
*info)
/* No more blkif_request(). */
blk_mq_stop_hw_queues(info->rq);
 
-   /* No more gnttab callback work. */
-   gnttab_cancel_free_callback(>callback);
+   for (i = 0; i < info->nr_rings; i++) {
+   struct blkfront_ring_info *rinfo = >rinfo[i];
 
-   /* Flush gnttab callback work. Must be done with no locks held. */
-   flush_work(>work);
+   /* No more gnttab callback work. */
+   gnttab_cancel_free_callback(>callback);
+
+   /* Flush gnttab callback work. Must be done with no locks held. 
*/
+   flush_work(>work);
+   }
 
del_gendisk(info->gd);
 
@@ -970,37 +976,11 @@ static void blkif_restart_queue(struct work_struct *work)
spin_unlock_irq(>dev_info->io_lock);
 }
 
-static void blkif_free(struct blkfront_info *info, int suspend)
+static void blkif_free_ring(struct blkfront_ring_info *rinfo)
 {
struct grant *persistent_gnt;
-   struct grant *n;
+   struct blkfront_info *info = rinfo->dev_info;
int i, j, segs;
-   struct blkfront_ring_info *rinfo = >rinfo;
-
-   /* Prevent new requests being issued until we fix things up. */
-   spin_lock_irq(>io_lock);
-   info->connected = suspend ?
-   BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
-   /* No more blkif_request(). */
-   if (info->rq)
-   blk_mq_stop_hw_queues(info->rq);
-
-   /* Remove all persistent grants */
-   if (!list_empty(>grants)) {
-   list_for_each_entry_safe(persistent_gnt, n,
->grants, node) {
-   list_del(_gnt->node);
-   if (persistent_gnt-

[Xen-devel] [PATCH v4 02/10] xen/blkfront: separate per ring information out of device info

2015-11-01 Thread Bob Liu
Split per ring information to an new structure "blkfront_ring_info".

A ring is the representation of a hardware queue, every vbd device can associate
with one or more rings depending on how many hardware queues/rings to be used.

This patch is a preparation for supporting real multi hardware queues/rings.

Signed-off-by: Arianna Avanzini <avanzini.aria...@gmail.com>
Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 321 ---
 1 file changed, 178 insertions(+), 143 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index a69c02d..2a557e4 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -115,6 +115,23 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
 #define RINGREF_NAME_LEN (20)
 
 /*
+ *  Per-ring info.
+ *  Every blkfront device can associate with one or more blkfront_ring_info,
+ *  depending on how many hardware queues/rings to be used.
+ */
+struct blkfront_ring_info {
+   struct blkif_front_ring ring;
+   unsigned int ring_ref[XENBUS_MAX_RING_PAGES];
+   unsigned int evtchn, irq;
+   struct work_struct work;
+   struct gnttab_free_callback callback;
+   struct blk_shadow shadow[BLK_MAX_RING_SIZE];
+   struct list_head indirect_pages;
+   unsigned long shadow_free;
+   struct blkfront_info *dev_info;
+};
+
+/*
  * We have one of these per vbd, whether ide, scsi or 'other'.  They
  * hang in private_data off the gendisk structure. We may end up
  * putting all kinds of interesting stuff here :-)
@@ -128,18 +145,10 @@ struct blkfront_info
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
-   int ring_ref[XENBUS_MAX_RING_PAGES];
unsigned int nr_ring_pages;
-   struct blkif_front_ring ring;
-   unsigned int evtchn, irq;
struct request_queue *rq;
-   struct work_struct work;
-   struct gnttab_free_callback callback;
-   struct blk_shadow shadow[BLK_MAX_RING_SIZE];
struct list_head grants;
-   struct list_head indirect_pages;
unsigned int persistent_gnts_c;
-   unsigned long shadow_free;
unsigned int feature_flush;
unsigned int feature_discard:1;
unsigned int feature_secdiscard:1;
@@ -149,6 +158,7 @@ struct blkfront_info
unsigned int max_indirect_segments;
int is_ready;
struct blk_mq_tag_set tag_set;
+   struct blkfront_ring_info rinfo;
 };
 
 static unsigned int nr_minors;
@@ -179,33 +189,35 @@ static DEFINE_SPINLOCK(minor_lock);
 #define INDIRECT_GREFS(_segs) \
((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
 
-static int blkfront_setup_indirect(struct blkfront_info *info);
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
 static int blkfront_gather_backend_features(struct blkfront_info *info);
 
-static int get_id_from_freelist(struct blkfront_info *info)
+static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
-   unsigned long free = info->shadow_free;
-   BUG_ON(free >= BLK_RING_SIZE(info));
-   info->shadow_free = info->shadow[free].req.u.rw.id;
-   info->shadow[free].req.u.rw.id = 0x0fee; /* debug */
+   unsigned long free = rinfo->shadow_free;
+
+   BUG_ON(free >= BLK_RING_SIZE(rinfo->dev_info));
+   rinfo->shadow_free = rinfo->shadow[free].req.u.rw.id;
+   rinfo->shadow[free].req.u.rw.id = 0x0fee; /* debug */
return free;
 }
 
-static int add_id_to_freelist(struct blkfront_info *info,
+static int add_id_to_freelist(struct blkfront_ring_info *rinfo,
   unsigned long id)
 {
-   if (info->shadow[id].req.u.rw.id != id)
+   if (rinfo->shadow[id].req.u.rw.id != id)
return -EINVAL;
-   if (info->shadow[id].request == NULL)
+   if (rinfo->shadow[id].request == NULL)
return -EINVAL;
-   info->shadow[id].req.u.rw.id  = info->shadow_free;
-   info->shadow[id].request = NULL;
-   info->shadow_free = id;
+   rinfo->shadow[id].req.u.rw.id  = rinfo->shadow_free;
+   rinfo->shadow[id].request = NULL;
+   rinfo->shadow_free = id;
return 0;
 }
 
-static int fill_grant_buffer(struct blkfront_info *info, int num)
+static int fill_grant_buffer(struct blkfront_ring_info *rinfo, int num)
 {
+   struct blkfront_info *info = rinfo->dev_info;
struct page *granted_page;
struct grant *gnt_list_entry, *n;
int i = 0;
@@ -341,8 +353,8 @@ static void xlbd_release_minors(unsigned int minor, 
unsigned int nr)
 
 static void blkif_restart_queue_callback(void *arg)
 {
-   struct blkfront_info *info = (struct blkfront_info *)arg;
-   schedule_work(>work);
+   struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)arg;
+

[Xen-devel] [PATCH v4 05/10] xen/blkfront: negotiate number of queues/rings to be used with backend

2015-11-01 Thread Bob Liu
The number of hardware queues for xen/blkfront is set by parameter
'max_queues'(default 4), while the max value xen/blkback supported is notified
through xenstore("multi-queue-max-queues").

The negotiated number is the smaller one and would be written back to xenstore
as "multi-queue-num-queues", blkback need to read this negotiated number.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 166 +++
 1 file changed, 120 insertions(+), 46 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 8cc5995..23096d7 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -98,6 +98,10 @@ static unsigned int xen_blkif_max_segments = 32;
 module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
 MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests 
(default is 32)");
 
+static unsigned int xen_blkif_max_queues = 4;
+module_param_named(max_queues, xen_blkif_max_queues, uint, S_IRUGO);
+MODULE_PARM_DESC(max_queues, "Maximum number of hardware queues/rings used per 
virtual disk");
+
 /*
  * Maximum order of pages to be used for the shared ring between front and
  * backend, 4KB page granularity is used.
@@ -113,6 +117,7 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
  * characters are enough. Define to 20 to keep consist with backend.
  */
 #define RINGREF_NAME_LEN (20)
+#define QUEUE_NAME_LEN (12)
 
 /*
  *  Per-ring info.
@@ -695,7 +700,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
 
memset(>tag_set, 0, sizeof(info->tag_set));
info->tag_set.ops = _mq_ops;
-   info->tag_set.nr_hw_queues = 1;
+   info->tag_set.nr_hw_queues = info->nr_rings;
info->tag_set.queue_depth =  BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
@@ -1352,6 +1357,51 @@ fail:
return err;
 }
 
+static int write_per_ring_nodes(struct xenbus_transaction xbt,
+   struct blkfront_ring_info *rinfo, const char 
*dir)
+{
+   int err, i;
+   const char *message = NULL;
+   struct blkfront_info *info = rinfo->dev_info;
+
+   if (info->nr_ring_pages == 1) {
+   err = xenbus_printf(xbt, dir, "ring-ref", "%u", 
rinfo->ring_ref[0]);
+   if (err) {
+   message = "writing ring-ref";
+   goto abort_transaction;
+   }
+   pr_info("%s: write ring-ref:%d\n", dir, rinfo->ring_ref[0]);
+   } else {
+   for (i = 0; i < info->nr_ring_pages; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
i);
+   err = xenbus_printf(xbt, dir, ring_ref_name,
+   "%u", rinfo->ring_ref[i]);
+   if (err) {
+   message = "writing ring-ref";
+   goto abort_transaction;
+   }
+   pr_info("%s: write ring-ref:%d\n", dir, 
rinfo->ring_ref[i]);
+   }
+   }
+
+   err = xenbus_printf(xbt, dir, "event-channel", "%u", rinfo->evtchn);
+   if (err) {
+   message = "writing event-channel";
+   goto abort_transaction;
+   }
+   pr_info("%s: write event-channel:%d\n", dir, rinfo->evtchn);
+
+   return 0;
+
+abort_transaction:
+   xenbus_transaction_end(xbt, 1);
+   if (message)
+   xenbus_dev_fatal(info->xbdev, err, "%s", message);
+
+   return err;
+}
 
 /* Common code used when first setting up, and when resuming. */
 static int talk_to_blkback(struct xenbus_device *dev,
@@ -1362,7 +1412,6 @@ static int talk_to_blkback(struct xenbus_device *dev,
int err, i;
unsigned int max_page_order = 0;
unsigned int ring_page_order = 0;
-   struct blkfront_ring_info *rinfo;
 
err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
   "max-ring-page-order", "%u", _page_order);
@@ -1374,7 +1423,8 @@ static int talk_to_blkback(struct xenbus_device *dev,
}
 
for (i = 0; i < info->nr_rings; i++) {
-   rinfo = >rinfo[i];
+   struct blkfront_ring_info *rinfo = >rinfo[i];
+
/* Create shared ring, alloc event channel. */
err = setup_blkring(dev, rinfo);
if (err)
@@ -1388,45 +1438,51 @@ again:
goto destroy_blkring;
}
 
-   if (info-&

[Xen-devel] [PATCH v4 00/10] xen-block: multi hardware-queues/rings support

2015-11-01 Thread Bob Liu
Note: These patches were based on original work of Arianna's internship for
GNOME's Outreach Program for Women.

After using blk-mq api, a guest has more than one(nr_vpus) software request
queues associated with each block front. These queues can be mapped over several
rings(hardware queues) to the backend, making it very easy for us to run
multiple threads on the backend for a single virtual disk.

By having different threads issuing requests at the same time, the performance
of guest can be improved significantly.

Test was done based on null_blk driver:
dom0: v4.3-rc7 16vcpus 10GB "modprobe null_blk"
domU: v4.3-rc7 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

domU(orig)  4 queues8 queues16 queues
iops:690k   1024k(+30%) 800k750k

After patch 9 and 10:
domU(orig)  4 queues8 queues16 queues
iops:690k   1600k(+100%)   1450k1320k

Chart: https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0

Also see huge improvements for write and real SSD storage.

---
v4:
 * Rebase to v4.3-rc7
 * Comments from Roger

v3:
 * Rebased to v4.2-rc8

Bob Liu (10):
  xen/blkif: document blkif multi-queue/ring extension
  xen/blkfront: separate per ring information out of device info
  xen/blkfront: pseudo support for multi hardware queues/rings
  xen/blkfront: split per device io_lock
  xen/blkfront: negotiate number of queues/rings to be used with backend
  xen/blkback: separate ring information out of struct xen_blkif
  xen/blkback: pseudo support for multi hardware queues/rings
  xen/blkback: get the number of hardware queues/rings from blkfront
  xen/blkfront: make persistent grants per-queue
  xen/blkback: make pool of persistent grants and free pages per-queue

 drivers/block/xen-blkback/blkback.c | 386 ++-
 drivers/block/xen-blkback/common.h  |  78 ++--
 drivers/block/xen-blkback/xenbus.c  | 359 --
 drivers/block/xen-blkfront.c| 718 ++--
 include/xen/interface/io/blkif.h|  48 +++
 5 files changed, 971 insertions(+), 618 deletions(-)

-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 04/10] xen/blkfront: split per device io_lock

2015-11-01 Thread Bob Liu
The per device io_lock became a coarser grained lock after multi-queues/rings
was introduced, this patch introduced a fine-grained ring_lock for each ring.

The old io_lock was renamed to dev_lock and only protect the ->grants list
which is shared by all rings.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c | 57 ++--
 1 file changed, 34 insertions(+), 23 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index eab78e7..8cc5995 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -121,6 +121,7 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
  */
 struct blkfront_ring_info {
struct blkif_front_ring ring;
+   spinlock_t ring_lock;
unsigned int ring_ref[XENBUS_MAX_RING_PAGES];
unsigned int evtchn, irq;
struct work_struct work;
@@ -138,7 +139,8 @@ struct blkfront_ring_info {
  */
 struct blkfront_info
 {
-   spinlock_t io_lock;
+   /* Lock to proect info->grants list shared by multi rings */
+   spinlock_t dev_lock;
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
@@ -224,6 +226,7 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
struct grant *gnt_list_entry, *n;
int i = 0;
 
+   spin_lock_irq(>dev_lock);
while(i < num) {
gnt_list_entry = kzalloc(sizeof(struct grant), GFP_NOIO);
if (!gnt_list_entry)
@@ -242,6 +245,7 @@ static int fill_grant_buffer(struct blkfront_ring_info 
*rinfo, int num)
list_add(_list_entry->node, >grants);
i++;
}
+   spin_unlock_irq(>dev_lock);
 
return 0;
 
@@ -254,6 +258,7 @@ out_of_memory:
kfree(gnt_list_entry);
i--;
}
+   spin_unlock_irq(>dev_lock);
BUG_ON(i != 0);
return -ENOMEM;
 }
@@ -265,6 +270,7 @@ static struct grant *get_grant(grant_ref_t *gref_head,
struct grant *gnt_list_entry;
unsigned long buffer_gfn;
 
+   spin_lock(>dev_lock);
BUG_ON(list_empty(>grants));
gnt_list_entry = list_first_entry(>grants, struct grant,
  node);
@@ -272,8 +278,10 @@ static struct grant *get_grant(grant_ref_t *gref_head,
 
if (gnt_list_entry->gref != GRANT_INVALID_REF) {
info->persistent_gnts_c--;
+   spin_unlock(>dev_lock);
return gnt_list_entry;
}
+   spin_unlock(>dev_lock);
 
/* Assign a gref to this page */
gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
@@ -639,7 +647,7 @@ static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info 
*)hctx->driver_data;
 
blk_mq_start_request(qd->rq);
-   spin_lock_irq(>io_lock);
+   spin_lock_irq(>ring_lock);
if (RING_FULL(>ring))
goto out_busy;
 
@@ -650,15 +658,15 @@ static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
goto out_busy;
 
flush_requests(rinfo);
-   spin_unlock_irq(>io_lock);
+   spin_unlock_irq(>ring_lock);
return BLK_MQ_RQ_QUEUE_OK;
 
 out_err:
-   spin_unlock_irq(>io_lock);
+   spin_unlock_irq(>ring_lock);
return BLK_MQ_RQ_QUEUE_ERROR;
 
 out_busy:
-   spin_unlock_irq(>io_lock);
+   spin_unlock_irq(>ring_lock);
blk_mq_stop_hw_queue(hctx);
return BLK_MQ_RQ_QUEUE_BUSY;
 }
@@ -959,21 +967,22 @@ static void xlvbd_release_gendisk(struct blkfront_info 
*info)
info->gd = NULL;
 }
 
-/* Must be called with io_lock holded */
 static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
 {
+   unsigned long flags;
+
+   spin_lock_irqsave(>ring_lock, flags);
if (!RING_FULL(>ring))
blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
+   spin_unlock_irqrestore(>ring_lock, flags);
 }
 
 static void blkif_restart_queue(struct work_struct *work)
 {
struct blkfront_ring_info *rinfo = container_of(work, struct 
blkfront_ring_info, work);
 
-   spin_lock_irq(>dev_info->io_lock);
if (rinfo->dev_info->connected == BLKIF_STATE_CONNECTED)
kick_pending_request_queues(rinfo);
-   spin_unlock_irq(>dev_info->io_lock);
 }
 
 static void blkif_free_ring(struct blkfront_ring_info *rinfo)
@@ -1065,7 +1074,7 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
int i;
 
/* Prevent new requests being issued until we fix things up. */
-   spin_lock_irq(>io_lock);
+   spin_lock_irq(>dev_lock);
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
 

[Xen-devel] [PATCH v4 01/10] xen/blkif: document blkif multi-queue/ring extension

2015-11-01 Thread Bob Liu
Document the multi-queue/ring feature in terms of XenStore keys to be written by
the backend and by the frontend.

Signed-off-by: Bob Liu <bob@oracle.com>
--
v2:
Add descriptions together with multi-page ring buffer.
---
 include/xen/interface/io/blkif.h | 48 
 1 file changed, 48 insertions(+)

diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
index c33e1c4..8b8cfad 100644
--- a/include/xen/interface/io/blkif.h
+++ b/include/xen/interface/io/blkif.h
@@ -28,6 +28,54 @@ typedef uint16_t blkif_vdev_t;
 typedef uint64_t blkif_sector_t;
 
 /*
+ * Multiple hardware queues/rings:
+ * If supported, the backend will write the key "multi-queue-max-queues" to
+ * the directory for that vbd, and set its value to the maximum supported
+ * number of queues.
+ * Frontends that are aware of this feature and wish to use it can write the
+ * key "multi-queue-num-queues" with the number they wish to use, which must be
+ * greater than zero, and no more than the value reported by the backend in
+ * "multi-queue-max-queues".
+ *
+ * For frontends requesting just one queue, the usual event-channel and
+ * ring-ref keys are written as before, simplifying the backend processing
+ * to avoid distinguishing between a frontend that doesn't understand the
+ * multi-queue feature, and one that does, but requested only one queue.
+ *
+ * Frontends requesting two or more queues must not write the toplevel
+ * event-channel and ring-ref keys, instead writing those keys under sub-keys
+ * having the name "queue-N" where N is the integer ID of the queue/ring for
+ * which those keys belong. Queues are indexed from zero.
+ * For example, a frontend with two queues must write the following set of
+ * queue-related keys:
+ *
+ * /local/domain/1/device/vbd/0/multi-queue-num-queues = "2"
+ * /local/domain/1/device/vbd/0/queue-0 = ""
+ * /local/domain/1/device/vbd/0/queue-0/ring-ref = "<ring-ref#0>"
+ * /local/domain/1/device/vbd/0/queue-0/event-channel = "<evtchn#0>"
+ * /local/domain/1/device/vbd/0/queue-1 = ""
+ * /local/domain/1/device/vbd/0/queue-1/ring-ref = "<ring-ref#1>"
+ * /local/domain/1/device/vbd/0/queue-1/event-channel = "<evtchn#1>"
+ *
+ * It is also possible to use multiple queues/rings together with
+ * feature multi-page ring buffer.
+ * For example, a frontend requests two queues/rings and the size of each ring
+ * buffer is two pages must write the following set of related keys:
+ *
+ * /local/domain/1/device/vbd/0/multi-queue-num-queues = "2"
+ * /local/domain/1/device/vbd/0/ring-page-order = "1"
+ * /local/domain/1/device/vbd/0/queue-0 = ""
+ * /local/domain/1/device/vbd/0/queue-0/ring-ref0 = "<ring-ref#0>"
+ * /local/domain/1/device/vbd/0/queue-0/ring-ref1 = "<ring-ref#1>"
+ * /local/domain/1/device/vbd/0/queue-0/event-channel = "<evtchn#0>"
+ * /local/domain/1/device/vbd/0/queue-1 = ""
+ * /local/domain/1/device/vbd/0/queue-1/ring-ref0 = "<ring-ref#2>"
+ * /local/domain/1/device/vbd/0/queue-1/ring-ref1 = "<ring-ref#3>"
+ * /local/domain/1/device/vbd/0/queue-1/event-channel = "<evtchn#1>"
+ *
+ */
+
+/*
  * REQUEST CODES.
  */
 #define BLKIF_OP_READ  0
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


  1   2   3   >