Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
>>> On 15.03.12 at 09:51, Ian Campbell wrote: > On Thu, 2012-03-15 at 08:03 +, Jan Beulich wrote: >> >>> On 14.03.12 at 18:01, Justin Gibbs wrote: >> > While we're talking about fixing ring data structures, can RING_IDX >> > be defined as a "uint32_t" instead of "unsigned int". The structure >> > padding in the ring macros assumes RING_IDX is exactly 4 bytes, >> > so this should be made explicit. ILP64 machines may still be a way >> > out, but the use of non-fixed sized types in places where size really >> > matters just isn't clean. >> >> Yes, if we're going to rev the interface, then any such flaws should be >> corrected. > > There has been talk of doing something similar for netif too. IIRC the > netchannel2 work included a new generic ring scheme with support for > variable sized req/rsp elements and such. > > If we are going to rev the rings then should we try and use a common > ring mechanism? I think so. If so then we could do worse than to start > from the netchannel2 ring stuff and/or concepts? > > Looks like that is > http://xenbits.xen.org/ext/netchannel2/linux-2.6.18/log/075f6677a290/include > /xen/interface/io/uring.h > still a bit nc2 specific though. Taking the concept (and the implementation as a starting point) would seem like a good idea to me. Separate request and reply rings as well as variable size entries would certainly benefit blkif too. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On Thu, 2012-03-15 at 08:03 +, Jan Beulich wrote: > >>> On 14.03.12 at 18:01, Justin Gibbs wrote: > > While we're talking about fixing ring data structures, can RING_IDX > > be defined as a "uint32_t" instead of "unsigned int". The structure > > padding in the ring macros assumes RING_IDX is exactly 4 bytes, > > so this should be made explicit. ILP64 machines may still be a way > > out, but the use of non-fixed sized types in places where size really > > matters just isn't clean. > > Yes, if we're going to rev the interface, then any such flaws should be > corrected. There has been talk of doing something similar for netif too. IIRC the netchannel2 work included a new generic ring scheme with support for variable sized req/rsp elements and such. If we are going to rev the rings then should we try and use a common ring mechanism? I think so. If so then we could do worse than to start from the netchannel2 ring stuff and/or concepts? Looks like that is http://xenbits.xen.org/ext/netchannel2/linux-2.6.18/log/075f6677a290/include/xen/interface/io/uring.h still a bit nc2 specific though. Ian. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
>>> On 14.03.12 at 18:17, "Justin T. Gibbs" wrote: > On Mar 6, 2012, at 1:34 AM, Jan Beulich wrote: > > On 05.03.12 at 22:49, Santosh Jodh wrote: > > … > >>> + } >>> + >>>/* Create shared ring, alloc event channel. */ >>>err = setup_blkring(dev, info); >>>if (err) >>> @@ -889,12 +916,35 @@ again: >>>goto destroy_blkring; >>>} >>> >>> - err = xenbus_printf(xbt, dev->nodename, >>> - "ring-ref", "%u", info->ring_ref); >>> - if (err) { >>> - message = "writing ring-ref"; >>> - goto abort_transaction; >>> + if (legacy_backend) { >> >> Why not use the simpler interface always when info->ring_order == 0? > > Because, as I just found out today via a FreeBSD bug report, that's > not how XenServer works. If the front-end publishes "ring-page-order", > the backend assumes the "ring-refNN" XenStore nodes are in effect, > even if the order is 0. I was certainly implying to not write the ring-page-order and num-ring-pages nodes in that case. > I'm working on a documentation update for blkif.h now. > > > > -- > Justin ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
>>> On 14.03.12 at 18:01, Justin Gibbs wrote: > While we're talking about fixing ring data structures, can RING_IDX > be defined as a "uint32_t" instead of "unsigned int". The structure > padding in the ring macros assumes RING_IDX is exactly 4 bytes, > so this should be made explicit. ILP64 machines may still be a way > out, but the use of non-fixed sized types in places where size really > matters just isn't clean. Yes, if we're going to rev the interface, then any such flaws should be corrected. (Also shrinking the Cc list a little.) Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On Mar 6, 2012, at 1:34 AM, Jan Beulich wrote: On 05.03.12 at 22:49, Santosh Jodh wrote: … >> + } >> + >>/* Create shared ring, alloc event channel. */ >>err = setup_blkring(dev, info); >>if (err) >> @@ -889,12 +916,35 @@ again: >>goto destroy_blkring; >>} >> >> - err = xenbus_printf(xbt, dev->nodename, >> - "ring-ref", "%u", info->ring_ref); >> - if (err) { >> - message = "writing ring-ref"; >> - goto abort_transaction; >> + if (legacy_backend) { > > Why not use the simpler interface always when info->ring_order == 0? Because, as I just found out today via a FreeBSD bug report, that's not how XenServer works. If the front-end publishes "ring-page-order", the backend assumes the "ring-refNN" XenStore nodes are in effect, even if the order is 0. I'm working on a documentation update for blkif.h now. -- Justin ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On Mar 14, 2012, at 9:34 AM, Jan Beulich wrote: On 14.03.12 at 07:32, Justin Gibbs wrote: >> There's another problem here that I brought up during the Xen >> Hack-a-thon. The ring macros require that the ring element count >> be a power of two. This doesn't mean that the ring will be a power >> of 2 pages in size. To illustrate this point, I modified the FreeBSD >> blkback driver to provide negotiated ring stats via sysctl. >> >> Here's a connection to a Windows VM running the Citrix PV drivers: >> >>dev.xbbd.2.max_requests: 128 >>dev.xbbd.2.max_request_segments: 11 >>dev.xbbd.2.max_request_size: 45056 >>dev.xbbd.2.ring_elem_size: 108 <= 32bit ABI >>dev.xbbd.2.ring_pages: 4 >>dev.xbbd.2.ring_elements: 128 >>dev.xbbd.2.ring_waste: 2496 >> >> Over half a page is wasted when ring-page-order is 2. I'm sure you >> can see where this is going. :-) > > Having looked a little closer on how the wasted space is progressing, > I find myself in the odd position that I can't explain the original (and > still active) definition of BLKIF_MAX_SEGMENTS_PER_REQUEST (11): > With ring-order zero, there's 0x240/0x1c0 bytes (32/64-bit > respectively) are unused. With 32 requests fitting in the ring, and with > each segment occupying 6 bytes (padded to 8), in the 64-bit variant > there's enough space for a 12th segment (32-bit would even have > space for a 13th). Am I missing anything here? I don't profess to know the real reason, but the only thing I can come up with is a requirement/desire on some platforms for 16byte alignment of the request structures. This would make the largest possible structure 112 bytes, not the 120 that would allow for more elements. While we're talking about fixing ring data structures, can RING_IDX be defined as a "uint32_t" instead of "unsigned int". The structure padding in the ring macros assumes RING_IDX is exactly 4 bytes, so this should be made explicit. ILP64 machines may still be a way out, but the use of non-fixed sized types in places where size really matters just isn't clean. > >> Here are the limits published by our backend to the XenStore: >> >>max-ring-pages = "113" >>max-ring-page-order = "7" >>max-requests = "256" >>max-request-segments = "129" >>max-request-size = "524288" > > Oh, so this protocol doesn't require ring-pages (and max-ring-pages) > to be a power of two? In which case I think it is a mistake to also > advertise max-ring-page-order, as at least the (Linux) frontend code > I know of interprets this as being able to set up a ring of (using the > numbers above) 128 pages (unless, of course, your backend can deal > with this regardless of the max-ring-pages value it announces). The advertised max-ring-pages is sufficient to hold the maximum allowed number of ring elements regardless of ABI. This is then rounded up to the next power of 2 pages to get the max-ring-page order. When the front-end negotiates, the backend just verifies that the maximum number of ring elements in the specified ring size doesn't exceed the backend's limit. Fortunately, even with this large of a ring, regardless of ABI, a given page order computes to the same number of ring elements. You just have more wasted space. -- Justin ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On Mar 7, 2012, at 2:33 AM, Jan Beulich wrote: On 06.03.12 at 18:20, Konrad Rzeszutek Wilk wrote: >> -> XENBUS_MAX_RING_PAGES - why 2? Why not 4? What is the optimal >> default size for SSD usage? 16? > > What do SSDs have to do with a XenBus definition? Imo it's wrong (and > unnecessary) to introduce a limit at the XenBus level at all - each driver > can do this for itself. > > As to the limit for SSDs in the block interface - I don't think the number > of possibly simultaneous requests has anything to do with this. Instead, > I'd expect the request number/size/segments extension that NetBSD > apparently implements to possibly have an effect. > > Jan There's another problem here that I brought up during the Xen Hack-a-thon. The ring macros require that the ring element count be a power of two. This doesn't mean that the ring will be a power of 2 pages in size. To illustrate this point, I modified the FreeBSD blkback driver to provide negotiated ring stats via sysctl. Here's a connection to a Windows VM running the Citrix PV drivers: dev.xbbd.2.max_requests: 128 dev.xbbd.2.max_request_segments: 11 dev.xbbd.2.max_request_size: 45056 dev.xbbd.2.ring_elem_size: 108 <= 32bit ABI dev.xbbd.2.ring_pages: 4 dev.xbbd.2.ring_elements: 128 dev.xbbd.2.ring_waste: 2496 Over half a page is wasted when ring-page-order is 2. I'm sure you can see where this is going. :-) Here are the limits published by our backend to the XenStore: max-ring-pages = "113" max-ring-page-order = "7" max-requests = "256" max-request-segments = "129" max-request-size = "524288" Because we allow so many concurrent, large requests in our product, the ring wastage really adds up if the front end doesn't support the "ring-pages" variant of the extension. However, you only need a ring-page-order of 3 with this protocol to start seeing pages of wasted ring space. You don't really want to negotiate "ring-pages" either. The backends often need to support multiple ABIs. I can easily construct a set of limits for the FreeBSD blkback driver which will cause the ring limits to vary by a page between the 32bit and 64bit ABIs. With all this in mind, the backend must do a dance of rounding up, taking the max of the ring sizes for the different ABIs, and then validating the front-end published limits taking its ABI into account. The front-end does some of this too. Its way too messy and error prone because we don't communicate the ring element limit directly. "max-ring-element-order" anyone? :-) -- Justin ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On Mon, 2012-03-05 at 21:49 +, Santosh Jodh wrote: > From: Santosh Jodh > > Add support for multi page ring for block devices. > The number of pages is configurable for blkback via module parameter. > blkback reports max-ring-page-order to blkfront via xenstore. > blkfront reports its supported ring-page-order to blkback via xenstore. > blkfront reports multi page ring references via ring-refNN in xenstore. > The change allows newer blkfront to work with older blkback and > vice-versa. > Based on original patch by Paul Durrant. > > Signed-off-by: Santosh Jodh Doesn't the xenbus interface change deserve another patch (as prerequisite for block devices change)? Or at least please mention the change in commit message? Wei. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
>>> On 14.03.12 at 07:32, Justin Gibbs wrote: > There's another problem here that I brought up during the Xen > Hack-a-thon. The ring macros require that the ring element count > be a power of two. This doesn't mean that the ring will be a power > of 2 pages in size. To illustrate this point, I modified the FreeBSD > blkback driver to provide negotiated ring stats via sysctl. > > Here's a connection to a Windows VM running the Citrix PV drivers: > > dev.xbbd.2.max_requests: 128 > dev.xbbd.2.max_request_segments: 11 > dev.xbbd.2.max_request_size: 45056 > dev.xbbd.2.ring_elem_size: 108 <= 32bit ABI > dev.xbbd.2.ring_pages: 4 > dev.xbbd.2.ring_elements: 128 > dev.xbbd.2.ring_waste: 2496 > > Over half a page is wasted when ring-page-order is 2. I'm sure you > can see where this is going. :-) Having looked a little closer on how the wasted space is progressing, I find myself in the odd position that I can't explain the original (and still active) definition of BLKIF_MAX_SEGMENTS_PER_REQUEST (11): With ring-order zero, there's 0x240/0x1c0 bytes (32/64-bit respectively) are unused. With 32 requests fitting in the ring, and with each segment occupying 6 bytes (padded to 8), in the 64-bit variant there's enough space for a 12th segment (32-bit would even have space for a 13th). Am I missing anything here? Plus all this assumes a page size of 4k, yet ia64 had always been using pages of 16k iirc. > Here are the limits published by our backend to the XenStore: > > max-ring-pages = "113" > max-ring-page-order = "7" > max-requests = "256" > max-request-segments = "129" > max-request-size = "524288" Oh, so this protocol doesn't require ring-pages (and max-ring-pages) to be a power of two? In which case I think it is a mistake to also advertise max-ring-page-order, as at least the (Linux) frontend code I know of interprets this as being able to set up a ring of (using the numbers above) 128 pages (unless, of course, your backend can deal with this regardless of the max-ring-pages value it announces). Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
>>> On 14.03.12 at 07:32, Justin Gibbs wrote: > There's another problem here that I brought up during the Xen > Hack-a-thon. The ring macros require that the ring element count > be a power of two. This doesn't mean that the ring will be a power > of 2 pages in size. To illustrate this point, I modified the FreeBSD > blkback driver to provide negotiated ring stats via sysctl. > > Here's a connection to a Windows VM running the Citrix PV drivers: > > dev.xbbd.2.max_requests: 128 > dev.xbbd.2.max_request_segments: 11 > dev.xbbd.2.max_request_size: 45056 > dev.xbbd.2.ring_elem_size: 108 <= 32bit ABI > dev.xbbd.2.ring_pages: 4 > dev.xbbd.2.ring_elements: 128 > dev.xbbd.2.ring_waste: 2496 > > Over half a page is wasted when ring-page-order is 2. I'm sure you > can see where this is going. :-) > > Here are the limits published by our backend to the XenStore: > > max-ring-pages = "113" > max-ring-page-order = "7" > max-requests = "256" > max-request-segments = "129" > max-request-size = "524288" > > Because we allow so many concurrent, large requests in our product, > the ring wastage really adds up if the front end doesn't support > the "ring-pages" variant of the extension. However, you only need > a ring-page-order of 3 with this protocol to start seeing pages of > wasted ring space. > > You don't really want to negotiate "ring-pages" either. The backends > often need to support multiple ABIs. I can easily construct a set > of limits for the FreeBSD blkback driver which will cause the ring > limits to vary by a page between the 32bit and 64bit ABIs. > > With all this in mind, the backend must do a dance of rounding up, > taking the max of the ring sizes for the different ABIs, and then > validating the front-end published limits taking its ABI into > account. The front-end does some of this too. Its way too messy > and error prone because we don't communicate the ring element limit > directly. > > "max-ring-element-order" anyone? :-) Interesting observation - yes, I think deprecating both pre-existing methods in favor of something along those lines would be desirable. (But I'd favor not using the term "order" here as it is - at least in Linux - usually implied to be used on pages. "max-ringent-log2" perhaps?) What you say also implies that all currently floating around Linux backend patches are flawed in their way of calculating the number of ring entries, as this number really depends on the protocol the frontend advertises. Further, if you're concerned about wasting ring space (and particularly in the context of your request number/size/segments extension), shouldn't we bother to define pairs (or larger groups) of struct blkif_request_segment (as currently a quarter of the space is mere padding)? Or split grefs from {first,last}_sect altogether? Finally, while looking at all this again, I stumbled across the use of blkif_vdev_t in the ring structures: At least Linux'es blkback completely ignores this field - {xen_,}vbd_translate() simply overwrites what dispatch_rw_block_io() put there (and with this, struct phys_req's dev and bdev members seem rather pointless too). Does anyone recall what the original intention with this request field was? Allowing I/O on multiple devices over a single ring? Bottom line - shouldn't we define a blkif2 interface to cleanly accommodate all the various extensions (and do away with the protocol variations)? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On Mar 7, 2012 4:33 AM, "Jan Beulich" wrote: > > >>> On 06.03.12 at 18:20, Konrad Rzeszutek Wilk wrote: > > -> the usage of XenbusStateInitWait? Why do we introduce that? Looks > > like a fix to something. > > No, this is required to get the negotiation working (the frontend must > not try to read the new nodes until it can be certain that the backend > populated them). However, as already pointed out in an earlier reply > to Santosh, the way this is done here doesn't appear to allow for the > backend to already be in InitWait state when the frontend gets > invoked. OK. > > > -> XENBUS_MAX_RING_PAGES - why 2? Why not 4? What is the optimal > > default size for SSD usage? 16? > > What do SSDs have to do with a XenBus definition? Imo it's wrong (and > unnecessary) to introduce a limit at the XenBus level at all - each driver > can do this for itself. The patch should mention what the benefit of multi ring is. > > As to the limit for SSDs in the block interface - I don't think the number > of possibly simultaneous requests has anything to do with this. Instead, > I'd expect the request number/size/segments extension that NetBSD > apparently implements to possibly have an effect. .. which sounds to me like increasing the bandwidth of the protocol. Should be mentioned somewhere in the git description. > > Jan > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
>>> On 06.03.12 at 18:20, Konrad Rzeszutek Wilk wrote: > -> the usage of XenbusStateInitWait? Why do we introduce that? Looks > like a fix to something. No, this is required to get the negotiation working (the frontend must not try to read the new nodes until it can be certain that the backend populated them). However, as already pointed out in an earlier reply to Santosh, the way this is done here doesn't appear to allow for the backend to already be in InitWait state when the frontend gets invoked. > -> XENBUS_MAX_RING_PAGES - why 2? Why not 4? What is the optimal > default size for SSD usage? 16? What do SSDs have to do with a XenBus definition? Imo it's wrong (and unnecessary) to introduce a limit at the XenBus level at all - each driver can do this for itself. As to the limit for SSDs in the block interface - I don't think the number of possibly simultaneous requests has anything to do with this. Instead, I'd expect the request number/size/segments extension that NetBSD apparently implements to possibly have an effect. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization