Re: [Cluster-devel] remove kernel_setsockopt and kernel_getsockopt
Hi Dave, this series removes the kernel_setsockopt and kernel_getsockopt functions, and instead switches their users to small functions that implement setting (or in one case getting) a sockopt directly using a normal kernel function call with type safety and all the other benefits of not having a function call. In some cases these functions seem pretty heavy handed as they do a lock_sock even for just setting a single variable, but this mirrors the real setsockopt implementation - counter to that a few kernel drivers just set the fields directly already. Nevertheless the diffstat looks quite promising: 42 files changed, 721 insertions(+), 799 deletions(-) For the nvme-tcp bits, Acked-by: Sagi Grimberg
Re: [Cluster-devel] [PATCH V14 00/18] block: support multi-page bvec
V14: - drop patch(patch 4 in V13) for renaming bvec helpers, as suggested by Jens - use mp_bvec_* as multi-page bvec helper name - fix one build issue, which is caused by missing one converion of bio_for_each_segment_all in fs/gfs2 - fix one 32bit ARCH specific issue caused by segment boundary mask overflow Hey Ming, So is nvme-tcp also affected here? The only point where I see nvme-tcp can be affected is when initializing a bvec iter using bio_segments() as everywhere else we use iters which should transparently work.. I see that loop was converted, does it mean that nvme-tcp needs to call something like? -- bio_for_each_mp_bvec(bv, bio, iter) nr_bvecs++; --
Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
Wait, I see that the bvec is still a single array per bio. When you said a table I thought you meant a 2-dimentional array... I mean a new 1-d table A has to be created for multiple bios in one rq, and build it in the following way rq_for_each_bvec(tmp, rq, rq_iter) *A = tmp; Then you can pass A to iov_iter_bvec() & send(). Given it is over TCP, I guess it should be doable for you to preallocate one 256-bvec table in one page for each request, then sets the max segment size as (unsigned int)-1, and max segment number as 256, the preallocated table should work anytime. 256 bvec table is really a lot to preallocate, especially when its not needed, I can easily initialize the bvec_iter on the bio bvec. If this involves preallocation of the worst-case than I don't consider this to be an improvement.
Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
Yeah, that is the most common example, given merge is enabled in most of cases. If the driver or device doesn't care merge, you can disable it and always get single bio request, then the bio's bvec table can be reused for send(). Does bvec_iter span bvecs with your patches? I didn't see that change? Wait, I see that the bvec is still a single array per bio. When you said a table I thought you meant a 2-dimentional array... Unless I'm not mistaken, I think that the change is pretty simple then. However, nvme-tcp still needs to be bio aware unless we have some abstraction in place.. Which will mean that nvme-tcp will need to open-code bio_bvecs.
Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
I would like to avoid growing bvec tables and keep everything preallocated. Plus, a bvec_iter operates on a bvec which means we'll need a table there as well... Not liking it so far... In case of bios in one request, we can't know how many bvecs there are except for calling rq_bvecs(), so it may not be suitable to preallocate the table. If you have to send the IO request in one send(), runtime allocation may be inevitable. I don't want to do that, I want to work on a single bvec at a time like the current implementation does. If you don't require to send the IO request in one send(), you may send one bio in one time, and just uses the bio's bvec table directly, such as the single bio case in lo_rw_aio(). we'd need some indication that we need to reinit my iter with the new bvec, today we do: static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req, int len) { req->snd.data_sent += len; req->pdu_sent += len; iov_iter_advance(>snd.iter, len); if (!iov_iter_count(>snd.iter) && req->snd.data_sent < req->data_len) { req->snd.curr_bio = req->snd.curr_bio->bi_next; nvme_tcp_init_send_iter(req); } } and initialize the send iter. I imagine that now I will need to switch to the next bvec and only if I'm on the last I need to use the next bio... Do you offer an API for that? can this way avoid your blocking issue? You may see this example in branch 'rq->bio != rq->biotail' of lo_rw_aio(). This is exactly an example of not ignoring the bios... Yeah, that is the most common example, given merge is enabled in most of cases. If the driver or device doesn't care merge, you can disable it and always get single bio request, then the bio's bvec table can be reused for send(). Does bvec_iter span bvecs with your patches? I didn't see that change? I'm not sure how this helps me either. Unless we can set a bvec_iter to span bvecs or have an abstract bio crossing when we re-initialize the bvec_iter I don't see how I can ignore bios completely... rq_for_each_bvec() will iterate over all bvecs from all bios, so you needn't to see any bio in this req. But I don't need this iteration, I need a transparent API like; bvec2 = rq_bvec_next(rq, bvec) This way I can simply always reinit my iter without thinking about how the request/bios/bvecs are constructed... rq_bvecs() will return how many bvecs there are in this request(cover all bios in this req) Still not very useful given that I don't want to use a table... So looks nvme-tcp host driver might be the 2nd driver which benefits from multi-page bvec directly. The multi-page bvec V11 has passed my tests and addressed almost all the comments during review on V10. I removed bio_vecs() in V11, but it won't be big deal, we can introduce them anytime when there is the requirement. multipage-bvecs and nvme-tcp are going to conflict, so it would be good to coordinate on this. I think that nvme-tcp host needs some adjustments as setting a bvec_iter. I'm under the impression that the change is rather small and self-contained, but I'm not sure I have the full picture here. I guess I may not get your exact requirement on block io iterator from nvme-tcp too, :-( They are pretty much listed above. Today nvme-tcp sets an iterator with: vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); nsegs = bio_segments(bio); size = bio->bi_iter.bi_size; offset = bio->bi_iter.bi_bvec_done; iov_iter_bvec(>snd.iter, WRITE, vec, nsegs, size); and when done, iterate to the next bio and do the same. With multipage bvec it would be great if we can simply have something like rq_bvec_next() that would pretty much satisfy the requirements from the nvme-tcp side...
Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
Not sure I understand the 'blocking' problem in this case. We can build a bvec table from this req, and send them all in send(), I would like to avoid growing bvec tables and keep everything preallocated. Plus, a bvec_iter operates on a bvec which means we'll need a table there as well... Not liking it so far... can this way avoid your blocking issue? You may see this example in branch 'rq->bio != rq->biotail' of lo_rw_aio(). This is exactly an example of not ignoring the bios... If this way is what you need, I think you are right, even we may introduce the following helpers: rq_for_each_bvec() rq_bvecs() I'm not sure how this helps me either. Unless we can set a bvec_iter to span bvecs or have an abstract bio crossing when we re-initialize the bvec_iter I don't see how I can ignore bios completely... So looks nvme-tcp host driver might be the 2nd driver which benefits from multi-page bvec directly. The multi-page bvec V11 has passed my tests and addressed almost all the comments during review on V10. I removed bio_vecs() in V11, but it won't be big deal, we can introduce them anytime when there is the requirement. multipage-bvecs and nvme-tcp are going to conflict, so it would be good to coordinate on this. I think that nvme-tcp host needs some adjustments as setting a bvec_iter. I'm under the impression that the change is rather small and self-contained, but I'm not sure I have the full picture here.
Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
The only user in your final tree seems to be the loop driver, and even that one only uses the helper for read/write bios. I think something like this would be much simpler in the end: The recently submitted nvme-tcp host driver should also be a user of this. Does it make sense to keep it as a helper then? I did take a brief look at the code, and I really don't understand why the heck it even deals with bios to start with. Like all the other nvme transports it is a blk-mq driver and should iterate over segments in a request and more or less ignore bios. Something is horribly wrong in the design. Can you explain a little more? I'm more than happy to change that but I'm not completely clear how... Before we begin a data transfer, we need to set our own iterator that will advance with the progression of the data transfer. We also need to keep in mind that all the data transfer (both send and recv) are completely non blocking (and zero-copy when we send). That means that every data movement needs to be able to suspend and resume asynchronously. i.e. we cannot use the following pattern: rq_for_each_segment(bvec, rq, rq_iter) { iov_iter_bvec(_iter, WRITE, , 1, bvec.bv_len); send(sock, iov_iter); } Given that a request can hold more than a single bio, I'm not clear on how we can achieve that without iterating over the bios in the request ourselves. Any useful insight?
Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
The only user in your final tree seems to be the loop driver, and even that one only uses the helper for read/write bios. I think something like this would be much simpler in the end: The recently submitted nvme-tcp host driver should also be a user of this. Does it make sense to keep it as a helper then?
Re: [Cluster-devel] [PATCH] configfs: switch ->default groups to a linked list
On 26/02/2016 14:33, Christoph Hellwig wrote: Replace the current NULL-terminated array of default groups with a linked list. This gets rid of lots of nasty code to size and/or dynamically allocate the array. While we're at it also provide a conveniant helper to remove the default groups. Signed-off-by: Christoph Hellwig <h...@lst.de> --- Nice! -As a consequence of this, default_groups cannot be removed directly via +As a consequence of this, default groups cannot be removed directly via rmdir(2). They also are not considered when rmdir(2) on the parent group is checking for children. What's changed here? Other than that, looks good Reviewed-by: Sagi Grimberg <sa...@mellanox.com>