[PATCH 0/2] (was Re: [ofa-general] fmr pool free_list empty)
[EMAIL PROTECTED] wrote on Mon, 25 Feb 2008 15:02 -0800: Ugh. [pw wrote:] Looking at the FMR dirty list unmapping code in ib_fmr_batch_release(), there is a section that pulls all the dirty entries onto a list that it will later unmap and put back on the free list. But it also plans to unmap all the free entries that have ever been remapped: Yes, this came from a3cd7d90 (IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs). That solved a real problem for Olaf, because otherwise dirty FMRs with not at the max map count might never get invalidated. It's not exactly an optimization but rather a correctness issue, because RDS relies on killing mapping eventually. On the other hand, this behavior clearly does lead to the possibility of leaving the free list temporarily empty for stupid reasons. I don't see a really good way to fix this at the momemnt, need to meditate a little. Adding CCs in case some iser users are not on the openfabrics list. Original message is here: http://lists.openfabrics.org/pipermail/general/2008-February/047111.html This quoted commit is a regression for iSER. Not sure if it causes problems for the other FMR user, SRP. It went in after v2.6.24. Following this mail are two patches. One to revert the change, and one to attempt to do Olaf's patch in such a way that it does not cause problems for other FMR users. I haven't tested the patches with RDS. It apparently isn't in the tree yet. In fact, there are no users of ib_flush_fmr_pool() in the tree, which is the only function affected by the second patch. But iSER is working again in my scenario. As a side note, I don't remember seeing this patch on the openfabrics mailing list. Perhaps I missed it. Sometimes these sorts of interactions can be spotted if proposed changes get wider attention. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Revert IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs
This reverts commit a3cd7d9070be417a21905c997ee32d756d999b38. The original commit breaks iSER reliably, making it complain: iser: iser_reg_page_vec:ib_fmr_pool_map_phys failed: -11 The FMR cleanup thread runs ib_fmr_batch_release() as dirty entries build up. This commit causes clean but used FMR entries also to be purged. During that process, another thread can see that there are no free FMRs and fail, even though there should always have been enough available. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/core/fmr_pool.c | 21 ++--- 1 files changed, 6 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c index 7f00347..4044fdf 100644 --- a/drivers/infiniband/core/fmr_pool.c +++ b/drivers/infiniband/core/fmr_pool.c @@ -139,7 +139,7 @@ static inline struct ib_pool_fmr *ib_fmr_cache_lookup(struct ib_fmr_pool *pool, static void ib_fmr_batch_release(struct ib_fmr_pool *pool) { int ret; - struct ib_pool_fmr *fmr, *next; + struct ib_pool_fmr *fmr; LIST_HEAD(unmap_list); LIST_HEAD(fmr_list); @@ -158,20 +158,6 @@ static void ib_fmr_batch_release(struct ib_fmr_pool *pool) #endif } - /* -* The free_list may hold FMRs that have been put there -* because they haven't reached the max_remap count. -* Invalidate their mapping as well. -*/ - list_for_each_entry_safe(fmr, next, pool-free_list, list) { - if (fmr-remap_count == 0) - continue; - hlist_del_init(fmr-cache_node); - fmr-remap_count = 0; - list_add_tail(fmr-fmr-list, fmr_list); - list_move(fmr-list, unmap_list); - } - list_splice(pool-dirty_list, unmap_list); INIT_LIST_HEAD(pool-dirty_list); pool-dirty_len = 0; @@ -384,6 +370,11 @@ void ib_destroy_fmr_pool(struct ib_fmr_pool *pool) i = 0; list_for_each_entry_safe(fmr, tmp, pool-free_list, list) { + if (fmr-remap_count) { + INIT_LIST_HEAD(fmr_list); + list_add_tail(fmr-fmr-list, fmr_list); + ib_unmap_fmr(fmr_list); + } ib_dealloc_fmr(fmr-fmr); list_del(fmr-list); kfree(fmr); -- 1.5.4.1 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] ib fmr pool: flush used clean entries
Commit a3cd7d9070be417a21905c997ee32d756d999b38 (IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs) caused a regression for iSER and was reverted in e5507736c6449b3253347eed6f8ea77a28cf688e. This change attempts to redo the original patch so that all used FMR entries are flushed when ib_flush_fmr_pool() is called, and other FMR users are not affected. Simply move used entries from the clean list onto the dirty list before letting the cleanup thread do its job. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/core/fmr_pool.c | 17 - 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c index 4044fdf..06d502c 100644 --- a/drivers/infiniband/core/fmr_pool.c +++ b/drivers/infiniband/core/fmr_pool.c @@ -398,8 +398,23 @@ EXPORT_SYMBOL(ib_destroy_fmr_pool); */ int ib_flush_fmr_pool(struct ib_fmr_pool *pool) { - int serial = atomic_inc_return(pool-req_ser); + int serial; + struct ib_pool_fmr *fmr, *next; + + /* +* The free_list holds FMRs that may have been used +* but have not been remapped enough times to be dirty. +* Put them on the dirty list now so that the cleanup +* thread will reap them too. +*/ + spin_lock_irq(pool-pool_lock); + list_for_each_entry_safe(fmr, next, pool-free_list, list) { + if (fmr-remap_count 0) + list_move(fmr-list, pool-dirty_list); + } + spin_unlock_irq(pool-pool_lock); + serial = atomic_inc_return(pool-req_ser); wake_up_process(pool-thread); if (wait_event_interruptible(pool-force_wait, -- 1.5.4.1 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] iscsi bidi varlen support
[EMAIL PROTECTED] wrote on Mon, 18 Feb 2008 17:39 +0200: On Tue, Feb 12 2008 at 22:17 +0200, Pete Wyckoff [EMAIL PROTECTED] wrote: From: Pete Wyckoff [EMAIL PROTECTED] Subject: [PATCH] iscsi iser: varlen Handle variable-length CDBs in iSER. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/ulp/iser/iscsi_iser.c |5 +++-- drivers/infiniband/ulp/iser/iscsi_iser.h |2 +- drivers/infiniband/ulp/iser/iser_initiator.c | 16 ++-- 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index 5f2284d..9dfc310 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -401,7 +401,8 @@ iscsi_iser_session_create(struct iscsi_transport *iscsit, ctask = session-cmds[i]; iser_ctask = ctask-dd_data; ctask-hdr = (struct iscsi_cmd *)iser_ctask-desc.iscsi_header; - ctask-hdr_max = sizeof(iser_ctask-desc.iscsi_header); + ctask-hdr_max = sizeof(iser_ctask-desc.iscsi_header) + +sizeof(iser_ctask-desc.hdrextbuf); } for (i = 0; i session-mgmtpool_max; i++) { @@ -604,7 +605,7 @@ static struct iscsi_transport iscsi_iser_transport = { .host_template = iscsi_iser_sht, .conndata_size = sizeof(struct iscsi_conn), .max_lun= ISCSI_ISER_MAX_LUN, - .max_cmd_len= ISCSI_ISER_MAX_CMD_LEN, + .max_cmd_len= 260, Same bug I had. .max_cmd_len is still char, before the varlen patch to scsi-ml. So it must be at most 252, Until that patch is introduced and it can return to the correct 260 or better yet SCSI_MAX_VARLEN_CDB_SIZE. That also is only defined in the scsi-ml varlen patch. Ah, that is unfortunate. I'm afraid the varlen patches to block and scsi-ml are waiting because of me. There are more things I need to check, before they can get approved. Once I do that, and varlen gets accepted, iSER and iscsi_tcp can go up to 260 for the .max_cmd_len as they should. I will sit on these iser changes until we get core varlen resolved, then. Or you can just sequence it all cleverly through the various maintainers. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3 v2] iscsi iser: remove DMA alignment restriction
Thanks to James pointing out the problems with the BLK_BOUNCE_ANY for IB devices, this revised patch contains only the DMA alignment fix for iSER. Mike, can you take care of this and the other two patches in the series: [PATCH 2/3] iscsi iser: increase max_sectors [PATCH 3/3] iscsi iser: increase sg_tablesize -- Pete From 255e73b67ec1458af18395981fddebdc958e8fe9 Mon Sep 17 00:00:00 2001 From: Pete Wyckoff [EMAIL PROTECTED] Date: Thu, 14 Feb 2008 16:09:27 -0500 Subject: [PATCH] iscsi iser: remove DMA alignment restriction iscsi_iser does not require any particular DMA aligement requirement. Add a slave_configure function to set the alignment to zero, allowing the use of direct IO from arbitrary offsets within a page. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/ulp/iser/iscsi_iser.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index be1b9fb..313f102 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -543,6 +543,12 @@ iscsi_iser_ep_disconnect(__u64 ep_handle) iser_conn_terminate(ib_conn); } +static int iscsi_iser_slave_configure(struct scsi_device *sdev) +{ + blk_queue_dma_alignment(sdev-request_queue, 0); + return 0; +} + static struct scsi_host_template iscsi_iser_sht = { .module = THIS_MODULE, .name = iSCSI Initiator over iSER, v. DRV_VER, @@ -556,6 +562,7 @@ static struct scsi_host_template iscsi_iser_sht = { .eh_device_reset_handler= iscsi_eh_device_reset, .eh_host_reset_handler = iscsi_eh_host_reset, .use_clustering = DISABLE_CLUSTERING, + .slave_configure= iscsi_iser_slave_configure, .proc_name = iscsi_iser, .this_id= -1, }; -- 1.5.3.8 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] iscsi iser: remove DMA restrictions
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 15:57 -0600: On Tue, 2008-02-12 at 16:46 -0500, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 15:10 -0600: On Tue, 2008-02-12 at 15:54 -0500, Pete Wyckoff wrote: iscsi_iser does not have any hardware DMA restrictions. Add a slave_configure function to remove any DMA alignment restriction, allowing the use of direct IO from arbitrary offsets within a page. Also disable page bouncing; iser has no restrictions on which pages it can address. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/ulp/iser/iscsi_iser.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index be1b9fb..1b272a6 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -543,6 +543,13 @@ iscsi_iser_ep_disconnect(__u64 ep_handle) iser_conn_terminate(ib_conn); } +static int iscsi_iser_slave_configure(struct scsi_device *sdev) +{ + blk_queue_bounce_limit(sdev-request_queue, BLK_BOUNCE_ANY); You really don't want to do this. That signals to the block layer that we have an iommu, although it's practically the same thing as a 64 bit DMA mask ... but I'd just leave it to the DMA mask to set this up correctly. Anything else is asking for a subtle bug to turn up years from now when something causes the mask and the limit to be mismatched. Oh. I decided to add that line for symmetry with TCP, and was convinced by the arguments here: commit b6d44fe9582b9d90a0b16f508ac08a90d899bf56 Author: Mike Christie [EMAIL PROTECTED] Date: Thu Jul 26 12:46:47 2007 -0500 [SCSI] iscsi_tcp: Turn off bounce buffers It was found by LSI that on setups with large amounts of memory we were bouncing buffers when we did not need to. If the iscsi tcp code touches the data buffer (or a helper does), it will kmap the buffer. iscsi_tcp also does not interact with hardware, so it does not have any hw dma restrictions. This patch sets the bounce buffer settings for our device queue so buffers should not be bounced because of a driver limit. I don't see a convenient place to callback into particular iscsi devices to set the DMA mask per-host. It has to go on the shost_gendev, right?, but only for TCP and iSER, not qla4xxx, which handles its DMA mask during device probe. You should be taking your mask from the underlying infiniband device as part of the setup, shouldn't you? I think you're right about this. All the existing IB HW tries to set a 64-bit dma mask, but that's no reason to disable the mechanism entirely in iser. I'll remove that line that disables bouncing in my patch. Perhaps Mike will know if the iscsi_tcp usage is still appropriate. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] iscsi iser: remove DMA restrictions
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 15:10 -0600: On Tue, 2008-02-12 at 15:54 -0500, Pete Wyckoff wrote: iscsi_iser does not have any hardware DMA restrictions. Add a slave_configure function to remove any DMA alignment restriction, allowing the use of direct IO from arbitrary offsets within a page. Also disable page bouncing; iser has no restrictions on which pages it can address. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/ulp/iser/iscsi_iser.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index be1b9fb..1b272a6 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -543,6 +543,13 @@ iscsi_iser_ep_disconnect(__u64 ep_handle) iser_conn_terminate(ib_conn); } +static int iscsi_iser_slave_configure(struct scsi_device *sdev) +{ + blk_queue_bounce_limit(sdev-request_queue, BLK_BOUNCE_ANY); You really don't want to do this. That signals to the block layer that we have an iommu, although it's practically the same thing as a 64 bit DMA mask ... but I'd just leave it to the DMA mask to set this up correctly. Anything else is asking for a subtle bug to turn up years from now when something causes the mask and the limit to be mismatched. Oh. I decided to add that line for symmetry with TCP, and was convinced by the arguments here: commit b6d44fe9582b9d90a0b16f508ac08a90d899bf56 Author: Mike Christie [EMAIL PROTECTED] Date: Thu Jul 26 12:46:47 2007 -0500 [SCSI] iscsi_tcp: Turn off bounce buffers It was found by LSI that on setups with large amounts of memory we were bouncing buffers when we did not need to. If the iscsi tcp code touches the data buffer (or a helper does), it will kmap the buffer. iscsi_tcp also does not interact with hardware, so it does not have any hw dma restrictions. This patch sets the bounce buffer settings for our device queue so buffers should not be bounced because of a driver limit. I don't see a convenient place to callback into particular iscsi devices to set the DMA mask per-host. It has to go on the shost_gendev, right?, but only for TCP and iSER, not qla4xxx, which handles its DMA mask during device probe. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] iscsi iser: increase sg_tablesize
Increase FMR limits from 512 kB to 1 MB. This matches the limit used by SRP which also uses FMRs for memory mapping. Also correct a bug where the sg_tablesize was 1 smaller than what was allocated, by folding the + 1 used everywhere into the definition of the constant. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/ulp/iser/iscsi_iser.h |3 ++- drivers/infiniband/ulp/iser/iser_verbs.c |6 ++ 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h index 1ee867b..db8f81a 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.h +++ b/drivers/infiniband/ulp/iser/iscsi_iser.h @@ -87,7 +87,8 @@ #define MASK_4K(~(SIZE_4K-1)) /* support upto 512KB in one RDMA */ -#define ISCSI_ISER_SG_TABLESIZE (0x8 SHIFT_4K) +/* FMR space for 1 MB of 4k-page transfers, plus 1 if not page aligned */ +#define ISCSI_ISER_SG_TABLESIZE (((120) SHIFT_4K) + 1) #define ISCSI_ISER_MAX_LUN 256 #define ISCSI_ISER_MAX_CMD_LEN 16 diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index 714b8db..c5b374f 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -140,7 +140,7 @@ static int iser_create_ib_conn_res(struct iser_conn *ib_conn) device = ib_conn-device; ib_conn-page_vec = kmalloc(sizeof(struct iser_page_vec) + - (sizeof(u64) * (ISCSI_ISER_SG_TABLESIZE +1)), + sizeof(u64) * ISCSI_ISER_SG_TABLESIZE, GFP_KERNEL); if (!ib_conn-page_vec) { ret = -ENOMEM; @@ -149,9 +149,7 @@ static int iser_create_ib_conn_res(struct iser_conn *ib_conn) ib_conn-page_vec-pages = (u64 *) (ib_conn-page_vec + 1); params.page_shift= SHIFT_4K; - /* when the first/last SG element are not start/end * -* page aligned, the map whould be of N+1 pages */ - params.max_pages_per_fmr = ISCSI_ISER_SG_TABLESIZE + 1; + params.max_pages_per_fmr = ISCSI_ISER_SG_TABLESIZE; /* make the pool size twice the max number of SCSI commands * * the ML is expected to queue, watermark for unmap at 50% */ params.pool_size = ISCSI_DEF_XMIT_CMDS_MAX * 2; -- 1.5.3.8 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] iscsi bidi varlen support
[EMAIL PROTECTED] wrote on Thu, 31 Jan 2008 20:08 +0200: Cheers after 1.3 years these can go in. [PATCH 1/3] iscsi: extended cdb support The varlen support is not yet in mainline for block and scsi-ml. But the API for drivers will not change. All LLD need to do is max_command to the it's maximum and be ready for bigger commands. This is what's done here. Once these commands start coming iscsi will be ready for them. [PATCH 2/3] iscsi: bidi support - libiscsi [PATCH 3/3] iscsi: bidi support - iscsi_tcp bidirectional commands support in iscsi. iSER is not yet ready, but it will not break. There is already a mechanism in libiscsi that will return error if bidi commands are sent iSER way. Pete please send me the iSER bits so we can port them to this latest version. Mike these patches are ontop of iscs branch of the iscsi git tree, they will apply but for compilation you will need to sync with Linus mainline. The patches are for the in-tree iscsi code. I own you the compat patch for the out-off-tree code, but this I will only be Sunday. If we do it fast it might get accepted to 2.6.25 merge window Everybody is invited to a party at Shila ben-yhuda 52 Tel-Aviv 9:45 pm. Drinks and wonderful see-food on us :) Here's the patch to add bidi support to iSER too. It works with my setup, but could use more testing. Note that this does rely on the 3 patches quoted above. -- Pete From: Pete Wyckoff [EMAIL PROTECTED] Support bidirectional SCSI commands in iSCSI iSER. The handling of data buffers for each direction was already mostly separated. It was necessary to separate the unmapping of data bufs when one was found to be unaligned. This was done by adding a direction parameter to iser_dma_unmap_task_data() and calling it appropriately. Also iser_ctask_rdma_finalize() used local variables in such a way that it assumed a unidirectional command; that was split up to make it cleaner and order the operations correctly for bidi. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/ulp/iser/iscsi_iser.h |3 +- drivers/infiniband/ulp/iser/iser_initiator.c | 79 +++--- drivers/infiniband/ulp/iser/iser_memory.c|9 ++- 3 files changed, 41 insertions(+), 50 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h index 66905df..f5497b0 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.h +++ b/drivers/infiniband/ulp/iser/iscsi_iser.h @@ -356,5 +356,6 @@ int iser_dma_map_task_data(struct iscsi_iser_cmd_task *iser_ctask, enum iser_data_dir iser_dir, enum dma_data_direction dma_dir); -void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *iser_ctask); +void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *iser_ctask, + enum iser_data_dir cmd_dir); #endif diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index ea3f5dc..491ffab 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -325,9 +325,8 @@ int iser_send_command(struct iscsi_conn *conn, { struct iscsi_iser_conn *iser_conn = conn-dd_data; struct iscsi_iser_cmd_task *iser_ctask = ctask-dd_data; - struct iser_dto *send_dto = NULL; - unsigned long edtl; - int err = 0; + struct iser_dto *send_dto; + int err; struct iser_data_buf *data_buf; struct iscsi_cmd *hdr = ctask-hdr; @@ -340,37 +339,32 @@ int iser_send_command(struct iscsi_conn *conn, if (iser_check_xmit(conn, ctask)) return -ENOBUFS; - edtl = ntohl(hdr-data_length); - /* build the tx desc regd header and add it to the tx desc dto */ iser_ctask-desc.type = ISCSI_TX_SCSI_COMMAND; send_dto = iser_ctask-desc.dto; send_dto-ctask = iser_ctask; iser_create_send_desc(iser_conn, iser_ctask-desc, ctask-hdr_len); - if (hdr-flags ISCSI_FLAG_CMD_READ) - data_buf = iser_ctask-data[ISER_DIR_IN]; - else - data_buf = iser_ctask-data[ISER_DIR_OUT]; - - if (scsi_sg_count(sc)) { /* using a scatter list */ - data_buf-buf = scsi_sglist(sc); - data_buf-size = scsi_sg_count(sc); - } - - data_buf-data_len = scsi_bufflen(sc); - if (hdr-flags ISCSI_FLAG_CMD_READ) { - err = iser_prepare_read_cmd(ctask, edtl); + data_buf = iser_ctask-data[ISER_DIR_IN]; + data_buf-buf = scsi_in(sc)-table.sgl; + data_buf-size = scsi_in(sc)-table.nents; + data_buf-data_len = scsi_in(sc)-length; + err = iser_prepare_read_cmd(ctask, data_buf-data_len); if (err) goto send_command_error
[PATCH] bsg: bidi bio map failure fix
If blk_rq_map_user requires more than one bio, and fails mapping somewhere after the first bio, it will return with rq-bio set to non-NULL, but it will have already unmapped the partial bio. The out: error exit section will see the non-null bio and try to unmap it again, triggering a mapcount bug via bad_page(). Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/bsg.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 3337125..bba7154 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -295,8 +295,10 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) dxferp = (void*)(unsigned long)hdr-din_xferp; ret = blk_rq_map_user(q, next_rq, dxferp, hdr-din_xfer_len); - if (ret) + if (ret) { + next_rq-bio = NULL; /* do not unmap twice */ goto out; + } } if (hdr-dout_xfer_len) { -- 1.5.3.8 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] iscsi bidi varlen support
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 15:12 -0500: [EMAIL PROTECTED] wrote on Thu, 31 Jan 2008 20:08 +0200: Cheers after 1.3 years these can go in. [PATCH 1/3] iscsi: extended cdb support The varlen support is not yet in mainline for block and scsi-ml. But the API for drivers will not change. All LLD need to do is max_command to the it's maximum and be ready for bigger commands. This is what's done here. Once these commands start coming iscsi will be ready for them. [PATCH 2/3] iscsi: bidi support - libiscsi [PATCH 3/3] iscsi: bidi support - iscsi_tcp bidirectional commands support in iscsi. iSER is not yet ready, but it will not break. There is already a mechanism in libiscsi that will return error if bidi commands are sent iSER way. Pete please send me the iSER bits so we can port them to this latest version. Mike these patches are ontop of iscs branch of the iscsi git tree, they will apply but for compilation you will need to sync with Linus mainline. The patches are for the in-tree iscsi code. I own you the compat patch for the out-off-tree code, but this I will only be Sunday. Here's the patch to add bidi support to iSER too. It works with my setup, but could use more testing. Note that this does rely on the 3 patches quoted above. Similar, for varlen support to iSER. Probably apply this one before the bidi one I just sent. -- Pete From: Pete Wyckoff [EMAIL PROTECTED] Subject: [PATCH] iscsi iser: varlen Handle variable-length CDBs in iSER. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/ulp/iser/iscsi_iser.c |5 +++-- drivers/infiniband/ulp/iser/iscsi_iser.h |2 +- drivers/infiniband/ulp/iser/iser_initiator.c | 16 ++-- 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index 5f2284d..9dfc310 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -401,7 +401,8 @@ iscsi_iser_session_create(struct iscsi_transport *iscsit, ctask = session-cmds[i]; iser_ctask = ctask-dd_data; ctask-hdr = (struct iscsi_cmd *)iser_ctask-desc.iscsi_header; - ctask-hdr_max = sizeof(iser_ctask-desc.iscsi_header); + ctask-hdr_max = sizeof(iser_ctask-desc.iscsi_header) + +sizeof(iser_ctask-desc.hdrextbuf); } for (i = 0; i session-mgmtpool_max; i++) { @@ -604,7 +605,7 @@ static struct iscsi_transport iscsi_iser_transport = { .host_template = iscsi_iser_sht, .conndata_size = sizeof(struct iscsi_conn), .max_lun= ISCSI_ISER_MAX_LUN, - .max_cmd_len= ISCSI_ISER_MAX_CMD_LEN, + .max_cmd_len= 260, /* session management */ .create_session = iscsi_iser_session_create, .destroy_session= iscsi_session_teardown, diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h index db8f81a..66905df 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.h +++ b/drivers/infiniband/ulp/iser/iscsi_iser.h @@ -90,7 +90,6 @@ /* FMR space for 1 MB of 4k-page transfers, plus 1 if not page aligned */ #define ISCSI_ISER_SG_TABLESIZE (((120) SHIFT_4K) + 1) #define ISCSI_ISER_MAX_LUN 256 -#define ISCSI_ISER_MAX_CMD_LEN 16 /* QP settings */ /* Maximal bounds on received asynchronous PDUs */ @@ -217,6 +216,7 @@ enum iser_desc_type { struct iser_desc { struct iser_hdr iser_header; struct iscsi_hdr iscsi_header; + char hdrextbuf[ISCSI_MAX_AHS_SIZE]; struct iser_regd_buf hdr_regd_buf; void *data; /* used by RX TX_CONTROL */ struct iser_regd_buf data_regd_buf; /* used by RX TX_CONTROL */ diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index 83247f1..ea3f5dc 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -246,9 +246,13 @@ post_rx_kmalloc_failure: return err; } -/* creates a new tx descriptor and adds header regd buffer */ +/** + * iser_create_send_desc - creates a new tx descriptor and adds header regd buffer + * @iscsi_hdr_len: length of struct iscsi_hdr and any AHSes in hdrextbuf + */ static void iser_create_send_desc(struct iscsi_iser_conn *iser_conn, - struct iser_desc *tx_desc) + struct iser_desc *tx_desc, + int iscsi_hdr_len) { struct iser_regd_buf *regd_hdr = tx_desc-hdr_regd_buf; struct iser_dto *send_dto = tx_desc
[PATCH 1/3] iscsi iser: remove DMA restrictions
iscsi_iser does not have any hardware DMA restrictions. Add a slave_configure function to remove any DMA alignment restriction, allowing the use of direct IO from arbitrary offsets within a page. Also disable page bouncing; iser has no restrictions on which pages it can address. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/ulp/iser/iscsi_iser.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index be1b9fb..1b272a6 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -543,6 +543,13 @@ iscsi_iser_ep_disconnect(__u64 ep_handle) iser_conn_terminate(ib_conn); } +static int iscsi_iser_slave_configure(struct scsi_device *sdev) +{ + blk_queue_bounce_limit(sdev-request_queue, BLK_BOUNCE_ANY); + blk_queue_dma_alignment(sdev-request_queue, 0); + return 0; +} + static struct scsi_host_template iscsi_iser_sht = { .module = THIS_MODULE, .name = iSCSI Initiator over iSER, v. DRV_VER, @@ -556,6 +563,7 @@ static struct scsi_host_template iscsi_iser_sht = { .eh_device_reset_handler= iscsi_eh_device_reset, .eh_host_reset_handler = iscsi_eh_host_reset, .use_clustering = DISABLE_CLUSTERING, + .slave_configure= iscsi_iser_slave_configure, .proc_name = iscsi_iser, .this_id= -1, }; -- 1.5.3.8 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] iscsi: bidi support - libiscsi
[EMAIL PROTECTED] wrote on Thu, 31 Jan 2008 22:29 +0200: iscsi bidi support at the generic libiscsi level - prepare the additional bidi_read rlength header. - access the right scsi_in() and/or scsi_out() side of things. also for resid. - Handle BIDI underflow overflow from target Signed-off-by: Boaz Harrosh [EMAIL PROTECTED] I see you do this a bit differently than in your previous patch set. In particular, the residual handling in libiscsi.c. (I'm editing in a bit more context to the patch below.) + if (scsi_bidi_cmnd(sc) + (rhdr-flags (ISCSI_FLAG_CMD_BIDI_UNDERFLOW | + ISCSI_FLAG_CMD_BIDI_OVERFLOW))) { + int res_count = be32_to_cpu(rhdr-bi_residual_count); + + if (res_count 0 + (rhdr-flags ISCSI_FLAG_CMD_BIDI_OVERFLOW || + res_count = scsi_in(sc)-length)) + scsi_in(sc)-resid = res_count; + else + sc-result = + (DID_BAD_TARGET 16) | rhdr-cmd_status; + } + if (rhdr-flags (ISCSI_FLAG_CMD_UNDERFLOW | ISCSI_FLAG_CMD_OVERFLOW)) { int res_count = be32_to_cpu(rhdr-residual_count); if (res_count 0 (rhdr-flags ISCSI_FLAG_CMD_OVERFLOW || res_count = scsi_bufflen(sc))) + /* write side for bidi or uni-io set_resid */ scsi_set_resid(sc, res_count); else sc-result = (DID_BAD_TARGET 16) | rhdr-cmd_status; } else if (rhdr-flags (ISCSI_FLAG_CMD_BIDI_UNDERFLOW | ISCSI_FLAG_CMD_BIDI_OVERFLOW)) sc-result = (DID_BAD_TARGET 16) | rhdr-cmd_status; I haven't tested this, but, consider a bidi command that results in an overflow on the read part, and no overflow on the write part. E.g., the user supplied a data-in buffer that wasn't big enough to hold the returned data from the target, but the data-out buffer was just right. Then this code will set scsi_in(sc)-resid properly, informing the caller that there were extra bytes that were not transferred. But the else clause at the bottom will also set sc-result to be bad. I don't think we want this. Your earlier patch got rid of the second bidi_overflow handler and just did all the logic for both bidi and non-bidi cases in a single if clause. Seemed better. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] iscsi: bidi support - libiscsi
[EMAIL PROTECTED] wrote on Mon, 11 Feb 2008 18:05 +0200: You are most probably right I will investigate what happened. It looks like I went back to some old version right? or a merge fallout Thanks for reviewing. Please also test latest head-of-line code if possible + iscsi patches + last varlen I sent. Is there any new patches I need for 2.6.24 or head-of-line for my osd-dev tree? Testing now. My patch set is actually shrinking (!) thanks to some merges. Some rebasing was required to apply your three varlen patches and three bidi patches, but I'm sure you'll update your tree and push those upstream soon enough. I'll take a look at iser bidi then update my patch list and let you know soon. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Integration of SCST in the mainstream Linux kernel
[EMAIL PROTECTED] wrote on Wed, 30 Jan 2008 10:34 -0600: On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote: On Jan 30, 2008 12:32 AM, FUJITA Tomonori [EMAIL PROTECTED] wrote: iSER has parameters to limit the maximum size of RDMA (it needs to repeat RDMA with a poor configuration)? Please specify which parameters you are referring to. As you know I had already repeated my tests with ridiculously high values for the following iSER parameters: FirstBurstLength, MaxBurstLength and MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block size specified to dd). the 1Mb block size is a bit of a red herring. Unless you've specifically increased the max_sector_size and are using an sg_chain converted driver, on x86 the maximum possible transfer accumulation is 0.5MB. I certainly don't rule out that increasing the transfer size up from 0.5MB might be the way to improve STGT efficiency, since at an 1GB/s theoretical peak, that's roughly 2000 context switches per I/O; however, It doesn't look like you've done anything that will overcome the block layer limitations. The MRDSL parameter has no effect on iSER, as the RFC describes. How to transfer data to satisfy a command is solely up to the target. So you would need both big requests from the client, then look at how the target will send the data. I've only used 512 kB for the RDMA transfer size from the target, as it matches the default client size and was enough to get good performance out of my IB gear and minimizes resource consumption on the target. It's currently hard-coded as a #define. There is no provision in the protocol for the client to dictate the value. If others want to spend some effort trying to tune stgt for iSER, there are a fair number of comments in the code, including a big one that explains this RDMA transfer size issue. And I'll answer informed questions as I can. But I'm not particularly interested in arguing about which implementation is best, or trying to interpret bandwidth comparison numbers from poorly designed tests. It takes work to understand these issues. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use the cmd_type of a leading request for scsi_init_sgtable
[EMAIL PROTECTED] wrote on Fri, 25 Jan 2008 20:05 -0600: On Sat, 2008-01-26 at 09:57 +0900, FUJITA Tomonori wrote: This is against the scsi-bidi tree. We need to use the cmd_type of a leading request for scsi_init_sgtable to set up scsi_data_buffer:length of a bidi request properly. An alternative approach is setting the cmd_type of a leading request and its bidi request (*1). But the block layer and scsi-ml don't expect that the leading request and its sub-requests have the different command types. Note that scsi_debug's XDWRITEREAD_10 support is fine without this patch since req-nr_sectors works for it but req-nr_sectors doesn't work for everyone. (*1) http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg12669.html = From: FUJITA Tomonori [EMAIL PROTECTED] Subject: [PATCH] use the cmd_type of a leading request for scsi_init_sgtable We need to use the cmd_type of a leading request for scsi_init_sgtable to set up scsi_data_buffer:length of its bidi request properly. This seems to be a very convoluted work around for the fact that we forgot to set the cmd_type on the subordinate request. Wouldn't this be a better fix? James --- diff --git a/block/bsg.c b/block/bsg.c index 69b0a9d..8917c51 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -279,6 +279,7 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) goto out; } rq-next_rq = next_rq; + next_rq-cmd_type = rq-cmd_type; dxferp = (void*)(unsigned long)hdr-din_xferp; ret = blk_rq_map_user(q, next_rq, dxferp, hdr-din_xfer_len); Agree. I did essentially the same thing in: http://marc.info/?l=linux-scsim=11983623270 Tomo, you may want to have a look at all the bsg patches I sent back in Dec 2007. Boaz has the minimum required for bidi in his tree. There are a few more here too, if you want to see what we need for OSD: http://git.osc.edu/?p=linux.git;a=summary -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance of SCST versus STGT
[EMAIL PROTECTED] wrote on Thu, 17 Jan 2008 19:05 +0900: On Thu, 17 Jan 2008 12:48:28 +0300 Vladislav Bolkhovitin [EMAIL PROTECTED] wrote: FUJITA Tomonori wrote: On Thu, 17 Jan 2008 10:27:08 +0100 Bart Van Assche [EMAIL PROTECTED] wrote: Hello, I have performed a test to compare the performance of SCST and STGT. Apparently the SCST target implementation performed far better than the STGT target implementation. This makes me wonder whether this is due to the design of SCST or whether STGT's performance can be improved to the level of SCST ? Test performed: read 2 GB of data in blocks of 1 MB from a target (hot cache -- no disk reads were performed, all reads were from the cache). Test command: time dd if=/dev/sde of=/dev/null bs=1M count=2000 STGT read SCST read performance (MB/s) performance (MB/s) Ethernet (1 Gb/s network)7789 IPoIB (8 Gb/s network) 82 229 SRP (8 Gb/s network)N/A 600 iSER (8 Gb/s network)80 N/A These results show that SCST uses the InfiniBand network very well (effectivity of about 88% via SRP), but that the current STGT version is unable to transfer data faster than 82 MB/s. Does this mean that there is a severe bottleneck present in the current STGT implementation ? I don't know about the details but Pete said that he can achieve more than 900MB/s read performance with tgt iSER target using ramdisk. http://www.mail-archive.com/[EMAIL PROTECTED]/msg4.html Please don't confuse multithreaded latency insensitive workload with single threaded, hence latency sensitive one. Seems that he can get good performance with single threaded workload: http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf But I don't know about the details so let's wait for Pete to comment on this. Page 16 is pretty straight forward. One command outstanding from the client. It is an OSD read command. Data on tmpfs. 500 MB/s is pretty easy to get on IB. The other graph on page 23 is for block commands. 600 MB/s ish. Still single command; so essentially a latency test. Dominated by the memcpy time from tmpfs to pinned IB buffer, as per page 24. Erez said: We didn't run any real performance test with tgt, so I don't have numbers yet. I know that Pete got ~900 MB/sec by hacking sgp_dd, so all data was read/written to the same block (so it was all done in the cache). Pete - am I right? Yes (actually just 1 thread in sg_dd). This is obviously cheating. Take the pread time to zero in SCSI Read analysis on page 24 to show max theoretical. It's IB theoretical minus some initiator and stgt overheads. The other way to get more read throughput is to throw multiple simultaneous commands at the server. There's nothing particularly stunning here. Suspect Bart has configuration issues if not even IPoIB will do 100 MB/s. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Fri, 11 Jan 2008 19:16 -0500: James Bottomley wrote: On Thu, 2008-01-10 at 16:46 -0500, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600: On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900: On Tue, 8 Jan 2008 17:09:18 -0500 Pete Wyckoff [EMAIL PROTECTED] wrote: I took another look at the compat approach, to see if it is feasible to keep the compat handling somewhere else, without the use of #ifdef CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. The use of iovec is within a write operation on a char device. It's not amenable to a compat_sys_ or a .compat_ioctl approach. I'm partial to #1 because the use of architecture-independent fields matches the rest of struct sg_io_v4. But if you don't want to have another iovec type in the kernel, could we do #2 but just return -EINVAL if the need for compat is detected? I.e. change dout_iovec_count to dout_iovec_length and do the math? If you are ok with removing the write/read interface and just have ioctl, we could can handle comapt stuff like others do. But I think that you (OSD people) really want to keep the write/read interface. Sorry, I think that there is no workaround to support iovec in bsg. I don't care about read/write in particular. But we do need some way to launch asynchronous SCSI commands, and currently read/write are the only way to do that in bsg. The reason is to keep multiple spindles busy at the same time. Won't multi-threading the ioctl calls achieve the same effect? Or do you trip over BKL there? There's no BKL on (new) ioctls anymore, at least. A thread per device would be feasible perhaps. But if you want any sort of pipelining out of the device, esp. in the remote iSCSI case, you need to have a good number of commands outstanding to each device. So a thread per command per device. Typical iSCSI queue depth of 128 times 16 devices for a small setup is a lot of threads. I was actually thinking of a thread per outstanding command. The pthread/pipe latency overhead is not insignificant for fast storage networks too. How about these new ioctls instead of read/write: SG_IO_SUBMIT - start a new blk_execute_rq_nowait() SG_IO_TEST - complete and return a previous req SG_IO_WAIT - wait for a req to finish, interruptibly Then old write users will instead do ioctl SUBMIT. Read users will do TEST for non-blocking fd, or WAIT for blocking. And SG_IO could be implemented as SUBMIT + WAIT. Then we can do compat_ioctl and convert up iovecs out-of-line before calling the normal functions. Let me know if you want a patch for this. Really, the thought of re-inventing yet another async I/O interface isn't very appealing. I'm fine with read/write, except Tomo is against handling iovecs because of the compat complexity with struct iovec being different on 32- vs 64-bit. There is a standard way to do compat ioctl that hides this handling in a different file (not bsg.c), which is the only reason I'm even considering these ioctls. I don't care about compat setups per se. Is there another async I/O mechanism? Userspace builds the CDBs, just needs some way to drop them in SCSI ML. BSG is almost perfect for this, but doesn't do iovec, leading to lots of memcpy. No, it's just that async interfaces in Linux have a long and fairly unhappy history. The sg driver's async interface has been pretty stable for a long time. The sync SG_IO ioctl is built on top of the async interface. That makes the async interface extremely well tested. The write()/read() async interface in sg does have one problem: when a command is dispatched via a write() it would be very useful to get back a tag but that violates write()'s second argument: 'const void * buf'. That tag could be useful both for identification of the response and by task management functions. I was hoping that the 'flags' field in sgv4 could be used to implement the variants: SG_IO_SUBMIT - start a new blk_execute_rq_nowait() SG_IO_TEST - complete and return a previous req SG_IO_WAIT - wait for a req to finish, interruptibly that way the existing SG_IO ioctl is sufficient. And if Tomo doesn't want to do it in the bsg driver, then it could be done it the sg driver. The sg driver already has async via read/write, and it works fine. Perhaps someone wants the ioctl versions too, but that's not my main goal here. I think that sg doesn't bother with compat iovec handling. It just uses SZ_SG_IOVEC (defined as sizeof(sg_iovec_t)) and doesn't check if the userspace iovec happens to be smaller. Sg does have a compat ioctl function; it just doesn't support SG_IO. So, on 64-bit kernel, read/write from 32-bit userspace with iovec will get undefined results due to the mis-interpretation of the iovec fields, while ioctl from 32-bit will fail with ENOIOCTLCMD
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900: On Tue, 8 Jan 2008 17:09:18 -0500 Pete Wyckoff [EMAIL PROTECTED] wrote: I took another look at the compat approach, to see if it is feasible to keep the compat handling somewhere else, without the use of #ifdef CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. The use of iovec is within a write operation on a char device. It's not amenable to a compat_sys_ or a .compat_ioctl approach. I'm partial to #1 because the use of architecture-independent fields matches the rest of struct sg_io_v4. But if you don't want to have another iovec type in the kernel, could we do #2 but just return -EINVAL if the need for compat is detected? I.e. change dout_iovec_count to dout_iovec_length and do the math? If you are ok with removing the write/read interface and just have ioctl, we could can handle comapt stuff like others do. But I think that you (OSD people) really want to keep the write/read interface. Sorry, I think that there is no workaround to support iovec in bsg. I don't care about read/write in particular. But we do need some way to launch asynchronous SCSI commands, and currently read/write are the only way to do that in bsg. The reason is to keep multiple spindles busy at the same time. How about these new ioctls instead of read/write: SG_IO_SUBMIT - start a new blk_execute_rq_nowait() SG_IO_TEST - complete and return a previous req SG_IO_WAIT - wait for a req to finish, interruptibly Then old write users will instead do ioctl SUBMIT. Read users will do TEST for non-blocking fd, or WAIT for blocking. And SG_IO could be implemented as SUBMIT + WAIT. Then we can do compat_ioctl and convert up iovecs out-of-line before calling the normal functions. Let me know if you want a patch for this. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600: On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900: On Tue, 8 Jan 2008 17:09:18 -0500 Pete Wyckoff [EMAIL PROTECTED] wrote: I took another look at the compat approach, to see if it is feasible to keep the compat handling somewhere else, without the use of #ifdef CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. The use of iovec is within a write operation on a char device. It's not amenable to a compat_sys_ or a .compat_ioctl approach. I'm partial to #1 because the use of architecture-independent fields matches the rest of struct sg_io_v4. But if you don't want to have another iovec type in the kernel, could we do #2 but just return -EINVAL if the need for compat is detected? I.e. change dout_iovec_count to dout_iovec_length and do the math? If you are ok with removing the write/read interface and just have ioctl, we could can handle comapt stuff like others do. But I think that you (OSD people) really want to keep the write/read interface. Sorry, I think that there is no workaround to support iovec in bsg. I don't care about read/write in particular. But we do need some way to launch asynchronous SCSI commands, and currently read/write are the only way to do that in bsg. The reason is to keep multiple spindles busy at the same time. Won't multi-threading the ioctl calls achieve the same effect? Or do you trip over BKL there? There's no BKL on (new) ioctls anymore, at least. A thread per device would be feasible perhaps. But if you want any sort of pipelining out of the device, esp. in the remote iSCSI case, you need to have a good number of commands outstanding to each device. So a thread per command per device. Typical iSCSI queue depth of 128 times 16 devices for a small setup is a lot of threads. The pthread/pipe latency overhead is not insignificant for fast storage networks too. How about these new ioctls instead of read/write: SG_IO_SUBMIT - start a new blk_execute_rq_nowait() SG_IO_TEST - complete and return a previous req SG_IO_WAIT - wait for a req to finish, interruptibly Then old write users will instead do ioctl SUBMIT. Read users will do TEST for non-blocking fd, or WAIT for blocking. And SG_IO could be implemented as SUBMIT + WAIT. Then we can do compat_ioctl and convert up iovecs out-of-line before calling the normal functions. Let me know if you want a patch for this. Really, the thought of re-inventing yet another async I/O interface isn't very appealing. I'm fine with read/write, except Tomo is against handling iovecs because of the compat complexity with struct iovec being different on 32- vs 64-bit. There is a standard way to do compat ioctl that hides this handling in a different file (not bsg.c), which is the only reason I'm even considering these ioctls. I don't care about compat setups per se. Is there another async I/O mechanism? Userspace builds the CDBs, just needs some way to drop them in SCSI ML. BSG is almost perfect for this, but doesn't do iovec, leading to lots of memcpy. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Sat, 05 Jan 2008 14:01 +0900: From: Deepak Colluru [EMAIL PROTECTED] Subject: [PATCH] bsg : Add support for io vectors in bsg Date: Fri, 4 Jan 2008 21:47:34 +0530 (IST) From: Deepak Colluru [EMAIL PROTECTED] Add support for io vectors in bsg. Signed-off-by: Deepak Colluru [EMAIL PROTECTED] --- bsg.c | 52 +--- 1 file changed, 49 insertions(+), 3 deletions(-) Thanks, but I have to NACK this. You can find the discussion about bsg io vector support and a similar patch in linux-scsi archive. I have no plan to support it since it needs the compat hack. You may recall this is one of the patches I need to use bsg with OSD devices. OSDs overload the SCSI buffer model to put mulitple fields in dataout and datain. Some is user data, but some is more logically created by a library. Memcpying in userspace to wedge all the segments into a single buffer is painful, and is required both on outgoing and incoming data buffers. There are two approaches to add iovec to bsg. 1. Define a new sg_iovec_v4 that uses constant width types. Both 32- and 64-bit userspace would hand arrays of this to the kernel. struct sg_v4_iovec { __u64 iov_base; __u32 iov_len; __u32 __pad1; }; Old patch here: http://article.gmane.org/gmane.linux.scsi/30461/ 2. Do as Deepak has done, using the existing sg_iovec, but then also work around the compat issue. Old v3 sg_iovec is: typedef struct sg_iovec /* same structure as used by readv() Linux system */ { /* call. It defines one scatter-gather element. */ void __user *iov_base; /* Starting address */ size_t iov_len; /* Length in bytes */ } sg_iovec_t; Old patch here: http://article.gmane.org/gmane.linux.scsi/30460/ I took another look at the compat approach, to see if it is feasible to keep the compat handling somewhere else, without the use of #ifdef CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. The use of iovec is within a write operation on a char device. It's not amenable to a compat_sys_ or a .compat_ioctl approach. I'm partial to #1 because the use of architecture-independent fields matches the rest of struct sg_io_v4. But if you don't want to have another iovec type in the kernel, could we do #2 but just return -EINVAL if the need for compat is detected? I.e. change dout_iovec_count to dout_iovec_length and do the math? -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] relax scsi dma alignment
[EMAIL PROTECTED] wrote on Mon, 31 Dec 2007 16:43 -0600: This patch relaxes the default SCSI DMA alignment from 512 bytes to 4 bytes. I remember from previous discussions that usb and firewire have sector size alignment requirements, so I upped their alignments in the respective slave allocs. The reason for doing this is so that we don't get such a huge amount of copy overhead in bio_copy_user() for udev. (basically all inquiries it issues can now be directly mapped). Great change. Here's another patch. It allows a queue dma_alignment of 0 bytes, for drivers that can move data at byte granularity. Two drivers try to turn off DMA alignment restrictions by setting the dma_alignment to zero: drivers/scsi/iscsi_tcp.c: blk_queue_dma_alignment(sdev-request_queue, 0); drivers/scsi/scsi_tgt_lib.c: blk_queue_dma_alignment(q, 0); But they end up doing copies due to the way that queue_dma_alignment() returns 511 in this case. -- Pete --- From 90ee6d61ad71a024815eee2b416edb40c6b01256 Mon Sep 17 00:00:00 2001 From: Pete Wyckoff [EMAIL PROTECTED] Date: Tue, 1 Jan 2008 10:23:02 -0500 Subject: [PATCH] block: allow queue dma_alignment of zero Let queue_dma_alignment return 0 if it was specifically set to 0. This permits devices with no particular alignment restrictions to use arbitrary user space buffers without copying. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- include/linux/blkdev.h |7 +-- 1 files changed, 1 insertions(+), 6 deletions(-) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index aa3090c..eea31c2 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -860,12 +860,7 @@ static inline int bdev_hardsect_size(struct block_device *bdev) static inline int queue_dma_alignment(struct request_queue *q) { - int retval = 511; - - if (q q-dma_alignment) - retval = q-dma_alignment; - - return retval; + return q ? q-dma_alignment : 511; } /* assumes size 256 */ -- 1.5.3.6 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] relax scsi dma alignment
[EMAIL PROTECTED] wrote on Tue, 01 Jan 2008 10:27 -0600: On Tue, 2008-01-01 at 11:08 -0500, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Mon, 31 Dec 2007 16:43 -0600: This patch relaxes the default SCSI DMA alignment from 512 bytes to 4 bytes. I remember from previous discussions that usb and firewire have sector size alignment requirements, so I upped their alignments in the respective slave allocs. The reason for doing this is so that we don't get such a huge amount of copy overhead in bio_copy_user() for udev. (basically all inquiries it issues can now be directly mapped). Great change. Here's another patch. It allows a queue dma_alignment of 0 bytes, for drivers that can move data at byte granularity. Two drivers try to turn off DMA alignment restrictions by setting the dma_alignment to zero: drivers/scsi/iscsi_tcp.c: blk_queue_dma_alignment(sdev-request_queue, 0); drivers/scsi/scsi_tgt_lib.c: blk_queue_dma_alignment(q, 0); But they end up doing copies due to the way that queue_dma_alignment() returns 511 in this case. [..] From 90ee6d61ad71a024815eee2b416edb40c6b01256 Mon Sep 17 00:00:00 2001 From: Pete Wyckoff [EMAIL PROTECTED] Date: Tue, 1 Jan 2008 10:23:02 -0500 Subject: [PATCH] block: allow queue dma_alignment of zero Let queue_dma_alignment return 0 if it was specifically set to 0. This permits devices with no particular alignment restrictions to use arbitrary user space buffers without copying. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- include/linux/blkdev.h |7 +-- 1 files changed, 1 insertions(+), 6 deletions(-) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index aa3090c..eea31c2 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -860,12 +860,7 @@ static inline int bdev_hardsect_size(struct block_device *bdev) static inline int queue_dma_alignment(struct request_queue *q) { - int retval = 511; - - if (q q-dma_alignment) - retval = q-dma_alignment; - - return retval; + return q ? q-dma_alignment : 511; } /* assumes size 256 */ This is certainly a possible patch. What assurance do you have that it doesn't upset block devices who set zero assuming they get the default alignment? Code inspection of the initialization and use of this field. I hope anybody who sees a mistake here will point it out. 1. Initialization Most users call blk_init_queue{,_node} to alloc and initialize a new request_queue. This leads to initialization of dma_alignment to 511 in blk_queue_make_request(). The rest of the callers use blk_alloc_queue{,_node}. Most of those call blk_queue_make_request() with a custom make_request_fn a few lines later, similarly causing dma_alignment to be initialized to non-zero. The loop and pktcdvd drivers require a bit of reading to convince oneself, but also can be seen to call blk_queue_make_request() before using the queue. 2. Use of blk_queue_dma_alignment() to set, queue_dma_alignment() and -dma_alignment to get Inspection of the 20-odd spots that match dma_alignment shows that none of them set zero to expect the default. Definitely a valid concern, but it seems to be a safe change, at least for in-tree users. Do you think a new request_queue flag for zero-aware drivers and a WARN_ON that checks for zero and !zero_aware would be worthwhile? Without this change, we can go as low as two-byte alignment on buffer start and length by setting dma_alignment to 1, but will do a full copy if (buffer1) or (length1). -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] iscsi: extended cdb support
From: Boaz Harrosh [EMAIL PROTECTED] Once varlen cdbs are supported by the block and scsi-ml layers we can apply this patch to support extended CDBs in iscsi. Signed-off-by: Boaz Harrosh [EMAIL PROTECTED] Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/scsi/iscsi_tcp.c |2 +- drivers/scsi/libiscsi.c| 56 include/scsi/iscsi_proto.h |6 +++- 3 files changed, 56 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index e5be5fd..9fde5ce 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -1973,7 +1973,7 @@ static struct iscsi_transport iscsi_tcp_transport = { .host_template = iscsi_sht, .conndata_size = sizeof(struct iscsi_conn), .max_conn = 1, - .max_cmd_len= 16, + .max_cmd_len= SCSI_MAX_VARLEN_CDB_SIZE, /* session management */ .create_session = iscsi_tcp_session_create, .destroy_session= iscsi_tcp_session_destroy, diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 553168a..94a8046 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -137,6 +137,45 @@ static int iscsi_add_hdr(struct iscsi_cmd_task *ctask, unsigned len) return 0; } +/* + * make an extended cdb AHS + */ +static int iscsi_prep_ecdb_ahs(struct iscsi_cmd_task *ctask) +{ + struct scsi_cmnd *cmd = ctask-sc; + unsigned rlen, pad_len; + unsigned short ahslength; + struct iscsi_ecdb_ahdr *ecdb_ahdr; + int rc; + + ecdb_ahdr = iscsi_next_hdr(ctask); + rlen = cmd-cmd_len - ISCSI_CDB_SIZE; + + BUG_ON(rlen sizeof(ecdb_ahdr-ecdb)); + ahslength = rlen + sizeof(ecdb_ahdr-reserved); + + pad_len = iscsi_padding(rlen); + + rc = iscsi_add_hdr(ctask, sizeof(ecdb_ahdr-ahslength) + + sizeof(ecdb_ahdr-ahstype) + ahslength + pad_len); + if (rc) + return rc; + + if (pad_len) + memset(ecdb_ahdr-ecdb[rlen], 0, pad_len); + + ecdb_ahdr-ahslength = cpu_to_be16(ahslength); + ecdb_ahdr-ahstype = ISCSI_AHSTYPE_CDB; + ecdb_ahdr-reserved = 0; + memcpy(ecdb_ahdr-ecdb, cmd-cmnd + ISCSI_CDB_SIZE, rlen); + + debug_scsi(iscsi_prep_ecdb_ahs: varlen_cdb_len %d + rlen %d pad_len %d ahs_length %d iscsi_headers_size %u\n, + cmd-cmd_len, rlen, pad_len, ahslength, ctask-hdr_len); + + return 0; +} + /** * iscsi_prep_scsi_cmd_pdu - prep iscsi scsi cmd pdu * @ctask: iscsi cmd task @@ -150,7 +189,7 @@ static int iscsi_prep_scsi_cmd_pdu(struct iscsi_cmd_task *ctask) struct iscsi_session *session = conn-session; struct iscsi_cmd *hdr = ctask-hdr; struct scsi_cmnd *sc = ctask-sc; - unsigned hdrlength; + unsigned hdrlength, cmd_len; int rc; ctask-hdr_len = 0; @@ -165,10 +204,17 @@ static int iscsi_prep_scsi_cmd_pdu(struct iscsi_cmd_task *ctask) hdr-cmdsn = cpu_to_be32(session-cmdsn); session-cmdsn++; hdr-exp_statsn = cpu_to_be32(conn-exp_statsn); - memcpy(hdr-cdb, sc-cmnd, sc-cmd_len); - if (sc-cmd_len MAX_COMMAND_SIZE) - memset(hdr-cdb[sc-cmd_len], 0, - MAX_COMMAND_SIZE - sc-cmd_len); + cmd_len = sc-cmd_len; + if (cmd_len ISCSI_CDB_SIZE) + memset(hdr-cdb[cmd_len], 0, + ISCSI_CDB_SIZE - cmd_len); + else if (cmd_len ISCSI_CDB_SIZE) { + rc = iscsi_prep_ecdb_ahs(ctask); + if (rc) + return rc; + cmd_len = ISCSI_CDB_SIZE; + } + memcpy(hdr-cdb, sc-cmnd, cmd_len); ctask-imm_count = 0; if (sc-sc_data_direction == DMA_TO_DEVICE) { diff --git a/include/scsi/iscsi_proto.h b/include/scsi/iscsi_proto.h index 318a909..8c591ac 100644 --- a/include/scsi/iscsi_proto.h +++ b/include/scsi/iscsi_proto.h @@ -112,6 +112,7 @@ struct iscsi_ahs_hdr { #define ISCSI_AHSTYPE_CDB 1 #define ISCSI_AHSTYPE_RLENGTH 2 +#define ISCSI_CDB_SIZE 16 /* iSCSI PDU Header */ struct iscsi_cmd { @@ -125,7 +126,7 @@ struct iscsi_cmd { __be32 data_length; __be32 cmdsn; __be32 exp_statsn; - uint8_t cdb[16];/* SCSI Command Block */ + uint8_t cdb[ISCSI_CDB_SIZE];/* SCSI Command Block */ /* Additional Data (Command Dependent) */ }; @@ -154,7 +155,8 @@ struct iscsi_ecdb_ahdr { __be16 ahslength; /* CDB length - 15, including reserved byte */ uint8_t ahstype; uint8_t reserved; - uint8_t ecdb[260 - 16]; /* 4-byte aligned extended CDB spillover */ + /* 4-byte aligned extended CDB spillover */ + uint8_t ecdb[260 - ISCSI_CDB_SIZE]; }; /* SCSI Response Header */ -- 1.5.3.6 - To unsubscribe from this list: send
[PATCH 4/5] bsg: bidi fix
Fixes for bsg to handle bidirectional commands. The next_rq part of a bidi command must be marked REQ_TYPE_BLOCK_PC for scsi_init_sgtable. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/bsg.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 0d9364d..fc3a024 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -291,6 +291,7 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) goto out; } rq-next_rq = next_rq; + next_rq-cmd_type = REQ_TYPE_BLOCK_PC; dxferp = (void*)(unsigned long)hdr-din_xferp; ret = blk_rq_map_user(q, next_rq, dxferp, hdr-din_xfer_len); -- 1.5.3.6 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] iscsi: extended cdb support
[Resend, date was in ancient history.] From: Boaz Harrosh [EMAIL PROTECTED] Once varlen cdbs are supported by the block and scsi-ml layers we can apply this patch to support extended CDBs in iscsi. Signed-off-by: Boaz Harrosh [EMAIL PROTECTED] Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/scsi/iscsi_tcp.c |2 +- drivers/scsi/libiscsi.c| 56 include/scsi/iscsi_proto.h |6 +++- 3 files changed, 56 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index e5be5fd..9fde5ce 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -1973,7 +1973,7 @@ static struct iscsi_transport iscsi_tcp_transport = { .host_template = iscsi_sht, .conndata_size = sizeof(struct iscsi_conn), .max_conn = 1, - .max_cmd_len= 16, + .max_cmd_len= SCSI_MAX_VARLEN_CDB_SIZE, /* session management */ .create_session = iscsi_tcp_session_create, .destroy_session= iscsi_tcp_session_destroy, diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 553168a..94a8046 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -137,6 +137,45 @@ static int iscsi_add_hdr(struct iscsi_cmd_task *ctask, unsigned len) return 0; } +/* + * make an extended cdb AHS + */ +static int iscsi_prep_ecdb_ahs(struct iscsi_cmd_task *ctask) +{ + struct scsi_cmnd *cmd = ctask-sc; + unsigned rlen, pad_len; + unsigned short ahslength; + struct iscsi_ecdb_ahdr *ecdb_ahdr; + int rc; + + ecdb_ahdr = iscsi_next_hdr(ctask); + rlen = cmd-cmd_len - ISCSI_CDB_SIZE; + + BUG_ON(rlen sizeof(ecdb_ahdr-ecdb)); + ahslength = rlen + sizeof(ecdb_ahdr-reserved); + + pad_len = iscsi_padding(rlen); + + rc = iscsi_add_hdr(ctask, sizeof(ecdb_ahdr-ahslength) + + sizeof(ecdb_ahdr-ahstype) + ahslength + pad_len); + if (rc) + return rc; + + if (pad_len) + memset(ecdb_ahdr-ecdb[rlen], 0, pad_len); + + ecdb_ahdr-ahslength = cpu_to_be16(ahslength); + ecdb_ahdr-ahstype = ISCSI_AHSTYPE_CDB; + ecdb_ahdr-reserved = 0; + memcpy(ecdb_ahdr-ecdb, cmd-cmnd + ISCSI_CDB_SIZE, rlen); + + debug_scsi(iscsi_prep_ecdb_ahs: varlen_cdb_len %d + rlen %d pad_len %d ahs_length %d iscsi_headers_size %u\n, + cmd-cmd_len, rlen, pad_len, ahslength, ctask-hdr_len); + + return 0; +} + /** * iscsi_prep_scsi_cmd_pdu - prep iscsi scsi cmd pdu * @ctask: iscsi cmd task @@ -150,7 +189,7 @@ static int iscsi_prep_scsi_cmd_pdu(struct iscsi_cmd_task *ctask) struct iscsi_session *session = conn-session; struct iscsi_cmd *hdr = ctask-hdr; struct scsi_cmnd *sc = ctask-sc; - unsigned hdrlength; + unsigned hdrlength, cmd_len; int rc; ctask-hdr_len = 0; @@ -165,10 +204,17 @@ static int iscsi_prep_scsi_cmd_pdu(struct iscsi_cmd_task *ctask) hdr-cmdsn = cpu_to_be32(session-cmdsn); session-cmdsn++; hdr-exp_statsn = cpu_to_be32(conn-exp_statsn); - memcpy(hdr-cdb, sc-cmnd, sc-cmd_len); - if (sc-cmd_len MAX_COMMAND_SIZE) - memset(hdr-cdb[sc-cmd_len], 0, - MAX_COMMAND_SIZE - sc-cmd_len); + cmd_len = sc-cmd_len; + if (cmd_len ISCSI_CDB_SIZE) + memset(hdr-cdb[cmd_len], 0, + ISCSI_CDB_SIZE - cmd_len); + else if (cmd_len ISCSI_CDB_SIZE) { + rc = iscsi_prep_ecdb_ahs(ctask); + if (rc) + return rc; + cmd_len = ISCSI_CDB_SIZE; + } + memcpy(hdr-cdb, sc-cmnd, cmd_len); ctask-imm_count = 0; if (sc-sc_data_direction == DMA_TO_DEVICE) { diff --git a/include/scsi/iscsi_proto.h b/include/scsi/iscsi_proto.h index 318a909..8c591ac 100644 --- a/include/scsi/iscsi_proto.h +++ b/include/scsi/iscsi_proto.h @@ -112,6 +112,7 @@ struct iscsi_ahs_hdr { #define ISCSI_AHSTYPE_CDB 1 #define ISCSI_AHSTYPE_RLENGTH 2 +#define ISCSI_CDB_SIZE 16 /* iSCSI PDU Header */ struct iscsi_cmd { @@ -125,7 +126,7 @@ struct iscsi_cmd { __be32 data_length; __be32 cmdsn; __be32 exp_statsn; - uint8_t cdb[16];/* SCSI Command Block */ + uint8_t cdb[ISCSI_CDB_SIZE];/* SCSI Command Block */ /* Additional Data (Command Dependent) */ }; @@ -154,7 +155,8 @@ struct iscsi_ecdb_ahdr { __be16 ahslength; /* CDB length - 15, including reserved byte */ uint8_t ahstype; uint8_t reserved; - uint8_t ecdb[260 - 16]; /* 4-byte aligned extended CDB spillover */ + /* 4-byte aligned extended CDB spillover */ + uint8_t ecdb[260 - ISCSI_CDB_SIZE]; }; /* SCSI Response Header */ -- 1.5.3.6
[PATCH 1/3] bsg bidi block pc
Set cmd_type on rq-next_rq to BLOCK_PC so that scsi_init_sgtable knows to look at req-data_len rather than nr_sectors. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/bsg.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index ba4353a..eb0aaf4 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -278,6 +278,7 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) ret = -ENOMEM; goto out; } + next_rq-cmd_type = REQ_TYPE_BLOCK_PC; rq-next_rq = next_rq; dxferp = (void*)(unsigned long)hdr-din_xferp; -- 1.5.2.4 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] bsg bidirectional and variable length commands
There are three patches here to enable using BSG to send SCSI commands across iscsi TCP that are bidirectional and/or use variable length CDBs. They sit on top of 2.6.23-rc2 plus Mike's iscsi git plus the 12 core patches that Boaz has for bidirectional support. They apply to stock 2.6.23-rc2 but I don't think it is worth anybody's time to take them yet. Instead, I think it makes sense for Boaz to hang onto these and submit them along as part of the bigger bidirectional support picture. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] iscsi tcp queue bidi
Mark queue_flags that bidirectional is acceptable for iscsi_tcp, as BSG will check to make sure this bit is set. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/scsi/iscsi_tcp.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index 35dd19f..8622930 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -2205,6 +2205,12 @@ static void iscsi_tcp_session_destroy(struct iscsi_cls_session *cls_session) iscsi_session_teardown(cls_session); } +static int iscsi_tcp_slave_alloc(struct scsi_device *sdev) +{ + set_bit(QUEUE_FLAG_BIDI, sdev-request_queue-queue_flags); + return 0; +} + static int iscsi_tcp_slave_configure(struct scsi_device *sdev) { blk_queue_bounce_limit(sdev-request_queue, BLK_BOUNCE_ANY); @@ -2224,6 +2230,7 @@ static struct scsi_host_template iscsi_sht = { .eh_abort_handler = iscsi_eh_abort, .eh_host_reset_handler = iscsi_eh_host_reset, .use_clustering = DISABLE_CLUSTERING, + .slave_alloc= iscsi_tcp_slave_alloc, .slave_configure= iscsi_tcp_slave_configure, .proc_name = iscsi_tcp, .this_id= -1, -- 1.5.2.4 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] varlen bsg submit
Accept variable length SCSI commands through BSG. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/bsg.c | 19 +++ 1 files changed, 15 insertions(+), 4 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index eb0aaf4..c72b4f9 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -175,11 +175,22 @@ unlock: static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq, struct sg_io_v4 *hdr, int has_write_perm) { + int len = hdr-request_len; + memset(rq-cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */ if (copy_from_user(rq-cmd, (void *)(unsigned long)hdr-request, - hdr-request_len)) + min(len, BLK_MAX_CDB))) return -EFAULT; + if (len BLK_MAX_CDB) { + rq-varlen_cdb_len = len; + rq-varlen_cdb = kmalloc(len, GFP_KERNEL); + if (rq-varlen_cdb == NULL) + return -ENOMEM; + if (copy_from_user(rq-varlen_cdb, + (void *)(unsigned long)hdr-request, len)) + return -EFAULT; + } if (hdr-subprotocol == BSG_SUB_PROTOCOL_SCSI_CMD) { if (blk_verify_command(rq-cmd, has_write_perm)) @@ -190,7 +201,7 @@ static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq, /* * fill in request structure */ - rq-cmd_len = hdr-request_len; + rq-cmd_len = min(len, BLK_MAX_CDB); rq-cmd_type = REQ_TYPE_BLOCK_PC; rq-timeout = (hdr-timeout * HZ) / 1000; @@ -212,8 +223,6 @@ bsg_validate_sgv4_hdr(struct request_queue *q, struct sg_io_v4 *hdr, int *rw) if (hdr-guard != 'Q') return -EINVAL; - if (hdr-request_len BLK_MAX_CDB) - return -EINVAL; if (hdr-dout_xfer_len (q-max_sectors 9) || hdr-din_xfer_len (q-max_sectors 9)) return -EIO; @@ -303,6 +312,7 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) } return rq; out: + kfree(rq-varlen_cdb); blk_put_request(rq); if (next_rq) { blk_rq_unmap_user(next_rq-bio); @@ -443,6 +453,7 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, struct sg_io_v4 *hdr, } blk_rq_unmap_user(bio); + kfree(rq-varlen_cdb); blk_put_request(rq); return ret; -- 1.5.2.4 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Announcing new open source iSER (iSCSI/RDMA) target
We are releasing code to add support for iSCSI Extensions for RDMA (iSER) to the existing STGT user space SCSI target. It uses OpenFabrics libraries and kernel drivers to act as a SCSI target over RDMA-capable devices. The code has been tested against the existing Linux iSER initiator over InfiniBand cards, but should be specification compliant and work generally. A bit of documentation is included, and a short technical report is available at http://www.osc.edu/~pw/papers/iser-techreport.pdf . For performance, a single SCSI client using iSCSI over gigabit ethernet does 100 MB/s. iSCSI over IPoIB gets 200 MB/s, and iSER over native IB sees 500 MB/s. More information on STGT is available at http://stgt.berlios.de . The seven iSER patches can be downloaded from: git://git.osc.edu/tgt or browsed at: http://git.osc.edu/?p=tgt.git;a=summary New and modified files are distributed under a GPLv2 license. I'll submit individual patches to stgt-devel for review and eventual inclusion in STGT. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] bsg: add SCSI transport-level request support
[EMAIL PROTECTED] wrote on Sat, 09 Jun 2007 00:12 +0900: SCSI transport-level requests such as SAS management protocol (SMP) skip blk_verify_command(). Signed-off-by: FUJITA Tomonori [EMAIL PROTECTED] --- block/bsg.c | 27 +-- include/linux/bsg.h |6 ++ 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 13ecc95..f2a9979 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -208,7 +208,11 @@ static int blk_fill_sgv4_hdr_rq(request_ if (copy_from_user(rq-cmd, (void *)(unsigned long)hdr-request, hdr-request_len)) return -EFAULT; - if (blk_verify_command(rq-cmd, has_write_perm)) + + if (hdr-subprotocol == BSG_SUB_PROTOCOL_SCSI_CMD) { + if (blk_verify_command(rq-cmd, has_write_perm)) + return -EPERM; + } else if (capable(CAP_SYS_RAWIO)) return -EPERM; Do you mean !capable here? -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend] bsg: return SAM device status code
Use the status codes from the spec, not the shifted-by-one codes that are marked deprecated in scsi.h This makes bsg v4 status report the same value as sg v3 status too. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] Acked-by: Douglas Gilbert [EMAIL PROTECTED] --- block/bsg.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index f0753c0..92be6fa 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -440,7 +440,7 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, struct sg_io_v4 *hdr, /* * fill in all the output members */ - hdr-device_status = status_byte(rq-errors); + hdr-device_status = rq-errors 0xff; hdr-transport_status = host_byte(rq-errors); hdr-driver_status = driver_byte(rq-errors); hdr-info = 0; -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bsg: port to bidi
Changes required to make bsg patches work on top of bidi patches. Adds capability to bsg to handle bidirectional commands and extended CDBs. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/bsg.c| 106 ++-- block/ll_rw_blk.c | 30 +- include/linux/blkdev.h |1 + 3 files changed, 87 insertions(+), 50 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 92be6fa..9d09505 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -95,6 +95,7 @@ struct bsg_command { struct list_head list; struct request *rq; struct bio *bio; + struct bio *bidi_read_bio; int err; struct sg_io_v4 hdr; struct sg_io_v4 __user *uhdr; @@ -225,18 +226,31 @@ static struct bsg_command *bsg_get_command(struct bsg_device *bd) static int blk_fill_sgv4_hdr_rq(request_queue_t *q, struct request *rq, struct sg_io_v4 *hdr, int has_write_perm) { + int len = hdr-request_len; + int cmd_len = min(len, BLK_MAX_CDB); + memset(rq-cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */ if (copy_from_user(rq-cmd, (void *)(unsigned long)hdr-request, - hdr-request_len)) + cmd_len)) return -EFAULT; + if (len BLK_MAX_CDB) { + rq-varlen_cdb_len = len; + rq-varlen_cdb = kmalloc(len, GFP_KERNEL); + if (rq-varlen_cdb == NULL) + return -ENOMEM; + if (copy_from_user(rq-varlen_cdb, + (void *)(unsigned long)hdr-request, len)) + return -EFAULT; + } + if (blk_verify_command(rq-cmd, has_write_perm)) return -EPERM; /* * fill in request structure */ - rq-cmd_len = hdr-request_len; + rq-cmd_len = cmd_len; rq-cmd_type = REQ_TYPE_BLOCK_PC; rq-timeout = (hdr-timeout * HZ) / 1000; @@ -252,12 +266,11 @@ static int blk_fill_sgv4_hdr_rq(request_queue_t *q, struct request *rq, * Check if sg_io_v4 from user is allowed and valid */ static int -bsg_validate_sgv4_hdr(request_queue_t *q, struct sg_io_v4 *hdr, int *rw) +bsg_validate_sgv4_hdr(request_queue_t *q, struct sg_io_v4 *hdr, + enum dma_data_direction *dir) { if (hdr-guard != 'Q') return -EINVAL; - if (hdr-request_len BLK_MAX_CDB) - return -EINVAL; if (hdr-dout_xfer_len (q-max_sectors 9) || hdr-din_xfer_len (q-max_sectors 9)) return -EIO; @@ -266,17 +279,15 @@ bsg_validate_sgv4_hdr(request_queue_t *q, struct sg_io_v4 *hdr, int *rw) if (hdr-protocol || hdr-subprotocol) return -EINVAL; - /* -* looks sane, if no data then it should be fine from our POV -*/ - if (!hdr-dout_xfer_len !hdr-din_xfer_len) - return 0; - - /* not supported currently */ - if (hdr-dout_xfer_len hdr-din_xfer_len) - return -EINVAL; - - *rw = hdr-dout_xfer_len ? WRITE : READ; + if (hdr-dout_xfer_len) { + if (hdr-din_xfer_len) + *dir = DMA_BIDIRECTIONAL; + else + *dir = DMA_TO_DEVICE; + } else if (hdr-din_xfer_len) + *dir = DMA_FROM_DEVICE; + else + *dir = DMA_NONE; return 0; } @@ -289,7 +300,8 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) { request_queue_t *q = bd-queue; struct request *rq; - int ret, rw = 0; /* shut up gcc */ + enum dma_data_direction dir; + int ret; unsigned int dxfer_len; void *dxferp = NULL; @@ -297,39 +309,45 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) hdr-dout_xfer_len, (unsigned long long) hdr-din_xferp, hdr-din_xfer_len); - ret = bsg_validate_sgv4_hdr(q, hdr, rw); + ret = bsg_validate_sgv4_hdr(q, hdr, dir); if (ret) return ERR_PTR(ret); /* * map scatter-gather elements seperately and string them to request */ - rq = blk_get_request(q, rw, GFP_KERNEL); + rq = blk_get_request(q, dir, GFP_KERNEL); ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, test_bit(BSG_F_WRITE_PERM, bd-flags)); - if (ret) { - blk_put_request(rq); - return ERR_PTR(ret); - } + if (ret) + goto errout; if (hdr-dout_xfer_len) { dxfer_len = hdr-dout_xfer_len; dxferp = (void*)(unsigned long)hdr-dout_xferp; - } else if (hdr-din_xfer_len) { + ret = blk_rq_map_user_bidi(q, rq, dxferp, dxfer_len, dir); + if (ret) + goto errout
[PATCH 1/2] bsg: compile with bidi
Fixes to common functions added with bsg so that they compile in a kernel that has the bidi patches. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/bsg.c|6 +++--- block/scsi_ioctl.c |6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index a333c93..f0753c0 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -368,7 +368,7 @@ static void bsg_add_command(struct bsg_device *bd, request_queue_t *q, * add bc command to busy queue and submit rq for io */ bc-rq = rq; - bc-bio = rq-bio; + bc-bio = rq_uni(rq)-bio; bc-hdr.duration = jiffies; spin_lock_irq(bd-lock); list_add_tail(bc-list, bd-busy_list); @@ -446,7 +446,7 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, struct sg_io_v4 *hdr, hdr-info = 0; if (hdr-device_status || hdr-transport_status || hdr-driver_status) hdr-info |= SG_INFO_CHECK; - hdr-din_resid = rq-data_len; + hdr-din_resid = rq_uni(rq)-data_len; hdr-response_len = 0; if (rq-sense_len hdr-response) { @@ -915,7 +915,7 @@ bsg_ioctl(struct inode *inode, struct file *file, unsigned int cmd, if (IS_ERR(rq)) return PTR_ERR(rq); - bio = rq-bio; + bio = rq_uni(rq)-bio; blk_execute_rq(bd-queue, NULL, rq, 0); blk_complete_sgv4_hdr_rq(rq, hdr, bio); diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index cb29ea1..c1cfae9 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -244,7 +244,7 @@ EXPORT_SYMBOL_GPL(blk_fill_sghdr_rq); */ int blk_unmap_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr) { - blk_rq_unmap_user(rq-bio); + blk_rq_unmap_user(rq_uni(rq)-bio); blk_put_request(rq); return 0; } @@ -266,7 +266,7 @@ int blk_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr, hdr-info = 0; if (hdr-masked_status || hdr-host_status || hdr-driver_status) hdr-info |= SG_INFO_CHECK; - hdr-resid = rq-data_len; + hdr-resid = rq_uni(rq)-data_len; hdr-sb_len_wr = 0; if (rq-sense_len hdr-sbp) { @@ -278,7 +278,7 @@ int blk_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr, ret = -EFAULT; } - rq-bio = bio; + rq_uni(rq)-bio = bio; r = blk_unmap_sghdr_rq(rq, hdr); if (ret) r = ret; -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bsg: port to bidi
[EMAIL PROTECTED] wrote on Tue, 24 Apr 2007 13:27 -0400: Changes required to make bsg patches work on top of bidi patches. Adds capability to bsg to handle bidirectional commands and extended CDBs. Oops. This is 2/2. Apply the compile with bidi one first. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] bsg: iovec support with compat
[EMAIL PROTECTED] wrote on Tue, 20 Mar 2007 03:22 +0900: From: Pete Wyckoff [EMAIL PROTECTED] Subject: [PATCH v2] bsg: iovec support with compat Date: Mon, 19 Mar 2007 14:07:27 -0400 Adding a new bsg_write_compat would be bad. There is lots of parsing and setup before we even get to the possibility of iovec data mapping. Reusing just sg_build_ioctl from compat_ioctl.c is also suboptimal as this function is built around the idea of a contiguous sg_io_hdr + iovec in userspace. The function is small enough that splitting it into a generic + ioctl-specific part would add too much abstraction to be worth it. Here is the patch to use sg_iovec, with its userspace void * and size_t, and the CONFIG_COMPAT code to fixup 32-bit userspace. I'm not fond of having __u64 for non-iovec buffer representations, and void * for iovec buffer representations, but it saves having to build an sg_iovec just to call into the existing blk_rq_map_user_iov. Comments? The compat code should not go to bsg.c. Agreed. I dislike the entire approach of reusing sg_iovec in bsg, but thought I'd work it through to see what it is like. The compat code cannot go in compat_ioctl.c or similar, as explained above. And it seems wrong to leave it out and silently break on compat-requiring setups. Using sg_iovec with bsg is bad from a user perspective too. sg_iovec pointers are (void *). sg_io_v4 pointers are __u64. sg_iovec lengths are size_t (64-bit). sg_io_v4 lengths are __u32. Please consider instead the original proposal of a new iovec type for bsg. It is less complex than using existing sg_iovec. http://article.gmane.org/gmane.linux.scsi/30461 There is exactly one user of blk_rq_map_user_iov() in the tree: sg_io() in scsi_ioctl.c. I could convert that to sg_v4_iovec now too. The helper function bio_map_user_iov() is only used by blk_rq_map_user_iov() and can also be fixed. The only use we have for struct sg_iovec is this one sg_io() caller, and sg.c. Let's migrate to sg_v4_iovec at the same time we switch from sgv3 to sgv4. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] bsg: iovec support with explicit u64
[EMAIL PROTECTED] wrote on Mon, 19 Mar 2007 14:07 -0400: Here is the patch to use sg_iovec, with its userspace void * and size_t, and the CONFIG_COMPAT code to fixup 32-bit userspace. I'm not fond of having __u64 for non-iovec buffer representations, and void * for iovec buffer representations, but it saves having to build an sg_iovec just to call into the existing blk_rq_map_user_iov. Just for comparison, a cleaned up version of the earlier patch that does not require compat conversion. Instead, use a new struct sg_v4_iovec with explicit u64/u32 representation for user pointer and length. Perhaps later we might change blk_rq_map_user_iov to take sg_v4_iovec and let sgv3 convert its sg_iovec into that. -- Pete Support vectored IO as in SGv3. The iovec structure uses explicit sizes to avoid the need for compat conversion. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/bsg.c | 95 ++ include/linux/bsg.h | 14 +++ 2 files changed, 86 insertions(+), 23 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index f1ea258..e334f75 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -280,6 +280,56 @@ bsg_validate_sgv4_hdr(request_queue_t *q, struct sg_io_v4 *hdr, int *rw) } /* + * Map either the in or out bufs. Convert sg_v4_iovec to sg_iovec + * to use existing blk_rq_map infrastructure. We could do a copy-in-place + * conversion on 64-bit kernels, just by zeroing the top half of sg_iovec's + * iov_len, but do not, for simplicity. + */ +static int bsg_map_data(struct request_queue *q, struct request *rq, + __u64 uaddr, __u32 tot_len, __u32 numiov, + enum dma_data_direction dir) +{ + int i, ret; + void __user *ubuf = (void __user *) (unsigned long) uaddr; + struct sg_iovec fastiov[8], *iov = fastiov; + struct sg_v4_iovec v4_fastiov[8], *v4_iov = v4_fastiov; + + if (numiov == 0) { + ret = blk_rq_map_user(q, rq, ubuf, tot_len); + goto out; + } + + if (numiov = ARRAY_SIZE(fastiov)) { + iov = kmalloc(numiov * (sizeof(*iov) + sizeof(*v4_iov)), + GFP_KERNEL); + if (iov == NULL) { + ret = -ENOMEM; + goto out; + } + v4_iov = (void *) iov[numiov]; + } + + if (copy_from_user(v4_iov, ubuf, numiov * sizeof(*v4_iov))) { + ret = -EFAULT; + goto outfree; + } + + for (i=0; inumiov; i++) { + iov[i].iov_base = (void *)(unsigned long) v4_iov[i].iov_base; + iov[i].iov_len = v4_iov[i].iov_len; + } + + ret = blk_rq_map_user_iov(q, rq, iov, numiov, tot_len); + +outfree: + if (iov != fastiov) + kfree(iov); + +out: + return ret; +} + +/* * map sg_io_v4 to a request. */ static struct request * @@ -288,12 +338,12 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) request_queue_t *q = bd-queue; struct request *rq; int ret, rw = 0; /* shut up gcc */ - unsigned int dxfer_len; - void *dxferp = NULL; - dprintk(map hdr %llx/%u %llx/%u\n, (unsigned long long) hdr-dout_xferp, - hdr-dout_xfer_len, (unsigned long long) hdr-din_xferp, - hdr-din_xfer_len); + dprintk(map hdr %llx/%u/%u %llx/%u/%u\n, + (unsigned long long) hdr-dout_xferp, hdr-dout_xfer_len, + hdr-dout_iovec_count, + (unsigned long long) hdr-din_xferp, hdr-din_xfer_len, + hdr-din_iovec_count); ret = bsg_validate_sgv4_hdr(q, hdr, rw); if (ret) @@ -305,29 +355,28 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) rq = blk_get_request(q, rw, GFP_KERNEL); ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, test_bit(BSG_F_WRITE_PERM, bd-flags)); - if (ret) { - blk_put_request(rq); - return ERR_PTR(ret); - } + if (ret) + goto errout; if (hdr-dout_xfer_len) { - dxfer_len = hdr-dout_xfer_len; - dxferp = (void*)(unsigned long)hdr-dout_xferp; + ret = bsg_map_data(q, rq, hdr-dout_xferp, hdr-dout_xfer_len, + hdr-dout_iovec_count, DMA_TO_DEVICE); + if (ret) + goto errout; } else if (hdr-din_xfer_len) { - dxfer_len = hdr-din_xfer_len; - dxferp = (void*)(unsigned long)hdr-din_xferp; - } else - dxfer_len = 0; - - if (dxfer_len) { - ret = blk_rq_map_user(q, rq, dxferp, dxfer_len); - if (ret) { - dprintk(failed map at %d\n, ret); - blk_put_request(rq); - rq = ERR_PTR(ret
Re: SCSI Generic version 4 interface, release 1.2
[EMAIL PROTECTED] wrote on Sat, 17 Mar 2007 18:13 -0500: Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Wed, 14 Mar 2007 12:18 -0400: After reviewing this post by Pete Wyckoff: http://marc.theaimsgroup.com/?l=linux-scsim=117278879816029w=2 I decided to update my sg v4 interface document originally posted 20061106 which I will now call release 1.1 : http://lwn.net/Articles/208082/ Pete was proposing to put back din_iovec_count and dout_iovec_count that had been dropped out of bsg but had been in release 1.1 . Hmm. Some other items have been picked up from the bsg implementation plus the suggestion from LSF'07 to add dout_resid. See the attachment, comments welcome. Do you want to define the iovec format too? As I commented in my patch, v3 sg_iovec has pointers with 32/64-bit issues. Would be nice to see you declare v4 sg_iovec as pure u64. (By the way, don't use the patch: casting from the new to the old can put junk in the top half of the 64-bit sg_iovec.iov_len). Another issue I wonder about is queue DMA alignment. In bsg, blk_rq_map_user will use a bounce buffer if the user-supplied start and end addresses are not aligned. sg will happily map user pages at any offset without checking, although I haven't checked if Mike's patches change this. ll_rw_blk.c says regarding blk_rq_map_user: Currently and before any of my patches, sg checked the start if (((unsigned long)hp-dxferp queue_dma_alignment(sdev-request_queue)) != 0) It did not check the start and end like blk_rq_map_user. With my updated patches, blk_rq_map_user is just checking the start. There was a thread about not checking the end on linux-scsi. I will try to dig it up. A couple kernels ago the bio map helpers were modified to only check the start. Oh, I missed that. There's this thread about USB needing end alignment too, on all but the last iovec entry: http://thread.gmane.org/gmane.linux.usb.devel/16230/focus=16409 -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI Generic version 4 interface, release 1.2
[EMAIL PROTECTED] wrote on Wed, 14 Mar 2007 12:18 -0400: After reviewing this post by Pete Wyckoff: http://marc.theaimsgroup.com/?l=linux-scsim=117278879816029w=2 I decided to update my sg v4 interface document originally posted 20061106 which I will now call release 1.1 : http://lwn.net/Articles/208082/ Pete was proposing to put back din_iovec_count and dout_iovec_count that had been dropped out of bsg but had been in release 1.1 . Hmm. Some other items have been picked up from the bsg implementation plus the suggestion from LSF'07 to add dout_resid. See the attachment, comments welcome. Do you want to define the iovec format too? As I commented in my patch, v3 sg_iovec has pointers with 32/64-bit issues. Would be nice to see you declare v4 sg_iovec as pure u64. (By the way, don't use the patch: casting from the new to the old can put junk in the top half of the 64-bit sg_iovec.iov_len). Another issue I wonder about is queue DMA alignment. In bsg, blk_rq_map_user will use a bounce buffer if the user-supplied start and end addresses are not aligned. sg will happily map user pages at any offset without checking, although I haven't checked if Mike's patches change this. ll_rw_blk.c says regarding blk_rq_map_user: We don't allow misaligned data like bio_map_user() does. If the user is using sg, they're expected to know the alignment constraints and respect them accordingly. Should this still be true for both iovec and non-iovec uses of sgv4? I modified bsg to use iovec and ignore alignment issues, just like sg, but left in the bounce buffer for non-iovec usage. Seems awkward. scsi_ioctl's sg_io has the same odd situation: non-iovec is bounced, iovec must be aligned by user. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] return valid data with sense
Some targets can return both valid data and sense information. Always update the request data_len from the SCSI command residual. Callers should interpret sense data to determine what parts of the data are valid in case of a CHECK CONDITION status. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/scsi/scsi_lib.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 5f95570..be8e655 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -848,8 +848,8 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes) memcpy(req-sense, cmd-sense_buffer, len); req-sense_len = len; } - } else - req-data_len = cmd-resid; + } + req-data_len = cmd-resid; } /* -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] block queue dma alignment zero
Make queue_dma_alignment return 0 if it was specifically set to 0. Set it to the default 511 to keep the old behavior when it was not explicitly set. This permits devices with no particular alignment restrictions to use direct IO from arbitrary addresses. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/ll_rw_blk.c |2 ++ include/linux/blkdev.h |7 +-- 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index fb67897..ef1d1a8 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1925,6 +1925,8 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id) blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); + blk_queue_dma_alignment(q, 511); + /* * all done */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 36a6eac..f416c2a 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -819,12 +819,7 @@ static inline int bdev_hardsect_size(struct block_device *bdev) static inline int queue_dma_alignment(request_queue_t *q) { - int retval = 511; - - if (q q-dma_alignment) - retval = q-dma_alignment; - - return retval; + return q ? q-dma_alignment : 511; } /* assumes size 256 */ -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] iscsi tcp set queue dma alignment to zero
Add a slave_configure function to iSCSI TCP to remove any DMA alignment restriction. This permits the use of direct IO from arbitrary addresses. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/scsi/iscsi_tcp.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index 4376840..f48eedd 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -2132,6 +2132,16 @@ static void iscsi_tcp_session_destroy(struct iscsi_cls_session *cls_session) iscsi_session_teardown(cls_session); } +/* + * New device attached. Turn off the DMA alignment restriction on + * the request queue. + */ +static int iscsi_tcp_slave_configure(struct scsi_device *sdev) +{ + blk_queue_dma_alignment(sdev-request_queue, 0); + return 0; +} + static struct scsi_host_template iscsi_sht = { .name = iSCSI Initiator over TCP/IP, .queuecommand = iscsi_queuecommand, @@ -2142,6 +2152,7 @@ static struct scsi_host_template iscsi_sht = { .eh_abort_handler = iscsi_eh_abort, .eh_host_reset_handler = iscsi_eh_host_reset, .use_clustering = DISABLE_CLUSTERING, + .slave_configure= iscsi_tcp_slave_configure, .proc_name = iscsi_tcp, .this_id= -1, }; -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bsg: iovec support
Support vectored IO as in SGv3. The iovec structure uses explicit sizes to avoid the need for compat conversion. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- My application definitely can take advantage of scatter/gather IO, which is supported in sgv3 but not in the bsg implementation of sgv4. I understand Tomo's concerns about code bloat and the need for 32/64 compat translations, but this will make things much easier on users of bsg who read or write out of multiple buffers in a single SCSI operation. Clearly we want to avoid doing the compat work that sg.c has to do now, so I went with __u64 for the addresses in the structures that userspace sees. But to interface with existing bio structures, that must be converted back to 32-bit pointers in sg_iovec (only on 32-bit architectures). In the long run, maybe we should have a bio_map_user_iov() that works on the constant-sized sg_io_v4_vec proposed here? -- Pete block/bsg.c | 132 ++- include/linux/bsg.h | 16 ++ 2 files changed, 125 insertions(+), 23 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index c85d961..8e3d6c7 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -280,6 +280,95 @@ bsg_validate_sgv4_hdr(request_queue_t *q, struct sg_io_v4 *hdr, int *rw) } /* + * Sits around blk_rq_map_user_iov so we can use an iovec type that + * does not require compat manipulations. For now we just clumsily + * remap the entire iovec if the types do not match. Later consider + * changing the bio map function. + */ +static int bsg_map_user_iovec(request_queue_t *q, struct request *rq, + struct sg_io_v4_vec *vec, int numvec, + size_t tot_len, enum dma_data_direction dir) +{ + struct bio *bio; + struct sg_iovec *iov; + int write_to_vm = (dir == DMA_FROM_DEVICE ? 1 : 0); + int must_copy_iovec = (sizeof(*iov) != sizeof(*vec)); + + /* +* For 64-bit everywhere, sg_io_v4_vec using __u64 is same as sg_iovec +* using void *. For 64-bit kernel with 32-bit userspace, also no +* translation needed as userspace is forced to use __u64. Only in the +* all 32-bit case will sg_iovec use 32-bit pointers and hence we +* must shrink our 64-bit pointers down into it. +*/ + if (must_copy_iovec) { + int i; + iov = kmalloc(numvec * sizeof(*iov), GFP_KERNEL); + for (i=0; inumvec; i++) { + iov[i].iov_base = (void __user *) vec[i].iov_base; + iov[i].iov_len = vec[i].iov_len; + } + } else { + iov = (struct sg_iovec *) vec; + } + + bio = bio_map_user_iov(q, NULL, iov, numvec, write_to_vm); + + if (must_copy_iovec) + kfree(iov); + + if (IS_ERR(bio)) { + dprintk(bio_map_user_iov err\n); + return PTR_ERR(bio); + } + + if (bio-bi_size != tot_len) { + dprintk(bio-bi_size %u != len %lu\n, bio-bi_size, tot_len); + bio_endio(bio, bio-bi_size, 0); + bio_unmap_user(bio); + return -EINVAL; + } + + bio_get(bio); + blk_rq_bio_prep_bidi(q, rq, bio, dir); + rq-buffer = rq-data = NULL; + return 0; +} + +/* + * Map either the in or out bufs. + */ +static int bsg_map_data(struct request_queue *q, struct request *rq, + __u64 uaddr, __u32 tot_len, __u32 numiov, + enum dma_data_direction dir) +{ + int ret; + void __user *ubuf = (void __user *) (unsigned long) uaddr; + + if (numiov) { + struct sg_io_v4_vec *vec; + size_t len = numiov * sizeof(*vec); + + vec = kmalloc(len, GFP_KERNEL); + if (vec == NULL) { + ret = -ENOMEM; + goto out; + } + if (copy_from_user(vec, ubuf, len)) { + ret = -EFAULT; + kfree(vec); + goto out; + } + ret = bsg_map_user_iovec(q, rq, vec, numiov, tot_len, dir); + kfree(vec); + } else + ret = blk_rq_map_user(q, rq, ubuf, tot_len); + +out: + return ret; +} + +/* * map sg_io_v4 to a request. */ static struct request * @@ -288,12 +377,10 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr) request_queue_t *q = bd-queue; struct request *rq; int ret, rw = 0; /* shut up gcc */ - unsigned int dxfer_len; - void *dxferp = NULL; - dprintk(map hdr %llx/%u %llx/%u\n, (unsigned long long) hdr-dout_xferp, - hdr-dout_xfer_len, (unsigned long long) hdr-din_xferp, - hdr-din_xfer_len); + dprintk(map hdr %llx/%u %llx/%u\n, + (unsigned long long) hdr-dout_xferp, hdr
[PATCH] bsg: return SAM device status code
Use the status codes from the standard, not the shifted-by-one codes that are marked deprecated in scsi.h. This makes bsg v4 status report the same value as sg v3 status too. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/bsg.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index c85d961..e39a321 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -438,7 +438,7 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, struct sg_io_v4 *hdr, /* * fill in all the output members */ - hdr-device_status = status_byte(rq-errors); + hdr-device_status = rq-errors 0xff; hdr-transport_status = host_byte(rq-errors); hdr-driver_status = driver_byte(rq-errors); hdr-info = 0; -- 1.4.4.2 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bind bsg to request_queue instead of gendisk
[EMAIL PROTECTED] wrote on Wed, 14 Feb 2007 02:53 +0900: It seems that it would be better to bind bsg devices to request_queue instead of gendisk. This enables any objects to define own request_handler and create own bsg device (under sysfs). Possible enhancements: - I removed gendisk but it would be better for objects having gendisk to keep it for nice features like disk stats. - Objects that wants to use bsg need to setup a request_queue. Maybe wrapper functions to setup a request_queue for them would be useful. This patch was tested only with disk drivers. The only place that bsg_register_rq is called is via blk_register_queue, which is only called by add_disk. But not all devices have a block interface or need the gendisk that these functions assume. This change registers all SCSI devices with bsg, even ones that are not accessed through a block interface. They already have a request_queue; this just presents it to bsg at LUN scan time. It uses the bus_id as the name for sysfs. Devices that happen to be block SCSI devices are thus registered twice, which is probably not good, but preserves the exsiting sda entry, e.g., in /sys/class/bsg. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- block/bsg.c | 16 +--- drivers/scsi/scsi_scan.c |6 ++ drivers/scsi/scsi_sysfs.c |1 + 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index e261997..e100efe 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -962,6 +962,7 @@ int bsg_register_rq(struct request_queue *q, char *name) { struct bsg_class_device *bcd; dev_t dev; + int ret = -ENOMEM; /* * we need a proper transport to send commands, not a stacked device @@ -981,9 +982,18 @@ int bsg_register_rq(struct request_queue *q, char *name) bcd-class_dev = class_device_create(bsg_class, NULL, dev, bcd-dev, %s, name); if (!bcd-class_dev) goto err; + + /* +* Put a link, e.g., /sys/block/sda/queue/bsg - /sys/class/bsg/sda, +* if a sysfs entry for the queue exists. +*/ + if (q-kobj.dentry) { + ret = sysfs_create_link(q-kobj, bcd-class_dev-kobj, bsg); + if (ret) + goto err; + } + list_add_tail(bcd-list, bsg_class_list); - if (sysfs_create_link(q-kobj, bcd-class_dev-kobj, bsg)) - goto err; mutex_unlock(bsg_mutex); return 0; err: @@ -991,7 +1001,7 @@ err: if (bcd-class_dev) class_device_destroy(bsg_class, MKDEV(BSG_MAJOR, bcd-minor)); mutex_unlock(bsg_mutex); - return -ENOMEM; + return ret; } static int __init bsg_init(void) diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 23ab62e..8aa4c3c 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -895,6 +895,12 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result, } /* +* Register bsg interface for every SCSI device. +*/ + if (bsg_register_rq(sdev-request_queue, sdev-sdev_gendev.bus_id)) + return SCSI_SCAN_NO_RESPONSE; + + /* * Ok, the device is now all set up, we can * register it and tell the rest of the kernel * about it. diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 259c90c..bc694a5 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -238,6 +238,7 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work) spin_unlock_irqrestore(sdev-host-host_lock, flags); if (sdev-request_queue) { + bsg_unregister_rq(sdev-request_queue); sdev-request_queue-queuedata = NULL; /* user context needed to free queue */ scsi_free_queue(sdev-request_queue); -- 1.4.4.2 - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html