RE: [PATCH] qlogicpti: Return correct error code
Reviewed-by: Shane Seymour
RE: [PATCH] qlogicpti: Return correct error code
Reviewed-by: Shane Seymour
RE: [PATCH] snic: correctly check for array overrun on overly long version number
Reviewed-by: Shane Seymour
RE: [PATCH] snic: correctly check for array overrun on overly long version number
Reviewed-by: Shane Seymour
RE: [PATCH] ipr: fix out-of-bounds null overwrite
> len = snprintf(fname, 99, "%s", buf); > - fname[len-1] = '\0'; Since it appears that's the only time len is actually used in that function can you please remove the variable len completely as part of the patch? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] ipr: fix out-of-bounds null overwrite
> len = snprintf(fname, 99, "%s", buf); > - fname[len-1] = '\0'; Since it appears that's the only time len is actually used in that function can you please remove the variable len completely as part of the patch? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/2] pci: Update VPD size with correct length
Tested with a HP AE311-60001 PCIe card. It used to repeat the same VPD every 4k for 32k now only the 154 bytes are returned and lspci - reports that the data up to and including the end and that the check sum is good: ... Capabilities: [74] Vital Product Data Product Name: PCI-Express 4Gb Fibre Channel HBA Read-only fields: [PN] Part number: AE311-60001 [EC] Engineering changes: C-4830 [SN] Serial number: CN80847W3U [V0] Vendor specific: PW=15W [V2] Vendor specific: 4847 [MN] Manufacture ID: 50 58 32 35 31 30 34 30 31 2d 37 30 20 20 42 [V1] Vendor specific: 02.22 [V3] Vendor specific: 05.03.15 [V4] Vendor specific: 03.13 [V5] Vendor specific: 02.03 [YA] Asset tag: NA [RV] Reserved: checksum good, 0 byte(s) reserved End ... --- Tested-by: Shane Seymour -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] pci: Update VPD definitions
> The 'end' tag is actually 0x0f, it's the representation as a > small resource data type tag that's 0x78 (ie shifted by 3). > This patch also adds helper functions to extract the resource > data type tags for both large and small resource data types. > > Cc: Alexander Duyck > Cc: Bjorn Helgaas > Signed-off-by: Hannes Reinecke Tested-by: Shane Seymour -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] pci: Update VPD definitions
> The 'end' tag is actually 0x0f, it's the representation as a > small resource data type tag that's 0x78 (ie shifted by 3). > This patch also adds helper functions to extract the resource > data type tags for both large and small resource data types. > > Cc: Alexander Duyck> Cc: Bjorn Helgaas > Signed-off-by: Hannes Reinecke Tested-by: Shane Seymour -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/2] pci: Update VPD size with correct length
Tested with a HP AE311-60001 PCIe card. It used to repeat the same VPD every 4k for 32k now only the 154 bytes are returned and lspci - reports that the data up to and including the end and that the check sum is good: ... Capabilities: [74] Vital Product Data Product Name: PCI-Express 4Gb Fibre Channel HBA Read-only fields: [PN] Part number: AE311-60001 [EC] Engineering changes: C-4830 [SN] Serial number: CN80847W3U [V0] Vendor specific: PW=15W [V2] Vendor specific: 4847 [MN] Manufacture ID: 50 58 32 35 31 30 34 30 31 2d 37 30 20 20 42 [V1] Vendor specific: 02.22 [V3] Vendor specific: 05.03.15 [V4] Vendor specific: 03.13 [V5] Vendor specific: 02.03 [YA] Asset tag: NA [RV] Reserved: checksum good, 0 byte(s) reserved End ... --- Tested-by: Shane Seymour-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv2] pci: Update VPD size with correct length
> The only 'error' cases I've encountered so far is a read of all zeroes (and a > halting the machine once you've read beyond a certain point) or a read of > 0xff throughout the entire area. So that approach would work for both of them. I should add that I'd tested the previous patch and this patch. I'll retest once v3 is available. I have a card that repeats the vpd data every 4k for all 32k. The patch as it stands truncates that to just the valid data and lspci - shows the vpd data correctly. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCHv2] pci: Update VPD size with correct length
> The only 'error' cases I've encountered so far is a read of all zeroes (and a > halting the machine once you've read beyond a certain point) or a read of > 0xff throughout the entire area. So that approach would work for both of them. I should add that I'd tested the previous patch and this patch. I'll retest once v3 is available. I have a card that repeats the vpd data every 4k for all 32k. The patch as it stands truncates that to just the valid data and lspci - shows the vpd data correctly. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v5 1/2] block: invalidate the page cache when issuing BLKZEROOUT.
Thank you for taking the issues I raised about converting unsigned to signed into account. Reviewed-by: Shane Seymour
RE: [PATCH v5 1/2] block: invalidate the page cache when issuing BLKZEROOUT.
Thank you for taking the issues I raised about converting unsigned to signed into account. Reviewed-by: Shane Seymour
RE: [PATCH v4] block: create ioctl to discard-or-zeroout a range of blocks
> v4: Check the start/len arguments for overflows prior to feeding the page > cache bogus numbers (that it'll ignore anyway). > Signed-off-by: Darrick J. Wong Reviewed-by: Shane Seymour -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v4] block: create ioctl to discard-or-zeroout a range of blocks
> v4: Check the start/len arguments for overflows prior to feeding the page > cache bogus numbers (that it'll ignore anyway). > Signed-off-by: Darrick J. WongReviewed-by: Shane Seymour -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
> I don't have a device large enough to test for signedness errors, since > passing > huge values for start and len never make it past the i_size_read check. If you have someone trying to bypass your sanity checks then if start=18446744073709551104 and len=1024 the result of adding them together will be 512 (subtracting an extra 1 in the patched code to get 511 for end). That will pass the i_size_read check won't it? If so that would cause lstart in truncate_inode_pages_range() to be -512. I don't know what truncate_inode_pages_range() will do with a negative lstart value like that but it seems like an unusual value for your code to be willing to pass into truncate_inode_pages_range(). Shane -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
> I don't have a device large enough to test for signedness errors, since > passing > huge values for start and len never make it past the i_size_read check. If you have someone trying to bypass your sanity checks then if start=18446744073709551104 and len=1024 the result of adding them together will be 512 (subtracting an extra 1 in the patched code to get 511 for end). That will pass the i_size_read check won't it? If so that would cause lstart in truncate_inode_pages_range() to be -512. I don't know what truncate_inode_pages_range() will do with a negative lstart value like that but it seems like an unusual value for your code to be willing to pass into truncate_inode_pages_range(). Shane -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
> which would make the other checks I suggested to ensure that neither start > or end were more than (uint64_t)LLONG_MAX unnecessary. My apologies I was wrong about what I said above - after thinking about it for longer you still need to make sure that at least len is not greater than (uint64_t)LLONG_MAX because in the calculation: if (start > (uint64_t)LLONG_MAX - len) return -EINVAL; if len was more than (uint64_t)LLONG_MAX it would underflow and become a very large positive number and start would never be greater than that (and I said end when I should have said len). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
> I don't have a device large enough to test for signedness errors, since > passing > huge values for start and len never make it past the i_size_read check. > However, I do see that I forgot to check the padding values, so I'll update > that. Then do you want to at least consider converting end to be of type loff_t to match the type of value returned by i_get_size() and the type of the value passed to truncate_inode_pages_range()? Then you only need one additional check, something like: /* Check for an overflow when adding start+len into end */ if (start > (uint64_t)LLONG_MAX - len) return -EINVAL; Looking at the maximum values if we have start=0 and len=(uint64_t)LLONG_MAX we would still continue but len=(uint64_t)LLONG_MAX+1 would return -EINVAL. Similarly for start=(uint64_t)LLONG_MAX and len=0 it would continue but start=(uint64_t)LLONG_MAX+1 would return -EINVAL and that should allow start + len to be safely added and stored into an loff_t without overflow. With this check start+len can never be more than (uint64_t)LLONG_MAX which would make the other checks I suggested to ensure that neither start or end were more than (uint64_t)LLONG_MAX unnecessary. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
> I don't have a device large enough to test for signedness errors, since > passing > huge values for start and len never make it past the i_size_read check. > However, I do see that I forgot to check the padding values, so I'll update > that. Then do you want to at least consider converting end to be of type loff_t to match the type of value returned by i_get_size() and the type of the value passed to truncate_inode_pages_range()? Then you only need one additional check, something like: /* Check for an overflow when adding start+len into end */ if (start > (uint64_t)LLONG_MAX - len) return -EINVAL; Looking at the maximum values if we have start=0 and len=(uint64_t)LLONG_MAX we would still continue but len=(uint64_t)LLONG_MAX+1 would return -EINVAL. Similarly for start=(uint64_t)LLONG_MAX and len=0 it would continue but start=(uint64_t)LLONG_MAX+1 would return -EINVAL and that should allow start + len to be safely added and stored into an loff_t without overflow. With this check start+len can never be more than (uint64_t)LLONG_MAX which would make the other checks I suggested to ensure that neither start or end were more than (uint64_t)LLONG_MAX unnecessary. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
> which would make the other checks I suggested to ensure that neither start > or end were more than (uint64_t)LLONG_MAX unnecessary. My apologies I was wrong about what I said above - after thinking about it for longer you still need to make sure that at least len is not greater than (uint64_t)LLONG_MAX because in the calculation: if (start > (uint64_t)LLONG_MAX - len) return -EINVAL; if len was more than (uint64_t)LLONG_MAX it would underflow and become a very large positive number and start would never be greater than that (and I said end when I should have said len). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
A quick question about this part of the patch: > + uint64_t end = start + len - 1; > + if (end >= i_size_read(bdev->bd_inode)) return -EINVAL; > + /* Invalidate the page cache, including dirty pages */ > + mapping = bdev->bd_inode->i_mapping; > + truncate_inode_pages_range(mapping, start, end); blk_ioctl_zeroout accepts unsigned values for start and end (uint64_t) but loff_t types are turned from i_size_read() and passed as the 2nd and 3rd values to truncate_inode_pages_range() and loff_t is a signed value. It should be possible to pass in some values would overflow the calculation of end causing the test on the value of end and the result of i_size_read to pass but then end up passing a large unsigned value for in start that would be implicitly converted to signed in truncate_inode_pages_range. I was wondering if you'd tested passing in data that would cause sign conversion issues when passed into truncate_inode_pages_range (does it handle it gracefully?) or should this code: if (start & 511) return -EINVAL; if (len & 511) return -EINVAL; be something more like this (for better sanity checking of your arguments) which will ensure that you don't have implicit conversion issues from unsigned to signed and ensure that the result of adding them together won't either: if ((start & 511) || (start > (uint64_t)LLONG_MAX)) return -EINVAL; if ((len & 511) ) || (len > (uint64_t)LLONG_MAX)) return -EINVAL; if (end > (uint64_t)LLONG_MAX) return -EINVAL; My apologies in advance if I've made a mistake when looking at this and my concerns about unsigned values being implicitly converted to signed are unfounded (I would have hoped for compiler warnings about any implicit conversions though). Thanks Shane -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
A quick question about this part of the patch: > + uint64_t end = start + len - 1; > + if (end >= i_size_read(bdev->bd_inode)) return -EINVAL; > + /* Invalidate the page cache, including dirty pages */ > + mapping = bdev->bd_inode->i_mapping; > + truncate_inode_pages_range(mapping, start, end); blk_ioctl_zeroout accepts unsigned values for start and end (uint64_t) but loff_t types are turned from i_size_read() and passed as the 2nd and 3rd values to truncate_inode_pages_range() and loff_t is a signed value. It should be possible to pass in some values would overflow the calculation of end causing the test on the value of end and the result of i_size_read to pass but then end up passing a large unsigned value for in start that would be implicitly converted to signed in truncate_inode_pages_range. I was wondering if you'd tested passing in data that would cause sign conversion issues when passed into truncate_inode_pages_range (does it handle it gracefully?) or should this code: if (start & 511) return -EINVAL; if (len & 511) return -EINVAL; be something more like this (for better sanity checking of your arguments) which will ensure that you don't have implicit conversion issues from unsigned to signed and ensure that the result of adding them together won't either: if ((start & 511) || (start > (uint64_t)LLONG_MAX)) return -EINVAL; if ((len & 511) ) || (len > (uint64_t)LLONG_MAX)) return -EINVAL; if (end > (uint64_t)LLONG_MAX) return -EINVAL; My apologies in advance if I've made a mistake when looking at this and my concerns about unsigned values being implicitly converted to signed are unfounded (I would have hoped for compiler warnings about any implicit conversions though). Thanks Shane -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
I found the manual pages really confusing so I had a go at rewriting them - there were places in the manual page that didn't match the functionality provided by your code as well as I could tell). My apologies for a few formatting issues though. I still don't like parts of epoll_pwait1 but it's less confusing than it was. You are free to take some or all or none of the changes. I did have a question I marked with below about what you describe and what your code does. 1) epoll_ctl_batch -- NAME epoll_ctl_batch - batch control interface for an epoll descriptor SYNOPSIS #include int epoll_ctl_batch(int epfd, int flags, int ncmds, struct epoll_ctl_cmd *cmds); DESCRIPTION This system call is an extension of epoll_ctl(). The primary difference is that this system call allows you to batch multiple operations with the one system call. This provides a more efficient interface for updating events on this epoll file descriptor epfd. The flags argument is reserved and must be 0. The argument ncmds is the number of cmds entries being passed in. This number must be greater than 0. Each operation is specified as an element in the cmds array, defined as: struct epoll_ctl_cmd { /* Reserved flags for future extension, must be 0. */ int flags; /* The same as epoll_ctl() op parameter. */ int op; /* The same as epoll_ctl() fd parameter. */ int fd; /* The same as the "events" field in struct epoll_event. */ uint32_t events; /* The same as the "data" field in struct epoll_event. */ uint64_t data; /* Output field, will be set to the return code after this * command is executed by kernel */ int result; }; This system call is not atomic when updating the epoll descriptor. All entries in cmds are executed in the order provided. If any cmds entry fails to be processed no further entries are processed and the number of successfully processed entries is returned. Each single operation defined by a struct epoll_ctl_cmd has the same semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page for more information about how to correctly setup the members of a struct epoll_ctl_cmd. Upon completion of the call the result member of each struct epoll_ctl_cmd may be set to 0 (sucessfully completed) or an error code depending on the result of the command. If the kernel fails to change the result (for example the location of the cmds argument is fully or partly read only) the result member of each struct epoll_ctl_cmd may be unchanged. RETURN VALUE epoll_ctl_batch() returns a number greater than 0 to indicate the number of cmnd entries processed. If all entries have been processed this will equal the ncmds parameter passed in. If one or more parameters are incorrect the value returned is -1 with errno set appropriately - no cmds entries have been processed when this happens. If processing any entry in the cmds argument results in an error the number returned is the number of the failing entry - this number will be less than ncmds. Since ncmds must be greater than 0 a return value of 0 indicates an error associated with the very first cmds entry. A return value of 0 does not indicate a successful system call. To correctly test the return value from epoll_ctl_batch() use code similar to the following: ret=epoll_ctl_batch(epfd, flags, ncmds, ); if (ret < ncmds) { if (ret == -1) { /* An argument was invalid */ } else { /* ret contains the number of successful entries * processed. If you (mis?)use it as a C index it * will index directly to the failing entry to * get the result use cmds[ret].result which may * contain the errno value associated with the * entry. */ } } else { /* Success */ } ERRORS EINVAL flags is non-zero, or ncmds is less than or equal to zero, or cmds is NULL. ENOMEM There was insufficient memory to handle the requested op control operation. EFAULT The memory area pointed to by cmds is not accessible with read permissions. In the event that the
RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
I found the manual pages really confusing so I had a go at rewriting them - there were places in the manual page that didn't match the functionality provided by your code as well as I could tell). My apologies for a few formatting issues though. I still don't like parts of epoll_pwait1 but it's less confusing than it was. You are free to take some or all or none of the changes. I did have a question I marked with below about what you describe and what your code does. 1) epoll_ctl_batch -- NAME epoll_ctl_batch - batch control interface for an epoll descriptor SYNOPSIS #include sys/epoll.h int epoll_ctl_batch(int epfd, int flags, int ncmds, struct epoll_ctl_cmd *cmds); DESCRIPTION This system call is an extension of epoll_ctl(). The primary difference is that this system call allows you to batch multiple operations with the one system call. This provides a more efficient interface for updating events on this epoll file descriptor epfd. The flags argument is reserved and must be 0. The argument ncmds is the number of cmds entries being passed in. This number must be greater than 0. Each operation is specified as an element in the cmds array, defined as: struct epoll_ctl_cmd { /* Reserved flags for future extension, must be 0. */ int flags; /* The same as epoll_ctl() op parameter. */ int op; /* The same as epoll_ctl() fd parameter. */ int fd; /* The same as the events field in struct epoll_event. */ uint32_t events; /* The same as the data field in struct epoll_event. */ uint64_t data; /* Output field, will be set to the return code after this * command is executed by kernel */ int result; }; This system call is not atomic when updating the epoll descriptor. All entries in cmds are executed in the order provided. If any cmds entry fails to be processed no further entries are processed and the number of successfully processed entries is returned. Each single operation defined by a struct epoll_ctl_cmd has the same semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page for more information about how to correctly setup the members of a struct epoll_ctl_cmd. Upon completion of the call the result member of each struct epoll_ctl_cmd may be set to 0 (sucessfully completed) or an error code depending on the result of the command. If the kernel fails to change the result (for example the location of the cmds argument is fully or partly read only) the result member of each struct epoll_ctl_cmd may be unchanged. RETURN VALUE epoll_ctl_batch() returns a number greater than 0 to indicate the number of cmnd entries processed. If all entries have been processed this will equal the ncmds parameter passed in. If one or more parameters are incorrect the value returned is -1 with errno set appropriately - no cmds entries have been processed when this happens. If processing any entry in the cmds argument results in an error the number returned is the number of the failing entry - this number will be less than ncmds. Since ncmds must be greater than 0 a return value of 0 indicates an error associated with the very first cmds entry. A return value of 0 does not indicate a successful system call. To correctly test the return value from epoll_ctl_batch() use code similar to the following: ret=epoll_ctl_batch(epfd, flags, ncmds, cmds); if (ret ncmds) { if (ret == -1) { /* An argument was invalid */ } else { /* ret contains the number of successful entries * processed. If you (mis?)use it as a C index it * will index directly to the failing entry to * get the result use cmds[ret].result which may * contain the errno value associated with the * entry. */ } } else { /* Success */ } ERRORS EINVAL flags is non-zero, or ncmds is less than or equal to zero, or cmds is NULL. ENOMEM There was insufficient memory to handle the requested op control operation. EFAULT The memory area pointed to by cmds is not accessible with read permissions. In the event
[RFC] Implementing tape statistics
Before you start reading this I want to apologize in advance for the length of this email. The length is important though to make sure all of the arguments and counter-arguments are represented in asking for feedback about how tape statistics would be best implemented. There is some demand for the provision of tape IO statistics by users of the enterprise distributions, in particular those possessing large scale tape libraries. The provision of interfaces for getting statistics about tape I/O for use by utilities such as sar is a feature present in most commercial UNIX distributions. Several patches have been produced and presented to linux-scsi mailing list but it seems that there are differences of opinion that cannot be reconciled and hence currently no acceptance of the proposed solutions. I have therefore decided to post to the wider kernel list to see if we can come to some consensus on what one of these (or other) should be adopted. No patches are presented in this email for the sake of brevity, it's only a summary of the implementation and consequences along with discussion points for each. Note that this is not an attempt to work around the feedback gained on the linux-scsi mailing list but an attempt to get a wider consensus on what would be an acceptable implementation of a tape statistics interface. Option #1 = Provide device based stats vis sysfs: /sys/class/scsi_tape/stNN/stats (where NN is the tape device instance number) The stat file provides the following in a one line entry suitable for a single fgets() and processing by sscanf(): +/* Tape stats */ + u64 read_byte_cnt; /* bytes read since tape open */ + u64 write_byte_cnt; /* bytes written since tape open */ + u64 in_flight; /* Number of I/Os in flight */ + u64 read_cnt; /* Count of read requests since tape open */ + u64 write_cnt; /* Count of write requests since tape open */ + u64 other_cnt; /* Count of other requests since tape open + either implicit (from driver) or from + user space via ioctl. */ + u64 read_ticks; /* Ticks spent completing read requests */ + u64 write_ticks;/* Ticks spent completing write requests */ + u64 io_ticks; /* Ticks spent doing any I/O */ + u64 stamp; /* holds time request was queued */ The file contents are almost the same as the stat file for disks except the merge statistics are always 0 (since tape drives are sequential merged I/Os don't make sense) and the inflight value is almost always either a 0 or 1 since the st module always only has either one read or write outstanding. An additional field is added to the end of the file - a count of other I/Os - this could be commands issued by the driver within the kernel (e.g. rewind) or via an ioctl from user space. For tape drives some commands involving actions like tape movement can take a long time, it's important to keep track of scsi requests sent to the tape drive other than reads and writes so when delays happen they can be explained. With some future patches to iostat this data will be reported, an example set of data is (the extra other_cnt data allows an average wait for all (a_await) and other I/Os per second (oio/s)): tape: wr/s KiB_write/srd/s KiB_read/s r_await w_await a_await oio/s st0 186.50 46.750.000.000.0000.2760.276 0.00 st1 186.00 93.000.000.000.0000.1800.180 0.00 st2 0.00 0.00 181.50 45.500.3470.0000.347 0.00 st3 0.00 0.00 183.00 45.750.2240.0000.224 0.00 ## This is our preferred method of implementation since it is efficient for both kernel and user-space (also requires fewest code changes), it also matches that already presented for the disk block subsys, see for example: # grep . /sys/block/sd*/stat /sys/block/sda/stat: 27351 6890 609272 22812936810 920727 7660304 13339500 556889 1562009 /sys/block/sdb/stat:2369 6762188903900300 000 405939002 ## SCSI maintainers counter-point: "I'm afraid we can't do it the way you're proposing. files in sysfs must conform to the one value per file rule (so we avoid the ABI nastiness that plagues /proc). You can create a stat directory with a bunch of files, but not a single file that gives all values. ## My counter: I can only assume it (sysfs blk_subsys stat file) was implemented this way for the sake of efficiency, eg avoid a huge amount of file open/read/close calls in sar/iostat. It's not unusual for us to see over a thousand block devices on enterprise servers, multiply that by the number of above entries and you would be talking about 9 x block-dev-count per iostat read
[RFC] Implementing tape statistics
Before you start reading this I want to apologize in advance for the length of this email. The length is important though to make sure all of the arguments and counter-arguments are represented in asking for feedback about how tape statistics would be best implemented. There is some demand for the provision of tape IO statistics by users of the enterprise distributions, in particular those possessing large scale tape libraries. The provision of interfaces for getting statistics about tape I/O for use by utilities such as sar is a feature present in most commercial UNIX distributions. Several patches have been produced and presented to linux-scsi mailing list but it seems that there are differences of opinion that cannot be reconciled and hence currently no acceptance of the proposed solutions. I have therefore decided to post to the wider kernel list to see if we can come to some consensus on what one of these (or other) should be adopted. No patches are presented in this email for the sake of brevity, it's only a summary of the implementation and consequences along with discussion points for each. Note that this is not an attempt to work around the feedback gained on the linux-scsi mailing list but an attempt to get a wider consensus on what would be an acceptable implementation of a tape statistics interface. Option #1 = Provide device based stats vis sysfs: /sys/class/scsi_tape/stNN/stats (where NN is the tape device instance number) The stat file provides the following in a one line entry suitable for a single fgets() and processing by sscanf(): +/* Tape stats */ + u64 read_byte_cnt; /* bytes read since tape open */ + u64 write_byte_cnt; /* bytes written since tape open */ + u64 in_flight; /* Number of I/Os in flight */ + u64 read_cnt; /* Count of read requests since tape open */ + u64 write_cnt; /* Count of write requests since tape open */ + u64 other_cnt; /* Count of other requests since tape open + either implicit (from driver) or from + user space via ioctl. */ + u64 read_ticks; /* Ticks spent completing read requests */ + u64 write_ticks;/* Ticks spent completing write requests */ + u64 io_ticks; /* Ticks spent doing any I/O */ + u64 stamp; /* holds time request was queued */ The file contents are almost the same as the stat file for disks except the merge statistics are always 0 (since tape drives are sequential merged I/Os don't make sense) and the inflight value is almost always either a 0 or 1 since the st module always only has either one read or write outstanding. An additional field is added to the end of the file - a count of other I/Os - this could be commands issued by the driver within the kernel (e.g. rewind) or via an ioctl from user space. For tape drives some commands involving actions like tape movement can take a long time, it's important to keep track of scsi requests sent to the tape drive other than reads and writes so when delays happen they can be explained. With some future patches to iostat this data will be reported, an example set of data is (the extra other_cnt data allows an average wait for all (a_await) and other I/Os per second (oio/s)): tape: wr/s KiB_write/srd/s KiB_read/s r_await w_await a_await oio/s st0 186.50 46.750.000.000.0000.2760.276 0.00 st1 186.00 93.000.000.000.0000.1800.180 0.00 st2 0.00 0.00 181.50 45.500.3470.0000.347 0.00 st3 0.00 0.00 183.00 45.750.2240.0000.224 0.00 ## This is our preferred method of implementation since it is efficient for both kernel and user-space (also requires fewest code changes), it also matches that already presented for the disk block subsys, see for example: # grep . /sys/block/sd*/stat /sys/block/sda/stat: 27351 6890 609272 22812936810 920727 7660304 13339500 556889 1562009 /sys/block/sdb/stat:2369 6762188903900300 000 405939002 ## SCSI maintainers counter-point: I'm afraid we can't do it the way you're proposing. files in sysfs must conform to the one value per file rule (so we avoid the ABI nastiness that plagues /proc). You can create a stat directory with a bunch of files, but not a single file that gives all values. ## My counter: I can only assume it (sysfs blk_subsys stat file) was implemented this way for the sake of efficiency, eg avoid a huge amount of file open/read/close calls in sar/iostat. It's not unusual for us to see over a thousand block devices on enterprise servers, multiply that by the number of above entries and you would be talking about 9 x block-dev-count per iostat read