RE: [PATCH] qlogicpti: Return correct error code

2016-02-29 Thread Seymour, Shane M

Reviewed-by: Shane Seymour 



RE: [PATCH] qlogicpti: Return correct error code

2016-02-29 Thread Seymour, Shane M

Reviewed-by: Shane Seymour 



RE: [PATCH] snic: correctly check for array overrun on overly long version number

2016-02-29 Thread Seymour, Shane M

Reviewed-by: Shane Seymour 



RE: [PATCH] snic: correctly check for array overrun on overly long version number

2016-02-29 Thread Seymour, Shane M

Reviewed-by: Shane Seymour 



RE: [PATCH] ipr: fix out-of-bounds null overwrite

2016-01-05 Thread Seymour, Shane M
>   len = snprintf(fname, 99, "%s", buf);
> - fname[len-1] = '\0';

Since it appears that's the only time len is actually used in that function can
you please remove the variable len completely as part of the patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] ipr: fix out-of-bounds null overwrite

2016-01-05 Thread Seymour, Shane M
>   len = snprintf(fname, 99, "%s", buf);
> - fname[len-1] = '\0';

Since it appears that's the only time len is actually used in that function can
you please remove the variable len completely as part of the patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 2/2] pci: Update VPD size with correct length

2015-12-17 Thread Seymour, Shane M
Tested with a HP AE311-60001 PCIe card. It used to repeat the same VPD every 4k
for 32k now only the 154 bytes are returned and lspci - reports that the 
data up
to and including the end and that the check sum is good:

...
Capabilities: [74] Vital Product Data
Product Name: PCI-Express 4Gb Fibre Channel HBA
Read-only fields:
[PN] Part number: AE311-60001
[EC] Engineering changes: C-4830
[SN] Serial number: CN80847W3U
[V0] Vendor specific: PW=15W
[V2] Vendor specific: 4847
[MN] Manufacture ID: 50 58 32 35 31 30 34 30 31 2d 37 
30 20 20 42
[V1] Vendor specific: 02.22
[V3] Vendor specific: 05.03.15
[V4] Vendor specific: 03.13
[V5] Vendor specific: 02.03
[YA] Asset tag: NA
[RV] Reserved: checksum good, 0 byte(s) reserved
End
...
---
Tested-by: Shane Seymour 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/2] pci: Update VPD definitions

2015-12-17 Thread Seymour, Shane M
> The 'end' tag is actually 0x0f, it's the representation as a
> small resource data type tag that's 0x78 (ie shifted by 3).
> This patch also adds helper functions to extract the resource
> data type tags for both large and small resource data types.
>
> Cc: Alexander Duyck 
> Cc: Bjorn Helgaas 
> Signed-off-by: Hannes Reinecke 

Tested-by: Shane Seymour 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/2] pci: Update VPD definitions

2015-12-17 Thread Seymour, Shane M
> The 'end' tag is actually 0x0f, it's the representation as a
> small resource data type tag that's 0x78 (ie shifted by 3).
> This patch also adds helper functions to extract the resource
> data type tags for both large and small resource data types.
>
> Cc: Alexander Duyck 
> Cc: Bjorn Helgaas 
> Signed-off-by: Hannes Reinecke 

Tested-by: Shane Seymour 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 2/2] pci: Update VPD size with correct length

2015-12-17 Thread Seymour, Shane M
Tested with a HP AE311-60001 PCIe card. It used to repeat the same VPD every 4k
for 32k now only the 154 bytes are returned and lspci - reports that the 
data up
to and including the end and that the check sum is good:

...
Capabilities: [74] Vital Product Data
Product Name: PCI-Express 4Gb Fibre Channel HBA
Read-only fields:
[PN] Part number: AE311-60001
[EC] Engineering changes: C-4830
[SN] Serial number: CN80847W3U
[V0] Vendor specific: PW=15W
[V2] Vendor specific: 4847
[MN] Manufacture ID: 50 58 32 35 31 30 34 30 31 2d 37 
30 20 20 42
[V1] Vendor specific: 02.22
[V3] Vendor specific: 05.03.15
[V4] Vendor specific: 03.13
[V5] Vendor specific: 02.03
[YA] Asset tag: NA
[RV] Reserved: checksum good, 0 byte(s) reserved
End
...
---
Tested-by: Shane Seymour 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCHv2] pci: Update VPD size with correct length

2015-12-16 Thread Seymour, Shane M
> The only 'error' cases I've encountered so far is a read of all zeroes (and a
> halting the machine once you've read beyond a certain point) or a read of
> 0xff throughout the entire area. So that approach would work for both of them.

I should add that I'd tested the previous patch and this patch. I'll retest 
once v3 is available.

I have a card that repeats the vpd data every 4k for all 32k. The patch as it 
stands truncates that to just the valid data and lspci - shows the vpd data 
correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCHv2] pci: Update VPD size with correct length

2015-12-16 Thread Seymour, Shane M
> The only 'error' cases I've encountered so far is a read of all zeroes (and a
> halting the machine once you've read beyond a certain point) or a read of
> 0xff throughout the entire area. So that approach would work for both of them.

I should add that I'd tested the previous patch and this patch. I'll retest 
once v3 is available.

I have a card that repeats the vpd data every 4k for all 32k. The patch as it 
stands truncates that to just the valid data and lspci - shows the vpd data 
correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v5 1/2] block: invalidate the page cache when issuing BLKZEROOUT.

2015-12-14 Thread Seymour, Shane M
Thank you for taking the issues I raised about converting
unsigned to signed into account.

Reviewed-by: Shane Seymour 



RE: [PATCH v5 1/2] block: invalidate the page cache when issuing BLKZEROOUT.

2015-12-14 Thread Seymour, Shane M
Thank you for taking the issues I raised about converting
unsigned to signed into account.

Reviewed-by: Shane Seymour 



RE: [PATCH v4] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-16 Thread Seymour, Shane M

> v4: Check the start/len arguments for overflows prior to feeding the page
> cache bogus numbers (that it'll ignore anyway).

> Signed-off-by: Darrick J. Wong 

Reviewed-by: Shane Seymour 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v4] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-16 Thread Seymour, Shane M

> v4: Check the start/len arguments for overflows prior to feeding the page
> cache bogus numbers (that it'll ignore anyway).

> Signed-off-by: Darrick J. Wong 

Reviewed-by: Shane Seymour 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-12 Thread Seymour, Shane M
> I don't have a device large enough to test for signedness errors, since 
> passing 
> huge values for start and len never make it past the i_size_read check.

If you have someone trying to bypass your sanity checks then if 
start=18446744073709551104 and len=1024 the result of adding them together will 
be 512 (subtracting an extra 1 in the patched code to get 511 for end). That 
will pass the i_size_read check won't it? If so that would cause lstart in 
truncate_inode_pages_range() to be -512. I don't know what 
truncate_inode_pages_range() will do with a negative lstart value like that but 
it seems like an unusual value for your code to be willing to pass into 
truncate_inode_pages_range().

Shane
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-12 Thread Seymour, Shane M
> I don't have a device large enough to test for signedness errors, since 
> passing 
> huge values for start and len never make it past the i_size_read check.

If you have someone trying to bypass your sanity checks then if 
start=18446744073709551104 and len=1024 the result of adding them together will 
be 512 (subtracting an extra 1 in the patched code to get 511 for end). That 
will pass the i_size_read check won't it? If so that would cause lstart in 
truncate_inode_pages_range() to be -512. I don't know what 
truncate_inode_pages_range() will do with a negative lstart value like that but 
it seems like an unusual value for your code to be willing to pass into 
truncate_inode_pages_range().

Shane
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-11 Thread Seymour, Shane M
> which would make the other checks I suggested to ensure that neither start
> or end were more than (uint64_t)LLONG_MAX unnecessary.

My apologies I was wrong about what I said above - after thinking about it for 
longer you still need to make sure that at least len is not greater than 
(uint64_t)LLONG_MAX because in the calculation:

if (start > (uint64_t)LLONG_MAX - len)
return -EINVAL;

if len was more than (uint64_t)LLONG_MAX it would underflow and become a very 
large positive number and start would never be greater than that (and I said 
end when I should have said len).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-11 Thread Seymour, Shane M
> I don't have a device large enough to test for signedness errors, since 
> passing
> huge values for start and len never make it past the i_size_read check.
> However, I do see that I forgot to check the padding values, so I'll update 
> that.

Then do you want to at least consider converting end to be of type loff_t to
match the type of value returned by i_get_size() and the type of the value
passed to truncate_inode_pages_range()? Then you only need one additional
check, something like:

/* Check for an overflow when adding start+len into end */
if (start > (uint64_t)LLONG_MAX - len)
return -EINVAL;

Looking at the maximum values if we have start=0 and
len=(uint64_t)LLONG_MAX we would still continue but
len=(uint64_t)LLONG_MAX+1 would return -EINVAL. Similarly for
start=(uint64_t)LLONG_MAX and len=0 it would continue
but start=(uint64_t)LLONG_MAX+1 would return -EINVAL and that 
should allow start + len to be safely added and stored into an loff_t without
overflow. With this check start+len can never be more than (uint64_t)LLONG_MAX
which would make the other checks I suggested to ensure that neither start or 
end
were more than (uint64_t)LLONG_MAX unnecessary.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-11 Thread Seymour, Shane M
> I don't have a device large enough to test for signedness errors, since 
> passing
> huge values for start and len never make it past the i_size_read check.
> However, I do see that I forgot to check the padding values, so I'll update 
> that.

Then do you want to at least consider converting end to be of type loff_t to
match the type of value returned by i_get_size() and the type of the value
passed to truncate_inode_pages_range()? Then you only need one additional
check, something like:

/* Check for an overflow when adding start+len into end */
if (start > (uint64_t)LLONG_MAX - len)
return -EINVAL;

Looking at the maximum values if we have start=0 and
len=(uint64_t)LLONG_MAX we would still continue but
len=(uint64_t)LLONG_MAX+1 would return -EINVAL. Similarly for
start=(uint64_t)LLONG_MAX and len=0 it would continue
but start=(uint64_t)LLONG_MAX+1 would return -EINVAL and that 
should allow start + len to be safely added and stored into an loff_t without
overflow. With this check start+len can never be more than (uint64_t)LLONG_MAX
which would make the other checks I suggested to ensure that neither start or 
end
were more than (uint64_t)LLONG_MAX unnecessary.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-11 Thread Seymour, Shane M
> which would make the other checks I suggested to ensure that neither start
> or end were more than (uint64_t)LLONG_MAX unnecessary.

My apologies I was wrong about what I said above - after thinking about it for 
longer you still need to make sure that at least len is not greater than 
(uint64_t)LLONG_MAX because in the calculation:

if (start > (uint64_t)LLONG_MAX - len)
return -EINVAL;

if len was more than (uint64_t)LLONG_MAX it would underflow and become a very 
large positive number and start would never be greater than that (and I said 
end when I should have said len).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-10 Thread Seymour, Shane M
A quick question about this part of the patch:

> + uint64_t end = start + len - 1;

> + if (end >= i_size_read(bdev->bd_inode))
return -EINVAL;
 
> + /* Invalidate the page cache, including dirty pages */
> + mapping = bdev->bd_inode->i_mapping;
> + truncate_inode_pages_range(mapping, start, end);

blk_ioctl_zeroout accepts unsigned values for start and end (uint64_t) but 
loff_t types are turned from i_size_read() and passed as the 2nd and 3rd values 
to truncate_inode_pages_range() and loff_t is a signed value. It should be 
possible to pass in some values would overflow the calculation of end causing 
the test on the value of end and the result of i_size_read to pass but then end 
up passing a large unsigned value for in start that would be implicitly 
converted to signed in truncate_inode_pages_range. I was wondering if you'd 
tested passing in data that would cause sign conversion issues when passed into 
truncate_inode_pages_range (does it handle it gracefully?) or should this code:

if (start & 511)
return -EINVAL;
if (len & 511)
return -EINVAL;

be something more like this (for better sanity checking of your arguments) 
which will ensure that you don't have implicit conversion issues from unsigned 
to signed and ensure that the result of adding them together won't either:

if ((start & 511) || (start > (uint64_t)LLONG_MAX))
return -EINVAL;
if ((len & 511) ) || (len > (uint64_t)LLONG_MAX))
return -EINVAL;
if (end > (uint64_t)LLONG_MAX)
return -EINVAL;

My apologies in advance if I've made a mistake when looking at this and my 
concerns about unsigned values being implicitly converted to signed are 
unfounded (I would have hoped for compiler warnings about any implicit 
conversions though).

Thanks
Shane
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-11-10 Thread Seymour, Shane M
A quick question about this part of the patch:

> + uint64_t end = start + len - 1;

> + if (end >= i_size_read(bdev->bd_inode))
return -EINVAL;
 
> + /* Invalidate the page cache, including dirty pages */
> + mapping = bdev->bd_inode->i_mapping;
> + truncate_inode_pages_range(mapping, start, end);

blk_ioctl_zeroout accepts unsigned values for start and end (uint64_t) but 
loff_t types are turned from i_size_read() and passed as the 2nd and 3rd values 
to truncate_inode_pages_range() and loff_t is a signed value. It should be 
possible to pass in some values would overflow the calculation of end causing 
the test on the value of end and the result of i_size_read to pass but then end 
up passing a large unsigned value for in start that would be implicitly 
converted to signed in truncate_inode_pages_range. I was wondering if you'd 
tested passing in data that would cause sign conversion issues when passed into 
truncate_inode_pages_range (does it handle it gracefully?) or should this code:

if (start & 511)
return -EINVAL;
if (len & 511)
return -EINVAL;

be something more like this (for better sanity checking of your arguments) 
which will ensure that you don't have implicit conversion issues from unsigned 
to signed and ensure that the result of adding them together won't either:

if ((start & 511) || (start > (uint64_t)LLONG_MAX))
return -EINVAL;
if ((len & 511) ) || (len > (uint64_t)LLONG_MAX))
return -EINVAL;
if (end > (uint64_t)LLONG_MAX)
return -EINVAL;

My apologies in advance if I've made a mistake when looking at this and my 
concerns about unsigned values being implicitly converted to signed are 
unfounded (I would have hoped for compiler warnings about any implicit 
conversions though).

Thanks
Shane
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Seymour, Shane M
I found the manual pages really confusing so I had a go at rewriting
them - there were places in the manual page that didn't match the
functionality provided by your code as well as I could tell).

My apologies for a few formatting issues though. I still don't like
parts of epoll_pwait1 but it's less confusing than it was.

You are free to take some or all or none of the changes.

I did have a question I marked with  below about what you
describe and what your code does.

1) epoll_ctl_batch
--

NAME
   epoll_ctl_batch - batch control interface for an epoll descriptor

SYNOPSIS

   #include 

   int epoll_ctl_batch(int epfd, int flags,
   int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

   This system call is an extension of epoll_ctl(). The primary
   difference is that this system call allows you to batch multiple
   operations with the one system call. This provides a more efficient
   interface for updating events on this epoll file descriptor epfd.

   The flags argument is reserved and must be 0.

   The argument ncmds is the number of cmds entries being passed in.
   This number must be greater than 0.

   Each operation is specified as an element in the cmds array, defined as:

   struct epoll_ctl_cmd {

  /* Reserved flags for future extension, must be 0. */
  int flags;

  /* The same as epoll_ctl() op parameter. */
  int op;

  /* The same as epoll_ctl() fd parameter. */
  int fd;

  /* The same as the "events" field in struct epoll_event. */
  uint32_t events;

  /* The same as the "data" field in struct epoll_event. */
  uint64_t data;

  /* Output field, will be set to the return code after this
   * command is executed by kernel */
  int result;
   };

   This system call is not atomic when updating the epoll descriptor.
   All entries in cmds are executed in the order provided. If any cmds
   entry fails to be processed no further entries are processed and 
   the number of successfully processed entries is returned.

   Each single operation defined by a struct epoll_ctl_cmd has the same 
   semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page
   for more information about how to correctly setup the members of a
   struct epoll_ctl_cmd.

   Upon completion of the call the result member of each struct 
epoll_ctl_cmd
   may be set to 0 (sucessfully completed) or an error code depending on the
   result of the command. If the kernel fails to change the result (for
   example the location of the cmds argument is fully or partly read only)
   the result member of each struct epoll_ctl_cmd may be unchanged. 

RETURN VALUE

   epoll_ctl_batch() returns a number greater than 0 to indicate the number
   of cmnd entries processed. If all entries have been processed this will
   equal the ncmds parameter passed in.

   If one or more parameters are incorrect the value returned is -1 with
   errno set appropriately - no cmds entries have been processed when this
   happens.

   If processing any entry in the cmds argument results in an error the
   number returned is the number of the failing entry - this number will be
   less than ncmds. Since ncmds must be greater than 0 a return value of
   0 indicates an error associated with the very first cmds entry. 
   A return value of 0 does not indicate a successful system call.

   To correctly test the return value from epoll_ctl_batch() use code 
similar
   to the following:

ret=epoll_ctl_batch(epfd, flags, ncmds, );
if (ret < ncmds) {
if (ret == -1) {
/* An argument was invalid */
} else {
/* ret contains the number of successful entries
 * processed. If you (mis?)use it as a C index 
it
 * will index directly to the failing entry to
 * get the result use cmds[ret].result which 
may 
 * contain the errno value associated with the
 * entry.
 */
}
} else {
/* Success */
}

ERRORS

   EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
  cmds is NULL.

   ENOMEM There was insufficient memory to handle the requested op control
  operation.

   EFAULT The memory area pointed to by cmds is not accessible with read
  permissions.

   In the event that the 

RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Seymour, Shane M
I found the manual pages really confusing so I had a go at rewriting
them - there were places in the manual page that didn't match the
functionality provided by your code as well as I could tell).

My apologies for a few formatting issues though. I still don't like
parts of epoll_pwait1 but it's less confusing than it was.

You are free to take some or all or none of the changes.

I did have a question I marked with  below about what you
describe and what your code does.

1) epoll_ctl_batch
--

NAME
   epoll_ctl_batch - batch control interface for an epoll descriptor

SYNOPSIS

   #include sys/epoll.h

   int epoll_ctl_batch(int epfd, int flags,
   int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

   This system call is an extension of epoll_ctl(). The primary
   difference is that this system call allows you to batch multiple
   operations with the one system call. This provides a more efficient
   interface for updating events on this epoll file descriptor epfd.

   The flags argument is reserved and must be 0.

   The argument ncmds is the number of cmds entries being passed in.
   This number must be greater than 0.

   Each operation is specified as an element in the cmds array, defined as:

   struct epoll_ctl_cmd {

  /* Reserved flags for future extension, must be 0. */
  int flags;

  /* The same as epoll_ctl() op parameter. */
  int op;

  /* The same as epoll_ctl() fd parameter. */
  int fd;

  /* The same as the events field in struct epoll_event. */
  uint32_t events;

  /* The same as the data field in struct epoll_event. */
  uint64_t data;

  /* Output field, will be set to the return code after this
   * command is executed by kernel */
  int result;
   };

   This system call is not atomic when updating the epoll descriptor.
   All entries in cmds are executed in the order provided. If any cmds
   entry fails to be processed no further entries are processed and 
   the number of successfully processed entries is returned.

   Each single operation defined by a struct epoll_ctl_cmd has the same 
   semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page
   for more information about how to correctly setup the members of a
   struct epoll_ctl_cmd.

   Upon completion of the call the result member of each struct 
epoll_ctl_cmd
   may be set to 0 (sucessfully completed) or an error code depending on the
   result of the command. If the kernel fails to change the result (for
   example the location of the cmds argument is fully or partly read only)
   the result member of each struct epoll_ctl_cmd may be unchanged. 

RETURN VALUE

   epoll_ctl_batch() returns a number greater than 0 to indicate the number
   of cmnd entries processed. If all entries have been processed this will
   equal the ncmds parameter passed in.

   If one or more parameters are incorrect the value returned is -1 with
   errno set appropriately - no cmds entries have been processed when this
   happens.

   If processing any entry in the cmds argument results in an error the
   number returned is the number of the failing entry - this number will be
   less than ncmds. Since ncmds must be greater than 0 a return value of
   0 indicates an error associated with the very first cmds entry. 
   A return value of 0 does not indicate a successful system call.

   To correctly test the return value from epoll_ctl_batch() use code 
similar
   to the following:

ret=epoll_ctl_batch(epfd, flags, ncmds, cmds);
if (ret  ncmds) {
if (ret == -1) {
/* An argument was invalid */
} else {
/* ret contains the number of successful entries
 * processed. If you (mis?)use it as a C index 
it
 * will index directly to the failing entry to
 * get the result use cmds[ret].result which 
may 
 * contain the errno value associated with the
 * entry.
 */
}
} else {
/* Success */
}

ERRORS

   EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
  cmds is NULL.

   ENOMEM There was insufficient memory to handle the requested op control
  operation.

   EFAULT The memory area pointed to by cmds is not accessible with read
  permissions.

   In the event 

[RFC] Implementing tape statistics

2013-03-20 Thread Seymour, Shane M
Before you start reading this I want to apologize in advance for the length of 
this email. The length is important though to make sure all of the arguments 
and counter-arguments are represented in asking for feedback about how tape 
statistics would be best implemented.

There is some demand for the provision of tape IO statistics by users of the 
enterprise distributions, in particular those possessing large scale tape 
libraries. The provision of interfaces for getting statistics about tape I/O 
for use by utilities such as sar is a feature present in most commercial UNIX 
distributions.

Several patches have been produced and presented to linux-scsi mailing list but 
it seems that there are differences of opinion that cannot be reconciled and 
hence currently no acceptance of the proposed solutions. I have therefore 
decided to post to the wider kernel list to see if we can come to some 
consensus on what one of these (or other) should be adopted.

No patches are presented in this email for the sake of brevity, it's only a 
summary of the implementation and consequences along with discussion points for 
each. Note that this is not an attempt to work around the feedback gained on 
the linux-scsi mailing list but an attempt to get a wider consensus on what 
would be an acceptable implementation of a tape statistics interface.


Option #1
=

Provide device based stats vis sysfs:

/sys/class/scsi_tape/stNN/stats   (where NN is the tape device instance number)

The stat file provides the following in a one line entry suitable for a single 
fgets() and processing by sscanf():

+/* Tape stats */
+   u64 read_byte_cnt;  /* bytes read since tape open */
+   u64 write_byte_cnt; /* bytes written since tape open */
+   u64 in_flight;  /* Number of I/Os in flight */
+   u64 read_cnt;   /* Count of read requests since tape open */
+   u64 write_cnt;  /* Count of write requests since tape open */
+   u64 other_cnt;  /* Count of other requests since tape open
+  either implicit (from driver) or from
+  user space via ioctl. */
+   u64 read_ticks; /* Ticks spent completing read requests */
+   u64 write_ticks;/* Ticks spent completing write requests */
+   u64 io_ticks;   /* Ticks spent doing any I/O */
+   u64 stamp;  /* holds time request was queued */

The file contents are almost the same as the stat file for disks except the 
merge statistics are always 0 (since tape drives are sequential merged I/Os 
don't make sense) and the inflight value is almost always either a 0 or 1 since 
the st module always only has either one read or write outstanding. An 
additional field is added to the end of the file - a count of other I/Os - this 
could be commands issued by the driver within the kernel (e.g. rewind) or via 
an ioctl from user space. For tape drives some commands involving actions like 
tape movement can take a long time, it's important to keep track of scsi 
requests sent to the tape drive other than reads and writes so when delays 
happen they can be explained.

With some future patches to iostat this data will be reported, an example set 
of data is (the extra other_cnt data allows an average wait for all (a_await) 
and other I/Os per second (oio/s)):

tape:   wr/s   KiB_write/srd/s  KiB_read/s  r_await  w_await  a_await  oio/s
st0   186.50 46.750.000.000.0000.2760.276   0.00
st1   186.00 93.000.000.000.0000.1800.180   0.00
st2 0.00  0.00  181.50   45.500.3470.0000.347   0.00
st3 0.00  0.00  183.00   45.750.2240.0000.224   0.00

## This is our preferred method of implementation since it is efficient for 
both kernel and user-space (also requires fewest code changes), it also matches 
that already presented for the disk block subsys, see for example:

# grep . /sys/block/sd*/stat
/sys/block/sda/stat:   27351 6890   609272   22812936810   920727  
7660304  13339500   556889  1562009
/sys/block/sdb/stat:2369 6762188903900300   
 000 405939002

## SCSI maintainers counter-point: "I'm afraid we can't do it the way you're 
proposing.  files in sysfs must conform to the one value per file rule (so we 
avoid the ABI nastiness that plagues /proc).  You can create a stat directory 
with a bunch of files, but not a single file that gives all values.

## My counter:

I can only assume it (sysfs blk_subsys stat file) was implemented this way for 
the sake of efficiency, eg avoid a huge amount of file open/read/close calls in 
sar/iostat.  It's not unusual for us to see over a thousand block devices on 
enterprise servers, multiply that by the number of above entries and you would 
be talking about 9 x block-dev-count per iostat read 

[RFC] Implementing tape statistics

2013-03-20 Thread Seymour, Shane M
Before you start reading this I want to apologize in advance for the length of 
this email. The length is important though to make sure all of the arguments 
and counter-arguments are represented in asking for feedback about how tape 
statistics would be best implemented.

There is some demand for the provision of tape IO statistics by users of the 
enterprise distributions, in particular those possessing large scale tape 
libraries. The provision of interfaces for getting statistics about tape I/O 
for use by utilities such as sar is a feature present in most commercial UNIX 
distributions.

Several patches have been produced and presented to linux-scsi mailing list but 
it seems that there are differences of opinion that cannot be reconciled and 
hence currently no acceptance of the proposed solutions. I have therefore 
decided to post to the wider kernel list to see if we can come to some 
consensus on what one of these (or other) should be adopted.

No patches are presented in this email for the sake of brevity, it's only a 
summary of the implementation and consequences along with discussion points for 
each. Note that this is not an attempt to work around the feedback gained on 
the linux-scsi mailing list but an attempt to get a wider consensus on what 
would be an acceptable implementation of a tape statistics interface.


Option #1
=

Provide device based stats vis sysfs:

/sys/class/scsi_tape/stNN/stats   (where NN is the tape device instance number)

The stat file provides the following in a one line entry suitable for a single 
fgets() and processing by sscanf():

+/* Tape stats */
+   u64 read_byte_cnt;  /* bytes read since tape open */
+   u64 write_byte_cnt; /* bytes written since tape open */
+   u64 in_flight;  /* Number of I/Os in flight */
+   u64 read_cnt;   /* Count of read requests since tape open */
+   u64 write_cnt;  /* Count of write requests since tape open */
+   u64 other_cnt;  /* Count of other requests since tape open
+  either implicit (from driver) or from
+  user space via ioctl. */
+   u64 read_ticks; /* Ticks spent completing read requests */
+   u64 write_ticks;/* Ticks spent completing write requests */
+   u64 io_ticks;   /* Ticks spent doing any I/O */
+   u64 stamp;  /* holds time request was queued */

The file contents are almost the same as the stat file for disks except the 
merge statistics are always 0 (since tape drives are sequential merged I/Os 
don't make sense) and the inflight value is almost always either a 0 or 1 since 
the st module always only has either one read or write outstanding. An 
additional field is added to the end of the file - a count of other I/Os - this 
could be commands issued by the driver within the kernel (e.g. rewind) or via 
an ioctl from user space. For tape drives some commands involving actions like 
tape movement can take a long time, it's important to keep track of scsi 
requests sent to the tape drive other than reads and writes so when delays 
happen they can be explained.

With some future patches to iostat this data will be reported, an example set 
of data is (the extra other_cnt data allows an average wait for all (a_await) 
and other I/Os per second (oio/s)):

tape:   wr/s   KiB_write/srd/s  KiB_read/s  r_await  w_await  a_await  oio/s
st0   186.50 46.750.000.000.0000.2760.276   0.00
st1   186.00 93.000.000.000.0000.1800.180   0.00
st2 0.00  0.00  181.50   45.500.3470.0000.347   0.00
st3 0.00  0.00  183.00   45.750.2240.0000.224   0.00

## This is our preferred method of implementation since it is efficient for 
both kernel and user-space (also requires fewest code changes), it also matches 
that already presented for the disk block subsys, see for example:

# grep . /sys/block/sd*/stat
/sys/block/sda/stat:   27351 6890   609272   22812936810   920727  
7660304  13339500   556889  1562009
/sys/block/sdb/stat:2369 6762188903900300   
 000 405939002

## SCSI maintainers counter-point: I'm afraid we can't do it the way you're 
proposing.  files in sysfs must conform to the one value per file rule (so we 
avoid the ABI nastiness that plagues /proc).  You can create a stat directory 
with a bunch of files, but not a single file that gives all values.

## My counter:

I can only assume it (sysfs blk_subsys stat file) was implemented this way for 
the sake of efficiency, eg avoid a huge amount of file open/read/close calls in 
sar/iostat.  It's not unusual for us to see over a thousand block devices on 
enterprise servers, multiply that by the number of above entries and you would 
be talking about 9 x block-dev-count per iostat read