Re: "kyber: add tracepoints" causes write beyond size of object

2018-11-14 Thread Kees Cook
On Wed, Nov 14, 2018 at 6:06 PM, Omar Sandoval  wrote:
> On Wed, Nov 14, 2018 at 05:23:06PM -0600, Kees Cook wrote:
>> On Sat, Nov 10, 2018 at 8:15 AM, Jordan Glover
>>  wrote:
>> > Hello,
>> >
>> > Commit 6c3b7af1c975b87b86dcb2af233d1ae21eb05107 ("kyber: add 
>> > tracepoints")[1] causes write beyond size of object. This was detected by 
>> > "FORTIFY_SOURCE intra-object overflow checking"[2] feature which is part 
>> > of linux-hardened out-of-tree patchset designed to catch such errors.
>> >
>> > The specific error is:
>> >
>> > In file included from ./include/linux/bitmap.h:9,
>> >  from ./include/linux/cpumask.h:12,
>> >  from ./arch/x86/include/asm/cpumask.h:5,
>> >  from ./arch/x86/include/asm/msr.h:11,
>> >  from ./arch/x86/include/asm/processor.h:21,
>> >  from ./arch/x86/include/asm/cpufeature.h:8,
>> >  from ./arch/x86/include/asm/thread_info.h:53,
>> >  from ./include/linux/thread_info.h:38,
>> >  from ./arch/x86/include/asm/preempt.h:7,
>> >  from ./include/linux/preempt.h:81,
>> >  from ./include/linux/rcupdate.h:40,
>> >  from ./include/linux/rculist.h:11,
>> >  from ./include/linux/pid.h:5,
>> >  from ./include/linux/sched.h:14,
>> >  from ./include/linux/blkdev.h:5,
>> >  from block/kyber-iosched.c:21:
>> > In function ‘strlcpy’,
>> > inlined from ‘perf_trace_kyber_latency’ at 
>> > ./include/trace/events/kyber.h:14:1:
>> > ./include/linux/string.h:310:4: error: call to ‘__write_overflow’ declared 
>> > with attribute error: detected write beyond size of object passed as 1st 
>> > parameter
>> > __write_overflow();
>> > ^~
>> > In function ‘strlcpy’,
>> > inlined from ‘trace_event_raw_event_kyber_latency’ at 
>> > ./include/trace/events/kyber.h:14:1:
>> > ./include/linux/string.h:310:4: error: call to ‘__write_overflow’ declared 
>> > with attribute error: detected write beyond size of object passed as 1st 
>> > parameter
>> > __write_overflow();
>> > ^~
>> > make[1]: *** [scripts/Makefile.build:293: block/kyber-iosched.o] Error 1
>> > make: *** [Makefile:1063: block] Error 2
>> > make: *** Waiting for unfinished jobs
>> >
>> > Using 'strlcpy' function is generally not recommended[3][4].
>>
>> Due to the macros, this was a little tricky to find, but it looks like
>> a cut/paste typo:
>>
>> #define DOMAIN_LEN  16
>> #define LATENCY_TYPE_LEN8
>>
>> strlcpy(__entry->domain, domain, DOMAIN_LEN);
>> strlcpy(__entry->type, type, DOMAIN_LEN);
>>
>> This should use strscpy() regardless, and should use sizeof(dst)
>> instead of separate literals. The primary bug is using DOMAIN_LEN for
>> __entry->type when it is actually LATENCY_TYPE_LEN bytes.
>>
>> Can you build a patch for this? I'm happy to review.
>>
>> Thanks for finding this!
>
> Sorry, I forgot to reply to this thread, but Jens queued up a fix for
> this already:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus=18e962ac0781bcb70d433de3b2a325ff792b4288

Ah! Perfect. Thanks! :) I should be late to threads more often. ;)

-kees

-- 
Kees Cook


Re: "kyber: add tracepoints" causes write beyond size of object

2018-11-14 Thread Kees Cook
On Sat, Nov 10, 2018 at 8:15 AM, Jordan Glover
 wrote:
> Hello,
>
> Commit 6c3b7af1c975b87b86dcb2af233d1ae21eb05107 ("kyber: add tracepoints")[1] 
> causes write beyond size of object. This was detected by "FORTIFY_SOURCE 
> intra-object overflow checking"[2] feature which is part of linux-hardened 
> out-of-tree patchset designed to catch such errors.
>
> The specific error is:
>
> In file included from ./include/linux/bitmap.h:9,
>  from ./include/linux/cpumask.h:12,
>  from ./arch/x86/include/asm/cpumask.h:5,
>  from ./arch/x86/include/asm/msr.h:11,
>  from ./arch/x86/include/asm/processor.h:21,
>  from ./arch/x86/include/asm/cpufeature.h:8,
>  from ./arch/x86/include/asm/thread_info.h:53,
>  from ./include/linux/thread_info.h:38,
>  from ./arch/x86/include/asm/preempt.h:7,
>  from ./include/linux/preempt.h:81,
>  from ./include/linux/rcupdate.h:40,
>  from ./include/linux/rculist.h:11,
>  from ./include/linux/pid.h:5,
>  from ./include/linux/sched.h:14,
>  from ./include/linux/blkdev.h:5,
>  from block/kyber-iosched.c:21:
> In function ‘strlcpy’,
> inlined from ‘perf_trace_kyber_latency’ at 
> ./include/trace/events/kyber.h:14:1:
> ./include/linux/string.h:310:4: error: call to ‘__write_overflow’ declared 
> with attribute error: detected write beyond size of object passed as 1st 
> parameter
> __write_overflow();
> ^~
> In function ‘strlcpy’,
> inlined from ‘trace_event_raw_event_kyber_latency’ at 
> ./include/trace/events/kyber.h:14:1:
> ./include/linux/string.h:310:4: error: call to ‘__write_overflow’ declared 
> with attribute error: detected write beyond size of object passed as 1st 
> parameter
> __write_overflow();
> ^~
> make[1]: *** [scripts/Makefile.build:293: block/kyber-iosched.o] Error 1
> make: *** [Makefile:1063: block] Error 2
> make: *** Waiting for unfinished jobs
>
> Using 'strlcpy' function is generally not recommended[3][4].

Due to the macros, this was a little tricky to find, but it looks like
a cut/paste typo:

#define DOMAIN_LEN  16
#define LATENCY_TYPE_LEN8

strlcpy(__entry->domain, domain, DOMAIN_LEN);
strlcpy(__entry->type, type, DOMAIN_LEN);

This should use strscpy() regardless, and should use sizeof(dst)
instead of separate literals. The primary bug is using DOMAIN_LEN for
__entry->type when it is actually LATENCY_TYPE_LEN bytes.

Can you build a patch for this? I'm happy to review.

Thanks for finding this!

-Kees

>
> Jordan
>
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc1=6c3b7af1c975b87b86dcb2af233d1ae21eb05107
>
> [2] 
> https://github.com/anthraxx/linux-hardened/commit/9460692de8eb53fd62d59f564eba215e7c03a34b
>
> [3] https://lwn.net/Articles/763641/
>
> [4] https://outflux.net/slides/2018/lss/danger.pdf



-- 
Kees Cook


Re: [PATCH v6 10/18] x86/power/64: Remove VLA usage

2018-07-25 Thread Kees Cook
On Wed, Jul 25, 2018 at 4:32 AM, Rafael J. Wysocki  wrote:
> On Tue, Jul 24, 2018 at 6:49 PM, Kees Cook  wrote:
>> In the quest to remove all stack VLA usage from the kernel[1], this
>> removes the discouraged use of AHASH_REQUEST_ON_STACK by switching to
>> shash directly and allocating the descriptor in heap memory (which should
>> be fine: the tfm has already been allocated there too).
>>
>> [1] 
>> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
>>
>> Signed-off-by: Kees Cook 
>> Acked-by: Pavel Machek 
>
> I think I can queue this up if there are no objections from others.
>
> Do you want me to do that?

Sure thing. It looks like the other stand-alone patches like this one
are getting taken into the non-crypto trees, so that's fine.

Thanks!

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-24 Thread Kees Cook
On Thu, May 24, 2018 at 1:00 AM, Christoph Hellwig <h...@infradead.org> wrote:
> On Wed, May 23, 2018 at 03:14:19PM -0600, Jens Axboe wrote:
>> Ugh, so that would necessitate a change there too. As I said before,
>> I don't really care where it lives. I know the SCSI folks seem bothered
>> by moving it, but in reality, it's not like this stuff will likely ever
>> really change. Of the two choices (select entire SCSI stack, or just move
>> this little bit), I know what I would consider the saner option...
>
> Oh well.  How about something like this respin of Kees' series?
>
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/sense-cleanup

Does the CONFIG_PCMCIA in drivers/scsi/Makefile now get exposed in
weird config cases?

Otherwise, yeah, looks good to me. Thanks!

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-23 Thread Kees Cook
On Wed, May 23, 2018 at 2:06 PM, Martin K. Petersen
<martin.peter...@oracle.com> wrote:
>
> Kees,
>
>> obj-$(CONFIG_SCSI)  += scsi/
>>
>> So: this needs to live in block/ just like CONFIG_BLK_SCSI_REQUEST's
>> scsi_ioctl.c. I will split it into CONFIG_BLK_SCSI_SENSE, but I'll
>> still need to move the code from drivers/scsi/ to block/. Is this
>> okay?
>
> The reason this sucks is that scsi_normalize_sense() is an inherent core
> feature in the SCSI layer. Dealing with sense data for ioctls is just a
> fringe use case.

True, though I'm finding other robustness issues in the CDROM code.
They're probably all insane corner cases, but it seems like it'd be
nice to just fix them:

diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
index 3522d2cae1b6..7726c8618c30 100644
--- a/drivers/cdrom/cdrom.c
+++ b/drivers/cdrom/cdrom.c
@@ -,9 +2223,12 @@ static int cdrom_read_cdda_bpc(struct
cdrom_device_info *cdi, __u8 __user *ubuf,

blk_execute_rq(q, cdi->disk, rq, 0);
if (scsi_req(rq)->result) {
-   struct request_sense *s = req->sense;
+   struct scsi_sense_hdr sshdr;
+
ret = -EIO;
-   cdi->last_sense = s->sense_key;
+   scsi_normalize_sense(req->sense, req->sense_len,
+);
+   cdi->last_sense = sshdr.sense_key;
}

if (blk_rq_unmap_user(bio))

This code wasn't checking req->sense_len, for example. It'll just get
stale data at worst case, but it's still ugly, especially since we
have a solution to do it right.

> I don't want to get too hung up on what goes where. But architecturally
> it really irks me to move a core piece of SCSI state machine
> functionality out of the subsystem to accommodate ioctl handling.

It looks like there is more in block/scsi_ioctl.c than just ioctl
handling, which is why I put the function in there originally.
Honestly, it's almost so small I could make it a static inline. :P

> I'm traveling today so I probably won't get a chance to look closely
> until tomorrow morning.

No worries; thanks for looking at it!

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 6/6] scsi: Check sense buffer size at build time

2018-05-23 Thread Kees Cook
On Wed, May 23, 2018 at 1:25 AM, Sergei Shtylyov
<sergei.shtyl...@cogentembedded.com> wrote:
> Hello!
>
> On 5/22/2018 9:15 PM, Kees Cook wrote:
>
>> To avoid introducing problems like those fixed in commit f7068114d45e
>> ("sr: pass down correctly sized SCSI sense buffer"), this creates a macro
>> wrapper for scsi_execute() that verifies the size of the sense buffer
>> similar to what was done for command string sizes in commit 3756f6401c30
>> ("exec: avoid gcc-8 warning for get_task_comm").
>>
>> Another solution could be to add another argument to scsi_execute(),
>> but this function already takes a lot of arguments and Jens was not fond
>> of that approach. As there was only a pair of dynamically allocated sense
>> buffers, this also moves those 96 bytes onto the stack to avoid triggering
>> the sizeof() check.
>>
>> Signed-off-by: Kees Cook <keesc...@chromium.org>
>> ---
>>   drivers/scsi/scsi_lib.c|  6 +++---
>>   include/scsi/scsi_device.h | 12 +++-
>>   2 files changed, 14 insertions(+), 4 deletions(-)
>>
> [...]
>>
>> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
>> index 7ae177c8e399..1bb87b6c0ad2 100644
>> --- a/include/scsi/scsi_device.h
>> +++ b/include/scsi/scsi_device.h
>> @@ -426,11 +426,21 @@ extern const char *scsi_device_state_name(enum
>> scsi_device_state);
>>   extern int scsi_is_sdev_device(const struct device *);
>>   extern int scsi_is_target_device(const struct device *);
>>   extern void scsi_sanitize_inquiry_string(unsigned char *s, int len);
>> -extern int scsi_execute(struct scsi_device *sdev, const unsigned char
>> *cmd,
>> +extern int __scsi_execute(struct scsi_device *sdev, const unsigned char
>> *cmd,
>> int data_direction, void *buffer, unsigned
>> bufflen,
>> unsigned char *sense, struct scsi_sense_hdr
>> *sshdr,
>> int timeout, int retries, u64 flags,
>> req_flags_t rq_flags, int *resid);
>> +/* Make sure any sense buffer is the correct size. */
>> +#define scsi_execute(sdev, cmd, data_direction, buffer, bufflen, sense,
>> \
>> +sshdr, timeout, retries, flags, rq_flags, resid)   \
>> +({ \
>> +   BUILD_BUG_ON((sense) != NULL && \
>> +sizeof(sense) != SCSI_SENSE_BUFFERSIZE);   \
>
>
>This would only check the size of the 'sense' pointer, no?

Correct. Passing in a variable that was declared as:

char sense[SCSI_SENSE_BUFFERSIZE];

would pass this check. Dynamically allocated would not (and we don't
have any cases of that left after the prior patch in this series), and
it should cast improper casts.

If people don't like this bit of robustness vs complexity/limit, I'm
happy to leave it off the series.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-23 Thread Kees Cook
On Wed, May 23, 2018 at 7:31 AM, Jens Axboe <ax...@kernel.dk> wrote:
> On 5/23/18 8:25 AM, Christoph Hellwig wrote:
>> On Wed, May 23, 2018 at 08:13:56AM -0600, Jens Axboe wrote:
>>>> Should I move to code to a new drivers/scsi/scsi_sense.c and add it to
>>>> drivers/scsi/Makefile as:
>>>>
>>>> obj-$(CONFIG_BLK_SCSI_REQUEST)+= scsi_sense.o
>>>>
>>>> Every place I want to use the code is already covered by
>>>> CONFIG_BLK_SCSI_REQUEST, so it seems like I just need to know where to
>>>> put the .c file. :P
>>>
>>> I think this is so much saner than a SCSI select or dependency, so I'll
>>> have to disagree with Martin and Christoph. Just put it in drivers/scsi,
>>> if it's the location they care about.
>>
>> I actually plan to remove CONFIG_BLK_SCSI_REQUEST in a merge window
>> or two.  The only users are scsi and the ide layer, (virtio_blk
>> support has already been accidentally disabled for a while), and getting
>> rid of it allows to to shrink and simply the scsi data structures.
>>
>> But if you want this for now lets keep scsi_sense.c in drivers/scsi
>> but depend on CONFIG_BLK_SCSI_REQUEST, that is easy enough to fix up.
>
> It could be a stand-alone dependency, doesn't have to be BLK_SCSI_REQUEST.
> BLA_SCSI_SENSE or something would do. I don't care too much about that,
> mostly getting rid of the entire stack dependency.

Aaand, I can't do this and leave it in drivers/scsi because of drivers/Makefile:

obj-$(CONFIG_SCSI)  += scsi/

So: this needs to live in block/ just like CONFIG_BLK_SCSI_REQUEST's
scsi_ioctl.c. I will split it into CONFIG_BLK_SCSI_SENSE, but I'll
still need to move the code from drivers/scsi/ to block/. Is this
okay?

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Kees Cook
On Tue, May 22, 2018 at 4:42 PM, Jens Axboe <ax...@kernel.dk> wrote:
> On May 22, 2018, at 5:31 PM, Kees Cook <keesc...@chromium.org> wrote:
>>
>>> On Tue, May 22, 2018 at 12:16 PM, Jens Axboe <ax...@kernel.dk> wrote:
>>>> On 5/22/18 1:13 PM, Christoph Hellwig wrote:
>>>>> On Tue, May 22, 2018 at 01:09:41PM -0600, Jens Axboe wrote:
>>>>> I think Martin and Christoph are objecting to moving the code to
>>>>> block/scsi_ioctl.h. I don't care too much about where the code is, but
>>>>> think it would be nice to have the definitions in a separate header. But
>>>>> if they prefer just pulling in all of SCSI for it, well then I guess
>>>>> it's pointless to move the header bits. Seems very heavy handed to me,
>>>>> though.
>>>>
>>>> It might be heavy handed for the 3 remaining users of drivers/ide,
>>>
>>> Brutal :-)
>>
>> Heh. I noticed a similar sense buffer use in drivers/cdrom/cdrom.c
>> too. Is this okay under the same considerations?
>
> This is going from somewhat crazy to pretty nuts, imho. I guess in practical 
> terms it doesn’t matter that much, since most folks would be using sr. I 
> still think a split would be vastly superior. Just keep the scsi code in 
> drivers/scsi/ and make it independently selectable.

I had originally tied it to BLK_SCSI_REQUEST. Logically speaking,
sense buffers are part of the request, and the CONFIG work is already
done. This is roughly what I tried to do before, since scsi_ioctl.c is
the only code pulled in for CONFIG_BLK_SCSI_REQUEST:

$ git grep BLK_SCSI_REQUEST | grep Makefile
block/Makefile:obj-$(CONFIG_BLK_SCSI_REQUEST)   += scsi_ioctl.o

Should I move to code to a new drivers/scsi/scsi_sense.c and add it to
drivers/scsi/Makefile as:

obj-$(CONFIG_BLK_SCSI_REQUEST)+= scsi_sense.o

Every place I want to use the code is already covered by
CONFIG_BLK_SCSI_REQUEST, so it seems like I just need to know where to
put the .c file. :P

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Kees Cook
On Tue, May 22, 2018 at 12:16 PM, Jens Axboe <ax...@kernel.dk> wrote:
> On 5/22/18 1:13 PM, Christoph Hellwig wrote:
>> On Tue, May 22, 2018 at 01:09:41PM -0600, Jens Axboe wrote:
>>> I think Martin and Christoph are objecting to moving the code to
>>> block/scsi_ioctl.h. I don't care too much about where the code is, but
>>> think it would be nice to have the definitions in a separate header. But
>>> if they prefer just pulling in all of SCSI for it, well then I guess
>>> it's pointless to move the header bits. Seems very heavy handed to me,
>>> though.
>>
>> It might be heavy handed for the 3 remaining users of drivers/ide,
>
> Brutal :-)

Heh. I noticed a similar sense buffer use in drivers/cdrom/cdrom.c
too. Is this okay under the same considerations?

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index ad9b687a236a..220ff321c102 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -79,7 +79,7 @@ config GDROM
tristate "SEGA Dreamcast GD-ROM drive"
depends on SH_DREAMCAST
select CDROM
-   select BLK_SCSI_REQUEST # only for the generic cdrom code
+   select SCSI
help
  A standard SEGA Dreamcast comes with a modified CD ROM drive called a
  "GD-ROM" by SEGA to signify it is capable of reading special disks
@@ -345,7 +345,7 @@ config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media (DEPRECATED)"
depends on !UML
select CDROM
-   select BLK_SCSI_REQUEST
+   select SCSI
help
  Note: This driver is deprecated and will be removed from the
  kernel in the near future!
diff --git a/drivers/block/paride/Kconfig b/drivers/block/paride/Kconfig
index f8bd6ef3605a..7fdfcc5eaca5 100644
--- a/drivers/block/paride/Kconfig
+++ b/drivers/block/paride/Kconfig
@@ -27,7 +27,7 @@ config PARIDE_PCD
tristate "Parallel port ATAPI CD-ROMs"
depends on PARIDE
select CDROM
-   select BLK_SCSI_REQUEST # only for the generic cdrom code
+   select SCSI
---help---
  This option enables the high-level driver for ATAPI CD-ROM devices
  connected through a parallel port. If you chose to build PARIDE

>> but as long as that stuff just keeps working I'd rather worry about
>> everyone else, and keep the scsi code where it belongs.
>
> Fine with me then, hopefully we can some day kill it off.

I'll send a v2. I found a few other things to fix up (including the
cdrom.c one).

Thanks!

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Kees Cook
On Tue, May 22, 2018 at 11:50 AM, Martin K. Petersen
<martin.peter...@oracle.com> wrote:
>
> Christoph,
>
>> On Tue, May 22, 2018 at 11:15:09AM -0700, Kees Cook wrote:
>>> Both SCSI and ATAPI share the sense header. In preparation for using the
>>> struct scsi_sense_hdr more widely, move this into a separate header and
>>> move the helper function to scsi_ioctl.c which is linked with CONFIG_IDE
>>> by way of CONFIG_BLK_SCSI_REQUEST.
>>
>> Please keep the code where it is and just depend on SCSI on the legacy
>> ide driver.  No need to do gymnastics just for a legacy case.
>
> Yup, I agree.

Oh, er, this was mainly done at Jens's request. Jens, can you advise?

-Kees

-- 
Kees Cook
Pixel Security


[PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Kees Cook
Both SCSI and ATAPI share the sense header. In preparation for using the
struct scsi_sense_hdr more widely, move this into a separate header and
move the helper function to scsi_ioctl.c which is linked with CONFIG_IDE
by way of CONFIG_BLK_SCSI_REQUEST.

Signed-off-by: Kees Cook <keesc...@chromium.org>
---
 block/scsi_ioctl.c | 64 ++
 drivers/scsi/scsi_common.c | 64 --
 include/scsi/scsi_cmnd.h   |  1 -
 include/scsi/scsi_common.h | 32 +--
 include/scsi/scsi_sense.h  | 44 ++
 5 files changed, 109 insertions(+), 96 deletions(-)
 create mode 100644 include/scsi/scsi_sense.h

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 60b471f8621b..87ff3cc7a364 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -728,6 +728,70 @@ void scsi_req_init(struct scsi_request *req)
 }
 EXPORT_SYMBOL(scsi_req_init);
 
+/**
+ * scsi_normalize_sense - normalize main elements from either fixed or
+ * descriptor sense data format into a common format.
+ *
+ * @sense_buffer:  byte array containing sense data returned by device
+ * @sb_len:number of valid bytes in sense_buffer
+ * @sshdr: pointer to instance of structure that common
+ * elements are written to.
+ *
+ * Notes:
+ * The "main elements" from sense data are: response_code, sense_key,
+ * asc, ascq and additional_length (only for descriptor format).
+ *
+ * Typically this function can be called after a device has
+ * responded to a SCSI command with the CHECK_CONDITION status.
+ *
+ * Return value:
+ * true if valid sense data information found, else false;
+ */
+bool scsi_normalize_sense(const u8 *sense_buffer, int sb_len,
+ struct scsi_sense_hdr *sshdr)
+{
+   memset(sshdr, 0, sizeof(struct scsi_sense_hdr));
+
+   if (!sense_buffer || !sb_len)
+   return false;
+
+   sshdr->response_code = (sense_buffer[0] & 0x7f);
+
+   if (!scsi_sense_valid(sshdr))
+   return false;
+
+   if (sshdr->response_code >= 0x72) {
+   /*
+* descriptor format
+*/
+   if (sb_len > 1)
+   sshdr->sense_key = (sense_buffer[1] & 0xf);
+   if (sb_len > 2)
+   sshdr->asc = sense_buffer[2];
+   if (sb_len > 3)
+   sshdr->ascq = sense_buffer[3];
+   if (sb_len > 7)
+   sshdr->additional_length = sense_buffer[7];
+   } else {
+   /*
+* fixed format
+*/
+   if (sb_len > 2)
+   sshdr->sense_key = (sense_buffer[2] & 0xf);
+   if (sb_len > 7) {
+   sb_len = (sb_len < (sense_buffer[7] + 8)) ?
+sb_len : (sense_buffer[7] + 8);
+   if (sb_len > 12)
+   sshdr->asc = sense_buffer[12];
+   if (sb_len > 13)
+   sshdr->ascq = sense_buffer[13];
+   }
+   }
+
+   return true;
+}
+EXPORT_SYMBOL(scsi_normalize_sense);
+
 static int __init blk_scsi_ioctl_init(void)
 {
blk_set_cmd_filter_defaults(_default_cmd_filter);
diff --git a/drivers/scsi/scsi_common.c b/drivers/scsi/scsi_common.c
index 90349498f686..8621a07cb967 100644
--- a/drivers/scsi/scsi_common.c
+++ b/drivers/scsi/scsi_common.c
@@ -116,70 +116,6 @@ void int_to_scsilun(u64 lun, struct scsi_lun *scsilun)
 }
 EXPORT_SYMBOL(int_to_scsilun);
 
-/**
- * scsi_normalize_sense - normalize main elements from either fixed or
- * descriptor sense data format into a common format.
- *
- * @sense_buffer:  byte array containing sense data returned by device
- * @sb_len:number of valid bytes in sense_buffer
- * @sshdr: pointer to instance of structure that common
- * elements are written to.
- *
- * Notes:
- * The "main elements" from sense data are: response_code, sense_key,
- * asc, ascq and additional_length (only for descriptor format).
- *
- * Typically this function can be called after a device has
- * responded to a SCSI command with the CHECK_CONDITION status.
- *
- * Return value:
- * true if valid sense data information found, else false;
- */
-bool scsi_normalize_sense(const u8 *sense_buffer, int sb_len,
- struct scsi_sense_hdr *sshdr)
-{
-   memset(sshdr, 0, sizeof(struct scsi_sense_hdr));
-
-   if (!sense_buffer || !sb_len)
-   return false;
-
-   sshdr->response_code = (sense_buffer[0] & 0x7f);
-
-   if (!scsi_sense_valid(sshdr))
-   return false;

[PATCH 2/6] scsi: cxlflash: Drop unused sense buffers

2018-05-22 Thread Kees Cook
This removes the unused sense buffer in read_cap16() and write_same16().

Signed-off-by: Kees Cook <keesc...@chromium.org>
---
 drivers/scsi/cxlflash/superpipe.c | 8 ++--
 drivers/scsi/cxlflash/vlun.c  | 7 ++-
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/cxlflash/superpipe.c 
b/drivers/scsi/cxlflash/superpipe.c
index 2fe79df5c73c..59b9f2023748 100644
--- a/drivers/scsi/cxlflash/superpipe.c
+++ b/drivers/scsi/cxlflash/superpipe.c
@@ -324,7 +324,6 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
struct scsi_sense_hdr sshdr;
u8 *cmd_buf = NULL;
u8 *scsi_cmd = NULL;
-   u8 *sense_buf = NULL;
int rc = 0;
int result = 0;
int retry_cnt = 0;
@@ -333,8 +332,7 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
 retry:
cmd_buf = kzalloc(CMD_BUFSIZE, GFP_KERNEL);
scsi_cmd = kzalloc(MAX_COMMAND_SIZE, GFP_KERNEL);
-   sense_buf = kzalloc(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL);
-   if (unlikely(!cmd_buf || !scsi_cmd || !sense_buf)) {
+   if (unlikely(!cmd_buf || !scsi_cmd)) {
rc = -ENOMEM;
goto out;
}
@@ -349,7 +347,7 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
/* Drop the ioctl read semahpore across lengthy call */
up_read(>ioctl_rwsem);
result = scsi_execute(sdev, scsi_cmd, DMA_FROM_DEVICE, cmd_buf,
- CMD_BUFSIZE, sense_buf, , to, CMD_RETRIES,
+ CMD_BUFSIZE, NULL, , to, CMD_RETRIES,
  0, 0, NULL);
down_read(>ioctl_rwsem);
rc = check_state(cfg);
@@ -380,7 +378,6 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
if (retry_cnt++ < 1) {
kfree(cmd_buf);
kfree(scsi_cmd);
-   kfree(sense_buf);
goto retry;
}
}
@@ -411,7 +408,6 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
 out:
kfree(cmd_buf);
kfree(scsi_cmd);
-   kfree(sense_buf);
 
dev_dbg(dev, "%s: maxlba=%lld blklen=%d rc=%d\n",
__func__, gli->max_lba, gli->blk_len, rc);
diff --git a/drivers/scsi/cxlflash/vlun.c b/drivers/scsi/cxlflash/vlun.c
index 5deef57a7834..e7e9b2f2ad21 100644
--- a/drivers/scsi/cxlflash/vlun.c
+++ b/drivers/scsi/cxlflash/vlun.c
@@ -425,7 +425,6 @@ static int write_same16(struct scsi_device *sdev,
 {
u8 *cmd_buf = NULL;
u8 *scsi_cmd = NULL;
-   u8 *sense_buf = NULL;
int rc = 0;
int result = 0;
u64 offset = lba;
@@ -439,8 +438,7 @@ static int write_same16(struct scsi_device *sdev,
 
cmd_buf = kzalloc(CMD_BUFSIZE, GFP_KERNEL);
scsi_cmd = kzalloc(MAX_COMMAND_SIZE, GFP_KERNEL);
-   sense_buf = kzalloc(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL);
-   if (unlikely(!cmd_buf || !scsi_cmd || !sense_buf)) {
+   if (unlikely(!cmd_buf || !scsi_cmd)) {
rc = -ENOMEM;
goto out;
}
@@ -456,7 +454,7 @@ static int write_same16(struct scsi_device *sdev,
/* Drop the ioctl read semahpore across lengthy call */
up_read(>ioctl_rwsem);
result = scsi_execute(sdev, scsi_cmd, DMA_TO_DEVICE, cmd_buf,
- CMD_BUFSIZE, sense_buf, NULL, to,
+ CMD_BUFSIZE, NULL, NULL, to,
  CMD_RETRIES, 0, 0, NULL);
down_read(>ioctl_rwsem);
rc = check_state(cfg);
@@ -481,7 +479,6 @@ static int write_same16(struct scsi_device *sdev,
 out:
kfree(cmd_buf);
kfree(scsi_cmd);
-   kfree(sense_buf);
dev_dbg(dev, "%s: returning rc=%d\n", __func__, rc);
return rc;
 }
-- 
2.17.0



[PATCH 1/6] ide-cd: Drop unused sense buffers

2018-05-22 Thread Kees Cook
This drops unused sense buffers from:

cdrom_eject()
cdrom_read_capacity()
cdrom_read_tocentry()
ide_cd_lockdoor()
ide_cd_read_toc()

Signed-off-by: Kees Cook <keesc...@chromium.org>
---
 drivers/ide/ide-cd.c   | 36 +++-
 drivers/ide/ide-cd.h   |  2 +-
 drivers/ide/ide-cd_ioctl.c | 34 --
 3 files changed, 28 insertions(+), 44 deletions(-)

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 5a8e8e3c22cd..1d5b35312e33 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -890,8 +890,7 @@ int cdrom_check_status(ide_drive_t *drive, struct 
request_sense *sense)
 }
 
 static int cdrom_read_capacity(ide_drive_t *drive, unsigned long *capacity,
-  unsigned long *sectors_per_frame,
-  struct request_sense *sense)
+  unsigned long *sectors_per_frame)
 {
struct {
__be32 lba;
@@ -908,7 +907,7 @@ static int cdrom_read_capacity(ide_drive_t *drive, unsigned 
long *capacity,
memset(cmd, 0, BLK_MAX_CDB);
cmd[0] = GPCMD_READ_CDVD_CAPACITY;
 
-   stat = ide_cd_queue_pc(drive, cmd, 0, , , sense, 0,
+   stat = ide_cd_queue_pc(drive, cmd, 0, , , NULL, 0,
   RQF_QUIET);
if (stat)
return stat;
@@ -944,8 +943,7 @@ static int cdrom_read_capacity(ide_drive_t *drive, unsigned 
long *capacity,
 }
 
 static int cdrom_read_tocentry(ide_drive_t *drive, int trackno, int msf_flag,
-   int format, char *buf, int buflen,
-   struct request_sense *sense)
+   int format, char *buf, int buflen)
 {
unsigned char cmd[BLK_MAX_CDB];
 
@@ -962,11 +960,11 @@ static int cdrom_read_tocentry(ide_drive_t *drive, int 
trackno, int msf_flag,
if (msf_flag)
cmd[1] = 2;
 
-   return ide_cd_queue_pc(drive, cmd, 0, buf, , sense, 0, 
RQF_QUIET);
+   return ide_cd_queue_pc(drive, cmd, 0, buf, , NULL, 0, RQF_QUIET);
 }
 
 /* Try to read the entire TOC for the disk into our internal buffer. */
-int ide_cd_read_toc(ide_drive_t *drive, struct request_sense *sense)
+int ide_cd_read_toc(ide_drive_t *drive)
 {
int stat, ntracks, i;
struct cdrom_info *info = drive->driver_data;
@@ -996,14 +994,13 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
 * Check to see if the existing data is still valid. If it is,
 * just return.
 */
-   (void) cdrom_check_status(drive, sense);
+   (void) cdrom_check_status(drive, NULL);
 
if (drive->atapi_flags & IDE_AFLAG_TOC_VALID)
return 0;
 
/* try to get the total cdrom capacity and sector size */
-   stat = cdrom_read_capacity(drive, >capacity, _per_frame,
-  sense);
+   stat = cdrom_read_capacity(drive, >capacity, _per_frame);
if (stat)
toc->capacity = 0x1f;
 
@@ -1016,7 +1013,7 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
 
/* first read just the header, so we know how long the TOC is */
stat = cdrom_read_tocentry(drive, 0, 1, 0, (char *) >hdr,
-   sizeof(struct atapi_toc_header), sense);
+   sizeof(struct atapi_toc_header));
if (stat)
return stat;
 
@@ -1036,7 +1033,7 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
  (char *)>hdr,
   sizeof(struct atapi_toc_header) +
   (ntracks + 1) *
-  sizeof(struct atapi_toc_entry), sense);
+  sizeof(struct atapi_toc_entry));
 
if (stat && toc->hdr.first_track > 1) {
/*
@@ -1056,8 +1053,7 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
   (char *)>hdr,
   sizeof(struct atapi_toc_header) +
   (ntracks + 1) *
-  sizeof(struct atapi_toc_entry),
-  sense);
+  sizeof(struct atapi_toc_entry));
if (stat)
return stat;
 
@@ -1094,7 +1090,7 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
if (toc->hdr.first_track != CDROM_LEADOUT) {
/* read the multisession information */
stat = cdrom_read_tocentry(drive, 0, 0, 1, (char *)_tmp,
-  sizeof(ms_tmp), sense);
+  sizeof(ms_tmp));
 

[PATCH 0/6] block: Consolidate scsi sense buffer usage

2018-05-22 Thread Kees Cook
This is a follow-up to commit f7068114d45e ("sr: pass down correctly
sized SCSI sense buffer") which further cleans up and removes needless
sense character array buffers and "struct request_sense" usage in favor
of the common "struct scsi_sense_hdr".

First, drop a bunch of unused sense buffers:
 [PATCH 1/6] ide-cd: Drop unused sense buffers
 [PATCH 2/6] scsi: cxlflash: Drop unused sense buffers

Next, split out struct scsi_sense_hdr:
 [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

Then move all request_sense usage to scsi_sense_hdr:
 [PATCH 4/6] block: Consolidate scsi sense buffer usage

Finally add a build-time check to make sure we don't pass bad buffer sizes:
 [PATCH 5/6] libata-scsi: Move sense buffers onto stack
 [PATCH 6/6] scsi: Check sense buffer size at build time

-Kees



[PATCH 6/6] scsi: Check sense buffer size at build time

2018-05-22 Thread Kees Cook
To avoid introducing problems like those fixed in commit f7068114d45e
("sr: pass down correctly sized SCSI sense buffer"), this creates a macro
wrapper for scsi_execute() that verifies the size of the sense buffer
similar to what was done for command string sizes in commit 3756f6401c30
("exec: avoid gcc-8 warning for get_task_comm").

Another solution could be to add another argument to scsi_execute(),
but this function already takes a lot of arguments and Jens was not fond
of that approach. As there was only a pair of dynamically allocated sense
buffers, this also moves those 96 bytes onto the stack to avoid triggering
the sizeof() check.

Signed-off-by: Kees Cook <keesc...@chromium.org>
---
 drivers/scsi/scsi_lib.c|  6 +++---
 include/scsi/scsi_device.h | 12 +++-
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e9b4f279d29c..718c2bec4516 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -238,7 +238,7 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
 
 
 /**
- * scsi_execute - insert request and wait for the result
+ * __scsi_execute - insert request and wait for the result
  * @sdev:  scsi device
  * @cmd:   scsi command
  * @data_direction: data direction
@@ -255,7 +255,7 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
  * Returns the scsi_cmnd result field if a command was executed, or a negative
  * Linux error code if we didn't get that far.
  */
-int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
+int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 int data_direction, void *buffer, unsigned bufflen,
 unsigned char *sense, struct scsi_sense_hdr *sshdr,
 int timeout, int retries, u64 flags, req_flags_t rq_flags,
@@ -309,7 +309,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
 
return ret;
 }
-EXPORT_SYMBOL(scsi_execute);
+EXPORT_SYMBOL(__scsi_execute);
 
 /*
  * Function:scsi_init_cmd_errh()
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 7ae177c8e399..1bb87b6c0ad2 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -426,11 +426,21 @@ extern const char *scsi_device_state_name(enum 
scsi_device_state);
 extern int scsi_is_sdev_device(const struct device *);
 extern int scsi_is_target_device(const struct device *);
 extern void scsi_sanitize_inquiry_string(unsigned char *s, int len);
-extern int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
+extern int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
int data_direction, void *buffer, unsigned bufflen,
unsigned char *sense, struct scsi_sense_hdr *sshdr,
int timeout, int retries, u64 flags,
req_flags_t rq_flags, int *resid);
+/* Make sure any sense buffer is the correct size. */
+#define scsi_execute(sdev, cmd, data_direction, buffer, bufflen, sense,
\
+sshdr, timeout, retries, flags, rq_flags, resid)   \
+({ \
+   BUILD_BUG_ON((sense) != NULL && \
+sizeof(sense) != SCSI_SENSE_BUFFERSIZE);   \
+   __scsi_execute(sdev, cmd, data_direction, buffer, bufflen,  \
+  sense, sshdr, timeout, retries, flags, rq_flags, \
+  resid);  \
+})
 static inline int scsi_execute_req(struct scsi_device *sdev,
const unsigned char *cmd, int data_direction, void *buffer,
unsigned bufflen, struct scsi_sense_hdr *sshdr, int timeout,
-- 
2.17.0



[PATCH 5/6] libata-scsi: Move sense buffers onto stack

2018-05-22 Thread Kees Cook
Instead of dynamically allocating the sense buffers, put them on the
stack so that future compile-time sizeof() checks will be able to see
their buffer length.

Signed-off-by: Kees Cook <keesc...@chromium.org>
---
 drivers/ata/libata-scsi.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 89a9d4a2efc8..3a43d3a1ce2d 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -597,8 +597,9 @@ static int ata_get_identity(struct ata_port *ap, struct 
scsi_device *sdev,
 int ata_cmd_ioctl(struct scsi_device *scsidev, void __user *arg)
 {
int rc = 0;
+   u8 sensebuf[SCSI_SENSE_BUFFERSIZE];
u8 scsi_cmd[MAX_COMMAND_SIZE];
-   u8 args[4], *argbuf = NULL, *sensebuf = NULL;
+   u8 args[4], *argbuf = NULL;
int argsize = 0;
enum dma_data_direction data_dir;
struct scsi_sense_hdr sshdr;
@@ -610,10 +611,7 @@ int ata_cmd_ioctl(struct scsi_device *scsidev, void __user 
*arg)
if (copy_from_user(args, arg, sizeof(args)))
return -EFAULT;
 
-   sensebuf = kzalloc(SCSI_SENSE_BUFFERSIZE, GFP_NOIO);
-   if (!sensebuf)
-   return -ENOMEM;
-
+   memset(sensebuf, 0, sizeof(sensebuf));
memset(scsi_cmd, 0, sizeof(scsi_cmd));
 
if (args[3]) {
@@ -685,7 +683,6 @@ int ata_cmd_ioctl(struct scsi_device *scsidev, void __user 
*arg)
 && copy_to_user(arg + sizeof(args), argbuf, argsize))
rc = -EFAULT;
 error:
-   kfree(sensebuf);
kfree(argbuf);
return rc;
 }
@@ -704,8 +701,9 @@ int ata_cmd_ioctl(struct scsi_device *scsidev, void __user 
*arg)
 int ata_task_ioctl(struct scsi_device *scsidev, void __user *arg)
 {
int rc = 0;
+   u8 sensebuf[SCSI_SENSE_BUFFERSIZE];
u8 scsi_cmd[MAX_COMMAND_SIZE];
-   u8 args[7], *sensebuf = NULL;
+   u8 args[7];
struct scsi_sense_hdr sshdr;
int cmd_result;
 
@@ -715,10 +713,7 @@ int ata_task_ioctl(struct scsi_device *scsidev, void 
__user *arg)
if (copy_from_user(args, arg, sizeof(args)))
return -EFAULT;
 
-   sensebuf = kzalloc(SCSI_SENSE_BUFFERSIZE, GFP_NOIO);
-   if (!sensebuf)
-   return -ENOMEM;
-
+   memset(sensebuf, 0, sizeof(sensebuf));
memset(scsi_cmd, 0, sizeof(scsi_cmd));
scsi_cmd[0]  = ATA_16;
scsi_cmd[1]  = (3 << 1); /* Non-data */
@@ -769,7 +764,6 @@ int ata_task_ioctl(struct scsi_device *scsidev, void __user 
*arg)
}
 
  error:
-   kfree(sensebuf);
return rc;
 }
 
-- 
2.17.0



[PATCH 4/6] block: Consolidate scsi sense buffer usage

2018-05-22 Thread Kees Cook
There is a lot of needless struct request_sense usage in the CDROM
code. These can all be struct scsi_sense_hdr instead, to avoid any
confusion over their respective structure sizes. This patch is a lot
of noise changing "sense" to "sshdr", but the final code is more
readable to distinguish between "sense" meaning "struct request_sense"
and "sshdr" meaning "struct scsi_sense_hdr".

Signed-off-by: Kees Cook <keesc...@chromium.org>
---
 drivers/block/pktcdvd.c| 36 ++--
 drivers/cdrom/cdrom.c  | 22 +++---
 drivers/ide/ide-cd.c   | 11 ++-
 drivers/ide/ide-cd.h   |  5 +++--
 drivers/ide/ide-cd_ioctl.c | 30 +++---
 drivers/scsi/sr_ioctl.c| 22 +-
 include/linux/cdrom.h  |  3 ++-
 7 files changed, 64 insertions(+), 65 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index c61d20c9f3f8..f91c9f85e29d 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -748,13 +748,13 @@ static const char *sense_key_string(__u8 index)
 static void pkt_dump_sense(struct pktcdvd_device *pd,
   struct packet_command *cgc)
 {
-   struct request_sense *sense = cgc->sense;
+   struct scsi_sense_hdr *sshdr = cgc->sshdr;
 
-   if (sense)
+   if (sshdr)
pkt_err(pd, "%*ph - sense %02x.%02x.%02x (%s)\n",
CDROM_PACKET_SIZE, cgc->cmd,
-   sense->sense_key, sense->asc, sense->ascq,
-   sense_key_string(sense->sense_key));
+   sshdr->sense_key, sshdr->asc, sshdr->ascq,
+   sense_key_string(sshdr->sense_key));
else
pkt_err(pd, "%*ph - no sense\n", CDROM_PACKET_SIZE, cgc->cmd);
 }
@@ -787,11 +787,11 @@ static noinline_for_stack int pkt_set_speed(struct 
pktcdvd_device *pd,
unsigned write_speed, unsigned read_speed)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct scsi_sense_hdr sshdr;
int ret;
 
init_cdrom_command(, NULL, 0, CGC_DATA_NONE);
-   cgc.sense = 
+   cgc.sshdr = 
cgc.cmd[0] = GPCMD_SET_SPEED;
cgc.cmd[2] = (read_speed >> 8) & 0xff;
cgc.cmd[3] = read_speed & 0xff;
@@ -1645,7 +1645,7 @@ static noinline_for_stack int pkt_get_last_written(struct 
pktcdvd_device *pd,
 static noinline_for_stack int pkt_set_write_settings(struct pktcdvd_device *pd)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct scsi_sense_hdr sshdr;
write_param_page *wp;
char buffer[128];
int ret, size;
@@ -1656,7 +1656,7 @@ static noinline_for_stack int 
pkt_set_write_settings(struct pktcdvd_device *pd)
 
memset(buffer, 0, sizeof(buffer));
init_cdrom_command(, buffer, sizeof(*wp), CGC_DATA_READ);
-   cgc.sense = 
+   cgc.sshdr = 
if ((ret = pkt_mode_sense(pd, , GPMODE_WRITE_PARMS_PAGE, 0))) {
pkt_dump_sense(pd, );
return ret;
@@ -1671,7 +1671,7 @@ static noinline_for_stack int 
pkt_set_write_settings(struct pktcdvd_device *pd)
 * now get it all
 */
init_cdrom_command(, buffer, size, CGC_DATA_READ);
-   cgc.sense = 
+   cgc.sshdr = 
if ((ret = pkt_mode_sense(pd, , GPMODE_WRITE_PARMS_PAGE, 0))) {
pkt_dump_sense(pd, );
return ret;
@@ -1905,12 +1905,12 @@ static noinline_for_stack int pkt_write_caching(struct 
pktcdvd_device *pd,
int set)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct scsi_sense_hdr sshdr;
unsigned char buf[64];
int ret;
 
init_cdrom_command(, buf, sizeof(buf), CGC_DATA_READ);
-   cgc.sense = 
+   cgc.sshdr = 
cgc.buflen = pd->mode_offset + 12;
 
/*
@@ -1950,14 +1950,14 @@ static noinline_for_stack int pkt_get_max_speed(struct 
pktcdvd_device *pd,
unsigned *write_speed)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct scsi_sense_hdr sshdr;
unsigned char buf[256+18];
unsigned char *cap_buf;
int ret, offset;
 
cap_buf = [sizeof(struct mode_page_header) + pd->mode_offset];
init_cdrom_command(, buf, sizeof(buf), CGC_DATA_UNKNOWN);
-   cgc.sense = 
+   cgc.sshdr = 
 
ret = pkt_mode_sense(pd, , GPMODE_CAPABILITIES_PAGE, 0);
if (ret) {
@@ -2011,13 +2011,13 @@ static noinline_for_stack int pkt_media_speed(struct 
pktcdvd_device *pd,
unsigned *speed)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct 

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-20 Thread Kees Cook
On Thu, Apr 19, 2018 at 2:32 AM, Paolo Valente <paolo.vale...@linaro.org> wrote:
> I'm missing something here.  When the request gets completed in the
> first place, the hook bfq_finish_requeue_request gets called, and that
> hook clears both ->elv.priv elements (as the request has a non-null
> elv.icq).  So, when bfq gets the same request again, those elements
> must be NULL.  What am I getting wrong?
>
> I have some more concern on this point, but I'll stick to this for the
> moment, to not create more confusion.

I don't know the "how", I only found the "what". :) If you want, grab
the reproducer VM linked to earlier in this thread; it'll hit the
problem within about 30 seconds of running the reproducer.

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Tue, Apr 17, 2018 at 3:57 PM, Jens Axboe <ax...@kernel.dk> wrote:
> On 4/17/18 3:48 PM, Jens Axboe wrote:
>> On 4/17/18 3:47 PM, Kees Cook wrote:
>>> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <ax...@kernel.dk> wrote:
>>>> On 4/17/18 3:25 PM, Kees Cook wrote:
>>>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keesc...@chromium.org> wrote:
>>>>>> I see elv.priv[1] assignments made in a few places -- is it possible
>>>>>> there is some kind of uninitialized-but-not-NULL state that can leak
>>>>>> in there?
>>>>>
>>>>> Got it. This fixes it for me:
>>>>>
>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>> index 0dc9e341c2a7..859df3160303 100644
>>>>> --- a/block/blk-mq.c
>>>>> +++ b/block/blk-mq.c
>>>>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct
>>>>> request_queue *q,
>>>>>
>>>>> rq = blk_mq_rq_ctx_init(data, tag, op);
>>>>> if (!op_is_flush(op)) {
>>>>> -   rq->elv.icq = NULL;
>>>>> +   memset(>elv, 0, sizeof(rq->elv));
>>>>> if (e && e->type->ops.mq.prepare_request) {
>>>>> if (e->type->icq_cache && rq_ioc(bio))
>>>>> blk_mq_sched_assign_ioc(rq, bio);
>>>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq)
>>>>> e->type->ops.mq.finish_request(rq);
>>>>> if (rq->elv.icq) {
>>>>> put_io_context(rq->elv.icq->ioc);
>>>>> -   rq->elv.icq = NULL;
>>>>> +   memset(>elv, 0, sizeof(rq->elv));
>>>>> }
>>>>> }
>>>>
>>>> This looks like a BFQ problem, this should not be necessary. Paolo,
>>>> you're calling your own prepare request handler from the insert
>>>> as well, and your prepare request does nothing if rq->elv.icq == NULL.
>>>
>>> I sent the patch anyway, since it's kind of a robustness improvement,
>>> I'd hope. If you fix BFQ also, please add:
>>
>> It's also a memset() in the hot path, would prefer to avoid that...
>> The issue here is really the convoluted bfq usage of insert/prepare,
>> I'm sure Paolo can take it from here.
>
> Does this fix it?
>
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index f0ecd98509d8..d883469a1582 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct request *rq, 
> struct bio *bio)
> bool new_queue = false;
> bool bfqq_already_existing = false, split = false;
>
> -   if (!rq->elv.icq)
> +   if (!rq->elv.icq) {
> +   rq->elv.priv[0] = rq->elv.priv[1] = NULL;
> return;
> +   }
> +
> bic = icq_to_bic(rq->elv.icq);
>
> spin_lock_irq(>lock);

It does! Excellent. :)

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH] blk-mq: Clear out elevator private data

2018-04-17 Thread Kees Cook
On Tue, Apr 17, 2018 at 2:45 PM, Jens Axboe <ax...@kernel.dk> wrote:
> On 4/17/18 3:42 PM, Kees Cook wrote:
>> Some elevators may not correctly check rq->rq_flags & RQF_ELVPRIV, and
>> may attempt to read rq->elv fields. When requests got reused, this
>> caused BFQ to think it already had a bfqq (rq->elv.priv[1]) allocated.
>> This could lead to odd behaviors like having the sense buffer address
>> slowly start incrementing. This eventually tripped HARDENED_USERCOPY
>> and KASAN.
>>
>> This patch wipes all of rq->elv instead of just rq->elv.icq. While
>> it shouldn't technically be needed, this ends up being a robustness
>> improvement that should lead to at least finding bugs in elevators faster.
>
> Comments from the other email still apply, we should not need to do this
> full memset() for every request. From a quick look, BFQ needs to straighten
> out its usage of prepare request and interactions with insert_request.

Sure, understood. I would point out, FWIW, that memset() gets unrolled
by the compiler and this is just two more XORs in the same cacheline
(the two words following icq). (And there is SO much more being
cleared during alloc, it didn't seem like hardly any extra cost vs the
robustness it provided.)

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <ax...@kernel.dk> wrote:
> On 4/17/18 3:25 PM, Kees Cook wrote:
>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keesc...@chromium.org> wrote:
>>> I see elv.priv[1] assignments made in a few places -- is it possible
>>> there is some kind of uninitialized-but-not-NULL state that can leak
>>> in there?
>>
>> Got it. This fixes it for me:
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 0dc9e341c2a7..859df3160303 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct
>> request_queue *q,
>>
>> rq = blk_mq_rq_ctx_init(data, tag, op);
>> if (!op_is_flush(op)) {
>> -   rq->elv.icq = NULL;
>> +   memset(>elv, 0, sizeof(rq->elv));
>> if (e && e->type->ops.mq.prepare_request) {
>> if (e->type->icq_cache && rq_ioc(bio))
>> blk_mq_sched_assign_ioc(rq, bio);
>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq)
>> e->type->ops.mq.finish_request(rq);
>> if (rq->elv.icq) {
>> put_io_context(rq->elv.icq->ioc);
>> -   rq->elv.icq = NULL;
>> +   memset(>elv, 0, sizeof(rq->elv));
>> }
>> }
>
> This looks like a BFQ problem, this should not be necessary. Paolo,
> you're calling your own prepare request handler from the insert
> as well, and your prepare request does nothing if rq->elv.icq == NULL.

I sent the patch anyway, since it's kind of a robustness improvement,
I'd hope. If you fix BFQ also, please add:

Reported-by: Oleksandr Natalenko <oleksa...@natalenko.name>
Root-caused-by: Kees Cook <keesc...@chromium.org>

:) I gotta task-switch to other things!

Thanks for the pointers, and thank you Oleksandr for providing the reproducer!

-Kees

-- 
Kees Cook
Pixel Security


[PATCH] blk-mq: Clear out elevator private data

2018-04-17 Thread Kees Cook
Some elevators may not correctly check rq->rq_flags & RQF_ELVPRIV, and
may attempt to read rq->elv fields. When requests got reused, this
caused BFQ to think it already had a bfqq (rq->elv.priv[1]) allocated.
This could lead to odd behaviors like having the sense buffer address
slowly start incrementing. This eventually tripped HARDENED_USERCOPY
and KASAN.

This patch wipes all of rq->elv instead of just rq->elv.icq. While
it shouldn't technically be needed, this ends up being a robustness
improvement that should lead to at least finding bugs in elevators faster.

Reported-by: Oleksandr Natalenko <oleksa...@natalenko.name>
Fixes: bd166ef183c26 ("blk-mq-sched: add framework for MQ capable IO 
schedulers")
Cc: sta...@vger.kernel.org
Signed-off-by: Kees Cook <keesc...@chromium.org>
---
In theory, BFQ needs to also check the RQF_ELVPRIV flag, but I'll leave that
to Paolo to figure out. Also, my Fixes line is kind of a best-guess. This
is where icq was originally wiped, so it seemed as good a commit as any.
---
 block/blk-mq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..859df3160303 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct 
request_queue *q,
 
rq = blk_mq_rq_ctx_init(data, tag, op);
if (!op_is_flush(op)) {
-   rq->elv.icq = NULL;
+   memset(>elv, 0, sizeof(rq->elv));
if (e && e->type->ops.mq.prepare_request) {
if (e->type->icq_cache && rq_ioc(bio))
blk_mq_sched_assign_ioc(rq, bio);
@@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq)
e->type->ops.mq.finish_request(rq);
if (rq->elv.icq) {
put_io_context(rq->elv.icq->ioc);
-   rq->elv.icq = NULL;
+       memset(>elv, 0, sizeof(rq->elv));
}
}
 
-- 
2.7.4


-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keesc...@chromium.org> wrote:
> I see elv.priv[1] assignments made in a few places -- is it possible
> there is some kind of uninitialized-but-not-NULL state that can leak
> in there?

Got it. This fixes it for me:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..859df3160303 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct
request_queue *q,

rq = blk_mq_rq_ctx_init(data, tag, op);
if (!op_is_flush(op)) {
-   rq->elv.icq = NULL;
+   memset(>elv, 0, sizeof(rq->elv));
if (e && e->type->ops.mq.prepare_request) {
if (e->type->icq_cache && rq_ioc(bio))
blk_mq_sched_assign_ioc(rq, bio);
@@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq)
e->type->ops.mq.finish_request(rq);
if (rq->elv.icq) {
put_io_context(rq->elv.icq->ioc);
-   rq->elv.icq = NULL;
+   memset(>elv, 0, sizeof(rq->elv));
}
}



-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Tue, Apr 17, 2018 at 1:28 PM, Jens Axboe <ax...@kernel.dk> wrote:
> It has to be the latter bfqq->dispatched increment, as those are
> transient (and bfqd is not).

Yeah, and I see a lot of comments around the lifetime of rq and bfqq,
so I assume something is not being locked correctly.

#define RQ_BFQQ(rq) ((rq)->elv.priv[1])

static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
struct request *rq = NULL;
struct bfq_queue *bfqq = NULL;

if (!list_empty(>dispatch)) {
rq = list_first_entry(>dispatch, struct request,
  queuelist);
list_del_init(>queuelist);

bfqq = RQ_BFQQ(rq);

if (bfqq) {
/*
 * Increment counters here, because this
 * dispatch does not follow the standard
 * dispatch flow (where counters are
 * incremented)
 */
bfqq->dispatched++;
...

I see elv.priv[1] assignments made in a few places -- is it possible
there is some kind of uninitialized-but-not-NULL state that can leak
in there?

bfq_prepare_request() assigns elv.priv[1], and bfq_insert_request()
only checks that it's non-NULL (if at all) in one case. Can
bfq_insert_request() get called without bfq_prepare_request() being
called first?

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Tue, Apr 17, 2018 at 1:20 PM, Kees Cook <keesc...@chromium.org> wrote:
> On Tue, Apr 17, 2018 at 1:03 PM, Kees Cook <keesc...@chromium.org> wrote:
>> The above bfq_dispatch_request+0x99/0xad0 is still
>> __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN
>> removed. 0x99 is 153 decimal:
>>
>> (gdb) disass bfq_dispatch_request
>> Dump of assembler code for function bfq_dispatch_request:
>> ...
>>0x8134b2ad <+141>:   test   %rax,%rax
>>0x8134b2b0 <+144>:   je 0x8134b2bd
>> <bfq_dispatch_request+157>
>>0x8134b2b2 <+146>:   addl   $0x1,0x100(%rax)
>>0x8134b2b9 <+153>:   addl   $0x1,0x3c(%rbx)
>>0x8134b2bd <+157>:   orl$0x2,0x18(%r12)
>>0x8134b2c3 <+163>:   test   %ebp,%ebp
>>0x8134b2c5 <+165>:   je 0x8134b2ce
>> <bfq_dispatch_request+174>
>>0x8134b2c7 <+167>:   mov0x108(%r14),%rax
>>0x8134b2ce <+174>:   mov%r15,%rdi
>>0x8134b2d1 <+177>:   callq  0x81706f90 
>> <_raw_spin_unlock_irq>
>>
>> Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18
>> offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |=
>> RQF_STARTED", the next C statement. I don't know what +146 is, though?
>> An increment of something 256 bytes offset? There's a lot of inline
>> fun and reordering happening here, so I'm ignoring that for the
>> moment.
>
> No -- I'm reading this wrong. The RIP is the IP _after_ the trap, so
> +146 is the offender.
>
> [   29.284746] watchpoint @ 95d41a0fe580 triggered
> [   29.285349] sense before:95d41f45f700 after:95d41f45f701 
> (@95d41a
> 0fe580)
> [   29.286176] elevator before:95d419419c00 after:95d419419c00
> [   29.286847] elevator_data before:95d419418c00 after:95d419418c00
> ...
> [   29.295069] RIP: 0010:bfq_dispatch_request+0x99/0xbb0
> [   29.295622] RSP: 0018:b26e01707a40 EFLAGS: 0002
> [   29.296181] RAX: 95d41a0fe480 RBX: 95d419418c00 RCX: 
> 95d419418c08
>
> RAX is 95d41a0fe480 and sense is stored at 95d41a0fe580,
> exactly 0x100 away.
>
> WTF is this addl?

What are the chances? :P Two ++ statements in a row separate by a
collapsed goto. FML. :)

...
bfqq->dispatched++;
goto inc_in_driver_start_rq;
...
inc_in_driver_start_rq:
bfqd->rq_in_driver++;
...

And there's the 0x100 (256):

struct bfq_queue {
...
intdispatched;   /*   256 4 */

So bfqq is corrupted somewhere... I'll keep digging. I hope you're all
enjoying my live debugging transcript. ;)

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Tue, Apr 17, 2018 at 1:03 PM, Kees Cook <keesc...@chromium.org> wrote:
> The above bfq_dispatch_request+0x99/0xad0 is still
> __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN
> removed. 0x99 is 153 decimal:
>
> (gdb) disass bfq_dispatch_request
> Dump of assembler code for function bfq_dispatch_request:
> ...
>0x8134b2ad <+141>:   test   %rax,%rax
>0x8134b2b0 <+144>:   je 0x8134b2bd
> <bfq_dispatch_request+157>
>0x8134b2b2 <+146>:   addl   $0x1,0x100(%rax)
>0x8134b2b9 <+153>:   addl   $0x1,0x3c(%rbx)
>0x8134b2bd <+157>:   orl$0x2,0x18(%r12)
>0x8134b2c3 <+163>:   test   %ebp,%ebp
>0x8134b2c5 <+165>:   je 0x8134b2ce
> <bfq_dispatch_request+174>
>0x8134b2c7 <+167>:   mov0x108(%r14),%rax
>0x8134b2ce <+174>:   mov%r15,%rdi
>0x8134b2d1 <+177>:   callq  0x81706f90 
> <_raw_spin_unlock_irq>
>
> Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18
> offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |=
> RQF_STARTED", the next C statement. I don't know what +146 is, though?
> An increment of something 256 bytes offset? There's a lot of inline
> fun and reordering happening here, so I'm ignoring that for the
> moment.

No -- I'm reading this wrong. The RIP is the IP _after_ the trap, so
+146 is the offender.

[   29.284746] watchpoint @ 95d41a0fe580 triggered
[   29.285349] sense before:95d41f45f700 after:95d41f45f701 (@95d41a
0fe580)
[   29.286176] elevator before:95d419419c00 after:95d419419c00
[   29.286847] elevator_data before:95d419418c00 after:95d419418c00
...
[   29.295069] RIP: 0010:bfq_dispatch_request+0x99/0xbb0
[   29.295622] RSP: 0018:b26e01707a40 EFLAGS: 0002
[   29.296181] RAX: 95d41a0fe480 RBX: 95d419418c00 RCX: 95d419418c08

RAX is 95d41a0fe480 and sense is stored at 95d41a0fe580,
exactly 0x100 away.

WTF is this addl?

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Mon, Apr 16, 2018 at 8:12 PM, Kees Cook <keesc...@chromium.org> wrote:
> With a hardware watchpoint, I've isolated the corruption to here:
>
> bfq_dispatch_request+0x2be/0x1610:
> __bfq_dispatch_request at block/bfq-iosched.c:3902
> 3900if (rq) {
> 3901inc_in_driver_start_rq:
> 3902bfqd->rq_in_driver++;
> 3903start_rq:
> 3904rq->rq_flags |= RQF_STARTED;
> 3905}

This state continues to appear to be "impossible".

[   68.845979] watchpoint triggered
[   68.846462] sense before:8b8f9f6aae00 after:8b8f9f6aae01
[   68.847196] elevator before:8b8f9a2c2000 after:8b8f9a2c2000
[   68.847905] elevator_data before:8b8f9a2c0400 after:8b8f9a2c0400
...
[   68.856925] RIP: 0010:bfq_dispatch_request+0x99/0xad0
[   68.857553] RSP: 0018:900280c63a58 EFLAGS: 0082
[   68.858253] RAX: 8b8f9aefbe80 RBX: 8b8f9a2c0400 RCX: 8b8f9a2c0408
[   68.859201] RDX: 8b8f9a2c0408 RSI: 900280c63b34 RDI: 0001
[   68.860147] RBP:  R08: 000f0204 R09: 
[   68.861122] R10: 900280c63af0 R11: 0040 R12: 8b8f9aefbe00
[   68.862089] R13: 8b8f9a221950 R14:  R15: 8b8f9a2c0770

Here we can see that sense buffer has, as we've seen, been
incremented. However, the "before" values for elevator and
elevator_data still match their expected values. As such, this should
be "impossible", since:

static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
...
if (rq) {
inc_in_driver_start_rq:
bfqd->rq_in_driver++;
start_rq:
rq->rq_flags |= RQF_STARTED;
}
exit:
return rq;
}

For rq_in_driver++ to touch sense, bfqd must be equal to scsi_request
- 12 bytes (rq_in_driver is 60 byte offset from struct bfq_data, and
sense is 48 bytes offset from struct scsi_request).

The above bfq_dispatch_request+0x99/0xad0 is still
__bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN
removed. 0x99 is 153 decimal:

(gdb) disass bfq_dispatch_request
Dump of assembler code for function bfq_dispatch_request:
...
   0x8134b2ad <+141>:   test   %rax,%rax
   0x8134b2b0 <+144>:   je 0x8134b2bd
<bfq_dispatch_request+157>
   0x8134b2b2 <+146>:   addl   $0x1,0x100(%rax)
   0x8134b2b9 <+153>:   addl   $0x1,0x3c(%rbx)
   0x8134b2bd <+157>:   orl$0x2,0x18(%r12)
   0x8134b2c3 <+163>:   test   %ebp,%ebp
   0x8134b2c5 <+165>:   je 0x8134b2ce
<bfq_dispatch_request+174>
   0x8134b2c7 <+167>:   mov0x108(%r14),%rax
   0x8134b2ce <+174>:   mov%r15,%rdi
   0x8134b2d1 <+177>:   callq  0x81706f90 <_raw_spin_unlock_irq>

Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18
offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |=
RQF_STARTED", the next C statement. I don't know what +146 is, though?
An increment of something 256 bytes offset? There's a lot of inline
fun and reordering happening here, so I'm ignoring that for the
moment.

So at +153 %rbx should be bfqd (i.e. elevator_data), but this is the
tripped instruction. The watchpoint dump shows RBX as 8b8f9a2c0400
... which matches elevator_data.

So, what can this be? Some sort of cache issue? By the time the
watchpoint handler captures the register information, it's already
masked the problem? I'm stumped again. :(

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Mon, Apr 16, 2018 at 8:12 PM, Kees Cook <keesc...@chromium.org> wrote:
> With a hardware watchpoint, I've isolated the corruption to here:
>
> bfq_dispatch_request+0x2be/0x1610:
> __bfq_dispatch_request at block/bfq-iosched.c:3902
> 3900if (rq) {
> 3901inc_in_driver_start_rq:
> 3902bfqd->rq_in_driver++;
> 3903start_rq:
> 3904rq->rq_flags |= RQF_STARTED;
> 3905}

FWIW, the stacktrace here (removing the ? lines) is:

[   34.311980] RIP: 0010:bfq_dispatch_request+0x2be/0x1610
[   34.452491]  blk_mq_do_dispatch_sched+0x1d9/0x260
[   34.454561]  blk_mq_sched_dispatch_requests+0x3da/0x4b0
[   34.458789]  __blk_mq_run_hw_queue+0xae/0x130
[   34.460001]  __blk_mq_delay_run_hw_queue+0x192/0x280
[   34.460823]  blk_mq_run_hw_queue+0x10b/0x1b0
[   34.463240]  blk_mq_sched_insert_request+0x3bd/0x4d0
[   34.467342]  blk_execute_rq+0xcf/0x140
[   34.468483]  sg_io+0x2f7/0x730

Can anyone tell me more about the memory allocation layout of the
various variables here? It looks like struct request is a header in
front of struct scsi_request? How do struct elevator_queue, struct
blk_mq_ctx, and struct blk_mq_hw_ctx overlap these?

Regardless, I'll check for elevator data changing too...

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Tue, Apr 17, 2018 at 3:02 AM, James Bottomley
<j...@linux.vnet.ibm.com> wrote:
> On Mon, 2018-04-16 at 20:12 -0700, Kees Cook wrote:
>> I still haven't figured this out, though... any have a moment to look
>> at this?
>
> Just to let you know you're not alone ... but I can't make any sense of
> this either.  The bfdq is the elevator_data, which is initialised when
> the scheduler is attached, so it shouldn't change.  Is it possible to
> set a data break point on elevator_data after it's initialised and see
> if it got changed by something?

Yeah, it seems like some pointer chain is getting overwritten outside
of a lock or rcu or ?. I don't know this code well enough to guess at
where to check, though. What I find so strange is that the structure
offsets are different between bfpd's rq_in_driver field and
scsi_request's sense field, so even THAT doesn't look to be clear-cut
either:

struct bfq_data {
struct request_queue * queue;/* 0 8 */
struct list_head   dispatch; /* 816 */
struct bfq_group * root_group;   /*24 8 */
struct rb_root queue_weights_tree;   /*32 8 */
struct rb_root group_weights_tree;   /*40 8 */
intbusy_queues;  /*48 4 */
intwr_busy_queues;   /*52 4 */
intqueued;   /*56 4 */
intrq_in_driver; /*60 4 */
...

struct scsi_request {
unsigned char  __cmd[16];/* 016 */
unsigned char *cmd;  /*16 8 */
short unsigned int cmd_len;  /*24 2 */

/* XXX 2 bytes hole, try to pack */

intresult;   /*28 4 */
unsigned int   sense_len;/*32 4 */
unsigned int   resid_len;/*36 4 */
intretries;  /*40 4 */

/* XXX 4 bytes hole, try to pack */

void * sense;/*48 8 */
...

This is _so_ weird. :P

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread Kees Cook
On Tue, Apr 17, 2018 at 2:19 AM, Oleksandr Natalenko
<oleksa...@natalenko.name> wrote:
> By any chance, have you tried to simplify the reproducer environment, or it
> still needs my complex layout to trigger things even with KASAN?

I haven't tried minimizing the reproducer yet, no. Now that I have a
specific place to watch in the kernel for the corruption, though, that
might help. If I get stuck again today, I'll try it.

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-16 Thread Kees Cook
On Mon, Apr 16, 2018 at 1:44 PM, Kees Cook <keesc...@chromium.org> wrote:
> On Thu, Apr 12, 2018 at 8:02 PM, Kees Cook <keesc...@chromium.org> wrote:
>> On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook <keesc...@chromium.org> wrote:
>>> After fixing up some build issues in the middle of the 4.16 cycle, I
>>> get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch
>>> 'for-4.16/block'"). Instead of letting the test run longer, I'm going
>>> to switch to doing several shorter test boots per kernel and see if
>>> that helps. One more bisect coming...
>>
>> Okay, so I can confirm the bisect points at the _merge_ itself, not a
>> specific patch. I'm not sure how to proceed here. It looks like some
>> kind of interaction between separate trees? Jens, do you have
>> suggestions on how to track this down?
>
> Turning off HARDENED_USERCOPY and turning on KASAN, I see the same report:
>
> [   38.274106] BUG: KASAN: slab-out-of-bounds in _copy_to_user+0x42/0x60
> [   38.274841] Read of size 22 at addr 8800122b8c4b by task smartctl/1064
> [   38.275630]
> [   38.275818] CPU: 2 PID: 1064 Comm: smartctl Not tainted 4.17.0-rc1-ARCH+ 
> #266
> [   38.276631] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   38.277690] Call Trace:
> [   38.277988]  dump_stack+0x71/0xab
> [   38.278397]  ? _copy_to_user+0x42/0x60
> [   38.278833]  print_address_description+0x6a/0x270
> [   38.279368]  ? _copy_to_user+0x42/0x60
> [   38.279800]  kasan_report+0x243/0x360
> [   38.280221]  _copy_to_user+0x42/0x60
> [   38.280635]  sg_io+0x459/0x660
> ...
>
> Though we get slightly more details (some we already knew):
>
> [   38.301330] Allocated by task 329:
> [   38.301734]  kmem_cache_alloc_node+0xca/0x220
> [   38.302239]  scsi_mq_init_request+0x64/0x130 [scsi_mod]
> [   38.302821]  blk_mq_alloc_rqs+0x2cf/0x370
> [   38.303265]  blk_mq_sched_alloc_tags.isra.4+0x7d/0xb0
> [   38.303820]  blk_mq_init_sched+0xf0/0x220
> [   38.304268]  elevator_switch+0x17a/0x2c0
> [   38.304705]  elv_iosched_store+0x173/0x220
> [   38.305171]  queue_attr_store+0x72/0xb0
> [   38.305602]  kernfs_fop_write+0x188/0x220
> [   38.306049]  __vfs_write+0xb6/0x330
> [   38.306436]  vfs_write+0xe9/0x240
> [   38.306804]  ksys_write+0x98/0x110
> [   38.307181]  do_syscall_64+0x6d/0x1d0
> [   38.307590]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   38.308142]
> [   38.308316] Freed by task 0:
> [   38.308652] (stack is not available)
> [   38.309060]
> [   38.309243] The buggy address belongs to the object at 8800122b8c00
> [   38.309243]  which belongs to the cache scsi_sense_cache of size 96
> [   38.310625] The buggy address is located 75 bytes inside of
> [   38.310625]  96-byte region [8800122b8c00, 8800122b8c60)

With a hardware watchpoint, I've isolated the corruption to here:

bfq_dispatch_request+0x2be/0x1610:
__bfq_dispatch_request at block/bfq-iosched.c:3902
3900if (rq) {
3901inc_in_driver_start_rq:
3902bfqd->rq_in_driver++;
3903start_rq:
3904rq->rq_flags |= RQF_STARTED;
3905}

Through some race condition(?), rq_in_driver is also sense_buffer, and
it can get incremented. Specifically, I am doing:

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 25c14c58385c..f50d5053d732 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -6,6 +6,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #include 

@@ -428,6 +430,18 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
}
 }

+static void sample_hbp_handler(struct perf_event *bp,
+   struct perf_sample_data *data,
+   struct pt_regs *regs)
+{
+printk(KERN_INFO "sense_buffer value is changed\n");
+dump_stack();
+printk(KERN_INFO "Dump stack from sample_hbp_handler\n");
+}
+
+struct perf_event * __percpu *sample_hbp;
+int ok_to_go;
+
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 bool run_queue, bool async)
 {
@@ -435,6 +449,16 @@ void blk_mq_sched_insert_request(struct request
*rq, bool at_head,
struct elevator_queue *e = q->elevator;
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+   struct perf_event_attr attr;
+   struct scsi_request *req = scsi_req(rq);
+
+   if (ok_to_go) {
+   hw_breakpoint_init();
+   attr.bp_addr = (uintptr_t)&(req->sense);
+   attr.bp_len = HW_BREAKPOINT_LEN_8;
+   attr.bp_type = HW_BREAKPOINT_W;
+   sample_hbp = register_wide_hw_bre

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-16 Thread Kees Cook
On Thu, Apr 12, 2018 at 8:02 PM, Kees Cook <keesc...@chromium.org> wrote:
> On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook <keesc...@chromium.org> wrote:
>> After fixing up some build issues in the middle of the 4.16 cycle, I
>> get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch
>> 'for-4.16/block'"). Instead of letting the test run longer, I'm going
>> to switch to doing several shorter test boots per kernel and see if
>> that helps. One more bisect coming...
>
> Okay, so I can confirm the bisect points at the _merge_ itself, not a
> specific patch. I'm not sure how to proceed here. It looks like some
> kind of interaction between separate trees? Jens, do you have
> suggestions on how to track this down?

Turning off HARDENED_USERCOPY and turning on KASAN, I see the same report:

[   38.274106] BUG: KASAN: slab-out-of-bounds in _copy_to_user+0x42/0x60
[   38.274841] Read of size 22 at addr 8800122b8c4b by task smartctl/1064
[   38.275630]
[   38.275818] CPU: 2 PID: 1064 Comm: smartctl Not tainted 4.17.0-rc1-ARCH+ #266
[   38.276631] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   38.277690] Call Trace:
[   38.277988]  dump_stack+0x71/0xab
[   38.278397]  ? _copy_to_user+0x42/0x60
[   38.278833]  print_address_description+0x6a/0x270
[   38.279368]  ? _copy_to_user+0x42/0x60
[   38.279800]  kasan_report+0x243/0x360
[   38.280221]  _copy_to_user+0x42/0x60
[   38.280635]  sg_io+0x459/0x660
...

Though we get slightly more details (some we already knew):

[   38.301330] Allocated by task 329:
[   38.301734]  kmem_cache_alloc_node+0xca/0x220
[   38.302239]  scsi_mq_init_request+0x64/0x130 [scsi_mod]
[   38.302821]  blk_mq_alloc_rqs+0x2cf/0x370
[   38.303265]  blk_mq_sched_alloc_tags.isra.4+0x7d/0xb0
[   38.303820]  blk_mq_init_sched+0xf0/0x220
[   38.304268]  elevator_switch+0x17a/0x2c0
[   38.304705]  elv_iosched_store+0x173/0x220
[   38.305171]  queue_attr_store+0x72/0xb0
[   38.305602]  kernfs_fop_write+0x188/0x220
[   38.306049]  __vfs_write+0xb6/0x330
[   38.306436]  vfs_write+0xe9/0x240
[   38.306804]  ksys_write+0x98/0x110
[   38.307181]  do_syscall_64+0x6d/0x1d0
[   38.307590]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   38.308142]
[   38.308316] Freed by task 0:
[   38.308652] (stack is not available)
[   38.309060]
[   38.309243] The buggy address belongs to the object at 8800122b8c00
[   38.309243]  which belongs to the cache scsi_sense_cache of size 96
[   38.310625] The buggy address is located 75 bytes inside of
[   38.310625]  96-byte region [8800122b8c00, 8800122b8c60)


-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-12 Thread Kees Cook
On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook <keesc...@chromium.org> wrote:
> After fixing up some build issues in the middle of the 4.16 cycle, I
> get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch
> 'for-4.16/block'"). Instead of letting the test run longer, I'm going
> to switch to doing several shorter test boots per kernel and see if
> that helps. One more bisect coming...

Okay, so I can confirm the bisect points at the _merge_ itself, not a
specific patch. I'm not sure how to proceed here. It looks like some
kind of interaction between separate trees? Jens, do you have
suggestions on how to track this down?

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-12 Thread Kees Cook
On Thu, Apr 12, 2018 at 3:01 PM, Kees Cook <keesc...@chromium.org> wrote:
> On Thu, Apr 12, 2018 at 12:04 PM, Oleksandr Natalenko
> <oleksa...@natalenko.name> wrote:
>> Hi.
>>
>> On čtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote:
>>> My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire
>>> up timeouts"), which seems insane given that null_blk isn't even built
>>> in the .config. I managed to get the testing automated now for a "git
>>> bisect run ...", so I'm restarting again to hopefully get a better
>>> answer. Normally the Oops happens in the first minute of running. I've
>>> set the timeout to 10 minutes for a "good" run...

After fixing up some build issues in the middle of the 4.16 cycle, I
get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch
'for-4.16/block'". Instead of letting the test run longer, I'm going
to switch to doing several shorter test boots per kernel and see if
that helps. One more bisect coming...

>> Apparently, you do this on Linus' tree, right? If so, I won't compile it here
>> then.
>
> Actually, I didn't test Linus's tree, but I can do that after the most
> recent bisect finishes... I'm just trying to find where it went wrong
> in 4.16.

FWIW, I see an Oops under Linus's latest tree.

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-12 Thread Kees Cook
On Thu, Apr 12, 2018 at 12:04 PM, Oleksandr Natalenko
<oleksa...@natalenko.name> wrote:
> Hi.
>
> On čtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote:
>> My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire
>> up timeouts"), which seems insane given that null_blk isn't even built
>> in the .config. I managed to get the testing automated now for a "git
>> bisect run ...", so I'm restarting again to hopefully get a better
>> answer. Normally the Oops happens in the first minute of running. I've
>> set the timeout to 10 minutes for a "good" run...
>
> Apparently, you do this on Linus' tree, right? If so, I won't compile it here
> then.

Actually, I didn't test Linus's tree, but I can do that after the most
recent bisect finishes... I'm just trying to find where it went wrong
in 4.16.

> Thanks for taking care of this.

Thanks for building the reproducer! :)

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-12 Thread Kees Cook
On Wed, Apr 11, 2018 at 5:03 PM, Kees Cook <keesc...@chromium.org> wrote:
> On Wed, Apr 11, 2018 at 3:47 PM, Kees Cook <keesc...@chromium.org> wrote:
>> On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook <keesc...@chromium.org> wrote:
>>> I'll see about booting with my own kernels, etc, and try to narrow this 
>>> down. :)
>>
>> If I boot kernels I've built, I no longer hit the bug in this VM
>> (though I'll keep trying). What compiler are you using?
>
> Ignore that: I've reproduced it with my kernels now. I think I messed
> up the initramfs initially. But with an exact copy of your .config,
> booting under Arch grub with initramfs, I see it. I'll start removing
> variables now... :P

My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire
up timeouts"), which seems insane given that null_blk isn't even built
in the .config. I managed to get the testing automated now for a "git
bisect run ...", so I'm restarting again to hopefully get a better
answer. Normally the Oops happens in the first minute of running. I've
set the timeout to 10 minutes for a "good" run...

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-11 Thread Kees Cook
On Wed, Apr 11, 2018 at 3:47 PM, Kees Cook <keesc...@chromium.org> wrote:
> On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook <keesc...@chromium.org> wrote:
>> I'll see about booting with my own kernels, etc, and try to narrow this 
>> down. :)
>
> If I boot kernels I've built, I no longer hit the bug in this VM
> (though I'll keep trying). What compiler are you using?

Ignore that: I've reproduced it with my kernels now. I think I messed
up the initramfs initially. But with an exact copy of your .config,
booting under Arch grub with initramfs, I see it. I'll start removing
variables now... :P

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-11 Thread Kees Cook
On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook <keesc...@chromium.org> wrote:
> I'll see about booting with my own kernels, etc, and try to narrow this down. 
> :)

If I boot kernels I've built, I no longer hit the bug in this VM
(though I'll keep trying). What compiler are you using?

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-10 Thread Kees Cook
On Tue, Apr 10, 2018 at 10:16 AM, Oleksandr Natalenko
<oleksa...@natalenko.name> wrote:
> Hi, Kees, Paolo et al.
>
> 10.04.2018 08:53, Kees Cook wrote:
>>
>> Unfortunately I only had a single hang with no dumps. I haven't been
>> able to reproduce it since. :(
>
>
> For your convenience I've prepared a VM that contains a reproducer.

Awesome. :)

> Under the /root folder there is a reproducer script (reproducer.sh). It does
> trivial things like enabling sysrq, opening LUKS device, mounting a volume,
> running a background I/O (this is an important part, actually, since I
> wasn't able to trigger the issue without the background I/O) and, finally,
> running the smartctl in a loop. If you are lucky, within a minute or two
> you'll get the first warning followed shortly by subsequent bugs and I/O
> stall (htop is pre-installed for your convenience too).

Yup!

[   27.729498] Bad or missing usercopy whitelist? Kernel memory
exposure attempt detected from SLUB object 'scsi_sense_cache' (offset
76, size 22)!

I'll see about booting with my own kernels, etc, and try to narrow this down. :)

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-10 Thread Kees Cook
On Mon, Apr 9, 2018 at 11:35 PM, Oleksandr Natalenko
<oleksa...@natalenko.name> wrote:
> Did your system hang on smartctl hammering too? Have you got some stack
> traces to compare with mine ones?

Unfortunately I only had a single hang with no dumps. I haven't been
able to reproduce it since. :(

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-09 Thread Kees Cook
On Mon, Apr 9, 2018 at 1:30 PM, Kees Cook <keesc...@chromium.org> wrote:
> Ah! dm-crypt too. I'll see if I can get that added easily to my tests.

Quick update: I added dm-crypt (with XFS on top) and it hung my system
almost immediately. I got no warnings at all, though.

-Kees

-- 
Kees Cook
Pixel Security


Re: limits->max_sectors is getting set to 0, why/where? [was: Re: dm: kernel oops by divide error on v4.16+]

2018-04-09 Thread Kees Cook
On Mon, Apr 9, 2018 at 3:32 PM, Jens Axboe <ax...@kernel.dk> wrote:
> That's bad, for sure, but my worry was bigger than an oops or crash,
> we could have had corruption due to this.
>
> The resulting min/max and friends would have been trivial to test, but
> clearly they weren't.

Yeah, that was bad luck and my fault: I tested min(), max(), min_t(),
and max_t(). My assumption was that since the others were built from
them, they'd be fine. Not true in this shadow variable case, though.
:(  We could do something like this, which would have caught it:


diff --git a/init/main.c b/init/main.c
index e4a3160991ea..ce46afc53b8b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -993,10 +993,32 @@ static inline void mark_readonly(void)
 }
 #endif

+static inline void compiletime_sanity_checks(void)
+{
+   /* Sanity-check min()/max() family of macros. */
+   BUILD_BUG_ON(min(5, 50) != 5);
+   BUILD_BUG_ON(max(5, 50) != 50);
+   BUILD_BUG_ON(min_t(int, (size_t)-1 , 50) != -1);
+   BUILD_BUG_ON(max_t(size_t, -1 , 50) != (size_t)-1);
+   BUILD_BUG_ON(min3(-50, 0, 1000) != -50);
+   BUILD_BUG_ON(max3(-50, 0, 1000) != 1000);
+   BUILD_BUG_ON(min_not_zero(0, 20) != 20);
+   BUILD_BUG_ON(min_not_zero(30, 0) != 30);
+   BUILD_BUG_ON(min_not_zero(150, 40) != 40);
+   BUILD_BUG_ON(clamp(20, 1, 7) != 7);
+   BUILD_BUG_ON(clamp(40, 20, 100) != 40);
+   BUILD_BUG_ON(clamp(1, 20, 100) != 20);
+   BUILD_BUG_ON(clamp_t(int, -5, (size_t)-1, 100) != -1);
+   BUILD_BUG_ON(clamp_t(int, -1, (size_t)-5, 100) != -1);
+   BUILD_BUG_ON(clamp_t(size_t, -10, 1, -50) != -50);
+}
+
 static int __ref kernel_init(void *unused)
 {
int ret;

+   compiletime_sanity_checks();
+
kernel_init_freeable();
/* need to finish all async __init code before freeing the memory */
async_synchronize_full();



-- 
Kees Cook
Pixel Security


Re: limits->max_sectors is getting set to 0, why/where? [was: Re: dm: kernel oops by divide error on v4.16+]

2018-04-09 Thread Kees Cook
On Mon, Apr 9, 2018 at 2:56 PM, Jens Axboe <ax...@kernel.dk> wrote:
> On 4/9/18 3:26 PM, Jens Axboe wrote:
>> On 4/9/18 1:32 PM, Jens Axboe wrote:
>>> On 4/9/18 12:38 PM, Mike Snitzer wrote:
>>>> On Mon, Apr 09 2018 at 11:51am -0400,
>>>> Mike Snitzer <snit...@redhat.com> wrote:
>>>>
>>>>> On Sun, Apr 08 2018 at 12:00am -0400,
>>>>> Ming Lei <ming@redhat.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The following kernel oops(divide error) is triggered when running
>>>>>> xfstest(generic/347) on ext4.
>>>>>>
>>>>>> [  442.632954] run fstests generic/347 at 2018-04-07 18:06:44
>>>>>> [  443.839480] divide error:  [#1] PREEMPT SMP PTI
>>>>>> [  443.840201] Dumping ftrace buffer:
>>>>>> [  443.840692](ftrace buffer empty)
>>>> ...
>>>>>> [  443.845756] CPU: 1 PID: 29607 Comm: dmsetup Not tainted 
>>>>>> 4.16.0_f605ba97fb80_master+ #1
>>>>>> [  443.846968] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
>>>>>> 1.10.2-2.fc27 04/01/2014
>>>>>> [  443.848147] RIP: 0010:pool_io_hints+0x77/0x153 [dm_thin_pool]
>>>>
>>>> ...
>>>>
>>>>> I was able to reproduce (in my case RIP was pool_io_hints+0x45)
>>>>>
>>>>> Which on my kernel, is:
>>>>>
>>>>> crash> dis -l pool_io_hints+0x45
>>>>> /root/snitm/git/linux/drivers/md/dm-thin.c: 2748
>>>>> 0xc0765165 <pool_io_hints+69>:  div%rdi
>>>>>
>>>>> Which is drivers/md/dm-thin.c:is_factor()'s return
>>>>> !sector_div(block_size, n);
>>>>>
>>>>> SO looking at pool_io_hints() it would seem limits->max_sectors is 0 for
>>>>> this xfstests device... why would that be!?
>>>>>
>>>>> Clearly pool_io_hints() could stand to be more defensive with a
>>>>> !limits->max_sectors negative check but is it ever really valid for
>>>>> max_sectors to be 0?
>>>>>
>>>>> Pretty sure the ultimate bug is outside DM (but not seeing an obvious
>>>>> place where block core would set max_sectors to 0, all blk-settings.c
>>>>> uses min_not_zero(), etc).
>>>>
>>>> I successfully ran this test against the linux-dm.git
>>>> "for-4.17/dm-changes" tag that Linus merged after the block changes:
>>>>  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
>>>> tags/for-4.17/dm-changes
>>>>
>>>> # ./check tests/generic/347
>>>> FSTYP -- ext4
>>>> PLATFORM  -- Linux/x86_64 thegoat 4.16.0-rc5.snitm
>>>> MKFS_OPTIONS  -- /dev/mapper/test-xfstests_scratch
>>>> MOUNT_OPTIONS -- -o acl,user_xattr /dev/mapper/test-xfstests_scratch 
>>>> /scratch
>>>>
>>>> generic/347  65s
>>>> Ran: generic/347
>>>> Passed all 1 tests
>>>>
>>>> SO this would seem to implicate some regression in the 4.17 block layer
>>>> changes.
>>>
>>> No immediate ideas come to mind, we didn't have a lot of changes and I
>>> don't see anything that looks problematic. Maybe you can try and
>>> bisect it and see what you come up with?
>>
>> I ran it, problematic commit is:
>>
>> commit 3c8ba0d61d04ced9f8d9ff93977995a9e4e96e91
>> Author: Kees Cook <keesc...@chromium.org>
>> Date:   Fri Mar 30 18:52:36 2018 -0700
>>
>> kernel.h: Retain constant expression output for max()/min()
>>
>
> The fun continues. Thinking I'd try a userspace repro and thinking it
> would be difficult to reproduce, try the attached min.c that just copies
> all the bits from include/linux/kernel.h
>
> axboe@x1:~ $ gcc -Wall -O2 -o min min.c
> axboe@x1:~ $ ./min 128 256
> min_not_zero(128, 256) = 0

This should be fixed with e9092d0d9796 ("Fix subtle macro variable
shadowing in min_not_zero()").

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-09 Thread Kees Cook
On Mon, Apr 9, 2018 at 12:02 PM, Oleksandr Natalenko
<oleksa...@natalenko.name> wrote:
>
> Hi.
>
> (fancy details for linux-block and BFQ people go below)
>
> 09.04.2018 20:32, Kees Cook wrote:
>>
>> Ah, this detail I didn't have. I've changed my environment to
>>
>> build with:
>>
>> CONFIG_BLK_MQ_PCI=y
>> CONFIG_BLK_MQ_VIRTIO=y
>> CONFIG_IOSCHED_BFQ=y
>>
>> boot with scsi_mod.use_blk_mq=1
>>
>> and select BFQ in the scheduler:
>>
>> # cat /sys/block/sd?/queue/scheduler
>> mq-deadline kyber [bfq] none
>> mq-deadline kyber [bfq] none
>>
>> Even with this, I'm not seeing anything yet...
>
>
> Thanks for looking into it anyway. I was experimenting today a little bit, 
> and for me it looks like setting queue_depth and nr_requests to minimal 
> values speeds up the reproducing. Could you please try it too? Something like:
>
> echo 1 | tee /sys/block/sd*/queue/nr_requests

I can't get this below "4".

> echo 1 | tee /sys/block/sd*/device/queue_depth

I've got this now too.

> Also, now I have a more solid proof that this is related to I/O scheduling.
>
> I was hammering my VM, and after a couple of usercopy warnings/bugs I can 
> reliably trigger I/O hang. I was able to obtain the stack traces of tasks in 
> D state. Listing them here below. dmcrypt_write:

Ah! dm-crypt too. I'll see if I can get that added easily to my tests.

> ===
> [ 1615.409622] dmcrypt_write   D0   236  2 0x8000
> [ 1615.413422] Call Trace:
> [ 1615.415428]  ? __schedule+0x336/0xf40
> [ 1615.417824]  ? blk_mq_sched_dispatch_requests+0x117/0x190
> [ 1615.421423]  ? __sbitmap_get_word+0x2a/0x80
> [ 1615.424202]  schedule+0x32/0xc0
> [ 1615.426521]  io_schedule+0x12/0x40
> [ 1615.432414]  blk_mq_get_tag+0x181/0x2a0
> [ 1615.434881]  ? elv_merge+0x79/0xe0
> [ 1615.437447]  ? wait_woken+0x80/0x80
> [ 1615.439553]  blk_mq_get_request+0xf9/0x400
> [ 1615.444653]  blk_mq_make_request+0x10b/0x640
> [ 1615.448025]  generic_make_request+0x124/0x2d0
> [ 1615.450716]  ? raid10_unplug+0xfb/0x180 [raid10]
> [ 1615.454069]  raid10_unplug+0xfb/0x180 [raid10]
> [ 1615.456729]  blk_flush_plug_list+0xc1/0x250
> [ 1615.460276]  blk_finish_plug+0x27/0x40
> [ 1615.464103]  dmcrypt_write+0x233/0x240 [dm_crypt]
> [ 1615.467443]  ? wake_up_process+0x20/0x20
> [ 1615.470845]  ? crypt_iv_essiv_dtr+0x60/0x60 [dm_crypt]
> [ 1615.475272]  ? kthread+0x113/0x130
> [ 1615.477652]  kthread+0x113/0x130
> [ 1615.480567]  ? kthread_create_on_node+0x70/0x70
> [ 1615.483268]  ret_from_fork+0x35/0x40
> ===
>
> One of XFS threads, too:

And XFS! You love your corner cases. ;)

>
> ===
> [ 1616.133298] xfsaild/dm-7D0   316  2 0x8000
> [ 1616.136679] Call Trace:
> [ 1616.138845]  ? __schedule+0x336/0xf40
> [ 1616.141581]  ? preempt_count_add+0x68/0xa0
> [ 1616.147214]  ? _raw_spin_unlock+0x16/0x30
> [ 1616.149813]  ? _raw_spin_unlock_irqrestore+0x20/0x40
> [ 1616.152478]  ? try_to_del_timer_sync+0x4d/0x80
> [ 1616.154734]  schedule+0x32/0xc0
> [ 1616.156579]  _xfs_log_force+0x146/0x290 [xfs]
> [ 1616.159322]  ? wake_up_process+0x20/0x20
> [ 1616.162175]  xfsaild+0x1a9/0x820 [xfs]
> [ 1616.164695]  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
> [ 1616.167567]  ? kthread+0x113/0x130
> [ 1616.169722]  kthread+0x113/0x130
> [ 1616.171908]  ? kthread_create_on_node+0x70/0x70
> [ 1616.174073]  ? do_syscall_64+0x74/0x190
> [ 1616.179008]  ? SyS_exit_group+0x10/0x10
> [ 1616.182306]  ret_from_fork+0x35/0x40
> ===
>
> journald is another victim:
>
> ===
> [ 1616.184343] systemd-journal D0   354  1 0x0104
> [ 1616.187282] Call Trace:
> [ 1616.189464]  ? __schedule+0x336/0xf40
> [ 1616.191781]  ? call_function_single_interrupt+0xa/0x20
> [ 1616.194788]  ? _raw_spin_lock_irqsave+0x25/0x50
> [ 1616.197592]  schedule+0x32/0xc0
> [ 1616.200171]  schedule_timeout+0x202/0x470
> [ 1616.202851]  ? preempt_count_add+0x49/0xa0
> [ 1616.206227]  wait_for_common+0xbb/0x180
> [ 1616.209877]  ? wake_up_process+0x20/0x20
> [ 1616.212511]  do_coredump+0x335/0xea0
> [ 1616.214861]  ? schedule+0x3c/0xc0
> [ 1616.216775]  ? futex_wait_queue_me+0xbb/0x110
> [ 1616.218894]  ? _raw_spin_unlock+0x16/0x30
> [ 1616.220868]  ? futex_wait+0x143/0x240
> [ 1616.223450]  get_signal+0x250/0x5c0
> [ 1616.225965]  do_signal+0x36/0x610
> [ 1616.228246]  ? __seccomp_filter+0x3b/0x260
> [ 1616.231000]  ? __check_object_size+0x9f/0x1a0
> [ 1616.233716]  exit_to_usermode_loop+0x85/0xa0
> [ 1616.238413]  do_syscall_64+0x18b/0x190
> [ 1616.240798]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [ 1616.244401] RIP: 0033:0x7f78fc53e45d
> [ 1616.246690] RSP: 002b:7ffd40330d20

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-09 Thread Kees Cook
On Sun, Apr 8, 2018 at 12:07 PM, Oleksandr Natalenko
<oleksa...@natalenko.name> wrote:
> So far, I wasn't able to trigger this with mq-deadline (or without blk-mq).
> Maybe, this has something to do with blk-mq+BFQ re-queuing, or it's just me
> not being persistent enough.

Ah, this detail I didn't have. I've changed my environment to

build with:

CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_IOSCHED_BFQ=y

boot with scsi_mod.use_blk_mq=1

and select BFQ in the scheduler:

# cat /sys/block/sd?/queue/scheduler
mq-deadline kyber [bfq] none
mq-deadline kyber [bfq] none

Even with this, I'm not seeing anything yet...

> It looks like this code path was re-written completely with 17cb960f29c2, but
> it went merged for the upcoming v4.17 only, and thus I haven't tried it yet.
>
> Kees took a brief look at it already: [1]. This is what smartctl does [2]
> (just a usual strace capture when the bug is not triggered).
>
> Christoph, do you have some idea on why this can happen?
>
> Thanks.
>
> Regards,
>   Oleksandr
>
> [1] https://marc.info/?l=linux-scsi=152287333013845=2
> [2] https://gist.github.com/pfactum/6f58f8891468aeba1ab2cc9f45668735

The thing I can't figure out is how req->sense is slipping forward in
(and even beyond!) the allocation.

-Kees

-- 
Kees Cook
Pixel Security


[PATCH] lightnvm: Convert timers to use timer_setup()

2017-10-24 Thread Kees Cook
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Matias Bjorling <m...@lightnvm.io>
Cc: linux-block@vger.kernel.org
Signed-off-by: Kees Cook <keesc...@chromium.org>
---
 drivers/lightnvm/pblk-core.c | 4 ++--
 drivers/lightnvm/pblk-gc.c   | 6 +++---
 drivers/lightnvm/pblk-init.c | 2 +-
 drivers/lightnvm/pblk-rl.c   | 6 +++---
 drivers/lightnvm/pblk.h  | 2 +-
 drivers/lightnvm/rrpc.c  | 6 +++---
 6 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 81501644fb15..8841eb66c902 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -229,9 +229,9 @@ static void pblk_write_kick(struct pblk *pblk)
mod_timer(>wtimer, jiffies + msecs_to_jiffies(1000));
 }
 
-void pblk_write_timer_fn(unsigned long data)
+void pblk_write_timer_fn(struct timer_list *t)
 {
-   struct pblk *pblk = (struct pblk *)data;
+   struct pblk *pblk = from_timer(pblk, t, wtimer);
 
/* kick the write thread every tick to flush outstanding data */
pblk_write_kick(pblk);
diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 6090d28f7995..85ff64fe5ba2 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -428,9 +428,9 @@ void pblk_gc_kick(struct pblk *pblk)
mod_timer(>gc_timer, jiffies + msecs_to_jiffies(GC_TIME_MSECS));
 }
 
-static void pblk_gc_timer(unsigned long data)
+static void pblk_gc_timer(struct timer_list *t)
 {
-   struct pblk *pblk = (struct pblk *)data;
+   struct pblk *pblk = from_timer(pblk, t, gc.gc_timer);
 
pblk_gc_kick(pblk);
 }
@@ -569,7 +569,7 @@ int pblk_gc_init(struct pblk *pblk)
goto fail_free_writer_kthread;
}
 
-   setup_timer(>gc_timer, pblk_gc_timer, (unsigned long)pblk);
+   timer_setup(>gc_timer, pblk_gc_timer, 0);
mod_timer(>gc_timer, jiffies + msecs_to_jiffies(GC_TIME_MSECS));
 
gc->gc_active = 0;
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 1b0f61233c21..c0ae85b514f5 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -833,7 +833,7 @@ static int pblk_lines_init(struct pblk *pblk)
 
 static int pblk_writer_init(struct pblk *pblk)
 {
-   setup_timer(>wtimer, pblk_write_timer_fn, (unsigned long)pblk);
+   timer_setup(>wtimer, pblk_write_timer_fn, 0);
mod_timer(>wtimer, jiffies + msecs_to_jiffies(100));
 
pblk->writer_ts = kthread_create(pblk_write_ts, pblk, "pblk-writer-t");
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index 2e6a5361baf0..1210deeb6f43 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -178,9 +178,9 @@ int pblk_rl_sysfs_rate_show(struct pblk_rl *rl)
return rl->rb_user_max;
 }
 
-static void pblk_rl_u_timer(unsigned long data)
+static void pblk_rl_u_timer(struct timer_list *t)
 {
-   struct pblk_rl *rl = (struct pblk_rl *)data;
+   struct pblk_rl *rl = from_timer(rl, t, u_timer);
 
/* Release user I/O state. Protect from GC */
smp_store_release(>rb_user_active, 0);
@@ -221,7 +221,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
atomic_set(>rb_gc_cnt, 0);
atomic_set(>rb_space, -1);
 
-   setup_timer(>u_timer, pblk_rl_u_timer, (unsigned long)rl);
+   timer_setup(>u_timer, pblk_rl_u_timer, 0);
 
rl->rb_user_active = 0;
rl->rb_gc_active = 0;
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 67e623bd5c2d..e097dea4a1ea 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -789,7 +789,7 @@ void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, 
unsigned int sentry,
  * pblk write thread
  */
 int pblk_write_ts(void *data);
-void pblk_write_timer_fn(unsigned long data);
+void pblk_write_timer_fn(struct timer_list *t);
 void pblk_write_should_kick(struct pblk *pblk);
 
 /*
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 267f01ae87e4..0993c14be860 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -267,9 +267,9 @@ static void rrpc_gc_kick(struct rrpc *rrpc)
 /*
  * timed GC every interval.
  */
-static void rrpc_gc_timer(unsigned long data)
+static void rrpc_gc_timer(struct timer_list *t)
 {
-   struct rrpc *rrpc = (struct rrpc *)data;
+   struct rrpc *rrpc = from_timer(rrpc, t, gc_timer);
 
rrpc_gc_kick(rrpc);
mod_timer(>gc_timer, jiffies + msecs_to_jiffies(10));
@@ -1063,7 +1063,7 @@ static int rrpc_gc_init(struct rrpc *rrpc)
if (!rrpc->kgc_wq)
return -ENOMEM;
 
-   setup_timer(>gc_timer, rrpc_gc_timer, (unsigned long)rrpc);
+   timer_setup(>gc_timer, rrpc_gc_timer, 0);
 
return 0;
 }
-- 
2.7.4


-- 
Kees Cook
Pixel Security


Re: [PATCH v2] block/aoe: Convert timers to use timer_setup()

2017-10-16 Thread Kees Cook
On Fri, Oct 6, 2017 at 7:19 AM, Jens Axboe <ax...@kernel.dk> wrote:
> On 10/05/2017 05:13 PM, Kees Cook wrote:
>> In preparation for unconditionally passing the struct timer_list pointer to
>> all timer callbacks, switch to using the new timer_setup() and from_timer()
>> to pass the timer pointer explicitly.
>
> Applied to for-4.15/timer

Hi,

I just wanted to check what your timer plans were for merging this
into -next (I'm doing rebasing to find out which maintainers I need to
resend patches to, and I noticed block hasn't appeared in -next, but I
know you've pulled patches...)

Thanks!

-Kees

-- 
Kees Cook
Pixel Security


[PATCH v2] block/laptop_mode: Convert timers to use timer_setup()

2017-10-05 Thread Kees Cook
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Jens Axboe <ax...@kernel.dk>
Cc: Michal Hocko <mho...@suse.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Jan Kara <j...@suse.cz>
Cc: Johannes Weiner <han...@cmpxchg.org>
Cc: Nicholas Piggin <npig...@gmail.com>
Cc: Vladimir Davydov <vdavydov@gmail.com>
Cc: Matthew Wilcox <mawil...@microsoft.com>
Cc: Jeff Layton <jlay...@redhat.com>
Cc: linux-block@vger.kernel.org
Cc: linux...@kvack.org
Cc: Thomas Gleixner <t...@linutronix.de>
Signed-off-by: Kees Cook <keesc...@chromium.org>
---
v2: Rebased to linux-block.git/for-4.15/timer
---
 block/blk-core.c  | 10 +-
 include/linux/writeback.h |  2 +-
 mm/page-writeback.c   |  7 ---
 3 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14f7674fa0b1..596255822d7d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -803,9 +803,9 @@ static void blk_queue_usage_counter_release(struct 
percpu_ref *ref)
wake_up_all(>mq_freeze_wq);
 }
 
-static void blk_rq_timed_out_timer(unsigned long data)
+static void blk_rq_timed_out_timer(struct timer_list *t)
 {
-   struct request_queue *q = (struct request_queue *)data;
+   struct request_queue *q = from_timer(q, t, timeout);
 
kblockd_schedule_work(>timeout_work);
 }
@@ -841,9 +841,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
q->backing_dev_info->name = "block";
q->node = node_id;
 
-   setup_timer(>backing_dev_info->laptop_mode_wb_timer,
-   laptop_mode_timer_fn, (unsigned long) q);
-   setup_timer(>timeout, blk_rq_timed_out_timer, (unsigned long) q);
+   timer_setup(>backing_dev_info->laptop_mode_wb_timer,
+   laptop_mode_timer_fn, 0);
+   timer_setup(>timeout, blk_rq_timed_out_timer, 0);
INIT_LIST_HEAD(>queue_head);
INIT_LIST_HEAD(>timeout_list);
INIT_LIST_HEAD(>icq_list);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index dd1d2c23f743..80944e6bf484 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -309,7 +309,7 @@ static inline void cgroup_writeback_umount(void)
 void laptop_io_completion(struct backing_dev_info *info);
 void laptop_sync_completion(void);
 void laptop_mode_sync(struct work_struct *work);
-void laptop_mode_timer_fn(unsigned long data);
+void laptop_mode_timer_fn(struct timer_list *t);
 #else
 static inline void laptop_sync_completion(void) { }
 #endif
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8d1fc593bce8..138b8016f759 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1977,11 +1977,12 @@ int dirty_writeback_centisecs_handler(struct ctl_table 
*table, int write,
 }
 
 #ifdef CONFIG_BLOCK
-void laptop_mode_timer_fn(unsigned long data)
+void laptop_mode_timer_fn(struct timer_list *t)
 {
-   struct request_queue *q = (struct request_queue *)data;
+   struct backing_dev_info *backing_dev_info =
+   from_timer(backing_dev_info, t, laptop_mode_wb_timer);
 
-   wakeup_flusher_threads_bdi(q->backing_dev_info, WB_REASON_LAPTOP_TIMER);
+   wakeup_flusher_threads_bdi(backing_dev_info, WB_REASON_LAPTOP_TIMER);
 }
 
 /*
-- 
2.7.4


-- 
Kees Cook
Pixel Security


[PATCH v2] block/aoe: Convert timers to use timer_setup()

2017-10-05 Thread Kees Cook
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Jens Axboe <ax...@kernel.dk>
Cc: "Ed L. Cashin" <ed.cas...@acm.org>
Cc: linux-block@vger.kernel.org
Cc: Thomas Gleixner <t...@linutronix.de>
Signed-off-by: Kees Cook <keesc...@chromium.org>
---
v2: Rebased to linux-block.git/for-4.15/timer
---
 drivers/block/aoe/aoecmd.c | 6 +++---
 drivers/block/aoe/aoedev.c | 9 +++--
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index dc43254e05a4..55ab25f79a08 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -744,7 +744,7 @@ count_targets(struct aoedev *d, int *untainted)
 }
 
 static void
-rexmit_timer(ulong vp)
+rexmit_timer(struct timer_list *timer)
 {
struct aoedev *d;
struct aoetgt *t;
@@ -758,7 +758,7 @@ rexmit_timer(ulong vp)
int utgts;  /* number of aoetgt descriptors (not slots) */
int since;
 
-   d = (struct aoedev *) vp;
+   d = from_timer(d, timer, timer);
 
spin_lock_irqsave(>lock, flags);
 
@@ -1429,7 +1429,7 @@ aoecmd_ata_id(struct aoedev *d)
 
d->rttavg = RTTAVG_INIT;
d->rttdev = RTTDEV_INIT;
-   d->timer.function = rexmit_timer;
+   d->timer.function = (TIMER_FUNC_TYPE)rexmit_timer;
 
skb = skb_clone(skb, GFP_ATOMIC);
if (skb) {
diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c
index b28fefb90391..697f735b07a4 100644
--- a/drivers/block/aoe/aoedev.c
+++ b/drivers/block/aoe/aoedev.c
@@ -15,7 +15,6 @@
 #include 
 #include "aoe.h"
 
-static void dummy_timer(ulong);
 static void freetgt(struct aoedev *d, struct aoetgt *t);
 static void skbpoolfree(struct aoedev *d);
 
@@ -146,11 +145,11 @@ aoedev_put(struct aoedev *d)
 }
 
 static void
-dummy_timer(ulong vp)
+dummy_timer(struct timer_list *t)
 {
struct aoedev *d;
 
-   d = (struct aoedev *)vp;
+   d = from_timer(d, t, timer);
if (d->flags & DEVFL_TKILL)
return;
d->timer.expires = jiffies + HZ;
@@ -466,9 +465,7 @@ aoedev_by_aoeaddr(ulong maj, int min, int do_alloc)
INIT_WORK(>work, aoecmd_sleepwork);
spin_lock_init(>lock);
skb_queue_head_init(>skbpool);
-   init_timer(>timer);
-   d->timer.data = (ulong) d;
-   d->timer.function = dummy_timer;
+   timer_setup(>timer, dummy_timer, 0);
d->timer.expires = jiffies + HZ;
add_timer(>timer);
d->bufpool = NULL;  /* defer to aoeblk_gdalloc */
-- 
2.7.4


-- 
Kees Cook
Pixel Security


Re: [PATCH] block/laptop_mode: Convert timers to use timer_setup()

2017-10-05 Thread Kees Cook
On Thu, Oct 5, 2017 at 3:07 PM, Jens Axboe <ax...@kernel.dk> wrote:
> Yes, it's not impossible, I just usually prefer not to. For this case, I
> just setup a for-4.15/timer, that is the current block branch with -rc3
> pulled in. I applied the two patches for floppy and amiflop, I'm
> assuming Kees will respin the writeback/laptop version and I can shove
> that in there too.

Thanks for setting this up! I'll respin and send it out again.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH] block/laptop_mode: Convert timers to use timer_setup()

2017-10-05 Thread Kees Cook
On Thu, Oct 5, 2017 at 11:26 AM, Jens Axboe <ax...@kernel.dk> wrote:
> Honestly, I think the change should have waited for 4.15 in that case.
> Why the rush? It wasn't ready for the merge window.

My understanding from my discussion with tglx was that if the API
change waiting for 4.15, then the conversions would have to wait until
4.16. With most conversions only depending on the new API, it seemed
silly to wait another whole release when the work is just waiting to
be merged.

But yes, timing was not ideal. I did try to get it in earlier, but I
think tglx was busy with other concerns.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH] block/laptop_mode: Convert timers to use timer_setup()

2017-10-05 Thread Kees Cook
On Thu, Oct 5, 2017 at 7:56 AM, Jens Axboe <ax...@kernel.dk> wrote:
> On 10/04/2017 06:49 PM, Kees Cook wrote:
>> In preparation for unconditionally passing the struct timer_list pointer to
>> all timer callbacks, switch to using the new timer_setup() and from_timer()
>> to pass the timer pointer explicitly.
>>
>> Cc: Jens Axboe <ax...@kernel.dk>
>> Cc: Michal Hocko <mho...@suse.com>
>> Cc: Andrew Morton <a...@linux-foundation.org>
>> Cc: Jan Kara <j...@suse.cz>
>> Cc: Johannes Weiner <han...@cmpxchg.org>
>> Cc: Nicholas Piggin <npig...@gmail.com>
>> Cc: Vladimir Davydov <vdavydov@gmail.com>
>> Cc: Matthew Wilcox <mawil...@microsoft.com>
>> Cc: Jeff Layton <jlay...@redhat.com>
>> Cc: linux-block@vger.kernel.org
>> Cc: linux...@kvack.org
>> Cc: Thomas Gleixner <t...@linutronix.de>
>> Signed-off-by: Kees Cook <keesc...@chromium.org>
>> ---
>> This requires commit 686fef928bba ("timer: Prepare to change timer
>> callback argument type") in v4.14-rc3, but should be otherwise
>> stand-alone.
>
> My only complaint about this is the use of a from_timer() macro instead
> of just using container_of() at the call sites to actually show that is
> happening. I'm generally opposed to obfuscation like that. It just means
> you have to look up what is going on, instead of it being readily
> apparent to the reader/reviewer.

Yeah, this got discussed a bit with tglx and hch. Ultimately, this
seems to be the least bad of several options. Specifically with regard
to container_of(), it just gets to be huge, and makes things harder to
read (almost always requires a line break, needlessly repeats the
variable type definition, etc). Since there is precedent of both using
wrappers on container_of() and for adding from_foo() helpers, I chose
the resulting from_timer().

> I guess I do have a a second complaint as well - that it landed in -rc3,
> which is rather late considering subsystem trees are usually forked
> earlier than that. Had this been in -rc1, I would have had an easier
> time applying the block bits for 4.15.

Yes, totally true. tglx and I ended up meeting face-to-face at the
Kernel Recipes conference and we solved some outstanding design issues
with the conversion. The timing meant the new API went into -rc3,
which seemed better than missing an entire release cycle, or carrying
deltas against maintainer trees that would drift. (This is actually my
second massive refactoring of these changes...)

If you don't want to deal with the -rc3 issue, would you want these
changes to get carried in the timer tree instead?

Thanks!

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 0/5] v3 block subsystem refcounter conversions

2017-06-27 Thread Kees Cook
On Tue, Jun 27, 2017 at 6:26 AM, Jens Axboe <ax...@kernel.dk> wrote:
> On 06/27/2017 05:39 AM, Elena Reshetova wrote:
>> Changes in v3:
>> No changes in patches apart from trivial rebases, but now by
>> default refcount_t = atomic_t and uses all atomic standard operations
>> unless CONFIG_REFCOUNT_FULL is enabled. This is a compromize for the
>> systems that are critical on performance and cannot accept even
>> slight delay on the refcounter operations.
>
> Is that true in 4.12-rc, or is that true in a later release once
> Linus has pulled those changes in? If the latter, please resend
> this when those changes are in, thanks.

It's in -next currently ("locking/refcount: Create unchecked atomic_t
implementation")

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 0/5] v2: block subsystem refcounter conversions

2017-04-21 Thread Kees Cook
On Fri, Apr 21, 2017 at 2:27 PM, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:
> On Fri, 2017-04-21 at 13:22 -0700, Kees Cook wrote:
>> On Fri, Apr 21, 2017 at 12:55 PM, Eric Biggers <ebigge...@gmail.com>
>> wrote:
>> > > > Of course, having extra checks behind a debug option is fine.
>> > > >  But they should not be part of the base feature; the base
>> > > > feature should just be mitigation of reference count
>> > > > *overflows*.  It would be nice to do more, of
>> > > > course; but when the extra stuff prevents people from using
>> > > > refcount_t for performance reasons, it defeats the point of the
>> > > > feature in the first place.
>> > >
>> > > Sure, but as I said above, I think the smaller tricks and fixes
>> > > won't be convincing enough, so their value is questionable.
>> >
>> > This makes no sense, as the main point of the feature is supposed
>> > to be the security improvement.  As-is, the extra debugging stuff
>> > is actually preventing the security improvement from being adopted,
>> > which is unfortunate.
>>
>> We've been trying to handle the conflicting desires of those wanting
>> very precise refcounting implementation and gaining the security
>> protections. Ultimately, the best way forward seemed to be to first
>> land the precise refcounting implementation, and start conversion
>> until we ran into concerns over performance. Now, since we're here,
>> we can move forward with getting a fast implementation that provides
>> the desired security protections without too greatly messing with the
>> refcount API.
>
> But that's not what it really looks like.  What it looks like is
> someone came up with a new API and is now intent on forcing it through
> the kernel in about 500 patches using security as the hammer.

The intent is to split refcounting and statistical counters away from
atomics, since they have distinct APIs. This will let us audit the
remaining atomic uses much more easily.

> If we were really concerned about security first, we'd have fixed the
> atomic overflow problem in the atomics and then built the precise
> refcounting on that and tried to persuade people of the merits.

I agree, but this approach was NAKed by multiple atomics maintainers.

> Why can't we still do this?  It looks like the overflow protection will
> add only a couple of cycles to the atomics.  We can move the current
> version to an unchecked variant which can be used either in truly
> performance critical regions with no security implications or if
> someone really needs the atomic to overflow.  From my point of view it
> would give us the 90% case (security) and stop this endless argument
> over the fast paths.  Subsystems which have already moved to refcount
> would stay there and the rest get to evaluate a migration on the merits
> of the operational utility.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 0/5] v2: block subsystem refcounter conversions

2017-04-21 Thread Kees Cook
On Fri, Apr 21, 2017 at 12:55 PM, Eric Biggers <ebigge...@gmail.com> wrote:
> Hi Elena,
>
> On Fri, Apr 21, 2017 at 10:55:29AM +, Reshetova, Elena wrote:
>> >
>> > At the very least, what is there now could probably be made about twice as 
>> > fast
>> > by removing the checks that don't actually help mitigate refcount overflow 
>> > bugs,
>> > specifically all the checks in refcount_dec(), and the part of 
>> > refcount_inc()
>> > where it doesn't allow incrementing a 0 refcount.  Hint: if a refcount is 
>> > 0, the
>> > object has already been freed, so the attacker will have had the 
>> > opportunity to
>> > allocate it with contents they control already.
>>
>> refcount_dec() is used very little through the code actually, it is more 
>> like an exception
>> case since in order to use it one must really be sure that refcounter 
>> doesn't drop to zero.
>> Removing the warn around it wouldn't likely affect much overall and thus it 
>> is better to
>> stay to discourage people of API itself :)
>>
>> refcount_inc() is of course a different story, it is extensively used. I 
>> guess the perf issue
>> on checking increment from zero might only come from WARNs being printed,
>> but not really from an additional check here for zero since it is trivial 
>> and part of
>> the needed loop anyway. So, I think only removing the
>> WARNs might have any visible impact, but even this is probably not really 
>> that big.
>>
>> So, I think these changes won't really help adoption of interface if 
>> arguments against
>> is performance. If we do have a performance issue, I think arch. specific 
>> implementation
>> is likely the only way to considerably speed it up.
>
> I should have used refcount_dec_and_test() as the example, as the same applies
> to both refcount_dec() and refcount_dec_and_test().  There is negligible
> security benefit to have these refcount release operations checked vs. just
> calling through to atomic_dec() and atomic_dec_and_test().  It's unfortunate,
> but there is no known way to detect ahead of time (i.e. before exploitation) 
> if
> there are too many refcount releases, only too many refcount acquires.
>
> The WARN is only executed if there is a bug, so naturally it's only a problem 
> if
> the functions are to be inlined, creating bloat.  The actual performance 
> problem
> is the overhead associated with using comparisons and cmpxchg's to avoid
> changing a refcount that is 0 or UINT_MAX.  The more efficient approach would 
> be
> to (a) drop the check for 0, and (b) don't require full operation to be 
> atomic,
> but rather do something like "lock incl %[counter]; js ".  Yes
> it's not "atomic", and people have complained about this, but there is no
> technical reason why it needs to be atomic.  Attackers do *not* care whether
> your exploit mitigation is theoretically "atomic" or not, they only care 
> whether
> it works or not.  And besides, it's not even "atomic_t" anymore, it's
> "refcount_t".
>
>> > Of course, having extra checks behind a debug option is fine.  But they 
>> > should
>> > not be part of the base feature; the base feature should just be 
>> > mitigation of
>> > reference count *overflows*.  It would be nice to do more, of course; but 
>> > when
>> > the extra stuff prevents people from using refcount_t for performance 
>> > reasons,
>> > it defeats the point of the feature in the first place.
>>
>> Sure, but as I said above, I think the smaller tricks and fixes won't be 
>> convincing enough,
>> so their value is questionable.
>
> This makes no sense, as the main point of the feature is supposed to be the
> security improvement.  As-is, the extra debugging stuff is actually preventing
> the security improvement from being adopted, which is unfortunate.

We've been trying to handle the conflicting desires of those wanting
very precise refcounting implementation and gaining the security
protections. Ultimately, the best way forward seemed to be to first
land the precise refcounting implementation, and start conversion
until we ran into concerns over performance. Now, since we're here, we
can move forward with getting a fast implementation that provides the
desired security protections without too greatly messing with the
refcount API.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 0/5] v2: block subsystem refcounter conversions

2017-04-21 Thread Kees Cook
On Fri, Apr 21, 2017 at 3:55 AM, Reshetova, Elena
<elena.reshet...@intel.com> wrote:
> On Thu, Apr 20, 2017 at 11:33 AM, Eric Biggers <ebigge...@gmail.com> wrote:
>> Of course, having extra checks behind a debug option is fine.  But they 
>> should
>> not be part of the base feature; the base feature should just be mitigation 
>> of
>> reference count *overflows*.  It would be nice to do more, of course; but 
>> when
>> the extra stuff prevents people from using refcount_t for performance 
>> reasons,
>> it defeats the point of the feature in the first place.
>
> Sure, but as I said above, I think the smaller tricks and fixes won't be 
> convincing enough,
> so their value is questionable.
>
>> I strongly encourage anyone who has been involved in refcount_t to experiment
>> with removing a reference count decrement somewhere in their kernel, then 
>> trying
>> to exploit it to gain code execution.  If you don't know what you're trying 
>> to
>> protect against, you will not know which defences work and which don't.
>
> Well, we had successful CVEs and even exploits on this in the past.
> @Kees, do you store a list of them in the project?

Here are two from last year:
http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/
http://cyseclabs.com/page?n=02012016

I don't disagree with Eric on the threat model: the primary concern
for reference count protection is the overflow condition. Detecting a
prior use-after-free is certainly nice to have, but the more important
case is the initial overflow.

How about we introduce something like CONFIG_HAVE_ARCH_FAST_REFCOUNT_T
for per-arch implementations and CONFIG_FAST_REFCOUNT_T that trades
coverage for speed, and checks only the overflow condition. This gets
us the critical coverage without the changes in performance. This is
basically what PaX/grsecurity already did: there is a tiny change to
the atomic inc functions to detect the wrap.

-Kees

-- 
Kees Cook
Pixel Security