Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-09 Thread Bart Van Assche
On Tue, 2018-04-10 at 09:30 +0800, Ming Lei wrote: > Also is it possible to see queue freed here? I think the caller should keep a reference on the request queue. Otherwise we have a much bigger problem than a race between submitting a bio and removing a request queue from the cgroup controller

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-09 Thread Ming Lei
On Mon, Apr 09, 2018 at 10:54:57PM +, Bart Van Assche wrote: > On Mon, 2018-04-09 at 14:54 +0800, Joseph Qi wrote: > > The oops happens during generic_make_request_checks(), in > > blk_throtl_bio() exactly. > > So if we want to bypass dying queue, we have to check this before > >

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-09 Thread Bart Van Assche
On Mon, 2018-04-09 at 16:58 -0600, Jens Axboe wrote: > This ends up being nutty in the generic_make_request() case, where we > do the exact same enter/exit logic right after. That needs to get unified. > Maybe move the queue enter into generic_make_request_checks(), and exit > in the caller?

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-09 Thread Jens Axboe
On 4/9/18 4:54 PM, Bart Van Assche wrote: > On Mon, 2018-04-09 at 14:54 +0800, Joseph Qi wrote: >> The oops happens during generic_make_request_checks(), in >> blk_throtl_bio() exactly. >> So if we want to bypass dying queue, we have to check this before >> generic_make_request_checks(), I think.

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-09 Thread Bart Van Assche
On Mon, 2018-04-09 at 14:54 +0800, Joseph Qi wrote: > The oops happens during generic_make_request_checks(), in > blk_throtl_bio() exactly. > So if we want to bypass dying queue, we have to check this before > generic_make_request_checks(), I think. How about something like the patch below?

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-09 Thread Joseph Qi
Hi Bart, On 18/4/9 12:47, Bart Van Assche wrote: > On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: >> The following kernel oops is triggered by 'removing scsi device' during >> heavy IO. > > Is the below patch sufficient to fix this? > > Thanks, > > Bart. > > > Subject: blk-mq: Avoid that

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Bart Van Assche
On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: > The following kernel oops is triggered by 'removing scsi device' during > heavy IO. Is the below patch sufficient to fix this? Thanks, Bart. Subject: blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Ming Lei
On Mon, Apr 09, 2018 at 09:33:08AM +0800, Joseph Qi wrote: > Hi Bart, > > On 18/4/8 22:50, Bart Van Assche wrote: > > On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: > >> The following kernel oops is triggered by 'removing scsi device' during > >> heavy IO. > > > > How did you trigger this

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Joseph Qi
Hi Bart, On 18/4/8 22:50, Bart Van Assche wrote: > On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: >> The following kernel oops is triggered by 'removing scsi device' during >> heavy IO. > > How did you trigger this oops? > I can reproduce this oops by the following steps: 1) start a fio

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Bart Van Assche
On Sun, 2018-04-08 at 16:11 +0800, Joseph Qi wrote: > This is because scsi_remove_device() will call blk_cleanup_queue(), and > then all blkgs have been destroyed and root_blkg is NULL. > Thus tg is NULL and trigger NULL pointer dereference when get td from > tg (tg->td). > It seems that we cannot

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Bart Van Assche
On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: > The following kernel oops is triggered by 'removing scsi device' during > heavy IO. How did you trigger this oops? Bart.

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Ming Lei
On Sun, Apr 08, 2018 at 05:25:42PM +0800, Ming Lei wrote: > On Sun, Apr 08, 2018 at 04:11:51PM +0800, Joseph Qi wrote: > > This is because scsi_remove_device() will call blk_cleanup_queue(), and > > then all blkgs have been destroyed and root_blkg is NULL. > > Thus tg is NULL and trigger NULL

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Ming Lei
On Sun, Apr 08, 2018 at 04:11:51PM +0800, Joseph Qi wrote: > This is because scsi_remove_device() will call blk_cleanup_queue(), and > then all blkgs have been destroyed and root_blkg is NULL. > Thus tg is NULL and trigger NULL pointer dereference when get td from > tg (tg->td). > It seems that we

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Joseph Qi
This is because scsi_remove_device() will call blk_cleanup_queue(), and then all blkgs have been destroyed and root_blkg is NULL. Thus tg is NULL and trigger NULL pointer dereference when get td from tg (tg->td). It seems that we cannot simply move blkcg_exit_queue() up to blk_cleanup_queue().