Re: 4.15.14 crash with iscsi target and dvd
On Fri, 2018-04-06 at 21:03 -0400, Wakko Warner wrote: > Bart Van Assche wrote: > > On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote: > > > I know now why scsi_print_command isn't doing anything. cmd->cmnd is > > > null. > > > I added a dev_printk in scsi_print_command where the 2 if statements > > > return. > > > Logs: > > > [ 29.866415] sr 3:0:0:0: cmd->cmnd is NULL > > > > That's something that should never happen. As one can see in > > scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize > > that pointer. Since I have not yet been able to reproduce myself what you > > reported, would it be possible for you to bisect this issue? You will need > > to follow something like the following procedure (see also > > https://git-scm.com/docs/git-bisect): > > > > git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > git bisect start > > git bisect bad v4.10 > > git bisect good v4.9 > > > > and then build the kernel, install it, boot the kernel and test it. > > Depending on the result, run either git bisect bad or git bisect good and > > keep going until git bisect comes to a conclusion. This can take an hour or > > more. > > I have 1 question. Should make clean be done between tests? My box > compiles the whole kernel in 2 minutes. If you trust that the build system will figure out all dependencies then running make clean is not necessary. Personally I always run make clean during a bisect before rebuilding the kernel because if a header file has changed in e.g. the block layer a huge number of files has to be rebuilt anyway. Bart.
Re: 4.15.14 crash with iscsi target and dvd
Bart Van Assche wrote: > On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote: > > I know now why scsi_print_command isn't doing anything. cmd->cmnd is null. > > I added a dev_printk in scsi_print_command where the 2 if statements return. > > Logs: > > [ 29.866415] sr 3:0:0:0: cmd->cmnd is NULL > > That's something that should never happen. As one can see in > scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize > that pointer. Since I have not yet been able to reproduce myself what you > reported, would it be possible for you to bisect this issue? You will need > to follow something like the following procedure (see also > https://git-scm.com/docs/git-bisect): > > git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > git bisect start > git bisect bad v4.10 > git bisect good v4.9 > > and then build the kernel, install it, boot the kernel and test it. > Depending on the result, run either git bisect bad or git bisect good and > keep going until git bisect comes to a conclusion. This can take an hour or > more. I have 1 question. Should make clean be done between tests? My box compiles the whole kernel in 2 minutes. -- Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 million bugs.
Re: 4.15.14 crash with iscsi target and dvd
Bart Van Assche wrote: > On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote: > > I know now why scsi_print_command isn't doing anything. cmd->cmnd is null. > > I added a dev_printk in scsi_print_command where the 2 if statements return. > > Logs: > > [ 29.866415] sr 3:0:0:0: cmd->cmnd is NULL > > That's something that should never happen. As one can see in > scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize > that pointer. Since I have not yet been able to reproduce myself what you > reported, would it be possible for you to bisect this issue? You will need > to follow something like the following procedure (see also > https://git-scm.com/docs/git-bisect): I don't know how relevent it is, but this machine boots nfs and exports it's dvd drives over iscsi with the target modules. My scsi_target.lio is at the end. I removed the iqn name. The options are default except for a few. Non default options I tabbed over. eth0 is the nfs/localnet nic and eth1 is the nic that iscsi goes over. eth0 is onboard pci 8086:1502 (subsystem 1028:05d3) eth1 is pci 8086:107d (subsystem 8086:1084) Both use the e1000e driver The initiator is running 4.4.107. When running on the initiator, /dev/sr1 is the target /dev/sr0. Therefor cat /dev/sr1 > /dev/null seems to work. mount /dev/sr1 /cdrom works find /cdrom -type f | xargs cat > /dev/null immediately crashes the target. Burning to /dev/sr1 seems to work. I have another nic that uses igb instead, I'll see if that makes a difference. > git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > git bisect start > git bisect bad v4.10 > git bisect good v4.9 > > and then build the kernel, install it, boot the kernel and test it. > Depending on the result, run either git bisect bad or git bisect good and > keep going until git bisect comes to a conclusion. This can take an hour or > more. I'll try this. Here's my scsi_target.lio: storage pscsi { disk dvd0 { path /dev/sr0 attribute { emulate_3pc yes emulate_caw yes emulate_dpo no emulate_fua_read no emulate_model_alias no emulate_rest_reord no emulate_tas yes emulate_tpu no emulate_tpws no emulate_ua_intlck_ctrl no emulate_write_cache no enforce_pr_isids yes fabric_max_sectors 8192 is_nonrot yes max_unmap_block_desc_count 0 max_unmap_lba_count 0 max_write_same_len 65535 queue_depth 128 unmap_granularity 0 unmap_granularity_alignment 0 } } disk dvd1 { path /dev/sr1 attribute { emulate_3pc yes emulate_caw yes emulate_dpo no emulate_fua_read no emulate_model_alias no emulate_rest_reord no emulate_tas yes emulate_tpu no emulate_tpws no emulate_ua_intlck_ctrl no emulate_write_cache no enforce_pr_isids yes fabric_max_sectors 8192 is_nonrot yes max_unmap_block_desc_count 0 max_unmap_lba_count 0 max_write_same_len 65535 queue_depth 128 unmap_granularity 0 unmap_granularity_alignment 0 } } disk dvd2 { path /dev/sr2 attribute { emulate_3pc yes emulate_caw yes emulate_dpo no emulate_fua_read no emulate_model_alias no emulate_rest_reord no emulate_tas yes emulate_tpu no emulate_tpws no emulate_ua_intlck_ctrl no emulate_write_cache no enforce_pr_isids yes fabric_max_sectors 8192 is_nonrot yes max_unmap_block_desc_count 0 max_unmap_lba_count 0 max_write_same_len 65535 queue_depth 128 unmap_granularity 0 unmap_granularity_alignment 0 } } } fabric iscsi { discovery_auth { enable no mutual_password "" mutual_userid "" password "" userid "" } target iqn.:dvd tpgt 1 { enable yes attribute { authentication no cache_dynamic_acls yes default_cmdsn_depth 64 default_erl 0 demo_mode_discovery yes demo_mode_write_protect no fabric_prot_type 0 generate_node_acls yes login_timeout 15 netif_timeout 2 prod_mode_write_protect no t10_pi 0 tpg_enabled_sendtargets 1 } auth { password "" password_mutual "" userid "" userid_mutual "" } parameter { AuthMethod "CHAP,None" DataDigest "CRC32C,None" DataPDUInOrder yes DataSequenceInOrder yes DefaultTime2Retain 20 DefaultTime2Wait 2 ErrorRecoveryLevel no FirstBurstLength 65536 HeaderDigest "CRC32C,None" IFMarkInt Reject IFMarker no ImmediateData yes InitialR2T yes MaxBurstLength 262144 MaxConnections 1 MaxOutstandingR2T 1 MaxRecvDataSegmentLength 8192 MaxXmitDataSegmentLength 262144 OFMarkInt Reject OFMarker no TargetAlias "LIO Target" } lun 0 backend pscsi:dvd0 lun 1 backend pscsi:dvd1 lun 2 backend pscsi:dvd2 portal 0.0.0.0:3260 } } -- Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 million bugs.
Re: [PATCH] blk-cgroup: remove entries in blkg_tree before queue release
Hi Alexandru, Thank you for the patch! Yet something to improve: [auto build test ERROR on block/for-next] [also build test ERROR on v4.16 next-20180406] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Alexandru-Moise/blk-cgroup-remove-entries-in-blkg_tree-before-queue-release/20180407-035957 base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next config: c6x-evmc6678_defconfig (attached as .config) compiler: c6x-elf-gcc (GCC) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=c6x All errors (new ones prefixed by >>): block/blk-sysfs.c: In function '__blk_release_queue': >> block/blk-sysfs.c:820:2: error: implicit declaration of function >> 'blkg_destroy_all'; did you mean 'irq_destroy_ipi'? >> [-Werror=implicit-function-declaration] blkg_destroy_all(q); ^~~~ irq_destroy_ipi cc1: some warnings being treated as errors vim +820 block/blk-sysfs.c 770 771 /** 772 * __blk_release_queue - release a request queue when it is no longer needed 773 * @work: pointer to the release_work member of the request queue to be released 774 * 775 * Description: 776 * blk_release_queue is the counterpart of blk_init_queue(). It should be 777 * called when a request queue is being released; typically when a block 778 * device is being de-registered. Its primary task it to free the queue 779 * itself. 780 * 781 * Notes: 782 * The low level driver must have finished any outstanding requests first 783 * via blk_cleanup_queue(). 784 * 785 * Although blk_release_queue() may be called with preemption disabled, 786 * __blk_release_queue() may sleep. 787 */ 788 static void __blk_release_queue(struct work_struct *work) 789 { 790 struct request_queue *q = container_of(work, typeof(*q), release_work); 791 792 if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags)) 793 blk_stat_remove_callback(q, q->poll_cb); 794 blk_stat_free_callback(q->poll_cb); 795 796 blk_free_queue_stats(q->stats); 797 798 blk_exit_rl(q, >root_rl); 799 800 if (q->queue_tags) 801 __blk_queue_free_tags(q); 802 803 if (!q->mq_ops) { 804 if (q->exit_rq_fn) 805 q->exit_rq_fn(q, q->fq->flush_rq); 806 blk_free_flush_queue(q->fq); 807 } else { 808 blk_mq_release(q); 809 } 810 811 blk_trace_shutdown(q); 812 813 if (q->mq_ops) 814 blk_mq_debugfs_unregister(q); 815 816 if (q->bio_split) 817 bioset_free(q->bio_split); 818 819 spin_lock_irq(q->queue_lock); > 820 blkg_destroy_all(q); 821 spin_unlock_irq(q->queue_lock); 822 823 ida_simple_remove(_queue_ida, q->id); 824 call_rcu(>rcu_head, blk_free_queue_rcu); 825 } 826 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH] block/compat_ioctl: fix range check in BLKGETSIZE
kernel ulong and compat_ulong_t may not be same width. Use type directly to eliminate mismatches. This would result in truncation rather than EFBIG for 32bit mode for large disks. Signed-off-by: Khazhismel Kumykov--- block/compat_ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c index 6ca015f92766..3a2c77f07da8 100644 --- a/block/compat_ioctl.c +++ b/block/compat_ioctl.c @@ -388,7 +388,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg) return 0; case BLKGETSIZE: size = i_size_read(bdev->bd_inode); - if ((size >> 9) > ~0UL) + if ((size >> 9) > ~((compat_ulong_t)0UL)) return -EFBIG; return compat_put_ulong(arg, size >> 9); -- 2.17.0.484.g0c8726318c-goog smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible
On Fri, 6 Apr 2018, Thomas Gleixner wrote: > On Fri, 6 Apr 2018, Ming Lei wrote: > > > > I will post V4 soon by using cpu_present_mask in the 1st stage irq spread. > > And it should work fine for Kashyap's case in normal cases. > > No need to resend. I've changed it already and will push it out after > lunch. No. Lunch did not last that long :) I pushed out the lot to git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core Please double check the modifications I did. The first related commit fixes an existing error handling bug. Thanks, tglx
Re: [PATCH] blk-cgroup: remove entries in blkg_tree before queue release
Hi Alexandru, Thank you for the patch! Yet something to improve: [auto build test ERROR on block/for-next] [also build test ERROR on v4.16 next-20180406] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Alexandru-Moise/blk-cgroup-remove-entries-in-blkg_tree-before-queue-release/20180407-035957 base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next config: x86_64-randconfig-x001-201813 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): block/blk-sysfs.c: In function '__blk_release_queue': >> block/blk-sysfs.c:820:2: error: implicit declaration of function >> 'blkg_destroy_all'; did you mean 'blk_stat_add'? >> [-Werror=implicit-function-declaration] blkg_destroy_all(q); ^~~~ blk_stat_add cc1: some warnings being treated as errors vim +820 block/blk-sysfs.c 770 771 /** 772 * __blk_release_queue - release a request queue when it is no longer needed 773 * @work: pointer to the release_work member of the request queue to be released 774 * 775 * Description: 776 * blk_release_queue is the counterpart of blk_init_queue(). It should be 777 * called when a request queue is being released; typically when a block 778 * device is being de-registered. Its primary task it to free the queue 779 * itself. 780 * 781 * Notes: 782 * The low level driver must have finished any outstanding requests first 783 * via blk_cleanup_queue(). 784 * 785 * Although blk_release_queue() may be called with preemption disabled, 786 * __blk_release_queue() may sleep. 787 */ 788 static void __blk_release_queue(struct work_struct *work) 789 { 790 struct request_queue *q = container_of(work, typeof(*q), release_work); 791 792 if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags)) 793 blk_stat_remove_callback(q, q->poll_cb); 794 blk_stat_free_callback(q->poll_cb); 795 796 blk_free_queue_stats(q->stats); 797 798 blk_exit_rl(q, >root_rl); 799 800 if (q->queue_tags) 801 __blk_queue_free_tags(q); 802 803 if (!q->mq_ops) { 804 if (q->exit_rq_fn) 805 q->exit_rq_fn(q, q->fq->flush_rq); 806 blk_free_flush_queue(q->fq); 807 } else { 808 blk_mq_release(q); 809 } 810 811 blk_trace_shutdown(q); 812 813 if (q->mq_ops) 814 blk_mq_debugfs_unregister(q); 815 816 if (q->bio_split) 817 bioset_free(q->bio_split); 818 819 spin_lock_irq(q->queue_lock); > 820 blkg_destroy_all(q); 821 spin_unlock_irq(q->queue_lock); 822 823 ida_simple_remove(_queue_ida, q->id); 824 call_rcu(>rcu_head, blk_free_queue_rcu); 825 } 826 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH] loop: fix LOOP_GET_STATUS lock imbalance
On 4/6/18 10:57 AM, Omar Sandoval wrote: > From: Omar Sandoval> > Commit 2d1d4c1e591f made loop_get_status() drop lo_ctx_mutex before > returning, but the loop_get_status_old(), loop_get_status64(), and > loop_get_status_compat() wrappers don't call loop_get_status() if the > passed argument is NULL. The callers expect that the lock is dropped, so > make sure we drop it in that case, too. Applied, thanks Omar. -- Jens Axboe
Re: Multi-Actuator SAS HDD First Look
On 2018-04-06 02:42 AM, Christoph Hellwig wrote: On Fri, Apr 06, 2018 at 08:24:18AM +0200, Hannes Reinecke wrote: Ah. Far better. What about delegating FORMAT UNIT to the control LUN, and not implementing it for the individual disk LUNs? That would make an even stronger case for having a control LUN; with that there wouldn't be any problem with having to synchronize across LUNs etc. It sounds to me like NVMe might be a much better model for this drive than SCSI, btw :) So you found a document that outlines NVMe's architecture! Could you share the url (no marketing BS, please)? And a serious question ... How would you map NVMe's (in Linux) subsystem number, controller device minor number, CNTLID field (Identify ctl response) and namespace id onto the SCSI subsystem's h:c:t:l ? Doug Gilbert
Re: [PATCH] loop: fix LOOP_GET_STATUS lock imbalance
On Fri, Apr 06, 2018 at 09:57:03AM -0700, Omar Sandoval wrote: > From: Omar Sandoval> > Commit 2d1d4c1e591f made loop_get_status() drop lo_ctx_mutex before > returning, but the loop_get_status_old(), loop_get_status64(), and > loop_get_status_compat() wrappers don't call loop_get_status() if the > passed argument is NULL. The callers expect that the lock is dropped, so > make sure we drop it in that case, too. > > Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com > Fixes: 2d1d4c1e591f ("loop: don't call into filesystem while holding > lo_ctl_mutex") > Signed-off-by: Omar Sandoval > --- > Based on Linus' tree. > > drivers/block/loop.c | 33 ++--- > 1 file changed, 18 insertions(+), 15 deletions(-) Also just pushed a regression test to blktests: 140ee15de9f3 ("loop: add ioctl lock imbalance regression test")
[PATCH] loop: fix LOOP_GET_STATUS lock imbalance
From: Omar SandovalCommit 2d1d4c1e591f made loop_get_status() drop lo_ctx_mutex before returning, but the loop_get_status_old(), loop_get_status64(), and loop_get_status_compat() wrappers don't call loop_get_status() if the passed argument is NULL. The callers expect that the lock is dropped, so make sure we drop it in that case, too. Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com Fixes: 2d1d4c1e591f ("loop: don't call into filesystem while holding lo_ctl_mutex") Signed-off-by: Omar Sandoval --- Based on Linus' tree. drivers/block/loop.c | 33 ++--- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 264abaaff662..9b476fd2bc41 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1283,12 +1283,13 @@ static int loop_get_status_old(struct loop_device *lo, struct loop_info __user *arg) { struct loop_info info; struct loop_info64 info64; - int err = 0; + int err; - if (!arg) - err = -EINVAL; - if (!err) - err = loop_get_status(lo, ); + if (!arg) { + mutex_unlock(>lo_ctl_mutex); + return -EINVAL; + } + err = loop_get_status(lo, ); if (!err) err = loop_info64_to_old(, ); if (!err && copy_to_user(arg, , sizeof(info))) @@ -1300,12 +1301,13 @@ loop_get_status_old(struct loop_device *lo, struct loop_info __user *arg) { static int loop_get_status64(struct loop_device *lo, struct loop_info64 __user *arg) { struct loop_info64 info64; - int err = 0; + int err; - if (!arg) - err = -EINVAL; - if (!err) - err = loop_get_status(lo, ); + if (!arg) { + mutex_unlock(>lo_ctl_mutex); + return -EINVAL; + } + err = loop_get_status(lo, ); if (!err && copy_to_user(arg, , sizeof(info64))) err = -EFAULT; @@ -1529,12 +1531,13 @@ loop_get_status_compat(struct loop_device *lo, struct compat_loop_info __user *arg) { struct loop_info64 info64; - int err = 0; + int err; - if (!arg) - err = -EINVAL; - if (!err) - err = loop_get_status(lo, ); + if (!arg) { + mutex_unlock(>lo_ctl_mutex); + return -EINVAL; + } + err = loop_get_status(lo, ); if (!err) err = loop_info64_to_compat(, arg); return err; -- 2.17.0
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On Fri, Apr 06, 2018 at 05:11:53PM +0200, Christian Borntraeger wrote: > > > On 04/06/2018 04:58 PM, Ming Lei wrote: > > On Fri, Apr 06, 2018 at 04:26:49PM +0200, Christian Borntraeger wrote: > >> > >> > >> On 04/06/2018 03:41 PM, Ming Lei wrote: > >>> On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote: > > > On 04/06/2018 11:23 AM, Ming Lei wrote: > > On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote: > >> > >> > >> On 04/06/2018 10:41 AM, Ming Lei wrote: > >>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: > > > On 04/05/2018 06:11 PM, Ming Lei wrote: > >> > >> Could you please apply the following patch and provide the dmesg > >> boot log? > > > > And please post out the 'lscpu' log together from the test machine > > too. > > attached. > > As I said before this seems to go way with CONFIG_NR_CPUS=64 or > smaller. > We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but > only 8 Cores > == 16 threads. > >>> > >>> OK, thanks! > >>> > >>> The most weird thing is that hctx->next_cpu is computed as 512 since > >>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of > >>> possible CPU. > >>> > >>> Looks like it is a s390 specific issue, since I can setup one queue > >>> which has same mapping with yours: > >>> > >>> - nr_cpu_id is 282 > >>> - CPU 0~15 is online > >>> - 64 queues null_blk > >>> - still run all hw queues in .complete handler > >>> > >>> But can't reproduce this issue at all. > >>> > >>> So please test the following patch, which may tell us why > >>> hctx->next_cpu > >>> is computed wrong: > >> > >> I see things like > >> > >> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> > >> which is exactly what happens if the find and and operation fails > >> (returns size of bitmap). > > > > Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct > > in your previous debug log, it means the following function returns > > totally wrong result on S390. > > > > cpumask_first_and(hctx->cpumask, cpu_online_mask); > > > > The debugfs log shows that each hctx->cpumask includes one online > > CPU(0~15). > > Really? the last log (with the latest patch applied shows a lot of > contexts > that do not have CPUs in 0-15: > > e.g. > [4.049828] dump CPUs mapped to this hctx: > [4.049829] 18 > [4.049829] 82 > [4.049830] 146 > [4.049830] 210 > [4.049831] 274 > >>> > >>> That won't be an issue, since no IO can be submitted from these offline > >>> CPUs, then these hctx shouldn't have been run at all. > >>> > >>> But hctx->next_cpu can be set as 512 for these inactive hctx in > >>> blk_mq_map_swqueue(), then please test the attached patch, and if > >>> hctx->next_cpu is still set as 512, something is still wrong. > >> > >> > >> WIth this patch I no longer see the "run queue from wrong CPU x, hctx > >> active" messages. > >> your debug code still triggers, though. > >> > >> wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and > >> wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and > >> > >> If we would remove the debug code then dmesg would be clean it seems. > > > > That is still a bit strange, since for any inactive hctx(without online > > CPU mapped), blk_mq_run_hw_queue() will check blk_mq_hctx_has_pending() > > I think for next_and it is reasonable to see this, as the next_and will return > 512 after we have used the last one. In fact the code does call first_and in > that case for a reason, no? It is possible for dumping 'first_and' when there isn't any online CPUs mapped to this hctx. But my question is that for this case, there shouldn't be any IO queued for this hctx, and blk_mq_hctx_has_pending() has been called to check that, so blk_mq_hctx_next_cpu() should have only be called when blk_mq_hctx_has_pending() in blk_mq_run_hw_queue() is true. Thanks, Ming
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 04/06/2018 04:58 PM, Ming Lei wrote: > On Fri, Apr 06, 2018 at 04:26:49PM +0200, Christian Borntraeger wrote: >> >> >> On 04/06/2018 03:41 PM, Ming Lei wrote: >>> On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote: On 04/06/2018 11:23 AM, Ming Lei wrote: > On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote: >> >> >> On 04/06/2018 10:41 AM, Ming Lei wrote: >>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: On 04/05/2018 06:11 PM, Ming Lei wrote: >> >> Could you please apply the following patch and provide the dmesg >> boot log? > > And please post out the 'lscpu' log together from the test machine > too. attached. As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 Cores == 16 threads. >>> >>> OK, thanks! >>> >>> The most weird thing is that hctx->next_cpu is computed as 512 since >>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of >>> possible CPU. >>> >>> Looks like it is a s390 specific issue, since I can setup one queue >>> which has same mapping with yours: >>> >>> - nr_cpu_id is 282 >>> - CPU 0~15 is online >>> - 64 queues null_blk >>> - still run all hw queues in .complete handler >>> >>> But can't reproduce this issue at all. >>> >>> So please test the following patch, which may tell us why hctx->next_cpu >>> is computed wrong: >> >> I see things like >> >> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> >> which is exactly what happens if the find and and operation fails >> (returns size of bitmap). > > Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct > in your previous debug log, it means the following function returns > totally wrong result on S390. > > cpumask_first_and(hctx->cpumask, cpu_online_mask); > > The debugfs log shows that each hctx->cpumask includes one online > CPU(0~15). Really? the last log (with the latest patch applied shows a lot of contexts that do not have CPUs in 0-15: e.g. [4.049828] dump CPUs mapped to this hctx: [4.049829] 18 [4.049829] 82 [4.049830] 146 [4.049830] 210 [4.049831] 274 >>> >>> That won't be an issue, since no IO can be submitted from these offline >>> CPUs, then these hctx shouldn't have been run at all. >>> >>> But hctx->next_cpu can be set as 512 for these inactive hctx in >>> blk_mq_map_swqueue(), then please test the attached patch, and if >>> hctx->next_cpu is still set as 512, something is still wrong. >> >> >> WIth this patch I no longer see the "run queue from wrong CPU x, hctx >> active" messages. >> your debug code still triggers, though. >> >> wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and >> wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and >> >> If we would remove the debug code then dmesg would be clean it seems. > > That is still a bit strange, since for any inactive hctx(without online > CPU mapped), blk_mq_run_hw_queue() will check blk_mq_hctx_has_pending() I think for next_and it is reasonable to see this, as the next_and will return 512 after we have used the last one. In fact the code does call first_and in that case for a reason, no? > first. And there shouldn't be any pending IO for all inactive hctx > in your case, so looks blk_mq_hctx_next_cpu() shouldn't be called for > inactive hctx. > > I will prepare one patchset and post out soon, and hope all these issues > can be covered. > > Thanks, > Ming >
[PATCH] blk-cgroup: remove entries in blkg_tree before queue release
The q->id is used as an index within the blkg_tree radix tree. If the entry is not released before reclaiming the blk_queue_ida's id blkcg_init_queue() within a different driver from which this id was originally for can fail due to the entry at that index within the radix tree already existing. Signed-off-by: Alexandru Moise <00moses.alexande...@gmail.com> --- block/blk-cgroup.c | 2 +- block/blk-sysfs.c | 4 include/linux/blk-cgroup.h | 2 ++ 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 1c16694ae145..224e937dbb59 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg) * * Destroy all blkgs associated with @q. */ -static void blkg_destroy_all(struct request_queue *q) +void blkg_destroy_all(struct request_queue *q) { struct blkcg_gq *blkg, *n; diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index d00d1b0ec109..a72866458f22 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct *work) if (q->bio_split) bioset_free(q->bio_split); + spin_lock_irq(q->queue_lock); + blkg_destroy_all(q); + spin_unlock_irq(q->queue_lock); + ida_simple_remove(_queue_ida, q->id); call_rcu(>rcu_head, blk_free_queue_rcu); } diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 6c666fd7de3c..4470fdb6ea8f 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -179,6 +179,8 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, int blkcg_init_queue(struct request_queue *q); void blkcg_drain_queue(struct request_queue *q); void blkcg_exit_queue(struct request_queue *q); +void blkg_destroy_all(struct request_queue *q); + /* Blkio controller policy registration */ int blkcg_policy_register(struct blkcg_policy *pol); -- 2.16.3
Re: WARNING: lock held when returning to user space!
On 4/6/18 8:57 AM, Dmitry Vyukov wrote: > On Fri, Apr 6, 2018 at 4:27 PM, Jens Axboewrote: >> On 4/6/18 7:02 AM, syzbot wrote: >>> Hello, >>> >>> syzbot hit the following crash on upstream commit >>> 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +) >>> Merge tag 'armsoc-drivers' of >>> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc >>> syzbot dashboard link: >>> https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2 >>> >>> So far this crash happened 9 times on upstream. >>> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6407930337296384 >>> syzkaller reproducer: >>> https://syzkaller.appspot.com/x/repro.syz?id=4942413340606464 >>> Raw console output: >>> https://syzkaller.appspot.com/x/log.txt?id=4764483918495744 >>> Kernel config: >>> https://syzkaller.appspot.com/x/.config?id=-5813481738265533882 >>> compiler: gcc (GCC) 8.0.1 20180301 (experimental) >>> >>> IMPORTANT: if you fix the bug, please add the following tag to the commit: >>> Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com >>> It will help syzbot understand when the bug is fixed. See footer for >>> details. >>> If you forward the report, please keep this part and the footer. >>> >>> >>> >>> WARNING: lock held when returning to user space! >>> 4.16.0+ #3 Not tainted >>> >>> syzkaller433111/4462 is leaving the kernel with locks still held! >>> 1 lock held by syzkaller433111/4462: >>> #0: 03a06fae (>lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x1ec0 >>> drivers/block/loop.c:1363 >> >> Is this a new regression? Omar did just fiddle with the locking a bit, >> seems suspicious. > > Looking at: > https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2 > It first happened 4 hours ago and 9 times since then, so probably a > just introduced regression. After writing that, I saw the discussion in another thread ("INFO: task hung in lo_ioctl"), so I think we can definitely say that it's a recently introduced regression in loop due to the killable lock changes. -- Jens Axboe
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On Fri, Apr 06, 2018 at 04:26:49PM +0200, Christian Borntraeger wrote: > > > On 04/06/2018 03:41 PM, Ming Lei wrote: > > On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote: > >> > >> > >> On 04/06/2018 11:23 AM, Ming Lei wrote: > >>> On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote: > > > On 04/06/2018 10:41 AM, Ming Lei wrote: > > On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: > >> > >> > >> On 04/05/2018 06:11 PM, Ming Lei wrote: > > Could you please apply the following patch and provide the dmesg > boot log? > >>> > >>> And please post out the 'lscpu' log together from the test machine > >>> too. > >> > >> attached. > >> > >> As I said before this seems to go way with CONFIG_NR_CPUS=64 or > >> smaller. > >> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but > >> only 8 Cores > >> == 16 threads. > > > > OK, thanks! > > > > The most weird thing is that hctx->next_cpu is computed as 512 since > > nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of > > possible CPU. > > > > Looks like it is a s390 specific issue, since I can setup one queue > > which has same mapping with yours: > > > > - nr_cpu_id is 282 > > - CPU 0~15 is online > > - 64 queues null_blk > > - still run all hw queues in .complete handler > > > > But can't reproduce this issue at all. > > > > So please test the following patch, which may tell us why hctx->next_cpu > > is computed wrong: > > I see things like > > [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and > > which is exactly what happens if the find and and operation fails > (returns size of bitmap). > >>> > >>> Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct > >>> in your previous debug log, it means the following function returns > >>> totally wrong result on S390. > >>> > >>> cpumask_first_and(hctx->cpumask, cpu_online_mask); > >>> > >>> The debugfs log shows that each hctx->cpumask includes one online > >>> CPU(0~15). > >> > >> Really? the last log (with the latest patch applied shows a lot of > >> contexts > >> that do not have CPUs in 0-15: > >> > >> e.g. > >> [4.049828] dump CPUs mapped to this hctx: > >> [4.049829] 18 > >> [4.049829] 82 > >> [4.049830] 146 > >> [4.049830] 210 > >> [4.049831] 274 > > > > That won't be an issue, since no IO can be submitted from these offline > > CPUs, then these hctx shouldn't have been run at all. > > > > But hctx->next_cpu can be set as 512 for these inactive hctx in > > blk_mq_map_swqueue(), then please test the attached patch, and if > > hctx->next_cpu is still set as 512, something is still wrong. > > > WIth this patch I no longer see the "run queue from wrong CPU x, hctx active" > messages. > your debug code still triggers, though. > > wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and > wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and > > If we would remove the debug code then dmesg would be clean it seems. That is still a bit strange, since for any inactive hctx(without online CPU mapped), blk_mq_run_hw_queue() will check blk_mq_hctx_has_pending() first. And there shouldn't be any pending IO for all inactive hctx in your case, so looks blk_mq_hctx_next_cpu() shouldn't be called for inactive hctx. I will prepare one patchset and post out soon, and hope all these issues can be covered. Thanks, Ming
Re: WARNING: lock held when returning to user space!
On Fri, Apr 6, 2018 at 4:27 PM, Jens Axboewrote: > On 4/6/18 7:02 AM, syzbot wrote: >> Hello, >> >> syzbot hit the following crash on upstream commit >> 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +) >> Merge tag 'armsoc-drivers' of >> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc >> syzbot dashboard link: >> https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2 >> >> So far this crash happened 9 times on upstream. >> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6407930337296384 >> syzkaller reproducer: >> https://syzkaller.appspot.com/x/repro.syz?id=4942413340606464 >> Raw console output: >> https://syzkaller.appspot.com/x/log.txt?id=4764483918495744 >> Kernel config: >> https://syzkaller.appspot.com/x/.config?id=-5813481738265533882 >> compiler: gcc (GCC) 8.0.1 20180301 (experimental) >> >> IMPORTANT: if you fix the bug, please add the following tag to the commit: >> Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com >> It will help syzbot understand when the bug is fixed. See footer for >> details. >> If you forward the report, please keep this part and the footer. >> >> >> >> WARNING: lock held when returning to user space! >> 4.16.0+ #3 Not tainted >> >> syzkaller433111/4462 is leaving the kernel with locks still held! >> 1 lock held by syzkaller433111/4462: >> #0: 03a06fae (>lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x1ec0 >> drivers/block/loop.c:1363 > > Is this a new regression? Omar did just fiddle with the locking a bit, > seems suspicious. Looking at: https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2 It first happened 4 hours ago and 9 times since then, so probably a just introduced regression. > -- > Jens Axboe > > -- > You received this message because you are subscribed to the Google Groups > "syzkaller-bugs" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to syzkaller-bugs+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/syzkaller-bugs/0e998b77-14f0-aee0-8d32-bc1dd96fcc4c%40kernel.dk. > For more options, visit https://groups.google.com/d/optout.
Re: WARNING: lock held when returning to user space!
On 4/6/18 7:02 AM, syzbot wrote: > Hello, > > syzbot hit the following crash on upstream commit > 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +) > Merge tag 'armsoc-drivers' of > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc > syzbot dashboard link: > https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2 > > So far this crash happened 9 times on upstream. > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6407930337296384 > syzkaller reproducer: > https://syzkaller.appspot.com/x/repro.syz?id=4942413340606464 > Raw console output: > https://syzkaller.appspot.com/x/log.txt?id=4764483918495744 > Kernel config: > https://syzkaller.appspot.com/x/.config?id=-5813481738265533882 > compiler: gcc (GCC) 8.0.1 20180301 (experimental) > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > > > WARNING: lock held when returning to user space! > 4.16.0+ #3 Not tainted > > syzkaller433111/4462 is leaving the kernel with locks still held! > 1 lock held by syzkaller433111/4462: > #0: 03a06fae (>lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x1ec0 > drivers/block/loop.c:1363 Is this a new regression? Omar did just fiddle with the locking a bit, seems suspicious. -- Jens Axboe
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 04/06/2018 03:41 PM, Ming Lei wrote: > On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote: >> >> >> On 04/06/2018 11:23 AM, Ming Lei wrote: >>> On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote: On 04/06/2018 10:41 AM, Ming Lei wrote: > On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: >> >> >> On 04/05/2018 06:11 PM, Ming Lei wrote: Could you please apply the following patch and provide the dmesg boot log? >>> >>> And please post out the 'lscpu' log together from the test machine too. >> >> attached. >> >> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. >> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only >> 8 Cores >> == 16 threads. > > OK, thanks! > > The most weird thing is that hctx->next_cpu is computed as 512 since > nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of > possible CPU. > > Looks like it is a s390 specific issue, since I can setup one queue > which has same mapping with yours: > > - nr_cpu_id is 282 > - CPU 0~15 is online > - 64 queues null_blk > - still run all hw queues in .complete handler > > But can't reproduce this issue at all. > > So please test the following patch, which may tell us why hctx->next_cpu > is computed wrong: I see things like [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and which is exactly what happens if the find and and operation fails (returns size of bitmap). >>> >>> Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct >>> in your previous debug log, it means the following function returns >>> totally wrong result on S390. >>> >>> cpumask_first_and(hctx->cpumask, cpu_online_mask); >>> >>> The debugfs log shows that each hctx->cpumask includes one online >>> CPU(0~15). >> >> Really? the last log (with the latest patch applied shows a lot of contexts >> that do not have CPUs in 0-15: >> >> e.g. >> [4.049828] dump CPUs mapped to this hctx: >> [4.049829] 18 >> [4.049829] 82 >> [4.049830] 146 >> [4.049830] 210 >> [4.049831] 274 > > That won't be an issue, since no IO can be submitted from these offline > CPUs, then these hctx shouldn't have been run at all. > > But hctx->next_cpu can be set as 512 for these inactive hctx in > blk_mq_map_swqueue(), then please test the attached patch, and if > hctx->next_cpu is still set as 512, something is still wrong. WIth this patch I no longer see the "run queue from wrong CPU x, hctx active" messages. your debug code still triggers, though. wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and If we would remove the debug code then dmesg would be clean it seems. > --- > > diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c > index 9f8cffc8a701..638ab5c11b3c 100644 > --- a/block/blk-mq-cpumap.c > +++ b/block/blk-mq-cpumap.c > @@ -14,13 +14,12 @@ > #include "blk.h" > #include "blk-mq.h" > > +/* > + * Given there isn't CPU hotplug handler in blk-mq, map all CPUs to > + * queues even it isn't present yet. > + */ > static int cpu_to_queue_index(unsigned int nr_queues, const int cpu) > { > - /* > - * Non present CPU will be mapped to queue index 0. > - */ > - if (!cpu_present(cpu)) > - return 0; > return cpu % nr_queues; > } > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 90838e998f66..1a834d96a718 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1343,6 +1343,13 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx > *hctx) > hctx_unlock(hctx, srcu_idx); > } > > +static void check_next_cpu(int next_cpu, const char *str1, const char *str2) > +{ > + if (next_cpu > nr_cpu_ids) > + printk_ratelimited("wrong next_cpu %d, %s, %s\n", > + next_cpu, str1, str2); > +} > + > /* > * It'd be great if the workqueue API had a way to pass > * in a mask and had some smarts for more clever placement. > @@ -1352,26 +1359,29 @@ static void __blk_mq_run_hw_queue(struct > blk_mq_hw_ctx *hctx) > static int blk_mq_hctx_next_cpu(struct
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote: > > > On 04/06/2018 11:23 AM, Ming Lei wrote: > > On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote: > >> > >> > >> On 04/06/2018 10:41 AM, Ming Lei wrote: > >>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: > > > On 04/05/2018 06:11 PM, Ming Lei wrote: > >> > >> Could you please apply the following patch and provide the dmesg boot > >> log? > > > > And please post out the 'lscpu' log together from the test machine too. > > attached. > > As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. > We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only > 8 Cores > == 16 threads. > >>> > >>> OK, thanks! > >>> > >>> The most weird thing is that hctx->next_cpu is computed as 512 since > >>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of > >>> possible CPU. > >>> > >>> Looks like it is a s390 specific issue, since I can setup one queue > >>> which has same mapping with yours: > >>> > >>> - nr_cpu_id is 282 > >>> - CPU 0~15 is online > >>> - 64 queues null_blk > >>> - still run all hw queues in .complete handler > >>> > >>> But can't reproduce this issue at all. > >>> > >>> So please test the following patch, which may tell us why hctx->next_cpu > >>> is computed wrong: > >> > >> I see things like > >> > >> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and > >> > >> which is exactly what happens if the find and and operation fails (returns > >> size of bitmap). > > > > Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct > > in your previous debug log, it means the following function returns > > totally wrong result on S390. > > > > cpumask_first_and(hctx->cpumask, cpu_online_mask); > > > > The debugfs log shows that each hctx->cpumask includes one online > > CPU(0~15). > > Really? the last log (with the latest patch applied shows a lot of contexts > that do not have CPUs in 0-15: > > e.g. > [4.049828] dump CPUs mapped to this hctx: > [4.049829] 18 > [4.049829] 82 > [4.049830] 146 > [4.049830] 210 > [4.049831] 274 That won't be an issue, since no IO can be submitted from these offline CPUs, then these hctx shouldn't have been run at all. But hctx->next_cpu can be set as 512 for these inactive hctx in blk_mq_map_swqueue(), then please test the attached patch, and if hctx->next_cpu is still set as 512, something is still wrong. --- diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 9f8cffc8a701..638ab5c11b3c 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -14,13 +14,12 @@ #include "blk.h" #include "blk-mq.h" +/* + * Given there isn't CPU hotplug handler in blk-mq, map all CPUs to + * queues even it isn't present yet. + */ static int cpu_to_queue_index(unsigned int nr_queues, const int cpu) { - /* -* Non present CPU will be mapped to queue index 0. -*/ - if (!cpu_present(cpu)) - return 0; return cpu % nr_queues; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 90838e998f66..1a834d96a718 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1343,6 +1343,13 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) hctx_unlock(hctx, srcu_idx); } +static void check_next_cpu(int next_cpu, const char *str1, const char *str2) +{ + if (next_cpu > nr_cpu_ids) + printk_ratelimited("wrong next_cpu %d, %s, %s\n", + next_cpu, str1, str2); +} + /* * It'd be great if the workqueue API had a way to pass * in a mask and had some smarts for more clever placement. @@ -1352,26 +1359,29 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) { bool tried = false; + int next_cpu = hctx->next_cpu; if (hctx->queue->nr_hw_queues == 1) return WORK_CPU_UNBOUND; if (--hctx->next_cpu_batch <= 0) { - int next_cpu; select_cpu: - next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask, + next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,
WARNING: lock held when returning to user space!
Hello, syzbot hit the following crash on upstream commit 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +) Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2 So far this crash happened 9 times on upstream. C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6407930337296384 syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=4942413340606464 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=4764483918495744 Kernel config: https://syzkaller.appspot.com/x/.config?id=-5813481738265533882 compiler: gcc (GCC) 8.0.1 20180301 (experimental) IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. WARNING: lock held when returning to user space! 4.16.0+ #3 Not tainted syzkaller433111/4462 is leaving the kernel with locks still held! 1 lock held by syzkaller433111/4462: #0: 03a06fae (>lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x1ec0 drivers/block/loop.c:1363 --- This bug is generated by a dumb bot. It may contain errors. See https://goo.gl/tpsmEJ for details. Direct all questions to syzkal...@googlegroups.com. syzbot will keep track of this bug report. If you forgot to add the Reported-by tag, once the fix for this bug is merged into any tree, please reply to this email with: #syz fix: exact-commit-title If you want to test a patch for this bug, please reply with: #syz test: git://repo/address.git branch and provide the patch inline or as an attachment. To mark this as a duplicate of another syzbot report, please reply with: #syz dup: exact-subject-of-another-report If it's a one-off invalid bug report, please reply with: #syz invalid Note: if the crash happens again, it will cause creation of a new bug report. Note: all commands must start from beginning of the line in the email body.
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 04/06/2018 11:23 AM, Ming Lei wrote: > On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote: >> >> >> On 04/06/2018 10:41 AM, Ming Lei wrote: >>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: On 04/05/2018 06:11 PM, Ming Lei wrote: >> >> Could you please apply the following patch and provide the dmesg boot >> log? > > And please post out the 'lscpu' log together from the test machine too. attached. As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 Cores == 16 threads. >>> >>> OK, thanks! >>> >>> The most weird thing is that hctx->next_cpu is computed as 512 since >>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of >>> possible CPU. >>> >>> Looks like it is a s390 specific issue, since I can setup one queue >>> which has same mapping with yours: >>> >>> - nr_cpu_id is 282 >>> - CPU 0~15 is online >>> - 64 queues null_blk >>> - still run all hw queues in .complete handler >>> >>> But can't reproduce this issue at all. >>> >>> So please test the following patch, which may tell us why hctx->next_cpu >>> is computed wrong: >> >> I see things like >> >> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> >> which is exactly what happens if the find and and operation fails (returns >> size of bitmap). > > Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct > in your previous debug log, it means the following function returns > totally wrong result on S390. > > cpumask_first_and(hctx->cpumask, cpu_online_mask); > > The debugfs log shows that each hctx->cpumask includes one online > CPU(0~15). > > So looks it isn't one issue in block MQ core. So I checked further and printed the mask I think I can ignore the next_and cases. It is totally valid to get 512 here (as we might start with an offset that is already the last cpu and we need to wrap with first_and)). So the first_and and the first cases are really the interesting one. And I think the code is perfectly right, there is no bit after and for these cases: [3.220021] wrong next_cpu 512, blk_mq_map_swqueue, first_and [3.220023] 1: 0001 0001 0001 0001 0001 [3.220025] 2: [3.220027] wrong next_cpu 512, blk_mq_map_swqueue, first_and [3.220028] 1: 0002 0002 0002 0002 0002 [3.220030] wrong next_cpu 512, blk_mq_map_swqueue, first_and [3.220032] 1: 0001 0001 0001 0001 0001 [3.220033] 2: [3.220035] wrong next_cpu 512, blk_mq_map_swqueue, first_and [3.220036] 2: [3.220037] wrong next_cpu 512, blk_mq_map_swqueue, first_and [3.220039] 1: 0004 0004 0004 0004 0004 [3.220040] 2: [3.220042] wrong next_cpu 512, blk_mq_map_swqueue, first_and [3.220062] 1: 0002 0002 0002 0002 0002 [3.220063] 2: [3.220064] wrong next_cpu 512, blk_mq_map_swqueue, first_and [3.220066] 1: 0008 0008 0008 0008 0008
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 04/06/2018 11:23 AM, Ming Lei wrote: > On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote: >> >> >> On 04/06/2018 10:41 AM, Ming Lei wrote: >>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: On 04/05/2018 06:11 PM, Ming Lei wrote: >> >> Could you please apply the following patch and provide the dmesg boot >> log? > > And please post out the 'lscpu' log together from the test machine too. attached. As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 Cores == 16 threads. >>> >>> OK, thanks! >>> >>> The most weird thing is that hctx->next_cpu is computed as 512 since >>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of >>> possible CPU. >>> >>> Looks like it is a s390 specific issue, since I can setup one queue >>> which has same mapping with yours: >>> >>> - nr_cpu_id is 282 >>> - CPU 0~15 is online >>> - 64 queues null_blk >>> - still run all hw queues in .complete handler >>> >>> But can't reproduce this issue at all. >>> >>> So please test the following patch, which may tell us why hctx->next_cpu >>> is computed wrong: >> >> I see things like >> >> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and >> >> which is exactly what happens if the find and and operation fails (returns >> size of bitmap). > > Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct > in your previous debug log, it means the following function returns > totally wrong result on S390. > > cpumask_first_and(hctx->cpumask, cpu_online_mask); > > The debugfs log shows that each hctx->cpumask includes one online > CPU(0~15). Really? the last log (with the latest patch applied shows a lot of contexts that do not have CPUs in 0-15: e.g. [4.049828] dump CPUs mapped to this hctx: [4.049829] 18 [4.049829] 82 [4.049830] 146 [4.049830] 210 [4.049831] 274 > > So looks it isn't one issue in block MQ core. > > Thanks, > Ming >
Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible
On Fri, 6 Apr 2018, Ming Lei wrote: > > I will post V4 soon by using cpu_present_mask in the 1st stage irq spread. > And it should work fine for Kashyap's case in normal cases. No need to resend. I've changed it already and will push it out after lunch. Thanks, tglx
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote: > > > On 04/06/2018 10:41 AM, Ming Lei wrote: > > On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: > >> > >> > >> On 04/05/2018 06:11 PM, Ming Lei wrote: > > Could you please apply the following patch and provide the dmesg boot > log? > >>> > >>> And please post out the 'lscpu' log together from the test machine too. > >> > >> attached. > >> > >> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. > >> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 > >> Cores > >> == 16 threads. > > > > OK, thanks! > > > > The most weird thing is that hctx->next_cpu is computed as 512 since > > nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of > > possible CPU. > > > > Looks like it is a s390 specific issue, since I can setup one queue > > which has same mapping with yours: > > > > - nr_cpu_id is 282 > > - CPU 0~15 is online > > - 64 queues null_blk > > - still run all hw queues in .complete handler > > > > But can't reproduce this issue at all. > > > > So please test the following patch, which may tell us why hctx->next_cpu > > is computed wrong: > > I see things like > > [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and > > which is exactly what happens if the find and and operation fails (returns > size of bitmap). Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct in your previous debug log, it means the following function returns totally wrong result on S390. cpumask_first_and(hctx->cpumask, cpu_online_mask); The debugfs log shows that each hctx->cpumask includes one online CPU(0~15). So looks it isn't one issue in block MQ core. Thanks, Ming
Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible
Hi Thomas, On Wed, Apr 04, 2018 at 09:38:26PM +0200, Thomas Gleixner wrote: > On Wed, 4 Apr 2018, Ming Lei wrote: > > On Wed, Apr 04, 2018 at 10:25:16AM +0200, Thomas Gleixner wrote: > > > In the example above: > > > > > > > > > irq 39, cpu list 0,4 > > > > > > irq 40, cpu list 1,6 > > > > > > irq 41, cpu list 2,5 > > > > > > irq 42, cpu list 3,7 > > > > > > and assumed that at driver init time only CPU 0-3 are online then the > > > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7. > > > > Indeed, and I just tested this case, and found that no interrupts are > > delivered to CPU 4-7. > > > > In theory, the affinity has been assigned to these irq vectors, and > > programmed to interrupt controller, I understand it should work. > > > > Could you explain it a bit why interrupts aren't delivered to CPU 4-7? > > As I explained before: > > "If the device is already in use when the offline CPUs get hot plugged, then > the interrupts still stay on cpu 0-3 because the effective affinity of > interrupts on X86 (and other architectures) is always a single CPU." > > IOW. If you set the affinity mask so it contains more than one CPU then the > kernel selects a single CPU as target. The selected CPU must be online and > if there is more than one online CPU in the mask then the kernel picks the > one which has the least number of interrupts targeted at it. This selected > CPU target is programmed into the corresponding interrupt chip > (IOAPIC/MSI/MSIX) and it stays that way until the selected target CPU > goes offline or the affinity mask changes. > > The reasons why we use single target delivery on X86 are: > >1) Not all X86 systems support multi target delivery > >2) If a system supports multi target delivery then the interrupt is > preferrably delivered to the CPU with the lowest APIC ID (which > usually corresponds to the lowest CPU number) due to hardware magic > and only a very small percentage of interrupts are delivered to the > other CPUs in the multi target set. So the benefit is rather dubious > and extensive performance testing did not show any significant > difference. > >3) The management of multi targets on the software side is painful as > the same low level vector number has to be allocated on all possible > target CPUs. That's making a lot of things including hotplug more > complex for very little - if at all - benefit. > > So at some point we ripped out the multi target support on X86 and moved > everything to single target delivery mode. > > Other architectures never supported multi target delivery either due to > hardware restrictions or for similar reasons why X86 dropped it. There > might be a few architectures which support it, but I have no overview at > the moment. > > The information is in procfs > > # cat /proc/irq/9/smp_affinity_list > 0-3 > # cat /proc/irq/9/effective_affinity_list > 1 > > # cat /proc/irq/10/smp_affinity_list > 0-3 > # cat /proc/irq/10/effective_affinity_list > 2 > > smp_affinity[_list] is the affinity which is set either by the kernel or by > writing to /proc/irq/$N/smp_affinity[_list] > > effective_affinity[_list] is the affinity which is effective, i.e. the > single target CPU to which the interrupt is affine at this point. > > As you can see in the above examples the target CPU is selected from the > given possible target set and the internal spreading of the low level x86 > vector allocation code picks a CPU which has the lowest number of > interrupts targeted at it. > > Let's assume for the example below > > # cat /proc/irq/10/smp_affinity_list > 0-3 > # cat /proc/irq/10/effective_affinity_list > 2 > > that CPU 3 was offline when the device was initialized. So there was no way > to select it and when CPU 3 comes online there is no reason to change the > affinity of that interrupt, at least not from the kernel POV. Actually we > don't even have a mechanism to do so automagically. > > If I offline CPU 2 after onlining CPU 3 then the kernel has to move the > interrupt away from CPU 2, so it selects CPU 3 as it's the one with the > lowest number of interrupts targeted at it. > > Now this is a bit different if you use affinity managed interrupts like > NVME and other devices do. > > Many of these devices create one queue per possible CPU, so the spreading > is simple; One interrupt per possible cpu. Pretty boring. > > When the device has less queues than possible CPUs, then stuff gets more > interesting. The queues and therefore the interrupts must be targeted at > multiple CPUs. There is some logic which spreads them over the numa nodes > and takes siblings into account when Hyperthreading is enabled. > > In both cases the managed interrupts are handled over CPU soft > hotplug/unplug: > > 1) If a CPU is soft unplugged and an interrupt is targeted at the CPU > then the interrupt is either moved to a still online CPU in
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 04/06/2018 10:51 AM, Christian Borntraeger wrote: > > > On 04/06/2018 10:41 AM, Ming Lei wrote: >> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: >>> >>> >>> On 04/05/2018 06:11 PM, Ming Lei wrote: > > Could you please apply the following patch and provide the dmesg boot log? And please post out the 'lscpu' log together from the test machine too. >>> >>> attached. >>> >>> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. >>> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 >>> Cores >>> == 16 threads. >> >> OK, thanks! >> >> The most weird thing is that hctx->next_cpu is computed as 512 since >> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of >> possible CPU. >> >> Looks like it is a s390 specific issue, since I can setup one queue >> which has same mapping with yours: >> >> - nr_cpu_id is 282 >> - CPU 0~15 is online >> - 64 queues null_blk >> - still run all hw queues in .complete handler >> >> But can't reproduce this issue at all. >> >> So please test the following patch, which may tell us why hctx->next_cpu >> is computed wrong: > > I see things like > > [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and > [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and There are more # dmesg | grep "wrong next" | cut -d "]" -f 2- | uniq -c 10 wrong next_cpu 512, blk_mq_map_swqueue, first_and 72 wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and 7 wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and 1 wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and 10 wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and > > which is exactly what happens if the find and and operation fails (returns > size of bitmap). > > FWIW, I added a dump stack for the case when we run unbound before I tested > your patch: > > Apr 06 10:47:41 s38lp39 kernel: CPU: 15 PID: 86 Comm: ksoftirqd/15 Not > tainted 4.16.0-07249-g864f9fc031e4-dirty #2 > Apr 06 10:47:41 s38lp39 kernel: Hardware name: IBM 2964 NC9 704 (LPAR) > Apr 06 10:47:41 s38lp39 kernel: Call Trace: > Apr 06 10:47:41 s38lp39 kernel: ([<00113946>] show_stack+0x56/0x80) > Apr 06 10:47:41 s38lp39 kernel: [<009d8132>] dump_stack+0x82/0xb0 > Apr 06 10:47:41 s38lp39 kernel: [<006a05de>] > blk_mq_hctx_next_cpu+0x12e/0x138 > Apr 06 10:47:41 s38lp39 kernel: [<006a084c>] > __blk_mq_delay_run_hw_queue+0x94/0xd8 > Apr 06 10:47:41 s38lp39 kernel: [<006a097a>] > blk_mq_run_hw_queue+0x82/0x180 > Apr 06 10:47:41 s38lp39 kernel: [<006a0ae0>] > blk_mq_run_hw_queues+0x68/0x88 > Apr 06 10:47:41 s38lp39 kernel: [<0069fc4e>] > __blk_mq_complete_request+0x11e/0x1d8 > Apr 06 10:47:41 s38lp39 kernel: [<0069fd94>] > blk_mq_complete_request+0x8c/0xc8 > Apr 06 10:47:41 s38lp39 kernel: [<00824c50>] > dasd_block_tasklet+0x158/0x490 > Apr 06 10:47:41 s38lp39 kernel: [<0014a952>] > tasklet_action_common.isra.5+0x7a/0x100 > Apr 06 10:47:41 s38lp39 kernel: [<009f8248>] __do_softirq+0x98/0x368 > Apr 06 10:47:41 s38lp39 kernel: [<0014a322>] run_ksoftirqd+0x4a/0x68 > Apr 06 10:47:41 s38lp39 kernel: [<0016dc20>] > smpboot_thread_fn+0x108/0x1b0 > Apr 06 10:47:41 s38lp39 kernel: [<00168e70>] kthread+0x148/0x160 > Apr 06 10:47:41 s38lp39 kernel: [<009f727a>] > kernel_thread_starter+0x6/0xc > Apr 06 10:47:41 s38lp39 kernel: [<009f7274>] > kernel_thread_starter+0x0/0xc >
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 04/06/2018 10:41 AM, Ming Lei wrote: > On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: >> >> >> On 04/05/2018 06:11 PM, Ming Lei wrote: Could you please apply the following patch and provide the dmesg boot log? >>> >>> And please post out the 'lscpu' log together from the test machine too. >> >> attached. >> >> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. >> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 >> Cores >> == 16 threads. > > OK, thanks! > > The most weird thing is that hctx->next_cpu is computed as 512 since > nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of > possible CPU. > > Looks like it is a s390 specific issue, since I can setup one queue > which has same mapping with yours: > > - nr_cpu_id is 282 > - CPU 0~15 is online > - 64 queues null_blk > - still run all hw queues in .complete handler > > But can't reproduce this issue at all. > > So please test the following patch, which may tell us why hctx->next_cpu > is computed wrong: I see things like [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and which is exactly what happens if the find and and operation fails (returns size of bitmap). FWIW, I added a dump stack for the case when we run unbound before I tested your patch: Apr 06 10:47:41 s38lp39 kernel: CPU: 15 PID: 86 Comm: ksoftirqd/15 Not tainted 4.16.0-07249-g864f9fc031e4-dirty #2 Apr 06 10:47:41 s38lp39 kernel: Hardware name: IBM 2964 NC9 704 (LPAR) Apr 06 10:47:41 s38lp39 kernel: Call Trace: Apr 06 10:47:41 s38lp39 kernel: ([<00113946>] show_stack+0x56/0x80) Apr 06 10:47:41 s38lp39 kernel: [<009d8132>] dump_stack+0x82/0xb0 Apr 06 10:47:41 s38lp39 kernel: [<006a05de>] blk_mq_hctx_next_cpu+0x12e/0x138 Apr 06 10:47:41 s38lp39 kernel: [<006a084c>] __blk_mq_delay_run_hw_queue+0x94/0xd8 Apr 06 10:47:41 s38lp39 kernel: [<006a097a>] blk_mq_run_hw_queue+0x82/0x180 Apr 06 10:47:41 s38lp39 kernel: [<006a0ae0>] blk_mq_run_hw_queues+0x68/0x88 Apr 06 10:47:41 s38lp39 kernel: [<0069fc4e>] __blk_mq_complete_request+0x11e/0x1d8 Apr 06 10:47:41 s38lp39 kernel: [<0069fd94>] blk_mq_complete_request+0x8c/0xc8 Apr 06 10:47:41 s38lp39 kernel: [<00824c50>] dasd_block_tasklet+0x158/0x490 Apr 06 10:47:41 s38lp39 kernel: [<0014a952>] tasklet_action_common.isra.5+0x7a/0x100 Apr 06 10:47:41 s38lp39 kernel: [<009f8248>] __do_softirq+0x98/0x368 Apr 06 10:47:41 s38lp39 kernel: [<0014a322>] run_ksoftirqd+0x4a/0x68 Apr 06 10:47:41 s38lp39 kernel: [<0016dc20>] smpboot_thread_fn+0x108/0x1b0 Apr 06 10:47:41 s38lp39 kernel: [<00168e70>] kthread+0x148/0x160 Apr 06 10:47:41 s38lp39 kernel: [<009f727a>] kernel_thread_starter+0x6/0xc Apr 06 10:47:41 s38lp39 kernel: [<009f7274>] kernel_thread_starter+0x0/0xc
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote: > > > On 04/05/2018 06:11 PM, Ming Lei wrote: > >> > >> Could you please apply the following patch and provide the dmesg boot log? > > > > And please post out the 'lscpu' log together from the test machine too. > > attached. > > As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. > We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 > Cores > == 16 threads. OK, thanks! The most weird thing is that hctx->next_cpu is computed as 512 since nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of possible CPU. Looks like it is a s390 specific issue, since I can setup one queue which has same mapping with yours: - nr_cpu_id is 282 - CPU 0~15 is online - 64 queues null_blk - still run all hw queues in .complete handler But can't reproduce this issue at all. So please test the following patch, which may tell us why hctx->next_cpu is computed wrong: --- diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 9f8cffc8a701..638ab5c11b3c 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -14,13 +14,12 @@ #include "blk.h" #include "blk-mq.h" +/* + * Given there isn't CPU hotplug handler in blk-mq, map all CPUs to + * queues even it isn't present yet. + */ static int cpu_to_queue_index(unsigned int nr_queues, const int cpu) { - /* -* Non present CPU will be mapped to queue index 0. -*/ - if (!cpu_present(cpu)) - return 0; return cpu % nr_queues; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 90838e998f66..9b130e4b87df 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1343,6 +1343,13 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) hctx_unlock(hctx, srcu_idx); } +static void check_next_cpu(int next_cpu, const char *str1, const char *str2) +{ + if (next_cpu > nr_cpu_ids) + printk_ratelimited("wrong next_cpu %d, %s, %s\n", + next_cpu, str1, str2); +} + /* * It'd be great if the workqueue API had a way to pass * in a mask and had some smarts for more clever placement. @@ -1352,26 +1359,29 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) { bool tried = false; + int next_cpu = hctx->next_cpu; if (hctx->queue->nr_hw_queues == 1) return WORK_CPU_UNBOUND; if (--hctx->next_cpu_batch <= 0) { - int next_cpu; select_cpu: - next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask, + next_cpu = cpumask_next_and(next_cpu, hctx->cpumask, cpu_online_mask); - if (next_cpu >= nr_cpu_ids) + check_next_cpu(next_cpu, __func__, "next_and"); + if (next_cpu >= nr_cpu_ids) { next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask); + check_next_cpu(next_cpu, __func__, "first_and"); + } /* * No online CPU is found, so have to make sure hctx->next_cpu * is set correctly for not breaking workqueue. */ - if (next_cpu >= nr_cpu_ids) - hctx->next_cpu = cpumask_first(hctx->cpumask); - else - hctx->next_cpu = next_cpu; + if (next_cpu >= nr_cpu_ids) { + next_cpu = cpumask_first(hctx->cpumask); + check_next_cpu(next_cpu, __func__, "first"); + } hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; } @@ -1379,7 +1389,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) * Do unbound schedule if we can't find a online CPU for this hctx, * and it should only happen in the path of handling CPU DEAD. */ - if (!cpu_online(hctx->next_cpu)) { + if (!cpu_online(next_cpu)) { if (!tried) { tried = true; goto select_cpu; @@ -1392,7 +1402,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) hctx->next_cpu_batch = 1; return WORK_CPU_UNBOUND; } - return hctx->next_cpu; + + hctx->next_cpu = next_cpu; + return next_cpu; } static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, @@ -2408,6 +2420,8 @@ static void blk_mq_map_swqueue(struct request_queue *q) mutex_unlock(>sysfs_lock); queue_for_each_hw_ctx(q, hctx, i) { + int next_cpu; + /* * If no software queues are mapped to this hardware queue, * disable it and free the request entries. @@ -2437,8 +2451,10 @@ static void blk_mq_map_swqueue(struct request_queue
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 04/05/2018 06:05 PM, Ming Lei wrote: [...] > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 90838e998f66..996f8a963026 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1324,9 +1324,18 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx > *hctx) >*/ > if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) && > cpu_online(hctx->next_cpu)) { > - printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n", > - raw_smp_processor_id(), > + int cpu; > + printk(KERN_WARNING "run queue from wrong CPU %d/%d, hctx-%d > %s\n", > + raw_smp_processor_id(), hctx->next_cpu, > + hctx->queue_num, > cpumask_empty(hctx->cpumask) ? "inactive": "active"); > + printk("dump CPUs mapped to this hctx:\n"); > + for_each_cpu(cpu, hctx->cpumask) > + printk("%d ", cpu); > + printk("\n"); > + printk("nr_cpu_ids is %d, and dump online cpus:\n", nr_cpu_ids); > + for_each_cpu(cpu, cpu_online_mask) > + printk("%d ", cpu); > dump_stack(); > } > FWIW, with things like [4.049828] dump CPUs mapped to this hctx: [4.049829] 18 [4.049829] 82 [4.049830] 146 [4.049830] 210 [4.049831] 274 [4.049832] nr_cpu_ids is 282, and dump online cpus: [4.049833] 0 [4.049833] 1 [4.049834] 2 [4.049834] 3 [4.049835] 4 [4.049835] 5 [4.049836] 6 [4.049836] 7 [4.049837] 8 [4.049837] 9 [4.049838] 10 [4.049839] 11 [4.049839] 12 [4.049840] 13 [4.049840] 14 [4.049841] 15 So the hctx has only "possible CPUs", but all are offline. Doesnt that always make this run unbound? See blk_mq_hctx_next_cpu below. /* * It'd be great if the workqueue API had a way to pass * in a mask and had some smarts for more clever placement. * For now we just round-robin here, switching for every * BLK_MQ_CPU_WORK_BATCH queued items. */ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) { bool tried = false; if (hctx->queue->nr_hw_queues == 1) return WORK_CPU_UNBOUND; if (--hctx->next_cpu_batch <= 0) { int next_cpu; select_cpu: next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask, cpu_online_mask); if (next_cpu >= nr_cpu_ids) next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask); /* * No online CPU is found, so have to make sure hctx->next_cpu * is set correctly for not breaking workqueue. */ if (next_cpu >= nr_cpu_ids) hctx->next_cpu = cpumask_first(hctx->cpumask); else hctx->next_cpu = next_cpu; hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; } /* * Do unbound schedule if we can't find a online CPU for this hctx, * and it should only happen in the path of handling CPU DEAD. */ if (!cpu_online(hctx->next_cpu)) { if (!tried) { tried = true; goto select_cpu; } /* * Make sure to re-select CPU next time once after CPUs * in hctx->cpumask become online again. */ hctx->next_cpu_batch = 1; return WORK_CPU_UNBOUND; } return hctx->next_cpu; }
Re: Multi-Actuator SAS HDD First Look
On Fri, Apr 06, 2018 at 08:24:18AM +0200, Hannes Reinecke wrote: > Ah. Far better. > What about delegating FORMAT UNIT to the control LUN, and not > implementing it for the individual disk LUNs? > That would make an even stronger case for having a control LUN; > with that there wouldn't be any problem with having to synchronize > across LUNs etc. It sounds to me like NVMe might be a much better model for this drive than SCSI, btw :)
Re: Multi-Actuator SAS HDD First Look
On Thu, 5 Apr 2018 17:43:46 -0600 Tim Walkerwrote: > On Tue, Apr 3, 2018 at 1:46 AM, Christoph Hellwig > wrote: > > On Sat, Mar 31, 2018 at 01:03:46PM +0200, Hannes Reinecke wrote: > >> Actually I would propose to have a 'management' LUN at LUN0, who > >> could handle all the device-wide commands (eg things like START > >> STOP UNIT, firmware update, or even SMART commands), and ignoring > >> them for the remaining LUNs. > > > > That is in fact the only workable option at all. Everything else > > completly breaks the scsi architecture. > > Here's an update: Seagate will eliminate the inter-LU actions from > FORMAT UNIT and SANITIZE. Probably SANITIZE will be per-LUN, but > FORMAT UNIT is trickier due to internal drive architecture, and how > FORMAT UNIT initializes on-disk metadata. Likely it will require some > sort of synchronization across LUNs, such as the command being sent to > both LUNs sequentially or something similar. We are also considering > not supporting FORMAT UNIT at all - would anybody object? Any other > suggestions? > Ah. Far better. What about delegating FORMAT UNIT to the control LUN, and not implementing it for the individual disk LUNs? That would make an even stronger case for having a control LUN; with that there wouldn't be any problem with having to synchronize across LUNs etc. Cheers, Hannes