Re: 4.15.14 crash with iscsi target and dvd

2018-04-06 Thread Bart Van Assche
On Fri, 2018-04-06 at 21:03 -0400, Wakko Warner wrote:
> Bart Van Assche wrote:
> > On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote:
> > > I know now why scsi_print_command isn't doing anything.  cmd->cmnd is 
> > > null.
> > > I added a dev_printk in scsi_print_command where the 2 if statements 
> > > return.
> > > Logs:
> > > [  29.866415] sr 3:0:0:0: cmd->cmnd is NULL
> > 
> > That's something that should never happen. As one can see in
> > scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize
> > that pointer. Since I have not yet been able to reproduce myself what you
> > reported, would it be possible for you to bisect this issue? You will need
> > to follow something like the following procedure (see also
> > https://git-scm.com/docs/git-bisect):
> > 
> > git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > git bisect start
> > git bisect bad v4.10
> > git bisect good v4.9
> > 
> > and then build the kernel, install it, boot the kernel and test it.
> > Depending on the result, run either git bisect bad or git bisect good and
> > keep going until git bisect comes to a conclusion. This can take an hour or
> > more.
> 
> I have 1 question.  Should make clean be done between tests?  My box
> compiles the whole kernel in 2 minutes.

If you trust that the build system will figure out all dependencies then
running make clean is not necessary. Personally I always run make clean
during a bisect before rebuilding the kernel because if a header file has
changed in e.g. the block layer a huge number of files has to be rebuilt
anyway.

Bart.




Re: 4.15.14 crash with iscsi target and dvd

2018-04-06 Thread Wakko Warner
Bart Van Assche wrote:
> On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote:
> > I know now why scsi_print_command isn't doing anything.  cmd->cmnd is null.
> > I added a dev_printk in scsi_print_command where the 2 if statements return.
> > Logs:
> > [  29.866415] sr 3:0:0:0: cmd->cmnd is NULL
> 
> That's something that should never happen. As one can see in
> scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize
> that pointer. Since I have not yet been able to reproduce myself what you
> reported, would it be possible for you to bisect this issue? You will need
> to follow something like the following procedure (see also
> https://git-scm.com/docs/git-bisect):
> 
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git bisect start
> git bisect bad v4.10
> git bisect good v4.9
> 
> and then build the kernel, install it, boot the kernel and test it.
> Depending on the result, run either git bisect bad or git bisect good and
> keep going until git bisect comes to a conclusion. This can take an hour or
> more.

I have 1 question.  Should make clean be done between tests?  My box
compiles the whole kernel in 2 minutes.

-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.


Re: 4.15.14 crash with iscsi target and dvd

2018-04-06 Thread Wakko Warner
Bart Van Assche wrote:
> On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote:
> > I know now why scsi_print_command isn't doing anything.  cmd->cmnd is null.
> > I added a dev_printk in scsi_print_command where the 2 if statements return.
> > Logs:
> > [  29.866415] sr 3:0:0:0: cmd->cmnd is NULL
> 
> That's something that should never happen. As one can see in
> scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize
> that pointer. Since I have not yet been able to reproduce myself what you
> reported, would it be possible for you to bisect this issue? You will need
> to follow something like the following procedure (see also
> https://git-scm.com/docs/git-bisect):

I don't know how relevent it is, but this machine boots nfs and exports it's
dvd drives over iscsi with the target modules.  My scsi_target.lio is at the
end.  I removed the iqn name.  The options are default except for a few. 
Non default options I tabbed over.
eth0 is the nfs/localnet nic and eth1 is the
nic that iscsi goes over.
eth0 is onboard pci 8086:1502 (subsystem 1028:05d3)
eth1 is pci 8086:107d (subsystem 8086:1084)
Both use the e1000e driver

The initiator is running 4.4.107.
When running on the initiator, /dev/sr1 is the target /dev/sr0.  Therefor
cat /dev/sr1 > /dev/null seems to work.
mount /dev/sr1 /cdrom works
find /cdrom -type f | xargs cat > /dev/null immediately crashes the target.
Burning to /dev/sr1 seems to work.

I have another nic that uses igb instead, I'll see if that makes a
difference.

> git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git bisect start
> git bisect bad v4.10
> git bisect good v4.9
> 
> and then build the kernel, install it, boot the kernel and test it.
> Depending on the result, run either git bisect bad or git bisect good and
> keep going until git bisect comes to a conclusion. This can take an hour or
> more.

I'll try this.

Here's my scsi_target.lio:
storage pscsi {
disk dvd0 {
path /dev/sr0 
attribute {
emulate_3pc yes 
emulate_caw yes 
emulate_dpo no 
emulate_fua_read no 
emulate_model_alias no 
emulate_rest_reord no 
emulate_tas yes 
emulate_tpu no 
emulate_tpws no 
emulate_ua_intlck_ctrl no 
emulate_write_cache no 
enforce_pr_isids yes 
fabric_max_sectors 8192 
is_nonrot yes 
max_unmap_block_desc_count 0 
max_unmap_lba_count 0 
max_write_same_len 65535 
queue_depth 128 
unmap_granularity 0 
unmap_granularity_alignment 0 
}
}
disk dvd1 {
path /dev/sr1 
attribute {
emulate_3pc yes 
emulate_caw yes 
emulate_dpo no 
emulate_fua_read no 
emulate_model_alias no 
emulate_rest_reord no 
emulate_tas yes 
emulate_tpu no 
emulate_tpws no 
emulate_ua_intlck_ctrl no 
emulate_write_cache no 
enforce_pr_isids yes 
fabric_max_sectors 8192 
is_nonrot yes 
max_unmap_block_desc_count 0 
max_unmap_lba_count 0 
max_write_same_len 65535 
queue_depth 128 
unmap_granularity 0 
unmap_granularity_alignment 0 
}
}
disk dvd2 {
path /dev/sr2 
attribute {
emulate_3pc yes 
emulate_caw yes 
emulate_dpo no 
emulate_fua_read no 
emulate_model_alias no 
emulate_rest_reord no 
emulate_tas yes 
emulate_tpu no 
emulate_tpws no 
emulate_ua_intlck_ctrl no 
emulate_write_cache no 
enforce_pr_isids yes 
fabric_max_sectors 8192 
is_nonrot yes 
max_unmap_block_desc_count 0 
max_unmap_lba_count 0 
max_write_same_len 65535 
queue_depth 128 
unmap_granularity 0 
unmap_granularity_alignment 0 
}
}
}
fabric iscsi {
discovery_auth {
enable no 
mutual_password "" 
mutual_userid "" 
password "" 
userid "" 
}
target iqn.:dvd tpgt 1 {
enable yes 
attribute {
authentication no 
cache_dynamic_acls yes 
default_cmdsn_depth 64 
default_erl 0 
demo_mode_discovery yes 
demo_mode_write_protect no 
fabric_prot_type 0 
generate_node_acls yes 
login_timeout 15 
netif_timeout 2 
prod_mode_write_protect no 
t10_pi 0 
tpg_enabled_sendtargets 1 
}
auth {
password "" 
password_mutual "" 
userid "" 
userid_mutual "" 
}
parameter {
AuthMethod "CHAP,None" 
DataDigest "CRC32C,None" 
DataPDUInOrder yes 
DataSequenceInOrder yes 
DefaultTime2Retain 20 
DefaultTime2Wait 2 
ErrorRecoveryLevel no 
FirstBurstLength 65536 
HeaderDigest "CRC32C,None" 
IFMarkInt Reject 
IFMarker no 
ImmediateData yes 
InitialR2T yes 
MaxBurstLength 262144 
MaxConnections 1 
MaxOutstandingR2T 1 
MaxRecvDataSegmentLength 8192 
MaxXmitDataSegmentLength 262144 
OFMarkInt Reject 
OFMarker no 
TargetAlias "LIO Target" 
}
lun 0 backend pscsi:dvd0 
lun 1 backend pscsi:dvd1 
lun 2 backend pscsi:dvd2 
portal 0.0.0.0:3260 
}
}


-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.


Re: [PATCH] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-06 Thread kbuild test robot
Hi Alexandru,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on block/for-next]
[also build test ERROR on v4.16 next-20180406]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Alexandru-Moise/blk-cgroup-remove-entries-in-blkg_tree-before-queue-release/20180407-035957
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
for-next
config: c6x-evmc6678_defconfig (attached as .config)
compiler: c6x-elf-gcc (GCC) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=c6x 

All errors (new ones prefixed by >>):

   block/blk-sysfs.c: In function '__blk_release_queue':
>> block/blk-sysfs.c:820:2: error: implicit declaration of function 
>> 'blkg_destroy_all'; did you mean 'irq_destroy_ipi'? 
>> [-Werror=implicit-function-declaration]
 blkg_destroy_all(q);
 ^~~~
 irq_destroy_ipi
   cc1: some warnings being treated as errors

vim +820 block/blk-sysfs.c

   770  
   771  /**
   772   * __blk_release_queue - release a request queue when it is no longer 
needed
   773   * @work: pointer to the release_work member of the request queue to be 
released
   774   *
   775   * Description:
   776   * blk_release_queue is the counterpart of blk_init_queue(). It 
should be
   777   * called when a request queue is being released; typically when a 
block
   778   * device is being de-registered. Its primary task it to free the 
queue
   779   * itself.
   780   *
   781   * Notes:
   782   * The low level driver must have finished any outstanding requests 
first
   783   * via blk_cleanup_queue().
   784   *
   785   * Although blk_release_queue() may be called with preemption 
disabled,
   786   * __blk_release_queue() may sleep.
   787   */
   788  static void __blk_release_queue(struct work_struct *work)
   789  {
   790  struct request_queue *q = container_of(work, typeof(*q), 
release_work);
   791  
   792  if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
   793  blk_stat_remove_callback(q, q->poll_cb);
   794  blk_stat_free_callback(q->poll_cb);
   795  
   796  blk_free_queue_stats(q->stats);
   797  
   798  blk_exit_rl(q, >root_rl);
   799  
   800  if (q->queue_tags)
   801  __blk_queue_free_tags(q);
   802  
   803  if (!q->mq_ops) {
   804  if (q->exit_rq_fn)
   805  q->exit_rq_fn(q, q->fq->flush_rq);
   806  blk_free_flush_queue(q->fq);
   807  } else {
   808  blk_mq_release(q);
   809  }
   810  
   811  blk_trace_shutdown(q);
   812  
   813  if (q->mq_ops)
   814  blk_mq_debugfs_unregister(q);
   815  
   816  if (q->bio_split)
   817  bioset_free(q->bio_split);
   818  
   819  spin_lock_irq(q->queue_lock);
 > 820  blkg_destroy_all(q);
   821  spin_unlock_irq(q->queue_lock);
   822  
   823  ida_simple_remove(_queue_ida, q->id);
   824  call_rcu(>rcu_head, blk_free_queue_rcu);
   825  }
   826  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH] block/compat_ioctl: fix range check in BLKGETSIZE

2018-04-06 Thread Khazhismel Kumykov
kernel ulong and compat_ulong_t may not be same width. Use type directly
to eliminate mismatches.

This would result in truncation rather than EFBIG for 32bit mode for
large disks.

Signed-off-by: Khazhismel Kumykov 
---
 block/compat_ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
index 6ca015f92766..3a2c77f07da8 100644
--- a/block/compat_ioctl.c
+++ b/block/compat_ioctl.c
@@ -388,7 +388,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, 
unsigned long arg)
return 0;
case BLKGETSIZE:
size = i_size_read(bdev->bd_inode);
-   if ((size >> 9) > ~0UL)
+   if ((size >> 9) > ~((compat_ulong_t)0UL))
return -EFBIG;
return compat_put_ulong(arg, size >> 9);
 
-- 
2.17.0.484.g0c8726318c-goog



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-06 Thread Thomas Gleixner
On Fri, 6 Apr 2018, Thomas Gleixner wrote:

> On Fri, 6 Apr 2018, Ming Lei wrote:
> > 
> > I will post V4 soon by using cpu_present_mask in the 1st stage irq spread.
> > And it should work fine for Kashyap's case in normal cases.
> 
> No need to resend. I've changed it already and will push it out after
> lunch.

No. Lunch did not last that long :)

I pushed out the lot to

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core

Please double check the modifications I did. The first related commit fixes
an existing error handling bug.

Thanks,

tglx


Re: [PATCH] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-06 Thread kbuild test robot
Hi Alexandru,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on block/for-next]
[also build test ERROR on v4.16 next-20180406]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Alexandru-Moise/blk-cgroup-remove-entries-in-blkg_tree-before-queue-release/20180407-035957
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
for-next
config: x86_64-randconfig-x001-201813 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   block/blk-sysfs.c: In function '__blk_release_queue':
>> block/blk-sysfs.c:820:2: error: implicit declaration of function 
>> 'blkg_destroy_all'; did you mean 'blk_stat_add'? 
>> [-Werror=implicit-function-declaration]
 blkg_destroy_all(q);
 ^~~~
 blk_stat_add
   cc1: some warnings being treated as errors

vim +820 block/blk-sysfs.c

   770  
   771  /**
   772   * __blk_release_queue - release a request queue when it is no longer 
needed
   773   * @work: pointer to the release_work member of the request queue to be 
released
   774   *
   775   * Description:
   776   * blk_release_queue is the counterpart of blk_init_queue(). It 
should be
   777   * called when a request queue is being released; typically when a 
block
   778   * device is being de-registered. Its primary task it to free the 
queue
   779   * itself.
   780   *
   781   * Notes:
   782   * The low level driver must have finished any outstanding requests 
first
   783   * via blk_cleanup_queue().
   784   *
   785   * Although blk_release_queue() may be called with preemption 
disabled,
   786   * __blk_release_queue() may sleep.
   787   */
   788  static void __blk_release_queue(struct work_struct *work)
   789  {
   790  struct request_queue *q = container_of(work, typeof(*q), 
release_work);
   791  
   792  if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
   793  blk_stat_remove_callback(q, q->poll_cb);
   794  blk_stat_free_callback(q->poll_cb);
   795  
   796  blk_free_queue_stats(q->stats);
   797  
   798  blk_exit_rl(q, >root_rl);
   799  
   800  if (q->queue_tags)
   801  __blk_queue_free_tags(q);
   802  
   803  if (!q->mq_ops) {
   804  if (q->exit_rq_fn)
   805  q->exit_rq_fn(q, q->fq->flush_rq);
   806  blk_free_flush_queue(q->fq);
   807  } else {
   808  blk_mq_release(q);
   809  }
   810  
   811  blk_trace_shutdown(q);
   812  
   813  if (q->mq_ops)
   814  blk_mq_debugfs_unregister(q);
   815  
   816  if (q->bio_split)
   817  bioset_free(q->bio_split);
   818  
   819  spin_lock_irq(q->queue_lock);
 > 820  blkg_destroy_all(q);
   821  spin_unlock_irq(q->queue_lock);
   822  
   823  ida_simple_remove(_queue_ida, q->id);
   824  call_rcu(>rcu_head, blk_free_queue_rcu);
   825  }
   826  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] loop: fix LOOP_GET_STATUS lock imbalance

2018-04-06 Thread Jens Axboe
On 4/6/18 10:57 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Commit 2d1d4c1e591f made loop_get_status() drop lo_ctx_mutex before
> returning, but the loop_get_status_old(), loop_get_status64(), and
> loop_get_status_compat() wrappers don't call loop_get_status() if the
> passed argument is NULL. The callers expect that the lock is dropped, so
> make sure we drop it in that case, too.

Applied, thanks Omar.

-- 
Jens Axboe



Re: Multi-Actuator SAS HDD First Look

2018-04-06 Thread Douglas Gilbert

On 2018-04-06 02:42 AM, Christoph Hellwig wrote:

On Fri, Apr 06, 2018 at 08:24:18AM +0200, Hannes Reinecke wrote:

Ah. Far better.
What about delegating FORMAT UNIT to the control LUN, and not
implementing it for the individual disk LUNs?
That would make an even stronger case for having a control LUN;
with that there wouldn't be any problem with having to synchronize
across LUNs etc.


It sounds to me like NVMe might be a much better model for this drive
than SCSI, btw :)


So you found a document that outlines NVMe's architecture! Could you
share the url (no marketing BS, please)?


And a serious question ... How would you map NVMe's (in Linux)
subsystem number, controller device minor number, CNTLID field
(Identify ctl response) and namespace id onto the SCSI subsystem's
h:c:t:l ?

Doug Gilbert



Re: [PATCH] loop: fix LOOP_GET_STATUS lock imbalance

2018-04-06 Thread Omar Sandoval
On Fri, Apr 06, 2018 at 09:57:03AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Commit 2d1d4c1e591f made loop_get_status() drop lo_ctx_mutex before
> returning, but the loop_get_status_old(), loop_get_status64(), and
> loop_get_status_compat() wrappers don't call loop_get_status() if the
> passed argument is NULL. The callers expect that the lock is dropped, so
> make sure we drop it in that case, too.
> 
> Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com
> Fixes: 2d1d4c1e591f ("loop: don't call into filesystem while holding 
> lo_ctl_mutex")
> Signed-off-by: Omar Sandoval 
> ---
> Based on Linus' tree.
> 
>  drivers/block/loop.c | 33 ++---
>  1 file changed, 18 insertions(+), 15 deletions(-)

Also just pushed a regression test to blktests:

140ee15de9f3 ("loop: add ioctl lock imbalance regression test")


[PATCH] loop: fix LOOP_GET_STATUS lock imbalance

2018-04-06 Thread Omar Sandoval
From: Omar Sandoval 

Commit 2d1d4c1e591f made loop_get_status() drop lo_ctx_mutex before
returning, but the loop_get_status_old(), loop_get_status64(), and
loop_get_status_compat() wrappers don't call loop_get_status() if the
passed argument is NULL. The callers expect that the lock is dropped, so
make sure we drop it in that case, too.

Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com
Fixes: 2d1d4c1e591f ("loop: don't call into filesystem while holding 
lo_ctl_mutex")
Signed-off-by: Omar Sandoval 
---
Based on Linus' tree.

 drivers/block/loop.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 264abaaff662..9b476fd2bc41 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1283,12 +1283,13 @@ static int
 loop_get_status_old(struct loop_device *lo, struct loop_info __user *arg) {
struct loop_info info;
struct loop_info64 info64;
-   int err = 0;
+   int err;
 
-   if (!arg)
-   err = -EINVAL;
-   if (!err)
-   err = loop_get_status(lo, );
+   if (!arg) {
+   mutex_unlock(>lo_ctl_mutex);
+   return -EINVAL;
+   }
+   err = loop_get_status(lo, );
if (!err)
err = loop_info64_to_old(, );
if (!err && copy_to_user(arg, , sizeof(info)))
@@ -1300,12 +1301,13 @@ loop_get_status_old(struct loop_device *lo, struct 
loop_info __user *arg) {
 static int
 loop_get_status64(struct loop_device *lo, struct loop_info64 __user *arg) {
struct loop_info64 info64;
-   int err = 0;
+   int err;
 
-   if (!arg)
-   err = -EINVAL;
-   if (!err)
-   err = loop_get_status(lo, );
+   if (!arg) {
+   mutex_unlock(>lo_ctl_mutex);
+   return -EINVAL;
+   }
+   err = loop_get_status(lo, );
if (!err && copy_to_user(arg, , sizeof(info64)))
err = -EFAULT;
 
@@ -1529,12 +1531,13 @@ loop_get_status_compat(struct loop_device *lo,
   struct compat_loop_info __user *arg)
 {
struct loop_info64 info64;
-   int err = 0;
+   int err;
 
-   if (!arg)
-   err = -EINVAL;
-   if (!err)
-   err = loop_get_status(lo, );
+   if (!arg) {
+   mutex_unlock(>lo_ctl_mutex);
+   return -EINVAL;
+   }
+   err = loop_get_status(lo, );
if (!err)
err = loop_info64_to_compat(, arg);
return err;
-- 
2.17.0



Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Ming Lei
On Fri, Apr 06, 2018 at 05:11:53PM +0200, Christian Borntraeger wrote:
> 
> 
> On 04/06/2018 04:58 PM, Ming Lei wrote:
> > On Fri, Apr 06, 2018 at 04:26:49PM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 04/06/2018 03:41 PM, Ming Lei wrote:
> >>> On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote:
> 
> 
>  On 04/06/2018 11:23 AM, Ming Lei wrote:
> > On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 04/06/2018 10:41 AM, Ming Lei wrote:
> >>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:
> 
> 
>  On 04/05/2018 06:11 PM, Ming Lei wrote:
> >>
> >> Could you please apply the following patch and provide the dmesg 
> >> boot log?
> >
> > And please post out the 'lscpu' log together from the test machine 
> > too.
> 
>  attached.
> 
>  As I said before this seems to go way with CONFIG_NR_CPUS=64 or 
>  smaller.
>  We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but 
>  only 8 Cores
>  == 16 threads.
> >>>
> >>> OK, thanks!
> >>>
> >>> The most weird thing is that hctx->next_cpu is computed as 512 since
> >>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
> >>> possible CPU.
> >>>
> >>> Looks like it is a s390 specific issue, since I can setup one queue
> >>> which has same mapping with yours:
> >>>
> >>>   - nr_cpu_id is 282
> >>>   - CPU 0~15 is online
> >>>   - 64 queues null_blk
> >>>   - still run all hw queues in .complete handler
> >>>
> >>> But can't reproduce this issue at all.
> >>>
> >>> So please test the following patch, which may tell us why 
> >>> hctx->next_cpu
> >>> is computed wrong:
> >>
> >> I see things like
> >>
> >> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >>
> >> which is exactly what happens if the find and and operation fails 
> >> (returns size of bitmap).
> >
> > Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct
> > in your previous debug log, it means the following function returns
> > totally wrong result on S390.
> >
> > cpumask_first_and(hctx->cpumask, cpu_online_mask);
> >
> > The debugfs log shows that each hctx->cpumask includes one online
> > CPU(0~15).
> 
>  Really? the last log (with the latest patch applied  shows a lot of 
>  contexts
>  that do not have CPUs in 0-15:
> 
>  e.g. 
>  [4.049828] dump CPUs mapped to this hctx:
>  [4.049829] 18 
>  [4.049829] 82 
>  [4.049830] 146 
>  [4.049830] 210 
>  [4.049831] 274 
> >>>
> >>> That won't be an issue, since no IO can be submitted from these offline
> >>> CPUs, then these hctx shouldn't have been run at all.
> >>>
> >>> But hctx->next_cpu can be set as 512 for these inactive hctx in
> >>> blk_mq_map_swqueue(), then please test the attached patch, and if
> >>> hctx->next_cpu is still set as 512, something is still wrong.
> >>
> >>
> >> WIth this patch I no longer see the "run queue from wrong CPU x, hctx 
> >> active" messages.
> >> your debug code still triggers, though.
> >>
> >> wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
> >> wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
> >>
> >> If we would remove the debug code then dmesg would be clean it seems.
> > 
> > That is still a bit strange, since for any inactive hctx(without online
> > CPU mapped), blk_mq_run_hw_queue() will check blk_mq_hctx_has_pending()
> 
> I think for next_and it is reasonable to see this, as the next_and will return
> 512 after we have used the last one. In fact the code does call first_and in
> that case for a reason, no?

It is possible for dumping 'first_and' when there isn't any online CPUs mapped
to this hctx.

But my question is that for this case, there shouldn't be any IO queued
for this hctx, and blk_mq_hctx_has_pending() has been called to check
that, so blk_mq_hctx_next_cpu() should have only be called when
blk_mq_hctx_has_pending() in blk_mq_run_hw_queue() is true.


Thanks,
Ming


Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Christian Borntraeger


On 04/06/2018 04:58 PM, Ming Lei wrote:
> On Fri, Apr 06, 2018 at 04:26:49PM +0200, Christian Borntraeger wrote:
>>
>>
>> On 04/06/2018 03:41 PM, Ming Lei wrote:
>>> On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote:


 On 04/06/2018 11:23 AM, Ming Lei wrote:
> On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote:
>>
>>
>> On 04/06/2018 10:41 AM, Ming Lei wrote:
>>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:


 On 04/05/2018 06:11 PM, Ming Lei wrote:
>>
>> Could you please apply the following patch and provide the dmesg 
>> boot log?
>
> And please post out the 'lscpu' log together from the test machine 
> too.

 attached.

 As I said before this seems to go way with CONFIG_NR_CPUS=64 or 
 smaller.
 We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but 
 only 8 Cores
 == 16 threads.
>>>
>>> OK, thanks!
>>>
>>> The most weird thing is that hctx->next_cpu is computed as 512 since
>>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
>>> possible CPU.
>>>
>>> Looks like it is a s390 specific issue, since I can setup one queue
>>> which has same mapping with yours:
>>>
>>> - nr_cpu_id is 282
>>> - CPU 0~15 is online
>>> - 64 queues null_blk
>>> - still run all hw queues in .complete handler
>>>
>>> But can't reproduce this issue at all.
>>>
>>> So please test the following patch, which may tell us why hctx->next_cpu
>>> is computed wrong:
>>
>> I see things like
>>
>> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>>
>> which is exactly what happens if the find and and operation fails 
>> (returns size of bitmap).
>
> Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct
> in your previous debug log, it means the following function returns
> totally wrong result on S390.
>
>   cpumask_first_and(hctx->cpumask, cpu_online_mask);
>
> The debugfs log shows that each hctx->cpumask includes one online
> CPU(0~15).

 Really? the last log (with the latest patch applied  shows a lot of 
 contexts
 that do not have CPUs in 0-15:

 e.g. 
 [4.049828] dump CPUs mapped to this hctx:
 [4.049829] 18 
 [4.049829] 82 
 [4.049830] 146 
 [4.049830] 210 
 [4.049831] 274 
>>>
>>> That won't be an issue, since no IO can be submitted from these offline
>>> CPUs, then these hctx shouldn't have been run at all.
>>>
>>> But hctx->next_cpu can be set as 512 for these inactive hctx in
>>> blk_mq_map_swqueue(), then please test the attached patch, and if
>>> hctx->next_cpu is still set as 512, something is still wrong.
>>
>>
>> WIth this patch I no longer see the "run queue from wrong CPU x, hctx 
>> active" messages.
>> your debug code still triggers, though.
>>
>> wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
>> wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
>>
>> If we would remove the debug code then dmesg would be clean it seems.
> 
> That is still a bit strange, since for any inactive hctx(without online
> CPU mapped), blk_mq_run_hw_queue() will check blk_mq_hctx_has_pending()

I think for next_and it is reasonable to see this, as the next_and will return
512 after we have used the last one. In fact the code does call first_and in
that case for a reason, no?


> first. And there shouldn't be any pending IO for all inactive hctx
> in your case, so looks blk_mq_hctx_next_cpu() shouldn't be called for
> inactive hctx.
> 
> I will prepare one patchset and post out soon, and hope all these issues
> can be covered.
> 
> Thanks,
> Ming
> 



[PATCH] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-06 Thread Alexandru Moise
The q->id is used as an index within the blkg_tree radix tree.

If the entry is not released before reclaiming the blk_queue_ida's id
blkcg_init_queue() within a different driver from which this id
was originally for can fail due to the entry at that index within
the radix tree already existing.

Signed-off-by: Alexandru Moise <00moses.alexande...@gmail.com>
---
 block/blk-cgroup.c | 2 +-
 block/blk-sysfs.c  | 4 
 include/linux/blk-cgroup.h | 2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1c16694ae145..224e937dbb59 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
  *
  * Destroy all blkgs associated with @q.
  */
-static void blkg_destroy_all(struct request_queue *q)
+void blkg_destroy_all(struct request_queue *q)
 {
struct blkcg_gq *blkg, *n;
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d00d1b0ec109..a72866458f22 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct *work)
if (q->bio_split)
bioset_free(q->bio_split);
 
+   spin_lock_irq(q->queue_lock);
+   blkg_destroy_all(q);
+   spin_unlock_irq(q->queue_lock);
+
ida_simple_remove(_queue_ida, q->id);
call_rcu(>rcu_head, blk_free_queue_rcu);
 }
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 6c666fd7de3c..4470fdb6ea8f 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -179,6 +179,8 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 int blkcg_init_queue(struct request_queue *q);
 void blkcg_drain_queue(struct request_queue *q);
 void blkcg_exit_queue(struct request_queue *q);
+void blkg_destroy_all(struct request_queue *q);
+
 
 /* Blkio controller policy registration */
 int blkcg_policy_register(struct blkcg_policy *pol);
-- 
2.16.3



Re: WARNING: lock held when returning to user space!

2018-04-06 Thread Jens Axboe
On 4/6/18 8:57 AM, Dmitry Vyukov wrote:
> On Fri, Apr 6, 2018 at 4:27 PM, Jens Axboe  wrote:
>> On 4/6/18 7:02 AM, syzbot wrote:
>>> Hello,
>>>
>>> syzbot hit the following crash on upstream commit
>>> 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +)
>>> Merge tag 'armsoc-drivers' of
>>> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
>>> syzbot dashboard link:
>>> https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2
>>>
>>> So far this crash happened 9 times on upstream.
>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6407930337296384
>>> syzkaller reproducer:
>>> https://syzkaller.appspot.com/x/repro.syz?id=4942413340606464
>>> Raw console output:
>>> https://syzkaller.appspot.com/x/log.txt?id=4764483918495744
>>> Kernel config:
>>> https://syzkaller.appspot.com/x/.config?id=-5813481738265533882
>>> compiler: gcc (GCC) 8.0.1 20180301 (experimental)
>>>
>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com
>>> It will help syzbot understand when the bug is fixed. See footer for
>>> details.
>>> If you forward the report, please keep this part and the footer.
>>>
>>>
>>> 
>>> WARNING: lock held when returning to user space!
>>> 4.16.0+ #3 Not tainted
>>> 
>>> syzkaller433111/4462 is leaving the kernel with locks still held!
>>> 1 lock held by syzkaller433111/4462:
>>>   #0: 03a06fae (>lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x1ec0
>>> drivers/block/loop.c:1363
>>
>> Is this a new regression? Omar did just fiddle with the locking a bit,
>> seems suspicious.
> 
> Looking at:
> https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2
> It first happened 4 hours ago and 9 times since then, so probably a
> just introduced regression.

After writing that, I saw the discussion in another
thread ("INFO: task hung in lo_ioctl"), so I think we can definitely say
that it's a recently introduced regression in loop due to the killable
lock changes.

-- 
Jens Axboe



Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Ming Lei
On Fri, Apr 06, 2018 at 04:26:49PM +0200, Christian Borntraeger wrote:
> 
> 
> On 04/06/2018 03:41 PM, Ming Lei wrote:
> > On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 04/06/2018 11:23 AM, Ming Lei wrote:
> >>> On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote:
> 
> 
>  On 04/06/2018 10:41 AM, Ming Lei wrote:
> > On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 04/05/2018 06:11 PM, Ming Lei wrote:
> 
>  Could you please apply the following patch and provide the dmesg 
>  boot log?
> >>>
> >>> And please post out the 'lscpu' log together from the test machine 
> >>> too.
> >>
> >> attached.
> >>
> >> As I said before this seems to go way with CONFIG_NR_CPUS=64 or 
> >> smaller.
> >> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but 
> >> only 8 Cores
> >> == 16 threads.
> >
> > OK, thanks!
> >
> > The most weird thing is that hctx->next_cpu is computed as 512 since
> > nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
> > possible CPU.
> >
> > Looks like it is a s390 specific issue, since I can setup one queue
> > which has same mapping with yours:
> >
> > - nr_cpu_id is 282
> > - CPU 0~15 is online
> > - 64 queues null_blk
> > - still run all hw queues in .complete handler
> >
> > But can't reproduce this issue at all.
> >
> > So please test the following patch, which may tell us why hctx->next_cpu
> > is computed wrong:
> 
>  I see things like
> 
>  [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>  [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>  [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>  [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>  [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>  [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>  [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>  [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>  [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>  [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> 
>  which is exactly what happens if the find and and operation fails 
>  (returns size of bitmap).
> >>>
> >>> Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct
> >>> in your previous debug log, it means the following function returns
> >>> totally wrong result on S390.
> >>>
> >>>   cpumask_first_and(hctx->cpumask, cpu_online_mask);
> >>>
> >>> The debugfs log shows that each hctx->cpumask includes one online
> >>> CPU(0~15).
> >>
> >> Really? the last log (with the latest patch applied  shows a lot of 
> >> contexts
> >> that do not have CPUs in 0-15:
> >>
> >> e.g. 
> >> [4.049828] dump CPUs mapped to this hctx:
> >> [4.049829] 18 
> >> [4.049829] 82 
> >> [4.049830] 146 
> >> [4.049830] 210 
> >> [4.049831] 274 
> > 
> > That won't be an issue, since no IO can be submitted from these offline
> > CPUs, then these hctx shouldn't have been run at all.
> > 
> > But hctx->next_cpu can be set as 512 for these inactive hctx in
> > blk_mq_map_swqueue(), then please test the attached patch, and if
> > hctx->next_cpu is still set as 512, something is still wrong.
> 
> 
> WIth this patch I no longer see the "run queue from wrong CPU x, hctx active" 
> messages.
> your debug code still triggers, though.
> 
> wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
> wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
> 
> If we would remove the debug code then dmesg would be clean it seems.

That is still a bit strange, since for any inactive hctx(without online
CPU mapped), blk_mq_run_hw_queue() will check blk_mq_hctx_has_pending()
first. And there shouldn't be any pending IO for all inactive hctx
in your case, so looks blk_mq_hctx_next_cpu() shouldn't be called for
inactive hctx.

I will prepare one patchset and post out soon, and hope all these issues
can be covered.

Thanks,
Ming


Re: WARNING: lock held when returning to user space!

2018-04-06 Thread Dmitry Vyukov
On Fri, Apr 6, 2018 at 4:27 PM, Jens Axboe  wrote:
> On 4/6/18 7:02 AM, syzbot wrote:
>> Hello,
>>
>> syzbot hit the following crash on upstream commit
>> 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +)
>> Merge tag 'armsoc-drivers' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
>> syzbot dashboard link:
>> https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2
>>
>> So far this crash happened 9 times on upstream.
>> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6407930337296384
>> syzkaller reproducer:
>> https://syzkaller.appspot.com/x/repro.syz?id=4942413340606464
>> Raw console output:
>> https://syzkaller.appspot.com/x/log.txt?id=4764483918495744
>> Kernel config:
>> https://syzkaller.appspot.com/x/.config?id=-5813481738265533882
>> compiler: gcc (GCC) 8.0.1 20180301 (experimental)
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>>
>> 
>> WARNING: lock held when returning to user space!
>> 4.16.0+ #3 Not tainted
>> 
>> syzkaller433111/4462 is leaving the kernel with locks still held!
>> 1 lock held by syzkaller433111/4462:
>>   #0: 03a06fae (>lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x1ec0
>> drivers/block/loop.c:1363
>
> Is this a new regression? Omar did just fiddle with the locking a bit,
> seems suspicious.

Looking at:
https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2
It first happened 4 hours ago and 9 times since then, so probably a
just introduced regression.


> --
> Jens Axboe
>
> --
> You received this message because you are subscribed to the Google Groups 
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller-bugs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/syzkaller-bugs/0e998b77-14f0-aee0-8d32-bc1dd96fcc4c%40kernel.dk.
> For more options, visit https://groups.google.com/d/optout.


Re: WARNING: lock held when returning to user space!

2018-04-06 Thread Jens Axboe
On 4/6/18 7:02 AM, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on upstream commit
> 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +)
> Merge tag 'armsoc-drivers' of  
> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> syzbot dashboard link:  
> https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2
> 
> So far this crash happened 9 times on upstream.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6407930337296384
> syzkaller reproducer:  
> https://syzkaller.appspot.com/x/repro.syz?id=4942413340606464
> Raw console output:  
> https://syzkaller.appspot.com/x/log.txt?id=4764483918495744
> Kernel config:  
> https://syzkaller.appspot.com/x/.config?id=-5813481738265533882
> compiler: gcc (GCC) 8.0.1 20180301 (experimental)
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for  
> details.
> If you forward the report, please keep this part and the footer.
> 
> 
> 
> WARNING: lock held when returning to user space!
> 4.16.0+ #3 Not tainted
> 
> syzkaller433111/4462 is leaving the kernel with locks still held!
> 1 lock held by syzkaller433111/4462:
>   #0: 03a06fae (>lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x1ec0  
> drivers/block/loop.c:1363

Is this a new regression? Omar did just fiddle with the locking a bit,
seems suspicious.

-- 
Jens Axboe



Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Christian Borntraeger


On 04/06/2018 03:41 PM, Ming Lei wrote:
> On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote:
>>
>>
>> On 04/06/2018 11:23 AM, Ming Lei wrote:
>>> On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote:


 On 04/06/2018 10:41 AM, Ming Lei wrote:
> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:
>>
>>
>> On 04/05/2018 06:11 PM, Ming Lei wrote:

 Could you please apply the following patch and provide the dmesg boot 
 log?
>>>
>>> And please post out the 'lscpu' log together from the test machine too.
>>
>> attached.
>>
>> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
>> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 
>> 8 Cores
>> == 16 threads.
>
> OK, thanks!
>
> The most weird thing is that hctx->next_cpu is computed as 512 since
> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
> possible CPU.
>
> Looks like it is a s390 specific issue, since I can setup one queue
> which has same mapping with yours:
>
>   - nr_cpu_id is 282
>   - CPU 0~15 is online
>   - 64 queues null_blk
>   - still run all hw queues in .complete handler
>
> But can't reproduce this issue at all.
>
> So please test the following patch, which may tell us why hctx->next_cpu
> is computed wrong:

 I see things like

 [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
 [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
 [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
 [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
 [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
 [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
 [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
 [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
 [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
 [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and

 which is exactly what happens if the find and and operation fails (returns 
 size of bitmap).
>>>
>>> Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct
>>> in your previous debug log, it means the following function returns
>>> totally wrong result on S390.
>>>
>>> cpumask_first_and(hctx->cpumask, cpu_online_mask);
>>>
>>> The debugfs log shows that each hctx->cpumask includes one online
>>> CPU(0~15).
>>
>> Really? the last log (with the latest patch applied  shows a lot of contexts
>> that do not have CPUs in 0-15:
>>
>> e.g. 
>> [4.049828] dump CPUs mapped to this hctx:
>> [4.049829] 18 
>> [4.049829] 82 
>> [4.049830] 146 
>> [4.049830] 210 
>> [4.049831] 274 
> 
> That won't be an issue, since no IO can be submitted from these offline
> CPUs, then these hctx shouldn't have been run at all.
> 
> But hctx->next_cpu can be set as 512 for these inactive hctx in
> blk_mq_map_swqueue(), then please test the attached patch, and if
> hctx->next_cpu is still set as 512, something is still wrong.


WIth this patch I no longer see the "run queue from wrong CPU x, hctx active" 
messages.
your debug code still triggers, though.

wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and

If we would remove the debug code then dmesg would be clean it seems.


> ---
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 9f8cffc8a701..638ab5c11b3c 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -14,13 +14,12 @@
>  #include "blk.h"
>  #include "blk-mq.h"
> 
> +/*
> + * Given there isn't CPU hotplug handler in blk-mq, map all CPUs to
> + * queues even it isn't present yet.
> + */
>  static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
>  {
> - /*
> -  * Non present CPU will be mapped to queue index 0.
> -  */
> - if (!cpu_present(cpu))
> - return 0;
>   return cpu % nr_queues;
>  }
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 90838e998f66..1a834d96a718 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1343,6 +1343,13 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
> *hctx)
>   hctx_unlock(hctx, srcu_idx);
>  }
> 
> +static void check_next_cpu(int next_cpu, const char *str1, const char *str2)
> +{
> + if (next_cpu > nr_cpu_ids)
> + printk_ratelimited("wrong next_cpu %d, %s, %s\n",
> + next_cpu, str1, str2);
> +}
> +
>  /*
>   * It'd be great if the workqueue API had a way to pass
>   * in a mask and had some smarts for more clever placement.
> @@ -1352,26 +1359,29 @@ static void __blk_mq_run_hw_queue(struct 
> blk_mq_hw_ctx *hctx)
>  static int blk_mq_hctx_next_cpu(struct 

Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Ming Lei
On Fri, Apr 06, 2018 at 12:19:19PM +0200, Christian Borntraeger wrote:
> 
> 
> On 04/06/2018 11:23 AM, Ming Lei wrote:
> > On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 04/06/2018 10:41 AM, Ming Lei wrote:
> >>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:
> 
> 
>  On 04/05/2018 06:11 PM, Ming Lei wrote:
> >>
> >> Could you please apply the following patch and provide the dmesg boot 
> >> log?
> >
> > And please post out the 'lscpu' log together from the test machine too.
> 
>  attached.
> 
>  As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
>  We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 
>  8 Cores
>  == 16 threads.
> >>>
> >>> OK, thanks!
> >>>
> >>> The most weird thing is that hctx->next_cpu is computed as 512 since
> >>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
> >>> possible CPU.
> >>>
> >>> Looks like it is a s390 specific issue, since I can setup one queue
> >>> which has same mapping with yours:
> >>>
> >>>   - nr_cpu_id is 282
> >>>   - CPU 0~15 is online
> >>>   - 64 queues null_blk
> >>>   - still run all hw queues in .complete handler
> >>>
> >>> But can't reproduce this issue at all.
> >>>
> >>> So please test the following patch, which may tell us why hctx->next_cpu
> >>> is computed wrong:
> >>
> >> I see things like
> >>
> >> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> >>
> >> which is exactly what happens if the find and and operation fails (returns 
> >> size of bitmap).
> > 
> > Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct
> > in your previous debug log, it means the following function returns
> > totally wrong result on S390.
> > 
> > cpumask_first_and(hctx->cpumask, cpu_online_mask);
> > 
> > The debugfs log shows that each hctx->cpumask includes one online
> > CPU(0~15).
> 
> Really? the last log (with the latest patch applied  shows a lot of contexts
> that do not have CPUs in 0-15:
> 
> e.g. 
> [4.049828] dump CPUs mapped to this hctx:
> [4.049829] 18 
> [4.049829] 82 
> [4.049830] 146 
> [4.049830] 210 
> [4.049831] 274 

That won't be an issue, since no IO can be submitted from these offline
CPUs, then these hctx shouldn't have been run at all.

But hctx->next_cpu can be set as 512 for these inactive hctx in
blk_mq_map_swqueue(), then please test the attached patch, and if
hctx->next_cpu is still set as 512, something is still wrong.

---

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9f8cffc8a701..638ab5c11b3c 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -14,13 +14,12 @@
 #include "blk.h"
 #include "blk-mq.h"
 
+/*
+ * Given there isn't CPU hotplug handler in blk-mq, map all CPUs to
+ * queues even it isn't present yet.
+ */
 static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
 {
-   /*
-* Non present CPU will be mapped to queue index 0.
-*/
-   if (!cpu_present(cpu))
-   return 0;
return cpu % nr_queues;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90838e998f66..1a834d96a718 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1343,6 +1343,13 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
hctx_unlock(hctx, srcu_idx);
 }
 
+static void check_next_cpu(int next_cpu, const char *str1, const char *str2)
+{
+   if (next_cpu > nr_cpu_ids)
+   printk_ratelimited("wrong next_cpu %d, %s, %s\n",
+   next_cpu, str1, str2);
+}
+
 /*
  * It'd be great if the workqueue API had a way to pass
  * in a mask and had some smarts for more clever placement.
@@ -1352,26 +1359,29 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
bool tried = false;
+   int next_cpu = hctx->next_cpu;
 
if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;
 
if (--hctx->next_cpu_batch <= 0) {
-   int next_cpu;
 select_cpu:
-   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
+   next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,

WARNING: lock held when returning to user space!

2018-04-06 Thread syzbot

Hello,

syzbot hit the following crash on upstream commit
38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +)
Merge tag 'armsoc-drivers' of  
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=31e8daa8b3fc129e75f2


So far this crash happened 9 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6407930337296384
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=4942413340606464
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4764483918495744
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-5813481738265533882

compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+31e8daa8b3fc129e7...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.



WARNING: lock held when returning to user space!
4.16.0+ #3 Not tainted

syzkaller433111/4462 is leaving the kernel with locks still held!
1 lock held by syzkaller433111/4462:
 #0: 03a06fae (>lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x1ec0  
drivers/block/loop.c:1363



---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged

into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.

Note: all commands must start from beginning of the line in the email body.


Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Christian Borntraeger


On 04/06/2018 11:23 AM, Ming Lei wrote:
> On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote:
>>
>>
>> On 04/06/2018 10:41 AM, Ming Lei wrote:
>>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:


 On 04/05/2018 06:11 PM, Ming Lei wrote:
>>
>> Could you please apply the following patch and provide the dmesg boot 
>> log?
>
> And please post out the 'lscpu' log together from the test machine too.

 attached.

 As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
 We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 
 Cores
 == 16 threads.
>>>
>>> OK, thanks!
>>>
>>> The most weird thing is that hctx->next_cpu is computed as 512 since
>>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
>>> possible CPU.
>>>
>>> Looks like it is a s390 specific issue, since I can setup one queue
>>> which has same mapping with yours:
>>>
>>> - nr_cpu_id is 282
>>> - CPU 0~15 is online
>>> - 64 queues null_blk
>>> - still run all hw queues in .complete handler
>>>
>>> But can't reproduce this issue at all.
>>>
>>> So please test the following patch, which may tell us why hctx->next_cpu
>>> is computed wrong:
>>
>> I see things like
>>
>> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>>
>> which is exactly what happens if the find and and operation fails (returns 
>> size of bitmap).
> 
> Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct
> in your previous debug log, it means the following function returns
> totally wrong result on S390.
> 
>   cpumask_first_and(hctx->cpumask, cpu_online_mask);
> 
> The debugfs log shows that each hctx->cpumask includes one online
> CPU(0~15).
> 
> So looks it isn't one issue in block MQ core.

So I checked further and printed the mask I think I can ignore the next_and 
cases. It is totally
valid to get 512 here (as we might start with an offset that is already the 
last cpu and we need
to wrap with first_and)).

So the first_and and the first cases are really the interesting one. And I 
think the code is perfectly
right, there is no bit after and for these cases:


[3.220021] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[3.220023] 1: 0001 0001 0001 
0001 0001   

[3.220025] 2:    
    

[3.220027] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[3.220028] 1: 0002 0002 0002 
0002 0002   

[3.220030] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[3.220032] 1: 0001 0001 0001 
0001 0001   

[3.220033] 2:    
    

[3.220035] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[3.220036] 2:    
    

[3.220037] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[3.220039] 1: 0004 0004 0004 
0004 0004   

[3.220040] 2:    
    

[3.220042] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[3.220062] 1: 0002 0002 0002 
0002 0002   

[3.220063] 2:    
    

[3.220064] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[3.220066] 1: 0008 0008 0008 
0008 0008  

Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Christian Borntraeger


On 04/06/2018 11:23 AM, Ming Lei wrote:
> On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote:
>>
>>
>> On 04/06/2018 10:41 AM, Ming Lei wrote:
>>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:


 On 04/05/2018 06:11 PM, Ming Lei wrote:
>>
>> Could you please apply the following patch and provide the dmesg boot 
>> log?
>
> And please post out the 'lscpu' log together from the test machine too.

 attached.

 As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
 We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 
 Cores
 == 16 threads.
>>>
>>> OK, thanks!
>>>
>>> The most weird thing is that hctx->next_cpu is computed as 512 since
>>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
>>> possible CPU.
>>>
>>> Looks like it is a s390 specific issue, since I can setup one queue
>>> which has same mapping with yours:
>>>
>>> - nr_cpu_id is 282
>>> - CPU 0~15 is online
>>> - 64 queues null_blk
>>> - still run all hw queues in .complete handler
>>>
>>> But can't reproduce this issue at all.
>>>
>>> So please test the following patch, which may tell us why hctx->next_cpu
>>> is computed wrong:
>>
>> I see things like
>>
>> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and
>>
>> which is exactly what happens if the find and and operation fails (returns 
>> size of bitmap).
> 
> Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct
> in your previous debug log, it means the following function returns
> totally wrong result on S390.
> 
>   cpumask_first_and(hctx->cpumask, cpu_online_mask);
> 
> The debugfs log shows that each hctx->cpumask includes one online
> CPU(0~15).

Really? the last log (with the latest patch applied  shows a lot of contexts
that do not have CPUs in 0-15:

e.g. 
[4.049828] dump CPUs mapped to this hctx:
[4.049829] 18 
[4.049829] 82 
[4.049830] 146 
[4.049830] 210 
[4.049831] 274 

> 
> So looks it isn't one issue in block MQ core.
> 
> Thanks,
> Ming
> 



Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-06 Thread Thomas Gleixner
On Fri, 6 Apr 2018, Ming Lei wrote:
> 
> I will post V4 soon by using cpu_present_mask in the 1st stage irq spread.
> And it should work fine for Kashyap's case in normal cases.

No need to resend. I've changed it already and will push it out after
lunch.

Thanks,

tglx


Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Ming Lei
On Fri, Apr 06, 2018 at 10:51:28AM +0200, Christian Borntraeger wrote:
> 
> 
> On 04/06/2018 10:41 AM, Ming Lei wrote:
> > On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 04/05/2018 06:11 PM, Ming Lei wrote:
> 
>  Could you please apply the following patch and provide the dmesg boot 
>  log?
> >>>
> >>> And please post out the 'lscpu' log together from the test machine too.
> >>
> >> attached.
> >>
> >> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
> >> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 
> >> Cores
> >> == 16 threads.
> > 
> > OK, thanks!
> > 
> > The most weird thing is that hctx->next_cpu is computed as 512 since
> > nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
> > possible CPU.
> > 
> > Looks like it is a s390 specific issue, since I can setup one queue
> > which has same mapping with yours:
> > 
> > - nr_cpu_id is 282
> > - CPU 0~15 is online
> > - 64 queues null_blk
> > - still run all hw queues in .complete handler
> > 
> > But can't reproduce this issue at all.
> > 
> > So please test the following patch, which may tell us why hctx->next_cpu
> > is computed wrong:
> 
> I see things like
> 
> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> 
> which is exactly what happens if the find and and operation fails (returns 
> size of bitmap).

Given both 'cpu_online_mask' and 'hctx->cpumask' are shown as correct
in your previous debug log, it means the following function returns
totally wrong result on S390.

cpumask_first_and(hctx->cpumask, cpu_online_mask);

The debugfs log shows that each hctx->cpumask includes one online
CPU(0~15).

So looks it isn't one issue in block MQ core.

Thanks,
Ming


Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-06 Thread Ming Lei
Hi Thomas,

On Wed, Apr 04, 2018 at 09:38:26PM +0200, Thomas Gleixner wrote:
> On Wed, 4 Apr 2018, Ming Lei wrote:
> > On Wed, Apr 04, 2018 at 10:25:16AM +0200, Thomas Gleixner wrote:
> > > In the example above:
> > > 
> > > > > > irq 39, cpu list 0,4
> > > > > > irq 40, cpu list 1,6
> > > > > > irq 41, cpu list 2,5
> > > > > > irq 42, cpu list 3,7
> > > 
> > > and assumed that at driver init time only CPU 0-3 are online then the
> > > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.
> > 
> > Indeed, and I just tested this case, and found that no interrupts are
> > delivered to CPU 4-7.
> > 
> > In theory, the affinity has been assigned to these irq vectors, and
> > programmed to interrupt controller, I understand it should work.
> > 
> > Could you explain it a bit why interrupts aren't delivered to CPU 4-7?
> 
> As I explained before:
> 
> "If the device is already in use when the offline CPUs get hot plugged, then
>  the interrupts still stay on cpu 0-3 because the effective affinity of
>  interrupts on X86 (and other architectures) is always a single CPU."
> 
> IOW. If you set the affinity mask so it contains more than one CPU then the
> kernel selects a single CPU as target. The selected CPU must be online and
> if there is more than one online CPU in the mask then the kernel picks the
> one which has the least number of interrupts targeted at it. This selected
> CPU target is programmed into the corresponding interrupt chip
> (IOAPIC/MSI/MSIX) and it stays that way until the selected target CPU
> goes offline or the affinity mask changes.
> 
> The reasons why we use single target delivery on X86 are:
> 
>1) Not all X86 systems support multi target delivery
> 
>2) If a system supports multi target delivery then the interrupt is
>   preferrably delivered to the CPU with the lowest APIC ID (which
>   usually corresponds to the lowest CPU number) due to hardware magic
>   and only a very small percentage of interrupts are delivered to the
>   other CPUs in the multi target set. So the benefit is rather dubious
>   and extensive performance testing did not show any significant
>   difference.
> 
>3) The management of multi targets on the software side is painful as
>   the same low level vector number has to be allocated on all possible
>   target CPUs. That's making a lot of things including hotplug more
>   complex for very little - if at all - benefit.
> 
> So at some point we ripped out the multi target support on X86 and moved
> everything to single target delivery mode.
> 
> Other architectures never supported multi target delivery either due to
> hardware restrictions or for similar reasons why X86 dropped it. There
> might be a few architectures which support it, but I have no overview at
> the moment.
> 
> The information is in procfs
> 
> # cat /proc/irq/9/smp_affinity_list 
> 0-3
> # cat /proc/irq/9/effective_affinity_list 
> 1
> 
> # cat /proc/irq/10/smp_affinity_list 
> 0-3
> # cat /proc/irq/10/effective_affinity_list 
> 2
> 
> smp_affinity[_list] is the affinity which is set either by the kernel or by
> writing to /proc/irq/$N/smp_affinity[_list]
> 
> effective_affinity[_list] is the affinity which is effective, i.e. the
> single target CPU to which the interrupt is affine at this point.
> 
> As you can see in the above examples the target CPU is selected from the
> given possible target set and the internal spreading of the low level x86
> vector allocation code picks a CPU which has the lowest number of
> interrupts targeted at it.
> 
> Let's assume for the example below
> 
> # cat /proc/irq/10/smp_affinity_list 
> 0-3
> # cat /proc/irq/10/effective_affinity_list 
> 2
> 
> that CPU 3 was offline when the device was initialized. So there was no way
> to select it and when CPU 3 comes online there is no reason to change the
> affinity of that interrupt, at least not from the kernel POV. Actually we
> don't even have a mechanism to do so automagically.
> 
> If I offline CPU 2 after onlining CPU 3 then the kernel has to move the
> interrupt away from CPU 2, so it selects CPU 3 as it's the one with the
> lowest number of interrupts targeted at it.
> 
> Now this is a bit different if you use affinity managed interrupts like
> NVME and other devices do.
> 
> Many of these devices create one queue per possible CPU, so the spreading
> is simple; One interrupt per possible cpu. Pretty boring.
> 
> When the device has less queues than possible CPUs, then stuff gets more
> interesting. The queues and therefore the interrupts must be targeted at
> multiple CPUs. There is some logic which spreads them over the numa nodes
> and takes siblings into account when Hyperthreading is enabled.
> 
> In both cases the managed interrupts are handled over CPU soft
> hotplug/unplug:
> 
>   1) If a CPU is soft unplugged and an interrupt is targeted at the CPU
>  then the interrupt is either moved to a still online CPU in 

Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Christian Borntraeger


On 04/06/2018 10:51 AM, Christian Borntraeger wrote:
> 
> 
> On 04/06/2018 10:41 AM, Ming Lei wrote:
>> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:
>>>
>>>
>>> On 04/05/2018 06:11 PM, Ming Lei wrote:
>
> Could you please apply the following patch and provide the dmesg boot log?

 And please post out the 'lscpu' log together from the test machine too.
>>>
>>> attached.
>>>
>>> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
>>> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 
>>> Cores
>>> == 16 threads.
>>
>> OK, thanks!
>>
>> The most weird thing is that hctx->next_cpu is computed as 512 since
>> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
>> possible CPU.
>>
>> Looks like it is a s390 specific issue, since I can setup one queue
>> which has same mapping with yours:
>>
>>  - nr_cpu_id is 282
>>  - CPU 0~15 is online
>>  - 64 queues null_blk
>>  - still run all hw queues in .complete handler
>>
>> But can't reproduce this issue at all.
>>
>> So please test the following patch, which may tell us why hctx->next_cpu
>> is computed wrong:
> 
> I see things like
> 
> [8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
> [8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and

There are more

# dmesg  | grep "wrong next"  | cut -d "]" -f 2- | uniq -c
 10  wrong next_cpu 512, blk_mq_map_swqueue, first_and
 72  wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
  7  wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and
  1  wrong next_cpu 512, blk_mq_hctx_next_cpu, first_and
 10  wrong next_cpu 512, blk_mq_hctx_next_cpu, next_and


> 
> which is exactly what happens if the find and and operation fails (returns 
> size of bitmap).
> 
> FWIW, I added a dump stack for the case when we run unbound before I tested 
> your patch:
> 
> Apr 06 10:47:41 s38lp39 kernel: CPU: 15 PID: 86 Comm: ksoftirqd/15 Not 
> tainted 4.16.0-07249-g864f9fc031e4-dirty #2
> Apr 06 10:47:41 s38lp39 kernel: Hardware name: IBM 2964 NC9 704 (LPAR)
> Apr 06 10:47:41 s38lp39 kernel: Call Trace:
> Apr 06 10:47:41 s38lp39 kernel: ([<00113946>] show_stack+0x56/0x80)
> Apr 06 10:47:41 s38lp39 kernel:  [<009d8132>] dump_stack+0x82/0xb0 
> Apr 06 10:47:41 s38lp39 kernel:  [<006a05de>] 
> blk_mq_hctx_next_cpu+0x12e/0x138 
> Apr 06 10:47:41 s38lp39 kernel:  [<006a084c>] 
> __blk_mq_delay_run_hw_queue+0x94/0xd8 
> Apr 06 10:47:41 s38lp39 kernel:  [<006a097a>] 
> blk_mq_run_hw_queue+0x82/0x180 
> Apr 06 10:47:41 s38lp39 kernel:  [<006a0ae0>] 
> blk_mq_run_hw_queues+0x68/0x88 
> Apr 06 10:47:41 s38lp39 kernel:  [<0069fc4e>] 
> __blk_mq_complete_request+0x11e/0x1d8 
> Apr 06 10:47:41 s38lp39 kernel:  [<0069fd94>] 
> blk_mq_complete_request+0x8c/0xc8 
> Apr 06 10:47:41 s38lp39 kernel:  [<00824c50>] 
> dasd_block_tasklet+0x158/0x490 
> Apr 06 10:47:41 s38lp39 kernel:  [<0014a952>] 
> tasklet_action_common.isra.5+0x7a/0x100 
> Apr 06 10:47:41 s38lp39 kernel:  [<009f8248>] __do_softirq+0x98/0x368 
> Apr 06 10:47:41 s38lp39 kernel:  [<0014a322>] run_ksoftirqd+0x4a/0x68 
> Apr 06 10:47:41 s38lp39 kernel:  [<0016dc20>] 
> smpboot_thread_fn+0x108/0x1b0 
> Apr 06 10:47:41 s38lp39 kernel:  [<00168e70>] kthread+0x148/0x160 
> Apr 06 10:47:41 s38lp39 kernel:  [<009f727a>] 
> kernel_thread_starter+0x6/0xc 
> Apr 06 10:47:41 s38lp39 kernel:  [<009f7274>] 
> kernel_thread_starter+0x0/0xc
> 



Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Christian Borntraeger


On 04/06/2018 10:41 AM, Ming Lei wrote:
> On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:
>>
>>
>> On 04/05/2018 06:11 PM, Ming Lei wrote:

 Could you please apply the following patch and provide the dmesg boot log?
>>>
>>> And please post out the 'lscpu' log together from the test machine too.
>>
>> attached.
>>
>> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
>> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 
>> Cores
>> == 16 threads.
> 
> OK, thanks!
> 
> The most weird thing is that hctx->next_cpu is computed as 512 since
> nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
> possible CPU.
> 
> Looks like it is a s390 specific issue, since I can setup one queue
> which has same mapping with yours:
> 
>   - nr_cpu_id is 282
>   - CPU 0~15 is online
>   - 64 queues null_blk
>   - still run all hw queues in .complete handler
> 
> But can't reproduce this issue at all.
> 
> So please test the following patch, which may tell us why hctx->next_cpu
> is computed wrong:

I see things like

[8.196907] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[8.196910] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[8.196912] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[8.196913] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[8.196914] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[8.196915] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[8.196916] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[8.196917] wrong next_cpu 512, blk_mq_map_swqueue, first_and
[8.196918] wrong next_cpu 512, blk_mq_map_swqueue, first_and

which is exactly what happens if the find and and operation fails (returns size 
of bitmap).

FWIW, I added a dump stack for the case when we run unbound before I tested 
your patch:

Apr 06 10:47:41 s38lp39 kernel: CPU: 15 PID: 86 Comm: ksoftirqd/15 Not tainted 
4.16.0-07249-g864f9fc031e4-dirty #2
Apr 06 10:47:41 s38lp39 kernel: Hardware name: IBM 2964 NC9 704 (LPAR)
Apr 06 10:47:41 s38lp39 kernel: Call Trace:
Apr 06 10:47:41 s38lp39 kernel: ([<00113946>] show_stack+0x56/0x80)
Apr 06 10:47:41 s38lp39 kernel:  [<009d8132>] dump_stack+0x82/0xb0 
Apr 06 10:47:41 s38lp39 kernel:  [<006a05de>] 
blk_mq_hctx_next_cpu+0x12e/0x138 
Apr 06 10:47:41 s38lp39 kernel:  [<006a084c>] 
__blk_mq_delay_run_hw_queue+0x94/0xd8 
Apr 06 10:47:41 s38lp39 kernel:  [<006a097a>] 
blk_mq_run_hw_queue+0x82/0x180 
Apr 06 10:47:41 s38lp39 kernel:  [<006a0ae0>] 
blk_mq_run_hw_queues+0x68/0x88 
Apr 06 10:47:41 s38lp39 kernel:  [<0069fc4e>] 
__blk_mq_complete_request+0x11e/0x1d8 
Apr 06 10:47:41 s38lp39 kernel:  [<0069fd94>] 
blk_mq_complete_request+0x8c/0xc8 
Apr 06 10:47:41 s38lp39 kernel:  [<00824c50>] 
dasd_block_tasklet+0x158/0x490 
Apr 06 10:47:41 s38lp39 kernel:  [<0014a952>] 
tasklet_action_common.isra.5+0x7a/0x100 
Apr 06 10:47:41 s38lp39 kernel:  [<009f8248>] __do_softirq+0x98/0x368 
Apr 06 10:47:41 s38lp39 kernel:  [<0014a322>] run_ksoftirqd+0x4a/0x68 
Apr 06 10:47:41 s38lp39 kernel:  [<0016dc20>] 
smpboot_thread_fn+0x108/0x1b0 
Apr 06 10:47:41 s38lp39 kernel:  [<00168e70>] kthread+0x148/0x160 
Apr 06 10:47:41 s38lp39 kernel:  [<009f727a>] 
kernel_thread_starter+0x6/0xc 
Apr 06 10:47:41 s38lp39 kernel:  [<009f7274>] 
kernel_thread_starter+0x0/0xc



Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Ming Lei
On Thu, Apr 05, 2018 at 07:39:56PM +0200, Christian Borntraeger wrote:
> 
> 
> On 04/05/2018 06:11 PM, Ming Lei wrote:
> >>
> >> Could you please apply the following patch and provide the dmesg boot log?
> > 
> > And please post out the 'lscpu' log together from the test machine too.
> 
> attached.
> 
> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 
> Cores
> == 16 threads.

OK, thanks!

The most weird thing is that hctx->next_cpu is computed as 512 since
nr_cpu_id is 282, and hctx->next_cpu should have pointed to one of
possible CPU.

Looks like it is a s390 specific issue, since I can setup one queue
which has same mapping with yours:

- nr_cpu_id is 282
- CPU 0~15 is online
- 64 queues null_blk
- still run all hw queues in .complete handler

But can't reproduce this issue at all.

So please test the following patch, which may tell us why hctx->next_cpu
is computed wrong:

---
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9f8cffc8a701..638ab5c11b3c 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -14,13 +14,12 @@
 #include "blk.h"
 #include "blk-mq.h"
 
+/*
+ * Given there isn't CPU hotplug handler in blk-mq, map all CPUs to
+ * queues even it isn't present yet.
+ */
 static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
 {
-   /*
-* Non present CPU will be mapped to queue index 0.
-*/
-   if (!cpu_present(cpu))
-   return 0;
return cpu % nr_queues;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90838e998f66..9b130e4b87df 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1343,6 +1343,13 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
hctx_unlock(hctx, srcu_idx);
 }
 
+static void check_next_cpu(int next_cpu, const char *str1, const char *str2)
+{
+   if (next_cpu > nr_cpu_ids)
+   printk_ratelimited("wrong next_cpu %d, %s, %s\n",
+   next_cpu, str1, str2);
+}
+
 /*
  * It'd be great if the workqueue API had a way to pass
  * in a mask and had some smarts for more clever placement.
@@ -1352,26 +1359,29 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
bool tried = false;
+   int next_cpu = hctx->next_cpu;
 
if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;
 
if (--hctx->next_cpu_batch <= 0) {
-   int next_cpu;
 select_cpu:
-   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
+   next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,
cpu_online_mask);
-   if (next_cpu >= nr_cpu_ids)
+   check_next_cpu(next_cpu, __func__, "next_and");
+   if (next_cpu >= nr_cpu_ids) {
next_cpu = 
cpumask_first_and(hctx->cpumask,cpu_online_mask);
+   check_next_cpu(next_cpu, __func__, "first_and");
+   }
 
/*
 * No online CPU is found, so have to make sure hctx->next_cpu
 * is set correctly for not breaking workqueue.
 */
-   if (next_cpu >= nr_cpu_ids)
-   hctx->next_cpu = cpumask_first(hctx->cpumask);
-   else
-   hctx->next_cpu = next_cpu;
+   if (next_cpu >= nr_cpu_ids) {
+   next_cpu = cpumask_first(hctx->cpumask);
+   check_next_cpu(next_cpu, __func__, "first");
+   }
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
 
@@ -1379,7 +1389,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
 * Do unbound schedule if we can't find a online CPU for this hctx,
 * and it should only happen in the path of handling CPU DEAD.
 */
-   if (!cpu_online(hctx->next_cpu)) {
+   if (!cpu_online(next_cpu)) {
if (!tried) {
tried = true;
goto select_cpu;
@@ -1392,7 +1402,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
hctx->next_cpu_batch = 1;
return WORK_CPU_UNBOUND;
}
-   return hctx->next_cpu;
+
+   hctx->next_cpu = next_cpu;
+   return next_cpu;
 }
 
 static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
@@ -2408,6 +2420,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
mutex_unlock(>sysfs_lock);
 
queue_for_each_hw_ctx(q, hctx, i) {
+   int next_cpu;
+
/*
 * If no software queues are mapped to this hardware queue,
 * disable it and free the request entries.
@@ -2437,8 +2451,10 @@ static void blk_mq_map_swqueue(struct request_queue 

Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-06 Thread Christian Borntraeger


On 04/05/2018 06:05 PM, Ming Lei wrote:
[...]
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 90838e998f66..996f8a963026 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1324,9 +1324,18 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
> *hctx)
>*/
>   if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
>   cpu_online(hctx->next_cpu)) {
> - printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
> - raw_smp_processor_id(),
> + int cpu;
> + printk(KERN_WARNING "run queue from wrong CPU %d/%d, hctx-%d 
> %s\n",
> + raw_smp_processor_id(), hctx->next_cpu,
> + hctx->queue_num,
>   cpumask_empty(hctx->cpumask) ? "inactive": "active");
> + printk("dump CPUs mapped to this hctx:\n");
> + for_each_cpu(cpu, hctx->cpumask)
> + printk("%d ", cpu);
> + printk("\n");
> + printk("nr_cpu_ids is %d, and dump online cpus:\n", nr_cpu_ids);
> + for_each_cpu(cpu, cpu_online_mask)
> + printk("%d ", cpu);
>   dump_stack();
>   }
> 

FWIW, with things like

[4.049828] dump CPUs mapped to this hctx:
[4.049829] 18 
[4.049829] 82 
[4.049830] 146 
[4.049830] 210 
[4.049831] 274 

[4.049832] nr_cpu_ids is 282, and dump online cpus:
[4.049833] 0 
[4.049833] 1 
[4.049834] 2 
[4.049834] 3 
[4.049835] 4 
[4.049835] 5 
[4.049836] 6 
[4.049836] 7 
[4.049837] 8 
[4.049837] 9 
[4.049838] 10 
[4.049839] 11 
[4.049839] 12 
[4.049840] 13 
[4.049840] 14 
[4.049841] 15 

So the hctx has only "possible CPUs", but all are offline.

Doesnt that always make this run unbound? See blk_mq_hctx_next_cpu  below.

/*
 * It'd be great if the workqueue API had a way to pass
 * in a mask and had some smarts for more clever placement.
 * For now we just round-robin here, switching for every
 * BLK_MQ_CPU_WORK_BATCH queued items.
 */
static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
{
bool tried = false;

if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;

if (--hctx->next_cpu_batch <= 0) {
int next_cpu;
select_cpu: 
next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
cpu_online_mask);
if (next_cpu >= nr_cpu_ids)
next_cpu = 
cpumask_first_and(hctx->cpumask,cpu_online_mask);

/*
 * No online CPU is found, so have to make sure hctx->next_cpu
 * is set correctly for not breaking workqueue.
 */
if (next_cpu >= nr_cpu_ids)
hctx->next_cpu = cpumask_first(hctx->cpumask);
else
hctx->next_cpu = next_cpu;
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}

/*
 * Do unbound schedule if we can't find a online CPU for this hctx,
 * and it should only happen in the path of handling CPU DEAD.
 */
if (!cpu_online(hctx->next_cpu)) {
if (!tried) {
tried = true;
goto select_cpu;
}

/*
 * Make sure to re-select CPU next time once after CPUs
 * in hctx->cpumask become online again.
 */
hctx->next_cpu_batch = 1;
return WORK_CPU_UNBOUND;
}
return hctx->next_cpu;
}



Re: Multi-Actuator SAS HDD First Look

2018-04-06 Thread Christoph Hellwig
On Fri, Apr 06, 2018 at 08:24:18AM +0200, Hannes Reinecke wrote:
> Ah. Far better.
> What about delegating FORMAT UNIT to the control LUN, and not
> implementing it for the individual disk LUNs?
> That would make an even stronger case for having a control LUN;
> with that there wouldn't be any problem with having to synchronize
> across LUNs etc.

It sounds to me like NVMe might be a much better model for this drive
than SCSI, btw :)


Re: Multi-Actuator SAS HDD First Look

2018-04-06 Thread Hannes Reinecke
On Thu, 5 Apr 2018 17:43:46 -0600
Tim Walker  wrote:

> On Tue, Apr 3, 2018 at 1:46 AM, Christoph Hellwig 
> wrote:
> > On Sat, Mar 31, 2018 at 01:03:46PM +0200, Hannes Reinecke wrote:  
> >> Actually I would propose to have a 'management' LUN at LUN0, who
> >> could handle all the device-wide commands (eg things like START
> >> STOP UNIT, firmware update, or even SMART commands), and ignoring
> >> them for the remaining LUNs.  
> >
> > That is in fact the only workable option at all.  Everything else
> > completly breaks the scsi architecture.  
> 
> Here's an update: Seagate will eliminate the inter-LU actions from
> FORMAT UNIT and SANITIZE. Probably SANITIZE will be per-LUN, but
> FORMAT UNIT is trickier due to internal drive architecture, and how
> FORMAT UNIT initializes on-disk metadata. Likely it will require some
> sort of synchronization across LUNs, such as the command being sent to
> both LUNs sequentially or something similar. We are also considering
> not supporting FORMAT UNIT at all - would anybody object? Any other
> suggestions?
> 

Ah. Far better.
What about delegating FORMAT UNIT to the control LUN, and not
implementing it for the individual disk LUNs?
That would make an even stronger case for having a control LUN;
with that there wouldn't be any problem with having to synchronize
across LUNs etc.

Cheers,

Hannes