date:20201022

Re: [PATCH 1/3] md: align superblock writes to physical blocks

2020-10-22 Thread Song Liu

On Thu, Oct 22, 2020 at 8:31 PM Christopher Unkel  wrote:
>
> Writes of the md superblock are aligned to the logical blocks of the
> containing device, but no attempt is made to align them to physical
> block boundaries.  This means that on a "512e" device (4k physical, 512
> logical) every superblock update hits the 512-byte emulation and the
> possible associated performance penalty.
>
> Respect the physical block alignment when possible.
>
> Signed-off-by: Christopher Unkel 
> ---
>  drivers/md/md.c | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f304ae..2b42850acfb3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1732,6 +1732,21 @@ static int super_1_load(struct md_rdev *rdev, struct 
> md_rdev *refdev, int minor_
> && rdev->new_data_offset < sb_start + (rdev->sb_size/512))
> return -EINVAL;
>
> +   /* Respect physical block size if feasible. */
> +   bmask = queue_physical_block_size(rdev->bdev->bd_disk->queue)-1;
> +   if (!((rdev->sb_start * 512) & bmask) && (rdev->sb_size & bmask)) {
> +   int candidate_size = (rdev->sb_size | bmask) + 1;
> +
> +   if (minor_version) {
> +   int sectors = candidate_size / 512;
> +
> +   if (rdev->data_offset >= sb_start + sectors
> +   && rdev->new_data_offset >= sb_start + sectors)
> +   rdev->sb_size = candidate_size;
> +   } else if (bmask <= 4095)
> +   rdev->sb_size = candidate_size;
> +   }

In super_1_load() and super_1_sync(), we have

bmask = queue_logical_block_size(rdev->bdev->bd_disk->queue)-1;

I think we should replace it with queue_physical_block_size() so the logic is
cleaner. Would this work?

Thanks,
Song

Re: [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives

2020-10-22 Thread Song Liu

On Thu, Oct 22, 2020 at 8:31 PM Christopher Unkel  wrote:
>
> Hello all,
>
> While investigating some performance issues on mdraid 10 volumes
> formed with "512e" disks (4k native/physical sector size but with 512
> byte sector emulation), I've found two cases where mdraid will
> needlessly issue writes that start on 4k byte boundary, but are are
> shorter than 4k:
>
> 1. writes of the raid superblock; and
> 2. writes of the last page of the write-intent bitmap.
>
> The following is an excerpt of a blocktrace of one of the component
> members of a mdraid 10 volume during a 4k write near the end of the
> array:
>
>   8,32  112 0.01687   711  D  WS 2064 + 8 [kworker/11:1H]
> * 8,32  115 0.001454119   711  D  WS 2056 + 1 [kworker/11:1H]
> * 8,32  118 0.002847204   711  D  WS 2080 + 7 [kworker/11:1H]
>   8,32  11   11 0.003700545  3094  D  WS 11721043920 + 8 [md127_raid1]
>   8,32  11   14 0.308785692   711  D  WS 2064 + 8 [kworker/11:1H]
> * 8,32  11   17 0.310201697   711  D  WS 2056 + 1 [kworker/11:1H]
>   8,32  11   20 5.500799245   711  D  WS 2064 + 8 [kworker/11:1H]
> * 8,32  11   2315.740923558   711  D  WS 2080 + 7 [kworker/11:1H]
>
> Note the starred transactions, which each start on a 4k boundary, but
> are less than 4k in length, and so will use the 512-byte emulation.
> Sector 2056 holds the superblock, and is written as a single 512-byte
> write.  Sector 2086 holds the bitmap bit relevant to the written
> sector.  When it is written the active bits of the last page of the
> bitmap are written, starting at sector 2080, padded out to the end of
> the 512-byte logical sector as required.  This results in a 3.5kb
> write, again using the 512-byte emulation.
>
> Note that in some arrays the last page of the bitmap may be
> sufficiently full that they are not affected by the issue with the
> bitmap write.
>
> As there can be a substantial penalty to using the 512-byte sector
> emulation (turning writes into read-modify writes if the relevant
> sector is not in the drive's cache) I believe it makes sense to pad
> these writes out to a 4k boundary.  The writes are already padded out
> for "4k native" drives, where the short access is illegal.
>
> The following patch set changes the superblock and bitmap writes to
> respect the physical block size (e.g. 4k for today's 512e drives) when
> possible.  In each case there is already logic for padding out to the
> underlying logical sector size.  I reuse or repeat the logic for
> padding out to the physical sector size, but treat the padding out as
> optional rather than mandatory.
>
> The corresponding block trace with these patches is:
>
>8,32   12 0.03410   694  D  WS 2064 + 8 [kworker/1:1H]
>8,32   15 0.001368788   694  D  WS 2056 + 8 [kworker/1:1H]
>8,32   18 0.002727981   694  D  WS 2080 + 8 [kworker/1:1H]
>8,32   1   11 0.003533831  3063  D  WS 11721043920 + 8 
> [md127_raid1]
>8,32   1   14 0.253952321   694  D  WS 2064 + 8 [kworker/1:1H]
>8,32   1   17 0.255354215   694  D  WS 2056 + 8 [kworker/1:1H]
>8,32   1   20 5.337938486   694  D  WS 2064 + 8 [kworker/1:1H]
>8,32   1   2315.577963062   694  D  WS 2080 + 8 [kworker/1:1H]
>
> I do notice that the code for bitmap writes has a more sophisticated
> and thorough check for overlap than the code for superblock writes.
> (Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From
> what I know since the various structures starts have always been 4k
> aligned anyway, it is always safe to pad the superblock write out to
> 4k (as occurs on 4k native drives) but not necessarily futher.
>
> Feedback appreciated.
>
>   --Chris

Thanks for the patches. Do you have performance numbers before/after these
changes? Some micro benchmarks results would be great motivation.

Thanks,
Song


>
>
> Christopher Unkel (3):
>   md: align superblock writes to physical blocks
>   md: factor sb write alignment check into function
>   md: pad writes to end of bitmap to physical blocks
>
>  drivers/md/md-bitmap.c | 80 +-
>  drivers/md/md.c| 15 
>  2 files changed, 63 insertions(+), 32 deletions(-)
>
> --
> 2.17.1
>

Re: [PATCH 1/4] MAINTAINERS: move Kamil Debski to credits

2020-10-22 Thread Mauro Carvalho Chehab

Em Thu, 22 Oct 2020 22:09:25 +0200
Krzysztof Kozlowski  escreveu:

> On Thu, Oct 22, 2020 at 09:13:14PM +0200, Uwe Kleine-König wrote:
> > Hello,
> > 
> > this series doesn't seem to be applied and looking at the list of people
> > this mail was sent "To:" it's not obvious who is expected to take it. I
> > assume it is not for us linux-pwm guys and will tag it as
> > "not-applicable" in our patchwork.  
> 
> Hi Uwe,
> 
> All of the patches, including the one here, touch actually multiple
> subsystems, so if this is OK with you, I could take them through
> Samsung SoC.

Acked-by: Mauro Carvalho Chehab 
> 
> Best regards,
> Krzysztof
> 



Thanks,
Mauro

[PATCH v3] i2c: designware: call i2c_dw_read_clear_intrbits_slave() once

2020-10-22 Thread Michael Wu

If some bits were cleared by i2c_dw_read_clear_intrbits_slave() in
i2c_dw_isr_slave() and not handled immediately, those cleared bits would
not be shown again by later i2c_dw_read_clear_intrbits_slave(). They
therefore were forgotten to be handled.

i2c_dw_read_clear_intrbits_slave() should be called once in an ISR and take
its returned state for all later handlings.

Signed-off-by: Michael Wu 
---

Change in v3:
 - revert deleted braces of 'else' branch in v2

Change in v2:
 - revert moving I2C_SLAVE_WRITE_REQUESTED reporting in v1

 drivers/i2c/busses/i2c-designware-slave.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/i2c/busses/i2c-designware-slave.c 
b/drivers/i2c/busses/i2c-designware-slave.c
index 44974b53a626..13de01a0f75f 100644
--- a/drivers/i2c/busses/i2c-designware-slave.c
+++ b/drivers/i2c/busses/i2c-designware-slave.c
@@ -159,7 +159,6 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev)
u32 raw_stat, stat, enabled, tmp;
u8 val = 0, slave_activity;
 
-   regmap_read(dev->map, DW_IC_INTR_STAT, );
regmap_read(dev->map, DW_IC_ENABLE, );
regmap_read(dev->map, DW_IC_RAW_INTR_STAT, _stat);
regmap_read(dev->map, DW_IC_STATUS, );
@@ -168,6 +167,7 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev)
if (!enabled || !(raw_stat & ~DW_IC_INTR_ACTIVITY) || !dev->slave)
return 0;
 
+   stat = i2c_dw_read_clear_intrbits_slave(dev);
dev_dbg(dev->dev,
"%#x STATUS SLAVE_ACTIVITY=%#x : RAW_INTR_STAT=%#x : 
INTR_STAT=%#x\n",
enabled, slave_activity, raw_stat, stat);
@@ -188,11 +188,9 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev)
 val);
}
regmap_read(dev->map, DW_IC_CLR_RD_REQ, );
-   stat = i2c_dw_read_clear_intrbits_slave(dev);
} else {
regmap_read(dev->map, DW_IC_CLR_RD_REQ, );
regmap_read(dev->map, DW_IC_CLR_RX_UNDER, );
-   stat = i2c_dw_read_clear_intrbits_slave(dev);
}
if (!i2c_slave_event(dev->slave,
 I2C_SLAVE_READ_REQUESTED,
@@ -207,7 +205,6 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev)
regmap_read(dev->map, DW_IC_CLR_RX_DONE, );
 
i2c_slave_event(dev->slave, I2C_SLAVE_STOP, );
-   stat = i2c_dw_read_clear_intrbits_slave(dev);
return 1;
}
 
@@ -219,7 +216,6 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev)
dev_vdbg(dev->dev, "Byte %X acked!", val);
} else {
i2c_slave_event(dev->slave, I2C_SLAVE_STOP, );
-   stat = i2c_dw_read_clear_intrbits_slave(dev);
}
 
return 1;
@@ -230,7 +226,6 @@ static irqreturn_t i2c_dw_isr_slave(int this_irq, void 
*dev_id)
struct dw_i2c_dev *dev = dev_id;
int ret;
 
-   i2c_dw_read_clear_intrbits_slave(dev);
ret = i2c_dw_irq_handler_slave(dev);
if (ret > 0)
complete(>cmd_complete);
-- 
2.17.1

Re: [PATCH 2/2] cpufreq: Drop restore_freq from struct cpufreq_policy

2020-10-22 Thread Viresh Kumar

On 22-10-20, 13:57, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The restore_freq field in struct cpufreq_policy is only used by
> __target_index() in one place and a local variable in that function
> may as well be used instead of it, so drop it and modify
> __target_index() accordingly.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  drivers/cpufreq/cpufreq.c |   10 +-
>  include/linux/cpufreq.h   |5 -
>  2 files changed, 5 insertions(+), 10 deletions(-)

Acked-by: Viresh Kumar 

-- 
viresh

Re: [LKP] Re: [sched] bdfcae1140: will-it-scale.per_thread_ops -37.0% regression

2020-10-22 Thread Xing Zhengjun





On 10/22/2020 9:19 PM, Mathieu Desnoyers wrote:

- On Oct 21, 2020, at 9:54 PM, Xing Zhengjun zhengjun.x...@linux.intel.com 
wrote:
[...]

In fact, 0-day just copy the will-it-scale benchmark from the GitHub, if
you think the will-it-scale benchmark has some issues, you can
contribute your idea and help to improve it, later we will update the
will-it-scale benchmark to the new version.


This is why I CC'd the maintainer of the will-it-scale github project, Anton 
Blanchard.
My main intent is to report this issue to him, but I have not heard back from 
him yet.
Is this project maintained ? Let me try to add his ozlabs.org address in CC.


For this test case, if we bind the workload to a specific CPU, then it
will hide the scheduler balance issue. In the real world, we seldom bind
the CPU...


When you say that you bind the workload to a specific CPU, is that done
outside of the will-it-scale testsuite, thus limiting the entire testsuite
to a single CPU, or you expect that internally the will-it-scale context-switch1
test gets affined to a single specific CPU/core/hardware thread through use of
hwloc ?


The later one.


Thanks,

Mathieu



--
Zhengjun Xing

Re: [PATCH V3 2/3] vhost: vdpa: report iova range

2020-10-22 Thread kernel test robot

Hi Jason,

I love your patch! Perhaps something to improve:

[auto build test WARNING on vhost/linux-next]
[also build test WARNING on linus/master v5.9 next-20201023]
[cannot apply to linux/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Jason-Wang/vDPA-API-for-reporting-IOVA-range/20201023-102708
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: m68k-randconfig-r034-20201022 (attached as .config)
compiler: m68k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/0day-ci/linux/commit/446e7b97838ebf87f1acd61580137716fdad104a
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Jason-Wang/vDPA-API-for-reporting-IOVA-range/20201023-102708
git checkout 446e7b97838ebf87f1acd61580137716fdad104a
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=m68k 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

   drivers/vhost/vdpa.c: In function 'vhost_vdpa_setup_vq_irq':
   drivers/vhost/vdpa.c:94:6: warning: variable 'ret' set but not used 
[-Wunused-but-set-variable]
  94 |  int ret, irq;
 |  ^~~
   drivers/vhost/vdpa.c: In function 'vhost_vdpa_unlocked_ioctl':
>> drivers/vhost/vdpa.c:483:5: warning: this statement may fall through 
>> [-Wimplicit-fallthrough=]
 483 |   r = copy_to_user(featurep, , sizeof(features));
 |   ~~^
   drivers/vhost/vdpa.c:484:2: note: here
 484 |  case VHOST_VDPA_GET_IOVA_RANGE:
 |  ^~~~

vim +483 drivers/vhost/vdpa.c

4c8cf31885f69e8 Tiwei Bie2020-03-26  426  
4c8cf31885f69e8 Tiwei Bie2020-03-26  427  static long 
vhost_vdpa_unlocked_ioctl(struct file *filep,
4c8cf31885f69e8 Tiwei Bie2020-03-26  428  
unsigned int cmd, unsigned long arg)
4c8cf31885f69e8 Tiwei Bie2020-03-26  429  {
4c8cf31885f69e8 Tiwei Bie2020-03-26  430struct vhost_vdpa *v = 
filep->private_data;
4c8cf31885f69e8 Tiwei Bie2020-03-26  431struct vhost_dev *d = >vdev;
4c8cf31885f69e8 Tiwei Bie2020-03-26  432void __user *argp = (void 
__user *)arg;
a127c5bbb6a8eee Jason Wang   2020-09-07  433u64 __user *featurep = argp;
a127c5bbb6a8eee Jason Wang   2020-09-07  434u64 features;
4c8cf31885f69e8 Tiwei Bie2020-03-26  435long r;
4c8cf31885f69e8 Tiwei Bie2020-03-26  436  
a127c5bbb6a8eee Jason Wang   2020-09-07  437if (cmd == 
VHOST_SET_BACKEND_FEATURES) {
a127c5bbb6a8eee Jason Wang   2020-09-07  438r = 
copy_from_user(, featurep, sizeof(features));
a127c5bbb6a8eee Jason Wang   2020-09-07  439if (r)
a127c5bbb6a8eee Jason Wang   2020-09-07  440return r;
a127c5bbb6a8eee Jason Wang   2020-09-07  441if (features & 
~VHOST_VDPA_BACKEND_FEATURES)
a127c5bbb6a8eee Jason Wang   2020-09-07  442return 
-EOPNOTSUPP;
a127c5bbb6a8eee Jason Wang   2020-09-07  443
vhost_set_backend_features(>vdev, features);
a127c5bbb6a8eee Jason Wang   2020-09-07  444return 0;
a127c5bbb6a8eee Jason Wang   2020-09-07  445}
a127c5bbb6a8eee Jason Wang   2020-09-07  446  
4c8cf31885f69e8 Tiwei Bie2020-03-26  447mutex_lock(>mutex);
4c8cf31885f69e8 Tiwei Bie2020-03-26  448  
4c8cf31885f69e8 Tiwei Bie2020-03-26  449switch (cmd) {
4c8cf31885f69e8 Tiwei Bie2020-03-26  450case VHOST_VDPA_GET_DEVICE_ID:
4c8cf31885f69e8 Tiwei Bie2020-03-26  451r = 
vhost_vdpa_get_device_id(v, argp);
4c8cf31885f69e8 Tiwei Bie2020-03-26  452break;
4c8cf31885f69e8 Tiwei Bie2020-03-26  453case VHOST_VDPA_GET_STATUS:
4c8cf31885f69e8 Tiwei Bie2020-03-26  454r = 
vhost_vdpa_get_status(v, argp);
4c8cf31885f69e8 Tiwei Bie2020-03-26  455break;
4c8cf31885f69e8 Tiwei Bie2020-03-26  456case VHOST_VDPA_SET_STATUS:
4c8cf31885f69e8 Tiwei Bie2020-03-26  457r = 
vhost_vdpa_set_status(v, argp);
4c8cf31885f69e8 Tiwei Bie2020-03-26  458break;
4c8cf31885f69e8 Tiwei Bie2020-03-26  459case VHOST_VDPA_GET_CONFIG:
4c8cf31885f69e8 Tiwei Bie2020-03-26  460r = 
vhost_vdpa_get_config(v, argp);
4c8cf31885f69e8 Tiwei Bie2020-03-26  461break;
4c8cf31885f69e8 Tiwei Bie2020-03-26  462case VHOST_VDPA_SET_CONFIG:
4c8cf31885f69e8 Tiwei Bie2020-03-26  463r = 
vhost_vdpa_se

Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()

2020-10-22 Thread Li, Aubrey

On 2020/10/22 23:25, Joel Fernandes wrote:
> On Thu, Oct 22, 2020 at 12:59 AM Li, Aubrey  wrote:
>>
>> On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
>>> From: Peter Zijlstra 
>>>
>>> Because sched_class::pick_next_task() also implies
>>> sched_class::set_next_task() (and possibly put_prev_task() and
>>> newidle_balance) it is not state invariant. This makes it unsuitable
>>> for remote task selection.
>>>
>>> Tested-by: Julien Desfossez 
>>> Signed-off-by: Peter Zijlstra (Intel) 
>>> Signed-off-by: Vineeth Remanan Pillai 
>>> Signed-off-by: Julien Desfossez 
>>> Signed-off-by: Joel Fernandes (Google) 
>>> ---
>>>  kernel/sched/deadline.c  | 16 ++--
>>>  kernel/sched/fair.c  | 32 +++-
>>>  kernel/sched/idle.c  |  8 
>>>  kernel/sched/rt.c| 14 --
>>>  kernel/sched/sched.h |  3 +++
>>>  kernel/sched/stop_task.c | 13 +++--
>>>  6 files changed, 79 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>> index 814ec49502b1..0271a7848ab3 100644
>>> --- a/kernel/sched/deadline.c
>>> +++ b/kernel/sched/deadline.c
>>> @@ -1848,7 +1848,7 @@ static struct sched_dl_entity 
>>> *pick_next_dl_entity(struct rq *rq,
>>>   return rb_entry(left, struct sched_dl_entity, rb_node);
>>>  }
>>>
>>> -static struct task_struct *pick_next_task_dl(struct rq *rq)
>>> +static struct task_struct *pick_task_dl(struct rq *rq)
>>>  {
>>>   struct sched_dl_entity *dl_se;
>>>   struct dl_rq *dl_rq = >dl;
>>> @@ -1860,7 +1860,18 @@ static struct task_struct *pick_next_task_dl(struct 
>>> rq *rq)
>>>   dl_se = pick_next_dl_entity(rq, dl_rq);
>>>   BUG_ON(!dl_se);
>>>   p = dl_task_of(dl_se);
>>> - set_next_task_dl(rq, p, true);
>>> +
>>> + return p;
>>> +}
>>> +
>>> +static struct task_struct *pick_next_task_dl(struct rq *rq)
>>> +{
>>> + struct task_struct *p;
>>> +
>>> + p = pick_task_dl(rq);
>>> + if (p)
>>> + set_next_task_dl(rq, p, true);
>>> +
>>>   return p;
>>>  }
>>>
>>> @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class
>>>
>>>  #ifdef CONFIG_SMP
>>>   .balance= balance_dl,
>>> + .pick_task  = pick_task_dl,
>>>   .select_task_rq = select_task_rq_dl,
>>>   .migrate_task_rq= migrate_task_rq_dl,
>>>   .set_cpus_allowed   = set_cpus_allowed_dl,
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index dbd9368a959d..bd6aed63f5e3 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
>>> sched_entity *curr)
>>>* Avoid running the skip buddy, if running something else can
>>>* be done without getting too unfair.
>>>*/
>>> - if (cfs_rq->skip == se) {
>>> + if (cfs_rq->skip && cfs_rq->skip == se) {
>>>   struct sched_entity *second;
>>>
>>>   if (se == curr) {
>>> @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, 
>>> struct task_struct *p, int wake_
>>>   set_last_buddy(se);
>>>  }
>>>
>>> +#ifdef CONFIG_SMP
>>> +static struct task_struct *pick_task_fair(struct rq *rq)
>>> +{
>>> + struct cfs_rq *cfs_rq = >cfs;
>>> + struct sched_entity *se;
>>> +
>>> + if (!cfs_rq->nr_running)
>>> + return NULL;
>>> +
>>> + do {
>>> + struct sched_entity *curr = cfs_rq->curr;
>>> +
>>> + se = pick_next_entity(cfs_rq, NULL);
>>> +
>>> + if (curr) {
>>> + if (se && curr->on_rq)
>>> + update_curr(cfs_rq);
>>> +
>>> + if (!se || entity_before(curr, se))
>>> + se = curr;
>>> + }
>>> +
>>> + cfs_rq = group_cfs_rq(se);
>>> + } while (cfs_rq);
>>> ++
>>> + return task_of(se);
>>> +}
>>> +#endif
>>
>> One of my machines hangs when I run uperf with only one message:
>> [  719.034962] BUG: kernel NULL pointer dereference, address: 
>> 0050
>>
>> Then I replicated the problem on my another machine(no serial console),
>> here is the stack by manual copy.
>>
>> Call Trace:
>>  pick_next_entity+0xb0/0x160
>>  pick_task_fair+0x4b/0x90
>>  __schedule+0x59b/0x12f0
>>  schedule_idle+0x1e/0x40
>>  do_idle+0x193/0x2d0
>>  cpu_startup_entry+0x19/0x20
>>  start_secondary+0x110/0x150
>>  secondary_startup_64_no_verify+0xa6/0xab
> 
> Interesting. Wondering if we screwed something up in the rebase.
> 
> Questions:
> 1. Does the issue happen if you just apply only up until this patch,
> or the entire series?

I applied the entire series and just find a related patch to report the
issue.

> 2. Do you see the issue in v7? Not much if at all has changed in this
> part of the code from v7 -> v8 but could be something in the newer
> kernel.
> 

IIRC, I can run uperf successfully on v7.
I'm on tip/master 2d3e8c9424c9

[PATCH net] net: hns3: Clear the CMDQ registers before unmapping BAR region

2020-10-22 Thread Zenghui Yu

When unbinding the hns3 driver with the HNS3 VF, I got the following
kernel panic:

[  265.709989] Unable to handle kernel paging request at virtual address 
800054627000
[  265.717928] Mem abort info:
[  265.720740]   ESR = 0x9647
[  265.723810]   EC = 0x25: DABT (current EL), IL = 32 bits
[  265.729126]   SET = 0, FnV = 0
[  265.732195]   EA = 0, S1PTW = 0
[  265.735351] Data abort info:
[  265.738227]   ISV = 0, ISS = 0x0047
[  265.742071]   CM = 0, WnR = 1
[  265.745055] swapper pgtable: 4k pages, 48-bit VAs, pgdp=09b54000
[  265.751753] [800054627000] pgd=202ff003, p4d=202ff003, 
pud=2020020eb003, pmd=0020a0dfc003, pte=
[  265.764314] Internal error: Oops: 9647 [#1] SMP
[  265.830357] CPU: 61 PID: 20319 Comm: bash Not tainted 5.9.0+ #206
[  265.836423] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 
09/18/2019
[  265.843873] pstate: 8049 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[  265.843890] pc : hclgevf_cmd_uninit+0xbc/0x300
[  265.861988] lr : hclgevf_cmd_uninit+0xb0/0x300
[  265.861992] sp : 80004c983b50
[  265.881411] pmr_save: 00e0
[  265.884453] x29: 80004c983b50 x28: 20280bbce500
[  265.889744] x27:  x26: 
[  265.895034] x25: 800011a1f000 x24: 800011a1fe90
[  265.900325] x23: 0020ce9b00d8 x22: 0020ce9b0150
[  265.905616] x21: 800010d70e90 x20: 800010d70e90
[  265.910906] x19: 0020ce9b0080 x18: 0004
[  265.916198] x17:  x16: 800011ae32e8
[  265.916201] x15: 0028 x14: 0002
[  265.916204] x13: 800011ae32e8 x12: 00012ad8
[  265.946619] x11: 80004c983b50 x10: 
[  265.951911] x9 : 8000115d0888 x8 : 
[  265.951914] x7 : 800011890b20 x6 : c0007fff
[  265.951917] x5 : 80004c983930 x4 : 0001
[  265.951919] x3 : a027eec1b000 x2 : 2b78ccbbff369100
[  265.964487] x1 :  x0 : 800054627000
[  265.964491] Call trace:
[  265.964494]  hclgevf_cmd_uninit+0xbc/0x300
[  265.964496]  hclgevf_uninit_ae_dev+0x9c/0xe8
[  265.964501]  hnae3_unregister_ae_dev+0xb0/0x130
[  265.964516]  hns3_remove+0x34/0x88 [hns3]
[  266.009683]  pci_device_remove+0x48/0xf0
[  266.009692]  device_release_driver_internal+0x114/0x1e8
[  266.030058]  device_driver_detach+0x28/0x38
[  266.034224]  unbind_store+0xd4/0x108
[  266.037784]  drv_attr_store+0x40/0x58
[  266.041435]  sysfs_kf_write+0x54/0x80
[  266.045081]  kernfs_fop_write+0x12c/0x250
[  266.049076]  vfs_write+0xc4/0x248
[  266.052378]  ksys_write+0x74/0xf8
[  266.055677]  __arm64_sys_write+0x24/0x30
[  266.059584]  el0_svc_common.constprop.3+0x84/0x270
[  266.064354]  do_el0_svc+0x34/0xa0
[  266.067658]  el0_svc+0x38/0x40
[  266.070700]  el0_sync_handler+0x8c/0xb0
[  266.074519]  el0_sync+0x140/0x180

It looks like the BAR memory region had already been unmapped before we
start clearing CMDQ registers in it, which is pretty bad and the kernel
happily kills itself because of a Current EL Data Abort (on arm64).

Moving the CMDQ uninitialization a bit early fixes the issue for me.

Signed-off-by: Zenghui Yu 
---

I have almost zero knowledge about the hns3 driver. You can regard this
as a report and make a better fix if possible.

I can't even figure out that how can we live with this issue for a long
time... It should exists since commit 34f81f049e35 ("net: hns3: clear
command queue's registers when unloading VF driver"), where we start
writing something into the unmapped area.

 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 50c84c5e65d2..c8e3fdd5999c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -3262,8 +3262,8 @@ static void hclgevf_uninit_hdev(struct hclgevf_dev *hdev)
hclgevf_uninit_msi(hdev);
}
 
-   hclgevf_pci_uninit(hdev);
hclgevf_cmd_uninit(hdev);
+   hclgevf_pci_uninit(hdev);
hclgevf_uninit_mac_list(hdev);
 }
 
-- 
2.19.1

Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Viresh Kumar

On 22-10-20, 17:55, Vincent Guittot wrote:
> On Thu, 22 Oct 2020 at 17:45, A L  wrote:
> >
> >
> >
> >  From: Peter Zijlstra  -- Sent: 2020-10-22 - 
> > 14:29 
> >
> > > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> > >> > However I do want to retire ondemand, conservative and also very much
> > >> > intel_pstate/active mode.
> > >>
> > >> I agree in general, but IMO it would not be prudent to do that without 
> > >> making
> > >> schedutil provide the same level of performance in all of the relevant 
> > >> use
> > >> cases.
> > >
> > > Agreed; I though to have understood we were there already.
> >
> > Hi,
> >
> >
> > Currently schedutil does not populate all stats like ondemand does, which 
> > can be a problem for some monitoring software.
> >
> > On my AMD 3000G CPU with kernel-5.9.1:
> >
> >
> > grep. /sys/devices/system/cpu/cpufreq/policy0/stats/*
> >
> > With ondemand:
> > time_in_state:390 145179
> > time_in_state:160 9588482
> > total_trans:177565
> > trans_table:   From  :To
> > trans_table: :   390   160
> > trans_table:  390: 0 88783
> > trans_table:  160: 88782 0
> >
> > With schedutil only two file exists:
> > reset:
> > total_trans:216609
> >
> >
> > I'd really like to have these stats populated with schedutil, if that's 
> > possible.
> 
> Your problem might have been fixed with
> commit 96f60cddf7a1 ("cpufreq: stats: Enable stats for fast-switch as well")

Thanks Vincent. Right, I have already fixed that for everyone.

-- 
viresh

Re: [LTP] mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]

2020-10-22 Thread Sean Christopherson

On Thu, Oct 22, 2020 at 08:05:05PM -0700, Linus Torvalds wrote:
> On Thu, Oct 22, 2020 at 6:36 PM Daniel Díaz  wrote:
> >
> > The kernel Naresh originally referred to is here:
> >   https://builds.tuxbuild.com/SCI7Xyjb7V2NbfQ2lbKBZw/
> 
> Thanks.
> 
> And when I started looking at it, I realized that my original idea
> ("just look for __put_user_nocheck_X calls, there aren't so many of
> those") was garbage, and that I was just being stupid.
> 
> Yes, the commit that broke was about __put_user(), but in order to not
> duplicate all the code, it re-used the regular put_user()
> infrastructure, and so all the normal put_user() calls are potential
> problem spots too if this is about the compiler interaction with KASAN
> and the asm changes.
> 
> So it's not just a couple of special cases to look at, it's all the
> normal cases too.
> 
> Ok, back to the drawing board, but I think reverting it is probably
> the right thing to do if I can't think of something smart.
> 
> That said, since you see this on x86-64, where the whole ugly trick with that
> 
>register asm("%"_ASM_AX)
> 
> is unnecessary (because the 8-byte case is still just a single
> register, no %eax:%edx games needed), it would be interesting to hear
> if the attached patch fixes it. That would confirm that the problem
> really is due to some register allocation issue interaction (or,
> alternatively, it would tell me that there's something else going on).

I haven't reproduced the crash, but I did find a smoking gun that confirms the
"register shenanigans are evil shenanigans" theory.  I ran into a similar thing
recently where a seemingly innocuous line of code after loading a value into a
register variable wreaked havoc because it clobbered the input register.

This put_user() in schedule_tail():

   if (current->set_child_tid)
   put_user(task_pid_vnr(current), current->set_child_tid);

generates the following assembly with KASAN out-of-line:

   0x810dccc9 <+73>: xor%edx,%edx
   0x810dcccb <+75>: xor%esi,%esi
   0x810dcccd <+77>: mov%rbp,%rdi
   0x810dccd0 <+80>: callq  0x810bf5e0 <__task_pid_nr_ns>
   0x810dccd5 <+85>: mov%r12,%rdi
   0x810dccd8 <+88>: callq  0x81388c60 <__asan_load8>
   0x810dccdd <+93>: mov0x590(%rbp),%rcx
   0x810dcce4 <+100>: callq  0x817708a0 <__put_user_4>
   0x810dcce9 <+105>: pop%rbx
   0x810dccea <+106>: pop%rbp
   0x810dcceb <+107>: pop%r12

__task_pid_nr_ns() returns the pid in %rax, which gets clobbered by
__asan_load8()'s check on current for the current->set_child_tid dereference.

[git pull] Input updates for v5.10-rc0

2020-10-22 Thread Dmitry Torokhov

Hi Linus,

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus

to receive updates for the input subsystem. You will get:

- a new driver for ADC driven joysticks
- a new Zintix touchscreen driver
- enhancements to Intel SoC button array driver
- support for F3A "function" in Synaptics RMI4 driver
- assorted driver fixups

Changelog:
-

Artur Rojek (2):
  dt-bindings: input: Add docs for ADC driven joystick
  Input: joystick - add ADC attached joystick driver.

Dan Carpenter (1):
  Input: imx6ul_tsc - clean up some errors in imx6ul_tsc_resume()

Dmitry Torokhov (1):
  Input: imx6ul_tsc - unify open/close and PM paths

Furquan Shaikh (1):
  Input: raydium_i2c_ts - use single i2c_transfer transaction when using 
RM_CMD_BANK_SWITCH

Hans de Goede (8):
  Input: allocate keycodes for notification-center, pickup-phone and 
hangup-phone
  Input: allocate keycode for Fn + right shift
  platform/x86: thinkpad_acpi: Add support for new hotkeys found on X1C8 / 
T14
  platform/x86: thinkpad_acpi: Map Clipping tool hotkey to 
KEY_SELECTIVE_SCREENSHOT
  Input: soc_button_array - add active_low setting to soc_button_info
  Input: soc_button_array - add support for INT33D3 tablet-mode switch 
devices
  Input: soc_button_array - work around DSDTs which modify the irqflags
  Input: synaptics - enable InterTouch for ThinkPad T14 Gen 1

Jason A. Donenfeld (2):
  Input: synaptics-rmi4 - support bootloader v8 in f34v7
  Input: synaptics - enable InterTouch for ThinkPad P1/X1E gen 2

Joe Perches (1):
  Input: MT - avoid comma separated statements

Johnny Chuang (2):
  Input: elants_i2c - report resolution of ABS_MT_TOUCH_MAJOR by FW 
information.
  Input: elants_i2c - fix typo for an attribute to show calibration count

Kenny Levinsen (1):
  Input: evdev - per-client waitgroups

Krzysztof Kozlowski (4):
  Input: ep93xx_keypad - fix handling of platform_get_irq() error
  Input: omap4-keypad - fix handling of platform_get_irq() error
  Input: twl4030_keypad - fix handling of platform_get_irq() error
  Input: sun4i-ps2 - fix handling of platform_get_irq() error

Michael Srba (2):
  dt-bindings: input/touchscreen: add bindings for zinitix
  Input: add zinitix touchscreen driver

Mika Penttilä (1):
  Input: Add MAINTAINERS entry for SiS i2c touch input driver

Vincent Huang (2):
  Input: synaptics-rmi4 - rename f30_data to gpio_data
  Input: synaptics-rmi4 - add support for F3A

YueHaibing (1):
  Input: stmfts - fix a & vs && typo

Diffstat:


 .../devicetree/bindings/input/adc-joystick.yaml| 121 +
 .../bindings/input/touchscreen/zinitix.txt |  40 ++
 .../devicetree/bindings/vendor-prefixes.yaml   |   2 +
 MAINTAINERS|   7 +
 drivers/hid/hid-rmi.c  |   2 +-
 drivers/input/evdev.c  |  19 +-
 drivers/input/input-mt.c   |  11 +-
 drivers/input/joystick/Kconfig |  10 +
 drivers/input/joystick/Makefile|   1 +
 drivers/input/joystick/adc-joystick.c  | 264 ++
 drivers/input/keyboard/ep93xx_keypad.c |   4 +-
 drivers/input/keyboard/omap4-keypad.c  |   6 +-
 drivers/input/keyboard/twl4030_keypad.c|   8 +-
 drivers/input/misc/soc_button_array.c  | 100 +++-
 drivers/input/mouse/synaptics.c|   6 +-
 drivers/input/rmi4/Kconfig |   8 +
 drivers/input/rmi4/Makefile|   1 +
 drivers/input/rmi4/rmi_bus.c   |   3 +
 drivers/input/rmi4/rmi_driver.h|   1 +
 drivers/input/rmi4/rmi_f30.c   |  14 +-
 drivers/input/rmi4/rmi_f34v7.c |   9 +-
 drivers/input/rmi4/rmi_f3a.c   | 241 +
 drivers/input/serio/sun4i-ps2.c|   9 +-
 drivers/input/touchscreen/Kconfig  |  12 +
 drivers/input/touchscreen/Makefile |   1 +
 drivers/input/touchscreen/elants_i2c.c |   8 +-
 drivers/input/touchscreen/imx6ul_tsc.c |  47 +-
 drivers/input/touchscreen/raydium_i2c_ts.c | 131 ++---
 drivers/input/touchscreen/stmfts.c |   2 +-
 drivers/input/touchscreen/zinitix.c| 581 +
 drivers/platform/x86/thinkpad_acpi.c   |  18 +-
 include/linux/rmi.h|  11 +-
 include/uapi/linux/input-event-codes.h |   4 +
 33 files changed, 1531 insertions(+), 171 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/input/adc-joystick.yaml
 create mode 100644 
Documentation/devicetree/bindings/input/touchscreen/zinitix.txt
 create mode 100644 drivers/input/joystick/adc-joystick.c
 create mode 100644

ERROR: modpost: "has_transparent_hugepage" undefined!

2020-10-22 Thread kernel test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   f9893351acaecf0a414baf9942b48d5bb5c688c6
commit: 6d82120f41561426dd67c86380d779b4599d070d device-dax: add an 'align' 
attribute
date:   9 days ago
config: mips-randconfig-m031-20201022 (attached as .config)
compiler: mips64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d82120f41561426dd67c86380d779b4599d070d
git remote add linus 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git fetch --no-tags linus master
git checkout 6d82120f41561426dd67c86380d779b4599d070d
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "has_transparent_hugepage" [drivers/dax/dax.ko] undefined!
ERROR: modpost: "spurious_interrupt" [drivers/mfd/ioc3.ko] undefined!
ERROR: modpost: "pci_find_host_bridge" [drivers/mfd/ioc3.ko] undefined!

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH] PM / s2idle: Export s2idle_set_ops

2020-10-22 Thread claude yen

On Thu, 2020-10-22 at 08:02 +0100, Sudeep Holla wrote:
> On Thu, Oct 22, 2020 at 02:17:48PM +0800, Claude Yen wrote:
> > As suspend_set_ops is exported in commit a5e4fd8783a2
> > ("PM / Suspend: Export suspend_set_ops, suspend_valid_only_mem"),
> > exporting s2idle_set_ops to make kernel module setup s2idle ops too.
> > 
> > In this way, kernel module can hook platform suspend
> > functions regardless of Suspend-to-Ram(S2R) or
> > Suspend-to-Idle(S2I)
> >
> 
> If this is for arm64 platform, then NACK. You must use PSCI and it will
> set the ops and it can't be module.
> 

PSCI uses suspend_set_ops instead. And suspend_set_ops has been
exported years ago.

Suspend-to_Idle(S2I) is another suspend method supported by linux
kernel. The corresponding s2idle_ops can be hooked by s2idle_set_ops
by underlying platforms.  For example, S2I is now introduced into
Mediatek SoC platforms. Besides, power management driver is built as
kernel module.

Mobile platforms are now call for kernel drivers to be kernel modules.
This could help drivers easier to migrate to newer linux kernel.
Ref: https://linuxplumbersconf.org/event/7/contributions/790/

Regards,
Claude

[PATCHv4 net-next] dropwatch: Support monitoring of dropped frames

2020-10-22 Thread izabela . bakollari

From: Izabela Bakollari 

Dropwatch is a utility that monitors dropped frames by having userspace
record them over the dropwatch protocol over a file. This augument
allows live monitoring of dropped frames using tools like tcpdump.

With this feature, dropwatch allows two additional commands (start and
stop interface) which allows the assignment of a net_device to the
dropwatch protocol. When assinged, dropwatch will clone dropped frames,
and receive them on the assigned interface, allowing tools like tcpdump
to monitor for them.

With this feature, create a dummy ethernet interface (ip link add dev
dummy0 type dummy), assign it to the dropwatch kernel subsystem, by using
these new commands, and then monitor dropped frames in real time by
running tcpdump -i dummy0.

Signed-off-by: Izabela Bakollari 
---
 include/uapi/linux/net_dropmon.h |   3 +
 net/core/drop_monitor.c  | 120 +++
 2 files changed, 123 insertions(+)

diff --git a/include/uapi/linux/net_dropmon.h b/include/uapi/linux/net_dropmon.h
index 67e31f329190..e8e861e03a8a 100644
--- a/include/uapi/linux/net_dropmon.h
+++ b/include/uapi/linux/net_dropmon.h
@@ -58,6 +58,8 @@ enum {
NET_DM_CMD_CONFIG_NEW,
NET_DM_CMD_STATS_GET,
NET_DM_CMD_STATS_NEW,
+   NET_DM_CMD_START_IFC,
+   NET_DM_CMD_STOP_IFC,
_NET_DM_CMD_MAX,
 };
 
@@ -93,6 +95,7 @@ enum net_dm_attr {
NET_DM_ATTR_SW_DROPS,   /* flag */
NET_DM_ATTR_HW_DROPS,   /* flag */
NET_DM_ATTR_FLOW_ACTION_COOKIE, /* binary */
+   NET_DM_ATTR_IFNAME, /* string */
 
__NET_DM_ATTR_MAX,
NET_DM_ATTR_MAX = __NET_DM_ATTR_MAX - 1
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 8e33cec9fc4e..dea85291808b 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -46,6 +47,7 @@
  */
 static int trace_state = TRACE_OFF;
 static bool monitor_hw;
+struct net_device *interface;
 
 /* net_dm_mutex
  *
@@ -54,6 +56,8 @@ static bool monitor_hw;
  */
 static DEFINE_MUTEX(net_dm_mutex);
 
+static DEFINE_SPINLOCK(interface_lock);
+
 struct net_dm_stats {
u64 dropped;
struct u64_stats_sync syncp;
@@ -217,6 +221,7 @@ static void trace_drop_common(struct sk_buff *skb, void 
*location)
struct nlattr *nla;
int i;
struct sk_buff *dskb;
+   struct sk_buff *nskb = NULL;
struct per_cpu_dm_data *data;
unsigned long flags;
 
@@ -255,6 +260,20 @@ static void trace_drop_common(struct sk_buff *skb, void 
*location)
 
 out:
spin_unlock_irqrestore(>lock, flags);
+   spin_lock_irqsave(_lock, flags);
+   if (interface && interface != skb->dev) {
+   nskb = skb_clone(skb, GFP_ATOMIC);
+   if (!nskb)
+   goto free;
+   nskb->dev = interface;
+   }
+   spin_unlock_irqrestore(_lock, flags);
+   if (nskb)
+   netif_receive_skb(nskb);
+
+free:
+   spin_unlock_irqrestore(_lock, flags);
+   return;
 }
 
 static void trace_kfree_skb_hit(void *ignore, struct sk_buff *skb, void 
*location)
@@ -1315,6 +1334,89 @@ static int net_dm_cmd_trace(struct sk_buff *skb,
return -EOPNOTSUPP;
 }
 
+static bool is_dummy_dev(struct net_device *dev)
+{
+   struct ethtool_drvinfo drvinfo;
+
+   if (dev->ethtool_ops && dev->ethtool_ops->get_drvinfo) {
+   memset(, 0, sizeof(drvinfo));
+   dev->ethtool_ops->get_drvinfo(dev, );
+
+   if (strcmp(drvinfo.driver, "dummy"))
+   return false;
+   return true;
+   }
+   return false;
+}
+
+static int net_dm_interface_start(struct net *net, const char *ifname)
+{
+   struct net_device *dev = dev_get_by_name(net, ifname);
+   unsigned long flags;
+   int rc = -EBUSY;
+
+   if (!dev)
+   return -ENODEV;
+
+   if (!is_dummy_dev(dev)) {
+   rc = -EOPNOTSUPP;
+   goto out;
+   }
+
+   spin_lock_irqsave(_lock, flags);
+   if (!interface) {
+   interface = dev;
+   rc = 0;
+   }
+   spin_unlock_irqrestore(_lock, flags);
+
+   goto out;
+
+out:
+   dev_put(dev);
+   return rc;
+}
+
+static int net_dm_interface_stop(struct net *net, const char *ifname)
+{
+   unsigned long flags;
+   int rc = -ENODEV;
+
+   spin_lock_irqsave(_lock, flags);
+   if (interface && interface->name == ifname) {
+   dev_put(interface);
+   interface = NULL;
+   rc = 0;
+   }
+   spin_unlock_irqrestore(_lock, flags);
+
+   return rc;
+}
+
+static int net_dm_cmd_ifc_trace(struct sk_buff *skb, struct genl_info *info)
+{
+   struct net *net = sock_net(skb->sk);
+   char ifname[IFNAMSIZ];
+
+   if (net_dm_is_monitoring())
+

回复: Question on io-wq

2020-10-22 Thread Zhang, Qiang




发件人: Zhang, Qiang 
发送时间: 2020年10月23日 11:55
收件人: Jens Axboe
抄送: v...@zeniv.linux.org.uk; io-ur...@vger.kernel.org; 
linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org
主题: 回复: Question on io-wq




发件人: Jens Axboe 
发送时间: 2020年10月22日 22:08
收件人: Zhang, Qiang
抄送: v...@zeniv.linux.org.uk; io-ur...@vger.kernel.org; 
linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org
主题: Re: Question on io-wq

On 10/22/20 3:02 AM, Zhang,Qiang wrote:
>
> Hi Jens Axboe
>
> There are some problem in 'io_wqe_worker' thread, when the
> 'io_wqe_worker' be create and  Setting the affinity of CPUs in NUMA
> nodes, due to CPU hotplug, When the last CPU going down, the
> 'io_wqe_worker' thread will run anywhere. when the CPU in the node goes
> online again, we should restore their cpu bindings?

>Something like the below should help in ensuring affinities are
>always correct - trigger an affinity set for an online CPU event. We
>should not need to do it for offlining. Can you test it?


>diff --git a/fs/io-wq.c b/fs/io-wq.c
>index 4012ff541b7b..3bf029d1170e 100644
>--- a/fs/io-wq.c
>+++ b/fs/io-wq.c
>@@ -19,6 +19,7 @@
 >#include 
 >#include 
 >#include 
>+#include 

 >#include "io-wq.h"
>
>@@ -123,9 +124,13 @@ struct io_wq {
 >   refcount_t refs;
  >  struct completion done;
>
>+   struct hlist_node cpuhp_node;
>+
 >   refcount_t use_refs;
 >};
>
>+static enum cpuhp_state io_wq_online;
>+
 >static bool io_worker_get(struct io_worker *worker)
 >{
   > return refcount_inc_not_zero(>ref);
>@@ -1096,6 +1101,13 @@ struct io_wq *io_wq_create(unsigned bounded, >struct 
>io_wq_data *data)
 >   return ERR_PTR(-ENOMEM);
  >  }
>
>+   ret = cpuhp_state_add_instance_nocalls(io_wq_online, >>cpuhp_node);
>+   if (ret) {
>+   kfree(wq->wqes);
>+   kfree(wq);
>+   return ERR_PTR(ret);
>+   }
>+
>wq->free_work = data->free_work;
>wq->do_work = data->do_work;
>
>@@ -1145,6 +1157,7 @@ struct io_wq *io_wq_create(unsigned bounded, >struct 
>io_wq_data *data)
 >   ret = PTR_ERR(wq->manager);
 >   complete(>done);
 >err:
>+   cpuhp_state_remove_instance_nocalls(io_wq_online, >>cpuhp_node);
  >  for_each_node(node)
 >   kfree(wq->wqes[node]);
 >   kfree(wq->wqes);
>@@ -1164,6 +1177,8 @@ static void __io_wq_destroy(struct io_wq *wq)
 >{
 >   int node;
>
>+   cpuhp_state_remove_instance_nocalls(io_wq_online, >>cpuhp_node);
>+
   > set_bit(IO_WQ_BIT_EXIT, >state);
  >  if (wq->manager)
 >   kthread_stop(wq->manager);
>@@ -1191,3 +1206,40 @@ struct task_struct *io_wq_get_task(struct io_wq >*wq)
 >{
 >  return wq->manager;
 >}
>+
>+static bool io_wq_worker_affinity(struct io_worker *worker, void *data)
>+{
>+   struct task_struct *task = worker->task;
>+   unsigned long flags;
>+
   struct rq_flags rf;
   struct rq *rq;
   rq = task_rq_lock(task, );

---   raw_spin_lock_irqsave(>pi_lock, flags);
>+   do_set_cpus_allowed(task, cpumask_of_node(worker->wqe->node));
>+   task->flags |= PF_NO_SETAFFINITY;
---  raw_spin_unlock_irqrestore(>pi_lock, flags);
   
  task_rq_unlock(rq, task, );

>+   return false;
>+}
>+
>+static int io_wq_cpu_online(unsigned int cpu, struct hlist_node *node)
>+{
>+   struct io_wq *wq = hlist_entry_safe(node, struct io_wq, cpuhp_node);
>+   int i;
>+
>+   rcu_read_lock();
>+   for_each_node(i)
>+   io_wq_for_each_worker(wq->wqes[i], io_wq_worker_affinity, 
>>NULL);
>+   rcu_read_unlock();
>+   return 0;
>+}
>+
>+static __init int io_wq_init(void)
>+{
>+   int ret;
>+
>+   ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, >"io->wq/online",
>+   io_wq_cpu_online, NULL);
>+   if (ret < 0)
>+   return ret;
>+   io_wq_online = ret;
>+   return 0;
>+}
>+subsys_initcall(io_wq_init);
>
>--
>Jens Axboe

回复: Question on io-wq

2020-10-22 Thread Zhang, Qiang




发件人: Jens Axboe 
发送时间: 2020年10月22日 22:08
收件人: Zhang, Qiang
抄送: v...@zeniv.linux.org.uk; io-ur...@vger.kernel.org; 
linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org
主题: Re: Question on io-wq

On 10/22/20 3:02 AM, Zhang,Qiang wrote:
>
> Hi Jens Axboe
>
> There are some problem in 'io_wqe_worker' thread, when the
> 'io_wqe_worker' be create and  Setting the affinity of CPUs in NUMA
> nodes, due to CPU hotplug, When the last CPU going down, the
> 'io_wqe_worker' thread will run anywhere. when the CPU in the node goes
> online again, we should restore their cpu bindings?

>Something like the below should help in ensuring affinities are
>always correct - trigger an affinity set for an online CPU event. We
>should not need to do it for offlining. Can you test it?


>diff --git a/fs/io-wq.c b/fs/io-wq.c
>index 4012ff541b7b..3bf029d1170e 100644
>--- a/fs/io-wq.c
>+++ b/fs/io-wq.c
>@@ -19,6 +19,7 @@
 >#include 
 >#include 
 >#include 
>+#include 

 >#include "io-wq.h"
>
>@@ -123,9 +124,13 @@ struct io_wq {
 >   refcount_t refs;
  >  struct completion done;
>
>+   struct hlist_node cpuhp_node;
>+
 >   refcount_t use_refs;
 >};
>
>+static enum cpuhp_state io_wq_online;
>+
 >static bool io_worker_get(struct io_worker *worker)
 >{
   > return refcount_inc_not_zero(>ref);
>@@ -1096,6 +1101,13 @@ struct io_wq *io_wq_create(unsigned bounded, >struct 
>io_wq_data *data)
 >   return ERR_PTR(-ENOMEM);
  >  }
>
>+   ret = cpuhp_state_add_instance_nocalls(io_wq_online, >>cpuhp_node);
>+   if (ret) {
>+   kfree(wq->wqes);
>+   kfree(wq);
>+   return ERR_PTR(ret);
>+   }
>+
>wq->free_work = data->free_work;
>wq->do_work = data->do_work;
>
>@@ -1145,6 +1157,7 @@ struct io_wq *io_wq_create(unsigned bounded, >struct 
>io_wq_data *data)
 >   ret = PTR_ERR(wq->manager);
 >   complete(>done);
 >err:
>+   cpuhp_state_remove_instance_nocalls(io_wq_online, >>cpuhp_node);
  >  for_each_node(node)
 >   kfree(wq->wqes[node]);
 >   kfree(wq->wqes);
>@@ -1164,6 +1177,8 @@ static void __io_wq_destroy(struct io_wq *wq)
 >{
 >   int node;
>
>+   cpuhp_state_remove_instance_nocalls(io_wq_online, >>cpuhp_node);
>+
   > set_bit(IO_WQ_BIT_EXIT, >state);
  >  if (wq->manager)
 >   kthread_stop(wq->manager);
>@@ -1191,3 +1206,40 @@ struct task_struct *io_wq_get_task(struct io_wq >*wq)
 >{
 >  return wq->manager;
 >}
>+
>+static bool io_wq_worker_affinity(struct io_worker *worker, void *data)
>+{
>+   struct task_struct *task = worker->task;
>+   unsigned long flags;
>+
   struct rq_flags rf;


>+   raw_spin_lock_irqsave(>pi_lock, flags);
>+   do_set_cpus_allowed(task, cpumask_of_node(worker->wqe->node));
>+   task->flags |= PF_NO_SETAFFINITY;
>+   raw_spin_unlock_irqrestore(>pi_lock, flags);


>+   return false;
>+}
>+
>+static int io_wq_cpu_online(unsigned int cpu, struct hlist_node *node)
>+{
>+   struct io_wq *wq = hlist_entry_safe(node, struct io_wq, cpuhp_node);
>+   int i;
>+
>+   rcu_read_lock();
>+   for_each_node(i)
>+   io_wq_for_each_worker(wq->wqes[i], io_wq_worker_affinity, 
>>NULL);
>+   rcu_read_unlock();
>+   return 0;
>+}
>+
>+static __init int io_wq_init(void)
>+{
>+   int ret;
>+
>+   ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, >"io->wq/online",
>+   io_wq_cpu_online, NULL);
>+   if (ret < 0)
>+   return ret;
>+   io_wq_online = ret;
>+   return 0;
>+}
>+subsys_initcall(io_wq_init);
>
>--
>Jens Axboe

Re: [PATCH v7 1/4] powerpc: Refactor kexec functions to move arch independent code to kernel

2020-10-22 Thread Thiago Jung Bauermann



Hello Lakshmi,

Lakshmi Ramasubramanian  writes:

> On 10/20/20 8:17 PM, Mimi Zohar wrote:
>> On Tue, 2020-10-20 at 19:25 -0700, Lakshmi Ramasubramanian wrote:
>>> On 10/20/20 1:00 PM, Mimi Zohar wrote:
 Hi Lakshmi,

 On Wed, 2020-09-30 at 13:59 -0700, Lakshmi Ramasubramanian wrote:
> The functions remove_ima_buffer() and delete_fdt_mem_rsv() that handle
> carrying forward the IMA measurement logs on kexec for powerpc do not
> have architecture specific code, but they are currently defined for
> powerpc only.
>
> remove_ima_buffer() and delete_fdt_mem_rsv() are used to remove
> the IMA log entry from the device tree and free the memory reserved
> for the log. These functions need to be defined even if the current
> kernel does not support carrying forward IMA log across kexec since
> the previous kernel could have supported that and therefore the current
> kernel needs to free the allocation.
>
> Rename remove_ima_buffer() to remove_ima_kexec_buffer().
> Define remove_ima_kexec_buffer() and delete_fdt_mem_rsv() in kernel.
> A later patch in this series will use these functions to free
> the allocation, if any, made by the previous kernel for ARM64.
>
> Define FDT_PROP_IMA_KEXEC_BUFFER for the chosen node, namely
> "linux,ima-kexec-buffer", that is added to the DTB to hold
> the address and the size of the memory reserved to carry
> the IMA measurement log.

> Co-developed-by: Prakhar Srivastava 
> Signed-off-by: Prakhar Srivastava 
> Signed-off-by: Lakshmi Ramasubramanian 
> Reported-by: kernel test robot  error: implicit 
> declaration of function 'delete_fdt_mem_rsv' 
> [-Werror,-Wimplicit-function-declaration]

 Much better!  This version limits unnecessarily changing the existing
 code to adding a couple of debugging statements, but that looks to be
 about it.
>>> Yes Mimi - that's correct.
>>>

 Based on Chester Lin's "ima_arch" support for arm64 discussion, the IMA 
 generic
 EFI support will be defined in ima/ima-efi.c.  Similarly, I think it would 
 make sense to put the generic device tree support in ima/ima_kexec_fdt.c 
 or ima/ima_fdt.c, as opposed to kernel/.  (Refer to my comments on 2/4 
 about the new file named ima_kexec_fdt.c.)
>>>
>>> The functions remove_ima_kexec_buffer() and delete_fdt_mem_rsv(), which
>>> are defined in kernel/ima_kexec.c and kernel/kexec_file_fdt.c
>>> respectively, are needed even when CONFIG_IMA is not defined. These
>>> functions need to be called by the current kernel to free the ima kexec
>>> buffer resources allocated by the previous kernel. This is the reason,
>>> these functions are defined under "kernel" instead of
>>> "security/integrity/ima".
>>>
>>> If there is a better location to move the above C files, please let me
>>> know. I'll move them.
>> Freeing the previous kernel measurement list is currently called from
>> ima_load_kexec_buffer(), only after the measurement list has been
>> restored.  The only other time the memory is freed is when the
>> allocated memory size isn't sufficient to hold the measurement list,
>> which could happen if there is a delay between loading and executing
>> the kexec.
>> 
>
> There are two "free" operations we need to perform with respect to ima buffer 
> on
> kexec:
>
> 1, The ima_free_kexec_buffer() called from ima_load_kexec_buffer() - the one 
> you
> have stated above.
>
> Here we remove the "ima buffer" node from the "OF" tree and free the memory
> pages that were allocated for the measurement list.
>
> This one is already present in ima and there's no change in that in my 
> patches.
>
> 2, The other one is remove_ima_kexec_buffer() called from setup_ima_buffer()
> defined in "arch/powerpc/kexec/ima.c"
>
>  This function removes the "ima buffer" node from the "FDT" and also frees the
> physical memory reserved for the "ima measurement list" by the previous 
> kernel.
>
>  This "free" operation needs to be performed even if the current kernel does 
> not
> support IMA kexec since the previous kernel could have passed the IMA
> measurement list (in FDT and reserved physical memory).
>
> For this reason, remove_ima_kexec_buffer() cannot be defined in "ima" but some
> other place which will be built even if ima is not enabled. I chose to define
> this function in "kernel" since that is guaranteed to be always built.
>
> thanks,
>  -lakshmi

That is true. I believe a more fitting place for these functions is
drivers/of/fdt.c rather than these new files in kernel/. Both CONFIG_PPC
and CONFIG_ARM64 select CONFIG_OF and CONFIG_OF_FLATTREE (indirectly,
via CONFIG_OF_EARLY_FLATTREE) so they will both build that file.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH v2] mm,thp,shmem: limit shmem THP alloc gfp_mask

2020-10-22 Thread Rik van Riel

On Thu, 2020-10-22 at 19:54 -0700, Hugh Dickins wrote:
> On Thu, 22 Oct 2020, Rik van Riel wrote:
> 
> > The allocation flags of anonymous transparent huge pages can be
> controlled
> > through the files in /sys/kernel/mm/transparent_hugepage/defrag,
> which can
> > help the system from getting bogged down in the page reclaim and
> compaction
> > code when many THPs are getting allocated simultaneously.
> > 
> > However, the gfp_mask for shmem THP allocations were not limited by
> those
> > configuration settings, and some workloads ended up with all CPUs
> stuck
> > on the LRU lock in the page reclaim code, trying to allocate dozens
> of
> > THPs simultaneously.
> > 
> > This patch applies the same configurated limitation of THPs to
> shmem
> > hugepage allocations, to prevent that from happening.
> > 
> > This way a THP defrag setting of "never" or "defer+madvise" will
> result
> > in quick allocation failures without direct reclaim when no 2MB
> free
> > pages are available.
> > 
> > Signed-off-by: Rik van Riel 
> 
> NAK in its present untested form: see below.

Oops. That issue is easy to fix, but indeed lets figure
out what the desired behavior is.

> I'm open to change here, particularly to Yu Xu's point (in other
> mail)
> about direct reclaim - we avoid that here in Google too: though it's
> not so much to avoid the direct reclaim, as to avoid the latencies of
> direct compaction, which __GFP_DIRECT_RECLAIM allows as a side-
> effect.
> 
> > @@ -1887,7 +1888,8 @@ static int shmem_getpage_gfp(struct inode
> *inode, pgoff_t index,
> >   }
> >  
> >  alloc_huge:
> > - page = shmem_alloc_and_acct_page(gfp, inode, index, true);
> > + huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> 
> Still looks nice: but what about the crash when vma is NULL?

That's a one line fix, but I suppose we should get the
discussion on what the code behavior should be out of
the way first :)

> Michal is right to remember pushback before, because tmpfs is a
> filesystem, and "huge=" is a mount option: in using a huge=always
> filesystem, the user has already declared a preference for huge
> pages.
> Whereas the original anon THP had to deduce that preference from sys
> tunables and vma madvice.

...

> But it's likely that they have accumulated some defrag wisdom, which
> tmpfs can take on board - but please accept that in using a huge
> mount,
> the preference for huge has already been expressed, so I don't expect
> anon THP alloc_hugepage_direct_gfpmask() choices will map one to one.

In my mind, the huge= mount options for tmpfs corresponded
to the "enabled" anon THP options, denoting a desired end
state, not necessarily how much we will stall allocations
to get there immediately.

The underlying allocation behavior has been changed repeatedly,
with changes to the direct reclaim code and the compaction
deferral code.

The shmem THP gfp_mask never tried really hard anyway,
with __GFP_NORETRY being the default, which matches what
is used for non-VM_HUGEPAGE anon VMAs.

Likewise, the direct reclaim done from the opportunistic
THP allocations done by the shmem code limited itself to
reclaiming 32 4kB pages per THP allocation.

In other words, mounting
with huge=always has never behaved
the same as the more aggressive allocations done for
MADV_HUGEPAGE VMAs.

This patch would leave shmem THP allocations for non-MADV_HUGEPAGE
mapped files opportunistic like today, and make shmem THP
allocations for files mapped with MADV_HUGEPAGE more aggressive
than today.

However, I would like to know what people think the shmem
huge= mount options should do, and how things should behave
when memory gets low, before pushing in a patch just because
it makes the system run smoother "without changing current
behavior too much".

What do people want tmpfs THP allocations to do?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

[PATCH] net: ucc_geth: Drop extraneous parentheses in comparison

2020-10-22 Thread Michael Ellerman

Clang warns about the extra parentheses in this comparison:

  drivers/net/ethernet/freescale/ucc_geth.c:1361:28:
  warning: equality comparison with extraneous parentheses
if ((ugeth->phy_interface == PHY_INTERFACE_MODE_SGMII))
 ~^~~

It seems clear the intent here is to do a comparison not an
assignment, so drop the extra parentheses to avoid any confusion.

Signed-off-by: Michael Ellerman 
---
 drivers/net/ethernet/freescale/ucc_geth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
b/drivers/net/ethernet/freescale/ucc_geth.c
index db791f60b884..d8ad478a0a13 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -1358,7 +1358,7 @@ static int adjust_enet_interface(struct ucc_geth_private 
*ugeth)
(ugeth->phy_interface == PHY_INTERFACE_MODE_RTBI)) {
upsmr |= UCC_GETH_UPSMR_TBIM;
}
-   if ((ugeth->phy_interface == PHY_INTERFACE_MODE_SGMII))
+   if (ugeth->phy_interface == PHY_INTERFACE_MODE_SGMII)
upsmr |= UCC_GETH_UPSMR_SGMM;
 
out_be32(_regs->upsmr, upsmr);
-- 
2.25.1

[PATCH 3/3] md: pad writes to end of bitmap to physical blocks

2020-10-22 Thread Christopher Unkel

Writes of the last page of the bitmap are padded out to the next logical
block boundary.  However, they are not padded out to the next physical
block boundary, so the writes may be less than a physical block.  On a
"512e" disk (logical block 512 bytes, physical block 4k) and if the last
page of the bitmap is less than 3584 bytes, this means that writes of
the last bitmap page hit the 512-byte emulation.

Respect the physical block boundary as long as the resulting write
doesn't run into other data, and is no longer than a page.  (If the
physical block size is larger than a page no bitmap write will respect
the physical block boundaries.)

Signed-off-by: Christopher Unkel 
---
 drivers/md/md-bitmap.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 600b89d5a3ad..21af5f94d495 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -264,10 +264,18 @@ static int write_sb_page(struct bitmap *bitmap, struct 
page *page, int wait)
 
if (page->index == store->file_pages-1) {
int last_page_size = store->bytes & (PAGE_SIZE-1);
+   int pb_aligned_size;
if (last_page_size == 0)
last_page_size = PAGE_SIZE;
size = roundup(last_page_size,
   bdev_logical_block_size(bdev));
+   pb_aligned_size = roundup(last_page_size,
+ 
bdev_physical_block_size(bdev));
+   if (pb_aligned_size > size
+   && pb_aligned_size <= PAGE_SIZE
+   && sb_write_alignment_ok(mddev, rdev, page, offset,
+pb_aligned_size))
+   size = pb_aligned_size;
}
/* Just make sure we aren't corrupting data or
 * metadata
-- 
2.17.1

[PATCH 2/3] md: factor sb write alignment check into function

2020-10-22 Thread Christopher Unkel

Refactor in preparation for a second use of the logic.

Signed-off-by: Christopher Unkel 
---
 drivers/md/md-bitmap.c | 72 +++---
 1 file changed, 40 insertions(+), 32 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 200c5d0f08bf..600b89d5a3ad 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -209,6 +209,44 @@ static struct md_rdev *next_active_rdev(struct md_rdev 
*rdev, struct mddev *mdde
return NULL;
 }
 
+static int sb_write_alignment_ok(struct mddev *mddev, struct md_rdev *rdev,
+struct page *page, int offset, int size)
+{
+   if (mddev->external) {
+   /* Bitmap could be anywhere. */
+   if (rdev->sb_start + offset + (page->index
+  * (PAGE_SIZE/512))
+   > rdev->data_offset
+   &&
+   rdev->sb_start + offset
+   < (rdev->data_offset + mddev->dev_sectors
++ (PAGE_SIZE/512)))
+   return 0;
+   } else if (offset < 0) {
+   /* DATA  BITMAP METADATA  */
+   if (offset
+   + (long)(page->index * (PAGE_SIZE/512))
+   + size/512 > 0)
+   /* bitmap runs in to metadata */
+   return 0;
+   if (rdev->data_offset + mddev->dev_sectors
+   > rdev->sb_start + offset)
+   /* data runs in to bitmap */
+   return 0;
+   } else if (rdev->sb_start < rdev->data_offset) {
+   /* METADATA BITMAP DATA */
+   if (rdev->sb_start
+   + offset
+   + page->index*(PAGE_SIZE/512) + size/512
+   > rdev->data_offset)
+   /* bitmap runs in to data */
+   return 0;
+   } else {
+   /* DATA METADATA BITMAP - no problems */
+   }
+   return 1;
+}
+
 static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 {
struct md_rdev *rdev;
@@ -234,38 +272,8 @@ static int write_sb_page(struct bitmap *bitmap, struct 
page *page, int wait)
/* Just make sure we aren't corrupting data or
 * metadata
 */
-   if (mddev->external) {
-   /* Bitmap could be anywhere. */
-   if (rdev->sb_start + offset + (page->index
-  * (PAGE_SIZE/512))
-   > rdev->data_offset
-   &&
-   rdev->sb_start + offset
-   < (rdev->data_offset + mddev->dev_sectors
-+ (PAGE_SIZE/512)))
-   goto bad_alignment;
-   } else if (offset < 0) {
-   /* DATA  BITMAP METADATA  */
-   if (offset
-   + (long)(page->index * (PAGE_SIZE/512))
-   + size/512 > 0)
-   /* bitmap runs in to metadata */
-   goto bad_alignment;
-   if (rdev->data_offset + mddev->dev_sectors
-   > rdev->sb_start + offset)
-   /* data runs in to bitmap */
-   goto bad_alignment;
-   } else if (rdev->sb_start < rdev->data_offset) {
-   /* METADATA BITMAP DATA */
-   if (rdev->sb_start
-   + offset
-   + page->index*(PAGE_SIZE/512) + size/512
-   > rdev->data_offset)
-   /* bitmap runs in to data */
-   goto bad_alignment;
-   } else {
-   /* DATA METADATA BITMAP - no problems */
-   }
+   if (!sb_write_alignment_ok(mddev, rdev, page, offset, size))
+   goto bad_alignment;
md_super_write(mddev, rdev,
   rdev->sb_start + offset
   + page->index * (PAGE_SIZE/512),
-- 
2.17.1

[PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives

2020-10-22 Thread Christopher Unkel

Hello all,

While investigating some performance issues on mdraid 10 volumes
formed with "512e" disks (4k native/physical sector size but with 512
byte sector emulation), I've found two cases where mdraid will
needlessly issue writes that start on 4k byte boundary, but are are
shorter than 4k:

1. writes of the raid superblock; and
2. writes of the last page of the write-intent bitmap.

The following is an excerpt of a blocktrace of one of the component
members of a mdraid 10 volume during a 4k write near the end of the
array:

  8,32  112 0.01687   711  D  WS 2064 + 8 [kworker/11:1H]
* 8,32  115 0.001454119   711  D  WS 2056 + 1 [kworker/11:1H]
* 8,32  118 0.002847204   711  D  WS 2080 + 7 [kworker/11:1H]
  8,32  11   11 0.003700545  3094  D  WS 11721043920 + 8 [md127_raid1]
  8,32  11   14 0.308785692   711  D  WS 2064 + 8 [kworker/11:1H]
* 8,32  11   17 0.310201697   711  D  WS 2056 + 1 [kworker/11:1H]
  8,32  11   20 5.500799245   711  D  WS 2064 + 8 [kworker/11:1H]
* 8,32  11   2315.740923558   711  D  WS 2080 + 7 [kworker/11:1H]

Note the starred transactions, which each start on a 4k boundary, but
are less than 4k in length, and so will use the 512-byte emulation.
Sector 2056 holds the superblock, and is written as a single 512-byte
write.  Sector 2086 holds the bitmap bit relevant to the written
sector.  When it is written the active bits of the last page of the
bitmap are written, starting at sector 2080, padded out to the end of
the 512-byte logical sector as required.  This results in a 3.5kb
write, again using the 512-byte emulation.

Note that in some arrays the last page of the bitmap may be
sufficiently full that they are not affected by the issue with the
bitmap write.

As there can be a substantial penalty to using the 512-byte sector
emulation (turning writes into read-modify writes if the relevant
sector is not in the drive's cache) I believe it makes sense to pad
these writes out to a 4k boundary.  The writes are already padded out
for "4k native" drives, where the short access is illegal.

The following patch set changes the superblock and bitmap writes to
respect the physical block size (e.g. 4k for today's 512e drives) when
possible.  In each case there is already logic for padding out to the
underlying logical sector size.  I reuse or repeat the logic for
padding out to the physical sector size, but treat the padding out as
optional rather than mandatory.

The corresponding block trace with these patches is:

   8,32   12 0.03410   694  D  WS 2064 + 8 [kworker/1:1H]
   8,32   15 0.001368788   694  D  WS 2056 + 8 [kworker/1:1H]
   8,32   18 0.002727981   694  D  WS 2080 + 8 [kworker/1:1H]
   8,32   1   11 0.003533831  3063  D  WS 11721043920 + 8 [md127_raid1]
   8,32   1   14 0.253952321   694  D  WS 2064 + 8 [kworker/1:1H]
   8,32   1   17 0.255354215   694  D  WS 2056 + 8 [kworker/1:1H]
   8,32   1   20 5.337938486   694  D  WS 2064 + 8 [kworker/1:1H]
   8,32   1   2315.577963062   694  D  WS 2080 + 8 [kworker/1:1H]

I do notice that the code for bitmap writes has a more sophisticated
and thorough check for overlap than the code for superblock writes.
(Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From
what I know since the various structures starts have always been 4k
aligned anyway, it is always safe to pad the superblock write out to
4k (as occurs on 4k native drives) but not necessarily futher.

Feedback appreciated.

  --Chris


Christopher Unkel (3):
  md: align superblock writes to physical blocks
  md: factor sb write alignment check into function
  md: pad writes to end of bitmap to physical blocks

 drivers/md/md-bitmap.c | 80 +-
 drivers/md/md.c| 15 
 2 files changed, 63 insertions(+), 32 deletions(-)

-- 
2.17.1

[PATCH 1/3] md: align superblock writes to physical blocks

2020-10-22 Thread Christopher Unkel

Writes of the md superblock are aligned to the logical blocks of the
containing device, but no attempt is made to align them to physical
block boundaries.  This means that on a "512e" device (4k physical, 512
logical) every superblock update hits the 512-byte emulation and the
possible associated performance penalty.

Respect the physical block alignment when possible.

Signed-off-by: Christopher Unkel 
---
 drivers/md/md.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..2b42850acfb3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1732,6 +1732,21 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
&& rdev->new_data_offset < sb_start + (rdev->sb_size/512))
return -EINVAL;
 
+   /* Respect physical block size if feasible. */
+   bmask = queue_physical_block_size(rdev->bdev->bd_disk->queue)-1;
+   if (!((rdev->sb_start * 512) & bmask) && (rdev->sb_size & bmask)) {
+   int candidate_size = (rdev->sb_size | bmask) + 1;
+
+   if (minor_version) {
+   int sectors = candidate_size / 512;
+
+   if (rdev->data_offset >= sb_start + sectors
+   && rdev->new_data_offset >= sb_start + sectors)
+   rdev->sb_size = candidate_size;
+   } else if (bmask <= 4095)
+   rdev->sb_size = candidate_size;
+   }
+
if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
rdev->desc_nr = -1;
else
-- 
2.17.1

[PATCH 3/3] net: better handling for network busy poll

2020-10-22 Thread Josh Don

Add the new functions prepare_to_busy_poll() and friends to
napi_busy_loop(). The busy polling cpu will be considered an idle
target during wake up balancing.

Suggested-by: Xi Wang 
Signed-off-by: Josh Don 
Signed-off-by: Xi Wang 
---
 net/core/dev.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 266073e300b5..4fb4ae4b27fc 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6476,7 +6476,7 @@ void napi_busy_loop(unsigned int napi_id,
if (!napi)
goto out;
 
-   preempt_disable();
+   prepare_to_busy_poll(); /* disables preemption */
for (;;) {
int work = 0;
 
@@ -6509,10 +6509,10 @@ void napi_busy_loop(unsigned int napi_id,
if (!loop_end || loop_end(loop_end_arg, start_time))
break;
 
-   if (unlikely(need_resched())) {
+   if (unlikely(!continue_busy_poll())) {
if (napi_poll)
busy_poll_stop(napi, have_poll_lock);
-   preempt_enable();
+   end_busy_poll(true);
rcu_read_unlock();
cond_resched();
if (loop_end(loop_end_arg, start_time))
@@ -6523,7 +6523,7 @@ void napi_busy_loop(unsigned int napi_id,
}
if (napi_poll)
busy_poll_stop(napi, have_poll_lock);
-   preempt_enable();
+   end_busy_poll(true);
 out:
rcu_read_unlock();
 }
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH 2/3] kvm: better handling for kvm halt polling

2020-10-22 Thread Josh Don

Add the new functions prepare_to_busy_poll() and friends to
kvm_vcpu_block. The busy polling cpu will be considered an
idle target during wake up balancing.

cpu_relax is also added to the polling loop to improve the performance
of other hw threads sharing the busy polling core.

Suggested-by: Xi Wang 
Signed-off-by: Josh Don 
Signed-off-by: Xi Wang 
---
 virt/kvm/kvm_main.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf88233b819a..8f818f0fc979 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2772,7 +2772,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
 
++vcpu->stat.halt_attempted_poll;
+   prepare_to_busy_poll(); /* also disables preemption */
do {
+   cpu_relax();
/*
 * This sets KVM_REQ_UNHALT if an interrupt
 * arrives.
@@ -2781,10 +2783,12 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
++vcpu->stat.halt_successful_poll;
if (!vcpu_valid_wakeup(vcpu))
++vcpu->stat.halt_poll_invalid;
+   end_busy_poll(false);
goto out;
}
poll_end = cur = ktime_get();
-   } while (single_task_running() && ktime_before(cur, stop));
+   } while (continue_busy_poll() && ktime_before(cur, stop));
+   end_busy_poll(false);
}
 
prepare_to_rcuwait(>wait);
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH 1/3] sched: better handling for busy polling loops

2020-10-22 Thread Josh Don

Busy polling loops in the kernel such as network socket poll and kvm
halt polling have performance problems related to process scheduler load
accounting.

Both of the busy polling examples are opportunistic - they relinquish
the cpu if another thread is ready to run. This design, however, doesn't
extend to multiprocessor load balancing very well. The scheduler still
sees the busy polling cpu as 100% busy and will be less likely to put
another thread on that cpu. In other words, if all cores are 100%
utilized and some of them are running real workloads and some others are
running busy polling loops, newly woken up threads will not prefer the
busy polling cpus. System wide throughput and latency may suffer.

This change allows the scheduler to detect busy polling cpus in order to
allow them to be more frequently considered for wake up balancing.

This change also disables preemption for the duration of the busy
polling loop. This is important, as it ensures that if a polling thread
decides to end its poll to relinquish cpu to another thread, the polling
thread will actually exit the busy loop and potentially block. When it
later becomes runnable, it will have the opportunity to find an idle cpu
via wakeup cpu selection.

Suggested-by: Xi Wang 
Signed-off-by: Josh Don 
Signed-off-by: Xi Wang 
---
 include/linux/sched.h |  5 +++
 kernel/sched/core.c   | 94 +++
 kernel/sched/fair.c   | 25 
 kernel/sched/sched.h  |  2 +
 4 files changed, 119 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index afe01e232935..80ef477e5a87 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1651,6 +1651,7 @@ extern int can_nice(const struct task_struct *p, const 
int nice);
 extern int task_curr(const struct task_struct *p);
 extern int idle_cpu(int cpu);
 extern int available_idle_cpu(int cpu);
+extern int polling_cpu(int cpu);
 extern int sched_setscheduler(struct task_struct *, int, const struct 
sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int, const struct 
sched_param *);
 extern void sched_set_fifo(struct task_struct *p);
@@ -2048,4 +2049,8 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+extern void prepare_to_busy_poll(void);
+extern int continue_busy_poll(void);
+extern void end_busy_poll(bool allow_resched);
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d95dc3f4644..2783191d0bd4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5107,6 +5107,24 @@ int available_idle_cpu(int cpu)
return 1;
 }
 
+/**
+ * polling_cpu - is a given CPU currently running a thread in a busy polling
+ * loop that could be preempted if a new thread were to be scheduled?
+ * @cpu: the CPU in question.
+ *
+ * Return: 1 if the CPU is currently polling. 0 otherwise.
+ */
+int polling_cpu(int cpu)
+{
+#ifdef CONFIG_SMP
+   struct rq *rq = cpu_rq(cpu);
+
+   return unlikely(rq->busy_polling);
+#else
+   return 0;
+#endif
+}
+
 /**
  * idle_task - return the idle task for a given CPU.
  * @cpu: the processor in question.
@@ -7191,6 +7209,7 @@ void __init sched_init(void)
 
rq_csd_init(rq, >nohz_csd, nohz_csd_func);
 #endif
+   rq->busy_polling = 0;
 #endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
@@ -7417,6 +7436,81 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
 
 #endif
 
+/*
+ * Calling this function before entering a preemptible busy polling loop will
+ * help the scheduler make better load balancing decisions. Wake up balance
+ * will treat the polling cpu as idle.
+ *
+ * Preemption is disabled inside this function and re-enabled in
+ * end_busy_poll(), thus the polling loop must periodically check
+ * continue_busy_poll().
+ *
+ * REQUIRES: prepare_to_busy_poll(), continue_busy_poll(), and end_busy_poll()
+ * must be used together.
+ */
+void prepare_to_busy_poll(void)
+{
+   struct rq __maybe_unused *rq = this_rq();
+   unsigned long __maybe_unused flags;
+
+   /* Preemption will be reenabled by end_busy_poll() */
+   preempt_disable();
+
+#ifdef CONFIG_SMP
+   raw_spin_lock_irqsave(>lock, flags);
+   /* preemption disabled; only one thread can poll at a time */
+   WARN_ON_ONCE(rq->busy_polling);
+   rq->busy_polling++;
+   raw_spin_unlock_irqrestore(>lock, flags);
+#endif
+}
+EXPORT_SYMBOL(prepare_to_busy_poll);
+
+int continue_busy_poll(void)
+{
+   if (!single_task_running())
+   return 0;
+
+   /* Important that we check this, since preemption is disabled */
+   if (need_resched())
+   return 0;
+
+   return 1;
+}
+EXPORT_SYMBOL(continue_busy_poll);
+
+/*
+ * Restore any state modified by prepare_to_busy_poll(), including re-enabling
+ * preemption.
+ *
+ * @allow_resched: If true, this potentially

Re: [PATCH] serial: pmac_zilog: don't init if zilog is not available

2020-10-22 Thread Finn Thain

On Thu, 22 Oct 2020, Geert Uytterhoeven wrote:

> 
> Thanks for your patch...
> 

You're welcome.

> I can't say I'm a fan of this...
> 

Sorry.

> 
> The real issue is this "extern struct platform_device scc_a_pdev, 
> scc_b_pdev", circumventing the driver framework.
> 
> Can we get rid of that?
> 

Is there a better alternative?

pmz_probe() is called by console_initcall(pmz_console_init) when 
CONFIG_SERIAL_PMACZILOG_CONSOLE=y because this has to happen earlier than 
the normal platform bus probing which takes place later as a typical 
module_initcall.

Re: [PATCH v2 2/6] crypto: lib/sha256 - Don't clear temporary variables

2020-10-22 Thread Arvind Sankar

On Wed, Oct 21, 2020 at 09:58:50PM -0700, Eric Biggers wrote:
> On Tue, Oct 20, 2020 at 04:39:53PM -0400, Arvind Sankar wrote:
> > The assignments to clear a through h and t1/t2 are optimized out by the
> > compiler because they are unused after the assignments.
> > 
> > These variables shouldn't be very sensitive: t1/t2 can be calculated
> > from a through h, so they don't reveal any additional information.
> > Knowing a through h is equivalent to knowing one 64-byte block's SHA256
> > hash (with non-standard initial value) which, assuming SHA256 is secure,
> > doesn't reveal any information about the input.
> > 
> > Signed-off-by: Arvind Sankar 
> 
> I don't entirely buy the second paragraph.  It could be the case that the 
> input
> is less than or equal to one SHA-256 block (64 bytes), in which case leaking
> 'a' through 'h' would reveal the final SHA-256 hash if the input length is
> known.  And note that callers might consider either the input, the resulting
> hash, or both to be sensitive information -- it depends.

The "non-standard initial value" was just parenthetical -- my thinking
was that revealing the hash, whether the real SHA hash or an
intermediate one starting at some other initial value, shouldn't reveal
the input; not that you get any additional security from being an
intermediate block. But if the hash itself could be sensitive, yeah then
a-h are sensitive anyway.

> 
> > ---
> >  lib/crypto/sha256.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> > index d43bc39ab05e..099cd11f83c1 100644
> > --- a/lib/crypto/sha256.c
> > +++ b/lib/crypto/sha256.c
> > @@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 
> > *input)
> > state[4] += e; state[5] += f; state[6] += g; state[7] += h;
> >  
> > /* clear any sensitive info... */
> > -   a = b = c = d = e = f = g = h = t1 = t2 = 0;
> > memzero_explicit(W, 64 * sizeof(u32));
> >  }
> 
> Your change itself is fine, though.  As you mentioned, these assignments get
> optimized out, so they weren't accomplishing anything.
> 
> The fact is, there just isn't any way to guarantee in C code that all 
> sensitive
> variables get cleared.
> 
> So we shouldn't (and generally don't) bother trying to clear individual u32's,
> ints, etc. like this, but rather only structs and arrays, as clearing those is
> more likely to work as intended.
> 
> - Eric

Ok, I'll just drop the second paragraph from the commit message then.

Re: [PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

2020-10-22 Thread Herbert Xu

On Thu, Oct 22, 2020 at 11:12:36PM -0400, Arvind Sankar wrote:
>
> I was aiming for 8 columns per line to match all the other groupings by
> eight. It does slightly exceed 100 columns but can this be an exception,
> or should I maybe make it 4 columns per line?

Please limit it to 4 columns.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2020-10-22 Thread Sean Christopherson

On Wed, Oct 14, 2020 at 04:44:57PM -0700, Jim Mattson wrote:
> On Fri, Oct 9, 2020 at 9:17 AM Jim Mattson  wrote:
> >
> > On Fri, Jul 10, 2020 at 8:48 AM Mohammed Gamal  wrote:
> > > @@ -5308,6 +5314,18 @@ static int handle_ept_violation(struct kvm_vcpu 
> > > *vcpu)
> > >PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> > >
> > > vcpu->arch.exit_qualification = exit_qualification;
> > > +
> > > +   /*
> > > +* Check that the GPA doesn't exceed physical memory limits, as 
> > > that is
> > > +* a guest page fault.  We have to emulate the instruction here, 
> > > because
> > > +* if the illegal address is that of a paging structure, then
> > > +* EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if 
> > > supported we
> > > +* would also use advanced VM-exit information for EPT violations 
> > > to
> > > +* reconstruct the page fault error code.
> > > +*/
> > > +   if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> > > +   return kvm_emulate_instruction(vcpu, 0);
> > > +
> >
> > Is kvm's in-kernel emulator up to the task? What if the instruction in
> > question is AVX-512, or one of the myriad instructions that the
> > in-kernel emulator can't handle? Ice Lake must support the advanced
> > VM-exit information for EPT violations, so that would seem like a
> > better choice.
> >
> Anyone?

Using "advanced info" if it's supported seems like the way to go.  Outright
requiring it is probably overkill; if userspace wants to risk having to kill a
(likely broken) guest, so be it.

Re: [PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

2020-10-22 Thread Arvind Sankar

On Wed, Oct 21, 2020 at 10:02:19PM -0700, Eric Biggers wrote:
> On Tue, Oct 20, 2020 at 04:39:55PM -0400, Arvind Sankar wrote:
> > This reduces code size substantially (on x86_64 with gcc-10 the size of
> > sha256_update() goes from 7593 bytes to 1952 bytes including the new
> > SHA256_K array), and on x86 is slightly faster than the full unroll
> > (tesed on Broadwell Xeon).
> 
> tesed => tested
> 
> > 
> > Signed-off-by: Arvind Sankar 
> > ---
> >  lib/crypto/sha256.c | 166 
> >  1 file changed, 30 insertions(+), 136 deletions(-)
> > 
> > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> > index c6bfeacc5b81..5efd390706c6 100644
> > --- a/lib/crypto/sha256.c
> > +++ b/lib/crypto/sha256.c
> > @@ -18,6 +18,17 @@
> >  #include 
> >  #include 
> >  
> > +static const u32 SHA256_K[] = {
> > +   0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 
> > 0x923f82a4, 0xab1c5ed5,
> > +   0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 
> > 0x9bdc06a7, 0xc19bf174,
> > +   0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 
> > 0x5cb0a9dc, 0x76f988da,
> > +   0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 
> > 0x06ca6351, 0x14292967,
> > +   0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 
> > 0x81c2c92e, 0x92722c85,
> > +   0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 
> > 0xf40e3585, 0x106aa070,
> > +   0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 
> > 0x5b9cca4f, 0x682e6ff3,
> > +   0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 
> > 0xbef9a3f7, 0xc67178f2,
> > +};
> 
> Limit this to 80 columns?

I was aiming for 8 columns per line to match all the other groupings by
eight. It does slightly exceed 100 columns but can this be an exception,
or should I maybe make it 4 columns per line?

> 
> Otherwise this looks good.
> 
> - Eric

Re: [LTP] mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]

2020-10-22 Thread Linus Torvalds

On Thu, Oct 22, 2020 at 6:36 PM Daniel Díaz  wrote:
>
> The kernel Naresh originally referred to is here:
>   https://builds.tuxbuild.com/SCI7Xyjb7V2NbfQ2lbKBZw/

Thanks.

And when I started looking at it, I realized that my original idea
("just look for __put_user_nocheck_X calls, there aren't so many of
those") was garbage, and that I was just being stupid.

Yes, the commit that broke was about __put_user(), but in order to not
duplicate all the code, it re-used the regular put_user()
infrastructure, and so all the normal put_user() calls are potential
problem spots too if this is about the compiler interaction with KASAN
and the asm changes.

So it's not just a couple of special cases to look at, it's all the
normal cases too.

Ok, back to the drawing board, but I think reverting it is probably
the right thing to do if I can't think of something smart.

That said, since you see this on x86-64, where the whole ugly trick with that

   register asm("%"_ASM_AX)

is unnecessary (because the 8-byte case is still just a single
register, no %eax:%edx games needed), it would be interesting to hear
if the attached patch fixes it. That would confirm that the problem
really is due to some register allocation issue interaction (or,
alternatively, it would tell me that there's something else going on).

  Linus

patch
Description: Binary data

Re: [PATCH 1/2] fs:regfs: add register easy filesystem

2020-10-22 Thread zc


Hi viro:

  Through regfs is very sample and easy,  but i think it is a Interest 
,  could give  some suggestions?



Regards,

zc

在 2020/10/20 下午2:30, Zou Cao 写道:

register filesystem is mapping the register into file dentry, it
will use the io readio to get the register val. DBT file is use
to decript the register tree, you can use it as follow:

mount -t regfs -o dtb=test.dtb none /mnt

test.dts:
/ {

compatible = "hisilicon,hi6220-hikey", "hisilicon,hi6220";
#address-cells = <0x2>;
#size-cells = <0x2>;
model = "HiKey Development Board";

gic-v3-dist{
reg = <0x0 0x800 0x0 0x1>;
GIC_CTRL {
offset = <0x0>;
};
GICD_TYPER {
offset = <0x4>;
};
   };
};

it will create all regiter dentry file in /mnt

Signed-off-by: Zou Cao 
---
  fs/Kconfig |   1 +
  fs/Makefile|   1 +
  fs/regfs/Kconfig   |   7 +
  fs/regfs/Makefile  |   8 ++
  fs/regfs/file.c| 107 +++
  fs/regfs/inode.c   | 354 +
  fs/regfs/internal.h|  32 +
  fs/regfs/regfs_inode.h |  32 +
  fs/regfs/supper.c  |  71 ++
  9 files changed, 613 insertions(+)
  create mode 100644 fs/regfs/Kconfig
  create mode 100644 fs/regfs/Makefile
  create mode 100644 fs/regfs/file.c
  create mode 100644 fs/regfs/inode.c
  create mode 100644 fs/regfs/internal.h
  create mode 100644 fs/regfs/regfs_inode.h
  create mode 100644 fs/regfs/supper.c

diff --git a/fs/Kconfig b/fs/Kconfig
index a88aa3a..d95acaf 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -324,6 +324,7 @@ endif # NETWORK_FILESYSTEMS
  source "fs/nls/Kconfig"
  source "fs/dlm/Kconfig"
  source "fs/unicode/Kconfig"
+source "fs/regfs/Kconfig"
  
  config IO_WQ

bool
diff --git a/fs/Makefile b/fs/Makefile
index 2ce5112..24f3878 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -136,3 +136,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/
  obj-$(CONFIG_EROFS_FS)+= erofs/
  obj-$(CONFIG_VBOXSF_FS)   += vboxsf/
  obj-$(CONFIG_ZONEFS_FS)   += zonefs/
+obj-$(CONFIG_REGFS_FS) += zonefs/
diff --git a/fs/regfs/Kconfig b/fs/regfs/Kconfig
new file mode 100644
index 000..74ba85b
--- /dev/null
+++ b/fs/regfs/Kconfig
@@ -0,0 +1,7 @@
+config REGFS_FS
+   tristate "registers filesystem support"
+   depends on ARM64
+   help
+ regfs support the read and write register of device resource by
+ dentry filesystem, it is more easy to support bsp debug. it also
+ support to printk the register val when panic
diff --git a/fs/regfs/Makefile b/fs/regfs/Makefile
new file mode 100644
index 000..26d5eef
--- /dev/null
+++ b/fs/regfs/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+#Makefile for the linux ramfs routines.
+#
+
+obj-y += regfs.o
+
+regfs-objs += inode.o file.o supper.o
diff --git a/fs/regfs/file.c b/fs/regfs/file.c
new file mode 100644
index 000..6cd9f3d
--- /dev/null
+++ b/fs/regfs/file.c
@@ -0,0 +1,107 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "regfs_inode.h"
+#include "internal.h"
+
+ssize_t regfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+   struct file *file = iocb->ki_filp;
+   struct inode *inode = file->f_mapping->host;
+   ssize_t ret;
+
+   inode_lock(inode);
+   ret = generic_write_checks(iocb, from);
+   if (ret > 0)
+   ret = __generic_file_write_iter(iocb, from);
+   inode_unlock(inode);
+
+   if (ret > 0)
+   ret = generic_write_sync(iocb, ret);
+   return ret;
+}
+
+static ssize_t regfs_file_read(struct file *file, char __user *buf, size_t 
len, loff_t *ppos)
+{
+   struct address_space *mapping = file->f_mapping;
+   struct regfs_inode_info  *info = REGFS_I(mapping->host);
+   char str[64];
+   unsigned long val;
+
+   val = readl_relaxed(info->base + info->offset);
+
+   loc_debug("name:%s base:%p val:%lx\n"
+   , file->f_path.dentry->d_iname
+   , info->base + info->offset
+   , val);
+
+   snprintf(str, 64, "%lx", val);
+
+   return simple_read_from_buffer(buf, len, ppos, str, strlen(str));
+}
+
+static ssize_t regfs_file_write(struct file *file, const char __user *buf, 
size_t len, loff_t *ppos)
+{
+   struct address_space *mapping = file->f_mapping;
+   struct regfs_inode_info  *info = REGFS_I(mapping->host);
+   char str[67];
+   unsigned long val = 0;
+   loff_t pos = *ppos;
+   size_t res;
+
+   if (pos < 0)
+   return -EINVAL;
+   if (pos >= len || len > 66)
+

Re: [PATCH v2] mm,thp,shmem: limit shmem THP alloc gfp_mask

2020-10-22 Thread Hugh Dickins

On Thu, 22 Oct 2020, Rik van Riel wrote:

> The allocation flags of anonymous transparent huge pages can be controlled
> through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
> help the system from getting bogged down in the page reclaim and compaction
> code when many THPs are getting allocated simultaneously.
> 
> However, the gfp_mask for shmem THP allocations were not limited by those
> configuration settings, and some workloads ended up with all CPUs stuck
> on the LRU lock in the page reclaim code, trying to allocate dozens of
> THPs simultaneously.
> 
> This patch applies the same configurated limitation of THPs to shmem
> hugepage allocations, to prevent that from happening.
> 
> This way a THP defrag setting of "never" or "defer+madvise" will result
> in quick allocation failures without direct reclaim when no 2MB free
> pages are available.
> 
> Signed-off-by: Rik van Riel 

NAK in its present untested form: see below.

I'm open to change here, particularly to Yu Xu's point (in other mail)
about direct reclaim - we avoid that here in Google too: though it's
not so much to avoid the direct reclaim, as to avoid the latencies of
direct compaction, which __GFP_DIRECT_RECLAIM allows as a side-effect.

> --- 
> v2: move gfp calculation to shmem_getpage_gfp as suggested by Yu Xu
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index c603237e006c..0a5b164a26d9 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -614,6 +614,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
>  extern void pm_restrict_gfp_mask(void);
>  extern void pm_restore_gfp_mask(void);
>  
> +extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma);
> +
>  #ifdef CONFIG_PM_SLEEP
>  extern bool pm_suspended_storage(void);
>  #else
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9474dbc150ed..9b08ce5cc387 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -649,7 +649,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
> vm_fault *vmf,
>   *   available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
>   const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 537c137698f8..9710b9df91e9 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1545,8 +1545,8 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>   return NULL;
>  
>   shmem_pseudo_vma_init(, info, hindex);
> - page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
> - HPAGE_PMD_ORDER, , 0, numa_node_id(), true);
> + page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, , 0, numa_node_id(),
> +true);

Commendably neat so far.

>   shmem_pseudo_vma_destroy();
>   if (page)
>   prep_transhuge_page(page);
> @@ -1802,6 +1802,7 @@ static int shmem_getpage_gfp(struct inode *inode, 
> pgoff_t index,
>   struct page *page;
>   enum sgp_type sgp_huge = sgp;
>   pgoff_t hindex = index;
> + gfp_t huge_gfp;
>   int error;
>   int once = 0;
>   int alloced = 0;
> @@ -1887,7 +1888,8 @@ static int shmem_getpage_gfp(struct inode *inode, 
> pgoff_t index,
>   }
>  
>  alloc_huge:
> - page = shmem_alloc_and_acct_page(gfp, inode, index, true);
> + huge_gfp = alloc_hugepage_direct_gfpmask(vma);

Still looks nice: but what about the crash when vma is NULL?

It may work for shmem_fault() (though I'll probably disagree on the
details): but tmpfs is a filesystem, so most if not all of the system
calls which arrive here have no vma to offer.

Michal is right to remember pushback before, because tmpfs is a
filesystem, and "huge=" is a mount option: in using a huge=always
filesystem, the user has already declared a preference for huge pages.
Whereas the original anon THP had to deduce that preference from sys
tunables and vma madvice.

I certainly found it a lot easier to ignore all the shifting sandmaze
of the anon THP tunables, and I think Kirill followed me on that.

But it's likely that they have accumulated some defrag wisdom, which
tmpfs can take on board - but please accept that in using a huge mount,
the preference for huge has already been expressed, so I don't expect
anon THP alloc_hugepage_direct_gfpmask() choices will map one to one.

> + page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
>   if (IS_ERR(page)) {
>  alloc_nohuge:
>   page = shmem_alloc_and_acct_page(gfp, inode,
> 

Hugh

Re: [PATCH] perf trace: Segfault when trying to trace events by cgroup

2020-10-22 Thread Namhyung Kim

Hello,

On Tue, Oct 20, 2020 at 5:48 AM Stanislav Ivanichkin
 wrote:
>
> Hi,
>
> +linux-perf-users@
>
> Gentle ping for this patch
>
> Many Thanks
>
> --
> Stanislav Ivanichkin
>
> > On 9 Oct 2020, at 09:45, Stanislav Ivanichkin  
> > wrote:
> >
> > # ./perf trace -e sched:sched_switch -G test -a sleep 1
> > perf: Segmentation fault
> > Obtained 11 stack frames.
> > ./perf(sighandler_dump_stack+0x43) [0x55cfdc636db3]
> > /lib/x86_64-linux-gnu/libc.so.6(+0x3efcf) [0x7fd23eecafcf]
> > ./perf(parse_cgroups+0x36) [0x55cfdc673f36]
> > ./perf(+0x3186ed) [0x55cfdc70d6ed]
> > ./perf(parse_options_subcommand+0x629) [0x55cfdc70e999]
> > ./perf(cmd_trace+0x9c2) [0x55cfdc5ad6d2]
> > ./perf(+0x1e8ae0) [0x55cfdc5ddae0]
> > ./perf(+0x1e8ded) [0x55cfdc5ddded]
> > ./perf(main+0x370) [0x55cfdc556f00]
> > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x7fd23eeadb96]
> > ./perf(_start+0x29) [0x55cfdc557389]
> > Segmentation fault
> >
> > It happens because "struct trace" in option->value is passed to
> > parse_cgroups function instead of "struct evlist".
> >
> > Signed-off-by: Stanislav Ivanichkin 
> > Reviewed-by: Dmitry Monakhov 

It seems we should add this too:
Fixes: 9ea42ba4411ac ("perf trace: Support setting cgroups as targets")

> > ---
> > tools/perf/builtin-trace.c | 9 ++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> > index bea461b6f937..cbc4de6840db 100644
> > --- a/tools/perf/builtin-trace.c
> > +++ b/tools/perf/builtin-trace.c
> > @@ -4651,9 +4651,12 @@ static int trace__parse_cgroups(const struct option 
> > *opt, const char *str, int u
> > {
> >   struct trace *trace = opt->value;
> >
> > - if (!list_empty(>evlist->core.entries))
> > - return parse_cgroups(opt, str, unset);
> > -
> > + if (!list_empty(>evlist->core.entries)) {
> > + struct option o = OPT_CALLBACK('G', "cgroup", >evlist,
> > + "name", "monitor event in cgroup name only",
> > + parse_cgroups);

Just make it simple and clear what parse_cgroups() expects:

struct option o = {
.value = >evlist,
};

Or else, we can change parse_cgroups() to take evlist directly.
But it needs to change other callsites too.

Either is fine to me.

Thanks
Namhyung

> > + return parse_cgroups(, str, unset);
> > + }
> >   trace->cgroup = evlist__findnew_cgroup(trace->evlist, str);
> >
> >   return 0;
> > --
> > 2.17.1
> >
>

[PATCHv2] selftests/powerpc/eeh: disable kselftest timeout setting for eeh-basic

2020-10-22 Thread Po-Hsu Lin

The eeh-basic test got its own 60 seconds timeout (defined in commit
414f50434aa2 "selftests/eeh: Bump EEH wait time to 60s") per breakable
device.

And we have discovered that the number of breakable devices varies
on different hardware. The device recovery time ranges from 0 to 35
seconds. In our test pool it will take about 30 seconds to run on a
Power8 system that with 5 breakable devices, 60 seconds to run on a
Power9 system that with 4 breakable devices.

Extend the timeout setting in the kselftest framework to 5 minutes
to give it a chance to finish.

Signed-off-by: Po-Hsu Lin 
---
 tools/testing/selftests/powerpc/eeh/Makefile | 2 +-
 tools/testing/selftests/powerpc/eeh/settings | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/eeh/settings

diff --git a/tools/testing/selftests/powerpc/eeh/Makefile 
b/tools/testing/selftests/powerpc/eeh/Makefile
index b397bab..ae963eb 100644
--- a/tools/testing/selftests/powerpc/eeh/Makefile
+++ b/tools/testing/selftests/powerpc/eeh/Makefile
@@ -3,7 +3,7 @@ noarg:
$(MAKE) -C ../
 
 TEST_PROGS := eeh-basic.sh
-TEST_FILES := eeh-functions.sh
+TEST_FILES := eeh-functions.sh settings
 
 top_srcdir = ../../../../..
 include ../../lib.mk
diff --git a/tools/testing/selftests/powerpc/eeh/settings 
b/tools/testing/selftests/powerpc/eeh/settings
new file mode 100644
index 000..694d707
--- /dev/null
+++ b/tools/testing/selftests/powerpc/eeh/settings
@@ -0,0 +1 @@
+timeout=300
-- 
2.7.4

[PATCH/RFC net v2] net: dec: tulip: de2104x: Add shutdown handler to stop NIC

2020-10-22 Thread Moritz Fischer

The driver does not implement a shutdown handler which leads to issues
when using kexec in certain scenarios. The NIC keeps on fetching
descriptors which gets flagged by the IOMMU with errors like this:

DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000
DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000
DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000
DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000
DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000

Signed-off-by: Moritz Fischer 
---

Changes from v1:
- Replace call to de_remove_one with de_shutdown() function
  as suggested by James.

---
 drivers/net/ethernet/dec/tulip/de2104x.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/dec/tulip/de2104x.c 
b/drivers/net/ethernet/dec/tulip/de2104x.c
index f1a2da15dd0a..6de0cd6cf4ca 100644
--- a/drivers/net/ethernet/dec/tulip/de2104x.c
+++ b/drivers/net/ethernet/dec/tulip/de2104x.c
@@ -2180,11 +2180,19 @@ static int de_resume (struct pci_dev *pdev)
 
 #endif /* CONFIG_PM */
 
+static void de_shutdown(struct pci_dev *pdev)
+{
+   struct net_device *dev = pci_get_drvdata (pdev);
+
+   de_close(dev);
+}
+
 static struct pci_driver de_driver = {
.name   = DRV_NAME,
.id_table   = de_pci_tbl,
.probe  = de_init_one,
.remove = de_remove_one,
+   .shutdown   = de_shutdown,
 #ifdef CONFIG_PM
.suspend= de_suspend,
.resume = de_resume,
-- 
2.28.0

RE: [PATCH] scsi: megaraid_sas: use spin_lock() in hard IRQ

2020-10-22 Thread Finn Thain

On Thu, 22 Oct 2020, Tianxianting wrote:

> I see, If we add this patch, we need to get all cpu arch that support 
> nested interrupts.
> 

I was just calling into question 1. the benefit (does it improve 
performance?) and 2. the code style (is it less portable?).

It's really the style question that mostly interests me because I've had 
to code around the nested interrupt situation before, and everytime it 
comes up it makes me wonder about the necessity.

I was not trying to veto your patch. It is not my position to do that. If 
Broadcom likes the patch, that's great.

Re: [PATCH] selftests/powerpc/eeh: disable kselftest timeout setting for eeh-basic

2020-10-22 Thread Po-Hsu Lin

On Fri, Oct 23, 2020 at 10:07 AM Michael Ellerman  wrote:
>
> Po-Hsu Lin  writes:
> > The eeh-basic test got its own 60 seconds timeout (defined in commit
> > 414f50434aa2 "selftests/eeh: Bump EEH wait time to 60s") per breakable
> > device.
> >
> > And we have discovered that the number of breakable devices varies
> > on different hardware. The device recovery time ranges from 0 to 35
> > seconds. In our test pool it will take about 30 seconds to run on a
> > Power8 system that with 5 breakable devices, 60 seconds to run on a
> > Power9 system that with 4 breakable devices.
> >
> > Thus it's better to disable the default 45 seconds timeout setting in
> > the kselftest framework to give it a chance to finish. And let the
> > test to take care of the timeout control.
>
> I'd prefer if we still had some timeout, maybe 5 or 10 minutes? Just in
> case the test goes completely bonkers.
>
OK, let's go for 5 minutes.
Will send V2 later.
Thanks for your suggestion!

> cheers
>
> > diff --git a/tools/testing/selftests/powerpc/eeh/Makefile 
> > b/tools/testing/selftests/powerpc/eeh/Makefile
> > index b397bab..ae963eb 100644
> > --- a/tools/testing/selftests/powerpc/eeh/Makefile
> > +++ b/tools/testing/selftests/powerpc/eeh/Makefile
> > @@ -3,7 +3,7 @@ noarg:
> >   $(MAKE) -C ../
> >
> >  TEST_PROGS := eeh-basic.sh
> > -TEST_FILES := eeh-functions.sh
> > +TEST_FILES := eeh-functions.sh settings
> >
> >  top_srcdir = ../../../../..
> >  include ../../lib.mk
> > diff --git a/tools/testing/selftests/powerpc/eeh/settings 
> > b/tools/testing/selftests/powerpc/eeh/settings
> > new file mode 100644
> > index 000..e7b9417
> > --- /dev/null
> > +++ b/tools/testing/selftests/powerpc/eeh/settings
> > @@ -0,0 +1 @@
> > +timeout=0
> > --
> > 2.7.4

[GIT PULL] ARC fix for 5.10-rc1

2020-10-22 Thread Vineet Gupta

Hi Linus,

This is an unusual 2nd pull request for merge window. I found a snafu in perf
driver which made it into 5.9-rc4 and thus the fix could go in now than wait for
5.10-rc2. Sorry for the trouble.

Thx,
-Vineet
->
The following changes since commit 6364d1b41cc382db3b03cf33c57b6007ee8f09cf:

  arc: include/asm: fix typos of "themselves" (2020-10-05 21:02:29 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc.git/
tags/arc-5.10-rc1-fixes

for you to fetch changes up to 8c42a5c02bec6c7eccf08957be3c6c8fccf9790b:

  ARC: perf: redo the pct irq missing in device-tree handling (2020-10-22 
10:57:58
-0700)


Urgent perf ARC fix


Vineet Gupta (1):
  ARC: perf: redo the pct irq missing in device-tree handling

 arch/arc/kernel/perf_event.c | 27 ++-
 1 file changed, 18 insertions(+), 9 deletions(-)

linux-next: Tree for Oct 23

2020-10-22 Thread Stephen Rothwell

Hi all,

Since the merge window is open, please do not add any v5.11 material to
your linux-next included branches until after v5.10-rc1 has been released.

Changes since 20201022:

Non-merge commits (relative to Linus' tree): 1952
 2322 files changed, 329767 insertions(+), 37681 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a
multi_v7_defconfig for arm and a native build of tools/perf. After
the final fixups (if any), I do an x86_64 modules_install followed by
builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit),
ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc
and sparc64 defconfig and htmldocs. And finally, a simple boot test
of the powerpc pseries_le_defconfig kernel in qemu (with and without
kvm enabled).

Below is a summary of the state of the merge.

I am currently merging 329 trees (counting Linus' and 86 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (f9893351acae Merge tag 'kconfig-v5.10' of 
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild)
Merging fixes/fixes (9123e3a74ec7 Linux 5.9-rc1)
Merging kbuild-current/fixes (e30d694c3381 Documentation/llvm: Fix clang target 
examples)
Merging arc-current/for-curr (6364d1b41cc3 arc: include/asm: fix typos of 
"themselves")
Merging arm-current/fixes (9123e3a74ec7 Linux 5.9-rc1)
Merging arm64-fixes/for-next/fixes (39e4716caa59 crypto: arm64: Use x16 with 
indirect branch to bti_c)
Merging arm-soc-fixes/arm/fixes (6869f774b1cd Merge tag 
'omap-for-v5.9/fixes-rc7' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes)
Merging uniphier-fixes/fixes (48778464bb7d Linux 5.8-rc2)
Merging drivers-memory-fixes/fixes (7ff3a2a626f7 memory: jz4780_nemc: Fix an 
error pointer vs NULL check in probe())
Merging m68k-current/for-linus (50c5feeea0af ide/macide: Convert Mac IDE driver 
to platform driver)
Merging powerpc-fixes/fixes (4ff753feab02 powerpc/pseries: Avoid using 
addr_to_pfn in real mode)
Merging s390-fixes/fixes (549738f15da0 Linux 5.9-rc8)
Merging sparc/master (0a95a6d1a4cd sparc: use for_each_child_of_node() macro)
Merging fscrypt-current/for-stable (2b4eae95c736 fscrypt: don't evict dirty 
inodes after removing key)
Merging net/master (18ded910b589 tcp: fix to update snd_wl1 in bulk receiver 
fast path)
Merging bpf/master (18ded910b589 tcp: fix to update snd_wl1 in bulk receiver 
fast path)
Merging ipsec/master (7fe94612dd4c xfrm: interface: fix the priorities for ipip 
and ipv6 tunnels)
Merging netfilter/master (c77761c8a594 netfilter: nf_fwd_netdev: clear 
timestamp in forwarding path)
Merging ipvs/master (48d072c4e8cd selftests: netfilter: add time counter check)
Merging wireless-drivers/master (df41c19abbea drivers/net/wan/hdlc_fr: Move the 
skb_headroom check out of fr_hard_header)
Merging mac80211/master (9ff9b0d392ea Merge tag 'net-next-5.10' of 
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next)
Merging rdma-fixes/for-rc (a1b8638ba132 Linux 5.9-rc7)
Merging sound-current/for-linus (033e4040d453 ALSA: hda - Fix the return value 
if cb func is already registered)
Merging sound-asoc-fixes/for-linus (8101e3024d76 Merge remote-tracking branch 
'asoc/for-5.10' into asoc-linus)
Merging regmap-fixes/for-linus (549738f15da0 Linux 5.9-rc8)
Merging regulator-fixes/for-linus (b7c11f48ff81 Merge remote-tracking branch 
'regulator/for-5.10' into regulator-linus)
Merging spi-fixes/for-linus (d4f3a651ab82 Merge remote-tracking branch 
'spi/for-5.9' into spi-linus)
Merging pci-current/for-linus (76a6b0b90d53 MAINTAINERS: Add Pali Rohár as 
aardvark PCI maintainer)
Merging driver-core.current/driver-core-linus (270315b8235e Merge tag 
'riscv-for-linus-5.10-mw0' of 
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/

[PATCH V3 2/3] vhost: vdpa: report iova range

2020-10-22 Thread Jason Wang

This patch introduces a new ioctl for vhost-vdpa device that can
report the iova range by the device.

For device that implements get_iova_range() method, we fetch it from
the vDPA device. If device doesn't implement get_iova_range() but
depends on platform IOMMU, we will query via DOMAIN_ATTR_GEOMETRY,
otherwise [0, ULLONG_MAX] is assumed.

For safety, this patch also rules out the map request which is not in
the valid range.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c | 40 
 include/uapi/linux/vhost.h   |  4 
 include/uapi/linux/vhost_types.h |  9 +++
 3 files changed, 53 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index a2dbc85e0b0d..562ed99116d1 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -47,6 +47,7 @@ struct vhost_vdpa {
int minor;
struct eventfd_ctx *config_ctx;
int in_batch;
+   struct vdpa_iova_range range;
 };
 
 static DEFINE_IDA(vhost_vdpa_ida);
@@ -337,6 +338,16 @@ static long vhost_vdpa_set_config_call(struct vhost_vdpa 
*v, u32 __user *argp)
return 0;
 }
 
+static long vhost_vdpa_get_iova_range(struct vhost_vdpa *v, u32 __user *argp)
+{
+   struct vhost_vdpa_iova_range range = {
+   .first = v->range.first,
+   .last = v->range.last,
+   };
+
+   return copy_to_user(argp, , sizeof(range));
+}
+
 static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
   void __user *argp)
 {
@@ -470,6 +481,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
case VHOST_GET_BACKEND_FEATURES:
features = VHOST_VDPA_BACKEND_FEATURES;
r = copy_to_user(featurep, , sizeof(features));
+   case VHOST_VDPA_GET_IOVA_RANGE:
+   r = vhost_vdpa_get_iova_range(v, argp);
break;
default:
r = vhost_dev_ioctl(>vdev, cmd, argp);
@@ -597,6 +610,10 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
long pinned;
int ret = 0;
 
+   if (msg->iova < v->range.first ||
+   msg->iova + msg->size - 1 > v->range.last)
+   return -EINVAL;
+
if (vhost_iotlb_itree_first(iotlb, msg->iova,
msg->iova + msg->size - 1))
return -EEXIST;
@@ -783,6 +800,27 @@ static void vhost_vdpa_free_domain(struct vhost_vdpa *v)
v->domain = NULL;
 }
 
+static void vhost_vdpa_set_iova_range(struct vhost_vdpa *v)
+{
+   struct vdpa_iova_range *range = >range;
+   struct iommu_domain_geometry geo;
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->get_iova_range) {
+   *range = ops->get_iova_range(vdpa);
+   } else if (v->domain &&
+  !iommu_domain_get_attr(v->domain,
+  DOMAIN_ATTR_GEOMETRY, ) &&
+  geo.force_aperture) {
+   range->first = geo.aperture_start;
+   range->last = geo.aperture_end;
+   } else {
+   range->first = 0;
+   range->last = ULLONG_MAX;
+   }
+}
+
 static int vhost_vdpa_open(struct inode *inode, struct file *filep)
 {
struct vhost_vdpa *v;
@@ -823,6 +861,8 @@ static int vhost_vdpa_open(struct inode *inode, struct file 
*filep)
if (r)
goto err_init_iotlb;
 
+   vhost_vdpa_set_iova_range(v);
+
filep->private_data = v;
 
return 0;
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 75232185324a..c998860d7bbc 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -146,4 +146,8 @@
 
 /* Set event fd for config interrupt*/
 #define VHOST_VDPA_SET_CONFIG_CALL _IOW(VHOST_VIRTIO, 0x77, int)
+
+/* Get the valid iova range */
+#define VHOST_VDPA_GET_IOVA_RANGE  _IOR(VHOST_VIRTIO, 0x78, \
+struct vhost_vdpa_iova_range)
 #endif
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 9a269a88a6ff..f7f6a3a28977 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -138,6 +138,15 @@ struct vhost_vdpa_config {
__u8 buf[0];
 };
 
+/* vhost vdpa IOVA range
+ * @first: First address that can be mapped by vhost-vDPA
+ * @last: Last address that can be mapped by vhost-vDPA
+ */
+struct vhost_vdpa_iova_range {
+   __u64 first;
+   __u64 last;
+};
+
 /* Feature bits */
 /* Log all write descriptors. Can be changed while device is active. */
 #define VHOST_F_LOG_ALL 26
-- 
2.20.1

Re: [PATCH net RFC] net: Clear IFF_TX_SKB_SHARING for all Ethernet devices using skb_padto

2020-10-22 Thread Xie He

On Thu, Oct 22, 2020 at 6:56 PM Xie He  wrote:
>
> My patch isn't complete. Because there are so many drivers with this
> problem, I feel it's hard to solve them all at once. So I only grepped
> "skb_padto" under "drivers/net/ethernet". There are other drivers
> under "ethernet" using "skb_pad", "skb_put_padto" or "eth_skb_pad".
> There are also (fake) Ethernet drivers under "drivers/net/wireless". I
> feel it'd take a long time and also be error-prone to solve them all,
> so I feel it'd be the best if there are other solutions.

BTW, I also see some Ethernet drivers calling skb_push to prepend
strange headers to the skbs. For example,

drivers/net/ethernet/mellanox/mlxsw/switchx2.c prepends a header of
MLXSW_TXHDR_LEN (16).

We can't send shared skbs to these drivers either because they modify the skbs.

It seems to me that many drivers have always assumed that they can
modify the skb whenever needed. They've never considered there might
be shared skbs. I guess adding IFF_TX_SKB_SHARING to ether_setup was a
bad idea. It not only made the code less clean, but also didn't agree
with the actual situations of the drivers.

[PATCH V3 3/3] vdpa_sim: implement get_iova_range()

2020-10-22 Thread Jason Wang

This implements a sample get_iova_range() for the simulator which
advertise [0, ULLONG_MAX] as the valid range.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 62d640327145..ff6c9fd8d879 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -574,6 +574,16 @@ static u32 vdpasim_get_generation(struct vdpa_device *vdpa)
return vdpasim->generation;
 }
 
+static struct vdpa_iova_range vdpasim_get_iova_range(struct vdpa_device *vdpa)
+{
+   struct vdpa_iova_range range = {
+   .first = 0ULL,
+   .last = ULLONG_MAX,
+   };
+
+   return range;
+}
+
 static int vdpasim_set_map(struct vdpa_device *vdpa,
   struct vhost_iotlb *iotlb)
 {
@@ -657,6 +667,7 @@ static const struct vdpa_config_ops vdpasim_net_config_ops 
= {
.get_config = vdpasim_get_config,
.set_config = vdpasim_set_config,
.get_generation = vdpasim_get_generation,
+   .get_iova_range = vdpasim_get_iova_range,
.dma_map= vdpasim_dma_map,
.dma_unmap  = vdpasim_dma_unmap,
.free   = vdpasim_free,
@@ -683,6 +694,7 @@ static const struct vdpa_config_ops 
vdpasim_net_batch_config_ops = {
.get_config = vdpasim_get_config,
.set_config = vdpasim_set_config,
.get_generation = vdpasim_get_generation,
+   .get_iova_range = vdpasim_get_iova_range,
.set_map= vdpasim_set_map,
.free   = vdpasim_free,
 };
-- 
2.20.1

[PATCH V3 0/3] vDPA: API for reporting IOVA range

2020-10-22 Thread Jason Wang

Hi All:

This series introduces API for reporing IOVA range. This is a must for
userspace to work correclty:

- for the process that uses vhost-vDPA directly, the IOVA must be
  allocated from this range.
- for VM(qemu), when vIOMMU is not enabled, fail early if GPA is out
  of range
- for VM(qemu), when vIOMMU is enabled, determine a valid guest
  address width and then guest IOVA allocator can behave correctly.

Please review.

Changes from V2:
- silent build warnings

Changes from V1:

- do not mandate get_iova_range() for device with its own DMA
  translation logic and assume a [0, ULLONG_MAX] range
- mandate IOVA range only for IOMMU that forcing aperture
- forbid the map which is out of the IOVA range in vhost-vDPA

Jason Wang (3):
  vdpa: introduce config op to get valid iova range
  vhost: vdpa: report iova range
  vdpa_sim: implement get_iova_range()

 drivers/vdpa/vdpa_sim/vdpa_sim.c | 12 ++
 drivers/vhost/vdpa.c | 40 
 include/linux/vdpa.h | 15 
 include/uapi/linux/vhost.h   |  4 
 include/uapi/linux/vhost_types.h |  9 +++
 5 files changed, 80 insertions(+)

-- 
2.20.1

[PATCH V3 1/3] vdpa: introduce config op to get valid iova range

2020-10-22 Thread Jason Wang

This patch introduce a config op to get valid iova range from the vDPA
device.

Signed-off-by: Jason Wang 
---
 include/linux/vdpa.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index eae0bfd87d91..30bc7a7223bb 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -52,6 +52,16 @@ struct vdpa_device {
int nvqs;
 };
 
+/**
+ * vDPA IOVA range - the IOVA range support by the device
+ * @first: start of the IOVA range
+ * @last: end of the IOVA range
+ */
+struct vdpa_iova_range {
+   u64 first;
+   u64 last;
+};
+
 /**
  * vDPA_config_ops - operations for configuring a vDPA device.
  * Note: vDPA device drivers are required to implement all of the
@@ -151,6 +161,10 @@ struct vdpa_device {
  * @get_generation:Get device config generation (optional)
  * @vdev: vdpa device
  * Returns u32: device generation
+ * @get_iova_range:Get supported iova range (optional)
+ * @vdev: vdpa device
+ * Returns the iova range supported by
+ * the device.
  * @set_map:   Set device memory mapping (optional)
  * Needed for device that using device
  * specific DMA translation (on-chip IOMMU)
@@ -216,6 +230,7 @@ struct vdpa_config_ops {
void (*set_config)(struct vdpa_device *vdev, unsigned int offset,
   const void *buf, unsigned int len);
u32 (*get_generation)(struct vdpa_device *vdev);
+   struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
 
/* DMA ops */
int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
-- 
2.20.1

Re: Question on io-wq

2020-10-22 Thread Jens Axboe

On 10/22/20 8:05 PM, Hillf Danton wrote:
> On Thu, 22 Oct 2020 08:08:09 -0600 Jens Axboe wrote:
>> On 10/22/20 3:02 AM, Zhang,Qiang wrote:
>>>
>>> Hi Jens Axboe
>>>
>>> There are some problem in 'io_wqe_worker' thread, when the 
>>> 'io_wqe_worker' be create and  Setting the affinity of CPUs in NUMA 
>>> nodes, due to CPU hotplug, When the last CPU going down, the 
>>> 'io_wqe_worker' thread will run anywhere. when the CPU in the node goes 
>>> online again, we should restore their cpu bindings?
>>
>> Something like the below should help in ensuring affinities are
>> always correct - trigger an affinity set for an online CPU event. We
>> should not need to do it for offlining. Can you test it?
> 
> CPU affinity is intact because of nothing to do on offline, and scheduler
> will move the stray workers on to the correct NUMA node if any CPU goes
> online, so it's a bit hard to see what is going to be tested.

Test it yourself:

- Boot with > 1 NUMA node
- Start an io_uring, you now get 2 workers, each affinitized to a node
- Now offline all CPUs in one node
- Online one or more of the CPU in that same node

The end result is that the worker on the node that was offlined now
has a mask of the other node, plus the newly added CPU.

So your last statement isn't correct, which is what the original
reporter stated.

-- 
Jens Axboe

Re: [PATCHSET v6] Add support for TIF_NOTIFY_SIGNAL

2020-10-22 Thread Jens Axboe

On 10/16/20 9:45 AM, Jens Axboe wrote:
> Hi,
> 
> The goal is this patch series is to decouple TWA_SIGNAL based task_work
> from real signals and signal delivery. The motivation is speeding up
> TWA_SIGNAL based task_work, particularly for threaded setups where
> ->sighand is shared across threads. See the last patch for numbers.
> 
> Cleanups in this series, see changelog. But the arch and cleanup
> series that goes after this series is much simpler now that we handle
> TIF_NOTIFY_SIGNAL generically for !CONFIG_GENERIC_ENTRY.

Any objections to this one? I just rebased this one and the full arch
series that sits on top for -git, but apart from that, no changes.

Thomas, would be nice to know if you're good with patch 2+3 at this
point. Once we get outside of the merge window next week, I'll post
the updated series since we get a few conflicts at this point, and
would be great if you could carry this for 5.11.

-- 
Jens Axboe

[PATCH] drm: Add the missed device_unregister() in drm_sysfs_connector_add()

2020-10-22 Thread Jing Xiangfeng

drm_sysfs_connector_add() misses to call device_unregister() when
sysfs_create_link() fails to create. Add the missed function call
to fix it.

Fixes: e1a29c6c5955 ("drm: Add ddc link in sysfs created by drm_connector")
Signed-off-by: Jing Xiangfeng 
---
 drivers/gpu/drm/drm_sysfs.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index f0336c804639..39e173e10cf7 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -274,6 +274,7 @@ static const struct attribute_group *connector_dev_groups[] 
= {
 int drm_sysfs_connector_add(struct drm_connector *connector)
 {
struct drm_device *dev = connector->dev;
+   int ret = 0;
 
if (connector->kdev)
return 0;
@@ -291,10 +292,16 @@ int drm_sysfs_connector_add(struct drm_connector 
*connector)
return PTR_ERR(connector->kdev);
}
 
-   if (connector->ddc)
-   return sysfs_create_link(>kdev->kobj,
+   if (connector->ddc) {
+   ret = sysfs_create_link(>kdev->kobj,
 >ddc->dev.kobj, "ddc");
-   return 0;
+   if (ret) {
+   device_unregister(connector->kdev);
+   connector->kdev = NULL;
+   }
+   }
+
+   return ret;
 }
 
 void drm_sysfs_connector_remove(struct drm_connector *connector)
-- 
2.17.1

Re: [PATCH] perf trace beauty: Allow header files in a different path

2020-10-22 Thread Ian Rogers

On Thu, Oct 22, 2020 at 7:06 PM Namhyung Kim  wrote:
>
> Current script to generate mmap flags and prot checks headers from the
> uapi/asm-generic directory but it might come from a different
> directory in some environment.  So change the pattern to accept it.
>
> Signed-off-by: Namhyung Kim 

Acked-by: Ian Rogers 

Thanks,
Ian

> ---
>  tools/perf/trace/beauty/mmap_flags.sh | 4 ++--
>  tools/perf/trace/beauty/mmap_prot.sh  | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tools/perf/trace/beauty/mmap_flags.sh 
> b/tools/perf/trace/beauty/mmap_flags.sh
> index 39eb2595983b..76825710c725 100755
> --- a/tools/perf/trace/beauty/mmap_flags.sh
> +++ b/tools/perf/trace/beauty/mmap_flags.sh
> @@ -28,12 +28,12 @@ egrep -q $regex ${linux_mman} && \
> egrep -vw 'MAP_(UNINITIALIZED|TYPE|SHARED_VALIDATE)' | \
> sed -r "s/$regex/\2 \1 \1 \1 \2/g" | \
> xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define 
> MAP_%s %s\n#endif\n")
> -([ ! -f ${arch_mman} ] || egrep -q 
> '#[[:space:]]*include[[:space:]]+ +([ ! -f ${arch_mman} ] || egrep -q 
> '#[[:space:]]*include[[:space:]]+.*uapi/asm-generic/mman.*' ${arch_mman}) &&
>  (egrep $regex ${header_dir}/mman-common.h | \
> egrep -vw 'MAP_(UNINITIALIZED|TYPE|SHARED_VALIDATE)' | \
> sed -r "s/$regex/\2 \1 \1 \1 \2/g"  | \
> xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define 
> MAP_%s %s\n#endif\n")
> -([ ! -f ${arch_mman} ] || egrep -q 
> '#[[:space:]]*include[[:space:]]+.*' ${arch_mman}) &&
> +([ ! -f ${arch_mman} ] || egrep -q 
> '#[[:space:]]*include[[:space:]]+.*uapi/asm-generic/mman.h>.*' ${arch_mman}) 
> &&
>  (egrep $regex ${header_dir}/mman.h | \
> sed -r "s/$regex/\2 \1 \1 \1 \2/g"  | \
> xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define 
> MAP_%s %s\n#endif\n")
> diff --git a/tools/perf/trace/beauty/mmap_prot.sh 
> b/tools/perf/trace/beauty/mmap_prot.sh
> index 28f638f8d216..664d8d534a50 100755
> --- a/tools/perf/trace/beauty/mmap_prot.sh
> +++ b/tools/perf/trace/beauty/mmap_prot.sh
> @@ -17,7 +17,7 @@ prefix="PROT"
>
>  printf "static const char *mmap_prot[] = {\n"
>  regex=`printf 
> '^[[:space:]]*#[[:space:]]*define[[:space:]]+%s_([[:alnum:]_]+)[[:space:]]+(0x[[:xdigit:]]+)[[:space:]]*.*'
>  ${prefix}`
> -([ ! -f ${arch_mman} ] || egrep -q 
> '#[[:space:]]*include[[:space:]]+ +([ ! -f ${arch_mman} ] || egrep -q 
> '#[[:space:]]*include[[:space:]]+.*uapi/asm-generic/mman.*' ${arch_mman}) &&
>  (egrep $regex ${common_mman} | \
> egrep -vw PROT_NONE | \
> sed -r "s/$regex/\2 \1 \1 \1 \2/g"  | \
> --
> 2.29.0.rc1.297.gfa9743e501-goog
>

Re: [PATCH] selftests/powerpc/eeh: disable kselftest timeout setting for eeh-basic

2020-10-22 Thread Michael Ellerman

Po-Hsu Lin  writes:
> The eeh-basic test got its own 60 seconds timeout (defined in commit
> 414f50434aa2 "selftests/eeh: Bump EEH wait time to 60s") per breakable
> device.
>
> And we have discovered that the number of breakable devices varies
> on different hardware. The device recovery time ranges from 0 to 35
> seconds. In our test pool it will take about 30 seconds to run on a
> Power8 system that with 5 breakable devices, 60 seconds to run on a
> Power9 system that with 4 breakable devices.
>
> Thus it's better to disable the default 45 seconds timeout setting in
> the kselftest framework to give it a chance to finish. And let the
> test to take care of the timeout control.

I'd prefer if we still had some timeout, maybe 5 or 10 minutes? Just in
case the test goes completely bonkers.

cheers

> diff --git a/tools/testing/selftests/powerpc/eeh/Makefile 
> b/tools/testing/selftests/powerpc/eeh/Makefile
> index b397bab..ae963eb 100644
> --- a/tools/testing/selftests/powerpc/eeh/Makefile
> +++ b/tools/testing/selftests/powerpc/eeh/Makefile
> @@ -3,7 +3,7 @@ noarg:
>   $(MAKE) -C ../
>  
>  TEST_PROGS := eeh-basic.sh
> -TEST_FILES := eeh-functions.sh
> +TEST_FILES := eeh-functions.sh settings
>  
>  top_srcdir = ../../../../..
>  include ../../lib.mk
> diff --git a/tools/testing/selftests/powerpc/eeh/settings 
> b/tools/testing/selftests/powerpc/eeh/settings
> new file mode 100644
> index 000..e7b9417
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/eeh/settings
> @@ -0,0 +1 @@
> +timeout=0
> -- 
> 2.7.4

Re: [External] Re: [PATCH] nvme-rdma: handle nvme completion data length

2020-10-22 Thread Chao Leng





On 2020/10/22 18:05, zhenwei pi wrote:

On 10/22/20 5:55 PM, Chao Leng wrote:



On 2020/10/22 16:38, zhenwei pi wrote:

Hit a kernel warning:
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 0 at lib/refcount.c:28

RIP: 0010:refcount_warn_saturate+0xd9/0xe0
Call Trace:
  
  nvme_rdma_recv_done+0xf3/0x280 [nvme_rdma]
  __ib_process_cq+0x76/0x150 [ib_core]
  ...

The reason is that a zero bytes message received from target, and the
host side continues to process without length checking, then the
previous CQE is processed twice.

Handle data length, ignore zero bytes message, and try to recovery for
corrupted CQE case.

Signed-off-by: zhenwei pi 
---
  drivers/nvme/host/rdma.c | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9e378d0a0c01..9f5112040d43 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1767,6 +1767,17 @@ static void nvme_rdma_recv_done(struct ib_cq *cq, struct 
ib_wc *wc)
  return;
  }
+    if (unlikely(!wc->byte_len)) {
+    /* zero bytes message could be ignored */
+    return;

Resource leak, need nvme_rdma_post_recv.

+    } else if (unlikely(wc->byte_len < len)) {
+    /* Corrupted completion, try to recovry */
+    dev_err(queue->ctrl->ctrl.device,
+    "Unexpected nvme completion length(%d)\n", wc->byte_len);
+    nvme_rdma_error_recovery(queue->ctrl);
+    return;
+    }

!wc->byte_len and wc->byte_len < len may be the same type of anomaly.
Why do different error handling?
In which scenario zero bytes message received from target? fault inject test or 
normal test/run?


Zero bytes message could be used as transport layer keep alive mechanism (I's 
also developing target side transport layer keep alive now. To reclaim 
resource, target side needs to close dead connections even kato is set as 0).

nvme over fabric protocol do not define this.
May be async event is a option for target keep alive(if kato set as 0).



+
  ib_dma_sync_single_for_cpu(ibdev, qe->dma, len, DMA_FROM_DEVICE);
  /*
   * AEN requests are special as they don't time out and can

Re: [PATCH v3 2/6] docs: lockdep-design: fix some warning issues

2020-10-22 Thread Boqun Feng

On Wed, Oct 21, 2020 at 02:17:23PM +0200, Mauro Carvalho Chehab wrote:
> There are several warnings caused by a recent change
> 224ec489d3cd ("lockdep/Documention: Recursive read lock detection reasoning")
> 
> Those are reported by htmldocs build:
> 
> Documentation/locking/lockdep-design.rst:429: WARNING: Definition list 
> ends without a blank line; unexpected unindent.
> Documentation/locking/lockdep-design.rst:452: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> Documentation/locking/lockdep-design.rst:453: WARNING: Unexpected 
> indentation.
> Documentation/locking/lockdep-design.rst:453: WARNING: Blank line 
> required after table.
> Documentation/locking/lockdep-design.rst:454: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> Documentation/locking/lockdep-design.rst:455: WARNING: Unexpected 
> indentation.
> Documentation/locking/lockdep-design.rst:455: WARNING: Blank line 
> required after table.
> Documentation/locking/lockdep-design.rst:456: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> Documentation/locking/lockdep-design.rst:457: WARNING: Unexpected 
> indentation.
> Documentation/locking/lockdep-design.rst:457: WARNING: Blank line 
> required after table.
> 
> Besides the reported issues, there are some missing blank
> lines that ended producing wrong html output, and some
> literals are not properly identified.
> 
> Also, the symbols used at the irq enabled/disable table
> are not displayed as expected, as they're not literals.
> Also, on another table they're using a different notation.
> 
> Fixes: 224ec489d3cd ("lockdep/Documention: Recursive read lock detection 
> reasoning")
> Signed-off-by: Mauro Carvalho Chehab 

Acked-by: Boqun Feng 

Regards,
Boqun

> ---
>  Documentation/locking/lockdep-design.rst | 51 ++--
>  1 file changed, 31 insertions(+), 20 deletions(-)
> 
> diff --git a/Documentation/locking/lockdep-design.rst 
> b/Documentation/locking/lockdep-design.rst
> index cec03bd1294a..9f3cfca9f8a4 100644
> --- a/Documentation/locking/lockdep-design.rst
> +++ b/Documentation/locking/lockdep-design.rst
> @@ -42,6 +42,7 @@ The validator tracks lock-class usage history and divides 
> the usage into
>  (4 usages * n STATEs + 1) categories:
>  
>  where the 4 usages can be:
> +
>  - 'ever held in STATE context'
>  - 'ever held as readlock in STATE context'
>  - 'ever held with STATE enabled'
> @@ -49,10 +50,12 @@ where the 4 usages can be:
>  
>  where the n STATEs are coded in kernel/locking/lockdep_states.h and as of
>  now they include:
> +
>  - hardirq
>  - softirq
>  
>  where the last 1 category is:
> +
>  - 'ever used'   [ == !unused]
>  
>  When locking rules are violated, these usage bits are presented in the
> @@ -96,9 +99,9 @@ exact case is for the lock as of the reporting time.
>+--+-+--+
>|  | irq enabled | irq disabled |
>+--+-+--+
> -  | ever in irq  |  ?  |   -  |
> +  | ever in irq  | '?' |  '-' |
>+--+-+--+
> -  | never in irq |  +  |   .  |
> +  | never in irq | '+' |  '.' |
>+--+-+--+
>  
>  The character '-' suggests irq is disabled because if otherwise the
> @@ -216,7 +219,7 @@ looks like this::
> BD_MUTEX_PARTITION
>};
>  
> -mutex_lock_nested(>bd_contains->bd_mutex, BD_MUTEX_PARTITION);
> +  mutex_lock_nested(>bd_contains->bd_mutex, BD_MUTEX_PARTITION);
>  
>  In this case the locking is done on a bdev object that is known to be a
>  partition.
> @@ -334,7 +337,7 @@ Troubleshooting:
>  
>  
>  The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes.
> -Exceeding this number will trigger the following lockdep warning:
> +Exceeding this number will trigger the following lockdep warning::
>  
>   (DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS))
>  
> @@ -420,7 +423,8 @@ the critical section of another reader of the same lock 
> instance.
>  
>  The difference between recursive readers and non-recursive readers is 
> because:
>  recursive readers get blocked only by a write lock *holder*, while 
> non-recursive
> -readers could get blocked by a write lock *waiter*. Considering the follow 
> example:
> +readers could get blocked by a write lock *waiter*. Considering the follow
> +example::
>  
>   TASK A: TASK B:
>  
> @@ -448,20 +452,22 @@ There are simply four block conditions:
>  
>  Block condition matrix, Y means the row blocks the column, and N means 
> otherwise.
>  
> - | E | r | R |
>   +---+---+---+---+
> -   E | Y | Y | Y |
> + |   | E | r | R |
>   +---+---+---+---+
> -   r | Y | Y | N |
> + | E | Y | Y | Y |
> + +---+---+---+---+
> + | r

[PATCH] perf trace beauty: Allow header files in a different path

2020-10-22 Thread Namhyung Kim

Current script to generate mmap flags and prot checks headers from the
uapi/asm-generic directory but it might come from a different
directory in some environment.  So change the pattern to accept it.

Signed-off-by: Namhyung Kim 
---
 tools/perf/trace/beauty/mmap_flags.sh | 4 ++--
 tools/perf/trace/beauty/mmap_prot.sh  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/trace/beauty/mmap_flags.sh 
b/tools/perf/trace/beauty/mmap_flags.sh
index 39eb2595983b..76825710c725 100755
--- a/tools/perf/trace/beauty/mmap_flags.sh
+++ b/tools/perf/trace/beauty/mmap_flags.sh
@@ -28,12 +28,12 @@ egrep -q $regex ${linux_mman} && \
egrep -vw 'MAP_(UNINITIALIZED|TYPE|SHARED_VALIDATE)' | \
sed -r "s/$regex/\2 \1 \1 \1 \2/g" | \
xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define 
MAP_%s %s\n#endif\n")
-([ ! -f ${arch_mman} ] || egrep -q 
'#[[:space:]]*include[[:space:]]+.*' ${arch_mman}) &&
+([ ! -f ${arch_mman} ] || egrep -q 
'#[[:space:]]*include[[:space:]]+.*uapi/asm-generic/mman.h>.*' ${arch_mman}) &&
 (egrep $regex ${header_dir}/mman.h | \
sed -r "s/$regex/\2 \1 \1 \1 \2/g"  | \
xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define 
MAP_%s %s\n#endif\n")
diff --git a/tools/perf/trace/beauty/mmap_prot.sh 
b/tools/perf/trace/beauty/mmap_prot.sh
index 28f638f8d216..664d8d534a50 100755
--- a/tools/perf/trace/beauty/mmap_prot.sh
+++ b/tools/perf/trace/beauty/mmap_prot.sh
@@ -17,7 +17,7 @@ prefix="PROT"
 
 printf "static const char *mmap_prot[] = {\n"
 regex=`printf 
'^[[:space:]]*#[[:space:]]*define[[:space:]]+%s_([[:alnum:]_]+)[[:space:]]+(0x[[:xdigit:]]+)[[:space:]]*.*'
 ${prefix}`
-([ ! -f ${arch_mman} ] || egrep -q 
'#[[:space:]]*include[[:space:]]+

[GIT PULL] Arch/task_work cleanup

2020-10-22 Thread Jens Axboe

Hi Linus,

Two cleanups that don't fit other categories:

- Finally get the task_work_add() cleanup done properly, so we don't
  have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates all
  callers, and also fixes up the documentation for task_work_add().

- While working on some TIF related changes for 5.11, this
  TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch
  duplication for how that is handled.

Please pull!


The following changes since commit 324bcf54c449c7b5b7024c9fa4549fbaaae1935d:

  mm: use limited read-ahead to satisfy read (2020-10-17 13:49:08 -0600)

are available in the Git repository at:

  git://git.kernel.dk/linux-block.git tags/arch-cleanup-2020-10-22

for you to fetch changes up to 91989c707884ecc7cd537281ab1a4b8fb7219da3:

  task_work: cleanup notification modes (2020-10-17 15:05:30 -0600)


arch-cleanup-2020-10-22


Jens Axboe (2):
  tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()
  task_work: cleanup notification modes

 arch/alpha/kernel/signal.c |  1 -
 arch/arc/kernel/signal.c   |  2 +-
 arch/arm/kernel/signal.c   |  1 -
 arch/arm64/kernel/signal.c |  1 -
 arch/c6x/kernel/signal.c   |  4 +---
 arch/csky/kernel/signal.c  |  1 -
 arch/h8300/kernel/signal.c |  4 +---
 arch/hexagon/kernel/process.c  |  1 -
 arch/ia64/kernel/process.c |  2 +-
 arch/m68k/kernel/signal.c  |  2 +-
 arch/microblaze/kernel/signal.c|  2 +-
 arch/mips/kernel/signal.c  |  1 -
 arch/nds32/kernel/signal.c |  4 +---
 arch/nios2/kernel/signal.c |  2 +-
 arch/openrisc/kernel/signal.c  |  1 -
 arch/parisc/kernel/signal.c|  4 +---
 arch/powerpc/kernel/signal.c   |  1 -
 arch/riscv/kernel/signal.c |  4 +---
 arch/s390/kernel/signal.c  |  1 -
 arch/sh/kernel/signal_32.c |  4 +---
 arch/sparc/kernel/signal_32.c  |  4 +---
 arch/sparc/kernel/signal_64.c  |  4 +---
 arch/um/kernel/process.c   |  2 +-
 arch/x86/kernel/cpu/mce/core.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  2 +-
 arch/xtensa/kernel/signal.c|  2 +-
 drivers/acpi/apei/ghes.c   |  2 +-
 drivers/android/binder.c   |  2 +-
 fs/file_table.c|  2 +-
 fs/io_uring.c  | 13 +++--
 fs/namespace.c |  2 +-
 include/linux/task_work.h  | 11 ---
 include/linux/tracehook.h  |  4 ++--
 kernel/entry/common.c  |  1 -
 kernel/entry/kvm.c |  4 +---
 kernel/events/uprobes.c|  2 +-
 kernel/irq/manage.c|  2 +-
 kernel/sched/fair.c|  2 +-
 kernel/task_work.c | 30 --
 security/keys/keyctl.c |  2 +-
 security/yama/yama_lsm.c   |  2 +-
 41 files changed, 64 insertions(+), 76 deletions(-)

-- 
Jens Axboe

Re: [PATCH net RFC] net: Clear IFF_TX_SKB_SHARING for all Ethernet devices using skb_padto

2020-10-22 Thread Xie He

On Thu, Oct 22, 2020 at 5:44 PM Jakub Kicinski  wrote:
>
> On Thu, 22 Oct 2020 12:59:45 -0700 Xie He wrote:
> >
> > But I also see some drivers that want to pad the skb to a strange
> > length, and don't set their special min_mtu to match this length. For
> > example:
> >
> > drivers/net/ethernet/packetengines/yellowfin.c wants to pad the skb to
> > a dynamically calculated value.
> >
> > drivers/net/ethernet/ti/cpsw.c, cpsw_new.c and tlan.c want to pad the
> > skb to macro defined values.
> >
> > drivers/net/ethernet/intel/iavf/iavf_txrx.c wants to pad the skb to
> > IAVF_MIN_TX_LEN (17).
> >
> > drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c wants to pad the skb to 
> > 17.
>
> Hm, I see, that would be a slight loss of functionality if we started
> requiring 64B, for example, while the driver could in practice xmit
> 17B frames (would matter only to VFs, but nonetheless).

I think requiring the length to be at least some value won't solve the
problem for all drivers. For example:

drivers/net/ethernet/packetengines/yellowfin.c pads the skb to 32-byte
boundaries in the memory (no matter how long the length is).

drivers/net/ethernet/adaptec/starfire.c pads the skb so that the
length is multiples of 4.

drivers/net/ethernet/sun/cassini.c pads the skb to cp->min_frame_size,
which may be 255, 60, or 97.

> > Another solution I can think of is to add a "skb_shared" check to
> > "__skb_pad", so that if __skb_pad encounters a shared skb, it just
> > returns an error. The driver would think this is a memory allocation
> > failure. This way we can ensure shared skbs are not modified.
>
> I'm not sure if we want to be adding checks to __skb_pad() to handle
> what's effectively a pktgen specific condition.
>
> We could create a new field in struct netdevice for min_frame_len, but I
> think your patch is the simplest solution. Let's see if anyone objects.
>
> BTW it seems like there is more drivers which will need the flag
> cleared, e.g. drivers/net/ethernet/broadcom/bnxt/bnxt.c?

My patch isn't complete. Because there are so many drivers with this
problem, I feel it's hard to solve them all at once. So I only grepped
"skb_padto" under "drivers/net/ethernet". There are other drivers
under "ethernet" using "skb_pad", "skb_put_padto" or "eth_skb_pad".
There are also (fake) Ethernet drivers under "drivers/net/wireless". I
feel it'd take a long time and also be error-prone to solve them all,
so I feel it'd be the best if there are other solutions.

RE: [PATCH v2 tty] tty: serial: fsl_lpuart: LS1021A has a FIFO size of 16 words, like LS1028A

2020-10-22 Thread Andy Duan

From: Vladimir Oltean  Sent: Friday, October 23, 2020 
9:34 AM
> Prior to the commit that this one fixes, the FIFO size was derived from the
> read-only register LPUARTx_FIFO[TXFIFOSIZE] using the following
> formula:
> 
> TX FIFO size = 2 ^ (LPUARTx_FIFO[TXFIFOSIZE] - 1)
> 
> The documentation for LS1021A is a mess. Under chapter 26.1.3 LS1021A
> LPUART module special consideration, it mentions TXFIFO_SZ and RXFIFO_SZ
> being equal to 4, and in the register description for LPUARTx_FIFO, it shows 
> the
> out-of-reset value of TXFIFOSIZE and RXFIFOSIZE fields as "011", even though
> these registers read as "101" in reality.
> 
> And when LPUART on LS1021A was working, the "101" value did correspond to
> "16 datawords", by applying the formula above, even though the
> documentation is wrong again () and says that "101" means 64 datawords
> (hint: it doesn't).
> 
> So the "new" formula created by commit f77ebb241ce0 has all the premises of
> being wrong for LS1021A, because it relied only on false data and no actual
> experimentation.
> 
> Interestingly, in commit c2f448cff22a ("tty: serial: fsl_lpuart: add LS1028A
> support"), Michael Walle applied a workaround to this by manually setting the
> FIFO widths for LS1028A. It looks like the same values are used by LS1021A as
> well, in fact.
> 
> When the driver thinks that it has a deeper FIFO than it really has, getty 
> (user
> space) output gets truncated.
> 
> Many thanks to Michael for pointing out where to look.
> 
> Fixes: f77ebb241ce0 ("tty: serial: fsl_lpuart: correct the FIFO depth size")
> Suggested-by: Michael Walle 
> Signed-off-by: Vladimir Oltean 
> ---
> Changes in v2:
> Reworded commit message.

For the v2 with commit message change: 
Reviewed-by：Fugang Duan 
> 
>  drivers/tty/serial/fsl_lpuart.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/tty/serial/fsl_lpuart.c 
> b/drivers/tty/serial/fsl_lpuart.c index
> ff4b88c637d0..bd047e1f9bea 100644
> --- a/drivers/tty/serial/fsl_lpuart.c
> +++ b/drivers/tty/serial/fsl_lpuart.c
> @@ -314,9 +314,10 @@ MODULE_DEVICE_TABLE(of, lpuart_dt_ids);
>  /* Forward declare this for the dma callbacks*/  static void
> lpuart_dma_tx_complete(void *arg);
> 
> -static inline bool is_ls1028a_lpuart(struct lpuart_port *sport)
> +static inline bool is_layerscape_lpuart(struct lpuart_port *sport)
>  {
> - return sport->devtype == LS1028A_LPUART;
> + return (sport->devtype == LS1021A_LPUART ||
> + sport->devtype == LS1028A_LPUART);
>  }
> 
>  static inline bool is_imx8qxp_lpuart(struct lpuart_port *sport) @@ -1701,11
> +1702,11 @@ static int lpuart32_startup(struct uart_port *port)
>   UARTFIFO_FIFOSIZE_MASK);
> 
>   /*
> -  * The LS1028A has a fixed length of 16 words. Although it supports the
> -  * RX/TXSIZE fields their encoding is different. Eg the reference manual
> -  * states 0b101 is 16 words.
> +  * The LS1021A and LS1028A have a fixed FIFO depth of 16 words.
> +  * Although they support the RX/TXSIZE fields, their encoding is
> +  * different. Eg the reference manual states 0b101 is 16 words.
>*/
> - if (is_ls1028a_lpuart(sport)) {
> + if (is_layerscape_lpuart(sport)) {
>   sport->rxfifo_size = 16;
>   sport->txfifo_size = 16;
>   sport->port.fifosize = sport->txfifo_size;
> --
> 2.25.1

RE: [EXT] [PATCH] tty: serial: fsl_lpuart: LS1021A has a FIFO size of 32 datawords

2020-10-22 Thread Andy Duan

From: Vladimir Oltean  Sent: Thursday, October 22, 2020 
11:13 PM
> From: Vladimir Oltean 
> 
> Similar to the workaround applied by Michael Walle in commit c2f448cff22a
> ("tty: serial: fsl_lpuart: add LS1028A support"), it turns out that the
> LPUARTx_FIFO encoding for fields TXFIFOSIZE and RXFIFOSIZE is the same for
> LS1028A as for LS1021A.
> 
> The RXFIFOSIZE in the Layerscape SoCs is fixed at this value:
> 101 Receive FIFO/Buffer depth = 32 datawords.
> 
> When Andy Duan wrote the commit in Fixes: below, he assumed that the 101
> encoding means 64 datawords. But this is not true for Layerscape. So that
> commit broke LS1021A, and this patch is extending the workaround for LS1028A
> which appeared in the meantime, to fix that breakage.
> 
> When the driver thinks that it has a deeper FIFO than it really has, getty 
> (user
> space) output gets truncated.
> 
> Many thanks to Michael for suggesting this!
> 
> Fixes: f77ebb241ce0 ("tty: serial: fsl_lpuart: correct the FIFO depth size")
> Suggested-by: Michael Walle 
> Signed-off-by: Vladimir Oltean 
Layerscape has different define for the FIFO size.

Reviewed-by: Fugang Duan 
> ---
>  drivers/tty/serial/fsl_lpuart.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/tty/serial/fsl_lpuart.c 
> b/drivers/tty/serial/fsl_lpuart.c index
> ff4b88c637d0..bd047e1f9bea 100644
> --- a/drivers/tty/serial/fsl_lpuart.c
> +++ b/drivers/tty/serial/fsl_lpuart.c
> @@ -314,9 +314,10 @@ MODULE_DEVICE_TABLE(of, lpuart_dt_ids);
>  /* Forward declare this for the dma callbacks*/  static void
> lpuart_dma_tx_complete(void *arg);
> 
> -static inline bool is_ls1028a_lpuart(struct lpuart_port *sport)
> +static inline bool is_layerscape_lpuart(struct lpuart_port *sport)
>  {
> -   return sport->devtype == LS1028A_LPUART;
> +   return (sport->devtype == LS1021A_LPUART ||
> +   sport->devtype == LS1028A_LPUART);
>  }
> 
>  static inline bool is_imx8qxp_lpuart(struct lpuart_port *sport) @@ -1701,11
> +1702,11 @@ static int lpuart32_startup(struct uart_port *port)
> 
> UARTFIFO_FIFOSIZE_MASK);
> 
> /*
> -* The LS1028A has a fixed length of 16 words. Although it supports
> the
> -* RX/TXSIZE fields their encoding is different. Eg the reference
> manual
> -* states 0b101 is 16 words.
> +* The LS1021A and LS1028A have a fixed FIFO depth of 16 words.
> +* Although they support the RX/TXSIZE fields, their encoding is
> +* different. Eg the reference manual states 0b101 is 16 words.
>  */
> -   if (is_ls1028a_lpuart(sport)) {
> +   if (is_layerscape_lpuart(sport)) {
> sport->rxfifo_size = 16;
> sport->txfifo_size = 16;
> sport->port.fifosize = sport->txfifo_size;
> --
> 2.25.1

[tip:auto-latest] BUILD SUCCESS 65609b26b21a169a05d1482db6c1b52d8a4abe0d

2020-10-22 Thread kernel test robot

omega2p_defconfig
ia64  gensparse_defconfig
arm ebsa110_defconfig
powerpcmvme5100_defconfig
arm rpc_defconfig
powerpc  ppc64e_defconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
nds32 allnoconfig
c6x  allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a002-20201022
i386 randconfig-a005-20201022
i386 randconfig-a003-20201022
i386 randconfig-a001-20201022
i386 randconfig-a006-20201022
i386 randconfig-a004-20201022
i386 randconfig-a002-20201023
i386 randconfig-a005-20201023
i386 randconfig-a003-20201023
i386 randconfig-a001-20201023
i386 randconfig-a006-20201023
i386 randconfig-a004-20201023
x86_64   randconfig-a011-20201022
x86_64   randconfig-a013-20201022
x86_64   randconfig-a016-20201022
x86_64   randconfig-a015-20201022
x86_64   randconfig-a012-20201022
x86_64   randconfig-a014-20201022
i386 randconfig-a016-20201022
i386 randconfig-a014-20201022
i386 randconfig-a015-20201022
i386 randconfig-a012-20201022
i386 randconfig-a013-20201022
i386 randconfig-a011-20201022
riscvnommu_k210_defconfig
riscvallyesconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
x86_64   randconfig-a001-20201022
x86_64   randconfig-a002-20201022
x86_64   randconfig-a003-20201022
x86_64   randconfig-a006-20201022
x86_64   randconfig-a004-20201022
x86_64   randconfig-a005-20201022

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: [LTP] mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]

2020-10-22 Thread Daniel Díaz

Hello!

On Thu, 22 Oct 2020 at 19:11, Linus Torvalds
 wrote:
> On Thu, Oct 22, 2020 at 4:43 PM Linus Torvalds
> Would you mind sending me the problematic vmlinux file in private (or,
> likely better - a pointer to some place I can download it, it's going
> to be huge).

The kernel Naresh originally referred to is here:
  https://builds.tuxbuild.com/SCI7Xyjb7V2NbfQ2lbKBZw/

Greetings!

Daniel Díaz
daniel.d...@linaro.org

[PATCH v2 tty] tty: serial: fsl_lpuart: LS1021A has a FIFO size of 16 words, like LS1028A

2020-10-22 Thread Vladimir Oltean

Prior to the commit that this one fixes, the FIFO size was derived from
the read-only register LPUARTx_FIFO[TXFIFOSIZE] using the following
formula:

TX FIFO size = 2 ^ (LPUARTx_FIFO[TXFIFOSIZE] - 1)

The documentation for LS1021A is a mess. Under chapter 26.1.3 LS1021A
LPUART module special consideration, it mentions TXFIFO_SZ and RXFIFO_SZ
being equal to 4, and in the register description for LPUARTx_FIFO, it
shows the out-of-reset value of TXFIFOSIZE and RXFIFOSIZE fields as "011",
even though these registers read as "101" in reality.

And when LPUART on LS1021A was working, the "101" value did correspond
to "16 datawords", by applying the formula above, even though the
documentation is wrong again () and says that "101" means 64 datawords
(hint: it doesn't).

So the "new" formula created by commit f77ebb241ce0 has all the premises
of being wrong for LS1021A, because it relied only on false data and no
actual experimentation.

Interestingly, in commit c2f448cff22a ("tty: serial: fsl_lpuart: add
LS1028A support"), Michael Walle applied a workaround to this by manually
setting the FIFO widths for LS1028A. It looks like the same values are
used by LS1021A as well, in fact.

When the driver thinks that it has a deeper FIFO than it really has,
getty (user space) output gets truncated.

Many thanks to Michael for pointing out where to look.

Fixes: f77ebb241ce0 ("tty: serial: fsl_lpuart: correct the FIFO depth size")
Suggested-by: Michael Walle 
Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Reworded commit message.

 drivers/tty/serial/fsl_lpuart.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index ff4b88c637d0..bd047e1f9bea 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -314,9 +314,10 @@ MODULE_DEVICE_TABLE(of, lpuart_dt_ids);
 /* Forward declare this for the dma callbacks*/
 static void lpuart_dma_tx_complete(void *arg);
 
-static inline bool is_ls1028a_lpuart(struct lpuart_port *sport)
+static inline bool is_layerscape_lpuart(struct lpuart_port *sport)
 {
-   return sport->devtype == LS1028A_LPUART;
+   return (sport->devtype == LS1021A_LPUART ||
+   sport->devtype == LS1028A_LPUART);
 }
 
 static inline bool is_imx8qxp_lpuart(struct lpuart_port *sport)
@@ -1701,11 +1702,11 @@ static int lpuart32_startup(struct uart_port *port)
UARTFIFO_FIFOSIZE_MASK);
 
/*
-* The LS1028A has a fixed length of 16 words. Although it supports the
-* RX/TXSIZE fields their encoding is different. Eg the reference manual
-* states 0b101 is 16 words.
+* The LS1021A and LS1028A have a fixed FIFO depth of 16 words.
+* Although they support the RX/TXSIZE fields, their encoding is
+* different. Eg the reference manual states 0b101 is 16 words.
 */
-   if (is_ls1028a_lpuart(sport)) {
+   if (is_layerscape_lpuart(sport)) {
sport->rxfifo_size = 16;
sport->txfifo_size = 16;
sport->port.fifosize = sport->txfifo_size;
-- 
2.25.1

Re: [PATCH ghak90 V9 05/13] audit: log container info of syscalls

2020-10-22 Thread Paul Moore

On Wed, Oct 21, 2020 at 12:39 PM Richard Guy Briggs  wrote:
> Here is an exmple I was able to generate after updating the testsuite
> script to include a signalling example of a nested audit container
> identifier:
>
> 
> type=PROCTITLE msg=audit(2020-10-21 10:31:16.655:6731) : 
> proctitle=/usr/bin/perl -w containerid/test
> type=CONTAINER_ID msg=audit(2020-10-21 10:31:16.655:6731) : 
> contid=7129731255799087104^941723245477888
> type=OBJ_PID msg=audit(2020-10-21 10:31:16.655:6731) : opid=115583 oauid=root 
> ouid=root oses=1 obj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 
> ocomm=perl
> type=CONTAINER_ID msg=audit(2020-10-21 10:31:16.655:6731) : 
> contid=941723245477888
> type=OBJ_PID msg=audit(2020-10-21 10:31:16.655:6731) : opid=115580 oauid=root 
> ouid=root oses=1 obj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 
> ocomm=perl
> type=CONTAINER_ID msg=audit(2020-10-21 10:31:16.655:6731) : 
> contid=8098399240850112512^941723245477888
> type=OBJ_PID msg=audit(2020-10-21 10:31:16.655:6731) : opid=115582 oauid=root 
> ouid=root oses=1 obj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 
> ocomm=perl
> type=SYSCALL msg=audit(2020-10-21 10:31:16.655:6731) : arch=x86_64 
> syscall=kill success=yes exit=0 a0=0xfffe3c84 a1=SIGTERM a2=0x4d524554 a3=0x0 
> items=0 ppid=115564 pid=115567 auid=root uid=root gid=root euid=root 
> suid=root fsuid=root egid=root sgid=root fsgid=root tty=ttyS0 ses=1 comm=perl 
> exe=/usr/bin/perl subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 
> key=testsuite-1603290671-AcLtUulY
> 
>
> There are three CONTAINER_ID records which need some way of associating with 
> OBJ_PID records.  An additional CONTAINER_ID record would be present if the 
> killing process itself had an audit container identifier.  I think the most 
> obvious way to connect them is with a pid= field in the CONTAINER_ID record.

Using a "pid=" field as a way to link CONTAINER_ID records to other
records raises a few questions.  What happens if/when we need to
represent those PIDs in the context of a namespace?  Are we ever going
to need to link to records which don't have a "pid=" field?  I haven't
done the homework to know if either of these are a concern right now,
but I worry that this might become a problem in the future.

The idea of using something like "item=" is interesting.  As you
mention, the "item=" field does present some overlap problems with the
PATH record, but perhaps we can do something similar.  What if we
added a "record=" (or similar, I'm not worried about names at this
point) to each record, reset to 0/1 at the start of each event, and
when we needed to link records somehow we could add a "related=1,..,N"
field.  This would potentially be useful beyond just the audit
container ID work.

-- 
paul moore
www.paul-moore.com

Re: [PATCH] perf vendor events: Fix DRAM_BW_Use 0 issue for CLX/SKX

2020-10-22 Thread Ian Rogers

On Thu, Oct 22, 2020 at 5:54 PM Jin Yao  wrote:
>
> Ian reports an issue that the metric DRAM_BW_Use often remains 0.
>
> The metric expression for DRAM_BW_Use on CLX/SKX:
>
> "( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / 
> 10 ) / duration_time"
>
> The counts of uncore_imc/cas_count_read/ and uncore_imc/cas_count_write/
> are scaled up by 64, that is to turn a count of cache lines into bytes,
> the count is then divided by 10 to give GB.
>
> However, the counts of uncore_imc/cas_count_read/ and
> uncore_imc/cas_count_write/ have been scaled yet.
>
> The scale values are from sysfs, such as
> /sys/devices/uncore_imc_0/events/cas_count_read.scale.
> It's 6.103515625e-5 (64 / 1024.0 / 1024.0).
>
> So if we use original metric expression, the result is not correct.
>
> But the difficulty is, for SKL client, the counts are not scaled.
>
> The metric expression for DRAM_BW_Use on SKL:
>
> "64 * ( arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@ ) / 
> 100 / duration_time / 1000"
>
> root@kbl-ppc:~# perf stat -M DRAM_BW_Use -a -- sleep 1
>
>  Performance counter stats for 'system wide':
>
>190  arb/event=0x84,umask=0x1/ # 1.86 DRAM_BW_Use
> 29,093,178  arb/event=0x81,umask=0x1/
>  1,000,703,287 ns   duration_time
>
>1.000703287 seconds time elapsed
>
> The result is expected.
>
> So the easy way is just change the metric expression for CLX/SKX.
> This patch changes the metric expression to:
>
> "( ( ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) * 1048576 ) 
> / 10 ) / duration_time"
>
> 1048576 = 1024 * 1024.
>
> Before (tested on CLX):
>
> root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1
>
>  Performance counter stats for 'system wide':
>
> 765.35 MiB  uncore_imc/cas_count_read/ # 0.00 DRAM_BW_Use
>   5.42 MiB  uncore_imc/cas_count_write/
> 1001515088 ns   duration_time
>
>1.001515088 seconds time elapsed
>
> After:
>
> root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1
>
>  Performance counter stats for 'system wide':
>
> 767.95 MiB  uncore_imc/cas_count_read/ # 0.80 DRAM_BW_Use

Nit, using ScaleUnit would allow this to be 0.80GB/s.

>   5.02 MiB  uncore_imc/cas_count_write/
> 1001900010 ns   duration_time
>
>1.001900010 seconds time elapsed
>
> Fixes: 038d3b53c284 ("perf vendor events intel: Update CascadelakeX events to 
> v1.08")
> Fixes: b5ff7f2799a4 ("perf vendor events: Update SkylakeX events to v1.21")
> Signed-off-by: Jin Yao 

Acked-by: Ian Rogers 

Thanks,
Ian

> ---
>  tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json | 2 +-
>  tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json 
> b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json
> index de3193552277..00f4fcffa815 100644
> --- a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json
> @@ -329,7 +329,7 @@
>  },
>  {
>  "BriefDescription": "Average external Memory Bandwidth Use for reads 
> and writes [GB / sec]",
> -"MetricExpr": "( 64 * ( uncore_imc@cas_count_read@ + 
> uncore_imc@cas_count_write@ ) / 10 ) / duration_time",
> +"MetricExpr": "( ( ( uncore_imc@cas_count_read@ + 
> uncore_imc@cas_count_write@ ) * 1048576 ) / 10 ) / duration_time",
>  "MetricGroup": "Memory_BW;SoC",
>  "MetricName": "DRAM_BW_Use"
>  },
> diff --git a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json 
> b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json
> index f31794d3b926..0dd8b13b5cfb 100644
> --- a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json
> @@ -323,7 +323,7 @@
>  },
>  {
>  "BriefDescription": "Average external Memory Bandwidth Use for reads 
> and writes [GB / sec]",
> -"MetricExpr": "( 64 * ( uncore_imc@cas_count_read@ + 
> uncore_imc@cas_count_write@ ) / 10 ) / duration_time",
> +"MetricExpr": "( ( ( uncore_imc@cas_count_read@ + 
> uncore_imc@cas_count_write@ ) * 1048576 ) / 10 ) / duration_time",
>  "MetricGroup": "Memory_BW;SoC",
>  "MetricName": "DRAM_BW_Use"
>  },
> --
> 2.17.1
>

Re: [PATCH v1 0/2] mm: cma: introduce a non-blocking version of cma_release()

2020-10-22 Thread Zi Yan

On 22 Oct 2020, at 20:47, Roman Gushchin wrote:

> On Thu, Oct 22, 2020 at 07:42:45PM -0400, Zi Yan wrote:
>> On 22 Oct 2020, at 18:53, Roman Gushchin wrote:
>>
>>> This small patchset introduces a non-blocking version of cma_release()
>>> and simplifies the code in hugetlbfs, where previously we had to
>>> temporarily drop hugetlb_lock around the cma_release() call.
>>>
>>> It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
>>> THP under a memory pressure requires a cma_release() call. If it's
>>
>> Thanks for the patch. But during 1GB THP split, we only clear
>> the bitmaps without releasing the pages. Also in cma_release_nowait(),
>> the first page in the allocated CMA region is reused to store
>> struct cma_clear_bitmap_work, but the same method cannot be used
>> during THP split, since the first page is still in-use. We might
>> need to allocate some new memory for struct cma_clear_bitmap_work,
>> which might not be successful under memory pressure. Any suggestion
>> on where to store struct cma_clear_bitmap_work when I only want to
>> clear bitmap without releasing the pages?
>
> It means we can't use cma_release() there either, because it does clear
> individual pages. We need to clear the cma bitmap without touching pages.
>
> Can you handle an error there?
>
> If so, we can introduce something like int cma_schedule_bitmap_clearance(),
> which will allocate a work structure and will be able to return -ENOMEM
> in the unlikely case of error.
>
> Will it work for you?

Yes, it works. Thanks.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature

[PATCH] perf vendor events: Fix DRAM_BW_Use 0 issue for CLX/SKX

2020-10-22 Thread Jin Yao

Ian reports an issue that the metric DRAM_BW_Use often remains 0.

The metric expression for DRAM_BW_Use on CLX/SKX:

"( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / 
10 ) / duration_time"

The counts of uncore_imc/cas_count_read/ and uncore_imc/cas_count_write/
are scaled up by 64, that is to turn a count of cache lines into bytes,
the count is then divided by 10 to give GB.

However, the counts of uncore_imc/cas_count_read/ and
uncore_imc/cas_count_write/ have been scaled yet.

The scale values are from sysfs, such as
/sys/devices/uncore_imc_0/events/cas_count_read.scale.
It's 6.103515625e-5 (64 / 1024.0 / 1024.0).

So if we use original metric expression, the result is not correct.

But the difficulty is, for SKL client, the counts are not scaled.

The metric expression for DRAM_BW_Use on SKL:

"64 * ( arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@ ) / 
100 / duration_time / 1000"

root@kbl-ppc:~# perf stat -M DRAM_BW_Use -a -- sleep 1

 Performance counter stats for 'system wide':

   190  arb/event=0x84,umask=0x1/ # 1.86 DRAM_BW_Use
29,093,178  arb/event=0x81,umask=0x1/
 1,000,703,287 ns   duration_time

   1.000703287 seconds time elapsed

The result is expected.

So the easy way is just change the metric expression for CLX/SKX.
This patch changes the metric expression to:

"( ( ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) * 1048576 ) / 
10 ) / duration_time"

1048576 = 1024 * 1024.

Before (tested on CLX):

root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1

 Performance counter stats for 'system wide':

765.35 MiB  uncore_imc/cas_count_read/ # 0.00 DRAM_BW_Use
  5.42 MiB  uncore_imc/cas_count_write/
1001515088 ns   duration_time

   1.001515088 seconds time elapsed

After:

root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1

 Performance counter stats for 'system wide':

767.95 MiB  uncore_imc/cas_count_read/ # 0.80 DRAM_BW_Use
  5.02 MiB  uncore_imc/cas_count_write/
1001900010 ns   duration_time

   1.001900010 seconds time elapsed

Fixes: 038d3b53c284 ("perf vendor events intel: Update CascadelakeX events to 
v1.08")
Fixes: b5ff7f2799a4 ("perf vendor events: Update SkylakeX events to v1.21")
Signed-off-by: Jin Yao 
---
 tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json | 2 +-
 tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json 
b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json
index de3193552277..00f4fcffa815 100644
--- a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json
@@ -329,7 +329,7 @@
 },
 {
 "BriefDescription": "Average external Memory Bandwidth Use for reads 
and writes [GB / sec]",
-"MetricExpr": "( 64 * ( uncore_imc@cas_count_read@ + 
uncore_imc@cas_count_write@ ) / 10 ) / duration_time",
+"MetricExpr": "( ( ( uncore_imc@cas_count_read@ + 
uncore_imc@cas_count_write@ ) * 1048576 ) / 10 ) / duration_time",
 "MetricGroup": "Memory_BW;SoC",
 "MetricName": "DRAM_BW_Use"
 },
diff --git a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json 
b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json
index f31794d3b926..0dd8b13b5cfb 100644
--- a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json
@@ -323,7 +323,7 @@
 },
 {
 "BriefDescription": "Average external Memory Bandwidth Use for reads 
and writes [GB / sec]",
-"MetricExpr": "( 64 * ( uncore_imc@cas_count_read@ + 
uncore_imc@cas_count_write@ ) / 10 ) / duration_time",
+"MetricExpr": "( ( ( uncore_imc@cas_count_read@ + 
uncore_imc@cas_count_write@ ) * 1048576 ) / 10 ) / duration_time",
 "MetricGroup": "Memory_BW;SoC",
 "MetricName": "DRAM_BW_Use"
 },
-- 
2.17.1

Re: [PATCH v2 5/5] scsi: ufs: fix clkgating on/off correctly

2020-10-22 Thread Jaegeuk Kim

On 10/21, Can Guo wrote:
> On 2020-10-21 12:52, jaeg...@kernel.org wrote:
> > On 10/21, Can Guo wrote:
> > > On 2020-10-21 03:52, Jaegeuk Kim wrote:
> > > > The below call stack prevents clk_gating at every IO completion.
> > > > We can remove the condition, ufshcd_any_tag_in_use(), since
> > > > clkgating_work
> > > > will check it again.
> > > >
> > > 
> > > I think checking ufshcd_any_tag_in_use() in either ufshcd_release() or
> > > gate_work() can break UFS clk gating's functionality.
> > > 
> > > ufshcd_any_tag_in_use() was introduced to replace hba->lrb_in_use.
> > > However,
> > > they are not exactly same - ufshcd_any_tag_in_use() returns true if
> > > any tag
> > > assigned from block layer is still in use, but tags are released
> > > asynchronously
> > > (through block softirq), meaning it does not reflect the real
> > > occupation of
> > > UFS host.
> > > That is after UFS host finishes all tasks, ufshcd_any_tag_in_use()
> > > can still
> > > return true.
> > > 
> > > This change only removes the check of ufshcd_any_tag_in_use() in
> > > ufshcd_release(),
> > > but having the check of it in gate_work() can still prevent gating
> > > from
> > > happening.
> > > The current change works for you maybe because the tags are release
> > > before
> > > hba->clk_gating.delay_ms expires, but if hba->clk_gating.delay_ms is
> > > shorter
> > > or
> > > somehow block softirq is retarded, gate_work() may have chance to see
> > > ufshcd_any_tag_in_use()
> > > returns true. What do you think?
> > 
> > I don't think this breaks clkgating, but fix the wrong condition check
> > which
> > prevented gate_work at all. As you mentioned, even if this schedules
> > gate_work
> > by racy conditions, gate_work will handle it as a last resort.
> > 
> 
> If clocks cannot be gated after the last task is cleared from UFS host, then
> clk gating
> is broken, no? Assume UFS has completed the last task in its queue, as this
> change says,
> ufshcd_any_tag_in_use() is preventing ufshcd_release() from invoking
> gate_work().
> Similarly, ufshcd_any_tag_in_use() can prevent gate_work() from doing its
> real work -
> disabling the clocks. Do you agree?
> 
> if (hba->clk_gating.active_reqs
> || hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL
> || ufshcd_any_tag_in_use(hba) || hba->outstanding_tasks
> || hba->active_uic_cmd || hba->uic_async_done)
> goto rel_lock;

I see the point, but this happens only when clkgate_delay_ms is too short
to give enough time for releasing tag. If it's correctly set, I think there'd
be no problem, unless softirq was delayed by other RT threads which is just
a corner case tho.

> 
> Thanks,
> 
> Can Guo.
> 
> > > 
> > > Thanks,
> > > 
> > > Can Guo.
> > > 
> > > In __ufshcd_transfer_req_compl
> > > Ihba->lrb_in_use is cleared immediately when UFS driver
> > > finishes all tasks
> > > 
> > > > ufshcd_complete_requests(struct ufs_hba *hba)
> > > >   ufshcd_transfer_req_compl()
> > > > __ufshcd_transfer_req_compl()
> > > >   __ufshcd_release(hba)
> > > > if (ufshcd_any_tag_in_use() == 1)
> > > >return;
> > > >   ufshcd_tmc_handler(hba);
> > > > blk_mq_tagset_busy_iter();
> > > >
> > > > Cc: Alim Akhtar 
> > > > Cc: Avri Altman 
> > > > Cc: Can Guo 
> > > > Signed-off-by: Jaegeuk Kim 
> > > > ---
> > > >  drivers/scsi/ufs/ufshcd.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> > > > index b5ca0effe636..cecbd4ace8b4 100644
> > > > --- a/drivers/scsi/ufs/ufshcd.c
> > > > +++ b/drivers/scsi/ufs/ufshcd.c
> > > > @@ -1746,7 +1746,7 @@ static void __ufshcd_release(struct ufs_hba *hba)
> > > >
> > > > if (hba->clk_gating.active_reqs || hba->clk_gating.is_suspended 
> > > > ||
> > > > hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL ||
> > > > -   ufshcd_any_tag_in_use(hba) || hba->outstanding_tasks ||
> > > > +   hba->outstanding_tasks ||
> > > > hba->active_uic_cmd || hba->uic_async_done)
> > > > return;

Re: [PATCH v1 0/2] mm: cma: introduce a non-blocking version of cma_release()

2020-10-22 Thread Roman Gushchin

On Thu, Oct 22, 2020 at 07:42:45PM -0400, Zi Yan wrote:
> On 22 Oct 2020, at 18:53, Roman Gushchin wrote:
> 
> > This small patchset introduces a non-blocking version of cma_release()
> > and simplifies the code in hugetlbfs, where previously we had to
> > temporarily drop hugetlb_lock around the cma_release() call.
> >
> > It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
> > THP under a memory pressure requires a cma_release() call. If it's
> 
> Thanks for the patch. But during 1GB THP split, we only clear
> the bitmaps without releasing the pages. Also in cma_release_nowait(),
> the first page in the allocated CMA region is reused to store
> struct cma_clear_bitmap_work, but the same method cannot be used
> during THP split, since the first page is still in-use. We might
> need to allocate some new memory for struct cma_clear_bitmap_work,
> which might not be successful under memory pressure. Any suggestion
> on where to store struct cma_clear_bitmap_work when I only want to
> clear bitmap without releasing the pages?

It means we can't use cma_release() there either, because it does clear
individual pages. We need to clear the cma bitmap without touching pages.

Can you handle an error there?

If so, we can introduce something like int cma_schedule_bitmap_clearance(),
which will allocate a work structure and will be able to return -ENOMEM
in the unlikely case of error.

Will it work for you?

Thanks!

Re: [PATCH v17 1/4] Add flags option to get xattr method paired to __vfs_getxattr

2020-10-22 Thread Paul Moore

On Wed, Oct 21, 2020 at 8:07 AM Mark Salyzyn  wrote:
> On 10/20/20 6:17 PM, Paul Moore wrote:
> > On Tue, Oct 20, 2020 at 3:17 PM Mark Salyzyn  wrote:
> >> Add a flag option to get xattr method that could have a bit flag of
> >> XATTR_NOSECURITY passed to it.  XATTR_NOSECURITY is generally then
> >> set in the __vfs_getxattr path when called by security
> >> infrastructure.
> >>
> >> This handles the case of a union filesystem driver that is being
> >> requested by the security layer to report back the xattr data.
> >>
> >> For the use case where access is to be blocked by the security layer.
> >>
> >> The path then could be security(dentry) ->
> >> __vfs_getxattr(dentry...XATTR_NOSECURITY) ->
> >> handler->get(dentry...XATTR_NOSECURITY) ->
> >> __vfs_getxattr(lower_dentry...XATTR_NOSECURITY) ->
> >> lower_handler->get(lower_dentry...XATTR_NOSECURITY)
> >> which would report back through the chain data and success as
> >> expected, the logging security layer at the top would have the
> >> data to determine the access permissions and report back the target
> >> context that was blocked.
> >>
> >> Without the get handler flag, the path on a union filesystem would be
> >> the errant security(dentry) -> __vfs_getxattr(dentry) ->
> >> handler->get(dentry) -> vfs_getxattr(lower_dentry) -> nested ->
> >> security(lower_dentry, log off) -> lower_handler->get(lower_dentry)
> >> which would report back through the chain no data, and -EACCES.
> >>
> >> For selinux for both cases, this would translate to a correctly
> >> determined blocked access. In the first case with this change a correct avc
> >> log would be reported, in the second legacy case an incorrect avc log
> >> would be reported against an uninitialized u:object_r:unlabeled:s0
> >> context making the logs cosmetically useless for audit2allow.
> >>
> >> This patch series is inert and is the wide-spread addition of the
> >> flags option for xattr functions, and a replacement of __vfs_getxattr
> >> with __vfs_getxattr(...XATTR_NOSECURITY).
> >>
> >> Signed-off-by: Mark Salyzyn 
> >> Reviewed-by: Jan Kara 
> >> Acked-by: Jan Kara 
> >> Acked-by: Jeff Layton 
> >> Acked-by: David Sterba 
> >> Acked-by: Darrick J. Wong 
> >> Acked-by: Mike Marshall 
> >> To: linux-fsde...@vger.kernel.org
> >> To: linux-unio...@vger.kernel.org
> >> Cc: Stephen Smalley 
> >> Cc: linux-kernel@vger.kernel.org
> >> Cc: linux-security-mod...@vger.kernel.org
> >> Cc: kernel-t...@android.com
> > ...
> >
> >> 
> > [NOTE: added the SELinux list to the CC line]
>
>
> Thanks and 
>
> >
> > I'm looking at this patchset in earnest for the first time and I'm a
> > little uncertain about the need for the new XATTR_NOSECURITY flag;
> > perhaps you can help me understand it better.  Looking over this
> > patch, and quickly looking at the others in the series, it seems as
> > though XATTR_NOSECURITY is basically used whenever a filesystem has to
> > call back into the vfs layer (e.g. overlayfs, ecryptfs, etc).  Am I
> > understanding that correctly?  If that assumption is correct, I'm not
> > certain why the new XATTR_NOSECURITY flag is needed; why couldn't
> > _vfs_getxattr() be used by all of the callers that need to bypass
> > DAC/MAC with vfs_getxattr() continuing to perform the DAC/MAC checks?
> > If for some reason _vfs_getxattr() can't be used, would it make more
> > sense to create a new stripped/special getxattr function for use by
> > nested filesystems?  Based on the number of revisions to this
> > patchset, I'm sure it can't be that simple so please educate me :)
> >
> It is hard to please everyone :-}
>
> Yes, calling back through the vfs layer.
>
> I was told not to change or remove the __vfs_getxattr default behaviour,
> but use the flag to pass through the new behavior. Security concerns
> requiring the _key_ of the flag to be passed through rather than a
> blanket bypass. This was also the similar security reasoning not to have
> a special getxattr call.
>
> [TL;DR]
>
> history and details
>
> When it goes down through the layers again, and into the underlying
> filesystems, to get the getxattr, the xattributes are blocked, then the
> selinux _context_ will not be copied into the buffer leaving the caller
> looking at effectively u:r:unknown:s0. Well, they were blocked, so from
> the security standpoint that part was accurate, but the evaluation of
> the context is using the wrong rules and an (cosmetically) incorrect avc
> report. This also poisons the cache layers that may hold on to the
> context for future calls (+/- bugs) disturbing the future decisions (we
> saw that in 4.14 and earlier vintage kernels without this patch, later
> kernels appeared to clear up the cache bug).
>
> The XATTR_NOSECURITY is used in the overlayfs driver for a substantial
> majority of the calls for getxattr only if the data is private (ie: on
> the stack, not returned to the caller) as simplification. A _real_
> getxattr is performed when the data is returned to the caller. I expect
> that

Re: [PATCH net RFC] net: Clear IFF_TX_SKB_SHARING for all Ethernet devices using skb_padto

2020-10-22 Thread Jakub Kicinski

On Thu, 22 Oct 2020 12:59:45 -0700 Xie He wrote:
> On Thu, Oct 22, 2020 at 8:22 AM Jakub Kicinski  wrote:
> >
> > Are most of these drivers using skb_padto()? Is that the reason they
> > can't be sharing the SKB?  
> 
> Yes, I think if a driver calls skb_pad / skb_padto / skb_put_padto /
> eth_skb_pad, the driver can't accept shared skbs because it may modify
> the skbs.
> 
> > I think the IFF_TX_SKB_SHARING flag is only used by pktgen, so perhaps
> > we can make sure pktgen doesn't generate skbs < dev->min_mtu, and then
> > the drivers won't pad?  
> 
> Yes, I see a lot of drivers just want to pad the skb to ETH_ZLEN, or
> just call eth_skb_pad. In this case, requiring the shared skb to be at
> least dev->min_mtu long can solve the problem for these drivers.
> 
> But I also see some drivers that want to pad the skb to a strange
> length, and don't set their special min_mtu to match this length. For
> example:
> 
> drivers/net/ethernet/packetengines/yellowfin.c wants to pad the skb to
> a dynamically calculated value.
> 
> drivers/net/ethernet/ti/cpsw.c, cpsw_new.c and tlan.c want to pad the
> skb to macro defined values.
> 
> drivers/net/ethernet/intel/iavf/iavf_txrx.c wants to pad the skb to
> IAVF_MIN_TX_LEN (17).
> 
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c wants to pad the skb to 17.

Hm, I see, that would be a slight loss of functionality if we started
requiring 64B, for example, while the driver could in practice xmit
17B frames (would matter only to VFs, but nonetheless).

> Another solution I can think of is to add a "skb_shared" check to
> "__skb_pad", so that if __skb_pad encounters a shared skb, it just
> returns an error. The driver would think this is a memory allocation
> failure. This way we can ensure shared skbs are not modified.

I'm not sure if we want to be adding checks to __skb_pad() to handle
what's effectively a pktgen specific condition.

We could create a new field in struct netdevice for min_frame_len, but I
think your patch is the simplest solution. Let's see if anyone objects.

BTW it seems like there is more drivers which will need the flag
cleared, e.g. drivers/net/ethernet/broadcom/bnxt/bnxt.c?

Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-22 Thread Roman Gushchin

On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote:
> On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin  wrote:
> >
> [snip]
> > >
> > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> > > for 5.11.
> > >
> > > What does everyone think?
> >
> > I think we should use the link to the root approach both for stable 
> > backports
> > and for 5.11+, to keep them in sync. The cleanup (always charging the root 
> > cgroup)
> > is not directly related to this problem, and we can keep it for 5.11+ only.
> >
> > Thanks!
> 
> Roman, can you send the signed-off patch for the root linking for
> use_hierarchy=0?

Sure, here we are.

Thanks!

--

>From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
From: Roman Gushchin 
Date: Thu, 22 Oct 2020 17:12:32 -0700
Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false

Richard reported a warning which can be reproduced by running the LTP
madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

[9.841552] [ cut here ]
[9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 
page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 
mm/page_counter.c:156)
[9.841982] Modules linked in:
[9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 
5.9.0-rc7-22-default #77
[9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[9.842571] Workqueue: events drain_local_stock
[9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 
mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff 
ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 
90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[9.843438] RSP: 0018:b1c18006be28 EFLAGS: 00010086
[9.843585] RAX:  RBX:  RCX: 94803bc2cae0
[9.843806] RDX: 0001 RSI:  RDI: 948007d2b248
[9.844026] RBP: 948007d2b248 R08: 948007c58eb0 R09: 948007da05ac
[9.844248] R10: 0018 R11: 0018 R12: 0001
[9.844477] R13:  R14:  R15: 94803bc2cac0
[9.844696] FS:  () GS:94803bc0() 
knlGS:
[9.844915] CS:  0010 DS:  ES:  CR0: 80050033
[9.845096] CR2: 7f0579ee0384 CR3: 2cc0a000 CR4: 06f0
[9.845319] Call Trace:
[9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 
mm/memcontrol.c:3114)
[9.845684] drain_local_stock (mm/memcontrol.c:2255)
[9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 
./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 
kernel/workqueue.c:2274)
[9.845898] worker_thread (./include/linux/list.h:282 
kernel/workqueue.c:2416)
[9.846034] ? process_one_work (kernel/workqueue.c:2358)
[9.846162] kthread (kernel/kthread.c:292)
[9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[9.846531] ---[ end trace 8b5647c1eba9d18a ]---

The problem occurs because in the non-hierarchical mode non-root page
counters are not linked to root page counters, so the charge is not
propagated to the root memory cgroup.

After the removal of the original memory cgroup and reparenting of the
object cgroup, the root cgroup might be uncharged by draining a objcg
stock, for example. It leads to an eventual underflow of the charge
and triggers a warning.

Fix it by linking all page counters to corresponding root page
counters in the non-hierarchical mode.

The patch doesn't affect how the hierarchical mode is working,
which is the only sane and truly supported mode now.

Thanks to Richard for reporting, debugging and providing an
alternative version of the fix!

Reported-by: l...@lists.linux.it
Debugged-by: Richard Palethorpe 
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Roman Gushchin 
Cc: sta...@vger.kernel.org
---
 mm/memcontrol.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..009297017c87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5339,17 +5339,22 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
memcg->swappiness = mem_cgroup_swappiness(parent);
memcg->oom_kill_disable = parent->oom_kill_disable;
}
-   if (parent && parent->use_hierarchy) {
+   if (!parent) {
+   page_counter_init(>memory, NULL);
+   page_counter_init(>swap, NULL);
+   page_counter_init(>kmem, NULL);
+

[PATCH] x86/mm/KASLR: Account for minimum padding when calculating entropy

2020-10-22 Thread Junaid Shahid

Subtract the minimum padding between regions from the initial
remain_entropy. Without this, the last region could potentially
overflow past vaddr_end if we happen to get a specific sequence
of random numbers (although extremely unlikely in practice).
The bug can be demonstrated by replacing the prandom_bytes_state
call with "rand = entropy;"

Signed-off-by: Junaid Shahid 
---
 arch/x86/mm/kaslr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 6e6b39710e5f..fe3eec30f736 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -109,7 +109,8 @@ void __init kernel_randomize_memory(void)
kaslr_regions[2].size_tb = DIV_ROUND_UP(vmemmap_size, 1UL << TB_SHIFT);
 
/* Calculate entropy available between regions */
-   remain_entropy = vaddr_end - vaddr_start;
+   remain_entropy = vaddr_end - vaddr_start -
+(ARRAY_SIZE(kaslr_regions) - 1) * PUD_SIZE;
for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++)
remain_entropy -= get_padding(_regions[i]);
 
-- 
2.29.0.rc2.309.g374f81d7ae-goog

[PATCH v3 03/10] ASoC: SOF: Create client driver for IPC test

2020-10-22 Thread Dave Ertman

From: Ranjani Sridharan 

Create an SOF client driver for IPC flood test. This
driver is used to set up the debugfs entries and the
read/write ops for initiating the IPC flood test that
would be used to measure the min/max/avg response times
for sending IPCs to the DSP. The debugfs ops definitions
in the driver is existing code that has been copied
from the core. These will be removed from the SOF core
making is less monolithic and easier to maintain.

Reviewed-by: Pierre-Louis Bossart 
Signed-off-by: Ranjani Sridharan 
Co-developed-by: Fred Oh 
Signed-off-by: Fred Oh 
Signed-off-by: Dave Ertman 
---
 sound/soc/sof/Kconfig   |  10 +
 sound/soc/sof/Makefile  |   4 +
 sound/soc/sof/sof-ipc-test-client.c | 321 
 3 files changed, 335 insertions(+)
 create mode 100644 sound/soc/sof/sof-ipc-test-client.c

diff --git a/sound/soc/sof/Kconfig b/sound/soc/sof/Kconfig
index 31e9911098fc..13bde36cc5d7 100644
--- a/sound/soc/sof/Kconfig
+++ b/sound/soc/sof/Kconfig
@@ -190,6 +190,16 @@ config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST
  Say Y if you want to enable IPC flood test.
  If unsure, select "N".
 
+config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT
+   tristate "SOF enable IPC flood test client"
+   depends on SND_SOC_SOF_CLIENT
+   help
+ This option enables a separate client device for IPC flood test
+ which can be used to flood the DSP with test IPCs and gather stats
+ about response times.
+ Say Y if you want to enable IPC flood test.
+ If unsure, select "N".
+
 config SND_SOC_SOF_DEBUG_RETAIN_DSP_CONTEXT
bool "SOF retain DSP context on any FW exceptions"
help
diff --git a/sound/soc/sof/Makefile b/sound/soc/sof/Makefile
index 5e46f25a3851..baa93fe2cc9a 100644
--- a/sound/soc/sof/Makefile
+++ b/sound/soc/sof/Makefile
@@ -9,6 +9,8 @@ snd-sof-pci-objs := sof-pci-dev.o
 snd-sof-acpi-objs := sof-acpi-dev.o
 snd-sof-of-objs := sof-of-dev.o
 
+snd-sof-ipc-test-objs := sof-ipc-test-client.o
+
 snd-sof-nocodec-objs := nocodec.o
 
 obj-$(CONFIG_SND_SOC_SOF) += snd-sof.o
@@ -21,6 +23,8 @@ obj-$(CONFIG_SND_SOC_SOF_PCI) += snd-sof-pci.o
 
 obj-$(CONFIG_SND_SOC_SOF_CLIENT) += snd-sof-client.o
 
+obj-$(CONFIG_SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT) += snd-sof-ipc-test.o
+
 obj-$(CONFIG_SND_SOC_SOF_INTEL_TOPLEVEL) += intel/
 obj-$(CONFIG_SND_SOC_SOF_IMX_TOPLEVEL) += imx/
 obj-$(CONFIG_SND_SOC_SOF_XTENSA) += xtensa/
diff --git a/sound/soc/sof/sof-ipc-test-client.c 
b/sound/soc/sof/sof-ipc-test-client.c
new file mode 100644
index ..b4d803b9139b
--- /dev/null
+++ b/sound/soc/sof/sof-ipc-test-client.c
@@ -0,0 +1,321 @@
+// SPDX-License-Identifier: GPL-2.0-only
+//
+// Copyright(c) 2020 Intel Corporation. All rights reserved.
+//
+// Author: Ranjani Sridharan 
+//
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "sof-client.h"
+
+#define MAX_IPC_FLOOD_DURATION_MS 1000
+#define MAX_IPC_FLOOD_COUNT 1
+#define IPC_FLOOD_TEST_RESULT_LEN 512
+#define SOF_IPC_CLIENT_SUSPEND_DELAY_MS 3000
+
+struct sof_ipc_client_data {
+   struct dentry *dfs_root;
+   char *buf;
+};
+
+/*
+ * helper function to perform the flood test. Only one of the two params, 
ipc_duration_ms
+ * or ipc_count, will be non-zero and will determine the type of test
+ */
+static int sof_debug_ipc_flood_test(struct sof_client_dev *cdev, unsigned long 
ipc_duration_ms,
+   unsigned long ipc_count)
+{
+   struct sof_ipc_client_data *ipc_client_data = cdev->data;
+   struct device *dev = >auxdev.dev;
+   struct sof_ipc_cmd_hdr hdr;
+   struct sof_ipc_reply reply;
+   u64 min_response_time = U64_MAX;
+   u64 avg_response_time = 0;
+   u64 max_response_time = 0;
+   ktime_t cur;
+   ktime_t test_end;
+   int i = 0;
+   int ret = 0;
+   bool end_test = false;
+
+   /* configure test IPC */
+   hdr.cmd = SOF_IPC_GLB_TEST_MSG | SOF_IPC_TEST_IPC_FLOOD;
+   hdr.size = sizeof(hdr);
+
+   /* set test end time for duration flood test */
+   test_end = ktime_get_ns() + ipc_duration_ms * NSEC_PER_MSEC;
+
+   /* send test IPC's */
+   do {
+   ktime_t start;
+   u64 ipc_response_time;
+
+   start = ktime_get();
+   ret = sof_client_ipc_tx_message(cdev, hdr.cmd, , hdr.size, 
,
+   sizeof(reply));
+   if (ret < 0)
+   break;
+   cur = ktime_get();
+
+   i++;
+
+   /* compute min and max response times */
+   ipc_response_time = ktime_to_ns(ktime_sub(cur, start));
+   min_response_time = min(min_response_time, ipc_response_time);
+   max_response_time = max(max_response_time, ipc_response_time);
+
+   /* sum up response times */
+

[PATCH v3 08/10] ASoC: SOF: compress: move and export sof_probe_compr_ops

2020-10-22 Thread Dave Ertman

From: Ranjani Sridharan 

sof_probe_compr_ops are not platform-specific. So move
it to common compress code and export the symbol. The
compilation of the common compress code is already dependent
on the selection of CONFIG_SND_SOC_SOF_DEBUG_PROBES, so no
need to check the Kconfig section for defining sof_probe_compr_ops
again.

Reviewed-by: Pierre-Louis Bossart 
Tested-by: Fred Oh 
Signed-off-by: Ranjani Sridharan 
Signed-off-by: Dave Ertman 
---
 sound/soc/sof/compress.c  |  9 +
 sound/soc/sof/compress.h  |  1 +
 sound/soc/sof/intel/hda-dai.c | 12 
 3 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/sound/soc/sof/compress.c b/sound/soc/sof/compress.c
index 2d4969c705a4..0443f171b4e7 100644
--- a/sound/soc/sof/compress.c
+++ b/sound/soc/sof/compress.c
@@ -13,6 +13,15 @@
 #include "ops.h"
 #include "probe.h"
 
+struct snd_soc_cdai_ops sof_probe_compr_ops = {
+   .startup= sof_probe_compr_open,
+   .shutdown   = sof_probe_compr_free,
+   .set_params = sof_probe_compr_set_params,
+   .trigger= sof_probe_compr_trigger,
+   .pointer= sof_probe_compr_pointer,
+};
+EXPORT_SYMBOL(sof_probe_compr_ops);
+
 struct snd_compress_ops sof_probe_compressed_ops = {
.copy   = sof_probe_compr_copy,
 };
diff --git a/sound/soc/sof/compress.h b/sound/soc/sof/compress.h
index ca8790bd4b13..689c83ac8ffc 100644
--- a/sound/soc/sof/compress.h
+++ b/sound/soc/sof/compress.h
@@ -13,6 +13,7 @@
 
 #include 
 
+extern struct snd_soc_cdai_ops sof_probe_compr_ops;
 extern struct snd_compress_ops sof_probe_compressed_ops;
 
 int sof_probe_compr_open(struct snd_compr_stream *cstream,
diff --git a/sound/soc/sof/intel/hda-dai.c b/sound/soc/sof/intel/hda-dai.c
index c6cb8c212eca..1acec1176986 100644
--- a/sound/soc/sof/intel/hda-dai.c
+++ b/sound/soc/sof/intel/hda-dai.c
@@ -400,18 +400,6 @@ static const struct snd_soc_dai_ops hda_link_dai_ops = {
.prepare = hda_link_pcm_prepare,
 };
 
-#if IS_ENABLED(CONFIG_SND_SOC_SOF_HDA_PROBES)
-#include "../compress.h"
-
-static struct snd_soc_cdai_ops sof_probe_compr_ops = {
-   .startup= sof_probe_compr_open,
-   .shutdown   = sof_probe_compr_free,
-   .set_params = sof_probe_compr_set_params,
-   .trigger= sof_probe_compr_trigger,
-   .pointer= sof_probe_compr_pointer,
-};
-
-#endif
 #endif
 
 /*
-- 
2.26.2

[PATCH v3 02/10] ASoC: SOF: Introduce descriptors for SOF client

2020-10-22 Thread Dave Ertman

From: Ranjani Sridharan 

A client in the SOF (Sound Open Firmware) context is a
device that needs to communicate with the DSP via IPC
messages. The SOF core is responsible for serializing the
IPC messages to the DSP from the different clients. One
example of an SOF client would be an IPC test client that
floods the DSP with test IPC messages to validate if the
serialization works as expected. Multi-client support will
also add the ability to split the existing audio cards
into multiple ones, so as to e.g. to deal with HDMI with a
dedicated client instead of adding HDMI to all cards.

This patch introduces descriptors for SOF client driver
and SOF client device along with APIs for registering
and unregistering a SOF client driver, sending IPCs from
a client device and accessing the SOF core debugfs root entry.

Along with this, add a couple of new members to struct
snd_sof_dev that will be used for maintaining the list of
clients.

Reviewed-by: Pierre-Louis Bossart 
Signed-off-by: Ranjani Sridharan 
Co-developed-by: Fred Oh 
Signed-off-by: Fred Oh 
Signed-off-by: Dave Ertman 
---
 sound/soc/sof/Kconfig  |  19 ++
 sound/soc/sof/Makefile |   3 +
 sound/soc/sof/core.c   |   2 +
 sound/soc/sof/sof-client.c | 115 +
 sound/soc/sof/sof-client.h |  66 +
 sound/soc/sof/sof-priv.h   |   9 +++
 6 files changed, 214 insertions(+)
 create mode 100644 sound/soc/sof/sof-client.c
 create mode 100644 sound/soc/sof/sof-client.h

diff --git a/sound/soc/sof/Kconfig b/sound/soc/sof/Kconfig
index 8c1f0829de40..31e9911098fc 100644
--- a/sound/soc/sof/Kconfig
+++ b/sound/soc/sof/Kconfig
@@ -50,6 +50,24 @@ config SND_SOC_SOF_DEBUG_PROBES
  Say Y if you want to enable probes.
  If unsure, select "N".
 
+config SND_SOC_SOF_CLIENT
+   tristate
+   select AUXILIARY_BUS
+   help
+ This option is not user-selectable but automagically handled by
+ 'select' statements at a higher level.
+
+config SND_SOC_SOF_CLIENT_SUPPORT
+   bool "SOF enable clients"
+   depends on SND_SOC_SOF
+   help
+ This adds support for auxiliary client devices to separate out the 
debug
+ functionality for IPC tests, probes etc. into separate devices. This
+ option would also allow adding client devices based on DSP firmware
+ capabilities and ACPI/OF device information.
+ Say Y if you want to enable clients with SOF.
+ If unsure select "N".
+
 config SND_SOC_SOF_DEVELOPER_SUPPORT
bool "SOF developer options support"
depends on EXPERT
@@ -186,6 +204,7 @@ endif ## SND_SOC_SOF_DEVELOPER_SUPPORT
 
 config SND_SOC_SOF
tristate
+   select SND_SOC_SOF_CLIENT if SND_SOC_SOF_CLIENT_SUPPORT
select SND_SOC_TOPOLOGY
select SND_SOC_SOF_NOCODEC if SND_SOC_SOF_NOCODEC_SUPPORT
help
diff --git a/sound/soc/sof/Makefile b/sound/soc/sof/Makefile
index 05718dfe6cd2..5e46f25a3851 100644
--- a/sound/soc/sof/Makefile
+++ b/sound/soc/sof/Makefile
@@ -2,6 +2,7 @@
 
 snd-sof-objs := core.o ops.o loader.o ipc.o pcm.o pm.o debug.o topology.o\
control.o trace.o utils.o sof-audio.o
+snd-sof-client-objs := sof-client.o
 snd-sof-$(CONFIG_SND_SOC_SOF_DEBUG_PROBES) += probe.o compress.o
 
 snd-sof-pci-objs := sof-pci-dev.o
@@ -18,6 +19,8 @@ obj-$(CONFIG_SND_SOC_SOF_ACPI) += snd-sof-acpi.o
 obj-$(CONFIG_SND_SOC_SOF_OF) += snd-sof-of.o
 obj-$(CONFIG_SND_SOC_SOF_PCI) += snd-sof-pci.o
 
+obj-$(CONFIG_SND_SOC_SOF_CLIENT) += snd-sof-client.o
+
 obj-$(CONFIG_SND_SOC_SOF_INTEL_TOPLEVEL) += intel/
 obj-$(CONFIG_SND_SOC_SOF_IMX_TOPLEVEL) += imx/
 obj-$(CONFIG_SND_SOC_SOF_XTENSA) += xtensa/
diff --git a/sound/soc/sof/core.c b/sound/soc/sof/core.c
index adc7c37145d6..72a97219395f 100644
--- a/sound/soc/sof/core.c
+++ b/sound/soc/sof/core.c
@@ -314,8 +314,10 @@ int snd_sof_device_probe(struct device *dev, struct 
snd_sof_pdata *plat_data)
INIT_LIST_HEAD(>widget_list);
INIT_LIST_HEAD(>dai_list);
INIT_LIST_HEAD(>route_list);
+   INIT_LIST_HEAD(>client_list);
spin_lock_init(>ipc_lock);
spin_lock_init(>hw_lock);
+   mutex_init(>client_mutex);
 
if (IS_ENABLED(CONFIG_SND_SOC_SOF_PROBE_WORK_QUEUE))
INIT_WORK(>probe_work, sof_probe_work);
diff --git a/sound/soc/sof/sof-client.c b/sound/soc/sof/sof-client.c
new file mode 100644
index ..dd75a0ba4c28
--- /dev/null
+++ b/sound/soc/sof/sof-client.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0-only
+//
+// Copyright(c) 2020 Intel Corporation. All rights reserved.
+//
+// Author: Ranjani Sridharan 
+//
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "sof-client.h"
+#include "sof-priv.h"
+
+static void sof_client_auxdev_release(struct device *dev)
+{
+   struct auxiliary_device *auxdev = to_auxiliary_dev(dev);
+   struct sof_client_dev *cdev = auxiliary_dev_to_sof_client_dev(auxdev);
+
+

[PATCH v3 09/10] ASoC: SOF: Add new client driver for probes support

2020-10-22 Thread Dave Ertman

From: Ranjani Sridharan 

Add a new client driver for probes support and move
all the probes-related code from the core to the
client driver.

The probes client driver registers a component driver
with one CPU DAI driver for extraction and creates a
new sound card with one DUMMY DAI link with a dummy codec
that will be used for extracting audio data from specific
points in the audio pipeline.

The probes debugfs ops are based on the initial
implementation by Cezary Rojewski and have been moved
out of the SOF core into the client driver making it
easier to maintain. This change will make it easier
for the probes functionality to be added for all platforms
without having the need to modify the existing(15+) machine
drivers.

Reviewed-by: Pierre-Louis Bossart 
Tested-by: Fred Oh 
Signed-off-by: Ranjani Sridharan 
Signed-off-by: Dave Ertman 
---
 sound/soc/sof/Kconfig |  18 +-
 sound/soc/sof/Makefile|   3 +-
 sound/soc/sof/compress.c  |  51 ++--
 sound/soc/sof/core.c  |   6 -
 sound/soc/sof/debug.c | 227 
 sound/soc/sof/intel/hda-dai.c |  15 --
 sound/soc/sof/intel/hda.h |   6 -
 sound/soc/sof/pcm.c   |  11 -
 sound/soc/sof/probe.c | 124 -
 sound/soc/sof/probe.h |  41 +--
 sound/soc/sof/sof-priv.h  |   4 -
 sound/soc/sof/sof-probes-client.c | 414 ++
 12 files changed, 545 insertions(+), 375 deletions(-)
 create mode 100644 sound/soc/sof/sof-probes-client.c

diff --git a/sound/soc/sof/Kconfig b/sound/soc/sof/Kconfig
index a0f9474b8143..9fa00780c842 100644
--- a/sound/soc/sof/Kconfig
+++ b/sound/soc/sof/Kconfig
@@ -42,13 +42,11 @@ config SND_SOC_SOF_OF
  Say Y if you need this option. If unsure select "N".
 
 config SND_SOC_SOF_DEBUG_PROBES
-   bool "SOF enable data probing"
+   bool
select SND_SOC_COMPRESS
help
- This option enables the data probing feature that can be used to
- gather data directly from specific points of the audio pipeline.
- Say Y if you want to enable probes.
- If unsure, select "N".
+ This option is not user-selectable but automagically handled by
+ 'select' statements at a higher level.
 
 config SND_SOC_SOF_CLIENT
tristate
@@ -192,6 +190,15 @@ config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT
  Say Y if you want to enable IPC flood test.
  If unsure, select "N".
 
+config SND_SOC_SOF_DEBUG_PROBES_CLIENT
+   tristate "SOF enable data probing"
+   depends on SND_SOC_SOF_CLIENT
+   help
+ This option enables the data probing feature that can be used to
+ gather data directly from specific points of the audio pipeline.
+ Say Y if you want to enable probes.
+ If unsure, select "N".
+
 config SND_SOC_SOF_DEBUG_RETAIN_DSP_CONTEXT
bool "SOF retain DSP context on any FW exceptions"
help
@@ -207,6 +214,7 @@ endif ## SND_SOC_SOF_DEVELOPER_SUPPORT
 config SND_SOC_SOF
tristate
select SND_SOC_SOF_CLIENT if SND_SOC_SOF_CLIENT_SUPPORT
+   select SND_SOC_SOF_DEBUG_PROBES if SND_SOC_SOF_DEBUG_PROBES_CLIENT
select SND_SOC_TOPOLOGY
select SND_SOC_SOF_NOCODEC if SND_SOC_SOF_NOCODEC_SUPPORT
help
diff --git a/sound/soc/sof/Makefile b/sound/soc/sof/Makefile
index baa93fe2cc9a..cf49466f7910 100644
--- a/sound/soc/sof/Makefile
+++ b/sound/soc/sof/Makefile
@@ -3,7 +3,7 @@
 snd-sof-objs := core.o ops.o loader.o ipc.o pcm.o pm.o debug.o topology.o\
control.o trace.o utils.o sof-audio.o
 snd-sof-client-objs := sof-client.o
-snd-sof-$(CONFIG_SND_SOC_SOF_DEBUG_PROBES) += probe.o compress.o
+snd-sof-probes-objs := probe.o compress.o sof-probes-client.o
 
 snd-sof-pci-objs := sof-pci-dev.o
 snd-sof-acpi-objs := sof-acpi-dev.o
@@ -24,6 +24,7 @@ obj-$(CONFIG_SND_SOC_SOF_PCI) += snd-sof-pci.o
 obj-$(CONFIG_SND_SOC_SOF_CLIENT) += snd-sof-client.o
 
 obj-$(CONFIG_SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT) += snd-sof-ipc-test.o
+obj-$(CONFIG_SND_SOC_SOF_DEBUG_PROBES_CLIENT) += snd-sof-probes.o
 
 obj-$(CONFIG_SND_SOC_SOF_INTEL_TOPLEVEL) += intel/
 obj-$(CONFIG_SND_SOC_SOF_IMX_TOPLEVEL) += imx/
diff --git a/sound/soc/sof/compress.c b/sound/soc/sof/compress.c
index 0443f171b4e7..bbb77f028e74 100644
--- a/sound/soc/sof/compress.c
+++ b/sound/soc/sof/compress.c
@@ -10,8 +10,8 @@
 
 #include 
 #include "compress.h"
-#include "ops.h"
 #include "probe.h"
+#include "sof-client.h"
 
 struct snd_soc_cdai_ops sof_probe_compr_ops = {
.startup= sof_probe_compr_open,
@@ -30,17 +30,18 @@ EXPORT_SYMBOL(sof_probe_compressed_ops);
 int sof_probe_compr_open(struct snd_compr_stream *cstream,
struct snd_soc_dai *dai)
 {
-   struct snd_sof_dev *sdev =
-   snd_soc_component_get_drvdata(dai->component);
+   struct snd_soc_card *card = 
snd_soc_component_get_drvdata(dai->component);
+   struct

[PATCH v3 06/10] ASoC: SOF: Intel: Remove IPC flood test support in SOF core

2020-10-22 Thread Dave Ertman

From: Ranjani Sridharan 

Remove the IPC flood test support in the SOF core as it is
now added in the IPC flood test client.

Reviewed-by: Pierre-Louis Bossart 
Signed-off-by: Fred Oh 
Signed-off-by: Ranjani Sridharan 
Signed-off-by: Dave Ertman 
---
 sound/soc/sof/Kconfig|   8 --
 sound/soc/sof/debug.c| 230 ---
 sound/soc/sof/sof-priv.h |   6 +-
 3 files changed, 1 insertion(+), 243 deletions(-)

diff --git a/sound/soc/sof/Kconfig b/sound/soc/sof/Kconfig
index 13bde36cc5d7..a0f9474b8143 100644
--- a/sound/soc/sof/Kconfig
+++ b/sound/soc/sof/Kconfig
@@ -182,14 +182,6 @@ config SND_SOC_SOF_DEBUG_ENABLE_FIRMWARE_TRACE
  module parameter (similar to dynamic debug)
  If unsure, select "N".
 
-config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST
-   bool "SOF enable IPC flood test"
-   help
- This option enables the IPC flood test which can be used to flood
- the DSP with test IPCs and gather stats about response times.
- Say Y if you want to enable IPC flood test.
- If unsure, select "N".
-
 config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT
tristate "SOF enable IPC flood test client"
depends on SND_SOC_SOF_CLIENT
diff --git a/sound/soc/sof/debug.c b/sound/soc/sof/debug.c
index 9419a99bab53..d224641768da 100644
--- a/sound/soc/sof/debug.c
+++ b/sound/soc/sof/debug.c
@@ -232,120 +232,10 @@ static int snd_sof_debugfs_probe_item(struct snd_sof_dev 
*sdev,
 }
 #endif
 
-#if IS_ENABLED(CONFIG_SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST)
-#define MAX_IPC_FLOOD_DURATION_MS 1000
-#define MAX_IPC_FLOOD_COUNT 1
-#define IPC_FLOOD_TEST_RESULT_LEN 512
-
-static int sof_debug_ipc_flood_test(struct snd_sof_dev *sdev,
-   struct snd_sof_dfsentry *dfse,
-   bool flood_duration_test,
-   unsigned long ipc_duration_ms,
-   unsigned long ipc_count)
-{
-   struct sof_ipc_cmd_hdr hdr;
-   struct sof_ipc_reply reply;
-   u64 min_response_time = U64_MAX;
-   ktime_t start, end, test_end;
-   u64 avg_response_time = 0;
-   u64 max_response_time = 0;
-   u64 ipc_response_time;
-   int i = 0;
-   int ret;
-
-   /* configure test IPC */
-   hdr.cmd = SOF_IPC_GLB_TEST_MSG | SOF_IPC_TEST_IPC_FLOOD;
-   hdr.size = sizeof(hdr);
-
-   /* set test end time for duration flood test */
-   if (flood_duration_test)
-   test_end = ktime_get_ns() + ipc_duration_ms * NSEC_PER_MSEC;
-
-   /* send test IPC's */
-   while (1) {
-   start = ktime_get();
-   ret = sof_ipc_tx_message(sdev->ipc, hdr.cmd, , hdr.size,
-, sizeof(reply));
-   end = ktime_get();
-
-   if (ret < 0)
-   break;
-
-   /* compute min and max response times */
-   ipc_response_time = ktime_to_ns(ktime_sub(end, start));
-   min_response_time = min(min_response_time, ipc_response_time);
-   max_response_time = max(max_response_time, ipc_response_time);
-
-   /* sum up response times */
-   avg_response_time += ipc_response_time;
-   i++;
-
-   /* test complete? */
-   if (flood_duration_test) {
-   if (ktime_to_ns(end) >= test_end)
-   break;
-   } else {
-   if (i == ipc_count)
-   break;
-   }
-   }
-
-   if (ret < 0)
-   dev_err(sdev->dev,
-   "error: ipc flood test failed at %d iterations\n", i);
-
-   /* return if the first IPC fails */
-   if (!i)
-   return ret;
-
-   /* compute average response time */
-   do_div(avg_response_time, i);
-
-   /* clear previous test output */
-   memset(dfse->cache_buf, 0, IPC_FLOOD_TEST_RESULT_LEN);
-
-   if (flood_duration_test) {
-   dev_dbg(sdev->dev, "IPC Flood test duration: %lums\n",
-   ipc_duration_ms);
-   snprintf(dfse->cache_buf, IPC_FLOOD_TEST_RESULT_LEN,
-"IPC Flood test duration: %lums\n", ipc_duration_ms);
-   }
-
-   dev_dbg(sdev->dev,
-   "IPC Flood count: %d, Avg response time: %lluns\n",
-   i, avg_response_time);
-   dev_dbg(sdev->dev, "Max response time: %lluns\n",
-   max_response_time);
-   dev_dbg(sdev->dev, "Min response time: %lluns\n",
-   min_response_time);
-
-   /* format output string */
-   snprintf(dfse->cache_buf + strlen(dfse->cache_buf),
-IPC_FLOOD_TEST_RESULT_LEN - strlen(dfse->cache_buf),
-"IPC Flood count: %d\nAvg response time: %lluns\n",
-i, avg_response_time);
-
-   snprintf(dfse->cache_buf + strlen(dfse->cache_buf),

[PATCH v3 07/10] ASoC: SOF: sof-client: Add client APIs to access probes ops

2020-10-22 Thread Dave Ertman

From: Ranjani Sridharan 

Add client APIs to invoke the platform-specific DSP probes
ops. Also, add a new API to get the SOF core device pointer
which will be used for DMA buffer allocation.

Reviewed-by: Pierre-Louis Bossart 
Tested-by: Fred Oh 
Signed-off-by: Ranjani Sridharan 
Signed-off-by: Dave Ertman 
---
 sound/soc/sof/sof-client.c | 55 ++
 sound/soc/sof/sof-client.h | 25 +
 2 files changed, 80 insertions(+)

diff --git a/sound/soc/sof/sof-client.c b/sound/soc/sof/sof-client.c
index dd75a0ba4c28..838aaa5ea179 100644
--- a/sound/soc/sof/sof-client.c
+++ b/sound/soc/sof/sof-client.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include "ops.h"
 #include "sof-client.h"
 #include "sof-priv.h"
 
@@ -112,4 +113,58 @@ struct dentry *sof_client_get_debugfs_root(struct 
sof_client_dev *cdev)
 }
 EXPORT_SYMBOL_NS_GPL(sof_client_get_debugfs_root, SND_SOC_SOF_CLIENT);
 
+#if IS_ENABLED(CONFIG_SND_SOC_SOF_DEBUG_PROBES_CLIENT)
+int sof_client_probe_compr_assign(struct sof_client_dev *cdev,
+ struct snd_compr_stream *cstream,
+ struct snd_soc_dai *dai)
+{
+   return snd_sof_probe_compr_assign(cdev->sdev, cstream, dai);
+}
+EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_assign, SND_SOC_SOF_CLIENT);
+
+int sof_client_probe_compr_free(struct sof_client_dev *cdev,
+   struct snd_compr_stream *cstream,
+   struct snd_soc_dai *dai)
+{
+   return snd_sof_probe_compr_free(cdev->sdev, cstream, dai);
+}
+EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_free, SND_SOC_SOF_CLIENT);
+
+int sof_client_probe_compr_set_params(struct sof_client_dev *cdev,
+ struct snd_compr_stream *cstream,
+ struct snd_compr_params *params,
+ struct snd_soc_dai *dai)
+{
+   return snd_sof_probe_compr_set_params(cdev->sdev, cstream, params, dai);
+}
+EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_set_params, SND_SOC_SOF_CLIENT);
+
+int sof_client_probe_compr_trigger(struct sof_client_dev *cdev,
+  struct snd_compr_stream *cstream, int cmd,
+  struct snd_soc_dai *dai)
+{
+   return snd_sof_probe_compr_trigger(cdev->sdev, cstream, cmd, dai);
+}
+EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_trigger, SND_SOC_SOF_CLIENT);
+
+int sof_client_probe_compr_pointer(struct sof_client_dev *cdev,
+  struct snd_compr_stream *cstream,
+  struct snd_compr_tstamp *tstamp,
+  struct snd_soc_dai *dai)
+{
+   return snd_sof_probe_compr_pointer(cdev->sdev, cstream, tstamp, dai);
+}
+EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_pointer, SND_SOC_SOF_CLIENT);
+#endif
+
+/*
+ * DMA buffer alloc fails when using the client device. Use the SOF core 
device instead.
+ * This will be needed for clients other than the probes client device as well.
+ */
+struct device *sof_client_get_dma_dev(struct sof_client_dev *cdev)
+{
+   return cdev->sdev->dev;
+}
+EXPORT_SYMBOL_NS_GPL(sof_client_get_dma_dev, SND_SOC_SOF_CLIENT);
+
 MODULE_LICENSE("GPL");
diff --git a/sound/soc/sof/sof-client.h b/sound/soc/sof/sof-client.h
index 429282df9f65..be80053068c9 100644
--- a/sound/soc/sof/sof-client.h
+++ b/sound/soc/sof/sof-client.h
@@ -7,6 +7,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 #define SOF_CLIENT_PROBE_TIMEOUT_MS 2000
 
@@ -50,6 +54,27 @@ int sof_client_ipc_tx_message(struct sof_client_dev *cdev, 
u32 header, void *msg
  size_t msg_bytes, void *reply_data, size_t 
reply_bytes);
 
 struct dentry *sof_client_get_debugfs_root(struct sof_client_dev *cdev);
+struct device *sof_client_get_dma_dev(struct sof_client_dev *cdev);
+
+#if IS_ENABLED(CONFIG_SND_SOC_SOF_DEBUG_PROBES_CLIENT)
+int sof_client_probe_compr_assign(struct sof_client_dev *cdev,
+ struct snd_compr_stream *cstream,
+ struct snd_soc_dai *dai);
+int sof_client_probe_compr_free(struct sof_client_dev *cdev,
+   struct snd_compr_stream *cstream,
+   struct snd_soc_dai *dai);
+int sof_client_probe_compr_set_params(struct sof_client_dev *cdev,
+ struct snd_compr_stream *cstream,
+ struct snd_compr_params *params,
+ struct snd_soc_dai *dai);
+int sof_client_probe_compr_trigger(struct sof_client_dev *cdev,
+  struct snd_compr_stream *cstream, int cmd,
+  struct snd_soc_dai *dai);
+int sof_client_probe_compr_pointer(struct sof_client_dev *cdev,
+  struct snd_compr_stream *cstream,
+

[PATCH v3 10/10] ASoC: SOF: Intel: CNL: register probes client

2020-10-22 Thread Dave Ertman

From: Ranjani Sridharan 

Register the client device for probes support on the
CNL platform. Creating this client device alleviates the
need for modifying the sound card definitions in the existing
machine drivers to add support for the new probes feature in
the FW. This will result in the creation of a separate sound
card that can be used for audio data extraction from user
specified points in the audio pipeline.

Reviewed-by: Pierre-Louis Bossart 
Tested-by: Fred Oh 
Signed-off-by: Ranjani Sridharan 
Signed-off-by: Dave Ertman 
---
 sound/soc/sof/intel/cnl.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/sound/soc/sof/intel/cnl.c b/sound/soc/sof/intel/cnl.c
index 20afb622c315..6d15b871dc17 100644
--- a/sound/soc/sof/intel/cnl.c
+++ b/sound/soc/sof/intel/cnl.c
@@ -19,6 +19,7 @@
 #include "hda.h"
 #include "hda-ipc.h"
 #include "../sof-audio.h"
+#include "../sof-client.h"
 #include "intel-client.h"
 
 static const struct snd_sof_debugfs_map cnl_dsp_debugfs[] = {
@@ -233,12 +234,26 @@ void cnl_ipc_dump(struct snd_sof_dev *sdev)
 
 static int cnl_register_clients(struct snd_sof_dev *sdev)
 {
-   return intel_register_ipc_test_clients(sdev);
+   int ret;
+
+   ret = intel_register_ipc_test_clients(sdev);
+   if (ret < 0)
+   return ret;
+
+#if IS_ENABLED(CONFIG_SND_SOC_SOF_HDA_PROBES)
+   return sof_client_dev_register(sdev, "probes", 0);
+#endif
+
+   return 0;
 }
 
 static void cnl_unregister_clients(struct snd_sof_dev *sdev)
 {
intel_unregister_ipc_test_clients(sdev);
+
+#if IS_ENABLED(CONFIG_SND_SOC_SOF_HDA_PROBES)
+   sof_client_dev_unregister(sdev, "probes", 0);
+#endif
 }
 
 /* cannonlake ops */
@@ -409,3 +424,4 @@ const struct sof_intel_dsp_desc jsl_chip_info = {
 };
 EXPORT_SYMBOL_NS(jsl_chip_info, SND_SOC_SOF_INTEL_HDA_COMMON);
 MODULE_IMPORT_NS(SND_SOC_SOF_INTEL_CLIENT);
+MODULE_IMPORT_NS(SND_SOC_SOF_CLIENT);
-- 
2.26.2

[PATCH v3 04/10] ASoC: SOF: ops: Add ops for client registration

2020-10-22 Thread Dave Ertman

From: Ranjani Sridharan 

Add new ops for registering/unregistering clients based
on DSP capabilities and/or DT information.

Reviewed-by: Pierre-Louis Bossart 
Signed-off-by: Ranjani Sridharan 
Signed-off-by: Dave Ertman 
---
 sound/soc/sof/core.c | 10 ++
 sound/soc/sof/ops.h  | 14 ++
 sound/soc/sof/sof-priv.h |  4 
 3 files changed, 28 insertions(+)

diff --git a/sound/soc/sof/core.c b/sound/soc/sof/core.c
index 72a97219395f..ddb9a12d5aac 100644
--- a/sound/soc/sof/core.c
+++ b/sound/soc/sof/core.c
@@ -246,8 +246,17 @@ static int sof_probe_continue(struct snd_sof_dev *sdev)
if (plat_data->sof_probe_complete)
plat_data->sof_probe_complete(sdev->dev);
 
+   /* If registering certain clients fails, unregister the previously 
registered clients. */
+   ret = snd_sof_register_clients(sdev);
+   if (ret < 0) {
+   dev_err(sdev->dev, "error: failed to register clients %d\n", 
ret);
+   goto client_reg_err;
+   }
+
return 0;
 
+client_reg_err:
+   snd_sof_unregister_clients(sdev);
 fw_trace_err:
snd_sof_free_trace(sdev);
 fw_run_err:
@@ -356,6 +365,7 @@ int snd_sof_device_remove(struct device *dev)
dev_warn(dev, "error: %d failed to prepare DSP for 
device removal",
 ret);
 
+   snd_sof_unregister_clients(sdev);
snd_sof_fw_unload(sdev);
snd_sof_ipc_free(sdev);
snd_sof_free_debug(sdev);
diff --git a/sound/soc/sof/ops.h b/sound/soc/sof/ops.h
index b21632f5511a..00370f8bcd75 100644
--- a/sound/soc/sof/ops.h
+++ b/sound/soc/sof/ops.h
@@ -470,6 +470,20 @@ snd_sof_set_mach_params(const struct snd_soc_acpi_mach 
*mach,
sof_ops(sdev)->set_mach_params(mach, dev);
 }
 
+static inline int snd_sof_register_clients(struct snd_sof_dev *sdev)
+{
+   if (sof_ops(sdev) && sof_ops(sdev)->register_clients)
+   return sof_ops(sdev)->register_clients(sdev);
+
+   return 0;
+}
+
+static inline void snd_sof_unregister_clients(struct snd_sof_dev *sdev)
+{
+   if (sof_ops(sdev) && sof_ops(sdev)->unregister_clients)
+   sof_ops(sdev)->unregister_clients(sdev);
+}
+
 static inline const struct snd_sof_dsp_ops
 *sof_get_ops(const struct sof_dev_desc *d,
 const struct sof_ops_table mach_ops[], int asize)
diff --git a/sound/soc/sof/sof-priv.h b/sound/soc/sof/sof-priv.h
index dceac73b858f..cca239c09d0e 100644
--- a/sound/soc/sof/sof-priv.h
+++ b/sound/soc/sof/sof-priv.h
@@ -252,6 +252,10 @@ struct snd_sof_dsp_ops {
void (*set_mach_params)(const struct snd_soc_acpi_mach *mach,
struct device *dev); /* optional */
 
+   /* client ops */
+   int (*register_clients)(struct snd_sof_dev *sdev); /* optional */
+   void (*unregister_clients)(struct snd_sof_dev *sdev); /* optional */
+
/* DAI ops */
struct snd_soc_dai_driver *drv;
int num_drv;
-- 
2.26.2

[PATCH v3 05/10] ASoC: SOF: Intel: Define ops for client registration

2020-10-22 Thread Dave Ertman

From: Ranjani Sridharan 

Define client ops for Intel platforms. For now, we only add
2 IPC test clients that will be used for run tandem IPC flood
tests for.

For ACPI platforms, change the Kconfig to select
SND_SOC_SOF_PROBE_WORK_QUEUE to allow the ancillary driver
to probe when the client is registered.

Reviewed-by: Pierre-Louis Bossart 
Signed-off-by: Ranjani Sridharan 
Co-developed-by: Fred Oh 
Signed-off-by: Fred Oh 
Signed-off-by: Dave Ertman 
---
 sound/soc/sof/intel/Kconfig|  9 +++
 sound/soc/sof/intel/Makefile   |  3 +++
 sound/soc/sof/intel/apl.c  | 16 
 sound/soc/sof/intel/bdw.c  | 16 
 sound/soc/sof/intel/byt.c  | 20 +++
 sound/soc/sof/intel/cnl.c  | 16 
 sound/soc/sof/intel/intel-client.c | 40 ++
 sound/soc/sof/intel/intel-client.h | 26 +++
 8 files changed, 146 insertions(+)
 create mode 100644 sound/soc/sof/intel/intel-client.c
 create mode 100644 sound/soc/sof/intel/intel-client.h

diff --git a/sound/soc/sof/intel/Kconfig b/sound/soc/sof/intel/Kconfig
index a066e08860cb..b449fa2f8005 100644
--- a/sound/soc/sof/intel/Kconfig
+++ b/sound/soc/sof/intel/Kconfig
@@ -13,6 +13,8 @@ config SND_SOC_SOF_INTEL_ACPI
def_tristate SND_SOC_SOF_ACPI
select SND_SOC_SOF_BAYTRAIL  if SND_SOC_SOF_BAYTRAIL_SUPPORT
select SND_SOC_SOF_BROADWELL if SND_SOC_SOF_BROADWELL_SUPPORT
+   select SND_SOC_SOF_PROBE_WORK_QUEUE if SND_SOC_SOF_CLIENT
+   select SND_SOC_SOF_INTEL_CLIENT if SND_SOC_SOF_CLIENT
help
  This option is not user-selectable but automagically handled by
  'select' statements at a higher level
@@ -29,6 +31,7 @@ config SND_SOC_SOF_INTEL_PCI
select SND_SOC_SOF_TIGERLAKE   if SND_SOC_SOF_TIGERLAKE_SUPPORT
select SND_SOC_SOF_ELKHARTLAKE if SND_SOC_SOF_ELKHARTLAKE_SUPPORT
select SND_SOC_SOF_JASPERLAKE  if SND_SOC_SOF_JASPERLAKE_SUPPORT
+   select SND_SOC_SOF_INTEL_CLIENT if SND_SOC_SOF_CLIENT
help
  This option is not user-selectable but automagically handled by
  'select' statements at a higher level
@@ -57,6 +60,12 @@ config SND_SOC_SOF_INTEL_COMMON
  This option is not user-selectable but automagically handled by
  'select' statements at a higher level
 
+config SND_SOC_SOF_INTEL_CLIENT
+   tristate
+   help
+ This option is not user-selectable but automagically handled by
+ 'select' statements at a higher level.
+
 if SND_SOC_SOF_INTEL_ACPI
 
 config SND_SOC_SOF_BAYTRAIL_SUPPORT
diff --git a/sound/soc/sof/intel/Makefile b/sound/soc/sof/intel/Makefile
index 72d85b25df7d..683e64c627c1 100644
--- a/sound/soc/sof/intel/Makefile
+++ b/sound/soc/sof/intel/Makefile
@@ -5,6 +5,8 @@ snd-sof-intel-bdw-objs := bdw.o
 
 snd-sof-intel-ipc-objs := intel-ipc.o
 
+snd-sof-intel-client-objs := intel-client.o
+
 snd-sof-intel-hda-common-objs := hda.o hda-loader.o hda-stream.o hda-trace.o \
 hda-dsp.o hda-ipc.o hda-ctrl.o hda-pcm.o \
 hda-dai.o hda-bus.o \
@@ -18,3 +20,4 @@ obj-$(CONFIG_SND_SOC_SOF_BROADWELL) += snd-sof-intel-bdw.o
 obj-$(CONFIG_SND_SOC_SOF_INTEL_HIFI_EP_IPC) += snd-sof-intel-ipc.o
 obj-$(CONFIG_SND_SOC_SOF_HDA_COMMON) += snd-sof-intel-hda-common.o
 obj-$(CONFIG_SND_SOC_SOF_HDA) += snd-sof-intel-hda.o
+obj-$(CONFIG_SND_SOC_SOF_INTEL_CLIENT) += snd-sof-intel-client.o
diff --git a/sound/soc/sof/intel/apl.c b/sound/soc/sof/intel/apl.c
index 4eeade2e77f7..ce2dcd6aa7de 100644
--- a/sound/soc/sof/intel/apl.c
+++ b/sound/soc/sof/intel/apl.c
@@ -18,6 +18,7 @@
 #include "../sof-priv.h"
 #include "hda.h"
 #include "../sof-audio.h"
+#include "intel-client.h"
 
 static const struct snd_sof_debugfs_map apl_dsp_debugfs[] = {
{"hda", HDA_DSP_HDA_BAR, 0, 0x4000, SOF_DEBUGFS_ACCESS_ALWAYS},
@@ -25,6 +26,16 @@ static const struct snd_sof_debugfs_map apl_dsp_debugfs[] = {
{"dsp", HDA_DSP_BAR,  0, 0x1, SOF_DEBUGFS_ACCESS_ALWAYS},
 };
 
+static int apl_register_clients(struct snd_sof_dev *sdev)
+{
+   return intel_register_ipc_test_clients(sdev);
+}
+
+static void apl_unregister_clients(struct snd_sof_dev *sdev)
+{
+   intel_unregister_ipc_test_clients(sdev);
+}
+
 /* apollolake ops */
 const struct snd_sof_dsp_ops sof_apl_ops = {
/* probe and remove */
@@ -101,6 +112,10 @@ const struct snd_sof_dsp_ops sof_apl_ops = {
.trace_release = hda_dsp_trace_release,
.trace_trigger = hda_dsp_trace_trigger,
 
+   /* client ops */
+   .register_clients = apl_register_clients,
+   .unregister_clients = apl_unregister_clients,
+
/* DAI drivers */
.drv= skl_dai,
.num_drv= SOF_SKL_NUM_DAIS,
@@ -140,3 +155,4 @@ const struct sof_intel_dsp_desc apl_chip_info = {
.ssp_base_offset = APL_SSP_BASE_OFFSET,
 };
 EXPORT_SYMBOL_NS(apl_chip_info, SND_SOC_SOF_INTEL_HDA_COMMON);

[PATCH v3 01/10] Add auxiliary bus support

2020-10-22 Thread Dave Ertman

Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
It enables drivers to create an auxiliary_device and bind an
auxiliary_driver to it.

The bus supports probe/remove shutdown and suspend/resume callbacks.
Each auxiliary_device has a unique string based id; driver binds to
an auxiliary_device based on this id through the bus.

Co-developed-by: Kiran Patil 
Signed-off-by: Kiran Patil 
Co-developed-by: Ranjani Sridharan 
Signed-off-by: Ranjani Sridharan 
Co-developed-by: Fred Oh 
Signed-off-by: Fred Oh 
Co-developed-by: Leon Romanovsky 
Signed-off-by: Leon Romanovsky 
Reviewed-by: Pierre-Louis Bossart 
Reviewed-by: Shiraz Saleem 
Reviewed-by: Parav Pandit 
Reviewed-by: Dan Williams 
Signed-off-by: Dave Ertman 
---
 Documentation/driver-api/auxiliary_bus.rst | 228 ++
 Documentation/driver-api/index.rst |   1 +
 drivers/base/Kconfig   |   3 +
 drivers/base/Makefile  |   1 +
 drivers/base/auxiliary.c   | 267 +
 include/linux/auxiliary_bus.h  |  78 ++
 include/linux/mod_devicetable.h|   8 +
 scripts/mod/devicetable-offsets.c  |   3 +
 scripts/mod/file2alias.c   |   8 +
 9 files changed, 597 insertions(+)
 create mode 100644 Documentation/driver-api/auxiliary_bus.rst
 create mode 100644 drivers/base/auxiliary.c
 create mode 100644 include/linux/auxiliary_bus.h

diff --git a/Documentation/driver-api/auxiliary_bus.rst 
b/Documentation/driver-api/auxiliary_bus.rst
new file mode 100644
index ..500f29692c81
--- /dev/null
+++ b/Documentation/driver-api/auxiliary_bus.rst
@@ -0,0 +1,228 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+=
+Auxiliary Bus
+=
+
+In some subsystems, the functionality of the core device (PCI/ACPI/other) is
+too complex for a single device to be managed as a monolithic block or a part 
of
+the functionality needs to be exposed to a different subsystem.  Splitting the
+functionality into smaller orthogonal devices would make it easier to manage
+data, power management and domain-specific interaction with the hardware. A key
+requirement for such a split is that there is no dependency on a physical bus,
+device, register accesses or regmap support. These individual devices split 
from
+the core cannot live on the platform bus as they are not physical devices that
+are controlled by DT/ACPI. The same argument applies for not using MFD in this
+scenario as MFD relies on individual function devices being physical devices.
+
+An example for this kind of requirement is the audio subsystem where a single
+IP is handling multiple entities such as HDMI, Soundwire, local devices such as
+mics/speakers etc. The split for the core's functionality can be arbitrary or
+be defined by the DSP firmware topology and include hooks for test/debug. This
+allows for the audio core device to be minimal and focused on hardware-specific
+control and communication.
+
+The auxiliary bus is intended to be minimal, generic and avoid domain-specific
+assumptions. Each auxiliary_device represents a part of its parent
+functionality. The generic behavior can be extended and specialized as needed
+by encapsulating an auxiliary_device within other domain-specific structures 
and
+the use of .ops callbacks. Devices on the auxiliary bus do not share any
+structures and the use of a communication channel with the parent is
+domain-specific.
+
+When Should the Auxiliary Bus Be Used
+=
+
+The auxiliary bus is to be used when a driver and one or more kernel modules,
+who share a common header file with the driver, need a mechanism to connect and
+provide access to a shared object allocated by the auxiliary_device's
+registering driver.  The registering driver for the auxiliary_device(s) and the
+kernel module(s) registering auxiliary_drivers can be from the same subsystem,
+or from multiple subsystems.
+
+The emphasis here is on a common generic interface that keeps subsystem
+customization out of the bus infrastructure.
+
+One example could be a multi-port PCI network device that is rdma-capable and
+needs to export this functionality and attach to an rdma driver in another
+subsystem.  The PCI driver will allocate and register an auxiliary_device for
+each physical function on the NIC.  The rdma driver will register an
+auxiliary_driver that will be matched with and probed for each of these
+auxiliary_devices.  This will give the rdma driver access to the shared 
data/ops
+in the PCI drivers shared object to establish a connection with the PCI driver.
+
+Another use case is for the PCI device to be split out into multiple sub
+functions.  For each sub function an auxiliary_device will be created.  A PCI
+sub function driver will bind to such devices that will create its own one or
+more class devices.  A PCI sub function auxiliary device will likely be
+contained in a struct with additional

[PATCH v3 00/10] Auxiliary bus implementation and SOF multi-client support

2020-10-22 Thread Dave Ertman

Brief history of Auxiliary Bus
==
The auxiliary bus code was originally submitted upstream as virtual
bus, and was submitted through the netdev tree.  This process generated
up to v4.  This discussion can be found here:
https://lore.kernel.org/netdev/2019192219.30259-1-jeffrey.t.kirs...@intel.com/#t

At this point, GregKH requested that we take the review and revision
process to an internal mailing list and garner the buy-in of a respected
kernel contributor.

The auxiliary bus (then known as virtual bus) was originally submitted
along with implementation code for the ice driver and irdma drive,
causing the complication of also having dependencies in the rdma tree.
This new submission is utilizing an auxiliary bus consumer in only the
sound driver tree to create the initial implementation and a single
user.

Since implementation work has started on this patch set, there have been
multiple inquiries about the time frame of its completion.  It appears
that there will be numerous consumers of this functionality.

The process of internal review and implementation using the sound
drivers generated 19 internal versions.  The changes, including the name
change from virtual bus to auxiliary bus, from these versions can be
summarized as the following:

- Fixed compilation and checkpatch errors
- Improved documentation to address the motivation for virtual bus.
- Renamed virtual bus to auxiliary bus
- increased maximum device name size
- Correct order in Kconfig and Makefile
- removed the mid-layer adev->release layer for device unregister
- pushed adev->id management to parent driver
- all error paths out of ancillary_device_register return error code
- all error paths out of ancillary_device_register use put_device
- added adev->name element
- modname in register cannot be NULL
- added KBUILD_MODNAME as prefix for match_name
- push adev->id responsibility to registering driver
- uevent now parses adev->dev name
- match_id function now parses adev->dev name
- changed drivers probe function to also take an ancillary_device_id param
- split ancillary_device_register into device_initialize and device_add
- adjusted what is done in device_initialize and device_add
- change adev to ancildev and adrv to ancildrv
- change adev to ancildev in documentation

==

Introduces the auxiliary bus implementation along with the example usage
in the Sound Open Firmware(SOF) audio driver.

In some subsystems, the functionality of the core device
(PCI/ACPI/other) may be too complex for a single device to be managed as
a monolithic block or a part of the functionality might need to be
exposed to a different subsystem.  Splitting the functionality into
smaller orthogonal devices makes it easier to manage data, power
management and domain-specific communication with the hardware.  Also,
common auxiliary_device functionality across primary devices can be
handled by a common auxiliary_device. A key requirement for such a split
is that there is no dependency on a physical bus, device, register
accesses or regmap support. These individual devices split from the core
cannot live on the platform bus as they are not physical devices that
are controlled by DT/ACPI. The same argument applies for not using MFD
in this scenario as it relies on individual function devices being
physical devices that are DT enumerated.

An example for this kind of requirement is the audio subsystem where a
single IP handles multiple entities such as HDMI, Soundwire, local
devices such as mics/speakers etc. The split for the core's
functionality can be arbitrary or be defined by the DSP firmware
topology and include hooks for test/debug. This allows for the audio
core device to be minimal and tightly coupled with handling the
hardware-specific logic and communication.

The auxiliary bus is intended to be minimal, generic and avoid
domain-specific assumptions. Each auxiliary bus device represents a part
of its parent functionality. The generic behavior can be extended and
specialized as needed by encapsulating an auxiliary bus device within
other domain-specific structures and the use of .ops callbacks.

The SOF driver adopts the auxiliary bus for implementing the
multi-client support. A client in the context of the SOF driver
represents a part of the core device's functionality. It is not a
physical device but rather an auxiliary device that needs to communicate
with the DSP via IPCs. With multi-client support,the sound card can be
separated into multiple orthogonal auxiliary devices for local devices
(mic/speakers etc), HDMI, sensing, probes, debug etc.  In this series,
we demonstrate the usage of the auxiliary bus with the help of the IPC
test client which is used for testing the serialization of IPCs when
multiple clients talk to the DSP at the same time.

v3 changes:
rename to auxiliary bus
move .c file to drivers/base/
split auxdev unregister flow into uninitialize and delete steps
update kernel-doc on

Re: [RFC 1/2] printk: Add kernel parameter: mute_console

2020-10-22 Thread Sergey Senozhatsky

On (20/10/22 13:42), Petr Mladek wrote:
> +static bool mute_console;
> +
> +static int __init mute_console_setup(char *str)
> +{
> + mute_console = true;
> + pr_info("All consoles muted.\n");
> +
> + return 0;
> +}

First of all, thanks a lot for picking this up and for the patch set!

I've several thoughts and comments below.

>  static bool suppress_message_printing(int level)
>  {
> - return (level >= console_loglevel && !ignore_loglevel);
> + if (unlikely(mute_console))
> + return true;
> +
> + if (unlikely(ignore_loglevel))
> + return false;
> +
> + return (level >= console_loglevel);
>  }

This is one way of doing it. Another one is to clear CON_ENABLED bit
from all consoles (upon registration), one upside of this is that we
will signal user-space that consoles are disabled/muted (by removing
the E flag from /proc/consoles).

But, if I'm mistaken, but this mutes only printk side, consoles still
have uart running:
printf -> tty -> uart -> serial_driver_IRQ() -> TX
seriaal_driver_IRQ() -> RX -> uart -> tty

so user space, in theory, still can push messages to slow consoles,
AFAIU.

Thinking more about it. We are still relying on the fact that there is
anything registered as console driver, which is not necessarily the case,
we can have NULL console drivers list. So how about having a dummy struct
console in printk, with NOP read/write and NOP tty_driver and NOP
tty_operations. So that when init calls filp_open("/dev/console") and
we can't give tty anything but NULL, we'd just give tty back the dummy
NOP device.

-ss

[tip:x86/urgent] BUILD SUCCESS abee7c494d8c41bb388839bccc47e06247f0d7de

2020-10-22 Thread kernel test robot

allyesconfig
powerpcsocrates_defconfig
c6xevmc6678_defconfig
mipsomega2p_defconfig
ia64  gensparse_defconfig
arm ebsa110_defconfig
powerpcmvme5100_defconfig
arm rpc_defconfig
powerpc  ppc64e_defconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
nds32 allnoconfig
c6x  allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a002-20201022
i386 randconfig-a005-20201022
i386 randconfig-a003-20201022
i386 randconfig-a001-20201022
i386 randconfig-a006-20201022
i386 randconfig-a004-20201022
i386 randconfig-a002-20201023
i386 randconfig-a005-20201023
i386 randconfig-a003-20201023
i386 randconfig-a001-20201023
i386 randconfig-a006-20201023
i386 randconfig-a004-20201023
x86_64   randconfig-a011-20201022
x86_64   randconfig-a013-20201022
x86_64   randconfig-a016-20201022
x86_64   randconfig-a015-20201022
x86_64   randconfig-a012-20201022
x86_64   randconfig-a014-20201022
i386 randconfig-a016-20201022
i386 randconfig-a014-20201022
i386 randconfig-a015-20201022
i386 randconfig-a012-20201022
i386 randconfig-a013-20201022
i386 randconfig-a011-20201022
riscvallyesconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
x86_64   randconfig-a001-20201022
x86_64   randconfig-a002-20201022
x86_64   randconfig-a003-20201022
x86_64   randconfig-a006-20201022
x86_64   randconfig-a004-20201022
x86_64   randconfig-a005-20201022

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]

2020-10-22 Thread Linus Torvalds

On Thu, Oct 22, 2020 at 5:11 PM Linus Torvalds
 wrote:
>
> In particular, I wonder if it's that KASAN causes some reload pattern,
> and the whole
>
>  register __typeof__(*(ptr)) __val_pu asm("%"_ASM_AX);
> ..
>  asm volatile(.. "r" (__val_pu) ..)
>
> thing causes problems.

That pattern isn't new (see the same pattern and the comment above get_user).

But our previous use of that pattern had it as an output of the asm,
and the new use is as an input. That obviously shouldn't matter, but
if it's some odd compiler code generation interaction, all bets are
off..

Linus

Re: linux-next: build warning after merge of the block tree

2020-10-22 Thread Jens Axboe

On 10/22/20 5:48 PM, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the block tree, today's linux-next build (KCONFIG_NAME)
> produced this warning:
> 
> fs/io_uring.c: In function 'loop_rw_iter':
> fs/io_uring.c:3141:21: warning: cast to pointer from integer of different 
> size [-Wint-to-pointer-cast]
>  3141 |iovec.iov_base = (void __user *) req->rw.addr;
>   | ^
> 
> Introduced by commit
> 
>   a5371db1e38d ("io_uring: make loop_rw_iter() use original user supplied 
> pointers")

Thanks, not sure why I didn't use u64_to_user_pointer() in the first
place - updated now.

-- 
Jens Axboe

Re: [PATCH] KVM: X86: Expose KVM_HINTS_REALTIME in KVM_GET_SUPPORTED_CPUID

2020-10-22 Thread Wanpeng Li

On Thu, 22 Oct 2020 at 21:02, Paolo Bonzini  wrote:
>
> On 22/10/20 03:34, Wanpeng Li wrote:
> > From: Wanpeng Li 
> >
> > Per KVM_GET_SUPPORTED_CPUID ioctl documentation:
> >
> > This ioctl returns x86 cpuid features which are supported by both the
> > hardware and kvm in its default configuration.
> >
> > A well-behaved userspace should not set the bit if it is not supported.
> >
> > Suggested-by: Jim Mattson 
> > Signed-off-by: Wanpeng Li 
>
> It's common for userspace to copy all supported CPUID bits to
> KVM_SET_CPUID2, I don't think this is the right behavior for
> KVM_HINTS_REALTIME.
>
> (But maybe this was discussed already; if so, please point me to the
> previous discussion).

The discussion is here. :) https://www.spinics.net/lists/kvm/msg227265.html

Wanpeng

Re: [PATCH/RFC net] net: dec: tulip: de2104x: Add shutdown handler to stop NIC

2020-10-22 Thread Moritz Fischer

On Thu, Oct 22, 2020 at 04:04:16PM -0700, James Bottomley wrote:
> On Thu, 2020-10-22 at 15:06 -0700, Moritz Fischer wrote:
> > The driver does not implement a shutdown handler which leads to
> > issues
> > when using kexec in certain scenarios. The NIC keeps on fetching
> > descriptors which gets flagged by the IOMMU with errors like this:
> > 
> > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000
> > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000
> > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000
> > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000
> > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000
> > 
> > Signed-off-by: Moritz Fischer 
> > ---
> > 
> > Hi all,
> > 
> > I'm not sure if this is the proper way for a shutdown handler,
> > I've tried to look at a bunch of examples and couldn't find a
> > specific
> > solution, in my tests on hardware this works, though.
> > 
> > Open to suggestions.
> > 
> > Thanks,
> > Moritz
> > 
> > ---
> >  drivers/net/ethernet/dec/tulip/de2104x.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/net/ethernet/dec/tulip/de2104x.c
> > b/drivers/net/ethernet/dec/tulip/de2104x.c
> > index f1a2da15dd0a..372c62c7e60f 100644
> > --- a/drivers/net/ethernet/dec/tulip/de2104x.c
> > +++ b/drivers/net/ethernet/dec/tulip/de2104x.c
> > @@ -2185,6 +2185,7 @@ static struct pci_driver de_driver = {
> > .id_table   = de_pci_tbl,
> > .probe  = de_init_one,
> > .remove = de_remove_one,
> > +   .shutdown   = de_remove_one,
> 
> This doesn't look right: shutdown is supposed to turn off the device
> without disturbing the tree or causing any knock on effects (I think
> that rule is mostly because you don't want anything in userspace
> triggering since it's likely to be nearly dead).  Remove removes the
> device from the tree and cleans up everything.  I think the function
> you want that's closest to what shutdown needs is de_close().  That
> basically just turns off the chip and frees the interrupt ... you'll
> have to wrapper it to call it from the pci_driver, though.

Thanks for the suggestion, I like that better. I'll send a v2 after
testing.
I think anything that hits on de_stop_hw() will keep the NIC from
fetching further descriptors.

Cheers,
Moritz

Re: mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]

2020-10-22 Thread Linus Torvalds

On Thu, Oct 22, 2020 at 4:43 PM Linus Torvalds
 wrote:
>
> Thanks. Very funky, but thanks. I've been running that commit on my
> machine for over half a year, and it still looks "trivially correct"
> to me, but let me go look at it one more time. Can't argue with a
> reliable bisect and revert..

Hmm. The fact that it only happens with KASAN makes me suspect it's
some bad interaction with the inline asm syntax change (and explains
why I've run with this for half a year without issues).

In particular, I wonder if it's that KASAN causes some reload pattern,
and the whole

 register __typeof__(*(ptr)) __val_pu asm("%"_ASM_AX);
..
 asm volatile(.. "r" (__val_pu) ..)

thing causes problems. That's an ugly pattern, but it's written that
way to get gcc to handle the 64-bit case properly (with the value in
%rax:%rdx).

It turns out that the decode of the user-mode SIGSEGV code is a
variation of system calls, ie

   0: b8 18 00 00 00mov$0x18,%eax
   5: 0f 05syscall
   7: 48 3d 01 f0 ff ffcmp$0xf001,%rax
   d: 73 01jae0x10
   f:* c3retq<-- trapping instruction

or

   0: 41 52push   %r10
   2: 52push   %rdx
   3: 4d 31 d2  xor%r10,%r10
   6: ba 02 00 00 00mov$0x2,%edx
   b: be 80 00 00 00mov$0x80,%esi
  10: 39 d0cmp%edx,%eax
  12: 75 07jne0x1b
  14: b8 ca 00 00 00mov$0xca,%eax
  19: 0f 05syscall
  1b: 89 d0mov%edx,%eax
  1d: 87 07xchg   %eax,(%rdi)
  1f: 85 c0test   %eax,%eax
  21: 75 f1jne0x14
  23:* 5apop%rdx <-- trapping instruction
  24: 41 5apop%r10
  26: c3retq

so in both cases it looks like 'syscall' returned with a bad stack pointer.

Which is certainly a sign of some code generation issue.

Very annoying, because it probably means that it's compiler-specific
too. And that "syscall 018" looks very odd. I think that's
sched_yield() on x86-64, which doesn't have any __put_user() cases at
all..

Would you mind sending me the problematic vmlinux file in private (or,
likely better - a pointer to some place I can download it, it's going
to be huge).

  Linus

[git pull] drm fixes part 2 for 5.10-rc1

2020-10-22 Thread Dave Airlie

Hi Linus,

This should be the last round of things for rc1, a bunch of i915
fixes, some amdgpu, more font OOB fixes and one ttm fix just found
reading code.

Dave.

drm-next-2020-10-23:
drm fixes (round two) for 5.10-rc1

fbcon/fonts:
- Two patches to prevent OOB access

ttm:
- fix for eviction value range check

amdgpu:
- Sienna Cichlid fixes
- MST manager resource leak fix
- GPU reset fix

amdkfd:
- Luxmark fix for Navi1x

i915:
- Tweak initial DPCD backlight.enabled value (Sean)
- Initialize reserved MOCS indices (Ayaz)
- Mark initial fb obj as WT on eLLC machines to avoid rcu lockup (Ville)
- Support parsing of oversize batches (Chris)
- Delay execlists processing for TGL (Chris)
- Use the active reference on the vma during error capture (Chris)
- Widen CSB pointer (Chris)
- Wait for CSB entries on TGL (Chris)
- Fix unwind for scratch page allocation (Chris)
- Exclude low patches of stolen memory (Chris)
- Force VT'd workarounds when running as a guest OS (Chris)
- Drop runtime-pm assert from vpgu io accessors (Chris)
The following changes since commit 40b99050455b9a6cb8faf15dcd41888312184720:

  Merge tag 'drm-intel-next-fixes-2020-10-15' of
git://anongit.freedesktop.org/drm/drm-intel into drm-next (2020-10-19
09:21:59 +1000)

are available in the Git repository at:

  git://anongit.freedesktop.org/drm/drm tags/drm-next-2020-10-23

for you to fetch changes up to b45b6fbc671c60f56fd119c443e5570f83175928:

  Merge tag 'drm-intel-next-fixes-2020-10-22' of
git://anongit.freedesktop.org/drm/drm-intel into drm-next (2020-10-23
09:52:18 +1000)


drm fixes (round two) for 5.10-rc1

fbcon/fonts:
- Two patches to prevent OOB access

ttm:
- fix for evicition value range check

amdgpu:
- Sienna Cichlid fixes
- MST manager resource leak fix
- GPU reset fix

amdkfd:
- Luxmark fix for Navi1x

i915:
- Tweak initial DPCD backlight.enabled value (Sean)
- Initialize reserved MOCS indices (Ayaz)
- Mark initial fb obj as WT on eLLC machines to avoid rcu lockup (Ville)
- Support parsing of oversize batches (Chris)
- Delay execlists processing for TGL (Chris)
- Use the active reference on the vma during error capture (Chris)
- Widen CSB pointer (Chris)
- Wait for CSB entries on TGL (Chris)
- Fix unwind for scratch page allocation (Chris)
- Exclude low patches of stolen memory (Chris)
- Force VT'd workarounds when running as a guest OS (Chris)
- Drop runtime-pm assert from vpgu io accessors (Chris)


Andrey Grodzovsky (3):
  drm/amd/display: Revert "drm/amd/display: Fix a list corruption"
  drm/amd/display: Avoid MST manager resource leak.
  drm/amd/psp: Fix sysfs: cannot create duplicate filename

Ayaz A Siddiqui (1):
  drm/i915/gt: Initialize reserved and unspecified MOCS indices

Chris Wilson (10):
  drm/i915/gem: Support parsing of oversize batches
  drm/i915/gt: Delay execlist processing for tgl
  drm/i915/gt: Undo forced context restores after trivial preemptions
  drm/i915: Use the active reference on the vma while capturing
  drm/i915/gt: Widen CSB pointer to u64 for the parsers
  drm/i915/gt: Wait for CSB entries on Tigerlake
  drm/i915/gt: Onion unwind for scratch page allocation failure
  drm/i915: Exclude low pages (128KiB) of stolen from use
  drm/i915: Force VT'd workarounds when running as a guest OS
  drm/i915: Drop runtime-pm assert from vgpu io accessors

Dave Airlie (4):
  Merge tag 'drm-misc-next-fixes-2020-10-20' of
git://anongit.freedesktop.org/drm/drm-misc into drm-next
  drm/ttm: fix eviction valuable range check.
  Merge tag 'amd-drm-fixes-5.10-2020-10-21' of
git://people.freedesktop.org/~agd5f/linux into drm-next
  Merge tag 'drm-intel-next-fixes-2020-10-22' of
git://anongit.freedesktop.org/drm/drm-intel into drm-next

Evan Quan (1):
  drm/amdgpu: correct the gpu reset handling for job != NULL case

Jay Cornwall (1):
  drm/amdkfd: Use same SQ prefetch setting as amdgpu

John Clements (1):
  Revert drm/amdgpu: disable sienna chichlid UMC RAS

Kenneth Feng (2):
  drm/amd/pm: fix pp_dpm_fclk
  drm/amd/pm: remove the average clock value in sysfs

Kevin Wang (2):
  drm/amd/swsmu: add missing feature map for sienna_cichlid
  drm/amd/swsmu: correct wrong feature bit mapping

Likun Gao (5):
  drm/amdgpu: add function to program pbb mode for sienna cichlid
  drm/amdgpu: add rlc iram and dram firmware support
  drm/amdgpu: update golden setting for sienna_cichlid
  drm/amd/pm: fix pcie information for sienna cichlid
  drm/amdgpu: correct the cu and rb info for sienna cichlid

Peilin Ye (2):
  docs: fb: Add font_6x8 to available built-in fonts
  Fonts: Support FONT_EXTRA_WORDS macros for font_6x8

Sean Paul (1):
  drm/i915/dp: Tweak initial dpcd backlight.enabled value

Ville Syrjälä (1):
  drm/i915: Mark ininitial fb obj as WT on eLLC machines to

Re: [PATCH v2 2/2] net: phy: adin: implement cable-test support

2020-10-22 Thread Jakub Kicinski

On Thu, 22 Oct 2020 10:45:51 +0300 Alexandru Ardelean wrote:
> The ADIN1300/ADIN1200 support cable diagnostics using TDR.
> 
> The cable fault detection is automatically run on all four pairs looking at
> all combinations of pair faults by first putting the PHY in standby (clear
> the LINK_EN bit, PHY_CTRL_3 register, Address 0x0017) and then enabling the
> diagnostic clock (set the DIAG_CLK_EN bit, PHY_CTRL_1 register, Address
> 0x0012).
> 
> Cable diagnostics can then be run (set the CDIAG_RUN bit in the
> CDIAG_RUN register, Address 0xBA1B). The results are reported for each pair
> in the cable diagnostics results registers, CDIAG_DTLD_RSLTS_0,
> CDIAG_DTLD_RSLTS_1, CDIAG_DTLD_RSLTS_2, and CDIAG_DTLD_RSLTS_3, Address
> 0xBA1D to Address 0xBA20).
> 
> The distance to the first fault for each pair is reported in the cable
> fault distance registers, CDIAG_FLT_DIST_0, CDIAG_FLT_DIST_1,
> CDIAG_FLT_DIST_2, and CDIAG_FLT_DIST_3, Address 0xBA21 to Address 0xBA24).
> 
> This change implements support for this using phylib's cable-test support.
> 
> Signed-off-by: Alexandru Ardelean 

# Form letter - net-next is closed

We have already sent a pull request for 5.10 and therefore net-next 
is closed for new drivers, features, and code refactoring.

Please repost when net-next reopens after 5.10-rc1 is cut.

(http://vger.kernel.org/~davem/net-next.html will not be up to date 
 this time around, sorry about that).

RFC patches sent for review only are obviously welcome at any time.

Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-22 Thread Shakeel Butt

On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin  wrote:
>
[snip]
> >
> > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> > for 5.11.
> >
> > What does everyone think?
>
> I think we should use the link to the root approach both for stable backports
> and for 5.11+, to keep them in sync. The cleanup (always charging the root 
> cgroup)
> is not directly related to this problem, and we can keep it for 5.11+ only.
>
> Thanks!

Roman, can you send the signed-off patch for the root linking for
use_hierarchy=0?

Re: [PATCH] ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it

2020-10-22 Thread Brendan Higgins

On Wed, Oct 21, 2020 at 3:36 PM Theodore Y. Ts'o  wrote:
>
> On Wed, Oct 21, 2020 at 02:16:56PM -0700, Randy Dunlap wrote:
> > On 10/21/20 2:15 PM, Brendan Higgins wrote:
> > > On Tue, Oct 20, 2020 at 12:37 AM Geert Uytterhoeven
> > >  wrote:
> > >>
> > >> EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the
> > >> user may not want to enable.  Fix this by making the test depend on
> > >> EXT4_FS instead.
> > >>
> > >> Fixes: 1cbeab1b242d16fd ("ext4: add kunit test for decoding extended 
> > >> timestamps")
> > >> Signed-off-by: Geert Uytterhoeven 
> > >
> > > If I remember correctly, having EXT4_KUNIT_TESTS select EXT4_FS was
> > > something that Ted specifically requested, but I don't have any strong
> > > feelings on it either way.
> >
> > omg, please No. depends on is the right fix here.
>
> So my requirement which led to that particular request is to keep what
> needs to be placed in .kunitconfig to a small and reasonable set.
>
> Per Documentation/dev-tools/kunit, we start by:
>
> cd $PATH_TO_LINUX_REPO
> cp arch/um/configs/kunit_defconfig .kunitconfig
>
> we're then supposed to add whatever Kunit tests we want to enable, to wit:
>
> CONFIG_EXT4_KUNIT_TESTS=y
>
> so that .kunitconfig would look like this:
>
> CONFIG_KUNIT=y
> CONFIG_KUNIT_TEST=y
> CONFIG_KUNIT_EXAMPLE_TEST=y
> CONFIG_EXT4_KUNIT_TESTS=y
>
> ... and then you should be able to run:
>
> ./tools/testing/kunit/kunit.py run
>
> ... and have the kunit tests run.  I would *not* like to have to put a
> huge long list of CONFIG_* dependencies into the .kunitconfig file.
>
> I'm don't particularly care how this gets achieved, but please think
> about how to make it easy for a kernel developer to run a specific set
> of subsystem unit tests.  (In fact, being able to do something like
> "kunit.py run fs/ext4 fs/jbd2" or maybe "kunit.py run fs/..." would be
> *great*.  No need to fuss with hand editing the .kunitconfig file at
> all would be **wonderful**.

So you, me, Luis, David, and a whole bunch of other people have been
thinking about this problem for a while. What if we just put
kunitconfig fragments in directories along side the test files they
enable?

For example, we could add a file to fs/ext4/kunitconfig which contains:

CONFIG_EXT4_FS=y
CONFIG_EXT4_KUNIT_TESTS=y

We could do something similar in fs/jdb2, etc.

Obviously some logically separate KUnit tests (different maintainers,
different Kconfig symbols, etc) reside in the same directory, for
these we could name the kunitconfig file something like
lib/list-test.kunitconfig (not a great example because lists are
always built into Linux), but you get the idea.

Then like Ted suggested, if you call kunit.py run foo/bar, then

if bar is a directory, then kunit.py will look for foo/bar/kunitconfig

if bar is a file ending with .kunitconfig like foo/bar.kunitconfig,
then it will use that kunitconfig

if bar is '...' (foo/...) then kunit.py will look for all kunitconfigs
underneath foo.

Once all the kunitconfigs have been resolved, they will be merged into
the .kunitconfig. If they can be successfully merged together, the new
.kunitconfig will then continue to function as it currently does.

What do people think about this?

Re: mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]

2020-10-22 Thread Linus Torvalds

On Thu, Oct 22, 2020 at 1:55 PM Naresh Kamboju
 wrote:
>
> The bad commit points to,
>
> commit d55564cfc222326e944893eff0c4118353e349ec
> x86: Make __put_user() generate an out-of-line call
>
> I have reverted this single patch and confirmed the reported
> problem is not seen anymore.

Thanks. Very funky, but thanks. I've been running that commit on my
machine for over half a year, and it still looks "trivially correct"
to me, but let me go look at it one more time. Can't argue with a
reliable bisect and revert..

Linus

linux-next: build warning after merge of the block tree

2020-10-22 Thread Stephen Rothwell

Hi all,

After merging the block tree, today's linux-next build (KCONFIG_NAME)
produced this warning:

fs/io_uring.c: In function 'loop_rw_iter':
fs/io_uring.c:3141:21: warning: cast to pointer from integer of different size 
[-Wint-to-pointer-cast]
 3141 |iovec.iov_base = (void __user *) req->rw.addr;
  | ^

Introduced by commit

  a5371db1e38d ("io_uring: make loop_rw_iter() use original user supplied 
pointers")

-- 
Cheers,
Stephen Rothwell


pgpByYOw5brcs.pgp
Description: OpenPGP digital signature

Re: [PATCH v1 0/2] mm: cma: introduce a non-blocking version of cma_release()

2020-10-22 Thread Zi Yan

On 22 Oct 2020, at 18:53, Roman Gushchin wrote:

> This small patchset introduces a non-blocking version of cma_release()
> and simplifies the code in hugetlbfs, where previously we had to
> temporarily drop hugetlb_lock around the cma_release() call.
>
> It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
> THP under a memory pressure requires a cma_release() call. If it's

Thanks for the patch. But during 1GB THP split, we only clear
the bitmaps without releasing the pages. Also in cma_release_nowait(),
the first page in the allocated CMA region is reused to store
struct cma_clear_bitmap_work, but the same method cannot be used
during THP split, since the first page is still in-use. We might
need to allocate some new memory for struct cma_clear_bitmap_work,
which might not be successful under memory pressure. Any suggestion
on where to store struct cma_clear_bitmap_work when I only want to
clear bitmap without releasing the pages?

Thanks.

—
Best Regards,
Yan Zi

signature.asc
Description: OpenPGP digital signature

Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

2020-10-22 Thread YiFei Zhu

On Thu, Oct 22, 2020 at 5:32 PM Kees Cook  wrote:
> I've been going back and forth on this, and I think what I've settled
> on is I'd like to avoid new CONFIG dependencies just for this feature.
> Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT
> for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the
> cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It
> should be relatively simple to extract those details and make
> SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches?

Hmm. So I could enable the cache logic to every architecture (one
patch per arch) that does not have the sparse syscall numbers, and
then have the proc reporting after the arch patches? I could do that.
I don't have test machines to run anything other than x86_64 or ia32,
so they will need a closer look by people more familiar with those
arches.

> I'd still like to get more specific workload performance numbers too.
> The microbenchmark is nice, but getting things like build times under
> docker's default seccomp filter, etc would be lovely. I've almost gotten
> there, but my benchmarks are still really noisy and CPU isolation
> continues to frustrate me. :)

Ok, let me know if I can help.

YiFei Zhu

Re: [PATCH 3/6] fs: Convert block_read_full_page to be synchronous

2020-10-22 Thread Eric Biggers

On Thu, Oct 22, 2020 at 10:22:25PM +0100, Matthew Wilcox (Oracle) wrote:
> +static int readpage_submit_bhs(struct page *page, struct blk_completion 
> *cmpl,
> + unsigned int nr, struct buffer_head **bhs)
> +{
> + struct bio *bio = NULL;
> + unsigned int i;
> + int err;
> +
> + blk_completion_init(cmpl, nr);
> +
> + for (i = 0; i < nr; i++) {
> + struct buffer_head *bh = bhs[i];
> + sector_t sector = bh->b_blocknr * (bh->b_size >> 9);
> + bool same_page;
> +
> + if (buffer_uptodate(bh)) {
> + end_buffer_async_read(bh, 1);
> + blk_completion_sub(cmpl, BLK_STS_OK, 1);
> + continue;
> + }
> + if (bio) {
> + if (bio_end_sector(bio) == sector &&
> + __bio_try_merge_page(bio, bh->b_page, bh->b_size,
> + bh_offset(bh), _page))
> + continue;
> + submit_bio(bio);
> + }
> + bio = bio_alloc(GFP_NOIO, 1);
> + bio_set_dev(bio, bh->b_bdev);
> + bio->bi_iter.bi_sector = sector;
> + bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
> + bio->bi_end_io = readpage_end_bio;
> + bio->bi_private = cmpl;
> + /* Take care of bh's that straddle the end of the device */
> + guard_bio_eod(bio);
> + }

The following is needed to set the bio encryption context for the
'-o inlinecrypt' case on ext4:

diff --git a/fs/buffer.c b/fs/buffer.c
index 95c338e2b99c..546a08c5003b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2237,6 +2237,7 @@ static int readpage_submit_bhs(struct page *page, struct 
blk_completion *cmpl,
submit_bio(bio);
}
bio = bio_alloc(GFP_NOIO, 1);
+   fscrypt_set_bio_crypt_ctx_bh(bio, bh, GFP_NOIO);
bio_set_dev(bio, bh->b_bdev);
bio->bi_iter.bi_sector = sector;
bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));

Spende: 2 Millionen Euro

2020-10-22 Thread Manuel Franco





--
Mein Name ist Manuel Franco, ich bin der Gewinner des Powerball Mega
Jackpot-Gewinners in Höhe von 768 Millionen US-Dollar aus New Jersey, 
USA,

und ich freue mich, Ihnen zu gratulieren, dass Sie zufällig unter den 5
glücklichen Menschen ausgewählt wurden, denen ich jeweils 2 Millionen
Euro (2.000.000,00 €) spende . Kontaktieren Sie meine E-Mail unten, um
das Geld zu fordern.

E-Mail: bmosth...@gmail.com



My name is Manuel Franco, I am the winner of $768 million Powerball mega
jackpot winner from New Jersey, USA and I am pleased to congratulate you
for being randomly picked among the 5 lucky people i am donating 2 
million

euros (€ 2,000,000.00) each to. Contact my email below to claim the
money.

Email: bmosth...@gmail.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1016 matches

Mail list logo