Re: [PATCH 1/3] md: align superblock writes to physical blocks
On Thu, Oct 22, 2020 at 8:31 PM Christopher Unkel wrote: > > Writes of the md superblock are aligned to the logical blocks of the > containing device, but no attempt is made to align them to physical > block boundaries. This means that on a "512e" device (4k physical, 512 > logical) every superblock update hits the 512-byte emulation and the > possible associated performance penalty. > > Respect the physical block alignment when possible. > > Signed-off-by: Christopher Unkel > --- > drivers/md/md.c | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 98bac4f304ae..2b42850acfb3 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -1732,6 +1732,21 @@ static int super_1_load(struct md_rdev *rdev, struct > md_rdev *refdev, int minor_ > && rdev->new_data_offset < sb_start + (rdev->sb_size/512)) > return -EINVAL; > > + /* Respect physical block size if feasible. */ > + bmask = queue_physical_block_size(rdev->bdev->bd_disk->queue)-1; > + if (!((rdev->sb_start * 512) & bmask) && (rdev->sb_size & bmask)) { > + int candidate_size = (rdev->sb_size | bmask) + 1; > + > + if (minor_version) { > + int sectors = candidate_size / 512; > + > + if (rdev->data_offset >= sb_start + sectors > + && rdev->new_data_offset >= sb_start + sectors) > + rdev->sb_size = candidate_size; > + } else if (bmask <= 4095) > + rdev->sb_size = candidate_size; > + } In super_1_load() and super_1_sync(), we have bmask = queue_logical_block_size(rdev->bdev->bd_disk->queue)-1; I think we should replace it with queue_physical_block_size() so the logic is cleaner. Would this work? Thanks, Song
Re: [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives
On Thu, Oct 22, 2020 at 8:31 PM Christopher Unkel wrote: > > Hello all, > > While investigating some performance issues on mdraid 10 volumes > formed with "512e" disks (4k native/physical sector size but with 512 > byte sector emulation), I've found two cases where mdraid will > needlessly issue writes that start on 4k byte boundary, but are are > shorter than 4k: > > 1. writes of the raid superblock; and > 2. writes of the last page of the write-intent bitmap. > > The following is an excerpt of a blocktrace of one of the component > members of a mdraid 10 volume during a 4k write near the end of the > array: > > 8,32 112 0.01687 711 D WS 2064 + 8 [kworker/11:1H] > * 8,32 115 0.001454119 711 D WS 2056 + 1 [kworker/11:1H] > * 8,32 118 0.002847204 711 D WS 2080 + 7 [kworker/11:1H] > 8,32 11 11 0.003700545 3094 D WS 11721043920 + 8 [md127_raid1] > 8,32 11 14 0.308785692 711 D WS 2064 + 8 [kworker/11:1H] > * 8,32 11 17 0.310201697 711 D WS 2056 + 1 [kworker/11:1H] > 8,32 11 20 5.500799245 711 D WS 2064 + 8 [kworker/11:1H] > * 8,32 11 2315.740923558 711 D WS 2080 + 7 [kworker/11:1H] > > Note the starred transactions, which each start on a 4k boundary, but > are less than 4k in length, and so will use the 512-byte emulation. > Sector 2056 holds the superblock, and is written as a single 512-byte > write. Sector 2086 holds the bitmap bit relevant to the written > sector. When it is written the active bits of the last page of the > bitmap are written, starting at sector 2080, padded out to the end of > the 512-byte logical sector as required. This results in a 3.5kb > write, again using the 512-byte emulation. > > Note that in some arrays the last page of the bitmap may be > sufficiently full that they are not affected by the issue with the > bitmap write. > > As there can be a substantial penalty to using the 512-byte sector > emulation (turning writes into read-modify writes if the relevant > sector is not in the drive's cache) I believe it makes sense to pad > these writes out to a 4k boundary. The writes are already padded out > for "4k native" drives, where the short access is illegal. > > The following patch set changes the superblock and bitmap writes to > respect the physical block size (e.g. 4k for today's 512e drives) when > possible. In each case there is already logic for padding out to the > underlying logical sector size. I reuse or repeat the logic for > padding out to the physical sector size, but treat the padding out as > optional rather than mandatory. > > The corresponding block trace with these patches is: > >8,32 12 0.03410 694 D WS 2064 + 8 [kworker/1:1H] >8,32 15 0.001368788 694 D WS 2056 + 8 [kworker/1:1H] >8,32 18 0.002727981 694 D WS 2080 + 8 [kworker/1:1H] >8,32 1 11 0.003533831 3063 D WS 11721043920 + 8 > [md127_raid1] >8,32 1 14 0.253952321 694 D WS 2064 + 8 [kworker/1:1H] >8,32 1 17 0.255354215 694 D WS 2056 + 8 [kworker/1:1H] >8,32 1 20 5.337938486 694 D WS 2064 + 8 [kworker/1:1H] >8,32 1 2315.577963062 694 D WS 2080 + 8 [kworker/1:1H] > > I do notice that the code for bitmap writes has a more sophisticated > and thorough check for overlap than the code for superblock writes. > (Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From > what I know since the various structures starts have always been 4k > aligned anyway, it is always safe to pad the superblock write out to > 4k (as occurs on 4k native drives) but not necessarily futher. > > Feedback appreciated. > > --Chris Thanks for the patches. Do you have performance numbers before/after these changes? Some micro benchmarks results would be great motivation. Thanks, Song > > > Christopher Unkel (3): > md: align superblock writes to physical blocks > md: factor sb write alignment check into function > md: pad writes to end of bitmap to physical blocks > > drivers/md/md-bitmap.c | 80 +- > drivers/md/md.c| 15 > 2 files changed, 63 insertions(+), 32 deletions(-) > > -- > 2.17.1 >
Re: [PATCH 1/4] MAINTAINERS: move Kamil Debski to credits
Em Thu, 22 Oct 2020 22:09:25 +0200 Krzysztof Kozlowski escreveu: > On Thu, Oct 22, 2020 at 09:13:14PM +0200, Uwe Kleine-König wrote: > > Hello, > > > > this series doesn't seem to be applied and looking at the list of people > > this mail was sent "To:" it's not obvious who is expected to take it. I > > assume it is not for us linux-pwm guys and will tag it as > > "not-applicable" in our patchwork. > > Hi Uwe, > > All of the patches, including the one here, touch actually multiple > subsystems, so if this is OK with you, I could take them through > Samsung SoC. Acked-by: Mauro Carvalho Chehab > > Best regards, > Krzysztof > Thanks, Mauro
[PATCH v3] i2c: designware: call i2c_dw_read_clear_intrbits_slave() once
If some bits were cleared by i2c_dw_read_clear_intrbits_slave() in i2c_dw_isr_slave() and not handled immediately, those cleared bits would not be shown again by later i2c_dw_read_clear_intrbits_slave(). They therefore were forgotten to be handled. i2c_dw_read_clear_intrbits_slave() should be called once in an ISR and take its returned state for all later handlings. Signed-off-by: Michael Wu --- Change in v3: - revert deleted braces of 'else' branch in v2 Change in v2: - revert moving I2C_SLAVE_WRITE_REQUESTED reporting in v1 drivers/i2c/busses/i2c-designware-slave.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/i2c/busses/i2c-designware-slave.c b/drivers/i2c/busses/i2c-designware-slave.c index 44974b53a626..13de01a0f75f 100644 --- a/drivers/i2c/busses/i2c-designware-slave.c +++ b/drivers/i2c/busses/i2c-designware-slave.c @@ -159,7 +159,6 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev) u32 raw_stat, stat, enabled, tmp; u8 val = 0, slave_activity; - regmap_read(dev->map, DW_IC_INTR_STAT, ); regmap_read(dev->map, DW_IC_ENABLE, ); regmap_read(dev->map, DW_IC_RAW_INTR_STAT, _stat); regmap_read(dev->map, DW_IC_STATUS, ); @@ -168,6 +167,7 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev) if (!enabled || !(raw_stat & ~DW_IC_INTR_ACTIVITY) || !dev->slave) return 0; + stat = i2c_dw_read_clear_intrbits_slave(dev); dev_dbg(dev->dev, "%#x STATUS SLAVE_ACTIVITY=%#x : RAW_INTR_STAT=%#x : INTR_STAT=%#x\n", enabled, slave_activity, raw_stat, stat); @@ -188,11 +188,9 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev) val); } regmap_read(dev->map, DW_IC_CLR_RD_REQ, ); - stat = i2c_dw_read_clear_intrbits_slave(dev); } else { regmap_read(dev->map, DW_IC_CLR_RD_REQ, ); regmap_read(dev->map, DW_IC_CLR_RX_UNDER, ); - stat = i2c_dw_read_clear_intrbits_slave(dev); } if (!i2c_slave_event(dev->slave, I2C_SLAVE_READ_REQUESTED, @@ -207,7 +205,6 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev) regmap_read(dev->map, DW_IC_CLR_RX_DONE, ); i2c_slave_event(dev->slave, I2C_SLAVE_STOP, ); - stat = i2c_dw_read_clear_intrbits_slave(dev); return 1; } @@ -219,7 +216,6 @@ static int i2c_dw_irq_handler_slave(struct dw_i2c_dev *dev) dev_vdbg(dev->dev, "Byte %X acked!", val); } else { i2c_slave_event(dev->slave, I2C_SLAVE_STOP, ); - stat = i2c_dw_read_clear_intrbits_slave(dev); } return 1; @@ -230,7 +226,6 @@ static irqreturn_t i2c_dw_isr_slave(int this_irq, void *dev_id) struct dw_i2c_dev *dev = dev_id; int ret; - i2c_dw_read_clear_intrbits_slave(dev); ret = i2c_dw_irq_handler_slave(dev); if (ret > 0) complete(>cmd_complete); -- 2.17.1
Re: [PATCH 2/2] cpufreq: Drop restore_freq from struct cpufreq_policy
On 22-10-20, 13:57, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The restore_freq field in struct cpufreq_policy is only used by > __target_index() in one place and a local variable in that function > may as well be used instead of it, so drop it and modify > __target_index() accordingly. > > Signed-off-by: Rafael J. Wysocki > --- > drivers/cpufreq/cpufreq.c | 10 +- > include/linux/cpufreq.h |5 - > 2 files changed, 5 insertions(+), 10 deletions(-) Acked-by: Viresh Kumar -- viresh
Re: [LKP] Re: [sched] bdfcae1140: will-it-scale.per_thread_ops -37.0% regression
On 10/22/2020 9:19 PM, Mathieu Desnoyers wrote: - On Oct 21, 2020, at 9:54 PM, Xing Zhengjun zhengjun.x...@linux.intel.com wrote: [...] In fact, 0-day just copy the will-it-scale benchmark from the GitHub, if you think the will-it-scale benchmark has some issues, you can contribute your idea and help to improve it, later we will update the will-it-scale benchmark to the new version. This is why I CC'd the maintainer of the will-it-scale github project, Anton Blanchard. My main intent is to report this issue to him, but I have not heard back from him yet. Is this project maintained ? Let me try to add his ozlabs.org address in CC. For this test case, if we bind the workload to a specific CPU, then it will hide the scheduler balance issue. In the real world, we seldom bind the CPU... When you say that you bind the workload to a specific CPU, is that done outside of the will-it-scale testsuite, thus limiting the entire testsuite to a single CPU, or you expect that internally the will-it-scale context-switch1 test gets affined to a single specific CPU/core/hardware thread through use of hwloc ? The later one. Thanks, Mathieu -- Zhengjun Xing
Re: [PATCH V3 2/3] vhost: vdpa: report iova range
Hi Jason, I love your patch! Perhaps something to improve: [auto build test WARNING on vhost/linux-next] [also build test WARNING on linus/master v5.9 next-20201023] [cannot apply to linux/master] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Jason-Wang/vDPA-API-for-reporting-IOVA-range/20201023-102708 base: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next config: m68k-randconfig-r034-20201022 (attached as .config) compiler: m68k-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/446e7b97838ebf87f1acd61580137716fdad104a git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Jason-Wang/vDPA-API-for-reporting-IOVA-range/20201023-102708 git checkout 446e7b97838ebf87f1acd61580137716fdad104a # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=m68k If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All warnings (new ones prefixed by >>): drivers/vhost/vdpa.c: In function 'vhost_vdpa_setup_vq_irq': drivers/vhost/vdpa.c:94:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable] 94 | int ret, irq; | ^~~ drivers/vhost/vdpa.c: In function 'vhost_vdpa_unlocked_ioctl': >> drivers/vhost/vdpa.c:483:5: warning: this statement may fall through >> [-Wimplicit-fallthrough=] 483 | r = copy_to_user(featurep, , sizeof(features)); | ~~^ drivers/vhost/vdpa.c:484:2: note: here 484 | case VHOST_VDPA_GET_IOVA_RANGE: | ^~~~ vim +483 drivers/vhost/vdpa.c 4c8cf31885f69e8 Tiwei Bie2020-03-26 426 4c8cf31885f69e8 Tiwei Bie2020-03-26 427 static long vhost_vdpa_unlocked_ioctl(struct file *filep, 4c8cf31885f69e8 Tiwei Bie2020-03-26 428 unsigned int cmd, unsigned long arg) 4c8cf31885f69e8 Tiwei Bie2020-03-26 429 { 4c8cf31885f69e8 Tiwei Bie2020-03-26 430struct vhost_vdpa *v = filep->private_data; 4c8cf31885f69e8 Tiwei Bie2020-03-26 431struct vhost_dev *d = >vdev; 4c8cf31885f69e8 Tiwei Bie2020-03-26 432void __user *argp = (void __user *)arg; a127c5bbb6a8eee Jason Wang 2020-09-07 433u64 __user *featurep = argp; a127c5bbb6a8eee Jason Wang 2020-09-07 434u64 features; 4c8cf31885f69e8 Tiwei Bie2020-03-26 435long r; 4c8cf31885f69e8 Tiwei Bie2020-03-26 436 a127c5bbb6a8eee Jason Wang 2020-09-07 437if (cmd == VHOST_SET_BACKEND_FEATURES) { a127c5bbb6a8eee Jason Wang 2020-09-07 438r = copy_from_user(, featurep, sizeof(features)); a127c5bbb6a8eee Jason Wang 2020-09-07 439if (r) a127c5bbb6a8eee Jason Wang 2020-09-07 440return r; a127c5bbb6a8eee Jason Wang 2020-09-07 441if (features & ~VHOST_VDPA_BACKEND_FEATURES) a127c5bbb6a8eee Jason Wang 2020-09-07 442return -EOPNOTSUPP; a127c5bbb6a8eee Jason Wang 2020-09-07 443 vhost_set_backend_features(>vdev, features); a127c5bbb6a8eee Jason Wang 2020-09-07 444return 0; a127c5bbb6a8eee Jason Wang 2020-09-07 445} a127c5bbb6a8eee Jason Wang 2020-09-07 446 4c8cf31885f69e8 Tiwei Bie2020-03-26 447mutex_lock(>mutex); 4c8cf31885f69e8 Tiwei Bie2020-03-26 448 4c8cf31885f69e8 Tiwei Bie2020-03-26 449switch (cmd) { 4c8cf31885f69e8 Tiwei Bie2020-03-26 450case VHOST_VDPA_GET_DEVICE_ID: 4c8cf31885f69e8 Tiwei Bie2020-03-26 451r = vhost_vdpa_get_device_id(v, argp); 4c8cf31885f69e8 Tiwei Bie2020-03-26 452break; 4c8cf31885f69e8 Tiwei Bie2020-03-26 453case VHOST_VDPA_GET_STATUS: 4c8cf31885f69e8 Tiwei Bie2020-03-26 454r = vhost_vdpa_get_status(v, argp); 4c8cf31885f69e8 Tiwei Bie2020-03-26 455break; 4c8cf31885f69e8 Tiwei Bie2020-03-26 456case VHOST_VDPA_SET_STATUS: 4c8cf31885f69e8 Tiwei Bie2020-03-26 457r = vhost_vdpa_set_status(v, argp); 4c8cf31885f69e8 Tiwei Bie2020-03-26 458break; 4c8cf31885f69e8 Tiwei Bie2020-03-26 459case VHOST_VDPA_GET_CONFIG: 4c8cf31885f69e8 Tiwei Bie2020-03-26 460r = vhost_vdpa_get_config(v, argp); 4c8cf31885f69e8 Tiwei Bie2020-03-26 461break; 4c8cf31885f69e8 Tiwei Bie2020-03-26 462case VHOST_VDPA_SET_CONFIG: 4c8cf31885f69e8 Tiwei Bie2020-03-26 463r = vhost_vdpa_se
Re: [PATCH v8 -tip 02/26] sched: Introduce sched_class::pick_task()
On 2020/10/22 23:25, Joel Fernandes wrote: > On Thu, Oct 22, 2020 at 12:59 AM Li, Aubrey wrote: >> >> On 2020/10/20 9:43, Joel Fernandes (Google) wrote: >>> From: Peter Zijlstra >>> >>> Because sched_class::pick_next_task() also implies >>> sched_class::set_next_task() (and possibly put_prev_task() and >>> newidle_balance) it is not state invariant. This makes it unsuitable >>> for remote task selection. >>> >>> Tested-by: Julien Desfossez >>> Signed-off-by: Peter Zijlstra (Intel) >>> Signed-off-by: Vineeth Remanan Pillai >>> Signed-off-by: Julien Desfossez >>> Signed-off-by: Joel Fernandes (Google) >>> --- >>> kernel/sched/deadline.c | 16 ++-- >>> kernel/sched/fair.c | 32 +++- >>> kernel/sched/idle.c | 8 >>> kernel/sched/rt.c| 14 -- >>> kernel/sched/sched.h | 3 +++ >>> kernel/sched/stop_task.c | 13 +++-- >>> 6 files changed, 79 insertions(+), 7 deletions(-) >>> >>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >>> index 814ec49502b1..0271a7848ab3 100644 >>> --- a/kernel/sched/deadline.c >>> +++ b/kernel/sched/deadline.c >>> @@ -1848,7 +1848,7 @@ static struct sched_dl_entity >>> *pick_next_dl_entity(struct rq *rq, >>> return rb_entry(left, struct sched_dl_entity, rb_node); >>> } >>> >>> -static struct task_struct *pick_next_task_dl(struct rq *rq) >>> +static struct task_struct *pick_task_dl(struct rq *rq) >>> { >>> struct sched_dl_entity *dl_se; >>> struct dl_rq *dl_rq = >dl; >>> @@ -1860,7 +1860,18 @@ static struct task_struct *pick_next_task_dl(struct >>> rq *rq) >>> dl_se = pick_next_dl_entity(rq, dl_rq); >>> BUG_ON(!dl_se); >>> p = dl_task_of(dl_se); >>> - set_next_task_dl(rq, p, true); >>> + >>> + return p; >>> +} >>> + >>> +static struct task_struct *pick_next_task_dl(struct rq *rq) >>> +{ >>> + struct task_struct *p; >>> + >>> + p = pick_task_dl(rq); >>> + if (p) >>> + set_next_task_dl(rq, p, true); >>> + >>> return p; >>> } >>> >>> @@ -2517,6 +2528,7 @@ const struct sched_class dl_sched_class >>> >>> #ifdef CONFIG_SMP >>> .balance= balance_dl, >>> + .pick_task = pick_task_dl, >>> .select_task_rq = select_task_rq_dl, >>> .migrate_task_rq= migrate_task_rq_dl, >>> .set_cpus_allowed = set_cpus_allowed_dl, >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index dbd9368a959d..bd6aed63f5e3 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -4450,7 +4450,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct >>> sched_entity *curr) >>>* Avoid running the skip buddy, if running something else can >>>* be done without getting too unfair. >>>*/ >>> - if (cfs_rq->skip == se) { >>> + if (cfs_rq->skip && cfs_rq->skip == se) { >>> struct sched_entity *second; >>> >>> if (se == curr) { >>> @@ -6976,6 +6976,35 @@ static void check_preempt_wakeup(struct rq *rq, >>> struct task_struct *p, int wake_ >>> set_last_buddy(se); >>> } >>> >>> +#ifdef CONFIG_SMP >>> +static struct task_struct *pick_task_fair(struct rq *rq) >>> +{ >>> + struct cfs_rq *cfs_rq = >cfs; >>> + struct sched_entity *se; >>> + >>> + if (!cfs_rq->nr_running) >>> + return NULL; >>> + >>> + do { >>> + struct sched_entity *curr = cfs_rq->curr; >>> + >>> + se = pick_next_entity(cfs_rq, NULL); >>> + >>> + if (curr) { >>> + if (se && curr->on_rq) >>> + update_curr(cfs_rq); >>> + >>> + if (!se || entity_before(curr, se)) >>> + se = curr; >>> + } >>> + >>> + cfs_rq = group_cfs_rq(se); >>> + } while (cfs_rq); >>> ++ >>> + return task_of(se); >>> +} >>> +#endif >> >> One of my machines hangs when I run uperf with only one message: >> [ 719.034962] BUG: kernel NULL pointer dereference, address: >> 0050 >> >> Then I replicated the problem on my another machine(no serial console), >> here is the stack by manual copy. >> >> Call Trace: >> pick_next_entity+0xb0/0x160 >> pick_task_fair+0x4b/0x90 >> __schedule+0x59b/0x12f0 >> schedule_idle+0x1e/0x40 >> do_idle+0x193/0x2d0 >> cpu_startup_entry+0x19/0x20 >> start_secondary+0x110/0x150 >> secondary_startup_64_no_verify+0xa6/0xab > > Interesting. Wondering if we screwed something up in the rebase. > > Questions: > 1. Does the issue happen if you just apply only up until this patch, > or the entire series? I applied the entire series and just find a related patch to report the issue. > 2. Do you see the issue in v7? Not much if at all has changed in this > part of the code from v7 -> v8 but could be something in the newer > kernel. > IIRC, I can run uperf successfully on v7. I'm on tip/master 2d3e8c9424c9
[PATCH net] net: hns3: Clear the CMDQ registers before unmapping BAR region
When unbinding the hns3 driver with the HNS3 VF, I got the following kernel panic: [ 265.709989] Unable to handle kernel paging request at virtual address 800054627000 [ 265.717928] Mem abort info: [ 265.720740] ESR = 0x9647 [ 265.723810] EC = 0x25: DABT (current EL), IL = 32 bits [ 265.729126] SET = 0, FnV = 0 [ 265.732195] EA = 0, S1PTW = 0 [ 265.735351] Data abort info: [ 265.738227] ISV = 0, ISS = 0x0047 [ 265.742071] CM = 0, WnR = 1 [ 265.745055] swapper pgtable: 4k pages, 48-bit VAs, pgdp=09b54000 [ 265.751753] [800054627000] pgd=202ff003, p4d=202ff003, pud=2020020eb003, pmd=0020a0dfc003, pte= [ 265.764314] Internal error: Oops: 9647 [#1] SMP [ 265.830357] CPU: 61 PID: 20319 Comm: bash Not tainted 5.9.0+ #206 [ 265.836423] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 09/18/2019 [ 265.843873] pstate: 8049 (Nzcv daif +PAN -UAO -TCO BTYPE=--) [ 265.843890] pc : hclgevf_cmd_uninit+0xbc/0x300 [ 265.861988] lr : hclgevf_cmd_uninit+0xb0/0x300 [ 265.861992] sp : 80004c983b50 [ 265.881411] pmr_save: 00e0 [ 265.884453] x29: 80004c983b50 x28: 20280bbce500 [ 265.889744] x27: x26: [ 265.895034] x25: 800011a1f000 x24: 800011a1fe90 [ 265.900325] x23: 0020ce9b00d8 x22: 0020ce9b0150 [ 265.905616] x21: 800010d70e90 x20: 800010d70e90 [ 265.910906] x19: 0020ce9b0080 x18: 0004 [ 265.916198] x17: x16: 800011ae32e8 [ 265.916201] x15: 0028 x14: 0002 [ 265.916204] x13: 800011ae32e8 x12: 00012ad8 [ 265.946619] x11: 80004c983b50 x10: [ 265.951911] x9 : 8000115d0888 x8 : [ 265.951914] x7 : 800011890b20 x6 : c0007fff [ 265.951917] x5 : 80004c983930 x4 : 0001 [ 265.951919] x3 : a027eec1b000 x2 : 2b78ccbbff369100 [ 265.964487] x1 : x0 : 800054627000 [ 265.964491] Call trace: [ 265.964494] hclgevf_cmd_uninit+0xbc/0x300 [ 265.964496] hclgevf_uninit_ae_dev+0x9c/0xe8 [ 265.964501] hnae3_unregister_ae_dev+0xb0/0x130 [ 265.964516] hns3_remove+0x34/0x88 [hns3] [ 266.009683] pci_device_remove+0x48/0xf0 [ 266.009692] device_release_driver_internal+0x114/0x1e8 [ 266.030058] device_driver_detach+0x28/0x38 [ 266.034224] unbind_store+0xd4/0x108 [ 266.037784] drv_attr_store+0x40/0x58 [ 266.041435] sysfs_kf_write+0x54/0x80 [ 266.045081] kernfs_fop_write+0x12c/0x250 [ 266.049076] vfs_write+0xc4/0x248 [ 266.052378] ksys_write+0x74/0xf8 [ 266.055677] __arm64_sys_write+0x24/0x30 [ 266.059584] el0_svc_common.constprop.3+0x84/0x270 [ 266.064354] do_el0_svc+0x34/0xa0 [ 266.067658] el0_svc+0x38/0x40 [ 266.070700] el0_sync_handler+0x8c/0xb0 [ 266.074519] el0_sync+0x140/0x180 It looks like the BAR memory region had already been unmapped before we start clearing CMDQ registers in it, which is pretty bad and the kernel happily kills itself because of a Current EL Data Abort (on arm64). Moving the CMDQ uninitialization a bit early fixes the issue for me. Signed-off-by: Zenghui Yu --- I have almost zero knowledge about the hns3 driver. You can regard this as a report and make a better fix if possible. I can't even figure out that how can we live with this issue for a long time... It should exists since commit 34f81f049e35 ("net: hns3: clear command queue's registers when unloading VF driver"), where we start writing something into the unmapped area. drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index 50c84c5e65d2..c8e3fdd5999c 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -3262,8 +3262,8 @@ static void hclgevf_uninit_hdev(struct hclgevf_dev *hdev) hclgevf_uninit_msi(hdev); } - hclgevf_pci_uninit(hdev); hclgevf_cmd_uninit(hdev); + hclgevf_pci_uninit(hdev); hclgevf_uninit_mac_list(hdev); } -- 2.19.1
Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core
On 22-10-20, 17:55, Vincent Guittot wrote: > On Thu, 22 Oct 2020 at 17:45, A L wrote: > > > > > > > > From: Peter Zijlstra -- Sent: 2020-10-22 - > > 14:29 > > > > > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote: > > >> > However I do want to retire ondemand, conservative and also very much > > >> > intel_pstate/active mode. > > >> > > >> I agree in general, but IMO it would not be prudent to do that without > > >> making > > >> schedutil provide the same level of performance in all of the relevant > > >> use > > >> cases. > > > > > > Agreed; I though to have understood we were there already. > > > > Hi, > > > > > > Currently schedutil does not populate all stats like ondemand does, which > > can be a problem for some monitoring software. > > > > On my AMD 3000G CPU with kernel-5.9.1: > > > > > > grep. /sys/devices/system/cpu/cpufreq/policy0/stats/* > > > > With ondemand: > > time_in_state:390 145179 > > time_in_state:160 9588482 > > total_trans:177565 > > trans_table: From :To > > trans_table: : 390 160 > > trans_table: 390: 0 88783 > > trans_table: 160: 88782 0 > > > > With schedutil only two file exists: > > reset: > > total_trans:216609 > > > > > > I'd really like to have these stats populated with schedutil, if that's > > possible. > > Your problem might have been fixed with > commit 96f60cddf7a1 ("cpufreq: stats: Enable stats for fast-switch as well") Thanks Vincent. Right, I have already fixed that for everyone. -- viresh
Re: [LTP] mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]
On Thu, Oct 22, 2020 at 08:05:05PM -0700, Linus Torvalds wrote: > On Thu, Oct 22, 2020 at 6:36 PM Daniel Díaz wrote: > > > > The kernel Naresh originally referred to is here: > > https://builds.tuxbuild.com/SCI7Xyjb7V2NbfQ2lbKBZw/ > > Thanks. > > And when I started looking at it, I realized that my original idea > ("just look for __put_user_nocheck_X calls, there aren't so many of > those") was garbage, and that I was just being stupid. > > Yes, the commit that broke was about __put_user(), but in order to not > duplicate all the code, it re-used the regular put_user() > infrastructure, and so all the normal put_user() calls are potential > problem spots too if this is about the compiler interaction with KASAN > and the asm changes. > > So it's not just a couple of special cases to look at, it's all the > normal cases too. > > Ok, back to the drawing board, but I think reverting it is probably > the right thing to do if I can't think of something smart. > > That said, since you see this on x86-64, where the whole ugly trick with that > >register asm("%"_ASM_AX) > > is unnecessary (because the 8-byte case is still just a single > register, no %eax:%edx games needed), it would be interesting to hear > if the attached patch fixes it. That would confirm that the problem > really is due to some register allocation issue interaction (or, > alternatively, it would tell me that there's something else going on). I haven't reproduced the crash, but I did find a smoking gun that confirms the "register shenanigans are evil shenanigans" theory. I ran into a similar thing recently where a seemingly innocuous line of code after loading a value into a register variable wreaked havoc because it clobbered the input register. This put_user() in schedule_tail(): if (current->set_child_tid) put_user(task_pid_vnr(current), current->set_child_tid); generates the following assembly with KASAN out-of-line: 0x810dccc9 <+73>: xor%edx,%edx 0x810dcccb <+75>: xor%esi,%esi 0x810dcccd <+77>: mov%rbp,%rdi 0x810dccd0 <+80>: callq 0x810bf5e0 <__task_pid_nr_ns> 0x810dccd5 <+85>: mov%r12,%rdi 0x810dccd8 <+88>: callq 0x81388c60 <__asan_load8> 0x810dccdd <+93>: mov0x590(%rbp),%rcx 0x810dcce4 <+100>: callq 0x817708a0 <__put_user_4> 0x810dcce9 <+105>: pop%rbx 0x810dccea <+106>: pop%rbp 0x810dcceb <+107>: pop%r12 __task_pid_nr_ns() returns the pid in %rax, which gets clobbered by __asan_load8()'s check on current for the current->set_child_tid dereference.
[git pull] Input updates for v5.10-rc0
Hi Linus, Please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus to receive updates for the input subsystem. You will get: - a new driver for ADC driven joysticks - a new Zintix touchscreen driver - enhancements to Intel SoC button array driver - support for F3A "function" in Synaptics RMI4 driver - assorted driver fixups Changelog: - Artur Rojek (2): dt-bindings: input: Add docs for ADC driven joystick Input: joystick - add ADC attached joystick driver. Dan Carpenter (1): Input: imx6ul_tsc - clean up some errors in imx6ul_tsc_resume() Dmitry Torokhov (1): Input: imx6ul_tsc - unify open/close and PM paths Furquan Shaikh (1): Input: raydium_i2c_ts - use single i2c_transfer transaction when using RM_CMD_BANK_SWITCH Hans de Goede (8): Input: allocate keycodes for notification-center, pickup-phone and hangup-phone Input: allocate keycode for Fn + right shift platform/x86: thinkpad_acpi: Add support for new hotkeys found on X1C8 / T14 platform/x86: thinkpad_acpi: Map Clipping tool hotkey to KEY_SELECTIVE_SCREENSHOT Input: soc_button_array - add active_low setting to soc_button_info Input: soc_button_array - add support for INT33D3 tablet-mode switch devices Input: soc_button_array - work around DSDTs which modify the irqflags Input: synaptics - enable InterTouch for ThinkPad T14 Gen 1 Jason A. Donenfeld (2): Input: synaptics-rmi4 - support bootloader v8 in f34v7 Input: synaptics - enable InterTouch for ThinkPad P1/X1E gen 2 Joe Perches (1): Input: MT - avoid comma separated statements Johnny Chuang (2): Input: elants_i2c - report resolution of ABS_MT_TOUCH_MAJOR by FW information. Input: elants_i2c - fix typo for an attribute to show calibration count Kenny Levinsen (1): Input: evdev - per-client waitgroups Krzysztof Kozlowski (4): Input: ep93xx_keypad - fix handling of platform_get_irq() error Input: omap4-keypad - fix handling of platform_get_irq() error Input: twl4030_keypad - fix handling of platform_get_irq() error Input: sun4i-ps2 - fix handling of platform_get_irq() error Michael Srba (2): dt-bindings: input/touchscreen: add bindings for zinitix Input: add zinitix touchscreen driver Mika Penttilä (1): Input: Add MAINTAINERS entry for SiS i2c touch input driver Vincent Huang (2): Input: synaptics-rmi4 - rename f30_data to gpio_data Input: synaptics-rmi4 - add support for F3A YueHaibing (1): Input: stmfts - fix a & vs && typo Diffstat: .../devicetree/bindings/input/adc-joystick.yaml| 121 + .../bindings/input/touchscreen/zinitix.txt | 40 ++ .../devicetree/bindings/vendor-prefixes.yaml | 2 + MAINTAINERS| 7 + drivers/hid/hid-rmi.c | 2 +- drivers/input/evdev.c | 19 +- drivers/input/input-mt.c | 11 +- drivers/input/joystick/Kconfig | 10 + drivers/input/joystick/Makefile| 1 + drivers/input/joystick/adc-joystick.c | 264 ++ drivers/input/keyboard/ep93xx_keypad.c | 4 +- drivers/input/keyboard/omap4-keypad.c | 6 +- drivers/input/keyboard/twl4030_keypad.c| 8 +- drivers/input/misc/soc_button_array.c | 100 +++- drivers/input/mouse/synaptics.c| 6 +- drivers/input/rmi4/Kconfig | 8 + drivers/input/rmi4/Makefile| 1 + drivers/input/rmi4/rmi_bus.c | 3 + drivers/input/rmi4/rmi_driver.h| 1 + drivers/input/rmi4/rmi_f30.c | 14 +- drivers/input/rmi4/rmi_f34v7.c | 9 +- drivers/input/rmi4/rmi_f3a.c | 241 + drivers/input/serio/sun4i-ps2.c| 9 +- drivers/input/touchscreen/Kconfig | 12 + drivers/input/touchscreen/Makefile | 1 + drivers/input/touchscreen/elants_i2c.c | 8 +- drivers/input/touchscreen/imx6ul_tsc.c | 47 +- drivers/input/touchscreen/raydium_i2c_ts.c | 131 ++--- drivers/input/touchscreen/stmfts.c | 2 +- drivers/input/touchscreen/zinitix.c| 581 + drivers/platform/x86/thinkpad_acpi.c | 18 +- include/linux/rmi.h| 11 +- include/uapi/linux/input-event-codes.h | 4 + 33 files changed, 1531 insertions(+), 171 deletions(-) create mode 100644 Documentation/devicetree/bindings/input/adc-joystick.yaml create mode 100644 Documentation/devicetree/bindings/input/touchscreen/zinitix.txt create mode 100644 drivers/input/joystick/adc-joystick.c create mode 100644
ERROR: modpost: "has_transparent_hugepage" undefined!
tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: f9893351acaecf0a414baf9942b48d5bb5c688c6 commit: 6d82120f41561426dd67c86380d779b4599d070d device-dax: add an 'align' attribute date: 9 days ago config: mips-randconfig-m031-20201022 (attached as .config) compiler: mips64-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d82120f41561426dd67c86380d779b4599d070d git remote add linus https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git git fetch --no-tags linus master git checkout 6d82120f41561426dd67c86380d779b4599d070d # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=mips If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All errors (new ones prefixed by >>, old ones prefixed by <<): >> ERROR: modpost: "has_transparent_hugepage" [drivers/dax/dax.ko] undefined! ERROR: modpost: "spurious_interrupt" [drivers/mfd/ioc3.ko] undefined! ERROR: modpost: "pci_find_host_bridge" [drivers/mfd/ioc3.ko] undefined! --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org .config.gz Description: application/gzip
Re: [PATCH] PM / s2idle: Export s2idle_set_ops
On Thu, 2020-10-22 at 08:02 +0100, Sudeep Holla wrote: > On Thu, Oct 22, 2020 at 02:17:48PM +0800, Claude Yen wrote: > > As suspend_set_ops is exported in commit a5e4fd8783a2 > > ("PM / Suspend: Export suspend_set_ops, suspend_valid_only_mem"), > > exporting s2idle_set_ops to make kernel module setup s2idle ops too. > > > > In this way, kernel module can hook platform suspend > > functions regardless of Suspend-to-Ram(S2R) or > > Suspend-to-Idle(S2I) > > > > If this is for arm64 platform, then NACK. You must use PSCI and it will > set the ops and it can't be module. > PSCI uses suspend_set_ops instead. And suspend_set_ops has been exported years ago. Suspend-to_Idle(S2I) is another suspend method supported by linux kernel. The corresponding s2idle_ops can be hooked by s2idle_set_ops by underlying platforms. For example, S2I is now introduced into Mediatek SoC platforms. Besides, power management driver is built as kernel module. Mobile platforms are now call for kernel drivers to be kernel modules. This could help drivers easier to migrate to newer linux kernel. Ref: https://linuxplumbersconf.org/event/7/contributions/790/ Regards, Claude
[PATCHv4 net-next] dropwatch: Support monitoring of dropped frames
From: Izabela Bakollari Dropwatch is a utility that monitors dropped frames by having userspace record them over the dropwatch protocol over a file. This augument allows live monitoring of dropped frames using tools like tcpdump. With this feature, dropwatch allows two additional commands (start and stop interface) which allows the assignment of a net_device to the dropwatch protocol. When assinged, dropwatch will clone dropped frames, and receive them on the assigned interface, allowing tools like tcpdump to monitor for them. With this feature, create a dummy ethernet interface (ip link add dev dummy0 type dummy), assign it to the dropwatch kernel subsystem, by using these new commands, and then monitor dropped frames in real time by running tcpdump -i dummy0. Signed-off-by: Izabela Bakollari --- include/uapi/linux/net_dropmon.h | 3 + net/core/drop_monitor.c | 120 +++ 2 files changed, 123 insertions(+) diff --git a/include/uapi/linux/net_dropmon.h b/include/uapi/linux/net_dropmon.h index 67e31f329190..e8e861e03a8a 100644 --- a/include/uapi/linux/net_dropmon.h +++ b/include/uapi/linux/net_dropmon.h @@ -58,6 +58,8 @@ enum { NET_DM_CMD_CONFIG_NEW, NET_DM_CMD_STATS_GET, NET_DM_CMD_STATS_NEW, + NET_DM_CMD_START_IFC, + NET_DM_CMD_STOP_IFC, _NET_DM_CMD_MAX, }; @@ -93,6 +95,7 @@ enum net_dm_attr { NET_DM_ATTR_SW_DROPS, /* flag */ NET_DM_ATTR_HW_DROPS, /* flag */ NET_DM_ATTR_FLOW_ACTION_COOKIE, /* binary */ + NET_DM_ATTR_IFNAME, /* string */ __NET_DM_ATTR_MAX, NET_DM_ATTR_MAX = __NET_DM_ATTR_MAX - 1 diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c index 8e33cec9fc4e..dea85291808b 100644 --- a/net/core/drop_monitor.c +++ b/net/core/drop_monitor.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include @@ -46,6 +47,7 @@ */ static int trace_state = TRACE_OFF; static bool monitor_hw; +struct net_device *interface; /* net_dm_mutex * @@ -54,6 +56,8 @@ static bool monitor_hw; */ static DEFINE_MUTEX(net_dm_mutex); +static DEFINE_SPINLOCK(interface_lock); + struct net_dm_stats { u64 dropped; struct u64_stats_sync syncp; @@ -217,6 +221,7 @@ static void trace_drop_common(struct sk_buff *skb, void *location) struct nlattr *nla; int i; struct sk_buff *dskb; + struct sk_buff *nskb = NULL; struct per_cpu_dm_data *data; unsigned long flags; @@ -255,6 +260,20 @@ static void trace_drop_common(struct sk_buff *skb, void *location) out: spin_unlock_irqrestore(>lock, flags); + spin_lock_irqsave(_lock, flags); + if (interface && interface != skb->dev) { + nskb = skb_clone(skb, GFP_ATOMIC); + if (!nskb) + goto free; + nskb->dev = interface; + } + spin_unlock_irqrestore(_lock, flags); + if (nskb) + netif_receive_skb(nskb); + +free: + spin_unlock_irqrestore(_lock, flags); + return; } static void trace_kfree_skb_hit(void *ignore, struct sk_buff *skb, void *location) @@ -1315,6 +1334,89 @@ static int net_dm_cmd_trace(struct sk_buff *skb, return -EOPNOTSUPP; } +static bool is_dummy_dev(struct net_device *dev) +{ + struct ethtool_drvinfo drvinfo; + + if (dev->ethtool_ops && dev->ethtool_ops->get_drvinfo) { + memset(, 0, sizeof(drvinfo)); + dev->ethtool_ops->get_drvinfo(dev, ); + + if (strcmp(drvinfo.driver, "dummy")) + return false; + return true; + } + return false; +} + +static int net_dm_interface_start(struct net *net, const char *ifname) +{ + struct net_device *dev = dev_get_by_name(net, ifname); + unsigned long flags; + int rc = -EBUSY; + + if (!dev) + return -ENODEV; + + if (!is_dummy_dev(dev)) { + rc = -EOPNOTSUPP; + goto out; + } + + spin_lock_irqsave(_lock, flags); + if (!interface) { + interface = dev; + rc = 0; + } + spin_unlock_irqrestore(_lock, flags); + + goto out; + +out: + dev_put(dev); + return rc; +} + +static int net_dm_interface_stop(struct net *net, const char *ifname) +{ + unsigned long flags; + int rc = -ENODEV; + + spin_lock_irqsave(_lock, flags); + if (interface && interface->name == ifname) { + dev_put(interface); + interface = NULL; + rc = 0; + } + spin_unlock_irqrestore(_lock, flags); + + return rc; +} + +static int net_dm_cmd_ifc_trace(struct sk_buff *skb, struct genl_info *info) +{ + struct net *net = sock_net(skb->sk); + char ifname[IFNAMSIZ]; + + if (net_dm_is_monitoring()) +
回复: Question on io-wq
发件人: Zhang, Qiang 发送时间: 2020年10月23日 11:55 收件人: Jens Axboe 抄送: v...@zeniv.linux.org.uk; io-ur...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org 主题: 回复: Question on io-wq 发件人: Jens Axboe 发送时间: 2020年10月22日 22:08 收件人: Zhang, Qiang 抄送: v...@zeniv.linux.org.uk; io-ur...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org 主题: Re: Question on io-wq On 10/22/20 3:02 AM, Zhang,Qiang wrote: > > Hi Jens Axboe > > There are some problem in 'io_wqe_worker' thread, when the > 'io_wqe_worker' be create and Setting the affinity of CPUs in NUMA > nodes, due to CPU hotplug, When the last CPU going down, the > 'io_wqe_worker' thread will run anywhere. when the CPU in the node goes > online again, we should restore their cpu bindings? >Something like the below should help in ensuring affinities are >always correct - trigger an affinity set for an online CPU event. We >should not need to do it for offlining. Can you test it? >diff --git a/fs/io-wq.c b/fs/io-wq.c >index 4012ff541b7b..3bf029d1170e 100644 >--- a/fs/io-wq.c >+++ b/fs/io-wq.c >@@ -19,6 +19,7 @@ >#include >#include >#include >+#include >#include "io-wq.h" > >@@ -123,9 +124,13 @@ struct io_wq { > refcount_t refs; > struct completion done; > >+ struct hlist_node cpuhp_node; >+ > refcount_t use_refs; >}; > >+static enum cpuhp_state io_wq_online; >+ >static bool io_worker_get(struct io_worker *worker) >{ > return refcount_inc_not_zero(>ref); >@@ -1096,6 +1101,13 @@ struct io_wq *io_wq_create(unsigned bounded, >struct >io_wq_data *data) > return ERR_PTR(-ENOMEM); > } > >+ ret = cpuhp_state_add_instance_nocalls(io_wq_online, >>cpuhp_node); >+ if (ret) { >+ kfree(wq->wqes); >+ kfree(wq); >+ return ERR_PTR(ret); >+ } >+ >wq->free_work = data->free_work; >wq->do_work = data->do_work; > >@@ -1145,6 +1157,7 @@ struct io_wq *io_wq_create(unsigned bounded, >struct >io_wq_data *data) > ret = PTR_ERR(wq->manager); > complete(>done); >err: >+ cpuhp_state_remove_instance_nocalls(io_wq_online, >>cpuhp_node); > for_each_node(node) > kfree(wq->wqes[node]); > kfree(wq->wqes); >@@ -1164,6 +1177,8 @@ static void __io_wq_destroy(struct io_wq *wq) >{ > int node; > >+ cpuhp_state_remove_instance_nocalls(io_wq_online, >>cpuhp_node); >+ > set_bit(IO_WQ_BIT_EXIT, >state); > if (wq->manager) > kthread_stop(wq->manager); >@@ -1191,3 +1206,40 @@ struct task_struct *io_wq_get_task(struct io_wq >*wq) >{ > return wq->manager; >} >+ >+static bool io_wq_worker_affinity(struct io_worker *worker, void *data) >+{ >+ struct task_struct *task = worker->task; >+ unsigned long flags; >+ struct rq_flags rf; struct rq *rq; rq = task_rq_lock(task, ); --- raw_spin_lock_irqsave(>pi_lock, flags); >+ do_set_cpus_allowed(task, cpumask_of_node(worker->wqe->node)); >+ task->flags |= PF_NO_SETAFFINITY; --- raw_spin_unlock_irqrestore(>pi_lock, flags); task_rq_unlock(rq, task, ); >+ return false; >+} >+ >+static int io_wq_cpu_online(unsigned int cpu, struct hlist_node *node) >+{ >+ struct io_wq *wq = hlist_entry_safe(node, struct io_wq, cpuhp_node); >+ int i; >+ >+ rcu_read_lock(); >+ for_each_node(i) >+ io_wq_for_each_worker(wq->wqes[i], io_wq_worker_affinity, >>NULL); >+ rcu_read_unlock(); >+ return 0; >+} >+ >+static __init int io_wq_init(void) >+{ >+ int ret; >+ >+ ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, >"io->wq/online", >+ io_wq_cpu_online, NULL); >+ if (ret < 0) >+ return ret; >+ io_wq_online = ret; >+ return 0; >+} >+subsys_initcall(io_wq_init); > >-- >Jens Axboe
回复: Question on io-wq
发件人: Jens Axboe 发送时间: 2020年10月22日 22:08 收件人: Zhang, Qiang 抄送: v...@zeniv.linux.org.uk; io-ur...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org 主题: Re: Question on io-wq On 10/22/20 3:02 AM, Zhang,Qiang wrote: > > Hi Jens Axboe > > There are some problem in 'io_wqe_worker' thread, when the > 'io_wqe_worker' be create and Setting the affinity of CPUs in NUMA > nodes, due to CPU hotplug, When the last CPU going down, the > 'io_wqe_worker' thread will run anywhere. when the CPU in the node goes > online again, we should restore their cpu bindings? >Something like the below should help in ensuring affinities are >always correct - trigger an affinity set for an online CPU event. We >should not need to do it for offlining. Can you test it? >diff --git a/fs/io-wq.c b/fs/io-wq.c >index 4012ff541b7b..3bf029d1170e 100644 >--- a/fs/io-wq.c >+++ b/fs/io-wq.c >@@ -19,6 +19,7 @@ >#include >#include >#include >+#include >#include "io-wq.h" > >@@ -123,9 +124,13 @@ struct io_wq { > refcount_t refs; > struct completion done; > >+ struct hlist_node cpuhp_node; >+ > refcount_t use_refs; >}; > >+static enum cpuhp_state io_wq_online; >+ >static bool io_worker_get(struct io_worker *worker) >{ > return refcount_inc_not_zero(>ref); >@@ -1096,6 +1101,13 @@ struct io_wq *io_wq_create(unsigned bounded, >struct >io_wq_data *data) > return ERR_PTR(-ENOMEM); > } > >+ ret = cpuhp_state_add_instance_nocalls(io_wq_online, >>cpuhp_node); >+ if (ret) { >+ kfree(wq->wqes); >+ kfree(wq); >+ return ERR_PTR(ret); >+ } >+ >wq->free_work = data->free_work; >wq->do_work = data->do_work; > >@@ -1145,6 +1157,7 @@ struct io_wq *io_wq_create(unsigned bounded, >struct >io_wq_data *data) > ret = PTR_ERR(wq->manager); > complete(>done); >err: >+ cpuhp_state_remove_instance_nocalls(io_wq_online, >>cpuhp_node); > for_each_node(node) > kfree(wq->wqes[node]); > kfree(wq->wqes); >@@ -1164,6 +1177,8 @@ static void __io_wq_destroy(struct io_wq *wq) >{ > int node; > >+ cpuhp_state_remove_instance_nocalls(io_wq_online, >>cpuhp_node); >+ > set_bit(IO_WQ_BIT_EXIT, >state); > if (wq->manager) > kthread_stop(wq->manager); >@@ -1191,3 +1206,40 @@ struct task_struct *io_wq_get_task(struct io_wq >*wq) >{ > return wq->manager; >} >+ >+static bool io_wq_worker_affinity(struct io_worker *worker, void *data) >+{ >+ struct task_struct *task = worker->task; >+ unsigned long flags; >+ struct rq_flags rf; >+ raw_spin_lock_irqsave(>pi_lock, flags); >+ do_set_cpus_allowed(task, cpumask_of_node(worker->wqe->node)); >+ task->flags |= PF_NO_SETAFFINITY; >+ raw_spin_unlock_irqrestore(>pi_lock, flags); >+ return false; >+} >+ >+static int io_wq_cpu_online(unsigned int cpu, struct hlist_node *node) >+{ >+ struct io_wq *wq = hlist_entry_safe(node, struct io_wq, cpuhp_node); >+ int i; >+ >+ rcu_read_lock(); >+ for_each_node(i) >+ io_wq_for_each_worker(wq->wqes[i], io_wq_worker_affinity, >>NULL); >+ rcu_read_unlock(); >+ return 0; >+} >+ >+static __init int io_wq_init(void) >+{ >+ int ret; >+ >+ ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, >"io->wq/online", >+ io_wq_cpu_online, NULL); >+ if (ret < 0) >+ return ret; >+ io_wq_online = ret; >+ return 0; >+} >+subsys_initcall(io_wq_init); > >-- >Jens Axboe
Re: [PATCH v7 1/4] powerpc: Refactor kexec functions to move arch independent code to kernel
Hello Lakshmi, Lakshmi Ramasubramanian writes: > On 10/20/20 8:17 PM, Mimi Zohar wrote: >> On Tue, 2020-10-20 at 19:25 -0700, Lakshmi Ramasubramanian wrote: >>> On 10/20/20 1:00 PM, Mimi Zohar wrote: Hi Lakshmi, On Wed, 2020-09-30 at 13:59 -0700, Lakshmi Ramasubramanian wrote: > The functions remove_ima_buffer() and delete_fdt_mem_rsv() that handle > carrying forward the IMA measurement logs on kexec for powerpc do not > have architecture specific code, but they are currently defined for > powerpc only. > > remove_ima_buffer() and delete_fdt_mem_rsv() are used to remove > the IMA log entry from the device tree and free the memory reserved > for the log. These functions need to be defined even if the current > kernel does not support carrying forward IMA log across kexec since > the previous kernel could have supported that and therefore the current > kernel needs to free the allocation. > > Rename remove_ima_buffer() to remove_ima_kexec_buffer(). > Define remove_ima_kexec_buffer() and delete_fdt_mem_rsv() in kernel. > A later patch in this series will use these functions to free > the allocation, if any, made by the previous kernel for ARM64. > > Define FDT_PROP_IMA_KEXEC_BUFFER for the chosen node, namely > "linux,ima-kexec-buffer", that is added to the DTB to hold > the address and the size of the memory reserved to carry > the IMA measurement log. > Co-developed-by: Prakhar Srivastava > Signed-off-by: Prakhar Srivastava > Signed-off-by: Lakshmi Ramasubramanian > Reported-by: kernel test robot error: implicit > declaration of function 'delete_fdt_mem_rsv' > [-Werror,-Wimplicit-function-declaration] Much better! This version limits unnecessarily changing the existing code to adding a couple of debugging statements, but that looks to be about it. >>> Yes Mimi - that's correct. >>> Based on Chester Lin's "ima_arch" support for arm64 discussion, the IMA generic EFI support will be defined in ima/ima-efi.c. Similarly, I think it would make sense to put the generic device tree support in ima/ima_kexec_fdt.c or ima/ima_fdt.c, as opposed to kernel/. (Refer to my comments on 2/4 about the new file named ima_kexec_fdt.c.) >>> >>> The functions remove_ima_kexec_buffer() and delete_fdt_mem_rsv(), which >>> are defined in kernel/ima_kexec.c and kernel/kexec_file_fdt.c >>> respectively, are needed even when CONFIG_IMA is not defined. These >>> functions need to be called by the current kernel to free the ima kexec >>> buffer resources allocated by the previous kernel. This is the reason, >>> these functions are defined under "kernel" instead of >>> "security/integrity/ima". >>> >>> If there is a better location to move the above C files, please let me >>> know. I'll move them. >> Freeing the previous kernel measurement list is currently called from >> ima_load_kexec_buffer(), only after the measurement list has been >> restored. The only other time the memory is freed is when the >> allocated memory size isn't sufficient to hold the measurement list, >> which could happen if there is a delay between loading and executing >> the kexec. >> > > There are two "free" operations we need to perform with respect to ima buffer > on > kexec: > > 1, The ima_free_kexec_buffer() called from ima_load_kexec_buffer() - the one > you > have stated above. > > Here we remove the "ima buffer" node from the "OF" tree and free the memory > pages that were allocated for the measurement list. > > This one is already present in ima and there's no change in that in my > patches. > > 2, The other one is remove_ima_kexec_buffer() called from setup_ima_buffer() > defined in "arch/powerpc/kexec/ima.c" > > This function removes the "ima buffer" node from the "FDT" and also frees the > physical memory reserved for the "ima measurement list" by the previous > kernel. > > This "free" operation needs to be performed even if the current kernel does > not > support IMA kexec since the previous kernel could have passed the IMA > measurement list (in FDT and reserved physical memory). > > For this reason, remove_ima_kexec_buffer() cannot be defined in "ima" but some > other place which will be built even if ima is not enabled. I chose to define > this function in "kernel" since that is guaranteed to be always built. > > thanks, > -lakshmi That is true. I believe a more fitting place for these functions is drivers/of/fdt.c rather than these new files in kernel/. Both CONFIG_PPC and CONFIG_ARM64 select CONFIG_OF and CONFIG_OF_FLATTREE (indirectly, via CONFIG_OF_EARLY_FLATTREE) so they will both build that file. -- Thiago Jung Bauermann IBM Linux Technology Center
Re: [PATCH v2] mm,thp,shmem: limit shmem THP alloc gfp_mask
On Thu, 2020-10-22 at 19:54 -0700, Hugh Dickins wrote: > On Thu, 22 Oct 2020, Rik van Riel wrote: > > > The allocation flags of anonymous transparent huge pages can be > controlled > > through the files in /sys/kernel/mm/transparent_hugepage/defrag, > which can > > help the system from getting bogged down in the page reclaim and > compaction > > code when many THPs are getting allocated simultaneously. > > > > However, the gfp_mask for shmem THP allocations were not limited by > those > > configuration settings, and some workloads ended up with all CPUs > stuck > > on the LRU lock in the page reclaim code, trying to allocate dozens > of > > THPs simultaneously. > > > > This patch applies the same configurated limitation of THPs to > shmem > > hugepage allocations, to prevent that from happening. > > > > This way a THP defrag setting of "never" or "defer+madvise" will > result > > in quick allocation failures without direct reclaim when no 2MB > free > > pages are available. > > > > Signed-off-by: Rik van Riel > > NAK in its present untested form: see below. Oops. That issue is easy to fix, but indeed lets figure out what the desired behavior is. > I'm open to change here, particularly to Yu Xu's point (in other > mail) > about direct reclaim - we avoid that here in Google too: though it's > not so much to avoid the direct reclaim, as to avoid the latencies of > direct compaction, which __GFP_DIRECT_RECLAIM allows as a side- > effect. > > > @@ -1887,7 +1888,8 @@ static int shmem_getpage_gfp(struct inode > *inode, pgoff_t index, > > } > > > > alloc_huge: > > - page = shmem_alloc_and_acct_page(gfp, inode, index, true); > > + huge_gfp = alloc_hugepage_direct_gfpmask(vma); > > Still looks nice: but what about the crash when vma is NULL? That's a one line fix, but I suppose we should get the discussion on what the code behavior should be out of the way first :) > Michal is right to remember pushback before, because tmpfs is a > filesystem, and "huge=" is a mount option: in using a huge=always > filesystem, the user has already declared a preference for huge > pages. > Whereas the original anon THP had to deduce that preference from sys > tunables and vma madvice. ... > But it's likely that they have accumulated some defrag wisdom, which > tmpfs can take on board - but please accept that in using a huge > mount, > the preference for huge has already been expressed, so I don't expect > anon THP alloc_hugepage_direct_gfpmask() choices will map one to one. In my mind, the huge= mount options for tmpfs corresponded to the "enabled" anon THP options, denoting a desired end state, not necessarily how much we will stall allocations to get there immediately. The underlying allocation behavior has been changed repeatedly, with changes to the direct reclaim code and the compaction deferral code. The shmem THP gfp_mask never tried really hard anyway, with __GFP_NORETRY being the default, which matches what is used for non-VM_HUGEPAGE anon VMAs. Likewise, the direct reclaim done from the opportunistic THP allocations done by the shmem code limited itself to reclaiming 32 4kB pages per THP allocation. In other words, mounting with huge=always has never behaved the same as the more aggressive allocations done for MADV_HUGEPAGE VMAs. This patch would leave shmem THP allocations for non-MADV_HUGEPAGE mapped files opportunistic like today, and make shmem THP allocations for files mapped with MADV_HUGEPAGE more aggressive than today. However, I would like to know what people think the shmem huge= mount options should do, and how things should behave when memory gets low, before pushing in a patch just because it makes the system run smoother "without changing current behavior too much". What do people want tmpfs THP allocations to do? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
[PATCH] net: ucc_geth: Drop extraneous parentheses in comparison
Clang warns about the extra parentheses in this comparison: drivers/net/ethernet/freescale/ucc_geth.c:1361:28: warning: equality comparison with extraneous parentheses if ((ugeth->phy_interface == PHY_INTERFACE_MODE_SGMII)) ~^~~ It seems clear the intent here is to do a comparison not an assignment, so drop the extra parentheses to avoid any confusion. Signed-off-by: Michael Ellerman --- drivers/net/ethernet/freescale/ucc_geth.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c index db791f60b884..d8ad478a0a13 100644 --- a/drivers/net/ethernet/freescale/ucc_geth.c +++ b/drivers/net/ethernet/freescale/ucc_geth.c @@ -1358,7 +1358,7 @@ static int adjust_enet_interface(struct ucc_geth_private *ugeth) (ugeth->phy_interface == PHY_INTERFACE_MODE_RTBI)) { upsmr |= UCC_GETH_UPSMR_TBIM; } - if ((ugeth->phy_interface == PHY_INTERFACE_MODE_SGMII)) + if (ugeth->phy_interface == PHY_INTERFACE_MODE_SGMII) upsmr |= UCC_GETH_UPSMR_SGMM; out_be32(_regs->upsmr, upsmr); -- 2.25.1
[PATCH 3/3] md: pad writes to end of bitmap to physical blocks
Writes of the last page of the bitmap are padded out to the next logical block boundary. However, they are not padded out to the next physical block boundary, so the writes may be less than a physical block. On a "512e" disk (logical block 512 bytes, physical block 4k) and if the last page of the bitmap is less than 3584 bytes, this means that writes of the last bitmap page hit the 512-byte emulation. Respect the physical block boundary as long as the resulting write doesn't run into other data, and is no longer than a page. (If the physical block size is larger than a page no bitmap write will respect the physical block boundaries.) Signed-off-by: Christopher Unkel --- drivers/md/md-bitmap.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 600b89d5a3ad..21af5f94d495 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -264,10 +264,18 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) if (page->index == store->file_pages-1) { int last_page_size = store->bytes & (PAGE_SIZE-1); + int pb_aligned_size; if (last_page_size == 0) last_page_size = PAGE_SIZE; size = roundup(last_page_size, bdev_logical_block_size(bdev)); + pb_aligned_size = roundup(last_page_size, + bdev_physical_block_size(bdev)); + if (pb_aligned_size > size + && pb_aligned_size <= PAGE_SIZE + && sb_write_alignment_ok(mddev, rdev, page, offset, +pb_aligned_size)) + size = pb_aligned_size; } /* Just make sure we aren't corrupting data or * metadata -- 2.17.1
[PATCH 2/3] md: factor sb write alignment check into function
Refactor in preparation for a second use of the logic. Signed-off-by: Christopher Unkel --- drivers/md/md-bitmap.c | 72 +++--- 1 file changed, 40 insertions(+), 32 deletions(-) diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 200c5d0f08bf..600b89d5a3ad 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -209,6 +209,44 @@ static struct md_rdev *next_active_rdev(struct md_rdev *rdev, struct mddev *mdde return NULL; } +static int sb_write_alignment_ok(struct mddev *mddev, struct md_rdev *rdev, +struct page *page, int offset, int size) +{ + if (mddev->external) { + /* Bitmap could be anywhere. */ + if (rdev->sb_start + offset + (page->index + * (PAGE_SIZE/512)) + > rdev->data_offset + && + rdev->sb_start + offset + < (rdev->data_offset + mddev->dev_sectors ++ (PAGE_SIZE/512))) + return 0; + } else if (offset < 0) { + /* DATA BITMAP METADATA */ + if (offset + + (long)(page->index * (PAGE_SIZE/512)) + + size/512 > 0) + /* bitmap runs in to metadata */ + return 0; + if (rdev->data_offset + mddev->dev_sectors + > rdev->sb_start + offset) + /* data runs in to bitmap */ + return 0; + } else if (rdev->sb_start < rdev->data_offset) { + /* METADATA BITMAP DATA */ + if (rdev->sb_start + + offset + + page->index*(PAGE_SIZE/512) + size/512 + > rdev->data_offset) + /* bitmap runs in to data */ + return 0; + } else { + /* DATA METADATA BITMAP - no problems */ + } + return 1; +} + static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) { struct md_rdev *rdev; @@ -234,38 +272,8 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) /* Just make sure we aren't corrupting data or * metadata */ - if (mddev->external) { - /* Bitmap could be anywhere. */ - if (rdev->sb_start + offset + (page->index - * (PAGE_SIZE/512)) - > rdev->data_offset - && - rdev->sb_start + offset - < (rdev->data_offset + mddev->dev_sectors -+ (PAGE_SIZE/512))) - goto bad_alignment; - } else if (offset < 0) { - /* DATA BITMAP METADATA */ - if (offset - + (long)(page->index * (PAGE_SIZE/512)) - + size/512 > 0) - /* bitmap runs in to metadata */ - goto bad_alignment; - if (rdev->data_offset + mddev->dev_sectors - > rdev->sb_start + offset) - /* data runs in to bitmap */ - goto bad_alignment; - } else if (rdev->sb_start < rdev->data_offset) { - /* METADATA BITMAP DATA */ - if (rdev->sb_start - + offset - + page->index*(PAGE_SIZE/512) + size/512 - > rdev->data_offset) - /* bitmap runs in to data */ - goto bad_alignment; - } else { - /* DATA METADATA BITMAP - no problems */ - } + if (!sb_write_alignment_ok(mddev, rdev, page, offset, size)) + goto bad_alignment; md_super_write(mddev, rdev, rdev->sb_start + offset + page->index * (PAGE_SIZE/512), -- 2.17.1
[PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives
Hello all, While investigating some performance issues on mdraid 10 volumes formed with "512e" disks (4k native/physical sector size but with 512 byte sector emulation), I've found two cases where mdraid will needlessly issue writes that start on 4k byte boundary, but are are shorter than 4k: 1. writes of the raid superblock; and 2. writes of the last page of the write-intent bitmap. The following is an excerpt of a blocktrace of one of the component members of a mdraid 10 volume during a 4k write near the end of the array: 8,32 112 0.01687 711 D WS 2064 + 8 [kworker/11:1H] * 8,32 115 0.001454119 711 D WS 2056 + 1 [kworker/11:1H] * 8,32 118 0.002847204 711 D WS 2080 + 7 [kworker/11:1H] 8,32 11 11 0.003700545 3094 D WS 11721043920 + 8 [md127_raid1] 8,32 11 14 0.308785692 711 D WS 2064 + 8 [kworker/11:1H] * 8,32 11 17 0.310201697 711 D WS 2056 + 1 [kworker/11:1H] 8,32 11 20 5.500799245 711 D WS 2064 + 8 [kworker/11:1H] * 8,32 11 2315.740923558 711 D WS 2080 + 7 [kworker/11:1H] Note the starred transactions, which each start on a 4k boundary, but are less than 4k in length, and so will use the 512-byte emulation. Sector 2056 holds the superblock, and is written as a single 512-byte write. Sector 2086 holds the bitmap bit relevant to the written sector. When it is written the active bits of the last page of the bitmap are written, starting at sector 2080, padded out to the end of the 512-byte logical sector as required. This results in a 3.5kb write, again using the 512-byte emulation. Note that in some arrays the last page of the bitmap may be sufficiently full that they are not affected by the issue with the bitmap write. As there can be a substantial penalty to using the 512-byte sector emulation (turning writes into read-modify writes if the relevant sector is not in the drive's cache) I believe it makes sense to pad these writes out to a 4k boundary. The writes are already padded out for "4k native" drives, where the short access is illegal. The following patch set changes the superblock and bitmap writes to respect the physical block size (e.g. 4k for today's 512e drives) when possible. In each case there is already logic for padding out to the underlying logical sector size. I reuse or repeat the logic for padding out to the physical sector size, but treat the padding out as optional rather than mandatory. The corresponding block trace with these patches is: 8,32 12 0.03410 694 D WS 2064 + 8 [kworker/1:1H] 8,32 15 0.001368788 694 D WS 2056 + 8 [kworker/1:1H] 8,32 18 0.002727981 694 D WS 2080 + 8 [kworker/1:1H] 8,32 1 11 0.003533831 3063 D WS 11721043920 + 8 [md127_raid1] 8,32 1 14 0.253952321 694 D WS 2064 + 8 [kworker/1:1H] 8,32 1 17 0.255354215 694 D WS 2056 + 8 [kworker/1:1H] 8,32 1 20 5.337938486 694 D WS 2064 + 8 [kworker/1:1H] 8,32 1 2315.577963062 694 D WS 2080 + 8 [kworker/1:1H] I do notice that the code for bitmap writes has a more sophisticated and thorough check for overlap than the code for superblock writes. (Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From what I know since the various structures starts have always been 4k aligned anyway, it is always safe to pad the superblock write out to 4k (as occurs on 4k native drives) but not necessarily futher. Feedback appreciated. --Chris Christopher Unkel (3): md: align superblock writes to physical blocks md: factor sb write alignment check into function md: pad writes to end of bitmap to physical blocks drivers/md/md-bitmap.c | 80 +- drivers/md/md.c| 15 2 files changed, 63 insertions(+), 32 deletions(-) -- 2.17.1
[PATCH 1/3] md: align superblock writes to physical blocks
Writes of the md superblock are aligned to the logical blocks of the containing device, but no attempt is made to align them to physical block boundaries. This means that on a "512e" device (4k physical, 512 logical) every superblock update hits the 512-byte emulation and the possible associated performance penalty. Respect the physical block alignment when possible. Signed-off-by: Christopher Unkel --- drivers/md/md.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/drivers/md/md.c b/drivers/md/md.c index 98bac4f304ae..2b42850acfb3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1732,6 +1732,21 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ && rdev->new_data_offset < sb_start + (rdev->sb_size/512)) return -EINVAL; + /* Respect physical block size if feasible. */ + bmask = queue_physical_block_size(rdev->bdev->bd_disk->queue)-1; + if (!((rdev->sb_start * 512) & bmask) && (rdev->sb_size & bmask)) { + int candidate_size = (rdev->sb_size | bmask) + 1; + + if (minor_version) { + int sectors = candidate_size / 512; + + if (rdev->data_offset >= sb_start + sectors + && rdev->new_data_offset >= sb_start + sectors) + rdev->sb_size = candidate_size; + } else if (bmask <= 4095) + rdev->sb_size = candidate_size; + } + if (sb->level == cpu_to_le32(LEVEL_MULTIPATH)) rdev->desc_nr = -1; else -- 2.17.1
[PATCH 3/3] net: better handling for network busy poll
Add the new functions prepare_to_busy_poll() and friends to napi_busy_loop(). The busy polling cpu will be considered an idle target during wake up balancing. Suggested-by: Xi Wang Signed-off-by: Josh Don Signed-off-by: Xi Wang --- net/core/dev.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 266073e300b5..4fb4ae4b27fc 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6476,7 +6476,7 @@ void napi_busy_loop(unsigned int napi_id, if (!napi) goto out; - preempt_disable(); + prepare_to_busy_poll(); /* disables preemption */ for (;;) { int work = 0; @@ -6509,10 +6509,10 @@ void napi_busy_loop(unsigned int napi_id, if (!loop_end || loop_end(loop_end_arg, start_time)) break; - if (unlikely(need_resched())) { + if (unlikely(!continue_busy_poll())) { if (napi_poll) busy_poll_stop(napi, have_poll_lock); - preempt_enable(); + end_busy_poll(true); rcu_read_unlock(); cond_resched(); if (loop_end(loop_end_arg, start_time)) @@ -6523,7 +6523,7 @@ void napi_busy_loop(unsigned int napi_id, } if (napi_poll) busy_poll_stop(napi, have_poll_lock); - preempt_enable(); + end_busy_poll(true); out: rcu_read_unlock(); } -- 2.29.0.rc1.297.gfa9743e501-goog
[PATCH 2/3] kvm: better handling for kvm halt polling
Add the new functions prepare_to_busy_poll() and friends to kvm_vcpu_block. The busy polling cpu will be considered an idle target during wake up balancing. cpu_relax is also added to the polling loop to improve the performance of other hw threads sharing the busy polling core. Suggested-by: Xi Wang Signed-off-by: Josh Don Signed-off-by: Xi Wang --- virt/kvm/kvm_main.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index cf88233b819a..8f818f0fc979 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2772,7 +2772,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns); ++vcpu->stat.halt_attempted_poll; + prepare_to_busy_poll(); /* also disables preemption */ do { + cpu_relax(); /* * This sets KVM_REQ_UNHALT if an interrupt * arrives. @@ -2781,10 +2783,12 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) ++vcpu->stat.halt_successful_poll; if (!vcpu_valid_wakeup(vcpu)) ++vcpu->stat.halt_poll_invalid; + end_busy_poll(false); goto out; } poll_end = cur = ktime_get(); - } while (single_task_running() && ktime_before(cur, stop)); + } while (continue_busy_poll() && ktime_before(cur, stop)); + end_busy_poll(false); } prepare_to_rcuwait(>wait); -- 2.29.0.rc1.297.gfa9743e501-goog
[PATCH 1/3] sched: better handling for busy polling loops
Busy polling loops in the kernel such as network socket poll and kvm halt polling have performance problems related to process scheduler load accounting. Both of the busy polling examples are opportunistic - they relinquish the cpu if another thread is ready to run. This design, however, doesn't extend to multiprocessor load balancing very well. The scheduler still sees the busy polling cpu as 100% busy and will be less likely to put another thread on that cpu. In other words, if all cores are 100% utilized and some of them are running real workloads and some others are running busy polling loops, newly woken up threads will not prefer the busy polling cpus. System wide throughput and latency may suffer. This change allows the scheduler to detect busy polling cpus in order to allow them to be more frequently considered for wake up balancing. This change also disables preemption for the duration of the busy polling loop. This is important, as it ensures that if a polling thread decides to end its poll to relinquish cpu to another thread, the polling thread will actually exit the busy loop and potentially block. When it later becomes runnable, it will have the opportunity to find an idle cpu via wakeup cpu selection. Suggested-by: Xi Wang Signed-off-by: Josh Don Signed-off-by: Xi Wang --- include/linux/sched.h | 5 +++ kernel/sched/core.c | 94 +++ kernel/sched/fair.c | 25 kernel/sched/sched.h | 2 + 4 files changed, 119 insertions(+), 7 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..80ef477e5a87 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1651,6 +1651,7 @@ extern int can_nice(const struct task_struct *p, const int nice); extern int task_curr(const struct task_struct *p); extern int idle_cpu(int cpu); extern int available_idle_cpu(int cpu); +extern int polling_cpu(int cpu); extern int sched_setscheduler(struct task_struct *, int, const struct sched_param *); extern int sched_setscheduler_nocheck(struct task_struct *, int, const struct sched_param *); extern void sched_set_fifo(struct task_struct *p); @@ -2048,4 +2049,8 @@ int sched_trace_rq_nr_running(struct rq *rq); const struct cpumask *sched_trace_rd_span(struct root_domain *rd); +extern void prepare_to_busy_poll(void); +extern int continue_busy_poll(void); +extern void end_busy_poll(bool allow_resched); + #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d95dc3f4644..2783191d0bd4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5107,6 +5107,24 @@ int available_idle_cpu(int cpu) return 1; } +/** + * polling_cpu - is a given CPU currently running a thread in a busy polling + * loop that could be preempted if a new thread were to be scheduled? + * @cpu: the CPU in question. + * + * Return: 1 if the CPU is currently polling. 0 otherwise. + */ +int polling_cpu(int cpu) +{ +#ifdef CONFIG_SMP + struct rq *rq = cpu_rq(cpu); + + return unlikely(rq->busy_polling); +#else + return 0; +#endif +} + /** * idle_task - return the idle task for a given CPU. * @cpu: the processor in question. @@ -7191,6 +7209,7 @@ void __init sched_init(void) rq_csd_init(rq, >nohz_csd, nohz_csd_func); #endif + rq->busy_polling = 0; #endif /* CONFIG_SMP */ hrtick_rq_init(rq); atomic_set(>nr_iowait, 0); @@ -7417,6 +7436,81 @@ void ia64_set_curr_task(int cpu, struct task_struct *p) #endif +/* + * Calling this function before entering a preemptible busy polling loop will + * help the scheduler make better load balancing decisions. Wake up balance + * will treat the polling cpu as idle. + * + * Preemption is disabled inside this function and re-enabled in + * end_busy_poll(), thus the polling loop must periodically check + * continue_busy_poll(). + * + * REQUIRES: prepare_to_busy_poll(), continue_busy_poll(), and end_busy_poll() + * must be used together. + */ +void prepare_to_busy_poll(void) +{ + struct rq __maybe_unused *rq = this_rq(); + unsigned long __maybe_unused flags; + + /* Preemption will be reenabled by end_busy_poll() */ + preempt_disable(); + +#ifdef CONFIG_SMP + raw_spin_lock_irqsave(>lock, flags); + /* preemption disabled; only one thread can poll at a time */ + WARN_ON_ONCE(rq->busy_polling); + rq->busy_polling++; + raw_spin_unlock_irqrestore(>lock, flags); +#endif +} +EXPORT_SYMBOL(prepare_to_busy_poll); + +int continue_busy_poll(void) +{ + if (!single_task_running()) + return 0; + + /* Important that we check this, since preemption is disabled */ + if (need_resched()) + return 0; + + return 1; +} +EXPORT_SYMBOL(continue_busy_poll); + +/* + * Restore any state modified by prepare_to_busy_poll(), including re-enabling + * preemption. + * + * @allow_resched: If true, this potentially
Re: [PATCH] serial: pmac_zilog: don't init if zilog is not available
On Thu, 22 Oct 2020, Geert Uytterhoeven wrote: > > Thanks for your patch... > You're welcome. > I can't say I'm a fan of this... > Sorry. > > The real issue is this "extern struct platform_device scc_a_pdev, > scc_b_pdev", circumventing the driver framework. > > Can we get rid of that? > Is there a better alternative? pmz_probe() is called by console_initcall(pmz_console_init) when CONFIG_SERIAL_PMACZILOG_CONSOLE=y because this has to happen earlier than the normal platform bus probing which takes place later as a typical module_initcall.
Re: [PATCH v2 2/6] crypto: lib/sha256 - Don't clear temporary variables
On Wed, Oct 21, 2020 at 09:58:50PM -0700, Eric Biggers wrote: > On Tue, Oct 20, 2020 at 04:39:53PM -0400, Arvind Sankar wrote: > > The assignments to clear a through h and t1/t2 are optimized out by the > > compiler because they are unused after the assignments. > > > > These variables shouldn't be very sensitive: t1/t2 can be calculated > > from a through h, so they don't reveal any additional information. > > Knowing a through h is equivalent to knowing one 64-byte block's SHA256 > > hash (with non-standard initial value) which, assuming SHA256 is secure, > > doesn't reveal any information about the input. > > > > Signed-off-by: Arvind Sankar > > I don't entirely buy the second paragraph. It could be the case that the > input > is less than or equal to one SHA-256 block (64 bytes), in which case leaking > 'a' through 'h' would reveal the final SHA-256 hash if the input length is > known. And note that callers might consider either the input, the resulting > hash, or both to be sensitive information -- it depends. The "non-standard initial value" was just parenthetical -- my thinking was that revealing the hash, whether the real SHA hash or an intermediate one starting at some other initial value, shouldn't reveal the input; not that you get any additional security from being an intermediate block. But if the hash itself could be sensitive, yeah then a-h are sensitive anyway. > > > --- > > lib/crypto/sha256.c | 1 - > > 1 file changed, 1 deletion(-) > > > > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c > > index d43bc39ab05e..099cd11f83c1 100644 > > --- a/lib/crypto/sha256.c > > +++ b/lib/crypto/sha256.c > > @@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 > > *input) > > state[4] += e; state[5] += f; state[6] += g; state[7] += h; > > > > /* clear any sensitive info... */ > > - a = b = c = d = e = f = g = h = t1 = t2 = 0; > > memzero_explicit(W, 64 * sizeof(u32)); > > } > > Your change itself is fine, though. As you mentioned, these assignments get > optimized out, so they weren't accomplishing anything. > > The fact is, there just isn't any way to guarantee in C code that all > sensitive > variables get cleared. > > So we shouldn't (and generally don't) bother trying to clear individual u32's, > ints, etc. like this, but rather only structs and arrays, as clearing those is > more likely to work as intended. > > - Eric Ok, I'll just drop the second paragraph from the commit message then.
Re: [PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
On Thu, Oct 22, 2020 at 11:12:36PM -0400, Arvind Sankar wrote: > > I was aiming for 8 columns per line to match all the other groupings by > eight. It does slightly exceed 100 columns but can this be an exception, > or should I maybe make it 4 columns per line? Please limit it to 4 columns. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig
On Wed, Oct 14, 2020 at 04:44:57PM -0700, Jim Mattson wrote: > On Fri, Oct 9, 2020 at 9:17 AM Jim Mattson wrote: > > > > On Fri, Jul 10, 2020 at 8:48 AM Mohammed Gamal wrote: > > > @@ -5308,6 +5314,18 @@ static int handle_ept_violation(struct kvm_vcpu > > > *vcpu) > > >PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK; > > > > > > vcpu->arch.exit_qualification = exit_qualification; > > > + > > > + /* > > > +* Check that the GPA doesn't exceed physical memory limits, as > > > that is > > > +* a guest page fault. We have to emulate the instruction here, > > > because > > > +* if the illegal address is that of a paging structure, then > > > +* EPT_VIOLATION_ACC_WRITE bit is set. Alternatively, if > > > supported we > > > +* would also use advanced VM-exit information for EPT violations > > > to > > > +* reconstruct the page fault error code. > > > +*/ > > > + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa))) > > > + return kvm_emulate_instruction(vcpu, 0); > > > + > > > > Is kvm's in-kernel emulator up to the task? What if the instruction in > > question is AVX-512, or one of the myriad instructions that the > > in-kernel emulator can't handle? Ice Lake must support the advanced > > VM-exit information for EPT violations, so that would seem like a > > better choice. > > > Anyone? Using "advanced info" if it's supported seems like the way to go. Outright requiring it is probably overkill; if userspace wants to risk having to kill a (likely broken) guest, so be it.
Re: [PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
On Wed, Oct 21, 2020 at 10:02:19PM -0700, Eric Biggers wrote: > On Tue, Oct 20, 2020 at 04:39:55PM -0400, Arvind Sankar wrote: > > This reduces code size substantially (on x86_64 with gcc-10 the size of > > sha256_update() goes from 7593 bytes to 1952 bytes including the new > > SHA256_K array), and on x86 is slightly faster than the full unroll > > (tesed on Broadwell Xeon). > > tesed => tested > > > > > Signed-off-by: Arvind Sankar > > --- > > lib/crypto/sha256.c | 166 > > 1 file changed, 30 insertions(+), 136 deletions(-) > > > > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c > > index c6bfeacc5b81..5efd390706c6 100644 > > --- a/lib/crypto/sha256.c > > +++ b/lib/crypto/sha256.c > > @@ -18,6 +18,17 @@ > > #include > > #include > > > > +static const u32 SHA256_K[] = { > > + 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, > > 0x923f82a4, 0xab1c5ed5, > > + 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, > > 0x9bdc06a7, 0xc19bf174, > > + 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, > > 0x5cb0a9dc, 0x76f988da, > > + 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, > > 0x06ca6351, 0x14292967, > > + 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, > > 0x81c2c92e, 0x92722c85, > > + 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, > > 0xf40e3585, 0x106aa070, > > + 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, > > 0x5b9cca4f, 0x682e6ff3, > > + 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, > > 0xbef9a3f7, 0xc67178f2, > > +}; > > Limit this to 80 columns? I was aiming for 8 columns per line to match all the other groupings by eight. It does slightly exceed 100 columns but can this be an exception, or should I maybe make it 4 columns per line? > > Otherwise this looks good. > > - Eric
Re: [LTP] mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]
On Thu, Oct 22, 2020 at 6:36 PM Daniel Díaz wrote: > > The kernel Naresh originally referred to is here: > https://builds.tuxbuild.com/SCI7Xyjb7V2NbfQ2lbKBZw/ Thanks. And when I started looking at it, I realized that my original idea ("just look for __put_user_nocheck_X calls, there aren't so many of those") was garbage, and that I was just being stupid. Yes, the commit that broke was about __put_user(), but in order to not duplicate all the code, it re-used the regular put_user() infrastructure, and so all the normal put_user() calls are potential problem spots too if this is about the compiler interaction with KASAN and the asm changes. So it's not just a couple of special cases to look at, it's all the normal cases too. Ok, back to the drawing board, but I think reverting it is probably the right thing to do if I can't think of something smart. That said, since you see this on x86-64, where the whole ugly trick with that register asm("%"_ASM_AX) is unnecessary (because the 8-byte case is still just a single register, no %eax:%edx games needed), it would be interesting to hear if the attached patch fixes it. That would confirm that the problem really is due to some register allocation issue interaction (or, alternatively, it would tell me that there's something else going on). Linus patch Description: Binary data
Re: [PATCH 1/2] fs:regfs: add register easy filesystem
Hi viro: Through regfs is very sample and easy, but i think it is a Interest , could give some suggestions? Regards, zc 在 2020/10/20 下午2:30, Zou Cao 写道: register filesystem is mapping the register into file dentry, it will use the io readio to get the register val. DBT file is use to decript the register tree, you can use it as follow: mount -t regfs -o dtb=test.dtb none /mnt test.dts: / { compatible = "hisilicon,hi6220-hikey", "hisilicon,hi6220"; #address-cells = <0x2>; #size-cells = <0x2>; model = "HiKey Development Board"; gic-v3-dist{ reg = <0x0 0x800 0x0 0x1>; GIC_CTRL { offset = <0x0>; }; GICD_TYPER { offset = <0x4>; }; }; }; it will create all regiter dentry file in /mnt Signed-off-by: Zou Cao --- fs/Kconfig | 1 + fs/Makefile| 1 + fs/regfs/Kconfig | 7 + fs/regfs/Makefile | 8 ++ fs/regfs/file.c| 107 +++ fs/regfs/inode.c | 354 + fs/regfs/internal.h| 32 + fs/regfs/regfs_inode.h | 32 + fs/regfs/supper.c | 71 ++ 9 files changed, 613 insertions(+) create mode 100644 fs/regfs/Kconfig create mode 100644 fs/regfs/Makefile create mode 100644 fs/regfs/file.c create mode 100644 fs/regfs/inode.c create mode 100644 fs/regfs/internal.h create mode 100644 fs/regfs/regfs_inode.h create mode 100644 fs/regfs/supper.c diff --git a/fs/Kconfig b/fs/Kconfig index a88aa3a..d95acaf 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -324,6 +324,7 @@ endif # NETWORK_FILESYSTEMS source "fs/nls/Kconfig" source "fs/dlm/Kconfig" source "fs/unicode/Kconfig" +source "fs/regfs/Kconfig" config IO_WQ bool diff --git a/fs/Makefile b/fs/Makefile index 2ce5112..24f3878 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -136,3 +136,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS)+= erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-$(CONFIG_REGFS_FS) += zonefs/ diff --git a/fs/regfs/Kconfig b/fs/regfs/Kconfig new file mode 100644 index 000..74ba85b --- /dev/null +++ b/fs/regfs/Kconfig @@ -0,0 +1,7 @@ +config REGFS_FS + tristate "registers filesystem support" + depends on ARM64 + help + regfs support the read and write register of device resource by + dentry filesystem, it is more easy to support bsp debug. it also + support to printk the register val when panic diff --git a/fs/regfs/Makefile b/fs/regfs/Makefile new file mode 100644 index 000..26d5eef --- /dev/null +++ b/fs/regfs/Makefile @@ -0,0 +1,8 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +#Makefile for the linux ramfs routines. +# + +obj-y += regfs.o + +regfs-objs += inode.o file.o supper.o diff --git a/fs/regfs/file.c b/fs/regfs/file.c new file mode 100644 index 000..6cd9f3d --- /dev/null +++ b/fs/regfs/file.c @@ -0,0 +1,107 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "regfs_inode.h" +#include "internal.h" + +ssize_t regfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) +{ + struct file *file = iocb->ki_filp; + struct inode *inode = file->f_mapping->host; + ssize_t ret; + + inode_lock(inode); + ret = generic_write_checks(iocb, from); + if (ret > 0) + ret = __generic_file_write_iter(iocb, from); + inode_unlock(inode); + + if (ret > 0) + ret = generic_write_sync(iocb, ret); + return ret; +} + +static ssize_t regfs_file_read(struct file *file, char __user *buf, size_t len, loff_t *ppos) +{ + struct address_space *mapping = file->f_mapping; + struct regfs_inode_info *info = REGFS_I(mapping->host); + char str[64]; + unsigned long val; + + val = readl_relaxed(info->base + info->offset); + + loc_debug("name:%s base:%p val:%lx\n" + , file->f_path.dentry->d_iname + , info->base + info->offset + , val); + + snprintf(str, 64, "%lx", val); + + return simple_read_from_buffer(buf, len, ppos, str, strlen(str)); +} + +static ssize_t regfs_file_write(struct file *file, const char __user *buf, size_t len, loff_t *ppos) +{ + struct address_space *mapping = file->f_mapping; + struct regfs_inode_info *info = REGFS_I(mapping->host); + char str[67]; + unsigned long val = 0; + loff_t pos = *ppos; + size_t res; + + if (pos < 0) + return -EINVAL; + if (pos >= len || len > 66) +
Re: [PATCH v2] mm,thp,shmem: limit shmem THP alloc gfp_mask
On Thu, 22 Oct 2020, Rik van Riel wrote: > The allocation flags of anonymous transparent huge pages can be controlled > through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can > help the system from getting bogged down in the page reclaim and compaction > code when many THPs are getting allocated simultaneously. > > However, the gfp_mask for shmem THP allocations were not limited by those > configuration settings, and some workloads ended up with all CPUs stuck > on the LRU lock in the page reclaim code, trying to allocate dozens of > THPs simultaneously. > > This patch applies the same configurated limitation of THPs to shmem > hugepage allocations, to prevent that from happening. > > This way a THP defrag setting of "never" or "defer+madvise" will result > in quick allocation failures without direct reclaim when no 2MB free > pages are available. > > Signed-off-by: Rik van Riel NAK in its present untested form: see below. I'm open to change here, particularly to Yu Xu's point (in other mail) about direct reclaim - we avoid that here in Google too: though it's not so much to avoid the direct reclaim, as to avoid the latencies of direct compaction, which __GFP_DIRECT_RECLAIM allows as a side-effect. > --- > v2: move gfp calculation to shmem_getpage_gfp as suggested by Yu Xu > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index c603237e006c..0a5b164a26d9 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -614,6 +614,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask); > extern void pm_restrict_gfp_mask(void); > extern void pm_restore_gfp_mask(void); > > +extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma); > + > #ifdef CONFIG_PM_SLEEP > extern bool pm_suspended_storage(void); > #else > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 9474dbc150ed..9b08ce5cc387 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -649,7 +649,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct > vm_fault *vmf, > * available > * never: never stall for any thp allocation > */ > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) > +gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) > { > const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); > > diff --git a/mm/shmem.c b/mm/shmem.c > index 537c137698f8..9710b9df91e9 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1545,8 +1545,8 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp, > return NULL; > > shmem_pseudo_vma_init(, info, hindex); > - page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN, > - HPAGE_PMD_ORDER, , 0, numa_node_id(), true); > + page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, , 0, numa_node_id(), > +true); Commendably neat so far. > shmem_pseudo_vma_destroy(); > if (page) > prep_transhuge_page(page); > @@ -1802,6 +1802,7 @@ static int shmem_getpage_gfp(struct inode *inode, > pgoff_t index, > struct page *page; > enum sgp_type sgp_huge = sgp; > pgoff_t hindex = index; > + gfp_t huge_gfp; > int error; > int once = 0; > int alloced = 0; > @@ -1887,7 +1888,8 @@ static int shmem_getpage_gfp(struct inode *inode, > pgoff_t index, > } > > alloc_huge: > - page = shmem_alloc_and_acct_page(gfp, inode, index, true); > + huge_gfp = alloc_hugepage_direct_gfpmask(vma); Still looks nice: but what about the crash when vma is NULL? It may work for shmem_fault() (though I'll probably disagree on the details): but tmpfs is a filesystem, so most if not all of the system calls which arrive here have no vma to offer. Michal is right to remember pushback before, because tmpfs is a filesystem, and "huge=" is a mount option: in using a huge=always filesystem, the user has already declared a preference for huge pages. Whereas the original anon THP had to deduce that preference from sys tunables and vma madvice. I certainly found it a lot easier to ignore all the shifting sandmaze of the anon THP tunables, and I think Kirill followed me on that. But it's likely that they have accumulated some defrag wisdom, which tmpfs can take on board - but please accept that in using a huge mount, the preference for huge has already been expressed, so I don't expect anon THP alloc_hugepage_direct_gfpmask() choices will map one to one. > + page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true); > if (IS_ERR(page)) { > alloc_nohuge: > page = shmem_alloc_and_acct_page(gfp, inode, > Hugh
Re: [PATCH] perf trace: Segfault when trying to trace events by cgroup
Hello, On Tue, Oct 20, 2020 at 5:48 AM Stanislav Ivanichkin wrote: > > Hi, > > +linux-perf-users@ > > Gentle ping for this patch > > Many Thanks > > -- > Stanislav Ivanichkin > > > On 9 Oct 2020, at 09:45, Stanislav Ivanichkin > > wrote: > > > > # ./perf trace -e sched:sched_switch -G test -a sleep 1 > > perf: Segmentation fault > > Obtained 11 stack frames. > > ./perf(sighandler_dump_stack+0x43) [0x55cfdc636db3] > > /lib/x86_64-linux-gnu/libc.so.6(+0x3efcf) [0x7fd23eecafcf] > > ./perf(parse_cgroups+0x36) [0x55cfdc673f36] > > ./perf(+0x3186ed) [0x55cfdc70d6ed] > > ./perf(parse_options_subcommand+0x629) [0x55cfdc70e999] > > ./perf(cmd_trace+0x9c2) [0x55cfdc5ad6d2] > > ./perf(+0x1e8ae0) [0x55cfdc5ddae0] > > ./perf(+0x1e8ded) [0x55cfdc5ddded] > > ./perf(main+0x370) [0x55cfdc556f00] > > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x7fd23eeadb96] > > ./perf(_start+0x29) [0x55cfdc557389] > > Segmentation fault > > > > It happens because "struct trace" in option->value is passed to > > parse_cgroups function instead of "struct evlist". > > > > Signed-off-by: Stanislav Ivanichkin > > Reviewed-by: Dmitry Monakhov It seems we should add this too: Fixes: 9ea42ba4411ac ("perf trace: Support setting cgroups as targets") > > --- > > tools/perf/builtin-trace.c | 9 ++--- > > 1 file changed, 6 insertions(+), 3 deletions(-) > > > > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c > > index bea461b6f937..cbc4de6840db 100644 > > --- a/tools/perf/builtin-trace.c > > +++ b/tools/perf/builtin-trace.c > > @@ -4651,9 +4651,12 @@ static int trace__parse_cgroups(const struct option > > *opt, const char *str, int u > > { > > struct trace *trace = opt->value; > > > > - if (!list_empty(>evlist->core.entries)) > > - return parse_cgroups(opt, str, unset); > > - > > + if (!list_empty(>evlist->core.entries)) { > > + struct option o = OPT_CALLBACK('G', "cgroup", >evlist, > > + "name", "monitor event in cgroup name only", > > + parse_cgroups); Just make it simple and clear what parse_cgroups() expects: struct option o = { .value = >evlist, }; Or else, we can change parse_cgroups() to take evlist directly. But it needs to change other callsites too. Either is fine to me. Thanks Namhyung > > + return parse_cgroups(, str, unset); > > + } > > trace->cgroup = evlist__findnew_cgroup(trace->evlist, str); > > > > return 0; > > -- > > 2.17.1 > > >
[PATCHv2] selftests/powerpc/eeh: disable kselftest timeout setting for eeh-basic
The eeh-basic test got its own 60 seconds timeout (defined in commit 414f50434aa2 "selftests/eeh: Bump EEH wait time to 60s") per breakable device. And we have discovered that the number of breakable devices varies on different hardware. The device recovery time ranges from 0 to 35 seconds. In our test pool it will take about 30 seconds to run on a Power8 system that with 5 breakable devices, 60 seconds to run on a Power9 system that with 4 breakable devices. Extend the timeout setting in the kselftest framework to 5 minutes to give it a chance to finish. Signed-off-by: Po-Hsu Lin --- tools/testing/selftests/powerpc/eeh/Makefile | 2 +- tools/testing/selftests/powerpc/eeh/settings | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/eeh/settings diff --git a/tools/testing/selftests/powerpc/eeh/Makefile b/tools/testing/selftests/powerpc/eeh/Makefile index b397bab..ae963eb 100644 --- a/tools/testing/selftests/powerpc/eeh/Makefile +++ b/tools/testing/selftests/powerpc/eeh/Makefile @@ -3,7 +3,7 @@ noarg: $(MAKE) -C ../ TEST_PROGS := eeh-basic.sh -TEST_FILES := eeh-functions.sh +TEST_FILES := eeh-functions.sh settings top_srcdir = ../../../../.. include ../../lib.mk diff --git a/tools/testing/selftests/powerpc/eeh/settings b/tools/testing/selftests/powerpc/eeh/settings new file mode 100644 index 000..694d707 --- /dev/null +++ b/tools/testing/selftests/powerpc/eeh/settings @@ -0,0 +1 @@ +timeout=300 -- 2.7.4
[PATCH/RFC net v2] net: dec: tulip: de2104x: Add shutdown handler to stop NIC
The driver does not implement a shutdown handler which leads to issues when using kexec in certain scenarios. The NIC keeps on fetching descriptors which gets flagged by the IOMMU with errors like this: DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 Signed-off-by: Moritz Fischer --- Changes from v1: - Replace call to de_remove_one with de_shutdown() function as suggested by James. --- drivers/net/ethernet/dec/tulip/de2104x.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/net/ethernet/dec/tulip/de2104x.c b/drivers/net/ethernet/dec/tulip/de2104x.c index f1a2da15dd0a..6de0cd6cf4ca 100644 --- a/drivers/net/ethernet/dec/tulip/de2104x.c +++ b/drivers/net/ethernet/dec/tulip/de2104x.c @@ -2180,11 +2180,19 @@ static int de_resume (struct pci_dev *pdev) #endif /* CONFIG_PM */ +static void de_shutdown(struct pci_dev *pdev) +{ + struct net_device *dev = pci_get_drvdata (pdev); + + de_close(dev); +} + static struct pci_driver de_driver = { .name = DRV_NAME, .id_table = de_pci_tbl, .probe = de_init_one, .remove = de_remove_one, + .shutdown = de_shutdown, #ifdef CONFIG_PM .suspend= de_suspend, .resume = de_resume, -- 2.28.0
RE: [PATCH] scsi: megaraid_sas: use spin_lock() in hard IRQ
On Thu, 22 Oct 2020, Tianxianting wrote: > I see, If we add this patch, we need to get all cpu arch that support > nested interrupts. > I was just calling into question 1. the benefit (does it improve performance?) and 2. the code style (is it less portable?). It's really the style question that mostly interests me because I've had to code around the nested interrupt situation before, and everytime it comes up it makes me wonder about the necessity. I was not trying to veto your patch. It is not my position to do that. If Broadcom likes the patch, that's great.
Re: [PATCH] selftests/powerpc/eeh: disable kselftest timeout setting for eeh-basic
On Fri, Oct 23, 2020 at 10:07 AM Michael Ellerman wrote: > > Po-Hsu Lin writes: > > The eeh-basic test got its own 60 seconds timeout (defined in commit > > 414f50434aa2 "selftests/eeh: Bump EEH wait time to 60s") per breakable > > device. > > > > And we have discovered that the number of breakable devices varies > > on different hardware. The device recovery time ranges from 0 to 35 > > seconds. In our test pool it will take about 30 seconds to run on a > > Power8 system that with 5 breakable devices, 60 seconds to run on a > > Power9 system that with 4 breakable devices. > > > > Thus it's better to disable the default 45 seconds timeout setting in > > the kselftest framework to give it a chance to finish. And let the > > test to take care of the timeout control. > > I'd prefer if we still had some timeout, maybe 5 or 10 minutes? Just in > case the test goes completely bonkers. > OK, let's go for 5 minutes. Will send V2 later. Thanks for your suggestion! > cheers > > > diff --git a/tools/testing/selftests/powerpc/eeh/Makefile > > b/tools/testing/selftests/powerpc/eeh/Makefile > > index b397bab..ae963eb 100644 > > --- a/tools/testing/selftests/powerpc/eeh/Makefile > > +++ b/tools/testing/selftests/powerpc/eeh/Makefile > > @@ -3,7 +3,7 @@ noarg: > > $(MAKE) -C ../ > > > > TEST_PROGS := eeh-basic.sh > > -TEST_FILES := eeh-functions.sh > > +TEST_FILES := eeh-functions.sh settings > > > > top_srcdir = ../../../../.. > > include ../../lib.mk > > diff --git a/tools/testing/selftests/powerpc/eeh/settings > > b/tools/testing/selftests/powerpc/eeh/settings > > new file mode 100644 > > index 000..e7b9417 > > --- /dev/null > > +++ b/tools/testing/selftests/powerpc/eeh/settings > > @@ -0,0 +1 @@ > > +timeout=0 > > -- > > 2.7.4
[GIT PULL] ARC fix for 5.10-rc1
Hi Linus, This is an unusual 2nd pull request for merge window. I found a snafu in perf driver which made it into 5.9-rc4 and thus the fix could go in now than wait for 5.10-rc2. Sorry for the trouble. Thx, -Vineet -> The following changes since commit 6364d1b41cc382db3b03cf33c57b6007ee8f09cf: arc: include/asm: fix typos of "themselves" (2020-10-05 21:02:29 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc.git/ tags/arc-5.10-rc1-fixes for you to fetch changes up to 8c42a5c02bec6c7eccf08957be3c6c8fccf9790b: ARC: perf: redo the pct irq missing in device-tree handling (2020-10-22 10:57:58 -0700) Urgent perf ARC fix Vineet Gupta (1): ARC: perf: redo the pct irq missing in device-tree handling arch/arc/kernel/perf_event.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-)
linux-next: Tree for Oct 23
Hi all, Since the merge window is open, please do not add any v5.11 material to your linux-next included branches until after v5.10-rc1 has been released. Changes since 20201022: Non-merge commits (relative to Linus' tree): 1952 2322 files changed, 329767 insertions(+), 37681 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig and htmldocs. And finally, a simple boot test of the powerpc pseries_le_defconfig kernel in qemu (with and without kvm enabled). Below is a summary of the state of the merge. I am currently merging 329 trees (counting Linus' and 86 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (f9893351acae Merge tag 'kconfig-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild) Merging fixes/fixes (9123e3a74ec7 Linux 5.9-rc1) Merging kbuild-current/fixes (e30d694c3381 Documentation/llvm: Fix clang target examples) Merging arc-current/for-curr (6364d1b41cc3 arc: include/asm: fix typos of "themselves") Merging arm-current/fixes (9123e3a74ec7 Linux 5.9-rc1) Merging arm64-fixes/for-next/fixes (39e4716caa59 crypto: arm64: Use x16 with indirect branch to bti_c) Merging arm-soc-fixes/arm/fixes (6869f774b1cd Merge tag 'omap-for-v5.9/fixes-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes) Merging uniphier-fixes/fixes (48778464bb7d Linux 5.8-rc2) Merging drivers-memory-fixes/fixes (7ff3a2a626f7 memory: jz4780_nemc: Fix an error pointer vs NULL check in probe()) Merging m68k-current/for-linus (50c5feeea0af ide/macide: Convert Mac IDE driver to platform driver) Merging powerpc-fixes/fixes (4ff753feab02 powerpc/pseries: Avoid using addr_to_pfn in real mode) Merging s390-fixes/fixes (549738f15da0 Linux 5.9-rc8) Merging sparc/master (0a95a6d1a4cd sparc: use for_each_child_of_node() macro) Merging fscrypt-current/for-stable (2b4eae95c736 fscrypt: don't evict dirty inodes after removing key) Merging net/master (18ded910b589 tcp: fix to update snd_wl1 in bulk receiver fast path) Merging bpf/master (18ded910b589 tcp: fix to update snd_wl1 in bulk receiver fast path) Merging ipsec/master (7fe94612dd4c xfrm: interface: fix the priorities for ipip and ipv6 tunnels) Merging netfilter/master (c77761c8a594 netfilter: nf_fwd_netdev: clear timestamp in forwarding path) Merging ipvs/master (48d072c4e8cd selftests: netfilter: add time counter check) Merging wireless-drivers/master (df41c19abbea drivers/net/wan/hdlc_fr: Move the skb_headroom check out of fr_hard_header) Merging mac80211/master (9ff9b0d392ea Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next) Merging rdma-fixes/for-rc (a1b8638ba132 Linux 5.9-rc7) Merging sound-current/for-linus (033e4040d453 ALSA: hda - Fix the return value if cb func is already registered) Merging sound-asoc-fixes/for-linus (8101e3024d76 Merge remote-tracking branch 'asoc/for-5.10' into asoc-linus) Merging regmap-fixes/for-linus (549738f15da0 Linux 5.9-rc8) Merging regulator-fixes/for-linus (b7c11f48ff81 Merge remote-tracking branch 'regulator/for-5.10' into regulator-linus) Merging spi-fixes/for-linus (d4f3a651ab82 Merge remote-tracking branch 'spi/for-5.9' into spi-linus) Merging pci-current/for-linus (76a6b0b90d53 MAINTAINERS: Add Pali Rohár as aardvark PCI maintainer) Merging driver-core.current/driver-core-linus (270315b8235e Merge tag 'riscv-for-linus-5.10-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/
[PATCH V3 2/3] vhost: vdpa: report iova range
This patch introduces a new ioctl for vhost-vdpa device that can report the iova range by the device. For device that implements get_iova_range() method, we fetch it from the vDPA device. If device doesn't implement get_iova_range() but depends on platform IOMMU, we will query via DOMAIN_ATTR_GEOMETRY, otherwise [0, ULLONG_MAX] is assumed. For safety, this patch also rules out the map request which is not in the valid range. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 40 include/uapi/linux/vhost.h | 4 include/uapi/linux/vhost_types.h | 9 +++ 3 files changed, 53 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index a2dbc85e0b0d..562ed99116d1 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -47,6 +47,7 @@ struct vhost_vdpa { int minor; struct eventfd_ctx *config_ctx; int in_batch; + struct vdpa_iova_range range; }; static DEFINE_IDA(vhost_vdpa_ida); @@ -337,6 +338,16 @@ static long vhost_vdpa_set_config_call(struct vhost_vdpa *v, u32 __user *argp) return 0; } +static long vhost_vdpa_get_iova_range(struct vhost_vdpa *v, u32 __user *argp) +{ + struct vhost_vdpa_iova_range range = { + .first = v->range.first, + .last = v->range.last, + }; + + return copy_to_user(argp, , sizeof(range)); +} + static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd, void __user *argp) { @@ -470,6 +481,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, case VHOST_GET_BACKEND_FEATURES: features = VHOST_VDPA_BACKEND_FEATURES; r = copy_to_user(featurep, , sizeof(features)); + case VHOST_VDPA_GET_IOVA_RANGE: + r = vhost_vdpa_get_iova_range(v, argp); break; default: r = vhost_dev_ioctl(>vdev, cmd, argp); @@ -597,6 +610,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, long pinned; int ret = 0; + if (msg->iova < v->range.first || + msg->iova + msg->size - 1 > v->range.last) + return -EINVAL; + if (vhost_iotlb_itree_first(iotlb, msg->iova, msg->iova + msg->size - 1)) return -EEXIST; @@ -783,6 +800,27 @@ static void vhost_vdpa_free_domain(struct vhost_vdpa *v) v->domain = NULL; } +static void vhost_vdpa_set_iova_range(struct vhost_vdpa *v) +{ + struct vdpa_iova_range *range = >range; + struct iommu_domain_geometry geo; + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->get_iova_range) { + *range = ops->get_iova_range(vdpa); + } else if (v->domain && + !iommu_domain_get_attr(v->domain, + DOMAIN_ATTR_GEOMETRY, ) && + geo.force_aperture) { + range->first = geo.aperture_start; + range->last = geo.aperture_end; + } else { + range->first = 0; + range->last = ULLONG_MAX; + } +} + static int vhost_vdpa_open(struct inode *inode, struct file *filep) { struct vhost_vdpa *v; @@ -823,6 +861,8 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep) if (r) goto err_init_iotlb; + vhost_vdpa_set_iova_range(v); + filep->private_data = v; return 0; diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index 75232185324a..c998860d7bbc 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -146,4 +146,8 @@ /* Set event fd for config interrupt*/ #define VHOST_VDPA_SET_CONFIG_CALL _IOW(VHOST_VIRTIO, 0x77, int) + +/* Get the valid iova range */ +#define VHOST_VDPA_GET_IOVA_RANGE _IOR(VHOST_VIRTIO, 0x78, \ +struct vhost_vdpa_iova_range) #endif diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index 9a269a88a6ff..f7f6a3a28977 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -138,6 +138,15 @@ struct vhost_vdpa_config { __u8 buf[0]; }; +/* vhost vdpa IOVA range + * @first: First address that can be mapped by vhost-vDPA + * @last: Last address that can be mapped by vhost-vDPA + */ +struct vhost_vdpa_iova_range { + __u64 first; + __u64 last; +}; + /* Feature bits */ /* Log all write descriptors. Can be changed while device is active. */ #define VHOST_F_LOG_ALL 26 -- 2.20.1
Re: [PATCH net RFC] net: Clear IFF_TX_SKB_SHARING for all Ethernet devices using skb_padto
On Thu, Oct 22, 2020 at 6:56 PM Xie He wrote: > > My patch isn't complete. Because there are so many drivers with this > problem, I feel it's hard to solve them all at once. So I only grepped > "skb_padto" under "drivers/net/ethernet". There are other drivers > under "ethernet" using "skb_pad", "skb_put_padto" or "eth_skb_pad". > There are also (fake) Ethernet drivers under "drivers/net/wireless". I > feel it'd take a long time and also be error-prone to solve them all, > so I feel it'd be the best if there are other solutions. BTW, I also see some Ethernet drivers calling skb_push to prepend strange headers to the skbs. For example, drivers/net/ethernet/mellanox/mlxsw/switchx2.c prepends a header of MLXSW_TXHDR_LEN (16). We can't send shared skbs to these drivers either because they modify the skbs. It seems to me that many drivers have always assumed that they can modify the skb whenever needed. They've never considered there might be shared skbs. I guess adding IFF_TX_SKB_SHARING to ether_setup was a bad idea. It not only made the code less clean, but also didn't agree with the actual situations of the drivers.
[PATCH V3 3/3] vdpa_sim: implement get_iova_range()
This implements a sample get_iova_range() for the simulator which advertise [0, ULLONG_MAX] as the valid range. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 62d640327145..ff6c9fd8d879 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -574,6 +574,16 @@ static u32 vdpasim_get_generation(struct vdpa_device *vdpa) return vdpasim->generation; } +static struct vdpa_iova_range vdpasim_get_iova_range(struct vdpa_device *vdpa) +{ + struct vdpa_iova_range range = { + .first = 0ULL, + .last = ULLONG_MAX, + }; + + return range; +} + static int vdpasim_set_map(struct vdpa_device *vdpa, struct vhost_iotlb *iotlb) { @@ -657,6 +667,7 @@ static const struct vdpa_config_ops vdpasim_net_config_ops = { .get_config = vdpasim_get_config, .set_config = vdpasim_set_config, .get_generation = vdpasim_get_generation, + .get_iova_range = vdpasim_get_iova_range, .dma_map= vdpasim_dma_map, .dma_unmap = vdpasim_dma_unmap, .free = vdpasim_free, @@ -683,6 +694,7 @@ static const struct vdpa_config_ops vdpasim_net_batch_config_ops = { .get_config = vdpasim_get_config, .set_config = vdpasim_set_config, .get_generation = vdpasim_get_generation, + .get_iova_range = vdpasim_get_iova_range, .set_map= vdpasim_set_map, .free = vdpasim_free, }; -- 2.20.1
[PATCH V3 0/3] vDPA: API for reporting IOVA range
Hi All: This series introduces API for reporing IOVA range. This is a must for userspace to work correclty: - for the process that uses vhost-vDPA directly, the IOVA must be allocated from this range. - for VM(qemu), when vIOMMU is not enabled, fail early if GPA is out of range - for VM(qemu), when vIOMMU is enabled, determine a valid guest address width and then guest IOVA allocator can behave correctly. Please review. Changes from V2: - silent build warnings Changes from V1: - do not mandate get_iova_range() for device with its own DMA translation logic and assume a [0, ULLONG_MAX] range - mandate IOVA range only for IOMMU that forcing aperture - forbid the map which is out of the IOVA range in vhost-vDPA Jason Wang (3): vdpa: introduce config op to get valid iova range vhost: vdpa: report iova range vdpa_sim: implement get_iova_range() drivers/vdpa/vdpa_sim/vdpa_sim.c | 12 ++ drivers/vhost/vdpa.c | 40 include/linux/vdpa.h | 15 include/uapi/linux/vhost.h | 4 include/uapi/linux/vhost_types.h | 9 +++ 5 files changed, 80 insertions(+) -- 2.20.1
[PATCH V3 1/3] vdpa: introduce config op to get valid iova range
This patch introduce a config op to get valid iova range from the vDPA device. Signed-off-by: Jason Wang --- include/linux/vdpa.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index eae0bfd87d91..30bc7a7223bb 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -52,6 +52,16 @@ struct vdpa_device { int nvqs; }; +/** + * vDPA IOVA range - the IOVA range support by the device + * @first: start of the IOVA range + * @last: end of the IOVA range + */ +struct vdpa_iova_range { + u64 first; + u64 last; +}; + /** * vDPA_config_ops - operations for configuring a vDPA device. * Note: vDPA device drivers are required to implement all of the @@ -151,6 +161,10 @@ struct vdpa_device { * @get_generation:Get device config generation (optional) * @vdev: vdpa device * Returns u32: device generation + * @get_iova_range:Get supported iova range (optional) + * @vdev: vdpa device + * Returns the iova range supported by + * the device. * @set_map: Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) @@ -216,6 +230,7 @@ struct vdpa_config_ops { void (*set_config)(struct vdpa_device *vdev, unsigned int offset, const void *buf, unsigned int len); u32 (*get_generation)(struct vdpa_device *vdev); + struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev); /* DMA ops */ int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb); -- 2.20.1
Re: Question on io-wq
On 10/22/20 8:05 PM, Hillf Danton wrote: > On Thu, 22 Oct 2020 08:08:09 -0600 Jens Axboe wrote: >> On 10/22/20 3:02 AM, Zhang,Qiang wrote: >>> >>> Hi Jens Axboe >>> >>> There are some problem in 'io_wqe_worker' thread, when the >>> 'io_wqe_worker' be create and Setting the affinity of CPUs in NUMA >>> nodes, due to CPU hotplug, When the last CPU going down, the >>> 'io_wqe_worker' thread will run anywhere. when the CPU in the node goes >>> online again, we should restore their cpu bindings? >> >> Something like the below should help in ensuring affinities are >> always correct - trigger an affinity set for an online CPU event. We >> should not need to do it for offlining. Can you test it? > > CPU affinity is intact because of nothing to do on offline, and scheduler > will move the stray workers on to the correct NUMA node if any CPU goes > online, so it's a bit hard to see what is going to be tested. Test it yourself: - Boot with > 1 NUMA node - Start an io_uring, you now get 2 workers, each affinitized to a node - Now offline all CPUs in one node - Online one or more of the CPU in that same node The end result is that the worker on the node that was offlined now has a mask of the other node, plus the newly added CPU. So your last statement isn't correct, which is what the original reporter stated. -- Jens Axboe
Re: [PATCHSET v6] Add support for TIF_NOTIFY_SIGNAL
On 10/16/20 9:45 AM, Jens Axboe wrote: > Hi, > > The goal is this patch series is to decouple TWA_SIGNAL based task_work > from real signals and signal delivery. The motivation is speeding up > TWA_SIGNAL based task_work, particularly for threaded setups where > ->sighand is shared across threads. See the last patch for numbers. > > Cleanups in this series, see changelog. But the arch and cleanup > series that goes after this series is much simpler now that we handle > TIF_NOTIFY_SIGNAL generically for !CONFIG_GENERIC_ENTRY. Any objections to this one? I just rebased this one and the full arch series that sits on top for -git, but apart from that, no changes. Thomas, would be nice to know if you're good with patch 2+3 at this point. Once we get outside of the merge window next week, I'll post the updated series since we get a few conflicts at this point, and would be great if you could carry this for 5.11. -- Jens Axboe
[PATCH] drm: Add the missed device_unregister() in drm_sysfs_connector_add()
drm_sysfs_connector_add() misses to call device_unregister() when sysfs_create_link() fails to create. Add the missed function call to fix it. Fixes: e1a29c6c5955 ("drm: Add ddc link in sysfs created by drm_connector") Signed-off-by: Jing Xiangfeng --- drivers/gpu/drm/drm_sysfs.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index f0336c804639..39e173e10cf7 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -274,6 +274,7 @@ static const struct attribute_group *connector_dev_groups[] = { int drm_sysfs_connector_add(struct drm_connector *connector) { struct drm_device *dev = connector->dev; + int ret = 0; if (connector->kdev) return 0; @@ -291,10 +292,16 @@ int drm_sysfs_connector_add(struct drm_connector *connector) return PTR_ERR(connector->kdev); } - if (connector->ddc) - return sysfs_create_link(>kdev->kobj, + if (connector->ddc) { + ret = sysfs_create_link(>kdev->kobj, >ddc->dev.kobj, "ddc"); - return 0; + if (ret) { + device_unregister(connector->kdev); + connector->kdev = NULL; + } + } + + return ret; } void drm_sysfs_connector_remove(struct drm_connector *connector) -- 2.17.1
Re: [PATCH] perf trace beauty: Allow header files in a different path
On Thu, Oct 22, 2020 at 7:06 PM Namhyung Kim wrote: > > Current script to generate mmap flags and prot checks headers from the > uapi/asm-generic directory but it might come from a different > directory in some environment. So change the pattern to accept it. > > Signed-off-by: Namhyung Kim Acked-by: Ian Rogers Thanks, Ian > --- > tools/perf/trace/beauty/mmap_flags.sh | 4 ++-- > tools/perf/trace/beauty/mmap_prot.sh | 2 +- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/tools/perf/trace/beauty/mmap_flags.sh > b/tools/perf/trace/beauty/mmap_flags.sh > index 39eb2595983b..76825710c725 100755 > --- a/tools/perf/trace/beauty/mmap_flags.sh > +++ b/tools/perf/trace/beauty/mmap_flags.sh > @@ -28,12 +28,12 @@ egrep -q $regex ${linux_mman} && \ > egrep -vw 'MAP_(UNINITIALIZED|TYPE|SHARED_VALIDATE)' | \ > sed -r "s/$regex/\2 \1 \1 \1 \2/g" | \ > xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define > MAP_%s %s\n#endif\n") > -([ ! -f ${arch_mman} ] || egrep -q > '#[[:space:]]*include[[:space:]]+ +([ ! -f ${arch_mman} ] || egrep -q > '#[[:space:]]*include[[:space:]]+.*uapi/asm-generic/mman.*' ${arch_mman}) && > (egrep $regex ${header_dir}/mman-common.h | \ > egrep -vw 'MAP_(UNINITIALIZED|TYPE|SHARED_VALIDATE)' | \ > sed -r "s/$regex/\2 \1 \1 \1 \2/g" | \ > xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define > MAP_%s %s\n#endif\n") > -([ ! -f ${arch_mman} ] || egrep -q > '#[[:space:]]*include[[:space:]]+.*' ${arch_mman}) && > +([ ! -f ${arch_mman} ] || egrep -q > '#[[:space:]]*include[[:space:]]+.*uapi/asm-generic/mman.h>.*' ${arch_mman}) > && > (egrep $regex ${header_dir}/mman.h | \ > sed -r "s/$regex/\2 \1 \1 \1 \2/g" | \ > xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define > MAP_%s %s\n#endif\n") > diff --git a/tools/perf/trace/beauty/mmap_prot.sh > b/tools/perf/trace/beauty/mmap_prot.sh > index 28f638f8d216..664d8d534a50 100755 > --- a/tools/perf/trace/beauty/mmap_prot.sh > +++ b/tools/perf/trace/beauty/mmap_prot.sh > @@ -17,7 +17,7 @@ prefix="PROT" > > printf "static const char *mmap_prot[] = {\n" > regex=`printf > '^[[:space:]]*#[[:space:]]*define[[:space:]]+%s_([[:alnum:]_]+)[[:space:]]+(0x[[:xdigit:]]+)[[:space:]]*.*' > ${prefix}` > -([ ! -f ${arch_mman} ] || egrep -q > '#[[:space:]]*include[[:space:]]+ +([ ! -f ${arch_mman} ] || egrep -q > '#[[:space:]]*include[[:space:]]+.*uapi/asm-generic/mman.*' ${arch_mman}) && > (egrep $regex ${common_mman} | \ > egrep -vw PROT_NONE | \ > sed -r "s/$regex/\2 \1 \1 \1 \2/g" | \ > -- > 2.29.0.rc1.297.gfa9743e501-goog >
Re: [PATCH] selftests/powerpc/eeh: disable kselftest timeout setting for eeh-basic
Po-Hsu Lin writes: > The eeh-basic test got its own 60 seconds timeout (defined in commit > 414f50434aa2 "selftests/eeh: Bump EEH wait time to 60s") per breakable > device. > > And we have discovered that the number of breakable devices varies > on different hardware. The device recovery time ranges from 0 to 35 > seconds. In our test pool it will take about 30 seconds to run on a > Power8 system that with 5 breakable devices, 60 seconds to run on a > Power9 system that with 4 breakable devices. > > Thus it's better to disable the default 45 seconds timeout setting in > the kselftest framework to give it a chance to finish. And let the > test to take care of the timeout control. I'd prefer if we still had some timeout, maybe 5 or 10 minutes? Just in case the test goes completely bonkers. cheers > diff --git a/tools/testing/selftests/powerpc/eeh/Makefile > b/tools/testing/selftests/powerpc/eeh/Makefile > index b397bab..ae963eb 100644 > --- a/tools/testing/selftests/powerpc/eeh/Makefile > +++ b/tools/testing/selftests/powerpc/eeh/Makefile > @@ -3,7 +3,7 @@ noarg: > $(MAKE) -C ../ > > TEST_PROGS := eeh-basic.sh > -TEST_FILES := eeh-functions.sh > +TEST_FILES := eeh-functions.sh settings > > top_srcdir = ../../../../.. > include ../../lib.mk > diff --git a/tools/testing/selftests/powerpc/eeh/settings > b/tools/testing/selftests/powerpc/eeh/settings > new file mode 100644 > index 000..e7b9417 > --- /dev/null > +++ b/tools/testing/selftests/powerpc/eeh/settings > @@ -0,0 +1 @@ > +timeout=0 > -- > 2.7.4
Re: [External] Re: [PATCH] nvme-rdma: handle nvme completion data length
On 2020/10/22 18:05, zhenwei pi wrote: On 10/22/20 5:55 PM, Chao Leng wrote: On 2020/10/22 16:38, zhenwei pi wrote: Hit a kernel warning: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 0 at lib/refcount.c:28 RIP: 0010:refcount_warn_saturate+0xd9/0xe0 Call Trace: nvme_rdma_recv_done+0xf3/0x280 [nvme_rdma] __ib_process_cq+0x76/0x150 [ib_core] ... The reason is that a zero bytes message received from target, and the host side continues to process without length checking, then the previous CQE is processed twice. Handle data length, ignore zero bytes message, and try to recovery for corrupted CQE case. Signed-off-by: zhenwei pi --- drivers/nvme/host/rdma.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 9e378d0a0c01..9f5112040d43 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1767,6 +1767,17 @@ static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc) return; } + if (unlikely(!wc->byte_len)) { + /* zero bytes message could be ignored */ + return; Resource leak, need nvme_rdma_post_recv. + } else if (unlikely(wc->byte_len < len)) { + /* Corrupted completion, try to recovry */ + dev_err(queue->ctrl->ctrl.device, + "Unexpected nvme completion length(%d)\n", wc->byte_len); + nvme_rdma_error_recovery(queue->ctrl); + return; + } !wc->byte_len and wc->byte_len < len may be the same type of anomaly. Why do different error handling? In which scenario zero bytes message received from target? fault inject test or normal test/run? Zero bytes message could be used as transport layer keep alive mechanism (I's also developing target side transport layer keep alive now. To reclaim resource, target side needs to close dead connections even kato is set as 0). nvme over fabric protocol do not define this. May be async event is a option for target keep alive(if kato set as 0). + ib_dma_sync_single_for_cpu(ibdev, qe->dma, len, DMA_FROM_DEVICE); /* * AEN requests are special as they don't time out and can
Re: [PATCH v3 2/6] docs: lockdep-design: fix some warning issues
On Wed, Oct 21, 2020 at 02:17:23PM +0200, Mauro Carvalho Chehab wrote: > There are several warnings caused by a recent change > 224ec489d3cd ("lockdep/Documention: Recursive read lock detection reasoning") > > Those are reported by htmldocs build: > > Documentation/locking/lockdep-design.rst:429: WARNING: Definition list > ends without a blank line; unexpected unindent. > Documentation/locking/lockdep-design.rst:452: WARNING: Block quote ends > without a blank line; unexpected unindent. > Documentation/locking/lockdep-design.rst:453: WARNING: Unexpected > indentation. > Documentation/locking/lockdep-design.rst:453: WARNING: Blank line > required after table. > Documentation/locking/lockdep-design.rst:454: WARNING: Block quote ends > without a blank line; unexpected unindent. > Documentation/locking/lockdep-design.rst:455: WARNING: Unexpected > indentation. > Documentation/locking/lockdep-design.rst:455: WARNING: Blank line > required after table. > Documentation/locking/lockdep-design.rst:456: WARNING: Block quote ends > without a blank line; unexpected unindent. > Documentation/locking/lockdep-design.rst:457: WARNING: Unexpected > indentation. > Documentation/locking/lockdep-design.rst:457: WARNING: Blank line > required after table. > > Besides the reported issues, there are some missing blank > lines that ended producing wrong html output, and some > literals are not properly identified. > > Also, the symbols used at the irq enabled/disable table > are not displayed as expected, as they're not literals. > Also, on another table they're using a different notation. > > Fixes: 224ec489d3cd ("lockdep/Documention: Recursive read lock detection > reasoning") > Signed-off-by: Mauro Carvalho Chehab Acked-by: Boqun Feng Regards, Boqun > --- > Documentation/locking/lockdep-design.rst | 51 ++-- > 1 file changed, 31 insertions(+), 20 deletions(-) > > diff --git a/Documentation/locking/lockdep-design.rst > b/Documentation/locking/lockdep-design.rst > index cec03bd1294a..9f3cfca9f8a4 100644 > --- a/Documentation/locking/lockdep-design.rst > +++ b/Documentation/locking/lockdep-design.rst > @@ -42,6 +42,7 @@ The validator tracks lock-class usage history and divides > the usage into > (4 usages * n STATEs + 1) categories: > > where the 4 usages can be: > + > - 'ever held in STATE context' > - 'ever held as readlock in STATE context' > - 'ever held with STATE enabled' > @@ -49,10 +50,12 @@ where the 4 usages can be: > > where the n STATEs are coded in kernel/locking/lockdep_states.h and as of > now they include: > + > - hardirq > - softirq > > where the last 1 category is: > + > - 'ever used' [ == !unused] > > When locking rules are violated, these usage bits are presented in the > @@ -96,9 +99,9 @@ exact case is for the lock as of the reporting time. >+--+-+--+ >| | irq enabled | irq disabled | >+--+-+--+ > - | ever in irq | ? | - | > + | ever in irq | '?' | '-' | >+--+-+--+ > - | never in irq | + | . | > + | never in irq | '+' | '.' | >+--+-+--+ > > The character '-' suggests irq is disabled because if otherwise the > @@ -216,7 +219,7 @@ looks like this:: > BD_MUTEX_PARTITION >}; > > -mutex_lock_nested(>bd_contains->bd_mutex, BD_MUTEX_PARTITION); > + mutex_lock_nested(>bd_contains->bd_mutex, BD_MUTEX_PARTITION); > > In this case the locking is done on a bdev object that is known to be a > partition. > @@ -334,7 +337,7 @@ Troubleshooting: > > > The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes. > -Exceeding this number will trigger the following lockdep warning: > +Exceeding this number will trigger the following lockdep warning:: > > (DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS)) > > @@ -420,7 +423,8 @@ the critical section of another reader of the same lock > instance. > > The difference between recursive readers and non-recursive readers is > because: > recursive readers get blocked only by a write lock *holder*, while > non-recursive > -readers could get blocked by a write lock *waiter*. Considering the follow > example: > +readers could get blocked by a write lock *waiter*. Considering the follow > +example:: > > TASK A: TASK B: > > @@ -448,20 +452,22 @@ There are simply four block conditions: > > Block condition matrix, Y means the row blocks the column, and N means > otherwise. > > - | E | r | R | > +---+---+---+---+ > - E | Y | Y | Y | > + | | E | r | R | > +---+---+---+---+ > - r | Y | Y | N | > + | E | Y | Y | Y | > + +---+---+---+---+ > + | r
[PATCH] perf trace beauty: Allow header files in a different path
Current script to generate mmap flags and prot checks headers from the uapi/asm-generic directory but it might come from a different directory in some environment. So change the pattern to accept it. Signed-off-by: Namhyung Kim --- tools/perf/trace/beauty/mmap_flags.sh | 4 ++-- tools/perf/trace/beauty/mmap_prot.sh | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/perf/trace/beauty/mmap_flags.sh b/tools/perf/trace/beauty/mmap_flags.sh index 39eb2595983b..76825710c725 100755 --- a/tools/perf/trace/beauty/mmap_flags.sh +++ b/tools/perf/trace/beauty/mmap_flags.sh @@ -28,12 +28,12 @@ egrep -q $regex ${linux_mman} && \ egrep -vw 'MAP_(UNINITIALIZED|TYPE|SHARED_VALIDATE)' | \ sed -r "s/$regex/\2 \1 \1 \1 \2/g" | \ xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define MAP_%s %s\n#endif\n") -([ ! -f ${arch_mman} ] || egrep -q '#[[:space:]]*include[[:space:]]+.*' ${arch_mman}) && +([ ! -f ${arch_mman} ] || egrep -q '#[[:space:]]*include[[:space:]]+.*uapi/asm-generic/mman.h>.*' ${arch_mman}) && (egrep $regex ${header_dir}/mman.h | \ sed -r "s/$regex/\2 \1 \1 \1 \2/g" | \ xargs printf "\t[ilog2(%s) + 1] = \"%s\",\n#ifndef MAP_%s\n#define MAP_%s %s\n#endif\n") diff --git a/tools/perf/trace/beauty/mmap_prot.sh b/tools/perf/trace/beauty/mmap_prot.sh index 28f638f8d216..664d8d534a50 100755 --- a/tools/perf/trace/beauty/mmap_prot.sh +++ b/tools/perf/trace/beauty/mmap_prot.sh @@ -17,7 +17,7 @@ prefix="PROT" printf "static const char *mmap_prot[] = {\n" regex=`printf '^[[:space:]]*#[[:space:]]*define[[:space:]]+%s_([[:alnum:]_]+)[[:space:]]+(0x[[:xdigit:]]+)[[:space:]]*.*' ${prefix}` -([ ! -f ${arch_mman} ] || egrep -q '#[[:space:]]*include[[:space:]]+
[GIT PULL] Arch/task_work cleanup
Hi Linus, Two cleanups that don't fit other categories: - Finally get the task_work_add() cleanup done properly, so we don't have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates all callers, and also fixes up the documentation for task_work_add(). - While working on some TIF related changes for 5.11, this TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch duplication for how that is handled. Please pull! The following changes since commit 324bcf54c449c7b5b7024c9fa4549fbaaae1935d: mm: use limited read-ahead to satisfy read (2020-10-17 13:49:08 -0600) are available in the Git repository at: git://git.kernel.dk/linux-block.git tags/arch-cleanup-2020-10-22 for you to fetch changes up to 91989c707884ecc7cd537281ab1a4b8fb7219da3: task_work: cleanup notification modes (2020-10-17 15:05:30 -0600) arch-cleanup-2020-10-22 Jens Axboe (2): tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume() task_work: cleanup notification modes arch/alpha/kernel/signal.c | 1 - arch/arc/kernel/signal.c | 2 +- arch/arm/kernel/signal.c | 1 - arch/arm64/kernel/signal.c | 1 - arch/c6x/kernel/signal.c | 4 +--- arch/csky/kernel/signal.c | 1 - arch/h8300/kernel/signal.c | 4 +--- arch/hexagon/kernel/process.c | 1 - arch/ia64/kernel/process.c | 2 +- arch/m68k/kernel/signal.c | 2 +- arch/microblaze/kernel/signal.c| 2 +- arch/mips/kernel/signal.c | 1 - arch/nds32/kernel/signal.c | 4 +--- arch/nios2/kernel/signal.c | 2 +- arch/openrisc/kernel/signal.c | 1 - arch/parisc/kernel/signal.c| 4 +--- arch/powerpc/kernel/signal.c | 1 - arch/riscv/kernel/signal.c | 4 +--- arch/s390/kernel/signal.c | 1 - arch/sh/kernel/signal_32.c | 4 +--- arch/sparc/kernel/signal_32.c | 4 +--- arch/sparc/kernel/signal_64.c | 4 +--- arch/um/kernel/process.c | 2 +- arch/x86/kernel/cpu/mce/core.c | 2 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +- arch/xtensa/kernel/signal.c| 2 +- drivers/acpi/apei/ghes.c | 2 +- drivers/android/binder.c | 2 +- fs/file_table.c| 2 +- fs/io_uring.c | 13 +++-- fs/namespace.c | 2 +- include/linux/task_work.h | 11 --- include/linux/tracehook.h | 4 ++-- kernel/entry/common.c | 1 - kernel/entry/kvm.c | 4 +--- kernel/events/uprobes.c| 2 +- kernel/irq/manage.c| 2 +- kernel/sched/fair.c| 2 +- kernel/task_work.c | 30 -- security/keys/keyctl.c | 2 +- security/yama/yama_lsm.c | 2 +- 41 files changed, 64 insertions(+), 76 deletions(-) -- Jens Axboe
Re: [PATCH net RFC] net: Clear IFF_TX_SKB_SHARING for all Ethernet devices using skb_padto
On Thu, Oct 22, 2020 at 5:44 PM Jakub Kicinski wrote: > > On Thu, 22 Oct 2020 12:59:45 -0700 Xie He wrote: > > > > But I also see some drivers that want to pad the skb to a strange > > length, and don't set their special min_mtu to match this length. For > > example: > > > > drivers/net/ethernet/packetengines/yellowfin.c wants to pad the skb to > > a dynamically calculated value. > > > > drivers/net/ethernet/ti/cpsw.c, cpsw_new.c and tlan.c want to pad the > > skb to macro defined values. > > > > drivers/net/ethernet/intel/iavf/iavf_txrx.c wants to pad the skb to > > IAVF_MIN_TX_LEN (17). > > > > drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c wants to pad the skb to > > 17. > > Hm, I see, that would be a slight loss of functionality if we started > requiring 64B, for example, while the driver could in practice xmit > 17B frames (would matter only to VFs, but nonetheless). I think requiring the length to be at least some value won't solve the problem for all drivers. For example: drivers/net/ethernet/packetengines/yellowfin.c pads the skb to 32-byte boundaries in the memory (no matter how long the length is). drivers/net/ethernet/adaptec/starfire.c pads the skb so that the length is multiples of 4. drivers/net/ethernet/sun/cassini.c pads the skb to cp->min_frame_size, which may be 255, 60, or 97. > > Another solution I can think of is to add a "skb_shared" check to > > "__skb_pad", so that if __skb_pad encounters a shared skb, it just > > returns an error. The driver would think this is a memory allocation > > failure. This way we can ensure shared skbs are not modified. > > I'm not sure if we want to be adding checks to __skb_pad() to handle > what's effectively a pktgen specific condition. > > We could create a new field in struct netdevice for min_frame_len, but I > think your patch is the simplest solution. Let's see if anyone objects. > > BTW it seems like there is more drivers which will need the flag > cleared, e.g. drivers/net/ethernet/broadcom/bnxt/bnxt.c? My patch isn't complete. Because there are so many drivers with this problem, I feel it's hard to solve them all at once. So I only grepped "skb_padto" under "drivers/net/ethernet". There are other drivers under "ethernet" using "skb_pad", "skb_put_padto" or "eth_skb_pad". There are also (fake) Ethernet drivers under "drivers/net/wireless". I feel it'd take a long time and also be error-prone to solve them all, so I feel it'd be the best if there are other solutions.
RE: [PATCH v2 tty] tty: serial: fsl_lpuart: LS1021A has a FIFO size of 16 words, like LS1028A
From: Vladimir Oltean Sent: Friday, October 23, 2020 9:34 AM > Prior to the commit that this one fixes, the FIFO size was derived from the > read-only register LPUARTx_FIFO[TXFIFOSIZE] using the following > formula: > > TX FIFO size = 2 ^ (LPUARTx_FIFO[TXFIFOSIZE] - 1) > > The documentation for LS1021A is a mess. Under chapter 26.1.3 LS1021A > LPUART module special consideration, it mentions TXFIFO_SZ and RXFIFO_SZ > being equal to 4, and in the register description for LPUARTx_FIFO, it shows > the > out-of-reset value of TXFIFOSIZE and RXFIFOSIZE fields as "011", even though > these registers read as "101" in reality. > > And when LPUART on LS1021A was working, the "101" value did correspond to > "16 datawords", by applying the formula above, even though the > documentation is wrong again () and says that "101" means 64 datawords > (hint: it doesn't). > > So the "new" formula created by commit f77ebb241ce0 has all the premises of > being wrong for LS1021A, because it relied only on false data and no actual > experimentation. > > Interestingly, in commit c2f448cff22a ("tty: serial: fsl_lpuart: add LS1028A > support"), Michael Walle applied a workaround to this by manually setting the > FIFO widths for LS1028A. It looks like the same values are used by LS1021A as > well, in fact. > > When the driver thinks that it has a deeper FIFO than it really has, getty > (user > space) output gets truncated. > > Many thanks to Michael for pointing out where to look. > > Fixes: f77ebb241ce0 ("tty: serial: fsl_lpuart: correct the FIFO depth size") > Suggested-by: Michael Walle > Signed-off-by: Vladimir Oltean > --- > Changes in v2: > Reworded commit message. For the v2 with commit message change: Reviewed-by:Fugang Duan > > drivers/tty/serial/fsl_lpuart.c | 13 +++-- > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/drivers/tty/serial/fsl_lpuart.c > b/drivers/tty/serial/fsl_lpuart.c index > ff4b88c637d0..bd047e1f9bea 100644 > --- a/drivers/tty/serial/fsl_lpuart.c > +++ b/drivers/tty/serial/fsl_lpuart.c > @@ -314,9 +314,10 @@ MODULE_DEVICE_TABLE(of, lpuart_dt_ids); > /* Forward declare this for the dma callbacks*/ static void > lpuart_dma_tx_complete(void *arg); > > -static inline bool is_ls1028a_lpuart(struct lpuart_port *sport) > +static inline bool is_layerscape_lpuart(struct lpuart_port *sport) > { > - return sport->devtype == LS1028A_LPUART; > + return (sport->devtype == LS1021A_LPUART || > + sport->devtype == LS1028A_LPUART); > } > > static inline bool is_imx8qxp_lpuart(struct lpuart_port *sport) @@ -1701,11 > +1702,11 @@ static int lpuart32_startup(struct uart_port *port) > UARTFIFO_FIFOSIZE_MASK); > > /* > - * The LS1028A has a fixed length of 16 words. Although it supports the > - * RX/TXSIZE fields their encoding is different. Eg the reference manual > - * states 0b101 is 16 words. > + * The LS1021A and LS1028A have a fixed FIFO depth of 16 words. > + * Although they support the RX/TXSIZE fields, their encoding is > + * different. Eg the reference manual states 0b101 is 16 words. >*/ > - if (is_ls1028a_lpuart(sport)) { > + if (is_layerscape_lpuart(sport)) { > sport->rxfifo_size = 16; > sport->txfifo_size = 16; > sport->port.fifosize = sport->txfifo_size; > -- > 2.25.1
RE: [EXT] [PATCH] tty: serial: fsl_lpuart: LS1021A has a FIFO size of 32 datawords
From: Vladimir Oltean Sent: Thursday, October 22, 2020 11:13 PM > From: Vladimir Oltean > > Similar to the workaround applied by Michael Walle in commit c2f448cff22a > ("tty: serial: fsl_lpuart: add LS1028A support"), it turns out that the > LPUARTx_FIFO encoding for fields TXFIFOSIZE and RXFIFOSIZE is the same for > LS1028A as for LS1021A. > > The RXFIFOSIZE in the Layerscape SoCs is fixed at this value: > 101 Receive FIFO/Buffer depth = 32 datawords. > > When Andy Duan wrote the commit in Fixes: below, he assumed that the 101 > encoding means 64 datawords. But this is not true for Layerscape. So that > commit broke LS1021A, and this patch is extending the workaround for LS1028A > which appeared in the meantime, to fix that breakage. > > When the driver thinks that it has a deeper FIFO than it really has, getty > (user > space) output gets truncated. > > Many thanks to Michael for suggesting this! > > Fixes: f77ebb241ce0 ("tty: serial: fsl_lpuart: correct the FIFO depth size") > Suggested-by: Michael Walle > Signed-off-by: Vladimir Oltean Layerscape has different define for the FIFO size. Reviewed-by: Fugang Duan > --- > drivers/tty/serial/fsl_lpuart.c | 13 +++-- > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/drivers/tty/serial/fsl_lpuart.c > b/drivers/tty/serial/fsl_lpuart.c index > ff4b88c637d0..bd047e1f9bea 100644 > --- a/drivers/tty/serial/fsl_lpuart.c > +++ b/drivers/tty/serial/fsl_lpuart.c > @@ -314,9 +314,10 @@ MODULE_DEVICE_TABLE(of, lpuart_dt_ids); > /* Forward declare this for the dma callbacks*/ static void > lpuart_dma_tx_complete(void *arg); > > -static inline bool is_ls1028a_lpuart(struct lpuart_port *sport) > +static inline bool is_layerscape_lpuart(struct lpuart_port *sport) > { > - return sport->devtype == LS1028A_LPUART; > + return (sport->devtype == LS1021A_LPUART || > + sport->devtype == LS1028A_LPUART); > } > > static inline bool is_imx8qxp_lpuart(struct lpuart_port *sport) @@ -1701,11 > +1702,11 @@ static int lpuart32_startup(struct uart_port *port) > > UARTFIFO_FIFOSIZE_MASK); > > /* > -* The LS1028A has a fixed length of 16 words. Although it supports > the > -* RX/TXSIZE fields their encoding is different. Eg the reference > manual > -* states 0b101 is 16 words. > +* The LS1021A and LS1028A have a fixed FIFO depth of 16 words. > +* Although they support the RX/TXSIZE fields, their encoding is > +* different. Eg the reference manual states 0b101 is 16 words. > */ > - if (is_ls1028a_lpuart(sport)) { > + if (is_layerscape_lpuart(sport)) { > sport->rxfifo_size = 16; > sport->txfifo_size = 16; > sport->port.fifosize = sport->txfifo_size; > -- > 2.25.1
[tip:auto-latest] BUILD SUCCESS 65609b26b21a169a05d1482db6c1b52d8a4abe0d
omega2p_defconfig ia64 gensparse_defconfig arm ebsa110_defconfig powerpcmvme5100_defconfig arm rpc_defconfig powerpc ppc64e_defconfig ia64 allmodconfig ia64defconfig ia64 allyesconfig m68k allmodconfig m68kdefconfig m68k allyesconfig nios2 defconfig nds32 allnoconfig c6x allyesconfig nds32 defconfig nios2allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig s390 allyesconfig parisc allyesconfig s390defconfig i386 allyesconfig sparc defconfig i386defconfig mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig powerpc allnoconfig i386 randconfig-a002-20201022 i386 randconfig-a005-20201022 i386 randconfig-a003-20201022 i386 randconfig-a001-20201022 i386 randconfig-a006-20201022 i386 randconfig-a004-20201022 i386 randconfig-a002-20201023 i386 randconfig-a005-20201023 i386 randconfig-a003-20201023 i386 randconfig-a001-20201023 i386 randconfig-a006-20201023 i386 randconfig-a004-20201023 x86_64 randconfig-a011-20201022 x86_64 randconfig-a013-20201022 x86_64 randconfig-a016-20201022 x86_64 randconfig-a015-20201022 x86_64 randconfig-a012-20201022 x86_64 randconfig-a014-20201022 i386 randconfig-a016-20201022 i386 randconfig-a014-20201022 i386 randconfig-a015-20201022 i386 randconfig-a012-20201022 i386 randconfig-a013-20201022 i386 randconfig-a011-20201022 riscvnommu_k210_defconfig riscvallyesconfig riscvnommu_virt_defconfig riscv allnoconfig riscv defconfig riscv rv32_defconfig riscvallmodconfig x86_64 rhel x86_64 allyesconfig x86_64rhel-7.6-kselftests x86_64 defconfig x86_64 rhel-8.3 x86_64 kexec clang tested configs: x86_64 randconfig-a001-20201022 x86_64 randconfig-a002-20201022 x86_64 randconfig-a003-20201022 x86_64 randconfig-a006-20201022 x86_64 randconfig-a004-20201022 x86_64 randconfig-a005-20201022 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
Re: [LTP] mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]
Hello! On Thu, 22 Oct 2020 at 19:11, Linus Torvalds wrote: > On Thu, Oct 22, 2020 at 4:43 PM Linus Torvalds > Would you mind sending me the problematic vmlinux file in private (or, > likely better - a pointer to some place I can download it, it's going > to be huge). The kernel Naresh originally referred to is here: https://builds.tuxbuild.com/SCI7Xyjb7V2NbfQ2lbKBZw/ Greetings! Daniel Díaz daniel.d...@linaro.org
[PATCH v2 tty] tty: serial: fsl_lpuart: LS1021A has a FIFO size of 16 words, like LS1028A
Prior to the commit that this one fixes, the FIFO size was derived from the read-only register LPUARTx_FIFO[TXFIFOSIZE] using the following formula: TX FIFO size = 2 ^ (LPUARTx_FIFO[TXFIFOSIZE] - 1) The documentation for LS1021A is a mess. Under chapter 26.1.3 LS1021A LPUART module special consideration, it mentions TXFIFO_SZ and RXFIFO_SZ being equal to 4, and in the register description for LPUARTx_FIFO, it shows the out-of-reset value of TXFIFOSIZE and RXFIFOSIZE fields as "011", even though these registers read as "101" in reality. And when LPUART on LS1021A was working, the "101" value did correspond to "16 datawords", by applying the formula above, even though the documentation is wrong again () and says that "101" means 64 datawords (hint: it doesn't). So the "new" formula created by commit f77ebb241ce0 has all the premises of being wrong for LS1021A, because it relied only on false data and no actual experimentation. Interestingly, in commit c2f448cff22a ("tty: serial: fsl_lpuart: add LS1028A support"), Michael Walle applied a workaround to this by manually setting the FIFO widths for LS1028A. It looks like the same values are used by LS1021A as well, in fact. When the driver thinks that it has a deeper FIFO than it really has, getty (user space) output gets truncated. Many thanks to Michael for pointing out where to look. Fixes: f77ebb241ce0 ("tty: serial: fsl_lpuart: correct the FIFO depth size") Suggested-by: Michael Walle Signed-off-by: Vladimir Oltean --- Changes in v2: Reworded commit message. drivers/tty/serial/fsl_lpuart.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c index ff4b88c637d0..bd047e1f9bea 100644 --- a/drivers/tty/serial/fsl_lpuart.c +++ b/drivers/tty/serial/fsl_lpuart.c @@ -314,9 +314,10 @@ MODULE_DEVICE_TABLE(of, lpuart_dt_ids); /* Forward declare this for the dma callbacks*/ static void lpuart_dma_tx_complete(void *arg); -static inline bool is_ls1028a_lpuart(struct lpuart_port *sport) +static inline bool is_layerscape_lpuart(struct lpuart_port *sport) { - return sport->devtype == LS1028A_LPUART; + return (sport->devtype == LS1021A_LPUART || + sport->devtype == LS1028A_LPUART); } static inline bool is_imx8qxp_lpuart(struct lpuart_port *sport) @@ -1701,11 +1702,11 @@ static int lpuart32_startup(struct uart_port *port) UARTFIFO_FIFOSIZE_MASK); /* -* The LS1028A has a fixed length of 16 words. Although it supports the -* RX/TXSIZE fields their encoding is different. Eg the reference manual -* states 0b101 is 16 words. +* The LS1021A and LS1028A have a fixed FIFO depth of 16 words. +* Although they support the RX/TXSIZE fields, their encoding is +* different. Eg the reference manual states 0b101 is 16 words. */ - if (is_ls1028a_lpuart(sport)) { + if (is_layerscape_lpuart(sport)) { sport->rxfifo_size = 16; sport->txfifo_size = 16; sport->port.fifosize = sport->txfifo_size; -- 2.25.1
Re: [PATCH ghak90 V9 05/13] audit: log container info of syscalls
On Wed, Oct 21, 2020 at 12:39 PM Richard Guy Briggs wrote: > Here is an exmple I was able to generate after updating the testsuite > script to include a signalling example of a nested audit container > identifier: > > > type=PROCTITLE msg=audit(2020-10-21 10:31:16.655:6731) : > proctitle=/usr/bin/perl -w containerid/test > type=CONTAINER_ID msg=audit(2020-10-21 10:31:16.655:6731) : > contid=7129731255799087104^941723245477888 > type=OBJ_PID msg=audit(2020-10-21 10:31:16.655:6731) : opid=115583 oauid=root > ouid=root oses=1 obj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > ocomm=perl > type=CONTAINER_ID msg=audit(2020-10-21 10:31:16.655:6731) : > contid=941723245477888 > type=OBJ_PID msg=audit(2020-10-21 10:31:16.655:6731) : opid=115580 oauid=root > ouid=root oses=1 obj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > ocomm=perl > type=CONTAINER_ID msg=audit(2020-10-21 10:31:16.655:6731) : > contid=8098399240850112512^941723245477888 > type=OBJ_PID msg=audit(2020-10-21 10:31:16.655:6731) : opid=115582 oauid=root > ouid=root oses=1 obj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > ocomm=perl > type=SYSCALL msg=audit(2020-10-21 10:31:16.655:6731) : arch=x86_64 > syscall=kill success=yes exit=0 a0=0xfffe3c84 a1=SIGTERM a2=0x4d524554 a3=0x0 > items=0 ppid=115564 pid=115567 auid=root uid=root gid=root euid=root > suid=root fsuid=root egid=root sgid=root fsgid=root tty=ttyS0 ses=1 comm=perl > exe=/usr/bin/perl subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > key=testsuite-1603290671-AcLtUulY > > > There are three CONTAINER_ID records which need some way of associating with > OBJ_PID records. An additional CONTAINER_ID record would be present if the > killing process itself had an audit container identifier. I think the most > obvious way to connect them is with a pid= field in the CONTAINER_ID record. Using a "pid=" field as a way to link CONTAINER_ID records to other records raises a few questions. What happens if/when we need to represent those PIDs in the context of a namespace? Are we ever going to need to link to records which don't have a "pid=" field? I haven't done the homework to know if either of these are a concern right now, but I worry that this might become a problem in the future. The idea of using something like "item=" is interesting. As you mention, the "item=" field does present some overlap problems with the PATH record, but perhaps we can do something similar. What if we added a "record=" (or similar, I'm not worried about names at this point) to each record, reset to 0/1 at the start of each event, and when we needed to link records somehow we could add a "related=1,..,N" field. This would potentially be useful beyond just the audit container ID work. -- paul moore www.paul-moore.com
Re: [PATCH] perf vendor events: Fix DRAM_BW_Use 0 issue for CLX/SKX
On Thu, Oct 22, 2020 at 5:54 PM Jin Yao wrote: > > Ian reports an issue that the metric DRAM_BW_Use often remains 0. > > The metric expression for DRAM_BW_Use on CLX/SKX: > > "( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / > 10 ) / duration_time" > > The counts of uncore_imc/cas_count_read/ and uncore_imc/cas_count_write/ > are scaled up by 64, that is to turn a count of cache lines into bytes, > the count is then divided by 10 to give GB. > > However, the counts of uncore_imc/cas_count_read/ and > uncore_imc/cas_count_write/ have been scaled yet. > > The scale values are from sysfs, such as > /sys/devices/uncore_imc_0/events/cas_count_read.scale. > It's 6.103515625e-5 (64 / 1024.0 / 1024.0). > > So if we use original metric expression, the result is not correct. > > But the difficulty is, for SKL client, the counts are not scaled. > > The metric expression for DRAM_BW_Use on SKL: > > "64 * ( arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@ ) / > 100 / duration_time / 1000" > > root@kbl-ppc:~# perf stat -M DRAM_BW_Use -a -- sleep 1 > > Performance counter stats for 'system wide': > >190 arb/event=0x84,umask=0x1/ # 1.86 DRAM_BW_Use > 29,093,178 arb/event=0x81,umask=0x1/ > 1,000,703,287 ns duration_time > >1.000703287 seconds time elapsed > > The result is expected. > > So the easy way is just change the metric expression for CLX/SKX. > This patch changes the metric expression to: > > "( ( ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) * 1048576 ) > / 10 ) / duration_time" > > 1048576 = 1024 * 1024. > > Before (tested on CLX): > > root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1 > > Performance counter stats for 'system wide': > > 765.35 MiB uncore_imc/cas_count_read/ # 0.00 DRAM_BW_Use > 5.42 MiB uncore_imc/cas_count_write/ > 1001515088 ns duration_time > >1.001515088 seconds time elapsed > > After: > > root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1 > > Performance counter stats for 'system wide': > > 767.95 MiB uncore_imc/cas_count_read/ # 0.80 DRAM_BW_Use Nit, using ScaleUnit would allow this to be 0.80GB/s. > 5.02 MiB uncore_imc/cas_count_write/ > 1001900010 ns duration_time > >1.001900010 seconds time elapsed > > Fixes: 038d3b53c284 ("perf vendor events intel: Update CascadelakeX events to > v1.08") > Fixes: b5ff7f2799a4 ("perf vendor events: Update SkylakeX events to v1.21") > Signed-off-by: Jin Yao Acked-by: Ian Rogers Thanks, Ian > --- > tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json | 2 +- > tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json > b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json > index de3193552277..00f4fcffa815 100644 > --- a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json > +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json > @@ -329,7 +329,7 @@ > }, > { > "BriefDescription": "Average external Memory Bandwidth Use for reads > and writes [GB / sec]", > -"MetricExpr": "( 64 * ( uncore_imc@cas_count_read@ + > uncore_imc@cas_count_write@ ) / 10 ) / duration_time", > +"MetricExpr": "( ( ( uncore_imc@cas_count_read@ + > uncore_imc@cas_count_write@ ) * 1048576 ) / 10 ) / duration_time", > "MetricGroup": "Memory_BW;SoC", > "MetricName": "DRAM_BW_Use" > }, > diff --git a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json > b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json > index f31794d3b926..0dd8b13b5cfb 100644 > --- a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json > +++ b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json > @@ -323,7 +323,7 @@ > }, > { > "BriefDescription": "Average external Memory Bandwidth Use for reads > and writes [GB / sec]", > -"MetricExpr": "( 64 * ( uncore_imc@cas_count_read@ + > uncore_imc@cas_count_write@ ) / 10 ) / duration_time", > +"MetricExpr": "( ( ( uncore_imc@cas_count_read@ + > uncore_imc@cas_count_write@ ) * 1048576 ) / 10 ) / duration_time", > "MetricGroup": "Memory_BW;SoC", > "MetricName": "DRAM_BW_Use" > }, > -- > 2.17.1 >
Re: [PATCH v1 0/2] mm: cma: introduce a non-blocking version of cma_release()
On 22 Oct 2020, at 20:47, Roman Gushchin wrote: > On Thu, Oct 22, 2020 at 07:42:45PM -0400, Zi Yan wrote: >> On 22 Oct 2020, at 18:53, Roman Gushchin wrote: >> >>> This small patchset introduces a non-blocking version of cma_release() >>> and simplifies the code in hugetlbfs, where previously we had to >>> temporarily drop hugetlb_lock around the cma_release() call. >>> >>> It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic >>> THP under a memory pressure requires a cma_release() call. If it's >> >> Thanks for the patch. But during 1GB THP split, we only clear >> the bitmaps without releasing the pages. Also in cma_release_nowait(), >> the first page in the allocated CMA region is reused to store >> struct cma_clear_bitmap_work, but the same method cannot be used >> during THP split, since the first page is still in-use. We might >> need to allocate some new memory for struct cma_clear_bitmap_work, >> which might not be successful under memory pressure. Any suggestion >> on where to store struct cma_clear_bitmap_work when I only want to >> clear bitmap without releasing the pages? > > It means we can't use cma_release() there either, because it does clear > individual pages. We need to clear the cma bitmap without touching pages. > > Can you handle an error there? > > If so, we can introduce something like int cma_schedule_bitmap_clearance(), > which will allocate a work structure and will be able to return -ENOMEM > in the unlikely case of error. > > Will it work for you? Yes, it works. Thanks. — Best Regards, Yan Zi signature.asc Description: OpenPGP digital signature
[PATCH] perf vendor events: Fix DRAM_BW_Use 0 issue for CLX/SKX
Ian reports an issue that the metric DRAM_BW_Use often remains 0. The metric expression for DRAM_BW_Use on CLX/SKX: "( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / 10 ) / duration_time" The counts of uncore_imc/cas_count_read/ and uncore_imc/cas_count_write/ are scaled up by 64, that is to turn a count of cache lines into bytes, the count is then divided by 10 to give GB. However, the counts of uncore_imc/cas_count_read/ and uncore_imc/cas_count_write/ have been scaled yet. The scale values are from sysfs, such as /sys/devices/uncore_imc_0/events/cas_count_read.scale. It's 6.103515625e-5 (64 / 1024.0 / 1024.0). So if we use original metric expression, the result is not correct. But the difficulty is, for SKL client, the counts are not scaled. The metric expression for DRAM_BW_Use on SKL: "64 * ( arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@ ) / 100 / duration_time / 1000" root@kbl-ppc:~# perf stat -M DRAM_BW_Use -a -- sleep 1 Performance counter stats for 'system wide': 190 arb/event=0x84,umask=0x1/ # 1.86 DRAM_BW_Use 29,093,178 arb/event=0x81,umask=0x1/ 1,000,703,287 ns duration_time 1.000703287 seconds time elapsed The result is expected. So the easy way is just change the metric expression for CLX/SKX. This patch changes the metric expression to: "( ( ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) * 1048576 ) / 10 ) / duration_time" 1048576 = 1024 * 1024. Before (tested on CLX): root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1 Performance counter stats for 'system wide': 765.35 MiB uncore_imc/cas_count_read/ # 0.00 DRAM_BW_Use 5.42 MiB uncore_imc/cas_count_write/ 1001515088 ns duration_time 1.001515088 seconds time elapsed After: root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1 Performance counter stats for 'system wide': 767.95 MiB uncore_imc/cas_count_read/ # 0.80 DRAM_BW_Use 5.02 MiB uncore_imc/cas_count_write/ 1001900010 ns duration_time 1.001900010 seconds time elapsed Fixes: 038d3b53c284 ("perf vendor events intel: Update CascadelakeX events to v1.08") Fixes: b5ff7f2799a4 ("perf vendor events: Update SkylakeX events to v1.21") Signed-off-by: Jin Yao --- tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json | 2 +- tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json index de3193552277..00f4fcffa815 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json @@ -329,7 +329,7 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", -"MetricExpr": "( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / 10 ) / duration_time", +"MetricExpr": "( ( ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) * 1048576 ) / 10 ) / duration_time", "MetricGroup": "Memory_BW;SoC", "MetricName": "DRAM_BW_Use" }, diff --git a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json index f31794d3b926..0dd8b13b5cfb 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json @@ -323,7 +323,7 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", -"MetricExpr": "( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / 10 ) / duration_time", +"MetricExpr": "( ( ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) * 1048576 ) / 10 ) / duration_time", "MetricGroup": "Memory_BW;SoC", "MetricName": "DRAM_BW_Use" }, -- 2.17.1
Re: [PATCH v2 5/5] scsi: ufs: fix clkgating on/off correctly
On 10/21, Can Guo wrote: > On 2020-10-21 12:52, jaeg...@kernel.org wrote: > > On 10/21, Can Guo wrote: > > > On 2020-10-21 03:52, Jaegeuk Kim wrote: > > > > The below call stack prevents clk_gating at every IO completion. > > > > We can remove the condition, ufshcd_any_tag_in_use(), since > > > > clkgating_work > > > > will check it again. > > > > > > > > > > I think checking ufshcd_any_tag_in_use() in either ufshcd_release() or > > > gate_work() can break UFS clk gating's functionality. > > > > > > ufshcd_any_tag_in_use() was introduced to replace hba->lrb_in_use. > > > However, > > > they are not exactly same - ufshcd_any_tag_in_use() returns true if > > > any tag > > > assigned from block layer is still in use, but tags are released > > > asynchronously > > > (through block softirq), meaning it does not reflect the real > > > occupation of > > > UFS host. > > > That is after UFS host finishes all tasks, ufshcd_any_tag_in_use() > > > can still > > > return true. > > > > > > This change only removes the check of ufshcd_any_tag_in_use() in > > > ufshcd_release(), > > > but having the check of it in gate_work() can still prevent gating > > > from > > > happening. > > > The current change works for you maybe because the tags are release > > > before > > > hba->clk_gating.delay_ms expires, but if hba->clk_gating.delay_ms is > > > shorter > > > or > > > somehow block softirq is retarded, gate_work() may have chance to see > > > ufshcd_any_tag_in_use() > > > returns true. What do you think? > > > > I don't think this breaks clkgating, but fix the wrong condition check > > which > > prevented gate_work at all. As you mentioned, even if this schedules > > gate_work > > by racy conditions, gate_work will handle it as a last resort. > > > > If clocks cannot be gated after the last task is cleared from UFS host, then > clk gating > is broken, no? Assume UFS has completed the last task in its queue, as this > change says, > ufshcd_any_tag_in_use() is preventing ufshcd_release() from invoking > gate_work(). > Similarly, ufshcd_any_tag_in_use() can prevent gate_work() from doing its > real work - > disabling the clocks. Do you agree? > > if (hba->clk_gating.active_reqs > || hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL > || ufshcd_any_tag_in_use(hba) || hba->outstanding_tasks > || hba->active_uic_cmd || hba->uic_async_done) > goto rel_lock; I see the point, but this happens only when clkgate_delay_ms is too short to give enough time for releasing tag. If it's correctly set, I think there'd be no problem, unless softirq was delayed by other RT threads which is just a corner case tho. > > Thanks, > > Can Guo. > > > > > > > Thanks, > > > > > > Can Guo. > > > > > > In __ufshcd_transfer_req_compl > > > Ihba->lrb_in_use is cleared immediately when UFS driver > > > finishes all tasks > > > > > > > ufshcd_complete_requests(struct ufs_hba *hba) > > > > ufshcd_transfer_req_compl() > > > > __ufshcd_transfer_req_compl() > > > > __ufshcd_release(hba) > > > > if (ufshcd_any_tag_in_use() == 1) > > > >return; > > > > ufshcd_tmc_handler(hba); > > > > blk_mq_tagset_busy_iter(); > > > > > > > > Cc: Alim Akhtar > > > > Cc: Avri Altman > > > > Cc: Can Guo > > > > Signed-off-by: Jaegeuk Kim > > > > --- > > > > drivers/scsi/ufs/ufshcd.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c > > > > index b5ca0effe636..cecbd4ace8b4 100644 > > > > --- a/drivers/scsi/ufs/ufshcd.c > > > > +++ b/drivers/scsi/ufs/ufshcd.c > > > > @@ -1746,7 +1746,7 @@ static void __ufshcd_release(struct ufs_hba *hba) > > > > > > > > if (hba->clk_gating.active_reqs || hba->clk_gating.is_suspended > > > > || > > > > hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL || > > > > - ufshcd_any_tag_in_use(hba) || hba->outstanding_tasks || > > > > + hba->outstanding_tasks || > > > > hba->active_uic_cmd || hba->uic_async_done) > > > > return;
Re: [PATCH v1 0/2] mm: cma: introduce a non-blocking version of cma_release()
On Thu, Oct 22, 2020 at 07:42:45PM -0400, Zi Yan wrote: > On 22 Oct 2020, at 18:53, Roman Gushchin wrote: > > > This small patchset introduces a non-blocking version of cma_release() > > and simplifies the code in hugetlbfs, where previously we had to > > temporarily drop hugetlb_lock around the cma_release() call. > > > > It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic > > THP under a memory pressure requires a cma_release() call. If it's > > Thanks for the patch. But during 1GB THP split, we only clear > the bitmaps without releasing the pages. Also in cma_release_nowait(), > the first page in the allocated CMA region is reused to store > struct cma_clear_bitmap_work, but the same method cannot be used > during THP split, since the first page is still in-use. We might > need to allocate some new memory for struct cma_clear_bitmap_work, > which might not be successful under memory pressure. Any suggestion > on where to store struct cma_clear_bitmap_work when I only want to > clear bitmap without releasing the pages? It means we can't use cma_release() there either, because it does clear individual pages. We need to clear the cma bitmap without touching pages. Can you handle an error there? If so, we can introduce something like int cma_schedule_bitmap_clearance(), which will allocate a work structure and will be able to return -ENOMEM in the unlikely case of error. Will it work for you? Thanks!
Re: [PATCH v17 1/4] Add flags option to get xattr method paired to __vfs_getxattr
On Wed, Oct 21, 2020 at 8:07 AM Mark Salyzyn wrote: > On 10/20/20 6:17 PM, Paul Moore wrote: > > On Tue, Oct 20, 2020 at 3:17 PM Mark Salyzyn wrote: > >> Add a flag option to get xattr method that could have a bit flag of > >> XATTR_NOSECURITY passed to it. XATTR_NOSECURITY is generally then > >> set in the __vfs_getxattr path when called by security > >> infrastructure. > >> > >> This handles the case of a union filesystem driver that is being > >> requested by the security layer to report back the xattr data. > >> > >> For the use case where access is to be blocked by the security layer. > >> > >> The path then could be security(dentry) -> > >> __vfs_getxattr(dentry...XATTR_NOSECURITY) -> > >> handler->get(dentry...XATTR_NOSECURITY) -> > >> __vfs_getxattr(lower_dentry...XATTR_NOSECURITY) -> > >> lower_handler->get(lower_dentry...XATTR_NOSECURITY) > >> which would report back through the chain data and success as > >> expected, the logging security layer at the top would have the > >> data to determine the access permissions and report back the target > >> context that was blocked. > >> > >> Without the get handler flag, the path on a union filesystem would be > >> the errant security(dentry) -> __vfs_getxattr(dentry) -> > >> handler->get(dentry) -> vfs_getxattr(lower_dentry) -> nested -> > >> security(lower_dentry, log off) -> lower_handler->get(lower_dentry) > >> which would report back through the chain no data, and -EACCES. > >> > >> For selinux for both cases, this would translate to a correctly > >> determined blocked access. In the first case with this change a correct avc > >> log would be reported, in the second legacy case an incorrect avc log > >> would be reported against an uninitialized u:object_r:unlabeled:s0 > >> context making the logs cosmetically useless for audit2allow. > >> > >> This patch series is inert and is the wide-spread addition of the > >> flags option for xattr functions, and a replacement of __vfs_getxattr > >> with __vfs_getxattr(...XATTR_NOSECURITY). > >> > >> Signed-off-by: Mark Salyzyn > >> Reviewed-by: Jan Kara > >> Acked-by: Jan Kara > >> Acked-by: Jeff Layton > >> Acked-by: David Sterba > >> Acked-by: Darrick J. Wong > >> Acked-by: Mike Marshall > >> To: linux-fsde...@vger.kernel.org > >> To: linux-unio...@vger.kernel.org > >> Cc: Stephen Smalley > >> Cc: linux-kernel@vger.kernel.org > >> Cc: linux-security-mod...@vger.kernel.org > >> Cc: kernel-t...@android.com > > ... > > > >> > > [NOTE: added the SELinux list to the CC line] > > > Thanks and > > > > > I'm looking at this patchset in earnest for the first time and I'm a > > little uncertain about the need for the new XATTR_NOSECURITY flag; > > perhaps you can help me understand it better. Looking over this > > patch, and quickly looking at the others in the series, it seems as > > though XATTR_NOSECURITY is basically used whenever a filesystem has to > > call back into the vfs layer (e.g. overlayfs, ecryptfs, etc). Am I > > understanding that correctly? If that assumption is correct, I'm not > > certain why the new XATTR_NOSECURITY flag is needed; why couldn't > > _vfs_getxattr() be used by all of the callers that need to bypass > > DAC/MAC with vfs_getxattr() continuing to perform the DAC/MAC checks? > > If for some reason _vfs_getxattr() can't be used, would it make more > > sense to create a new stripped/special getxattr function for use by > > nested filesystems? Based on the number of revisions to this > > patchset, I'm sure it can't be that simple so please educate me :) > > > It is hard to please everyone :-} > > Yes, calling back through the vfs layer. > > I was told not to change or remove the __vfs_getxattr default behaviour, > but use the flag to pass through the new behavior. Security concerns > requiring the _key_ of the flag to be passed through rather than a > blanket bypass. This was also the similar security reasoning not to have > a special getxattr call. > > [TL;DR] > > history and details > > When it goes down through the layers again, and into the underlying > filesystems, to get the getxattr, the xattributes are blocked, then the > selinux _context_ will not be copied into the buffer leaving the caller > looking at effectively u:r:unknown:s0. Well, they were blocked, so from > the security standpoint that part was accurate, but the evaluation of > the context is using the wrong rules and an (cosmetically) incorrect avc > report. This also poisons the cache layers that may hold on to the > context for future calls (+/- bugs) disturbing the future decisions (we > saw that in 4.14 and earlier vintage kernels without this patch, later > kernels appeared to clear up the cache bug). > > The XATTR_NOSECURITY is used in the overlayfs driver for a substantial > majority of the calls for getxattr only if the data is private (ie: on > the stack, not returned to the caller) as simplification. A _real_ > getxattr is performed when the data is returned to the caller. I expect > that
Re: [PATCH net RFC] net: Clear IFF_TX_SKB_SHARING for all Ethernet devices using skb_padto
On Thu, 22 Oct 2020 12:59:45 -0700 Xie He wrote: > On Thu, Oct 22, 2020 at 8:22 AM Jakub Kicinski wrote: > > > > Are most of these drivers using skb_padto()? Is that the reason they > > can't be sharing the SKB? > > Yes, I think if a driver calls skb_pad / skb_padto / skb_put_padto / > eth_skb_pad, the driver can't accept shared skbs because it may modify > the skbs. > > > I think the IFF_TX_SKB_SHARING flag is only used by pktgen, so perhaps > > we can make sure pktgen doesn't generate skbs < dev->min_mtu, and then > > the drivers won't pad? > > Yes, I see a lot of drivers just want to pad the skb to ETH_ZLEN, or > just call eth_skb_pad. In this case, requiring the shared skb to be at > least dev->min_mtu long can solve the problem for these drivers. > > But I also see some drivers that want to pad the skb to a strange > length, and don't set their special min_mtu to match this length. For > example: > > drivers/net/ethernet/packetengines/yellowfin.c wants to pad the skb to > a dynamically calculated value. > > drivers/net/ethernet/ti/cpsw.c, cpsw_new.c and tlan.c want to pad the > skb to macro defined values. > > drivers/net/ethernet/intel/iavf/iavf_txrx.c wants to pad the skb to > IAVF_MIN_TX_LEN (17). > > drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c wants to pad the skb to 17. Hm, I see, that would be a slight loss of functionality if we started requiring 64B, for example, while the driver could in practice xmit 17B frames (would matter only to VFs, but nonetheless). > Another solution I can think of is to add a "skb_shared" check to > "__skb_pad", so that if __skb_pad encounters a shared skb, it just > returns an error. The driver would think this is a memory allocation > failure. This way we can ensure shared skbs are not modified. I'm not sure if we want to be adding checks to __skb_pad() to handle what's effectively a pktgen specific condition. We could create a new field in struct netdevice for min_frame_len, but I think your patch is the simplest solution. Let's see if anyone objects. BTW it seems like there is more drivers which will need the flag cleared, e.g. drivers/net/ethernet/broadcom/bnxt/bnxt.c?
Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote: > On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin wrote: > > > [snip] > > > > > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I > > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup > > > for 5.11. > > > > > > What does everyone think? > > > > I think we should use the link to the root approach both for stable > > backports > > and for 5.11+, to keep them in sync. The cleanup (always charging the root > > cgroup) > > is not directly related to this problem, and we can keep it for 5.11+ only. > > > > Thanks! > > Roman, can you send the signed-off patch for the root linking for > use_hierarchy=0? Sure, here we are. Thanks! -- >From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Thu, 22 Oct 2020 17:12:32 -0700 Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false Richard reported a warning which can be reproduced by running the LTP madvise6 test (cgroup v1 in the non-hierarchical mode should be used): [9.841552] [ cut here ] [9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156) [9.841982] Modules linked in: [9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77 [9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014 [9.842571] Workqueue: events drain_local_stock [9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156) [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89 [9.843438] RSP: 0018:b1c18006be28 EFLAGS: 00010086 [9.843585] RAX: RBX: RCX: 94803bc2cae0 [9.843806] RDX: 0001 RSI: RDI: 948007d2b248 [9.844026] RBP: 948007d2b248 R08: 948007c58eb0 R09: 948007da05ac [9.844248] R10: 0018 R11: 0018 R12: 0001 [9.844477] R13: R14: R15: 94803bc2cac0 [9.844696] FS: () GS:94803bc0() knlGS: [9.844915] CS: 0010 DS: ES: CR0: 80050033 [9.845096] CR2: 7f0579ee0384 CR3: 2cc0a000 CR4: 06f0 [9.845319] Call Trace: [9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022) [9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114) [9.845684] drain_local_stock (mm/memcontrol.c:2255) [9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274) [9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416) [9.846034] ? process_one_work (kernel/workqueue.c:2358) [9.846162] kthread (kernel/kthread.c:292) [9.846271] ? __kthread_bind_mask (kernel/kthread.c:245) [9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300) [9.846531] ---[ end trace 8b5647c1eba9d18a ]--- The problem occurs because in the non-hierarchical mode non-root page counters are not linked to root page counters, so the charge is not propagated to the root memory cgroup. After the removal of the original memory cgroup and reparenting of the object cgroup, the root cgroup might be uncharged by draining a objcg stock, for example. It leads to an eventual underflow of the charge and triggers a warning. Fix it by linking all page counters to corresponding root page counters in the non-hierarchical mode. The patch doesn't affect how the hierarchical mode is working, which is the only sane and truly supported mode now. Thanks to Richard for reporting, debugging and providing an alternative version of the fix! Reported-by: l...@lists.linux.it Debugged-by: Richard Palethorpe Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Roman Gushchin Cc: sta...@vger.kernel.org --- mm/memcontrol.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2636f8bad908..009297017c87 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5339,17 +5339,22 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memcg->swappiness = mem_cgroup_swappiness(parent); memcg->oom_kill_disable = parent->oom_kill_disable; } - if (parent && parent->use_hierarchy) { + if (!parent) { + page_counter_init(>memory, NULL); + page_counter_init(>swap, NULL); + page_counter_init(>kmem, NULL); +
[PATCH] x86/mm/KASLR: Account for minimum padding when calculating entropy
Subtract the minimum padding between regions from the initial remain_entropy. Without this, the last region could potentially overflow past vaddr_end if we happen to get a specific sequence of random numbers (although extremely unlikely in practice). The bug can be demonstrated by replacing the prandom_bytes_state call with "rand = entropy;" Signed-off-by: Junaid Shahid --- arch/x86/mm/kaslr.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 6e6b39710e5f..fe3eec30f736 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -109,7 +109,8 @@ void __init kernel_randomize_memory(void) kaslr_regions[2].size_tb = DIV_ROUND_UP(vmemmap_size, 1UL << TB_SHIFT); /* Calculate entropy available between regions */ - remain_entropy = vaddr_end - vaddr_start; + remain_entropy = vaddr_end - vaddr_start - +(ARRAY_SIZE(kaslr_regions) - 1) * PUD_SIZE; for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) remain_entropy -= get_padding(_regions[i]); -- 2.29.0.rc2.309.g374f81d7ae-goog
[PATCH v3 03/10] ASoC: SOF: Create client driver for IPC test
From: Ranjani Sridharan Create an SOF client driver for IPC flood test. This driver is used to set up the debugfs entries and the read/write ops for initiating the IPC flood test that would be used to measure the min/max/avg response times for sending IPCs to the DSP. The debugfs ops definitions in the driver is existing code that has been copied from the core. These will be removed from the SOF core making is less monolithic and easier to maintain. Reviewed-by: Pierre-Louis Bossart Signed-off-by: Ranjani Sridharan Co-developed-by: Fred Oh Signed-off-by: Fred Oh Signed-off-by: Dave Ertman --- sound/soc/sof/Kconfig | 10 + sound/soc/sof/Makefile | 4 + sound/soc/sof/sof-ipc-test-client.c | 321 3 files changed, 335 insertions(+) create mode 100644 sound/soc/sof/sof-ipc-test-client.c diff --git a/sound/soc/sof/Kconfig b/sound/soc/sof/Kconfig index 31e9911098fc..13bde36cc5d7 100644 --- a/sound/soc/sof/Kconfig +++ b/sound/soc/sof/Kconfig @@ -190,6 +190,16 @@ config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST Say Y if you want to enable IPC flood test. If unsure, select "N". +config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT + tristate "SOF enable IPC flood test client" + depends on SND_SOC_SOF_CLIENT + help + This option enables a separate client device for IPC flood test + which can be used to flood the DSP with test IPCs and gather stats + about response times. + Say Y if you want to enable IPC flood test. + If unsure, select "N". + config SND_SOC_SOF_DEBUG_RETAIN_DSP_CONTEXT bool "SOF retain DSP context on any FW exceptions" help diff --git a/sound/soc/sof/Makefile b/sound/soc/sof/Makefile index 5e46f25a3851..baa93fe2cc9a 100644 --- a/sound/soc/sof/Makefile +++ b/sound/soc/sof/Makefile @@ -9,6 +9,8 @@ snd-sof-pci-objs := sof-pci-dev.o snd-sof-acpi-objs := sof-acpi-dev.o snd-sof-of-objs := sof-of-dev.o +snd-sof-ipc-test-objs := sof-ipc-test-client.o + snd-sof-nocodec-objs := nocodec.o obj-$(CONFIG_SND_SOC_SOF) += snd-sof.o @@ -21,6 +23,8 @@ obj-$(CONFIG_SND_SOC_SOF_PCI) += snd-sof-pci.o obj-$(CONFIG_SND_SOC_SOF_CLIENT) += snd-sof-client.o +obj-$(CONFIG_SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT) += snd-sof-ipc-test.o + obj-$(CONFIG_SND_SOC_SOF_INTEL_TOPLEVEL) += intel/ obj-$(CONFIG_SND_SOC_SOF_IMX_TOPLEVEL) += imx/ obj-$(CONFIG_SND_SOC_SOF_XTENSA) += xtensa/ diff --git a/sound/soc/sof/sof-ipc-test-client.c b/sound/soc/sof/sof-ipc-test-client.c new file mode 100644 index ..b4d803b9139b --- /dev/null +++ b/sound/soc/sof/sof-ipc-test-client.c @@ -0,0 +1,321 @@ +// SPDX-License-Identifier: GPL-2.0-only +// +// Copyright(c) 2020 Intel Corporation. All rights reserved. +// +// Author: Ranjani Sridharan +// + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "sof-client.h" + +#define MAX_IPC_FLOOD_DURATION_MS 1000 +#define MAX_IPC_FLOOD_COUNT 1 +#define IPC_FLOOD_TEST_RESULT_LEN 512 +#define SOF_IPC_CLIENT_SUSPEND_DELAY_MS 3000 + +struct sof_ipc_client_data { + struct dentry *dfs_root; + char *buf; +}; + +/* + * helper function to perform the flood test. Only one of the two params, ipc_duration_ms + * or ipc_count, will be non-zero and will determine the type of test + */ +static int sof_debug_ipc_flood_test(struct sof_client_dev *cdev, unsigned long ipc_duration_ms, + unsigned long ipc_count) +{ + struct sof_ipc_client_data *ipc_client_data = cdev->data; + struct device *dev = >auxdev.dev; + struct sof_ipc_cmd_hdr hdr; + struct sof_ipc_reply reply; + u64 min_response_time = U64_MAX; + u64 avg_response_time = 0; + u64 max_response_time = 0; + ktime_t cur; + ktime_t test_end; + int i = 0; + int ret = 0; + bool end_test = false; + + /* configure test IPC */ + hdr.cmd = SOF_IPC_GLB_TEST_MSG | SOF_IPC_TEST_IPC_FLOOD; + hdr.size = sizeof(hdr); + + /* set test end time for duration flood test */ + test_end = ktime_get_ns() + ipc_duration_ms * NSEC_PER_MSEC; + + /* send test IPC's */ + do { + ktime_t start; + u64 ipc_response_time; + + start = ktime_get(); + ret = sof_client_ipc_tx_message(cdev, hdr.cmd, , hdr.size, , + sizeof(reply)); + if (ret < 0) + break; + cur = ktime_get(); + + i++; + + /* compute min and max response times */ + ipc_response_time = ktime_to_ns(ktime_sub(cur, start)); + min_response_time = min(min_response_time, ipc_response_time); + max_response_time = max(max_response_time, ipc_response_time); + + /* sum up response times */ +
[PATCH v3 08/10] ASoC: SOF: compress: move and export sof_probe_compr_ops
From: Ranjani Sridharan sof_probe_compr_ops are not platform-specific. So move it to common compress code and export the symbol. The compilation of the common compress code is already dependent on the selection of CONFIG_SND_SOC_SOF_DEBUG_PROBES, so no need to check the Kconfig section for defining sof_probe_compr_ops again. Reviewed-by: Pierre-Louis Bossart Tested-by: Fred Oh Signed-off-by: Ranjani Sridharan Signed-off-by: Dave Ertman --- sound/soc/sof/compress.c | 9 + sound/soc/sof/compress.h | 1 + sound/soc/sof/intel/hda-dai.c | 12 3 files changed, 10 insertions(+), 12 deletions(-) diff --git a/sound/soc/sof/compress.c b/sound/soc/sof/compress.c index 2d4969c705a4..0443f171b4e7 100644 --- a/sound/soc/sof/compress.c +++ b/sound/soc/sof/compress.c @@ -13,6 +13,15 @@ #include "ops.h" #include "probe.h" +struct snd_soc_cdai_ops sof_probe_compr_ops = { + .startup= sof_probe_compr_open, + .shutdown = sof_probe_compr_free, + .set_params = sof_probe_compr_set_params, + .trigger= sof_probe_compr_trigger, + .pointer= sof_probe_compr_pointer, +}; +EXPORT_SYMBOL(sof_probe_compr_ops); + struct snd_compress_ops sof_probe_compressed_ops = { .copy = sof_probe_compr_copy, }; diff --git a/sound/soc/sof/compress.h b/sound/soc/sof/compress.h index ca8790bd4b13..689c83ac8ffc 100644 --- a/sound/soc/sof/compress.h +++ b/sound/soc/sof/compress.h @@ -13,6 +13,7 @@ #include +extern struct snd_soc_cdai_ops sof_probe_compr_ops; extern struct snd_compress_ops sof_probe_compressed_ops; int sof_probe_compr_open(struct snd_compr_stream *cstream, diff --git a/sound/soc/sof/intel/hda-dai.c b/sound/soc/sof/intel/hda-dai.c index c6cb8c212eca..1acec1176986 100644 --- a/sound/soc/sof/intel/hda-dai.c +++ b/sound/soc/sof/intel/hda-dai.c @@ -400,18 +400,6 @@ static const struct snd_soc_dai_ops hda_link_dai_ops = { .prepare = hda_link_pcm_prepare, }; -#if IS_ENABLED(CONFIG_SND_SOC_SOF_HDA_PROBES) -#include "../compress.h" - -static struct snd_soc_cdai_ops sof_probe_compr_ops = { - .startup= sof_probe_compr_open, - .shutdown = sof_probe_compr_free, - .set_params = sof_probe_compr_set_params, - .trigger= sof_probe_compr_trigger, - .pointer= sof_probe_compr_pointer, -}; - -#endif #endif /* -- 2.26.2
[PATCH v3 02/10] ASoC: SOF: Introduce descriptors for SOF client
From: Ranjani Sridharan A client in the SOF (Sound Open Firmware) context is a device that needs to communicate with the DSP via IPC messages. The SOF core is responsible for serializing the IPC messages to the DSP from the different clients. One example of an SOF client would be an IPC test client that floods the DSP with test IPC messages to validate if the serialization works as expected. Multi-client support will also add the ability to split the existing audio cards into multiple ones, so as to e.g. to deal with HDMI with a dedicated client instead of adding HDMI to all cards. This patch introduces descriptors for SOF client driver and SOF client device along with APIs for registering and unregistering a SOF client driver, sending IPCs from a client device and accessing the SOF core debugfs root entry. Along with this, add a couple of new members to struct snd_sof_dev that will be used for maintaining the list of clients. Reviewed-by: Pierre-Louis Bossart Signed-off-by: Ranjani Sridharan Co-developed-by: Fred Oh Signed-off-by: Fred Oh Signed-off-by: Dave Ertman --- sound/soc/sof/Kconfig | 19 ++ sound/soc/sof/Makefile | 3 + sound/soc/sof/core.c | 2 + sound/soc/sof/sof-client.c | 115 + sound/soc/sof/sof-client.h | 66 + sound/soc/sof/sof-priv.h | 9 +++ 6 files changed, 214 insertions(+) create mode 100644 sound/soc/sof/sof-client.c create mode 100644 sound/soc/sof/sof-client.h diff --git a/sound/soc/sof/Kconfig b/sound/soc/sof/Kconfig index 8c1f0829de40..31e9911098fc 100644 --- a/sound/soc/sof/Kconfig +++ b/sound/soc/sof/Kconfig @@ -50,6 +50,24 @@ config SND_SOC_SOF_DEBUG_PROBES Say Y if you want to enable probes. If unsure, select "N". +config SND_SOC_SOF_CLIENT + tristate + select AUXILIARY_BUS + help + This option is not user-selectable but automagically handled by + 'select' statements at a higher level. + +config SND_SOC_SOF_CLIENT_SUPPORT + bool "SOF enable clients" + depends on SND_SOC_SOF + help + This adds support for auxiliary client devices to separate out the debug + functionality for IPC tests, probes etc. into separate devices. This + option would also allow adding client devices based on DSP firmware + capabilities and ACPI/OF device information. + Say Y if you want to enable clients with SOF. + If unsure select "N". + config SND_SOC_SOF_DEVELOPER_SUPPORT bool "SOF developer options support" depends on EXPERT @@ -186,6 +204,7 @@ endif ## SND_SOC_SOF_DEVELOPER_SUPPORT config SND_SOC_SOF tristate + select SND_SOC_SOF_CLIENT if SND_SOC_SOF_CLIENT_SUPPORT select SND_SOC_TOPOLOGY select SND_SOC_SOF_NOCODEC if SND_SOC_SOF_NOCODEC_SUPPORT help diff --git a/sound/soc/sof/Makefile b/sound/soc/sof/Makefile index 05718dfe6cd2..5e46f25a3851 100644 --- a/sound/soc/sof/Makefile +++ b/sound/soc/sof/Makefile @@ -2,6 +2,7 @@ snd-sof-objs := core.o ops.o loader.o ipc.o pcm.o pm.o debug.o topology.o\ control.o trace.o utils.o sof-audio.o +snd-sof-client-objs := sof-client.o snd-sof-$(CONFIG_SND_SOC_SOF_DEBUG_PROBES) += probe.o compress.o snd-sof-pci-objs := sof-pci-dev.o @@ -18,6 +19,8 @@ obj-$(CONFIG_SND_SOC_SOF_ACPI) += snd-sof-acpi.o obj-$(CONFIG_SND_SOC_SOF_OF) += snd-sof-of.o obj-$(CONFIG_SND_SOC_SOF_PCI) += snd-sof-pci.o +obj-$(CONFIG_SND_SOC_SOF_CLIENT) += snd-sof-client.o + obj-$(CONFIG_SND_SOC_SOF_INTEL_TOPLEVEL) += intel/ obj-$(CONFIG_SND_SOC_SOF_IMX_TOPLEVEL) += imx/ obj-$(CONFIG_SND_SOC_SOF_XTENSA) += xtensa/ diff --git a/sound/soc/sof/core.c b/sound/soc/sof/core.c index adc7c37145d6..72a97219395f 100644 --- a/sound/soc/sof/core.c +++ b/sound/soc/sof/core.c @@ -314,8 +314,10 @@ int snd_sof_device_probe(struct device *dev, struct snd_sof_pdata *plat_data) INIT_LIST_HEAD(>widget_list); INIT_LIST_HEAD(>dai_list); INIT_LIST_HEAD(>route_list); + INIT_LIST_HEAD(>client_list); spin_lock_init(>ipc_lock); spin_lock_init(>hw_lock); + mutex_init(>client_mutex); if (IS_ENABLED(CONFIG_SND_SOC_SOF_PROBE_WORK_QUEUE)) INIT_WORK(>probe_work, sof_probe_work); diff --git a/sound/soc/sof/sof-client.c b/sound/soc/sof/sof-client.c new file mode 100644 index ..dd75a0ba4c28 --- /dev/null +++ b/sound/soc/sof/sof-client.c @@ -0,0 +1,115 @@ +// SPDX-License-Identifier: GPL-2.0-only +// +// Copyright(c) 2020 Intel Corporation. All rights reserved. +// +// Author: Ranjani Sridharan +// + +#include +#include +#include +#include +#include +#include +#include "sof-client.h" +#include "sof-priv.h" + +static void sof_client_auxdev_release(struct device *dev) +{ + struct auxiliary_device *auxdev = to_auxiliary_dev(dev); + struct sof_client_dev *cdev = auxiliary_dev_to_sof_client_dev(auxdev); + +
[PATCH v3 09/10] ASoC: SOF: Add new client driver for probes support
From: Ranjani Sridharan Add a new client driver for probes support and move all the probes-related code from the core to the client driver. The probes client driver registers a component driver with one CPU DAI driver for extraction and creates a new sound card with one DUMMY DAI link with a dummy codec that will be used for extracting audio data from specific points in the audio pipeline. The probes debugfs ops are based on the initial implementation by Cezary Rojewski and have been moved out of the SOF core into the client driver making it easier to maintain. This change will make it easier for the probes functionality to be added for all platforms without having the need to modify the existing(15+) machine drivers. Reviewed-by: Pierre-Louis Bossart Tested-by: Fred Oh Signed-off-by: Ranjani Sridharan Signed-off-by: Dave Ertman --- sound/soc/sof/Kconfig | 18 +- sound/soc/sof/Makefile| 3 +- sound/soc/sof/compress.c | 51 ++-- sound/soc/sof/core.c | 6 - sound/soc/sof/debug.c | 227 sound/soc/sof/intel/hda-dai.c | 15 -- sound/soc/sof/intel/hda.h | 6 - sound/soc/sof/pcm.c | 11 - sound/soc/sof/probe.c | 124 - sound/soc/sof/probe.h | 41 +-- sound/soc/sof/sof-priv.h | 4 - sound/soc/sof/sof-probes-client.c | 414 ++ 12 files changed, 545 insertions(+), 375 deletions(-) create mode 100644 sound/soc/sof/sof-probes-client.c diff --git a/sound/soc/sof/Kconfig b/sound/soc/sof/Kconfig index a0f9474b8143..9fa00780c842 100644 --- a/sound/soc/sof/Kconfig +++ b/sound/soc/sof/Kconfig @@ -42,13 +42,11 @@ config SND_SOC_SOF_OF Say Y if you need this option. If unsure select "N". config SND_SOC_SOF_DEBUG_PROBES - bool "SOF enable data probing" + bool select SND_SOC_COMPRESS help - This option enables the data probing feature that can be used to - gather data directly from specific points of the audio pipeline. - Say Y if you want to enable probes. - If unsure, select "N". + This option is not user-selectable but automagically handled by + 'select' statements at a higher level. config SND_SOC_SOF_CLIENT tristate @@ -192,6 +190,15 @@ config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT Say Y if you want to enable IPC flood test. If unsure, select "N". +config SND_SOC_SOF_DEBUG_PROBES_CLIENT + tristate "SOF enable data probing" + depends on SND_SOC_SOF_CLIENT + help + This option enables the data probing feature that can be used to + gather data directly from specific points of the audio pipeline. + Say Y if you want to enable probes. + If unsure, select "N". + config SND_SOC_SOF_DEBUG_RETAIN_DSP_CONTEXT bool "SOF retain DSP context on any FW exceptions" help @@ -207,6 +214,7 @@ endif ## SND_SOC_SOF_DEVELOPER_SUPPORT config SND_SOC_SOF tristate select SND_SOC_SOF_CLIENT if SND_SOC_SOF_CLIENT_SUPPORT + select SND_SOC_SOF_DEBUG_PROBES if SND_SOC_SOF_DEBUG_PROBES_CLIENT select SND_SOC_TOPOLOGY select SND_SOC_SOF_NOCODEC if SND_SOC_SOF_NOCODEC_SUPPORT help diff --git a/sound/soc/sof/Makefile b/sound/soc/sof/Makefile index baa93fe2cc9a..cf49466f7910 100644 --- a/sound/soc/sof/Makefile +++ b/sound/soc/sof/Makefile @@ -3,7 +3,7 @@ snd-sof-objs := core.o ops.o loader.o ipc.o pcm.o pm.o debug.o topology.o\ control.o trace.o utils.o sof-audio.o snd-sof-client-objs := sof-client.o -snd-sof-$(CONFIG_SND_SOC_SOF_DEBUG_PROBES) += probe.o compress.o +snd-sof-probes-objs := probe.o compress.o sof-probes-client.o snd-sof-pci-objs := sof-pci-dev.o snd-sof-acpi-objs := sof-acpi-dev.o @@ -24,6 +24,7 @@ obj-$(CONFIG_SND_SOC_SOF_PCI) += snd-sof-pci.o obj-$(CONFIG_SND_SOC_SOF_CLIENT) += snd-sof-client.o obj-$(CONFIG_SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT) += snd-sof-ipc-test.o +obj-$(CONFIG_SND_SOC_SOF_DEBUG_PROBES_CLIENT) += snd-sof-probes.o obj-$(CONFIG_SND_SOC_SOF_INTEL_TOPLEVEL) += intel/ obj-$(CONFIG_SND_SOC_SOF_IMX_TOPLEVEL) += imx/ diff --git a/sound/soc/sof/compress.c b/sound/soc/sof/compress.c index 0443f171b4e7..bbb77f028e74 100644 --- a/sound/soc/sof/compress.c +++ b/sound/soc/sof/compress.c @@ -10,8 +10,8 @@ #include #include "compress.h" -#include "ops.h" #include "probe.h" +#include "sof-client.h" struct snd_soc_cdai_ops sof_probe_compr_ops = { .startup= sof_probe_compr_open, @@ -30,17 +30,18 @@ EXPORT_SYMBOL(sof_probe_compressed_ops); int sof_probe_compr_open(struct snd_compr_stream *cstream, struct snd_soc_dai *dai) { - struct snd_sof_dev *sdev = - snd_soc_component_get_drvdata(dai->component); + struct snd_soc_card *card = snd_soc_component_get_drvdata(dai->component); + struct
[PATCH v3 06/10] ASoC: SOF: Intel: Remove IPC flood test support in SOF core
From: Ranjani Sridharan Remove the IPC flood test support in the SOF core as it is now added in the IPC flood test client. Reviewed-by: Pierre-Louis Bossart Signed-off-by: Fred Oh Signed-off-by: Ranjani Sridharan Signed-off-by: Dave Ertman --- sound/soc/sof/Kconfig| 8 -- sound/soc/sof/debug.c| 230 --- sound/soc/sof/sof-priv.h | 6 +- 3 files changed, 1 insertion(+), 243 deletions(-) diff --git a/sound/soc/sof/Kconfig b/sound/soc/sof/Kconfig index 13bde36cc5d7..a0f9474b8143 100644 --- a/sound/soc/sof/Kconfig +++ b/sound/soc/sof/Kconfig @@ -182,14 +182,6 @@ config SND_SOC_SOF_DEBUG_ENABLE_FIRMWARE_TRACE module parameter (similar to dynamic debug) If unsure, select "N". -config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST - bool "SOF enable IPC flood test" - help - This option enables the IPC flood test which can be used to flood - the DSP with test IPCs and gather stats about response times. - Say Y if you want to enable IPC flood test. - If unsure, select "N". - config SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST_CLIENT tristate "SOF enable IPC flood test client" depends on SND_SOC_SOF_CLIENT diff --git a/sound/soc/sof/debug.c b/sound/soc/sof/debug.c index 9419a99bab53..d224641768da 100644 --- a/sound/soc/sof/debug.c +++ b/sound/soc/sof/debug.c @@ -232,120 +232,10 @@ static int snd_sof_debugfs_probe_item(struct snd_sof_dev *sdev, } #endif -#if IS_ENABLED(CONFIG_SND_SOC_SOF_DEBUG_IPC_FLOOD_TEST) -#define MAX_IPC_FLOOD_DURATION_MS 1000 -#define MAX_IPC_FLOOD_COUNT 1 -#define IPC_FLOOD_TEST_RESULT_LEN 512 - -static int sof_debug_ipc_flood_test(struct snd_sof_dev *sdev, - struct snd_sof_dfsentry *dfse, - bool flood_duration_test, - unsigned long ipc_duration_ms, - unsigned long ipc_count) -{ - struct sof_ipc_cmd_hdr hdr; - struct sof_ipc_reply reply; - u64 min_response_time = U64_MAX; - ktime_t start, end, test_end; - u64 avg_response_time = 0; - u64 max_response_time = 0; - u64 ipc_response_time; - int i = 0; - int ret; - - /* configure test IPC */ - hdr.cmd = SOF_IPC_GLB_TEST_MSG | SOF_IPC_TEST_IPC_FLOOD; - hdr.size = sizeof(hdr); - - /* set test end time for duration flood test */ - if (flood_duration_test) - test_end = ktime_get_ns() + ipc_duration_ms * NSEC_PER_MSEC; - - /* send test IPC's */ - while (1) { - start = ktime_get(); - ret = sof_ipc_tx_message(sdev->ipc, hdr.cmd, , hdr.size, -, sizeof(reply)); - end = ktime_get(); - - if (ret < 0) - break; - - /* compute min and max response times */ - ipc_response_time = ktime_to_ns(ktime_sub(end, start)); - min_response_time = min(min_response_time, ipc_response_time); - max_response_time = max(max_response_time, ipc_response_time); - - /* sum up response times */ - avg_response_time += ipc_response_time; - i++; - - /* test complete? */ - if (flood_duration_test) { - if (ktime_to_ns(end) >= test_end) - break; - } else { - if (i == ipc_count) - break; - } - } - - if (ret < 0) - dev_err(sdev->dev, - "error: ipc flood test failed at %d iterations\n", i); - - /* return if the first IPC fails */ - if (!i) - return ret; - - /* compute average response time */ - do_div(avg_response_time, i); - - /* clear previous test output */ - memset(dfse->cache_buf, 0, IPC_FLOOD_TEST_RESULT_LEN); - - if (flood_duration_test) { - dev_dbg(sdev->dev, "IPC Flood test duration: %lums\n", - ipc_duration_ms); - snprintf(dfse->cache_buf, IPC_FLOOD_TEST_RESULT_LEN, -"IPC Flood test duration: %lums\n", ipc_duration_ms); - } - - dev_dbg(sdev->dev, - "IPC Flood count: %d, Avg response time: %lluns\n", - i, avg_response_time); - dev_dbg(sdev->dev, "Max response time: %lluns\n", - max_response_time); - dev_dbg(sdev->dev, "Min response time: %lluns\n", - min_response_time); - - /* format output string */ - snprintf(dfse->cache_buf + strlen(dfse->cache_buf), -IPC_FLOOD_TEST_RESULT_LEN - strlen(dfse->cache_buf), -"IPC Flood count: %d\nAvg response time: %lluns\n", -i, avg_response_time); - - snprintf(dfse->cache_buf + strlen(dfse->cache_buf),
[PATCH v3 07/10] ASoC: SOF: sof-client: Add client APIs to access probes ops
From: Ranjani Sridharan Add client APIs to invoke the platform-specific DSP probes ops. Also, add a new API to get the SOF core device pointer which will be used for DMA buffer allocation. Reviewed-by: Pierre-Louis Bossart Tested-by: Fred Oh Signed-off-by: Ranjani Sridharan Signed-off-by: Dave Ertman --- sound/soc/sof/sof-client.c | 55 ++ sound/soc/sof/sof-client.h | 25 + 2 files changed, 80 insertions(+) diff --git a/sound/soc/sof/sof-client.c b/sound/soc/sof/sof-client.c index dd75a0ba4c28..838aaa5ea179 100644 --- a/sound/soc/sof/sof-client.c +++ b/sound/soc/sof/sof-client.c @@ -11,6 +11,7 @@ #include #include #include +#include "ops.h" #include "sof-client.h" #include "sof-priv.h" @@ -112,4 +113,58 @@ struct dentry *sof_client_get_debugfs_root(struct sof_client_dev *cdev) } EXPORT_SYMBOL_NS_GPL(sof_client_get_debugfs_root, SND_SOC_SOF_CLIENT); +#if IS_ENABLED(CONFIG_SND_SOC_SOF_DEBUG_PROBES_CLIENT) +int sof_client_probe_compr_assign(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, + struct snd_soc_dai *dai) +{ + return snd_sof_probe_compr_assign(cdev->sdev, cstream, dai); +} +EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_assign, SND_SOC_SOF_CLIENT); + +int sof_client_probe_compr_free(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, + struct snd_soc_dai *dai) +{ + return snd_sof_probe_compr_free(cdev->sdev, cstream, dai); +} +EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_free, SND_SOC_SOF_CLIENT); + +int sof_client_probe_compr_set_params(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, + struct snd_compr_params *params, + struct snd_soc_dai *dai) +{ + return snd_sof_probe_compr_set_params(cdev->sdev, cstream, params, dai); +} +EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_set_params, SND_SOC_SOF_CLIENT); + +int sof_client_probe_compr_trigger(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, int cmd, + struct snd_soc_dai *dai) +{ + return snd_sof_probe_compr_trigger(cdev->sdev, cstream, cmd, dai); +} +EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_trigger, SND_SOC_SOF_CLIENT); + +int sof_client_probe_compr_pointer(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, + struct snd_compr_tstamp *tstamp, + struct snd_soc_dai *dai) +{ + return snd_sof_probe_compr_pointer(cdev->sdev, cstream, tstamp, dai); +} +EXPORT_SYMBOL_NS_GPL(sof_client_probe_compr_pointer, SND_SOC_SOF_CLIENT); +#endif + +/* + * DMA buffer alloc fails when using the client device. Use the SOF core device instead. + * This will be needed for clients other than the probes client device as well. + */ +struct device *sof_client_get_dma_dev(struct sof_client_dev *cdev) +{ + return cdev->sdev->dev; +} +EXPORT_SYMBOL_NS_GPL(sof_client_get_dma_dev, SND_SOC_SOF_CLIENT); + MODULE_LICENSE("GPL"); diff --git a/sound/soc/sof/sof-client.h b/sound/soc/sof/sof-client.h index 429282df9f65..be80053068c9 100644 --- a/sound/soc/sof/sof-client.h +++ b/sound/soc/sof/sof-client.h @@ -7,6 +7,10 @@ #include #include #include +#include +#include +#include +#include #define SOF_CLIENT_PROBE_TIMEOUT_MS 2000 @@ -50,6 +54,27 @@ int sof_client_ipc_tx_message(struct sof_client_dev *cdev, u32 header, void *msg size_t msg_bytes, void *reply_data, size_t reply_bytes); struct dentry *sof_client_get_debugfs_root(struct sof_client_dev *cdev); +struct device *sof_client_get_dma_dev(struct sof_client_dev *cdev); + +#if IS_ENABLED(CONFIG_SND_SOC_SOF_DEBUG_PROBES_CLIENT) +int sof_client_probe_compr_assign(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, + struct snd_soc_dai *dai); +int sof_client_probe_compr_free(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, + struct snd_soc_dai *dai); +int sof_client_probe_compr_set_params(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, + struct snd_compr_params *params, + struct snd_soc_dai *dai); +int sof_client_probe_compr_trigger(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, int cmd, + struct snd_soc_dai *dai); +int sof_client_probe_compr_pointer(struct sof_client_dev *cdev, + struct snd_compr_stream *cstream, +
[PATCH v3 10/10] ASoC: SOF: Intel: CNL: register probes client
From: Ranjani Sridharan Register the client device for probes support on the CNL platform. Creating this client device alleviates the need for modifying the sound card definitions in the existing machine drivers to add support for the new probes feature in the FW. This will result in the creation of a separate sound card that can be used for audio data extraction from user specified points in the audio pipeline. Reviewed-by: Pierre-Louis Bossart Tested-by: Fred Oh Signed-off-by: Ranjani Sridharan Signed-off-by: Dave Ertman --- sound/soc/sof/intel/cnl.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/sound/soc/sof/intel/cnl.c b/sound/soc/sof/intel/cnl.c index 20afb622c315..6d15b871dc17 100644 --- a/sound/soc/sof/intel/cnl.c +++ b/sound/soc/sof/intel/cnl.c @@ -19,6 +19,7 @@ #include "hda.h" #include "hda-ipc.h" #include "../sof-audio.h" +#include "../sof-client.h" #include "intel-client.h" static const struct snd_sof_debugfs_map cnl_dsp_debugfs[] = { @@ -233,12 +234,26 @@ void cnl_ipc_dump(struct snd_sof_dev *sdev) static int cnl_register_clients(struct snd_sof_dev *sdev) { - return intel_register_ipc_test_clients(sdev); + int ret; + + ret = intel_register_ipc_test_clients(sdev); + if (ret < 0) + return ret; + +#if IS_ENABLED(CONFIG_SND_SOC_SOF_HDA_PROBES) + return sof_client_dev_register(sdev, "probes", 0); +#endif + + return 0; } static void cnl_unregister_clients(struct snd_sof_dev *sdev) { intel_unregister_ipc_test_clients(sdev); + +#if IS_ENABLED(CONFIG_SND_SOC_SOF_HDA_PROBES) + sof_client_dev_unregister(sdev, "probes", 0); +#endif } /* cannonlake ops */ @@ -409,3 +424,4 @@ const struct sof_intel_dsp_desc jsl_chip_info = { }; EXPORT_SYMBOL_NS(jsl_chip_info, SND_SOC_SOF_INTEL_HDA_COMMON); MODULE_IMPORT_NS(SND_SOC_SOF_INTEL_CLIENT); +MODULE_IMPORT_NS(SND_SOC_SOF_CLIENT); -- 2.26.2
[PATCH v3 04/10] ASoC: SOF: ops: Add ops for client registration
From: Ranjani Sridharan Add new ops for registering/unregistering clients based on DSP capabilities and/or DT information. Reviewed-by: Pierre-Louis Bossart Signed-off-by: Ranjani Sridharan Signed-off-by: Dave Ertman --- sound/soc/sof/core.c | 10 ++ sound/soc/sof/ops.h | 14 ++ sound/soc/sof/sof-priv.h | 4 3 files changed, 28 insertions(+) diff --git a/sound/soc/sof/core.c b/sound/soc/sof/core.c index 72a97219395f..ddb9a12d5aac 100644 --- a/sound/soc/sof/core.c +++ b/sound/soc/sof/core.c @@ -246,8 +246,17 @@ static int sof_probe_continue(struct snd_sof_dev *sdev) if (plat_data->sof_probe_complete) plat_data->sof_probe_complete(sdev->dev); + /* If registering certain clients fails, unregister the previously registered clients. */ + ret = snd_sof_register_clients(sdev); + if (ret < 0) { + dev_err(sdev->dev, "error: failed to register clients %d\n", ret); + goto client_reg_err; + } + return 0; +client_reg_err: + snd_sof_unregister_clients(sdev); fw_trace_err: snd_sof_free_trace(sdev); fw_run_err: @@ -356,6 +365,7 @@ int snd_sof_device_remove(struct device *dev) dev_warn(dev, "error: %d failed to prepare DSP for device removal", ret); + snd_sof_unregister_clients(sdev); snd_sof_fw_unload(sdev); snd_sof_ipc_free(sdev); snd_sof_free_debug(sdev); diff --git a/sound/soc/sof/ops.h b/sound/soc/sof/ops.h index b21632f5511a..00370f8bcd75 100644 --- a/sound/soc/sof/ops.h +++ b/sound/soc/sof/ops.h @@ -470,6 +470,20 @@ snd_sof_set_mach_params(const struct snd_soc_acpi_mach *mach, sof_ops(sdev)->set_mach_params(mach, dev); } +static inline int snd_sof_register_clients(struct snd_sof_dev *sdev) +{ + if (sof_ops(sdev) && sof_ops(sdev)->register_clients) + return sof_ops(sdev)->register_clients(sdev); + + return 0; +} + +static inline void snd_sof_unregister_clients(struct snd_sof_dev *sdev) +{ + if (sof_ops(sdev) && sof_ops(sdev)->unregister_clients) + sof_ops(sdev)->unregister_clients(sdev); +} + static inline const struct snd_sof_dsp_ops *sof_get_ops(const struct sof_dev_desc *d, const struct sof_ops_table mach_ops[], int asize) diff --git a/sound/soc/sof/sof-priv.h b/sound/soc/sof/sof-priv.h index dceac73b858f..cca239c09d0e 100644 --- a/sound/soc/sof/sof-priv.h +++ b/sound/soc/sof/sof-priv.h @@ -252,6 +252,10 @@ struct snd_sof_dsp_ops { void (*set_mach_params)(const struct snd_soc_acpi_mach *mach, struct device *dev); /* optional */ + /* client ops */ + int (*register_clients)(struct snd_sof_dev *sdev); /* optional */ + void (*unregister_clients)(struct snd_sof_dev *sdev); /* optional */ + /* DAI ops */ struct snd_soc_dai_driver *drv; int num_drv; -- 2.26.2
[PATCH v3 05/10] ASoC: SOF: Intel: Define ops for client registration
From: Ranjani Sridharan Define client ops for Intel platforms. For now, we only add 2 IPC test clients that will be used for run tandem IPC flood tests for. For ACPI platforms, change the Kconfig to select SND_SOC_SOF_PROBE_WORK_QUEUE to allow the ancillary driver to probe when the client is registered. Reviewed-by: Pierre-Louis Bossart Signed-off-by: Ranjani Sridharan Co-developed-by: Fred Oh Signed-off-by: Fred Oh Signed-off-by: Dave Ertman --- sound/soc/sof/intel/Kconfig| 9 +++ sound/soc/sof/intel/Makefile | 3 +++ sound/soc/sof/intel/apl.c | 16 sound/soc/sof/intel/bdw.c | 16 sound/soc/sof/intel/byt.c | 20 +++ sound/soc/sof/intel/cnl.c | 16 sound/soc/sof/intel/intel-client.c | 40 ++ sound/soc/sof/intel/intel-client.h | 26 +++ 8 files changed, 146 insertions(+) create mode 100644 sound/soc/sof/intel/intel-client.c create mode 100644 sound/soc/sof/intel/intel-client.h diff --git a/sound/soc/sof/intel/Kconfig b/sound/soc/sof/intel/Kconfig index a066e08860cb..b449fa2f8005 100644 --- a/sound/soc/sof/intel/Kconfig +++ b/sound/soc/sof/intel/Kconfig @@ -13,6 +13,8 @@ config SND_SOC_SOF_INTEL_ACPI def_tristate SND_SOC_SOF_ACPI select SND_SOC_SOF_BAYTRAIL if SND_SOC_SOF_BAYTRAIL_SUPPORT select SND_SOC_SOF_BROADWELL if SND_SOC_SOF_BROADWELL_SUPPORT + select SND_SOC_SOF_PROBE_WORK_QUEUE if SND_SOC_SOF_CLIENT + select SND_SOC_SOF_INTEL_CLIENT if SND_SOC_SOF_CLIENT help This option is not user-selectable but automagically handled by 'select' statements at a higher level @@ -29,6 +31,7 @@ config SND_SOC_SOF_INTEL_PCI select SND_SOC_SOF_TIGERLAKE if SND_SOC_SOF_TIGERLAKE_SUPPORT select SND_SOC_SOF_ELKHARTLAKE if SND_SOC_SOF_ELKHARTLAKE_SUPPORT select SND_SOC_SOF_JASPERLAKE if SND_SOC_SOF_JASPERLAKE_SUPPORT + select SND_SOC_SOF_INTEL_CLIENT if SND_SOC_SOF_CLIENT help This option is not user-selectable but automagically handled by 'select' statements at a higher level @@ -57,6 +60,12 @@ config SND_SOC_SOF_INTEL_COMMON This option is not user-selectable but automagically handled by 'select' statements at a higher level +config SND_SOC_SOF_INTEL_CLIENT + tristate + help + This option is not user-selectable but automagically handled by + 'select' statements at a higher level. + if SND_SOC_SOF_INTEL_ACPI config SND_SOC_SOF_BAYTRAIL_SUPPORT diff --git a/sound/soc/sof/intel/Makefile b/sound/soc/sof/intel/Makefile index 72d85b25df7d..683e64c627c1 100644 --- a/sound/soc/sof/intel/Makefile +++ b/sound/soc/sof/intel/Makefile @@ -5,6 +5,8 @@ snd-sof-intel-bdw-objs := bdw.o snd-sof-intel-ipc-objs := intel-ipc.o +snd-sof-intel-client-objs := intel-client.o + snd-sof-intel-hda-common-objs := hda.o hda-loader.o hda-stream.o hda-trace.o \ hda-dsp.o hda-ipc.o hda-ctrl.o hda-pcm.o \ hda-dai.o hda-bus.o \ @@ -18,3 +20,4 @@ obj-$(CONFIG_SND_SOC_SOF_BROADWELL) += snd-sof-intel-bdw.o obj-$(CONFIG_SND_SOC_SOF_INTEL_HIFI_EP_IPC) += snd-sof-intel-ipc.o obj-$(CONFIG_SND_SOC_SOF_HDA_COMMON) += snd-sof-intel-hda-common.o obj-$(CONFIG_SND_SOC_SOF_HDA) += snd-sof-intel-hda.o +obj-$(CONFIG_SND_SOC_SOF_INTEL_CLIENT) += snd-sof-intel-client.o diff --git a/sound/soc/sof/intel/apl.c b/sound/soc/sof/intel/apl.c index 4eeade2e77f7..ce2dcd6aa7de 100644 --- a/sound/soc/sof/intel/apl.c +++ b/sound/soc/sof/intel/apl.c @@ -18,6 +18,7 @@ #include "../sof-priv.h" #include "hda.h" #include "../sof-audio.h" +#include "intel-client.h" static const struct snd_sof_debugfs_map apl_dsp_debugfs[] = { {"hda", HDA_DSP_HDA_BAR, 0, 0x4000, SOF_DEBUGFS_ACCESS_ALWAYS}, @@ -25,6 +26,16 @@ static const struct snd_sof_debugfs_map apl_dsp_debugfs[] = { {"dsp", HDA_DSP_BAR, 0, 0x1, SOF_DEBUGFS_ACCESS_ALWAYS}, }; +static int apl_register_clients(struct snd_sof_dev *sdev) +{ + return intel_register_ipc_test_clients(sdev); +} + +static void apl_unregister_clients(struct snd_sof_dev *sdev) +{ + intel_unregister_ipc_test_clients(sdev); +} + /* apollolake ops */ const struct snd_sof_dsp_ops sof_apl_ops = { /* probe and remove */ @@ -101,6 +112,10 @@ const struct snd_sof_dsp_ops sof_apl_ops = { .trace_release = hda_dsp_trace_release, .trace_trigger = hda_dsp_trace_trigger, + /* client ops */ + .register_clients = apl_register_clients, + .unregister_clients = apl_unregister_clients, + /* DAI drivers */ .drv= skl_dai, .num_drv= SOF_SKL_NUM_DAIS, @@ -140,3 +155,4 @@ const struct sof_intel_dsp_desc apl_chip_info = { .ssp_base_offset = APL_SSP_BASE_OFFSET, }; EXPORT_SYMBOL_NS(apl_chip_info, SND_SOC_SOF_INTEL_HDA_COMMON);
[PATCH v3 01/10] Add auxiliary bus support
Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver. It enables drivers to create an auxiliary_device and bind an auxiliary_driver to it. The bus supports probe/remove shutdown and suspend/resume callbacks. Each auxiliary_device has a unique string based id; driver binds to an auxiliary_device based on this id through the bus. Co-developed-by: Kiran Patil Signed-off-by: Kiran Patil Co-developed-by: Ranjani Sridharan Signed-off-by: Ranjani Sridharan Co-developed-by: Fred Oh Signed-off-by: Fred Oh Co-developed-by: Leon Romanovsky Signed-off-by: Leon Romanovsky Reviewed-by: Pierre-Louis Bossart Reviewed-by: Shiraz Saleem Reviewed-by: Parav Pandit Reviewed-by: Dan Williams Signed-off-by: Dave Ertman --- Documentation/driver-api/auxiliary_bus.rst | 228 ++ Documentation/driver-api/index.rst | 1 + drivers/base/Kconfig | 3 + drivers/base/Makefile | 1 + drivers/base/auxiliary.c | 267 + include/linux/auxiliary_bus.h | 78 ++ include/linux/mod_devicetable.h| 8 + scripts/mod/devicetable-offsets.c | 3 + scripts/mod/file2alias.c | 8 + 9 files changed, 597 insertions(+) create mode 100644 Documentation/driver-api/auxiliary_bus.rst create mode 100644 drivers/base/auxiliary.c create mode 100644 include/linux/auxiliary_bus.h diff --git a/Documentation/driver-api/auxiliary_bus.rst b/Documentation/driver-api/auxiliary_bus.rst new file mode 100644 index ..500f29692c81 --- /dev/null +++ b/Documentation/driver-api/auxiliary_bus.rst @@ -0,0 +1,228 @@ +.. SPDX-License-Identifier: GPL-2.0-only + += +Auxiliary Bus += + +In some subsystems, the functionality of the core device (PCI/ACPI/other) is +too complex for a single device to be managed as a monolithic block or a part of +the functionality needs to be exposed to a different subsystem. Splitting the +functionality into smaller orthogonal devices would make it easier to manage +data, power management and domain-specific interaction with the hardware. A key +requirement for such a split is that there is no dependency on a physical bus, +device, register accesses or regmap support. These individual devices split from +the core cannot live on the platform bus as they are not physical devices that +are controlled by DT/ACPI. The same argument applies for not using MFD in this +scenario as MFD relies on individual function devices being physical devices. + +An example for this kind of requirement is the audio subsystem where a single +IP is handling multiple entities such as HDMI, Soundwire, local devices such as +mics/speakers etc. The split for the core's functionality can be arbitrary or +be defined by the DSP firmware topology and include hooks for test/debug. This +allows for the audio core device to be minimal and focused on hardware-specific +control and communication. + +The auxiliary bus is intended to be minimal, generic and avoid domain-specific +assumptions. Each auxiliary_device represents a part of its parent +functionality. The generic behavior can be extended and specialized as needed +by encapsulating an auxiliary_device within other domain-specific structures and +the use of .ops callbacks. Devices on the auxiliary bus do not share any +structures and the use of a communication channel with the parent is +domain-specific. + +When Should the Auxiliary Bus Be Used += + +The auxiliary bus is to be used when a driver and one or more kernel modules, +who share a common header file with the driver, need a mechanism to connect and +provide access to a shared object allocated by the auxiliary_device's +registering driver. The registering driver for the auxiliary_device(s) and the +kernel module(s) registering auxiliary_drivers can be from the same subsystem, +or from multiple subsystems. + +The emphasis here is on a common generic interface that keeps subsystem +customization out of the bus infrastructure. + +One example could be a multi-port PCI network device that is rdma-capable and +needs to export this functionality and attach to an rdma driver in another +subsystem. The PCI driver will allocate and register an auxiliary_device for +each physical function on the NIC. The rdma driver will register an +auxiliary_driver that will be matched with and probed for each of these +auxiliary_devices. This will give the rdma driver access to the shared data/ops +in the PCI drivers shared object to establish a connection with the PCI driver. + +Another use case is for the PCI device to be split out into multiple sub +functions. For each sub function an auxiliary_device will be created. A PCI +sub function driver will bind to such devices that will create its own one or +more class devices. A PCI sub function auxiliary device will likely be +contained in a struct with additional
[PATCH v3 00/10] Auxiliary bus implementation and SOF multi-client support
Brief history of Auxiliary Bus == The auxiliary bus code was originally submitted upstream as virtual bus, and was submitted through the netdev tree. This process generated up to v4. This discussion can be found here: https://lore.kernel.org/netdev/2019192219.30259-1-jeffrey.t.kirs...@intel.com/#t At this point, GregKH requested that we take the review and revision process to an internal mailing list and garner the buy-in of a respected kernel contributor. The auxiliary bus (then known as virtual bus) was originally submitted along with implementation code for the ice driver and irdma drive, causing the complication of also having dependencies in the rdma tree. This new submission is utilizing an auxiliary bus consumer in only the sound driver tree to create the initial implementation and a single user. Since implementation work has started on this patch set, there have been multiple inquiries about the time frame of its completion. It appears that there will be numerous consumers of this functionality. The process of internal review and implementation using the sound drivers generated 19 internal versions. The changes, including the name change from virtual bus to auxiliary bus, from these versions can be summarized as the following: - Fixed compilation and checkpatch errors - Improved documentation to address the motivation for virtual bus. - Renamed virtual bus to auxiliary bus - increased maximum device name size - Correct order in Kconfig and Makefile - removed the mid-layer adev->release layer for device unregister - pushed adev->id management to parent driver - all error paths out of ancillary_device_register return error code - all error paths out of ancillary_device_register use put_device - added adev->name element - modname in register cannot be NULL - added KBUILD_MODNAME as prefix for match_name - push adev->id responsibility to registering driver - uevent now parses adev->dev name - match_id function now parses adev->dev name - changed drivers probe function to also take an ancillary_device_id param - split ancillary_device_register into device_initialize and device_add - adjusted what is done in device_initialize and device_add - change adev to ancildev and adrv to ancildrv - change adev to ancildev in documentation == Introduces the auxiliary bus implementation along with the example usage in the Sound Open Firmware(SOF) audio driver. In some subsystems, the functionality of the core device (PCI/ACPI/other) may be too complex for a single device to be managed as a monolithic block or a part of the functionality might need to be exposed to a different subsystem. Splitting the functionality into smaller orthogonal devices makes it easier to manage data, power management and domain-specific communication with the hardware. Also, common auxiliary_device functionality across primary devices can be handled by a common auxiliary_device. A key requirement for such a split is that there is no dependency on a physical bus, device, register accesses or regmap support. These individual devices split from the core cannot live on the platform bus as they are not physical devices that are controlled by DT/ACPI. The same argument applies for not using MFD in this scenario as it relies on individual function devices being physical devices that are DT enumerated. An example for this kind of requirement is the audio subsystem where a single IP handles multiple entities such as HDMI, Soundwire, local devices such as mics/speakers etc. The split for the core's functionality can be arbitrary or be defined by the DSP firmware topology and include hooks for test/debug. This allows for the audio core device to be minimal and tightly coupled with handling the hardware-specific logic and communication. The auxiliary bus is intended to be minimal, generic and avoid domain-specific assumptions. Each auxiliary bus device represents a part of its parent functionality. The generic behavior can be extended and specialized as needed by encapsulating an auxiliary bus device within other domain-specific structures and the use of .ops callbacks. The SOF driver adopts the auxiliary bus for implementing the multi-client support. A client in the context of the SOF driver represents a part of the core device's functionality. It is not a physical device but rather an auxiliary device that needs to communicate with the DSP via IPCs. With multi-client support,the sound card can be separated into multiple orthogonal auxiliary devices for local devices (mic/speakers etc), HDMI, sensing, probes, debug etc. In this series, we demonstrate the usage of the auxiliary bus with the help of the IPC test client which is used for testing the serialization of IPCs when multiple clients talk to the DSP at the same time. v3 changes: rename to auxiliary bus move .c file to drivers/base/ split auxdev unregister flow into uninitialize and delete steps update kernel-doc on
Re: [RFC 1/2] printk: Add kernel parameter: mute_console
On (20/10/22 13:42), Petr Mladek wrote: > +static bool mute_console; > + > +static int __init mute_console_setup(char *str) > +{ > + mute_console = true; > + pr_info("All consoles muted.\n"); > + > + return 0; > +} First of all, thanks a lot for picking this up and for the patch set! I've several thoughts and comments below. > static bool suppress_message_printing(int level) > { > - return (level >= console_loglevel && !ignore_loglevel); > + if (unlikely(mute_console)) > + return true; > + > + if (unlikely(ignore_loglevel)) > + return false; > + > + return (level >= console_loglevel); > } This is one way of doing it. Another one is to clear CON_ENABLED bit from all consoles (upon registration), one upside of this is that we will signal user-space that consoles are disabled/muted (by removing the E flag from /proc/consoles). But, if I'm mistaken, but this mutes only printk side, consoles still have uart running: printf -> tty -> uart -> serial_driver_IRQ() -> TX seriaal_driver_IRQ() -> RX -> uart -> tty so user space, in theory, still can push messages to slow consoles, AFAIU. Thinking more about it. We are still relying on the fact that there is anything registered as console driver, which is not necessarily the case, we can have NULL console drivers list. So how about having a dummy struct console in printk, with NOP read/write and NOP tty_driver and NOP tty_operations. So that when init calls filp_open("/dev/console") and we can't give tty anything but NULL, we'd just give tty back the dummy NOP device. -ss
[tip:x86/urgent] BUILD SUCCESS abee7c494d8c41bb388839bccc47e06247f0d7de
allyesconfig powerpcsocrates_defconfig c6xevmc6678_defconfig mipsomega2p_defconfig ia64 gensparse_defconfig arm ebsa110_defconfig powerpcmvme5100_defconfig arm rpc_defconfig powerpc ppc64e_defconfig ia64 allmodconfig ia64defconfig ia64 allyesconfig m68k allmodconfig m68kdefconfig m68k allyesconfig nios2 defconfig nds32 allnoconfig c6x allyesconfig nds32 defconfig nios2allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig s390 allyesconfig parisc allyesconfig s390defconfig i386 allyesconfig sparc defconfig i386defconfig mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig powerpc allnoconfig i386 randconfig-a002-20201022 i386 randconfig-a005-20201022 i386 randconfig-a003-20201022 i386 randconfig-a001-20201022 i386 randconfig-a006-20201022 i386 randconfig-a004-20201022 i386 randconfig-a002-20201023 i386 randconfig-a005-20201023 i386 randconfig-a003-20201023 i386 randconfig-a001-20201023 i386 randconfig-a006-20201023 i386 randconfig-a004-20201023 x86_64 randconfig-a011-20201022 x86_64 randconfig-a013-20201022 x86_64 randconfig-a016-20201022 x86_64 randconfig-a015-20201022 x86_64 randconfig-a012-20201022 x86_64 randconfig-a014-20201022 i386 randconfig-a016-20201022 i386 randconfig-a014-20201022 i386 randconfig-a015-20201022 i386 randconfig-a012-20201022 i386 randconfig-a013-20201022 i386 randconfig-a011-20201022 riscvallyesconfig riscvnommu_virt_defconfig riscv allnoconfig riscv defconfig riscv rv32_defconfig riscvallmodconfig x86_64 rhel x86_64 allyesconfig x86_64rhel-7.6-kselftests x86_64 defconfig x86_64 rhel-8.3 x86_64 kexec clang tested configs: x86_64 randconfig-a001-20201022 x86_64 randconfig-a002-20201022 x86_64 randconfig-a003-20201022 x86_64 randconfig-a006-20201022 x86_64 randconfig-a004-20201022 x86_64 randconfig-a005-20201022 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
Re: mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]
On Thu, Oct 22, 2020 at 5:11 PM Linus Torvalds wrote: > > In particular, I wonder if it's that KASAN causes some reload pattern, > and the whole > > register __typeof__(*(ptr)) __val_pu asm("%"_ASM_AX); > .. > asm volatile(.. "r" (__val_pu) ..) > > thing causes problems. That pattern isn't new (see the same pattern and the comment above get_user). But our previous use of that pattern had it as an output of the asm, and the new use is as an input. That obviously shouldn't matter, but if it's some odd compiler code generation interaction, all bets are off.. Linus
Re: linux-next: build warning after merge of the block tree
On 10/22/20 5:48 PM, Stephen Rothwell wrote: > Hi all, > > After merging the block tree, today's linux-next build (KCONFIG_NAME) > produced this warning: > > fs/io_uring.c: In function 'loop_rw_iter': > fs/io_uring.c:3141:21: warning: cast to pointer from integer of different > size [-Wint-to-pointer-cast] > 3141 |iovec.iov_base = (void __user *) req->rw.addr; > | ^ > > Introduced by commit > > a5371db1e38d ("io_uring: make loop_rw_iter() use original user supplied > pointers") Thanks, not sure why I didn't use u64_to_user_pointer() in the first place - updated now. -- Jens Axboe
Re: [PATCH] KVM: X86: Expose KVM_HINTS_REALTIME in KVM_GET_SUPPORTED_CPUID
On Thu, 22 Oct 2020 at 21:02, Paolo Bonzini wrote: > > On 22/10/20 03:34, Wanpeng Li wrote: > > From: Wanpeng Li > > > > Per KVM_GET_SUPPORTED_CPUID ioctl documentation: > > > > This ioctl returns x86 cpuid features which are supported by both the > > hardware and kvm in its default configuration. > > > > A well-behaved userspace should not set the bit if it is not supported. > > > > Suggested-by: Jim Mattson > > Signed-off-by: Wanpeng Li > > It's common for userspace to copy all supported CPUID bits to > KVM_SET_CPUID2, I don't think this is the right behavior for > KVM_HINTS_REALTIME. > > (But maybe this was discussed already; if so, please point me to the > previous discussion). The discussion is here. :) https://www.spinics.net/lists/kvm/msg227265.html Wanpeng
Re: [PATCH/RFC net] net: dec: tulip: de2104x: Add shutdown handler to stop NIC
On Thu, Oct 22, 2020 at 04:04:16PM -0700, James Bottomley wrote: > On Thu, 2020-10-22 at 15:06 -0700, Moritz Fischer wrote: > > The driver does not implement a shutdown handler which leads to > > issues > > when using kexec in certain scenarios. The NIC keeps on fetching > > descriptors which gets flagged by the IOMMU with errors like this: > > > > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 > > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 > > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 > > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 > > DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr f000 > > > > Signed-off-by: Moritz Fischer > > --- > > > > Hi all, > > > > I'm not sure if this is the proper way for a shutdown handler, > > I've tried to look at a bunch of examples and couldn't find a > > specific > > solution, in my tests on hardware this works, though. > > > > Open to suggestions. > > > > Thanks, > > Moritz > > > > --- > > drivers/net/ethernet/dec/tulip/de2104x.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/net/ethernet/dec/tulip/de2104x.c > > b/drivers/net/ethernet/dec/tulip/de2104x.c > > index f1a2da15dd0a..372c62c7e60f 100644 > > --- a/drivers/net/ethernet/dec/tulip/de2104x.c > > +++ b/drivers/net/ethernet/dec/tulip/de2104x.c > > @@ -2185,6 +2185,7 @@ static struct pci_driver de_driver = { > > .id_table = de_pci_tbl, > > .probe = de_init_one, > > .remove = de_remove_one, > > + .shutdown = de_remove_one, > > This doesn't look right: shutdown is supposed to turn off the device > without disturbing the tree or causing any knock on effects (I think > that rule is mostly because you don't want anything in userspace > triggering since it's likely to be nearly dead). Remove removes the > device from the tree and cleans up everything. I think the function > you want that's closest to what shutdown needs is de_close(). That > basically just turns off the chip and frees the interrupt ... you'll > have to wrapper it to call it from the pci_driver, though. Thanks for the suggestion, I like that better. I'll send a v2 after testing. I think anything that hits on de_stop_hw() will keep the NIC from fetching further descriptors. Cheers, Moritz
Re: mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]
On Thu, Oct 22, 2020 at 4:43 PM Linus Torvalds wrote: > > Thanks. Very funky, but thanks. I've been running that commit on my > machine for over half a year, and it still looks "trivially correct" > to me, but let me go look at it one more time. Can't argue with a > reliable bisect and revert.. Hmm. The fact that it only happens with KASAN makes me suspect it's some bad interaction with the inline asm syntax change (and explains why I've run with this for half a year without issues). In particular, I wonder if it's that KASAN causes some reload pattern, and the whole register __typeof__(*(ptr)) __val_pu asm("%"_ASM_AX); .. asm volatile(.. "r" (__val_pu) ..) thing causes problems. That's an ugly pattern, but it's written that way to get gcc to handle the 64-bit case properly (with the value in %rax:%rdx). It turns out that the decode of the user-mode SIGSEGV code is a variation of system calls, ie 0: b8 18 00 00 00mov$0x18,%eax 5: 0f 05syscall 7: 48 3d 01 f0 ff ffcmp$0xf001,%rax d: 73 01jae0x10 f:* c3retq<-- trapping instruction or 0: 41 52push %r10 2: 52push %rdx 3: 4d 31 d2 xor%r10,%r10 6: ba 02 00 00 00mov$0x2,%edx b: be 80 00 00 00mov$0x80,%esi 10: 39 d0cmp%edx,%eax 12: 75 07jne0x1b 14: b8 ca 00 00 00mov$0xca,%eax 19: 0f 05syscall 1b: 89 d0mov%edx,%eax 1d: 87 07xchg %eax,(%rdi) 1f: 85 c0test %eax,%eax 21: 75 f1jne0x14 23:* 5apop%rdx <-- trapping instruction 24: 41 5apop%r10 26: c3retq so in both cases it looks like 'syscall' returned with a bad stack pointer. Which is certainly a sign of some code generation issue. Very annoying, because it probably means that it's compiler-specific too. And that "syscall 018" looks very odd. I think that's sched_yield() on x86-64, which doesn't have any __put_user() cases at all.. Would you mind sending me the problematic vmlinux file in private (or, likely better - a pointer to some place I can download it, it's going to be huge). Linus
[git pull] drm fixes part 2 for 5.10-rc1
Hi Linus, This should be the last round of things for rc1, a bunch of i915 fixes, some amdgpu, more font OOB fixes and one ttm fix just found reading code. Dave. drm-next-2020-10-23: drm fixes (round two) for 5.10-rc1 fbcon/fonts: - Two patches to prevent OOB access ttm: - fix for eviction value range check amdgpu: - Sienna Cichlid fixes - MST manager resource leak fix - GPU reset fix amdkfd: - Luxmark fix for Navi1x i915: - Tweak initial DPCD backlight.enabled value (Sean) - Initialize reserved MOCS indices (Ayaz) - Mark initial fb obj as WT on eLLC machines to avoid rcu lockup (Ville) - Support parsing of oversize batches (Chris) - Delay execlists processing for TGL (Chris) - Use the active reference on the vma during error capture (Chris) - Widen CSB pointer (Chris) - Wait for CSB entries on TGL (Chris) - Fix unwind for scratch page allocation (Chris) - Exclude low patches of stolen memory (Chris) - Force VT'd workarounds when running as a guest OS (Chris) - Drop runtime-pm assert from vpgu io accessors (Chris) The following changes since commit 40b99050455b9a6cb8faf15dcd41888312184720: Merge tag 'drm-intel-next-fixes-2020-10-15' of git://anongit.freedesktop.org/drm/drm-intel into drm-next (2020-10-19 09:21:59 +1000) are available in the Git repository at: git://anongit.freedesktop.org/drm/drm tags/drm-next-2020-10-23 for you to fetch changes up to b45b6fbc671c60f56fd119c443e5570f83175928: Merge tag 'drm-intel-next-fixes-2020-10-22' of git://anongit.freedesktop.org/drm/drm-intel into drm-next (2020-10-23 09:52:18 +1000) drm fixes (round two) for 5.10-rc1 fbcon/fonts: - Two patches to prevent OOB access ttm: - fix for evicition value range check amdgpu: - Sienna Cichlid fixes - MST manager resource leak fix - GPU reset fix amdkfd: - Luxmark fix for Navi1x i915: - Tweak initial DPCD backlight.enabled value (Sean) - Initialize reserved MOCS indices (Ayaz) - Mark initial fb obj as WT on eLLC machines to avoid rcu lockup (Ville) - Support parsing of oversize batches (Chris) - Delay execlists processing for TGL (Chris) - Use the active reference on the vma during error capture (Chris) - Widen CSB pointer (Chris) - Wait for CSB entries on TGL (Chris) - Fix unwind for scratch page allocation (Chris) - Exclude low patches of stolen memory (Chris) - Force VT'd workarounds when running as a guest OS (Chris) - Drop runtime-pm assert from vpgu io accessors (Chris) Andrey Grodzovsky (3): drm/amd/display: Revert "drm/amd/display: Fix a list corruption" drm/amd/display: Avoid MST manager resource leak. drm/amd/psp: Fix sysfs: cannot create duplicate filename Ayaz A Siddiqui (1): drm/i915/gt: Initialize reserved and unspecified MOCS indices Chris Wilson (10): drm/i915/gem: Support parsing of oversize batches drm/i915/gt: Delay execlist processing for tgl drm/i915/gt: Undo forced context restores after trivial preemptions drm/i915: Use the active reference on the vma while capturing drm/i915/gt: Widen CSB pointer to u64 for the parsers drm/i915/gt: Wait for CSB entries on Tigerlake drm/i915/gt: Onion unwind for scratch page allocation failure drm/i915: Exclude low pages (128KiB) of stolen from use drm/i915: Force VT'd workarounds when running as a guest OS drm/i915: Drop runtime-pm assert from vgpu io accessors Dave Airlie (4): Merge tag 'drm-misc-next-fixes-2020-10-20' of git://anongit.freedesktop.org/drm/drm-misc into drm-next drm/ttm: fix eviction valuable range check. Merge tag 'amd-drm-fixes-5.10-2020-10-21' of git://people.freedesktop.org/~agd5f/linux into drm-next Merge tag 'drm-intel-next-fixes-2020-10-22' of git://anongit.freedesktop.org/drm/drm-intel into drm-next Evan Quan (1): drm/amdgpu: correct the gpu reset handling for job != NULL case Jay Cornwall (1): drm/amdkfd: Use same SQ prefetch setting as amdgpu John Clements (1): Revert drm/amdgpu: disable sienna chichlid UMC RAS Kenneth Feng (2): drm/amd/pm: fix pp_dpm_fclk drm/amd/pm: remove the average clock value in sysfs Kevin Wang (2): drm/amd/swsmu: add missing feature map for sienna_cichlid drm/amd/swsmu: correct wrong feature bit mapping Likun Gao (5): drm/amdgpu: add function to program pbb mode for sienna cichlid drm/amdgpu: add rlc iram and dram firmware support drm/amdgpu: update golden setting for sienna_cichlid drm/amd/pm: fix pcie information for sienna cichlid drm/amdgpu: correct the cu and rb info for sienna cichlid Peilin Ye (2): docs: fb: Add font_6x8 to available built-in fonts Fonts: Support FONT_EXTRA_WORDS macros for font_6x8 Sean Paul (1): drm/i915/dp: Tweak initial dpcd backlight.enabled value Ville Syrjälä (1): drm/i915: Mark ininitial fb obj as WT on eLLC machines to
Re: [PATCH v2 2/2] net: phy: adin: implement cable-test support
On Thu, 22 Oct 2020 10:45:51 +0300 Alexandru Ardelean wrote: > The ADIN1300/ADIN1200 support cable diagnostics using TDR. > > The cable fault detection is automatically run on all four pairs looking at > all combinations of pair faults by first putting the PHY in standby (clear > the LINK_EN bit, PHY_CTRL_3 register, Address 0x0017) and then enabling the > diagnostic clock (set the DIAG_CLK_EN bit, PHY_CTRL_1 register, Address > 0x0012). > > Cable diagnostics can then be run (set the CDIAG_RUN bit in the > CDIAG_RUN register, Address 0xBA1B). The results are reported for each pair > in the cable diagnostics results registers, CDIAG_DTLD_RSLTS_0, > CDIAG_DTLD_RSLTS_1, CDIAG_DTLD_RSLTS_2, and CDIAG_DTLD_RSLTS_3, Address > 0xBA1D to Address 0xBA20). > > The distance to the first fault for each pair is reported in the cable > fault distance registers, CDIAG_FLT_DIST_0, CDIAG_FLT_DIST_1, > CDIAG_FLT_DIST_2, and CDIAG_FLT_DIST_3, Address 0xBA21 to Address 0xBA24). > > This change implements support for this using phylib's cable-test support. > > Signed-off-by: Alexandru Ardelean # Form letter - net-next is closed We have already sent a pull request for 5.10 and therefore net-next is closed for new drivers, features, and code refactoring. Please repost when net-next reopens after 5.10-rc1 is cut. (http://vger.kernel.org/~davem/net-next.html will not be up to date this time around, sorry about that). RFC patches sent for review only are obviously welcome at any time.
Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin wrote: > [snip] > > > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup > > for 5.11. > > > > What does everyone think? > > I think we should use the link to the root approach both for stable backports > and for 5.11+, to keep them in sync. The cleanup (always charging the root > cgroup) > is not directly related to this problem, and we can keep it for 5.11+ only. > > Thanks! Roman, can you send the signed-off patch for the root linking for use_hierarchy=0?
Re: [PATCH] ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it
On Wed, Oct 21, 2020 at 3:36 PM Theodore Y. Ts'o wrote: > > On Wed, Oct 21, 2020 at 02:16:56PM -0700, Randy Dunlap wrote: > > On 10/21/20 2:15 PM, Brendan Higgins wrote: > > > On Tue, Oct 20, 2020 at 12:37 AM Geert Uytterhoeven > > > wrote: > > >> > > >> EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the > > >> user may not want to enable. Fix this by making the test depend on > > >> EXT4_FS instead. > > >> > > >> Fixes: 1cbeab1b242d16fd ("ext4: add kunit test for decoding extended > > >> timestamps") > > >> Signed-off-by: Geert Uytterhoeven > > > > > > If I remember correctly, having EXT4_KUNIT_TESTS select EXT4_FS was > > > something that Ted specifically requested, but I don't have any strong > > > feelings on it either way. > > > > omg, please No. depends on is the right fix here. > > So my requirement which led to that particular request is to keep what > needs to be placed in .kunitconfig to a small and reasonable set. > > Per Documentation/dev-tools/kunit, we start by: > > cd $PATH_TO_LINUX_REPO > cp arch/um/configs/kunit_defconfig .kunitconfig > > we're then supposed to add whatever Kunit tests we want to enable, to wit: > > CONFIG_EXT4_KUNIT_TESTS=y > > so that .kunitconfig would look like this: > > CONFIG_KUNIT=y > CONFIG_KUNIT_TEST=y > CONFIG_KUNIT_EXAMPLE_TEST=y > CONFIG_EXT4_KUNIT_TESTS=y > > ... and then you should be able to run: > > ./tools/testing/kunit/kunit.py run > > ... and have the kunit tests run. I would *not* like to have to put a > huge long list of CONFIG_* dependencies into the .kunitconfig file. > > I'm don't particularly care how this gets achieved, but please think > about how to make it easy for a kernel developer to run a specific set > of subsystem unit tests. (In fact, being able to do something like > "kunit.py run fs/ext4 fs/jbd2" or maybe "kunit.py run fs/..." would be > *great*. No need to fuss with hand editing the .kunitconfig file at > all would be **wonderful**. So you, me, Luis, David, and a whole bunch of other people have been thinking about this problem for a while. What if we just put kunitconfig fragments in directories along side the test files they enable? For example, we could add a file to fs/ext4/kunitconfig which contains: CONFIG_EXT4_FS=y CONFIG_EXT4_KUNIT_TESTS=y We could do something similar in fs/jdb2, etc. Obviously some logically separate KUnit tests (different maintainers, different Kconfig symbols, etc) reside in the same directory, for these we could name the kunitconfig file something like lib/list-test.kunitconfig (not a great example because lists are always built into Linux), but you get the idea. Then like Ted suggested, if you call kunit.py run foo/bar, then if bar is a directory, then kunit.py will look for foo/bar/kunitconfig if bar is a file ending with .kunitconfig like foo/bar.kunitconfig, then it will use that kunitconfig if bar is '...' (foo/...) then kunit.py will look for all kunitconfigs underneath foo. Once all the kunitconfigs have been resolved, they will be merged into the .kunitconfig. If they can be successfully merged together, the new .kunitconfig will then continue to function as it currently does. What do people think about this?
Re: mmstress[1309]: segfault at 7f3d71a36ee8 ip 00007f3d77132bdf sp 00007f3d71a36ee8 error 4 in libc-2.27.so[7f3d77058000+1aa000]
On Thu, Oct 22, 2020 at 1:55 PM Naresh Kamboju wrote: > > The bad commit points to, > > commit d55564cfc222326e944893eff0c4118353e349ec > x86: Make __put_user() generate an out-of-line call > > I have reverted this single patch and confirmed the reported > problem is not seen anymore. Thanks. Very funky, but thanks. I've been running that commit on my machine for over half a year, and it still looks "trivially correct" to me, but let me go look at it one more time. Can't argue with a reliable bisect and revert.. Linus
linux-next: build warning after merge of the block tree
Hi all, After merging the block tree, today's linux-next build (KCONFIG_NAME) produced this warning: fs/io_uring.c: In function 'loop_rw_iter': fs/io_uring.c:3141:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 3141 |iovec.iov_base = (void __user *) req->rw.addr; | ^ Introduced by commit a5371db1e38d ("io_uring: make loop_rw_iter() use original user supplied pointers") -- Cheers, Stephen Rothwell pgpByYOw5brcs.pgp Description: OpenPGP digital signature
Re: [PATCH v1 0/2] mm: cma: introduce a non-blocking version of cma_release()
On 22 Oct 2020, at 18:53, Roman Gushchin wrote: > This small patchset introduces a non-blocking version of cma_release() > and simplifies the code in hugetlbfs, where previously we had to > temporarily drop hugetlb_lock around the cma_release() call. > > It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic > THP under a memory pressure requires a cma_release() call. If it's Thanks for the patch. But during 1GB THP split, we only clear the bitmaps without releasing the pages. Also in cma_release_nowait(), the first page in the allocated CMA region is reused to store struct cma_clear_bitmap_work, but the same method cannot be used during THP split, since the first page is still in-use. We might need to allocate some new memory for struct cma_clear_bitmap_work, which might not be successful under memory pressure. Any suggestion on where to store struct cma_clear_bitmap_work when I only want to clear bitmap without releasing the pages? Thanks. — Best Regards, Yan Zi signature.asc Description: OpenPGP digital signature
Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
On Thu, Oct 22, 2020 at 5:32 PM Kees Cook wrote: > I've been going back and forth on this, and I think what I've settled > on is I'd like to avoid new CONFIG dependencies just for this feature. > Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT > for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the > cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It > should be relatively simple to extract those details and make > SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches? Hmm. So I could enable the cache logic to every architecture (one patch per arch) that does not have the sparse syscall numbers, and then have the proc reporting after the arch patches? I could do that. I don't have test machines to run anything other than x86_64 or ia32, so they will need a closer look by people more familiar with those arches. > I'd still like to get more specific workload performance numbers too. > The microbenchmark is nice, but getting things like build times under > docker's default seccomp filter, etc would be lovely. I've almost gotten > there, but my benchmarks are still really noisy and CPU isolation > continues to frustrate me. :) Ok, let me know if I can help. YiFei Zhu
Re: [PATCH 3/6] fs: Convert block_read_full_page to be synchronous
On Thu, Oct 22, 2020 at 10:22:25PM +0100, Matthew Wilcox (Oracle) wrote: > +static int readpage_submit_bhs(struct page *page, struct blk_completion > *cmpl, > + unsigned int nr, struct buffer_head **bhs) > +{ > + struct bio *bio = NULL; > + unsigned int i; > + int err; > + > + blk_completion_init(cmpl, nr); > + > + for (i = 0; i < nr; i++) { > + struct buffer_head *bh = bhs[i]; > + sector_t sector = bh->b_blocknr * (bh->b_size >> 9); > + bool same_page; > + > + if (buffer_uptodate(bh)) { > + end_buffer_async_read(bh, 1); > + blk_completion_sub(cmpl, BLK_STS_OK, 1); > + continue; > + } > + if (bio) { > + if (bio_end_sector(bio) == sector && > + __bio_try_merge_page(bio, bh->b_page, bh->b_size, > + bh_offset(bh), _page)) > + continue; > + submit_bio(bio); > + } > + bio = bio_alloc(GFP_NOIO, 1); > + bio_set_dev(bio, bh->b_bdev); > + bio->bi_iter.bi_sector = sector; > + bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh)); > + bio->bi_end_io = readpage_end_bio; > + bio->bi_private = cmpl; > + /* Take care of bh's that straddle the end of the device */ > + guard_bio_eod(bio); > + } The following is needed to set the bio encryption context for the '-o inlinecrypt' case on ext4: diff --git a/fs/buffer.c b/fs/buffer.c index 95c338e2b99c..546a08c5003b 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2237,6 +2237,7 @@ static int readpage_submit_bhs(struct page *page, struct blk_completion *cmpl, submit_bio(bio); } bio = bio_alloc(GFP_NOIO, 1); + fscrypt_set_bio_crypt_ctx_bh(bio, bh, GFP_NOIO); bio_set_dev(bio, bh->b_bdev); bio->bi_iter.bi_sector = sector; bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
Spende: 2 Millionen Euro
-- Mein Name ist Manuel Franco, ich bin der Gewinner des Powerball Mega Jackpot-Gewinners in Höhe von 768 Millionen US-Dollar aus New Jersey, USA, und ich freue mich, Ihnen zu gratulieren, dass Sie zufällig unter den 5 glücklichen Menschen ausgewählt wurden, denen ich jeweils 2 Millionen Euro (2.000.000,00 €) spende . Kontaktieren Sie meine E-Mail unten, um das Geld zu fordern. E-Mail: bmosth...@gmail.com My name is Manuel Franco, I am the winner of $768 million Powerball mega jackpot winner from New Jersey, USA and I am pleased to congratulate you for being randomly picked among the 5 lucky people i am donating 2 million euros (€ 2,000,000.00) each to. Contact my email below to claim the money. Email: bmosth...@gmail.com