Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
On Mon, Jul 30, 2018 at 07:59:36PM +0200, Pavel Machek wrote: > Its true I have no interest in psi. But I'm trying to use same kernel > you are trying to "improve" and I was confused enough by seing > "CONFIG_PSI". And yes, my association was "pounds per square inch" and > "what is it doing here". Read the help message. If that's not enough, we sure can improve it. > So I'm asking you to change the name. > > USB is well known acronym, so it is okay to have CONFIG_USB. PSI is > also well known -- but means something else. > > And the code kind-of acknowledges that acronym is unknown, by having > /proc/pressure. Your momentary confusion isn't the only criterion. Thanks. -- tejun
Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
Hello, On Mon, Jul 30, 2018 at 10:54:05AM -0700, Randy Dunlap wrote: > I'd say he's trying to make something that is readable and easier to > understand for users. Sure, it's perfectly fine to make those suggestions and discuss but the counter points have already been discussed (e.g. PSI is a known acronym associated with pressure and internal symbols all use them for brevity and uniqueness). There's no clear technically winning choice here and it's a decision of a relatively low importance given that it's confined to kernel config. I can't see any merit in turning it into a last-word match. Thanks. -- tejun
Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
Hello, On Mon, Jul 30, 2018 at 10:54:05AM -0700, Randy Dunlap wrote: > I'd say he's trying to make something that is readable and easier to > understand for users. Sure, it's perfectly fine to make those suggestions and discuss but the counter points have already been discussed (e.g. PSI is a known acronym associated with pressure and internal symbols all use them for brevity and uniqueness). There's no clear technically winning choice here and it's a decision of a relatively low importance given that it's confined to kernel config. I can't see any merit in turning it into a last-word match. Thanks. -- tejun
Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
Hello, On Mon, Jul 30, 2018 at 07:39:40PM +0200, Pavel Machek wrote: > > I'd rather have the internal config symbol match the naming scheme in > > the code, where psi is a shorter, unique token as copmared to e.g. > > pressure, press, prsr, etc. > > I'd do "pressure", really. Yes, psi is shorter, but I'd say that > length is not really important there. This is an extreme bikeshedding without any relevance. You can make suggestions but please lay it to the rest. There isn't any general consensus against the current name and you're just trying to push your favorite name without proper justifications after contributing nothing to the project. Please stop. Thanks. -- tejun
Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
Hello, On Mon, Jul 30, 2018 at 07:39:40PM +0200, Pavel Machek wrote: > > I'd rather have the internal config symbol match the naming scheme in > > the code, where psi is a shorter, unique token as copmared to e.g. > > pressure, press, prsr, etc. > > I'd do "pressure", really. Yes, psi is shorter, but I'd say that > length is not really important there. This is an extreme bikeshedding without any relevance. You can make suggestions but please lay it to the rest. There isn't any general consensus against the current name and you're just trying to push your favorite name without proper justifications after contributing nothing to the project. Please stop. Thanks. -- tejun
Re: [PATCH 0/2] ata: libahci: devslp fixes
On Mon, Jul 30, 2018 at 05:26:45PM +0200, Hans de Goede wrote: > Hi, > > On 30-07-18 17:22, Tejun Heo wrote: > >On Mon, Jul 30, 2018 at 08:15:47AM -0700, Srinivas Pandruvada wrote: > >>Hi Tejan, > >> > >>On Mon, 2018-07-02 at 12:01 -0700, Srinivas Pandruvada wrote: > >>>Some minor fixes to be able to correctly set devslp register > >>>to optimize power. > >>> > >>>Srinivas Pandruvada (2): > >>> ata: libahci: Correct setting of DEVSLP register > >>> ata: libahci: Allow reconfigure of DEVSLP register > >>> > >>Are you applying this series? > > > >I was waiting for Hans's reviews. Hans, what do you think? > > Ah I missed that this was another series. With the caveat that > I do not really know that much about devslp, both patches > seem sensible to me, so both are: > > Reviewed-by: Hans de Goede Applied 1-2 to libata/for-4.19. Thanks. -- tejun
Re: [PATCH 0/2] ata: libahci: devslp fixes
On Mon, Jul 30, 2018 at 05:26:45PM +0200, Hans de Goede wrote: > Hi, > > On 30-07-18 17:22, Tejun Heo wrote: > >On Mon, Jul 30, 2018 at 08:15:47AM -0700, Srinivas Pandruvada wrote: > >>Hi Tejan, > >> > >>On Mon, 2018-07-02 at 12:01 -0700, Srinivas Pandruvada wrote: > >>>Some minor fixes to be able to correctly set devslp register > >>>to optimize power. > >>> > >>>Srinivas Pandruvada (2): > >>> ata: libahci: Correct setting of DEVSLP register > >>> ata: libahci: Allow reconfigure of DEVSLP register > >>> > >>Are you applying this series? > > > >I was waiting for Hans's reviews. Hans, what do you think? > > Ah I missed that this was another series. With the caveat that > I do not really know that much about devslp, both patches > seem sensible to me, so both are: > > Reviewed-by: Hans de Goede Applied 1-2 to libata/for-4.19. Thanks. -- tejun
Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
Hello, On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote: > WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But > WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon > as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues > and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e. > when schedule_timeout_*() is called). > > Is this correct? WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one kworker running the workqueue. But all per-cpu kworkers are subject to concurrency limiting execution - ie. if there are any per-cpu actively running on a cpu, no futher kworkers will be scheduled. > > We can add timeout mechanism to workqueue so that it > > kicks off other kworkers if one of them is in running state for too > > long, but idk, if there's an indefinite busy loop condition in kernel > > threads, we really should get rid of them and hung task watchdog is > > pretty effective at finding these cases (at least with preemption > > disabled). > > Currently the page allocator has a path which can loop forever with > only cond_resched(). Yeah, workqueue can choke on things like that and kthread indefinitely busy looping doesn't do anybody any good. Thanks. -- tejun
Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
Hello, On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote: > WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But > WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon > as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues > and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e. > when schedule_timeout_*() is called). > > Is this correct? WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one kworker running the workqueue. But all per-cpu kworkers are subject to concurrency limiting execution - ie. if there are any per-cpu actively running on a cpu, no futher kworkers will be scheduled. > > We can add timeout mechanism to workqueue so that it > > kicks off other kworkers if one of them is in running state for too > > long, but idk, if there's an indefinite busy loop condition in kernel > > threads, we really should get rid of them and hung task watchdog is > > pretty effective at finding these cases (at least with preemption > > disabled). > > Currently the page allocator has a path which can loop forever with > only cond_resched(). Yeah, workqueue can choke on things like that and kthread indefinitely busy looping doesn't do anybody any good. Thanks. -- tejun
Re: [PATCH 0/2] ata: libahci: devslp fixes
On Mon, Jul 30, 2018 at 08:15:47AM -0700, Srinivas Pandruvada wrote: > Hi Tejan, > > On Mon, 2018-07-02 at 12:01 -0700, Srinivas Pandruvada wrote: > > Some minor fixes to be able to correctly set devslp register > > to optimize power. > > > > Srinivas Pandruvada (2): > > ata: libahci: Correct setting of DEVSLP register > > ata: libahci: Allow reconfigure of DEVSLP register > > > Are you applying this series? I was waiting for Hans's reviews. Hans, what do you think? Thanks. -- tejun
Re: [PATCH 0/2] ata: libahci: devslp fixes
On Mon, Jul 30, 2018 at 08:15:47AM -0700, Srinivas Pandruvada wrote: > Hi Tejan, > > On Mon, 2018-07-02 at 12:01 -0700, Srinivas Pandruvada wrote: > > Some minor fixes to be able to correctly set devslp register > > to optimize power. > > > > Srinivas Pandruvada (2): > > ata: libahci: Correct setting of DEVSLP register > > ata: libahci: Allow reconfigure of DEVSLP register > > > Are you applying this series? I was waiting for Hans's reviews. Hans, what do you think? Thanks. -- tejun
Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
Hello, On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote: > On Mon 30-07-18 23:34:23, Tetsuo Handa wrote: > > On 2018/07/30 18:32, Michal Hocko wrote: > [...] > > > This one is waiting for draining and we are in mm_percpu_wq WQ context > > > which has its rescuer so no other activity can block us for ever. So > > > this certainly shouldn't deadlock. It can be dead slow but well, this is > > > what you will get when your shoot your system to death. > > > > We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to > > wake up. (Tejun, > > is my understanding correct?) Lack of schedule_timeout_*() does block > > WQ_MEM_RECLAIM > > workqueues forever. > > Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually > guarantees. If you are right then the whole thing sounds quite fragile > to me TBH. Workqueue doesn't think the cpu is stalled as long as one of the per-cpu kworkers is running. The assumption is that kernel threads are not supposed to be busy-looping indefinitely (and they really shouldn't). We can add timeout mechanism to workqueue so that it kicks off other kworkers if one of them is in running state for too long, but idk, if there's an indefinite busy loop condition in kernel threads, we really should get rid of them and hung task watchdog is pretty effective at finding these cases (at least with preemption disabled). Thanks. -- tejun
Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
Hello, On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote: > On Mon 30-07-18 23:34:23, Tetsuo Handa wrote: > > On 2018/07/30 18:32, Michal Hocko wrote: > [...] > > > This one is waiting for draining and we are in mm_percpu_wq WQ context > > > which has its rescuer so no other activity can block us for ever. So > > > this certainly shouldn't deadlock. It can be dead slow but well, this is > > > what you will get when your shoot your system to death. > > > > We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to > > wake up. (Tejun, > > is my understanding correct?) Lack of schedule_timeout_*() does block > > WQ_MEM_RECLAIM > > workqueues forever. > > Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually > guarantees. If you are right then the whole thing sounds quite fragile > to me TBH. Workqueue doesn't think the cpu is stalled as long as one of the per-cpu kworkers is running. The assumption is that kernel threads are not supposed to be busy-looping indefinitely (and they really shouldn't). We can add timeout mechanism to workqueue so that it kicks off other kworkers if one of them is in running state for too long, but idk, if there's an indefinite busy loop condition in kernel threads, we really should get rid of them and hung task watchdog is pretty effective at finding these cases (at least with preemption disabled). Thanks. -- tejun
Re: [PATCH v2 0/2] ata: ahci: Enable DEVSLP by default on SLP_S0 support
On Fri, Jul 27, 2018 at 01:47:01PM -0700, Srinivas Pandruvada wrote: > One of the requirement for modern x86 system to enter lowest power mode > (SLP_S0) is SATA IP block to be off. This is true even during when > platform is suspended to idle and not only in opportunistic (runtime) > suspend. > This series is to enable DEVSLP by default. Applied to 1-2 to libata/for-4.19. Thanks. -- tejun
Re: [PATCH v2 0/2] ata: ahci: Enable DEVSLP by default on SLP_S0 support
On Fri, Jul 27, 2018 at 01:47:01PM -0700, Srinivas Pandruvada wrote: > One of the requirement for modern x86 system to enter lowest power mode > (SLP_S0) is SATA IP block to be off. This is true even during when > platform is suspended to idle and not only in opportunistic (runtime) > suspend. > This series is to enable DEVSLP by default. Applied to 1-2 to libata/for-4.19. Thanks. -- tejun
Re: cgroups iptables-restor: vmalloc: allocation failure
On Wed, Jul 25, 2018 at 02:50:01PM +0300, Georgi Nikolov wrote: > I posted a kernel bug https://bugzilla.kernel.org/show_bug.cgi?id=200651 and > i hope this is the correct place to discuss this. That's just NORETRY vmalloc allocation failing under memory pressure. Doesn't have much to do with cgroups except that there are also a bunch of cgroups consuming memory. Thanks. -- tejun
Re: cgroups iptables-restor: vmalloc: allocation failure
On Wed, Jul 25, 2018 at 02:50:01PM +0300, Georgi Nikolov wrote: > I posted a kernel bug https://bugzilla.kernel.org/show_bug.cgi?id=200651 and > i hope this is the correct place to discuss this. That's just NORETRY vmalloc allocation failing under memory pressure. Doesn't have much to do with cgroups except that there are also a bunch of cgroups consuming memory. Thanks. -- tejun
Re: [PATCH] delayacct: Fix crash in delayacct_blkio_end() after delayacct init failure
Hello, Andrew. On Tue, Jul 24, 2018 at 12:29:13PM -0700, Andrew Morton wrote: > How did you make this happen, btw? Fault injection, or did a small > GFP_KERNEL allocation fail? We have a group of machines which are pushing memory really hard and this actually triggered in prod on several of them. Thanks. -- tejun
Re: [PATCH] delayacct: Fix crash in delayacct_blkio_end() after delayacct init failure
Hello, Andrew. On Tue, Jul 24, 2018 at 12:29:13PM -0700, Andrew Morton wrote: > How did you make this happen, btw? Fault injection, or did a small > GFP_KERNEL allocation fail? We have a group of machines which are pushing memory really hard and this actually triggered in prod on several of them. Thanks. -- tejun
[PATCH] delayacct: Fix crash in delayacct_blkio_end() after delayacct init failure
While forking, if delayacct init fails due to memory shortage, it continues expecting all delayacct users to check task->delays pointer against NULL before dereferencing it, which all of them used to do. c96f5471ce7d ("delayacct: Account blkio completion on the correct task"), while updating delayacct_blkio_end() to take the target task instead of always using %current, made the function test NULL on %current->delays and then continue to operated on @p->delays. If %current succeeded init while @p didn't, it leads to the following crash. BUG: unable to handle kernel NULL pointer dereference at 0004 IP: __delayacct_blkio_end+0xc/0x40 PGD 801fd07e1067 P4D 801fd07e1067 PUD 1fcffbb067 PMD 0 Oops: [#1] SMP PTI CPU: 4 PID: 25774 Comm: QIOThread0 Not tainted 4.16.0-9_fbk1_rc2_1180_g6b593215b4d7 #9 Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B12 08/17/2017 RIP: 0010:__delayacct_blkio_end+0xc/0x40 RSP: :881fff703bf8 EFLAGS: 00010086 RAX: 881f1ec8b800 RBX: 8804f735cd54 RCX: 881fff703cb0 RDX: 0002 RSI: 0003 RDI: RBP: R08: R09: 881fff703cc0 R10: 1000 R11: 881fd3f73d00 R12: 8804f735c600 R13: R14: 001d R15: 881fff703cb0 FS: 7f5003f7d700() GS:881fff70() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0004 CR3: 001f401a6006 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: try_to_wake_up+0x2c0/0x600 autoremove_wake_function+0xe/0x30 __wake_up_common+0x74/0x120 wake_up_page_bit+0x9c/0xe0 mpage_end_io+0x27/0x70 blk_update_request+0x78/0x2c0 scsi_end_request+0x2c/0x1e0 scsi_io_completion+0x20b/0x5f0 blk_mq_complete_request+0xa2/0x100 ata_scsi_qc_complete+0x79/0x400 ata_qc_complete_multiple+0x86/0xd0 ahci_handle_port_interrupt+0xc9/0x5c0 ahci_handle_port_intr+0x54/0xb0 ahci_single_level_irq_intr+0x3b/0x60 __handle_irq_event_percpu+0x43/0x190 handle_irq_event_percpu+0x20/0x50 handle_irq_event+0x2a/0x50 handle_edge_irq+0x80/0x1c0 handle_irq+0xaf/0x120 do_IRQ+0x41/0xc0 common_interrupt+0xf/0xf Fix it by updating delayacct_blkio_end() check @p->delays instead. Signed-off-by: Tejun Heo Reported-and-debugged-by: Dave Jones Cc: Josh Snyder Fixes: c96f5471ce7d ("delayacct: Account blkio completion on the correct task") Cc: sta...@vger.kernel.org # v4.15+ --- include/linux/delayacct.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index e6c0448ebcc7..31c865d1842e 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -124,7 +124,7 @@ static inline void delayacct_blkio_start(void) static inline void delayacct_blkio_end(struct task_struct *p) { - if (current->delays) + if (p->delays) __delayacct_blkio_end(p); delayacct_clear_flag(DELAYACCT_PF_BLKIO); }
[PATCH] delayacct: Fix crash in delayacct_blkio_end() after delayacct init failure
While forking, if delayacct init fails due to memory shortage, it continues expecting all delayacct users to check task->delays pointer against NULL before dereferencing it, which all of them used to do. c96f5471ce7d ("delayacct: Account blkio completion on the correct task"), while updating delayacct_blkio_end() to take the target task instead of always using %current, made the function test NULL on %current->delays and then continue to operated on @p->delays. If %current succeeded init while @p didn't, it leads to the following crash. BUG: unable to handle kernel NULL pointer dereference at 0004 IP: __delayacct_blkio_end+0xc/0x40 PGD 801fd07e1067 P4D 801fd07e1067 PUD 1fcffbb067 PMD 0 Oops: [#1] SMP PTI CPU: 4 PID: 25774 Comm: QIOThread0 Not tainted 4.16.0-9_fbk1_rc2_1180_g6b593215b4d7 #9 Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B12 08/17/2017 RIP: 0010:__delayacct_blkio_end+0xc/0x40 RSP: :881fff703bf8 EFLAGS: 00010086 RAX: 881f1ec8b800 RBX: 8804f735cd54 RCX: 881fff703cb0 RDX: 0002 RSI: 0003 RDI: RBP: R08: R09: 881fff703cc0 R10: 1000 R11: 881fd3f73d00 R12: 8804f735c600 R13: R14: 001d R15: 881fff703cb0 FS: 7f5003f7d700() GS:881fff70() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0004 CR3: 001f401a6006 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: try_to_wake_up+0x2c0/0x600 autoremove_wake_function+0xe/0x30 __wake_up_common+0x74/0x120 wake_up_page_bit+0x9c/0xe0 mpage_end_io+0x27/0x70 blk_update_request+0x78/0x2c0 scsi_end_request+0x2c/0x1e0 scsi_io_completion+0x20b/0x5f0 blk_mq_complete_request+0xa2/0x100 ata_scsi_qc_complete+0x79/0x400 ata_qc_complete_multiple+0x86/0xd0 ahci_handle_port_interrupt+0xc9/0x5c0 ahci_handle_port_intr+0x54/0xb0 ahci_single_level_irq_intr+0x3b/0x60 __handle_irq_event_percpu+0x43/0x190 handle_irq_event_percpu+0x20/0x50 handle_irq_event+0x2a/0x50 handle_edge_irq+0x80/0x1c0 handle_irq+0xaf/0x120 do_IRQ+0x41/0xc0 common_interrupt+0xf/0xf Fix it by updating delayacct_blkio_end() check @p->delays instead. Signed-off-by: Tejun Heo Reported-and-debugged-by: Dave Jones Cc: Josh Snyder Fixes: c96f5471ce7d ("delayacct: Account blkio completion on the correct task") Cc: sta...@vger.kernel.org # v4.15+ --- include/linux/delayacct.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index e6c0448ebcc7..31c865d1842e 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -124,7 +124,7 @@ static inline void delayacct_blkio_start(void) static inline void delayacct_blkio_end(struct task_struct *p) { - if (current->delays) + if (p->delays) __delayacct_blkio_end(p); delayacct_clear_flag(DELAYACCT_PF_BLKIO); }
Re: [PATCH] Revert "ata: ahci_platform: convert kcalloc to devm_kcalloc"
On Tue, Jul 17, 2018 at 10:49:35AM +, Corentin Labbe wrote: > Since ahci_platform_put_resources() use target_pwrs after "devm_" freed > it, we cannot use devm_kcalloc for allocating target_pwrs. > > This reverts commit bd0038b1b4f499d814d8f33a55b1df5ea6cf3b85. > > Reported-by: Mikko Perttunen > Signed-off-by: Corentin Labbe Applied to libata/for-4.19. Sorry about the delay. Thanks. -- tejun
Re: [PATCH] Revert "ata: ahci_platform: convert kcalloc to devm_kcalloc"
On Tue, Jul 17, 2018 at 10:49:35AM +, Corentin Labbe wrote: > Since ahci_platform_put_resources() use target_pwrs after "devm_" freed > it, we cannot use devm_kcalloc for allocating target_pwrs. > > This reverts commit bd0038b1b4f499d814d8f33a55b1df5ea6cf3b85. > > Reported-by: Mikko Perttunen > Signed-off-by: Corentin Labbe Applied to libata/for-4.19. Sorry about the delay. Thanks. -- tejun
Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller
Hello, Patrick. On Mon, Jul 23, 2018 at 06:22:15PM +0100, Patrick Bellasi wrote: > However, the "best effort" bandwidth control we have for CFS and RT > can be further improved if, instead of just looking at time spent on > CPUs, we provide some more hints to the scheduler to know at which > min/max "MIPS" we want to consume the (best effort) time we have been > allocated on a CPU. > > Such a simple extension is still quite useful to satisfy many use-case > we have, mainly on mobile systems, like the ones I've described in the >"Newcomer's Short Abstract (Updated)" > section of the cover letter: > > https://lore.kernel.org/lkml/20180716082906.6061-1-patrick.bell...@arm.com/T/#u So, that's all completely fine but then let's please not give it a name which doesn't quite match what it does. We can just call it e.g. cpufreq range control. > > So, there are fundamental discrepancies between > > description+interface vs. what it actually does. > > Perhaps then I should just change the description to make it less > generic... I think so, along with the interface itself. > > I really don't think that's something we can fix up later. > > ... since, really, I don't think we can get to the point to extend > later this interface to provide the strict bandwidth enforcement you > are thinking about. That's completley fine. The interface just has to match what's implemented. ... > > and what you're describing inherently breaks the delegation model. > > What I describe here is just an additional hint to the scheduler which > enrich the above described model. Provided A and B are already > satisfied, when a task gets a chance to run it will be executed at a > min/max configured frequency. That's really all... there is not > additional impact on "resources allocation". So, if it's a cpufreq range controller. It'd have sth like cpu.freq.min and cpu.freq.max, where min defines the maximum minimum cpufreq its descendants can get and max defines the maximum cpufreq allowed in the subtree. For an example, please refer to how memory.min and memory.max are defined. Thanks. -- tejun
Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller
Hello, Patrick. On Mon, Jul 23, 2018 at 06:22:15PM +0100, Patrick Bellasi wrote: > However, the "best effort" bandwidth control we have for CFS and RT > can be further improved if, instead of just looking at time spent on > CPUs, we provide some more hints to the scheduler to know at which > min/max "MIPS" we want to consume the (best effort) time we have been > allocated on a CPU. > > Such a simple extension is still quite useful to satisfy many use-case > we have, mainly on mobile systems, like the ones I've described in the >"Newcomer's Short Abstract (Updated)" > section of the cover letter: > > https://lore.kernel.org/lkml/20180716082906.6061-1-patrick.bell...@arm.com/T/#u So, that's all completely fine but then let's please not give it a name which doesn't quite match what it does. We can just call it e.g. cpufreq range control. > > So, there are fundamental discrepancies between > > description+interface vs. what it actually does. > > Perhaps then I should just change the description to make it less > generic... I think so, along with the interface itself. > > I really don't think that's something we can fix up later. > > ... since, really, I don't think we can get to the point to extend > later this interface to provide the strict bandwidth enforcement you > are thinking about. That's completley fine. The interface just has to match what's implemented. ... > > and what you're describing inherently breaks the delegation model. > > What I describe here is just an additional hint to the scheduler which > enrich the above described model. Provided A and B are already > satisfied, when a task gets a chance to run it will be executed at a > min/max configured frequency. That's really all... there is not > additional impact on "resources allocation". So, if it's a cpufreq range controller. It'd have sth like cpu.freq.min and cpu.freq.max, where min defines the maximum minimum cpufreq its descendants can get and max defines the maximum cpufreq allowed in the subtree. For an example, please refer to how memory.min and memory.max are defined. Thanks. -- tejun
Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller
Hello, On Mon, Jul 16, 2018 at 09:29:02AM +0100, Patrick Bellasi wrote: > The cgroup's CPU controller allows to assign a specified (maximum) > bandwidth to the tasks of a group. However this bandwidth is defined and > enforced only on a temporal base, without considering the actual > frequency a CPU is running on. Thus, the amount of computation completed > by a task within an allocated bandwidth can be very different depending > on the actual frequency the CPU is running that task. > The amount of computation can be affected also by the specific CPU a > task is running on, especially when running on asymmetric capacity > systems like Arm's big.LITTLE. One basic problem I have with this patchset is that what's being described is way more generic than what actually got implemented. What's described is computation bandwidth control but what's implemented is just frequency clamping. So, there are fundamental discrepancies between description+interface vs. what it actually does. I really don't think that's something we can fix up later. > These attributes: > > a) are available only for non-root nodes, both on default and legacy >hierarchies > b) do not enforce any constraints and/or dependency between the parent >and its child nodes, thus relying on the delegation model and >permission settings defined by the system management software cgroup does host attributes which only concern the cgroup itself and thus don't need any hierarchical behaviors on their own, but what's being implemented does control resource allocation, and what you're describing inherently breaks the delegation model. > c) allow to (eventually) further restrict task-specific clamps defined >via sched_setattr(2) Thanks. -- tejun
Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller
Hello, On Mon, Jul 16, 2018 at 09:29:02AM +0100, Patrick Bellasi wrote: > The cgroup's CPU controller allows to assign a specified (maximum) > bandwidth to the tasks of a group. However this bandwidth is defined and > enforced only on a temporal base, without considering the actual > frequency a CPU is running on. Thus, the amount of computation completed > by a task within an allocated bandwidth can be very different depending > on the actual frequency the CPU is running that task. > The amount of computation can be affected also by the specific CPU a > task is running on, especially when running on asymmetric capacity > systems like Arm's big.LITTLE. One basic problem I have with this patchset is that what's being described is way more generic than what actually got implemented. What's described is computation bandwidth control but what's implemented is just frequency clamping. So, there are fundamental discrepancies between description+interface vs. what it actually does. I really don't think that's something we can fix up later. > These attributes: > > a) are available only for non-root nodes, both on default and legacy >hierarchies > b) do not enforce any constraints and/or dependency between the parent >and its child nodes, thus relying on the delegation model and >permission settings defined by the system management software cgroup does host attributes which only concern the cgroup itself and thus don't need any hierarchical behaviors on their own, but what's being implemented does control resource allocation, and what you're describing inherently breaks the delegation model. > c) allow to (eventually) further restrict task-specific clamps defined >via sched_setattr(2) Thanks. -- tejun
Re: [PATCH kselftest-next] Add cgroup core selftests
On Wed, Jul 18, 2018 at 07:33:58PM +0200, Claudio wrote: > This commit adds tests for some of the core functionalities > of cgroups v2. > > The commit adds tests for some core principles of croup V2 API: Acked-by: Tejun Heo Thanks, Claudio. -- tejun
Re: [PATCH kselftest-next] Add cgroup core selftests
On Wed, Jul 18, 2018 at 07:33:58PM +0200, Claudio wrote: > This commit adds tests for some of the core functionalities > of cgroups v2. > > The commit adds tests for some core principles of croup V2 API: Acked-by: Tejun Heo Thanks, Claudio. -- tejun
Re: [PATCH 09/10] psi: cgroup support
On Thu, Jul 12, 2018 at 01:29:41PM -0400, Johannes Weiner wrote: > On a system that executes multiple cgrouped jobs and independent > workloads, we don't just care about the health of the overall system, > but also that of individual jobs, so that we can ensure individual job > health, fairness between jobs, or prioritize some jobs over others. > > This patch implements pressure stall tracking for cgroups. In kernels > with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, > memory.pressure, and io.pressure files that track aggregate pressure > stall times for only the tasks inside the cgroup. > > Signed-off-by: Johannes Weiner Acked-by: Tejun Heo Please feel free to route with the rest of the patchset. Thanks. -- tejun
Re: [PATCH 09/10] psi: cgroup support
On Thu, Jul 12, 2018 at 01:29:41PM -0400, Johannes Weiner wrote: > On a system that executes multiple cgrouped jobs and independent > workloads, we don't just care about the health of the overall system, > but also that of individual jobs, so that we can ensure individual job > health, fairness between jobs, or prioritize some jobs over others. > > This patch implements pressure stall tracking for cgroups. In kernels > with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, > memory.pressure, and io.pressure files that track aggregate pressure > stall times for only the tasks inside the cgroup. > > Signed-off-by: Johannes Weiner Acked-by: Tejun Heo Please feel free to route with the rest of the patchset. Thanks. -- tejun
Re: [PATCH] ata: Only output sg element mapped number in verbose debug
On Thu, Jul 12, 2018 at 06:31:03PM +0200, Paul Menzel wrote: > Date: Sun, 8 Jul 2018 09:18:21 +0200 > > Defining `ATA_DEBUG` there are a lof of messages like below in the log. > > [ 16.345472] ata_sg_setup: 1 sg elements mapped > > As that is too verbose, only output these messages in verbose debug. > > Signed-off-by: Paul Menzel Applied to libata/for-4.19. Thanks. -- tejun
Re: [PATCH] ata: Only output sg element mapped number in verbose debug
On Thu, Jul 12, 2018 at 06:31:03PM +0200, Paul Menzel wrote: > Date: Sun, 8 Jul 2018 09:18:21 +0200 > > Defining `ATA_DEBUG` there are a lof of messages like below in the log. > > [ 16.345472] ata_sg_setup: 1 sg elements mapped > > As that is too verbose, only output these messages in verbose debug. > > Signed-off-by: Paul Menzel Applied to libata/for-4.19. Thanks. -- tejun
Re: [PATCH] ata: Guard ata_scsi_dump_cdb() by ATA_VERBOSE_DEBUG
On Thu, Jul 12, 2018 at 06:29:44PM +0200, Paul Menzel wrote: > Date: Sun, 8 Jul 2018 09:11:34 +0200 > > Defining `ATA_DEBUG` nothing can be really seen, as the log is spammed > with CDB messages. > > Therefore, guard the print by `ATA_VERBOSE_DEBUG`. > > Signed-off-by: Paul Menzel Applied to libata/for-4.19. Thanks. -- tejun
Re: [PATCH] ata: Guard ata_scsi_dump_cdb() by ATA_VERBOSE_DEBUG
On Thu, Jul 12, 2018 at 06:29:44PM +0200, Paul Menzel wrote: > Date: Sun, 8 Jul 2018 09:11:34 +0200 > > Defining `ATA_DEBUG` nothing can be really seen, as the log is spammed > with CDB messages. > > Therefore, guard the print by `ATA_VERBOSE_DEBUG`. > > Signed-off-by: Paul Menzel Applied to libata/for-4.19. Thanks. -- tejun
Re: [PATCH 0/3] ata: ahci_platform: minor fixes
On Thu, Jul 12, 2018 at 11:41:28AM +, Corentin Labbe wrote: > Hello > > This patchset fixes some minor problem found when working on supporting > allwinner R40 AHCI. > > Regards > > Corentin Labbe (3): > ata: ahci_platform: correct parameter documentation for > ahci_platform_shutdown > ata: ahci_platform: convert kzallloc to kcalloc > ata: ahci_platform: convert kcalloc to devm_kcalloc Applied all three to libata/for-4.19 with Hans's ack added. Thanks. -- tejun
Re: [PATCH 0/3] ata: ahci_platform: minor fixes
On Thu, Jul 12, 2018 at 11:41:28AM +, Corentin Labbe wrote: > Hello > > This patchset fixes some minor problem found when working on supporting > allwinner R40 AHCI. > > Regards > > Corentin Labbe (3): > ata: ahci_platform: correct parameter documentation for > ahci_platform_shutdown > ata: ahci_platform: convert kzallloc to kcalloc > ata: ahci_platform: convert kcalloc to devm_kcalloc Applied all three to libata/for-4.19 with Hans's ack added. Thanks. -- tejun
[GIT PULL] libata fixes for v4.18-rc4
Hello, Linus. * Jens's patches to expand the usable command depth from 31 to 32 broke sata_fsl due to a subtle command iteration bug. Fixed by introducing explicit iteration helpers and using the correct variant. * On some laptops, enabling LPM by default reportedly led to occasional hard hangs. Blacklist the affected cases. * Other misc fixes / changes. Thanks. The following changes since commit 9ffc59d57228d74809700be6f7ecb1db10292f05: Merge tag '4.18-rc1-more-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 (2018-06-18 14:28:19 +0900) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.18-fixes for you to fetch changes up to d55bac2754476624f23bdf3b908d117f3cdf469b: ata: Remove depends on HAS_DMA in case of platform dependency (2018-07-02 13:18:22 -0700) Colin Ian King (1): sata_nv: remove redundant pointers sdev0 and sdev1 Damien Le Moal (2): ata: Fix ZBC_OUT command block check ata: Fix ZBC_OUT all bit handling Geert Uytterhoeven (1): ata: Remove depends on HAS_DMA in case of platform dependency Hans de Goede (1): ahci: Disable LPM on Lenovo 50 series laptops with a too old BIOS Jens Axboe (4): libata: add command iterator helpers libata: convert eh to command iterators sata_fsl: convert to command iterator sata_fsl: remove dead code in tag retrieval John Garry (1): libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store() Mika Westerberg (1): ahci: Add Intel Ice Lake LP PCI ID Wei Yongjun (1): ata: ahci_mvebu: ahci_mvebu_stop_engine() can be static drivers/ata/Kconfig | 2 -- drivers/ata/ahci.c| 60 +++ drivers/ata/ahci_mvebu.c | 2 +- drivers/ata/libahci.c | 7 -- drivers/ata/libata-core.c | 3 +++ drivers/ata/libata-eh.c | 41 +--- drivers/ata/libata-scsi.c | 18 +- drivers/ata/sata_fsl.c| 9 +-- drivers/ata/sata_nv.c | 3 --- include/linux/libata.h| 24 +++ 10 files changed, 122 insertions(+), 47 deletions(-) -- tejun
[GIT PULL] libata fixes for v4.18-rc4
Hello, Linus. * Jens's patches to expand the usable command depth from 31 to 32 broke sata_fsl due to a subtle command iteration bug. Fixed by introducing explicit iteration helpers and using the correct variant. * On some laptops, enabling LPM by default reportedly led to occasional hard hangs. Blacklist the affected cases. * Other misc fixes / changes. Thanks. The following changes since commit 9ffc59d57228d74809700be6f7ecb1db10292f05: Merge tag '4.18-rc1-more-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 (2018-06-18 14:28:19 +0900) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.18-fixes for you to fetch changes up to d55bac2754476624f23bdf3b908d117f3cdf469b: ata: Remove depends on HAS_DMA in case of platform dependency (2018-07-02 13:18:22 -0700) Colin Ian King (1): sata_nv: remove redundant pointers sdev0 and sdev1 Damien Le Moal (2): ata: Fix ZBC_OUT command block check ata: Fix ZBC_OUT all bit handling Geert Uytterhoeven (1): ata: Remove depends on HAS_DMA in case of platform dependency Hans de Goede (1): ahci: Disable LPM on Lenovo 50 series laptops with a too old BIOS Jens Axboe (4): libata: add command iterator helpers libata: convert eh to command iterators sata_fsl: convert to command iterator sata_fsl: remove dead code in tag retrieval John Garry (1): libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store() Mika Westerberg (1): ahci: Add Intel Ice Lake LP PCI ID Wei Yongjun (1): ata: ahci_mvebu: ahci_mvebu_stop_engine() can be static drivers/ata/Kconfig | 2 -- drivers/ata/ahci.c| 60 +++ drivers/ata/ahci_mvebu.c | 2 +- drivers/ata/libahci.c | 7 -- drivers/ata/libata-core.c | 3 +++ drivers/ata/libata-eh.c | 41 +--- drivers/ata/libata-scsi.c | 18 +- drivers/ata/sata_fsl.c| 9 +-- drivers/ata/sata_nv.c | 3 --- include/linux/libata.h| 24 +++ 10 files changed, 122 insertions(+), 47 deletions(-) -- tejun
Re: [PATCH] cgroup/tracing: Move taking of spin lock out of trace event handlers
On Mon, Jul 09, 2018 at 05:48:54PM -0400, Steven Rostedt wrote: > From: Steven Rostedt (VMware) > > It is unwise to take spin locks from the handlers of trace events. > Mainly, because they can introduce lockups, because it introduces locks > in places that are normally not tested. Worse yet, because trace events > are tucked away in the include/trace/events/ directory, locks that are > taken there are forgotten about. > > As a general rule, I tell people never to take any locks in a trace > event handler. > > Several cgroup trace event handlers call cgroup_path() which eventually > takes the kernfs_rename_lock spinlock. This injects the spinlock in the > code without people realizing it. It also can cause issues for the > PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event > handlers are called with preemption disabled. > > By moving the calculation of the cgroup_path() out of the trace event > handlers and into a macro (surrounded by a > trace_cgroup_##type##_enabled()), then we could place the cgroup_path > into a string, and pass that to the trace event. Not only does this > remove the taking of the spinlock out of the trace event handler, but > it also means that the cgroup_path() only needs to be called once (it > is currently called twice, once to get the length to reserver the > buffer for, and once again to get the path itself. Now it only needs to > be done once. > > Reported-by: Sebastian Andrzej Siewior > Signed-off-by: Steven Rostedt (VMware) Applied to cgroup/for-4.19. Thanks. -- tejun
Re: [PATCH] cgroup/tracing: Move taking of spin lock out of trace event handlers
On Mon, Jul 09, 2018 at 05:48:54PM -0400, Steven Rostedt wrote: > From: Steven Rostedt (VMware) > > It is unwise to take spin locks from the handlers of trace events. > Mainly, because they can introduce lockups, because it introduces locks > in places that are normally not tested. Worse yet, because trace events > are tucked away in the include/trace/events/ directory, locks that are > taken there are forgotten about. > > As a general rule, I tell people never to take any locks in a trace > event handler. > > Several cgroup trace event handlers call cgroup_path() which eventually > takes the kernfs_rename_lock spinlock. This injects the spinlock in the > code without people realizing it. It also can cause issues for the > PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event > handlers are called with preemption disabled. > > By moving the calculation of the cgroup_path() out of the trace event > handlers and into a macro (surrounded by a > trace_cgroup_##type##_enabled()), then we could place the cgroup_path > into a string, and pass that to the trace event. Not only does this > remove the taking of the spinlock out of the trace event handler, but > it also means that the cgroup_path() only needs to be called once (it > is currently called twice, once to get the length to reserver the > buffer for, and once again to get the path itself. Now it only needs to > be done once. > > Reported-by: Sebastian Andrzej Siewior > Signed-off-by: Steven Rostedt (VMware) Applied to cgroup/for-4.19. Thanks. -- tejun
Re: [PATCH] cgroup: use irqsave in cgroup_rstat_flush_locked()
Hello, Sebastian. On Wed, Jul 11, 2018 at 01:05:13PM +0200, Sebastian Andrzej Siewior wrote: > > > We at least used to do this in the kernel - manipulating irqsafe locks > > > with spin_lock/unlock() if the irq state is known, whether enabled or > > > disabled, and ISTR lockdep being smart enough to track actual irq > > > state to determine irq safety. Am I misremembering or is this > > > different on RT kernels? > > > > No, this is correct. So on !RT kernels the spin_lock_irq() disables > > interrupts and the raw_spin_lock() has the interrupts already disabled, > > everything is good. On RT kernels the spin_lock_irq() does not disable > > interrupts and the raw_spin_lock() acquires the lock with enabled > > interrupts and lockdep complains properly. > > lockdep sees the hardirq path via: I feel weary about applying a patch which isn't needed in mainline, especially without annotations or at least comments. I suppose it may not be too common but this can't be the only place which needs this and using irqsave/restore spuriously in all those sites doesn't sound like a good solution. Is there any other way of handling this? Thanks. -- tejun
Re: [PATCH] cgroup: use irqsave in cgroup_rstat_flush_locked()
Hello, Sebastian. On Wed, Jul 11, 2018 at 01:05:13PM +0200, Sebastian Andrzej Siewior wrote: > > > We at least used to do this in the kernel - manipulating irqsafe locks > > > with spin_lock/unlock() if the irq state is known, whether enabled or > > > disabled, and ISTR lockdep being smart enough to track actual irq > > > state to determine irq safety. Am I misremembering or is this > > > different on RT kernels? > > > > No, this is correct. So on !RT kernels the spin_lock_irq() disables > > interrupts and the raw_spin_lock() has the interrupts already disabled, > > everything is good. On RT kernels the spin_lock_irq() does not disable > > interrupts and the raw_spin_lock() acquires the lock with enabled > > interrupts and lockdep complains properly. > > lockdep sees the hardirq path via: I feel weary about applying a patch which isn't needed in mainline, especially without annotations or at least comments. I suppose it may not be too common but this can't be the only place which needs this and using irqsave/restore spuriously in all those sites doesn't sound like a good solution. Is there any other way of handling this? Thanks. -- tejun
Re: [PATCH] cgroup: use irqsave in cgroup_rstat_flush_locked()
(cc'ing Peter and Ingo for lockdep) Hello, Sebastian. On Tue, Jul 03, 2018 at 06:45:44PM +0200, Sebastian Andrzej Siewior wrote: > All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock > either with spin_lock_irq() or spin_lock_irqsave(). So, irq is always disabled in cgroup_rstat_flush_locked(). > cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which > is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() > in IRQ context and therefore requires _irqsave() locking suffix in > cgroup_rstat_flush_locked(). Yes, the cpu locks should be irqsafe too; however, as irq is always disabled in that function, save/restore is redundant, no? > Since there is no difference between spin_lock_t and raw_spin_lock_t > on !RT lockdep does not complain here. On RT lockdep complains because > the interrupts were not disabled here and a deadlock is possible. We at least used to do this in the kernel - manipulating irqsafe locks with spin_lock/unlock() if the irq state is known, whether enabled or disabled, and ISTR lockdep being smart enough to track actual irq state to determine irq safety. Am I misremembering or is this different on RT kernels? Thanks. -- tejun
Re: [PATCH] cgroup: use irqsave in cgroup_rstat_flush_locked()
(cc'ing Peter and Ingo for lockdep) Hello, Sebastian. On Tue, Jul 03, 2018 at 06:45:44PM +0200, Sebastian Andrzej Siewior wrote: > All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock > either with spin_lock_irq() or spin_lock_irqsave(). So, irq is always disabled in cgroup_rstat_flush_locked(). > cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which > is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() > in IRQ context and therefore requires _irqsave() locking suffix in > cgroup_rstat_flush_locked(). Yes, the cpu locks should be irqsafe too; however, as irq is always disabled in that function, save/restore is redundant, no? > Since there is no difference between spin_lock_t and raw_spin_lock_t > on !RT lockdep does not complain here. On RT lockdep complains because > the interrupts were not disabled here and a deadlock is possible. We at least used to do this in the kernel - manipulating irqsafe locks with spin_lock/unlock() if the irq state is known, whether enabled or disabled, and ISTR lockdep being smart enough to track actual irq state to determine irq safety. Am I misremembering or is this different on RT kernels? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote: > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE() > > trigger since. :-/ > > But I get a build error: > > kernel/workqueue.o: In function `worker_attach_to_pool': > workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state' > workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state' > /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' > failed > > My guess is that the worker_attach_to_pool() code needs to change so > as to invoke cpuhp_current_state() and cpuhp_target_state() only in > CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you. Just to clarify, the bug happened on a kernel with cpuhp enabled, right? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote: > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE() > > trigger since. :-/ > > But I get a build error: > > kernel/workqueue.o: In function `worker_attach_to_pool': > workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state' > workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state' > /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' > failed > > My guess is that the worker_attach_to_pool() code needs to change so > as to invoke cpuhp_current_state() and cpuhp_target_state() only in > CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you. Just to clarify, the bug happened on a kernel with cpuhp enabled, right? Thanks. -- tejun
Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier
Hello, On Tue, Jul 03, 2018 at 03:22:47PM +1000, Benjamin Herrenschmidt wrote: > +bool kernfs_has_children(struct kernfs_node *kn) > +{ > + bool has_children = false; > + struct kernfs_node *pos; > + > + /* Lockless shortcut */ > + if (RB_EMPTY_NODE(>rb)) > + return false; Hmm... shouldn't it be testing !rb_first(kn->dir.children)? The above would test whether @kn itself is unlinked. > + > + /* Now check for active children */ > + mutex_lock(_mutex); > + pos = NULL; > + while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) { > + if (kernfs_active(pos)) { > + has_children = true; > + break; > + } > + } > + mutex_unlock(_mutex); > + > + return has_children; The active ref is there to synchronize removal against in-flight reads so that kernfs_remove() can wait for them to drain. On return from kernfs_remove(), the node is guaranteed to be off the rbtree and the above test isn't necessary. !rb_first() test should be enough (provided that there's external synchronization, of course). Thanks. -- tejun
Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier
Hello, On Tue, Jul 03, 2018 at 03:22:47PM +1000, Benjamin Herrenschmidt wrote: > +bool kernfs_has_children(struct kernfs_node *kn) > +{ > + bool has_children = false; > + struct kernfs_node *pos; > + > + /* Lockless shortcut */ > + if (RB_EMPTY_NODE(>rb)) > + return false; Hmm... shouldn't it be testing !rb_first(kn->dir.children)? The above would test whether @kn itself is unlinked. > + > + /* Now check for active children */ > + mutex_lock(_mutex); > + pos = NULL; > + while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) { > + if (kernfs_active(pos)) { > + has_children = true; > + break; > + } > + } > + mutex_unlock(_mutex); > + > + return has_children; The active ref is there to synchronize removal against in-flight reads so that kernfs_remove() can wait for them to drain. On return from kernfs_remove(), the node is guaranteed to be off the rbtree and the above test isn't necessary. !rb_first() test should be enough (provided that there's external synchronization, of course). Thanks. -- tejun
Re: printk() from NMI backtrace can delay a lot
Hello, Sergey. On Tue, Jul 03, 2018 at 01:30:21PM +0900, Sergey Senozhatsky wrote: > Cc-ing Linus, Tejun, Andrew > [I'll keep the entire lockdep report] > > On (07/02/18 19:26), Tetsuo Handa wrote: > [..] > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606834] swapper/0/0 is > > trying to acquire lock: > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606835] 316e1432 > > (console_owner){-.-.}, at: console_unlock+0x1ce/0x8b0 > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606840] > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606841] but task is already > > holding lock: > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606842] 9b45dcb4 > > (&(>lock)->rlock){-.-.}, at: show_workqueue_state+0x3b2/0x900 > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606847] > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606848] which lock already > > depends on the new lock. ... > But anyway. So we can have [but I'm not completely sure. Maybe lockdep has > something else on its mind] something like this: > > CPU1CPU0 > > #IRQ#soft irq > serial8250_handle_irq() wq_watchdog_timer_fn() >spin_lock(_port->lock) show_workqueue_state() > serial8250_rx_chars()spin_lock(>lock) > tty_flip_buffer_push() printk() > tty_schedule_flip() serial8250_console_write() >queue_work() > spin_lock(_port->lock) > __queue_work() > spin_lock(>lock) > > We need to break the pool->lock -> uart_port->lock chain. > > - use printk_deferred() to show WQs states [show_workqueue_state() is > a timer callback, so local IRQs are enabled]. But show_workqueue_state() > is also available via sysrq. > > - what Alan Cox suggested: use spin_trylock() in serial8250_console_write() > and just discard (do not print anything on console) console->writes() that > can deadlock us [uart_port->lock is already locked]. This basically means > that sometimes there will be no output on a serial console, or there > will be missing line. Which kind of contradicts the purpose of print > out. > > We are facing the risk of no output on serial consoles in both case. Thus > there must be some other way out of this. show_workqueue_state() is only used when something is already horribly broken or when invoked through sysrq. I'm not sure it's worthwhile to make invasive changes to avoid lockdep warnings. If anything, we should make show_workqueue_state() avoid grabbing pool->lock (e.g. use trylock and fallback to probe_kernel_reads if that fails). I'm a bit skeptical how actually useful that'd be tho. Thanks. -- tejun
Re: printk() from NMI backtrace can delay a lot
Hello, Sergey. On Tue, Jul 03, 2018 at 01:30:21PM +0900, Sergey Senozhatsky wrote: > Cc-ing Linus, Tejun, Andrew > [I'll keep the entire lockdep report] > > On (07/02/18 19:26), Tetsuo Handa wrote: > [..] > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606834] swapper/0/0 is > > trying to acquire lock: > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606835] 316e1432 > > (console_owner){-.-.}, at: console_unlock+0x1ce/0x8b0 > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606840] > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606841] but task is already > > holding lock: > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606842] 9b45dcb4 > > (&(>lock)->rlock){-.-.}, at: show_workqueue_state+0x3b2/0x900 > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606847] > > 2018-07-02 12:13:13 192.168.159.129: [ 151.606848] which lock already > > depends on the new lock. ... > But anyway. So we can have [but I'm not completely sure. Maybe lockdep has > something else on its mind] something like this: > > CPU1CPU0 > > #IRQ#soft irq > serial8250_handle_irq() wq_watchdog_timer_fn() >spin_lock(_port->lock) show_workqueue_state() > serial8250_rx_chars()spin_lock(>lock) > tty_flip_buffer_push() printk() > tty_schedule_flip() serial8250_console_write() >queue_work() > spin_lock(_port->lock) > __queue_work() > spin_lock(>lock) > > We need to break the pool->lock -> uart_port->lock chain. > > - use printk_deferred() to show WQs states [show_workqueue_state() is > a timer callback, so local IRQs are enabled]. But show_workqueue_state() > is also available via sysrq. > > - what Alan Cox suggested: use spin_trylock() in serial8250_console_write() > and just discard (do not print anything on console) console->writes() that > can deadlock us [uart_port->lock is already locked]. This basically means > that sometimes there will be no output on a serial console, or there > will be missing line. Which kind of contradicts the purpose of print > out. > > We are facing the risk of no output on serial consoles in both case. Thus > there must be some other way out of this. show_workqueue_state() is only used when something is already horribly broken or when invoked through sysrq. I'm not sure it's worthwhile to make invasive changes to avoid lockdep warnings. If anything, we should make show_workqueue_state() avoid grabbing pool->lock (e.g. use trylock and fallback to probe_kernel_reads if that fails). I'm a bit skeptical how actually useful that'd be tho. Thanks. -- tejun
Re: [PATCH 14/14] skip readahead if the cgroup is congested
On Fri, Jun 29, 2018 at 03:25:42PM -0400, Josef Bacik wrote: > From: Josef Bacik > > We noticed in testing we'd get pretty bad latency stalls under heavy > pressure because read ahead would try to do its thing while the cgroup > was under severe pressure. If we're under this much pressure we want to > do as little IO as possible so we can still make progress on real work > if we're a throttled cgroup, so just skip readahead if our group is > under pressure. > > Signed-off-by: Josef Bacik Acked-by: Tejun Heo Thanks. -- tejun
Re: [PATCH 14/14] skip readahead if the cgroup is congested
On Fri, Jun 29, 2018 at 03:25:42PM -0400, Josef Bacik wrote: > From: Josef Bacik > > We noticed in testing we'd get pretty bad latency stalls under heavy > pressure because read ahead would try to do its thing while the cgroup > was under severe pressure. If we're under this much pressure we want to > do as little IO as possible so we can still make progress on real work > if we're a throttled cgroup, so just skip readahead if our group is > under pressure. > > Signed-off-by: Josef Bacik Acked-by: Tejun Heo Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. Sorry about the late reply. On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote: > I have hit this WARN_ON_ONCE() in process_one_work: > > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && >raw_smp_processor_id() != pool->cpu); > > This looks like it is my rcu_gp workqueue (see splat below), and it > appears to be intermittent. This happens on rcutorture scenario SRCU-N, > which does random CPU-hotplug operations (in case that helps). > > Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, > what should I do to further debug this? Hmm... I checked the code paths but couldn't spot anything suspicious. Can you please apply the following patch and see whether it triggers before hitting the warn and if so report what it says? Thanks. diff --git a/kernel/cpu.c b/kernel/cpu.c index 0db8938fbb23..81caab9643b2 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map = static struct lockdep_map cpuhp_state_down_map = STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map); +int cpuhp_current_state(int cpu) +{ + return per_cpu_ptr(_state, cpu)->state; +} + +int cpuhp_target_state(int cpu) +{ + return per_cpu_ptr(_state, cpu)->target; +} static inline void cpuhp_lock_acquire(bool bringup) { diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 78b192071ef7..365cf6342808 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node) return worker; } +int cpuhp_current_state(int cpu); +int cpuhp_target_state(int cpu); + /** * worker_attach_to_pool() - attach a worker to a pool * @worker: worker to be attached @@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node) static void worker_attach_to_pool(struct worker *worker, struct worker_pool *pool) { + int ret; + mutex_lock(_pool_attach_mutex); /* * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any * online CPUs. It'll be re-applied when any of the CPUs come up. */ - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + if (ret && pool->cpu >= 0 && worker->rescue_wq) + printk("XXX rescuer failed to attach: ret=%d pool=%d this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n", + ret, pool->id, raw_smp_processor_id(), pool->cpu, + cpuhp_current_state(pool->cpu), + cpuhp_target_state(pool->cpu)); /* * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. Sorry about the late reply. On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote: > I have hit this WARN_ON_ONCE() in process_one_work: > > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && >raw_smp_processor_id() != pool->cpu); > > This looks like it is my rcu_gp workqueue (see splat below), and it > appears to be intermittent. This happens on rcutorture scenario SRCU-N, > which does random CPU-hotplug operations (in case that helps). > > Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, > what should I do to further debug this? Hmm... I checked the code paths but couldn't spot anything suspicious. Can you please apply the following patch and see whether it triggers before hitting the warn and if so report what it says? Thanks. diff --git a/kernel/cpu.c b/kernel/cpu.c index 0db8938fbb23..81caab9643b2 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map = static struct lockdep_map cpuhp_state_down_map = STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map); +int cpuhp_current_state(int cpu) +{ + return per_cpu_ptr(_state, cpu)->state; +} + +int cpuhp_target_state(int cpu) +{ + return per_cpu_ptr(_state, cpu)->target; +} static inline void cpuhp_lock_acquire(bool bringup) { diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 78b192071ef7..365cf6342808 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node) return worker; } +int cpuhp_current_state(int cpu); +int cpuhp_target_state(int cpu); + /** * worker_attach_to_pool() - attach a worker to a pool * @worker: worker to be attached @@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node) static void worker_attach_to_pool(struct worker *worker, struct worker_pool *pool) { + int ret; + mutex_lock(_pool_attach_mutex); /* * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any * online CPUs. It'll be re-applied when any of the CPUs come up. */ - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + if (ret && pool->cpu >= 0 && worker->rescue_wq) + printk("XXX rescuer failed to attach: ret=%d pool=%d this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n", + ret, pool->id, raw_smp_processor_id(), pool->cpu, + cpuhp_current_state(pool->cpu), + cpuhp_target_state(pool->cpu)); /* * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
Re: [RFC PATCH] ata: ahci: Enable DEVSLP by default on x86 modern standby platform
On Mon, Jul 02, 2018 at 12:08:45PM -0700, Srinivas Pandruvada wrote: > From: Srinivas > > One of the requirement for modern x86 system to enter lowest power mode > (SLP_S0) is SATA IP block to be off. This is true even during when > platform is suspended to idle and not only in opportunistic (runtime) > suspend. > > Several of these system don't have traditional ACPI S3, so it is > important that they enter SLP_S0 state, to avoid draining battery even > during suspend even with out of the box Linux installation. > > SATA IP block doesn't get turned off till SATA is in DEVSLP mode. Here > user has to either use scsi-host sysfs or tools like powertop to set > the sata-host link_power_management_policy to min_power. > > This change sets by default link power management policy to min_power > for some platforms. To avoid regressions, the following conditions > are used: > - The kernel config is already set to use med_power_with_dipm or deeper > - System is a modern standby system using ACPI low power idle flag > - The platform is not blacklisted for Suspend to Idle and suspend > to idle is used instead of S3 > This combination will make sure that systems are fairly recent and > since getting shipped with modern standby standby, the DEVSLP function > is already validated. > > Signed-off-by: Srinivas Pandruvada Seems sane to me. Hans, what do you think? Thanks. -- tejun
Re: [RFC PATCH] ata: ahci: Enable DEVSLP by default on x86 modern standby platform
On Mon, Jul 02, 2018 at 12:08:45PM -0700, Srinivas Pandruvada wrote: > From: Srinivas > > One of the requirement for modern x86 system to enter lowest power mode > (SLP_S0) is SATA IP block to be off. This is true even during when > platform is suspended to idle and not only in opportunistic (runtime) > suspend. > > Several of these system don't have traditional ACPI S3, so it is > important that they enter SLP_S0 state, to avoid draining battery even > during suspend even with out of the box Linux installation. > > SATA IP block doesn't get turned off till SATA is in DEVSLP mode. Here > user has to either use scsi-host sysfs or tools like powertop to set > the sata-host link_power_management_policy to min_power. > > This change sets by default link power management policy to min_power > for some platforms. To avoid regressions, the following conditions > are used: > - The kernel config is already set to use med_power_with_dipm or deeper > - System is a modern standby system using ACPI low power idle flag > - The platform is not blacklisted for Suspend to Idle and suspend > to idle is used instead of S3 > This combination will make sure that systems are fairly recent and > since getting shipped with modern standby standby, the DEVSLP function > is already validated. > > Signed-off-by: Srinivas Pandruvada Seems sane to me. Hans, what do you think? Thanks. -- tejun
Re: [PATCH v4] ata: Remove depends on HAS_DMA in case of platform dependency
On Fri, Jun 22, 2018 at 01:03:07PM +0200, Geert Uytterhoeven wrote: > Remove dependencies on HAS_DMA where a Kconfig symbol depends on another > symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST". > In most cases this other symbol is an architecture or platform specific > symbol, or PCI. > > Generic symbols and drivers without platform dependencies keep their > dependencies on HAS_DMA, to prevent compiling subsystems or drivers that > cannot work anyway. > > This simplifies the dependencies, and allows to improve compile-testing. > > Signed-off-by: Geert Uytterhoeven > Reviewed-by: Mark Brown > Acked-by: Robin Murphy Applied to libata/for-4.18-fixes. Thanks. -- tejun
Re: [PATCH v4] ata: Remove depends on HAS_DMA in case of platform dependency
On Fri, Jun 22, 2018 at 01:03:07PM +0200, Geert Uytterhoeven wrote: > Remove dependencies on HAS_DMA where a Kconfig symbol depends on another > symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST". > In most cases this other symbol is an architecture or platform specific > symbol, or PCI. > > Generic symbols and drivers without platform dependencies keep their > dependencies on HAS_DMA, to prevent compiling subsystems or drivers that > cannot work anyway. > > This simplifies the dependencies, and allows to improve compile-testing. > > Signed-off-by: Geert Uytterhoeven > Reviewed-by: Mark Brown > Acked-by: Robin Murphy Applied to libata/for-4.18-fixes. Thanks. -- tejun
Re: [PATCH v4] ata: Remove depends on HAS_DMA in case of platform dependency
On Fri, Jun 22, 2018 at 01:03:07PM +0200, Geert Uytterhoeven wrote: > Remove dependencies on HAS_DMA where a Kconfig symbol depends on another > symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST". > In most cases this other symbol is an architecture or platform specific > symbol, or PCI. > > Generic symbols and drivers without platform dependencies keep their > dependencies on HAS_DMA, to prevent compiling subsystems or drivers that > cannot work anyway. > > This simplifies the dependencies, and allows to improve compile-testing. > > Signed-off-by: Geert Uytterhoeven > Reviewed-by: Mark Brown > Acked-by: Robin Murphy This looks fine to me but how do you wanna route it? Should I apply it to for-4.18-fixes or should it go with arch changes in some other tree? Thanks. -- tejun
Re: [PATCH v4] ata: Remove depends on HAS_DMA in case of platform dependency
On Fri, Jun 22, 2018 at 01:03:07PM +0200, Geert Uytterhoeven wrote: > Remove dependencies on HAS_DMA where a Kconfig symbol depends on another > symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST". > In most cases this other symbol is an architecture or platform specific > symbol, or PCI. > > Generic symbols and drivers without platform dependencies keep their > dependencies on HAS_DMA, to prevent compiling subsystems or drivers that > cannot work anyway. > > This simplifies the dependencies, and allows to improve compile-testing. > > Signed-off-by: Geert Uytterhoeven > Reviewed-by: Mark Brown > Acked-by: Robin Murphy This looks fine to me but how do you wanna route it? Should I apply it to for-4.18-fixes or should it go with arch changes in some other tree? Thanks. -- tejun
Re: [PATCH] writeback: update stale account_page_redirty() comment
On Mon, Jun 25, 2018 at 10:15:26AM -0700, Greg Thelen wrote: > commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into > bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in > account_page_redirty(). Update comment to track that change. > BDI_DIRTIED => WB_DIRTIED > BDI_WRITTEN => WB_WRITTEN > > Signed-off-by: Greg Thelen Acked-by: Tejun Heo Thanks. -- tejun
Re: [PATCH] writeback: update stale account_page_redirty() comment
On Mon, Jun 25, 2018 at 10:15:26AM -0700, Greg Thelen wrote: > commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into > bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in > account_page_redirty(). Update comment to track that change. > BDI_DIRTIED => WB_DIRTIED > BDI_WRITTEN => WB_WRITTEN > > Signed-off-by: Greg Thelen Acked-by: Tejun Heo Thanks. -- tejun
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
Hello, Peter. On Tue, Jun 26, 2018 at 05:49:08PM +0200, Peter Zijlstra wrote: > Well, no, because the Changelog is incomprehensible and the patch > doesn't really have useful comments, so I'll have to reverse engineer > the entire thing, and I've just not had time for that. Just as an additional data point, we also sometimes see artifacts from cpu_adjust_time() in the form of per-task user or sys time getting stuck for some period (in extreme cases for over a minute) while the application isn't doing anything differently. We're telling the users that it's an inherent sampling artifact but it'd be nice to improve it if possible without adding noticeable overhead. No idea whether this patch's approach is a good one tho. Thanks. -- tejun
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
Hello, Peter. On Tue, Jun 26, 2018 at 05:49:08PM +0200, Peter Zijlstra wrote: > Well, no, because the Changelog is incomprehensible and the patch > doesn't really have useful comments, so I'll have to reverse engineer > the entire thing, and I've just not had time for that. Just as an additional data point, we also sometimes see artifacts from cpu_adjust_time() in the form of per-task user or sys time getting stuck for some period (in extreme cases for over a minute) while the application isn't doing anything differently. We're telling the users that it's an inherent sampling artifact but it'd be nice to improve it if possible without adding noticeable overhead. No idea whether this patch's approach is a good one tho. Thanks. -- tejun
Re: [PATCH] ahci: Disable LPM on Lenovo 50 series laptops with a too old BIOS
On Sun, Jul 01, 2018 at 12:15:46PM +0200, Hans de Goede wrote: > There have been several reports of LPM related hard freezes about once > a day on multiple Lenovo 50 series models. Strange enough these reports > where not disk model specific as LPM issues usually are and some users > with the exact same disk + laptop where seeing them while other users > where not seeing these issues. > > It turns out that enabling LPM triggers a firmware bug somewhere, which > has been fixed in later BIOS versions. > > This commit adds a new ahci_broken_lpm() function and a new ATA_FLAG_NO_LPM > for dealing with this. > > The ahci_broken_lpm() function contains DMI match info for the 4 models > which are known to be affected by this and the DMI BIOS date field for > known good BIOS versions. If the BIOS date is older then the one in the > table LPM will be disabled and a warning will be printed. > > Note the BIOS dates are for known good versions, some older versions may > work too, but we don't know for sure, the table is using dates from BIOS > versions for which users have confirmed that upgrading to that version > makes the problem go away. > > Unfortunately I've been unable to get hold of the reporter who reported > that BIOS version 2.35 fixed the problems on the W541 for him. I've been > able to verify the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION from an older > dmidecode, but I don't know the exact BIOS date as reported in the DMI. > Lenovo keeps a changelog with dates in their release notes, but the > dates there are the release dates not the build dates which are in DMI. > So I've chosen to set the date to which we compare to one day past the > release date of the 2.34 BIOS. I plan to fix this with a follow up > commit once I've the necessary info. > > Cc: sta...@vger.kernel.org > Signed-off-by: Hans de Goede Applied to libata/for-4.18-fixes. Thanks. -- tejun
Re: [PATCH] ahci: Disable LPM on Lenovo 50 series laptops with a too old BIOS
On Sun, Jul 01, 2018 at 12:15:46PM +0200, Hans de Goede wrote: > There have been several reports of LPM related hard freezes about once > a day on multiple Lenovo 50 series models. Strange enough these reports > where not disk model specific as LPM issues usually are and some users > with the exact same disk + laptop where seeing them while other users > where not seeing these issues. > > It turns out that enabling LPM triggers a firmware bug somewhere, which > has been fixed in later BIOS versions. > > This commit adds a new ahci_broken_lpm() function and a new ATA_FLAG_NO_LPM > for dealing with this. > > The ahci_broken_lpm() function contains DMI match info for the 4 models > which are known to be affected by this and the DMI BIOS date field for > known good BIOS versions. If the BIOS date is older then the one in the > table LPM will be disabled and a warning will be printed. > > Note the BIOS dates are for known good versions, some older versions may > work too, but we don't know for sure, the table is using dates from BIOS > versions for which users have confirmed that upgrading to that version > makes the problem go away. > > Unfortunately I've been unable to get hold of the reporter who reported > that BIOS version 2.35 fixed the problems on the W541 for him. I've been > able to verify the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION from an older > dmidecode, but I don't know the exact BIOS date as reported in the DMI. > Lenovo keeps a changelog with dates in their release notes, but the > dates there are the release dates not the build dates which are in DMI. > So I've chosen to set the date to which we compare to one day past the > release date of the 2.34 BIOS. I plan to fix this with a follow up > commit once I've the necessary info. > > Cc: sta...@vger.kernel.org > Signed-off-by: Hans de Goede Applied to libata/for-4.18-fixes. Thanks. -- tejun
Re: [PATCH] kernfs: Replace strncpy with memcpy
On Sun, Jul 01, 2018 at 01:57:13PM -0700, Guenter Roeck wrote: > gcc 8.1.0 complains: > > fs/kernfs/symlink.c:91:3: warning: > 'strncpy' output truncated before terminating nul copying > as many bytes from a string as its length > fs/kernfs/symlink.c: In function 'kernfs_iop_get_link': > fs/kernfs/symlink.c:88:14: note: length computed here > > Using strncpy() is indeed less than perfect since the length of data to > be copied has already been determined with strlen(). Replace strncpy() > with memcpy() to address the warning and optimize the code a little. > > Signed-off-by: Guenter Roeck Acked-by: Tejun Heo Thanks. -- tejun
Re: [PATCH] kernfs: Replace strncpy with memcpy
On Sun, Jul 01, 2018 at 01:57:13PM -0700, Guenter Roeck wrote: > gcc 8.1.0 complains: > > fs/kernfs/symlink.c:91:3: warning: > 'strncpy' output truncated before terminating nul copying > as many bytes from a string as its length > fs/kernfs/symlink.c: In function 'kernfs_iop_get_link': > fs/kernfs/symlink.c:88:14: note: length computed here > > Using strncpy() is indeed less than perfect since the length of data to > be copied has already been determined with strlen(). Replace strncpy() > with memcpy() to address the warning and optimize the code a little. > > Signed-off-by: Guenter Roeck Acked-by: Tejun Heo Thanks. -- tejun
Re: [PATCH] sata_nv: remove redundant pointers sdev0 and sdev1
On Mon, Jul 02, 2018 at 08:30:22AM +0100, Colin King wrote: > From: Colin Ian King > > Pointers sdev0 and sdev1 are being assigned but are never used hence they > are redundant and can be removed. > > Cleans up clang warnings: > warning: variable 'sdev0' set but not used [-Wunused-but-set-variable] > warning: variable 'sdev1' set but not used [-Wunused-but-set-variable] > > Signed-off-by: Colin Ian King Applied to libata/for-4.18-fixes. Thanks. -- tejun
Re: [PATCH] sata_nv: remove redundant pointers sdev0 and sdev1
On Mon, Jul 02, 2018 at 08:30:22AM +0100, Colin King wrote: > From: Colin Ian King > > Pointers sdev0 and sdev1 are being assigned but are never used hence they > are redundant and can be removed. > > Cleans up clang warnings: > warning: variable 'sdev0' set but not used [-Wunused-but-set-variable] > warning: variable 'sdev1' set but not used [-Wunused-but-set-variable] > > Signed-off-by: Colin Ian King Applied to libata/for-4.18-fixes. Thanks. -- tejun
Re: [PATCH] libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store()
On Fri, Jun 08, 2018 at 06:26:33PM +0800, John Garry wrote: > Currently smatch warns of possible Spectre-V1 issue in ahci_led_store(): > drivers/ata/libahci.c:1150 ahci_led_store() warn: potential spectre issue > 'pp->em_priv' (local cap) > > Userspace controls @pmp from following callchain: > em_message->store() > ->ata_scsi_em_message_store() > -->ap->ops->em_store() > --->ahci_led_store() > > After the mask+shift @pmp is effectively an 8b value, which is used to > index into an array of length 8, so sanitize the array index. > > Signed-off-by: John Garry Applied to libata/for-4.18-fixes. Thanks. -- tejun
Re: [PATCH] libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store()
On Fri, Jun 08, 2018 at 06:26:33PM +0800, John Garry wrote: > Currently smatch warns of possible Spectre-V1 issue in ahci_led_store(): > drivers/ata/libahci.c:1150 ahci_led_store() warn: potential spectre issue > 'pp->em_priv' (local cap) > > Userspace controls @pmp from following callchain: > em_message->store() > ->ata_scsi_em_message_store() > -->ap->ops->em_store() > --->ahci_led_store() > > After the mask+shift @pmp is effectively an 8b value, which is used to > index into an array of length 8, so sanitize the array index. > > Signed-off-by: John Garry Applied to libata/for-4.18-fixes. Thanks. -- tejun
Re: Cgroups "pids" controller does not update "pids.current" count immediately
Hello, Ivan. On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote: > The lazy pids accounting + modern fast CPUs makes the "pids.current" > metric practically unusable for resource limiting in our case. For a > test, when we started and ended one single process very quickly, we > saw "pids.current" equal up to 185 (while the correct value at all > time is either 0 or 1). If we want that a "cgroup" can spawn maximum > 50 processes, we should use some high value like 300 for "pids.max", > in order to compensate the pids uncharge lag (and this depends on > the speed of the CPU and how busy the system is). Yeah, that actually makes a lot of sense. We can't keep everything synchronous for obvious performance reasons but we definitely can wait for RCU grace period before failing. Forking might become a bit slower while pids are draining but shouldn't fail and that shouldn't incur any performance overhead in normal conditions when pids aren't constrained. Thanks. -- tejun
Re: Cgroups "pids" controller does not update "pids.current" count immediately
Hello, Ivan. On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote: > The lazy pids accounting + modern fast CPUs makes the "pids.current" > metric practically unusable for resource limiting in our case. For a > test, when we started and ended one single process very quickly, we > saw "pids.current" equal up to 185 (while the correct value at all > time is either 0 or 1). If we want that a "cgroup" can spawn maximum > 50 processes, we should use some high value like 300 for "pids.max", > in order to compensate the pids uncharge lag (and this depends on > the speed of the CPU and how busy the system is). Yeah, that actually makes a lot of sense. We can't keep everything synchronous for obvious performance reasons but we definitely can wait for RCU grace period before failing. Forking might become a bit slower while pids are draining but shouldn't fail and that shouldn't incur any performance overhead in normal conditions when pids aren't constrained. Thanks. -- tejun
Re: Cgroups "pids" controller does not update "pids.current" count immediately
Hello, On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote: > I understand all concerns and design decisions. However, having > RLIMIT_NPROC support combined with "cgroups" hierarchy would be very > handy. > > Does it make sense that you introduce "nproc.current" and > "nproc.max" metrics which work in the same atomic, real-time way > like RLIMIT_NPROC? Or make this in a new "nproc" controller? I'm skeptical for two reasons. 1. That doesn't sound much like a resource control problem but more of a policy enforcement problem. 2. and it's difficult to see why such policies would need to be that strict. Where is the requirement coming from? Thanks. -- tejun
Re: Cgroups "pids" controller does not update "pids.current" count immediately
Hello, On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote: > I understand all concerns and design decisions. However, having > RLIMIT_NPROC support combined with "cgroups" hierarchy would be very > handy. > > Does it make sense that you introduce "nproc.current" and > "nproc.max" metrics which work in the same atomic, real-time way > like RLIMIT_NPROC? Or make this in a new "nproc" controller? I'm skeptical for two reasons. 1. That doesn't sound much like a resource control problem but more of a policy enforcement problem. 2. and it's difficult to see why such policies would need to be that strict. Where is the requirement coming from? Thanks. -- tejun
Re: Cgroups "pids" controller does not update "pids.current" count immediately
Hello, On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote: > The standard RLIMIT_NPROC does not suffer from such accounting > discrepancies at any time. RLIMIT_NPROC uses a dedicated atomic counter which is updated when the process is getting reaped; however, that doesn't actually coincide with the pid being freed. The base pid ref is put then but there can be other refs and even after that it has to go through RCU grace period to be actually freed. They seem equivalent but serve a bit different purposes. RLIMIT_NPROC is primarily about limiting what the user can do and doesn't guarantee that that actually matches resource (pid here) consumption. pid controller's primary role is limiting pid consumption - ie. no matter what happens the cgroup must not be able to take away more than the specified number from the available pool, which has to account for the lazy release and draining refs and stuff. > The "memory" cgroups controller also does > not suffer from any discrepancies -- it accounts memory usage in > real time without any lag on process start or exit. The "tasks" file > list is also always up-to-date. The memory controller does the same thing, actually way more extensively. It's just less noticeable because people generally don't try to control at individual page level. > Is it really technically not possible to make "pids.current" do > accounting properly like RLIMIT_NPROC does? We were hoping to > replace RLIMIT_NPROC with the "pids" controller. It is of course possible but at a cost. The cost (getting rid of lazy release optimizations) is just not justifiable for most cases. Thanks. -- tejun
Re: Cgroups "pids" controller does not update "pids.current" count immediately
Hello, On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote: > The standard RLIMIT_NPROC does not suffer from such accounting > discrepancies at any time. RLIMIT_NPROC uses a dedicated atomic counter which is updated when the process is getting reaped; however, that doesn't actually coincide with the pid being freed. The base pid ref is put then but there can be other refs and even after that it has to go through RCU grace period to be actually freed. They seem equivalent but serve a bit different purposes. RLIMIT_NPROC is primarily about limiting what the user can do and doesn't guarantee that that actually matches resource (pid here) consumption. pid controller's primary role is limiting pid consumption - ie. no matter what happens the cgroup must not be able to take away more than the specified number from the available pool, which has to account for the lazy release and draining refs and stuff. > The "memory" cgroups controller also does > not suffer from any discrepancies -- it accounts memory usage in > real time without any lag on process start or exit. The "tasks" file > list is also always up-to-date. The memory controller does the same thing, actually way more extensively. It's just less noticeable because people generally don't try to control at individual page level. > Is it really technically not possible to make "pids.current" do > accounting properly like RLIMIT_NPROC does? We were hoping to > replace RLIMIT_NPROC with the "pids" controller. It is of course possible but at a cost. The cost (getting rid of lazy release optimizations) is just not justifiable for most cases. Thanks. -- tejun
Re: [PATCH] bdi: Fix another oops in wb_workfn()
Hello, On Mon, Jun 11, 2018 at 11:12:48AM +0200, Jan Kara wrote: > However this is wrong and so is the patch. The problem is in > cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's > reference to wb structures before going through the list of wbs again and > calling wb_shutdown() on each of them. The writeback structures we are > accessing at this point can be already freed in principle like: > > CPU1 CPU2 > cgwb_bdi_unregister() > cgwb_kill(*slot); > > cgwb_release() > queue_work(cgwb_release_wq, >release_work); > cgwb_release_workfn() > wb = list_first_entry(>wb_list, ...) > spin_unlock_irq(_lock); > wb_shutdown(wb); > ... > kfree_rcu(wb, rcu); > wb_shutdown(wb); -> oops use-after-free > > I'm not 100% sure how to fix this. wb structures can be at various phases of > shutdown (or there may be other external references still existing) when we > enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister() > to wait for standard wb shutdown path to finish is the most robust way. > What do you think about attached patch Tejun? So far only compile tested... > > Possible problem with it is that now cgwb_bdi_unregister() will wait for > all wb references to be dropped so it adds some implicit dependencies to > bdi shutdown path. Would something like the following work or am I missing the point entirely? Thanks. diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 347cc83..359cacd 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -715,14 +715,19 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) WARN_ON(test_bit(WB_registered, >wb.state)); spin_lock_irq(_lock); - radix_tree_for_each_slot(slot, >cgwb_tree, , 0) - cgwb_kill(*slot); + radix_tree_for_each_slot(slot, >cgwb_tree, , 0) { + struct bdi_writeback *wb = *slot; + + wb_get(wb); + cgwb_kill(wb); + } while (!list_empty(>wb_list)) { wb = list_first_entry(>wb_list, struct bdi_writeback, bdi_node); spin_unlock_irq(_lock); wb_shutdown(wb); + wb_put(wb); spin_lock_irq(_lock); } spin_unlock_irq(_lock); -- tejun
Re: [PATCH] bdi: Fix another oops in wb_workfn()
Hello, On Mon, Jun 11, 2018 at 11:12:48AM +0200, Jan Kara wrote: > However this is wrong and so is the patch. The problem is in > cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's > reference to wb structures before going through the list of wbs again and > calling wb_shutdown() on each of them. The writeback structures we are > accessing at this point can be already freed in principle like: > > CPU1 CPU2 > cgwb_bdi_unregister() > cgwb_kill(*slot); > > cgwb_release() > queue_work(cgwb_release_wq, >release_work); > cgwb_release_workfn() > wb = list_first_entry(>wb_list, ...) > spin_unlock_irq(_lock); > wb_shutdown(wb); > ... > kfree_rcu(wb, rcu); > wb_shutdown(wb); -> oops use-after-free > > I'm not 100% sure how to fix this. wb structures can be at various phases of > shutdown (or there may be other external references still existing) when we > enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister() > to wait for standard wb shutdown path to finish is the most robust way. > What do you think about attached patch Tejun? So far only compile tested... > > Possible problem with it is that now cgwb_bdi_unregister() will wait for > all wb references to be dropped so it adds some implicit dependencies to > bdi shutdown path. Would something like the following work or am I missing the point entirely? Thanks. diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 347cc83..359cacd 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -715,14 +715,19 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) WARN_ON(test_bit(WB_registered, >wb.state)); spin_lock_irq(_lock); - radix_tree_for_each_slot(slot, >cgwb_tree, , 0) - cgwb_kill(*slot); + radix_tree_for_each_slot(slot, >cgwb_tree, , 0) { + struct bdi_writeback *wb = *slot; + + wb_get(wb); + cgwb_kill(wb); + } while (!list_empty(>wb_list)) { wb = list_first_entry(>wb_list, struct bdi_writeback, bdi_node); spin_unlock_irq(_lock); wb_shutdown(wb); + wb_put(wb); spin_lock_irq(_lock); } spin_unlock_irq(_lock); -- tejun
Re: [PATCH] percpu_counter: `percpu_counter_read_positive' should not return negative number
On Wed, Jun 06, 2018 at 09:39:40PM +0200, Mathieu Malaterre wrote: > Since function `percpu_counter_add' may result in a signed integer overflow > the result stored in `fbc->count' could be negative. Make sure that > function `percpu_counter_read_positive' does not return a negative number > in this case. This will match behavior when CONFIG_SMP=y. > > Detected wth CONFIG_UBSAN=y > > [76404.888450] > > [76404.888477] UBSAN: Undefined behaviour in > ../include/linux/percpu_counter.h:136:13 > [76404.888485] signed integer overflow: > [76404.888490] 9223308017647617321 + 76624449492175 cannot be represented in > type 'long long int' > [76404.888501] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #50 > [76404.888506] Call Trace: > [76404.888523] [dffedd30] [c0478b90] ubsan_epilogue+0x18/0x4c (unreliable) > [76404.888533] [dffedd40] [c0479530] handle_overflow+0xbc/0xdc > [76404.888548] [dffeddc0] [c0439044] cfq_completed_request+0x560/0x1234 > [76404.888557] [dffede40] [c03f3fc4] __blk_put_request+0xb0/0x2dc > [76404.888570] [dffede80] [c05a81d0] scsi_end_request+0x19c/0x344 > [76404.888579] [dffedeb0] [c05a9954] scsi_io_completion+0x4b4/0x854 > [76404.888592] [dffedf10] [c04046b4] blk_done_softirq+0xe4/0x1e0 > [76404.888605] [dffedf60] [c07ec1d4] __do_softirq+0x16c/0x5f0 > [76404.888617] [dffedfd0] [c0065160] irq_exit+0x110/0x1a8 > [76404.888629] [dffedff0] [c001646c] call_do_irq+0x24/0x3c > [76404.888641] [c0cdbe80] [c0009a2c] do_IRQ+0x98/0x1a0 > [76404.888649] [c0cdbeb0] [c001b93c] ret_from_except+0x0/0x14 > [76404.888659] --- interrupt: 501 at arch_cpu_idle+0x30/0x78 >LR = arch_cpu_idle+0x30/0x78 > [76404.888667] [c0cdbf70] [c0cda000] 0xc0cda000 (unreliable) > [76404.888679] [c0cdbf80] [c00a3844] do_idle+0xc4/0x158 > [76404.888687] [c0cdbfb0] [c00a3a90] cpu_startup_entry+0x24/0x28 > [76404.888696] [c0cdbfc0] [c097f820] start_kernel+0x47c/0x490 > [76404.888703] [c0cdbff0] [3444] 0x3444 So, the _positive versions are there to deal with small negative reads coming from percpu propagation delays. It's not there to protect against actual counter overflow although it can't detect that on SMP. It's just outright wrong to report 0 when the counter actually overflowed and applying the suggested patch masks a real problem undetectable. I think the right thing to do is actually undersatnding what's going on (why is a 64bit counter overflowing?) and fix the underlying issue. Thanks. -- tejun
Re: [PATCH] percpu_counter: `percpu_counter_read_positive' should not return negative number
On Wed, Jun 06, 2018 at 09:39:40PM +0200, Mathieu Malaterre wrote: > Since function `percpu_counter_add' may result in a signed integer overflow > the result stored in `fbc->count' could be negative. Make sure that > function `percpu_counter_read_positive' does not return a negative number > in this case. This will match behavior when CONFIG_SMP=y. > > Detected wth CONFIG_UBSAN=y > > [76404.888450] > > [76404.888477] UBSAN: Undefined behaviour in > ../include/linux/percpu_counter.h:136:13 > [76404.888485] signed integer overflow: > [76404.888490] 9223308017647617321 + 76624449492175 cannot be represented in > type 'long long int' > [76404.888501] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #50 > [76404.888506] Call Trace: > [76404.888523] [dffedd30] [c0478b90] ubsan_epilogue+0x18/0x4c (unreliable) > [76404.888533] [dffedd40] [c0479530] handle_overflow+0xbc/0xdc > [76404.888548] [dffeddc0] [c0439044] cfq_completed_request+0x560/0x1234 > [76404.888557] [dffede40] [c03f3fc4] __blk_put_request+0xb0/0x2dc > [76404.888570] [dffede80] [c05a81d0] scsi_end_request+0x19c/0x344 > [76404.888579] [dffedeb0] [c05a9954] scsi_io_completion+0x4b4/0x854 > [76404.888592] [dffedf10] [c04046b4] blk_done_softirq+0xe4/0x1e0 > [76404.888605] [dffedf60] [c07ec1d4] __do_softirq+0x16c/0x5f0 > [76404.888617] [dffedfd0] [c0065160] irq_exit+0x110/0x1a8 > [76404.888629] [dffedff0] [c001646c] call_do_irq+0x24/0x3c > [76404.888641] [c0cdbe80] [c0009a2c] do_IRQ+0x98/0x1a0 > [76404.888649] [c0cdbeb0] [c001b93c] ret_from_except+0x0/0x14 > [76404.888659] --- interrupt: 501 at arch_cpu_idle+0x30/0x78 >LR = arch_cpu_idle+0x30/0x78 > [76404.888667] [c0cdbf70] [c0cda000] 0xc0cda000 (unreliable) > [76404.888679] [c0cdbf80] [c00a3844] do_idle+0xc4/0x158 > [76404.888687] [c0cdbfb0] [c00a3a90] cpu_startup_entry+0x24/0x28 > [76404.888696] [c0cdbfc0] [c097f820] start_kernel+0x47c/0x490 > [76404.888703] [c0cdbff0] [3444] 0x3444 So, the _positive versions are there to deal with small negative reads coming from percpu propagation delays. It's not there to protect against actual counter overflow although it can't detect that on SMP. It's just outright wrong to report 0 when the counter actually overflowed and applying the suggested patch masks a real problem undetectable. I think the right thing to do is actually undersatnding what's going on (why is a 64bit counter overflowing?) and fix the underlying issue. Thanks. -- tejun
Re: [GIT PULL] cgroup changes for v4.18-rc1
On Wed, Jun 06, 2018 at 08:14:27AM -0700, Linus Torvalds wrote: > On Wed, Jun 6, 2018 at 8:04 AM Tejun Heo wrote: > > > > The notification implementation isn't super light weight, so the patch > > ratelimits the notifications by capping minimum notification interval > > interval to 10ms. > > Yeah, I looked at the patch (and the code) to make sense of the explanation. > > My reaction to that was that it might be a better idea to simply not > notify if a notification was already pending, rather than have the > timeout. Or perhaps in addition to. The path _to_ the fsnotify code > looked quite messy, though. Yeah, getting to the inode is currently quite involved. Some of the complications come from the fact that a kernfs node can be associated with multiple inodes for sysfs namespace support. The right thing to do could be linking the inodes to the kernfs node and protect it with a spinlock so that we don't have to punt and walk them directly from notification path. For now, I think the 10ms thing is an acceptable workaround but this likely needs more work in the future. Thanks. -- tejun
Re: [GIT PULL] cgroup changes for v4.18-rc1
On Wed, Jun 06, 2018 at 08:14:27AM -0700, Linus Torvalds wrote: > On Wed, Jun 6, 2018 at 8:04 AM Tejun Heo wrote: > > > > The notification implementation isn't super light weight, so the patch > > ratelimits the notifications by capping minimum notification interval > > interval to 10ms. > > Yeah, I looked at the patch (and the code) to make sense of the explanation. > > My reaction to that was that it might be a better idea to simply not > notify if a notification was already pending, rather than have the > timeout. Or perhaps in addition to. The path _to_ the fsnotify code > looked quite messy, though. Yeah, getting to the inode is currently quite involved. Some of the complications come from the fact that a kernfs node can be associated with multiple inodes for sysfs namespace support. The right thing to do could be linking the inodes to the kernfs node and protect it with a spinlock so that we don't have to punt and walk them directly from notification path. For now, I think the 10ms thing is an acceptable workaround but this likely needs more work in the future. Thanks. -- tejun
Re: [GIT PULL] cgroup changes for v4.18-rc1
Hello, Linus. On Tue, Jun 05, 2018 at 05:13:55PM -0700, Linus Torvalds wrote: > On Tue, Jun 5, 2018 at 12:22 PM Tejun Heo wrote: > > > > * cgroup uses file modified events to notify certain events. A rate > > limiting mechanism is added. > > This "explanation" didn't really parse for me at all. > > I edited the merge message to something that I think is correct and > made more sense to me. Heh, yeah, yours reads a lot better. Sorry about that. The background is that the mechanism was first added to generate notifications on events like cgroup becoming empty, which is very low frequency. Since then, we've expanded its usage to cover things like memory.high breached events which can be a lot higher frequency. The notification implementation isn't super light weight, so the patch ratelimits the notifications by capping minimum notification interval interval to 10ms. Thanks. -- tejun
Re: [GIT PULL] cgroup changes for v4.18-rc1
Hello, Linus. On Tue, Jun 05, 2018 at 05:13:55PM -0700, Linus Torvalds wrote: > On Tue, Jun 5, 2018 at 12:22 PM Tejun Heo wrote: > > > > * cgroup uses file modified events to notify certain events. A rate > > limiting mechanism is added. > > This "explanation" didn't really parse for me at all. > > I edited the merge message to something that I think is correct and > made more sense to me. Heh, yeah, yours reads a lot better. Sorry about that. The background is that the mechanism was first added to generate notifications on events like cgroup becoming empty, which is very low frequency. Since then, we've expanded its usage to cover things like memory.high breached events which can be a lot higher frequency. The notification implementation isn't super light weight, so the patch ratelimits the notifications by capping minimum notification interval interval to 10ms. Thanks. -- tejun
[GIT PULL] workqueue changes for v4.18-rc1
Hello, Linus. * Patches to make kworkers to report the workqueue it is executing or has executed most recently in /proc/PID/comm. * CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds inside CONFIG_SMP. Thanks. The following changes since commit e6506eb241871d68647c53cb6d0a16299550ae97: Merge tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace (2018-05-16 16:45:23 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-4.18 for you to fetch changes up to 66448bc274cadedb71fda7d914e7c29d8dead217: workqueue: move function definitions within CONFIG_SMP block (2018-05-23 11:16:58 -0700) Mathieu Malaterre (1): workqueue: move function definitions within CONFIG_SMP block Tejun Heo (6): workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex workqueue: Make worker_attach/detach_pool() update worker->pool workqueue: Set worker->desc to workqueue name by default proc: Consolidate task->comm formatting into proc_task_name() workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status} workqueue: Make sure struct worker is accessible for wq_worker_comm() fs/proc/array.c | 33 ++- fs/proc/base.c | 5 +- fs/proc/internal.h | 2 + include/linux/workqueue.h | 1 + kernel/workqueue.c | 139 +--- kernel/workqueue_internal.h | 3 +- 6 files changed, 119 insertions(+), 64 deletions(-) -- tejun
[GIT PULL] workqueue changes for v4.18-rc1
Hello, Linus. * Patches to make kworkers to report the workqueue it is executing or has executed most recently in /proc/PID/comm. * CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds inside CONFIG_SMP. Thanks. The following changes since commit e6506eb241871d68647c53cb6d0a16299550ae97: Merge tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace (2018-05-16 16:45:23 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-4.18 for you to fetch changes up to 66448bc274cadedb71fda7d914e7c29d8dead217: workqueue: move function definitions within CONFIG_SMP block (2018-05-23 11:16:58 -0700) Mathieu Malaterre (1): workqueue: move function definitions within CONFIG_SMP block Tejun Heo (6): workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex workqueue: Make worker_attach/detach_pool() update worker->pool workqueue: Set worker->desc to workqueue name by default proc: Consolidate task->comm formatting into proc_task_name() workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status} workqueue: Make sure struct worker is accessible for wq_worker_comm() fs/proc/array.c | 33 ++- fs/proc/base.c | 5 +- fs/proc/internal.h | 2 + include/linux/workqueue.h | 1 + kernel/workqueue.c | 139 +--- kernel/workqueue_internal.h | 3 +- 6 files changed, 119 insertions(+), 64 deletions(-) -- tejun
[GIT PULL] cgroup changes for v4.18-rc1
Hello, Linus. cgroup changes for v4.18. * For cpustat, cgroup has a percpu hierarchical stat mechanism which propagates up the hierarchy lazily. This pull request contains commits to factor out and generalize the mechanism so that it can be used for other cgroup stats too. The original intention was to update memcg stats to use it but memcg went for a different approach, so still the only user is cpustat. The factoring out and generalization still make sense and it's likely that this can be used for other purposes in the future. * cgroup uses file modified events to notify certain events. A rate limiting mechanism is added. * Other misc changes. Thanks. The following changes since commit fe03a7594d86e0754f05e604cd803a6a9aae3c1c: Merge tag 'acpi-4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm (2018-04-26 11:06:36 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.18 for you to fetch changes up to d8742e22902186e30c346b1ba881cb52942ae3e4: cgroup: css_set_lock should nest inside tasklist_lock (2018-05-23 11:04:54 -0700) Andy Shevchenko (1): rdmacg: Convert to use match_string() helper Tejun Heo (12): cgroup: Explicitly remove core interface files cgroup: Limit event generation frequency cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c cgroup: Rename stat to rstat cgroup: Distinguish base resource stat implementation from rstat cgroup: Reorganize kernel/cgroup/rstat.c cgroup: Factor out and expose cgroup_rstat_*() interface functions cgroup: Replace cgroup_rstat_mutex with a spinlock cgroup: Add cgroup_subsys->css_rstat_flush() cgroup: Add memory barriers to plug cgroup_rstat_updated() race window cgroup: Make cgroup_rstat_updated() ready for root cgroup usage cgroup: css_set_lock should nest inside tasklist_lock include/linux/cgroup-defs.h | 52 +++-- include/linux/cgroup.h | 12 +- kernel/cgroup/Makefile | 2 +- kernel/cgroup/cgroup-internal.h | 11 +- kernel/cgroup/cgroup.c | 105 +++--- kernel/cgroup/rdma.c| 35 ++-- kernel/cgroup/rstat.c | 416 kernel/cgroup/stat.c| 338 8 files changed, 554 insertions(+), 417 deletions(-) create mode 100644 kernel/cgroup/rstat.c delete mode 100644 kernel/cgroup/stat.c -- tejun
[GIT PULL] cgroup changes for v4.18-rc1
Hello, Linus. cgroup changes for v4.18. * For cpustat, cgroup has a percpu hierarchical stat mechanism which propagates up the hierarchy lazily. This pull request contains commits to factor out and generalize the mechanism so that it can be used for other cgroup stats too. The original intention was to update memcg stats to use it but memcg went for a different approach, so still the only user is cpustat. The factoring out and generalization still make sense and it's likely that this can be used for other purposes in the future. * cgroup uses file modified events to notify certain events. A rate limiting mechanism is added. * Other misc changes. Thanks. The following changes since commit fe03a7594d86e0754f05e604cd803a6a9aae3c1c: Merge tag 'acpi-4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm (2018-04-26 11:06:36 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.18 for you to fetch changes up to d8742e22902186e30c346b1ba881cb52942ae3e4: cgroup: css_set_lock should nest inside tasklist_lock (2018-05-23 11:04:54 -0700) Andy Shevchenko (1): rdmacg: Convert to use match_string() helper Tejun Heo (12): cgroup: Explicitly remove core interface files cgroup: Limit event generation frequency cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c cgroup: Rename stat to rstat cgroup: Distinguish base resource stat implementation from rstat cgroup: Reorganize kernel/cgroup/rstat.c cgroup: Factor out and expose cgroup_rstat_*() interface functions cgroup: Replace cgroup_rstat_mutex with a spinlock cgroup: Add cgroup_subsys->css_rstat_flush() cgroup: Add memory barriers to plug cgroup_rstat_updated() race window cgroup: Make cgroup_rstat_updated() ready for root cgroup usage cgroup: css_set_lock should nest inside tasklist_lock include/linux/cgroup-defs.h | 52 +++-- include/linux/cgroup.h | 12 +- kernel/cgroup/Makefile | 2 +- kernel/cgroup/cgroup-internal.h | 11 +- kernel/cgroup/cgroup.c | 105 +++--- kernel/cgroup/rdma.c| 35 ++-- kernel/cgroup/rstat.c | 416 kernel/cgroup/stat.c| 338 8 files changed, 554 insertions(+), 417 deletions(-) create mode 100644 kernel/cgroup/rstat.c delete mode 100644 kernel/cgroup/stat.c -- tejun
[GIT PULL 2/2] libata changes for v4.18-rc1
Hello, Linus. This is the second part of libata pull request. * libata has always been limiting the maximum queue depth to 31, 1 set aside mostly for historical reasons. This didn't use to make much difference but Jens found out that modern hard drives can actually perform measurably better with the extra one queue depth. Jens updated libata core so that it can make use of full 32 queue depth. * Damien updated command retry logic in error handling so that it doesn't unnecessarily retry when upper layer (SCSI) is gonna handle them. * A couple misc changes. Thanks. The following changes since commit 75bc37fefc4471e718ba8e651aa74673d4e0a9eb: Linux 4.17-rc4 (2018-05-06 16:57:38 -1000) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.18 for you to fetch changes up to 88e10092f6a623b88808f782b6637162b5b658fb: sata_fsl: use the right type for tag bitshift (2018-05-14 08:14:21 -0700) Andy Shevchenko (1): ata: hpt37x: Convert to use match_string() helper Christoph Hellwig (1): sata_nv: don't use block layer bounce buffer Damien Le Moal (5): libata: Fix comment typo in ata_eh_analyze_tf() libata: Fix ata_err_string() libata: Make ata_dev_set_mode() less verbose libata: Honor RQF_QUIET flag libata: Fix command retry decision Jens Axboe (10): libata: introduce notion of separate hardware tags libata: convert core and drivers to ->hw_tag usage libata: bump ->qc_active to a 64-bit type libata: use ata_tag_internal() consistently libata: remove assumption that ATA_MAX_QUEUE - 1 is the max sata_nv: set host can_queue count appropriately libata: add extra internal command libata: don't clamp queue depth to ATA_MAX_QUEUE - 1 ahci: enable full queue depth of 32 sata_fsl: use the right type for tag bitshift drivers/ata/acard-ahci.c | 4 +- drivers/ata/ahci.h | 2 +- drivers/ata/libahci.c| 8 ++-- drivers/ata/libata-core.c| 61 --- drivers/ata/libata-eh.c | 56 - drivers/ata/libata-scsi.c| 19 + drivers/ata/pata_hpt37x.c| 13 +++--- drivers/ata/sata_dwc_460ex.c | 14 +++ drivers/ata/sata_fsl.c | 10 ++--- drivers/ata/sata_mv.c| 26 ++-- drivers/ata/sata_nv.c| 98 +++- drivers/ata/sata_sil24.c | 6 +-- include/linux/libata.h | 22 +- 13 files changed, 177 insertions(+), 162 deletions(-) -- tejun
[GIT PULL 2/2] libata changes for v4.18-rc1
Hello, Linus. This is the second part of libata pull request. * libata has always been limiting the maximum queue depth to 31, 1 set aside mostly for historical reasons. This didn't use to make much difference but Jens found out that modern hard drives can actually perform measurably better with the extra one queue depth. Jens updated libata core so that it can make use of full 32 queue depth. * Damien updated command retry logic in error handling so that it doesn't unnecessarily retry when upper layer (SCSI) is gonna handle them. * A couple misc changes. Thanks. The following changes since commit 75bc37fefc4471e718ba8e651aa74673d4e0a9eb: Linux 4.17-rc4 (2018-05-06 16:57:38 -1000) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.18 for you to fetch changes up to 88e10092f6a623b88808f782b6637162b5b658fb: sata_fsl: use the right type for tag bitshift (2018-05-14 08:14:21 -0700) Andy Shevchenko (1): ata: hpt37x: Convert to use match_string() helper Christoph Hellwig (1): sata_nv: don't use block layer bounce buffer Damien Le Moal (5): libata: Fix comment typo in ata_eh_analyze_tf() libata: Fix ata_err_string() libata: Make ata_dev_set_mode() less verbose libata: Honor RQF_QUIET flag libata: Fix command retry decision Jens Axboe (10): libata: introduce notion of separate hardware tags libata: convert core and drivers to ->hw_tag usage libata: bump ->qc_active to a 64-bit type libata: use ata_tag_internal() consistently libata: remove assumption that ATA_MAX_QUEUE - 1 is the max sata_nv: set host can_queue count appropriately libata: add extra internal command libata: don't clamp queue depth to ATA_MAX_QUEUE - 1 ahci: enable full queue depth of 32 sata_fsl: use the right type for tag bitshift drivers/ata/acard-ahci.c | 4 +- drivers/ata/ahci.h | 2 +- drivers/ata/libahci.c| 8 ++-- drivers/ata/libata-core.c| 61 --- drivers/ata/libata-eh.c | 56 - drivers/ata/libata-scsi.c| 19 + drivers/ata/pata_hpt37x.c| 13 +++--- drivers/ata/sata_dwc_460ex.c | 14 +++ drivers/ata/sata_fsl.c | 10 ++--- drivers/ata/sata_mv.c| 26 ++-- drivers/ata/sata_nv.c| 98 +++- drivers/ata/sata_sil24.c | 6 +-- include/linux/libata.h | 22 +- 13 files changed, 177 insertions(+), 162 deletions(-) -- tejun
[GIT PULL 1/2] libata changes for v4.18-rc1
Hello, Linus. These two are fixes which missed v4.17. I tried to merge these into libata/for-4.18 but couldn't make "git request-pull" not generate huge spurious diffstat without merging v4.17 in the middle, so I'm sending them out as two spearate pull request. Pulling into master should be clean for both. One is to remove an incorrect power management blacklist entry and the other to fix a cdb buffer overrun which has been there for a very long time. The following changes since commit 4544e403eb25552aed7f0ee181a7a506b8800403: ahci: Add PCI ID for Cannon Lake PCH-LP AHCI (2018-05-24 07:03:32 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.17-fixes for you to fetch changes up to 2cfce3a86b64b53f0a70e92a6a659c720c319b45: libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk (2018-05-31 08:45:37 -0700) Dan Carpenter (1): libata: zpodd: small read overflow in eject_tray() Hans de Goede (1): libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk drivers/ata/libata-core.c | 3 --- drivers/ata/libata-zpodd.c | 2 +- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 346b163..9bfd2f7 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4557,9 +4557,6 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { { "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, }, { "SAMSUNG SSD PM830 mSATA *", "CXM13D1Q", ATA_HORKAGE_NOLPM, }, - /* Sandisk devices which are known to not handle LPM well */ - { "SanDisk SD7UB3Q*G1001", NULL, ATA_HORKAGE_NOLPM, }, - /* devices that don't properly handle queued TRIM commands */ { "Micron_M500IT_*","MU01", ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM, }, diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c index de4ddd0..b3ed8f9 100644 --- a/drivers/ata/libata-zpodd.c +++ b/drivers/ata/libata-zpodd.c @@ -35,7 +35,7 @@ struct zpodd { static int eject_tray(struct ata_device *dev) { struct ata_taskfile tf; - static const char cdb[] = { GPCMD_START_STOP_UNIT, + static const char cdb[ATAPI_CDB_LEN] = { GPCMD_START_STOP_UNIT, 0, 0, 0, 0x02, /* LoEj */ 0, 0, 0, 0, 0, 0, 0, -- tejun
[GIT PULL 1/2] libata changes for v4.18-rc1
Hello, Linus. These two are fixes which missed v4.17. I tried to merge these into libata/for-4.18 but couldn't make "git request-pull" not generate huge spurious diffstat without merging v4.17 in the middle, so I'm sending them out as two spearate pull request. Pulling into master should be clean for both. One is to remove an incorrect power management blacklist entry and the other to fix a cdb buffer overrun which has been there for a very long time. The following changes since commit 4544e403eb25552aed7f0ee181a7a506b8800403: ahci: Add PCI ID for Cannon Lake PCH-LP AHCI (2018-05-24 07:03:32 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.17-fixes for you to fetch changes up to 2cfce3a86b64b53f0a70e92a6a659c720c319b45: libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk (2018-05-31 08:45:37 -0700) Dan Carpenter (1): libata: zpodd: small read overflow in eject_tray() Hans de Goede (1): libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk drivers/ata/libata-core.c | 3 --- drivers/ata/libata-zpodd.c | 2 +- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 346b163..9bfd2f7 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4557,9 +4557,6 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { { "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, }, { "SAMSUNG SSD PM830 mSATA *", "CXM13D1Q", ATA_HORKAGE_NOLPM, }, - /* Sandisk devices which are known to not handle LPM well */ - { "SanDisk SD7UB3Q*G1001", NULL, ATA_HORKAGE_NOLPM, }, - /* devices that don't properly handle queued TRIM commands */ { "Micron_M500IT_*","MU01", ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM, }, diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c index de4ddd0..b3ed8f9 100644 --- a/drivers/ata/libata-zpodd.c +++ b/drivers/ata/libata-zpodd.c @@ -35,7 +35,7 @@ struct zpodd { static int eject_tray(struct ata_device *dev) { struct ata_taskfile tf; - static const char cdb[] = { GPCMD_START_STOP_UNIT, + static const char cdb[ATAPI_CDB_LEN] = { GPCMD_START_STOP_UNIT, 0, 0, 0, 0x02, /* LoEj */ 0, 0, 0, 0, 0, 0, 0, -- tejun
Re: [PATCH] Smack: Fix memory leak in smack_inode_getsecctx
On Mon, Jun 04, 2018 at 02:01:34PM -0700, Casey Schaufler wrote: > On 6/1/2018 10:45 AM, Casey Schaufler wrote: > > Fix memory leak in smack_inode_getsecctx > > > > The implementation of smack_inode_getsecctx() made > > incorrect assumptions about how Smack presents a security > > context. Smack does not need to allocate memory to support > > security contexts, so "releasing" a Smack context is a no-op. > > The code made an unnecessary copy and returned that as a > > context, which was never freed. The revised implementation > > returns the context correctly. > > > > Signed-off-by: Casey Schaufler > > Tejun, does this pass your tests? Oh, I'm not the one who reported. Chandan? -- tejun
Re: [PATCH] Smack: Fix memory leak in smack_inode_getsecctx
On Mon, Jun 04, 2018 at 02:01:34PM -0700, Casey Schaufler wrote: > On 6/1/2018 10:45 AM, Casey Schaufler wrote: > > Fix memory leak in smack_inode_getsecctx > > > > The implementation of smack_inode_getsecctx() made > > incorrect assumptions about how Smack presents a security > > context. Smack does not need to allocate memory to support > > security contexts, so "releasing" a Smack context is a no-op. > > The code made an unnecessary copy and returned that as a > > context, which was never freed. The revised implementation > > returns the context correctly. > > > > Signed-off-by: Casey Schaufler > > Tejun, does this pass your tests? Oh, I'm not the one who reported. Chandan? -- tejun
Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup
Hello, Michal. On Mon, Jun 04, 2018 at 03:01:19PM +0200, Michal Hocko wrote: > > On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote: > > > Widening the definition of a process sounds good. The memory control > > > group code would still need a way to forbid these in cgroup v1 mode, > > > when someone uses the task file. > > > > Yeap, you're right. We'll need memcg's can_attach rejecting for v1. > > Do we really need? I mean, do we know about any existing usecase that > would need this weird threading concept and depend on memory migration > which doesn't really work? I thought the requirement is from the ->owner change so that the association doesn't become 1:N, right? Thanks. -- tejun
Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup
Hello, Michal. On Mon, Jun 04, 2018 at 03:01:19PM +0200, Michal Hocko wrote: > > On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote: > > > Widening the definition of a process sounds good. The memory control > > > group code would still need a way to forbid these in cgroup v1 mode, > > > when someone uses the task file. > > > > Yeap, you're right. We'll need memcg's can_attach rejecting for v1. > > Do we really need? I mean, do we know about any existing usecase that > would need this weird threading concept and depend on memory migration > which doesn't really work? I thought the requirement is from the ->owner change so that the association doesn't become 1:N, right? Thanks. -- tejun
Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup
Hello, On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote: > Widening the definition of a process sounds good. The memory control > group code would still need a way to forbid these in cgroup v1 mode, > when someone uses the task file. Yeap, you're right. We'll need memcg's can_attach rejecting for v1. > Using widening instead of denying should reduce the risk of introducing > a regression. > > The only reason I support the crazy case in my earlier patch is so that > we can have this discussion and in case we do cause a regression with > this change the previous algorithmic cleanup won't have to be reverted > as well. Yeah, sure thing. Thanks a lot. -- tejun