Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2

2018-07-30 Thread Tejun Heo
On Mon, Jul 30, 2018 at 07:59:36PM +0200, Pavel Machek wrote:
> Its true I have no interest in psi. But I'm trying to use same kernel
> you are trying to "improve" and I was confused enough by seing
> "CONFIG_PSI". And yes, my association was "pounds per square inch" and
> "what is it doing here".

Read the help message.  If that's not enough, we sure can improve it.

> So I'm asking you to change the name.
> 
> USB is well known acronym, so it is okay to have CONFIG_USB. PSI is
> also well known -- but means something else.
> 
> And the code kind-of acknowledges that acronym is unknown, by having
> /proc/pressure.

Your momentary confusion isn't the only criterion.

Thanks.

-- 
tejun


Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2

2018-07-30 Thread Tejun Heo
Hello,

On Mon, Jul 30, 2018 at 10:54:05AM -0700, Randy Dunlap wrote:
> I'd say he's trying to make something that is readable and easier to
> understand for users.

Sure, it's perfectly fine to make those suggestions and discuss but
the counter points have already been discussed (e.g. PSI is a known
acronym associated with pressure and internal symbols all use them for
brevity and uniqueness).  There's no clear technically winning choice
here and it's a decision of a relatively low importance given that
it's confined to kernel config.  I can't see any merit in turning it
into a last-word match.

Thanks.

-- 
tejun


Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2

2018-07-30 Thread Tejun Heo
Hello,

On Mon, Jul 30, 2018 at 10:54:05AM -0700, Randy Dunlap wrote:
> I'd say he's trying to make something that is readable and easier to
> understand for users.

Sure, it's perfectly fine to make those suggestions and discuss but
the counter points have already been discussed (e.g. PSI is a known
acronym associated with pressure and internal symbols all use them for
brevity and uniqueness).  There's no clear technically winning choice
here and it's a decision of a relatively low importance given that
it's confined to kernel config.  I can't see any merit in turning it
into a last-word match.

Thanks.

-- 
tejun


Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2

2018-07-30 Thread Tejun Heo
Hello,

On Mon, Jul 30, 2018 at 07:39:40PM +0200, Pavel Machek wrote:
> > I'd rather have the internal config symbol match the naming scheme in
> > the code, where psi is a shorter, unique token as copmared to e.g.
> > pressure, press, prsr, etc.
> 
> I'd do "pressure", really. Yes, psi is shorter, but I'd say that
> length is not really important there.

This is an extreme bikeshedding without any relevance.  You can make
suggestions but please lay it to the rest.  There isn't any general
consensus against the current name and you're just trying to push your
favorite name without proper justifications after contributing nothing
to the project.  Please stop.

Thanks.

-- 
tejun


Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2

2018-07-30 Thread Tejun Heo
Hello,

On Mon, Jul 30, 2018 at 07:39:40PM +0200, Pavel Machek wrote:
> > I'd rather have the internal config symbol match the naming scheme in
> > the code, where psi is a shorter, unique token as copmared to e.g.
> > pressure, press, prsr, etc.
> 
> I'd do "pressure", really. Yes, psi is shorter, but I'd say that
> length is not really important there.

This is an extreme bikeshedding without any relevance.  You can make
suggestions but please lay it to the rest.  There isn't any general
consensus against the current name and you're just trying to push your
favorite name without proper justifications after contributing nothing
to the project.  Please stop.

Thanks.

-- 
tejun


Re: [PATCH 0/2] ata: libahci: devslp fixes

2018-07-30 Thread Tejun Heo
On Mon, Jul 30, 2018 at 05:26:45PM +0200, Hans de Goede wrote:
> Hi,
> 
> On 30-07-18 17:22, Tejun Heo wrote:
> >On Mon, Jul 30, 2018 at 08:15:47AM -0700, Srinivas Pandruvada wrote:
> >>Hi Tejan,
> >>
> >>On Mon, 2018-07-02 at 12:01 -0700, Srinivas Pandruvada wrote:
> >>>Some minor fixes to be able to correctly set devslp register
> >>>to optimize power.
> >>>
> >>>Srinivas Pandruvada (2):
> >>>   ata: libahci: Correct setting of DEVSLP register
> >>>   ata: libahci: Allow reconfigure of DEVSLP register
> >>>
> >>Are you applying this series?
> >
> >I was waiting for Hans's reviews.  Hans, what do you think?
> 
> Ah I missed that this was another series. With the caveat that
> I do not really know that much about devslp, both patches
> seem sensible to me, so both are:
> 
> Reviewed-by: Hans de Goede 

Applied 1-2 to libata/for-4.19.

Thanks.

-- 
tejun


Re: [PATCH 0/2] ata: libahci: devslp fixes

2018-07-30 Thread Tejun Heo
On Mon, Jul 30, 2018 at 05:26:45PM +0200, Hans de Goede wrote:
> Hi,
> 
> On 30-07-18 17:22, Tejun Heo wrote:
> >On Mon, Jul 30, 2018 at 08:15:47AM -0700, Srinivas Pandruvada wrote:
> >>Hi Tejan,
> >>
> >>On Mon, 2018-07-02 at 12:01 -0700, Srinivas Pandruvada wrote:
> >>>Some minor fixes to be able to correctly set devslp register
> >>>to optimize power.
> >>>
> >>>Srinivas Pandruvada (2):
> >>>   ata: libahci: Correct setting of DEVSLP register
> >>>   ata: libahci: Allow reconfigure of DEVSLP register
> >>>
> >>Are you applying this series?
> >
> >I was waiting for Hans's reviews.  Hans, what do you think?
> 
> Ah I missed that this was another series. With the caveat that
> I do not really know that much about devslp, both patches
> seem sensible to me, so both are:
> 
> Reviewed-by: Hans de Goede 

Applied 1-2 to libata/for-4.19.

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello,

On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote:
> WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
> WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
> as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
> and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
> when schedule_timeout_*() is called).
> 
> Is this correct?

WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one
kworker running the workqueue.  But all per-cpu kworkers are subject
to concurrency limiting execution - ie. if there are any per-cpu
actively running on a cpu, no futher kworkers will be scheduled.

> >  We can add timeout mechanism to workqueue so that it
> > kicks off other kworkers if one of them is in running state for too
> > long, but idk, if there's an indefinite busy loop condition in kernel
> > threads, we really should get rid of them and hung task watchdog is
> > pretty effective at finding these cases (at least with preemption
> > disabled).
> 
> Currently the page allocator has a path which can loop forever with
> only cond_resched().

Yeah, workqueue can choke on things like that and kthread indefinitely
busy looping doesn't do anybody any good.

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello,

On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote:
> WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
> WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
> as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
> and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
> when schedule_timeout_*() is called).
> 
> Is this correct?

WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one
kworker running the workqueue.  But all per-cpu kworkers are subject
to concurrency limiting execution - ie. if there are any per-cpu
actively running on a cpu, no futher kworkers will be scheduled.

> >  We can add timeout mechanism to workqueue so that it
> > kicks off other kworkers if one of them is in running state for too
> > long, but idk, if there's an indefinite busy loop condition in kernel
> > threads, we really should get rid of them and hung task watchdog is
> > pretty effective at finding these cases (at least with preemption
> > disabled).
> 
> Currently the page allocator has a path which can loop forever with
> only cond_resched().

Yeah, workqueue can choke on things like that and kthread indefinitely
busy looping doesn't do anybody any good.

Thanks.

-- 
tejun


Re: [PATCH 0/2] ata: libahci: devslp fixes

2018-07-30 Thread Tejun Heo
On Mon, Jul 30, 2018 at 08:15:47AM -0700, Srinivas Pandruvada wrote:
> Hi Tejan,
> 
> On Mon, 2018-07-02 at 12:01 -0700, Srinivas Pandruvada wrote:
> > Some minor fixes to be able to correctly set devslp register
> > to optimize power.
> > 
> > Srinivas Pandruvada (2):
> >   ata: libahci: Correct setting of DEVSLP register
> >   ata: libahci: Allow reconfigure of DEVSLP register
> > 
> Are you applying this series?

I was waiting for Hans's reviews.  Hans, what do you think?

Thanks.

-- 
tejun


Re: [PATCH 0/2] ata: libahci: devslp fixes

2018-07-30 Thread Tejun Heo
On Mon, Jul 30, 2018 at 08:15:47AM -0700, Srinivas Pandruvada wrote:
> Hi Tejan,
> 
> On Mon, 2018-07-02 at 12:01 -0700, Srinivas Pandruvada wrote:
> > Some minor fixes to be able to correctly set devslp register
> > to optimize power.
> > 
> > Srinivas Pandruvada (2):
> >   ata: libahci: Correct setting of DEVSLP register
> >   ata: libahci: Allow reconfigure of DEVSLP register
> > 
> Are you applying this series?

I was waiting for Hans's reviews.  Hans, what do you think?

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello,

On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
> > On 2018/07/30 18:32, Michal Hocko wrote:
> [...]
> > > This one is waiting for draining and we are in mm_percpu_wq WQ context
> > > which has its rescuer so no other activity can block us for ever. So
> > > this certainly shouldn't deadlock. It can be dead slow but well, this is
> > > what you will get when your shoot your system to death.
> > 
> > We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to 
> > wake up. (Tejun,
> > is my understanding correct?) Lack of schedule_timeout_*() does block 
> > WQ_MEM_RECLAIM
> > workqueues forever.
> 
> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
> guarantees. If you are right then the whole thing sounds quite fragile
> to me TBH.

Workqueue doesn't think the cpu is stalled as long as one of the
per-cpu kworkers is running.  The assumption is that kernel threads
are not supposed to be busy-looping indefinitely (and they really
shouldn't).  We can add timeout mechanism to workqueue so that it
kicks off other kworkers if one of them is in running state for too
long, but idk, if there's an indefinite busy loop condition in kernel
threads, we really should get rid of them and hung task watchdog is
pretty effective at finding these cases (at least with preemption
disabled).

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello,

On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
> > On 2018/07/30 18:32, Michal Hocko wrote:
> [...]
> > > This one is waiting for draining and we are in mm_percpu_wq WQ context
> > > which has its rescuer so no other activity can block us for ever. So
> > > this certainly shouldn't deadlock. It can be dead slow but well, this is
> > > what you will get when your shoot your system to death.
> > 
> > We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to 
> > wake up. (Tejun,
> > is my understanding correct?) Lack of schedule_timeout_*() does block 
> > WQ_MEM_RECLAIM
> > workqueues forever.
> 
> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
> guarantees. If you are right then the whole thing sounds quite fragile
> to me TBH.

Workqueue doesn't think the cpu is stalled as long as one of the
per-cpu kworkers is running.  The assumption is that kernel threads
are not supposed to be busy-looping indefinitely (and they really
shouldn't).  We can add timeout mechanism to workqueue so that it
kicks off other kworkers if one of them is in running state for too
long, but idk, if there's an indefinite busy loop condition in kernel
threads, we really should get rid of them and hung task watchdog is
pretty effective at finding these cases (at least with preemption
disabled).

Thanks.

-- 
tejun


Re: [PATCH v2 0/2] ata: ahci: Enable DEVSLP by default on SLP_S0 support

2018-07-30 Thread Tejun Heo
On Fri, Jul 27, 2018 at 01:47:01PM -0700, Srinivas Pandruvada wrote:
> One of the requirement for modern x86 system to enter lowest power mode
> (SLP_S0) is SATA IP block to be off. This is true even during when
> platform is suspended to idle and not only in opportunistic (runtime)
> suspend.
> This series is to enable DEVSLP by default.

Applied to 1-2 to libata/for-4.19.

Thanks.

-- 
tejun


Re: [PATCH v2 0/2] ata: ahci: Enable DEVSLP by default on SLP_S0 support

2018-07-30 Thread Tejun Heo
On Fri, Jul 27, 2018 at 01:47:01PM -0700, Srinivas Pandruvada wrote:
> One of the requirement for modern x86 system to enter lowest power mode
> (SLP_S0) is SATA IP block to be off. This is true even during when
> platform is suspended to idle and not only in opportunistic (runtime)
> suspend.
> This series is to enable DEVSLP by default.

Applied to 1-2 to libata/for-4.19.

Thanks.

-- 
tejun


Re: cgroups iptables-restor: vmalloc: allocation failure

2018-07-25 Thread Tejun Heo
On Wed, Jul 25, 2018 at 02:50:01PM +0300, Georgi Nikolov wrote:
> I posted a kernel bug https://bugzilla.kernel.org/show_bug.cgi?id=200651 and
> i hope this is the correct place to discuss this.

That's just NORETRY vmalloc allocation failing under memory pressure.
Doesn't have much to do with cgroups except that there are also a
bunch of cgroups consuming memory.

Thanks.

-- 
tejun


Re: cgroups iptables-restor: vmalloc: allocation failure

2018-07-25 Thread Tejun Heo
On Wed, Jul 25, 2018 at 02:50:01PM +0300, Georgi Nikolov wrote:
> I posted a kernel bug https://bugzilla.kernel.org/show_bug.cgi?id=200651 and
> i hope this is the correct place to discuss this.

That's just NORETRY vmalloc allocation failing under memory pressure.
Doesn't have much to do with cgroups except that there are also a
bunch of cgroups consuming memory.

Thanks.

-- 
tejun


Re: [PATCH] delayacct: Fix crash in delayacct_blkio_end() after delayacct init failure

2018-07-24 Thread Tejun Heo
Hello, Andrew.

On Tue, Jul 24, 2018 at 12:29:13PM -0700, Andrew Morton wrote:
> How did you make this happen, btw?  Fault injection, or did a small
> GFP_KERNEL allocation fail?

We have a group of machines which are pushing memory really hard and
this actually triggered in prod on several of them.

Thanks.

-- 
tejun


Re: [PATCH] delayacct: Fix crash in delayacct_blkio_end() after delayacct init failure

2018-07-24 Thread Tejun Heo
Hello, Andrew.

On Tue, Jul 24, 2018 at 12:29:13PM -0700, Andrew Morton wrote:
> How did you make this happen, btw?  Fault injection, or did a small
> GFP_KERNEL allocation fail?

We have a group of machines which are pushing memory really hard and
this actually triggered in prod on several of them.

Thanks.

-- 
tejun


[PATCH] delayacct: Fix crash in delayacct_blkio_end() after delayacct init failure

2018-07-24 Thread Tejun Heo
While forking, if delayacct init fails due to memory shortage, it
continues expecting all delayacct users to check task->delays pointer
against NULL before dereferencing it, which all of them used to do.

c96f5471ce7d ("delayacct: Account blkio completion on the correct
task"), while updating delayacct_blkio_end() to take the target task
instead of always using %current, made the function test NULL on
%current->delays and then continue to operated on @p->delays.  If
%current succeeded init while @p didn't, it leads to the following
crash.

 BUG: unable to handle kernel NULL pointer dereference at 0004
 IP: __delayacct_blkio_end+0xc/0x40
 PGD 801fd07e1067 P4D 801fd07e1067 PUD 1fcffbb067 PMD 0 
 Oops:  [#1] SMP PTI
 CPU: 4 PID: 25774 Comm: QIOThread0 Not tainted 
4.16.0-9_fbk1_rc2_1180_g6b593215b4d7 #9
 Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B12 
08/17/2017
 RIP: 0010:__delayacct_blkio_end+0xc/0x40
 RSP: :881fff703bf8 EFLAGS: 00010086
 RAX: 881f1ec8b800 RBX: 8804f735cd54 RCX: 881fff703cb0
 RDX: 0002 RSI: 0003 RDI: 
 RBP:  R08:  R09: 881fff703cc0
 R10: 1000 R11: 881fd3f73d00 R12: 8804f735c600
 R13:  R14: 001d R15: 881fff703cb0
 FS:  7f5003f7d700() GS:881fff70() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0004 CR3: 001f401a6006 CR4: 003606e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  try_to_wake_up+0x2c0/0x600
  autoremove_wake_function+0xe/0x30
  __wake_up_common+0x74/0x120
  wake_up_page_bit+0x9c/0xe0
  mpage_end_io+0x27/0x70
  blk_update_request+0x78/0x2c0
  scsi_end_request+0x2c/0x1e0
  scsi_io_completion+0x20b/0x5f0
  blk_mq_complete_request+0xa2/0x100
  ata_scsi_qc_complete+0x79/0x400
  ata_qc_complete_multiple+0x86/0xd0
  ahci_handle_port_interrupt+0xc9/0x5c0
  ahci_handle_port_intr+0x54/0xb0
  ahci_single_level_irq_intr+0x3b/0x60
  __handle_irq_event_percpu+0x43/0x190
  handle_irq_event_percpu+0x20/0x50
  handle_irq_event+0x2a/0x50
  handle_edge_irq+0x80/0x1c0
  handle_irq+0xaf/0x120
  do_IRQ+0x41/0xc0
  common_interrupt+0xf/0xf
  

Fix it by updating delayacct_blkio_end() check @p->delays instead.

Signed-off-by: Tejun Heo 
Reported-and-debugged-by: Dave Jones 
Cc: Josh Snyder 
Fixes: c96f5471ce7d ("delayacct: Account blkio completion on the correct task")
Cc: sta...@vger.kernel.org # v4.15+
---
 include/linux/delayacct.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index e6c0448ebcc7..31c865d1842e 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -124,7 +124,7 @@ static inline void delayacct_blkio_start(void)
 
 static inline void delayacct_blkio_end(struct task_struct *p)
 {
-   if (current->delays)
+   if (p->delays)
__delayacct_blkio_end(p);
delayacct_clear_flag(DELAYACCT_PF_BLKIO);
 }


[PATCH] delayacct: Fix crash in delayacct_blkio_end() after delayacct init failure

2018-07-24 Thread Tejun Heo
While forking, if delayacct init fails due to memory shortage, it
continues expecting all delayacct users to check task->delays pointer
against NULL before dereferencing it, which all of them used to do.

c96f5471ce7d ("delayacct: Account blkio completion on the correct
task"), while updating delayacct_blkio_end() to take the target task
instead of always using %current, made the function test NULL on
%current->delays and then continue to operated on @p->delays.  If
%current succeeded init while @p didn't, it leads to the following
crash.

 BUG: unable to handle kernel NULL pointer dereference at 0004
 IP: __delayacct_blkio_end+0xc/0x40
 PGD 801fd07e1067 P4D 801fd07e1067 PUD 1fcffbb067 PMD 0 
 Oops:  [#1] SMP PTI
 CPU: 4 PID: 25774 Comm: QIOThread0 Not tainted 
4.16.0-9_fbk1_rc2_1180_g6b593215b4d7 #9
 Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B12 
08/17/2017
 RIP: 0010:__delayacct_blkio_end+0xc/0x40
 RSP: :881fff703bf8 EFLAGS: 00010086
 RAX: 881f1ec8b800 RBX: 8804f735cd54 RCX: 881fff703cb0
 RDX: 0002 RSI: 0003 RDI: 
 RBP:  R08:  R09: 881fff703cc0
 R10: 1000 R11: 881fd3f73d00 R12: 8804f735c600
 R13:  R14: 001d R15: 881fff703cb0
 FS:  7f5003f7d700() GS:881fff70() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0004 CR3: 001f401a6006 CR4: 003606e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  try_to_wake_up+0x2c0/0x600
  autoremove_wake_function+0xe/0x30
  __wake_up_common+0x74/0x120
  wake_up_page_bit+0x9c/0xe0
  mpage_end_io+0x27/0x70
  blk_update_request+0x78/0x2c0
  scsi_end_request+0x2c/0x1e0
  scsi_io_completion+0x20b/0x5f0
  blk_mq_complete_request+0xa2/0x100
  ata_scsi_qc_complete+0x79/0x400
  ata_qc_complete_multiple+0x86/0xd0
  ahci_handle_port_interrupt+0xc9/0x5c0
  ahci_handle_port_intr+0x54/0xb0
  ahci_single_level_irq_intr+0x3b/0x60
  __handle_irq_event_percpu+0x43/0x190
  handle_irq_event_percpu+0x20/0x50
  handle_irq_event+0x2a/0x50
  handle_edge_irq+0x80/0x1c0
  handle_irq+0xaf/0x120
  do_IRQ+0x41/0xc0
  common_interrupt+0xf/0xf
  

Fix it by updating delayacct_blkio_end() check @p->delays instead.

Signed-off-by: Tejun Heo 
Reported-and-debugged-by: Dave Jones 
Cc: Josh Snyder 
Fixes: c96f5471ce7d ("delayacct: Account blkio completion on the correct task")
Cc: sta...@vger.kernel.org # v4.15+
---
 include/linux/delayacct.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index e6c0448ebcc7..31c865d1842e 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -124,7 +124,7 @@ static inline void delayacct_blkio_start(void)
 
 static inline void delayacct_blkio_end(struct task_struct *p)
 {
-   if (current->delays)
+   if (p->delays)
__delayacct_blkio_end(p);
delayacct_clear_flag(DELAYACCT_PF_BLKIO);
 }


Re: [PATCH] Revert "ata: ahci_platform: convert kcalloc to devm_kcalloc"

2018-07-24 Thread Tejun Heo
On Tue, Jul 17, 2018 at 10:49:35AM +, Corentin Labbe wrote:
> Since ahci_platform_put_resources() use target_pwrs after "devm_" freed
> it, we cannot use devm_kcalloc for allocating target_pwrs.
> 
> This reverts commit bd0038b1b4f499d814d8f33a55b1df5ea6cf3b85.
> 
> Reported-by: Mikko Perttunen 
> Signed-off-by: Corentin Labbe 

Applied to libata/for-4.19.  Sorry about the delay.

Thanks.

-- 
tejun


Re: [PATCH] Revert "ata: ahci_platform: convert kcalloc to devm_kcalloc"

2018-07-24 Thread Tejun Heo
On Tue, Jul 17, 2018 at 10:49:35AM +, Corentin Labbe wrote:
> Since ahci_platform_put_resources() use target_pwrs after "devm_" freed
> it, we cannot use devm_kcalloc for allocating target_pwrs.
> 
> This reverts commit bd0038b1b4f499d814d8f33a55b1df5ea6cf3b85.
> 
> Reported-by: Mikko Perttunen 
> Signed-off-by: Corentin Labbe 

Applied to libata/for-4.19.  Sorry about the delay.

Thanks.

-- 
tejun


Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

2018-07-24 Thread Tejun Heo
Hello, Patrick.

On Mon, Jul 23, 2018 at 06:22:15PM +0100, Patrick Bellasi wrote:
> However, the "best effort" bandwidth control we have for CFS and RT
> can be further improved if, instead of just looking at time spent on
> CPUs, we provide some more hints to the scheduler to know at which
> min/max "MIPS" we want to consume the (best effort) time we have been
> allocated on a CPU.
> 
> Such a simple extension is still quite useful to satisfy many use-case
> we have, mainly on mobile systems, like the ones I've described in the
>"Newcomer's Short Abstract (Updated)"
> section of the cover letter:
>
> https://lore.kernel.org/lkml/20180716082906.6061-1-patrick.bell...@arm.com/T/#u

So, that's all completely fine but then let's please not give it a
name which doesn't quite match what it does.  We can just call it
e.g. cpufreq range control.

> > So, there are fundamental discrepancies between
> > description+interface vs. what it actually does.
> 
> Perhaps then I should just change the description to make it less
> generic...

I think so, along with the interface itself.

> > I really don't think that's something we can fix up later.
> 
> ... since, really, I don't think we can get to the point to extend
> later this interface to provide the strict bandwidth enforcement you
> are thinking about.

That's completley fine.  The interface just has to match what's
implemented.

...
> > and what you're describing inherently breaks the delegation model.
> 
> What I describe here is just an additional hint to the scheduler which
> enrich the above described model. Provided A and B are already
> satisfied, when a task gets a chance to run it will be executed at a
> min/max configured frequency. That's really all... there is not
> additional impact on "resources allocation".

So, if it's a cpufreq range controller.  It'd have sth like
cpu.freq.min and cpu.freq.max, where min defines the maximum minimum
cpufreq its descendants can get and max defines the maximum cpufreq
allowed in the subtree.  For an example, please refer to how
memory.min and memory.max are defined.

Thanks.

-- 
tejun


Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

2018-07-24 Thread Tejun Heo
Hello, Patrick.

On Mon, Jul 23, 2018 at 06:22:15PM +0100, Patrick Bellasi wrote:
> However, the "best effort" bandwidth control we have for CFS and RT
> can be further improved if, instead of just looking at time spent on
> CPUs, we provide some more hints to the scheduler to know at which
> min/max "MIPS" we want to consume the (best effort) time we have been
> allocated on a CPU.
> 
> Such a simple extension is still quite useful to satisfy many use-case
> we have, mainly on mobile systems, like the ones I've described in the
>"Newcomer's Short Abstract (Updated)"
> section of the cover letter:
>
> https://lore.kernel.org/lkml/20180716082906.6061-1-patrick.bell...@arm.com/T/#u

So, that's all completely fine but then let's please not give it a
name which doesn't quite match what it does.  We can just call it
e.g. cpufreq range control.

> > So, there are fundamental discrepancies between
> > description+interface vs. what it actually does.
> 
> Perhaps then I should just change the description to make it less
> generic...

I think so, along with the interface itself.

> > I really don't think that's something we can fix up later.
> 
> ... since, really, I don't think we can get to the point to extend
> later this interface to provide the strict bandwidth enforcement you
> are thinking about.

That's completley fine.  The interface just has to match what's
implemented.

...
> > and what you're describing inherently breaks the delegation model.
> 
> What I describe here is just an additional hint to the scheduler which
> enrich the above described model. Provided A and B are already
> satisfied, when a task gets a chance to run it will be executed at a
> min/max configured frequency. That's really all... there is not
> additional impact on "resources allocation".

So, if it's a cpufreq range controller.  It'd have sth like
cpu.freq.min and cpu.freq.max, where min defines the maximum minimum
cpufreq its descendants can get and max defines the maximum cpufreq
allowed in the subtree.  For an example, please refer to how
memory.min and memory.max are defined.

Thanks.

-- 
tejun


Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

2018-07-23 Thread Tejun Heo
Hello,

On Mon, Jul 16, 2018 at 09:29:02AM +0100, Patrick Bellasi wrote:
> The cgroup's CPU controller allows to assign a specified (maximum)
> bandwidth to the tasks of a group. However this bandwidth is defined and
> enforced only on a temporal base, without considering the actual
> frequency a CPU is running on. Thus, the amount of computation completed
> by a task within an allocated bandwidth can be very different depending
> on the actual frequency the CPU is running that task.
> The amount of computation can be affected also by the specific CPU a
> task is running on, especially when running on asymmetric capacity
> systems like Arm's big.LITTLE.

One basic problem I have with this patchset is that what's being
described is way more generic than what actually got implemented.
What's described is computation bandwidth control but what's
implemented is just frequency clamping.  So, there are fundamental
discrepancies between description+interface vs. what it actually does.

I really don't think that's something we can fix up later.

> These attributes:
> 
> a) are available only for non-root nodes, both on default and legacy
>hierarchies
> b) do not enforce any constraints and/or dependency between the parent
>and its child nodes, thus relying on the delegation model and
>permission settings defined by the system management software

cgroup does host attributes which only concern the cgroup itself and
thus don't need any hierarchical behaviors on their own, but what's
being implemented does control resource allocation, and what you're
describing inherently breaks the delegation model.

> c) allow to (eventually) further restrict task-specific clamps defined
>via sched_setattr(2)

Thanks.

-- 
tejun


Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

2018-07-23 Thread Tejun Heo
Hello,

On Mon, Jul 16, 2018 at 09:29:02AM +0100, Patrick Bellasi wrote:
> The cgroup's CPU controller allows to assign a specified (maximum)
> bandwidth to the tasks of a group. However this bandwidth is defined and
> enforced only on a temporal base, without considering the actual
> frequency a CPU is running on. Thus, the amount of computation completed
> by a task within an allocated bandwidth can be very different depending
> on the actual frequency the CPU is running that task.
> The amount of computation can be affected also by the specific CPU a
> task is running on, especially when running on asymmetric capacity
> systems like Arm's big.LITTLE.

One basic problem I have with this patchset is that what's being
described is way more generic than what actually got implemented.
What's described is computation bandwidth control but what's
implemented is just frequency clamping.  So, there are fundamental
discrepancies between description+interface vs. what it actually does.

I really don't think that's something we can fix up later.

> These attributes:
> 
> a) are available only for non-root nodes, both on default and legacy
>hierarchies
> b) do not enforce any constraints and/or dependency between the parent
>and its child nodes, thus relying on the delegation model and
>permission settings defined by the system management software

cgroup does host attributes which only concern the cgroup itself and
thus don't need any hierarchical behaviors on their own, but what's
being implemented does control resource allocation, and what you're
describing inherently breaks the delegation model.

> c) allow to (eventually) further restrict task-specific clamps defined
>via sched_setattr(2)

Thanks.

-- 
tejun


Re: [PATCH kselftest-next] Add cgroup core selftests

2018-07-19 Thread Tejun Heo
On Wed, Jul 18, 2018 at 07:33:58PM +0200, Claudio wrote:
> This commit adds tests for some of the core functionalities
> of cgroups v2.
> 
> The commit adds tests for some core principles of croup V2 API:

Acked-by: Tejun Heo 

Thanks, Claudio.

-- 
tejun


Re: [PATCH kselftest-next] Add cgroup core selftests

2018-07-19 Thread Tejun Heo
On Wed, Jul 18, 2018 at 07:33:58PM +0200, Claudio wrote:
> This commit adds tests for some of the core functionalities
> of cgroups v2.
> 
> The commit adds tests for some core principles of croup V2 API:

Acked-by: Tejun Heo 

Thanks, Claudio.

-- 
tejun


Re: [PATCH 09/10] psi: cgroup support

2018-07-12 Thread Tejun Heo
On Thu, Jul 12, 2018 at 01:29:41PM -0400, Johannes Weiner wrote:
> On a system that executes multiple cgrouped jobs and independent
> workloads, we don't just care about the health of the overall system,
> but also that of individual jobs, so that we can ensure individual job
> health, fairness between jobs, or prioritize some jobs over others.
> 
> This patch implements pressure stall tracking for cgroups. In kernels
> with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure,
> memory.pressure, and io.pressure files that track aggregate pressure
> stall times for only the tasks inside the cgroup.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Tejun Heo 

Please feel free to route with the rest of the patchset.

Thanks.

-- 
tejun


Re: [PATCH 09/10] psi: cgroup support

2018-07-12 Thread Tejun Heo
On Thu, Jul 12, 2018 at 01:29:41PM -0400, Johannes Weiner wrote:
> On a system that executes multiple cgrouped jobs and independent
> workloads, we don't just care about the health of the overall system,
> but also that of individual jobs, so that we can ensure individual job
> health, fairness between jobs, or prioritize some jobs over others.
> 
> This patch implements pressure stall tracking for cgroups. In kernels
> with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure,
> memory.pressure, and io.pressure files that track aggregate pressure
> stall times for only the tasks inside the cgroup.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Tejun Heo 

Please feel free to route with the rest of the patchset.

Thanks.

-- 
tejun


Re: [PATCH] ata: Only output sg element mapped number in verbose debug

2018-07-12 Thread Tejun Heo
On Thu, Jul 12, 2018 at 06:31:03PM +0200, Paul Menzel wrote:
> Date: Sun, 8 Jul 2018 09:18:21 +0200
> 
> Defining `ATA_DEBUG` there are a lof of messages like below in the log.
> 
> [   16.345472] ata_sg_setup: 1 sg elements mapped
> 
> As that is too verbose, only output these messages in verbose debug.
> 
> Signed-off-by: Paul Menzel 

Applied to libata/for-4.19.  Thanks.

-- 
tejun


Re: [PATCH] ata: Only output sg element mapped number in verbose debug

2018-07-12 Thread Tejun Heo
On Thu, Jul 12, 2018 at 06:31:03PM +0200, Paul Menzel wrote:
> Date: Sun, 8 Jul 2018 09:18:21 +0200
> 
> Defining `ATA_DEBUG` there are a lof of messages like below in the log.
> 
> [   16.345472] ata_sg_setup: 1 sg elements mapped
> 
> As that is too verbose, only output these messages in verbose debug.
> 
> Signed-off-by: Paul Menzel 

Applied to libata/for-4.19.  Thanks.

-- 
tejun


Re: [PATCH] ata: Guard ata_scsi_dump_cdb() by ATA_VERBOSE_DEBUG

2018-07-12 Thread Tejun Heo
On Thu, Jul 12, 2018 at 06:29:44PM +0200, Paul Menzel wrote:
> Date: Sun, 8 Jul 2018 09:11:34 +0200
> 
> Defining `ATA_DEBUG` nothing can be really seen, as the log is spammed
> with CDB messages.
> 
> Therefore, guard the print by `ATA_VERBOSE_DEBUG`.
> 
> Signed-off-by: Paul Menzel 

Applied to libata/for-4.19.

Thanks.

-- 
tejun


Re: [PATCH] ata: Guard ata_scsi_dump_cdb() by ATA_VERBOSE_DEBUG

2018-07-12 Thread Tejun Heo
On Thu, Jul 12, 2018 at 06:29:44PM +0200, Paul Menzel wrote:
> Date: Sun, 8 Jul 2018 09:11:34 +0200
> 
> Defining `ATA_DEBUG` nothing can be really seen, as the log is spammed
> with CDB messages.
> 
> Therefore, guard the print by `ATA_VERBOSE_DEBUG`.
> 
> Signed-off-by: Paul Menzel 

Applied to libata/for-4.19.

Thanks.

-- 
tejun


Re: [PATCH 0/3] ata: ahci_platform: minor fixes

2018-07-12 Thread Tejun Heo
On Thu, Jul 12, 2018 at 11:41:28AM +, Corentin Labbe wrote:
> Hello
> 
> This patchset fixes some minor problem found when working on supporting
> allwinner R40 AHCI.
> 
> Regards
> 
> Corentin Labbe (3):
>   ata: ahci_platform: correct parameter documentation for
> ahci_platform_shutdown
>   ata: ahci_platform: convert kzallloc to kcalloc
>   ata: ahci_platform: convert kcalloc to devm_kcalloc

Applied all three to libata/for-4.19 with Hans's ack added.

Thanks.

-- 
tejun


Re: [PATCH 0/3] ata: ahci_platform: minor fixes

2018-07-12 Thread Tejun Heo
On Thu, Jul 12, 2018 at 11:41:28AM +, Corentin Labbe wrote:
> Hello
> 
> This patchset fixes some minor problem found when working on supporting
> allwinner R40 AHCI.
> 
> Regards
> 
> Corentin Labbe (3):
>   ata: ahci_platform: correct parameter documentation for
> ahci_platform_shutdown
>   ata: ahci_platform: convert kzallloc to kcalloc
>   ata: ahci_platform: convert kcalloc to devm_kcalloc

Applied all three to libata/for-4.19 with Hans's ack added.

Thanks.

-- 
tejun


[GIT PULL] libata fixes for v4.18-rc4

2018-07-11 Thread Tejun Heo
Hello, Linus.

* Jens's patches to expand the usable command depth from 31 to 32
  broke sata_fsl due to a subtle command iteration bug.  Fixed by
  introducing explicit iteration helpers and using the correct
  variant.

* On some laptops, enabling LPM by default reportedly led to
  occasional hard hangs.  Blacklist the affected cases.

* Other misc fixes / changes.

Thanks.

The following changes since commit 9ffc59d57228d74809700be6f7ecb1db10292f05:

  Merge tag '4.18-rc1-more-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 
(2018-06-18 14:28:19 +0900)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.18-fixes

for you to fetch changes up to d55bac2754476624f23bdf3b908d117f3cdf469b:

  ata: Remove depends on HAS_DMA in case of platform dependency (2018-07-02 
13:18:22 -0700)


Colin Ian King (1):
  sata_nv: remove redundant pointers sdev0 and sdev1

Damien Le Moal (2):
  ata: Fix ZBC_OUT command block check
  ata: Fix ZBC_OUT all bit handling

Geert Uytterhoeven (1):
  ata: Remove depends on HAS_DMA in case of platform dependency

Hans de Goede (1):
  ahci: Disable LPM on Lenovo 50 series laptops with a too old BIOS

Jens Axboe (4):
  libata: add command iterator helpers
  libata: convert eh to command iterators
  sata_fsl: convert to command iterator
  sata_fsl: remove dead code in tag retrieval

John Garry (1):
  libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store()

Mika Westerberg (1):
  ahci: Add Intel Ice Lake LP PCI ID

Wei Yongjun (1):
  ata: ahci_mvebu: ahci_mvebu_stop_engine() can be static

 drivers/ata/Kconfig   |  2 --
 drivers/ata/ahci.c| 60 +++
 drivers/ata/ahci_mvebu.c  |  2 +-
 drivers/ata/libahci.c |  7 --
 drivers/ata/libata-core.c |  3 +++
 drivers/ata/libata-eh.c   | 41 +---
 drivers/ata/libata-scsi.c | 18 +-
 drivers/ata/sata_fsl.c|  9 +--
 drivers/ata/sata_nv.c |  3 ---
 include/linux/libata.h| 24 +++
 10 files changed, 122 insertions(+), 47 deletions(-)

-- 
tejun


[GIT PULL] libata fixes for v4.18-rc4

2018-07-11 Thread Tejun Heo
Hello, Linus.

* Jens's patches to expand the usable command depth from 31 to 32
  broke sata_fsl due to a subtle command iteration bug.  Fixed by
  introducing explicit iteration helpers and using the correct
  variant.

* On some laptops, enabling LPM by default reportedly led to
  occasional hard hangs.  Blacklist the affected cases.

* Other misc fixes / changes.

Thanks.

The following changes since commit 9ffc59d57228d74809700be6f7ecb1db10292f05:

  Merge tag '4.18-rc1-more-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 
(2018-06-18 14:28:19 +0900)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.18-fixes

for you to fetch changes up to d55bac2754476624f23bdf3b908d117f3cdf469b:

  ata: Remove depends on HAS_DMA in case of platform dependency (2018-07-02 
13:18:22 -0700)


Colin Ian King (1):
  sata_nv: remove redundant pointers sdev0 and sdev1

Damien Le Moal (2):
  ata: Fix ZBC_OUT command block check
  ata: Fix ZBC_OUT all bit handling

Geert Uytterhoeven (1):
  ata: Remove depends on HAS_DMA in case of platform dependency

Hans de Goede (1):
  ahci: Disable LPM on Lenovo 50 series laptops with a too old BIOS

Jens Axboe (4):
  libata: add command iterator helpers
  libata: convert eh to command iterators
  sata_fsl: convert to command iterator
  sata_fsl: remove dead code in tag retrieval

John Garry (1):
  libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store()

Mika Westerberg (1):
  ahci: Add Intel Ice Lake LP PCI ID

Wei Yongjun (1):
  ata: ahci_mvebu: ahci_mvebu_stop_engine() can be static

 drivers/ata/Kconfig   |  2 --
 drivers/ata/ahci.c| 60 +++
 drivers/ata/ahci_mvebu.c  |  2 +-
 drivers/ata/libahci.c |  7 --
 drivers/ata/libata-core.c |  3 +++
 drivers/ata/libata-eh.c   | 41 +---
 drivers/ata/libata-scsi.c | 18 +-
 drivers/ata/sata_fsl.c|  9 +--
 drivers/ata/sata_nv.c |  3 ---
 include/linux/libata.h| 24 +++
 10 files changed, 122 insertions(+), 47 deletions(-)

-- 
tejun


Re: [PATCH] cgroup/tracing: Move taking of spin lock out of trace event handlers

2018-07-11 Thread Tejun Heo
On Mon, Jul 09, 2018 at 05:48:54PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt (VMware) 
> 
> It is unwise to take spin locks from the handlers of trace events.
> Mainly, because they can introduce lockups, because it introduces locks
> in places that are normally not tested. Worse yet, because trace events
> are tucked away in the include/trace/events/ directory, locks that are
> taken there are forgotten about.
> 
> As a general rule, I tell people never to take any locks in a trace
> event handler.
> 
> Several cgroup trace event handlers call cgroup_path() which eventually
> takes the kernfs_rename_lock spinlock. This injects the spinlock in the
> code without people realizing it. It also can cause issues for the
> PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
> handlers are called with preemption disabled.
> 
> By moving the calculation of the cgroup_path() out of the trace event
> handlers and into a macro (surrounded by a
> trace_cgroup_##type##_enabled()), then we could place the cgroup_path
> into a string, and pass that to the trace event. Not only does this
> remove the taking of the spinlock out of the trace event handler, but
> it also means that the cgroup_path() only needs to be called once (it
> is currently called twice, once to get the length to reserver the
> buffer for, and once again to get the path itself. Now it only needs to
> be done once.
> 
> Reported-by: Sebastian Andrzej Siewior 
> Signed-off-by: Steven Rostedt (VMware) 

Applied to cgroup/for-4.19.

Thanks.

-- 
tejun


Re: [PATCH] cgroup/tracing: Move taking of spin lock out of trace event handlers

2018-07-11 Thread Tejun Heo
On Mon, Jul 09, 2018 at 05:48:54PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt (VMware) 
> 
> It is unwise to take spin locks from the handlers of trace events.
> Mainly, because they can introduce lockups, because it introduces locks
> in places that are normally not tested. Worse yet, because trace events
> are tucked away in the include/trace/events/ directory, locks that are
> taken there are forgotten about.
> 
> As a general rule, I tell people never to take any locks in a trace
> event handler.
> 
> Several cgroup trace event handlers call cgroup_path() which eventually
> takes the kernfs_rename_lock spinlock. This injects the spinlock in the
> code without people realizing it. It also can cause issues for the
> PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
> handlers are called with preemption disabled.
> 
> By moving the calculation of the cgroup_path() out of the trace event
> handlers and into a macro (surrounded by a
> trace_cgroup_##type##_enabled()), then we could place the cgroup_path
> into a string, and pass that to the trace event. Not only does this
> remove the taking of the spinlock out of the trace event handler, but
> it also means that the cgroup_path() only needs to be called once (it
> is currently called twice, once to get the length to reserver the
> buffer for, and once again to get the path itself. Now it only needs to
> be done once.
> 
> Reported-by: Sebastian Andrzej Siewior 
> Signed-off-by: Steven Rostedt (VMware) 

Applied to cgroup/for-4.19.

Thanks.

-- 
tejun


Re: [PATCH] cgroup: use irqsave in cgroup_rstat_flush_locked()

2018-07-11 Thread Tejun Heo
Hello, Sebastian.

On Wed, Jul 11, 2018 at 01:05:13PM +0200, Sebastian Andrzej Siewior wrote:
> > > We at least used to do this in the kernel - manipulating irqsafe locks
> > > with spin_lock/unlock() if the irq state is known, whether enabled or
> > > disabled, and ISTR lockdep being smart enough to track actual irq
> > > state to determine irq safety.  Am I misremembering or is this
> > > different on RT kernels?
> > 
> > No, this is correct. So on !RT kernels the spin_lock_irq() disables
> > interrupts and the raw_spin_lock() has the interrupts already disabled,
> > everything is good. On RT kernels the spin_lock_irq() does not disable
> > interrupts and the raw_spin_lock() acquires the lock with enabled
> > interrupts and lockdep complains properly.
> > lockdep sees the hardirq path via:

I feel weary about applying a patch which isn't needed in mainline,
especially without annotations or at least comments.  I suppose it may
not be too common but this can't be the only place which needs this
and using irqsave/restore spuriously in all those sites doesn't sound
like a good solution.  Is there any other way of handling this?

Thanks.

-- 
tejun


Re: [PATCH] cgroup: use irqsave in cgroup_rstat_flush_locked()

2018-07-11 Thread Tejun Heo
Hello, Sebastian.

On Wed, Jul 11, 2018 at 01:05:13PM +0200, Sebastian Andrzej Siewior wrote:
> > > We at least used to do this in the kernel - manipulating irqsafe locks
> > > with spin_lock/unlock() if the irq state is known, whether enabled or
> > > disabled, and ISTR lockdep being smart enough to track actual irq
> > > state to determine irq safety.  Am I misremembering or is this
> > > different on RT kernels?
> > 
> > No, this is correct. So on !RT kernels the spin_lock_irq() disables
> > interrupts and the raw_spin_lock() has the interrupts already disabled,
> > everything is good. On RT kernels the spin_lock_irq() does not disable
> > interrupts and the raw_spin_lock() acquires the lock with enabled
> > interrupts and lockdep complains properly.
> > lockdep sees the hardirq path via:

I feel weary about applying a patch which isn't needed in mainline,
especially without annotations or at least comments.  I suppose it may
not be too common but this can't be the only place which needs this
and using irqsave/restore spuriously in all those sites doesn't sound
like a good solution.  Is there any other way of handling this?

Thanks.

-- 
tejun


Re: [PATCH] cgroup: use irqsave in cgroup_rstat_flush_locked()

2018-07-03 Thread Tejun Heo
(cc'ing Peter and Ingo for lockdep)

Hello, Sebastian.

On Tue, Jul 03, 2018 at 06:45:44PM +0200, Sebastian Andrzej Siewior wrote:
> All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock
> either with spin_lock_irq() or spin_lock_irqsave().

So, irq is always disabled in cgroup_rstat_flush_locked().

> cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which
> is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated()
> in IRQ context and therefore requires _irqsave() locking suffix in
> cgroup_rstat_flush_locked().

Yes, the cpu locks should be irqsafe too; however, as irq is always
disabled in that function, save/restore is redundant, no?

> Since there is no difference between spin_lock_t and raw_spin_lock_t
> on !RT lockdep does not complain here. On RT lockdep complains because
> the interrupts were not disabled here and a deadlock is possible.

We at least used to do this in the kernel - manipulating irqsafe locks
with spin_lock/unlock() if the irq state is known, whether enabled or
disabled, and ISTR lockdep being smart enough to track actual irq
state to determine irq safety.  Am I misremembering or is this
different on RT kernels?

Thanks.

-- 
tejun


Re: [PATCH] cgroup: use irqsave in cgroup_rstat_flush_locked()

2018-07-03 Thread Tejun Heo
(cc'ing Peter and Ingo for lockdep)

Hello, Sebastian.

On Tue, Jul 03, 2018 at 06:45:44PM +0200, Sebastian Andrzej Siewior wrote:
> All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock
> either with spin_lock_irq() or spin_lock_irqsave().

So, irq is always disabled in cgroup_rstat_flush_locked().

> cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which
> is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated()
> in IRQ context and therefore requires _irqsave() locking suffix in
> cgroup_rstat_flush_locked().

Yes, the cpu locks should be irqsafe too; however, as irq is always
disabled in that function, save/restore is redundant, no?

> Since there is no difference between spin_lock_t and raw_spin_lock_t
> on !RT lockdep does not complain here. On RT lockdep complains because
> the interrupts were not disabled here and a deadlock is possible.

We at least used to do this in the kernel - manipulating irqsafe locks
with spin_lock/unlock() if the irq state is known, whether enabled or
disabled, and ISTR lockdep being smart enough to track actual irq
state to determine irq safety.  Am I misremembering or is this
different on RT kernels?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2018-07-03 Thread Tejun Heo
Hello, Paul.

On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote:
> > I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
> > trigger since.  :-/
> 
> But I get a build error:
> 
> kernel/workqueue.o: In function `worker_attach_to_pool':
> workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state'
> workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state'
> /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' 
> failed
> 
> My guess is that the worker_attach_to_pool() code needs to change so
> as to invoke cpuhp_current_state() and cpuhp_target_state() only in
> CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you.

Just to clarify, the bug happened on a kernel with cpuhp enabled,
right?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2018-07-03 Thread Tejun Heo
Hello, Paul.

On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote:
> > I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
> > trigger since.  :-/
> 
> But I get a build error:
> 
> kernel/workqueue.o: In function `worker_attach_to_pool':
> workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state'
> workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state'
> /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' 
> failed
> 
> My guess is that the worker_attach_to_pool() code needs to change so
> as to invoke cpuhp_current_state() and cpuhp_target_state() only in
> CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you.

Just to clarify, the bug happened on a kernel with cpuhp enabled,
right?

Thanks.

-- 
tejun


Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-03 Thread Tejun Heo
Hello,

On Tue, Jul 03, 2018 at 03:22:47PM +1000, Benjamin Herrenschmidt wrote:
> +bool kernfs_has_children(struct kernfs_node *kn)
> +{
> + bool has_children = false;
> + struct kernfs_node *pos;
> +
> + /* Lockless shortcut */
> + if (RB_EMPTY_NODE(>rb))
> + return false;

Hmm... shouldn't it be testing !rb_first(kn->dir.children)?  The above
would test whether @kn itself is unlinked.

> +
> + /* Now check for active children */
> + mutex_lock(_mutex);
> + pos = NULL;
> + while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
> + if (kernfs_active(pos)) {
> + has_children = true;
> + break;
> + }
> + }
> + mutex_unlock(_mutex);
> +
> + return has_children;

The active ref is there to synchronize removal against in-flight reads
so that kernfs_remove() can wait for them to drain.  On return from
kernfs_remove(), the node is guaranteed to be off the rbtree and the
above test isn't necessary.  !rb_first() test should be enough
(provided that there's external synchronization, of course).

Thanks.

-- 
tejun


Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-03 Thread Tejun Heo
Hello,

On Tue, Jul 03, 2018 at 03:22:47PM +1000, Benjamin Herrenschmidt wrote:
> +bool kernfs_has_children(struct kernfs_node *kn)
> +{
> + bool has_children = false;
> + struct kernfs_node *pos;
> +
> + /* Lockless shortcut */
> + if (RB_EMPTY_NODE(>rb))
> + return false;

Hmm... shouldn't it be testing !rb_first(kn->dir.children)?  The above
would test whether @kn itself is unlinked.

> +
> + /* Now check for active children */
> + mutex_lock(_mutex);
> + pos = NULL;
> + while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
> + if (kernfs_active(pos)) {
> + has_children = true;
> + break;
> + }
> + }
> + mutex_unlock(_mutex);
> +
> + return has_children;

The active ref is there to synchronize removal against in-flight reads
so that kernfs_remove() can wait for them to drain.  On return from
kernfs_remove(), the node is guaranteed to be off the rbtree and the
above test isn't necessary.  !rb_first() test should be enough
(provided that there's external synchronization, of course).

Thanks.

-- 
tejun


Re: printk() from NMI backtrace can delay a lot

2018-07-03 Thread Tejun Heo
Hello, Sergey.

On Tue, Jul 03, 2018 at 01:30:21PM +0900, Sergey Senozhatsky wrote:
> Cc-ing Linus, Tejun, Andrew
> [I'll keep the entire lockdep report]
> 
> On (07/02/18 19:26), Tetsuo Handa wrote:
> [..]
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606834] swapper/0/0 is 
> > trying to acquire lock:
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606835] 316e1432 
> > (console_owner){-.-.}, at: console_unlock+0x1ce/0x8b0
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606840] 
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606841] but task is already 
> > holding lock:
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606842] 9b45dcb4 
> > (&(>lock)->rlock){-.-.}, at: show_workqueue_state+0x3b2/0x900
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606847] 
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606848] which lock already 
> > depends on the new lock.
...
> But anyway. So we can have [but I'm not completely sure. Maybe lockdep has
> something else on its mind] something like this:
> 
>   CPU1CPU0
> 
>   #IRQ#soft irq
>   serial8250_handle_irq() wq_watchdog_timer_fn()
>spin_lock(_port->lock) show_workqueue_state()
> serial8250_rx_chars()spin_lock(>lock)
>  tty_flip_buffer_push()   printk()
>   tty_schedule_flip()  serial8250_console_write()
>queue_work() 
> spin_lock(_port->lock)
> __queue_work()
>  spin_lock(>lock)
> 
> We need to break the pool->lock -> uart_port->lock chain.
> 
> - use printk_deferred() to show WQs states [show_workqueue_state() is
>   a timer callback, so local IRQs are enabled]. But show_workqueue_state()
>   is also available via sysrq.
> 
> - what Alan Cox suggested: use spin_trylock() in serial8250_console_write()
>   and just discard (do not print anything on console) console->writes() that
>   can deadlock us [uart_port->lock is already locked]. This basically means
>   that sometimes there will be no output on a serial console, or there
>   will be missing line. Which kind of contradicts the purpose of print
>   out.
> 
> We are facing the risk of no output on serial consoles in both case. Thus
> there must be some other way out of this.

show_workqueue_state() is only used when something is already horribly
broken or when invoked through sysrq.  I'm not sure it's worthwhile to
make invasive changes to avoid lockdep warnings.  If anything, we
should make show_workqueue_state() avoid grabbing pool->lock (e.g. use
trylock and fallback to probe_kernel_reads if that fails).  I'm a bit
skeptical how actually useful that'd be tho.

Thanks.

-- 
tejun


Re: printk() from NMI backtrace can delay a lot

2018-07-03 Thread Tejun Heo
Hello, Sergey.

On Tue, Jul 03, 2018 at 01:30:21PM +0900, Sergey Senozhatsky wrote:
> Cc-ing Linus, Tejun, Andrew
> [I'll keep the entire lockdep report]
> 
> On (07/02/18 19:26), Tetsuo Handa wrote:
> [..]
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606834] swapper/0/0 is 
> > trying to acquire lock:
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606835] 316e1432 
> > (console_owner){-.-.}, at: console_unlock+0x1ce/0x8b0
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606840] 
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606841] but task is already 
> > holding lock:
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606842] 9b45dcb4 
> > (&(>lock)->rlock){-.-.}, at: show_workqueue_state+0x3b2/0x900
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606847] 
> > 2018-07-02 12:13:13 192.168.159.129: [  151.606848] which lock already 
> > depends on the new lock.
...
> But anyway. So we can have [but I'm not completely sure. Maybe lockdep has
> something else on its mind] something like this:
> 
>   CPU1CPU0
> 
>   #IRQ#soft irq
>   serial8250_handle_irq() wq_watchdog_timer_fn()
>spin_lock(_port->lock) show_workqueue_state()
> serial8250_rx_chars()spin_lock(>lock)
>  tty_flip_buffer_push()   printk()
>   tty_schedule_flip()  serial8250_console_write()
>queue_work() 
> spin_lock(_port->lock)
> __queue_work()
>  spin_lock(>lock)
> 
> We need to break the pool->lock -> uart_port->lock chain.
> 
> - use printk_deferred() to show WQs states [show_workqueue_state() is
>   a timer callback, so local IRQs are enabled]. But show_workqueue_state()
>   is also available via sysrq.
> 
> - what Alan Cox suggested: use spin_trylock() in serial8250_console_write()
>   and just discard (do not print anything on console) console->writes() that
>   can deadlock us [uart_port->lock is already locked]. This basically means
>   that sometimes there will be no output on a serial console, or there
>   will be missing line. Which kind of contradicts the purpose of print
>   out.
> 
> We are facing the risk of no output on serial consoles in both case. Thus
> there must be some other way out of this.

show_workqueue_state() is only used when something is already horribly
broken or when invoked through sysrq.  I'm not sure it's worthwhile to
make invasive changes to avoid lockdep warnings.  If anything, we
should make show_workqueue_state() avoid grabbing pool->lock (e.g. use
trylock and fallback to probe_kernel_reads if that fails).  I'm a bit
skeptical how actually useful that'd be tho.

Thanks.

-- 
tejun


Re: [PATCH 14/14] skip readahead if the cgroup is congested

2018-07-02 Thread Tejun Heo
On Fri, Jun 29, 2018 at 03:25:42PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We noticed in testing we'd get pretty bad latency stalls under heavy
> pressure because read ahead would try to do its thing while the cgroup
> was under severe pressure.  If we're under this much pressure we want to
> do as little IO as possible so we can still make progress on real work
> if we're a throttled cgroup, so just skip readahead if our group is
> under pressure.
> 
> Signed-off-by: Josef Bacik 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH 14/14] skip readahead if the cgroup is congested

2018-07-02 Thread Tejun Heo
On Fri, Jun 29, 2018 at 03:25:42PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We noticed in testing we'd get pretty bad latency stalls under heavy
> pressure because read ahead would try to do its thing while the cgroup
> was under severe pressure.  If we're under this much pressure we want to
> do as little IO as possible so we can still make progress on real work
> if we're a throttled cgroup, so just skip readahead if our group is
> under pressure.
> 
> Signed-off-by: Josef Bacik 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2018-07-02 Thread Tejun Heo
Hello, Paul.

Sorry about the late reply.

On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote:
> I have hit this WARN_ON_ONCE() in process_one_work:
> 
>   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>raw_smp_processor_id() != pool->cpu);
> 
> This looks like it is my rcu_gp workqueue (see splat below), and it
> appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
> which does random CPU-hotplug operations (in case that helps).
> 
> Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
> what should I do to further debug this?

Hmm... I checked the code paths but couldn't spot anything suspicious.
Can you please apply the following patch and see whether it triggers
before hitting the warn and if so report what it says?

Thanks.

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 0db8938fbb23..81caab9643b2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map =
 static struct lockdep_map cpuhp_state_down_map =
STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map);
 
+int cpuhp_current_state(int cpu)
+{
+   return per_cpu_ptr(_state, cpu)->state;
+}
+
+int cpuhp_target_state(int cpu)
+{
+   return per_cpu_ptr(_state, cpu)->target;
+}
 
 static inline void cpuhp_lock_acquire(bool bringup)
 {
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 78b192071ef7..365cf6342808 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node)
return worker;
 }
 
+int cpuhp_current_state(int cpu);
+int cpuhp_target_state(int cpu);
+
 /**
  * worker_attach_to_pool() - attach a worker to a pool
  * @worker: worker to be attached
@@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node)
 static void worker_attach_to_pool(struct worker *worker,
   struct worker_pool *pool)
 {
+   int ret;
+
mutex_lock(_pool_attach_mutex);
 
/*
 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 */
-   set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   if (ret && pool->cpu >= 0 && worker->rescue_wq)
+   printk("XXX rescuer failed to attach: ret=%d pool=%d 
this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n",
+  ret, pool->id, raw_smp_processor_id(), pool->cpu,
+  cpuhp_current_state(pool->cpu),
+  cpuhp_target_state(pool->cpu));
 
/*
 * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains


Re: WARN_ON_ONCE() in process_one_work()?

2018-07-02 Thread Tejun Heo
Hello, Paul.

Sorry about the late reply.

On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote:
> I have hit this WARN_ON_ONCE() in process_one_work:
> 
>   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>raw_smp_processor_id() != pool->cpu);
> 
> This looks like it is my rcu_gp workqueue (see splat below), and it
> appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
> which does random CPU-hotplug operations (in case that helps).
> 
> Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
> what should I do to further debug this?

Hmm... I checked the code paths but couldn't spot anything suspicious.
Can you please apply the following patch and see whether it triggers
before hitting the warn and if so report what it says?

Thanks.

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 0db8938fbb23..81caab9643b2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map =
 static struct lockdep_map cpuhp_state_down_map =
STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map);
 
+int cpuhp_current_state(int cpu)
+{
+   return per_cpu_ptr(_state, cpu)->state;
+}
+
+int cpuhp_target_state(int cpu)
+{
+   return per_cpu_ptr(_state, cpu)->target;
+}
 
 static inline void cpuhp_lock_acquire(bool bringup)
 {
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 78b192071ef7..365cf6342808 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node)
return worker;
 }
 
+int cpuhp_current_state(int cpu);
+int cpuhp_target_state(int cpu);
+
 /**
  * worker_attach_to_pool() - attach a worker to a pool
  * @worker: worker to be attached
@@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node)
 static void worker_attach_to_pool(struct worker *worker,
   struct worker_pool *pool)
 {
+   int ret;
+
mutex_lock(_pool_attach_mutex);
 
/*
 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 */
-   set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   if (ret && pool->cpu >= 0 && worker->rescue_wq)
+   printk("XXX rescuer failed to attach: ret=%d pool=%d 
this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n",
+  ret, pool->id, raw_smp_processor_id(), pool->cpu,
+  cpuhp_current_state(pool->cpu),
+  cpuhp_target_state(pool->cpu));
 
/*
 * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains


Re: [RFC PATCH] ata: ahci: Enable DEVSLP by default on x86 modern standby platform

2018-07-02 Thread Tejun Heo
On Mon, Jul 02, 2018 at 12:08:45PM -0700, Srinivas Pandruvada wrote:
> From: Srinivas 
> 
> One of the requirement for modern x86 system to enter lowest power mode
> (SLP_S0) is SATA IP block to be off. This is true even during when
> platform is suspended to idle and not only in opportunistic (runtime)
> suspend.
> 
> Several of these system don't have traditional ACPI S3, so it is
> important that they enter SLP_S0 state, to avoid draining battery even
> during suspend even with out of the box Linux installation.
> 
> SATA IP block doesn't get turned off till SATA is in DEVSLP mode. Here
> user has to either use scsi-host sysfs or tools like powertop to set
> the sata-host link_power_management_policy to min_power.
> 
> This change sets by default link power management policy to min_power
> for some platforms.  To avoid regressions, the following conditions
> are used:
> - The kernel config is already set to use med_power_with_dipm or deeper
> - System is a modern standby system using ACPI low power idle flag
> - The platform is not blacklisted for Suspend to Idle and suspend
> to idle is used instead of S3
> This combination will make sure that systems are fairly recent and
> since getting shipped with modern standby standby, the DEVSLP function
> is already validated.
> 
> Signed-off-by: Srinivas Pandruvada 

Seems sane to me.  Hans, what do you think?

Thanks.

-- 
tejun


Re: [RFC PATCH] ata: ahci: Enable DEVSLP by default on x86 modern standby platform

2018-07-02 Thread Tejun Heo
On Mon, Jul 02, 2018 at 12:08:45PM -0700, Srinivas Pandruvada wrote:
> From: Srinivas 
> 
> One of the requirement for modern x86 system to enter lowest power mode
> (SLP_S0) is SATA IP block to be off. This is true even during when
> platform is suspended to idle and not only in opportunistic (runtime)
> suspend.
> 
> Several of these system don't have traditional ACPI S3, so it is
> important that they enter SLP_S0 state, to avoid draining battery even
> during suspend even with out of the box Linux installation.
> 
> SATA IP block doesn't get turned off till SATA is in DEVSLP mode. Here
> user has to either use scsi-host sysfs or tools like powertop to set
> the sata-host link_power_management_policy to min_power.
> 
> This change sets by default link power management policy to min_power
> for some platforms.  To avoid regressions, the following conditions
> are used:
> - The kernel config is already set to use med_power_with_dipm or deeper
> - System is a modern standby system using ACPI low power idle flag
> - The platform is not blacklisted for Suspend to Idle and suspend
> to idle is used instead of S3
> This combination will make sure that systems are fairly recent and
> since getting shipped with modern standby standby, the DEVSLP function
> is already validated.
> 
> Signed-off-by: Srinivas Pandruvada 

Seems sane to me.  Hans, what do you think?

Thanks.

-- 
tejun


Re: [PATCH v4] ata: Remove depends on HAS_DMA in case of platform dependency

2018-07-02 Thread Tejun Heo
On Fri, Jun 22, 2018 at 01:03:07PM +0200, Geert Uytterhoeven wrote:
> Remove dependencies on HAS_DMA where a Kconfig symbol depends on another
> symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST".
> In most cases this other symbol is an architecture or platform specific
> symbol, or PCI.
> 
> Generic symbols and drivers without platform dependencies keep their
> dependencies on HAS_DMA, to prevent compiling subsystems or drivers that
> cannot work anyway.
> 
> This simplifies the dependencies, and allows to improve compile-testing.
> 
> Signed-off-by: Geert Uytterhoeven 
> Reviewed-by: Mark Brown 
> Acked-by: Robin Murphy 

Applied to libata/for-4.18-fixes.

Thanks.

-- 
tejun


Re: [PATCH v4] ata: Remove depends on HAS_DMA in case of platform dependency

2018-07-02 Thread Tejun Heo
On Fri, Jun 22, 2018 at 01:03:07PM +0200, Geert Uytterhoeven wrote:
> Remove dependencies on HAS_DMA where a Kconfig symbol depends on another
> symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST".
> In most cases this other symbol is an architecture or platform specific
> symbol, or PCI.
> 
> Generic symbols and drivers without platform dependencies keep their
> dependencies on HAS_DMA, to prevent compiling subsystems or drivers that
> cannot work anyway.
> 
> This simplifies the dependencies, and allows to improve compile-testing.
> 
> Signed-off-by: Geert Uytterhoeven 
> Reviewed-by: Mark Brown 
> Acked-by: Robin Murphy 

Applied to libata/for-4.18-fixes.

Thanks.

-- 
tejun


Re: [PATCH v4] ata: Remove depends on HAS_DMA in case of platform dependency

2018-07-02 Thread Tejun Heo
On Fri, Jun 22, 2018 at 01:03:07PM +0200, Geert Uytterhoeven wrote:
> Remove dependencies on HAS_DMA where a Kconfig symbol depends on another
> symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST".
> In most cases this other symbol is an architecture or platform specific
> symbol, or PCI.
> 
> Generic symbols and drivers without platform dependencies keep their
> dependencies on HAS_DMA, to prevent compiling subsystems or drivers that
> cannot work anyway.
> 
> This simplifies the dependencies, and allows to improve compile-testing.
> 
> Signed-off-by: Geert Uytterhoeven 
> Reviewed-by: Mark Brown 
> Acked-by: Robin Murphy 

This looks fine to me but how do you wanna route it?  Should I apply
it to for-4.18-fixes or should it go with arch changes in some other
tree?

Thanks.

-- 
tejun


Re: [PATCH v4] ata: Remove depends on HAS_DMA in case of platform dependency

2018-07-02 Thread Tejun Heo
On Fri, Jun 22, 2018 at 01:03:07PM +0200, Geert Uytterhoeven wrote:
> Remove dependencies on HAS_DMA where a Kconfig symbol depends on another
> symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST".
> In most cases this other symbol is an architecture or platform specific
> symbol, or PCI.
> 
> Generic symbols and drivers without platform dependencies keep their
> dependencies on HAS_DMA, to prevent compiling subsystems or drivers that
> cannot work anyway.
> 
> This simplifies the dependencies, and allows to improve compile-testing.
> 
> Signed-off-by: Geert Uytterhoeven 
> Reviewed-by: Mark Brown 
> Acked-by: Robin Murphy 

This looks fine to me but how do you wanna route it?  Should I apply
it to for-4.18-fixes or should it go with arch changes in some other
tree?

Thanks.

-- 
tejun


Re: [PATCH] writeback: update stale account_page_redirty() comment

2018-07-02 Thread Tejun Heo
On Mon, Jun 25, 2018 at 10:15:26AM -0700, Greg Thelen wrote:
> commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into
> bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
> account_page_redirty().  Update comment to track that change.
>   BDI_DIRTIED => WB_DIRTIED
>   BDI_WRITTEN => WB_WRITTEN
> 
> Signed-off-by: Greg Thelen 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] writeback: update stale account_page_redirty() comment

2018-07-02 Thread Tejun Heo
On Mon, Jun 25, 2018 at 10:15:26AM -0700, Greg Thelen wrote:
> commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into
> bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
> account_page_redirty().  Update comment to track that change.
>   BDI_DIRTIED => WB_DIRTIED
>   BDI_WRITTEN => WB_WRITTEN
> 
> Signed-off-by: Greg Thelen 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-02 Thread Tejun Heo
Hello, Peter.

On Tue, Jun 26, 2018 at 05:49:08PM +0200, Peter Zijlstra wrote:
> Well, no, because the Changelog is incomprehensible and the patch
> doesn't really have useful comments, so I'll have to reverse engineer
> the entire thing, and I've just not had time for that.

Just as an additional data point, we also sometimes see artifacts from
cpu_adjust_time() in the form of per-task user or sys time getting
stuck for some period (in extreme cases for over a minute) while the
application isn't doing anything differently.  We're telling the users
that it's an inherent sampling artifact but it'd be nice to improve it
if possible without adding noticeable overhead.  No idea whether this
patch's approach is a good one tho.

Thanks.

-- 
tejun


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-02 Thread Tejun Heo
Hello, Peter.

On Tue, Jun 26, 2018 at 05:49:08PM +0200, Peter Zijlstra wrote:
> Well, no, because the Changelog is incomprehensible and the patch
> doesn't really have useful comments, so I'll have to reverse engineer
> the entire thing, and I've just not had time for that.

Just as an additional data point, we also sometimes see artifacts from
cpu_adjust_time() in the form of per-task user or sys time getting
stuck for some period (in extreme cases for over a minute) while the
application isn't doing anything differently.  We're telling the users
that it's an inherent sampling artifact but it'd be nice to improve it
if possible without adding noticeable overhead.  No idea whether this
patch's approach is a good one tho.

Thanks.

-- 
tejun


Re: [PATCH] ahci: Disable LPM on Lenovo 50 series laptops with a too old BIOS

2018-07-02 Thread Tejun Heo
On Sun, Jul 01, 2018 at 12:15:46PM +0200, Hans de Goede wrote:
> There have been several reports of LPM related hard freezes about once
> a day on multiple Lenovo 50 series models. Strange enough these reports
> where not disk model specific as LPM issues usually are and some users
> with the exact same disk + laptop where seeing them while other users
> where not seeing these issues.
> 
> It turns out that enabling LPM triggers a firmware bug somewhere, which
> has been fixed in later BIOS versions.
> 
> This commit adds a new ahci_broken_lpm() function and a new ATA_FLAG_NO_LPM
> for dealing with this.
> 
> The ahci_broken_lpm() function contains DMI match info for the 4 models
> which are known to be affected by this and the DMI BIOS date field for
> known good BIOS versions. If the BIOS date is older then the one in the
> table LPM will be disabled and a warning will be printed.
> 
> Note the BIOS dates are for known good versions, some older versions may
> work too, but we don't know for sure, the table is using dates from BIOS
> versions for which users have confirmed that upgrading to that version
> makes the problem go away.
> 
> Unfortunately I've been unable to get hold of the reporter who reported
> that BIOS version 2.35 fixed the problems on the W541 for him. I've been
> able to verify the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION from an older
> dmidecode, but I don't know the exact BIOS date as reported in the DMI.
> Lenovo keeps a changelog with dates in their release notes, but the
> dates there are the release dates not the build dates which are in DMI.
> So I've chosen to set the date to which we compare to one day past the
> release date of the 2.34 BIOS. I plan to fix this with a follow up
> commit once I've the necessary info.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Hans de Goede 

Applied to libata/for-4.18-fixes.

Thanks.

-- 
tejun


Re: [PATCH] ahci: Disable LPM on Lenovo 50 series laptops with a too old BIOS

2018-07-02 Thread Tejun Heo
On Sun, Jul 01, 2018 at 12:15:46PM +0200, Hans de Goede wrote:
> There have been several reports of LPM related hard freezes about once
> a day on multiple Lenovo 50 series models. Strange enough these reports
> where not disk model specific as LPM issues usually are and some users
> with the exact same disk + laptop where seeing them while other users
> where not seeing these issues.
> 
> It turns out that enabling LPM triggers a firmware bug somewhere, which
> has been fixed in later BIOS versions.
> 
> This commit adds a new ahci_broken_lpm() function and a new ATA_FLAG_NO_LPM
> for dealing with this.
> 
> The ahci_broken_lpm() function contains DMI match info for the 4 models
> which are known to be affected by this and the DMI BIOS date field for
> known good BIOS versions. If the BIOS date is older then the one in the
> table LPM will be disabled and a warning will be printed.
> 
> Note the BIOS dates are for known good versions, some older versions may
> work too, but we don't know for sure, the table is using dates from BIOS
> versions for which users have confirmed that upgrading to that version
> makes the problem go away.
> 
> Unfortunately I've been unable to get hold of the reporter who reported
> that BIOS version 2.35 fixed the problems on the W541 for him. I've been
> able to verify the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION from an older
> dmidecode, but I don't know the exact BIOS date as reported in the DMI.
> Lenovo keeps a changelog with dates in their release notes, but the
> dates there are the release dates not the build dates which are in DMI.
> So I've chosen to set the date to which we compare to one day past the
> release date of the 2.34 BIOS. I plan to fix this with a follow up
> commit once I've the necessary info.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Hans de Goede 

Applied to libata/for-4.18-fixes.

Thanks.

-- 
tejun


Re: [PATCH] kernfs: Replace strncpy with memcpy

2018-07-02 Thread Tejun Heo
On Sun, Jul 01, 2018 at 01:57:13PM -0700, Guenter Roeck wrote:
> gcc 8.1.0 complains:
> 
> fs/kernfs/symlink.c:91:3: warning:
>   'strncpy' output truncated before terminating nul copying
>   as many bytes from a string as its length
> fs/kernfs/symlink.c: In function 'kernfs_iop_get_link':
> fs/kernfs/symlink.c:88:14: note: length computed here
> 
> Using strncpy() is indeed less than perfect since the length of data to
> be copied has already been determined with strlen(). Replace strncpy()
> with memcpy() to address the warning and optimize the code a little.
> 
> Signed-off-by: Guenter Roeck 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] kernfs: Replace strncpy with memcpy

2018-07-02 Thread Tejun Heo
On Sun, Jul 01, 2018 at 01:57:13PM -0700, Guenter Roeck wrote:
> gcc 8.1.0 complains:
> 
> fs/kernfs/symlink.c:91:3: warning:
>   'strncpy' output truncated before terminating nul copying
>   as many bytes from a string as its length
> fs/kernfs/symlink.c: In function 'kernfs_iop_get_link':
> fs/kernfs/symlink.c:88:14: note: length computed here
> 
> Using strncpy() is indeed less than perfect since the length of data to
> be copied has already been determined with strlen(). Replace strncpy()
> with memcpy() to address the warning and optimize the code a little.
> 
> Signed-off-by: Guenter Roeck 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] sata_nv: remove redundant pointers sdev0 and sdev1

2018-07-02 Thread Tejun Heo
On Mon, Jul 02, 2018 at 08:30:22AM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Pointers sdev0 and sdev1 are being assigned but are never used hence they
> are redundant and can be removed.
> 
> Cleans up clang warnings:
> warning: variable 'sdev0' set but not used [-Wunused-but-set-variable]
> warning: variable 'sdev1' set but not used [-Wunused-but-set-variable]
> 
> Signed-off-by: Colin Ian King 

Applied to libata/for-4.18-fixes.

Thanks.

-- 
tejun


Re: [PATCH] sata_nv: remove redundant pointers sdev0 and sdev1

2018-07-02 Thread Tejun Heo
On Mon, Jul 02, 2018 at 08:30:22AM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Pointers sdev0 and sdev1 are being assigned but are never used hence they
> are redundant and can be removed.
> 
> Cleans up clang warnings:
> warning: variable 'sdev0' set but not used [-Wunused-but-set-variable]
> warning: variable 'sdev1' set but not used [-Wunused-but-set-variable]
> 
> Signed-off-by: Colin Ian King 

Applied to libata/for-4.18-fixes.

Thanks.

-- 
tejun


Re: [PATCH] libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store()

2018-06-18 Thread Tejun Heo
On Fri, Jun 08, 2018 at 06:26:33PM +0800, John Garry wrote:
> Currently smatch warns of possible Spectre-V1 issue in ahci_led_store():
> drivers/ata/libahci.c:1150 ahci_led_store() warn: potential spectre issue 
> 'pp->em_priv' (local cap)
> 
> Userspace controls @pmp from following callchain:
> em_message->store()
> ->ata_scsi_em_message_store()
> -->ap->ops->em_store()
> --->ahci_led_store()
> 
> After the mask+shift @pmp is effectively an 8b value, which is used to
> index into an array of length 8, so sanitize the array index.
> 
> Signed-off-by: John Garry 

Applied to libata/for-4.18-fixes.

Thanks.

-- 
tejun


Re: [PATCH] libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store()

2018-06-18 Thread Tejun Heo
On Fri, Jun 08, 2018 at 06:26:33PM +0800, John Garry wrote:
> Currently smatch warns of possible Spectre-V1 issue in ahci_led_store():
> drivers/ata/libahci.c:1150 ahci_led_store() warn: potential spectre issue 
> 'pp->em_priv' (local cap)
> 
> Userspace controls @pmp from following callchain:
> em_message->store()
> ->ata_scsi_em_message_store()
> -->ap->ops->em_store()
> --->ahci_led_store()
> 
> After the mask+shift @pmp is effectively an 8b value, which is used to
> index into an array of length 8, so sanitize the array index.
> 
> Signed-off-by: John Garry 

Applied to libata/for-4.18-fixes.

Thanks.

-- 
tejun


Re: Cgroups "pids" controller does not update "pids.current" count immediately

2018-06-15 Thread Tejun Heo
Hello, Ivan.

On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote:
> The lazy pids accounting + modern fast CPUs makes the "pids.current"
> metric practically unusable for resource limiting in our case. For a
> test, when we started and ended one single process very quickly, we
> saw "pids.current" equal up to 185 (while the correct value at all
> time is either 0 or 1). If we want that a "cgroup" can spawn maximum
> 50 processes, we should use some high value like 300 for "pids.max",
> in order to compensate the pids uncharge lag (and this depends on
> the speed of the CPU and how busy the system is).

Yeah, that actually makes a lot of sense.  We can't keep everything
synchronous for obvious performance reasons but we definitely can wait
for RCU grace period before failing.  Forking might become a bit
slower while pids are draining but shouldn't fail and that shouldn't
incur any performance overhead in normal conditions when pids aren't
constrained.

Thanks.

-- 
tejun


Re: Cgroups "pids" controller does not update "pids.current" count immediately

2018-06-15 Thread Tejun Heo
Hello, Ivan.

On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote:
> The lazy pids accounting + modern fast CPUs makes the "pids.current"
> metric practically unusable for resource limiting in our case. For a
> test, when we started and ended one single process very quickly, we
> saw "pids.current" equal up to 185 (while the correct value at all
> time is either 0 or 1). If we want that a "cgroup" can spawn maximum
> 50 processes, we should use some high value like 300 for "pids.max",
> in order to compensate the pids uncharge lag (and this depends on
> the speed of the CPU and how busy the system is).

Yeah, that actually makes a lot of sense.  We can't keep everything
synchronous for obvious performance reasons but we definitely can wait
for RCU grace period before failing.  Forking might become a bit
slower while pids are draining but shouldn't fail and that shouldn't
incur any performance overhead in normal conditions when pids aren't
constrained.

Thanks.

-- 
tejun


Re: Cgroups "pids" controller does not update "pids.current" count immediately

2018-06-15 Thread Tejun Heo
Hello,

On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote:
> I understand all concerns and design decisions. However, having
> RLIMIT_NPROC support combined with "cgroups" hierarchy would be very
> handy.
> 
> Does it make sense that you introduce "nproc.current" and
> "nproc.max" metrics which work in the same atomic, real-time way
> like RLIMIT_NPROC? Or make this in a new "nproc" controller?

I'm skeptical for two reasons.

1. That doesn't sound much like a resource control problem but more of
   a policy enforcement problem.

2. and it's difficult to see why such policies would need to be that
   strict.  Where is the requirement coming from?

Thanks.

-- 
tejun


Re: Cgroups "pids" controller does not update "pids.current" count immediately

2018-06-15 Thread Tejun Heo
Hello,

On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote:
> I understand all concerns and design decisions. However, having
> RLIMIT_NPROC support combined with "cgroups" hierarchy would be very
> handy.
> 
> Does it make sense that you introduce "nproc.current" and
> "nproc.max" metrics which work in the same atomic, real-time way
> like RLIMIT_NPROC? Or make this in a new "nproc" controller?

I'm skeptical for two reasons.

1. That doesn't sound much like a resource control problem but more of
   a policy enforcement problem.

2. and it's difficult to see why such policies would need to be that
   strict.  Where is the requirement coming from?

Thanks.

-- 
tejun


Re: Cgroups "pids" controller does not update "pids.current" count immediately

2018-06-15 Thread Tejun Heo
Hello,

On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote:
> The standard RLIMIT_NPROC does not suffer from such accounting
> discrepancies at any time.

RLIMIT_NPROC uses a dedicated atomic counter which is updated when the
process is getting reaped; however, that doesn't actually coincide
with the pid being freed.  The base pid ref is put then but there can
be other refs and even after that it has to go through RCU grace
period to be actually freed.

They seem equivalent but serve a bit different purposes.  RLIMIT_NPROC
is primarily about limiting what the user can do and doesn't guarantee
that that actually matches resource (pid here) consumption.  pid
controller's primary role is limiting pid consumption - ie. no matter
what happens the cgroup must not be able to take away more than the
specified number from the available pool, which has to account for the
lazy release and draining refs and stuff.

> The "memory" cgroups controller also does
> not suffer from any discrepancies -- it accounts memory usage in
> real time without any lag on process start or exit. The "tasks" file
> list is also always up-to-date.

The memory controller does the same thing, actually way more
extensively.  It's just less noticeable because people generally don't
try to control at individual page level.

> Is it really technically not possible to make "pids.current" do
> accounting properly like RLIMIT_NPROC does? We were hoping to
> replace RLIMIT_NPROC with the "pids" controller.

It is of course possible but at a cost.  The cost (getting rid of lazy
release optimizations) is just not justifiable for most cases.

Thanks.

-- 
tejun


Re: Cgroups "pids" controller does not update "pids.current" count immediately

2018-06-15 Thread Tejun Heo
Hello,

On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote:
> The standard RLIMIT_NPROC does not suffer from such accounting
> discrepancies at any time.

RLIMIT_NPROC uses a dedicated atomic counter which is updated when the
process is getting reaped; however, that doesn't actually coincide
with the pid being freed.  The base pid ref is put then but there can
be other refs and even after that it has to go through RCU grace
period to be actually freed.

They seem equivalent but serve a bit different purposes.  RLIMIT_NPROC
is primarily about limiting what the user can do and doesn't guarantee
that that actually matches resource (pid here) consumption.  pid
controller's primary role is limiting pid consumption - ie. no matter
what happens the cgroup must not be able to take away more than the
specified number from the available pool, which has to account for the
lazy release and draining refs and stuff.

> The "memory" cgroups controller also does
> not suffer from any discrepancies -- it accounts memory usage in
> real time without any lag on process start or exit. The "tasks" file
> list is also always up-to-date.

The memory controller does the same thing, actually way more
extensively.  It's just less noticeable because people generally don't
try to control at individual page level.

> Is it really technically not possible to make "pids.current" do
> accounting properly like RLIMIT_NPROC does? We were hoping to
> replace RLIMIT_NPROC with the "pids" controller.

It is of course possible but at a cost.  The cost (getting rid of lazy
release optimizations) is just not justifiable for most cases.

Thanks.

-- 
tejun


Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-06-11 Thread Tejun Heo
Hello,

On Mon, Jun 11, 2018 at 11:12:48AM +0200, Jan Kara wrote:
> However this is wrong and so is the patch. The problem is in
> cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
> reference to wb structures before going through the list of wbs again and
> calling wb_shutdown() on each of them. The writeback structures we are
> accessing at this point can be already freed in principle like:
> 
> CPU1  CPU2
>   cgwb_bdi_unregister()
> cgwb_kill(*slot);
> 
> cgwb_release()
>   queue_work(cgwb_release_wq, >release_work);
> cgwb_release_workfn()
> wb = list_first_entry(>wb_list, ...)
> spin_unlock_irq(_lock);
>   wb_shutdown(wb);
>   ... 
>   kfree_rcu(wb, rcu);
> wb_shutdown(wb); -> oops use-after-free
> 
> I'm not 100% sure how to fix this. wb structures can be at various phases of
> shutdown (or there may be other external references still existing) when we
> enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister()
> to wait for standard wb shutdown path to finish is the most robust way.
> What do you think about attached patch Tejun? So far only compile tested...
> 
> Possible problem with it is that now cgwb_bdi_unregister() will wait for
> all wb references to be dropped so it adds some implicit dependencies to
> bdi shutdown path. 

Would something like the following work or am I missing the point
entirely?

Thanks.

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..359cacd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -715,14 +715,19 @@ static void cgwb_bdi_unregister(struct backing_dev_info 
*bdi)
WARN_ON(test_bit(WB_registered, >wb.state));
 
spin_lock_irq(_lock);
-   radix_tree_for_each_slot(slot, >cgwb_tree, , 0)
-   cgwb_kill(*slot);
+   radix_tree_for_each_slot(slot, >cgwb_tree, , 0) {
+   struct bdi_writeback *wb = *slot;
+
+   wb_get(wb);
+   cgwb_kill(wb);
+   }
 
while (!list_empty(>wb_list)) {
wb = list_first_entry(>wb_list, struct bdi_writeback,
  bdi_node);
spin_unlock_irq(_lock);
wb_shutdown(wb);
+   wb_put(wb);
spin_lock_irq(_lock);
}
spin_unlock_irq(_lock);


-- 
tejun


Re: [PATCH] bdi: Fix another oops in wb_workfn()

2018-06-11 Thread Tejun Heo
Hello,

On Mon, Jun 11, 2018 at 11:12:48AM +0200, Jan Kara wrote:
> However this is wrong and so is the patch. The problem is in
> cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
> reference to wb structures before going through the list of wbs again and
> calling wb_shutdown() on each of them. The writeback structures we are
> accessing at this point can be already freed in principle like:
> 
> CPU1  CPU2
>   cgwb_bdi_unregister()
> cgwb_kill(*slot);
> 
> cgwb_release()
>   queue_work(cgwb_release_wq, >release_work);
> cgwb_release_workfn()
> wb = list_first_entry(>wb_list, ...)
> spin_unlock_irq(_lock);
>   wb_shutdown(wb);
>   ... 
>   kfree_rcu(wb, rcu);
> wb_shutdown(wb); -> oops use-after-free
> 
> I'm not 100% sure how to fix this. wb structures can be at various phases of
> shutdown (or there may be other external references still existing) when we
> enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister()
> to wait for standard wb shutdown path to finish is the most robust way.
> What do you think about attached patch Tejun? So far only compile tested...
> 
> Possible problem with it is that now cgwb_bdi_unregister() will wait for
> all wb references to be dropped so it adds some implicit dependencies to
> bdi shutdown path. 

Would something like the following work or am I missing the point
entirely?

Thanks.

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..359cacd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -715,14 +715,19 @@ static void cgwb_bdi_unregister(struct backing_dev_info 
*bdi)
WARN_ON(test_bit(WB_registered, >wb.state));
 
spin_lock_irq(_lock);
-   radix_tree_for_each_slot(slot, >cgwb_tree, , 0)
-   cgwb_kill(*slot);
+   radix_tree_for_each_slot(slot, >cgwb_tree, , 0) {
+   struct bdi_writeback *wb = *slot;
+
+   wb_get(wb);
+   cgwb_kill(wb);
+   }
 
while (!list_empty(>wb_list)) {
wb = list_first_entry(>wb_list, struct bdi_writeback,
  bdi_node);
spin_unlock_irq(_lock);
wb_shutdown(wb);
+   wb_put(wb);
spin_lock_irq(_lock);
}
spin_unlock_irq(_lock);


-- 
tejun


Re: [PATCH] percpu_counter: `percpu_counter_read_positive' should not return negative number

2018-06-06 Thread Tejun Heo
On Wed, Jun 06, 2018 at 09:39:40PM +0200, Mathieu Malaterre wrote:
> Since function `percpu_counter_add' may result in a signed integer overflow
> the result stored in `fbc->count' could be negative. Make sure that
> function `percpu_counter_read_positive' does not return a negative number
> in this case. This will match behavior when CONFIG_SMP=y.
> 
> Detected wth CONFIG_UBSAN=y
> 
> [76404.888450] 
> 
> [76404.888477] UBSAN: Undefined behaviour in 
> ../include/linux/percpu_counter.h:136:13
> [76404.888485] signed integer overflow:
> [76404.888490] 9223308017647617321 + 76624449492175 cannot be represented in 
> type 'long long int'
> [76404.888501] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #50
> [76404.888506] Call Trace:
> [76404.888523] [dffedd30] [c0478b90] ubsan_epilogue+0x18/0x4c (unreliable)
> [76404.888533] [dffedd40] [c0479530] handle_overflow+0xbc/0xdc
> [76404.888548] [dffeddc0] [c0439044] cfq_completed_request+0x560/0x1234
> [76404.888557] [dffede40] [c03f3fc4] __blk_put_request+0xb0/0x2dc
> [76404.888570] [dffede80] [c05a81d0] scsi_end_request+0x19c/0x344
> [76404.888579] [dffedeb0] [c05a9954] scsi_io_completion+0x4b4/0x854
> [76404.888592] [dffedf10] [c04046b4] blk_done_softirq+0xe4/0x1e0
> [76404.888605] [dffedf60] [c07ec1d4] __do_softirq+0x16c/0x5f0
> [76404.888617] [dffedfd0] [c0065160] irq_exit+0x110/0x1a8
> [76404.888629] [dffedff0] [c001646c] call_do_irq+0x24/0x3c
> [76404.888641] [c0cdbe80] [c0009a2c] do_IRQ+0x98/0x1a0
> [76404.888649] [c0cdbeb0] [c001b93c] ret_from_except+0x0/0x14
> [76404.888659] --- interrupt: 501 at arch_cpu_idle+0x30/0x78
>LR = arch_cpu_idle+0x30/0x78
> [76404.888667] [c0cdbf70] [c0cda000] 0xc0cda000 (unreliable)
> [76404.888679] [c0cdbf80] [c00a3844] do_idle+0xc4/0x158
> [76404.888687] [c0cdbfb0] [c00a3a90] cpu_startup_entry+0x24/0x28
> [76404.888696] [c0cdbfc0] [c097f820] start_kernel+0x47c/0x490
> [76404.888703] [c0cdbff0] [3444] 0x3444

So, the _positive versions are there to deal with small negative reads
coming from percpu propagation delays.  It's not there to protect
against actual counter overflow although it can't detect that on SMP.
It's just outright wrong to report 0 when the counter actually
overflowed and applying the suggested patch masks a real problem
undetectable.

I think the right thing to do is actually undersatnding what's going
on (why is a 64bit counter overflowing?) and fix the underlying issue.

Thanks.

-- 
tejun


Re: [PATCH] percpu_counter: `percpu_counter_read_positive' should not return negative number

2018-06-06 Thread Tejun Heo
On Wed, Jun 06, 2018 at 09:39:40PM +0200, Mathieu Malaterre wrote:
> Since function `percpu_counter_add' may result in a signed integer overflow
> the result stored in `fbc->count' could be negative. Make sure that
> function `percpu_counter_read_positive' does not return a negative number
> in this case. This will match behavior when CONFIG_SMP=y.
> 
> Detected wth CONFIG_UBSAN=y
> 
> [76404.888450] 
> 
> [76404.888477] UBSAN: Undefined behaviour in 
> ../include/linux/percpu_counter.h:136:13
> [76404.888485] signed integer overflow:
> [76404.888490] 9223308017647617321 + 76624449492175 cannot be represented in 
> type 'long long int'
> [76404.888501] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #50
> [76404.888506] Call Trace:
> [76404.888523] [dffedd30] [c0478b90] ubsan_epilogue+0x18/0x4c (unreliable)
> [76404.888533] [dffedd40] [c0479530] handle_overflow+0xbc/0xdc
> [76404.888548] [dffeddc0] [c0439044] cfq_completed_request+0x560/0x1234
> [76404.888557] [dffede40] [c03f3fc4] __blk_put_request+0xb0/0x2dc
> [76404.888570] [dffede80] [c05a81d0] scsi_end_request+0x19c/0x344
> [76404.888579] [dffedeb0] [c05a9954] scsi_io_completion+0x4b4/0x854
> [76404.888592] [dffedf10] [c04046b4] blk_done_softirq+0xe4/0x1e0
> [76404.888605] [dffedf60] [c07ec1d4] __do_softirq+0x16c/0x5f0
> [76404.888617] [dffedfd0] [c0065160] irq_exit+0x110/0x1a8
> [76404.888629] [dffedff0] [c001646c] call_do_irq+0x24/0x3c
> [76404.888641] [c0cdbe80] [c0009a2c] do_IRQ+0x98/0x1a0
> [76404.888649] [c0cdbeb0] [c001b93c] ret_from_except+0x0/0x14
> [76404.888659] --- interrupt: 501 at arch_cpu_idle+0x30/0x78
>LR = arch_cpu_idle+0x30/0x78
> [76404.888667] [c0cdbf70] [c0cda000] 0xc0cda000 (unreliable)
> [76404.888679] [c0cdbf80] [c00a3844] do_idle+0xc4/0x158
> [76404.888687] [c0cdbfb0] [c00a3a90] cpu_startup_entry+0x24/0x28
> [76404.888696] [c0cdbfc0] [c097f820] start_kernel+0x47c/0x490
> [76404.888703] [c0cdbff0] [3444] 0x3444

So, the _positive versions are there to deal with small negative reads
coming from percpu propagation delays.  It's not there to protect
against actual counter overflow although it can't detect that on SMP.
It's just outright wrong to report 0 when the counter actually
overflowed and applying the suggested patch masks a real problem
undetectable.

I think the right thing to do is actually undersatnding what's going
on (why is a 64bit counter overflowing?) and fix the underlying issue.

Thanks.

-- 
tejun


Re: [GIT PULL] cgroup changes for v4.18-rc1

2018-06-06 Thread Tejun Heo
On Wed, Jun 06, 2018 at 08:14:27AM -0700, Linus Torvalds wrote:
> On Wed, Jun 6, 2018 at 8:04 AM Tejun Heo  wrote:
> >
> > The notification implementation isn't super light weight, so the patch
> > ratelimits the notifications by capping minimum notification interval
> > interval to 10ms.
> 
> Yeah, I looked at the patch (and the code) to make sense of the explanation.
> 
> My reaction to that was that it might be a better idea to simply not
> notify if a notification was already pending, rather than have the
> timeout. Or perhaps in addition to. The path _to_ the fsnotify code
> looked quite messy, though.

Yeah, getting to the inode is currently quite involved.  Some of the
complications come from the fact that a kernfs node can be associated
with multiple inodes for sysfs namespace support.

The right thing to do could be linking the inodes to the kernfs node
and protect it with a spinlock so that we don't have to punt and walk
them directly from notification path.

For now, I think the 10ms thing is an acceptable workaround but this
likely needs more work in the future.

Thanks.

-- 
tejun


Re: [GIT PULL] cgroup changes for v4.18-rc1

2018-06-06 Thread Tejun Heo
On Wed, Jun 06, 2018 at 08:14:27AM -0700, Linus Torvalds wrote:
> On Wed, Jun 6, 2018 at 8:04 AM Tejun Heo  wrote:
> >
> > The notification implementation isn't super light weight, so the patch
> > ratelimits the notifications by capping minimum notification interval
> > interval to 10ms.
> 
> Yeah, I looked at the patch (and the code) to make sense of the explanation.
> 
> My reaction to that was that it might be a better idea to simply not
> notify if a notification was already pending, rather than have the
> timeout. Or perhaps in addition to. The path _to_ the fsnotify code
> looked quite messy, though.

Yeah, getting to the inode is currently quite involved.  Some of the
complications come from the fact that a kernfs node can be associated
with multiple inodes for sysfs namespace support.

The right thing to do could be linking the inodes to the kernfs node
and protect it with a spinlock so that we don't have to punt and walk
them directly from notification path.

For now, I think the 10ms thing is an acceptable workaround but this
likely needs more work in the future.

Thanks.

-- 
tejun


Re: [GIT PULL] cgroup changes for v4.18-rc1

2018-06-06 Thread Tejun Heo
Hello, Linus.

On Tue, Jun 05, 2018 at 05:13:55PM -0700, Linus Torvalds wrote:
> On Tue, Jun 5, 2018 at 12:22 PM Tejun Heo  wrote:
> >
> > * cgroup uses file modified events to notify certain events.  A rate
> >   limiting mechanism is added.
> 
> This "explanation" didn't really parse for me at all.
>
> I edited the merge message to something that I think is correct and
> made more sense to me.

Heh, yeah, yours reads a lot better.  Sorry about that.  The
background is that the mechanism was first added to generate
notifications on events like cgroup becoming empty, which is very low
frequency.  Since then, we've expanded its usage to cover things like
memory.high breached events which can be a lot higher frequency.

The notification implementation isn't super light weight, so the patch
ratelimits the notifications by capping minimum notification interval
interval to 10ms.

Thanks.

-- 
tejun


Re: [GIT PULL] cgroup changes for v4.18-rc1

2018-06-06 Thread Tejun Heo
Hello, Linus.

On Tue, Jun 05, 2018 at 05:13:55PM -0700, Linus Torvalds wrote:
> On Tue, Jun 5, 2018 at 12:22 PM Tejun Heo  wrote:
> >
> > * cgroup uses file modified events to notify certain events.  A rate
> >   limiting mechanism is added.
> 
> This "explanation" didn't really parse for me at all.
>
> I edited the merge message to something that I think is correct and
> made more sense to me.

Heh, yeah, yours reads a lot better.  Sorry about that.  The
background is that the mechanism was first added to generate
notifications on events like cgroup becoming empty, which is very low
frequency.  Since then, we've expanded its usage to cover things like
memory.high breached events which can be a lot higher frequency.

The notification implementation isn't super light weight, so the patch
ratelimits the notifications by capping minimum notification interval
interval to 10ms.

Thanks.

-- 
tejun


[GIT PULL] workqueue changes for v4.18-rc1

2018-06-05 Thread Tejun Heo
Hello, Linus.

* Patches to make kworkers to report the workqueue it is executing or
  has executed most recently in /proc/PID/comm.

* CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds
  inside CONFIG_SMP.

Thanks.

The following changes since commit e6506eb241871d68647c53cb6d0a16299550ae97:

  Merge tag 'trace-v4.17-rc4-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace (2018-05-16 
16:45:23 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-4.18

for you to fetch changes up to 66448bc274cadedb71fda7d914e7c29d8dead217:

  workqueue: move function definitions within CONFIG_SMP block (2018-05-23 
11:16:58 -0700)


Mathieu Malaterre (1):
  workqueue: move function definitions within CONFIG_SMP block

Tejun Heo (6):
  workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex
  workqueue: Make worker_attach/detach_pool() update worker->pool
  workqueue: Set worker->desc to workqueue name by default
  proc: Consolidate task->comm formatting into proc_task_name()
  workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}
  workqueue: Make sure struct worker is accessible for wq_worker_comm()

 fs/proc/array.c |  33 ++-
 fs/proc/base.c  |   5 +-
 fs/proc/internal.h  |   2 +
 include/linux/workqueue.h   |   1 +
 kernel/workqueue.c  | 139 +---
 kernel/workqueue_internal.h |   3 +-
 6 files changed, 119 insertions(+), 64 deletions(-)

-- 
tejun


[GIT PULL] workqueue changes for v4.18-rc1

2018-06-05 Thread Tejun Heo
Hello, Linus.

* Patches to make kworkers to report the workqueue it is executing or
  has executed most recently in /proc/PID/comm.

* CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds
  inside CONFIG_SMP.

Thanks.

The following changes since commit e6506eb241871d68647c53cb6d0a16299550ae97:

  Merge tag 'trace-v4.17-rc4-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace (2018-05-16 
16:45:23 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-4.18

for you to fetch changes up to 66448bc274cadedb71fda7d914e7c29d8dead217:

  workqueue: move function definitions within CONFIG_SMP block (2018-05-23 
11:16:58 -0700)


Mathieu Malaterre (1):
  workqueue: move function definitions within CONFIG_SMP block

Tejun Heo (6):
  workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex
  workqueue: Make worker_attach/detach_pool() update worker->pool
  workqueue: Set worker->desc to workqueue name by default
  proc: Consolidate task->comm formatting into proc_task_name()
  workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}
  workqueue: Make sure struct worker is accessible for wq_worker_comm()

 fs/proc/array.c |  33 ++-
 fs/proc/base.c  |   5 +-
 fs/proc/internal.h  |   2 +
 include/linux/workqueue.h   |   1 +
 kernel/workqueue.c  | 139 +---
 kernel/workqueue_internal.h |   3 +-
 6 files changed, 119 insertions(+), 64 deletions(-)

-- 
tejun


[GIT PULL] cgroup changes for v4.18-rc1

2018-06-05 Thread Tejun Heo
Hello, Linus.

cgroup changes for v4.18.

* For cpustat, cgroup has a percpu hierarchical stat mechanism which
  propagates up the hierarchy lazily.  This pull request contains
  commits to factor out and generalize the mechanism so that it can be
  used for other cgroup stats too.  The original intention was to
  update memcg stats to use it but memcg went for a different
  approach, so still the only user is cpustat.  The factoring out and
  generalization still make sense and it's likely that this can be
  used for other purposes in the future.

* cgroup uses file modified events to notify certain events.  A rate
  limiting mechanism is added.

* Other misc changes.

Thanks.

The following changes since commit fe03a7594d86e0754f05e604cd803a6a9aae3c1c:

  Merge tag 'acpi-4.17-rc3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm (2018-04-26 
11:06:36 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.18

for you to fetch changes up to d8742e22902186e30c346b1ba881cb52942ae3e4:

  cgroup: css_set_lock should nest inside tasklist_lock (2018-05-23 11:04:54 
-0700)


Andy Shevchenko (1):
  rdmacg: Convert to use match_string() helper

Tejun Heo (12):
  cgroup: Explicitly remove core interface files
  cgroup: Limit event generation frequency
  cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c
  cgroup: Rename stat to rstat
  cgroup: Distinguish base resource stat implementation from rstat
  cgroup: Reorganize kernel/cgroup/rstat.c
  cgroup: Factor out and expose cgroup_rstat_*() interface functions
  cgroup: Replace cgroup_rstat_mutex with a spinlock
  cgroup: Add cgroup_subsys->css_rstat_flush()
  cgroup: Add memory barriers to plug cgroup_rstat_updated() race window
  cgroup: Make cgroup_rstat_updated() ready for root cgroup usage
  cgroup: css_set_lock should nest inside tasklist_lock

 include/linux/cgroup-defs.h |  52 +++--
 include/linux/cgroup.h  |  12 +-
 kernel/cgroup/Makefile  |   2 +-
 kernel/cgroup/cgroup-internal.h |  11 +-
 kernel/cgroup/cgroup.c  | 105 +++---
 kernel/cgroup/rdma.c|  35 ++--
 kernel/cgroup/rstat.c   | 416 
 kernel/cgroup/stat.c| 338 
 8 files changed, 554 insertions(+), 417 deletions(-)
 create mode 100644 kernel/cgroup/rstat.c
 delete mode 100644 kernel/cgroup/stat.c

-- 
tejun


[GIT PULL] cgroup changes for v4.18-rc1

2018-06-05 Thread Tejun Heo
Hello, Linus.

cgroup changes for v4.18.

* For cpustat, cgroup has a percpu hierarchical stat mechanism which
  propagates up the hierarchy lazily.  This pull request contains
  commits to factor out and generalize the mechanism so that it can be
  used for other cgroup stats too.  The original intention was to
  update memcg stats to use it but memcg went for a different
  approach, so still the only user is cpustat.  The factoring out and
  generalization still make sense and it's likely that this can be
  used for other purposes in the future.

* cgroup uses file modified events to notify certain events.  A rate
  limiting mechanism is added.

* Other misc changes.

Thanks.

The following changes since commit fe03a7594d86e0754f05e604cd803a6a9aae3c1c:

  Merge tag 'acpi-4.17-rc3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm (2018-04-26 
11:06:36 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.18

for you to fetch changes up to d8742e22902186e30c346b1ba881cb52942ae3e4:

  cgroup: css_set_lock should nest inside tasklist_lock (2018-05-23 11:04:54 
-0700)


Andy Shevchenko (1):
  rdmacg: Convert to use match_string() helper

Tejun Heo (12):
  cgroup: Explicitly remove core interface files
  cgroup: Limit event generation frequency
  cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c
  cgroup: Rename stat to rstat
  cgroup: Distinguish base resource stat implementation from rstat
  cgroup: Reorganize kernel/cgroup/rstat.c
  cgroup: Factor out and expose cgroup_rstat_*() interface functions
  cgroup: Replace cgroup_rstat_mutex with a spinlock
  cgroup: Add cgroup_subsys->css_rstat_flush()
  cgroup: Add memory barriers to plug cgroup_rstat_updated() race window
  cgroup: Make cgroup_rstat_updated() ready for root cgroup usage
  cgroup: css_set_lock should nest inside tasklist_lock

 include/linux/cgroup-defs.h |  52 +++--
 include/linux/cgroup.h  |  12 +-
 kernel/cgroup/Makefile  |   2 +-
 kernel/cgroup/cgroup-internal.h |  11 +-
 kernel/cgroup/cgroup.c  | 105 +++---
 kernel/cgroup/rdma.c|  35 ++--
 kernel/cgroup/rstat.c   | 416 
 kernel/cgroup/stat.c| 338 
 8 files changed, 554 insertions(+), 417 deletions(-)
 create mode 100644 kernel/cgroup/rstat.c
 delete mode 100644 kernel/cgroup/stat.c

-- 
tejun


[GIT PULL 2/2] libata changes for v4.18-rc1

2018-06-05 Thread Tejun Heo
Hello, Linus.

This is the second part of libata pull request.

* libata has always been limiting the maximum queue depth to 31, 1 set
  aside mostly for historical reasons.  This didn't use to make much
  difference but Jens found out that modern hard drives can actually
  perform measurably better with the extra one queue depth.  Jens
  updated libata core so that it can make use of full 32 queue depth.

* Damien updated command retry logic in error handling so that it
  doesn't unnecessarily retry when upper layer (SCSI) is gonna handle
  them.

* A couple misc changes.

Thanks.

The following changes since commit 75bc37fefc4471e718ba8e651aa74673d4e0a9eb:

  Linux 4.17-rc4 (2018-05-06 16:57:38 -1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.18

for you to fetch changes up to 88e10092f6a623b88808f782b6637162b5b658fb:

  sata_fsl: use the right type for tag bitshift (2018-05-14 08:14:21 -0700)


Andy Shevchenko (1):
  ata: hpt37x: Convert to use match_string() helper

Christoph Hellwig (1):
  sata_nv: don't use block layer bounce buffer

Damien Le Moal (5):
  libata: Fix comment typo in ata_eh_analyze_tf()
  libata: Fix ata_err_string()
  libata: Make ata_dev_set_mode() less verbose
  libata: Honor RQF_QUIET flag
  libata: Fix command retry decision

Jens Axboe (10):
  libata: introduce notion of separate hardware tags
  libata: convert core and drivers to ->hw_tag usage
  libata: bump ->qc_active to a 64-bit type
  libata: use ata_tag_internal() consistently
  libata: remove assumption that ATA_MAX_QUEUE - 1 is the max
  sata_nv: set host can_queue count appropriately
  libata: add extra internal command
  libata: don't clamp queue depth to ATA_MAX_QUEUE - 1
  ahci: enable full queue depth of 32
  sata_fsl: use the right type for tag bitshift

 drivers/ata/acard-ahci.c |  4 +-
 drivers/ata/ahci.h   |  2 +-
 drivers/ata/libahci.c|  8 ++--
 drivers/ata/libata-core.c| 61 ---
 drivers/ata/libata-eh.c  | 56 -
 drivers/ata/libata-scsi.c| 19 +
 drivers/ata/pata_hpt37x.c| 13 +++---
 drivers/ata/sata_dwc_460ex.c | 14 +++
 drivers/ata/sata_fsl.c   | 10 ++---
 drivers/ata/sata_mv.c| 26 ++--
 drivers/ata/sata_nv.c| 98 +++-
 drivers/ata/sata_sil24.c |  6 +--
 include/linux/libata.h   | 22 +-
 13 files changed, 177 insertions(+), 162 deletions(-)

--
tejun


[GIT PULL 2/2] libata changes for v4.18-rc1

2018-06-05 Thread Tejun Heo
Hello, Linus.

This is the second part of libata pull request.

* libata has always been limiting the maximum queue depth to 31, 1 set
  aside mostly for historical reasons.  This didn't use to make much
  difference but Jens found out that modern hard drives can actually
  perform measurably better with the extra one queue depth.  Jens
  updated libata core so that it can make use of full 32 queue depth.

* Damien updated command retry logic in error handling so that it
  doesn't unnecessarily retry when upper layer (SCSI) is gonna handle
  them.

* A couple misc changes.

Thanks.

The following changes since commit 75bc37fefc4471e718ba8e651aa74673d4e0a9eb:

  Linux 4.17-rc4 (2018-05-06 16:57:38 -1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.18

for you to fetch changes up to 88e10092f6a623b88808f782b6637162b5b658fb:

  sata_fsl: use the right type for tag bitshift (2018-05-14 08:14:21 -0700)


Andy Shevchenko (1):
  ata: hpt37x: Convert to use match_string() helper

Christoph Hellwig (1):
  sata_nv: don't use block layer bounce buffer

Damien Le Moal (5):
  libata: Fix comment typo in ata_eh_analyze_tf()
  libata: Fix ata_err_string()
  libata: Make ata_dev_set_mode() less verbose
  libata: Honor RQF_QUIET flag
  libata: Fix command retry decision

Jens Axboe (10):
  libata: introduce notion of separate hardware tags
  libata: convert core and drivers to ->hw_tag usage
  libata: bump ->qc_active to a 64-bit type
  libata: use ata_tag_internal() consistently
  libata: remove assumption that ATA_MAX_QUEUE - 1 is the max
  sata_nv: set host can_queue count appropriately
  libata: add extra internal command
  libata: don't clamp queue depth to ATA_MAX_QUEUE - 1
  ahci: enable full queue depth of 32
  sata_fsl: use the right type for tag bitshift

 drivers/ata/acard-ahci.c |  4 +-
 drivers/ata/ahci.h   |  2 +-
 drivers/ata/libahci.c|  8 ++--
 drivers/ata/libata-core.c| 61 ---
 drivers/ata/libata-eh.c  | 56 -
 drivers/ata/libata-scsi.c| 19 +
 drivers/ata/pata_hpt37x.c| 13 +++---
 drivers/ata/sata_dwc_460ex.c | 14 +++
 drivers/ata/sata_fsl.c   | 10 ++---
 drivers/ata/sata_mv.c| 26 ++--
 drivers/ata/sata_nv.c| 98 +++-
 drivers/ata/sata_sil24.c |  6 +--
 include/linux/libata.h   | 22 +-
 13 files changed, 177 insertions(+), 162 deletions(-)

--
tejun


[GIT PULL 1/2] libata changes for v4.18-rc1

2018-06-05 Thread Tejun Heo
Hello, Linus.

These two are fixes which missed v4.17.  I tried to merge these into
libata/for-4.18 but couldn't make "git request-pull" not generate huge
spurious diffstat without merging v4.17 in the middle, so I'm sending
them out as two spearate pull request.  Pulling into master should be
clean for both.

One is to remove an incorrect power management blacklist entry and the
other to fix a cdb buffer overrun which has been there for a very long
time.

The following changes since commit 4544e403eb25552aed7f0ee181a7a506b8800403:

  ahci: Add PCI ID for Cannon Lake PCH-LP AHCI (2018-05-24 07:03:32 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.17-fixes

for you to fetch changes up to 2cfce3a86b64b53f0a70e92a6a659c720c319b45:

  libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk (2018-05-31 08:45:37 -0700)


Dan Carpenter (1):
  libata: zpodd: small read overflow in eject_tray()

Hans de Goede (1):
  libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk

 drivers/ata/libata-core.c  | 3 ---
 drivers/ata/libata-zpodd.c | 2 +-
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 346b163..9bfd2f7 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4557,9 +4557,6 @@ static const struct ata_blacklist_entry 
ata_device_blacklist [] = {
{ "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, },
{ "SAMSUNG SSD PM830 mSATA *",  "CXM13D1Q", ATA_HORKAGE_NOLPM, },
 
-   /* Sandisk devices which are known to not handle LPM well */
-   { "SanDisk SD7UB3Q*G1001",  NULL,   ATA_HORKAGE_NOLPM, },
-
/* devices that don't properly handle queued TRIM commands */
{ "Micron_M500IT_*","MU01", ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM, },
diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c
index de4ddd0..b3ed8f9 100644
--- a/drivers/ata/libata-zpodd.c
+++ b/drivers/ata/libata-zpodd.c
@@ -35,7 +35,7 @@ struct zpodd {
 static int eject_tray(struct ata_device *dev)
 {
struct ata_taskfile tf;
-   static const char cdb[] = {  GPCMD_START_STOP_UNIT,
+   static const char cdb[ATAPI_CDB_LEN] = {  GPCMD_START_STOP_UNIT,
0, 0, 0,
0x02, /* LoEj */
0, 0, 0, 0, 0, 0, 0,
-- 
tejun


[GIT PULL 1/2] libata changes for v4.18-rc1

2018-06-05 Thread Tejun Heo
Hello, Linus.

These two are fixes which missed v4.17.  I tried to merge these into
libata/for-4.18 but couldn't make "git request-pull" not generate huge
spurious diffstat without merging v4.17 in the middle, so I'm sending
them out as two spearate pull request.  Pulling into master should be
clean for both.

One is to remove an incorrect power management blacklist entry and the
other to fix a cdb buffer overrun which has been there for a very long
time.

The following changes since commit 4544e403eb25552aed7f0ee181a7a506b8800403:

  ahci: Add PCI ID for Cannon Lake PCH-LP AHCI (2018-05-24 07:03:32 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata.git for-4.17-fixes

for you to fetch changes up to 2cfce3a86b64b53f0a70e92a6a659c720c319b45:

  libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk (2018-05-31 08:45:37 -0700)


Dan Carpenter (1):
  libata: zpodd: small read overflow in eject_tray()

Hans de Goede (1):
  libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk

 drivers/ata/libata-core.c  | 3 ---
 drivers/ata/libata-zpodd.c | 2 +-
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 346b163..9bfd2f7 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4557,9 +4557,6 @@ static const struct ata_blacklist_entry 
ata_device_blacklist [] = {
{ "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, },
{ "SAMSUNG SSD PM830 mSATA *",  "CXM13D1Q", ATA_HORKAGE_NOLPM, },
 
-   /* Sandisk devices which are known to not handle LPM well */
-   { "SanDisk SD7UB3Q*G1001",  NULL,   ATA_HORKAGE_NOLPM, },
-
/* devices that don't properly handle queued TRIM commands */
{ "Micron_M500IT_*","MU01", ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM, },
diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c
index de4ddd0..b3ed8f9 100644
--- a/drivers/ata/libata-zpodd.c
+++ b/drivers/ata/libata-zpodd.c
@@ -35,7 +35,7 @@ struct zpodd {
 static int eject_tray(struct ata_device *dev)
 {
struct ata_taskfile tf;
-   static const char cdb[] = {  GPCMD_START_STOP_UNIT,
+   static const char cdb[ATAPI_CDB_LEN] = {  GPCMD_START_STOP_UNIT,
0, 0, 0,
0x02, /* LoEj */
0, 0, 0, 0, 0, 0, 0,
-- 
tejun


Re: [PATCH] Smack: Fix memory leak in smack_inode_getsecctx

2018-06-04 Thread Tejun Heo
On Mon, Jun 04, 2018 at 02:01:34PM -0700, Casey Schaufler wrote:
> On 6/1/2018 10:45 AM, Casey Schaufler wrote:
> > Fix memory leak in smack_inode_getsecctx
> >
> > The implementation of smack_inode_getsecctx() made
> > incorrect assumptions about how Smack presents a security
> > context. Smack does not need to allocate memory to support
> > security contexts, so "releasing" a Smack context is a no-op.
> > The code made an unnecessary copy and returned that as a
> > context, which was never freed. The revised implementation
> > returns the context correctly.
> >
> > Signed-off-by: Casey Schaufler 
> 
> Tejun, does this pass your tests?

Oh, I'm not the one who reported.  Chandan?

-- 
tejun


Re: [PATCH] Smack: Fix memory leak in smack_inode_getsecctx

2018-06-04 Thread Tejun Heo
On Mon, Jun 04, 2018 at 02:01:34PM -0700, Casey Schaufler wrote:
> On 6/1/2018 10:45 AM, Casey Schaufler wrote:
> > Fix memory leak in smack_inode_getsecctx
> >
> > The implementation of smack_inode_getsecctx() made
> > incorrect assumptions about how Smack presents a security
> > context. Smack does not need to allocate memory to support
> > security contexts, so "releasing" a Smack context is a no-op.
> > The code made an unnecessary copy and returned that as a
> > context, which was never freed. The revised implementation
> > returns the context correctly.
> >
> > Signed-off-by: Casey Schaufler 
> 
> Tejun, does this pass your tests?

Oh, I'm not the one who reported.  Chandan?

-- 
tejun


Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

2018-06-04 Thread Tejun Heo
Hello, Michal.

On Mon, Jun 04, 2018 at 03:01:19PM +0200, Michal Hocko wrote:
> > On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote:
> > > Widening the definition of a process sounds good.  The memory control
> > > group code would still need a way to forbid these in cgroup v1 mode,
> > > when someone uses the task file.
> > 
> > Yeap, you're right.  We'll need memcg's can_attach rejecting for v1.
> 
> Do we really need? I mean, do we know about any existing usecase that
> would need this weird threading concept and depend on memory migration
> which doesn't really work?

I thought the requirement is from the ->owner change so that the
association doesn't become 1:N, right?

Thanks.

-- 
tejun


Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

2018-06-04 Thread Tejun Heo
Hello, Michal.

On Mon, Jun 04, 2018 at 03:01:19PM +0200, Michal Hocko wrote:
> > On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote:
> > > Widening the definition of a process sounds good.  The memory control
> > > group code would still need a way to forbid these in cgroup v1 mode,
> > > when someone uses the task file.
> > 
> > Yeap, you're right.  We'll need memcg's can_attach rejecting for v1.
> 
> Do we really need? I mean, do we know about any existing usecase that
> would need this weird threading concept and depend on memory migration
> which doesn't really work?

I thought the requirement is from the ->owner change so that the
association doesn't become 1:N, right?

Thanks.

-- 
tejun


Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

2018-06-01 Thread Tejun Heo
Hello,

On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote:
> Widening the definition of a process sounds good.  The memory control
> group code would still need a way to forbid these in cgroup v1 mode,
> when someone uses the task file.

Yeap, you're right.  We'll need memcg's can_attach rejecting for v1.

> Using widening instead of denying should reduce the risk of introducing
> a regression.
> 
> The only reason I support the crazy case in my earlier patch is so that
> we can have this discussion and in case we do cause a regression with
> this change the previous algorithmic cleanup won't have to be reverted
> as well.

Yeah, sure thing.

Thanks a lot.

-- 
tejun


<    2   3   4   5   6   7   8   9   10   11   >