Re: [EXT] Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-29 Thread Alex Belits

On 1/29/21 06:23, Marcelo Tosatti wrote:

External Email

--
On Fri, Jan 29, 2021 at 08:55:20AM -0500, Nitesh Narayan Lal wrote:


On 1/28/21 3:01 PM, Thomas Gleixner wrote:

On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:

The whole pile wants to be reverted. It's simply broken in several ways.

I was asking for your comments on interaction with CPU hotplug :-)

Which I answered in an seperate mail :)


So housekeeping_cpumask has multiple meanings. In this case:

...


So as long as the meaning of the flags are respected, seems
alright.

Yes. Stuff like the managed interrupts preference for housekeeping CPUs
when a affinity mask spawns housekeeping and isolated is perfectly
fine. It's well thought out and has no limitations.


Nitesh, is there anything preventing this from being fixed
in userspace ? (as Thomas suggested previously).

Everything with is not managed can be steered by user space.

Thanks,

 tglx



So, I think the conclusion here would be to revert the change made in
cpumask_local_spread via the patch:
  - lib: Restrict cpumask_local_spread to housekeeping CPUs

Also, a similar case can be made for the rps patch that went in with
this:
 - net: Restrict receive packets queuing to housekeeping CPUs


Yes, this is the userspace solution:

https://lkml.org/lkml/2021/1/22/815

Should have a kernel document with this info and examples
(the network queue configuration as well). Will
send something.


 + net: accept an empty mask in /sys/class/net/*/queues/rx-*/rps_cpus

I am not sure about the PCI patch as I don't think we can control that from
the userspace or maybe I am wrong?


You mean "lib: Restrict cpumask_local_spread to housekeeping CPUs" ?



If we want to do it from userspace, we should have something that 
triggers it in userspace. Should we use udev for this purpose?


--
Alex


Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-12-03 Thread Alex Belits

On Wed, 2020-12-02 at 14:20 +, Mark Rutland wrote:
> External Email
> 
> ---
> ---
> On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote:
> > From: Yuri Norov 
> > 
> > For nohz_full CPUs the desirable behavior is to receive interrupts
> > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > obviously not desirable because it breaks isolation.
> > 
> > This patch adds check for it.
> > 
> > Signed-off-by: Yuri Norov 
> > [abel...@marvell.com: updated, only exclude CPUs running isolated
> > tasks]
> > Signed-off-by: Alex Belits 
> > ---
> >  kernel/time/tick-sched.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index a213952541db..6c8679e200f0 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -20,6 +20,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> >   */
> >  void tick_nohz_full_kick_cpu(int cpu)
> >  {
> > -   if (!tick_nohz_full_cpu(cpu))
> > +   smp_rmb();
> 
> What does this barrier pair with? The commit message doesn't mention
> it,
> and it's not clear in-context.

With barriers in task_isolation_kernel_enter()
and task_isolation_exit_to_user_mode().

-- 
Alex


Re: [EXT] Re: [PATCH v5 5/9] task_isolation: Add driver-specific hooks

2020-12-03 Thread Alex Belits

On Wed, 2020-12-02 at 14:18 +, Mark Rutland wrote:
> External Email
> 
> ---
> ---
> On Mon, Nov 23, 2020 at 05:57:42PM +0000, Alex Belits wrote:
> > Some drivers don't call functions that call
> > task_isolation_kernel_enter() in interrupt handlers. Call it
> > directly.
> 
> I don't think putting this in drivers is the right approach. IIUC we
> only need to track user<->kernel transitions, and we can do that
> within
> the architectural entry code before we ever reach irqchip code. I
> suspect the current approacch is an artifact of that being difficult
> in
> the old structure of the arch code; recent rework should address
> that,
> and we can restruecture things further in future.

I agree completely. This patch only covers irqchip drivers with unusual
entry procedures.

-- 
Alex


Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode

2020-12-03 Thread Alex Belits

On Wed, 2020-12-02 at 14:02 +, Mark Rutland wrote:
> On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote:
> > 
> > > I am having problems applying the patchset to today's linux-next.
> > > 
> > > Which kernel should I be using ?
> > 
> > The patches are against Linus' tree, in particular, commit
> > a349e4c659609fd20e4beea89e5c4a4038e33a95
> 
> Is there any reason to base on that commit in particular?

No specific reason for that particular commit.

> Generally it's preferred that a series is based on a tag (so either a
> release or an -rc kernel), and that the cover letter explains what
> the
> base is. If you can do that in future it'll make the series much
> easier
> to work with.

Ok.

-- 
Alex


Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality

2020-12-03 Thread Alex Belits

On Wed, 2020-12-02 at 13:59 +, Mark Rutland wrote:
> External Email
> 
> ---
> ---
> Hi Alex,
> 
> On Mon, Nov 23, 2020 at 05:58:06PM +, Alex Belits wrote:
> > In do_notify_resume(), call
> > task_isolation_before_pending_work_check()
> > first, to report isolation breaking, then after handling all
> > pending
> > work, call task_isolation_start() for TIF_TASK_ISOLATION tasks.
> > 
> > Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK,
> > define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since
> > we
> > don't clear _TIF_TASK_ISOLATION in the loop.
> > 
> > Early kernel entry code calls task_isolation_kernel_enter(). In
> > particular:
> > 
> > Vectors:
> > el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter()
> > el1_irq -> asm_nmi_enter(), handle_arch_irq()
> > el1_error -> do_serror()
> > el0_sync -> el0_sync_handler()
> > el0_irq -> handle_arch_irq()
> > el0_error -> do_serror()
> > el0_sync_compat -> el0_sync_compat_handler()
> > el0_irq_compat -> handle_arch_irq()
> > el0_error_compat -> do_serror()
> > 
> > SDEI entry:
> > __sdei_asm_handler -> __sdei_handler() -> nmi_enter()
> 
> As a heads-up, the arm64 entry code is changing, as we found that our
> lockdep, RCU, and context-tracking management wasn't quite right. I
> have
> a series of patches:
> 
> https://lore.kernel.org/r/20201130115950.22492-1-mark.rutl...@arm.com
> 
> ... which are queued in the arm64 for-next/fixes branch. I intend to
> have some further rework ready for the next cycle.

Thanks!

>  I'd appreciate if you
> could Cc me on any patches altering the arm64 entry code, as I have a
> vested interest.

I will do that.

> 
> That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were
> chosen and context tracking was in use (e.g. with
> CONTEXT_TRACKING_FORCE),

I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and
CONTEXT_TRACKING have to be enabled for it to do anything.

I will check it with PROVE_LOCKING and your patches.

Entry code only adds an inline function that, if task isolation is
enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level 
operations and accesses per-CPU variabled by offset, so at very least
it should not add any problems. Even raw_local_irq_save() /
raw_local_irq_restore() probably should be removed, however I wanted to
have something that can be safely called if by whatever reason
interrupts were enabled before kernel was fully entered.

>  so I'm assuming that this series has not been
> tested in that configuration. What sort of testing has this seen?
> 

On various available arm64 hardware, with enabled

CONFIG_TASK_ISOLATION
CONFIG_NO_HZ_FULL
CONFIG_HIGH_RES_TIMERS

and disabled:

CONFIG_HZ_PERIODIC
CONFIG_NO_HZ_IDLE
CONFIG_NO_HZ

> It would be very helpful for the next posting if you could provide
> any
> instructions on how to test this series (e.g. with pointers to any
> test
> suite that you have), since it's very easy to introduce subtle
> breakage
> in this area without realising it.

I will. Currently libtmc ( https://github.com/abelits/libtmc ) contains
all userspace code used for testing, however I should document the
testing procedures.

> 
> > Functions called from there:
> > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
> > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()
> > 
> > Handlers:
> > do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
> >   or task_isolation_kernel_enter()
> > el1_sync_handler() -> task_isolation_kernel_enter()
> > el0_sync_handler() -> task_isolation_kernel_enter()
> > el0_sync_compat_handler() -> task_isolation_kernel_enter()
> > 
> > handle_arch_irq() is irqchip-specific, most call
> > handle_domain_irq()
> > There is a separate patch for irqchips that do not follow this
> > rule.
> > 
> > handle_domain_irq() -> task_isolation_kernel_enter()
> > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
> > nmi_enter() -> task_isolation_kernel_enter()
> 
> The IRQ cases look very odd to me. With the rework I've just done for
> arm64, we'll do the regular context tracking accounting before we
> ever
> get into handle_domain_irq() or similar, so I suspect that's not
> necessary at all?

The goal is to call task_isolation_kernel_enter() before anything that
depends on a CPU state, including pipeline, that could remain un-
synchronized when the rest of the kernel was send

Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus

2020-11-24 Thread Alex Belits

On Tue, 2020-11-24 at 00:21 +0100, Frederic Weisbecker wrote:
> On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote:
> > 
> > This is different from timers. The original design was based on the
> > idea that every CPU should be able to enter kernel at any time and
> > run
> > kernel code with no additional preparation. Then the only solution
> > is
> > to always do full broadcast and require all CPUs to process it.
> > 
> > What I am trying to introduce is the idea of CPU that is not likely
> > to
> > run kernel code any soon, and can afford to go through an
> > additional
> > synchronization procedure on the next entry into kernel. The
> > synchronization is not skipped, it simply happens later, early in
> > kernel entry code.
> 
> Ah I see, this is ordered that way:
> 
> ll_isol_flags = ISOLATED
> 
>  CPU 0CPU 1
> --   -
> // kernel entry
> data_to_sync = 1ll_isol_flags = ISOLATED_BROKEN
> smp_mb()smp_mb()
> if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync
>  smp_call(CPU 1)
> 

The check for ll_isol_flags(CPU 1) is reversed, and it's a bit more
complex. In terms of scenarios, on entry from isolation the following
can happen:

1. Kernel entry happens simultaneously with operation that requires
synchronization, kernel entry processing happens before the check for
isolation on the sender side:

ll_isol_flags(CPU 1) = ISOLATED

 CPU 0CPU 1
--  -
// kernel entry
if (ll_isol_flags == ISOLATED) {
  ll_isol_flags = 
ISOLATED_BROKEN
data_to_sync = 1  smp_mb()
  // data_to_sync undetermined
smp_mb()}
// ll_isol_flags(CPU 1) updated
if ll_isol_flags(CPU 1) != ISOLATED
// interrupts enabled
 smp_call(CPU 1)  // kernel entry again
  if (ll_isol_flags == ISOLATED)
// nothing happens
  // explicit or implied 
barriers
  // data_to_sync updated
  // kernel exit
// CPU 0 assumes, CPU 1 will seeREAD data_to_sync
// data_to_sync = 1 when in kernel

2. Kernel entry happens simultaneously with operation that requires
synchronization, kernel entry processing happens after the check for
isolation on the sender side:

ll_isol_flags(CPU 1) = ISOLATED

 CPU 0CPU 1
--  -
data_to_sync = 1// kernel entry
smp_mb()// data_to_sync undetermined
// should not access data_to_sync 
here
if (ll_isol_flags == ISOLATED) {
  
   ll_isol_flags = 
ISOLATED_BROKEN
// ll_isol_flags(CPU 1) undetermined   smp_mb()
   // data_to_sync updated
if ll_isol_flags(CPU 1) != ISOLATED }
 // possibly nothing happens
// CPU 0 assumes, CPU 1 will seeREAD data_to_sync
// data_to_sync = 1 when in kernel

3. Kernel entry processing completed before the check for isolation on the 
sender
side:

ll_isol_flags(CPU 1) = ISOLATED

 CPU 0CPU 1
--  -
// kernel entry
if (ll_isol_flags == ISOLATED) {
  ll_isol_flags = 
ISOLATED_BROKEN
  smp_mb()
}
// interrupts are enabled at some
data_to_sync = 1// point here, data_to_sync value
smp_mb()// is undetermined, CPU 0 makes no
// ll_isol_flags(CPU 1) updated // assumptions about it
if ll_isol_flags(CPU 1) != ISOLATED //
  smp_call(CPU 1) // kernel entry again
  

Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode

2020-11-24 Thread Alex Belits

On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote:
> External Email
> 
> ---
> ---
> 
> On 11/23/20 9:42 AM, Alex Belits wrote:
> > This is an update of task isolation work that was originally done
> > by
> > Chris Metcalf  and maintained by him until
> > November 2017. It is adapted to the current kernel and cleaned up
> > to
> > implement its functionality in a more complete and cleaner manner.
> 
> I am having problems applying the patchset to today's linux-next.
> 
> Which kernel should I be using ?

The patches are against Linus' tree, in particular, commit
a349e4c659609fd20e4beea89e5c4a4038e33a95

-- 
Alex


Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus

2020-11-23 Thread Alex Belits

On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote:
> External Email
> 
> ---
> ---
> On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote:
> > From: Yuri Norov 
> > 
> > Make sure that kick_all_cpus_sync() does not call CPUs that are
> > running
> > isolated tasks.
> > 
> > Signed-off-by: Yuri Norov 
> > [abel...@marvell.com: use safe task_isolation_cpumask()
> > implementation]
> > Signed-off-by: Alex Belits 
> > ---
> >  kernel/smp.c | 14 +-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index 4d17501433be..b2faecf58ed0 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -932,9 +932,21 @@ static void do_nothing(void *unused)
> >   */
> >  void kick_all_cpus_sync(void)
> >  {
> > +   struct cpumask mask;
> > +
> > /* Make sure the change is visible before we kick the cpus */
> > smp_mb();
> > -   smp_call_function(do_nothing, NULL, 1);
> > +
> > +   preempt_disable();
> > +#ifdef CONFIG_TASK_ISOLATION
> > +   cpumask_clear(&mask);
> > +   task_isolation_cpumask(&mask);
> > +   cpumask_complement(&mask, &mask);
> > +#else
> > +   cpumask_setall(&mask);
> > +#endif
> > +   smp_call_function_many(&mask, do_nothing, NULL, 1);
> > +   preempt_enable();
> 
> Same comment about IPIs here.

This is different from timers. The original design was based on the
idea that every CPU should be able to enter kernel at any time and run
kernel code with no additional preparation. Then the only solution is
to always do full broadcast and require all CPUs to process it.

What I am trying to introduce is the idea of CPU that is not likely to
run kernel code any soon, and can afford to go through an additional
synchronization procedure on the next entry into kernel. The
synchronization is not skipped, it simply happens later, early in
kernel entry code.

-- 
Alex


Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-11-23 Thread Alex Belits

On Mon, 2020-11-23 at 23:13 +0100, Frederic Weisbecker wrote:
> External Email
> 
> ---
> ---
> Hi Alex,
> 
> On Mon, Nov 23, 2020 at 05:58:22PM +, Alex Belits wrote:
> > From: Yuri Norov 
> > 
> > For nohz_full CPUs the desirable behavior is to receive interrupts
> > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > obviously not desirable because it breaks isolation.
> > 
> > This patch adds check for it.
> > 
> > Signed-off-by: Yuri Norov 
> > [abel...@marvell.com: updated, only exclude CPUs running isolated
> > tasks]
> > Signed-off-by: Alex Belits 
> > ---
> >  kernel/time/tick-sched.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index a213952541db..6c8679e200f0 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -20,6 +20,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> >   */
> >  void tick_nohz_full_kick_cpu(int cpu)
> >  {
> > -   if (!tick_nohz_full_cpu(cpu))
> > +   smp_rmb();
> > +   if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
> > return;
> 
> Like I said in subsequent reviews, we are not going to ignore IPIs.
> We must fix the sources of these IPIs instead.

This is what I am working on right now. This is made with an assumption
that CPU running isolated task has no reason to be kicked because
nothing else is supposed to be there. Usually this is true and when not
true is still safe when everything else is behaving right. For this
version I have kept the original implementation with minimal changes to
make it possible to use task isolation at all.

I agree that it's a much better idea is to determine if the CPU should
be kicked. If it really should, that will be a legitimate cause to
break isolation there, because CPU running isolated task has no
legitimate reason to have timers running. Right now I am trying to
determine the origin of timers that _still_ show up as running in the
current kernel version, so I think, this is a rather large chunk of
work that I have to do separately.

-- 
Alex


[PATCH v5 8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize

2020-11-23 Thread Alex Belits
From: Yuri Norov 

CPUs running isolated tasks are in userspace, so they don't have to
perform ring buffer updates immediately. If ring_buffer_resize()
schedules the update on those CPUs, isolation is broken. To prevent
that, updates for CPUs running isolated tasks are performed locally,
like for offline CPUs.

A race condition between this update and isolation breaking is avoided
at the cost of disabling per_cpu buffer writing for the time of update
when it coincides with isolation breaking.

Signed-off-by: Yuri Norov 
[abel...@marvell.com: updated to prevent race with isolation breaking]
Signed-off-by: Alex Belits 
---
 kernel/trace/ring_buffer.c | 63 ++
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index dc83b3fa9fe7..9e4fb3ed2af0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1939,6 +1940,38 @@ static void update_pages_handler(struct work_struct 
*work)
complete(&cpu_buffer->update_done);
 }
 
+static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer,
+  int cpu)
+{
+   bool rv = false;
+
+   smp_rmb();
+   if (task_isolation_on_cpu(cpu)) {
+   /*
+* CPU is running isolated task. Since it may lose
+* isolation and re-enter kernel simultaneously with
+* this update, disable recording until it's done.
+*/
+   atomic_inc(&cpu_buffer->record_disabled);
+   /* Make sure, update is done, and isolation state is current */
+   smp_mb();
+   if (task_isolation_on_cpu(cpu)) {
+   /*
+* If CPU is still running isolated task, we
+* can be sure that breaking isolation will
+* happen while recording is disabled, and CPU
+* will not touch this buffer until the update
+* is done.
+*/
+   rb_update_pages(cpu_buffer);
+   cpu_buffer->nr_pages_to_update = 0;
+   rv = true;
+   }
+   atomic_dec(&cpu_buffer->record_disabled);
+   }
+   return rv;
+}
+
 /**
  * ring_buffer_resize - resize the ring buffer
  * @buffer: the buffer to resize.
@@ -2028,13 +2061,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, 
unsigned long size,
if (!cpu_buffer->nr_pages_to_update)
continue;
 
-   /* Can't run something on an offline CPU. */
+   /*
+* Can't run something on an offline CPU.
+*
+* CPUs running isolated tasks don't have to
+* update ring buffers until they exit
+* isolation because they are in
+* userspace. Use the procedure that prevents
+* race condition with isolation breaking.
+*/
if (!cpu_online(cpu)) {
rb_update_pages(cpu_buffer);
cpu_buffer->nr_pages_to_update = 0;
} else {
-   schedule_work_on(cpu,
-   &cpu_buffer->update_pages_work);
+   if (!update_if_isolated(cpu_buffer, cpu))
+   schedule_work_on(cpu,
+   &cpu_buffer->update_pages_work);
}
}
 
@@ -2083,13 +2125,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, 
unsigned long size,
 
get_online_cpus();
 
-   /* Can't run something on an offline CPU. */
+   /*
+* Can't run something on an offline CPU.
+*
+* CPUs running isolated tasks don't have to update
+* ring buffers until they exit isolation because they
+* are in userspace. Use the procedure that prevents
+* race condition with isolation breaking.
+*/
if (!cpu_online(cpu_id))
rb_update_pages(cpu_buffer);
else {
-   schedule_work_on(cpu_id,
+   if (!update_if_isolated(cpu_buffer, cpu_id))
+   schedule_work_on(cpu_id,
 &cpu_buffer->update_pages_work);
-   wait_for_completion(&cpu_buf

[PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus

2020-11-23 Thread Alex Belits
From: Yuri Norov 

Make sure that kick_all_cpus_sync() does not call CPUs that are running
isolated tasks.

Signed-off-by: Yuri Norov 
[abel...@marvell.com: use safe task_isolation_cpumask() implementation]
Signed-off-by: Alex Belits 
---
 kernel/smp.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 4d17501433be..b2faecf58ed0 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -932,9 +932,21 @@ static void do_nothing(void *unused)
  */
 void kick_all_cpus_sync(void)
 {
+   struct cpumask mask;
+
/* Make sure the change is visible before we kick the cpus */
smp_mb();
-   smp_call_function(do_nothing, NULL, 1);
+
+   preempt_disable();
+#ifdef CONFIG_TASK_ISOLATION
+   cpumask_clear(&mask);
+   task_isolation_cpumask(&mask);
+   cpumask_complement(&mask, &mask);
+#else
+   cpumask_setall(&mask);
+#endif
+   smp_call_function_many(&mask, do_nothing, NULL, 1);
+   preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
 
-- 
2.20.1



[PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-11-23 Thread Alex Belits
From: Yuri Norov 

For nohz_full CPUs the desirable behavior is to receive interrupts
generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
obviously not desirable because it breaks isolation.

This patch adds check for it.

Signed-off-by: Yuri Norov 
[abel...@marvell.com: updated, only exclude CPUs running isolated tasks]
Signed-off-by: Alex Belits 
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a213952541db..6c8679e200f0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
  */
 void tick_nohz_full_kick_cpu(int cpu)
 {
-   if (!tick_nohz_full_cpu(cpu))
+   smp_rmb();
+   if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
return;
 
irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
-- 
2.20.1



[PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality

2020-11-23 Thread Alex Belits
In do_notify_resume(), call task_isolation_before_pending_work_check()
first, to report isolation breaking, then after handling all pending
work, call task_isolation_start() for TIF_TASK_ISOLATION tasks.

Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK,
define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we
don't clear _TIF_TASK_ISOLATION in the loop.

Early kernel entry code calls task_isolation_kernel_enter(). In
particular:

Vectors:
el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter()
el1_irq -> asm_nmi_enter(), handle_arch_irq()
el1_error -> do_serror()
el0_sync -> el0_sync_handler()
el0_irq -> handle_arch_irq()
el0_error -> do_serror()
el0_sync_compat -> el0_sync_compat_handler()
el0_irq_compat -> handle_arch_irq()
el0_error_compat -> do_serror()

SDEI entry:
__sdei_asm_handler -> __sdei_handler() -> nmi_enter()

Functions called from there:
asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()

Handlers:
do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
  or task_isolation_kernel_enter()
el1_sync_handler() -> task_isolation_kernel_enter()
el0_sync_handler() -> task_isolation_kernel_enter()
el0_sync_compat_handler() -> task_isolation_kernel_enter()

handle_arch_irq() is irqchip-specific, most call handle_domain_irq()
There is a separate patch for irqchips that do not follow this rule.

handle_domain_irq() -> task_isolation_kernel_enter()
do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
nmi_enter() -> task_isolation_kernel_enter()

Signed-off-by: Chris Metcalf 
[abel...@marvell.com: simplified to match kernel 5.10]
Signed-off-by: Alex Belits 
---
 arch/arm64/Kconfig   |  1 +
 arch/arm64/include/asm/barrier.h |  1 +
 arch/arm64/include/asm/thread_info.h |  7 +--
 arch/arm64/kernel/entry-common.c |  7 +++
 arch/arm64/kernel/ptrace.c   | 10 ++
 arch/arm64/kernel/signal.c   | 13 -
 arch/arm64/kernel/smp.c  |  3 +++
 7 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1515f6f153a0..fc958d8d8945 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -141,6 +141,7 @@ config ARM64
select HAVE_ARCH_PREL32_RELOCATIONS
select HAVE_ARCH_SECCOMP_FILTER
select HAVE_ARCH_STACKLEAK
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index c3009b0e5239..ad5a6dd380cf 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -49,6 +49,7 @@
 #define dma_rmb()  dmb(oshld)
 #define dma_wmb()  dmb(oshst)
 
+#define instr_sync()   isb()
 /*
  * Generate a mask for array_index__nospec() that is ~0UL when 0 <= idx < sz
  * and 0 otherwise.
diff --git a/arch/arm64/include/asm/thread_info.h 
b/arch/arm64/include/asm/thread_info.h
index 1fbab854a51b..3321c69c46fe 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -68,6 +68,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_UPROBE 4   /* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK5   /* Check FS is USER_DS on return */
 #define TIF_MTE_ASYNC_FAULT6   /* MTE Asynchronous Tag Check Fault */
+#define TIF_TASK_ISOLATION 7   /* task isolation enabled for task */
 #define TIF_SYSCALL_TRACE  8   /* syscall trace active */
 #define TIF_SYSCALL_AUDIT  9   /* syscall auditing */
 #define TIF_SYSCALL_TRACEPOINT 10  /* syscall tracepoint for ftrace */
@@ -87,6 +88,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_NEED_RESCHED  (1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE   (1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION)
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
@@ -101,11 +103,12 @@ void arch_release_task_struct(struct task_struct *tsk);
 
 #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-_TIF_UPROBE | _TIF_FSCHECK | 
_TIF_MTE_ASYNC_FAULT)
+_TIF_UPROBE | _TIF_FSCHECK | \
+_TIF_MTE_ASYNC_FAULT | _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK  (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \

[PATCH v5 5/9] task_isolation: Add driver-specific hooks

2020-11-23 Thread Alex Belits
Some drivers don't call functions that call
task_isolation_kernel_enter() in interrupt handlers. Call it
directly.

Signed-off-by: Alex Belits 
---
 drivers/irqchip/irq-armada-370-xp.c | 6 ++
 drivers/irqchip/irq-gic-v3.c| 3 +++
 drivers/irqchip/irq-gic.c   | 3 +++
 drivers/s390/cio/cio.c  | 3 +++
 4 files changed, 15 insertions(+)

diff --git a/drivers/irqchip/irq-armada-370-xp.c 
b/drivers/irqchip/irq-armada-370-xp.c
index d7eb2e93db8f..4ac7babe1abe 100644
--- a/drivers/irqchip/irq-armada-370-xp.c
+++ b/drivers/irqchip/irq-armada-370-xp.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -572,6 +573,7 @@ static const struct irq_domain_ops 
armada_370_xp_mpic_irq_ops = {
 static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
 {
u32 msimask, msinr;
+   int isol_entered = 0;
 
msimask = readl_relaxed(per_cpu_int_base +
ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS)
@@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs 
*regs, bool is_chained)
continue;
 
if (is_chained) {
+   if (!isol_entered) {
+   task_isolation_kernel_enter();
+   isol_entered = 1;
+   }
irq = irq_find_mapping(armada_370_xp_msi_inner_domain,
   msinr - PCI_MSI_DOORBELL_START);
generic_handle_irq(irq);
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 16fecc0febe8..ded26dd4da0f 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry 
gic_handle_irq(struct pt_regs *regs
 {
u32 irqnr;
 
+   task_isolation_kernel_enter();
+
irqnr = gic_read_iar();
 
if (gic_supports_nmi() &&
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 6053245a4754..bb482b4ae218 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct 
pt_regs *regs)
struct gic_chip_data *gic = &gic_data[0];
void __iomem *cpu_base = gic_data_cpu_base(gic);
 
+   task_isolation_kernel_enter();
+
do {
irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK);
irqnr = irqstat & GICC_IAR_INT_ID_MASK;
diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c
index 6d716db2a46a..beab1b6d 100644
--- a/drivers/s390/cio/cio.c
+++ b/drivers/s390/cio/cio.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch)
struct irb *irb;
int irq_context;
 
+   task_isolation_kernel_enter();
+
irb = this_cpu_ptr(&cio_irb);
/* Store interrupt response block to lowcore. */
if (tsch(sch->schid, irb) != 0)
-- 
2.20.1



[PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code

2020-11-23 Thread Alex Belits
Kernel entry and exit functions for task isolation are added to context
tracking and common entry points. Common handling of pending work on exit
to userspace now processes isolation breaking, cleanup and start.

Signed-off-by: Chris Metcalf 
[abel...@marvell.com: adapted for kernel 5.10]
Signed-off-by: Alex Belits 
---
 include/linux/hardirq.h   |  2 ++
 include/linux/sched.h |  2 ++
 kernel/context_tracking.c |  5 +
 kernel/entry/common.c | 10 +-
 kernel/irq/irqdesc.c  |  5 +
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 754f67ac4326..b9e604ae6a0d 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 extern void synchronize_irq(unsigned int irq);
@@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void);
do {\
lockdep_off();  \
arch_nmi_enter();   \
+   task_isolation_kernel_enter();  \
printk_nmi_enter(); \
BUG_ON(in_nmi() == NMI_MASK);   \
__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);   \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d8b17aa544b..51c2d774250b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, struct 
task_struct *tsk);
 #ifdef CONFIG_SMP
 static __always_inline void scheduler_ipi(void)
 {
+   task_isolation_kernel_enter();
/*
 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
 * TIF_NEED_RESCHED remotely (for the first time) will also send
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 36a98c48aedc..379a48fd0e65 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state state)
__this_cpu_write(context_tracking.state, state);
}
context_tracking_recursion_exit();
+
+   task_isolation_exit_to_user_mode();
 }
 EXPORT_SYMBOL_GPL(__context_tracking_enter);
 
@@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state)
if (!context_tracking_recursion_enter())
return;
 
+   task_isolation_kernel_enter();
+
if (__this_cpu_read(context_tracking.state) == state) {
if (__this_cpu_read(context_tracking.active)) {
/*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e9e2df3f3f9e..10a520894105 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -183,13 +184,20 @@ static unsigned long exit_to_user_mode_loop(struct 
pt_regs *regs,
 
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-   unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+   unsigned long ti_work;
 
lockdep_assert_irqs_disabled();
 
+   task_isolation_before_pending_work_check();
+
+   ti_work = READ_ONCE(current_thread_info()->flags);
+
if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
ti_work = exit_to_user_mode_loop(regs, ti_work);
 
+   if (unlikely(ti_work & _TIF_TASK_ISOLATION))
+   task_isolation_start();
+
arch_exit_to_user_mode_prepare(regs, ti_work);
 
/* Ensure that the address limit is intact and no locks are held */
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..b8f0a7574f55 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internals.h"
 
@@ -669,6 +670,8 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned 
int hwirq,
unsigned int irq = hwirq;
int ret = 0;
 
+   task_isolation_kernel_enter();
+
irq_enter();
 
 #ifdef CONFIG_IRQ_DOMAIN
@@ -710,6 +713,8 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned 
int hwirq,
unsigned int irq;
int ret = 0;
 
+   task_isolation_kernel_enter();
+
/*
 * NMI context needs to be setup earlier in order to deal with tracing.
 */
-- 
2.20.1



[PATCH v5 3/9] task_isolation: userspace hard isolation from kernel

2020-11-23 Thread Alex Belits
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE, 0, 0, 0) to do
so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"isolcpus=nohz,domain,CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

The last stage of enabling task isolation happens in
task_isolation_exit_to_user_mode() that runs last before returning
to userspace and changes ll_isol_flags (see later) to prevent other
CPUs from interfering with isolated task.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will send it a signal to indicate
isolation loss. In addition to sending a signal, the code supports a
kernel command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

Isolation also disables CPU state synchronization mechanisms that
are. normally done by IPI. In the future, more synchronization
mechanisms, such as TLB flushes, may be disabled for isolated tasks.
This requires careful handling of kernel entry from isolated task --
remote synchronization requests must be re-enabled and
synchronization procedure triggered, before anything other than
low-level kernel entry code is called. Same applies to exiting from
kernel to userspace after isolation is enabled.

For this purpose, per-CPU low-level flags ll_isol_flags are used to
indicate isolation state, and task_isolation_kernel_enter() is used
to safely clear them early in kernel entry. CPU mask corresponding
to isolation bit in ll_isol_flags is visible to userspace as
/sys/devices/system/cpu/isolation_running, and can be used for
monitoring.

Signed-off-by: Chris Metcalf 
Signed-off-by: Alex Belits 
---
 .../admin-guide/kernel-parameters.txt |   6 +
 drivers/base/cpu.c|  23 +
 include/linux/hrtimer.h   |   4 +
 include/linux/isolation.h | 326 
 include/linux/sched.h |   5 +
 include/linux/tick.h  |   3 +
 include/uapi/linux/prctl.h|   6 +
 init/Kconfig  |  27 +
 kernel/Makefile   |   2 +
 kernel/isolation.c| 714 ++
 kernel/signal.c   |   2 +
 kernel/sys.c  |   6 +
 kernel/time/hrtimer.c |  27 +
 kernel/time/tick-sched.c  |  18 +
 14 files changed, 1169 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 

[PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function

2020-11-23 Thread Alex Belits
From: Chris Metcalf 

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Signed-off-by: Chris Metcalf 
Signed-off-by: Alex Belits 
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c| 10 ++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 300ce6648923..24392a957cfc 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -285,6 +285,7 @@ extern void __dec_node_state(struct pglist_data *, enum 
node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -393,6 +394,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 43999caf47a4..5b0ad7ed65f7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1945,6 +1945,16 @@ void quiet_vmstat_sync(void)
refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+   return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+   !need_update(smp_processor_id());
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1



[PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function

2020-11-23 Thread Alex Belits
From: Chris Metcalf 

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf 
Signed-off-by: Alex Belits 
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c| 9 +
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 322dcbfcc933..300ce6648923 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -284,6 +284,7 @@ extern void __dec_zone_state(struct zone *, enum 
zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -391,6 +392,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 698bc0bc18d1..43999caf47a4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1936,6 +1936,15 @@ void quiet_vmstat(void)
refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+   cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+   refresh_cpu_vm_stats(false);
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1



[PATCH v5 0/9] "Task_isolation" mode

2020-11-23 Thread Alex Belits
This is an update of task isolation work that was originally done by
Chris Metcalf  and maintained by him until
November 2017. It is adapted to the current kernel and cleaned up to
implement its functionality in a more complete and cleaner manner.

Previous version is at
https://lore.kernel.org/netdev/04be044c1bcd76b7438b7563edc35383417f12c8.ca...@marvell.com/

The last version by Chris Metcalf (now obsolete but may be relevant
for comparison and understanding the origin of the changes) is at
https://lore.kernel.org/lkml/1509728692-10460-1-git-send-email-cmetc...@mellanox.com

Supported architectures

This version includes only architecture-independent code and arm64
support. x86 and arm support, and everything related to virtualization
will be re-added later when new kernel entry/exit implementation will
be accommodated. Support for other architectures can be added in a
somewhat modular manner, however it heavily depends on the details of
a kernel entry/exit support on any particular architecture.
Development of common entry/exit and conversion to it should simplify
that task. For now, this is the version that is currently being
developed on arm64.

Major changes since v4

The goal was to make isolation-breaking detection as generic as
possible, and remove everything related to determining, _why_
isolation was broken. Originally reporting isolation breaking was done
with a large number of of hooks in specific code (hardware interrupts,
syscalls, IPIs, page faults, etc.), and it was necessary to cover all
possible such events to have a reliable notification of a task about
its isolation being broken. To avoid such a fragile mechanism, this
version relies on mere fact of kernel being entered in isolation
mode. As a result, reporting happens later in kernel code, however it
covers everything.

This means that now there is no specific reporting, in kernel log or
elsewhere, about the reasons for breaking isolation. Information about
that may be valuable at runtime, so a separate mechanism for generic
reporting "why did CPU enter kernel" (with isolation or under other
conditions) may be a good thing. That can be done later, however at
this point it's important that task isolation does not require it, and
such mechanism will not be developed with the limited purpose of
supporting isolation alone.

General description

This is the result of development and maintenance of task isolation
functionality that originally started based on task isolation patch
v15 and was later updated to include v16. It provided predictable
environment for userspace tasks running on arm64 processors alongside
with full-featured Linux environment. It is intended to provide
reliable interruption-free environment from the point when a userspace
task enters isolation and until the moment it leaves isolation or
receives a signal intentionally sent to it, and was successfully used
for this purpose. While CPU isolation with nohz provides an
environment that is close to this requirement, the remaining IPIs and
other disturbances keep it from being usable for tasks that require
complete predictability of CPU timing.

This set of patches only covers the implementation of task isolation,
however additional functionality, such as selective TLB flushes, may
be implemented to avoid other kinds of disturbances that affect
latency and performance of isolated tasks.

The userspace support and test program is now at
https://github.com/abelits/libtmc . It was originally developed for
earlier implementation, so it has some checks that may be redundant
now but kept for compatibility.

My thanks to Chris Metcalf for design and maintenance of the original
task isolation patch, Francis Giraldeau 
and Yuri Norov  for various contributions to this
work, Frederic Weisbecker  for his work on CPU
isolation and housekeeping that made possible to remove some less
elegant solutions that I had to devise for earlier, <4.17 kernels, and
Nitesh Narayan Lal  for adapting earlier patches
related to interrupt and work distribution in presence of CPU
isolation.

-- 
Alex


Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-10-17 Thread Alex Belits

On Sat, 2020-10-17 at 18:08 +0200, Thomas Gleixner wrote:
> On Sat, Oct 17 2020 at 01:08, Alex Belits wrote:
> > On Mon, 2020-10-05 at 14:52 -0400, Nitesh Narayan Lal wrote:
> > > On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> > I think that the goal of "finding source of disturbance" interface
> > is
> > different from what can be accomplished by tracing in two ways:
> > 
> > 1. "Source of disturbance" should provide some useful information
> > about
> > category of event and it cause as opposed to determining all
> > precise
> > details about things being called that resulted or could result in
> > disturbance. It should not depend on the user's knowledge about
> > details
> 
> Tracepoints already give you selectively useful information.

Carefully placed tracepoints also can give the user information about
failures of open(), write(), execve() or mmap(). However syscalls still
provide an error code instead of returning generic failure and letting
user debug the cause.

-- 
Alex


Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-10-16 Thread Alex Belits

On Mon, 2020-10-05 at 14:52 -0400, Nitesh Narayan Lal wrote:
> On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> > On Sun, Oct 04, 2020 at 02:44:39PM +0000, Alex Belits wrote:
> > > On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote:
> > > > External Email
> > > > 
> > > > -
> > > > ------
> > > > ---
> > > > On Wed, Jul 22, 2020 at 02:49:49PM +, Alex Belits wrote:
> > > > > +/*
> > > > > + * Description of the last two tasks that ran isolated on a
> > > > > given
> > > > > CPU.
> > > > > + * This is intended only for messages about isolation
> > > > > breaking. We
> > > > > + * don't want any references to actual task while accessing
> > > > > this
> > > > > from
> > > > > + * CPU that caused isolation breaking -- we know nothing
> > > > > about
> > > > > timing
> > > > > + * and don't want to use locking or RCU.
> > > > > + */
> > > > > +struct isol_task_desc {
> > > > > + atomic_t curr_index;
> > > > > + atomic_t curr_index_wr;
> > > > > + boolwarned[2];
> > > > > + pid_t   pid[2];
> > > > > + pid_t   tgid[2];
> > > > > + charcomm[2][TASK_COMM_LEN];
> > > > > +};
> > > > > +static DEFINE_PER_CPU(struct isol_task_desc,
> > > > > isol_task_descs);
> > > > So that's quite a huge patch that would have needed to be split
> > > > up.
> > > > Especially this tracing engine.
> > > > 
> > > > Speaking of which, I agree with Thomas that it's unnecessary.
> > > > It's
> > > > too much
> > > > code and complexity. We can use the existing trace events and
> > > > perform
> > > > the
> > > > analysis from userspace to find the source of the disturbance.
> > > The idea behind this is that isolation breaking events are
> > > supposed to
> > > be known to the applications while applications run normally, and
> > > they
> > > should not require any analysis or human intervention to be
> > > handled.
> > Sure but you can use trace events for that. Just trace interrupts,
> > workqueues,
> > timers, syscalls, exceptions and scheduler events and you get all
> > the local
> > disturbance. You might want to tune a few filters but that's pretty
> > much it.
> > 
> > As for the source of the disturbances, if you really need that
> > information,
> > you can trace the workqueue and timer queue events and just filter
> > those that
> > target your isolated CPUs.
> > 
> 
> I agree that we can do all those things with tracing.
> However, IMHO having a simplified logging mechanism to gather the
> source of
> violation may help in reducing the manual effort.
> 
> Although, I am not sure how easy will it be to maintain such an
> interface
> over time.

I think that the goal of "finding source of disturbance" interface is
different from what can be accomplished by tracing in two ways:

1. "Source of disturbance" should provide some useful information about
category of event and it cause as opposed to determining all precise
details about things being called that resulted or could result in
disturbance. It should not depend on the user's knowledge about details
of implementations, it should provide some definite answer of what
happened (with whatever amount of details can be given in a generic
mechanism) even if the user has no idea how those things happen and
what part of kernel is responsible for either causing or processing
them. Then if the user needs further details, they can be obtained with
tracing.

2. It should be usable as a runtime error handling mechanism, so the
information it provides should be suitable for application use and
logging. It should be usable when applications are running on a system
in production, and no specific tracing or monitoring mechanism can be
in use. If, say, thousands of devices are controlling neutrino
detectors on an ocean floor, and in a month of work one of them got one
isolation breaking event, it should be able to report that isolation
was broken by an interrupt from a network interface, so the users will
be able to track it down to some userspace application reconfiguring
those interrupts.

It will be a good idea to make such mechanism optional and suitable for
tracking things on conditions other than "always enabled" and "enabled
with task isolation". However in my opinion, there should be something
in kernel entry procedure that, if enabled, prepared something to be
filled by the cause data, and we know at least one such situation when
this kernel entry procedure should be triggered -- when task isolation
is on.

-- 
Alex


Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-10-16 Thread Alex Belits

On Tue, 2020-10-06 at 12:35 +0200, Frederic Weisbecker wrote:
> On Mon, Oct 05, 2020 at 02:52:49PM -0400, Nitesh Narayan Lal wrote:
> > On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> > > On Sun, Oct 04, 2020 at 02:44:39PM +, Alex Belits wrote:
> > > 
> > > > The idea behind this is that isolation breaking events are
> > > > supposed to
> > > > be known to the applications while applications run normally,
> > > > and they
> > > > should not require any analysis or human intervention to be
> > > > handled.
> > > Sure but you can use trace events for that. Just trace
> > > interrupts, workqueues,
> > > timers, syscalls, exceptions and scheduler events and you get all
> > > the local
> > > disturbance. You might want to tune a few filters but that's
> > > pretty much it.
> 
> formation,
> > > you can trace the workqueue and timer queue events and just
> > > filter those that
> > > target your isolated CPUs.
> > > 
> > 
> > I agree that we can do all those things with tracing.
> > However, IMHO having a simplified logging mechanism to gather the
> > source of
> > violation may help in reducing the manual effort.
> > 
> > Although, I am not sure how easy will it be to maintain such an
> > interface
> > over time.
> 
> The thing is: tracing is your simplified logging mechanism here. You
> can achieve
> the same in userspace with _way_ less code, no race, and you can do
> it in
> bash.

The idea is that this mechanism should be usable when no one is there
to run things in bash, or no information about what might happen. It
should be able to report rare events in production when users may not
be able to reproduce them.

-- 
Alex


Re: [EXT] Re: [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-10-16 Thread Alex Belits

On Tue, 2020-10-06 at 23:41 +0200, Frederic Weisbecker wrote:
> On Sun, Oct 04, 2020 at 03:22:09PM +0000, Alex Belits wrote:
> > On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote:
> > > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> > > >   */
> > > >  void tick_nohz_full_kick_cpu(int cpu)
> > > >  {
> > > > -   if (!tick_nohz_full_cpu(cpu))
> > > > +   smp_rmb();
> > > 
> > > What is it ordering?
> > 
> > ll_isol_flags will be read in task_isolation_on_cpu(), that accrss
> > should be ordered against writing in
> > task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup()
> > and task_isolation_start().
> > 
> > Since task_isolation_on_cpu() is often called for multiple CPUs in
> > a
> > sequence, it would be wasteful to include a barrier inside it.
> 
> Then I think you meant a full barrier: smp_mb()

For read-only operation? task_isolation_on_cpu() is the only place
where per-cpu ll_isol_flags is accessed, read-only, from multiple CPUs.
All other access to ll_isol_flags is done from the local CPU, and
writes are followed by smp_mb(). There are no other dependencies here,
except operations that depend on the value returned from
task_isolation_on_cpu().

If/when more flags will be added, those rules will be still followed,
because the intention is to store the state of isolation and phases of
entering/breaking/reporting it that can only be updated from the local
CPUs.

> 
> > > > +   if (!tick_nohz_full_cpu(cpu) ||
> > > > task_isolation_on_cpu(cpu))
> > > > return;
> > > 
> > > You can't simply ignore an IPI. There is always a reason for a
> > > nohz_full CPU
> > > to be kicked. Something triggered a tick dependency. It can be
> > > posix
> > > cpu timers
> > > for example, or anything.

This was added some time ago, when timers appeared and CPUs were kicked
seemingly out of nowhere. At that point breaking posix timers when
running tasks that are not supposed to rely on posix timers, was the
least problematic solution. From user's point of view in this case
entering isolation had an effect on timer similar to task exiting while
the timer is running.

Right now, there are still sources of superfluous calls to this, when
tick_nohz_full_kick_all() is used. If I will be able to confirm that
this is the only problematic place, I would rather fix calls to it, and
make this condition produce a warning.

This gives me an idea that if there will be a mechanism specifically
for reporting kernel entry and isolation breaking, maybe it should be
possible to add a distinction between:

1. isolation breaking that already happened upon kernel entry;
2. performing operation that will immediately and synchronously cause
isolation breaking;
3. operations or conditions that will eventually or asynchronously
cause isolation breaking (having timers running, possibly sending
signals should be in the same category).

This will be (2).

I assume that when reporting of isolation breaking will be separated
from the isolation implementation, it will be implemented as a runtime
error condition reporting mechanism. Then it can be focused on
providing information about category of events and their sources, and
have internal logic designed for that purpose, as opposed to designed
entirely for debugging, providing flexibility and obtaining maximum
details about internals involved.

> > 
> > I realize that this is unusual, however the idea is that while the
> > task
> > is running in isolated mode in userspace, we assume that from this
> > CPUs
> > point of view whatever is happening in kernel, can wait until CPU
> > is
> > back in kernel and when it first enters kernel from this mode, it
> > should "catch up" with everything that happened in its absence.
> > task_isolation_kernel_enter() is supposed to do that, so by the
> > time
> > anything should be done involving the rest of the kernel, CPU is
> > back
> > to normal.
> 
> You can't assume that. If something needs the tick, this can't wait.
> If the user did something wrong, such as setting a posix cpu timer
> to an isolated task, that's his fault and the kernel has to stick
> with
> correctness and kick that task out of isolation mode.

That would be true if not multiple "let's just tell all other CPUs that
they should check if they have to update something" situations like the
above.

In case of timers it's possible that I will be able to eliminate all
specific instances when this is done, however I think that as a general
approach we have to establish some distinction between things that must
cause IPI (and 

Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-10-06 Thread Alex Belits
On Mon, 2020-10-05 at 01:14 +0200, Frederic Weisbecker wrote:
> Speaking of which, I agree with Thomas that it's unnecessary.
> > > It's
> > > too much
> > > code and complexity. We can use the existing trace events and
> > > perform
> > > the
> > > analysis from userspace to find the source of the disturbance.
> > 
> > The idea behind this is that isolation breaking events are supposed
> > to
> > be known to the applications while applications run normally, and
> > they
> > should not require any analysis or human intervention to be
> > handled.
> 
> Sure but you can use trace events for that. Just trace interrupts,
> workqueues,
> timers, syscalls, exceptions and scheduler events and you get all the
> local
> disturbance. You might want to tune a few filters but that's pretty
> much it.

And keep all tracing enabled all the time, just to be able to figure
out that disturbance happened at all?

Or do you mean that we can use kernel entry mechanism to reliably
determine that isolation breaking event happened (so the isolation-
breaking procedure can be triggered as early as possible), yet avoid
trying to determine why exactly it happened, and use tracing if we want
to know?

Original patch did the opposite, it triggered any isolation-breaking
procedure only once it was known specifically, what kind of event
happened -- a hardware interrupt, IPI, syscall, page fault, or any
other kind of exception, possibly something architecture-specific.
This, of course, always had a potential problem with coverage -- if
handling of something is missing, isolation breaking is not handled at
all, and there is no obvious way of finding if we covered everything.
This also made the patch large and somewhat ugly.

When I have added a mechanism for low-level isolation breaking handling
on kernel entry, it also partially improved the problem with
completeness. Partially because I have not yet added handling of
"unknown cause" before returning to userspace, however that would be a
logical thing to do. Then if we entered kernel from isolation, did
something, and are returning to userspace still not knowing what kind
of isolation-breaking event happened, we can still trigger isolation
breaking.

Did I get it right, and you mean that we can remove all specific
handling of isolation breaking causes, except for syscall that exits
isolation, and report isolation breaking instead of normally returning
to userspace? Then isolation breaking will be handled reliably without
knowing the cause, and we can leave determining the cause to the
tracing mechanism (if enabled)?

This does make sense. However for me it looks somewhat strange, because
I assume isolation breaking to be a kind of runtime error, that
userspace software is supposed to get some basic information about --
like, signals distinguishing between, say, SIGSEGV and SIGPIPE, or
write() being able to set errno to ENOSPC or EIO. Then userspace
receives basic information about the cause of exception or error, and
can do some meaningful reporting, or decide if the error should be
fatal for the application or handled differently, based on its internal
logic. To get those distinctions, application does not have to be aware
of anything internal to the kernel.

Similarly distinguishing between, say, a page fault, device interrupt
and a timer may be important for a logic implemented in userspace, and
I think, it may be nice to allow userspace to get this information
immediately and without being aware of any additional details of kernel
implementation. The current patch doesn't do this yet, however the
intention is to implement reliable isolation breaking by checking on
userspace re-entry, plus make reporting of causes, if any were found,
visible to the userspace in some convenient way.

The part that determines the cause can be implemented separately from
isolation breaking mechanism. Then we can have isolation breaking on
kernel entry (or potentially some other condition on kernel entry that
requires logging the cause) enable reporting, then reporting mechanism,
if it exists will fill the blanks, and once either cause is known, or
it's time to return to userspace, notification will be done with
whatever information is available. For some in-depth analysis, if
necessary for debugging the kernel, we can have tracing check if we are
in this "suspicious kernel entry" mode, and log things that otherwise
would not be.

> As for the source of the disturbances, if you really need that
> information,
> you can trace the workqueue and timer queue events and just filter
> those that
> target your isolated CPUs.

For the purpose of human debugging the kernel or application, the more
information is (usually) the better, so the only concern here is that
now user is responsible for completeness of things he is tracing.
However from application's point of view, or for logging in a
production environment it's usually more important to get general type
of events, so it's possible to, say, confirm that no

Re: [EXT] Re: [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks

2020-10-04 Thread Alex Belits

On Thu, 2020-10-01 at 16:47 +0200, Frederic Weisbecker wrote:
> External Email
> 
> ---
> ---
> On Wed, Jul 22, 2020 at 02:58:24PM +0000, Alex Belits wrote:
> > From: Yuri Norov 
> > 
> > If CPU runs isolated task, there's no any backlog on it, and
> > so we don't need to flush it.
> 
> What guarantees that we have no backlog on it?

I believe, the logic was that it is not supposed to have backlog
because it could not be produced while the CPU was in userspace,
because one has to enter kernel to receive (by interrupt) or send (by
syscall) anything.

Now, looking at this patch. I don't think, it can be guaranteed that
there was no backlog before it entered userspace. Then backlog
processing will be delayed until exit from isolation. It won't be
queued, and flush_work() will not wait when no worker is assigned, so
there won't be a deadlock, however this delay may not be such a great
idea.

So it may be better to flush backlog before entering isolation, and in
flush_all_backlogs() instead of skipping all CPUs in isolated mode,
check if their per-CPU softnet_data->input_pkt_queue and softnet_data-
>process_queue are empty, and if they are not, call backlog anyway.
Then, if for whatever reason backlog will appear after flushing (we
can't guarantee that nothing preempted us then), it will cause one
isolation breaking event, and if nothing will be queued before re-
entering isolation, there will be no backlog until exiting isolation.

> 
> > Currently flush_all_backlogs()
> > enqueues corresponding work on all CPUs including ones that run
> > isolated tasks. It leads to breaking task isolation for nothing.
> > 
> > In this patch, backlog flushing is enqueued only on non-isolated
> > CPUs.
> > 
> > Signed-off-by: Yuri Norov 
> > [abel...@marvell.com: use safe task_isolation_on_cpu()
> > implementation]
> > Signed-off-by: Alex Belits 
> > ---
> >  net/core/dev.c | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 90b59fc50dc9..83a282f7453d 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -74,6 +74,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void)
> >  
> > get_online_cpus();
> >  
> > -   for_each_online_cpu(cpu)
> > +   smp_rmb();
> 
> What is it ordering?

Same as with other calls to task_isolation_on_cpu(cpu), it orders
access to ll_isol_flags.

> > +   for_each_online_cpu(cpu) {
> > +   if (task_isolation_on_cpu(cpu))
> > +   continue;
> > queue_work_on(cpu, system_highpri_wq,
> >   per_cpu_ptr(&flush_works, cpu));
> > +   }
> >  
> > for_each_online_cpu(cpu)
> > flush_work(per_cpu_ptr(&flush_works, cpu));
> 
> Thanks.



Re: [EXT] Re: [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-10-04 Thread Alex Belits

On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote:
> External Email
> 
> ---
> ---
> On Wed, Jul 22, 2020 at 02:57:33PM +0000, Alex Belits wrote:
> > From: Yuri Norov 
> > 
> > For nohz_full CPUs the desirable behavior is to receive interrupts
> > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > obviously not desirable because it breaks isolation.
> > 
> > This patch adds check for it.
> > 
> > Signed-off-by: Yuri Norov 
> > [abel...@marvell.com: updated, only exclude CPUs running isolated
> > tasks]
> > Signed-off-by: Alex Belits 
> > ---
> >  kernel/time/tick-sched.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 6e4cd8459f05..2f82a6daf8fc 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -20,6 +20,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> >   */
> >  void tick_nohz_full_kick_cpu(int cpu)
> >  {
> > -   if (!tick_nohz_full_cpu(cpu))
> > +   smp_rmb();
> 
> What is it ordering?

ll_isol_flags will be read in task_isolation_on_cpu(), that accrss
should be ordered against writing in
task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup()
and task_isolation_start().

Since task_isolation_on_cpu() is often called for multiple CPUs in a
sequence, it would be wasteful to include a barrier inside it.

> > +   if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
> > return;
> 
> You can't simply ignore an IPI. There is always a reason for a
> nohz_full CPU
> to be kicked. Something triggered a tick dependency. It can be posix
> cpu timers
> for example, or anything.

I realize that this is unusual, however the idea is that while the task
is running in isolated mode in userspace, we assume that from this CPUs
point of view whatever is happening in kernel, can wait until CPU is
back in kernel, and when it first enters kernel from this mode, it
should "catch up" with everything that happened in its absence.
task_isolation_kernel_enter() is supposed to do that, so by the time
anything should be done involving the rest of the kernel, CPU is back
to normal.

It is application's responsibility to avoid triggering things that
break its isolation, so the application assumes that everything that
involves entering kernel will not be available while it is isolated. If
isolation will be broken, or application will request return from
isolation, everything will go back to normal environment with all
functionality available.

> >  
> > irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
> > -- 
> > 2.26.2
> > 



Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-10-04 Thread Alex Belits

On Thu, 2020-10-01 at 16:40 +0200, Frederic Weisbecker wrote:
> External Email
> 
> ---
> ---
> On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
> > +/**
> > + * task_isolation_kernel_enter() - clear low-level task isolation
> > flag
> > + *
> > + * This should be called immediately after entering kernel.
> > + */
> > +static inline void task_isolation_kernel_enter(void)
> > +{
> > +   unsigned long flags;
> > +
> > +   /*
> > +* This function runs on a CPU that ran isolated task.
> > +*
> > +* We don't want this CPU running code from the rest of kernel
> > +* until other CPUs know that it is no longer isolated.
> > +* When CPU is running isolated task until this point anything
> > +* that causes an interrupt on this CPU must end up calling
> > this
> > +* before touching the rest of kernel. That is, this function
> > or
> > +* fast_task_isolation_cpu_cleanup() or stop_isolation()
> > calling
> > +* it. If any interrupt, including scheduling timer, arrives,
> > it
> > +* will still end up here early after entering kernel.
> > +* From this point interrupts are disabled until all CPUs will
> > see
> > +* that this CPU is no longer running isolated task.
> > +*
> > +* See also fast_task_isolation_cpu_cleanup().
> > +*/
> > +   smp_rmb();
> 
> I'm a bit confused what this read memory barrier is ordering. Also
> against
> what it pairs.

My bad, I have kept it after there were left no write accesses from
other CPUs.

> 
> > +   if((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION) ==
> > 0)
> > +   return;
> > +
> > +   local_irq_save(flags);
> > +
> > +   /* Clear low-level flags */
> > +   this_cpu_write(ll_isol_flags, 0);
> > +
> > +   /*
> > +* If something happened that requires a barrier that would
> > +* otherwise be called from remote CPUs by CPU kick procedure,
> > +* this barrier runs instead of it. After this barrier, CPU
> > +* kick procedure would see the updated ll_isol_flags, so it
> > +* will run its own IPI to trigger a barrier.
> > +*/
> > +   smp_mb();
> > +   /*
> > +* Synchronize instructions -- this CPU was not kicked while
> > +* in isolated mode, so it might require synchronization.
> > +* There might be an IPI if kick procedure happened and
> > +* ll_isol_flags was already updated while it assembled a CPU
> > +* mask. However if this did not happen, synchronize everything
> > +* here.
> > +*/
> > +   instr_sync();
> 
> It's the first time I meet an instruction barrier. I should get
> information
> about that but what is it ordering here?

Against barriers in instruction cache flushing (flush_icache_range()
and such). 

> > +   local_irq_restore(flags);
> > +}
> 
> Thanks.



Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-10-04 Thread Alex Belits
On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote:
> External Email
> 
> ---
> ---
> On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
> > +/*
> > + * Description of the last two tasks that ran isolated on a given
> > CPU.
> > + * This is intended only for messages about isolation breaking. We
> > + * don't want any references to actual task while accessing this
> > from
> > + * CPU that caused isolation breaking -- we know nothing about
> > timing
> > + * and don't want to use locking or RCU.
> > + */
> > +struct isol_task_desc {
> > +   atomic_t curr_index;
> > +   atomic_t curr_index_wr;
> > +   boolwarned[2];
> > +   pid_t   pid[2];
> > +   pid_t   tgid[2];
> > +   charcomm[2][TASK_COMM_LEN];
> > +};
> > +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
> 
> So that's quite a huge patch that would have needed to be split up.
> Especially this tracing engine.
> 
> Speaking of which, I agree with Thomas that it's unnecessary. It's
> too much
> code and complexity. We can use the existing trace events and perform
> the
> analysis from userspace to find the source of the disturbance.

The idea behind this is that isolation breaking events are supposed to
be known to the applications while applications run normally, and they
should not require any analysis or human intervention to be handled.

A process may exit isolation because some leftover delayed work, for
example, a timer or a workqueue, is still present on a CPU, or because
a page fault or some other exception, normally handled silently, is
caused by the task. It is also possible to direct an interrupt to a CPU
that is running an isolated task -- currently it's perfectly valid to
set interrupt smp affinity to a CPU running isolated task, and then
interrupt will cause breaking isolation. While it's probably not the
best way of handling interrupts, I would rather not prohibit this
explicitly.

There is also a matter of avoiding race conditions on entering
isolation. Once CPU entered isolation, other CPUs should avoid
disturbing it when they know that CPU is running a task in isolated
mode. However for a short time after entering isolation other CPUs may
be unaware of this, and will still send IPIs to it. Preventing this
scenario completely would be very costly in terms of what other CPUs
will have to do before notifying others, so similar to how EINTR works,
we can simply specify that this is allowed, and task is supposed to re-
enter isolation after this. It's still a bad idea to specify that
isolation breaking can continue happening while application is running
in isolated mode, however allowing some "grace period" after entering
is acceptable as long as application is aware of this happening.

In libtmc I have moved this handling of isolation breaking into a
separate thread, intended to become a separate daemon if necessary. In
part it was done because initial implementation of isolation made it
very difficult to avoid repeating delayed work on isolated CPUs, so
something had to watch for it from non-isolated CPU. It's possible that
now, when delayed work does not appear on isolated CPUs out of nowhere,
the need in isolation manager thread will disappear, and task itself
will be able to handle all isolation breaking, like original
implementation by Chris was supposed to.

However in either case it's still useful for the task, or isolation
manager, to get a description of the isolation-breaking event. This is
what those things are intended for. Now they only produce log messages
because this is where initially all description of isolation-breaking
events went, however I would prefer to make logging optional but always
let applications read those events descriptions, regardless of any
tracing mechanism being used. I was more focused on making the
reporting mechanism properly detect the cause of isolation breaking
because that functionality was not quite working in earlier work by
Chris and Yuri, so I have kept logging as the only output, but made it
suitable for producing events that applications will be able to
receive. Application, or isolation manager, will receive clear and
unambiguous reporting, so there will be no need for any additional
analysis or guesswork.

After adding a proper "low-level" isolation flags, I got the idea that
we might have a better yet reporting mechanism. Early isolation
breaking detection on kernel entry may set a flag that says that
isolation breaking happened, however its cause is unknown. Or, more
likely, only some general information about isolation breaking is
available, like a type of exception. Then, once a known isolation-
breaking reporting mechanism is called from interrupt, syscall

Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode

2020-07-23 Thread Alex Belits

On Thu, 2020-07-23 at 23:44 +0200, Thomas Gleixner wrote:
> External Email
> 
> ---
> ---
> Alex Belits  writes:
> > On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote:
> > > 'What does noinstr mean? and why do we have it" -- don't dare
> > > touch
> > > the
> > > entry code until you can answer that.
> > 
> > noinstr disables instrumentation, so there would not be calls and
> > dependencies on other parts of the kernel when it's not yet safe to
> > call them. Relevant functions already have it, and I add an inline
> > call
> > to perform flags update and synchronization. Unless something else
> > is
> > involved, those operations are safe, so I am not adding anything
> > that
> > can break those.
> 
> Sure.
> 
>  1) That inline function can be put out of line by the compiler and
> placed into the regular text section which makes it subject to
> instrumentation
> 
>  2) That inline function invokes local_irq_save() which is subject to
> instrumentation _before_ the entry state for the instrumentation
> mechanisms is established.
> 
>  3) That inline function invokes sync_core() before important state
> has
> been established, which is especially interesting in NMI like
> exceptions.
> 
> As you clearly documented why all of the above is safe and does not
> cause any problems, it's just me and Peter being silly, right?
> 
> Try again.

I don't think, accusations and mockery are really necessary here.

I am trying to do the right thing here. In particular, I am trying to
port the code that was developed on platforms that have not yet
implemented those useful instrumentation safety features of x86 arch
support. For most of the development time I had to figure out, where
the synchronization can be safely inserted into kernel entry code on
three platforms and tens of interrupt controller drivers, with some of
those presenting unusual exceptions (forgive me the pun) from platform-
wide conventions. I really appreciate the work you did cleaning up
kernel entry procedures, my 5.6 version of this patch had to follow a
much more complex and I would say, convoluted entry handling on x86,
and now I don't have to do that, thanks to you.

Unfortunately, most of my mental effort recently had to be spent on
three things:

1. (small): finding a way to safely enable events and synchronize state
on kernel entry, so it will not have a race condition between
isolation-breaking kernel entry and an event that was disabled while
the task was isolated.

2. (big): trying to derive any useful rules applicable to kernel entry
in various architectures, finding that there is very little consistency
across architectures, and whatever exists, can be broken by interrupt
controller drivers that don't all follow the same rules as the rest of
the platform.

3. (medium): introducing calls to synchronization on all kernel entry
procedures, in places where it is guaranteed to not normally yet have
done any calls to parts of the kernel that may be affected by "stale"
state, and do it in a manner as consistent and generalized as possible.

The current state of kernel entry handling on arm and arm64
architectures has significant differences from x86 and from each other.
There is also a matter of interrupt controllers. As can be seen in
interrupt controller-specific patch, I had to accommodate some variety
of custom interrupt entry code. What can not be seen, is that I had to
check that all other interrupt controller drivers and architecture-
specific entry procedures, and find that they _do_ follow some
understandable rules -- unfortunately architecture-specific and not
documented in any manner.

I have no valid reasons for complaining about it. I could not expect
that authors of all kernel entry procedures would have any
foreknowledge that someone at some point may have a reason to establish
any kind of synchronization point for CPU cores. And this is why I had
to do my research by manually drawing call trees and sequences,
separately for every entry on every supported architecture, and across
two or three versions of kernel, as those were changing along the way.

The result of this may be not a "design" per se, but an understanding
of how things are implemented, and what rules are being followed, so I
could add my code in a manner consistent with what is done, and
document the whole thing. Then there will be some written rules to
check for, when anything of this kind will be necessary again (say,
with TLB, but considering how much now is done in userspace, possibly
to accommodate more exotic CPU features that may have state messed up
by userspace). I am afraid, this task, kernel entry documentation,
would take me some

Re: [PATCH v4 00/13] "Task_isolation" mode

2020-07-23 Thread Alex Belits

On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote:
> 
> 'What does noinstr mean? and why do we have it" -- don't dare touch
> the
> entry code until you can answer that.

noinstr disables instrumentation, so there would not be calls and
dependencies on other parts of the kernel when it's not yet safe to
call them. Relevant functions already have it, and I add an inline call
to perform flags update and synchronization. Unless something else is
involved, those operations are safe, so I am not adding anything that
can break those.

-- 
Alex


Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode

2020-07-23 Thread Alex Belits

On Thu, 2020-07-23 at 17:48 +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2020 at 03:41:46PM +0000, Alex Belits wrote:
> > On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote:
> > > .
> > > 
> > > This.. as presented it is an absolutely unreviewable pile of
> > > junk. It
> > > presents code witout any coherent problem description and
> > > analysis.
> > > And
> > > the patches are not split sanely either.
> > 
> > There is a more complete and slightly outdated description in the
> > previous version of the patch at 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_07c25c246c55012981ec0296eee23e68c719333a.camel-40marvell.com_&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=FFZaj-KanwqEiXYCdjd96JOgP_GAOnanpkw6bBvNrK4&e=
> >  
> 
> Not the point, you're mixing far too many things in one go. You also
> have the patches split like 'generic / arch-1 / arch-2' which is
> wrong
> per definition, as patches should be split per change and not care
> about
> sily boundaries.

This follows the original patch by Chris Metcalf. There is a reason for
that -- per-architecture changes are independent from each other and
affect not just code but functionality that was implemented per-
architecture. To support more architectures, it will be necessary to do
it separately for each, and mark them supported with
HAVE_ARCH_TASK_ISOLATION. Having only some architectures supported does
not break anything for the rest -- architectures that are not covered,
would not have this functionality.

> 
> Also, if you want generic entry code, there's patches for that here:
> 
>   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.kernel.org_r_20200722215954.464281930-40linutronix.de&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=nZXIviY7rva31KvPgSVnTacwFNbsmkdW0LxSTfYSiqg&e=
>  
> 
> 

That looks useful. Why didn't Thomas Gleixner mention it in his
criticism of my approach if he already solved that exact problem, at
least for x86?

-- 
Alex


Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode

2020-07-23 Thread Alex Belits

On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote:
> .
> 
> This.. as presented it is an absolutely unreviewable pile of junk. It
> presents code witout any coherent problem description and analysis.
> And
> the patches are not split sanely either.

There is a more complete and slightly outdated description in the
previous version of the patch at 
https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.ca...@marvell.com/
 .

It allows userspace application to take a CPU core for itself and run
completely isolated, with no disturbances. There is work in progress
that also disables and re-enables TLB flushes, and depending on CPU it
may be possible to also pre-allocate cache, so it would not be affected
by the rest of the system. Events that cause interaction with isolated
task, cause isolation breaking, turning the task into a regular
userspace task that can continue running normally and enter isolated
state again if necessary.

To make this feature suitable for any practical use, many mechanisms
that normally would cause events on a CPU, should exclude CPU cores in
this state, and synchronization should happen later, at the time of
isolation breaking.

There are three architectures supported, x86, arm and arm64, and it
should be possible to extend it to others. Unfortunately kernel entry
procedures are neither unified, nor straightforward, so introducing new
feature to them causes an appearance of a mess.

-- 
Alex


Re: [PATCH v4 00/13] "Task_isolation" mode

2020-07-23 Thread Alex Belits
On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote:
> 
> Without going into details of the individual patches, let me give you a
> high level view of this series:
> 
>   1) Entry code handling:
> 
>  That's completely broken vs. the careful ordering and instrumentation
>  protection of the entry code. You can't just slap stuff randomly
>  into places which you think are safe w/o actually trying to understand
>  why this code is ordered in the way it is.
> 
>  This clearly was never built and tested with any of the relevant
>  debug options enabled. Both build and boot would have told you.

This is intended to avoid a race condition when entry or exit from isolation
happens at the same time as an event that requires synchronization. The idea
is, it is possible to insulate the core from all events while it is running
isolated task in userspace, it will receive those calls normally after
breaking isolation and entering kernel, and it will synchronize itself on
kernel entry.

This has two potential problems that I am trying to solve:

1. Without careful ordering, there will be a race condition with events that
happen at the same time as kernel entry or exit.

2. CPU runs some kernel code after entering but before synchronization. This
code should be restricted to early entry that is not affected by the "stale"
state, similar to how IPI code that receives synchronization events does it
normally.

I can't say that I am completely happy with the amount of kernel entry
handling that had to be added. The problem is, I am trying to introduce a
feature that allows CPU cores to go into "de-synchronized" state while running
isolated tasks and not receiving synchronization events that normally would
reach them. This means, there should be established some point on kernel entry
when it is safe for the core to catch up with the rest of kernel. It may be
useful for other purposes, however at this point task isolation is the first
to need it, so I had to determine where such point is for every supported
architecture and method of kernel entry.

I have found that each architecture has its own way of handling this,
and sometimes individual interrupt controller drivers vary in their
sequence of calls on early kernel entry. For x86 I also have an
implementation for kernel 5.6, before your changes to IDT macros.
That version is much less straightforward, so I am grateful for those
relatively recent improvements.

Nevertheless, I believe that the goal of finding those points and using
them for synchronization is valid. If you can recommend me a better way
for at least x86, I will be happy to follow your advice. I have tried to
cover kernel entry in a generic way while making the changes least
disruptive, and this is why it looks simple and spread over multiple
places. I also had to do the same for arm and arm64 (that I use for
development), and for each architecture I had to produce sequences of
entry points and function calls to determine the correct placement of
task_isolation_enter() calls in them. It is not random, however it does
reflect the complex nature of kernel entry code. I believe, RCU
implementation faced somewhat similar requirements for calls on kernel
entry, however it is not completely unified, either

>  2) Instruction synchronization
> Trying to do instruction synchronization delayed is a clear recipe
> for hard to diagnose failures. Just because it blew not up in your
> face does not make it correct in any way. It's broken by design and
> violates _all_ rules of safe instruction patching and introduces a
> complete trainwreck in x86 NMI processing.

The idea is that just like synchronization events are handled by regular IPI,
we already use some code with the assumption that it is safe to be entered in
"stale" state before synchronization. I have extended it to allow
synchronization points on all kernel entry points.

> If you really think that this is correct, then please have at least
> the courtesy to come up with a detailed and precise argumentation
> why this is a valid approach.
>
> While writing that up you surely will find out why it is not.
> 

I had to document a sequence of calls for every entry point on three supported
architectures, to determine the points for synchronization. It is possible that
I have somehow missed something, however I don't see a better approach, save
for establishing a kernel-wide infrastructure for this. And even if we did just
that, it would be possible to implement this kind of synchronization point
calls first, and convert them to something more generic later.

> 
>   3) Debug calls
> 
>  Sprinkling debug calls around the codebase randomly is not going to
>  happen. That's an unmaintainable mess.

Those report isolation breaking causes, and are intended for application and
system debugging.

> 
>  Aside of that none of these dmesg based debug things is necessary.
>  This can simply be monito

[PATCH 13/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus

2020-07-22 Thread Alex Belits
From: Yuri Norov 

Make sure that kick_all_cpus_sync() does not call CPUs that are running
isolated tasks.

Signed-off-by: Yuri Norov 
[abel...@marvell.com: use safe task_isolation_cpumask() implementation]
Signed-off-by: Alex Belits 
---
 kernel/smp.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 6a6849783948..ff0d95db33b3 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -803,9 +803,21 @@ static void do_nothing(void *unused)
  */
 void kick_all_cpus_sync(void)
 {
+   struct cpumask mask;
+
/* Make sure the change is visible before we kick the cpus */
smp_mb();
-   smp_call_function(do_nothing, NULL, 1);
+
+   preempt_disable();
+#ifdef CONFIG_TASK_ISOLATION
+   cpumask_clear(&mask);
+   task_isolation_cpumask(&mask);
+   cpumask_complement(&mask, &mask);
+#else
+   cpumask_setall(&mask);
+#endif
+   smp_call_function_many(&mask, do_nothing, NULL, 1);
+   preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
 
-- 
2.26.2



[PATCH v4 12/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize

2020-07-22 Thread Alex Belits
From: Yuri Norov 

CPUs running isolated tasks are in userspace, so they don't have to
perform ring buffer updates immediately. If ring_buffer_resize()
schedules the update on those CPUs, isolation is broken. To prevent
that, updates for CPUs running isolated tasks are performed locally,
like for offline CPUs.

A race condition between this update and isolation breaking is avoided
at the cost of disabling per_cpu buffer writing for the time of update
when it coincides with isolation breaking.

Signed-off-by: Yuri Norov 
[abel...@marvell.com: updated to prevent race with isolation breaking]
Signed-off-by: Alex Belits 
---
 kernel/trace/ring_buffer.c | 63 ++
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 00867ff82412..22d4731f0def 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1705,6 +1706,38 @@ static void update_pages_handler(struct work_struct 
*work)
complete(&cpu_buffer->update_done);
 }
 
+static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer,
+  int cpu)
+{
+   bool rv = false;
+
+   smp_rmb();
+   if (task_isolation_on_cpu(cpu)) {
+   /*
+* CPU is running isolated task. Since it may lose
+* isolation and re-enter kernel simultaneously with
+* this update, disable recording until it's done.
+*/
+   atomic_inc(&cpu_buffer->record_disabled);
+   /* Make sure, update is done, and isolation state is current */
+   smp_mb();
+   if (task_isolation_on_cpu(cpu)) {
+   /*
+* If CPU is still running isolated task, we
+* can be sure that breaking isolation will
+* happen while recording is disabled, and CPU
+* will not touch this buffer until the update
+* is done.
+*/
+   rb_update_pages(cpu_buffer);
+   cpu_buffer->nr_pages_to_update = 0;
+   rv = true;
+   }
+   atomic_dec(&cpu_buffer->record_disabled);
+   }
+   return rv;
+}
+
 /**
  * ring_buffer_resize - resize the ring buffer
  * @buffer: the buffer to resize.
@@ -1794,13 +1827,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, 
unsigned long size,
if (!cpu_buffer->nr_pages_to_update)
continue;
 
-   /* Can't run something on an offline CPU. */
+   /*
+* Can't run something on an offline CPU.
+*
+* CPUs running isolated tasks don't have to
+* update ring buffers until they exit
+* isolation because they are in
+* userspace. Use the procedure that prevents
+* race condition with isolation breaking.
+*/
if (!cpu_online(cpu)) {
rb_update_pages(cpu_buffer);
cpu_buffer->nr_pages_to_update = 0;
} else {
-   schedule_work_on(cpu,
-   &cpu_buffer->update_pages_work);
+   if (!update_if_isolated(cpu_buffer, cpu))
+   schedule_work_on(cpu,
+   &cpu_buffer->update_pages_work);
}
}
 
@@ -1849,13 +1891,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, 
unsigned long size,
 
get_online_cpus();
 
-   /* Can't run something on an offline CPU. */
+   /*
+* Can't run something on an offline CPU.
+*
+* CPUs running isolated tasks don't have to update
+* ring buffers until they exit isolation because they
+* are in userspace. Use the procedure that prevents
+* race condition with isolation breaking.
+*/
if (!cpu_online(cpu_id))
rb_update_pages(cpu_buffer);
else {
-   schedule_work_on(cpu_id,
+   if (!update_if_isolated(cpu_buffer, cpu_id))
+   schedule_work_on(cpu_id,
 &cpu_buffer->update_pages_work);
-   wait_for_completion(&cpu_buf

[PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks

2020-07-22 Thread Alex Belits
From: Yuri Norov 

If CPU runs isolated task, there's no any backlog on it, and
so we don't need to flush it. Currently flush_all_backlogs()
enqueues corresponding work on all CPUs including ones that run
isolated tasks. It leads to breaking task isolation for nothing.

In this patch, backlog flushing is enqueued only on non-isolated CPUs.

Signed-off-by: Yuri Norov 
[abel...@marvell.com: use safe task_isolation_on_cpu() implementation]
Signed-off-by: Alex Belits 
---
 net/core/dev.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 90b59fc50dc9..83a282f7453d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void)
 
get_online_cpus();
 
-   for_each_online_cpu(cpu)
+   smp_rmb();
+   for_each_online_cpu(cpu) {
+   if (task_isolation_on_cpu(cpu))
+   continue;
queue_work_on(cpu, system_highpri_wq,
  per_cpu_ptr(&flush_works, cpu));
+   }
 
for_each_online_cpu(cpu)
flush_work(per_cpu_ptr(&flush_works, cpu));
-- 
2.26.2



[PATCH v4 09/13] task_isolation: arch/arm: enable task isolation functionality

2020-07-22 Thread Alex Belits
From: Francis Giraldeau 

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

Early kernel entry relies on task_isolation_kernel_enter().

vector_swi to label __sys_trace
  -> syscall_trace_enter() when task isolation is enabled,
  -> task_isolation_kernel_enter()

nvic_handle_irq()
  -> handle_IRQ() -> __handle_domain_irq() -> task_isolation_kernel_enter()

__fiq_svc, __fiq_abt __fiq_usr
  -> handle_fiq_as_nmi() -> uses nmi_enter() / nmi_exit()

__irq_svc -> irq_handler
__irq_usr -> irq_handler
  irq_handler
-> (handle_arch_irq or
(arch_irq_handler_default -> (asm_do_IRQ() -> __handle_domain_irq())
  or do_IPI() -> handle_IPI())
  asm_do_IRQ()
-> __handle_domain_irq() -> task_isolation_kernel_enter()
  do_IPI()
-> handle_IPI() -> task_isolation_kernel_enter()

handle_arch_irq for arm-specific controllers calls
  (handle_IRQ() -> __handle_domain_irq() -> task_isolation_kernel_enter())
or (handle_domain_irq() -> __handle_domain_irq()
  -> task_isolation_kernel_enter())

Not covered:
__dabt_svc -> dabt_helper
__dabt_usr -> dabt_helper
  dabt_helper -> CPU_DABORT_HANDLER (cpu-specific)
-> do_DataAbort or PROCESSOR_DABT_FUNC
-> _data_abort (cpu-specific) -> do_DataAbort

__pabt_svc -> pabt_helper
__pabt_usr -> pabt_helper
  pabt_helper -> CPU_PABORT_HANDLER (cpu-specific)
-> do_PrefetchAbort or PROCESSOR_PABT_FUNC
-> _prefetch_abort (cpu-specific) -> do_PrefetchAbort

Signed-off-by: Francis Giraldeau 
Signed-off-by: Chris Metcalf  [with modifications]
[abel...@marvell.com: modified for kernel 5.6, added isolation cleanup]
Signed-off-by: Alex Belits 
---
 arch/arm/Kconfig   |  1 +
 arch/arm/include/asm/barrier.h |  2 ++
 arch/arm/include/asm/thread_info.h | 10 +++---
 arch/arm/kernel/entry-common.S | 15 ++-
 arch/arm/kernel/ptrace.c   | 12 
 arch/arm/kernel/signal.c   | 13 -
 arch/arm/kernel/smp.c  |  6 ++
 arch/arm/mm/fault.c|  8 +++-
 8 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 2ac74904a3ce..f06d0e0e4fe9 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -67,6 +67,7 @@ config ARM
select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_TRACEHOOK
select HAVE_ARM_SMCCC if CPU_V7
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 83ae97c049d9..3c603df6c290 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -66,12 +66,14 @@ extern void arm_heavy_mb(void);
 #define wmb()  __arm_heavy_mb(st)
 #define dma_rmb()  dmb(osh)
 #define dma_wmb()  dmb(oshst)
+#define instr_sync()   isb()
 #else
 #define mb()   barrier()
 #define rmb()  barrier()
 #define wmb()  barrier()
 #define dma_rmb()  barrier()
 #define dma_wmb()  barrier()
+#define instr_sync()   barrier()
 #endif
 
 #define __smp_mb() dmb(ish)
diff --git a/arch/arm/include/asm/thread_info.h 
b/arch/arm/include/asm/thread_info.h
index 3609a6980c34..ec0f11e1bb4c 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -139,7 +139,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define TIF_SYSCALL_TRACE  4   /* syscall trace active */
 #define TIF_SYSCALL_AUDIT  5   /* syscall auditing active */
 #define TIF_SYSCALL_TRACEPOINT 6   /* syscall tracepoint instrumentation */
-#define TIF_SECCOMP7   /* seccomp syscall filtering active */
+#define TIF_TASK_ISOLATION 7   /* task isolation enabled for task */
+#define TIF_SECCOMP8   /* seccomp syscall filtering active */
 
 #define TIF_USING_IWMMXT   17
 #define TIF_MEMDIE 18  /* is terminating due to OOM killer */
@@ -152,18 +153,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_TASK_ISOLATION(1 << TIF_

[PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-07-22 Thread Alex Belits
From: Yuri Norov 

For nohz_full CPUs the desirable behavior is to receive interrupts
generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
obviously not desirable because it breaks isolation.

This patch adds check for it.

Signed-off-by: Yuri Norov 
[abel...@marvell.com: updated, only exclude CPUs running isolated tasks]
Signed-off-by: Alex Belits 
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6e4cd8459f05..2f82a6daf8fc 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
  */
 void tick_nohz_full_kick_cpu(int cpu)
 {
-   if (!tick_nohz_full_cpu(cpu))
+   smp_rmb();
+   if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
return;
 
irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
-- 
2.26.2



[PATCH 08/13] task_isolation: arch/arm64: enable task isolation functionality

2020-07-22 Thread Alex Belits
From: Chris Metcalf 

In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Early kernel entry code calls task_isolation_kernel_enter(). In
particular:

Vectors:
el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter()
el1_irq -> asm_nmi_enter(), handle_arch_irq()
el1_error -> do_serror()
el0_sync -> el0_sync_handler()
el0_irq -> handle_arch_irq()
el0_error -> do_serror()
el0_sync_compat -> el0_sync_compat_handler()
el0_irq_compat -> handle_arch_irq()
el0_error_compat -> do_serror()

SDEI entry:
__sdei_asm_handler -> __sdei_handler() -> nmi_enter()

Functions called from there:
asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()

Handlers:
do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
  or task_isolation_kernel_enter()
el1_sync_handler() -> task_isolation_kernel_enter()
el0_sync_handler() -> task_isolation_kernel_enter()
el0_sync_compat_handler() -> task_isolation_kernel_enter()

handle_arch_irq() is irqchip-specific, most call handle_domain_irq()
  or handle_IPI()
There is a separate patch for irqchips that do not follow this rule.

handle_domain_irq() -> task_isolation_kernel_enter()
handle_IPI() -> task_isolation_kernel_enter()
nmi_enter() -> task_isolation_kernel_enter()

Signed-off-by: Chris Metcalf 
[abel...@marvell.com: simplified to match kernel 5.6]
Signed-off-by: Alex Belits 
---
 arch/arm64/Kconfig   |  1 +
 arch/arm64/include/asm/barrier.h |  2 ++
 arch/arm64/include/asm/thread_info.h |  5 -
 arch/arm64/kernel/entry-common.c |  7 +++
 arch/arm64/kernel/ptrace.c   | 16 +++-
 arch/arm64/kernel/sdei.c |  2 ++
 arch/arm64/kernel/signal.c   | 13 -
 arch/arm64/kernel/smp.c  |  9 +
 arch/arm64/mm/fault.c|  5 +
 9 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 66dc41fd49f2..96fefabfa10f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -137,6 +137,7 @@ config ARM64
select HAVE_ARCH_PREL32_RELOCATIONS
select HAVE_ARCH_SECCOMP_FILTER
select HAVE_ARCH_STACKLEAK
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index fb4c27506ef4..bf4a2adabd5b 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -48,6 +48,8 @@
 #define dma_rmb()  dmb(oshld)
 #define dma_wmb()  dmb(oshst)
 
+#define instr_sync()   isb()
+
 /*
  * Generate a mask for array_index__nospec() that is ~0UL when 0 <= idx < sz
  * and 0 otherwise.
diff --git a/arch/arm64/include/asm/thread_info.h 
b/arch/arm64/include/asm/thread_info.h
index 5e784e16ee89..73269bb8a57d 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -67,6 +67,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_FOREIGN_FPSTATE3   /* CPU's FP state is not current's */
 #define TIF_UPROBE 4   /* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK5   /* Check FS is USER_DS on return */
+#define TIF_TASK_ISOLATION 6   /* task isolation enabled for task */
 #define TIF_SYSCALL_TRACE  8   /* syscall trace active */
 #define TIF_SYSCALL_AUDIT  9   /* syscall auditing */
 #define TIF_SYSCALL_TRACEPOINT 10  /* syscall tracepoint for ftrace */
@@ -86,6 +87,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_NEED_RESCHED  (1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE   (1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION)
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
@@ -99,7 +101,8 @@ void arch_release_task_struct(struct task_struct *tsk);
 
 #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-_TIF_UPROBE | _TIF_FSCHECK)
+_TIF_UPROBE | _TIF_FSCHE

[PATCH v4 07/13] task_isolation: arch/x86: enable task isolation functionality

2020-07-22 Thread Alex Belits
In prepare_exit_to_usermode(), run cleanup for tasks exited fromi
isolation and call task_isolation_start() for tasks that entered
TIF_TASK_ISOLATION.

In syscall_trace_enter(), add the necessary support for reporting
syscalls for task-isolation processes.

Add task_isolation_remote() calls for the kernel exception types
that do not result in signals, namely non-signalling page faults.

Add task_isolation_kernel_enter() calls to interrupt and syscall
entry handlers.

This mechanism relies on calls to functions that call
task_isolation_kernel_enter() early after entry into kernel. Those
functions are:

enter_from_user_mode()
  called from do_syscall_64(), do_int80_syscall_32(),
  do_fast_syscall_32(), idtentry_enter_user(),
  idtentry_enter_cond_rcu()
idtentry_enter_cond_rcu()
  called from non-raw IDT macros and other entry points
idtentry_enter_user()
nmi_enter()
xen_call_function_interrupt()
xen_call_function_single_interrupt()
xen_irq_work_interrupt()

Signed-off-by: Chris Metcalf 
[abel...@marvell.com: adapted for kernel 5.8]
Signed-off-by: Alex Belits 
---
 arch/x86/Kconfig   |  1 +
 arch/x86/entry/common.c| 20 +++-
 arch/x86/include/asm/barrier.h |  2 ++
 arch/x86/include/asm/thread_info.h |  4 +++-
 arch/x86/kernel/apic/ipi.c |  2 ++
 arch/x86/mm/fault.c|  4 
 arch/x86/xen/smp.c |  3 +++
 arch/x86/xen/smp_pv.c  |  2 ++
 8 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0abf779..3a80142f85c8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -149,6 +149,7 @@ config X86
select HAVE_ARCH_COMPAT_MMAP_BASES  if MMU && COMPAT
select HAVE_ARCH_PREL32_RELOCATIONS
select HAVE_ARCH_SECCOMP_FILTER
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_STACKLEAK
select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index f09288431f28..ab94d90a2bd5 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_XEN_PV
 #include 
@@ -86,6 +87,7 @@ static noinstr void enter_from_user_mode(void)
 {
enum ctx_state state = ct_state();
 
+   task_isolation_kernel_enter();
lockdep_hardirqs_off(CALLER_ADDR0);
user_exit_irqoff();
 
@@ -97,6 +99,7 @@ static noinstr void enter_from_user_mode(void)
 #else
 static __always_inline void enter_from_user_mode(void)
 {
+   task_isolation_kernel_enter();
lockdep_hardirqs_off(CALLER_ADDR0);
instrumentation_begin();
trace_hardirqs_off_finish();
@@ -161,6 +164,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
return -1L;
}
 
+   /*
+* In task isolation mode, we may prevent the syscall from
+* running, and if so we also deliver a signal to the process.
+*/
+   if (work & _TIF_TASK_ISOLATION) {
+   if (task_isolation_syscall(regs->orig_ax) == -1)
+   return -1L;
+   work &= ~_TIF_TASK_ISOLATION;
+   }
 #ifdef CONFIG_SECCOMP
/*
 * Do seccomp after ptrace, to catch any tracer changes.
@@ -263,6 +275,8 @@ static void __prepare_exit_to_usermode(struct pt_regs *regs)
lockdep_assert_irqs_disabled();
lockdep_sys_exit();
 
+   task_isolation_check_run_cleanup();
+
cached_flags = READ_ONCE(ti->flags);
 
if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
@@ -278,6 +292,9 @@ static void __prepare_exit_to_usermode(struct pt_regs *regs)
if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD))
switch_fpu_return();
 
+   if (cached_flags & _TIF_TASK_ISOLATION)
+   task_isolation_start();
+
 #ifdef CONFIG_COMPAT
/*
 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
@@ -597,7 +614,8 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
check_user_regs(regs);
enter_from_user_mode();
return false;
-   }
+   } else
+   task_isolation_kernel_enter();
 
/*
 * If this entry hit the idle task invoke rcu_irq_enter() whether
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 7f828fe49797..5be6ca0519fc 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * Force strict CPU ordering.
@@ -53,6 +54,7 @@ static inline unsigned long array_index_mask_nospec(unsigned 
long index,
 
 #define dma_rmb()  barrier()
 #define dma_wmb()  barrier()
+#define instr_sync()   sync_core()
 
 #ifdef CONFIG_X86_32
 #define __smp_mb() asm volatile("lock; addl $0,-4(%%esp)" ::: "memory", 
"cc")
dif

[PATCH 06/13] task_isolation: Add driver-specific hooks

2020-07-22 Thread Alex Belits
Some drivers don't call functions that call
task_isolation_kernel_enter() in interrupt handlers. Call it
directly.

Signed-off-by: Alex Belits 
---
 drivers/irqchip/irq-armada-370-xp.c | 6 ++
 drivers/irqchip/irq-gic-v3.c| 3 +++
 drivers/irqchip/irq-gic.c   | 3 +++
 drivers/s390/cio/cio.c  | 3 +++
 4 files changed, 15 insertions(+)

diff --git a/drivers/irqchip/irq-armada-370-xp.c 
b/drivers/irqchip/irq-armada-370-xp.c
index c9bdc5221b82..df7f2cce3a54 100644
--- a/drivers/irqchip/irq-armada-370-xp.c
+++ b/drivers/irqchip/irq-armada-370-xp.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -473,6 +474,7 @@ static const struct irq_domain_ops 
armada_370_xp_mpic_irq_ops = {
 static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
 {
u32 msimask, msinr;
+   int isol_entered = 0;
 
msimask = readl_relaxed(per_cpu_int_base +
ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS)
@@ -489,6 +491,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs 
*regs, bool is_chained)
continue;
 
if (is_chained) {
+   if (!isol_entered) {
+   task_isolation_kernel_enter();
+   isol_entered = 1;
+   }
irq = irq_find_mapping(armada_370_xp_msi_inner_domain,
   msinr - PCI_MSI_DOORBELL_START);
generic_handle_irq(irq);
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index cc46bc2d634b..be0e0ffa0fb7 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -629,6 +630,8 @@ static asmlinkage void __exception_irq_entry 
gic_handle_irq(struct pt_regs *regs
 {
u32 irqnr;
 
+   task_isolation_kernel_enter();
+
irqnr = gic_read_iar();
 
if (gic_supports_nmi() &&
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index c17fabd6741e..fde547a31566 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -353,6 +354,8 @@ static void __exception_irq_entry gic_handle_irq(struct 
pt_regs *regs)
struct gic_chip_data *gic = &gic_data[0];
void __iomem *cpu_base = gic_data_cpu_base(gic);
 
+   task_isolation_kernel_enter();
+
do {
irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK);
irqnr = irqstat & GICC_IAR_INT_ID_MASK;
diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c
index 6d716db2a46a..beab1b6d 100644
--- a/drivers/s390/cio/cio.c
+++ b/drivers/s390/cio/cio.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch)
struct irb *irb;
int irq_context;
 
+   task_isolation_kernel_enter();
+
irb = this_cpu_ptr(&cio_irb);
/* Store interrupt response block to lowcore. */
if (tsch(sch->schid, irb) != 0)
-- 
2.26.2



[PATCH v4 05/13] task_isolation: Add xen-specific hook

2020-07-22 Thread Alex Belits
xen_evtchn_do_upcall() should call task_isolation_kernel_enter()
to indicate that isolation is broken and perform synchronization.

Signed-off-by: Alex Belits 
---
 drivers/xen/events/events_base.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 140c7bf33a98..4c16cd58f36b 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_X86
 #include 
@@ -1236,6 +1237,8 @@ void xen_evtchn_do_upcall(struct pt_regs *regs)
 {
struct pt_regs *old_regs = set_irq_regs(regs);
 
+   task_isolation_kernel_enter();
+
irq_enter();
 
__xen_evtchn_do_upcall();
-- 
2.26.2



[PATCH v4 04/13] task_isolation: Add task isolation hooks to arch-independent code

2020-07-22 Thread Alex Belits
This commit adds task isolation hooks as follows:

- __handle_domain_irq() and handle_domain_nmi() generate an
  isolation warning for the local task

- irq_work_queue_on() generates an isolation warning for the remote
  task being interrupted for irq_work (through
  __smp_call_single_queue())

- generic_exec_single() generates a remote isolation warning for
  the remote cpu being IPI'd (through __smp_call_single_queue())

- smp_call_function_many() generates a remote isolation warning for
  the set of remote cpus being IPI'd (through
  smp_call_function_many_cond())

- on_each_cpu_cond_mask() generates a remote isolation warning for
  the set of remote cpus being IPI'd (through
  smp_call_function_many_cond())

- __ttwu_queue_wakelist() generates a remote isolation warning for
  the remote cpu being IPI'd (through __smp_call_single_queue())

- nmi_enter(), __context_tracking_exit(), __handle_domain_irq(),
  handle_domain_nmi() and scheduler_ipi() clear low-level flags and
  synchronize CPUs by calling task_isolation_kernel_enter()

Calls to task_isolation_remote() or task_isolation_interrupt() can
be placed in the platform-independent code like this when doing so
results in fewer lines of code changes, as for example is true of
the users of the arch_send_call_function_*() APIs. Or, they can be
placed in the per-architecture code when there are many callers,
as for example is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked. But for now, we
just update either callers or callees as makes most sense.

Calls to task_isolation_kernel_enter() are intended for early
kernel entry code. They may be called in platform-independent or
platform-specific code.

It may be possible to clean up low-level entry code and somehow
organize calls to task_isolation_kernel_enter() to avoid multiple
per-architecture or driver-specific calls to it. RCU initialization
may be a good reference point for those places in kernel
(task_isolation_kernel_enter() should precede it), however right now
it is not unified between architectures.

Signed-off-by: Chris Metcalf 
[abel...@marvell.com: adapted for kernel 5.8, added low-level flags handling]
Signed-off-by: Alex Belits 
---
 include/linux/hardirq.h   |  2 ++
 include/linux/sched.h |  2 ++
 kernel/context_tracking.c |  4 
 kernel/irq/irqdesc.c  | 13 +
 kernel/smp.c  |  6 +-
 5 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 03c9fece7d43..5aab1d0a580e 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 extern void synchronize_irq(unsigned int irq);
@@ -114,6 +115,7 @@ extern void rcu_nmi_exit(void);
 #define nmi_enter()\
do {\
arch_nmi_enter();   \
+   task_isolation_kernel_enter();  \
printk_nmi_enter(); \
lockdep_off();  \
BUG_ON(in_nmi() == NMI_MASK);   \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7fb7bb3fddaa..cacfa415dc59 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -1743,6 +1744,7 @@ extern char *__get_task_comm(char *to, size_t len, struct 
task_struct *tsk);
 #ifdef CONFIG_SMP
 static __always_inline void scheduler_ipi(void)
 {
+   task_isolation_kernel_enter();
/*
 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
 * TIF_NEED_RESCHED remotely (for the first time) will also send
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 36a98c48aedc..481a722ddbce 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -148,6 +149,8 @@ void noinstr __context_tracking_exit(enum ctx_state state)
if (!context_tracking_recursion_enter())
return;
 
+   task_isolation_kernel_enter();
+
if (__this_cpu_read(context_tracking.state) == state) {
if (__this_cpu_read(context_tracking.active)) {
/*
@@ -159,6 +162,7 @@ void noinstr __context_tracking_exit(enum ctx_state state)
instrumentation_begin();
vti

[PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-07-22 Thread Alex Belits
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"isolcpus=nohz,domain,CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will send it a signal to indicate
isolation loss. In addition to sending a signal, the code supports a
kernel command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Isolation also disables CPU state synchronization mechanisms that
are. normally done by IPI. In the future, more synchronization
mechanisms, such as TLB flushes, may be disabled for isolated tasks.
This requires careful handling of kernel entry from isolated task --
remote synchronization requests must be re-enabled and
synchronization procedure triggered, before anything other than
low-level kernel entry code is called. Same applies to exiting from
kernel to userspace after isolation is enabled -- either the code
should not depend on synchronization, or isolation should be broken.

For this purpose, per-CPU low-level flags ll_isol_flags are used to
indicate isolation state, and task_isolation_kernel_enter() is used
to safely clear them early in kernel entry. CPU mask corresponding
to isolation bit in ll_isol_flags is visible to userspace as
/sys/devices/system/cpu/isolation_running, and can be used for
monitoring.

Separate patches that follow provide these changes for x86, arm,
and arm64 architectures, xen and irqchip drivers.

Signed-off-by: Alex Belits 
---
 .../admin-guide/kernel-parameters.txt |   6 +
 drivers/base/cpu.c|  23 +
 include/linux/hrtimer.h   |   4 +
 include/linux/isolation.h | 295 ++
 include/linux/sched.h |   5 +
 include/linux/tick.h  |   3 +
 include/uapi/linux/prctl.h|   6 +
 init/Kconfig  |  28 +
 kernel/Makefile  

[PATCH v4 02/13] task_isolation: vmstat: add vmstat_idle function

2020-07-22 Thread Alex Belits
From 7823be8cd3ba2e66308f334a2e47f60ba7829e0b Mon Sep 17 00:00:00 2001
From: Chris Metcalf 
Date: Sat, 1 Feb 2020 08:05:45 +
Subject: [PATCH 02/13] task_isolation: vmstat: add vmstat_idle function

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Signed-off-by: Chris Metcalf 
Signed-off-by: Alex Belits 
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c| 10 ++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ded16dfd21fa..97bc9ed92036 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -273,6 +273,7 @@ extern void __dec_node_state(struct pglist_data *, enum 
node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -376,6 +377,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 93534f8537ca..f3693ef0a958 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1898,6 +1898,16 @@ void quiet_vmstat_sync(void)
refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+   return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+   !need_update(smp_processor_id());
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.26.2



[PATCH v4 01/13] task_isolation: vmstat: add quiet_vmstat_sync function

2020-07-22 Thread Alex Belits
In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf 
Signed-off-by: Alex Belits 
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c| 9 +
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index aa961088c551..ded16dfd21fa 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -272,6 +272,7 @@ extern void __dec_zone_state(struct zone *, enum
zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum
node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -374,6 +375,7 @@ static inline void __dec_node_page_state(struct
page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3fb23a21f6dd..93534f8537ca 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1889,6 +1889,15 @@ void quiet_vmstat(void)
refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on
return.
+ */
+void quiet_vmstat_sync(void)
+{
+   cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+   refresh_cpu_vm_stats(false);
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.26.2



[PATCH v4 00/13] "Task_isolation" mode

2020-07-22 Thread Alex Belits
This is a new version of task isolation implementation. Previous version is at
https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.ca...@marvell.com/

Mostly this covers race conditions prevention on breaking isolation. Early 
after kernel entry,
task_isolation_enter() is called to update flags visible to other CPU cores and 
to perform
synchronization if necessary. Before this call only "safe" operations happen, 
as long as
CONFIG_TRACE_IRQFLAGS is not enabled.

This is also intended for future TLB handling -- the idea is to also isolate 
those CPU cores from
TLB flushes while they are running isolated task in userspace, and do one flush 
on exiting, before
any code is called that may touch anything updated.

The functionality and interface is unchanged, except for 
/sys/devices/system/cpu/isolation_running
containing the list of CPUs running isolated tasks. This should be useful for 
userspace helper
libraries.


[tip: sched/core] lib: Restrict cpumask_local_spread to houskeeping CPUs

2020-07-09 Thread tip-bot2 for Alex Belits
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 1abdfe706a579a702799fce465bceb9fb01d407c
Gitweb:
https://git.kernel.org/tip/1abdfe706a579a702799fce465bceb9fb01d407c
Author:Alex Belits 
AuthorDate:Thu, 25 Jun 2020 18:34:41 -04:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00

lib: Restrict cpumask_local_spread to houskeeping CPUs

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200625223443.2684-2-nit...@redhat.com
---
 lib/cpumask.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb2..85da6ab 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-   int cpu;
+   int cpu, hk_flags;
+   const struct cpumask *mask;
 
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
+   mask = housekeeping_cpumask(hk_flags);
/* Wrap: we always want a cpu. */
-   i %= num_online_cpus();
+   i %= cpumask_weight(mask);
 
if (node == NUMA_NO_NODE) {
-   for_each_cpu(cpu, cpu_online_mask)
+   for_each_cpu(cpu, mask) {
if (i-- == 0)
return cpu;
+   }
} else {
/* NUMA first. */
-   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
if (i-- == 0)
return cpu;
+   }
 
-   for_each_cpu(cpu, cpu_online_mask) {
+   for_each_cpu(cpu, mask) {
/* Skip NUMA nodes, done above. */
if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
continue;


[tip: sched/core] net: Restrict receive packets queuing to housekeeping CPUs

2020-07-09 Thread tip-bot2 for Alex Belits
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 07bbecb3410617816a99e76a2df7576507a0c8ad
Gitweb:
https://git.kernel.org/tip/07bbecb3410617816a99e76a2df7576507a0c8ad
Author:Alex Belits 
AuthorDate:Thu, 25 Jun 2020 18:34:43 -04:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 08 Jul 2020 11:39:02 +02:00

net: Restrict receive packets queuing to housekeeping CPUs

With the existing implementation of store_rps_map(), packets are queued
in the receive path on the backlog queues of other CPUs irrespective of
whether they are isolated or not. This could add a latency overhead to
any RT workload that is running on the same CPU.

Ensure that store_rps_map() only uses available housekeeping CPUs for
storing the rps_map.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200625223443.2684-4-nit...@redhat.com
---
 net/core/net-sysfs.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e353b82..677868f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -741,7 +742,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 {
struct rps_map *old_map, *map;
cpumask_var_t mask;
-   int err, cpu, i;
+   int err, cpu, i, hk_flags;
static DEFINE_MUTEX(rps_map_mutex);
 
if (!capable(CAP_NET_ADMIN))
@@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
return err;
}
 
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+   cpumask_and(mask, mask, housekeeping_cpumask(hk_flags));
+   if (cpumask_empty(mask)) {
+   free_cpumask_var(mask);
+   return -EINVAL;
+   }
+
map = kzalloc(max_t(unsigned int,
RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
  GFP_KERNEL);


[tip: sched/core] PCI: Restrict probe functions to housekeeping CPUs

2020-07-09 Thread tip-bot2 for Alex Belits
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 69a18b18699b59654333651d95f8ca09d01048f8
Gitweb:
https://git.kernel.org/tip/69a18b18699b59654333651d95f8ca09d01048f8
Author:Alex Belits 
AuthorDate:Thu, 25 Jun 2020 18:34:42 -04:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00

PCI: Restrict probe functions to housekeeping CPUs

pci_call_probe() prevents the nesting of work_on_cpu() for a scenario
where a VF device is probed from work_on_cpu() of the PF.

Replace the cpumask used in pci_call_probe() from all online CPUs to only
housekeeping CPUs. This is to ensure that there are no additional latency
overheads caused due to the pinning of jobs on isolated CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Frederic Weisbecker 
Acked-by: Bjorn Helgaas 
Link: https://lkml.kernel.org/r/20200625223443.2684-3-nit...@redhat.com
---
 drivers/pci/pci-driver.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index da6510a..449466f 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
  const struct pci_device_id *id)
 {
int error, node, cpu;
+   int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
struct drv_dev_and_id ddi = { drv, dev, id };
 
/*
@@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
pci_physfn_is_probed(dev))
cpu = nr_cpu_ids;
else
-   cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+   cpu = cpumask_any_and(cpumask_of_node(node),
+ housekeeping_cpumask(hk_flags));
 
if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);


Re: how to look for source code in kernel

2012-12-27 Thread Alex Belits

On Fri, 28 Dec 2012, anish singh wrote:


have source insight.We can use wine in linux but that sucks.

Funny you say that!
Never heard of cscope, ctags ?

It is not as convenient as source insight or is it?


There is also LXR.

If it's not good enough for you, then don't look at it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Âèçû.Ïàñïîðòà.Ãðàæäàíñòâà.Ïðîáëåìû.=?ISO-8859-1?Q?=D0=E5=F8=E5

2001-06-15 Thread Alex Belits

On Fri, 15 Jun 2001, Dan Hollis wrote:

> Received: from [195.161.132.168] ([195.161.132.168]:38150 "HELO 777")
> by vger.kernel.org with SMTP id ;
> Fri, 15 Jun 2001 17:19:32 -0400
>
> inetnum:  195.161.132.0 - 195.161.132.255
> netname:  RT-CLNT-MMTEL
> descr:Moscow Long Distance and International Telephone
>
> Anyone want to fire the nuclear larts?

Me!

1. It's a spam.
2. It's in the dreaded windows-1251 charset.
3. The text in the header is mis-identified as ISO 8859-1

-- 
Alex

--
 Excellent.. now give users the option to cut your hair you hippie!
  -- Anonymous Coward

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Serial device with very large buffer

2001-02-11 Thread Alex Belits

On Fri, 9 Feb 2001, Pavel Machek wrote:

> > >   I also propose to increase the size of flip buffer to 640 bytes (so the
> > > flipping won't occur every time in the middle of the full buffer), however
> > > I understand that it's a rather drastic change for such a simple goal, and
> > > not everyone will agree that it's worth the trouble:
> > 
> > Going to a 1K flip buffer would make sense IMHO for high speed devices too
> 
> Actually bigger flipbufs are needed for highspeed serials and
> irda. Tytso received patch to make flipbuf size settable by the
> driver. (Setting it to 1K is not easy, you need to change allocation
> mechanism of buffers.)

  The need for changes in allocation mechanism was the reason why I have
limited the buffer increase to 640 bytes. If changes already exist, and
there is no some hidden overhead associated with them, I am all for it.

  Still it's not a replacement for the change in serial driver that I have
posted -- assumption that hardware is slower than we are, that it has
limited buffer in the way, and that it's ok to discard all the data beyond
our buffer's size is, to say least, silly.

-- 
Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Serial device with very large buffer

2001-02-01 Thread Alex Belits

On Thu, 1 Feb 2001, Joe deBlaquiere wrote:

> >>I'm a little confused here... why are we overrunning? This thing is 
> >> running externally at 19200 at best, even if it does all come in as a 
> >> packet.
> > 
> > 
> >   Different Merlin -- original Merlin is 19200, "Merlin for Ricochet" is
> > 128Kbps (or faster), and uses Metricom/Ricochet network.
> 
> so can you still limit the mru?

  No. And even if I could, there is no guarantee that it won't fill the
whole buffer anyway by attaching head of the second packet after the tail
of the first one -- this thing treats interface as asynchronous and
ignores PPP packets boundaries.

-- 
Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Serial device with very large buffer

2001-02-01 Thread Alex Belits

On Thu, 1 Feb 2001, Joe deBlaquiere wrote:

> Hi Alex!
> 
>   I'm a little confused here... why are we overrunning? This thing is 
> running externally at 19200 at best, even if it does all come in as a 
> packet.

  Different Merlin -- original Merlin is 19200, "Merlin for Ricochet" is
128Kbps (or faster), and uses Metricom/Ricochet network.

-- 
Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Serial device with very large buffer

2001-02-01 Thread Alex Belits

On Thu, 1 Feb 2001, Alan Cox wrote:

> >   I also propose to increase the size of flip buffer to 640 bytes (so the
> > flipping won't occur every time in the middle of the full buffer), however
> > I understand that it's a rather drastic change for such a simple goal, and
> > not everyone will agree that it's worth the trouble:
> 
> Going to a 1K flip buffer would make sense IMHO for high speed devices too

1K flip buffer makes the tty_struct exceed 4096 bytes, and I don't think,
it's a good idea to change the allocation mechanism for it.

-- 
Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Serial device with very large buffer

2001-02-01 Thread Alex Belits


  Greg Pomerantz <[EMAIL PROTECTED]> and I have found that Novatel Merlin
for Ricochet PCMCIA card, while looking like otherwise ordinary serial
PCMCIA device, has the receive buffer 576 bytes long. When regular serial
driver reads the arrived data, it often runs out of 512-bytes flip buffer
and discards the rest of the data with rather disastrous consequences for
whatever is expecting it.

  We made a fix that changes the behavior of the driver, so when it fills
the flip buffer while characters are still being read from uart, it flips
the buffer if it's possible or if it's impossible, finishes the loop
without reading the remaining characters.

The patch is:
---8<---
--- linux-2.4.1-orig/drivers/char/serial.c  Wed Dec  6 12:06:18 2000
+++ linux/drivers/char/serial.c Thu Feb  1 13:14:05 2001
@@ -569,9 +569,16 @@
 
icount = &info->state->icount;
do {
+   /*
+* Check if flip buffer is full -- if it is, try to flip,
+* and if flipping got queued, return immediately
+*/
+   if (tty->flip.count >= TTY_FLIPBUF_SIZE) {
+   tty->flip.tqueue.routine((void *) tty);
+   if (tty->flip.count >= TTY_FLIPBUF_SIZE)
+   return;
+   }
ch = serial_inp(info, UART_RX);
-   if (tty->flip.count >= TTY_FLIPBUF_SIZE)
-   goto ignore_char;
*tty->flip.char_buf_ptr = ch;
icount->rx++;

--->8---

  I also propose to increase the size of flip buffer to 640 bytes (so the
flipping won't occur every time in the middle of the full buffer), however
I understand that it's a rather drastic change for such a simple goal, and
not everyone will agree that it's worth the trouble:

---8<---
--- linux-2.4.1-orig/include/linux/tty.hMon Jan 29 23:24:56 2001
+++ linux/include/linux/tty.h   Wed Jan 31 13:06:42 2001
@@ -134,7 +134,7 @@
  * located in the tty structure, and is used as a high speed interface
  * between the tty driver and the tty line discipline.
  */
-#define TTY_FLIPBUF_SIZE 512
+#define TTY_FLIPBUF_SIZE 640
 
 struct tty_flip_buffer {
struct tq_struct tqueue;
--->8---

-- 
Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PPP broken in Kernel 2.4.1?

2001-01-31 Thread Alex Belits

On Mon, 29 Jan 2001, Michael B. Trausch wrote:

> I'm having a weird problem with 2.4.1, and I am *not* having this problem
> with 2.4.0.  When I attempt to connect to the Internet using Kernel 2.4.1,
> I get errors about PPP something-or-another, invalid argument.  I've tried

  Upgrade ppp to 2.4.0b1 or later -- it's documented in 
Documentation/Changes.

-- 
Alex

--
 Excellent.. now give users the option to cut your hair you hippie!
  -- Anonymous Coward

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Journaling: Surviving or allowing unclean shutdown?

2001-01-04 Thread Alex Belits

On Thu, 4 Jan 2001, Daniel Phillips wrote:

> >   A lot of applications always rely on their file i/o being done in some
> > manner that has atomic (from the application's point of view) operations
> > other than system calls -- heck, even make(1) does that.
> 
> Nobody is forcing you to hit the power switch in the middle of a build. 
> But now that you mention it, you've provided a good example of a broken
> application.  Make with its reliance on timestamps for determining build
> status is both painfully slow and unreliable.

  Actually I mean its reliance on files being deleted if the problem or
SIGTERM happened in the middle of build ing them.

>  What happens if you
> adjust your system clock?

  Don't adjust the system clock in the middle of the build. Adjusting
clock backward for more than a second is much more rare operation than a
shutdown.

>  That said, Tux2 can preserve the per-write
> atomicity quite easily, or better, make could take advantage of the new
> journal-oriented transaction api that's being cooked up and specify its
> requirement for atomicity in a precise way.

  I have already said that programs don't use syscalls as the only atomic
operations on files -- yes, it may be a good idea to add transactions API
on the top of this (and it will have a lot of uses), but then it should be
made in a way that its use will be easy to add to existing applications.

> Do you have any other examples of programs that would be hurt by sudden
> termination?  Certainly we'd consider a desktop gui broken if it failed
> to come up again just because you bailed out with the power switch
> instead of logging out nicely.

  Any application that writes multiple times over the same files and has
any data consistency requirements beyond the piece of data in the chunk
sent in one write().

-- 
Alex

--
 Excellent.. now give users the option to cut your hair you hippie!
  -- Anonymous Coward

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Journaling: Surviving or allowing unclean shutdown?

2001-01-03 Thread Alex Belits

On Wed, 3 Jan 2001, Daniel Phillips wrote:

> I don't doubt that if the 'power switch' method of shutdown becomes
> popular we will discover some applications that have windows where they
> can be hurt by sudden shutdown, even will full filesystem data state
> being preserved.  Such applications are arguably broken because they
> will behave badly in the event of accidental shutdown anyway, and we
> should fix them.  Well-designed applications are explicitly 'serially
> reuseable', in other words, you can interrupt at any point and start
> again from the beginning with valid and expected results.

  I strongly disagree. All valid ways to shut down the system involve
sending SIGTERM to running applications -- only broken ones would
live long enough after that to be killed by subsequent SIGKILL.

  A lot of applications always rely on their file i/o being done in some
manner that has atomic (from the application's point of view) operations
other than system calls -- heck, even make(1) does that.

-- 
Alex

--
 Excellent.. now give users the option to cut your hair you hippie!
  -- Anonymous Coward

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: OS & Games Software

2000-12-23 Thread Alex Belits

On Sat, 23 Dec 2000 [EMAIL PROTECTED] wrote:

> Subject: OS & Games Software
> 
> Are you still using an old operating system? Why not upgrade to a 
> newer and
> more reliable version? You'll enjoy greater features and more 
> stability. 
> 
> Microsoft Dos 6.22$15
> Microsoft Windows 3.11$15
> Microsoft Windows 95  $15
> Microsoft Windows 98 SE   $20
> Microsoft Windows Millenium   $20
> Microsoft Windows 2000 Pro$20
> Microsoft Windows 2000 Server $50
> Microsoft Windows 2000 Advanced Server (25CAL)$65
> 

  Is this a desperate Microsoft's attempt to slow Linux development by
insulting developers? ;-))

  I mean, what other purpose can this possibly have? Unless, of course,
some unintelligent person got linux-kernel address in a list of
prepackaged "n millions email addresses for sale" (and then he must be not
moron*2, or moron^2, but at least e^moron).

-- 
Alex

--
 Excellent.. now give users the option to cut your hair you hippie!
  -- Anonymous Coward


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: The NSA's Security-Enhanced Linux (fwd)

2000-12-22 Thread Alex Belits

On Fri, 22 Dec 2000, James Lewis Nance wrote:

> > benefits from and which may help cut down computer crime beyond government.
> > (and which of course actually is part of the NSA's real job)
> 
> I often wonder how many people know that a whole bunch of the Linux
> networking code is Copyrighted by the NSA.

  Not exactly by NSA itself. A bunch of files have in copyright comment:

---8<---
Written 1992-94 by Donald Becker.

Copyright 1993 United States Government as represented by the
Director, National Security Agency.

This software may be used and distributed according to the terms
of the GNU Public License, incorporated herein by reference.

The author may be reached as [EMAIL PROTECTED], or C/O
Center of Excellence in Space Data and Information Sciences
Code 930.5, Goddard Space Flight Center, Greenbelt MD 20771

--->8---

  ...so this is the result of Becker's employment at NASA and government's
legal weirdness (no, I have no idea, why of all possible choices
"Director, National Security Agency" must represent US government for
copyright purpose).

>  I'm always waiting to
> hear someone come up with a conspiracy theory about it on slashdot,
> but I have never heard anyone mention it.

  Actually I have seen it mentioned there today -- maybe conspiracy
theory is being developed right now ;-)

-- 
Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: OS Software?

2000-12-21 Thread Alex Belits

On Thu, 21 Dec 2000 [EMAIL PROTECTED] wrote:

> Are you interested in Office 2000? I am selling perfectly working 
> copies 
> of Microsoft Office 2000 SR-1 Premium Edition for a flat price of 
> $50 USD.
> The suite contains 4 discs and includes: 
> 
> Word 
> Excel 
> Outlook 
> PowerPoint 
> Access 
> FrontPage 
> Publisher 
> Small Business Tools 
> PhotoDraw 

  Is it a new tradition among spammers -- spam linux-kernel ML with offers
of software, most hated among the subscribers? Can't they offer something
less offensive? 

-- 
Alex

--
 Excellent.. now give users the option to cut your hair you hippie!
  -- Anonymous Coward

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: uname

2000-11-22 Thread Alex Belits

On Thu, 23 Nov 2000, J . A . Magallon wrote:

> Little question about 'uname'. Does it read data from kernel, /proc or
> get its data from other source ?

uname(1) utility calls uname(2) syscall.

-- 
Alex

--
 Excellent.. now give users the option to cut your hair you hippie!
  -- Anonymous Coward

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] STRIP support for new Metricom modems

2000-10-06 Thread Alex Belits


  I have made changes in STRIP address handling to accomodate new 128Kbps
Ricochet GS "modems" that Metricom makes now. There is no official
maintainer of STRIP code (maybe I should become one, however folks at
Stanford who work on the original project probably will be more
appropriate), so I am only sending it here.

  The patch was tested on 2.2.17 (original tested with Metricom modem with
serial link and patched for USB to test the same modem with USB link), and
2.4.0-test9 (unchanged), in all tests I had old modems (Original
Ricochet and Ricochet SE) talking to each other and old modems talking
with Ricochet GS.

Explanations are at http://phobos.illtel.denver.co.us/~abelits/metricom/

diff -u linux-2.2.17-orig/drivers/net/strip.c linux-2.2.17/drivers/net/strip.c
--- linux-2.2.17-orig/drivers/net/strip.c   Sun Nov  8 13:48:06 1998
+++ linux-2.2.17/drivers/net/strip.cThu Sep 28 11:06:46 2000
@@ -14,7 +14,7 @@
  * for kernel-based devices like TTY.  It interfaces between a
  * raw TTY, and the kernel's INET protocol layers (via DDI).
  *
- * Version:@(#)strip.c 1.3 July 1997
+ * Version:@(#)strip.c 1.4 September 2000
  *
  * Author: Stuart Cheshire <[EMAIL PROTECTED]>
  *
@@ -66,12 +66,15 @@
  *  It is no longer necessarily to manually set the radio's
  *  rate permanently to 115200 -- the driver handles setting
  *  the rate automatically.
+ *
+ *  v1.4 September 2000 (AB)
+ *  Added support for long serial numbers.
  */
 
 #ifdef MODULE
-static const char StripVersion[] = "1.3-STUART.CHESHIRE-MODULAR";
+static const char StripVersion[] = "1.4-STUART.CHESHIRE-MODULAR";
 #else
-static const char StripVersion[] = "1.3-STUART.CHESHIRE";
+static const char StripVersion[] = "1.4-STUART.CHESHIRE";
 #endif
 
 #define TICKLE_TIMERS 0
@@ -897,20 +900,37 @@
  * Convert a string to a Metricom Address.
  */
 
-#define IS_RADIO_ADDRESS(p) ( \
+#define IS_RADIO_ADDRESS_1(p) (   \
   isdigit((p)[0]) && isdigit((p)[1]) && isdigit((p)[2]) && isdigit((p)[3]) && \
   (p)[4] == '-' &&\
   isdigit((p)[5]) && isdigit((p)[6]) && isdigit((p)[7]) && isdigit((p)[8]))
 
+#define IS_RADIO_ADDRESS_2(p) (   \
+  isdigit((p)[0]) && isdigit((p)[1]) &&   \
+  (p)[2] == '-' &&\
+  isdigit((p)[3]) && isdigit((p)[4]) && isdigit((p)[5]) && isdigit((p)[6]) && \
+  (p)[7] == '-' &&\
+  isdigit((p)[8]) && isdigit((p)[9]) && isdigit((p)[10]) && isdigit((p)[11])  )
+
 static int string_to_radio_address(MetricomAddress *addr, __u8 *p)
 {
-if (!IS_RADIO_ADDRESS(p)) return(1);
+if (IS_RADIO_ADDRESS_2(p))
+{
+addr->c[0] = 0;
+addr->c[1] = (READHEX(p[0]) << 4 | READHEX(p[1])) ^ 0xFF;
+addr->c[2] = READHEX(p[3]) << 4 | READHEX(p[4]);
+addr->c[3] = READHEX(p[5]) << 4 | READHEX(p[6]);
+addr->c[4] = READHEX(p[8]) << 4 | READHEX(p[9]);
+addr->c[5] = READHEX(p[10]) << 4 | READHEX(p[11]);
+}else{
+if(!IS_RADIO_ADDRESS_1(p)) return(1);
 addr->c[0] = 0;
 addr->c[1] = 0;
 addr->c[2] = READHEX(p[0]) << 4 | READHEX(p[1]);
 addr->c[3] = READHEX(p[2]) << 4 | READHEX(p[3]);
 addr->c[4] = READHEX(p[5]) << 4 | READHEX(p[6]);
 addr->c[5] = READHEX(p[7]) << 4 | READHEX(p[8]);
+}
 return(0);
 }
 
@@ -920,6 +940,9 @@
 
 static __u8 *radio_address_to_string(const MetricomAddress *addr, 
MetricomAddressString *p)
 {
+if(addr->c[1])
+sprintf(p->c, "%02X-%02X%02X-%02X%02X", addr->c[1] ^ 0xFF, addr->c[2], 
+addr->c[3], addr->c[4], addr->c[5]);
+else
 sprintf(p->c, "%02X%02X-%02X%02X", addr->c[2], addr->c[3], addr->c[4], 
addr->c[5]);
 return(p->c);
 }
@@ -1481,6 +1504,12 @@
 
 *ptr++ = 0x0D;
 *ptr++ = '*';
+if(haddr.c[1])
+{
+*ptr++ = hextable[(haddr.c[1] >> 4) ^ 0xF];
+*ptr++ = hextable[(haddr.c[1] & 0xF) ^ 0xF];
+*ptr++ = '-';
+}
 *ptr++ = hextable[haddr.c[2] >> 4];
 *ptr++ = hextable[haddr.c[2] & 0xF];
 *ptr++ = hextable[haddr.c[3] >> 4];

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/