Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-14 Thread Oleg Nesterov
On 06/11, Eric W. Biederman wrote:
>
> Oleg Nesterov  writes:
>
> > Could you spell to explain why this can't work (again, in this simple case) 
> > ?
> >
> > My current (and I know, very poor) understanding is that .release() should
> > roughly do the following:
> >
> > 1. Ensure that vhost_work_queue() can't add the new callbacks
> >
> > 2. Call vhost_dev_flush() to ensure that worker->work_list is empty
> >
> > 3. Call vhost_task_stop()
>
> At least in the case of exec by the time the final fput happens
> from close_on_exec the task has already changed it's mm.

Of course you are right.

But can't resist, please note that I only meant "this simple case" which
doesn't include exec/etc.

Nevermind. As Mike explains there are more problems even in this particular
"simple" case, and I am not surprised.

Sorry for noise,

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-06 Thread Oleg Nesterov
On 06/06, Mike Christie wrote:
>
> On 6/6/23 7:16 AM, Oleg Nesterov wrote:
> > On 06/05, Mike Christie wrote:
> >
> >> So it works like if we were using a kthread still:
> >>
> >> 1. Userapce thread0 opens /dev/vhost-$something.
> >> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
> >> create the task_struct which runs the vhost_worker() function which handles
> >> the work->fns.
> >> 3. If userspace now does a SIGKILL or just exits without doing a close() on
> >> /dev/vhost-$something, then when thread0 does exit_files() that will do the
> >> fput that does vhost-$something's file_operations->release.
> >
> > So, at least in this simple case vhost_worker() can just exit after SIGKILL,
> > and thread0 can flush the outstanding commands when it calls 
> > vhost_dev_flush()
> > rather than wait for vhost_worker().
> >
> > Right?
>
> With the current code, the answer is no. We would hang like I mentioned here:
>
> https://lore.kernel.org/lkml/ae250076-7d55-c407-1066-86b37014c...@oracle.com/

If only I could fully understand this email ;)

Could you spell to explain why this can't work (again, in this simple case) ?

My current (and I know, very poor) understanding is that .release() should
roughly do the following:

1. Ensure that vhost_work_queue() can't add the new callbacks

2. Call vhost_dev_flush() to ensure that worker->work_list is empty

3. Call vhost_task_stop()

so why this sequence can't work if we turn vhost_dev_flush() into something like

void vhost_dev_flush(struct vhost_dev *dev)
{
struct vhost_flush_struct flush;

if (dev->worker) {
// this assumes that vhost_task_create() uses 
CLONE_THREAD
if (same_thread_group(current, 
dev->worker->vtsk->task)) {
... run the pending callbacks ...
return;
}


// this is what we currently have

init_completion(_event);
vhost_work_init(, vhost_flush_work);

vhost_work_queue(dev, );
wait_for_completion(_event);
}
}

?

Mike, I am just trying to understand what exactly vhost_worker() should do.

> We need to add code like I mentioned in that reply because we don't have a
> way to call into the layers below us to flush those commands.

This tells me nothing, but this is my fault, not yours. Again, again, I know
nothing about drivers/vhost.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-06 Thread Oleg Nesterov
On 06/05, Mike Christie wrote:
>
> On 6/5/23 10:10 AM, Oleg Nesterov wrote:
> > On 06/03, michael.chris...@oracle.com wrote:
> >>
> >> On 6/2/23 11:15 PM, Eric W. Biederman wrote:
> >> The problem is that as part of the flush the drivers/vhost/scsi.c code
> >> will wait for outstanding commands, because we can't free the device and
> >> it's resources before the commands complete or we will hit the accessing
> >> freed memory bug.
> >
> > ignoring send-fd/clone issues, can we assume that the final fput/release
> > should always come from vhost_worker's sub-thread (which shares mm/etc) ?
>
> I think I'm misunderstanding the sub-thread term.
>
> - Is it the task_struct's context that we did the
> kernel/vhost_taskc.c:vhost_task_create() from? Below it would be the
> thread we did VHOST_SET_OWNER from.

Yes,

> So it works like if we were using a kthread still:
>
> 1. Userapce thread0 opens /dev/vhost-$something.
> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
> create the task_struct which runs the vhost_worker() function which handles
> the work->fns.
> 3. If userspace now does a SIGKILL or just exits without doing a close() on
> /dev/vhost-$something, then when thread0 does exit_files() that will do the
> fput that does vhost-$something's file_operations->release.

So, at least in this simple case vhost_worker() can just exit after SIGKILL,
and thread0 can flush the outstanding commands when it calls vhost_dev_flush()
rather than wait for vhost_worker().

Right?

not that I think this can help in the general case ...

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-05 Thread Oleg Nesterov
On 06/03, michael.chris...@oracle.com wrote:
>
> On 6/2/23 11:15 PM, Eric W. Biederman wrote:
> The problem is that as part of the flush the drivers/vhost/scsi.c code
> will wait for outstanding commands, because we can't free the device and
> it's resources before the commands complete or we will hit the accessing
> freed memory bug.

ignoring send-fd/clone issues, can we assume that the final fput/release
should always come from vhost_worker's sub-thread (which shares mm/etc) ?

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-05 Thread Oleg Nesterov
On 06/02, Linus Torvalds wrote:
>
> On Fri, Jun 2, 2023 at 1:59 PM Oleg Nesterov  wrote:
> >
> > As I said from the very beginning, this code is fine on x86 because
> > atomic ops are fully serialised on x86.
>
> Yes. Other architectures require __smp_mb__{before,after}_atomic for
> the bit setting ops to actually be memory barriers.
>
> We *should* probably have acquire/release versions of the bit test/set
> helpers, but we don't, so they end up being full memory barriers with
> those things. Which isn't optimal, but I doubt it matters on most
> architectures.
>
> So maybe we'll some day have a "test_bit_acquire()" and a
> "set_bit_release()" etc.

In this particular case we need clear_bit_release() and iiuc it is
already here, just it is named clear_bit_unlock().

So do you agree that vhost_worker() needs smp_mb__before_atomic()
before clear_bit() or just clear_bit_unlock() to avoid the race with
vhost_work_queue() ?

Let me provide a simplified example:

struct item {
struct llist_node   llist;
unsigned long   flags;
};

struct llist_head HEAD = {};// global

void queue(struct item *item)
{
// ensure this item was already flushed
if (!test_and_set_bit(0, >flags))
llist_add(item->llist, );

}

void flush(void)
{
struct llist_node *head = llist_del_all();
struct item *item, *next;

llist_for_each_entry_safe(item, next, head, llist)
clear_bit(0, >flags);
}

I think this code is buggy in that flush() can race with queue(), the same
way as vhost_worker() and vhost_work_queue().

Once flush() clears bit 0, queue() can come on another CPU and re-queue
this item and change item->llist.next. We need a barrier before clear_bit()
to ensure that next = llist_entry(item->next) in llist_for_each_entry_safe()
completes before the result of clear_bit() is visible to queue().

And, I do not think we can rely on control dependency because... because
I fail to see the load-store control dependency in this code,
llist_for_each_entry_safe() loads item->llist.next but doesn't check the
result until the next iteration.

No?

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-05 Thread Oleg Nesterov
On 06/01, Mike Christie wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1368,7 +1368,9 @@ int zap_other_threads(struct task_struct *p)
>  
>   while_each_thread(p, t) {
>   task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> - count++;
> + /* Don't require de_thread to wait for the vhost_worker */
> + if ((t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != 
> PF_USER_WORKER)
> + count++;

Well if you do this, then you should also change __exit_signal() to
not decrement sig->notify_count. Otherwise de_thread() can succeed
before the "normal" sub-threads exit.

But this can't be right anyway... If nothing else, suppose we have
a process with 3 threads:

M   - the main thread, group leader
T   - sub-thread
V   - vhost worker

T does exec and calls de_thread().

M exits. T takes the leadership and does release_task()

V is still running but V->group_leader points to already freed M.

Or unshare_sighand() after that... If nothing else this means that
lock_task_sighand(T) and lock_task_sighand(V) will take different
locks.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-05 Thread Oleg Nesterov
On 06/02, Eric W. Biederman wrote:
>
> Oleg Nesterov  writes:
>
> > Hi Mike,
> >
> > sorry, but somehow I can't understand this patch...
> >
> > I'll try to read it with a fresh head on Weekend, but for example,
> >
> > On 06/01, Mike Christie wrote:
> >>
> >>  static int vhost_task_fn(void *data)
> >>  {
> >>struct vhost_task *vtsk = data;
> >> -  int ret;
> >> +  bool dead = false;
> >> +
> >> +  for (;;) {
> >> +  bool did_work;
> >> +
> >> +  /* mb paired w/ vhost_task_stop */
> >> +  if (test_bit(VHOST_TASK_FLAGS_STOP, >flags))
> >> +  break;
> >> +
> >> +  if (!dead && signal_pending(current)) {
> >> +  struct ksignal ksig;
> >> +  /*
> >> +   * Calling get_signal will block in SIGSTOP,
> >> +   * or clear fatal_signal_pending, but remember
> >> +   * what was set.
> >> +   *
> >> +   * This thread won't actually exit until all
> >> +   * of the file descriptors are closed, and
> >> +   * the release function is called.
> >> +   */
> >> +  dead = get_signal();
> >> +  if (dead)
> >> +  clear_thread_flag(TIF_SIGPENDING);
> >
> > this can't be right or I am totally confused.
> >
> > Another signal_wake_up() can come right after clear(SIGPENDING).
>
> Technically yes.

...

> Beyond that clearing TIF_SIGPENDING is just an optimization so
> the thread can sleep in schedule and not spin.

Yes. So if another signal_wake_up() comes after clear(SIGPENDING)
this code will spin in busy-wait loop waiting VHOST_TASK_FLAGS_STOP.
Obviously not good and even deadlockable on UP && !PREEMPT.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-05 Thread Oleg Nesterov
On 06/02, Eric W. Biederman wrote:
>
>  static int vhost_task_fn(void *data)
>  {
>   struct vhost_task *vtsk = data;
> - int ret;
> + bool dead = false;
> +
> + for (;;) {
> + bool did_work;
> +
> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> + /*
> +  * Calling get_signal can block in SIGSTOP,
> +  * and the freezer.  Or it can clear
> +  * fatal_signal_pending and return non-zero.
> +  */
> + dead = get_signal();
> + if (dead)
> + set_bit(VHOST_TASK_FLAGS_STOP, >flags);
> + }
> +
> + /* mb paired w/ kthread_stop */
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + did_work = vtsk->fn(vtsk->data);

I don't understand why do you set TASK_INTERRUPTIBLE before vtsk->fn(),
it seems that you could do this before the test_bit(FLAGS_STOP) below.
But probably I missed something and this is minor anyway...

> + if (!did_work) {
> + if (test_bit(VHOST_TASK_FLAGS_STOP, >flags)) {
> + __set_current_state(TASK_RUNNING);
> + break;

What if VHOST_TASK_FLAGS_STOP was set by us after get_signal() above ?
We need to ensure that in this case vhost_work_queue() can't add a new work,
nobody will flush it.

In fact, unless I missed something this can even race with vhost_dev_flush().

vhost_dev_flush:vhost_task_fn:

checks FLAGS_STOP, not set,
vhost_task_flush() returns false
gets SIGKILL, sets 
FLAGS_STOP

vtsk->fn() returns false

vhost_task_fn() exits.

vhost_work_queue();
wait_for_completion(_event);


and the last wait_for_completion() will hang forever.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-02 Thread Oleg Nesterov
Hi Mike,

sorry, but somehow I can't understand this patch...

I'll try to read it with a fresh head on Weekend, but for example,

On 06/01, Mike Christie wrote:
>
>  static int vhost_task_fn(void *data)
>  {
>   struct vhost_task *vtsk = data;
> - int ret;
> + bool dead = false;
> +
> + for (;;) {
> + bool did_work;
> +
> + /* mb paired w/ vhost_task_stop */
> + if (test_bit(VHOST_TASK_FLAGS_STOP, >flags))
> + break;
> +
> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> + /*
> +  * Calling get_signal will block in SIGSTOP,
> +  * or clear fatal_signal_pending, but remember
> +  * what was set.
> +  *
> +  * This thread won't actually exit until all
> +  * of the file descriptors are closed, and
> +  * the release function is called.
> +  */
> + dead = get_signal();
> + if (dead)
> + clear_thread_flag(TIF_SIGPENDING);

this can't be right or I am totally confused.

Another signal_wake_up() can come right after clear(SIGPENDING).


Again, I'll try to re-read this patch, but let me ask anyway...

Do we have a plan B? I mean... iirc you have mentioned that you can
change these code paths to do something like

if (killed)
tell_the_drivers_that_all_callbacks_will_fail();


so that vhost_worker() can exit after get_signal() returns SIGKILL.

Probably I misunderstood you, but it would be nice to avoid the changes
in coredump/etc code just to add a temporary (iiuc!) fix.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-02 Thread Oleg Nesterov
On 06/02, Jason Wang wrote:
>
> On Thu, Jun 1, 2023 at 3:43 PM Oleg Nesterov  wrote:
> >
> > and the final rewrite:
> >
> > if (work->node) {
> > work_next = work->node->next;
> > if (true)
> > clear_bit(>flags);
> > }
> >
> > so again, I do not see the load-store control dependency.
>
> This kind of optimization is suspicious. Especially considering it's
> the control expression of the loop but not a condition.

It is not about optimization,

> Looking at the assembly (x86):
>
>0x81d46c5b <+75>:callq  0x81689ac0 
> 
>0x81d46c60 <+80>:mov%rax,%r15
>0x81d46c63 <+83>:test   %rax,%rax
>0x81d46c66 <+86>:je 0x81d46c3a 
>0x81d46c68 <+88>:mov%r15,%rdi
>0x81d46c6b <+91>:mov(%r15),%r15
>0x81d46c6e <+94>:lock andb $0xfd,0x10(%rdi)
>0x81d46c73 <+99>:movl   $0x0,0x18(%rbx)
>0x81d46c7a <+106>:   mov0x8(%rdi),%rax
>0x81d46c7e <+110>:   callq  0x821b39a0
> <__x86_indirect_thunk_array>
>0x81d46c83 <+115>:   callq  0x821b4d10 
> <__SCT__cond_resched>
> ...
>
> I can see:
>
> 1) The code read node->next (+91) before clear_bit (+94)

The code does. but what about CPU ?

> 2) And the it uses a lock prefix to guarantee the execution order

As I said from the very beginning, this code is fine on x86 because
atomic ops are fully serialised on x86.

OK. we can't convince each other. I'll try to write another email when
I have time,

If this code is correct, then my understanding of memory barriers is even
worse than I think. I wouldn't be surprised, but I'd like to understand
what I have missed.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-01 Thread Oleg Nesterov
On 06/01, Jason Wang wrote:
>
> On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov  wrote:
> >
> > > > I don't understand you. OK, to simplify, suppose we have 2 global vars
> > > >
> > > > void *PTR = something_non_null;
> > > > unsigned long FLAGS = -1ul;
> > > >
> > > > Now I think this code
> > > >
> > > > CPU_0   CPU_1
> > > >
> > > > void *ptr = PTR;if (!test_and_set_bit(0, FLAGS))
> > > > clear_bit(0, FLAGS);PTR = NULL;
> > > > BUG_ON(!ptr);
> > > >
> > > > is racy and can hit the BUG_ON(!ptr).
> > >
> > > This seems different to the above case?
> >
> > not sure,
> >
> > > And you can hit BUG_ON with
> > > the following execution sequence:
> > >
> > > [cpu 0] clear_bit(0, FLAGS);
> > > [cpu 1] if (!test_and_set_bit(0, FLAGS))
> > > [cpu 1] PTR = NULL;
> > > [cpu 0] BUG_ON(!ptr)
> >
> > I don't understand this part... yes, we can hit this BUG_ON() without mb in
> > between, this is what I tried to say.
>
> I may miss something,

Or me... note that CPU_0 loads the global "PTR" into the local "ptr" before 
clear_bit.
Since you have mentioned the program order: yes this lacks READ_ONCE() or 
barrier(),
but the same is true for the code in vhost_worker(). So I still don't 
understand.

> but the above is the sequence that is executed
> by the processor (for each CPU, it's just the program order). So where
> do you expect to place an mb can help?

before clear_bit:

CPU_0

void *ptr = PTR;
mb();   // implies compiler barrier as well
clear_bit(0, FLAGS);
BUG_ON(!ptr);

just in case... mb() in the code above is only for illustration, we can use
smp_mb__before_atomic() + clear_bit(). Or just clear_bit_unlock(), iiuc the
one-way barrier is fine in this case.


> > > In vhost code, there's a condition before the clear_bit() which sits
> > > inside llist_for_each_entry_safe():
> > >
> > > #define llist_for_each_entry_safe(pos, n, node, member)   
> > >  \
> > > for (pos = llist_entry((node), typeof(*pos), member); 
> > >  \
> > >  member_address_is_nonnull(pos, member) &&
> > >  \
> > > (n = llist_entry(pos->member.next, typeof(*n), member), 
> > > true); \
> > >  pos = n)
> > >
> > > The clear_bit() is a store which is not speculated, so there's a
> > > control dependency, the store can't be executed until the condition
> > > expression is evaluated which requires pos->member.next
> > > (work->node.next) to be loaded.
> >
> > But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that 
> > we have
> > something like
> >
> > n = llist_entry(...);
> > if (n)
> > clear_bit(...);
> >
> > so I do not see how can we rely on the load-store control dependency.
>
> Just to make sure we are on the same page, the condition expression is
>
> member_address_is_nonnull(pos, member) && (n =
> llist_entry(pos->member.next, typeof(*n), member), true)
>
> So it's something like:
>
> if (work->node && (work_next = work->node->next, true))
> clear_bit(>flags);
>
> So two loads from both work->node and work->node->next, and there's a
> store which is clear_bit, then it's a load-store control dependencies?

I guess you missed the comma expression... Let me rewrite your pseudo-code
above, it is equivalent to

if (work->node) {
if ((work_next = work->node->next, true))
clear_bit(>flags);
}

another rewrite:

if (work->node) {
work_next = work->node->next;
if ((work, true))
clear_bit(>flags);
}

and the final rewrite:

if (work->node) {
work_next = work->node->next;
if (true)
clear_bit(>flags);
}

so again, I do not see the load-store control dependency. Not to mention this
code lacks READ_ONCE().


If we had something like

if (work->node) {
work_next = READ_ONCE(work->node->next);
if (work_next)
clear_bit(>flags);
}

instead, then yes, we could rely on the LOAD-STORE dependency.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-31 Thread Oleg Nesterov
On 05/31, Jason Wang wrote:
>
> On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov  wrote:
> >
> > On 05/31, Jason Wang wrote:
> > >
> > > 在 2023/5/23 20:15, Oleg Nesterov 写道:
> > > >
> > > > /* make sure flag is seen after deletion */
> > > > smp_wmb();
> > > > llist_for_each_entry_safe(work, work_next, node, node) {
> > > > clear_bit(VHOST_WORK_QUEUED, >flags);
> > > >
> > > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> > > >vhost_work_queue() can add this work again and change work->node->next.
> > > >
> > > >That is why we use _safe, but we need to ensure that 
> > > >llist_for_each_safe()
> > > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
> > >
> > > This should be fine since store is not speculated, so work->node->next 
> > > needs
> > > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop 
> > > condition.
> >
> > I don't understand you. OK, to simplify, suppose we have 2 global vars
> >
> > void *PTR = something_non_null;
> > unsigned long FLAGS = -1ul;
> >
> > Now I think this code
> >
> > CPU_0   CPU_1
> >
> > void *ptr = PTR;if (!test_and_set_bit(0, FLAGS))
> > clear_bit(0, FLAGS);PTR = NULL;
> > BUG_ON(!ptr);
> >
> > is racy and can hit the BUG_ON(!ptr).
>
> This seems different to the above case?

not sure,

> And you can hit BUG_ON with
> the following execution sequence:
>
> [cpu 0] clear_bit(0, FLAGS);
> [cpu 1] if (!test_and_set_bit(0, FLAGS))
> [cpu 1] PTR = NULL;
> [cpu 0] BUG_ON(!ptr)

I don't understand this part... yes, we can hit this BUG_ON() without mb in
between, this is what I tried to say.

> In vhost code, there's a condition before the clear_bit() which sits
> inside llist_for_each_entry_safe():
>
> #define llist_for_each_entry_safe(pos, n, node, member)   
>  \
> for (pos = llist_entry((node), typeof(*pos), member); 
>  \
>  member_address_is_nonnull(pos, member) &&
>  \
> (n = llist_entry(pos->member.next, typeof(*n), member), 
> true); \
>  pos = n)
>
> The clear_bit() is a store which is not speculated, so there's a
> control dependency, the store can't be executed until the condition
> expression is evaluated which requires pos->member.next
> (work->node.next) to be loaded.

But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we 
have
something like

n = llist_entry(...);
if (n)
clear_bit(...);

so I do not see how can we rely on the load-store control dependency.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-31 Thread Oleg Nesterov
On 05/31, Jason Wang wrote:
>
> 在 2023/5/23 20:15, Oleg Nesterov 写道:
> >
> > /* make sure flag is seen after deletion */
> > smp_wmb();
> > llist_for_each_entry_safe(work, work_next, node, node) {
> > clear_bit(VHOST_WORK_QUEUED, >flags);
> >
> >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> >vhost_work_queue() can add this work again and change work->node->next.
> >
> >That is why we use _safe, but we need to ensure that llist_for_each_safe()
> >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
>
> This should be fine since store is not speculated, so work->node->next needs
> to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition.

I don't understand you. OK, to simplify, suppose we have 2 global vars

void *PTR = something_non_null;
unsigned long FLAGS = -1ul;

Now I think this code

CPU_0   CPU_1

void *ptr = PTR;if (!test_and_set_bit(0, FLAGS))
clear_bit(0, FLAGS);PTR = NULL;
BUG_ON(!ptr);

is racy and can hit the BUG_ON(!ptr).

I guess it is fine on x86, but in general you need smp_mb__before_atomic()
before clear_bit(), or clear_bit_unlock().

> > __set_current_state(TASK_RUNNING);
> >
> >Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
> >can return with current->state != RUNNING ?
>
> It is because the state were set to TASK_INTERRUPTIBLE in the beginning of
> the loop otherwise it might be side effect while executing work->fn().

Again, I don't understand you. So let me repeat: can work->fn() return with
current->_state != TASK_RUNNING ? If not (and I'd say it should not), you can
do __set_current_state(TASK_RUNNING) once, before llist_for_each_entry_safe().

> >Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> >before we call work->fn(). Is it "safe" to run this callback with
> >signal_pending() or fatal_signal_pending() ?
>
> It looks safe since:
>
> 1) vhost hold refcnt of the mm
> 2) release will sync with the worker

Well, that's not what I asked... nevermind, please forget.

Thanks.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-30 Thread Oleg Nesterov
On 05/30, Christian Brauner wrote:
>
> On Mon, May 29, 2023 at 01:19:39PM +0200, Oleg Nesterov wrote:
> >
> > If we want CLONE_THREAD, I think vhost_worker() should exit after 
> > get_signal()
> > returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
>
> Yes, and that's what I proposed at the beginning of this tread.

Yes. And you know, I misunderstood you even if I had the same feeling from the
very beginning too (except I wasn't and still not sure CLONE_THREAD is a good
idea). Because... OK, I think it doesn't matter now ;)

Mike, Eric, et al.

I'll try to (at least) read your emails tomorrow. Another day spent on redhat
bugzillas.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Oleg Nesterov
On 05/29, Oleg Nesterov wrote:
>
> Mike, sorry, I don't understand your email.
>
> Just in case, let me remind I know nothing about drivers/vhost/

And... this is slightly off-topic but you didn't answer my previous
question and I am just curious. Do you agree that, even if we forget
about CLONE_THREAD/signals, vhost_worker() needs fixes anyway because
it can race with vhost_work_queue() ?

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Oleg Nesterov
Mike, sorry, I don't understand your email.

Just in case, let me remind I know nothing about drivers/vhost/

On 05/29, michael.chris...@oracle.com wrote:
>
> On 5/29/23 6:19 AM, Oleg Nesterov wrote:
> > On 05/27, Eric W. Biederman wrote:
> >>
> >> Looking forward I don't see not asking the worker threads to stop
> >> for the coredump right now causing any problems in the future.
> >> So I think we can use this to resolve the coredump issue I spotted.
> >
> > But we have almost the same problem with exec.
> >
> > Execing thread will wait for vhost_worker() while vhost_worker will wait for
> > .release -> vhost_task_stop().
>
> For this type of case, what is the goal or correct behavior in the end?
>
> When get_signal returns true we can code things like you mention below

and you have mentioned in the next email that you have already coded something
like this, so perhaps we can delay the further discussions until you send the
new code?

> and
> clean up the task_struct.

Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself,

> However, we now have a non-functioning vhost device
> open and just sitting around taking up memory and it can't do any IO.

can't comment, see above.

> For this type of case, do we expect just not to crash/hang, or was this new
> exec'd thread suppose to be able to use the vhost device?

I just tried to point out that (unless I missed something) there are more corner
cases, not just coredump.

> > Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...
>
> You mean the vhost_task's task/thread doing a function that does a 
> copy_process
> right?

I meant that the vhost_task's sub-thread can do sys_clone(CLONE_FILES) from
userspace. Yes, this implies copy_process() but I still can't understand you.

> That type of thing is not needed.

Do you mean that userspace should never do this? But this doesn't matter, the
kernel should handle this case anyway.

Or what?

In short let me repeat that I don't understand you and - of course! - quite
possibly I missed something.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Oleg Nesterov
On 05/27, Eric W. Biederman wrote:
>
> Looking forward I don't see not asking the worker threads to stop
> for the coredump right now causing any problems in the future.
> So I think we can use this to resolve the coredump issue I spotted.

But we have almost the same problem with exec.

Execing thread will wait for vhost_worker() while vhost_worker will wait for
.release -> vhost_task_stop().

And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread().

Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...

If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
flush the pending works on ->work_list before exit, I dunno. But imo it should
not wait for the final fput().

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-25 Thread Oleg Nesterov
On 05/24, Eric W. Biederman wrote:
>
> Oleg Nesterov  writes:
>
> > Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt
> > vhost_worker().
>
> Actually I think it reveals that exiting with SIGABRT will cause
> a deadlock.
>
> coredump_wait will wait for all of the threads to reach
> coredump_task_exit.  Meanwhile vhost_worker is waiting for
> all of the other threads to reach exit_files to close their
> file descriptors.

Indeed, I didn't think about this.


So why do we actually need CLONE_THREAD ? Can't vhost_worker() be a kernel 
thread?

kthread_create() won't be convenient, but how about kernel_thread() ? it 
inherits
mm/cgroups/rlimits/etc, kthread_stop() should work just fine.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-24 Thread Oleg Nesterov
On 05/23, Eric W. Biederman wrote:
>
> I want to point out that we need to consider not just SIGKILL, but
> SIGABRT that causes a coredump, as well as the process peforming
> an ordinary exit(2).  All of which will cause get_signal to return
> SIGKILL in this context.

Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt
vhost_worker().

> It is probably not the worst thing in the world, but what this means
> is now if you pass a copy of the vhost file descriptor to another
> process the vhost_worker will persist, and thus the process will persist
> until that copy of the file descriptor is closed.

Hadn't thought about it.

I am fighting with internal bugzillas today, will try to write another
email tomorrow.

But before that, I would like to have an answer to my "main" question in
my previois email. Otherwise I am still not sure I understand what exactly
we need to fix.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-23 Thread Oleg Nesterov
On 05/22, Oleg Nesterov wrote:
>
> Right now I think that "int dead" should die,

No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL.

> but let me think tomorrow.

May be something like this... I don't like it but I can't suggest anything 
better
right now.

bool killed = false;

for (;;) {
...

node = llist_del_all(>work_list);
if (!node) {
schedule();
/*
 * When we get a SIGKILL our release function will
 * be called. That will stop new IOs from being queued
 * and check for outstanding cmd responses. It will then
 * call vhost_task_stop to tell us to return and exit.
 */
if (signal_pending(current)) {
struct ksignal ksig;

if (!killed)
killed = get_signal();

clear_thread_flag(TIF_SIGPENDING);
}

continue;
}

---
But let me ask a couple of questions. Let's forget this patch, let's look at the
current code:

node = llist_del_all(>work_list);
if (!node)
schedule();

node = llist_reverse_order(node);
... process works ...

To me this looks a bit confusing. Shouldn't we do

if (!node) {
schedule();
continue;
}

just to make the code a bit more clear? If node == NULL then
llist_reverse_order() and llist_for_each_entry_safe() will do nothing.
But this is minor.



/* make sure flag is seen after deletion */
smp_wmb();
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, >flags);

I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
vhost_work_queue() can add this work again and change work->node->next.

That is why we use _safe, but we need to ensure that llist_for_each_safe()
completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.

So it seems that smp_wmb() can't help and should be removed, instead we need

llist_for_each_entry_safe(...) {
smp_mb__before_atomic();
clear_bit(VHOST_WORK_QUEUED, >flags);

Also, if the work->fn pointer is not stable, we should read it before
smp_mb__before_atomic() as well.

No?


__set_current_state(TASK_RUNNING);

Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
can return with current->state != RUNNING ?


work->fn(work);

Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
before we call work->fn(). Is it "safe" to run this callback with
signal_pending() or fatal_signal_pending() ?


Finally. I never looked into drivers/vhost/ before so I don't understand
this code at all, but let me ask anyway... Can we change vhost_dev_flush()
to run the pending callbacks rather than wait for vhost_worker() ?
I guess we can't, ->mm won't be correct, but can you confirm?

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-22 Thread Oleg Nesterov
On 05/22, Mike Christie wrote:
>
> On 5/22/23 7:30 AM, Oleg Nesterov wrote:
> >> +  /*
> >> +   * When we get a SIGKILL our release function will
> >> +   * be called. That will stop new IOs from being queued
> >> +   * and check for outstanding cmd responses. It will then
> >> +   * call vhost_task_stop to tell us to return and exit.
> >> +   */
> >
> > But who will call the release function / vhost_task_stop() and when this
> > will happen after this thread gets SIGKILL ?
>
> When we get a SIGKILL, the thread that owns the device/vhost_task will
> also exit since it's the same thread group and it does:
>
> do_exit -> exit_files -> put_files_struct -> close_files -> fput

Ah. thanks. I confused CLONE_FS in vhost_task_create() with CLONE_FILES.

> > Also. Suppose that vhost_worker() dequeues SIGKILL and clears 
> > TIF_SIGPENDING.
> >
> > SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again.
> > In this case the main for (;;) loop will spin without sleeping until
> > vhost_task_should_stop() becomes true?
>
> I see. So I either have to be able to call get_signal after SIGKILL or
> at this time work like a kthread and ignore signals like a
>
> if (dead && signal_pending())
>   flush_signals()
> ?

Right now I think that "int dead" should die, and you should simply do
get_signal() + clear(SIGPENDING) if signal_pending() == T , but let me
think tomorrow.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-22 Thread Oleg Nesterov
On 05/18, Eric W. Biederman wrote:
>
>  void recalc_sigpending(void)
>  {
> -   if (!recalc_sigpending_tsk(current) && !freezing(current))
> +   if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
> +   ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
> +   !__fatal_signal_pending(current)))
> clear_thread_flag(TIF_SIGPENDING);
>  
>  }
> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct 
> task_struct *p, enum pid_type type)
>  * This signal will be fatal to the whole group.
>  */
> if (!sig_kernel_coredump(sig)) {
> +   /*
> +* The signal is being short circuit delivered
> +* don't it pending.
> +*/
> +   if (type != PIDTYPE_PID) {
> +   sigdelset(>signal->shared_pending,  sig);
> +
> /*
>  * Start a group exit and wake everybody up.
>  * This way we don't have other threads

Eric, sorry. I fail to understand this patch.

How can it help? And whom?

Perhaps we can discuss it in the context of the new series from Mike?

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-22 Thread Oleg Nesterov
Confused, please help...

On 05/21, Mike Christie wrote:
>
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -338,6 +338,7 @@ static int vhost_worker(void *data)
>   struct vhost_worker *worker = data;
>   struct vhost_work *work, *work_next;
>   struct llist_node *node;
> + bool dead = false;
>
>   for (;;) {
>   /* mb paired w/ kthread_stop */
> @@ -349,8 +350,22 @@ static int vhost_worker(void *data)
>   }
>
>   node = llist_del_all(>work_list);
> - if (!node)
> + if (!node) {
>   schedule();
> + /*
> +  * When we get a SIGKILL our release function will
> +  * be called. That will stop new IOs from being queued
> +  * and check for outstanding cmd responses. It will then
> +  * call vhost_task_stop to tell us to return and exit.
> +  */

But who will call the release function / vhost_task_stop() and when this
will happen after this thread gets SIGKILL ?

> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> +
> + dead = get_signal();
> + if (dead)
> + clear_thread_flag(TIF_SIGPENDING);

If you do clear_thread_flag(TIF_SIGPENDING), then why do we need 1/3 ?


Also. Suppose that vhost_worker() dequeues SIGKILL and clears TIF_SIGPENDING.

SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again.
In this case the main for (;;) loop will spin without sleeping until
vhost_task_should_stop() becomes true?

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-18 Thread Oleg Nesterov
On 05/18, Christian Brauner wrote:
>
> On Thu, May 18, 2023 at 08:08:10PM +0200, Oleg Nesterov wrote:
> > On 05/18, Christian Brauner wrote:
> > >
> > > Yeah, but these are issues that exist with PF_IO_WORKER then too
> >
> > This was my thought too but I am starting to think I was wrong.
> >
> > Of course I don't understand the code in io_uring/ but it seems
> > that it always breaks the IO loops if get_signal() returns SIGKILL.
>
> Yeah, it does and I think Mike has a point that vhost could be running
> into an issue here that io_uring currently does avoid. But I don't think
> we should rely on that.

So what do you propose?

Unless (quite possibly) I am confused again, unlike io_uring vhost can't
tolerate signal_pending() == T in the cleanup-after-SIGKILL paths?

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-18 Thread Oleg Nesterov
On 05/18, Christian Brauner wrote:
>
> Yeah, but these are issues that exist with PF_IO_WORKER then too

This was my thought too but I am starting to think I was wrong.

Of course I don't understand the code in io_uring/ but it seems
that it always breaks the IO loops if get_signal() returns SIGKILL.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-18 Thread Oleg Nesterov
On 05/18, Mike Christie wrote:
>
> On 5/18/23 11:25 AM, Oleg Nesterov wrote:
> > I too do not understand the 1st change in this patch ...
> >
> > On 05/18, Mike Christie wrote:
> >>
> >> In the other patches we do:
> >>
> >> if (get_signal(ksig))
> >>start_exit_cleanup_by_stopping_newIO()
> >>flush running IO()
> >>exit()
> >>
> >> But to do the flush running IO() part of this I need to wait for it so
> >> that's why I wanted to be able to dequeue the SIGKILL and clear the
> >> TIF_SIGPENDING bit.
> >
> > But get_signal() will do what you need, dequeue SIGKILL and clear 
> > SIGPENDING ?
> >
> > if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> >  signal->group_exec_task) {
> > clear_siginfo(>info);
> > ksig->info.si_signo = signr = SIGKILL;
> > sigdelset(>pending.signal, SIGKILL);
> >
> > this "dequeues" SIGKILL,

OOPS. this doesn't remove SIGKILL from current->signal->shared_pending

> >
> > trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
> > >action[SIGKILL - 1]);
> > recalc_sigpending();
> >
> > this clears TIF_SIGPENDING.

No, I was wrong, recalc_sigpending() won't clear TIF_SIGPENDING if
SIGKILL is in signal->shared_pending

> I see what you guys meant. TIF_SIGPENDING isn't getting cleared.
> I'll dig into why.

See above, sorry for confusion.



And again, there is another problem with SIGSTOP. To simplify, suppose
a PF_IO_WORKER thread does something like

while (signal_pending(current))
get_signal(...);

this will loop forever if (SIGNAL_GROUP_EXIT || group_exec_task) and
SIGSTOP is pending.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-18 Thread Oleg Nesterov
I too do not understand the 1st change in this patch ...

On 05/18, Mike Christie wrote:
>
> In the other patches we do:
>
> if (get_signal(ksig))
>   start_exit_cleanup_by_stopping_newIO()
>   flush running IO()
>   exit()
>
> But to do the flush running IO() part of this I need to wait for it so
> that's why I wanted to be able to dequeue the SIGKILL and clear the
> TIF_SIGPENDING bit.

But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?

if ((signal->flags & SIGNAL_GROUP_EXIT) ||
 signal->group_exec_task) {
clear_siginfo(>info);
ksig->info.si_signo = signr = SIGKILL;
sigdelset(>pending.signal, SIGKILL);

this "dequeues" SIGKILL,

trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>action[SIGKILL - 1]);
recalc_sigpending();

this clears TIF_SIGPENDING.

> Or I don't need this specifically. In patch 0/8 I said I knew you guys
> would not like it :) If I just have a:
>
> if (fatal_signal())
>   clear_fatal_signal()

see above...


Well... I think this code is actually wrong if if SIGSTOP is pending and
the task is PF_IO_WORKER, but this is also true for io-threads so we can
discuss this separately.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

2023-05-17 Thread Oleg Nesterov
On 05/16, Eric W. Biederman wrote:
>
> Oleg Nesterov  writes:
>
> >> There is this bit in complete_signal when SIGKILL is delivered to any
> >> thread in the process.
> >>
> >>t = p;
> >>do {
> >>task_clear_jobctl_pending(t, 
> >> JOBCTL_PENDING_MASK);
> >>sigaddset(>pending.signal, SIGKILL);
> >>signal_wake_up(t, 1);
> >>} while_each_thread(p, t);
> >
> > That is why the latest version adds try_set_pending_sigkill(). No, no,
> > it is not that I think this is a good idea.
>
> I see that try_set_pending_sigkill in the patch now.
>
> That try_set_pending_sigkill just keeps the process from reporting
> that it has exited, and extend the process exit indefinitely.
>
> SIGNAL_GROUP_EXIT has already been set, so the KILL signal was
> already delivered and the process is exiting.

Agreed, that is why I said I don't think try_set_pending_sigkill() is
a good idea.

And again, the same is true for the threads created by
create_io_thread(). get_signal() from io_uring/ can dequeue a pending
SIGKILL and return, but that is all.

> >> For clarity that sigaddset(>pending.signal, SIGKILL);  Really isn't
> >> setting SIGKILL pending,
> >
> > Hmm. it does? Nevermind.
>
> The point is that what try_set_pending_sigkill in the patch is doing is
> keeping the "you are dead exit now" flag, from being set.
>
> That flag is what fatal_signal_pending always tests, because we can only
> know if a fatal signal is pending if we have performed short circuit
> delivery on the signal.
>
> The result is the effects of the change are mostly what people expect.
> The difference the semantics being changed aren't what people think they
> are.
>
> AKA process exit is being ignored for the thread, not that SIGKILL is
> being blocked.

Sorry, I don't understand. I just tried to say that
sigaddset(>pending.signal, SIGKILL) really sets SIGKILL pending.
Nevermind.

> > Although I never understood this logic.

I meant I never really liked how io-threads play with signals,

> I can't even understand the usage
> > of lower_32_bits() in create_io_thread().
>
> As far as I can tell lower_32_bits(flags) is just defensive programming

Cough. but this is ugly. Or I missed something.

> or have just populated .flags directly.

Exactly,

> Then .exit_signal
> could have been set to 0.

Exactly.

---
OK. It doesn't matter. I tried to read the whole thread and got lost.

IIUC, Mike is going to send the next version? So I think we can delay
the further discussions until then.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

2023-05-16 Thread Oleg Nesterov
On 05/16, Eric W. Biederman wrote:
>
> A kernel thread can block SIGKILL and that is supported.
>
> For a thread that is part of a process you can't block SIGKILL when the
> task is part of a user mode process.

Or SIGSTOP. Another thread can call do_signal_stop()->signal_wake_up/etc.

> There is this bit in complete_signal when SIGKILL is delivered to any
> thread in the process.
>
>   t = p;
>   do {
>   task_clear_jobctl_pending(t, 
> JOBCTL_PENDING_MASK);
>   sigaddset(>pending.signal, SIGKILL);
>   signal_wake_up(t, 1);
>   } while_each_thread(p, t);

That is why the latest version adds try_set_pending_sigkill(). No, no,
it is not that I think this is a good idea.

> For clarity that sigaddset(>pending.signal, SIGKILL);  Really isn't
> setting SIGKILL pending,

Hmm. it does? Nevermind.

> The important part of that code is that SIGNAL_GROUP_EXIT gets set.
> That indicates the entire process is being torn down.

Yes. and the same is true for io-thread even if it calls get_signal()
and dequeues SIGKILL and clears TIF_SIGPENDING.

> but in that case the vhost logic needs to act like a process, just
> like io_uring does.

confused... create_io_thread() creates a sub-thread too?

Although I never understood this logic. I can't even understand the usage
of lower_32_bits() in create_io_thread().

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

2023-05-16 Thread Oleg Nesterov
On 05/15, Mike Christie wrote:
>
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), 
> void *arg,
>const char *name)
>  {
>   struct kernel_clone_args args = {
> - .flags  = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> + .flags  = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
> +   CLONE_THREAD | CLONE_SIGHAND,

I am looking at 6/8 on https://lore.kernel.org/lkml/ ...

with this change kernel_wait4() in vhost_task_stop() won't work?

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

2023-05-16 Thread Oleg Nesterov
On 05/15, Mike Christie wrote:
>
> Oleg and Christian,
>
>
> Below is an updated patch that doesn't check for PF_USER_WORKER in the
> signal.c code and instead will check for if we have blocked the signal.

Looks like I need to read the whole series... will try tomorrow.

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
>   p->flags &= ~PF_KTHREAD;
>   if (args->kthread)
>   p->flags |= PF_KTHREAD;
> - if (args->user_worker)
> + if (args->user_worker) {
> + /*
> +  * User worker are similar to io_threads but they do not
> +  * support signals and cleanup is driven via another kernel
> +  * interface so even SIGKILL is blocked.
> +  */
>   p->flags |= PF_USER_WORKER;
> + siginitsetinv(>blocked, 0);

I never liked the fact that io-threads block the signals, this adds
another precedent... OK, this needs another discussion.

> +static void try_set_pending_sigkill(struct task_struct *t)
> +{
> + /*
> +  * User workers don't support signals and their exit is driven through
> +  * their kernel layer, so by default block even SIGKILL.
> +  */
> + if (sigismember(>blocked, SIGKILL))
> + return;
> +
> + sigaddset(>pending.signal, SIGKILL);
> + signal_wake_up(t, 1);
> +}

so why do you need this? to avoid fatal_signal_pending() or signal_pending() ?

In the latter case this change is not enough.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5] x86 spinlock: Fix memory corruption on completing completions

2015-02-15 Thread Oleg Nesterov
Well, I regret I mentioned the lack of barrier after enter_slowpath ;)

On 02/15, Raghavendra K T wrote:

 @@ -46,7 +46,8 @@ static __always_inline bool static_key_false(struct 
 static_key *key);

  static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
  {
 - set_bit(0, (volatile unsigned long *)lock-tickets.tail);
 + set_bit(0, (volatile unsigned long *)lock-tickets.head);
 + barrier();
  }

Because this barrier() looks really confusing.

Firsty, it is equally unneeded on x86. At the same time, it can not help.
We need a memory barrier() between set_bit(SLOWPATH) and READ_ONCE(head)
to avoid the race with spin_unlock().

So I think you should replace it with smp_mb__after_atomic() or remove it.



Other than that I believe this version is correct. So I won't insist, this
is cosmetic after all.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V4] x86 spinlock: Fix memory corruption on completing completions

2015-02-15 Thread Oleg Nesterov
On 02/15, Raghavendra K T wrote:

 On 02/13/2015 09:02 PM, Oleg Nesterov wrote:

 @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock 
 *lock, __ticket_t want)
  * check again make sure it didn't become free while
  * we weren't looking.
  */
 -   if (ACCESS_ONCE(lock-tickets.head) == want) {
 +   head = READ_ONCE(lock-tickets.head);
 +   if (__tickets_equal(head, want)) {
 add_stats(TAKEN_SLOW_PICKUP, 1);
 goto out;

 This is off-topic, but with or without this change perhaps it makes sense
 to add smp_mb__after_atomic(). It is nop on x86, just to make this code
 more understandable for those (for me ;) who can never remember even the
 x86 rules.

 Hope you meant it for add_stat.

No, no. We need a barrier between set_bit(SLOWPATH) and tickets_equal().

Yes, on x86 set_bit() can't be reordered so smp_mb_*_atomic() is nop, but
it can make the code more understandable.

 yes  smp_mb__after_atomic() would be
 harmless barrier() in x86. Did not add this V5 as yoiu though  but this
 made me look at slowpath_enter code and added an explicit barrier()
 there :).

Well. it looks even more confusing than a lack of barrier ;)

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5] x86 spinlock: Fix memory corruption on completing completions

2015-02-15 Thread Oleg Nesterov
On 02/15, Raghavendra K T wrote:

 * Raghavendra K T raghavendra...@linux.vnet.ibm.com [2015-02-15 11:25:44]:

 Resending the V5 with smp_mb__after_atomic() change without bumping up
 revision

Reviewed-by: Oleg Nesterov o...@redhat.com


Of course, this needs the acks from maintainers. And I agree that SLOWPATH
in .head makes xadd() in unlock() unavoidable. However I do not see how we
can avoid the locked inc if we want to eliminate read-after-unlock.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V4] x86 spinlock: Fix memory corruption on completing completions

2015-02-13 Thread Oleg Nesterov
On 02/13, Raghavendra K T wrote:

 @@ -164,7 +161,7 @@ static inline int arch_spin_is_locked(arch_spinlock_t 
 *lock)
  {
   struct __raw_tickets tmp = READ_ONCE(lock-tickets);

 - return tmp.tail != tmp.head;
 + return tmp.tail != (tmp.head  ~TICKET_SLOWPATH_FLAG);
  }

Well, this can probably use __tickets_equal() too. But this is cosmetic.

It seems that arch_spin_is_contended() should be fixed with this change,

(__ticket_t)(tmp.tail - tmp.head)  TICKET_LOCK_INC

can be true because of TICKET_SLOWPATH_FLAG in .head, even if it is actually
unlocked. And the (__ticket_t) typecast looks unnecessary, it only adds more
confusuin, but this is cosmetic too.



 @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock 
 *lock, __ticket_t want)
* check again make sure it didn't become free while
* we weren't looking.
*/
 - if (ACCESS_ONCE(lock-tickets.head) == want) {
 + head = READ_ONCE(lock-tickets.head);
 + if (__tickets_equal(head, want)) {
   add_stats(TAKEN_SLOW_PICKUP, 1);
   goto out;

This is off-topic, but with or without this change perhaps it makes sense
to add smp_mb__after_atomic(). It is nop on x86, just to make this code
more understandable for those (for me ;) who can never remember even the
x86 rules.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V4] x86 spinlock: Fix memory corruption on completing completions

2015-02-13 Thread Oleg Nesterov
On 02/13, Oleg Nesterov wrote:

 On 02/13, Raghavendra K T wrote:
 
  @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock 
  *lock, __ticket_t want)
   * check again make sure it didn't become free while
   * we weren't looking.
   */
  -   if (ACCESS_ONCE(lock-tickets.head) == want) {
  +   head = READ_ONCE(lock-tickets.head);
  +   if (__tickets_equal(head, want)) {
  add_stats(TAKEN_SLOW_PICKUP, 1);
  goto out;

 This is off-topic, but with or without this change perhaps it makes sense
 to add smp_mb__after_atomic(). It is nop on x86, just to make this code
 more understandable for those (for me ;) who can never remember even the
 x86 rules.

Not that I think you should do this in v5, so please ignore.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions

2015-02-12 Thread Oleg Nesterov
Damn, sorry for noise, forgot to mention...

On 02/12, Raghavendra K T wrote:

 +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock,
 + __ticket_t head)
 +{
 + if (head  TICKET_SLOWPATH_FLAG) {
 + arch_spinlock_t old, new;
 +
 + old.tickets.head = head;
 + new.tickets.head = head  ~TICKET_SLOWPATH_FLAG;
 + old.tickets.tail = new.tickets.head + TICKET_LOCK_INC;
 + new.tickets.tail = old.tickets.tail;
 +
 + /* try to clear slowpath flag when there are no contenders */
 + cmpxchg(lock-head_tail, old.head_tail, new.head_tail);
 + }
 +}

...

 +clear_slowpath:
 + if (TICKET_SLOWPATH_FLAG)
 + __ticket_check_and_clear_slowpath(lock, inc.head);

I think you can remove this if (TICKET_SLOWPATH_FLAG) check. If it is
defined as zero, gcc should optimize out the code under if (head  0).

But this is purely cosmetic and subjective, feel free to ignore.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions

2015-02-12 Thread Oleg Nesterov
On 02/12, Raghavendra K T wrote:

 @@ -191,8 +189,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t 
 *lock)
* We need to check unlocked in a loop, tmp.head == head
* can be false positive because of overflow.
*/
 - if (tmp.head == (tmp.tail  ~TICKET_SLOWPATH_FLAG) ||
 - tmp.head != head)
 + if (__tickets_equal(tmp.head, tmp.tail) || tmp.head != head)
   break;

Ah, it seems that tmp.head != head should be turned into
!__tickets_equal(), no?

Suppose that TICKET_SLOWPATH_FLAG is set after the first ACCESS_ONCE(head),
then tmp.head != head will be true before the first unlock we are waiting
for.

And perhaps you can turn these ACCESS_ONCE into READ_ONCE as well.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions

2015-02-12 Thread Oleg Nesterov
On 02/12, Raghavendra K T wrote:

 @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock 
 *lock, __ticket_t want)
* check again make sure it didn't become free while
* we weren't looking.
*/
 - if (ACCESS_ONCE(lock-tickets.head) == want) {
 + head = ACCESS_ONCE(lock-tickets.head);
 + if (__tickets_equal(head, want)) {
   add_stats(TAKEN_SLOW_PICKUP, 1);

While at it, perhaps it makes sense to s/ACCESS_ONCE/READ_ONCE/ but this
is cosmetic.

We also need to change another user of enter_slow_path, xen_lock_spinning()
in arch/x86/xen/spinlock.c.

Other than that looks correct at first glance... but this is up to
maintainers.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-12 Thread Oleg Nesterov
On 02/11, Jeremy Fitzhardinge wrote:

 On 02/11/2015 09:24 AM, Oleg Nesterov wrote:
  I agree, and I have to admit I am not sure I fully understand why
  unlock uses the locked add. Except we need a barrier to avoid the race
  with the enter_slowpath() users, of course. Perhaps this is the only
  reason?

 Right now it needs to be a locked operation to prevent read-reordering.
 x86 memory ordering rules state that all writes are seen in a globally
 consistent order, and are globally ordered wrt reads *on the same
 addresses*, but reads to different addresses can be reordered wrt to writes.

 So, if the unlocking add were not a locked operation:

 __add(lock-tickets.head, TICKET_LOCK_INC);  /* not locked */

 if (unlikely(lock-tickets.tail  TICKET_SLOWPATH_FLAG))
 __ticket_unlock_slowpath(lock, prev);

 Then the read of lock-tickets.tail can be reordered before the unlock,
 which introduces a race:

Yes, yes, thanks, but this is what I meant. We need a barrier. Even if
Every store is a release as Linus mentioned.

 This *might* be OK, but I think it's on dubious ground:

 __add(lock-tickets.head, TICKET_LOCK_INC);  /* not locked */

   /* read overlaps write, and so is ordered */
 if (unlikely(lock-head_tail  (TICKET_SLOWPATH_FLAG  TICKET_SHIFT))
 __ticket_unlock_slowpath(lock, prev);

 because I think Intel and AMD differed in interpretation about how
 overlapping but different-sized reads  writes are ordered (or it simply
 isn't architecturally defined).

can't comment, I simply so not know how the hardware works.

 If the slowpath flag is moved to head, then it would always have to be
 locked anyway, because it needs to be atomic against other CPU's RMW
 operations setting the flag.

Yes, this is true.

But again, if we want to avoid the read-after-unlock, we need to update
this lock and read SLOWPATH atomically, it seems that we can't avoid the
locked insn.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Oleg Nesterov
On 02/10, Jeremy Fitzhardinge wrote:

 On 02/10/2015 05:26 AM, Oleg Nesterov wrote:
  On 02/10, Raghavendra K T wrote:
  Unfortunately xadd could result in head overflow as tail is high.
 
  The other option was repeated cmpxchg which is bad I believe.
  Any suggestions?
  Stupid question... what if we simply move SLOWPATH from .tail to .head?
  In this case arch_spin_unlock() could do xadd(tickets.head) and check
  the result

 Well, right now, tail is manipulated by locked instructions by CPUs
 who are contending for the ticketlock, but head can be manipulated
 unlocked by the CPU which currently owns the ticketlock. If SLOWPATH
 moved into head, then non-owner CPUs would be touching head, requiring
 everyone to use locked instructions on it.

 That's the theory, but I don't see much (any?) code which depends on that.

 Ideally we could find a way so that pv ticketlocks could use a plain
 unlocked add for the unlock like the non-pv case, but I just don't see a
 way to do it.

I agree, and I have to admit I am not sure I fully understand why unlock
uses the locked add. Except we need a barrier to avoid the race with the
enter_slowpath() users, of course. Perhaps this is the only reason?

Anyway, I suggested this to avoid the overflow if we use xadd(), and I
guess we need the locked insn anyway if we want to eliminate the unsafe
read-after-unlock...

  BTW. If we move clear slowpath into lock path, then probably trylock
  should be changed too? Something like below, we just need to clear SLOWPATH
  before cmpxchg.

 How important / widely used is trylock these days?

I am not saying this is that important. Just this looks more consistent imo
and we can do this for free.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Oleg Nesterov
On 02/11, Raghavendra K T wrote:

 On 02/10/2015 06:56 PM, Oleg Nesterov wrote:

 In this case __ticket_check_and_clear_slowpath() really needs to cmpxchg
 the whole .head_tail. Plus obviously more boring changes. This needs a
 separate patch even _if_ this can work.

 Correct, but apart from this, before doing xadd in unlock,
 we would have to make sure lsb bit is cleared so that we can live with 1
 bit overflow to tail which is unused. now either or both of head,tail
 lsb bit may be set after unlock.

Sorry, can't understand... could you spell?

If TICKET_SLOWPATH_FLAG lives in .head arch_spin_unlock() could simply do

head = xadd(lock-tickets.head, TICKET_LOCK_INC);

if (head  TICKET_SLOWPATH_FLAG)
__ticket_unlock_kick(head);

so it can't overflow to .tail?

But probably I missed your concern.



And we we do this, probably it makes sense to add something like

bool tickets_equal(__ticket_t one, __ticket_t two)
{
return (one ^ two)  ~TICKET_SLOWPATH_FLAG;
}

and change kvm_lock_spinning() to use tickets_equal(tickets.head, want), plus
it can have more users in asm/spinlock.h.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-10 Thread Oleg Nesterov
On 02/10, Denys Vlasenko wrote:

 # define HEAD_MASK (TICKET_SLOWPATH_FLAG-1)

 ...
 unlock_again:

 val = xadd((lock-ticket.head_tail, TICKET_LOCK_INC);
 if (unlikely(!(val  HEAD_MASK))) {
 /* overflow. we inadvertently incremented the tail word.
  * tail's lsb is TICKET_SLOWPATH_FLAG.
  * Increment inverted this bit, fix it up.
  * (inc _may_ have messed up tail counter too,
  * will deal with it after kick.)
 */
 val ^= TICKET_SLOWPATH_FLAG;
 }

 if (unlikely(val  TICKET_SLOWPATH_FLAG)) {
 ...kick the waiting task...

val -= TICKET_SLOWPATH_FLAG;
if (unlikely(!(val  HEAD_MASK))) {
 /* overflow. we inadvertently incremented tail word, *and*
  * TICKET_SLOWPATH_FLAG was set, increment overflowed
  * that bit too and incremented tail counter.
  * This means we (inadvertently) taking the lock again!
  * Oh well. Take it, and unlock it again...
  */
 while (1) {
 if (READ_ONCE(lock-tickets.head) != TICKET_TAIL(val))
 cpu_relax();
 }
 goto unlock_again;
 }


 Granted, this looks ugly.

complicated ;)

But Take it, and unlock it again simply can't work, this can deadlock.
Note that unlock() can be called after successful try_lock(). And other
problems with lock-ordering, like

lock(X);
lock(Y);

unlock(X);

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-10 Thread Oleg Nesterov
On 02/10, Raghavendra K T wrote:

 On 02/10/2015 06:23 AM, Linus Torvalds wrote:

  add_smp(lock-tickets.head, TICKET_LOCK_INC);
  if (READ_ONCE(lock-tickets.tail)  TICKET_SLOWPATH_FLAG) ..

 into something like

  val = xadd((lock-ticket.head_tail, TICKET_LOCK_INC  
 TICKET_SHIFT);
  if (unlikely(val  TICKET_SLOWPATH_FLAG)) ...

 would be the right thing to do. Somebody should just check that I got
 that shift right, and that the tail is in the high bytes (head really
 needs to be high to work, if it's in the low byte(s) the xadd would
 overflow from head into tail which would be wrong).

 Unfortunately xadd could result in head overflow as tail is high.

 The other option was repeated cmpxchg which is bad I believe.
 Any suggestions?

Stupid question... what if we simply move SLOWPATH from .tail to .head?
In this case arch_spin_unlock() could do xadd(tickets.head) and check
the result

In this case __ticket_check_and_clear_slowpath() really needs to cmpxchg
the whole .head_tail. Plus obviously more boring changes. This needs a
separate patch even _if_ this can work.



BTW. If we move clear slowpath into lock path, then probably trylock
should be changed too? Something like below, we just need to clear SLOWPATH
before cmpxchg.

Oleg.

--- x/arch/x86/include/asm/spinlock.h
+++ x/arch/x86/include/asm/spinlock.h
@@ -109,7 +109,8 @@ static __always_inline int arch_spin_try
if (old.tickets.head != (old.tickets.tail  ~TICKET_SLOWPATH_FLAG))
return 0;
 
-   new.head_tail = old.head_tail + (TICKET_LOCK_INC  TICKET_SHIFT);
+   new.tickets.head = old.tickets.head;
+   new.tickets.tail = (old.tickets.tail  ~TICKET_SLOWPATH_FLAG) + 
TICKET_LOCK_INC;
 
/* cmpxchg is a full barrier, so nothing can move before it */
return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == 
old.head_tail;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2] x86 spinlock: Fix memory corruption on completing completions

2015-02-09 Thread Oleg Nesterov
On 02/09, Raghavendra K T wrote:

 +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock)
 +{
 + arch_spinlock_t old, new;
 + __ticket_t diff;
 +
 + old.tickets = READ_ONCE(lock-tickets);
 + diff = (old.tickets.tail  ~TICKET_SLOWPATH_FLAG) - old.tickets.head;
 +
 + /* try to clear slowpath flag when there are no contenders */
 + if ((old.tickets.tail  TICKET_SLOWPATH_FLAG) 
 + (diff == TICKET_LOCK_INC)) {
 + new = old;
 + new.tickets.tail = ~TICKET_SLOWPATH_FLAG;
 + cmpxchg(lock-head_tail, old.head_tail, new.head_tail);
 + }
 +}

Can't we simplify it? We own .head, and we already know it. We only need
to clear TICKET_SLOWPATH_FLAG in .tail atomically?

IOW,

static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t 
*lock, __ticket_t head)
{
__ticket_t old_tail, new_tail;

new_tail = head + TICKET_LOCK_INC;
old_tail = new_tail | TICKET_SLOWPATH_FLAG;

if (READ_ONCE(lock-tickets.tail) == old_tail)
cmpxchg(lock-tickets.tail, old_tail, new_tail);
}

Plus

-   __ticket_check_and_clear_slowpath(lock);
+   __ticket_check_and_clear_slowpath(lock, inc.tail);

Or I missed something?

And I think it would be better to avoid ifdef(CONFIG_PARAVIRT_SPINLOCKS),
ww can just do

if (TICKET_SLOWPATH_FLAG)
__ticket_check_and_clear_slowpath();

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-08 Thread Oleg Nesterov
On 02/06, Sasha Levin wrote:

 Can we modify it slightly to avoid potentially accessing invalid memory:
 
 diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
 index 5315887..cd22d73 100644
 --- a/arch/x86/include/asm/spinlock.h
 +++ b/arch/x86/include/asm/spinlock.h
 @@ -144,13 +144,13 @@ static __always_inline void 
 arch_spin_unlock(arch_spinlock_t *lock
 if (TICKET_SLOWPATH_FLAG 
 static_key_false(paravirt_ticketlocks_enabled)) {
 __ticket_t prev_head;
 -
 +   bool needs_kick = lock-tickets.tail  TICKET_SLOWPATH_FLAG;
 prev_head = lock-tickets.head;
 add_smp(lock-tickets.head, TICKET_LOCK_INC);
 
 /* add_smp() is a full mb() */
 
 -   if (unlikely(lock-tickets.tail  TICKET_SLOWPATH_FLAG)) {
 +   if (unlikely(needs_kick)) {

This doesn't look right too...

We need to guarantee that either unlock() sees TICKET_SLOWPATH_FLAG, or
the calller of __ticket_enter_slowpath() sees the result of add_smp().

Suppose that kvm_lock_spinning() is called right before add_smp() and it
sets SLOWPATH. It will block then because .head != want, and it needs
__ticket_unlock_kick().

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation

2014-03-02 Thread Oleg Nesterov
On 02/26, Waiman Long wrote:

 +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 +{
 + unsigned int cpu_nr, qn_idx;
 + struct qnode *node, *next;
 + u32 prev_qcode, my_qcode;
 +
 + /*
 +  * Get the queue node
 +  */
 + cpu_nr = smp_processor_id();
 + node   = get_qnode(qn_idx);
 +
 + /*
 +  * It should never happen that all the queue nodes are being used.
 +  */
 + BUG_ON(!node);
 +
 + /*
 +  * Set up the new cpu code to be exchanged
 +  */
 + my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
 +
 + /*
 +  * Initialize the queue node
 +  */
 + node-wait = true;
 + node-next = NULL;
 +
 + /*
 +  * The lock may be available at this point, try again if no task was
 +  * waiting in the queue.
 +  */
 + if (!(qsval  _QCODE_OFFSET)  queue_spin_trylock(lock)) {
 + put_qnode();
 + return;
 + }

Cosmetic, but probably goto release_node would be more consistent.

And I am wondering how much this qsval  _QCODE_OFFSET check can help.
Note that this is the only usage of this arg, perhaps it would be better
to simply remove it and shrink the caller's code a bit? It is also used
in 3/8, but we can read the fresh value of -qlcode (trylock does this
anyway), and perhaps it can actually help if it is already unlocked.

 + prev_qcode = atomic_xchg(lock-qlcode, my_qcode);
 + /*
 +  * It is possible that we may accidentally steal the lock. If this is
 +  * the case, we need to either release it if not the head of the queue
 +  * or get the lock and be done with it.
 +  */
 + if (unlikely(!(prev_qcode  _QSPINLOCK_LOCKED))) {
 + if (prev_qcode == 0) {
 + /*
 +  * Got the lock since it is at the head of the queue
 +  * Now try to atomically clear the queue code.
 +  */
 + if (atomic_cmpxchg(lock-qlcode, my_qcode,
 +   _QSPINLOCK_LOCKED) == my_qcode)
 + goto release_node;
 + /*
 +  * The cmpxchg fails only if one or more tasks
 +  * are added to the queue. In this case, we need to
 +  * notify the next one to be the head of the queue.
 +  */
 + goto notify_next;
 + }
 + /*
 +  * Accidentally steal the lock, release the lock and
 +  * let the queue head get it.
 +  */
 + queue_spin_unlock(lock);
 + } else
 + prev_qcode = ~_QSPINLOCK_LOCKED;   /* Clear the lock bit */

You know, actually I started this email because I thought that goto 
notify_next
is wrong, I misread the patch as if this goto can happen even if prev_qcode 
!= 0.

So feel free to ignore, all my comments are cosmetic/subjective, but to me it
would be more clean/clear to rewrite the code above as

if (prev_qcode == 0) {
if (atomic_cmpxchg(..., _QSPINLOCK_LOCKED) == my_qcode)
goto release_node;
goto notify_next;
}

if (prev_qcode  _QSPINLOCK_LOCKED)
prev_qcode = ~_QSPINLOCK_LOCKED;
else
queue_spin_unlock(lock);


 + while (true) {
 + u32 qcode;
 + int retval;
 +
 + retval = queue_get_lock_qcode(lock, qcode, my_qcode);
 + if (retval  0)
 + ;   /* Lock not available yet */
 + else if (retval  0)
 + /* Lock taken, can release the node  return */
 + goto release_node;

I guess this is for 3/8which adds the optimized version of
queue_get_lock_qcode(), so perhaps this retval  0 block can go into 3/8
as well.

 + else if (qcode != my_qcode) {
 + /*
 +  * Just get the lock with other spinners waiting
 +  * in the queue.
 +  */
 + if (queue_spin_setlock(lock))
 + goto notify_next;

OTOH, at least the generic (non-optimized) version of queue_spin_setlock()
could probably accept qcode and avoid atomic_read() + _QSPINLOCK_LOCKED
check.

But once again, please feel free to ignore.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-02 Thread Oleg Nesterov
On 02/26, Waiman Long wrote:

 @@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct 
 qspinlock *lock)
   int qlcode = atomic_read(lock-qlcode);
  
   if (!(qlcode  _QSPINLOCK_LOCKED)  (atomic_cmpxchg(lock-qlcode,
 - qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
 + qlcode, code|_QSPINLOCK_LOCKED) == qlcode))

Hmm. didn't read the patch, but this change looks like accidental typo...

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation

2014-03-02 Thread Oleg Nesterov
Forgot to ask...

On 02/26, Waiman Long wrote:

 +notify_next:
 + /*
 +  * Wait, if needed, until the next one in queue set up the next field
 +  */
 + while (!(next = ACCESS_ONCE(node-next)))
 + arch_mutex_cpu_relax();
 + /*
 +  * The next one in queue is now at the head
 +  */
 + smp_store_release(next-wait, false);

Do we really need smp_store_release()? It seems that we can rely on the
control dependency here. And afaics there is no need to serialise this
store with other changes in *lock, plus they all have mb's anyway.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 2/8] qspinlock, x86: Enable x86-64 to use queue spinlock

2014-03-02 Thread Oleg Nesterov
On 02/26, Waiman Long wrote:

 +#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
 +
 +/*
 + * x86-64 specific queue spinlock union structure
 + */
 +union arch_qspinlock {
 + struct qspinlock slock;
 + u8   lock;  /* Lock bit */
 +};

And this enables the optimized version of queue_spin_setlock().

But why does it check ACCESS_ONCE(qlock-lock) == 0 ? This is called
right after queue_get_lock_qcode() returns 0, this locked should be
likely unlocked.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization