Re: [Qemu-block] [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel

2017-11-08 Thread Stefan Hajnoczi
On Wed, Nov 08, 2017 at 05:32:23PM +0300, Pavel Butsykin wrote:
> On 08.11.2017 17:24, Sergio Lopez wrote:
> > On Wed, Nov 8, 2017 at 3:15 PM, Paolo Bonzini  wrote:
> > > On 08/11/2017 15:10, Sergio Lopez wrote:
> > > > > I'm not quite sure that the pre-fetched is involved in this issue,
> > > > > because pre-fetch reading a certain addresses should be invalidated by
> > > > > write on another core to the same addresses. In our case write
> > > > > req->state = THREAD_DONE should invalidate read req->state == 
> > > > > THREAD_DONE.
> > > > > I am inclined to think that there is a memory-reordering read with
> > > > > write. It's a very real case for x86 and I don't see the reasons which
> > > > > can prevent it:
> > > > > 
> > > > Yes, you're right. This is actually a memory reordering issue. I'm
> > > > going to rewrite that paragraph.
> > > 
> > > Well, memory reordering _is_ caused by speculative prefetching, delayed
> > > cache invalidation (store buffers), and so on.
> > > 
> > > But it's probably better indeed to replace "pre-fetched" with
> > > "outdated".  Whoever commits the patch can do the substitution (I can 
> > > too).
> > > 
> > 
> > Alternatively, if we want to explicitly mention the memory barrier, we
> > can replace the third paragraph with something like this:
> > 
> > 
> > This was considered to be safe, as the completion function restarts the
> > loop just after the call to qemu_bh_cancel. But, as this loop lacks a HW
> > memory barrier, the read of req->state may actually happen _before_ the
> > call, seeing it still as THREAD_QUEUED, and ending the completion
> > function without having processed a pending TPE linked at pool->head:
> > 
> 
> Yes, that's better. Thank you.

I have updated the commit description and sent an updated pull request
for QEMU 2.11-rc1.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel

2017-11-08 Thread Pavel Butsykin



On 08.11.2017 17:15, Paolo Bonzini wrote:

On 08/11/2017 15:10, Sergio Lopez wrote:

I'm not quite sure that the pre-fetched is involved in this issue,
because pre-fetch reading a certain addresses should be invalidated by
write on another core to the same addresses. In our case write
req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
I am inclined to think that there is a memory-reordering read with
write. It's a very real case for x86 and I don't see the reasons which
can prevent it:


Yes, you're right. This is actually a memory reordering issue. I'm
going to rewrite that paragraph.


Well, memory reordering _is_ caused by speculative prefetching, delayed
cache invalidation (store buffers), and so on.


what do you mean?

If we are speaking about x86, then a write on another core
(like req->state = THREAD_DONE in this issue) should invalidate
prefetch read(req->state = THREAD_DONE) and this is prevented in
hardware. The prefetch is locked to the L1, when another cpu
invalidates the cache lines, the prefetch is invalidated also
(As far as I understand it).


But it's probably better indeed to replace "pre-fetched" with
"outdated".  Whoever commits the patch can do the substitution (I can too).

Paolo





Re: [Qemu-block] [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel

2017-11-08 Thread Pavel Butsykin

On 08.11.2017 17:24, Sergio Lopez wrote:

On Wed, Nov 8, 2017 at 3:15 PM, Paolo Bonzini  wrote:

On 08/11/2017 15:10, Sergio Lopez wrote:

I'm not quite sure that the pre-fetched is involved in this issue,
because pre-fetch reading a certain addresses should be invalidated by
write on another core to the same addresses. In our case write
req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
I am inclined to think that there is a memory-reordering read with
write. It's a very real case for x86 and I don't see the reasons which
can prevent it:


Yes, you're right. This is actually a memory reordering issue. I'm
going to rewrite that paragraph.


Well, memory reordering _is_ caused by speculative prefetching, delayed
cache invalidation (store buffers), and so on.

But it's probably better indeed to replace "pre-fetched" with
"outdated".  Whoever commits the patch can do the substitution (I can too).



Alternatively, if we want to explicitly mention the memory barrier, we
can replace the third paragraph with something like this:


This was considered to be safe, as the completion function restarts the
loop just after the call to qemu_bh_cancel. But, as this loop lacks a HW
memory barrier, the read of req->state may actually happen _before_ the
call, seeing it still as THREAD_QUEUED, and ending the completion
function without having processed a pending TPE linked at pool->head:



Yes, that's better. Thank you.


---
Sergio





Re: [Qemu-block] [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel

2017-11-08 Thread Sergio Lopez
On Wed, Nov 8, 2017 at 3:15 PM, Paolo Bonzini  wrote:
> On 08/11/2017 15:10, Sergio Lopez wrote:
>>> I'm not quite sure that the pre-fetched is involved in this issue,
>>> because pre-fetch reading a certain addresses should be invalidated by
>>> write on another core to the same addresses. In our case write
>>> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
>>> I am inclined to think that there is a memory-reordering read with
>>> write. It's a very real case for x86 and I don't see the reasons which
>>> can prevent it:
>>>
>> Yes, you're right. This is actually a memory reordering issue. I'm
>> going to rewrite that paragraph.
>
> Well, memory reordering _is_ caused by speculative prefetching, delayed
> cache invalidation (store buffers), and so on.
>
> But it's probably better indeed to replace "pre-fetched" with
> "outdated".  Whoever commits the patch can do the substitution (I can too).
>

Alternatively, if we want to explicitly mention the memory barrier, we
can replace the third paragraph with something like this:


This was considered to be safe, as the completion function restarts the
loop just after the call to qemu_bh_cancel. But, as this loop lacks a HW
memory barrier, the read of req->state may actually happen _before_ the
call, seeing it still as THREAD_QUEUED, and ending the completion
function without having processed a pending TPE linked at pool->head:


---
Sergio



Re: [Qemu-block] [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel

2017-11-08 Thread Paolo Bonzini
On 08/11/2017 15:10, Sergio Lopez wrote:
>> I'm not quite sure that the pre-fetched is involved in this issue,
>> because pre-fetch reading a certain addresses should be invalidated by
>> write on another core to the same addresses. In our case write
>> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
>> I am inclined to think that there is a memory-reordering read with
>> write. It's a very real case for x86 and I don't see the reasons which
>> can prevent it:
>>
> Yes, you're right. This is actually a memory reordering issue. I'm
> going to rewrite that paragraph.

Well, memory reordering _is_ caused by speculative prefetching, delayed
cache invalidation (store buffers), and so on.

But it's probably better indeed to replace "pre-fetched" with
"outdated".  Whoever commits the patch can do the substitution (I can too).

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel

2017-11-08 Thread Sergio Lopez
On Wed, Nov 8, 2017 at 2:50 PM, Pavel Butsykin  wrote:
> On 08.11.2017 09:34, Sergio Lopez wrote:
>> This was considered to be safe, as the completion function restarts the
>> loop just after the call to qemu_bh_cancel. But, under certain access
>> patterns and scheduling conditions, the loop may wrongly use a
>> pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
>> the completion function without having processed a pending TPE linked at
>> pool->head:
>
>
> I'm not quite sure that the pre-fetched is involved in this issue,
> because pre-fetch reading a certain addresses should be invalidated by
> write on another core to the same addresses. In our case write
> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
> I am inclined to think that there is a memory-reordering read with
> write. It's a very real case for x86 and I don't see the reasons which
> can prevent it:
>

Yes, you're right. This is actually a memory reordering issue. I'm
going to rewrite that paragraph.

Thanks Pavel.



Re: [Qemu-block] [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel

2017-11-08 Thread Pavel Butsykin

On 08.11.2017 09:34, Sergio Lopez wrote:

Commit b7a745d added a qemu_bh_cancel call to the completion function
as an optimization to prevent it from unnecessarily rescheduling itself.

This completion function is scheduled from worker_thread, after setting
the state of a ThreadPoolElement to THREAD_DONE.



Great! We are seeing the same problem, and I was describing my fix,
when I came across your patch :)


This was considered to be safe, as the completion function restarts the
loop just after the call to qemu_bh_cancel. But, under certain access
patterns and scheduling conditions, the loop may wrongly use a
pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
the completion function without having processed a pending TPE linked at
pool->head:


I'm not quite sure that the pre-fetched is involved in this issue,
because pre-fetch reading a certain addresses should be invalidated by
write on another core to the same addresses. In our case write
req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
I am inclined to think that there is a memory-reordering read with
write. It's a very real case for x86 and I don't see the reasons which
can prevent it:

.text:0060E21E loc_60E21E: ; CODE 
XREF: .text:0060E2F4j

.text:0060E21E mov rbx, [r12+98h]
.text:0060E226 testrbx, rbx
.text:0060E229 jnz short loc_60E238
.text:0060E22B jmp short exit_0
.text:0060E22B ; 
---

.text:0060E22D align 10h
.text:0060E21E loc_60E21E: ; CODE 
XREF: .text:0060E2F4j

.text:0060E21E mov rbx, [r12+98h]
.text:0060E226 testrbx, rbx
.text:0060E229 jnz short loc_60E238
.text:0060E22B jmp short exit_0
.text:0060E230 loc_60E230: ; CODE 
XREF: .text:0060E240j

.text:0060E230 testrbp, rbp
.text:0060E233 jz  short exit_0
.text:0060E235
.text:0060E235 loc_60E235: ; CODE 
XREF: .text:0060E289j

.text:0060E235 mov rbx, rbp
.text:0060E238
.text:0060E238 loc_60E238: ; CODE 
XREF: .text:0060E229j
.text:0060E238 cmp 
[rbx+ThreadPoolElement.state], 2 ; THREAD_DONE
.text:0060E23C mov rbp, 
[rbx+ThreadPoolElement.all.link_next]

.text:0060E240 jnz short loc_60E230
.text:0060E242 mov r15d, 
[rbx+ThreadPoolElement.ret]
.text:0060E246 mov r13, 
[rbx+ThreadPoolElement.common.opaque]

.text:0060E24A nop
.text:0060E24B lea rax, 
trace_events_enabled_count

.text:0060E252 mov eax, [rax]
.text:0060E254 testeax, eax
.text:0060E256 mov rax, rbp
.text:0060E259 jnz loc_60E2F9
 ...

.text:0060E2BC loc_60E2BC: ; CODE 
XREF: .text:0060E27Cj

.text:0060E2BC mov rdi, [r12+8]
.text:0060E2C1 callqemu_bh_schedule
.text:0060E2C6 mov rdi, [r12]
.text:0060E2CA callaio_context_release
.text:0060E2CF mov esi, [rbx+44h]
.text:0060E2D2 mov rdi, [rbx+18h]
.text:0060E2D6 callqword ptr [rbx+10h]
.text:0060E2D9 mov rdi, [r12]
.text:0060E2DD callaio_context_acquire
.text:0060E2E2 mov rdi, [r12+8]
.text:0060E2E7 callqemu_bh_cancel
.text:0060E2EC mov rdi, rbx
.text:0060E2EF callqemu_aio_unref
.text:0060E2F4 jmp loc_60E21E


The read (req->state == THREAD_DONE) can be reordered
with qemu_bh_cancel(p->completion_bh) and then we get the same picture:

   worker thread |I/O thread
 
 | reordered read req->state
  req->state = THREAD_DONE;  |
  qemu_bh_schedule(p->completion_bh) |
bh->scheduled = 1;   |
 | qemu_bh_cancel(p->completion_bh)
 |   bh->scheduled = 0;
 | if (req->state == THREAD_DONE)