Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-26 Thread David Hildenbrand
On 22.08.19 09:01, David Hildenbrand wrote:
> On 22.08.19 00:31, Richard Henderson wrote:
>> On 8/21/19 2:33 PM, David Hildenbrand wrote:
 NOTDIRTY cannot fault at all.  The associated rcu critical section is ugly
 enough to make me not want to do anything except continue to go through the
 regular MMIO path.

 In any case, so long as we eliminate *access* faults by probing the page 
 table,
 then falling back to the byte-by-byte loop is, AFAICS, sufficient to 
 implement
 the instructions correctly.
>>>
>>> "In any case, so long as we eliminate *access* faults by probing the
>>> page table" - that's what I'm doing in this patch (and even more correct
>>> in the prototype patch I shared), no? (besides the watchpoint madness below)
>>
>> Correct.
>>
>> My main objection to your current patch is that you perform the access checks
>> within MVC, and then do some more tlb lookups in fast_memmove.
> 
> Right, however I think splitting up both steps is nicer (especially if
> we mix and match memmove and memset in MVCLE later). Maybe combining a
> tuned-up probe_write() with something similar (but different to)
> access_prepare().
> 
> General changes:
> 
> 1. Check watchpoints in probe_write()
> 
> 2. Make probe_write() bail out if it would cross pages() - caller has to
> assure it falls into a single page (TARGET_PAGE_SIZE ?).
> 
> 3. Return a pointer from probe_write()
> -> If NULL is returned, reads/writes have to go via memops, e.g., single
> byte access
> 
> 4. Make probe_write() return addresses of TLB entries that are already
> invalid again (-> LAP)
> 
> 
> What I propose for s390x:
> 
> 1. access_prepare(): Collect the pointers using probe_write() - maximum
> is two pages. If we get NULL, there is no fault but we have to fallback
> to read/write using memops.
> 
> 2. access_memset() / access_set() / access_move() ... pass the access
> information we collected. Prototype I shared can be heavily simplified -
> don't fall back to paddrs but instead do single-byte access, don't allow
> flexible array of pages, etc.
> 
> 3. First convert MVC to the new access mechanism, later fix + convert
> the others.
> 
>>
>> I think that fast_memmove is where the access checks should live.  That 
>> allows
>> incremental improvement to combine access checks + host address lookup, which
>> cannot currently be done in one step with existing interfaces.
>>
>> I guess you would still want access checks within MVC for the case in which 
>> you
>> must fall back to byte-by-byte because of destructive overlap.
> 
> That's why I think introducing a new set of helpers makes sense. Once
> all users of fast_memmove() etc are gone, we can then throw them away.
> 
>>
>>> "falling back to the byte-by-byte loop is, AFAICS, sufficient"
>>>
>>> I don't think this is sufficient. E.g., LAP protected pages
>>> (PAGE_WRITE_INV which immediately requires a new MMU walk on the next
>>> access) will trigger a new MMU walk on every byte access (that's why I
>>> chose to pre-translate in my prototype).
>>
>> LAP protected pages is exactly why probe_write should return the host 
>> address,
>> so that we can do the access check + host address lookup in one step.
>>
> 
> Yes, this should also be done in tlb_vaddr_to_host() - but as we'll be
> converting to a probe_write() thingy, it might not be required to handle
> the special "TLB_INVALID_MASK set after tlb_fill()" case.
> 
>> But in the meantime...
>>
>>> If another CPU modified the
>>> page tables in between, we could suddenly get a fault - although we
>>> checked early. What am I missing?
>>
>> You're concerned with a bare write to the page table by cpu B, while cpu A is
>> executing, and before cpu B issues the cross-cpu tlb flush?
> 
> Not bare writes, these are invalid. Take IPTE for example. Set the
> invalid bit, then do a sync CPU flush.
> 
> If we re-translate a vaddr after probe_write() - which would currently
> happen in case of LAP - then we could suddenly run into the invalid bit,
> triggering a fault. The sync CPU flush still wasn't processed, but we
> see the "invalid" bit early because we decided to re-translate after
> already modifying memory. This re-translation really has to be avoided.

FWIW, real HW has a dedicated lock for that as far as I remember, the
IPTE lock. While holding the IPTE lock, no modifications of page tables
are possible.

So e.g., IPTE/IDTE/CSP take the lock while working one the page tables.
MMU walkers in return take the lock while translating vaddrs.
Especially, in KVM guest access code we hold the same HW lock to make
sure no IPTE/IDTE can modify the page tables while we walk them.
Something like that (reader/writer lock) would also be possible in our MMU.

-- 

Thanks,

David / dhildenb



Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-22 Thread David Hildenbrand
On 22.08.19 00:31, Richard Henderson wrote:
> On 8/21/19 2:33 PM, David Hildenbrand wrote:
>>> NOTDIRTY cannot fault at all.  The associated rcu critical section is ugly
>>> enough to make me not want to do anything except continue to go through the
>>> regular MMIO path.
>>>
>>> In any case, so long as we eliminate *access* faults by probing the page 
>>> table,
>>> then falling back to the byte-by-byte loop is, AFAICS, sufficient to 
>>> implement
>>> the instructions correctly.
>>
>> "In any case, so long as we eliminate *access* faults by probing the
>> page table" - that's what I'm doing in this patch (and even more correct
>> in the prototype patch I shared), no? (besides the watchpoint madness below)
> 
> Correct.
> 
> My main objection to your current patch is that you perform the access checks
> within MVC, and then do some more tlb lookups in fast_memmove.

Right, however I think splitting up both steps is nicer (especially if
we mix and match memmove and memset in MVCLE later). Maybe combining a
tuned-up probe_write() with something similar (but different to)
access_prepare().

General changes:

1. Check watchpoints in probe_write()

2. Make probe_write() bail out if it would cross pages() - caller has to
assure it falls into a single page (TARGET_PAGE_SIZE ?).

3. Return a pointer from probe_write()
-> If NULL is returned, reads/writes have to go via memops, e.g., single
byte access

4. Make probe_write() return addresses of TLB entries that are already
invalid again (-> LAP)


What I propose for s390x:

1. access_prepare(): Collect the pointers using probe_write() - maximum
is two pages. If we get NULL, there is no fault but we have to fallback
to read/write using memops.

2. access_memset() / access_set() / access_move() ... pass the access
information we collected. Prototype I shared can be heavily simplified -
don't fall back to paddrs but instead do single-byte access, don't allow
flexible array of pages, etc.

3. First convert MVC to the new access mechanism, later fix + convert
the others.

> 
> I think that fast_memmove is where the access checks should live.  That allows
> incremental improvement to combine access checks + host address lookup, which
> cannot currently be done in one step with existing interfaces.
> 
> I guess you would still want access checks within MVC for the case in which 
> you
> must fall back to byte-by-byte because of destructive overlap.

That's why I think introducing a new set of helpers makes sense. Once
all users of fast_memmove() etc are gone, we can then throw them away.

> 
>> "falling back to the byte-by-byte loop is, AFAICS, sufficient"
>>
>> I don't think this is sufficient. E.g., LAP protected pages
>> (PAGE_WRITE_INV which immediately requires a new MMU walk on the next
>> access) will trigger a new MMU walk on every byte access (that's why I
>> chose to pre-translate in my prototype).
> 
> LAP protected pages is exactly why probe_write should return the host address,
> so that we can do the access check + host address lookup in one step.
> 

Yes, this should also be done in tlb_vaddr_to_host() - but as we'll be
converting to a probe_write() thingy, it might not be required to handle
the special "TLB_INVALID_MASK set after tlb_fill()" case.

> But in the meantime...
> 
>> If another CPU modified the
>> page tables in between, we could suddenly get a fault - although we
>> checked early. What am I missing?
> 
> You're concerned with a bare write to the page table by cpu B, while cpu A is
> executing, and before cpu B issues the cross-cpu tlb flush?

Not bare writes, these are invalid. Take IPTE for example. Set the
invalid bit, then do a sync CPU flush.

If we re-translate a vaddr after probe_write() - which would currently
happen in case of LAP - then we could suddenly run into the invalid bit,
triggering a fault. The sync CPU flush still wasn't processed, but we
see the "invalid" bit early because we decided to re-translate after
already modifying memory. This re-translation really has to be avoided.

> 
> The tlb victim cache should prevent having to re-read a tlb entry from memory,
> at least for MVC.  The unlimited size we currently support for MVCL and MVCLE
> could act weird, but would be fixed by limiting the execution as discussed.

Yes, MVCL/MVCLE need a major overhaul. But good to know that a second
tlb_fill() will not result in the other TLB entry to suddenly get
invalid and require a new MMU translation (-> victim cache). That's
another concern I had.

> 
> Honestly, the os has to make sure that the page remains valid until after the
> flush completes, otherwise it's an os bug.  The cross-cpu tlb flush happens 
> via
> async_run_on_cpu, and of course never occurs while we are executing a TB.  Yet
> another reason to limit the amount of work any one instruction does.  ;-)

In my example, the page would still be valid, however we the invalid bit
has already been set. The next MMU translation would choke on that and
trigger a fault.

I 

Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-22 Thread David Hildenbrand
On 22.08.19 00:43, Richard Henderson wrote:
> On 8/21/19 3:31 PM, Richard Henderson wrote:
>>> Yes, that's what I mean, TARGET_PAGE_SIZE, but eventually crossing a
>>> page boundary. The longer I stare at the MVCL code, the more broken it
>>> is. There are more nice things buried in the PoP. MVCL does not detect
>>> access exceptions beyond the next 2k. So we have to limit it there
>>> differently.
>> That language is indeed odd.
>>
>> The only reading of that paragraph that makes sense to me is that the 
>> hardware
>> *must* interrupt MVCL after every 2k bytes processed.  The idea that the user
>> can magically write to a read-only page simply by providing length = 2MB and
>> page that is initially writable is dumb.  I cannot imagine that is a correct
>> reading.
>>
>> Getting clarification from an IBM engineer on that would be good; otherwise I
>> would just ignore that and proceed as if all access checks are performed.
>>
> 
> FWIW, splitting the operation at every aligned 2k boundary is exactly what the
> Hercules emulator does:
> 
> len3 = NOCROSS2KL(addr1,len1) ? len1 : (int)(0x800 - (addr1 & 0x7FF));
> len4 = NOCROSS2KL(addr2,len2) ? len2 : (int)(0x800 - (addr2 & 0x7FF));
> len = len3 < len4 ? len3 : len4;
> /* Use concpy to ensure Concurrent block update consistency */
> concpy (regs, dest, source, len);
> 
> After this it writes back the lengths and addresses to the
> register file, and then if necessary loops back to the address
> translation step.

That's almost exactly how I planned to fix MVCL :)

-- 

Thanks,

David / dhildenb



Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread Richard Henderson
On 8/21/19 3:31 PM, Richard Henderson wrote:
>> Yes, that's what I mean, TARGET_PAGE_SIZE, but eventually crossing a
>> page boundary. The longer I stare at the MVCL code, the more broken it
>> is. There are more nice things buried in the PoP. MVCL does not detect
>> access exceptions beyond the next 2k. So we have to limit it there
>> differently.
> That language is indeed odd.
> 
> The only reading of that paragraph that makes sense to me is that the hardware
> *must* interrupt MVCL after every 2k bytes processed.  The idea that the user
> can magically write to a read-only page simply by providing length = 2MB and
> page that is initially writable is dumb.  I cannot imagine that is a correct
> reading.
> 
> Getting clarification from an IBM engineer on that would be good; otherwise I
> would just ignore that and proceed as if all access checks are performed.
> 

FWIW, splitting the operation at every aligned 2k boundary is exactly what the
Hercules emulator does:

len3 = NOCROSS2KL(addr1,len1) ? len1 : (int)(0x800 - (addr1 & 0x7FF));
len4 = NOCROSS2KL(addr2,len2) ? len2 : (int)(0x800 - (addr2 & 0x7FF));
len = len3 < len4 ? len3 : len4;
/* Use concpy to ensure Concurrent block update consistency */
concpy (regs, dest, source, len);

After this it writes back the lengths and addresses to the
register file, and then if necessary loops back to the address
translation step.


r~



Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread Richard Henderson
On 8/21/19 2:33 PM, David Hildenbrand wrote:
>> NOTDIRTY cannot fault at all.  The associated rcu critical section is ugly
>> enough to make me not want to do anything except continue to go through the
>> regular MMIO path.
>>
>> In any case, so long as we eliminate *access* faults by probing the page 
>> table,
>> then falling back to the byte-by-byte loop is, AFAICS, sufficient to 
>> implement
>> the instructions correctly.
> 
> "In any case, so long as we eliminate *access* faults by probing the
> page table" - that's what I'm doing in this patch (and even more correct
> in the prototype patch I shared), no? (besides the watchpoint madness below)

Correct.

My main objection to your current patch is that you perform the access checks
within MVC, and then do some more tlb lookups in fast_memmove.

I think that fast_memmove is where the access checks should live.  That allows
incremental improvement to combine access checks + host address lookup, which
cannot currently be done in one step with existing interfaces.

I guess you would still want access checks within MVC for the case in which you
must fall back to byte-by-byte because of destructive overlap.

> "falling back to the byte-by-byte loop is, AFAICS, sufficient"
> 
> I don't think this is sufficient. E.g., LAP protected pages
> (PAGE_WRITE_INV which immediately requires a new MMU walk on the next
> access) will trigger a new MMU walk on every byte access (that's why I
> chose to pre-translate in my prototype).

LAP protected pages is exactly why probe_write should return the host address,
so that we can do the access check + host address lookup in one step.

But in the meantime...

> If another CPU modified the
> page tables in between, we could suddenly get a fault - although we
> checked early. What am I missing?

You're concerned with a bare write to the page table by cpu B, while cpu A is
executing, and before cpu B issues the cross-cpu tlb flush?

The tlb victim cache should prevent having to re-read a tlb entry from memory,
at least for MVC.  The unlimited size we currently support for MVCL and MVCLE
could act weird, but would be fixed by limiting the execution as discussed.

Honestly, the os has to make sure that the page remains valid until after the
flush completes, otherwise it's an os bug.  The cross-cpu tlb flush happens via
async_run_on_cpu, and of course never occurs while we are executing a TB.  Yet
another reason to limit the amount of work any one instruction does.  ;-)


> I see that we use BP_STOP_BEFORE_ACCESS for PER (Program Event
> Recording) on s390x. I don't think that's correct. We want to get
> notified after the values were changed.
> 
> "A storage-alteration event occurs whenever a CPU,
> by using a logical or virtual address, makes a store
> access without an access exception to the storage
> area designated by control registers 10 and 11. ..."
> 
> "For a PER instruction-fetching nullification event, the
> unit of operation is nullified. For other PER events,
> the unit of operation is completed"
> 
> Oh man, why is everything I take a look at broken.

Heh.

>> In the latter case, if the instruction has had any side effects prior to the
>> longjmp, they will be re-done when we re-start the current instruction.
>>
>> To me this seems like a rather large bug in our implementation of 
>> watchpoints,
>> as it only really works properly for simple load/store/load-op-store type
>> instructions.  Anything that works on many addresses and doesn't delay side
>> effects until all accesses are complete will Do The Wrong Thing.
>>
>> The fix, AFAICS, is for probe_write to call check_watchpoint(), so that we
>> take the debug exit early.
> 
> Indeed. I see what you mean now. (I was ignoring the "before access"
> because I was assuming we don't need it on s390x)
> 
> probe_write() would have to check for all BP_STOP_BEFORE_ACCESS watchpoints.

!BP_STOP_BEFORE_ACCESS watchpoints exit to the main loop as well, so that it
can restart and then single-step the current instruction.

We need it the check in probe_write for all cases.

> Yes, that's what I mean, TARGET_PAGE_SIZE, but eventually crossing a
> page boundary. The longer I stare at the MVCL code, the more broken it
> is. There are more nice things buried in the PoP. MVCL does not detect
> access exceptions beyond the next 2k. So we have to limit it there
> differently.

That language is indeed odd.

The only reading of that paragraph that makes sense to me is that the hardware
*must* interrupt MVCL after every 2k bytes processed.  The idea that the user
can magically write to a read-only page simply by providing length = 2MB and
page that is initially writable is dumb.  I cannot imagine that is a correct
reading.

Getting clarification from an IBM engineer on that would be good; otherwise I
would just ignore that and proceed as if all access checks are performed.

> So what I understand is that
> 
> - we should handle watchpoints in probe_write()
> - not bypass IO memory 

Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread David Hildenbrand
On 21.08.19 22:38, Richard Henderson wrote:
> On 8/21/19 12:36 PM, David Hildenbrand wrote:
 There are certain cases where we can't get access to the raw host
 page. Namely, cpu watchpoints, LAP, NODIRTY. In summary: this won't
 as you describe. (my first approach did exactly this)
>>>
>>> NODIRTY and LAP are automatically handled via probe_write
>>> faulting instead of returning the address.  I think there
>>> may be a bug in probe_write at present not checking
>>
>> For LAP pages we immediately set TLB_INVALID_MASK again, to trigger a
>> new fault on the next write access (only). The could be handled in
>> tlb_vaddr_to_host(), simply returning the address to the page after
>> trying to fill the tlb and succeeding (I implemented that, that's the
>> easy part).
> 
> Yes.
> 
>> TLB_NOTDIRTY and TLB_MMIO are the real issue. We don't want to refault,
>> we want to treat that memory like IO memory and route it via
>> MemoryRegionOps() - e.g., watch_mem_ops() in qemu/exec.c. So it's not a
>> fault but IO memory.
> 
> Watchpoints are not *really* i/o memory (unless of course it's a watchpoint on
> a device, which is a different matter), that's merely how we've chosen to
> implement it to force the memory operation through the slow path.  We can, and
> probably should, implement this differently wrt probe_write.

Yes, I agree wrt probe_write.

> 
> Real MMIO can only fault via cc->transaction_failed(), for some sort of bus
> error.  Which s390x does not currently implement.  In the meantime, a
> probe_write proves that the page is at least mapped correctly, so we have
> eliminated the normal access fault.

Yes, and that's all we care about on s390x.

> 
> NOTDIRTY cannot fault at all.  The associated rcu critical section is ugly
> enough to make me not want to do anything except continue to go through the
> regular MMIO path.
> 
> In any case, so long as we eliminate *access* faults by probing the page 
> table,
> then falling back to the byte-by-byte loop is, AFAICS, sufficient to implement
> the instructions correctly.

"In any case, so long as we eliminate *access* faults by probing the
page table" - that's what I'm doing in this patch (and even more correct
in the prototype patch I shared), no? (besides the watchpoint madness below)

"falling back to the byte-by-byte loop is, AFAICS, sufficient"

I don't think this is sufficient. E.g., LAP protected pages
(PAGE_WRITE_INV which immediately requires a new MMU walk on the next
access) will trigger a new MMU walk on every byte access (that's why I
chose to pre-translate in my prototype). If another CPU modified the
page tables in between, we could suddenly get a fault - although we
checked early. What am I missing?

> 
>> probe_write() performs the MMU translation. If that succeeds, there is
>> no fault. If there are watchpoints, the memory is treated like IO and
>> memory access is rerouted. I think this works as designed.
> 
> Well, if BP_STOP_BEFORE_ACCESS, then we want to raise a debug exception before
> any changes are made.  If !BP_STOP_BEFORE_ACCESS, then we longjmp back to the
> main cpu loop and single-step the current instruction.

I see that we use BP_STOP_BEFORE_ACCESS for PER (Program Event
Recording) on s390x. I don't think that's correct. We want to get
notified after the values were changed.

"A storage-alteration event occurs whenever a CPU,
by using a logical or virtual address, makes a store
access without an access exception to the storage
area designated by control registers 10 and 11. ..."

"For a PER instruction-fetching nullification event, the
unit of operation is nullified. For other PER events,
the unit of operation is completed"

Oh man, why is everything I take a look at broken.

> 
> In the latter case, if the instruction has had any side effects prior to the
> longjmp, they will be re-done when we re-start the current instruction.
> 
> To me this seems like a rather large bug in our implementation of watchpoints,
> as it only really works properly for simple load/store/load-op-store type
> instructions.  Anything that works on many addresses and doesn't delay side
> effects until all accesses are complete will Do The Wrong Thing.
> 
> The fix, AFAICS, is for probe_write to call check_watchpoint(), so that we
> take the debug exit early.

Indeed. I see what you mean now. (I was ignoring the "before access"
because I was assuming we don't need it on s390x)

probe_write() would have to check for all BP_STOP_BEFORE_ACCESS watchpoints.

> 
>> You mean two pages I assume. Yeah, I could certainly simplify the
>> prototype patch I have here quite a lot. 2 pages should be enough for
>> everybody ;)
> 
> Heh.  But, seriously, TARGET_PAGE_SIZE bytes is enough at one go, without
> releasing control so that interrupts etc may be recognized.

Yes, that's what I mean, TARGET_PAGE_SIZE, but eventually crossing a
page boundary. The longer I stare at the MVCL code, the more broken it
is. There are more nice things buried in the PoP. MVCL 

Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread Richard Henderson
On 8/21/19 12:36 PM, David Hildenbrand wrote:
>>> There are certain cases where we can't get access to the raw host
>>> page. Namely, cpu watchpoints, LAP, NODIRTY. In summary: this won't
>>> as you describe. (my first approach did exactly this)
>>
>> NODIRTY and LAP are automatically handled via probe_write
>> faulting instead of returning the address.  I think there
>> may be a bug in probe_write at present not checking
> 
> For LAP pages we immediately set TLB_INVALID_MASK again, to trigger a
> new fault on the next write access (only). The could be handled in
> tlb_vaddr_to_host(), simply returning the address to the page after
> trying to fill the tlb and succeeding (I implemented that, that's the
> easy part).

Yes.

> TLB_NOTDIRTY and TLB_MMIO are the real issue. We don't want to refault,
> we want to treat that memory like IO memory and route it via
> MemoryRegionOps() - e.g., watch_mem_ops() in qemu/exec.c. So it's not a
> fault but IO memory.

Watchpoints are not *really* i/o memory (unless of course it's a watchpoint on
a device, which is a different matter), that's merely how we've chosen to
implement it to force the memory operation through the slow path.  We can, and
probably should, implement this differently wrt probe_write.

Real MMIO can only fault via cc->transaction_failed(), for some sort of bus
error.  Which s390x does not currently implement.  In the meantime, a
probe_write proves that the page is at least mapped correctly, so we have
eliminated the normal access fault.

NOTDIRTY cannot fault at all.  The associated rcu critical section is ugly
enough to make me not want to do anything except continue to go through the
regular MMIO path.

In any case, so long as we eliminate *access* faults by probing the page table,
then falling back to the byte-by-byte loop is, AFAICS, sufficient to implement
the instructions correctly.

> probe_write() performs the MMU translation. If that succeeds, there is
> no fault. If there are watchpoints, the memory is treated like IO and
> memory access is rerouted. I think this works as designed.

Well, if BP_STOP_BEFORE_ACCESS, then we want to raise a debug exception before
any changes are made.  If !BP_STOP_BEFORE_ACCESS, then we longjmp back to the
main cpu loop and single-step the current instruction.

In the latter case, if the instruction has had any side effects prior to the
longjmp, they will be re-done when we re-start the current instruction.

To me this seems like a rather large bug in our implementation of watchpoints,
as it only really works properly for simple load/store/load-op-store type
instructions.  Anything that works on many addresses and doesn't delay side
effects until all accesses are complete will Do The Wrong Thing.

The fix, AFAICS, is for probe_write to call check_watchpoint(), so that we
take the debug exit early.

> You mean two pages I assume. Yeah, I could certainly simplify the
> prototype patch I have here quite a lot. 2 pages should be enough for
> everybody ;)

Heh.  But, seriously, TARGET_PAGE_SIZE bytes is enough at one go, without
releasing control so that interrupts etc may be recognized.


r~



Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread David Hildenbrand
On 21.08.19 21:19, Richard Henderson wrote:
> On 8/21/19 10:37 AM, David Hildenbrand wrote:
>> Hah, guess what, I implemented a similar variant of "fetch all
>> of the host addresses" *but* it is not that easy as you might
>> think (sorry for the bad news).
> 
> I think it is, because I didn't think it *that* easy.  :-)

:) hehe

> 
>> There are certain cases where we can't get access to the raw host
>> page. Namely, cpu watchpoints, LAP, NODIRTY. In summary: this won't
>> as you describe. (my first approach did exactly this)
> 
> NODIRTY and LAP are automatically handled via probe_write
> faulting instead of returning the address.  I think there
> may be a bug in probe_write at present not checking

For LAP pages we immediately set TLB_INVALID_MASK again, to trigger a
new fault on the next write access (only). The could be handled in
tlb_vaddr_to_host(), simply returning the address to the page after
trying to fill the tlb and succeeding (I implemented that, that's the
easy part).

TLB_NOTDIRTY and TLB_MMIO are the real issue. We don't want to refault,
we want to treat that memory like IO memory and route it via
MemoryRegionOps() - e.g., watch_mem_ops() in qemu/exec.c. So it's not a
fault but IO memory.

That's why we don't expose that memory via tlb_vaddr_to_host(). Faulting
in case of TLB_NOTDIRTY or TLB_MMIO would be bad.

> 
> Watchpoints could be handled the same way, if we were to
> export check_watchpoint from exec.c.  Indeed, I see no way
> to handle watchpoints correctly if we don't.  I think that's
> an outstanding bug with probe_write.

probe_write() performs the MMU translation. If that succeeds, there is
no fault. If there are watchpoints, the memory is treated like IO and
memory access is rerouted. I think this works as designed.

> 
> Any other objections?  I certainly think that restricting the
> size of such operations to one page is a large simplification
> over the S390Access array thing that you create in this patch.

You mean two pages I assume. Yeah, I could certainly simplify the
prototype patch I have here quite a lot. 2 pages should be enough for
everybody ;)

The basic question is: Should we try to somehow work around IO memory
access (including NOTDIRTY and watchpoints) in tlb_vaddr_to_host() or
perform access in these cases via cpu_physical_memory_write() ?

It feels somewhat wrong to me to tune tlb_vaddr_to_host() to always
return the address of a page although we are dealing with
MemoryRegionOps()  that want a more controlled access mechanism.

> 
> 
> r~
> 
>>
>> The following patch requires another re-factoring
>> (tcg_s390_cpu_mmu_translate), but you should get the idea.
>>
>>
>>
>> From 0cacd2aea3dbc25e93492cca04f6c866b86d7f8a Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand 
>> Date: Tue, 20 Aug 2019 09:37:11 +0200
>> Subject: [PATCH v1] s390x/tcg: Fault-safe MVC (MOVE) implementation
>>
>> MVC can cross page boundaries. In case we fault on the second page, we
>> already partially copied data. If we have overlaps, we would
>> trigger a fault after having partially moved data, eventually having
>> our original data already overwritten. When continuing after the fault,
>> we would try to move already modified data, not the original data -
>> very bad.
>>
>> glibc started to use MVC for forward memmove() and is able to trigger
>> exectly this corruption (via rpmbuild and rpm). Fedora 31 (rawhide)
>> currently fails to install as we trigger rpm database corruptions due to
>> this bug.
>>
>> We need a way to translate a virtual address range to individual pages that
>> we can access later on without triggering faults. Probing all virtual
>> addresses once before the read/write is not sufficient - the guest could
>> have modified the page tables (e.g., write-protect, map out) in between,
>> so on we could fault on any new tlb_fill() - we have to skip any new MMU
>> translations.
>>
>> Unfortunately, there are TLB entries for which cannot get a host address
>> for (esp., watchpoints, LAP, NOTDIRTY) - in these cases we cannot avoid
>> a new MMU translation using the ordinary ld/st helpers. Let's fallback
>> to guest physical addresses in these cases, that we access via
>> cpu_physical_memory_(read|write),
>>
>> This change reduced the boottime for s390x guests (to prompt) from ~1:29
>> min to ~1:19 min in my tests. For example, LAP protected pages are now only
>> translated once when writing to them using MVC and we don't always fallback
>> to byte-based copies.
>>
>> We will want to use the same mechanism for other accesses as well (e.g.,
>> mvcl), prepare for that right away.
>>
>> Signed-off-by: David Hildenbrand 
>> ---
>>  target/s390x/mem_helper.c | 213 +++---
>>  1 file changed, 200 insertions(+), 13 deletions(-)
>>
>> diff --git a/target/s390x/mem_helper.c b/target/s390x/mem_helper.c
>> index 91ba2e03d9..1ca293e00d 100644
>> --- a/target/s390x/mem_helper.c
>> +++ b/target/s390x/mem_helper.c
>> @@ -24,8 +24,10 @@
>>  #include 

Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread Richard Henderson
On 8/21/19 10:37 AM, David Hildenbrand wrote:
> Hah, guess what, I implemented a similar variant of "fetch all
> of the host addresses" *but* it is not that easy as you might
> think (sorry for the bad news).

I think it is, because I didn't think it *that* easy.  :-)

> There are certain cases where we can't get access to the raw host
> page. Namely, cpu watchpoints, LAP, NODIRTY. In summary: this won't
> as you describe. (my first approach did exactly this)

NODIRTY and LAP are automatically handled via probe_write
faulting instead of returning the address.  I think there
may be a bug in probe_write at present not checking

Watchpoints could be handled the same way, if we were to
export check_watchpoint from exec.c.  Indeed, I see no way
to handle watchpoints correctly if we don't.  I think that's
an outstanding bug with probe_write.

Any other objections?  I certainly think that restricting the
size of such operations to one page is a large simplification
over the S390Access array thing that you create in this patch.


r~

> 
> The following patch requires another re-factoring
> (tcg_s390_cpu_mmu_translate), but you should get the idea.
> 
> 
> 
> From 0cacd2aea3dbc25e93492cca04f6c866b86d7f8a Mon Sep 17 00:00:00 2001
> From: David Hildenbrand 
> Date: Tue, 20 Aug 2019 09:37:11 +0200
> Subject: [PATCH v1] s390x/tcg: Fault-safe MVC (MOVE) implementation
> 
> MVC can cross page boundaries. In case we fault on the second page, we
> already partially copied data. If we have overlaps, we would
> trigger a fault after having partially moved data, eventually having
> our original data already overwritten. When continuing after the fault,
> we would try to move already modified data, not the original data -
> very bad.
> 
> glibc started to use MVC for forward memmove() and is able to trigger
> exectly this corruption (via rpmbuild and rpm). Fedora 31 (rawhide)
> currently fails to install as we trigger rpm database corruptions due to
> this bug.
> 
> We need a way to translate a virtual address range to individual pages that
> we can access later on without triggering faults. Probing all virtual
> addresses once before the read/write is not sufficient - the guest could
> have modified the page tables (e.g., write-protect, map out) in between,
> so on we could fault on any new tlb_fill() - we have to skip any new MMU
> translations.
> 
> Unfortunately, there are TLB entries for which cannot get a host address
> for (esp., watchpoints, LAP, NOTDIRTY) - in these cases we cannot avoid
> a new MMU translation using the ordinary ld/st helpers. Let's fallback
> to guest physical addresses in these cases, that we access via
> cpu_physical_memory_(read|write),
> 
> This change reduced the boottime for s390x guests (to prompt) from ~1:29
> min to ~1:19 min in my tests. For example, LAP protected pages are now only
> translated once when writing to them using MVC and we don't always fallback
> to byte-based copies.
> 
> We will want to use the same mechanism for other accesses as well (e.g.,
> mvcl), prepare for that right away.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  target/s390x/mem_helper.c | 213 +++---
>  1 file changed, 200 insertions(+), 13 deletions(-)
> 
> diff --git a/target/s390x/mem_helper.c b/target/s390x/mem_helper.c
> index 91ba2e03d9..1ca293e00d 100644
> --- a/target/s390x/mem_helper.c
> +++ b/target/s390x/mem_helper.c
> @@ -24,8 +24,10 @@
>  #include "exec/helper-proto.h"
>  #include "exec/exec-all.h"
>  #include "exec/cpu_ldst.h"
> +#include "exec/cpu-common.h"
>  #include "qemu/int128.h"
>  #include "qemu/atomic128.h"
> +#include "tcg_s390x.h"
>  
>  #if !defined(CONFIG_USER_ONLY)
>  #include "hw/s390x/storage-keys.h"
> @@ -104,6 +106,181 @@ static inline void cpu_stsize_data_ra(CPUS390XState 
> *env, uint64_t addr,
>  }
>  }
>  
> +/*
> + * An access covers one page, except for the start/end of the translated
> + * virtual address range.
> + */
> +typedef struct S390Access {
> +union {
> +char *haddr;
> +hwaddr paddr;
> +};
> +uint16_t size;
> +bool isHaddr;
> +} S390Access;
> +
> +/*
> + * Prepare access to a virtual address range, guaranteeing we won't trigger
> + * faults during the actual access. Sometimes we can't get access to the
> + * host address (e.g., LAP, cpu watchpoints/PER, clean pages, ...). Then, we
> + * translate to guest physical addresses instead. We'll have to perform
> + * slower, indirect, access to these physical addresses then.
> + */
> +static void access_prepare_idx(CPUS390XState *env, S390Access access[],
> +   int nb_access, vaddr vaddr, target_ulong size,
> +   MMUAccessType access_type, int mmu_idx,
> +   uintptr_t ra)
> +{
> +int i = 0;
> +int cur_size;
> +
> +/*
> + * After we obtained the host address of a TLB entry that entry might
> + * be invalidated again - e.g., via tlb_set_dirty(), via 

Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread David Hildenbrand
On 21.08.19 19:26, Richard Henderson wrote:
> On 8/21/19 2:22 AM, David Hildenbrand wrote:
>> +/*
>> + * Make sure the read access is permitted and TLB entries are created. In
>> + * very rare cases it might happen that the actual accesses might need
>> + * new MMU translations. If the page tables were changed in between, we
>> + * might still trigger a fault. However, this seems to barely happen, so we
>> + * can ignore this for now.
>> + */
>> +void probe_read_access(CPUS390XState *env, uint64_t addr, uint64_t len,
>> +   uintptr_t ra)
>> +{
>> +#ifdef CONFIG_USER_ONLY
>> +if (!guest_addr_valid(addr) || !guest_addr_valid(addr + len - 1) ||
>> +page_check_range(addr, len, PAGE_READ) < 0) {
>> +s390_program_interrupt(env, PGM_ADDRESSING, ILEN_AUTO, ra);
>> +}
>> +#else
>> +while (len) {
>> +const uint64_t pagelen = -(addr | -TARGET_PAGE_MASK);
>> +const uint64_t curlen = MIN(pagelen, len);
>> +
>> +cpu_ldub_data_ra(env, addr, ra);
>> +addr = wrap_address(env, addr + curlen);
>> +len -= curlen;
>> +}
>> +#endif
>> +}
> 
> I don't think this is really the right approach, precisely because of the
> comment above.
> 
> I think we should
> 
> (1) Modify the generic probe_write to return the host address,
> akin to tlb_vaddr_to_host except it *will* fault.
> 
> (2) Create a generic version of probe_write for CONFIG_USER_ONLY,
> much like the one you have done for target/s390x.
> 
> (3) Create generic version of probe_read that does the same.
> 
> (4) Rewrite fast_memset and fast_memmove to fetch all of the host
> addresses before doing any modifications.  The functions are
> currently written as if len can be very large, handling any
> number of pages.  Except that's not true.  While there are
> several kinds of users apart from MVC, two pages are sufficient
> for all users.
> 
> Well, should be.  We would need to adjust do_mvcl to limit the
> operation to TARGET_PAGE_SIZE (CC=3, cpu-determined number of
> bytes moved without reaching end of first operand).
> Which is probably a good idea anyway.  System mode should not
> spend forever executing one instruction, as it would if you
> pass in a 64-bit length from MVCLE.
> 

Related to that: yes, that's what I mentioned in the cover letter, the
MOVE variants are full of issues. MVCLE should be limited to 4096 bytes,
and we should return cc=3. However, MVCL (which also uses do_mvcl) is
also semi-broken (it's an interruptible instruction - not cc=3, we have
to update the register step by step). MVCL should return cc=3 in case
it's an destructive copy - also not implemented properly. mem_helpers.c
is full of issues.

-- 

Thanks,

David / dhildenb



Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread David Hildenbrand
On 21.08.19 19:26, Richard Henderson wrote:
> On 8/21/19 2:22 AM, David Hildenbrand wrote:
>> +/*
>> + * Make sure the read access is permitted and TLB entries are created. In
>> + * very rare cases it might happen that the actual accesses might need
>> + * new MMU translations. If the page tables were changed in between, we
>> + * might still trigger a fault. However, this seems to barely happen, so we
>> + * can ignore this for now.
>> + */
>> +void probe_read_access(CPUS390XState *env, uint64_t addr, uint64_t len,
>> +   uintptr_t ra)
>> +{
>> +#ifdef CONFIG_USER_ONLY
>> +if (!guest_addr_valid(addr) || !guest_addr_valid(addr + len - 1) ||
>> +page_check_range(addr, len, PAGE_READ) < 0) {
>> +s390_program_interrupt(env, PGM_ADDRESSING, ILEN_AUTO, ra);
>> +}
>> +#else
>> +while (len) {
>> +const uint64_t pagelen = -(addr | -TARGET_PAGE_MASK);
>> +const uint64_t curlen = MIN(pagelen, len);
>> +
>> +cpu_ldub_data_ra(env, addr, ra);
>> +addr = wrap_address(env, addr + curlen);
>> +len -= curlen;
>> +}
>> +#endif
>> +}
> 
> I don't think this is really the right approach, precisely because of the
> comment above.
> 
> I think we should
> 
> (1) Modify the generic probe_write to return the host address,
> akin to tlb_vaddr_to_host except it *will* fault.
> 
> (2) Create a generic version of probe_write for CONFIG_USER_ONLY,
> much like the one you have done for target/s390x.
> 
> (3) Create generic version of probe_read that does the same.
> 
> (4) Rewrite fast_memset and fast_memmove to fetch all of the host
> addresses before doing any modifications.  The functions are
> currently written as if len can be very large, handling any
> number of pages.  Except that's not true.  While there are
> several kinds of users apart from MVC, two pages are sufficient
> for all users.
> 
> Well, should be.  We would need to adjust do_mvcl to limit the
> operation to TARGET_PAGE_SIZE (CC=3, cpu-determined number of
> bytes moved without reaching end of first operand).
> Which is probably a good idea anyway.  System mode should not
> spend forever executing one instruction, as it would if you
> pass in a 64-bit length from MVCLE.


Hah, guess what, I implemented a similar variant of "fetch all
of the host addresses" *but* it is not that easy as you might
think (sorry for the bad news).

There are certain cases where we can't get access to the raw host
page. Namely, cpu watchpoints, LAP, NODIRTY. In summary: this won't
as you describe. (my first approach did exactly this)

The following patch requires another re-factoring
(tcg_s390_cpu_mmu_translate), but you should get the idea.



>From 0cacd2aea3dbc25e93492cca04f6c866b86d7f8a Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Tue, 20 Aug 2019 09:37:11 +0200
Subject: [PATCH v1] s390x/tcg: Fault-safe MVC (MOVE) implementation

MVC can cross page boundaries. In case we fault on the second page, we
already partially copied data. If we have overlaps, we would
trigger a fault after having partially moved data, eventually having
our original data already overwritten. When continuing after the fault,
we would try to move already modified data, not the original data -
very bad.

glibc started to use MVC for forward memmove() and is able to trigger
exectly this corruption (via rpmbuild and rpm). Fedora 31 (rawhide)
currently fails to install as we trigger rpm database corruptions due to
this bug.

We need a way to translate a virtual address range to individual pages that
we can access later on without triggering faults. Probing all virtual
addresses once before the read/write is not sufficient - the guest could
have modified the page tables (e.g., write-protect, map out) in between,
so on we could fault on any new tlb_fill() - we have to skip any new MMU
translations.

Unfortunately, there are TLB entries for which cannot get a host address
for (esp., watchpoints, LAP, NOTDIRTY) - in these cases we cannot avoid
a new MMU translation using the ordinary ld/st helpers. Let's fallback
to guest physical addresses in these cases, that we access via
cpu_physical_memory_(read|write),

This change reduced the boottime for s390x guests (to prompt) from ~1:29
min to ~1:19 min in my tests. For example, LAP protected pages are now only
translated once when writing to them using MVC and we don't always fallback
to byte-based copies.

We will want to use the same mechanism for other accesses as well (e.g.,
mvcl), prepare for that right away.

Signed-off-by: David Hildenbrand 
---
 target/s390x/mem_helper.c | 213 +++---
 1 file changed, 200 insertions(+), 13 deletions(-)

diff --git a/target/s390x/mem_helper.c b/target/s390x/mem_helper.c
index 91ba2e03d9..1ca293e00d 100644
--- a/target/s390x/mem_helper.c
+++ b/target/s390x/mem_helper.c
@@ -24,8 +24,10 @@
 #include "exec/helper-proto.h"
 #include "exec/exec-all.h"
 

Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread Richard Henderson
On 8/21/19 2:22 AM, David Hildenbrand wrote:
> +/*
> + * Make sure the read access is permitted and TLB entries are created. In
> + * very rare cases it might happen that the actual accesses might need
> + * new MMU translations. If the page tables were changed in between, we
> + * might still trigger a fault. However, this seems to barely happen, so we
> + * can ignore this for now.
> + */
> +void probe_read_access(CPUS390XState *env, uint64_t addr, uint64_t len,
> +   uintptr_t ra)
> +{
> +#ifdef CONFIG_USER_ONLY
> +if (!guest_addr_valid(addr) || !guest_addr_valid(addr + len - 1) ||
> +page_check_range(addr, len, PAGE_READ) < 0) {
> +s390_program_interrupt(env, PGM_ADDRESSING, ILEN_AUTO, ra);
> +}
> +#else
> +while (len) {
> +const uint64_t pagelen = -(addr | -TARGET_PAGE_MASK);
> +const uint64_t curlen = MIN(pagelen, len);
> +
> +cpu_ldub_data_ra(env, addr, ra);
> +addr = wrap_address(env, addr + curlen);
> +len -= curlen;
> +}
> +#endif
> +}

I don't think this is really the right approach, precisely because of the
comment above.

I think we should

(1) Modify the generic probe_write to return the host address,
akin to tlb_vaddr_to_host except it *will* fault.

(2) Create a generic version of probe_write for CONFIG_USER_ONLY,
much like the one you have done for target/s390x.

(3) Create generic version of probe_read that does the same.

(4) Rewrite fast_memset and fast_memmove to fetch all of the host
addresses before doing any modifications.  The functions are
currently written as if len can be very large, handling any
number of pages.  Except that's not true.  While there are
several kinds of users apart from MVC, two pages are sufficient
for all users.

Well, should be.  We would need to adjust do_mvcl to limit the
operation to TARGET_PAGE_SIZE (CC=3, cpu-determined number of
bytes moved without reaching end of first operand).
Which is probably a good idea anyway.  System mode should not
spend forever executing one instruction, as it would if you
pass in a 64-bit length from MVCLE.


Something like

static void fast_memmove_idx(CPUS390XState *env, uint64_t dst,
 uint64_t src, uint32_t len,
 int dst_idx, int src_idx,
 uintptr_t ra)
{
void *dst1, *dst2, *dst3;
void *src1, *src2, *src3;
uint32_t len1, len2, lenr;
uint64_t dstr, srcr;

if (unlikely(len == 0)) {
return;
}
assert(len <= TARGET_PAGE_SIZE);

dst1 = probe_write(env, dst, 1, dst_idx, ra);
src1 = probe_read(env, src, 1, src_idx, ra);

if (dst1 == NULL || src1 == NULL) {
goto io_memmove;
}

/* Crop length so that neither SRC+LEN nor DST+LEN crosses a page. */
len1 = adj_len_to_page(adj_len_to_page(len, src), dst);
lenr = len - len1;

if (likely(lenr == 0)) {
memmove(dst1, src1, len);
return;
}

/* Probe for the second page and range.  */
dstr = dst + len1;
srcr = src + len1;
dst2 = probe_write(env, dstr, 1, dst_idx, ra);
src2 = probe_read(env, srcr, 1, src_idx, ra);

len2 = adj_len_to_page(adj_len_to_page(lenr, srcr), dstr);
lenr -= len2;

if (likely(lenr == 0)) {
memmove(dst1, src1, len1);
if (dst2 != NULL && src2 != NULL) {
memmove(dst2, src2, len2);
return;
}
dst = dstr;
src = srcr;
len = len2;
goto io_memmove;
}

/*
 * There is a chance for a third page and range.  E.g.
 *   dst = 0xaaaff0, src = 0xbbbff8, len = 32
 * len1 = 8, bringing src to its page crossing, then
 * len2 = 8, bringing dst to its page crossing, then
 * lenr = 16, finishing the rest of the operation.
 */
dstr += len2;
srcr += len2;
dst3 = probe_write(env, dstr, 1, dst_idx, ra);
src3 = probe_read(env, srcr, 1, src_idx, ra);

memmove(dst1, src1, len1);
memmove(dst2, src2, len2);
if (dst3 != NULL && src3 != NULL) {
memmove(dst3, src3, lenr);
return;
}
dst = dstr;
src = srcr;
len = lenr;

 io_memmove:
#ifdef CONFIG_USER_ONLY
/*
 * There is no I/O space, so probe_{read,write} raise exceptions
 * for unmapped pages and never return NULL.
 */
g_assert_not_reached();
#else
TCGMemOpIdx oi_dst = make_memop_idx(MO_UB, dst_idx);
TCGMemOpIdx oi_src = make_memop_idx(MO_UB, src_idx);
do {
uint8_t x = helper_ret_ldub_mmu(env, src, oi_src, ra);
helper_ret_stb_mmu(env, dest, x, oi_dst, ra);
} while (--len != 0);
#endif
}

static void fast_memmove(CPUS390XState *env, uint64_t dst, uint64_t src,
 uint32_t len, uintptr_t ra)
{
int mmu_idx = cpu_mmu_index(env, false);
fast_memmove_idx(env, dst, src, len, mmu_idx, mmu_idx, ra);
}


r~



[Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

2019-08-21 Thread David Hildenbrand
Let's introduce a helper to probe read access (by actually reading a
piece of data of every page) and add a comment why this might not be
100% safe in all scenarios. Once we actually run into that issue, we'll
have to think of something else.

Signed-off-by: David Hildenbrand 
---
 target/s390x/internal.h   |  2 ++
 target/s390x/mem_helper.c | 34 ++
 2 files changed, 36 insertions(+)

diff --git a/target/s390x/internal.h b/target/s390x/internal.h
index c243fa725b..bdb833c525 100644
--- a/target/s390x/internal.h
+++ b/target/s390x/internal.h
@@ -354,6 +354,8 @@ void ioinst_handle_sal(S390CPU *cpu, uint64_t reg1, 
uintptr_t ra);
 
 /* mem_helper.c */
 target_ulong mmu_real2abs(CPUS390XState *env, target_ulong raddr);
+void probe_read_access(CPUS390XState *env, uint64_t addr, uint64_t len,
+   uintptr_t ra);
 void probe_write_access(CPUS390XState *env, uint64_t addr, uint64_t len,
 uintptr_t ra);
 
diff --git a/target/s390x/mem_helper.c b/target/s390x/mem_helper.c
index 7819aca15d..4e9d126e2c 100644
--- a/target/s390x/mem_helper.c
+++ b/target/s390x/mem_helper.c
@@ -2612,6 +2612,40 @@ uint32_t HELPER(cu42)(CPUS390XState *env, uint32_t r1, 
uint32_t r2, uint32_t m3)
decode_utf32, encode_utf16);
 }
 
+/*
+ * Make sure the read access is permitted and TLB entries are created. In
+ * very rare cases it might happen that the actual accesses might need
+ * new MMU translations. If the page tables were changed in between, we
+ * might still trigger a fault. However, this seems to barely happen, so we
+ * can ignore this for now.
+ */
+void probe_read_access(CPUS390XState *env, uint64_t addr, uint64_t len,
+   uintptr_t ra)
+{
+#ifdef CONFIG_USER_ONLY
+if (!guest_addr_valid(addr) || !guest_addr_valid(addr + len - 1) ||
+page_check_range(addr, len, PAGE_READ) < 0) {
+s390_program_interrupt(env, PGM_ADDRESSING, ILEN_AUTO, ra);
+}
+#else
+while (len) {
+const uint64_t pagelen = -(addr | -TARGET_PAGE_MASK);
+const uint64_t curlen = MIN(pagelen, len);
+
+cpu_ldub_data_ra(env, addr, ra);
+addr = wrap_address(env, addr + curlen);
+len -= curlen;
+}
+#endif
+}
+
+/*
+ * Make sure the write access is permitted and TLB entries are created. In
+ * very rare cases it might happen that the actual accesses might need
+ * new MMU translations - especially, on LAP protected pages. If the page
+ * tables were changed in between, we might still trigger a fault. However,
+ * this seems to barely happen, so we can ignore this for now.
+ */
 void probe_write_access(CPUS390XState *env, uint64_t addr, uint64_t len,
 uintptr_t ra)
 {
-- 
2.21.0