[issue32561] Add API to io objects for cache-only reads/writes

2020-07-04 Thread Joshua Bronson


Change by Joshua Bronson :


--
nosy: +jab

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32561] Add API to io objects for cache-only reads/writes

2019-10-11 Thread STINNER Victor


STINNER Victor  added the comment:

I suggest to leave not attempt to put "async or "await" in the io module to 
keep it a "simple" as possible, but fix bpo-13322 (in io and _pyio modules).

Instead, I suggest to write a new module only providing asynchronous methods, 
maybe even for open() and close(). But the new module can reuse the existing io 
module to avoid having to write complex algorithm like read-ahead, buffering, 
etc.

Well, I'm not 100% sure that it's doable, since io is hiding many 
implementation details, there are complex issues like multithreading, locks, 
interlaced read and write operations, etc.

Note: The io module doesn't fully suppored interlaced read and write :-) See 
bpo-12215 for example.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32561] Add API to io objects for cache-only reads/writes

2019-10-11 Thread STINNER Victor


STINNER Victor  added the comment:

Here a proof-of-concept of an asynchronous io module reusing the existing 
blocking io module: it implements AsyncBufferedReader.readline() using existing 
_pyio.BufferedReader.readline().

The approach seems to work, but only if bpo-13322 is fixed first: 
BufferedReader, TextIOWrapper & friends must return None if the underlying 
object ("raw" and "buffer" objects) return None.

--

My PoC uses 3 classes:

* AsyncFileIO: similar to io.FileIO but uses "async def"
* AsyncBufferedReader: similar to io.BufferedReader but uses "async def"
* FileIOSandwich: glue between asynchronous AsyncFileIO and blocking 
io.BufferedReader

At the first read, FileIOSandwich.read(n) call returns None, but it stores the 
request read size (n).

If AsyncBufferedReader gets None, is calls FileIOSandwich._prepare_read() 
*asynchronously*/

Then FileIOSandwich.read(n) is called again, and this time it no longer blocks, 
since data has been already read.

--

Since bpo-13322 is not fixed, my PoC uses _pyio since it's easier to fix. It 
needs the following fix for _pyio.BufferedReader.readline():

diff --git a/Lib/_pyio.py b/Lib/_pyio.py
index c1bdac7913..e90742ec43 100644
--- a/Lib/_pyio.py
+++ b/Lib/_pyio.py
@@ -557,6 +557,8 @@ class IOBase(metaclass=abc.ABCMeta):
 res = bytearray()
 while size < 0 or len(res) < size:
 b = self.read(nreadahead())
+if b is None and not res:
+return None
 if not b:
 break
 res += b


---

Example:

$ ./python poc_aio.py poc_aio.py 
data: b'import asyncio\n'

Internally, _prepare_read() reads 8 KiB.

--
Added file: https://bugs.python.org/file48656/poc_aio.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32561] Add API to io objects for cache-only reads/writes

2019-10-10 Thread Nathaniel Smith

Nathaniel Smith  added the comment:

> If you wanted to keep async disk access separate from the io module, then 
> what we'd have to do is to create a fork of all the code in the io module, 
> and add this feature to it.

Thinking about this again today, I realized there *might* be another option.

The tricky thing about supporting async file I/O is that users want the whole 
io module interface, and we don't want to have to reimplement all the 
functionality in TextIOWrapper, BufferedReader, BufferedWriter, etc. And we 
still need the blocking functionality too, for when we fall back to threads.

But, here's a possible hack. We could implement our own version of 'FileIO' 
that wraps around a real FileIO. Every operation just delegates to the 
underlying FileIO – but with a twist. Something like:

def wrapped_op(self, *args):
if self._cached_op.key == (op, args):
return self._cached_op.result
if MAGIC_THREAD_LOCAL.io_is_forbidden:
def cache_filler():
MAGIC_THREAD_LOCAL.io_is_forbidden = False
self._cached_op = self._real_file.op(*args)
raise IOForbiddenError(cache_filler)
return self._real_file.op(*args)

And then in order to implement an async operation, we do something like:

async def op(self, *args):
while True:
try:
# First try fulfilling the operation from cache
MAGIC_THREAD_LOCAL.io_is_forbidden = True
return self._io_obj.op(*args)
except IOForbiddenError as exc:
# We have to actually hit the disk
# Run the real IO operation in a thread, then try again
await in_thread(cache_filler)
finally:
del MAGIC_THREAD_LOCAL.io_is_forbidden

This is pretty convoluted: we keep trying the operation on the outer "buffered" 
object, seeing which low-level I/O operation it gets stuck on, doing that I/O 
operation, and trying again. There's all kinds of tricky non-local state here; 
like for example, there isn't any formal guarantee that the next time we try 
the "outer" I/O operation it will end up making exactly the same request to the 
"inner" RawIO object. If you try performing I/O operations on the same file 
from multiple tasks concurrently then you'll get all kinds of havoc. But if it 
works, then it does have two advantages:

First, it doesn't require changes to the io module, which is at least nice for 
experimentation.

And second, it's potentially compatible with the io_uring style of async disk 
I/O API. I don't actually know if this matters; if you look at the io_uring 
docs, the only reason they say they're more efficient than a thread pool is 
that they can do the equivalent of preadv(RWF_NOWAIT), and it's a lot easier to 
add preadv(RWF_NOWAIT) to a thread pool than it is to implement io_uring. But 
still, this approach is potentially more flexible than my original idea.

We'd still have to reimplement open() in order to set up our weird custom IO 
stacks, but hopefully that's not *too* bad, since it's mostly just a bunch of 
if statements to decide which wrappers to stick around the raw IO object.

My big concern is, that I'm not actually sure if this works :-).

The thing is, for this to work, we need 
TextIOWrapper/BufferedReader/BufferedWriter to be very well-behaved when the 
underlying operation raises an exception. In particular, if they're doing a 
complex operation that requires multiple calls to the underlying object, and 
the second call raises an exception, they need to keep the first call's results 
in their buffer so that next time they can pick up where they left off. And I 
have no idea if that's true.

I guess if you squint this is kind of like the non-blocking support in the io 
module – IOForbiddenError is like NonBlockingError. The big difference is that 
here, we don't have any "partial success" state at the low-level; either we do 
the operation immediately, or we punt and do the operation in a thread. Either 
way it completes as a single indivisible unit. So that might simplify things? 
From a quick skim of issue13322 it sounds like a lot of the problems with the 
current "non-blocking" mode come from these partial success states, but I 
haven't read it in detail.

--
title: Add API to io objects for non-blocking reads/writes -> Add API to io 
objects for cache-only reads/writes

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com