Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner haypocalc.com> writes: > It's difficult for an user to choose between between open() and > codecs.open(). Is it? How about the following decision process? If writing code for Python 3.x only, use open(). If writing code which has to work under both Python 2.x and 3.x, use codecs.open(). BTW I have written code using StreamReader and StreamWriter in the past, though it may not have been published on the Internet. Python is used a lot by companies for internal systems. Such code is seldom published on the Internet, so it seems that there's no real way of knowing how much StreamReader/StreamWriter are used. When looking at porting projects to Python 3.x, I've always adopted a single code-base approach for 2.x and 3.x, as I feel it's the path of least ongoing maintenance and hence (in my experience) the path of least resistance to providing 3.x support. Though of course I've no objection to implementing their functionality in the most efficient way possible (which may well be TextIOWrapper), IMO deprecating StreamReader/StreamWriter will make 2.x/3.x portability harder to achieve, and so seems a step too far. Regards, Vinay Sajip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On Sat, May 28, 2011 at 6:30 AM, Terry Reedy wrote: > On 5/27/2011 11:08 AM, Victor Stinner wrote: > >> Tell me if I am wrong, but only Marc-Andre is against deprecating >> StreamReader > > While I am, in general, in favor of removing some duplication, I was and am > against doing this change precipitously. So I was for the reversion (noted), > at least temporarily. Given the disagreement, I think there should be a PEP > with pro and con arguments. Indeed. I'm also against any deprecation in this area, since that just means needless work for anyone that *do* use these APIs (even if those people are few and far between). If we can refactor to remove the duplication of functionality, that's a *much* better solution. If we can carry optparse style argument parsing and 2.x style string formatting, we can carry a couple of legacy codec interface definitions. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On 5/27/2011 11:08 AM, Victor Stinner wrote: Tell me if I am wrong, but only Marc-Andre is against deprecating StreamReader While I am, in general, in favor of removing some duplication, I was and am against doing this change precipitously. So I was for the reversion (noted), at least temporarily. Given the disagreement, I think there should be a PEP with pro and con arguments. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: > Le vendredi 27 mai 2011 15:42:10, M.-A. Lemburg a écrit : >> If we'd go by your reasoning for deprecating and eventually >> removing parts of the stdlib or Python's subsystems, we'll end >> up with a barebone version of Python. That's not what we want >> and it's not what our users want. > > I don't want to deprecate the whole stdlib, just duplicate old API, to follow > "import this" mantra: > > "There should be one-- and preferably only one --obvious way to do it." What people tend to miss in this mantra is the last part: "obvious". It doesn't say: there should only be one way to do it. There can be many ways, but there should preferably be only one *obvious* way. Using codec.open() is not obvious in Python3, since the standard open() already provides a way to access an encoded stream. Using a builtin is the obvious way to go. It is obvious in Python2 where the standard open() doesn't provide a way to define an encoding, so the user has to explicitly look for this kind of API and then find it in the "obvious" (to some extent) codecs module, since that's where encodings happen in Python2. Having multiple ways to do things, is the most natural thing on earth and it's good that way. Python does not and should not force people into doing things in one dictated "right" way. It should, however, provide natural choices and obvious hints to find a good solution. And that's what the Zen mantra is all about. > It's difficult for an user to choose between between open() and codecs.open(). As I mentioned on the ticket and in my replies: I'm not against changing codecs.open() to use a variant that is based on TextIOWrapper, provided there are no user noticeable compatibility issues. Thanks for reverting the patch. Have a nice weekend, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 27 2011) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 24 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le vendredi 27 mai 2011 15:42:10, M.-A. Lemburg a écrit : > If we'd go by your reasoning for deprecating and eventually > removing parts of the stdlib or Python's subsystems, we'll end > up with a barebone version of Python. That's not what we want > and it's not what our users want. I don't want to deprecate the whole stdlib, just duplicate old API, to follow "import this" mantra: "There should be one-- and preferably only one --obvious way to do it." It's difficult for an user to choose between between open() and codecs.open(). Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le vendredi 27 mai 2011 16:01:14, Nick Coghlan a écrit : > On Fri, May 27, 2011 at 11:42 PM, M.-A. Lemburg wrote: > > Wrong order: first write a PEP, then discuss, then get approval, > > then patch. > > Indeed. > > If another committer says "please revert and better justify this > change" then we revert it. We don't get into commit wars. I reverted my controversal commit. > Something does need to be done to resolve the duplication of > functionality between the io and codecs modules, but it is *far* from > clear that deprecating chunks of the longer standing API is the right > way to go about it. Yes, StreamReader & friends are present in Python since Python 2.0. > This is especially true given Guido's explicit > direction following the issues with the PyCObject removal in 3.2 that > we be *very* conservative about introducing additional > incompatibilities between Python 2 and Python 3. I did search for usage of these classes on the Internet, and except projects implementing their own codecs (and so implement their StreamReader/StreamWriter classes, even if they don't use it), I only found one project using directly StreamReader: pygment (*). I searched quickly, so don't trust these results :-) StreamReader & friends are used indirectly through codecs.open(). My patch changes codecs.open() to make it reuse open (io.TextIOWrapper), so the deprecation of StreamReader would not be noticed by most users. I think that there are much more users of PyCObject than users using directly the StreamReader API (not through codecs.open()). (*) I also found Sphinx, but I was wrong: it doesn't use StreamReader, it just has a full copy of the UTF-8-SIG codec which has a StreamReader class. I don't think that the class is used. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
2011/5/27 Victor Stinner : > Le vendredi 27 mai 2011 15:33:07, Benjamin Peterson a écrit : >> 2011/5/27 Victor Stinner : >> > You have until the release of Python 3.3 to prove that StreamReader >> > and/or StreamWriter can be faster than TextIOWrapper. If you can prove >> > it using a patch and a benchmark, I will be ok to revert my commit. >> >> Please don't hold commits over someone's head. > > Tell me if I am wrong, but only Marc-Andre is against deprecating StreamReader > and StreamWriter. Walter and Antoine are in favor of using TextIOWrapper > instead of StreamReader/StreamWriter. I'm am too. There does, however, seem to be significant disagreement, and it shouldn't be a race to see who can commit first. > > Different people would like to be able to call codecs.open() in Python 2 and > 3, > so I kept the function with its API unchanged, and I documented that open() > should be preferred (but I did not deprecated codecs.open). -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le vendredi 27 mai 2011 15:33:07, Benjamin Peterson a écrit : > 2011/5/27 Victor Stinner : > > You have until the release of Python 3.3 to prove that StreamReader > > and/or StreamWriter can be faster than TextIOWrapper. If you can prove > > it using a patch and a benchmark, I will be ok to revert my commit. > > Please don't hold commits over someone's head. Tell me if I am wrong, but only Marc-Andre is against deprecating StreamReader and StreamWriter. Walter and Antoine are in favor of using TextIOWrapper instead of StreamReader/StreamWriter. Different people would like to be able to call codecs.open() in Python 2 and 3, so I kept the function with its API unchanged, and I documented that open() should be preferred (but I did not deprecated codecs.open). Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On Fri, May 27, 2011 at 11:42 PM, M.-A. Lemburg wrote: > > Wrong order: first write a PEP, then discuss, then get approval, > then patch. Indeed. If another committer says "please revert and better justify this change" then we revert it. We don't get into commit wars. Something does need to be done to resolve the duplication of functionality between the io and codecs modules, but it is *far* from clear that deprecating chunks of the longer standing API is the right way to go about it. This is especially true given Guido's explicit direction following the issues with the PyCObject removal in 3.2 that we be *very* conservative about introducing additional incompatibilities between Python 2 and Python 3. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: > Le vendredi 27 mai 2011 10:17:29, M.-A. Lemburg a écrit : >> I am still -1 on deprecating the StreamReader/Writer parts of >> the codec APIs. I've given numerous reasons on why these are >> useful, what their intention is, why they were added to Python 1.6. > > codecs.open() now uses TextIOWrapper, so there is no good reason to keep > StreamReader or StreamWriter. You did not give me any use case where > StreamReader or StreamWriter should be used instead of TextIOWrapper. You > only > listed theorical optimizations. > > You have until the release of Python 3.3 to prove that StreamReader and/or > StreamWriter can be faster than TextIOWrapper. If you can prove it using a > patch and a benchmark, I will be ok to revert my commit. Victor, please revert the change. It has *not* been approved ! If we'd go by your reasoning for deprecating and eventually removing parts of the stdlib or Python's subsystems, we'll end up with a barebone version of Python. That's not what we want and it's not what our users want. I have tried to explain the design decisions and reasons for those codec APIs at great length. You've pretty much used up my patience. If you are not going to revert the patch, I will. >> Since such a deprecation would change an important documented API, >> please write a PEP outlining your reasoning, including my comments, >> use cases and possibilities for optimizations. > > Ok, I will write on a PEP explaining why StreamReader and StreamWriter are > deprecated. Wrong order: first write a PEP, then discuss, then get approval, then patch. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 27 2011) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 24 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
2011/5/27 Victor Stinner : > You have until the release of Python 3.3 to prove that StreamReader and/or > StreamWriter can be faster than TextIOWrapper. If you can prove it using a > patch and a benchmark, I will be ok to revert my commit. Please don't hold commits over someone's head. -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le vendredi 27 mai 2011 10:17:29, M.-A. Lemburg a écrit : > > I think that the readahead algorithm is much more faster than trying to > > avoid partial input, and it's not a problem to have partial input if you > > use an incremental decoder. > > Depends on where you're coming from. For non-seekable streams > such as sockets or pipes, readahead is not going to work. I don't see how StreamReader/StreamWriter can do a better job than TextIOWrapper for non-seekable streams. > > TextIOWrapper implements this optimization using its readahead > > algorithm. > > It does yes, but the above was an optimization specific > to single character encodings, not all encodings and > TextIOWrapper doesn't know anything about specific characteristics > of the underlying encodings (except perhaps a few special > cases). Please give me numbers: how fast are your suggested optimizations? Are they faster than readahead? All of your argumentation is based on theorical facts. > > Do you mean that you would like to reimplement codecs in C? > > As use of Unicode codecs increases in Python applications, > this would certainly be an approach to consider, yes. I am not sure that StreamReader is/can be faster than TextIOWrapper if it is reimplemented in C (see the updated benchmark below, codecs vs _pyio). > > test_io.read(): 3991.0 ms > > test_codecs.read(): 1736.9 ms > > -> codecs 130% FASTER than io > > No surprise here. It's also a very common use case > to read the whole file in one go and the bigger > the file, the more impact this has. Oh, I understood why codecs is always faster than _pyio (or even io): it's because of IncrementalNewlineDecoder. To be fair, the read(-1) should be tested without IncrementalNewlineDecoder: e.g. with newline='\n'. newline='' cannot be used for the read(-1) test, because even if newline='' indicates that we don't want to translate newlines, read(-1) uses the IncrementalNewlineDecoder (which is slower than not calling it at all). We may optimize this specific case in TextIOWrapper. > > (3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from > > gb18030 > > > > test_io.readline(): 38.9 ms > > test_codecs.readline(): 15.1 ms > > -> codecs 157% FASTER than io > > > > test_io.read(1): 369.8 ms > > test_codecs.read(1): 302.2 ms > > -> codecs 22% FASTER than io > > > > test_io.read(100): 258.2 ms > > test_codecs.read(100): 155.1 ms > > -> codecs 67% FASTER than io > > > > test_io.read(): 1803.2 ms > > test_codecs.read(): 1002.9 ms > > -> codecs 80% FASTER than io > > These results are interesting since gb18030 is a shift > encoding which keeps state in the encoded data stream, so > the strategy chosen by TextIOWrapper doesn't work out that > well. In the 4 tests, TextIOWrapper only calls the decoder *once*, on the whole content of the file. The file size if 864 bytes, which is smaller than the TextIOWrapper chunk size (2048 bytes). StreamReader of the gb18030 codec is implemented in C, not in Python (using multibytecodec.c). So to be fair, the test on this encoding should be done using io, not _pyio for this encoding. Moreover, the multibytecodec module doesn't support universal newline! It does only support '\n' newlines. So to be more fair, the test should use '\n' newline. It's one more reason to TextIOWrapper instead of StreamReader: it has the same behaviour (universal newlines) for all encodings. Or is it yet another bug in StreamReader? > I am still -1 on deprecating the StreamReader/Writer parts of > the codec APIs. I've given numerous reasons on why these are > useful, what their intention is, why they were added to Python 1.6. codecs.open() now uses TextIOWrapper, so there is no good reason to keep StreamReader or StreamWriter. You did not give me any use case where StreamReader or StreamWriter should be used instead of TextIOWrapper. You only listed theorical optimizations. You have until the release of Python 3.3 to prove that StreamReader and/or StreamWriter can be faster than TextIOWrapper. If you can prove it using a patch and a benchmark, I will be ok to revert my commit. > Since such a deprecation would change an important documented API, > please write a PEP outlining your reasoning, including my comments, > use cases and possibilities for optimizations. Ok, I will write on a PEP explaining why StreamReader and StreamWriter are deprecated. --- I wrote a new benchmarking script which tries to compare more closely codecs to io/_pyio (change the newline value and use io for gb18030). It should be a little bit more reliable because each test now runs 5 times (taking the smallest time), but it's not really reliable... The script is attached to this mail. (1) Decode Objects/unicodeobject.c (317334 characters) from utf-8 _pyio.readline(): 1078.4 ms (8 loops, newline: '') codecs.readline(): 983.0 ms (8 loops, newline: '') -> codecs 10% FASTER than _pyio _pyio.read(1): 3503.5 ms (2 loops, newline: '') codecs.read(1): 66
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: > Le mercredi 25 mai 2011 à 15:43 +0200, M.-A. Lemburg a écrit : >> For UTF-16 it would e.g. make sense to always read data in blocks >> with even sizes, removing the trial-and-error decoding and extra >> buffering currently done by the base classes. For UTF-32, the >> blocks should have size % 4 == 0. >> >> For UTF-8 (and other variable length encodings) it would make >> sense looking at the end of the (bytes) data read from the >> stream to see whether a complete code point was read or not, >> rather than simply running the decoder on the complete data >> set, only to find that a few bytes at the end are missing. > > I think that the readahead algorithm is much more faster than trying to > avoid partial input, and it's not a problem to have partial input if you > use an incremental decoder. Depends on where you're coming from. For non-seekable streams such as sockets or pipes, readahead is not going to work. For seekable streams, I agree that readahead is better strategy. And of course, it also makes sense to use incremental decoders for these encodings. >> For single character encodings, it would make sense to prefetch >> data in big chunks and skip all the trial and error decoding >> implemented by the base classes to address the above problem >> with variable length encodings. > > TextIOWrapper implements this optimization using its readahead > algorithm. It does yes, but the above was an optimization specific to single character encodings, not all encodings and TextIOWrapper doesn't know anything about specific characteristics of the underlying encodings (except perhaps a few special cases). >> That's somewhat unfair: TextIOWrapper is implemented in C, >> whereas the StreamReader/Writer subclasses used by the >> codecs are written in Python. >> >> A fair comparison would use the Python implementation of >> TextIOWrapper. > > Do you mean that you would like to reimplement codecs in C? As use of Unicode codecs increases in Python applications, this would certainly be an approach to consider, yes. Looking at the current situation, it is better to use TextIOWrapper as it provides better performance, but since TextIOWrapper cannot (per desing) provide per-codec optimizations, this is likely to change with a codec rewrite in C of codecs that benefit a lot from such specific optimizations. > It is not > revelant to compare codecs and _pyio, because codecs reuses > BufferedReader (of the io module, not of the _pyio module), and io is > the main I/O module of Python 3. They both use whatever stream you pass in as parameter, so your TextIOWrapper benchmark will also use the BufferedReader of the io module. The point here is to compare Python to Python, not Python to C. > But well, as you want, here is a benchmark comparing: >_pyio.TextIOWrapper(io.open(filename, 'rb'), encoding) > and > codecs.open(filename, encoding) > > The only change with my previous bench.py script is the test_io() > function : > > def test_io(test_func, chunk_size): > with open(FILENAME, 'rb') as buffered: > f = _pyio.TextIOWrapper(buffered, ENCODING) > test_file(f, test_func, chunk_size) > f.close() Thanks for running those tests. > (1) Decode Objects/unicodeobject.c (317336 characters) from utf-8 > > test_io.readline(): 1193.4 ms > test_codecs.readline(): 1267.9 ms > -> codecs 6% slower than io > > test_io.read(1): 21696.4 ms > test_codecs.read(1): 36027.2 ms > -> codecs 66% slower than io > > test_io.read(100): 3080.7 ms > test_codecs.read(100): 3901.7 ms > -> codecs 27% slower than io This shows that StreamReader/Writer could benefit quite a bit from using incremental encoders/decoders. > test_io.read(): 3991.0 ms > test_codecs.read(): 1736.9 ms > -> codecs 130% FASTER than io No surprise here. It's also a very common use case to read the whole file in one go and the bigger the file, the more impact this has. > (2) Decode README (6613 characters) from ascii > > test_io.readline(): 678.1 ms > test_codecs.readline(): 760.5 ms > -> codecs 12% slower than io > > test_io.read(1): 13533.2 ms > test_codecs.read(1): 21900.0 ms > -> codecs 62% slower than io > > test_io.read(100): 2663.1 ms > test_codecs.read(100): 3270.1 ms > -> codecs 23% slower than io > > test_io.read(): 6769.1 ms > test_codecs.read(): 3919.6 ms > -> codecs 73% FASTER than io See above. > (3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from > gb18030 > > test_io.readline(): 38.9 ms > test_codecs.readline(): 15.1 ms > -> codecs 157% FASTER than io > > test_io.read(1): 369.8 ms > test_codecs.read(1): 302.2 ms > -> codecs 22% FASTER than io > > test_io.read(100): 258.2 ms > test_codecs.read(100): 155.1 ms > -> codecs 67% FASTER than io > > test_io.read(): 1803.2 ms > test_codecs.read(): 1002.9 ms > -> codecs 80% FASTER than io These results are interesting since gb18030 is a shift encoding which keeps state in the encoded data stream, so the strategy chosen by TextI
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le mercredi 25 mai 2011 à 13:10 +0200, Victor Stinner a écrit : > codecs is always faster (between 1.07 and 1.15 times faster than io) to > read the whole content of file using read(-1). Something should maybe be > optimized in TextIOWrapper.read() ;-) Oh, I understood: it's maybe the universal newline mode of TextIOWrapper was enabled. If you disable is using open(..., newline='\n'), io and codecs run at the same speed to read the whole content of the file (f.read()). Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le mercredi 25 mai 2011 à 15:43 +0200, M.-A. Lemburg a écrit : > For UTF-16 it would e.g. make sense to always read data in blocks > with even sizes, removing the trial-and-error decoding and extra > buffering currently done by the base classes. For UTF-32, the > blocks should have size % 4 == 0. > > For UTF-8 (and other variable length encodings) it would make > sense looking at the end of the (bytes) data read from the > stream to see whether a complete code point was read or not, > rather than simply running the decoder on the complete data > set, only to find that a few bytes at the end are missing. I think that the readahead algorithm is much more faster than trying to avoid partial input, and it's not a problem to have partial input if you use an incremental decoder. > For single character encodings, it would make sense to prefetch > data in big chunks and skip all the trial and error decoding > implemented by the base classes to address the above problem > with variable length encodings. TextIOWrapper implements this optimization using its readahead algorithm. > That's somewhat unfair: TextIOWrapper is implemented in C, > whereas the StreamReader/Writer subclasses used by the > codecs are written in Python. > > A fair comparison would use the Python implementation of > TextIOWrapper. Do you mean that you would like to reimplement codecs in C? It is not revelant to compare codecs and _pyio, because codecs reuses BufferedReader (of the io module, not of the _pyio module), and io is the main I/O module of Python 3. But well, as you want, here is a benchmark comparing: _pyio.TextIOWrapper(io.open(filename, 'rb'), encoding) and codecs.open(filename, encoding) The only change with my previous bench.py script is the test_io() function : def test_io(test_func, chunk_size): with open(FILENAME, 'rb') as buffered: f = _pyio.TextIOWrapper(buffered, ENCODING) test_file(f, test_func, chunk_size) f.close() (1) Decode Objects/unicodeobject.c (317336 characters) from utf-8 test_io.readline(): 1193.4 ms test_codecs.readline(): 1267.9 ms -> codecs 6% slower than io test_io.read(1): 21696.4 ms test_codecs.read(1): 36027.2 ms -> codecs 66% slower than io test_io.read(100): 3080.7 ms test_codecs.read(100): 3901.7 ms -> codecs 27% slower than io test_io.read(): 3991.0 ms test_codecs.read(): 1736.9 ms -> codecs 130% FASTER than io (2) Decode README (6613 characters) from ascii test_io.readline(): 678.1 ms test_codecs.readline(): 760.5 ms -> codecs 12% slower than io test_io.read(1): 13533.2 ms test_codecs.read(1): 21900.0 ms -> codecs 62% slower than io test_io.read(100): 2663.1 ms test_codecs.read(100): 3270.1 ms -> codecs 23% slower than io test_io.read(): 6769.1 ms test_codecs.read(): 3919.6 ms -> codecs 73% FASTER than io (3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from gb18030 test_io.readline(): 38.9 ms test_codecs.readline(): 15.1 ms -> codecs 157% FASTER than io test_io.read(1): 369.8 ms test_codecs.read(1): 302.2 ms -> codecs 22% FASTER than io test_io.read(100): 258.2 ms test_codecs.read(100): 155.1 ms -> codecs 67% FASTER than io test_io.read(): 1803.2 ms test_codecs.read(): 1002.9 ms -> codecs 80% FASTER than io _pyio.TextIOWrapper is faster than codecs.StreamReader for readline(), read(1) and read(100), with ASCII and UTF-8. It is slower for gb18030. As in the io vs codecs benchmark, codecs.StreamReader is always faster than _pyio.TextIOWrapper for read(). Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: > Le mercredi 25 mai 2011 à 11:38 +0200, M.-A. Lemburg a écrit : >> You are missing the point: we have StreamReader and StreamWriter APIs >> on codecs to allow each codecs to implement more efficient ways of >> encoding and decoding streams. >> >> Examples of such optimizations are reading the stream in >> chunks that can be decoded in one piece, or writing to the stream >> in a way that doesn't generate encoding state problems on the >> receiving end by ending transmission half-way through a >> shift block. >> >> ... >> >> We don't have many such specialized implementations in the stdlib, >> but this doesn't mean that there's no use for them. It >> just means that developers and users are simply unaware of the >> possibilities opened by these stateful stream APIs. > > Does at least one codec implement such implementation in its > StreamReader or StreamWriter class? And can't we implement such > optimization in incremental encoders and decoders (or in TextIOWrapper)? I don't see how, since you need control over the file API methods in order to implement such optimizations. OTOH, adding lots of special cases to TextIOWrapper isn't a good either, since these optimizations would then only trigger for a small number of codecs and completely leave out 3rd party codecs. > I checked all multibyte codecs (UTF and CJK codecs) and I don't see any > of such optimization. UTF codecs handle the BOM, but don't have anything > looking like an optimization. CJK codecs use multibytecodec, > MultibyteStreamReader and MultibyteStreamWriter, which don't look to be > optimized. But I missed maybe something? No, you haven't missed such per-codec optimizations. The base classes implement general purpose support for reading from streams in chunks, but the support isn't optimized per codec. For UTF-16 it would e.g. make sense to always read data in blocks with even sizes, removing the trial-and-error decoding and extra buffering currently done by the base classes. For UTF-32, the blocks should have size % 4 == 0. For UTF-8 (and other variable length encodings) it would make sense looking at the end of the (bytes) data read from the stream to see whether a complete code point was read or not, rather than simply running the decoder on the complete data set, only to find that a few bytes at the end are missing. For single character encodings, it would make sense to prefetch data in big chunks and skip all the trial and error decoding implemented by the base classes to address the above problem with variable length encodings. Finally, all this could be implemented in C, reducing the Python call overhead dramatically. > TextIOWrapper has an advanced buffer algorithm to prefetch (readahead) > some bytes at each read to speed up small read. It is difficult to > implement such algorithm, but it's done and it works. > > -- > > Ok, let's stop to speak about theorical optimizations, and let's do a > benchmark to compare codecs and the io modules on reading files! That's somewhat unfair: TextIOWrapper is implemented in C, whereas the StreamReader/Writer subclasses used by the codecs are written in Python. A fair comparison would use the Python implementation of TextIOWrapper. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 25 2011) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 26 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le mercredi 25 mai 2011 à 11:38 +0200, M.-A. Lemburg a écrit : > You are missing the point: we have StreamReader and StreamWriter APIs > on codecs to allow each codecs to implement more efficient ways of > encoding and decoding streams. > > Examples of such optimizations are reading the stream in > chunks that can be decoded in one piece, or writing to the stream > in a way that doesn't generate encoding state problems on the > receiving end by ending transmission half-way through a > shift block. > > ... > > We don't have many such specialized implementations in the stdlib, > but this doesn't mean that there's no use for them. It > just means that developers and users are simply unaware of the > possibilities opened by these stateful stream APIs. Does at least one codec implement such implementation in its StreamReader or StreamWriter class? And can't we implement such optimization in incremental encoders and decoders (or in TextIOWrapper)? I checked all multibyte codecs (UTF and CJK codecs) and I don't see any of such optimization. UTF codecs handle the BOM, but don't have anything looking like an optimization. CJK codecs use multibytecodec, MultibyteStreamReader and MultibyteStreamWriter, which don't look to be optimized. But I missed maybe something? TextIOWrapper has an advanced buffer algorithm to prefetch (readahead) some bytes at each read to speed up small read. It is difficult to implement such algorithm, but it's done and it works. -- Ok, let's stop to speak about theorical optimizations, and let's do a benchmark to compare codecs and the io modules on reading files! I tested Python 3.3 (70370:178d367c9733) compiled in release mode (gcc -O3) on a Pentium4 @ 3 GHz with 2 GB of memory. I tunned manually the number of loops to ensure that the faster test takes at least one second. I only ran my benchmark once. See the attached bench.py file. (1) Decode Objects/unicodeobject.c (317336 characters) from utf-8 test_io.readline(): 89.6 ms test_codecs.readline(): 1272.8 ms -> codecs 1320% slower than io test_io.read(1): 1728.9 ms test_codecs.read(1): 36395.0 ms -> codecs 2005% slower than io test_io.read(100): 460.7 ms test_codecs.read(100): 3897.0 ms -> codecs 746% slower than io test_io.read(-1): 1911.7 ms test_codecs.read(-1): 1740.7 ms -> codecs 10% FASTER than io (2) Decode README (6613 characters) from ascii test_io.readline(): 109.9 ms test_codecs.readline(): 1023.8 ms -> codecs 832% slower than io test_io.read(1): 1560.4 ms test_codecs.read(1): 29402.6 ms -> codecs 1784% slower than io test_io.read(100): 866.9 ms test_codecs.read(100): 3699.5 ms -> codecs 327% slower than io test_io.read(-1): 5140.2 ms test_codecs.read(-1): 4817.9 ms -> codecs 7% FASTER than io (3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from gb18030 test_io.readline(): 1193.7 ms test_codecs.readline(): 1474.3 ms -> codecs 24% slower than io test_io.read(1): 3847.7 ms test_codecs.read(1): 27103.9 ms -> codecs 604% slower than io test_io.read(100): 12839.5 ms test_codecs.read(100): 13444.2 ms -> codecs 5% slower than io test_io.read(-1): 2183.3 ms test_codecs.read(-1): 1906.1 ms -> codecs 15% FASTER than io The readahead code does really help read(1): io is between 6 and 20 times faster than the codecs. But it does really use a more common usecase, readline: io is between 1.2 and 13 times faster than the codecs. codecs is always faster (between 1.07 and 1.15 times faster than io) to read the whole content of file using read(-1). Something should maybe be optimized in TextIOWrapper.read() ;-) But the gain is minor if you compare it to the gain on read(1) and readline()! Please check my bench.py script and redo the benchmark on your own computer! Victor import codecs import sys import time FILENAME = "Objects/unicodeobject.c"; FILESIZE = 317336; ENCODING = 'utf-8'; LOOPS=10 FILENAME = "Lib/test/cjkencodings/gb18030.txt"; FILESIZE = 501; ENCODING = 'gb18030'; LOOPS=200 FILENAME = "README"; FILESIZE = 6613; ENCODING = 'ascii'; LOOPS=400 def bench(loops, func, *args): t0=time.time() for loop in range(loops): func(*args) dt = time.time() - t0 text = "%s.%s" % (func.__name__, test_func) if chunk_size is not None: text += "(%s)" % chunk_size else: text += "()" print("%s: %.1f ms" % (text, dt * 1000)) return dt def test_file(f, test_func, chunk_size): size = 0 func = getattr(f, test_func) while True: if chunk_size is not None: c = func(chunk_size) else: c = func() if not c: break size += len(c) assert size == FILESIZE, "%s != %s" % (size, FILESIZE) def test_io(test_func, chunk_size): with open(FILENAME, encoding=ENCODING) as f: test_file(f, test_func, chunk_size) def test_codecs(test_func, chunk_size): with codecs.open(FILENAME, 'r', encoding=ENCODING) as f: test_file(f, test_func, chunk_size) print("Python %s" % sys.versi
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Walter Dörwald wrote: > On 24.05.11 12:58, Victor Stinner wrote: >> Le mardi 24 mai 2011 à 12:42 +0200, Łukasz Langa a écrit : >>> Wiadomość napisana przez Walter Dörwald w dniu 2011-05-24, o godz. 12:16: >>> > I don't see which usecase is not covered by TextIOWrapper. But I know > some cases which are not supported by StreamReader/StreamWriter. This could be be partially fixed by implementing generic StreamReader/StreamWriter classes that reuse the incremental codecs, but I don't think thats worth it. >>> >>> Why not? >> >> We have already an implementation of this idea, it is called >> io.TextIOWrapper. > > Exactly. > > From another post by Victor: > >> As I wrote, codecs.open() is useful in Python 2. But I don't know any >> program or library using directly StreamReader or StreamWriter. > > So: implementing this is a lot of work, duplicates existing > functionality and is mostly unused. You are missing the point: we have StreamReader and StreamWriter APIs on codecs to allow each codecs to implement more efficient ways of encoding and decoding streams. Examples of such optimizations are reading the stream in chunks that can be decoded in one piece, or writing to the stream in a way that doesn't generate encoding state problems on the receiving end by ending transmission half-way through a shift block. Of course, you won't find many direct uses of these APIs, since most of the time, applications will simply use codecs.open() to automatically benefit from these optimizations. OTOH, TextIOWrapper doesn't know anything about specific encodings and thus does not allow for such optimizations to be implemented by codecs. We don't have many such specialized implementations in the stdlib, but this doesn't mean that there's no use for them. It just means that developers and users are simply unaware of the possibilities opened by these stateful stream APIs. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 25 2011) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 26 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On 24/05/2011, Victor Stinner wrote: > > In Python 2, codecs.open() is the best way to read and/or write files > using Unicode. But in Python 3, open() is preferred with its fast io > module. I would like to deprecate codecs.open() because it can be > replaced by open() and io.TextIOWrapper. I would like your opinion and > that's why I'm writing this email. There are some modules that try to stay compatible with Python 2 and 3 without a source translation step. Removing the codecs classes would mean they'd have to add a few more compatibility hacks, but could be done. As an aside, I'm still not sure how the io module should be used. Example, a simple task I've used StreamWriter classes for is to wrap stdout. If the stdout.encoding can't represent a character, using "replace" means you can write any unicode string without throwing a UnicodeEncodeError. With the io module, it seems you need to construct a new TextIOWrapper object, passing the attributes of the old one as parameters, and as soon as someone passes something that's not a TextIOWrapper (say, a StringIO object) your code breaks. Is the intention that code dealing with streams needs to be covered in isinstance checks in Python 3? Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On 5/24/2011 6:14 AM, M.-A. Lemburg wrote: I have no idea why TextIOWrapper was added to the stdlib instead of making StreamReaderWriter more capable, since StreamReaderWriter had already been available in Python since Python 1.6 (and this is being used by codecs.open()). As I understand it, you (and others) wrote codecs long ago and recently other people wrote the new i/o stack, which sometimes uses codecs, and when they needed to add a few details, they 'naturally' added them to the module they were working on and understood (and planned to rewrite in C) rather than to the older module that they maybe did not completely understand and which is only in Python. The Victor comes along to do maintenance on some of the Asian codecs and discovers that he needs to make changes in two (or more?) places rather than one, which he naturally finds unsatifactory. Perhaps we should deprecate TextIOWrapper instead and replace it with codecs.StreamReaderWriter ? ;-) I think we should separate two issues: removing internal implementation duplication and removing external api duplication. I should think that the former should not be too controversial. The latter, I know, is more contentious. One problem is that stdlib changes that perhaps 'should' have been made in 3.0/1 could not be discovered until the moratorium and greater focus on the stdlib. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On 24.05.11 12:58, Victor Stinner wrote: > Le mardi 24 mai 2011 à 12:42 +0200, Łukasz Langa a écrit : >> Wiadomość napisana przez Walter Dörwald w dniu 2011-05-24, o godz. 12:16: >> I don't see which usecase is not covered by TextIOWrapper. But I know some cases which are not supported by StreamReader/StreamWriter. >>> >>> This could be be partially fixed by implementing generic >>> StreamReader/StreamWriter classes that reuse the incremental codecs, but >>> I don't think thats worth it. >> >> Why not? > > We have already an implementation of this idea, it is called > io.TextIOWrapper. Exactly. From another post by Victor: > As I wrote, codecs.open() is useful in Python 2. But I don't know any > program or library using directly StreamReader or StreamWriter. So: implementing this is a lot of work, duplicates existing functionality and is mostly unused. Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le mardi 24 mai 2011 à 12:42 +0200, Łukasz Langa a écrit : > Wiadomość napisana przez Walter Dörwald w dniu 2011-05-24, o godz. 12:16: > > >> I don't see which usecase is not covered by TextIOWrapper. But I know > >> some cases which are not supported by StreamReader/StreamWriter. > > > > This could be be partially fixed by implementing generic > > StreamReader/StreamWriter classes that reuse the incremental codecs, but > > I don't think thats worth it. > > Why not? We have already an implementation of this idea, it is called io.TextIOWrapper. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On Tue, 24 May 2011 12:16:49 +0200 Walter Dörwald wrote: > > > and so it's not possible to write a generic fix for > > all child classes in the codecs module. Each stateful codec has to > > handle special cases like seek() problems. > > Yes, which in theory makes it possible to implement shortcuts for > certain codecs (e.g. the UTF-32-BE/LE codecs could simply multiply the > character position by 4 to get the byte position). However AFAICR none > of the readers/writers does that. And in practice, TextIOWrapper.tell() does a similar optimization in a generic way. I'm linking to the Python implementation for readability: http://hg.python.org/cpython/file/5c716437a83a/Lib/_pyio.py#l1741 TextIOWrapper.seek() is straightforward due to the structure of the integer "cookie" returned by TextIOWrapper.tell(). In practice, TextIOWrapper gets much more love than Stream{Reader,Writer} because it's an essential part of the new I/O stack. As Victor said, problems which Stream* have had for years are solved neatly in TextIOWrapper. Therefore, leaving Stream{Reader,Writer} in is not a matter of "choice" and "freedom given to users". It's giving people the misleading possibility of using non-optimized, poorly debugged, less featureful implementations of the same basic idea (an unicode stream abstraction). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le mardi 24 mai 2011 à 08:16 +, Vinay Sajip a écrit : > > I opened an issue for this idea. Brett and Marc-Andree Lemburg don't > > want to deprecate codecs.open() & friends because they want to be able > > to write code working on Python 2 and on Python 3 without any change. I > > don't think it's realistic: nontrivial programs require at least the six > > module, and most likely the 2to3 program. The six module can have its > > "codecs.open" function if codecs.open is removed from Python 3.4. > > What's "non-trivial"? Both pip and virtualenv (widely used programs) were > ported > to Python 3 using a single codebase for 2.x and 3.x, because it seemed to > involve the least ongoing maintenance burden. Though these particular programs > don't use codecs.open, I don't see much value in making it harder to write > programs which can run under both 2.x and 3.x; that's not going to speed > adoption of 3.x. pip has a pip.backwardcompat module which is vey similar to six. If codecs.open() is deprecated or removed, it will be trivial to add a wrapper for codecs.open() or open() to six and pip.backwardcompat. virtualenv.py starts also with a thin compatibility layer. But yes, each program using a compatibily layer/module will have to be updated. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Wiadomość napisana przez Walter Dörwald w dniu 2011-05-24, o godz. 12:16: >> I don't see which usecase is not covered by TextIOWrapper. But I know >> some cases which are not supported by StreamReader/StreamWriter. > > This could be be partially fixed by implementing generic > StreamReader/StreamWriter classes that reuse the incremental codecs, but > I don't think thats worth it. Why not? -- Best regards, Łukasz Langa Senior Systems Architecture Engineer IT Infrastructure Department Grupa Allegro Sp. z o.o. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On Tue, 24 May 2011 20:25:11 +1000 Nick Coghlan wrote: > > Just as PEP 302 defines how module importers should be written, PEP > 100 defines how text codecs should be written (i.e. in terms of > StreamReader and StreamWriter). > > PEP 3116 then defines how such codecs can be used as part of the > overall I/O stack as redesigned for Python 3. The I/O stack doesn't use StreamReader and StreamWriter. That's the whole point. Stream* have been made useless by the new I/O stack. > Now, there may be an opportunity here to rationalise things a bit and > re-use the *new* io module interfaces as the basis for an updated > codec API PEP, but we shouldn't be hasty in deprecating an old API > that is about "how to write codecs" just because it is similar to a > shiny new one that is about "how to process I/O data". Ok, can you explain us the difference, concretely? Thanks Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On 24.05.11 02:08, Victor Stinner wrote: > [...] > codecs.open() and StreamReader, StreamWriter and StreamReaderWriter > classes of the codecs module don't support universal newlines, still > have some issues with stateful codecs (like UTF-16/32 BOMs), and each > codec has to implement a StreamReader and a StreamWriter class. > > StreamReader and StreamWriter are stateless codecs (no reset() or > setstate() method), They *are* stateful, they just don't expose their state to the public. > and so it's not possible to write a generic fix for > all child classes in the codecs module. Each stateful codec has to > handle special cases like seek() problems. Yes, which in theory makes it possible to implement shortcuts for certain codecs (e.g. the UTF-32-BE/LE codecs could simply multiply the character position by 4 to get the byte position). However AFAICR none of the readers/writers does that. > For example, UTF-16 codec > duplicates some IncrementalEncoder/IncrementalDecoder code into its > StreamWriter/StreamReader class. Actually it's the other way round: When I implemented the incremental codecs, I copied code from the StreamReader/StreamWriter classes. > The io module is well tested, supports non-seekable streams, handles > correctly corner-cases (like UTF-16/32 BOMs) and supports any kind of > newlines including an "universal newline" mode. TextIOWrapper reuses > incremental encoders and decoders, so BOM issues were fixed only once, > in TextIOWrapper. > > It's trivial to replace a call to codecs.open() by a call to open(), > because the two API are very close. The main different is that > codecs.open() doesn't support universal newline, so you have to use > open(..., newline='') to keep the same behaviour (keep newlines > unchanged). This task can be done by 2to3. But I suppose that most > people will be happy with the universal newline mode. > > I don't see which usecase is not covered by TextIOWrapper. But I know > some cases which are not supported by StreamReader/StreamWriter. This could be be partially fixed by implementing generic StreamReader/StreamWriter classes that reuse the incremental codecs, but I don't think thats worth it. > [...] Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On Tue, May 24, 2011 at 6:58 PM, Victor Stinner wrote: > StreamReader, StreamWriter, TextIOWrapper and StreamReaderWriter all > have a file-like API: tell(), seek(), read(), readline(), write(), etc. > The implementation is maybe different, but the API is just the same, and > so the usecases are just the same. > > I don't see in which case I should use StreamReader or StreamWriter > instead TextIOWrapper. I thought that TextIOWrapper is specific to files > on disk, but TextIOWrapper is already used for other usages like > sockets. Back up a step here. It's important to remember that the codecs module *long* predates the existence of the Python 3 I/O model and the io module in particular. Just as PEP 302 defines how module importers should be written, PEP 100 defines how text codecs should be written (i.e. in terms of StreamReader and StreamWriter). PEP 3116 then defines how such codecs can be used as part of the overall I/O stack as redesigned for Python 3. Now, there may be an opportunity here to rationalise things a bit and re-use the *new* io module interfaces as the basis for an updated codec API PEP, but we shouldn't be hasty in deprecating an old API that is about "how to write codecs" just because it is similar to a shiny new one that is about "how to process I/O data". Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: > Le mardi 24 mai 2011 à 10:03 +0200, M.-A. Lemburg a écrit : >> Please read PEP 100 regarding StreamReader and StreamWriter. >> Those codecs parts were explicitly designed to be stateful, >> unlike the stateless encoder/decoder methods. > > Yes, it is possible to implement stateful StreamReader and StreamWriter > classes and we have such codecs (I gave the example of UTF-16), but the > state is not exposed (getstate / setstate), and so it's not possible to > write generic code to handle the codec state in the base StreamReader > and StreamWriter classes. io.TextIOWrapper requires encoder.setstate(0) > for example. So instead of always suggesting to deprecate everything, how about you come up with a proposal to add meaningful new methods to those base classes ? >> Each codec can, however, implement variants which are optimized >> for the specific encoding or intercept certain stream methods >> to add functionality or improve the encoding/decoding >> performance. > > Can you give me some examples? See the UTF-16 codec in the stdlib for example. This uses some of the available possibilities to interpret the BOM mark and then switches the encoder/decoder methods accordingly. A lot more could be done for other variable length encoding codecs, e.g. UTF-8, since these often have problems near the end of a read due to missing bytes. The base class implementation provides a general purpose implementation to cover the case, but it's not efficient, since it doesn't know anything about the encoding characteristics. Such an implementation would have to be done per codec and that's why we have per codec StreamReader/Writer APIs. >> TextIOWrapper and StreamReaderWriter are merely wrappers >> around streams that make use of the codecs. They don't >> provide any codec logic themselves. That's the conceptual >> difference. >> ... >> StreamReader and StreamWriters ... work efficiently and >> directly on streams rather than buffers. > > StreamReader, StreamWriter, TextIOWrapper and StreamReaderWriter all > have a file-like API: tell(), seek(), read(), readline(), write(), etc. > The implementation is maybe different, but the API is just the same, and > so the usecases are just the same. > > I don't see in which case I should use StreamReader or StreamWriter > instead TextIOWrapper. I thought that TextIOWrapper is specific to files > on disk, but TextIOWrapper is already used for other usages like > sockets. I have no idea why TextIOWrapper was added to the stdlib instead of making StreamReaderWriter more capable, since StreamReaderWriter had already been available in Python since Python 1.6 (and this is being used by codecs.open()). Perhaps we should deprecate TextIOWrapper instead and replace it with codecs.StreamReaderWriter ? ;-) Seriously, I don't see use of TextIOWrapper as an argument for removing StreamReader/Writer parts of the codecs API. >> Here's my reply from the ticket regarding using incremental >> encoders/decoders for the StreamReader/Writer parts of the >> codec set of APIs: >> >> """ >> The point about having them use incremental codecs for encoding and >> decoding is a good one and would >> need to be investigated. If possible, we could use incremental >> encoders/decoders for the standard >> StreamReader/Writer base classes or add new >> IncrementalStreamReader/Writer classes which then use >> the IncrementalEncode/Decoder per default. > > Why do you want to write a duplicate feature? TextIOWrapper is already > here, it's working and widely used. See above and please also try to understand why we have per-codec implementations for streams. I'm tired of repeating myself. I would much prefer to see the codec-specific functionality in TextIOWrapper added back to the codecs where it belongs. > I am working on codec issues (like CJK encodings, see #12100, #12057, > #12016) and I would like to remove StreamReader and StreamWriter to have > *less* code to maintain. > > If you want to add more code, will be available to maintain it? It looks > like you are busy, some people (not me ;-)) are still > waiting .transform()/.untransform()! I dropped the ball on the idea after the strong wave of comments against those methods. People will simply have to use codecs.encode() and codecs.decode(). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 24 2011) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 27 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le mardi 24 mai 2011 à 10:03 +0200, M.-A. Lemburg a écrit : > Please read PEP 100 regarding StreamReader and StreamWriter. > Those codecs parts were explicitly designed to be stateful, > unlike the stateless encoder/decoder methods. Yes, it is possible to implement stateful StreamReader and StreamWriter classes and we have such codecs (I gave the example of UTF-16), but the state is not exposed (getstate / setstate), and so it's not possible to write generic code to handle the codec state in the base StreamReader and StreamWriter classes. io.TextIOWrapper requires encoder.setstate(0) for example. > Each codec can, however, implement variants which are optimized > for the specific encoding or intercept certain stream methods > to add functionality or improve the encoding/decoding > performance. Can you give me some examples? > TextIOWrapper and StreamReaderWriter are merely wrappers > around streams that make use of the codecs. They don't > provide any codec logic themselves. That's the conceptual > difference. > ... > StreamReader and StreamWriters ... work efficiently and > directly on streams rather than buffers. StreamReader, StreamWriter, TextIOWrapper and StreamReaderWriter all have a file-like API: tell(), seek(), read(), readline(), write(), etc. The implementation is maybe different, but the API is just the same, and so the usecases are just the same. I don't see in which case I should use StreamReader or StreamWriter instead TextIOWrapper. I thought that TextIOWrapper is specific to files on disk, but TextIOWrapper is already used for other usages like sockets. > Here's my reply from the ticket regarding using incremental > encoders/decoders for the StreamReader/Writer parts of the > codec set of APIs: > > """ > The point about having them use incremental codecs for encoding and > decoding is a good one and would > need to be investigated. If possible, we could use incremental > encoders/decoders for the standard > StreamReader/Writer base classes or add new > IncrementalStreamReader/Writer classes which then use > the IncrementalEncode/Decoder per default. Why do you want to write a duplicate feature? TextIOWrapper is already here, it's working and widely used. I am working on codec issues (like CJK encodings, see #12100, #12057, #12016) and I would like to remove StreamReader and StreamWriter to have *less* code to maintain. If you want to add more code, will be available to maintain it? It looks like you are busy, some people (not me ;-)) are still waiting .transform()/.untransform()! Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le mardi 24 mai 2011 à 08:16 +, Vinay Sajip a écrit : > So I would also want to keep codecs.open() and friends, at least for now Well, I would agree to keep codecs.open() (if we patch it to reuse TextIOWrapper and add a note to say that it is kept for backward compatibiltiy and open() should be preferred in Python 3), but deprecate StreamReader, StreamWriter and EncodedFile. As I wrote, codecs.open() is useful in Python 2. But I don't know any program or library using directly StreamReader or StreamWriter. I found some projects (ex: twisted-mail, feeds2imap, pyflag, pygsm, ...) implementing their own Python codec (cool!) and their codec has their StreamReader and StreamWriter class, but I don't think that these classes are used. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner haypocalc.com> writes: > I opened an issue for this idea. Brett and Marc-Andree Lemburg don't > want to deprecate codecs.open() & friends because they want to be able > to write code working on Python 2 and on Python 3 without any change. I > don't think it's realistic: nontrivial programs require at least the six > module, and most likely the 2to3 program. The six module can have its > "codecs.open" function if codecs.open is removed from Python 3.4. What's "non-trivial"? Both pip and virtualenv (widely used programs) were ported to Python 3 using a single codebase for 2.x and 3.x, because it seemed to involve the least ongoing maintenance burden. Though these particular programs don't use codecs.open, I don't see much value in making it harder to write programs which can run under both 2.x and 3.x; that's not going to speed adoption of 3.x. I find 2to3 very useful indeed for showing where changes may need to be made for 2.x/3.x portability, but do not use it as an automatic conversion tool. The six module is very useful, too, but some projects won't necessarily want to add it as an additional dependency, and reimplement just the parts they need from that bag of tricks. So I would also want to keep codecs.open() and friends, at least for now - though it makes seems to make sense to implement them as wrappers (as Nick suggested). Regards, Vinay Sajip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: > Hi, > > In Python 2, codecs.open() is the best way to read and/or write files > using Unicode. But in Python 3, open() is preferred with its fast io > module. I would like to deprecate codecs.open() because it can be > replaced by open() and io.TextIOWrapper. I would like your opinion and > that's why I'm writing this email. I think you should have moved this part of your email further up, since it explains the reason why this idea was rejected for now: > I opened an issue for this idea. Brett and Marc-Andree Lemburg don't > want to deprecate codecs.open() & friends because they want to be able > to write code working on Python 2 and on Python 3 without any change. I > don't think it's realistic: nontrivial programs require at least the six > module, and most likely the 2to3 program. The six module can have its > "codecs.open" function if codecs.open is removed from Python 3.4. And now for something completely different: > codecs.open() and StreamReader, StreamWriter and StreamReaderWriter > classes of the codecs module don't support universal newlines, still > have some issues with stateful codecs (like UTF-16/32 BOMs), and each > codec has to implement a StreamReader and a StreamWriter class. > > StreamReader and StreamWriter are stateless codecs (no reset() or > setstate() method), and so it's not possible to write a generic fix for > all child classes in the codecs module. Each stateful codec has to > handle special cases like seek() problems. For example, UTF-16 codec > duplicates some IncrementalEncoder/IncrementalDecoder code into its > StreamWriter/StreamReader class. Please read PEP 100 regarding StreamReader and StreamWriter. Those codecs parts were explicitly designed to be stateful, unlike the stateless encoder/decoder methods. Please read my reply on the ticket: """ StreamReader and StreamWriter classes provide the base codec implementations for stateful interaction with streams. They define the interface and provide a working implementation for those codecs that choose not to implement their own variants. Each codec can, however, implement variants which are optimized for the specific encoding or intercept certain stream methods to add functionality or improve the encoding/decoding performance. Both are essential parts of the codec interface. TextIOWrapper and StreamReaderWriter are merely wrappers around streams that make use of the codecs. They don't provide any codec logic themselves. That's the conceptual difference. """ > The io module is well tested, supports non-seekable streams, handles > correctly corner-cases (like UTF-16/32 BOMs) and supports any kind of > newlines including an "universal newline" mode. TextIOWrapper reuses > incremental encoders and decoders, so BOM issues were fixed only once, > in TextIOWrapper. > > It's trivial to replace a call to codecs.open() by a call to open(), > because the two API are very close. The main different is that > codecs.open() doesn't support universal newline, so you have to use > open(..., newline='') to keep the same behaviour (keep newlines > unchanged). This task can be done by 2to3. But I suppose that most > people will be happy with the universal newline mode. > > I don't see which usecase is not covered by TextIOWrapper. But I know > some cases which are not supported by StreamReader/StreamWriter. This is a misunderstanding of the concepts behind the two. StreamReader and StreamWriters are implemented by the codecs, they are part of the API that each codec has to provide in order to register in the Python codecs system. Their purpose is to provide a stateful interface and work efficiently and directly on streams rather than buffers. Here's my reply from the ticket regarding using incremental encoders/decoders for the StreamReader/Writer parts of the codec set of APIs: """ The point about having them use incremental codecs for encoding and decoding is a good one and would need to be investigated. If possible, we could use incremental encoders/decoders for the standard StreamReader/Writer base classes or add new IncrementalStreamReader/Writer classes which then use the IncrementalEncode/Decoder per default. Please open a new ticket for this. """ > StreamReader, StreamWriter, StreamReaderEncoder and EncodedFile are not > used in the Python 3 standard library. I tried removed them: except > tests of test_codecs which test them directly, the full test suite pass. > > Read the issue for more information: http://bugs.python.org/issue8796 -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 24 2011) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 27 da
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Le mardi 24 mai 2011 à 15:24 +1000, Nick Coghlan a écrit : > On Tue, May 24, 2011 at 10:08 AM, Victor Stinner > wrote: > > It's trivial to replace a call to codecs.open() by a call to open(), > > because the two API are very close. The main different is that > > codecs.open() doesn't support universal newline, so you have to use > > open(..., newline='') to keep the same behaviour (keep newlines > > unchanged). This task can be done by 2to3. But I suppose that most > > people will be happy with the universal newline mode. > > Is there any reason that codecs.open() can't become a thin wrapper > around builtin open in 3.3? Yes, it's trivial to implement codecs.open using: def open(filename, mode='rb', encoding=None, errors='strict', buffering=1): return builtins.open(filename, mode, buffering, encoding, errors, newline='') But do you we really need two ways to open a file? Extract of import this: "There should be one-- and preferably only one --obvious way to do it." Another example: Python 3.2 has subprocess.Popen, os.popen and platform.popen to open a subprocess. platform.popen is now deprecated in Python 3.3. Well, it's already better than Python 2.5 which has os.popen(), os.popen2(), os.popen3(), os.popen4(), os.spawnl(), os.spawnle(), os.spawnlp(), os.spawnlpe(), os.spawnv(), os.spawnve(), os.spawnvp(), os.spawnvpe(), subprocess.Popen, platform.popen and maybe others :-) > How API compatible is TextIOWrapper with StreamReader/StreamWriter? It's fully compatible. > How hard would it to be change them to be adapters over the main IO > machinery rather than independent classes? I don't understand your proposition. We don't need StreamReader and StreamWriter to open a stream as a file text, only incremental decoders and encoders. Why do you want to keep them? Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
On Tue, May 24, 2011 at 10:08 AM, Victor Stinner wrote: > It's trivial to replace a call to codecs.open() by a call to open(), > because the two API are very close. The main different is that > codecs.open() doesn't support universal newline, so you have to use > open(..., newline='') to keep the same behaviour (keep newlines > unchanged). This task can be done by 2to3. But I suppose that most > people will be happy with the universal newline mode. Is there any reason that codecs.open() can't become a thin wrapper around builtin open in 3.3? > I don't see which usecase is not covered by TextIOWrapper. But I know > some cases which are not supported by StreamReader/StreamWriter. How API compatible is TextIOWrapper with StreamReader/StreamWriter? How hard would it to be change them to be adapters over the main IO machinery rather than independent classes? Rather than deprecating them, that seems like a more profitable direction to take them. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com