[issue20132] Many incremental codecs don’t handle fragmented data

2016-08-19 Thread Martin Panter

Martin Panter added the comment:

Split off Issue 27799 for the base-64 incremental decoder

--
dependencies: +Fix base64-codec and bz2-codec incremental decoders

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2015-07-22 Thread Martin Panter

Changes by Martin Panter :


--
dependencies: +Fix codecs.iterencode/decode() by allowing data parameter to be 
omitted, Stream encoder for zlib_codec doesn't use the incremental encoder, 
quopri module differences in quoted-printable text with whitespace, 
quopri_codec newline handling

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2015-03-21 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
nosy: +serhiy.storchaka
stage:  -> patch review
versions: +Python 3.5 -Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2015-01-24 Thread Martin Panter

Martin Panter added the comment:

Here is a new patch which fixes the bytes-to-bytes incremental codecs. It 
depends on my patches for these other issues being applied first:

* Issue 23231: Bytes-to-bytes support for iteren/decode()
* Issue 13881: Generic StreamWriter from IncrementalEncoder
* Issue 16473: Clarify and test quopri-codec implementation

In addition, without the fix for Issue 20121 (Quoted printable soft line 
breaking), the patch will apply, but there will be test failures.

Summary of the changes in this patch:

* Fix simple bz2-codec bug uncovered by new tests
* Implement stateful hex-codec IncrementalDecoder
* Add helpers for getstate() and setstate() to bijectively convert between 
arbitrary-length byte strings and integers
* Implement stateful base64-codec IncrementalEncoder; use it for the 
StreamWriter
* base64-codec IncrementalDecoder
* quopri-codec IncrementalEncoder, StreamWriter, IncrementalDecoder
* uu-codec IncrementalEncoder, StreamWriter, IncrementalDecoder
* Document that bytes-to-bytes StreamReader is not supported
* Document stateful raw-/unicode-escape decoding not supported
* Warn that stateful UTF-7 encoding may not be optimal, and that stateful UTF-7 
decoding may buffer unlimited input

This patch is actually generated from a series of patches folded together. If 
anyone is interested, I could try pushing the original series to a Bitbucket 
repository or something.

--
Added file: http://bugs.python.org/file37846/inc-codecs.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2015-01-16 Thread Martin Panter

Martin Panter added the comment:

There is a flaw with inheriting the readline() method in my patch, and I have 
decided to give up fixing the StreamReader classes. I did update the 
documentation in my copy of the patch based on Marc-Andre Lemburg’s feedback if 
anyone is interested in it, otherwise I am going to focus on some of the 
incremental codecs now.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2015-01-15 Thread Martin Panter

Martin Panter added the comment:

My “master plan” is basically to make most of the bytes-to-bytes codecs work as 
documented in the incremental (stateful) modes. I’m less interested in fixing 
the text codecs, and the quopri and uu codecs might be too hard, so I was going 
to propose some documentation warnings for those.

If you have a suggestion on how to go about this better, let me know.

With my doc change to StreamReader, I wanted to document the different modes 
that I saw in the base codecs.StreamReader.read() implementation:

* read() or read(-1) reads everything
* read(size) returns an arbitrary amount of data
* read(size, chars) returns exactly *chars* length of data (unless EOF or 
similar)

Previously the case of read(-1, chars) was ambiguous. Also I did not find the 
description “an approximate maximum number of decoded bytes” very helpful, 
considering more than this maximum was read if necessary to get enough *chars*.

Regarding the end-of-stream behaviour, I made an assumption but I now realize 
it was wrong. Experimenting with the UTF-8 codec shows that its 
StreamReader.read() keeps returning "" when the underlying stream returns no 
data. But if it was in the middle of a multi-byte sequence, no “end of data” 
error is raised, and the multi-byte sequence can be completed if the underlying 
stream later returns more data. I think the lack of end-of-data checking should 
be documented.

The different cases of ValueError and UnicodeError that you describe make 
sense. I think the various references to ValueError and UnicodeError should be 
updated (or replaced with pointers) to match.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2015-01-15 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

This addition is wrong as well:

The *stream* argument must be a file-like object open for reading
-   text or binary data, as appropriate for the specific codec.
+   text or binary data, as appropriate for the specific codec. This stream is
+   assumed to have ended as soon as a read from it returns no data.

Where did you get this idea from ?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2015-01-15 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 15.01.2015 23:46, Martin Panter wrote:
> 
> I opened Issue 23231 about fixing iterencode() and iterdecode() in the 
> general case. I added a patch to Issue 13881 to fix StreamWriter for zlib and 
> bz2, and to fix StreamWriter.writelines() in general.
> 
> I am adding a patch here to clarify the StreamReader API and fix the 
> StreamReader for the zlib-codec.

Martin, I must be missing some context: what's your master plan with
all these codec patches and open tickets ?

They are very hard to follow and you're making design changes which
need more review and discussion than can easily be done on a few tickets.

Regarding the patch:

The doc patch seems to just change ordering of sentences and paragraphs.
Without additional explanation, it's difficult to determine whether
you are changing semantics or not.

The "strict" error change is not correct. Most codecs raise a UnicodeError
(which is a ValueError subclass) and code expects codecs that work
on Unicode to return a UnicodeError, not a ValueError. Only codecs that
do not work on Unicode are allowed to raise ValueErrors.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2015-01-15 Thread Martin Panter

Martin Panter added the comment:

I opened Issue 23231 about fixing iterencode() and iterdecode() in the general 
case. I added a patch to Issue 13881 to fix StreamWriter for zlib and bz2, and 
to fix StreamWriter.writelines() in general.

I am adding a patch here to clarify the StreamReader API and fix the 
StreamReader for the zlib-codec.

Plan of other things to do:

bz2 StreamReader: Should be implemented similar to the zlib patch above, after 
Issue 15955 is resolved and we have a max_length parameter to use. Or could be 
based on Bz2File now.

hex decoder: Shouldn’t be too hard to hack a stateful IncrementalDecoder that 
saves the leftover digit if given an odd number of digits. Create a generic 
codecs._IncrementalStreamReader class that uses an IncrementalDecoder and 
buffers unread decoded data, similar to my _IncrementalStreamWriter for Issue 
13881.

base64 encoder: IncrementalEncoder could encode in base64.MAXBINSIZE chunks

base64 decoder: IncrementalDecoder could strip non-alphabet characters using 
regular expressions, decode in multiples of four characters

quopri encoder: would require new implementation or major refactoring of quopri 
module

quopri decoder: check for incomplete trailing escape codes (=, =, =\r) and newlines (\r)

uu encoder: write header and trailer via uu module; encode using b2a_uu()

uu decoder: factor out header parsing from uu module; buffer and decode line by 
line based on encoded length

unicode-escape, raw-unicode-escape: Stateful decoding would probably require a 
new -Stateful() function at the C level, though it might be easy to build from 
the existing implementation. I suggest documenting that stateful decoding is 
not supported for the time being.

utf-7: As Walter said, proper stateful codec is not supported by the C API, 
despite PyUnicode_DecodeUTF7Stateful(); doing so would probably require 
significant changes. I suggest documenting a warning for stateful mode 
(including with TextIOWrapper) about suboptimal encoding and unlimited data 
buffering for the time being.

punycode, unicode_internal: According to test_codecs.py, these also don’t work 
in stateful mode. Not sure on the details though.

--
keywords: +patch
Added file: http://bugs.python.org/file37718/zlib-reader.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2014-12-17 Thread Martin Panter

Martin Panter added the comment:

The “unicode-escape” and “utf-7” cases affect the more widely-used 
TextIOWrapper interface:

>>> TextIOWrapper(BytesIO(br"\u2013" * 2000), "unicode-escape").read(1)
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.4/encodings/unicode_escape.py", line 26, in decode
return codecs.unicode_escape_decode(input, self.errors)[0]
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 
8190-8191: truncated \u escape
>>> w = TextIOWrapper(BytesIO(), "utf-7")
>>> w.writelines("\xA9\xA9")  # Write one character at a time
>>> w.detach().getvalue()
b'+AKk-+AKk-'
>>> r = TextIOWrapper(BytesIO(b"+" + b"" * 10 + b"-"), "utf-7")
>>> r.read(1)  # Short delay as all 800 kB are decoded to read one character
'\x00'
>>> r.buffer.tell()
82

For UTF-7 decoding to work optimally I think the amount of data buffering 
necessary would be limited to only a few bytes.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2014-09-09 Thread Martin Panter

Martin Panter added the comment:

Stream reader interfaces suffer the same problem. Test cases:

codecs.getreader("unicode-escape")(BytesIO(br"\u2013")).read(1)
codecs.getreader("hex-codec")(BytesIO(b"33")).read(1)
codecs.getreader("base64-codec")(BytesIO(b"AA==")).read(1)
TestCase().assertEqual(b"=", 
codecs.getreader("quopri-codec")(BytesIO(b"=3D")).read(1))

Even though the “zlib” incremental decoder is okay, its stream reader still 
suffers this problem. I was going to check for zip bomb behaviour when you call 
read() with a limited number of input “characters”, but got this instead:

>>> codecs.getreader("zlib-codec")(bomb).read(1, 1)
zlib.error: Error -5 while decompressing data: incomplete or truncated stream

Similar problems with stream writers, for instance:

>>> w = codecs.getwriter("base64-codec")(BytesIO())
>>> w.write(b"1"); w.write(b"2"); w.reset(); w.getvalue()
b'MQ==\nMg==\n'  # Should be b"MTI=\n"

I also noticed StreamWriter.writelines() assumes it is encoding text, and does 
not work at least with “base64-codec”:

 codecs.getwriter("base64-codec")(BytesIO()).writelines((b"1", b"2"))
TypeError: sequence item 0: expected str instance, bytes found

--
versions: +Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2014-01-10 Thread Martin Panter

Martin Panter added the comment:

I think calling iterencode() with an empty iterator is a side issue. Even with 
a non-empty iterator, it tries to encode an empty _text_ string to finalise the 
encoder:

>>> bytes().join(codecs.iterencode(iter((b"data",)), "base64-codec"))
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.3/codecs.py", line 1014, in iterencode
output = encoder.encode("", True)
  File "/usr/lib/python3.3/encodings/base64_codec.py", line 31, in encode
return base64.encodebytes(input)
  File "/usr/lib/python3.3/base64.py", line 343, in encodebytes
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str

Similarly, iterdecode(encoding="rot-13") doesn’t work. I agree it would be good 
to document that iterencode() is limited to text encoders and iterdecode() is 
limited to byte decoders.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2014-01-10 Thread Walter Dörwald

Walter Dörwald added the comment:

The best solution IMHO would be to implement real incremental codecs for all of 
those.

Maybe iterencode() with an empty iterator should never call encode()? (But IMHO 
it would be better to document that iterencode()/iterdecode() should only be 
used with "real" codecs.)

Note that the comment before PyUnicode_DecodeUTF7Stateful() in unicodeobject.c 
reads:

/* The decoder.  The only state we preserve is our read position,
 * i.e. how many characters we have consumed.  So if we end in the
 * middle of a shift sequence we have to back off the read position
 * and the output to the beginning of the sequence, otherwise we lose
 * all the shift state (seen bits, number of bits seen, high
 * surrogate). */

Changing that would have to introduce a state object that the codec updates and 
from which it can be restarted.

Also the encoder does not buffer anything. To implement the suggested 
behaviour, the encoder might have to buffer unlimited data.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2014-01-05 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
nosy: +doerwalter, haypo, lemburg, loewis, ncoghlan

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20132] Many incremental codecs don’t handle fragmented data

2014-01-05 Thread Martin Panter

New submission from Martin Panter:

Many of the incremental codecs do not handle fragmented data very well. In the 
past I think I was interested in using the Base-64 and Quoted-printable codecs, 
and playing with other codecs today reveals many more issues. A lot of the 
issues reflect missing functionality, so maybe the simplest solution would be 
to document the codecs that don’t work.

Incremental decoding issues:

>>> str().join(codecs.iterdecode(iter((b"\\", b"u2013")), "unicode-escape"))
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: 
\ at end of string
# Same deal for raw-unicode-escape.

>>> bytes().join(codecs.iterdecode(iter((b"3", b"3")), "hex-codec"))
binascii.Error: Odd-length string

>>> bytes().join(codecs.iterdecode(iter((b"A", b"A==")), "base64-codec"))
binascii.Error: Incorrect padding

>>> bytes().join(codecs.iterdecode(iter((b"=", b"3D")), "quopri-codec"))
b'3D'  # Should return b"="

>>> codecs.getincrementaldecoder("uu-codec")().decode(b"begin ")
ValueError: Truncated input data

Incremental encoding issues:

>>> e = codecs.getincrementalencoder("base64-codec")(); 
>>> codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "base64-codec")
b'1'  # Should be b"12"

>>> e = codecs.getincrementalencoder("quopri-codec")(); e.encode(b"1" * 50) + 
>>> e.encode(b"2" * 50, final=True)
b'1122'
# I suspect the line should have been split in two

>>> e = codecs.getincrementalencoder("uu-codec")(); 
>>> codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "uu-codec")
b'1'  # Should be b"12"

I also noticed iterdecode() does not work with “uu-codec”:

>>> bytes().join(codecs.iterdecode(iter((b"begin 666 \n \nend\n",)), 
>>> "uu-codec"))
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.3/codecs.py", line 1032, in iterdecode
output = decoder.decode(b"", True)
  File "/usr/lib/python3.3/encodings/uu_codec.py", line 80, in decode
return uu_decode(input, self.errors)[0]
  File "/usr/lib/python3.3/encodings/uu_codec.py", line 45, in uu_decode
raise ValueError('Missing "begin" line in input data')
ValueError: Missing "begin" line in input data

And iterencode() does not work with any of the byte encoders, because it does 
not know what kind of empty string to pass to 
IncrementalEncoder.encode(final=True):

>>> bytes().join(codecs.iterencode(iter(()), "base64-codec"))
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.3/codecs.py", line 1014, in iterencode
output = encoder.encode("", True)
  File "/usr/lib/python3.3/encodings/base64_codec.py", line 31, in encode
return base64.encodebytes(input)
  File "/usr/lib/python3.3/base64.py", line 343, in encodebytes
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str

Finally, incremental UTF-7 encoding is suboptimal, and the decoder seems to 
buffer unlimited data, both defeating the purpose of using an incremental codec:

>>> bytes().join(codecs.iterencode("\xA9" * 2, "utf-7"))
b'+AKk-+AKk-'  # b"+AKkAqQ-" would be better

>>> d = codecs.getincrementaldecoder("utf-7")()
>>> d.decode(b"+")
b''
>>> any(d.decode(b"" * 10) for _ in range(10))
False  # No data returned: everything must be buffered
>>> d.decode(b"-") == "\x00" * 300
True  # Returned all buffered data in one go

--
components: Library (Lib)
messages: 207374
nosy: vadmium
priority: normal
severity: normal
status: open
title: Many incremental codecs don’t handle fragmented data
type: behavior
versions: Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com