[issue41671] inspect.getdoc/.cleandoc doesn't always remove trailing blank lines
New submission from RalfM : Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import inspect >>> def func1(): ... """This is func1. ... """ ... pass ... >>> inspect.getdoc(func1) 'This is func1.\n' >>> >>> def func2(): ... """Line1 ...Line2 ... ...""" ... >>> inspect.getdoc(func2) 'Line1\nLine2 \n\n' Note: The blank line between "Line2 " and the closing """ contains 11 spaces. The algorithm given in PEP 257 returns what I would expect, i.e. 'This is func1.' and 'Line1\nLine2' respectively. Strictly speaking, inspect.cleandoc doesn't claim to implement PEP 257. However, there is a comment "# Remove any trailing or leading blank lines." in the code of inspect.cleandoc, and this is obviously not done. Looking at the code, the reason seems to be twofold: 1. When removing the indentation, PEP 257 also does a .rstrip() on the lines, inspect.cleandoc doesn't. As a consequence, in inspect.cleandoc trailing lines with many spaces will still contain spaces after the indentation has been removed, thus are not empty and the "while lines and not lines[-1]" doesn't remove them. That explains func2 above. 2. If all lines but the first are blank (as in func1 above), indent / margin will be sys.maxint / sys.maxsize and no indentation will be removed. PEP 257 copies dedented lines to a new list. If no indentation needs to be removed, nothing but the first line will be copied, and so the trailing lines are gone. inspect.cleandoc dedents lines inplace. If no indentation needs to be removed the trailing lines with spaces remain and, as they contain spaces, the "while lines and not lines[-1]" doesn't remove them. There is another difference between PEP 257 and inspect.cleandoc: PEP 257 removes trailing whitespace on every line, inspect.cleandoc preserves it. I don't know whether that's intentional. I see this behaviour in 3.7 and 3.8, and the inspect.cleandoc code is unchanged in 3.9.0rc1. -- components: Library (Lib) messages: 376136 nosy: RalfM priority: normal severity: normal status: open title: inspect.getdoc/.cleandoc doesn't always remove trailing blank lines type: behavior versions: Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue41671> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24214] Exception with utf-8, surrogatepass and incremental decoding
RalfM added the comment: I just tested Python 3.6.0a3, and that (mis)behaves exactly like 3.4.3. -- versions: +Python 3.6 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue24214> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24214] Exception with utf-8, surrogatepass and incremental decoding
New submission from RalfM: I have an utf-8 encoded file containing single surrogates. Reading this file, specifying surrgatepass, works fine when I read the whole file with .read(), but raises an UnicodeDecodeError when I read the file line by line: - start of demo - Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AM D64)] on win32 Type help, copyright, credits or license for more information. with open(Demo.txt, encoding=utf-8, errors=surrogatepass) as f: ... s = f.read() ... \ud900 in s True with open(Demo.txt, encoding=utf-8, errors=surrogatepass) as f: ... for line in f: ... pass ... Traceback (most recent call last): File stdin, line 2, in module File C:\Python\34x64\lib\codecs.py, line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 8190: inval id continuation byte - end of demo - I attached the file used for the demo such that you can reproduce the problem. If I change all 0xED bytes in the file to 0xEC (i.e. effectively change all surrogates to non-surrogates), the problem disappears. The original file I noticed the problem with was 73 MB. The demo file was derived from the original by removing data around the critical section, keeping the alignment to 16-k-blocks, and then replacing all printable ASCII characters by x. If I change the file length by adding or removing 16 bytes to / from the beginning of the demo file, the problem disappears, so alignment seems to be an issue. All this seems to indicate that the utf-8 decoder has problems when used for incremental decoding and a single surrogate appears around the block boundary. -- components: Unicode files: Demo.txt messages: 243376 nosy: RalfM, ezio.melotti, haypo priority: normal severity: normal status: open title: Exception with utf-8, surrogatepass and incremental decoding type: behavior versions: Python 3.4 Added file: http://bugs.python.org/file39400/Demo.txt ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24214 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue20413] Errors in documentation of standard codec error handlers
New submission from RalfM: The standard library documentation lists the standard codec error handlers in three places: (a) 2. Build-in Functions, section open() (b) 7.2 codecs - Codec registry and base classes (c) 7.2.1 Codec Base Classes As far as I can judge these lists, (c) looks ok, but (a) and (b) contain two errors: 1. 'surrogatepass' is not mentioned. 2. 'surrogateescape' is described as: 'on decoding, replace with code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will ...' This is incorrect in so far as U+DC80 to U+DCFF are not private code points, but (low-)surrogate code points. This is correctly explained in (c) and in PEP383 (and, of course, in the Unicode standard, chapter 16). I suggest to correct (a) and (b) by * adding 'surrogatepass' with the description given in (c), * changing the description of 'surrogateescape' to something like: 'on decoding, replace with surrogate code points ranging from U+DC80 to U+DCFF. These surrogate code points will ...'. These errors are present in the documentation (more precisely, the .chm files) of at least - Python 3.3.3 - Python 3.3.4rc1 - Python 3.4.0b3. -- assignee: docs@python components: Documentation messages: 209477 nosy: RalfM, docs@python priority: normal severity: normal status: open title: Errors in documentation of standard codec error handlers type: enhancement versions: Python 3.3, Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue20413 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com