[issue41671] inspect.getdoc/.cleandoc doesn't always remove trailing blank lines

2020-08-30 Thread RalfM


New submission from RalfM :

Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import inspect
>>> def func1():
... """This is func1.
... """
... pass
...
>>> inspect.getdoc(func1)
'This is func1.\n'
>>>
>>> def func2():
... """Line1
...Line2 
...
..."""
...
>>> inspect.getdoc(func2)
'Line1\nLine2 \n\n'

Note: The blank line between "Line2 " and the closing """ contains 11 spaces.

The algorithm given in PEP 257 returns what I would expect, i.e. 
'This is func1.'
and
'Line1\nLine2'
respectively.

Strictly speaking, inspect.cleandoc doesn't claim to implement PEP 257.
However, there is a comment "# Remove any trailing or leading blank lines." in 
the code of inspect.cleandoc, and this is obviously not done.

Looking at the code, the reason seems to be twofold:

1. When removing the indentation, PEP 257 also does a .rstrip() on the lines, 
inspect.cleandoc doesn't.
As a consequence, in inspect.cleandoc trailing lines with many spaces will 
still contain spaces after the indentation has been removed, thus are not empty 
and the "while lines and not lines[-1]" doesn't remove them.
That explains func2 above.

2. If all lines but the first are blank (as in func1 above), indent / margin 
will be sys.maxint / sys.maxsize and no indentation will be removed.
PEP 257 copies dedented lines to a new list. If no indentation needs to be 
removed, nothing but the first line will be copied, and so the trailing lines 
are gone.
inspect.cleandoc dedents lines inplace. If no indentation needs to be removed 
the trailing lines with spaces remain and, as they contain spaces, the "while 
lines and not lines[-1]" doesn't remove them.

There is another difference between PEP 257 and inspect.cleandoc: PEP 257 
removes trailing whitespace on every line, inspect.cleandoc preserves it.
I don't know whether that's intentional.

I see this behaviour in 3.7 and 3.8, and the inspect.cleandoc code is unchanged 
in 3.9.0rc1.

--
components: Library (Lib)
messages: 376136
nosy: RalfM
priority: normal
severity: normal
status: open
title: inspect.getdoc/.cleandoc doesn't always remove trailing blank lines
type: behavior
versions: Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue41671>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24214] Exception with utf-8, surrogatepass and incremental decoding

2016-07-26 Thread RalfM

RalfM added the comment:

I just tested Python 3.6.0a3, and that (mis)behaves exactly like 3.4.3.

--
versions: +Python 3.6

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue24214>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24214] Exception with utf-8, surrogatepass and incremental decoding

2015-05-16 Thread RalfM

New submission from RalfM:

I have an utf-8 encoded file containing single surrogates. Reading this file, 
specifying surrgatepass, works fine when I read the whole file with .read(), 
but raises an UnicodeDecodeError when I read the file line by line:

- start of demo -
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AM
D64)] on win32
Type help, copyright, credits or license for more information.
 with open(Demo.txt, encoding=utf-8, errors=surrogatepass) as f:
...   s = f.read()
...
 \ud900 in s
True
 with open(Demo.txt, encoding=utf-8, errors=surrogatepass) as f:
...   for line in f:
... pass
...
Traceback (most recent call last):
  File stdin, line 2, in module
  File C:\Python\34x64\lib\codecs.py, line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 8190: inval
id continuation byte

- end of demo -

I attached the file used for the demo such that you can reproduce the problem.

If I change all 0xED bytes in the file to 0xEC (i.e. effectively change all 
surrogates to non-surrogates), the problem disappears.

The original file I noticed the problem with was 73 MB.  The demo file was 
derived from the original by removing data around the critical section, keeping 
the alignment to 16-k-blocks, and then replacing all printable ASCII characters 
by x.

If I change the file length by adding or removing 16 bytes to / from the 
beginning of the demo file, the problem disappears, so alignment seems to be an 
issue.

All this seems to indicate that the utf-8 decoder has problems when used for 
incremental decoding and a single surrogate appears around the block boundary.

--
components: Unicode
files: Demo.txt
messages: 243376
nosy: RalfM, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Exception with utf-8, surrogatepass and incremental decoding
type: behavior
versions: Python 3.4
Added file: http://bugs.python.org/file39400/Demo.txt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24214
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20413] Errors in documentation of standard codec error handlers

2014-01-27 Thread RalfM

New submission from RalfM:

The standard library documentation lists the standard codec error handlers in 
three places:

(a) 2. Build-in Functions, section open()
(b) 7.2 codecs - Codec registry and base classes
(c) 7.2.1 Codec Base Classes

As far as I can judge these lists, (c) looks ok, but (a) and (b) contain two 
errors:
1. 'surrogatepass' is not mentioned.
2. 'surrogateescape' is described as: 
   'on decoding, replace with code points in the Unicode Private
   Use Area ranging from U+DC80 to U+DCFF. These private code points
   will ...' 
   This is incorrect in so far as U+DC80 to U+DCFF are not private 
   code points, but (low-)surrogate code points. This is correctly
   explained in (c) and in PEP383 (and, of course, in the Unicode 
   standard, chapter 16).

I suggest to correct (a) and (b) by
* adding 'surrogatepass' with the description given in (c),
* changing the description of 'surrogateescape' to something like: 
  'on decoding, replace with surrogate code points ranging from 
  U+DC80 to U+DCFF. These surrogate code points will ...'.

These errors are present in the documentation (more precisely, the .chm files) 
of at least 
- Python 3.3.3
- Python 3.3.4rc1
- Python 3.4.0b3.

--
assignee: docs@python
components: Documentation
messages: 209477
nosy: RalfM, docs@python
priority: normal
severity: normal
status: open
title: Errors in documentation of standard codec error handlers
type: enhancement
versions: Python 3.3, Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20413
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com