[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-05 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 05.10.2018 14:06, Serhiy Storchaka wrote: > > Then this particularity of codecs streams should be explicitly documented. Yes, probably. Such extensions of scope for different character types in Unicode vs. ASCII are a common gotcha when moving from

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-05 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Then this particularity of codecs streams should be explicitly documented. codecs.open() was advertised as a way of writing portable code for Python 2 and 3, and it can still be used in many old programs. --

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-05 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Sorry, I probably wasn't clear: the codecs interface is a direct interface to the Unicode codecs and thus has to work according to what Unicode defines. Your PR changes this to be non-compliant and does this for all codecs. That's a major backwards and

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-05 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: PR 9711 splits lines using regular expressions. This fixes this issue without changing str.splitlines(). After adding a new option in str.splitlines() the code in master can be simplified. -- resolution: wont fix -> stage: resolved -> patch

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-05 Thread Serhiy Storchaka
Change by Serhiy Storchaka : -- pull_requests: +9094 ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-05 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: The Unicode .splitlines() splits strings on what Unicode defines as linebreak characters (all code points with character properties Zl or bidirectional property B). This is different than what typical CSV file parsers or other parsers built for the

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-05 Thread Neil Schemenauer
Neil Schemenauer added the comment: I just found bug #22232 myself but thanks for pointing it out. > changing the behavior unconditionally is not an option At this point, I disagree. If I do a search on the web, lots of pages referring to str.splitlines() seem it imply that is splits only

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-04 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: There is an open issue for changing str.splitlines(): issue22232. It would help to fix this issue. The only problem is that we don't have agreement about the new parameter name (and changing the behavior unconditionally is not an option). --

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-04 Thread Neil Schemenauer
Neil Schemenauer added the comment: New patch that changes str.splitlines to work like Python 2 str.splitlines and like Python 3 bytes.splitlines. Surprisingly, only a few cases in the unit test suite fail. I've fixed them in my patch. -- Added file:

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-04 Thread Karthikeyan Singaravelan
Change by Karthikeyan Singaravelan : -- nosy: +xtreak ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-04 Thread Neil Schemenauer
Neil Schemenauer added the comment: Some further progress on this. My patch slows down reading files with the codecs module very significantly. So, I think it could never be merged as is. Maybe we would need to implement an alternative str.splitlines that behaves as we want, implemented

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-04 Thread Neil Schemenauer
Neil Schemenauer added the comment: Attached is a rough patch that tries to fix this problem. I changed the behavior in that unicode char 0x2028 is no longer treated as a line separator. It would be trival to change the regex to support that too, if we want to preserve backwards

[issue18291] codecs.open interprets FS, RS, GS as line ends

2018-10-04 Thread Neil Schemenauer
Neil Schemenauer added the comment: I think one bug here is that codecs readers use str.splitlines() internally. The splitlines method treats a bunch of different characters as line separators, unlike io..readlines(). So, you end up with different behavior between doing

[issue18291] codecs.open interprets FS, RS, GS as line ends

2015-07-10 Thread Martin Panter
Changes by Martin Panter vadmium...@gmail.com: -- title: codecs.open interprets space as line ends - codecs.open interprets FS, RS, GS as line ends ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18291