Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Nick Coghlan
On 12 January 2018 at 14:55, Steve Dower wrote: > On 12Jan2018 0342, Random832 wrote: >> >> On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: >>> >>> The way of solving this issue in Python is using an error handler. The >>> "surrogateescape" error handler is specially designed for lossless

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Stephen J. Turnbull
Executive summary: we already do. Nathaniel suggests we should conform to the WHAT-WG standard. But AFAGCT[1], there is no such thing as "WHATWG versions of legacy encodings". The document at https://encoding.spec.whatwg.org/ has the following normative specifications (capitalized words are pres

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Steve Dower
On 12Jan2018 0342, Random832 wrote: On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: The way of solving this issue in Python is using an error handler. The "surrogateescape" error handler is specially designed for lossless reversible decoding. It maps every unassigned byte in the range 0x

Re: [Python-ideas] Make functions, methods and descriptor types living in the types module

2018-01-11 Thread Steve Dower
I certainly have code that joins __module__ with __name__ to create a fully-qualified name (with special handling for those builtins that are not in builtins), and IIUC __qualname__ doesn't normally include the module name either (it's intended for nested types/functions). Can we make it visib

Re: [Python-ideas] Make functions, methods and descriptor types living in the types module

2018-01-11 Thread Victor Stinner
I like the idea of having a fully qualified name that "works" (can be resolved). I don't think that repr() should change, right? Can this change break the backward compatibility somehow? Victor Le 11 janv. 2018 21:00, "Serhiy Storchaka" a écrit : > Currently the classes of functions (implemen

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread MRAB
On 2018-01-11 19:42, Rob Speer wrote: > The question is rather: how often does web-XXX mojibake happen? Very often. Particularly web-1252 mixed up with UTF-8. My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is w

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Random832
On Thu, Jan 11, 2018, at 14:55, Rob Speer wrote: > There is one more difference I have found between Python's encodings and > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down > what the Unicode Conso

[Python-ideas] Make functions, methods and descriptor types living in the types module

2018-01-11 Thread Serhiy Storchaka
Currently the classes of functions (implemented in Python and builtin), methods, and different type of descriptors, generators, etc have the __module__ attribute equal to "builtins" and the name that can't be used for accessing the class. >>> def f(): pass ... >>> type(f) >>> type(f).__modul

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Rob Speer
On Thu, 11 Jan 2018 at 11:43 Random832 wrote: > Maybe we need a new error handler that maps unassigned bytes in the range > 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the > encodings being discussed have behavior other than the "normal" version of > the encoding plus wh

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Rob Speer
> The question is rather: how often does web-XXX mojibake happen? Very often. Particularly web-1252 mixed up with UTF-8. My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is when a right curly quote is encoded as UTF-

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Random832
On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote: > There's a problem with these encodings: they are mostly meant > for decoding (broken) data, but as soon as we have them in the stdlib, > people will also start using them for encoding data, producing more > corrupted data. Is it really corrupt

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Random832
On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: > The way of solving this issue in Python is using an error handler. The > "surrogateescape" error handler is specially designed for lossless > reversible decoding. It maps every unassigned byte in the range > 0x80-0xff to a single characte

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Antoine Pitrou
On Thu, 11 Jan 2018 05:18:43 -0800 Nathaniel Smith wrote: > I'm not an expert here or anything, but from what we've been hearing it > sounds like it must be used by all standard-compliant HTML parsers. I don't > *like* the standard much, but I don't think that the stdlib should refuse > to handle

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Nathaniel Smith
On Jan 11, 2018 4:05 AM, "Antoine Pitrou" wrote: Define "widely used". If web-XXX is a superset of windows-XXX, then perhaps web-XXX is "used" in the sense of "used to decode valid windows-XXX data" (but windows-XXX could be used just as well to decode the same data). The question is rather: ho

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Antoine Pitrou
On Wed, 10 Jan 2018 16:24:33 -0800 Chris Barker wrote: > On Wed, Jan 10, 2018 at 11:04 AM, M.-A. Lemburg wrote: > > > I don't believe it's a good strategy to create the confusion that > > WHATWG is introducing by using the same names for non-standard > > encodings. > > > > agreed. > > > > P

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Stephan Houben
Op 11 jan. 2018 10:56 schreef "Serhiy Storchaka" : 09.01.18 23:15, Rob Speer пише: > > > For the sake of discussion, let's call this encoding "web-1252". WHATWG > calls it "windows-1252", I'd suggest to name it then "whatwg-windows-152". and in general "whatwg-" + whatgwgs_name_of_encoding S

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Serhiy Storchaka
09.01.18 23:15, Rob Speer пише: There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably th

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread M.-A. Lemburg
On 11.01.2018 10:01, Chris Angelico wrote: > On Thu, Jan 11, 2018 at 7:58 PM, M.-A. Lemburg wrote: >> On 11.01.2018 01:22, Nick Coghlan wrote: >>> On 11 January 2018 at 05:04, M.-A. Lemburg wrote: For the stdlib, I think we should stick to standards and not go for spreading non-standard

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Chris Angelico
On Thu, Jan 11, 2018 at 7:58 PM, M.-A. Lemburg wrote: > On 11.01.2018 01:22, Nick Coghlan wrote: >> On 11 January 2018 at 05:04, M.-A. Lemburg wrote: >>> For the stdlib, I think we should stick to standards and >>> not go for spreading non-standard ones. >>> >>> So -1 on adding WHATWG encodings t

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread M.-A. Lemburg
On 11.01.2018 01:22, Nick Coghlan wrote: > On 11 January 2018 at 05:04, M.-A. Lemburg wrote: >> For the stdlib, I think we should stick to standards and >> not go for spreading non-standard ones. >> >> So -1 on adding WHATWG encodings to the stdlib. > > We already support HTML5 in the standard li