Re: [Python-Dev] Python3 "complexity" - 2 use cases

2014-01-10 Thread Ben Finney
"Jim J. Jewett"  writes:

>  
> > Steven D'Aprano wrote:
> >> I think that heuristics to guess the encoding have their role to play,
> >> if the caller understands the risks.
>
> Ben Finney wrote:
> > In my opinion, content-type guessing heuristics certainly don't belong
> > in the standard library.
>
> It would be great if there were never any need to guess.  But in the
> real world, there is -- and often the user won't know any more than
> python does.

That's why I think it's great to have heuristic guessing code available
as a third-party library.

> So when it is time to guess, a source of good guesses is an important
> battery to include.

Why is it important enough to deserve that privilege, over the thousands
of other candidates for the standard library? The barrier for entry to
the standard library is higher than mere usefulness.

> We should explicitly treat autodetection like time zone data --
> there is no promise that the "right answer" (or at least the "best
> guess") won't change, even within a release.

But there is exactly one set of authoritative time zones at any
particular point in time. That's why it makes sense to have that set of
authoritative values available in the standard library.

Heuristic guesses about content types do not have the property of
exactly one authoritative source, so your analogy is not compelling.

-- 
 \ “Unix is an operating system, OS/2 is half an operating system, |
  `\Windows is a shell, and DOS is a boot partition virus.” —Peter |
_o__)H. Coffin |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python3 "complexity" - 2 use cases

2014-01-10 Thread Jim J. Jewett

 
> Steven D'Aprano wrote:
>> I think that heuristics to guess the encoding have their role to play,
>> if the caller understands the risks.

Ben Finney wrote:
> In my opinion, content-type guessing heuristics certainly don't belong
> in the standard library.

It would be great if there were never any need to guess.  But in the
real world, there is -- and often the user won't know any more than
python does.  So when it is time to guess, a source of good guesses
is an important battery to include.

The HTML5 specifications go through some fairly extreme contortions
to document what browsers actually do, as opposed to what previous
standards have mandated.  They don't currently specify how to guess
(though I think a draft once tried, since the major browsers all do
it, and at the time did it similarly), but the specs do explicitly
support such a step, and do provide an implementation note
encouraging user-agents to do at least minimal auto-detection.  

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

My own opinion is therefore that Python SHOULD provide better support
for both of the following use cases:

(1)  Treat this file like it came from the web -- including
 autodetection and even overriding explicit charset
 declarations for certain charsets.

We should explicitly treat autodetection like time zone data --
there is no promise that the "right answer" (or at least the
"best guess") won't change, even within a release.

I offer no opinion on whether chardet in particular is still
too volatile, but the docs should warn that the API is driven
by possibly changing external data.

(2)  Treat this file as "ASCII+", where anything non-ASCII
 will (at most) be written back out unchanged; it doesn't
 even need to be converted to text.

At this time, I don't know whether the right answer is making it
easy to default to surrogate-escape for all error-handling, 
adding more bytes methods, encouraging use of python's latin-1
variant, offering a dedicated (new?) codec, or some new suggestion.

I do know that this use case is important, and that python 3
currently looks clumsy compared to python 2.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com