Re: [Python-Dev] Bytes path support

2014-08-25 Thread Oleg Broytman
Hi! Thank you very much, Nick, for long and detailed explanation!

On Sun, Aug 24, 2014 at 01:27:55PM +1000, Nick Coghlan ncogh...@gmail.com 
wrote:
 On 24 August 2014 04:37, Oleg Broytman p...@phdru.name wrote:
  On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore p.f.mo...@gmail.com 
  wrote:
  Generally, it seems to be mostly a reaction to the repeated claims
  that Python, or Windows, or whatever, is broken.
 
 Ah, if that's the only problem I certainly can live with that. My
  problem is that it *seems* this anti-Unix attitude infiltrates Python
  core development. I very much hope I'm wrong and it really isn't.
 
 The POSIX locale based approach to handling encodings is genuinely
 broken - it's almost as broken as code pages are on Windows. The
 fundamental flaw is that locales encourage *bilingual* computing:
 handling English plus one other language correctly. Given a global
 internet, bilingual computing *is a fundamentally broken approach*. We
 need multilingual computing (any human language, all the time), and
 that means Unicode.
 
 As some examples of where bilingual computing breaks down:
 
 * My NFS client and server may have different locale settings
 * My FTP client and server may have different locale settings
 * My SSH client and server may have different locale settings
 * I save a file locally and send it to someone with a different locale setting
 * I attempt to access a Windows share from a Linux client (or vice-versa)
 * I clone my POSIX hosted git or Mercurial repository on a Windows client
 * I have to connect my Linux client to a Windows Active Directory
 domain (or vice-versa)
 * I have to interoperate between native code and JVM code
 
 The entire computing industry is currently struggling with this
 monolingual (ASCII/Extended ASCII/EBCDIC/etc) - bilingual (locale
 encoding/code pages) - multilingual (Unicode) transition. It's been
 going on for decades, and it's still going to be quite some time
 before we're done.
 
 The POSIX world is slowly clawing its way towards a multilingual model
 that actually works: UTF-8
 Windows (including the CLR) and the JVM adopted a different
 multilingual model, but still one that actually works: UTF-16-LE
 
 POSIX is hampered by legacy ASCII defaults in various subsystems (most
 notably the default locale) and the assumption that system metadata is
 just bytes (an assumption that breaks down as soon as you have to
 hand that metadata over to another machine that may have different
 locale settings)
 Windows is hampered by the fact they kept the old 8-bit APIs around
 for backwards compatibility purposes, so applications using those APIs
 are still only bilingual (at best) rather than multilingual.
 JVM and CLR applications will at least handle the Basic Multilingual
 Plane (UCS-2) correctly, but may not correctly handle code points
 beyond the 16-bit boundary (this is the Python narrow builds don't
 handle Unicode correctly problem that was resolved for Python 3.3+ by
 PEP 393)
 
 Individual users (including some organisations) may have the luxury of
 saying well, all my clients and all my servers are POSIX, so I don't
 care about interoperability with other platforms. As the providers of
 a cross-platform runtime environment, we don't have that luxury - we
 need to figure out how to get *all* the major platforms playing nice
 with each other, regardless of whether they chose UTF-8 or UTF-16-LE
 as the basis for their approach towards providing multilingual
 computing environments.
 
 Historically, that question of cross platform interoperability for
 open source software has been handled in a few different ways:
 
 * Don't really interoperate with anybody, reinvent all the wheels (the JVM 
 way)
 * Emulate POSIX on Windows (the Cygwin/MinGW way)
 * Let the application developer figure it out (the Python 2 way)
 
 The first approach is inordinately expensive - it took the resources
 of Sun in its heyday to make it possible, and it effectively locks the
 JVM out of certain kinds of computing (e.g. it's hard to do array
 oriented programming in JVM languages, because the CPU and GPU
 vectorisation features aren't readily accessible).
 
 The second approach prevents the creation of truly native Windows
 applications, which makes it uncompelling as a way of attracting
 Windows users - it sends a clear signal that the project doesn't
 *really* care about supporting Windows as a platform, but instead only
 grudgingly accepts that there are Windows users out there that might
 like to use their software.
 
 The third approach is the one we tried for a long time with Python 2,
 and essentially found to be an experts only solution. Yes, you can
 *make* it work, but the runtime isn't set up so it works *by default*.
 
 The Unicode changes in Python 3 are a result of the Python core
 development team saying it really shouldn't be this hard for
 application developers to get cross-platform interoperability between
 correctly configured systems when 

Re: [Python-Dev] Bytes path support

2014-08-25 Thread R. David Murray
On Sat, 23 Aug 2014 19:33:06 +0300, Marko Rauhamaa ma...@pacujo.net wrote:
 R. David Murray rdmur...@bitdance.com:
 
  The same problem existed in python2 if your goal was to produce a stream
  with a consistent encoding, but now python3 treats that as an error.
 
 I have a different interpretation of the situation: as a rule, use byte
 strings in Python3. Text strings are a special corner case for
 applications that have to deal with human languages.

Clearly, then, you are writing unix (or perhaps posix)-only programs.

Also, as has been discussed in this thread previously, any program that
deals with filenames is dealing with human readable languages, even
if posix itself treats the filenames as bytes.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread Isaac Morland

On Sat, 23 Aug 2014, Marko Rauhamaa wrote:


Isaac Morland ijmor...@uwaterloo.ca:


 HTTP/1.1 200 OK
 Content-Type: text/html; charset=ISO-8859-1

 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=utf-16


For HTML it's not quite so bad.  According to the HTML 4 standard:
[...]

The Content-Type header takes precedence over a meta element. I
thought I read once that the reason was to allow proxy servers to
transcode documents but I don't have a cite for that. Also, the meta
element must only be used when the character encoding is organized
such that ASCII-valued bytes stand for ASCII characters so the
initial UTF-16 example wouldn't be conformant in HTML.


That's not how I read it:

  The META declaration must only be used when the character encoding is
  organized such that ASCII characters stand for themselves (at least
  until the META element is parsed). META declarations should appear as
  early as possible in the HEAD element.

  URL: http://www.w3.org/TR/1998/REC-html40-19980424/charset.ht
  ml#doc-char-set

IOW, you must obey the HTTP character encoding until you have parsed a
conflicting META content-type declaration.



From the same document:


--
To sum up, conforming user agents must observe the following priorities 
when determining a document's character encoding (from highest priority to 
lowest):


An HTTP charset parameter in a Content-Type field.
A META declaration with http-equiv set to Content-Type and a value 
set for charset.
The charset attribute set on an element that designates an external 
resource. 
--


(In the original they are numbered)

This is a priority list - if the Content-Type header gives a charset, it 
takes precedence, and all other sources for the encoding are ignored.  The 
charset= on an img or similar is only used if it is the only source 
for the encoding.


The at least until the META element is parsed bit allows for the use of 
encodings which make use of shifting.  So maybe they start out 
ASCII-compatible, but after a particular shift byte is seen those bytes 
now stand for Japanese Kanji characters until another shift byte is seen. 
This is allowed by the specification, as long as none of the 
non-ASCII-compatible stuff is seen before the META element.



The author of the standard keeps a straight face and continues:


I like your way of putting this - straight face indeed.  The third 
option really is a hack to allow working around nonsensical situations 
(and even the META tag is pretty questionable).  All this complexity 
because people can't be bothered to do things properly.



  For cases where neither the HTTP protocol nor the META element
  provides information about the character encoding of a document, HTML
  also provides the charset attribute on several elements. By combining
  these mechanisms, an author can greatly improve the chances that,
  when the user retrieves a resource, the user agent will recognize the
  character encoding.


Isaac Morland   CSCF Web Guru
DC 2554C, x36650WWW Software Specialist
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path related questions for Guido

2014-08-25 Thread Stephen J. Turnbull
Nick Coghlan writes:

  purge_surrogate_escapes was the other term that occurred to me.

purge suggests removal, not replacement.  That may be useful too.

neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')

maybe?  (Of course the remove argument is feature creep, so I'm only
about +0.5 myself.  And the name is long, but I can't think of any
better synonyms for make safe in English right now).

  Either way, my use case is to filter them out when I *don't* want to
  pass them along to other software, but would prefer the Unicode
  replacement character to the ASCII question mark created by using the
  replace filter when encoding.

I think it would be preferable to be unicodely correct here by
default, since this is a str - str function.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread Stephen J. Turnbull
R. David Murray writes:

  Also, as has been discussed in this thread previously, any program that
  deals with filenames is dealing with human readable languages, even
  if posix itself treats the filenames as bytes.

That's a bit extreme.  I can name two interesting applications
offhand: git's object database and the Coda filesystem's containers.

It's true that for debugging purposes bytestrings representing largish
numbers are readably encoded (in hexadecimal and decimal,
respectively), but they're clearly not human readable in the sense
you mean.

Nevertheless, these are the applications that prove your rule.  You
don't need the power of pathlib to conveniently (for the programmer)
and efficiently handle the file structures these programs use.
os.path is plenty.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread R. David Murray
On Tue, 26 Aug 2014 11:25:19 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 R. David Murray writes:
 
   Also, as has been discussed in this thread previously, any program that
   deals with filenames is dealing with human readable languages, even
   if posix itself treats the filenames as bytes.
 
 That's a bit extreme.  I can name two interesting applications
 offhand: git's object database and the Coda filesystem's containers.

As soon as I hit send I realized there were a few counter examples :)
So, replace any with most.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread Stephen J. Turnbull
Isaac Morland writes:

  I like your way of putting this - straight face indeed.  The third 
  option really is a hack to allow working around nonsensical situations 
  (and even the META tag is pretty questionable).  All this complexity 
  because people can't be bothered to do things properly.

At least in Japan and Russia, doing things properly in your sense in
heterogenous distributed systems is really hard, requiring use of
rather fragile encoding detection heuristics that break at the
slightest whiff of encodings that are unusual in the particular
locale, and in Japan requiring equally fragile transcoding programs
that break on vendor charset variations.  The META charset attribute
is useful in those contexts, and the charset attribute for external
elements may have been useful in the past as well, although I've never
needed it.

I agree that an environment where charset attributes on META and
other elements are needed kinda sucks, but the prerequisite for doing
things properly is basically Unicode[1], and that just wasn't going
to happen until at least the 1990s.  To make the transition in less
than several decades would have required a degree of monopoly in
software production that I shudder to contemplate.  Even today there
are programmers around the world grumbling about having to deal with
the Unicode coded character set.


Footnotes: 
[1]  More precisely, a universal coded character set.  TRON code or
MULE code would have done (but yuck!)  ISO 2022 won't do!

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com